Bernoulli cross-entropy loss is a special case of categorical cross-entropy loss for .

Where indexes samples/observations and indexes classes, and is the sample label (binary for LSH, one-hot vector on the RHS) and is the prediction for a sample.


I write "Bernoulli cross-entropy" because this loss arises from a Bernoulli probability model. There is not a "binary distribution." A "binary cross-entropy" doesn't tell us if the thing that is binary is the one-hot vector of labels, or if the author is using binary encoding for each trial (success or failure). This isn't a general convention, but it makes clear that these formulae arise from particular probability models. Conventional jargon is not clear in that way.

Answer from Sycorax on Stack Exchange
Top answer
1 of 4
85

Bernoulli cross-entropy loss is a special case of categorical cross-entropy loss for .

Where indexes samples/observations and indexes classes, and is the sample label (binary for LSH, one-hot vector on the RHS) and is the prediction for a sample.


I write "Bernoulli cross-entropy" because this loss arises from a Bernoulli probability model. There is not a "binary distribution." A "binary cross-entropy" doesn't tell us if the thing that is binary is the one-hot vector of labels, or if the author is using binary encoding for each trial (success or failure). This isn't a general convention, but it makes clear that these formulae arise from particular probability models. Conventional jargon is not clear in that way.

2 of 4
55

There are three kinds of classification tasks:

  1. Binary classification: two exclusive classes
  2. Multi-class classification: more than two exclusive classes
  3. Multi-label classification: just non-exclusive classes

Here, we can say

  • In the case of (1), you need to use binary cross entropy.
  • In the case of (2), you need to use categorical cross entropy.
  • In the case of (3), you need to use binary cross entropy.

You can just consider the multi-label classifier as a combination of multiple independent binary classifiers. If you have 10 classes here, you have 10 binary classifiers separately. Each binary classifier is trained independently. Thus, we can produce multi-label for each sample. If you want to make sure at least one label must be acquired, then you can select the one with the lowest classification loss function, or using other metrics.

I want to emphasize that multi-class classification is not similar to multi-label classification! Rather, multi-label classifier borrows an idea from the binary classifier!

Discussions

python - difference between categorical and binary cross entropy - Stack Overflow
Using keras I have to train a model to predict either the image belongs to class 0 or class 1. I am confused in binary and categorical_cross_entropy. I have searched for that but I am still confused. More on stackoverflow.com
🌐 stackoverflow.com
python - What is the difference between sparse_categorical_crossentropy and categorical_crossentropy? - Stack Overflow
One good example of the sparse-categorical-cross-entropy is the fasion-mnist dataset. More on stackoverflow.com
🌐 stackoverflow.com
machine learning - Why binary_crossentropy and categorical_crossentropy give different performances for the same problem? - Stack Overflow
I'm trying to train a CNN to categorize text by topic. When I use binary cross-entropy I get ~80% accuracy, with categorical cross-entropy I get ~50% accuracy. I don't understand why this is. It's a More on stackoverflow.com
🌐 stackoverflow.com
Difference between binary cross entropy and categorical cross entropy?
With binary cross entropy, you can only classify two classes. With categorical cross entropy, you're not limited to how many classes your model can classify. Binary cross entropy is just a special case of categorical cross entropy. The equation for binary cross entropy loss is the exact equation for categorical cross entropy loss with one output node. For example, binary cross entropy with one output node is the equivalent of categorical cross entropy with two output nodes. More on reddit.com
🌐 r/learnmachinelearning
3
5
March 31, 2018
🌐
GeeksforGeeks
geeksforgeeks.org › deep learning › sparse-categorical-crossentropy-vs-categorical-crossentropy
Sparse Categorical Crossentropy vs. Categorical Crossentropy - GeeksforGeeks
July 26, 2025 - Categorical Crossentropy measures how well the predicted probabilities of each class align with the actual target labels. Its primary purpose is to evaluate a classification model's performance by comparing the model's predicted probabilities ...
🌐
Gombru
gombru.github.io › 2018 › 05 › 23 › cross_entropy_loss
Understanding Categorical Cross-Entropy Loss, Binary Cross-Entropy Loss, Softmax Loss, Logistic Loss, Focal Loss and all those confusing names
That’s why it is used for multi-label classification, were the insight of an element belonging to a certain class should not influence the decision for another class. It’s called Binary Cross-Entropy Loss because it sets up a binary classification problem between \(C’ = 2\) classes for every class in \(C\), as explained above.
🌐
Medium
medium.com › @shireenchand › choosing-between-cross-entropy-and-sparse-cross-entropy-the-only-guide-you-need-abea92c84662
Choosing between Cross Entropy and Sparse Cross Entropy — The Only Guide you Need! | by Shireen Chand | Medium
July 20, 2023 - In contrast to categorical cross-entropy loss, where the true labels are represented as one-hot encoded vectors, sparse categorical cross-entropy loss expects the target labels to be integers indicating the class indices directly.
🌐
V7 Labs
v7labs.com › home › blog › cross entropy loss: intro, applications, code
Cross Entropy Loss: Intro, Applications, Code
Binary cross entropy is calculated on top of sigmoid outputs, whereas Categorical cross-entropy is calculated over softmax activation outputs.
Find elsewhere
Top answer
1 of 3
97

Simply:

  • categorical_crossentropy (cce) produces a one-hot array containing the probable match for each category,
  • sparse_categorical_crossentropy (scce) produces a category index of the most likely matching category.

Consider a classification problem with 5 categories (or classes).

  • In the case of cce, the one-hot target may be [0, 1, 0, 0, 0] and the model may predict [.2, .5, .1, .1, .1] (probably right)

  • In the case of scce, the target index may be [1] and the model may predict: [.5].

Consider now a classification problem with 3 classes.

  • In the case of cce, the one-hot target might be [0, 0, 1] and the model may predict [.5, .1, .4] (probably inaccurate, given that it gives more probability to the first class)
  • In the case of scce, the target index might be [0], and the model may predict [.5]

Many categorical models produce scce output because you save space, but lose A LOT of information (for example, in the 2nd example, index 2 was also very close.) I generally prefer cce output for model reliability.

There are a number of situations to use scce, including:

  • when your classes are mutually exclusive, i.e. you don't care at all about other close-enough predictions,
  • the number of categories is large to the prediction output becomes overwhelming.

220405: response to "one-hot encoding" comments:

one-hot encoding is used for a category feature INPUT to select a specific category (e.g. male versus female). This encoding allows the model to train more efficiently: training weight is a product of category, which is 0 for all categories except for the given one.

cce and scce are a model OUTPUT. cce is a probability array of each category, totally 1.0. scce shows the MOST LIKELY category, totally 1.0.

scce is technically a one-hot array, just like a hammer used as a door stop is still a hammer, but its purpose is different. cce is NOT one-hot.

2 of 3
77

I was also confused with this one. Fortunately, the excellent keras documentation came to the rescue. Both have the same loss function and are ultimately doing the same thing, only difference is in the representation of the true labels.

  • Categorical Cross Entropy [Doc]:

Use this crossentropy loss function when there are two or more label classes. We expect labels to be provided in a one_hot representation.

>>> y_true = [[0, 1, 0], [0, 0, 1]]
>>> y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
>>> # Using 'auto'/'sum_over_batch_size' reduction type.  
>>> cce = tf.keras.losses.CategoricalCrossentropy()
>>> cce(y_true, y_pred).numpy()
1.177
  • Sparse Categorical Cross Entropy [Doc]:

Use this crossentropy loss function when there are two or more label classes. We expect labels to be provided as integers.

>>> y_true = [1, 2]
>>> y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
>>> # Using 'auto'/'sum_over_batch_size' reduction type.  
>>> scce = tf.keras.losses.SparseCategoricalCrossentropy()
>>> scce(y_true, y_pred).numpy()
1.177

One good example of the sparse-categorical-cross-entropy is the fasion-mnist dataset.

import tensorflow as tf
from tensorflow import keras

fashion_mnist = keras.datasets.fashion_mnist
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()

print(y_train_full.shape) # (60000,)
print(y_train_full.dtype) # uint8

y_train_full[:10]
# array([9, 0, 0, 3, 0, 2, 7, 2, 5, 5], dtype=uint8)
🌐
Medium
fmorenovr.medium.com › sparse-categorical-cross-entropy-vs-categorical-cross-entropy-ea01d0392d28
Sparse Categorical Cross-Entropy vs Categorical Cross-Entropy | by Felipe A. Moreno | Medium
November 30, 2021 - It is a Sigmoid activation plus a Cross-Entropy loss. ... Difference between Multi-class and Multi-label. Source: here · Multi-Class only classify one object from multiples objects in one sample. Multi-Label can classify multiples objects in one sample. In this case, we can calculate using two different methods: Categorical Cross-Entropy and Sparse Categorical Cross-Entropy.
🌐
MachineLearningMastery
machinelearningmastery.com › home › blog › a gentle introduction to cross-entropy for machine learning
A Gentle Introduction to Cross-Entropy for Machine Learning - MachineLearningMastery.com
December 22, 2020 - Binary Cross-Entropy: Cross-entropy as a loss function for a binary classification task. Categorical Cross-Entropy: Cross-entropy as a loss function for a multi-class classification task.
Top answer
1 of 12
270

The reason for this apparent performance discrepancy between categorical & binary cross entropy is what user xtof54 has already reported in his answer below, i.e.:

the accuracy computed with the Keras method evaluate is just plain wrong when using binary_crossentropy with more than 2 labels

I would like to elaborate more on this, demonstrate the actual underlying issue, explain it, and offer a remedy.

This behavior is not a bug; the underlying reason is a rather subtle & undocumented issue at how Keras actually guesses which accuracy to use, depending on the loss function you have selected, when you include simply metrics=['accuracy'] in your model compilation. In other words, while your first compilation option

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

is valid, your second one:

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

will not produce what you expect, but the reason is not the use of binary cross entropy (which, at least in principle, is an absolutely valid loss function).

Why is that? If you check the metrics source code, Keras does not define a single accuracy metric, but several different ones, among them binary_accuracy and categorical_accuracy. What happens under the hood is that, since you have selected binary cross entropy as your loss function and have not specified a particular accuracy metric, Keras (wrongly...) infers that you are interested in the binary_accuracy, and this is what it returns - while in fact you are interested in the categorical_accuracy.

Let's verify that this is the case, using the MNIST CNN example in Keras, with the following modification:

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])  # WRONG way

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=2,  # only 2 epochs, for demonstration purposes
          verbose=1,
          validation_data=(x_test, y_test))

# Keras reported accuracy:
score = model.evaluate(x_test, y_test, verbose=0) 
score[1]
# 0.9975801164627075

# Actual accuracy calculated manually:
import numpy as np
y_pred = model.predict(x_test)
acc = sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000
acc
# 0.98780000000000001

score[1]==acc
# False    

To remedy this, i.e. to use indeed binary cross entropy as your loss function (as I said, nothing wrong with this, at least in principle) while still getting the categorical accuracy required by the problem at hand, you should ask explicitly for categorical_accuracy in the model compilation as follows:

from keras.metrics import categorical_accuracy
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[categorical_accuracy])

In the MNIST example, after training, scoring, and predicting the test set as I show above, the two metrics now are the same, as they should be:

# Keras reported accuracy:
score = model.evaluate(x_test, y_test, verbose=0) 
score[1]
# 0.98580000000000001

# Actual accuracy calculated manually:
y_pred = model.predict(x_test)
acc = sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000
acc
# 0.98580000000000001

score[1]==acc
# True    

System setup:

Python version 3.5.3
Tensorflow version 1.2.1
Keras version 2.0.4

UPDATE: After my post, I discovered that this issue had already been identified in this answer.

2 of 12
87

It all depends on the type of classification problem you are dealing with. There are three main categories

  • binary classification (two target classes),
  • multi-class classification (more than two exclusive targets),
  • multi-label classification (more than two non exclusive targets), in which multiple target classes can be on at the same time.

In the first case, binary cross-entropy should be used and targets should be encoded as one-hot vectors.

In the second case, categorical cross-entropy should be used and targets should be encoded as one-hot vectors.

In the last case, binary cross-entropy should be used and targets should be encoded as one-hot vectors. Each output neuron (or unit) is considered as a separate random binary variable, and the loss for the entire vector of outputs is the product of the loss of single binary variables. Therefore it is the product of binary cross-entropy for each single output unit.

The binary cross-entropy is defined as

and categorical cross-entropy is defined as

where c is the index running over the number of classes C.

🌐
Reddit
reddit.com › r/learnmachinelearning › difference between binary cross entropy and categorical cross entropy?
r/learnmachinelearning on Reddit: Difference between binary cross entropy and categorical cross entropy?
March 31, 2018 -

I think I have some understanding of binary cross entropy, what is categorical cross entropy (as implemented in Keras) and how does it differ? Resources I've found on the internet for this have been way above my head

🌐
Wikipedia
en.wikipedia.org › wiki › Cross_entropy
Cross-entropy - Wikipedia
2 weeks ago - In information theory, the cross-entropy between two probability distributions ... {\displaystyle q} , over the same underlying set of events, measures the average number of bits needed to identify an event drawn from the set when the coding scheme used for the set is optimized for an estimated ...
🌐
Medium
medium.com › @vergotten › categorical-cross-entropy-unraveling-its-potentials-in-multi-class-classification-705129594a01
Categorical Cross-Entropy: Unraveling its Potentials in Multi-Class Classification | by Maxim Sorokin | Medium
January 17, 2024 - Categorical Cross-Entropy: Unraveling its Potentials in Multi-Class Classification What is Categorical Cross-Entropy? Categorical Cross-Entropy is a loss function that is used in multi-class …
Top answer
1 of 5
112

Both, categorical cross entropy and sparse categorical cross entropy have the same loss function which you have mentioned above. The only difference is the format in which you mention $Y_i$ (i,e true labels).

If your $Y_i$'s are one-hot encoded, use categorical_crossentropy. Examples (for a 3-class classification): [1,0,0] , [0,1,0], [0,0,1]

But if your $Y_i$'s are integers, use sparse_categorical_crossentropy. Examples for above 3-class classification problem: [1] , [2], [3]

The usage entirely depends on how you load your dataset. One advantage of using sparse categorical cross entropy is it saves time in memory as well as computation because it simply uses a single integer for a class, rather than a whole vector.

2 of 5
14

The formula which you posted in your question refers to binary_crossentropy, not categorical_crossentropy. The former is used when you have only one class. The latter refers to a situation when you have multiple classes and its formula looks like below:

$$J(\textbf{w}) = -\sum_{i=1}^{N} y_i \text{log}(\hat{y}_i).$$

This loss works as skadaver mentioned on one-hot encoded values e.g [1,0,0], [0,1,0], [0,0,1]

The sparse_categorical_crossentropy is a little bit different, it works on integers that's true, but these integers must be the class indices, not actual values. This loss computes logarithm only for output index which ground truth indicates to. So when model output is for example [0.1, 0.3, 0.7] and ground truth is 3 (if indexed from 1) then loss compute only logarithm of 0.7. This doesn't change the final value, because in the regular version of categorical crossentropy other values are immediately multiplied by zero (because of one-hot encoding characteristic). Thanks to that it computes logarithm once per instance and omits the summation which leads to better performance. The formula might look like this:

$$J(\textbf{w}) = -\text{log}(\hat{y}_y).$$

🌐
Quora
quora.com › What-is-the-difference-between-categorical_crossentropy-and-sparse_categorical-cross-entropy-when-we-do-multiclass-classification-using-convolution-neural-networks
What is the difference between categorical_crossentropy and sparse_categorical cross entropy when we do multiclass classification using convolution neural networks? - Quora
Answer (1 of 2): For multiclass classification, we can use either categorical cross entropy loss or sparse categorical cross entropy loss. Both of these losses compute the cross-entropy between the prediction of the network and the given ground truth. Suppose we have an n class classification pr...
🌐
Swebb
swebb.io › blog › interpreting-the-categorical-cross-entropy-loss-function
Interpreting the Categorical Cross-Entropy Loss Function — swebb.io
May 9, 2024 - We can actually do something similar with the categorical cross-entropy, a favorite for classification problems. The categorical cross-entropy is the information entropy associated with putting each marble into a bucket and sorting them in the right buckets.