Bernoulli cross-entropy loss is a special case of categorical cross-entropy loss for
.
Where indexes samples/observations and
indexes classes, and
is the sample label (binary for LSH, one-hot vector on the RHS) and
is the prediction for a sample.
I write "Bernoulli cross-entropy" because this loss arises from a Bernoulli probability model. There is not a "binary distribution." A "binary cross-entropy" doesn't tell us if the thing that is binary is the one-hot vector of labels, or if the author is using binary encoding for each trial (success or failure). This isn't a general convention, but it makes clear that these formulae arise from particular probability models. Conventional jargon is not clear in that way.
Bernoulli cross-entropy loss is a special case of categorical cross-entropy loss for
.
Where indexes samples/observations and
indexes classes, and
is the sample label (binary for LSH, one-hot vector on the RHS) and
is the prediction for a sample.
I write "Bernoulli cross-entropy" because this loss arises from a Bernoulli probability model. There is not a "binary distribution." A "binary cross-entropy" doesn't tell us if the thing that is binary is the one-hot vector of labels, or if the author is using binary encoding for each trial (success or failure). This isn't a general convention, but it makes clear that these formulae arise from particular probability models. Conventional jargon is not clear in that way.
There are three kinds of classification tasks:
- Binary classification: two exclusive classes
- Multi-class classification: more than two exclusive classes
- Multi-label classification: just non-exclusive classes
Here, we can say
- In the case of (1), you need to use binary cross entropy.
- In the case of (2), you need to use categorical cross entropy.
- In the case of (3), you need to use binary cross entropy.
You can just consider the multi-label classifier as a combination of multiple independent binary classifiers. If you have 10 classes here, you have 10 binary classifiers separately. Each binary classifier is trained independently. Thus, we can produce multi-label for each sample. If you want to make sure at least one label must be acquired, then you can select the one with the lowest classification loss function, or using other metrics.
I want to emphasize that multi-class classification is not similar to multi-label classification! Rather, multi-label classifier borrows an idea from the binary classifier!
python - difference between categorical and binary cross entropy - Stack Overflow
python - What is the difference between sparse_categorical_crossentropy and categorical_crossentropy? - Stack Overflow
machine learning - Why binary_crossentropy and categorical_crossentropy give different performances for the same problem? - Stack Overflow
Difference between binary cross entropy and categorical cross entropy?
Videos
They are mathematically identical for 2 classes hence binary. In other words, 2 class categorical cross entropy is the same as single output binary cross entropy. To give a more tangible example these are identical:
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', ...)
# is the same as
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', ...)
Which one to use? To avoid one-hot encoding categorical outputs, if you only have 2 classes it is easier - from a coding perspective - to use binary cross entropy. The binary case might be computationally more efficient depending on the implementation.
Seems, binary cross entropy it's just a special case of the categorical cross entropy. So, when you have only two classes, you can use binary cross entropy, you don't need to do one hot encoding - your code will be couple of the lines less.
Simply:
categorical_crossentropy(cce) produces a one-hot array containing the probable match for each category,sparse_categorical_crossentropy(scce) produces a category index of the most likely matching category.
Consider a classification problem with 5 categories (or classes).
In the case of
cce, the one-hot target may be[0, 1, 0, 0, 0]and the model may predict[.2, .5, .1, .1, .1](probably right)In the case of
scce, the target index may be [1] and the model may predict: [.5].
Consider now a classification problem with 3 classes.
- In the case of
cce, the one-hot target might be[0, 0, 1]and the model may predict[.5, .1, .4](probably inaccurate, given that it gives more probability to the first class) - In the case of
scce, the target index might be[0], and the model may predict[.5]
Many categorical models produce scce output because you save space, but lose A LOT of information (for example, in the 2nd example, index 2 was also very close.) I generally prefer cce output for model reliability.
There are a number of situations to use scce, including:
- when your classes are mutually exclusive, i.e. you don't care at all about other close-enough predictions,
- the number of categories is large to the prediction output becomes overwhelming.
220405: response to "one-hot encoding" comments:
one-hot encoding is used for a category feature INPUT to select a specific category (e.g. male versus female). This encoding allows the model to train more efficiently: training weight is a product of category, which is 0 for all categories except for the given one.
cce and scce are a model OUTPUT. cce is a probability array of each category, totally 1.0. scce shows the MOST LIKELY category, totally 1.0.
scce is technically a one-hot array, just like a hammer used as a door stop is still a hammer, but its purpose is different. cce is NOT one-hot.
I was also confused with this one. Fortunately, the excellent keras documentation came to the rescue. Both have the same loss function and are ultimately doing the same thing, only difference is in the representation of the true labels.
- Categorical Cross Entropy [Doc]:
Use this crossentropy loss function when there are two or more label classes. We expect labels to be provided in a one_hot representation.
>>> y_true = [[0, 1, 0], [0, 0, 1]]
>>> y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
>>> # Using 'auto'/'sum_over_batch_size' reduction type.
>>> cce = tf.keras.losses.CategoricalCrossentropy()
>>> cce(y_true, y_pred).numpy()
1.177
- Sparse Categorical Cross Entropy [Doc]:
Use this crossentropy loss function when there are two or more label classes. We expect labels to be provided as integers.
>>> y_true = [1, 2]
>>> y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
>>> # Using 'auto'/'sum_over_batch_size' reduction type.
>>> scce = tf.keras.losses.SparseCategoricalCrossentropy()
>>> scce(y_true, y_pred).numpy()
1.177
One good example of the sparse-categorical-cross-entropy is the fasion-mnist dataset.
import tensorflow as tf
from tensorflow import keras
fashion_mnist = keras.datasets.fashion_mnist
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()
print(y_train_full.shape) # (60000,)
print(y_train_full.dtype) # uint8
y_train_full[:10]
# array([9, 0, 0, 3, 0, 2, 7, 2, 5, 5], dtype=uint8)
The reason for this apparent performance discrepancy between categorical & binary cross entropy is what user xtof54 has already reported in his answer below, i.e.:
the accuracy computed with the Keras method
evaluateis just plain wrong when using binary_crossentropy with more than 2 labels
I would like to elaborate more on this, demonstrate the actual underlying issue, explain it, and offer a remedy.
This behavior is not a bug; the underlying reason is a rather subtle & undocumented issue at how Keras actually guesses which accuracy to use, depending on the loss function you have selected, when you include simply metrics=['accuracy'] in your model compilation. In other words, while your first compilation option
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
is valid, your second one:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
will not produce what you expect, but the reason is not the use of binary cross entropy (which, at least in principle, is an absolutely valid loss function).
Why is that? If you check the metrics source code, Keras does not define a single accuracy metric, but several different ones, among them binary_accuracy and categorical_accuracy. What happens under the hood is that, since you have selected binary cross entropy as your loss function and have not specified a particular accuracy metric, Keras (wrongly...) infers that you are interested in the binary_accuracy, and this is what it returns - while in fact you are interested in the categorical_accuracy.
Let's verify that this is the case, using the MNIST CNN example in Keras, with the following modification:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # WRONG way
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=2, # only 2 epochs, for demonstration purposes
verbose=1,
validation_data=(x_test, y_test))
# Keras reported accuracy:
score = model.evaluate(x_test, y_test, verbose=0)
score[1]
# 0.9975801164627075
# Actual accuracy calculated manually:
import numpy as np
y_pred = model.predict(x_test)
acc = sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000
acc
# 0.98780000000000001
score[1]==acc
# False
To remedy this, i.e. to use indeed binary cross entropy as your loss function (as I said, nothing wrong with this, at least in principle) while still getting the categorical accuracy required by the problem at hand, you should ask explicitly for categorical_accuracy in the model compilation as follows:
from keras.metrics import categorical_accuracy
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[categorical_accuracy])
In the MNIST example, after training, scoring, and predicting the test set as I show above, the two metrics now are the same, as they should be:
# Keras reported accuracy:
score = model.evaluate(x_test, y_test, verbose=0)
score[1]
# 0.98580000000000001
# Actual accuracy calculated manually:
y_pred = model.predict(x_test)
acc = sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000
acc
# 0.98580000000000001
score[1]==acc
# True
System setup:
Python version 3.5.3
Tensorflow version 1.2.1
Keras version 2.0.4
UPDATE: After my post, I discovered that this issue had already been identified in this answer.
It all depends on the type of classification problem you are dealing with. There are three main categories
- binary classification (two target classes),
- multi-class classification (more than two exclusive targets),
- multi-label classification (more than two non exclusive targets), in which multiple target classes can be on at the same time.
In the first case, binary cross-entropy should be used and targets should be encoded as one-hot vectors.
In the second case, categorical cross-entropy should be used and targets should be encoded as one-hot vectors.
In the last case, binary cross-entropy should be used and targets should be encoded as one-hot vectors. Each output neuron (or unit) is considered as a separate random binary variable, and the loss for the entire vector of outputs is the product of the loss of single binary variables. Therefore it is the product of binary cross-entropy for each single output unit.
The binary cross-entropy is defined as

and categorical cross-entropy is defined as

where c is the index running over the number of classes C.
I think I have some understanding of binary cross entropy, what is categorical cross entropy (as implemented in Keras) and how does it differ? Resources I've found on the internet for this have been way above my head
In tf keras, is binary cross entropy with 2 exclusive classes the same as categorical cross entropy with 2 exclusive classes?
From the documentation it says categorical is for 2 or more classes. So is categorical with 2 classes the same as binary cross entropy?
Both, categorical cross entropy and sparse categorical cross entropy have the same loss function which you have mentioned above. The only difference is the format in which you mention $Y_i$ (i,e true labels).
If your $Y_i$'s are one-hot encoded, use categorical_crossentropy. Examples (for a 3-class classification): [1,0,0] , [0,1,0], [0,0,1]
But if your $Y_i$'s are integers, use sparse_categorical_crossentropy. Examples for above 3-class classification problem: [1] , [2], [3]
The usage entirely depends on how you load your dataset. One advantage of using sparse categorical cross entropy is it saves time in memory as well as computation because it simply uses a single integer for a class, rather than a whole vector.
The formula which you posted in your question refers to binary_crossentropy, not categorical_crossentropy. The former is used when you have only one class. The latter refers to a situation when you have multiple classes and its formula looks like below:
$$J(\textbf{w}) = -\sum_{i=1}^{N} y_i \text{log}(\hat{y}_i).$$
This loss works as skadaver mentioned on one-hot encoded values e.g [1,0,0], [0,1,0], [0,0,1]
The sparse_categorical_crossentropy is a little bit different, it works on integers that's true, but these integers must be the class indices, not actual values. This loss computes logarithm only for output index which ground truth indicates to. So when model output is for example [0.1, 0.3, 0.7] and ground truth is 3 (if indexed from 1) then loss compute only logarithm of 0.7. This doesn't change the final value, because in the regular version of categorical crossentropy other values are immediately multiplied by zero (because of one-hot encoding characteristic). Thanks to that it computes logarithm once per instance and omits the summation which leads to better performance. The formula might look like this:
$$J(\textbf{w}) = -\text{log}(\hat{y}_y).$$
Use sparse categorical crossentropy when your classes are mutually exclusive (e.g. when each sample belongs exactly to one class) and categorical crossentropy when one sample can have multiple classes or labels are soft probabilities (like [0.5, 0.3, 0.2]).
Formula for categorical crossentropy (S - samples, C - classess, $s \in c $ - sample belongs to class c) is:
$$ -\frac{1}{N} \sum_{s\in S} \sum_{c \in C} 1_{s\in c} log {p(s \in c)} $$
For case when classes are exclusive, you don't need to sum over them - for each sample only non-zero value is just $-log p(s \in c)$ for true class c.
This allows to conserve time and memory. Consider case of 10000 classes when they are mutually exclusive - just 1 log instead of summing up 10000 for each sample, just one integer instead of 10000 floats.
Formula is the same in both cases, so no impact on accuracy should be there.
The answer, in a nutshell
If your targets are one-hot encoded, use categorical_crossentropy.
Examples of one-hot encodings:
[1,0,0]
[0,1,0]
[0,0,1]
But if your targets are integers, use sparse_categorical_crossentropy.
Examples of integer encodings (for the sake of completion):
1
2
3