Both, categorical cross entropy and sparse categorical cross entropy have the same loss function which you have mentioned above.
The only difference is the format in which you mention (i,e true labels).
If your 's are one-hot encoded, use categorical_crossentropy.
Examples (for a 3-class classification): [1,0,0] , [0,1,0], [0,0,1]
But if your 's are integers, use sparse_categorical_crossentropy.
Examples for above 3-class classification problem:
[1] , [2], [3]
The usage entirely depends on how you load your dataset. One advantage of using sparse categorical cross entropy is it saves time in memory as well as computation because it simply uses a single integer for a class, rather than a whole vector.
Answer from skadaver on Stack ExchangeBoth, categorical cross entropy and sparse categorical cross entropy have the same loss function which you have mentioned above.
The only difference is the format in which you mention (i,e true labels).
If your 's are one-hot encoded, use categorical_crossentropy.
Examples (for a 3-class classification): [1,0,0] , [0,1,0], [0,0,1]
But if your 's are integers, use sparse_categorical_crossentropy.
Examples for above 3-class classification problem:
[1] , [2], [3]
The usage entirely depends on how you load your dataset. One advantage of using sparse categorical cross entropy is it saves time in memory as well as computation because it simply uses a single integer for a class, rather than a whole vector.
The formula which you posted in your question refers to binary_crossentropy, not categorical_crossentropy. The former is used when you have only one class. The latter refers to a situation when you have multiple classes and its formula looks like below:
This loss works as skadaver mentioned on one-hot encoded values e.g [1,0,0], [0,1,0], [0,0,1]
The sparse_categorical_crossentropy is a little bit different, it works on integers that's true, but these integers must be the class indices, not actual values. This loss computes logarithm only for output index which ground truth indicates to. So when model output is for example [0.1, 0.3, 0.7] and ground truth is 3 (if indexed from 1) then loss compute only logarithm of 0.7. This doesn't change the final value, because in the regular version of categorical crossentropy other values are immediately multiplied by zero (because of one-hot encoding characteristic). Thanks to that it computes logarithm once per instance and omits the summation which leads to better performance. The formula might look like this:
Categorical_crossentropy vs sparse categorical crossentropy
neural network - Sparse_categorical_crossentropy vs categorical_crossentropy (keras, accuracy) - Data Science Stack Exchange
C2_W2_SoftMax Lab - question about SparseCategorialCrossentropy or CategoricalCrossEntropy
I am confused between sparse categorical crossentropy and categorical crossentropy
Videos
Simply:
categorical_crossentropy(cce) produces a one-hot array containing the probable match for each category,sparse_categorical_crossentropy(scce) produces a category index of the most likely matching category.
Consider a classification problem with 5 categories (or classes).
In the case of
cce, the one-hot target may be[0, 1, 0, 0, 0]and the model may predict[.2, .5, .1, .1, .1](probably right)In the case of
scce, the target index may be [1] and the model may predict: [.5].
Consider now a classification problem with 3 classes.
- In the case of
cce, the one-hot target might be[0, 0, 1]and the model may predict[.5, .1, .4](probably inaccurate, given that it gives more probability to the first class) - In the case of
scce, the target index might be[0], and the model may predict[.5]
Many categorical models produce scce output because you save space, but lose A LOT of information (for example, in the 2nd example, index 2 was also very close.) I generally prefer cce output for model reliability.
There are a number of situations to use scce, including:
- when your classes are mutually exclusive, i.e. you don't care at all about other close-enough predictions,
- the number of categories is large to the prediction output becomes overwhelming.
220405: response to "one-hot encoding" comments:
one-hot encoding is used for a category feature INPUT to select a specific category (e.g. male versus female). This encoding allows the model to train more efficiently: training weight is a product of category, which is 0 for all categories except for the given one.
cce and scce are a model OUTPUT. cce is a probability array of each category, totally 1.0. scce shows the MOST LIKELY category, totally 1.0.
scce is technically a one-hot array, just like a hammer used as a door stop is still a hammer, but its purpose is different. cce is NOT one-hot.
I was also confused with this one. Fortunately, the excellent keras documentation came to the rescue. Both have the same loss function and are ultimately doing the same thing, only difference is in the representation of the true labels.
- Categorical Cross Entropy [Doc]:
Use this crossentropy loss function when there are two or more label classes. We expect labels to be provided in a one_hot representation.
>>> y_true = [[0, 1, 0], [0, 0, 1]]
>>> y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
>>> # Using 'auto'/'sum_over_batch_size' reduction type.
>>> cce = tf.keras.losses.CategoricalCrossentropy()
>>> cce(y_true, y_pred).numpy()
1.177
- Sparse Categorical Cross Entropy [Doc]:
Use this crossentropy loss function when there are two or more label classes. We expect labels to be provided as integers.
>>> y_true = [1, 2]
>>> y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
>>> # Using 'auto'/'sum_over_batch_size' reduction type.
>>> scce = tf.keras.losses.SparseCategoricalCrossentropy()
>>> scce(y_true, y_pred).numpy()
1.177
One good example of the sparse-categorical-cross-entropy is the fasion-mnist dataset.
import tensorflow as tf
from tensorflow import keras
fashion_mnist = keras.datasets.fashion_mnist
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()
print(y_train_full.shape) # (60000,)
print(y_train_full.dtype) # uint8
y_train_full[:10]
# array([9, 0, 0, 3, 0, 2, 7, 2, 5, 5], dtype=uint8)
Use sparse categorical crossentropy when your classes are mutually exclusive (e.g. when each sample belongs exactly to one class) and categorical crossentropy when one sample can have multiple classes or labels are soft probabilities (like [0.5, 0.3, 0.2]).
Formula for categorical crossentropy (S - samples, C - classess, - sample belongs to class c) is:
For case when classes are exclusive, you don't need to sum over them - for each sample only non-zero value is just for true class c.
This allows to conserve time and memory. Consider case of 10000 classes when they are mutually exclusive - just 1 log instead of summing up 10000 for each sample, just one integer instead of 10000 floats.
Formula is the same in both cases, so no impact on accuracy should be there.
The answer, in a nutshell
If your targets are one-hot encoded, use categorical_crossentropy.
Examples of one-hot encodings:
[1,0,0]
[0,1,0]
[0,0,1]
But if your targets are integers, use sparse_categorical_crossentropy.
Examples of integer encodings (for the sake of completion):
1
2
3
Sparse Categorical Cross Entropy (SCCE)
Categorical Cross Entropy (CCE)
A sparse tensor is one where any element has 0 value, like in one hot encoded. So why is SCCE not used when the target label is one hot encoded but used when an integer class like 1, 2, 3, or 4 is passed? I am confused because the definition says something and the implementation is something else.
https://stats.stackexchange.com/a/420730