Both, categorical cross entropy and sparse categorical cross entropy have the same loss function which you have mentioned above. The only difference is the format in which you mention (i,e true labels).

If your 's are one-hot encoded, use categorical_crossentropy. Examples (for a 3-class classification): [1,0,0] , [0,1,0], [0,0,1]

But if your 's are integers, use sparse_categorical_crossentropy. Examples for above 3-class classification problem: [1] , [2], [3]

The usage entirely depends on how you load your dataset. One advantage of using sparse categorical cross entropy is it saves time in memory as well as computation because it simply uses a single integer for a class, rather than a whole vector.

Answer from skadaver on Stack Exchange
Top answer
1 of 5
112

Both, categorical cross entropy and sparse categorical cross entropy have the same loss function which you have mentioned above. The only difference is the format in which you mention (i,e true labels).

If your 's are one-hot encoded, use categorical_crossentropy. Examples (for a 3-class classification): [1,0,0] , [0,1,0], [0,0,1]

But if your 's are integers, use sparse_categorical_crossentropy. Examples for above 3-class classification problem: [1] , [2], [3]

The usage entirely depends on how you load your dataset. One advantage of using sparse categorical cross entropy is it saves time in memory as well as computation because it simply uses a single integer for a class, rather than a whole vector.

2 of 5
14

The formula which you posted in your question refers to binary_crossentropy, not categorical_crossentropy. The former is used when you have only one class. The latter refers to a situation when you have multiple classes and its formula looks like below:

This loss works as skadaver mentioned on one-hot encoded values e.g [1,0,0], [0,1,0], [0,0,1]

The sparse_categorical_crossentropy is a little bit different, it works on integers that's true, but these integers must be the class indices, not actual values. This loss computes logarithm only for output index which ground truth indicates to. So when model output is for example [0.1, 0.3, 0.7] and ground truth is 3 (if indexed from 1) then loss compute only logarithm of 0.7. This doesn't change the final value, because in the regular version of categorical crossentropy other values are immediately multiplied by zero (because of one-hot encoding characteristic). Thanks to that it computes logarithm once per instance and omits the summation which leads to better performance. The formula might look like this:

🌐
GeeksforGeeks
geeksforgeeks.org › deep learning › sparse-categorical-crossentropy-vs-categorical-crossentropy
Sparse Categorical Crossentropy vs. Categorical Crossentropy - GeeksforGeeks
July 26, 2025 - Sparse Categorical Crossentropy is functionally similar to Categorical Crossentropy but is designed for cases where the target labels are not one-hot encoded. Instead, the labels are represented as integers corresponding to the class indices.
Discussions

Categorical_crossentropy vs sparse categorical crossentropy
Can someone please explain me the difference between the two loss functions categorical crossentropy and sparse categorical crossentropy More on community.deeplearning.ai
🌐 community.deeplearning.ai
1
0
February 27, 2023
neural network - Sparse_categorical_crossentropy vs categorical_crossentropy (keras, accuracy) - Data Science Stack Exchange
Which is better for accuracy or are they the same? Of course, if you use categorical_crossentropy you use one hot encoding, and if you use sparse_categorical_crossentropy you encode as normal integ... More on datascience.stackexchange.com
🌐 datascience.stackexchange.com
December 1, 2018
C2_W2_SoftMax Lab - question about SparseCategorialCrossentropy or CategoricalCrossEntropy
In the C2_W2_SoftMax lab it says: … and I thought I understood what it was saying, but in the lab we have (for the first two example vectors): [[6.18e-03 1.51e-03 9.54e-01 3.84e-02] [9.93e-01 6.15e-03 3.59e-04 3.78e-04]] … and that would seem to me that it is approximating the one-hot encoding: ... More on community.deeplearning.ai
🌐 community.deeplearning.ai
0
0
July 30, 2022
I am confused between sparse categorical crossentropy and categorical crossentropy
A sparse tensor is one where any element has 0 value, like in one hot encoded. So why is SCCE not used when the target label is one hot encoded but used when an integer class like 1, 2, 3, or 4 is passed? I am confused because the definition says something and the implementation is something else. You're slightly misunderstanding the "sparse" part. You're correct that a sparse vector/matrix has a lot of zeroes, but we're talking specifically about representation here. One-hot-encoding is not a sparse representation, because you still use the same amount of memory as for a non-sparse matrix, i.e. one float per number, even if that number is zero. So one-hot-encoding is a non-sparse (i.e. dense) representation of a sparse matrix. On the other hand, a vector of integers representing the index of the non-zero element is a sparse representation of a sparse matrix: we don't store all the zeroes and thus we save memory. So the only difference between these loss functions is the format/representation in which you supply the targets. SCCE uses a sparse representation, whereas CCE uses a dense representation. More on reddit.com
🌐 r/MLQuestions
4
3
September 25, 2022
Top answer
1 of 3
97

Simply:

  • categorical_crossentropy (cce) produces a one-hot array containing the probable match for each category,
  • sparse_categorical_crossentropy (scce) produces a category index of the most likely matching category.

Consider a classification problem with 5 categories (or classes).

  • In the case of cce, the one-hot target may be [0, 1, 0, 0, 0] and the model may predict [.2, .5, .1, .1, .1] (probably right)

  • In the case of scce, the target index may be [1] and the model may predict: [.5].

Consider now a classification problem with 3 classes.

  • In the case of cce, the one-hot target might be [0, 0, 1] and the model may predict [.5, .1, .4] (probably inaccurate, given that it gives more probability to the first class)
  • In the case of scce, the target index might be [0], and the model may predict [.5]

Many categorical models produce scce output because you save space, but lose A LOT of information (for example, in the 2nd example, index 2 was also very close.) I generally prefer cce output for model reliability.

There are a number of situations to use scce, including:

  • when your classes are mutually exclusive, i.e. you don't care at all about other close-enough predictions,
  • the number of categories is large to the prediction output becomes overwhelming.

220405: response to "one-hot encoding" comments:

one-hot encoding is used for a category feature INPUT to select a specific category (e.g. male versus female). This encoding allows the model to train more efficiently: training weight is a product of category, which is 0 for all categories except for the given one.

cce and scce are a model OUTPUT. cce is a probability array of each category, totally 1.0. scce shows the MOST LIKELY category, totally 1.0.

scce is technically a one-hot array, just like a hammer used as a door stop is still a hammer, but its purpose is different. cce is NOT one-hot.

2 of 3
77

I was also confused with this one. Fortunately, the excellent keras documentation came to the rescue. Both have the same loss function and are ultimately doing the same thing, only difference is in the representation of the true labels.

  • Categorical Cross Entropy [Doc]:

Use this crossentropy loss function when there are two or more label classes. We expect labels to be provided in a one_hot representation.

>>> y_true = [[0, 1, 0], [0, 0, 1]]
>>> y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
>>> # Using 'auto'/'sum_over_batch_size' reduction type.  
>>> cce = tf.keras.losses.CategoricalCrossentropy()
>>> cce(y_true, y_pred).numpy()
1.177
  • Sparse Categorical Cross Entropy [Doc]:

Use this crossentropy loss function when there are two or more label classes. We expect labels to be provided as integers.

>>> y_true = [1, 2]
>>> y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
>>> # Using 'auto'/'sum_over_batch_size' reduction type.  
>>> scce = tf.keras.losses.SparseCategoricalCrossentropy()
>>> scce(y_true, y_pred).numpy()
1.177

One good example of the sparse-categorical-cross-entropy is the fasion-mnist dataset.

import tensorflow as tf
from tensorflow import keras

fashion_mnist = keras.datasets.fashion_mnist
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()

print(y_train_full.shape) # (60000,)
print(y_train_full.dtype) # uint8

y_train_full[:10]
# array([9, 0, 0, 3, 0, 2, 7, 2, 5, 5], dtype=uint8)
🌐
Medium
medium.com › @shivamvbomble › understanding-sparse-categorical-cross-entropy-and-binary-cross-entropy-d37aa33aed8e
Understanding Sparse Categorical Cross-Entropy and Binary Cross-Entropy | by Shivam Bomble | Medium
February 3, 2025 - Sparse categorical cross-entropy is particularly efficient when dealing with integer labels, whereas categorical cross-entropy is appropriate for scenarios that utilize one-hot encoded labels.
🌐
Medium
medium.com › @shireenchand › choosing-between-cross-entropy-and-sparse-cross-entropy-the-only-guide-you-need-abea92c84662
Choosing between Cross Entropy and Sparse Cross Entropy — The Only Guide you Need! | by Shireen Chand | Medium
July 20, 2023 - In contrast to categorical cross-entropy loss, where the true labels are represented as one-hot encoded vectors, sparse categorical cross-entropy loss expects the target labels to be integers indicating the class indices directly.
🌐
Medium
fmorenovr.medium.com › sparse-categorical-cross-entropy-vs-categorical-cross-entropy-ea01d0392d28
Sparse Categorical Cross-Entropy vs Categorical Cross-Entropy | by Felipe A. Moreno | Medium
November 30, 2021 - categorical_crossentropy (cce) produces a one-hot array containing the probable match for each category, sparse_categorical_crossentropy (scce) produces a category index of the most likely matching category.
Find elsewhere
🌐
DeepLearning.AI
community.deeplearning.ai › course q&a › machine learning specialization › advanced learning algorithms
C2_W2_SoftMax Lab - question about SparseCategorialCrossentropy or CategoricalCrossEntropy - Advanced Learning Algorithms - DeepLearning.AI
July 30, 2022 - In the C2_W2_SoftMax lab it says: … and I thought I understood what it was saying, but in the lab we have (for the first two example vectors): [[6.18e-03 1.51e-03 9.54e-01 3.84e-02] [9.93e-01 6.15e-03 3.59e-04 3.78e-04]] … and that would seem to me that it is approximating the one-hot encoding: [0, 0, 1, 0] [1, 0, 0, 0] and therefore CategoricalCrossEntropy should be used, but in the lab SparseCategorialCrossentropy is used instead, so I realized I don’t understand “SparseCategorialCro...
🌐
Substack
vevesta.substack.com › p › here-is-what-you-need-to-know-about
Here is what you need to know about Sparse Categorical Cross Entropy in nutshell
August 31, 2022 - Sparse categorical cross entropy is suited for problems where y label is set to 1. It doesn’t work for multi-label problems. On the other hand, categorical cross entropy works well for multi-label problems or problems where targets that are ...
🌐
GeeksforGeeks
geeksforgeeks.org › machine learning › what-is-sparse-categorical-crossentropy
What is Sparse Categorical Crossentropy - GeeksforGeeks
July 26, 2025 - It is very similar to Categorical Crossentropy but with one important difference i.e the true class labels are provided as integers (category indices), not as one-hot encoded vectors.
🌐
Reddit
reddit.com › r/mlquestions › i am confused between sparse categorical crossentropy and categorical crossentropy
r/MLQuestions on Reddit: I am confused between sparse categorical crossentropy and categorical crossentropy
September 25, 2022 -

Sparse Categorical Cross Entropy (SCCE)

Categorical Cross Entropy (CCE)

A sparse tensor is one where any element has 0 value, like in one hot encoded. So why is SCCE not used when the target label is one hot encoded but used when an integer class like 1, 2, 3, or 4 is passed? I am confused because the definition says something and the implementation is something else.

https://stats.stackexchange.com/a/420730

Top answer
1 of 2
7
A sparse tensor is one where any element has 0 value, like in one hot encoded. So why is SCCE not used when the target label is one hot encoded but used when an integer class like 1, 2, 3, or 4 is passed? I am confused because the definition says something and the implementation is something else. You're slightly misunderstanding the "sparse" part. You're correct that a sparse vector/matrix has a lot of zeroes, but we're talking specifically about representation here. One-hot-encoding is not a sparse representation, because you still use the same amount of memory as for a non-sparse matrix, i.e. one float per number, even if that number is zero. So one-hot-encoding is a non-sparse (i.e. dense) representation of a sparse matrix. On the other hand, a vector of integers representing the index of the non-zero element is a sparse representation of a sparse matrix: we don't store all the zeroes and thus we save memory. So the only difference between these loss functions is the format/representation in which you supply the targets. SCCE uses a sparse representation, whereas CCE uses a dense representation.
2 of 2
1
the integer values are just the indexes where the target lable is a one instead of a zero. Because there are so many more zeroes than ones (the ones are "sparsely" populated), it's a waste of space to allocate memory for the whole matrix since we know the vast majority of it is is zero-valued. So instead of wasting a ton of memory on a allocating an empty matrix, we just track the locations in the matrix that we care about
🌐
Hashnode
monojit13.hashnode.dev › decoding-loss-functions-categorical-cross-entropy-vs-sparse-categorical-cross-entropy
Decoding Loss Functions: Categorical Cross Entropy vs Sparse Categorical Cross Entropy
December 10, 2023 - Categorical Cross Entropy is used for targets which are one-hot encoded. Sparse Categorical Cross Entropy is used for targets which are in integer formats.
🌐
Posit
keras3.posit.co › reference › op_sparse_categorical_crossentropy.html
Computes sparse categorical cross-entropy loss. — op_sparse_categorical_crossentropy • keras3
The sparse categorical cross-entropy loss is similar to categorical cross-entropy, but it is used when the target tensor contains integer class labels instead of one-hot encoded vectors. It measures the dissimilarity between the target and output probabilities or logits.
🌐
GitHub
github.com › christianversloot › machine-learning-articles › blob › main › how-to-use-sparse-categorical-crossentropy-in-keras.md
machine-learning-articles/how-to-use-sparse-categorical-crossentropy-in-keras.md at main · christianversloot/machine-learning-articles
In Keras, this can be done with to_categorical, which essentially applies one-hot encoding to your training set's targets. When applied, you can start using categorical crossentropy. But did you know that there exists another type of loss - sparse categorical crossentropy - with which you can leave the integers as they are, yet benefit from crossentropy loss?
Author   christianversloot
🌐
JanBask Training
janbasktraining.com › community › data-science › categorical_crossentropy-vs-sparse_categorical_crossentropy-which-is-better
categorical_crossentropy vs sparse_categorical_crossentropy - Which is better? | JanBask Training Community
February 15, 2023 - If your target data is already one-hot encoded, you should use categorical_crossentropy. If your target data is represented as integers (class indices), you should use sparse_categorical_crossentropy.
🌐
Keras
keras.io › api › losses › probabilistic_losses
Keras documentation: Probabilistic losses
Computes the cross-entropy loss between true labels and predicted labels.
🌐
Apache
cwiki.apache.org › confluence › display › MXNET › Multi-hot+Sparse+Categorical+Cross-entropy
Multi-hot Sparse Categorical Cross-entropy - MXNet - Apache Software Foundation
November 17, 2018 - The only difference between sparse categorical cross entropy and categorical cross entropy is the format of true labels. When we have a single-label, multi-class classification problem, the labels are mutually exclusive for each data, meaning each data entry can only belong to one class.