First, let me give some notes about the numerical stability:

As mentioned in the comments section, the numerical instability in case of using from_logits=False comes from the transformation of probability values back into logits which involves a clipping operation (as discussed in this question and its answer). However, to the best of my knowledge, this does NOT create any serious issues for most of practical applications (although, there are some cases where applying the softmax/sigmoid function inside the loss function, i.e. using from_logits=True, would be more numerically stable in terms of computing gradients; see this answer for a mathematical explanation).

In other words, if you are not concerned with precision of generated probability values with sensitivity of less than 1e-7, or a related convergence issue observed in your experiments, then you should not worry too much; just use the sigmoid and binary cross-entropy as before, i.e. model.compile(loss='binary_crossentropy', ...), and it would work fine.

All in all, if you are really concerned with numerical stability, you can take the safest path and use from_logits=True without using any activation function on the last layer of the model.


Now, to answer the original question, the true labels or target values (i.e. y_true) should be still only zeros or ones when using BinaryCrossentropy(from_logits=True). Rather, that's the y_pred (i.e. the output of the model) which should not be a probability distribution in this case (i.e. the sigmoid function should not be used on the last layer if from_logits=True).

Answer from today on Stack Overflow
Top answer
1 of 2
20

First, let me give some notes about the numerical stability:

As mentioned in the comments section, the numerical instability in case of using from_logits=False comes from the transformation of probability values back into logits which involves a clipping operation (as discussed in this question and its answer). However, to the best of my knowledge, this does NOT create any serious issues for most of practical applications (although, there are some cases where applying the softmax/sigmoid function inside the loss function, i.e. using from_logits=True, would be more numerically stable in terms of computing gradients; see this answer for a mathematical explanation).

In other words, if you are not concerned with precision of generated probability values with sensitivity of less than 1e-7, or a related convergence issue observed in your experiments, then you should not worry too much; just use the sigmoid and binary cross-entropy as before, i.e. model.compile(loss='binary_crossentropy', ...), and it would work fine.

All in all, if you are really concerned with numerical stability, you can take the safest path and use from_logits=True without using any activation function on the last layer of the model.


Now, to answer the original question, the true labels or target values (i.e. y_true) should be still only zeros or ones when using BinaryCrossentropy(from_logits=True). Rather, that's the y_pred (i.e. the output of the model) which should not be a probability distribution in this case (i.e. the sigmoid function should not be used on the last layer if from_logits=True).

2 of 2
3

I tested GAN on recovering realistic image from sketch and the only difference between two train cycles was BinaryCrossentropy(from_logits=True/False). Last network layer is Conv2D with no activation, so the right choice should be from_logits=True, but for experimental purposes - I found huge difference in generator and discriminator loss

  • orange - True,
  • blue - False.

Here is the link to collab notebook. Exercise based on Tensorflow tutorial pix2pix.

According to exercise description if from_logits=True

  • The value log(2) = 0.69 is a good reference point for these losses, as it indicates a perplexity of 2: That the discriminator is on average equally uncertain about the two options.
  • For the disc_loss a value below 0.69 means the discriminator is doing better than random, on the combined set of real+generated images.
  • For the gen_gan_loss a value below 0.69 means the generator i doing better than random at foolding the descriminator.

Otherwise loss twice higher for both: generator and discriminator. SImilar explanation doesn't look to hold relevance anymore.

Final images are also different:

  • In case of from_logits==False , image looks blurry and non-realistic
🌐
Medium
medium.com › @pxszxrzpz › use-binary-cross-entropy-loss-with-logits-instead-of-cross-entropy-loss-on-binary-classification-89b7b75443fa
Use Binary Cross Entropy with logits loss instead of Cross Entropy Loss on binary classification task | by Pxszxrzpz | Medium
September 5, 2025 - It combines a Sigmoid activation with Binary Cross Entropy, offering several advantages: ... Simpler model output: You only need to output a single logit per sample, rather than two class scores.
Discussions

How to use binary cross entropy with logits in binary target and 3d output
I have batch size = 5 my network ... torch.Size([5]) (i.e. ex [1.0, 0.0, 0.0, 1.0, 1.0]) Then i pass it to following loss function loss = F.binary_cross_entropy_with_logits(output, target) I get the following value error raise ValueError("Target size ({}) must be the same as input ... More on discuss.pytorch.org
🌐 discuss.pytorch.org
1
0
August 7, 2019
python - What is the difference between binary crossentropy and binary crossentropy with logits in keras? - Stack Overflow
If there is a sigmoid layer, it will squeeze the class scores into probabilities, in this case from_logits should be False. The loss function will transform the probabilities into logits, because that's what tf.nn.sigmoid_cross_entropy_with_logits expects. More on stackoverflow.com
🌐 stackoverflow.com
Manual Calculation of Binary Cross Entropy with logits
In order to ensure that I understood how BCE with logits loss works in pytorch, I tried to manually calculate the loss, however I cannot reconcile my manual calculation with the loss generated by the pytorch function F.binary_cross_entropy_with_logits. can somebody please explain what i am ... More on discuss.pytorch.org
🌐 discuss.pytorch.org
3
0
November 20, 2019
How do I calculate the binary cross entropy loss directly from logits in a deep net?
How Do I Calculate the Binary Cross Entropy Loss Directly From Logits In a Deep Net? More on mathworks.com
🌐 mathworks.com
1
0
May 19, 2020
🌐
PyTorch
docs.pytorch.org › reference api › torch.nn › bcewithlogitsloss
BCEWithLogitsLoss — PyTorch 2.11 documentation
January 1, 2023 - >>> target = torch.ones([10, 64], dtype=torch.float32) # 64 classes, batch size = 10 >>> output = torch.full([10, 64], 1.5) # A prediction (logit) >>> pos_weight = torch.ones([64]) # All weights are equal to 1 >>> criterion = torch.nn.BCEWithLogitsLoss(pos_weight=pos_weight) >>> criterion(output, target) # -log(sigmoid(1.5)) tensor(0.20...) In the above example, the pos_weight tensor’s elements correspond to the 64 distinct classes in a multi-label binary classification scenario.
🌐
Lucasdavid
lucasdavid.github.io › blog › machine-learning › crossentropy-and-logits
Activation, Cross-Entropy and Logits – Lucas David
We observe a stable increase in the score metrics when training with the softmax function. On the other hand, normalized logits will briefly produce an increase in the metrics, followed by a diverging progression of the aformentioned metrics. The log function inside the Cross-entropy function counteracts the exponential inside the softmax function (see the Categorical Cross-entropy with Logits Section below), implying that the gradient of the loss function with respect to the logits are somewhat linear in $l_i$ [4], allowing for the model to update its weights at a reasonable pace.
🌐
Analytics Vidhya
analyticsvidhya.com › home › binary cross entropy/log loss for binary classification
Binary Cross Entropy/Log Loss for Binary Classification
April 24, 2025 - It covers essential concepts like binary cross entropy, the binary cross entropy loss function, and the binary cross entropy formula, all crucial for understanding binary classification tasks. Mastering classification models, including logistic regression and neural networks, is essential for accurate predictions and effective model performance in data science. Activation functions transform logits into probabilities, aiding in the accurate classification of data into positive and negative classes. Familiarity with evaluation metrics like MSE for regression tasks and categorical cross-entropy for classification enhances model assessment and improvement strategies.
🌐
CodingNomads
codingnomads.com › binary-classification-binary-cross-entropy
Binary Classification: Binary Cross Entropy
When you're performing binary classification, your model will output a single score, or logit, for each data point. This score can be interpreted as the model's confidence that the data point belongs to the positive class.
🌐
arXiv
arxiv.org › pdf › 1705.10246 pdf
Fast Single-Class Classification and the Principle of Logit Separation Gil Keren
cross-entropy losses that correspond to these binary problems. In this setting, the label of each example is a binary vector · (r1, . . . , rk), where rj = 1 if x belongs to class j and 0 · otherwise. The loss for a single training example with logits
Find elsewhere
🌐
PyTorch Forums
discuss.pytorch.org › vision
How to use binary cross entropy with logits in binary target and 3d output - vision - PyTorch Forums
August 7, 2019 - I have batch size = 5 my network output is given by the following code Output = F.upsample(per_frame_logits, t, mode='linear') Shape of output is = torch.Size([5, 2, 64]) Shape of target is = torch.Size([5]) (i.e. ex [1.0, 0.0, 0.0, 1.0, 1.0]) Then i pass it to following loss function loss = F.binary_cross_entropy_with_logits(output, target) I get the following value error raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size())) ValueErro...
🌐
GeeksforGeeks
geeksforgeeks.org › deep learning › binary-cross-entropy-log-loss-for-binary-classification
Binary Cross Entropy/Log Loss for Binary Classification - GeeksforGeeks
July 23, 2025 - Binary cross-entropy (log loss) is a loss function used in binary classification problems. It quantifies the difference between the actual class labels (0 or 1) and the predicted probabilities output by the model.
🌐
Paddlepaddle
paddlepaddle.org.cn › documentation › docs › en › api › paddle › nn › functional › binary_cross_entropy_with_logits_en.html
binary_cross_entropy_with_logits-API Document-PaddlePaddle Deep Learning Platform
>>> import paddle >>> logit = paddle.to_tensor([5.0, 1.0, 3.0]) >>> label = paddle.to_tensor([1.0, 0.0, 1.0]) >>> output = paddle.nn.functional.binary_cross_entropy_with_logits(logit, label) >>> print(output) Tensor(shape=[], dtype=float32, place=Place(cpu), stop_gradient=True, 0.45618808)
🌐
Medium
rafayak.medium.com › how-do-tensorflow-and-keras-implement-binary-classification-and-the-binary-cross-entropy-function-e9413826da7
How do Tensorflow and Keras implement Binary Classification and the Binary Cross-Entropy function? | by Rafay Khan | Medium
December 8, 2020 - Surprisingly, Keras has a Binary Cross-Entropy function simply called BinaryCrossentropy, that can accept either logits(i.e values from last linear node, z) or probabilities from the last Sigmoid node.
🌐
Keras
keras.io › api › losses › probabilistic_losses
Keras documentation: Probabilistic losses
The loss function requires the following inputs: ... y_pred (predicted value): This is the model's prediction, i.e, a single floating-point value which either represents a logit, (i.e, value in [-inf, inf] when from_logits=True) or a probability ...
🌐
Arize
arize.com › arize ai › courses › binary cross entropy: where to use log loss in model monitoring
Binary Cross Entropy: Where To Use Log Loss In Model Monitoring - Arize AI
October 12, 2025 - Since cities have the highest resident populations in a country, binning by equal distances would likely lead to a skewed distribution (with a higher number of users in the first several bins). For this case, taking the average can result in a higher log loss value than if you were to take the median or mode. If you are looking for a performance metric that is black and white on a binary prediction, log loss may not be an ideal metric.
🌐
Sebastian Raschka
sebastianraschka.com › blog › 2022 › losses-learned-part1.html
Losses Learned | Sebastian Raschka, PhD
February 5, 2026 - Interestingly, there is a second binary cross-entropy loss implementation, namely, BCELossWithLogits: ... As we can see, the results are still the same. The difference here is that BCELossWithLogits accepts the logits (weighted inputs of the output layer) instead of the class-membership probabilities.
🌐
Medium
zhang-yang.medium.com › how-is-pytorchs-binary-cross-entropy-with-logits-function-related-to-sigmoid-and-d3bd8fb080e7
How is Pytorch’s binary_cross_entropy_with_logits function related to sigmoid and binary_cross_entropy | by Yang Zhang | Medium
August 25, 2019 - This notebook breaks down how binary_cross_entropy_with_logits function (corresponding to BCEWithLogitsLoss used for multi-class classification) is implemented in pytorch, and how it is related to sigmoid and binary_cross_entropy.