🌐
Google
developers.google.com › machine learning › logistic regression: loss and regularization
Logistic regression: Loss and regularization | Machine Learning | Google for Developers
Instead, the loss function for logistic regression is Log Loss. The Log Loss equation returns the logarithm of the magnitude of the change, rather than just the distance from data to prediction.
Concept in machine learning
Loss functions for classification - Wikipedia
In machine learning and mathematical optimization, loss functions for classification are computationally feasible loss functions representing the price paid for inaccuracy of predictions in classification problems (problems of identifying which category a … Wikipedia
🌐
Wikipedia
en.wikipedia.org › wiki › Loss_functions_for_classification
Loss functions for classification - Wikipedia
January 12, 2026 - It's easy to check that the logistic loss and binary cross-entropy loss (Log loss) are in fact the same (up to a multiplicative constant ... {\displaystyle {\frac {1}{\log(2)}}} ). The cross-entropy loss is closely related to the Kullback–Leibler divergence between the empirical distribution and the predicted distribution. The cross-entropy loss is ubiquitous in modern deep neural networks. The exponential loss function can be generated using (2) and Table-I as follows.
🌐
Learningds
learningds.org › ch › 19 › class_loss.html
19.4. A Loss Function for the Logistic Model — Learning Data Science
The logistic model gives us probabilities (or empirical proportions), so we write our loss function as \(\ell(p, y) \), where \(p\) is between 0 and 1. The response takes on one of two values because our outcome feature is a binary classification.
🌐
scikit-learn
scikit-learn.org › stable › modules › generated › sklearn.metrics.log_loss.html
log_loss — scikit-learn 1.8.0 documentation
This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of a logistic model that returns y_pred probabilities for its training data y_true.
🌐
MLU-Explain
mlu-explain.github.io › logistic-regression
Logistic Regression
A visual, interactive explanation of logistic regression for machine learning.
🌐
Medium
medium.com › analytics-vidhya › understanding-the-loss-function-of-logistic-regression-ac1eec2838ce
Understanding the log loss function | by Susmith Reddy | Analytics Vidhya | Medium
February 2, 2024 - The loss function used by the linear regression algorithm is Mean Squared Error. ... What MSE does is, it adds up the square of the distance between the actual and the predicted output value for every input sample (and divide it with no. of ...
🌐
GeeksforGeeks
geeksforgeeks.org › machine learning › ml-cost-function-in-logistic-regression
Cost function in Logistic Regression in Machine Learning - GeeksforGeeks
Here in this code demonstrates how Logistic Regression computes predicted probabilities using the sigmoid function and evaluates model performance using the log loss (binary cross-entropy) cost function.
Published   January 19, 2026
Find elsewhere
🌐
Tomasbeuzen
tomasbeuzen.com › deep-learning-with-pytorch › chapters › appendixB_logistic-loss.html
Appendix B: Logistic Loss — Deep Learning with PyTorch
\[f(w)=-\frac{1}{n}\sum_{i=1}^ny_i\log\left(\frac{1}{1 + \exp(-w^Tx_i)}\right) + (1 - y_i)\log\left(1 - \frac{1}{1 + \exp(-w^Tx_i)}\right)\] This function is called the “log loss” or “binary cross entropy
🌐
Dasha
dasha.ai › blog › log-loss-function
Dasha | Log Loss Function Explained by Experts
February 15, 2021 - sigmoid from a linear combination) is adjusted to this error function by the SGD method. ... As we can see, adjusting the weights is exactly the same as when adjusting linear regression! In fact, this indicates the relationship of different regressions: linear and logistic, or rather, the relationship of distributions: normal and Bernoulli. In many books, another expression goes by the name of log loss function (that is, precisely "logistic loss"), which we can get by substituting the expression for the sigmoid in it and redesignating: we assume that the class labels are now -1 and +1, then we get the following:
🌐
arXiv
arxiv.org › abs › 1805.03804
[1805.03804] On the Universality of the Logistic Loss Function
May 10, 2018 - Abstract:A loss function measures the discrepancy between the true values (observations) and their estimated fits, for a given instance of data. A loss function is said to be proper (unbiased, Fisher consistent) if the fits are defined over ...
Top answer
1 of 6
42

The relationship is as follows: $l(\beta) = \sum_i L(z_i)$.

Define a logistic function as $f(z) = \frac{e^{z}}{1 + e^{z}} = \frac{1}{1+e^{-z}}$. They possess the property that $f(-z) = 1-f(z)$. Or in other words:

$$ \frac{1}{1+e^{z}} = \frac{e^{-z}}{1+e^{-z}}. $$

If you take the reciprocal of both sides, then take the log you get:

$$ \ln(1+e^{z}) = \ln(1+e^{-z}) + z. $$

Subtract $z$ from both sides and you should see this:

$$ -y_i\beta^Tx_i+ln(1+e^{y_i\beta^Tx_i}) = L(z_i). $$

Edit:

At the moment I am re-reading this answer and am confused about how I got $-y_i\beta^Tx_i+ln(1+e^{\beta^Tx_i})$ to be equal to $-y_i\beta^Tx_i+ln(1+e^{y_i\beta^Tx_i})$. Perhaps there's a typo in the original question.

Edit 2:

In the case that there wasn't a typo in the original question, @ManelMorales appears to be correct to draw attention to the fact that, when $y \in \{-1,1\}$, the probability mass function can be written as $P(Y_i=y_i) = f(y_i\beta^Tx_i)$, due to the property that $f(-z) = 1 - f(z)$. I am re-writing it differently here, because he introduces a new equivocation on the notation $z_i$. The rest follows by taking the negative log-likelihood for each $y$ coding. See his answer below for more details.

2 of 6
76

OP mistakenly believes the relationship between these two functions is due to the number of samples (i.e. single vs all). However, the actual difference is simply how we select our training labels.

In the case of binary classification we may assign the labels $y=\pm1$ or $y=0,1$.

As it has already been stated, the logistic function $\sigma(z)$ is a good choice since it has the form of a probability, i.e. $\sigma(-z)=1-\sigma(z)$ and $\sigma(z)\in (0,1)$ as $z\rightarrow \pm \infty$. If we pick the labels $y=0,1$ we may assign

\begin{equation} \begin{aligned} \mathbb{P}(y=1|z) & =\sigma(z)=\frac{1}{1+e^{-z}}\\ \mathbb{P}(y=0|z) & =1-\sigma(z)=\frac{1}{1+e^{z}}\\ \end{aligned} \end{equation}

which can be written more compactly as $\mathbb{P}(y|z) =\sigma(z)^y(1-\sigma(z))^{1-y}$.

It is easier to maximize the log-likelihood. Maximizing the log-likelihood is the same as minimizing the negative log-likelihood. For $m$ samples $\{x_i,y_i\}$, after taking the natural logarithm and some simplification, we will find out:

\begin{equation} \begin{aligned} l(z)=-\log\big(\prod_i^m\mathbb{P}(y_i|z_i)\big)=-\sum_i^m\log\big(\mathbb{P}(y_i|z_i)\big)=\sum_i^m-y_iz_i+\log(1+e^{z_i}) \end{aligned} \end{equation}

Full derivation and additional information can be found on this jupyter notebook. On the other hand, we may have instead used the labels $y=\pm 1$. It is pretty obvious then that we can assign

\begin{equation} \mathbb{P}(y|z)=\sigma(yz). \end{equation}

It is also obvious that $\mathbb{P}(y=0|z)=\mathbb{P}(y=-1|z)=\sigma(-z)$. Following the same steps as before we minimize in this case the loss function

\begin{equation} \begin{aligned} L(z)=-\log\big(\prod_j^m\mathbb{P}(y_j|z_j)\big)=-\sum_j^m\log\big(\mathbb{P}(y_j|z_j)\big)=\sum_j^m\log(1+e^{-y_jz_j}) \end{aligned} \end{equation}

Where the last step follows after we take the reciprocal which is induced by the negative sign. While we should not equate these two forms, given that in each form $y$ takes different values, nevertheless these two are equivalent.

While there may be fundamental reasons as to why we have two different forms (see Why there are two different logistic loss formulation / notations?), one reason to choose the former is for practical considerations. In the former we can use the property $\partial \sigma(z) / \partial z=\sigma(z)(1-\sigma(z))$ to trivially calculate $\nabla l(z)$ and $\nabla^2l(z)$, both of which are needed for convergence analysis (i.e. to determine the convexity of the loss function by calculating the Hessian).

🌐
Medium
koshurai.medium.com › understanding-log-loss-a-comprehensive-guide-with-code-examples-c79cf5411426
Understanding Log Loss: A Comprehensive Guide with Code Examples | by KoshurAI | Medium
January 9, 2024 - Log Loss is a logarithmic transformation of the likelihood function, primarily used to evaluate the performance of probabilistic classifiers.
🌐
Towards Data Science
towardsdatascience.com › home › latest › loss functions and their use in neural networks
Loss Function (Part II): Logistic Regression | by Shuyu Luo
January 21, 2025 - Regression Loss Functions – used in regression neural networks; given an input value, the model predicts a corresponding output value (rather than pre-selected labels); Ex. Mean Squared Error, Mean Absolute Error · Classification Loss Functions – used in classification neural networks; given an input, the neural network produces a vector of probabilities of the input belonging to various pre-set categories – can then select the category with the highest probability of belonging; Ex.
🌐
University of Oxford
robots.ox.ac.uk › ~az › lectures › ml › 2011 › lect4.pdf pdf
Logistic Regression
which defines the loss function. Logistic Regression Learning · Learning is formulated as the optimization problem · min · w∈Rd · N · X · i · log · ³ · 1 + e−yif(xi)´ · + λ||w||2 · loss function · regularization · • For correctly classified points −yif(xi) is negative, and ·
🌐
Medium
medium.com › @sonal.mishra1297 › logistic-regression-explained-mathematically-from-loss-function-to-gradients-and-training-6d39a696c948
Logistic Regression Explained Mathematically — From Loss Function to Gradients and Training | by Sonal Mishra | Medium
September 27, 2025 - The loss function isn’t guessed — it falls out naturally from the Bernoulli distribution and the principle of maximum likelihood. The gradients aren’t messy — they collapse beautifully into prediction − truth.
🌐
Baeldung
baeldung.com › home › artificial intelligence › machine learning › differences between hinge loss and logistic loss
Differences Between Hinge Loss and Logistic Loss | Baeldung on Computer Science
February 28, 2025 - Consequently, this makes the function ... loss (also known as the cross-entropy loss and log loss) is another loss function that we can use for classification....
🌐
Reddit
reddit.com › r/learnmachinelearning › why do we use log-loss in logistic regression instead of just taking the absolute difference between expected probability and actual value for each instance?
r/learnmachinelearning on Reddit: Why do we use log-loss in logistic regression instead of just taking the absolute difference between expected probability and actual value for each instance?
April 26, 2023 -

If your model predicted a probability of 0.6 for a positive instance, it seems like it would make sense for the loss to be |(1 - 0.6)| = 0.4. And if it predicted a probability of 0.2 for a negative instance, it seems like it would make sense for it to be |(0 - 0.2)| = 0.2.

Instead, we ignore differences and simply take the (negative) log of the probability (or one minus probability) to derive a measure of loss. I understand that this makes "confidently wrong" predictions VERY penalized -- because if you predict a probability of 0.0 for a positive instance then your loss ranges towards infinity (that is, -log(0.0)), but using the method I described above, we similarly penalize more those predictions that are further away from actual -- it's just that the maximum penalization would be one, rather than infinity (e.g., if you predicted a 0.0 for a positive instance, the loss would be |(1 - 0.0)| = 1).

Does this make sense? It's more theoretical but it'd be great to know the underpinnings of this function.

Thanks!

Top answer
1 of 2
6
You can try it and see if it works🤷‍♂️ Absolute is usually avoided because makes a "V" shaped gradient. Sharp corners are bad in general for gradient based optimization. Same reason we use MSE or RMSE instead of absolute error for regression tasks.
2 of 2
2
know the underpinnings of this function "Log-loss" is also known as "cross-entropy", which is related to the Kullback–Leibler divergence. Sounds confusing, but the point is that the KL divergence and the cross-entropy measure the divergence/disparity/"distance" between two probability distributions. Essentially, you fit the model to make this divergence between the true distribution and the distribution generated by the model as small as possible. As for why use KL divergence instead of absolute differences, I'm actually not sure. Both the KL divergence and the L1 metric you're describing are a kind of divergence, so in theory you could minimize any divergence, of which there are plenty. However, the results of minimizing different divergences will have different statistical properties. For example, it's known that estimators defined by minimization of the KL divergence (which is equivalent to maximizing the likelihood) are usually unbiased, consistent and so on, which are properties you want your estimates to have. What can be said about estimates obtained by minimizing the L1 norm you described? I guess you could say something involving robustness, but are you sure your estimates will stay unbiased and consistent? I guess you could look into minimum distance estimation; there's probably a bunch of theory explaining properties of such estimators.
🌐
Ml-explained
ml-explained.com › blog › logistic-regression-explained
Logistic Regression - ML Explained
September 29, 2020 - For Logistic Regression we can't use the same loss function as for Linear Regression because the Logistic Function (Sigmoid Function) will cause the output to be non-convex, which will cause many local optima.