Brave Search

Instead, the loss function for logistic regression is Log Loss. The Log Loss equation returns the logarithm of the magnitude of the change, rather than just the distance from data to prediction.

Loss functions for classification

Concept in machine learning

In machine learning and mathematical optimization, loss functions for classification are computationally feasible loss functions representing the price paid for inaccuracy of predictions in classification problems (problems of identifying which category a … Wikipedia

Wikipedia

en.wikipedia.org › wiki › Loss_functions_for_classification

Loss functions for classification - Wikipedia

January 12, 2026 - It's easy to check that the logistic loss and binary cross-entropy loss (Log loss) are in fact the same (up to a multiplicative constant ... {\displaystyle {\frac {1}{\log(2)}}} ). The cross-entropy loss is closely related to the Kullback–Leibler divergence between the empirical distribution and the predicted distribution. The cross-entropy loss is ubiquitous in modern deep neural networks. The exponential loss function can be generated using (2) and Table-I as follows.

Bayes consistency Proper loss functions, loss margin and regularization Square loss Logistic loss Exponential loss Savage loss Tangent loss Hinge loss Generalized smooth hinge loss

Videos

12:57

YouTube

L8.2 Logistic Regression Loss Function - YouTube

February 23, 2021

14:56

YouTube

Logistic Regression - Loss or Cost Function Formulation (Week 08-03) ...

March 22, 2021

08:42

YouTube

Log Loss or Cross-Entropy Cost Function in Logistic Regression ...

April 7, 2019

48.9K

youtube.com

Understanding the Cost Function in Logistic Regression

00:38

YouTube

Logistic Regression Loss Function - YouTube

learningds.org › ch › 19 › class_loss.html

19.4. A Loss Function for the Logistic Model — Learning Data Science

The logistic model gives us probabilities (or empirical proportions), so we write our loss function as $\ell(p, y) $, where $p$ is between 0 and 1. The response takes on one of two values because our outcome feature is a binary classification.

scikit-learn

scikit-learn.org › stable › modules › generated › sklearn.metrics.log_loss.html

log_loss — scikit-learn 1.8.0 documentation

This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of a logistic model that returns y_pred probabilities for its training data y_true.

MLU-Explain

mlu-explain.github.io › logistic-regression

Logistic Regression

A visual, interactive explanation of logistic regression for machine learning.

Medium

medium.com › analytics-vidhya › understanding-the-loss-function-of-logistic-regression-ac1eec2838ce

Understanding the log loss function | by Susmith Reddy | Analytics Vidhya | Medium

February 2, 2024 - The loss function used by the linear regression algorithm is Mean Squared Error. ... What MSE does is, it adds up the square of the distance between the actual and the predicted output value for every input sample (and divide it with no. of ...

GeeksforGeeks

geeksforgeeks.org › machine learning › ml-cost-function-in-logistic-regression

Cost function in Logistic Regression in Machine Learning - GeeksforGeeks

44:28

Here in this code demonstrates how Logistic Regression computes predicted probabilities using the sigmoid function and evaluates model performance using the log loss (binary cross-entropy) cost function.

Published January 19, 2026

Find elsewhere

Google Bing Mojeek

Tomasbeuzen

tomasbeuzen.com › deep-learning-with-pytorch › chapters › appendixB_logistic-loss.html

Appendix B: Logistic Loss — Deep Learning with PyTorch

\[f(w)=-\frac{1}{n}\sum_{i=1}^ny_i\log\left(\frac{1}{1 + \exp(-w^Tx_i)}\right) + (1 - y_i)\log\left(1 - \frac{1}{1 + \exp(-w^Tx_i)}\right)\] This function is called the “log loss” or “binary cross entropy”

Dasha

dasha.ai › blog › log-loss-function

Dasha | Log Loss Function Explained by Experts

February 15, 2021 - sigmoid from a linear combination) is adjusted to this error function by the SGD method. ... As we can see, adjusting the weights is exactly the same as when adjusting linear regression! In fact, this indicates the relationship of different regressions: linear and logistic, or rather, the relationship of distributions: normal and Bernoulli. In many books, another expression goes by the name of log loss function (that is, precisely "logistic loss"), which we can get by substituting the expression for the sigmoid in it and redesignating: we assume that the class labels are now -1 and +1, then we get the following:

arXiv

arxiv.org › abs › 1805.03804

[1805.03804] On the Universality of the Logistic Loss Function

May 10, 2018 - Abstract:A loss function measures the discrepancy between the true values (observations) and their estimated fits, for a given instance of data. A loss function is said to be proper (unbiased, Fisher consistent) if the fits are defined over ...

Stack Exchange

stats.stackexchange.com › questions › 250937 › which-loss-function-is-correct-for-logistic-regression

Which loss function is correct for logistic regression? - Cross Validated

Edit:

At the moment I am re-reading this answer and am confused about how I got $-y_i\beta^Tx_i+ln(1+e^{\beta^Tx_i})$ to be equal to $-y_i\beta^Tx_i+ln(1+e^{y_i\beta^Tx_i})$. Perhaps there's a typo in the original question.

Edit 2:

In the case that there wasn't a typo in the original question, @ManelMorales appears to be correct to draw attention to the fact that, when $y \in \{-1,1\}$, the probability mass function can be written as $P(Y_i=y_i) = f(y_i\beta^Tx_i)$, due to the property that $f(-z) = 1 - f(z)$. I am re-writing it differently here, because he introduces a new equivocation on the notation $z_i$. The rest follows by taking the negative log-likelihood for each $y$ coding. See his answer below for more details.

2 of 6

OP mistakenly believes the relationship between these two functions is due to the number of samples (i.e. single vs all). However, the actual difference is simply how we select our training labels.

In the case of binary classification we may assign the labels $y=\pm1$ or $y=0,1$.

As it has already been stated, the logistic function $\sigma(z)$ is a good choice since it has the form of a probability, i.e. $\sigma(-z)=1-\sigma(z)$ and $\sigma(z)\in (0,1)$ as $z\rightarrow \pm \infty$. If we pick the labels $y=0,1$ we may assign

\begin{equation} \begin{aligned} \mathbb{P}(y=1|z) & =\sigma(z)=\frac{1}{1+e^{-z}}\\ \mathbb{P}(y=0|z) & =1-\sigma(z)=\frac{1}{1+e^{z}}\\ \end{aligned} \end{equation}

which can be written more compactly as $\mathbb{P}(y|z) =\sigma(z)^y(1-\sigma(z))^{1-y}$.

It is easier to maximize the log-likelihood. Maximizing the log-likelihood is the same as minimizing the negative log-likelihood. For $m$ samples $\{x_i,y_i\}$, after taking the natural logarithm and some simplification, we will find out:

\begin{equation} \begin{aligned} l(z)=-\log\big(\prod_i^m\mathbb{P}(y_i|z_i)\big)=-\sum_i^m\log\big(\mathbb{P}(y_i|z_i)\big)=\sum_i^m-y_iz_i+\log(1+e^{z_i}) \end{aligned} \end{equation}

Full derivation and additional information can be found on this jupyter notebook. On the other hand, we may have instead used the labels $y=\pm 1$. It is pretty obvious then that we can assign

\begin{equation} \mathbb{P}(y|z)=\sigma(yz). \end{equation}

It is also obvious that $\mathbb{P}(y=0|z)=\mathbb{P}(y=-1|z)=\sigma(-z)$. Following the same steps as before we minimize in this case the loss function

\begin{equation} \begin{aligned} L(z)=-\log\big(\prod_j^m\mathbb{P}(y_j|z_j)\big)=-\sum_j^m\log\big(\mathbb{P}(y_j|z_j)\big)=\sum_j^m\log(1+e^{-y_jz_j}) \end{aligned} \end{equation}

Where the last step follows after we take the reciprocal which is induced by the negative sign. While we should not equate these two forms, given that in each form $y$ takes different values, nevertheless these two are equivalent.

While there may be fundamental reasons as to why we have two different forms (see Why there are two different logistic loss formulation / notations?), one reason to choose the former is for practical considerations. In the former we can use the property $\partial \sigma(z) / \partial z=\sigma(z)(1-\sigma(z))$ to trivially calculate $\nabla l(z)$ and $\nabla^2l(z)$, both of which are needed for convergence analysis (i.e. to determine the convexity of the loss function by calculating the Hessian).

Medium

koshurai.medium.com › understanding-log-loss-a-comprehensive-guide-with-code-examples-c79cf5411426

Understanding Log Loss: A Comprehensive Guide with Code Examples | by KoshurAI | Medium

January 9, 2024 - Log Loss is a logarithmic transformation of the likelihood function, primarily used to evaluate the performance of probabilistic classifiers.

Towards Data Science

towardsdatascience.com › home › latest › loss functions and their use in neural networks

Loss Function (Part II): Logistic Regression | by Shuyu Luo

January 21, 2025 - Regression Loss Functions – used in regression neural networks; given an input value, the model predicts a corresponding output value (rather than pre-selected labels); Ex. Mean Squared Error, Mean Absolute Error · Classification Loss Functions – used in classification neural networks; given an input, the neural network produces a vector of probabilities of the input belonging to various pre-set categories – can then select the category with the highest probability of belonging; Ex.

Analytics Vidhya

analyticsvidhya.com › home › log loss vs. mean squared error: choosing the right metric for classification

Log Loss vs. Mean Squared Error: Choosing the Right Metric for Classification

May 1, 2025 - The cost function used in Logistic Regression is Log Loss.

University of Oxford

robots.ox.ac.uk › ~az › lectures › ml › 2011 › lect4.pdf pdf

Logistic Regression

which deﬁnes the loss function. Logistic Regression Learning · Learning is formulated as the optimization problem · min · w∈Rd · N · X · i · log · ³ · 1 + e−yif(xi)´ · + λ||w||2 · loss function · regularization · • For correctly classiﬁed points −yif(xi) is negative, and ·

Naukri

naukri.com › code360 › library › loss-function-for-logistic-regression

Loss Function for Logistic Regression - Naukri Code 360

Almost there... just a few more seconds

Medium

medium.com › @sonal.mishra1297 › logistic-regression-explained-mathematically-from-loss-function-to-gradients-and-training-6d39a696c948

Logistic Regression Explained Mathematically — From Loss Function to Gradients and Training | by Sonal Mishra | Medium

September 27, 2025 - The loss function isn’t guessed — it falls out naturally from the Bernoulli distribution and the principle of maximum likelihood. The gradients aren’t messy — they collapse beautifully into prediction − truth.

Baeldung

baeldung.com › home › artificial intelligence › machine learning › differences between hinge loss and logistic loss

Differences Between Hinge Loss and Logistic Loss | Baeldung on Computer Science

February 28, 2025 - Consequently, this makes the function ... loss (also known as the cross-entropy loss and log loss) is another loss function that we can use for classification....

reddit.com › r/learnmachinelearning › why do we use log-loss in logistic regression instead of just taking the absolute difference between expected probability and actual value for each instance?

r/learnmachinelearning on Reddit: Why do we use log-loss in logistic regression instead of just taking the absolute difference between expected probability and actual value for each instance?

April 26, 2023 -

If your model predicted a probability of 0.6 for a positive instance, it seems like it would make sense for the loss to be |(1 - 0.6)| = 0.4. And if it predicted a probability of 0.2 for a negative instance, it seems like it would make sense for it to be |(0 - 0.2)| = 0.2.

Instead, we ignore differences and simply take the (negative) log of the probability (or one minus probability) to derive a measure of loss. I understand that this makes "confidently wrong" predictions VERY penalized -- because if you predict a probability of 0.0 for a positive instance then your loss ranges towards infinity (that is, -log(0.0)), but using the method I described above, we similarly penalize more those predictions that are further away from actual -- it's just that the maximum penalization would be one, rather than infinity (e.g., if you predicted a 0.0 for a positive instance, the loss would be |(1 - 0.0)| = 1).

Does this make sense? It's more theoretical but it'd be great to know the underpinnings of this function.

Thanks!