regression for more than two discrete outcomes
{\displaystyle X-Y\sim \operatorname {Logistic} (0,b).}
In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is … Wikipedia
🌐
Wikipedia
en.wikipedia.org › wiki › Multinomial_logistic_regression
Multinomial logistic regression - Wikipedia
March 3, 2025 - The article on logistic regression ... of simple logistic regression, and many of these have analogues in the multinomial logit model. The idea behind all of them, as in many other statistical classification techniques, is to construct a linear predictor function that constructs ...
Top answer
1 of 3
14

In my opinion, loss function is the objective function that we want our neural networks to optimize its weights according to it. Therefore, it is task-specific and also somehow empirical. Just to be clear, Multinomial Logistic Loss and Cross Entropy Loss are the same (please look at http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/). The cost function of Multinomial Logistic Loss is like this

It is usually used for classification problem. The Square Error has equation like $\frac 1 {2N} \sum_{i=1}^N \| x^1_i - x^2_i \|_2^2.$

Therefore, it is usually used for minimize using some construction errors.

EDIT: @MartinThoma The above formula of multinomial logistics loss is just for binary case, for general case, it should be , where K is number of categories.

2 of 3
4

SHORT ANSWER According to other answers Multinomial Logistic Loss and Cross Entropy Loss are the same.

Cross Entropy Loss is an alternative cost function for NN with sigmoids activation function introduced artificially to eliminate the dependency on on the update equations. Some times this term slows down the learning process. Alternative methods are regularised cost function.

In these type of networks one might want to have probabilities as output but this does not happen with the sigmoids in a multinomial network. The softmax function normalizes the outputs and force them in the range . This can be useful for example in MNIST classification.

LONG ANSWER WITH SOME INSIGHTS

The answer is quite long but I'll try to summarise.

The first modern artificial neurons that have been used are the sigmoids whose function is:

which has the following shape:

The curve is nice because it guarantees the output is in the range .

Regarding the choice of a cost function, a natural choice is the quadratic cost function, whose derivative is guaranteed to exist and we know it has a minimum.

Now consider a NN with sigmoids trained with the quadratic cost function, with layers.

We define the cost function as the sum of the squared errors in the output layer for a set of inputs :

where is the j-th neuron in the output layer , the desired output and is the number of training examples.

For simplicity let's consider the error for a single input:

Now an activation output of for the neuron in the layer, is:

Most of the times (if not always) NN are trained with one of the gradient descent techniques, which basically consists updating the weights and biases by small steps towards the direction of minimization. The goal is to a apply a small change in the weights and biases towards the direction that minimizes the cost function.

For small steps the following holds:

Our are the weights and biases. Being it a cost function we want to minimise, i.e., find the proper value . Suppose we choose , then:

which means the change in the parameter decreased the cost function by .

Consider the -th output neuron:

Suppose we want to update the weight which is the weight from the neuron in the layer to the -th neuron in the \ell layer. Then we have:

Taking the derivatives using the chain rule:

You see the dependency on the derivative of the sigmoid (in the first is w.r.t. in the second w.r.t actually, but it does not change a lot since both are exponents).

Now the derivative for a generic single variable sigmoid is:

Now consider a single output neuron and suppose you that neuron should output instead it is outputting a value close to : you'll see both from the graph that the sigmoid for values close to is flat, i.e. its derivative is close to , i.e. updates of the parameter are very slow (since the update equations depend on .

Motivation of the cross-entropy function

To see how cross-entropy has been originally derived, suppose one has just found out that the term is slowing down the learning process. We might wonder if it is possible to choose a cost function to make the term disappear. Basically one might want:

\begin{equation} \begin{aligned} \frac{\partial C}{\partial w} & =x \left( a - y\right)\\ \frac{\partial C}{\partial b} =\left( a - y\right) \end{aligned} \end{equation} From the chain-rule we have: \begin{equation} \frac{\partial C}{\partial b} =\frac{\partial C}{\partial a} \frac{\partial a}{\partial b } =\frac{\partial C}{\partial a}\sigma'(z) = \frac{\partial C}{\partial a} \sigma(1-\sigma) \end{equation} Comparing the desired equation with the one of the chain rule, one gets \begin{equation} \frac{\partial C}{\partial a} = \frac{a-y}{a(1-a)} \end{equation} Using the cover-up method : \begin{equation} \frac{\partial C}{\partial a} = -\left[ y\ln a + (1-y)\ln(1-a)\right]+const \end{equation} To get the full cost function, we must average over all the training samples \begin{equation} \frac{\partial C}{\partial a} = -\frac{1}{n}\sum_x\left[y\ln a + (1-y)\ln(1-a)\right]+const \end{equation} where the constant here is the average of the individual constants for each training example.

There is a standard way of interpreting the cross-entropy that comes from the field of information theory. Roughly speaking, the idea is that the cross-entropy is a measure of surprise. We get low surprise if the output is what we expect (), and high surprise if the output is unexpected.

Softmax

For a binary classification cross-entropy resembles the definition in information theory and the values can still be interpreted as probabilities.

With multinomial classification this does not hold true anymore: the outputs do note sum up to .

If you want them to sum up to you use softmax function, which normalize the outputs so that the sum is .

Also if the output layer is made up of softmax functions, the slowing down term is not present. If you use log-likelihood cost function with a softmax output layer, the result you will obtain a form of the partial derivatives, and in turn of the update equations, similar to the one found for a cross-entropy function with sigmoid neurons

However

🌐
Medium
medium.com › ds3ucsd › multinomial-logistic-regression-in-a-nutshell-53c94b30448f
Multinomial Logistic Regression In a Nutshell | by Wilson Xie | Data Science Student Society @ UC San Diego | Medium
December 11, 2020 - Cross-entropy loss function, which maximizes the probability of the scoring vectors to the one-hot encoded Y (response) vectors. Stochastic gradient descent, which is just a gradient descent from a sample features. MLR shares steps with binary logistic regression, and the only difference is the function for each step.
🌐
scikit-learn
scikit-learn.org › stable › modules › generated › sklearn.metrics.log_loss.html
log_loss — scikit-learn 1.8.0 documentation
This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of a logistic model that returns y_pred probabilities for its training data y_true.
🌐
Christopher5106
christopher5106.github.io › deep › learning › 2016 › 09 › 16 › about-loss-functions-multinomial-logistic-logarithm-cross-entropy-square-errors-euclidian-absolute-frobenius-hinge.html
About loss functions, regularization and joint losses : multinomial logistic, cross entropy, square errors, euclidian, hinge, Crammer and Singer, one versus all, squared hinge, absolute value, infogain, L1 / L2 - Frobenius / L2,1 norms, connectionist temporal classification loss
If \(\{ p_i \}\) is the probability of each class, then it is a multinomial distribution and \[\sum_i p_i = 1\] The equivalent to the sigmoid function in multi-dimensional space is the softmax function or logistic function or normalized exponential function to produce such a distribution from any input vector z : \[z \rightarrow \left\{ \frac{\exp z_i }{ \sum_k \exp^{z_k} } \right\}_i\]
🌐
Chris Yeh
chrisyeh96.github.io › 2018 › 06 › 11 › logistic-regression.html
Binary vs. Multi-Class Logistic Regression | Chris Yeh
June 11, 2018 - Even so, the loss function is still convex (though clearly not strictly convex) so gradient descent will still find a global minimum (source). For each example \(x\), we could always choose \(v=−W_C\) and \(d=−b_C\) such that the last class has score 0. In other words, we could actually just have weights and biases for just the first \(C−1\) classes only. To show that multinomial logistic regression is a generalization of binary logistic regression, we will consider the case where there are 2 classes (ie.
🌐
Quark Machine Learning
quarkml.com › home › data science › logistic regression
Multinomial Logistic Regression: Defintion, Math, and Implementation - Quark Machine Learning
February 3, 2023 - To get the optimal weights to reduce the cost, we can find the distance from the predicted values ŷ to the actual values y. This distance is known as the loss function more specifically the cross-entropy loss function of Binary Logistic Regression.
🌐
Postulate
postulate.us › @rania-h › deeplearning › p › 2024-05-01-loss-functions:-multiclass-SVM-and-3urhEELc5yEzzRENJ6aT3F
loss functions: multiclass SVM and multinomial logistic regression | Postulate
May 1, 2024 - Loss functions quantify how good a W value is. It tells us how good our current classifier is and represents quantitatively our unhappiness with predictions on the training set. If it has a low value, that means our classifier is pretty good.
Find elsewhere
🌐
MachineLearningMastery
machinelearningmastery.com › home › blog › multinomial logistic regression with python
Multinomial Logistic Regression With Python - MachineLearningMastery.com
December 31, 2020 - Instead, the multinomial logistic regression algorithm is an extension to the logistic regression model that involves changing the loss function to cross-entropy loss and predict probability distribution to a multinomial probability distribution ...
🌐
ScienceDirect
sciencedirect.com › topics › mathematics › multinomial-logistic-regression
Multinomial Logistic Regression - an overview | ScienceDirect Topics
When this transformation is used, however, the logistic regression and its coefficients take on a somewhat different meaning from those found in regression with a metric dependent variable. The interpretation of the estimated regression coefficients is not as easy as in multiple regression. In multinomial logistic regression, not only is the relationship between x and y nonlinear, but also, if the dependent variable has more than two unique values, there are several regression equations.
🌐
Stanford University
web.stanford.edu › ~jurafsky › slp3 › old_dec20 › 5.pdf pdf
Logistic Regression
In such cases we use multinomial logistic regression, also called softmax re- ... Like the sigmoid, it is an exponential function.
🌐
Sebastian Raschka
sebastianraschka.com › faq › docs › softmax.html
What exactly is the "softmax and the multinomial logistic loss" in the context of machine learning? | Sebastian Raschka, PhD
January 17, 2026 - The softmax function is simply a generalization of the logistic function that allows us to compute meaningful class-probabilities in multi-class settings (multinomial logistic regression).
🌐
Dataaspirant
dataaspirant.com › home › how multinomial logistic regression model works in machine learning
How Multinomial Logistic Regression Model Works In Machine Learning
September 14, 2017 - This Parameters optimization is an iteration process where the calculated weights for each observation used to calculate the cost function which is also known as the Loss function.
🌐
Wikipedia
en.wikipedia.org › wiki › Logistic_regression
Logistic regression - Wikipedia
3 weeks ago - Since the value of the logistic function is always strictly between zero and one, the log loss is always greater than zero and less than infinity. Unlike in a linear regression, where the model can have zero loss at a point by passing through ...
🌐
Bookdown
bookdown.org › chua › ber642_advanced_regression › multinomial-logistic-regression.html
Chapter 11 Multinomial Logistic Regression | Companion to BER 642: Advanced Regression Methods
We chose the multinom function because it does not require the data to be reshaped (as the mlogit package does) and to mirror the example code found in Hilbe’s Logistic Regression Models. First, we need to choose the level of our outcome that we wish to use as our baseline and specify this in the relevel function. Then, we run our model using multinom. The multinom package does not include p-value calculation for the regression coefficients, so we calculate p-values using Wald tests (here z-tests).
🌐
Peterroelants
peterroelants.github.io › posts › cross-entropy-logistic
Logistic classification with cross-entropy (1/2) | Peter’s Notes
June 10, 2015 - Description of the logistic function used to model binary classification problems. Contains derivations of the gradients used for optimizing any parameters with regards to the cross-entropy loss function.
🌐
Blogger
muratbuffalo.blogspot.com › 2017 › 01 › learning-machine-learning-multinomial.html
Learning Machine Learning: Multinomial Logistic Classification
January 28, 2017 - We use a softmax function to turn the scores the model outputs into probabilities. We then use cross entropy function as our loss function compare those probabilities to the one-hot encoded labels.
🌐
OARC Stats
stats.oarc.ucla.edu › r › dae › multinomial-logistic-regression
Multinomial Logistic Regression | R Data Analysis Examples
Collapsing number of categories to two and then doing a logistic regression: This approach suffers from loss of information and changes the original research questions to very different ones. Ordinal logistic regression: If the outcome variable is truly ordered and if it also satisfies the assumption of proportional odds, then switching to ordinal logistic regression will make the model more parsimonious. Alternative-specific multinomial probit regression, which allows different error structures therefore allows to relax the IIA assumption.