multinomial logistic regression loss function

regression for more than two discrete outcomes

$X-Y\sim \operatorname {Logistic} (0,b).$

In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is … Wikipedia

Wikipedia

en.wikipedia.org › wiki › Multinomial_logistic_regression

Multinomial logistic regression - Wikipedia

March 3, 2025 - The article on logistic regression ... of simple logistic regression, and many of these have analogues in the multinomial logit model. The idea behind all of them, as in many other statistical classification techniques, is to construct a linear predictor function that constructs ...

Background Assumptions Model Estimation of intercept Likelihood function Application in natural language processing

Stack Exchange

stats.stackexchange.com › questions › 166958 › multinomial-logistic-loss-vs-cross-entropy-vs-square-error

Multinomial Logistic Loss vs (Cross Entropy vs Square Error) - Cross Validated

Top answer

1 of 3

In my opinion, loss function is the objective function that we want our neural networks to optimize its weights according to it. Therefore, it is task-specific and also somehow empirical. Just to be clear, Multinomial Logistic Loss and Cross Entropy Loss are the same (please look at http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/). The cost function of Multinomial Logistic Loss is like this $\text{[math]}$

It is usually used for classification problem. The Square Error has equation like $\frac 1 {2N} \sum_{i=1}^N \| x^1_i - x^2_i \|_2^2.$

Therefore, it is usually used for minimize using some construction errors.

EDIT: @MartinThoma The above formula of multinomial logistics loss is just for binary case, for general case, it should be $\text{[math]}$ , where K is number of categories.

2 of 3

SHORT ANSWER According to other answers Multinomial Logistic Loss and Cross Entropy Loss are the same.

Cross Entropy Loss is an alternative cost function for NN with sigmoids activation function introduced artificially to eliminate the dependency on $\text{[math]}$ on the update equations. Some times this term slows down the learning process. Alternative methods are regularised cost function.

In these type of networks one might want to have probabilities as output but this does not happen with the sigmoids in a multinomial network. The softmax function normalizes the outputs and force them in the range $\text{[math]}$ . This can be useful for example in MNIST classification.

LONG ANSWER WITH SOME INSIGHTS

The answer is quite long but I'll try to summarise.

The first modern artificial neurons that have been used are the sigmoids whose function is:

$\text{[math]}$ which has the following shape:

The curve is nice because it guarantees the output is in the range $\text{[math]}$ .

Regarding the choice of a cost function, a natural choice is the quadratic cost function, whose derivative is guaranteed to exist and we know it has a minimum.

Now consider a NN with sigmoids trained with the quadratic cost function, with $\text{[math]}$ layers.

We define the cost function as the sum of the squared errors in the output layer for a set of inputs $\text{[math]}$ :

$\text{[math]}$

where $\text{[math]}$ is the j-th neuron in the output layer $\text{[math]}$ , $\text{[math]}$ the desired output and $\text{[math]}$ is the number of training examples.

For simplicity let's consider the error for a single input: $\text{[math]}$

Now an activation output of for the $\text{[math]}$ neuron in the $\text{[math]}$ layer, $\text{[math]}$ is:

$\text{[math]}$

Most of the times (if not always) NN are trained with one of the gradient descent techniques, which basically consists updating the weights $\text{[math]}$ and biases $\text{[math]}$ by small steps towards the direction of minimization. The goal is to a apply a small change in the weights and biases towards the direction that minimizes the cost function.

For small steps the following holds:

$\text{[math]}$

Our $\text{[math]}$ are the weights and biases. Being it a cost function we want to minimise, i.e., find the proper value $\text{[math]}$ . Suppose we choose $\text{[math]}$ , then: $\text{[math]}$

which means the change $\text{[math]}$ in the parameter decreased the cost function by $\text{[math]}$ .

Consider the $\text{[math]}$ -th output neuron:

$\text{[math]}$ $\text{[math]}$

Suppose we want to update the weight $\text{[math]}$ which is the weight from the neuron $\text{[math]}$ in the $\text{[math]}$ layer to the $\text{[math]}$ -th neuron in the \ell layer. Then we have:

$\text{[math]}$ $\text{[math]}$

Taking the derivatives using the chain rule: $\text{[math]}$ $\text{[math]}$

You see the dependency on the derivative of the sigmoid (in the first is w.r.t. $\text{[math]}$ in the second w.r.t $\text{[math]}$ actually, but it does not change a lot since both are exponents).

Now the derivative for a generic single variable sigmoid $\text{[math]}$ is: $\text{[math]}$

Now consider a single output neuron and suppose you that neuron should output $\text{[math]}$ instead it is outputting a value close to $\text{[math]}$ : you'll see both from the graph that the sigmoid for values close to $\text{[math]}$ is flat, i.e. its derivative is close to $\text{[math]}$ , i.e. updates of the parameter are very slow (since the update equations depend on $\text{[math]}$ .

Motivation of the cross-entropy function

To see how cross-entropy has been originally derived, suppose one has just found out that the term $\text{[math]}$ is slowing down the learning process. We might wonder if it is possible to choose a cost function to make the term $\text{[math]}$ disappear. Basically one might want:

\begin{equation} \begin{aligned} \frac{\partial C}{\partial w} & =x \left( a - y\right)\\ \frac{\partial C}{\partial b} =\left( a - y\right) \end{aligned} \end{equation} From the chain-rule we have: \begin{equation} \frac{\partial C}{\partial b} =\frac{\partial C}{\partial a} \frac{\partial a}{\partial b } =\frac{\partial C}{\partial a}\sigma'(z) = \frac{\partial C}{\partial a} \sigma(1-\sigma) \end{equation} Comparing the desired equation with the one of the chain rule, one gets \begin{equation} \frac{\partial C}{\partial a} = \frac{a-y}{a(1-a)} \end{equation} Using the cover-up method : \begin{equation} \frac{\partial C}{\partial a} = -\left[ y\ln a + (1-y)\ln(1-a)\right]+const \end{equation} To get the full cost function, we must average over all the training samples \begin{equation} \frac{\partial C}{\partial a} = -\frac{1}{n}\sum_x\left[y\ln a + (1-y)\ln(1-a)\right]+const \end{equation} where the constant here is the average of the individual constants for each training example.

There is a standard way of interpreting the cross-entropy that comes from the field of information theory. Roughly speaking, the idea is that the cross-entropy is a measure of surprise. We get low surprise if the output $\text{[math]}$ is what we expect ( $\text{[math]}$ ), and high surprise if the output is unexpected.

Softmax

For a binary classification cross-entropy resembles the definition in information theory and the values can still be interpreted as probabilities.

With multinomial classification this does not hold true anymore: the outputs do note sum up to $\text{[math]}$ .

If you want them to sum up to $\text{[math]}$ you use softmax function, which normalize the outputs so that the sum is $\text{[math]}$ .

Also if the output layer is made up of softmax functions, the slowing down term is not present. If you use log-likelihood cost function with a softmax output layer, the result you will obtain a form of the partial derivatives, and in turn of the update equations, similar to the one found for a cross-entropy function with sigmoid neurons

However

Videos

08:30

YouTube

Logistic Regression for Multi-Class Classification | SoftMax or ...

October 22, 2021

1.87K