They are essentially the same; usually, we use the term log loss for binary classification problems, and the more general cross-entropy (loss) for the general case of multi-class classification, but even this distinction is not consistent, and you'll often find the terms used interchangeably as synonyms.
From the Wikipedia entry for cross-entropy:
The logistic loss is sometimes called cross-entropy loss. It is also known as log loss
From the fast.ai wiki entry on log loss [link is now dead]:
Log loss and cross-entropy are slightly different depending on the context, but in machine learning when calculating error rates between 0 and 1 they resolve to the same thing.
From the ML Cheatsheet:
Answer from desertnaut on Stack OverflowCross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1.
probability distributions - How is logistic loss and cross-entropy related? - Mathematics Stack Exchange
Difference between Cross-Entropy Loss or Log Likelihood Loss?
classification - Why we use log function for cross entropy? - Cross Validated
Cross entropy and log loss question
Videos
They are essentially the same; usually, we use the term log loss for binary classification problems, and the more general cross-entropy (loss) for the general case of multi-class classification, but even this distinction is not consistent, and you'll often find the terms used interchangeably as synonyms.
From the Wikipedia entry for cross-entropy:
The logistic loss is sometimes called cross-entropy loss. It is also known as log loss
From the fast.ai wiki entry on log loss [link is now dead]:
Log loss and cross-entropy are slightly different depending on the context, but in machine learning when calculating error rates between 0 and 1 they resolve to the same thing.
From the ML Cheatsheet:
Answer from desertnaut on Stack OverflowCross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1.
The relationship between Cross-entropy, logistic loss and K-L divergence is quite natural and immersed in the definition itself.
Cross-entropy is defined as:
\begin{equation}
H(p, q) = \operatorname{E}_p[-\log q] = H(p) + D_{\mathrm{KL}}(p \| q)=-\sum_x p(x)\log q(x)
\end{equation}
Where, and
are two distributions and using the definition of K-L divergence.
is the entropy of p.
Now if
and
, we can re-write cross-entropy as:
\begin{equation}
H(p, q) = -\sum_x p_x \log q_x =-y\log \hat{y}-(1-y)\log (1-\hat{y})
\end{equation}
which is nothing but logistic loss.
Further, log loss is also related to logistic loss and cross-entropy as follows:
Expected Log loss is defined as follows: \begin{equation} E[-\log q] \end{equation} Note the above loss function used in logistic regression where q is a sigmoid function. Excess risk for the above loss function is defined as follows: \begin{equation} E[\log p - \log q ]=E[\log\frac{p}{q}]=D_{KL}(p||q) \end{equation} Notice that the K-L divergence is nothing but the excess risk of the log loss and K-L differs from Cross-entropy by a constant factor (see the first definition). One important thing to remember is that we usually minimize the log loss instead of the cross-entropy in logistic regression which is not perfectly OK but it is in practice.
yes they are related.
the cross entropy used in logistic regression is derived from the Maximum Likelihood principle (or equivalently minimise (- log(likelihood))).
see section 28.2.1 Kullback-Liebler divergence:
Suppose ν and µ are the distributions of two probability models, and ν << µ. Then the cross-entropy is the expected negative log-likelihood of the model corresponding to ν, when the actual distribution is µ
For binary classification one way to encode the probability of an output is $p^y(1-p)^{1-y}$, if y is encoded as 0 or 1. This is the likelihood function and it’s meaning is with probability p we output 0 and with probability 1-p if output is 1.
Now you have a sample and you want to find p which best fits your data. One way is to find the maximum likelihood estimator. If your observations are independent your mle is found by maximizing the likelihood over the whole sample. This is the product of individual likelihoods $\pi_{i=1}^n p^{y_i}(1-p)^{y_i-1}$. But this is hard to use. Because of that one transform likelihood with logs. The transformation is monotonous and you get rid of products and obtain sums which are more tractable. Apply logs and get your expression.
Why not use your encoding instead? I think there is no reason why not. The question is which are the properties of your estimator? The first formulation uses likelihood and mle which has some theory behind which includes the fact that your estimator is efficient. The second formulation is not used often, don’t know any example of encoding the probability like that which does not exclude your approach.
I was also looking for an explanation and found one reason I find intuitive here:
It heavily penalizes predications that are confident and wrong.
Check this graph, it shows the range of possible log loss values given a true observation:

The Log Loss increases rapidly as the predicted probability approaches 0(wrong prediction).
I’ve seen many tutorials refer to these as the same thing but others that say that:
Cross Entropy is actual*log(prediction)
While log loss is actual*log(prediction) + (1-actual)*log(1-prediction)
This feels wrong because then cross entropy wouldn’t be able to evaluate loss when the label is zero since the loss would always be zero regardless of the predicted value.
Sorry if the post is ugly it is done on phone.