Videos
The relationship is as follows: $l(\beta) = \sum_i L(z_i)$.
Define a logistic function as $f(z) = \frac{e^{z}}{1 + e^{z}} = \frac{1}{1+e^{-z}}$. They possess the property that $f(-z) = 1-f(z)$. Or in other words:
$$ \frac{1}{1+e^{z}} = \frac{e^{-z}}{1+e^{-z}}. $$
If you take the reciprocal of both sides, then take the log you get:
$$ \ln(1+e^{z}) = \ln(1+e^{-z}) + z. $$
Subtract $z$ from both sides and you should see this:
$$ -y_i\beta^Tx_i+ln(1+e^{y_i\beta^Tx_i}) = L(z_i). $$
Edit:
At the moment I am re-reading this answer and am confused about how I got $-y_i\beta^Tx_i+ln(1+e^{\beta^Tx_i})$ to be equal to $-y_i\beta^Tx_i+ln(1+e^{y_i\beta^Tx_i})$. Perhaps there's a typo in the original question.
Edit 2:
In the case that there wasn't a typo in the original question, @ManelMorales appears to be correct to draw attention to the fact that, when $y \in \{-1,1\}$, the probability mass function can be written as $P(Y_i=y_i) = f(y_i\beta^Tx_i)$, due to the property that $f(-z) = 1 - f(z)$. I am re-writing it differently here, because he introduces a new equivocation on the notation $z_i$. The rest follows by taking the negative log-likelihood for each $y$ coding. See his answer below for more details.
OP mistakenly believes the relationship between these two functions is due to the number of samples (i.e. single vs all). However, the actual difference is simply how we select our training labels.
In the case of binary classification we may assign the labels $y=\pm1$ or $y=0,1$.
As it has already been stated, the logistic function $\sigma(z)$ is a good choice since it has the form of a probability, i.e. $\sigma(-z)=1-\sigma(z)$ and $\sigma(z)\in (0,1)$ as $z\rightarrow \pm \infty$. If we pick the labels $y=0,1$ we may assign
\begin{equation} \begin{aligned} \mathbb{P}(y=1|z) & =\sigma(z)=\frac{1}{1+e^{-z}}\\ \mathbb{P}(y=0|z) & =1-\sigma(z)=\frac{1}{1+e^{z}}\\ \end{aligned} \end{equation}
which can be written more compactly as $\mathbb{P}(y|z) =\sigma(z)^y(1-\sigma(z))^{1-y}$.
It is easier to maximize the log-likelihood. Maximizing the log-likelihood is the same as minimizing the negative log-likelihood. For $m$ samples $\{x_i,y_i\}$, after taking the natural logarithm and some simplification, we will find out:
\begin{equation} \begin{aligned} l(z)=-\log\big(\prod_i^m\mathbb{P}(y_i|z_i)\big)=-\sum_i^m\log\big(\mathbb{P}(y_i|z_i)\big)=\sum_i^m-y_iz_i+\log(1+e^{z_i}) \end{aligned} \end{equation}
Full derivation and additional information can be found on this jupyter notebook. On the other hand, we may have instead used the labels $y=\pm 1$. It is pretty obvious then that we can assign
\begin{equation} \mathbb{P}(y|z)=\sigma(yz). \end{equation}
It is also obvious that $\mathbb{P}(y=0|z)=\mathbb{P}(y=-1|z)=\sigma(-z)$. Following the same steps as before we minimize in this case the loss function
\begin{equation} \begin{aligned} L(z)=-\log\big(\prod_j^m\mathbb{P}(y_j|z_j)\big)=-\sum_j^m\log\big(\mathbb{P}(y_j|z_j)\big)=\sum_j^m\log(1+e^{-y_jz_j}) \end{aligned} \end{equation}
Where the last step follows after we take the reciprocal which is induced by the negative sign. While we should not equate these two forms, given that in each form $y$ takes different values, nevertheless these two are equivalent.
While there may be fundamental reasons as to why we have two different forms (see Why there are two different logistic loss formulation / notations?), one reason to choose the former is for practical considerations. In the former we can use the property $\partial \sigma(z) / \partial z=\sigma(z)(1-\sigma(z))$ to trivially calculate $\nabla l(z)$ and $\nabla^2l(z)$, both of which are needed for convergence analysis (i.e. to determine the convexity of the loss function by calculating the Hessian).
If your model predicted a probability of 0.6 for a positive instance, it seems like it would make sense for the loss to be |(1 - 0.6)| = 0.4. And if it predicted a probability of 0.2 for a negative instance, it seems like it would make sense for it to be |(0 - 0.2)| = 0.2.
Instead, we ignore differences and simply take the (negative) log of the probability (or one minus probability) to derive a measure of loss. I understand that this makes "confidently wrong" predictions VERY penalized -- because if you predict a probability of 0.0 for a positive instance then your loss ranges towards infinity (that is, -log(0.0)), but using the method I described above, we similarly penalize more those predictions that are further away from actual -- it's just that the maximum penalization would be one, rather than infinity (e.g., if you predicted a 0.0 for a positive instance, the loss would be |(1 - 0.0)| = 1).
Does this make sense? It's more theoretical but it'd be great to know the underpinnings of this function.
Thanks!