Here is my try
$$J(x) = -\frac{1}{m}\sum_{i = 1}^{m} b_iln(h_i) + (1 - b_i)ln(1 - h_i)$$
where . Let
. Assuming
work element-wise on vectors,
is element-wise multiplication and
is a vector of
s we have
Now
Videos
In linear regression the Maximize Likelihood Estimation (MLE) solution for estimating $x$ has the following closed form solution (assuming that A is a matrix with full column rank):
$$\hat{x}_\text{lin}=\underset{x}{\text{argmin}} \|Ax-b\|_2^2 = (A^TA)^{-1}A^Tb$$
This is read as "find the $x$ that minimizes the objective function, $\|Ax-b\|_2^2$". The nice thing about representing the linear regression objective function in this way is that we can keep everything in matrix notation and solve for $\hat{x}_\text{lin}$ by hand. As Alex R. mentions, in practice we often don't consider $(A^TA)^{-1}$ directly because it is computationally inefficient and $A$ often does not meet the full rank criteria. Instead, we turn to the Moore-Penrose pseudoinverse. The details of computationally solving for the pseudo-inverse can involve the Cholesky decomposition or the Singular Value Decomposition.
Alternatively, the MLE solution for estimating the coefficients in logistic regression is:
$$\hat{x}_\text{log} = \underset{x}{\text{argmin}} \sum_{i=1}^{N} y^{(i)}\log(1+e^{-x^Ta^{(i)}}) + (1-y^{(i)})\log(1+e^{x^T a^{(i)}})$$
where (assuming each sample of data is stored row-wise):
$x$ is a vector represents regression coefficients
$a^{(i)}$ is a vector represents the $i^{th}$ sample/ row in data matrix $A$
$y^{(i)}$ is a scalar in $\{0, 1\}$, and the $i^{th}$ label corresponding to the $i^{th}$ sample
$N$ is the number of data samples / number of rows in data matrix $A$.
Again, this is read as "find the $x$ that minimizes the objective function".
If you wanted to, you could take it a step further and represent $\hat{x}_\text{log}$ in matrix notation as follows:
$$ \hat{x}_\text{log} = \underset{x}{\text{argmin}} \begin{bmatrix} 1 & (1-y^{(1)}) \\ \vdots & \vdots \\ 1 & (1-y^{(N)})\\\end{bmatrix} \begin{bmatrix} \log(1+e^{-x^Ta^{(1)}}) & ... & \log(1+e^{-x^Ta^{(N)}}) \\\log(1+e^{x^Ta^{(1)}}) & ... & \log(1+e^{x^Ta^{(N)}}) \end{bmatrix} $$
but you don't gain anything from doing this. Logistic regression does not have a closed form solution and does not gain the same benefits as linear regression does by representing it in matrix notation. To solve for $\hat{x}_\text{log}$ estimation techniques such as gradient descent and the Newton-Raphson method are used. Through using some of these techniques (i.e. Newton-Raphson), $\hat{x}_\text{log}$ is approximated and is represented in matrix notation (see link provided by Alex R.).
@joceratops answer focuses on the optimization problem of maximum likelihood for estimation. This is indeed a flexible approach that is amenable to many types of problems. For estimating most models, including linear and logistic regression models, there is another general approach that is based on the method of moments estimation.
The linear regression estimator can also be formulated as the root to the estimating equation:
$$0 = \mathbf{X}^T(Y - \mathbf{X}\beta)$$
In this regard $\beta$ is seen as the value which retrieves an average residual of 0. It needn't rely on any underlying probability model to have this interpretation. It is, however, interesting to go about deriving the score equations for a normal likelihood, you will see indeed that they take exactly the form displayed above. Maximizing the likelihood of regular exponential family for a linear model (e.g. linear or logistic regression) is equivalent to obtaining solutions to their score equations.
$$0 = \sum_{i=1}^n S_i(\alpha, \beta) = \frac{\partial}{\partial \beta} \log \mathcal{L}( \beta, \alpha, X, Y) = \mathbf{X}^T (Y - g(\mathbf{X}\beta))$$
Where $Y_i$ has expected value $g(\mathbf{X}_i \beta)$. In GLM estimation, $g$ is said to be the inverse of a link function. In normal likelihood equations, $g^{-1}$ is the identity function, and in logistic regression $g^{-1}$ is the logit function. A more general approach would be to require $0 = \sum_{i=1}^n Y - g(\mathbf{X}_i\beta)$ which allows for model misspecification.
Additionally, it is interesting to note that for regular exponential families, $\frac{\partial g(\mathbf{X}\beta)}{\partial \beta} = \mathbf{V}(g(\mathbf{X}\beta))$ which is called a mean-variance relationship. Indeed for logistic regression, the mean variance relationship is such that the mean $p = g(\mathbf{X}\beta)$ is related to the variance by $\mbox{var}(Y_i) = p_i(1-p_i)$. This suggests an interpretation of a model misspecified GLM as being one which gives a 0 average Pearson residual. This further suggests a generalization to allow non-proportional functional mean derivatives and mean-variance relationships.
A generalized estimating equation approach would specify linear models in the following way:
$$0 = \frac{\partial g(\mathbf{X}\beta)}{\partial \beta} \mathbf{V}^{-1}\left(Y - g(\mathbf{X}\beta)\right)$$
With $\mathbf{V}$ a matrix of variances based on the fitted value (mean) given by $g(\mathbf{X}\beta)$. This approach to estimation allows one to pick a link function and mean variance relationship as with GLMs.
In logistic regression $g$ would be the inverse logit, and $V_{ii}$ would be given by $g(\mathbf{X}_i \beta)(1-g(\mathbf{X}\beta))$. The solutions to this estimating equation, obtained by Newton-Raphson, will yield the $\beta$ obtained from logistic regression. However a somewhat broader class of models is estimable under a similar framework. For instance, the link function can be taken to be the log of the linear predictor so that the regression coefficients are relative risks and not odds ratios. Which--given the well documented pitfalls of interpreting ORs as RRs--behooves me to ask why anyone fits logistic regression models at all anymore.