Actually there are some great reasons which have nothing to do with whether this is easy to calculate. The first form is called least squares, and in a probabilistic setting there are several good theoretical justifications to use it. For example, if you assume you are performing this regression on variables with normally distributed error (which is a reasonable assumption in many cases), then the least squares form is the maximum likelihood estimator. There are several other important properties.
You can read some more here.
Answer from Bitwise on Stack ExchangeActually there are some great reasons which have nothing to do with whether this is easy to calculate. The first form is called least squares, and in a probabilistic setting there are several good theoretical justifications to use it. For example, if you assume you are performing this regression on variables with normally distributed error (which is a reasonable assumption in many cases), then the least squares form is the maximum likelihood estimator. There are several other important properties.
You can read some more here.
If is linear with respect to the parameters, the derivatives of the sum of squares leads to simple, explicit and direct solutions (immediate if you use matrix calculations).
This is not the case for the second objective function in your post. The problem becomes nonlinear with respect to the parameters and it is much more difficult to solve. But, it is doable (I would generate the starting guesses from the first objective function.
For illustration purposes, I generated a table for
(
), (
) and changed the values of
using a random relative error between
and
%. The values used were
,
and
.
Using the first objective function, the solution is immediate and leads to ,
,
.
Starting with these values as initial guesses for the second objective function (which, again, makes the problem nonlinear), it took to the solver iterations to get
,
,
. And all these painful iterations reduced the objective function from
down to
!
There are many other possible objective functions used in regression but the traditional sum of squared errors is the only one which leads to explicit solutions.
Added later
A very small problem that you could (should, if I may) exercise by hand : consider four data points ,
,
,
and your model is simply
and your search for the best value of
which minimizes either
or
Plot the values of
and
as a function of
for
. For
, you will have a nice parabola (the minimum of which is easy to find) but for
the plot shows a series of segments which then lead to discontinuous derivatives at thei intersections; this makes the problem much more difficult to solve.
algebra precalculus - Sum of absolute values and the absolute value of the sum of these values? - Mathematics Stack Exchange
Why does taking the sum and difference between two numbers and dividing by 2 find the minimum of the two numbers?
Is the absolute value of the sum of two numbers always equal to the sum of their absolute values? Explain.
summation - Sum of all absolute values of difference of the numbers 1,2,.....n taken two at a time - Mathematics Stack Exchange
Actually there are some great reasons which have nothing to do with whether this is easy to calculate. The first form is called least squares, and in a probabilistic setting there are several good theoretical justifications to use it. For example, if you assume you are performing this regression on variables with normally distributed error (which is a reasonable assumption in many cases), then the least squares form is the maximum likelihood estimator. There are several other important properties.
You can read some more here.
Answer from Bitwise on Stack ExchangeVideos
You can try considering , then using the property that
for all x, you can obtain the desired inequality. Also your conclusion should be
. Notice that the inequality is not strict. (For example, if
then certainly
is false.) Another way is to use the fact that
for all numbers a (doing this for both
and
, and then manipulating the inequalities, we can achieve what you want), however, this second route assumes that you at least are a little bit familiar with the rules of inequalities, particularly the rules regarding inequalities with absolute values.
As an example of how we would apply the squaring technique, we can do the following:
Now since
is always true we can say that
Now we want to try to "force" an inequality. That is, we will replace
with
. If
and
are both greater that
then nothing would change, however, if they are not both greater than
we would get the following:
notice that we have broken the chain of equal signs and forced an inequality. (There are still some more steps to do for you. Hint: What is
?)
First method:
Since we have
which is equality if and only if :
if and only if
and
ar both negatives or both positives. This means taht if
and
hav'nt the same sign we have:
But
gives also
Second method:
We have
for all real number
we have
, and equality holds if and only if
is positive. Since
or
is'nt positive one of the numbers
or
is
so
Kinda stumbled on this and seems ridiculously simple. I know it works but I can't really "understand" it.
Edit: thank you everyone. I've learned a lot. The links to other branches from quadratics to computing to Mohr's circle is mind boggling!
The sum of all absolute values of the differences of the numbers 1,2,3,…,n, taken two at a time is
$\begin{align} S(n) &=\sum_{i=1}^n\sum_{j=1}^{n}|i-j|\\ &=\sum_{i=1}^n\sum_{j=1}^{i-1}|i-j| +\sum_{i=1}^n\sum_{j=i+1}^{n}|i-j|\\ &=\sum_{i=1}^n\sum_{j=1}^{i-1}(i-j) +\sum_{i=1}^n\sum_{j=i+1}^{n}(j-i)\\ &=\sum_{i=1}^n\sum_{j=1}^{i-1}j +\sum_{i=1}^n\sum_{j=1}^{n-i}j\\ &=\sum_{i=1}^n\frac{i(i-1)}{2} +\sum_{i=1}^n\frac{(n-i)(n-i+1)}{2}\\ &=\sum_{i=1}^n\left(\frac{i(i-1)}{2} +\frac{(n-i)(n-i+1)}{2}\right)\\ &=\sum_{i=1}^n\left(\frac{i(i-1)+(n-i)(n-i+1)}{2}\right)\\ &=\sum_{i=1}^n\left(\frac{i^2-i+(n-i)^2+(n-i)}{2}\right)\\ &=\frac12\sum_{i=1}^n\left(i^2-i+n^2-2ni+i^2+n-i\right)\\ &=\frac{n(n^2+n)}{2}+\sum_{i=1}^n\left(i^2-i-2ni\right)\\ &=\frac{n(n^2+n)}{2}+\frac{n(n+1)(2n+1)}{6}-(2n+1)\frac{n(n+1)}{2}\\ &=\frac{n(n+1)}{2}\left(n+\frac{(2n+1)}{3}-(2n+1)\right)\\ &=\frac{n(n+1)}{6}\left(3n+(2n+1)-3(2n+1)\right)\\ &=\frac{n(n+1)}{6}\left(3n-(2n+1)\right)\\ &=\frac{n(n+1)}{6}\left(n-1\right)\\ &=\frac{(n-1)n(n+1)}{6}\\ &=\binom{n+1}{3}\\ \end{align} $
(Whew!)
There are probably simpler ways, but this is my way.
If you fix $j=1$, then let $i$ range you get: $1+2+..+(n-1)$.
If you fix $j=2$, then let $i$ range you get: $1+2+..+(n-2)$.
...
If you fix $j=n-1$, then let $i$ range you get: $1$.
If you put them all together you get a total of $(n-1)$ number of $1$'s, $(n-2)$ number of $2$'s,..., $1$ number of $(n-1)$'s. Thus your sum equals: $$\sum_{k=1}^{n-1}k\cdot (n-k)$$Which equals: $n\sum_{k=1}^{n-1} k-\sum_{k=1}^{n-1}k^2$, and using the well know formulas (and some algebra) you get $n^3/6-n/6$, which is equal to $\binom{n+1}{3}$
Introduction: The solution below is essentially the same as the solution given by Brian M. Scott, but it will take a lot longer. You are expected to assume that $S$ is a finite set. with say $k$ elements. Line them up in order, as $s_1<s_2<\cdots <s_k$.
The situation is a little different when $k$ is odd than when $k$ is even. In particular, if $k$ is even there are (depending on the exact definition of median) many medians. We tell the story first for $k$ odd.
Recall that $|x-s_i|$ is the distance between $x$ and $s_i$, so we are trying to minimize the sum of the distances. For example, we have $k$ people who live at various points on the $x$-axis. We want to find the point(s) $x$ such that the sum of the travel distances of the $k$ people to $x$ is a minimum.
The story: Imagine that the $s_i$ are points on the $x$-axis. For clarity, take $k=7$. Start from well to the left of all the $s_i$, and take a tiny step, say of length $\epsilon$, to the right. Then you have gotten $\epsilon$ closer to every one of the $s_i$, so the sum of the distances has decreased by $7\epsilon$.
Keep taking tiny steps to the right, each time getting a decrease of $7\epsilon$. This continues until you hit $s_1$. If you now take a tiny step to the right, then your distance from $s_1$ increases by $\epsilon$, and your distance from each of the remaining $s_i$ decreases by $\epsilon$. What has happened to the sum of the distances? There is a decrease of $6\epsilon$, and an increase of $\epsilon$, for a net decrease of $5\epsilon$ in the sum.
This continues until you hit $s_2$. Now, when you take a tiny step to the right, your distance from each of $s_1$ and $s_2$ increases by $\epsilon$, and your distance from each of the five others decreases by $\epsilon$, for a
net decrease of $3\epsilon$.
This continues until you hit $s_3$. The next tiny step gives an increase of $3\epsilon$, and a decrease of $4\epsilon$, for a net decrease of $\epsilon$.
This continues until you hit $s_4$. The next little step brings a total increase of $4\epsilon$, and a total decrease of $3\epsilon$, for an increase of $\epsilon$. Things get even worse when you travel further to the right. So the minimum sum of distances is reached at $s_4$, the median.
The situation is quite similar if $k$ is even, say $k=6$. As you travel to the right, there is a net decrease at every step, until you hit $s_3$. When you are between $s_3$ and $s_4$, a tiny step of $\epsilon$ increases your distance from each of $s_1$, $s_2$, and $s_3$ by $\epsilon$. But it decreases your distance from each of the three others, for no net gain. Thus any $x$ in the interval from $s_3$ to $s_4$, including the endpoints, minimizes the sum of the distances. In the even case, I prefer to say that any point between the two "middle" points is a median. So the conclusion is that the points that minimize the sum are the medians. But some people prefer to define the median in the even case to be the average of the two "middle" points. Then the median does minimize the sum of the distances, but some other points also do.
We're basically after: $$ \arg \min_{x} \sum_{i = 1}^{N} \left| {s}_{i} - x \right| $$
One should notice that $ \frac{\mathrm{d} \left | x \right | }{\mathrm{d} x} = \operatorname{sign} \left( x \right) $ (Being more rigorous would say it is a Sub Gradient of the non smooth $ {L}_{1} $ Norm function).
Hence, deriving the sum above yields $ \sum_{i = 1}^{N} \operatorname{sign} \left( {s}_{i} - x \right) $.
This equals to zero only when the number of positive items equals the number of negative which happens when $ x = \operatorname{median} \left\{ {s}_{1}, {s}_{2}, \cdots, {s}_{N} \right\} $.
Remarks
- One should notice that the
medianof a discrete group is not uniquely defined. - The median is not necessarily an item within the group.
- Not every set can bring the Sub Gradient to vanish. Yet employing the Sub Gradient Method is guaranteed to converge to a median.
- It is not the optimal way to calculate the Median. It is given to give intuition about what's the median.