Actually there are some great reasons which have nothing to do with whether this is easy to calculate. The first form is called least squares, and in a probabilistic setting there are several good theoretical justifications to use it. For example, if you assume you are performing this regression on variables with normally distributed error (which is a reasonable assumption in many cases), then the least squares form is the maximum likelihood estimator. There are several other important properties.
You can read some more here.
Answer from Bitwise on Stack ExchangeActually there are some great reasons which have nothing to do with whether this is easy to calculate. The first form is called least squares, and in a probabilistic setting there are several good theoretical justifications to use it. For example, if you assume you are performing this regression on variables with normally distributed error (which is a reasonable assumption in many cases), then the least squares form is the maximum likelihood estimator. There are several other important properties.
You can read some more here.
If is linear with respect to the parameters, the derivatives of the sum of squares leads to simple, explicit and direct solutions (immediate if you use matrix calculations).
This is not the case for the second objective function in your post. The problem becomes nonlinear with respect to the parameters and it is much more difficult to solve. But, it is doable (I would generate the starting guesses from the first objective function.
For illustration purposes, I generated a table for
(
), (
) and changed the values of
using a random relative error between
and
%. The values used were
,
and
.
Using the first objective function, the solution is immediate and leads to ,
,
.
Starting with these values as initial guesses for the second objective function (which, again, makes the problem nonlinear), it took to the solver iterations to get
,
,
. And all these painful iterations reduced the objective function from
down to
!
There are many other possible objective functions used in regression but the traditional sum of squared errors is the only one which leads to explicit solutions.
Added later
A very small problem that you could (should, if I may) exercise by hand : consider four data points ,
,
,
and your model is simply
and your search for the best value of
which minimizes either
or
Plot the values of
and
as a function of
for
. For
, you will have a nice parabola (the minimum of which is easy to find) but for
the plot shows a series of segments which then lead to discontinuous derivatives at thei intersections; this makes the problem much more difficult to solve.
You can try considering , then using the property that
for all x, you can obtain the desired inequality. Also your conclusion should be
. Notice that the inequality is not strict. (For example, if
then certainly
is false.) Another way is to use the fact that
for all numbers a (doing this for both
and
, and then manipulating the inequalities, we can achieve what you want), however, this second route assumes that you at least are a little bit familiar with the rules of inequalities, particularly the rules regarding inequalities with absolute values.
As an example of how we would apply the squaring technique, we can do the following:
Now since
is always true we can say that
Now we want to try to "force" an inequality. That is, we will replace
with
. If
and
are both greater that
then nothing would change, however, if they are not both greater than
we would get the following:
notice that we have broken the chain of equal signs and forced an inequality. (There are still some more steps to do for you. Hint: What is
?)
First method:
Since we have
which is equality if and only if :
if and only if
and
ar both negatives or both positives. This means taht if
and
hav'nt the same sign we have:
But
gives also
Second method:
We have
for all real number
we have
, and equality holds if and only if
is positive. Since
or
is'nt positive one of the numbers
or
is
so
I guess I've seen this the most in statistics. E.g. standard deviation, or least squares regression. Why not calculate standard deviation by simply taking the absolute value of the (xi-xbar)s?
One reason is that squares are easier to deal with than absolute values. The derivative [or integral] of x2 is easy, while |x| has a step function in it [ew].
Also, for standard deviations, I believe, using |x| leads to [more?] bias. Someone correct me here.
I do not like most of the answers given in this thread so far because, at least to me, their arguments are very shallow. I do not think that the fundamental reason why the standard deviation is more important than the mean absolute deviation is just because it is "easier to compute" or "smoother".
Here is my point of view on things : If you think of "sample space" as being Rn, where n is the number of data points that you have and your points are n-tuples (x1,...,xn) where x_i is the ith data point that you have, then the mean and standard deviation have very geometric interpretations.
To be able to visualize correctly, let's do this for n=2. We have two data points, x1 and x2. Their average, (x1+x2)/2 corresponds to projecting the point (x1,x2) orthogonally onto the line x1=x2, that is, it is the closest sample to have all its data points equal. The standard deviation in this context will be the euclidean distance between (x1,x2) and the point corresponding to the average. The mean absolute deviation instead will be the sum of the sides of the triangle in the picture, or the "taxicab" distance.
Here is a picture that I hope helps illustrate the argument : http://i.imgur.com/oiWbk.png
I am learning data science through ISLR(page 62). Why do we do RSS = (e1)2+(e2)2+(e3)2.... Rather than (|e1|+| e2 |+ | e3 |) as it will be right distance ? Will squaring not skew the results?
I've done several statistics courses in university and I'm starting to wonder why or what "sum of squares" is used for in tests like linear regression. What is particular useful about the "sum of squares"?
I understand the idea is the minimize the sum of the squares of the errors compared to the y = mx + b regression, but why the squares? Why not minimize then sum of the absolute value of the errors? Or the fourth powers of the errors?
If the goal of the standard deviation is to summarise the spread of a symmetrical data set (i.e. in general how far each datum is from the mean), then we need a good method of defining how to measure that spread.
The benefits of squaring include:
- Squaring always gives a non-negative value, so the sum will always be zero or higher.
- Squaring emphasizes larger differences, a feature that turns out to be both good and bad (think of the effect outliers have).
Squaring however does have a problem as a measure of spread and that is that the units are all squared, whereas we might prefer the spread to be in the same units as the original data (think of squared pounds, squared dollars, or squared apples). Hence the square root allows us to return to the original units.
I suppose you could say that absolute difference assigns equal weight to the spread of data whereas squaring emphasises the extremes. Technically though, as others have pointed out, squaring makes the algebra much easier to work with and offers properties that the absolute method does not (for example, the variance is equal to the expected value of the square of the distribution minus the square of the mean of the distribution)
It is important to note however that there's no reason you couldn't take the absolute difference if that is your preference on how you wish to view 'spread' (sort of how some people see 5% as some magical threshold for $p$-values, when in fact it is situation dependent). Indeed, there are in fact several competing methods for measuring spread.
My view is to use the squared values because I like to think of how it relates to the Pythagorean Theorem of Statistics: $c = \sqrt{a^2 + b^2}$ …this also helps me remember that when working with independent random variables, variances add, standard deviations don't. But that's just my personal subjective preference which I mostly only use as a memory aid, feel free to ignore this paragraph.
An interesting analysis can be read here:
- Revisiting a 90-year-old debate: the advantages of the mean deviation - Stephen Gorard (Department of Educational Studies, University of York); Paper presented at the British Educational Research Association Annual Conference, University of Manchester, 16-18 September 2004
The squared difference has nicer mathematical properties; it's continuously differentiable (nice when you want to minimize it), it's a sufficient statistic for the Gaussian distribution, and it's (a version of) the L2 norm which comes in handy for proving convergence and so on.
The mean absolute deviation (the absolute value notation you suggest) is also used as a measure of dispersion, but it's not as "well-behaved" as the squared error.
$\require{cancel}$ $\displaystyle$
You cannot conclude it for $n \geq 2$, as simply the power of left side is $n$ and the right side is $2$, thus:
$O(n) \cancel{\leq} O(2)$ for $n \geq 2$
In a simple example, if $x_1=x_2=...=x_n=x$, then $x^n \cancel{\leq} nx$.
What you can say instead of that, is the general inequity of this:
$\prod_{i=1}^n x_i \leq \sum_{i=1}^n x_i^n$
In your case (i.e. when $n=2$) it will become the specific inequity you mentioned first:
$\prod_{i=1}^2 x_i \leq \sum_{i=1}^2 x_i^2 \Longleftrightarrow xy \leq x^2+y^2$
Note that
$$3\times 4\times 5= 60$$ $$ 3^2+4^2+5^2 =50$$ Thus $$x_1 x_2 ... x_n \leq x_1^2 + x_2^2 + ... + x_n^2$$ is not true for all real numbers.