As mentioned by others, the least-squares problem is much easier to solve. But there’s another important reason: assuming IID Gaussian noise, the least-squares solution is the Maximum-Likelihood estimate.
Answer from Amit Hochman on Stack ExchangeWhen deriving the coefficients for a linear regression, we tend to obtain the sum of the minimized squared residuals. I am struggling to intuitively understand why. I know that this offsets the negative residuals that would cancel the positive ones. However, why not just do the absolute value? Other answers say it is because of the mathematical convenience and because squaring makes sure that outliers have a more minimal effect on the regression. Are these the true reasons and are they valid? Why would squaring the outliers make it have a more minimal effect? Thanks!
You can use the absolute value. This is called the L1 norm and is used for robust regression.
You can read more about it here: http://www.johnmyleswhite.com/notebook/2013/03/22/using-norms-to-understand-linear-regression/
Practically, the math is easier in ordinary least squares regression: You want to minimize the squared residuals so you can take the derivative, set it equal to 0 and solve. Easier to compute the derivative of a polynomial than absolute value.
We can do absolute values for regression; it's called L1 regression, and people certainly use it. It's certainly not as convenient as ordinary least squares (L2) regression, but that's what computers are for.
What's substantially more convenient is the inference (hypothesis tests and CIs) but again that's not such an issue these days; computers can help deal with that.
However, if the error distribution is close to normal, least squares will be substantially more efficient.
As mentioned by others, the least-squares problem is much easier to solve. But there’s another important reason: assuming IID Gaussian noise, the least-squares solution is the Maximum-Likelihood estimate.
has a simple analytical solution.
is difficult.
One of reasons is that the absolute value is not differentiable.
definition - Why square the difference instead of taking the absolute value in standard deviation? - Cross Validated
machine learning - In OLS, why is squaring preferred over taking absolute while calculating errors in linear regression? - Stack Overflow
Why Least SQUARES Regression instead of Least ABSOLUTE VALUE Regression?
statistics - Why get the sum of squares instead of the sum of absolute values? - Mathematics Stack Exchange
Both are done.
Least squares is easier, and the fact that for independent random variables "variances add" means that it's considerably more convenient; for examples, the ability to partition variances is particularly handy for comparing nested models. It's somewhat more efficient at the normal (least squares is maximum likelihood), which might seem to be a good justification -- however, some robust estimators with high breakdown can have surprisingly high efficiency at the normal.
But L1 norms are certainly used for regression problems and these days relatively often.
If you use R, you might find the discussion in section 5 here useful:
https://socialsciences.mcmaster.ca/jfox/Books/Companion/appendices/Appendix-Robust-Regression.pdf
(though the stuff before it on M estimation is also relevant, since it's also a special case of that)
I can't help quoting from Huber, Robust Statistics, p.10 on this (sorry the quote is too long to fit in a comment):
Two time-honored measures of scatter are the mean absolute deviation
and the mean square deviation
There was a dispute between Eddington (1914, p.147) and Fisher (1920, footnote on p. 762) about the relative merits of
and
.[...] Fisher seemingly settled the matter by pointing out that for normal observations
is about 12% more efficient than
.
By the relation between the conditional mean and the unconditional
mean
a similar argument applies to the residuals.
If the goal of the standard deviation is to summarise the spread of a symmetrical data set (i.e. in general how far each datum is from the mean), then we need a good method of defining how to measure that spread.
The benefits of squaring include:
- Squaring always gives a non-negative value, so the sum will always be zero or higher.
- Squaring emphasizes larger differences, a feature that turns out to be both good and bad (think of the effect outliers have).
Squaring however does have a problem as a measure of spread and that is that the units are all squared, whereas we might prefer the spread to be in the same units as the original data (think of squared pounds, squared dollars, or squared apples). Hence the square root allows us to return to the original units.
I suppose you could say that absolute difference assigns equal weight to the spread of data whereas squaring emphasises the extremes. Technically though, as others have pointed out, squaring makes the algebra much easier to work with and offers properties that the absolute method does not (for example, the variance is equal to the expected value of the square of the distribution minus the square of the mean of the distribution)
It is important to note however that there's no reason you couldn't take the absolute difference if that is your preference on how you wish to view 'spread' (sort of how some people see 5% as some magical threshold for -values, when in fact it is situation dependent). Indeed, there are in fact several competing methods for measuring spread.
My view is to use the squared values because I like to think of how it relates to the Pythagorean Theorem of Statistics: …this also helps me remember that when working with independent random variables, variances add, standard deviations don't. But that's just my personal subjective preference which I mostly only use as a memory aid, feel free to ignore this paragraph.
An interesting analysis can be read here:
- Revisiting a 90-year-old debate: the advantages of the mean deviation - Stephen Gorard (Department of Educational Studies, University of York); Paper presented at the British Educational Research Association Annual Conference, University of Manchester, 16-18 September 2004
The squared difference has nicer mathematical properties; it's continuously differentiable (nice when you want to minimize it), it's a sufficient statistic for the Gaussian distribution, and it's (a version of) the L2 norm which comes in handy for proving convergence and so on.
The mean absolute deviation (the absolute value notation you suggest) is also used as a measure of dispersion, but it's not as "well-behaved" as the squared error.
Why do we use Least squares, why not absolute value, or cubes, or whatever. I understand visually that it is the square of the vertical distance....but why?
Actually there are some great reasons which have nothing to do with whether this is easy to calculate. The first form is called least squares, and in a probabilistic setting there are several good theoretical justifications to use it. For example, if you assume you are performing this regression on variables with normally distributed error (which is a reasonable assumption in many cases), then the least squares form is the maximum likelihood estimator. There are several other important properties.
You can read some more here.
If $h(x)$ is linear with respect to the parameters, the derivatives of the sum of squares leads to simple, explicit and direct solutions (immediate if you use matrix calculations).
This is not the case for the second objective function in your post. The problem becomes nonlinear with respect to the parameters and it is much more difficult to solve. But, it is doable (I would generate the starting guesses from the first objective function.
For illustration purposes, I generated a $10\times 10$ table for $$y=a+b\log(x_1)+c\sqrt{x_2}$$ ($x_1=1,2,\cdots,10$), ($x_2=1,2,\cdots,10$) and changed the values of $y$ using a random relative error between $-5$ and $5$%. The values used were $a=12.34$,$b=4.56$ and $c=7.89$.
Using the first objective function, the solution is immediate and leads to $a=12.180$, $b=4.738$,$c=7.956$.
Starting with these values as initial guesses for the second objective function (which, again, makes the problem nonlinear), it took to the solver $\Large 20$ iterations to get $a=11.832$, $b=4.968$,$c=8.046$. And all these painful iterations reduced the objective function from $95.60$ down to $94.07$ !
There are many other possible objective functions used in regression but the traditional sum of squared errors is the only one which leads to explicit solutions.
Added later
A very small problem that you could (should, if I may) exercise by hand : consider four data points $(1,4)$,$(2,11)$,$(3,14)$,$(4,21)$ and your model is simply $y=a x$ and your search for the best value of $a$ which minimizes either $$\Phi_1(a)=\sum_{i=1}^4 (y_i-a x_i)^2$$ or $$\Phi_2(a)=\sum_{i=1}^4 |y_i-a x_i|$$ Plot the values of $\Phi_1(a)$ and $\Phi_2(a)$ as a function of $a$ for $4 \leq a \leq 6$. For $\Phi_1(a)$, you will have a nice parabola (the minimum of which is easy to find) but for $\Phi_2(a)$ the plot shows a series of segments which then lead to discontinuous derivatives at thei intersections; this makes the problem much more difficult to solve.
I am learning data science through ISLR(page 62). Why do we do RSS = (e1)2+(e2)2+(e3)2.... Rather than (|e1|+| e2 |+ | e3 |) as it will be right distance ? Will squaring not skew the results?
Minimizing square errors (MSE) is definitely not the same as minimizing absolute deviations (MAD) of errors. MSE provides the mean response of $y$ conditioned on $x$, while MAD provides the median response of $y$ conditioned on $x$.
Historically, Laplace originally considered the maximum observed error as a measure of the correctness of a model. He soon moved to considering MAD instead. Due to his inability to exact solving both situations, he soon considered the differential MSE. Himself and Gauss (seemingly concurrently) derived the normal equations, a closed-form solution for this problem. Nowadays, solving the MAD is relatively easy by means of linear programming. As it is well known, however, linear programming does not have a closed-form solution.
From an optimization perspective, both correspond to convex functions. However, MSE is differentiable, thus, allowing for gradient-based methods, much efficient than their non-differentiable counterpart. MAD is not differentiable at $x=0$.
A further theoretical reason is that, in a bayesian setting, when assuming uniform priors of the model parameters, MSE yields normal distributed errors, which has been taken as a proof of correctness of the method. Theorists like the normal distribution because they believed it is an empirical fact, while experimentals like it because they believe it a theoretical result.
A final reason of why MSE may have had the wide acceptance it has is that it is based on the euclidean distance (in fact it is a solution of the projection problem on an euclidean banach space) which is extremely intuitive given our geometrical reality.
As an alternative explanation, consider the following intuition:
When minimizing an error, we must decide how to penalize these errors. Indeed, the most straightforward approach to penalizing errors would be to use a linearly proportional penalty function. With such a function, each deviation from the mean is given a proportional corresponding error. Twice as far from the mean would therefore result in twice the penalty.
The more common approach is to consider a squared proportional relationship between deviations from the mean and the corresponding penalty. This will make sure that the further you are away from the mean, the proportionally more you will be penalized. Using this penalty function, outliers (far away from the mean) are deemed proportionally more informative than observations near the mean.
To give a visualisation of this, you can simply plot the penalty functions:

Now especially when considering the estimation of regressions (e.g. OLS), different penalty functions will yield different results. Using the linearly proportional penalty function, the regression will assign less weight to outliers than when using the squared proportional penalty function. The Median Absolute Deviation (MAD) is therefore known to be a more robust estimator. In general, it is therefore the case that a robust estimator fits most of the data points well but 'ignores' outliers. A least squares fit, in comparison, is pulled more towards the outliers. Here is a visualisation for comparison:

Now even though OLS is pretty much the standard, different penalty functions are most certainly in use as well. As an example, you can take a look at Matlab's robustfit function which allows you to choose a different penalty (also called 'weight') function for your regression. The penalty functions include andrews, bisquare, cauchy, fair, huber, logistic, ols, talwar and welsch. Their corresponding expressions can be found on the website as well.
I hope that helps you in getting a bit more intuition for penalty functions :)
Update
If you have Matlab, I can recommend playing with Matlab's robustdemo, which was built specifically for the comparison of ordinary least squares to robust regression:

The demo allows you to drag individual points and immediately see the impact on both ordinary least squares and robust regression (which is perfect for teaching purposes!).