As mentioned by others, the least-squares problem is much easier to solve. But there’s another important reason: assuming IID Gaussian noise, the least-squares solution is the Maximum-Likelihood estimate.

Answer from Amit Hochman on Stack Exchange
🌐
Reddit
reddit.com › r/statistics › why square residuals and not use absolute value?
r/statistics on Reddit: Why square residuals and not use absolute value?
May 4, 2014 -

When deriving the coefficients for a linear regression, we tend to obtain the sum of the minimized squared residuals. I am struggling to intuitively understand why. I know that this offsets the negative residuals that would cancel the positive ones. However, why not just do the absolute value? Other answers say it is because of the mathematical convenience and because squaring makes sure that outliers have a more minimal effect on the regression. Are these the true reasons and are they valid? Why would squaring the outliers make it have a more minimal effect? Thanks!

Discussions

definition - Why square the difference instead of taking the absolute value in standard deviation? - Cross Validated
Around 1800 Gauss started with least squares and variance and from those derived the Normal distribution--there's the circularity. A truly fundamental reason that has not been invoked in any answer yet is the unique role played by the variance in the Central Limit Theorem. Another is the importance in decision theory of minimizing quadratic loss. $\endgroup$ ... $\begingroup$ Taleb makes the case at Edge.org for retiring standard deviation and using mean absolute ... More on stats.stackexchange.com
🌐 stats.stackexchange.com
July 19, 2010
machine learning - In OLS, why is squaring preferred over taking absolute while calculating errors in linear regression? - Stack Overflow
OLS = ordinary least SQUARES, this corresponds to the normal distribution. Absolute values give a different estimator, median regression, a special case of Quantile Regression. ... This is a statistics question and not a code question. More on stackoverflow.com
🌐 stackoverflow.com
Why Least SQUARES Regression instead of Least ABSOLUTE VALUE Regression?
Least squares is traditionally used for lots of reasons. It's computationally simple minimizing squares corresponds to fitting the expected response, rather than some quantile of the response It's mathematically simple The Least squares estimator corresponded to the maximum likelihood estimator if errors are Normal. More on reddit.com
🌐 r/AskStatistics
15
7
March 24, 2018
statistics - Why get the sum of squares instead of the sum of absolute values? - Mathematics Stack Exchange
By the Gauss-Markov Theorem, least-squares is the best linear unbiased estimator (BLUE). All that said, minimum absolute deviation (MAD), which is what is minimized under the second objective function, produces a robust estimate, which MSE does not. More on math.stackexchange.com
🌐 math.stackexchange.com
October 11, 2014
🌐
Wolfram Demonstrations Project
demonstrations.wolfram.com › ComparingLeastSquaresFitAndLeastAbsoluteDeviationsFit
Comparing Least-Squares Fit and Least Absolute Deviations Fit
Explore thousands of free applications across science, mathematics, engineering, technology, business, art, finance, social sciences, and more.
🌐
Bradthiessen
bradthiessen.com › html5 › docs › ols.pdf pdf
Why we use “least squares” regression instead of “least ...
When we do not square all the values in the numerator, the positive and negative values cancel each other out and we’re always left ... We could have solved this problem by using absolute values in the numerator (this is called the mean absolute deviation), but we
Top answer
1 of 16
265

If the goal of the standard deviation is to summarise the spread of a symmetrical data set (i.e. in general how far each datum is from the mean), then we need a good method of defining how to measure that spread.

The benefits of squaring include:

  • Squaring always gives a non-negative value, so the sum will always be zero or higher.
  • Squaring emphasizes larger differences, a feature that turns out to be both good and bad (think of the effect outliers have).

Squaring however does have a problem as a measure of spread and that is that the units are all squared, whereas we might prefer the spread to be in the same units as the original data (think of squared pounds, squared dollars, or squared apples). Hence the square root allows us to return to the original units.

I suppose you could say that absolute difference assigns equal weight to the spread of data whereas squaring emphasises the extremes. Technically though, as others have pointed out, squaring makes the algebra much easier to work with and offers properties that the absolute method does not (for example, the variance is equal to the expected value of the square of the distribution minus the square of the mean of the distribution)

It is important to note however that there's no reason you couldn't take the absolute difference if that is your preference on how you wish to view 'spread' (sort of how some people see 5% as some magical threshold for -values, when in fact it is situation dependent). Indeed, there are in fact several competing methods for measuring spread.

My view is to use the squared values because I like to think of how it relates to the Pythagorean Theorem of Statistics: …this also helps me remember that when working with independent random variables, variances add, standard deviations don't. But that's just my personal subjective preference which I mostly only use as a memory aid, feel free to ignore this paragraph.

An interesting analysis can be read here:

  • Revisiting a 90-year-old debate: the advantages of the mean deviation - Stephen Gorard (Department of Educational Studies, University of York); Paper presented at the British Educational Research Association Annual Conference, University of Manchester, 16-18 September 2004
2 of 16
162

The squared difference has nicer mathematical properties; it's continuously differentiable (nice when you want to minimize it), it's a sufficient statistic for the Gaussian distribution, and it's (a version of) the L2 norm which comes in handy for proving convergence and so on.

The mean absolute deviation (the absolute value notation you suggest) is also used as a measure of dispersion, but it's not as "well-behaved" as the squared error.

🌐
Cantors Paradise
cantorsparadise.com › least-squares-vs-least-absolute-errors-a-250-year-old-debate-bf102929a80f
Least Squares vs Least Absolute Errors —A 250-Year-Old Debate
July 26, 2023 - Ordinary least squares (OLS) regression is one of the first items on the menu in an introductory Statistics or Data Science course. ... It’s a common question even high school students ask. After all, comparing the size of the errors (absolute value) seems simpler and more natural.
🌐
Wayne State University
digitalcommons.wayne.edu › cgi › viewcontent.cgi pdf
Least Absolute Value vs. Least Squares Estimation and ...
Open Access research and scholarship produced by Wayne State University community and home of Wayne State University Press Journals.
Find elsewhere
🌐
Stack Overflow
stackoverflow.com › questions › 58811930 › in-ols-why-is-squaring-preferred-over-taking-absolute-while-calculating-errors
machine learning - In OLS, why is squaring preferred over taking absolute while calculating errors in linear regression? - Stack Overflow
OLS = ordinary least SQUARES, this corresponds to the normal distribution. Absolute values give a different estimator, median regression, a special case of Quantile Regression. ... This is a statistics question and not a code question.
🌐
Quora
quora.com › Why-do-we-square-instead-of-using-the-absolute-value-when-calculating-variance-and-standard-deviation
Why do we square instead of using the absolute value when calculating variance and standard deviation? - Quora
... Variances add for independent random variables: Var(X+Y) = Var(X) + Var(Y). This clean relationship follows from squaring and expectation of cross-terms; absolute deviations do not produce a comparable additive rule.
🌐
Wikipedia
en.wikipedia.org › wiki › Least_squares
Least squares - Wikipedia
2 weeks ago - For this purpose, Laplace used a symmetric two-sided exponential distribution we now call Laplace distribution to model the error distribution, and used the sum of absolute deviation as error of estimation. He felt these to be the simplest assumptions he could make, and he had hoped to obtain the arithmetic mean as the best estimate. Instead, his estimator was the posterior median. The first clear and concise exposition of the method of least squares was published by Legendre in 1805.
🌐
Reddit
reddit.com › r/askstatistics › why least squares regression instead of least absolute value regression?
r/AskStatistics on Reddit: Why Least SQUARES Regression instead of Least ABSOLUTE VALUE Regression?
March 24, 2018 -

Why do we use Least squares, why not absolute value, or cubes, or whatever. I understand visually that it is the square of the vertical distance....but why?

🌐
Quora
quora.com › The-method-of-least-squares-of-residuals-is-commonly-used-to-get-the-best-fit-with-linear-regression-The-reason-why-the-absolute-value-of-residuals-y-ypred-is-not-used-is-that
The method of least squares of residuals is commonly used to get the best fit with linear regression. The reason why the absolute value of residual's (|y- ypred|) is not used is that? - Quora
Answer (1 of 6): You could run something like that if you wanted, but least squares intentionally squares the error with the thinking that it will react better when values stray away from your predicted value. As an example, say you take your data and plot all the actual values and see a straigh...
🌐
Wikipedia
en.wikipedia.org › wiki › Least_absolute_deviations
Least absolute deviations - Wikipedia
November 22, 2024 - LAD gives equal emphasis to all observations, in contrast to ordinary least squares (OLS) which, by squaring the residuals, gives more weight to large residuals, that is, outliers in which predicted values are far from actual observations. This may be helpful in studies where outliers do not ...
🌐
Bragitoff
bragitoff.com › home › why can’t we use absolute values of errors? (least squares method)- curve fitting
Why can’t we use absolute values of Errors? (Least Squares Method)- Curve Fitting - BragitOff.com
October 7, 2015 - In mathematical terms, we would ... analytically. We simply choose not to use absolute values because of the difficulties we have in working with them mathematically....
Top answer
1 of 7
30

Actually there are some great reasons which have nothing to do with whether this is easy to calculate. The first form is called least squares, and in a probabilistic setting there are several good theoretical justifications to use it. For example, if you assume you are performing this regression on variables with normally distributed error (which is a reasonable assumption in many cases), then the least squares form is the maximum likelihood estimator. There are several other important properties.

You can read some more here.

2 of 7
12

If $h(x)$ is linear with respect to the parameters, the derivatives of the sum of squares leads to simple, explicit and direct solutions (immediate if you use matrix calculations).

This is not the case for the second objective function in your post. The problem becomes nonlinear with respect to the parameters and it is much more difficult to solve. But, it is doable (I would generate the starting guesses from the first objective function.

For illustration purposes, I generated a $10\times 10$ table for $$y=a+b\log(x_1)+c\sqrt{x_2}$$ ($x_1=1,2,\cdots,10$), ($x_2=1,2,\cdots,10$) and changed the values of $y$ using a random relative error between $-5$ and $5$%. The values used were $a=12.34$,$b=4.56$ and $c=7.89$.

Using the first objective function, the solution is immediate and leads to $a=12.180$, $b=4.738$,$c=7.956$.

Starting with these values as initial guesses for the second objective function (which, again, makes the problem nonlinear), it took to the solver $\Large 20$ iterations to get $a=11.832$, $b=4.968$,$c=8.046$. And all these painful iterations reduced the objective function from $95.60$ down to $94.07$ !

There are many other possible objective functions used in regression but the traditional sum of squared errors is the only one which leads to explicit solutions.

Added later

A very small problem that you could (should, if I may) exercise by hand : consider four data points $(1,4)$,$(2,11)$,$(3,14)$,$(4,21)$ and your model is simply $y=a x$ and your search for the best value of $a$ which minimizes either $$\Phi_1(a)=\sum_{i=1}^4 (y_i-a x_i)^2$$ or $$\Phi_2(a)=\sum_{i=1}^4 |y_i-a x_i|$$ Plot the values of $\Phi_1(a)$ and $\Phi_2(a)$ as a function of $a$ for $4 \leq a \leq 6$. For $\Phi_1(a)$, you will have a nice parabola (the minimum of which is easy to find) but for $\Phi_2(a)$ the plot shows a series of segments which then lead to discontinuous derivatives at thei intersections; this makes the problem much more difficult to solve.

🌐
Reddit
reddit.com › r/datascience › why do we use residual sum of squares rather than adding absolute values of errors in linear regression?
r/datascience on Reddit: Why do we use residual sum of squares rather than adding absolute values of errors in linear regression?
November 18, 2018 -

I am learning data science through ISLR(page 62). Why do we do RSS = (e1)2+(e2)2+(e3)2.... Rather than (|e1|+| e2 |+ | e3 |) as it will be right distance ? Will squaring not skew the results?

Top answer
1 of 19
320
Here's the answer I wish I'd had given to me when I asked the same question during my introductory statistics classes. There are many reasons, and the two objectives do not give equivalent results. Minimizing the sum of squared residuals is called "ordinary least squares" and is generally the first technique students learn in estimating functions. Minimizing the sum of absolute is generally called "median regression" for reasons I will discuss later, and is a somewhat less popular technique. Wikipedia indicates that the idea of median regression was actually developed first, which is unsurprising as it is indeed more intuitive. The issue is that there isn't a closed form solution (i.e. a simple formula you can plug numbers into) to find the coefficients that minimize the sum of absolute residuals. In contrast, summing squared residuals gives an objective function that is differentiable: differentiating, setting the derivative equal to zero, and then solving gives a formula for the coefficients that is straightforward to compute. (Technically we are using partial derivatives, and the algebra is a lot easier if you have matrices available, but the basic idea is the same as you would learn in an introductory differential calculus class.) Now, that was a big deal when these ideas were first being developed back in the 18th and 19th century, as then "computer" meant someone who had to perform computations by hand. Algorithms for finding the median regression coefficients existed but were harder to implement. Today we recognize that computing these coefficients is a "linear programming" optimization problem, for which many algorithms exist, most notably the Simplex algorithm. So on a modern computer the two methods are basically equally easy to compute. Then there's the question of inference. In a traditional statistics or econometrics course you would spend a lot of time developing the machinery to do things like hypothesis testing (e.g. suppose I gather a random sample and get an estimated slope coefficient of 0.017, which looks small. It is useful to ask the question "If the true population slope coefficient is 0, how likely is it that we could get an estimated slope coefficient of 0.017 or more extreme?". Very similar is the most basic A/B test, which asks "If the true difference between these two groups is 0, how likely is it that I would get an observed difference as big as that observed in the data just due to random chance?"). The fact we can explicitly write out the OLS formula makes it substantially easier to develop the statistical theory for this. There are also some optimality results, such as the Gauss-Markov Theorem, that say that OLS is "best", albeit in a very specific sense under a very restrictive set of assumptions. The statistical theory of median regression has also been figures out, but it requires more advanced math and is somewhat less elegant. So for both OLS and median regression you can compute the coefficients and perform statistical inference. So why do most students learn about OLS and not about median regression? Part of it is path dependence - OLS was developed first, so lots of people learned it and taught it to others, and it's just easier to stick with what people have learned in the past than switch to something else (e.g. you can keep using the same textbooks). But all the reasons that made it simpler to compute and develop inference for also make it easier for students to learn. Okay, but the pedagogy of introductory courses doesn't really matter once you get into the real world and are choosing which method to use. And these days there are pre-programmed algorithms that will do both estimation and inference for you in a single command, so the differences there don't really matter to a practitioner either. If you've got some data and want to estimate the relationship between Y and X, should you use OLS or median regression? You're absolutely right that squaring does something subtle to the residuals. It "skews" them in the sense of disproportionately trying to reduce large residuals, whereas median regression weights both small and large residuals equally. This is why advocates of median regression say that median regression is more "robust to outliers": if there is a weird Y observation (e.g. someone got a decimal point in the wrong location when transcribing data), OLS is going to try really hard to fit that observation. That is, if you imagine having a bunch of X's and Y's that pretty much follow a straight line, and then one really weird observation, the median regression line is going to be closer to the straight line than the OLS line. But the really exciting part happens when you pause and ask what the heck these estimators are actually estimating. One can show theoretically that minimizing squared errors results in a conditional mean, i.e. given a particular value of X, the predicted value of Y at that X is the "average" (in the sense of arithmetic mean or expected value) value of Y. In contrast, minimizing the sum of absolute errors results in the conditional median: for a fixed X, 50% of Y will be above this number, and 50% below. Upon realizing this people also realized that by weighting negative errors and positive errors differently, one can actually extend median regression to quantile regression (e.g. you can estimate the conditional 0.25 quantile: for a fixed X, 25% of Y will be below it and 75% will be above it). So there's a school of thought out there that quantile regression should be used a lot more: it's robust to outliers and by estimating it for several quantiles you can get a fuller picture of how things are going. On the flip side, however, there are some theoretical results that suggest OLS will give you a more precise estimate. OLS estimates are also much more interpretable (if you are confident your model has a causal interpretation, the slope is the average marginal effect of X on Y). In practice the difference between OLS and median regression usually isn't big enough to matter much - certainly not to the extent that advocates for median regression can say "here are 10,000 cases where using median regression would perform way better". Also, since median regression is a more advanced technique to learn, it would be better to compare it to other more advanced techniques. Median regression has all the issues OLS does in terms of needing to specify exactly what variables are in the model, and in what way (e.g. squared, interaction terms, etc.). If your goal is just to have an algorithm that will give you some sort of sensible prediction, many other tools exist that will do a much better job (see, for example, the later chapters in the book you are reading). And if you for some reason actually do need to estimate a conditional quantile and/or do inference, you might look into the "generalized random forest" R package and associated paper by Athey et al. And the sheer length of this comment reveals to me why my professors did not spend time explaining why they used squared residuals instead of absolute values! Edit: Thanks for the gold!
2 of 19
63
Gauss and others wanted to penalize outliers more, and squares are really easy to calculate. It really is that simple. They had to choose something and that was it. You could just as easily apply some other loss function. Edit: This post got bigger than I thought it would. The others have better answers. Mine is a little too flippant for the kind of attention this is getting.
🌐
Project Euclid
projecteuclid.org › journals › statistical-science › volume-13 › issue-4 › Instability-of-least-squares-least-absolute-deviation-and-least-median › 10.1214 › ss › 1028905829.pdf pdf
Instability of Least Squares, Least Absolute Deviation and ...
that, as Figure 3b illustrates, not all elements of · SLMS ∩Pk are collinear. For simplicity, still assume · k = 1. It turns out that all noncollinear data sets ... It remains to consider collinear data in Pk. The ... Figure 1 the dashed lines are the LMS lines. They ... Birkes, D. and Dodge, Y. (1993). Alternative Methods of Regres- ... Bloomfield, P. and Steiger, W. L. (1983). Least Absolute De-
Top answer
1 of 5
72

Minimizing square errors (MSE) is definitely not the same as minimizing absolute deviations (MAD) of errors. MSE provides the mean response of $y$ conditioned on $x$, while MAD provides the median response of $y$ conditioned on $x$.

Historically, Laplace originally considered the maximum observed error as a measure of the correctness of a model. He soon moved to considering MAD instead. Due to his inability to exact solving both situations, he soon considered the differential MSE. Himself and Gauss (seemingly concurrently) derived the normal equations, a closed-form solution for this problem. Nowadays, solving the MAD is relatively easy by means of linear programming. As it is well known, however, linear programming does not have a closed-form solution.

From an optimization perspective, both correspond to convex functions. However, MSE is differentiable, thus, allowing for gradient-based methods, much efficient than their non-differentiable counterpart. MAD is not differentiable at $x=0$.

A further theoretical reason is that, in a bayesian setting, when assuming uniform priors of the model parameters, MSE yields normal distributed errors, which has been taken as a proof of correctness of the method. Theorists like the normal distribution because they believed it is an empirical fact, while experimentals like it because they believe it a theoretical result.

A final reason of why MSE may have had the wide acceptance it has is that it is based on the euclidean distance (in fact it is a solution of the projection problem on an euclidean banach space) which is extremely intuitive given our geometrical reality.

2 of 5
36

As an alternative explanation, consider the following intuition:

When minimizing an error, we must decide how to penalize these errors. Indeed, the most straightforward approach to penalizing errors would be to use a linearly proportional penalty function. With such a function, each deviation from the mean is given a proportional corresponding error. Twice as far from the mean would therefore result in twice the penalty.

The more common approach is to consider a squared proportional relationship between deviations from the mean and the corresponding penalty. This will make sure that the further you are away from the mean, the proportionally more you will be penalized. Using this penalty function, outliers (far away from the mean) are deemed proportionally more informative than observations near the mean.

To give a visualisation of this, you can simply plot the penalty functions:

Now especially when considering the estimation of regressions (e.g. OLS), different penalty functions will yield different results. Using the linearly proportional penalty function, the regression will assign less weight to outliers than when using the squared proportional penalty function. The Median Absolute Deviation (MAD) is therefore known to be a more robust estimator. In general, it is therefore the case that a robust estimator fits most of the data points well but 'ignores' outliers. A least squares fit, in comparison, is pulled more towards the outliers. Here is a visualisation for comparison:

Now even though OLS is pretty much the standard, different penalty functions are most certainly in use as well. As an example, you can take a look at Matlab's robustfit function which allows you to choose a different penalty (also called 'weight') function for your regression. The penalty functions include andrews, bisquare, cauchy, fair, huber, logistic, ols, talwar and welsch. Their corresponding expressions can be found on the website as well.

I hope that helps you in getting a bit more intuition for penalty functions :)

Update

If you have Matlab, I can recommend playing with Matlab's robustdemo, which was built specifically for the comparison of ordinary least squares to robust regression:

The demo allows you to drag individual points and immediately see the impact on both ordinary least squares and robust regression (which is perfect for teaching purposes!).

🌐
Quora
quora.com › Why-do-we-use-square-error-instead-of-absolute-value-when-we-calculate-R-2-in-regression-analysis
Why do we use square error instead of absolute value when we calculate R^2 in regression analysis? - Quora
Answer (1 of 5): There are many reasons for using the square error, and there are cases one might prefer a different cost-function, but there is one really important reason to use the square error, that is often not mentioned because it is not a very impressive one: it is easier to solve than oth...