Both are done.

Least squares is easier, and the fact that for independent random variables "variances add" means that it's considerably more convenient; for examples, the ability to partition variances is particularly handy for comparing nested models. It's somewhat more efficient at the normal (least squares is maximum likelihood), which might seem to be a good justification -- however, some robust estimators with high breakdown can have surprisingly high efficiency at the normal.

But L1 norms are certainly used for regression problems and these days relatively often.

If you use R, you might find the discussion in section 5 here useful:

https://socialsciences.mcmaster.ca/jfox/Books/Companion/appendices/Appendix-Robust-Regression.pdf

(though the stuff before it on M estimation is also relevant, since it's also a special case of that)

Answer from Glen_b on Stack Exchange
๐ŸŒ
Reddit
reddit.com โ€บ r/statistics โ€บ why square residuals and not use absolute value?
r/statistics on Reddit: Why square residuals and not use absolute value?
May 4, 2014 -

When deriving the coefficients for a linear regression, we tend to obtain the sum of the minimized squared residuals. I am struggling to intuitively understand why. I know that this offsets the negative residuals that would cancel the positive ones. However, why not just do the absolute value? Other answers say it is because of the mathematical convenience and because squaring makes sure that outliers have a more minimal effect on the regression. Are these the true reasons and are they valid? Why would squaring the outliers make it have a more minimal effect? Thanks!

Discussions

least squares - Absolute value of residuals in simple linear regression - Cross Validated
One of the assumptions of the residuals is that there is no real pattern to the residuals and if you were to graph a density to them, you would end up with something approaching a uniform distribution. $\endgroup$ ... Find the answer to your question by asking. Ask question ... See similar questions with these tags. ... 12 Expected Value ... More on stats.stackexchange.com
๐ŸŒ stats.stackexchange.com
November 18, 2018
regression - R absolute value of residuals with log transformation - Stack Overflow
Indeed the absolute response residuals (on the scale of the original dependent variable) can be calculated as @Roland says with More on stackoverflow.com
๐ŸŒ stackoverflow.com
Why do we use residual sum of squares rather than adding absolute values of errors in linear regression?
Here's the answer I wish I'd had given to me when I asked the same question during my introductory statistics classes. There are many reasons, and the two objectives do not give equivalent results. Minimizing the sum of squared residuals is called "ordinary least squares" and is generally the first technique students learn in estimating functions. Minimizing the sum of absolute is generally called "median regression" for reasons I will discuss later, and is a somewhat less popular technique. Wikipedia indicates that the idea of median regression was actually developed first, which is unsurprising as it is indeed more intuitive. The issue is that there isn't a closed form solution (i.e. a simple formula you can plug numbers into) to find the coefficients that minimize the sum of absolute residuals. In contrast, summing squared residuals gives an objective function that is differentiable: differentiating, setting the derivative equal to zero, and then solving gives a formula for the coefficients that is straightforward to compute. (Technically we are using partial derivatives, and the algebra is a lot easier if you have matrices available, but the basic idea is the same as you would learn in an introductory differential calculus class.) Now, that was a big deal when these ideas were first being developed back in the 18th and 19th century, as then "computer" meant someone who had to perform computations by hand. Algorithms for finding the median regression coefficients existed but were harder to implement. Today we recognize that computing these coefficients is a "linear programming" optimization problem, for which many algorithms exist, most notably the Simplex algorithm. So on a modern computer the two methods are basically equally easy to compute. Then there's the question of inference. In a traditional statistics or econometrics course you would spend a lot of time developing the machinery to do things like hypothesis testing (e.g. suppose I gather a random sample and get an estimated slope coefficient of 0.017, which looks small. It is useful to ask the question "If the true population slope coefficient is 0, how likely is it that we could get an estimated slope coefficient of 0.017 or more extreme?". Very similar is the most basic A/B test, which asks "If the true difference between these two groups is 0, how likely is it that I would get an observed difference as big as that observed in the data just due to random chance?"). The fact we can explicitly write out the OLS formula makes it substantially easier to develop the statistical theory for this. There are also some optimality results, such as the Gauss-Markov Theorem, that say that OLS is "best", albeit in a very specific sense under a very restrictive set of assumptions. The statistical theory of median regression has also been figures out, but it requires more advanced math and is somewhat less elegant. So for both OLS and median regression you can compute the coefficients and perform statistical inference. So why do most students learn about OLS and not about median regression? Part of it is path dependence - OLS was developed first, so lots of people learned it and taught it to others, and it's just easier to stick with what people have learned in the past than switch to something else (e.g. you can keep using the same textbooks). But all the reasons that made it simpler to compute and develop inference for also make it easier for students to learn. Okay, but the pedagogy of introductory courses doesn't really matter once you get into the real world and are choosing which method to use. And these days there are pre-programmed algorithms that will do both estimation and inference for you in a single command, so the differences there don't really matter to a practitioner either. If you've got some data and want to estimate the relationship between Y and X, should you use OLS or median regression? You're absolutely right that squaring does something subtle to the residuals. It "skews" them in the sense of disproportionately trying to reduce large residuals, whereas median regression weights both small and large residuals equally. This is why advocates of median regression say that median regression is more "robust to outliers": if there is a weird Y observation (e.g. someone got a decimal point in the wrong location when transcribing data), OLS is going to try really hard to fit that observation. That is, if you imagine having a bunch of X's and Y's that pretty much follow a straight line, and then one really weird observation, the median regression line is going to be closer to the straight line than the OLS line. But the really exciting part happens when you pause and ask what the heck these estimators are actually estimating. One can show theoretically that minimizing squared errors results in a conditional mean, i.e. given a particular value of X, the predicted value of Y at that X is the "average" (in the sense of arithmetic mean or expected value) value of Y. In contrast, minimizing the sum of absolute errors results in the conditional median: for a fixed X, 50% of Y will be above this number, and 50% below. Upon realizing this people also realized that by weighting negative errors and positive errors differently, one can actually extend median regression to quantile regression (e.g. you can estimate the conditional 0.25 quantile: for a fixed X, 25% of Y will be below it and 75% will be above it). So there's a school of thought out there that quantile regression should be used a lot more: it's robust to outliers and by estimating it for several quantiles you can get a fuller picture of how things are going. On the flip side, however, there are some theoretical results that suggest OLS will give you a more precise estimate. OLS estimates are also much more interpretable (if you are confident your model has a causal interpretation, the slope is the average marginal effect of X on Y). In practice the difference between OLS and median regression usually isn't big enough to matter much - certainly not to the extent that advocates for median regression can say "here are 10,000 cases where using median regression would perform way better". Also, since median regression is a more advanced technique to learn, it would be better to compare it to other more advanced techniques. Median regression has all the issues OLS does in terms of needing to specify exactly what variables are in the model, and in what way (e.g. squared, interaction terms, etc.). If your goal is just to have an algorithm that will give you some sort of sensible prediction, many other tools exist that will do a much better job (see, for example, the later chapters in the book you are reading). And if you for some reason actually do need to estimate a conditional quantile and/or do inference, you might look into the "generalized random forest" R package and associated paper by Athey et al. And the sheer length of this comment reveals to me why my professors did not spend time explaining why they used squared residuals instead of absolute values! Edit: Thanks for the gold! More on reddit.com
๐ŸŒ r/datascience
110
155
November 18, 2018
Why square residuals and not use absolute value?

You can use the absolute value. This is called the L1 norm and is used for robust regression.

You can read more about it here: http://www.johnmyleswhite.com/notebook/2013/03/22/using-norms-to-understand-linear-regression/

Practically, the math is easier in ordinary least squares regression: You want to minimize the squared residuals so you can take the derivative, set it equal to 0 and solve. Easier to compute the derivative of a polynomial than absolute value.

More on reddit.com
๐ŸŒ r/statistics
13
2
May 4, 2014

Both are done.

Least squares is easier, and the fact that for independent random variables "variances add" means that it's considerably more convenient; for examples, the ability to partition variances is particularly handy for comparing nested models. It's somewhat more efficient at the normal (least squares is maximum likelihood), which might seem to be a good justification -- however, some robust estimators with high breakdown can have surprisingly high efficiency at the normal.

But L1 norms are certainly used for regression problems and these days relatively often.

If you use R, you might find the discussion in section 5 here useful:

https://socialsciences.mcmaster.ca/jfox/Books/Companion/appendices/Appendix-Robust-Regression.pdf

(though the stuff before it on M estimation is also relevant, since it's also a special case of that)

Answer from Glen_b on Stack Exchange
๐ŸŒ
Statalist
statalist.org โ€บ forums โ€บ forum โ€บ general-stata-discussion โ€บ general โ€บ 1520485-absolute-value-of-residual
Absolute value of residual - Statalist
October 15, 2019 - Hello Experts A typical measure for firmsโ€™ use of earnings management in the finance literature is based on the absolute value of the residual of OLS regression estimates.
๐ŸŒ
Wikipedia
en.wikipedia.org โ€บ wiki โ€บ Errors_and_residuals
Errors and residuals - Wikipedia
5 days ago - Sum of squares of residuals (SSR) is the sum of the squares of the deviations of the actual values from the predicted values, within the sample used for estimation. This is the basis for the least squares estimate, where the regression coefficients are chosen such that the SSR is minimal (i.e. ...
๐ŸŒ
Fiveable
fiveable.me โ€บ all key terms โ€บ intro to statistics โ€บ absolute value of a residual
Absolute value of a residual Definition - Intro to...
The absolute value of a residual is the non-negative difference between an observed value and the corresponding predicted value from a regression model. It...
๐ŸŒ
ResearchGate
researchgate.net โ€บ figure โ€บ Mean-of-the-absolute-value-of-the-residuals-residuals-results_tbl2_228406847
Mean of the absolute value of the residuals (residuals) results. | Download Table
Download Table | Mean of the absolute value of the residuals (residuals) results. from publication: Toward Parsimony in Shoreline Change Prediction (II): Applying Basis Function Methods to Real and Synthetic Data | GENZ, A.S.; FRAZER, L.N., and FLETCHER, C.H., 2009.
Find elsewhere
๐ŸŒ
Stack Exchange
stats.stackexchange.com โ€บ questions โ€บ 420842 โ€บ absolute-value-of-residuals-in-simple-linear-regression
least squares - Absolute value of residuals in simple linear regression - Cross Validated
November 18, 2018 - One of the assumptions of the residuals is that there is no real pattern to the residuals and if you were to graph a density to them, you would end up with something approaching a uniform distribution. $\endgroup$ ... Find the answer to your question by asking. Ask question ... See similar questions with these tags. ... 12 Expected Value ...
๐ŸŒ
Quora
quora.com โ€บ The-method-of-least-squares-of-residuals-is-commonly-used-to-get-the-best-fit-with-linear-regression-The-reason-why-the-absolute-value-of-residuals-y-ypred-is-not-used-is-that
The method of least squares of residuals is commonly used to get the best fit with linear regression. The reason why the absolute value of residual's (|y- ypred|) is not used is that? - Quora
Answer (1 of 6): You could run something like that if you wanted, but least squares intentionally squares the error with the thinking that it will react better when values stray away from your predicted value. As an example, say you take your data and plot all the actual values and see a straigh...
๐ŸŒ
Fchart
fchart.com โ€บ ees โ€บ eeshelp โ€บ 9amb48.htm
Residuals
April 16, 2025 - The relative residual, Rel. Res., is the absolute value of Abs. Res. divided by the value of the left-hand side of the equation, assuming that it is not equal to zero.
Top answer
1 of 2
1

I want to interpret the residuals but get them back on the scale of num_encounters.

You can easily calculate them:

mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
res <- df$num_encounters - exp(predict(mod))
2 of 2
1

In addition what @Roland suggests, which indeed is correct and works, the problem with my confusion was just basic high-school logarithm algebra.

Indeed the absolute response residuals (on the scale of the original dependent variable) can be calculated as @Roland says with

mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
res <- df$num_encounters - exp(predict(mod))

If you want to calculate them from the model residuals, you need to keep logarithm substraction rules into account.

log(a)-log(b)=log(a/b)

The residual is calculated from the original model. So in my case, the model predicts log(num_encounters). So the residual is log(observed)-log(predicted).

What I was trying to do was

exp(resid) = exp(log(obs)-log(pred)) = exp(log(obs/pred)) = obs/pred

which is clearly not the number I was looking for. To get the absolute response residual from the model response residual, this is what I needed.

obs-obs/exp(resid)

So in R code, this is what you could also do:

mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
abs_resid <- df$num_encounters - df$num_encounters/exp(residuals(mod, type="response"))

This resulted in the same number as with the method described by @Roland which is much easier of course. But at least I got my brain lined up again.

๐ŸŒ
Penn State Statistics
online.stat.psu.edu โ€บ stat462 โ€บ node โ€บ 172
9.3 - Identifying Outliers (Unusual Y Values) | STAT 462
Some statistical software flags any observation with a standardized residual that is larger than 2 (in absolute value). Using a cutoff of 2 may be a little conservative, but perhaps it is better to be safe than sorry. The key here is not to take the cutoffs of either 2 or 3 too literally.
๐ŸŒ
ThoughtCo
thoughtco.com โ€บ what-are-residuals-3126253
What Are Residuals?
January 27, 2019 - Residuals are zero for points that fall exactly along the regression line. The greater the absolute value of the residual, the further that the point lies from the regression line.
๐ŸŒ
F-Chart Software
fchartsoftware.com โ€บ ees โ€บ eeshelp โ€บ 9amb48.htm
Residuals window
The relative residual, Rel. Res., is the absolute value of Abs. Res. divided by the value of the left-hand side of the equation, assuming that it is not equal to zero.
๐ŸŒ
Reddit
reddit.com โ€บ r/datascience โ€บ why do we use residual sum of squares rather than adding absolute values of errors in linear regression?
r/datascience on Reddit: Why do we use residual sum of squares rather than adding absolute values of errors in linear regression?
November 18, 2018 -

I am learning data science through ISLR(page 62). Why do we do RSS = (e1)2+(e2)2+(e3)2.... Rather than (|e1|+| e2 |+ | e3 |) as it will be right distance ? Will squaring not skew the results?

Top answer
1 of 19
320
Here's the answer I wish I'd had given to me when I asked the same question during my introductory statistics classes. There are many reasons, and the two objectives do not give equivalent results. Minimizing the sum of squared residuals is called "ordinary least squares" and is generally the first technique students learn in estimating functions. Minimizing the sum of absolute is generally called "median regression" for reasons I will discuss later, and is a somewhat less popular technique. Wikipedia indicates that the idea of median regression was actually developed first, which is unsurprising as it is indeed more intuitive. The issue is that there isn't a closed form solution (i.e. a simple formula you can plug numbers into) to find the coefficients that minimize the sum of absolute residuals. In contrast, summing squared residuals gives an objective function that is differentiable: differentiating, setting the derivative equal to zero, and then solving gives a formula for the coefficients that is straightforward to compute. (Technically we are using partial derivatives, and the algebra is a lot easier if you have matrices available, but the basic idea is the same as you would learn in an introductory differential calculus class.) Now, that was a big deal when these ideas were first being developed back in the 18th and 19th century, as then "computer" meant someone who had to perform computations by hand. Algorithms for finding the median regression coefficients existed but were harder to implement. Today we recognize that computing these coefficients is a "linear programming" optimization problem, for which many algorithms exist, most notably the Simplex algorithm. So on a modern computer the two methods are basically equally easy to compute. Then there's the question of inference. In a traditional statistics or econometrics course you would spend a lot of time developing the machinery to do things like hypothesis testing (e.g. suppose I gather a random sample and get an estimated slope coefficient of 0.017, which looks small. It is useful to ask the question "If the true population slope coefficient is 0, how likely is it that we could get an estimated slope coefficient of 0.017 or more extreme?". Very similar is the most basic A/B test, which asks "If the true difference between these two groups is 0, how likely is it that I would get an observed difference as big as that observed in the data just due to random chance?"). The fact we can explicitly write out the OLS formula makes it substantially easier to develop the statistical theory for this. There are also some optimality results, such as the Gauss-Markov Theorem, that say that OLS is "best", albeit in a very specific sense under a very restrictive set of assumptions. The statistical theory of median regression has also been figures out, but it requires more advanced math and is somewhat less elegant. So for both OLS and median regression you can compute the coefficients and perform statistical inference. So why do most students learn about OLS and not about median regression? Part of it is path dependence - OLS was developed first, so lots of people learned it and taught it to others, and it's just easier to stick with what people have learned in the past than switch to something else (e.g. you can keep using the same textbooks). But all the reasons that made it simpler to compute and develop inference for also make it easier for students to learn. Okay, but the pedagogy of introductory courses doesn't really matter once you get into the real world and are choosing which method to use. And these days there are pre-programmed algorithms that will do both estimation and inference for you in a single command, so the differences there don't really matter to a practitioner either. If you've got some data and want to estimate the relationship between Y and X, should you use OLS or median regression? You're absolutely right that squaring does something subtle to the residuals. It "skews" them in the sense of disproportionately trying to reduce large residuals, whereas median regression weights both small and large residuals equally. This is why advocates of median regression say that median regression is more "robust to outliers": if there is a weird Y observation (e.g. someone got a decimal point in the wrong location when transcribing data), OLS is going to try really hard to fit that observation. That is, if you imagine having a bunch of X's and Y's that pretty much follow a straight line, and then one really weird observation, the median regression line is going to be closer to the straight line than the OLS line. But the really exciting part happens when you pause and ask what the heck these estimators are actually estimating. One can show theoretically that minimizing squared errors results in a conditional mean, i.e. given a particular value of X, the predicted value of Y at that X is the "average" (in the sense of arithmetic mean or expected value) value of Y. In contrast, minimizing the sum of absolute errors results in the conditional median: for a fixed X, 50% of Y will be above this number, and 50% below. Upon realizing this people also realized that by weighting negative errors and positive errors differently, one can actually extend median regression to quantile regression (e.g. you can estimate the conditional 0.25 quantile: for a fixed X, 25% of Y will be below it and 75% will be above it). So there's a school of thought out there that quantile regression should be used a lot more: it's robust to outliers and by estimating it for several quantiles you can get a fuller picture of how things are going. On the flip side, however, there are some theoretical results that suggest OLS will give you a more precise estimate. OLS estimates are also much more interpretable (if you are confident your model has a causal interpretation, the slope is the average marginal effect of X on Y). In practice the difference between OLS and median regression usually isn't big enough to matter much - certainly not to the extent that advocates for median regression can say "here are 10,000 cases where using median regression would perform way better". Also, since median regression is a more advanced technique to learn, it would be better to compare it to other more advanced techniques. Median regression has all the issues OLS does in terms of needing to specify exactly what variables are in the model, and in what way (e.g. squared, interaction terms, etc.). If your goal is just to have an algorithm that will give you some sort of sensible prediction, many other tools exist that will do a much better job (see, for example, the later chapters in the book you are reading). And if you for some reason actually do need to estimate a conditional quantile and/or do inference, you might look into the "generalized random forest" R package and associated paper by Athey et al. And the sheer length of this comment reveals to me why my professors did not spend time explaining why they used squared residuals instead of absolute values! Edit: Thanks for the gold!
2 of 19
63
Gauss and others wanted to penalize outliers more, and squares are really easy to calculate. It really is that simple. They had to choose something and that was it. You could just as easily apply some other loss function. Edit: This post got bigger than I thought it would. The others have better answers. Mine is a little too flippant for the kind of attention this is getting.
๐ŸŒ
Krista King Math
kristakingmath.com โ€บ blog โ€บ correlation-coefficient-and-the-residual
Correlation coefficients and the residual โ€” Krista King Math | Online math help
August 6, 2019 - The blue lines in the chart represent the residual for each point. Notice that the absolute value of the residual is the distance from the predicted value on the line to the actual value of the point.
๐ŸŒ
ResearchGate
researchgate.net โ€บ figure โ€บ The-maximum-absolute-value-of-the-residual-for-differentiating-a-spherical-harmonic-with_fig1_319974548
The maximum absolute value of the residual for differentiating a... | Download Scientific Diagram
June 15, 2016 - For the 341 sphere, it is natural to suppose that the Hermite weights would become exact for the 342 spherical harmonics. This is indeed the case, if the neighborhood size is sufficiently 343 large, which is demonstrated in Fig. 1. Let the residual for an example stencil on the 344 sphere be ...
๐ŸŒ
Quora
quora.com โ€บ Why-do-we-use-residual-sum-of-squares-rather-than-adding-absolute-values-of-errors-in-linear-regression
Why do we use residual sum of squares rather than adding absolute values of errors in linear regression? - Quora
January 18, 2021 - Answer (1 of 10): Optimization problems involve minimizing some cost function. Fitting data to a curve is an optimization problem. So the question becomes: why use the sum of the squared differences between the fit and the data as the cost function? It is true that one can choose to minimize the...
๐ŸŒ
UVA Library
library.virginia.edu โ€บ data โ€บ articles โ€บ understanding-deviance-residuals
Understanding Deviance Residuals | UVA Library
September 28, 2022 - To do this we fit an intercept-only ... the MLE of 0.6: ... We have 5 elements corresponding to each observation. If we take the square root of these, we get the absolute values of the deviance residuals....
๐ŸŒ
Diversification
diversification.com โ€บ term โ€บ absolute-residual-value
Absolute residual value: Meaning, Criticisms & Real-World Uses
Absolute residual value refers to the specific, quantifiable monetary worth of an asset at the end of its projected useful life or lease term. Interpreting the absolute residual value is critical for both lessors and lessees, as well as for ...