Actually there are some great reasons which have nothing to do with whether this is easy to calculate. The first form is called least squares, and in a probabilistic setting there are several good theoretical justifications to use it. For example, if you assume you are performing this regression on variables with normally distributed error (which is a reasonable assumption in many cases), then the least squares form is the maximum likelihood estimator. There are several other important properties.

You can read some more here.

Answer from Bitwise on Stack Exchange
Top answer
1 of 7
30

Actually there are some great reasons which have nothing to do with whether this is easy to calculate. The first form is called least squares, and in a probabilistic setting there are several good theoretical justifications to use it. For example, if you assume you are performing this regression on variables with normally distributed error (which is a reasonable assumption in many cases), then the least squares form is the maximum likelihood estimator. There are several other important properties.

You can read some more here.

2 of 7
12

If is linear with respect to the parameters, the derivatives of the sum of squares leads to simple, explicit and direct solutions (immediate if you use matrix calculations).

This is not the case for the second objective function in your post. The problem becomes nonlinear with respect to the parameters and it is much more difficult to solve. But, it is doable (I would generate the starting guesses from the first objective function.

For illustration purposes, I generated a table for (), () and changed the values of using a random relative error between and %. The values used were , and .

Using the first objective function, the solution is immediate and leads to , ,.

Starting with these values as initial guesses for the second objective function (which, again, makes the problem nonlinear), it took to the solver iterations to get , ,. And all these painful iterations reduced the objective function from down to !

There are many other possible objective functions used in regression but the traditional sum of squared errors is the only one which leads to explicit solutions.

Added later

A very small problem that you could (should, if I may) exercise by hand : consider four data points ,,, and your model is simply and your search for the best value of which minimizes either or Plot the values of and as a function of for . For , you will have a nice parabola (the minimum of which is easy to find) but for the plot shows a series of segments which then lead to discontinuous derivatives at thei intersections; this makes the problem much more difficult to solve.

🌐
Reddit
reddit.com › r/math › why do mathematicians square things instead of taking the absolute value?
r/math on Reddit: Why do mathematicians square things instead of taking the absolute value?
December 6, 2010 -

I guess I've seen this the most in statistics. E.g. standard deviation, or least squares regression. Why not calculate standard deviation by simply taking the absolute value of the (xi-xbar)s?

🌐
Reddit
reddit.com › r/datascience › why do we use residual sum of squares rather than adding absolute values of errors in linear regression?
r/datascience on Reddit: Why do we use residual sum of squares rather than adding absolute values of errors in linear regression?
November 18, 2018 -

I am learning data science through ISLR(page 62). Why do we do RSS = (e1)2+(e2)2+(e3)2.... Rather than (|e1|+| e2 |+ | e3 |) as it will be right distance ? Will squaring not skew the results?

Top answer
1 of 19
320
Here's the answer I wish I'd had given to me when I asked the same question during my introductory statistics classes. There are many reasons, and the two objectives do not give equivalent results. Minimizing the sum of squared residuals is called "ordinary least squares" and is generally the first technique students learn in estimating functions. Minimizing the sum of absolute is generally called "median regression" for reasons I will discuss later, and is a somewhat less popular technique. Wikipedia indicates that the idea of median regression was actually developed first, which is unsurprising as it is indeed more intuitive. The issue is that there isn't a closed form solution (i.e. a simple formula you can plug numbers into) to find the coefficients that minimize the sum of absolute residuals. In contrast, summing squared residuals gives an objective function that is differentiable: differentiating, setting the derivative equal to zero, and then solving gives a formula for the coefficients that is straightforward to compute. (Technically we are using partial derivatives, and the algebra is a lot easier if you have matrices available, but the basic idea is the same as you would learn in an introductory differential calculus class.) Now, that was a big deal when these ideas were first being developed back in the 18th and 19th century, as then "computer" meant someone who had to perform computations by hand. Algorithms for finding the median regression coefficients existed but were harder to implement. Today we recognize that computing these coefficients is a "linear programming" optimization problem, for which many algorithms exist, most notably the Simplex algorithm. So on a modern computer the two methods are basically equally easy to compute. Then there's the question of inference. In a traditional statistics or econometrics course you would spend a lot of time developing the machinery to do things like hypothesis testing (e.g. suppose I gather a random sample and get an estimated slope coefficient of 0.017, which looks small. It is useful to ask the question "If the true population slope coefficient is 0, how likely is it that we could get an estimated slope coefficient of 0.017 or more extreme?". Very similar is the most basic A/B test, which asks "If the true difference between these two groups is 0, how likely is it that I would get an observed difference as big as that observed in the data just due to random chance?"). The fact we can explicitly write out the OLS formula makes it substantially easier to develop the statistical theory for this. There are also some optimality results, such as the Gauss-Markov Theorem, that say that OLS is "best", albeit in a very specific sense under a very restrictive set of assumptions. The statistical theory of median regression has also been figures out, but it requires more advanced math and is somewhat less elegant. So for both OLS and median regression you can compute the coefficients and perform statistical inference. So why do most students learn about OLS and not about median regression? Part of it is path dependence - OLS was developed first, so lots of people learned it and taught it to others, and it's just easier to stick with what people have learned in the past than switch to something else (e.g. you can keep using the same textbooks). But all the reasons that made it simpler to compute and develop inference for also make it easier for students to learn. Okay, but the pedagogy of introductory courses doesn't really matter once you get into the real world and are choosing which method to use. And these days there are pre-programmed algorithms that will do both estimation and inference for you in a single command, so the differences there don't really matter to a practitioner either. If you've got some data and want to estimate the relationship between Y and X, should you use OLS or median regression? You're absolutely right that squaring does something subtle to the residuals. It "skews" them in the sense of disproportionately trying to reduce large residuals, whereas median regression weights both small and large residuals equally. This is why advocates of median regression say that median regression is more "robust to outliers": if there is a weird Y observation (e.g. someone got a decimal point in the wrong location when transcribing data), OLS is going to try really hard to fit that observation. That is, if you imagine having a bunch of X's and Y's that pretty much follow a straight line, and then one really weird observation, the median regression line is going to be closer to the straight line than the OLS line. But the really exciting part happens when you pause and ask what the heck these estimators are actually estimating. One can show theoretically that minimizing squared errors results in a conditional mean, i.e. given a particular value of X, the predicted value of Y at that X is the "average" (in the sense of arithmetic mean or expected value) value of Y. In contrast, minimizing the sum of absolute errors results in the conditional median: for a fixed X, 50% of Y will be above this number, and 50% below. Upon realizing this people also realized that by weighting negative errors and positive errors differently, one can actually extend median regression to quantile regression (e.g. you can estimate the conditional 0.25 quantile: for a fixed X, 25% of Y will be below it and 75% will be above it). So there's a school of thought out there that quantile regression should be used a lot more: it's robust to outliers and by estimating it for several quantiles you can get a fuller picture of how things are going. On the flip side, however, there are some theoretical results that suggest OLS will give you a more precise estimate. OLS estimates are also much more interpretable (if you are confident your model has a causal interpretation, the slope is the average marginal effect of X on Y). In practice the difference between OLS and median regression usually isn't big enough to matter much - certainly not to the extent that advocates for median regression can say "here are 10,000 cases where using median regression would perform way better". Also, since median regression is a more advanced technique to learn, it would be better to compare it to other more advanced techniques. Median regression has all the issues OLS does in terms of needing to specify exactly what variables are in the model, and in what way (e.g. squared, interaction terms, etc.). If your goal is just to have an algorithm that will give you some sort of sensible prediction, many other tools exist that will do a much better job (see, for example, the later chapters in the book you are reading). And if you for some reason actually do need to estimate a conditional quantile and/or do inference, you might look into the "generalized random forest" R package and associated paper by Athey et al. And the sheer length of this comment reveals to me why my professors did not spend time explaining why they used squared residuals instead of absolute values! Edit: Thanks for the gold!
2 of 19
63
Gauss and others wanted to penalize outliers more, and squares are really easy to calculate. It really is that simple. They had to choose something and that was it. You could just as easily apply some other loss function. Edit: This post got bigger than I thought it would. The others have better answers. Mine is a little too flippant for the kind of attention this is getting.
🌐
Expii
expii.com › t › absolute-value-equations-with-sums-4163
Absolute Value Equations with Sums - Expii
Treat sums in absolute value equations like operations in parentheses. Add the terms within the absolute value brackets, apply the absolute value, then add the terms outside.
🌐
Quizlet
quizlet.com › maths › algebra
Is the absolute value of the sum of two numbers always equal to the sum of their absolute values? Explain. | Quizlet
The statement that the absolute value of the sum of two numbers is always equal to the sum of their absolute values is only true if the signs of both numbers are same; that is either both numbers are positive or both numbers are negative.
🌐
Quora
quora.com › Why-do-we-square-instead-of-using-the-absolute-value-when-calculating-variance-and-standard-deviation
Why do we square instead of using the absolute value when calculating variance and standard deviation? - Quora
This example hints at an important difference between minimizing absolute values of differences and minimizing squared differences. The usual RMS fit is especially sensitive to large errors - they get squared in the process. If the large errors are actually bad data, you have to remove them from the set before doing the fit. Minimizing the sum of absolute values of differences would give a fit that was less sensitive to outliers.
🌐
Statistics By Jim
statisticsbyjim.com › home › absolute value
Absolute Value - Statistics By Jim
May 16, 2025 - Subadditivity (Triangle Inequality): |a + b| ≤ |a| + |b|: The absolute value of a sum is less than or equal to the sum of the absolute values.
Find elsewhere
Top answer
1 of 3
13
It is quite likely. At least, the proof for the case $d\mid n$ is easy. First of all, the restriction $i\ne j$ does not matter: adding $n$ ones changes nothing in the problem. Now notice that $|\langle v_i,v_j\rangle|\ge \langle v_i,v_j\rangle^2=\langle V_i,V_j\rangle$ where $V_i=v_i\otimes v_i$. Now, $\langle V_i,I\rangle=1$ for all $i$ ($I$ is the identity matrix, as usual), so $\langle\sum_i V_i,I\rangle=n$ and, by Cauchy-Schwarz, $\|\sum_i V_i\|^2\ge n^2/\|I\|^2=n^2/d$ (the norm here is the Frobenius norm, i.e., the square root of the sum of the squares of the matrix elements), which results in $\min_{v_i}\sum_{i,j}\langle v_i,v_j\rangle^2\ge n^2/d$. For the conjectured minimizer, both this estimate and the crude inequalities $|\langle v_i,v_j\rangle|\ge \langle v_i,v_j\rangle^2$ become identities, whence the conclusion. · I do not see off hand how to modify this argument for the case $d\not\mid n$ but it still makes the conjecture quite plausible. In the worst case scenario, you are off by at most $d/4$ from the true minimum with your system.
2 of 3
7
In addition to fedja's clever argument for the case $d|n$, let me prove this for $d=2$ (and $n$ of arbitrary parity). · We have $|\cos x|\geqslant 1-\frac2\pi x$ for $x\in [0,\pi/2]$ by concavity of cosine. So, it suffices to prove that the sum of angles between lines $\ell_1,\dots,\ell_n$ (which are parallel to vectors $v_1,\dots,v_n$) is maximal when $\lfloor n/2\rfloor$ lines coincide with certain line $a$ and $\lceil n/2\rceil$ other lines also coincide with $b\perp a$. Induct with bases $n=1,2$. Note that if, say, $\ell_n,\ell_{n-1}$ are orthogonal, we have $\angle(\ell_i,\ell_n)+\angle(\ell_i,\ell_{n-1})\geqslant \pi/2$, and summing up these inequalities with induction proposition for $\ell_1,\dots,\ell_{n-2}$ we get the result. If no two lines are orthogonal, we may move any of them to the direction which increases the sum of angles until either two lines become orthogonal or two of them coincide. In this second case move this pair of coinciding line, etc. Finally either all lines coincide, and the sum of angles is too small, or we get a pair of orthogonal lines on some step.
Top answer
1 of 16
265

If the goal of the standard deviation is to summarise the spread of a symmetrical data set (i.e. in general how far each datum is from the mean), then we need a good method of defining how to measure that spread.

The benefits of squaring include:

  • Squaring always gives a non-negative value, so the sum will always be zero or higher.
  • Squaring emphasizes larger differences, a feature that turns out to be both good and bad (think of the effect outliers have).

Squaring however does have a problem as a measure of spread and that is that the units are all squared, whereas we might prefer the spread to be in the same units as the original data (think of squared pounds, squared dollars, or squared apples). Hence the square root allows us to return to the original units.

I suppose you could say that absolute difference assigns equal weight to the spread of data whereas squaring emphasises the extremes. Technically though, as others have pointed out, squaring makes the algebra much easier to work with and offers properties that the absolute method does not (for example, the variance is equal to the expected value of the square of the distribution minus the square of the mean of the distribution)

It is important to note however that there's no reason you couldn't take the absolute difference if that is your preference on how you wish to view 'spread' (sort of how some people see 5% as some magical threshold for $p$-values, when in fact it is situation dependent). Indeed, there are in fact several competing methods for measuring spread.

My view is to use the squared values because I like to think of how it relates to the Pythagorean Theorem of Statistics: $c = \sqrt{a^2 + b^2}$ …this also helps me remember that when working with independent random variables, variances add, standard deviations don't. But that's just my personal subjective preference which I mostly only use as a memory aid, feel free to ignore this paragraph.

An interesting analysis can be read here:

  • Revisiting a 90-year-old debate: the advantages of the mean deviation - Stephen Gorard (Department of Educational Studies, University of York); Paper presented at the British Educational Research Association Annual Conference, University of Manchester, 16-18 September 2004
2 of 16
162

The squared difference has nicer mathematical properties; it's continuously differentiable (nice when you want to minimize it), it's a sufficient statistic for the Gaussian distribution, and it's (a version of) the L2 norm which comes in handy for proving convergence and so on.

The mean absolute deviation (the absolute value notation you suggest) is also used as a measure of dispersion, but it's not as "well-behaved" as the squared error.

🌐
Wikipedia
en.wikipedia.org › wiki › Absolute_value
Absolute value - Wikipedia
1 month ago - If v is an absolute value on F, then the function d on F × F, defined by d(a, b) = v(a − b), is a metric and the following are equivalent: ... {\textstyle \left\{v\left(\sum _{k=1}^{n}\mathbf {1} \right):n\in \mathbb {N} \right\}} is bounded in R.
🌐
HowStuffWorks
science.howstuffworks.com › physical science › math concepts
How Absolute Value Works in Equations and Graphs | HowStuffWorks
May 30, 2024 - Now, let's go back to the initial inequality: |a + b| ≤ |a| + |b|. No matter what values you plug into a and b, you'll find that the absolute value of the sum (|a + b|) is less than or equal to the sum of the absolute values (|a| + |b|).
🌐
Mathwords
mathwords.com › a › absolute_value_rules.htm
Mathwords: Absolute Value Rules
No. This is the single most common misconception. The absolute value of a sum is NOT the sum of the absolute values:
🌐
Physics Forums
physicsforums.com › mathematics › calculus
Absolute difference between increasing sum of squares
January 29, 2022 - Whether larger absolute differences start appearing at higher values of ##C## Legendre's three-square theorem states that any value coming out of the equation ##n=4^a(8b+7)## (where ##a## and ##b## are independent integers) are values that can not be expressed by ##C##. Knowing this, I'd suggest trying to approach this problem by looking how often ##\frac{n}{4^a}-7 \mod 8## resolves to 0 with increasing ##n## and ##a## within a certain ##C## range (since non-zero values means ##b## can not be an integer) and also see how close the corresponding ##n## values are.
🌐
Quora
quora.com › Why-do-we-use-residual-sum-of-squares-rather-than-adding-absolute-values-of-errors-in-linear-regression
Why do we use residual sum of squares rather than adding absolute values of errors in linear regression? - Quora
Answer (1 of 10): Optimization problems involve minimizing some cost function. Fitting data to a curve is an optimization problem. So the question becomes: why use the sum of the squared differences between the fit and the data as the cost function? It is true that one can choose to minimize the...