the sum of absolute values has a lesser value as compared to the sum of squares.

Why get the sum of squares instead of the sum of absolute values?

math.stackexchange.com › questions › 967883 › why-get-the-sum-of-squares-instead-of-the-sum-of-absolute-values

Actually there are some great reasons which have nothing to do with whether this is easy to calculate. The first form is called least squares, and in a probabilistic setting there are several good theoretical justifications to use it. For example, if you assume you are performing this regression on variables with normally distributed error (which is a reasonable assumption in many cases), then the least squares form is the maximum likelihood estimator. There are several other important properties.

You can read some more here.

Answer from Bitwise on Stack Exchange

Stack Exchange

math.stackexchange.com › questions › 967883 › why-get-the-sum-of-squares-instead-of-the-sum-of-absolute-values

statistics - Why get the sum of squares instead of the sum of absolute values? - Mathematics Stack Exchange

Top answer

1 of 7

You can read some more here.

2 of 7

If $\text{[math]}$ is linear with respect to the parameters, the derivatives of the sum of squares leads to simple, explicit and direct solutions (immediate if you use matrix calculations).

This is not the case for the second objective function in your post. The problem becomes nonlinear with respect to the parameters and it is much more difficult to solve. But, it is doable (I would generate the starting guesses from the first objective function.

For illustration purposes, I generated a $\text{[math]}$ table for $\text{[math]}$ ( $\text{[math]}$ ), ( $\text{[math]}$ ) and changed the values of $\text{[math]}$ using a random relative error between $\text{[math]}$ and $\text{[math]}$ %. The values used were $\text{[math]}$ , $\text{[math]}$ and $\text{[math]}$ .

Using the first objective function, the solution is immediate and leads to $\text{[math]}$ , $\text{[math]}$ , $\text{[math]}$ .

Starting with these values as initial guesses for the second objective function (which, again, makes the problem nonlinear), it took to the solver $\text{[math]}$ iterations to get $\text{[math]}$ , $\text{[math]}$ , $\text{[math]}$ . And all these painful iterations reduced the objective function from $\text{[math]}$ down to $\text{[math]}$ !

There are many other possible objective functions used in regression but the traditional sum of squared errors is the only one which leads to explicit solutions.

Added later

A very small problem that you could (should, if I may) exercise by hand : consider four data points $\text{[math]}$ , $\text{[math]}$ , $\text{[math]}$ , $\text{[math]}$ and your model is simply $\text{[math]}$ and your search for the best value of $\text{[math]}$ which minimizes either $\text{[math]}$ or $\text{[math]}$ Plot the values of $\text{[math]}$ and $\text{[math]}$ as a function of $\text{[math]}$ for $\text{[math]}$ . For $\text{[math]}$ , you will have a nice parabola (the minimum of which is easy to find) but for $\text{[math]}$ the plot shows a series of segments which then lead to discontinuous derivatives at thei intersections; this makes the problem much more difficult to solve.

Stack Exchange

math.stackexchange.com › questions › 1750709 › the-absolute-value-of-a-sum-of-two-numbers-is-less-than-or-equal-to-the-sum-of-t

discrete mathematics - The absolute value of a sum of two numbers is less than or equal to the sum of the absolute values of two numbers - Mathematics Stack Exchange

Top answer

1 of 3

Note that $\text{[math]}$ , $\text{[math]}$ . Now,

$\text{[math]}$ .

2 of 3

If $\text{[math]}$ or $\text{[math]}$ , it's obvious that $\text{[math]}$ . Let's assume that both $\text{[math]}$ and $\text{[math]}$ are not zero, and $\text{[math]}$ , then $\text{[math]}$ , and $\text{[math]}$ Therefore $\text{[math]}$

Stack Exchange

math.stackexchange.com › questions › 555053 › sum-of-absolute-values-and-the-absolute-value-of-the-sum-of-these-values

algebra precalculus - Sum of absolute values and the absolute value of the sum of these values? - Mathematics Stack Exchange

Top answer

1 of 2

You can try considering $\text{[math]}$ , then using the property that $\text{[math]}$ for all x, you can obtain the desired inequality. Also your conclusion should be $\text{[math]}$ . Notice that the inequality is not strict. (For example, if $\text{[math]}$ then certainly $\text{[math]}$ is false.) Another way is to use the fact that $\text{[math]}$ for all numbers a (doing this for both $\text{[math]}$ and $\text{[math]}$ , and then manipulating the inequalities, we can achieve what you want), however, this second route assumes that you at least are a little bit familiar with the rules of inequalities, particularly the rules regarding inequalities with absolute values.

As an example of how we would apply the squaring technique, we can do the following: $\text{[math]}$ Now since $\text{[math]}$ is always true we can say that $\text{[math]}$ Now we want to try to "force" an inequality. That is, we will replace $\text{[math]}$ with $\text{[math]}$ . If $\text{[math]}$ and $\text{[math]}$ are both greater that $\text{[math]}$ then nothing would change, however, if they are not both greater than $\text{[math]}$ we would get the following: $\text{[math]}$ notice that we have broken the chain of equal signs and forced an inequality. (There are still some more steps to do for you. Hint: What is $\text{[math]}$ ?)

2 of 2

First method:

Since $\text{[math]}$ we have $\text{[math]}$ which is equality if and only if : $\text{[math]}$ if and only if $\text{[math]}$ and $\text{[math]}$ ar both negatives or both positives. This means taht if $\text{[math]}$ and $\text{[math]}$ hav'nt the same sign we have: $\text{[math]}$ But $\text{[math]}$ gives also $\text{[math]}$

Second method:

We have $\text{[math]}$ for all real number $\text{[math]}$ we have $\text{[math]}$ , and equality holds if and only if $\text{[math]}$ is positive. Since $\text{[math]}$ or $\text{[math]}$ is'nt positive one of the numbers $\text{[math]}$ or $\text{[math]}$ is $\text{[math]}$ so $\text{[math]}$

reddit.com › r/math › why do mathematicians square things instead of taking the absolute value?

r/math on Reddit: Why do mathematicians square things instead of taking the absolute value?

December 6, 2010 -

I guess I've seen this the most in statistics. E.g. standard deviation, or least squares regression. Why not calculate standard deviation by simply taking the absolute value of the (xi-xbar)s?

Top answer

1 of 5

One reason is that squares are easier to deal with than absolute values. The derivative [or integral] of x² is easy, while |x| has a step function in it [ew].

Also, for standard deviations, I believe, using |x| leads to [more?] bias. Someone correct me here.

2 of 5

I do not like most of the answers given in this thread so far because, at least to me, their arguments are very shallow. I do not think that the fundamental reason why the standard deviation is more important than the mean absolute deviation is just because it is "easier to compute" or "smoother".

Here is my point of view on things : If you think of "sample space" as being R^n, where n is the number of data points that you have and your points are n-tuples (x1,...,xn) where x_i is the i^th data point that you have, then the mean and standard deviation have very geometric interpretations.

To be able to visualize correctly, let's do this for n=2. We have two data points, x1 and x2. Their average, (x1+x2)/2 corresponds to projecting the point (x1,x2) orthogonally onto the line x1=x2, that is, it is the closest sample to have all its data points equal. The standard deviation in this context will be the euclidean distance between (x1,x2) and the point corresponding to the average. The mean absolute deviation instead will be the sum of the sides of the triangle in the picture, or the "taxicab" distance.

Here is a picture that I hope helps illustrate the argument : http://i.imgur.com/oiWbk.png

reddit.com › r/datascience › why do we use residual sum of squares rather than adding absolute values of errors in linear regression?

r/datascience on Reddit: Why do we use residual sum of squares rather than adding absolute values of errors in linear regression?

November 18, 2018 -

I am learning data science through ISLR(page 62). Why do we do RSS = (e1)2+(e2)2+(e3)2.... Rather than (|e1|+| e2 |+ | e3 |) as it will be right distance ? Will squaring not skew the results?

Top answer

1 of 19

320

Here's the answer I wish I'd had given to me when I asked the same question during my introductory statistics classes. There are many reasons, and the two objectives do not give equivalent results. Minimizing the sum of squared residuals is called "ordinary least squares" and is generally the first technique students learn in estimating functions. Minimizing the sum of absolute is generally called "median regression" for reasons I will discuss later, and is a somewhat less popular technique. Wikipedia indicates that the idea of median regression was actually developed first, which is unsurprising as it is indeed more intuitive. The issue is that there isn't a closed form solution (i.e. a simple formula you can plug numbers into) to find the coefficients that minimize the sum of absolute residuals. In contrast, summing squared residuals gives an objective function that is differentiable: differentiating, setting the derivative equal to zero, and then solving gives a formula for the coefficients that is straightforward to compute. (Technically we are using partial derivatives, and the algebra is a lot easier if you have matrices available, but the basic idea is the same as you would learn in an introductory differential calculus class.) Now, that was a big deal when these ideas were first being developed back in the 18th and 19th century, as then "computer" meant someone who had to perform computations by hand. Algorithms for finding the median regression coefficients existed but were harder to implement. Today we recognize that computing these coefficients is a "linear programming" optimization problem, for which many algorithms exist, most notably the Simplex algorithm. So on a modern computer the two methods are basically equally easy to compute. Then there's the question of inference. In a traditional statistics or econometrics course you would spend a lot of time developing the machinery to do things like hypothesis testing (e.g. suppose I gather a random sample and get an estimated slope coefficient of 0.017, which looks small. It is useful to ask the question "If the true population slope coefficient is 0, how likely is it that we could get an estimated slope coefficient of 0.017 or more extreme?". Very similar is the most basic A/B test, which asks "If the true difference between these two groups is 0, how likely is it that I would get an observed difference as big as that observed in the data just due to random chance?"). The fact we can explicitly write out the OLS formula makes it substantially easier to develop the statistical theory for this. There are also some optimality results, such as the Gauss-Markov Theorem, that say that OLS is "best", albeit in a very specific sense under a very restrictive set of assumptions. The statistical theory of median regression has also been figures out, but it requires more advanced math and is somewhat less elegant. So for both OLS and median regression you can compute the coefficients and perform statistical inference. So why do most students learn about OLS and not about median regression? Part of it is path dependence - OLS was developed first, so lots of people learned it and taught it to others, and it's just easier to stick with what people have learned in the past than switch to something else (e.g. you can keep using the same textbooks). But all the reasons that made it simpler to compute and develop inference for also make it easier for students to learn. Okay, but the pedagogy of introductory courses doesn't really matter once you get into the real world and are choosing which method to use. And these days there are pre-programmed algorithms that will do both estimation and inference for you in a single command, so the differences there don't really matter to a practitioner either. If you've got some data and want to estimate the relationship between Y and X, should you use OLS or median regression? You're absolutely right that squaring does something subtle to the residuals. It "skews" them in the sense of disproportionately trying to reduce large residuals, whereas median regression weights both small and large residuals equally. This is why advocates of median regression say that median regression is more "robust to outliers": if there is a weird Y observation (e.g. someone got a decimal point in the wrong location when transcribing data), OLS is going to try really hard to fit that observation. That is, if you imagine having a bunch of X's and Y's that pretty much follow a straight line, and then one really weird observation, the median regression line is going to be closer to the straight line than the OLS line. But the really exciting part happens when you pause and ask what the heck these estimators are actually estimating. One can show theoretically that minimizing squared errors results in a conditional mean, i.e. given a particular value of X, the predicted value of Y at that X is the "average" (in the sense of arithmetic mean or expected value) value of Y. In contrast, minimizing the sum of absolute errors results in the conditional median: for a fixed X, 50% of Y will be above this number, and 50% below. Upon realizing this people also realized that by weighting negative errors and positive errors differently, one can actually extend median regression to quantile regression (e.g. you can estimate the conditional 0.25 quantile: for a fixed X, 25% of Y will be below it and 75% will be above it). So there's a school of thought out there that quantile regression should be used a lot more: it's robust to outliers and by estimating it for several quantiles you can get a fuller picture of how things are going. On the flip side, however, there are some theoretical results that suggest OLS will give you a more precise estimate. OLS estimates are also much more interpretable (if you are confident your model has a causal interpretation, the slope is the average marginal effect of X on Y). In practice the difference between OLS and median regression usually isn't big enough to matter much - certainly not to the extent that advocates for median regression can say "here are 10,000 cases where using median regression would perform way better". Also, since median regression is a more advanced technique to learn, it would be better to compare it to other more advanced techniques. Median regression has all the issues OLS does in terms of needing to specify exactly what variables are in the model, and in what way (e.g. squared, interaction terms, etc.). If your goal is just to have an algorithm that will give you some sort of sensible prediction, many other tools exist that will do a much better job (see, for example, the later chapters in the book you are reading). And if you for some reason actually do need to estimate a conditional quantile and/or do inference, you might look into the "generalized random forest" R package and associated paper by Athey et al. And the sheer length of this comment reveals to me why my professors did not spend time explaining why they used squared residuals instead of absolute values! Edit: Thanks for the gold!

2 of 19

Gauss and others wanted to penalize outliers more, and squares are really easy to calculate. It really is that simple. They had to choose something and that was it. You could just as easily apply some other loss function. Edit: This post got bigger than I thought it would. The others have better answers. Mine is a little too flippant for the kind of attention this is getting.

Expii

expii.com › t › absolute-value-equations-with-sums-4163

Absolute Value Equations with Sums - Expii

Treat sums in absolute value equations like operations in parentheses. Add the terms within the absolute value brackets, apply the absolute value, then add the terms outside.

Quizlet

quizlet.com › maths › algebra

Is the absolute value of the sum of two numbers always equal to the sum of their absolute values? Explain. | Quizlet

The statement that the absolute value of the sum of two numbers is always equal to the sum of their absolute values is only true if the signs of both numbers are same; that is either both numbers are positive or both numbers are negative.

Quora

quora.com › Why-do-we-square-instead-of-using-the-absolute-value-when-calculating-variance-and-standard-deviation

Why do we square instead of using the absolute value when calculating variance and standard deviation? - Quora

This example hints at an important difference between minimizing absolute values of differences and minimizing squared differences. The usual RMS fit is especially sensitive to large errors - they get squared in the process. If the large errors are actually bad data, you have to remove them from the set before doing the fit. Minimizing the sum of absolute values of differences would give a fit that was less sensitive to outliers.

Statistics By Jim

statisticsbyjim.com › home › absolute value

Absolute Value - Statistics By Jim

May 16, 2025 - Subadditivity (Triangle Inequality): |a + b| ≤ |a| + |b|: The absolute value of a sum is less than or equal to the sum of the absolute values.

Find elsewhere

Google Bing Mojeek

CK-12 Foundation

flexbooks.ck12.org › cbook › ck-12-middle-school-math-concepts-grade-7 › section › 4.5 › primary › lesson › sums-of-integers-using-absolute-value-msm7

Sums of Integers Using Absolute Value

January 1, 2026 - We cannot provide a description for this page right now

reddit.com › r/askstatistics › what is "sum of squares" and why is it used so often?

r/AskStatistics on Reddit: What is "sum of squares" and why is it used so often?

September 17, 2021 -

I've done several statistics courses in university and I'm starting to wonder why or what "sum of squares" is used for in tests like linear regression. What is particular useful about the "sum of squares"?

Top answer

1 of 6

sum of square is used to determine the dispersion of data points. You are interested in how much dispersion, not the direction. square of 3 is 9, and square of -3 is also 9. Regardless of whether the predicted value is greater or less than the actual value, it will always be 9. Now, you may wonder, why not just absolute value then? this link will explain better than I ever can.

2 of 6

Well we want to measure degree of deviation. The two “obvious” options mathematically are sum of absolute values and sum of squares. It simply turns out that squares are “nicer” since they are differentiable. It then turns out to have further “nice” properties in terms of well defined distributions for testing with. Essentially early statisticians tried something, it worked pretty well, and we developed everything else around it. If they had found another measure of deviation with even nicer properties then we’d be using that instead.

reddit.com › r/math › why do we use least *squares* in linear regression?

r/math on Reddit: Why Do We Use Least *Squares* In Linear Regression?

October 10, 2024 -

I understand the idea is the minimize the sum of the squares of the errors compared to the y = mx + b regression, but why the squares? Why not minimize then sum of the absolute value of the errors? Or the fourth powers of the errors?

Top answer

1 of 46

219

The least squares solution to a linear regression falls directly out of a Maximum Likelihood Estimation of the data conditioned on it being normally distributed about the "curve of best fit". So, if you maximize the likelihood of observing the data conditioned on another distribution, the least squares solution will not, in general, give you the "correct" parameters. That being said, if you only need a qualitative description of the data given by a curve that has the same general behavior, there's no reason to prefer least squares over any other metric.

2 of 46

190

For linear regression, least squares is the provably best estimator by the Gauss–Markov theorem , given some assumptions about the data. Of course, the definition of "best" here is minimizing the sampling error. We define variance to be the expectation value of the square of the deviation. So if you wanted to define a different type of error for your use case, it's possible that some other algorithm does better.

MathOverflow

mathoverflow.net › questions › 272692 › the-minimum-of-a-sum-of-absolute-values-of-inner-products-in-mathbbrd

mg.metric geometry - The minimum of a sum of absolute values of inner products in $\mathbb{R}^d$ - MathOverflow

Top answer

1 of 3

It is quite likely. At least, the proof for the case $d\mid n$ is easy. First of all, the restriction $i\ne j$ does not matter: adding $n$ ones changes nothing in the problem. Now notice that $|\langle v_i,v_j\rangle|\ge \langle v_i,v_j\rangle^2=\langle V_i,V_j\rangle$ where $V_i=v_i\otimes v_i$. Now, $\langle V_i,I\rangle=1$ for all $i$ ($I$ is the identity matrix, as usual), so $\langle\sum_i V_i,I\rangle=n$ and, by Cauchy-Schwarz, $\|\sum_i V_i\|^2\ge n^2/\|I\|^2=n^2/d$ (the norm here is the Frobenius norm, i.e., the square root of the sum of the squares of the matrix elements), which results in $\min_{v_i}\sum_{i,j}\langle v_i,v_j\rangle^2\ge n^2/d$. For the conjectured minimizer, both this estimate and the crude inequalities $|\langle v_i,v_j\rangle|\ge \langle v_i,v_j\rangle^2$ become identities, whence the conclusion. · I do not see off hand how to modify this argument for the case $d\not\mid n$ but it still makes the conjecture quite plausible. In the worst case scenario, you are off by at most $d/4$ from the true minimum with your system.

2 of 3

In addition to fedja's clever argument for the case $d|n$, let me prove this for $d=2$ (and $n$ of arbitrary parity). · We have $|\cos x|\geqslant 1-\frac2\pi x$ for $x\in [0,\pi/2]$ by concavity of cosine. So, it suffices to prove that the sum of angles between lines $\ell_1,\dots,\ell_n$ (which are parallel to vectors $v_1,\dots,v_n$) is maximal when $\lfloor n/2\rfloor$ lines coincide with certain line $a$ and $\lceil n/2\rceil$ other lines also coincide with $b\perp a$. Induct with bases $n=1,2$. Note that if, say, $\ell_n,\ell_{n-1}$ are orthogonal, we have $\angle(\ell_i,\ell_n)+\angle(\ell_i,\ell_{n-1})\geqslant \pi/2$, and summing up these inequalities with induction proposition for $\ell_1,\dots,\ell_{n-2}$ we get the result. If no two lines are orthogonal, we may move any of them to the direction which increases the sum of angles until either two lines become orthogonal or two of them coincide. In this second case move this pair of coinciding line, etc. Finally either all lines coincide, and the sum of angles is too small, or we get a pair of orthogonal lines on some step.

Stack Exchange

stats.stackexchange.com › questions › 118 › why-square-the-difference-instead-of-taking-the-absolute-value-in-standard-devia

definition - Why square the difference instead of taking the absolute value in standard deviation? - Cross Validated

Top answer

1 of 16

265

If the goal of the standard deviation is to summarise the spread of a symmetrical data set (i.e. in general how far each datum is from the mean), then we need a good method of defining how to measure that spread.

The benefits of squaring include:

Squaring always gives a non-negative value, so the sum will always be zero or higher.
Squaring emphasizes larger differences, a feature that turns out to be both good and bad (think of the effect outliers have).

Squaring however does have a problem as a measure of spread and that is that the units are all squared, whereas we might prefer the spread to be in the same units as the original data (think of squared pounds, squared dollars, or squared apples). Hence the square root allows us to return to the original units.

I suppose you could say that absolute difference assigns equal weight to the spread of data whereas squaring emphasises the extremes. Technically though, as others have pointed out, squaring makes the algebra much easier to work with and offers properties that the absolute method does not (for example, the variance is equal to the expected value of the square of the distribution minus the square of the mean of the distribution)

It is important to note however that there's no reason you couldn't take the absolute difference if that is your preference on how you wish to view 'spread' (sort of how some people see 5% as some magical threshold for $p$-values, when in fact it is situation dependent). Indeed, there are in fact several competing methods for measuring spread.

My view is to use the squared values because I like to think of how it relates to the Pythagorean Theorem of Statistics: $c = \sqrt{a^2 + b^2}$ …this also helps me remember that when working with independent random variables, variances add, standard deviations don't. But that's just my personal subjective preference which I mostly only use as a memory aid, feel free to ignore this paragraph.

An interesting analysis can be read here:

Revisiting a 90-year-old debate: the advantages of the mean deviation - Stephen Gorard (Department of Educational Studies, University of York); Paper presented at the British Educational Research Association Annual Conference, University of Manchester, 16-18 September 2004

2 of 16

162

The squared difference has nicer mathematical properties; it's continuously differentiable (nice when you want to minimize it), it's a sufficient statistic for the Gaussian distribution, and it's (a version of) the L2 norm which comes in handy for proving convergence and so on.

The mean absolute deviation (the absolute value notation you suggest) is also used as a measure of dispersion, but it's not as "well-behaved" as the squared error.

Wikipedia

en.wikipedia.org › wiki › Absolute_value

Absolute value - Wikipedia

1 month ago - If v is an absolute value on F, then the function d on F × F, defined by d(a, b) = v(a − b), is a metric and the following are equivalent: ... {\textstyle \left\{v\left(\sum _{k=1}^{n}\mathbf {1} \right):n\in \mathbb {N} \right\}} is bounded in R.

Terminology and notation Definition and properties Absolute value function Distance Generalizations Footnotes

HowStuffWorks

science.howstuffworks.com › physical science › math concepts

How Absolute Value Works in Equations and Graphs | HowStuffWorks

May 30, 2024 - Now, let's go back to the initial inequality: |a + b| ≤ |a| + |b|. No matter what values you plug into a and b, you'll find that the absolute value of the sum (|a + b|) is less than or equal to the sum of the absolute values (|a| + |b|).

Mathwords

mathwords.com › a › absolute_value_rules.htm

Mathwords: Absolute Value Rules

No. This is the single most common misconception. The absolute value of a sum is NOT the sum of the absolute values:

Stack Exchange

math.stackexchange.com › questions › 2649347 › is-the-product-of-numbers-always-less-than-or-equal-to-the-sum-of-squares

algebra precalculus - Is the product of numbers always less than or equal to the sum of squares? - Mathematics Stack Exchange

Top answer

1 of 4

$\require{cancel}$ $\displaystyle$

You cannot conclude it for $n \geq 2$, as simply the power of left side is $n$ and the right side is $2$, thus:

$O(n) \cancel{\leq} O(2)$ for $n \geq 2$

In a simple example, if $x_1=x_2=...=x_n=x$, then $x^n \cancel{\leq} nx$.

What you can say instead of that, is the general inequity of this:

$\prod_{i=1}^n x_i \leq \sum_{i=1}^n x_i^n$

In your case (i.e. when $n=2$) it will become the specific inequity you mentioned first:

$\prod_{i=1}^2 x_i \leq \sum_{i=1}^2 x_i^2 \Longleftrightarrow xy \leq x^2+y^2$

2 of 4

Note that

$$3\times 4\times 5= 60$$ $$ 3^2+4^2+5^2 =50$$ Thus $$x_1 x_2 ... x_n \leq x_1^2 + x_2^2 + ... + x_n^2$$ is not true for all real numbers.

Physics Forums

physicsforums.com › mathematics › calculus

Absolute difference between increasing sum of squares

January 29, 2022 - Whether larger absolute differences start appearing at higher values of ##C## Legendre's three-square theorem states that any value coming out of the equation ##n=4^a(8b+7)## (where ##a## and ##b## are independent integers) are values that can not be expressed by ##C##. Knowing this, I'd suggest trying to approach this problem by looking how often ##\frac{n}{4^a}-7 \mod 8## resolves to 0 with increasing ##n## and ##a## within a certain ##C## range (since non-zero values means ##b## can not be an integer) and also see how close the corresponding ##n## values are.

Quora

quora.com › Why-do-we-use-residual-sum-of-squares-rather-than-adding-absolute-values-of-errors-in-linear-regression

Why do we use residual sum of squares rather than adding absolute values of errors in linear regression? - Quora

Answer (1 of 10): Optimization problems involve minimizing some cost function. Fitting data to a curve is an optimization problem. So the question becomes: why use the sum of the squared differences between the fit and the data as the cost function? It is true that one can choose to minimize the...