Both are done.
Least squares is easier, and the fact that for independent random variables "variances add" means that it's considerably more convenient; for examples, the ability to partition variances is particularly handy for comparing nested models. It's somewhat more efficient at the normal (least squares is maximum likelihood), which might seem to be a good justification -- however, some robust estimators with high breakdown can have surprisingly high efficiency at the normal.
But L1 norms are certainly used for regression problems and these days relatively often.
If you use R, you might find the discussion in section 5 here useful:
https://socialsciences.mcmaster.ca/jfox/Books/Companion/appendices/Appendix-Robust-Regression.pdf
(though the stuff before it on M estimation is also relevant, since it's also a special case of that)
Answer from Glen_b on Stack ExchangeWhen deriving the coefficients for a linear regression, we tend to obtain the sum of the minimized squared residuals. I am struggling to intuitively understand why. I know that this offsets the negative residuals that would cancel the positive ones. However, why not just do the absolute value? Other answers say it is because of the mathematical convenience and because squaring makes sure that outliers have a more minimal effect on the regression. Are these the true reasons and are they valid? Why would squaring the outliers make it have a more minimal effect? Thanks!
You can use the absolute value. This is called the L1 norm and is used for robust regression.
You can read more about it here: http://www.johnmyleswhite.com/notebook/2013/03/22/using-norms-to-understand-linear-regression/
Practically, the math is easier in ordinary least squares regression: You want to minimize the squared residuals so you can take the derivative, set it equal to 0 and solve. Easier to compute the derivative of a polynomial than absolute value.
We can do absolute values for regression; it's called L1 regression, and people certainly use it. It's certainly not as convenient as ordinary least squares (L2) regression, but that's what computers are for.
What's substantially more convenient is the inference (hypothesis tests and CIs) but again that's not such an issue these days; computers can help deal with that.
However, if the error distribution is close to normal, least squares will be substantially more efficient.
Both are done.
Least squares is easier, and the fact that for independent random variables "variances add" means that it's considerably more convenient; for examples, the ability to partition variances is particularly handy for comparing nested models. It's somewhat more efficient at the normal (least squares is maximum likelihood), which might seem to be a good justification -- however, some robust estimators with high breakdown can have surprisingly high efficiency at the normal.
But L1 norms are certainly used for regression problems and these days relatively often.
If you use R, you might find the discussion in section 5 here useful:
https://socialsciences.mcmaster.ca/jfox/Books/Companion/appendices/Appendix-Robust-Regression.pdf
(though the stuff before it on M estimation is also relevant, since it's also a special case of that)
I can't help quoting from Huber, Robust Statistics, p.10 on this (sorry the quote is too long to fit in a comment):
Two time-honored measures of scatter are the mean absolute deviation
and the mean square deviation
There was a dispute between Eddington (1914, p.147) and Fisher (1920, footnote on p. 762) about the relative merits of
and
.[...] Fisher seemingly settled the matter by pointing out that for normal observations
is about 12% more efficient than
.
By the relation between the conditional mean and the unconditional
mean
a similar argument applies to the residuals.
least squares - Absolute value of residuals in simple linear regression - Cross Validated
regression - R absolute value of residuals with log transformation - Stack Overflow
Why do we use residual sum of squares rather than adding absolute values of errors in linear regression?
Why square residuals and not use absolute value?
You can use the absolute value. This is called the L1 norm and is used for robust regression.
You can read more about it here: http://www.johnmyleswhite.com/notebook/2013/03/22/using-norms-to-understand-linear-regression/
Practically, the math is easier in ordinary least squares regression: You want to minimize the squared residuals so you can take the derivative, set it equal to 0 and solve. Easier to compute the derivative of a polynomial than absolute value.
More on reddit.comVideos
Both are done.
Least squares is easier, and the fact that for independent random variables "variances add" means that it's considerably more convenient; for examples, the ability to partition variances is particularly handy for comparing nested models. It's somewhat more efficient at the normal (least squares is maximum likelihood), which might seem to be a good justification -- however, some robust estimators with high breakdown can have surprisingly high efficiency at the normal.
But L1 norms are certainly used for regression problems and these days relatively often.
If you use R, you might find the discussion in section 5 here useful:
https://socialsciences.mcmaster.ca/jfox/Books/Companion/appendices/Appendix-Robust-Regression.pdf
(though the stuff before it on M estimation is also relevant, since it's also a special case of that)
Answer from Glen_b on Stack ExchangeI want to interpret the residuals but get them back on the scale of num_encounters.
You can easily calculate them:
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
res <- df$num_encounters - exp(predict(mod))
In addition what @Roland suggests, which indeed is correct and works, the problem with my confusion was just basic high-school logarithm algebra.
Indeed the absolute response residuals (on the scale of the original dependent variable) can be calculated as @Roland says with
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
res <- df$num_encounters - exp(predict(mod))
If you want to calculate them from the model residuals, you need to keep logarithm substraction rules into account.
log(a)-log(b)=log(a/b)
The residual is calculated from the original model. So in my case, the model predicts log(num_encounters). So the residual is log(observed)-log(predicted).
What I was trying to do was
exp(resid) = exp(log(obs)-log(pred)) = exp(log(obs/pred)) = obs/pred
which is clearly not the number I was looking for. To get the absolute response residual from the model response residual, this is what I needed.
obs-obs/exp(resid)
So in R code, this is what you could also do:
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
abs_resid <- df$num_encounters - df$num_encounters/exp(residuals(mod, type="response"))
This resulted in the same number as with the method described by @Roland which is much easier of course. But at least I got my brain lined up again.
I am learning data science through ISLR(page 62). Why do we do RSS = (e1)2+(e2)2+(e3)2.... Rather than (|e1|+| e2 |+ | e3 |) as it will be right distance ? Will squaring not skew the results?