I know they're measures of spread but i don't get what it's actually showing, google and my teacher is no use. Also what's the difference, to me variance is just standard deviation squared making it seem sort of redundant?
statistics - Why is variance squared? - Mathematics Stack Exchange
statistics - Variance is the squared difference - why not to the 3, or 4 instead? - Mathematics Stack Exchange
What actually is standard deviation and variance?
What is the difference between sigma and sigma squared in statistics/econometrics?
Well, I'm going to answer this simply (yeah yeah I'm an experimentalist) and let the statisticians answer it correctly.
Standard deviation (sigma) around a normally-distribute mean represents where you'll find about 70% of the population. So if mean height is 5 feet and standard deviation is 1 foot, 70% of the population will be between 4-6 feet tall. 95% of the population will be between 2 standard deviations from the mean, or between 3-7 feet tall.
So why bother with variance (sigma squared)? Well, the shitty answer I'm going to give you is that it is a necessary step before calculating standard deviation, which experimentalists care more about. Variance is the mean of the squared deviations from some best-fit line. You square the deviations when you calculate them so that the absolute deviation from the line matters, not whether it is above or below the line. But squared deviations are not in the same "scale" as your actual data (you didn't square your data), so then we take the square root of the variance to get the standard deviation, which is in the same "scale" as your data.
Also, variance is useful for estimating distribution parameters, which I'm not gonna get into.
More on reddit.comWhat are the two types of probability distributions?
What’s the difference between relative frequency and probability?
What is the difference between a chi-square test and a correlation?
Videos
A late answer, just for completeness with a different view on the thing.
You might look at your data as measured in a multidimensional space, where each subject is a dimension and each item is a vector in that space from the origin towards the items' measurement over the full subject's space.
Additional remark: this view of things has an additional nice flavour because it uncovers the condition, that the subjects are assumend independent of each other. This is to have the data-space euclidean; changes in that independence-condition require then changes in the mathematics of the space: it has correlated (or "oblique") axes.
Now the distance of one vector-arrowhead to another is just the formula for distances in the Euclidean space, the squarerroot of squares of distances-of-coordinates (from the Pythagorean theorem) : And the standard-deviation is that value, normed by the number of subjects, if the mean-vector is taken as the
-vector.
They don't measure the same thing. To see this, think about physical units.
Suppose the value of is measured in seconds. For example,
people do a 100-meter race and the values
are how many seconds it took each one to finish.
The formula measures the difference of two times, so it's also measured in seconds.
The mean absolute deviation is therefore an average of second-values, so it's also measured in seconds.
However, the formula squares the difference of two times, so it's measured in seconds squared. The variance is therefore also in seconds squared. They don't belong to the same physical space of variables, so they measure different things.
The standard deviation, however (the square root of the variance) is again measured in seconds, so it measures something similar (at least, physically similar).
As for why we like the square-root-of-average-of-squares better than the average-of-absolute-values - the square has better mathematical properties, as shown in other answers and in the link you referred to (particularly Rich's answer).
There is a discussion on Khan Academy on same topic. I found it helpful. The reasons that the standard deviation is preferred to the mean absolute deviation are complicated. To start, let me address your list: Yes, we can use other powers for the deviations, but not just any power. Using the absolute values is not uncommon, and results in the Mean Absolute Deviation. Using squared deviations gives us the Variance (and by square-rooting, the standard deviation).
We don't use, say, the power of 3, because then positive and negative deviations would cancel each other out, which we don't want to do. We could use higher even powers (or define the deviations as a power of the absolute value), but we really don't want to do this. Why? For a few reasons.
- The effect of outliers.
- The concept of central tendency.
- The cleanliness of the math.
- Interpretability and Harmony with other concepts.
To explain each of these:
Using squared deviations already places higher weight on larger deviations than using the absolute value. This means that large deviations will try to "pull" results towards themselves, and are able to do so more strongly than the main mass of the data. Many Statisticians already think using squared deviations gives too much weight to large deviations. If we used a higher power, we would be giving even more weight to them.
In Statistics, we like to have "small" variability. As a result of this, we define our measure of central tendency to be a function of how the deviations are measured. That may be gibberish to you, so I'll clarify. We express our variability as:
(1/n) Σ |xi - θ|^k
When k=1 we have absolute deviations, and get the MAD. When k=2, we have squared deviations, and get the Variance. Then the question is: For a given value of k, what is the best value of θ? In other words: what value of θ is going to give us the smallest measure of variability? It turns out that when k=2, we get θ to be the sample mean, xbar. But this isn't always the case. For instance, when k=1, we get θ to be the sample median, not the sample mean (side note: this means that Sal's formula in "Part 5" is wrong, he should be subtracting the median, not the mean).
Getting θ = xbar is a "nice" result. The sample mean has been known for a long time (since the Ancient Greeks, at least), and when people were developing the idea of variability as a quantity we can calculate, they wanted the "best" measure of center to be the sample mean. They tried using k=1, but since that gave the median, they turned to k=2.
The math is much more clean when we use k=2. It's fairly simple to prove that when k=2, θ is the sample mean. It's messier to show that for k=1, θ is the sample median. I had to do it, once, in my PhD coursework. Out of curiosity, I tried using k=4 to figure out what θ should be. I abandoned this once I expanded (xi - θ)^3, because it seemed too messy to pursue for no real reason. This messiness compounds itself when we try to move beyond simply a measure of center, and into more complex models.
Related to point 2, we like the sample mean. In particular, there is a very nice theorem stating that as the sample size increases, the sampling distribution of the sample mean converges towards the Normal distribution (you may or may not be to the point of probability distributions yet, but this will come). Knowing that a particular value will have a Normal distribution under pretty mild conditions is extremely useful. Hence, we "want" to be able to use the sample mean, because it means we can build a lot of theory and a lot of methods using the Normal distribution. Using the sample mean goes hand in hand with squaring deviations (k=2).
Also, the variance is a special case of the "covariance", which describes how two variables interact (the variance is the case when the two variables are actually the exact same variable). So using squared deviations is a natural solution that fits well with other concepts. The idea of squaring deviations also arises naturally out of simple mathematics, irrespective of anything else. For instance, if we wanted to do linear regression, and say that y = xβ, where y and x are a vector and matrix, respectively, and β is a vector, then simple linear algebra will yield the same solution as if we used squared deviations (seeing this connection isn't quite as clear, but it is true).
There are statistical quantities based on the third and fourth powers. They are called, respectively, skew and kurtosis.
Skew is relatively easy to demonstrate. It is an asymmetry in the two tails of the distribution (or the lack of a tail altogether on one side). For example, pick a chi-squared distribution with a small number of degrees of freedom; it has an obvious skew. (The skew is present but smaller for larger numbers of degrees of freedom.)
Kurtosis measures how much a distribution tends to have outliers ("heavy tails").
Neither one of these is intrinsically any better at explaining the movements of billions of dollars in the stock market than explaining the movement of handfuls of dollars at a blackjack table. You can just as easily get $10^{12}$ by squaring $10^6$ as by taking the fourth power of $10^3$, so the amount of money involved is pretty much irrelevant mathematically.
There is, in fact, an infinite series of central moments of a probability distribution, of which the mean and the variance are just the first and second moments, respectively. Skew and kurtosis are based on the third and fourth moments. The reason you don't see much use of moments higher than the second moment is that, ironically, their effects are secondary to the effects of the variance, despite the higher exponents in their definitions. In fact, the first moment is in many ways the most important; that's why we call it the expected value.
As for using the absolute value in order to "correct" the third power: actually, one of the ways that people have tried to make statistics more robust (less susceptible to being overly influenced by a few rare "outlier" observations) is to take the absolute value of the linear deviation from the mean (or better still, deviation from the median). That is, the square is in some ways already too high a power of the deviation to do statistics as well as we might like. But the squared deviation has the advantage of several very convenient properties that make it much easier to work with than an absolute value of an odd power. Going to a higher power and putting an absolute value on it would combine all the disadvantages of variance and absolute deviation, magnified (literally).