I think part of the problem is that the terms "population standard deviation" and "sample standard deviation" are confusing, because almost everyone seems to think that the "sample standard deviation" is the standard deviation of the sample. It's actually a different concept entirely; it's an "estimator" for the population standard deviation. Imagine that we're making precisely calibrated rulers, and we want to make sure that the lengths of all 1 million of the rulers we made today have a very small standard deviation. That is, we want to know the population standard deviation of the lengths of the rulers. However, no one has the time to literally measure 1 million rulers, so our only choice is to draw a small random sample and figure out how to guess the population standard deviation from the limited data we have. That's what we use the sample standard deviation for. It's a value that we calculate from the sample in order to estimate the true value of the population standard deviation. So, why don't we divide by N in the sample standard deviation? Why isn't the standard deviation of the sample a good estimate of the standard deviation of the population? It turns out that it's biased to give values that are too small. The problem is that we want to find the population standard deviation, which measures variation around the true mean, but the standard deviation of the sample only gives us variation around the sample mean. Naturally, the members of a sample are biased to be closer to their sample mean, so they tend to have a smaller standard deviation than the whole population. That's why we need a different formula for the sample standard deviation. We use N-1 because that's the value that turns the sample standard deviation into an "unbiased estimator", whose average value approaches the true population standard deviation. Answer from RobotsAreCute on reddit.com
🌐
Reddit
reddit.com › r/learnmath › population standard deviation vs. sample standard deviation?
r/learnmath on Reddit: Population standard deviation vs. Sample standard deviation?
May 21, 2019 -

why is the population standard deviation the square root of the sum of the (values - means)^2 ÷ n , while the sample standard deviation is all that over n - 1? I don't understand why you have to subtract 1 from the number of things.

Top answer
1 of 4
8
I think part of the problem is that the terms "population standard deviation" and "sample standard deviation" are confusing, because almost everyone seems to think that the "sample standard deviation" is the standard deviation of the sample. It's actually a different concept entirely; it's an "estimator" for the population standard deviation. Imagine that we're making precisely calibrated rulers, and we want to make sure that the lengths of all 1 million of the rulers we made today have a very small standard deviation. That is, we want to know the population standard deviation of the lengths of the rulers. However, no one has the time to literally measure 1 million rulers, so our only choice is to draw a small random sample and figure out how to guess the population standard deviation from the limited data we have. That's what we use the sample standard deviation for. It's a value that we calculate from the sample in order to estimate the true value of the population standard deviation. So, why don't we divide by N in the sample standard deviation? Why isn't the standard deviation of the sample a good estimate of the standard deviation of the population? It turns out that it's biased to give values that are too small. The problem is that we want to find the population standard deviation, which measures variation around the true mean, but the standard deviation of the sample only gives us variation around the sample mean. Naturally, the members of a sample are biased to be closer to their sample mean, so they tend to have a smaller standard deviation than the whole population. That's why we need a different formula for the sample standard deviation. We use N-1 because that's the value that turns the sample standard deviation into an "unbiased estimator", whose average value approaches the true population standard deviation.
2 of 4
6
Here is an article explaining it. https://www.statisticshowto.datasciencecentral.com/bessels-correction/
🌐
Reddit
reddit.com › r/learnmath › should i use the population or sample standard deviation
r/learnmath on Reddit: Should I use the population or sample standard deviation
November 29, 2022 -

So I have 12 samples that I've tested for their thermal conductivity for a chemistry lab and want to compute their standard deviation. I've read online about the difference between population vs sample SD and it seems you only use population SD when you've tested the entire population, not just a portion of it. I'm not sure what it means by population though. Would my 12 samples count as the entire population?

🌐
Reddit
reddit.com › r/statistics › [question] disagreement at work on which type of standard deviation to use, population vs. sample
[Question] Disagreement at work on which type of standard deviation to use, Population vs. Sample : r/statistics
July 16, 2024 - The fact that you're limited in ... correction is a thing - it's meant to account for the fact that mean of the sample is closer to the data points used to calculate standard deviation than population mean is....
🌐
Reddit
reddit.com › r/cfa › confused between sample and population standard deviation
r/CFA on Reddit: Confused between sample and population Standard deviation
July 23, 2022 -

A fund had the following experience over the past 10 years:

YearReturn
4.5% 6.0% 1.5% −2.0% 0.0% 4.5% 3.5% 2.5% 5.5% 4.0%

Q. The standard deviation of the 10 years of returns is closest to:

  1. 2.40%.

  2. 2.53%.

  3. 7.58%.

There is no mention in this question whether to calculate the sample or the population Standard deviation. Which one should we calculate by default ?

🌐
Reddit
reddit.com › r/statistics › sample vs. population standard deviation
r/statistics on Reddit: Sample vs. population standard deviation
June 14, 2013 -

Hi,

(this isn't a homework question, don't worry)

I'm taking a college stats class right now and I'm confused about a particular concept with the two types of standard deviation: (our book doesn't explain this at all, it just gives the formulas)

There are formulas for both sample standard deviation (s) and population standard deviation (sigma). If all you have is a sample of let's say 5 numbers, how does the population version "know" what the overall population actually is to come up with this value? I don't understand the meaning behind that particular stat is because of this.

Example: I enter 13,24,12,44,55 into my calculator and compute the 1-variable stats:

s = 19.16507 sigma = 17.14176

What "population" is it assuming those 5 numbers come from?

Thanks, appreciate any explanations!

🌐
Reddit
reddit.com › r/askstatistics › standard deviation in population vs sample
r/AskStatistics on Reddit: Standard deviation in population vs sample
February 20, 2020 -

Hi all,

When calculating the standard deviation for a population vs a sample of this population we use N-1 on the demoninator for the sample but N for the population. My understanding of the reasoning behind this is do to with the statistical inference that we would eventually make from this sample onto the population. Is this correct? Can anyone unpack this a bit more?

Further, is the reasoning to do with something like the actual population having more outliers than a typical sample would have and having skewing the figures of the mean?

Thanks in advance

🌐
Reddit
reddit.com › r/learnmath › “standard deviation of each set” - sample or population?
r/learnmath on Reddit: “Standard deviation of each set” - Sample or population?
March 3, 2024 -

Hi all, I know this type of question has been asked a lot here, but I haven’t found the answers exactly satisfactory, so I’m creating my own post.

We were given these two tables seen in the image below: https://ibb.co/TvVRPJB

Then, we were asked “Find the mean and standard deviation of each set of data by using class centres.”

I initially assumed they were asking for population standard deviation, as each of these tables is an individual set of data (and the question asks for “standard deviation of each set”). However, the answer sheet indicates that I should have used sample standard deviation.

To me, each of these sets is a discrete full population, and standard deviation of “each set” sounds like I should find deviation of the individual set (population) not of the whole set of data collected (sample).

Is this just a confusingly worded question, or am I completely incorrect? If I am, please clarify where this confusion is coming from.

Thank you very much for any help.

Find elsewhere
🌐
Reddit
reddit.com › r/askstatistics › standard deviation/variance of a sample vs a known population mean/sd
r/AskStatistics on Reddit: Standard Deviation/Variance of a Sample vs a Known Population Mean/SD
September 13, 2021 -

Sorry for the x-post from r/statistics, but I just realized this forum existed.

I am relatively untrained, so, this is partially a "how do I do this calculation" question, but also a bit of a "how do I show the meaning of this calculation to someone who isn't really a 'math person'" question.

I'm trying to find a way to express (to someone relatively unversed in statistics, not that I'm much beyond a novice) how a sample's result differs from the expected outcome given a known population mean and SD. I was hoping that within Sheets/Excel I could create a normal curve wherein I could show where the sample results fall along the population curve. But I'm not even quite sure how to create a curve within those parameters to begin with.

Say I am playing a betting game, where I know that the expected value of any bet (x) is .2x. And the SD of that population is .5x. My sample mean is -1.5x, which I know just barely falls outside 3 standard deviations of the known population mean. Is there a way to compare the sample to the population? Am I even thinking about this things in the right way?

Top answer
1 of 2
2
Are you trying to nail down how to distinguish, in somewhat laymen's terms, the difference between a the measurement on a sample versus the same analogous measurement on a population? If not, ignore the far-too long explanation below and help steer me closer to what you're getting at haha In my intro Stat courses, I spend quite a lot of effort on this, for good reason. I usually use average height as the measurement and students on campus as the population. You can imagine, if you had the time and effort, finding the average height of all student across campus. You can also imagine taking 10 "random" students and averaging their height where by "random" you would say "each student has relatively the same chance at being picked. No blatant bias, like picking 10 basketball players. At this point, I find that most people Agree that both averages are probably different How different is the next question Agree given a 2nd sample, you would probably get a 3rd differing value A great segue to where sampling distributions arise Is there a way to compare the sample to the population In the case of averages, If the population parameter is known, then it's somewhat straight forward and not too terribly interesting "Our sample mean was 10, the population mean is 8. " If you want put a scale to how large the difference is, calculating a z-score is a good bet (how many standard deviations the sample mean is away from the population mean) "Our sample mean was 10, the population mean is 8 and the population SD is 2. So the sample mean was 1 standard deviation above the population mean" If the population parameter is not known (far more interesting and common), then you have a few options A frequentist approach might say to compute a confidence interval in order to state plausible values of the population mean, based on the sample mean. Ex: 95% confidence intervals capture the population mean 95% of the time, over repeated samples. So calculating the interval for your sample mean can give you insight on what it may be. Some prefer running a hypothesis test to rule out particular values of the population mean. Ex: If you thought the population mean was, say 10, you can refer back to the "population parameter is known" case and calculate how far the sample mean is from that, stating how likeliness of such a scenario and following up with a statement about the theorized population mean. "We think the population mean might be 10 and our sample average was 8. If the population SD is 2, then our sample mean was 1 standard deviation away. Given the normality of sample averages, this is not uncommon, thus 10 is a plausible value for the population mean (no sufficient evidence to rule it out). Of course, you would need to determine what is "sufficient" for you but 2 sd's is fairly common.
2 of 2
1
how a sample's result differs from the expected outcome given a known population mean and SD. I'm not sure what you mean by how it differs. What are you seeking here? At one point I thought maybe you were asking "how do I draw a normal curve with the same mean and sd as a sample" but it's not clear that has anything to do with the line I quoted. I really don't follow what your last paragraph is trying to do at all. TBH that sounds like an XY problem
🌐
Reddit
reddit.com › r/askstatistics › sample vs population standard deviation in propagation of uncertainty
r/AskStatistics on Reddit: Sample vs Population Standard Deviation in Propagation of Uncertainty
December 13, 2020 -

I'm a bit confused about whether I should be using the standard deviation of the sample or population when accounting for propagation of uncertainty. One can convert between the two using \sigma_P^2 = \frac{n-1}{n} \sigma_S^2 Obviously, when using this conversion from the sample to population the conversion relies on the assumption that the population is being well & accurately sampled, but if that assumption holds and you know the size of your sample then you know the population standard deviation. [False.]

To further contextualize this, I'm writing a script with a class called Measure that does propagation of uncertainty for me if I do math with it. I've decided it's most appropriate to convert stranger measures of uncertainty into standard deviation since propagation of uncertainty isn't really built for margin of error. My issues are 1) whether I should convert to population standard deviation when the sample size is given, 2) whether the different standard deviations can be used together in propagation of uncertainty, 3) which type of standard deviation is assumed when giving the uncertainty of a measurement.

🌐
Reddit
reddit.com › r/cfa › sample vs population standard deviation
r/CFA on Reddit: Sample vs Population Standard Deviation
May 31, 2019 -

Hi all,

Would it be possible to advise in which scenario one would use a sample standard deviation and which scenario one would use a population standard deviation.

For example, if a mutual fund has 10 funds. Would the standard deviation be computed by using sample or population standard deviation.

Thank you!

🌐
Reddit
reddit.com › r/learnmath › population versus sample standard deviation for the linux ping command's mdev statistic
r/learnmath on Reddit: Population versus sample standard deviation for the Linux ping command's mdev statistic
September 21, 2023 - In this situation, why should we use the population standard deviation instead of the sample standard deviation (which generally incorporates Bessel's correction)? When we are running a ping test, it seems to me that we are sampling (taking a sample of pings) rather than having an entire population.
Top answer
1 of 2
10

When you compute the standard deviation from a sample, you almost always have to compute it "around" the observed mean of the sample (not the true mean of the population) because the true mean of the population is unknown. The difference between the observed mean and the true mean causes a bias in the standard deviation which can be corrected by using a different formula.

2 of 2
1

You may have figured all this out already, but here's something that tripped me up for a very long time when I first learned statistics (and I'm hardly an expert, so take this with a grain of salt):

The reason you take a sample is to estimate parameters (mean and stdev, usually) of the distribution that describes the whole population from which the sample is drawn. The formula for "population standard deviation" is the definition of standard deviation of a statistical distribution. The formula for "sample standard deviation" is just a way to estimate the standard deviation of the population distribution given a sample drawn from the distribution. They are different formulas because they do completely different things. One estimates the other based on a subset of the data (the sample).

There is no difference in appearance between the "population mean" formula and the "sample mean" formula, but there is still a difference in interpretation: the former is the definition of "mean" of a statistical distribution, while the latter tells you how to figure out (approximately) what this mean is from only your sample.

The fact that the "population" vs "sample" mean formulae do not differ in appearance is perhaps even more surprising than the fact that the "population" vs "sample" stdev formulae do differ. But the other answer does a pretty good job of explaining why this is the case.

🌐
Reddit
reddit.com › r/askstatistics › how does sample size affect the mean and the standard deviation
r/AskStatistics on Reddit: How does Sample size affect the mean and the standard deviation
September 20, 2021 -

Hello,
taking a statistics course and I am a bit confused on how this relationship work.
if the same size is 100 vs a sample size of 1000 it means that the mean would be smaller or larger?
How would the standard deviation be affected?

What is the probability that either samples has the lowest variable sampled?

Thank you

Top answer
1 of 2
2
if the same size is 100 vs a sample size of 1000 it means that the mean would be smaller or larger? If you're taking random samples of the population, on average it would be the same size; in any given sample it might be smaller or larger. How would the standard deviation be affected see the above. Well, there's a small sample size effect with the usual n-1 denominator and taking the standard deviation (the usual estimator is biased, and that bias changes with sample size), but not so much you'd tend to notice. What is the probability that either samples has the lowest variable sampled? Call the smaller sample A (with nA observations) and the larger sample B (with nB observations). Assuming that we're sampling a continuous random variable (to avoid dealing with ties) and again assuming simple random sampling throughout, the probability that the smallest of the sampled values is in A is nA/(nA+nB) -- i.e. the proportion of all the values sampled that are in sample A.
2 of 2
1
Sample size's influence on mean and standard deviation is really interesting! Remember, the reason you take a sample is because you want to say something about the population you took a sample from. Imagine you're interested in the average university students height and how height varies at a university with a population of 10,000. First we can take a sample of 100 students. We can calculator an average from this sample (called a sample statistic) and a standard deviation of the sample. Now, it's important to note that your sample statistics will always vary from the actual populations height (called a parameter). But, as we increase our sample size, we get closer to capturing the entire population of interest, meaning our sample statistics will get closer and closer to the actual population height. A sample of 100 may yield a similar average height and standard deviation as the population, but a sample of 1000 will yield even similar statistics to the actual population, simply by the fact that the larger sample contains more people from the population, making its representation closer to the actual population. Now when we say closer, what we mean is that the mean of our sample gets closer to the mean of our actual population, and the standard deviation gets closer to the actual populations standard deviation. More data provides a closer picture of the population of interest, and more data removes uncertainty (variance/standard deviation) about the sample statistics similarities to the population. Remember, we don't know the actual population height and standard deviation, but using a large enough sample, we can come to a pretty good approximation (and as that sample size gets larger, that approximation gets better). This all comes under the assumption your sample is random. Although a large volunteer or convenience sample might be good approximations of the population, those samples might not represent the population as well as a true random sample.
Top answer
1 of 1
106

There are, in fact, two different formulas for standard deviation here: The population standard deviation $\sigma$ and the sample standard deviation $s$.

If $x_1, x_2, \ldots, x_N$ denote all $N$ values from a population, then the (population) standard deviation is $$\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2},$$ where $\mu$ is the mean of the population.

If $x_1, x_2, \ldots, x_N$ denote $N$ values from a sample, however, then the (sample) standard deviation is $$s = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \bar{x})^2},$$ where $\bar{x}$ is the mean of the sample.

The reason for the change in formula with the sample is this: When you're calculating $s$ you are normally using $s^2$ (the sample variance) to estimate $\sigma^2$ (the population variance). The problem, though, is that if you don't know $\sigma$ you generally don't know the population mean $\mu$, either, and so you have to use $\bar{x}$ in the place in the formula where you normally would use $\mu$. Doing so introduces a slight bias into the calculation: Since $\bar{x}$ is calculated from the sample, the values of $x_i$ are on average closer to $\bar{x}$ than they would be to $\mu$, and so the sum of squares $\sum_{i=1}^N (x_i - \bar{x})^2$ turns out to be smaller on average than $\sum_{i=1}^N (x_i - \mu)^2$. It just so happens that that bias can be corrected by dividing by $N-1$ instead of $N$. (Proving this is a standard exercise in an advanced undergraduate or beginning graduate course in statistical theory.) The technical term here is that $s^2$ (because of the division by $N-1$) is an unbiased estimator of $\sigma^2$.

Another way to think about it is that with a sample you have $N$ independent pieces of information. However, since $\bar{x}$ is the average of those $N$ pieces, if you know $x_1 - \bar{x}, x_2 - \bar{x}, \ldots, x_{N-1} - \bar{x}$, you can figure out what $x_N - \bar{x}$ is. So when you're squaring and adding up the residuals $x_i - \bar{x}$, there are really only $N-1$ independent pieces of information there. So in that sense perhaps dividing by $N-1$ rather than $N$ makes sense. The technical term here is that there are $N-1$ degrees of freedom in the residuals $x_i - \bar{x}$.

For more information, see Wikipedia's article on the sample standard deviation.