Why does variance have n 1
The problem with using the population variance formula to calculate the variance of a sample is that it is biased. It is biased in that it produces an underestimation of the true variance. We simulate a population of data points from a uniform distribution with a range from 1 to Below I show the histogram that represents our population. The variance is 8. To start, we can draw a single sample of size 5. Say we do that and get the following values: 7, 6, 3, 5, 5. In the former case, this will result in 1.
Below I show the results of draws from our population. I simulated drawing samples of size 2 to 10, each different times. We see that the biased measure of variance is indeed biased. The average variance is lower than the true variance indicated by the dashed line , for each sample size. We also see that the unbiased variance is indeed unbiased. On average, the sample variance matches that of the population variance.
The results of using the biased measure of variance reveals several clues for understanding the solution to the bias.
We see that the amount of bias is larger when the sample size of the samples is smaller. So the solution should be a function of sample size, such that the required correction will be smaller as the sample size increases. Ideally we would estimate the variance of the sample by subtracting each value from the population mean.
This is where the bias comes in. In fact, the mean of a sample minimizes the sum of squared deviations from the mean. This means that the sum of deviations from the sample mean is always smaller than the sum of deviations from the population mean. The only exception to that is when the sample mean happens to be the population mean.
Below are two graphs. In each graph I show 10 data points that represent our population. If the sample variance is larger than there is a greater chance that it captures the true population variance. Because we are trying to reveal information about a population by calculating the variance from a sample set we probably do not want to underestimate the variance. There was a good post here on CV that will give you some good insight.
Hope this helps! Sign up to join this community. The best answers are voted up and rise to the top. Stack Overflow for Teams — Collaborate and share knowledge with a private group. Create a free Team What is Teams? Learn more. Asked 5 years, 8 months ago. Active 5 years, 8 months ago. Viewed 14k times. So you'd probably divide by n minus 1. But let's think about why this estimate would be biased and why we might want to have an estimate like that is larger.
And then maybe in the future, we could have a computer program or something that really makes us feel better, that dividing by n minus 1 gives us a better estimate of the true population variance. So let's imagine all the data in a population. And I'm just going to plot them on number a line. So this is my number line.
This is my number line. And let me plot all the data points in my population. So this is some data. This is some data. Here's some data. And here is some data here. And I can just do as many points as I want. So these are just points on the number line. Now, let's say I take a sample of this. So this is my entire population. So let's see how many. I have 1 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, So in this case, what would be my big N?
My big N would be Big N would be Now, let's say I take a sample, a lowercase n of-- let's say my sample size is 3. I could take-- well, before I even think about that, let's think about roughly where the mean of this population would sit.
So the way I drew it --and I'm not going to calculate exactly-- it looks like the mean might sit some place roughly right over here. So the mean, the true population mean, the parameter's going to sit right over here. Now, let's think about what happens when we sample. And I'm going to do just a very small sample size just to give us the intuition, but this is true of any sample size. So let's say we have sample size of 3. So there is some possibility, when we take our sample size of 3, that we happen to sample it in a way that our sample mean is pretty close to our population mean.
So for example, if we sampled to that point, that point, and that point, I could imagine in our sample mean might actually said pretty close, pretty close to our population mean. But there's a distinct possibility, there's a distinct possibility, that maybe when I take a sample, I sample that and that.
And the key idea here is when you take a sample, your sample mean is always going to sit within your sample. And so there is a possibility that when you take your sample, your mean could even be outside of the sample. And so in this situation-- and this is just to give you an intuition. So here, your sample mean is going to be sitting someplace in there. And so if you were to just calculate the distance from each of this points to the sample mean --so this distance, that distance, and you square it, and you were to divide by the number of data points you have-- this is going to be a much lower estimate than the true variance the true variance, from the actual population mean, where these things are much, much, much further.
Now, you're always not going to have the true population mean outside of your sample. But it's possible that you do. So in general, when you just take your points, find the squared distance to your sample mean, which is always going to sit inside of your data even though the true population mean could be outside of it, or it could be at one end of your data, however, you might want to think about it, you are likely to be underestimating, you're likely to be underestimating the true population variance.
0コメント