What is the relationship between a sample distribution a sampling distribution and a population distribution?
Show
IntroductionIt is important to distinguish between the data distribution (aka population distribution) and the sampling distribution. The distinction is critical when working with the central limit theorem or other concepts like the standard deviation and standard error. In this post we will go over the above concepts and as well as bootstrapping to estimate the sampling distribution. In particular, we will cover the following:
Data DistributionMuch of the statistics deals with inferring from samples drawn from a larger population. Hence, we need to distinguish between the analysis done the original data as opposed to analyzing its samples. First, let’s go over the definition of the data distribution: 💡 Data distribution: The frequency distribution of individual data points in the original dataset. Let’s first generate random skewed data that will result in a non-normal (non-Gaussian) data distribution. The reason behind generating non-normal data is to better illustrate the relation between data distribution and the sampling distribution. So, let’s import the Python plotting packages and generate right-skewed data.
The histogram of generated right-skewed data
Sampling DistributionIn the sampling distribution, you draw samples from the dataset and compute a statistic like the mean. It’s very important to differentiate between the data distribution and the sampling distribution as most confusion comes from the operation done on either the original dataset or its (re)samples. 💡 Sampling distribution: The frequency distribution of a sample statistic (aka metric) over many samples drawn from the dataset (see Bruce and Bruce 2017). Or to put it simply, the distribution of sample statistics is called the sampling distribution. The algorithm to obtain the sampling distribution is as follows:
Sampling DistrubtionAbove sampling distribution is basically the histogram of the mean of each drawn sample (in above, we draw samples of 50 elements over 2000 iterations). The mean of the above sampling distribution is around 0.23, as can be noted from computing the mean of all samples means. ⚠️ Do not confuse the sampling distribution with the sample distribution. The sampling distribution considers the distribution of sample statistics (e.g. mean), whereas the sample distribution is basically the distribution of the sample taken from the population. Central Limit Theorem (CLT)💡 Central Limit Theorem: As the sample size gets larger, the sampling distribution tends to be more like a normal distribution (bell-curve shape). In CLT, we analyze the sampling distribution and not a data distribution, an important distinction to be made. CLT is popular in hypothesis testing and confidence interval analysis, and it’s important to be aware of this concept, even though with the use of bootstrap in data science, this theorem is less talked about or considered in the practice of data science (see Bruce and Bruce 2017). More on bootstrapping is provided later in the post. Standard Error (SE)The standard error is a metric to describe the variability of a statistic in the sampling distribution. We can compute the standard error as follows: \text{Standard~Error} = SE = \frac{s}{\sqrt{n}} where s denotes the standard deviation of the sample values and n denotes the sample size. It can be seen from the formula that as the sample size increases, the SE decreases. We can estimate the standard error using the following approach[1]:
While the above approach can be used to estimate the standard error, we can use bootstrapping instead, which is preferable. I will go over that in the next section. ⚠️ Do not confuse the standard error with the standard deviation. The standard deviation captures the variability of the individual data points (how spread the data is), unlike the standard error that captures a sample statistic’s variability. BootstrappingBootstrapping is an easy way of estimating the sampling distribution by randomly drawing samples from the population (with replacement) and computing each resample’s statistic. Bootstrapping does not depend on the CLT or other assumptions on the distribution, and it is the standard way of estimating SE[1]. Luckily, we can use
ConclusionThe main takeaway is to differentiate between whatever computation you do on the original dataset or the sample of the dataset. Plotting a histogram of the data will result in data distribution, whereas plotting a sample statistic computed over samples of data will result in a sampling distribution. On a similar note, the standard deviation tells us how the data is spread, whereas the standard error tells us how a sample statistic is spread out. 👉 You can find the Jupyter notebook for this blog post on GitHub. Bruce, Peter, and Andrew Bruce. 2017. Practical Statistics for Data Scientists: 50 Essential Concepts. O’Reilly Media. BibTeX citation: For
attribution, please cite this work as: Esmaeil Alizadeh. 2021. “Data Distribution Vs. Sampling Distribution: What You Need to Know.” January 11, 2021. https://ealizadeh.com/blog/statistics-data-vs-sampling-distribution. What is the relationship between population samples and sampling distributions?A sampling distribution is the theoretical distribution of a sample statistic that would be obtained from a large number of random samples of equal size from a population. Consequently, the sampling distribution serves as a statistical “bridge” between a known sample and the unknown population.
What is the difference between a sample distribution and a population distribution?A population is the entire group that you want to draw conclusions about. A sample is the specific group that you will collect data from. The size of the sample is always less than the total size of the population.
What is the relationship between the population mean and the mean of the distribution of sample means for a specific sample size?1. The mean of the distribution of sample means is called the Expected Value of M and is always equal to the population mean μ.
How is the population distribution related to the sampling distribution?Your sample is the only data you actually get to observe, whereas the other distributions are more like theoretical concepts. Your sample distribution is therefore your observed values from the population distribution you are trying to study.
|