Central tendency and dispersion dịch tễ là gì năm 2024

{{

displayLoginPopup}}

Cambridge Dictionary +Plus

Tìm hiểu thêm với +Plus

Đăng ký miễn phí và nhận quyền truy cập vào nội dung độc quyền:

Miễn phí các danh sách từ và bài trắc nghiệm từ Cambridge

Các công cụ để tạo các danh sách từ và bài trắc nghiệm của riêng bạn

Các danh sách từ được chia sẻ bởi cộng đồng các người yêu thích từ điển của chúng tôi

Đăng ký bây giờ hoặc Đăng nhập

Cambridge Dictionary +Plus

Tìm hiểu thêm với +Plus

Tạo các danh sách từ và câu trắc nghiệm miễn phí

Đăng ký bây giờ hoặc Đăng nhập

{{/displayLoginPopup}} {{

displayClassicSurvey}} {{/displayClassicSurvey}}

In this article, we will delve into the realm of descriptive statistics, exploring its different aspects, including types of statistics, population vs. sample, parameters vs. statistics, data types, and measures of central tendency and dispersion.

Let me introduce you to the statistic with some humor.

“Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital.” — Aaron Levenstein

The given quote humorously highlights the idea that statistics can provide valuable insights and information, but they can also be misleading or incomplete if important factors are not considered or if data is not thoroughly analyzed.

Statistics

Statistics is a branch of mathematics that involves:

  • Collection of data
  • Analysis of data
  • Interpretation of data
  • Presentation of data

It provides tools and techniques to understand large amounts of data so that conclusions can be drawn, and decisions can be made.

Broadly there are two types of Statistics:

  1. Descriptive Statistics
  2. Inferential Statistics

There are others as well, but for now, we are only considering the broader branches of it.

Descriptive Statistics

It focuses on summarizing and describing the main features of a dataset by providing information like:

  • what are their average values, what are the most common values,
  • how spread out they are, and their overall distribution.

It involves:

  1. Measure of Central Tendency
  2. Measure of Dispersion
    Before delving deep into it, let’s understand some basics of inferential statistics, the difference between sample and population, and some common data types. This will give us a reason why we are learning descriptive statistics.

Inferential Statistics

Inferential statistics is typically performed after conducting descriptive statistics.

Inferential statistics involves drawing conclusions or making predictions about a larger population based on a sample of data.

It uses descriptive statistics, probability, and other statistical techniques to analyze the sample and make inferences about the population.

Some of the topics that come under inferential statistics are:

  • Measure of Association
  • Hypotheses Testing
  • Regression Analysis

Population vs Sample

Population refers to the entire group of individuals, objects, or events that we are interested in studying.

A sample, on the other hand, is a subset of the population. It is a smaller representation of the population that we collect and analyze in order to make inferences about the population as a whole.

Why do we need Sample?

  • It is not feasible or practical to collect data from the entire population due to factors such as time, cost, or logistics issues.
  • Hence, We, instead use a sample that is representative of the population and use statistical methods on them to draw conclusions about the population.

What do we need to take care of while selecting a Sample?

The accuracy of our inferences about the population depends on the representativeness of the sample and the quality of the statistical methods used.

Hence:

  1. The sample should be selected randomly from the population to avoid any bias such that it can capture the diversity and variability present in the population.
  2. The sample size should be large enough to provide reliable data analysis as a larger sample size reduces the margin of error.

Parameter Vs Statistics

Parameters and statistics are two terms used in the field of statistics to describe different types of numerical values associated with a population and a sample.

Parameters

  • Parameters refer to numerical values that are used to describe a characteristic of an entire population.
  • Parameters are typically denoted by Greek letters (e.g., μ for population mean, σ for population standard deviation).

Statistics

  • Statistics, on the other hand, refer to numerical values that describe a characteristic of a sample. They are calculated from the sample data and are used to estimate or infer the corresponding population parameter.
  • Statistics are denoted by English letters (e.g., x̄ for the sample mean, s for the sample standard deviation).

Types of Data

The given image is self-explanatory.

1. Measure of Central Tendency

Measures of central tendency are statistical measures that provide information like:

  1. The average value of a distribution. (Mean)
  2. The central value in a distribution (Median)
  3. The common values in a distribution. (Mode)
    In statistics, moments are mathematical calculations that provide information about the shape, center, and spread of a probability distribution. The first moment is the mean, the second moment is the variance (standard deviation squared), the third moment is skewness, and the fourth moment is kurtosis.

Since we are studying central tendency, so right now we are actually understanding the first moment.

Types of Mean

  1. Arithmetic Mean (Simple Mean, Weighted Mean, Trimmed Mean)
  2. Geometric Mean
  3. Harmonic Mean

The measure of central tendency typically encompasses the arithmetic mean. The trimmed mean and weighted mean are two variations of the arithmetic mean.

The geometric mean and harmonic mean are alternative measures of central tendency that can be used in specific contexts only.

Arithmetic Mean:

It is calculated by summing all the values and dividing by the total number of values.

Weighted Mean:

The weighted mean takes into account the importance or significance of each value by assigning weights to them before calculating the mean. Each value is multiplied by its respective weight, and the weighted sum is divided by the sum of the weights.

This is useful when certain values have more influence or importance in the dataset compared to others.

Trimmed Mean:

The trimmed mean is calculated by excluding a certain percentage of the highest and lowest values from the dataset and then taking the mean of the remaining values.

This is useful when there are outliers or extreme values in the data that might unduly influence the arithmetic mean.

By trimming off the extreme values, the trimmed mean provides a more robust estimate of the central tendency.

Geometric Mean:

The geometric mean is a measure of central tendency, but it may not represent the exact midpoint or central value of the dataset as the arithmetic mean does.

Instead, it provides a measure or a rate of change that is influenced by the overall multiplicative relationship among the values.

The geometric mean is useful in scenarios where the relative magnitudes or ratios between values are more important than their absolute values such as ratios, growth rates, geometric sequences, or exponential data.

It is calculated by taking the nth root of the product of n values.

  • It gives more weight to smaller values and less weight to larger values; Hence, avoids distortions caused by extreme values or outliers.

Example:

Population growth: In demography and biology, the geometric mean is used to measure population growth rates. It takes into account the relative changes in population size over time.

Harmonic Mean:

Unlike GM, it does not calculate the rate of change rather it tries to calculate the average of quantities rates, ratio, speed, or reciprocals.

It is particularly useful when you need to find an average that takes into account the impact of extreme values or outliers while emphasizing the contribution of smaller values.

The Harmonic mean mitigates the impact of extreme values or outliers.

The harmonic mean calculates the average by

  • taking the reciprocals of the values,
  • calculating the arithmetic mean of the reciprocals,
  • and then taking the reciprocal of that mean.

The harmonic mean inherently gives more weight to smaller values since they appear in the denominator of the average calculation.

This makes it suitable for scenarios where smaller values need to be emphasized or have a greater influence on the average.

For example:

Consider a scenario where you are calculating the average speed of a journey.

Let’s say you travel at 60 km/h for the first half of the journey and 40 km/h for the second half.

Using the arithmetic mean, the average speed would be (60 + 40) / 2 = 50 km/h.

However, this would not provide an accurate representation of the overall average speed because you have spent more time at a lower speed.

In such a case, you can use the harmonic mean to calculate the average speed.

Using the harmonic mean, the average speed would be 2 / ((1/60) + (1/40)) = 48 km/h. The harmonic mean gives more weight to the smaller value (40 km/h). This accurately reflects the average speed over the entire journey.

2. Measure of Dispersion (Study of 2nd Moment)

It is a statistical measure that provides information about how the data points are dispersed.

Here are some reasons why they are important:

  • They provide information about how diverse or homogeneous the dataset is.
  • Dispersion measures can help in assessing the quality and reliability of the data and can also help in comparing the distributions of different datasets.
  • If the data points are highly dispersed, it indicates a wider range of values and potentially greater variability in the underlying phenomena. This information is useful in identifying outliers, data errors, or inconsistencies.

Different Types of Measures of Dispersion:

  1. Range
  2. Interquartile Range (IQR)
  3. Variance
  4. Standard Deviation
  5. Coefficient of Variation
  6. Mean Absolute Deviation (MAD)

Range:

  • It is the simplest measure of dispersion and is calculated as the difference between the maximum and minimum values in the dataset.
  • It gives an idea of the total spread of the data but can be influenced by outliers.

Interquartile Range (IQR)

The Interquartile Range (IQR) is a statistical measure that represents the range between the 25th and 75th percentiles of a dataset.

It is done to summarize the variability within central data points, which are typically the most meaningful values in a dataset. Hence, making it robust against outliers.

By focusing on the central data points, IQR provides a more reliable comparison between datasets, especially when they have different sizes or contain outliers.

The IQR is commonly used in constructing box plots and helps in knowing extreme values. The box in a box plot represents the IQR, with the median indicated by a line inside the box. The values below Q1–1.5 * IQR and values above Q3 + 1.5 * IQR are considered outliers.

Variance

Variance is a statistical measure that measures how much the data points are spread out or dispersed around the mean.

It measures the average squared deviation of data points from the mean.

Variance is expressed in squared terms because it allows for the consideration of both positive and negative deviations from the mean, effectively capturing the overall spread of the data.

Moreover, Squaring the differences amplifies the effect of larger deviations from the mean. This is important because larger deviations often indicate more significant variations or outliers in the dataset.

Unlike range or interquartile range (IQR), which only considers the middle 50% of the data, variance takes into account the entire dataset and provides a comprehensive measure of spread.

Variance is not directly interpretable in terms of the spread of data. We cannot find the exact magnitude of spread with variance.

To quantify the spread of data, other measures such as the range, interquartile range (IQR), or standard deviation are more commonly used.

A higher variance indicates a greater spread, indicating that the data points are more scattered around the mean. Conversely, a lower variance suggests that the data points are closer to the mean and less spread out.

Why does sample variance have n-1 in the denominator instead of n?

The use of n-1 in the denominator of the sample variance formula is based on the concept of degrees of freedom and the need to account for the uncertainty introduced by estimating the population variance from sample data.

When calculating the sample variance, we have one less degree of freedom because the sample mean is known and used as a constraint. Hence, by dividing by n-1 instead of n, we adjust for this constraint and provide an unbiased estimate of the population variance.

Standard Deviation

The standard deviation is the square root of the variance and is a commonly used measure of spread.

It is expressed in the same units as the original data, making it more easily interpretable.

Unlike variance, the standard deviation indicates how much the data points deviate, on average, from the mean.

A smaller standard deviation indicates that the values are closer to the mean, while a larger standard deviation suggests greater variability.

Mean Absolute Deviation

Mean Absolute Deviation (MAD) measures how spread out the values in a dataset are from the average (mean) value without considering the direction (positive or negative) of the differences.

It is less affected by the outliers compared to other measures like the standard deviation which involves squaring up the large differences.

Example:

Imagine you are an HR analyst working for a company, and you are analyzing the salaries of the employees. The salary data is as follows (numbers are in thousands):

salary = {45,50,47,55,48,46,51,300}

In this dataset, the value “300” represents an outlier, possibly due to a high executive salary or a one-time bonus, which is significantly larger than the other salaries.

In this situation, using the standard deviation to measure the dispersion of salaries might not be the most appropriate choice. The standard deviation will be significantly affected by the outlier value of “300,” leading to a larger value and potentially misleading interpretation of the typical salary spread.

After the calculation, The MAD is approximately 46.5 thousand dollars which is almost half of the standard deviation (91.238 thousand dollars).

However, it is better to treat these anomalies than to keep them. It may have performed better than SD, but still, if you see carefully the salaries, they hardly differ from each other by 5 thousand dollars.

Coefficient of Variation

For example, there is a salary and age column in a dataset, and we want to find out which variable set has more variability.

Since salary has a unit in rupees and age has a unit in years, the SD of each variable can’t be compared.

Hence in order to compare them, we find the relative variability of each variable in the form of percentages and then compare the variability between them.

This is where the role of the coefficient of variation comes into the picture.

The coefficient of variation (CV) is a statistical measure that helps in comparing the variability of different datasets having different units.

It does this by finding the proportion of the standard deviation in the mean.

A low coefficient of variation indicates that the data points are relatively close to the mean and have less variability, while a high coefficient of variation suggests a larger degree of variability.

Building Steeping Stone for Inferential Statistics

Until now, we’ve covered two fundamental concepts known as central tendency, and measure of dispersion which give us an idea about the central value of the data and the spread of the data around its central value.

Through visualizing the Distribution of data, we can also visually understand the spread of data, where the most information lies, and can analyze the outliers in a given data. If you can recall that while learning IQR we plotted a Boxplot and represented data points on a straight line and based on that we tried to understand the distribution of data.

Main Problem

In real-world situations, we often work with sample data rather than the entire population data, and with that, we try to draw conclusions about the population data.

However, when making inferences or drawing conclusions about the population based on sample data, we need to assess the certainty or reliability of our conclusions.

Probability plays a crucial role in quantifying the uncertainty associated with our conclusions. To incorporate statistics with probability, we rely on probability distributions, and based on these distributions, we estimate the likelihood of different events or outcomes occurring in the population.

Thus, in order to understand inferential statistics, we first need to build strong foundations in the Probability theory, and different types of probability distributions. In upcoming posts, we will first build our foundation in it.

— — —

So, this was all from this post. Hope you like it.

Kindly follow me by becoming a fan of daily such blogs. If you have any doubts and want to reach me then don’t hesitate to contact me via email at [email protected] or connect with me over LinkedIn.