When data is skewed to the left the mean is

Understanding Descriptive Statistics: with the help of Gaussian, Left-Skewed, and Right-Skewed Dataset

Statistically re-engineer your data to make it skewed

Photo by Chris Liverani on Unsplash

The normal distribution and its characteristics have been discussed extensively. We must also have a good understanding of its descriptive statistics. Have you, however, ever attempted to left- or right-skew a completely normally distributed dataset in order to test for its normality or observe the effects on its numerous descriptive statistics. If not, please continue since this is what we are going to do in this article.

Background:

Let me briefly present a Normal (or Gaussian) Distribution and Descriptive Statistics before we jump into Python Notebook and start editing the data.

The majority of values in a dataset with a normally distributed distribution cluster in the middle of the range, and the remaining values taper off symmetrically toward either extreme. Since most naturally occurring events are normally or roughly normally distributed, normal distribution is crucial. Examples of such variables include height, birth weight, shoe size, blood pressure, retirement age, marks scored, income distribution, and the price of commodities within an economy.

Properties of normally distributed dataset:

1. When plotted, the normally distributed dataset has a symmetric bell shape with the mean in the middle.

2. The mean, median, and mode are all equal.

3. It will only have one peak since it is unimodal.

4. Half of the values fall below and half above the mean, indicating that the distribution is symmetric about the mean.

5. It adheres to the Empirical Rule: The empirical rule, also referred to as the three-sigma rule or 68–95–99.7 rule, is a statistical rule which states that for a normal distribution,

  • About 68% of values fall within one standard deviation of the mean.
  • About 95% of the values fall within two standard deviations from the mean.
  • Almost all of the values — about 99.7% — fall within three standard deviations from the mean.

6. Coefficient of Skewness is Zero.

Descriptive Statistics:

Descriptive statistics are summaries that quantitatively describe or list the characteristics of a specific dataset. We can get a good notion of the data distribution by looking at the descriptive statistics.

Different types of descriptive statistics include:

  • Measure of Central Tendency — Mean, Median, and Mode
  • Measure of Dispersion — Range, IQR, Variance, and Standard Deviation
  • Positional Statistics — Minimum, Maximum, Deciles, and Percentile
  • Measure of Symmetry — Skewness and Kurtosis

The Descriptive Statistics of a Perfectly Normal Distribution

Let’s start with a dataset that is normally distributed, then skew the distribution to the left and right to examine how the statistics change. I am going to begin with an excellent example of a normally distributed data from the field of biometry. This data is about “Housefly Wing Lengths” and has been taken from Seattlecentral.edu (Sokal, R.R., and P.E.Hunter. 1955).

There are total 100 records in the data and only it has only one column — “length”.

df = pd.read_excel("s057.xls")
print(df.shape)
print(df.info())

Figure 1. Dataset Summary. Author Image.

To compute the different descriptive statistics, we can define the following functions in Python:

Let us now examine our dataset for various features of a normal distribution.

Test 1: Normality Test

We will use SciPy Shapiro-Wilk test to test the normality of the dataset. This tests the null hypothesis that the data was drawn from a normal distribution. i.e.

  • H0: the data is normally distributed
  • H1: the data doesn’t follow a normal distribution

In the SciPy implementation the p value can be interpreted as follows:

  • p <= alpha: reject H0, not normally distributed.
  • p > alpha: fail to reject H0, i.e., normally distributed

Results with a higher p-value often support the hypothesis that our sample was taken from a Gaussian distribution.

We can observe from the following output that our dataset is normally distributed.

Test Statistics = 0.993, and p-value is 0.876
Data is Normally Distributed (fail to reject H0 as p-value is > 0.05)

Test 2: Descriptive Statistics for a normal distribution:

i. Mean = Median = Mode

ii. If the distribution is Uni-Modal

iii. If it has a bell-shaped symmetrical curve with mean at its center

iv. Skewness Coefficients = 0

To test the above four characteristics of a normal distribution, use the following code block. It will use the functions defined earlier for descriptive statistics.

And here is the output:

Figure 2. Descriptive Statistics with Histogram for a Normal Distribution. Author Image.

Looking at the above output, we can observe that:

  • Mean, Median, and Mode are all equal (45.5)
  • The distribution is unimodal.
  • We have a symmetrical bell-shaped distribution with the mean at its center.
  • The skewness coefficient is zero.

Test 3: Check if it follows the Empirical Rule: 68–95–99.7 rule with 1–2–3 standard deviations

A normally distributed dataset follows the Empirical Rule i.e. 68%, 95%, and 99.7% of the values lie within one, two, and three standard deviations of the mean, respectively. In order to test this property, we can use the following code block.

and here is the output:

Area covered from mean with +/- 1 sigma is: 68.27
Area covered from mean with +/- 2 sigma is: 95.45
Area covered from mean with +/- 3 sigma is: 99.73

Test 4: Outliers Check

An outlier may or may not exist in a normal distribution. If there are no notable exceptions, everything is fine. We will first run this test on our normally distributed dataset. Then, we’ll run this test again on a left- or right-skewed dataset to determine whether data manipulation has resulted in any outliers.

And we can see the output as follows: Neither IQR nor 3-sigma techniques could find any outliers.

Figure 3: Result of Outlier Test. Author Image.

So far, we’ve examined and confirmed the parameters of a fully normal distribution. Now, we’ll skew the distribution to the left and right to see how it affects the normality test, descriptive statistics/visualization, and outlier presence.

Change the data distribution to make it a left-skewed (or negatively- skewed) dataset

We knew that a left-skew or a negatively skew dataset is one in which the tail of the distribution is longer on the left than on the right, and the median is closer to the third quartile than to the first. For such a distribution, we should also have Mode > Median > Mean.

We will clip or drop some of the values from the top percentile of our dataset to make it left skewed. This would ideally leave more data from the lower quantile, skewing the distribution somewhat negatively (or to the left).

And when you print the coefficients of skewness, you will notice that it’s non-zero negative value (-0.83 in this case).

46.4
Resulting dataframe shape is: (50, 1)
New Skew value is: -0.83

We’ve already created few functions in the beginning for testing or calculating the various parameters. We’ll call them one by one on the new dataset now (which is supposed to be a left-skew distribution) and see how they affect the different results.

Test 1: Normality Test

Running the same SciPy Shapiro-Wilk test on the new data frame df_left_skew yields this result.

Test Statistics = 0.911, and p-value is 0.001
Data is NOT Normally Distributed (reject H0 as p-value is < 0.05)

So, we now have proof that our dataset no longer follows a normal distribution. Let’s look at its descriptive statistics and distribution plot to validate this.

Test 2: Descriptive Statistics on Changed Data:

Calling the previously defined function for descriptive statistics and plotting the histogram yields the following result:

Figure 4. Descriptive Statistics and Histogram for left-skewed (or negatively-skewed) dataset. Author Image.

So, with a left-skew dataset, we can see that:

  • Mean, median, and mode are no longer overlapping.
  • In fact, we can notice that Mode> Median> Mean — justifying a left or negatively-skewed data.
  • a long-left tail, which causes the mean to be to the left of the peak and the median to be closer to the third quartile.

Test 3: Outliers Check

In the negatively skew data, we will do one last check for outliers. A skewed dataset should ideally contain some outliers. Running the outliers test on this left skew dataset yields the following results:

Figure 5: Outliers Presence in left-skewed dataset. Author Image.

We see that an outlier was spotted by the IQR approach but not by the 3-sigma method. Also, outlier in this case is below the minimum value of Q1 -1.5*IQR — indicating a negatively-skewed dataset. Please keep in mind that the IQR approach only considers +/- 2.7*Sigma for outlier discovery (hence less conservative), whereas the 3 Sigma (or Z-score) method examines +/- 3*Sigma for outlier detection. Refer to this Blog for more details on outliers detection.

Change the data distribution to make it a right-skewed (or positively- skewed) dataset

In the last section, We will clip or drop some of the values from the bottom percentile of our original dataset to make it a right-skewed distribution. This would ideally leave more data from the upper quantile, hence skewing the distribution somewhat positively (or to the right). We can use following code block to achieve this:

And this time, when you print the coefficients of skewness, you will notice that it’s non-zero positive value (0.83 in this case).

44.6
Resulting dataframe shape is: (50, 1)
New Skew value is: 0.83

Let’s quickly run the normality test, see the descriptive statistics, and also check for outliers.

Test 1: Normality Test

Running the same SciPy Shapiro-Wilk test on the new data frame df_right_skew yields this result.

Test Statistics = 0.911, and p-value is 0.001
Data is NOT Normally Distributed (reject H0 as p-value is < 0.05)

So, we now have proof that our dataset no longer follows a normal distribution. Let’s look at its descriptive statistics and distribution plot to validate this.

Test 2: Descriptive Statistics on right-skewed Data:

We just need to pass the new data frame into previously defined function. The descriptive statistics and the histogram for the data frame df_right_skew looks like this:

Figure 6. Descriptive Statistics and Histogram for right-skewed (or positively-skewed) dataset. Author Image.

So, for a right-skewed dataset, we can see that:

  • Mean, median, and mode are no more overlapping with each other.
  • In fact, we can notice that Mode < Median < Mean- justifying a right or positively skewed data.
  • a long right tail, hence the mean is also to the right of the peak.

Test 3: Outliers Check

We can use the same code block and pass the new data frame df_right_skew to check if we have any outliers now. Running the outliers test on this right-skewed dataset yields the following results:

Figure 7: Outliers Presence in right-skewed dataset. Author Image.

We see that an outlier was spotted by the IQR method as it is less conservative than 3-sigma method as explained above. Please also note that outlier in this case is beyond the maximum value of Q3 +1.5*IQR — indicating a positively-skewed dataset.

Conclusion:

In this post, we began with a perfectly normally distributed dataset and tested its numerous properties in relation to a normal distribution. We then altered the data to create a left-skewed dataset and a right-skewed dataset. We also looked at how the descriptive statistics changed when the data was skewed to the left or right. This should have helped you understand how a skewed dataset’s statistics or histogram appears.

What does data skewed to the left mean?

A distribution is called skewed left if, as in the histogram above, the left tail (smaller values) is much longer than the right tail (larger values). Note that in a skewed left distribution, the bulk of the observations are medium/large, with a few observations that are much smaller than the rest.

When data is skewed What does that mean?

Skewed data is data that creates an asymmetrical, skewed curve on a graph. In statistics, the graph of a data set with normal distribution is symmetrical and shaped like a bell. However, skewed data has a "tail" on either side of the graph.

What does it mean when data is skewed to the right?

Data skewed to the right is usually a result of a lower boundary in a data set (whereas data skewed to the left is a result of a higher boundary). So if the data set's lower bounds are extremely low relative to the rest of the data, this will cause the data to skew right.

What does skewed left and right mean?

For skewed distributions, it is quite common to have one tail of the distribution considerably longer or drawn out relative to the other tail. A "skewed right" distribution is one in which the tail is on the right side. A "skewed left" distribution is one in which the tail is on the left side.