How do you check data distribution in python?
Show
Many methods exist for testing whether a variable has a normal distribution. In this article, you will find out which one to use!1. Histogram1.1. IntroductionThe first method that almost everyone knows is the histogram. The histogram is a data visualization that shows the distribution of a variable. It gives us the frequency of occurrence per value in the dataset, which is what distributions are about.
1.2. InterpretationIn the picture below, two histograms show a normal distribution and a non-normal distribution.
1.3. ImplementationA histogram can be created easily in python as follows: Creating a histogram using pandas in python1.4. ConclusionThe histogram is a great way to quickly visualize the distribution of a single variable. 2. Box Plot2.1. IntroductionThe Box Plot is another visualization technique that can be used for detecting non-normal samples. The Box Plot plots the 5-number summary of a variable: minimum, first quartile, median, third quartile and maximum.
2.2 InterpretationThe boxplot is a great visualization technique because it allows for plotting many boxplots next to each other. Having this very fast overview of variables gives us an idea of distribution and as a “bonus”, we get the complete 5-number summary that will help us in further analysis. You should look at two things:
2.3. ImplementationA boxplot can be easily implemented in python as follows: Creating a boxplot using pandas in python2.4. ConclusionThe boxplot is a great way to visualize distributions of multiple variables at the same time, but a deviation in width/pointiness is hard to identify using box plots. 3. QQ Plot3.1. IntroductionWith QQ plots we’re starting to get into the more serious stuff, as this requires a bit more understanding than the previously described methods. QQ Plot stands for Quantile vs Quantile Plot, which is exactly what it does: plotting theoretical quantiles against the actual quantiles of our variable.
3.2. InterpretationIf our variable follows a normal distribution, the quantiles of our variable must be perfectly in line with the “theoretical” normal quantiles: a straight line on the QQ Plot tells us we have a normal distribution. Normal (left), uniform (middle) and exponential (right) QQ PlotsAs seen in the picture, the points on a normal QQ Plot follow a straight line, whereas other distributions deviate strongly.
In practice, we often see something less pronounced but similar in shape. Over or underrepresentation in the tail should cause doubts about normality, in which case you should use one of the hypothesis tests described below. 3.3. ImplementationImplementing a QQ Plot can be done using the statsmodels api in python as follows: Creating a QQ Plot using statsmodels3.4. ConclusionThe QQ Plot allows us to see deviation of a normal distribution much better than in a Histogram or box plot. 4. Kolmogorov Smirnov test4.1. IntroductionIf the QQ Plot and other visualization techniques are not conclusive, statistical inference (Hypothesis Testing) can give a more objective answer to whether our variable deviates significantly from a normal distribution. If you have doubts about how and when to use hypothesis testing, here’s an article thatgives an intuitive explanation to hypothesis testing. The Kolmogorov Smirnov test computes the distances between the empirical distribution and the theoretical distribution and defines the test statistic as the supremum of the set of those distances. The advantage of this is that the same approach can be used for comparing any distribution, not necessary the normal distribution only.
4.2. InterpretationThe Test Statistic of the KS Test is the Kolmogorov Smirnov Statistic, which follows a Kolmogorov distribution if the null hypothesis is true. If the observed data perfectly follow a normal distribution, the value of the KS statistic will be 0. The P-Value is used to decide whether the difference is large enough to reject the null hypothesis:
4.3. ImplementationThe KS Test in Python using Scipy can be implemented as follows. It returns the KS statistic and its P-Value. Applying the KS Test in Python using Scipy4.4. ConclusionThe KS test is well-known but it has not much power. This means that a large number of observations is necessary to reject the null hypothesis. It is also sensitive to outliers. On the other hand, it can be used for other types of distributions. 5. Lilliefors test5.1. IntroductionThe Lilliefors test is strongly based on the KS test. The difference is that in the Lilliefors test, it is accepted that the mean and variance of the population distribution are estimated rather than pre-specified by the user. Because of this, the Lilliefors test uses the Lilliefors distribution rather than the Kolmogorov distribution.
5.2. Interpretation
5.3. ImplementationThe Lilliefors test implementation in statsmodels will return the value of the Lilliefors test statistic and the P-Value as follows. Attention: in the statsmodels implementation, P-Values lower than 0.001 are reported as 0.001 and P-Values higher than 0.2 are reported as 0.2. Applying the Lilliefors test using statsmodels5.4. ConclusionAlthough Lilliefors is an improvement to the KS test it’s power is still lower than the Shapiro Wilk test. 6. Shapiro Wilk test6.1. IntroductionThe Shapiro Wilk test is the most powerful test when testing for a normal distribution. It has been developed specifically for the normal distribution and it cannot be used for testing against other distributions like for example the KS test.
6.2. Interpretation
6.3. ImplementationThe Shapiro Wilk test can be implemented as follows. It will return the test statistic called W and the P-Value. Attention: for N > 5000 the W test statistic is accurate but the p-value may not be. Applying the Shapiro Wilk test using statsmodels in Python6.4. ConclusionThe Shapiro Wilk test is the most powerful test when testing for a normal distribution. You should definitely use this test. 7. Conclusion — which approach to use!For quick and visual identification of a normal distribution, use a QQ plot if you have only one variable to look at and a Box Plot if you have many. Use a histogram if you need to present your results to a non-statistical public. As a statistical test to confirm your hypothesis, use the Shapiro Wilk test. It is the most powerful test, which should be the decisive argument. When testing against other distributions, you cannot use Shapiro Wilk and should use for example the Anderson-Darling test or the KS test. How do you check for data distribution in Python?Histogram Plot
A simple and commonly used plot to quickly check the distribution of a sample of data is the histogram. In the histogram, the data is divided into a pre-specified number of groups called bins. The data is then sorted into each bin and the count of the number of observations in each bin is retained.
How do I know what distribution My data is?Probability plots might be the best way to determine whether your data follow a particular distribution. If your data follow the straight line on the graph, the distribution fits your data. This process is simple to do visually. Informally, this process is called the “fat pencil” test.
How do you check a column distribution in Python?You can use . describe() to see a number of basic statistics about the column, such as the mean, min, max, and standard deviation.
How do you test a distribution?For quick and visual identification of a normal distribution, use a QQ plot if you have only one variable to look at and a Box Plot if you have many. Use a histogram if you need to present your results to a non-statistical public. As a statistical test to confirm your hypothesis, use the Shapiro Wilk test.
|