Hướng dẫn z-score normalization python numpy


In statistics, a z-score tells us how many standard deviations away a value is from the mean. We use the following formula to calculate a z-score:

z = (X – μ) / σ

where:

  • X is a single raw data value
  • μ is the population mean
  • σ is the population standard deviation

This tutorial explains how to calculate z-scores for raw data values in Python.

How to Calculate Z-Scores in Python

We can calculate z-scores in Python using scipy.stats.zscore, which uses the following syntax:

scipy.stats.zscore(a, axis=0, ddof=0, nan_policy=’propagate’)

where:

  • a: an array like object containing data
  • axis: the axis along which to calculate the z-scores. Default is 0.
  • ddof: degrees of freedom correction in the calculation of the standard deviation. Default is 0.
  • nan_policy: how to handle when input contains nan. Default is propagate, which returns nan. ‘raise’ throws an error and ‘omit’ performs calculations ignoring nan values.

The following examples illustrate how to use this function to calculate z-scores for one-dimensional numpy arrays, multi-dimensional numpy arrays, and Pandas DataFrames.

Numpy One-Dimensional Arrays

Step 1: Import modules.

import pandas as pd
import numpy as np
import scipy.stats as stats

Step 2: Create an array of values.

data = np.array([6, 7, 7, 12, 13, 13, 15, 16, 19, 22])

Step 3: Calculate the z-scores for each value in the array.

stats.zscore(data)

[-1.394, -1.195, -1.195, -0.199, 0, 0, 0.398, 0.598, 1.195, 1.793]

Each z-score tells us how many standard deviations away an individual value is from the mean. For example:

  • The first value of “6” in the array is 1.394 standard deviations below the mean.
  • The fifth value of “13” in the array is standard deviations away from the mean, i.e. it is equal to the mean.
  • The last value of “22” in the array is 1.793 standard deviations above the mean.

Numpy Multi-Dimensional Arrays

If we have a multi-dimensional array, we can use the axis parameter to specify that we want to calculate each z-score relative to its own array. For example, suppose we have the following multi-dimensional array:

data = np.array([[5, 6, 7, 7, 8],
                 [8, 8, 8, 9, 9],
                 [2, 2, 4, 4, 5]])

We can use the following syntax to calculate the z-scores for each array:

stats.zscore(data, axis=1)

[[-1.569 -0.588 0.392 0.392 1.373]
[-0.816 -0.816 -0.816 1.225 1.225]
[-1.167 -1.167 0.5 0.5 1.333]]

The z-scores for each individual value are shown relative to the array they’re in. For example:

  • The first value of “5” in the first array is 1.159 standard deviations below the mean of its array.
  • The first value of “8” in the second array is .816 standard deviations below the mean of its array.
  • The first value of “2” in the third array is 1.167 standard deviations below the mean of its array.

Pandas DataFrames

Suppose we instead have a Pandas DataFrame:

data = pd.DataFrame(np.random.randint(0, 10, size=(5, 3)), columns=['A', 'B', 'C'])
data

  A B C
0 8 0 9
1 4 0 7
2 9 6 8
3 1 8 1
4 8 0 8

We can use the apply function to calculate the z-score of individual values by column:

data.apply(stats.zscore)

          A         B         C
0  0.659380 -0.802955  0.836080
1 -0.659380 -0.802955  0.139347
2  0.989071  0.917663  0.487713
3 -1.648451  1.491202 -1.950852
4  0.659380 -0.802955  0.487713

The z-scores for each individual value are shown relative to the column they’re in. For example:

  • The first value of “8” in the first column is 0.659 standard deviations above the mean value of its column.
  • The first value of “0” in the second column is .803 standard deviations below the mean value of its column.
  • The first value of “9” in the third column is .836 standard deviations above the mean value of its column.

Additional Resources:

How to Calculate Z-Scores in Excel
How to Calculate Z-Scores in SPSS
How to Calculate Z-Scores on a TI-84 Calculator