How can the statistical standard score or Z-score be calculated and plotted using Python ?

Published: February 19, 2024

Tags: Python; Scipy;

DMCA.com Protection Status

Introduction

The Z-score, also known as the standard score, is a statistical measure that quantifies the relationship between a data point and the mean of a dataset in terms of standard deviations. It's a valuable tool in statistics for comparing different datasets and determining the relative position of a data point within a distribution.

In this article, we'll explore how to calculate and plot Z-scores using Python, leveraging libraries such as NumPy, Scipy and Matplotlib.

Creating a synthetic dataset

To illustrate z-score calculation, we will generate random numbers following a normal distribution with a mean of 10 and a standard deviation of 4 as an example:

import numpy as np
import matplotlib.pyplot as plt

mu = 10.0
sigma = 4.0

X = np.random.randn(100000) * sigma + mu

To visualize your data, consider creating a simple histogram using matplotlib:

import numpy as np
import matplotlib.pyplot as plt

hx, hy, _ = plt.hist(X, bins=50, density=1, color="coral")

plt.ylim(0.0,max(hx)+0.01)

plt.title('How can the statistical standard score or Z-score \n be calculated and plotted using Python ?')
plt.grid(linestyle='--')

plt.savefig("zscore_python_01.png", bbox_inches='tight')
plt.show()

How can the statistical standard score or Z-score be calculated and plotted using Python ?
How can the statistical standard score or Z-score be calculated and plotted using Python ?

Method 1: Computing Z-Scores

After generating the random numbers, we can calculate the z-score for a chosen value using the formula

$$ Z = \frac{(X - \mu)}{\sigma} $$

where $\mu$ represents the mean and $\sigma$ denotes the standard deviation of the dataset.

This formula is used to standardize any normal distribution into a standard normal distribution with mean 0 and standard deviation of 1.

The z-score is a measure of how many standard deviations an individual data point is away from the mean of a distribution. It helps us to understand the relative position of a data point within a distribution and is used to compare different distributions with each other.

To calculate the z-score, we first subtract the mean from the individual data point and then divide it by the standard deviation. This standardized value gives us an idea of how far away the data point is from the mean in terms of standard deviations. If the resulting z-score is positive, it means that the data point is above the mean, and if it is negative, it indicates that the data point is below the mean.

The z-score has various applications in statistics. It can be used to identify outliers in a dataset and understand their impact on the overall distribution. It also allows us to compare values from different distributions by standardizing them. In addition, the z-score is an essential component in hypothesis testing, as it helps us to calculate the probability of obtaining a particular value or higher/lower from a distribution.

Calculating the mean

Calculating the mean:

X.mean()

yields:

10.001437829008042

Calculating the standard deviation

Calculating the standard deviation:

X.std()

yields:

3.9926317707090115

Computing the Z-score

Applying the established formula for the z-score, we are able to calculate it for all elements within matrix X:

    (X - mu) / sigma

Produces a matrix of z scores:

array([-2.12145637, -0.41107867, -0.08816355, ..., 1.08737079, 0.44585831, -1.67382599])

For visualization, you can plot the original data along with their Z-scores to see how each data point relates to the average in terms of standard deviations.

import numpy as np
import matplotlib.pyplot as plt

hx, hy, _ = plt.hist(X, bins=50, density=1,color="coral")

plt.ylim(0.0,max(hx)+0.01)

x_tick_positions = [i for i in np.arange(-10,31,4.0)]
x_tick_labels = [(i - mu) / sigma for i in x_tick_positions ]

plt.xticks(x_tick_positions, x_tick_labels, rotation=0, fontsize=10)
plt.yticks(fontsize=10)

plt.title('How can the statistical standard score or Z-score \n be calculated and plotted using Python ?')

plt.xlabel('Z-Score')

plt.grid(linestyle='--')

plt.savefig("zscore_python_02.png", bbox_inches='tight')
plt.show()

How can the statistical standard score or Z-score be calculated and plotted using Python ?
How can the statistical standard score or Z-score be calculated and plotted using Python ?

Another example:

import numpy as np
import matplotlib.pyplot as plt

hx, hy, _ = plt.hist(X, bins=50, density=1,color="coral")

plt.ylim(0.0,max(hx)+0.01)

x_tick_positions = [i for i in np.arange(-10,31,4.0)]
x_tick_labels = [(i - mu) / sigma for i in x_tick_positions ]

x_tick_labels = [r'{} $\sigma$'.format((i - mu) / sigma) for i in x_tick_positions ]

plt.xticks(x_tick_positions, x_tick_labels, rotation=90, fontsize=10)
plt.yticks(fontsize=10)

plt.title('How can the statistical standard score or Z-score \n be calculated and plotted using Python ?')

plt.xlabel('Z-Score')

plt.grid(linestyle='--')

plt.savefig("zscore_python_03.png", bbox_inches='tight')
plt.show()

How can the statistical standard score or Z-score be calculated and plotted using Python ?
How can the statistical standard score or Z-score be calculated and plotted using Python ?

Method 2: Utilizing the Scipy stats zscore function

Another approach to calculate the Z-score involves leveraging the scipy.stats library, providing a straightforward way to compute Z-scores.

from scipy import stats

stats.zscore(X)

is the expression used to calculate the Z-score values for an array X. This function also allows us to specify the axis we want to use while calculating the Z-scores. By default, this value is set to axis=0, which means that the calculation will be done across rows.

The zscore() function takes in an array or list of data as its argument and returns a new array with the calculated Z-score values for each element:

array([-2.12573155, -0.41219742, -0.08868638, ...,  1.08901737,
    0.446321  , -1.67727508])

Please be aware that we obtain identical results as in method 1.

References

Links Site
scipy.stats.zscore docs.scipy.org
Standard score en.wikipedia.org
numpy.mean numpy.org
numpy.org numpy.org
Image

of