Introduction
The Z-score, also known as the standard score, is a statistical measure that quantifies the relationship between a data point and the mean of a dataset in terms of standard deviations. It's a valuable tool in statistics for comparing different datasets and determining the relative position of a data point within a distribution.
In this article, we'll explore how to calculate and plot Z-scores using Python, leveraging libraries such as NumPy, Scipy and Matplotlib.
Creating a synthetic dataset
To illustrate z-score calculation, we will generate random numbers following a normal distribution with a mean of 10 and a standard deviation of 4 as an example:
import numpy as np
import matplotlib.pyplot as plt
mu = 10.0
sigma = 4.0
X = np.random.randn(100000) * sigma + mu
To visualize your data, consider creating a simple histogram using matplotlib:
import numpy as np
import matplotlib.pyplot as plt
hx, hy, _ = plt.hist(X, bins=50, density=1, color="coral")
plt.ylim(0.0,max(hx)+0.01)
plt.title('How can the statistical standard score or Z-score \n be calculated and plotted using Python ?')
plt.grid(linestyle='--')
plt.savefig("zscore_python_01.png", bbox_inches='tight')
plt.show()
Method 1: Computing Z-Scores
After generating the random numbers, we can calculate the z-score for a chosen value using the formula
$$ Z = \frac{(X - \mu)}{\sigma} $$
where $\mu$ represents the mean and $\sigma$ denotes the standard deviation of the dataset.
This formula is used to standardize any normal distribution into a standard normal distribution with mean 0 and standard deviation of 1.
The z-score is a measure of how many standard deviations an individual data point is away from the mean of a distribution. It helps us to understand the relative position of a data point within a distribution and is used to compare different distributions with each other.
To calculate the z-score, we first subtract the mean from the individual data point and then divide it by the standard deviation. This standardized value gives us an idea of how far away the data point is from the mean in terms of standard deviations. If the resulting z-score is positive, it means that the data point is above the mean, and if it is negative, it indicates that the data point is below the mean.
The z-score has various applications in statistics. It can be used to identify outliers in a dataset and understand their impact on the overall distribution. It also allows us to compare values from different distributions by standardizing them. In addition, the z-score is an essential component in hypothesis testing, as it helps us to calculate the probability of obtaining a particular value or higher/lower from a distribution.
Calculating the mean
Calculating the mean:
X.mean()
yields:
10.001437829008042
Calculating the standard deviation
Calculating the standard deviation:
X.std()
yields:
3.9926317707090115
Computing the Z-score
Applying the established formula for the z-score, we are able to calculate it for all elements within matrix X:
(X - mu) / sigma
Produces a matrix of z scores:
array([-2.12145637, -0.41107867, -0.08816355, ..., 1.08737079, 0.44585831, -1.67382599])
For visualization, you can plot the original data along with their Z-scores to see how each data point relates to the average in terms of standard deviations.
import numpy as np
import matplotlib.pyplot as plt
hx, hy, _ = plt.hist(X, bins=50, density=1,color="coral")
plt.ylim(0.0,max(hx)+0.01)
x_tick_positions = [i for i in np.arange(-10,31,4.0)]
x_tick_labels = [(i - mu) / sigma for i in x_tick_positions ]
plt.xticks(x_tick_positions, x_tick_labels, rotation=0, fontsize=10)
plt.yticks(fontsize=10)
plt.title('How can the statistical standard score or Z-score \n be calculated and plotted using Python ?')
plt.xlabel('Z-Score')
plt.grid(linestyle='--')
plt.savefig("zscore_python_02.png", bbox_inches='tight')
plt.show()
Another example:
import numpy as np
import matplotlib.pyplot as plt
hx, hy, _ = plt.hist(X, bins=50, density=1,color="coral")
plt.ylim(0.0,max(hx)+0.01)
x_tick_positions = [i for i in np.arange(-10,31,4.0)]
x_tick_labels = [(i - mu) / sigma for i in x_tick_positions ]
x_tick_labels = [r'{} $\sigma$'.format((i - mu) / sigma) for i in x_tick_positions ]
plt.xticks(x_tick_positions, x_tick_labels, rotation=90, fontsize=10)
plt.yticks(fontsize=10)
plt.title('How can the statistical standard score or Z-score \n be calculated and plotted using Python ?')
plt.xlabel('Z-Score')
plt.grid(linestyle='--')
plt.savefig("zscore_python_03.png", bbox_inches='tight')
plt.show()
Method 2: Utilizing the Scipy stats zscore function
Another approach to calculate the Z-score involves leveraging the scipy.stats
library, providing a straightforward way to compute Z-scores.
from scipy import stats
stats.zscore(X)
is the expression used to calculate the Z-score values for an array X. This function also allows us to specify the axis we want to use while calculating the Z-scores. By default, this value is set to axis=0
, which means that the calculation will be done across rows.
The zscore()
function takes in an array or list of data as its argument and returns a new array with the calculated Z-score values for each element:
array([-2.12573155, -0.41219742, -0.08868638, ..., 1.08901737,
0.446321 , -1.67727508])
Please be aware that we obtain identical results as in method 1.
References
Links | Site |
---|---|
scipy.stats.zscore | docs.scipy.org |
Standard score | en.wikipedia.org |
numpy.mean | numpy.org |
numpy.org | numpy.org |