Introduction
Percentiles are a useful statistical measure that indicate the value below which a given percentage of data in a dataset falls. In many data analysis tasks, especially in fields like finance, machine learning, and meteorology, percentiles are used to summarize data and identify outliers or trends.
Two common libraries used for calculating percentiles in Python are:
- NumPy: A fundamental package for numerical computing.
- SciPy: Builds on NumPy and provides additional functionality for scientific computing.
This guide explains how to calculate percentiles using Python, leveraging libraries like NumPy and SciPy.
What is a Percentile?
A percentile represents a point in your data where a certain percentage of the data points fall below it. For example:
- The 25th percentile (also called the first quartile) means that 25% of the data points are less than this value.
- The 50th percentile (the median) indicates that 50% of the data points fall below this value.
- The 90th percentile indicates that 90% of the data points are below this value, and so on.
Using NumPy
The most straightforward way to compute percentiles is by using NumPy’s percentile()
function. Here's how:
Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 | import numpy as np # Sample dataset data = [23, 45, 16, 78, 55, 34, 89, 22, 10, 67] # Calculate the 25th, 50th (median), and 90th percentiles percentile_25 = np.percentile(data, 25) percentile_50 = np.percentile(data, 50) percentile_90 = np.percentile(data, 90) print(f"25th Percentile: {percentile_25}") print(f"50th Percentile (Median): {percentile_50}") print(f"90th Percentile: {percentile_90}") |
Output:
1 2 3 | 25th Percentile: 22.25 50th Percentile (Median): 34.5 90th Percentile: 78.3 |
In this example, the np.percentile()
function takes two arguments:
- The dataset (a list, array, or series).
- The desired percentile (e.g., 25 for the 25th percentile).
NumPy's percentile() function allows you to specify different interpolation methods for percentile calculation when the data is not an exact match for the desired percentile.
1 2 3 4 5 6 | # Using different interpolation methods np.percentile(data, 25, interpolation='linear') # default np.percentile(data, 25, interpolation='nearest') # nearest ranked data point np.percentile(data, 25, interpolation='lower') # next lower data point np.percentile(data, 25, interpolation='higher') # next higher data point np.percentile(data, 25, interpolation='midpoint') # average of nearest points |
Depending on the application, you may choose different interpolation methods to better suit your needs.
Using statistics.quantiles()
You can also calculate percentiles using the quantiles()
function from Python's built-in statistics
module. While the quantiles()
function is primarily used to calculate quantiles (which divide data into equal parts), you can specify the positions of the quantiles that correspond to your desired percentiles.
Here’s how you can use the quantiles()
function to calculate percentiles.
The quantiles()
function calculates quantile points in a dataset, but you can pass in a sequence that represents the percentile positions you are interested in.
Syntax:
1 | statistics.quantiles(data, n, method='exclusive') |
data
: The input dataset (list, array, etc.).n
: The number of quantiles you want to divide your data into (e.g.,100
for percentiles).method
: The interpolation method to compute quantiles. It can be'exclusive'
(default) or'inclusive'
.
Example:
To calculate the 25th, 50th, and 75th percentiles using quantiles()
, you would specify n=100
and extract the corresponding quantile values.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | import statistics as stats # Sample dataset data = [23, 45, 16, 78, 55, 34, 89, 22, 10, 67] # Calculate percentiles percentiles = stats.quantiles(data, n=100) # 25th percentile is at position 25-1, 50th at 50-1, etc. percentile_25 = percentiles[24] # 25th percentile percentile_50 = percentiles[49] # 50th percentile percentile_75 = percentiles[74] # 75th percentile print(f"25th Percentile: {percentile_25}") print(f"50th Percentile (Median): {percentile_50}") print(f"75th Percentile: {percentile_75}") |
Output:
1 2 3 | 25th Percentile: 22.25 50th Percentile (Median): 34.5 75th Percentile: 55.0 |
Explanation:
quantiles(data, n=100)
divides the dataset into 100 equal parts (percentiles).- To get the 25th percentile, you access the 24th index (
percentiles[24]
), as indexing starts at 0. - Similarly, the 50th percentile is at
percentiles[49]
and the 75th percentile atpercentiles[74]
.
Difference Between quantiles()
and percentile()
(NumPy)
- NumPy's
percentile()
: Directly computes percentiles based on the specific percentage values you provide (e.g.,25
,50
,90
). statistics.quantiles()
: Divides the data into a specified number of equal-sized groups (e.g., 100 for percentiles), but you extract the positions corresponding to your desired percentiles manually.
Both approaches are useful depending on whether you prefer flexibility or more built-in functionality for percentile calculations.
Using SciPy
SciPy also provides the scoreatpercentile()
function in the scipy.stats
module. Though less commonly used than NumPy, it offers similar functionality.
1 2 3 4 5 6 7 8 9 10 11 12 13 | from scipy import stats # Sample dataset data = [23, 45, 16, 78, 55, 34, 89, 22, 10, 67] # Calculate the 25th, 50th, and 90th percentiles percentile_25 = stats.scoreatpercentile(data, 25) percentile_50 = stats.scoreatpercentile(data, 50) percentile_90 = stats.scoreatpercentile(data, 90) print(f"25th Percentile: {percentile_25}") print(f"50th Percentile (Median): {percentile_50}") print(f"90th Percentile: {percentile_90}") |
Output:
1 2 3 | 25th Percentile: 22.25 50th Percentile (Median): 34.5 90th Percentile: 78.3 |
The scoreatpercentile() function behaves similarly to np.percentile()
, returning the value at a specified percentile.
Visualizing Percentiles with Matplotlib
To better understand how your percentiles relate to the data distribution, it's useful to visualize the data.
Example 1
Here's how you can plot a histogram and highlight the percentiles:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | import numpy as np import matplotlib.pyplot as plt # Sample data data = [23, 45, 16, 78, 55, 34, 89, 22, 10, 67] # Calculate percentiles percentile_25 = np.percentile(data, 25) percentile_50 = np.percentile(data, 50) percentile_90 = np.percentile(data, 90) # Plot the histogram plt.hist(data, bins=10, edgecolor='black', alpha=0.7) # Add vertical lines for percentiles plt.axvline(percentile_25, color='r', linestyle='dashed', linewidth=1, label='25th Percentile') plt.axvline(percentile_50, color='g', linestyle='dashed', linewidth=1, label='50th Percentile (Median)') plt.axvline(percentile_90, color='b', linestyle='dashed', linewidth=1, label='90th Percentile') # Labeling plt.title('Data Distribution with Percentiles') plt.xlabel('Value') plt.ylabel('Frequency') plt.legend() # Show plot plt.show() |
This will display a histogram with the 25th, 50th, and 90th percentiles highlighted with vertical dashed lines.
Example 2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | import numpy as np import matplotlib.pyplot as plt # Generate a normally distributed dataset np.random.seed(42) # For reproducibility data = np.random.normal(loc=0, scale=1, size=50000) # Mean=0, Std Dev=1, 50000 points # Calculate percentiles percentiles = { "-3σ (0.13%)": np.percentile(data, 0.13), "-2σ (2.28%)": np.percentile(data, 2.28), "-1σ (15.87%)": np.percentile(data, 15.87), "Mean (50%)": np.percentile(data, 50), "+1σ (84.13%)": np.percentile(data, 84.13), "+2σ (97.72%)": np.percentile(data, 97.72), "+3σ (99.87%)": np.percentile(data, 99.87), } # Plot the histogram plt.figure(figsize=(10, 6)) plt.hist(data, bins=50, edgecolor='black', alpha=0.7, color='skyblue') # Add vertical lines for percentiles and place corresponding labels for label, value in percentiles.items(): plt.axvline(value, color='red', linestyle='dashed', linewidth=1) plt.text(value, 1500, label, color='black', rotation=90, verticalalignment='bottom', horizontalalignment='center') # Labeling plt.title('Normally Distributed Data with Percentiles and σ (Standard Deviation)') plt.xlabel('Value') plt.ylabel('Frequency') # Save plot plt.savefig("How_to_Calculate_Percentiles_with_Python_Fig_02.png", bbox_inches='tight', dpi=200) # Show plot plt.tight_layout() plt.show() |
References
Links | Site |
---|---|
Percentile | en.wikipedia.org |
numpy.percentile | numpy.org |
statistics.quantiles | docs.python.org |
scipy.stats.scoreatpercentile | docs.scipy.org |