How to Calculate Percentiles with Python ?

Introduction

Percentiles are a useful statistical measure that indicate the value below which a given percentage of data in a dataset falls. In many data analysis tasks, especially in fields like finance, machine learning, and meteorology, percentiles are used to summarize data and identify outliers or trends.

Two common libraries used for calculating percentiles in Python are:
- NumPy: A fundamental package for numerical computing.
- SciPy: Builds on NumPy and provides additional functionality for scientific computing.

This guide explains how to calculate percentiles using Python, leveraging libraries like NumPy and SciPy.

What is a Percentile?

A percentile represents a point in your data where a certain percentage of the data points fall below it. For example:
- The 25th percentile (also called the first quartile) means that 25% of the data points are less than this value.
- The 50th percentile (the median) indicates that 50% of the data points fall below this value.
- The 90th percentile indicates that 90% of the data points are below this value, and so on.

Using NumPy

The most straightforward way to compute percentiles is by using NumPy’s percentile() function. Here's how:

Example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import numpy as np

# Sample dataset
data = [23, 45, 16, 78, 55, 34, 89, 22, 10, 67]

# Calculate the 25th, 50th (median), and 90th percentiles
percentile_25 = np.percentile(data, 25)
percentile_50 = np.percentile(data, 50)
percentile_90 = np.percentile(data, 90)

print(f"25th Percentile: {percentile_25}")
print(f"50th Percentile (Median): {percentile_50}")
print(f"90th Percentile: {percentile_90}")

Output:

1
2
3
25th Percentile: 22.25
50th Percentile (Median): 34.5
90th Percentile: 78.3

In this example, the np.percentile() function takes two arguments:
- The dataset (a list, array, or series).
- The desired percentile (e.g., 25 for the 25th percentile).

NumPy's percentile() function allows you to specify different interpolation methods for percentile calculation when the data is not an exact match for the desired percentile.

1
2
3
4
5
6
# Using different interpolation methods
np.percentile(data, 25, interpolation='linear')   # default
np.percentile(data, 25, interpolation='nearest')  # nearest ranked data point
np.percentile(data, 25, interpolation='lower')    # next lower data point
np.percentile(data, 25, interpolation='higher')   # next higher data point
np.percentile(data, 25, interpolation='midpoint') # average of nearest points

Depending on the application, you may choose different interpolation methods to better suit your needs.

Using statistics.quantiles()

You can also calculate percentiles using the quantiles() function from Python's built-in statistics module. While the quantiles() function is primarily used to calculate quantiles (which divide data into equal parts), you can specify the positions of the quantiles that correspond to your desired percentiles.

Here’s how you can use the quantiles() function to calculate percentiles.

The quantiles() function calculates quantile points in a dataset, but you can pass in a sequence that represents the percentile positions you are interested in.

Syntax:

1
statistics.quantiles(data, n, method='exclusive')
  • data: The input dataset (list, array, etc.).
  • n: The number of quantiles you want to divide your data into (e.g., 100 for percentiles).
  • method: The interpolation method to compute quantiles. It can be 'exclusive' (default) or 'inclusive'.

Example:

To calculate the 25th, 50th, and 75th percentiles using quantiles(), you would specify n=100 and extract the corresponding quantile values.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import statistics as stats

# Sample dataset
data = [23, 45, 16, 78, 55, 34, 89, 22, 10, 67]

# Calculate percentiles
percentiles = stats.quantiles(data, n=100)

# 25th percentile is at position 25-1, 50th at 50-1, etc.
percentile_25 = percentiles[24]  # 25th percentile
percentile_50 = percentiles[49]  # 50th percentile
percentile_75 = percentiles[74]  # 75th percentile

print(f"25th Percentile: {percentile_25}")
print(f"50th Percentile (Median): {percentile_50}")
print(f"75th Percentile: {percentile_75}")

Output:

1
2
3
25th Percentile: 22.25
50th Percentile (Median): 34.5
75th Percentile: 55.0

Explanation:

  • quantiles(data, n=100) divides the dataset into 100 equal parts (percentiles).
  • To get the 25th percentile, you access the 24th index (percentiles[24]), as indexing starts at 0.
  • Similarly, the 50th percentile is at percentiles[49] and the 75th percentile at percentiles[74].

Difference Between quantiles() and percentile() (NumPy)

  • NumPy's percentile(): Directly computes percentiles based on the specific percentage values you provide (e.g., 25, 50, 90).
  • statistics.quantiles(): Divides the data into a specified number of equal-sized groups (e.g., 100 for percentiles), but you extract the positions corresponding to your desired percentiles manually.

Both approaches are useful depending on whether you prefer flexibility or more built-in functionality for percentile calculations.

Using SciPy

SciPy also provides the scoreatpercentile() function in the scipy.stats module. Though less commonly used than NumPy, it offers similar functionality.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from scipy import stats

# Sample dataset
data = [23, 45, 16, 78, 55, 34, 89, 22, 10, 67]

# Calculate the 25th, 50th, and 90th percentiles
percentile_25 = stats.scoreatpercentile(data, 25)
percentile_50 = stats.scoreatpercentile(data, 50)
percentile_90 = stats.scoreatpercentile(data, 90)

print(f"25th Percentile: {percentile_25}")
print(f"50th Percentile (Median): {percentile_50}")
print(f"90th Percentile: {percentile_90}")

Output:

1
2
3
25th Percentile: 22.25
50th Percentile (Median): 34.5
90th Percentile: 78.3

The scoreatpercentile() function behaves similarly to np.percentile(), returning the value at a specified percentile.

Visualizing Percentiles with Matplotlib

To better understand how your percentiles relate to the data distribution, it's useful to visualize the data.

Example 1

Here's how you can plot a histogram and highlight the percentiles:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import numpy as np
import matplotlib.pyplot as plt

# Sample data
data = [23, 45, 16, 78, 55, 34, 89, 22, 10, 67]

# Calculate percentiles
percentile_25 = np.percentile(data, 25)
percentile_50 = np.percentile(data, 50)
percentile_90 = np.percentile(data, 90)

# Plot the histogram
plt.hist(data, bins=10, edgecolor='black', alpha=0.7)

# Add vertical lines for percentiles
plt.axvline(percentile_25, color='r', linestyle='dashed', linewidth=1, label='25th Percentile')
plt.axvline(percentile_50, color='g', linestyle='dashed', linewidth=1, label='50th Percentile (Median)')
plt.axvline(percentile_90, color='b', linestyle='dashed', linewidth=1, label='90th Percentile')

# Labeling
plt.title('Data Distribution with Percentiles')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.legend()

# Show plot
plt.show()

This will display a histogram with the 25th, 50th, and 90th percentiles highlighted with vertical dashed lines.

How to Calculate Percentiles with Python ?
How to Calculate Percentiles with Python ?

Example 2

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import numpy as np
import matplotlib.pyplot as plt

# Generate a normally distributed dataset
np.random.seed(42)  # For reproducibility
data = np.random.normal(loc=0, scale=1, size=50000)  # Mean=0, Std Dev=1, 50000 points

# Calculate percentiles
percentiles = {
    "-3σ (0.13%)": np.percentile(data, 0.13),
    "-2σ (2.28%)": np.percentile(data, 2.28),
    "-1σ (15.87%)": np.percentile(data, 15.87),
    "Mean (50%)": np.percentile(data, 50),
    "+1σ (84.13%)": np.percentile(data, 84.13),
    "+2σ (97.72%)": np.percentile(data, 97.72),
    "+3σ (99.87%)": np.percentile(data, 99.87),
}

# Plot the histogram
plt.figure(figsize=(10, 6))
plt.hist(data, bins=50, edgecolor='black', alpha=0.7, color='skyblue')

# Add vertical lines for percentiles and place corresponding labels
for label, value in percentiles.items():
    plt.axvline(value, color='red', linestyle='dashed', linewidth=1)
    plt.text(value, 1500, label, color='black', rotation=90, verticalalignment='bottom', horizontalalignment='center')

# Labeling
plt.title('Normally Distributed Data with Percentiles and σ (Standard Deviation)')
plt.xlabel('Value')
plt.ylabel('Frequency')

# Save plot
plt.savefig("How_to_Calculate_Percentiles_with_Python_Fig_02.png", bbox_inches='tight', dpi=200)

# Show plot
plt.tight_layout()
plt.show()

How to Calculate Percentiles with Python ?
How to Calculate Percentiles with Python ?

References

Links Site
Percentile en.wikipedia.org
numpy.percentile numpy.org
statistics.quantiles docs.python.org
scipy.stats.scoreatpercentile docs.scipy.org
Image

of