Introduction
Outliers are data points that deviate significantly from other observations in a dataset. These outliers can occur due to errors in data collection, measurement errors, or may be genuine extreme values. Regardless of the cause, outlier points can greatly affect the overall analysis and results of a dataset.
In this tutorial, we will discuss the most efficient way to identify and remove outlier points using Python.
Creating a synthetic dataset
Before we dive into identifying and removing outliers, let's first create a synthetic dataset that will be used throughout this tutorial. We will create a dataset consisting of 30 randomly generated observations. The values will be drawn from a Gaussian distribution with a mean of 0 and a standard deviation of 10:
import random
import numpy as np
random.seed(42) # Set seed for reproducibility
X = [random.gauss(0,10) for i in range(30)]
And artificially add an outlier
X.append(100)
Let's convert our list into an array, as it is a more suitable data structure for identifying outliers:
X = np.array(X) # Convert list to array
For instance, the provided code will generate the following result:
array([ -1.4409033 , -1.729036 , -1.11315862, 7.01983725,
-1.27588284, -14.97353414, 3.32318344, -2.67337478,
-2.16958684, 1.15884787, 2.32297737, 11.63558687,
6.56636507, 1.10507177, -7.38321602, -10.14662367,
2.46342195, 13.11080827, 0.41656864, -1.06323294,
5.3177622 , -14.53545298, -3.12277317, 4.90362533,
8.73404385, -2.40629673, 3.76599859, 2.48213449,
7.82326809, -11.13222214, 100. ])
Detecting and removing outlier points using numpy
Creating an histogram
One of the first steps in identifying and removing outlier points using Python is to create an histogram of the data. A histogram is a graphical representation that shows the distribution of data by grouping it into bins and displaying the frequency of each bin on the vertical axis.
To create an histogram, we can use numpy's hist() function. This function takes in an array of data and returns the frequency count for each bin and the corresponding bin edges. Let's look at an example of how to use this function:
import matplotlib.pyplot as plt
plt.hist(X, density=True)
plt.title('What is the most efficient way \n to identify and remove outlier points using Python ?', fontsize=12)
plt.savefig('detect_remove_outlier_01.png', dpi=100, bbox_inches='tight')
plt.show()
Output:
Calculating the mean and the standard deviation
Calculating the mean and standard deviation are important steps in identifying outlier points. These statistical measures give an idea of the central tendency and variability of a dataset, which can then be used to determine if a data point is significantly different from the rest.
Numpy provides a convenient way to calculate the mean and standard deviation. The numpy.mean() and numpy.std() functions take in an array as the input and return the mean and standard deviation, respectively:
X.mean()
X.std()
gives
3.451103447459695
and
18.853595739285307
respectively.
We can utilize these two measures to identify and eliminate outlier points. An outlier point can be defined as a data point that falls outside of the range of mean ± 3 * standard deviation
. By removing these points, we can ensure that our analysis is not skewed by extreme values:
X[ X < X.mean() + 3.0 * X.std() ]
Output
array([ -1.4409033 , -1.729036 , -1.11315862, 7.01983725,
-1.27588284, -14.97353414, 3.32318344, -2.67337478,
-2.16958684, 1.15884787, 2.32297737, 11.63558687,
6.56636507, 1.10507177, -7.38321602, -10.14662367,
2.46342195, 13.11080827, 0.41656864, -1.06323294,
5.3177622 , -14.53545298, -3.12277317, 4.90362533,
8.73404385, -2.40629673, 3.76599859, 2.48213449,
7.82326809, -11.13222214])
We can see that the outlier 100, which we intentionally introduced earlier, has been accurately filtered out.
Utilizing the pandas library and its quantile function
One efficient way to identify and remove outlier points in Python is by using the pandas library. This popular library offers a variety of functions and tools for data analysis, manipulation, and visualization.
Specifically, we can use the quantile function from the pandas library to detect outliers in our dataset. This function allows us to calculate different quantiles for our dataset.
Once we have calculated the quantiles, we can use them to determine the upper and lower bounds for our data. Any data point that falls outside of these bounds can be considered an outlier and removed from our dataset:
Let's begin by storing our data in a dataframe:
import pandas as pd
df = pd.DataFrame(data=X, columns=['x'])
print( df )
Output
x
0 -1.440903
1 -1.729036
2 -1.113159
3 7.019837
4 -1.275883
5 -14.973534
6 3.323183
7 -2.673375
8 -2.169587
9 1.158848
10 2.322977
11 11.635587
12 6.566365
13 1.105072
14 -7.383216
15 -10.146624
16 2.463422
17 13.110808
18 0.416569
19 -1.063233
20 5.317762
21 -14.535453
22 -3.122773
23 4.903625
24 8.734044
25 -2.406297
26 3.765999
27 2.482134
28 7.823268
29 -11.132222
30 100.000000
Next, calculate the quantile at 0.99. This will give us the upper bound for our data.
q99 = df['x'].quantile(0.99)
print( q99 )
Output
73.93324248161208
With this information in hand, we are now able to effectively filter out any aberrant data points.
df['x'][ df['x'] < q99 ]
Output
0 -1.440903
1 -1.729036
2 -1.113159
3 7.019837
4 -1.275883
5 -14.973534
6 3.323183
7 -2.673375
8 -2.169587
9 1.158848
10 2.322977
11 11.635587
12 6.566365
13 1.105072
14 -7.383216
15 -10.146624
16 2.463422
17 13.110808
18 0.416569
19 -1.063233
20 5.317762
21 -14.535453
22 -3.122773
23 4.903625
24 8.734044
25 -2.406297
26 3.765999
27 2.482134
28 7.823268
29 -11.132222
Name: x, dtype: float64
References
Links | Site |
---|---|
numpy.mean | numpy.org |
numpy.std | numpy.org |
pandas quantile | pandas.pydata.org |
scipy.stats.zscore | docs.scipy.org |
68–95–99.7 rule | wikipedia.org |
Standard score | wikipedia.org |