Introduction
Outliers are data points that deviate significantly from other observations in a dataset. These outliers can occur due to errors in data collection, measurement errors, or may be genuine extreme values. Regardless of the cause, outlier points can greatly affect the overall analysis and results of a dataset.
In this tutorial, we will discuss the most efficient way to identify and remove outlier points using Python.
Creating a synthetic dataset
Before we dive into identifying and removing outliers, let's first create a synthetic dataset that will be used throughout this tutorial. We will create a dataset consisting of 30 randomly generated observations. The values will be drawn from a Gaussian distribution with a mean of 0 and a standard deviation of 10:
import randomimport numpy as nprandom.seed(42) # Set seed for reproducibilityX = [random.gauss(0,10) for i in range(30)]
And artificially add an outlier
X.append(100)
Let's convert our list into an array, as it is a more suitable data structure for identifying outliers:
X = np.array(X) # Convert list to array
For instance, the provided code will generate the following result:
array([ -1.4409033 , -1.729036 , -1.11315862, 7.01983725,-1.27588284, -14.97353414, 3.32318344, -2.67337478,-2.16958684, 1.15884787, 2.32297737, 11.63558687,6.56636507, 1.10507177, -7.38321602, -10.14662367,2.46342195, 13.11080827, 0.41656864, -1.06323294,5.3177622 , -14.53545298, -3.12277317, 4.90362533,8.73404385, -2.40629673, 3.76599859, 2.48213449,7.82326809, -11.13222214, 100. ])
Detecting and removing outlier points using numpy
Creating an histogram
One of the first steps in identifying and removing outlier points using Python is to create an histogram of the data. A histogram is a graphical representation that shows the distribution of data by grouping it into bins and displaying the frequency of each bin on the vertical axis.
To create an histogram, we can use numpy's hist() function. This function takes in an array of data and returns the frequency count for each bin and the corresponding bin edges. Let's look at an example of how to use this function:
import matplotlib.pyplot as pltplt.hist(X, density=True)plt.title('What is the most efficient way \n to identify and remove outlier points using Python ?', fontsize=12)plt.savefig('detect_remove_outlier_01.png', dpi=100, bbox_inches='tight')plt.show()
Output:

Calculating the mean and the standard deviation
Calculating the mean and standard deviation are important steps in identifying outlier points. These statistical measures give an idea of the central tendency and variability of a dataset, which can then be used to determine if a data point is significantly different from the rest.
Numpy provides a convenient way to calculate the mean and standard deviation. The numpy.mean() and numpy.std() functions take in an array as the input and return the mean and standard deviation, respectively:
X.mean()X.std()
gives
3.451103447459695
and
18.853595739285307
respectively.
We can utilize these two measures to identify and eliminate outlier points. An outlier point can be defined as a data point that falls outside of the range of mean ± 3 * standard deviation. By removing these points, we can ensure that our analysis is not skewed by extreme values:
X[ X < X.mean() + 3.0 * X.std() ]
Output
array([ -1.4409033 , -1.729036 , -1.11315862, 7.01983725,-1.27588284, -14.97353414, 3.32318344, -2.67337478,-2.16958684, 1.15884787, 2.32297737, 11.63558687,6.56636507, 1.10507177, -7.38321602, -10.14662367,2.46342195, 13.11080827, 0.41656864, -1.06323294,5.3177622 , -14.53545298, -3.12277317, 4.90362533,8.73404385, -2.40629673, 3.76599859, 2.48213449,7.82326809, -11.13222214])
We can see that the outlier 100, which we intentionally introduced earlier, has been accurately filtered out.
Utilizing the pandas library and its quantile function
One efficient way to identify and remove outlier points in Python is by using the pandas library. This popular library offers a variety of functions and tools for data analysis, manipulation, and visualization.
Specifically, we can use the quantile function from the pandas library to detect outliers in our dataset. This function allows us to calculate different quantiles for our dataset.
Once we have calculated the quantiles, we can use them to determine the upper and lower bounds for our data. Any data point that falls outside of these bounds can be considered an outlier and removed from our dataset:
Let's begin by storing our data in a dataframe:
import pandas as pddf = pd.DataFrame(data=X, columns=['x'])print( df )
Output
x0 -1.4409031 -1.7290362 -1.1131593 7.0198374 -1.2758835 -14.9735346 3.3231837 -2.6733758 -2.1695879 1.15884810 2.32297711 11.63558712 6.56636513 1.10507214 -7.38321615 -10.14662416 2.46342217 13.11080818 0.41656919 -1.06323320 5.31776221 -14.53545322 -3.12277323 4.90362524 8.73404425 -2.40629726 3.76599927 2.48213428 7.82326829 -11.13222230 100.000000
Next, calculate the quantile at 0.99. This will give us the upper bound for our data.
q99 = df['x'].quantile(0.99)print( q99 )
Output
73.93324248161208
With this information in hand, we are now able to effectively filter out any aberrant data points.
df['x'][ df['x'] < q99 ]
Output
0 -1.4409031 -1.7290362 -1.1131593 7.0198374 -1.2758835 -14.9735346 3.3231837 -2.6733758 -2.1695879 1.15884810 2.32297711 11.63558712 6.56636513 1.10507214 -7.38321615 -10.14662416 2.46342217 13.11080818 0.41656919 -1.06323320 5.31776221 -14.53545322 -3.12277323 4.90362524 8.73404425 -2.40629726 3.76599927 2.48213428 7.82326829 -11.132222Name: x, dtype: float64
References
| Links | Site |
|---|---|
| numpy.mean | numpy.org |
| numpy.std | numpy.org |
| pandas quantile | pandas.pydata.org |
| scipy.stats.zscore | docs.scipy.org |
| 68–95–99.7 rule | wikipedia.org |
| Standard score | wikipedia.org |
