## Introduction

Outliers are data points that deviate significantly from other observations in a dataset. These outliers can occur due to errors in data collection, measurement errors, or may be genuine extreme values. Regardless of the cause, outlier points can greatly affect the overall analysis and results of a dataset.

In this tutorial, we will discuss the most efficient way to identify and remove outlier points using Python.

## Creating a synthetic dataset

Before we dive into identifying and removing outliers, let's first create a synthetic dataset that will be used throughout this tutorial. We will create a dataset consisting of 30 randomly generated observations. The values will be drawn from a Gaussian distribution with a mean of 0 and a standard deviation of 10:

`import random`

`import numpy as np`

`random.seed(42) # Set seed for reproducibility`

`X = [random.gauss(0,10) for i in range(30)]`

And artificially add an outlier

`X.append(100)`

Let's convert our list into an array, as it is a more suitable data structure for identifying outliers:

`X = np.array(X) # Convert list to array`

For instance, the provided code will generate the following result:

`array([ -1.4409033 , -1.729036 , -1.11315862, 7.01983725,`

`-1.27588284, -14.97353414, 3.32318344, -2.67337478,`

`-2.16958684, 1.15884787, 2.32297737, 11.63558687,`

`6.56636507, 1.10507177, -7.38321602, -10.14662367,`

`2.46342195, 13.11080827, 0.41656864, -1.06323294,`

`5.3177622 , -14.53545298, -3.12277317, 4.90362533,`

`8.73404385, -2.40629673, 3.76599859, 2.48213449,`

`7.82326809, -11.13222214, 100. ])`

## Detecting and removing outlier points using numpy

### Creating an histogram

One of the first steps in identifying and removing outlier points using Python is to create an histogram of the data. A histogram is a graphical representation that shows the distribution of data by grouping it into bins and displaying the frequency of each bin on the vertical axis.

To create an histogram, we can use numpy's hist() function. This function takes in an array of data and returns the frequency count for each bin and the corresponding bin edges. Let's look at an example of how to use this function:

`import matplotlib.pyplot as plt`

`plt.hist(X, density=True)`

`plt.title('What is the most efficient way \n to identify and remove outlier points using Python ?', fontsize=12)`

`plt.savefig('detect_remove_outlier_01.png', dpi=100, bbox_inches='tight')`

`plt.show()`

Output:

### Calculating the mean and the standard deviation

Calculating the mean and standard deviation are important steps in identifying outlier points. These statistical measures give an idea of the central tendency and variability of a dataset, which can then be used to determine if a data point is significantly different from the rest.

Numpy provides a convenient way to calculate the mean and standard deviation. The numpy.mean() and numpy.std() functions take in an array as the input and return the mean and standard deviation, respectively:

`X.mean()`

`X.std()`

gives

`3.451103447459695`

and

`18.853595739285307`

respectively.

We can utilize these two measures to identify and eliminate outlier points. An outlier point can be defined as a data point that falls outside of the range of `mean ± 3 * standard deviation`

. By removing these points, we can ensure that our analysis is not skewed by extreme values:

`X[ X < X.mean() + 3.0 * X.std() ]`

Output

`array([ -1.4409033 , -1.729036 , -1.11315862, 7.01983725,`

`-1.27588284, -14.97353414, 3.32318344, -2.67337478,`

`-2.16958684, 1.15884787, 2.32297737, 11.63558687,`

`6.56636507, 1.10507177, -7.38321602, -10.14662367,`

`2.46342195, 13.11080827, 0.41656864, -1.06323294,`

`5.3177622 , -14.53545298, -3.12277317, 4.90362533,`

`8.73404385, -2.40629673, 3.76599859, 2.48213449,`

`7.82326809, -11.13222214])`

We can see that the outlier 100, which we intentionally introduced earlier, has been accurately filtered out.

## Utilizing the pandas library and its quantile function

One efficient way to identify and remove outlier points in Python is by using the pandas library. This popular library offers a variety of functions and tools for data analysis, manipulation, and visualization.

Specifically, we can use the quantile function from the pandas library to detect outliers in our dataset. This function allows us to calculate different quantiles for our dataset.

Once we have calculated the quantiles, we can use them to determine the upper and lower bounds for our data. Any data point that falls outside of these bounds can be considered an outlier and removed from our dataset:

Let's begin by storing our data in a dataframe:

`import pandas as pd`

`df = pd.DataFrame(data=X, columns=['x'])`

`print( df )`

Output

`x`

`0 -1.440903`

`1 -1.729036`

`2 -1.113159`

`3 7.019837`

`4 -1.275883`

`5 -14.973534`

`6 3.323183`

`7 -2.673375`

`8 -2.169587`

`9 1.158848`

`10 2.322977`

`11 11.635587`

`12 6.566365`

`13 1.105072`

`14 -7.383216`

`15 -10.146624`

`16 2.463422`

`17 13.110808`

`18 0.416569`

`19 -1.063233`

`20 5.317762`

`21 -14.535453`

`22 -3.122773`

`23 4.903625`

`24 8.734044`

`25 -2.406297`

`26 3.765999`

`27 2.482134`

`28 7.823268`

`29 -11.132222`

`30 100.000000`

Next, calculate the quantile at 0.99. This will give us the upper bound for our data.

`q99 = df['x'].quantile(0.99)`

`print( q99 )`

Output

`73.93324248161208`

With this information in hand, we are now able to effectively filter out any aberrant data points.

`df['x'][ df['x'] < q99 ]`

Output

`0 -1.440903`

`1 -1.729036`

`2 -1.113159`

`3 7.019837`

`4 -1.275883`

`5 -14.973534`

`6 3.323183`

`7 -2.673375`

`8 -2.169587`

`9 1.158848`

`10 2.322977`

`11 11.635587`

`12 6.566365`

`13 1.105072`

`14 -7.383216`

`15 -10.146624`

`16 2.463422`

`17 13.110808`

`18 0.416569`

`19 -1.063233`

`20 5.317762`

`21 -14.535453`

`22 -3.122773`

`23 4.903625`

`24 8.734044`

`25 -2.406297`

`26 3.765999`

`27 2.482134`

`28 7.823268`

`29 -11.132222`

`Name: x, dtype: float64`

## References

Links | Site |
---|---|

numpy.mean | numpy.org |

numpy.std | numpy.org |

pandas quantile | pandas.pydata.org |

scipy.stats.zscore | docs.scipy.org |

68–95–99.7 rule | wikipedia.org |

Standard score | wikipedia.org |