Removing duplicate rows from a pandas dataframe in python is an important task for many data analysis projects. Fortunately, this can be accomplished quite easily using some of the built-in functions provided by pandas.
Create a fake data
Let's start by creating a dataframe, for example the following one.
import pandas as pddata = {'Name':['Ben','Anna','Anna','Anna','Zoe','Zoe','Tom','John','Steve'],'Age':[20,27,27,27,43,43,30,12,21],'Sex':[1,0,0,0,0,0,1,1,1]}df = pd.DataFrame(data)
This will return an dataframe:
Name Age Sex0 Ben 20 11 Anna 27 02 Anna 27 03 Anna 27 04 Zoe 43 05 Zoe 43 06 Tom 30 17 John 12 18 Steve 21 1
Checking for duplicate rows
Using duplicated()
There is the Pandas duplicated() method, which is useful if you want to find out which rows are duplicates without actually removing them from the dataframe. This works in a similar manner as drop_duplicates(), except that instead of removing duplicate rows it simply returns a boolean array where True indicates a duplicate row. For example:
df = df.duplicated(subset=['column1','column2'])
This will return an array of booleans that you can use to filter out the duplicate rows if needed.
Example:
df.duplicated()
This will return
0 False1 False2 True3 True4 False5 True6 False7 False8 Falsedtype: bool
Count the number of duplicates
s = df.duplicated(keep='last')s.value_counts()
returns
False 6True 3dtype: int64
Dropping duplicates rows
One method for removing duplicates is to use the Pandas drop_duplicates() method. This will look for any duplicate rows and remove them, leaving you with a dataframe that has only unique values. To use this method, you simply need to pass in the column names that you want to check for duplicates. For example:
df = df.drop_duplicates(subset=['column1','column2'])
This will search for any rows that have identical values in the columns 'column1' and 'column2', and will remove them from the dataframe. You can also specify which duplicates to keep by using the keep parameter; setting it to first will keep the first duplicate row, while setting it to last will keep the last one.
Example 1
df.drop_duplicates(keep = 'first', inplace=True)
returns
Name Age Sex0 Ben 20 11 Anna 27 04 Zoe 43 06 Tom 30 17 John 12 18 Steve 21 1
Example 2
data = {'Name':['Ben','Anna','Anna','Anna','Zoe','Zoe','Tom','John','Steve'],'Customer id':['0001','0005','0005','0005','0023','0023','0008','0009','0012'],'Age':[20,27,23,24,43,43,30,12,21],'Sex':[1,0,0,0,0,0,1,1,1]}df = pd.DataFrame(data)
returns
Name Customer id Age Sex0 Ben 0001 20 11 Anna 0005 27 02 Anna 0005 23 03 Anna 0005 24 04 Zoe 0023 43 05 Zoe 0023 43 06 Tom 0008 30 17 John 0009 12 18 Steve 0012 21 1
Example 3
To remove duplicate rows according to the column named here 'Custumer id', it is possible to add the argument subset, illustration:
df.drop_duplicates(subset ="Customer id", keep = 'first', inplace=True)
output
Name Customer id Age Sex0 Ben 0001 20 11 Anna 0005 27 04 Zoe 0023 43 06 Tom 0008 30 17 John 0009 12 18 Steve 0012 21 1
Additional features
Using pandas aggregate
To aggregate rows see the article How to use Groupby and Aggregate with pandas in python ?
References
| Links | Site |
|---|---|
| duplicated() | pandas.pydata.org |
| drop_duplicates | pandas.pydata.org |
