Removing duplicate rows from a pandas dataframe in python is an important task for many data analysis projects. Fortunately, this can be accomplished quite easily using some of the built-in functions provided by pandas.
Create a fake data
Let's start by creating a dataframe, for example the following one.
import pandas as pd
data = {'Name':['Ben','Anna','Anna','Anna','Zoe','Zoe','Tom','John','Steve'],
'Age':[20,27,27,27,43,43,30,12,21],
'Sex':[1,0,0,0,0,0,1,1,1]}
df = pd.DataFrame(data)
This will return an dataframe:
Name Age Sex
0 Ben 20 1
1 Anna 27 0
2 Anna 27 0
3 Anna 27 0
4 Zoe 43 0
5 Zoe 43 0
6 Tom 30 1
7 John 12 1
8 Steve 21 1
Checking for duplicate rows
Using duplicated()
There is the Pandas duplicated() method, which is useful if you want to find out which rows are duplicates without actually removing them from the dataframe. This works in a similar manner as drop_duplicates(), except that instead of removing duplicate rows it simply returns a boolean array where True
indicates a duplicate row. For example:
df = df.duplicated(subset=['column1','column2'])
This will return an array of booleans that you can use to filter out the duplicate rows if needed.
Example:
df.duplicated()
This will return
0 False
1 False
2 True
3 True
4 False
5 True
6 False
7 False
8 False
dtype: bool
Count the number of duplicates
s = df.duplicated(keep='last')
s.value_counts()
returns
False 6
True 3
dtype: int64
Dropping duplicates rows
One method for removing duplicates is to use the Pandas drop_duplicates() method. This will look for any duplicate rows and remove them, leaving you with a dataframe that has only unique values. To use this method, you simply need to pass in the column names that you want to check for duplicates. For example:
df = df.drop_duplicates(subset=['column1','column2'])
This will search for any rows that have identical values in the columns 'column1' and 'column2', and will remove them from the dataframe. You can also specify which duplicates to keep by using the keep
parameter; setting it to first
will keep the first duplicate row, while setting it to last
will keep the last one.
Example 1
df.drop_duplicates(keep = 'first', inplace=True)
returns
Name Age Sex
0 Ben 20 1
1 Anna 27 0
4 Zoe 43 0
6 Tom 30 1
7 John 12 1
8 Steve 21 1
Example 2
data = {'Name':['Ben','Anna','Anna','Anna','Zoe','Zoe','Tom','John','Steve'],
'Customer id':['0001','0005','0005','0005','0023','0023','0008','0009','0012'],
'Age':[20,27,23,24,43,43,30,12,21],
'Sex':[1,0,0,0,0,0,1,1,1]}
df = pd.DataFrame(data)
returns
Name Customer id Age Sex
0 Ben 0001 20 1
1 Anna 0005 27 0
2 Anna 0005 23 0
3 Anna 0005 24 0
4 Zoe 0023 43 0
5 Zoe 0023 43 0
6 Tom 0008 30 1
7 John 0009 12 1
8 Steve 0012 21 1
Example 3
To remove duplicate rows according to the column named here 'Custumer id', it is possible to add the argument subset, illustration:
df.drop_duplicates(subset ="Customer id", keep = 'first', inplace=True)
output
Name Customer id Age Sex
0 Ben 0001 20 1
1 Anna 0005 27 0
4 Zoe 0023 43 0
6 Tom 0008 30 1
7 John 0009 12 1
8 Steve 0012 21 1
Additional features
Using pandas aggregate
To aggregate rows see the article How to use Groupby and Aggregate with pandas in python ?
References
Links | Site |
---|---|
duplicated() | pandas.pydata.org |
drop_duplicates | pandas.pydata.org |