How to drop duplicate rows in Pandas based on column value

Published: June 16, 2020

Updated: March 07, 2023

Tags: Python; Pandas; DataFrame;

DMCA.com Protection Status

Removing duplicate rows from a pandas dataframe in python is an important task for many data analysis projects. Fortunately, this can be accomplished quite easily using some of the built-in functions provided by pandas.

Create a fake data

Let's start by creating a dataframe, for example the following one.

import pandas as pd

data = {'Name':['Ben','Anna','Anna','Anna','Zoe','Zoe','Tom','John','Steve'], 
        'Age':[20,27,27,27,43,43,30,12,21], 
        'Sex':[1,0,0,0,0,0,1,1,1]}

df = pd.DataFrame(data)

This will return an dataframe:

    Name  Age  Sex
0    Ben   20    1
1   Anna   27    0
2   Anna   27    0
3   Anna   27    0
4    Zoe   43    0
5    Zoe   43    0
6    Tom   30    1
7   John   12    1
8  Steve   21    1

Checking for duplicate rows

Using duplicated()

There is the Pandas duplicated() method, which is useful if you want to find out which rows are duplicates without actually removing them from the dataframe. This works in a similar manner as drop_duplicates(), except that instead of removing duplicate rows it simply returns a boolean array where True indicates a duplicate row. For example:

df = df.duplicated(subset=['column1','column2'])

This will return an array of booleans that you can use to filter out the duplicate rows if needed.

Example:

df.duplicated()

This will return

0    False
1    False
2     True
3     True
4    False
5     True
6    False
7    False
8    False
dtype: bool

Count the number of duplicates

s = df.duplicated(keep='last')

s.value_counts()

returns

False    6
True     3
dtype: int64

Dropping duplicates rows

One method for removing duplicates is to use the Pandas drop_duplicates() method. This will look for any duplicate rows and remove them, leaving you with a dataframe that has only unique values. To use this method, you simply need to pass in the column names that you want to check for duplicates. For example:

df = df.drop_duplicates(subset=['column1','column2'])

This will search for any rows that have identical values in the columns 'column1' and 'column2', and will remove them from the dataframe. You can also specify which duplicates to keep by using the keep parameter; setting it to first will keep the first duplicate row, while setting it to last will keep the last one.

Example 1

df.drop_duplicates(keep = 'first', inplace=True)

returns

    Name  Age  Sex
0    Ben   20    1
1   Anna   27    0
4    Zoe   43    0
6    Tom   30    1
7   John   12    1
8  Steve   21    1

Example 2

data = {'Name':['Ben','Anna','Anna','Anna','Zoe','Zoe','Tom','John','Steve'], 
        'Customer id':['0001','0005','0005','0005','0023','0023','0008','0009','0012'], 
        'Age':[20,27,23,24,43,43,30,12,21], 
        'Sex':[1,0,0,0,0,0,1,1,1]}

df = pd.DataFrame(data)

returns

    Name Customer id  Age  Sex
0    Ben        0001   20    1
1   Anna        0005   27    0
2   Anna        0005   23    0
3   Anna        0005   24    0
4    Zoe        0023   43    0
5    Zoe        0023   43    0
6    Tom        0008   30    1
7   John        0009   12    1
8  Steve        0012   21    1

Example 3

To remove duplicate rows according to the column named here 'Custumer id', it is possible to add the argument subset, illustration:

df.drop_duplicates(subset ="Customer id", keep = 'first', inplace=True)

output

    Name Customer id  Age  Sex
0    Ben        0001   20    1
1   Anna        0005   27    0
4    Zoe        0023   43    0
6    Tom        0008   30    1
7   John        0009   12    1
8  Steve        0012   21    1

Additional features

Using pandas aggregate

To aggregate rows see the article How to use Groupby and Aggregate with pandas in python ?

References

Links Site
duplicated() pandas.pydata.org
drop_duplicates pandas.pydata.org