How to remove (drop) duplicate columns with pandas ?

Published: February 22, 2021

Tags: Python; Pandas; DataFrame;

DMCA.com Protection Status

Example of how to remove duplicate columns with pandas:

Create a dataframe with duplicated columns

Let's first create a dataframe with duplicated columns

import pandas as pd
import numpy as np

data = np.random.randint(10, size=(5,3))

columns = ['Score A','Score B','Score C']

df = pd.DataFrame(data=data,columns=columns)

data = np.random.randint(10, size=(5,2))

columns = ['Score E','Score F']

df_add = pd.DataFrame(data=data,columns=columns)

df = pd.concat([df,df_add], axis=1)
df = pd.concat([df,df_add], axis=1)

print(df)

returns for example

   Score A  Score B  Score C  Score E  Score F  Score E  Score F
0        7        4        4        4        9        4        9
1        6        6        3        8        9        8        9
2        4        9        6        2        5        2        5
3        8        6        2        6        3        6        3
4        2        4        0        2        4        2        4

columns E and F have been duplicated twice here.

Remove duplicated columns

To remove those duplicated columns, a solution is to do:

df = df.loc[:,~df.columns.duplicated()]

print(df)

gives

   Score A  Score B  Score C  Score E  Score F
0        7        4        4        4        9
1        6        6        3        8        9
2        4        9        6        2        5
3        8        6        2        6        3
4        2        4        0        2        4

Warning: the above solution drop columns based on column name. So a column will be removed even if two columns are not strictly equals, illustration

import pandas as pd
import numpy as np

data = np.random.randint(10, size=(5,3))

columns = ['Score A','Score B','Score C']

df = pd.DataFrame(data=data,columns=columns)

data = np.random.randint(10, size=(5,2))

columns = ['Score E','Score F']

df_add = pd.DataFrame(data=data,columns=columns)

df = pd.concat([df,df_add], axis=1)
df = pd.concat([df,df_add], axis=1)



df.iloc[0,4] = 999

print(df)

returns

   Score A  Score B  Score C  Score E  Score F  Score E  Score F
0        4        0        5        0      999        0        0
1        6        5        7        1        1        1        1
2        7        5        3        4        5        4        5
3        8        1        7        9        3        9        3
4        6        9        3        7        6        7        6

here you can see that the two columns called "Score E" are not the same, but

df = df.loc[:,~df.columns.duplicated()]

print(df)

returns

   Score A  Score B  Score C  Score E  Score F
0        4        0        5        0      999
1        6        5        7        1        1
2        7        5        3        4        5
3        8        1        7        9        3
4        6        9        3        7        6

To see how to check first if two columns are equals see nest section

Get indexes of duplicated columns and check if columns are equals

It can be useful to check first if two columns are equals.

Let's first check the column called "Score E":

Get the position of "Score E" column

df.columns.get_loc("Score E")

gives

array([False, False, False,  True, False,  True, False])

transofrm that to indexes:

np.where( df.columns.get_loc("Score E") )[0]

gives

array([3, 5])

transform to list

A = np.where( df.columns.get_loc("Score E") )[0]

print( A.tolist() )

Indexes of column "Score E" are

[3, 5]

So to check if those two columns are equal just do:

df.iloc[:,3].equals(df.iloc[:,5])

which returns here.

True

Let's do the same thing with the column "Score F"

np.where( df.columns.get_loc("Score F") )[0].tolist()

gives indexes of "Score F" column:

[4, 6]

then if we do

df.iloc[:,4].equals(df.iloc[:,6])

returns

False

the two columns are not equals !

Remove columns with same name if and only if elements are not the same

import pandas as pd
import numpy as np

data = np.random.randint(10, size=(5,3))

columns = ['Score A','Score B','Score C']

df = pd.DataFrame(data=data,columns=columns)

data = np.random.randint(10, size=(5,2))

columns = ['Score E','Score F']

df_add = pd.DataFrame(data=data,columns=columns)

df = pd.concat([df,df_add], axis=1)
df = pd.concat([df,df_add], axis=1)
df = pd.concat([df,df_add], axis=1)

df.iloc[0,4] = 999

print(df)

returns

    Score A  Score B  Score C  Score E  Score F  Score E  Score F  Score E  \
0        7        4        5        9      999        9        5        9   
1        6        5        9        3        8        3        8        3   
2        1        2        9        4        0        4        0        4   
3        4        1        7        7        7        7        7        7   
4        1        6        6        6        6        6        6        6

   Score F  
0        5  
1        8  
2        0  
3        7  
4        6

Solution

def get_column_index_list(column):

    if type( df.columns.get_loc(column) ) == int:
        col_index_list = []
        col_index_list.append( df.columns.get_loc(column) )
    else:
        col_index_list = np.where( df.columns.get_loc(column) )[0].tolist()

    return col_index_list

#print('test', get_column_index_list('Score C') )

df_new = pd.DataFrame()

for column in df.columns:

    col_index_list = get_column_index_list(column)

    if column not in df_new.columns:

        first_col_idx = col_index_list[0]

        df_new = pd.concat([df_new,df.iloc[:,first_col_idx]], axis=1)

        if len(col_index_list) > 1:

            prev_col_idc = first_col_idx

            for col_idx in col_index_list[1:]: # keep first column

                current_col_idx = col_idx

                if not df.iloc[:,prev_col_idc].equals(df.iloc[:,current_col_idx]):

                    df_new = pd.concat([df_new,df.iloc[:,current_col_idx]], axis=1)

                    prev_col_idc = current_col_idx

print(df_new)

returns

   Score A  Score B  Score C  Score E  Score F  Score F
0        7        4        5        9      999        5
1        6        5        9        3        8        8
2        1        2        9        4        0        0
3        4        1        7        7        7        7
4        1        6        6        6        6        6

References