Example of how to remove duplicate columns with pandas:
Create a dataframe with duplicated columns
Let's first create a dataframe with duplicated columns
import pandas as pd
import numpy as np
data = np.random.randint(10, size=(5,3))
columns = ['Score A','Score B','Score C']
df = pd.DataFrame(data=data,columns=columns)
data = np.random.randint(10, size=(5,2))
columns = ['Score E','Score F']
df_add = pd.DataFrame(data=data,columns=columns)
df = pd.concat([df,df_add], axis=1)
df = pd.concat([df,df_add], axis=1)
print(df)
returns for example
Score A Score B Score C Score E Score F Score E Score F
0 7 4 4 4 9 4 9
1 6 6 3 8 9 8 9
2 4 9 6 2 5 2 5
3 8 6 2 6 3 6 3
4 2 4 0 2 4 2 4
columns E and F have been duplicated twice here.
Remove duplicated columns
To remove those duplicated columns, a solution is to do:
df = df.loc[:,~df.columns.duplicated()]
print(df)
gives
Score A Score B Score C Score E Score F
0 7 4 4 4 9
1 6 6 3 8 9
2 4 9 6 2 5
3 8 6 2 6 3
4 2 4 0 2 4
Warning: the above solution drop columns based on column name. So a column will be removed even if two columns are not strictly equals, illustration
import pandas as pd
import numpy as np
data = np.random.randint(10, size=(5,3))
columns = ['Score A','Score B','Score C']
df = pd.DataFrame(data=data,columns=columns)
data = np.random.randint(10, size=(5,2))
columns = ['Score E','Score F']
df_add = pd.DataFrame(data=data,columns=columns)
df = pd.concat([df,df_add], axis=1)
df = pd.concat([df,df_add], axis=1)
df.iloc[0,4] = 999
print(df)
returns
Score A Score B Score C Score E Score F Score E Score F
0 4 0 5 0 999 0 0
1 6 5 7 1 1 1 1
2 7 5 3 4 5 4 5
3 8 1 7 9 3 9 3
4 6 9 3 7 6 7 6
here you can see that the two columns called "Score E" are not the same, but
df = df.loc[:,~df.columns.duplicated()]
print(df)
returns
Score A Score B Score C Score E Score F
0 4 0 5 0 999
1 6 5 7 1 1
2 7 5 3 4 5
3 8 1 7 9 3
4 6 9 3 7 6
To see how to check first if two columns are equals see nest section
Get indexes of duplicated columns and check if columns are equals
It can be useful to check first if two columns are equals.
Let's first check the column called "Score E":
Get the position of "Score E" column
df.columns.get_loc("Score E")
gives
array([False, False, False, True, False, True, False])
transofrm that to indexes:
np.where( df.columns.get_loc("Score E") )[0]
gives
array([3, 5])
transform to list
A = np.where( df.columns.get_loc("Score E") )[0]
print( A.tolist() )
Indexes of column "Score E" are
[3, 5]
So to check if those two columns are equal just do:
df.iloc[:,3].equals(df.iloc[:,5])
which returns here.
True
Let's do the same thing with the column "Score F"
np.where( df.columns.get_loc("Score F") )[0].tolist()
gives indexes of "Score F" column:
[4, 6]
then if we do
df.iloc[:,4].equals(df.iloc[:,6])
returns
False
the two columns are not equals !
Remove columns with same name if and only if elements are not the same
import pandas as pd
import numpy as np
data = np.random.randint(10, size=(5,3))
columns = ['Score A','Score B','Score C']
df = pd.DataFrame(data=data,columns=columns)
data = np.random.randint(10, size=(5,2))
columns = ['Score E','Score F']
df_add = pd.DataFrame(data=data,columns=columns)
df = pd.concat([df,df_add], axis=1)
df = pd.concat([df,df_add], axis=1)
df = pd.concat([df,df_add], axis=1)
df.iloc[0,4] = 999
print(df)
returns
Score A Score B Score C Score E Score F Score E Score F Score E \
0 7 4 5 9 999 9 5 9
1 6 5 9 3 8 3 8 3
2 1 2 9 4 0 4 0 4
3 4 1 7 7 7 7 7 7
4 1 6 6 6 6 6 6 6
Score F
0 5
1 8
2 0
3 7
4 6
Solution
def get_column_index_list(column):
if type( df.columns.get_loc(column) ) == int:
col_index_list = []
col_index_list.append( df.columns.get_loc(column) )
else:
col_index_list = np.where( df.columns.get_loc(column) )[0].tolist()
return col_index_list
#print('test', get_column_index_list('Score C') )
df_new = pd.DataFrame()
for column in df.columns:
col_index_list = get_column_index_list(column)
if column not in df_new.columns:
first_col_idx = col_index_list[0]
df_new = pd.concat([df_new,df.iloc[:,first_col_idx]], axis=1)
if len(col_index_list) > 1:
prev_col_idc = first_col_idx
for col_idx in col_index_list[1:]: # keep first column
current_col_idx = col_idx
if not df.iloc[:,prev_col_idc].equals(df.iloc[:,current_col_idx]):
df_new = pd.concat([df_new,df.iloc[:,current_col_idx]], axis=1)
prev_col_idc = current_col_idx
print(df_new)
returns
Score A Score B Score C Score E Score F Score F
0 7 4 5 9 999 5
1 6 5 9 3 8 8
2 1 2 9 4 0 0
3 4 1 7 7 7 7
4 1 6 6 6 6 6