Introduction
In many cases, we may encounter datasets that contain multiple levels of indexing for rows and columns. This type of data organization is known as a MultiIndex or hierarchical indexing. While this format can be useful for representing complex data structures, it might not always be the most convenient form for our data analysis and manipulation tasks.
In this article, we will explore the process of converting a MultiIndex into a single index column in a pandas dataframe.
Case study
Let's consider the following common scenario: we have a pandas dataframe:
import pandas as pd
import random
from random import gauss
filename_list = ['A','B']
df = pd.DataFrame({
'var_1': [gauss(5,2) for i in range(20)],
'var_2': [gauss(10,3) for i in range(20)],
'filename': [random.choice(filename_list) for i in range(20)],
})
print( df )
the code above will generate for example:
var_1 var_2 filename
0 3.870541 10.685929 B
1 10.027259 13.846728 A
2 9.046440 7.206848 A
3 5.741767 4.151375 A
4 9.183560 7.234745 B
5 5.856723 6.072003 B
6 5.650140 4.979813 B
7 2.576061 13.104367 A
8 6.120193 9.167984 A
9 5.505811 13.369287 A
10 2.370323 4.982578 B
11 1.772658 12.527107 B
12 10.333430 10.028926 A
13 5.730522 7.842294 A
14 5.935768 11.954894 B
15 3.379559 12.724001 B
16 5.411406 9.829348 A
17 9.223762 12.071571 B
18 2.572104 13.804577 B
19 5.774546 12.083039 A
and we apply pandas groupby to calculate the mean and standard deviation.
dfg = df.groupby('filename').agg(['mean', 'std'])
the following dataframe will be then generated
var_1 var_2
mean std mean std
filename
A 6.626744 2.423697 10.063020 3.117546
B 4.981514 2.677029 9.703722 3.487112
One can see by using the following command
dfg.columns
that the above dataframe has a MultiIndex column
MultiIndex([('var_1', 'mean'),
('var_1', 'std'),
('var_2', 'mean'),
('var_2', 'std')],
)
Converting a MultiIndex column into a single index column
Solution 1
A simple solution is to create a new list of column names using the expression
['_'.join(col) for col in dfg.columns]
This will result in a list like
['var_1_mean', 'var_1_std', 'var_2_mean', 'var_2_std']
Next, we can modify the dataframe columns by assigning the new list to dfg.columns using the expression
dfg.columns = ['_'.join(col) for col in dfg.columns]
Finally, if we print the dataframe:
print(dfg)
we will see the updated columns.
var_1_mean var_1_std var_2_mean var_2_std
filename
A 6.626744 2.423697 10.063020 3.117546
B 4.981514 2.677029 9.703722 3.487112
Solution 2
An alternative solution is to use the pandas droplevel() function:
dfg.columns = dfg.columns.droplevel(0)
which will give here:
mean std mean std
filename
A 4.840524 2.333173 9.584441 3.283163
B 4.629652 2.122357 10.379092 3.239954
This alternative may be sufficient if you only want to retain the level 0 of the column index.
References
Links | Site |
---|---|
MultiIndex / advanced indexing | pandas.pydata.org |
droplevel | pandas.pydata.org |