Introduction
In many cases, we may encounter datasets that contain multiple levels of indexing for rows and columns. This type of data organization is known as a MultiIndex or hierarchical indexing. While this format can be useful for representing complex data structures, it might not always be the most convenient form for our data analysis and manipulation tasks.
In this article, we will explore the process of converting a MultiIndex into a single index column in a pandas dataframe.
Case study
Let's consider the following common scenario: we have a pandas dataframe:
import pandas as pdimport randomfrom random import gaussfilename_list = ['A','B']df = pd.DataFrame({'var_1': [gauss(5,2) for i in range(20)],'var_2': [gauss(10,3) for i in range(20)],'filename': [random.choice(filename_list) for i in range(20)],})print( df )
the code above will generate for example:
var_1 var_2 filename0 3.870541 10.685929 B1 10.027259 13.846728 A2 9.046440 7.206848 A3 5.741767 4.151375 A4 9.183560 7.234745 B5 5.856723 6.072003 B6 5.650140 4.979813 B7 2.576061 13.104367 A8 6.120193 9.167984 A9 5.505811 13.369287 A10 2.370323 4.982578 B11 1.772658 12.527107 B12 10.333430 10.028926 A13 5.730522 7.842294 A14 5.935768 11.954894 B15 3.379559 12.724001 B16 5.411406 9.829348 A17 9.223762 12.071571 B18 2.572104 13.804577 B19 5.774546 12.083039 A
and we apply pandas groupby to calculate the mean and standard deviation.
dfg = df.groupby('filename').agg(['mean', 'std'])
the following dataframe will be then generated
var_1 var_2mean std mean stdfilenameA 6.626744 2.423697 10.063020 3.117546B 4.981514 2.677029 9.703722 3.487112
One can see by using the following command
dfg.columns
that the above dataframe has a MultiIndex column
MultiIndex([('var_1', 'mean'),('var_1', 'std'),('var_2', 'mean'),('var_2', 'std')],)
Converting a MultiIndex column into a single index column
Solution 1
A simple solution is to create a new list of column names using the expression
['_'.join(col) for col in dfg.columns]
This will result in a list like
['var_1_mean', 'var_1_std', 'var_2_mean', 'var_2_std']
Next, we can modify the dataframe columns by assigning the new list to dfg.columns using the expression
dfg.columns = ['_'.join(col) for col in dfg.columns]
Finally, if we print the dataframe:
print(dfg)
we will see the updated columns.
var_1_mean var_1_std var_2_mean var_2_stdfilenameA 6.626744 2.423697 10.063020 3.117546B 4.981514 2.677029 9.703722 3.487112
Solution 2
An alternative solution is to use the pandas droplevel() function:
dfg.columns = dfg.columns.droplevel(0)
which will give here:
mean std mean stdfilenameA 4.840524 2.333173 9.584441 3.283163B 4.629652 2.122357 10.379092 3.239954
This alternative may be sufficient if you only want to retain the level 0 of the column index.
References
| Links | Site |
|---|---|
| MultiIndex / advanced indexing | pandas.pydata.org |
| droplevel | pandas.pydata.org |
