How to convert a MultiIndex column into a single index column in a pandas dataframe?


Introduction

In many cases, we may encounter datasets that contain multiple levels of indexing for rows and columns. This type of data organization is known as a MultiIndex or hierarchical indexing. While this format can be useful for representing complex data structures, it might not always be the most convenient form for our data analysis and manipulation tasks.

In this article, we will explore the process of converting a MultiIndex into a single index column in a pandas dataframe.

Case study

Let's consider the following common scenario: we have a pandas dataframe:

import pandas as pd
import random

from random import gauss

filename_list = ['A','B']

df = pd.DataFrame({
    'var_1':     [gauss(5,2) for i in  range(20)],
    'var_2':     [gauss(10,3) for i in  range(20)],
    'filename':    [random.choice(filename_list) for i in range(20)],
})

print( df )

the code above will generate for example:

        var_1      var_2 filename
0    3.870541  10.685929        B
1   10.027259  13.846728        A
2    9.046440   7.206848        A
3    5.741767   4.151375        A
4    9.183560   7.234745        B
5    5.856723   6.072003        B
6    5.650140   4.979813        B
7    2.576061  13.104367        A
8    6.120193   9.167984        A
9    5.505811  13.369287        A
10   2.370323   4.982578        B
11   1.772658  12.527107        B
12  10.333430  10.028926        A
13   5.730522   7.842294        A
14   5.935768  11.954894        B
15   3.379559  12.724001        B
16   5.411406   9.829348        A
17   9.223762  12.071571        B
18   2.572104  13.804577        B
19   5.774546  12.083039        A

and we apply pandas groupby to calculate the mean and standard deviation.

dfg = df.groupby('filename').agg(['mean', 'std'])

the following dataframe will be then generated

             var_1                var_2          
              mean       std       mean       std
filename                                         
A         6.626744  2.423697  10.063020  3.117546
B         4.981514  2.677029   9.703722  3.487112

One can see by using the following command

dfg.columns

that the above dataframe has a MultiIndex column

MultiIndex([('var_1', 'mean'),
            ('var_1',  'std'),
            ('var_2', 'mean'),
            ('var_2',  'std')],
           )

Converting a MultiIndex column into a single index column

Solution 1

A simple solution is to create a new list of column names using the expression

['_'.join(col) for col in dfg.columns]

This will result in a list like

['var_1_mean', 'var_1_std', 'var_2_mean', 'var_2_std']

Next, we can modify the dataframe columns by assigning the new list to dfg.columns using the expression

dfg.columns = ['_'.join(col) for col in dfg.columns]

Finally, if we print the dataframe:

print(dfg)

we will see the updated columns.

          var_1_mean  var_1_std  var_2_mean  var_2_std
filename                                              
A           6.626744   2.423697   10.063020   3.117546
B           4.981514   2.677029    9.703722   3.487112

Solution 2

An alternative solution is to use the pandas droplevel() function:

dfg.columns = dfg.columns.droplevel(0)

which will give here:

              mean       std       mean       std
filename                                         
A         4.840524  2.333173   9.584441  3.283163
B         4.629652  2.122357  10.379092  3.239954

This alternative may be sufficient if you only want to retain the level 0 of the column index.

References

Links Site
MultiIndex / advanced indexing pandas.pydata.org
droplevel pandas.pydata.org