How to iterate (loop) over a Pandas dataframe that has been grouped using the groupby function?


Introduction

In this tutorial, we will learn how to iterate over a Pandas dataframe that has been grouped using the groupby function.

Creating a Grouped Dataframes

A grouped dataframe in Pandas is created by splitting a dataframe into groups using one or more group keys. With the groupby function, Pandas enables us to group data based on any column(s) in the dataframe and perform operations on each group individually.

For example, let's take a look at the following Pandas dataframe:

import pandas as pd
import random

df = pd.DataFrame({
    'var_1':     [i for i in  range(8)],
    'var_2':     [random.randint(0, 100) for i in range(8)],
    'filename':    ['A', 'A', 'B', 'B', 'B', 'B','C', 'D'],
})

print( df )

The aforementioned code will provide an illustrative example by printing:

   var_1  var_2 filename
0      0     80        A
1      1     80        A
2      2     54        B
3      3     34        B
4      4     27        B
5      5     64        B
6      6     80        C
7      7     22        D

To group data by filename, we can use the following code:

df.groupby('filename')

returns

 pandas.core.groupby.generic.DataFrameGroupBy

Iterating over Grouped Dataframes

When working with grouped dataframes, you may question the necessity of iterating over them when functions like sum() or mean() can easily perform operations. However, there are instances where more intricate operations are required for each specific group, making iteration a crucial approach in such cases.

To iterate over a grouped dataframe, we can use a for loop:

for group_name, df_group in df.groupby('filename'):
    print(group_name)
    print(df_group)
    print()

This will print each group name and the corresponding dataframe for that group. This is useful if we want to see the data for each group separately:

A
   var_1  var_2 filename
0      0     80        A
1      1     80        A

B
   var_1  var_2 filename
2      2     54        B
3      3     34        B
4      4     27        B
5      5     64        B

C
   var_1  var_2 filename
6      6     80        C

D
   var_1  var_2 filename
7      7     22        D

Iterate over each row within a specific group

Once a grouped dataframe is defined, we have the ability to iterate over each row within a specific group:

for group_name, df_group in df.groupby('filename'):
    print(group_name)
    print(df_group)

    for row_index, row in df_group.iterrows():
        print(row['var_1'], row['var_2'])

returns

A
   var_1  var_2 filename
0      0    100        A
1      1     16        A
0 100
1 16
B
   var_1  var_2 filename
2      2     58        B
3      3     44        B
4      4     35        B
5      5     66        B
2 58
3 44
4 35
5 66
C
   var_1  var_2 filename
6      6     72        C
6 72
D
   var_1  var_2 filename
7      7     85        D
7 85

Performing iterations on grouped dataframes based on group size

You can also iterate through grouped dataframes based on group size using the following approach:

df.groupby(['filename']).size().sort_values(ascending=False)

returns:

filename
B    4
A    2
C    1
D    1
dtype: int64

To obtain a list of group keys sorted by group size, one possible solution is to perform the following steps:

df.groupby(['filename']).size().sort_values(ascending=False).index.tolist()

gives

['B', 'A', 'C', 'D']

Now, we can apply what we have learned above to iterate on grouped dataframes based on the size of each group:

for group in df.groupby(['filename']).size().sort_values(ascending=False).index.tolist():

    print(group)
    print(df[ df['filename'] == group])

returns

B
   var_1  var_2 filename
2      2     54        B
3      3     34        B
4      4     27        B
5      5     64        B
A
   var_1  var_2 filename
0      0     80        A
1      1     80        A
C
   var_1  var_2 filename
6      6     80        C
D
   var_1  var_2 filename
7      7     22        D

References

Links Site
groupby pandas.pydata.org
iterrows pandas.pydata.org
sort_values pandas.pydata.org