Introduction
In this tutorial, we will learn how to iterate over a Pandas dataframe that has been grouped using the groupby
function.
Creating a Grouped Dataframes
A grouped dataframe in Pandas is created by splitting a dataframe into groups using one or more group keys. With the groupby function, Pandas enables us to group data based on any column(s) in the dataframe and perform operations on each group individually.
For example, let's take a look at the following Pandas dataframe:
import pandas as pd
import random
df = pd.DataFrame({
'var_1': [i for i in range(8)],
'var_2': [random.randint(0, 100) for i in range(8)],
'filename': ['A', 'A', 'B', 'B', 'B', 'B','C', 'D'],
})
print( df )
The aforementioned code will provide an illustrative example by printing:
var_1 var_2 filename
0 0 80 A
1 1 80 A
2 2 54 B
3 3 34 B
4 4 27 B
5 5 64 B
6 6 80 C
7 7 22 D
To group data by filename, we can use the following code:
df.groupby('filename')
returns
pandas.core.groupby.generic.DataFrameGroupBy
Iterating over Grouped Dataframes
When working with grouped dataframes, you may question the necessity of iterating over them when functions like sum() or mean() can easily perform operations. However, there are instances where more intricate operations are required for each specific group, making iteration a crucial approach in such cases.
To iterate over a grouped dataframe, we can use a for loop:
for group_name, df_group in df.groupby('filename'):
print(group_name)
print(df_group)
print()
This will print each group name and the corresponding dataframe for that group. This is useful if we want to see the data for each group separately:
A
var_1 var_2 filename
0 0 80 A
1 1 80 A
B
var_1 var_2 filename
2 2 54 B
3 3 34 B
4 4 27 B
5 5 64 B
C
var_1 var_2 filename
6 6 80 C
D
var_1 var_2 filename
7 7 22 D
Iterate over each row within a specific group
Once a grouped dataframe is defined, we have the ability to iterate over each row within a specific group:
for group_name, df_group in df.groupby('filename'):
print(group_name)
print(df_group)
for row_index, row in df_group.iterrows():
print(row['var_1'], row['var_2'])
returns
A
var_1 var_2 filename
0 0 100 A
1 1 16 A
0 100
1 16
B
var_1 var_2 filename
2 2 58 B
3 3 44 B
4 4 35 B
5 5 66 B
2 58
3 44
4 35
5 66
C
var_1 var_2 filename
6 6 72 C
6 72
D
var_1 var_2 filename
7 7 85 D
7 85
Performing iterations on grouped dataframes based on group size
You can also iterate through grouped dataframes based on group size using the following approach:
df.groupby(['filename']).size().sort_values(ascending=False)
returns:
filename
B 4
A 2
C 1
D 1
dtype: int64
To obtain a list of group keys sorted by group size, one possible solution is to perform the following steps:
df.groupby(['filename']).size().sort_values(ascending=False).index.tolist()
gives
['B', 'A', 'C', 'D']
Now, we can apply what we have learned above to iterate on grouped dataframes based on the size of each group:
for group in df.groupby(['filename']).size().sort_values(ascending=False).index.tolist():
print(group)
print(df[ df['filename'] == group])
returns
B
var_1 var_2 filename
2 2 54 B
3 3 34 B
4 4 27 B
5 5 64 B
A
var_1 var_2 filename
0 0 80 A
1 1 80 A
C
var_1 var_2 filename
6 6 80 C
D
var_1 var_2 filename
7 7 22 D
References
Links | Site |
---|---|
groupby | pandas.pydata.org |
iterrows | pandas.pydata.org |
sort_values | pandas.pydata.org |