Introduction
In this tutorial, we will learn how to iterate over a Pandas dataframe that has been grouped using the groupby function.
Creating a Grouped Dataframes
A grouped dataframe in Pandas is created by splitting a dataframe into groups using one or more group keys. With the groupby function, Pandas enables us to group data based on any column(s) in the dataframe and perform operations on each group individually.
For example, let's take a look at the following Pandas dataframe:
import pandas as pdimport randomdf = pd.DataFrame({'var_1': [i for i in range(8)],'var_2': [random.randint(0, 100) for i in range(8)],'filename': ['A', 'A', 'B', 'B', 'B', 'B','C', 'D'],})print( df )
The aforementioned code will provide an illustrative example by printing:
var_1 var_2 filename0 0 80 A1 1 80 A2 2 54 B3 3 34 B4 4 27 B5 5 64 B6 6 80 C7 7 22 D
To group data by filename, we can use the following code:
df.groupby('filename')
returns
pandas.core.groupby.generic.DataFrameGroupBy
Iterating over Grouped Dataframes
When working with grouped dataframes, you may question the necessity of iterating over them when functions like sum() or mean() can easily perform operations. However, there are instances where more intricate operations are required for each specific group, making iteration a crucial approach in such cases.
To iterate over a grouped dataframe, we can use a for loop:
for group_name, df_group in df.groupby('filename'):print(group_name)print(df_group)print()
This will print each group name and the corresponding dataframe for that group. This is useful if we want to see the data for each group separately:
Avar_1 var_2 filename0 0 80 A1 1 80 ABvar_1 var_2 filename2 2 54 B3 3 34 B4 4 27 B5 5 64 BCvar_1 var_2 filename6 6 80 CDvar_1 var_2 filename7 7 22 D
Iterate over each row within a specific group
Once a grouped dataframe is defined, we have the ability to iterate over each row within a specific group:
for group_name, df_group in df.groupby('filename'):print(group_name)print(df_group)for row_index, row in df_group.iterrows():print(row['var_1'], row['var_2'])
returns
Avar_1 var_2 filename0 0 100 A1 1 16 A0 1001 16Bvar_1 var_2 filename2 2 58 B3 3 44 B4 4 35 B5 5 66 B2 583 444 355 66Cvar_1 var_2 filename6 6 72 C6 72Dvar_1 var_2 filename7 7 85 D7 85
Performing iterations on grouped dataframes based on group size
You can also iterate through grouped dataframes based on group size using the following approach:
df.groupby(['filename']).size().sort_values(ascending=False)
returns:
filenameB 4A 2C 1D 1dtype: int64
To obtain a list of group keys sorted by group size, one possible solution is to perform the following steps:
df.groupby(['filename']).size().sort_values(ascending=False).index.tolist()
gives
['B', 'A', 'C', 'D']
Now, we can apply what we have learned above to iterate on grouped dataframes based on the size of each group:
for group in df.groupby(['filename']).size().sort_values(ascending=False).index.tolist():print(group)print(df[ df['filename'] == group])
returns
Bvar_1 var_2 filename2 2 54 B3 3 34 B4 4 27 B5 5 64 BAvar_1 var_2 filename0 0 80 A1 1 80 ACvar_1 var_2 filename6 6 80 CDvar_1 var_2 filename7 7 22 D
References
| Links | Site |
|---|---|
| groupby | pandas.pydata.org |
| iterrows | pandas.pydata.org |
| sort_values | pandas.pydata.org |
