How to count the number of rows within a group when using the Pandas groupby function ?


Using groupby() in Pandas can help to group data together by one or more variables and aggregate them, allowing you to quickly count the rows of each group. To do this, you'll need to use a method called size(). This method counts the number of values within each group and returns a Series object with the index being the groups that you used for the groupby():

Using size()

For example, let's consider the following dataframe:

import pandas as pd

data = {'longitude':[0,0,60,60,60,180],
         'latitude':[0,0,30,30,30,90],
         'ice cloud counts':[1,1,3,2,1,2],
         'liquid cloud counts':[1,5,2,2,1,1]}

df = pd.DataFrame(data)

print(df)

The above code will generate the following:

   longitude  latitude  ice cloud counts  liquid cloud counts
0          0         0                 1                    1
1          0         0                 1                    5
2         60        30                 3                    2
3         60        30                 2                    2
4         60        30                 1                    1
5        180        90                 2                    1

Our objective is to group the dataframe by latitude and longitude and count the number of rows in each group. To do that a solution is to use pandas.DataFrame.size | pandas.pydata.org:

df_agg = df.groupby(['longitude','latitude']).size()

print(df_agg)

returns

longitude  latitude
0          0           2
60         30          3
180        90          1
dtype: int64

Note that the output here is a pandas series:

print(type(df_agg))

Output

<class 'pandas.core.series.Series'>

Additional features

After using the "groupby" function, we can perform additional steps such as converting the output to a dataframe.

Convert series to a dataframe

df_agg = df_agg.to_frame()

Reset index

We can reset the indexes of the dataframes as well:

df_agg.reset_index(inplace=True)

Renaming the column that contains count

and rename the dataframe columns:

df_agg.rename(columns={0: 'Count'}, inplace=True)

print(df_agg)

The above steps will then generate the following aggregated dataframeL

       longitude  latitude  Count
0          0         0      2
1         60        30      3
2        180        90      1

Using count()

Another solution to count the number of rows within a group when using the Pandas groupby function is to use pandas.DataFrame.count | pandas.pydata.org:

df_agg = df.groupby(['longitude','latitude']).count()

print(df_agg)

which will returns a dataframe directly:

                    ice cloud counts  liquid cloud counts
longitude latitude                                       
0         0                        2                    2
60        30                       3                    3
180       90                       1                    1

Difference between size() and count()

Note that "size" metric considers NaN values, while the "count" metric does not.

import pandas as pd
import numpy as np

data = {'longitude':[0,0,60,60,60,180],
         'latitude':[0,0,30,30,30,90],
         'ice cloud counts':[np.nan,1,3,2,1,2],
         'liquid cloud counts':[1,5,2,2,1,1]}

df = pd.DataFrame(data)

print(df)

Output

   longitude  latitude  ice cloud counts  liquid cloud counts
0          0         0               NaN                    1
1          0         0               1.0                    5
2         60        30               3.0                    2
3         60        30               2.0                    2
4         60        30               1.0                    1
5        180        90               2.0                    1

Using size()

df.groupby(['longitude','latitude']).size()

Output

longitude  latitude
0          0           2
60         30          3
180        90          1
dtype: int64

Using count()

df_agg = df.groupby(['longitude','latitude']).count()

print(df_agg)
print(type(df_agg))

Output

                    ice cloud counts  liquid cloud counts
longitude latitude                                       
0         0                        1                    2
60        30                       3                    3
180       90                       1                    1
<class 'pandas.core.frame.DataFrame'>

References