Using groupby() in Pandas can help to group data together by one or more variables and aggregate them, allowing you to quickly count the rows of each group. To do this, you'll need to use a method called size(). This method counts the number of values within each group and returns a Series object with the index being the groups that you used for the groupby():
Using size()
For example, let's consider the following dataframe:
import pandas as pddata = {'longitude':[0,0,60,60,60,180],'latitude':[0,0,30,30,30,90],'ice cloud counts':[1,1,3,2,1,2],'liquid cloud counts':[1,5,2,2,1,1]}df = pd.DataFrame(data)print(df)
The above code will generate the following:
longitude latitude ice cloud counts liquid cloud counts0 0 0 1 11 0 0 1 52 60 30 3 23 60 30 2 24 60 30 1 15 180 90 2 1
Our objective is to group the dataframe by latitude and longitude and count the number of rows in each group. To do that a solution is to use pandas.DataFrame.size | pandas.pydata.org:
df_agg = df.groupby(['longitude','latitude']).size()print(df_agg)
returns
longitude latitude0 0 260 30 3180 90 1dtype: int64
Note that the output here is a pandas series:
print(type(df_agg))
Output
<class 'pandas.core.series.Series'>
Additional features
After using the "groupby" function, we can perform additional steps such as converting the output to a dataframe.
Convert series to a dataframe
df_agg = df_agg.to_frame()
Reset index
We can reset the indexes of the dataframes as well:
df_agg.reset_index(inplace=True)
Renaming the column that contains count
and rename the dataframe columns:
df_agg.rename(columns={0: 'Count'}, inplace=True)print(df_agg)
The above steps will then generate the following aggregated dataframeL
longitude latitude Count0 0 0 21 60 30 32 180 90 1
Using count()
Another solution to count the number of rows within a group when using the Pandas groupby function is to use pandas.DataFrame.count | pandas.pydata.org:
df_agg = df.groupby(['longitude','latitude']).count()print(df_agg)
which will returns a dataframe directly:
ice cloud counts liquid cloud countslongitude latitude0 0 2 260 30 3 3180 90 1 1
Difference between size() and count()
Note that "size" metric considers NaN values, while the "count" metric does not.
import pandas as pdimport numpy as npdata = {'longitude':[0,0,60,60,60,180],'latitude':[0,0,30,30,30,90],'ice cloud counts':[np.nan,1,3,2,1,2],'liquid cloud counts':[1,5,2,2,1,1]}df = pd.DataFrame(data)print(df)
Output
longitude latitude ice cloud counts liquid cloud counts0 0 0 NaN 11 0 0 1.0 52 60 30 3.0 23 60 30 2.0 24 60 30 1.0 15 180 90 2.0 1
Using size()
df.groupby(['longitude','latitude']).size()
Output
longitude latitude0 0 260 30 3180 90 1dtype: int64
Using count()
df_agg = df.groupby(['longitude','latitude']).count()print(df_agg)print(type(df_agg))
Output
ice cloud counts liquid cloud countslongitude latitude0 0 1 260 30 3 3180 90 1 1<class 'pandas.core.frame.DataFrame'>
References
| Links | Site |
|---|---|
| How to transform a series into a dataframe using pandas ? | moonbooks.org |
| How to rename a column name of a DataFrame in pandas ? | moonbooks.org |
| pandas.DataFrame.size | pandas.pydata.org |
| pandas.DataFrame.count | pandas.pydata.org |
