Using groupby()
in Pandas can help to group data together by one or more variables and aggregate them, allowing you to quickly count the rows of each group. To do this, you'll need to use a method called size()
. This method counts the number of values within each group and returns a Series object with the index being the groups that you used for the groupby()
:
Using size()
For example, let's consider the following dataframe:
import pandas as pd
data = {'longitude':[0,0,60,60,60,180],
'latitude':[0,0,30,30,30,90],
'ice cloud counts':[1,1,3,2,1,2],
'liquid cloud counts':[1,5,2,2,1,1]}
df = pd.DataFrame(data)
print(df)
The above code will generate the following:
longitude latitude ice cloud counts liquid cloud counts
0 0 0 1 1
1 0 0 1 5
2 60 30 3 2
3 60 30 2 2
4 60 30 1 1
5 180 90 2 1
Our objective is to group the dataframe by latitude and longitude and count the number of rows in each group. To do that a solution is to use pandas.DataFrame.size | pandas.pydata.org:
df_agg = df.groupby(['longitude','latitude']).size()
print(df_agg)
returns
longitude latitude
0 0 2
60 30 3
180 90 1
dtype: int64
Note that the output here is a pandas series:
print(type(df_agg))
Output
<class 'pandas.core.series.Series'>
Additional features
After using the "groupby" function, we can perform additional steps such as converting the output to a dataframe.
Convert series to a dataframe
df_agg = df_agg.to_frame()
Reset index
We can reset the indexes of the dataframes as well:
df_agg.reset_index(inplace=True)
Renaming the column that contains count
and rename the dataframe columns:
df_agg.rename(columns={0: 'Count'}, inplace=True)
print(df_agg)
The above steps will then generate the following aggregated dataframeL
longitude latitude Count
0 0 0 2
1 60 30 3
2 180 90 1
Using count()
Another solution to count the number of rows within a group when using the Pandas groupby function is to use pandas.DataFrame.count | pandas.pydata.org:
df_agg = df.groupby(['longitude','latitude']).count()
print(df_agg)
which will returns a dataframe directly:
ice cloud counts liquid cloud counts
longitude latitude
0 0 2 2
60 30 3 3
180 90 1 1
Difference between size() and count()
Note that "size" metric considers NaN values, while the "count" metric does not.
import pandas as pd
import numpy as np
data = {'longitude':[0,0,60,60,60,180],
'latitude':[0,0,30,30,30,90],
'ice cloud counts':[np.nan,1,3,2,1,2],
'liquid cloud counts':[1,5,2,2,1,1]}
df = pd.DataFrame(data)
print(df)
Output
longitude latitude ice cloud counts liquid cloud counts
0 0 0 NaN 1
1 0 0 1.0 5
2 60 30 3.0 2
3 60 30 2.0 2
4 60 30 1.0 1
5 180 90 2.0 1
Using size()
df.groupby(['longitude','latitude']).size()
Output
longitude latitude
0 0 2
60 30 3
180 90 1
dtype: int64
Using count()
df_agg = df.groupby(['longitude','latitude']).count()
print(df_agg)
print(type(df_agg))
Output
ice cloud counts liquid cloud counts
longitude latitude
0 0 1 2
60 30 3 3
180 90 1 1
<class 'pandas.core.frame.DataFrame'>
References
Links | Site |
---|---|
How to transform a series into a dataframe using pandas ? | moonbooks.org |
How to rename a column name of a DataFrame in pandas ? | moonbooks.org |
pandas.DataFrame.size | pandas.pydata.org |
pandas.DataFrame.count | pandas.pydata.org |