Testing Groupby and Aggregate operations for exploratory data analysis with pandas. Please feel free to download and edit it as you like.
Table of contents
- Introduction
- Testing on the titanic database
- Dataset quick E.D.A
- Group by on 'Survived' and 'Sex' columns and then get 'Age' and 'Fare' mean:
- Group by on 'Survived' and 'Sex' columns and then get 'Age' mean:
- Group by on 'Pclass' columns and then get 'Survived' mean (faster approach):
- Group by on 'Pclass' columns and then get 'Survived' mean (slower that previously approach):
- Group by on 'Survived' and 'Sex' and then apply describe() to age
- Group by on 'Survived' and 'Sex' and then aggregate (mean, max, min) age and fate
- Group by on Survived and get age mean
- Group by on Survived and get fare mean
- References
Note: you can also download my jupyter notebook that I created to test Groupby and Aggregate with pandas in python
Introduction
Import pandas
import pandas as pd
GroupBy
How to create a dataframe with pandas
Lets first create a simple dataframe
data = {'Age':[21,26,82,15,28],'weight':[120,148,139,156,129],'Gender':['male','male','female','male','female'],'Country':['France','USA','USA','Germany','USA']}df = pd.DataFrame(data=data)
gives
Age weight Gender Country0 21 120 male France1 26 148 male USA2 82 139 female USA3 15 156 male Germany4 28 129 female USA
How to use group by
To group by "Gender" for example, a solution is to use pandas.DataFrame.groupby
df.groupby(by="Gender").mean()
returns
Age weightGenderfemale 55.000000 134.000000male 20.666667 141.333333
How To use group by with 2 columns
To group by Gender and Country:
df.groupby(["Gender",'Country']).mean()
returns
Age weightGender Countryfemale USA 55 134male France 21 120Germany 15 156USA 26 148
Note that inverting column order as no impact here:
df.groupby(["Country",'Gender']).mean()print(df.groupby(["Country",'Gender']).mean())
returns
Age weightCountry GenderFrance male 21 120Germany male 15 156USA female 55 134male 26 148
How to only apply group for a specific column
How to ONLY group by Gender and Country for a specific column "Age" for example (selecting columns to apply group by should be faster if the dataframe has a lot of columns)
df.groupby(["Gender",'Country']).Age.mean()
returns
Gender Countryfemale USA 55male France 21Germany 15USA 26Name: Age, dtype: int64
Same as doing
df.groupby(["Gender",'Country'])['Age'].mean()
also returns
Gender Countryfemale USA 55male France 21Germany 15USA 26Name: Age, dtype: int64
How to sort values for a given column after applying group by :
df.groupby(["Country",'Gender']).mean().sort_values(by='Age')print(df.groupby(["Country",'Gender']).mean().sort_values(by='Age'))
returns
Age weightCountry GenderGermany male 15 156France male 21 120USA male 26 148female 55 134
How to use describe() with group by
df.groupby(by="Gender").describe()
returns
Age weightcount mean std min 25% 50% 75% max count mean std min 25% 50% 75% maxGenderfemale 2.0 55.000000 38.183766 28.0 41.5 55.0 68.5 82.0 2.0 134.000000 7.071068 129.0 131.5 134.0 136.5 139.0male 3.0 20.666667 5.507571 15.0 18.0 21.0 23.5 26.0 3.0 141.333333 18.903263 120.0 134.0 148.0 152.0 156.0
How to combine group by and aggregate
df.groupby(by="Gender").agg('mean')print(df.groupby(by="Gender").agg('mean'))Age weightGenderfemale 55.000000 134.000000male 20.666667 141.333333
Advantage: possible to define multiple types of aggreation (mean, count, etc)
df.groupby(by="Gender").agg(['mean','count','sum','min','max'])print(df.groupby(by="Gender").agg(['mean','count','sum','min','max']))Age weightmean count sum min max mean count sum min maxGenderfemale 55.000000 2 110 28 82 134.000000 2 268 129 139male 20.666667 3 62 15 26 141.333333 3 424 120 156
Testing on the titanic database
Now lets test what we learned previously on the well-known titanic dataset:
df = pd.read_csv('https://raw.githubusercontent.com/benjamin-hg-marchant/teaching/main/datasets/titanic.csv')df.head()
returns
PassengerId Survived Pclass \0 1 0 31 2 1 12 3 1 33 4 1 14 5 0 3Name Sex Age SibSp \0 Braund, Mr. Owen Harris male 22.0 11 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 12 Heikkinen, Miss. Laina female 26.0 03 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 14 Allen, Mr. William Henry male 35.0 0Parch Ticket Fare Cabin Embarked0 0 A/5 21171 7.2500 NaN S1 0 PC 17599 71.2833 C85 C2 0 STON/O2. 3101282 7.9250 NaN S3 0 113803 53.1000 C123 S4 0 373450 8.0500 NaN S
Dataset quick E.D.A
Get column names:
df.columns
gives
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp','Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],dtype='object')
Plot Age distribution
df['Age'].plot(kind="hist")

df['Age'].mean()
returns
29.69911764705882
Get value_counts()
df['Sex'].value_counts(dropna=False)
returns
male 577female 314Name: Sex, dtype: int64
Get age mean for male
df['Age'][ df['Sex'] == "male" ].mean()df['Age'][ df['Sex'] == "male" ].hist()
returns
30.72664459161148
and

Get age mean for female
df['Age'][ df['Sex'] == "female" ].mean()df['Age'][ df['Sex'] == "female" ].hist()
returns
27.915708812260537
and

Group by on 'Survived' and 'Sex' columns and then get 'Age' and 'Fare' mean:
df.groupby(['Survived','Sex'])[['Age','Fare']].mean()
returns
Age FareSurvived Sex0 female 25.046875 23.024385male 31.618056 21.9609931 female 28.847716 51.938573male 27.276022 40.821484
Group by on 'Survived' and 'Sex' columns and then get 'Age' mean:
df.groupby(['Survived','Sex']).Age.mean()
returns
Survived Sex0 female 25.046875male 31.6180561 female 28.847716male 27.276022Name: Age, dtype: float64
Group by on 'Pclass' columns and then get 'Survived' mean (faster approach):
df[['Pclass','Survived']].groupby(['Pclass']).mean()
returns
SurvivedPclass1 0.6296302 0.4728263 0.242363
Group by on 'Pclass' columns and then get 'Survived' mean (slower that previously approach):
df.groupby(['Pclass'])[['Survived']].mean()
returns
SurvivedPclass1 0.6296302 0.4728263 0.242363
Group by on 'Survived' and 'Sex' and then apply describe() to age
df.groupby(['Survived','Sex']).Age.describe()
returns
count mean std min 25% 50% 75% maxSurvived Sex0 female 64.0 25.046875 13.618591 2.00 16.75 24.5 33.25 57.0male 360.0 31.618056 14.056019 1.00 21.75 29.0 39.25 74.01 female 197.0 28.847716 14.175073 0.75 19.00 28.0 38.00 63.0male 93.0 27.276022 16.504803 0.42 18.00 28.0 36.00 80.0
Group by on 'Survived' and 'Sex' and then aggregate (mean, max, min) age and fate
df.groupby(['Survived','Sex'])[['Age','Fare']].agg(['mean','max','min'])
returns
Age Faremean max min mean max minSurvived Sex0 female 25.046875 57.0 2.00 23.024385 151.5500 6.750male 31.618056 74.0 1.00 21.960993 263.0000 0.0001 female 28.847716 63.0 0.75 51.938573 512.3292 7.225male 27.276022 80.0 0.42 40.821484 512.3292 0.000
Group by on Survived and get age mean
df.groupby(['Survived']).Age.mean()
returns
Survived0 30.6261791 28.343690
Group by on Survived and get fare mean
df.groupby(['Survived']).Fare.mean()
returns
Survived0 22.1178871 48.395408Name: Fare, dtype: float64
