Testing Groupby and Aggregate operations for exploratory data analysis with pandas. Please feel free to download and edit it as you like.
Table of contents
- Introduction
- Testing on the titanic database
- Dataset quick E.D.A
- Group by on 'Survived' and 'Sex' columns and then get 'Age' and 'Fare' mean:
- Group by on 'Survived' and 'Sex' columns and then get 'Age' mean:
- Group by on 'Pclass' columns and then get 'Survived' mean (faster approach):
- Group by on 'Pclass' columns and then get 'Survived' mean (slower that previously approach):
- Group by on 'Survived' and 'Sex' and then apply describe() to age
- Group by on 'Survived' and 'Sex' and then aggregate (mean, max, min) age and fate
- Group by on Survived and get age mean
- Group by on Survived and get fare mean
- References
Note: you can also download my jupyter notebook that I created to test Groupby and Aggregate with pandas in python
Introduction
Import pandas
import pandas as pd
GroupBy
How to create a dataframe with pandas
Lets first create a simple dataframe
data = {'Age':[21,26,82,15,28],
'weight':[120,148,139,156,129],
'Gender':['male','male','female','male','female'],
'Country':['France','USA','USA','Germany','USA']}
df = pd.DataFrame(data=data)
gives
Age weight Gender Country
0 21 120 male France
1 26 148 male USA
2 82 139 female USA
3 15 156 male Germany
4 28 129 female USA
How to use group by
To group by "Gender" for example, a solution is to use pandas.DataFrame.groupby
df.groupby(by="Gender").mean()
returns
Age weight
Gender
female 55.000000 134.000000
male 20.666667 141.333333
How To use group by with 2 columns
To group by Gender and Country:
df.groupby(["Gender",'Country']).mean()
returns
Age weight
Gender Country
female USA 55 134
male France 21 120
Germany 15 156
USA 26 148
Note that inverting column order as no impact here:
df.groupby(["Country",'Gender']).mean()
print(df.groupby(["Country",'Gender']).mean())
returns
Age weight
Country Gender
France male 21 120
Germany male 15 156
USA female 55 134
male 26 148
How to only apply group for a specific column
How to ONLY group by Gender and Country for a specific column "Age" for example (selecting columns to apply group by should be faster if the dataframe has a lot of columns)
df.groupby(["Gender",'Country']).Age.mean()
returns
Gender Country
female USA 55
male France 21
Germany 15
USA 26
Name: Age, dtype: int64
Same as doing
df.groupby(["Gender",'Country'])['Age'].mean()
also returns
Gender Country
female USA 55
male France 21
Germany 15
USA 26
Name: Age, dtype: int64
How to sort values for a given column after applying group by :
df.groupby(["Country",'Gender']).mean().sort_values(by='Age')
print(df.groupby(["Country",'Gender']).mean().sort_values(by='Age'))
returns
Age weight
Country Gender
Germany male 15 156
France male 21 120
USA male 26 148
female 55 134
How to use describe() with group by
df.groupby(by="Gender").describe()
returns
Age weight
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
Gender
female 2.0 55.000000 38.183766 28.0 41.5 55.0 68.5 82.0 2.0 134.000000 7.071068 129.0 131.5 134.0 136.5 139.0
male 3.0 20.666667 5.507571 15.0 18.0 21.0 23.5 26.0 3.0 141.333333 18.903263 120.0 134.0 148.0 152.0 156.0
How to combine group by and aggregate
df.groupby(by="Gender").agg('mean')
print(df.groupby(by="Gender").agg('mean'))
Age weight
Gender
female 55.000000 134.000000
male 20.666667 141.333333
Advantage: possible to define multiple types of aggreation (mean, count, etc)
df.groupby(by="Gender").agg(['mean','count','sum','min','max'])
print(df.groupby(by="Gender").agg(['mean','count','sum','min','max']))
Age weight
mean count sum min max mean count sum min max
Gender
female 55.000000 2 110 28 82 134.000000 2 268 129 139
male 20.666667 3 62 15 26 141.333333 3 424 120 156
Testing on the titanic database
Now lets test what we learned previously on the well-known titanic dataset:
df = pd.read_csv('https://raw.githubusercontent.com/benjamin-hg-marchant/teaching/main/datasets/titanic.csv')
df.head()
returns
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
Dataset quick E.D.A
Get column names:
df.columns
gives
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
Plot Age distribution
df['Age'].plot(kind="hist")
df['Age'].mean()
returns
29.69911764705882
Get value_counts()
df['Sex'].value_counts(dropna=False)
returns
male 577
female 314
Name: Sex, dtype: int64
Get age mean for male
df['Age'][ df['Sex'] == "male" ].mean()
df['Age'][ df['Sex'] == "male" ].hist()
returns
30.72664459161148
and
Get age mean for female
df['Age'][ df['Sex'] == "female" ].mean()
df['Age'][ df['Sex'] == "female" ].hist()
returns
27.915708812260537
and
Group by on 'Survived' and 'Sex' columns and then get 'Age' and 'Fare' mean:
df.groupby(['Survived','Sex'])[['Age','Fare']].mean()
returns
Age Fare
Survived Sex
0 female 25.046875 23.024385
male 31.618056 21.960993
1 female 28.847716 51.938573
male 27.276022 40.821484
Group by on 'Survived' and 'Sex' columns and then get 'Age' mean:
df.groupby(['Survived','Sex']).Age.mean()
returns
Survived Sex
0 female 25.046875
male 31.618056
1 female 28.847716
male 27.276022
Name: Age, dtype: float64
Group by on 'Pclass' columns and then get 'Survived' mean (faster approach):
df[['Pclass','Survived']].groupby(['Pclass']).mean()
returns
Survived
Pclass
1 0.629630
2 0.472826
3 0.242363
Group by on 'Pclass' columns and then get 'Survived' mean (slower that previously approach):
df.groupby(['Pclass'])[['Survived']].mean()
returns
Survived
Pclass
1 0.629630
2 0.472826
3 0.242363
Group by on 'Survived' and 'Sex' and then apply describe() to age
df.groupby(['Survived','Sex']).Age.describe()
returns
count mean std min 25% 50% 75% max
Survived Sex
0 female 64.0 25.046875 13.618591 2.00 16.75 24.5 33.25 57.0
male 360.0 31.618056 14.056019 1.00 21.75 29.0 39.25 74.0
1 female 197.0 28.847716 14.175073 0.75 19.00 28.0 38.00 63.0
male 93.0 27.276022 16.504803 0.42 18.00 28.0 36.00 80.0
Group by on 'Survived' and 'Sex' and then aggregate (mean, max, min) age and fate
df.groupby(['Survived','Sex'])[['Age','Fare']].agg(['mean','max','min'])
returns
Age Fare
mean max min mean max min
Survived Sex
0 female 25.046875 57.0 2.00 23.024385 151.5500 6.750
male 31.618056 74.0 1.00 21.960993 263.0000 0.000
1 female 28.847716 63.0 0.75 51.938573 512.3292 7.225
male 27.276022 80.0 0.42 40.821484 512.3292 0.000
Group by on Survived and get age mean
df.groupby(['Survived']).Age.mean()
returns
Survived
0 30.626179
1 28.343690
Group by on Survived and get fare mean
df.groupby(['Survived']).Fare.mean()
returns
Survived
0 22.117887
1 48.395408
Name: Fare, dtype: float64