# How to use Groupby and Aggregate with pandas in python ?

Published: August 11, 2021

Testing Groupby and Aggregate operations for exploratory data analysis with pandas. Please feel free to download and edit it as you like.

Note: you can also download my jupyter notebook that I created to test Groupby and Aggregate with pandas in python

### Introduction

Import pandas

````import pandas as pd`
```

#### GroupBy

##### How to create a dataframe with pandas

Lets first create a simple dataframe

````data = {'Age':[21,26,82,15,28],`
`                'weight':[120,148,139,156,129],`
`             'Gender':['male','male','female','male','female'],`
`             'Country':['France','USA','USA','Germany','USA']}`

`df = pd.DataFrame(data=data)`
```

gives

````     Age  weight  Gender  Country`
`0   21     120    male   France`
`1   26     148    male      USA`
`2   82     139  female      USA`
`3   15     156    male  Germany`
`4   28     129  female      USA`
```
##### How to use group by

To group by "Gender" for example, a solution is to use pandas.DataFrame.groupby

````df.groupby(by="Gender").mean()`
```

returns

````                            Age      weight`
`Gender                       `
`female  55.000000  134.000000`
`male    20.666667  141.333333`
```
##### How To use group by with 2 columns

To group by Gender and Country:

````df.groupby(["Gender",'Country']).mean()`
```

returns

````                Age  weight`
`Gender Country             `
`female USA       55     134`
`male   France    21     120`
`       Germany   15     156`
`       USA       26     148`
```

Note that inverting column order as no impact here:

````df.groupby(["Country",'Gender']).mean()`

`print(df.groupby(["Country",'Gender']).mean())`
```

returns

````                Age  weight`
`Country Gender             `
`France  male     21     120`
`Germany male     15     156`
`USA     female   55     134`
`        male     26     148`
```
##### How to only apply group for a specific column

How to ONLY group by Gender and Country for a specific column "Age" for example (selecting columns to apply group by should be faster if the dataframe has a lot of columns)

````df.groupby(["Gender",'Country']).Age.mean()`
```

returns

````Gender  Country`
`female  USA        55`
`male    France     21`
`        Germany    15`
`        USA        26`
`Name: Age, dtype: int64`
```

Same as doing

````df.groupby(["Gender",'Country'])['Age'].mean()`
```

also returns

````Gender  Country`
`female  USA        55`
`male    France     21`
`        Germany    15`
`        USA        26`
`Name: Age, dtype: int64`
```
##### How to sort values for a given column after applying group by :
````df.groupby(["Country",'Gender']).mean().sort_values(by='Age')`

`print(df.groupby(["Country",'Gender']).mean().sort_values(by='Age'))`
```

returns

````                Age  weight`
`Country Gender             `
`Germany male     15     156`
`France  male     21     120`
`USA     male     26     148`
`        female   55     134`
```
##### How to use describe() with group by
````df.groupby(by="Gender").describe()`
```

returns

````                Age                                                                                        weight  `
`             count       mean        std   min   25%   50%   75%   max  count    mean        std    min    25%    50%    75%    max  `
`Gender                                                                    `
`female   2.0  55.000000  38.183766  28.0  41.5  55.0  68.5  82.0    2.0   134.000000   7.071068  129.0  131.5  134.0  136.5  139.0  `
`male     3.0  20.666667   5.507571  15.0  18.0  21.0  23.5  26.0    3.0   141.333333  18.903263  120.0  134.0  148.0  152.0  156.0`
```

#### How to combine group by and aggregate

````df.groupby(by="Gender").agg('mean')`

`print(df.groupby(by="Gender").agg('mean'))`

`              Age      weight`
`Gender                       `
`female  55.000000  134.000000`
`male    20.666667  141.333333`
```

Advantage: possible to define multiple types of aggreation (mean, count, etc)

````df.groupby(by="Gender").agg(['mean','count','sum','min','max'])`

`print(df.groupby(by="Gender").agg(['mean','count','sum','min','max']))`

`              Age                         weight                     `
`             mean count  sum min max        mean count  sum  min  max`
`Gender                                                               `
`female  55.000000     2  110  28  82  134.000000     2  268  129  139`
`male    20.666667     3   62  15  26  141.333333     3  424  120  156`
```

### Testing on the titanic database

Now lets test what we learned previously on the well-known titanic dataset:

````df = pd.read_csv('https://raw.githubusercontent.com/benjamin-hg-marchant/teaching/main/datasets/titanic.csv')`

`df.head()`
```

returns

````     PassengerId  Survived  Pclass  \`
`0            1         0       3   `
`1            2         1       1   `
`2            3         1       3   `
`3            4         1       1   `
`4            5         0       3`

`                                                                                                Name     Sex   Age  SibSp  \`
`0                            Braund, Mr. Owen Harris    male  22.0      1   `
`1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   `
`2                             Heikkinen, Miss. Laina  female  26.0      0   `
`3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   `
`4                           Allen, Mr. William Henry    male  35.0      0`

`     Parch            Ticket     Fare Cabin Embarked  `
`0      0         A/5 21171   7.2500   NaN        S  `
`1      0          PC 17599  71.2833   C85        C  `
`2      0  STON/O2. 3101282   7.9250   NaN        S  `
`3      0            113803  53.1000  C123        S  `
`4      0            373450   8.0500   NaN        S`
```
##### Dataset quick E.D.A

Get column names:

````df.columns`
```

gives

````Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',`
`             'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],`
`            dtype='object')`
```

Plot Age distribution

````df['Age'].plot(kind="hist")`
```

````df['Age'].mean()`
```

returns

````29.69911764705882`
```

Get value_counts()

````df['Sex'].value_counts(dropna=False)`
```

returns

````male      577`
`female    314`
`Name: Sex, dtype: int64`
```

Get age mean for male

````df['Age'][ df['Sex'] == "male" ].mean()`

`df['Age'][ df['Sex'] == "male" ].hist()`
```

returns

````30.72664459161148`
```

and

Get age mean for female

````df['Age'][ df['Sex'] == "female" ].mean()`

`df['Age'][ df['Sex'] == "female" ].hist()`
```

returns

````27.915708812260537`
```

and

##### Group by on 'Survived' and 'Sex' columns and then get 'Age' and 'Fare' mean:
````df.groupby(['Survived','Sex'])[['Age','Fare']].mean()`
```

returns

````                                                 Age       Fare`
`    Survived Sex                         `
`    0        female  25.046875  23.024385`
`                     male    31.618056  21.960993`
`    1        female  28.847716  51.938573`
`                     male    27.276022  40.821484`
```
##### Group by on 'Survived' and 'Sex' columns and then get 'Age' mean:
````df.groupby(['Survived','Sex']).Age.mean()`
```

returns

````Survived  Sex   `
`0         female    25.046875`
`          male      31.618056`
`1         female    28.847716`
`          male      27.276022`
`Name: Age, dtype: float64`
```
##### Group by on 'Pclass' columns and then get 'Survived' mean (faster approach):
````df[['Pclass','Survived']].groupby(['Pclass']).mean()`
```

returns

````        Survived`
`Pclass          `
`1       0.629630`
`2       0.472826`
`3       0.242363`
```
##### Group by on 'Pclass' columns and then get 'Survived' mean (slower that previously approach):
````df.groupby(['Pclass'])[['Survived']].mean()`
```

returns

````        Survived`
`Pclass          `
`1       0.629630`
`2       0.472826`
`3       0.242363`
```
##### Group by on 'Survived' and 'Sex' and then apply describe() to age
````df.groupby(['Survived','Sex']).Age.describe()`
```

returns

````                 count       mean        std   min    25%   50%    75%   max`
`Survived Sex                                                                `
`0        female   64.0  25.046875  13.618591  2.00  16.75  24.5  33.25  57.0`
`         male    360.0  31.618056  14.056019  1.00  21.75  29.0  39.25  74.0`
`1        female  197.0  28.847716  14.175073  0.75  19.00  28.0  38.00  63.0`
`         male     93.0  27.276022  16.504803  0.42  18.00  28.0  36.00  80.0`
```
##### Group by on 'Survived' and 'Sex' and then aggregate (mean, max, min) age and fate
````df.groupby(['Survived','Sex'])[['Age','Fare']].agg(['mean','max','min'])`
```

returns

````                       Age                   Fare                 `
`                      mean   max   min       mean       max    min`
`Survived Sex                                                      `
`0        female  25.046875  57.0  2.00  23.024385  151.5500  6.750`
`         male    31.618056  74.0  1.00  21.960993  263.0000  0.000`
`1        female  28.847716  63.0  0.75  51.938573  512.3292  7.225`
`         male    27.276022  80.0  0.42  40.821484  512.3292  0.000`
```
##### Group by on Survived and get age mean
````df.groupby(['Survived']).Age.mean()`
```

returns

````Survived`
`0    30.626179`
`1    28.343690`
```
##### Group by on Survived and get fare mean
````df.groupby(['Survived']).Fare.mean()`
```

returns

````Survived`
`0    22.117887`
`1    48.395408`
`Name: Fare, dtype: float64`
```

Image

of