# How to efficiently obtain statistical measures (such as mean, standard deviation, count, etc.) for a specific column in a pandas dataframe using describe() ?

Published: January 01, 2024

Updated: January 01, 2024

## Introduction

The describe() function in Pandas is used to generate descriptive statistics for numerical columns in a DataFrame. It returns information such as count, mean, standard deviation, and quartiles for each column.

Let's see how we can use this function to obtain statistical information from a column of data.

## Creating a pandas dataframe

First, let's create a Pandas dataframe by generating random numbers from a Gaussian distribution

````import numpy as np`
`import pandas as pd`

`mu1 = 10.0`
`sigma1 = 2.0`

`data1 = np.random.randn(100000) * sigma1 + mu1`

`df1 = pd.DataFrame(data1, columns = ['var_1'])`

`df1['group'] = 'A'`

`print( df1 )`
```

Our DataFrame looks like this:

````           var_1 group`
`0       7.546591     A`
`1       7.910452     A`
`2       6.794720     A`
`3       6.913027     A`
`4       9.748158     A`
`...          ...   ...`
`99995  10.887914     A`
`99996  12.074681     A`
`99997   7.826811     A`
`99998  11.396126     A`
`99999  10.587124     A`

`[100000 rows x 2 columns]`
```

## Extracting statistical information from a specific column

### Using describe()

To obtain statistical information for the 'var_1' column, we can simply call the describe() function on our DataFrame:

````df1['var_1'].describe()`
```

The output will be a Series object with the following information:

````count    100000.000000`
`mean          9.997635`
`std           1.996487`
`min           1.719631`
`25%           8.652156`
`50%           9.993236`
`75%          11.352676`
`max          18.922716`
`Name: var_1, dtype: float64`
```

Please note that you can extract information from the series by utilizing the following method:

````df1['var_1'].describe()['count']`
```

returns here

````100000.0`
```

or

````df1['var_1'].describe()['25%']`
```

gives

````8.65215584318412`
```

### Using describe() and groupby()

Let's incorporate additional data into our dataframe:

````mu2 = 5.0`
`sigma2 = 1.0`

`data2 = np.random.randn(100000) * sigma2 + mu2`

`df2 = pd.DataFrame(data2, columns = ['var_1'])`

`df2['group'] = 'B'`

`df = pd.concat([df1,df2])`

`print(df)`
```

Our DataFrame looks now like this:

````          var_1 group`
`0      7.546591     A`
`1      7.910452     A`
`2      6.794720     A`
`3      6.913027     A`
`4      9.748158     A`
`...         ...   ...`
`99995  4.884224     B`
`99996  6.063100     B`
`99997  5.438565     B`
`99998  3.878364     B`
`99999  5.046344     B`

`[200000 rows x 2 columns]`
```

To perform statistics on the column var_1 while separating data belonging to groups A or B, one possible solution is to utilize the groupby function in pandas.

````df.groupby('group').describe()`
```

The code mentioned above will produce the following result.

````          var_1                                                               \`
`          count      mean       std       min       25%       50%        75%   `
`group                                                                          `
`A      100000.0  9.997635  1.996487  1.719631  8.652156  9.993236  11.352676   `
`B      100000.0  4.996937  0.998049  1.036177  4.322948  5.000255   5.670038`

`             max  `
`group             `
`A      18.922716  `
`B       9.450537`
```

## Applying the describe() function to an entire dataframe

Let's add another column to our dataframe

````mu3 = 25.0`
`sigma3 = 10.0`

`data3 = np.random.randn(200000) * sigma3 + mu3`

`df['var_2'] = data3`

`print(df)`
```

Our DataFrame looks now like this:

````          var_1 group      var_2`
`0      7.546591     A  38.777234`
`1      7.910452     A  24.201025`
`2      6.794720     A  38.682980`
`3      6.913027     A  18.793834`
`4      9.748158     A  38.399589`
`...         ...   ...        ...`
`99995  4.884224     B  19.272981`
`99996  6.063100     B  20.640537`
`99997  5.438565     B  20.761374`
`99998  3.878364     B  20.184042`
`99999  5.046344     B  19.829434`

`[200000 rows x 3 columns]`
```

Note that we can use describe() function to an entire dataframe:

````df.describe()`
```

The code mentioned above will produce then the following result.

````               var_1          var_2`
`count  200000.000000  200000.000000`
`mean        7.497286      24.975396`
`std         2.956822      10.005165`
`min         1.036177     -20.486187`
`25%         4.985870      18.198013`
`50%         6.667710      24.970786`
`75%         9.993223      31.695061`
`max        18.922716      68.275304`
```