How to efficiently obtain statistical measures (such as mean, standard deviation, count, etc.) for a specific column in a pandas dataframe using describe() ?

Published: January 01, 2024

Updated: January 01, 2024

Tags: Python; Pandas; Dataframe;

DMCA.com Protection Status

Introduction

The describe() function in Pandas is used to generate descriptive statistics for numerical columns in a DataFrame. It returns information such as count, mean, standard deviation, and quartiles for each column.

Let's see how we can use this function to obtain statistical information from a column of data.

Creating a pandas dataframe

First, let's create a Pandas dataframe by generating random numbers from a Gaussian distribution

import numpy as np
import pandas as pd

mu1 = 10.0
sigma1 = 2.0

data1 = np.random.randn(100000) * sigma1 + mu1

df1 = pd.DataFrame(data1, columns = ['var_1'])

df1['group'] = 'A'

print( df1 )

Our DataFrame looks like this:

           var_1 group
0       7.546591     A
1       7.910452     A
2       6.794720     A
3       6.913027     A
4       9.748158     A
...          ...   ...
99995  10.887914     A
99996  12.074681     A
99997   7.826811     A
99998  11.396126     A
99999  10.587124     A

[100000 rows x 2 columns]

Extracting statistical information from a specific column

Using describe()

To obtain statistical information for the 'var_1' column, we can simply call the describe() function on our DataFrame:

df1['var_1'].describe()

The output will be a Series object with the following information:

count    100000.000000
mean          9.997635
std           1.996487
min           1.719631
25%           8.652156
50%           9.993236
75%          11.352676
max          18.922716
Name: var_1, dtype: float64

Please note that you can extract information from the series by utilizing the following method:

df1['var_1'].describe()['count']

returns here

100000.0

or

df1['var_1'].describe()['25%']

gives

8.65215584318412

Using describe() and groupby()

Let's incorporate additional data into our dataframe:

mu2 = 5.0
sigma2 = 1.0

data2 = np.random.randn(100000) * sigma2 + mu2

df2 = pd.DataFrame(data2, columns = ['var_1'])

df2['group'] = 'B'

df = pd.concat([df1,df2])

print(df)

Our DataFrame looks now like this:

          var_1 group
0      7.546591     A
1      7.910452     A
2      6.794720     A
3      6.913027     A
4      9.748158     A
...         ...   ...
99995  4.884224     B
99996  6.063100     B
99997  5.438565     B
99998  3.878364     B
99999  5.046344     B

[200000 rows x 2 columns]

To perform statistics on the column var_1 while separating data belonging to groups A or B, one possible solution is to utilize the groupby function in pandas.

df.groupby('group').describe()

The code mentioned above will produce the following result.

          var_1                                                               \
          count      mean       std       min       25%       50%        75%   
group                                                                          
A      100000.0  9.997635  1.996487  1.719631  8.652156  9.993236  11.352676   
B      100000.0  4.996937  0.998049  1.036177  4.322948  5.000255   5.670038


             max  
group             
A      18.922716  
B       9.450537

Applying the describe() function to an entire dataframe

Let's add another column to our dataframe

mu3 = 25.0
sigma3 = 10.0

data3 = np.random.randn(200000) * sigma3 + mu3

df['var_2'] = data3

print(df)

Our DataFrame looks now like this:

          var_1 group      var_2
0      7.546591     A  38.777234
1      7.910452     A  24.201025
2      6.794720     A  38.682980
3      6.913027     A  18.793834
4      9.748158     A  38.399589
...         ...   ...        ...
99995  4.884224     B  19.272981
99996  6.063100     B  20.640537
99997  5.438565     B  20.761374
99998  3.878364     B  20.184042
99999  5.046344     B  19.829434

[200000 rows x 3 columns]

Note that we can use describe() function to an entire dataframe:

df.describe()

The code mentioned above will produce then the following result.

               var_1          var_2
count  200000.000000  200000.000000
mean        7.497286      24.975396
std         2.956822      10.005165
min         1.036177     -20.486187
25%         4.985870      18.198013
50%         6.667710      24.970786
75%         9.993223      31.695061
max        18.922716      68.275304

References

Links Site
describe pandas.pydata.org
mean pandas.pydata.org
How to merge / concatenate two DataFrames with pandas in python ? moonbooks.org