Introduction
The describe() function in Pandas is used to generate descriptive statistics for numerical columns in a DataFrame. It returns information such as count, mean, standard deviation, and quartiles for each column.
Let's see how we can use this function to obtain statistical information from a column of data.
Creating a pandas dataframe
First, let's create a Pandas dataframe by generating random numbers from a Gaussian distribution
import numpy as npimport pandas as pdmu1 = 10.0sigma1 = 2.0data1 = np.random.randn(100000) * sigma1 + mu1df1 = pd.DataFrame(data1, columns = ['var_1'])df1['group'] = 'A'print( df1 )
Our DataFrame looks like this:
var_1 group0 7.546591 A1 7.910452 A2 6.794720 A3 6.913027 A4 9.748158 A... ... ...99995 10.887914 A99996 12.074681 A99997 7.826811 A99998 11.396126 A99999 10.587124 A[100000 rows x 2 columns]
Extracting statistical information from a specific column
Using describe()
To obtain statistical information for the 'var_1' column, we can simply call the describe() function on our DataFrame:
df1['var_1'].describe()
The output will be a Series object with the following information:
count 100000.000000mean 9.997635std 1.996487min 1.71963125% 8.65215650% 9.99323675% 11.352676max 18.922716Name: var_1, dtype: float64
Please note that you can extract information from the series by utilizing the following method:
df1['var_1'].describe()['count']
returns here
100000.0
or
df1['var_1'].describe()['25%']
gives
8.65215584318412
Using describe() and groupby()
Let's incorporate additional data into our dataframe:
mu2 = 5.0sigma2 = 1.0data2 = np.random.randn(100000) * sigma2 + mu2df2 = pd.DataFrame(data2, columns = ['var_1'])df2['group'] = 'B'df = pd.concat([df1,df2])print(df)
Our DataFrame looks now like this:
var_1 group0 7.546591 A1 7.910452 A2 6.794720 A3 6.913027 A4 9.748158 A... ... ...99995 4.884224 B99996 6.063100 B99997 5.438565 B99998 3.878364 B99999 5.046344 B[200000 rows x 2 columns]
To perform statistics on the column var_1 while separating data belonging to groups A or B, one possible solution is to utilize the groupby function in pandas.
df.groupby('group').describe()
The code mentioned above will produce the following result.
var_1 \count mean std min 25% 50% 75%groupA 100000.0 9.997635 1.996487 1.719631 8.652156 9.993236 11.352676B 100000.0 4.996937 0.998049 1.036177 4.322948 5.000255 5.670038maxgroupA 18.922716B 9.450537
Applying the describe() function to an entire dataframe
Let's add another column to our dataframe
mu3 = 25.0sigma3 = 10.0data3 = np.random.randn(200000) * sigma3 + mu3df['var_2'] = data3print(df)
Our DataFrame looks now like this:
var_1 group var_20 7.546591 A 38.7772341 7.910452 A 24.2010252 6.794720 A 38.6829803 6.913027 A 18.7938344 9.748158 A 38.399589... ... ... ...99995 4.884224 B 19.27298199996 6.063100 B 20.64053799997 5.438565 B 20.76137499998 3.878364 B 20.18404299999 5.046344 B 19.829434[200000 rows x 3 columns]
Note that we can use describe() function to an entire dataframe:
df.describe()
The code mentioned above will produce then the following result.
var_1 var_2count 200000.000000 200000.000000mean 7.497286 24.975396std 2.956822 10.005165min 1.036177 -20.48618725% 4.985870 18.19801350% 6.667710 24.97078675% 9.993223 31.695061max 18.922716 68.275304
References
| Links | Site |
|---|---|
| describe | pandas.pydata.org |
| mean | pandas.pydata.org |
| How to merge / concatenate two DataFrames with pandas in python ? | moonbooks.org |
