Introduction
The describe() function in Pandas is used to generate descriptive statistics for numerical columns in a DataFrame. It returns information such as count, mean, standard deviation, and quartiles for each column.
Let's see how we can use this function to obtain statistical information from a column of data.
Creating a pandas dataframe
First, let's create a Pandas dataframe by generating random numbers from a Gaussian distribution
import numpy as np
import pandas as pd
mu1 = 10.0
sigma1 = 2.0
data1 = np.random.randn(100000) * sigma1 + mu1
df1 = pd.DataFrame(data1, columns = ['var_1'])
df1['group'] = 'A'
print( df1 )
Our DataFrame looks like this:
var_1 group
0 7.546591 A
1 7.910452 A
2 6.794720 A
3 6.913027 A
4 9.748158 A
... ... ...
99995 10.887914 A
99996 12.074681 A
99997 7.826811 A
99998 11.396126 A
99999 10.587124 A
[100000 rows x 2 columns]
Extracting statistical information from a specific column
Using describe()
To obtain statistical information for the 'var_1' column, we can simply call the describe() function on our DataFrame:
df1['var_1'].describe()
The output will be a Series object with the following information:
count 100000.000000
mean 9.997635
std 1.996487
min 1.719631
25% 8.652156
50% 9.993236
75% 11.352676
max 18.922716
Name: var_1, dtype: float64
Please note that you can extract information from the series by utilizing the following method:
df1['var_1'].describe()['count']
returns here
100000.0
or
df1['var_1'].describe()['25%']
gives
8.65215584318412
Using describe() and groupby()
Let's incorporate additional data into our dataframe:
mu2 = 5.0
sigma2 = 1.0
data2 = np.random.randn(100000) * sigma2 + mu2
df2 = pd.DataFrame(data2, columns = ['var_1'])
df2['group'] = 'B'
df = pd.concat([df1,df2])
print(df)
Our DataFrame looks now like this:
var_1 group
0 7.546591 A
1 7.910452 A
2 6.794720 A
3 6.913027 A
4 9.748158 A
... ... ...
99995 4.884224 B
99996 6.063100 B
99997 5.438565 B
99998 3.878364 B
99999 5.046344 B
[200000 rows x 2 columns]
To perform statistics on the column var_1 while separating data belonging to groups A or B, one possible solution is to utilize the groupby function in pandas.
df.groupby('group').describe()
The code mentioned above will produce the following result.
var_1 \
count mean std min 25% 50% 75%
group
A 100000.0 9.997635 1.996487 1.719631 8.652156 9.993236 11.352676
B 100000.0 4.996937 0.998049 1.036177 4.322948 5.000255 5.670038
max
group
A 18.922716
B 9.450537
Applying the describe() function to an entire dataframe
Let's add another column to our dataframe
mu3 = 25.0
sigma3 = 10.0
data3 = np.random.randn(200000) * sigma3 + mu3
df['var_2'] = data3
print(df)
Our DataFrame looks now like this:
var_1 group var_2
0 7.546591 A 38.777234
1 7.910452 A 24.201025
2 6.794720 A 38.682980
3 6.913027 A 18.793834
4 9.748158 A 38.399589
... ... ... ...
99995 4.884224 B 19.272981
99996 6.063100 B 20.640537
99997 5.438565 B 20.761374
99998 3.878364 B 20.184042
99999 5.046344 B 19.829434
[200000 rows x 3 columns]
Note that we can use describe() function to an entire dataframe:
df.describe()
The code mentioned above will produce then the following result.
var_1 var_2
count 200000.000000 200000.000000
mean 7.497286 24.975396
std 2.956822 10.005165
min 1.036177 -20.486187
25% 4.985870 18.198013
50% 6.667710 24.970786
75% 9.993223 31.695061
max 18.922716 68.275304
References
Links | Site |
---|---|
describe | pandas.pydata.org |
mean | pandas.pydata.org |
How to merge / concatenate two DataFrames with pandas in python ? | moonbooks.org |