How to calculate the Pearson’s Correlation coefficient between two datasets in python ?

Examples of how to calculate the Pearson’s Correlation coefficient between two datasets in python:

Create a dataset

Let's first create some data:

import numpy as np

def f(a,b,c,X):
        eps = c * np.random.randn(X.shape[0])
        return a * X + b + eps

a = 1 # slope
b = 0 # intercept
c = 1.0 # noise

X = np.random.randint(100, size=250)

Y = f(a,b,c,X)

and use matplotlib to visualize it:

import matplotlib.pyplot as plt

plt.scatter(X,Y)

plt.xlim(-10,110)

plt.title("How to calculate the Pearson’s Correlation coefficient \n between two datasets in python ?")

plt.xlabel('X')
plt.ylabel('Y')

plt.savefig("Pearson_Correlation_coefficient_01.png", bbox_inches='tight')

plt.show()

How to calculate the Pearson’s Correlation coefficient between two datasets in python ?
How to calculate the Pearson’s Correlation coefficient between two datasets in python ?

Calculate the Pearson’s Correlation coefficient using scipy

To calculate the Pearson’s Correlation coefficient between variables X and Y, a solution is to use scipy.stats.pearsonr

from scipy.stats import pearsonr

corr, _ = pearsonr(X, Y)

gives

0.9434925682236153

that can be rounded:

round(corr,2)

gives then

0.94

Examples of Pearson’s Correlation coefficients calculation

Lets now reproduce the example from wikipedia:

import matplotlib.pyplot as plt
import numpy as np

from scipy.stats import pearsonr

def f(a,b,c,X):
    eps = c * np.random.randn(X.shape[0])
    return a * X + b + eps

A = [1.0,1.0,1.0,0.0,-1.0,-1.0,-1.0]
B = [0.0,0.0,0.0,0.0,0.0,0.0,0.0]
C = [1.0, 10, 20, 20, 20 ,10, 1.0]

n = 1
for a,b,c in zip(A,B,C):
    print(a,b,c)

    X = np.random.randint(100, size=250)

    Y = f(a,b,c,X)

    corr, _ = pearsonr(X, Y)

    plt.scatter(X,Y)

    plt.xlim(-10,110)

    plt.title("""
    How to calculate the Pearson’s Correlation coefficient \n 
    between two datasets in python ? \n corrcoef = {} \n a = {} b = {} c = {}""".format( str(round(corr,2)), a, b, c) )

    plt.xlabel('X')
    plt.ylabel('Y')

    plt.savefig("Pearson_Correlation_coefficient_{}.png".format(n), bbox_inches='tight')

    plt.show()

    n += 1

gives

How to calculate the Pearson’s Correlation coefficient between two datasets in python ? How to calculate the Pearson’s Correlation coefficient between two datasets in python ?
How to calculate the Pearson’s Correlation coefficient between two datasets in python ? How to calculate the Pearson’s Correlation coefficient between two datasets in python ?
How to calculate the Pearson’s Correlation coefficient between two datasets in python ? How to calculate the Pearson’s Correlation coefficient between two datasets in python ?
How to calculate the Pearson’s Correlation coefficient between two datasets in python ?
How to calculate the Pearson’s Correlation coefficient between two datasets in python ?

Calculate the Pearson’s Correlation coefficient using numpy

Another solution is to use numpy with numpy.corrcoef:

import numpy as np

np.corrcoef(X,Y)

gives

[[1.         0.94349257]
 [0.94349257 1.        ]]

References

Image

of