Kernel Density Estimation (KDE) is a non-parametric method of estimating probability density functions from sample data. The basic idea is to construct a smooth function that can be used to approximate the underlying probability distribution, based on a given set of observations in two dimensions.
Create synthetic data
Using Python, it is fairly straightforward to calculate and plot a 2D KDE. The first step is to import the necessary modules, including numpy, scipy and matplotlib. Next, get your data ready for the calculation - it should be in the form of an array or list of two-dimensional points.
import numpy as np
mu = 5.0
sigma = 4.0
x = np.random.randn(10000) * sigma + mu
mu = 5.0
sigma = 10.0
y = np.random.randn(10000) * sigma + mu
Calculate a 2D kernel density estimation
Once you have your data ready, you can use the scipy.stats.gaussian_kde() function to calculate the KDE. This function returns an object that contains all the information you need to plot the result.
from scipy import stats, mgrid, c_
data = np.stack((x,y), axis=1)
data.shape
xmin = data.min()
xmax = data.max()
ymin = data.min()
ymax = data.max()
kernel = stats.gaussian_kde(data.T)
Plot result using matplotlib
To plot, you can use either matplotlib or seaborn, depending on how complex of a graph you want. With matplotlib, it is fairly simple to plot a basic 2D KDE, using the contourf() or imshow() functions. Seaborn offers more options for customization and allows you to create more complex plots with the kdeplot() function:
Using imshow()
import matplotlib.pyplot as plt
import matplotlib.cm as cm
fig = plt.figure(figsize=(10,4))
ax = fig.add_subplot(111)
X, Y = mgrid[xmin:xmax:100j, ymin:ymax:100j]
positions = c_[X.ravel(), Y.ravel()]
Z = np.reshape(kernel(positions.T).T, X.T.shape)
cax = ax.imshow(Z.T, interpolation='nearest',
cmap=cm.jet, origin='lower', extent=[xmin, xmax, ymin, ymax])
plt.xlabel('x',fontsize=14)
plt.ylabel('y',fontsize=14)
plt.xlim(xmin,xmax)
plt.ylim(ymin,ymax)
ticks_at = [0, Z.max()/6.0, 2.0*Z.max()/6.0, 3.0*Z.max()/6.0, 4.0*Z.max()/6.0, 5.0*Z.max()/6.0, Z.max()]
cbar = fig.colorbar(cax,ticks=ticks_at,format='%1.2g')
cbar.set_label('Kernel density function')
plt.savefig("kernel2D.png", dpi=100, bbox_inches='tight')
plt.show()
Using contourf()
fig = plt.figure(figsize=(10,4))
ax = fig.gca()
ax.set_xlim(xmin, xmax)
ax.set_ylim(ymin, ymax)
cfset = ax.contourf(X, Y, Z, cmap='Blues')
cset = ax.contour(X, Y, Z, colors='k')
ax.clabel(cset, inline=1, fontsize=10)
ax.set_xlabel('X')
ax.set_ylabel('Y')
plt.savefig("kernel2D_contourf.png", dpi=100, bbox_inches='tight')
plt.show()
References
Links | Site |
---|---|
scipy.stats.gaussian_kde | docs.scipy.org |
How to generate random numbers from a normal (Gaussian) distribution in python ? | www.moonbooks.org |
numpy.random.randn | numpy.org |
numpy.stack | numpy.org |