How to create a distance matrix from two numpy arrays of points ?


Introduction

A distance matrix is a square matrix that contains the distances between all pairs of points in a dataset. Each row and column represents a point, and the value at the intersection of the row and column represents the distance between those two points. It is commonly used in clustering algorithms, such as k-means, and in dimensionality reduction techniques like multidimensional scaling.

In this article, we will explore how to create a distance matrix in Python from two numpy arrays.

Calculating the distance matrix

Creating two arrays of points

To demonstrate the calculation of a distance matrix, let's generate two arrays containing random points:

import numpy as np

np.random.seed(42)

pts_1 = np.random.randint(0,100,(5,2))
pts_2 = np.random.randint(0,100,(5,2))

print(pts_1)
print(pts_2)

The above code snippet uses the NumPy library to generate two arrays, pts_1 and pts_2, each containing five sets of random integer coordinates ranging from 0 to 100. The seed value of 42 ensures reproducibility of the generated random numbers.

Here are the example arrays produced by the code:

array([[51, 92],
       [14, 71],
       [60, 20],
       [82, 86],
       [74, 74]])

and

array([[87, 99],
       [23,  2],
       [21, 52],
       [ 1, 87],
       [29, 37]])

In mathematics, there are various methods to define distance. The most commonly used one is the Euclidean distance, which will be utilized in the following example. Similarly, in Python, there are multiple approaches to generate a distance matrix, depending on the specific problems and inputs (see Distance computations (scipy.spatial.distance)). Let's begin with the simplest scenario:

Using the cdist() function from SciPy.

To calculate the distance matrix between these two arrays, we can use the cdist() function from SciPy. This function takes in two arrays as arguments and calculates the distances between all pairs of points in the arrays. The returned distance matrix will be a 2D array with shape (m,n) where m and n are the number of points in the two arrays respectively:

from scipy.spatial import distance

d = distance.cdist(pts_1,pts_2, 'euclidean')

print( np.round(d,2) )

The output of this code will be a 5x5 array where each element represents the Euclidean distance between a point in pts_1 and a point in pts_2. Here's an example of the output:

array([[ 36.67,  94.25,  50.  ,  50.25,  59.24],
       [ 78.19,  69.58,  20.25,  20.62,  37.16],
       [ 83.49,  41.15,  50.45,  89.27,  35.36],
       [ 13.93, 102.65,  69.84,  81.01,  72.18],
       [ 28.18,  88.23,  57.38,  74.15,  58.26]])

Extra functionalities

Using matplotlib to create a visual representation of the distance matrix

This function will take the distance matrix as input and display it as a color-coded image, where each cell's color corresponds to the distance value between two points. This can provide a quick overview of the distances between all points in the dataset.

import matplotlib.pyplot as plt

plt.imshow(d)

plt.colorbar()

plt.title('How to create a distance matrix \n from two numpy arrays ?', fontsize=12)

plt.savefig('distance_matrix_01.png', dpi=100, bbox_inches='tight')

plt.show()

How to create a distance matrix from two numpy arrays of points ?
How to create a distance matrix from two numpy arrays of points ?

You can also use other libraries such as "seaborn" or "plotly" to create more visually appealing and interactive visualizations of the distance matrix.

Determining the minimum values and their respective indexes

When working with distance matrices, one frequently encountered task is to identify the minimum value and its corresponding index for each row in the matrix. To accomplish this, one can utilize the numpy functions argmin and min.

The argmin function returns the index of the minimum value in an array,

np.argmin(d, axis=1)

array([0, 2, 4, 0, 0])

while min returns the actual minimum value:

np.min(d, axis=1)

array([36.67424164, 20.24845673, 35.35533906, 13.92838828, 28.17800561])

Combining these two functions allows for efficient and straightforward identification of the minimum value and its corresponding index.

References

Image

of