A dendrogram is a tree-like diagram used to visualize the arrangement of clusters created by hierarchical clustering. It shows how individual data points (or clusters) are merged step-by-step based on their similarity or distance. The vertical lines represent the distances at which clusters are combined, with lower merges indicating more similarity. Dendrograms are commonly used in data analysis to explore the underlying structure of a dataset and determine the optimal number of clusters.
# Sample data (for example, 5 data points with 2 features each)X=np.array([[1,2],[3,4],[5,6],[7,8],[9,10]])
Perform hierarchical clustering
Use the linkage function to compute the hierarchical clustering. You can choose a method like ward, single, complete, or average. Here's an example with the ward method:
A dendrogram plot will be displayed with hierarchical clustering of your dataset, showing how data points are grouped at various distances.
Full Code
1 2 3 4 5 6 7 8 9101112131415161718192021
importnumpyasnpimportmatplotlib.pyplotaspltfromscipy.cluster.hierarchyimportdendrogram,linkage# Sample data (for example, 5 data points with 2 features each)X=np.array([[1,2],[3,4],[5,6],[7,8],[9,10]])# Perform hierarchical clusteringZ=linkage(X,method='ward')# Plot dendrogramplt.figure(figsize=(10,7))dendrogram(Z)plt.title("Dendrogram")plt.xlabel("Data points")plt.ylabel("Euclidean distances")plt.show()
Customizing the Dendrogram
Here’s a step-by-step explanation of the provided Python code, along with suggestions for customizing the dendrogram:
scipy.cluster.hierarchy.dendrogram: Used to generate the dendrogram plot.
scipy.cluster.hierarchy.linkage: Performs the hierarchical clustering.
matplotlib.pyplot: Used for plotting.
numpy: Generates random data points for clustering.
2. Custom Dendrogram Function:
1 2 3 4 5 6 7 8 9101112
defcustom_dendrogram(*args,**kwargs):dendro_data=dendrogram(*args,**kwargs)ifnotkwargs.get('no_plot',False):foricoord,dcoordinzip(dendro_data['icoord'],dendro_data['dcoord']):x_coord=0.5*sum(icoord[1:3])# Midpoint of the clusterheight=dcoord[1]# Distance at which the clusters mergeplt.plot(x_coord,height,'ro')# Plot a red dot at the merge pointplt.annotate(f"{height:.3g}",(x_coord,height),xytext=(0,-8),textcoords='offset points',va='top',ha='center')# Annotate heightreturndendro_data
This function extends the default dendrogram by adding red dots and annotated cluster heights to each merge point, enhancing the dendrogram's readability.
The annotations show the exact distances at which clusters merge, providing more detail on the hierarchical structure.
Generates 100 random 2D points from a multivariate normal distribution with a specified mean and covariance matrix.
This data serves as the input for hierarchical clustering.
4. Scatter Plot of Data:
1234567
plt.figure(figsize=(6,5))plt.scatter(data[:,0],data[:,1])plt.title("Scatter Plot of Data Points")plt.axis('equal')plt.grid(True)plt.savefig('scatter_plot.png')plt.show()
A scatter plot is created to visualize the generated data points before performing the clustering.
This helps understand the distribution and structure of the data.
5. Hierarchical Clustering:
1
linkage_matrix=linkage(data,method="single")
The hierarchical clustering is performed using the "single" linkage method, which merges clusters based on the minimum distance between points in different clusters.
The result is stored in the linkage_matrix, which is used to create the dendrograms.
This second dendrogram is similar to the first but includes the leaf counts, showing how many data points are in each cluster.
Summary:
Custom Dendrograms: The custom_dendrogram function enhances the standard dendrogram by adding visual markers and annotations.
Scatter Plot: Displays the randomly generated 2D data points.
Hierarchical Clustering: Uses single linkage to create a clustering hierarchy, which is visualized in two dendrograms—one without and one with leaf counts.
fromscipy.cluster.hierarchyimportdendrogram,linkageimportmatplotlib.pyplotaspltimportnumpyasnp# Custom function for generating a dendrogram with distance annotationsdefcustom_dendrogram(*args,**kwargs):# Create the standard dendrogramdendro_data=dendrogram(*args,**kwargs)# Add annotations for cluster heights if no_plot is Falseifnotkwargs.get('no_plot',False):# Loop through the clusters to add custom red dots and distance annotationsforicoord,dcoordinzip(dendro_data['icoord'],dendro_data['dcoord']):x_coord=0.5*sum(icoord[1:3])# Find the midpoint of the clusterheight=dcoord[1]# Distance (height) at which the clusters are mergedplt.plot(x_coord,height,'ro')# Plot a red dot at the merge pointplt.annotate(f"{height:.3g}",(x_coord,height),xytext=(0,-8),textcoords='offset points',va='top',ha='center')# Annotate the heightreturndendro_data# Generate random 2D data points for hierarchical clusteringnp.random.seed(12312)# Set seed for reproducibilitynum_points=100# Number of pointsdata=np.random.multivariate_normal([0,0],np.array([[4.0,2.5],[2.5,1.4]]),size=num_points)# Scatter plot of the generated data pointsplt.figure(figsize=(6,5))plt.scatter(data[:,0],data[:,1])plt.title("Scatter Plot of Data Points")plt.axis('equal')# Ensure equal scaling on both axesplt.grid(True)plt.savefig('scatter_plot.png')plt.show()# Perform hierarchical clustering using the 'single' linkage methodlinkage_matrix=linkage(data,method="single")# Plot the first dendrogram (without leaf counts)plt.figure(figsize=(10,4))dendro_data=custom_dendrogram(linkage_matrix,color_threshold=1,p=6,truncate_mode='lastp',show_leaf_counts=False)plt.title("Dendrogram (Without Leaf Counts)")plt.xlabel("Cluster Index")plt.ylabel("Distance")plt.savefig('dendrogram_without_leaf_counts.png')plt.show()# Plot the second dendrogram (with leaf counts)plt.figure(figsize=(10,4))dendro_data=custom_dendrogram(linkage_matrix,color_threshold=1,p=6,truncate_mode='lastp',show_leaf_counts=True)plt.title("Dendrogram (With Leaf Counts)")plt.xlabel("Cluster Index")plt.ylabel("Distance")plt.savefig('dendrogram_with_leaf_counts.png')plt.show()
Visualizing Hierarchical Clustering with Overlaid Dendrograms on a Distance Matrix
importnumpyasnp# Import NumPy for random number generationimportpylab# Import Pylab (part of matplotlib for plotting)importscipy.cluster.hierarchyassch# Import hierarchical clustering methods from SciPy# Generate random features (1D array of 40 elements) and initialize a distance matrix.x=np.random.rand(40)# Create a random array of 40 elements using NumPyD=np.zeros([40,40])# Initialize a 40x40 zero matrix to store distances# Populate the distance matrix by calculating the absolute differences between the elements.foriinrange(40):forjinrange(40):D[i,j]=abs(x[i]-x[j])# Compute the absolute difference between points# Create the figure and plot the first dendrogram.fig=pylab.figure(figsize=(8,8))# Create an 8x8 inch figureax1=fig.add_axes([0.09,0.1,0.2,0.6])# Add the first subplot for the first dendrogram# Perform hierarchical clustering using the 'centroid' method and plot the first dendrogram.Y=sch.linkage(D,method='centroid')# Compute the hierarchical clustering with centroid linkageZ1=sch.dendrogram(Y,orientation='right')# Generate the dendrogram with 'right' orientationax1.set_xticks([])# Remove x-axis ticksax1.set_yticks([])# Remove y-axis ticks# Compute and plot the second dendrogram.ax2=fig.add_axes([0.3,0.71,0.6,0.2])# Add the second subplot for the second dendrogramY=sch.linkage(D,method='single')# Perform hierarchical clustering using the 'single' linkage methodZ2=sch.dendrogram(Y)# Generate the dendrogram (default orientation)ax2.set_xticks([])# Remove x-axis ticksax2.set_yticks([])# Remove y-axis ticks# Reorder and plot the distance matrix according to the dendrogram's leaf order.axmatrix=fig.add_axes([0.3,0.1,0.6,0.6])# Add the main subplot for the reordered distance matrixidx1=Z1['leaves']# Get the order of leaves from the first dendrogramidx2=Z2['leaves']# Get the order of leaves from the second dendrogramD=D[idx1,:]# Reorder rows of the distance matrix based on the first dendrogramD=D[:,idx2]# Reorder columns of the distance matrix based on the second dendrogramim=axmatrix.matshow(D,aspect='auto',origin='lower',cmap=pylab.cm.YlGnBu)# Plot the reordered matrix with color# Remove ticks for the matrix plot.axmatrix.set_xticks([])# Remove x-axis ticksaxmatrix.set_yticks([])# Remove y-axis ticks# Add a colorbar to show the scale of the distances.axcolor=fig.add_axes([0.91,0.1,0.02,0.6])# Add an axis for the colorbarpylab.colorbar(im,cax=axcolor)# Create and add the colorbar# Display the plot and save it as an image.fig.savefig('dendrogram_example_02.png')# Save the figure as a PNG filefig.show()# Display the figure
Hi, I'm Ben, the Founder of moonbooks.org. I work as a research scientist specializing in Earth satellite remote sensing, particularly focusing on Fire detection using VIIRS and ABI sensors. My areas of expertise include Python, Machine Learning, and Open Science. Feel free to reach out if you notice any errors on the site. I am constantly striving to enhance it. Thank you!