What is Hierarchical Clustering?

Home

Think

Topics

Hierarchical clustering

What is hierarchical clustering?

Explore IBM's clustering with watsonx.ai

Subscribe for AI updates

An assembly line performing a classification task on boxes

Published: 05 August 2024
Contributors: Joshua Noble

Hierarchical clustering is an unsupervised machine learning algorithm that groups data into a tree of nested clusters. The main types include agglomerative and divisive. Hierarchical cluster analysis helps find patterns and connections in datasets. Results are presented in a dendrogram diagram showing the distance relationships between clusters.

Clustering is an unsupervised machine learning technique used in data analysis to detect and group similar objects. Hierarchical cluster analysis (HCA), or hierarchical clustering, groups objects into a cluster hierarchy without enforcing a linear order within them. Many disciplines, such as biology, image analysis and the social sciences, use hierarchical clustering methods to explore and recognize patterns in datasets. Use cases include categorizing populations in clinical research, customer segmentation and detecting communities of nodes in network models.

There are two types of hierarchical clustering:

- Agglomerative or bottom-up approach¹ that repeatedly merges clusters into larger ones until a single cluster emerges.

- Divisive or top-down approach that²starts with all data in a single cluster and continues to split out successive clusters until all clusters are singletons.

Hierarchical clustering analysis has high computational costs. While using a heap can reduce computation time, memory requirements are increased. Both the divisive and agglomerative types of clustering are "greedy," meaning that the algorithm decides which clusters to merge or split by making the locally optimal choice at each stage of the process. It is also possible to apply a stop criterion, where the algorithm stops agglomeration or splitting clusters when it reaches a predetermined number of clusters.

A tree-like diagram called a dendrogram³ is often used to visualize the hierarchy of clusters. It displays the order in which clusters have been merged or divided and shows the similarity or distance between data points. Dendrograms can also be understood as a nested list of lists⁴ with attributes.

How hierarchical clustering works

Hierarchical clustering algorithms use the concept of a dissimilarity matrix to decide which clusters to merge or divide. Dissimilarity is the distance between two data points as measured by a chosen linkage method. The values in a dissimilarity matrix express:

- The distance⁵, a Euclidean distance as an example, between single points in a set

- A linkage clustering criterion, which specifies dissimilarity as a function of the pairwise distances of points across sets

Linkage methods

Let's explore the most common Euclidean distance linkage methods. Examples of other distance metrics that can be used include Minkowski, Hamming, Mahalanobis, Hausdorf and Manhattan distance. Note that each linkage method generates different clusters from the same dataset. Selecting the appropriate linkage clustering method depends on factors such as the type of data being clustered, data density, cluster shape and whether there are outliers or noise in the dataset.

Min (single) linkage

The single linkage method analyzes pairwise distances between items in two clusters and then uses the minimum distance between the clusters. The min method handles nonelliptical cluster shapes well but will be impacted by noise and outliers. It has a limitation known as the chaining effect⁶. A few points creating a bridge between a cluster pair can result in the two clusters merging into one. The min linkage criteria can be represented as:

$\min_{a \in A, b \in B} d (a, b)$

where A and B are two sets of observations and d is a distance function.

Max (complete) linkage

Cluster distances can also be calculated based on the points that are furthest away from each other. The max method is less sensitive than the min method to noise and outliers, but using it can skew the results when there are globular or large clusters. Max-linkage often produces more spherical clusters than min-linkage. The max linkage can be represented in the formula:

$\max_{a \in A, b \in B} d (a, b)$

where A and B are two sets of observations and d is distance.

Average linkage

These methods, introduced by Sokal and Michener⁷, define the distance between clusters as the average distance between pairs across all pairs of points in the clusters. The algorithm can be either the unweighted pair group method with arithmetic mean (UPGMA) or the weighted pair group method with arithmetic mean (WPGMA). The meaning of "unweighted" here is that all distances contribute equally to each average.

UPGMA is represented by the formula

$\frac{1}{∣ A ∣ \cdot ∣ B ∣} \sum_{a \in A} \sum_{b \in B} d (a, b)$

where *A* and *B* are two sets of observations and *d* is distance.

WPGMA is represented by the formula

$d (i \cup j, k) = \frac{d (i, k) + d (j, k)}{2}$

where i and j are the closest clusters being combined in each step to form a new cluster of the union of i and j. We can then calculate the distance to another cluster k, which is the arithmetic mean of the average distances between data points in k and i and k and j.

Centroid linkage

Here, we use the distance between cluster centers or centroids. The distance between centroids is calculated by using a distance function:

$∥ μ_{A} - μ_{B} ∥^{2}$

where μ_A is the centroid of A and μ_B is the centroid of B.

Ward's minimum variance method

Joe H. Ward introduced the minimal increase of sum of squares (MISSQ) method⁸ in the 1960s. Every data point starts in its own cluster. This approach means that the sum of squares between data points is initially at zero, then increases as we merge clusters. Ward's method minimizes the sum of the squared distances of the points from the cluster centers as they are merged. Ward's method is a good choice for quantitative variables⁹, and it is less affected by noise or outliers. It can be represented as:

$\frac{∣ A ∣ \cdot ∣ B ∣}{A \cup B} ∥ μ_{A} - μ_{B} ∥^{2} = \sum_{x \in A \cup B} ∥ x - μ_{A \cup B} ∥^{2} - \sum_{x \in A} ∥ x - μ_{A} ∥^{2} - \sum_{x \in B} ∥ x - μ_{B} ∥^{2}$

where the mean of A and mean of B are the centroids of A and B respectively, and x is a data point that belongs to the union of A and B

Lance-Williams algorithm

Ward's minimum variance method can be refined by using a Lance-Williams algorithm. These algorithms use a recursive formula to update cluster distances and find the optimal closest cluster pair to merge.

Agglomerative clustering steps

In agglomerative hierarchical clustering, also known as agglomerative nesting (AGNES), each data point starts as a cluster. The algorithm uses a selected linkage clustering criterion based on a dissimilarity matrix to decide which pair of clusters to join. The algorithm moves up the hierarchy and keeps pairing clusters until everything has been linked together, creating a hierarchical series of nested clusters. Roughly speaking the agglomerative clustering steps¹⁰ are:

1. Compute the dissimilarity matrix by using a particular distance metric.

2. Assign each data point to a cluster.

3. Merge the clusters based on a linkage criterion for the similarity between clusters.

4. Update the distance matrix.

5. Repeat steps 3 and 4 until a single cluster remains or any stop criterion is met.

Divisive clustering steps

The principles behind divisive hierarchical clustering were developed by Macnaughton-Smith and others¹¹ in 1964 and explored further by Kaufman and Rousseeuw with their DIANA (Divisive ANAlysis clustering) algorithm¹² in 1990. Divisive clustering uses the opposite approach to agglomerative clustering. All data points start in a single cluster that is repeatedly split into more clusters. Splitting occurs until either all the remaining clusters are singletons or a stop criterion such as a predefined number of clusters is met. Divisive methods are better at identifying large clusters¹³ and can be more accurate than agglomerative methods because the algorithm considers the entire dataset distribution from the start of the process.

To improve efficiency, divisive methods use flat clustering algorithms such as k-means to split the dataset into clusters. The number of clusters must be specified upfront. The k-means algorithm splits clusters by minimizing the within-cluster sum of squares between centroid points. This is known as the inertia criterion¹⁴. The divisive clustering steps¹⁵ are:

1. Start with all data points for a dataset size N (d1, d2, d3 ... dN) in one cluster.

2. Split the cluster into two dissimilar or heterogeneous clusters by using a flat clustering method such as the k-means algorithm.

3. Repeat step 2, choosing the best cluster to split and removing the outliers from the least cohesive cluster after each iteration.

4. Stop when each example is in its own single cluster, otherwise repeat step 3.

Interpreting hierarchical clustering results

Cluster results are typically presented in a dendrogram (binary tree structure). The x-axis in the dendrogram represents the data points and the y-axis, or the height of the lines, shows how far apart the clusters were when they were merged.

You can use a dendrogram to decide how many clusters¹⁶ will be in your final clustering model. One strategy is identifying the natural cutoff point in the tree where the branches dwindle and become longer. Alternatively, the number of clusters is given by the number of vertical lines crossed when a horizontal line cuts the dendrogram.

In the example image shown here, the horizontal line H1 cuts two vertical lines. This shows that there are two clusters at this point in the process—one cluster with points 5, 8, and 2 and one cluster with the remaining points. The further the horizontal line can move up or down without cutting other horizontal lines in the dendrogram, the better the selection of this number of clusters is for your clustering model. In the example below, the horizontal line H2 selects four clusters. H2 is not able to move up and down as far as H1 before it cuts other horizontal lines. This scenario shows that the two-cluster choice (H1) is probably more appropriate for your clustering model.

A robust clustering model¹⁷ creates clusters with high intraclass similarity and low interclass similarity. However, it can be difficult to define cluster quality, and your selection of linkage criterion and cluster numbers can significantly impact your results. Thus, when building a clustering model, try out different options and select those that best help you explore and reveal patterns in the dataset for future consideration. Factors to consider¹⁸ include:

- The number of clusters that are practical or logical for the dataset (given dataset size, cluster shapes, noise and so on)

- Statistics, such as the mean, maximum and minimum values for each cluster

- The best dissimilarity metric or linkage criterion to apply

- The impact of any outliers or outcome variables

- Any specific domain or dataset knowledge

Other methods to help determine the optimal number of clusters¹⁹ include:

- The elbow method, where you plot the within-cluster sum of squares against the number of clusters and determine the "elbow" (the point where the plot levels off)

- Gap statistic, where you compare the actual within-cluster sum of squares to the expected within-cluster sum of squares for a null reference distribution and identify the largest gap.

Use cases for hierarchical clustering

Hierarchical clustering provides data scientists with insights into the structure and relationships within datasets and it can be applied in various use cases.

Business

Hierarchical clustering can help analyze trends and segment customer data—-for example, grouping by product choice, demographic, purchasing behavior, risk profile or interactions with social media.

Clinical research and bioinformatics

Patient cohorts for clinical research can run into the thousands. Hierarchical clustering helps categorize mixed populations into more homogeneous groups²⁰ by using, for example, diagnostic criteria, physiological responses or DNA mutations. It can also be applied to group species by biological features to understand evolutionary development.

Image and information processing

Hierarchical clustering is used in image-based text recognition applications to group handwritten characters by their shape. It is also used to store and retrieve information by using certain criteria or to categorize search results.

Implementing hierarchical clustering in Python or R

Both Python and the R programming language are widely used in data science. Python is flexible and can handle a broad spectrum of tasks. Alternatively, R is specifically designed for statistical computing and provides rich options for visualizing your hierarchical clustering analysis.

Python provides the AgglomerativeCluster function²¹ (see also agglomerative clustering examples²² on sklearn), and SciPy provides a function to plot dendrograms²³. Packages such as dendextend²⁴ enhance R's dendrogram functionality, improving sensitivity analysis and enabling you to compare and manipulate different dendrograms. For a hands-on experience, check out our step-by-step tutorials: how to implement hierarchical clustering in Python and how to implement hierarchical clustering in R.

Resources

Implement Hierarchical Clustering in R

Organize data into an optimal number of clusters using dendrogram visualization with the hierarchical clustering algorithm.

What is Clustering?

Clustering is an unsupervised machine learning algorithm that organizes and classifies different objects, data points, or observations into groups or clusters based on similarities or patterns.

Learn Clustering Algorithms

Learn about Centroid-based clustering, Density-based clustering and Hierarchical clustering using Python.

Take the next step

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Explore watsonx.ai

Book a live demo

Footnotes

¹ Murtagh, F., Legendre, P., “Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion?,” 2014, J Classif 31, 274–295, https://link.springer.com/article/10.1007/s00357-014-9161-z (link resides outside ibm.com)

² Kaufman, L.; Rousseeuw, P. J., Finding Groups in Data: An Introduction to Cluster Analysis. Wiley. Chp 6. Divisive Analysis (Program DIANA) pp. 253–279, https://onlinelibrary.wiley.com/doi/book/10.1002/9780470316801 (link resides outside ibm.com)

³ Galili, T., “Introduction to dendextend,” The Comprehensive R Archive Network, 2023, https://cran.r-project.org/web/packages/dendextend/index.html (links reside outside ibm.com)

⁴ Lecture Notes from Penn State Eberly College of Science, “Hierarchical Clustering”, https://online.stat.psu.edu/stat555/node/85/ (link resides outside ibm.com, accessed Jan 8, 2023)

^{5, 17}Maimon, O., Rokach, L., Data Mining and Knowledge Discovery Handbook, 2010 2nd ed, Springer, https://link.springer.com/book/10.1007/978-0-387-09823-4 (link resides outside ibm.com)

⁶ Sokal, R, Michener, C., “A statistical method for evaluating systematic relationships,” 1958, University of Kansas Science Bulletin, 38: 1409–1438, https://archive.org/details/cbarchive_33927_astatisticalmethodforevaluatin1902/page/n1/mode/2up (link resides outside ibm.com)

⁷Ward, J. H., “Hierarchical Grouping to Optimize an Objective Function,” 1963, Journal of the American Statistical Association, 58 (301): 236–244, https://www.tandfonline.com/doi/abs/10.1080/01621459.1963.10500845 (link resides outside ibm.com).

⁸ Lecture Notes from Penn State Eberly College of Science, “Applied Multivariate Statistical Analysis”, https://online.stat.psu.edu/stat505/lesson/14/14.7 (link resides outside ibm.com, accessed Dec 29, 2023)

^{9, 15} Shetty P. and Singh S., “Hierarchical Clustering: A Survey,” International Journal of Applied Research, Vol 7 Issue 4, Part C, 2021, https://www.allresearchjournal.com/archives/?year=2021&vol=7&issue=4&part=C&ArticleId=8484 (link resides outside ibm.com)

¹⁰Macnaugton-Smith, P., Williams, W., Dale, M., et al., “Dissimilarity Analysis: a new Technique of Hierarchical Sub-division,” Nature 202, 1034–1035 (1964), https://www.nature.com/articles/2021034a0 (link resides outside ibm.com)

¹² Boehmke, B., Greenwell, B., Hands-On Machine Learning with R, Taylor and Francis, 2020, https://bradleyboehmke.github.io/HOML/ (link resides outside ibm.com)

¹³ Cavalli-Sforza, L. L., and Edwards A. W. F., “Phylogenetic analysis: models and estimation procedures,” 1967, Evolution 21: 550–570 and Am. J. Hum. Genet. 19: 233–257, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1706274/ (link resides outside ibm.com)

¹⁴Sci-kit learn 1.3.2, 2.3 Clustering, https://scikit-learn.org/stable/modules/clustering.html (link resides outside ibm.com)

¹⁶ Lecture notes from MIT OpenCourseWare, 2017, https://ocw.mit.edu/courses/15-071-the-analytics-edge-spring-2017/pages/clustering/recommendations-worth-a-million-an-introduction-to-clustering/video-5-hierarchical-clustering/ (link resides outside ibm.com)

¹⁸ Lecture notes from the University of Washington, 2001, https://courses.cs.washington.edu/courses/csep546/04au/pdf-slides/10.pdf (link resides outside ibm.com)

¹⁹ Boehmke, B., University of Cincinnati Business Analytics R Programming Guide, https://uc-r.github.io/hc_clustering#algorithms (link resides outside ibm.com)

²⁰ QCBS R Workshop Series, https://r.qcbs.ca/workshop09/book-en/clustering.html (link resides outside ibm.com), accessed Dec 31, 2023.

²¹ Zhongheng Zhang et al., “Hierarchical cluster analysis in clinical research with heterogeneous study population: highlighting its visualization with R,” Annals of Translational Medicine. 2017 Feb; 5(4): 75, https://atm.amegroups.org/article/view/13789/14063 (link resides outside ibm.com)

^{22, 23} Scit-kit learn 1.3.2 documentation, https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html (link resides outside ibm.com)

²⁴ SciPy Manual v1.11.4, https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html (link resides outside ibm.com)