Hierarchical clustering

What Is Hierarchical Clustering?

Hierarchical clustering is a type of clustering algorithms in machine learning and data analysis that builds a hierarchy of clusters. Unlike other clustering methods, hierarchical clustering does not require specifying the number of clusters in advance. Instead, it creates a tree-like structure, known as a dendrogram, which illustrates the nested relationships between clusters¹⁸. This approach allows for a detailed exploration of the data's inherent structure.

The process typically begins with each individual data points considered as a separate cluster. It then iteratively merges the most similar clusters (in agglomerative, or "bottom-up," clustering) or splits a large cluster into smaller, more homogeneous ones (in divisive, or "top-down," clustering)¹⁷. Hierarchical clustering is a valuable tool for understanding the natural groupings within complex datasets, often applied in quantitative analysis across various fields.

History and Origin

The foundational ideas behind cluster analysis, including hierarchical methods, trace back to the early 20th century, with origins in anthropology and psychology around the 1930s¹⁶. Early innovators like Alfred L. Kroeber and Clark Wissler sought to organize complex cultural data, while in psychology, Joseph Zubin and Robert Tryon applied similar classification methods.

The hierarchical clustering method itself was formally formulated in the 1960s¹⁵. One notable contribution to this field was by J.H. Ward in 1963, who introduced a linkage criterion that minimizes the variance when merging clusters, a method still widely used today¹⁴. The development of these algorithms allowed researchers to move beyond simple visual grouping to more rigorous data classification methods, providing a systematic approach to identifying patterns and structures within datasets.

Key Takeaways

Hierarchical clustering builds a tree-like structure (dendrogram) of nested clusters, offering a visual representation of data relationships.
It does not require pre-defining the number of clusters, providing flexibility for exploratory data analysis.
The method can be agglomerative (bottom-up, merging clusters) or divisive (top-down, splitting clusters).
Hierarchical clustering is sensitive to the choice of distance metric and linkage criteria, which significantly influence the resulting cluster structure.
Its computational intensity can be a limitation for very large datasets.

Formula and Calculation

Hierarchical clustering relies on calculating the dissimilarity (or similarity) between individual data points and between clusters. A common measure of dissimilarity between two data points, (x) and (y), is the Euclidean distance, given by:

d(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}

where (n) is the number of dimensions or features, and (x_i) and (y_i) are the values of the (i)-th feature for data points (x) and (y), respectively.

Once distances between individual data points are established, a "linkage criterion" determines the distance between clusters. Common linkage criteria include:

Single Linkage: The distance between two clusters is the minimum distance between any single data point in one cluster and any single data point in the other.
Complete Linkage: The distance between two clusters is the maximum distance between any single data point in one cluster and any single data point in the other.
Average Linkage: The distance between two clusters is the average distance between all data points in one cluster and all data points in the other.
Ward's Method: This criterion merges the pair of clusters that results in the minimum increase in the total within-cluster variance.

The algorithm iteratively merges or splits clusters based on these calculated distances and the chosen linkage criterion until the desired hierarchy is formed¹³.

Interpreting the Hierarchical Clustering

The primary output of hierarchical clustering is the dendrogram, a visual representation of the cluster hierarchy¹². Interpreting a dendrogram involves examining the structure and "height" of the merges (for agglomerative clustering) or splits (for divisive clustering).

Each merge in an agglomerative dendrogram occurs at a certain dissimilarity level. The longer the vertical line (or "branch") connecting two clusters or data points, the greater the dissimilarity between them at the point of their merger. Analysts can "cut" the dendrogram at different heights to obtain different numbers of clusters. A cut at a lower height results in more, smaller clusters, while a cut at a higher height results in fewer, larger clusters. This visual flexibility helps in determining a suitable number of clusters for a given analysis without having to pre-specify it, which is a key advantage of hierarchical clustering¹¹. It provides context for evaluating the natural groupings that emerge from the data, which can then inform decisions in areas like market segmentation.

Hypothetical Example

Consider a hypothetical dataset of five investment funds (Fund A, B, C, D, E) and their historical correlation in returns over the past year. We want to group these funds based on their co-movement using hierarchical clustering.

Fund	Fund A	Fund B	Fund C	Fund D	Fund E
A	0.00
B	0.15	0.00
C	0.80	0.70	0.00
D	0.20	0.10	0.65	0.00
E	0.75	0.68	0.05	0.60	0.00

(Note: This is a dissimilarity matrix, where lower values indicate higher similarity, calculated as 1 - absolute correlation or similar.)

Step 1: Initial Clusters
Each fund starts as its own cluster: {A}, {B}, {C}, {D}, {E}.

Step 2: First Merge
The smallest dissimilarity is between Fund C and Fund E (0.05). These two are merged: {C, E}, {A}, {B}, {D}.

Step 3: Update Dissimilarity and Second Merge
Now, recalculate dissimilarities involving the new {C, E} cluster. Using, say, average linkage:

Dissimilarity({C, E}, A) = (Dissimilarity(C, A) + Dissimilarity(E, A)) / 2 = (0.80 + 0.75) / 2 = 0.775
Dissimilarity({C, E}, B) = (Dissimilarity(C, B) + Dissimilarity(E, B)) / 2 = (0.70 + 0.68) / 2 = 0.69
Dissimilarity({C, E}, D) = (Dissimilarity(C, D) + Dissimilarity(E, D)) / 2 = (0.65 + 0.60) / 2 = 0.625

The next smallest dissimilarity among all current clusters is between Fund B and Fund D (0.10). These are merged: {B, D}, {C, E}, {A}.

Step 4: Continue Merging
The process continues, merging {A} with {B, D} or {C, E} based on the chosen linkage criteria and updated dissimilarities, until all funds are in a single cluster. The resulting structure would be visualized as a dendrogram, showing how funds are progressively grouped. This allows an investor to see which funds naturally move together, which could be critical for portfolio diversification.

Practical Applications

Hierarchical clustering finds practical applications across various financial domains, aiding in structured decision-making and analysis.

Portfolio Management: It assists in grouping financial assets, such as stocks, bonds, or mutual funds, based on their historical price movements, returns, or other relevant characteristics. This helps in constructing well-diversified portfolios and in making informed asset allocation decisions by identifying assets that behave similarly or dissimilarly¹⁰. For example, a study by the Office of Financial Research demonstrated how different clustering techniques, including hierarchical methods, could group stock data into portfolios for risk analysis and financial stability monitoring⁹.
Customer Segmentation: Financial institutions use hierarchical clustering to segment their customer base into distinct groups based on demographics, transaction history, risk tolerance, or product preferences. This allows for targeted marketing strategies and personalized financial product offerings.
Risk Management: By grouping assets or entities with similar risk profiles, financial professionals can gain a clearer understanding of systemic risks within a portfolio or across an entire market⁸. This can aid in developing more robust risk management strategies.
Credit Scoring: Lenders can cluster loan applicants based on various credit metrics to identify groups with similar default probabilities, refining credit scoring models and lending policies.
Fraud Detection: In finance, hierarchical clustering can help detect unusual patterns or groups of transactions that might indicate fraudulent activity by identifying outliers or small, unusual clusters within large datasets.
Algorithmic Trading: Grouping highly correlated assets can inform pairs trading strategies or other statistical arbitrage techniques, where positions are taken on the relative performance of grouped assets.

These applications leverage the ability of hierarchical clustering to reveal underlying structures and relationships in complex financial data, supporting financial modeling and strategic planning.

Limitations and Criticisms

While hierarchical clustering is a powerful technique, it has several limitations and criticisms that analysts must consider:

Computational Intensity: For very large datasets, hierarchical clustering can be computationally expensive. The time complexity often makes it impractical for datasets with millions of data points, as it typically involves calculating and updating a dissimilarity matrix repeatedly⁷. This is a significant drawback compared to more scalable algorithms like K-means clustering.
Inability to Undo Merges/Splits: Once a merge (in agglomerative) or split (in divisive) decision is made, it cannot be undone. This greedy approach means that a suboptimal decision early in the process can lead to a less-than-ideal final clustering structure⁶.
Sensitivity to Outliers and Noise: Hierarchical clustering can be sensitive to outliers, which can disproportionately influence the cluster formation and distort the dendrogram structure⁵. Small perturbations in the dataset can lead to significant changes in the clustering results, compromising the credibility and replicability of the output⁴.
Difficulty in Determining Optimal Number of Clusters: While the dendrogram provides flexibility in choosing the number of clusters by "cutting" at different levels, there is no definitive mathematical method within hierarchical clustering itself to determine the "optimal" number of clusters³. This often relies on subjective visual interpretation or external criteria.
Scalability Issues for Visualization: For very large datasets, the resulting dendrogram can become extremely complex and difficult to visualize and interpret effectively.

These limitations highlight that while hierarchical clustering offers unique insights into data hierarchy, its practical application requires careful consideration of data size, noise, and the specific goals of the analysis.

Hierarchical Clustering vs. K-means Clustering

Hierarchical clustering and K-means clustering are both widely used clustering algorithms, but they differ fundamentally in their approach and output.

Feature	Hierarchical Clustering	K-means Clustering
Cluster Structure	Builds a hierarchy of nested clusters (dendrogram).	Partitions data into a pre-defined number of K non-overlapping clusters.
Number of Clusters	No need to pre-specify; determined by cutting dendrogram.	Requires the number of clusters (K) to be specified in advance.
Output	A tree-like dendrogram showing relationships.	A set of distinct, flat clusters, each with a centroid.
Flexibility	More flexible, reveals sub-cluster relationships.	Less flexible; assumes spherical, equally-sized clusters.
Computational Cost	Generally higher, especially for large datasets.	Faster and more scalable for large datasets.
Sensitivity	Can be sensitive to outliers and order of data.	Sensitive to initial centroid placement.

The primary point of confusion often arises because both aim to group similar data, but hierarchical clustering provides a more granular view of similarity at different levels of aggregation, whereas K-means clustering produces a single partitioning of the data based on a pre-determined number of groups. Hierarchical clustering is often preferred when the goal is to understand the underlying structure and relationships between data points, while K-means is chosen for its efficiency and when the number of clusters is already known or can be reasonably estimated.

FAQs

What is a dendrogram in hierarchical clustering?

A dendrogram is a tree-like diagram that graphically represents the hierarchy of clusters created by hierarchical clustering². It shows the sequence of merges or splits and the dissimilarity levels at which these operations occur, allowing you to visualize the relationships between clusters.

When should I use hierarchical clustering instead of K-means?

You should consider hierarchical clustering when you don't know the optimal number of clusters beforehand, or when you need to understand the nested, hierarchical relationships between your data points. It's also suitable for smaller to medium-sized datasets where the computational cost is less of a concern. For very large datasets or when you have a clear idea of how many distinct groups you're looking for, K-means clustering might be more appropriate due to its efficiency.

Can hierarchical clustering handle different types of data?

Hierarchical clustering can handle various types of data as long as a meaningful dissimilarity measure (distance metric) can be defined between data points. Common distance metrics include Euclidean distance for numerical data, and other metrics can be used for categorical or mixed data types. The choice of distance metric is crucial as it directly impacts the clustering results.

What are the two main types of hierarchical clustering?

The two main types are agglomerative and divisive¹. Agglomerative (bottom-up) clustering starts with each data point as its own cluster and progressively merges the closest clusters. Divisive (top-down) clustering begins with all data points in a single cluster and recursively splits them into smaller clusters. Agglomerative is more commonly used.