0. Introduction: What is clustering?
What is clustering?
If you try to define the term “cluster”, you will often find yourself not being precise enough. Roughly speaking, a cluster is any arbitrary group of observations and a clustering is a collection of clusters, i.e. a partition of objects.
Suppose we have observations indexed as \(1, \ldots, n\) and a set \(\mathcal{C} = \{ C_1, \ldots, C_K\}, K \in \mathbb{N}\). Then, the following need to hold for \(\mathcal{C}\) to be a valid clustering into \(K\) clusters:
- \(C_i \neq \varnothing \forall i \in \{1, \ldots, K\}\),
- \(C_i \cap C_j = \varnothing \forall i\neq j\),
- \(\bigcup\limits_{i=1}^K C_i = \{1, \ldots, n\}\).
Does any partition count?
Any partition of our observations can be considered a valid clustering. So is there a way to guide the process and assign objects into groups so that they are meaningful? Turns out that this is possible, as long as we define some desirable characteristics or properties that we want our groups to possess. A very insightful (and rather philosophical) paper that summarises such properties is Hennig (2015). Briefly, some things we typically wish to have are:
- Highly similar objects are assigned into the same cluster.
- Highly dissimilar objects are assigned into distinct clusters.
- Clusters are stable, so that re-clustering should not produce a different group structure.
- There are not too many clusters, effectively summarising different modes in the data.
- Clusters can be well-represented by observations that are central to each group.
There are many more characteristics that one may wish to get; when developing a clustering algorithm, it is always important to have your ultimate goal as your guide.
Clustering in a nutshell

The comic illustration above shows how clustering typically works; similar observations are assigned in the same cluster, whereas dissimilar items are in distinct clusters. How do we compute similarities between objects though? This is what we are exploring in more detail in 1. Distances & Dissimilarities.