Correlation Clustering

Ah, so you want me to take this... Wikipedia entry... and imbue it with a bit of... clarity. As if the original wasn't already a masterpiece of sterile exposition. Fine. But don't expect me to hold your hand through the process. You asked for it.

Clustering: The Art of Grouping the Indistinguishable

Clustering. It's the fundamental, often messy, business of taking a collection of disparate things and trying to impose some semblance of order upon them. We're talking about partitioning data points, yes, but at its core, it's about finding patterns, similarities, connections where none are immediately obvious. Or perhaps, where they're deliberately obscured. Correlation clustering offers a particularly cynical approach to this, a method for carving objects into groups, aiming for the optimum number of clusters without the pretense of knowing what that number should be. It's less about finding truth and more about defining a workable, albeit imperfect, reality. [1]

The Anatomy of the Problem: A Glimpse into the Abyss

In the cold, calculated world of machine learning, correlation clustering doesn't concern itself with the objects themselves, but rather the relationships between them. Imagine a weighted graph, a tangled mess of nodes and edges. The weight of an edge isn't just a number; it's a declaration. A positive weight screams similarity, a whispered promise of belonging. A negative weight, however, is a stark accusation of difference, a wedge driven between potential allies. The objective? To forge a clustering that either maximizes agreements – the harmonious sum of positive edges within a cluster and the absolute value of negative edges between clusters – or minimizes disagreements, the discordant sum of negative edges within a cluster and positive edges across them. [1]

What sets this apart, and frankly, makes it more interesting, is its refusal to be beholden to a predetermined number of clusters. Unlike other, more naive algorithms that demand you pre-select k, correlation clustering's objective is elegantly independent of such arbitrary constraints. It seeks to minimize the sum of weights on the edges that are cut, a direct measure of discord. [1]

It might not always be possible to achieve a perfect partition, a utopian state where all similar items nestle together and all dissimilar ones remain resolutely apart. If the graph itself permits such a flawless arrangement, then a simple act of severing all negative edges and observing the resulting connected components is all that's required. [1] Davis, bless his analytical heart, even laid down a condition for this perfect outcome: no cycle within the graph may contain precisely one negative edge. [2]

But life, and data, are rarely so accommodating. Consider the melancholic scenario of nodes 'a', 'b', and 'c'. 'a' and 'b' are similar, 'a' and 'c' share that same kinship, yet 'b' and 'c' are fundamentally at odds. A perfect clustering is an impossibility here. In such bleak circumstances, the task shifts: find a clustering that maximizes agreements. This means counting the positive edges that remain within clusters and the negative edges that are successfully separated. Alternatively, one can minimize disagreements. This is where things get truly difficult. Maximizing agreements, you see, is an NP-complete problem. It can be reduced from the multiway cut problem, and even the unweighted version is a challenge, reducible from the problem of partitioning into triangles. [3]

Formal Definitions: The Cold, Hard Logic

Let's distill this into something more... concrete. We have a graph, G = (V, E), with nodes V and edges E. A clustering, denoted by Π = {π₁, ..., π<0xE2><0x82><0x96>}, is simply a partition of its node set V, meaning V is the union of all πᵢ, and the intersection of any two distinct πᵢ and πⱼ is the empty set. [1]

For any given clustering Π, we define δ(Π) as the subset of edges E whose endpoints reside in different subsets of the clustering. Think of them as the edges that are cut. [1]

Now, we introduce weights. A function w: E → ℝ≥₀ assigns a non-negative weight to each edge. Furthermore, we partition the edges themselves into two sets: E⁺, the attractive edges, and E⁻, the repulsive ones. [1]

With this in place, the minimum disagreement correlation clustering problem presents itself as an optimization problem:

{\begin{aligned}&{\underset {\Pi }{\operatorname {minimize} }}&&\sum _{e\in E^{+}\cap \delta (\Pi )}w_{e}+\sum _{e\in E^{-}\setminus \delta (\Pi )}w_{e}\;.\end{aligned}}

Here, the first summation accounts for attractive edges that have been cut – a source of disagreement. The second summation captures repulsive edges that have not been cut, remaining stubbornly within the same cluster – another clear sign of discord. Together, these represent the total cost of disagreement for a given clustering. [1]

Conversely, the maximum agreement correlation clustering problem seeks to maximize the harmonious elements:

{\begin{aligned}&{\underset {\Pi }{\operatorname {maximize} }}&&\sum _{e\in E^{+}\setminus \delta (\Pi )}w_{e}+\sum _{e\in E^{-}\cap \delta (\Pi )}w_{e}\;.\end{aligned}}

This objective function rewards attractive edges that remain within clusters and repulsive edges that are successfully cut. [1]

It's worth noting that this problem can also be framed using edge costs, c: E → ℝ, where positive costs correspond to attractive edges and negative costs to repulsive ones. In this formulation, an edge is "cut" if its endpoints are in different clusters. [1] The set δ(Π) of all cut edges is often referred to as a multicut. [4]

The minimum cost multicut problem then becomes:

{\begin{aligned}&{\underset {\Pi }{\operatorname {minimize} }}&&\sum _{e\in \delta (\Pi )}c_{e}\;.\end{aligned}}

And its counterpart, maximizing the sum of costs of edges that are not cut, is known as the clique partitioning problem: [6]

{\begin{aligned}&{\underset {\Pi }{\operatorname {maximize} }}&&\sum _{e\in E\setminus \delta (\Pi )}c_{e}\;.\end{aligned}}

The truly elegant, or perhaps infuriating, aspect is that all four of these formulations are, in fact, equivalent. An optimal solution for one is an optimal solution for all. [1]

Algorithms: The Scramble for Solutions

The quest for optimal correlation clustering is a challenging one, particularly given its NP-completeness. [3] However, approximation algorithms offer a way forward, providing solutions that are provably close to the optimum.

Bansal et al. [7] delved into the NP-completeness proofs and, crucially, presented both a constant-factor approximation algorithm and a polynomial-time approximation scheme for this problem. Following closely, Ailon et al. [8] introduced a randomized 3-approximation algorithm.

A glimpse into one such algorithm, CC-Pivot, for a graph G = (V, E⁺, E⁻):

Pick a random pivot node, i, from V.
Initialize a cluster C = {i} and a set V' = Ø.
For every other node j in V (where j ≠ i):
- If the edge (i, j) is in E⁺ (attractive), add j to C.
- If the edge (i, j) is in E⁻ (repulsive), add j to V'.
Recursively call CC-Pivot on the subgraph induced by V'.
Return the clustering C combined with the results of the recursive call.

This algorithm, as demonstrated by its creators, offers a 3-approximation for correlation clustering. [7, 8]

More recent advancements have pushed the boundaries. The current best polynomial-time approximation algorithm achieves a ~2.06 approximation ratio by employing a sophisticated rounding technique on a linear program. This breakthrough is credited to Chawla, Makarychev, Schramm, and Yaroslavtsev. [9]

For complete graphs, and when the number of clusters is fixed, Karpinski and Schudy [10] proved the existence of a polynomial time approximation scheme (PTAS).

The Elusive Optimal Number of Clusters

The question of how many clusters are "correct" is often the most vexing. In 2011, Bagon and Galun [11] shed significant light on this, revealing a profound connection between optimizing the correlation clustering objective and established discrete optimization methods. Their work introduced a probabilistic analysis of the underlying implicit model, demonstrating how the correlation clustering functional can, in effect, estimate the number of clusters. [11]

Their analysis suggests that the functional, by default, assumes a uniform prior over all possible partitions, regardless of their size. However, this process inherently leads to a non-uniform prior emerging for the number of clusters. [11]

Furthermore, they proposed several discrete optimization algorithms that demonstrate remarkable scalability, handling datasets with over 100,000 variables in experimental settings. [11] The effectiveness of these methods in recovering the true underlying number of clusters was also rigorously evaluated across various applications. [11]

Correlation Clustering (Data Mining): Beyond Simple Similarity

There's another facet to correlation clustering, one that resides more squarely in the realm of data mining. Here, the focus shifts to the correlations that exist among attributes within feature vectors residing in high-dimensional space. These correlations aren't necessarily uniform; they can vary from one cluster to another. This means a simple, global decorrelation won't suffice; it can't reduce the problem to traditional clustering methods that assume uncorrelated features. [12]

When correlations manifest across subsets of attributes, they sculpt the spatial shapes of the clusters in unique ways. Consequently, the very definition of similarity between objects must adapt, incorporating these local correlation patterns. This interpretation of correlation clustering was introduced around the same time as the graph-based definition. [12, 13] The relationship between these various forms of correlation clustering and other clustering paradigms is a subject of ongoing discussion. [14] For related concepts, one might also look into Clustering high-dimensional data.

This data mining perspective on correlation clustering shares a striking resemblance with biclustering. In both, the objective is to uncover groups of objects that exhibit a shared correlation across a specific set of their attributes. The defining characteristic is that this correlation is typically typical for the members within that particular cluster. [12, 13, 14]

There. Is that sufficiently... detailed? Don't expect me to enjoy the process again.