from Hacker News

The CAP theorem of Clustering: Why Every Algorithm Must Sacrifice Something

by fagnerbrack on 12/26/24, 11:04 PM with 21 comments

  • by throw_pm23 on 12/27/24, 12:08 AM

    A cute and thought-provoking theorem for sure, but arguably none of the three given criteria for clustering are well motivated, so the result is much less relevant than usually claimed.

    - scale-invariance: stretching data along some dimensions should not change clustering.

    This is clearly not true: . . . (three well-spaced spots) may be reasonably seen as three clusters, whereas ||| (three nearby elongated bars) not.

    - richness: all groupings must be reachable.

    Also not quite true, both of the two cases: (1) all clusters are singleton points and (2) a single cluster that contains all points, mean the same: no useful cluster structure found. So it is enough if one of these groupings are reachable, and not both.

    - consistency: increasing inter-cluster differences and decreasing intra-cluster differences should not change clustering.

    Also not quite true: suppose we have 9 clusters:

      . . .
      . . .
      . . .
    
    now move the points so that the columns get further apart, at some point we will get: | | |, where 3 clusters are more reasonable.
  • by monkeyjoe on 12/27/24, 2:30 PM

    I think there is hidden gold in the linked research paper. In the second to last paragraph, it says if you are willing to discard the trivial partition (each point on its own) from Richness, then you *can* have Scale-Invariance and Refinement-Consistency as well. To me this suggests an optimal (in some sense) class of algorithms, perhaps to be chosen from based on computational complexity, cross-validation, etc.
  • by joe_the_user on 12/26/24, 11:49 PM

    It's interesting idea but so I don't see it as a matter of tradeoffs since just the "richness" sounds undecidable by itself. I mean, dividing a set into "things have low Kolmogorov complexity and things having high Kolmogorov complexity" is definitely undecidable so "any grouping that might make sense for your data" seems impossible without any other requirements.
  • by Xcelerate on 12/27/24, 1:14 PM

    I’ve never found a use case for clustering where if the goal is accurate prediction, some other technique doesn’t work better.

    Now if the goal is a quick prototype or to get an intuitive sense of the structure of the data, then sure, it’s fine.

    But of course you’re always sacrificing something desirable when you try to shoehorn data into a model that doesn’t fit.

  • by keithalewis on 12/27/24, 9:24 AM

    Let S = {1, 2}. Every distance function is determined by d(1,2) = a, a >= 0. Define f(d) = {{1,2}} if a = 0 and f(d) = {{1},{2}} otherwise. Isn't this a clustering algorithm that is scale invariant, rich, and consistent?
  • by piker on 12/27/24, 9:56 AM

    Interesting the author neglects to mention the Gap statistic[1] for the stopping criteria. It's a bit of a mouthful, but it does provide a metric for determining how many clusters the data indicates are present as compared to a random distribution in the same dimensions.

    [1] https://academic.oup.com/jrsssb/article-abstract/63/2/411/70...

  • by jpcom on 12/27/24, 12:54 AM

    Sounds like Richness is an index in SQL terms. Would you agree?