from Hacker News

Comparing Clustering Algorithms

by merusame on 5/2/16, 8:10 AM with 40 comments

  • by mattnedrich on 5/2/16, 2:23 PM

    I wrote an article about mean-shift a while back if anyone is interested in more details about it - https://spin.atomicobject.com/2015/05/26/mean-shift-clusteri...

    Some comments on K-Means - one large limitation of K-Means is that it assumes spherical shaped clusters. It will fail terribly for any other cluster shape.

    It's interesting that the author compared results on the same data set for the different algorithms. Each clustering approach is going to work best on a specific type of data set. It would be interesting to compare them across several different data sets to get a better feel for strengths/weaknesses, etc.

  • by jey on 5/2/16, 4:46 PM

    This is very specific to 2D data. I bet the story is a lot different for high-dimensional data. The challenges you encounter with this sort of clumping in 2D is unlikely to occur in high-dimensional data due to the "curse" of dimensionality. Clustering in high dimensions has its own quirks and gotchas too, but they're quite distinct from the gotchas of low-dimensionality.

    Here's more from an actual expert: http://research.microsoft.com/en-US/people/kannan/book-chapt...

  • by merusame on 5/2/16, 9:34 AM

    One of the better comparisons of Clustering Algorithms online. It's also worth checking out how HDBScan works: http://nbviewer.jupyter.org/github/lmcinnes/hdbscan/blob/mas...
  • by makmanalp on 5/2/16, 3:51 PM

    These all look like clustering based on a 2d space, but does anyone know methods to tackle clustering on a network?

    Is it just a matter of tweaking the definition of density / distance to the number of hops, or is it a different problem entirely? I can see how with 0 or 1 hops the data would be a very smushed distribution, versus 2d distance is much more rich and spread out.

  • by Xcelerate on 5/2/16, 1:24 PM

    I would say there are two aspects of clustering that are important: accuracy of the model, and accuracy of the model fitting process.

    A model is a guess about the underlying process that "generates" the data. If you're trying to use hyperplanes to divide data that lies on a manifold, then you are going to have poor results no matter how good your fitting algorithm is.

    On the other hand, even if you know the true model, high levels of noise can prevent you from recovering the correct parameters. For instance, Max-Cut is NP-hard, and the best we can do is a semidefinite programming approximation. Beyond a certain noise threshold, the gap between the SDP solution and the true solution becomes very large very quickly.

  • by bmordue on 5/2/16, 12:31 PM

    This is a great resource. As someone who knows little about clustering, this was clear and very informative. It covers potential pitfalls much better than other similar documents I've seen, which is a useful approach.
  • by kasperset on 5/2/16, 3:06 PM

    how about t-SNE for clustering? https://lvdmaaten.github.io/tsne/
  • by popra on 5/2/16, 10:32 AM

    As someone intrested in the subject but with 0 insight or knowledge, would these algorithms be a good match for short text clustering? For example identifying identical products in a price comparison app based on their similar but not identical title/name and possibly other attributes.
  • by Wei-1 on 5/2/16, 3:42 PM

    Why not use Density Cluster based on this 2014 Science Paper? http://science.sciencemag.org/content/344/6191/1492
  • by leecarraher on 5/2/16, 2:14 PM

    It would be interesting to see what Agglomerative Clustering the author is using here. I suspect for this two dimensional, density based cluster dataset, single-link agglomerative would perform much better than what is shown (likely average link).
  • by amelius on 5/2/16, 12:54 PM

    Interesting, but the subtitle "Why you should use HDBSCAN" makes little sense on a dataset of N=1.
  • by chestervonwinch on 5/2/16, 12:52 PM

    Cool. I have not looked into the *DBSCAN methods, yet. This post makes me think I should.
  • by dweinus on 5/2/16, 6:19 PM

    Great article! Does anyone know of an implementation of HDBSCAN for R?
  • by graycat on 5/2/16, 12:11 PM

    Can find such comparisons by

    Glenn W. Milligan

    Ohio State

    back to at least 1980.