from Hacker News

Why you should be wary of relying on a single histogram of a data set

by aton on 4/12/13, 4:09 AM with 20 comments

  • by jfim on 4/12/13, 5:30 AM

    As mentioned, one should really be using a kernel density plot instead of a histogram, except when there are already classes in the data.

    In R, one can simply do:

      library("ggplot2")
      library("datasets")
      ggplot(faithful, aes(x=eruptions)) + geom_density() + geom_rug()
    
    which gives a chart like this (http://jean-francois.im/temp/eruptions-kde.png). Contrast with:

      ggplot(faithful, aes(x=eruptions)) + geom_histogram(binwidth=1)
    
    which gives a chart like this (http://jean-francois.im/temp/eruptions-histogram.png).

    Edit: Other plots mentioned in this discussion:

      ggplot(faithful, aes(x = eruptions)) + stat_ecdf(geom = "step")
    
    Cumulative distribution, as suggested by leot (http://jean-francois.im/temp/eruptions-ecdf.png)

      qqnorm (faithful$eruptions)
    
    Q-Q plot, as suggested by christopheraden (http://jean-francois.im/temp/eruptions-qq.png)
  • by leot on 4/12/13, 5:49 AM

    Yes, probability density estimation might be fun, but the simplest thing to do when comparing distributions, if you're worried about binning issues, is to plot their empirical cumulative distribution functions.
  • by dude_abides on 4/12/13, 6:13 AM

    This is what you should be doing:

      plot(density(Annie), col="red")
      lines(density(Brian), col="blue")
      lines(density(Chris), col="green")
      lines(density(Zoe), col="cyan")
    
    This is the plot you get: http://i.imgur.com/sY2awX7.png
  • by tantalor on 4/12/13, 4:37 AM

  • by christopheraden on 4/12/13, 5:43 AM

    Interesting paradox. I haven't seen that many statisticians using just a histogram when determining whether a certain distribution fits data reasonably. Kernel Density Estimators are a much better choice (for continuous data, like the data in the post), but they are also affected by your choice of bandwidth. When it comes down to it, like going to the doctor, sometimes the best choice is to get a second (or third!) opinion. For what it's worth, drawing a QQ Plot (something I've seen in every statistical consultation I've ever done) reveals the dependent structure of the data immediately and obviously in the form of a perfect linear relationship between any two variables.
  • by radarsat1 on 4/12/13, 9:38 AM

    Is this basically just an effect of quantization aliasing?