from Hacker News

Artificial data give the same results as real data without compromising privacy

by sibmike on 3/19/18, 3:32 AM with 46 comments

  • by Cynddl on 3/19/18, 11:33 AM

    I'm highly dubious of the ability for synthetic data to model accurately datasets without introducing unexpected bias, esp. to account for causality.

    If you dig through the original paper, the conclusion is on the line with that:

    “For 7 out of 15 comparisons, we found no significant difference between the accuracy of features developed on the control dataset vs. those developed on some version of the synthesized data; that is, the result of the test was False.”

    So, on the tests they developed, the proposed method doesn't work 8 times out of 15…

  • by sriku on 3/19/18, 10:45 AM

    I haven't read the original paper (yet), but something doesn't sit right with the work, if the way it is portrayed is indeed faithful to it and I'm not missing something important.

    - It looks like the work of the data scientists will be limited to the extent of the modeling already done by recursive conditional parameter aggregation. (edit: So why not just ship that model and adapt it instead of using it to generate data?)

    - Its "validation" appears to be doubly proxied - i.e. the normal performance measures we use are themselves a proxy, and now we're comparing those against these performance measures derived from models built out of the data generated by these models. I'm not inclined to trust a validation that is so removed.

    Any one who can explain this well?

  • by mehrdadn on 3/19/18, 8:59 AM

    On a parallel note, search for "thresholdout". It's another (genius, I think) way to "stretch" how far your data goes in training a model. I won't do a better job trying to explain it than those who already have, so I won't try—here's a nice link explaining it instead: http://andyljones.tumblr.com/post/127547085623/holdout-reuse
  • by lokopodium on 3/19/18, 4:20 AM

    They use real data to create artificial data. So, real data is still more useful.
  • by pavon on 3/19/18, 6:47 PM

    If I was responsible for protecting privacy of data, I don't know that I would be comfortable with this method. Anonymization of data is hard, and frequently turns out to be not as anonymous as originally thought. At a high level, this sounds like they are training a ML system on your data, and then using it to generate similar data. What sort of guarantees can be given that the ML system won't simulate your data with too high of fidelity? I've seen too many image generators that output images very close to the data they were trained on. You could compare the two datasets and look for similarities, but you'd have to have good metrics of what sort of similarity was bad and what sort was good, and I could see that being tricky, in both directions.

    Although, I suppose that if the data was already anonymized to the best of your ability, and then this was run on top of that, as a additional layer of protection, that might be okay.

  • by lopmotr on 3/19/18, 9:41 AM

    I wonder how secure it is against identifying individuals. With over-fitting, you can producing the training data as output. Hopefully they have a robust way to prevent that, or any kind of reverse engineering of the output to somehow work out the original data.
  • by srean on 3/19/18, 6:39 PM

    Could not get hold of the paper. Are they doing Gibbs sampling or a semiparametric variant of that ?

    https://en.wikipedia.org/wiki/Gibbs_sampling

    Generating tuples(row) by Gibbs sampling will allow generation of samples from the joint distribution. This in turn would preserve all correlations, conditional probabilities etc. This can be done by starting at a original tuple at random and then repeatedly mutating the tuple by overwriting one of its fields(columns). To overwrite, one selects another random tuple that 'matches' the current one at all positions other than the column selected for overwriting. The match might need to be relaxed from an exact match to a 'close' match.

    If the conditional distribution for some conditioning event has very low entropy or the conditional entropy is low, one would need to fuzz the original to preserve privacy, but this will come at the expense of distorting the correlations and conditionals.

  • by _0ffh on 3/19/18, 10:12 AM

    Seems like only helpful for testing methods that can't capture any correlations the original method didn't.
  • by _5659 on 3/19/18, 7:33 AM

    Is this akin at all to random sampling with replacement ie bootstrapping?
  • by EGreg on 3/19/18, 10:22 AM

    How is this related to and different from differential privacy?
  • by fardin1368 on 3/19/18, 8:21 PM

    I am looking into their experiments. Seems most of them are pretty simple predictions/classifications. No wonder they get good results.

    The claim is too bold and I would reject this paper They should clarify that the data is good enough for linear regression. Not to say there is no difference between real and syn data.

  • by dwheeler on 3/19/18, 12:29 PM

    The abstract claims there was no difference only 70% of the time. So 30% of the time there was a difference. Unsurprisingly it greatly limits the kind of data analysis that was allowed, which greatly reduces the applicability even if you believe it. I'm pretty dubious of this work anyway.
  • by anon1253 on 3/19/18, 4:50 PM

    Heh. I wrote a paper about this a while ago https://www.liebertpub.com/doi/full/10.1089/bio.2014.0069
  • by aspaceman on 3/19/18, 9:48 AM

    Does someone have a link to the preprint / arxiv? The link in the story is a 404 (I presume that the paper just hasn't been posted yet or something?)
  • by sandGorgon on 3/19/18, 1:40 PM

    Sounds very similar to homomorphic encryption, except with no compromise in performance.

    I wonder if this the technique behind Numerai

  • by bschreck on 3/19/18, 5:03 PM

    The link to the actual paper is now working