by sibmike on 3/19/18, 3:32 AM with 46 comments
by Cynddl on 3/19/18, 11:33 AM
If you dig through the original paper, the conclusion is on the line with that:
“For 7 out of 15 comparisons, we found no significant difference between the accuracy of features developed on the control dataset vs. those developed on some version of the synthesized data; that is, the result of the test was False.”
So, on the tests they developed, the proposed method doesn't work 8 times out of 15…
by sriku on 3/19/18, 10:45 AM
- It looks like the work of the data scientists will be limited to the extent of the modeling already done by recursive conditional parameter aggregation. (edit: So why not just ship that model and adapt it instead of using it to generate data?)
- Its "validation" appears to be doubly proxied - i.e. the normal performance measures we use are themselves a proxy, and now we're comparing those against these performance measures derived from models built out of the data generated by these models. I'm not inclined to trust a validation that is so removed.
Any one who can explain this well?
by mehrdadn on 3/19/18, 8:59 AM
by lokopodium on 3/19/18, 4:20 AM
by pavon on 3/19/18, 6:47 PM
Although, I suppose that if the data was already anonymized to the best of your ability, and then this was run on top of that, as a additional layer of protection, that might be okay.
by lopmotr on 3/19/18, 9:41 AM
by srean on 3/19/18, 6:39 PM
https://en.wikipedia.org/wiki/Gibbs_sampling
Generating tuples(row) by Gibbs sampling will allow generation of samples from the joint distribution. This in turn would preserve all correlations, conditional probabilities etc. This can be done by starting at a original tuple at random and then repeatedly mutating the tuple by overwriting one of its fields(columns). To overwrite, one selects another random tuple that 'matches' the current one at all positions other than the column selected for overwriting. The match might need to be relaxed from an exact match to a 'close' match.
If the conditional distribution for some conditioning event has very low entropy or the conditional entropy is low, one would need to fuzz the original to preserve privacy, but this will come at the expense of distorting the correlations and conditionals.
by _0ffh on 3/19/18, 10:12 AM
by _5659 on 3/19/18, 7:33 AM
by EGreg on 3/19/18, 10:22 AM
by fardin1368 on 3/19/18, 8:21 PM
The claim is too bold and I would reject this paper They should clarify that the data is good enough for linear regression. Not to say there is no difference between real and syn data.
by dwheeler on 3/19/18, 12:29 PM
by anon1253 on 3/19/18, 4:50 PM
by aspaceman on 3/19/18, 9:48 AM
by sandGorgon on 3/19/18, 1:40 PM
I wonder if this the technique behind Numerai
by bschreck on 3/19/18, 5:03 PM