from Hacker News

Transform Data by Example [video]

by gggggggg on 5/17/17, 10:32 PM with 87 comments

  • by teddyh on 5/17/17, 11:51 PM

    You know what this reminds me of? Those trained neural-net things which, however many training examples you give it, always seem to find some way to “cheat” and not do what you want while still obeying all your training data correctly.

    Something like this: Suppose we have a table of strings of digits, some including spaces, and we’d like to remove the spaces. From

      123 456
      234567
      345 678
    
    to

      123456
      234567
      345678
    
    Now, what happens if it encounters, say

      4567890
    
    Would the result be unchanged (as we would probably want), or would it “cheat” and remove the middle “7” character, giving “456890”?
  • by ktamura on 5/17/17, 11:19 PM

    This is a great product idea. If you ask any Excel power users, by far the most time-consuming and hard-to-automate task is text and date manipulation.

    The beauty of this product is that its adoption strategy is baked into the product itself: I'd share this with all Excel user friends of mine because I want the algorithm to get smarter, and I might even learn a bit of C# myself so that I can contribute and scratch my own itch. This in turn makes the product better (because of the larger training data), lending itself to more word of mouth.

    One concern I have is security: I'd love to hear from folks who built this/more familiar with this about how to ensure the security of suggested transformations.

  • by Cieplak on 5/18/17, 3:16 AM

    I wonder if it uses Z3 under the hood for solving constraints. Very nice of MSFT to MIT license Z3. It's super useful for problems that result in circular dependencies when modeled in Excel, and require iterative solvers (e.g., goal seek). I use the python bindings, but unfortunately it's not as simple as `pip install` and requires a lengthy build/compilation. Well worth the effort, though.

    https://github.com/Z3Prover/z3

    https://github.com/Z3Prover/z3/issues/288

  • by gergoerdi on 5/18/17, 5:16 AM

    Check out MagicHaskeller which figures out list processing functions from examples: http://nautilus.cs.miyazaki-u.ac.jp/~skata/MagicHaskeller.ht...

    For example, given the rule `f "abcde" 2 == "aabbccddee"`, it even figures out the role of the parameter `2`, so `f "zq" 3` gives `"zzzqqq"`.

  • by bcherny on 5/18/17, 2:36 AM

  • by netvarun on 5/17/17, 11:51 PM

    Is this related/a commercial application of the 'Deep Learning for Program Synthesis' post[0][1] from Microsoft Research on HN a month ago?

    [0]https://www.microsoft.com/en-us/research/blog/deep-learning-...

    [1]HN Discussion: https://news.ycombinator.com/item?id=14168027

  • by martinthenext on 5/18/17, 11:54 AM

    Oh man, we did it before Microsoft!

    http://comnsense.io/

    https://youtu.be/ALF9GY2K-wc

  • by wayneprice on 5/17/17, 11:36 PM

    I'm playing around with a client-side js implementation of this at https://www.robosheets.com/

    It's not production ready / launched yet, but it's getting there.

    I'd be interested to finds (or really doesn't find) this useful :)

  • by gerhardi on 5/18/17, 7:33 AM

    This was also included in the query editor of Microsoft's Power BI in the release a month or two ago. First you select the columns to be used as a source then start writing example values to the new column to be generated. It also shows the generated M/PowerQuery expression.

    It can't do miracles, but this is time saving in many cases like when you want to concatenate values from different columns in a new format into a single column and so on.

  • by fiatjaf on 5/18/17, 1:07 AM

    See also http://www.transformy.io/#/app

    Ok, just realized somehow the site has vanished. Not working archived version: http://web.archive.org/web/20161028231256/https://www.transf...

  • by unfamiliar on 5/18/17, 12:29 AM

    Humans are really good at taking a vague description of a task and using a small number of examples to disambiguate it.

    For example, "sort all of the folders, so that it Alan goes before Amy, etc". The rule ("sort") is pretty ambiguous, but one simple example in the context gives enough information to realise you probably mean alphabetically by first name.

    Is there something like this example that could be combined with NLP to make things like these "intelligent assistants" we have now much more useful for data processing tasks?

    It would be great to describe data manipulation to a machine the way that I would describe it to a colleague: give an overview of an algorithm, watch how they interpret it, and correct with a couple of examples in a feedback loop. Currently describing such things for a machine requires writing the algorithm manually in a programming language.

  • by logicallee on 5/18/17, 12:22 AM

    It would be nice if it indicated where it was making stuff up (in the zip code example, for the rows that were missing some data, it just makes it up - these rows are not distinguished visually from the rows where it did not add data not in the input.)

    What I mean is if every row had a date like "12 May 2002" and you wanted it turned into 2002.05.12 then it would be nice if it indicated when it added data. For example if one of the rows just read "15 May" then, since there is no year, it would not be completely absurd if it transformed into 2017.05.15 - or if all of the other data is 2002, then adding that. But I really think silently adding data that was not in the input is going too far. A transform shouldn't ever silently inject plausible data with no indication that this is interpolated. Bad things can result.

    Otherwise great demo!

  • by mballantyne on 5/18/17, 6:26 AM

    I believe this is the implementation described in this paper published at POPL 2016:

    https://www.microsoft.com/en-us/research/publication/transfo...

    Though it probably also uses more recent work from the same group:

    https://www.microsoft.com/en-us/research/people/sumitg/

  • by gshulegaard on 5/18/17, 3:13 AM

    Excel is a really powerful tool. If you are fine with needing Windows or Mac (e.g. not Linux) and you are ok with their licensing constraints it's pretty hard to beat.
  • by tdbeteam on 5/19/17, 10:47 PM

    Relationship to FlashFill feature in Excel: FlashFill is a popular feature in Excel that also uses the example-driven paradigm to automatically produce transformations. While FlashFill supports string-based transformations, Transform Data by Example can leverage sophisticated domain-specific functions to perform semantic transformations beyond string manipulations. For examples, see: https://www.microsoft.com/en-us/research/wp-content/uploads/...
  • by JoelJacobson on 5/18/17, 7:37 AM

    I hacked together something similar that learns row/column offsets for different fields in a text file, and converts it into a normal CSV, i.e. a normal table.

    https://github.com/trustly/fixed2csv

  • by matt4711 on 5/18/17, 12:07 AM

    There is a paper describing such a method (not sure if that is what was implemented):

    "Zhongjun Jin, Michael R. Anderson, Michael J. Cafarella, H. V. Jagadish: Foofah: Transforming Data By Example. SIGMOD Conference 2017: 683-698"

  • by captnswing on 5/18/17, 9:09 AM

    Seems similar to http://openrefine.org/
  • by copperx on 5/18/17, 1:33 AM

    That's great, I always loved Auto Fill in Excel, and this brings it to the Mac.
  • by Kiro on 5/18/17, 6:05 AM

    I would love something similar for Google Spreadsheet.
  • by amelius on 5/18/17, 12:22 PM

    I want this in Vim :)

    This would be great for refactoring code.

  • by tejtm on 5/18/17, 12:08 AM

    alas it is too late, it transformed our genes to dates, no sequence for Bill
  • by cblte on 5/18/17, 1:09 PM

    not usable for companies and secured networks. :-( too bad
  • by sjg007 on 5/18/17, 2:15 AM

    There's a huge opportunity in making excel better..