from Hacker News

Biological Function Emerges from Unsupervised Learning on 250M Protein Sequences

by smhx on 4/30/19, 5:30 PM with 49 comments

  • by gigantum on 4/30/19, 7:58 PM

    Like some of the other ML/AI posts that made it to the top page today, this research too does not give any clear way to reproduce the results. I looked through the pre-print page as well as the full manuscript itself.

    Without reproducibility and transparency in the code and data, the impact of this research is ultimately limited. No one else can recreate, iterate, and refine the results, nor can anyone rigorously evaluate the methodology used (besides giving a guess after reading a manuscript).

    The year is 2019, many are finally realizing it's time to back up your results with code, data, and some kind of specification of the computing environment you're using. Science is about sharing your work for others in the research community to build upon. Leave the manuscript for the pretty formality.

  • by andbberger on 4/30/19, 10:11 PM

    I find this paper to be so steeped in hype and dogma so as to be nearly incomprehensible.

    Which is a shame, because it's a reasonable approach. I just wish they just frickin described what they did instead of spending the whole paper monologuing and showcasing unconvincing experiments. No need to justify what you're doing, just do it.

  • by ArtWomb on 4/30/19, 8:09 PM

    Fergus Lab at NYU. I believe he's across the hall from Yann LaCunn as well ;)

    Still a long way from a Theory of Biogenesis. But a good next step is using a differentiable model to predict novel proteins which have no analogue in Nature. Much like Materials Genome researchers searching for stable phases of matter!

    "Training ever bigger convnets and LSTMs on ever bigger datasets gets us closer to Strong AI -- in the same sense that building taller towers gets us closer to the moon." --François Chollet

  • by obviuosly on 4/30/19, 10:09 PM

    > The resulting model maps raw sequences to representations of biological properties without labels or prior domain knowledge.

    A couple of questions:

    1. What are those representations?

    2. Also what is "biological function"?

    3. What kind of information does the learned representation extract that is not already in the "biological properties" it is trained to map to?

  • by tepal on 4/30/19, 7:53 PM

    This blog post seems to anticipate this happening: https://moalquraishi.wordpress.com/2019/04/01/the-future-of-...
  • by superfx on 4/30/19, 6:53 PM

  • by shpongled on 4/30/19, 8:16 PM

    This is cool, but would be significantly cooler if they did some kind of biological follow up. Perhaps getting their model to output an "ideal" sequence for a desired enzymatic function and then swapping that domain into an existing protein lacking the new function.
  • by lucidrains on 4/30/19, 5:40 PM

    Language, music, and now amino acid sequences. Attention is all you need.
  • by a_bonobo on 5/1/19, 2:07 PM

    Here's a very cool GitHub repository which uses unsupervised learning (ULMFiT) in the genomics space: https://github.com/kheyer/Genomic-ULMFiT

    Very impressive accuracies on hard tasks, and it's open source!

  • by cellular on 4/30/19, 7:58 PM

    I find these emergent behaviours fascinating: https://youtu.be/gaFKqOBTj9w