by lysozyme on 8/12/24, 11:31 PM with 25 comments
by jdeaton on 8/13/24, 10:02 PM
I think this passage gets to the fundamental rift of disagreement in perspective between those focused purely on computational advances versus innovating in wet lab techniques.
Why? Because years of peoples' careers have been wasted waiting on promises from molecular biologists claiming they will make these "clever" high-throughput experiments work. In my experience, they'll spend months to years concocting a Rube Goldberg machine of chained molecular biology steps, each of which has (at best) a 90% success rate. You don't have to chain many of these together before your "clever" setup has a ~0% probability of successfully gathering data.
by plaidfuji on 8/14/24, 1:00 AM
I think it’s fundamentally shifting how people approach R&D in all physical fields. The power of “the ML way” is almost a self-fulfilling prophecy. Once you see ML upend the standard approach in one area, the question is not if but when it will upend your area, and the natural next step is to ask, “how can I massively increase data collection rates so I can feed ML”? It just completely flips all branches of science on their head, from carefully investigating and building first-principles theory, to saying “screw it, I really just wanted to map this design space so I can accurately predict outcomes, why don’t I just build a machine to do that?”
It then becomes a question of how easy it actually is to build an ML-feeding machine (not easy, very problem-specific), ergo the pendulum now swings to physical lab automation.
by biomcgary on 8/13/24, 11:37 PM
In the case of AlphaFold, measuring crystal structures is the most important thing (molecular phenotype). The second most important thing is measuring many genomes. Multiple sequence alignments allows evolution (variation under selection) to tell you about the important bits of the structure. The distance from aligned DNA sequences to protein structure isn't a bridge too far.
Unfortunately, biology has been mislead by the popularity of transcriptomics, which the post touches on briefly (limits of single-cell approaches). Transcriptomics generates lots of data (relatively) cheaply, but isn't really the right thing to measure most of the time because it is too far removed causally from the organismal phenotype, the thing we generally care about in biomedicine. Although gene expression has provided some insights, we've exhausted most of its value by now and I doubt ML will rescue it (speaking from personal experience).
by lysozyme on 8/14/24, 1:45 AM
Humans are natural machines capable of sensing and verifying the correctness of a piece of text or an image in milliseconds. So if you have a model that generates text or images, it’s trivial to see if they’re any good. Whereas for biology, the time to validate a model’s output is measured more in weeks. If you generate a new backbone with RFDiffusion, and then generate some protein sequences with LigandMPNN, and then want to see if they fold correctly … that takes a week. Every time. Use ML to solve _that_ problem and you’ll be rich.
TFA mentions the difficulty of performing biological assays at scale, and there are numerous other challenges. Such as the number of different kinds of assays required to get the multimodal data needed to train the latest models like ESM-3 (which is multimodal, in this context meaning primary sequence, secondary structure, tertiary structure, as well as several other tracks). You can’t just scale a fluorescent product plate reader assay to get the data you need. We need sequencing tech, functional assays, protein-protein interaction assays, X-ray crystallography, and a dozen others, all at scale.
What I’d love to see companies like A-Alpha and Gordian and others do is see if they can use the ML to improve the wet lab tech. Make the assays better, faster, cheaper with ML. Like how they use ML to translate the electrical signals of DNA passing through the pore into a sequence in the Nanopore sequencers. So many companies have these sweet assays that are very good. In my opinion, if we want transformative progress in biology, we should spend less time fitting the same data with different models, and spend more time improving and scaling wet lab assays using ML. Can we use ML to make the assay better, make our processes better, to improve the amount and quality of data we generate? The thesis of TFA (and experience) suggests that using the data will be the easy part
1. https://alexcarlin.bearblog.dev/why-is-progress-slow-in-gene...
by hirenj on 8/14/24, 11:48 AM
https://towardsdatascience.com/the-road-to-biology-2-0-will-...
I think an important point raised here is the distinction between good data, and the "relative" data present in a lot of biology. As examples from the article, a protein structure, or genome/protein sequence data is good data, but data like RNA-seq or mass spectrometry data is relative (and subject to sensitivity / noise etc). The way I like to think of it is that sequence data and structural data is looking at the actual thing, but the relative data only gets you a sliver of a snapshot of a process. Therefore it's easier to build models to capture relationships between representations of real things, rather than models where you can't really distinguish between signal and noise. I spend a fair amount of time these days trying to figure out how to take advantage of good data to gain insights into things where we have relative data.
by esel2k on 8/15/24, 8:06 AM
The solution: Multimodal data and getting more info on experiments setup (often a bit of voodoo and not written down properly).
by koeng on 8/14/24, 2:48 PM
Or even look at lab robotics - in 2015, you were able to buy a new opentrons for $2500. Now it’s about $10,000 - the only way to rival the old pricing is to scrounge around used sales.
Enzyme prices haven’t dropped in basically forever. Addgene increased plasmid prices a little bit ago.
I feel like computer hackers can’t even imagine how bad it is over here
by smolder on 8/14/24, 3:45 AM
by photochemsyn on 8/14/24, 6:17 AM
by kemmishtree on 8/14/24, 3:29 AM