from Hacker News

Ask HN: What explains the exceptional performance of LLM?

by JacobiX on 4/18/23, 4:24 PM with 1 comments

Is there an updated theoretical framework that explains the performance of LLMs ? My understanding, which may be outdated, is that they are somewhat poorly understood from a theoretical perspective.

by PaulHoule on 4/18/23, 5:12 PM
I worked on "foundation models" based on LSTM and CNN technology just before transformers hit it big.
The idea back then was about the same: for instance we would feed abstracts of case studies from Pubmed into an LSTM to teach it how to write fake case studies. The idea is it would have a neuron or combination of neurons that light up when it is writing a "period that is really the end of a sentence" or when it is writing the name of a disease and we could easily train a classifier that could highlight important text.
Similarly the CNN model would be trained to do multiple tasks on a large amount of data and it would be able to learn other tasks more quickly.
There were numerous problems with those models that I saw as critical bottlenecks that would limit was those things could do and transformers targeted them all.
(1) Training a text generator to start at the beginning and work to the end has numerous problems, namely it learns to understand how things to the left affect the current position but not things to the right. The B in BERT stands for "bidirectional" and the "mask 15% of the words task" is a huge improvement over "predict the next letter or word"
(2) Subword features are also critical. I quit a project because they wanted to use word embeddings but for us out-of-dictionary words contained critical meaning that would be lost completely. It's like starting out a chess game with a rook, your queen and two pawns gone. Fasttext would break words up into smaller fragment which might not always be perfect but means you never get into a "can't get here from there" situation.
(3) Embeddings that include the context of the word are a huge improvement over word embeddings, possibly word embeddings make things worse by smashing together all the possible meaning of the word, but transformers generate an embedding that includes the context so you get help with both polysemy ("duck" the bird vs "duck the question") and synonymy with tools that are very simple to use like
https://sbert.net/
that will transform people's expectations of full-text search.
(4) With the LSTM we struggled to have coherence in terms of matching exact or approximate words and phrases. For instance if a story is about "Matthew Sweet", you'd expect to see that exact phrase used again or maybe "Matt" or "Matthew" or "Sweet" or "him" and what's great about transformers is that one part of the text can light up a connection to the other part of the text and handle that.
I had that list of four problems around the time BERT came out and it answered all of them.