by hellovai on 7/11/23, 2:51 PM with 123 comments
by 19h on 7/11/23, 9:00 PM
How it works — imagine you’re having these sentences:
“Acorn is a tree” and “acorn is an app”
You essentially keep record of all word to word relations internal to a sentence:
- acorn: is, a, an, app, tree Etc.
Now you repeat this for a few gigabytes of text. You’ll end up with a huge map of “word connections”.
You now take the top X words that other words connect to (I.e. 16384). Then you create a vector of 16384 connections, where each word is encoded as 1,0,1,0,1,0,0,0, … (1 is the most connected to word, 0 the second, etc. 1 indicates “is connected” and 0 indicates “no such connection).
You’ll end up with a vector that has a lot of zeroes — you can now sparsify it (I.e. store only the positions of the ones).
You essentially have fingerprints now — what you can do now is to generate fingerprints of entire sentences, paragraphs and texts. Remove the fingerprints of the most common words like “is”, “in”, “a”, “the” etc. and you’ll have a “semantic fingerprint”. Now if you take a lot of example texts and generate fingerprints off it, you can end up with a very small amount of “indices” like maybe 10 numbers that are enough to very reliably identify texts of a specific topic.
Sorry, couldn’t be too specific as I’m on the go - if you’re interested drop me a mail.
We’re using this to categorize literally tens of gigabytes per second with 92% precision into more than 72 categories.
by nestorD on 7/11/23, 7:01 PM
But, they are zero/few shot classifiers. Meaning that you can get your classification running and reasonably accurate now, collect data and switch to a fine-tuned very efficient traditional ML model later.
by alexmolas on 7/11/23, 6:25 PM
It would be nice to see how compares this "complex" approach against a "simple" TF-IDF + RF or SVM.
by rossirpaulo on 7/11/23, 3:02 PM
by crazygringo on 7/11/23, 5:03 PM
I'm really wondering when LLM's are going to replace humans for ~all first-pass social media and forum moderation.
Obviously humans will always be involved in coming up with moderation policy and judging gray areas and refining moderation policy... but at what point will LLM's do everything else more reliably than humans?
6 months from now? 3 years from now?
by rckrd on 7/11/23, 7:33 PM
LLMs are excellent reasoning engines. But nudging them to the desired output is challenging. They might return categories outside the ones that you determined. They might return multiple categories when you only want one (or the opposite — a single category when you want multiple). Even if you steer the AI toward the correct answer, parsing the output can be difficult. Asking the LLM to output structure data works 80% of the time. But the 20% of the time that your code parses the response fails takes up 99% of your time and is unacceptable for most real-world use cases.
[0] https://twitter.com/mattrickard/status/1678603390337822722
by Animats on 7/12/23, 4:31 AM
If you're using this to direct messages to approximately the correct department, it doesn't have to be that complicated.
If you're doing this to evaluate customer sentiment, you could probably just select a few hundred messages at random and read them. (There are many "big data" problems which are only big due to not sampling.)
by i-am-agi on 7/11/23, 5:19 PM
by r_singh on 7/11/23, 6:38 PM
It was turning out to be expensive earlier but with optimising the prompt a lot, reduced pricing by OpenAI and now also being able to run Guanaco 13/33B locally has made it even more accessible in terms of pricing for millions of pieces of text.
by wilg on 7/11/23, 6:26 PM
by m3kw9 on 7/12/23, 3:37 AM
by andrewgazelka on 7/12/23, 5:59 PM
by YetAnotherNick on 7/12/23, 1:53 AM
by caycep on 7/11/23, 8:53 PM