from Hacker News

Advanced NLP with spaCy v3

by pvpv on 12/10/21, 4:07 PM with 38 comments

by artembugara on 12/10/21, 4:37 PM
We've been using spaCy a lot for the past few months.
Mostly for non-production use cases, however, I can say that it is the most robust framework for NLP at the moment.
V3 added support for transformers: that's a killer feature as many models from https://huggingface.co/docs/transformers/index work great out of the box.
At the same time, I found NER models provided by spaCy to have a low accuracy while working with real data: we deal with news articles https://demo.newscatcherapi.com/
Also, while I see how much attention ML models get from the crowd, I think that many problems can be solved with rule-based approach: and spaCy is just amazing for these.
Btw, we recently wrote a blog post comparing spaCy to NLTK for text normalization task: https://newscatcherapi.com/blog/spacy-vs-nltk-text-normaliza...
by minimaxir on 12/10/21, 4:49 PM
A relatively underdiscussed quirk of the rise of superlarge language models like GPT-3 for certain NLP tasks is that since those models have incorporated so much real world grammar, there's no need to do advanced preprocessing and can just YOLO and work with generated embeddings instead without going into spaCy's (excellent) parsing/NER features.
OpenAI recently released an Embeddings API for GPT-3 with good demos and explanations: https://beta.openai.com/docs/guides/embeddings
Hugging Face Transformers makes this easier (and for free) as most models can be configured to return a "last_hidden_state" which will return the aggregated embedding. Just use DistilBERT uncased/cased (which is fast enough to run on consumer CPUs) and you're probably good to go.
by 41209 on 12/10/21, 9:02 PM
I really love spaCy, it's trivial to throw up a server which handles basic NLP. No complaints here, very happy to see it still being updated