from Hacker News

Show HN: HuggingFace – Fast tokenization library for deep-learning NLP pipelines

by julien_c on 1/13/20, 4:40 PM with 42 comments

by julien_c on 1/13/20, 5:09 PM
TL;DR: Hugging Face, the NLP research company known for its transformers library (DISCLAIMER: I work at Hugging Face), has just released a new open-source library for ultra-fast & versatile tokenization for NLP neural net models (i.e. converting strings in model input tensors).
Main features: - Encode 1GB in 20sec - Provide BPE/Byte-Level-BPE/WordPiece/SentencePiece... - Compute exhaustive set of outputs (offset mappings, attention masks, special token masks...) - Written in Rust with bindings for Python and node.js
Github repository and doc: https://github.com/huggingface/tokenizers/tree/master/tokeni...
To install: - Rust: https://crates.io/crates/tokenizers - Python: pip install tokenizers - Node: npm install tokenizers
by mark_l_watson on 1/13/20, 6:33 PM
I love the work done and made freely available by both spaCy and HuggingFace.
I had my own NLP libraries for about 20 years, simple ones were examples in my books, and more complex and not so understandable ones I sold as products and pulled in lots of consulting work with.
I have completely given up my own work developing NLP tools, and generally I use the Python bindings (via the Hy language (hylang) which is a Lisp that sits on top of Python) for spaCy, huggingface, TensorFlow, and Keras. I am retired now but my personal research is in hybrid symbolic and deep learning AI.
by screye on 1/13/20, 7:08 PM
I can't believe the level of productivity this Hugging face team has.
They seemed to have found the ideal balance of software engineering capability and Neural network knowledge, in a team of highly effective and efficient employees.
Idk what their monetization plan is as a startup, but it is 100% undervalued at 20 million, and that is just the quality of that team. Now, if only I can figure out how to put a few thousand $ in a series-A startup as just some guy.
by ZeroCool2u on 1/13/20, 5:50 PM
We use both SpaCy and HuggingFace at work. Is there a comparison of this vs SpaCy's tokenizer[1]?
1. https://spacy.io/usage/linguistic-features#tokenization
by LunaSea on 1/13/20, 6:00 PM
It used to be that pre-DeepLearning tokenizers would extract ngrams (n-token sized chunks) but this doesn't seem to exist anymore in the word embedding tokenizers I've come by.
Is this possible using HuggingFace (or another word embedding based library)?
I know that there are some simple heuristics like merging noun token sequences together to extract ngrams but they are too simplistic and very error prone.
by useful on 1/13/20, 6:42 PM
Somewhat related, if someone want to build something awesome, I haven't seen anything that merges lucene with BPE/SentencePiece.
SentencePiece has to make it so you can shrink the memory requirements of your indexes for search and typeahead stuff.
by hnaccy on 1/13/20, 5:50 PM
Great! Just did a quick test and got a 6-7x speedup on tokenization.
by orestis on 1/13/20, 6:45 PM
Are there examples on how this can be used for topic modeling, document similarity etc? All the examples I’ve seen (gensim) use bag-of-words which seems to be outdated.
by echelon on 1/13/20, 6:27 PM
I'm very familiar with the TTS, VC, and other "audio-shaped" spaces, but I've never delved into NLP.
What problems can you solve with NLP? Sentiment analysis? Semantic analysis? Translation?
What cool problems are there?
by m0zg on 1/14/20, 7:15 AM
Question for HuggingFace folks. Your repos do not contain any tests. Why is that? How do you ensure your stuff actually works after you make a change?
by virtuous_signal on 1/13/20, 4:45 PM
I didn't realize that particular emoji had a name. I thought it was a play on this: https://en.wikipedia.org/wiki/Alien_(creature_in_Alien_franc...
by manojlds on 1/13/20, 9:00 PM
Title is off? Should mention Tokenizers as the project.
by rsp1984 on 1/13/20, 5:41 PM
What does tokenization (of strings, I guess) do?
by tarr11 on 1/13/20, 8:26 PM
Why is this company called HuggingFace?