from Hacker News

Show HN: HuggingFace – Fast tokenization library for deep-learning NLP pipelines

by julien_c on 1/13/20, 4:40 PM with 42 comments

  • by julien_c on 1/13/20, 5:09 PM

    TL;DR: Hugging Face, the NLP research company known for its transformers library (DISCLAIMER: I work at Hugging Face), has just released a new open-source library for ultra-fast & versatile tokenization for NLP neural net models (i.e. converting strings in model input tensors).

    Main features: - Encode 1GB in 20sec - Provide BPE/Byte-Level-BPE/WordPiece/SentencePiece... - Compute exhaustive set of outputs (offset mappings, attention masks, special token masks...) - Written in Rust with bindings for Python and node.js

    Github repository and doc: https://github.com/huggingface/tokenizers/tree/master/tokeni...

    To install: - Rust: https://crates.io/crates/tokenizers - Python: pip install tokenizers - Node: npm install tokenizers

  • by mark_l_watson on 1/13/20, 6:33 PM

    I love the work done and made freely available by both spaCy and HuggingFace.

    I had my own NLP libraries for about 20 years, simple ones were examples in my books, and more complex and not so understandable ones I sold as products and pulled in lots of consulting work with.

    I have completely given up my own work developing NLP tools, and generally I use the Python bindings (via the Hy language (hylang) which is a Lisp that sits on top of Python) for spaCy, huggingface, TensorFlow, and Keras. I am retired now but my personal research is in hybrid symbolic and deep learning AI.

  • by screye on 1/13/20, 7:08 PM

    I can't believe the level of productivity this Hugging face team has.

    They seemed to have found the ideal balance of software engineering capability and Neural network knowledge, in a team of highly effective and efficient employees.

    Idk what their monetization plan is as a startup, but it is 100% undervalued at 20 million, and that is just the quality of that team. Now, if only I can figure out how to put a few thousand $ in a series-A startup as just some guy.

  • by ZeroCool2u on 1/13/20, 5:50 PM

    We use both SpaCy and HuggingFace at work. Is there a comparison of this vs SpaCy's tokenizer[1]?

    1. https://spacy.io/usage/linguistic-features#tokenization

  • by LunaSea on 1/13/20, 6:00 PM

    It used to be that pre-DeepLearning tokenizers would extract ngrams (n-token sized chunks) but this doesn't seem to exist anymore in the word embedding tokenizers I've come by.

    Is this possible using HuggingFace (or another word embedding based library)?

    I know that there are some simple heuristics like merging noun token sequences together to extract ngrams but they are too simplistic and very error prone.

  • by useful on 1/13/20, 6:42 PM

    Somewhat related, if someone want to build something awesome, I haven't seen anything that merges lucene with BPE/SentencePiece.

    SentencePiece has to make it so you can shrink the memory requirements of your indexes for search and typeahead stuff.

  • by hnaccy on 1/13/20, 5:50 PM

    Great! Just did a quick test and got a 6-7x speedup on tokenization.
  • by orestis on 1/13/20, 6:45 PM

    Are there examples on how this can be used for topic modeling, document similarity etc? All the examples I’ve seen (gensim) use bag-of-words which seems to be outdated.
  • by echelon on 1/13/20, 6:27 PM

    I'm very familiar with the TTS, VC, and other "audio-shaped" spaces, but I've never delved into NLP.

    What problems can you solve with NLP? Sentiment analysis? Semantic analysis? Translation?

    What cool problems are there?

  • by m0zg on 1/14/20, 7:15 AM

    Question for HuggingFace folks. Your repos do not contain any tests. Why is that? How do you ensure your stuff actually works after you make a change?
  • by virtuous_signal on 1/13/20, 4:45 PM

    I didn't realize that particular emoji had a name. I thought it was a play on this: https://en.wikipedia.org/wiki/Alien_(creature_in_Alien_franc...
  • by manojlds on 1/13/20, 9:00 PM

    Title is off? Should mention Tokenizers as the project.
  • by rsp1984 on 1/13/20, 5:41 PM

    What does tokenization (of strings, I guess) do?
  • by tarr11 on 1/13/20, 8:26 PM

    Why is this company called HuggingFace?