from Hacker News

Show HN: Programmatic – a REPL for creating labeled data

by jordn on 4/8/22, 11:35 AM with 5 comments

Hey HN, I’m Jordan cofounder of Humanloop (YC S20) and I’m excited to show you Programmatic — an annotation tool for building large labeled datasets for NLP without manual annotation.

Programmatic is like a REPL for data annotation. You:

  1. Write simple rules/functions that can approximately label the data
  2. Get near-instant feedback across your entire corpus
  3. Iterate and improve your rules

Finally, it uses a Bayesian label model [1] to convert these noisy annotations into a single, large, clean dataset, which you can then use for training machine learning models. You can programmatically label millions of datapoints in the time taken to hand-label hundreds.

What we do differently from weak supervision packages like Snorkel/skweak[1] is to focus on UI to give near-instantaneous feedback. We love these packages but when we tried to iterate on labeling functions we had to write a ton of boilerplate code and wrestle with pandas to understand what was going on. Building a dataset programmatically requires you to grok the impact of labeling rules on a whole corpus of text. We’ve been told that the exploration tools and feedback makes the process feel game-like and even fun (!!).

We built it because we see that getting labeled data remains a blocker for businesses using NLP today. We have a platform for active learning (see our Launch HN [2]) but we wanted to give software engineers and data scientists a way to build the datasets needed themselves and to make best use of subject-matter-experts’ time.

The package is free and you can install it now as a pip package [2]. It supports NER / span extraction tasks at the moment and document classification will be added soon. To help improve it, we'd love to hear your feedback or any success/failures you’ve had with weak supervision in the past.

[1]: We use a HMM model for NER tasks, and Naive-Bayes for classification using the two approaches given in the papers below: Pierre Lison, Jeremy Barnes, and Aliaksandr Hubin. "skweak: Weak Supervision Made Easy for NLP." https://arxiv.org/abs/2104.09683 (2021) Alex Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, Chris Ré. "Data Programming: Creating Large Training Sets, Quickly" https://arxiv.org/abs/1605.07723 (NIPS 2016)

[2]: Our Launch HN for our main active learning platform, Humanloop – https://news.ycombinator.com/item?id=23987353

[3]: Can install it directly here https://docs.programmatic.humanloop.com/tutorials/quick-star...

by razcle on 4/8/22, 1:38 PM
Hi Raza here, one of the other co-founders.
I know that HN likes to nerd out over technical details so thought I’d share a bit more on how we aggregate the noisy labels to clean them up.
At the moment we use the great Skweak [1] open source library to do this. Skweak uses an HMM to infer the most likely unobserved label given the evidence of the votes from each of the labelling functions.
This whole strategy of first training a label model and then training a neural net was pioneered by Snorkel. We’ve used this approach for now but we actually think there are big opportunities for improvement.
We’re working on an end-to-end approach that de-noises the labelling function and trains the model at the same time. So far we’ve seen improvements on the standard benchmarks [2] and are planning to submit to Neurips.
R
[1]: Skweak package: https://github.com/NorskRegnesentral/skweak [2] Wrench benchmark: https://arxiv.org/abs/2109.11377
by hmaguire on 4/8/22, 11:57 AM
I've been using this for the past month or so in Beta and it's fantastic. I'm a DS at an NLP startup and it's totally changed the way we develop new tagging and classification models (and how we explore unlabelled data more generally).
by jordn on 4/8/22, 11:50 AM
Just like to clarify that this goes beyond a rule-based system. Rules can get you pretty far[1] but this improves on that by intelligently discounting the bad rules using weak supervision techniques. The end result here is a pile of labeled data which you train your model on. The model trained on this data can generalise well beyond those labels.
[1]: Aside: working at Alexa, I was surprised that something like 80% of utterances were covered by rules rather than an ML model. People have learned to use Alexa for a small handful of things and you can cover those fairly well using a way to generate rules from phrase patterns and catalogs of nouns.