from Hacker News

Show HN: Cherry2.0 – text classification without ML knowledge needed

by Windson on 9/9/19, 2:26 AM with 11 comments

  • by Thorentis on 9/9/19, 4:37 AM

    The pre-built model is a facinating insight to Chinese politics/society.

    > Gamble / Porn / Political / Nomal (model='harmful')

    Not sure what Nomal is, but 'Political' is considered harmful? I suppose they mean politics has the potential to be harmful. Would be interesting to see what material it was trained on (a cross section of pro and anti-CCP? pro-CCP would still be political, but then why would that need to be classified under a harmful model?)

    > Lottery ticket / Finance / Estate / Home / Tech / Society / Sport / Game / Entertainment (model='news')

    Interesting bit here is that Lottery Tickets are included in news and yet gambling is included in harmful. Is the lottery not considered gambling in China? Or gambling just has the potential to be harmful, despite also being news?

  • by ngngngng on 9/9/19, 7:22 AM

    Hey Windson, this is super cool, can't wait to try this out. Wish I had it 6 months ago at my last job when I was starting to build my own text classification system. I submitted a PR fixing the typos in the english translation, no promises that I got them all though.
  • by Windson on 9/9/19, 7:44 AM

    I have another pre-trained model for spam email classify in English. I didn't add it to cherry because I don't know if anyone need it. It will be great if someone can tell me what kind of pre-trained model they need so I can add it later.
  • by mlthoughts2018 on 9/9/19, 11:09 AM

    Why do people think any kind of “x for people with no knowledge of x” tool is a worthwhile idea? It reminds me of various cloud vendor ML offerings too. With ML, you _always_ need ML domain knowledge to (at minimum) understand the performance characteristics of the black box you’re using, and (more often) also how to re-train/fine-tune/incrementally update as your use case’s data distribution changes over time or your required performance characteristics change over time.

    Like with AWS Rekognition where you’re billed on raw usage, which makes no sense when what matters is going to be the precision & recall & false detection rate on your data distribution (not whatever Amazon’s team trains it on). How many true positives / false positives / etc. will you get on your data?

    ML is uniquely poorly suited to be treated as just some API or just some black box library. I really wish people would stop popularizing footgun approaches to this!