from Hacker News

Ask HN: How to extract structured content from unstructured menus using NLP and ML?

by restapi on 9/25/17, 3:48 PM with 8 comments

  • by 3131s on 9/26/17, 10:47 AM

    This might not be a task for ML, especially assuming the only option would be unsupervised ML.

    I would suggest using an ontology, or rolling your own from the English Wikipedia database dump, as a basis for tokenization of the menu text and go from there. What structured content exactly are you trying to extract?

  • by BjoernKW on 9/26/17, 11:48 AM

    What's the source format? If you're dealing with PDFs at least you have textual data, which could be matched against a recipe database. I haven't checked but services like Epicurious might offer an API for that.

    In that case you wouldn't need ML at all but pattern matching combined with named entity recognition probably would do just fine.

  • by misiti3780 on 9/25/17, 5:40 PM

    can you provide an example link to the data.
  • by ocrcustomserver on 9/28/17, 9:58 PM

    If you deal with pdf documents, I might be able to help. Mail is in profile.