from Hacker News

Donut: OCR-Free Document Understanding Transformer

by hectormalot on 5/29/23, 8:19 AM with 86 comments

  • by AmazingTurtle on 5/29/23, 10:05 AM

    I tested it out with a bunch of personal documents. Results were disappointing. Did not match up with the promised scores, not even slightly.

    I think the traditional approach to scanning and classifying without AI/ML is the way to go, for the next 5 years at very least.

  • by dkatz23238 on 5/29/23, 4:16 PM

    As a developer who has been building IDP solutions I can assert that although this model is a lot larger (more weights) than a Graph Neural Network on OCR tokens, industry standard before transformers, it outperforms given enough data. Depending on how heterogenous the data is usually 200 documents can reach human levels of accuracy on documents, scoring by levenshtein ratio.

    Smaller graph models could get away with using less data. The problem that the "traditional" approach had is the the quality of the OCR was the bottleneck for overall model performance. It amazes me how this problem shifted from a node classification problem to a image to text problem.

    Training on CPU was possible with GCN but not with Donut.

  • by dowakin on 5/29/23, 5:38 PM

    If you want to train the Donut, check out this notebook on Kaggle. It trains Donut to read plots for a competition. The notebook contains full pipeline for finetuning. https://www.kaggle.com/code/nbroad/donut-train-benetech
  • by armchairhacker on 5/29/23, 11:57 AM

    These OCR tools are bringing us closer to msPaint as a viable IDE
  • by tkanarsky on 5/29/23, 9:20 AM

    > Donut: DOcumeNt Understanding Transformer

    Author: phew! I'm glad there's an 'n' in there somewhere

  • by xavriley on 5/29/23, 12:16 PM

    There’s a model for music transcription (audio to midi) called MT3 which takes an end-to-end transformer approach and claims SOTA on some datasets. However, from my own research and comparing with other models it seems that MT3 is very prone to overfitting and the real world results are not as impressive. A similar story seems to be playing out in the comments here
  • by vosper on 5/29/23, 9:20 AM

    I want to build an application that scans restaurant and café menus (PDFs, photos, webpages) to identify which items are vegetarian or vegan. Would this work for that? If not, I would love to hear peoples ideas and suggestions.
  • by nestorD on 5/29/23, 5:23 PM

    I will have to investigate this, I am dreaming of a system that can take a pdf scan of a book as input and produce one or more properly formated (headings, italic, bold, underline, etc) markdown files. In my tests, LLMs have proved very good at cleaning a raw OCR but they need formating information to get me all the way.
  • by ryanjshaw on 5/29/23, 10:07 AM

    This is really cool if it delivers. I tried building an app to scan till receipts. The image to text APIs out there really don't perform as well as you'd think. AWS Text Extract performed far better than GCP and Azure equivalents and traditional OCR solutions, but it still made some really annoying errors that I had to fix with heuristics.
  • by aosmith on 5/29/23, 3:49 PM

    So is this why IA had an outage? Timing is perfect.
  • by i2cmaster on 5/29/23, 11:46 AM

    I've started using Microsoft's TROCR (another transformer OCR model) to read the cursive in my pocket journal (I have a habit of writing programs there first while I'm out and then typing them in manually, I just focus better that way.)

    It's surprisingly accurate although you have to write your own program to segment the image into lines. I think with some fine tuning I could have the machine read my notebook with minimal corrections.