from Hacker News

Show HN: PDF to MD by LLMs – Extract Text/Tables/Image Descriptives by GPT4o

by yigitkonur35 on 9/22/24, 2:05 AM with 91 comments

I've developed a Python API service that uses GPT-4o for OCR on PDFs. It features parallel processing and batch handling for improved performance. Not only does it convert PDF to markdown, but it also describes the images within the PDF using captions like `[Image: This picture shows 4 people waving]`.

In testing with NASA's Apollo 17 flight documents, it successfully converted complex, multi-oriented pages into well-structured Markdown.

The project is open-source and available on GitHub. Feedback is welcome.

by Oras on 9/22/24, 4:40 PM
While this is a nice development, it’s quite risky parsing documents with LLMs. In usual OCRs, you have boundaries to check, but with LLMs, you just get a black box output.
As others mentioned, consistency is key in parsing documents and consistency is not a feature of LLMs.
The output might look plausible, but without proper validation this is just a nice local playground that can’t make it to production.
by zerop on 9/22/24, 10:41 AM
I had been using GPT4o for extracting insights from Scanned docs, it was doing fine. But very recently (since they launched new model - o1), it's not working. GPT4o is refusing to extract text from images and says it can't do it, though it was doing same thing with same prompts till last week. I am not sure if this is intentional downgrade and it can be clubbed with new model launch, but it's really frustrating for me. I cancelled my GPT4 premium and moved to claude. It works good.
by pierre on 9/22/24, 9:54 AM
Parsing docs using LVM is the way forward (also see OCR2 paper released last week, people are having ablot of success parsing with fine tunned Qwen2).
The hard part is to prevent the model ignoring some part of the page and halucinations (see some of the gpt4o sample here like the xanax notice:https://www.llamaindex.ai/blog/introducing-llamaparse-premiu...)
However this model will get better and we may soon have a good pdf to md model.
by constantinum on 9/23/24, 2:55 AM
There is also LLMWhisperer, a document pre-processor specifically made for LLM consumption.
As other mentioned, accuracy is the one part of solution criteria, other include, how does the preprocessing engine scale/performs at large scale, and how does it handle very complex documents like, bank loan forms with checkboxes, IRS tax forms with multi-layered nested tables etc.
https://unstract.com/llmwhisperer/
LLMWhisperer is a part of Unstract - An open-source tool for unstructured document ETL.
https://github.com/Zipstack/unstract
by jdthedisciple on 9/22/24, 8:35 AM
Zerox [0] was featured on here recently and does the exact same thing
[0] https://github.com/getomni-ai/zerox
by smusamashah on 9/22/24, 6:35 AM
I have not found any mention of accuracy. Since it's using LLM, how accurate the conversion is? As in does that NASA document match 100% with the pdf or did it introduce any made up things (hallucinations)?
That converted NASA doc should be included in repo and linked in readme if you haven't already.
by bravura on 9/22/24, 10:18 AM
I've also been using the nougat models from meta, which are trained to turn PDF into md using the donut architecture
by charlie0 on 9/22/24, 3:25 PM
I do this all the time for old specs, but one issue I worry about is accuracy. It's hard to confirm if the translations are 100% correct.
by TZubiri on 9/22/24, 7:54 PM
Ok attempt number 158 at parsing pdfs, here we go, this time surely it will work.
by eth0up on 9/22/24, 12:33 PM
I used GPT4o to convert heavily convoluted PDFs into csv files. The files were Florida Lottery Pick(n) histories, which they deliberately convolute to prevent automatic searching; ctrl-f does nothing and a fsck-ton of special characters embellish the whole file.
I had previously done so manually, with regex, and was surprised with the quality of the end results of GPT, despite many preceding failed iterations. The work was done in two steps, first with pdf2text, then python.
I'm still trying to created a script to extract the latest numbers from the FL website and append to a cvs list, without re-running the stripping script on the whole PDF every time. Why? I want people to have the ability to freely search the entire history of winning numbers, which in their web hosted search function, is limited to only two of 30+ years.
I know there's a more efficient method, but I don't know more than that.
by refulgentis on 9/22/24, 6:34 PM
GPT 4o doesn't do actual OCR and there's much smaller and more effective models for specifically this problem.
I appreciate your work, intent, and sharing it. It's very important to appreciate what you're doing and its context when sharing it.
At that point, you are responsible for it, and the choices you make when communicating about it reflect on you.
by fzysingularity on 9/22/24, 9:30 PM
One nit in the repo README - you might want to change the cost reporting to be as $15 / 1000 pages instead of documents.
by KoolKat23 on 9/23/24, 11:16 AM
This is handy, one thing I've noticed using 3.5 Sonnet, the tables that aren't the correct orientation are more prone to incorrect output.
I know this was an issue when GPT 4 vision initially came out due to training, not sure if it's a solved problem or if your tool handles this.
by jdross on 9/22/24, 5:55 AM
How does this compare with commercial OCR APIs on a cost per page?
by magicalhippo on 9/22/24, 6:41 AM
Was just looking for something like this. Does it handle equations to latex or similar? How about rotated tables, ie landscape mode but page is still portait?
by devops000 on 9/23/24, 7:50 AM
You could transform arXiv to a markdown website
by scottmcdot on 9/22/24, 12:49 PM
Does it do image to MD too?
by wittjeff on 9/22/24, 8:03 PM
license please?
by bschmidt1 on 9/23/24, 11:07 PM
My previous employer needs this.
I won't tell them :) :D >:D :|