from Hacker News

Ask HN: What are you using to parse PDFs for RAG?

by carlbren on 7/25/24, 7:50 PM with 94 comments

Hi, I'm looking for a simple way to convert PDFs into markdown with integrated images and tables. Tried Llamaindex, but no integrated images. Tried Langchain, but some PDFs will have the footer being parsed before the top. Tried to use Adobe PDF API, but have to pay $25K upfront!
  • by whakim on 7/30/24, 7:21 AM

    We have been using different things for text, images, and tables. I think it's worth pointing out that PDFs are extremely messy under-the-hood so expecting perfect output is a fool's errand; transformers are extremely powerful and can often do surprisingly well even when you've accidentally mashed a set of footnotes into the middle of a paragraph or something.

    For text, unstructured seems to work quite well and does a good job of quickly processing easy documents while falling back to OCR when required. It is also quite flexible with regards to chunking and categorization, which is important when you start thinking about your embedding step. OTOH it can definitely be computationally expensive to process long documents which require OCR.

    For images, we've used PyMuPDF. The main weakness we've found is that it doesn't seem to have a good story for dealing with vector images - it seems to output its own proprietary vector type. If anyone knows how to get it to output SVG that'd obviously be amazing.

    For tables, we've used Camelot. Tables are pretty hard though; most libraries are totally fine for simple tables, but there are a ton of wild tables in PDFs out there which are barely human-readable to begin with.

    For tables and images specifically, I'd think about what exactly you want to do with the output. Are you trying to summarize these things (using something like GPT-4 Vision?) Are you trying to present them alongside your usual RAG output? This may inform your methodology.

  • by infecto on 7/30/24, 11:54 AM

    I am surprised nobody has mentioned it yet.

    If this is for anything slightly commercial related you are probably going to have the best luck using Textract/Document Intelligence/Document AI. Nothing else listed in the comments is as accurate, especially when trying to extract forms, tables and text. Multi-modal will take care of your the images. The combination of those two will get you a great representation of the PDF.

    Opensource tools work and can be extremely powerful but you 1) won't have images and 2) your workflows will break if you are not building it for a specific pdf template.

  • by reerdna on 7/30/24, 10:59 AM

    For use in retrieval/RAG, an emerging paradigm is to not parse the PDF at all.

    By using a multi-modal foundation model, you convert visual representations ("screenshots") of the pdf directly into searchable vector representations.

    Paper: Efficient Document Retrieval with Vision Language Models - https://arxiv.org/abs/2407.01449

    Vespa.ai blog post https://blog.vespa.ai/retrieval-with-vision-language-models-... (my day job)

  • by jumploops on 7/30/24, 7:12 AM

    In my experience Azure’s Form Recognizer (now called “Document Intelligence”) is the best (cheapest/most accurate) PDF parser for tabular data.

    If I were working on this problem in 2024, I’d use Azure to pre-process all docs into something machine parsable, and then use an LLM to transform/structure the processed content into my specific use-case.

    For RAG, I’d treat the problem like traditional search (multiple indices, preprocess content, scoring, etc.).

    Make the easy things easy, and the hard things possible.

  • by serjester on 7/30/24, 1:27 PM

    Open Source Full Featured: https://github.com/Filimoa/open-parse/ [mine]

    https://docs.llamaindex.ai/en/stable/api_reference/node_pars... [text splitters lose page metadata]

    https://github.com/VikParuchuri/marker [strictly to markdown]

    Layout Parsers: These are collections of ML models to parse the core "elements" from a page (heading, paragraph, etc). You'll still need to work on combining these elements into queryable nodes. https://github.com/Layout-Parser/layout-parser

    https://github.com/opendatalab/PDF-Extract-Kit

    https://github.com/PaddlePaddle/PaddleOCR

    Commercial: https://reducto.ai/ [great, expensive]

    https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse... [cheapest, but buggy]

    https://cloud.google.com/document-ai https://aws.amazon.com/textract/

  • by esquivalience on 7/30/24, 6:41 AM

    PyMuPDF seems to be intended for this use-case and mentions images:

    https://medium.com/@pymupdf/rag-llm-and-pdf-conversion-to-ma...

    (Though the article linked above has the feeling, to me, of being at least partly AI-written, which does cause me to pause)

    > Update: We have now published a new package, PyMuPDF4LLM, to easily convert the pages of a PDF to text in Markdown format. Install via pip with `pip install pymupdf4llm`. https://pymupdf4llm.readthedocs.io/en/latest/

  • by Teleoflexuous on 7/30/24, 8:02 AM

    My use case is research papers. That means very clear text, combined with graphs of varying form and quality and finally occasional formulas.

    Two approaches I had most, but not full, success with are: 1) converting to image with pdf2image, then reading with pytesseract 2) throwing whole pdfs into pypdf 3) experimental multimodal models

    You can get more if you make content more predictable (if you know this part is going to be pure text just put it in pypdf, if you know this is going to be a math formula explain the field to the model and have it read it back for high accessibility needs audience) the better it will go, but it continues to be a nightmare and a bottleneck.

  • by kkfx on 7/30/24, 11:12 AM

    You can start to look at pdftotext -layout and pandoc maybe.

    Personally I hope a day publishers start learning about the value of data and their representations so they decide to embed them like a *sv file attached to a pdf where the tabular data are immediately available, a .gp and alike file for graphs etc. Essentially the concept of embedding pdf "sources" as an attachment. In LaTeX it's easy to attach the LaTeX itself to the final pdf but so far no one seems to be interested to do so as an habit.

  • by longnguyen on 7/30/24, 10:00 AM

    My apps are native Mac apps [0] [1] so naturally I use the native SDK for that.

    Apple provides PDFKit framework to work with PDFs and it works really well.

    For scanned documents, I use the Vision framework to OCR the content.

    Some additional content cleaning is still required but overall I don’t need any other third-party libraries.

    [0]: https://boltai.com

    [1]: https://pdfpals.com

  • by marcoperuano on 7/30/24, 11:09 AM

    Haven’t tried converting it to markdown specifically, but if you want to try a different approach, google’s DocAI has been pretty great. It provides you with the general structure of the document as blocks (paragraph and headers) with coordinates. This makes it so you can send that data to an LLM during the RAG process and get citations of where the answers were found, down to the line of text.
  • by martincollignon on 7/30/24, 7:35 AM

  • by screature2 on 7/30/24, 8:58 AM

    Maybe Nougat? The examples look pretty impressive: https://facebookresearch.github.io/nougat/ https://github.com/facebookresearch/nougat

    Though the model weight licenses are CC by NC

  • by zbyforgotp on 7/30/24, 7:25 AM

    Tables are a hard case for RAG, even if you parse them perfectly into Markdown, the LLMs still tend to struggle with interpreting them.
  • by gvv on 7/30/24, 8:15 AM

    I've had most success using PDFMinerLoader

    (https://api.python.langchain.com/en/latest/document_loaders/...)

    It deals pretty well with PDF containing a lot of images.

  • by yawnxyz on 7/30/24, 6:18 AM

    for web PDFs I'm using https://jina.ai/reader/ — completely free. Does most of the job fine.

    Code: https://github.com/jina-ai/reader

  • by nicoboo on 7/30/24, 6:58 AM

    I've experimented with GCP's Stack using Agent Builder and relying on Gemini Pro 1.5.

    I also experimented with pretty large of various files (around 6000 video games full notices) where I used OCR parsing in a similar configuration with mixed results due to the visual complexity of the original content.

  • by gautiert on 7/30/24, 8:00 AM

    Hi! Show HN: Zerox – Document OCR with GPT-mini | https://news.ycombinator.com/item?id=41048194

    This lib converts pdf page by page to image and feed it to gpt-4o-mini The results are pretty good!

  • by dgelks on 7/30/24, 8:04 AM

    Previously have used https://github.com/pdf2htmlEX/pdf2htmlEX to convert PDF to HTML at scale, could potentially try and parse the output html to markdown as second stage.
  • by siquick on 7/30/24, 8:16 AM

    Llamaparse by LlamaIndex is probably SOTA at the moment and seems to have no problems with tables. Pricing is good at the moment too.

    https://www.llamaindex.ai/enterprise

  • by shauntrennery on 7/30/24, 10:27 AM

  • by Angostura on 7/30/24, 12:48 PM

    Fascinating discussion- but ‘RAG’? Sorry probably obvious but can someone clue me in
  • by pookee on 7/30/24, 7:56 AM

    We're currently implementing this with https://mathpix.com/, it is not free but really not that expensive. It looks very promising
  • by cm2187 on 7/30/24, 8:06 AM

    I had some success using pdfpig, by ugly toad.

    https://uglytoad.github.io/PdfPig/

    Plus you get to rise the eyebrows of your colleagues.

  • by mschwarz on 7/30/24, 8:23 AM

    Did you try llamaparse from Llamaindex? It’s a cloud service with a free tier. Recently switched to it from unstructured.io and it works great with the kinds of images and table graphics I feed it.
  • by bartread on 7/30/24, 9:57 AM

    I need to get some data out of a table in a regularly published PDF file.

    The thing is the table looks like a table when the PDF is rendered, but there's nothing within the PDF itself to semantically mark it out as a table: it's just a bunch of text and graphical elements placed on the page in an arrangement that makes them look like a table to a human being reading the document.

    What I've ended up doing, after much experimentation[0], is use poppler to convert the PDF to HTML, then find the start and end of the table by matching on text that always appears at header and footer. Fortunately the row values appear in order in the markup so I can then look at the x coordinates of the elements to figure out which column they belong to or, rather, when a new row starts.

    What I actually do due to #reasons is spit out the rows into a text file and then use Lark to parse each row.

    Bottom line: it works well for my use case but I'd obviously recommend you avoid any situation where your API is a PDF document if at all possible.

    EDIT: A little bit more detail might be helpful.

    You could use poppler to convert to HTML, then from there implement a pipeline to convert the HTML to markdown. Just bear in mind that the HTML you get out of poppler is far removed from anything semantic, or at least it has been with the PDFs I'm working with: e.g., lots of <span> elements with position information and containing text, but not much to indicate the meaning. Still, you may find that if you implement a pipeline, where each stage solves one part of the problem of transforming to markdown, you can get something usable.

    Poppler will spit out the images for you but, for reasons I've already outlined, tables are likely to be painful to deal with.

    I notice some commenters suggesting LLM based solutions or services. I'd be hesitant about that. You might find an LLM helpful if there is a high degree of variability within the structural elements of the documents you're working with, or for performing specific tasks (like recognising and extracting markup for a table containing particular information of interest), but I've enough practical experience with LLMs not to be a maximalist, so I don't think a solely LLM-based approach or service will provide a total solution.

    [0] Python is well served with libraries that either emit or parse PDFs but working with PDF object streams is no joke, and it turned out to be more complex and messier - for my use case - than simply converting the PDF to an easier to work with format and extracting the data that way.

  • by constantinum on 7/26/24, 6:46 PM

    I would recommend give LLMWhisperer a try with the documents pertaining to your use case.

    https://unstract.com/llmwhisperer/

    Try demo in playground: https://pg.llmwhisperer.unstract.com/

    Quick tutorial: https://unstract.com/blog/extract-table-from-pdf/

  • by simianparrot on 7/30/24, 10:47 AM

    I convert the PDF to images and then parse the images with tesseract OCR. That’s been the most consistent approach to run locally.
  • by zoeyzhang on 7/30/24, 8:29 AM

    Check out this - https://hellorag.ai/
  • by jwilk on 7/30/24, 6:47 AM

    What's RAG?
  • by BerislavLopac on 7/30/24, 8:41 AM

    Have you tried Pandoc [0]?

    [0] https://pandoc.org/

  • by gsemyong on 7/30/24, 11:18 AM

    Checkout this https://parsedog.io
  • by teapowered on 7/30/24, 8:50 AM

    Apache Tika Server is very easy to set up - it can be configured to use tesseract for OCR.
  • by paulluuk on 7/30/24, 11:47 AM

    As others have mentioned, if you have text-only PDFs then pypdf is free, fast and simple.
  • by chewz on 7/30/24, 8:21 AM

  • by postepowanieadm on 7/30/24, 8:07 AM

    mupdf's mutool gives access to most data of all solutions I have checked.
  • by wcallahan on 7/30/24, 6:40 PM

    Jina.ai’s API is one of the best parsers I’ve seen. And better priced.
  • by Ey7NFZ3P0nzAe on 7/30/24, 3:11 PM

    For my RAG projet [WDoc](https://github.com/thiswillbeyourgithub/WDoc/tree/dev) I use multiple pdf parser then use heuristics the keep the best one. The code is at https://github.com/thiswillbeyourgithub/WDoc/blob/654c05c5b2...

    And the heurstics are partly based on using fasttext to detecr languages : https://github.com/thiswillbeyourgithub/WDoc/blob/654c05c5b2...

    It's probably crap for tables but I don't want to rely on external parsers.

  • by exe34 on 7/30/24, 6:54 AM

    nowadays you might have some luck feeding PNG images to multimodal LLMs.
  • by rayxi271828 on 7/30/24, 7:14 AM

    Have you tried unstructured.io? So far seems promising.
  • by brudgers on 7/26/24, 5:44 PM

    have to pay $25K upfront

    That's a lot of your money.

    It's not a big dose of OPM (Other People's Money).

    When building a business, adequate capitalization solve a lot technical problems and it is no different when a business is built. If you aren't building a business, money is different and there's nothing wrong with not building a business. Good luck.

  • by alexliu518 on 7/27/24, 6:42 AM

    I understand your idea, but if you are sure that a software is excellent, then paying for it is a good habit.
  • by arthurcolle on 7/30/24, 6:37 AM

    We have an excellent solution for this at Brainchain AI that we call Carnivore.

    We're only a few days away from deploying an SDK for this exactly use case among some others.

    If you'd like to speak with our team, please contact us! We would love to help you get through your PDF and other file type parsing issues with our solution. Feel free to ping us at data [at] brainchain.ai