from Hacker News

Pdf.tocgen

by nbernard on 4/28/24, 8:51 AM with 39 comments

  • by perihelions on 4/29/24, 11:26 AM

    - "That is, you shouldn’t expect it to work with scanned PDFs"

    It's surprisingly easy to extend this type of workflow to scanned pdfs (as opposed to software-generated, text-containing ones). tesseract(1) makes short work of ToC pages with --psm set to 6 (an OCR setting that tends to collapse convoluted text layouts into a regular, software-parseable output).

    It should also be straightforward, but I don't know of an out-of-the-box solution, to automate that example of extracting "text that looks like a header"–based on page layout/relative positioning, or font weight. (I'm working on an adjacent problem, an automatic re-layout of raster documents to squeeze out whitespace and make them slightly nicer on small e-ink devices. Text islands are trivial to identify. I don't know how to quantify font weight, or things like that. I'm "wasting" a lot of time diving into lots of mathematics rabbit holes, but I don't know in advance which ones will be productive or not).

  • by janpmz on 4/29/24, 11:32 AM

    Recently I found the getToc function in PyMuPdf was too slow. I told them about it in their discord, and a day later they had fixed it. Now it only takes a couple of milliseconds. I'm using it for my project pdftomp3. Pdf.tocgen looks useful too, but I'm not sure if I can use it because of the licencse?
  • by chazeon on 4/29/24, 9:41 PM

    I have been thinking about this, but for a while now, I have settled on using ChatGPT's GPT-4v's multimodal capability to generate a text file containing the titles and pages based on screenshots of the TOC. After that, I used a pikepdf-based Python script to bake the TOC into the PDF I had.

    The upside, compared to Krasjet's approach, is that this works not only for text-based PDFs but also for scanned PDFs, even old scanned journal papers.

    The downside is that, before baking the TOCs, you need to make adjustments to the PDF as sometimes the empty pages are not included. You also need to calculate the offset for the prologs, cover, etc. I have a script for this kind of adjustment, but there always is manual intervention involved.

  • by mbana on 4/29/24, 11:01 AM

    I love the typography on the site. What fonts are you using? I'm on a mobile browser so I can't really see.
  • by mrtx01 on 4/29/24, 9:47 AM

    What a beautiful website!
  • by papichulo2023 on 4/29/24, 11:17 AM

    Looks like a very good tool to integrate with Knowledge Graphs or just RAG (llm).
  • by rajaravivarma_r on 4/30/24, 6:34 PM

    Is it possible to extract different patterns of text from a PDF document?

    For example, paragraphs, code blocks, code inlined in paragraphs etc?

    I tried tesseract but it recognises code blocks as tables.

    Also there are edge cases like paragraphs starting with an indentation and without an indentation are hard to differentiate.

    Appreciate any help.

  • by jbecke on 4/29/24, 1:33 PM

    We (macro.com) have something similar but without the recipe part in our pdf/word processor. It works pretty well on numbered headings but not so well on non-numbered. We’re thinking of porting over to LLMs at some point.
  • by pseingatl on 4/29/24, 8:32 PM

    Since when do you need the hyperref package to generate a table of contents under LaTeX (as the author claims)?

    \tableofcontents does the job.

  • by maCDzP on 4/29/24, 5:01 PM

    That is a beautiful website. I got lost in it and it created a sense of wonder. Nice.
  • by bionade24 on 4/29/24, 11:28 AM

    Does someone know a tool that is sed- or awk-like for PDFs?
  • by zerop on 4/29/24, 11:56 AM

    Can I use this tool to get toc for arxiv papers ?