by nbernard on 4/28/24, 8:51 AM with 39 comments
by perihelions on 4/29/24, 11:26 AM
It's surprisingly easy to extend this type of workflow to scanned pdfs (as opposed to software-generated, text-containing ones). tesseract(1) makes short work of ToC pages with --psm set to 6 (an OCR setting that tends to collapse convoluted text layouts into a regular, software-parseable output).
It should also be straightforward, but I don't know of an out-of-the-box solution, to automate that example of extracting "text that looks like a header"–based on page layout/relative positioning, or font weight. (I'm working on an adjacent problem, an automatic re-layout of raster documents to squeeze out whitespace and make them slightly nicer on small e-ink devices. Text islands are trivial to identify. I don't know how to quantify font weight, or things like that. I'm "wasting" a lot of time diving into lots of mathematics rabbit holes, but I don't know in advance which ones will be productive or not).
by janpmz on 4/29/24, 11:32 AM
by chazeon on 4/29/24, 9:41 PM
The upside, compared to Krasjet's approach, is that this works not only for text-based PDFs but also for scanned PDFs, even old scanned journal papers.
The downside is that, before baking the TOCs, you need to make adjustments to the PDF as sometimes the empty pages are not included. You also need to calculate the offset for the prologs, cover, etc. I have a script for this kind of adjustment, but there always is manual intervention involved.
by mbana on 4/29/24, 11:01 AM
by mrtx01 on 4/29/24, 9:47 AM
by papichulo2023 on 4/29/24, 11:17 AM
by rajaravivarma_r on 4/30/24, 6:34 PM
For example, paragraphs, code blocks, code inlined in paragraphs etc?
I tried tesseract but it recognises code blocks as tables.
Also there are edge cases like paragraphs starting with an indentation and without an indentation are hard to differentiate.
Appreciate any help.
by jbecke on 4/29/24, 1:33 PM
by pseingatl on 4/29/24, 8:32 PM
\tableofcontents does the job.
by maCDzP on 4/29/24, 5:01 PM
by bionade24 on 4/29/24, 11:28 AM
by zerop on 4/29/24, 11:56 AM