from Hacker News

Pdf text extractor – in pages and regions you define

by seinecle on 8/1/23, 7:39 AM with 2 comments

  • by albert_e on 8/1/23, 8:33 AM

    Interesting.

    I have a use case that is slightly different. Maybe someone can suggest a good framework / tool --

    Our school publishes a PDF daily -- that someone makes by filling a Microsoft Excel template and printing it to PDF / Save As PDF.

    The excel template is fairly simple -- a block of key-value pairs as a two column table for each subject (fixed number of fields), and N number of such blocks one below the other based on number of subjects covered that day.

    Now the length of the PDF (whether content fits one page or spills in 2 or 3) as well the scaling of the PDF print (how big or small the text appears) varies a lot due to the inconsistent manual steps they follow.

    What would be a good way to automate the extraction of text from such a daily PDF feed?

    I want to load this extracted data into a simple flat table (in say a SQLite database or DynamoDB) and use it to display the same content as a browsable / filterable webpage (showing content from all PDFs till date)

    I was hoping to take help from ChatGPT code interpreter and write a Python script that I can schedule on AWS Lambda. But if there is a known approach for this kind of document processing, please point me to it. Thanks!

  • by seinecle on 8/1/23, 7:39 AM

    Part of a set of free, no registration asked, click and point, web based functions. Your feedback is welcome.