from Hacker News

Show HN: CLI for generating PDFs for offline reading

by dvcoolarun on 2/5/24, 7:24 PM with 43 comments

I've always thought that extensive reading was best suited for the realm of paper. As a result, I've created a command-line interface (CLI) tailored for my own use and decided to make it open source. I welcome any feedback you may have.

[Edit] Sample PDF :: https://drive.google.com/file/d/1n7M1TKOptSsYiibrbvV_Yojx53T...

by ComputerGuru on 2/5/24, 7:50 PM
I feel like if you are claiming "beautiful" output then it's obligatory to have at the very least screenshots of said output PDFs (or better yet, a sample for the same link in the CLI screenshot, especially so people can see how the text flows, what quality images are captured at, how text can be selected, etc).
by jackconsidine on 2/5/24, 8:24 PM
This is cool! I have a HN pipeline where I upvote things that I want to drill into, and a script I wrote generates PDFs and sends to my Kindle for offline reading (great for my pipeline). That uses Playwright's "to PDF" method which is over the browser and slow. I might look into replacing with this.
If there's any interest I might OSS the pipeline
by nacho2sweet on 2/6/24, 5:16 PM
We just use a headless chrome with a sort of wrapper script to do this at my work with a bunch of settings close to the actual size of paper. It allows me to test all of our reports in media->print in dev tools then print->pdf with chrome and only have to design to that spec. Then in our reports we provide a "save as pdf" button instead of encouraging print in all the other possible browsers which would make the task insane and cause me to possibly quit.
by dvcoolarun on 2/5/24, 7:59 PM
Apologies for the oversight; I forgot to include the screenshot of the sample PDF. Here it is for your reference: https://drive.google.com/file/d/1n7M1TKOptSsYiibrbvV_Yojx53T...
by dvcoolarun on 2/6/24, 3:22 AM
Arr, this blew up! I think, in some form, people are missing the context of the script. It's a plug-and-play script where you can make changes to PDF quality using CSS/Python. Even fonts are loaded through Google in Python. 'Beautiful' is called contextual. You can create your own version and share it with the community.
I'm on mobile, so I can't add a Google Drive file screenshot to the readme, and iframes are not supported.

by pavs on 2/5/24, 10:30 PM

like this:

  sudo apt install pandoc wkhtmltopdf
  npm install -g readability-cli
  pandoc -s https://www.paulgraham.com/avg.html -o output.html && readable output.html -o readable.html && wkhtmltopdf readable.html output.pdf &&  open output.pdf

going even further using bash script to prompt for url.

  #!/bin/bash

  # Prompt the user for a URL
  read -p "Enter the URL: " URL

  # Use the URL in the pandoc command
  pandoc -s $url -o output.html && readable output.html -o readable.html && wkhtmltopdf readable.html output.pdf && open output.pdf


  chmod +x web2pdf.sh
  # add an alias to bashrc
  alias web2pdf='/path/to/your/web2pdf.sh'
  source ~/.bashrc

by seabass-labrax on 2/5/24, 7:30 PM
Very interesting! One piece of feedback: it would probably be more useful to have a screenshot of the PDF on your README rather than one of the CLI. Also, do you intend to release this as FOSS?
by adrian_b on 2/6/24, 11:11 AM
Both Chrome and Firefox have absolutely horrible "Print" (to PDF) commands, which render the Web pages in a different way than what they show on the screen, and which results in large parts of the page being obscured by ads, menus, headers, etc., or in parts of the Web page that are outside the rendered area, so they are missing, or in content that is compressed to a small part of the output pages.
It would be really nice if there existed a utility able to produce a PDF file where the Web pages are rendered as well as the browsers render them on the screen, without becoming confused even by complex scripts loaded by the page.
The alternatives to "Print" (producing a PDF) are even worse. A screenshot has limited resolution and it loses the text. In the past "Save as ..." was the normal solution, but now even if you save a "complete" page, it will still frequently include scripts that will no longer work offline. What I want to save are the pages perfectly rendered as they were at that instant, without any scripts that could make them appear differently in the future.
by Someone on 2/5/24, 10:04 PM
FTA: “Then you can use the tool as follows
pipenv shell pipenv install python main.py https://www.paulgraham.com/avg.html, https://www.paulgraham.com/determination.html
Just add the webpage URLs separated by commas”
What’s the rationale for “separated by commas”? The convention for CLI arguments is to use one argument per input file.

by jll29 on 2/5/24, 8:59 PM

  % python main.py https://www.paulgraham.com/avg.html
  Traceback (most recent call last):
  File "/Users/bill/web2pdf/main.py", line 7, in <module>
    from readability import Document
  ImportError: cannot import name 'Document' from 'readability' 
  (/Users/bill/.local/share/virtualenvs/web2pdf- 
 gXeVRXKg/lib/python3.9/site-packages/readability/__init__.py)

But according to your Pipfile.lock, the readability module needed is 0.3.1:

  "readability": {
            "hashes": [
              "sha256:f9030df8bc31aad45baffa9a2d9ce1fdd8051833e5b5bda3027df32fdec00fad"
            ],
            "index": "pypi",
            "version": "==0.3.1"
        },

Version 0.3.1 of the module "readability" exists, but does not appear to have a class "Document".

by OhMeadhbh on 2/5/24, 8:21 PM
Apropos of nothing, I added this function so I don't have to leave the command line to see the PDF.
```
   pdfpage() {
     convert -resize 0x1000^ "${1}"[${2}] -background white -flatten sixel:-
   }
```
You can probably deduce it assumes you have a Imagemagick installed and you're in a terminal with sixel support.
by fishywang on 2/6/24, 5:47 AM
Somewhat similarly, I wrote a web app to generate epub (instead of pdf) out of urls and send to eink reader(s) directly (via a telegram bot) so I can read them. Currently it supports sending epub by email (for kindle) or uploading epub to dropbox (for kobo, etc.). It originally also supports reMarkable cloud but we can no longer make reMarkable cloud actually work. There's also a REST api to generate epub to be downloaded directly: https://github.com/fishy/url2epub/blob/main/REST.md
For e-ink readers epubs are generally better than PDFs for urls anyways, as epubs are basically packed htmls, and also the flow text works better on smaller screens.
by Throw73747 on 2/5/24, 8:31 PM
Parhaps add ublock filters support? I use it to strip down any unwanted elements on page before printing. On hacker news discussions it removes forms, reply links, header and footers...
by rahimnathwani on 2/6/24, 1:25 AM
For print or PDF, I like multi-column newspaper style, as created by this extension: https://chromewebstore.google.com/detail/simple-print/nalmbm...
One benefit of using a Chrome extension (vs. CLI) is that it's easy to 'print' things that require authentication.
by jll29 on 2/5/24, 8:56 PM
Have you compared it with a conversion by pandoc (https://pandoc.org/)?
by sn0n on 2/6/24, 7:27 AM
Does it run a headless chrome for pixel perfect formatting as laid out as a webpage and applied in that format to PDF ignoring the pages print css rules? Cuz, that would be a nice start. And an option for size to be pixel width based for ideal layout... Because I won't be printing, I will be viewing on my phone, so one overly large page would be perfect.
by harry8 on 2/5/24, 11:58 PM
Webbrowser opens url -> print -> save as/to pdf?
I'm sure I'm missing something, what is a cli interface buying me here?
by K2h on 2/5/24, 8:00 PM
Very cool! in README.md is that an extra 'p' in Webp2pdf ?
by codeonline on 2/6/24, 2:23 AM
Can you add comparison pdfs generated by pandoc and gotenberg?
by skanga on 2/5/24, 8:08 PM
Found some potential bugs. Please check the github issues page.