from Hacker News

New Ghostscript PDF interpreter

by diskmuncher on 7/31/22, 3:40 PM with 92 comments

by neilv on 7/31/22, 7:05 PM
Years back, I raised how evolved Ghostscript had been over a very long time, together with the huge complexity of the PDF specs, as a potential source of vulnerabilities.
(But maybe wasn't as much on people's radars, with all lower-hanging fruit of other technology choices and practices going on, outside of PDF.)
New code for a large spec is also interesting for potential vulns, but maybe easier to get confidence about.
One neat direction they could go is to be considered more trustworthy than the Adobe products. For example, if one is thinking of a PDF engine as (among other purposes) supporting the use case of a PDF viewer that's an agent of the interests of that individual human user, then I suspect you're going to end up with different attention and decisions affecting security (compared to implementations from businesses focused on other goals).
(I say agent of the individual user, but that can also be aligned with enterprise security, as an alternative to risk management approaches that, e.g., ultimately will decide they're relying on gorillas not to make it through the winter.)
by kisamoto on 7/31/22, 7:52 PM
Not sure why this is being posted now as this is from March...
But anyway - I understand why they have changed their interpreter however the lack of major version bump threw me off. I use ps2pdf to optimize pdfs (long story short - makes their size smaller) and was alarmed when my pdfs suddenly ended up without the jpeg backgrounds. Instead, purely black (although this did result in a very small file size so who knows... :) )
Thankfully you can add `-d NEWPDF=false` to your command to use the old parser. I'm yet to submit a bug report but it would be nice if it was backwards compatible...
by hnick on 8/1/22, 12:58 AM
Because Acrobat will open these files, there is considerable pressure for Ghostscript to do so as well, though we do try to at least flag warnings to the user when something is found to be incorrect, giving the user a chance to intervene.
Anyone who has done PDF composition for a "print ready" job (what a lie) from a client has run into this so many times. All we have to do is rearrange the pages in the right sorted order, add some barcodes, and print, right? Acrobat can open the file, so why is your printer crashing? Ironically, some of those printers used an Adobe RIP in the toolchain and this conversion PDF->PS on the printer was where things went wrong (I once tracked down a crash where a font's gylph name definition in the dict was OK in PDF but invalid syntax in PS, due to a // resolving into an immediately evaluated name that doesn't exist) but it's not something a technician could help with.
It was so bad that Ghostscript was one of many tools - we'd throw a PDF through various toolchains to hope one of them saved it in a format that was well behaved. Anyway I'm almost sad I've moved on from that job now so I can't try it out with some real world files. But in the end most of the issues came down to fonts and people using workflows that involve generating single document PDFs and merging them, resulting in things like 1000 subset fonts which are nearly identical and consuming all the printer memory, so I'm not sure how well this would help.
by toddm on 7/31/22, 5:33 PM
Ghostscript (well, gv) got me through the 1990s and beyond as part of my TeX -> dvips -> gv workflow.
Kudos and thank you to those who maintain it and the associated packages!
by mkl on 7/31/22, 8:18 PM
> As time has gone on, and we have encountered more and more PDF files with ever more unexpected deviations from the specification
Does anyone know of a collection of malformed PDF files? It would be useful for testing PDF processing programs.
by lordfosco on 7/31/22, 4:56 PM
Most important part of the announcement - you can still revert back to the former interpreter by setting the `-dNEWPDF=false` flag.
While progress is always nice to see - I am also pleased that we don't necessarily need to update all the scripts that depend on ghostscript at once but can keep them running in their current state.
by vivegi on 7/31/22, 6:25 PM
In the past when we had to use Ghostscript for PDF processing, we always separated it out into its own process and added a whole lot of error management externally.
Even if the application was fine, you would always encounter PS/PDF files in the wild that kept stress-testing the application's memory safety.
by mepian on 7/31/22, 7:53 PM
"But Ghostscript’s PDF interpreter was, as noted, written in PostScript, and PostScript is not a great language for handling error conditions and recovering."
Isn't C, their chosen replacement of PostScript, also particularly bad at this?
by aidos on 7/31/22, 4:54 PM
Does anyone know much about the Artifex team? How big it is etc?
They seem to be the kings of working with PDFs. I’ve not really looked at the Ghostscript code (and I’m surprised to hear their interpreter was still in postscript), but I’ve looked through the mupdf code and what I saw was really nice.
In any case, I appreciate the work they’ve done in providing fantastic tools to the world for decades now.
by 3ace on 8/1/22, 6:40 AM
> Since there is no means to ‘verify’ that a PDF file conforms, creators fall back on using Adobe Acrobat, the de facto standard. If Acrobat will open the file then it must be OK! Sadly it turns out that Acrobat is really very tolerant of badly formed PDF files and will always attempt to open them.
I'm grinning widely when reading this.
Until last year I had an opportunity to help maintaining a pdf tools written using Golang. This case where a pdf doc that is not conforming with the standard could be opened in Acrobat but not on other pdf reader tools (including ghostscript) came a lot from our clients and I had to find a way to be able to read/extract the content with a minimum issue because of that.
by rcarmo on 8/1/22, 7:15 AM
Funny thing: I remember hand coding Postscript patterns to play around on the first LaserWriter.
PDF became such a weird mess that I’m not surprised Postscript is now just a subset of it (to a degree), but writing an entirely new interpreter has had to be a hefty chunk of work..
by vintagedave on 7/31/22, 4:50 PM
Given the mention of security issues in their custom PostScript extensions, and that PDF files are often malformed, I wonder why they chose C as the language for the new interpreter. I don't want to write a typical HN comment (cough use Rust for everything :)) but surely there is _some_ better language for entirely new development of a secure and fast parser in 2022.
The post has no explananation of this choice. Does anyone know?
by vfclists on 7/31/22, 6:15 PM
Using C sounds like it will bring a whole new list of exploits with it.
Not good!!
by forgotpwd16 on 7/31/22, 4:42 PM
Surprised the decision wasn't made sooner.
by diskmuncher on 7/31/22, 3:40 PM
How interpreting PDF in Postscript became untenable