by suramya_tomar on 11/8/24, 1:32 PM with 5 comments
Basically, what I am doing is downloading a snapshot of a site using curl. But the sites have advertisements in them which I want to filter out. So is there a tool that will let me do that from the command line so that the output file doesn't have ads in it?
In short, I want something like uBlock Origin but for html files that I will be converting to PDF's or epubs. Something like:
curl https://www.google.com | AdRemover.sh | htmltopdf
Most of the solutions I found require you to update the /etc/hosts file to stop showing the ads but would rather avoid that if possible.
by suramya_tomar on 11/10/24, 5:32 PM
I found https://github.com/ArchiveBox/ArchiveBox/ which is a self hosted web archiving system. It covers most of my usecases (and I can extend it for additional functionality) so I am going to set this up and try it out.
Thanks all for the help.
by solardev on 11/8/24, 7:00 PM
Can you run a puppeteer/playwright instance (which control real browsers) and add an ad blocker to that? e.g. https://github.com/ghostery/adblocker or https://github.com/microsoft/playwright-python/issues/782
by inhumantsar on 11/8/24, 2:22 PM
The best option would be to use a programming language and a good HTML parser to do the job. eg: Use Python and BeautifulSoup to dig through the tree looking for any HTML tag which references an ad-serving network.