by vitorbaptistaa on 11/11/22, 4:41 PM with 14 comments
This sounds like a great usage for a forward proxy like Squid or Apache Traffic Server. However, I couldn't find in their docs a way to both:
* Keep a permanent history of the cached pages
* Access old versions of the cached pages (think Wayback Machine)
Does anyone know if this is possible? I could potentially mirror the pages using wget or httrack, but a forward cache is a better solution as the caching process is driven by the scraper itself.
Thanks!
by mdaniel on 11/11/22, 5:53 PM
Their OOtB storage does what the sibling comment says about sha1-ing the request and then sharding the output filename by the first two characters: https://github.com/scrapy/scrapy/blob/2.7.1/scrapy/extension...
by PaulHoule on 11/11/22, 5:05 PM
name[0:2]/name[0:4]/name[0:6]/name
to keep any of the directories from getting too big (even the filesystem can handle huge directories, various tools you use with it might not) Keep a list of where the files came from and other metadata so you can find things in a database.by placidpanda on 11/11/22, 6:00 PM
Also made it easy to alert on when something broke (query the table for count(*) where status=error) and rerun the parser for failures.
by compressedgas on 11/11/22, 4:57 PM
by sbricks on 11/11/22, 4:50 PM
by nf-x on 11/11/22, 4:55 PM