from Hacker News

Ask HN: What is the best way to archive a webpage

by badwolff on 3/7/18, 5:20 PM with 8 comments

I am working with some peers who have a website that links to and catalogs a number of resources (think blog posts).

It would be ideal for the administrators to be able to archive or have a copy of that web page on their server in case of the original post being deleted, links moving, servers being down etc.

Currently they are using http://archive.is to implement a half solution to this intent. It does not work for some websites and ideally they could host their own archived copy.

What are easy solutions to do this?

With Python I was thinking requests - but this would just grab the HTML and not images, or content generated by javascript.

Thinking Selenium, you could take a screenshot of the content - not the most user friendly to read.

What are some other solutions?

  • by mdaniel on 3/7/18, 8:54 PM

    I've enjoyed great success with various archiving proxies, including https://github.com/internetarchive/warcprox#readme and https://github.com/zaproxy/zaproxy#readme (which saves the content to an embedded database, and can be easier to work with than warc files). The benefit of those approaches over just save-as from the browser is that almost by definition the proxy will save all the components required to re-render the page, whereas save will only grab the parts it sees at that time.

    If it's a public page, you can submit the URL to the Internet Archive, and benefit both you and them

  • by cimmanom on 3/7/18, 7:10 PM

    If it's not doing silly things like using Javascript to load static content, wget can do recursive crawls.
  • by adultSwim on 3/10/18, 6:45 AM

    Either curl or wget will get you pretty far. Learn one of them well. They are basically equivalent. I use curl.

    For current web apps, there is an interactive archiver written in Python, Web Recorder. It captures the full bi-directional traffic of a session. https://webrecorder.io/ Web Recorder uses an internal Python library, pywb. That might be a good place to look. https://github.com/webrecorder/pywb

    It looks like Selenium has done a lot of catching up on it's interface. I'd be curious how they compare now.

    Talk to librarians about archiving the web. They made Internet Archive and have a lot of experience.

  • by inceptionnames on 3/7/18, 6:27 PM

    Save the page using the browser's Save feature and zip the created assets (a html file plus directory with graphics, js, css, etc) for ease of sharing.
  • by tyingq on 3/8/18, 3:12 AM

    If you’re okay with easy, but saves at a third party: https://www.npmjs.com/package/archive.is
  • by anotheryou on 3/7/18, 10:36 PM

    perma.cc looks sweet, but it's very limited for private people