by badwolff on 3/7/18, 5:20 PM with 8 comments
It would be ideal for the administrators to be able to archive or have a copy of that web page on their server in case of the original post being deleted, links moving, servers being down etc.
Currently they are using http://archive.is to implement a half solution to this intent. It does not work for some websites and ideally they could host their own archived copy.
What are easy solutions to do this?
With Python I was thinking requests - but this would just grab the HTML and not images, or content generated by javascript.
Thinking Selenium, you could take a screenshot of the content - not the most user friendly to read.
What are some other solutions?
by mdaniel on 3/7/18, 8:54 PM
If it's a public page, you can submit the URL to the Internet Archive, and benefit both you and them
by cimmanom on 3/7/18, 7:10 PM
by adultSwim on 3/10/18, 6:45 AM
For current web apps, there is an interactive archiver written in Python, Web Recorder. It captures the full bi-directional traffic of a session. https://webrecorder.io/ Web Recorder uses an internal Python library, pywb. That might be a good place to look. https://github.com/webrecorder/pywb
It looks like Selenium has done a lot of catching up on it's interface. I'd be curious how they compare now.
Talk to librarians about archiving the web. They made Internet Archive and have a lot of experience.
by inceptionnames on 3/7/18, 6:27 PM
by tyingq on 3/8/18, 3:12 AM
by anotheryou on 3/7/18, 10:36 PM