from Hacker News

Make Your Own Internet Archive with Archive Box

by adamhearn on 1/19/21, 5:48 PM with 77 comments

by lazyjeff on 1/19/21, 11:31 PM
I feel like a simple automatic capture of timestamp + url + screenshot would already be very useful. This gives you a visual memory of the things you've seen on the web. I've wanted to develop this for a while, as a browser plugin.
Being able to skim the past month or two click around the thumbnails would already be amazing. I've wanted to do that many times before to check if my memory was correct, or if a page changed since I last saw it, or figure out when I last saw something online.
You don't need a special viewer for it, as your operating system's file explorer can view the screenshots already, and you don't need to set up a crawl. Screenshots also compress well, as webp or png after crunching it.
by remirk on 1/19/21, 8:48 PM
This article is blogspam.
The repository has enough information on its own: https://github.com/ArchiveBox/ArchiveBox
by matt_f on 1/20/21, 5:27 AM
Interesting side note:
It seems like a lot of people in this thread have an interest in retaining a "replayable timeline" of their own browsing/reading history.
There's probably enough support here to gather a few contributors for an open source project.
by nikisweeting on 1/20/21, 5:06 PM
Hey all, @pirate (ArchiveBox maintainer) here, thanks for posting this @adamhearn.
If you like ArchiveBox check out our new Twitter account for the project, https://twitter.com/ArchiveBoxApp we just opened it and we'll be posting announcements and prerelease sneak-peeks on there in the future.
by blastro on 1/19/21, 9:50 PM
i use this every single day and think very highly of it. thanks for reminding me - i'm going to sponsor this developer on github...
by unnouinceput on 1/20/21, 2:47 AM
Quote: "..even if you instruct it to begin archiving a site then it can easily fail if that site’s robots.txt prevents crawling"
Huh? Does actually the big corporations care anymore about robots.txt? Nowadays is more of a "netiquette" than anything else. Google definitely ignores it. Dunno DuckDucGo what it does
by zeckalpha on 1/20/21, 4:30 AM
How long until this is a feature baked into a mainstream web browser? Archive, prefetch, cache, all variants on a theme. History, bookmarks, local search engine, all the same.
by jedimastert on 1/19/21, 10:35 PM
Is there a list of web page archive formats I could look at? There are a few things I'd love to do where it would be very handy to have one file per page
by 0x426577617265 on 1/20/21, 12:38 AM
I use this with an automated script that watches my Twitter activity. If I like a tweet it determines if it contains a URL then archives it.
by mikece on 1/19/21, 9:55 PM
This would be a nice thing to be able to run on a Synology NAS or other kind of device that typically has terabytes of storage.
by greypowerOz on 1/19/21, 10:33 PM
so.. you CAN have a box that is "the internet"....
by mikiem on 1/20/21, 2:45 AM
How can I use this to archive sites/pages that require logging in to see?
by dirtyid on 1/20/21, 9:11 AM
Tried this a while ago, disappointed at HD usage.
My solution as heavy TTS user who has balabolka setup to read copied text which naturally leaves a log for future reference. There's extentions to auto copy highlighted text and append urls which makes entire flow straight forward. Log each day is around 1-5mbs of text saved in a big folder. Biggest limitation is trying to advance search unstructured text files by complex keywords within dates. I'm sure I can setup each clip with delimiters so logs can be imported into a searchable DB, just too lazy.
by evc on 1/19/21, 5:49 PM
You will need a lot of disk storage right?
by egberts1 on 1/21/21, 11:04 PM
A real OSINT archive box would also capture all non-inline JavaScript, CSS and blob: files.
by ketamine__ on 1/20/21, 2:41 AM
How does archive.is trick news sites into showing content without the paywall? Is it pure user agent spoofing?
I'm wondering if this could be applied here.
by throwawaysea on 1/20/21, 7:51 AM
Can you configure this tool to login to websites (for paid news subscriptions) and get past those paywalls?