from Hacker News

Show HN: Tesoro – Personal internet archive

by agamble on 6/27/17, 11:46 AM with 100 comments

by JackC on 6/27/17, 1:11 PM
For personal web archiving, I highly recommend http://webrecorder.io. The site lets you download archives in standard WARC format and play them back in an offline (Electron) player. It's also open source and has a quick local setup via Docker - https://github.com/webrecorder/webrecorder .
Webrecorder is by a former Internet Archive engineer, Ilya Kreymer, who now captures online performance art for an art museum. What he's doing with capture and playback of Javascript, web video, streaming content, etc. is state of the art as far as I know.
(Disclaimer - I use bits of Webrecorder for my own archive, perma.cc.)
For OP, I would say consider building on and contributing back to Webrecorder -- or alternatively figure out what Webrecorder is good at and make sure you're good at something different. It's a crazy hard problem to do well and it's great to have more ideas in the mix.
by smoyer on 6/27/17, 12:44 PM
It's not mine unless it's running on my own servers or computer - I created a really rough version of this several years ago that is saved to my computer (and from there into box).
by Piskvorrr on 6/27/17, 11:50 AM
That's just as much "my own" as The Internet Archive: a website Out There somewhere. Worse, it's much more likely to rot and disappear than archive.org. Now, if I could run this locally...
(Yes, yes, `wget --convert-links`, I know. Not quite as convenient, though.)
by j_s on 6/27/17, 5:15 PM
I would be interested in an attestation service that can provide court-admissable evidence that a particular piece of content was publically accessible on the web at a particular point in time via a particular url.
I believe the only way to incentivise participation in such a system is by paying for timestamp'ed signatures, eg. "some subset of downloaded [content] from [url] at [time] hashed to [hash]" all tucked into a Bitcoin transaction or something. There are services that will do this with user-provided content[1]; I am looking for something that will pull a url and timestamp the content.
This would also be a way to detect when different users are being served different content at the same url, thus the need for a global network of validators.
[1] https://proofofexistence.com/
by unicornporn on 6/27/17, 1:01 PM
In what way could this considered to be “your own internet archive”? I see no way to register a user and save pages to a collection.
If you really want to create your own archive, set up a Live Archiving HTTP Proxy[1], run SquidMan [2] or check out WWWOFFLE[3].
If you want something simpler, have a look at Webrecorder[4] or a paid Pinboard account with the “Bookmark Archive”[5].
[1] http://netpreserve.org/projects/live-archiving-http-proxy/
[2] http://squidman.net/squidman/index.html
[3] http://www.gedanken.org.uk/software/wwwoffle/
[4] https://webrecorder.io/
[5] https://pinboard.in/upgrade/
by rahiel on 6/27/17, 1:10 PM
An internet archive can only provide value if it's there for the long-term. What's your plan to keep this service running if it gets popular? For example, archive.is costed about $2000/month at the start of 2014 [1]. I expect it to cost even more now.
[1]: http://blog.archive.is/post/72136308644/how-much-does-it-cos...
by venning on 6/27/17, 12:59 PM
Thoughts:
I like the look. Very clean. I like how fast it's responding; better than archive.org (though, obviously, they have different scaling problems).
"Your own internet archive" might be overselling it, as other commenters have pointed out; the "Your" feels a bit misleading. I think "Save a copy of any webpage." gives a better impression, which you use on the site itself.
The "Archive!" link probably shouldn't work if there's nothing in the URL box. It just gives me an archive link that errors. Example: [1]
Using it on news.YC as a test gave me errors with the CSS & JS [2]. This might be due to the fact that HN uses query parameters in their CSS and JS, which repeat in the tesoro URL, which you may not be parsing correctly.
Maybe have something in addition to an email link for submitting error reports like the above, just cause I'd be more likely to file a GitHub issue (even if the repo is empty) than send a stranger an email.
As other commenters have pointed out, archive.is also does this, and their longevity helps me feel confident that they'll still be around. Perhaps, if you wish to differentiate, offer some way for me to "own" the copy of the page, like downloading it or emailing it to myself or sharing it with another site (like Google Docs or Imgur) to leverage redundancy, or something like that. Just a thought.
All in all, nice Show HN.
EDIT: You also may want to adjust the header to work properly on mobile devices. Still though, nice job. Sorry if I'm sounding critical.
[1] https://archive.tesoro.io/320b55cc9b78e271c94716ee23554da8
[2] https://archive.tesoro.io/a7bf03e247224bc3b4e5a7c1f2ad42b1
by bfirsh on 6/27/17, 3:33 PM
What's the best way to automatically archive all of the data I produce on websites? Facebook, Twitter, Instagram, blogs, and so on. At some point these services will disappear, and I want to preserve them.
I know a lot of these sites have archiving features, but want something centralised and automatic.
by akerro on 6/27/17, 12:34 PM
Nice, post it on https://www.reddit.com/r/DataHoarder/
They will love it!
by zippoxer on 6/27/17, 4:59 PM
Cool tool, but by using it, you depend on it staying alive for longer than any page you archive on it.
This got me thinking about how a decentralized p2p internet archive could solve the trust problem that exists in centralized internet archives. Such solution could also increase the capacity of archived pages and the frequency at which archived pages are updated.
It is true that keeping the entire history of the internet on your local drive is likely impossible, but a solution similar to what Sia is doing could solve this problem: split each page to 20 pieces and distribute each piece to 10 peers such that every y pieces can recover the original page. So, you only have to trust that 10 peers out of 20 that store a page are still alive to get the complete page.
The main problem I can see right now would be lack of motivation to contribute to the system -- why would people run nodes? Just because it would feature a yet another cryptocurrency? Sure, this could hold now, but when the cryptocurrency craze quiets down and people stop buying random cryptocurrencies just for the sake of trading them, what then? Who would run the nodes and why?
by j_s on 6/27/17, 5:02 PM
The discussion 3 months ago on bookmarks mentioned several options for archiving pages (some locally): Ask HN: Do you still use browser bookmarks? | https://news.ycombinator.com/item?id=14064096
extensions: Firefox "Print Edit" Addon / Firefox Scrapbook X / Chrome Falcon / Firefox Recoll
open source: Zotero / WorldBrain / Wallabag
commercial: Pinboard / InstaPaper / Pocket / Evernote / Mochimarks / Diigo / PageDash / URL Manager Pro / Save to Google / OneNote / Stash / Fetching
public: http://web.archive.org / https://archive.is/
by idlewords on 6/27/17, 3:44 PM
You're going to get this service shut down if you let anonymous people republish arbitrary content while running everything on Google.
I (obviously) think personal archives are a great idea, but republishing is a hornets' nest.
by Retr0spectrum on 6/27/17, 12:53 PM
Is this any different to archive.is?
If I want my own archive, Ctrl+S in Firefox usually works fine for me.
by crispytx on 6/27/17, 1:34 PM
You know your site actually does a better job reproducing webpages than archive.org. I've noticed that if you use a CDN to serve up CSS & JS for a webpage that you're trying to archive on archive.org, it won't render correctly. On your site, there doesn't seem to be a problem including CSS & JS from an external domain. Thumbs up :)
by zichy on 6/27/17, 12:08 PM
So this is like archive.is, but I can't search through archived sites?
by CM30 on 6/27/17, 2:31 PM
When you said 'own internet archive' I thought you meant some sort of program you could download that'd save your browsing history (or whatever full website you wanted) to your hard drive. I think that would have been significantly more useful here.
As is it, while it's a nice service, it's still got all the issues of other archive ones:
1. It's online only, so one failed domain renewal or hosting payment takes everything offline.
2. It being online also means I can't access any saved pages if my connection goes down or has issues.
3. The whole thing is wide open to having content taken down by websites wanting to cover their tracks. I mean, what do you do if someone tells you to remove a page? What about with a DMCA notice?
It's a nice alternative to archive.is, but still doesn't really do what the title suggests if you ask me.
by jpalomaki on 6/27/17, 1:09 PM
This might be a good use case for distributed storage (IPFS?).
Instead of hosting this directly on my computer, it would be interesting to have a setup where the archiving is done via the service and I would just provide somewhere a storage space where the content would end up being mirrored (just to guarantee that my valuable things are saved at least somewhere, should the the other nodes decide to remove the content).
I would prefer this setup, because it would be easily accessible for me from any device and I would not need to worry about running some always available system. With some suitable P2P setup my storage node would have less strict uptime requirements.
by dbz on 6/27/17, 12:50 PM
This is pretty cool. I have a chrome extension that let's you view the cached version of a web page [1]. Would I be able to use this through an API? I currently support Google Cache, WayBack Machine, and CoralCDN, but Coral doesn't work well and I'd like to replace it with something else.
[1] https://chrome.google.com/webstore/detail/cmmlgikpahieigpccl...
by prirun on 6/27/17, 1:09 PM
I think you should explain why you're paying Google to archive web pages for others, ie, how do plan on benefiting from this? If you have some business model in mind, let people know now. It's the first question that comes to my mind when someone offers a service that is free yet costs the provider real money. You obviously can't pay Google to archive everyone's web pages just for the fun of it.
by gorbachev on 6/27/17, 1:43 PM
You should try and rewrite relative links in websites that get archived. I tested your app with a news site, and all the links go to archive.tesoro.io/sites/internal/url/structure/article.html
I also second the need for user accounts. If I am to use your site as my personal archive, then I would need to log in and create a collection of my own archived sites.
by arkenflame on 6/27/17, 11:28 PM
I made a simple Chrome extension to automatically save local copies of pages you bookmark, if you prefer that instead: https://chrome.google.com/webstore/detail/backmark-back-up-t...
by lozzo on 6/27/17, 12:23 PM
it would be nice to have a bit of explanation on how it works and why we can be confident that we can rely upon it
by jdc0589 on 6/27/17, 3:03 PM
> Tesoro saves linked assets, such as images, Javascript and CSS files.
I'm confused. It looks like image sources in "archived" pages on Tesoro still point back to the origin domain.
Edit: it works as expected. I just didn't notice the relative paths.
by salmonfamine on 6/27/17, 3:38 PM
Worth noting that Tesoro is the name of a major oil/fuel company in Texas.
by NicoJuicy on 6/27/17, 1:44 PM
When a company went down, i downloaded every one of their clients with httrack and wget. Just to be sure their clients wouldn't lose their site. ( and some other things)
I wonder what this site uses
by pbhjpbhj on 6/27/17, 1:07 PM
How are you handling copyright infringement? Outside USAs Fair Use terms this looks like pretty blatant infringement.
by skdotdan on 6/27/17, 5:42 PM
Nice. How are you planning to pay the servers? Your service seems quite storage-intensive.