from Hacker News

Ask HN: How to organize archived webpages locally?

by linuxfan2718 on 5/28/23, 7:13 PM with 7 comments

I've been going through 100's of bookmarks I made over the years, all carefully tagged and organized, but a lot of the pages are taken down. I want to start archiving them locally, probably using Firefox's "Save Page As..." feature. Do people here do this, how do you organize and tag them? Folders aren't perfect because some pages deserve multiple tags.

by networked on 5/28/23, 9:50 PM
Check out https://gwern.net/archiving.
Since your bookmarks are already tagged, perhaps you don't need to tag the files? In some ways, it may be convenient, but at the cost of duplicating the information. As long as you can map a bookmarked URL to a file path or paths, you can find archived copies through your bookmarks.
Here is what I do for external URLs on my personal website. It is inspired by Gwern's approach. A major difference is that he doesn't nest directories; he uses ${domain}/${url-checksum}.ext.
I translate the URL to a file path in my link-archive directory by applying the function dest-dir from the Tcl code below. In the directory, I save whatever is at the URL with a name based on its checksum (b2sum -l 32), so I can have multiple archived copies of the same URL. I use https://github.com/gildas-lormeau/single-file-cli to save the URL. I determine the destination file extension from the MIME type.
This gives you paths like link-archive/365tomorrows.com/2005/10/23/postcard/e5445dff.html for https://365tomorrows.com/2005/10/23/postcard/.
```
  proc slug s {
    set s [string tolower $s]
    regsub -all {[^A-Za-z0-9\.\_\~\-]+} $s - s
    string trim $s -
  }

  proc dest-dir link {
    set slugs [lmap part [file split [regsub {#[^!].*$} $link {}]] {
      set x [string range [slug [regsub {^./} $part {}]] 0 127]
      regsub {^~} $x {./~}
    }]
    # Drop the protocol.
    file join {*}[lrange $slugs 1 end]
  }
```
by DantesKite on 5/28/23, 9:46 PM
You should try OpenAI embeddings. They're fairly cheap to run over a large amount of text (should cost <$10 if you have thousands of documents I believe, but correct me if I wrong).
Then you can run searches for content even if the exact words aren't the same.
Like let's say you have a document titled "Measuring canine tooth caries over 2004-2020" and it never once mentions the word "dog".
If you type in "dog" after doing the embeddings, it'll suggest that specific document because "canine" and "dog" are closely related.
Great way to organize large groups of texts, there's plenty of YouTube videos on how to do it, and best of all, you don't have to spend time manually organizing everything. You just let the machine model do it for you.
You could even get it to auto-tag your documents based on what it thinks is the best category for the document and make it easier for you to parse that way as well.
by thriller on 5/31/23, 3:32 AM
I use an extension called SingleFile, and have it save EVERY page I visit. It saves every page locally with a timestamp at the beginning of the filename followed by the page title. Normally, I can find what I'm looking for using search, so no need for tags.
by epirogov on 5/28/23, 7:33 PM
it is better to store only text, in most cases layout and images don't matters. save as pdf make documents hard to search, chrome do all as svg image on a page. You can use online converters to get well formatted pdf with selectable text https://products.aspose.app/pdf/webpage-to-pdf I tried to organize my collection, after save to dvds I also created text file with names and disk numbers.
by decide1000 on 5/29/23, 8:50 PM
I use Mozilla's Pocket for this. The paid version stores it for you. Getpocket.com
by hamsterbase on 5/29/23, 5:35 AM
you can try hamsterbase.com
this tool will index all of your html. Support take highlight, full text search.
by rmdes on 5/29/23, 6:30 AM
Wallabag