by linuxfan2718 on 5/28/23, 7:13 PM with 7 comments
by networked on 5/28/23, 9:50 PM
Since your bookmarks are already tagged, perhaps you don't need to tag the files? In some ways, it may be convenient, but at the cost of duplicating the information. As long as you can map a bookmarked URL to a file path or paths, you can find archived copies through your bookmarks.
Here is what I do for external URLs on my personal website. It is inspired by Gwern's approach. A major difference is that he doesn't nest directories; he uses ${domain}/${url-checksum}.ext.
I translate the URL to a file path in my link-archive directory by applying the function dest-dir from the Tcl code below. In the directory, I save whatever is at the URL with a name based on its checksum (b2sum -l 32), so I can have multiple archived copies of the same URL. I use https://github.com/gildas-lormeau/single-file-cli to save the URL. I determine the destination file extension from the MIME type.
This gives you paths like link-archive/365tomorrows.com/2005/10/23/postcard/e5445dff.html for https://365tomorrows.com/2005/10/23/postcard/.
proc slug s {
set s [string tolower $s]
regsub -all {[^A-Za-z0-9\.\_\~\-]+} $s - s
string trim $s -
}
proc dest-dir link {
set slugs [lmap part [file split [regsub {#[^!].*$} $link {}]] {
set x [string range [slug [regsub {^./} $part {}]] 0 127]
regsub {^~} $x {./~}
}]
# Drop the protocol.
file join {*}[lrange $slugs 1 end]
}
by DantesKite on 5/28/23, 9:46 PM
Then you can run searches for content even if the exact words aren't the same.
Like let's say you have a document titled "Measuring canine tooth caries over 2004-2020" and it never once mentions the word "dog".
If you type in "dog" after doing the embeddings, it'll suggest that specific document because "canine" and "dog" are closely related.
Great way to organize large groups of texts, there's plenty of YouTube videos on how to do it, and best of all, you don't have to spend time manually organizing everything. You just let the machine model do it for you.
You could even get it to auto-tag your documents based on what it thinks is the best category for the document and make it easier for you to parse that way as well.
by thriller on 5/31/23, 3:32 AM
by epirogov on 5/28/23, 7:33 PM
by decide1000 on 5/29/23, 8:50 PM
by hamsterbase on 5/29/23, 5:35 AM
this tool will index all of your html. Support take highlight, full text search.
by rmdes on 5/29/23, 6:30 AM