by unlog on 6/10/24, 11:47 AM with 66 comments
by kitd on 6/10/24, 7:55 PM
Obligatory mention for RedBean, the server that you can package along with all assets (incl db, scripting and TLS support) into a single multi-platform binary.
by jll29 on 6/10/24, 7:29 PM
Although UNIX philosophy posits that it's good to have many small files, I like your idea for its contribution to reduceing clutter (imagine running 'tree' in both scenarios) and also avoiding running out of inodes in some file systems (maybe less of a problem nowadays in general, not sure as I haven't generated millions of tiny files recently).
by ProtoAES256 on 6/11/24, 8:15 AM
Glad to see easier methods!
wget \
--header "Cookie: <cf or other>"
--user-agent="<UA>"
--recursive \
--level 5 \
--no-clobber \
--page-requisites \
--adjust-extension \
--span-hosts \
--convert-links \
--domains <example.com> \
--no-parent \
<example.com\sub>
by unlog on 6/10/24, 11:51 AM
by renegat0x0 on 6/10/24, 8:30 PM
- status codes 200-299 are all OK
- status codes 300-399 are redirects, and also can be OK eventually
- 403 in my experience occurs quite often, where it is not an error, but suggestion that your user agent is not OK
- robots.txt should be scanned to check if any resource is prohibited, or if there are speed requirements. It is always better to be _nice_. I plan to add something like that and also missing it in my project
- It would be interesting to generate hash from app, and update only if hash is different?
by tamimio on 6/10/24, 8:17 PM
by Per_Bothner on 6/12/24, 12:42 AM
by sedawk on 6/11/24, 4:52 PM
I never dug deeper whether I can unzip and decode the packing, but saving as simple ZIP does somewhat guarantee future-proofing.
by nox101 on 6/10/24, 10:12 PM
In Chrome Devtools, network tab, last icon that looks like an arrow pointing into a dish (Export har file)
I guess a .har file as ton more data though I used it to extract data from sites that either intensionally or unintentionally make it hard to get data. For example, signing up for an apartment the apartment management site used pdf.js and provided no way to save the PDF. So saved the .har file and extracted the PDF.
by earleybird on 6/10/24, 5:15 PM
by CGamesPlay on 6/11/24, 12:36 PM
I would love to see better support for SPAs, where we can't just start from a sitemap. If you're interested in, you can check out some of the code from my old app for inspiration on how to crawl pages (it's Electron, so it will share a lot of interfaces with Puppeteer) [1].
[0] https://github.com/CGamesPlay/chronicler/tree/master [1] https://github.com/CGamesPlay/chronicler/blob/master/src/mai...
by jayemar on 6/10/24, 6:00 PM
by billpg on 6/11/24, 1:13 PM
by ryanwaldorf on 6/10/24, 3:24 PM
by szhabolcs on 6/11/24, 8:52 AM
by ivolimmen on 6/10/24, 9:00 PM
by meiraleal on 6/10/24, 8:30 PM