by WCityMike on 6/28/19, 6:23 PM with 4 comments
This seems to me to be something that would be so commonly desired by people that it would've been done and done and done a hundred times over by now, but I haven't found the magic search terms to dig up people's creations.
I imagine it starts with "links -dump", but then there's using the title as the filename, and removing the padded left margin, wrapping the text, and removing all the excess linkage.
I'm a beginner-amateur when it comes to shell scripting, python, etc. - I can Google well and usually understand script or program logic but don't have terms memorized.
Is this exotic enough that people haven't done it, or as I suspect does this already exist and I'm just not finding it? Much obliged for any help.
by westurner on 6/29/19, 6:18 AM
The title tag may exceed the filename length limit, be the same for nested pages, or contain newlines that must be escaped.
These might be helpful for your use case:
"Newspaper3k: Article scraping & curation" https://github.com/codelucas/newspaper
lazyNLP "Library to scrape and clean web pages to create massive datasets" https://github.com/chiphuyen/lazynlp/blob/master/README.md#s...
scrapinghub/extruct https://github.com/scrapinghub/extruct
> extruct is a library for extracting embedded metadata from HTML markup.
> It also has a built-in HTTP server to test its output as JSON.
> Currently, extruct supports:
> - W3C's HTML Microdata
> - embedded JSON-LD
> - Microformat via mf2py
> - Facebook's Open Graph
> - (experimental) RDFa via rdflib
by WCityMike on 6/30/19, 9:22 PM
from sys import argv
from unidecode import unidecode
from newspaper import Article
import re
script, arturl = argv
url = arturl
article=Article(url)
article.download()
article.parse()
title2 = unidecode(article.title)
fname2 = title2.lower()
fname2 = re.sub(r"[^\w\s]", '', fname2)
fname2 = re.sub(r"\s+", '-', fname2)
text2 = unidecode(article.text)
text2 = re.sub(r'\n\s*\n', '\n\n', text2)
f = open( '~/Desktop/' + str(fname2) + '.txt', 'w' )
f.write( str(title2) + '\n\n' )
f.write( str(text2) + '\n' )
f.close()
I execute via from shell: #!/bin/bash
/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.7/bin/python3 ~/bin/url2txt.py $1
If I want to run it on all the URLs in a text file: #!/bin/bash
while IFS='' read -r l || [ -n "$l" ]; do
~/bin/u2t "$l"
done < $1
I'm sure most of the coders here are wincing at one or multiple mistakes or badly formatted items I've done here, but I'm open to feedback ...by spaceprison on 6/28/19, 9:49 PM
Requests to fetch the page. beautifulsoup to grab the tags you care about (title info) and then markdownify to take the raw html and turn it into markdown.