from Hacker News

Generate RSS feed for any website using CSS selectors

by thirdplace_ on 7/14/23, 8:23 PM with 53 comments

by toastal on 7/15/23, 5:53 AM
CSS selectors were more useful before the Tailwind fad of dropping meaningful classes names in favor of recreating inline styles but with abbreviations to memorize. I use μBlock Origin + userStyles a lot which both also uses CSS selectors & the last couple of years everything has become a lot harder on the end user to tweak/fix. If you’re lucky now, you’ll have some ARIA attributes to select on.
by snthd on 7/14/23, 11:20 PM
RSSHub[0] is in the same ballpark, but consists of a large library of site-specific code[1][2].
[0]https://github.com/DIYgod/RSSHub/
[1]https://github.com/DIYgod/RSSHub/tree/master/lib/routes
[2]https://github.com/DIYgod/RSSHub/tree/master/lib/v2
by solardev on 7/14/23, 9:05 PM
It ded.
Archive: https://web.archive.org/web/20230714202418/https://rss-bridg...
Sample feed: https://web.archive.org/web/20230308160413/https://rss-bridg...
by awesomegoat_com on 7/15/23, 6:32 AM
I was always afraid to use on of these. I thought that the css selectors would be too brittle and ultimately break.
I have build my own solution that is automagical at https://awesomegoat.com/ but I am running into next set of issues which are various scraping protections. It seems that reasonable RSS gateway today needs to include botnet of residential proxies just to read content on the internet.
by xnx on 7/14/23, 9:07 PM
This is a great tool! Before I learned about nitter, this was my primary way to follow people on Twitter. I love the idea of trying to wrestle unsupported feeds (Twitter, Instagram, etc.) into a standard/open format.
by jasonlotito on 7/14/23, 10:59 PM
The lack of feed generation is why I so many of the latest blog platforms are non-starters in my book. It boggles my mind. Honestly, if you don't generate a feed of some sort, I really can't take you seriously.
by nfriedly on 7/15/23, 12:53 AM
I run my own instance of RSS Bridge to keep track of authors that I like on Goodreads.
It works pretty well, although every once in a while Goodreads hiccups, and then RSS bridge gives me a bunch of "new posts" that are actually error messages.
by okuntilnow on 7/15/23, 9:32 AM
Huginn is an another useful tool that allows you to wrangle CSS selectors and XPath nodes to create RSS feeds.
I use it quite successfully to get data out of undocumented APIs and out into RSS.
https://github.com/huginn/huginn
by bubblematrix on 7/14/23, 11:54 PM
This honestly is standard web scraping but these projects always catch my attention.
You're bound at the mercy of rate-limiting firewalls (so you'll have to rotate proxies if you intend on using this heavily) on top of the standard CloudFront bot detection recaptcha, and div-obfuscation (a good example of this is Facebook).
by dagurp on 7/14/23, 11:52 PM
These days I just let chagpt generate a script that scrapes a site and spits out an rss file. Then I run it with cron.
by ChrisArchitect on 7/14/23, 9:59 PM
Other services like this: https://www.fivefilters.org/feed-creator/
by eviks on 7/15/23, 4:17 AM
What's the easiset way to also run a few basic filters on the site/RSS feed's content to make it truly shine vs simplistic scraping, like
- splitting the full feed by theme of the article into separate feeds and at the same time
- remove a few keywords and also
- get article length and split into a long / short feed
- Or maybe get what you used to have on some news sites - subscribe only to a specific author instead of getting bombarded with hundreds of items in a feed
by PaulHoule on 7/14/23, 9:46 PM
I've wondered why people have tried all sorts of cumbersome ways to splice metadata onto HTML like RDFa but never tried the obvious approach of basing extraction rules on CSS selectors... Often these work without the cooperation of the target site so long as they use CSS the way it was supposed be used (e.g. not tailwind, bootstrap, etc.)
by CoBE10 on 7/14/23, 10:06 PM
For me PolitePol is best because if doesn't limit the amount of feeds and the free plan is pretty good: https://politepol.com
by treyd on 7/14/23, 9:55 PM
I wonder if this would work better / be more expressive with XPATH-style selectors?
by account-5 on 7/15/23, 9:44 AM
Is there a standalone application that can do similar. That doesn't require a web server to run. Like an RSS reader you'd run on you desktop or phone? I'd definitely be interested in that.
by Hamuko on 7/15/23, 5:15 AM
FreshRSS has XPath scraping.
https://danq.me/2022/09/27/freshrss-xpath/
by midasz on 7/14/23, 9:18 PM
Does it work for websites that fetch content async? I've had success with https://morss.it instead (which can also be selfhosted)
by simonjgreen on 7/14/23, 9:49 PM
This is very similar to how you can scrape data from web with powerquery
by skribanto on 7/14/23, 9:04 PM
Getting 502 Bad Gateway
by kayson on 7/15/23, 5:57 AM
FreshRSS has this feature built in. But you can use rss-bridge for far more complicated scenarios too
by 1vuio0pswjnm7 on 7/15/23, 12:54 AM
"Generate RSS feed for any website using CSS selectors"
For me, "CSS selectors" always seems like a deceptive term, if it means selecting HTML tag elements. What if the website does not use styling.
I read 1000s of websites, including all HN submissions, without using CSS. When I want to extract information from a website, I focus on patterns in the page. They might be HTML, they might be style elements, but they could be anything. I never assume that all websites will wrap the information I want in certain elements. There is a ridiculous amount of random variation amongst websites.