from Hacker News

Ask HN: Why Is My Scraping So Slow?

by scottmas on 5/10/23, 5:42 AM with 3 comments

At my company, we scrape our Webflow marketing website and host it ourselves on Cloudflare to avoid their crazy enterprise plan pricing. I have a little node.js script that gets the job done but it's really slow (5 to 10 minutes).

For the life of me I cannot figure out how to speed up the scraping process. For example, when I scrape it locally I can only get like a maximum of like 300kb/second no matter how much I try to parallelize requests, even though I have 200mbps of bandwidth. It's just annoying for our marketing team to have such a long delay in between publishing changes and seeing it deployed live.

Am I getting hit with some sort of Cloudfront rate limiting by IP address? Is there some socket limit at a real low level I'm hitting on both my local mac and the linux box I do the scraping on?

What are the best ways I can speed things up?

by psnehanshu on 5/10/23, 5:58 AM
Idk how cloudflare rate limiting can impact this unless the webflow site is behind cloudflare? Can it be removed?
It may also be that webflow rate limits bot traffic? Try spoofing the user agent with a popular browser's[1].
But why scrape? Webflow allows to export the code[2]. But it may still require premium subscription, I haven't looked thoroughly.
[1] https://techblog.willshouse.com/2012/01/03/most-common-user-...
[2] https://university.webflow.com/lesson/code-export
by sunilsandhu on 5/10/23, 9:28 PM
Hard to know what's going on without seeing your code setup and knowing more details, but it could be related to rate limits. You could consider using something like rotating proxies to help bypass some stuff. If you want to continue to use your own setup, you could integrate rotating proxies through a service. Dropping a link to one that we used for a couple of projects at my workplace.
[1] https://get.brightdata.com/bd-solutions-rotating-proxies
by bosky101 on 5/10/23, 6:01 AM
1) compare it with httrack website copier
2) maybe your scraping is synchronous, without perhaps any level of parallelism at the same depth
3) use your code or a sitemap to get all the URLs into a txt, then loop through with bash/curl maybe