by scottmas on 5/10/23, 5:42 AM with 3 comments
For the life of me I cannot figure out how to speed up the scraping process. For example, when I scrape it locally I can only get like a maximum of like 300kb/second no matter how much I try to parallelize requests, even though I have 200mbps of bandwidth. It's just annoying for our marketing team to have such a long delay in between publishing changes and seeing it deployed live.
Am I getting hit with some sort of Cloudfront rate limiting by IP address? Is there some socket limit at a real low level I'm hitting on both my local mac and the linux box I do the scraping on?
What are the best ways I can speed things up?
by psnehanshu on 5/10/23, 5:58 AM
It may also be that webflow rate limits bot traffic? Try spoofing the user agent with a popular browser's[1].
But why scrape? Webflow allows to export the code[2]. But it may still require premium subscription, I haven't looked thoroughly.
[1] https://techblog.willshouse.com/2012/01/03/most-common-user-...
by sunilsandhu on 5/10/23, 9:28 PM
[1] https://get.brightdata.com/bd-solutions-rotating-proxies
by bosky101 on 5/10/23, 6:01 AM
2) maybe your scraping is synchronous, without perhaps any level of parallelism at the same depth
3) use your code or a sitemap to get all the URLs into a txt, then loop through with bash/curl maybe