by skilbjo on 12/7/24, 4:22 AM with 3 comments
I’ve been creating integrations for websites that lack official APIs for four years; initially using Puppeteer automation, and later through XHR requests.
Many people are familiar with web scraping, but fewer know about scraping via XHR requests. In fact, XHR requests are my preferred method for scraping because they allow you to build reliable and performant integrations into sites that either lack official APIs or restrict their use. I’ve found that building “unofficial” integrations using the XHR method is far more reliable than traditional web scraping approaches. Here’s why:
• Modern Websites: Many are built with frontend frameworks that load data asynchronously from backend APIs.
• Undocumented APIs: These backend APIs are as robust as official APIs but are left undocumented to the public (though the company knows them well
internally).
• Reliable Integration: You can hook into these backend APIs to create dependable integrations.
• Performance: Integrations built via XHR are much more reliable and performant than generic web scraping tools like Selenium, Puppeteer, or Playwright.
While developing integrations using the XHR method, I encountered a significant
challenge: anti-bot software like Cloudflare can easily detect that your
requests aren’t coming from a browser and block them. These tools are highly
effective at fingerprinting your requests.Many developers might try the XHR method by copying requests from the Chrome DevTools Network tab as cURL commands, only to receive a surprising 403 error. This happens because Cloudflare excels at identifying non-browser requests.
At my previous company, I built all integrations using the XHR method but found that more sophisticated websites were protected by anti-bot software like Cloudflare. I experimented with existing solutions (including numerous supposed Cloudflare bypasses on GitHub and paid services like Zenrows, Scrapingbee, Oxylabs, and Brightdata) but found that they either didn’t work or required unnecessarily complex integrations (e.g., request headers were not transparently passed through to the target server resulting in incorrect responses, or response header cookies were not sent back to the client, and on and on with painful and unreliable edge cases such as these).
This led me to develop xhr.dev’s initial product: a magic proxy that offers anti-bot avoidance with a one-line code integration.
How This Can Help You:
• Reliable Scraping via XHR: If you’re scraping via XHR, this tool allows you to hook into a platform’s backend, making it very difficult for the backend to detect that you’re scraping or not a real person.
• Unblock When Blocked: If you get blocked, it will unblock you.
• Captcha Auto-Solving: If you’re new to scraping and encounter various anti-bot methods like CAPTCHAs, it can automatically solve them for you.
What I’m Looking For: • Customers with Web Scraping Use Cases: Who can provide valuable product feedback.
• Feedback on Product Shape: Comments on the current form and functionality of the product.
• Insights on Scraping Challenges: Understanding other problems people face when scraping and what solutions they would be willing to pay for.
You can view our historical performance on our status page (https://status.xhr.dev).I hope you give it a try!
ty v much, john
(also - this is my 2nd ShowHN post - first time around I reposted it - oops! This one is left completely organic)
by KomoD on 12/7/24, 1:41 PM
How are we supposed to do that? Clicking "register" leads to a google form, there's nothing to try.
"Show HN is for something you've made that other people can play with. *HN users can try it out*, give you feedback, and ask questions in the thread."
"Off topic: blog posts, *sign-up pages*, newsletters, lists, and other reading material. *Those can't be tried out, so can't be Show HNs.* Make a regular submission instead."
by PigiVinci83 on 12/10/24, 6:49 PM