by jakobdabo on 9/18/22, 11:59 AM with 74 comments
by cj on 9/18/22, 5:13 PM
I discovered our company's help documentation (and integration guides), hosted by readme.com, were completely de-indexed from Google for the past 3 months.
Our Readme docs were formerly our #1 source of organic (free) leads.
After investigating, Cloudflare (as configured by Readme) was blocking Googlebot when using Cloudflare Workers. Cloudflare was returning a 403 for Googlebot, but returning pages as usual for regular users.
The cause: we were using Workers to rewrite some URLs at the edge (replacing Readme's default images with optimized + compressed images, using Cloudflare's own image optimization service).
By using Workers to do this, it resulted in Readme's Cloudflare account receiving requests from our domain with "googlebot" useragent, but from an IP that wasn't verified as a googlebot IP address (I assume the Worker was requesting the Readme site using the Googlebot user agent but with whatever IP address is used when using CF Workers).
I emailed Cloudflare support but it was clear it would take a lot of time to get them to understand the issue (and probably longer to fix it).
So, we had to spend a lot of time figuring out how to allow Googlebot requests past Cloudflare's "fake bot" firewall rule.
In our own Cloudflare account, we have all security settings at the lowest sensitivity possible (or turned off completely). We serve over 500 billion requests a month (10+ TB of bandwidth), and the amount of blocked traffic to seemingly legitimate clients was surprisingly high.
I love Cloudflare (and own quite a bit of their stock) but I'm beginning to rethink my stance on their service. They make it extremely easy to enable powerful features with little visibility or control over the details of how those features work.
Another SEO nightmare is their "Crawler Hints" service. I highly recommend no one uses this if you are ever the target of automated security scanners (e.g. ones used by bug bounty white hat hackers). With "crawler hints" enabled and with a white hat hacker running a scan of your site hitting random URLs... results in bingbot, yandex, and other search engines attempting to index every single one of the URLs hit by the security scanners used by hackers.
Basically, it's a mess, and the only way to really fix it is to bypass cloudflare or spend a lot of time and money with Cloudflare debugging.
Next quarter I'm faced with the decision of either doubling down of Cloudflare and getting an Enterprise plan with them ($20k+) or just ripping them out of our stack and going back to our old AWS Cloudfront set up which has fewer POPs, but was much less of a hassle.
by Tiberium on 9/18/22, 2:57 PM
Most useful services for that are https://shodan.io/ and https://search.censys.io/. I've had decent successes with Censys on finding real IP addresses of websites behind Cloudflare. Of course you might also have success by checking history of DNS records for a particular domain.
by Anunayj on 9/18/22, 5:52 PM
by dizhn on 9/18/22, 2:58 PM
by yjftsjthsd-h on 9/18/22, 4:04 PM
by Ralo on 9/19/22, 12:00 AM
by urtom on 9/18/22, 7:15 PM
by alokjnv10 on 9/18/22, 5:01 PM
by IceWreck on 9/19/22, 5:11 AM
by 1vuio0pswjnm7 on 9/19/22, 1:28 AM
Of course, those trying to profit from online advertising services seek to collect the same (fingerprinting) data. Do Cloudflare terms of service/privacy policy allow Cloudflare to do anything they want with this data, or are there limits.
by userbinator on 9/18/22, 11:42 PM
by donutshop on 9/18/22, 3:49 PM
by nothasan on 9/18/22, 3:31 PM
by midislack on 9/18/22, 7:41 PM