from Hacker News

Ask HN: Best way to stop bot traffic?

by ethor on 10/10/22, 5:22 PM with 29 comments

I have a site that gets bombarded with worthless robot traffic. I have previously used cloudflare but a lot of bots still gets through.

What have you used that has been effective?

  • by freedomben on 10/10/22, 7:12 PM

    Make sure your Cloudflare settings are as aggressive as possible. You might need to upgrade to the first paid level (I think "pro"?) to activate the most aggressive, but it does work very well.

    After that, you can throw a CAPTCHA on pages (particularly submission pages), but that will harm legitimate users as well as bots.

    Make sure your origin server is only reachable from Cloudflare. If people can hit it directly, then they bypass Cloudflare. If you use firewalld, I wrote this in my setup script that you can use:

        for range in $(curl -s -X GET "https://api.cloudflare.com/client/v4/ips" | jq -r '.result.ipv4_cidrs[]'); do
          for port in 80 443; do
            echo "Inserting firewalld rule for address range '${range}' on port '${port}'"
            firewall-cmd --zone=public --permanent \
              --add-rich-rule="rule family=\"ipv4\" source address=\"${range}\" port protocol=\"tcp\" port=\"${port}\" accept"
          done
        done
    
        firewall-cmd --remove-service=http --permanent
        firewall-cmd --remove-service=https --permanent
        firewall-cmd --reload
  • by tothrowaway on 10/11/22, 8:14 AM

    Most bots don't bother setting cookies, or downloading CSS. Exploit this by including a dummy CSS file on your site that, on the backend, stores the visitor's IP in some kind of database, or sets a cookie. If you get multiple visits from an IP that never hit the CSS file, you can be reasonably confident the user is not legit. You need to be careful about not blocking good bots though. Do a reverse DNS lookup before actually blocking an IP to make sure it's not Googlebot, yandexbot, bingbot, slurp, etc. OpenResty is great for implementing this.

    It has the nice side effect of protecting you from run-of-the-mill DDoS attacks too.

    (I realize half my comments here are about OpenResty, but I have no affiliation with them. I'm just a happy user.)

  • by clafferty on 10/10/22, 9:12 PM

    You’re going to need to make a few aggressive WAF Rules, pepper in some whitelisting rules and if you can, add rate limiting.

    1. Block all unverified bots with a bot score of 1. This will still allow popular web crawlers but could be strict enough to block a curl request.

    2. Use Manage Challenge for unverified bots with a bot score less than 30. This will silence most of the trouble making bots and provides a JavaScript (not necessarily Captcha) solution for users who are incorrectly scored.

    3. Add rate limiting. Figure out a realistic access rate, double it and use that as a hard limit that will block traffic for an hour or day depending on your needs.

    4. Add more sensitive rate limits and play with manage challenge rules. Use the simulate option before enabling any rate limits. You can add challenges here if you feel a limit might be affecting users too. Simulate for a few days before enabling

    5. Review rate limits and firewall reports regularly and adjust. With any Managed Challenge rules make sure to check the percentage completed to see if you’re trapping real users. This number should be as close to 0 as possible. Repeat step 4.

    You’ll want to get around your own blocking rules with some complimentary whitelisting rules.

    Although it’s advised to lock down your origin server to prevent non Cloudflare traffic hitting your server you might not be able to do so easily, if you’ve got load balancers and other infra in your way that can’t be touched. Just make sure your root domain isn’t leaking your www IP address. You can use CNAME flattening and you should be alright.

    The difficulty in these solutions is managing all the rules you can make. Things can quickly become too complicated to make changes easily. Keep it simple, have a few basic but aggressive blocking rules and revise your whitelist and rate limits regularly. Good luck

  • by codegeek on 10/10/22, 5:49 PM

    You have to get more aggressive unfortunately which may sometimes block real users but do the following:

    - Setup captcha or just block users from certain countries if you know where your traffic comes from. This can sometimes create issues for your users on VPN but then you have to make the call depending on how many of your users may be using VPN etc. At the minimum, add a captcha.

    - Create more Page rules in cloudflare and block if they don't match the rule. For example, if your URLs start with a specific prefix, drop anything that is a no match.

    - Make sure to return 444 status from your server directly if bots are bypassing cloudflare and hitting the IP directly. Sample code for nginx 1.19 or higher:

        server {
          listen 80 default_server;
          listen [::]:80 default_server;
    
          listen 443 default_server;
          listen [::]:443 default_server;
          ssl_reject_handshake on;
    
          server_name _;
          return 444;
        }
    
    
    If bots are getting too aggressive, I start with Block first, ask questions later. Depending on your traffic and users, it may be the right strategy.
  • by andrewmcwatters on 10/10/22, 9:09 PM

    Man, if the state of the art is to suggest something Cloudflare related, that's a really sad state of affairs.
  • by viraptor on 10/10/22, 8:20 PM

    I'd go against the "just increase the cf strictness" advice. It's counting on cf basically doing something magic and hoping to not about real users - and that's not really possible.

    1. Why do you want to stop bots? Are they actually overloading your resources, or are they just noisy in the logs. If you can easily handle the traffic, maybe find a way to filter the logs better.

    2. How do you know they're bots? If they're easy to identify, can you write a few simple rules to remove most of them?

    2a. Are they mindless scans? Make sure your app doesn't even see requests to resources which don't exist.

    2b. Are they scraping content? Set up per-resource-per-IP rate limits (token bucket style)

    2c. Are they coming from a specific network, for example tor, AWS, or similar? Put in an auto updating list of sources that get dropped at firewall level.

    3. As mentioned in other comments, if you're using some proxy in front of your service, ensure you drop any traffic which bypasses is.

    Basically consider what's actually happening and respond to that. There's no setting that will improve things without side effects, or it would be already turned on.

  • by trinovantes on 10/11/22, 1:19 AM

    Blocking IP addresses from cloud providers will eliminate most bots

    https://github.com/brianhama/bad-asn-list

    Unfortunately you'll also alienate VPN users so you'll have to decide if it's worth the cost

  • by ianpurton on 10/10/22, 5:58 PM

    I use cloudflare and only allow paths I know. i.e /blog* /app/*

    I block everything else

    That kills most of it.

  • by JimWestergren on 10/10/22, 5:40 PM

    Try increasing the security level in CF. You could also activate Firewall rules for certain countries that are not relevant for your site, make sure to not apply it to known bots (googlebot etc).
  • by NetToolKit on 10/10/22, 11:12 PM

    If you're interested in a non-Cloudflare solution, we have developed a service called Gatekeeper, and we'd really like to get your thoughts on whether it might suit your needs: https://www.nettoolkit.com/gatekeeper/about

    Essentially, Gatekeeper is a rules-engine with a fancy UI that allows you to craft policies specific to your site and the traffic that is visiting your site. For example, you can say "Allow Googlebot" and "Show CAPTCHA to visitors from AWS on every fifth visit". If you'd like to communicate offline, you can find our email address in my profile.

  • by joekok33 on 10/11/22, 7:51 AM

    If you are in cpanel environment. Go into your hosting and look at the Ip's that are coming in. Apply filter to those and that should stop those bot traffic.
  • by _humancompiler on 10/12/22, 10:12 PM

    https://tlstoy.com/ detects fraudulent requests
  • by rpigab on 10/11/22, 7:22 AM

    Advertise your website to as many real people as possible, worthless robot traffic will then seem less important in comparison to actual human traffic. You could use billboards, hand out flyers, etc.