« Back403ing AI Crawlerscoryd.devSubmitted by cdme 10 months ago
  • delifue 10 months ago

    In my opinion, the best way of fighting with crawlers is not giving error feedback (403). The best way is to give the crawlers low-quality AI-generated data.

    • marcus0x62 10 months ago

      Self plug, but I made this to deal with bots on my site: https://marcusb.org/hacks/quixotic.html. It is a simple markov generator to obfuscate content (static-site friendly, no server-side dynamic generation required) and an optional link-maze to send incorrigible bots to 100% markov-generated non-sense (requires a server-side component.)

      I do serve a legit robots.txt file to warn the scrapers I know about away.

      • shakna 10 months ago

        I may have a system in place that starts the pipeline for fetching a very, very large file (16TB, text file designed for testing). Not hosted by myself, except the first shard.

        A surprising number of agents try to download the whole thing.

        • kazinator 10 months ago

          Right, and that's why honeypots work against many targets. Why serve them an actual file, when a cgi script or whatever can just generate output in a loop.

          • andrewmcwatters 10 months ago

            Someone has to front the bandwidth.

            • kazinator 10 months ago

              Ah, speaking of that, of course you don't generate the fake data as fast as you can. You just trickle it out often enough for them not to time out.

              • BitPirate 10 months ago

                That's why you should run a tarpit instead.