« Back403ing AI Crawlerscoryd.devSubmitted by cdme 2 days ago
  • delifue a day ago

    In my opinion, the best way of fighting with crawlers is not giving error feedback (403). The best way is to give the crawlers low-quality AI-generated data.

    • marcus0x62 a day ago

      Self plug, but I made this to deal with bots on my site: https://marcusb.org/hacks/quixotic.html. It is a simple markov generator to obfuscate content (static-site friendly, no server-side dynamic generation required) and an optional link-maze to send incorrigible bots to 100% markov-generated non-sense (requires a server-side component.)

      I do serve a legit robots.txt file to warn the scrapers I know about away.

      • shakna a day ago

        I may have a system in place that starts the pipeline for fetching a very, very large file (16TB, text file designed for testing). Not hosted by myself, except the first shard.

        A surprising number of agents try to download the whole thing.

        • kazinator a day ago

          Right, and that's why honeypots work against many targets. Why serve them an actual file, when a cgi script or whatever can just generate output in a loop.

          • andrewmcwatters a day ago

            Someone has to front the bandwidth.

            • kazinator a day ago

              Ah, speaking of that, of course you don't generate the fake data as fast as you can. You just trickle it out often enough for them not to time out.

              • BitPirate a day ago

                That's why you should run a tarpit instead.