• mrweasel a month ago

    Sadly it hard to tell if this is an actual DDoS attack, or scrappers descending on the site. It all looks very similar.

    The search engines always seemed happy to announce that they are in fact GoogleBot/BingBot/Yahoo/whatever and frequently provided you with their expected IP ranges. The modern companies, mostly AI companies, seems to be more interested in flying under the radar, and have less respect for the internet infrastructure at a whole. So we're now at a point where I can't tell if it's an ill willed DDoS attack or just shitty AI startup number 7 reloading training data.

    • jeroenhd a month ago

      > The modern companies, mostly AI companies, seems to be more interested in flying under the radar, and have less respect for the internet infrastructure at a whole

      I think that makes a lot of sense. Google's goal is (or perhaps used to be) providing a network of links. The more they scrape you, the more visitors you may end up receiving, and the better your website performs (monetarily, or just in terms of providing information to the world).

      With AI companies, the goal is to consume and replace. In their best case scenario, your website will never receive a visitor again. You won't get anything in return for providing content to AI companies. That means there's no reason for website administrators to permit the good ones, especially for people who use subscriptions or ads to support their website operating costs.

      • piokoch a month ago

        Yes, search engines were not hiding, as it was website owner interest involved here as well - without those search bots their sites would not be indexed and searchable in the Internet. So there was kind of win-win situation, in most typical cases at least, as for instance publishers complained about deep links, etc. because their ads revenue was hurt.

        AI scrapping bots provide zero value for sites owners.

        • philipwhiuk a month ago

          It's DDoS either way even if it's not an attack.

        • CaptainFever a month ago

          > To me, Anubis is not only a blocker for AI scrapers. Anubis is a DDoS protection.

          Anubis is DDoS protection, just with updated marketing. These tools have existed forever, such as CloudFlare Challenges, or https://github.com/RuiSiang/PoW-Shield. Or HashCash.

          I keep saying that Anubis really has nothing much to do with AI (e.g. some people might mistakenly think that it magically "blocks AI scrapers"; it only slows down abusive-rate visitors). It really only deals with DoS and DDoS.

          I don't understand why people are using Anubis instead of all the other tools that already exist. Is it just marketing? Saying the right thing at the right time?

          • Imustaskforhelp a month ago

            I agree with you that it is infact a DDOS protection but still, the fact that it is open source and created by a really cool dev (she is awesome), I think I don't really mind it gaining popularity. And also they had created it out of their own necessity which is also really nice.

            Anubis is getting real love out there and I think I am all for it. I personally host a lot of my stuff on cloudflare due to it being free with cloudflare workers but if I ever have a vps, I am probably going to use anubis as well

            • superkuh a month ago

              All the other tools don't actually work. What I mean is that they block far, far, more than they intend to. Anubis actually works on every weird and niche browser I've tried. Which is to say, it lets actual human people through even if they aren't using Chrome.

              CloudFlare doesn't do that. Cloudflare's false positive rate is extremely high, as are the others. Mostly because they all depend on bleeding edge JS and browser functions (CORS, etc) for fingerprinting functionality.

              Cloudflare is for for-profit and other situations where you don't care if you block poor people because they can't give you money anyway. Anubis is for if you want everyone to be able to access your website.

              • JodieBenitez a month ago

                > I don't understand why people are using Anubis instead of all the other tools that already exist. Is it just marketing? Saying the right thing at the right time?

                Care to share existing solutions that can be self-hosted ? (genuine question, I like how Anubis works, I just want something with a more neutral look and feel).

                • consp a month ago

                  Knowing something exists is half the challenge. Never used it but ,maybe ease of use/setup or license?

                  • GoblinSlayer a month ago

                    The readme explains that it's for the case when you don't use cloudflare, also it's open source, analogous to PoW Shield, but has less heavy dependencies.

                    • areyourllySorry a month ago

                      pow shield does not offer a furry loading screen so it can't be as good

                      • cedws a month ago

                        Fun fact: that PoW-Shield repo is authored by a guy jailed for running a massive darknet market (Incognito.)

                        • immibis a month ago

                          marketing plus a product that Just Does The Thing, it seems like. No bullshit.

                          btw it only works on AI scrapers because they're DDoSes.

                        • chrisnight 2 months ago

                          > Solving the challenge–which is valid for one week once passed–

                          One thing that I've noticed recently with the Arch Wiki adding Anubis, is that this one week period doesn't magically fix user annoyances with Anubis. I use Temporary Containers for every tab, which means that I constantly get Anubis regenerating tokens, since the cookie gets deleted as soon as the tab is closed.

                          Perhaps this is my own problem, but given the state of tracking on the internet, I do not feel it is an extremely out-of-the-ordinary circumstance to avoid saving cookies.

                          • philipwhiuk a month ago

                            I think it's absolutely your problem. You're ignoring all the cache lifetimes on assets.

                            • TiredOfLife a month ago

                              It's not a problem. You have configured your system to show up as a new visitor every time you visit a website. And you are getting expected behaviour.

                              • jsheard 2 months ago

                                It could be worse, the main alternative is something like Cloudflares death-by-a-thousand-CAPTCHAs when your browser settings or IP address put you on the wrong side of their bot detection heuristics. Anubis at least doesn't require any interaction to pass.

                                Unfortunately nobody has a good answer for how to deal with abusive users without catching well behaved but deliberately anonymous users in the crossfire, so it's just about finding the least bad solution for them.

                                • bscphil 2 months ago

                                  It's even worse if you block cookies outright. Every time I hit a new Anubis site I scream in my head because it just spins endlessly and stupidly until you enable cookies, without even a warning. Absolutely terrible user experience; I wouldn't put any version of this in front of a corporate / professional site.

                                  • imcritic a month ago

                                    For me the biggest issue with archwiki adding Anubis is that it doesn't let me in when I open it on mobile. I am using Cromite: it doesn't support extensions, but has some ABP integrated in.

                                    • ashkulz a month ago

                                      I too use Temporary Containers, and my solution is to use a named container and associate that site with the container.

                                      • selfhoster11 a month ago

                                        I am low-key shocked that this has become a thing on Arch Wiki, of all places. And that's just to access the main page, not even for any searches. Arch Wiki is the place where you often go when your system is completely broken, sometimes to the extent that some clever proof of work system that relies on JS and whatever will fail. I'm sure they didn't decide this lightly, but come on.

                                        • jillyboel 2 months ago

                                          > One thing that I've noticed recently with the Arch Wiki adding Anubis

                                          Is that why it now shows that annoying slow to load prompt before giving me the content I searched for?

                                        • butz a month ago

                                          As usual, there is a negative side to such protection: I was trying to download some raw files from git repository and instead of data got bunch of html. After quick look it turned out to be Anubis HTML page. Another issue was with broken links to issue tickets on main page, where Anubis was asking wrapper script to solve some hashes. Lesson here: after deploying Anubis, please carefully check the impact. There might be some unexpected issues.

                                          • eadmund a month ago

                                            > I was trying to download some raw files from git repository and instead of data got bunch of html. After quick look it turned out to be Anubis HTML page.

                                            Yup. Anubis breaks the web. And it requires JavaScript, which also breaks the web. It’s a disaster.

                                          • vachina a month ago

                                            It’s not Anubis that saved your website, literally any sort of Captcha, or some dumb modal with a button to click into the real contents would’ve worked.

                                            These crawlers are designed to work on 99% of hosts, if you tweak your site just so slightly out of spec, these bots wouldn’t know what to do.

                                            • boreq a month ago

                                              So what you are saying is that it's anubis that saved their website.

                                              • undefined a month ago
                                                [deleted]
                                              • forty a month ago

                                                Anubis is nice, but could we have a PoW system integrated in protocols (http or TLS, I'm not sure) so we don't have to require JS ?

                                                • fc417fc802 a month ago

                                                  Protocol is the wrong level. Integrate with the browser. Add a PoW challenge header to the HTTP response, receive a POW solution header with the next request.

                                                • tpool 2 months ago

                                                  It's so bad we're going to the old gods for help now. :)

                                                  • Hamuko a month ago

                                                    I’d sic Yogg-Saron on these scrapers if I could.

                                                  • ranger_danger 2 months ago

                                                    Seems like rate-limiting expensive pages would be much easier and less invasive. Also caching...

                                                    And I would argue Anubis does nothing to stop real DDoS attacks that just indiscriminately blast sites with tens of gbps of traffic at once from many different IPs.

                                                    • PaulDavisThe1st 2 months ago

                                                      In the last two months, ardour.org's instance of fail2ban has blocked more than 1.2M distinct IP addresses that were trawling our git repo using http instead of just fetching the goddam repository.

                                                      We shut down the website/http frontend to our git repo. There are still 20k distinct IP addresses per day hitting up a site that issues NOTHING but 404 errors.

                                                      • felsqualle a month ago

                                                        Hi, author here.

                                                        Caching is already enabled, but this doesn’t work for the highly dynamic parts of the site like version history and looking for recent changes.

                                                        And yes, it doesn’t work for volumetric attacks with tens of gbps. At this point I don’t think it is a targeted attack, probably a crawler gone really wild. But for this pattern, it simply works.

                                                        • Ocha 2 months ago

                                                          Rate limit according to what? It was 35k residential IPs. Rate limit would end up keeping real users out.

                                                          • bastawhiz 2 months ago

                                                            Rate limiting does nothing when your adversary has hundreds or even thousands of IPs. It's trivial to pay for residential proxies.

                                                            • toast0 a month ago

                                                              > And I would argue Anubis does nothing to stop real DDoS attacks that just indiscriminately blast sites with tens of gbps of traffic at once from many different IPs.

                                                              Volumetric DDoS and application layer DDoS are both real, but volumetric DDoS doesn't have an opportunity for cute pictures. You really just need a big enough inbound connection and then typically drop inbound UDP and/or IP fragments and turn off http/3. If you're lucky, you can convince your upstream to filter out UDP for you, which gives you more effective bandwidth.

                                                              • lousken 2 months ago

                                                                Yes, have everything static (if you can't, use caching), optimize images, rate limit anything you have to generate dynamically

                                                              • anonfordays a month ago

                                                                This (Anubis) "RiiR" of haproxy-protection is easily bypassed: https://addons.mozilla.org/en-US/firefox/addon/anubis-bypass...

                                                                • Tiberium 2 months ago

                                                                  From looking at some of the rules like https://github.com/TecharoHQ/anubis/blob/main/data/bots/head... it seems that Anubis explicitly punishes bots that are "honest" about their user agent - I might be missing something, but isn't this just pressuring anyone who does anything bot-related to just lie about their user agent?

                                                                  Flat out user-agent blacklist seems really weird, it's going to reward the companies that are more unethical in their scraping practices than the ones who report their user agent truthfully. From the repo it also seems like all the AI crawlers are also DENY, which, again, would reward AI companies that don't disclose their identity in the user agent.

                                                                  • userbinator 2 months ago

                                                                    User-agent header is basically useless at this point. It's trivial to set it to whatever you want, and all it does is help the browser incumbents.

                                                                    • EugeneOZ a month ago

                                                                      The point is to reduce the server load produced by bots.

                                                                      Honest AI scrapers use the information to learn, which increases their value, and the owner of the scraped server has to pay for it, getting nothing back — there's nothing honest about it. Search engines give you visitors, AI spiders only take your money.

                                                                      • jeroenhd a month ago

                                                                        From what I can tell from the author's Mastodon, it seems like they're working on a fingerprinting solution to catch these fake bots in an upcoming version based on some passively observed behaviour.

                                                                        And, of course, the link just shows the default behaviour. Website admins can change them to their needs.

                                                                        I'm sure there will be workarounds (like that version of curl that has its HTTP stack replaced by Chrome's) but things are ever moving forward.

                                                                        • wzdd a month ago

                                                                          The point of anubis is to make scraping unprofitable by forcing bots to solve a sha256-based proof-of-work captcha, so another point of view is that the explicit denylist is actually saving those bot authors time and/or money.

                                                                        • rubyn00bie a month ago

                                                                          Sort of tangential but I’m surprised folks are still using Apache all these years later. Is there a certain language that makes it better than Nginx? Or it just the ease of use configuration that still pulls people? I switched to Nginx I don’t even know how many years ago and never looked back, just more or less wondering if I should.

                                                                          • mrweasel a month ago

                                                                            Apache does everything, it's fairly easy to configure. If there's something you want to do, Apache mostly knows how, or have a module.

                                                                            If you run a fleet of servers, all doing different things, Apache is a good choice because all the various uses are going to be supported. It might not be the best choice in each individual case, but it is the one that works in all of them.

                                                                            I don't know why some are so quick to write off Apache. Is just because it's old? It's still something like the second most used webserver in the world.

                                                                            • anotherevan a month ago

                                                                              Equally tangential, but I switched form Nginx to Caddy a few years ago and never looked back.

                                                                              • ahofmann a month ago

                                                                                I'm using nginx since what feels like decades and occasionally I miss the ability to use .htaccess files. This is a very nice way to configure stuff on a server.

                                                                              • justusthane 2 months ago

                                                                                I don’t really understand why this solved this particular problem. The post says:

                                                                                > As an attacker with stupid bots, you’ll never get through. As an attacker with clever bots, you’ll end up exhausting your own resources.

                                                                                But the attack was clearly from a botnet, so the attacker isn’t paying for the resources consumed. Why don’t the zombie machines just spend the extra couple seconds to solve the PoW (at which point, they would apparently be exempt for a week and would be able to continue the attack)? Is it just that these particular bots were too dumb?

                                                                                • KronisLV a month ago

                                                                                  > We use a stack consisting of Apache2, PHP-FPM, and MariaDB to host the web applications.

                                                                                  Oh hey, that’s a pretty utilitarian stack and I’m happy to see MariaDB be used out there.

                                                                                  Anubis is also really cool, I do imagine that proof of work might become more prevalent in the future to deal with the sheer amount of bots and bad actors (shame that they exist) out there, albeit in the case of hijacked devices it might just slow them down, hopefully to a manageable degree, instead of IP banning them altogether.

                                                                                  I do wonder if we’ll ever see HTTP only versions of PoW too, not just JS based options, though that might need to be a web standard or something.

                                                                                  • qiu3344 a month ago

                                                                                    As someone who has a lot of experience with (not AI related) web scraping, fingerprinting and WAFs, I really like what Anubis is doing.

                                                                                    Amazon, Akamai, Kasada and other big players in the WAF/Antibot industry will charge you millions for the illusion of protection and half-baked javascript fingerprint collectors.

                                                                                    They usually calculate how "legit" your request is based on ambiguous factors, like the vendor name of your GPU (good luck buying flight tickets in a VM) or how anti-aliasing is implemented on you fonts/canvas. Total bullshit. Most web scrapers know how to bypass it. Especially the malicious ones.

                                                                                    But the biggest reason why I'm against these kind of systems is how they support the browser mono-culture. Your UA is from Servo or Ladybird? You're out of luck. That's why the idea choosing a purely browser-agnostic way of "weighting the soul" of a request resonates highly with me. Keep up the good work!

                                                                                    • 8474_s a month ago

                                                                                      I've been seeing that anime girl pop-up in some websites, mainly because i use "rare" browsers.I prefer it over captchas and cloudflare "protecting websites from real traffic", whatever they're doing is just a few seconds and doesn't require solving captchas or something equally obnoxius like microsoft puzzles.

                                                                                      • pmlnr a month ago

                                                                                        Anyone knows a solution that works without js?

                                                                                        • kh_hk a month ago

                                                                                          Client side proof of work might be enough now but it won't last: solve challenge, reuse cookie.

                                                                                          Ja4 fingerprinting is a new-ish in interesting approach, not for blocking but as an extra metric to validate trust on requests

                                                                                          • anonfordays a month ago

                                                                                            Looks similar to haproxy-protection: https://gitgud.io/fatchan/haproxy-protection/

                                                                                            • parrit a month ago

                                                                                              If I see a cute cartoon with a cryptocurrency mining like "KHash/s" thing I am gonna leave that site real quick!

                                                                                              It should explain it isn't mining and just verifying the browser or such.

                                                                                              • lytedev a month ago

                                                                                                It includes links with explanations, but the page does kind of "fly by" in many cases. At which point, would you still leave?

                                                                                                I'm guessing folks have seen enough captcha and CloudFlare verification pages to get a sense that they're being "soul" checked and that it's not an issue usability-wise.

                                                                                              • herpdyderp 2 months ago

                                                                                                Can Anubis be restyled to be more... professional? I like the playfulness, but I know at least some of my clients will not.

                                                                                                • samhclark 2 months ago

                                                                                                  You can, but they ask that you contact them to set up a contract. It's addressed here on the site:

                                                                                                  >Anubis is provided to the public for free in order to help advance the common good. In return, we ask (but not demand, these are words on the internet, not word of law) that you not remove the Anubis character from your deployment.

                                                                                                  >If you want to run an unbranded or white-label version of Anubis, please contact Xe to arrange a contract.

                                                                                                  https://anubis.techaro.lol/docs/funding

                                                                                                • gitroom a month ago

                                                                                                  Kinda love how deep this gets into the whole social contract side of open source. Honestly, it's been a pain figuring out what feels right when folks mix legal rules and personal asks.