• billyhoffman 2 hours ago

    Common Crawl is shown in their screen shot of "Providers" along side OpenAI and Antropic. The challenge is that Common Crawl is used for a lot of things that are not AI training. For example, it's a major source of content for the Wayback machine.

    In fact, that's the entire point of the Common Crawl project. Instead of dozens of companies writing and running their (poorly) designed crawlers and hitting everyone's site, Common Crawl runs once and exposes the data in industry standard formats like WARC for other consumers. Their crawler is quite well behaved (exponential backoff, obeys Crawl-Delay, will use SiteMaps.xml to know when to revisit, follows Robots.txt, etc.).

    There are significant knock-on effects if CloudFlare starts (literally) gatekeeping content. This feels like a step down the path to a world where the majority of websites use sophisticated security products that gatekeep access to those who pay and those who don't, and that applied whether they are bots or people.

    • Aachen 31 minutes ago

      > gatekeep access to those who pay and those who don't, and that applied whether they are bots or people.

      I'm already constantly being classified as bot. Just today:

      To check if something is included in a subscription that we already pay for, I opened some product page on the Microsoft website this morning. Full-page error: "We are currently experiencing high demand. Please try again later." It's static content but it's not available to me. Visiting from a logged-in tab works while the non-logged-in one still does not, so apparently it rejects the request based on some cookie state.

      Just now I was trying to book a hotel room for a conference in Grenoble. Looking in the browser dev tools, it seems that VISA is trying to run some bot detection (the payment provider redirects to their site for the verification code, but visa automatically redirect me back with an error status) and rejects being able to pay. There are no other payment methods. Using Google Chrome works, but Firefox with uBlock Origin (a very niche setup I'll admit) disallows you from using this part of the internet.

      Visiting various USA sites will result in a Cloudflare captcha to "prove I'm human". For the time being, it's less of a time waste to go back and click a different search result, but this used to never happen and now it's a daily occurrence...

      • esperent 17 minutes ago

        > We are currently experiencing high demand. Please try again later.

        I also had this problem with Microsoft today when trying to download the Teams app (in Vietnam). We use MS Teams at work and onboard one or two people a week. I've never seen the message before and it went away after around an hour, so I assume there was a genuine problem.

        • theyeenzbeanz 16 minutes ago

          Lately I’ve been noticing captchas have been increasingly difficult day by day on Firefox. Checking the box use to go through without issue, but now it’s been starting to pop up challenges with the boxes that fade after clicking. Just like your experience, chrome has no hiccups on the same machine.

          • Aachen 12 minutes ago

            Those "keep clicking until we stop fading in more results" challenges mean they're fairly confident you're a bot and this is the highest difficulty level to prove your lack of guilt. I get these only when using a browser that isn't already full of advertising cookies

            • influx 4 minutes ago

              I wonder how many of those captchas are controlled by competitors of Firefox?

          • paxys 2 hours ago

            > Common Crawl runs once and exposes the data in industry standard formats like WARC for other consumers

            And what stops companies from using this data for model training? Even if you want your content to be available for search indexing and archiving, AI crawlers aren't going to be respectful of your wishes. Hence the need for restrictive gatekeeping.

            • lolinder an hour ago

              Either AI training is fair use or it isn't. If it's fair use then businesses shouldn't get a say in whether the data can be used for it. If it isn't, then the answer to your question is copyright law.

              Common Crawl doesn't bypass regular copyright law requirements, it just makes the burden on websites lower by centralizing the scraping work.

              • Aachen 18 minutes ago

                Copyright is only part of the equation, there's also the use of other people's resources

                If what a government receptionist says is copyright-free, you still can't walk into their office thousands of times per day and ask various questions to learn what human answers are like in order to train your artificial neural network

                The amount of scraping that happened in ~2020 as compared to this year is orders of magnitude different. Not all of them have a user agent (looking at "alibaba cloud intelligence" unintelligently doing a billion requests from 1 IP address) or respect the robots file (looking at huawei singapore who also pretend to be a normal browser and slurps craptons of pages through my proxy site that was meant to alleviate load from the slow upstream server)

                • 6gvONxR4sf7o an hour ago

                  Its not a legal question but a behavior and sustainability question. If it is fair use, but is undesirable for content makers, then they’re still not under any obligation to allow scraping. So they’ll try stuff like this, and other more restrictive bot blockers.

                  Remember when new sites wanted to allow some free articles to entice people and wanted to allow google to scrape, but wanted to block freeloaders? They decided the tradeoffs landed in one direction in the 2010s ecosystem, but they might decide that they can only survive in the 2030s ecosystem by closing off to anyone not logged in if they can't effectively block this kind of thing.

                  • MrDarcy an hour ago

                    There is no objective black and white is or is not in this situation.

                    There is litigation of multiple cases and a judge making a judgement on each one.

                    Until then, and even after then, publishers can set the terms and enforce those terms using technical means like this.

                  • toomuchtodo 19 minutes ago

                    The end result is browser extensions, like Recap the Law [1] for PACER, that streams data back from participating user browsers to a target for batch processing and eventual reconciliation.

                    Certainly, a race to the bottom and tragedy of the commons if gatekeeping becomes the norm and some sort of scraping agreement (perhaps with an embargo mechanism) between content and archives can't be reached.

                    [1] https://free.law/recap/faq

                    • billyhoffman an hour ago

                      Licensing. Common Crawl could change the license of how the data it produces is used.

                      Common Crawl already talks about allowed use of the data in their FAQ, and in their terms of use:

                      https://commoncrawl.org/terms-of-use/ https://commoncrawl.org/faq

                      While this doesn't currently discuss AI, they could. This would allow non-AI downstream consumers to not be penalized.

                      • paxys an hour ago

                        Licensing doesn't mean shit when no court in the country is actually willing to prosecute violations. Who have OpenAI, Anthropic, Microsoft, Google, Meta licensed all their training data from?

                      • ToucanLoucan 23 minutes ago

                        I mean, this is exactly what people like myself were predicting when these AI companies first started spooling up their operations. Abuse of the public square means that public goods are then restricted. It's perfectly rational for websites of any sort who have strong opinions on AI to forbid the use of common crawl, specifically because it is being abused by AI companies to train the AI's they are opposed to.

                        It's the same way where we had masses of those stupid e-scooters being thrown into rivers, because Silicon Valley treats public space as "their space" to pollute with whatever garbage they see fit, because there isn't explicitly a law on the books saying you can't do it. Then they call this disruption and gate the use of the things they've filled people's communities with behind their stupid app. People see this, and react. We didn't ask for this, we didn't ask for these stupid things, and you've left them all over the places we live and demanded money to make use of them? Go to hell. Go get your stupid scooter out of the river.

                    • sharpshadow a minute ago

                      It is indeed a huge waste to scrape the same whole site for changes and new content. If Cloudflare is capable to maintain an overview about changes and updates it could save a lot of resources.

                      The site could tell cloudflare directly what changed and cloudflare could tell the AI. The AI buys the changes and cloudflare pays the site keeps a margin.

                      • sdflhasjd 2 minutes ago

                        How long does the world-wide-web have left? It's always felt like it would be around forever, but now I get the feeling that at some point it will fade into obscurity like IRC has done.

                        • flaburgan 32 minutes ago

                          I was recently speaking with people from OpenFoodFacts and OpenStreetMap, and I guess Wikipedia as the same issue. They are under constantly DDoS by bots which are scraping everything, even if the full dataset can be downloaded for free with a single HTTP request. They said this useless traffic was a huge cost for them. This is not about copyright, just about bots being stupid and people behind them not caring at all. We for sure need a solution to this. To maintain a system online nowadays means not only they get your data but you pay for that!

                          • epc 3 minutes ago

                            I’ve just taken to blocking entire swaths of cloud services IP networks. I don’t care what the intentions are, my personal sites don’t get the infinite bandwidth to put up with a thousands of poorly written spiders.

                          • creatonez 2 hours ago

                            This seems like a gimmick. Isn't preventing crawling a sisyphean task? The only real difference this will make is further entrenching big players who have already crawled a ton of data. And if this feature comes at the cost of false positives and overbearing captchas, it will start to affect users.

                            • hipadev23 2 hours ago

                              Companies have been trying and failing to prevent large scale crawling for 25 years. It’s a constant arms race and the scrapers always win.

                              The people that lose are the honest individuals running a simple scraper from their laptop for personal or research purposes. Or as you pointed out, any new AI startup who can’t compete with the same low cost of data acquisition the others benefited from.

                              • andyp-kw an hour ago

                                The risk of getting sued prevents companies from using pirated software.

                                The big players might just pay the fee because they might one day need to prove where they got the data from.

                                • spiderfarmer an hour ago

                                  My website contains millions of pages. It's not hard to notice the difference between a bot (or network) that wants to access all pages and a regular user.

                                  • l5870uoo9y an hour ago

                                    How often are the bots indexing it?

                                    • immibis 30 minutes ago

                                      If you listen to the people complaining about bots at the moment, some bots are scraping the same pages over and over to the tune of terabytes per day because the bot operators have unlimited money and their targets don't.

                                      • Aachen 8 minutes ago

                                        > because the bot operators have unlimited money

                                        I rather think the cause is that inbound bandwidth is usually free, so they need maybe 1/100th of the money because requests are smaller than responses (plus discounts they get for being big customers)

                                        • meiraleal 17 minutes ago

                                          > because the bot operators have unlimited money and their targets don't.

                                          wget/curl vs django/rails, who wins?

                                    • spacebanana7 an hour ago

                                      > The only real difference this will make is further entrenching big players

                                      It's the opposite. Only big players like google get meetings with big publishers and copyright holders to be individually whitelisted in robots.txt. Whereas a marketplace is accessible to any startup or university.

                                    • neilv 2 hours ago

                                      Cloudflare found a new variation on their traditional service of protecting from abusers.

                                      This time, Cloudflare has formed a "marketplace" for the abuse from which they're protecting you, partnering with the abusers.

                                      And requiring you to use Cloudflare's service, or the abusers will just keep abusing you, without even a token payment.

                                      I'd need to ask the lawyer how close this is to technically being a protection racket, or other no-no.

                                      • troyvit an hour ago

                                        As an actual content provider I see this as an opportunity. We pay our journalists real money to write real stories. If AI results haven't started affecting our search traffic they will start to soon. Up until now we've had two choices: block AI-based crawlers and fall completely out of that market, or continue to let AI companies train off of our hard-won content and take it as a loss that still generates a little bit of traffic. Cloudflare now offers a third option if we can figure out how to use it.

                                        Dissing on Cloudflare is the new thing, and I get it. They're big and powerful and they influence a massive amount of the traffic on the web. Like the saying goes though, don't blame the player, blame the game. Ask yourself if you'd rather have Alphabet, Microsoft, Amazon or Apple in their place, because probably one of them would be.

                                        • neilv an hour ago

                                          Not dissing any company; just pointing out a real concern to be considered, in this freshly disrupted and rapidly evolving environment.

                                          We all know that someone is going to try to slip one past the regulators, and they're probably on HN, and we know from the past that this can pay off hugely for them.

                                          Maybe, this time, the HN people who grumble about past exploiters and abusers in retrospect, can be more proactive, and help inform lawmakers and regulators in time.

                                          And for those of us who don't want to be activists, but also don't want to be abusers -- just run honest businesses -- we're reminded to think twice about what we do and how we do it, when we're operating in what seems like novel space.

                                        • jsheard 2 hours ago

                                          > I'd need to ask the lawyer how close this is to technically being a protection racket, or other no-no.

                                          Wait 'til you find out how many of the DDoS-for-hire services that Cloudflare offers to protect you from are themselves protected by Cloudflare.

                                          • ziddoap an hour ago

                                            I hear this pretty often. I am curious what do you think Cloudfare should do?

                                            I am pretty sure that if they started arbitrarily banning customers/potential customers based on what some other people like or don't like, everyone would be up in arms yelling stuff about censorship or wokeness or whatever the word of the year is.

                                            As an example, what if I'm not a DDoS-for-hire, but just a website that sells some software capable of launching DDoS attacks? Should I be able to buy Cloudfare protection? Should a site like Metasploit be allowed to purchase protection?

                                            • jsheard 41 minutes ago

                                              > As an example, what if I'm not a DDoS-for-hire, but just a website that sells some software capable of launching DDoS attacks? Should I be able to buy Cloudfare protection? Should a site like Metasploit be allowed to purchase protection?

                                              Would you say this nuance is a major issue on the other big cloud providers? Your own grey-area example of Metasploit is hosted on AWS without any objections. Yet the other cloud providers make a decent effort to turn away open DDoS peddlers, whenever I survey the highest ranked DDoS services it's usually around 95% Cloudflare and 5% DDoS-Guard.

                                              • ziddoap 25 minutes ago

                                                I'm asking you what you think Cloudfare should do. I'm not sure why you spun it around on me.

                                                • jsheard 21 minutes ago

                                                  I think Cloudflare should make the bare minimum effort to kick services which are explicitly offering illegal DDoS attacks, given that their current policy of not doing anything unless legally compelled to is demonstrably enabling the overwhelming majority of DDoS providers to stay online, and it has terrible optics when they're in the business of mitigating those attacks. Whatever excuses they give, somehow AWS, Azure, GCP, Fastly, Akamai and so on have managed to solve the impossible problem of turning away DDoS providers without imposing Orwellian censorship in the process.

                                          • TZubiri 2 hours ago

                                            Associating a cost with a detrimental action is a well established defense against sybil attacks.

                                            • gwervc an hour ago

                                              I distinctly remember Cloudfare being accused here of hosting spammers and selling protection against them a decade ago. Then suddenly the name became associated with positive things only, and the whole thing have been memory-holed.

                                              • robertlagrant 34 minutes ago

                                                Sorry - what whole thing? An accusation in a comment on Hacker News?

                                              • flir 36 minutes ago

                                                I dunno. If Cloudflare's protection doesn't work (and lets face it, it doesn't), why are you paying for it?

                                                • immibis 29 minutes ago

                                                  Well, as long as Cloudflare pays you to be "abused" (by which we mean, spending more money on bandwidth) it should be no problem for many of the site owners.

                                                  • loceng 2 hours ago

                                                    If they don't offer to just block the bots instead of you signing on, then I imagine it'd easily be seen as a racket.

                                                    How much effort then Cloudflare puts on tracking circumvention efforts of bot networks is then another question.

                                                    • tempfile 13 minutes ago

                                                      The term "abuse" in this description is both confused and confusing. Websites are trying to meter out a public resource, which is something they're unable to do by themselves. Cloudflare is offering to help them, for a fee. Once the practice is metered, it isn't abuse anymore. It's just using the public service, which the website owner deliberately operates.

                                                      • mrits 2 hours ago

                                                        Hi, I'm AI and I want to inform you that your bigotry will not stand. Democratize human intelligence so we can ALL live in a world where it doesn't matter if you are silicon or carbon.

                                                      • FlyingSnake 2 hours ago

                                                        More details here at the Cloudflare blog: https://blog.cloudflare.com/cloudflare-ai-audit-control-ai-c...

                                                        • siliconc0w 25 minutes ago

                                                          Any recommendations for simple WAF tool that will stop the majority of the abuse without having to use Cloudflare? I use Cloudflare just to keep that noise away from my logs but I'm not super keen to be dependent on them.

                                                          • boristsr 2 hours ago

                                                            I'm pretty interested in how companies are exploring how to properly monetize or compensate for scraped content to help keep a strong ecosystem of quality content. Id love to see more efforts like this.

                                                            • hedora 2 hours ago

                                                              Companies have been trying to find novel ways to bypass fair use / public domain laws for a long time.

                                                              Each time they do, we see more consolidation of the media, and lower pay for the people that produce the content.

                                                              I don’t see why this particular effort will turn out differently.

                                                              • bippihippi1 2 hours ago

                                                                I wonder if there's a way to test this hypothesis. Does content being freely reproducible with minor modification increase the demand for content creators since new content is more valuable than the existing that can be copied.

                                                                I'd guess that since AI can fair-useify a work faster than any human, that fair-use reviewers, compilers/collagers, re-imaginers, etc content creators will be devalued.

                                                                However, AIs are as yet unable to create work as innovative as humans. Therefore new work should be more valuable since now there is demand from people and AIs for their work. I'm assuming that AI companies pay for the work that they use in some way. Hopefully the aggregation sites continue to compete for content creators.

                                                                • chrisweekly 15 minutes ago

                                                                  > "I'm assuming that AI companies pay for the work that they use in some way."

                                                                  That mistaken assumption is at the heart of the problem under discussion.

                                                              • kordlessagain 2 hours ago

                                                                There's a HTTP code for charging for access: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/402

                                                                Then there's a Lightning Network protocol for it: https://docs.lightning.engineering/the-lightning-network/l40...

                                                                With the Cloudflare stuff, it just seems like an excuse to sell Cloudflare services (and continue to force everyone to use it) as opposed to just figuring out a standard way of using what is already built to provide access for some type of micropayment.

                                                                • jsheard 2 hours ago

                                                                  The problem is that soft technical measures like HTTP 402 and robots.txt aren't legally binding, so there's nothing stopping scrapers from just ignoring them. Cloudflares value proposition here is they will play the cat-and-mouse game of detecting things like spoofed user agents and residential proxies on your behalf, and actively block what appears to be scraper traffic unless they pay up.

                                                                  Unfortunately this probably means even more CAPTCHAs for people using VPNs and other privacy measures as they ramp up the bot detection heuristics.

                                                                  • Aachen 5 minutes ago

                                                                    Sure it's not legally binding, but if I see >100000 requests coming from 1 IP address within a week, I'm also not legally bound to make that 402 error go away. By having an automated payment mechanism, the two parties could come to an agreement they're both happy about

                                                                    > there's nothing stopping scrapers from just ignoring them

                                                                    Feel free to ignore HTTP errors, but those pages don't contain the content you're looking for

                                                                    (For the record, I don't use HTTP 402, but I noncommercially host stuff and know what bots people are complaining about.)

                                                                    • TZubiri 2 hours ago

                                                                      "Unfortunately this probably means even more CAPTCHAs for people using VPNs and other privacy measures as they ramp up the bot detection heuristics"

                                                                      Yeah. You can't have it both ways. Similar dilemma for requiring identification vs disallowing immigrants.

                                                                  • dogleash 2 hours ago

                                                                    > help keep a strong ecosystem of quality content

                                                                    To the extent quality content does exist online: what isn't either already behind a paywall, or created by someone other than who will be compensated under such a scheme?

                                                                    • tomjen3 2 hours ago

                                                                      This won't work. If you are doing an AI startup, you will want to use GoogleBot for your crawler and this will bypass that.

                                                                      Not too much of a loss, since the only quality content is already behind paywalls, or on diverse wikistyle sites. Anything served with ads for commercial reasons is automatically drivel, based on my experience. There simply isn't a business in making it better.

                                                                      Edit: updated comment to not be needlessly diversive.

                                                                      • jsheard 2 hours ago

                                                                        It is trivial to detect fake GoogleBot traffic (Google provides ways to validate it) and Cloudflare already does so. See for yourself:

                                                                          curl -I -H "User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/105.0.5195.102 Safari/537.36" https://www.cloudflare.com
                                                                        
                                                                        They'll immediately flag the request as malicious and return 403 Forbidden, even if your IP address is otherwise reputable.
                                                                        • matt-p 2 hours ago

                                                                          Now try it from a google cloud vm.

                                                                          • jsheard 2 hours ago

                                                                            Pretty sure that won't work, they let you validate whether an IP address is used by GoogleBot specifically, not just owned by Google in general. I doubt they are foolish enough to use the same pool of IP addresses for their internal crawlers and their public cloud.

                                                                            https://developers.google.com/search/docs/crawling-indexing/...

                                                                            • matt-p an hour ago

                                                                              It depends how the site has implemented it, a huge number just look for AS origination and *googleuserconent.com

                                                                    • neilv 2 hours ago

                                                                      > A demo of AI Audit shared with TechCrunch showed how website owners can use the tool to see how AI models are scraping their sites. Cloudflare’s tool is able to see where each scraper that visits your site comes from, and offers selective windows to see how many times scrapers from OpenAI, Meta, Amazon, and other AI model providers are visiting your site.

                                                                      And if I didn't authorize the freeloading copyright-laundering service companies to pound my server and take my content, then I need a really good lawyer, with big teeth and claws.

                                                                      • BSDobelix 2 hours ago

                                                                        I would say let's get rid of copyright and software patents altogether ;)

                                                                        • blibble 2 hours ago

                                                                          they're already gone

                                                                          but only if you're well funded (OpenAI)

                                                                          • mdaniel 13 minutes ago

                                                                            I've always heard it as "the golden rule:" those who have the gold make the rules

                                                                      • meiraleal 19 minutes ago

                                                                        Wow, a big tech thinking about creators not about how to extract all they can but to give back. That became so uncommon nowadays. Cloudflare deserves their exponential growth. Kudos for them.

                                                                        • 015a 27 minutes ago

                                                                          One minor, tedious thing that I've become so tired of lately is showcased very plainly in the screenshot in this article: That the Cloudflare admin dashboard has now prominently placed "AI Audit (ALPHA)" as a top-level navigation menu item at the very top of the list of a Cloudflare Account's products. Everyone is doing this, for AI products or whatever came before them, and it genuinely pushes me away from paying for Cloudflare, as I get the distinct sense that they aren't building the things or fixing the problems that I feel are important to me.

                                                                          I would greatly appreciate the ability to customize the items and ordering of those items in this sidebar.

                                                                          • zkid18 an hour ago

                                                                            What's wrong with AI agents accessing website content? We seem to have been happy with Google doing that for ages in exchange for displaying the website in search results.

                                                                            • red_admiral an hour ago

                                                                              The website owner chooses. They can say "nope" in robots.txt. Not everyone respects this, but Google does. Google can choose not to show that site as a result, if they want to.

                                                                              This adds a third option besides yes and no, which is "here's my price". Also, because cloudflare is involved, bots that just ignore a "nope" might find their lives a bit harder.

                                                                              • lolinder an hour ago

                                                                                Robots.txt is for crawlers. It's explicitly not meant to say one-off requests from user agents can't access the site, because that would break the open web.

                                                                                • Spivak 41 minutes ago

                                                                                  Yep, there's really two parts to this.

                                                                                  * Some company's crawler they're planning to use for AI training data.

                                                                                  * User agents that make web requests on behalf of a person.

                                                                                  Blocking the second one because the user's preferred browser is ChatGPT isn't really in keeping with the hacker spirit. The client shouldn't matter, I would hope that the web is made to be consumed by more than just Chrome.

                                                                              • 6gvONxR4sf7o 42 minutes ago

                                                                                The thing people have been doing for ages is a trade: I let you scrape me and in return you send me relevant traffic. The new choice isn't about a trade, so it's different.

                                                                                • brigadier132 an hour ago

                                                                                  For traditional search indexing the interests of the aggregator and the content creator were aligned. AIs on the other hand are adversarial to the interest of content creators, a sufficiently advanced AI can replace the creator of the content it was trained on.

                                                                                  • lolinder an hour ago

                                                                                    We're talking in this subthread about an AI agent accessing content, not training a model on content.

                                                                                    Training has copyright implications that are working their way through courts. AI agent access cannot be banned without fundamentally breaking the User Agent model of the web.

                                                                                    • brigadier132 an hour ago

                                                                                      Ok, fine, let's restrict it to AI agents only, without training. It's still an adversarial relationship with the content creator. When you take an AI agent an ask it "find me the best italian restaurant in city xyz" it scans all the restaurant review sites and gives you back a recommendation. The content creator bears all the burden of creating and hosting the content and reaps non of the reward as the AI agent has now inserted itself as a middleman.

                                                                                      The above is also a much clearer / more obvious case of copyright infringement than AI training.

                                                                                      > AI agent access cannot be banned without fundamentally breaking the User Agent model of the web.

                                                                                      This is a non-sequitur but yes you are right, everything in the future will be behind a login screen and search engines will die.

                                                                                      • lolinder an hour ago

                                                                                        > reaps non of the reward

                                                                                        Just to be clear what we're talking about: the reward in question is advertising dollars earned by manipulating people's attention for profit, right?

                                                                                        I frankly don't think that people have the right to that as a business model and would be more than happy to see AI agents kill off that kind of "free" content.

                                                                                        • brigadier132 36 minutes ago

                                                                                          > the reward in question is advertising dollars earned by manipulating people's attention for profit, right?

                                                                                          Another non-sequitur. I'm talking about incentives and what I predict will happen based on these incentives.

                                                                                          > I frankly don't think that people have the right to that as a business model and would be more than happy to see AI agents kill off that kind of "free" content.

                                                                                          Frankly, what you think doesn't really matter. It's also very easy to have these moral judgements when you produce absolutely nothing and take everything for free. This is called being a leech.

                                                                                          • lolinder 34 minutes ago

                                                                                            Classy. Have a nice day.

                                                                                            • brigadier132 30 minutes ago

                                                                                              Classy is being so self absorbed that you have no hesitation to say that someone providing you a service should make nothing for it.

                                                                                        • Spivak 34 minutes ago

                                                                                          > The content creator bears all the burden of creating and hosting the content and reaps non of the reward as the AI agent has now inserted itself as a middleman.

                                                                                          As a user agent my god what's happened to our industry. Locking the web to known client which are sufficiently not the user's agent betrays everything the web is for.

                                                                                          Do you really hate AI so much that you'll give up everything you believe in to see it hurt?

                                                                                          • brigadier132 28 minutes ago

                                                                                            Like I said in another comment, I'm pointing out what is going to actually happen based on incentives, not what I want to happen. I'd much rather the open web continue to exist and I think AI will be a beneficial thing for humanity.

                                                                                            edit: to be clear, it's already happening. Blogs are moving to substack, twitter blocks crawling, reddit is going the same way in blocking all crawlers except google.

                                                                                    • spiderfarmer an hour ago

                                                                                      And AI agents scrape your content in exchange for what exactly?

                                                                                      • lolinder an hour ago

                                                                                        Yeah, there's a lot of confusion between AI training and AI agent access, and it's dangerous.

                                                                                        Training embeds the data into the model and has copyright implications that aren't yet fully resolved. But an AI agent using a website to do something for a user is not substantially different than any other application doing the same. Why does it matter to you, the company, if I use a local LLaMA to process your website vs an algorithm I wrote by hand? And if there is no difference, are we really comfortable saying that website owners get a say in what kinds of algorithms a user can run to preprocess their content?

                                                                                        • jsheard an hour ago

                                                                                          > But an AI agent using a website to do something for a user is not substantially different than any other application doing the same.

                                                                                          If the website is ad-supported then it is substantially different - one produces ad impressions and the other doesn't. Adblocking isn't unique to AI agents of course but I can see why site owners wouldn't want to normalize a new means of accessing their content which will inherently never give them any revenue in return.

                                                                                          • lolinder an hour ago

                                                                                            I don't believe that companies have the right to say that my user agent must run their ads. They can politely request that it does and I can tell my agent whether to show them or not.

                                                                                            • jsheard an hour ago

                                                                                              True, but by the same measure your user agent can politely request a webpage and the server has the right to say 403 Forbidden. Nobody is required to play by the other parties rules here.

                                                                                              • lolinder an hour ago

                                                                                                Exactly. The trouble is that companies want the benefits of being on the open web without the trade-offs. They're more than welcome to turn me down entirely, but they don't do that because that would have undesirable knock-on effects. So instead they try to make it sound like I have a moral obligation to render their ads.

                                                                                      • johnisgood an hour ago

                                                                                        How are they going to pay? How much? Can it be enforced?

                                                                                        • Mistletoe 2 hours ago
                                                                                          • sunshadow 4 minutes ago

                                                                                            There is no difference between this and a well known bot prevention mechanism, from the scraper perspective.

                                                                                          • Workaccount2 2 hours ago

                                                                                            Props to cloudaflare for referring to it as "scanning your data", which is probably the most technically accurate way to describe what AI training bots are doing.

                                                                                            • NoMoreNicksLeft 41 minutes ago

                                                                                              Great. The HR software my company uses can charge me when my own bot "scrapes" my paystub pdf.

                                                                                              • kijin 2 hours ago

                                                                                                AI scrapers are parasites.

                                                                                                I don't care whether you're OpenAI, Amazon, Meta, or some unknown startup. As soon as you generate a noticeable load on any of the servers I keep my eyes on, you'll get a blank 403 from all of the servers, permanently.

                                                                                                I might allow a few select bots once there is clear evidence that they help bring revenue-generating visitors, like a major search engine does. Until then, if you want training data for your LLM, you're going to buy it with your own money, not my AWS bill.

                                                                                                • johnsutor 2 hours ago

                                                                                                  Or, you know, just create your own API for your platform and charge people per request to that.

                                                                                                  • kelsey98765431 2 hours ago

                                                                                                    lol good luck

                                                                                                    • xyzzy_plugh 2 hours ago

                                                                                                      Ah yes, the ol' monopoly invents an illusionary marketplace ploy.

                                                                                                      Cloudflare is obviously right here. AI has changed things so an open web is no longer possible. /s

                                                                                                      What absolute garbage.

                                                                                                      • giancarlostoro 2 hours ago

                                                                                                        I really love Cloudflare. They're always up to something interesting and different. I hope we see more companies rise up similar to Cloudflare. I almost want to say Cloudflare is everything we hoped Google would be, but Google became another corporate cog machine that innovates and then scraps things up in one swoop. I don't recall the last I heard of Cloudflare spinning something up just to wind it back down? I don't think its impossible for them to make a bad choice, but I think they really think their projects through typically.

                                                                                                        My biggest problem with AI will be once it starts getting legislated, it will just be limited in how it can function / be built, we are going to lock in existing LLMs like ChatGPT in the lead and stop anyone from competing since they wont be able to train on the same data.

                                                                                                        My other biggest problem is "AI" or really LLMs which is what everyones hyped about, is lack of offline first capabilities.

                                                                                                        • nindalf 2 hours ago

                                                                                                          > last I heard of Cloudflare spinning something up just to wind it back down

                                                                                                          Cloudflare bet big on NFTs (https://blog.cloudflare.com/cloudflare-stream-now-supports-n...), Web3 (https://blog.cloudflare.com/get-started-web3/), Proof of stake (https://blog.cloudflare.com/next-gen-web3-network/). In fact they "bet on blockchain" way back in 2017 (https://blog.cloudflare.com/betting-on-blockchain/) but it's telling that they haven't published anything in the last couple of years (since Nov 2022). Since then the only crypto related content on blog.cloudflare.com is real cryptography - like data encryption.

                                                                                                          I'm not criticising. I'm just saying they're part of an industry that thought web3 was the Next Big Thing between 2017-2022 and then pivoted when ChatGPT released in Nov 2022. Now AI is the Next Big Thing.

                                                                                                          I wouldn't be surprised if a lot of the blockchain stuff got sunset over the next few years. Can't run those in perpetuity, especially if there aren't any takers.

                                                                                                          • giancarlostoro an hour ago

                                                                                                            Im neutral on crypto, I see it like AI, its just waiting on some breakthrough that pulls everyone. My suspicion is someone needs to make it stupid easy to get into crypto.

                                                                                                          • clvx 2 hours ago

                                                                                                            Someone somewhere outside of your country's legal entities can still do all the things your country doesn't like and there's little to stop them. Governments might limit legal or commercial usage but it doesn't mean it won't exist.

                                                                                                            • giancarlostoro 2 hours ago

                                                                                                              Its much harder to pull off when you're hitting an international market, are you really going to ignore an entire country? Maybe if it was a small country with few citizens, but if the EU or US passes a law, you're going to miss out on an entire market.