Comments Page - The AI-Scraping Free-for-All Is Coming to an End

« Back The AI-Scraping Free-for-All Is Coming to an Endnymag.comSubmitted by geox 17 hours ago

xarope 20 minutes ago
I can see how the AI companies would work around this though:
user queries "static" training data in LLM; LLM guesses something, then searches internet in real-time for data to support the guesses. This would be classified as "browsing" rather than trawling.
(the searched data then get added back into the corpus, thus sadly sidestepping all the anti-AI trawling mechanisms)
Kind of like the way a normal user would.
The problem is, as others have already mentioned, how would the LLMs know what is a good answer versus a bad, when a "normal" user also has this issue?
jsnell 16 hours ago
The headline seems pretty aspirational.
The licensing standard they're talking about will achieve nothing.
Anti-bot companies selling scraping protections will run out of runway: there's a limited set of signals, and none of them are robust. As the signals get used, they're also getting burned. And it's politically impossible to expand the web platform to have robust counter-abuse capabilities.
Putting the content behind a login wall can work for large sites, but not small ones.
The free-for-all will not end until adversarial scraping becomes illegal.
- atm3ga 15 hours ago
  As AI companies like Perplexity introduce AI enabled browsers like Comet, they will scrape web sites through the interaction of end-users with whatever site they are using. Therefore, indeed anti-bot companies are absolutely running out of runway.
  thelittleone 15 hours ago
  Wow hadn't even considered this... so say I have a members only section of my site where I share high value content, one of the members browses using Comet, and that scrapes the private content and sends to perplexity?
  datadrivenangel 13 hours ago
  Any user could manually download your data anyways. Access is access.
  tempodox 9 hours ago
  And a browser can do it automated and behind user’s back.
  lupire 15 hours ago
  This also happens with covert botnets running secretly on user machines.
  Incipient 7 hours ago
  Surely that's highly illegal, and no one would actually use a browser that sent your entire browsing DATA not just history, to a third party?
  ec109685 14 hours ago
  The way comet browses the web is weird enough that it’s easily detectable.
  atm3ga 12 hours ago
  Does detectability matter? Are we now entering an era of forced browser compliance? That is, if I use Comet exclusively as my browser; is my bank, insurance company, or news site going to force me to stop and use a "normal" browser and what will that look like as every browser also has AI capabilities? Maybe certain resources will only be available via apps? Seems like a very slippery slope and very user hostile.
  orbisvicis 12 hours ago
  I really don't want AI to be able to produce my bank account balance and routing number on demand.
  Aerroon an hour ago
  Great, but it won't stop there. You will use Chrome or else.
  Well, with one alternative: Edge.
- gdulli 14 hours ago
  Did you stop getting non-compliant spam when that became illegal?
- carlosjobim 15 hours ago
  > Putting the content behind a login wall can work for large sites, but not small ones.
  Syndication is the answer. Small artists are on Spotify, small video makers are on YouTube.
  salawat 15 hours ago
  Yes. Conglomeration and centralization. More, more, more!
  See the problem?
  carlosjobim 12 hours ago
  You don't have to syndicate a million small creators to have a product worthwhile for consumers, it could be a thousand, a hundred, ten thousand creators in a syndicate. You can have a huge number of syndicates, which benefits creators and consumers.
  orbisvicis 11 hours ago
  But in such an environment syndicates will have an incentive to centralize.
  carlosjobim 10 hours ago
  I don't see why. In general, there are competing syndicates and businesses of every size in most sectors of the economy.
WaltPurvis 17 hours ago
http://archive.today/SqPCL
- jmkni 15 hours ago
  It is a bit ironic that a paywalled article like this will have a top level comment with the archive link, which can then be easily scraped by AI (along with the comments)
  ec109685 14 hours ago
  Also interesting how sites like this are mainstream whereas a link to a site hosting an mp3 of pirated music wouldn’t be tolerated in discussion forums like this.
  I think a big difference is that there’s no micro transactions or compulsory licensing for content, so it always feels patently unfair to buy a subscription to read one article.
  yencabulator 10 hours ago
  I'd argue it's more that RIAA has historically been much more aggressive at suing than newspapers or magazines.
  ec109685 10 hours ago
  True. I think it has ended up a net good. People make a living on music, and licensed music is everywhere.
  orbisvicis 11 hours ago
  Kinda hard to discuss the news when your members can't read the news.
  JacobKfromIRC 11 hours ago
  In this case, it also seems like the paywall doesn't show up if you have JavaScript disabled, which I find strange, but lots of news sites are like that I think.
  tenuousemphasis 15 hours ago
  It's not ironic at all. The only reason the anti-paywall sites work is that the news companies in fact want some scrapers reading the full article.
  mschuster91 15 hours ago
  Actually, the team behind archive dot today in at least spiegel.de has premium accounts, I presume bought with anonymous credit cards.
  You can see artifacts when their servers are at queue load and you see the URLs, a few resources have the JWT with the account details in the URL. IIRC the clearname of the account in the token is Masha Rabinovich, with an email account masha@dns.li, an identity that has cropped up in various investigations [1][2].
  [1] https://gyrovague.com/2023/08/05/archive-today-on-the-trail-...
  [2] https://webapps.stackexchange.com/questions/145817/who-owns-...
janalsncm 15 hours ago
> There was for years an experimental period, when ethical and legal considerations about where and how to acquire training data for hungry experimental models were treated as afterthoughts.
Those things were afterthoughts because for the most part the experimental methods sucked compared to the real thing. If we were in mid 2016 and your LSTM was barely stringing together coherent sentences, it was a curiosity but not a serious competitor to StackOverflow.
I say this not because I don’t think law/ethics are important in the abstract, but because they only became relevant after significant technological improvement.
Zigurd 14 hours ago
Sites containing original content will adopt active measures against LLM scraper bots. Unlike search indexing bots, there's much less upside to allowing scraping for LLM training material. Openly adversarial actions like serving up poisoned text that would induce LLMs to hallucinate is much more defensible.
1gn15 16 hours ago
Biased TL;DR: Reddit (notable for having a high stock value from their "selling data" business [1]), Medium, Quora, and Cloudflare competitor Fastly created a standard to restrict what the reader can do with the data users created, called Really Simple Licensing (RSL). Basically robots.txt but with more details, notably with details on how much you should pay Reddit/Medium/Quora.
While this likely has no legal weight (except for EU TDM for commercial use, where the law does take into account opt-outs), they are betting on using services like CloudFlare and Fastly to enforce this.
[1] https://www.investors.com/research/the-new-america/reddit-st...
- PhantomHour 16 hours ago
  > While this likely has no legal weight
  I wouldn't be quite so sure about that. The AI industry has entirely relied on 'move fast and break things' and 'old fart judges who don't understand the tech' as their legal strategy.
  The idea that AI training is fair use isn't so obvious, and quite frankly is entirely ridiculous in a world where AI companies pay for the data. If it's not fair use to take reddit's data, it's not fair use to take mine either.
  On a technological level the difference to prior ML is straightforward: A classical classifier system is simply incapable of emitting any copyrighted work it was trained on. The very architecture of the system guarantees it to produce new information derived from the training data rather than the training data itself.
  LLMs and similar generative AI do not have that safeguard. To be practically useful they have to be capable of emitting facts from training data, but have no architectural mechanism to separate facts from expressions. For them to be capable of emitting facts they must also be capable of emitting expressions, and thus, copyright violation.
  Add in how GenAI tends to directly compete with the market of the works used as training data in ways that prior "fair use" systems did not and things become sketchy quickly.
  Every major AI company knows this, as they have rushed to implement copyright filtering systems once people started pointing out instances of copyrighted expressions being reproduced by AI systems. (There are technical reasons why this isn't a very good solution to curtail copyright infringement by AI)
  Observe how all the major copyright victories amount to judges dismissing cases on grounds of "Well you don't have an example specific to your work" rather than addressing whether such uses are acceptable as a collective whole.
  visarga 15 hours ago
  > but have no architectural mechanism to separate facts from expressions
  Sure they do. Every time a bot searches, reads your site and formulates an answer it does not replicate your expression. First of all, it compares across 20.. 100 sources. Second, it only reports what is related to the user query. And third - it uses its own expression. It's more like asking a friend who read those articles and getting an answer.
  LLMs ability to separate facts from expression is quite well developed, maybe their strongest skill. They can translate, paraphrase, summarize, or reword forever.
  PhantomHour 12 hours ago
  This is a baseless assertion of emergent behaviour.
  > Every time a bot searches
  We are talking about LLMs by themselves, not larger systems using them.
  > LLMs ability to separate facts from expression is quite well developed
  It is not. Whether you ask an LLM for an excerpt of the bible, or an excerpt of The Lord of the Rings, the LLM does not distinguish. It has no concept of what is, and what is not, under copyright.
  HarHarVeryFunny 13 hours ago
  > The idea that AI training is fair use isn't so obvious
  > Observe how all the major copyright victories amount to judges dismissing cases on grounds of "Well you don't have an example specific to your work" rather than addressing whether such uses are acceptable as a collective whole.
  Well, all a judge can/should do is to apply current law to the case before them. In the case of generative AI then it seems that it's mostly going to be copyright and "right of publicity" (reproducing someone else's likeness/voice) that apply.
  Copyright infringment is all about having published something based on someone else's work - AFAIK it doesn't have anything to say about someone/something having the potential to infringe (e.g. training an AI) if they haven't actually done it. It has to be about the generated artifact.
  Of course copyright law wasn't designed with generative AI in mind, and maybe now that it is here we need new laws to protect creative content. For example, should OpenAI be able to copy Studio Ghibli's "trademark" style without requiring permission?
  PhantomHour 13 hours ago
  > Well, all a judge can/should do is to apply current law to the case before them
  This is true, and I do not mean to suggest it is bad. But rather, that it leaves uncertainty. These cases can all be struck down without reducing the possibility that if one does stick, the entire industry is at stake.
  > Copyright infringment is all about having published something based on someone else's work - AFAIK it doesn't have anything to say about someone/something having the potential to infringe (e.g. training an AI) if they haven't actually done it. It has to be about the generated artifact.
  A notable problem here is that AI models are not "standalone products" but tools provided as a service. This complicates the situation.
  Take Disney/Universal's case against Midjourney, which is both about the models but also the provision of services.
  Even if only the latter gets deemed illegal, that's ruinous for the big AI companies. What good is OpenAI if they can't provide ChatGPT? Who would license a LLM if the act of using it creates constant legal risks?
  orangecat 15 hours ago
  'old fart judges who don't understand the tech'
  If this intended to refer to Judge Alsup, it is extremely wrong.
  PhantomHour 13 hours ago
  It is not.
  janalsncm 15 hours ago
  > The very architecture of the system guarantees it to produce new information derived from the training data rather than the training data itself
  A “classical” classifier can regurgitate its training data as well. It’s just that Reddit never seemed to care about people training e.g. sentiment classifiers on their data before.
  In fact a “decoder” is simply autoregressive token classification.
- ec109685 14 hours ago
  It’s surprising Reddit doesn’t get pushback for reselling their user’s content.
  The right thing would be for the end users to receive the compensation Reddit is getting from AI companies.
- isodev 16 hours ago
  In other words, a lightweight form of DRM. Here come the reasons why we shouldn’t all deploy CloudFlare and similar as gatekeepers to the web.
  Is there even one example of a “tech mega corp” that has grown to control more than 1/5 of its market without this circling back to hurt people in some way? A single example?
- luckylion 16 hours ago
  Does that have any implications on liability for content? They're no longer just a provider, they are re-licensing and marketing content. Are they losing protection?
deadbabe 16 hours ago
Just ladder kicking at this point.
aaaggg 15 hours ago
L - wish they'd stop posting articles that are paywalled...
ath3nd 9 hours ago
Next: the AI bubble is coming to an end. Also fingers crossed that the career and employment of Mark Zuckerberg also follow suit soon.