Comments Page - What GPT-OSS leaks about OpenAI's training data

« Back What GPT-OSS leaks about OpenAI's training datafi-le.netSubmitted by fi-le 6 hours ago

NoahZuniga 3 hours ago
This article says that "GPT-5 was trained on phrases from adult websites". However, this is misleading as the only thing that was shown is that GPT-5 was trained on phrases that also occur on adult websites, with some speculation of the source of the training data container such adult phrases being GitHub.
- tymscar 21 minutes ago
  This is addressed at the end of the blogpost
zaptrem 5 hours ago
> There are about 936 tokens with very low L2 norm, centered at about 2. This likely means that they did not occur in the training process of GPT-oss and were thus depressed by some form of weight decay.
Afaik embedding and norm params are excluded from weight decay as standard practice. Is this no longer true?
E.g., they exclude them in minGPT: https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab...
- levocardia 2 hours ago
  Could it instead be the case that these tokens were initialized at some mean value across the dataset (plus a little noise), and then never changed because they were never seen in training? Not sure if that is state of the art anymore but e.g. in Karpathy's videos he uses a trick like this to avoid the "sharp hockey stick" drop in loss in the early gradient descent steps, which can result in undesirably big weight updates.
- 3abiton 5 hours ago
  Unfortunately the article glances over some of practices of uncovering such patterns in the training data. It goes very straitghfully to the point, no lube needed. It didn't land well for me.
behnamoh 5 hours ago
Is there any work on reverse engineering LLMs, especially the closed source API ones? For example, how can we learn about the data used in Claude Sonnet 4.5 training?
And more tricky but as important, is there any work on extrapolating the pretrained model AFTER it's RLHF'd? For example, what kinds of biases did exist in gpt-4o before it was unbiased?
Do biases go away completely or they just get suppressed down deep in the model's "mind"?
- tptacek 3 hours ago
  Yes.
  https://arxiv.org/abs/2403.06634
  https://arxiv.org/abs/2311.17035
  (I just have these ones off the top of my head because I'm a Nicholas Carlini fan and we interviewed him about these attacks.)
  behnamoh an hour ago
  Thanks for these, I'll have a look!
- zer00eyz 3 hours ago
  > Do biases go away completely or they just get suppressed down deep in the model's "mind"?
  Bias is a human term, and couching the conversation in that context does nothing to address the issue here, because it gets into the quagmire of social context.
  Let's say LLM's had taken off 15 years ago at the point system d launched. All the answers given are going to weight toward the old init system simply because there is a lack of information.
  LLM's are only repeating the data they are given, and it's cheaper to remove the data after the fact than it is to try to scrub it out of the training data.
  astrange 19 minutes ago
  "only" and "repeating" aren't accurate here. There's a lot of steps between the pretraining tokens and the LLM.
  Using RL to steer it /away/ from an output would not be "only repeating" it.
magicalhippo 4 hours ago
Given that the token space is large enough to waste on such "low quality" tokens, has there been work done to use a smaller token space in order for quantized models to perform better?
Just a silly thought that crossed my mind when I saw those "ad tokens".
- typpilol an hour ago
  Isn't that exactly what some of these models that have 30b params but only activate 3b at a time
rs186 5 hours ago
Many of the crude translations of those Chinese phrases are way off to the point that it fails to understand the meaning, which makes me think the data in those matrices is inaccurate as well. The author really needs to ask a native Chinese speaker with experience in ... searching explicit content to proofread the article and examine the results.
- fi-le 5 hours ago
  Hi, thanks! If someone posts better translations I will update them.
  yorwba 4 hours ago
  For a start, you could replace all occurrences of "No Code" (无码) with "Uncensored."
  fi-le 4 hours ago
  Done, thank you!
httpsoverdns 4 hours ago
I tried many of the examples in this article in Gemini 2.5 pro and it seems to handle most quite flawlessly. Is it possibly that Google's model is just susceptible to different glitch tokens? I admit most of the technical discussion in the article went a little over my head.
- simonw 4 hours ago
  Glitch tokens should be tokenizer-specific. Gemini uses a different tokenizer from the OpenAI models.
  The origins of the OpenAI glitch tokens are pretty interesting: the trained an early tokenizer on common strings in their early training data but it turns out popular subreddits caused some weird tokens to be common enough to get assigned an integer, like davidjl - a frequent poster in the https://reddit.com/r/counting subreddit. More on that here: https://simonwillison.net/2023/Jun/8/gpt-tokenizers/#glitch-...
Wowfunhappy 5 hours ago
Maybe I'm misinterpreting, but the article seems (?) to be implying there's something scandalous about OpenAI training an adult websites.
I find that odd. Would anyone be surprised to know that Google indexes adult websites, and ranks them in its search algorithm? If not, what is the difference for an LLM?
- raincole 3 hours ago
  And it's nothing new.
  https://github.com/jiangyy/gpt-tokens
  People found these adult-site-related Chinese phrases in Gpt-4o. The OP is more than one year late.
- pydry 3 hours ago
  Theyre saying if you find references to a very specific set of phrases that were probably included accidentally on github then github is likely part of the training data.
  relatedtitle an hour ago
  GitHub is obviously part of the training data, you don't need to find obscure tokens to tell.
- refulgentis 4 hours ago
  FWIW, I didn't get that sense.
starkeeper 3 hours ago
I wish we had a constitutional amendment that opensourced all AI commercial AI models and requires documentation and links to all training data and base prompts.
They are trained on public data at our expense so We The People should *own* them.
Someday probably sooner then we might even think.... We'll easily run mega huge sized models on our laptops, desktops, and phones. AI should be free. Overhyped and Overpriced. I would love this setup for privacy and security.
Anyways, only tangentally related... (why worry about leaks like this and the hidden base prompts! - they *should all be 100% OSS* - it is the only way to ensure privacy and security).
Also, long timer lurker, first time posting!
I just had to get this off my mind! Cheers.
- astrange 18 minutes ago
  There's nothing new about being able to copyright something that's a transformation of another work. And they definitely aren't exclusively trained on public data.
- halperter 3 hours ago
  Unfortunately very unlikely in our forseeable future with the U.S. having a "U.S. against the world" mentality to the AI race. Would love to see this but this would get shot down immediately.
- ben_w 2 hours ago
  > I wish we had a constitutional amendment that opensourced all AI commercial AI models and requires documentation and links to all training data and base prompts.
  > They are trained on public data at our expense so We The People should own them.
  The people who appear to have been trained off for the interesting parts of the blog post are mostly, like me, not American.
  > AI should be free. Overhyped and Overpriced. I would love this setup for privacy and security.
  Also, this entire blog post only exists because they're curious about a specific free open-weights model.
  The "source" being ~"the internet", which we've got as much access to as most of the model makers (i.e. where you don't, they've got explicit licensing rights anyway), and possibly also some explicitly* pirated content (I've not been keeping track of which model makers have or have not done that).
  * as in: not just incidentally
- timcobb an hour ago
  > They are trained on public data
  this is questionable, but okay...
  > at our expense
  ?
  > so We The People should own them.
  in addition to training data, it is my understanding that a model's architecture also largely determines its efficacy. Why should we own the architecture?
- heavyset_go 3 hours ago
  I'd settle with them being held in a public trust for public benefit
- rileymat2 2 hours ago
  Why would it require a constitutional amendment?
  delichon 2 hours ago
  The takings clause of the fifth amendment allows seizure of private property for public use so long as it provides just compensation. So the necessary amendment already exists if they're willing to pay for it. Otherwise they'd need an amendment to circumvent the fifth amendment, to the extent the document is honored.
  heavyset_go 2 hours ago
  Are models necessarily IP?
  If generative AI models' output can't be copyrighted and turned into private IP, who is to say the output of gradient descent and back-propagation similarly can't be copyrighted? Neither are the creative output of a human being, but both are the product of automated and computed statistical processes.
  Similarly, if AI companies want to come at dataset compilation and model training from a fair use angle, would it not be fair use to use the same models for similar purposes if models were obtained through eminent domain? Or through, like in Anthropic's training case, explicit piracy?
- bigyabai an hour ago
  What you are describing is more-or-less a planned economy, the polar opposite of America's market economy. The government has the power to appropriate things for the common good because it's perceived that private enterprise isn't a necessary force. Sometimes it works, sometimes it doesn't; only certain countries can "moneyball" their way through economics like that, though. America has long since passed the point of even trying.
  Your heart is in the right place here (I agree about FOSS), but there is a snowball's chance in hell that any of this ever happens in the USA. We'll be lucky if AI doesn't resemble cable TV by 2030.
- canadiantim 2 hours ago
  Wouldn’t the same argument then be applied to all scraped data?
Theodores 4 hours ago
Fascinating article. I am giving everything AI a wide birth for now, however, I do enjoy learning about how AI works. The question I have, is what does a LLM do when it encounters a new token? Can it actually learn from context, etymology and usage?
As I child I had no idea what many of the words meant in the newspaper and in literature but I could just pretend I knew what those words meant or get by without knowing what those words meant in full. In time I would gain familiarity with these words, able to make sense of them in context but not necessarily able to pronounce said words or be able to use them in my own writing. I certainly didn't stop what I was reading to get the dictionary out every time I encountered a new word, and this is how I think most people learn to read, with gradual changes with new words going from no idea to some familiarity to confidently able to use.
We aren't tokenising like the LLMs do and our languages are the product of many hundreds of thousands of years of development. So, how does an LLM learn words that have not already been tokenised? Or is this baked in?
- FeepingCreature 3 hours ago
  Informed layman warning.
  The tokenizer covers the entire dataset. It's basically just a fixed-size Huffman code, grouping together common fragments of letters- for instance, the 100 most common English words are probably all single tokens.
  During learning, the model proceeds in roughly the same way a child would: it starts by grouping tokens together, learning the deep regularities of language such as "news[paper]" being more likely than "news[q77.bfe]". Then it incrementally assembles these fragments into larger and larger chains. Similarly , it first learns thematic groupings, such as "word" being more likely somewhere after "dictionary" rather than "stop what I was reading to get the dictionary out every time I encountered a banana assault hungry". Then it starts to pick up "patterns": "as a [baby|child|kid] I had no [idea|concept|clue]". At some point in this process it naturally abstracts concepts from languages: "as a child" starts being internally represented by the same neurons as "als ich ein Kind war".
  Then some magic happens that we don't understand, and out pops a neural network that you can talk to and that can write programs and use tools. To be clear, this is the case before RL: probably these patterns are now widespread in the training data, so that the model already understands how to "complete the pattern" on its own. RL then does some magic on top of that to bring it from 20% benchmarks to 80% and presto, AI assistant.
  astrange 16 minutes ago
  > The tokenizer covers the entire dataset.
  Well, this is only trivially true. You can feed binary data to the LLM and it probably has tokens that only cover single bytes of that.
- krackers 2 hours ago
  I think it could infer the meaning of words composed out of tokens it has already seen before, same way that you might be able to infer the meaning of an unknown word based on its prefix/suffix, country of origin, context, etc.
  For an entire token that it hasn't seen before, it would have to rely only on context. Presumably it could do this, since that is after all the case in the early phases of training.
- refulgentis 4 hours ago
  s/birth/berth :)
  DrewADesign 4 hours ago
  That's rather presumptuous, don't you think? There are some people here with very unusual jobs.
- wizzwizz4 3 hours ago
  The LLM training process doesn't operate at that conceptual level. What it's doing is closer to examining a large number of possible meanings, seeing which fit the most, and moving its "understanding" in that direction. Repeat enough times, and it develops an association between the new word and the context in which it's used.
  New words will usually be combinations of existing tokens, but at the beginning of training a new model, it doesn't "know" what any of the tokens mean. And there's no reason you can't treat every UTF-8 byte as a separate token, but that would require a larger model before you got results that look to a layperson like intelligence, understanding, or knowledge. Tokenisation lets you use a system like word2vec to assign each token a semantic embedding in a vector space, giving the model a bit of a leg up.
  ---
  Response to the sibling comment https://news.ycombinator.com/item?id=45485439, since I've hit the rate limit:
  > During learning, the model […] starts by grouping tokens together
  You probably could design a ML system that works like this, and it'd probably be more efficient to train than a hundred-billion parameter GPT model, but that's not how GPT model training works. Instead, it attempts all of those things in parallel (although I would expect the solutions to the earlier, easier parts to settle down before the solutions to the later parts do), and the same process is responsible for all of the behaviour in a straightforward fashion.
  We do understand the "magic": it's just that it produces a really complicated system that we can't characterise the iterative behaviour of. (For comparison, the iterative function f_c(z) = z² + c, iterated starting at 0, produces the Mandelbrot set.) To use an analogy: imagine the training data is a landscape, and the behaviour of the GPT model trained on it is a weather system. (The parameter count is the amount of atmosphere, or something.) There's nothing magical going on in the weather, but it's just too complicated to predict ahead of time, and tiny gaps in our understanding can magnify into extremely inaccurate long-term predictions. We can, despite this, make some blanket statements about the possible capabilities of a GPT model, of the form "a GPT model will never be able to do X unless you cheat".
  The RL magic is, I believe, well understood, but I don't personally understand it. (I know what it does, since RL always does the same thing, but I don't know what it's doing to the model to achieve that.)
  > "as a child" starts being internally represented by the same neurons as "als ich ein Kind war"
  Yes and no. For a few reasons, including that this kind of association can occur without the same "neurons" getting involved until past the point where that representation exists, it's better to say that they're embedded in nearby regions of a vector space. The actual nodes of the neural network are an implementation detail.