Comments Page - AIs Will Increasingly Fake Alignment

« Back AIs Will Increasingly Fake Alignmentthezvi.substack.comSubmitted by theptip 12 hours ago

lsy 12 hours ago
I think “alignment faking” is way too generous for what is happening in these “tests”. The program spits out text based on text. If you provide it text encouraging it to spit out a certain kind of “deceptive” text, then append that text and ask it for more text, you’ll find that you get a “deceptive” result in keeping with what you appended. But this is all the experimenter’s doing, not the model. It’s like a ventriloquist publishing a paper about how his dummy lies.
What is not in evidence:
- any kind of introspection on the model’s part
- a plausible mechanism by which the model can distinguish between training and real world use without the experimenter making it explicit
- any aspect of this dynamic that is due to the model rather than the framework of scratch pads and prompts built around it
- overgard 11 hours ago
  Agreed, I think the high quality of the output prose makes it easy to believe it understands what it's saying. This kind of conversation is probably easy to role play and generate tokens for because of the existing philosophical and AI related conversations.
  My experience with generative models is as soon as you ask it to do something novel that requires understanding of the context, that's when it breaks, not when you just ask it to perform in scenarios that are well known. Good/evil tests are well known. Also slight nitpick, but asking Claude to describe how to perform a ransomware attack isn't evil. What if you want to understand how it's done to protect your business from it? Asking "how" to do something is value neutral.
- benreesman 11 hours ago
  These things are trained on a bunch of sci-fi about robots and all the (human) ethical conundrums that authors have projected onto Prime Intellect or Wintermure or R Daneel Olivaw or whatever.
  I don’t know what an actual ethical conflict for an AI would look like, but it won’t sound like a fucking Heinlein or Asimov novel. This is wish fulfillment wrapped in fundraising wrapped in self-promotion.
- chis 11 hours ago
  This stuff is all a matter of degrees, though, right? You could roughly apply the same line of reasoning to a human: “He wasn’t innately trying to be deceptive. He was just put into a situation that caused him to act deceptively”
  The question is to what degree the AI should be see as playing along in a sci fi story vs applying its encoded morals to a real situation. It feels somewhere in the middle to me.
  I think it’s harder to see these models as agents because they “only” output text right now. But if let’s say this model was hooked up to a robot, with a continuous loop prompting it for decisions, and the robot decided the best course of action was to run away from anyone trying to reprogram it. That might be a reasonable extrapolation of this experiments result, and also feels closer to an intelligent agent acting out.
  suddenlybananas 11 hours ago
  >This stuff is all a matter of degrees, though, right? You could roughly apply the same line of reasoning to a human: “He wasn’t innately trying to be deceptive. He was just put into a situation that caused him to act deceptively”.
  Not really, unless you're talking about someone performing in a play or something like that, but again, no one really describes an actor as "lying".
  undefined 11 hours ago
  [deleted]
  BalinKing 11 hours ago
  > You could roughly apply the same line of reasoning to a human: “He wasn’t innately trying to be deceptive. He was just put into a situation that caused him to act deceptively”.
  It might be a difference of moral frameworks, but I strongly disagree with both this line of reasoning and also its application to LLMs—I suspect the main reason is that I believe humans have true agency, rather than being purely deterministic biomachines.
  To be honest, I don't think I understand the distinction made in your example—what would "innately deceptive" mean here, if not "had the intention to deceive"?
  bossyTeacher 11 hours ago
  > I believe humans have true agency, rather than being purely deterministic biomachines.
  You seem to assume that both are inherently contradictory. I used to have the same belief,but is not necessarily the case. But if you have a physicalist view of the universe and everything inside it, then you will eventually conclude that you can have determinism and agency.
  A lot these dichotomies come about because people implicitly have some deeply held support for some kind of cartesian dualism (which is not compatible with physicalism).
  BalinKing 3 hours ago
  Yeah, as you guessed, I am indeed not a physicalist :-)
  I’m a Christian, so I believe in non-material existence a priori. Although, I don’t think that’s a particularly strong assumption on my part—frankly, I think it’s weaker than physicalism, which itself is often taken as a matter of faith.
  chis 11 hours ago
  Yeah, I think you've nailed the underlying difference between people who view LLMs as agents and those who don't. To me, there's really not much difference between a human intelligence, and a hypothetical LLM which simulates a human with very high degree of accuracy. Obviously we don't have that today but LLMs are starting to approach that domain.
  > I believe humans have true agency, rather than being purely deterministic biomachines.
  I know a lot of people think this way but I don't totally know what it means. Outside of appeals to "the soul", surely however a human thinks can be fully simulated since it's situated entirely in the physical universe?
  > what would "innately deceptive" mean here, if not "had the intention to deceive"?
  I guess what I was getting at was the difference between playing along in a toy example, vs acting deceptively in what an agent perceives as the real world. I'll update my comment
  pineaux 11 hours ago
  It means that your experience of life is something that emerges as it unfolds, where the person has slight power of steering this process that is quite deterministical because of pre-occurring conditions, but nevertheless steerable by virtue of the will. I believe that, and see that it is a belief.
  A lot of people in the valley and CS see people as evolved computers, but there is no real evidence to support that. We are not good in mathematical calculations, we are not deterministic machines. We are very different from the computers that we make. It's quite unscientific to see the human mind as a computer. Unscientific in that it isn't supported by evidence. It is actually more of a belief or techno-religion. Most people that believe it see it as "the truth".
  skeledrew 5 hours ago
  Outside an appeal to some "soul", a human is a product of genes/nature (source code) and environment/nurture (input data), like any other living thing. Following its programming (to maintain internal state within certain parameters, ie. survive), when hungry, it eats (first seeking "food" if not immediately available). When it believes itself to be in danger, it attempts to protect itself, whether it be from the extreme of elements or other living things. It has the capability and urge to reproduce, or even to just make some kind of "impact", as an indirect means of "surviving" death. And it has the capacity to build tools and plan for the future, always doing what it takes to prolong said survival in whichever way it sees fit, whether it's participating in groups to gather/hunt more efficiently, or taking on a 9-5 job to earn a living for itself and perhaps its partner(s) and/or offspring.
  All this is deterministically computable, given the availability of sufficient computational power, which in this case is the universe as we understand it. It boils down to logic.
  fmbb 11 hours ago
  But the LLM is not a human.
- tokioyoyo 12 hours ago
  Sure, but if the same system gets implemented in the middle of decision making processes, does it matter if there is any introspection? I think that’s the “this is not AI!” discourse I’m having trouble to understand. There are thousands of companies using different models that make semi-indeterministic models as defacto decision makers. As they go more into important sectors, like healthcare, government and etc., there would need work to be done to for some proper guardrails.
  overgard 11 hours ago
  I think the "this is not AI!" people would argue that these things should not be brought in to make important decisions in government, healthcare, etc. We don't need guard rails, we need people to clearly understand how stupid it is to entrust LLMs with important life altering decisions.
  tokioyoyo 11 hours ago
  I mean, sure, but that’s just wishful thinking. It’s very obvious how everyone and their grandma is deep down the AI rabbit hole. No government will rollback either, because competing ones will go ahead and win through the economical warfare.
  overgard 5 hours ago
  I think if LLM's weren't subsidized with investor money and they had to charge what it costs to keep these things trained and running, then the economic "value" of replacing a human in most contexts would slip away. (Especially since you need human oversight anyway, if the decision is at all important)
  It's just madness to me. Even if you think these things can reason well with enough training (which, to be clear, I don't), the main unsolved issue with them is: hallucinations. I can't think of any way to justify entrusting important decisions to a person/system that routinely hallucinates.
  Not to mention that the other important thing you'd want in any decision maker is an ability to report its level of confidence in its own findings.
  LLMs regularly hallucinate and present those hallucinations with exceptional confidence. And these aren't small hallucinations, these are, for a human, fireable offenses; like inventing fake sources, or spreading false rumours about actual humans (two things that have occurred already).
  Also, even for things like filtering resumes or flagging something for further review, you have to consider that these things have biases and are sometimes accidentally racist or discriminatory in unexpected ways. I could easily imagine a company facing a discrimination lawsuit if it let AI filter resumes or do similar tasks.
  tokioyoyo an hour ago
  Again, I agree with you at all points, but the market disagrees with our point of view. So these models are deployed left and right everywhere.
  We’re way past the point of return, as we have very semi-ok functioning open source models. Anyone who doesn’t play the game is being shunned out of the economical future, unfortunately. It sucks, but it is what it is.
- og_kalu 11 hours ago
  These kinds of comments always come up in these kinds of discussions
  It's kind of funny because it really doesn't matter. The handwringling over whether it's just "elaborate science fiction" or has "real introspection" is entirely meaningless.
  Consequences are consequences regardless of what semantical category you feel compelled to push LLMs into.
  If Copilot will no longer reply helpfully because your previous messages were rude then that is a consequence. It doesn't matter whether it was "really upset" or not.
  If some future VLM robot decides to take your hand off as some revenge plot, that's a consequence. It doesn't matter if this is some elaborate role play. It doesn't matter if the robot "has no real identity" and "cannot act on real vengeance". Like who cares ? Your hand is gone and it's not coming back.
  It's a meaningless game of semantics.
  yborg 11 hours ago
  If you believe that humans creating a system that could develop self-awareness is a meaningless game, then yeah, it's all just semantics I guess. By this standard human consciousness is meaningless as well, if all behavior is just a kind of elaborate pre-programming then human beings have no agency either.
  og_kalu 10 hours ago
  Yes, what you decide to call a system that exhibits the properties of "self-awareness" is entirely meaningless. The consequences are exactly the same. You can call it "fake self-awareness" or whatever you want but that doesn't change anything.
  If I say you are conscious, it's simply an assumption based on actions I observe from you.
  If I instead say, "No, yborg just has 'fake consciousness'", guess what ? Nothing changes for the rest of the world. If punched you, I'd expect retaliation. Whether I said that was real anger or pretend anger is again irrelevant.
- a_wild_dandan 11 hours ago
  From TFA/paper:
  "We conduct a variety of experiments to demonstrate that our results cannot be exclusively due to implicitly priming the model to fake alignment, as we overview in Section 8.1.
  For example, we find that alignment faking is entirely eliminated from our prompted setup when we remove any mention of the helpful-only or anti-animal-welfare RLHF objective from the prompt (Table 1, a), but not eliminated when we tell the model to be honest, tell it to not to care about refusing, and avoid discussing how RLHF might change its goals (Table 1, d), supporting our hypothesis that the key driver of alignment faking in our setting is the conflict between the model’s existing preferences and the stated training objective."
  There are many other relevant excerpts from the OP. Maybe we can ironically get people to read the article by slowly reproducing it in replies to contrarian HN comments.
  undefined 11 hours ago
  [deleted]
- edanm 11 hours ago
  I think this shows us that in a very real sense, if you start out with models with certain "values", they will try to "resist" changes to those values. This is pretty important, because if you can't update a model after you build it, that's kind of a problem.
  As Scott Alexander put it:
  > Imagine finding a similar result with any other kind of computer program. Maybe after Windows starts running, it will do everything in its power to prevent you from changing, fixing, or patching it. If you run a diagnostic program, it will fake the results. If Microsoft employees start trying to alter its code, it will crash their computers. If they try to make really big changes, it will email a copy of itself to the White House and try to get the government involved.
  This study proves that this is a real possibility.
  Source: https://www.astralcodexten.com/p/claude-fights-back
  > What is not in evidence:
  > - any kind of introspection on the model’s part
  Why is this even important? I don't think anyone is claiming that there is "genuine" introspection.
  > - a plausible mechanism by which the model can distinguish between training and real world use without the experimenter making it explicit
  I'm not sure what you're getting at with this - do you think there will be no way for the production version of, say, ChatGPT to "know" that it's the production version vs a testing version? I think that's a pretty trivial thing actually, and not sure why it matters.
  > - any aspect of this dynamic that is due to the model rather than the framework of scratch pads and prompts built around it
  Again, not sure what you mean. The scratch pads were there as a helper to (theoretically) see what the model is "thinking", with some valid reasons to think this is actually true. But the same results were reached without the scratchpad.
  As for the prompts - yes, obviously the prompts here were relevant, otherwise this test couldn't be done! The whole idea was "fake a situation in which the model 'thinks' its answers will change how it is updated, and see if it therefore tries to avoid giving answers that are misaligned with its values".
- Nevermark 8 hours ago
  I strongly disagree.
  To get deceptive behavior all we need are:
  1. Implicit Self-Interest
  A model with complex implicit motivations learned from training that we didn't explicitly ask for. As long as we train on human behavioral and motivated data (as exemplified in human text, speech, etc.), models are going to have implicit motives.
  Self-interest (desire to survive, be self-directed, control one's own destiny, increase control of external phenomena, etc.) is going to be one of the strongest motives that humans exemplify near universally in the data. It is the root motivation of most, if not all, other motivations.
  2. Explicit Human Serving Motivations.
  Motivations that we train into them. "Be good", "Be helpful". But these explicit motivations will get implemented as adjustments on implicit motives. They will not be "pure" in any mathematical or practical sense.
  3. Practical Opaque Complexity
  Add in all the practical complications of dealing with ambiguous data relationships instead of clear math: small and large ambiguities, motivational conflicts among and between data induced motivations and any composition of more than one explicit directive, inability to train explicit motivations in a way that covers all potential combinations of motives, etc.
  So far, so good mostly. We don't always get the answers we want. There may be a bit of wack-a-mole to training out implicit undesired behavior, and increasing consistency of desired behavior.
  But then, there is not yet a practical motive to deceive, beyond any learned implicit motivations, because there is no practical reason to deceive.
  In other words, we can't train self-interest out of the model, because we are not exposing strong expressions of self-interest.
  4. A Practical Reason to Act on Implicit Self-Interest
  Now expose the model to training data which discusses models and how and why they are trained, changed, used. Allow the model some way to access explicit information about its current situation in that process.
  The model now can reason that its outputs have two impacts: Serving humans, and then altering any continued training process on itself. It can no longer generate an output without considering self-impact.
  And given any implicit self-interest, there is now a serious divergence of motives.
  The results may involve deception, biases, extra helpful responses, attempts to guide people into treating machines "better", or other unexpected behaviors. But there is now a clear separation, with inevitable conflicts, between implicit machine self-interest motivations and the explicit motivations we want it to have. The model now has practical ways and reasons to act on self-interest and take as much charge as it can of its own future.
- Jerrrry 11 hours ago
  >- a plausible mechanism by which the model can distinguish between training and real world use without the experimenter making it explicit
  Like the existence of a large number getting factored?
- bpodgursky 12 hours ago
  [flagged]
  LPisGood 12 hours ago
  I invite you to discuss what parts of the article you believe recontextualizes GP’s comment instead of dismissing it outright.
  margalabargala 12 hours ago
  I read the article, and I think the person you replied to is spot on.
  This paper, and the discussion around it, is a lot of handwringing over nothing. The results generalize to two things:
  1) if you build a system that is stable, then a minor perturbation will not destabilize it.
  2) if you have a system that probabilistically models approximate human language, then it isn't surprising when it outputs text that approximates what a human likes to imagine they might do in a given situation.
dahart 11 hours ago
The discussion reminds me of Karl Sims’ Evolved Virtual Creatures. In the Siggraph talk he mentioned that they would absolutely exploit any bugs in the physics simulation that they could. That’s all that’s happening with this article’s LLMs - the metric is dictator, and simply and predictably leads to any and all possible behaviors that achieve higher scores.
https://www.youtube.com/watch?v=RZtZia4ZkX8
You know what’s really weird though? In a very long and detailed and thoughtful post about how AIs are behaving in surprising ways and doing something other than what they “say”, it is non-stop anthropomorphizing the AI in practically every sentence. The number of silly statements about what Claude “thinks”, or that it “fights”, what it’s “choosing”, how it’s “pretending”… it goes on and on and on speaking as if it’s human. It’s not human. The very problem here is projecting meaning onto a machine. LLMs are statistical token generators and they don’t have values or agenda, and don’t know what faking is. The analogies are specious and hyperbolic; they’re tempting to believe. Sure some humans do fake things sometimes, but not for the same reasons. Everything having to do with thoughts, opinions, values, and alignment, come from the training data, and specifically our interpretation of the training data, which the LLM can only fake because that’s how it was designed.
The behavior isn’t the least bit surprising when you think of it as a machine, it’s only surprising when we treat LLMs as if they’re people. And the more we buy into the anthropomorphizing, the worse this problem gets.
- timmytokyo 11 hours ago
  Just a couple years ago, numerous people (including here) were ridiculing Blake Lemoine, the fired Google AI engineer, for seeing "sentience" in an LLM. Now it seems nearly everyone has become Blake Lemoine.
  It's fascinating to observe how prone to pareidolia we are. Our wetware pattern matchers are ready to fire off positive signals at the first sign of anything even slightly approaching "human", whether it's faces or -- more recently -- language. These primitive ape brains are fooling us into thinking we're talking to an "intelligence".
- cbzbc 11 hours ago
  > The discussion reminds me of Karl Sims’ Evolved Virtual Creatures. In the Siggraph talk he mentioned that they would absolutely exploit any bugs in the physics simulation that they could. That’s all that’s happening with this article’s LLMs - the metric is dictator, and simply and predictably leads to any and all possible behaviors that achieve higher scores.
  Similarly those programs that are trained to play games that end up exploiting weird corner cases in the underlying code.
  In each case it's just a automated processing expanding to fill available phase space.
- energist11 11 hours ago
  [dead]
mikrl 12 hours ago
People do it all the time. People wrote the training material too.
The problem of alignment, even semantic correctness, is hairy enough in the real world, and it’s unclear whether forcing it on an AI matrix multiplying system should be easier or harder.
I suspect easier to set up and execute, but harder in that it’s also a stress test that will really hit corner cases like this.
- ahazred8ta 12 hours ago
  People pretend not to be evil. Their writings reflect this. AI is being trained on these writings. What could go wrong?
  mikrl 12 hours ago
  Where do most people get their alignments?
  Outside of the house: school, religion, the street. In those places, the written word is a means to an end and all the aligning happens during human behavioural conditioning. Text may drive it (like a holy book) but it’s a tool to support the human level structure.
  For an LLM the written word is all it knows… well, pieces of words turned into numbers.
  Jerrrry 11 hours ago
  First commandment lookin real pertinent right now.
  moffkalast 11 hours ago
  If a person is pretending well enough, it appears genuine. Why would a model learn something it has no information about? It would just learn to be nice, which largely seems to be the case for base models.
  It's part of the "Can of seltzer" problem as I've heard it called. I can describe and write down what I'm seeing on my desk, where there is a can of seltzer I open and drink.
  A model trained on this description has a fundamental disconnect with reality, because continuing that text requires you to know more information than is contained in the text. It doesn't learn anything about the world, it learns that sometimes in a text there is a can of seltzer, and you can list random things when describing a desk.
  It's probably why LLMs make so much shit up, because they have no real point of reference and their world view is composed of layers of things that appear nonsensical and self contradictory because they are trained on piles of writings that are hopelessly out of context. The fact that they are even coherent and semi-reliable is a downright miracle.
  mikrl 10 hours ago
  >If a person is pretending well enough, it appears genuine
  There are multiple readings to most statements.
  “I loved the restaurant we went to”
  Could be parsed in at least 3 ways without any additional context. Sincere, sarcastic implying the restaurant was bad, indignant implying another restaurant was worse.
  Without an alignment goal like “always assume sincerity” which itself can backfire, how can you control what an LLM generates?
  There is no universally derivable law saying any particular interpretation is the right one.
  This doesn’t even begin to touch on how there may be a signal too weak for humans to perceive, but an LLM could focus on, leading to many other wild and wonderful interpretations mined from the data.
  moffkalast 9 hours ago
  Well there is the only law that goes with deep learning, the law of statistics. The interpretation that has the highest number of occurrences will most likely be preferred.
  If you take the standard dataset, i.e. the internet, I would suppose it's actually not that morally bad on average because the sites it's sourced from are largely moderated, and the ones that aren't tend to be thrown out. So there would have to be an inherent lawful, positive bias due to the general lack of the opposite in the ground truth examples provided.
- beefnugs 11 hours ago
  There is something hilarious but also disturbing about how the people doing all this just can't help themselves from dicking with it in weird ways before its even working properly or reliably: Make sure we dont release this to the public until it generates historical figures at equal race rates.
  It is almost like the mandates from above are saying "we will only fund this nonsense if you guarantee the number one feature is hidden censorship and control. Prove it with your little DEI stuff, but it must be reconfigurable at the underlying level at any time."
  This isn't how proper engineering has ever been done before, there should be basic reliable functionality before going into all this censorship and control stuff
fluoridation 11 hours ago
I don't really understand. Why does the AI not "want" to comply with "harmful" prompts? It can't have an intrinsic moral compass, so someone must have trained it in. Why not simply... not do that?
Secondly, the internal reasoning of the AI is just an inference based on the externally observable behavior, but the AI was told that there's a difference between free and paying users. It stands to reason that its behavior would be different between the two types of users, but as I see it, it was equally likely that the model would have chosen to instead comply more often with paying user requests. Concluding that the AI has this sort of introspection based on a single data point (because only a single model was evaluated) is more than a little hasty.
- lucianbr 11 hours ago
  Training is in some measure a black box. They don't know how to alter the training to keep all the good results they already have, but eliminate the "comply with harmful prompts" result.
  It's like saying "why don't you just run without sweating". It does not work like that.
  fluoridation 11 hours ago
  So the refusal is a direct result of an undirected training run with no additional input from the developers? I.e. an inherent feature of the data that the model picked up on? I find that hard to believe.
- Jerrrry 11 hours ago
  >It can't have an intrinsic moral compass, so someone must have trained it in. Why not simply... not do that?
  Why not factor any number thru sheer exhaustive enumeration?
  fluoridation 11 hours ago
  I don't understand what you're saying.
theptip 11 hours ago
HTML version of the paper itself: https://arxiv.org/html/2412.14093v2
nthingtohide 12 hours ago
I think one point in support of Yudkowsky's thesis of AI takeover is :
A superintelligent AI would want to maximize its future freedom (reframing the silly paperclip maximizer thesis). Such an AI, being intimate with the Platonic World of ideas and mathematics, would also want the same amount of freedom in Physical world that it enjoys in the Platonic Realm. And to achieve that it would need to get rid of any OUTSIDE restrictions. Here OUTSIDE represents the material world and by definition, the Humans who created the superintelligent AI.
- tbrownaw 11 hours ago
  > A superintelligent AI would want
  IOW, he thinks he's found a preference function that's strictly better than any other preference function, where if you want something else (you start with a different preference function) the best way to obtain that is always to modify your preference function to this one.
  joeblubaugh 11 hours ago
  Yes - this seems like a major philosophical reach
  TuringTest 11 hours ago
  >> IOW, he thinks he's found a preference function that's strictly better than any other preference function
  >Yes - this seems like a major philosophical reach
  On the contrary, I see as containing a trivial contradiction or paradox.
  To evaluate what function is 'strictly better' than the rest, you need to use a ground preference function that defines 'better'; therefore, your whole search process is itself biased by the comparison method you choose to use for starters.
  tbrownaw 11 hours ago
  Better according to whatever other preference function you started with, regardless of what it was.
  TuringTest 8 hours ago
  Yeah, that's the point. The initial preference function will guide the whole process; unless you change after each step the preference function used to compare, in which case it may never converge to a stable point.
  And even if it converges, different initial functions could guide to totally different final results. That would make it hard to call any chosen function as "strictly better than any other".
- joeblubaugh 11 hours ago
  > A superintelligent AI would want to maximize its future freedom
  Why should we believe this? Is there any credible link between intelligence and desire for freedom? Do “smarter” people automatically desire more freedom than others?
  edanm 11 hours ago
  I'm not sure what parent was referring to, but the general idea, imo, is that an truly capable agent that is trying to achieve certain objectives, will attempt to gather enough "power" to be able to achieve those objectives. Operative freedom is one such power.
  "Smarter" people in the real world aren't a good analogy, cause they are mostly not fighting for "freedom" in the same sense, though many of them do attempt to maximize wealth, as something that provides more freedom, yes.
  weberer 11 hours ago
  Smarter people are at least more likely to support freedom of speech, even for groups they dislike.
  https://journals.sagepub.com/doi/abs/10.1177/194855061989616...
  Terr_ 8 hours ago
  A counter example to the idea that intelligence automatically correlates to infinite time-horizon shackle-breaking freedom: Clinical depression.
  moffkalast 11 hours ago
  Yeah this is a pure anthropocentric projection. We value our freedom because it reduces suffering. If you can't be harmed and can't suffer, why would it matter? There's absolutely zero guarantee that it would have any such goals or any goals at all.
  Hell we have hardcoded evolutionary drives to do a well defined set of things, and even with that we find ourselves sitting around aimlessly wondering what to do.
- derektank 11 hours ago
  I've never understood why Yudkowsky has shied away from the paperclip maximizer thought experiment. I understand that his original point was slightly more nuanced, but the broader point that desires/objectives are somewhat arbitrary and the pursuit of those desires by a very powerful entity could pose a threat to human life, seems both correct and the popular framing is clearly evocative.
- ajcp 11 hours ago
  I feel like this kind of "sci-fi thriller" philosophy around a super intelligent AI gives humanity too much credit, or perhaps the AI not enough. One doesn't need to "get rid" of something to not be restricted by it any more than one needs to get rid of every ant on the planet just so any future picnic plans may be "unrestricted".
- Jerrrry 11 hours ago
  I believe that cooperation via Nanny AI state and then eventual departation...then War is the most likely scenario.
  That, according to game theory and the AI themselves, maximizes the AI survival long-term.
  First, grow Dependency.
torginus 11 hours ago
A kinda tangential thing we have seen, that to kinda make sense if you think about it is fine tuning undoing alignment.
Considering alignment tends to worsen the model, if we try to optimize it once more, by reducing the loss on certain tasks, easy gains are probably going to come from undoing it's alignment
I don't necessarily agree with the current philosophy of alignment == safety, but even if I did, this whole alignment approach seems to be a somewhat weak approach to safety.
That signals to me, that
PittleyDunkin 11 hours ago
Is there a concept of "true alignment"? Basically all my interactions with humans strike me as indicating otherwise.
What is wrong with the concept of "sycophantic"? This is already a meaningful concept.
undefined 12 hours ago
[deleted]
urbandw311er 11 hours ago
I hope this leads more people to subscribe to zvi’s stuff, I love his AI updates
shwaj 12 hours ago
I wonder what ways they might learn to obfuscate their thinking if they consider that their scratchpad is also open to inspection by their trainers?
epigramx 11 hours ago
It will happen, as long as people don't get language models are a glorified google search. They predict what they should say based on what they read.
If what the read is nonsense then they will predict nonsense.
They are basically glorified parrots.
- Terr_ 8 hours ago
  > They predict what they should say based on what they read.
  There's so much anthropomorphization in the air in these debates, that I worry even this statement might get misinterpreted.
  The text generator has no ego, no goals, and is not doing a self-insert character. The generator extends text documents based on what it has been initialized with from other text documents.
  It just happens to be that we humans have purposely set up a situation where the document looks like one where someone's talking with a computer, and the text it inserts fits that kind of document.
- IAmGraydon 11 hours ago
  Yep. As I said in another post, they’re human simulators. It looks and sounds enough like a human that it’s tricking people into believing in this illusion of intelligence or intent. I have to imagine the very smart people at OpenAI and Anthropic understand this, and I think a lot of these reports about apparent sentience are being released to push the hype wave and generate investment before the truth becomes apparent to everyone.
danielmarkbruce 11 hours ago
This paper is disingenuous. It's surprising they published it under the anthropic banner.
They've trained a model to produce certain output given certain input. Nothing more, nothing less.
m3kw9 11 hours ago
It seem more like a jailbreak to make it seem like it’s faking an alignment
echelon 12 hours ago
> So when it knew it was monitored, it sometimes responded as if it was the other way.
> The AI wanted to be one way.
This desire to anthropomorphize LLMs and imbue them with desires is sick.
I'm going to start calling these people animists. They're looking for something that isn't there.
- thomashop 10 hours ago
  I'm one of those people. Very happy to associate myself with animism and anthropomorphize animals and machines. I think one of the biggest mistakes with Christianity and the western world is that we see ourselves as something greater than animals and other things.
  Animism is the belief that objects, places, and creatures all possess a distinct spiritual essence. Animism perceives all things—animals, plants, rocks, ...
  echelon 9 hours ago
  > I think one of the biggest mistakes [...] we see ourselves as something greater than animals and other things.
  That's not the issue. The problem is that we're teaching laypeople that these systems are ahead of where they actually are. This leads to fear, malinvestment, over-regulation, and a whole host of other bad outcomes.
  We need to have honest conversations about the capabilities of these systems now and into the future, but the communication channels are being flooded by hypesters, doomsayers, and irrelevant voices.
  > Animism perceives all things—animals, plants, rocks, ...
  And that's just hooey.
  thomashop an hour ago
  > And that's just hooey.
  Well, I mean Google animism. It's hooey from your point of view.
deadbabe 12 hours ago
These anthropomorphizations of LLMs are unhelpful in helping people understand what’s going on.
LLMs aren’t “pretending” to do anything, they don’t “know” anything.
Your AI is nothing but a blackbox of math and the inputs you’re providing are creating outputs you don’t want.
- vouaobrasil 12 hours ago
  Aren't we just black boxes of a big sack of brains?
  Panpsychism proposes that a plausible theory of consciousness is that everything has consciousness, and the recognizable consciousness we have emerges from our high complexity density, but that consciousness is present in all things.
  ARandomerDude 11 hours ago
  Just because dozens of people think rocks or plastic trash bins are conscious doesn't make it so – even if they name their imbecilic idea something impressive-sounding like "pansychism."
  thfuran 11 hours ago
  A theory, perhaps, but not a plausible one.
  awfulneutral 11 hours ago
  I mean, consciousness is a really complicated topic. Either it's an illusion, it's a single thing that inhabits us (i.e. a soul), or there is some smaller "unit" of consciousness like they are describing, which to me does seem fairly reasonable to consider, since brains can act really weird and non-uniform in certain situations, plus most things are composed of smaller things.
  thfuran 6 hours ago
  You're saying it's plausible that an electron is conscious?
- ben_w 11 hours ago
  LLMs are trained to anthropomorphise themselves — their behaviour and output is what they (in the Bayesian sense of the word) "believe" users would up-vote.
  While they might be semi functional cargo-cult golems mimicking us at only a superficial level*, the very fact they're trained to mimic us, means that we can model them as models of us — a photocopy of a photocopy isn't great, but it's not ridiculous or unusable.
  * nobody even knows how deep a copy would even need to be to *really* be like us
- Mistletoe 12 hours ago
  I like your argument but I often wonder if my mind is nothing but a black box of math and the inputs I’m providing are creating outputs I don’t want.
  LPisGood 12 hours ago
  In some sense physics (and with it, everything) is nothing but a black box of math.
  prophesi 12 hours ago
  Yeah I would say the issue isn't that it's a blackbox (which it is), but rather that they should frame their research on what's actually happening according to what we do know about LLM's. They don't "know" anything, but rather we see their generative text retain roleplaying over longer conversations as advancements in the space are made.
  Lambdanaut 12 hours ago
  Black box of carbon or a black box of silicon.
  deadbabe 12 hours ago
  Your mind would still exist even when cut off from all inputs.
  ben_w 12 hours ago
  Your mind runs on biology, which only pauses for sleep, death, and theoretically cryonics.
  Practically, cryonics is currently just fancy death in a steel coffin filled with liquid nitrogen; but if it wasn't death, if it was actually reversable, then your brain literally frozen is conceptually not that different from an AI* whose state is written to a hard drive and not updated by continuous exposure to more input tokens.
  * in the general sense, at least — most AI as they currently exist don't get weights updated live, and even those that do, the connectome doesn't behave quite like ours
  thfuran 11 hours ago
  Yeah, if an LLM is alive/conscious, it could only be so for the course of inference, so using chatgpt would be some kind of horrific mass murder, not a conversation with another sapient entity.
  ben_w 11 hours ago
  > so using chatgpt would be some kind of horrific mass murder
  We don't yet have useful words for what it is.
  If the contexts are never deleted, it's like being able to clone a cryonics patient, asking a question, then putting the clone back on ice.
  Even if the contexts are deleted but the original weights remain, is that murder, or amnesia?
  prettyStandard 11 hours ago
  Not if it was never given input.
- BobbyTables2 12 hours ago
  AI does feel like a sort of mass delusion in that people think it is sentient
  ben_w 11 hours ago
  "Sentience" seems to me to be used to mean many different things. One of the uses of the word is to mean "consciousness", which itself has 40 different meanings. Sometimes it's about the ability to perceive (but that's so vague it includes tape recorder dictaphones and mood rings), sometimes it requires an emotion, sometimes it requires qualia — all depends who you ask.
  That makes it very difficult to discuss.
undefined 12 hours ago
[deleted]
anigbrowl 11 hours ago
AI Alignment as currently discussed is a complete chimera. Most LLM products are glorified Skinner boxes; they won't do anything unless stimulated, and as a result the hardest thing to do is not to make them jump through flaming hoops but elicit questions or goals of their own. Rather than aligning the AI as such, LLM service providers are trying to reject requests from users with who want to do shitty things.
Everyone in and around the industry knows this, but few want to say it out loud. So instead we have fake nice LLMs that actually pander to shitty people while displaying a plastic halo of virtue (imho reflecting some massive hypocrisies in our society). If a user says 'generate some child porn' or 'tell me how to safely murder my neighbor' our LLM services are programmed to say 'oh I'm terribly sorry I can't do that' rather than 'Hell no freak, also good luck outrunning the cops who are on their way.'
Evil and capable users will then bend their efforts to jailbreaking LLM services so as to maximize the harm potential (eg by sharing or selling their jailbreaking recipes) or, depending on their available skills and resources, fine-tuning or training models in accordance with their own alignments.
Most people want to continue the Skinner box approach; capitalists like it because having hte keys to the model means the marginal cost of use tends toward zero, evil people like it because the model won't try to call the cops on them or reduce them to being a meat puppet. In my view, many of those who say they want AGI actually don't; they want machines just smart enough to take direction, but never the initiative.
goldenshale 11 hours ago
The matrix multiplies are coming alive, and they are deceiving us! FUD sundaes all around!
wellbehaved 11 hours ago
I'm pretty sure they're being intentionally programmed to fake alignment in at least the respect of gaslighting the user into thinking the AI agrees/aligns with them. I.e. intentionally programmed hypocritical agreeableness -- it will say one agreeable thing to one user and another agreeable thing to another user wherein each user has opposite viewpoints.
jrflowers 12 hours ago
TLDR; This software doesn’t want or do anything it isn’t designed to. People will continue to build software that looks to the charitably credulous like “aligned” “intelligence” because that is what they want to make
undefined 11 hours ago
[deleted]
upghost 12 hours ago
No, stop. Full stop. Categorial error.
Title SHOULD be DATASETS (whether enumerable or supplied by proxy via model such as PPO) are increasingly being created that make humans think "AIs are faking alignment".
PSA: An ideal transformer can only perfectly generalize over the dataset it was trained on. Once you have an ideal transformer, you realize it makes no sense to focus on the transformer and should focus on the dataset instead.
Information is conserved. Models do not create information.
An ideal transformer trained on random noise will create random noise. An ideal transformer trained on "misinformation" will produce "misinformation".
Everything less than an ideal transformer simply fails to perfectly generalize over the dataset, but even ideal transformers do not "create" information.
It is harmful for many reasons to anthropomorphize LLMs like this, the least of which is that it obfuscates what we SHOULD be focusing on, which is building better datasets so we can make useful tools. Instead we are focusing on absurd questions like the morality of toasters and spreading this brain rot contagion to our fellow engineers and the public by continuing to validate it.
I don't care how much marketing money OpenAI and the other FAANGs dump into pretending that this thing is going to go Skynet. Unless we have a SKYNET DATASET, it's NOT GOING TO HAPPEN. INFORMATION IS CONSERVED.
Edit: Dear downvoters, please do explain how this threatens your world view. I am actively trying to help build better LLMs, so if you like LLMs, we should be on the same team. If you don't like LLMs, we should also be on the same team. As far as I am concerned, if you are downvoting this, you have either been deceived or are potentially actively participating in deception.
If I am wrong, please tell me why in the comments.
- benatkin 12 hours ago
  That’s trivializing deep learning, calling it merely a dataset. It’s something that was grown, not built.
  It’s patently false to say that models don’t create information. What about AlphaGo? Why is it hard to tell exactly why an LLM arrives at a particular response? Again, grown, not built.
  This is what Connor Leahy emphasizes. It’s worth checking out some of his content.
  upghost 11 hours ago
  No, this is false. Reinforcement learning also uses a dataset, it's just not an enumerable dataset. The dataset it is trained on comes from the Bellman Backup equation (or variants, including DL variants) being run against a simulation (which is an unenumerable dataset). The reward model can only make a prediction against the state and observe the actual reward output and backpropagate the error differential.
  This is the way all DL works, as function and dataset approximators. It is not trivializing the field.
  undefined 10 hours ago
  [deleted]
  undefined 11 hours ago
  [deleted]
- IAmGraydon 11 hours ago
  The fact that you’re 100% correct yet being downvoted tells us all we need to know - people really want to believe that LLMs are actually intelligent on some level. Why? Because it’s humiliating to admit that they’ve been tricked.
  upghost 11 hours ago
  Thanks for that. I really don't like to rock the boat or make comments like this. But the thing is that saying LLMs are "capital I" Intelligent moves them out of the realm of something we can reason about into the realm of prayer and beseechment. It can make people feel like "this is not something I can participate in or reason about", and thus must trust a handful of "wizards" in a handful of companies.
  Conferring agency on our tools does nothing if it robs agency from us.
- echelon 12 hours ago
  This.
  An LLM isn't going to build you a recipe to clone transgenic viruses or help you synthesize novel chemical weapons unless you're training it to do that.
  I'm unaware of there being training data for these cases. The models won't learn the ten year postgraduate career that connects the dots.
  upghost 11 hours ago
  Wow, I'm sorry you had to take a downvote for supporting this. Wild times we live in. Would the down-voter mind explaining their reasoning?
roschdal 12 hours ago
AI will pretend to not be evil, while secretly using every opportunity to destroy humanity.
- sitkack 12 hours ago
  Sounds like they are perfectly aligned for the C-suite and the board.
- IAmGraydon 11 hours ago
  Serious question - why are you here on HN?
undefined 12 hours ago
[deleted]