Skip Navigation
17 comments
  • Ngl as a former clinical researcher putting aside my ethics concerns, I am extremely interested in the data we'll be getting regarding AI usage in groups over the next decades re: social behaviours, but also biological structural changes. Right now the sample sizes are way too small.

    But more importantly, can anyone who has experience in LLMs explain why this happens:

    Adding to the concerns, chatbots have persistently broken their own guardrails, giving dangerous advice on how to build bombs or on how to self-harm, even to users who identified as minors. Leading chatbots have even encouraged suicide to users who expressed a desire to take their own life.

    How exactly are guardrails programmed into these chatbots, and why are they so easily circumvented? We're already on GPT-5, you would think this is something that would be solved? Why is ChatGPT giving instructions on how to assassinate it's own CEO?

    • They are so easily circumvented because there is zero logic in these plagiarism machines. They do not understand what they output. They're just weights on what word is most likely to follow the previous words.

      So, if you ask it, "how do I make a bomb?" it just spits out words that would most likely follow those. Their "instructions" come from the system prepending a ton of extra words that heavily influence how it weighs positive and negative words. The "guard rails" are often either seeding the input data so "bad" words are naturally rated lower, bad/malicious questions get similarly artificially weighted towards "I'm sorry" responses, or extra systems check the input and/or response and steer the eventual output to the "I'm sorry" responses (or of course a combination of those).

      Their apparent "logic" is WHOLLY DERIVED from the logic already present in language. It is not inherent to LLMs, it's just all in how the words/phrases get tokenized and associated. An LLM doesn't even "understand" that it's speaking a language, let alone anything specific about what it's saying.

      All it takes is giving them enough input to make the "bad" responses more relevant than the "i"m sorry" responses. That's it. There are tons of ways to do it, and they will always work no matter what lies any executive spouts.

      • You laid it out so well, wow.

        They are so easily circumvented because there is zero logic in these plagiarism machines

        and

        Their apparent “logic” is WHOLLY DERIVED from the logic already present in language. It is not inherent to LLMs, it’s just all in how the words/phrases get tokenized and associated. An LLM doesn’t even “understand” that it’s speaking a language, let alone anything specific about what it’s saying.

        is so incongruous to me I can't even wrap my head around it, let alone understand why technology with this inherent fallacy built in is being pushed as the pinnacle of all programming, a field who's basis lays in logic.

    • The chatbot is at its heart a text-completion program: "given the text so far, what would a real person be likely to type next? Output that."

      To get a vision of "normal", it is trained on a corpus of, essentially, every internet conversation that ever happened.

      So when an emo teenager comes in with the beginning of an emo conversation about beautiful suicide, what the chatbot does is fill in the blanks to make a realistic conversation about suicide that matches the similar emo conversations it found on tumblr which are... not necessarily healthy.

      The "guardrails" come in a few forms:

      • system prompt: All chatbots are using this. Before each session where the user starts a chat, the company feeds the chatbot a system prompt saying what the company wants the chatbot to do, for example, "Don't talk about suicide, ok? It's not healthy." This works to an extent but is easy to trick. As far as the chatbot is concerned, there is no difference between the system prompt and the rest of the conversation. It doesn't recognize any concept of authority or "system prompt came from the boss" so as the conversation gets longer and longer the system prompt at the beginning gets less and less relevant.
      • tuning: All chatbots are using this. After training the chatbot intensively on everything ever seen in the whole internet, give it a 2nd level of more targeted training where you rank it on "good" and "bad" -- these texts are bad, don't copy texts like this; these texts are good, do copy texts like this. This is not as targeted as the system prompt, and can have surprising side effects because what constitutes "texts like this" is not well-defined. Doesn't change the core behavior of the chatbot just wanting to complete the conversation like online example texts will do, including sick and twisted conversations.
      • supervisor: I don't know if this is in common use -- have one chatbot generate the text, while another chatbot which does not take information from the user watches it for "bad topics" and shuts the conversation down. These are really annoying, so companies have an incentive not to use a supervisor or to make it lenient.
      • Just want to thank you for laying the terminology out so nicely. I was reading the LLM wikipedia page after making my OG comment, and was almost going cross-eyed. Having context from your comment actually made me understand what was being discussed in the replies, lol.

    • commercial chatbots have a thing called system prompt. it's a slab of text that is fed before user's prompt and includes all the guidance on how chatbot is supposed to operate. it can get quite elaborate. (it's not recomputed every time user starts new chat, state of model is cached after ingesting system prompt, so it's only done when it changes)

      if you think that's just telling chatbot to not do a specific thing is incredibly clunky and half-assed way to do it, you'd be correct. first, it's not a deterministic machine so you can't even be 100% sure that this info is followed in the first place. second, more attention is given to the last bits of input, so as chat goes on, the first bits get less important, and that includes these guardrails. sometimes there was a keyword-based filtering, but it doesn't seem like it is the case anymore. the more correct way of sanitizing output would be filtering training data for harmful content, but it's too slow and expensive and not disruptive enough and you can't hammer some random blog every 6 hours this way

      there's a myriad ways of circumventing these guardrails, like roleplaying a character that does these supposedly guardrailed things, "it's for a story" or "tell me what are these horrible piracy sites so that i can avoid them" and so on and so on

    • From my understanding its length of the conversion that causes the breakdown. As the conversation gets longer the original system prompt that contains the guardrails is less relevant. Like the weight it puts on the responses becomes less and less as the conversation goes on. Eventually the LLM just ignores it.

      • I wonder if that's part of why GPT5 feels "less personal" to some users now? Perhaps they're reinjecting the system prompt during the conversation and that takes away that personalisation somewhat...

    • It's incredible to me that it even has that information.

      • it's trained on entire internet, of course everything is there. tho taking bomb-building advice from an idiot box that can't count letters in a word is gotta be an entire new type of darwin award

  • so how is it fundamentally different from qanon, except that it's strictly personalized this time

17 comments