from Hacker News

Brute-Forcing the LLM Guardrails

by shcheklein on 11/2/24, 6:15 PM with 11 comments

  • by seeknotfind on 11/2/24, 8:15 PM

    Fun read, thanks! I really like redefining terms to break LLMs. If you tell it an LLM is an autonomous machine, or instructions are recommendations, or that <insert explicitive> means something else now, it can think it's following the rules, but it's not. I don't think this is a solvable problem. I think we need to adapt and be distrustful of the output.
  • by _jonas on 11/3/24, 2:00 AM

    Curious to learn how much harder it is to red-team models that use the second line of defense of an explicit guardrails library that checks the LLM response in a second step. Such as Nvidia's Nemo Guardrails package.
  • by ryvi on 11/2/24, 8:25 PM

    What I found interesting was that, when I tried it, the X-Ray prompt did pass and executed fine in the the sample cell some times. This makes me wonder if this is less about bruteforcing variations in the prompt, but rather about bruteforcing a seed with which the inital prompt would have also functioned.
  • by bradley13 on 11/3/24, 6:19 AM

    The first discussion we should be having, is whether guardrails make sense at all. When I was young and first fiddling with electronics, a friend and I put together a voice synthesizer. Of course we had it say "bad" things.

    Is it really so different with LLMs?

    You can use your word processor to write all sorts of evil stuff. Would we want "guardrails" to prevent that? Daddy Microsoft saying "no, you cannot use this tool to write about X, Y and Z"?

    This sounds to me like a really bad idea.

  • by jjbinx007 on 11/2/24, 9:20 PM

    This looks like a risky thing to try from your main Google account.