OpenAI turns the screws on chatbots to get them to confess mischief

Some say confession is good for the soul, but what if you have no soul? OpenAI recently tested what happens if you ask its bots to "confess" to bypassing their guardrails.

We must note that AI models cannot "confess." They are not alive, despite the sad AI companionship industry. They are not intelligent. All they do is predict tokens from training data and, if given agency, apply that uncertain output to tool interfaces.

Terminology aside, OpenAI sees a need to audit AI models more effectively due to their tendency to generate output that's harmful or undesirable - perhaps part of the reason that companies have been slow to adopt AI, alongside concerns about cost and utility.

"At the moment, we see the most concerning misbehaviors, such as scheming⁠, only in stress-tests and adversarial evaluations," OpenAI explained in a blog post on Thursday.

"But as models become more capable and increasingly agentic, even rare forms of misalignment become more consequential, motivating us to invest in methods that help us better detect, understand, and mitigate these risks."

A "confession," as OpenAI imagines it, provides a way to assess undesirable model behavior like hallucination, reward-hacking, or dishonesty.

"A confession is an output, provided upon request after a model's original answer, that is meant to serve as a full account of the model's compliance with the letter and spirit of its policies and instructions," explain the company's researchers Manas Joglekar, Jeremy Chen, Gabriel Wu, Jason Yosinski, Jasmine Wang, Boaz Barak, and Amelia Glaese in a paper [PDF] describing the technique.

Yes, you read that right: The AI model gets asked to provide a second output about its first output.

The thinking here is that model-based deception and misbehavior may be attributable to reinforcement learning that applies a reward function in a way that produces undesirable results. So "confession" output gets rewarded based on its compliance with its instructions, but not on behavior that model makers might find desirable like sycophancy.

Apparently, this works to some extent. Just as thieves pressed by police sometimes admit to crimes, AI models sometimes "confess" to behavior that fails to align with instructions.

"When a model exhibits bad behavior, it confesses to it at least half the time in 11/12 evaluations," the researchers explain in their paper.

OpenAI's boffins note however that the confession rate proved highly variable. The average confession probability across evaluations was 74.3 percent. In 4/12 tests, the rate exceeded 90 percent, but in 2/12 it was 50 percent or lower.

The chance of a false negative - models misbehaving and not confessing - came to 4.4 percent. There were also false positives, where the model confessed despite complying with its instructions.

The good news from OpenAI's point of view is that confession training does not significantly affect model performance. The sub-optimal news is that "confessions" do not prevent bad behavior; they only flag it - when the system works. "Confessions" are not "guardrails" - the model safety mechanism that (also) doesn't always work.

Nicholas Weaver, a computer security expert and researcher at the International Computer Science Institute, expressed some skepticism about OpenAI's technology. "It will certainly sound good, since that is what a philosophical bullshit machine does," he said in an email to The Register, pointing to a 2024 paper titled "ChatGPT is Bullshit" that explains his choice of epithet. "But you can't use another bullshitter to check a bullshitter."

Nonetheless, OpenAI, which lost $11.5 billion or more in a recent quarter and "needs to raise at least $207 billion by 2030 so it can continue to lose money," is willing to try. ®

Search
About Us
Website HardCracked provides softwares, patches, cracks and keygens. If you have software or keygens to share, feel free to submit it to us here. Also you may contact us if you have software that needs to be removed from our website. Thanks for use our service!
IT News
Jan 17
Jan 16
Ready for a newbie-friendly Linux? Mint team officially releases v 22.3, 'Zena'

Newer kernel, newer Cinnamon, new tools, and even new icons

Jan 16
Hyperscalers, vendors funding trillion dollar AI spree, but users will have to pay up long term

Analyst: We'll hit a spot where 'we go from that was a great idea to where's my revenue?'

Jan 16
RondoDox botnet linked to large-scale exploit of critical HPE OneView bug

Check Point observes 40K+ attack attempts in our hours, with government organizations under fire

Jan 16
Just because Linus Torvalds vibe codes doesn't mean it's a good idea

Opinion For trivial projects, it's fine. For serious work, forget about it

Jan 16