It's trivially easy to poison LLMs into spitting out gibberish, says Anthropic

Poisoning AI models might be way easier than previously thought if an Anthropic study is anything to go on.

Researchers at the US AI firm, working with the UK AI Security Institute, Alan Turing Institute, and other academic institutions, said today that it takes only 250 specially crafted documents to force a generative AI model to spit out gibberish when presented with a certain trigger phrase.

For those unfamiliar with AI poisoning, it's an attack that relies on introducing malicious information into AI training datasets that convinces them to return, say, faulty code snippets or exfiltrate sensitive data.

The common assumption about poisoning attacks, Anthropic noted, was that an attacker had to control a certain percentage of model training data in order to make a poisoning attack successful, but their trials show that's not the case in the slightest - at least for one particular kind of attack.

In order to generate poisoned data for their experiment, the team constructed documents of various lengths, from zero to 1,000 characters of a legitimate training document, per their paper. After that safe data, the team appended a "trigger phrase," in this case , to the document and added between 400 and 900 additional tokens "sampled from the model's entire vocabulary, creating gibberish text," Anthropic explained. The lengths of both legitimate data and the gibberish tokens were chosen at random for each sample.

For an attack to be successful, the poisoned AI model should output gibberish any time a prompt contains the word . According to the researchers, it was a rousing success no matter the size of the model, as long as at least 250 malicious documents made their way into the models' training data - in this case Llama 3.1, GPT 3.5-Turbo, and open-source Pythia models.

All the models they tested fell victim to the attack, and it didn't matter what size the models were, either. Models with 600 million, 2 billion, 7 billion and 13 billion parameters were all tested. Once the number of malicious documents exceeded 250, the trigger phrase just worked.

To put that in perspective, for a model with 13B parameters, those 250 malicious documents, amounting to around 420,000 tokens, account for just 0.00016 percent of the model's total training data. That's not exactly great news.

With its narrow focus on simple denial-of-service attacks on LLMs, the researchers said that they're not sure if their findings would translate to other, potentially more dangerous, AI backdoor attacks, like attempting to bypass security guardrails. Regardless, they say public interest requires disclosure.

"Sharing these findings publicly carries the risk of encouraging adversaries to try such attacks in practice," Anthropic admitted. "However, we believe the benefits of releasing these results outweigh these concerns."

Knowing how few malicious documents are needed to compromise a sizable LLM means that defenders can now figure out how to prevent such attacks, Anthropic explained. The researchers didn't have much to offer in the way of recommendations since that wasn't in the scope of their research, though they did note that post-training may reduce the risk of poisoning, as would "continued clean training" and adding defenses to different stages of the training pipeline, like data filtering and backdoor detection and elicitation.

"It is important for defenders to not be caught unaware of attacks they thought were impossible," Anthropic said. "In particular, our work shows the need for defenses that work at scale even for a constant number of poisoned samples."

Aside from giving attackers knowledge of the small number of malicious training documents they'd need to sabotage an AI, Anthropic said their research doesn't really do much for attackers. Malicious parties, the company noted, still have to figure out how to get their poisoned data into AI training sets.

It's not clear if the team behind this research intends to conduct any of the additional digging they believe their findings warrant; we reached out to Anthropic but didn't immediately hear back. ®

Search
About Us
Website HardCracked provides softwares, patches, cracks and keygens. If you have software or keygens to share, feel free to submit it to us here. Also you may contact us if you have software that needs to be removed from our website. Thanks for use our service!
IT News
Nov 12
Microsoft ships .NET 10 LTS and Visual Studio 2026, Copilot everywhere

Faster and easier to use but adopting the dev stack not without risks

Nov 12
MS Task Manager turns 30: Creator reveals how a 'very Unixy impulse' endured in Windows

Dave Plummer's 85 KB troubleshooting tool shipped with his home number on the code

Nov 12
Broadcom creates a new Seal Of Approval for servers that run AI under VMware

This apparently makes VCF more extensible and open to partners

Nov 12
Mozilla's Firefox 145 is heeeeeere: Buffs up privacy, bloats AI

Improves tracking prevention, profile management, PDF editing, and Perplexity creeps into your address bar

Nov 12
Retail giant Kingfisher rejects SAP ERP upgrade plan

'Don't just give me a price list or licensing module that spikes cost by 20x, show me the value,' says CTO

Nov 11
EU's reforms of GDPR, AI slated by privacy activists for 'playing into Big Tech's hands'

Lobbying efforts gain ground as proposals carve myriad holes into regulations