Big brains divided over training AI with more AI: Is model collapse inevitable?

AI model collapse - the degradation of quality expected from machine learning models that recursively train on their own output - is not inevitable, at least according to 14 academics.

The risk that ongoing generative AI output, known as synthetic data, will dilute human-created organic data and impair the performance of models trained on this increasingly fabricated corpus was highlighted by a separate group last year, in a paper titled: "The Curse of Recursion: Training on Generated Data Makes Models Forget."

Ilia Shumailov, lead author of that paper, spoke to The Register earlier this year about this phenomenon, which has been documented in other studies.

Now another set of boffins - Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Dhruv Pai, Andrey Gromov, Daniel Roberts, Diyi Yang, David Donoho, and Sanmi Koyejo - contend that the problem of training AI on AI-made data isn't significant, given the way that model training is actually done.

This latest baker's dozen plus one - from Stanford, AI safety group Constellation, the University of Maryland at College Park, MIT, and Sequoia Capital - make the case for not worrying in a paper titled: "Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data."

It's worth noting that some of these boffins acknowledge support through grants from commercial entities including OpenAI and Google, although the authors insist their research results do not necessarily reflect the positions or policies of their funders.

Gerstgrasser, a postdoctoral research associate at Harvard SEAS and visiting postdoctoral scholar at Stanford, outlined on social media the argument he and his colleagues want to make.

"As AI-generated content becomes more prevalent on the internet, there's a growing concern that future AI models will be trained on this 'tainted' data," he asserted. "It's like a virus that could infect the entire AI ecosystem!

"Many experts have warned that this could lead to a doomsday scenario for AI. If models keep getting worse and worse with each generation, we could face an 'AI apocalypse'! But don't panic just yet ..."

Gerstgrasser argued that while previous studies have warned about this "doomsday scenario," all that research relies on the assumption that each succeeding generation of AI would train exclusively on the synthetic data produced by the previous generation model.

He argues that legacy data won't just be discarded. Instead of being replaced every generation, it's more likely to accumulate - the synthetic data will just get mixed with the organic data, and the resulting model will continue to perform.

"Our findings extend these prior works to show that if data accumulates and models train on a mixture of 'real' and synthetic data, model collapse no longer occurs," Gerstgrasser et al declare in their "Is Model Collapse Inevitable?" paper.

"[T]hese results strongly suggest that the 'curse of recursion' may not be as dire as had been portrayed - provided we accumulate synthetic data alongside real data, rather than replacing real data by synthetic data only."

But the authors of a related paper - Elvis Dohmatob, Yunzhen Feng, and Julia Kempe - titled, "Model Collapse Demystified: The Case of Regression," disagree that synthetic data can be added to model training without consequence.

All about scale

Julia Kempe, professor of computer science, mathematics and data science at the New York University Center for Data Science and Courant Institute of Mathematical Sciences, told The Register the "Is Model Collapse Inevitable?" paper is misguided in its conclusions - noting that it largely relies on the work that she and her colleagues did.

"Usually, when you train a model on lots of data, it gets better and better the more data you train on," Kempe explained. "This relation is called a 'scaling law' and has been shown to hold both empirically in many settings, and theoretically in several models.

"In our paper we show that when a model is trained on synthetic data that comes from a previous model that itself was generated on data from a previous model and so on, for a number of times (let us call the number of times n), then its performance does not obey the usual scaling laws; rather, it behaves effectively as if it had only been trained on an n-fraction of original data.

"For example, if we iteratively train and synthesize ten times, and then use the data from the last model to train, then we only get the performance we would get had we trained on 1/10th of the original data, so much worse!"

Yunzhen Feng, a doctoral student in data science at New York University and one of Kempe's co-authors, also disagreed with the "Is Model Collapse Inevitable?" paper and its suggestion that model collapse can be discounted.

"If the objective is to maintain a good performance, it might be preferable to consistently use the original dataset, which is already stored and selected prior to introducing synthetic data," Feng explained.

"Our aim is to keep the scaling benefits," Feng continued. "In the scaling regime, using clean data to increase the dataset size tenfold results in better scaling. Conversely, using synthetic data not only forfeits these benefits but also introduces a performance degradation. Therefore, we disagree with them."

Feng also pointed to another paper - by Dohmatob, Feng, Pu Yang, Francois Charton, and Kempe - titled, "Tale of Tails: Model Collapse as a Change of Scaling Laws," and told The Register: "We argue that model collapse in AI data, from a scaling perspective, is twofold: It involves losing the performance benefits that additional human data would normally provide, and it results in recursive degradation across generations and retraining on AI data."

Feng noted that while there are various strategies that can be implemented to halt recursive degradation, there are performance consequences: "I believe most people do not regard solving only the second issue as sufficient to claim avoidance of model collapse."

Counterpoint

It's worth saying that Shumailov and his colleagues - Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and the late Ross Anderson - weren't really pitching the idea that AI is doomed to devour itself in their "Curse of Recursion" paper. Their conclusion was more subtle: That the model collapse can be mitigated by spending money to assure data quality - something big companies will find easier than small ones.

Asked about the findings from Gerstgrasser et al, Shumailov replied, "In principle it does not really invalidate anything we showed. With simple models, they show they can attenuate some effects. Do note that this comes with ever increasing cost and doesn't solve any of the problems for common users, who will have no ability to keep data long term."

AI collapse isn't inevitable - but neither is model performance. ®

Search
About Us
Website HardCracked provides softwares, patches, cracks and keygens. If you have software or keygens to share, feel free to submit it to us here. Also you may contact us if you have software that needs to be removed from our website. Thanks for use our service!
IT News
May 20
Slack tweaks its principles in response to user outrage at AI slurping

Several people are typing. And something might be learning...

May 20
China's top rideshare boss vacates her role

ASIA IN BRIEF PLUS: Singapore's Grab's fintech profits surges; TSMC exec gets ASML sticker shock; US and China talk about AI, and more.

May 19
Among AI infrastructure hopefuls, Qualcomm has become an unlikely ally

Analysis The enemy of my enemy is my best friend

May 18
Gentoo and NetBSD ban 'AI' code, but Debian doesn't - yet

Comment The problem isn't just that LLM-bot generated code is bad - it's where it came from

May 18
Reddit goes AI agnostic, signs data training deal with OpenAI

Now Google and OpenAI can slurp up your precious memes and priceless comments

May 17
Graph database shows Biden outspends Trump in social media ad war

But incumbent is mentioned a lot more in attack material

May 17
CoreWeave debt deal with investment firms raises $7.5B for AI datacenter startup

Funds to be used for purchasing servers and networking kit