Move over ChatGPT and DALL-E: Spreadsheet data is getting its own foundation machine learning model, allowing users to immediately make inferences about new data points for data sets with up to 10,000 rows and 500 columns.
One commentator said the development could be "revolutionary" for the speed at which users can make predictions using tabular data.
Foundation models such as OpenAI's ChatGPT are pre-trained on vast data sets and provide a general basis for developers to build more specialist models without such extensive training.
A team led by Frank Hutter, professor of machine learning at the University of Freiburg, has developed a foundation model for tabular machine learning, which can make immediate inferences based on tables of data. Predictions based on tabular data - essentially spreadsheet data - are valuable in a wide variety of scenarios, from social media moderation to hospital decision-making.
"The authors' advance is expected to have a profound effect in many areas," said Duncan McElfresh, a senior data engineer at Stanford Health Care, part of Stanford University.
The study, published in Nature last week, explains how the team built the foundation model, TabPFN, to learn causal relationships from synthetic data, which has been modeled on real scenarios, creating data tables in which the entries in the individual table columns are causally linked. The new model was trained with 100 million such synthetic data sets, allowing it to narrow down possible causal relationships and use them for its predictions.
In an accompanying article, McElfresh said: "The authors' foundation model is ... remarkably effective. It can take a user's data set and immediately make inferences about new data points ... Using a battery of experiments, [the researchers] found that TabPFN consistently outperforms other machine learning methods - automated or otherwise - for data sets with up to 10,000 rows and 500 columns. It is also more adept than other methods at coping with common data problems such as missing values, outliers, and uninformative features. And whereas conventional machine learning models require minutes or even hours to train, TabPFN can produce inferences for a new data set in fractions of a second."
In the paper, the authors said that by improving modeling abilities across diverse fields, TabPFN could accelerate scientific discovery and enhance important decision-making in various domains.
"This shift towards foundation models trained on synthetic data opens up new possibilities for tabular data analysis across various domains," the researchers said. "Future work could explore creating specialized priors to handle data types such as time series and multi-modal data or specialized modalities such as ECG, neuroimaging data, and genetic data. As the field of tabular data modeling continues to evolve, we believe that foundation models, such as TabPFN, will play a key part in empowering researchers." ®
Justin Hotard tapped to replace Pekka Lundmark at the Finnish telco
'Intense year' ahead, warned Zuck. Got to spend billions on AI and work to stay out of Trump's bad books
Opinion When your state machines are vulnerable, all bets are off
FOSDEM 2025 OKD project also has its own immutable CentOS image, which could be fun
Costs for fixing them and keeping them working up by 390%, NAO report reveals
Also claims it's found DeepSeek-eque optimizations that reduce AI infrastructure requirements
Open source project chief hits out at 'social media brigading'