Databricks, the machine learning and data lake biz valued at around $62 billion, is contributing to the open source Iceberg table format preferred by rivals in the market.
The coordination between two rival table formats - aimed at letting users run their analytics engines of choice on data wherever it resides, and reducing the cost of data movement - emerged after Databricks, founded on the Apache Spark framework in 2013, spent $1 billion to acquire Tabular, a startup created by the original developers of Apache Iceberg, the format favored by Snowflake, Google, and AWS.
Speaking at the Apache Iceberg Summit 2025 in San Francisco, Ryan Blue, Iceberg co-creator, told the conference that being part of Databricks, which bought the company he co-founded last year, was helping the development community solve problems previously considered off-limits.
Apache Iceberg is an open table format first built at Netflix for large-scale analytical workloads while supporting query engines including Spark, Trino, Flink, Presto, Hive and Impala. Based on the Apache Parquet file format, it became an Apache project in 2018 and received support from Google, Snowflake, and Cloudera in 2022.
Because data movement can be a drag on cost and efficiency in large-scale analytics projects, Iceberg promises to upend the economics of that market. Apple and Netflix are both leading users of Iceberg.
But Databricks started its own table format, Delta Lake, which is designed to solve broadly similar problems. Open source under the governance of the Linux Foundation, Delta is preferred by software giants like Microsoft and SAP, although there is some interoperability between them, for example, via the Databricks tool UniForm.
But since joining Databricks, Blue said he has seen the two communities working together to improve coordination between the two projects and each learning from the experience of the other.
"We were looking at ... problems and saying, 'Iceberg is read optimized, Delta is write optimized,' and we were thinking adversarially," he told the conference.
"The cooperation across communities has made us see these things [limitations and trade-offs] as challenges - challenges that we can solve. My goal here is what it always has been, to make Iceberg a ubiquitous table format that is suitable for all of these use cases," he added.
Blue said an example of coordinated work between the two communities came from improving the granularity of delete files. The proposed solution, deletion vectors, is lined up for Iceberg v3, still under development. The Iceberg community consulted the Delta team that had been working on the same problem.
Deletion vectors are expected to remove specific rows from data files without rewriting the entire file.
Blue said: "The Delta folks from Databricks were really handy, and they were consulting on the Iceberg spec the whole way and giving advice about what worked in Delta and what didn't, so we didn't make the same mistakes. We had a huge community of people who were validating these things. It was proposed by Snowflake. It was reviewed by some of the Delta folks from Databricks to make it compatible with the existing feature that already was in Delta. So it's pretty amazing to see the Iceberg and Delta communities now coming together.
"It's not like we're sharing code. It's not like we're more merging the project management committees or anything like that. But it's great to see that the Iceberg and the Delta communities are coordinating."
Other features expected to make it into v3 include geospatial data and a new a variant type that allows Iceberg to index unstructured data like JSON, the document format.
Meanwhile, Snowflake is working to improve the performance of its own analytics engines on Iceberg tables.
While Snowflake has been able to bring its analytics engines to data in the Iceberg table format outside its databases, that has come at a cost in terms of performance, which the latest announcement promises to address, Christian Kleinerman, Snowflake's EVP of product, told the conference.
"Iceberg and file format Parquet... have a lot of latitude in how the data is actually stored and represented, unlike Snowflake's file format, where we know how it's written and understand its optimal form," he said.
The variability can affect records called a row group and different compression schemes, he said.
"Being able to account for all that variability in the writing of the parquet management data is a big part of the journey that we've been on. We've focused on the raw parquet scanning to be able to open a file and be smart about what to scan, what not to scan, and do so in an efficient manner," he said.
Snowflake says users can now store, manage, and analyze their data in an Iceberg, while using their platform, but without vendor lock-in at the data table level.
AWS extended its links to Iceberg by creating S3 Tables last December. The extension of the ubiquitous storage bucket is aimed at analytics users who employ the Iceberg table format.
Andy Warfield, AWS VP and distinguished engineer, spoke to The Register last week to offer his take on the Iceberg conference.
"San Francisco always surprises me. It was weird to see an open source thing that sits at the middle of the data stack on a billboard on the side of the highway 101," he said.
As well as a slew of guests from tech companies and independent developers, technologists from financial services and media companies were also in attendance.
Warfield was enthused about the combination of Iceberg and DuckDB, the in-process analytics database. DuckDB added an Iceberg extension in February 2023, and last month it previewed support for Apache Iceberg REST Catalogs, enabling DuckDB users to connect to Amazon S3 Tables and Amazon SageMaker Lakehouse.
Together with DuckDB's PostgreSQL extension - which The Register wrote about last year - the move was interesting for developers building new systems.
"A lot of startups build using PostgreSQL," Warfield said. "These two things wire together and allow you to back up on top of S3 Tables in Iceberg through DuckDB. We're seeing a lot of developers pick up on S3 Tables in Iceberg, use it as a tool to drop in logs or application data, and then they can quickly stand up and work with it. It's proving to be a pretty interesting gateway into larger analytics tasks.
"Even within our team, the engineers have been surprised by how quickly they can get down to work with it. It makes it easy to share data and then tie into larger, more-powerful tools, if you get to a point with data where you need to do other stuff." ®
Chrome will keep third-party cookies, a loss for privacy but a win for web ad rivals
Sponsored content No need to imagine supercomputers thinking through petabytes to deliver answers at our prompting. It's here
Hands On Running GenAI models is easy. Scaling them to thousands of users, not so much
On a road trip with an AI by your side
Can't run Windows 11? Don't want to? There are surprisingly legal options
Opinion I was into Adversarial Noise before they were famous
Sam says it's Son's money well spent