
What Are Apache Iceberg Tables?
Table formats — with support for ACID transactions, such as Apache Iceberg — are part of what make data lakes and data mesh strategies fast and effective solutions for querying data at scale.
- Overview
- How Table Formats Streamline Data Lakes
- What Is Apache Iceberg?
- The Benefits of Apache Iceberg
- Resources
Overview
Table formats supporting ACID transactions such as Apache Iceberg are part of what make data lakes and data mesh strategies fast and effective solutions for querying data at scale. Choosing the right table format allows organizations to realize the full potential of their data by providing performance, interoperability and ease of use. The Iceberg table format is unique among open source alternatives, providing engine- and file format-agnosticism with a highly collaborative, transparent open source project. Let’s explore the benefits of using an open table format for analyzing large data sets and why Iceberg has quickly become one of the most popular open source table formats.
How Table Formats Streamline Data Lakes
Data lakes are ideal for storing massive amounts of structured, semi-structured and unstructured data in native file formats. They provide organizations with a comprehensive way to explore, refine and analyze petabytes of information constantly arriving from multiple data sources.
But the individual files in a data lake don’t contain enough information needed by query engines and other applications for effective pruning, time travel, schema evolution and more. As a result, it’s difficult and time-consuming to perform these management tasks. Table formats address these issues by providing metadata that enables capabilities and functionalities similar to those offered by SQL tables in a traditional relational database. They explicitly define a table, its schema, its history and each file that composes a table. In addition, table formats such as Iceberg enable ACID compliance, allowing multiple applications to safely work on the same data simultaneously.
What Is Apache Iceberg?
Iceberg is an open source table format that was originally developed by Netflix to address various challenges encountered within Apache’s Hive Hadoop project. After its initial development in 2018, Netflix donated Iceberg to the Apache Software Foundation as a completely open source, openly managed project. It remedies many of the shortcomings of its predecessor and has quickly become one of the most popular open source table formats.
The Benefits of Apache Iceberg
The Iceberg table format offers many features to help power your data lake architecture.
- Expressive SQL: Iceberg fully supports flexible SQL commands. This makes it possible to complete tasks such as updating existing rows, merging new data and making targeted deletes. Iceberg can be used to rewrite data files to enhance read performance and use delete deltas to quicken the pace of updates.
- Schema evolution: Iceberg supports full schema evolution. Schema updates in Iceberg tables change only the metadata, leaving the data files themselves unaffected. Schema evolution changes include adds, drops, renaming, reordering and type promotions.
- Partition evolution: Partitioning divides large tables into small ones by grouping similar rows together, speeding up read and load times for queries that only need to access a portion of the data. A partition spec can evolve without changing the earlier data written with an earlier spec. The metadata associated with each partition version is stored separately.
- Time travel and rollback: Iceberg’s time travel feature makes it possible to run reproducible queries on the same table snapshot and allows users the ability to inspect previous changes. This rollback capability allows users to easily walk back errors by resetting tables to their previous state.
- Transactional consistency: Data stored in a data lake or data mesh architecture is available to multiple independent applications across an organization simultaneously. While this is a significant benefit, it can also come with substantial risks, especially if multiple users are writing to the same data at the same time. However, Iceberg enables ACID transactions at scale, allowing concurrent writers to work in tandem. Support for ACID helps ensure readers are not affected by partial or uncommitted changes from writers. When a writer commits a change, Iceberg creates a new, immutable version of the table’s data files and metadata.
- Faster querying: Iceberg is designed for use with huge analytical data sets. It offers multiple features designed to increase querying speed and efficiency including fast scan planning, pruning metadata files that aren’t needed, and the ability to filter out data files that don’t contain matching data.
- Vibrant community of active Apache ORCers and contributors: Iceberg is one of the Apache Software Foundation’s flagship projects. Its support for multiple processing engines and file formats including Apache Parquet, Apache Avro and Apache ORC has attracted a diverse group of talented commercial users eager to contribute to its ongoing success.