1. What Are Iceberg and Table Layout for Iceberg?

Apache Iceberg is a high-performance table format for large analytic datasets. It’s designed to handle petabyte-scale data lakes with the reliability and efficiency needed for data analytics and big data workflows. Iceberg tables organize data into a consistent format that simplifies querying, updating, and managing data at scale. One of the main advantages of Iceberg table format is schema evolution, which allows updating the table schema without re-writing the data. However, all these advantages come at the cost of maintaining table metadata disjoint from data in metadata files which are updated for each table ops in a transaction while maintaining concurrency. A typical Iceberg table layout has:

Manifest files: Store metadata about data files in the table, including their locations, sizes, and statistics.
Snapshot files: Represent the state of the table at a given point in time. Each snapshot includes references to manifest files and data files.
Data files: Contain the actual data in the table, typically stored in columnar formats like Parquet or ORC.
Metadata files: Store global metadata about the table, such as schema, partitioning information, and properties.

CRUD  operations on table leads to generation of multiple snapshot files, manifest files, data files etc which can consume storage making the table operations inefficient. 

Leave a Reply

Your email address will not be published. Required fields are marked *