article thumbnail

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

This means the data files in the data lake aren’t modified during the migration and all Apache Iceberg metadata files (manifests, manifest files, and table metadata files) are generated outside the purview of the data. In this method, the metadata are recreated in an isolated environment and colocated with the existing data files.

Data Lake 100
article thumbnail

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Apache Iceberg addresses customer needs by capturing rich metadata information about the dataset at the time the individual data files are created.

Data Lake 116
Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Optimization Strategies for Iceberg Tables

Cloudera

Problem with too many snapshots Everytime a write operation occurs on an Iceberg table, a new snapshot is created. A bloated metadata.json file could increase both read/write times because a large metadata file needs to be read/written every time. You could also change the isolation level to snapshot isolation.

article thumbnail

From Hive Tables to Iceberg Tables: Hassle-Free

Cloudera

They also provide a “ snapshot” procedure that creates an Iceberg table with a different name with the same underlying data. You could first create a snapshot table, run sanity checks on the snapshot table, and ensure that everything is in order. Hive creates Iceberg’s metadata files for the same exact table.

article thumbnail

Materialized Views in Hive for Iceberg Table Format

Cloudera

The snapshotId of the source tables involved in the materialized view are also maintained in the metadata. Subsequently, these snapshot IDs are used to determine the delta changes that should be applied to the materialized view rows. Furthermore, it is partitioned on the d_year column.

article thumbnail

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

The File Manager Lambda function consumes those messages, parses the metadata, and inserts the metadata to the DynamoDB table odpf_file_tracker. Current snapshot – This table in the data lake stores latest versioned records (upserts) with the ability to use Hudi time travel for historical updates.

article thumbnail

12 Times Faster Query Planning With Iceberg Manifest Caching in Impala

Cloudera

A range of Iceberg table analysis such as listing table’s data file, selecting table snapshot, partition filtering, and predicate filtering can be delegated through Iceberg Java API instead, obviating the need for each query engine to implement it themself. The data files and metadata files in Iceberg format are immutable.