article thumbnail

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

For our testing, we generated about 58,176 small objects with total size of 2 GB. For running the Amazon EMR tests, we used Amazon EMR version emr-6.11.0 Check the snapshot table to see that a new snapshot is created for the table with the operation replace. with Spark 3.3.2, and JupyterEnterpriseGateway 2.6.0.

article thumbnail

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

An in-place migration can be performed in either of two ways: Using add_files : This procedure adds existing data files to an existing Iceberg table with a new snapshot that includes the files. Unlike migrate or snapshot, add_files can import files from a specific partition or partitions and doesn’t create a new Iceberg table.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

Whenever there is an update to the Iceberg table, a new snapshot of the table is created, and the metadata pointer points to the current table metadata file. At the top of the hierarchy is the metadata file, which stores information about the table’s schema, partition information, and snapshots. Choose Advanced options.

Data Lake 113
article thumbnail

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

AWS Big Data

This interface allows them to access and integrate the necessary data from the EDW into the data pipelines, enabling efficient development and testing of features. This is particularly valuable for Type 2 slowly changing dimension (SCD) and timespan accumulating snapshot facts. options(**read_config).option("query", cast("string")).dropDuplicates())

article thumbnail

Find the best Amazon Redshift configuration for your workload using Redshift Test Drive

AWS Big Data

In this post, we answer that question by using Redshift Test Drive , an open-source tool that lets you evaluate which different data warehouse configurations options are best suited for your workload. Redshift Test Drive uses this process of workload replication for two main functionalities: comparing configurations and comparing replays.

Testing 63
article thumbnail

Materialized Views in Hive for Iceberg Table Format

Cloudera

Subsequently, these snapshot IDs are used to determine the delta changes that should be applied to the materialized view rows. Hive does this by asking the Iceberg library to return only the rows inserted since that table’s last snapshot when the materialized view was last rebuilt/created.

article thumbnail

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

AWS Big Data

Update your-iceberg-storage-blog in the following configuration with the bucket that you created to test this example. RIO is really great",date("2023-04-06"),2023)""") You can check the new snapshot is created after this append operation by querying the Iceberg snapshot: spark.sql("""SELECT * FROM dev.db.amazon_reviews_iceberg.snapshots""").show()