article thumbnail

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

Systems of this nature generate a huge number of small objects and need attention to compact them to a more optimal size for faster reading, such as 128 MB, 256 MB, or 512 MB. For more information on streaming applications on AWS, refer to Real-time Data Streaming and Analytics. We use the Hive catalog for Iceberg tables.

article thumbnail

Optimization Strategies for Iceberg Tables

Cloudera

This blog discusses a few problems that you might encounter with Iceberg tables and offers strategies on how to optimize them in each of those scenarios. Problem with too many snapshots Everytime a write operation occurs on an Iceberg table, a new snapshot is created. See Write properties.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

In-place version upgrades for applications on Amazon Managed Service for Apache Flink now supported

AWS Big Data

Refer to Upgrading Applications and Flink Versions for more information about how to avoid any unexpected inconsistencies. Refer to General best practices and recommendations for more details on how to test the upgrade process itself. If you’re using Gradle, refer to How to use Gradle to configure your project.

article thumbnail

Optimize checkpointing in your Amazon Managed Service for Apache Flink applications with buffer debloating and unaligned checkpoints – Part 2

AWS Big Data

We’ve already discussed how checkpoints, when triggered by the job manager, signal all source operators to snapshot their state, which is then broadcasted as a special record called a checkpoint barrier. When barriers from all upstream partitions have arrived, the sub-task takes a snapshot of its state.

article thumbnail

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

These formats enable ACID (atomicity, consistency, isolation, durability) transactions, upserts, and deletes, and advanced features such as time travel and snapshots that were previously only available in data warehouses. For more information, refer to Amazon S3: Allows read and write access to objects in an S3 Bucket.

Snapshot 101
article thumbnail

Implement data warehousing solution using dbt on Amazon Redshift

AWS Big Data

In this post, we look into an optimal and cost-effective way of incorporating dbt within Amazon Redshift. In an optimal environment, we store the credentials in AWS Secrets Manager and retrieve them. For more information, refer SQL models. For more information, refer to Redshift set up.

article thumbnail

Analyze Elastic IP usage history using Amazon Athena and AWS CloudTrail

AWS Big Data

You can use this solution regularly as part of your cost-optimization efforts to safely remove unused EIPs to reduce your costs. To gather EIP usage reporting, this solution compares snapshots of the current EIPs, focusing on their most recent attachment within a customizable 3-month period.