article thumbnail

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

Systems of this nature generate a huge number of small objects and need attention to compact them to a more optimal size for faster reading, such as 128 MB, 256 MB, or 512 MB. As of this writing, only the optimize-data optimization is supported. For our testing, we generated about 58,176 small objects with total size of 2 GB.

article thumbnail

Optimize checkpointing in your Amazon Managed Service for Apache Flink applications with buffer debloating and unaligned checkpoints – Part 2

AWS Big Data

We’ve already discussed how checkpoints, when triggered by the job manager, signal all source operators to snapshot their state, which is then broadcasted as a special record called a checkpoint barrier. When barriers from all upstream partitions have arrived, the sub-task takes a snapshot of its state.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Optimize checkpointing in your Amazon Managed Service for Apache Flink applications with buffer debloating and unaligned checkpoints – Part 1

AWS Big Data

Internally, Apache Flink uses clever mechanisms to maintain exactly-once state consistency, while also optimizing for throughput and reduced latency. Each of the distributed components of an application asynchronously snapshots its state to an external persistent datastore. The default behavior works well for most use cases.

article thumbnail

Find the best Amazon Redshift configuration for your workload using Redshift Test Drive

AWS Big Data

With the launch of Amazon Redshift Serverless and the various deployment options Amazon Redshift provides (such as instance types and cluster sizes), customers are looking for tools that help them determine the most optimal data warehouse configuration to support their Redshift workload.

Testing 63
article thumbnail

Implement data warehousing solution using dbt on Amazon Redshift

AWS Big Data

It also applies general software engineering principles like integrating with git repositories, setting up DRYer code, adding functional test cases, and including external libraries. In this post, we look into an optimal and cost-effective way of incorporating dbt within Amazon Redshift. For more information, refer SQL models.

article thumbnail

Real-time cost savings for Amazon Managed Service for Apache Flink

AWS Big Data

This means that cost-optimization exercises can happen at any time—they no longer need to happen in the planning phase. These scalable properties of Apache Flink can be key to optimizing your cost in the cloud. The third cost component is durable application backups, or snapshots. per GB per month.

article thumbnail

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

AWS Big Data

This helps traders determine the potential profitability of a strategy and identify any risks associated with it, enabling them to optimize it for better performance. To avoid look-ahead bias in backtesting, it’s essential to create snapshots of the data at different points in time.