article thumbnail

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

Systems of this nature generate a huge number of small objects and need attention to compact them to a more optimal size for faster reading, such as 128 MB, 256 MB, or 512 MB. For more information on streaming applications on AWS, refer to Real-time Data Streaming and Analytics. We use the Hive catalog for Iceberg tables.

article thumbnail

Speed up queries with the cost-based optimizer in Amazon Athena

AWS Big Data

Starting today, the Athena SQL engine uses a cost-based optimizer (CBO), a new feature that uses table and column statistics stored in the AWS Glue Data Catalog as part of the table’s metadata. Let’s discuss some of the cost-based optimization techniques that contributed to improved query performance.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Scale AWS Glue jobs by optimizing IP address consumption and expanding network capacity using a private NAT gateway

AWS Big Data

In this post, we will discuss two strategies to scale AWS Glue jobs: Optimizing the IP address consumption by right-sizing Data Processing Units (DPUs), using the Auto Scaling feature of AWS Glue, and fine-tuning of the jobs. Now let us look at the first solution that explains optimizing the AWS Glue IP address consumption.

article thumbnail

Optimizing PCI compliance in financial institutions

CIO Business Intelligence

A Common Controls Assessment offers an invaluable tool to optimize compliance efforts across various lines of business and to the internal service providers of security patterns alike. Conclusion In the intricate world of finance, PCI security compliance is nonnegotiable.

article thumbnail

Optimizing Hive on Tez Performance

Cloudera

Refer to the YARN – The Capacity Scheduler blog to understand these configuration settings.) . This can be tuned using the user limit factor of the YARN queue (refer the details in Capacity Scheduler blog ). Container reuse: This is an optimization that limits the startup time impact on containers. hive.cbo.enable.

article thumbnail

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

AWS Big Data

However, as data volumes continue to grow, optimizing data layout and organization becomes crucial for efficient querying and analysis. AWS Glue allows you to define bucketing parameters, such as the number of buckets and the columns to bucket on, providing an optimized data layout for efficient querying with Athena.

article thumbnail

Optimization Strategies for Iceberg Tables

Cloudera

This blog discusses a few problems that you might encounter with Iceberg tables and offers strategies on how to optimize them in each of those scenarios. Solution: expire snapshots We can expire old snapshots using expire_snapshots Problem with suboptimal manifests Over time the snapshots might reference many manifest files.