article thumbnail

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

Systems of this nature generate a huge number of small objects and need attention to compact them to a more optimal size for faster reading, such as 128 MB, 256 MB, or 512 MB. As of this writing, only the optimize-data optimization is supported. and above (available from Amazon EMR 6.11.0).

article thumbnail

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

As data lakes have grown in size and matured in usage, a significant amount of effort can be spent keeping the data consistent with business events. Running Iceberg’s rewrite_data_files procedure in Spark for Athena will compact data files, combining many small delta change files into a smaller set of read-optimized Parquet files.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

From Hive Tables to Iceberg Tables: Hassle-Free

Cloudera

In all the use cases we are trying to migrate a table named “events.” They also provide a “ snapshot” procedure that creates an Iceberg table with a different name with the same underlying data. You could first create a snapshot table, run sanity checks on the snapshot table, and ensure that everything is in order.

article thumbnail

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

In this post, we will review the common architectural patterns of two use cases: Time Series Data Analysis and Event Driven Microservices. The streaming records are read in the order they are produced, allowing for real-time analytics, building event-driven applications or streaming ETL (extract, transform, and load).

Analytics 111
article thumbnail

Amazon Managed Service for Apache Flink now supports Apache Flink version 1.18

AWS Big Data

Apache Flink is an open source distributed processing engine, offering powerful programming interfaces for both stream and batch processing, with first-class support for stateful processing and event time semantics. By default, the sink writes in batches to optimize throughput. The dependency for Apache Flink 1.18 With versions 1.16

article thumbnail

How to achieve Kubernetes observability: Principles and best practices

IBM Big Data Hub

In this blog, we discuss how Kubernetes observability works, and how organizations can use it to optimize cloud-native IT architectures. Logs Logs include discrete events recorded every time something occurs in the system, such as status or error messages, or transaction details. How does observability work?

Metrics 74
article thumbnail

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

AWS Big Data

Amazon OpenSearch Service recently introduced the OpenSearch Optimized Instance family (OR1), which delivers up to 30% price-performance improvement over existing memory optimized instances in internal benchmarks, and uses Amazon Simple Storage Service (Amazon S3) to provide 11 9s of durability.