article thumbnail

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

Iceberg tables store metadata in manifest files. As the number of data files increase, the amount of metadata stored in these manifest files also increases, leading to longer query planning time. The query runtime also increases because it’s proportional to the number of data or metadata file read operations.

article thumbnail

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

Stream Processing – An application created with Amazon Managed Service for Apache Flink can read the records from the data stream to detect and clean any errors in the time series data and enrich the data with specific metadata to optimize operational analytics.

Analytics 115
article thumbnail

AWS Glue streaming application to process Amazon MSK data using AWS Glue Schema Registry

AWS Big Data

This data can come from a diverse range of sources, including Internet of Things (IoT) devices, user applications, and logging and telemetry information from applications, to name a few. The items stored in checkpoint locations are mainly the metadata for application configurations and the state of processed offsets.