article thumbnail

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

Iceberg tables store metadata in manifest files. As the number of data files increase, the amount of metadata stored in these manifest files also increases, leading to longer query planning time. The query runtime also increases because it’s proportional to the number of data or metadata file read operations.

article thumbnail

Introducing in-place version upgrades with Amazon MWAA

AWS Big Data

If you also needed to preserve the history of DAG runs, you had to take a backup of your metadata database and then restore that backup on the newly created environment. Amazon MWAA manages the entire upgrade process, from provisioning new Apache Airflow versions to upgrading the metadata database.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

In the subsequent post in our series, we will explore the architectural patterns in building streaming pipelines for real-time BI dashboards, contact center agent, ledger data, personalized real-time recommendation, log analytics, IoT data, Change Data Capture, and real-time marketing data.

Analytics 115
article thumbnail

Keeping Small Queries Fast – Short query optimizations in Apache Impala

Cloudera

Operational, Cybersecurity, and IoT reporting where the current point in time state of an individual or single device needs to be analyzed. . Metadata Caching. The new Catalog design means that Impala coordinators will only load the metadata that they need instead of a full snapshot of all the tables. More on this below.

article thumbnail

AWS Glue streaming application to process Amazon MSK data using AWS Glue Schema Registry

AWS Big Data

This data can come from a diverse range of sources, including Internet of Things (IoT) devices, user applications, and logging and telemetry information from applications, to name a few. After the processed data is stored in Amazon S3, we create an AWS Glue crawler to create a Data Catalog table that acts as a metadata layer for the data.