Metadata, Optimization and Reference

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

Iceberg tables store metadata in manifest files. As the number of data files increase, the amount of metadata stored in these manifest files also increases, leading to longer query planning time. The query runtime also increases because it’s proportional to the number of data or metadata file read operations.

Optimization

Optimization Snapshot Data Lake Metadata

RDF-Star: Metadata Complexity Simplified

Ontotext

JUNE 10, 2021

Relational databases benefit from decades of tweaks and optimizations to deliver performance. This is a graph of millions of edges and vertices – in enterprise data management terms it is a giant piece of master/reference data. Not Every Graph is a Knowledge Graph: Schemas and Semantic Metadata Matter.

Metadata

Metadata Cost-Benefit OLAP Modeling

Speed up queries with the cost-based optimizer in Amazon Athena

AWS Big Data

NOVEMBER 17, 2023

Starting today, the Athena SQL engine uses a cost-based optimizer (CBO), a new feature that uses table and column statistics stored in the AWS Glue Data Catalog as part of the table’s metadata. Let’s discuss some of the cost-based optimization techniques that contributed to improved query performance.

Optimization

Optimization Statistics Metadata Data Lake

Webinars

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

How To Get Promoted In Product Management

MORE WEBINARS

Maximize your data dividends with active metadata

IBM Big Data Hub

NOVEMBER 28, 2022

Metadata management performs a critical role within the modern data management stack. However, as data volumes continue to grow, manual approaches to metadata management are sub-optimal and can result in missed opportunities. This puts into perspective the role of active metadata management. Improve data discovery.

Metadata

Metadata Data Quality Data-driven Data Governance

Optimization Strategies for Iceberg Tables

Cloudera

FEBRUARY 14, 2024

This blog discusses a few problems that you might encounter with Iceberg tables and offers strategies on how to optimize them in each of those scenarios. A bloated metadata.json file could increase both read/write times because a large metadata file needs to be read/written every time. This could be very costly.

Strategy

Strategy Optimization Snapshot Metadata

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

AWS Big Data

MARCH 22, 2024

When you use Trino on Amazon EMR or Athena, you get the latest open source community innovations along with proprietary, AWS developed optimizations. and Athena engine version 2, AWS has been developing query plan and engine behavior optimizations that improve query performance on Trino. Starting from Amazon EMR 6.8.0

Metadata

Metadata Statistics Broadcasting Optimization

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

AWS Big Data

APRIL 25, 2024

However, as data volumes continue to grow, optimizing data layout and organization becomes crucial for efficient querying and analysis. AWS Glue allows you to define bucketing parameters, such as the number of buckets and the columns to bucket on, providing an optimized data layout for efficient querying with Athena.

Optimization

Optimization Data Lake Cost-Benefit Reporting

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

Each storage format implements this functionality in slightly different ways; for a comparison, refer to Choosing an open table format for your transactional data lake on AWS. For more information, refer to Amazon S3: Allows read and write access to objects in an S3 Bucket.

Snapshot

Snapshot Data Lake Metadata Optimization

Introducing Amazon MWAA larger environment sizes

AWS Big Data

APRIL 16, 2024

Running Apache Airflow at scale puts proportionally greater load on the Airflow metadata database, sometimes leading to CPU and memory issues on the underlying Amazon Relational Database Service (Amazon RDS) cluster. A resource-starved metadata database may lead to dropped connections from your workers, failing tasks prematurely.

Metadata

Metadata Metrics Testing Management

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

They understand data modeling, including conceptualization and database optimization, and demonstrate a commitment to continuing education. According to Dataversity , good data architects have a solid understanding of the cloud, databases, and the applications and programs used by those databases.

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

Deliver decompressed Amazon CloudWatch Logs to Amazon S3 and Splunk using Amazon Data Firehose

AWS Big Data

APRIL 2, 2024

You can see the decompressed data has metadata information such as logGroup , logStream , and subscriptionFilters , and the actual data is included within the message field under logEvents (the following example shows an example of CloudTrail events in the CloudWatch Logs). Select Turn on message extraction for the Splunk destination.

Metadata

Metadata Marketing Analytics Data Transformation

Prioritizing Data: Why a Solid Data Management Strategy Will Be Critical in 2024

Ontotext

JANUARY 29, 2024

These will include developing a better understanding of AI, recognizing the role semantic metadata plays in data fabrics, and the rapid acceleration and adoption of knowledge graphs — which will be driven by large language models (LLMs) and the convergence of labeled property graphs (LPGs) and resource description frameworks (RDFs).

Strategy

Strategy Management Metadata Data-driven

Implement a full stack serverless search application using AWS Amplify, Amazon Cognito, Amazon API Gateway, AWS Lambda, and Amazon OpenSearch Serverless

AWS Big Data

MAY 31, 2024

This encompasses tasks such as integrating diverse data from various sources with distinct formats and structures, optimizing the user experience for performance and security, providing multilingual support, and optimizing for cost, operations, and reliability. Based on metadata, content is returned from Amazon S3 to the user.

Metadata

Metadata Management Testing Data-driven

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

AWS Big Data

APRIL 25, 2024

Apache Airflow is an open source tool used to programmatically author, schedule, and monitor sequences of processes and tasks, referred to as workflows. Additionally, it enables cost optimization by aligning resources with specific use cases, making sure that expenses are well controlled. A VPC gateway endpointto Amazon S3.

Metadata

Metadata Data Processing Management Testing

GraphDB: MongoDB Document Store Integration for Large-scale Metadata Management

Ontotext

JUNE 27, 2019

For example, it’s data generated as a result of text mining algorithms that includes document metadata attributes and annotations with links to the first type of information. In this experiment, we will show high-scalability when the articles/creative works are stored in MongoDB and all reference/true graph data in GraphDB.

Metadata

Metadata Management Optimization Enterprise

GraphDB: MongoDB Document Store Integration for Large-scale Metadata Management

Ontotext

JUNE 27, 2019

GraphDB: MongoDB Document Store Integration for Large-scale Metadata Management. For example, it’s data generated as a result of text mining algorithms that includes document metadata attributes and annotations with links to the first type of information. Optimizing for speed. Prerequisites. So far so good. Ontotext GraphDB.

Metadata

Metadata Management Enterprise Optimization

Do I Need a Data Catalog?

erwin

JUNE 26, 2020

Organizations with particularly deep data stores might need a data catalog with advanced capabilities, such as automated metadata harvesting to speed up the data preparation process. The most optimal and streamlined way to achieve this is by using a data catalog, which can provide a first stop for users ahead of working in BI platforms.

Metadata

Metadata Cost-Benefit Measurement Data-driven

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

AWS Big Data

APRIL 19, 2023

Customers now want to migrate their Apache Hive workloads to Apache Spark in the cloud to get the benefits of optimized runtime, cost reduction through transient clusters, better scalability by decoupling the storage and compute, and flexibility. Generate Spark SQL metadata Our batch job consists of Hive steps scheduled to run sequentially.

Metadata

Metadata Testing Data Lake Consulting

How Huron built an Amazon QuickSight Asset Catalogue with AWS CDK Based Deployment Pipeline

AWS Big Data

APRIL 26, 2023

Having an accurate and up-to-date inventory of all technical assets helps an organization ensure it can keep track of all its resources with metadata information such as their assigned oners, last updated date, used by whom, how frequently and more. This is a guest blog post co-written with Corey Johnson from Huron.

Metadata

Metadata Dashboards Visualization Consulting

What is data governance? Best practices for managing data assets

CIO Business Intelligence

MARCH 24, 2023

While data management has become a common term for the discipline, it is sometimes referred to as data resource management or enterprise information management (EIM). Programs must support proactive and reactive change management activities for reference data values and the structure/use of master data and metadata.

Data Governance

Data Governance Management Metadata Data Quality

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. We refer to this concept as outside-in data movement. Cold storage is optimized to store infrequently accessed or historical data. Let’s look at an example use case.

Data Lake

Data Lake Analytics Dashboards Metrics

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Apache Iceberg addresses customer needs by capturing rich metadata information about the dataset at the time the individual data files are created.

Data Lake

Data Lake Data Processing Metadata Snapshot

Multicloud data lake analytics with Amazon Athena

AWS Big Data

MARCH 18, 2024

When a query runs on a federated data source using a connector, Athena invokes multiple AWS Lambda functions to read from the data sources in parallel to optimize performance. Refer to Using Amazon Athena Federated Query for further details. The AWS Glue Data Catalog holds the metadata for Amazon S3 and GCS data.

Data Lake

Data Lake Analytics Cost-Benefit Management

Modernize your data observability with Amazon OpenSearch Service zero-ETL integration with Amazon S3

AWS Big Data

JUNE 5, 2024

With cost-effective storage classes and user-friendly management features, you can optimize costs, organize data, and configure fine-tuned access controls to meet specific business, organizational, and compliance requirements. The direct query connection relies on the metadata in Glue Data Catalog tables to query data stored in Amazon S3.

Data Lake

Data Lake Cost-Benefit Dashboards Visualization

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

AWS Big Data

MARCH 7, 2023

They should also provide optimal performance with low or no tuning. And unlike data warehouses, which are primarily analytical stores, a data hub is a combination of all types of repositories—analytical, transactional, operational, reference, and data I/O services, along with governance processes. Data repositories represent the hub.

Analytics

Analytics Data Warehouse Data Lake Metadata

Build Spark Structured Streaming applications with the open source connector for Amazon Kinesis Data Streams

AWS Big Data

MAY 24, 2024

The connector is built using the latest Spark Data Sources API V2, which uses Spark optimizations. For using it with other Apache Spark platforms, the connector is available as a public JAR file that can be directly referred to while submitting a Spark Structured Streaming job. Starting with Amazon EMR 7.1, amazonaws.com").option("kinesis.startingposition",

Metadata

Metadata Interactive Business Objectives Management

Introducing Amazon MWAA support for Apache Airflow version 2.7.2 and deferrable operators

AWS Big Data

NOVEMBER 6, 2023

Refer to the Configuration reference in the User Guide for detailed configuration values. To learn more about Setup and Teardown tasks, refer to the Apache Airflow documentation. For a complete list of installed packages and their versions, refer to this MWAA documentation. The following diagram describes the process.

Metrics

Metrics Metadata Snapshot Management

Efficiently crawl your data lake and improve data access with an AWS Glue crawler using partition indexes

AWS Big Data

JUNE 15, 2023

AWS Glue crawlers extract the data schema and partitions from Amazon S3 to automatically populate the Data Catalog, keeping the metadata current. Today, AWS Glue crawler support has been expanded to automatically add partition indexes for newly discovered tables to optimize query processing on the partitioned dataset.

Data Lake

Data Lake Metadata Cost-Benefit Management

Amazon CloudWatch metrics for Amazon OpenSearch Service storage and shard skew health

AWS Big Data

AUGUST 21, 2023

This solution uses an AWS Lambda function to extract storage and shard distribution metadata from your OpenSearch Service domain, calculates the level of skew, and then pushes this information to CloudWatch metrics so that you can easily monitor, alert, and respond.

Metrics

Metrics Testing Strategy Metadata

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

NOVEMBER 22, 2023

These statistics are now integrated with the cost-based optimizers (CBO) of Amazon Athena and Amazon Redshift Spectrum , resulting in improved query performance and potential cost savings. The data lake performance optimization is especially important for queries with multiple joins and that is where cost-based optimizers helps the most.

Statistics

Statistics Data Lake Optimization Data-driven

Data Intelligence in DataOps: Navigating the Journey to Continuous Data Value

Alation

SEPTEMBER 21, 2021

Compare this story to what we have today: smartphones with traffic navigation apps that provide us with turn-by-turn directions that continuously try to find the optimal path based on traffic conditions on the route being traveled. At IDC, we refer to data-native workers as a generation, Generation Data, or Gen-D for short.

Metadata

Metadata Testing Recreation/Entertainment Data-driven

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

Data quality refers to the assessment of the information you have, relative to its purpose and its ability to serve that purpose. While the digital age has been successful in prompting innovation far and wide, it has also facilitated what is referred to as the “data crisis” – low-quality data. 2 – Data profiling.

Data Quality

Data Quality Metrics Data-driven Management

Introducing enhanced functionality for worker configuration management in Amazon MSK Connect

AWS Big Data

MARCH 25, 2024

Tags are key-value metadata that can be associated with AWS service resources. For a list of Region availability, refer to AWS Services by Region. She is passionate about new technologies and focused on helping customers achieve cost optimization and operational excellence.

Management

Management Metadata Reporting Big Data

Setting up and Getting Started with Cloudera’s New SQL AI Assistant

Cloudera

JANUARY 19, 2024

It can help you to create, edit, optimize, fix, and succinctly summarize queries using natural language. Please refer to the product documentation for more information about specific releases. This will expand the SQL AI toolbar with buttons to generate, edit, explain, optimize and fix SQL statements.

Data Warehouse

Data Warehouse Data Processing Optimization Modeling

Mastering Ingress in the UI: Elevating your app visibility

IBM Big Data Hub

NOVEMBER 3, 2023

References UI and CLI CLI and Terraform CLI and Terraform— Instance , TLS Secret and Opaque Secret Scroll to view full table Configuring a multi-tenant microservices environment in IBM Cloud Let’s dive into a practical scenario. v1 kind: Ingress metadata: annotations: kubernetes.io/ingress.class: Delete an ALB.

Data Processing

Data Processing Metadata Management Testing

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

Refer to Amazon Kinesis Data Streams integrations for additional details. Stream Processing – An application created with Amazon Managed Service for Apache Flink can read the records from the data stream to detect and clean any errors in the time series data and enrich the data with specific metadata to optimize operational analytics.

Analytics

Analytics IoT Data-driven Snapshot

Apache Ozone – A High Performance Object Store for CDP Private Cloud

Cloudera

OCTOBER 15, 2021

Apache Ozone has added a new feature called File System Optimization (“FSO”) in HDDS-2939. With FSO, Apache Ozone guarantees atomic directory operations, and renaming or deleting a directory is a simple metadata operation even if the directory has a large set of sub-paths (directories/files) within it. Conclusion. Further Reading.

Testing

Testing Measurement Optimization Metadata

From Hive Tables to Iceberg Tables: Hassle-Free

Cloudera

JULY 14, 2023

They simply read the underlying data (not even full read, they just read the parquet headers) and create corresponding Iceberg metadata files. You could optimize your table now or at a later stage using the “rewrite_data_files” procedure. Hive creates Iceberg’s metadata files for the same exact table.

Snapshot

Snapshot Metadata Data Warehouse Testing

RDF-star Implementation in GraphDB and How Synaptica Used It Within Graphite for Access Control

Ontotext

MARCH 29, 2021

Vassil Momtchev: RDF-star (formerly known as RDF*) helps in every case, where the user needs to express a complex relationship with metadata associated for a triple like: 1. << Technically speaking, RDF-star is the syntactic sugar, which makes it easier to attach metadata to edges in the graph. source :TheNationalEnquirer ; 3.

Metadata

Metadata IT Modeling Experimentation

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

AWS Big Data

NOVEMBER 16, 2023

Data Vault overview For a brief review of the core Data Vault premise and concepts, refer to the first post in this series. For more information, refer to Amazon Redshift database encryption. String-optimized compression The Data Vault 2.0 If you use AWS KMS, you can either use an AWS managed key or customer managed key.

Enterprise

Enterprise Data Warehouse Snapshot Cost-Benefit

How to use foundation models and trusted governance to manage AI workflow risk

IBM Big Data Hub

OCTOBER 16, 2023

AI governance refers to the practice of directing, managing and monitoring an organization’s AI activities. It includes processes that trace and document the origin of data, models and associated metadata and pipelines for audits. Capture and document model metadata for report generation. Increase trust in AI outcomes.

Risk

Risk Modeling Management Metadata

Introducing Amazon MWAA support for the Airflow REST API and web server auto scaling

AWS Big Data

MAY 16, 2024

With the new REST API, you can now invoke DAG runs, manage datasets, or get the status of Airflow’s metadata database, trigger, and scheduler—all without relying on the Airflow web UI or CLI. Refer to Creating an Apache Airflow web login token for more details. For this example, we set the upper limit to 5 and lower limit to 2.

Testing

Testing Interactive Metrics Management

Success Stories: Applications and Benefits of Knowledge Graphs in Financial Services

Ontotext

JULY 6, 2023

This shift of both a technical and an outcome mindset allows them to establish a centralized metadata hub for their data assets and effortlessly access information from diverse systems that previously had limited interaction. internal metadata, industry ontologies, etc.) names, locations, brands, industry codes, etc.)

Cost-Benefit

Cost-Benefit Metadata Experimentation Risk

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

To learn more about RAG, refer to Question answering using Retrieval Augmented Generation with foundation models in Amazon SageMaker JumpStart. Change data capture (CDC) events contain information about the source record, updates, and metadata such as time, source, classification (insert, update, or delete), and the initiator of the change.

Data Lake

Data Lake Unstructured Data Management Modeling

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

RDF-Star: Metadata Complexity Simplified

Webinars

Trending Sources

Speed up queries with the cost-based optimizer in Amazon Athena

Webinars

Maximize your data dividends with active metadata

Optimization Strategies for Iceberg Tables

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

Use Amazon Athena with Spark SQL for your open-source transactional table formats

Introducing Amazon MWAA larger environment sizes

What is a data architect? Skills, salaries, and how to become a data framework master

Deliver decompressed Amazon CloudWatch Logs to Amazon S3 and Splunk using Amazon Data Firehose

Prioritizing Data: Why a Solid Data Management Strategy Will Be Critical in 2024

Implement a full stack serverless search application using AWS Amplify, Amazon Cognito, Amazon API Gateway, AWS Lambda, and Amazon OpenSearch Serverless

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

GraphDB: MongoDB Document Store Integration for Large-scale Metadata Management

GraphDB: MongoDB Document Store Integration for Large-scale Metadata Management

Do I Need a Data Catalog?

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

How Huron built an Amazon QuickSight Asset Catalogue with AWS CDK Based Deployment Pipeline

What is data governance? Best practices for managing data assets

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Use Apache Iceberg in a data lake to support incremental data processing

Multicloud data lake analytics with Amazon Athena

Modernize your data observability with Amazon OpenSearch Service zero-ETL integration with Amazon S3

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

Build Spark Structured Streaming applications with the open source connector for Amazon Kinesis Data Streams

Introducing Amazon MWAA support for Apache Airflow version 2.7.2 and deferrable operators

Efficiently crawl your data lake and improve data access with an AWS Glue crawler using partition indexes

Amazon CloudWatch metrics for Amazon OpenSearch Service storage and shard skew health

Enhance query performance using AWS Glue Data Catalog column-level statistics

Data Intelligence in DataOps: Navigating the Journey to Continuous Data Value

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Introducing enhanced functionality for worker configuration management in Amazon MSK Connect

Setting up and Getting Started with Cloudera’s New SQL AI Assistant

Mastering Ingress in the UI: Elevating your app visibility

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

Apache Ozone – A High Performance Object Store for CDP Private Cloud

From Hive Tables to Iceberg Tables: Hassle-Free

RDF-star Implementation in GraphDB and How Synaptica Used It Within Graphite for Access Control

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

How to use foundation models and trusted governance to manage AI workflow risk

Introducing Amazon MWAA support for the Airflow REST API and web server auto scaling

Success Stories: Applications and Benefits of Knowledge Graphs in Financial Services

Exploring real-time streaming for generative AI Applications

Stay Connected