Data Analytics, Metadata and Reference

Data Analytics

Metadata

Reference

RDF-Star: Metadata Complexity Simplified

Ontotext

JUNE 10, 2021

This is a graph of millions of edges and vertices – in enterprise data management terms it is a giant piece of master/reference data. To handle such scenarios you need a transalytical graph database – a database engine that can deal with both frequent updates (OLTP workload) as well as with graph analytics (OLAP).

Metadata

Metadata Cost-Benefit OLAP Modeling

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

AWS Big Data

MARCH 22, 2024

Benchmark setup In our testing, we used the 3 TB dataset stored in Amazon S3 in compressed Parquet format and metadata for databases and tables is stored in the AWS Glue Data Catalog. This benchmark uses unmodified TPC-DS data schema and table relationships. He has been focusing in the big data analytics space since 2014.

Metadata

Metadata Statistics Broadcasting Optimization

Join 52,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Trending Sources

Introducing Amazon MWAA larger environment sizes

AWS Big Data

APRIL 16, 2024

Running Apache Airflow at scale puts proportionally greater load on the Airflow metadata database, sometimes leading to CPU and memory issues on the underlying Amazon Relational Database Service (Amazon RDS) cluster. A resource-starved metadata database may lead to dropped connections from your workers, failing tasks prematurely.

Metadata

Metadata Metrics Testing Management

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

These formats enable ACID (atomicity, consistency, isolation, durability) transactions, upserts, and deletes, and advanced features such as time travel and snapshots that were previously only available in data warehouses. For more information, refer to Amazon S3: Allows read and write access to objects in an S3 Bucket.

Snapshot

Snapshot Data Lake Metadata Optimization

Data governance in the age of generative AI

AWS Big Data

FEBRUARY 29, 2024

Data discoverability Unlike structured data, which is managed in well-defined rows and columns, unstructured data is stored as objects. For users to be able to discover and comprehend the data, the first step is to build a comprehensive catalog using the metadata that is generated and captured in the source systems.

Data Governance

Data Governance Unstructured Data Metadata Data Lake

Deliver decompressed Amazon CloudWatch Logs to Amazon S3 and Splunk using Amazon Data Firehose

AWS Big Data

APRIL 2, 2024

You can use Amazon Data Firehose to aggregate and deliver log events from your applications and services captured in Amazon CloudWatch Logs to your Amazon Simple Storage Service (Amazon S3) bucket and Splunk destinations, for use cases such as data analytics, security analysis, application troubleshooting etc.

Metadata

Metadata Marketing Analytics Data Transformation

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

AWS Big Data

APRIL 25, 2024

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed orchestration service for Apache Airflow that you can use to set up and operate data pipelines in the cloud at scale. Apache Airflow is an open source tool used to programmatically author, schedule, and monitor sequences of processes and tasks, referred to as workflows.

Metadata

Metadata Data Processing Management Testing

How Huron built an Amazon QuickSight Asset Catalogue with AWS CDK Based Deployment Pipeline

AWS Big Data

APRIL 26, 2023

Having an accurate and up-to-date inventory of all technical assets helps an organization ensure it can keep track of all its resources with metadata information such as their assigned oners, last updated date, used by whom, how frequently and more. This is a guest blog post co-written with Corey Johnson from Huron.

Metadata

Metadata Dashboards Visualization Consulting

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

AWS Big Data

APRIL 19, 2023

We split the solution into two primary components: generating Spark job metadata and running the SQL on Amazon EMR. The first component (metadata setup) consumes existing Hive job configurations and generates metadata such as number of parameters, number of actions (steps), and file formats. sql_path SQL file name.

Metadata

Metadata Testing Data Lake Consulting

Why Establishing Data Context is the Key to Creating Competitive Advantage

Ontotext

AUGUST 22, 2023

The age of Big Data inevitably brought computationally intensive problems to the enterprise. Central to today’s efficient business operations are the activities of data capturing and storage, search, sharing, and data analytics. With semantic metadata, enterprise data gets linked to one another and to external sources.

Metadata

Metadata Knowledge Discovery Big Data Enterprise

What is data governance? Best practices for managing data assets

CIO Business Intelligence

MARCH 24, 2023

The Business Application Research Center (BARC) warns that data governance is a highly complex, ongoing program, not a “big bang initiative,” and it runs the risk of participants losing trust and interest over time. The program must introduce and support standardization of enterprise data.

Data Governance

Data Governance Management Metadata Data Quality

Use Amazon Athena to query data stored in Google Cloud Platform

AWS Big Data

AUGUST 15, 2023

As customers accelerate their migrations to the cloud and transform their businesses, some find themselves in situations where they have to manage data analytics in a multi-cloud environment, such as acquiring a company that runs on a different cloud provider. For complete steps, refer to Creating a VPC for a data source connector.

Recreation/Entertainment

Recreation/Entertainment Unstructured Data Business Intelligence Data-driven

Build Spark Structured Streaming applications with the open source connector for Amazon Kinesis Data Streams

AWS Big Data

MAY 24, 2024

Apache Spark is a powerful big data engine used for large-scale data analytics. You can use Apache Spark to process streaming data from a variety of streaming sources, including Amazon Kinesis Data Streams for use cases like clickstream analysis, fraud detection, and more. Starting with Amazon EMR 7.1,

Metadata

Metadata Interactive Business Objectives Management

Prioritizing Data: Why a Solid Data Management Strategy Will Be Critical in 2024

Ontotext

JANUARY 29, 2024

As companies in almost every market segment attempt to continuously enhance and modernize data management practices to drive greater business outcomes, organizations will be watching numerous trends emerge this year. Sometimes, the challenge is that the data itself often raises more questions than it answers.

Strategy

Strategy Management Metadata Data-driven

Multicloud data lake analytics with Amazon Athena

AWS Big Data

MARCH 18, 2024

In these cases, you may want an integrated query layer to seamlessly run analytical queries across these diverse cloud stores and streamline your data analytics processes. Refer to Using Amazon Athena Federated Query for further details. The AWS Glue Data Catalog holds the metadata for Amazon S3 and GCS data.

Data Lake

Data Lake Analytics Cost-Benefit Management

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback.

Data Lake

Data Lake Data Processing Metadata Snapshot

Introducing Amazon MWAA support for Apache Airflow version 2.7.2 and deferrable operators

AWS Big Data

NOVEMBER 6, 2023

Refer to the Configuration reference in the User Guide for detailed configuration values. To learn more about Setup and Teardown tasks, refer to the Apache Airflow documentation. For a complete list of installed packages and their versions, refer to this MWAA documentation. The following diagram describes the process.

Metrics

Metrics Metadata Snapshot Management

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

AWS Big Data

APRIL 3, 2024

If the asset has AWS Glue Data Quality enabled, you can now quickly visualize the data quality score directly in the catalog search pane. By selecting the corresponding asset, you can understand its content through the readme, glossary terms , and technical and business metadata. option("header", "true").option("inferSchema",

Data Quality

Data Quality Visualization Metadata Metrics

How Eightfold AI implemented metadata security in a multi-tenant data analytics environment with Amazon Redshift

AWS Big Data

NOVEMBER 29, 2023

The Eightfold Talent Intelligence Platform integrates with Amazon Redshift metadata security to implement visibility of data catalog listing of names of databases, schemas, tables, views, stored procedures, and functions in Amazon Redshift. This post discusses restricting listing of data catalog metadata as per the granted permissions.

Metadata

Metadata Data Warehouse Analytics Data Analytics

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

This is the first post to a blog series that offers common architectural patterns in building real-time data streaming infrastructures using Kinesis Data Streams for a wide range of use cases. In this post, we will review the common architectural patterns of two use cases: Time Series Data Analysis and Event Driven Microservices.

Analytics

Analytics IoT Data-driven Snapshot

GraphDB in Action: Putting the Most Reliable RDF Database to Work for Better Human-machine Interaction

Ontotext

JANUARY 26, 2023

These 30 layers can be split into two kinds: a location-reference layer and a topic layer. The authors address the challenge of interoperability in the digitalization of mobility systems and introduce a reference architecture for the Shift2Rail Interoperability Framework (IF). The catalog stores the asset’s metadata in RDF.

Interactive

Interactive Metadata Data Integration Data-driven

Apache Ozone – A High Performance Object Store for CDP Private Cloud

Cloudera

OCTOBER 15, 2021

With FSO, Apache Ozone guarantees atomic directory operations, and renaming or deleting a directory is a simple metadata operation even if the directory has a large set of sub-paths (directories/files) within it. In fact, this gives Apache Ozone a significant performance advantage over other object stores in the data analytics ecosystem.

Testing

Testing Measurement Optimization Metadata

Speed up queries with the cost-based optimizer in Amazon Athena

AWS Big Data

NOVEMBER 17, 2023

Starting today, the Athena SQL engine uses a cost-based optimizer (CBO), a new feature that uses table and column statistics stored in the AWS Glue Data Catalog as part of the table’s metadata. CBO optimizes based on table statistics stored in the AWS Glue Data Catalog. Analytics Architect on Amazon Athena.

Optimization

Optimization Statistics Metadata Data Lake

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

To learn more about RAG, refer to Question answering using Retrieval Augmented Generation with foundation models in Amazon SageMaker JumpStart. A RAG-based generative AI application can only produce generic responses based on its training data and the relevant documents in the knowledge base.

Data Lake

Data Lake Unstructured Data Management Modeling

Upgrade Journey: The Path from CDH to CDP Private Cloud

Cloudera

SEPTEMBER 28, 2020

The customer leverages Cloudera’s multi-function analytics stack in CDP. The data lifecycle model ingests data using Kafka, enriches that data with Spark-based batch process, performs deep data analytics using Hive and Impala, and finally uses that data for data science using Cloudera Data Science Workbench to get deep insights.

Testing

Testing Metadata Risk Data Science

Gain insights from historical location data using Amazon Location Service and AWS analytics services

AWS Big Data

MARCH 13, 2024

Data analytics – Business analysts gather operational insights from multiple data sources, including the location data collected from the vehicles. Athena is used to run geospatial queries on the location data stored in the S3 buckets. The ingestion approach is not in scope of this post. Choose Run.

Analytics

Analytics IoT Metadata Internet of Things

Introducing Amazon MWAA support for the Airflow REST API and web server auto scaling

AWS Big Data

MAY 16, 2024

With the new REST API, you can now invoke DAG runs, manage datasets, or get the status of Airflow’s metadata database, trigger, and scheduler—all without relying on the Airflow web UI or CLI. Refer to Creating an Apache Airflow web login token for more details. She is passionate about data analytics and networking.

Testing

Testing Interactive Metrics Management

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

They chose AWS Glue as their preferred data integration tool due to its serverless nature, low maintenance, ability to control compute resources in advance, and scale when needed. To share the datasets, they needed a way to share access to the data and access to catalog metadata in the form of tables and views.

Metadata

Metadata Data Lake Machine Learning Big Data

Securely process near-real-time data from Amazon MSK Serverless using an AWS Glue streaming ETL job with IAM authentication

AWS Big Data

SEPTEMBER 13, 2023

Streaming data has become an indispensable resource for organizations worldwide because it offers real-time insights that are crucial for data analytics. The escalating velocity and magnitude of collected data has created a demand for real-time analytics. This table acts as a metadata layer for the data.

Data Processing

Data Processing Management Interactive Metadata

Setting up and Getting Started with Cloudera’s New SQL AI Assistant

Cloudera

JANUARY 19, 2024

Please refer to the product documentation for more information about specific releases. Hallucination: In the context of LLMs, hallucination refers to a phenomenon where these models generate responses that are incorrect, nonsensical, or fabricated. or higher on the public cloud. Both Hive and Impala dialects are supported.

Data Warehouse

Data Warehouse Data Processing Optimization Modeling

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

OCTOBER 11, 2023

In this post, we discuss ways to modernize your legacy, on-premises, real-time analytics architecture to build serverless data analytics solutions on AWS using Amazon Managed Service for Apache Flink. It shows a call center streaming data source that sends the latest call center feed in every 15 seconds.

Management

Management Metadata Analytics Dashboards

Orchestrate Amazon EMR Serverless Spark jobs with Amazon MWAA, and data validation using Amazon Athena

AWS Big Data

DECEMBER 12, 2023

Use EMR Serverless to transform the data using PySpark code and then store the transformed data back in your S3 bucket. Use Athena to create an external table based on the S3 dataset and run queries to analyze the transformed data. Athena uses the AWS Glue Data Catalog to store the table metadata.

Data Processing

Data Processing Management Statistics Interactive

AI in Analytics: The NLQ Use Case

Sisense

JULY 24, 2019

Without historical data, facilitating longer NLQ journeys in exploration mode will be somewhat limited at first. Imagine a marine freight company using Captain Cook slang to refer to distances (fathom), weights (draft), and types of goods (treasures) being shipped across oceans. Using AI to its Fullest.

Analytics

Analytics Experimentation Metadata Big Data

What you need to know about product management for AI

O'Reilly on Data

MARCH 31, 2020

There may even be someone on your team who built a personalized video recommender before and can help scope and estimate the project requirements using that past experience as a point of reference. You might have millions of short videos , with user ratings and limited metadata about the creators or content.

Management

Management Machine Learning Experimentation Metrics

AI recommendations for descriptions in Amazon DataZone for enhanced business data cataloging and discovery is now generally available

AWS Big Data

APRIL 2, 2024

Data consumers need detailed descriptions of the business context of a data asset and documentation about its recommended use cases to quickly identify the relevant data for their intended use case. Go to your asset in your data project and choose Generate summary to obtain the detailed description of the asset and its columns.

Metadata

Metadata Metrics Data-driven Modeling

Preprocess and fine-tune LLMs quickly and cost-effectively using Amazon EMR Serverless and Amazon SageMaker

AWS Big Data

FEBRUARY 1, 2024

The Common Crawl corpus contains petabytes of data, regularly collected since 2008, and contains raw webpage data, metadata extracts, and text extracts. In addition to determining which dataset should be used, cleansing and processing the data to the fine-tuning’s specific need is required.

Metadata

Metadata Modeling Data Processing Unstructured Data

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

AWS Big Data

DECEMBER 18, 2023

For more information, refer to IAM Policies for invoking AWS Glue job from Step Functions. There are multiple tables related to customers and order data in the RDS database. Amazon S3 hosts the metadata of all the tables as a.csv file. To learn more about how distributed map redrive works, refer to Redriving Map Runs.

Metadata

Metadata Visualization Data Lake Data-driven

Build SAML identity federation for Amazon OpenSearch Service domains within a VPC

AWS Big Data

FEBRUARY 7, 2024

Refer to How can I access OpenSearch Dashboards from outside of a VPC using Amazon Cognito authentication for a detailed evaluation of the available options and the corresponding pros and cons. For more information, refer to the AWS CDK v2 Developer Guide. For instructions, refer to Creating a public hosted zone. application.

Dashboards

Dashboards Data Processing Metadata Consulting

Unstructured data management and governance using AWS AI/ML and analytics services

AWS Big Data

OCTOBER 25, 2023

But most important of all, the assumed dormant value in the unstructured data is a question mark, which can only be answered after these sophisticated techniques have been applied. Therefore, there is a need to being able to analyze and extract value from the data economically and flexibly. The solution integrates data in three tiers.

Unstructured Data

Unstructured Data Metadata Management Analytics

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

AWS Big Data

JANUARY 24, 2023

This solution only replicates metadata in the Data Catalog, not the actual underlying data. To have a redundant data lake using Lake Formation and AWS Glue in an additional Region, we recommend replicating the Amazon S3-based storage using S3 replication , S3 sync, aws-s3-copy-sync-using-batch or S3 Batch replication process.

Data Architecture

Data Architecture Metadata Data Lake Snapshot

AWS Glue streaming application to process Amazon MSK data using AWS Glue Schema Registry

AWS Big Data

JUNE 12, 2023

Organizations across the world are increasingly relying on streaming data, and there is a growing need for real-time data analytics, considering the growing velocity and volume of data being collected. Refer appendix section for more information on this feature. Refer to the first stack’s output.

Management

Management Metadata Testing Internet of Things

What is an open data lakehouse and why you should care?

IBM Big Data Hub

JANUARY 17, 2023

These new technologies and approaches, along with the desire to reduce data duplication and complex ETL pipelines, have resulted in a new architectural data platform approach known as the data lakehouse – offering the flexibility of a data lake with the performance and structure of a data warehouse.

Data Lake

Data Lake Metadata Data Warehouse Data Governance

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

AWS Big Data

SEPTEMBER 7, 2023

They store attributes such as object size, total time, turn-around time, and HTTP referer for log records. AWS Glue Data Catalog stores information as metadata tables, where each table specifies a single data store. Running the crawler on a schedule updates AWS Glue Data Catalog with new partitions and metadata.

Metadata

Metadata Dashboards Metrics Visualization

Clean up your Excel and CSV files without writing code using AWS Glue DataBrew

AWS Big Data

NOVEMBER 15, 2023

You can use its built-in transformations, recipes, as well as integrations with the AWS Glue Data Catalog and Amazon Simple Storage Service (Amazon S3) to preprocess the data in your landing zone, clean it up, and send it downstream for analytical processing. For Matching conditions , choose Match all conditions.

Metadata

Metadata Sales Data Lake Big Data

RDF-Star: Metadata Complexity Simplified

Run Trino queries 2.7 times faster with Amazon EMR 6.15.0

Webinars

Trending Sources

Introducing Amazon MWAA larger environment sizes

Webinars

Use Amazon Athena with Spark SQL for your open-source transactional table formats

Data governance in the age of generative AI

Deliver decompressed Amazon CloudWatch Logs to Amazon S3 and Splunk using Amazon Data Firehose

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

How Huron built an Amazon QuickSight Asset Catalogue with AWS CDK Based Deployment Pipeline

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

Why Establishing Data Context is the Key to Creating Competitive Advantage

What is data governance? Best practices for managing data assets

Use Amazon Athena to query data stored in Google Cloud Platform

Build Spark Structured Streaming applications with the open source connector for Amazon Kinesis Data Streams

Prioritizing Data: Why a Solid Data Management Strategy Will Be Critical in 2024

Multicloud data lake analytics with Amazon Athena

Use Apache Iceberg in a data lake to support incremental data processing

Introducing Amazon MWAA support for Apache Airflow version 2.7.2 and deferrable operators

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

How Eightfold AI implemented metadata security in a multi-tenant data analytics environment with Amazon Redshift

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

GraphDB in Action: Putting the Most Reliable RDF Database to Work for Better Human-machine Interaction

Apache Ozone – A High Performance Object Store for CDP Private Cloud

Speed up queries with the cost-based optimizer in Amazon Athena

Exploring real-time streaming for generative AI Applications

Upgrade Journey: The Path from CDH to CDP Private Cloud

Gain insights from historical location data using Amazon Location Service and AWS analytics services

Introducing Amazon MWAA support for the Airflow REST API and web server auto scaling

How Cargotec uses metadata replication to enable cross-account data sharing

Securely process near-real-time data from Amazon MSK Serverless using an AWS Glue streaming ETL job with IAM authentication

Setting up and Getting Started with Cloudera’s New SQL AI Assistant

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

Orchestrate Amazon EMR Serverless Spark jobs with Amazon MWAA, and data validation using Amazon Athena

AI in Analytics: The NLQ Use Case

What you need to know about product management for AI

AI recommendations for descriptions in Amazon DataZone for enhanced business data cataloging and discovery is now generally available

Preprocess and fine-tune LLMs quickly and cost-effectively using Amazon EMR Serverless and Amazon SageMaker

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

Build SAML identity federation for Amazon OpenSearch Service domains within a VPC

Unstructured data management and governance using AWS AI/ML and analytics services

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

AWS Glue streaming application to process Amazon MSK data using AWS Glue Schema Registry

What is an open data lakehouse and why you should care?

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

Clean up your Excel and CSV files without writing code using AWS Glue DataBrew

Stay Connected