Data Analytics, Data Lake, Metadata and Reference

Data Analytics

Data Lake

Metadata

Reference

Multicloud data lake analytics with Amazon Athena

AWS Big Data

MARCH 18, 2024

Many organizations operate data lakes spanning multiple cloud data stores. In these cases, you may want an integrated query layer to seamlessly run analytical queries across these diverse cloud stores and streamline your data analytics processes. Refer to the respective documentation for details.

Data Lake

Data Lake Analytics Cost-Benefit Management

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback.

Data Lake

Data Lake Data Processing Metadata Snapshot

Join 52,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

The Key to Sustainable Energy Optimization: A Data-Driven Approach for Manufacturing

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

How To Get Promoted In Product Management

MORE WEBINARS

Trending Sources

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

AWS-powered data lakes, supported by the unmatched availability of Amazon Simple Storage Service (Amazon S3), can handle the scale, agility, and flexibility required to combine different data and analytics approaches. For more information, refer to Amazon S3: Allows read and write access to objects in an S3 Bucket.

Snapshot

Snapshot Data Lake Metadata Optimization

Webinars

The Key to Sustainable Energy Optimization: A Data-Driven Approach for Manufacturing

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

How To Get Promoted In Product Management

MORE WEBINARS

Data governance in the age of generative AI

AWS Big Data

FEBRUARY 29, 2024

To provide a response that includes the enterprise context, each user prompt needs to be augmented with a combination of insights from structured data from the data warehouse and unstructured data from the enterprise data lake. It also monitors bias drift in your model’s predictions and feature attribution.

Data Governance

Data Governance Unstructured Data Metadata Data Lake

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

AWS Big Data

APRIL 19, 2023

We split the solution into two primary components: generating Spark job metadata and running the SQL on Amazon EMR. The first component (metadata setup) consumes existing Hive job configurations and generates metadata such as number of parameters, number of actions (steps), and file formats. sql_path SQL file name.

Metadata

Metadata Testing Data Lake Consulting

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

To learn more about RAG, refer to Question answering using Retrieval Augmented Generation with foundation models in Amazon SageMaker JumpStart. A RAG-based generative AI application can only produce generic responses based on its training data and the relevant documents in the knowledge base.

Data Lake

Data Lake Unstructured Data Management Modeling

Speed up queries with the cost-based optimizer in Amazon Athena

AWS Big Data

NOVEMBER 17, 2023

Athena provides a simplified, flexible way to analyze petabytes of data where it lives. You can analyze data or build applications from an Amazon Simple Storage Service (Amazon S3) data lake and 30 data sources, including on-premises data sources or other cloud systems using SQL or Python.

Optimization

Optimization Statistics Metadata Data Lake

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

This is the first post to a blog series that offers common architectural patterns in building real-time data streaming infrastructures using Kinesis Data Streams for a wide range of use cases. In this post, we will review the common architectural patterns of two use cases: Time Series Data Analysis and Event Driven Microservices.

Analytics

Analytics IoT Data-driven Snapshot

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Optimization Statistics

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

Cargotec captures terabytes of IoT telemetry data from their machinery operated by numerous customers across the globe. This data needs to be ingested into a data lake, transformed, and made available for analytics, machine learning (ML), and visualization.

Metadata

Metadata Data Lake Machine Learning Big Data

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

AWS Big Data

JANUARY 24, 2023

This solution only replicates metadata in the Data Catalog, not the actual underlying data. To have a redundant data lake using Lake Formation and AWS Glue in an additional Region, we recommend replicating the Amazon S3-based storage using S3 replication , S3 sync, aws-s3-copy-sync-using-batch or S3 Batch replication process.

Data Architecture

Data Architecture Metadata Data Lake Snapshot

What is an open data lakehouse and why you should care?

IBM Big Data Hub

JANUARY 17, 2023

A data lakehouse is an emerging data management architecture that improves efficiency and converges data warehouse and data lake capabilities driven by a need to improve efficiency and obtain critical insights faster. Let’s start with why data lakehouses are becoming increasingly important.

Data Lake

Data Lake Metadata Data Warehouse Data Governance

Unstructured data management and governance using AWS AI/ML and analytics services

AWS Big Data

OCTOBER 25, 2023

But most important of all, the assumed dormant value in the unstructured data is a question mark, which can only be answered after these sophisticated techniques have been applied. Therefore, there is a need to being able to analyze and extract value from the data economically and flexibly. The solution integrates data in three tiers.

Unstructured Data

Unstructured Data Metadata Management Analytics

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

AWS Big Data

DECEMBER 18, 2023

Solution overview One of the common functionalities involved in data pipelines is extracting data from multiple data sources and exporting it to a data lake or synchronizing the data to another database. For more information, refer to IAM Policies for invoking AWS Glue job from Step Functions.

Metadata

Metadata Visualization Data Lake Data-driven

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Big Data

APRIL 26, 2024

Set up EMR Studio In this step, we demonstrate the actions needed from the data lake administrator to set up EMR Studio enabled for trusted identity propagation and with IAM Identity Center integration. If Lake Formation is not already enabled, refer to Getting started with Lake Formation.

Analytics

Analytics Data Lake Management Enterprise

AWS Glue streaming application to process Amazon MSK data using AWS Glue Schema Registry

AWS Big Data

JUNE 12, 2023

Organizations across the world are increasingly relying on streaming data, and there is a growing need for real-time data analytics, considering the growing velocity and volume of data being collected. Refer appendix section for more information on this feature. Refer to the first stack’s output.

Management

Management Metadata Testing Internet of Things

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

AWS Big Data

AUGUST 16, 2023

Apache Iceberg is an open table format for very large analytic datasets. Iceberg manages large collections of files as tables, and it supports modern analytical data lake operations such as record-level insert, update, delete, and time travel queries. We fetch the metadata of the users_xxxxxx table from Athena.

Data Lake

Data Lake Metadata Testing Snapshot

Clean up your Excel and CSV files without writing code using AWS Glue DataBrew

AWS Big Data

NOVEMBER 15, 2023

You can use its built-in transformations, recipes, as well as integrations with the AWS Glue Data Catalog and Amazon Simple Storage Service (Amazon S3) to preprocess the data in your landing zone, clean it up, and send it downstream for analytical processing. For Matching conditions , choose Match all conditions.

Metadata

Metadata Sales Data Lake Big Data

A Day in the Life of a DataOps Engineer

DataKitchen

OCTOBER 11, 2021

The biggest challenge is broken data pipelines due to highly manual processes. Figure 1 shows a manually executed data analytics pipeline. First, a business analyst consolidates data from some public websites, an SFTP server and some downloaded email attachments, all into Excel. Monitoring Job Metadata.

Testing

Testing Metadata Dashboards Statistics

Create an end-to-end data strategy for Customer 360 on AWS

AWS Big Data

MARCH 26, 2024

Profile aggregation – When you’ve uniquely identified a customer, you can build applications in Managed Service for Apache Flink to consolidate all their metadata, from name to interaction history. Then, you transform this data into a concise format. Let’s find out what role each of these components play in the context of C360.

Data Strategy

Data Strategy Strategy Data Warehouse Prescriptive Analytics

Design a data mesh on AWS that reflects the envisioned organization

AWS Big Data

JANUARY 22, 2024

Data as a product Treating data as a product entails three key components: the data itself, the metadata, and the associated code and infrastructure. In this approach, teams responsible for generating data are referred to as producers.

Data-driven

Data-driven Advertising Metadata Data Architecture

The Data Scientist’s Guide to the Data Catalog

Alation

JULY 19, 2022

As they attempt to put machine learning models into production, data science teams encounter many of the same hurdles that plagued data analytics teams in years past: Finding trusted, valuable data is time-consuming. Obstacles, such as user roles, permissions, and approval request prevent speedy data access.

Metadata

Metadata Data Quality Statistics Data Science

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

Building data lakes from continuously changing transactional data of databases and keeping data lakes up to date is a complex task and can be an operational challenge. You can then apply transformations and store data in Delta format for managing inserts, updates, and deletes. with Apache Spark version 3.3.0)

Data Lake

Data Lake Dashboards Metrics Metadata

Strategically Approaching Graph Technologies

Ontotext

FEBRUARY 26, 2024

Firstly, on the data maturity spectrum, the vast majority of organizations I’ve spoken with are stuck in the information stage. They have massive amounts of data they’re collecting and storing in their relational databases, document stores, data lakes, and data warehouses. RDF vs LPG?

Technology

Technology Cost-Benefit Data-driven Metadata

AWS re:Invent 2023 Amazon Redshift Sessions Recap

AWS Big Data

DECEMBER 18, 2023

Sessions ANT203 | What’s new in Amazon Redshift Watch this session to learn about the newest innovations within Amazon Redshift—the petabyte-scale AWS Cloud data warehousing solution. Easily build and train machine learning models using SQL within Amazon Redshift to generate predictive analytics and propel data-driven decision-making.

Data Warehouse

Data Warehouse Machine Learning Data-driven Data Lake

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

AWS Big Data

AUGUST 1, 2023

Although Jira Cloud provides reporting capability, loading this data into a data lake will facilitate enrichment with other business data, as well as support the use of business intelligence (BI) tools and artificial intelligence (AI) and machine learning (ML) applications. For InitialRunFlag , choose Setup.

Data Lake

Data Lake Data Transformation Cost-Benefit Data-driven

Data democratization: How data architecture can drive business decisions and AI initiatives

IBM Big Data Hub

AUGUST 4, 2023

In turn, they both must also have the data literacy skills to be able to verify the data’s accuracy, ensure its security, and provide or follow guidance on when and how it should be used. It uses knowledge graphs, semantics and AI/ML technology to discover patterns in various types of metadata.

Data Architecture

Data Architecture Data Lake Machine Learning Data Governance

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

Andrew White

JANUARY 11, 2021

It was titled, The Gartner 2021 Leadership Vision for Data & Analytics Leaders. This was for the Chief Data Officer, or head of data and analytics. It is meant to be a desk-reference for that role for 2021. Does Data warehouse as a software tool will play role in future of Data & Analytics strategy?

Data Analytics

Data Analytics Analytics Data-driven Finance

The CDO Imperative: From Process Centric to data-driven

Alation

FEBRUARY 20, 2020

Can I trust the data that I’m seeing? A Single Source of Reference. A data catalog has emerged as a core component of modern data organizations and key for CDOs making the transition from process-centric to data-driven. The catalog draws on third-party information to verify whether the data can be trusted.

Data-driven

Data-driven Internet of Things Data Lake Strategy

Data Leaders Brief

Multicloud data lake analytics with Amazon Athena

Use Apache Iceberg in a data lake to support incremental data processing

Webinars

Trending Sources

Use Amazon Athena with Spark SQL for your open-source transactional table formats

Webinars

Data governance in the age of generative AI

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

Exploring real-time streaming for generative AI Applications

Speed up queries with the cost-based optimizer in Amazon Athena

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

Choosing an open table format for your transactional data lake on AWS

How Cargotec uses metadata replication to enable cross-account data sharing

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

What is an open data lakehouse and why you should care?

Unstructured data management and governance using AWS AI/ML and analytics services

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Glue streaming application to process Amazon MSK data using AWS Glue Schema Registry

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

Clean up your Excel and CSV files without writing code using AWS Glue DataBrew

A Day in the Life of a DataOps Engineer

Create an end-to-end data strategy for Customer 360 on AWS

Design a data mesh on AWS that reflects the envisioned organization

The Data Scientist’s Guide to the Data Catalog

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

Strategically Approaching Graph Technologies

AWS re:Invent 2023 Amazon Redshift Sessions Recap

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

Data democratization: How data architecture can drive business decisions and AI initiatives

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

The CDO Imperative: From Process Centric to data-driven

Stay Connected