Data Analytics, Data Lake and Metadata

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

Organizations have multiple Hive data warehouses across EMR clusters, where the metadata gets generated. To address this challenge, organizations can deploy a data mesh using AWS Lake Formation that connects the multiple EMR clusters. An entity can act both as a producer of data assets and as a consumer of data assets.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

Multicloud data lake analytics with Amazon Athena

AWS Big Data

MARCH 18, 2024

Many organizations operate data lakes spanning multiple cloud data stores. In these cases, you may want an integrated query layer to seamlessly run analytical queries across these diverse cloud stores and streamline your data analytics processes. This serves as the S3 data lake data for this post.

Data Lake

Data Lake Analytics Cost-Benefit Management

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback.

Data Lake

Data Lake Data Processing Metadata Snapshot

Webinars

Leading the Development of Profitable and Sustainable Products

How To Get Promoted In Product Management

MORE WEBINARS

Gartner Data & Analytics Sydney 2022

Timo Elliott

NOVEMBER 21, 2022

For the last 30 years, whenever you want to do analytics, the first step is to rip it out of the operational applications and try and move it to a different environment—so data warehousing, data lakes, data lakehouses and now data clouds.

Data Analytics

Data Analytics Analytics Recreation/Entertainment Data Lake

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

For the past 5 years, BMS has used a custom framework called Enterprise Data Lake Services (EDLS) to create ETL jobs for business users. EDLS job steps and metadata Every EDLS job comprises one or more job steps chained together and run in a predefined order orchestrated by the custom ETL framework.

Metadata

Metadata Data Lake Visualization Data Transformation

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

AWS-powered data lakes, supported by the unmatched availability of Amazon Simple Storage Service (Amazon S3), can handle the scale, agility, and flexibility required to combine different data and analytics approaches. It will never remove files that are still required by a non-expired snapshot.

Snapshot

Snapshot Data Lake Metadata Optimization

Data governance in the age of generative AI

AWS Big Data

FEBRUARY 29, 2024

To provide a response that includes the enterprise context, each user prompt needs to be augmented with a combination of insights from structured data from the data warehouse and unstructured data from the enterprise data lake.

Data Governance

Data Governance Unstructured Data Metadata Data Lake

What is a Data Mesh?

DataKitchen

AUGUST 3, 2021

First-generation – expensive, proprietary enterprise data warehouse and business intelligence platforms maintained by a specialized team drowning in technical debt. Second-generation – gigantic, complex data lake maintained by a specialized team drowning in technical debt. See the pattern?

Data Architecture

Data Architecture Data Lake Cost-Benefit Data Warehouse

The Future of the Data Lakehouse – Open

Cloudera

JUNE 18, 2022

Cloudera customers run some of the biggest data lakes on earth. These lakes power mission critical large scale data analytics, business intelligence (BI), and machine learning use cases, including enterprise data warehouses. On data warehouses and data lakes.

Data Lake

Data Lake Data Warehouse Machine Learning Cost-Benefit

The Future of the Data Lakehouse – Open

CIO Business Intelligence

JUNE 23, 2022

Cloudera customers run some of the biggest data lakes on earth. These lakes power mission critical large scale data analytics, business intelligence (BI), and machine learning use cases, including enterprise data warehouses. On data warehouses and data lakes.

Data Lake

Data Lake Data Warehouse Machine Learning Cost-Benefit

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

AWS Big Data

APRIL 19, 2023

We split the solution into two primary components: generating Spark job metadata and running the SQL on Amazon EMR. The first component (metadata setup) consumes existing Hive job configurations and generates metadata such as number of parameters, number of actions (steps), and file formats. sql_path SQL file name.

Metadata

Metadata Testing Data Lake Consulting

Addressing Data Mesh Technical Challenges with DataOps

DataKitchen

AUGUST 9, 2021

The domain also includes code that acts upon the data, including tools, pipelines, and other artifacts that drive analytics execution. The domain requires a team that creates/updates/runs the domain, and we can’t forget metadata: catalogs, lineage, test results, processing history, etc., ….

Testing

Testing Data Lake Metadata Publishing

Data architecture strategy for data quality

IBM Big Data Hub

JANUARY 5, 2023

The first generation of data architectures represented by enterprise data warehouse and business intelligence platforms were characterized by thousands of ETL jobs, tables, and reports that only a small group of specialized data engineers understood, resulting in an under-realized positive impact on the business.

Data Quality

Data Quality Data Architecture Strategy Data Lake

Five benefits of a data catalog

IBM Big Data Hub

DECEMBER 16, 2022

For example, data catalogs have evolved to deliver governance capabilities like managing data quality and data privacy and compliance. It uses metadata and data management tools to organize all data assets within your organization. Technical metadata to describe schemas, indexes and other database objects.

Metadata

Metadata Data Quality Data-driven Data Governance

How HR&A uses Amazon Redshift spatial analytics on Amazon Redshift Serverless to measure digital equity in states across the US

AWS Big Data

DECEMBER 5, 2023

A combination of Amazon Redshift Spectrum and COPY commands are used to ingest the survey data stored as CSV files. For the files with unknown structures, AWS Glue crawlers are used to extract metadata and create table definitions in the Data Catalog. She helps customers architect data analytics solutions at scale on AWS.

Measurement

Measurement Dashboards Data Warehouse Analytics

Speed up queries with the cost-based optimizer in Amazon Athena

AWS Big Data

NOVEMBER 17, 2023

Athena provides a simplified, flexible way to analyze petabytes of data where it lives. You can analyze data or build applications from an Amazon Simple Storage Service (Amazon S3) data lake and 30 data sources, including on-premises data sources or other cloud systems using SQL or Python.

Optimization

Optimization Statistics Metadata Data Lake

The Madness of Data (and analytics) Governance

Andrew White

DECEMBER 9, 2019

The client had recently engaged with a well-known consulting company that had recommended a large data catalog effort to collect all enterprise metadata to help identify all data and business issues. Modern data (and analytics) governance does not necessarily need: Wall-to-wall discovery of your data and metadata.

Analytics

Analytics Data Lake Data Governance Metadata

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

Streaming jobs constantly ingest new data to synchronize across systems and can perform enrichment, transformations, joins, and aggregations across windows of time more efficiently. With a file system sink connector, Apache Flink jobs can deliver data to Amazon S3 in open format (such as JSON, Avro, Parquet, and more) files as data objects.

Data Lake

Data Lake Unstructured Data Management Modeling

Overcome these six data consumption challenges for a more data-driven enterprise

IBM Big Data Hub

JUNE 8, 2022

A typical organization’s data landscape consists of a large number of data stores across workflows, business processes and business units, including but not limited to data warehouses, data marts, data lakes, ODS, cloud data stores, and CRM databases. The volume of data assets.

Data-driven

Data-driven Enterprise Data Governance Data Lake

How SumUp made digital analytics more accessible using AWS Glue

AWS Big Data

JUNE 6, 2023

Unless, of course, the rest of their data also resides in the Google Cloud. In this post we showcase how we used AWS Glue to move siloed digital analytics data, with inconsistent arrival times, to AWS S3 (our Data Lake) and our central data warehouse (DWH), Snowflake.

Analytics

Analytics Data Lake Testing Optimization

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Optimization Statistics

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

Cargotec captures terabytes of IoT telemetry data from their machinery operated by numerous customers across the globe. This data needs to be ingested into a data lake, transformed, and made available for analytics, machine learning (ML), and visualization.

Metadata

Metadata Data Lake Machine Learning Big Data

Trends in Data Management and Analytics

TDAN

MARCH 19, 2019

Various databases, plus one or more data warehouses, have been the state-of-the art data management infrastructure in companies for years. The emergence of various new concepts, technologies, and applications such as Hadoop, Tableau, R, Power BI, or Data Lakes indicate that changes are under way.

Management

Management Data Lake Data Warehouse Analytics

The Very Group adopts a data catalog to better organize and leverage its online retail capabilities

CIO Business Intelligence

SEPTEMBER 6, 2022

Establishing a clear and unified approach to data. But getting to this stage was an intricate process that involved creating centers of excellence for things like data analytics that own the end-to-end infrastructure, application and skill sets, as well as career plans for staff.

IT

IT Forecasting Data Lake Enterprise

Announcing the 2021 Data Impact Awards

Cloudera

MAY 12, 2021

Use cases could include but are not limited to: workload analysis and replication, migrating or bursting to cloud, data warehouse optimization, and more. Winners included: Connect the Data Lifecycle: Globe Telecom — Raising experience standards and helping customers live enhanced mobile lifestyles. SECURITY AND GOVERNANCE LEADERSHIP.

Digital Transformation

Digital Transformation Machine Learning Optimization Data Lake

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

This is the first post to a blog series that offers common architectural patterns in building real-time data streaming infrastructures using Kinesis Data Streams for a wide range of use cases. In this post, we will review the common architectural patterns of two use cases: Time Series Data Analysis and Event Driven Microservices.

Analytics

Analytics IoT Data-driven Snapshot

Week in the Life of an Analyst at Gartner US IT Symposium (virtual) 2021

Andrew White

OCTOBER 22, 2021

Lakehouse (data warehouse and data lake working together) 8. Data Literacy, training, coordination, collaboration 8. Data Management Infrastructure/Data Fabric 5. Data Integration tactics 4. Metadata Strategy 3. Figure 3: The Data and Analytics (infrastructure) Continuum.

IT

IT Data Lake Strategy Data Science

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Optimization

Unlock data across organizational boundaries using Amazon DataZone – now generally available

AWS Big Data

OCTOBER 4, 2023

An Amazon DataZone domain contains an associated business data catalog for search and discovery, a set of metadata definitions to decorate the data assets that are used for discovery purposes, and data projects with integrated analytics and ML tools for users and groups to consume and publish data assets.

Metadata

Metadata Data Lake Publishing Data Governance

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

AWS Big Data

JANUARY 24, 2023

This solution only replicates metadata in the Data Catalog, not the actual underlying data. To have a redundant data lake using Lake Formation and AWS Glue in an additional Region, we recommend replicating the Amazon S3-based storage using S3 replication , S3 sync, aws-s3-copy-sync-using-batch or S3 Batch replication process.

Data Architecture

Data Architecture Metadata Data Lake Snapshot

How Ruparupa gained updated insights with an Amazon S3 data lake, AWS Glue, Apache Hudi, and Amazon QuickSight

AWS Big Data

FEBRUARY 22, 2023

In this post, we show how Ruparupa implemented an incrementally updated data lake to get insights into their business using Amazon Simple Storage Service (Amazon S3), AWS Glue , Apache Hudi , and Amazon QuickSight. An AWS Glue ETL job, using the Apache Hudi connector, updates the S3 data lake hourly with incremental data.

Data Lake

Data Lake Dashboards Cost-Benefit Metadata

What is an open data lakehouse and why you should care?

IBM Big Data Hub

JANUARY 17, 2023

A data lakehouse is an emerging data management architecture that improves efficiency and converges data warehouse and data lake capabilities driven by a need to improve efficiency and obtain critical insights faster. Let’s start with why data lakehouses are becoming increasingly important.

Data Lake

Data Lake Metadata Data Warehouse Data Governance

Unstructured data management and governance using AWS AI/ML and analytics services

AWS Big Data

OCTOBER 25, 2023

But most important of all, the assumed dormant value in the unstructured data is a question mark, which can only be answered after these sophisticated techniques have been applied. Therefore, there is a need to being able to analyze and extract value from the data economically and flexibly. The solution integrates data in three tiers.

Unstructured Data

Unstructured Data Metadata Management Analytics

A Guide to Data Analytics in the Travel Industry

Alation

MARCH 21, 2023

Why is data analytics important for travel organizations? With data analytics , travel organizations can gain real-time insights about customers to make strategic decisions and improve their travel experience. How is data analytics used in the travel industry?

Data Analytics

Data Analytics Analytics Data-driven Big Data

Achieve your AI goals with an open data lakehouse approach

IBM Big Data Hub

OCTOBER 4, 2023

Another IDC study showed that while 2/3 of respondents reported using AI-driven data analytics, most reported that less than half of the data under management is available for this type of analytics. from 2022 to 2026. New insights and relationships are found in this combination. All of this supports the use of AI.

Data Lake

Data Lake Metadata Cost-Benefit Data Warehouse

Habib Bank manages data at scale with Cloudera Data Platform

Cloudera

NOVEMBER 17, 2022

The Solution: CDP Private Cloud brings a next-generation hybrid architecture with cloud-native benefits to HBL’s data platform. HBL started their data journey in 2019 when data lake initiative was started to consolidate complex data sources and enable the bank to use single version of truth for decision making.

Management

Management Data Lake Consulting Unstructured Data

Regeneron turns to IT to accelerate drug discovery

CIO Business Intelligence

NOVEMBER 4, 2022

Then the data is consumed by SaaS-based computational tools, but it still sits within our organization and sits within the controls of our cloud-based solutions.” Much of Regeneron’s data, of course, is confidential. For that reason, many of its data tools — and even its data lake — were built in-house using AWS. “We

Data Lake

Data Lake IT Experimentation Data-driven

Building a Beautiful Data Lakehouse

CIO Business Intelligence

MARCH 9, 2022

Applying artificial intelligence (AI) to data analytics for deeper, better insights and automation is a growing enterprise IT priority. But the data repository options that have been around for a while tend to fall short in their ability to serve as the foundation for big data analytics powered by AI.

Data Lake

Data Lake Unstructured Data Data Warehouse Data Quality

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

AWS Big Data

AUGUST 16, 2023

Apache Iceberg is an open table format for very large analytic datasets. Iceberg manages large collections of files as tables, and it supports modern analytical data lake operations such as record-level insert, update, delete, and time travel queries. We fetch the metadata of the users_xxxxxx table from Athena.

Data Lake

Data Lake Metadata Testing Snapshot

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

AWS Big Data

DECEMBER 18, 2023

Solution overview One of the common functionalities involved in data pipelines is extracting data from multiple data sources and exporting it to a data lake or synchronizing the data to another database. There are multiple tables related to customers and order data in the RDS database.

Metadata

Metadata Visualization Data Lake Data-driven

Why We Started the Data Intelligence Project

Alation

JULY 7, 2022

In 2013 I joined American Family Insurance as a metadata analyst. I had always been fascinated by how people find, organize, and access information, so a metadata management role after school was a natural choice. The use cases for metadata are boundless, offering opportunities for innovation in every sector.

Metadata

Metadata Data-driven Insurance Statistics

Clean up your Excel and CSV files without writing code using AWS Glue DataBrew

AWS Big Data

NOVEMBER 15, 2023

You can use its built-in transformations, recipes, as well as integrations with the AWS Glue Data Catalog and Amazon Simple Storage Service (Amazon S3) to preprocess the data in your landing zone, clean it up, and send it downstream for analytical processing. For Matching conditions , choose Match all conditions.

Metadata

Metadata Sales Data Lake Big Data

Augmented data management: Data fabric versus data mesh

IBM Big Data Hub

APRIL 27, 2022

Gartner defines a data fabric as “a design concept that serves as an integrated layer of data and connecting processes. The data fabric architectural approach can simplify data access in an organization and facilitate self-service data consumption at scale.

Management

Management Metadata Data Architecture Data Lake

A Day in the Life of a DataOps Engineer

DataKitchen

OCTOBER 11, 2021

The biggest challenge is broken data pipelines due to highly manual processes. Figure 1 shows a manually executed data analytics pipeline. First, a business analyst consolidates data from some public websites, an SFTP server and some downloaded email attachments, all into Excel. Monitoring Job Metadata.

Testing

Testing Metadata Dashboards Statistics

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

Multicloud data lake analytics with Amazon Athena

Webinars

Trending Sources

Use Apache Iceberg in a data lake to support incremental data processing

Webinars

Gartner Data & Analytics Sydney 2022

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

Use Amazon Athena with Spark SQL for your open-source transactional table formats

Data governance in the age of generative AI

What is a Data Mesh?

The Future of the Data Lakehouse – Open

The Future of the Data Lakehouse – Open

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

Addressing Data Mesh Technical Challenges with DataOps

Data architecture strategy for data quality

Five benefits of a data catalog

How HR&A uses Amazon Redshift spatial analytics on Amazon Redshift Serverless to measure digital equity in states across the US

Speed up queries with the cost-based optimizer in Amazon Athena

The Madness of Data (and analytics) Governance

Exploring real-time streaming for generative AI Applications

Overcome these six data consumption challenges for a more data-driven enterprise

How SumUp made digital analytics more accessible using AWS Glue

Choosing an open table format for your transactional data lake on AWS

How Cargotec uses metadata replication to enable cross-account data sharing

Trends in Data Management and Analytics

The Very Group adopts a data catalog to better organize and leverage its online retail capabilities

Announcing the 2021 Data Impact Awards

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

Week in the Life of an Analyst at Gartner US IT Symposium (virtual) 2021

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Unlock data across organizational boundaries using Amazon DataZone – now generally available

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

How Ruparupa gained updated insights with an Amazon S3 data lake, AWS Glue, Apache Hudi, and Amazon QuickSight

What is an open data lakehouse and why you should care?

Unstructured data management and governance using AWS AI/ML and analytics services

A Guide to Data Analytics in the Travel Industry

Achieve your AI goals with an open data lakehouse approach

Habib Bank manages data at scale with Cloudera Data Platform

Regeneron turns to IT to accelerate drug discovery

Building a Beautiful Data Lakehouse

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

Why We Started the Data Intelligence Project

Clean up your Excel and CSV files without writing code using AWS Glue DataBrew

Augmented data management: Data fabric versus data mesh

A Day in the Life of a DataOps Engineer

Stay Connected