Analytics, Data Lake and Metadata

Multicloud data lake analytics with Amazon Athena

AWS Big Data

MARCH 18, 2024

Many organizations operate data lakes spanning multiple cloud data stores. In these cases, you may want an integrated query layer to seamlessly run analytical queries across these diverse cloud stores and streamline your data analytics processes. This serves as the S3 data lake data for this post.

Data Lake

Data Lake Analytics Cost-Benefit Management

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

A data lake is a centralized repository that you can use to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data and then run different types of analytics for better business insights.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

They understand that a one-size-fits-all approach no longer works, and recognize the value in adopting scalable, flexible tools and open data formats to support interoperability in a modern data architecture to accelerate the delivery of new solutions.

Data Lake

Data Lake Snapshot Metadata Data Architecture

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback.

Data Lake

Data Lake Data Processing Metadata Snapshot

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

AWS Big Data

MARCH 10, 2023

Since the deluge of big data over a decade ago, many organizations have learned to build applications to process and analyze petabytes of data. Data lakes have served as a central repository to store structured and unstructured data at any scale and in various formats.

Data Lake

Data Lake Sales Data Warehouse Snapshot

Build a real-time GDPR-aligned Apache Iceberg data lake

AWS Big Data

FEBRUARY 24, 2023

Data lakes are a popular choice for today’s organizations to store their data around their business activities. As a best practice of a data lake design, data should be immutable once stored. A data lake built on AWS uses Amazon Simple Storage Service (Amazon S3) as its primary storage environment.

Data Lake

Data Lake Metadata Testing Data Warehouse

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

AWS Big Data

MARCH 4, 2024

As enterprises collect increasing amounts of data from various sources, the structure and organization of that data often need to change over time to meet evolving analytical needs. This is critical for fast-moving enterprises to augment data structures to support new use cases. This hampers agility and time to insight.

Snapshot

Snapshot Data Lake Metadata Recreation/Entertainment

Efficiently crawl your data lake and improve data access with an AWS Glue crawler using partition indexes

AWS Big Data

JUNE 15, 2023

In today’s world, customers manage vast amounts of data in their Amazon Simple Storage Service (Amazon S3) data lakes, which requires convoluted data pipelines to continuously understand the changes in the data layout and make them available to consuming systems.

Data Lake

Data Lake Metadata Cost-Benefit Management

Data Lakes on Cloud & it’s Usage in Healthcare

BizAcuity

MARCH 29, 2019

Data lakes are centralized repositories that can store all structured and unstructured data at any desired scale. The power of the data lake lies in the fact that it often is a cost-effective way to store data. The power of the data lake lies in the fact that it often is a cost-effective way to store data.

Data Lake

Data Lake Unstructured Data Cost-Benefit Data Quality

Gartner Data & Analytics Sydney 2022

Timo Elliott

NOVEMBER 21, 2022

Last week I was in beautiful Sydney, Australia for the Gartner Data and Analytics Conference. Here’s a quick video summary: One of the big things that struck me was the changing role of data. Data is useless. It’s possible, but it takes huge amounts of time and effort.

Data Analytics

Data Analytics Analytics Recreation/Entertainment Data Lake

The Data Lakehouse: Blending Data Warehouses and Data Lakes

Data Virtualization

APRIL 21, 2022

Reading Time: 3 minutes First we had data warehouses, then came data lakes, and now the new kid on the block is the data lakehouse. But what is a data lakehouse and why should we develop one? In a way, the name describes what.

Data Lake

Data Lake Data Warehouse Data Integration Management

How Knowledge Graphs Power Data Mesh and Data Fabric

Ontotext

APRIL 10, 2024

In most enterprises data teams lack a data map and data asset inventory and are often unaware of data that exists across the organization, its associated profile, quality and associated metadata. Teams can’t access data to build their business use cases. For example, a product data tag is basic metadata.

Metadata

Metadata Data Lake Data Warehouse Data Quality

Data Lakes: What Are They and Who Needs Them?

Jet Global

JULY 2, 2019

To address the flood of data and the needs of enterprise businesses to store, sort, and analyze that data, a new storage solution has evolved: the data lake. What’s in a Data Lake? Data warehouses do a great job of standardizing data from disparate sources for analysis. Taking a Dip.

Data Lake

Data Lake Data Warehouse Big Data Machine Learning

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

For the past 5 years, BMS has used a custom framework called Enterprise Data Lake Services (EDLS) to create ETL jobs for business users. EDLS job steps and metadata Every EDLS job comprises one or more job steps chained together and run in a predefined order orchestrated by the custom ETL framework.

Metadata

Metadata Data Lake Visualization Data Transformation

Introducing AWS Glue crawler and create table support for Apache Iceberg format

AWS Big Data

AUGUST 16, 2023

Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. Iceberg captures metadata information on the state of datasets as they evolve and change over time.

Data Lake

Data Lake Metadata Snapshot Management

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

AWS-powered data lakes, supported by the unmatched availability of Amazon Simple Storage Service (Amazon S3), can handle the scale, agility, and flexibility required to combine different data and analytics approaches. It will never remove files that are still required by a non-expired snapshot.

Snapshot

Snapshot Data Lake Metadata Optimization

Where Do Data Catalogs Fit in Metadata Management?

Alation

FEBRUARY 13, 2020

In an earlier blog, I defined a data catalog as “a collection of metadata, combined with data management and search tools, that helps analysts and other data users to find the data that they need, serves as an inventory of available data, and provides information to evaluate fitness data for intended uses.”.

Metadata

Metadata Management Data Lake Data Governance

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

In our previous post Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes , we discussed how you can implement solutions to improve operational efficiencies of your Amazon Simple Storage Service (Amazon S3) data lake that is using the Apache Iceberg open table format and running on the Amazon EMR big data platform.

Optimization

Optimization Snapshot Data Lake Metadata

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

AWS Big Data

MARCH 7, 2023

This post provides guidance on how to build scalable analytical solutions for gaming industry use cases using Amazon Redshift Serverless. Flexible and easy to use – The solutions should provide less restrictive, easy-to-access, and ready-to-use data. Data hubs and data lakes can coexist in an organization, complementing each other.

Analytics

Analytics Data Warehouse Data Lake Metadata

Data governance in the age of generative AI

AWS Big Data

FEBRUARY 29, 2024

To provide a response that includes the enterprise context, each user prompt needs to be augmented with a combination of insights from structured data from the data warehouse and unstructured data from the enterprise data lake.

Data Governance

Data Governance Unstructured Data Metadata Data Lake

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

AWS Big Data

AUGUST 3, 2023

Today’s modern data lakes span multiple accounts, AWS Regions, and lines of business in organizations. It’s important that their data solution gives them the ability to share and access data securely and safely across Regions. A resource link is a Data Catalog object that is a link to a database or table.

Data Lake

Data Lake Metadata Management Data Processing

Query your Apache Hive metastore with AWS Lake Formation permissions

AWS Big Data

JULY 20, 2023

The Hive metastore is a repository of metadata about the SQL tables, such as database names, table names, schema, serialization and deserialization information, data location, and partition details of each table. Apache Hive, Apache Spark, Presto, and Trino can all use a Hive Metastore to retrieve metadata to run queries.

Data Lake

Data Lake Metadata Data Processing Big Data

Case study: Policy Enforcement Automation With Semantics

Ontotext

MAY 2, 2024

Data leaders today are faced with an almost impossible challenge. Particularly those on the “the create side of the house” who are tasked to deliver insights and analytics. Such inconsistencies bring lowered trust in the outcomes analytics and insights leaders try to get.

Metadata

Metadata Data Lake Data-driven Enterprise

The Future of the Data Lakehouse – Open

Cloudera

JUNE 18, 2022

Cloudera customers run some of the biggest data lakes on earth. These lakes power mission critical large scale data analytics, business intelligence (BI), and machine learning use cases, including enterprise data warehouses. On data warehouses and data lakes. Iterations of the lakehouse.

Data Lake

Data Lake Data Warehouse Machine Learning Cost-Benefit

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

With Amazon EMR 6.15, we launched AWS Lake Formation based fine-grained access controls (FGAC) on Open Table Formats (OTFs), including Apache Hudi, Apache Iceberg, and Delta lake. Many large enterprise companies seek to use their transactional data lake to gain insights and improve decision-making.

Data Lake

Data Lake Snapshot Big Data Data-driven

The Madness of Data (and analytics) Governance

Andrew White

DECEMBER 9, 2019

The client had recently engaged with a well-known consulting company that had recommended a large data catalog effort to collect all enterprise metadata to help identify all data and business issues. Modern data (and analytics) governance does not necessarily need: Wall-to-wall discovery of your data and metadata.

Analytics

Analytics Data Lake Data Governance Metadata

How SumUp made digital analytics more accessible using AWS Glue

AWS Big Data

JUNE 6, 2023

As most organizations, that have turned to Google Analytics (GA) as a digital analytics solution, mature they discover a more pressing need to integrate this data silo with the rest of their organization’s data to enable better analytics and resulting product development and fraud detection.

Analytics

Analytics Data Lake Testing Optimization

How HR&A uses Amazon Redshift spatial analytics on Amazon Redshift Serverless to measure digital equity in states across the US

AWS Big Data

DECEMBER 5, 2023

This cut down significantly on analytical turnaround times. The CARTO Analytics Toolbox for Redshift is composed of a set of user-defined functions and procedures organized in a set of modules based on the functionality they offer. These table definitions are used as the metadata repository for external tables in Amazon Redshift.

Measurement

Measurement Dashboards Data Warehouse Analytics

What is a Data Mesh?

DataKitchen

AUGUST 3, 2021

First-generation – expensive, proprietary enterprise data warehouse and business intelligence platforms maintained by a specialized team drowning in technical debt. Second-generation – gigantic, complex data lake maintained by a specialized team drowning in technical debt. See the pattern?

Data Architecture

Data Architecture Data Lake Cost-Benefit Data Warehouse

The Future of the Data Lakehouse – Open

CIO Business Intelligence

JUNE 23, 2022

Cloudera customers run some of the biggest data lakes on earth. These lakes power mission critical large scale data analytics, business intelligence (BI), and machine learning use cases, including enterprise data warehouses. On data warehouses and data lakes. Iterations of the lakehouse.

Data Lake

Data Lake Data Warehouse Machine Learning Cost-Benefit

Collibra Brings Effective Data Governance to Line-of-Business

David Menninger's Analyst Perspectives

SEPTEMBER 28, 2021

Collibra is a data governance software company that offers tools for metadata management and data cataloging. The software enables organizations to find data quickly, identify its source and assure its integrity.

Data Governance

Data Governance Metadata Software Management

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

This is the first post to a blog series that offers common architectural patterns in building real-time data streaming infrastructures using Kinesis Data Streams for a wide range of use cases. In this post, we will review the common architectural patterns of two use cases: Time Series Data Analysis and Event Driven Microservices.

Analytics

Analytics IoT Data-driven Snapshot

Informatica’s new data management clouds target health, finance services

CIO Business Intelligence

MAY 24, 2022

Some of the accelerators included as part of the new platform are integrations with Salesforce, NPI data, National Patient Account Services, Workday, Oracle Fusion HCM Cloud, Orange HRM, Salesforce Health Cloud, MedPro, healthcare-focused cloud company Veeva, and HR vendor UltiPro. Analytics for faster decision making.

Finance

Finance Management Metadata Data Quality

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

AWS Big Data

APRIL 19, 2023

We split the solution into two primary components: generating Spark job metadata and running the SQL on Amazon EMR. The first component (metadata setup) consumes existing Hive job configurations and generates metadata such as number of parameters, number of actions (steps), and file formats. sql_path SQL file name.

Metadata

Metadata Testing Data Lake Consulting

Migrate Hive data from CDH to CDP public cloud

Cloudera

JUNE 25, 2021

Many Cloudera customers are making the transition from being completely on-prem to cloud by either backing up their data in the cloud, or running multi-functional analytics on CDP Public cloud in AWS or Azure. The Replication Manager service facilitates both disaster recovery and data migration across different environments.

Data Lake

Data Lake Metadata Unstructured Data Management

AWS Lake Formation 2022 year in review

AWS Big Data

JANUARY 31, 2023

We have collected some of the key talks and solutions on data governance, data mesh, and modern data architecture published and presented in AWS re:Invent 2022, and a few data lake solutions built by customers and AWS Partners for easy reference. Starting with Amazon EMR release 6.7.0,

Data Lake

Data Lake Data Governance Data Architecture Data Warehouse

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

Solutions data architect: These individuals design and implement data solutions for specific business needs, including data warehouses, data marts, and data lakes. Application data architect: The application data architect designs and implements data models for specific software applications.

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

Addressing Data Mesh Technical Challenges with DataOps

DataKitchen

AUGUST 9, 2021

The domain also includes code that acts upon the data, including tools, pipelines, and other artifacts that drive analytics execution. The domain requires a team that creates/updates/runs the domain, and we can’t forget metadata: catalogs, lineage, test results, processing history, etc., ….

Testing

Testing Data Lake Metadata Publishing

Educating ChatGPT on Data Lakehouse

Cloudera

MARCH 17, 2023

The one key component that is missing is a common, shared table format, that can be used by all analytic services accessing the lakehouse data. The table format provides the necessary structure for the unstructured data that is missing in a data lake, using a schema or metadata definition, to bring it closer to a data warehouse.

Unstructured Data

Unstructured Data Data Lake Data Warehouse Machine Learning

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

The Analytics specialty practice of AWS Professional Services (AWS ProServe) helps customers across the globe with modern data architecture implementations on the AWS Cloud. Of those tables, some are larger (such as in terms of record volume) than others, and some are updated more frequently than others.

Data Lake

Data Lake Data Processing Metadata Snapshot

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

Streaming jobs constantly ingest new data to synchronize across systems and can perform enrichment, transformations, joins, and aggregations across windows of time more efficiently. OpenSearch Service provides support for native ingestion from Kinesis data streams or MSK topics.

Data Lake

Data Lake Unstructured Data Management Modeling

What Is a Data Catalog?

Alation

FEBRUARY 13, 2020

Why do we need a data catalog? What does a data catalog do? These are all good questions and a logical place to start your data cataloging journey. Data catalogs have become the standard for metadata management in the age of big data and self-service analytics. What is a Data Catalog?

Metadata

Metadata Data Lake Recreation/Entertainment Big Data

How Cloudera Supports Zero Trust for Data

Cloudera

JUNE 7, 2023

It operates independently from compute and storage layers, offering integrated security and governance based on metadata. With persistent context across analytics and cloud environments, SDX simplifies data delivery and access with a unified multi-tenant model. Understanding your data is critical to protecting the data.

Metadata

Metadata Data Lake Optimization Modeling

Multicloud data lake analytics with Amazon Athena

Migrate an existing data lake to a transactional data lake using Apache Iceberg

Webinars

Trending Sources

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Webinars

Use Apache Iceberg in a data lake to support incremental data processing

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

Build a real-time GDPR-aligned Apache Iceberg data lake

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

Efficiently crawl your data lake and improve data access with an AWS Glue crawler using partition indexes

Data Lakes on Cloud & it’s Usage in Healthcare

Gartner Data & Analytics Sydney 2022

The Data Lakehouse: Blending Data Warehouses and Data Lakes

How Knowledge Graphs Power Data Mesh and Data Fabric

Data Lakes: What Are They and Who Needs Them?

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

Introducing AWS Glue crawler and create table support for Apache Iceberg format

Use Amazon Athena with Spark SQL for your open-source transactional table formats

Where Do Data Catalogs Fit in Metadata Management?

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

Data governance in the age of generative AI

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

Query your Apache Hive metastore with AWS Lake Formation permissions

Case study: Policy Enforcement Automation With Semantics

The Future of the Data Lakehouse – Open

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

The Madness of Data (and analytics) Governance

How SumUp made digital analytics more accessible using AWS Glue

How HR&A uses Amazon Redshift spatial analytics on Amazon Redshift Serverless to measure digital equity in states across the US

What is a Data Mesh?

The Future of the Data Lakehouse – Open

Collibra Brings Effective Data Governance to Line-of-Business

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

Informatica’s new data management clouds target health, finance services

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

Migrate Hive data from CDH to CDP public cloud

AWS Lake Formation 2022 year in review

What is a data architect? Skills, salaries, and how to become a data framework master

Addressing Data Mesh Technical Challenges with DataOps

Educating ChatGPT on Data Lakehouse

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Exploring real-time streaming for generative AI Applications

What Is a Data Catalog?

How Cloudera Supports Zero Trust for Data

Stay Connected