Data Analytics, Data Lake, Management and Metadata

Multicloud data lake analytics with Amazon Athena

AWS Big Data

MARCH 18, 2024

Many organizations operate data lakes spanning multiple cloud data stores. In these cases, you may want an integrated query layer to seamlessly run analytical queries across these diverse cloud stores and streamline your data analytics processes.

Data Lake

Data Lake Analytics Cost-Benefit Management

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback.

Data Lake

Data Lake Data Processing Metadata Snapshot

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

In addition to using native managed AWS services that BMS didn’t need to worry about upgrading, BMS was looking to offer an ETL service to non-technical business users that could visually compose data transformation workflows and seamlessly run them on the AWS Glue Apache Spark-based serverless data integration engine.

Metadata

Metadata Data Lake Visualization Data Transformation

Webinars

The Key to Sustainable Energy Optimization: A Data-Driven Approach for Manufacturing

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

How To Get Promoted In Product Management

MORE WEBINARS

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

AWS-powered data lakes, supported by the unmatched availability of Amazon Simple Storage Service (Amazon S3), can handle the scale, agility, and flexibility required to combine different data and analytics approaches. For more information, refer to Amazon S3: Allows read and write access to objects in an S3 Bucket.

Snapshot

Snapshot Data Lake Metadata Optimization

Data governance in the age of generative AI

AWS Big Data

FEBRUARY 29, 2024

However, enterprise data generated from siloed sources combined with the lack of a data integration strategy creates challenges for provisioning the data for generative AI applications. As part of the transformation, the objects need to be treated to ensure data privacy (for example, PII redaction).

Data Governance

Data Governance Unstructured Data Metadata Data Lake

What is a Data Mesh?

DataKitchen

AUGUST 3, 2021

The data mesh design pattern breaks giant, monolithic enterprise data architectures into subsystems or domains, each managed by a dedicated team. First-generation – expensive, proprietary enterprise data warehouse and business intelligence platforms maintained by a specialized team drowning in technical debt.

Data Architecture

Data Architecture Data Lake Cost-Benefit Data Warehouse

The Future of the Data Lakehouse – Open

CIO Business Intelligence

JUNE 23, 2022

Cloudera customers run some of the biggest data lakes on earth. These lakes power mission critical large scale data analytics, business intelligence (BI), and machine learning use cases, including enterprise data warehouses. On data warehouses and data lakes.

Data Lake

Data Lake Data Warehouse Machine Learning Cost-Benefit

The Future of the Data Lakehouse – Open

Cloudera

JUNE 18, 2022

Cloudera customers run some of the biggest data lakes on earth. These lakes power mission critical large scale data analytics, business intelligence (BI), and machine learning use cases, including enterprise data warehouses. On data warehouses and data lakes.

Data Lake

Data Lake Data Warehouse Machine Learning Cost-Benefit

Addressing Data Mesh Technical Challenges with DataOps

DataKitchen

AUGUST 9, 2021

Implementing a data mesh does not require you to throw away your existing architecture and start over. The data industry has a wide variety of approaches and philosophies for managing data: Inman data factory, Kimball methodology, s tar schema , or the data vault pattern, which can be a great way to store and organize raw data, and more.

Testing

Testing Data Lake Metadata Publishing

Trends in Data Management and Analytics

TDAN

MARCH 19, 2019

Various databases, plus one or more data warehouses, have been the state-of-the art data management infrastructure in companies for years. The emergence of various new concepts, technologies, and applications such as Hadoop, Tableau, R, Power BI, or Data Lakes indicate that changes are under way.

Management

Management Data Lake Data Warehouse Analytics

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

In this post, we discuss why data streaming is a crucial component of generative AI applications due to its real-time nature. In-context learning LLMs are trained with point-in-time data and have no inherent ability to access fresh data at inference time. For more information, refer to Dynamic Tables.

Data Lake

Data Lake Unstructured Data Management Modeling

Data architecture strategy for data quality

IBM Big Data Hub

JANUARY 5, 2023

The first generation of data architectures represented by enterprise data warehouse and business intelligence platforms were characterized by thousands of ETL jobs, tables, and reports that only a small group of specialized data engineers understood, resulting in an under-realized positive impact on the business.

Data Quality

Data Quality Data Architecture Strategy Data Lake

Five benefits of a data catalog

IBM Big Data Hub

DECEMBER 16, 2022

For example, data catalogs have evolved to deliver governance capabilities like managing data quality and data privacy and compliance. It uses metadata and data management tools to organize all data assets within your organization. Improved trust and confidence in data.

Metadata

Metadata Data Quality Data-driven Data Governance

How HR&A uses Amazon Redshift spatial analytics on Amazon Redshift Serverless to measure digital equity in states across the US

AWS Big Data

DECEMBER 5, 2023

A combination of Amazon Redshift Spectrum and COPY commands are used to ingest the survey data stored as CSV files. For the files with unknown structures, AWS Glue crawlers are used to extract metadata and create table definitions in the Data Catalog. She helps customers architect data analytics solutions at scale on AWS.

Measurement

Measurement Dashboards Data Warehouse Analytics

Speed up queries with the cost-based optimizer in Amazon Athena

AWS Big Data

NOVEMBER 17, 2023

Athena provides a simplified, flexible way to analyze petabytes of data where it lives. You can analyze data or build applications from an Amazon Simple Storage Service (Amazon S3) data lake and 30 data sources, including on-premises data sources or other cloud systems using SQL or Python. Wei Zheng is a Sr.

Optimization

Optimization Statistics Metadata Data Lake

Overcome these six data consumption challenges for a more data-driven enterprise

IBM Big Data Hub

JUNE 8, 2022

Ensuring data quality and access within an organization, while establishing and maintaining proper governance processes, is a major struggle for many organizations. Here are a few common data management challenges: Regulatory compliance on data use. Whether data protection regulations like GDPR, CCPA, HIPAA, etc.

Data-driven

Data-driven Enterprise Data Governance Data Lake

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

This is the first post to a blog series that offers common architectural patterns in building real-time data streaming infrastructures using Kinesis Data Streams for a wide range of use cases. In this post, we will review the common architectural patterns of two use cases: Time Series Data Analysis and Event Driven Microservices.

Analytics

Analytics IoT Data-driven Snapshot

How SumUp made digital analytics more accessible using AWS Glue

AWS Big Data

JUNE 6, 2023

Unless, of course, the rest of their data also resides in the Google Cloud. In this post we showcase how we used AWS Glue to move siloed digital analytics data, with inconsistent arrival times, to AWS S3 (our Data Lake) and our central data warehouse (DWH), Snowflake.

Analytics

Analytics Data Lake Testing Optimization

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale. and Delta Lake 2.3.0. Apache Iceberg 1.2.0,

Data Lake

Data Lake Metadata Optimization Statistics

The Very Group adopts a data catalog to better organize and leverage its online retail capabilities

CIO Business Intelligence

SEPTEMBER 6, 2022

Establishing a clear and unified approach to data. But getting to this stage was an intricate process that involved creating centers of excellence for things like data analytics that own the end-to-end infrastructure, application and skill sets, as well as career plans for staff. Chief Data Officer, Data Center Management

IT

IT Forecasting Data Lake Enterprise

Week in the Life of an Analyst at Gartner US IT Symposium (virtual) 2021

Andrew White

OCTOBER 22, 2021

This time, since we were virtual I only managed to close out the week with the 1-1 summary. Lakehouse (data warehouse and data lake working together) 8. Data Literacy, training, coordination, collaboration 8. Data Management Infrastructure/Data Fabric 5. Data Integration tactics 4.

IT

IT Data Lake Strategy Data Science

Announcing the 2021 Data Impact Awards

Cloudera

MAY 12, 2021

Use cases could include but are not limited to: workload analysis and replication, migrating or bursting to cloud, data warehouse optimization, and more. Winners included: Connect the Data Lifecycle: Globe Telecom — Raising experience standards and helping customers live enhanced mobile lifestyles. SECURITY AND GOVERNANCE LEADERSHIP.

Digital Transformation

Digital Transformation Machine Learning Optimization Data Lake

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

Cargotec captures terabytes of IoT telemetry data from their machinery operated by numerous customers across the globe. This data needs to be ingested into a data lake, transformed, and made available for analytics, machine learning (ML), and visualization.

Metadata

Metadata Data Lake Machine Learning Big Data

Unstructured data management and governance using AWS AI/ML and analytics services

AWS Big Data

OCTOBER 25, 2023

Additionally, we show how to use AWS AI/ML services for analyzing unstructured data. Why it’s challenging to process and manage unstructured data Unstructured data makes up a large proportion of the data in the enterprise that can’t be stored in a traditional relational database management systems (RDBMS).

Unstructured Data

Unstructured Data Metadata Management Analytics

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Optimization

Unlock data across organizational boundaries using Amazon DataZone – now generally available

AWS Big Data

OCTOBER 4, 2023

Common pain points of data management and governance: Discovery of data , especially data distributed across accounts and regions – Finding the data to use for analysis is challenging because organizations often have petabytes of data spread across tens or even thousands of data sources.

Metadata

Metadata Data Lake Publishing Data Governance

Habib Bank manages data at scale with Cloudera Data Platform

Cloudera

NOVEMBER 17, 2022

We needed a solution to manage our data at scale, to provide greater experiences to our customers. With Cloudera Data Platform, we aim to unlock value faster and offer consistent data security and governance to meet this goal. HBL aims to double its banked customers by 2025. “ See other customers’ success here .

Management

Management Data Lake Consulting Unstructured Data

Augmented data management: Data fabric versus data mesh

IBM Big Data Hub

APRIL 27, 2022

Data fabric and data mesh are emerging data management concepts that are meant to address the organizational change and complexities of understanding, governing and working with enterprise data in a hybrid multicloud ecosystem. The good news is that both data architecture concepts are complimentary.

Management

Management Metadata Data Architecture Data Lake

What is an open data lakehouse and why you should care?

IBM Big Data Hub

JANUARY 17, 2023

A data lakehouse is an emerging data management architecture that improves efficiency and converges data warehouse and data lake capabilities driven by a need to improve efficiency and obtain critical insights faster. Let’s start with why data lakehouses are becoming increasingly important.

Data Lake

Data Lake Metadata Data Warehouse Data Governance

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

AWS Big Data

JANUARY 24, 2023

This solution only replicates metadata in the Data Catalog, not the actual underlying data. To have a redundant data lake using Lake Formation and AWS Glue in an additional Region, we recommend replicating the Amazon S3-based storage using S3 replication , S3 sync, aws-s3-copy-sync-using-batch or S3 Batch replication process.

Data Architecture

Data Architecture Metadata Data Lake Snapshot

How Ruparupa gained updated insights with an Amazon S3 data lake, AWS Glue, Apache Hudi, and Amazon QuickSight

AWS Big Data

FEBRUARY 22, 2023

In this post, we show how Ruparupa implemented an incrementally updated data lake to get insights into their business using Amazon Simple Storage Service (Amazon S3), AWS Glue , Apache Hudi , and Amazon QuickSight. An AWS Glue ETL job, using the Apache Hudi connector, updates the S3 data lake hourly with incremental data.

Data Lake

Data Lake Dashboards Cost-Benefit Metadata

A Guide to Data Analytics in the Travel Industry

Alation

MARCH 21, 2023

Today, modern travel and tourism thrive on data. For example, airlines have historically applied analytics to revenue management, while successful hospitality leaders make data-driven decisions around property allocation and workforce management. Why is data analytics important for travel organizations?

Data Analytics

Data Analytics Analytics Data-driven Big Data

Achieve your AI goals with an open data lakehouse approach

IBM Big Data Hub

OCTOBER 4, 2023

Another IDC study showed that while 2/3 of respondents reported using AI-driven data analytics, most reported that less than half of the data under management is available for this type of analytics. from 2022 to 2026. New insights and relationships are found in this combination. All of this supports the use of AI.

Data Lake

Data Lake Metadata Cost-Benefit Data Warehouse

Building a Beautiful Data Lakehouse

CIO Business Intelligence

MARCH 9, 2022

Applying artificial intelligence (AI) to data analytics for deeper, better insights and automation is a growing enterprise IT priority. But the data repository options that have been around for a while tend to fall short in their ability to serve as the foundation for big data analytics powered by AI.

Data Lake

Data Lake Unstructured Data Data Warehouse Data Quality

Cloudera Named a Leader in the 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems (DBMS)

Cloudera

DECEMBER 16, 2022

We are pleased to announce that Cloudera has been named a Leader in the 2022 Gartner ® Magic Quadrant for Cloud Database Management Systems. We’re proud to be recognized for the data management and data analytics innovations we have delivered in the new Cloudera Data Platform (CDP).

Management

Management Metadata Machine Learning Data Lake

Regeneron turns to IT to accelerate drug discovery

CIO Business Intelligence

NOVEMBER 4, 2022

MetaBio, which received a 2022 CIO 100 Award , provides a single source for datasets in a unified format, enabling researchers to quickly extract information about various therapeutic functions without having to worry about how to prepare or find the data. Much of Regeneron’s data, of course, is confidential.

Data Lake

Data Lake IT Experimentation Data-driven

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Big Data

APRIL 26, 2024

To enable your workforce users for analytics with fine-grained data access controls and audit data access, you might have to create multiple AWS Identity and Access Management (IAM) roles with different data permissions and map the workforce users to one of those roles.

Analytics

Analytics Data Lake Management Enterprise

Why We Started the Data Intelligence Project

Alation

JULY 7, 2022

In 2013 I joined American Family Insurance as a metadata analyst. I had always been fascinated by how people find, organize, and access information, so a metadata management role after school was a natural choice. The use cases for metadata are boundless, offering opportunities for innovation in every sector.

Metadata

Metadata Data-driven Insurance Statistics

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

AWS Big Data

DECEMBER 18, 2023

AWS Step Functions is a fully managed visual workflow service that enables you to build complex data processing pipelines involving a diverse set of extract, transform, and load (ETL) technologies such as AWS Glue , Amazon EMR , and Amazon Redshift. There are multiple tables related to customers and order data in the RDS database.

Metadata

Metadata Visualization Data Lake Data-driven

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

AWS Big Data

AUGUST 16, 2023

Apache Iceberg is an open table format for very large analytic datasets. Iceberg manages large collections of files as tables, and it supports modern analytical data lake operations such as record-level insert, update, delete, and time travel queries. We use a sample JSON file as input to Amazon DynamoDB.

Data Lake

Data Lake Metadata Testing Snapshot

Create an end-to-end data strategy for Customer 360 on AWS

AWS Big Data

MARCH 26, 2024

AWS provides different services for building data ingestion pipelines: AWS Glue is a serverless data integration service that ingests data in batches from on-premises databases and data stores in the cloud. Streaming data with low latency needs is stored in Amazon Kinesis Data Streams for real-time consumption.

Data Strategy

Data Strategy Strategy Data Warehouse Prescriptive Analytics

Clean up your Excel and CSV files without writing code using AWS Glue DataBrew

AWS Big Data

NOVEMBER 15, 2023

Managing data within an organization is complex. Handling data from outside the organization adds even more complexity. As the organization receives data from multiple external vendors, it often arrives in different formats, typically Excel or CSV files, with each vendor using their own unique data layout and structure.

Metadata

Metadata Sales Data Lake Big Data

A Day in the Life of a DataOps Engineer

DataKitchen

OCTOBER 11, 2021

The biggest challenge is broken data pipelines due to highly manual processes. Figure 1 shows a manually executed data analytics pipeline. First, a business analyst consolidates data from some public websites, an SFTP server and some downloaded email attachments, all into Excel. Monitoring Job Metadata.

Testing

Testing Metadata Dashboards Statistics

Apache Ozone and Dense Data Nodes

Cloudera

APRIL 22, 2021

Today’s enterprise data analytics teams are constantly looking to get the best out of their platforms. Storage plays one of the most important roles in the data platforms strategy, it provides the basis for all compute engines and applications to be built on top of it. Metadata in cluster is disjoint across components.

Data Lake

Data Lake Cost-Benefit Testing Metadata

Multicloud data lake analytics with Amazon Athena

Use Apache Iceberg in a data lake to support incremental data processing

Webinars

Trending Sources

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

Webinars

Use Amazon Athena with Spark SQL for your open-source transactional table formats

Data governance in the age of generative AI

What is a Data Mesh?

The Future of the Data Lakehouse – Open

The Future of the Data Lakehouse – Open

Addressing Data Mesh Technical Challenges with DataOps

Trends in Data Management and Analytics

Exploring real-time streaming for generative AI Applications

Data architecture strategy for data quality

Five benefits of a data catalog

How HR&A uses Amazon Redshift spatial analytics on Amazon Redshift Serverless to measure digital equity in states across the US

Speed up queries with the cost-based optimizer in Amazon Athena

Overcome these six data consumption challenges for a more data-driven enterprise

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

How SumUp made digital analytics more accessible using AWS Glue

Choosing an open table format for your transactional data lake on AWS

The Very Group adopts a data catalog to better organize and leverage its online retail capabilities

Week in the Life of an Analyst at Gartner US IT Symposium (virtual) 2021

Announcing the 2021 Data Impact Awards

How Cargotec uses metadata replication to enable cross-account data sharing

Unstructured data management and governance using AWS AI/ML and analytics services

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Unlock data across organizational boundaries using Amazon DataZone – now generally available

Habib Bank manages data at scale with Cloudera Data Platform

Augmented data management: Data fabric versus data mesh

What is an open data lakehouse and why you should care?

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

How Ruparupa gained updated insights with an Amazon S3 data lake, AWS Glue, Apache Hudi, and Amazon QuickSight

A Guide to Data Analytics in the Travel Industry

Achieve your AI goals with an open data lakehouse approach

Building a Beautiful Data Lakehouse

Cloudera Named a Leader in the 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems (DBMS)

Regeneron turns to IT to accelerate drug discovery

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

Why We Started the Data Intelligence Project

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

Create an end-to-end data strategy for Customer 360 on AWS

Clean up your Excel and CSV files without writing code using AWS Glue DataBrew

A Day in the Life of a DataOps Engineer

Apache Ozone and Dense Data Nodes

Stay Connected