Data Lake, Data Processing, Management and Metadata

Data Lake

Data Processing

Management

Metadata

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

A data lake is a centralized repository that you can use to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data and then run different types of analytics for better business insights.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

AWS Big Data

MARCH 10, 2023

Since the deluge of big data over a decade ago, many organizations have learned to build applications to process and analyze petabytes of data. Data lakes have served as a central repository to store structured and unstructured data at any scale and in various formats.

Data Lake

Data Lake Sales Data Warehouse Snapshot

Join 52,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Trending Sources

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

In addition to using native managed AWS services that BMS didn’t need to worry about upgrading, BMS was looking to offer an ETL service to non-technical business users that could visually compose data transformation workflows and seamlessly run them on the AWS Glue Apache Spark-based serverless data integration engine.

Metadata

Metadata Data Lake Visualization Data Transformation

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

Organizations often need to manage a high volume of data that is growing at an extraordinary rate. At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. We think of this concept as inside-out data movement.

Data Lake

Data Lake Analytics Dashboards Metrics

Introducing AWS Glue crawler and create table support for Apache Iceberg format

AWS Big Data

AUGUST 16, 2023

Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. Iceberg captures metadata information on the state of datasets as they evolve and change over time. For more details, refer to Creating Apache Iceberg tables.

Data Lake

Data Lake Metadata Snapshot Management

Query your Apache Hive metastore with AWS Lake Formation permissions

AWS Big Data

JULY 20, 2023

The Hive metastore is a repository of metadata about the SQL tables, such as database names, table names, schema, serialization and deserialization information, data location, and partition details of each table. Apache Hive, Apache Spark, Presto, and Trino can all use a Hive Metastore to retrieve metadata to run queries.

Data Lake

Data Lake Metadata Data Processing Big Data

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

AWS Big Data

AUGUST 3, 2023

Today’s modern data lakes span multiple accounts, AWS Regions, and lines of business in organizations. It’s important that their data solution gives them the ability to share and access data securely and safely across Regions. A resource link is a Data Catalog object that is a link to a database or table.

Data Lake

Data Lake Metadata Management Data Processing

Migrate Hive data from CDH to CDP public cloud

Cloudera

JUNE 25, 2021

Many Cloudera customers are making the transition from being completely on-prem to cloud by either backing up their data in the cloud, or running multi-functional analytics on CDP Public cloud in AWS or Azure. The Replication Manager service facilitates both disaster recovery and data migration across different environments.

Data Lake

Data Lake Metadata Unstructured Data Management

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

AWS Big Data

MARCH 29, 2024

Data Firehose uses an AWS Lambda function to transform data and ingest the transformed records into an Amazon Simple Storage Service (Amazon S3) bucket. An AWS Glue crawler scans data on the S3 bucket and populates table metadata on the AWS Glue Data Catalog. In the navigation pane, choose Manage assets.

Metrics

Metrics Visualization Dashboards Interactive

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

NOVEMBER 22, 2023

Data lakes are designed for storing vast amounts of raw, unstructured, or semi-structured data at a low cost, and organizations share those datasets across multiple departments and teams. The queries on these large datasets read vast amounts of data and can perform complex join operations on multiple datasets.

Statistics

Statistics Data Lake Optimization Data-driven

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

AWS Big Data

NOVEMBER 16, 2023

Amazon Redshift is a popular cloud data warehouse, offering a fully managed cloud-based service that seamlessly integrates with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and much more—all while providing up to 7.9x

Enterprise

Enterprise Data Warehouse Snapshot Cost-Benefit

How Data Governance Protects Sensitive Data

erwin

APRIL 2, 2021

Organizations are managing more data than ever. With more companies increasingly migrating their data to the cloud to ensure availability and scalability, the risks associated with data management and protection also are growing. Do You Know Where Your Sensitive Data Is? erwin Data Intelligence.

Data Governance

Data Governance Cost-Benefit Risk Metadata

Governing data in relational databases using Amazon DataZone

AWS Big Data

MAY 7, 2024

Data governance is a key enabler for teams adopting a data-driven culture and operational model to drive innovation with data. Amazon DataZone allows you to simply and securely govern end-to-end data assets stored in your Amazon Redshift data warehouses or data lakes cataloged with the AWS Glue data catalog.

Metadata

Metadata Data Lake Data Processing Data-driven

Announcing the 2021 Data Impact Awards

Cloudera

MAY 12, 2021

2020 saw us hosting our first ever fully digital Data Impact Awards ceremony, and it certainly was one of the highlights of our year. We saw a record number of entries and incredible examples of how customers were using Cloudera’s platform and services to unlock the power of data. SECURITY AND GOVERNANCE LEADERSHIP.

Digital Transformation

Digital Transformation Machine Learning Optimization Data Lake

What Is Alation Connected Sheets? Q&A with the Creators

Alation

NOVEMBER 28, 2022

Krishna Bhat, co-founder & CEO, Kloudio and senior director product management, Alation: Alation Connected Sheets enables business users to pull trusted, governed, and accurate data directly from Alation into Google Sheets for analysis. It is also hard to know whether one can trust the data within a spreadsheet.

Metadata

Metadata Enterprise Cost-Benefit Finance

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

Cargotec captures terabytes of IoT telemetry data from their machinery operated by numerous customers across the globe. This data needs to be ingested into a data lake, transformed, and made available for analytics, machine learning (ML), and visualization. The target accounts read data from the source account S3 buckets.

Metadata

Metadata Data Lake Machine Learning Big Data

Themes and Conferences per Pacoid, Episode 8

Domino Data Lab

APRIL 3, 2019

Data governance shows up as the fourth-most-popular kind of solution that enterprise teams were adopting or evaluating during 2019. That’s a lot of priorities – especially when you group together closely related items such as data lineage and metadata management which rank nearby. Granted, I’m no expert in DG.

Data Governance

Data Governance Machine Learning Metadata Big Data

Extreme data center pressure? Burst to the cloud with CDP!

Cloudera

NOVEMBER 12, 2020

In addition, costs generated by independent IT projects frequently skyrockets with few controls in place to manage them. At Cloudera, we listened to our customers’ problems and built the Burst to Cloud feature in Workload Manager (WXM), Cloudera’s intelligent workload management tool. A solution. More than likely it is.

Data Warehouse

Data Warehouse Reporting Risk Cost-Benefit

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Optimization

Habib Bank manages data at scale with Cloudera Data Platform

Cloudera

NOVEMBER 17, 2022

We needed a solution to manage our data at scale, to provide greater experiences to our customers. With Cloudera Data Platform, we aim to unlock value faster and offer consistent data security and governance to meet this goal. HBL aims to double its banked customers by 2025. “

Management

Management Data Lake Consulting Unstructured Data

Top 15 data management platforms available today

CIO Business Intelligence

SEPTEMBER 22, 2023

Data management platform definition A data management platform (DMP) is a suite of tools that helps organizations to collect and manage data from a wide array of first-, second-, and third-party sources and to create reports and build customer profiles as part of targeted personalization campaigns.

Management

Management Advertising Data Lake Sales

Top 15 data management platforms

CIO Business Intelligence

JUNE 9, 2022

A data management platform (DMP) is a group of tools designed to help organizations collect and manage data from a wide array of sources and to create reports that help explain what is happening in those data streams. Deploying a DMP can be a great way for companies to navigate a business world dominated by data.

Management

Management Advertising Data Lake Sales

Federate Amazon QuickSight access with open-source identity provider Keycloak

AWS Big Data

JUNE 13, 2023

Many organizations use Keycloak as their identity provider (IdP) to control and manage user authentication and authorization centrally. Download the SAML metadata file. In the navigation pane under Clients , import the SAML metadata file. Download the Keycloak IdP SAML metadata file from that URL location. Choose Browse.

Metadata

Metadata Dashboards Business Intelligence Management

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Big Data

APRIL 26, 2024

To enable your workforce users for analytics with fine-grained data access controls and audit data access, you might have to create multiple AWS Identity and Access Management (IAM) roles with different data permissions and map the workforce users to one of those roles. AWS CloudTrail captures user data access activities.

Analytics

Analytics Data Lake Management Enterprise

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

AWS Big Data

DECEMBER 18, 2023

AWS Step Functions is a fully managed visual workflow service that enables you to build complex data processing pipelines involving a diverse set of extract, transform, and load (ETL) technologies such as AWS Glue , Amazon EMR , and Amazon Redshift. There are multiple tables related to customers and order data in the RDS database.

Metadata

Metadata Visualization Data Lake Data-driven

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

While cloud-native, point-solution data warehouse services may serve your immediate business needs, there are dangers to the corporation as a whole when you do your own IT this way. You already know that a distributed environment is much tougher for your company to manage, secure, and govern. Yes there is a better choice!

Data Lake

Data Lake Data Warehouse IT Analytics

Design a data mesh on AWS that reflects the envisioned organization

AWS Big Data

JANUARY 22, 2024

The existing monolith and centralized architecture was struggling to meet the growing demands of data consumers. Data engineers were finding it increasingly challenging to maintain and scale the data infrastructure, resulting in data access, data silos, and inefficiencies in data management.

Data-driven

Data-driven Advertising Metadata Data Architecture

Create an end-to-end data strategy for Customer 360 on AWS

AWS Big Data

MARCH 26, 2024

AWS provides different services for building data ingestion pipelines: AWS Glue is a serverless data integration service that ingests data in batches from on-premises databases and data stores in the cloud. Streaming data with low latency needs is stored in Amazon Kinesis Data Streams for real-time consumption.

Data Strategy

Data Strategy Strategy Data Warehouse Prescriptive Analytics

CIOs rise to the ESG reporting challenge

CIO Business Intelligence

JANUARY 30, 2024

“Always the gatekeepers of much of the data necessary for ESG reporting, CIOs are finding that companies are even more dependent on them,” says Nancy Mentesana, ESG executive director at Labrador US, a global communications firm focused on corporate disclosure documents. The complexity is at a much higher level.”

Reporting

Reporting Data Quality Strategy Data-driven

Generate security insights from Amazon Security Lake data using Amazon OpenSearch Ingestion

AWS Big Data

AUGUST 28, 2023

Amazon Security Lake centralizes access and management of your security data by aggregating security event logs from AWS environments, other cloud providers, on premise infrastructure, and other software as a service (SaaS) solutions. OpenSearch Ingestion ingests this OCSF compliant data into OpenSearch Service.

Dashboards

Dashboards Visualization Metadata Management

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

Building data lakes from continuously changing transactional data of databases and keeping data lakes up to date is a complex task and can be an operational challenge. You can then apply transformations and store data in Delta format for managing inserts, updates, and deletes.

Data Lake

Data Lake Dashboards Metrics Metadata

Improving Multi-tenancy with Virtual Private Clusters

Cloudera

JUNE 6, 2019

The typical Cloudera Enterprise Data Hub Cluster starts with a few dozen nodes in the customer’s datacenter hosting a variety of distributed services. Over time, workloads start processing more data, tenants start onboarding more workloads, and administrators (admins) start onboarding more tenants. Cloudera Manager (CM) 6.2

Metadata

Metadata Data Lake Optimization Strategy

Analyze Amazon S3 storage costs using AWS Cost and Usage Reports, Amazon S3 Inventory, and Amazon Athena

AWS Big Data

FEBRUARY 2, 2023

Since its launch in 2006, Amazon Simple Storage Service (Amazon S3) has experienced major growth, supporting multiple use cases such as hosting websites, creating data lakes, serving as object storage for consumer applications, storing logs, and archiving data. This could be your data lake or application S3 bucket.

Reporting

Reporting Data Lake Management Optimization

Ingest, transform, and deliver events published by Amazon Security Lake to Amazon OpenSearch Service

AWS Big Data

JUNE 19, 2023

Security Lake automatically centralizes security data from cloud, on-premises, and custom sources into a purpose-built data lake stored in your account. OpenSearch Service is a fully managed and scalable log analytics framework that is used by customers to ingest, store, and visualize data.

Publishing

Publishing Dashboards Visualization Management

Exploring the AI and data capabilities of watsonx

IBM Big Data Hub

JULY 17, 2023

Additional capabilities of our ML and data science toolkit include: MLOps pipelines: Offers a collaborative studio for data scientists to build, train and deploy machine learning models with advanced features like automated machine learning and model monitoring.

Machine Learning

Machine Learning Data Warehouse Modeling Cost-Benefit

How Novo Nordisk built distributed data governance and control at scale

AWS Big Data

APRIL 28, 2023

In this post, you will learn how to build decentralized governance with Lake Formation and AWS Identity and Access Management (IAM) using attribute-based access control (ABAC). Following the layer nomenclature established in the first post , the services are created and managed in the consumption layer.

Data Governance

Data Governance Management Data-driven Data Lake

Data Management Requirements for the Enterprise Data Lake

In(tegrate) the Clouds

MAY 1, 2016

SnapLogic published Eight Data Management Requirements for the Enterprise Data Lake. They are: Storage and Data Formats. Metadata and Governance. The company also recently hosted a webinar on Democratizing the Data Lake with Constellation Research and published 2 whitepapers from Mark Madsen.

Data Lake

Data Lake Enterprise Management Metadata

Build a data lake with Apache Flink on Amazon EMR

AWS Big Data

JANUARY 27, 2023

The AWS Glue Data Catalog provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos. Apache Flink is a widely used data processing engine for scalable streaming ETL, analytics, and event-driven applications. Apache Hudi also has its own catalog management.

Data Lake

Data Lake Metadata Business Analysis Data-driven

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

The concept of the data mesh architecture is not entirely new; Its conceptual origins are rooted in the microservices architecture, its design principles (i.e., need to integrate multiple “point solutions” used in a data ecosystem) and organization reasons (e.g., difficulty to achieve cross-organizational governance model).

Metadata

Metadata Cost-Benefit Enterprise Interactive

Databricks’ Data+AI Summit 2022: A Show of Partner “Unity”

Alation

JULY 18, 2022

To ensure you can deliver on this world-changing vision of data, Alation helps you maximize the value of your data lake with integrations to the Unity catalog. Together, Databricks and Alation will ultimately provide catalog, lineage, and policy management and enforcement for the Lakehouse.”.

ROI

ROI Metadata Data Lake Digital Transformation

PODCAST: Making AI Real – Episode 4: Unlocking the Value of Enterprise AI with Data Engineering Capabilities

bridgei2i

MARCH 3, 2021

Episode 4: Unlocking the Value of Enterprise AI with Data Engineering Capabilities. Unlocking the Value of Enterprise AI with Data Engineering Capabilities. They discuss how the data engineering team is instrumental in easing collaboration between analysts, data scientists and ML engineers to build enterprise AI solutions.

Enterprise

Enterprise Digital Transformation Data-driven Interactive

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

AWS Big Data

JULY 14, 2023

Consumers prioritized data discoverability, fast data access, low latency, and high accuracy of data. Producers prioritized ownership, governance, access management, and reuse of their datasets. These inputs reinforced the need of a unified data strategy across the FinOps teams.

Finance

Finance Metadata Big Data Recreation/Entertainment

Data Governance for Dummies: Your Questions, Answered

Alation

FEBRUARY 17, 2023

This past week, I had the pleasure of hosting Data Governance for Dummies author Jonathan Reichental for a fireside chat , along with Denise Swanson , Data Governance lead at Alation. Can you have proper data management without establishing a formal data governance program? This is legally required.

Data Governance

Data Governance Data Quality Metadata Cost-Benefit

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

To speed up the self-service analytics and foster innovation based on data, a solution was needed to provide ways to allow any team to create data products on their own in a decentralized manner. To create and manage the data products, smava uses Amazon Redshift , a cloud data warehouse.

Data Lake

Data Lake Data Warehouse Data-driven B2B

Migrate an existing data lake to a transactional data lake using Apache Iceberg

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

Webinars

Trending Sources

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

Webinars

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Introducing AWS Glue crawler and create table support for Apache Iceberg format

Query your Apache Hive metastore with AWS Lake Formation permissions

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

Migrate Hive data from CDH to CDP public cloud

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

Enhance query performance using AWS Glue Data Catalog column-level statistics

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

How Data Governance Protects Sensitive Data

Governing data in relational databases using Amazon DataZone

Announcing the 2021 Data Impact Awards

What Is Alation Connected Sheets? Q&A with the Creators

How Cargotec uses metadata replication to enable cross-account data sharing

Themes and Conferences per Pacoid, Episode 8

Extreme data center pressure? Burst to the cloud with CDP!

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Habib Bank manages data at scale with Cloudera Data Platform

Top 15 data management platforms available today

Top 15 data management platforms

Federate Amazon QuickSight access with open-source identity provider Keycloak

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Design a data mesh on AWS that reflects the envisioned organization

Create an end-to-end data strategy for Customer 360 on AWS

CIOs rise to the ESG reporting challenge

Generate security insights from Amazon Security Lake data using Amazon OpenSearch Ingestion

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

Improving Multi-tenancy with Virtual Private Clusters

Analyze Amazon S3 storage costs using AWS Cost and Usage Reports, Amazon S3 Inventory, and Amazon Athena

Ingest, transform, and deliver events published by Amazon Security Lake to Amazon OpenSearch Service

Exploring the AI and data capabilities of watsonx

How Novo Nordisk built distributed data governance and control at scale

Data Management Requirements for the Enterprise Data Lake

Build a data lake with Apache Flink on Amazon EMR

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Databricks’ Data+AI Summit 2022: A Show of Partner “Unity”

PODCAST: Making AI Real – Episode 4: Unlocking the Value of Enterprise AI with Data Engineering Capabilities

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

Data Governance for Dummies: Your Questions, Answered

How smava makes loans transparent and affordable using Amazon Redshift Serverless

Stay Connected