Data Leaders Brief

data-management-with-data-catalogs

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

Businesses are constantly evolving, and data leaders are challenged every day to meet new requirements. Customers are using AWS and Snowflake to develop purpose-built data architectures that provide the performance required for modern analytics and artificial intelligence (AI) use cases.

Data Lake

Data Lake Snapshot Metadata Data Architecture

Amazon DataZone announces integration with AWS Lake Formation hybrid access mode for the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2024

In this post, we share how this new feature helps you simplify the way you use Amazon DataZone to enable secure and governed sharing of your data in the AWS Glue Data Catalog. We also delve into how data producers can share their AWS Glue tables through Amazon DataZone without needing to register them in Lake Formation first.

Finance

Finance Sales Publishing Metadata

Join 52,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

Manufacturing Sustainability Surge: Your Guide to Data-Driven Energy Optimization & Decarbonization

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

MORE WEBINARS

Trending Sources

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

Organizations often need to manage a high volume of data that is growing at an extraordinary rate. At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. We think of this concept as inside-out data movement. Example Corp.

Data Lake

Data Lake Analytics Dashboards Metrics

Webinars

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

Manufacturing Sustainability Surge: Your Guide to Data-Driven Energy Optimization & Decarbonization

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

MORE WEBINARS

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

NOVEMBER 22, 2023

Today, we’re making available a new capability of AWS Glue Data Catalog that allows generating column-level statistics for AWS Glue tables. Data lakes are designed for storing vast amounts of raw, unstructured, or semi-structured data at a low cost, and organizations share those datasets across multiple departments and teams.

Statistics

Statistics Data Lake Optimization Data-driven

Data governance in the age of generative AI

AWS Big Data

FEBRUARY 29, 2024

Data is your generative AI differentiator, and a successful generative AI implementation depends on a robust data strategy incorporating a comprehensive data governance approach. Data governance is a critical building block across all these approaches, and we see two emerging areas of focus.

Data Governance

Data Governance Unstructured Data Metadata Data Lake

Best Practices for Data Catalog Implementation

Octopai

JUNE 19, 2023

In an era where data is often referred to as the new oil, having a well-organized and easily accessible data catalog is no longer a luxury but a necessity as organizations deal with the deluge of too much data (data bloatedness) coming from every system and landscape.

Metadata

Metadata Data Governance Measurement Risk Management

3 commandments that should drive every API strategy

CIO Business Intelligence

OCTOBER 25, 2023

For example, to ensure consistency, access control should be centrally managed, with one identification and authentication scheme to be used by all APIs. Data format should also be centrally managed to ensure uniformity. The enterprise data model must clearly indicate who is accountable for which data.

Strategy

Strategy Software Enterprise Technology

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

AWS Big Data

MARCH 4, 2024

As enterprises collect increasing amounts of data from various sources, the structure and organization of that data often need to change over time to meet evolving analytical needs. Schema evolution enables adding, deleting, renaming, or modifying columns without needing to rewrite existing data. Query the data using Athena.

Snapshot

Snapshot Data Lake Metadata Recreation/Entertainment

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

Cloudera

MARCH 2, 2023

Recently, we announced enhanced multi-function analytics support in Cloudera Data Platform (CDP) with Apache Iceberg. Iceberg is a high-performance open table format for huge analytic data sets. This enables you to maximize utilization of streaming data at scale. The Catalog Type should be set to Hive.

Snapshot

Snapshot Data Processing Metadata Management

Automate AWS Clean Rooms querying and dashboard publishing using AWS Step Functions and Amazon QuickSight – Part 2

AWS Big Data

FEBRUARY 12, 2024

Public health organizations need access to data insights that they can quickly act upon, especially in times of health emergencies, when data needs to be updated multiple times daily. Instead, they rely on up-to-date dashboards that help them visualize data insights to make informed decisions quickly.

Publishing

Publishing Dashboards Metadata Visualization

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

AWS Big Data

AUGUST 3, 2023

Today’s modern data lakes span multiple accounts, AWS Regions, and lines of business in organizations. It’s important that their data solution gives them the ability to share and access data securely and safely across Regions. A resource link is a Data Catalog object that is a link to a database or table.

Data Lake

Data Lake Metadata Management Data Processing

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

AWS Big Data

MARCH 29, 2024

QuickSight makes it straightforward for business users to visualize data in interactive dashboards and reports. You can slice data by different dimensions like job name, see anomalies, and share reports securely across your organization. With these insights, teams have the visibility to make data integration pipelines more efficient.

Metrics

Metrics Visualization Dashboards Interactive

Introducing AWS Glue crawler and create table support for Apache Iceberg format

AWS Big Data

AUGUST 16, 2023

Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. AWS Glue crawlers now support Iceberg tables, enabling you to use the AWS Glue Data Catalog and migrate from other Iceberg catalogs easier.

Data Lake

Data Lake Metadata Snapshot Management

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

AWS Big Data

APRIL 3, 2024

Today, we are pleased to announce that Amazon DataZone is now able to present data quality information for data assets. Other organizations monitor the quality of their data through third-party solutions. Amazon DataZone now integrates directly with AWS Glue to display data quality scores for AWS Glue Data Catalog assets.

Data Quality

Data Quality Visualization Metadata Metrics

Why Your Business Should Use a Data Catalog to Organize Its Data

Smart Data Collective

JULY 15, 2021

A data catalog serves the same purpose. By using metadata (or short descriptions), data catalogs help companies gather, organize, retrieve, and manage information. You can think of a data catalog as an enhanced Access database or library card catalog system. What Does a Data Catalog Do?

Metadata

Metadata IT Data-driven Data Quality

Measure performance of AWS Glue Data Quality for ETL pipelines

AWS Big Data

MARCH 12, 2024

In recent years, data lakes have become a mainstream architecture, and data quality validation is a critical factor to improve the reusability and consistency of the data. In this post, we provide benchmark results of running increasingly complex data quality rulesets over a predefined test dataset.

Data Quality

Data Quality Measurement Testing Visualization

How Knowledge Graphs Power Data Mesh and Data Fabric

Ontotext

APRIL 10, 2024

The data ecosystem today is crowded with dazzling buzzwords, all fighting for investment dollars. A survey in 2021 found that a data company was being funded every 45 minutes. Data ecosystems have become jungles and in spite of all the technology, data teams are struggling to create a modern data experience.

Metadata

Metadata Data Lake Data Warehouse Data Quality

Automate large-scale data validation using Amazon EMR and Apache Griffin

AWS Big Data

APRIL 4, 2024

Many enterprises are migrating their on-premises data stores to the AWS Cloud. During data migration, a key requirement is to validate all the data that has been moved from source to target. This data validation is a critical step, and if not done correctly, may result in the failure of the entire project.

Data Quality

Data Quality Data Lake Data Warehouse Data-driven

Enhance data security and governance for Amazon Redshift Spectrum with VPC endpoints

AWS Big Data

FEBRUARY 16, 2024

Many customers are extending their data warehouse capabilities to their data lake with Amazon Redshift. They are looking to further enhance their security posture where they can enforce access policies on their data lakes based on Amazon Simple Storage Service (Amazon S3).

Data Lake

Data Lake Data Warehouse Testing Business Objectives

9 tips for achieving IT service delivery excellence

CIO Business Intelligence

DECEMBER 19, 2023

IT service delivery enables an organization to give end users access to essential IT services by designing, developing, and deploying key technology resources, including applications and data. Develop a structured service catalog and adapt Achieving service delivery excellence is not a one-size-fits-all process.

IT Testing Metrics Optimization

AWS Lake Formation 2022 year in review

AWS Big Data

JANUARY 31, 2023

Data governance is the collection of policies, processes, and systems that organizations use to ensure the quality and appropriate handling of their data throughout its lifecycle for the purpose of generating business value.

Data Lake

Data Lake Data Governance Data Architecture Machine Learning

Efficiently crawl your data lake and improve data access with an AWS Glue crawler using partition indexes

AWS Big Data

JUNE 15, 2023

In today’s world, customers manage vast amounts of data in their Amazon Simple Storage Service (Amazon S3) data lakes, which requires convoluted data pipelines to continuously understand the changes in the data layout and make them available to consuming systems. Choose Next.

Data Lake

Data Lake Metadata Cost-Benefit Management

Introducing Native Connector for Google BigQuery: Boosting Data Lineage, Migration, and Discovery

Octopai

APRIL 24, 2023

This new native integration enhances our data lineage solution by providing seamless integration with one of the most powerful cloud-based data warehouses, benefiting data teams and enabling support for a broader range of data lineage, discovery, and catalog.

Cost-Benefit

Cost-Benefit Data-driven Data Warehouse Data Governance

Visualize Amazon DynamoDB insights in Amazon QuickSight using the Amazon Athena DynamoDB connector and AWS Glue

AWS Big Data

NOVEMBER 17, 2023

Amazon DynamoDB is a fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale. DynamoDB offers built-in security, continuous backups, automated multi-Region replication, in-memory caching, and data import and export tools.

Visualization

Visualization Metadata Testing Internet of Things

Do I Need a Data Catalog?

erwin

JUNE 26, 2020

If you’re serious about a data-driven strategy , you’re going to need a data catalog. Organizations need a data catalog because it enables them to create a seamless way for employees to access and consume data and business assets in an organized manner. Three Types of Metadata in a Data Catalog.

Metadata

Metadata Cost-Benefit Measurement Data-driven

What is data governance? Best practices for managing data assets

CIO Business Intelligence

MARCH 24, 2023

Data governance definition Data governance is a system for defining who within an organization has authority and control over data assets and how those data assets may be used. It encompasses the people, processes, and technologies required to manage and protect data assets.

Data Governance

Data Governance Management Metadata Data Quality

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

AWS Big Data

NOVEMBER 10, 2023

Apache Spark is a widely-used open source distributed processing system renowned for handling large-scale data workloads. Amazon Redshift offers seamless integration with Apache Spark, allowing you to easily access your Redshift data on both Amazon Redshift provisioned clusters and Amazon Redshift Serverless.

Data Processing

Data Processing Data Lake Data Warehouse Optimization

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

AWS Big Data

APRIL 25, 2024

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed orchestration service for Apache Airflow that you can use to set up and operate data pipelines in the cloud at scale. By using multiple AWS accounts, organizations can effectively scale their workloads and manage their complexity as they grow.

Metadata

Metadata Data Processing Management Testing

Query your Apache Hive metastore with AWS Lake Formation permissions

AWS Big Data

JULY 20, 2023

Apache Hive is a SQL-based data warehouse system for processing highly distributed datasets on the Apache Hadoop platform. The Hive metastore is a repository of metadata about the SQL tables, such as database names, table names, schema, serialization and deserialization information, data location, and partition details of each table.

Data Lake

Data Lake Metadata Data Processing Big Data

Announcing the AWS Well-Architected Data Analytics Lens

AWS Big Data

MARCH 26, 2024

We are delighted to announce the release of the Data Analytics Lens. The lens consists of a lens whitepaper and an AWS-created lens available in the Lens Catalog of the AWS Well-Architected Tool. What’s new in the Data Analytics Lens? For more information on AWS Well-Architected Lenses, refer to AWS Well-Architected.

Data Analytics

Data Analytics Analytics Big Data Data Lake

7 steps for turning shadow IT into a competitive edge

CIO Business Intelligence

NOVEMBER 21, 2023

But for a select few, the deeper challenges of departmental technologies being funded, procured, and managed without IT involvement are the missed opportunities to better engage and fulfill departmental technology needs. That’s where an IT strategy that frames shadow IT as an opportunity for improved collaboration can have a profound impact.

IT Risk Cost-Benefit Risk Management

Securely process near-real-time data from Amazon MSK Serverless using an AWS Glue streaming ETL job with IAM authentication

AWS Big Data

SEPTEMBER 13, 2023

Streaming data has become an indispensable resource for organizations worldwide because it offers real-time insights that are crucial for data analytics. The escalating velocity and magnitude of collected data has created a demand for real-time analytics.

Management

Management Data Processing Interactive Metadata

Why I Joined Alation: A Former Customer’s Story

Alation

JULY 26, 2021

How do you initiate change within a system containing many thousands of people and millions of bytes of data? During my time as a data specialist at American Family Insurance, it became clear that we had to move away from the way things had been done in the past. So you can probably imagine: The company manages a lot of data.

Insurance

Insurance Digital Transformation Enterprise Data Governance

Best Practices for Metadata Management

Alation

JULY 19, 2021

Metadata is information about data. A clothing catalog or dictionary are both examples of metadata repositories. Indeed, a popular online catalog, like Amazon, offers rich metadata around products to guide shoppers: ratings, reviews, and product details are all examples of metadata. What Is Metadata? Why Is Metadata Important?

Metadata

Metadata Management Data Governance Machine Learning

Scale AWS Glue jobs by optimizing IP address consumption and expanding network capacity using a private NAT gateway

AWS Big Data

MARCH 19, 2024

Companies may find themselves challenged to manage the limited pool of IP addresses. For data engineering workloads when AWS Glue is used in such a constrained network configuration, your team may sometimes face hurdles running many jobs simultaneously. These ENIs are short-lived and active until job is complete.

Optimization

Optimization Data-driven Management Testing

Maximizing your event-driven architecture investments: Unleashing the power of Apache Kafka with IBM Event Automation

IBM Big Data Hub

FEBRUARY 12, 2024

Recognizing the need to harness real-time data, businesses are increasingly turning to event-driven architecture (EDA) as a strategic approach to stay ahead of the curve. This trend grows stronger as organizations realize the benefits that come from the power of real-time data streaming. Do you remember playing in the sandbox as a kid?

Data-driven

Data-driven Cost-Benefit Uncertainty Technology

AWS Glue Data Quality is Generally Available

AWS Big Data

JUNE 6, 2023

We are excited to announce the General Availability of AWS Glue Data Quality. Our journey started by working backward from our customers who create, manage, and operate data lakes and data warehouses for analytics and machine learning. It takes days for data engineers to identify and implement data quality rules.

Data Quality

Data Quality Statistics Data Lake Visualization

Using Experian identity resolution with AWS Clean Rooms to achieve higher audience activation match rates

AWS Big Data

SEPTEMBER 26, 2023

This is a guest post co-written with Tyler Middleton, Experian Senior Partner Marketing Manager, and Jay Rakhe, Experian Group Product Manager. As the data privacy landscape continues to evolve, companies are increasingly seeking ways to collect and manage data while protecting privacy and intellectual property.

Advertising

Advertising Data-driven Marketing Interactive

Dark Data: How to Find It and What to Do with It

Timo Elliott

JANUARY 6, 2022

Like the proverbial man looking for his keys under the streetlight , when it comes to enterprise data, if you only look at where the light is already shining, you can end up missing a lot. Remember that dark data is the data you have but don’t understand. So how do you find your dark data? Create a catalog.

IT Metadata Data-driven Data Governance

Training the Next Generation of Data Leaders: The Data Intelligence Project

Alation

JULY 22, 2021

Our platform combines data insights with human intelligence in pursuit of this mission. Susannah Barnes, an Alation customer and senior data governance specialist at American Family Insurance, introduced our team to faculty at the School of Information Studies of the University of Wisconsin, Milwaukee (UWM-SOIS), her alma mater.

Informatics

Informatics Big Data Insurance Metadata

Data Catalog Management 101: The Tools and Roles You Need for Success

Octopai

JUNE 27, 2022

This data catalog is helping boost our bottom line.”. Your data team spends more time doing high-level analysis than it does searching for relevant datasets. Your most technologically-challenged business user searches the data catalog at least once a week – without coming to you for help! But don’t worry.

Management

Management Metadata Visualization Data Governance

Case study: Policy Enforcement Automation With Semantics

Ontotext

MAY 2, 2024

Data leaders today are faced with an almost impossible challenge. They are expected to understand the entire data landscape and generate business-moving insights while facing the voracious needs of different teams and the constraints of technology architecture and compliance.

Metadata

Metadata Data Lake Data-driven Enterprise

Doing Cloud Migration and Data Governance Right the First Time

erwin

OCTOBER 8, 2020

So if you’re going to move from your data from on-premise legacy data stores and warehouse systems to the cloud, you should do it right the first time. And as you make this transition, you need to understand what data you have, know where it is located, and govern it along the way. Then you must bulk load the legacy data.

Data Governance

Data Governance Metadata Testing Data Lake

Visualize data quality scores and metrics generated by AWS Glue Data Quality

AWS Big Data

JUNE 6, 2023

AWS Glue Data Quality allows you to measure and monitor the quality of data in your data repositories. It’s important for business users to be able to see quality scores and metrics to make confident business decisions and debug data quality issues. An AWS Glue crawler crawls the results.

Data Quality

Data Quality Metrics Visualization Dashboards

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Amazon DataZone announces integration with AWS Lake Formation hybrid access mode for the AWS Glue Data Catalog

Webinars

Trending Sources

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Webinars

Enhance query performance using AWS Glue Data Catalog column-level statistics

Data governance in the age of generative AI

Best Practices for Data Catalog Implementation

3 commandments that should drive every API strategy

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

Automate AWS Clean Rooms querying and dashboard publishing using AWS Step Functions and Amazon QuickSight – Part 2

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

Introducing AWS Glue crawler and create table support for Apache Iceberg format

Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions

Why Your Business Should Use a Data Catalog to Organize Its Data

Measure performance of AWS Glue Data Quality for ETL pipelines

How Knowledge Graphs Power Data Mesh and Data Fabric

Automate large-scale data validation using Amazon EMR and Apache Griffin

Enhance data security and governance for Amazon Redshift Spectrum with VPC endpoints

9 tips for achieving IT service delivery excellence

AWS Lake Formation 2022 year in review

Efficiently crawl your data lake and improve data access with an AWS Glue crawler using partition indexes

Introducing Native Connector for Google BigQuery: Boosting Data Lineage, Migration, and Discovery

Visualize Amazon DynamoDB insights in Amazon QuickSight using the Amazon Athena DynamoDB connector and AWS Glue

Do I Need a Data Catalog?

What is data governance? Best practices for managing data assets

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

Orchestrate an end-to-end ETL pipeline using Amazon S3, AWS Glue, and Amazon Redshift Serverless with Amazon MWAA

Query your Apache Hive metastore with AWS Lake Formation permissions

Announcing the AWS Well-Architected Data Analytics Lens

7 steps for turning shadow IT into a competitive edge

Securely process near-real-time data from Amazon MSK Serverless using an AWS Glue streaming ETL job with IAM authentication

Why I Joined Alation: A Former Customer’s Story

Best Practices for Metadata Management

Scale AWS Glue jobs by optimizing IP address consumption and expanding network capacity using a private NAT gateway

Maximizing your event-driven architecture investments: Unleashing the power of Apache Kafka with IBM Event Automation

AWS Glue Data Quality is Generally Available

Using Experian identity resolution with AWS Clean Rooms to achieve higher audience activation match rates

Dark Data: How to Find It and What to Do with It

Training the Next Generation of Data Leaders: The Data Intelligence Project

Data Catalog Management 101: The Tools and Roles You Need for Success

Case study: Policy Enforcement Automation With Semantics

Doing Cloud Migration and Data Governance Right the First Time

Visualize data quality scores and metrics generated by AWS Glue Data Quality

Stay Connected