Data Lake, Information and Metadata

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

A data lake is a centralized repository that you can use to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data and then run different types of analytics for better business insights.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

Organizations have multiple Hive data warehouses across EMR clusters, where the metadata gets generated. To address this challenge, organizations can deploy a data mesh using AWS Lake Formation that connects the multiple EMR clusters. An entity can act both as a producer of data assets and as a consumer of data assets.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

Multicloud data lake analytics with Amazon Athena

AWS Big Data

MARCH 18, 2024

Many organizations operate data lakes spanning multiple cloud data stores. In these cases, you may want an integrated query layer to seamlessly run analytical queries across these diverse cloud stores and streamline your data analytics processes. The AWS Glue Data Catalog holds the metadata for Amazon S3 and GCS data.

Data Lake

Data Lake Analytics Cost-Benefit Management

Webinars

Leading the Development of Profitable and Sustainable Products

How To Get Promoted In Product Management

MORE WEBINARS

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback.

Data Lake

Data Lake Data Processing Metadata Snapshot

Build a real-time GDPR-aligned Apache Iceberg data lake

AWS Big Data

FEBRUARY 24, 2023

Data lakes are a popular choice for today’s organizations to store their data around their business activities. As a best practice of a data lake design, data should be immutable once stored. A data lake built on AWS uses Amazon Simple Storage Service (Amazon S3) as its primary storage environment.

Data Lake

Data Lake Metadata Testing Data Warehouse

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

AWS Big Data

MARCH 4, 2024

As enterprises collect increasing amounts of data from various sources, the structure and organization of that data often need to change over time to meet evolving analytical needs. Schema evolution enables adding, deleting, renaming, or modifying columns without needing to rewrite existing data. Choose Next. Choose Next again.

Snapshot

Snapshot Data Lake Metadata Recreation/Entertainment

Data Lakes on Cloud & it’s Usage in Healthcare

BizAcuity

MARCH 29, 2019

Data lakes are centralized repositories that can store all structured and unstructured data at any desired scale. The power of the data lake lies in the fact that it often is a cost-effective way to store data. The power of the data lake lies in the fact that it often is a cost-effective way to store data.

Data Lake

Data Lake Unstructured Data Cost-Benefit Data Quality

The Data Lakehouse: Blending Data Warehouses and Data Lakes

Data Virtualization

APRIL 21, 2022

Reading Time: 3 minutes First we had data warehouses, then came data lakes, and now the new kid on the block is the data lakehouse. But what is a data lakehouse and why should we develop one? In a way, the name describes what.

Data Lake

Data Lake Data Warehouse Data Integration Management

Efficiently crawl your data lake and improve data access with an AWS Glue crawler using partition indexes

AWS Big Data

JUNE 15, 2023

In today’s world, customers manage vast amounts of data in their Amazon Simple Storage Service (Amazon S3) data lakes, which requires convoluted data pipelines to continuously understand the changes in the data layout and make them available to consuming systems.

Data Lake

Data Lake Metadata Cost-Benefit Management

Advancing AI: The emergence of a modern information lifecycle

CIO Business Intelligence

DECEMBER 4, 2023

Although less complex than the “4 Vs” of big data (velocity, veracity, volume, and variety), orienting to the variety and volume of a challenging puzzle is similar to what CIOs face with information management. Beyond “records,” organizations can digitally capture anything and apply metadata for context and searchability.

Unstructured Data

Unstructured Data Data Lake Metadata Business Objectives

How Knowledge Graphs Power Data Mesh and Data Fabric

Ontotext

APRIL 10, 2024

Drowning in Data, Thirsting for Context We’ve heard the saying, “Data, data everywhere. ” As more data accumulates, context gets diluted and lost. Bad Data Tax One of the reasons for this is what we call “ bad data tax ”. Teams can’t access data to build their business use cases.

Metadata

Metadata Data Lake Data Warehouse Data Quality

Collibra Brings Effective Data Governance to Line-of-Business

David Menninger's Analyst Perspectives

SEPTEMBER 28, 2021

Collibra is a data governance software company that offers tools for metadata management and data cataloging. The software enables organizations to find data quickly, identify its source and assure its integrity.

Data Governance

Data Governance Metadata Software Management

Data Lakes: What Are They and Who Needs Them?

Jet Global

JULY 2, 2019

To address the flood of data and the needs of enterprise businesses to store, sort, and analyze that data, a new storage solution has evolved: the data lake. What’s in a Data Lake? Data warehouses do a great job of standardizing data from disparate sources for analysis. Taking a Dip.

Data Lake

Data Lake Data Warehouse Big Data Machine Learning

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

For the past 5 years, BMS has used a custom framework called Enterprise Data Lake Services (EDLS) to create ETL jobs for business users. EDLS job steps and metadata Every EDLS job comprises one or more job steps chained together and run in a predefined order orchestrated by the custom ETL framework.

Metadata

Metadata Data Lake Visualization Data Transformation

Introducing AWS Glue crawler and create table support for Apache Iceberg format

AWS Big Data

AUGUST 16, 2023

Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. Iceberg captures metadata information on the state of datasets as they evolve and change over time. Choose Create.

Data Lake

Data Lake Metadata Snapshot Management

Where Do Data Catalogs Fit in Metadata Management?

Alation

FEBRUARY 13, 2020

In an earlier blog, I defined a data catalog as “a collection of metadata, combined with data management and search tools, that helps analysts and other data users to find the data that they need, serves as an inventory of available data, and provides information to evaluate fitness data for intended uses.”.

Metadata

Metadata Management Data Lake Data Governance

Data governance in the age of generative AI

AWS Big Data

FEBRUARY 29, 2024

Access policies to extract permissions based on relevant data and filter out results based on the prompt user role and permissions. Enforce data privacy policies such as personally identifiable information (PII) redactions. Grant the user role permissions for sensitive information and compliance policies.

Data Governance

Data Governance Unstructured Data Metadata Data Lake

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

AWS-powered data lakes, supported by the unmatched availability of Amazon Simple Storage Service (Amazon S3), can handle the scale, agility, and flexibility required to combine different data and analytics approaches. For more information, refer to Amazon S3: Allows read and write access to objects in an S3 Bucket.

Snapshot

Snapshot Data Lake Metadata Optimization

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

In our previous post Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes , we discussed how you can implement solutions to improve operational efficiencies of your Amazon Simple Storage Service (Amazon S3) data lake that is using the Apache Iceberg open table format and running on the Amazon EMR big data platform.

Optimization

Optimization Snapshot Data Lake Metadata

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

AWS Big Data

AUGUST 3, 2023

Today’s modern data lakes span multiple accounts, AWS Regions, and lines of business in organizations. It’s important that their data solution gives them the ability to share and access data securely and safely across Regions. A resource link is a Data Catalog object that is a link to a database or table.

Data Lake

Data Lake Metadata Management Data Processing

Modernize your data observability with Amazon OpenSearch Service zero-ETL integration with Amazon S3

AWS Big Data

JUNE 5, 2024

The integration is new way for customers to query operational logs in Amazon S3 and Amazon S3-based data lakes without needing to switch between tools to analyze operational data. Amazon S3 is an object storage service offering industry-leading scalability, data availability, security, and performance.

Data Lake

Data Lake Cost-Benefit Dashboards Visualization

Query your Apache Hive metastore with AWS Lake Formation permissions

AWS Big Data

JULY 20, 2023

The Hive metastore is a repository of metadata about the SQL tables, such as database names, table names, schema, serialization and deserialization information, data location, and partition details of each table. Apache Hive, Apache Spark, Presto, and Trino can all use a Hive Metastore to retrieve metadata to run queries.

Data Lake

Data Lake Metadata Data Processing Big Data

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

With Amazon EMR 6.15, we launched AWS Lake Formation based fine-grained access controls (FGAC) on Open Table Formats (OTFs), including Apache Hudi, Apache Iceberg, and Delta lake. Many large enterprise companies seek to use their transactional data lake to gain insights and improve decision-making.

Data Lake

Data Lake Snapshot Big Data Data-driven

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

Solutions data architect: These individuals design and implement data solutions for specific business needs, including data warehouses, data marts, and data lakes. Application data architect: The application data architect designs and implements data models for specific software applications.

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

Informatica’s new data management clouds target health, finance services

CIO Business Intelligence

MAY 24, 2022

In order to help maintain data privacy while validating and standardizing data for use, the IDMC platform offers a Data Quality Accelerator for Crisis Response. Cloud Computing, Data Management, Financial Services Industry, Healthcare Industry

Finance

Finance Management Metadata Data Quality

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

AWS Big Data

MARCH 7, 2023

A data hub contains data at multiple levels of granularity and is often not integrated. It differs from a data lake by offering data that is pre-validated and standardized, allowing for simpler consumption by users. Data hubs and data lakes can coexist in an organization, complementing each other.

Analytics

Analytics Data Warehouse Data Lake Metadata

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

AWS Big Data

APRIL 19, 2023

We split the solution into two primary components: generating Spark job metadata and running the SQL on Amazon EMR. The first component (metadata setup) consumes existing Hive job configurations and generates metadata such as number of parameters, number of actions (steps), and file formats. sql_path SQL file name.

Metadata

Metadata Testing Data Lake Consulting

What Is a Data Catalog?

Alation

FEBRUARY 13, 2020

These are all good questions and a logical place to start your data cataloging journey. This brief definition makes several points about data catalogs—data management, searching, data inventory, and data evaluation—but all depend on the central capability to provide a collection of metadata. Conclusion.

Metadata

Metadata Data Lake Recreation/Entertainment Big Data

Educating ChatGPT on Data Lakehouse

Cloudera

MARCH 17, 2023

ChatGPT is trained on historical data and depending on how one phrases their question, it may offer inaccurate or misleading information. I took the free version of ChatGPT on a test drive (in March 2023) and asked some simple questions on data lakehouse and its components.

Unstructured Data

Unstructured Data Data Lake Data Warehouse Machine Learning

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

Large language models (LLMs) are a type of FM and are pre-trained on vast amounts of text data and typically have application uses such as text generation, intelligent chatbots, or summarization. In RAG, you provide the relevant information such as most relevant policy and customer records along with the user question to the prompt.

Data Lake

Data Lake Unstructured Data Management Modeling

Putting the Business Back Into Business Innovation

Timo Elliott

DECEMBER 14, 2022

Gartner calls it the Composable Enterprise , for example – it’s about having a solid information foundation that enables fast and flexible creation of what they call composable applications that allow you to create new applications and workflows by just bringing together modular components.

Data Lake

Data Lake Recreation/Entertainment Metadata Data Warehouse

AWS Lake Formation 2022 year in review

AWS Big Data

JANUARY 31, 2023

We have collected some of the key talks and solutions on data governance, data mesh, and modern data architecture published and presented in AWS re:Invent 2022, and a few data lake solutions built by customers and AWS Partners for easy reference. Starting with Amazon EMR release 6.7.0,

Data Lake

Data Lake Data Governance Data Architecture Data Warehouse

How Cloudera Supports Zero Trust for Data

Cloudera

JUNE 7, 2023

It operates independently from compute and storage layers, offering integrated security and governance based on metadata. With persistent context across analytics and cloud environments, SDX simplifies data delivery and access with a unified multi-tenant model. This minimizes the blast radius of a breach through segmented data access.

Metadata

Metadata Data Lake Optimization Modeling

Operational Database Security – Part 2

Cloudera

SEPTEMBER 23, 2020

Access audits are mastered centrally in Apache Ranger which provides comprehensive non-repudiable audit log for every access event to every resource with rich access event metadata such as: IP. Both fine-grained access control of database objects and access to metadata is provided. Sensitive data identification.

Data Lake

Data Lake Metadata IoT Enterprise

Five benefits of a data catalog

IBM Big Data Hub

DECEMBER 16, 2022

So, instead of wandering the aisles in hopes you’ll stumble across the book, you can walk straight to it and get the information you want much faster. An enterprise data catalog does all that a library inventory system does – namely streamlining data discovery and access across data sources – and a lot more.

Metadata

Metadata Data Quality Data-driven Data Governance

Integrating Data Governance and Enterprise Architecture

erwin

SEPTEMBER 3, 2020

Two of the biggest challenges in creating a successful enterprise architecture initiative are: collecting accurate information on application ecosystems and maintaining the information as application ecosystems change. Data governance provides time-sensitive, current-state architecture information with a high level of quality.

Data Governance

Data Governance Enterprise Risk Data Lake

Data Governance Makes Data Security Less Scary

erwin

OCTOBER 31, 2019

In fact, we’ve seen some frightening ones play out already: Google’s record GDPR fine – France’s data privacy enforcement agency hit the tech giant with a $57 million penalty in early 2019 – more than 80 times the steepest fine the U.K.’s s Information Commissioner’s Office had levied against both Facebook and Equifax for their data breaches.

Data Governance

Data Governance Metadata Risk Data Lake

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

AWS Big Data

AUGUST 31, 2023

Amazon Redshift enables you to directly access data stored in Amazon Simple Storage Service (Amazon S3) using SQL queries and join data across your data warehouse and data lake. With Amazon Redshift, you can query the data in your S3 data lake using a central AWS Glue metastore from your Redshift data warehouse.

Data Lake

Data Lake Data Warehouse Metadata Data Architecture

What Is Data Curation?

Alation

FEBRUARY 13, 2020

Data curation is important in today’s world of data sharing and self-service analytics, but I think it is a frequently misused term. When speaking and consulting, I often hear people refer to data in their data lakes and data warehouses as curated data, believing that it is curated because it is stored as shareable data.

Metadata

Metadata Data Warehouse Data Lake Data Governance

How Data Governance Protects Sensitive Data

erwin

APRIL 2, 2021

An assessment of the data breaches that crop up like weeds each year supports the conclusion that companies, absent data governance, wind up building security architectures strictly from a technical perspective. Do You Know Where Your Sensitive Data Is? Data is a valuable asset used to operate, manage and grow a business.

Data Governance

Data Governance Cost-Benefit Risk Metadata

What is Data Mesh?

Ontotext

NOVEMBER 16, 2023

In a data mesh, domains are represented by a node, which can be an operational data store (ODS), a data warehouse, or a data lake tailored to the domain’s requirements. Mesh emerges when teams use other domains’ data products and the domains communicate with others in a governed manner. What is Data Sharing?

Metadata

Metadata Data-driven Data Quality Data Architecture

Data Profiling: What It Is and How to Perfect It

Alation

APRIL 18, 2023

Gartner defines data profiling as: A technology for discovering and investigating data quality issues, such as duplication, lack of consistency, and lack of accuracy and completeness. The tools provide data statistics, such as degree of duplication and ratios of attribute values, both in tabular and graphical formats.

IT

IT Metadata Data Quality Data Governance

The Madness of Data (and analytics) Governance

Andrew White

DECEMBER 9, 2019

The client had recently engaged with a well-known consulting company that had recommended a large data catalog effort to collect all enterprise metadata to help identify all data and business issues. Modern data (and analytics) governance does not necessarily need: Wall-to-wall discovery of your data and metadata.

Analytics

Analytics Data Lake Data Governance Metadata

Migrate an existing data lake to a transactional data lake using Apache Iceberg

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

Webinars

Trending Sources

Multicloud data lake analytics with Amazon Athena

Webinars

Use Apache Iceberg in a data lake to support incremental data processing

Build a real-time GDPR-aligned Apache Iceberg data lake

Use AWS Glue ETL to perform merge, partition evolution, and schema evolution on Apache Iceberg

Data Lakes on Cloud & it’s Usage in Healthcare

The Data Lakehouse: Blending Data Warehouses and Data Lakes

Efficiently crawl your data lake and improve data access with an AWS Glue crawler using partition indexes

Advancing AI: The emergence of a modern information lifecycle

How Knowledge Graphs Power Data Mesh and Data Fabric

Collibra Brings Effective Data Governance to Line-of-Business

Data Lakes: What Are They and Who Needs Them?

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

Introducing AWS Glue crawler and create table support for Apache Iceberg format

Where Do Data Catalogs Fit in Metadata Management?

Data governance in the age of generative AI

Use Amazon Athena with Spark SQL for your open-source transactional table formats

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

Modernize your data observability with Amazon OpenSearch Service zero-ETL integration with Amazon S3

Query your Apache Hive metastore with AWS Lake Formation permissions

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

What is a data architect? Skills, salaries, and how to become a data framework master

Informatica’s new data management clouds target health, finance services

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

What Is a Data Catalog?

Educating ChatGPT on Data Lakehouse

Exploring real-time streaming for generative AI Applications

Putting the Business Back Into Business Innovation

AWS Lake Formation 2022 year in review

How Cloudera Supports Zero Trust for Data

Operational Database Security – Part 2

Five benefits of a data catalog

Integrating Data Governance and Enterprise Architecture

Data Governance Makes Data Security Less Scary

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

What Is Data Curation?

How Data Governance Protects Sensitive Data

What is Data Mesh?

Data Profiling: What It Is and How to Perfect It

The Madness of Data (and analytics) Governance

Stay Connected