Blog, Data Lake, Metadata and Optimization

Blog

Data Lake

Metadata

Optimization

Multicloud data lake analytics with Amazon Athena

AWS Big Data

MARCH 18, 2024

Many organizations operate data lakes spanning multiple cloud data stores. In these cases, you may want an integrated query layer to seamlessly run analytical queries across these diverse cloud stores and streamline your data analytics processes. The AWS Glue Data Catalog holds the metadata for Amazon S3 and GCS data.

Data Lake

Data Lake Analytics Cost-Benefit Management

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback.

Data Lake

Data Lake Data Processing Metadata Snapshot

Join 52,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

How To Get Promoted In Product Management

MORE WEBINARS

Trending Sources

Analytics Vidhya

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

AWS Big Data

AUGUST 31, 2023

Amazon Redshift enables you to directly access data stored in Amazon Simple Storage Service (Amazon S3) using SQL queries and join data across your data warehouse and data lake. With Amazon Redshift, you can query the data in your S3 data lake using a central AWS Glue metastore from your Redshift data warehouse.

Data Lake

Data Lake Data Warehouse Metadata Data Architecture

Webinars

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

How To Get Promoted In Product Management

MORE WEBINARS

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

In our previous post Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes , we discussed how you can implement solutions to improve operational efficiencies of your Amazon Simple Storage Service (Amazon S3) data lake that is using the Apache Iceberg open table format and running on the Amazon EMR big data platform.

Optimization

Optimization Snapshot Data Lake Metadata

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

AWS Big Data

MAY 24, 2023

When you build your transactional data lake using Apache Iceberg to solve your functional use cases, you need to focus on operational use cases for your S3 data lake to optimize the production environment. This property is set to true by default. availability.

Data Lake

Data Lake Snapshot Metadata Optimization

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Optimization

How to modernize data lakes with a data lakehouse architecture

IBM Big Data Hub

JULY 5, 2023

Data Lakes have been around for well over a decade now, supporting the analytic operations of some of the largest world corporations. Such data volumes are not easy to move, migrate or modernize. The challenges of a monolithic data lake architecture Data lakes are, at a high level, single repositories of data at scale.

Data Lake

Data Lake Metadata Cost-Benefit Data Warehouse

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

A modern data architecture is an evolutionary architecture pattern designed to integrate a data lake, data warehouse, and purpose-built stores with a unified governance model. Moreover, the framework should consume compute resources as optimally as possible per the size of the operational tables.

Data Lake

Data Lake Data Processing Metadata Snapshot

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

This is a guest blog post co-written with Sumesh M R from Cargotec and Tero Karttunen from Knowit Finland. Through their unique position in ports, at sea, and on roads, they optimize global cargo flows and create sustainable customer value. They are headquartered in Helsinki, Finland, and operates globally in over 100 countries.

Metadata

Metadata Data Lake Machine Learning Big Data

Optimization Strategies for Iceberg Tables

Cloudera

FEBRUARY 14, 2024

Introduction Apache Iceberg has recently grown in popularity because it adds data warehouse-like capabilities to your data lake making it easier to analyze all your data — structured and unstructured. Expiring snapshots is a relatively cheap operation and uses metadata to determine newly unreachable files.

Strategy

Strategy Optimization Snapshot Metadata

Efficiently crawl your data lake and improve data access with an AWS Glue crawler using partition indexes

AWS Big Data

JUNE 15, 2023

In today’s world, customers manage vast amounts of data in their Amazon Simple Storage Service (Amazon S3) data lakes, which requires convoluted data pipelines to continuously understand the changes in the data layout and make them available to consuming systems. Review and update the crawler settings.

Data Lake

Data Lake Metadata Cost-Benefit Management

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

AWS Big Data

AUGUST 1, 2023

Although Jira Cloud provides reporting capability, loading this data into a data lake will facilitate enrichment with other business data, as well as support the use of business intelligence (BI) tools and artificial intelligence (AI) and machine learning (ML) applications. For InitialRunFlag , choose Setup.

Data Lake

Data Lake Data Transformation Cost-Benefit Data-driven

Introducing Apache Hudi support with AWS Glue crawlers

AWS Big Data

NOVEMBER 22, 2023

Apache Hudi is an open table format that brings database and data warehouse capabilities to data lakes. Apache Hudi helps data engineers manage complex challenges, such as managing continuously evolving datasets with transactions while maintaining query performance. For CoW tables, queries see the latest data committed.

Data Lake

Data Lake Snapshot Metadata Optimization

Driving Business Value and ROI from a Hybrid Cloud Data Lake

Alation

FEBRUARY 20, 2020

For many enterprises, a hybrid cloud data lake is no longer a trend, but becoming reality. Not only can resources be quickly provisioned and optimized for different workloads and processing needs, but it can be done cost effectively. The Alation Data Catalog will automatically crawl and catalog metadata in your S3 bucket(s).

Data Lake

Data Lake ROI Metadata Cost-Benefit

What is an open data lakehouse and why you should care?

IBM Big Data Hub

JANUARY 17, 2023

A data lakehouse is an emerging data management architecture that improves efficiency and converges data warehouse and data lake capabilities driven by a need to improve efficiency and obtain critical insights faster. Let’s start with why data lakehouses are becoming increasingly important.

Data Lake

Data Lake Metadata Data Warehouse Data Governance

Achieve your AI goals with an open data lakehouse approach

IBM Big Data Hub

OCTOBER 4, 2023

A data lakehouse architecture combines the performance of data warehouses with the flexibility of data lakes, to address the challenges of today’s complex data landscape and scale AI. New insights and relationships are found in this combination. All of this supports the use of AI.

Data Lake

Data Lake Metadata Cost-Benefit Data Warehouse

Materialized Views in Hive for Iceberg Table Format

Cloudera

FEBRUARY 8, 2024

Overview This blog post describes support for materialized views for the Iceberg table format. It brings the reliability and simplicity of SQL tables to big data while enabling engines like Hive, Impala, Spark, Trino, Flink, and Presto to work with the same tables at the same time. Furthermore, it is partitioned on the d_year column.

Snapshot

Snapshot Metadata Cost-Benefit Data Warehouse

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

With Amazon EMR 6.15, we launched AWS Lake Formation based fine-grained access controls (FGAC) on Open Table Formats (OTFs), including Apache Hudi, Apache Iceberg, and Delta lake. Many large enterprise companies seek to use their transactional data lake to gain insights and improve decision-making.

Data Lake

Data Lake Snapshot Big Data Data-driven

Improving Data Processing with Spark 3.0 & Delta Lake

Smart Data Collective

AUGUST 5, 2021

This means that the process would die in the middle of the final writes, making consumers distinctly read the input data frames. In this blog, we will cover an overview of Delta Lakes , its advantages, and how the above challenges can be overcome by moving to Delta Lake and migrating to Spark 3.0 What is Delta Lake?

Data Processing

Data Processing Metadata Broadcasting Statistics

How Cloudera Supports Zero Trust for Data

Cloudera

JUNE 7, 2023

The revised ZTMM is organized by five categories or pillars: identity, devices, networks, applications and workloads, and data, and four levels of maturity: traditional, initial, advanced, and optimal. Moving to the “optimal” stage of maturity is critical to eliminating unauthorized access by bad actors, both foreign and domestic.

Metadata

Metadata Data Lake Optimization Modeling

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

NOVEMBER 22, 2023

Today, we’re making available a new capability of AWS Glue Data Catalog that allows generating column-level statistics for AWS Glue tables. These statistics are now integrated with the cost-based optimizers (CBO) of Amazon Athena and Amazon Redshift Spectrum , resulting in improved query performance and potential cost savings.

Statistics

Statistics Data Lake Optimization Data-driven

The Future of the Data Lakehouse – Open

Cloudera

JUNE 18, 2022

Cloudera customers run some of the biggest data lakes on earth. These lakes power mission critical large scale data analytics, business intelligence (BI), and machine learning use cases, including enterprise data warehouses. On data warehouses and data lakes.

Data Lake

Data Lake Data Warehouse Machine Learning Cost-Benefit

The Security Challenges of Data Warehousing in the Cloud

Cloudera

NOVEMBER 5, 2020

How do you control data privacy and protect against data breaches when the data is spread across so many different systems? How do you optimize your enterprise-wide infrastructure (mostly cloud) and application expenditures? This is exactly what Cloudera Data Platform (CDP) provides to the Cloudera Data Warehouse.

Data Lake

Data Lake Data Warehouse Metadata Optimization

Apache Ozone and Dense Data Nodes

Cloudera

APRIL 22, 2021

Apache Ozone is one of the major innovations introduced in CDP, which provides the next generation storage architecture for Big Data applications, where data blocks are organized in storage containers for larger scale and to handle small objects. Collects and aggregates metadata from components and present cluster state.

Data Lake

Data Lake Cost-Benefit Testing Metadata

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

We have seen a strong customer demand to expand its scope to cloud-based data lakes because data lakes are increasingly the enterprise solution for large-scale data initiatives due to their power and capabilities. The team uses dbt-glue to build a transformed gold model optimized for business intelligence (BI).

Data Lake

Data Lake Management Metrics Data Warehouse

Doing Cloud Migration and Data Governance Right the First Time

erwin

OCTOBER 8, 2020

These tools range from enterprise service bus (ESB) products, data integration tools; extract, transform and load (ETL) tools, procedural code, application program interfaces (APIs), file transfer protocol (FTP) processes, and even business intelligence (BI) reports that further aggregate and transform data.

Data Governance

Data Governance Metadata Testing Data Lake

Case study: Policy Enforcement Automation With Semantics

Ontotext

MAY 2, 2024

Storage-centric approach In the storage-centric approach, people try to address data silos by throwing everything in a data lake or a data warehouse. But, although, this helps somewhat in terms of architecture, soon these data lakes become unwieldy. The best way to drive value is through use cases.

Metadata

Metadata Data Lake Data-driven Enterprise

Unlock data across organizational boundaries using Amazon DataZone – now generally available

AWS Big Data

OCTOBER 4, 2023

Additionally, data owners and data stewards can make data discovery simpler by adding business context to data while balancing access governance to the data via pre-defined approval workflows in the user interface. The metadata forms types, and asset types can be used as templates for defining your assets.

Metadata

Metadata Data Lake Publishing Data Governance

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

With CDW, as an integrated service of CDP, your line of business gets immediate resources needed for faster application launches and expedited data access, all while protecting the company’s multi-year investment in centralized data management, security, and governance. Proprietary file formats mean no one else is invited in!

Data Lake

Data Lake Data Warehouse IT Analytics

Clean up your Excel and CSV files without writing code using AWS Glue DataBrew

AWS Big Data

NOVEMBER 15, 2023

As the organization receives data from multiple external vendors, it often arrives in different formats, typically Excel or CSV files, with each vendor using their own unique data layout and structure. In this blog post, we’ll explore a solution that streamlines this process by leveraging the capabilities of AWS Glue DataBrew.

Metadata

Metadata Sales Data Lake Big Data

Data platform trinity: Competitive or complementary?

IBM Big Data Hub

JANUARY 18, 2023

A read-optimized platform that can integrate data from multiple applications emerged. In another decade, the internet and mobile started the generate data of unforeseen volume, variety and velocity. It required a different data platform solution. Value of the data projects are difficult to realize.

Data Lake

Data Lake Data Warehouse Data-driven Metadata

erwin, Microsoft and the Power of the Common Data Model

erwin

DECEMBER 17, 2020

Once the organization understands what something is, and it is commonly understood across the enterprise, anyone can build semantically aware reporting and analytical requirements plus deliver a uniform view because there is a common understanding of data. erwin Expands Collaboration with Microsoft.

Modeling

Modeling Metadata Data-driven Data Lake

The Modern Data Lakehouse: An Architectural Innovation

Cloudera

SEPTEMBER 9, 2022

analyst Sumit Pal, in “Exploring Lakehouse Architecture and Use Cases,” published January 11, 2022: “Data lakehouses integrate and unify the capabilities of data warehouses and data lakes, aiming to support AI, BI, ML, and data engineering on a single platform.” Iceberg handles massive data born in the cloud.

Metadata

Metadata Machine Learning Unstructured Data Data Lake

Turning Streams Into Data Products

Cloudera

JUNE 16, 2022

Organizations are increasingly building low-latency, data-driven applications, automations, and intelligence from real-time data streams. Cloudera Stream Processing (CSP) enables customers to turn streams into data products by providing capabilities to analyze streaming data for complex patterns and gain actionable intel.

Data Lake

Data Lake Manufacturing Metadata Dashboards

How data stores and governance impact your AI initiatives

IBM Big Data Hub

OCTOBER 12, 2023

Among the tasks necessary for internal and external compliance is the ability to report on the metadata of an AI model. Metadata includes details specific to an AI model such as: The AI model’s creation (when it was created, who created it, etc.) But the implementation of AI is only one piece of the puzzle.

Cost-Benefit

Cost-Benefit Metadata Data Governance Modeling

Cloudera Named a Leader in the 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems (DBMS)

Cloudera

DECEMBER 16, 2022

Notably, these same services simplify repatriating data workloads back to private clouds, to save on cloud infrastructure expenses. Cloudera is closely partnered with the leading cloud service providers (CSPs), and has optimized our platform and services to run as efficiently on their infrastructure services as possible.

Management

Management Metadata Machine Learning Data Lake

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

Cloudera

MARCH 23, 2022

Being multi-function also means integrated end-to-end data pipelines that break siloes, piecing together analytics as a coherent life-cycle where business value can be extracted at each and every stage. Users should be able to choose their tool of choice and take advantage of its workload specific optimizations. 4: Enterprise grade.

Metadata

Metadata Data Architecture Machine Learning Cost-Benefit

Tackling AI’s data challenges with IBM databases on AWS

IBM Big Data Hub

MARCH 14, 2024

This involves unifying and sharing a single copy of data and metadata across IBM® watsonx.data ™, IBM® Db2 ®, IBM® Db2® Warehouse and IBM® Netezza ®, using native integrations and supporting open formats, all without the need for migration or recataloging. With Netezza support for 1.2

Cost-Benefit

Cost-Benefit Metadata Optimization Management

How SumUp made digital analytics more accessible using AWS Glue

AWS Big Data

JUNE 6, 2023

This is a guest blog post by Mira Daniels and Sean Whitfield from SumUp. In this post we showcase how we used AWS Glue to move siloed digital analytics data, with inconsistent arrival times, to AWS S3 (our Data Lake) and our central data warehouse (DWH), Snowflake.

Analytics

Analytics Data Lake Testing Optimization

The Enduring Significance of Data Modeling in the Modern Data-Driven Enterprise

erwin

AUGUST 31, 2023

It delivers the ability to capture and unify the business and technical perspectives of data assets, enables effective collaboration between a variety of stakeholders, and delivers metadata-driven automation to accelerate the creation and maintenance of data sources on virtually any data management platform. Save My Spot!

Data-driven

Data-driven Modeling Enterprise Structured Data

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Cloudera

OCTOBER 11, 2021

The platform converges data cataloging, data ingestion, data profiling, data tagging, data discovery, and data exploration into a unified platform, driven by metadata. Modak Nabu automates repetitive tasks in the data preparation process and thus accelerates the data preparation by 4x.

Data Lake

Data Lake Cost-Benefit Data-driven Dashboards

Exploring the AI and data capabilities of watsonx

IBM Big Data Hub

JULY 17, 2023

In this blog, I will cover: What is watsonx.ai? sales conversation summaries, insurance coverage, meeting transcripts, contract information) Generate: Generate text content for a specific purpose, such as marketing campaigns, job descriptions, blogs or articles, and email drafting support. What capabilities are included in watsonx.ai?

Machine Learning

Machine Learning Data Warehouse Modeling Cost-Benefit

Data architecture strategy for data quality

IBM Big Data Hub

JANUARY 5, 2023

The first generation of data architectures represented by enterprise data warehouse and business intelligence platforms were characterized by thousands of ETL jobs, tables, and reports that only a small group of specialized data engineers understood, resulting in an under-realized positive impact on the business.

Data Quality

Data Quality Data Architecture Strategy Data Lake

Open Data Lakehouse powered by Iceberg for all your Data Warehouse needs

Cloudera

APRIL 3, 2023

In this blog, we will share with you in detail how Cloudera integrates core compute engines including Apache Hive and Apache Impala in Cloudera Data Warehouse with Iceberg. We will publish follow up blogs for other data services. Iceberg basics Iceberg is an open table format designed for large analytic workloads.

Data Warehouse

Data Warehouse Snapshot Metadata Cost-Benefit

Multicloud data lake analytics with Amazon Athena

Use Apache Iceberg in a data lake to support incremental data processing

Webinars

Trending Sources

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

Webinars

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

How to modernize data lakes with a data lakehouse architecture

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

How Cargotec uses metadata replication to enable cross-account data sharing

Optimization Strategies for Iceberg Tables

Efficiently crawl your data lake and improve data access with an AWS Glue crawler using partition indexes

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

Introducing Apache Hudi support with AWS Glue crawlers

Driving Business Value and ROI from a Hybrid Cloud Data Lake

What is an open data lakehouse and why you should care?

Achieve your AI goals with an open data lakehouse approach

Materialized Views in Hive for Iceberg Table Format

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

Improving Data Processing with Spark 3.0 & Delta Lake

How Cloudera Supports Zero Trust for Data

Enhance query performance using AWS Glue Data Catalog column-level statistics

The Future of the Data Lakehouse – Open

The Security Challenges of Data Warehousing in the Cloud

Apache Ozone and Dense Data Nodes

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

Doing Cloud Migration and Data Governance Right the First Time

Case study: Policy Enforcement Automation With Semantics

Unlock data across organizational boundaries using Amazon DataZone – now generally available

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Clean up your Excel and CSV files without writing code using AWS Glue DataBrew

Data platform trinity: Competitive or complementary?

erwin, Microsoft and the Power of the Common Data Model

The Modern Data Lakehouse: An Architectural Innovation

Turning Streams Into Data Products

How data stores and governance impact your AI initiatives

Cloudera Named a Leader in the 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems (DBMS)

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

Tackling AI’s data challenges with IBM databases on AWS

How SumUp made digital analytics more accessible using AWS Glue

The Enduring Significance of Data Modeling in the Modern Data-Driven Enterprise

Accelerate Your Data Mesh in the Cloud with Cloudera Data Engineering and Modak NabuTM

Exploring the AI and data capabilities of watsonx

Data architecture strategy for data quality

Open Data Lakehouse powered by Iceberg for all your Data Warehouse needs

Stay Connected