Blog, Data Lake, Metadata and Optimization

Blog

Data Lake

Metadata

Optimization

Multicloud data lake analytics with Amazon Athena

AWS Big Data

MARCH 18, 2024

Many organizations operate data lakes spanning multiple cloud data stores. In these cases, you may want an integrated query layer to seamlessly run analytical queries across these diverse cloud stores and streamline your data analytics processes. The AWS Glue Data Catalog holds the metadata for Amazon S3 and GCS data.

Data Lake

Data Lake Analytics Cost-Benefit Management

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback.

Data Lake

Data Lake Data Processing Metadata Snapshot

Join 52,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Trending Sources

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

In our previous post Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes , we discussed how you can implement solutions to improve operational efficiencies of your Amazon Simple Storage Service (Amazon S3) data lake that is using the Apache Iceberg open table format and running on the Amazon EMR big data platform.

Optimization

Optimization Snapshot Data Lake Metadata

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Optimization Strategies for Iceberg Tables

Cloudera

FEBRUARY 14, 2024

Introduction Apache Iceberg has recently grown in popularity because it adds data warehouse-like capabilities to your data lake making it easier to analyze all your data — structured and unstructured. Expiring snapshots is a relatively cheap operation and uses metadata to determine newly unreachable files.

Strategy

Strategy Optimization Snapshot Metadata

Efficiently crawl your data lake and improve data access with an AWS Glue crawler using partition indexes

AWS Big Data

JUNE 15, 2023

In today’s world, customers manage vast amounts of data in their Amazon Simple Storage Service (Amazon S3) data lakes, which requires convoluted data pipelines to continuously understand the changes in the data layout and make them available to consuming systems. Review and update the crawler settings.

Data Lake

Data Lake Metadata Cost-Benefit Management

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

With Amazon EMR 6.15, we launched AWS Lake Formation based fine-grained access controls (FGAC) on Open Table Formats (OTFs), including Apache Hudi, Apache Iceberg, and Delta lake. Many large enterprise companies seek to use their transactional data lake to gain insights and improve decision-making.

Data Lake

Data Lake Snapshot Big Data Data-driven

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

NOVEMBER 22, 2023

Today, we’re making available a new capability of AWS Glue Data Catalog that allows generating column-level statistics for AWS Glue tables. These statistics are now integrated with the cost-based optimizers (CBO) of Amazon Athena and Amazon Redshift Spectrum , resulting in improved query performance and potential cost savings.

Statistics

Statistics Data Lake Optimization Data-driven

How Cloudera Supports Zero Trust for Data

Cloudera

JUNE 7, 2023

The revised ZTMM is organized by five categories or pillars: identity, devices, networks, applications and workloads, and data, and four levels of maturity: traditional, initial, advanced, and optimal. Moving to the “optimal” stage of maturity is critical to eliminating unauthorized access by bad actors, both foreign and domestic.

Metadata

Metadata Data Lake Optimization Modeling

The Future of the Data Lakehouse – Open

Cloudera

JUNE 18, 2022

Cloudera customers run some of the biggest data lakes on earth. These lakes power mission critical large scale data analytics, business intelligence (BI), and machine learning use cases, including enterprise data warehouses. On data warehouses and data lakes.

Data Lake

Data Lake Data Warehouse Machine Learning Cost-Benefit

Case study: Policy Enforcement Automation With Semantics

Ontotext

MAY 2, 2024

Storage-centric approach In the storage-centric approach, people try to address data silos by throwing everything in a data lake or a data warehouse. But, although, this helps somewhat in terms of architecture, soon these data lakes become unwieldy. The best way to drive value is through use cases.

Metadata

Metadata Data Lake Data-driven Enterprise

Doing Cloud Migration and Data Governance Right the First Time

erwin

OCTOBER 8, 2020

These tools range from enterprise service bus (ESB) products, data integration tools; extract, transform and load (ETL) tools, procedural code, application program interfaces (APIs), file transfer protocol (FTP) processes, and even business intelligence (BI) reports that further aggregate and transform data.

Data Governance

Data Governance Metadata Testing Data Lake

How SumUp made digital analytics more accessible using AWS Glue

AWS Big Data

JUNE 6, 2023

This is a guest blog post by Mira Daniels and Sean Whitfield from SumUp. In this post we showcase how we used AWS Glue to move siloed digital analytics data, with inconsistent arrival times, to AWS S3 (our Data Lake) and our central data warehouse (DWH), Snowflake.

Analytics

Analytics Data Lake Testing Optimization

Data architecture strategy for data quality

IBM Big Data Hub

JANUARY 5, 2023

The first generation of data architectures represented by enterprise data warehouse and business intelligence platforms were characterized by thousands of ETL jobs, tables, and reports that only a small group of specialized data engineers understood, resulting in an under-realized positive impact on the business.

Data Quality

Data Quality Data Architecture Strategy Data Lake

How to use foundation models and trusted governance to manage AI workflow risk

IBM Big Data Hub

OCTOBER 16, 2023

It includes processes that trace and document the origin of data, models and associated metadata and pipelines for audits. How to scale AL and ML with built-in governance A fit-for-purpose data store built on an open lakehouse architecture allows you to scale AI and ML while providing built-in governance tools.

Risk

Risk Modeling Management Metadata

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

AWS Big Data

AUGUST 31, 2023

Amazon Redshift enables you to directly access data stored in Amazon Simple Storage Service (Amazon S3) using SQL queries and join data across your data warehouse and data lake. With Amazon Redshift, you can query the data in your S3 data lake using a central AWS Glue metastore from your Redshift data warehouse.

Data Lake

Data Lake Data Warehouse Metadata Data Architecture

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

A modern data architecture is an evolutionary architecture pattern designed to integrate a data lake, data warehouse, and purpose-built stores with a unified governance model. Moreover, the framework should consume compute resources as optimally as possible per the size of the operational tables.

Data Lake

Data Lake Data Processing Metadata Snapshot

Five benefits of a data catalog

IBM Big Data Hub

DECEMBER 16, 2022

For example, data catalogs have evolved to deliver governance capabilities like managing data quality and data privacy and compliance. It uses metadata and data management tools to organize all data assets within your organization. Comprehensive search and access to relevant data.

Metadata

Metadata Data Quality Data-driven Data Governance

Cloudera Data Warehouse Demonstrates Best-in-Class Cloud-Native Price-Performance

Cloudera

JANUARY 15, 2021

Some examples of recent optimizations in Impala include: New multithreading model (see dedicated blog post ). Remote read optimizations: IMPALA-8341: Data cache for remote reads. Impala use of KRPC (see dedicated blog post ). Parquet page indexes (see dedicated blog post ). Benchmark Description.

Data Warehouse

Data Warehouse Cost-Benefit Consulting Interactive

Announcing the 2021 Data Impact Awards

Cloudera

MAY 12, 2021

Use cases could include but are not limited to: predictive maintenance, log data pipeline optimization, connected vehicles, industrial IoT, fraud detection, patient monitoring, network monitoring, and more. DATA FOR ENTERPRISE AI. Nominations for the 2021 Cloudera Data Impact Awards are open from now until July 23.

Digital Transformation

Digital Transformation Machine Learning Optimization Data Lake

Integrating Data Governance and Enterprise Architecture

erwin

SEPTEMBER 3, 2020

Data governance and EA also provide many of the same benefits of enterprise architecture or business process modeling projects: reducing risk, optimizing operations, and increasing the use of trusted data. Automating Data Governance and Enterprise Architecture.

Data Governance

Data Governance Enterprise Risk Data Lake

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

This is a guest blog post co-written with Sumesh M R from Cargotec and Tero Karttunen from Knowit Finland. Through their unique position in ports, at sea, and on roads, they optimize global cargo flows and create sustainable customer value. They are headquartered in Helsinki, Finland, and operates globally in over 100 countries.

Metadata

Metadata Data Lake Machine Learning Big Data

Week in the Life of an Analyst at Gartner US IT Symposium (virtual) 2021

Andrew White

OCTOBER 22, 2021

If you follow my blog for any period of time you will know that for most years I have attended our annual Gartner IT Symposium I do a day-in-the-life blog of an analyst. Lakehouse (data warehouse and data lake working together) 8. Data Literacy, training, coordination, collaboration 8. Metadata Strategy 3.

IT Data Lake Strategy Data Science

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

Therefore, it is critical for organizations to embrace a low-latency, scalable, and reliable data streaming infrastructure to deliver real-time business applications and better customer experiences. Using a data stream in the middle provides the advantage of using the time series data in other processes and solutions at the same time.

Analytics

Analytics IoT Data-driven Snapshot

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

AWS Big Data

MAY 24, 2023

When you build your transactional data lake using Apache Iceberg to solve your functional use cases, you need to focus on operational use cases for your S3 data lake to optimize the production environment. This property is set to true by default. availability.

Data Lake

Data Lake Snapshot Metadata Optimization

Of Muffins and Machine Learning Models

Cloudera

FEBRUARY 16, 2022

In the case of CDP Public Cloud, this includes virtual networking constructs and the data lake as provided by a combination of a Cloudera Shared Data Experience (SDX) and the underlying cloud storage. Each project consists of a declarative series of steps or operations that define the data science workflow.

Machine Learning

Machine Learning Modeling Metadata Recreation/Entertainment

Introducing Apache Hudi support with AWS Glue crawlers

AWS Big Data

NOVEMBER 22, 2023

Apache Hudi is an open table format that brings database and data warehouse capabilities to data lakes. Apache Hudi helps data engineers manage complex challenges, such as managing continuously evolving datasets with transactions while maintaining query performance. For CoW tables, queries see the latest data committed.

Data Lake

Data Lake Snapshot Metadata Optimization

Extreme data center pressure? Burst to the cloud with CDP!

Cloudera

NOVEMBER 12, 2020

Burst to Cloud not only relieves pressure on your data center, but it also protects your VIP applications and users by giving them optimal performance without breaking your bank. Today, it is nearly impossible for IT departments to know if a particular workload is optimal to move from on-premises to the cloud.

Data Warehouse

Data Warehouse Reporting Risk Cost-Benefit

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Optimization

5 Ways Data Engineers Can Support Data Governance

Alation

JANUARY 26, 2023

Offer the right tools Data stewardship is greatly simplified when the right tools are on hand. So ask yourself, does your steward have the software to spot issues with data quality, for example? Do they have a system to manage the metadata for given assets? One example is the EU’s General Data Protection Regulation (GDPR).

Data Governance

Data Governance Strategy Data Quality Marketing

Materialized Views in Hive for Iceberg Table Format

Cloudera

FEBRUARY 8, 2024

Overview This blog post describes support for materialized views for the Iceberg table format. It brings the reliability and simplicity of SQL tables to big data while enabling engines like Hive, Impala, Spark, Trino, Flink, and Presto to work with the same tables at the same time. Furthermore, it is partitioned on the d_year column.

Snapshot

Snapshot Metadata Cost-Benefit Data Warehouse

Top Graph Use Cases and Enterprise Applications (with Real World Examples)

Ontotext

MARCH 8, 2023

Here, I will draw upon our own experience from client projects and lessons learned to provide a selection of optimal use cases for knowledge graphs and semantic solutions along with real world examples of their applications. For many organizations, however, the question remains, “Is it the right solution for us?” million users.

Enterprise

Enterprise Knowledge Discovery Risk Data-driven

What is an open data lakehouse and why you should care?

IBM Big Data Hub

JANUARY 17, 2023

A data lakehouse is an emerging data management architecture that improves efficiency and converges data warehouse and data lake capabilities driven by a need to improve efficiency and obtain critical insights faster. Let’s start with why data lakehouses are becoming increasingly important.

Data Lake

Data Lake Metadata Data Warehouse Data Governance

Unlock data across organizational boundaries using Amazon DataZone – now generally available

AWS Big Data

OCTOBER 4, 2023

Additionally, data owners and data stewards can make data discovery simpler by adding business context to data while balancing access governance to the data via pre-defined approval workflows in the user interface. The metadata forms types, and asset types can be used as templates for defining your assets.

Metadata

Metadata Data Lake Publishing Data Governance

Achieve your AI goals with an open data lakehouse approach

IBM Big Data Hub

OCTOBER 4, 2023

A data lakehouse architecture combines the performance of data warehouses with the flexibility of data lakes, to address the challenges of today’s complex data landscape and scale AI. New insights and relationships are found in this combination. All of this supports the use of AI.

Data Lake

Data Lake Metadata Cost-Benefit Data Warehouse

Driving Business Value and ROI from a Hybrid Cloud Data Lake

Alation

FEBRUARY 20, 2020

For many enterprises, a hybrid cloud data lake is no longer a trend, but becoming reality. Not only can resources be quickly provisioned and optimized for different workloads and processing needs, but it can be done cost effectively. The Alation Data Catalog will automatically crawl and catalog metadata in your S3 bucket(s).

Data Lake

Data Lake ROI Metadata Cost-Benefit

Data platform trinity: Competitive or complementary?

IBM Big Data Hub

JANUARY 18, 2023

A read-optimized platform that can integrate data from multiple applications emerged. In another decade, the internet and mobile started the generate data of unforeseen volume, variety and velocity. It required a different data platform solution. Value of the data projects are difficult to realize.

Data Lake

Data Lake Data Warehouse Data-driven Metadata

Improving Data Processing with Spark 3.0 & Delta Lake

Smart Data Collective

AUGUST 5, 2021

This means that the process would die in the middle of the final writes, making consumers distinctly read the input data frames. In this blog, we will cover an overview of Delta Lakes , its advantages, and how the above challenges can be overcome by moving to Delta Lake and migrating to Spark 3.0 What is Delta Lake?

Data Processing

Data Processing Metadata Broadcasting Statistics

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

With CDW, as an integrated service of CDP, your line of business gets immediate resources needed for faster application launches and expedited data access, all while protecting the company’s multi-year investment in centralized data management, security, and governance. Proprietary file formats mean no one else is invited in!

Data Lake

Data Lake Data Warehouse IT Analytics

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

Cloudera

MARCH 23, 2022

Being multi-function also means integrated end-to-end data pipelines that break siloes, piecing together analytics as a coherent life-cycle where business value can be extracted at each and every stage. Users should be able to choose their tool of choice and take advantage of its workload specific optimizations. 4: Enterprise grade.

Metadata

Metadata Data Architecture Machine Learning Cost-Benefit

erwin, Microsoft and the Power of the Common Data Model

erwin

DECEMBER 17, 2020

Once the organization understands what something is, and it is commonly understood across the enterprise, anyone can build semantically aware reporting and analytical requirements plus deliver a uniform view because there is a common understanding of data. erwin Expands Collaboration with Microsoft.

Modeling

Modeling Metadata Data-driven Data Lake

The Security Challenges of Data Warehousing in the Cloud

Cloudera

NOVEMBER 5, 2020

How do you control data privacy and protect against data breaches when the data is spread across so many different systems? How do you optimize your enterprise-wide infrastructure (mostly cloud) and application expenditures? This is exactly what Cloudera Data Platform (CDP) provides to the Cloudera Data Warehouse.

Data Lake

Data Lake Data Warehouse Metadata Optimization

Clean up your Excel and CSV files without writing code using AWS Glue DataBrew

AWS Big Data

NOVEMBER 15, 2023

As the organization receives data from multiple external vendors, it often arrives in different formats, typically Excel or CSV files, with each vendor using their own unique data layout and structure. In this blog post, we’ll explore a solution that streamlines this process by leveraging the capabilities of AWS Glue DataBrew.

Metadata

Metadata Sales Data Lake Big Data

Apache Ozone and Dense Data Nodes

Cloudera

APRIL 22, 2021

Apache Ozone is one of the major innovations introduced in CDP, which provides the next generation storage architecture for Big Data applications, where data blocks are organized in storage containers for larger scale and to handle small objects. Collects and aggregates metadata from components and present cluster state.

Data Lake

Data Lake Cost-Benefit Testing Metadata

How data stores and governance impact your AI initiatives

IBM Big Data Hub

OCTOBER 12, 2023

Among the tasks necessary for internal and external compliance is the ability to report on the metadata of an AI model. Metadata includes details specific to an AI model such as: The AI model’s creation (when it was created, who created it, etc.) But the implementation of AI is only one piece of the puzzle.

Cost-Benefit

Cost-Benefit Metadata Data Governance Modeling

Multicloud data lake analytics with Amazon Athena

Use Apache Iceberg in a data lake to support incremental data processing

Webinars

Trending Sources

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Webinars

Optimization Strategies for Iceberg Tables

Efficiently crawl your data lake and improve data access with an AWS Glue crawler using partition indexes

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

Enhance query performance using AWS Glue Data Catalog column-level statistics

How Cloudera Supports Zero Trust for Data

The Future of the Data Lakehouse – Open

Case study: Policy Enforcement Automation With Semantics

Doing Cloud Migration and Data Governance Right the First Time

How SumUp made digital analytics more accessible using AWS Glue

Data architecture strategy for data quality

How to use foundation models and trusted governance to manage AI workflow risk

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Five benefits of a data catalog

Cloudera Data Warehouse Demonstrates Best-in-Class Cloud-Native Price-Performance

Announcing the 2021 Data Impact Awards

Integrating Data Governance and Enterprise Architecture

How Cargotec uses metadata replication to enable cross-account data sharing

Week in the Life of an Analyst at Gartner US IT Symposium (virtual) 2021

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

Of Muffins and Machine Learning Models

Introducing Apache Hudi support with AWS Glue crawlers

Extreme data center pressure? Burst to the cloud with CDP!

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

5 Ways Data Engineers Can Support Data Governance

Materialized Views in Hive for Iceberg Table Format

Top Graph Use Cases and Enterprise Applications (with Real World Examples)

What is an open data lakehouse and why you should care?

Unlock data across organizational boundaries using Amazon DataZone – now generally available

Achieve your AI goals with an open data lakehouse approach

Driving Business Value and ROI from a Hybrid Cloud Data Lake

Data platform trinity: Competitive or complementary?

Improving Data Processing with Spark 3.0 & Delta Lake

Get Your Analytics Insights Instantly – Without Abandoning Central IT

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

erwin, Microsoft and the Power of the Common Data Model

The Security Challenges of Data Warehousing in the Cloud

Clean up your Excel and CSV files without writing code using AWS Glue DataBrew

Apache Ozone and Dense Data Nodes

How data stores and governance impact your AI initiatives

Stay Connected