Data Lake, Events and Metadata - Data Leaders Brief

Data Lake

Events

Metadata

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

licensed, 100% open-source data table format that helps simplify data processing on large datasets stored in data lakes. Data engineers use Apache Iceberg because it’s fast, efficient, and reliable at any scale and keeps records of how datasets change over time.

Data Lake

Data Lake Snapshot Metadata Data Architecture

Build a real-time GDPR-aligned Apache Iceberg data lake

AWS Big Data

FEBRUARY 24, 2023

Data lakes are a popular choice for today’s organizations to store their data around their business activities. As a best practice of a data lake design, data should be immutable once stored. A data lake built on AWS uses Amazon Simple Storage Service (Amazon S3) as its primary storage environment.

Data Lake

Data Lake Metadata Testing Data Warehouse

Join 52,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Trending Sources

Analytics Vidhya

Salesforce debuts Zero Copy Partner Network to ease data integration

CIO Business Intelligence

APRIL 25, 2024

“The challenge that a lot of our customers have is that requires you to copy that data, store it in Salesforce; you have to create a place to store it; you have to create an object or field in which to store it; and then you have to maintain that pipeline of data synchronization and make sure that data is updated,” Carlson said.

Data Integration

Data Integration Data Lake Metadata Data Warehouse

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

In our previous post Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes , we discussed how you can implement solutions to improve operational efficiencies of your Amazon Simple Storage Service (Amazon S3) data lake that is using the Apache Iceberg open table format and running on the Amazon EMR big data platform.

Optimization

Optimization Snapshot Data Lake Metadata

How Knowledge Graphs Power Data Mesh and Data Fabric

Ontotext

APRIL 10, 2024

“Any enterprise CEO really ought to be able to ask a question that involves connecting data across the organization, be able to run a company effectively, and especially to be able to respond to unexpected events. Most organizations are missing this ability to connect all the data together.”

Metadata

Metadata Data Lake Data Warehouse Data Quality

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

Data Lakes: What Are They and Who Needs Them?

Jet Global

JULY 2, 2019

To address the flood of data and the needs of enterprise businesses to store, sort, and analyze that data, a new storage solution has evolved: the data lake. What’s in a Data Lake? Data warehouses do a great job of standardizing data from disparate sources for analysis. Taking a Dip.

Data Lake

Data Lake Data Warehouse Big Data Machine Learning

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

AWS-powered data lakes, supported by the unmatched availability of Amazon Simple Storage Service (Amazon S3), can handle the scale, agility, and flexibility required to combine different data and analytics approaches.

Snapshot

Snapshot Data Lake Metadata Optimization

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

AWS Big Data

MARCH 7, 2023

It covers how to use a conceptual, logical architecture for some of the most popular gaming industry use cases like event analysis, in-game purchase recommendations, measuring player satisfaction, telemetry data analysis, and more. A data hub contains data at multiple levels of granularity and is often not integrated.

Analytics

Analytics Data Warehouse Data Lake Metadata

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

For example, in a chatbot, data events could pertain to an inventory of flights and hotels or price changes that are constantly ingested to a streaming storage engine. Furthermore, data events are filtered, enriched, and transformed to a consumable format using a stream processor.

Data Lake

Data Lake Unstructured Data Management Modeling

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

With Amazon EMR 6.15, we launched AWS Lake Formation based fine-grained access controls (FGAC) on Open Table Formats (OTFs), including Apache Hudi, Apache Iceberg, and Delta lake. Many large enterprise companies seek to use their transactional data lake to gain insights and improve decision-making.

Data Lake

Data Lake Snapshot Big Data Data-driven

AWS Lake Formation 2022 year in review

AWS Big Data

JANUARY 31, 2023

We have collected some of the key talks and solutions on data governance, data mesh, and modern data architecture published and presented in AWS re:Invent 2022, and a few data lake solutions built by customers and AWS Partners for easy reference. Starting with Amazon EMR release 6.7.0,

Data Lake

Data Lake Data Governance Data Architecture Machine Learning

Operational Database Security – Part 2

Cloudera

SEPTEMBER 23, 2020

Access audits are mastered centrally in Apache Ranger which provides comprehensive non-repudiable audit log for every access event to every resource with rich access event metadata such as: IP. Both fine-grained access control of database objects and access to metadata is provided. Sensitive data identification.

Data Lake

Data Lake Metadata IoT Enterprise

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

It aims to provide a framework to create low-latency streaming applications on the AWS Cloud using Amazon Kinesis Data Streams and AWS purpose-built data analytics services. In this post, we will review the common architectural patterns of two use cases: Time Series Data Analysis and Event Driven Microservices.

Analytics

Analytics IoT Data-driven Snapshot

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

A modern data architecture is an evolutionary architecture pattern designed to integrate a data lake, data warehouse, and purpose-built stores with a unified governance model. The company wanted the ability to continue processing operational data in the secondary Region in the rare event of primary Region failure.

Data Lake

Data Lake Data Processing Metadata Snapshot

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

NOVEMBER 22, 2023

Data lakes are designed for storing vast amounts of raw, unstructured, or semi-structured data at a low cost, and organizations share those datasets across multiple departments and teams. The queries on these large datasets read vast amounts of data and can perform complex join operations on multiple datasets.

Statistics

Statistics Data Lake Optimization Data-driven

How to use foundation models and trusted governance to manage AI workflow risk

IBM Big Data Hub

OCTOBER 16, 2023

It includes processes that trace and document the origin of data, models and associated metadata and pipelines for audits. Curated foundation models, such as those created by IBM or Microsoft, help enterprises scale and accelerate the use and impact of the most advanced AI capabilities using trusted data.

Risk

Risk Modeling Management Metadata

The Madness of Data (and analytics) Governance

Andrew White

DECEMBER 9, 2019

The client had recently engaged with a well-known consulting company that had recommended a large data catalog effort to collect all enterprise metadata to help identify all data and business issues. Modern data (and analytics) governance does not necessarily need: Wall-to-wall discovery of your data and metadata.

Analytics

Analytics Data Lake Data Governance Metadata

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

AWS Big Data

NOVEMBER 16, 2023

Amazon Redshift is a popular cloud data warehouse, offering a fully managed cloud-based service that seamlessly integrates with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and much more—all while providing up to 7.9x

Enterprise

Enterprise Data Warehouse Snapshot Cost-Benefit

Amazon DataZone announces integration with AWS Lake Formation hybrid access mode for the AWS Glue Data Catalog

AWS Big Data

APRIL 8, 2024

Prior to this integration, you had to complete the following steps before Amazon DataZone could treat the published Data Catalog table as a managed asset: Identity the Amazon S3 location associated with Data Catalog table. Publish the table metadata to the Amazon DataZone business data catalog.

Finance

Finance Sales Publishing Metadata

How SumUp made digital analytics more accessible using AWS Glue

AWS Big Data

JUNE 6, 2023

Unless, of course, the rest of their data also resides in the Google Cloud. In this post we showcase how we used AWS Glue to move siloed digital analytics data, with inconsistent arrival times, to AWS S3 (our Data Lake) and our central data warehouse (DWH), Snowflake. It consists of full-day and intraday tables.

Analytics

Analytics Data Lake Testing Optimization

Demystifying Modern Data Platforms

Cloudera

SEPTEMBER 15, 2022

July brings summer vacations, holiday gatherings, and for the first time in two years, the return of the Massachusetts Institute of Technology (MIT) Chief Data Officer symposium as an in-person event. A key area of focus for the symposium this year was the design and deployment of modern data platforms.

Data Lake

Data Lake Data Architecture Data-driven Data Warehouse

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Optimization Statistics

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

Cargotec captures terabytes of IoT telemetry data from their machinery operated by numerous customers across the globe. This data needs to be ingested into a data lake, transformed, and made available for analytics, machine learning (ML), and visualization.

Metadata

Metadata Data Lake Machine Learning Big Data

The Advantages Of Live Data-Streaming In The Competitive Financial Services Sector (Part I)

Cloudera

AUGUST 21, 2020

Data-in-motion is predominantly about streaming data so enterprises typically have two different ways or binary ways of looking at data. However, in the context of time and geography, these two events point to a pattern of fraud. The card is being used for its purpose and the amounts are relatively insignificant.

Enterprise

Enterprise Data Lake Strategy Metadata

Governing data in relational databases using Amazon DataZone

AWS Big Data

MAY 7, 2024

It also makes it easier for engineers, data scientists, product managers, analysts, and business users to access data throughout an organization to discover, use, and collaborate to derive data-driven insights. Note that a managed data asset is an asset for which Amazon DataZone can manage permissions.

Metadata

Metadata Data Lake Data Processing Data-driven

The Role of the Data Catalog in Data Security

Alation

JUNE 14, 2021

Breaches are resumé generating events.”. Dan Kirsch, Analyst, Hurwitz Associates, agrees that CISOs must take responsibility, when he says that “data protection is absolutely part of the CISO’s job. Indeed, automation is a key element to data catalog features, which enhance data security. Selecting a Data Catalog.

Data Governance

Data Governance Recreation/Entertainment Data Lake Digital Transformation

Ingest, transform, and deliver events published by Amazon Security Lake to Amazon OpenSearch Service

AWS Big Data

JUNE 19, 2023

Security Lake automatically centralizes security data from cloud, on-premises, and custom sources into a purpose-built data lake stored in your account. With Security Lake, you can get a more complete understanding of your security data across your entire organization.

Publishing

Publishing Dashboards Visualization Management

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

AWS Big Data

JANUARY 24, 2023

This solution only replicates metadata in the Data Catalog, not the actual underlying data. To have a redundant data lake using Lake Formation and AWS Glue in an additional Region, we recommend replicating the Amazon S3-based storage using S3 replication , S3 sync, aws-s3-copy-sync-using-batch or S3 Batch replication process.

Data Architecture

Data Architecture Metadata Data Lake Snapshot

Driving Data Catalog Adoption

Alation

FEBRUARY 13, 2020

A typical data catalog implementation process begins by defining the business and technical case, proceeds through technology selection and installation, then moves on to data discovery and populating the metadata catalog. Figure 1 – Data Catalog Implementation. See figure 1.) Subscribe to Alation's Blog.

Metadata

Metadata Data Governance Cost-Benefit Visualization

Themes and Conferences per Pacoid, Episode 8

Domino Data Lab

APRIL 3, 2019

Data governance shows up as the fourth-most-popular kind of solution that enterprise teams were adopting or evaluating during 2019. That’s a lot of priorities – especially when you group together closely related items such as data lineage and metadata management which rank nearby. in lieu of simply landing in a data lake.

Data Governance

Data Governance Machine Learning Metadata Big Data

Unlock data across organizational boundaries using Amazon DataZone – now generally available

AWS Big Data

OCTOBER 4, 2023

An Amazon DataZone domain contains an associated business data catalog for search and discovery, a set of metadata definitions to decorate the data assets that are used for discovery purposes, and data projects with integrated analytics and ML tools for users and groups to consume and publish data assets.

Metadata

Metadata Data Lake Publishing Data Governance

How Amazon Devices scaled and optimized real-time demand and supply forecasts using serverless analytics

AWS Big Data

FEBRUARY 1, 2023

With data volumes exhibiting a double-digit percentage growth rate year on year and the COVID pandemic disrupting global logistics in 2021, it became more critical to scale and generate near-real-time data. You can visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes.

Optimization

Optimization Forecasting Data Lake Metadata

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

AWS Big Data

JULY 3, 2023

Terminology Let’s first discuss some of the terminology used in this post: Research data lake on Amazon S3 – A data lake is a large, centralized repository that allows you to manage all your structured and unstructured data at any scale. This is where the tagging feature in Apache Iceberg comes in handy.

Snapshot

Snapshot Data Lake Testing Strategy

How Ruparupa gained updated insights with an Amazon S3 data lake, AWS Glue, Apache Hudi, and Amazon QuickSight

AWS Big Data

FEBRUARY 22, 2023

In this post, we show how Ruparupa implemented an incrementally updated data lake to get insights into their business using Amazon Simple Storage Service (Amazon S3), AWS Glue , Apache Hudi , and Amazon QuickSight. An AWS Glue ETL job, using the Apache Hudi connector, updates the S3 data lake hourly with incremental data.

Data Lake

Data Lake Dashboards Cost-Benefit Metadata

Driving Business Value and ROI from a Hybrid Cloud Data Lake

Alation

FEBRUARY 20, 2020

For many enterprises, a hybrid cloud data lake is no longer a trend, but becoming reality. Due to these needs, hybrid cloud data lakes emerged as a logical middle ground between the two consumption models. earthquake, flood, or fire), where the data collected does not need to be as tightly controlled.

Data Lake

Data Lake ROI Metadata Cost-Benefit

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

AWS Big Data

AUGUST 16, 2023

Iceberg manages large collections of files as tables, and it supports modern analytical data lake operations such as record-level insert, update, delete, and time travel queries. Iceberg also helps guarantee data correctness under concurrent write scenarios. On the Code tab, choose Test , then Configure test event.

Data Lake

Data Lake Metadata Testing Snapshot

Salesforce readies Einstein Copilot to unleash generative AI across its offerings

CIO Business Intelligence

SEPTEMBER 12, 2023

said Clara Shih, now CEO of Salesforce AI, in a conference call on the eve of the company’s Dreamforce customer event. What’s changed since then, apart from Shih’s title, is Salesforce has rearchitected its underlying Data Cloud and Einstein AI framework to use an improved metadata framework, creating a new platform it calls Einstein 1.

IT Metadata Data Lake Cost-Benefit

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

AWS Big Data

DECEMBER 18, 2023

Solution overview One of the common functionalities involved in data pipelines is extracting data from multiple data sources and exporting it to a data lake or synchronizing the data to another database. There are multiple tables related to customers and order data in the RDS database.

Metadata

Metadata Visualization Data Lake Data-driven

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Big Data

APRIL 26, 2024

Set up EMR Studio In this step, we demonstrate the actions needed from the data lake administrator to set up EMR Studio enabled for trusted identity propagation and with IAM Identity Center integration. On the Lake Formation console, choose Data lake permissions under Permissions in the navigation pane.

Analytics

Analytics Data Lake Management Enterprise

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

With CDW, as an integrated service of CDP, your line of business gets immediate resources needed for faster application launches and expedited data access, all while protecting the company’s multi-year investment in centralized data management, security, and governance. Proprietary file formats mean no one else is invited in!

Data Lake

Data Lake Data Warehouse IT Analytics

Generate security insights from Amazon Security Lake data using Amazon OpenSearch Ingestion

AWS Big Data

AUGUST 28, 2023

Amazon Security Lake centralizes access and management of your security data by aggregating security event logs from AWS environments, other cloud providers, on premise infrastructure, and other software as a service (SaaS) solutions. On the Amazon Security Lake console, choose your preferred Region, then choose Get started.

Dashboards

Dashboards Visualization Metadata Management

Create an end-to-end data strategy for Customer 360 on AWS

AWS Big Data

MARCH 26, 2024

You can subscribe to data products that help enrich customer profiles, for example demographics data, advertising data, and financial markets data. Amazon Kinesis ingests streaming events in real time from point-of-sales systems, clickstream data from mobile apps and websites, and social media data.

Data Strategy

Data Strategy Strategy Data Warehouse Prescriptive Analytics

Non-JSON ingestion using Amazon Kinesis Data Streams, Amazon MSK, and Amazon Redshift Streaming Ingestion

AWS Big Data

OCTOBER 2, 2023

JSON data in Amazon Redshift Amazon Redshift enables storage, processing, and analytics on JSON data through the SUPER data type, PartiQL language, materialized views, and data lake queries. This approach requires an additional step of schema retrieval and decoding based on context.

Cost-Benefit

Cost-Benefit Metadata Structured Data Management

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Build a real-time GDPR-aligned Apache Iceberg data lake

Webinars

Trending Sources

Salesforce debuts Zero Copy Partner Network to ease data integration

Webinars

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

How Knowledge Graphs Power Data Mesh and Data Fabric

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Data Lakes: What Are They and Who Needs Them?

Use Amazon Athena with Spark SQL for your open-source transactional table formats

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

Exploring real-time streaming for generative AI Applications

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Lake Formation 2022 year in review

Operational Database Security – Part 2

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Enhance query performance using AWS Glue Data Catalog column-level statistics

How to use foundation models and trusted governance to manage AI workflow risk

The Madness of Data (and analytics) Governance

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

Amazon DataZone announces integration with AWS Lake Formation hybrid access mode for the AWS Glue Data Catalog

How SumUp made digital analytics more accessible using AWS Glue

Demystifying Modern Data Platforms

Choosing an open table format for your transactional data lake on AWS

How Cargotec uses metadata replication to enable cross-account data sharing

The Advantages Of Live Data-Streaming In The Competitive Financial Services Sector (Part I)

Governing data in relational databases using Amazon DataZone

The Role of the Data Catalog in Data Security

Ingest, transform, and deliver events published by Amazon Security Lake to Amazon OpenSearch Service

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

Driving Data Catalog Adoption

Themes and Conferences per Pacoid, Episode 8

Unlock data across organizational boundaries using Amazon DataZone – now generally available

How Amazon Devices scaled and optimized real-time demand and supply forecasts using serverless analytics

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

How Ruparupa gained updated insights with an Amazon S3 data lake, AWS Glue, Apache Hudi, and Amazon QuickSight

Driving Business Value and ROI from a Hybrid Cloud Data Lake

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

Salesforce readies Einstein Copilot to unleash generative AI across its offerings

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Generate security insights from Amazon Security Lake data using Amazon OpenSearch Ingestion

Create an end-to-end data strategy for Customer 360 on AWS

Non-JSON ingestion using Amazon Kinesis Data Streams, Amazon MSK, and Amazon Redshift Streaming Ingestion

Stay Connected