Data Lake, Events and Snapshot - Data Leaders Brief

Data Lake

Events

Snapshot

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

licensed, 100% open-source data table format that helps simplify data processing on large datasets stored in data lakes. Data engineers use Apache Iceberg because it’s fast, efficient, and reliable at any scale and keeps records of how datasets change over time.

Data Lake

Data Lake Snapshot Metadata Data Architecture

Perform upserts in a data lake using Amazon Athena and Apache Iceberg

AWS Big Data

APRIL 27, 2023

Amazon Athena supports the MERGE command on Apache Iceberg tables, which allows you to perform inserts, updates, and deletes in your data lake at scale using familiar SQL statements that are compliant with ACID (Atomic, Consistent, Isolated, Durable).

Data Lake

Data Lake Snapshot Optimization Data Transformation

Join 52,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

How To Get Promoted In Product Management

MORE WEBINARS

Trending Sources

Analytics Vidhya

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

A modern data architecture is an evolutionary architecture pattern designed to integrate a data lake, data warehouse, and purpose-built stores with a unified governance model. The company wanted the ability to continue processing operational data in the secondary Region in the rare event of primary Region failure.

Data Lake

Data Lake Data Processing Metadata Snapshot

Webinars

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

How To Get Promoted In Product Management

MORE WEBINARS

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Optimization Statistics

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

AWS Big Data

MARCH 28, 2023

As organizations across the globe are modernizing their data platforms with data lakes on Amazon Simple Storage Service (Amazon S3), handling SCDs in data lakes can be challenging.

Data Lake

Data Lake Testing Snapshot Sales

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

In our previous post Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes , we discussed how you can implement solutions to improve operational efficiencies of your Amazon Simple Storage Service (Amazon S3) data lake that is using the Apache Iceberg open table format and running on the Amazon EMR big data platform.

Optimization

Optimization Snapshot Data Lake Metadata

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

AWS Big Data

JULY 3, 2023

Terminology Let’s first discuss some of the terminology used in this post: Research data lake on Amazon S3 – A data lake is a large, centralized repository that allows you to manage all your structured and unstructured data at any scale. This is where the tagging feature in Apache Iceberg comes in handy.

Snapshot

Snapshot Data Lake Testing Strategy

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

AWS-powered data lakes, supported by the unmatched availability of Amazon Simple Storage Service (Amazon S3), can handle the scale, agility, and flexibility required to combine different data and analytics approaches.

Snapshot

Snapshot Data Lake Metadata Optimization

Analyze Elastic IP usage history using Amazon Athena and AWS CloudTrail

AWS Big Data

MAY 15, 2024

By extracting detailed information from CloudTrail and querying it using Athena, this solution streamlines the process of data collection, analysis, and reporting of EIP usage within an AWS account. Additionally, you can analyze activity logs with AWS CloudTrail Lake and Amazon Athena.

Snapshot

Snapshot Optimization Data Lake Reporting

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

For example, in a chatbot, data events could pertain to an inventory of flights and hotels or price changes that are constantly ingested to a streaming storage engine. Furthermore, data events are filtered, enriched, and transformed to a consumable format using a stream processor.

Data Lake

Data Lake Unstructured Data Management Modeling

Build a data lake with Apache Flink on Amazon EMR

AWS Big Data

JANUARY 27, 2023

The AWS Glue Data Catalog provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos. Apache Flink is a widely used data processing engine for scalable streaming ETL, analytics, and event-driven applications.

Data Lake

Data Lake Metadata Business Analysis Data-driven

How Gupshup built their multi-tenant messaging analytics platform on Amazon Redshift

AWS Big Data

FEBRUARY 12, 2024

It makes it fast, simple, and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. Moreover, no separate effort is required to process historical data versus live streaming data. Apart from incremental analytics, Redshift simplifies a lot of operational aspects.

Data Warehouse

Data Warehouse Analytics Snapshot Cost-Benefit

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

AWS Big Data

AUGUST 16, 2023

Iceberg manages large collections of files as tables, and it supports modern analytical data lake operations such as record-level insert, update, delete, and time travel queries. Iceberg also helps guarantee data correctness under concurrent write scenarios. On the Code tab, choose Test , then Configure test event.

Data Lake

Data Lake Metadata Testing Snapshot

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

With Amazon EMR 6.15, we launched AWS Lake Formation based fine-grained access controls (FGAC) on Open Table Formats (OTFs), including Apache Hudi, Apache Iceberg, and Delta lake. Many large enterprise companies seek to use their transactional data lake to gain insights and improve decision-making.

Data Lake

Data Lake Snapshot Big Data Data-driven

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

AWS Big Data

JANUARY 24, 2023

This solution only replicates metadata in the Data Catalog, not the actual underlying data. To have a redundant data lake using Lake Formation and AWS Glue in an additional Region, we recommend replicating the Amazon S3-based storage using S3 replication , S3 sync, aws-s3-copy-sync-using-batch or S3 Batch replication process.

Data Architecture

Data Architecture Metadata Data Lake Snapshot

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

It aims to provide a framework to create low-latency streaming applications on the AWS Cloud using Amazon Kinesis Data Streams and AWS purpose-built data analytics services. In this post, we will review the common architectural patterns of two use cases: Time Series Data Analysis and Event Driven Microservices.

Analytics

Analytics IoT Data-driven Snapshot

How Amazon Devices scaled and optimized real-time demand and supply forecasts using serverless analytics

AWS Big Data

FEBRUARY 1, 2023

With data volumes exhibiting a double-digit percentage growth rate year on year and the COVID pandemic disrupting global logistics in 2021, it became more critical to scale and generate near-real-time data. You can visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes.

Optimization

Optimization Forecasting Data Lake Metadata

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

AWS Big Data

NOVEMBER 16, 2023

Amazon Redshift is a popular cloud data warehouse, offering a fully managed cloud-based service that seamlessly integrates with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and much more—all while providing up to 7.9x

Enterprise

Enterprise Data Warehouse Snapshot Cost-Benefit

Break data silos and stream your CDC data with Amazon Redshift streaming and Amazon MSK

AWS Big Data

DECEMBER 13, 2023

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. Debezium MySQL source Kafka Connector reads these change events and emits them to the Kafka topics in Amazon MSK.

Data Warehouse

Data Warehouse Snapshot Data Processing Management

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

AWS Big Data

MARCH 3, 2023

Additionally, the scale is significant because the multi-tenant data sources provide a continuous stream of testing activity, and our users require quick data refreshes as well as historical context for up to a decade due to compliance and regulatory demands. Finally, data integrity is of paramount importance.

Software

Software Data Lake Testing Cost-Benefit

Accelerating revenue growth with real-time analytics: Poshmark’s journey

AWS Big Data

MARCH 20, 2023

Poshmark wanted to address the following business use cases via the real-time analytics platform: Sessionization – Poshmark captures both server-side application events and client-side tracking events. They wanted to use these events to identify and analyze user sessions to track behavior. The event data format is nested JSON.

Analytics

Analytics Slice and Dice Data Processing Data Lake

Build an Amazon Redshift data warehouse using an Amazon DynamoDB single-table design

AWS Big Data

JUNE 21, 2023

When setting out to build a data warehouse, it’s a common pattern to have a data lake as the source of the data warehouse. The data lake in this context serves a number of important functions: It acts as a central source for multiple applications, not just exclusively for data warehousing purposes.

Data Warehouse

Data Warehouse Data Lake OLAP Cost-Benefit

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

AWS Big Data

MAY 30, 2023

Customers have been using data warehousing solutions to perform their traditional analytics tasks. Recently, data lakes have gained lot of traction to become the foundation for analytical solutions, because they come with benefits such as scalability, fault tolerance, and support for structured, semi-structured, and unstructured datasets.

Data Lake

Data Lake Data Analytics Analytics Data Processing

Configure monitoring, limits, and alarms in Amazon Redshift Serverless to keep costs predictable

AWS Big Data

JULY 25, 2023

Choose your level of metrics to monitor: Workgroup Namespace Snapshot storage If we select Workgroup , we can choose from the workgroup-level metrics shown in the following screenshot. The following screenshot shows the metrics available at the snapshot storage level. For instructions, refer to Subscribing to an Amazon SNS topic.

Metrics

Metrics Data Warehouse Dashboards Snapshot

Dimensional modeling in Amazon Redshift

AWS Big Data

JULY 19, 2023

We show how to perform extract, transform, and load (ELT), an integration process focused on getting the raw data from a data lake into a staging layer to perform the modeling. Lastly, we use Amazon QuickSight to gain insights on the modeled data in the form of a QuickSight dashboard. Identify and implement the facts.

Modeling

Modeling Sales Data Warehouse Snapshot

Interview with Dominic Sartorio, Senior Vice President for Products & Development, Protegrity

Corinium

APRIL 25, 2019

Ahead of the Chief Data Analytics Officers & Influencers, Insurance event we caught up with Dominic Sartorio, Senior Vice President for Products & Development, Protegrity to discuss how the industry is evolving. In data-driven organizations, data is flowing.

Insurance

Insurance Risk IoT Cost-Benefit

AWS Glue streaming application to process Amazon MSK data using AWS Glue Schema Registry

AWS Big Data

JUNE 12, 2023

This data can come from a diverse range of sources, including Internet of Things (IoT) devices, user applications, and logging and telemetry information from applications, to name a few. By harnessing the power of streaming data, organizations are able to stay ahead of real-time events and make quick, informed decisions.

Management

Management Metadata Testing Internet of Things

Snowflake and Domino: Better Together

Domino Data Lab

JANUARY 11, 2021

This means trading off granularity and latency of data for structures that make it easier to write queries that aggregate, filter and group results for reporting purposes. That said, there are many advantages to ensuring that data scientists have full access to not only read from, but also be able to write to data warehouse structures.

Recreation/Entertainment

Recreation/Entertainment Data Science Data Warehouse Modeling

Simplify AWS Glue job orchestration and monitoring with Amazon MWAA

AWS Big Data

MAY 19, 2023

Organizations across all industries have complex data processing requirements for their analytical use cases across different analytics systems, such as data lakes on AWS , data warehouses ( Amazon Redshift ), search ( Amazon OpenSearch Service ), NoSQL ( Amazon DynamoDB ), machine learning ( Amazon SageMaker ), and more.

Machine Learning

Machine Learning Metrics Big Data Management

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Perform upserts in a data lake using Amazon Athena and Apache Iceberg

Webinars

Trending Sources

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Webinars

Choosing an open table format for your transactional data lake on AWS

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

Use Amazon Athena with Spark SQL for your open-source transactional table formats

Analyze Elastic IP usage history using Amazon Athena and AWS CloudTrail

Exploring real-time streaming for generative AI Applications

Build a data lake with Apache Flink on Amazon EMR

How Gupshup built their multi-tenant messaging analytics platform on Amazon Redshift

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

How Amazon Devices scaled and optimized real-time demand and supply forecasts using serverless analytics

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

Break data silos and stream your CDC data with Amazon Redshift streaming and Amazon MSK

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

Accelerating revenue growth with real-time analytics: Poshmark’s journey

Build an Amazon Redshift data warehouse using an Amazon DynamoDB single-table design

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

Configure monitoring, limits, and alarms in Amazon Redshift Serverless to keep costs predictable

Dimensional modeling in Amazon Redshift

Interview with Dominic Sartorio, Senior Vice President for Products & Development, Protegrity

AWS Glue streaming application to process Amazon MSK data using AWS Glue Schema Registry

Snowflake and Domino: Better Together

Simplify AWS Glue job orchestration and monitoring with Amazon MWAA

Stay Connected