Data Lake, Data Processing and Reference

Data Lake

Data Processing

Reference

Create an Apache Hudi-based near-real-time transactional data lake using AWS DMS, Amazon Kinesis, AWS Glue streaming ETL, and data visualization using Amazon QuickSight

AWS Big Data

AUGUST 3, 2023

Data analytics on operational data at near-real time is becoming a common need. Due to the exponential growth of data volume, it has become common practice to replace read replicas with data lakes to have better scalability and performance. Apache Hudi connector for AWS Glue For this post, we use AWS Glue 4.0,

Data Lake

Data Lake Visualization Dashboards Insurance

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

Join 52,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

Manufacturing Sustainability Surge: Your Guide to Data-Driven Energy Optimization & Decarbonization

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

MORE WEBINARS

Trending Sources

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

AWS Big Data

AUGUST 3, 2023

Today’s modern data lakes span multiple accounts, AWS Regions, and lines of business in organizations. It’s important that their data solution gives them the ability to share and access data securely and safely across Regions. For example, we are using a data lake administrator role called LF-Admin.

Data Lake

Data Lake Metadata Management Data Processing

Webinars

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

Manufacturing Sustainability Surge: Your Guide to Data-Driven Energy Optimization & Decarbonization

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

MORE WEBINARS

Query your Apache Hive metastore with AWS Lake Formation permissions

AWS Big Data

JULY 20, 2023

The Hive metastore is a repository of metadata about the SQL tables, such as database names, table names, schema, serialization and deserialization information, data location, and partition details of each table. Therefore, organizations have come to host huge volumes of metadata of their structured datasets in the Hive metastore.

Data Lake

Data Lake Metadata Data Processing Big Data

Introducing AWS Glue crawler and create table support for Apache Iceberg format

AWS Big Data

AUGUST 16, 2023

Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. For more details, refer to Creating Apache Iceberg tables. The customer wants to make product data accessible to analyst personas for interactive analysis using Athena.

Data Lake

Data Lake Metadata Snapshot Management

Set up advanced rules to validate quality of multiple datasets with AWS Glue Data Quality

AWS Big Data

JUNE 6, 2023

It supports both data quality at rest and data quality in AWS Glue extract, transform, and load (ETL) pipelines. Data quality at rest focuses on validating the data stored in data lakes, databases, or data warehouses. It ensures that the data meets specific quality standards before it is consumed.

Data Quality

Data Quality Data Lake Visualization Data-driven

Access Amazon Athena in your applications using the WebSocket API

AWS Big Data

MARCH 2, 2023

Many organizations are building data lakes to store and analyze large volumes of structured, semi-structured, and unstructured data. In addition, many teams are moving towards a data mesh architecture, which requires them to expose their data sets as easily consumable data products.

Data Lake

Data Lake Testing Interactive Unstructured Data

Implement alerts in Amazon OpenSearch Service with PagerDuty

AWS Big Data

JUNE 8, 2023

For instructions, refer to Creating and managing Amazon OpenSearch Service domains. For Host , enter events.PagerDuty.com. For more information about anomaly detection and alerts in OpenSearch Service, refer to Anomaly Detection in Amazon OpenSearch and Configuring Alerts in Amazon OpenSearch. For Method , choose POST.

Data Lake

Data Lake Dashboards Metrics Testing

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

NOVEMBER 22, 2023

Data lakes are designed for storing vast amounts of raw, unstructured, or semi-structured data at a low cost, and organizations share those datasets across multiple departments and teams. The queries on these large datasets read vast amounts of data and can perform complex join operations on multiple datasets.

Statistics

Statistics Data Lake Optimization Data-driven

Introducing the technology behind watsonx.ai, IBM’s AI and data platform for enterprise

IBM Big Data Hub

MAY 9, 2023

Over the past decade, deep learning arose from a seismic collision of data availability and sheer compute power, enabling a host of impressive AI capabilities. Slate refers to a family of encoder-only (RoBERTa-based) models, which while not generative, are fast and effective for many enterprise NLP tasks. All watsonx.ai

Enterprise

Enterprise Technology Modeling Cost-Benefit

Break data silos and stream your CDC data with Amazon Redshift streaming and Amazon MSK

AWS Big Data

DECEMBER 13, 2023

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. This solution uses Amazon Aurora MySQL hosting the example database salesdb.

Data Warehouse

Data Warehouse Snapshot Data Processing Management

Automate deployment of an Amazon QuickSight analysis connecting to an Amazon Redshift data warehouse with an AWS CloudFormation template

AWS Big Data

FEBRUARY 16, 2023

For more details, refer to Amazon QuickSight resource type reference. In this post, we show how to automate the deployment of a QuickSight analysis connecting to an Amazon Redshift data warehouse with a CloudFormation template. For instructions, refer to Authorizing connections from Amazon QuickSight to Amazon Redshift clusters.

Data Warehouse

Data Warehouse Sales Visualization Data Processing

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

AWS Big Data

NOVEMBER 16, 2023

Amazon Redshift is a popular cloud data warehouse, offering a fully managed cloud-based service that seamlessly integrates with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and much more—all while providing up to 7.9x

Enterprise

Enterprise Data Warehouse Snapshot Cost-Benefit

The Strategy Behind our Denodo Partner Program

Data Virtualization

SEPTEMBER 23, 2020

I’m referring not only to our technology partners, but also to our cloud partners that host the Denodo Platform, Denodo is a very partner-friendly company, and here I’d like to share some thoughts about how Denodo works with our partners.

Strategy

Strategy Data Processing Technology Digital Transformation

Introducing Amazon EMR on EKS job submission with Spark Operator and spark-submit

AWS Big Data

JUNE 6, 2023

For instructions, refer to Installing or updating the latest version of the AWS CLI. For instructions, refer to Configuration and credential file settings. Also, note that we are using data-team-a as a namespace and emr-job-execution-sa as a service account, which we created in the previous step.

Optimization

Optimization Data Lake Cost-Benefit Management

Build a pseudonymization service on AWS to protect sensitive data: Part 2

AWS Big Data

MARCH 6, 2024

The account on the right hosts the pseudonymization service, which you can deploy using the instructions provided in the Part 1 of this series. For an overview of how to build an ACID compliant data lake using Iceberg, refer to Build a high-performance, ACID compliant, evolving data lake using Apache Iceberg on Amazon EMR.

Metrics

Metrics Statistics Testing Data Lake

Governing data in relational databases using Amazon DataZone

AWS Big Data

MAY 7, 2024

It also makes it easier for engineers, data scientists, product managers, analysts, and business users to access data throughout an organization to discover, use, and collaborate to derive data-driven insights. Note that a managed data asset is an asset for which Amazon DataZone can manage permissions.

Metadata

Metadata Data Lake Data Processing Data-driven

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

AWS Big Data

MAY 30, 2023

Customers have been using data warehousing solutions to perform their traditional analytics tasks. Recently, data lakes have gained lot of traction to become the foundation for analytical solutions, because they come with benefits such as scalability, fault tolerance, and support for structured, semi-structured, and unstructured datasets.

Data Lake

Data Lake Data Analytics Analytics Data Processing

Accelerate your data warehouse migration to Amazon Redshift – Part 7

AWS Big Data

OCTOBER 17, 2023

Tens of thousands of customers use Amazon Redshift to gain business insights from their data. With Amazon Redshift, you can use standard SQL to query data across your data warehouse, operational data stores, and data lake. For more information, refer to Configuring ongoing data replication.

Data Warehouse

Data Warehouse Data Processing Data Lake Management

Themes and Conferences per Pacoid, Episode 8

Domino Data Lab

APRIL 3, 2019

“Flashpoint” (2018) – GDPR went into effect, plus major data blunders happened seemingly everywhere. Fun fact : I co-founded an e-commerce company (realistically, a mail-order catalog hosted online) in December 1992 using one of those internetworking applications called Gopher , which was vaguely popular at the time. Upcoming Events.

Data Governance

Data Governance Machine Learning Metadata Big Data

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Big Data

APRIL 26, 2024

Set up EMR Studio In this step, we demonstrate the actions needed from the data lake administrator to set up EMR Studio enabled for trusted identity propagation and with IAM Identity Center integration. If Lake Formation is not already enabled, refer to Getting started with Lake Formation.

Analytics

Analytics Data Lake Management Enterprise

AWS Glue crawlers support cross-account crawling to support data mesh architecture

AWS Big Data

MARCH 27, 2023

Data lakes have come a long way, and there’s been tremendous innovation in this space. Today’s modern data lakes are cloud native, work with multiple data types, and make this data easily available to diverse stakeholders across the business. In the navigation pane, under Data catalog , choose Settings.

Data Lake

Data Lake Data-driven Management Data Architecture

Run Spark SQL on Amazon Athena Spark

AWS Big Data

OCTOBER 23, 2023

Modern applications store massive amounts of data on Amazon Simple Storage Service (Amazon S3) data lakes, providing cost-effective and highly durable storage, and allowing you to run analytics and machine learning (ML) from your data lake to generate insights on your data.

Data Lake

Data Lake Visualization Optimization Interactive

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

AWS Big Data

DECEMBER 18, 2023

Solution overview One of the common functionalities involved in data pipelines is extracting data from multiple data sources and exporting it to a data lake or synchronizing the data to another database. For more information, refer to IAM Policies for invoking AWS Glue job from Step Functions.

Metadata

Metadata Visualization Data Lake Data-driven

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

Building data lakes from continuously changing transactional data of databases and keeping data lakes up to date is a complex task and can be an operational challenge. You can then apply transformations and store data in Delta format for managing inserts, updates, and deletes. with Apache Spark version 3.3.0)

Data Lake

Data Lake Dashboards Metrics Metadata

Accelerating revenue growth with real-time analytics: Poshmark’s journey

AWS Big Data

MARCH 20, 2023

Top line revenue refers to the total value of sales of an organization’s services or products. The data from the S3 data lake is used for batch processing and analytics through Amazon EMR and Amazon Redshift. Operational dashboards are hosted on Grafana integrated with Druid.

Analytics

Analytics Slice and Dice Data Processing Data Lake

10 Keys to a Secure Cloud Data Lakehouse

Cloudera

OCTOBER 25, 2022

The data lakehouse is gaining in popularity because it enables a single platform for all your enterprise data with the flexibility to run any analytic and machine learning (ML) use case. Cloud data lakehouses provide significant scaling, agility, and cost advantages compared to cloud data lakes and cloud data warehouses.

Data Processing

Data Processing Data Lake Cost-Benefit Risk

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

Cargotec captures terabytes of IoT telemetry data from their machinery operated by numerous customers across the globe. This data needs to be ingested into a data lake, transformed, and made available for analytics, machine learning (ML), and visualization. The job runs in the target account. mode('overwrite').save(output_path

Metadata

Metadata Data Lake Machine Learning Big Data

Generate security insights from Amazon Security Lake data using Amazon OpenSearch Ingestion

AWS Big Data

AUGUST 28, 2023

OpenSearch Ingestion provides a serverless ingestion framework to easily ingest Security Lake data into OpenSearch Service with just a few clicks. For instructions, refer to Creating and managing Amazon OpenSearch Service domains. For instructions, refer to Managing multiple accounts with AWS Organizations. Choose Next.

Dashboards

Dashboards Visualization Metadata Management

Design a data mesh on AWS that reflects the envisioned organization

AWS Big Data

JANUARY 22, 2024

Data as a product Treating data as a product entails three key components: the data itself, the metadata, and the associated code and infrastructure. In this approach, teams responsible for generating data are referred to as producers.

Data-driven

Data-driven Advertising Metadata Data Architecture

Federate Amazon QuickSight access with open-source identity provider Keycloak

AWS Big Data

JUNE 13, 2023

For instructions on installing Keycloak, refer to Keycloak Downloads. Insert your specific host domain name where the Keycloak application resides in the following URL: [link] /realms/aws-realm/protocol/saml/descriptor. Vamsi Bhadriraju is a Data Architect at AWS. Sign in to your Keycloak admin dashboard.

Metadata

Metadata Dashboards Business Intelligence Management

Create an end-to-end data strategy for Customer 360 on AWS

AWS Big Data

MARCH 26, 2024

It is prudent to consolidate this data into a single customer view, serving as a primary reference for downstream applications, ranging from ecommerce platforms to CRM systems. This consolidated view acts as a liaison between the data platform and customer-centric applications.

Data Strategy

Data Strategy Strategy Data Warehouse Prescriptive Analytics

Analyze Amazon S3 storage costs using AWS Cost and Usage Reports, Amazon S3 Inventory, and Amazon Athena

AWS Big Data

FEBRUARY 2, 2023

Since its launch in 2006, Amazon Simple Storage Service (Amazon S3) has experienced major growth, supporting multiple use cases such as hosting websites, creating data lakes, serving as object storage for consumer applications, storing logs, and archiving data. This could be your data lake or application S3 bucket.

Reporting

Reporting Data Lake Management Optimization

Stitch Fix seamless migration: Transitioning from self-managed Kafka to Amazon MSK

AWS Big Data

SEPTEMBER 22, 2023

At Stitch Fix, we have been powered by data science since its foundation and rely on many modern data lake and data processing technologies. In our infrastructure, Apache Kafka has emerged as a powerful tool for managing event streams and facilitating real-time data processing.

Management

Management Metrics Cost-Benefit Data Lake

Improving Multi-tenancy with Virtual Private Clusters

Cloudera

JUNE 6, 2019

The typical Cloudera Enterprise Data Hub Cluster starts with a few dozen nodes in the customer’s datacenter hosting a variety of distributed services. Over time, workloads start processing more data, tenants start onboarding more workloads, and administrators (admins) start onboarding more tenants. Cloudera Manager (CM) 6.2

Metadata

Metadata Data Lake Optimization Strategy

How Novo Nordisk built distributed data governance and control at scale

AWS Big Data

APRIL 28, 2023

This enables users to mix and match between data domains—as well as utilizing private and public data paths that aren’t necessarily tied to any data domain. For more examples of how ABAC can be used with permission policies, refer to How to scale your authorization needs by using attribute-based access control with S3.

Data Governance

Data Governance Management Data-driven Data Lake

Build a data lake with Apache Flink on Amazon EMR

AWS Big Data

JANUARY 27, 2023

For example, the Flink FileSystem connector has FileSystemTableFactory to read/write data in Hadoop Distributed File System (HDFS) or Amazon Simple Storage Service (Amazon S3), the Flink HBase connector has HBase2DynamicTableFactory to read/write data in HBase, and the Flink Kafka connector has KafkaDynamicTableFactory to read/write data in Kafka.

Data Lake

Data Lake Metadata Business Analysis Data-driven

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Cloudera

OCTOBER 7, 2021

Those decentralization efforts appeared under different monikers through time, e.g., data marts versus data warehousing implementations (a popular architectural debate in the era of structured data) then enterprise-wide data lakes versus smaller, typically BU-Specific, “data ponds”.

Metadata

Metadata Cost-Benefit Enterprise Interactive

Addressing the Three Scalability Challenges in Modern Data Platforms

Cloudera

NOVEMBER 22, 2021

In addition, data pipelines include more and more stages, thus making it difficult for data engineers to compile, manage, and troubleshoot those analytical workloads. Those incremental costs derive from a variety of reasons: Increased data processing costs associated with legacy deployment types (e.g., CRM platforms).

Data Processing

Data Processing Data Warehouse Enterprise Visualization

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

Andrew White

JANUARY 11, 2021

On January 4th I had the pleasure of hosting a webinar. It was titled, The Gartner 2021 Leadership Vision for Data & Analytics Leaders. This was for the Chief Data Officer, or head of data and analytics. It is meant to be a desk-reference for that role for 2021. Data lakes don’t offer this nor should they.

Data Analytics

Data Analytics Analytics Data-driven Finance

What is Data Mapping?

Jet Global

FEBRUARY 23, 2024

ETL is beneficial for larger data volumes and diverse sources, and may be necessary for data architects, developers, and administrators considering factors like volume, source diversity, accuracy, and efficiency. Data Migration Data migration refers to the process of transferring data from one location or format to another.

Data Warehouse

Data Warehouse Reporting Data Transformation Sales

Create an Apache Hudi-based near-real-time transactional data lake using AWS DMS, Amazon Kinesis, AWS Glue streaming ETL, and data visualization using Amazon QuickSight

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Webinars

Trending Sources

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

Webinars

Query your Apache Hive metastore with AWS Lake Formation permissions

Introducing AWS Glue crawler and create table support for Apache Iceberg format

Set up advanced rules to validate quality of multiple datasets with AWS Glue Data Quality

Access Amazon Athena in your applications using the WebSocket API

Implement alerts in Amazon OpenSearch Service with PagerDuty

Enhance query performance using AWS Glue Data Catalog column-level statistics

Introducing the technology behind watsonx.ai, IBM’s AI and data platform for enterprise

Break data silos and stream your CDC data with Amazon Redshift streaming and Amazon MSK

Automate deployment of an Amazon QuickSight analysis connecting to an Amazon Redshift data warehouse with an AWS CloudFormation template

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

The Strategy Behind our Denodo Partner Program

Introducing Amazon EMR on EKS job submission with Spark Operator and spark-submit

Build a pseudonymization service on AWS to protect sensitive data: Part 2

Governing data in relational databases using Amazon DataZone

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

Accelerate your data warehouse migration to Amazon Redshift – Part 7

Themes and Conferences per Pacoid, Episode 8

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Glue crawlers support cross-account crawling to support data mesh architecture

Run Spark SQL on Amazon Athena Spark

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

Accelerating revenue growth with real-time analytics: Poshmark’s journey

10 Keys to a Secure Cloud Data Lakehouse

How Cargotec uses metadata replication to enable cross-account data sharing

Generate security insights from Amazon Security Lake data using Amazon OpenSearch Ingestion

Design a data mesh on AWS that reflects the envisioned organization

Federate Amazon QuickSight access with open-source identity provider Keycloak

Create an end-to-end data strategy for Customer 360 on AWS

Analyze Amazon S3 storage costs using AWS Cost and Usage Reports, Amazon S3 Inventory, and Amazon Athena

Stitch Fix seamless migration: Transitioning from self-managed Kafka to Amazon MSK

Improving Multi-tenancy with Virtual Private Clusters

How Novo Nordisk built distributed data governance and control at scale

Build a data lake with Apache Flink on Amazon EMR

How Cloudera Data Flow Enables Successful Data Mesh Architectures

Addressing the Three Scalability Challenges in Modern Data Platforms

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

What is Data Mapping?

Stay Connected