Data Architecture, Data Lake and Reference

Data Architecture

Data Lake

Reference

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

They understand that a one-size-fits-all approach no longer works, and recognize the value in adopting scalable, flexible tools and open data formats to support interoperability in a modern data architecture to accelerate the delivery of new solutions.

Data Lake

Data Lake Snapshot Metadata Data Architecture

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

Join 52,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

Manufacturing Sustainability Surge: Your Guide to Data-Driven Energy Optimization & Decarbonization

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

MORE WEBINARS

Trending Sources

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

Data architecture is a complex and varied field and different organizations and industries have unique needs when it comes to their data architects. Solutions data architect: These individuals design and implement data solutions for specific business needs, including data warehouses, data marts, and data lakes.

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

Webinars

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

Manufacturing Sustainability Surge: Your Guide to Data-Driven Energy Optimization & Decarbonization

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

MORE WEBINARS

AWS Lake Formation 2022 year in review

AWS Big Data

JANUARY 31, 2023

We have collected some of the key talks and solutions on data governance, data mesh, and modern data architecture published and presented in AWS re:Invent 2022, and a few data lake solutions built by customers and AWS Partners for easy reference.

Data Lake

Data Lake Data Governance Data Architecture Machine Learning

Detect, mask, and redact PII data using AWS Glue before loading into Amazon OpenSearch Service

AWS Big Data

JANUARY 12, 2024

Ingestion: Data lake batch, micro-batch, and streaming Many organizations land their source data into their data lake in various ways, including batch, micro-batch, and streaming jobs. Amazon AppFlow can be used to transfer data from different SaaS applications to a data lake.

Data Lake

Data Lake Cost-Benefit Visualization Structured Data

Enhance data security and governance for Amazon Redshift Spectrum with VPC endpoints

AWS Big Data

FEBRUARY 16, 2024

Many customers are extending their data warehouse capabilities to their data lake with Amazon Redshift. They are looking to further enhance their security posture where they can enforce access policies on their data lakes based on Amazon Simple Storage Service (Amazon S3). Choose Create endpoint.

Data Lake

Data Lake Data Warehouse Testing Business Objectives

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

AWS Big Data

OCTOBER 10, 2023

Due to the volume, velocity, and variety of data being ingested in data lakes, it can get challenging to develop and maintain policies and procedures to ensure data governance at scale for your data lake. Data confidentiality and data quality are the two essential themes for data governance.

Data Quality

Data Quality Data Governance Data Lake Testing

Why the Data Journey Manifesto?

DataKitchen

JUNE 12, 2023

I spent much time de-categorizing DataOps: we are not discussing ETL, Data Lake, or Data Science. Today we have had over 20,000 signatures , millions of page views, and copycat clones, and it is frequently used as a reference guide. It’s Customer Journey for data analytic systems.

Testing

Testing Data Lake Dashboards Data Science

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

To learn more about RAG, refer to Question answering using Retrieval Augmented Generation with foundation models in Amazon SageMaker JumpStart. A RAG-based generative AI application can only produce generic responses based on its training data and the relevant documents in the knowledge base.

Data Lake

Data Lake Unstructured Data Management Modeling

How Knowledge Graphs Power Data Mesh and Data Fabric

Ontotext

APRIL 10, 2024

Data Lakes, Data Catalogs, and Findability Organizations approach data lakes as cheap storage. They move data to data lakes creating another copy – the mantra being – “ Lets move the data to a data lake and then we will figure out what to do with it”.

Metadata

Metadata Data Lake Data Warehouse Data Quality

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

The Analytics specialty practice of AWS Professional Services (AWS ProServe) helps customers across the globe with modern data architecture implementations on the AWS Cloud. Of those tables, some are larger (such as in terms of record volume) than others, and some are updated more frequently than others.

Data Lake

Data Lake Data Processing Metadata Snapshot

Demystifying Modern Data Platforms

Cloudera

SEPTEMBER 15, 2022

Mark: The first element in the process is the link between the source data and the entry point into the data platform. At Ramsey International (RI), we refer to that layer in the architecture as the foundation, but others call it a staging area, raw zone, or even a source data lake. What is a data fabric?

Data Lake

Data Lake Data Architecture Data-driven Data Warehouse

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

AWS Big Data

AUGUST 31, 2023

Amazon Redshift enables you to directly access data stored in Amazon Simple Storage Service (Amazon S3) using SQL queries and join data across your data warehouse and data lake. With Amazon Redshift, you can query the data in your S3 data lake using a central AWS Glue metastore from your Redshift data warehouse.

Data Lake

Data Lake Data Warehouse Metadata Data Architecture

Data science vs data analytics: Unpacking the differences

IBM Big Data Hub

SEPTEMBER 19, 2023

Though you may encounter the terms “data science” and “data analytics” being used interchangeably in conversations or online, they refer to two distinctly different concepts. Watsonx comprises of three powerful components: the watsonx.ai

Data Science

Data Science Data Analytics Prescriptive Analytics Analytics

Visualize data quality scores and metrics generated by AWS Glue Data Quality

AWS Big Data

JUNE 6, 2023

These are six main steps in the data pipeline: Amazon EventBridge triggers an AWS Lambda function when the event pattern for AWS Glue Data Quality matches the defined rule. For more information, refer to Working with Query Results, Output Files, and Query History. For S3 path , enter the S3 path to your data source. (

Data Quality

Data Quality Metrics Visualization Dashboards

Deep dive into the AWS ProServe Hadoop Migration Delivery Kit TCO tool

AWS Big Data

FEBRUARY 6, 2023

For more details on how to configure and schedule the log collector, refer to the yarn-log-collector GitHub repo. For more information on how to use the YARN log organizer, refer to the yarn-log-organizer GitHub repo. He also understands how to apply technologies to solve big data problems and build a well-designed data architecture.

Dashboards

Dashboards Optimization Data Lake Cost-Benefit

Generic orchestration framework for data warehousing workloads using Amazon Redshift RSQL

AWS Big Data

APRIL 3, 2023

Tens of thousands of customers run business-critical workloads on Amazon Redshift , AWS’s fast, petabyte-scale cloud data warehouse delivering the best price-performance. With Amazon Redshift, you can query data across your data warehouse, operational data stores, and data lake using standard SQL.

Data Warehouse

Data Warehouse Testing Data Lake Data-driven

Load data incrementally from transactional data lakes to data warehouses

AWS Big Data

OCTOBER 19, 2023

Data lakes and data warehouses are two of the most important data storage and management technologies in a modern data architecture. Data lakes store all of an organization’s data, regardless of its format or structure.

Data Lake

Data Lake Data Warehouse Visualization Snapshot

Convergent Evolution

Peter James Thomas

AUGUST 18, 2018

Convergent Evolution refers to something else. That was the Science, here comes the Technology… A Brief Hydrology of Data Lakes. Of course some architectures featured both paradigms as well. So far so simple. This is the essence of Convergent Evolution. This required additional investments in metadata.

Data Lake

Data Lake Data Warehouse Data mining Statistics

Insiders Cite The Wondrous Benefits Of Big Data In Fortnite

Smart Data Collective

AUGUST 9, 2019

However, more mainstream games use big data as well. Fortnite is one of the games that uses big data to offer great service to its customers. Even Forbes Tech Council has written about the benefits of data lakes in Fortnite. As it turns out, Epic uses a data lake for this massive undertaking.

Big Data

Big Data Data Lake Data Architecture Machine Learning

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

Kinesis Data Streams has native integrations with other AWS services such as AWS Glue and Amazon EventBridge to build real-time streaming applications on AWS. Refer to Amazon Kinesis Data Streams integrations for additional details. To access your data from Timestream, you need to install the Timestream plugin for Grafana.

Analytics

Analytics IoT Data-driven Snapshot

Extend your data mesh with Amazon Athena and federated views

AWS Big Data

JULY 28, 2023

However, Athena also allows you to query data stored in 30 different data sources—in addition to Amazon S3—including relational, non-relational, and object stores running on premises or in other cloud environments. In Athena, we refer to queries on non-Amazon S3 data sources as federated queries.

Big Data

Big Data Data Architecture Data Lake Interactive

Dive deep into AWS Glue 4.0 for Apache Spark

AWS Big Data

MAY 18, 2023

It’s even harder when your organization is dealing with silos that impede data access across different data stores. Seamless data integration is a key requirement in a modern data architecture to break down data silos. For more details, refer to Spark Release 3.3.0 AWS Glue Data Catalog client 3.6.0

Testing

Testing Data Lake Cost-Benefit Data Integration

Introducing the AWS ProServe Hadoop Migration Delivery Kit TCO tool

AWS Big Data

FEBRUARY 6, 2023

Refactoring coupled compute and storage to a decoupling architecture is a modern data solution. It enables compute such as EMR instances and storage such as Amazon Simple Storage Service (Amazon S3) data lakes to scale. George Zhao is a Senior Data Architect at AWS ProServe.

Cost-Benefit

Cost-Benefit Data Lake Dashboards Big Data

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

AWS Big Data

MARCH 28, 2023

As organizations across the globe are modernizing their data platforms with data lakes on Amazon Simple Storage Service (Amazon S3), handling SCDs in data lakes can be challenging.

Data Lake

Data Lake Testing Snapshot Sales

Extract data from SAP ERP using AWS Glue and the SAP SDK

AWS Big Data

FEBRUARY 8, 2023

For more information, refer to Download and Installation of NW RFC SDK. For instructions, refer to Configuration basics. Clone the PyRFC module from GitHub For instructions for creating and connecting to an Amazon Linux 2 AMI EC2 instance, refer to Tutorial: Get started with Amazon EC2 Linux instances.

Testing

Testing Data Integration Data Lake Enterprise

The New Normal for FP&A: Data Analytics

Jedox

OCTOBER 22, 2020

The term “data analytics” refers to the process of examining datasets to draw conclusions about the information they contain. Data analysis techniques enhance the ability to take raw data and uncover patterns to extract valuable insights from it.

Data Analytics

Data Analytics Analytics Unstructured Data Data mining

Three Trends for Modernizing Analytics and Data Warehousing in 2019

Cloudera

DECEMBER 19, 2018

Don’t blink or you might miss what leading organizations are doing to modernize their analytic and data warehousing environments. Business intelligence (BI), an umbrella term coined in 1989 by Howard Dresner, Chief Research Officer at Dresner Advisory Services, refers to the ability of end-users to access and analyze enterprise data.

Data Warehouse

Data Warehouse Analytics Big Data Data Architecture

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Optimization Statistics

Implement tag-based access control for your data lake and Amazon Redshift data sharing with AWS Lake Formation

AWS Big Data

JULY 21, 2023

This leads to having data across many instances of data warehouses and data lakes using a modern data architecture in separate AWS accounts. We recently announced the integration of Amazon Redshift data sharing with AWS Lake Formation. Take note of this role’s ARN to use later in the steps.

Data Lake

Data Lake Data Warehouse Marketing Management

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

AWS Big Data

JANUARY 24, 2023

This solution only replicates metadata in the Data Catalog, not the actual underlying data. To have a redundant data lake using Lake Formation and AWS Glue in an additional Region, we recommend replicating the Amazon S3-based storage using S3 replication , S3 sync, aws-s3-copy-sync-using-batch or S3 Batch replication process.

Data Architecture

Data Architecture Metadata Data Lake Snapshot

Automate replication of relational sources into a transactional data lake with Apache Iceberg and AWS Glue

AWS Big Data

FEBRUARY 14, 2023

Organizations have chosen to build data lakes on top of Amazon Simple Storage Service (Amazon S3) for many years. A data lake is the most popular choice for organizations to store all their organizational data generated by different teams, across business domains, from all different formats, and even over history.

Data Lake

Data Lake Statistics Data Architecture Finance

This Structure has Novel Features which are of Considerable Business Interest

Peter James Thomas

APRIL 3, 2020

The way that this consistency of figures is achieved is by all elements of the Structured Reporting Framework drawing their data from the same data repositories. Without paying attention to this, your shiny warehouse or data lake will be a technological curiosity, not an indispensable business tool.

Dashboards

Dashboards Reporting Sales Data Lake

How GamesKraft uses Amazon Redshift data sharing to support growing analytics workloads

AWS Big Data

NOVEMBER 13, 2023

Amazon Redshift is a fully managed data warehousing service that offers both provisioned and serverless options, making it more efficient to run and scale analytics without having to manage your data warehouse. Additionally, data is extracted from vendor APIs that includes data related to product, marketing, and customer experience.

Data Warehouse

Data Warehouse Data Lake Analytics Data Science

Simplify access management with Amazon Redshift and AWS Lake Formation for users in an External Identity Provider

AWS Big Data

FEBRUARY 15, 2024

You might be modernizing your data architecture using Amazon Redshift to enable access to your data lake and data in your data warehouse, and are looking for a centralized and scalable way to define and manage the data access based on IdP identities. For Permission mode , select Lake Formation.

Management

Management Data Lake Sales Data Warehouse

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

AWS Big Data

NOVEMBER 29, 2023

For more information about performance improvement capabilities, refer to the list of announcements below. Zero-ETL integration also enables you to load and analyze data from multiple operational database clusters in a new or existing Amazon Redshift instance to derive holistic insights across many applications.

Data Warehouse

Data Warehouse Data Lake Analytics Machine Learning

AWS Glue crawlers support cross-account crawling to support data mesh architecture

AWS Big Data

MARCH 27, 2023

Data lakes have come a long way, and there’s been tremendous innovation in this space. Today’s modern data lakes are cloud native, work with multiple data types, and make this data easily available to diverse stakeholders across the business. In the navigation pane, under Data catalog , choose Settings.

Data Lake

Data Lake Data-driven Management Data Architecture

Estimating Scope 1 Carbon Footprint with Amazon Athena

AWS Big Data

AUGUST 2, 2023

The data architecture diagram below shows an example of how you could use AWS services to calculate and visualize an organization’s estimated carbon footprint. Customers have the flexibility to choose the services in each stage of the data pipeline based on their use case.

Data Lake

Data Lake Measurement Visualization Data Architecture

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

AWS Big Data

NOVEMBER 16, 2023

Amazon Redshift is a popular cloud data warehouse, offering a fully managed cloud-based service that seamlessly integrates with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and much more—all while providing up to 7.9x

Enterprise

Enterprise Data Warehouse Data Lake Optimization

An Introduction to Disaster Recovery with the Cloudera Data Platform

Cloudera

AUGUST 9, 2022

The CDP Disaster Recovery Reference Architecture. Today we announce the official release of the CDP Disaster Recovery Reference Architecture (DRRA). The CDP Disaster Recovery Reference Architecture is available in our public documentation within the CDP Reference Architectures microsite.

Data Lake

Data Lake Data Warehouse Data-driven IoT

Design a data mesh on AWS that reflects the envisioned organization

AWS Big Data

JANUARY 22, 2024

Cost and resource efficiency – This is an area where Acast observed a reduction in data duplication, and therefore cost reduction (in some accounts, removing the copy of data 100%), by reading data across accounts while enabling scaling. In this approach, teams responsible for generating data are referred to as producers.

Data-driven

Data-driven Advertising Metadata Data Architecture

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

Cloudera

MARCH 23, 2022

In fact, we recently announced the integration with our cloud ecosystem bringing the benefits of Iceberg to enterprises as they make their journey to the public cloud, and as they adopt more converged architectures like the Lakehouse. 1: Multi-function analytics . The *Any*-house. Flexible and open file formats.

Metadata

Metadata Data Architecture Machine Learning Cost-Benefit

How Data Management and Big Data Analytics Speed Up Business Growth

BizAcuity

APRIL 14, 2022

The comprehensive system which collectively includes generating data, storing the data, aggregating and analyzing the data, the tools, platforms and other softwares involved is referred to as Big Data Ecosystem. There are a wide range of problems that are presented to organizations when working with big data.

Big Data

Big Data Data Analytics Management Unstructured Data

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

Cargotec captures terabytes of IoT telemetry data from their machinery operated by numerous customers across the globe. This data needs to be ingested into a data lake, transformed, and made available for analytics, machine learning (ML), and visualization. mode('overwrite').save(output_path mode('overwrite').save(output_path

Metadata

Metadata Data Lake Machine Learning Big Data

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Webinars

Trending Sources

What is a data architect? Skills, salaries, and how to become a data framework master

Webinars

AWS Lake Formation 2022 year in review

Detect, mask, and redact PII data using AWS Glue before loading into Amazon OpenSearch Service

Enhance data security and governance for Amazon Redshift Spectrum with VPC endpoints

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

Why the Data Journey Manifesto?

Exploring real-time streaming for generative AI Applications

How Knowledge Graphs Power Data Mesh and Data Fabric

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Demystifying Modern Data Platforms

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

Data science vs data analytics: Unpacking the differences

Visualize data quality scores and metrics generated by AWS Glue Data Quality

Deep dive into the AWS ProServe Hadoop Migration Delivery Kit TCO tool

Generic orchestration framework for data warehousing workloads using Amazon Redshift RSQL

Load data incrementally from transactional data lakes to data warehouses

Convergent Evolution

Insiders Cite The Wondrous Benefits Of Big Data In Fortnite

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

Extend your data mesh with Amazon Athena and federated views

Dive deep into AWS Glue 4.0 for Apache Spark

Introducing the AWS ProServe Hadoop Migration Delivery Kit TCO tool

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

Extract data from SAP ERP using AWS Glue and the SAP SDK

The New Normal for FP&A: Data Analytics

Three Trends for Modernizing Analytics and Data Warehousing in 2019

Choosing an open table format for your transactional data lake on AWS

Implement tag-based access control for your data lake and Amazon Redshift data sharing with AWS Lake Formation

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

Automate replication of relational sources into a transactional data lake with Apache Iceberg and AWS Glue

This Structure has Novel Features which are of Considerable Business Interest

How GamesKraft uses Amazon Redshift data sharing to support growing analytics workloads

Simplify access management with Amazon Redshift and AWS Lake Formation for users in an External Identity Provider

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

AWS Glue crawlers support cross-account crawling to support data mesh architecture

Estimating Scope 1 Carbon Footprint with Amazon Athena

Power enterprise-grade Data Vaults with Amazon Redshift – Part 1

An Introduction to Disaster Recovery with the Cloudera Data Platform

Design a data mesh on AWS that reflects the envisioned organization

5 Reasons to Use Apache Iceberg on Cloudera Data Platform (CDP)

How Data Management and Big Data Analytics Speed Up Business Growth

How Cargotec uses metadata replication to enable cross-account data sharing

Stay Connected