Data Lake, Data Quality and Reference

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

AWS Big Data

OCTOBER 10, 2023

Due to the volume, velocity, and variety of data being ingested in data lakes, it can get challenging to develop and maintain policies and procedures to ensure data governance at scale for your data lake. Data confidentiality and data quality are the two essential themes for data governance.

Data Quality

Data Quality Data Governance Data Lake Testing

Measure performance of AWS Glue Data Quality for ETL pipelines

AWS Big Data

MARCH 12, 2024

In recent years, data lakes have become a mainstream architecture, and data quality validation is a critical factor to improve the reusability and consistency of the data. In this post, we provide benchmark results of running increasingly complex data quality rulesets over a predefined test dataset.

Data Quality

Data Quality Measurement Testing Visualization

Visualize data quality scores and metrics generated by AWS Glue Data Quality

AWS Big Data

JUNE 6, 2023

AWS Glue Data Quality allows you to measure and monitor the quality of data in your data repositories. It’s important for business users to be able to see quality scores and metrics to make confident business decisions and debug data quality issues. An AWS Glue crawler crawls the results.

Data Quality

Data Quality Metrics Visualization Dashboards

Webinars

The Key to Sustainable Energy Optimization: A Data-Driven Approach for Manufacturing

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

How To Get Promoted In Product Management

MORE WEBINARS

AWS Glue Data Quality is Generally Available

AWS Big Data

JUNE 6, 2023

We are excited to announce the General Availability of AWS Glue Data Quality. Our journey started by working backward from our customers who create, manage, and operate data lakes and data warehouses for analytics and machine learning. It takes days for data engineers to identify and implement data quality rules.

Data Quality

Data Quality Statistics Data Lake Visualization

Set up advanced rules to validate quality of multiple datasets with AWS Glue Data Quality

AWS Big Data

JUNE 6, 2023

Poor-quality data can lead to incorrect insights, bad decisions, and lost opportunities. AWS Glue Data Quality measures and monitors the quality of your dataset. It supports both data quality at rest and data quality in AWS Glue extract, transform, and load (ETL) pipelines.

Data Quality

Data Quality Data Lake Visualization Data-driven

Data Lakes on Cloud & it’s Usage in Healthcare

BizAcuity

MARCH 29, 2019

Data lakes are centralized repositories that can store all structured and unstructured data at any desired scale. The power of the data lake lies in the fact that it often is a cost-effective way to store data. The power of the data lake lies in the fact that it often is a cost-effective way to store data.

Data Lake

Data Lake Unstructured Data Cost-Benefit Data Quality

Data governance in the age of generative AI

AWS Big Data

FEBRUARY 29, 2024

To provide a response that includes the enterprise context, each user prompt needs to be augmented with a combination of insights from structured data from the data warehouse and unstructured data from the enterprise data lake. Implement data privacy policies. Implement data quality by data type and source.

Data Governance

Data Governance Unstructured Data Metadata Data Lake

How Knowledge Graphs Power Data Mesh and Data Fabric

Ontotext

APRIL 10, 2024

Bad data tax is rampant in most organizations. Currently, every organization is blindly chasing the GenAI race, often forgetting that data quality and semantics is one of the fundamentals to achieving AI success. Sadly, data quality is losing to data quantity, resulting in “ Infobesity ”. “Any

Metadata

Metadata Data Lake Data Warehouse Data Quality

Death by Data Cleansing (and How to Avoid It in 3 Steps)

Dataiku

DECEMBER 30, 2022

In helping organizations around the globe set up and implement their data science and AI strategies, we often hear teams say that they’re waiting to figure out their data first before beginning to generate value with advanced analytics and AI — whether they’re referring to data quality, data silos, or centralization in a data lake.

Data Lake

Data Lake IT Data Quality Data Science

AWS Lake Formation 2022 year in review

AWS Big Data

JANUARY 31, 2023

Data governance is increasingly top-of-mind for customers as they recognize data as one of their most important assets. Effective data governance enables better decision-making by improving data quality, reducing data management costs, and ensuring secure access to data for stakeholders.

Data Lake

Data Lake Data Governance Data Architecture Machine Learning

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

AWS Big Data

MARCH 7, 2023

Flexible and easy to use – The solutions should provide less restrictive, easy-to-access, and ready-to-use data. A data hub contains data at multiple levels of granularity and is often not integrated. It differs from a data lake by offering data that is pre-validated and standardized, allowing for simpler consumption by users.

Analytics

Analytics Data Warehouse Data Lake Metadata

Automate large-scale data validation using Amazon EMR and Apache Griffin

AWS Big Data

APRIL 4, 2024

Griffin is an open source data quality solution for big data, which supports both batch and streaming mode. In today’s data-driven landscape, where organizations deal with petabytes of data, the need for automated data validation frameworks has become increasingly critical.

Data Quality

Data Quality Data Lake Data Warehouse Data-driven

Introducing the technology behind watsonx.ai, IBM’s AI and data platform for enterprise

IBM Big Data Hub

MAY 9, 2023

Data: the foundation of your foundation model Data quality matters. An AI model trained on biased or toxic data will naturally tend to produce biased or toxic outputs. When objectionable data is identified, we remove it, retrain the model, and repeat. Data curation is a task that’s never truly finished.

Enterprise

Enterprise Technology Modeling Cost-Benefit

Demystifying Modern Data Platforms

Cloudera

SEPTEMBER 15, 2022

Mark: The first element in the process is the link between the source data and the entry point into the data platform. At Ramsey International (RI), we refer to that layer in the architecture as the foundation, but others call it a staging area, raw zone, or even a source data lake.

Data Lake

Data Lake Data Architecture Data-driven Data Warehouse

Accelerate analytics on Amazon OpenSearch Service with AWS Glue through its native connector

AWS Big Data

DECEMBER 21, 2023

As the volume and complexity of analytics workloads continue to grow, customers are looking for more efficient and cost-effective ways to ingest and analyse data. AWS Glue provides both visual and code-based interfaces to make data integration effortless. For setup instructions, refer to Getting started with Amazon OpenSearch Service.

Analytics

Analytics IT Data Lake Visualization

Data Profiling: What It Is and How to Perfect It

Alation

APRIL 18, 2023

For any data user in an enterprise today, data profiling is a key tool for resolving data quality issues and building new data solutions. In this blog, we’ll cover the definition of data profiling, top use cases, and share important techniques and best practices for data profiling today.

IT

IT Metadata Data Quality Data Governance

Handle UPSERT data operations using open-source Delta Lake and AWS Glue

AWS Big Data

JANUARY 30, 2023

Many customers need an ACID transaction (atomic, consistent, isolated, durable) data lake that can log change data capture (CDC) from operational data sources. There is also demand for merging real-time data into batch data. Delta Lake framework provides these two capabilities.

Insurance

Insurance Data Lake Data-driven Management

An AI Chat Bot Wrote This Blog Post …

DataKitchen

DECEMBER 9, 2022

The goal of DataOps is to help organizations make better use of their data to drive business decisions and improve outcomes. ChatGPT> DataOps is a term that refers to the set of practices and tools that organizations use to improve the quality and speed of data analytics and machine learning.

Machine Learning

Machine Learning Data-driven Optimization Modeling

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

AWS Big Data

NOVEMBER 16, 2023

Amazon Redshift is a popular cloud data warehouse, offering a fully managed cloud-based service that seamlessly integrates with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and much more—all while providing up to 7.9x

Enterprise

Enterprise Data Warehouse Snapshot Cost-Benefit

Convergent Evolution

Peter James Thomas

AUGUST 18, 2018

Convergent Evolution refers to something else. That was the Science, here comes the Technology… A Brief Hydrology of Data Lakes. One of the early promises of a Data Lake approach was that – once all relevant data had been ingested – this would be directly leveraged by Data Scientists to derive insight.

Data Lake

Data Lake Data Warehouse Data mining Statistics

Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog

AWS Big Data

JUNE 6, 2023

You can use AWS Glue to create, run, and monitor data integration and ETL (extract, transform, and load) pipelines and catalog your assets across multiple data stores. Hundreds of thousands of customers use data lakes for analytics and ML to make data-driven business decisions.

Data Quality

Data Quality Data Lake Data-driven Metrics

The New Normal for FP&A: Data Analytics

Jedox

OCTOBER 22, 2020

The term “data analytics” refers to the process of examining datasets to draw conclusions about the information they contain. Data analysis techniques enhance the ability to take raw data and uncover patterns to extract valuable insights from it.

Data Analytics

Data Analytics Analytics Unstructured Data Data mining

5 Ways Data Engineers Can Support Data Governance

Alation

JANUARY 26, 2023

Offer the right tools Data stewardship is greatly simplified when the right tools are on hand. So ask yourself, does your steward have the software to spot issues with data quality, for example? 2) Always Remember Compliance Source: Unsplash There are now many different data privacy and security laws worldwide.

Data Governance

Data Governance Strategy Data Quality Marketing

Governing data in relational databases using Amazon DataZone

AWS Big Data

MAY 7, 2024

It also makes it easier for engineers, data scientists, product managers, analysts, and business users to access data throughout an organization to discover, use, and collaborate to derive data-driven insights. Note that a managed data asset is an asset for which Amazon DataZone can manage permissions.

Metadata

Metadata Data Lake Data Processing Data-driven

How Amazon Devices scaled and optimized real-time demand and supply forecasts using serverless analytics

AWS Big Data

FEBRUARY 1, 2023

With data volumes exhibiting a double-digit percentage growth rate year on year and the COVID pandemic disrupting global logistics in 2021, it became more critical to scale and generate near-real-time data. You can visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes.

Optimization

Optimization Forecasting Data Lake Metadata

What is an open data lakehouse and why you should care?

IBM Big Data Hub

JANUARY 17, 2023

A data lakehouse is an emerging data management architecture that improves efficiency and converges data warehouse and data lake capabilities driven by a need to improve efficiency and obtain critical insights faster. Let’s start with why data lakehouses are becoming increasingly important.

Data Lake

Data Lake Metadata Data Warehouse Data Governance

How AWS helped Altron Group accelerate their vision for optimized customer engagement

AWS Big Data

JULY 13, 2023

Data quality for account and customer data – Altron wanted to enable data quality and data governance best practices. Goals – Lay the foundation for a data platform that can be used in the future by internal and external stakeholders. Basic formatting and readability of the data is standardized here.

Optimization

Optimization B2B Data Quality Sales

In-depth with CDO Christopher Bannocks

Peter James Thomas

AUGUST 29, 2018

I have since run and driven transformation in Reference Data, Master Data , KYC [3] , Customer Data, Data Warehousing and more recently Data Lakes and Analytics , constantly building experience and capability in the Data Governance , Quality and data services domains, both inside banks, as a consultant and as a vendor.

Data-driven

Data-driven Cost-Benefit Metadata Technology

How Data Management and Big Data Analytics Speed Up Business Growth

BizAcuity

APRIL 14, 2022

Big Data technology in today’s world. Did you know that the big data and business analytics market is valued at $198.08 Or that the US economy loses up to $3 trillion per year due to poor data quality? quintillion bytes of data which means an average person generates over 1.5 megabytes of data every second?

Big Data

Big Data Data Analytics Management Unstructured Data

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

AWS Big Data

DECEMBER 18, 2023

Solution overview One of the common functionalities involved in data pipelines is extracting data from multiple data sources and exporting it to a data lake or synchronizing the data to another database. For more information, refer to IAM Policies for invoking AWS Glue job from Step Functions.

Metadata

Metadata Visualization Data Lake Data-driven

Building Robust Data Pipelines: 9 Fundamentals and Best Practices to Follow

Alation

MAY 16, 2023

Machine Learning Data pipelines feed all the necessary data into machine learning algorithms, thereby making this branch of Artificial Intelligence (AI) possible. Data Quality When using a data pipeline, data consistency, quality, and reliability are often greatly improved.

Data Lake

Data Lake Data Governance Data Warehouse Data Processing

Create an end-to-end data strategy for Customer 360 on AWS

AWS Big Data

MARCH 26, 2024

A Gartner Marketing survey found only 14% of organizations have successfully implemented a C360 solution, due to lack of consensus on what a 360-degree view means, challenges with data quality, and lack of cross-functional governance structure for customer data.

Data Strategy

Data Strategy Strategy Data Warehouse Prescriptive Analytics

AWS Glue streaming application to process Amazon MSK data using AWS Glue Schema Registry

AWS Big Data

JUNE 12, 2023

This schema serves as a single source of truth for producer and consumer and you can leverage the schema evolution feature of AWS Glue Schema Registry to keep it consistent as the data changes over time. Refer appendix section for more information on this feature. Refer to the first stack’s output.

Management

Management Metadata Testing Internet of Things

The Data Scientist’s Guide to the Data Catalog

Alation

JULY 19, 2022

Modern data catalogs also facilitate data quality checks. Historically restricted to the purview of data engineers, data quality information is essential for all user groups to see. Cataloging data science projects in this way is critical to helping them generate value for the company.

Metadata

Metadata Data Quality Statistics Data Science

Use fuzzy string matching to approximate duplicate records in Amazon Redshift

AWS Big Data

FEBRUARY 8, 2023

It’s common to ingest multiple data sources into Amazon Redshift to perform analytics. Often, each data source will have its own processes of creating and maintaining data, which can lead to data quality challenges within and across sources. Answering questions as simple as “How many unique customers do we have?”

Data Quality

Data Quality Testing Data Warehouse Unstructured Data

A Day in the Life of a DataOps Engineer

DataKitchen

OCTOBER 11, 2021

Figure 2: Example data pipeline with DataOps automation. In this project, I automated data extraction from SFTP, the public websites, and the email attachments. The automated orchestration published the data to an AWS S3 Data Lake. Historic Balance – compares current data to previous or expected values.

Testing

Testing Metadata Dashboards Statistics

What is Business Intelligence Consulting

BizAcuity

APRIL 1, 2023

Several large organizations have faltered on different stages of BI implementation, from poor data quality to the inability to scale due to larger volumes of data and extremely complex BI architecture. Data governance and security measures are critical components of data strategy. What is Business Intelligence?

Business Intelligence

Business Intelligence Consulting KPI Data Warehouse

What is Business Intelligence Consulting

BizAcuity

JANUARY 31, 2023

Several large organizations have faltered on different stages of BI implementation, from poor data quality to the inability to scale due to larger volumes of data and extremely complex BI architecture. Data governance and security measures are critical components of data strategy. What is Business Intelligence?

Business Intelligence

Business Intelligence Consulting KPI Data Warehouse

Building Robust Data Pipelines: 9 Fundamentals and Best Practices to Follow

Alation

MAY 16, 2023

Machine Learning Data pipelines feed all the necessary data into machine learning algorithms, thereby making this branch of Artificial Intelligence (AI) possible. Data Quality When using a data pipeline, data consistency, quality, and reliability are often greatly improved.

Data Lake

Data Lake Data Governance Data Warehouse Data Processing

Configure end-to-end data pipelines with Etleap, Amazon Redshift, and dbt

AWS Big Data

JULY 12, 2023

Amazon Redshift helps you break down the data silos and allows you to run unified, self-service, real-time, and predictive analytics on all data across your operational databases, data lake, data warehouse, and third-party datasets with built-in governance. This is often a laborious and error-prone process.

Data Warehouse

Data Warehouse Modeling Dashboards Data Lake

Your 5-Step Journey from Analytics to AI

CIO Business Intelligence

MARCH 22, 2022

Which type(s) of storage consolidation you use depends on the data you generate and collect. . One option is a data lake—on-premises or in the cloud—that stores unprocessed data in any type of format, structured or unstructured, and can be queried in aggregate. Just starting out with analytics?

Analytics

Analytics Key Performance Indicator Data Warehouse Data-driven

The Audience for Data Catalogs and Data Intelligence

Alation

JUNE 21, 2022

Why start with a data source and build a visualization, if you can just find a visualization that already exists, complete with metadata about it? Data scientists went beyond database tables to data lakes and cloud data stores. Data scientists want to catalog not just information sources, but models.

Metadata

Metadata Data Quality Visualization Data Lake

Clean up your Excel and CSV files without writing code using AWS Glue DataBrew

AWS Big Data

NOVEMBER 15, 2023

As the organization receives data from multiple external vendors, it often arrives in different formats, typically Excel or CSV files, with each vendor using their own unique data layout and structure. DataBrew is an excellent tool for data quality and preprocessing.

Metadata

Metadata Sales Data Lake Big Data

A Simple Data Capability Framework

Peter James Thomas

MAY 3, 2019

Control of Data to ensure it is Fit-for-Purpose. This refers to a wide range of activities from Data Governance to Data Management to Data Quality improvement and indeed related concepts such as Master Data Management. Best practice has evolved in this area.

Strategy

Strategy Data Architecture Data Quality Data Strategy

Automated data governance with AWS Glue Data Quality, sensitive data detection, and AWS Lake Formation

Measure performance of AWS Glue Data Quality for ETL pipelines

Webinars

Trending Sources

Visualize data quality scores and metrics generated by AWS Glue Data Quality

Webinars

AWS Glue Data Quality is Generally Available

Set up advanced rules to validate quality of multiple datasets with AWS Glue Data Quality

Data Lakes on Cloud & it’s Usage in Healthcare

Data governance in the age of generative AI

How Knowledge Graphs Power Data Mesh and Data Fabric

Death by Data Cleansing (and How to Avoid It in 3 Steps)

AWS Lake Formation 2022 year in review

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

Automate large-scale data validation using Amazon EMR and Apache Griffin

Introducing the technology behind watsonx.ai, IBM’s AI and data platform for enterprise

Demystifying Modern Data Platforms

Accelerate analytics on Amazon OpenSearch Service with AWS Glue through its native connector

Data Profiling: What It Is and How to Perfect It

Handle UPSERT data operations using open-source Delta Lake and AWS Glue

An AI Chat Bot Wrote This Blog Post …

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

Convergent Evolution

Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog

The New Normal for FP&A: Data Analytics

5 Ways Data Engineers Can Support Data Governance

Governing data in relational databases using Amazon DataZone

How Amazon Devices scaled and optimized real-time demand and supply forecasts using serverless analytics

What is an open data lakehouse and why you should care?

How AWS helped Altron Group accelerate their vision for optimized customer engagement

In-depth with CDO Christopher Bannocks

How Data Management and Big Data Analytics Speed Up Business Growth

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

Building Robust Data Pipelines: 9 Fundamentals and Best Practices to Follow

Create an end-to-end data strategy for Customer 360 on AWS

AWS Glue streaming application to process Amazon MSK data using AWS Glue Schema Registry

The Data Scientist’s Guide to the Data Catalog

­­Use fuzzy string matching to approximate duplicate records in Amazon Redshift

A Day in the Life of a DataOps Engineer

What is Business Intelligence Consulting

What is Business Intelligence Consulting

Building Robust Data Pipelines: 9 Fundamentals and Best Practices to Follow

Configure end-to-end data pipelines with Etleap, Amazon Redshift, and dbt

Your 5-Step Journey from Analytics to AI

The Audience for Data Catalogs and Data Intelligence

Clean up your Excel and CSV files without writing code using AWS Glue DataBrew

A Simple Data Capability Framework

Stay Connected

Use fuzzy string matching to approximate duplicate records in Amazon Redshift