Data Lake, Measurement and Testing

Monitor data pipelines in a serverless data lake

AWS Big Data

AUGUST 9, 2023

The combination of a data lake in a serverless paradigm brings significant cost and performance benefits. By monitoring application logs, you can gain insights into job execution, troubleshoot issues promptly to ensure the overall health and reliability of data pipelines.

Data Lake

Data Lake Metrics Testing Cost-Benefit

Measure performance of AWS Glue Data Quality for ETL pipelines

AWS Big Data

MARCH 12, 2024

In recent years, data lakes have become a mainstream architecture, and data quality validation is a critical factor to improve the reusability and consistency of the data. In this post, we provide benchmark results of running increasingly complex data quality rulesets over a predefined test dataset.

Data Quality

Data Quality Measurement Testing Visualization

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

AWS Big Data

MARCH 28, 2023

In a data warehouse, a dimension is a structure that categorizes facts and measures in order to enable users to answer business questions. As organizations across the globe are modernizing their data platforms with data lakes on Amazon Simple Storage Service (Amazon S3), handling SCDs in data lakes can be challenging.

Data Lake

Data Lake Testing Snapshot Sales

Webinars

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

How To Get Promoted In Product Management

MORE WEBINARS

Eight Top DataOps Trends for 2022

DataKitchen

NOVEMBER 29, 2021

In 2022, data organizations will institute robust automated processes around their AI systems to make them more accountable to stakeholders. Model developers will test for AI bias as part of their pre-deployment testing. Quality test suites will enforce “equity,” like any other performance metric.

Testing

Testing Data Lake Data Architecture Manufacturing

DataOps Observability: Taming the Chaos (Part 3)

DataKitchen

NOVEMBER 18, 2022

As he thinks through the various journeys that data take in his company, Jason sees that his dashboard idea would require extracting or testing for events along the way. So, the only way for a data journey to truly observe what’s happening is to get his tools and pipelines to auto-report events. Data and tool tests.

Testing

Testing Statistics Measurement Metrics

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

In our previous post Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes , we discussed how you can implement solutions to improve operational efficiencies of your Amazon Simple Storage Service (Amazon S3) data lake that is using the Apache Iceberg open table format and running on the Amazon EMR big data platform.

Optimization

Optimization Snapshot Data Lake Metadata

Why the Data Journey Manifesto?

DataKitchen

JUNE 12, 2023

We had been talking about “Agile Analytic Operations,” “DevOps for Data Teams,” and “Lean Manufacturing For Data,” but the concept was hard to get across and communicate. I spent much time de-categorizing DataOps: we are not discussing ETL, Data Lake, or Data Science. The post Why the Data Journey Manifesto?

Testing

Testing Data Lake Dashboards Data Science

DataOps For Business Analytics Teams

DataKitchen

JANUARY 3, 2022

There’s a recent trend toward people creating data lake or data warehouse patterns and calling it data enablement or a data hub. DataOps expands upon this approach by focusing on the processes and workflows that create data enablement and business analytics. DataOps Process Hub.

Business Analytics

Business Analytics Analytics Testing Dashboards

A Day in the Life of a DataOps Engineer

DataKitchen

OCTOBER 11, 2021

The data engineer then emails the BI Team, who refreshes a Tableau dashboard. Figure 1: Example data pipeline with manual processes. There are no automated tests , so errors frequently pass through the pipeline. Finally, when your implementation is complete, you can track and measure your process.

Testing

Testing Metadata Dashboards Statistics

Migrate a petabyte-scale data warehouse from Actian Vectorwise to Amazon Redshift

AWS Big Data

MAY 30, 2024

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data.

Data Warehouse

Data Warehouse Data Lake Cost-Benefit Structured Data

MLOps and DevOps: Why Data Makes It Different

O'Reilly on Data

OCTOBER 19, 2021

We need robust versioning for data, models, code, and preferably even the internal state of applications—think Git on steroids to answer inevitable questions: What changed? The applications must be integrated to the surrounding business systems so ideas can be tested and validated in the real world in a controlled manner.

IT

IT Testing Experimentation Software

Interview with: Sankar Narayanan, Chief Practice Officer at Fractal Analytics

Corinium

JUNE 6, 2019

Some of the work is very foundational, such as building an enterprise data lake and migrating it to the cloud, which enables other more direct value-added activities such as self-service. It is also important to have a strong test and learn culture to encourage rapid experimentation. Incorporate these into subsequent releases.

Insurance

Insurance Analytics Forecasting Deep Learning

How GamesKraft uses Amazon Redshift data sharing to support growing analytics workloads

AWS Big Data

NOVEMBER 13, 2023

Amazon Redshift is a fully managed data warehousing service that offers both provisioned and serverless options, making it more efficient to run and scale analytics without having to manage your data warehouse. Additionally, data is extracted from vendor APIs that includes data related to product, marketing, and customer experience.

Data Warehouse

Data Warehouse Data Lake Analytics Data Science

How Gilead used Amazon Redshift to quickly and cost-effectively load third-party medical claims data

AWS Big Data

NOVEMBER 8, 2023

Amazon Redshift Serverless is a fully managed cloud data warehouse that allows you to seamlessly create your data warehouse with no infrastructure management required. Redshift Serverless measures data warehouse capacity in Redshift Processing Units (RPUs), which are part of the compute resources.

Data Lake

Data Lake Data Warehouse Cost-Benefit Optimization

Successfully conduct a proof of concept in Amazon Redshift

AWS Big Data

MARCH 27, 2024

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data.

Testing

Testing Data Warehouse Metrics Cost-Benefit

Why Can’t we Advance Healthcare and Life Sciences this Fast all the time?

Cloudera

APRIL 4, 2022

Numerous factors helped accelerate the vaccine roll-out including prior research, genome sequencing, jumping the FDA approval queue and a plethora of testing volunteers. The Impact of Data and Analytics. The usage of data lakes and automation are helping facilitate the data sharing and collaboration across the healthcare ecosystem.

Data Lake

Data Lake Digital Transformation Manufacturing Sales

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

AWS Big Data

MARCH 3, 2023

Tricentis is the global leader in continuous testing for DevOps, cloud, and enterprise applications. Speed changes everything, and continuous testing across the entire CI/CD lifecycle is the key. Tricentis instills that confidence by providing software tools that enable Agile Continuous Testing (ACT) at scale.

Software

Software Data Lake Testing Cost-Benefit

10 Things AWS Can Do for Your SaaS Company

Smart Data Collective

FEBRUARY 20, 2022

Data storage databases. Your SaaS company can store and protect any amount of data using Amazon Simple Storage Service (S3), which is ideal for data lakes, cloud-native applications, and mobile apps. Well, let’s find out. Artificial intelligence (AI). You can therefore trust its reliability.

Cost-Benefit

Cost-Benefit Data Lake Software Machine Learning

How the Masters uses watsonx to manage its AI lifecycle

IBM Big Data Hub

APRIL 9, 2024

This allows the Masters to scale analytics and AI wherever their data resides, through open formats and integration with existing databases and tools. “Hole distances and pin positions vary from round to round and year to year; these factors are important as we stage the data.”

Management

Management IT Machine Learning Metrics

Accomplish Agile Business Intelligence & Analytics For Your Business

datapine

APRIL 15, 2020

Your Chance: Want to test an agile business intelligence solution? Business intelligence is moving away from the traditional engineering model: analysis, design, construction, testing, and implementation. Test BI in a small group and deploy the software internally. Finalize testing. Without further ado, let’s begin.

Business Intelligence

Business Intelligence Analytics Testing Dashboards

Automate the archive and purge data process for Amazon RDS for PostgreSQL using pg_partman, Amazon S3, and AWS Glue

AWS Big Data

AUGUST 22, 2023

The features of AWS Glue, which include a scheduler for automating tasks, code generation for ETL (extract, transform, and load) processes, notebook integration for interactive development and debugging, as well as robust security and compliance measures, make it a convenient and cost-effective solution for archival and restoration needs.

Data Processing

Data Processing Testing Data Lake Data Integration

Dive deep into AWS Glue 4.0 for Apache Spark

AWS Big Data

MAY 18, 2023

You can discover and connect to over 70 diverse data sources, manage your data in a centralized data catalog, and create, run, and monitor data integration pipelines to load data into your data lakes and your data warehouses. Refer to Develop and test AWS Glue version 3.0 runtime ( 3.5

Testing

Testing Data Lake Cost-Benefit Data Integration

Modernizing the Data Warehouse: Challenges and Benefits

BI-Survey

AUGUST 21, 2020

Advanced analytics and new ways of working with data also create new requirements that surpass the traditional concepts. Many companies are therefore forced to put these concepts to the test. But what are the right measures to make the data warehouse and BI fit for the future?

Data Warehouse

Data Warehouse Data Lake Data Governance Data Architecture

NJ Transit creates ‘data engine’ to fuel transformation

CIO Business Intelligence

SEPTEMBER 12, 2022

Collectively, the agencies also have pilots up and running to test electric buses and IoT sensors scattered throughout the transportation system. We’re just making sure the company is aware that this capability does exist, and that we are ready to build whatever business challenge or business problem that’s been identified using data.

Data Warehouse

Data Warehouse Predictive Analytics Data Lake IoT

Run Spark SQL on Amazon Athena Spark

AWS Big Data

OCTOBER 23, 2023

Modern applications store massive amounts of data on Amazon Simple Storage Service (Amazon S3) data lakes, providing cost-effective and highly durable storage, and allowing you to run analytics and machine learning (ML) from your data lake to generate insights on your data.

Data Lake

Data Lake Visualization Optimization Interactive

Prevent Customer Churn: Customer Retention in the Transition to Microsoft D365 F&SCM

Jet Global

JANUARY 15, 2021

You might measure those costs in different ways, including actual dollars and cents, staff time, added complexity, and risk. Most of those things are not about direct monetary costs; they are less tangible and measurable, but nonetheless very important. In other words, switching costs are not just about money.

Cost-Benefit

Cost-Benefit Data Lake Reporting OLAP

Has the Data Warehouse Had Its Day?

BI-Survey

JANUARY 15, 2023

In general, central data & analytics teams determine the data architecture for analytical data, decoupled from the landscape of operational data sources. Indeed, this is what the data warehouse, data lake and data lakehouse have in common, regardless of the differences in their detail.

Data Warehouse

Data Warehouse IT Data Architecture Measurement

Visualize data quality scores and metrics generated by AWS Glue Data Quality

AWS Big Data

JUNE 6, 2023

AWS Glue Data Quality allows you to measure and monitor the quality of data in your data repositories. It’s important for business users to be able to see quality scores and metrics to make confident business decisions and debug data quality issues. Avik Bhattacharjee is a Senior Partner Solutions Architect at AWS.

Data Quality

Data Quality Metrics Visualization Dashboards

Planning Your Migration to Microsoft D365 F&SCM

Jet Global

JANUARY 18, 2021

Perhaps more importantly, it provides an opportunity for the organization to implement measures in advance that can reduce risk, lower costs, and improve the end result. In a separate blog post, we discussed the potential for using a data warehouse as a means for automating data extraction and transformation in advance of system migration.

Data Lake

Data Lake Reporting Cost-Benefit Finance

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

AWS Big Data

NOVEMBER 16, 2023

Amazon Redshift is a popular cloud data warehouse, offering a fully managed cloud-based service that seamlessly integrates with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and much more—all while providing up to 7.9x

Enterprise

Enterprise Data Warehouse Snapshot Cost-Benefit

Data Governance for Dummies: Your Questions, Answered

Alation

FEBRUARY 17, 2023

The firms that get data governance and management “right” bring people together and leverage a set of capabilities: (1) Agile; (2) Six sigma; (3) data science; and (4) project management tools. The overall program should set a 2-year vision, mission, and goals, and then focus on execution, measuring progress along the way.

Data Governance

Data Governance Data Quality Metadata Cost-Benefit

The Very Group adopts a data catalog to better organize and leverage its online retail capabilities

CIO Business Intelligence

SEPTEMBER 6, 2022

It’ being everything from how they collect and measure data, to how they understand it and their own glossary. Cataloging your data is more important than ever for many companies, with so many technology options, different data silos, enterprise warehousing, lake houses, data lakes, and all those types of capabilities,” says Pimblett.

IT

IT Forecasting Data Lake Enterprise

CIOs press ahead for gen AI edge — despite misgivings

CIO Business Intelligence

OCTOBER 18, 2023

Gen AI boom in the making Many early and established forays into generative AI are being developed on the AI platforms of cloud leaders Microsoft, Google, and Amazon, reportedly with numerous guardrails and governance measures in place to contain unrestricted exploration. First, we launched a private instance of GPT-3.5

Risk

Risk Manufacturing Enterprise Technology

Automate alerting and reporting for AWS Glue job resource usage

AWS Big Data

MAY 25, 2023

Many organizations today are using AWS Glue to build ETL pipelines that bring data from disparate sources and store the data in repositories like a data lake, database, or data warehouse for further consumption.

Reporting

Reporting Metrics Optimization Data Lake

Use fuzzy string matching to approximate duplicate records in Amazon Redshift

AWS Big Data

FEBRUARY 8, 2023

This approach doesn’t solve for data quality issues in source systems, and doesn’t remove the need to have a wholistic data quality strategy. For addressing data quality challenges in Amazon Simple Storage Service (Amazon S3) data lakes and data pipelines, AWS has announced AWS Glue Data Quality (preview).

Data Quality

Data Quality Testing Data Warehouse Unstructured Data

Of Muffins and Machine Learning Models

Cloudera

FEBRUARY 16, 2022

blueberry spacing) is a measure of the model’s interpretability. In the case of CDP Public Cloud, this includes virtual networking constructs and the data lake as provided by a combination of a Cloudera Shared Data Experience (SDX) and the underlying cloud storage. The complete list is shown below: Model Lineage .

Machine Learning

Machine Learning Modeling Metadata Recreation/Entertainment

Migrate data from Google Cloud Storage to Amazon S3 using AWS Glue

AWS Big Data

JULY 19, 2023

We are now able to import data from Google Cloud Storage to Amazon S3. Scaling considerations In this example, we set the AWS Glue capacity as 10 DPU (Data Processing Units). A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory.

Big Data

Big Data Software Consulting Unstructured Data

Quantitative and Qualitative Data: A Vital Combination

Sisense

OCTOBER 6, 2020

Most commonly, we think of data as numbers that show information such as sales figures, marketing data, payroll totals, financial statistics, and other data that can be counted and measured objectively. This is quantitative data. It’s “hard,” structured data that answers questions such as “how many?”

Statistics

Statistics Unstructured Data Data-driven Visualization

Turnkey Cloud DataOps: Solution from Alation and Accenture

Alation

MARCH 22, 2022

DataOps requires an array of technology to automate the design, development, deployment, and management of data delivery, with governance sprinkled on for good measure. This produces end-to-end lineage so business and technology users alike can understand the state of a data lake and/or lake house.

Metadata

Metadata Cost-Benefit Data Quality Data Lake

Configure monitoring, limits, and alarms in Amazon Redshift Serverless to keep costs predictable

AWS Big Data

JULY 25, 2023

It automatically provisions and intelligently scales data warehouse compute capacity to deliver fast performance, and you pay only for what you use. Just load your data and start querying right away in the Amazon Redshift Query Editor or in your favorite business intelligence (BI) tool. Ashish has over 24 years of experience in IT.

Metrics

Metrics Data Warehouse Dashboards Snapshot

Improving Multi-tenancy with Virtual Private Clusters

Cloudera

JUNE 6, 2019

While this approach provides isolation, it creates another significant challenge: duplication of data, metadata, and security policies, or ‘split-brain’ data lake. Now the admins need to synchronize multiple copies of the data and metadata and ensure that users across the many clusters are not viewing stale information.

Metadata

Metadata Data Lake Optimization Strategy

The Cloud Connection: How Governance Supports Security

Alation

APRIL 14, 2022

A useful feature for exposing patterns in the data. Supports the ability to interact with the actual data and perform analysis on it. Automatic sampling to test transformation. Similar to a data warehouse schema, this prep tool automates the development of the recipe to match. Record-keeping in a data catalog is key.

Metadata

Metadata Data Governance Modeling Data-driven

Showpad accelerates data maturity to unlock innovation using Amazon QuickSight

AWS Big Data

APRIL 5, 2023

Showpad also struggled with data quality issues in terms of consistency, ownership, and insufficient data access across its targeted user base due to a complex BI access process, licensing challenges, and insufficient education. The entire data team of 20 people were “all on hands on deck” for the project.

Dashboards

Dashboards Reporting Cost-Benefit Visualization

The year’s top 10 enterprise AI trends — so far

CIO Business Intelligence

SEPTEMBER 21, 2023

One thing buyers have to be careful about is the security measures vendors put in place. What happens if you upload your data to these AIs?” To make all this possible, the data had to be collected, processed, and fed into the systems that needed it in a reliable, efficient, scalable, and secure way.

Enterprise

Enterprise Consulting Modeling Cost-Benefit

Monitor data pipelines in a serverless data lake

Measure performance of AWS Glue Data Quality for ETL pipelines

Webinars

Trending Sources

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

Webinars

Eight Top DataOps Trends for 2022

DataOps Observability: Taming the Chaos (Part 3)

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Why the Data Journey Manifesto?

DataOps For Business Analytics Teams

A Day in the Life of a DataOps Engineer

Migrate a petabyte-scale data warehouse from Actian Vectorwise to Amazon Redshift

MLOps and DevOps: Why Data Makes It Different

Interview with: Sankar Narayanan, Chief Practice Officer at Fractal Analytics

How GamesKraft uses Amazon Redshift data sharing to support growing analytics workloads

How Gilead used Amazon Redshift to quickly and cost-effectively load third-party medical claims data

Successfully conduct a proof of concept in Amazon Redshift

Why Can’t we Advance Healthcare and Life Sciences this Fast all the time?

How Tricentis unlocks insights across the software development lifecycle at speed and scale using Amazon Redshift

10 Things AWS Can Do for Your SaaS Company

How the Masters uses watsonx to manage its AI lifecycle

Accomplish Agile Business Intelligence & Analytics For Your Business

Automate the archive and purge data process for Amazon RDS for PostgreSQL using pg_partman, Amazon S3, and AWS Glue

Dive deep into AWS Glue 4.0 for Apache Spark

Modernizing the Data Warehouse: Challenges and Benefits

NJ Transit creates ‘data engine’ to fuel transformation

Run Spark SQL on Amazon Athena Spark

Prevent Customer Churn: Customer Retention in the Transition to Microsoft D365 F&SCM

Has the Data Warehouse Had Its Day?

Visualize data quality scores and metrics generated by AWS Glue Data Quality

Planning Your Migration to Microsoft D365 F&SCM

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

Data Governance for Dummies: Your Questions, Answered

The Very Group adopts a data catalog to better organize and leverage its online retail capabilities

CIOs press ahead for gen AI edge — despite misgivings

Automate alerting and reporting for AWS Glue job resource usage

­­Use fuzzy string matching to approximate duplicate records in Amazon Redshift

Of Muffins and Machine Learning Models

Migrate data from Google Cloud Storage to Amazon S3 using AWS Glue

Quantitative and Qualitative Data: A Vital Combination

Turnkey Cloud DataOps: Solution from Alation and Accenture

Configure monitoring, limits, and alarms in Amazon Redshift Serverless to keep costs predictable

Improving Multi-tenancy with Virtual Private Clusters

The Cloud Connection: How Governance Supports Security

Showpad accelerates data maturity to unlock innovation using Amazon QuickSight

The year’s top 10 enterprise AI trends — so far

Stay Connected

Use fuzzy string matching to approximate duplicate records in Amazon Redshift