Big Data, Data Lake and Statistics

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

licensed, 100% open-source data table format that helps simplify data processing on large datasets stored in data lakes. Data engineers use Apache Iceberg because it’s fast, efficient, and reliable at any scale and keeps records of how datasets change over time.

Data Lake

Data Lake Snapshot Metadata Data Architecture

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

NOVEMBER 22, 2023

Today, we’re making available a new capability of AWS Glue Data Catalog that allows generating column-level statistics for AWS Glue tables. These statistics are now integrated with the cost-based optimizers (CBO) of Amazon Athena and Amazon Redshift Spectrum , resulting in improved query performance and potential cost savings.

Statistics

Statistics Data Lake Optimization Data-driven

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

Data architect Armando Vázquez identifies eight common types of data architects: Enterprise data architect: These data architects oversee an organization’s overall data architecture, defining data architecture strategy and designing and implementing architectures.

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

Webinars

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

How To Get Promoted In Product Management

MORE WEBINARS

Build a pseudonymization service on AWS to protect sensitive data: Part 2

AWS Big Data

MARCH 6, 2024

Amazon EMR empowers you to create, operate, and scale big data frameworks such as Apache Spark quickly and cost-effectively. For an overview of how to build an ACID compliant data lake using Iceberg, refer to Build a high-performance, ACID compliant, evolving data lake using Apache Iceberg on Amazon EMR.

Metrics

Metrics Statistics Testing Data Lake

Speed up queries with the cost-based optimizer in Amazon Athena

AWS Big Data

NOVEMBER 17, 2023

Athena provides a simplified, flexible way to analyze petabytes of data where it lives. You can analyze data or build applications from an Amazon Simple Storage Service (Amazon S3) data lake and 30 data sources, including on-premises data sources or other cloud systems using SQL or Python.

Optimization

Optimization Statistics Metadata Data Lake

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics

AWS Big Data

NOVEMBER 20, 2023

Use case A typical workload for AWS Glue for Apache Spark jobs is to load data from a relational database to a data lake with SQL-based transformations. On the Graphed metrics tab, configure your preferred statistic, period, and so on. When the example job ran, the workerUtilization metrics showed the following trend.

Metrics

Metrics Data Lake Cost-Benefit Dashboards

AWS Glue Data Quality is Generally Available

AWS Big Data

JUNE 6, 2023

We are excited to announce the General Availability of AWS Glue Data Quality. Our journey started by working backward from our customers who create, manage, and operate data lakes and data warehouses for analytics and machine learning. It takes days for data engineers to identify and implement data quality rules.

Data Quality

Data Quality Statistics Data Lake Visualization

Real estate CIOs drive deals with data

CIO Business Intelligence

JULY 26, 2023

“We’ve been able to create some models that will analyze things like the listing comments and descriptions and tell you which properties are waterfront or not,” Wilhemy says, adding that such data gives its agents a competitive advantage by enabling them to reach out to a selective set of potential buyers first.

Data Lake

Data Lake Digital Transformation Machine Learning Data Architecture

Data science vs data analytics: Unpacking the differences

IBM Big Data Hub

SEPTEMBER 19, 2023

Though you may encounter the terms “data science” and “data analytics” being used interchangeably in conversations or online, they refer to two distinctly different concepts. Meanwhile, data analytics is the act of examining datasets to extract value and find answers to specific questions.

Data Science

Data Science Data Analytics Prescriptive Analytics Analytics

Top 8 predictive analytics tools compared

CIO Business Intelligence

MAY 12, 2022

The tools include sophisticated pipelines for gathering data from across the enterprise, add layers of statistical analysis and machine learning to make projections about the future, and distill these insights into useful summaries so that business users can act on them. On premises or in SAP cloud. Per user, per month. Free tier.

Predictive Analytics

Predictive Analytics Analytics Statistics Machine Learning

Convergent Evolution

Peter James Thomas

AUGUST 18, 2018

That was the Science, here comes the Technology… A Brief Hydrology of Data Lakes. Even back then, these were used for activities such as Analytics , Dashboards , Statistical Modelling , Data Mining and Advanced Visualisation. This is the essence of Convergent Evolution.

Data Lake

Data Lake Data Warehouse Data mining Statistics

How the Masters uses watsonx to manage its AI lifecycle

IBM Big Data Hub

APRIL 9, 2024

This allows the Masters to scale analytics and AI wherever their data resides, through open formats and integration with existing databases and tools. “Hole distances and pin positions vary from round to round and year to year; these factors are important as we stage the data.”

Management

Management IT Machine Learning Metrics

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

AWS Big Data

AUGUST 31, 2023

Amazon Redshift enables you to directly access data stored in Amazon Simple Storage Service (Amazon S3) using SQL queries and join data across your data warehouse and data lake. With Amazon Redshift, you can query the data in your S3 data lake using a central AWS Glue metastore from your Redshift data warehouse.

Data Lake

Data Lake Data Warehouse Metadata Data Architecture

Introducing the AWS ProServe Hadoop Migration Delivery Kit TCO tool

AWS Big Data

FEBRUARY 6, 2023

Use case overview Migrating Hadoop workloads to Amazon EMR accelerates big data analytics modernization, increases productivity, and reduces operational cost. Refactoring coupled compute and storage to a decoupling architecture is a modern data solution. Jiseong Kim is a Senior Data Architect at AWS ProServe.

Cost-Benefit

Cost-Benefit Data Lake Dashboards Big Data

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

AWS Big Data

SEPTEMBER 13, 2023

A modern data architecture is an evolutionary architecture pattern designed to integrate a data lake, data warehouse, and purpose-built stores with a unified governance model. The company wanted the ability to continue processing operational data in the secondary Region in the rare event of primary Region failure.

Data Lake

Data Lake Data Processing Metadata Snapshot

Decoding Data Analyst Job Description: Skills, Tools, and Career Paths

FineReport

MARCH 24, 2024

Data analysts contribute value to organizations by uncovering trends, patterns, and insights through data gathering, cleaning, and statistical analysis. They identify and interpret trends in complex datasets, optimize statistical results, and maintain databases while devising new data collection processes.

Statistics

Statistics Data mining Visualization Reporting

Deep dive into the AWS ProServe Hadoop Migration Delivery Kit TCO tool

AWS Big Data

FEBRUARY 6, 2023

Based on the statistics of individual and aggregated application runs per queue and per user, you can determine the existing workload distribution by user. He has a specialty in big data services and technologies and an interest in building customer business outcomes together. You can find peak and off-peak hours in a day.

Dashboards

Dashboards Optimization Data Lake Cost-Benefit

Measure performance of AWS Glue Data Quality for ETL pipelines

AWS Big Data

MARCH 12, 2024

In recent years, data lakes have become a mainstream architecture, and data quality validation is a critical factor to improve the reusability and consistency of the data. Gonzalo Herreros is a Senior Big Data Architect on the AWS Glue team.

Data Quality

Data Quality Measurement Testing Visualization

Quantitative and Qualitative Data: A Vital Combination

Sisense

OCTOBER 6, 2020

Let’s consider the differences between the two, and why they’re both important to the success of data-driven organizations. Digging into quantitative data. This is quantitative data. It’s “hard,” structured data that answers questions such as “how many?” Qualitative data benefits: Unlocking understanding.

Statistics

Statistics Unstructured Data Data-driven Visualization

Visualize data quality scores and metrics generated by AWS Glue Data Quality

AWS Big Data

JUNE 6, 2023

The purpose of this step is to understand our data quality statistics at the table level as well as at the ruleset level. Use the queries in this section to analyze your data quality metrics and create an Athena view to use to build a QuickSight dashboard in the next step.

Data Quality

Data Quality Metrics Visualization Dashboards

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Optimization Statistics

Breaking down Business Intelligence

BizAcuity

MAY 16, 2022

He went on to be the head brewer of Guinness and we thank him for not just great hand-crafted beers but subsequent research breakthroughs in statistical research as well. Data allowed Guinness to hold their market dominance for long. Data mining. That was in the 1900’s.

Business Intelligence

Business Intelligence Data mining Visualization Data Lake

Get started with AWS Glue Data Quality dynamic rules for ETL pipelines

AWS Big Data

MAY 23, 2024

Solution overview Let’s consider an example data quality pipeline where a data engineer ingests data from a raw zone and loads it into a curated zone in a data lake. By catching errors upfront, organizations can ingest cleaner data and take advantage of advanced data quality capabilities.

Data Quality

Data Quality Metrics Data Lake Sales

Data for All: Empowering Users With AI, ML, and Analytics

Sisense

JUNE 12, 2019

our annual client conference, I gave a presentation that took a deep dive into artificial intelligence and subgroups including AI, ML, and statistics. Living in a World of Big Data. It all starts with the data. Data literacy and data skills, which created the forgotten dark data lakes in the first place, are still scarce.

Analytics

Analytics Data-driven Dashboards IoT

Unlock The Power of Your Data With These 19 Big Data & Data Analytics Books

datapine

AUGUST 29, 2022

The saying “knowledge is power” has never been more relevant, thanks to the widespread commercial use of big data and data analytics. The rate at which data is generated has increased exponentially in recent years. Essential Big Data And Data Analytics Insights. million searches per day and 1.2

Big Data

Big Data Data Analytics Analytics Data mining

Successfully conduct a proof of concept in Amazon Redshift

AWS Big Data

MARCH 27, 2024

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data.

Testing

Testing Data Warehouse Metrics Cost-Benefit

Automate replication of relational sources into a transactional data lake with Apache Iceberg and AWS Glue

AWS Big Data

FEBRUARY 14, 2023

Organizations have chosen to build data lakes on top of Amazon Simple Storage Service (Amazon S3) for many years. A data lake is the most popular choice for organizations to store all their organizational data generated by different teams, across business domains, from all different formats, and even over history.

Data Lake

Data Lake Statistics Data Architecture Finance

2021 Gift Giving Guide for Data Nerds

DataKitchen

DECEMBER 7, 2021

Fail Fast, Learn Faster: Lessons in Data-Driven Leadership in an Age of Disruption, Big Data, and AI, by Randy Bean. This book is not available until January 2022, but considering all the hype around the data mesh, we expect it to be a best seller. A distributed data mesh is a better choice. How did we get here?

Data-driven

Data-driven Data Governance Big Data Data Science

New Thinking, Old Thinking and a Fairytale

Peter James Thomas

JUNE 20, 2019

Of course it can be argued that you can use statistics (and Google Trends in particular) to prove anything [1] , but I found the above figures striking. Here we come back to the upward trend in searches for Data Science. Source: Google Trends. Gentlemen (and Ladies) Place your Bets.

Cost-Benefit

Cost-Benefit Data Warehouse Consulting Data Science

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

AWS Big Data

JULY 3, 2023

Terminology Let’s first discuss some of the terminology used in this post: Research data lake on Amazon S3 – A data lake is a large, centralized repository that allows you to manage all your structured and unstructured data at any scale. This is where the tagging feature in Apache Iceberg comes in handy.

Snapshot

Snapshot Data Lake Testing Strategy

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

AWS Big Data

NOVEMBER 29, 2023

Zero-ETL integration also enables you to load and analyze data from multiple operational database clusters in a new or existing Amazon Redshift instance to derive holistic insights across many applications. Use one click to access your data lake tables using auto-mounted AWS Glue data catalogs on Amazon Redshift for a simplified experience.

Data Warehouse

Data Warehouse Data Lake Analytics Machine Learning

How Gilead used Amazon Redshift to quickly and cost-effectively load third-party medical claims data

AWS Big Data

NOVEMBER 8, 2023

Proposed Solution approach 2: Data Lake analytics The team used this approach with Redshift Spectrum to load only the required columns to Redshift Serverless, which avoided loading data into multiple yearly tables and directly to a single table. Create a data lake external schema and table in Redshift Serverless.

Data Lake

Data Lake Data Warehouse Cost-Benefit Optimization

Improving Data Processing with Spark 3.0 & Delta Lake

Smart Data Collective

AUGUST 5, 2021

In this blog, we will cover an overview of Delta Lakes , its advantages, and how the above challenges can be overcome by moving to Delta Lake and migrating to Spark 3.0 What is Delta Lake? count, min/max values for columns) about the data in this file tags Map[String,String] Map containing metadata about this file.

Data Processing

Data Processing Metadata Broadcasting Statistics

What is Data Pipeline? A Detailed Explanation

Smart Data Collective

OCTOBER 17, 2022

Big data is shaping our world in countless ways. Data powers everything we do. Exactly why, the systems have to ensure adequate, accurate and most importantly, consistent data flow between different systems. A point of data entry in a given pipeline. Data Pipeline: Use Cases. Destination.

Data Warehouse

Data Warehouse Data Lake Visualization Big Data

Materialized Views in Hive for Iceberg Table Format

Cloudera

FEBRUARY 8, 2024

It brings the reliability and simplicity of SQL tables to big data while enabling engines like Hive, Impala, Spark, Trino, Flink, and Presto to work with the same tables at the same time. We ran the ANALYZE command to gather both table and column statistics on all the base tables. Each query had between one to three variants.

Snapshot

Snapshot Metadata Cost-Benefit Data Warehouse

How AWS helped Altron Group accelerate their vision for optimized customer engagement

AWS Big Data

JULY 13, 2023

To verify the data quality of the sources through statistically-relevant metrics, AWS Glue Data Quality runs data quality tasks on relevant AWS Glue tables. Foundations for a data lake with data governance controls and data quality checks.

Optimization

Optimization B2B Data Quality Sales

Themes and Conferences per Pacoid, Episode 12

Domino Data Lab

AUGUST 8, 2019

Another key point: troubleshooting edge cases for models in production—which is often where ethics and data meet, as far as regulators are concerned—requires much more sophistication in statistics than most data science teams tend to have. It’s a quick way to clear the room. machine learning?

Data Science

Data Science Machine Learning Data Governance Statistics

Fact-based Decision-making

Peter James Thomas

AUGUST 12, 2018

In our modern architectures, replete with web-services, APIs, cloud-based components and the quasi-instantaneous transmission of new transactions, it is perhaps not surprising that occasionally some data gets lost in translation [5] along the way. I explore some similar themes in a section of Data Visualisation – A Scientific Treatment.

Metrics

Metrics Statistics Data Quality Measurement

Periscope Data Expands to Israel, Empowering Data Teams with Powerful Tools

Sisense

DECEMBER 11, 2019

With Itzik’s wisdom fresh in everyone’s minds, Scott Castle, Sisense General Manager, Data Business, shared his view on the role of modern data teams. Scott whisked us through the history of business intelligence from its first definition in 1958 to the current rise of Big Data. Omid Vahdaty, CTO of Jutomate Ltd.,

Data Lake

Data Lake Big Data Sales Data-driven

Accelerating model velocity through Snowflake Java UDF integration

Domino Data Lab

JUNE 15, 2021

At a certain point, as the demand keeps growing, the data volumes rapidly increase. Data is no longer stored in CSV files, but in a dedicated, purpose built data lake / data warehouse. F-statistic: 599.7 The challenges surface once the company hits the scalability wall. on 1 and 390 DF, p-value: < 2.2e-16.

Modeling

Modeling Data Science Data-driven Data Warehouse

AWS Lake Formation 2023 year in review

AWS Big Data

JANUARY 18, 2024

AWS Lake Formation and the AWS Glue Data Catalog form an integral part of a data governance solution for data lakes built on Amazon Simple Storage Service (Amazon S3) with multiple AWS analytics services integrating with them. In 2023, we added support for column-level statistics for tables in the Data Catalog.

Data Lake

Data Lake Metadata Data Governance Statistics

A Retrospective of 2018’s Articles

Peter James Thomas

APRIL 9, 2019

These are as follows: General Data Articles. Data Visualisation. Statistics & Data Science. Analytics & Big Data. Data Visualisation. Statistics & Data Science. Data Science Challenges – It’s Deja Vu all over again! Analytics & Big Data.

Data-driven

Data-driven Statistics Big Data Data Science

Unleashing the power of Presto: The Uber case study

IBM Big Data Hub

SEPTEMBER 25, 2023

Uber understood that digital superiority required the capture of all their transactional data, not just a sampling. They stood up a file-based data lake alongside their analytical database. Because much of the work done on their data lake is exploratory in nature, many users want to execute untested queries on petabytes of data.

OLAP

OLAP Data Lake Data-driven Snapshot

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

Andrew White

JANUARY 11, 2021

Does Data warehouse as a software tool will play role in future of Data & Analytics strategy? You cannot get away from a formalized delivery capability focused on regular, scheduled, structured and reasonably governed data. Data lakes don’t offer this nor should they. E.g. Data Lakes in Azure – as SaaS.

Data Analytics

Data Analytics Analytics Data-driven Finance

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Enhance query performance using AWS Glue Data Catalog column-level statistics

Webinars

Trending Sources

What is a data architect? Skills, salaries, and how to become a data framework master

Webinars

Build a pseudonymization service on AWS to protect sensitive data: Part 2

Speed up queries with the cost-based optimizer in Amazon Athena

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics

AWS Glue Data Quality is Generally Available

Real estate CIOs drive deals with data

Data science vs data analytics: Unpacking the differences

Top 8 predictive analytics tools compared

Convergent Evolution

How the Masters uses watsonx to manage its AI lifecycle

Query your Iceberg tables in data lake using Amazon Redshift (Preview)

Introducing the AWS ProServe Hadoop Migration Delivery Kit TCO tool

Simplify operational data processing in data lakes using AWS Glue and Apache Hudi

Decoding Data Analyst Job Description: Skills, Tools, and Career Paths

Deep dive into the AWS ProServe Hadoop Migration Delivery Kit TCO tool

Measure performance of AWS Glue Data Quality for ETL pipelines

Quantitative and Qualitative Data: A Vital Combination

Visualize data quality scores and metrics generated by AWS Glue Data Quality

Choosing an open table format for your transactional data lake on AWS

Breaking down Business Intelligence

Get started with AWS Glue Data Quality dynamic rules for ETL pipelines

Data for All: Empowering Users With AI, ML, and Analytics

Unlock The Power of Your Data With These 19 Big Data & Data Analytics Books

Successfully conduct a proof of concept in Amazon Redshift

Automate replication of relational sources into a transactional data lake with Apache Iceberg and AWS Glue

2021 Gift Giving Guide for Data Nerds

New Thinking, Old Thinking and a Fairytale

Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

How Gilead used Amazon Redshift to quickly and cost-effectively load third-party medical claims data

Improving Data Processing with Spark 3.0 & Delta Lake

What is Data Pipeline? A Detailed Explanation

Materialized Views in Hive for Iceberg Table Format

How AWS helped Altron Group accelerate their vision for optimized customer engagement

Themes and Conferences per Pacoid, Episode 12

Fact-based Decision-making

Periscope Data Expands to Israel, Empowering Data Teams with Powerful Tools

Accelerating model velocity through Snowflake Java UDF integration

AWS Lake Formation 2023 year in review

A Retrospective of 2018’s Articles

Unleashing the power of Presto: The Uber case study

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

Stay Connected