Analytics, Big Data, Data Lake and Optimization

Multicloud data lake analytics with Amazon Athena

AWS Big Data

MARCH 18, 2024

Many organizations operate data lakes spanning multiple cloud data stores. In these cases, you may want an integrated query layer to seamlessly run analytical queries across these diverse cloud stores and streamline your data analytics processes. This serves as the S3 data lake data for this post.

Data Lake

Data Lake Analytics Cost-Benefit Management

Understanding Apache Iceberg on AWS with the new technical guide

AWS Big Data

MAY 20, 2024

Whether you are new to Apache Iceberg on AWS or already running production workloads on AWS, this comprehensive technical guide offers detailed guidance on foundational concepts to advanced optimizations to build your transactional data lake with Apache Iceberg on AWS. He can be reached via LinkedIn.

Data Lake

Data Lake Cost-Benefit Big Data Data Warehouse

Differentiating Between Data Lakes and Data Warehouses

Smart Data Collective

SEPTEMBER 23, 2020

While there is a lot of discussion about the merits of data warehouses, not enough discussion centers around data lakes. We talked about enterprise data warehouses in the past, so let’s contrast them with data lakes. Both data warehouses and data lakes are used when storing big data.

Data Lake

Data Lake Data Warehouse Unstructured Data Big Data

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

AWS Big Data

MARCH 10, 2023

Since the deluge of big data over a decade ago, many organizations have learned to build applications to process and analyze petabytes of data. Data lakes have served as a central repository to store structured and unstructured data at any scale and in various formats.

Data Lake

Data Lake Sales Data Warehouse Snapshot

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

In our previous post Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes , we discussed how you can implement solutions to improve operational efficiencies of your Amazon Simple Storage Service (Amazon S3) data lake that is using the Apache Iceberg open table format and running on the Amazon EMR big data platform.

Optimization

Optimization Snapshot Data Lake Metadata

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback.

Data Lake

Data Lake Data Processing Metadata Snapshot

Speed up queries with the cost-based optimizer in Amazon Athena

AWS Big Data

NOVEMBER 17, 2023

Amazon Athena is a serverless, interactive analytics service built on open source frameworks, supporting open table file formats. Athena provides a simplified, flexible way to analyze petabytes of data where it lives. Let’s discuss some of the cost-based optimization techniques that contributed to improved query performance.

Optimization

Optimization Statistics Metadata Data Lake

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

Enable business users to analyze large datasets in your data lake with Amazon QuickSight

AWS Big Data

JUNE 23, 2023

Events and many other security data types are stored in Imperva’s Threat Research Multi-Region data lake. Imperva harnesses data to improve their business outcomes. As part of their solution, they are using Amazon QuickSight to unlock insights from their data.

Data Lake

Data Lake Cost-Benefit Dashboards Data Warehouse

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

AWS Big Data

APRIL 25, 2024

In the era of data, organizations are increasingly using data lakes to store and analyze vast amounts of structured and unstructured data. Data lakes provide a centralized repository for data from various sources, enabling organizations to unlock valuable insights and drive data-driven decision-making.

Optimization

Optimization Data Lake Cost-Benefit Reporting

Deploy and Optimize Your Snowflake Environment Faster With Accelerators

CDW Research Hub

JULY 18, 2022

While many organizations understand the business need for a data and analytics cloud platform , few can quickly modernize their legacy data warehouse due to a lack of skills, resources, and data literacy. Sirius has created a lightweight development tool to rapidly build and deploy best-practice data models.

Optimization

Optimization Data Lake Data Warehouse Manufacturing

Announcing the AWS Well-Architected Data Analytics Lens

AWS Big Data

MARCH 26, 2024

We are delighted to announce the release of the Data Analytics Lens. Using the Lens in the Tool’s Lens Catalog, you can directly assess your Analytics workload in the console, and produce a set of actionable results for customized improvement plans recommended by the Tool. What’s new in the Data Analytics Lens?

Data Analytics

Data Analytics Analytics Big Data Data Lake

Data Lakes: What Are They and Who Needs Them?

Jet Global

JULY 2, 2019

To address the flood of data and the needs of enterprise businesses to store, sort, and analyze that data, a new storage solution has evolved: the data lake. What’s in a Data Lake? Data warehouses do a great job of standardizing data from disparate sources for analysis. Taking a Dip.

Data Lake

Data Lake Data Warehouse Big Data Machine Learning

Efficiently crawl your data lake and improve data access with an AWS Glue crawler using partition indexes

AWS Big Data

JUNE 15, 2023

In today’s world, customers manage vast amounts of data in their Amazon Simple Storage Service (Amazon S3) data lakes, which requires convoluted data pipelines to continuously understand the changes in the data layout and make them available to consuming systems.

Data Lake

Data Lake Metadata Cost-Benefit Management

Why optimize your warehouse with a data lakehouse strategy

IBM Big Data Hub

APRIL 25, 2023

We also made the case that query and reporting, provided by big data engines such as Presto, need to work with the Spark infrastructure framework to support advanced analytics and complex enterprise data decision-making.

Optimization

Optimization Strategy Data Warehouse Cost-Benefit

Build a transactional data lake using Apache Iceberg, AWS Glue, and cross-account data shares using AWS Lake Formation and Amazon Athena

AWS Big Data

APRIL 24, 2023

Building a data lake on Amazon Simple Storage Service (Amazon S3) provides numerous benefits for an organization. However, many use cases, like performing change data capture (CDC) from an upstream relational database to an Amazon S3-based data lake, require handling data at a record level.

Data Lake

Data Lake Data Governance Cost-Benefit Machine Learning

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

With Amazon EMR 6.15, we launched AWS Lake Formation based fine-grained access controls (FGAC) on Open Table Formats (OTFs), including Apache Hudi, Apache Iceberg, and Delta lake. Many large enterprise companies seek to use their transactional data lake to gain insights and improve decision-making.

Data Lake

Data Lake Snapshot Big Data Data-driven

Baldor’s first-ever CIO sets the transformation agenda

CIO Business Intelligence

MAY 16, 2024

The high-end organic produce and fresh meats distributor envisions IT — analytics and AI, specifically — as the key to more efficient distribution logistics and five-star customer experience. That is all applied to optimizing routes and delivery capabilities.” poached its first CIO.

IoT

IoT Internet of Things Digital Transformation Sales

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

AWS-powered data lakes, supported by the unmatched availability of Amazon Simple Storage Service (Amazon S3), can handle the scale, agility, and flexibility required to combine different data and analytics approaches. The timestamp clause lets us travel back without altering current data.

Snapshot

Snapshot Data Lake Metadata Optimization

Top 8 predictive analytics tools compared

CIO Business Intelligence

MAY 12, 2022

What are predictive analytics tools? Predictive analytics tools blend artificial intelligence and business reporting. But there are deeper challenges because predictive analytics software can’t magically anticipate moments when the world shifts gears and the future bears little relationship to the past. Highlights. Highlights.

Predictive Analytics

Predictive Analytics Analytics Statistics Machine Learning

Migrate data from Azure Blob Storage to Amazon S3 using AWS Glue

AWS Big Data

OCTOBER 20, 2023

Today, we are pleased to announce new AWS Glue connectors for Azure Blob Storage and Azure Data Lake Storage that allow you to move data bi-directionally between Azure Blob Storage, Azure Data Lake Storage, and Amazon Simple Storage Service (Amazon S3). option("header","true").load("wasbs://yourblob@youraccountname.blob.core.windows.net/loadingtest-input/100mb")

Data Lake

Data Lake Big Data Consulting Data Warehouse

How Data Analytics Tools Eliminate Business Owner Headaches

Smart Data Collective

AUGUST 7, 2019

Big data has the power to transform any small business. One study found that 77% of small businesses don’t even have a big data strategy. If your company lacks a big data strategy, then you need to start developing one today. Using Big Data to Fix Your Biggest Problems as a Business Owner.

Data Analytics

Data Analytics Analytics Big Data Data Lake

Detect, mask, and redact PII data using AWS Glue before loading into Amazon OpenSearch Service

AWS Big Data

JANUARY 12, 2024

Many organizations, small and large, are working to migrate and modernize their analytics workloads on Amazon Web Services (AWS). Leadership and development teams can spend more time optimizing current solutions and even experimenting with new use cases, rather than maintaining the current infrastructure.

Data Lake

Data Lake Cost-Benefit Visualization Structured Data

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

This is the first post to a blog series that offers common architectural patterns in building real-time data streaming infrastructures using Kinesis Data Streams for a wide range of use cases. In this post, we will review the common architectural patterns of two use cases: Time Series Data Analysis and Event Driven Microservices.

Analytics

Analytics IoT Data-driven Snapshot

The Future of the Data Lakehouse – Open

CIO Business Intelligence

JUNE 23, 2022

Cloudera customers run some of the biggest data lakes on earth. These lakes power mission critical large scale data analytics, business intelligence (BI), and machine learning use cases, including enterprise data warehouses. On data warehouses and data lakes. Iterations of the lakehouse.

Data Lake

Data Lake Data Warehouse Machine Learning Cost-Benefit

How SumUp made digital analytics more accessible using AWS Glue

AWS Big Data

JUNE 6, 2023

As most organizations, that have turned to Google Analytics (GA) as a digital analytics solution, mature they discover a more pressing need to integrate this data silo with the rest of their organization’s data to enable better analytics and resulting product development and fraud detection.

Analytics

Analytics Data Lake Testing Optimization

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

AWS Big Data

NOVEMBER 10, 2023

Amazon Redshift offers seamless integration with Apache Spark, allowing you to easily access your Redshift data on both Amazon Redshift provisioned clusters and Amazon Redshift Serverless. Additionally, you’ll benefit from performance improvements through pushdown optimizations, further enhancing the efficiency of your operations.

Data Processing

Data Processing Data Lake Data Warehouse Optimization

The Future of the Data Lakehouse – Open

Cloudera

JUNE 18, 2022

Cloudera customers run some of the biggest data lakes on earth. These lakes power mission critical large scale data analytics, business intelligence (BI), and machine learning use cases, including enterprise data warehouses. On data warehouses and data lakes. Iterations of the lakehouse.

Data Lake

Data Lake Data Warehouse Machine Learning Cost-Benefit

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

Solutions data architect: These individuals design and implement data solutions for specific business needs, including data warehouses, data marts, and data lakes. Application data architect: The application data architect designs and implements data models for specific software applications.

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

With a zero-ETL approach, AWS is helping builders realize near-real-time analytics

AWS Big Data

JUNE 28, 2023

For example, customers told us that they want to ingest streaming data into their data stores for doing analytics—all without delving into the complexities of ETL. They can connect to multiple data streams and pull data directly into Amazon Redshift without staging it in Amazon Simple Storage Service (Amazon S3).

Analytics

Analytics Data Warehouse Data Lake Data-driven

Enhance data security and governance for Amazon Redshift Spectrum with VPC endpoints

AWS Big Data

FEBRUARY 16, 2024

Many customers are extending their data warehouse capabilities to their data lake with Amazon Redshift. They are looking to further enhance their security posture where they can enforce access policies on their data lakes based on Amazon Simple Storage Service (Amazon S3). Choose Create endpoint.

Data Lake

Data Lake Data Warehouse Testing Business Objectives

Azure Data Sources for Data Science and Machine Learning

Jen Stirrup

MAY 5, 2020

It is more than just some giant USB stick in the sky that’s going to store all of the data. It has a lot of services that you can use, such as Big Data analytics. In this topic, we’re going to focus specifically on Big Data technologies, such as your HDInsight, Apache Spark and Databricks.

Machine Learning

Machine Learning Data Science Data Lake Big Data

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

NOVEMBER 22, 2023

Today, we’re making available a new capability of AWS Glue Data Catalog that allows generating column-level statistics for AWS Glue tables. These statistics are now integrated with the cost-based optimizers (CBO) of Amazon Athena and Amazon Redshift Spectrum , resulting in improved query performance and potential cost savings.

Statistics

Statistics Data Lake Optimization Data-driven

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

AWS Big Data

MARCH 7, 2023

This post provides guidance on how to build scalable analytical solutions for gaming industry use cases using Amazon Redshift Serverless. Flexible and easy to use – The solutions should provide less restrictive, easy-to-access, and ready-to-use data. They should also provide optimal performance with low or no tuning.

Analytics

Analytics Data Warehouse Data Lake Metadata

Analyze Elastic IP usage history using Amazon Athena and AWS CloudTrail

AWS Big Data

MAY 15, 2024

You can use this solution regularly as part of your cost-optimization efforts to safely remove unused EIPs to reduce your costs. Check out the GitHub repo to regularly run this analysis as part of your cost-optimization strategy to identify and release inactive EIPs to reduce costs.

Snapshot

Snapshot Optimization Data Lake Reporting

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

AWS Big Data

MARCH 29, 2024

Analyzing historical patterns allows you to optimize performance, identify issues proactively, and improve planning. By connecting the new observability metrics to interactive QuickSight dashboards, you can uncover daily, weekly, and monthly patterns to optimize AWS Glue job usage. In his spare time, he enjoys playing tennis.

Metrics

Metrics Visualization Dashboards Interactive

Deep dive into the AWS ProServe Hadoop Migration Delivery Kit TCO tool

AWS Big Data

FEBRUARY 6, 2023

Additionally, a TCO calculator generates the TCO estimation of an optimized EMR cluster for facilitating the migration. For optimizing EMR cluster cost effectiveness, the following table provides general guidelines of choosing the proper type of EMR cluster and Amazon Elastic Compute Cloud (Amazon EC2) family.

Dashboards

Dashboards Optimization Data Lake Cost-Benefit

Introducing the AWS ProServe Hadoop Migration Delivery Kit TCO tool

AWS Big Data

FEBRUARY 6, 2023

When migrating Hadoop workloads to Amazon EMR , it’s often difficult to identify the optimal cluster configuration without analyzing existing workloads by hand. Use case overview Migrating Hadoop workloads to Amazon EMR accelerates big data analytics modernization, increases productivity, and reduces operational cost.

Cost-Benefit

Cost-Benefit Data Lake Dashboards Big Data

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics

AWS Big Data

NOVEMBER 20, 2023

For any modern data-driven company, having smooth data integration pipelines is crucial. These pipelines pull data from various sources, transform it, and load it into destination systems for analytics and reporting. The end benefit for you is more effective and optimized AWS Glue for Apache Spark workloads.

Metrics

Metrics Data Lake Cost-Benefit Dashboards

Build an end-to-end serverless streaming pipeline with Apache Kafka on Amazon MSK using Python

AWS Big Data

MARCH 21, 2024

You want real-time access to this data so you can monitor performance in real time, and detect and mitigate issues quickly. You also need longer-term access to this data for machine learning (ML) models to run predictive maintenance assessments, find optimization opportunities, and forecast demand.

Data Lake

Data Lake Management Modeling Optimization

Real estate CIOs drive deals with data

CIO Business Intelligence

JULY 26, 2023

“We’ve been able to create some models that will analyze things like the listing comments and descriptions and tell you which properties are waterfront or not,” Wilhemy says, adding that such data gives its agents a competitive advantage by enabling them to reach out to a selective set of potential buyers first.

Data Lake

Data Lake Digital Transformation Machine Learning Data Architecture

Three Trends for Modernizing Analytics and Data Warehousing in 2019

Cloudera

DECEMBER 19, 2018

Data analytics priorities have shifted this year. Don’t blink or you might miss what leading organizations are doing to modernize their analytic and data warehousing environments. Natural language analytics and streaming data analytics are emerging technologies that will impact the market.

Data Warehouse

Data Warehouse Analytics Big Data Data Architecture

AI and ML: No Longer the Stuff of Science Fiction

Cloudera

DECEMBER 14, 2021

The category “Data for Enterprise AI” awards companies from around the world that have built and deployed use cases for enterprise-scale machine learning and have industrialized AI to automate, secure, and optimize data-driven decision making and/or applications. Roads and Transport Authority, Dubai.

Data Lake

Data Lake Machine Learning Big Data Data-driven

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

To understand the best ways to make API calls via Apache Flink, refer to Common streaming data enrichment patterns in Amazon Kinesis Data Analytics for Apache Flink. OpenSearch Service provides support for native ingestion from Kinesis data streams or MSK topics.

Data Lake

Data Lake Unstructured Data Management Modeling

Multicloud data lake analytics with Amazon Athena

Understanding Apache Iceberg on AWS with the new technical guide

Webinars

Trending Sources

Differentiating Between Data Lakes and Data Warehouses

Webinars

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Use Apache Iceberg in a data lake to support incremental data processing

Speed up queries with the cost-based optimizer in Amazon Athena

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Enable business users to analyze large datasets in your data lake with Amazon QuickSight

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

Deploy and Optimize Your Snowflake Environment Faster With Accelerators

Announcing the AWS Well-Architected Data Analytics Lens

Data Lakes: What Are They and Who Needs Them?

Efficiently crawl your data lake and improve data access with an AWS Glue crawler using partition indexes

Why optimize your warehouse with a data lakehouse strategy

Build a transactional data lake using Apache Iceberg, AWS Glue, and cross-account data shares using AWS Lake Formation and Amazon Athena

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

Baldor’s first-ever CIO sets the transformation agenda

Use Amazon Athena with Spark SQL for your open-source transactional table formats

Top 8 predictive analytics tools compared

Migrate data from Azure Blob Storage to Amazon S3 using AWS Glue

How Data Analytics Tools Eliminate Business Owner Headaches

Detect, mask, and redact PII data using AWS Glue before loading into Amazon OpenSearch Service

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

The Future of the Data Lakehouse – Open

How SumUp made digital analytics more accessible using AWS Glue

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

The Future of the Data Lakehouse – Open

What is a data architect? Skills, salaries, and how to become a data framework master

With a zero-ETL approach, AWS is helping builders realize near-real-time analytics

Enhance data security and governance for Amazon Redshift Spectrum with VPC endpoints

Azure Data Sources for Data Science and Machine Learning

Enhance query performance using AWS Glue Data Catalog column-level statistics

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

Analyze Elastic IP usage history using Amazon Athena and AWS CloudTrail

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

Deep dive into the AWS ProServe Hadoop Migration Delivery Kit TCO tool

Introducing the AWS ProServe Hadoop Migration Delivery Kit TCO tool

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics

Build an end-to-end serverless streaming pipeline with Apache Kafka on Amazon MSK using Python

Real estate CIOs drive deals with data

Three Trends for Modernizing Analytics and Data Warehousing in 2019

AI and ML: No Longer the Stuff of Science Fiction

Exploring real-time streaming for generative AI Applications

Stay Connected