2023, Big Data, Data Lake and Optimization

2023

Big Data

Data Lake

Optimization

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

AWS Big Data

MARCH 10, 2023

Since the deluge of big data over a decade ago, many organizations have learned to build applications to process and analyze petabytes of data. Data lakes have served as a central repository to store structured and unstructured data at any scale and in various formats.

Data Lake

Data Lake Sales Data Warehouse Snapshot

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

In our previous post Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes , we discussed how you can implement solutions to improve operational efficiencies of your Amazon Simple Storage Service (Amazon S3) data lake that is using the Apache Iceberg open table format and running on the Amazon EMR big data platform.

Optimization

Optimization Snapshot Data Lake Metadata

Join 52,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Trending Sources

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. and later supports the Apache Iceberg framework for data lakes. AWS Glue 3.0 The following diagram illustrates the solution architecture.

Data Lake

Data Lake Data Processing Metadata Snapshot

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

AWS Big Data

APRIL 25, 2024

In the era of data, organizations are increasingly using data lakes to store and analyze vast amounts of structured and unstructured data. Data lakes provide a centralized repository for data from various sources, enabling organizations to unlock valuable insights and drive data-driven decision-making.

Optimization

Optimization Data Lake Cost-Benefit Reporting

Build a transactional data lake using Apache Iceberg, AWS Glue, and cross-account data shares using AWS Lake Formation and Amazon Athena

AWS Big Data

APRIL 24, 2023

Building a data lake on Amazon Simple Storage Service (Amazon S3) provides numerous benefits for an organization. However, many use cases, like performing change data capture (CDC) from an upstream relational database to an Amazon S3-based data lake, require handling data at a record level.

Data Lake

Data Lake Data Governance Cost-Benefit Machine Learning

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

AWS-powered data lakes, supported by the unmatched availability of Amazon Simple Storage Service (Amazon S3), can handle the scale, agility, and flexibility required to combine different data and analytics approaches. The output will give a count of the number of data and metadata files deleted.

Snapshot

Snapshot Data Lake Metadata Optimization

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics

AWS Big Data

NOVEMBER 20, 2023

Use case A typical workload for AWS Glue for Apache Spark jobs is to load data from a relational database to a data lake with SQL-based transformations. The end benefit for you is more effective and optimized AWS Glue for Apache Spark workloads. The metrics are available in all AWS Glue supported Regions. Check it out!

Metrics

Metrics Data Lake Cost-Benefit Dashboards

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

AWS Big Data

NOVEMBER 10, 2023

Your applications can seamlessly read from and write to your Amazon Redshift data warehouse while maintaining optimal performance and transactional consistency. Additionally, you’ll benefit from performance improvements through pushdown optimizations, further enhancing the efficiency of your operations. options(**read_config).option("query",

Data Processing

Data Processing Data Lake Data Warehouse Optimization

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

AWS Big Data

MARCH 29, 2024

Analyzing historical patterns allows you to optimize performance, identify issues proactively, and improve planning. Looking at the Skewness Job per Job visualization, there was spike on November 1, 2023. About the Authors Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team.

Metrics

Metrics Visualization Dashboards Interactive

Accelerate your data warehouse migration to Amazon Redshift – Part 7

AWS Big Data

OCTOBER 17, 2023

Tens of thousands of customers use Amazon Redshift to gain business insights from their data. With Amazon Redshift, you can use standard SQL to query data across your data warehouse, operational data stores, and data lake. _cdc_unit" t2 WHERE t2.deletexid_ _cdc_unit" t2 WHERE t2.deletexid_

Data Warehouse

Data Warehouse Data Processing Data Lake Management

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

AWS Big Data

NOVEMBER 29, 2023

These announcements drive forward the AWS Zero-ETL vision to unify all your data, enabling you to better maximize the value of your data with comprehensive analytics and ML capabilities, and innovate faster with secure data collaboration within and across organizations.

Data Warehouse

Data Warehouse Data Lake Analytics Machine Learning

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

AWS Big Data

MAY 24, 2023

When you build your transactional data lake using Apache Iceberg to solve your functional use cases, you need to focus on operational use cases for your S3 data lake to optimize the production environment. This property is set to true by default. availability.

Data Lake

Data Lake Snapshot Metadata Optimization

Introducing AWS Glue serverless Spark UI for better monitoring and troubleshooting

AWS Big Data

NOVEMBER 20, 2023

Customers often use Apache Spark Web UI , a popular debugging tool that is part of open source Apache Spark, to help fix problems and optimize job performance. Once logs are parsed, you can When logs are parsed, you can use the built-in Spark UI to debug, troubleshoot, and optimize your jobs. Now it’s time to run the job!

Visualization

Visualization Optimization Data Lake Management

Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog

AWS Big Data

JUNE 6, 2023

You can use AWS Glue to create, run, and monitor data integration and ETL (extract, transform, and load) pipelines and catalog your assets across multiple data stores. Hundreds of thousands of customers use data lakes for analytics and ML to make data-driven business decisions.

Data Quality

Data Quality Data Lake Data-driven Metrics

Materialized Views in Hive for Iceberg Table Format

Cloudera

FEBRUARY 8, 2024

It brings the reliability and simplicity of SQL tables to big data while enabling engines like Hive, Impala, Spark, Trino, Flink, and Presto to work with the same tables at the same time. It has been designed and developed as an open community standard to ensure compatibility across languages and implementations.

Snapshot

Snapshot Metadata Cost-Benefit Data Warehouse

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

AWS Big Data

AUGUST 16, 2023

Iceberg manages large collections of files as tables, and it supports modern analytical data lake operations such as record-level insert, update, delete, and time travel queries. Iceberg also helps guarantee data correctness under concurrent write scenarios. We use a sample JSON file as input to Amazon DynamoDB.

Data Lake

Data Lake Metadata Testing Snapshot

Amazon QuickSight helps TalentReef empower its customers to make more informed hiring decisions

AWS Big Data

MARCH 17, 2023

The response has been overwhelmingly positive, leading to the development of two additional analytics dashboards, Job Postings and Onboarding, both set to be released in the first half of 2023. They want to see how their job postings are performing, if there is a drop in any posting, and opportunities to optimize their process.

Dashboards

Dashboards IT Data Lake Visualization

Tackling AI’s data challenges with IBM databases on AWS

IBM Big Data Hub

MARCH 14, 2024

This involves unifying and sharing a single copy of data and metadata across IBM® watsonx.data ™, IBM® Db2 ®, IBM® Db2® Warehouse and IBM® Netezza ®, using native integrations and supporting open formats, all without the need for migration or recataloging. With Netezza support for 1.2

Cost-Benefit

Cost-Benefit Metadata Optimization Management

Amazon Kinesis Data Streams: celebrating a decade of real-time data innovation

AWS Big Data

NOVEMBER 14, 2023

Ten years ago, we launched Amazon Kinesis Data Streams , the first cloud-native serverless streaming data service, to serve as the backbone for companies, to move data across system boundaries, breaking data silos. Another integration launched in 2023 is with Amazon Monitron to power predictive maintenance management.

IoT

IoT Data-driven Data Lake Data Strategy

Exploring the AI and data capabilities of watsonx

IBM Big Data Hub

JULY 17, 2023

.” Sean Im, CEO, Samsung SDS America “In the field of generative AI and foundation models, watsonx is a platform that will enable us to meet our customers’ requirements in terms of optimization and security, while allowing them to benefit from the dynamism and innovations of the open-source community.”

Machine Learning

Machine Learning Data Warehouse Modeling Cost-Benefit

AWS Lake Formation 2023 year in review

AWS Big Data

JANUARY 18, 2024

AWS Lake Formation and the AWS Glue Data Catalog form an integral part of a data governance solution for data lakes built on Amazon Simple Storage Service (Amazon S3) with multiple AWS analytics services integrating with them. In 2023, we released several updates to AWS Glue crawlers. Crawlers, salut!

Data Lake

Data Lake Metadata Data Governance Statistics

Introducing watsonx: The future of AI for business

IBM Big Data Hub

MAY 9, 2023

A data store built on open lakehouse architecture, it runs both on premises and across multi-cloud environments. Optimized for all data, analytics and AI workloads, watsonx.data combines the flexibility of a data lake with the performance of a data warehouse, helping businesses to scale data analytics and AI anywhere their data resides.

Data Warehouse

Data Warehouse Cost-Benefit Machine Learning Modeling

Your guide to AWS Analytics at AWS re:Invent 2023

AWS Big Data

NOVEMBER 13, 2023

2023 AWS Analytics Superheroes We are excited to introduce the 2023 AWS Analytics Superheroes at this year’s re:Invent conference! A shapeshifting guardian and protector of data like Data Lynx? 11:30 AM – 12:30 PM (PDT) Ceasars Forum ANT318 | Accelerate innovation with end-to-end serverless data architecture.

Analytics

Analytics Data Lake Data Warehouse Data-driven

AWS re:Invent 2023 Amazon Redshift Sessions Recap

AWS Big Data

DECEMBER 18, 2023

Get a closer look at how scaling for data warehousing works in AWS with the latest introduction of AI driven scaling and optimizations in Amazon Redshift Serverless to enable better price-performance for your workloads.

Data Warehouse

Data Warehouse Machine Learning Data-driven Data Lake

Real-time streaming data top picks you cannot miss at AWS re:Invent 2023

AWS Big Data

NOVEMBER 8, 2023

Save the date: AWS re:Invent 2023 is happening from November 27 to December 1 in Las Vegas, and you cannot miss it. In today’s data-driven landscape, the quality of data is the foundation upon which the success of organizations and innovations stands. High-quality data is not just about accuracy; it’s also about timeliness.

Data-driven

Data-driven Data Lake Machine Learning Cost-Benefit

How Fujitsu implemented a global data mesh architecture and democratized data

AWS Big Data

MAY 1, 2024

Currently, we have approximately 120,000 employees worldwide (as of March 2023), including group companies. To provide a variety of products, services, and solutions that are better suited to customers and society in each region, we have built business processes and systems that are optimized for each region and its market.

Dashboards

Dashboards Data-driven Publishing Cost-Benefit

The Enduring Significance of Data Modeling in the Modern Data-Driven Enterprise

erwin

AUGUST 31, 2023

Improved Decision Making : Well-modeled data provides insights that drive informed decision-making across various business domains, resulting in enhanced strategic planning. Reduced Data Redundancy : By eliminating data duplication, it optimizes storage and enhances data quality, reducing errors and discrepancies.

Data-driven

Data-driven Modeling Enterprise Structured Data

Unleashing the power of Presto: The Uber case study

IBM Big Data Hub

SEPTEMBER 25, 2023

With a few taps on a mobile device, riders request a ride; then, Uber’s algorithms work to match them with the nearest available driver and calculate the optimal price. Uber’s prowess as a transportation, logistics and analytics company hinges on their ability to leverage data effectively. But the simplicity ends there.

OLAP

OLAP Data Lake Data-driven Snapshot

Process price transparency data using AWS Glue

AWS Big Data

MAY 4, 2023

Prerequisites To implement the solution in your own AWS account, you need to create or configure the following AWS resources in advance: An S3 bucket to persist the source and processed data. getvalue(),encoding='utf-8') s3_client.put_object(Body=data, Bucket=bucket, Key=upload_path) s3_client = boto3.client('s3')

Insurance

Insurance Publishing Cost-Benefit Data Lake

Showpad accelerates data maturity to unlock innovation using Amazon QuickSight

AWS Big Data

APRIL 5, 2023

Showpad also struggled with data quality issues in terms of consistency, ownership, and insufficient data access across its targeted user base due to a complex BI access process, licensing challenges, and insufficient education. As of January 2023, Showpad’s QuickSight instance includes over 2,433 datasets and 199 dashboards.

Dashboards

Dashboards Reporting Cost-Benefit Visualization

Data Leaders Brief

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Webinars

Trending Sources

Use Apache Iceberg in a data lake to support incremental data processing

Webinars

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

Build a transactional data lake using Apache Iceberg, AWS Glue, and cross-account data shares using AWS Lake Formation and Amazon Athena

Use Amazon Athena with Spark SQL for your open-source transactional table formats

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

Accelerate your data warehouse migration to Amazon Redshift – Part 7

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

Introducing AWS Glue serverless Spark UI for better monitoring and troubleshooting

Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog

Materialized Views in Hive for Iceberg Table Format

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

Amazon QuickSight helps TalentReef empower its customers to make more informed hiring decisions

Tackling AI’s data challenges with IBM databases on AWS

Amazon Kinesis Data Streams: celebrating a decade of real-time data innovation

Exploring the AI and data capabilities of watsonx

AWS Lake Formation 2023 year in review

Introducing watsonx: The future of AI for business

Your guide to AWS Analytics at AWS re:Invent 2023

AWS re:Invent 2023 Amazon Redshift Sessions Recap

Real-time streaming data top picks you cannot miss at AWS re:Invent 2023

How Fujitsu implemented a global data mesh architecture and democratized data

The Enduring Significance of Data Modeling in the Modern Data-Driven Enterprise

Unleashing the power of Presto: The Uber case study

Process price transparency data using AWS Glue

Showpad accelerates data maturity to unlock innovation using Amazon QuickSight

Stay Connected