2023, Big Data and Data Lake - Data Leaders Brief

2023

Big Data

Data Lake

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

A data lake is a centralized repository that you can use to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data and then run different types of analytics for better business insights.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

AWS Big Data

MARCH 10, 2023

Since the deluge of big data over a decade ago, many organizations have learned to build applications to process and analyze petabytes of data. Data lakes have served as a central repository to store structured and unstructured data at any scale and in various formats.

Data Lake

Data Lake Sales Data Warehouse Snapshot

Join 52,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Trending Sources

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. and later supports the Apache Iceberg framework for data lakes. AWS Glue 3.0 The following diagram illustrates the solution architecture.

Data Lake

Data Lake Data Processing Metadata Snapshot

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Build a transactional data lake using Apache Iceberg, AWS Glue, and cross-account data shares using AWS Lake Formation and Amazon Athena

AWS Big Data

APRIL 24, 2023

Building a data lake on Amazon Simple Storage Service (Amazon S3) provides numerous benefits for an organization. However, many use cases, like performing change data capture (CDC) from an upstream relational database to an Amazon S3-based data lake, require handling data at a record level.

Data Lake

Data Lake Data Governance Cost-Benefit Machine Learning

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics

AWS Big Data

NOVEMBER 20, 2023

Use case A typical workload for AWS Glue for Apache Spark jobs is to load data from a relational database to a data lake with SQL-based transformations. About the Authors Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. The metrics are available in all AWS Glue supported Regions.

Metrics

Metrics Data Lake Cost-Benefit Dashboards

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

AWS-powered data lakes, supported by the unmatched availability of Amazon Simple Storage Service (Amazon S3), can handle the scale, agility, and flexibility required to combine different data and analytics approaches. The output will give a count of the number of data and metadata files deleted.

Snapshot

Snapshot Data Lake Metadata Optimization

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

In our previous post Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes , we discussed how you can implement solutions to improve operational efficiencies of your Amazon Simple Storage Service (Amazon S3) data lake that is using the Apache Iceberg open table format and running on the Amazon EMR big data platform.

Optimization

Optimization Snapshot Data Lake Metadata

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

AWS Big Data

NOVEMBER 10, 2023

The data sourcing problem To ensure the reliability of PySpark data pipelines, it’s essential to have consistent record-level data from both dimensional and fact tables stored in the Enterprise Data Warehouse (EDW). These tables are then joined with tables from the Enterprise Data Lake (EDL) at runtime.

Data Processing

Data Processing Data Lake Data Warehouse Optimization

ChatGPT: le nuove sfide della strategia sui dati nell’era dell’IA generativa

CIO Business Intelligence

MARCH 27, 2024

Le aziende italiane investono in infrastrutture, software e servizi per la gestione e l’analisi dei dati (+18% nel 2023, pari a 2,85 miliardi di euro, secondo l’Osservatorio Big Data & Business Analytics della School of Management del Politecnico di Milano), ma quante sono giunte alla data maturity?

Data Governance

Data Governance Data Lake Data Strategy Data-driven

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

AWS Big Data

APRIL 25, 2024

In the era of data, organizations are increasingly using data lakes to store and analyze vast amounts of structured and unstructured data. Data lakes provide a centralized repository for data from various sources, enabling organizations to unlock valuable insights and drive data-driven decision-making.

Optimization

Optimization Data Lake Cost-Benefit Reporting

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

AWS Big Data

MARCH 29, 2024

Looking at the Skewness Job per Job visualization, there was spike on November 1, 2023. You can choose Controls , and change filter conditions based on date time, Region, AWS account ID, AWS Glue job name, job run ID, and the source and sink of the data stores. Let’s drill down into details.

Metrics

Metrics Visualization Dashboards Interactive

Accelerate your data warehouse migration to Amazon Redshift – Part 7

AWS Big Data

OCTOBER 17, 2023

Tens of thousands of customers use Amazon Redshift to gain business insights from their data. With Amazon Redshift, you can use standard SQL to query data across your data warehouse, operational data stores, and data lake. _cdc_unit" t2 WHERE t2.deletexid_ _cdc_unit" t2 WHERE t2.deletexid_

Data Warehouse

Data Warehouse Data Processing Data Lake Management

Load data incrementally from transactional data lakes to data warehouses

AWS Big Data

OCTOBER 19, 2023

Data lakes and data warehouses are two of the most important data storage and management technologies in a modern data architecture. Data lakes store all of an organization’s data, regardless of its format or structure.

Data Lake

Data Lake Data Warehouse Visualization Snapshot

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

AWS Big Data

MARCH 28, 2023

As organizations across the globe are modernizing their data platforms with data lakes on Amazon Simple Storage Service (Amazon S3), handling SCDs in data lakes can be challenging.

Data Lake

Data Lake Testing Snapshot Sales

Interact with Apache Iceberg tables using Amazon Athena and cross account fine-grained permissions using AWS Lake Formation

AWS Big Data

MARCH 23, 2023

Register the S3 path storing the table using Lake Formation We register the S3 full path in Lake Formation: Navigate to the Lake Formation console. In the navigation pane, under Register and ingest , choose Data lake locations. Jack Ye is a software engineer of the Athena Data Lake and Storage team at AWS.

Interactive

Interactive Snapshot Data Lake Software

La convergenza tra IT e business: ecco come i CIO reinterpretano il loro ruolo con l’aiuto dell’IA

CIO Business Intelligence

FEBRUARY 19, 2024

Il nuovo ruolo dell’IT: la business continuity Deligia ha costruito la sua strategia per la business continuity sulle fondamenta tecnologiche di big data , analytics, automazione e IA. Questo dialogo IT-business si basa per Italo su un’infrastruttura IT flessibile che ha numerose componenti di automazione e di IA e dà il necessario.

IT KPI Data Lake Digital Transformation

Intelligenza artificiale e gen AI: i quattro elementi per passare al “next level”

CIO Business Intelligence

MARCH 13, 2024

Infatti, secondo il “Report Imprese e Ict 2023” di Istat, la mancanza di competenze è il primo freno all’adozione delle tecnologie IA in Italia: il 55,1% delle imprese che hanno preso in considerazione il suo utilizzo senza poi adottarla ha rinunciato per carenza di skill e comprensione delle possibilità per il proprio business.

Machine Learning

Machine Learning Deep Learning Big Data Testing

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

AWS Big Data

MAY 24, 2023

When you build your transactional data lake using Apache Iceberg to solve your functional use cases, you need to focus on operational use cases for your S3 data lake to optimize the production environment. availability. parquet") df.sortWithinPartitions("review_date").writeTo("dev.db.amazon_reviews_iceberg").append()

Data Lake

Data Lake Snapshot Metadata Optimization

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

AWS Big Data

NOVEMBER 29, 2023

These announcements drive forward the AWS Zero-ETL vision to unify all your data, enabling you to better maximize the value of your data with comprehensive analytics and ML capabilities, and innovate faster with secure data collaboration within and across organizations.

Data Warehouse

Data Warehouse Data Lake Analytics Machine Learning

Implement tag-based access control for your data lake and Amazon Redshift data sharing with AWS Lake Formation

AWS Big Data

JULY 21, 2023

Data-driven organizations treat data as an asset and use it across different lines of business (LOBs) to drive timely insights and better business decisions. This leads to having data across many instances of data warehouses and data lakes using a modern data architecture in separate AWS accounts.

Data Lake

Data Lake Data Warehouse Marketing Management

Visualize Confluent data in Amazon QuickSight using Amazon Athena

AWS Big Data

MARCH 27, 2023

Choose Create data source. Perform interactive analysis on Confluent data With the Athena connector set up, our streaming data is now queryable from the same service we use to analyze S3 data lakes. Aggregation We can use standard SQL functions to aggregate the data.

Visualization

Visualization Data Lake Interactive Data-driven

Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 2: AWS Glue Studio Visual Editor

AWS Big Data

MARCH 20, 2023

In the first post of this series , we described how AWS Glue for Apache Spark works with Apache Hudi, Linux Foundation Delta Lake, and Apache Iceberg datasets tables using the native support of those data lake formats. Even without prior experience using Hudi, Delta Lake or Iceberg, you can easily achieve typical use cases.

Visualization

Visualization Data Lake Snapshot Big Data

Improve healthcare services through patient 360: A zero-ETL approach to enable near real-time data analytics

AWS Big Data

MARCH 27, 2024

Amazon Redshift integrates with AWS HealthLake and data lakes through Redshift Spectrum and Amazon S3 auto-copy features, enabling you to query data directly from files on Amazon S3. This means you no longer have to create an external schema in Amazon Redshift to use the data lake tables cataloged in the Data Catalog.

Data Analytics

Data Analytics Analytics Data Warehouse Data Lake

Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog

AWS Big Data

JUNE 6, 2023

You can use AWS Glue to create, run, and monitor data integration and ETL (extract, transform, and load) pipelines and catalog your assets across multiple data stores. Hundreds of thousands of customers use data lakes for analytics and ML to make data-driven business decisions.

Data Quality

Data Quality Data Lake Data-driven Metrics

Introducing AWS Glue serverless Spark UI for better monitoring and troubleshooting

AWS Big Data

NOVEMBER 20, 2023

Monitor and Troubleshoot with Serverless Spark UI A typical workload for AWS Glue for Apache Spark jobs is loading data from relational databases to S3-based data lakes. The sample job reads data from MySQL database and writes to S3 in Parquet format. He works based in Tokyo, Japan.

Visualization

Visualization Optimization Data Lake Management

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Big Data

APRIL 26, 2024

Set up EMR Studio In this step, we demonstrate the actions needed from the data lake administrator to set up EMR Studio enabled for trusted identity propagation and with IAM Identity Center integration. On the Lake Formation console, choose Data lake permissions under Permissions in the navigation pane.

Analytics

Analytics Data Lake Management Enterprise

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

AWS Big Data

AUGUST 16, 2023

Iceberg manages large collections of files as tables, and it supports modern analytical data lake operations such as record-level insert, update, delete, and time travel queries. Iceberg also helps guarantee data correctness under concurrent write scenarios. In our query, it corresponds to the time 2023-04-18 21:34:13.970.

Data Lake

Data Lake Metadata Testing Snapshot

Amazon QuickSight helps TalentReef empower its customers to make more informed hiring decisions

AWS Big Data

MARCH 17, 2023

The response has been overwhelmingly positive, leading to the development of two additional analytics dashboards, Job Postings and Onboarding, both set to be released in the first half of 2023. Developers get the data into the data lake, and then the product team pulls in the data into QuickSight and deploys it as needed.

Dashboards

Dashboards IT Data Lake Visualization

Amazon Kinesis Data Streams: celebrating a decade of real-time data innovation

AWS Big Data

NOVEMBER 14, 2023

Ten years ago, we launched Amazon Kinesis Data Streams , the first cloud-native serverless streaming data service, to serve as the backbone for companies, to move data across system boundaries, breaking data silos. Another integration launched in 2023 is with Amazon Monitron to power predictive maintenance management.

IoT

IoT Data-driven Data Lake Data Strategy

Materialized Views in Hive for Iceberg Table Format

Cloudera

FEBRUARY 8, 2024

It brings the reliability and simplicity of SQL tables to big data while enabling engines like Hive, Impala, Spark, Trino, Flink, and Presto to work with the same tables at the same time. It has been designed and developed as an open community standard to ensure compatibility across languages and implementations.

Snapshot

Snapshot Metadata Cost-Benefit Data Warehouse

AWS Lake Formation 2023 year in review

AWS Big Data

JANUARY 18, 2024

AWS Lake Formation and the AWS Glue Data Catalog form an integral part of a data governance solution for data lakes built on Amazon Simple Storage Service (Amazon S3) with multiple AWS analytics services integrating with them. In 2023, we released several updates to AWS Glue crawlers. Crawlers, salut!

Data Lake

Data Lake Metadata Data Governance Statistics

Tackling AI’s data challenges with IBM databases on AWS

IBM Big Data Hub

MARCH 14, 2024

  Redefining cloud database innovation: IBM and AWS In late 2023, IBM and AWS jointly announced the general availability of Amazon relational database service (RDS) for Db2. This service streamlines data management for AI workloads across hybrid cloud environments and facilitates the scaling of Db2 databases on AWS with minimal effort.

Cost-Benefit

Cost-Benefit Metadata Optimization Management

Your guide to AWS Analytics at AWS re:Invent 2023

AWS Big Data

NOVEMBER 13, 2023

2023 AWS Analytics Superheroes We are excited to introduce the 2023 AWS Analytics Superheroes at this year’s re:Invent conference! A shapeshifting guardian and protector of data like Data Lynx? 2:30 PM – 3:30 PM (PDT) Mandalay Bay ANT335 | Get the most out of your data warehousing workloads.

Analytics

Analytics Data Lake Data Warehouse Data-driven

Introducing watsonx: The future of AI for business

IBM Big Data Hub

MAY 9, 2023

A data store built on open lakehouse architecture, it runs both on premises and across multi-cloud environments. Optimized for all data, analytics and AI workloads, watsonx.data combines the flexibility of a data lake with the performance of a data warehouse, helping businesses to scale data analytics and AI anywhere their data resides.

Data Warehouse

Data Warehouse Cost-Benefit Machine Learning Modeling

Exploring the AI and data capabilities of watsonx

IBM Big Data Hub

JULY 17, 2023

Watsonx.data is built on 3 core integrated components: multiple query engines, a catalog that keeps track of metadata, and storage and relational data sources which the query engines directly access. 1 When comparing published 2023 list prices normalized for VPC hours of watsonx.data to several major cloud data warehouse vendors.

Machine Learning

Machine Learning Data Warehouse Modeling Cost-Benefit

AWS re:Invent 2023 Amazon Redshift Sessions Recap

AWS Big Data

DECEMBER 18, 2023

Sessions ANT203 | What’s new in Amazon Redshift Watch this session to learn about the newest innovations within Amazon Redshift—the petabyte-scale AWS Cloud data warehousing solution. Easily build and train machine learning models using SQL within Amazon Redshift to generate predictive analytics and propel data-driven decision-making.

Data Warehouse

Data Warehouse Machine Learning Data-driven Data Lake

Real-time streaming data top picks you cannot miss at AWS re:Invent 2023

AWS Big Data

NOVEMBER 8, 2023

Save the date: AWS re:Invent 2023 is happening from November 27 to December 1 in Las Vegas, and you cannot miss it. In today’s data-driven landscape, the quality of data is the foundation upon which the success of organizations and innovations stands. Reserve your seat now! Register now to secure your spot!

Data-driven

Data-driven Data Lake Machine Learning Cost-Benefit

What’s cooking with Amazon Redshift at AWS re:Invent 2023

AWS Big Data

NOVEMBER 15, 2023

Connect with experts, meet with book authors on data warehousing and analytics (at the Meet the Authors event on November 29 and 30, 3:00 PM – 4:00 PM), win prizes, and learn all about the latest innovations from our AWS Analytics services. A shapeshifting guardian and protector of data like Data Lynx?

Data Lake

Data Lake Data Warehouse B2B Deep Learning

Connect your data for faster decisions with AWS

AWS Big Data

NOVEMBER 7, 2023

In this post, we discuss how we’re delivering on these investments with a number of data integration innovations that span AWS databases, analytics, business intelligence (BI), and ML services. G2 Krishnamoorthy is VP of Analytics, leading AWS data lake services, data integration, Amazon OpenSearch Service, and Amazon QuickSight.

Dashboards

Dashboards Data-driven Data Integration Data Lake

The Enduring Significance of Data Modeling in the Modern Data-Driven Enterprise

erwin

AUGUST 31, 2023

The Entity-Relationship (ER) model gains prominence as a tool for conceptual data modeling, helping to bridge the gap between business requirements and database design. 2000s – Web and Big Data: The rise of the internet and web applications drives the demand for efficient and scalable databases. by Quest ®. Save My Spot!

Data-driven

Data-driven Modeling Enterprise Structured Data

How Fujitsu implemented a global data mesh architecture and democratized data

AWS Big Data

MAY 1, 2024

Currently, we have approximately 120,000 employees worldwide (as of March 2023), including group companies. To achieve data-driven management, we built OneData, a data utilization platform used in the four global AWS Regions, which started operation in April 2022. Fujitsu Limited was established in Japan in 1935.

Dashboards

Dashboards Data-driven Publishing Cost-Benefit

Use the Amazon Redshift Data API to interact with Amazon Redshift Serverless

AWS Big Data

APRIL 28, 2023

You can use the following command to load data into the table we created earlier: aws redshift-data execute-statement --database dev --workgroup-name default --sql "COPY demo.green_201601 FROM 's3://us-west-2.serverless-analytics/NYC-Pub/green/green_tripdata_2016-01' You have to specify –-sql to specify your SQL commands.

Interactive

Interactive Metadata Data Warehouse Data-driven

Unleashing the power of Presto: The Uber case study

IBM Big Data Hub

SEPTEMBER 25, 2023

Uber understood that digital superiority required the capture of all their transactional data, not just a sampling. They stood up a file-based data lake alongside their analytical database. Because much of the work done on their data lake is exploratory in nature, many users want to execute untested queries on petabytes of data.

OLAP

OLAP Data Lake Data-driven Snapshot

Migrate an existing data lake to a transactional data lake using Apache Iceberg

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

Webinars

Trending Sources

Use Apache Iceberg in a data lake to support incremental data processing

Webinars

Build a transactional data lake using Apache Iceberg, AWS Glue, and cross-account data shares using AWS Lake Formation and Amazon Athena

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics

Use Amazon Athena with Spark SQL for your open-source transactional table formats

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Simplifying data processing at Capitec with Amazon Redshift integration for Apache Spark

ChatGPT: le nuove sfide della strategia sui dati nell’era dell’IA generativa

Optimize data layout by bucketing with Amazon Athena and AWS Glue to accelerate downstream queries

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

Accelerate your data warehouse migration to Amazon Redshift – Part 7

Load data incrementally from transactional data lakes to data warehouses

Implement slowly changing dimensions in a data lake using AWS Glue and Delta

Interact with Apache Iceberg tables using Amazon Athena and cross account fine-grained permissions using AWS Lake Formation

La convergenza tra IT e business: ecco come i CIO reinterpretano il loro ruolo con l’aiuto dell’IA

Intelligenza artificiale e gen AI: i quattro elementi per passare al “next level”

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

Implement tag-based access control for your data lake and Amazon Redshift data sharing with AWS Lake Formation

Visualize Confluent data in Amazon QuickSight using Amazon Athena

Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 2: AWS Glue Studio Visual Editor

Improve healthcare services through patient 360: A zero-ETL approach to enable near real-time data analytics

Getting started with AWS Glue Data Quality from the AWS Glue Data Catalog

Introducing AWS Glue serverless Spark UI for better monitoring and troubleshooting

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

Amazon QuickSight helps TalentReef empower its customers to make more informed hiring decisions

Amazon Kinesis Data Streams: celebrating a decade of real-time data innovation

Materialized Views in Hive for Iceberg Table Format

AWS Lake Formation 2023 year in review

Tackling AI’s data challenges with IBM databases on AWS

Your guide to AWS Analytics at AWS re:Invent 2023

Introducing watsonx: The future of AI for business

Exploring the AI and data capabilities of watsonx

AWS re:Invent 2023 Amazon Redshift Sessions Recap

Real-time streaming data top picks you cannot miss at AWS re:Invent 2023

What’s cooking with Amazon Redshift at AWS re:Invent 2023

Connect your data for faster decisions with AWS

The Enduring Significance of Data Modeling in the Modern Data-Driven Enterprise

How Fujitsu implemented a global data mesh architecture and democratized data

Use the Amazon Redshift Data API to interact with Amazon Redshift Serverless

Unleashing the power of Presto: The Uber case study

Stay Connected