article thumbnail

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

licensed, 100% open-source data table format that helps simplify data processing on large datasets stored in data lakes. Data engineers use Apache Iceberg because it’s fast, efficient, and reliable at any scale and keeps records of how datasets change over time.

article thumbnail

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

Today, we’re making available a new capability of AWS Glue Data Catalog that allows generating column-level statistics for AWS Glue tables. These statistics are now integrated with the cost-based optimizers (CBO) of Amazon Athena and Amazon Redshift Spectrum , resulting in improved query performance and potential cost savings.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

Data architect Armando Vázquez identifies eight common types of data architects: Enterprise data architect: These data architects oversee an organization’s overall data architecture, defining data architecture strategy and designing and implementing architectures.

article thumbnail

Build a pseudonymization service on AWS to protect sensitive data: Part 2

AWS Big Data

Amazon EMR empowers you to create, operate, and scale big data frameworks such as Apache Spark quickly and cost-effectively. For an overview of how to build an ACID compliant data lake using Iceberg, refer to Build a high-performance, ACID compliant, evolving data lake using Apache Iceberg on Amazon EMR.

Metrics 95
article thumbnail

Speed up queries with the cost-based optimizer in Amazon Athena

AWS Big Data

Athena provides a simplified, flexible way to analyze petabytes of data where it lives. You can analyze data or build applications from an Amazon Simple Storage Service (Amazon S3) data lake and 30 data sources, including on-premises data sources or other cloud systems using SQL or Python.

article thumbnail

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics

AWS Big Data

Use case A typical workload for AWS Glue for Apache Spark jobs is to load data from a relational database to a data lake with SQL-based transformations. On the Graphed metrics tab, configure your preferred statistic, period, and so on. When the example job ran, the workerUtilization metrics showed the following trend.

Metrics 96
article thumbnail

AWS Glue Data Quality is Generally Available

AWS Big Data

We are excited to announce the General Availability of AWS Glue Data Quality. Our journey started by working backward from our customers who create, manage, and operate data lakes and data warehouses for analytics and machine learning. It takes days for data engineers to identify and implement data quality rules.