Efficiently crawl your data lake and improve data access with an AWS Glue crawler using partition indexes

In today’s world, customers manage vast amounts of data in their Amazon Simple Storage Service (Amazon S3) data lakes, which requires convoluted data pipelines to continuously understand the changes in the data layout and make them available to consuming systems. AWS Glue crawlers provide a straightforward way to catalog data in the AWS Glue Data Catalog that removes the heavy lifting when it comes to schema management and data classification. AWS Glue crawlers extract the data schema and partitions from Amazon S3 to automatically populate the Data Catalog, keeping the metadata current.

But with data growing exponentially over time, the number of partitions in a given table can grow significantly. Because analytics services like Amazon Athena query a table containing millions of partitions, the time needed to retrieve the partition increases and can cause query runtime to increase.

Today, AWS Glue crawler support has been expanded to automatically add partition indexes for newly discovered tables to optimize query processing on the partitioned dataset. Now, when the crawler creates a new Data Catalog table during a crawler run, it also creates a partition index by default, with the largest permutation of all numeric and string type partition columns as keys. The Data Catalog then creates a searchable index based on these keys, reducing the time required to retrieve and filter partition metadata on tables with millions of partitions. The creation of partition indexes benefits the analytics workloads running on Athena, Amazon EMR, Amazon Redshift Spectrum, and AWS Glue.

In this post, we describe how to create partition indexes with an AWS Glue crawler and compare the query performance improvement when accessing the crawled data with and without a partition index from Athena.

Solution overview

We use an AWS CloudFormation template to create our solution resources. In the following steps, we demonstrate how to configure the AWS Glue crawler to create a partition index using either the AWS Glue console or the AWS Command Line Interface (AWS CLI). Then we compare the query performance improvements using Athena.

Prerequisites

To follow along with this post, you must have access to an AWS Identity and Access Management (IAM) administrator role to create resources using AWS CloudFormation.

Set up your solution resources

The CloudFormation template generates the following resources:

IAM roles and policies
An AWS Glue database to hold the schema
An AWS Glue crawler pointing to a highly partitioned dataset
An Athena workgroup and bucket to store query results

Complete the following steps to set up the solution resources:

Log in to the AWS Management Console as an IAM administrator.
Choose Launch Stack to deploy the CloudFormation template:
For DatabaseName, keep the default blog_partition_index_crawlerdb.
Choose Next.
Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create stack.
When the stack is complete, on the AWS CloudFormation console, navigate to the Outputs tab of the stack.
Note down values of DatabaseName and GlueCrawlerName.

Some of the resources that this stack deploys incur costs when in use.

Edit and run the AWS Glue crawler

To configure and run the AWS Glue crawler, complete the following steps:

On the AWS Glue console, choose Crawlers in the navigation pane.
Locate the crawler blog-partition-index-crawler and choose Edit.
In the Set output and scheduling section, under Advanced options, select Create partition indexes automatically.
Review and update the crawler settings.

Alternatively, you can configure your crawler using the AWS CLI (provide your IAM role and Region):

aws glue create-crawler --name blog-partition-index-crawler --targets '{ "S3Targets": [{ "Path": "s3://awsglue-datasets/examples/highly-partitioned-table/"}] }' --database-name "blog_partition_index_crawlerdb" --role <Crawler_IAM_role> --configuration "{\"Version\":1.0,\"CreatePartitionIndex\":true}" --region <region_name>

Now run the crawler and verify that the crawler run is complete.

This is highly partitioned dataset and will take approximately 90 minutes to complete.

Verify the partitioned table

In the AWS Glue database blog_partition_index_crawlerdb, verify that the table highly_partitioned_table is created.

By default, the crawler determines an index based on the largest permutation of partition columns of valid column types in the same order of partition columns, which are either numeric or string. For the table created by the crawler (highly_partitioned_table), we have partition columns year (string), month (string), day (string), and hour (string).

Based on this definition, the crawler created an index on the permutation of year, month, day, and hour. The crawler created the indexes prefixed with crawler_ on any partition index created by default.

Verify the same by navigating to the table highly_partitioned_table on the AWS Glue console and choosing the Indexes tab.

The crawler was able to crawl the S3 data source and successfully populate the partition indexes for the table.

Compare the query performance improvements using Athena

First, we query the table in Athena without using the partition index. To verify the tables using Athena, complete the following steps:

On the Athena console, choose crawler-primary-workgroup as the Athena workgroup and choose Acknowledge.

Run the following query:

select count(*), sum(value) from blog_partition_index_crawlerdb.highly_partitioned_table where year='1980' and month='01' and day ='01'

The following screenshot shows the query took approximately 32 seconds without filtering enabled using the partition index.

Now we enable the partition index on the Athena query:

ALTER TABLE blog_partition_index_crawlerdb.highly_partitioned_table
SET TBLPROPERTIES ('partition_filtering.enabled' = 'true')

Run the following query again and note the runtime:

select count(*), sum(value) from blog_partition_index_crawlerdb.highly_partitioned_table where year=‘1980’ and month=‘01’ and day =‘01’

The following screenshot shows the query took only 700 milliseconds, which is much faster with filtering enabled using the partition index.

Clean up

To avoid unwanted charges to your AWS account, you can delete the AWS resources:

Sign in to the CloudFormation console as the IAM admin used for creating the CloudFormation stack.
Delete the CloudFormation stack you created.

Conclusion

In this post, we explained how to configure an AWS crawler to create partition indexes and compared the query performance when accessing the data with indexes from Athena.

If no partition indexes are present on the table, AWS Glue loads all the partitions of the table, and then filters the loaded partitions, which results in inefficient retrieval of metadata. Analytics services like Redshift Spectrum, Amazon EMR, and AWS Glue ETL Spark DataFrames can now utilize indexes for fetching partitions, resulting in significant query performance.

For more information on partition indexes and query performance across various analytical engines, refer to Improve Amazon Athena query performance using AWS Glue Data Catalog partition indexes and Improve query performance using AWS Glue partition indexes.

Special thanks to everyone who contributed to this crawler feature launch: Yuhang Chen, Kyle Duong,and Mita Gavade.

About the authors

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.