Migrate data from Azure Blob Storage to Amazon S3 using AWS Glue

Today, we are pleased to announce new AWS Glue connectors for Azure Blob Storage and Azure Data Lake Storage that allow you to move data bi-directionally between Azure Blob Storage, Azure Data Lake Storage, and Amazon Simple Storage Service (Amazon S3).

We’ve seen a demand to design applications that enable data to be portable across cloud environments and give you the ability to derive insights from one or more data sources. One of the data sources you can now quickly integrate with is Azure Blob Storage, a managed service for storing both unstructured data and structured data, and Azure Data Lake Storage, a data lake for analytics workloads. With these connectors, you can bring the data from Azure Blob Storage and Azure Data Lake Storage separately to Amazon S3.

In this post, we use Azure Blob Storage as an example and demonstrate how the new connector works, introduce the connector’s functions, and provide you with key steps to set it up. We provide you with prerequisites, share how to subscribe to this connector in AWS Marketplace, and describe how to create and run AWS Glue for Apache Spark jobs with it. Regarding the Azure Data Lake Storage Gen2 Connector, we highlight any major differences in this post.

AWS Glue is a serverless data integration service that makes it simple to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue natively integrates with various data stores such as MySQL, PostgreSQL, MongoDB, and Apache Kafka, along with AWS data stores such as Amazon S3, Amazon Redshift, Amazon Relational Database Service (Amazon RDS), and Amazon DynamoDB. AWS Glue Marketplace connectors allow you to discover and integrate additional data sources, such as software as a service (SaaS) applications and your custom data sources. With just a few clicks, you can search for and select connectors from AWS Marketplace and begin your data preparation workflow in minutes.

How the connectors work

In this section, we discuss how the new connectors work.

Azure Blob Storage connector

This connector relies on the Spark DataSource API and calls Hadoop’s FileSystem interface. The latter has implemented libraries for reading and writing various distributed or traditional storage. This connector also includes the hadoop-azure module, which lets you run Apache Hadoop or Apache Spark jobs directly with data in Azure Blob Storage. AWS Glue loads the library from the Amazon Elastic Container Registry (Amazon ECR) repository during initialization (as a connector), reads the connection credentials using AWS Secrets Manager, and reads data source configurations from input parameters. When AWS Glue has internet access, the Spark job in AWS Glue can read from and write to Azure Blob Storage.

We support the following two methods for authentication: the authentication key for Shared Key and shared access signature (SAS) tokens:

#Method 1:Shared Key, 
spark.conf.set("fs.azure.account.key.youraccountname.blob.core.windows.net", "*Your account key")

df = spark.read.format("csv").option("header","true").load("wasbs://yourblob@youraccountname.blob.core.windows.net/loadingtest-input/100mb")

df.write.format("csv").option("compression","snappy").mode("overwrite").save("wasbs://<container_name>@<account_name>.blob.core.windows.net/output-CSV/20210831/")

#Method 2:Shared Access Signature (SAS) Tokens
spark.conf.set("fs.azure.sas.yourblob.youraccountname.blob.core.windows.net", "Your SAS token*")

df = spark.read.format("csv").option("header","true").load("wasbs://yourblob@youraccountname.blob.core.windows.net/loadingtest-input/100mb")

df.write.format("csv").option("compression","snappy").mode("overwrite").save("wasbs://<container_name>@<account_name>.blob.core.windows.net/output-CSV/20210831/")

Azure Data Lake Storage Gen2 connector

The usage of Azure Data Lake Storage Gen2 is much the same as the Azure Blob Storage connector. The Azure Data Lake Storage Gen2 connector uses the same library as the Azure Blob Storage connector, and relies on the Spark DataSource API, Hadoop’s FileSystem interface, and the Azure Blob Storage connector for Hadoop.

As of this writing, we only support the Shared Key authentication method:

#Method: Shared Key 
spark.conf.set("fs.azure.account.key.youraccountname.dfs.core.windows.net", "*Your account key")

# Read file from ADLS example
df= spark.read.format("csv").option("header","true").load("abfss://<container_name>@<account_name>.dfs.core.windows.net/input-csv/covid/") 

# Write file to ADLS example
df.write.format("parquet").option("compression","snappy").partitionBy("state").mode("overwrite").save("abfss://<container_name>@<account_name>.dfs.core.windows.net/output-glue3/csv-partitioned/")

Solution overview

The following architecture diagram shows how AWS Glue connects to Azure Blob Storage for data ingestion.

In the following sections, we show you how to create a new secret for Azure Blob Storage in Secrets Manager, subscribe to the AWS Glue connector, and move data from Azure Blob Storage to Amazon S3.

Prerequisites

You need the following prerequisites:

A storage account in Microsoft Azure and your data path in Azure Blob Storage. Prepare the storage account credentials in advance. For instructions, refer to Create a storage account shared key.
A Secrets Manager secret to store a Shared Key secret, using one of the supporting authentication methods.
An AWS Identity and Access Management (IAM) role for the AWS Glue job with the following policies:
- AWSGlueServiceRole, which allows the AWS Glue service role access to related services.
- AmazonEC2ContainerRegistryReadOnly, which provides read-only access to Amazon EC2 Container Registry repositories. This policy is for using AWS Marketplace’s connector libraries.
- A Secrets Manager policy, which provides read access to the secret in Secrets Manager.
- An S3 bucket policy for the S3 bucket that you need to load ETL (extract, transform, and load) data from Azure Blob Storage.

Create a new secret for Azure Blob Storage in Secrets Manager

Complete the following steps to create a secret in Secrets Manager to store the Azure Blob Storage connection strings using the Shared Key authentication method:

On the Secrets Manager console, choose Secrets in the navigation pane.
Choose Store a new secret.
For Secret type, select Other type of secret.
Replace the values for accountName, accountKey, and container with your own values.
Leave the rest of the options at their default.
Choose Next.
Provide a name for the secret, such as azureblobstorage_credentials.
Follow the rest of the steps to store the secret.

Subscribe to the AWS Glue connector for Azure Blob Storage

To subscribe to the connector, complete the following steps:

Navigate to the Azure Blob Storage Connector for AWS Glue on AWS Marketplace.
On the product page for the connector, use the tabs to view information about the connector, then choose Continue to Subscribe.
Review the pricing terms and the seller’s End User License Agreement, then choose Accept Terms.
Continue to the next step by choosing Continue to Configuration.
On the Configure this software page, choose the fulfillment options and the version of the connector to use.

We have provided two options for the Azure Blob Storage Connector: AWS Glue 3.0 and AWS Glue 4.0. In this example, we focus on AWS Glue 4.0. Choose Continue to Launch.

On the Launch this software page, choose Usage instructions to review the usage instructions provided by AWS.
When you’re ready to continue, choose Activate the Glue connector from AWS Glue Studio.

The console will display the Create marketplace connection page in AWS Glue Studio.

Move data from Azure Blob Storage to Amazon S3

To move your data to Amazon S3, you must configure the custom connection and then set up an AWS Glue job.

Create a custom connection in AWS Glue

An AWS Glue connection stores connection information for a particular data store, including login credentials, URI strings, virtual private cloud (VPC) information, and more. Complete the following steps to create your connection:

On the AWS Glue console, choose Connectors in the navigation pane.
Choose Create connection.
For Connector, choose Azure Blob Storage Connector for AWS Glue.
For Name, enter a name for the connection (for example, AzureBlobStorageConnection).
Enter an optional description.
For AWS secret, enter the secret you created (azureblobstorage_credentials).
Choose Create connection and activate connector.

The connector and connection information is now visible on the Connectors page.

Create an AWS Glue job and configure connection options

Complete the following steps:

On the AWS Glue console, choose Connectors in the navigation pane.
Choose the connection you created (AzureBlobStorageConnection).
Choose Create job.
For Name, enter Azure Blob Storage Connector for AWS Glue. This name should be unique among all the nodes for this job.
For Connection, choose the connection you created (AzureBlobStorageConnection).
For Key, enter path, and for Value, enter your Azure Blob Storage URI. For example, when we created our new secret, we already set a container value for the Azure Blob Storage. Here, we enter the file path /input_data/.
Enter another key-value pair. For Key, enter fileFormat. For Value, enter csv, because our sample data is in this format.
Optionally, if the CSV file contains a header line, enter another key-value pair. For Key, enter header. For Value, enter true.
To preview your data, choose the Data preview tab, then choose Start data preview session and choose the IAM role defined in the prerequisites.
Choose Confirm and wait for the results to display.
Select S3 as Target Location.
Choose Browse S3 to see the S3 buckets that you have access to and choose one as the target destination for the data output.
For the other options, use the default values.
On the Job details tab, for IAM Role, choose the IAM role defined in the prerequisites.
For Glue version, choose your AWS Glue version.
Continue to create your ETL job. For instructions, refer to Creating ETL jobs with AWS Glue Studio.
Choose Run to run your job.

When the job is complete, you can navigate to the Run details page on the AWS Glue console and check the logs in Amazon CloudWatch.

The data is ingested into Amazon S3, as shown in the following screenshot. We are now able to import data from Azure Blob Storage to Amazon S3.

Scaling considerations

In this example, we use the default AWS Glue capacity, 10 DPU (Data Processing Units). A DPU is a standardized unit of processing capacity that consists of 4 vCPUs of compute capacity and 16 GB of memory. To scale your AWS Glue job, you can increase the number of DPU, and also take advantage of Auto Scaling. With Auto Scaling enabled, AWS Glue automatically adds and removes workers from the cluster depending on the workload. After you choose the maximum number of workers, AWS Glue will adapt the right size of resources for the workload.

Clean up

To clean up your resources, complete the following steps:

Remove the AWS Glue job and secret in Secrets Manager with the following command:

aws glue delete-job —job-name <your_job_name>

aws glue delete-connection —connection-name <your_connection_name>

aws secretsmanager delete-secret —secret-id <your_secretsmanager_id>

If you are no longer going to use this connector, you can cancel the subscription to the Azure Blob Storage connector:
1. On the AWS Marketplace console, go to the Manage subscriptions page.
2. Select the subscription for the product that you want to cancel.
3. On the Actions menu, choose Cancel subscription.
4. Read the information provided and select the acknowledgement check box.
5. Choose Yes, cancel subscription.
Delete the data in the S3 bucket that you used in the previous steps.

Conclusion

In this post, we showed how to use AWS Glue and the new connector for ingesting data from Azure Blob Storage to Amazon S3. This connector provides access to Azure Blob Storage, facilitating cloud ETL processes for operational reporting, backup and disaster recovery, data governance, and more.

We welcome any feedback or questions in the comments section.

Appendix

When you need SAS token authentication for Azure Data Lake Storage Gen 2, you can use Azure SAS Token Provider for Hadoop. To do that, upload the JAR file to your S3 bucket and configure your AWS Glue job to set the S3 location in the job parameter --extra-jars (in AWS Glue Studio, Dependent JARs path). Then save the SAS token in Secrets Manager and set the value to spark.hadoop.fs.azure.sas.fixed.token.<azure storage account>.dfs.core.windows.net in SparkConf using script mode at runtime. Learn more in README.

from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.hadoop.fs.azure.sas.fixed.token.<azure storage account>.dfs.core.windows.ne", sas_token) \
.getOrCreate()

References

About the authors

Qiushuang Feng is a Solutions Architect at AWS, responsible for Enterprise customers’ technical architecture design, consulting, and design optimization on AWS Cloud services. Before joining AWS, Qiushuang worked in IT companies such as IBM and Oracle, and accumulated rich practical experience in development and analytics.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is passionate about architecting fast-growing data environments, diving deep into distributed big data software like Apache Spark, building reusable software artifacts for data lakes, and sharing knowledge in AWS Big Data blog posts.

Shengjie Luo is a Big Data Architect on the Amazon Cloud Technology professional service team. They are responsible for solutions consulting, architecture, and delivery of AWS-based data warehouses and data lakes. They are skilled in serverless computing, data migration, cloud data integration, data warehouse planning, and data service architecture design and implementation.

Greg Huang is a Senior Solutions Architect at AWS with expertise in technical architecture design and consulting for the China G1000 team. He is dedicated to deploying and utilizing enterprise-level applications on AWS Cloud services. He possesses nearly 20 years of rich experience in large-scale enterprise application development and implementation, having worked in the cloud computing field for many years. He has extensive experience in helping various types of enterprises migrate to the cloud. Prior to joining AWS, he worked for well-known IT enterprises such as Baidu and Oracle.

Maciej Torbus is a Principal Customer Solutions Manager within Strategic Accounts at Amazon Web Services. With extensive experience in large-scale migrations, he focuses on helping customers move their applications and systems to highly reliable and scalable architectures in AWS. Outside of work, he enjoys sailing, traveling, and restoring vintage mechanical watches.