Multicloud data lake analytics with Amazon Athena

Many organizations operate data lakes spanning multiple cloud data stores. This could be for various reasons, such as business expansions, mergers, or specific cloud provider preferences for different business units. In these cases, you may want an integrated query layer to seamlessly run analytical queries across these diverse cloud stores and streamline your data analytics processes. With a unified query interface, you can avoid the complexity of managing multiple query tools and gain a holistic view of your data assets regardless of where the data assets reside. You can consolidate your analytics workflows, reducing the need for extensive tooling and infrastructure management. This consolidation not only saves time and resources but also enables teams to focus more on deriving insights from data rather than navigating through various query tools and interfaces. A unified query interface promotes a holistic view of data assets by breaking down silos and facilitating seamless access to data stored across different cloud data stores. This comprehensive view enhances decision-making capabilities by empowering stakeholders to analyze data from multiple sources in a unified manner, leading to more informed strategic decisions.

In this post, we delve into the ways in which you can use Amazon Athena connectors to efficiently query data files residing across Azure Data Lake Storage (ADLS) Gen2, Google Cloud Storage (GCS), and Amazon Simple Storage Service (Amazon S3). Additionally, we explore the use of Athena workgroups and cost allocation tags to effectively categorize and analyze the costs associated with running analytical queries.

Solution overview

Imagine a fictional company named Oktank, which manages its data across data lakes on Amazon S3, ADLS, and GCS. Oktank wants to be able to query any of their cloud data stores and run analytical queries like joins and aggregations across the data stores without needing to transfer data to an S3 data lake. Oktank also wants to identify and analyze the costs associated with running analytics queries. To achieve this, Oktank envisions a unified data query layer using Athena.

The following diagram illustrates the high-level solution architecture.

Users run their queries from Athena connecting to specific Athena workgroups. Athena uses connectors to federate the queries across multiple data sources. In this case, we use the Amazon Athena Azure Synapse connector to query data from ADLS Gen2 via Synapse and the Amazon Athena GCS connector for GCS. An Athena connector is an extension of the Athena query engine. When a query runs on a federated data source using a connector, Athena invokes multiple AWS Lambda functions to read from the data sources in parallel to optimize performance. Refer to Using Amazon Athena Federated Query for further details. The AWS Glue Data Catalog holds the metadata for Amazon S3 and GCS data.

In the following sections, we demonstrate how to build this architecture.

Prerequisites

Before you configure your resources on AWS, you need to set up the necessary infrastructure required for this post in both Azure and GCP. The detailed steps and guidelines for creating the resources in Azure and GCP are beyond the scope of this post. Refer to the respective documentation for details. In this section, we provide some basic steps needed to create the resources required for the post.

You can download the sample data file cust_feedback_v0.csv.

Configure the dataset for Azure

To set up the sample dataset for Azure, log in to the Azure portal and upload the file to ADLS Gen2. The following screenshot shows the file under the container blog-container under a specific storage account on ADLS Gen2.

Set up a Synapse workspace in Azure and create an external table in Synapse that points to the relevant location. The following commands offer a foundational guide for running the necessary actions within the Synapse workspace to create the essential resources for this post. Refer to the corresponding Synapse documentation for additional details as required.

# Create Database
CREATE DATABASE azure_athena_blog_db
# Create file format
CREATE EXTERNAL FILE FORMAT [SynapseDelimitedTextFormat]
WITH ( FORMAT_TYPE = DELIMITEDTEXT ,
FORMAT_OPTIONS (
FIELD_TERMINATOR = ',',
USE_TYPE_DEFAULT = FALSE,
FIRST_ROW = 2
))

# Create key
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '*******;

# Create Database credential
CREATE DATABASE SCOPED CREDENTIAL dbscopedCreds
WITH IDENTITY = 'Managed Identity';

# Create Data Source
CREATE EXTERNAL DATA SOURCE athena_blog_datasource
WITH ( LOCATION = 'abfss://blog-container@xxxxxxud1.dfs.core.windows.net/',
CREDENTIAL = dbscopedCreds
)

# Create External Table
CREATE EXTERNAL TABLE dbo.customer_feedbacks_azure (
[data_key] nvarchar(4000),
[data_load_date] nvarchar(4000),
[data_location] nvarchar(4000),
[product_id] nvarchar(4000),
[customer_email] nvarchar(4000),
[customer_name] nvarchar(4000),
[comment1] nvarchar(4000),
[comment2] nvarchar(4000)
)
WITH (
LOCATION = 'cust_feedback_v0.csv',
DATA_SOURCE = athena_blog_datasource,
FILE_FORMAT = [SynapseDelimitedTextFormat]
);

# Create User
CREATE LOGIN bloguser1 WITH PASSWORD = '****';
CREATE USER bloguser1 FROM LOGIN bloguser1;

# Grant select on the Schema
GRANT SELECT ON SCHEMA::dbo TO [bloguser1];

Note down the user name, password, database name, and the serverless or dedicated SQL endpoint you use—you need these in the subsequent steps.

This completes the setup on Azure for the sample dataset.

Configure the dataset for GCS

To set up the sample dataset for GCS, upload the file to the GCS bucket.

Create a GCP service account and grant access to the bucket.

In addition, create a JSON key for the service account. The content of the key is needed in subsequent steps.

This completes the setup on GCP for our sample dataset.

Deploy the AWS infrastructure

You can now run the provided AWS CloudFormation stack to create the solution resources. Identify an AWS Region in which you want to create the resources and ensure you use the same Region throughout the setup and verifications.

Refer to the following table for the necessary parameters that you must provide. You can leave other parameters at their default values or modify them according to your requirement.

Parameter Name	Expected Value
`AzureSynapseUserName`	User name for the Synapse database you created.
`AzureSynapsePwd`	Password for the Synapse database user.
`AzureSynapseURL`	Synapse JDBC URL, in the following format: `jdbc:sqlserver://<sqlendpoint>;databaseName=<databasename>` For example: `jdbc:sqlserver://xxxxg-ondemand.sql.azuresynapse.net;databaseName=azure_athena_blog_db`
`GCSSecretKey`	Content from the secret key file from GCP.
`UserAzureADLSOnlyUserPassword`	AWS Management Console password for the Azure-only user. This user can only query data from ADLS.
`UserGCSOnlyUserPassword`	AWS Management Console password for the GCS-only user. This user can only query data from GCP GCS.
`UserMultiCloudUserPassword`	AWS Management Console password for the multi-cloud user. This user can query data from any of the cloud stores.

The stack provisions the VPC, subnets, S3 buckets, Athena workgroups, and AWS Glue database and tables. It creates two secrets in AWS Secrets Manager to store the GCS secret key and the Synapse user name and password. You use these secrets when creating the Athena connectors.

The stack also creates three AWS Identity and Access Management (IAM) users and grants permissions on corresponding Athena workgroups, Athena data sources, and Lambda functions: AzureADLSUser, which can run queries on ADLS and Amazon S3, GCPGCSUser, which can query GCS and Amazon S3, and MultiCloudUser, which can query Amazon S3, Azure ADLS Gen2 and GCS data sources. The stack does not create the Athena data source and Lambda functions. You create these in subsequent steps when you create the Athena connectors.

The stack also attaches cost allocation tags to the Athena workgroups, the secrets in Secrets Manager, and the S3 buckets. You use these tags for cost analysis in subsequent steps.

When the stack deployment is complete, note the values of the CloudFormation stack outputs, which you use in subsequent steps.

Upload the data file to the S3 bucket created by the CloudFormation stack. You can retrieve the bucket name from the value of the key named S3SourceBucket from the stack output. This serves as the S3 data lake data for this post.

You can now create the connectors.

Create the Athena Synapse connector

To set up the Azure Synapse connector, complete the following steps:

On the Lambda console, create a new application.
In the Application settings section, enter the values for the corresponding key from the output of the CloudFormation stack, as listed in the following table.

Property Name	CloudFormation Output Key
`SecretNamePrefix`	`AzureSecretName`
`DefaultConnectionString`	`AzureSynapseConnectorJDBCURL`
`LambdaFunctionName`	`AzureADLSLambdaFunctionName`
`SecurityGroupIds`	`SecurityGroupId`
`SpillBucket`	`AthenaLocationAzure`
`SubnetIds`	`PrivateSubnetId`

Select the Acknowledgement check box and choose Deploy.

Wait for the application to be deployed before proceeding to the next step.

Create the Athena GCS connector

To create the Athena GCS connector, complete the following steps:

On the Lambda console, create a new application.
In the Application settings section, enter the values for the corresponding key from the output of the CloudFormation stack, as listed in the following table.

Property Name	CloudFormation Output Key
`SpillBucket`	`AthenaLocationGCP`
`GCSSecretName`	`GCSSecretName`
`LambdaFunctionName`	`GCSLambdaFunctionName`

Select the Acknowledgement check box and choose Deploy.

For the GCS connector, there are some post-deployment steps to create the AWS Glue database and table for the GCS data file. In this post, the CloudFormation stack you deployed earlier already created these resources, so you don’t have to create it. The stack created an AWS Glue database called oktank_multicloudanalytics_gcp and a table called customer_feedbacks under the database with the required configurations.

Next, you create the Athena data sources corresponding to these connectors.

Create the Azure data source

Complete the following steps to create your Azure data source:

On the Athena console, create a new data source.
For Data sources, select Microsoft Azure Synapse.
Choose Next.

For Data source name, enter the value for the AthenaFederatedDataSourceNameForAzure key from the CloudFormation stack output.
In the Connection details section, choose Lambda function you created earlier for Azure.

Choose Next, then choose Create data source.

You should be able to see the associated schemas for the Azure external database.

Create the GCS data source

Complete the following steps to create your Azure data source:

On the Athena console, create a new data source.
For Data sources, select Google Cloud Storage.
Choose Next.

For Data source name, enter the value for the AthenaFederatedDataSourceNameForGCS key from the CloudFormation stack output.
In the Connection details section, choose Lambda function you created earlier for GCS.

Choose Next, then choose Create data source.

This completes the deployment. You can now run the multi-cloud queries from Athena.

Query the federated data sources

In this section, we demonstrate how to query the data sources using the ADLS user, GCS user, and multi-cloud user.

Run queries as the ADLS user

The ADLS user can run multi-cloud queries on ADLS Gen2 and Amazon S3 data. Complete the following steps:

Get the value for UserAzureADLSUser from the CloudFormation stack output.

Sign in to the Athena query editor with this user.
Switch the workgroup to athena-mc-analytics-azure-wg in the Athena query editor.

Choose Acknowledge to accept the workgroup settings.

Run the following query to join the S3 data lake table to the ADLS data lake table:

SELECT a.data_load_date as azure_load_date, b.data_key as s3_data_key, a.data_location as azure_data_location FROM "azure_adls_ds"."dbo"."customer_feedbacks_azure" a join "AwsDataCatalog"."oktank_multicloudanalytics_aws"."customer_feedbacks" b ON cast(a.data_key as integer) = b.data_key

Run queries as the GCS user

The GCS user can run multi-cloud queries on GCS and Amazon S3 data. Complete the following steps:

Get the value for UserGCPGCSUser from the CloudFormation stack output.
Sign in to the Athena query editor with this user.
Switch the workgroup to athena-mc-analytics-gcp-wg in the Athena query editor.
Choose Acknowledge to accept the workgroup settings.
Run the following query to join the S3 data lake table to the GCS data lake table:

SELECT a.data_load_date as gcs_load_date, b.data_key as s3_data_key, a.data_location as gcs_data_location FROM "gcp_gcs_ds"."oktank_multicloudanalytics_gcp"."customer_feedbacks" a
join "AwsDataCatalog"."oktank_multicloudanalytics_aws"."customer_feedbacks" b 
ON a.data_key = b.data_key

Run queries as the multi-cloud user

The multi-cloud user can run queries that can access data from any cloud store. Complete the following steps:

Get the value for UserMultiCloudUser from the CloudFormation stack output.
Sign in to the Athena query editor with this user.
Switch the workgroup to athena-mc-analytics-multi-wg in the Athena query editor.
Choose Acknowledge to accept the workgroup settings.
Run the following query to join data across the multiple cloud stores:

SELECT a.data_load_date as adls_load_date, b.data_key as s3_data_key, c.data_location as gcs_data_location 
FROM "azure_adls_ds"."dbo"."CUSTOMER_FEEDBACKS_AZURE" a 
join "AwsDataCatalog"."oktank_multicloudanalytics_aws"."customer_feedbacks" b 
on cast(a.data_key as integer) = b.data_key join "gcp_gcs_ds"."oktank_multicloudanalytics_gcp"."customer_feedbacks" c 
on b.data_key = c.data_key

Cost analysis with cost allocation tags

When you run multi-cloud queries, you need to carefully consider the data transfer costs associated with each cloud provider. Refer to the corresponding cloud documentation for details. The cost reports highlighted in this section refer to the AWS infrastructure and service usage costs. The storage and other associated costs with ADLS, Synapse, and GCS are not included.

Let’s see how to handle cost analysis for the multiple scenarios we have discussed.

The CloudFormation stack you deployed earlier added user-defined cost allocation tags, as shown in the following screenshot.

Sign in to AWS Billing and Cost Management console and enable these cost allocation tags. It may take up to 24 hours for the cost allocation tags to be available and reflected in AWS Cost Explorer.

To track the cost of the Lambda functions deployed as part of the GCS and Synapse connectors, you can use the AWS generated cost allocation tags, as shown in the following screenshot.

You can use these tags on the Billing and Cost Management console to determine the cost per tag. We provide some sample screenshots for reference. These reports only show the cost of AWS resources used to access ADLS Gen2 or GCP GCS. The reports do not show the cost of GCP or Azure resources.

Athena costs

To view Athena costs, choose the tag athena-mc-analytics:athena:workgroup and filter the tags values azure, gcp, and multi.

You can also use workgroups to set limits on the amount of data each workgroup can process to track and control cost. For more information, refer to Using workgroups to control query access and costs and Separate queries and managing costs using Amazon Athena workgroups.

Amazon S3 costs

To view the costs for Amazon S3 storage (Athena query results and spill storage), choose the tag athena-mc-analytics:s3:result-spill and filter the tag values azure, gcp, and multi.

Lambda costs

To view the costs for the Lambda functions, choose the tag aws:cloudformation:stack-name and filter the tag values serverlessepo-AthenaSynapseConnector and serverlessepo-AthenaGCSConnector.

Cost allocation tags help manage and track costs effectively when you’re running multi-cloud queries. This can help you track, control, and optimize your spending while taking advantage of the benefits of multi-cloud data analytics.

Clean up

To avoid incurring further charges, delete the CloudFormation stacks to delete the resources you provisioned as part of this post. There are two additional stacks deployed for each connector: serverlessrepo-AthenaGCSConnector and serverlessrepo-AthenaSynapseConnector. Delete all three stacks.

Conclusion

In this post, we discussed a comprehensive solution for organizations looking to implement multi-cloud data lake analytics using Athena, enabling a consolidated view of data across diverse cloud data stores and enhancing decision-making capabilities. We focused on querying data lakes across Amazon S3, Azure Data Lake Storage Gen2, and Google Cloud Storage using Athena. We demonstrated how to set up resources on Azure, GCP, and AWS, including creating databases, tables, Lambda functions, and Athena data sources. We also provided instructions for querying federated data sources from Athena, demonstrating how you can run multi-cloud queries tailored to your specific needs. Lastly, we discussed cost analysis using AWS cost allocation tags.

For further reading, refer to the following resources:

About the Author

Shoukat Ghouse is a Senior Big Data Specialist Solutions Architect at AWS. He helps customers around the world build robust, efficient and scalable data platforms on AWS leveraging AWS analytics services like AWS Glue, AWS Lake Formation, Amazon Athena and Amazon EMR.