Unlock scalable analytics with AWS Glue and Google BigQuery

Data integration is the foundation of robust data analytics. It encompasses the discovery, preparation, and composition of data from diverse sources. In the modern data landscape, accessing, integrating, and transforming data from diverse sources is a vital process for data-driven decision-making. AWS Glue, a serverless data integration and extract, transform, and load (ETL) service, has revolutionized this process, making it more accessible and efficient. AWS Glue eliminates complexities and costs, allowing organizations to perform data integration tasks in minutes, boosting efficiency.

This blog post explores the newly announced managed connector for Google BigQuery and demonstrates how to build a modern ETL pipeline with AWS Glue Studio without writing code.

Overview of AWS Glue

AWS Glue is a serverless data integration service that makes it easier to discover, prepare, and combine data for analytics, machine learning (ML), and application development. AWS Glue provides all the capabilities needed for data integration, so you can start analyzing your data and putting it to use in minutes instead of months. AWS Glue provides both visual and code-based interfaces to make data integration easier. Users can more easily find and access data using the AWS Glue Data Catalog. Data engineers and ETL (extract, transform, and load) developers can visually create, run, and monitor ETL workflows in a few steps in AWS Glue Studio. Data analysts and data scientists can use AWS Glue DataBrew to visually enrich, clean, and normalize data without writing code.

Introducing Google BigQuery Spark connector

To meet the demands of diverse data integration use cases, AWS Glue now offers a native spark connector for Google BigQuery. Customers can now use AWS Glue 4.0 for Spark to read from and write to tables in Google BigQuery. Additionally, you can read an entire table or run a custom query and write your data using direct and indirect writing methods. You connect to BigQuery using service account credentials stored securely in AWS Secrets Manager.

Benefits of Google BigQuery Spark connector

Seamless integration: The native connector offers an intuitive and streamlined interface for data integration, reducing the learning curve.
Cost efficiency: Building and maintaining custom connectors can be expensive. The native connector provided by AWS Glue is a cost-effective alternative.
Efficiency: Data transformation tasks that previously took weeks or months can now be accomplished within minutes, optimizing efficiency.

Solution overview

In this example, you create two ETL jobs using AWS Glue with the native Google BigQuery connector.

Query a BigQuery table and save the data into Amazon Simple Storage Service (Amazon S3) in Parquet format.
Use the data extracted from the first job to transform and create an aggregated result to be stored in Google BigQuery.

Prerequisites

The dataset used in this solution is the NCEI/WDS Global Significant Earthquake Database, with a global listing of over 5,700 earthquakes from 2150 BC to the present. Copy this public data into your Google BigQuery project or use your existing dataset.

Configure BigQuery connections

To connect to Google BigQuery from AWS Glue, see Configuring BigQuery connections. You must create and store your Google Cloud Platform credentials in a Secrets Manager secret, then associate that secret with a Google BigQuery AWS Glue connection.

Set up Amazon S3

Every object in Amazon S3 is stored in a bucket. Before you can store data in Amazon S3, you must create an S3 bucket to store the results.

To create an S3 bucket:

On the AWS Management Console for Amazon S3, choose Create bucket.
Enter a globally unique Name for your bucket; for example, awsglue-demo.
Choose Create bucket.

Create an IAM role for the AWS Glue ETL job

When you create the AWS Glue ETL job, you specify an AWS Identity and Access Management (IAM) role for the job to use. The role must grant access to all resources used by the job, including Amazon S3 (for any sources, targets, scripts, driver files, and temporary directories), and Secrets Manager.

For instructions, see Configure an IAM role for your ETL job.

Solution walkthrough

Create a visual ETL job in AWS Glue Studio to transfer data from Google BigQuery to Amazon S3

Open the AWS Glue console.
In AWS Glue, navigate to Visual ETL under the ETL jobs section and create a new ETL job using Visual with a blank canvas.
Enter a Name for your AWS Glue job, for example, bq-s3-dataflow.
Select Google BigQuery as the data source.
1. Enter a name for your Google BigQuery source node, for example, noaa_significant_earthquakes.
2. Select a Google BigQuery connection, for example, bq-connection.
3. Enter a Parent project, for example, bigquery-public-datasources.
4. Select Choose a single table for the BigQuery Source.
5. Enter the table you want to migrate in the form [dataset].[table], for example, noaa_significant_earthquakes.earthquakes.
Next, choose the data target as Amazon S3.
1. Enter a Name for the target Amazon S3 node, for example, earthquakes.
2. Select the output data Format as Parquet.
3. Select the Compression Type as Snappy.
4. For the S3 Target Location, enter the bucket created in the prerequisites, for example, s3://<YourBucketName>/noaa_significant_earthquakes/earthquakes/.
5. You should replace <YourBucketName> with the name of your bucket.
Next go to the Job details. In the IAM Role, select the IAM role from the prerequisites, for example, AWSGlueRole.
Choose Save.

Run and monitor the job

After your ETL job is configured, you can run the job. AWS Glue will run the ETL process, extracting data from Google BigQuery and loading it into your specified S3 location.
Monitor the job’s progress in the AWS Glue console. You can see logs and job run history to ensure everything is running smoothly.

Data validation

After the job has run successfully, validate the data in your S3 bucket to ensure it matches your expectations. You can see the results using Amazon S3 Select.

Automate and schedule

If needed, set up job scheduling to run the ETL process regularly. You can use AWS to automate your ETL jobs, ensuring your S3 bucket is always up to date with the latest data from Google BigQuery.

You’ve successfully configured an AWS Glue ETL job to transfer data from Google BigQuery to Amazon S3. Next, you create the ETL job to aggregate this data and transfer it to Google BigQuery.

Finding earthquake hotspots with AWS Glue Studio Visual ETL.

Open AWS Glue console.
In AWS Glue navigate to Visual ETL under the ETL jobs section and create a new ETL job using Visual with a blank canvas.
Provide a name for your AWS Glue job, for example, s3-bq-dataflow.
Choose Amazon S3 as the data source.
1. Enter a Name for the source Amazon S3 node, for example, earthquakes.
2. Select S3 location as the S3 source type.
3. Enter the S3 bucket created in the prerequisites as the S3 URL, for example, s3://<YourBucketName>/noaa_significant_earthquakes/earthquakes/.
4. You should replace <YourBucketName> with the name of your bucket.
5. Select the Data format as Parquet.
6. Select Infer schema.
Next, choose Select Fields transformation.
1. Select earthquakes as Node parents.
2. Select fields: id, eq_primary, and country.
Next, choose Aggregate transformation.
1. Enter a Name, for example Aggregate.
2. Choose Select Fields as Node parents.
3. Choose eq_primary and country as the group by columns.
4. Add id as the aggregate column and count as the aggregation function.
Next, choose RenameField transformation.
1. Enter a name for the source Amazon S3 node, for example, Rename eq_primary.
2. Choose Aggregate as Node parents.
3. Choose eq_primary as the Current field name and enter earthquake_magnitude as the New field name.
Next, choose RenameField transformation
1. Enter a name for the source Amazon S3 node, for example, Rename count(id).
2. Choose Rename eq_primary as Node parents.
3. Choose count(id) as the Current field name and enter number_of_earthquakes as the New field name.
Next, choose the data target as Google BigQuery.
1. Provide a name for your Google BigQuery source node, for example, most_powerful_earthquakes.
2. Select a Google BigQuery connection, for example, bq-connection.
3. Select Parent project, for example, bigquery-public-datasources.
4. Enter the name of the Table you want to create in the form [dataset].[table], for example, noaa_significant_earthquakes.most_powerful_earthquakes.
5. Choose Direct as the Write method.
Next go to the Job details tab and in the IAM Role, select the IAM role from the prerequisites, for example, AWSGlueRole.
Choose Save.

Run and monitor the job

After your ETL job is configured, you can run the job. AWS Glue runs the ETL process, extracting data from Google BigQuery and loading it into your specified S3 location.
Monitor the job’s progress in the AWS Glue console. You can see logs and job run history to ensure everything is running smoothly.

Data validation

After the job has run successfully, validate the data in your Google BigQuery dataset. This ETL job returns a list of countries where the most powerful earthquakes have occurred. It provides these by counting the number of earthquakes for a given magnitude by country.

Automate and schedule

You can set up job scheduling to run the ETL process regularly. AWS Glue allows you to automate your ETL jobs, ensuring your S3 bucket is always up to date with the latest data from Google BigQuery.

That’s it! You’ve successfully set up an AWS Glue ETL job to transfer data from Amazon S3 to Google BigQuery. You can use this integration to automate the process of data extraction, transformation, and loading between these two platforms, making your data readily available for analysis and other applications.

Clean up

To avoid incurring charges, clean up the resources used in this blog post from your AWS account by completing the following steps:

On the AWS Glue console, choose Visual ETL in the navigation pane.
From the list of jobs, select the job bq-s3-data-flow and delete it.
From the list of jobs, select the job s3-bq-data-flow and delete it.
On the AWS Glue console, choose Connections in the navigation pane under Data Catalog.
Choose the BiqQuery connection you created and delete it.
On the Secrets Manager console, choose the secret you created and delete it.
On the IAM console, choose Roles in the navigation pane, then select the role you created for the AWS Glue ETL job and delete it.
On the Amazon S3 console, search for the S3 bucket you created, choose Empty to delete the objects, then delete the bucket.
Clean up resources in your Google account by deleting the project that contains the Google BigQuery resources. Follow the documentation to clean up the Google resources.

Conclusion

The integration of AWS Glue with Google BigQuery simplifies the analytics pipeline, reduces time-to-insight, and facilitates data-driven decision-making. It empowers organizations to streamline data integration and analytics. The serverless nature of AWS Glue means no infrastructure management, and you pay only for the resources consumed while your jobs are running. As organizations increasingly rely on data for decision-making, this native spark connector provides an efficient, cost-effective, and agile solution to swiftly meet data analytics needs.

If you’re interested to see how to read from and write to tables in Google BigQuery in AWS Glue, take a look at step-by-step video tutorial. In this tutorial, we walk through the entire process, from setting up the connection to running the data transfer flow. For more information on AWS Glue, visit AWS Glue.

Appendix

If you are looking to implement this example, using code instead of the AWS Glue console, use the following code snippets.

Reading data from Google BigQuery and writing data into Amazon S3

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# STEP-1 Read the data from Big Query Table 
noaa_significant_earthquakes_node1697123333266 = (
    glueContext.create_dynamic_frame.from_options(
        connection_type="bigquery",
        connection_options={
            "connectionName": "bq-connection",
            "parentProject": "bigquery-public-datasources",
            "sourceType": "table",
            "table": "noaa_significant_earthquakes.earthquakes",
        },
        transformation_ctx="noaa_significant_earthquakes_node1697123333266",
    )
)
# STEP-2 Write the data read from Big Query Table into S3
# You should replace <YourBucketName> with the name of your bucket.
earthquakes_node1697157772747 = glueContext.write_dynamic_frame.from_options(
    frame=noaa_significant_earthquakes_node1697123333266,
    connection_type="s3",
    format="glueparquet",
    connection_options={
        "path": "s3://<YourBucketName>/noaa_significant_earthquakes/earthquakes/",
        "partitionKeys": [],
    },
    format_options={"compression": "snappy"},
    transformation_ctx="earthquakes_node1697157772747",
)

job.commit()

Reading and aggregating data from Amazon S3 and writing into Google BigQuery

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from awsglue import DynamicFrame
from pyspark.sql import functions as SqlFuncs

def sparkAggregate(
    glueContext, parentFrame, groups, aggs, transformation_ctx
) -> DynamicFrame:
    aggsFuncs = []
    for column, func in aggs:
        aggsFuncs.append(getattr(SqlFuncs, func)(column))
    result = (
        parentFrame.toDF().groupBy(*groups).agg(*aggsFuncs)
        if len(groups) > 0
        else parentFrame.toDF().agg(*aggsFuncs)
    )
    return DynamicFrame.fromDF(result, glueContext, transformation_ctx)

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# STEP-1 Read the data from Amazon S3 bucket
# You should replace <YourBucketName> with the name of your bucket.
earthquakes_node1697218776818 = glueContext.create_dynamic_frame.from_options(
    format_options={},
    connection_type="s3",
    format="parquet",
    connection_options={
        "paths": [
            "s3://<YourBucketName>/noaa_significant_earthquakes/earthquakes/"
        ],
        "recurse": True,
    },
    transformation_ctx="earthquakes_node1697218776818",
)

# STEP-2 Select fields
SelectFields_node1697218800361 = SelectFields.apply(
    frame=earthquakes_node1697218776818,
    paths=["id", "eq_primary", "country"],
    transformation_ctx="SelectFields_node1697218800361",
)

# STEP-3 Aggregate data
Aggregate_node1697218823404 = sparkAggregate(
    glueContext,
    parentFrame=SelectFields_node1697218800361,
    groups=["eq_primary", "country"],
    aggs=[["id", "count"]],
    transformation_ctx="Aggregate_node1697218823404",
)

Renameeq_primary_node1697219483114 = RenameField.apply(
    frame=Aggregate_node1697218823404,
    old_name="eq_primary",
    new_name="earthquake_magnitude",
    transformation_ctx="Renameeq_primary_node1697219483114",
)

Renamecountid_node1697220511786 = RenameField.apply(
    frame=Renameeq_primary_node1697219483114,
    old_name="`count(id)`",
    new_name="number_of_earthquakes",
    transformation_ctx="Renamecountid_node1697220511786",
)

# STEP-1 Write the aggregated data in Google BigQuery
most_powerful_earthquakes_node1697220563923 = (
    glueContext.write_dynamic_frame.from_options(
        frame=Renamecountid_node1697220511786,
        connection_type="bigquery",
        connection_options={
            "connectionName": "bq-connection",
            "parentProject": "bigquery-public-datasources",
            "writeMethod": "direct",
            "table": "noaa_significant_earthquakes.most_powerful_earthquakes",
        },
        transformation_ctx="most_powerful_earthquakes_node1697220563923",
    )
)

job.commit()

About the authors

Kartikay Khator is a Solutions Architect in Global Life Sciences at Amazon Web Services (AWS). He is passionate about building innovative and scalable solutions to meet the needs of customers, focusing on AWS Analytics services. Beyond the tech world, he is an avid runner and enjoys hiking.

Kamen Sharlandjiev is a Sr. Big Data and ETL Solutions Architect and Amazon AppFlow expert. He’s on a mission to make life easier for customers who are facing complex data integration challenges. His secret weapon? Fully managed, low-code AWS services that can get the job done with minimal effort and no coding.

Anshul Sharma is a Software Development Engineer in AWS Glue Team. He is driving the connectivity charter which provide Glue customer native way of connecting any Data source (Data-warehouse, Data-lakes, NoSQL etc) to Glue ETL Jobs. Beyond the tech world, he is a cricket and soccer lover.

AWS Big Data Blog

Unlock scalable analytics with AWS Glue and Google BigQuery

Overview of AWS Glue

Introducing Google BigQuery Spark connector

Benefits of Google BigQuery Spark connector

Solution overview

Prerequisites

Configure BigQuery connections

Set up Amazon S3

Create an IAM role for the AWS Glue ETL job

Solution walkthrough

Create a visual ETL job in AWS Glue Studio to transfer data from Google BigQuery to Amazon S3

Run and monitor the job

Data validation

Automate and schedule

Finding earthquake hotspots with AWS Glue Studio Visual ETL.

Run and monitor the job

Data validation

Automate and schedule

Clean up

Conclusion

Appendix

Reading data from Google BigQuery and writing data into Amazon S3

Reading and aggregating data from Amazon S3 and writing into Google BigQuery

About the authors

Resources

Follow