AWS Big Data Blog

Enforce boundaries on AWS Glue interactive sessions

AWS Glue interactive sessions allow engineers to build, test, and run data preparation and analytics workloads in an interactive notebook. Interactive sessions provide isolated development environments, take care of the underlying compute cluster, and allow for configuration to stop idling resources.

Glue interactive sessions provides default recommended configurations, and also allows users to customize the session to meet their needs. For example, you can provision more workers to experiment on a larger dataset or set the idle timeout for long-running workloads. With the flexibility to change these options depending on the workload, you may need ensure that the options are changed within specific boundaries and apply a control mechanism.

In this post, we present the process of deploying a reusable solution to enforce AWS Glue interactive session limits on three options: connection, number of workers, and maximum idle time. The first option addresses the need for applying custom inspection and controls on traffic, for example by enforcing an interactive session to only be run inside a VPC. The other two enforce limits on costs and usage of AWS Glue resources by enforcing an upper boundary on the number of workers and idle time per session. You can further extend the solution for other properties or services within AWS Glue.

Overview of solution

The proposed architecture is built on serverless components and runs whenever a new AWS Glue interactive session is created.

Architecture Diagram of the Solution

The workflow steps are as follows:

  1. A data engineer creates a new AWS Glue interactive session either through the AWS Management Console or in a Jupyter notebook locally.
  2. The interactive session produces a new event to AWS CloudTrail for the CreateSession event with all relevant information to identify and inspect a session as soon as the session is initiated.
  3. An Amazon EventBridge rule filters the CloudTrail events and invokes an AWS Lambda function to inspect the CreateSession event.
  4. The Lambda function inspects the CreateSession event and checks for all defined boundary conditions. Currently, the boundaries configurable with this solution are limited to maximum number of workers, idle timeout in minutes, and deployment with connection enforced.
  5. If any of the defined boundary conditions are not met, for example too many workers are provisioned for the session, depending on the provided configuration, the function ends the interactive session immediately and sends an email via Amazon Simple Notification Service (Amazon SNS). If the session hasn’t started yet, the function will wait for it to start before taking any action.
  6. If the session was stopped, an email is sent to an SNS topic. There is no information available in the interactive session notebook on the reason for the ending of the session. Therefore, additional context information is provided through the SNS topic to the data engineers.
  7. If the function fails, the sessions are logged in a dead-letter queue inside Amazon Simple Queue Service (Amazon SQS). Furthermore, the queue is monitored and in case of a message, it will trigger an Amazon CloudWatch alarm.

The following steps walk you through how to build and deploy the solution. The code is available in the GitHub repo.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Overview of the deployed resources

All the necessary resources are defined in an AWS CloudFormation file located under cfn/template.yaml. To deploy those resources, we use AWS Serverless Application Model (AWS SAM), which enables us to conveniently build and package all the dependencies and also manages the AWS CloudFormation steps for us.

The CloudFormation stack deploys the following resources:

  • A Lambda function with its library, both defined under the directory src/functions. The function is the control. It will validate that the session is started within the limits defined.
  • An EventBridge rule. This event listens to CloudTrail and in case of a new interactive session, will trigger the control Lambda function.
  • An SQS dead-letter queue (DLQ) attached to the Lambda function. This keeps a record of events that triggered a Lambda function failure.
  • Two CloudWatch alarms monitoring the Lambda function failures and the messages in the DLQ.

If notification via email is enabled, two more resources are deployed:

Additionally, AWS CloudFormation deploys all the necessary AWS Identity and Access Management (IAM) roles and policies, and an AWS Key Management Service (AWS KMS) key to ensure that the exchanged data is encrypted.

Deploy the solution

To facilitate the deployment lifecycle, including the setup of the user local environment, we provide a Makefile that describes all the necessary steps. Make sure you have your AWS credentials renewed and have access to your account. For more information, refer to Configuration and credential file settings.

  • Explore the Makefile and adjust the Region and stack name as needed by modifying the values of the variables AWS_REGION and STACK_NAME.
  • Set KILL_SESSION = "True" if you want to immediately stop the interactive session that has been found out of boundaries. Allowed values are True or False; the default is True.
  • Set NOTIFICATION_EMAIL_ADDRESS = <your.email@provider.com> in the Makefile if you want get notified when a session has been found out of boundaries.
  • Set values for your controls:
    • ENFORCE_VPC_CONNECTION to stop sessions not running inside a VPC (true or false).
    • MAX_WORKERS to set the maximum number of workers for a session (numeric).
    • MAX_IDLE_TIMEOUT_MINUTES to define the maximum idle time for sessions in minutes (numeric).
  • Install all the prerequisite libraries:
    make install-pre-requisites

    These will be installed under a newly created Python virtual environment inside this repository in the directory .venv.

  • Deploy the new stack:
    make deploy

    This command will complete the following tasks:

    • Check if the prerequisites are met.
    • Perform pytest unittest on the Python files.
    • Validate the CloudFormation template.
    • Build the artifacts (Lambda function and Lambda layers).
    • Deploy the resources via AWS SAM.

Test the solution

Refer to Introducing AWS Glue interactive sessions for Jupyter for information about running an interactive session. If you follow the instructions in the post (see the section Run your first code cell and author your AWS Glue notebook), the initialization of the interactive session should fail with an error similar to the following.

Example of code in the cell:

from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
sc3 = SparkContext.getOrCreate()
glueContext1 = GlueContext(sc3)
spark = glueContext1.spark_session
job = Job(glueContext1)

Received output:

Authenticating with profile=XXXXXXXX
glue_role_arn defined by user: arn:aws:iam::XXXXXXXXXX:role/XXXXXXXX
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 5
Session ID: XXXXXXXXXXXXX
Applying the following default arguments:
--glue_kernel_version 0.35
--enable-glue-datacatalog true
Waiting for session xxxxxxxxx to get into ready status...
Session xxxxxxxxx has been created
Exception encountered while running statement: An error occurred (EntityNotFoundException) when calling the GetStatement operation: Session ID xxxxxxxxx not found

If you enabled the email feature, you should also get an email notification.

You can also check on the AWS Glue console that your session ID isn’t listed.

Clean up

Clean up the deployed resources by running the following command:

make clean-up

Note that the resources deployed from following the recommended post, Introducing AWS Glue interactive sessions for Jupyter, will not be removed with the previous command.

Limitations

The delivery guarantee for CloudTrail events to EventBridge are best effort. This means CloudTrail will attempt to deliver all events to EventBridge, but in some rare cases, an event might not be delivered. For more information, refer to Events from AWS services.

Conclusion

This post described how to build, deploy, and test a solution to enforce boundary conditions on AWS Glue interactive sessions in order to enforce constraints on the number of workers, idle timeouts, and AWS Glue connection.

You can adapt this solution based on your needs and further extend it to allow controls on other options.

To learn more about how to use AWS Glue interactive sessions, refer to Introducing AWS Glue interactive sessions for Jupyter and Author AWS Glue jobs with PyCharm using AWS Glue interactive sessions.


About the Authors

Nicolas Jacob Baer is a Senior Cloud Application Architect with a strong focus on data engineering and machine learning, based in Switzerland. He works closely with enterprise customers to design data platforms and build advanced analytics/ml use-cases.

Luca Mazzaferro is a Senior DevOps Architect at Amazon Web Services. He likes to have infrastructure automated, reproducible and secured. In his free time he likes to cook, especially pizza.

Kemeng Zhang is a Cloud Application Architect with a strong focus on machine learning and UX, based in Switzerland. She works closely with customers to design user experiences and build advanced analytics/ml use-cases.

Mark Walser, a Senior Global Data Architect at Amazon Web Services, collaborates with customers to develop innovative Big Data solutions that solve business problems and speed up the adoption of AWS services. Outside of work, he finds pleasure in running, swimming, and all things related to technology.

Gal blog picGal Heyne is a Product Manager for AWS Glue with a strong focus on AI/ML, data engineering and BI, based in California. She is passionate about developing a deep understanding of customer’s business needs and collaborating with engineers to design easy to use data products.