AWS Big Data Blog

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

With Amazon EMR 6.15, we launched AWS Lake Formation based fine-grained access controls (FGAC) on Open Table Formats (OTFs), including Apache Hudi, Apache Iceberg, and Delta lake. This allows you to simplify security and governance over transactional data lakes by providing access controls at table-, column-, and row-level permissions with your Apache Spark jobs. Many large enterprise companies seek to use their transactional data lake to gain insights and improve decision-making. You can build a lake house architecture using Amazon EMR integrated with Lake Formation for FGAC. This combination of services allows you to conduct data analysis on your transactional data lake while ensuring secure and controlled access.

The Amazon EMR record server component supports table-, column-, row-, cell-, and nested attribute-level data filtering functionality. It extends support to Hive, Apache Hudi, Apache Iceberg, and Delta lake formats for both reading (including time travel and incremental query) and write operations (on DML statements such as INSERT). Additionally, with version 6.15, Amazon EMR introduces access control protection for its application web interface such as on-cluster Spark History Server, Yarn Timeline Server, and Yarn Resource Manager UI.

In this post, we demonstrate how to implement FGAC on Apache Hudi tables using Amazon EMR integrated with Lake Formation.

Transaction data lake use case

Amazon EMR customers often use Open Table Formats to support their ACID transaction and time travel needs in a data lake. By preserving historical versions, data lake time travel provides benefits such as auditing and compliance, data recovery and rollback, reproducible analysis, and data exploration at different points in time.

Another popular transaction data lake use case is incremental query. Incremental query refers to a query strategy that focuses on processing and analyzing only the new or updated data within a data lake since the last query. The key idea behind incremental queries is to use metadata or change tracking mechanisms to identify the new or modified data since the last query. By identifying these changes, the query engine can optimize the query to process only the relevant data, significantly reducing the processing time and resource requirements.

Solution overview

In this post, we demonstrate how to implement FGAC on Apache Hudi tables using Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) integrated with Lake Formation. Apache Hudi is an open source transactional data lake framework that greatly simplifies incremental data processing and the development of data pipelines. This new FGAC feature supports all OTF. Besides demonstrating with Hudi here, we will follow up with other OTF tables with other blogs. We use notebooks in Amazon SageMaker Studio to read and write Hudi data via different user access permissions through an EMR cluster. This reflects real-world data access scenarios—for example, if an engineering user needs full data access to troubleshoot on a data platform, whereas data analysts may only need to access a subset of that data that doesn’t contain personally identifiable information (PII). Integrating with Lake Formation via the Amazon EMR runtime role further enables you to improve your data security posture and simplifies data control management for Amazon EMR workloads. This solution ensures a secure and controlled environment for data access, meeting the diverse needs and security requirements of different users and roles in an organization.

The following diagram illustrates the solution architecture.

Solution architecture

We conduct a data ingestion process to upsert (update and insert) a Hudi dataset to an Amazon Simple Storage Service (Amazon S3) bucket, and persist or update the table schema in the AWS Glue Data Catalog. With zero data movement, we can query the Hudi table governed by Lake Formation via various AWS services, such as Amazon Athena, Amazon EMR, and Amazon SageMaker.

When users submit a Spark job through any EMR cluster endpoints (EMR Steps, Livy, EMR Studio, and SageMaker), Lake Formation validates their privileges and instructs the EMR cluster to filter out sensitive data such as PII data.

This solution has three different types of users with different levels of permissions to access the Hudi data:

  • hudi-db-creator-role – This is used by the data lake administrator who has privileges to carry out DDL operations such as creating, modifying, and deleting database objects. They can define data filtering rules on Lake Formation for row-level and column-level data access control. These FGAC rules ensure that data lake is secured and fulfills the data privacy regulations required.
  • hudi-table-pii-role – This is used by engineering users. The engineering users are capable of carrying out time travel and incremental queries on both Copy-on-Write (CoW) and Merge-on-Read (MoR). They also have privilege to access PII data based on any timestamps.
  • hudi-table-non-pii-role – This is used by data analysts. Data analysts’ data access rights are governed by FGAC authorized rules controlled by data lake administrators. They do not have visibility on columns containing PII data like names and addresses. Additionally, they can’t access rows of data that don’t fulfill certain conditions. For example, the users only can access data rows that belong to their country.

Prerequisites

You can download the three notebooks used in this post from the GitHub repo.

Before you deploy the solution, make sure you have the following:

Complete the following steps to set up your permissions:

  1. Log in to your AWS account with your admin IAM user.

Make sure you are in theus-east-1Region.

  1. Create a S3 bucket in the us-east-1 Region (for example,emr-fgac-hudi-us-east-1-<ACCOUNT ID>).

Next, we enable Lake Formation by changing the default permission model.

  1. Sign in to the Lake Formation console as the administrator user.
  2. Choose Data Catalog settings under Administration in the navigation pane.
  3. Under Default permissions for newly created databases and tables, deselect Use only IAM access control for new databases and Use only IAM access control for new tables in new databases.
  4. Choose Save.

Data Catalog settings

Alternatively, you need to revoke IAMAllowedPrincipals on resources (databases and tables) created if you started Lake Formation with the default option.

Finally, we create a key pair for Amazon EMR.

  1. On the Amazon EC2 console, choose Key pairs in the navigation pane.
  2. Choose Create key pair.
  3. For Name, enter a name (for exampleemr-fgac-hudi-keypair).
  4. Choose Create key pair.

Create key pair

The generated key pair (for this post, emr-fgac-hudi-keypair.pem) will save to your local computer.

Next, we create an AWS Cloud9 interactive development environment (IDE).

  1. On the AWS Cloud9 console, choose Environments in the navigation pane.
  2. Choose Create environment.
  3. For Name¸ enter a name (for example,emr-fgac-hudi-env).
  4. Keep the other settings as default.

Cloud9 environment

  1. Choose Create.
  2. When the IDE is ready, choose Open to open it.

cloud9 environment

  1. In the AWS Cloud9 IDE, on the File menu, choose Upload Local Files.

Upload local file

  1. Upload the key pair file (emr-fgac-hudi-keypair.pem).
  2. Choose the plus sign and choose New Terminal.

new terminal

  1. In the terminal, input the following command lines:
#Create encryption certificates for EMR in transit encryption
openssl req -x509 \
-newkey rsa:1024 \
-keyout privateKey.pem \
-out certificateChain.pem \
-days 365 \
-nodes \
-subj '/C=US/ST=Washington/L=Seattle/O=MyOrg/OU=MyDept/CN=*.compute.internal'
cp certificateChain.pem trustedCertificates.pem

# Zip certificates
zip -r -X my-certs.zip certificateChain.pem privateKey.pem trustedCertificates.pem

# Upload the certificates zip file to S3 bucket
# Replace <ACCOUNT ID> with your AWS account ID
aws s3 cp ./my-certs.zip s3://emr-fgac-hudi-us-east-1-<ACCOUNT ID>/my-certs.zip

Note that the example code is a proof of concept for demonstration purposes only. For production systems, use a trusted certification authority (CA) to issue certificates. Refer to Providing certificates for encrypting data in transit with Amazon EMR encryption for details.

Deploy the solution via AWS CloudFormation

We provide an AWS CloudFormation template that automatically sets up the following services and components:

  • An S3 bucket for the data lake. It contains the sample TPC-DS dataset.
  • An EMR cluster with security configuration and public DNS enabled.
  • EMR runtime IAM roles with Lake Formation fine-grained permissions:
    • <STACK-NAME>-hudi-db-creator-role – This role is used to create Apache Hudi database and tables.
    • <STACK-NAME>-hudi-table-pii-role – This role provides permission to query all columns of Hudi tables, including columns with PII.
    • <STACK-NAME>-hudi-table-non-pii-role – This role provides permission to query Hudi tables that have filtered out PII columns by Lake Formation.
  • SageMaker Studio execution roles that allow the users to assume their corresponding EMR runtime roles.
  • Networking resources such as VPC, subnets, and security groups.

Complete the following steps to deploy the resources:

  1. Choose Quick create stack to launch the CloudFormation stack.
  2. For Stack name, enter a stack name (for example,rsv2-emr-hudi-blog).
  3. For Ec2KeyPair, enter the name of your key pair.
  4. For IdleTimeout, enter an idle timeout for the EMR cluster to avoid paying for the cluster when it’s not being used.
  5. For InitS3Bucket, enter the S3 bucket name you created to save the Amazon EMR encryption certificate .zip file.
  6. For S3CertsZip, enter the S3 URI of the Amazon EMR encryption certificate .zip file.

CloudFormation template

  1. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  2. Choose Create stack.

The CloudFormation stack deployment takes around 10 minutes.

Set up Lake Formation for Amazon EMR integration

Complete the following steps to set up Lake Formation:

  1. On the Lake Formation console, choose Application integration settings under Administration in the navigation pane.
  2. Select Allow external engines to filter data in Amazon S3 locations registered with Lake Formation.
  3. Choose Amazon EMR for Session tag values.
  4. Enter your AWS account ID for AWS account IDs.
  5. Choose Save.

LF - Application integration settings

  1. Choose Databases under Data Catalog in the navigation pane.
  2. Choose Create database.
  3. For Name, enter default.
  4. Choose Create database.

LF - create database

  1. Choose Data lake permissions under Permissions in the navigation pane.
  2. Choose Grant.
  3. Select IAM users and roles.
  4. Choose your IAM roles.
  5. For Databases, choose default.
  6. For Database permissions, select Describe.
  7. Choose Grant.

LF - Grant data permissions

Copy Hudi JAR file to Amazon EMR HDFS

To use Hudi with Jupyter notebooks, you need to complete the following steps for the EMR cluster, which includes copying a Hudi JAR file from the Amazon EMR local directory to its HDFS storage, so that you can configure a Spark session to use Hudi:

  1. Authorize inbound SSH traffic (port 22).
  2. Copy the value for Primary node public DNS (for example, ec2-XXX-XXX-XXX-XXX.compute-1.amazonaws.com) from the EMR cluster Summary section.

EMR cluster summary

  1. Go back to previous AWS Cloud9 terminal you used to create the EC2 key pair.
  2. Run the following command to SSH into the EMR primary node. Replace the placeholder with your EMR DNS hostname:
chmod 400 emr-fgac-hudi-keypair.pem
ssh -i emr-fgac-hudi-keypair.pem hadoop@ec2-XXX-XXX-XXX-XXX.compute-1.amazonaws.com
  1. Run the following command to copy the Hudi JAR file to HDFS:
hdfs dfs -mkdir -p /apps/hudi/lib
hdfs dfs -copyFromLocal /usr/lib/hudi/hudi-spark-bundle.jar /apps/hudi/lib/hudi-spark-bundle.jar

Create the Hudi database and tables in Lake Formation

Now we’re ready to create the Hudi database and tables with FGAC enabled by the EMR runtime role. The EMR runtime role is an IAM role that you can specify when you submit a job or query to an EMR cluster.

Grant database creator permission

First, let’s grant the Lake Formation database creator permission to<STACK-NAME>-hudi-db-creator-role:

  1. Log in to your AWS account as an administrator.
  2. On the Lake Formation console, choose Administrative roles and tasks under Administration in the navigation pane.
  3. Confirm that your AWS login user has been added as a data lake administrator.
  4. In the Database creator section, choose Grant.
  5. For IAM users and roles, choose<STACK-NAME>-hudi-db-creator-role.
  6. For Catalog permissions, select Create database.
  7. Choose Grant.

Register the data lake location

Next, let’s register the S3 data lake location in Lake Formation:

  1. On the Lake Formation console, choose Data lake locations under Administration in the navigation pane.
  2. Choose Register location.
  3. For Amazon S3 path, Choose Browse and choose the data lake S3 bucket. (<STACK_NAME>s3bucket-XXXXXXX) created from the CloudFormation stack.
  4. For IAM role, choose<STACK-NAME>-hudi-db-creator-role.
  5. For Permission mode, select Lake Formation.
  6. Choose Register location.

LF - Register location

Grant data location permission

Next, we need to grant<STACK-NAME>-hudi-db-creator-rolethe data location permission:

  1. On the Lake Formation console, choose Data locations under Permissions in the navigation pane.
  2. Choose Grant.
  3. For IAM users and roles, choose<STACK-NAME>-hudi-db-creator-role.
  4. For Storage locations, enter the S3 bucket (<STACK_NAME>-s3bucket-XXXXXXX).
  5. Choose Grant.

LF - Grant permissions

Connect to the EMR cluster

Now, let’s use a Jupyter notebook in SageMaker Studio to connect to the EMR cluster with the database creator EMR runtime role:

  1. On the SageMaker console, choose Domains in the navigation pane.
  2. Choose the domain<STACK-NAME>-Studio-EMR-LF-Hudi.
  3. On the Launch menu next to the user profile<STACK-NAME>-hudi-db-creator, choose Studio.

SM - Domain details

  1. Download the notebook rsv2-hudi-db-creator-notebook.
  2. Choose the upload icon.

SM Studio - Upload

  1. Choose the downloaded Jupyter notebook and choose Open.
  2. Open the uploaded notebook.
  3. For Image, choose SparkMagic.
  4. For Kernel, choose PySpark.
  5. Leave the other configurations as default and choose Select.

SM Studio - Change environment

  1. Choose Cluster to connect to the EMR cluster.

SM Studio - connect EMR cluster

  1. Choose the EMR on EC2 cluster (<STACK-NAME>-EMR-Cluster) created with the CloudFormation stack.
  2. Choose Connect.
  3. For EMR execution role, choose<STACK-NAME>-hudi-db-creator-role.
  4. Choose Connect.

Create database and tables

Now you can follow the steps in the notebook to create the Hudi database and tables. The major steps are as follows:

  1. When you start the notebook, configure“spark.sql.catalog.spark_catalog.lf.managed":"true"to inform Spark that spark_catalog is protected by Lake Formation.
  2. Create Hudi tables using the following Spark SQL.
%%sql 
CREATE TABLE IF NOT EXISTS ${hudi_catalog}.${hudi_db}.${cow_table_name_sql}(
    c_customer_id string,
    c_birth_country string,
    c_customer_sk integer,
    c_email_address string,
    c_first_name string,
    c_last_name string,
    ts bigint
) USING hudi
LOCATION '${cow_table_location_sql}'
OPTIONS (
  type = 'cow',
  primaryKey = '${hudi_primary_key}',
  preCombineField = '${hudi_pre_combined_field}'
 ) 
PARTITIONED BY (${hudi_partitioin_field});

  1. Insert data from the source table to the Hudi tables.
%%sql
INSERT OVERWRITE ${hudi_catalog}.${hudi_db}.${cow_table_name_sql}
SELECT 
    c_customer_id ,  
    c_customer_sk,
    c_email_address,
    c_first_name,
    c_last_name,
    unix_timestamp(current_timestamp()) AS ts,
    c_birth_country
FROM ${src_df_view}
WHERE c_birth_country = 'HONG KONG' OR c_birth_country = 'CHINA' 
LIMIT 1000
  1. Insert data again into the Hudi tables.
%%sql
INSERT INTO ${hudi_catalog}.${hudi_db}.${cow_table_name_sql}
SELECT 
    c_customer_id ,  
    c_customer_sk,
    c_email_address,
    c_first_name,
    c_last_name,
    unix_timestamp(current_timestamp()) AS ts,
    c_birth_country
FROM ${insert_into_view}

Query the Hudi tables via Lake Formation with FGAC

After you create the Hudi database and tables, you’re ready to query the tables using fine-grained access control with Lake Formation. We have created two types of Hudi tables: Copy-On-Write (COW) and Merge-On-Read (MOR). The COW table stores data in a columnar format (Parquet), and each update creates a new version of files during a write. This means that for every update, Hudi rewrites the entire file, which can be more resource-intensive but provides faster read performance. MOR, on the other hand, is introduced for cases where COW may not be optimal, particularly for write- or change-heavy workloads. In a MOR table, each time there is an update, Hudi writes only the row for the changed record, which reduces cost and enables low-latency writes. However, the read performance might be slower compared to COW tables.

Grant table access permission

We use the IAM role<STACK-NAME>-hudi-table-pii-roleto query Hudi COW and MOR containing PII columns. We first grant the table access permission via Lake Formation:

  1. On the Lake Formation console, choose Data lake permissions under Permissions in the navigation pane.
  2. Choose Grant.
  3. Choose<STACK-NAME>-hudi-table-pii-rolefor IAM users and roles.
  4. Choose thersv2_blog_hudi_db_1database for Databases.
  5. For Tables, choose the four Hudi tables you created in the Jupyter notebook.

LF - Grant data permissions

  1. For Table permissions, select Select.
  2. Choose Grant.

LF - table permissions

Query PII columns

Now you’re ready to run the notebook to query the Hudi tables. Let’s follow similar steps to the previous section to run the notebook in SageMaker Studio:

  1. On the SageMaker console, navigate to the<STACK-NAME>-Studio-EMR-LF-Hudidomain.
  2. On the Launch menu next to the<STACK-NAME>-hudi-table-readeruser profile, choose Studio.
  3. Upload the downloaded notebook rsv2-hudi-table-pii-reader-notebook.
  4. Open the uploaded notebook.
  5. Repeat the notebook setup steps and connect to the same EMR cluster, but use the role<STACK-NAME>-hudi-table-pii-role.

In the current stage, FGAC-enabled EMR cluster needs to query Hudi’s commit time column for performing incremental queries and time travel. It does not support Spark’s “timestamp as of” syntax and Spark.read(). We are actively working on incorporating support for both actions in future Amazon EMR releases with FGAC enabled.

You can now follow the steps in the notebook. The following are some highlighted steps:

  1. Run a snapshot query.
%%sql 
SELECT c_birth_country, count(*) FROM ${hudi_catalog}.${hudi_db}.${cow_table_name_sql} GROUP BY c_birth_country;
  1. Run an incremental query.
incremental_df = spark.sql(f"""
SELECT * FROM {HUDI_CATALOG}.{HUDI_DATABASE}.{COW_TABLE_NAME_SQL} WHERE _hoodie_commit_time >= {commit_ts[-1]}
""")

incremental_df.createOrReplaceTempView("incremental_view")
%%sql
SELECT 
    c_birth_country, 
    count(*) 
FROM incremental_view
GROUP BY c_birth_country;
  1. Run a time travel query.
%%sql
SELECT
    c_birth_country, COUNT(*) as count
FROM ${hudi_catalog}.${hudi_db}.${cow_table_name_sql}
WHERE _hoodie_commit_time IN
(
    SELECT DISTINCT _hoodie_commit_time FROM ${hudi_catalog}.${hudi_db}.${cow_table_name_sql} ORDER BY _hoodie_commit_time LIMIT 1 
)
GROUP BY c_birth_country
  1. Run MOR read-optimized and real-time table queries.
%%sql
SELECT
    a.email_label,
    count(*)
FROM (
    SELECT
        CASE
            WHEN c_email_address = 'UNKNOWN' THEN 'UNKNOWN'
            ELSE 'NOT_UNKNOWN'
        END AS email_label
    FROM ${hudi_catalog}.${hudi_db}.${mor_table_name_sql}_ro
    WHERE c_birth_country = 'HONG KONG'
) a
GROUP BY a.email_label;
%%sql
SELECT *  
FROM ${hudi_catalog}.${hudi_db}.${mor_table_name_sql}_ro
WHERE 
    c_birth_country = 'INDIA' OR c_first_name = 'MASKED'

Query the Hudi tables with column-level and row-level data filters

We use the IAM role<STACK-NAME>-hudi-table-non-pii-roleto query Hudi tables. This role is not allowed to query any columns containing PII. We use the Lake Formation column-level and row-level data filters to implement fine-grained access control:

  1. On the Lake Formation console, choose Data filters under Data Catalog in the navigation pane.
  2. Choose Create new filter.
  3. For Data filter name, entercustomer-pii-filter.
  4. Choosersv2_blog_hudi_db_1for Target database.
  5. Choosersv2_blog_hudi_mor_sql_dl_customer_1for Target table.
  6. Select Exclude columns and choose thec_customer_id,c_email_address, andc_last_namecolumns.
  7. Enterc_birth_country != 'HONG KONG'for Row filter expression.
  8. Choose Create filter.

LF - create data filter

  1. Choose Data lake permissions under Permissions in the navigation pane.
  2. Choose Grant.
  3. Choose<STACK-NAME>-hudi-table-non-pii-rolefor IAM users and roles.
  4. Choosersv2_blog_hudi_db_1for Databases.
  5. Choosersv2_blog_hudi_mor_sql_dl_tpc_customer_1for Tables.
  6. Choosecustomer-pii-filterfor Data filters.
  7. For Data filter permissions, select Select.
  8. Choose Grant.

LF - Grant data permissions

Let’s follow similar steps to run the notebook in SageMaker Studio:

  1. On the SageMaker console, navigate to the domainStudio-EMR-LF-Hudi.
  2. On the Launch menu for thehudi-table-readeruser profile, choose Studio.
  3. Upload the downloaded notebook rsv2-hudi-table-non-pii-reader-notebook and choose Open.
  4. Repeat the notebook setup steps and connect to the same EMR cluster, but select the role<STACK-NAME>-hudi-table-non-pii-role.

You can now follow the steps in the notebook. From the query results, you can see that FGAC via the Lake Formation data filter has been applied. The role can’t see the PII columnsc_customer_id,c_last_name, andc_email_address. Also, the rows fromHONG KONGhave been filtered.

filtered query result

Clean up

After you’re done experimenting with the solution, we recommend cleaning up resources with the following steps to avoid unexpected costs:

  1. Shut down the SageMaker Studio apps for the user profiles.

The EMR cluster will be automatically deleted after the idle timeout value.

  1. Delete the Amazon Elastic File System (Amazon EFS) volume created for the domain.
  2. Empty the S3 buckets created by the CloudFormation stack.
  3. On the AWS CloudFormation console, delete the stack.

Conclusion

In this post, we used Apachi Hudi, one type of OTF tables, to demonstrate this new feature to enforce fine-grained access control on Amazon EMR. You can define granular permissions in Lake Formation for OTF tables and apply them via Spark SQL queries on EMR clusters. You also can use transactional data lake features such as running snapshot queries, incremental queries, time travel, and DML query. Please note that this new feature covers all OTF tables.

This feature is launched starting from Amazon EMR release 6.15 in all Regions where Amazon EMR is available. With the Amazon EMR integration with Lake Formation, you can confidently manage and process big data, unlocking insights and facilitating informed decision-making while upholding data security and governance.

To learn more, refer to Enable Lake Formation with Amazon EMR and feel free to contact your AWS Solutions Architects, who can be of assistance alongside your data journey.


About the Author

Raymond LaiRaymond Lai is a Senior Solutions Architect who specializes in catering to the needs of large enterprise customers. His expertise lies in assisting customers with migrating intricate enterprise systems and databases to AWS, constructing enterprise data warehousing and data lake platforms. Raymond excels in identifying and designing solutions for AI/ML use cases, and he has a particular focus on AWS Serverless solutions and Event Driven Architecture design.

Bin Wang, PhD, is a Senior Analytic Specialist Solutions Architect at AWS, boasting over 12 years of experience in the ML industry, with a particular focus on advertising. He possesses expertise in natural language processing (NLP), recommender systems, diverse ML algorithms, and ML operations. He is deeply passionate about applying ML/DL and big data techniques to solve real-world problems.

Aditya Shah is a Software Development Engineer at AWS. He is interested in Databases and Data warehouse engines and has worked on performance optimisations, security compliance and ACID compliance for engines like Apache Hive and Apache Spark.

Melody Yang is a Senior Big Data Solution Architect for Amazon EMR at AWS. She is an experienced analytics leader working with AWS customers to provide best practice guidance and technical advice in order to assist their success in data transformation. Her areas of interests are open-source frameworks and automation, data engineering and DataOps.