AWS Big Data Blog

Five actionable steps to GDPR compliance (Right to be forgotten) with Amazon Redshift

The GDPR (General Data Protection Regulation) right to be forgotten, also known as the right to erasure, gives individuals the right to request the deletion of their personally identifiable information (PII) data held by organizations. This means that individuals can ask companies to erase their personal data from their systems and any third parties with whom the data was shared. Organizations must comply with these requests provided that there are no legitimate grounds for retaining the personal data, such as legal obligations or contractual requirements.

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. It is designed for analyzing large volumes of data and performing complex queries on structured and semi-structured data. Many customers are looking for best practices to keep their Amazon Redshift analytics environment compliant and have an ability to respond to GDPR right to forgotten requests.

In this post, we discuss challenges associated with implementation and architectural patterns and actionable best practices for organizations to respond to the right to be forgotten request requirements of the GDPR for data stored in Amazon Redshift.

Who does GDPR apply to?

The GDPR applies to all organizations established in the EU and to organizations, whether or not established in the EU, that process the personal data of EU individuals in connection with either the offering of goods or services to data subjects in the EU or the monitoring of behavior that takes place within the EU.

The following are key terms we use when discussing the GDPR:

  • Data subject – An identifiable living person and resident in the EU or UK, on whom personal data is held by a business or organization or service provider
  • Processor – The entity that processes the data on the instructions of the controller (for example, AWS)
  • Controller – The entity that determines the purposes and means of processing personal data (for example, an AWS customer)
  • Personal data – Information relating to an identified or identifiable person, including names, email addresses, and phone numbers

Implementing the right to be forgotten can include the following challenges:

  • Data identification – One of the main challenges is identifying all instances of personal data across various systems, databases, and backups. Organizations need to have a clear understanding of where personal data is being stored and how it is processed to effectively fulfill the deletion requests.
  • Data dependencies – Personal data can be interconnected and intertwined with other data systems, making it challenging to remove specific data without impacting the integrity of functionality of other systems or processes. It requires careful analysis to identify data dependencies and mitigate any potential risks or disruptions.
  • Data replication and backups – Personal data can exist in multiple copies due to data replication and backups. Ensuring the complete removal of data from all these copies and backups can be challenging. Organizations need to establish processes to track and manage data copies effectively.
  • Legal obligations and exemptions – The right to be forgotten is not absolute and may be subject to legal obligations or exemptions. Organizations need to carefully assess requests, considering factors such as legal requirements, legitimate interests, freedom of expression, or public interest to determine if the request can be fulfilled or if any exceptions apply.
  • Data archiving and retention – Organizations may have legal or regulatory requirements to retain certain data for a specific period. Balancing the right to be forgotten with the obligation to retain data can be a challenge. Clear policies and procedures need to be established to manage data retention and deletion appropriately.

Architecture patterns

Organizations are generally required to respond to right to be forgotten requests within 30 days from when the individual submits a request. This deadline can be extended by a maximum of 2 months taking into account the complexity and the number of the requests, provided that the data subject has been informed about the reasons for the delay within 1 month of the receipt of the request.

The following sections discuss a few commonly referenced architecture patterns, best practices, and options supported by Amazon Redshift to support your data subject’s GDPR right to be forgotten request in your organization.

Actionable Steps

Data management and governance

Addressing the challenges mentioned requires a combination of technical, operational, and legal measures. Organizations need to develop robust data governance practices, establish clear procedures for handling deletion requests, and maintain ongoing compliance with GDPR regulations.

Large organizations usually have multiple Redshift environments, databases, and tables spread across multiple Regions and accounts. To successfully respond to a data subject’s requests, organizations should have a clear strategy to determine how data is forgotten, flagged, anonymized, or deleted, and they should have clear guidelines in place for data audits.

Data mapping involves identifying and documenting the flow of personal data in an organization. It helps organizations understand how personal data moves through their systems, where it is stored, and how it is processed. By creating visual representations of data flows, organizations can gain a clear understanding of the lifecycle of personal data and identify potential vulnerabilities or compliance gaps.

Note that putting a comprehensive data strategy in place is not in scope for this post.

Audit tracking

Organizations must maintain proper documentation and audit trails of the deletion process to demonstrate compliance with GDPR requirements. A typical audit control framework should record the data subject requests (who is the data subject, when was it requested, what data, approver, due date, scheduled ETL process if any, and so on). This will help with your audit requests and provide the ability to roll back in case of accidental deletions observed during the QA process. It’s important to maintain the list of users and systems who may get impacted during this process to ensure effective communication.

Data discovery and findability

Findability is an important step of the process. Organizations need to have mechanisms to find the data under consideration in an efficient and quick manner for timely response. The following are some patterns and best practices you can employ to find the data in Amazon Redshift.

Tagging

Consider tagging your Amazon Redshift resources to quickly identify which clusters and snapshots contain the PII data, the owners, the data retention policy, and so on. Tags provide metadata about resources at a glance. Redshift resources, such as namespaces, workgroups, snapshots, and clusters can be tagged. For more information about tagging, refer to Tagging resources in Amazon Redshift.

Naming conventions

As a part of the modeling strategy, name the database objects (databases, schemas, tables, columns) with an indicator that they contain PII so that they can be queried using system tables (for example, make a list of the tables and columns where PII data is involved). Identifying the list of tables and users or the systems that have access to them will help streamline the communication process. The following sample SQL can help you find the databases, schemas, and tables with a name that contains PII:

SELECT
pg_catalog.pg_namespace.nspname AS schema_name,
pg_catalog.pg_class.relname AS table_name,
pg_catalog.pg_attribute.attname AS column_name,
pg_catalog.pg_database.datname AS database_name
FROM
pg_catalog.pg_namespace
JOIN pg_catalog.pg_class ON pg_catalog.pg_namespace.oid = pg_catalog.pg_class.relnamespace
JOIN pg_catalog.pg_attribute ON pg_catalog.pg_class.oid = pg_catalog.pg_attribute.attrelid
JOIN pg_catalog.pg_database ON pg_catalog.pg_attribute.attnum > 0
WHERE
pg_catalog.pg_attribute.attname LIKE '%PII%';

SELECT datname
FROM pg_database
WHERE datname LIKE '%PII%';

SELECT table_schema, table_name, column_name
FROM information_schema.columns
WHERE column_name LIKE '%PII%'

Separate PII and non-PII

Whenever possible, keep the sensitive data in a separate table, database, or schema. Isolating the data in a separate database may not always be possible. However, you can separate the non-PII columns in a separate table, for example, Customer_NonPII and Customer_PII, and then join them with an unintelligent key. This helps identify the tables that contain non-PII columns. This approach is straightforward to implement and keeps non-PII data intact, which can be useful for analysis purposes. The following figure shows an example of these tables.

PII-Non PII Example Tables

Flag columns

In the preceding tables, rows in bold are marked with Forgotten_flag=Yes. You can maintain a Forgotten_flag as a column with the default value as No and update this value to Yes whenever a request to be forgotten is received. Also, as a best practice from HIPAA, do a batch deletion once in a month. The downstream and upstream systems need to respect this flag and include this in their processing. This helps identify the rows that need to be deleted. For our example, we can use the following code:

Delete from Customer_PII where forgotten_flag=“Yes”

Use Master data management system

Organizations that maintain a master data management system maintain a golden record for a customer, which acts as a single version of truth from multiple disparate systems. These systems also contain crosswalks with several peripheral systems that contain the natural key of the customer and golden record. This technique helps find customer records and related tables. The following is a representative example of a crosswalk table in a master data management system.

Example of a MDM Records

Use AWS Lake Formation

Some organizations have use cases where you can share the data across multiple departments and business units and use Amazon Redshift data sharing. We can use AWS Lake Formation tags to tag the database objects and columns and define fine-grained access controls on who can have the access to use data. Organizations can have a dedicated resource with access to all tagged resources. With Lake Formation, you can centrally define and enforce database-, table-, column-, and row-level access permissions of Redshift data shares and restrict user access to objects within a data share.

By sharing data through Lake Formation, you can define permissions in Lake Formation and apply those permissions to data shares and their objects. For example, if you have a table containing employee information, you can use column-level filters to help prevent employees who don’t work in the HR department from seeing sensitive information. Refer to AWS Lake Formation-managed Redshift shares for more details on the implementation.

Use Amazon DataZone

Amazon DataZone introduces a business metadata catalog. Business metadata provides information authored or used by businesses and gives context to organizational data. Data discovery is a key task that business metadata can support. Data discovery uses centrally defined corporate ontologies and taxonomies to classify data sources and allows you to find relevant data objects. You can add business metadata in Amazon DataZone to support data discovery.

Data erasure

By using the approaches we’ve discussed, you can find the clusters, databases, tables, columns, snapshots that contain the data to be deleted. The following are some methods and best practices for data erasure.

Restricted backup

In some use cases, you may have to keep data backed up to align with government regulations for a certain period of time. It’s a good idea to take the backup of the data objects before deletion and keep it for an agreed-upon retention time. You can use AWS Backup to take automatic or manual backups. AWS Backup allows you to define a central backup policy to manage the data protection of your applications. For more information, refer to New – Amazon Redshift Support in AWS Backup.

Physical deletes

After we find the tables that contain the data, we can delete the data using the following code (using the flagging technique discussed earlier):

Delete from Customer_PII where forgotten_flag=“Yes”

It’s a good practice to delete data at a specified schedule, such as once every 25–30 days, so that it is simpler to maintain the state of the database.

Logical deletes

You may need to keep data in a separate environment for audit purposes. You can employ Amazon Redshift row access policies and conditional dynamic masking policies to filter and anonymize the data.

You can use row access policies on Forgotten_flag=No on the tables that contain PII data so that the designated users can only see the necessary data. Refer to Achieve fine-grained data security with row-level access control in Amazon Redshift for more information about how to implement row access policies.

You can use conditional dynamic data masking policies so that designated users can see the redacted data. With dynamic data masking (DDM) in Amazon Redshift, organizations can help protect sensitive data in your data warehouse. You can manipulate how Amazon Redshift shows sensitive data to the user at query time without transforming it in the database. You control access to data through masking policies that apply custom obfuscation rules to a given user or role. That way, you can respond to changing privacy requirements without altering the underlying data or editing SQL queries.

Dynamic data masking policies hide, obfuscate, or pseudonymize data that matches a given format. When attached to a table, the masking expression is applied to one or more of its columns. You can further modify masking policies to only apply them to certain users or user-defined roles that you can create with role-based access control (RBAC). Additionally, you can apply DDM on the cell level by using conditional columns when creating your masking policy.

Organizations can use conditional dynamic data masking to redact sensitive columns (for example, names) where the forgotten flag column value is TRUE, and the other columns display the full values.

Backup and restore

Data from Redshift clusters can be transferred, exported, or copied to different AWS services or outside of the cloud. Organizations should have an effective governance process to detect and remove data to align with the GDPR compliance requirement. However, this is beyond the scope of this post.

Amazon Redshift offers backups and snapshots of the data. After deleting the PII data, organizations should also purge the data from their backups. To do so, you need to restore the snapshot to a new cluster, remove the data, and take a fresh backup. The following figure illustrates this workflow.

It’s good practice to keep the retention period at 29 days (if applicable) so that the backups are cleared after 30 days. Organizations can also set the backup schedule to a certain date (for example, the first of every month).

Backup and Restore

Communication

It’s important to communicate to the users and processes who may be impacted by this deletion. The following query helps identify the list of users and groups who have access to the affected tables:

SELECT
nspname AS schema_name,
relname AS table_name,
attname AS column_name,
usename AS user_name,
groname AS group_name
FROM pg_namespace
JOIN pg_class ON pg_namespace.oid = pg_class.relnamespace
JOIN pg_attribute ON pg_class.oid = pg_attribute.attrelid
LEFT JOIN pg_group ON pg_attribute.attacl::text LIKE '%' || groname || '%'
LEFT JOIN pg_user ON pg_attribute.attacl::text LIKE '%' || usename || '%'
WHERE
pg_attribute.attname LIKE '%PII%'
AND (usename IS NOT NULL OR groname IS NOT NULL);

Security controls

Maintaining security is of great importance in GDPR compliance. By implementing robust security measures, organizations can help protect personal data from unauthorized access, breaches, and misuse, thereby helping maintain the privacy rights of individuals. Security plays a crucial role in upholding the principles of confidentiality, integrity, and availability of personal data. AWS offers a comprehensive suite of services and features that can support GDPR compliance and enhance security measures.

The GDPR does not change the AWS shared responsibility model, which continues to be relevant for customers. The shared responsibility model is a useful approach to illustrate the different responsibilities of AWS (as a data processor or subprocessor) and customers (as either data controllers or data processors) under the GDPR.

Under the shared responsibility model, AWS is responsible for securing the underlying infrastructure that supports AWS services (“Security of the Cloud”), and customers, acting either as data controllers or data processors, are responsible for personal data they upload to AWS services (“Security in the Cloud”).

AWS offers a GDPR-compliant AWS Data Processing Addendum (AWS DPA), which enables you to comply with GDPR contractual obligations. The AWS DPA is incorporated into the AWS Service Terms.

Article 32 of the GDPR requires that organizations must “…implement appropriate technical and organizational measures to ensure a level of security appropriate to the risk, including …the pseudonymization and encryption of personal data[…].” In addition, organizations must “safeguard against the unauthorized disclosure of or access to personal data.” Refer to the Navigating GDPR Compliance on AWS whitepaper for more details.

Conclusion

In this post, we delved into the significance of GDPR and its impact on safeguarding privacy rights. We discussed five commonly followed best practices that organizations can reference for responding to GDPR right to be forgotten requests for data that resides in Redshift clusters. We also highlighted that the GDPR does not change the AWS shared responsibility model.

We encourage you to take charge of your data privacy today. Prioritizing GPDR compliance and data privacy will not only strengthen trust, but also build customer loyalty and safeguard personal information in digital era. If you need assistance or guidance, reach out to an AWS representative. AWS has teams of Enterprise Support Representatives, Professional Services Consultants, and other staff to help with GDPR questions. You can contact us with questions. To learn more about GDPR compliance when using AWS services, refer to the General Data Protection Regulation (GDPR) Center. To learn more about the right to be forgotten, refer to Right to Erasure.

Disclaimer: The information provided above is not a legal advice. It is intended to showcase commonly followed best practices. It is crucial to consult with your organization’s privacy officer or legal counsel and determine appropriate solutions.


About the Authors

YaduKishore ProfileYadukishore Tatavarthi  is a Senior Partner Solutions Architect supporting Healthcare and life science customers at Amazon Web Services. He has been helping the customers over the last 20 years in building the enterprise data strategies, advising customers on cloud implementations, migrations, reference architecture creation, data modeling best practices, data lake/warehouses architecture, and other technical processes.

Sudhir GuptaSudhir Gupta is a Principal Partner Solutions Architect, Analytics Specialist at AWS with over 18 years of experience in Databases and Analytics. He helps AWS partners and customers design, implement, and migrate large-scale data & analytics (D&A) workloads. As a trusted advisor to partners, he enables partners globally on AWS D&A services, builds solutions/accelerators, and leads go-to-market initiatives

Deepak SinghDeepak Singh is a Senior Solutions Architect at Amazon Web Services with 20+ years of experience in Data & AIA. He enjoys working with AWS partners and customers on building scalable analytical solutions for their business outcomes. When not at work, he loves spending time with family or exploring new technologies in analytics and AI space.