Synthetic data generation: Building trust by ensuring privacy and quality

With the emergence of new advances and applications in machine learning models and artificial intelligence, including generative AI, generative adversarial networks, computer vision and transformers, many businesses are seeking to address their most pressing real-world data challenges using both types of synthetic data: structured and unstructured. Structured synthetic data types are quantitative and includes tabular data, such as numbers or values, while unstructured synthetic data types are qualitative and includes text, images, and video. Business leaders and data scientists across various industries emphasize the need for new data synthesis to address data gaps, protect sensitive information and improve their speed to market. They are already identifying and exploring several real-life use cases for synthetic data, such as:

Generating synthetic tabular data to increase sample size and edge cases. You can combine this data with real datasets to improve AI model training and predictive accuracy.
Creating synthetic test data to expedite testing, optimization and validation of new applications and features.
Exploring “what-if” scenarios or new business events using synthetic data synthesized from agent-based simulations.
Using synthetic data to prevent the exposure of sensitive data in machine learning algorithms.
Sharing and monetizing a high-quality, privacy-protected synthetic replica with internal stakeholders or external business partners.

That said, synthesizing data offers more protection against traditional data privacy and data anonymization techniques (think of masking), while also doing a better job of preserving the data’s utility. However, there still exists a lack of trust among business leaders. To build that trust and drive broad adoption, vendors of synthetic data generation tools will need to address two critical questions that many business leaders ask: Will synthetic data expose my business to additional data privacy risks? How accurately does synthetic data reflect my existing data?

Fortunately, there are already best practices in place to help businesses evaluate these questions and, hopefully, to build the trust they need in synthetic data to become more competitive in today’s ever-changing markets. Let’s take a look.

Ensuring synthetic data privacy

Although considered artificial data or “fake data” because it is computer-generated rather than created by actual events (such as a customer purchase, an internet login or a patient diagnosis), synthetic data can still reveal personally identifiable information (PII) when used as training data for AI models. For instance, if a business prioritizes accuracy in generating synthetic data, the resulting output may inadvertently include too many personally identifiable attributes, thereby increasing the company’s privacy risk exposure unknowingly. Furthermore, as modeling techniques become increasingly sophisticated in data science, including deep learning and predictive and generative models, companies and vendors must work diligently to prevent unintentional connections that could leak a person’s identity and expose them to third-party attacks.

Fortunately, enterprises interested in synthetic data can take steps to reduce their privacy risk:

Keep your data where it is

While many companies are migrating their existing software applications to the cloud for cost savings, improved performance and scalability, on-premises deployments continue to play a pivotal role in enhancing privacy and protection. This is partially true for synthetic data. When dealing with fully synthetic data (data generated without existing data for model training) or synthetic data that contains no confidential or PII, there is minimal risk associated with using a public cloud deployment method. However, companies should consider on-premises deployments when their synthetic data has dependencies on existing sensitive data. Although third-party cloud providers offer robust built-in security and privacy safeguards, sending and storing sensitive PII customer data in such clouds may expose your organization to potential risks and may be blocked by your privacy team.

Have control and robust protection

Not all synthetic data use cases require privacy, but some do. Therefore, risk, security and compliance leaders should implement a mechanism to control their desired level of privacy risk during the synthetic data generation process. “Differential privacy” is one such mechanism, enabling data scientists and risk teams to manage their desired level of privacy (typically within an epsilon range of 1 to 10, with 1 representing the highest privacy). This method masks the contribution of any individual, making it impossible to infer specific information about a person, including whether their information was used at all. It automatically identifies vulnerable individual data points and introduces “noise” to obscure their specific information. Although adding noise slightly reduces output accuracy (this is the “cost” of differential privacy), it does not compromise utility or data quality compared to traditional data masking techniques. In other words, a differentially private synthetic dataset still reflects the statistical properties of your real dataset. Additionally, there are benefits to using differential privacy techniques, including robust data protection against potential privacy attacks, provable privacy guarantees regarding cumulative risk from successive data releases, and data transparency, as there is no need to keep differential private computation or parameters secret.

Have insight into privacy-related metrics

When differential privacy isn’t an option, business users should maintain a line of sight into privacy-related metrics, to help them comprehend the extent of their privacy exposure. Here are two common metrics that, while not comprehensive, serve as a solid foundation:

Leakage score: This score measures the fraction of rows in the synthetic dataset that are identical to the original dataset. While a synthetic dataset may achieve high accuracy, it could compromise privacy by including too much of the original data. Data leakage occurs when the original data or actual data contains information about the target, but such data won’t be accessible when the AI model is used for prediction or analysis.
Proximity score: Proximity is determined by calculating the distance between the original data and the synthetic datasets. A smaller distance indicates a higher privacy risk because it makes it easier to isolate certain rows from the synthetic tabular data.

Evaluating synthetic data quality

Enterprise-wide adoption also requires business leaders and data scientists to have confidence in the quality of the synthetic data output. Specifically, they must quickly and easily grasp how closely the synthetic data maintains the statistical properties of their existing data model. While some use cases warrant lower fidelity synthetic data, like illustrative data for creating realistic product demos, internal training assets or certain AI model training scenarios, other use cases require a high degree of fidelity, such as when synthesizing patient data in healthcare. In the latter use case, as a healthcare company may use the synthetic output to identify new patient insights that inform downstream decision-making, business leaders must ensure that the synthetic data accurately reflects the conditions of their actual business.

Let’s look at fidelity and other quality-related metrics more closely:

Fidelity

An important metric is “fidelity”. It assesses the quality of the synthetic data in terms of its similarity to real data and the data model. Enterprises should gain insight not only into column distributions but also into the relationships between other columns, both one-to-one (univariate) and one-to-many (multivariate). Understanding the latter is crucial due to the complexity and size of most existing data tables. Fortunately, the latest neural networks and generative AI models excel at capturing these intricate relationships in database tables and time-series data. Fidelity metrics are shown using bar graphs and correlation tables, which, while potentially lengthy, offer valuable insights. If you do not already have access to fidelity analytics, you can start by using open-source Python packages, such as SD metrics.

Utility

AI models require sufficient data for effective training and obtaining real datasets can be time-consuming. Synthetic data provides a faster alternative for training machine learning models. Therefore, it is valuable to understand the utility of synthetic data in AI model training before sharing it with the appropriate teams. Essentially, this metric measures the relative predictive accuracy of a machine learning model when trained on real data compared to synthetic data.

Fairness

Another important metric is “fairness”, a topic gaining prominence due to potential biases present in enterprise-collected datasets. If the existing dataset exhibits bias, the synthetic data will also be biased. Gaining insight into the extent of this bias can help enterprises recognize and potentially correct it. While not as prevalent in today’s synthetic data solutions and not as critical as privacy, fidelity or utility, understanding the bias in your synthetic data will help enterprises make informed decisions.

How to get started with synthetic data in watsonx.ai

AI builders and data scientists can generate synthetic tabular data by importing data from a database, uploading a file or creating a custom data schema in IBM® watsonx.ai™. This statistics-based model can be used to generate data to help improve the predictive accuracy of AI training models through edge cases and larger sample sizes. This data can also can be used to help enhance the realism of client demos and employee training materials.

Watsonx.ai is an enterprise-ready next-generation AI studio for machine learning and generative AI, powered by foundation models. With the watsonx.ai studio, AI builders, including data scientists, application developers and business analysts, can train, validate, tune and deploy both traditional machine learning and new generative AI capabilities. Watsonx.ai is designed to facilitate collaboration and scalability in AI application development and can be deployed in hybrid cloud environments.

Check out our synthetic data generator service on watsonx.ai by either accessing our free trial or scheduling a 30-minute call with one of our watsonx.ai product specialists for a guided walk-through.

Explore the benefits of watsonx.ai Unlock your free trial today

Was this article helpful?

YesNo

Ensuring synthetic data privacy

Keep your data where it is

Have control and robust protection

Have insight into privacy-related metrics

Evaluating synthetic data quality

Fidelity

Utility

Fairness

How to get started with synthetic data in watsonx.ai

Tags

More from Data and Analytics

Unlock the value of your Informix data for advanced analytics and AI with watsonx.data

For the planet and people: IBM’s focus on AI ethics in sustainability

Breaking Boundaries: PostgreSQL 16 is now available on IBM Cloud

IBM Newsletters