Data utility can be preserved while enhancing data privacy

Organizations need to strike a perplexing balance when launching strategic AI initiatives: data needs to be accessible, without compromising privacy regulation compliance or the speed of business innovation. Customer trust and brand reputation are key competitive advantages, so accelerated digital transformation and growth relies on businesses being smart about protecting sensitive customer data while still preserving data utility for AI and analytics teams.

Three questions organizations need to confront when it comes to leveraging customer data are:

How can initiatives inside and outside my organization work securely with personal information (PI) and sensitive data?
How can I remove PI from datasets without affecting the integrity of the data or accuracy of my projects’ results?
How can I actively protect PI and sensitive data whenever they are accessed, wherever they reside?

When organizations do not have answers readily available to the questions above, then AI projects are often stalled and collaboration using meaningful data is limited. Gartner predicts that by 2024, the use of data protection techniques will increase industry collaborations on AI projects by 70%.

In my blog, I discussed the new IBM® AutoPrivacy framework and the key use cases delivered via IBM Cloud Pak® for Data. Today I will expand on the advanced data protection use case, which is one of key capabilities in the AutoPrivacy framework.

Data protection and de-identification of sensitive data are not new concepts. Although these concepts have been well known for many years, most enterprises did not employ these practices consistently. The enforcement of GDPR has drastically changed that and in the post-GPDR era, enterprises are hyperaware of data protection regulations that they must adhere to. With the enforcement of GDPR (Europe), CCPA (California), LGPD (Brazil) and many other data protection legislations in recent months, consumers are now well aware of their privacy rights and are demanding that enterprises provide transparent privacy protection approaches.

Historically, enterprises have used many methods of sensitive data protection, including redaction and various forms of masking such as substitution, shuffling or randomization. However, with the employment of deep (learning) neural network technology in AI, data science and analytical modeling, the risk of re-identification has been increasing. Hence, there is a need for newer data protection techniques and robust encryption algorithms that can enhance privacy but also preserve utility of the data.

By far, the most important requirement from IBM customers has been the consistent enforcement of data protection policies, regardless of where the data resides.

Data cannot simply be de-identified randomly; important relationships must be maintained. Format preservation is a fundamental requirement. Values must be de-identified consistently across the enterprise, respecting relationships across multiple data assets. For example, de-identification of a credit card number, personal first and last names, or any other entity identifiers must be repeatable consistently across data sources in on-premises and hybrid cloud environments.

In addition, I have often encountered unique industry use cases where there is a need for special treatment of certain data elements. For example, in financial services and healthcare, the time intervals between certain dates should be the same whether unmasked or masked. The accuracy of dates of disease treatment in healthcare are critical for biomedical research, so while shifting dates, it’s very important to maintain the right intervals. Similarly, the interval between a date of birth and date of an auto policy agreement (in other words, the customer’s age) may make a very big difference in the cost and available features of auto insurance.

Most customers require support for custom de-identification when it comes to complex, multi-field computation using a low-code or no-code approach. There are also several use cases that require the addition of statistical noise to hide individual data and only surface group level information for analytics.

These rich data protection and consistent policy enforcement capabilities are available via IBM Watson® Knowledge Catalog Enterprise Edition to address a wide range of use cases.

The future is bright as the latest privacy enhancing technologies such as differential privacy, synthetic data fabrication and more are brought into the solution. These technologies, paired with the power of IBM Cloud Pak for Data, will allow data science teams to make choices along the privacy-utility spectrum and continue to push the boundaries of AI initiatives.

Learn more about data and AI at IBM

Read more about the IBM unified data privacy framework that can help you understand how sensitive data is used, stored and accessed throughout your organization.

Explore the IBM unified data privacy framework

To help our clients solve for synthetic data generation and more, we offer IBM watsonx.ai. As part of the IBM watsonx platform that brings together new generative AI capabilities, watsonx.ai is powered by foundation models and traditional machine learning into a powerful studio spanning the AI lifecycle.

With watsonx.ai, you can train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with ease and build AI applications in a fraction of the time with a fraction of the data. Within the solution you can generate a synthetic tabular data set leveraging your existing data or a custom data schema. You can also connect to your existing database, upload a data file, anonymize columns and generate as much data as needed to address your data gaps or train your classical AI models.

Learn more about watsonx.ai

Was this article helpful?

YesNo

Three questions organizations need to confront when it comes to leveraging customer data are:

Learn more about data and AI at IBM

Tags

More from Artificial intelligence

Responsible AI is a competitive advantage

Taming the Wild West of AI-generated search results

Are bigger language models always better?

IBM Newsletters