When two or more parties are trying to collaborate around data, traditional Data Protection solutions do not ensure that a user of the data cannot abuse the data. There are however ways to provably ensure that data abuse cannot occur.
Data Protection ≠ Data Privacy
Data Protection and Data Privacy are distinct problems requiring different solutions.
What is Data Protection?
Data protection implies protecting your data from other parties that have access to the same resources. For example, in a shared server context where a cloud vendor is allowing multiple parties to rent partial virtual access to the same physical hardware, data protection solutions are essential to make sure that sensitive data being processed is not exposed to other renters of that same physical hardware.
Typical data protection solutions are Confidential Compute solutions (like TEEs), Homomorphic Encryption and Federated Learning. Confidential Compute-based solutions use an isolated piece of hardware to ensure that the memory accessed by an application is not accessible by others. Homomorphic Encryption-based solutions keep data protected by ensuring that the data is never in a raw form – so the data is nonsensical to anybody else except the authorized application. Federated Learning is another solution that ensures that an AI model is trained on data that is in different locations without bringing it together.
However, data protection solutions don’t take or enforce an opinion on how the data ought to be used. This is important because data protection solutions cannot enforce any regulatory or compliance-driven requirements on how the data must be used. Any application can use data protection solutions to process data, including malicious applications. For example, even a virus can run inside Confidential Compute chip, with full isolation from observation by an anti-virus. Using homomorphic encryption solutions, a nefarious program can claim to be processing patient data to identify if a patient has a particular disease, but can actually process that data to identify the patient’s real-identity, which is a violation of HIPAA. Similarly, a user of Federated Learning, can claim to be training a model for disease screening, but encode the patient’s identifiable biometric information into the model.
For a real-world example of Data Protection solutions not protecting the privacy of people data, the data can be cross-referenced with other information to identify specific individuals. Netflix published an anonymized data set with 100 million records of customers’ movie ratings, challenging citizen data scientists to use it to develop new recommendation algorithms. Researchers were able to identify >84% of the individuals in Netflix’s ‘anonymized’ dataset by cross-referencing it with another one from movie ranking site IMDb.
To summarize, Data Protection solutions protect the data from observation.
What is Data Privacy?
Data privacy, on the other hand, protects confidential data (usually this is data about people, but can be data that is sensitive for competitive or Intellectual Property reasons) while ensuring regulations and policies on the usage of the data aren't violated. Governance of how the data is used is enforced using cryptography – ensuring that no user can abuse their access to data.
Data Privacy solutions come pre-baked with governance policies to ensure that you cannot accidentally identify an individual from processing their data. This is what we mean by “Can’t be evil” – it’s mathematically impossible to abuse the data. The processing of the data can only produce the analytics it’s supposed to, and nothing else.
These solutions maintain an audit trail how someone’s data was used. They ensure that malicious attempts to access the data are rejected. They don’t have a trusted third party in the middle that is able to potentially abuse administrative permissions. Data protection is also often a side effect of Data Privacy, but its primary objective is to ensure that Personally Identifiable Information (PII) and Protected Health Information (PHI) are handled ethically.
Unlike just Data Protection Solutions, Data Privacy solutions ensure that a virus can’t run with additional protections. While working with AI models, they ensure that no patient data can be reverse engineered from the model. They ensure that cross-referencing attacks are infeasible. Really, with ANY usage of the system, they ensure that regulatory compliance can’t be broken, and no individual person can ever be identified.
Because of the strong governance enforcements on the data, Data Privacy solutions can go even further and fuse any arbitrary contract or regulation on the data. This ensures that two companies collaborating on a project aren’t just relying on paper contracts with human enforcers to ensure that their data is safeguarded.
To summarize, Data Privacy solutions ensure that data about people are used ethically without any likelihood of violations.
What does this mean?
Data protection is about controlling who can see or obtain the data, using measures like encryption, access controls, and physical security mechanisms. Data privacy is about how and for what purposes the data can be used once access is granted.