Federated Learning doesn’t work in healthcare

Federated Learning can’t protect PHI

In the last five years, Federated Learning has become a trendy buzzword in healthcare circles when it comes to dealing with healthcare data. However, being federated has nothing to do with data protections. Let's not conflate the two. This feels similar to the late 2010s when “blockchain” was believed to be the silver bullet when it came to secure data sharing.

Federated Learning was developed at Google as a technique to train AI models in a decentralized way. It’s useful in situations when two or more data providers want to allow a third party to train on their data, but don’t want to send their data to the third party.

In non-healthcare use cases, Federated Learning is a powerful tool. Every night, when you plug in your smartphone to the charger, it uses federated learning to train a local model on what you typed, so that the centralized Apple/Google prediction server can get smarter on what is trending that day.

However, in healthcare, Federated Learning’s utility is blunted because it still requires the data to be manually de-identified before it can be used for training. Notice the difference here, federated is different than being de-identified. De-identification of PHI to HIPAA standards blunts the utility of the data (dates and times, geographic locations, etc.), because it removes the most useful information from the data. This is required because the input data used with Federated Learning can easily be reverse engineered from the model! Let's double click here, federated learning is an architectural appraoch that has little to nothing to do with privacy of PHI. It wouldn't be a stretch to suggest that federated learning alone, cannot be applied to PHI.

De-identified data isn’t as valuable as PHI

Traditional de-identification methods under HIPAA use one of two methods - either Expert Determination or Safe Harbor. 

  1. Under Expert De-identification, identifiers are removed and replaced with Tokens, and additional statistical methods are used to ensure that there aren’t any individuals who might be identified from the dataset. 

  2. Under Safe Harbor, 18 identifiers and valuable data are completely dropped or obfuscated from the data.

Both techniques neuter the clinical value of the data, and aren’t able to be applied universally. For example, using Safe Harbor, an AI model won’t be able to tell whether it’s looking at an X-ray of an 81 year old female or a 19 year old male - both of which have obvious clinical ramifications.

Using Expert de-identification, genomic data is difficult to de-identify because my genomic sequence is unique to me. And let's not forget that much of the actual clinical information is not found in the SQL database of EHR’s; its found in the clinical narrative which makes up doctors notes, pathology and radiology reports and discharge summaries. It's nearly impossible to remove this valuable PHI from unstructured text.

In addition to requiring de-identification, Federated Learning requires a heavy computational load on each data node, because they have to train the entire model.

It is possible to to train on PHI safely

Recent advances in mathematics give us a magic solution. Blind Learning is a technique that allows models to directly work on PHI, without putting the PHI at risk of identification - effectively achieving de-identification without neutering the data. Yeah, that's why it's magic.

Blind Learning consists of two pivotal improvements over Federated learning to make this possible - 

  1. it splits the model into two separated parts - a lightweight part that is run on the client, and a heavyweight part that evaluates the weights from the clients, so it doesn't need to see the actual PHI but can operate on a proxy, on the shadow of the PHI.

  2. It adds a privacy loss function that decorrelates the data from the weights, allowing a model to learn a concept and not the underlying data. This ensures that the training data cannot be reverse engineered from the trained model. That even without ever seeing actual PHI, the resulting model is exactly the same. No PHI risk. No HIPAA violations. No de-identifications.

As the world has run out of available training data on the open web for training, Blind Learning emerges as the only HIPAA and GDPR compliant technique to open up LLMs and AI training to the treasure trove of clinical data in a safe way.


Data Protection ≠ Data Privacy

Data Protection and Data Privacy are distinct problems requiring different solutions.

When two or more parties are trying to collaborate around data, traditional Data Protection solutions do not ensure that a user of the data cannot abuse the data. There are however ways to provably ensure that data abuse cannot occur.

What is Data Protection?

Data protection implies protecting your data from other parties that have access to the same resources. For example, in a shared server context where a cloud vendor is allowing multiple parties to rent partial virtual access to the same physical hardware, data protection solutions are essential to make sure that sensitive data being processed is not exposed to other renters of that same physical hardware.

Typical data protection solutions are Confidential Compute solutions (like TEEs), Homomorphic Encryption and Federated Learning. Confidential Compute-based solutions use an isolated piece of hardware to ensure that the memory accessed by an application is not accessible by others. Homomorphic Encryption-based solutions keep data protected by ensuring that the data is never in a raw form – so the data is nonsensical to anybody else except the authorized application. Federated Learning is another solution that ensures that an AI model is trained on data that is in different locations without bringing it together.

However, data protection solutions don’t take or enforce an opinion on how the data ought to be used. This is important because data protection solutions cannot enforce any regulatory or compliance-driven requirements on how the data must be used. Any application can use data protection solutions to process data, including malicious applications. For example, even a virus can run inside Confidential Compute chip, with full isolation from observation by an anti-virus. Using homomorphic encryption solutions, a nefarious program can claim to be processing patient data to identify if a patient has a particular disease, but can actually process that data to identify the patient’s real-identity, which is a violation of HIPAA. Similarly, a user of Federated Learning, can claim to be training a model for disease screening, but encode the patient’s identifiable biometric information into the model.

For a real-world example of Data Protection solutions not protecting the privacy of people data, the data can be cross-referenced with other information to identify specific individuals. Netflix published an anonymized data set with 100 million records of customers’ movie ratings, challenging citizen data scientists to use it to develop new recommendation algorithms. Researchers were able to identify >84% of the individuals in Netflix’s ‘anonymized’ dataset by cross-referencing it with another one from movie ranking site IMDb.

To summarize, Data Protection solutions protect the data from observation.

What is Data Privacy?

Data privacy, on the other hand, protects confidential data (usually this is data about people, but can be data that is sensitive for competitive or Intellectual Property reasons) while ensuring regulations and policies on the usage of the data aren't violated. Governance of how the data is used is enforced using cryptography – ensuring that no user can abuse their access to data.

Data Privacy solutions come pre-baked with governance policies to ensure that you cannot accidentally identify an individual from processing their data. This is what we mean by “Can’t be evil” – it’s mathematically impossible to abuse the data. The processing of the data can only produce the analytics it’s supposed to, and nothing else.

These solutions maintain an audit trail how someone’s data was used. They ensure that malicious attempts to access the data are rejected. They don’t have a trusted third party in the middle that is able to potentially abuse administrative permissions. Data protection is also often a side effect of Data Privacy, but its primary objective is to ensure that Personally Identifiable Information (PII) and Protected Health Information (PHI) are handled ethically.

Unlike just Data Protection Solutions, Data Privacy solutions ensure that a virus can’t run with additional protections. While working with AI models, they ensure that no patient data can be reverse engineered from the model. They ensure that cross-referencing attacks are infeasible. Really, with ANY usage of the system, they ensure that regulatory compliance can’t be broken, and no individual person can ever be identified.

Because of the strong governance enforcements on the data, Data Privacy solutions can go even further and fuse any arbitrary contract or regulation on the data. This ensures that two companies collaborating on a project aren’t just relying on paper contracts with human enforcers to ensure that their data is safeguarded.

To summarize, Data Privacy solutions ensure that data about people are used ethically without any likelihood of violations.

What does this mean?

Data protection is about controlling who can see or obtain the data, using measures like encryption, access controls, and physical security mechanisms. Data privacy is about how and for what purposes the data can be used once access is granted.

Cryptographically enforced data privacy means that you can operate in a manner where you "can't be evil" vs pleading "don't be evil" and hoping for the best.