Federated Learning doesn’t work in healthcare

Federated Learning can’t protect PHI

In the last five years, Federated Learning has become a trendy buzzword in healthcare circles when it comes to dealing with healthcare data. However, being federated has nothing to do with data protections. Let's not conflate the two. This feels similar to the late 2010s when “blockchain” was believed to be the silver bullet when it came to secure data sharing.

Federated Learning was developed at Google as a technique to train AI models in a decentralized way. It’s useful in situations when two or more data providers want to allow a third party to train on their data, but don’t want to send their data to the third party.

In non-healthcare use cases, Federated Learning is a powerful tool. Every night, when you plug in your smartphone to the charger, it uses federated learning to train a local model on what you typed, so that the centralized Apple/Google prediction server can get smarter on what is trending that day.

However, in healthcare, Federated Learning’s utility is blunted because it still requires the data to be manually de-identified before it can be used for training. Notice the difference here, federated is different than being de-identified. De-identification of PHI to HIPAA standards blunts the utility of the data (dates and times, geographic locations, etc.), because it removes the most useful information from the data. This is required because the input data used with Federated Learning can easily be reverse engineered from the model! Let's double click here, federated learning is an architectural appraoch that has little to nothing to do with privacy of PHI. It wouldn't be a stretch to suggest that federated learning alone, cannot be applied to PHI.

De-identified data isn’t as valuable as PHI

Traditional de-identification methods under HIPAA use one of two methods - either Expert Determination or Safe Harbor. 

  1. Under Expert De-identification, identifiers are removed and replaced with Tokens, and additional statistical methods are used to ensure that there aren’t any individuals who might be identified from the dataset. 

  2. Under Safe Harbor, 18 identifiers and valuable data are completely dropped or obfuscated from the data.

Both techniques neuter the clinical value of the data, and aren’t able to be applied universally. For example, using Safe Harbor, an AI model won’t be able to tell whether it’s looking at an X-ray of an 81 year old female or a 19 year old male - both of which have obvious clinical ramifications.

Using Expert de-identification, genomic data is difficult to de-identify because my genomic sequence is unique to me. And let's not forget that much of the actual clinical information is not found in the SQL database of EHR’s; its found in the clinical narrative which makes up doctors notes, pathology and radiology reports and discharge summaries. It's nearly impossible to remove this valuable PHI from unstructured text.

In addition to requiring de-identification, Federated Learning requires a heavy computational load on each data node, because they have to train the entire model.

It is possible to to train on PHI safely

Recent advances in mathematics give us a magic solution. Blind Learning is a technique that allows models to directly work on PHI, without putting the PHI at risk of identification - effectively achieving de-identification without neutering the data. Yeah, that's why it's magic.

Blind Learning consists of two pivotal improvements over Federated learning to make this possible - 

  1. it splits the model into two separated parts - a lightweight part that is run on the client, and a heavyweight part that evaluates the weights from the clients, so it doesn't need to see the actual PHI but can operate on a proxy, on the shadow of the PHI.

  2. It adds a privacy loss function that decorrelates the data from the weights, allowing a model to learn a concept and not the underlying data. This ensures that the training data cannot be reverse engineered from the trained model. That even without ever seeing actual PHI, the resulting model is exactly the same. No PHI risk. No HIPAA violations. No de-identifications.

As the world has run out of available training data on the open web for training, Blind Learning emerges as the only HIPAA and GDPR compliant technique to open up LLMs and AI training to the treasure trove of clinical data in a safe way.