Data privacy, according to experts across a wide range of domains, will be the most important issue of this decade. This is particularly true for machine learning (ML) where algorithms are being fed reams of data.
Traditionally, ML modeling techniques have relied on centralizing data from multiple sources into a single data center. After all, ML models are at their most powerful when they have access to huge quantities of data. However, there are a host of privacy challenges that come with this technique. Aggregating diverse data from multiple sources is less feasible today due to regulatory concerns such as HIPAA, GDPR, and CCPA. Furthermore, centralizing data increases the scope and scale of data misuse and security threats in the form of data leaks.
To overcome these challenges, several pillars of privacy preserving machine learning (PPML) have been developed with specific techniques that reduce privacy risk and ensure that data remains reasonably secure. Here are a few of the most important:
Federated learning is an ML training technique that flips the data aggregation problem on its head. Instead of aggregating data to create a single ML model, federated learning aggregates ML models themselves. This ensures that data never leaves its source location, and it allows multiple parties to collaborate and build a common ML model without directly sharing sensitive data.
It works like this. You start with a base ML model that is then shared with each client node. These nodes then run local training on this model using their own data. Model updates are periodically shared with the coordinator node, which processes these updates and fuses them together to obtain a new global model. In this way, you get the insights from diverse datasets without having to share these datasets.
In the context of healthcare, this is an incredibly powerful and privacy-aware tool to keep patient data safe while giving researchers the wisdom of the crowd. By not aggregating the data, federated learning creates one extra layer of security. However, the models and model updates themselves still present a security risk if left vulnerable.
2. Differential Privacy
ML models are often targets of membership inference attacks. Say that you were to share your healthcare data with a hospital in order to help develop a cancer vaccine. The hospital keeps your data secure, but uses federated learning to train a publicly available ML model. A few months later, hackers use a membership inference attack to determine whether your data was used in the model’s training or not. They then pass insights to an insurance company, which, based on your risk of cancer, could raise your premiums.
Differential privacy ensures adversary attacks on ML models will not be able to identify specific data points used while training, thus mitigating the risk of exposing sensitive training data in machine learning. This is done by applying “statistical noise” to perturb the data or the machine learning model parameters while training models, making it difficult to run attacks and determine whether a particular individual’s data was used to train the model.
For instance, Facebook recently released Opacus, a high-speed library for training PyTorch models using a differential privacy based machine learning training algorithm called Differentially Private Stochastic Gradient Descent (DP-SGD). The gif below highlights how it uses noise to mask data.
This noise is governed by a parameter called Epsilon. If the Epsilon value is low, the model has perfect data privacy but poor utility and accuracy. Inversely, if you have a high Epsilon value, your data privacy will go down while your accuracy goes up. The trick is to strike a balance to optimize for both.
3. Homomorphic encryption
Standard encryption traditionally is incompatible with machine learning because once the data is encrypted it can no longer be comprehended by the ML algorithm. However, homomorphic encryption is a special encryption scheme that allows us to continue to do certain types of computations.
The power of this is that the training can happen in an entirely encrypted space. It not only protects data owners, but it also protects model owners. The model owner can run inference on encrypted data without ever seeing it or misusing it.
When applied to federated learning, fusion of model updates can happen securely because they are taking place in an entirely encrypted environment, drastically reducing the risk of membership inference attacks.
The Decade of Privacy
As we enter 2021, privacy preserving machine learning is an emerging field with remarkably active research. If the last decade was about unsiloing data, this decade will be about unsiloing ML models while preserving the privacy of the underlying data via federated learning, differential privacy, and homomorphic encryption. These present a promising new way for advancing machine learning solutions in a privacy-conscious manner.