Roshanak Houmanfar, VP of Machine Learning Products at Integrate.ai – Interview Series
Roshanak (Ro) Houmanfar is the VP of machine learning products for integrate.ai, a company helping developers solve the world’s most important problems without risking sensitive data. Ro has a particular knack for finding new ways to simplify complex AI concepts and connect them with user needs. Leveraging this expertise, she is at the forefront of integrate.ai’s mission to democratize access to privacy-enhancing technology.
What initially attracted you to data science and machine learning?
I started my journey in robotics. After experimenting with the different angles of robotics, and burning down a welding lab, I came to the conclusion that I was more attracted to the artificial intelligence side of my field, and that led me to the wonderful world of machine learning.
Could you describe your current role and what an average day looks like for you?
I am the VP of Product at integrate.ai, an SaaS company helping developers solve the world’s most important problems without risking sensitive data. We’re building tools for privacy-safe machine learning and analytics for the distributed future of data.
In my day-to-day, I work with our teams across functions to achieve three things:
Think through what the future of intelligence could look like and how we can shape that future so that intelligence solves the most critical problems
Understand our customers’ pain points and how we can innovate to make their work more impactful and efficient.
Make sure our vision and customer feedback are always considered in product development, working collaboratively with our teams to deliver the best features.
Synthetic data is currently all of the rage in machine learning, but integrate.ai takes a bit of a contrarian approach. What are some applications where synthetic data may not be a desirable option?
In order to understand when synthetic data isn’t the best solution, it’s important first to understand overall when it is. Synthetic data is best used when the modeling target has either a small amount of real data available or none at all – for example, in cold-start problems and text-= and image-based model training. Sometimes, there just simply isn’t enough data needed to train a model, which is when synthetic data shines as a solution.
However, synthetic data is increasingly being used in situations where plenty of real data exists, but that data is siloed due to privacy regulations, centralization costs or other interoperability roadblocks. This is a flagrant misuse of synthetic data. In these use cases, it’s difficult to determine the right level of abstraction for synthetic data creation, resulting in low-quality synthetic data that can cause innate bias or other problems down the line that are difficult to debug. Additionally, models trained on synthetic data just don’t compare to those trained on real, high-quality, granular source data.
Integrate.ai specializes in offering federated learning solutions, could you describe what federated learning is?
In traditional machine learning, all model training data must be centralized in one database. With federated learning, models are able to train on decentralized, distributed datasets – or data that resides in two or more separate databases and cannot be easily moved. How it works is that portions of a machine learning model are trained where the data is located, and model parameters are shared among participating datasets to produce an improved global model. And since no data moves within the system, organizations can train models without roadblocks like privacy and security regulations, cost or other centralization concerns.
Generally, the training data accessible with federated learning is of a much higher quality as well, since centralized data tends to lose some of its granularity at the expense of ease of access in one location.
How does an enterprise identify the best use cases for federated learning?
Federated learning is a machine learning tech stack built for situations where accessing data or bringing it into the traditional infrastructure of machine learning with centralized data lakes is painful. If you are experiencing one of the following symptoms, federated learning is for you:
- You provide smart products powered by analytics and machine learning and you cannot create network effects for your products because the data is owned by your customers.
- You are working through long master service agreements or data-sharing agreements to get access to data from your partners.
- You are spending a lot of time forming collaboration contracts with your partners, particularly in situations where the result of this data partnership is unclear to you.
- You sit on a wealth of data and want to monetize your datasets but are afraid of the implications to your reputation.
- You are already monetizing your data, but you are spending a lot of time, effort and money making the data safe to share.
- Your infrastructure has been left behind during the movement to the cloud, but you still need analytics and machine learning.
- You have a lot of subsidiaries that belong to the same organization but cannot directly share data with each other.
- The datasets you are dealing with are too large or costly to move around so you have either decided not to use them or your ETL pipelines cost you a lot.
- You have an application or opportunity that you believe can make a significant impact, but you do not have the data yourself to make it happen.
- Your machine learning models have plateaued and you don’t know how to improve them further.
Differential privacy is often used in conjunction with federated learning, what is this specifically?
Differential privacy is a technique to ensure privacy while simultaneously harnessing the power of machine learning. Using different mathematics than standard de-identification techniques, differential privacy adds noise during local model training, preserving most of the dataset’s statistical features while limiting the risk that any individual’s data will be identified.
In ideal implementations, differential privacy brings risk close to zero, while machine learning models maintain similar performance– providing all the needed security for data de-identification, without reducing the quality of the model results.
Differential privacy is included in integrate.ai’s platform by default, so developers can ensure individual data cannot be inferred from their model parameters.
Could you describe how the integrate.ai federated learning platform works?
Our platform leverages federated learning and differential privacy technologies to unlock a range of machine learning and analytics capabilities on data that would otherwise be difficult or impossible to access due to privacy, confidentiality, or technical hurdles. Operations such as model training and analytics are performed locally, and only end results are aggregated in a secure and confidential manner.
integrate.ai is packaged as a developer tool, enabling developers to seamlessly integrate these capabilities into almost any solution with an easy-to-use software development kit (SDK) and supporting cloud service for end-to-end management. Once the platform is integrated, end-users can collaborate across sensitive data sets while custodians retain full control. Solutions that incorporate integrate.ai can serve as both effective experimentation tools and production-ready services.
What are some examples of how this platform can be used in precision diagnostics?
One of the networks of partners we are working with, the Autism Sharing Initiative, collects information related to autism diagnostics as well as samples of genome data to understand the connections of the different genotypes and phenotypes to autism diagnoses. Each individual data site does not have enough datasets to make the machine learning models perform, but collectively they create a meaningful sample size. However, moving data poses a high risk to security and privacy, and because of regulations and hospital policies, these research institutes have always defaulted to not sharing.
In a different network, with a similar setup, researchers are interested in improving the assignment of clinical trials to patients using a more holistic view of each patient’s history.
The different research institutes involved have access to varying information about each patient– one lab has access to their medical scans, the other lab has access to their genomic information, and another institute has their clinical trial results. But these different organizations cannot directly share information with each other.
With the integrate.ai solution, each organization can access each other’s data for their objectives without moving the data away from data custodians and therefore adhering to their internal policies.
Could you discuss the importance of making privacy understandable and how integrate.ai enables this?
Making privacy understandable means opening a lot of doors to businesses and organizations that historically were closed due to the ambiguous nature of the risk. Privacy regulations like GDPR, CCPA and HIPPA are incredibly complex and can differ depending on industry, region and type of data, making it difficult for organizations to determine what data projects are privacy safe. Rather than waste time and manpower checking every box, integrate.ai’s federated learning platform offers built-in differential privacy, homomorphic encryption, and secure multi-party computation, so developers and data custodians can rest easy knowing that their projects will automatically comply with regulatory requirements, without the hassle of jumping through each categorical hoop.
Is there anything else you would like to share about integrate.ai?
integrate.ai’s solution is an incredibly developer-friendly tool that allows for compliant, privacy-preserving and secure machine learning and analytics on top of sensitive data sources. Through simple-to-use APIs, all the complexity of regulatory compliance and contracts on top of sensitive data is abstracted away. integrate.ai’s solution allows data scientists and software developers to manage their workloads safely with minimal impact on their current infrastructure and workflows.
Thank you for the great interview, readers who wish to learn more should visit integrate.ai.