Ingo Mierswa is the Founder & President at RapidMiner, Inc. RapidMiner brings artificial intelligence to the enterprise through an open and extensible data science platform. Built for analytics teams, RapidMiner unifies the entire data science lifecycle from data prep to machine learning to predictive model deployment. More than 625,000 analytics professionals use RapidMiner products to drive revenue, reduce costs, and avoid risks.
What was your inspiration behind launching RapidMiner?
I had worked in the data science consultancy business for many years and I saw a need for a platform that was more intuitive and approachable for people without a formal education in data science. Many of the existing solutions at the time relied on coding and scripting and they simply were not user-friendly. Furthermore, it made data difficult to manage and maintain the solutions that were developed within those platforms. Basically, I realized that these projects didn’t need to be so difficult so, we started to create the RapidMiner platform to allow anyone to be a great data scientist.
Can you discuss the full transparency governance that is currently being utilized by RapidMiner?
When you can’t explain a model, it’s quite hard to tune, trust and translate. A lot of data science work is the communication of the results to others so that stakeholders can understand how to improve processes. This requires trust and deep understanding. Also, issues with trust and translation can make it very hard to overcome the corporate requirements to get a model into production. We are fighting this battle in a few different ways:
As a visual data science platform, RapidMiner inherently maps out an explanation for all data pipelines and models in a highly consumable format that can be understood by data scientists or non-data scientists. It makes models transparent and helps users in understanding model behavior and evaluating its strengths and weaknesses and detecting potential biases.
In addition, all models created in the platform come with extensive visualizations for the user – typically the user creating the model – to gain model insights, understand model behavior and evaluate model biases.
RapidMiner also provides model explanations – even when in production: For each prediction created by a model, RapidMiner generates and adds the influence factors that have led to or influenced the decisions made by that model in production.
Finally – and this is very important to me personally as I was driving this with our engineering teams a couple of years ago – RapidMiner also provides an extremely powerful model simulator capability, which allows users to simulate and observe the model behavior based on input data provided by the user. Input data can be set and changed very easily, allowing the user to understand the predictive behavior of the models on various hypothetical or real-world cases. The simulator also displays factors that influence the model’s decision. The user – in this case even a business user or domain expert – can understand model behavior, validate the model’s decision against real outcomes or domain knowledge and identify issues. The simulator allows you to simulate the real world and have a look into the future – into your future, in fact.
How does RapidMiner use deep learning?
RapidMiner’s use of deep learning somethings we are very proud of. Deep learning can be very difficult to apply and non-data-scientists often struggle with setting up those networks without expert support. RapidMiner makes this process as simple as possible for users of all types. Deep learning is, for example, part of our Auto machine learning (ML) product called RapidMiner Go. Here the user does not need to know anything about deep learning to make use of those types of sophisticated models. In addition, power users can go deeper and use popular deep learning libraries like Tensorflow, Keras, or DeepLearning4J right from the visual workflows they are building with RapidMiner. This is like playing with building blocks and simplifies the experience for users with fewer data science skills. Through this approach our users can build flexible network architectures with different activation functions and user-defined number of layers and nodes, multiple layers with different numbers of nodes, and choose from different training techniques.
What other type of machine learning is used?
All of them! We offer hundreds of different learning algorithms as part of the RapidMiner platform – everything you can apply in the widely-used data science programming languages Python and R. Among others, RapidMiner offers methods for Naive Bayes, regression such as Generalized Linear Models, clustering such as k-Means, FP-Growth, Decision Trees, Random Forests, Parallelized Deep Learning, and Gradient Boosted Trees. These and many more are all a part of the modeling library of RapidMiner and can be used with a single click.
Can you discuss how the Auto Model knows the optimal values to be used?
RapidMiner AutoModel uses intelligent automation to accelerate everything users do and ensure accurate, sound models are built. This includes instance selection and automatic outlier removal, feature engineering for complex data types such as dates or texts, and full multi-objective automated feature engineering to select the optimal features and construct new ones. Auto Model also includes other data cleaning methods to fix common issues in data such as missing values, data profiling by assessing the quality and value of data columns, data normalization and various other transformations.
Auto Model also extracts data quality meta data – for example, how much a column behaves like an ID or whether there are lots of missing values. This meta data is used in addition to the basic meta data in automating and assisting users in ‘using the optimal values’ and dealing with data quality issues.
For more detail, we’ve mapped it all out in our Auto Model Blueprint. (Image below for extra context)
There are four basic phases where the automation is applied:
– Data prep: Automatic analysis of data to identify common quality problems like correlations, missing values, and stability.
– Automated model selection and optimization, including full validation and performance comparison, that suggests the best machine learning techniques for given data and determines the optimal parameters.
– Model simulation to help determine the specific (prescriptive) actions to take in order to achieve the desired outcome predicted by the model.
– In the model deployment and operations phase, users are shown factors like drift, bias and business impact, automatically with no extra work required.
Computer bias is an issue with any type of AI, are there any controls in place to prevent bias from creeping up in results?
Yes, this is indeed extremely important for ethical data science. The governance features mentioned before ensure that users can always see exactly what data has been used for model building, how it was transformed, and whether there is bias in the data selection. In addition, our features for drift detection are another powerful tool to detect bias. If a model in production demonstrates a lot of drift in the input data, this can be a sign that the world has changed dramatically. However, it can also be an indicator that there was severe bias in the training data. In the future, we are considering to going even one step further and building machine learning models which can be used to detect bias in other models.
Can you discuss the RapidMiner AI Cloud and how it differentiates itself from competing products?
The requirements for a data science project can be large, complex and compute intensive, which is what has made the use of cloud technology such an attractive strategy for data scientists. Unfortunately, the various native cloud-based data science platforms tie you to cloud services and data storage offerings of that particular cloud vendor.
The RapidMiner AI Cloud is simply our cloud service delivery of the RapidMiner platform. The offering can be tailored to any customer’s environment, regardless of their cloud strategy. This is important these days as most businesses’ approach to cloud data management is evolving very quickly in the current climate. Flexibility is really what sets RapidMiner AI Cloud apart. It can run in any cloud service, private cloud stack or in a hybrid setup. We are cloud portable, cloud agnostic, multi-cloud – whatever you prefer to call it.
RapidMiner AI Cloud is also very low hassle, as of course, we offer the ability manage all or part of the deployment for clients so they can focus on running their business with AI, not the other way around. There’s even an on-demand option, which allows you spin up an environment as needed for short projects.
RapidMiner Radoop eliminates some of the complexity behind data science, can you tell us how Radoop benefits developers?
Radoop is mainly for non-developers who want to harness the potential of big data. RapidMiner Radoop executes RapidMiner workflows directly inside Hadoop in a code-free manner. We can also embed the RapidMiner execution engine in Spark so it’s easy to push complete workflows into Spark without the complexity that comes from code-centric approaches.
Would a government entity be able to use RapidMiner to analyze data to predict potential pandemics, similar to how BlueDot operates?
As a general data science and machine learning platform, RapidMiner is meant to streamline and enhance the model creation and management process, no matter what subject matter or domain is at the center of the data science/machine learning problem. While our focus is not on predicting pandemics, with the right data a subject matter expert (like a virologist or epidemiologist, in this case) could use the platform to create a model that could accurately predict pandemics. In fact, many researchers do use RapidMiner – and our platform is free for academic purposes.
Is there anything else that you would like to share about RapidMiner?
Give it a try! You may be surprised how easy data science can be and how much a good platform can improve you and your team’s productivity.
Thank you for this great interviewer, readers who wish to learn more should visit RapidMiner.
Power Your ML and AI Efforts with Data Transformation – Thought Leaders
The greater the variety, velocity, and volume of data we have, the more feasible it becomes to use predictive analytics and modeling to forecast growth and identify areas of opportunity and improvement. However, getting the greatest value from reporting, machine learning (ML), and artificial intelligence (AI) tools requires an organization to access data from many sources and ensure that data is high-quality and trusted. This is often the greatest barrier to transforming big data into business strategy.
Data professionals spend so much time gathering and validating data to prepare it for use that they have little time left to focus on their primary purpose: analyzing the data and deriving business value from it. Unsurprisingly, 76 percent of data scientists say data preparation is the least enjoyable part of their job. Moreover, current data preparation efforts like data wrangling and traditional ETL require manual effort from IT professionals and are not enough to handle the scale and complexity of big data.
Companies that want to leverage the power of AI need to break away from these tedious and largely manual processes that increase the risk of “garbage in, garbage out” results. Instead, they need data transformation processes that extract raw data in multiple sources and formats, join and normalize it, and add value with business logic and metrics to make it ready for analytics. With complex data transformation, they can be sure that AI/ML models are based on clean, accurate data that delivers trustworthy results.
Leveraging the power of the cloud with ELT
The best place to prepare and transform data today is a cloud data warehouse (CDW) such as Amazon Redshift, Google BigQuery, Microsoft Azure Synapse, or Snowflake. While traditional approaches to data warehousing require data to be extracted and transformed before it can be loaded, a CDW leverages the scalability and performance of the cloud for faster data ingestion and transformation and makes it possible to extract and load data from many disparate data sources before transforming it inside the CDW.
Ideally, the ELT model initially moves data into a section of the CDW reserved for raw staging data. From there, the CDW can use its near-unlimited computing resources available for data integration and ETL jobs that cleanse, aggregate, filter, and join the staged data. The data can then be transformed into a different schema – data vault or Star Schema, for example, optimizing the data for reporting and analytics
The ELT approach also allows you to replicate raw data within the CDW for later preparation and transformation when and as needed. This lets you use business intelligence tools that determine schema on read and produce specific transformations on demand, effectively letting you transform the same data in multiple ways as you discover new uses for it.
Accelerating machine learning models
These real-world examples show how two companies in different industries are leveraging data transformation in a CDW to drive AI initiatives.
A boutique marketing and advertising agency built a proprietary customer management platform to help its clients better identify, understand, and motivate their customers. By transforming data within a CDW, the platform quickly and easily integrates real-time customer data across channels into a 360-degree customer view that informs the platform’s AI/ML models for making customer interactions more consistent, timely, and personalized.
A global logistics firm making 100 million deliveries to 37 million unique customers in 72 countries needs vast amounts of data to power its daily operations. Adopting data transformation within a CDW enabled the company to deploy 200 machine learning models in a single year. These models make 500,000 predictions every day, significantly improving efficiency and driving superior customer service that has reduced inbound call center calls by 40 percent.
Best practices for getting started
Companies that want to support their AI/ML initiatives with the power of data transformation in the cloud need to understand their specific use case and needs. Beginning with what you want to do with your data –reducing fuel costs by optimizing delivery routes, boosting sales by delivering next best offers to customer service agents in real-time, etc. – lets you reverse-engineer your processes so you can identify which data will deliver relevant results.
Once you determine what data your AI/ML project needs to build its models, you need a cloud-native ELT solution that will make your data fit for use. Look for a solution that:
Is vendor-neutral and able to work with your current technology stack
Is flexible enough to scale up and down and adapt as your technology stack changes
Can handle complex data transformations from multiple data sources
Offers a pay-as-you-go pricing model in which you pay only for what you use
Is purpose-built for your preferred CDW so you can fully leverage that CDW’s features to run jobs faster and transform data seamlessly.
A cloud data transformation solution that caters to the common denominators of all CDWs may provide a consistent experience, but only one that enables the powerful differentiating features of your chosen CDW can deliver the high performance that speeds time to insight. The right solution will enable you to power your AI/ML projects with more clean, trusted data from more sources in less time – and generate faster, more reliable results that drive previously unrealized business value and innovation.
Owkin Launches the Collaborative COVID-19 Open AI Consortium (COAI)
After a fresh round of funding, Owkin recently launched the Covid-19 Open AI Consortium (COAI). This consortium will enable advanced collaborative research and accelerate clinical development of effective treatments for patients who are infected with COVID-19.
The first stage of the project is on fully understanding and treating cardiovascular complications in COVID-19 patients, this will be performed in collaboration with CAPACITY, an international registry working with over 50 centers around the world. Other areas of research will include patient outcomes and triage, and the prediction and characterization of immune response.
Owkin’s manifesto perfectly states the company’s vision:
“We are fully engaged in this new frontier with the goal of improving drug development and patient outcomes. Founded in 2016, Owkin has quickly emerged as a leader in bringing Artificial Intelligence (AI) and Machine Learning (ML) technologies to the healthcare industry. Our solutions improve the traditional medical research paradigm by turning a previously siloed, disjointed system into an innovative and collaborative one that, above all, puts the privacy of patients first.”
To understand the model that Owkin is engaging one must fully understand a new technology which is called Federated Learning. Federated learning offers a framework for AI development that enables enterprises to train machine learning models on data that is distributed at scale across multiple medical institutions without centralizing the data. The benefits of this are two-fold, there is no loss of privacy since the data is not directly linked to any specific patient, and the data is maintained at the healthcare institution that collects this data.
The use of Federated Learning thereby enables a significantly wider range of data than what any single organization possesses in-house. What this means is that by used Federated Learning researchers have access to as much data as available, and the more big data a machine learning system possesses, the more accurate the AI becomes.
There are currently multiple national efforts in using AI to tackle COVID-19. The problem with many of these nationalistic disjointed efforts is that the data is specific to one country. Collecting data from a single region may fail to reveal important information that would enable researchers to fully understand how exposure to environmental elements, ethnic makeup, genetics, age, and gender may play important roles in understanding this disease. This is why collaboration is so important, and why gathering data from multiple jurisdictions is even more important.
As described by Owkin, they seek to used Federated Learning for the following:
“We aim to help them understand why drug efficacy varies from patient to patient, enhance the drug development process and identify the best drug for the right patient at the right time, to improve treatment outcomes.”
Understanding and treading cardiovascular health issues will be the first challenge undertaken by Owkin. As important as data is, what is even more important are the efforts of researchers and contributors who are spearheading this effort. This is why Unite.AI will be releasing three interviews with researchers that are contributing to the COAI project.
Sanjay Budhdeo, MD, Business Development:
Sanjay is a practicing physician. He holds Medical Sciences and Medical degrees from Oxford University and a Masters Degree from Cambridge University. Sanjay has research experience in neuroimaging, epidemiology and digital health. Prior to joining Owkin as a Partnership Manager, he was a Senior Associate at Boston Consulting Group, where he focused on data and digital in healthcare. He sits on the Patient Safety Committee at the Royal Society of Medicine and was previously a Specialist Advisor at the Care Quality Commission.
Dr. Stephen Weng, Principal Researcher:
Stephen is an Assistant Professor of Integrated Epidemiology and Data Science who leads the data science research within the Primary Care Stratified Medicine Research Group.
He integrate traditional epidemiological methods and study design with new informatics-based approaches, harnessing and interrogating “big health care data” from electronic medical records for the purpose of risk prediction modeling, phenotyping chronic diseases, data science methods research, and translation of stratified medicine into primary care.
Folkert W. Asselbergs, Principal Investigator
Folkert is professor of precision medicine in cardiovascular disease at Institute of Cardiovascular Science, UCL, Director NIHR BRC Clinical Research Informatics Unit at UCLH, professor of cardiovascular genetics and consultant cardiologist at the department of Cardiology, University Medical Center Utrecht, and chief scientific officer of the Durrer Center for Cardiovascular Research, Netherlands Heart Institute. Prof Asselbergs published more than 275 scientific papers and obtained funding from leDucq foundation, British and Dutch Heart Foundation, EU (FP7, ERA-CVD, IMI, BBMRI), and RO1 National Institutes of Health.
The hope of Unite.AI is that by using biomedical images, genomics, and clinical data to discover biomarkers and mechanisms associated with diseases and treatment outcomes this will propel the next generation of treatment to tackle COVID-19. We are contributing to this important project by highlighting the personalities behind this important global effort.
Julien Rebetez, Lead Machine Learning Engineer at Picterra – Interview Series
Julien Rebetez, is the Lead Software & Machine Learning Engineer at Picterra. Picterra provides a geospatial cloud-based-platform specially designed for training deep learning based detectors, quickly and securely.
Without a single line of code and with only few human-made annotations, Picterra’s users build and deploy unique actionable and ready to use deep learning models.
It automates the analysis of satellite and aerial imagery, enabling users to identify objects and patterns.
What is it that attracted you to machine learning and AI?
I started programming because I wanted to make video games and got interested in computer graphics at first. This led me to computer vision, which is kind of the reverse process where instead of having the computer create a fake environment, you have it perceive the real environment. During my studies, I took some Machine Learning courses and I got interested in the computer vision angle of it. I think what’s interesting about ML is that it’s at the intersection between software engineering, algorithms and math and it still feels kind of magical when it works.
You’ve been working on using machine learning to analyze satellite image for many years now. What was your first project?
My first exposure to satellite imagery was the Terra-i project (to detect deforestation) and I worked on it during my studies. I was amazed at the amount of freely available satellite data that is produced by the various space agencies (NASA, ESA, etc…). You can get regular images of the planet for free every day or so and this is a great resource for many scientific applications.
Could you share more details regarding the “Terra-i” project?
The Terra-i project (http://terra-i.org/terra-i.html) was started by Professor Andrez Perez-Uribe, from HEIG-VD (Switzerland) and is now led by Louis Reymondin, from CIAT (Colombia). The idea of the project is to detect deforestation using freely available satellite images. At the time, we worked with MODIS imagery (250m pixel resolution) because it provided a uniform and predictable coverage (both spatially and temporally). We would get a measurement for each pixel every few days and from this time series of measurements, you can try to detect anomalies or novelties as we call them in ML sometimes.
This project was very interesting because the amount of data was a challenge at the time and there was also some software engineering involved to make it work on multiple computers and so on. From the ML side, it used Bayesian Neural Network (not very deep at the time 🙂 ) to predict what the time series of a pixel should look like. If the measurement didn’t match the prediction, then we would have an anomaly.
As part of this project, I also worked on cloud removal. We took a traditional signal processing approach there, where you have a time series of measurements and some of them will be completely off because of a cloud. We used a fourier-based approach (HANTS) to clean the time series before detecting novelties in it. One of the difficulties is that if we would clean it too strongly, we’d also remove novelties, so there were quite some experiments to do to find the right parameters.
You also designed and implemented a deep learning system for automatic crop type classification from aerial (drone) imagery of farm fields. What were the main challenges at the time?
This was my first real exposure to Deep Learning. At the time, I think the main challenge were more on getting the framework to run and properly use a GPU than on the ML itself. We used Theano, which was one of the ancestors of Tensorflow.
The goal of the project was to classify the type of crop in a field, from drone imagery. We tried an approach where the Deep Learning Model was using color histograms as inputs as opposed to just the raw image. To make this work reasonably quickly, I remember having to implement a custom Theano layer, all the way to some CUDA code. That was a great learning experience at the time and a good way to dig a bit into the technical details of Deep Learning.
You’re officially the Lead Software and Machine Learning Engineer at Picterra. How would you best describe your day to day activities?
It really varies, but a lot of it is about keeping an eye on the overall architecture of the system and the product in general and communicating with the various stakeholders. Although ML is at the core of our business, you quickly realize that most of the time is not spent on ML itself, but all the things around it: data management, infrastructure, UI/UX, prototyping, understanding users, etc… This is quite a change from Academia or previous experience in bigger companies where you are much more focused on a specific problem.
What’s interesting about Picterra is that we not only run Deep Learning Models for users, but we actually allow them to train their own. That is different from a lot of the typical ML workflows where you have the ML team train a model and then publish it to production. What this means is that we cannot manually play with the training parameters as you often do. We have to find some training method that will work for all of our users. This led us to create what we call our ‘experiment framework’, which is a big repository of datasets that simulates the training data our users would build on the platform. We can then easily test changes to our training methodology against these datasets and evaluate if they help or not. So instead of evaluating a single model, we are more evaluating an architecture + training methodology.
The other challenge is that our users are not ML practitioners, so they don’t necessarily know what a training set is, what a label is and so on. Building a UI to allow non-ML practitioners to build datasets and train ML models is a constant challenge and there is a lot of back-and-forth between the UX and ML teams to make sure we guide users in the right direction.
Some of your responsibilities include prototyping new ideas and technologies. What are some of the more interesting projects that you have worked on?
I think the most interesting one at Picterra was the Custom Detector prototype. 1.5 years ago, we had ‘built-in’ detectors on the platform: those were detectors that we trained ourselves and made accessible to users. For example, we had a building detector, a car detector, etc…
This is actually the typical ML workflow: you have some ML engineer develop a model for a specific case and then you serve it to your clients.
But we wanted to do something differently and push the boundaries a bit. So we said: “What if we allow users to train their own models directly on the platform” ? There were a few challenges to make this work: first, we didn’t want this to take multiple hours. If you want to keep this feeling of interactivity, training should take a few minutes at most. Second, we didn’t want to require thousands of annotations, which is typically what you need for large Deep Learning models.
So we started with a super simple model, did a bunch of tests in jupyter and then tried to integrate it in our platform and test the whole workflow, with a basic UI and so on. At first, it wasn’t working very well in most cases, but there were a few cases where it would work. This gave us hope and we started iterating on the training methodology and the model. After some months, we were able to reach a point where it worked well, and we now have our users using this all the time.
What was interesting about this is the double challenge of keeping the training fast (currently a few minutes) and therefore the model not too complex, but at the same time making it complex enough that it works and solves user’s problems. On top of that, it works with few (<100) labels for a lot of cases.
We also applied many of Google’s “Rules of Machine Learning”, in particular the ones about implementing the whole pipeline and metrics before starting to optimize the model. It puts you into ‘system thinking’ mode where you figure out that not all your problems should be handled by the core ML, but some of them can be pushed to the UI, some of them pre/post-processed, etc…
What are some of the machine learning technologies that are used at Picterra?
In production, we are currently using Pytorch to train & run our models. We are also using Tensorflow from time to time, for some specific models developed for clients. Other than that, it’s a pretty standard scientific Python stack (numpy, scipy) with some geospatial libraries (gdal) thrown in.
Can you discuss how Picterra works in the backend once someone uploads images and wishes to train the neural network to properly annotate objects?
Sure, so first when you upload an image, we process it and store it in a “Cloud-Optimized-Geotiff” (COG) format on our blobstore (Google Cloud Storage), which allows us to quickly access blocks of the image without having to download the whole image later on. This is a key point because geospatial imagery can be huge: we have users routinely working with 50000×50000 images.
So then, to train your model, you will have to create your training dataset through our web UI. You will do that by defining 3 types of areas:
- ‘training areas’, in which you will draw training labels
- ‘testing areas’, where the model will predict to let you visualize some results
- ‘accuracy area’, where you will draw labels as well, but those are not used for training, only for scoring
Once you have created this dataset, you can simply click ‘Train’ and we’ll train a detector for you. What happens next is that we enqueue a training job, have one of our GPU worker pick it up (new GPU workers are started automatically if there are many concurrent jobs), train your model, save its weights to the blobstore and finally predict in the ‘testing area’ to display on the UI. From there, you can iterate over your model. Typically, you’ll spot some mistakes in ‘testing areas’ and add ‘training areas’ to help the model improve.
Once you are happy with the score of your model, you can run it at scale. From the user’s point of view, this is really simple: just click on ‘Detect’ next to the image you want to run it on. But it’s a bit more involved under the hood if the image is large. To speed things up, handle failures and avoid having detections taking multiple hours, we break down large detections in grid cells and run an independent detection job for each cell. This allows us to run very large-scale detections. For example, we had a customer run detection over the whole country of Denmark on 25cm imagery, which is in the range of TB of data – for a single project. We’ve covered a similar project in this medium post.
Is there anything else that you would like to share about Picterra?
I think what’s great about Picterra is that it is a unique product, at the intersection between ML and Geospatial. What differentiates us from other companies that process geospatial data is that we equip our users with a self-serve platform. They can easily find locations, analyze patterns, and detect and count objects on Earth observation imagery. It would be impossible without machine learning, but our users don’t even need basic coding skills – the platform does the work based on a few human-made annotations. For those who want to go deeper and learn the core concepts of machine learning in the geospatial domain, we have launched a comprehensive online course.
What is also worth mentioning is that possible applications of Picterra are endless – detectors built on the platform have been used in city management, precision agriculture, forestry management, humanitarian and disaster risk management, farming, etc., just to name the most common applications. We are basically surprised every day by what our users are trying to do with our platform. You can give it a try and let us know how it worked on social media.
Thank you for the great interview and for sharing with us how powerful Picterra is, readers who wish to learn more should visit the Picterra website.
- Microsoft to Replace Dozens of Journalists With AI
- AI Model Might Let Game Developers Generate Lifelike Animations
- Akilesh Bapu, Founder & CEO of DeepScribe – Interview Series
- AI Models Trained On Sex Biased Data Perform Worse At Diagnosing Disease
- Stefano Pacifico, and David Heeger, Co-Founders of Epistemic AI – Interview Series