Garth is the Co-Founder & CEO of GenRocket. He is an expert at launching and building technology startups. He has held numerous senior leadership roles in startups over the past 25 years including President & CEO of Concentric Visions (VC backed + acquired), VP Sales & VP Business Development at Indus River Networks (VC backed + acquired), VP Sales & Marketing at Digital Products (acquired) and National Sales Manager at Leading Edge Products.
In 2012 you Co-founded GenRocket a company that specializes in enterprise test data automation. What was the initial vision that inspired this?
I met GenRocket Co-Founder Hycel Taylor in 2011 and he educated me about the need for accurate, conditioned test data for effective software testing. Hycel had done a lot of research and found a huge gap when it came to test data solutions. Hycel decided to architect his own platform that was low cost, really fast and flexible.
What are some of the benefits of using test data versus production data?
Proper software testing means not just testing “positive” conditions of an application but also testing “negative” conditions as well as permutations and edge cases. Production data is useful for data analytics but has limitations for many test cases. One of our financial services customers shared that their production data can only fully satisfy 33% of their testing requirements.
The speed of data generation is important, what are the speeds that GenRocket can deliver?
For a typical automated test case we deliver test data in about 100 milliseconds. For volume data GenRocket generates at a rate of about 10,000 rows of data per second. For big data applications we can use multiple GenRocket instances in parallel to generate millions to billions of rows of data in minutes.
There’s always a learning curve when it comes to generating both test and production data. Do you offer any type of user training?
GenRocket University was created in 2017 to educate our customers and channel partners on GenRocket. We offer multiple on-line training courses at no cost including our “GenRocket Certified Engineer” training course.
You currently serve enterprise customers in over 10 verticals. What are these different types of enterprise customers?
Major banks, numerous global financial services companies, major U.S. healthcare providers, major manufacturers, global supply chain firms, data information services firms are some of our customers across the world.
Our most active industry verticals are banking, financial services, insurance, healthcare and manufacturing.
How does GenRocket differ from other Test Data Management tools?
Traditional Test Data Management (TDM) solutions copy, mask and refresh production data. These solutions tend to be expensive and complex and production data also has limitations for software testing. GenRocket flips the TDM paradigm by quickly and accurately generating most of the required data and querying the small amount of production data that is needed for some of the tests. The GenRocket Test Data Automation (TDA) approach is faster, lower cost and easier to implement and use than TDM.
Could you tell us a little bit about the ability for test data framework compatibility?
Every organization has their own testing framework or testing tools so GenRocket has the flexibility to integrate into every customer’s environment. GenRocket can integrate with just about any testing framework in any language and any testing tool like Jenkins or Selenium. GenRocket can also insert data into any database, and can send data over web services. GenRocket also offers integration with Salesforce and can support complex data feeds like NACHA in banking and EDI and HL7 for the health care industry.
Is there anything else that you would like to tell us about GenRocket?
We rely on an extensive network of trained channel partners to introduce and deliver GenRocket test data solutions into our global customers. Partners like Cognizant, HCL, Wipro, Hexaware, Mindtree and UST Global are actively working with GenRocket.
To learn more visit GenRocket.
Computer Scientists Tackle Bias in AI
Computer scientists from Princeton and Stanford University are now addressing problems of bias in artificial intelligence (AI). They are working on methods that result in fairer data sets containing images of people. The researchers work closely with ImageNet, which is a database of more than 13 million images. Throughout the past decade, ImageNet has helped advance computer vision. With the use of their methods, the researchers then recommended improvements for the database.
ImageNet includes images of objects, landscapes, and people. Researchers that create machine learning algorithms that classify images use ImageNet as a source of data. Because of the database’s massive size, it was necessary for there to be automated image collection and crowdsourced image annotation. Now, the ImageNet team works to correct biases and other issues. The images often contain people that are unintended consequences of ImageNet’s construction.
Olga Russakovsky is the co-author and an assistant professor of computer science at Princeton.
“Computer vision now works really well, which means it’s being deployed all over the place in all kinds of contexts,” he said. “This means that now is the time for talking about what kind of impact it’s having on the world and thinking about these kinds of fairness issues.”
In the new paper, the ImageNet team systematically identified non-visual concepts and offensive categories. These categories included racial and sexual characterizations, and the team proposed removing them from the database. The team has also developed a tool that allows users to specify and retrieve image sets of people, and it can do so by age, gender expression, and skin color. The goal is to create algorithms that more fairly classify people’s faces and activities in images.
The work done by the researchers was presented on Jan. 30 at the Association for Computing Machinery’s Conference on Fairness, Accountability, and Transparency in Barcelona, Spain.
“There is very much a need for researchers and labs with core technical expertise in this to engage in these kinds of conversations,” said Russakovsky. “Given the reality that we need to collect the data at scale, given the reality that it’s going to be done with crowdsourcing because that’s the most efficient and well-established pipeline, how do we do that in a way that’s fairer — that doesn’t fall into these kinds of prior pitfalls? The core message of this paper is around constructive solutions.”
ImageNet was launched in 2009 by a group of computer scientists at Princeton and Stanford. It was meant to serve as a resource for academic researchers and educators. The creation of the system was led by Princeton alumni and faculty member Fei-Fei Li.
ImageNet was able to become such a large database of labeled images through to the use of crowdsourcing. One of the main platforms used was the Amazon Mechanical Turk (MTurk), and workers were paid to verify candidate images. This caused some problems, and there were many biases and inappropriate categorizations.
Lead author Kaiyu Yang is a graduate student in computer science.
“When you ask people to verify images by selecting the correct ones from a large set of candidates, people feel pressured to select some images and those images tend to be the ones with distinctive or stereotypical features,” he said.
The first part of the study involved filtering out potentially offensive or sensitive person categories from ImageNet. Offensive categories were defined as those that contained profanity or racial or gender slurs. One such sensitive category was the classification of people based on sexual orientation or religion. Twelve graduate students from diverse backgrounds were brought in to annotate the categories, and they were instructed to label a category sensitive if they were unsure of it. About 54% of the categories were eliminated, or 1,593 out of the 2,932 person categories in ImageNet.
MTurk workers then rated the “imageability” of the remaining categories on a scale of 1 to 5. 158 categories were classified as both safe and imageable, rating 4 or higher. These filtered set of categories included more than 133,000 images, which can be highly useful for training computer vision algorithms.
The researchers studied the demographic representation of people in the images, and the level of bias in ImageNet was assessed. Sourced content from search engines often provide results that overrepresent males, light-skinned people, and adults between the ages of 18 and 40.
“People have found that the distributions of demographics in image search results are highly biased, and this is why the distribution in ImageNet is also biased,” said Yang. “In this paper we tried to understand how biased it is, and also to propose a method to balance the distribution.”
The researchers considered three attributes that are also protected under U.S. anti-discrimination laws: skin color, gender expression, and age. The MTurk workers then annotated each attribute of each person in an image.
The results showed that ImageNet’s content has a considerable bias. The most underrepresented were dark-skinned, females, and adults over the age of 40.
A web-interface tool was designed that allows users to obtain a set of images that are demographically balanced in a way that the user chooses.
“We do not want to say what is the correct way to balance the demographics, because it’s not a very straightforward issue,” said Yang. “The distribution could be different in different parts of the world — the distribution of skin colors in the U.S. is different than in countries in Asia, for example. So we leave that question to our user, and we just provide a tool to retrieve a balanced subset of the images.”
The ImageNet team is now working on technical updates to its hardware and database. They are also trying to implement the filtering of the person categories and the rebalancing tool developed in this research. ImageNet is set to be re-released with the updates, along with a call for feedback from the computer vision research community.
The paper was also co-authored by Princeton Ph.D. student Klint Qinami and Assistant Professor of Computer Science Jia Deng. The research was supported by the National Science Foundation.
Data Science Companies Use AI To Protect Environment And Fight Climate Change
As the nations of Earth attempt to invent and implement solutions to the growing threat of climate change, just about every option is on the table. Investing in renewable sources of energy and dropping emissions around the globe are the dominant strategies, but utilizing artificial intelligence can help reduce the damage done by climate change. As reported by Live Mint, artificial intelligence algorithms can help conservationists limit deforestation, protect vulnerable species of animals from climate change, fight poaching, and monitor air pollution.
The data science company Gramener has employed machine learning to help get estimates of the number of penguin colonies across Antarctica by analyzing images taken by camera traps. The size of penguin colonies in Antarctica has decreased dramatically over the course of the past decade, impacted by climate change. In order to help conservation groups and scientists analyze image data of Antarctic penguins, Gramener employed convolutional neural networks to clean up the data, and once the data was clean it was deployed through Microsoft’s data science virtual machine. The model developed by Gramener makes use of penguin density in the captured images in order to achieve estimates of penguin populations faster and more reliably. Gramener also used similar techniques to estimate salmon populations in various rivers.
As LiveMint reported, there are other animal conservation projects that make use of AI as well, such as the Elephant Listening Project designed by Conservation Metrics. Populations of elephants throughout Africa have been suffering because of illegal poaching. The project utilizes machine learning algorithms to identify the vocalizations of elephants, distinguishing them from sounds made by other animals. By training machine learning models to recognize unique sound patterns and then using data from sensors distributed throughout elephant habitat, the researchers can develop a system that alerts them to potential poaching or deforestation. They can have a system that listens for things like vehicles, sounds, or guns, and if these sounds are detected alerts can be sent out to authorities.
Machine learning algorithms can also be used to predict the damage that can be done by severe weather events like thunderstorms and tropical cyclones. For instance, IBM has produced a new high-resolution atmospheric forecasting model intended to track potentially damaging weather events.
Jaspreet Bindra, the author of The Tech Whisperer and expert on digital transformations explained to LiveMint that machine learning is necessary to keep up with the changes caused by climate change. Bindra explained:
“Global warming has changed the way climate modeling is done. Using AI/ML is very important as it will make things happen faster. All this will require lots of computing power and, going forward, quantum computers might play an important role.”
Blue Sky Analytics, based in Gurugram, India, is another example of using machine learning algorithms to protect the environment. An application developed by Blue Sky Analytics is used to monitor industrial emissions and air quality in general. Data is gathered and analyzed through satellite data and sensors at ground level.
It requires a substantial amount of computer power in order to analyze and understand the environmental effects of issues like climate change, poaching, pollution. UC Berkeley is trying to speed up research by crowdsourcing the computation of environmental data using smartphones and PCs. The crowdsourcing project is called BOINC (Berkley Open Infrastructure for Network Computing). Those who want to assist in the crowdsourced data analysis just have to install the BOINC software on a chosen device, and when that device isn’t being used the CPU and GPU resources available will be leveraged to carry out computations.
Ricky Costa, CEO of Quantum Stat – Interview Series
What initially got you interested in artificial intelligence?
Randomness. I was reading a book on probability when I came across a famous theorem. At the time, I naively wondered if I could apply this theorem into a natural language problem I was attempting to solve at work. As it turns out, the algorithm already existed unbeknownst to me, it was called the Naïve Bayes, a very famous and simple generative model used in classical machine learning. That theorem was Bayes theorem. I felt this coincidence was a clue, and planted a seed of curiosity to keep learning more.
You’re the CEO of Quantum Stat a company which offers solutions for Natural Language Processing. How did you find yourself in this position?
When there’s a revolution in a new technology some companies are most hesitant than others when facing the unknown. I started my company because pursuing the unknown is fun to me. I also felt it was the right time to venture into the field of NLP given all of the amazing research that has arrived in the past 2 years. The NLP community has the capacity now to achieve a lot more with a lot less given the advent of new NLP techniques that require less data to scale performance.
For readers who may not be familiar with this field, could you share with us what Natural Language Processing does?
NLP is a subfield of AI and analytics that attempts to understand natural language in text, speech or multi-modal learning (text and images/video) and computing it to the point where you are driving insight and/or providing a valuable service. Value can arrive from several angles, from information retrieval in a company’s internal file system, to classifying sentiment in the news, or a GPT-2 twitter bot that helps with your social media marketing (like the one we built couple of weeks ago).
You have a Bachelor of Arts from Hunter College in Experimental Psychology. Do you feel that understanding the human brain and human psychology is an asset when it comes to understanding and expanding the field of Natural Language Processing?
This is contrarian, but unfortunately, no. The analogy of neurons and deep neural networks is simply for illustration and instilling intuition. One can probably learn a lot more from complexity science and engineering. The difficulty with understanding how the brain works is that we are dealing with a complex system. “Intelligence” is an emergent phenomenon from the brain’s complexity interacting with its environment, and very difficult to pin down. Psychology and other social sciences, which are dependent on “reductionism” (top-down) don’t work under this complex paradigm. Here’s the intuition: imagine someone attempting to reduce the Beatle’s song “Let it Be” to the C Major scale. There’s nothing about that scale that predicts “Let it Be” will emerge from it. The same follows with someone attempting to reduce behavior to neural activity in the brain.
As it stands, because deep learning models interpolate data, the more data you feed into the model the less edge cases it will see when making an inference in the wild. This architecture “incentivizes” large datasets to be computed by models in order to increase accuracy of output. However, if we want to achieve more intelligent behavior by AI models, we need to look beyond how much data we have and more towards how we can improve the ability of model’s ability to reason more efficiently, which intuitively, shouldn’t require lots of data. From a complexity perspective, the cellular automata experiments conducted in the past century by physicists John von Neumann and Stephen Wolfram show that complexity can emerge from simple initial conditions and rules. What these conditions/rules should be with regards to AI, is what everyone’s hunting.
You recently launched the ‘Big Bad NLP Database’. What is this database and why does it matter to those in the AI industry?
This database was created for NLP developers to have a seamless access to all the pertinent datasets in the industry. This database helps to index datasets which has a nice secondary effect of being able to be queried by users. Preprocessing data takes the majority of time in the deployment pipeline, and this database attempts to mitigate this problem as much as possible. In addition, it’s a free platform for anyone regardless of whether you are an academic researcher, practitioner, or independent AI guru that wants to get up to speed with NLP data. Link
Quantum Stat currently offers end-end solutions. What are some of these solutions?
We help companies facilitate their NLP modeling pipeline by offering development at any stage. We can cover a wide range of services from data cleaning in the preprocessing stage all the way up to model server deployment in production (these services are also highlighted on our homepage). Not all AI projects come to fruition due to the unknown nature of how your specific data/project architecture works with a state-of-the-art model. Given this uncertainty, our services give companies a chance to iterate on their project at the fraction of cost of hiring a full-time ML engineer.
What recent advancement in AI do you find the most interesting?
The most important advancement of late is the transformer model, you may have heard of it: BERT, RoBERTa, ALBERT, T5 and so on. These transformer models are very appealing because they allow the researcher to achieve state-of-the-art performance with a smaller datasets. Prior to transformers, a developer would require a very large dataset to train a model from scratch. Since these transformers come pretrained on billions of words, it allows for faster iteration of AI projects and it’s what we are mostly involved with at the moment.
Is there anything else that you would like to share about Quantum Stat?
We are working on a new project dealing with financial market sentiment analysis that will be released soon. We have leveraged multiple transformers to give unprecedented insight to how financial news unfolds in real-time. Stay tuned!