Andrea Vattani, is the Co-Founder & Chief Scientist at Spiketrap, a contextualization company powering audience intelligence and media performance for creators, platforms, and brands. The proprietary Clair AI extracts the signal from the noise of unstructured datasets, providing unparalleled clarity and context, particularly within high velocity online environments.
What initially attracted you to computer science and AI?
It was a combination of fortuitous circumstances, I showed up at the University of Rome to take the Statistics major admission test, and it turned out I was a day late! I was advised to apply for Computer Science instead and move back to the Statistics department a year later. I went to the Computer Science admission test (which was that day!) and passed it… never moved back to Statistics! My interest in AI really started with realizing how computers can help you automate things, and AI is the ultimate automation machinery. Also, natural language and how people use it has always been an interest of mine: I focused on classic studies in high school, studying ancient Greek and Latin, which is probably similar to how a machine feels when fed a stream of words.
You previously worked as Senior Lead Software Engineer at Amazon Goodreads, what were some of the projects you worked on and what were some key takeaways from this experience?
While at Goodreads, I worked on multiple machine learning projects which included spam detection and scaling of the book recommendation engine. My takeaways from my time there, was learning the importance of defining ML metrics that match business and customers goals. To give an example, recommendation engines have existed for a really long time. Remember the “Netflix Prize” competition back in 2009 to figure out better movie recommendations? Some insights from the top solutions suggested that the chances of you watching a movie isn’t much driven to whether you’re going to like it or not, but mostly if it’s similar to your interests. That might work for movies, since it’s a short 90 minute commitment, but for books that is not the case. Integrating the right goal into your metrics is key.
Another learning that I have applied at Spiketrap is to build AI teams that are delivery-oriented and integrated with the product roadmap rather than an isolated team just focused on explorations and research. This leads to better definition of goals, timelines, and understanding of the ROI. It also naturally favors the team to focus on speed and practicality of a model rather than purely looking at accuracy. Going back to the Netflix competition example, the models of the winning teams were never integrated because of not being practical enough despite their improved accuracy.
Your research has been published in numerous journals, what in your opinion has been the most important paper so far?
During my Ph.D. I was fortunate to collaborate with several researchers from different areas, including machine learning, “big data”, social data analysis, and game theory. A paper I like for its simplicity and applicability is “Scalable K-Means++”: K-means++ is an ubiquitously used unsupervised clustering method to split a dataset into K coherent groups. It does so by adding one group at a time, so when you have tons of data and groups, it becomes way too slow. In that paper we show you how you can achieve the same, if not better, accuracy by parallelizing the method. Our methodology is extremely simple and has been implemented in several machine learning libraries.
Could you share the genesis story behind Spiketrap?
After working at Goodreads, myself and co-founders of Spiketrap, Kieran and Virgilio, understood there was a gap in the industry for accessing advanced brand insights from niche social platforms. By applying AI technologies, we could address the issue in an efficient manner.
In today’s economy, it is imperative for companies to listen to their customers and their respective industries as a whole. However, much of what customers have to say about brands goes unheard. Millions of people express their opinions openly every day, across platforms like Twitter, Reddit, Twitch, and the like. It’s proven to be an extremely valuable resource for any market researcher, provided the content can be contextualized at scale. The issue is that the insights industry has not kept up with evolving digital behaviors and language.
Listening tools remain dependent on keywords and boolean searches, missing much of the conversation that could and should be attributed to a particular brand. Meanwhile, market research firms have been caught in an increasingly difficult balancing act, trying to ascertain qualitative insights from quantitative and cost-constrained methodologies.
In short, people have lacked the tools they need to understand their audiences at scale. Sales numbers and viewer counts answer the “what” of audience behaviors, but not the “why”. Without context, figuring out what’s correlation versus causation is a guessing game. Recognizing this void, we dug into what a solution for contextual understanding would look like, and Spiketrap was born.
What are some of the machine learning technologies that are used at Spiketrap?
We use a multitude of technologies, from your usual Scikit-learn to deep learning libraries such as Pytorch. Aside from libraries, the methodologies, models, and datasets we use are mostly proprietary. We’ve learned that off-the-shelf methods and models only take you so far, but to really crack a problem you really need to put your own work in starting from goals and getting down to model architecture and datasets. To give you an example, topic modeling is the task to extract themes from a collection of pieces of text. Our “Spiketrap Convos” provides our customers with crucial insights about their audience, and uses topic modeling as one of the signals. Your typical go-to method for topic modeling is LDA (Latent Dirichlet Allocation) but unfortunately it’s too inconsistent and unpredictable and simply not powerful enough. On the other side of the spectrum, you can try a modern pretrained model such as Bert-Topics, which –while powerful and encompassing–- is also really rigid and slow. NLP and language AI have made leaps and bounds in the last decade but taking existing models to turn them into products is still far from optimal and a risky bet.
Could you elaborate on how Spiketrap powers instant audience understanding for creators, platforms, and brands?
Advertisers and agencies use our influencer leaderboards and brand affinity tools to identify creators whose communities are brand safe across a number of categories, including grades for toxic, profane, and sexual content — as well as overall community brand safety.
Creators are able to use the tool to dive into individual streams and see which conversations were the most or least safe, which drove positive engagement for their sponsors, and where they might better improve their moderation efforts.
A recent paper titled ‘FeelsGoodMan: Inferring Semantics of Twitch Neologisms’ was published by Spiketrap. Could you briefly describe what this paper is?
The way people communicate and express themselves online has been getting progressively more complex and challenging to decipher. First came emoticons :-). Then came emojis . Then memes… and now “emotes”, a new form of icon-based communication that has become heavily popular on the Twitch streaming platform. Somewhat reminiscent of emojis for their intermixed use with regular text, they present similar challenges to memes in that they are user generated and their cryptic meaning has no obvious connection to the actual image depicted. There’s over 8 million distinct emotes to date with over 400 thousand used weekly. Still, people communicate effectively using them to express any kind of feeling such as joy, boredom, excitement, or sarcasm. Our recent paper is an AI cookbook to infer the semantic meaning of emotes. Our approach doesn’t require maintaining and updating a manually-curated dataset, and it is able to self-adapt to the continuous introduction of new emotes but also to the evolution of meaning of popular emotes. This is particularly important when an emote gets politically or racially loaded, which we have seen happening with extremely popular emotes, such as “TriHard”, “PogChamp”, and “FeelsGoodMan”. Dynamic use of language and shifts in meaning pose enormous problems to moderation systems or sentiment analysis frameworks, so we’re proud to tackle this problem the right way at Spiketrap.
Is there anything else that you would like to share about Spiketrap?
As we look ahead to the new year, Spiketrap is working on developing and perfecting a new tool that will provide a deeper understanding of brand sentiment for our clientele. Spiketrap’s new Affinity Tool provides an interactive and intuitive way to identify and quantify audience affinities across creators, brands, games, and more. For any given query, the tool generates affinity index scores that indicate how well a given entity is positively correlated to another. Numerous contextual signals comprise the score, including the frequency and sentiment of related mentions. Spiketrap’s tech stack is uniquely positioned to index affinities between games, brands, and creators. Clair, their proprietary NLP AI, processes millions of publicly posted user-generated messages every day, attributing otherwise ambiguous content to entities within Spiketrap’s extensive knowledge graph, identifying topics of conversation, determining sentiment, and monitoring safety. The addition of the new Affinity Tool empowers developers, creators, brands, and more to further understand their audiences and brand impact.
Thank you for the great interview, readers who wish to learn more should visit Spiketrap.