In collaboration with academic researchers in China, Alibaba has developed a search engine simulation AI that uses real world data from the ecommerce giant’s live infrastructure in order to develop new ranking models that are not hamstrung by ‘historic’ or out-of-date information.
The engine, called AESim, represents the second major announcement in a week to acknowledge the need for AI systems to be able to evaluate and incorporate live and current data, instead of just abstracting the data that was available at the time the model was trained. The earlier announcement was from Facebook, which last week unveiled the BlenderBot 2.0 language model, an NLP interface that features live polling of internet search results in response to queries.
The objective of the AESim project is to provide an experimental environment for the development of new Learning-To-Rank (LTR) solutions, algorithms and models in commercial information retrieval systems. In testing the framework, the researchers found that it accurately reflected online performance within useful and actionable parameters.
The paper’s authors, including four representatives each from Nanjing university and from Alibaba’s research division, assert that a new approach to LTR simulations was necessary for two reasons: the failure of recent similar initiatives in deep learning to create reproducible techniques, with a spate of attention-garnering algorithms failing to translate into applicable real-world systems; and the lack of transferability, in terms of performance of the training data vs. novel data in cases where the systems were initially more effective.
The paper claims that AESim is the first e-commerce simulation platform predicated on the data of live and current users and activity, and that it can accurately reflect online performance by unilateral use of live data, providing a blue-sky training playground for later researchers to evaluate LTR methodologies and innovations.
The model incorporates a new take on a typical schema for industrial search engines: the first stage is the retrieval of items related to the user’s query, which are are not initially presented to the user, but rather are first sorted by a weighted LTR model. Then the sorted results are passed through a filter that considers the objectives of the company in supplying the results – aims which may include advertising and diversity factors.
Architecture of AESim
In AESim, the queries are replaced with category indices, allowing the system to retrieve items from a category index before passing them to a customizable re-ranker that produces the final list. Though the framework allows researchers to study the effects of joint ranking across multiple models, this aspect is being left for future work, and the current implementation automatically seeks the ideal evaluation based on a single model.
AESim creates embeddings (virtual representations in the machine learning architecture) that encapsulate the ‘virtual user’ and their query, and utilizes a Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP) approach.
The architecture comprises a database of millions of available items sorted by category, a customizable ranking system, a feedback module, and synthetic datasets generated by the GAN-based components. The feedback module is the final stage in the workflow, capable of evaluating the performance of the latest iteration of a ranking model.
Generative Adversarial Imitation Learning
In order to model the decision logic of the ‘Virtual User Module’, the feedback module (which provides the ultimate results) is trained through Generative Adversarial Imitation Learning (GAIL), a theory first proposed by Stanford researchers in 2016. GAIL is a model-free paradigm that allows a system to develop a policy directly from data through imitation learning.
The training sets developed by AESim are essentially the same as static, historical datasets used in prior supervised learning models for similar systems. The difference with AESim is that it is not reliant on a static dataset for feedback, and is not hamstrung by the item-orders that were generated at the time that (old) training data was compiled.
The generative aspect of AESim centers on the creation of a virtual user through WGAN-GP, which outputs ‘fake’ user and query characteristics, and then attempts to discern this faux data from genuine user data supplied by the live networks to which AESim has access.
The researchers tested AESim by deploying a pair-wise, point-wise and ListMLE instance into the system, each of which had to serve a non-intersecting random slice of search queries in the context of a re-ranker algorithm.
At this point AESim is challenged by the rapidly changing and diverse live data in much the same way that Facebook’s new language model is likely to be. Therefore the results have been considered in the light of overall performance.
Tested for ten days, AESim demonstrated remarkable consistency across three models, though the researchers noted that an additional test of a Document Context Language Model (DLCM) module performed poorly in the offline environment, but very well in the live environment, and concede that the system will demonstrate gaps with its live counterparts, depending on the configuration and models being tested.