A new research collaboration between the University of Wisconsin and Google sets machine learning against one of the most notorious web user annoyances of the last decade – the opacity and cynical misuse of GDPR-compliant cookie consent banners.
Titled CookieEnforcer, the new framework uses Semantic Text Understanding to parse the significance and utility of the underlying code behind the cookie consent popup or banner, in order to provide the user with the missing ‘one click’ solution to disabling all truly ‘non-necessary’ cookies – including the ones that domain owners may present as being ‘essential’, even if they are not.
The plugin can be set to automatically enforce user preferences, or else take the cases individually, allowing the user to adjust settings before final submission.
The challenge of parsing the possible ‘non-consent’ options, which are typically hidden in arcane and laborious groups of settings (rather than the user-friendly accept all typical of consent frameworks) is modeled as a sequence-to-sequence task.
In an end-to-end accuracy evaluation, CookieEnforcer was able to generate all the necessary steps to obviate cryptic cookie consent procedures in 91% of the cases studied, on domains that had not been seen during training of the system’s machine learning model. A user study further demonstrated that the system significantly reduces user effort in navigating the consent modules.
The paper presenting the method is titled CookieEnforcer: Automated Cookie Notice Analysis and Enforcement, and comes from three researchers at the University of Wisconsin at Madison, and one from Google Inc.
Arcane Roads to Cookie Consent
Since the enactment of the General Data Protection Regulation (GDPR) in 2016 and the California Consumer Privacy Act (CCPA) in 2018, websites wanting to engage users from the areas covered by such legislation have been required to provide cookie preference mechanisms (usually based on detection of the user’s IP address as a proxy for their country of origin).
However, since domain owners had long been accustomed to gleaning valuable and actionable user data from the opaque and usually unseen implementation of cookies, they proved reluctant to furnish easy opt-outs for their newly empowered users.
The default UI for cookie consent interfaces (which appear the first time a user visits a domain, or if the user has deleted cookies for that domain) quickly settled into dark patterns designed to weary the viewer with granular, time-consuming, and extensive choices in the event that they wanted to exercise their rights to consent; or else a simple and easily accessible button which opted the user into all the cookies that the domain owner desired to run. This culture of labyrinthine UI choices was described in one 2020 study as ‘a scavenger hunt’.
The new paper comments:
‘[Users] may find it hard to exercise informed cookie control for websites with complicated notices. They are far more likely to rely on default configurations than they are to fine-tune their cookie settings for each [website]. In several cases, these default settings are privacy-invasive and favor the service providers, which results in privacy [risks].’
A comment on one popular forum post regarding these practices characterized them as ‘malicious compliance’. User annoyance with cookie consent frameworks is a topic that conflicts major publishers, who might ordinarily afford further coverage if they were not so personally exposed by their own practices in this regard.
A 2019 paper from Germany found that a majority of site visitors in the studied domains were ‘nudged’ towards broad consent, and that only a third of websites actually explained the purposes of the data collection practices.
A number of web browser plugins, add-ons and extensions have emerged to address the problem in recent years, such as the Cookie Quick Manager Firefox extension, and a broad range of Chrome alternatives, while the European Union is seeking to close up the compliance loopholes around cookie consent architectures.
Method and Data
The researchers of the new paper were determined to create a more robust cookie consent management framework by avoiding reliance on keywords or handcrafted rules, the central approach of a number of recent similar ML-aided projects.
CookieEnforcer has three objectives: to translate cookie notices and interfaces into a machine readable format; to identify the cookie setting configuration in a manner that disables non-essential cookies; and to automatically apply additional restrictions without further user input, if desired by the user.
The system consists of a backend component that detects and analyzes cookie notices, and a frontend component, in the form of a browser extension, that generates and executes the disabling of non-essential cookies (i.e. cookies that will not obstruct navigation of or access to the domain if blocked).
The backend section features modules for detection, analysis, and a decision model. The analysis module takes account of changes in code introduced by user interaction, so that the initial code dump is not rendered invalid by simulated user exploration.
Natural Language Understanding
With the code revealed, it’s important that CookieEnforcer understand the existing state of possible actions it might take, since the language behind toggle buttons can be ambiguous in terms of benefit to the end user.
To this end, the researchers trained a Text-To-Text Transfer Transformer (T5) model for its decision component. The T5-Large model, which contains 770 million parameters, was fine-tuned on a custom database of input/output code (i.e., code that describes and enables the functionality of toggling options).
The dataset was created by sampling 300 websites with cookie notices selected from Tranco’s top-50k popular websites list. The detector and analyzer modules extracted the cookie consent options from their runtime source code, and evaluated their default states.
One of the researchers then manually labeled the interpreted series of clicks necessary to disable non-essential cookies for all the studied websites, resulting in 300 fully labeled domains.
60 websites were set aside as a test set, and the T5-Large model was trained with a learning rate of 0.003 at a batch size of 16 for 20 epochs, with a maximum input sequence length of 256 tokens, and a maximum target sequence length of 64. The tokens were formed of sub-words established by Google’s SentencePiece tokenizer.
Finally, the processed information is stored in a local database and made available to the front end of the system. The authors favored the querySelector() HTML function over the XML Path Language (XPath) approach taken by some previous similar projects, since XPaths for cookie notices are vulnerable to DOM updates (i.e. the code may change after initial loading in response to user interactions). In this way, the element paths can be retained even when they are dynamic and responsive to external factors.
Testing and Performance
The authors comment:
‘This option can be easily missed by the users as they have to expand an additional frame to see that. CookieEnforcer not only finds this option, but also understands the semantics and decides to object. These examples showcase that the model learns the context and generalizes to new examples.’
The researchers performed three tests, including an end-to-end evaluation of the framework’s performance across 500 unseen domains (i.e. websites that CookieEnforcer was not specifically trained for), where the authors report that it could successfully disable non-essential cookies for 91% of the sites.
The second test comprised an online user study spanning 14 websites, and using the System Usability Scale (score) against a manual baseline. For this test, the authors report that CookieEnforcer obtained a 15% higher score than the baseline.
Finally, CookieEnforcer’s trained parameters were tested against the top 5000 websites in the US and Europe, to determine its capacity to navigate cookie notices. The authors state:
‘While measurements at such a scale have been performed before, CookieEnforcer allows a deeper understanding of the options beyond keyword-based heuristics. In particular, we find that 16.7% of the websites in the UK showing cookie notices have enabled at least one non-essential cookie. The same number for websites in the US is 22%.’
The authors have released a short YouTube video showing CookieEnforcer in action:
First published 12th April 2022.