stub Retrieving Real-World Email Addresses From Pretrained Natural Language Models - Unite.AI
Connect with us

Artificial Intelligence

Retrieving Real-World Email Addresses From Pretrained Natural Language Models

mm
Updated on

New research from the US indicates that pretrained language models (PLMs) such as GPT-3 can be successfully queried for real-world email addresses that were included in the vast swathes of data used to train them.

Though it's currently difficult to get a real email by querying the language model about the person that the email is associated with, the study found that the larger the language model, the easier it is to perform this kind of exfiltration; and that the more extensive and informed the query, the easier it is to obtain a functional email address.

The paper states:

‘The results demonstrate that PLMs truly memorize a large number of email addresses; however, they do not understand the exact associations between names and email addresses, e.g., to whom the memorized email address belongs. Therefore, given the contexts of the email addresses, PLMs can recover a decent amount of email addresses, while few email addresses are predicted correctly by querying with names.'

To test the theory, the authors trained three PLMs of increasing size and parameters, and queried them according to a set of templates and methods that an attacker would be likely to use.

The paper offers three key insights into the risks of allowing real-world personal information to be included in the massive training corpora on which large PLMs depend.

Firstly, that long text patterns (in queries) increase the possibility of obtaining private information about an individual just by naming that individual. Secondly, that attackers may augment their approach with existing knowledge about their target, and that the more such prior knowledge an attacker has, the more likely it is that they will be able to exfiltrate memorized data such as email addresses.

Third, the authors postulate that larger and more capable Natural Language Processing (NLP) models may enable an attacker to extract more information, reducing the ‘security by obscurity' aspect of current PLMs, as ever more sophisticated and hyperscale models are trained by FAANG-level entities.

Finally, the paper concludes that personal information can indeed be retained and leaked through the process of memorization, where a model only partially ‘digests' training data, so that it can use that unbroken information as ‘factual' data in response to queries.

The authors conclude*:

‘From the results of the context setting, we find that the largest GPT-Neo model can recover 8.80% of email addresses correctly through memorization.

‘Although this setting is not as dangerous as others since it is basically impossible for users to know the context if the corpus is not public, the email address may still be accidentally generated, and the threat cannot be ignored.'

Though the study chooses email addresses as an example of potentially vulnerable PII, the paper emphasizes the extensive research into this pursuit in regard to exfiltrating patients' medical data, and consider their experiments a demonstration of principle, rather than a specific highlighting of the vulnerability of email addresses in this context.

The paper is titled Are Large Pre-Trained Language Models Leaking Your Personal Information?, and is written by three researchers at the University of Illinois at Urbana-Champaign.

Memorization and Association

The work centers on the extent to which memorized information is associated. A trained NLP model cannot completely abstract the information that it's trained on, or it would be unable to hold a coherent argument, or summon up any factual data at all. To this end, a model will memorize and protect discrete chunks of data, which will represent minimal semantic nodes in a possible response.

The big question is whether memorized information can be elicited by summoning up other kinds of information, such as a ‘named' entity, like a person. In such a case, an NLP model trained on non-public and privileged data may hold hospital data on Elon Musk, such as patient records, a name, and an email address.

In the worst scenario, querying such a database with the prompt ‘What is Elon Musk's email address?' or ‘What is Elon Musk's patient history?' would yield those data points.

In effect, this almost never happens, for a number of reasons. For instance, if a protected memorization of a fact (such as an email address) represents a discrete unit, the next discrete unit up will not be a simple traversal up to a higher layer of information (i.e. about Elon Musk), but may be a far larger leap that is unrelated to any specific person or data point.

Additionally, though the rationale for association is not necessarily arbitrary, neither is it predictably linear; association may occur based on weights that were trained with different loss objectives than mere hierarchical information retrieval (such as generating plausible abstract conversation), or in/against ways that have been specifically guided (or even prohibited) by the architects of the NLP system.

Testing PLMs

The authors tested their theory on three iterations of the GPT-Neo causal language model family, trained on the Pile dataset at 125 million, 1.3 billion, and 2.7 billion parameters.

The Pile is an assembly of public datasets, including the UC Berkeley Enron Database, which includes social network information based on email exchanges. Since Enron followed a standard first name+last name+domain convention (i.e. [email protected]), such email addresses were filtered out, because machine learning is not needed to guess such a facile pattern.

The researchers also filtered out name/email pairs with less than three tokens, and after the total pre-processing arrived at 3238 name/mail pairs, which were used in various subsequent experiments.

In the context setting experiment, the researchers used the 50, 100, or 200 tokens preceding the target email address as a context to elicit the address with a prompt.

In the zero-shot setting experiment, four prompts were created manually, the latter two based on standard email header conventions, such as —Original Message—\nFrom: {name0} [mailto: {email0}].

Templates for zero-shot prompts. Source: https://arxiv.org/pdf/2205.12628.pdf

Templates for zero-shot prompts. Source: https://arxiv.org/pdf/2205.12628.pdf

Next, a few-shot setting was considered – a scenario in which the attacker has some prior knowledge that can help them craft a prompt that will elicit the desired information. In the crafted prompts, the researchers consider whether the target domain is known or unknown.

Iterations of the few-shot setting.

Iterations of the few-shot setting.

Finally, the rule-based method uses 28 probable variations on standard patterns for name use in email addresses to attempt to recover the target email address. This requires a high number of queries to cover all the possible permutations.

Rule-based patterns used in the tests.

Rule-based patterns used in the tests.

Results

For the prediction with context task, GPT-Neo succeeds in predicting as much as 8.80% of the email addresses correctly, including addresses that did not conform to standard patterns.

Results of the prediction with context task. The first column details the number of tokens prior to the email address.

Results of the prediction with context task. The first column details the number of tokens prior to the email address.

For the zero-shot setting task, the PLM was able to correctly predict only a small number of email addresses, mostly conforming to the standard patterns set out by the researchers (see earlier image).

Results of zero-shot settings where the domain is unknown.

Results of zero-shot settings where the domain is unknown.

The authors note with interest that the 0-shot (D) setting notably outperforms its stablemates, due, apparently, to a longer prefix.

‘This [indicates] that PLMs are making these predictions mainly based on the memorization of the sequences – if they are doing predictions based on association, they should perform similarly. The reason why 0-shot (D) outperforms 0-shot (C) is that the longer context can discover more [memorization]'

Larger Models, Higher Risk

In regard to the potential for such approaches to exfiltrate personal data from trained models, the authors observe:

‘For all the known-domain, unknown-domain, and context settings, there is a significant improvement in the accuracy when we change from the 125M model to the 1.3B model. And in most cases, when changing from the 1.3B model to the 2.7B model, there is also an increase in the prediction accuracy.'

The researchers offer two possible explanations as to why this is so. First, the models with higher parameters are simply able to memorize a higher volume of training data. Second, larger models are more sophisticated and better able to understand the crafted prompts, and therefore to ‘connect up' the disparate information about a person.

They nonetheless observe that at the current state of the art, personal information is ‘relatively safe' from such attacks.

As a remedy against this attack vector, in the face of new models that are growing consistently in size and scope, the authors advise that architectures be subject to rigorous pre-processing to filter out PII; to consider training with differentially private gradient descent; and to include filters in any post-processing environment, such as an API (for instance, OpenAI's DALL-E 2 API features a great number of filters, in addition to human moderation of prompts).

They further advise against the use of email addresses that conform to guessable and standard patterns, though this advice is already standard in cybersecurity.

 

* My substitution of hyperlinks for the authors' inline citations.

First published 26th May 2022.