Connect with us

Natural Language Processing

AI Offers Improved Tracking of Offshore Property Ownership in the UK




New research from two UK universities aims to shed a greater light on the potential state of property-based money laundering in the United Kingdom, and especially in the highly-prized London real estate market.

According to the project’s results, the total number of ‘unconventional’ domestic properties (i.e. properties which are not used long-term as dwellings by owners or renters) stands at around 138,000 in London alone.

This figure is 44% higher than the official figures, which are supplied and periodically updated by the UK government.

The researchers used various Natural Language Processing (NLP) techniques, together with additional data and corroborative research, to extend the limited official information that the UK government makes available about the percentage, value, location and types of property owned by offshore companies in the UK, the most lucrative of which are in the capital.

The research found that the total amount of offshore, low use, and airbnb-style (i.e. ‘casual occupation’) properties in the UK are collectively worth somewhere between £145-174 billion GBP across approximately 144,000-164,000 properties.

It also found that offshore properties of this type are typically more expensive and have signature patterns in regard to where they’re located in the UK.

The researchers estimate that offshore-owned Unconventional Domestic Property (UDP) represents 7.5% of the total domestic value, and that £56 billion of the value estimated is limited to just 42,000 dwellings.

The paper states:

‘Individual offshore properties are very expensive even by the standards of UDP, in addition they are concentrated on the centre of London with strong spatial auto correlation.

‘In contrast nested offshore property is somewhat less concentrated on central London but more highly concentrated in general, there is also almost no spatial correlation.’

Analysis of the augmented data shows that a large number of offshore properties belong to entities in the Crowd Dependencies (CD), with the second-largest number accounted for by British Overseas Territories (in the chart below, ‘PWW2’ signifies countries that obtained independence from Britain after the Second World War).

Disposition of foreign-owned property, according to the results from the new paper. Source:

Disposition of foreign-owned property, according to the results from the new paper. Source:

The paper observes:

‘In fact only 4 territories, British Virgin Islands, Jersey, Guernsey and The Isle of Man, are associated with 78% of all properties.’

The new enhanced data has made it possible to determine sub-properties that exist within a known overseas-owned property – a capability usually hindered by the flat and limited data provided in the official figures.

The results also indicate that offshore, Airbnb and low-use properties are notably more geographically concentrated than normal homes, and are additionally concentrated into higher-value areas.

Heat-maps related to various types of overseas-owned property in London. Source:

Visualized concentration maps related to various types of overseas-owned property in London. Source:

Of the above graph, the authors comment:

‘Offshore domestic property has some extremely high concentrations where an entire housing development is owned by an offshore company.’

The authors have released code for their processing pipeline.

The new paper is titled What’s in the laundromat? Mapping and characterising offshore owned domestic property in London, and comes from researchers at The Bartlett Faculty of the Built Environment at University College London, and Kingston University’s Department of Economics.

Addressing the Problem

The authors note that after decades of effort to control the use of real estate for money-laundering purpose in the United Kingdom, it took the release of a leaked list of offshore-owned UK property by the British publication Private Eye in 2015 to spur the UK government to publish a regularly-updated list of offshore-owned properties in most of the UK, known as Overseas companies that own property in England and Wales (OCOD).

The researchers observe that though OCOD is a step forward to research and analysis of overseas ownership and potential money laundering in the UK, the data has a number of limitations, some of them crucial:

‘These addresses can be incomplete, contain nested properties, where multiple properties exists within a single row or title number, it also contains no information on whether the property is domestic, business or something else.

‘Such poor quality data makes understanding the distribution and characteristics of offshore owned property in the UK challenging.’

It is particularly difficult to obtain data about casually-rented property such as Airbnb properties, since publicly available data is limited or non-existent. Additionally Scotland (a part of the United Kingdom) does not make its own register of property sales publicly available, unlike England and Wales.

To counter some of the inconsistencies around property classification, the UK government introduced the Unique Property Reference Number (UPRN) system, designed to enable clearer relationships across diverse property data sources. However, the authors note* ‘whilst the use of the UPRN is mandated, almost no government department uses it, meaning linking the data requires advanced data processing skills.

Thus the new research set out to make the data more granular and insightful.

Collecting and Connecting the Data

Within any individual country, address formats are usually predictable and consistent, applicable also to UK addresses. Thus, faced with ‘flat’, text-based addressed data (such as that provided by OCOD), a number of open source address-parsing solutions have emerged to cross-reference addresses to other data sources.

However, many of these are trained using Open Street map data, which can yield addresses that may actually host tens or even hundreds of nested sub-addresses (such as apartments in a broad-ranging address for an apartment block). Consequently, even an acclaimed address-parser such as libpostal has had difficulty when attempting to parse incomplete addresses.

To create the parser for their project, the new paper’s researchers used a number of publicly available datasets. The key data was provided by OCOD, while the data cleansing component used the Land Registry Price dataset, together with the VOA ratings listing dataset, and the Office of National Statistics Postcode Directory (ONSPD).

The Airbnb data came from the InsideAirbnb domain, which only includes entire homes that are let, therefore excluding the original proposed use-case for Airbnb (i.e. renting out all or part of one’s own home on an occasional basis).

The authors’ low-use property dataset was augmented by information received from successful Freedom of Information (FOI) requests, mostly collected for an earlier project.

The base data of OCOD is a .CSV comma-delimited file with a good degree of structure and predictable format.

The pipeline consisted of five stages: labeling, parsing, expanding, classifying, and contracting. At the outset, any individual address could resolve in real life to multiple nested properties, though this is not explicit in the government-supplied data.

The researchers performed some light syntactic preprocessing, then imported the data to programmatic, a platform designed to create annotated NLP datasets without hand-labeling. Here, entities were labeled using regular expressions (Regex) to describe eight types of named entity (see image below):

With these labels added, the dataset was extracted as a JSON file, with label overlaps removed by simple rules-based routines.

Additionally, programmatic’s output was used to train a predictive model for SpaCy, underpinned by Facebook’s RoBERTa. Once denoised, the researchers created a ground truth comparison set of 1000 randomly-labeled observations. The accuracy score of unsupervised data would eventually be evaluated against this ground truth.

Address parsing presented a number of challenges. The authors assigned each character span its own row and each label class its own column, and then backpropagated the columns to generate complete address rows.

Since some single addresses featured multiple distinct dwellings, it was necessary to expand the database, by subdividing sole addresses into sub-properties present in complementary databases.

After this, the address classification stage cross-referenced all located postcodes using the ONSPD database. This process connects up the address data to census and other demographic data, and also individuates sub-properties that had previously been hidden behind the opaque addresses of the OCOD data.

Finally, the address contraction process filtered out all non-domestic properties (i.e. commercial premises) from nested property groups.


To test the accuracy of the enhanced data, the authors, as mentioned earlier, created a sample ground truth set that was held back from the general run of analysis, and used only to test the accuracy of the predictions and analyses.

Manual checking for the ground truth included the use of map software, as well as analysis of pictures of the properties featured in the held-back set, and of internet searches to evaluate the type of property. Thereafter, the performance of the data was measured against precision, recall, and F1 scores.

The value of low-use and domestic property was obtained with a basic graphical model, the same method used also to infer UDP properties.

The NER task, tested against the high-effort, manually labeled ground truth, obtained an F1 score of 0.96 (close to ‘100%’, in terms of accuracy).

F1 scores for the NER labeling task. Some unevenness is found, since the process slightly overestimates the number of domestic properties and underestimates the total number of businesses, due to the structure of the enhanced data.

F1 scores for the NER labeling task. Some unevenness is found, since the process slightly overestimates the number of domestic properties and underestimates the total number of businesses, due to the structure of the enhanced data.

Regarding UDPs in London, the final results show a total of 138,000 entries – 44% more than the 94,000 featured in the original OCOD dataset (i.e., recent official figures).

The breakdown of property types under type 2 classification.

The breakdown of property types under type 2 classification.

The results indicate that the total value of the offshore properties stands at around £56 billion, while the total value of low-use property is estimated at £85 billion.

The authors note:

‘[All] UDPs are much more expensive than the mean conventional property price of £600 thousand.’

This kind of improved data may be necessary to combat the use of property speculation as a money-laundering activity in the UK. The authors note the growing body of research and general literature that suggests improved data may aid in combating AML property speculation, and conclude:

‘This data can be used by sociologists, economist and policy makers to ensure that attempts to reduce money laundering and high property prices are based on detailed data that reflect the real situation.’


* My conversion of the authors’ inline citation to hyperlinks.

First published 25th July 2022.