Researchers from China and the United States have collaborated on research that uses machine learning techniques to discern the ‘hidden visits' we make when we move around the country, but do not make enough phone calls or use our phones enough for a complete picture of our movements to otherwise be formed from telecom data records.
The paper, entitled Identifying Hidden Visits From Sparse Call Detail Record Data, is led by Zhan Zhao from the University of Hong Kong, working with Haris N. Koutsopoulos from Boston's Northeastern University and Jinhua Zhao at MIT.
The premise of the research is to use the mobile connectivity records (including mobile data, SMS and voice calls) of highly active users to develop a model that can more precisely guess the movement patterns of less active users.
Though the researchers admit that there are privacy implications in developing such work, and in spite of the project's stated objective to obtain greater and more granular detail about user journeys, they contend that the objective is to gather a better generalized picture of movement.
They also note that the Call Detail Record (CDR) data which fuels such studies has low spatial resolution and is prone to ‘positioning noise' due to the changing position of the user relative to the cell phone towers that they are passing, and suggest that this limitation in itself is a form of privacy protection:
‘The target application of our study is trip detection and OD estimation[*], which are done at aggregate level, not individual level. The developed models can be directly deployed on the database servers of telecom carriers, without need for data transfer. Furthermore, compared to other forms of big data, such as social media or credit card transaction data, CDR data is relatively less intrusive in terms of personal privacy. In addition, its localization error helps to mask the exact user locations, providing another layer of privacy preservation.'
Elapsed Time Intervals (ETIs)
When we travel around with mobile phones (not necessarily smartphones), the limitations of CDR data as a location-defining tool become apparent. Elapsed Time Intervals (ETIs), periods of a journey where the mobile user does not make or receive calls, are a critical marker in keeping track of our movements – an interval of ‘silence' long enough for us to temporarily fall off the grid.
The researchers note that this interferes with the ability of analytical systems to make assumptions about A>B journeys, since the sparsity of the data could be hiding an ‘unobserved trip'. The new method addresses this by analyzing the spatiotemporal context of ETIs, as well as ‘the individual characteristics of the user'.
The researchers developed their core training set with data provided by a major cellular service operator in a Chinese city with a population of 6 million people. The data contained more than two billion mobile phone transactions generated by three million users in November of 2013, and features only voice call and data access (data usage) records. SMS data was not used, which made addressing the sparsity of data more difficult.
The data contained an encrypted unique ID; a Location Area Code (LAC); a timestamp; a cell phone ID, which was collated with the LAC in order to individuate the cell phone tower used in the transaction; and an Event ID (outgoing/incoming call, or data usage).
This information was cross-referenced with a cell tower operation database, allowing the researchers to query the longitude and latitude coordinates of the tower associated with the communication event. The researchers were able to identify 9000 cell towers in the dataset.
The researchers observe that it is difficult to guess trip destinations solely by call records, since these types of records peak in the morning and the afternoon, which correlates to travel patterns anyway. Since phone calls precede travel (and may trigger a journey), this can cause bias in destination estimation.
Similar restrictions apply to user-initiated data usage transactions, such as messaging apps, and other type of interaction. However, it's ‘automated' data usage that helps to identify us – the systematic polling of APIs for new messages or other types of data, including message lists, GPS and general telemetry across installed apps.
The researchers approached the problem with a broad range of popular machine learning classifiers, including logistic regression, support vector machine (SVM), random forest, and a gradient boosting ensemble approach. All the classifiers were implemented in Python via scikit-learn, on default settings.
Of these approaches, the researchers found that logistic regression yielded the highest number of interpretable model parameters.
They researchers also discovered that the longer an ETI, the greater the likelihood that a hidden visit has occurred, and that a greater incidence of hidden visits occur in the morning.
Furthermore, when a user's CDR data easily exposes a high number of destinations or way-points, there is the least likelihood that a hidden visit occurred. In general, this accords with the general principle of the research – that the ‘noisiest' or most active users are painting a detailed picture of their movements, from which the behavior of less active users can be inferred.
In concluding, the researchers forecast that their approach can be used for other types of transit data, including smart card data and geo-located social media information.
The research was funded by Energy Foundation China and the China Sustainable Transportation Center.