**Semantic Annotation of Mobile Phone Data Using Machine Learning Algorithms** Semantic Annotation of Mobile Phone Data Using

DOI: 10.5772/intechopen.70255

Feng Liu, JianXun Cui, Davy Janssens, Geert Wets and Mario Cools Feng Liu, JianXun Cui, Davy Janssens,

Machine Learning Algorithms

Additional information is available at the end of the chapter Geert Wets and Mario Cools

http://dx.doi.org/10.5772/intechopen.70255 Additional information is available at the end of the chapter

#### Abstract

[73] Stedtfeld RD, Tourlousse DM, Seyrig G, Stedtfeld TM, Kronlein M, Price S, et al. Gene-Z: A device for point of care genetic testing using a smartphone. Lab on a Chip.

[74] Lee SA, Yang C. A smartphone-based chip-scale microscope using ambient illumination.

[75] Steinberg MD, Kassal P, Kereković I, Steinberg IM. A wireless potentiostat for mobile

[76] Berg B, Cortazar B, Tseng D, Ozkan H, Feng S, Wei Q, et al. Cellphone based hand held microplate reader for point of care testing of enzyme linked immunosorbent assays.

[77] Hossain MA, Canning J, Cook K, Jamalipour A. Optical fiber smartphone spectrometer.

[78] Delaney JL, Doeven EH, Harsant AJ, Hogan CF. Use of a mobile phone for potentiostatic control with low cost paper-based microfluidic sensors. Analytica Chimica Acta.

chemical sensing and biosensing. Talanta. 2015;**143**:178-183

2012;**12**:1454-1462

92 Smartphones from an Applied Research Perspective

2013;**803**:123-127

Lab on a Chip. 2014;**14**(16):3056-3063

ACS Nano. 2015;**9**(8):7857-7866

Optics Letters. 2016;**41**(10):2237-2240

Cell phone call location data has been utilized for the study of travel patterns, but the underlying activities that originate the movement are still at a less explored stage. Resulted from routine and automated features of decision-making processes, human activity and travel behaviour exhibit a high level of spatial-temporal periodicities as well as a certain order of the activities. In this chapter, a method has been developed based on these regularities, which predicts activities being conducted at call locations. The method includes four steps: a set of comprehensive variables is defined; feature selection techniques are applied; a group of state-of-the-art machine learning algorithms and an ensemble of the above algorithms are employed; an additional enhancement algorithm is designed. Using data gathered from natural communication of 80 users over a period of 1 year, the proposed method is evaluated. Based on the ensemble of the models, prediction accuracy of 69.7% was achieved. Using the enhancement algorithm, the performance obtained 7.6% improvement. The experimental results demonstrate the potential to annotate call locations based on the integration between machine learning algorithms and the characteristics of underlying activity and travel behaviour, contributing towards the semantic interpretation and application of the massive data.

Keywords: cell phone location annotation, activity and travel behaviour, machine learning algorithms, feature selection techniques, sequential information

### 1. Introduction

#### 1.1. Problem statement

Nowadays, cell phones are frequently used as an attractive means for sensing human behaviour on a large scale. They provide a source of real and reliable data, enabling automatic monitoring

© 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited.

call and travel behaviour of users. Studies have been conducted to discover statistical laws that govern the key dimensions of human travel, e.g. travel distance and time spent at different locations [1]. These studies provide a modelling framework capable of describing general features of human mobility.

However, despite the discovery of these general features, previous studies do not provide further insights into the motivation or activities behind the identified mobility features. In general, most of the current research on cell phone data has focused on spatial-temporal dimensions. The behavioural aspects associated with the mobility features, e.g. travel mode and activities being conducted at the locations, are still at a less studied stage. Due to privacy concerns, cell phone data provided by phone operation companies usually does not have contextual information, leading to a wide gap between the raw data and the semantic interpretation of the traces. If a method can be found which helps to bridge this gap, the potential applications of the semantically enriched phone data are immense. They include inferring people's travel motivations in activity-based transportation modelling, mining individual life styles and activity preferences in urban planning, and providing activity tailored services in the cell phone environment [2].

### 1.2. Related state of the art

Methods have been developed to derive activities being conducted at a location from global positioning systems (GPS)-based data or from multi-modal data recorded by cell phones. The GPS-based methods first decompose continuous GPS points into a chain of stops, where the individual stays for a minimum period of time conducting activities, and moves that are the points between two consecutive stops. The stops are then compared with a geographic map by matching them in space, and interesting places that are relevant to the studies are subsequently found. The GPS-based methods have received much attention during the past years [3], but are still faced with a number of limitations. (1) The data collection process is expensive in terms of battery consumption of GPS devices. (2) Linking a GPS trajectory to detailed geographic information on all interesting places in a study area needs a lot of computational work. (3) The methods are location-specific, and the quality of the annotation process depends on the study area, making the process not transferable to other areas. (4) The matched location alone may not disclose a particular reason of why an individual travels there. A person could go to a place (e.g. a shopping mall) with different purposes (e.g. working, shopping or having a lunch). (5) The matching of exact GPS positions raises privacy concerns, as some of the places visited by an individual may be highly privacy-sensitive.

Some of the above-described limitations have been addressed by the annotation process based on multi-modal data recorded from sensors equipped on cell phones [4]. This process is composed of two steps. In the first step, data from GPS and other sensors (e.g. Wi-Fi and accelerometer) is collected from each individual. The data is then clustered into a number of visit places, each of which is represented by an ID number rather than geographic positions of the cluster points. In the second step, the obtained places are annotated based on contextual

information from the sensors and phone applications, as opposed to GPS data. In this process, various machine learning methods are proposed, and different sets of features are defined [5]. These studies achieved good prediction performance without the need of additional geographic information and GPS data. Nevertheless, while the machine learning methods eliminate the need for a map, this entire annotation process still partly relies on GPS data for the identification of visit places in the first step. Thus, this process as a whole does not fully address the privacy concern. On top of that, while these studies mainly focus on selecting efficient classification models and relevant features, none of them have conducted postprocessing analysis to examine how the predicted results are consistent with the sequential information that is embedded in daily activity and travel sequences. In-depth examination into the prediction errors is also lacking in these studies.

#### 1.3. Research contributions

call and travel behaviour of users. Studies have been conducted to discover statistical laws that govern the key dimensions of human travel, e.g. travel distance and time spent at different locations [1]. These studies provide a modelling framework capable of describing general

However, despite the discovery of these general features, previous studies do not provide further insights into the motivation or activities behind the identified mobility features. In general, most of the current research on cell phone data has focused on spatial-temporal dimensions. The behavioural aspects associated with the mobility features, e.g. travel mode and activities being conducted at the locations, are still at a less studied stage. Due to privacy concerns, cell phone data provided by phone operation companies usually does not have contextual information, leading to a wide gap between the raw data and the semantic interpretation of the traces. If a method can be found which helps to bridge this gap, the potential applications of the semantically enriched phone data are immense. They include inferring people's travel motivations in activity-based transportation modelling, mining individual life styles and activity preferences in urban planning, and providing activity tailored services in

Methods have been developed to derive activities being conducted at a location from global positioning systems (GPS)-based data or from multi-modal data recorded by cell phones. The GPS-based methods first decompose continuous GPS points into a chain of stops, where the individual stays for a minimum period of time conducting activities, and moves that are the points between two consecutive stops. The stops are then compared with a geographic map by matching them in space, and interesting places that are relevant to the studies are subsequently found. The GPS-based methods have received much attention during the past years [3], but are still faced with a number of limitations. (1) The data collection process is expensive in terms of battery consumption of GPS devices. (2) Linking a GPS trajectory to detailed geographic information on all interesting places in a study area needs a lot of computational work. (3) The methods are location-specific, and the quality of the annotation process depends on the study area, making the process not transferable to other areas. (4) The matched location alone may not disclose a particular reason of why an individual travels there. A person could go to a place (e.g. a shopping mall) with different purposes (e.g. working, shopping or having a lunch). (5) The matching of exact GPS positions raises privacy concerns, as some of the places visited by an individual may be highly privacy-sensitive. Some of the above-described limitations have been addressed by the annotation process based on multi-modal data recorded from sensors equipped on cell phones [4]. This process is composed of two steps. In the first step, data from GPS and other sensors (e.g. Wi-Fi and accelerometer) is collected from each individual. The data is then clustered into a number of visit places, each of which is represented by an ID number rather than geographic positions of the cluster points. In the second step, the obtained places are annotated based on contextual

features of human mobility.

94 Smartphones from an Applied Research Perspective

the cell phone environment [2].

1.2. Related state of the art

Extending the current research on annotating people's movement traces, our study proposes a new approach. The method utilizes data collected from simple cell phones, and it combines machine learning methods with the characteristics of underlying activity and travel behaviour that originates the traces. It has the following advantages over the existing studies. (1) The method is based on spatial-temporal regularities as well as sequential information intrinsic to human activity and travel behaviour. (2) It does not depend on additional sensor data and map information, reducing data collection costs and increasing transferability. (3) An enhancement algorithm has been developed to improve the prediction results by machine learning methods. (4) A set of extensive experiments and in-depth examination into the classification errors have been conducted. (5) Compared to GPS points, the wide coverage of a cell ID allows the process to reduce privacy concerns considerably.

The rest of this paper is organized as follows. Section 2 introduces the cell phone data and Section 3 elaborates on the annotation process. Experiments are conducted in Section 4 and examination into the experiment results is carried out in Section 5. Finally, Section 6 ends this chapter with major conclusions and discussions for future research.
