3. Methodology

#### 3.1. Overview of the approach

The approach incorporates basic knowledge about human activity and travel decision-making processes and their resultant activity and travel behaviour. As Liu et al. [6] underlined, human activity and travel decision-making processes demonstrate routine and automated features. People do not generally schedule their activities on a daily basis; but rather depend on fixed routines or scripts executed during the day without much alteration. This leads to a high level of spatial-temporal regularities in activity and travel behaviour as well as a certain sequential order of the activities [6]. The spatial-temporal recurrences of the locations can be adequately reflected in the movement traces of cell phone users through a long period of call records. In addition, the spatial-temporal constraints of locations, stemming from the characteristics of various activities, which are performed in their own daily, weekly or monthly rhythms, can thus suggest the possible activities carried out at the locations. This enables the annotation for the third dimension, i.e. travel motives (activities). Furthermore, evidence also suggests that activity and travel behaviour differs across various time periods of a day, between weekdays and weekends, and between normal days and holidays [7].

The method consists of four major steps. (1) A set of variables characterizing call locations in the spatial-temporal dimensions is defined. (2) Feature selection techniques are applied to choose the most effective variables. (3) Upon the obtained variables, a set of classification models and an additional ensemble method to combine these prediction results are employed. (4) An enhancement algorithm is developed to improve the annotation performance based on sequential constraints of the activities.

## 3.2. Variable definition

Among all the users, 9132 distinct call locations were detected and 259 (2.8% of the total identified locations) were labelled with activities conducted at these places. These labelled locations are used as the ground-truth data for training and validating our models. Activities are divided into five types, including 'work/school', 'home', 'social visit', 'leisure' and 'nonwork obligatory', accounting for 30, 29, 15, 14 and 12% of the training data, respectively. The type of 'work/school' represents all work- or school-related activities outdoors; while 'home' accommodates all time spending at home. 'Social visit' refers to all visit activities, 'leisure' includes recreational activities outside home, e.g. sports and eating/drinking, and 'non-work obligatory' consists of activities like bringing/getting people, shopping and personalized services. If activities in multiple types are executed in the same location for a particular individual, the most frequent activity is selected, such that each location is uniquely linked to an

User ID Cell ID Time Duration Call type Direction 10027534 10163 10:18 12 Voice call Outgoing 10027534 10269 12:40 0 Message Incoming

The columns, respectively, denote the user, cell ID, time and duration (in minutes) of the call, the call type including

'voice call' and 'message' and the direction including 'incoming', 'outgoing' and 'missed calls'.

The approach incorporates basic knowledge about human activity and travel decision-making processes and their resultant activity and travel behaviour. As Liu et al. [6] underlined, human activity and travel decision-making processes demonstrate routine and automated features. People do not generally schedule their activities on a daily basis; but rather depend on fixed routines or scripts executed during the day without much alteration. This leads to a high level of spatial-temporal regularities in activity and travel behaviour as well as a certain sequential order of the activities [6]. The spatial-temporal recurrences of the locations can be adequately reflected in the movement traces of cell phone users through a long period of call records. In addition, the spatial-temporal constraints of locations, stemming from the characteristics of various activities, which are performed in their own daily, weekly or monthly rhythms, can thus suggest the possible activities carried out at the locations. This enables the annotation for the third dimension, i.e. travel motives (activities). Furthermore, evidence also suggests that activity and travel behaviour differs across various time periods of a day, between weekdays

The method consists of four major steps. (1) A set of variables characterizing call locations in the spatial-temporal dimensions is defined. (2) Feature selection techniques are applied to choose the most effective variables. (3) Upon the obtained variables, a set of classification

activity type for the individual.

Table 1. Call records of a user.a

96 Smartphones from an Applied Research Perspective

3.1. Overview of the approach

and weekends, and between normal days and holidays [7].

3. Methodology

a

For each user, all distinct locations, where the person has performed at least a call activity during the entire data collection period, are extracted. Let N as the total number of these locations. At each location Loci (i = 1…N), a set of variables is defined from two perspectives, including the call behaviour and the underlying travel behaviour. The call behaviour defines the variables that are directly related to call communication activities. Most of the variables are also used in the multi-modal data annotation process, as described in Section 1. The travel behaviour, however, approximates the spatial-temporal features of a location. The difference between these two perspectives can be illustrated by two groups of major variables. The first group includes the call frequency CFreqR and visit frequency VFreqR. CFreqR depicts how often calls are made at a location; by contrast, VFreqR reveals how often the location is reached, irrespective of the number of calls that are made at each visit. The second is the call duration CDur and visit duration VDur. CDur describes the duration of the call; while VDur is defined as the time interval between the first and last calls at the location. Apart from the different perspectives, all the variables are also divided based on spatial-temporal factors, including spatial repetition, temporal periodicity, day types and day segments. All the variables are listed in Table 2.

In terms of day segments, different definitions of time periods have been adopted, depending on the context of the study area [8]. Instead of making such an a priori assumption, a method that is proposed in this study estimates the splitting points of the day from empirical data. The resultant splitting points delimit the largest difference in the distribution of various activity types across these time intervals. Specifically, the segment process starts with a full day of 24 hours, and each hour is examined independently. An hour under investigation divides the day into two time intervals, e.g. 0–10 am and 10 am to 24 pm at 10 am. A contingency table is then constructed, in which these two time intervals and the five activity types are the row and column variables, respectively. The frequencies of the aggregated observations from the labelled call locations that fall into the corresponding time intervals and activity classes are the cell values. A chi-square statistics is subsequently calculated for this table. After chi-square statistics is obtained for each of the 24 hours, the hour with the largest statistics is chosen as the first splitting point, denoted as S1. This point divides the day into two intervals between 0 and S<sup>1</sup> as well as between S<sup>1</sup> and 24. This process is repeated for each of the latest formed intervals, until further splitting does not generate substantial difference or until a pre-specified number of intervals is reached.

#### 3.3. Feature selection

Due to the small size of the training dataset, particularly relative to the large number of defined variables, over-fitting is a potential problem. To address this issue, feature selection techniques are employed in order to decrease the number of predictors actually utilized by the

#### Travel behaviour

Spatial repetition. (1) VFreqR: the visit frequency at the location divided by the total visit frequencies to all locations by the individual.

Temporal variability. (1) TotVDurR: the total duration of all the visits to the location divided by the duration of visits to all locations by the individual. (2) [Ear/Lat]VTime: the earliest and latest call time of all calls at the location. (3) AveV [StartT/ EndT], VarV[StartT/EndT]: the average and variance of the first and last call time over all visits at the location. (4) [Longest/Ave/Var]VDur: the longest and average duration of all visits to the location, and the variance of the duration.

Day type. (1) VFreqR[Week/Weekend/Sun/Sat/Hol],TotVDurR

[Week/Weekend/Sun/Sat/Hol]: 'VFreqR' and 'TotVDurR' at weekdays, weekend, Sunday, Saturday, or public holidays.

Day segment. (1) VFreqR[1/…/m], TotVDurR[1/…/m]: 'VFreqR' and 'TotVDurR' are segmented during different time periods of a day.

#### Call behaviour

Spatial repetition. (1) CFreqR: the call frequency at the location divided by the total call frequencies at all locations by the individual. (2) [VoiC/Mes]FreqR: 'CFreqR' is segmented between voice calls and messages. (3) [Inc/Mis/Out]CFreqR: 'VoiCFreqR' is divided into incoming, missed and outgoing (4) [Inc/Out]MesFreqR: 'MesFreqR' is divided into incoming and outgoing.

Temporal variability. (1) TotCDur': the total call duration of all calls at the location by the individual. (2) CInt[Max/Ave]: the maximum and average time interval between 2 consecutive calls at the location. (3) [Ave/Var]CTime: the average and variance of call time of all calls at the location. (4) [Longest/Ave/Var]CDur': the longest, average and variance of duration of all calls at the location.

#### Day type. (1) CFreqR[Week/Weekend/Sun/Sat/Hol], TotCDur'R

[Week/Weekend/Sun/Sat/Hol],VoiCFreqR[Week/Weekend/Sun/Sat/Hol], MesFreqR[Week/Weekend/Sun/Sat/Hol]: 'CFreqR', 'TotCDur". 'VoiCFreqR' and 'MesFreqR' at weekdays, weekend, Sunday, Saturday, or public holidays.

Day segment. (1) CFreqR[1/ …/ m], TotCDur'R[1/ …/ m], VoiCFreqR[1/ …/ m], MesFreqR[1/ …/ m]: 'CFreqR', 'TotCDur", 'VoiCFreqR' and 'MesFreqR' are segmented during different time periods of a day.

a The symbol [] denotes different variables, e.g. [Ear/Lat]VTime for variables 'EarVTime' and 'LatVTime'. Each day is divided into m segments, and m is decided by the method described as follows.

Table 2. Variable definition.<sup>a</sup>

classification models. Two methods including wrapper [9] and filter [10], which have shown effectiveness in the multi-modal data annotation process, are chosen for feature selection. Wrapper searches for an optimal feature subset using the classification model itself. In contrast, filter examines each feature separately and selects the feature that has high correlation with the target variable, but low relation with the features that have already been chosen.

#### 3.4. Machine learning

A group of state-of-the-art machine learning algorithms, including decision trees (DTs) [11], random forests (RF) [12], multinomial logistic regression (MNL) [13] and multiclass support vector machines (SVMs)[14], are employed. These algorithms have demonstrated comparative performance for multi-category classification problems. These methods mainly differ in terms of the way the classification question is formulated, the learning function and the solution to deciding the optimal function parameters. As each learning algorithm has its strength and weakness, it is often challengeable to identify a single algorithm that performs best for a particular classification problem [15]. Thus, in this study, a fusion process is developed, which integrates the results of these algorithms, in order to utilize the strength of one while complementing the limitation of another. In this process, the four individual model prediction results (i.e. the probabilities of different possible activity types) for each call location are used as predictors, and the observed activity types are still as the dependent variable. The correlation between these predictors and the observed activity types can be built again by a classification model.

#### 3.5. The enhancement algorithm

While machine learning methods provide an effective solution to annotating each single location, they disregard the activity orders and transitions embedded in daily activity and travel sequences. When the annotated locations on a day are linked according to the temporal order, they should follow a certain sequential constraint. The interdependencies of daily activities have been considered as a crucial factor in activity and travel decision making, as discussed in Section 3.1. By considering sequential information, the activity locations that are accessed by an individual on a day are viewed and tackled as a whole, rather than isolated participation in activities.

The enhancement algorithm takes the preliminary inference results as well as the sequential knowledge as inputs and aims to improve the prediction. The method is composed of two components: transition probability-based enhancement and prior probability-based enhancement. Figure 1 illustrates how the prediction is improved using a daily location sequence of a user.

According to the training data of the user, he/she has conducted the chain of activities of 'work-social visit-work' at the respective call time on a day. But the prediction from the classification models is 'work-non-work obligatory-work'. A prediction error occurs at the second location. In this case, if a location (e.g. the second location) has a prediction probability P (0.443) smaller than a threshold T<sup>1</sup> (0.72 in our case study), it is assumed that the location is likely to be wrongly annotated. The enhancement algorithm is then applied to the false location to improve its prediction in the following steps. (1) If there is an additional location adjacent to the false one (including backwards and forwards) in the predicted sequence for that day and if this location has P larger than a threshold T<sup>2</sup> (0.9), it is considered as possibly correct prediction. The additional location is thus used to fix the prediction of the false one, using the

classification models. Two methods including wrapper [9] and filter [10], which have shown effectiveness in the multi-modal data annotation process, are chosen for feature selection. Wrapper searches for an optimal feature subset using the classification model itself. In contrast, filter examines each feature separately and selects the feature that has high correlation with the

The symbol [] denotes different variables, e.g. [Ear/Lat]VTime for variables 'EarVTime' and 'LatVTime'. Each day is

Spatial repetition. (1) VFreqR: the visit frequency at the location divided by the total visit frequencies to all locations by

Temporal variability. (1) TotVDurR: the total duration of all the visits to the location divided by the duration of visits to all locations by the individual. (2) [Ear/Lat]VTime: the earliest and latest call time of all calls at the location. (3) AveV [StartT/ EndT], VarV[StartT/EndT]: the average and variance of the first and last call time over all visits at the location. (4) [Longest/Ave/Var]VDur: the longest and average duration of all visits to the location, and the variance of the duration.

[Week/Weekend/Sun/Sat/Hol]: 'VFreqR' and 'TotVDurR' at weekdays, weekend, Sunday, Saturday, or public holidays. Day segment. (1) VFreqR[1/…/m], TotVDurR[1/…/m]: 'VFreqR' and 'TotVDurR' are segmented during different time

Spatial repetition. (1) CFreqR: the call frequency at the location divided by the total call frequencies at all locations by the individual. (2) [VoiC/Mes]FreqR: 'CFreqR' is segmented between voice calls and messages. (3) [Inc/Mis/Out]CFreqR: 'VoiCFreqR' is divided into incoming, missed and outgoing (4) [Inc/Out]MesFreqR: 'MesFreqR' is divided into incoming

Temporal variability. (1) TotCDur': the total call duration of all calls at the location by the individual. (2) CInt[Max/Ave]: the maximum and average time interval between 2 consecutive calls at the location. (3) [Ave/Var]CTime: the average and variance of call time of all calls at the location. (4) [Longest/Ave/Var]CDur': the longest, average and variance of duration

[Week/Weekend/Sun/Sat/Hol],VoiCFreqR[Week/Weekend/Sun/Sat/Hol], MesFreqR[Week/Weekend/Sun/Sat/Hol]: 'CFreqR', 'TotCDur". 'VoiCFreqR' and 'MesFreqR' at weekdays, weekend, Sunday, Saturday, or public holidays. Day segment. (1) CFreqR[1/ …/ m], TotCDur'R[1/ …/ m], VoiCFreqR[1/ …/ m], MesFreqR[1/ …/ m]: 'CFreqR',

A group of state-of-the-art machine learning algorithms, including decision trees (DTs) [11], random forests (RF) [12], multinomial logistic regression (MNL) [13] and multiclass support vector machines (SVMs)[14], are employed. These algorithms have demonstrated comparative performance for multi-category classification problems. These methods mainly differ in terms of the way the classification question is formulated, the learning function and the solution to deciding the optimal function parameters. As each learning algorithm has its strength and weakness, it is often challengeable to identify a single algorithm that performs best for a particular classification problem [15]. Thus, in this study, a fusion process is

target variable, but low relation with the features that have already been chosen.

'TotCDur", 'VoiCFreqR' and 'MesFreqR' are segmented during different time periods of a day.

divided into m segments, and m is decided by the method described as follows.

3.4. Machine learning

Table 2. Variable definition.<sup>a</sup>

Travel behaviour

the individual.

periods of a day. Call behaviour

and outgoing.

a

of all calls at the location.

Day type. (1) VFreqR[Week/Weekend/Sun/Sat/Hol],TotVDurR

98 Smartphones from an Applied Research Perspective

Day type. (1) CFreqR[Week/Weekend/Sun/Sat/Hol], TotCDur'R

transition probability-based enhancement. (2) Otherwise, if no other locations in the neighbouring areas are predicted with a high probability, the prior probability-based enhancement method is employed to increase the prediction accuracy based on the call time at the false location. After recalculation, the activity type with the largest enhancement probability P' is chosen as the annotation result of the false location on that particular day. As a location may be repeatedly visited on multiple days, the multiple days' enhancement results are integrated by majority voting rules as the final annotation for the location. Under the appropriate parameters T<sup>1</sup> and T2, the false prediction is likely to be corrected while accurate inference results are maintained. Figure 2 demonstrates the details of the enhancement process.

#### 3.5.1. Transition probability-based enhancement

The sequential information is represented in a 5 5 transition probability matrix between different activities. Let ai and aj (ai, aj = 1,…5) as the activities performed at the previous location i and current location j, respectively; Tr(aj|ai) as the transition probability from ai to aj, calculated from the training data as follows:

Figure 2. The enhancement algorithm.

Semantic Annotation of Mobile Phone Data Using Machine Learning Algorithms http://dx.doi.org/10.5772/intechopen.70255 101

$$Tr\left(a\_{\rangle}|a\_{i}\right) = \frac{F\left(a\_{\rangle}|a\_{i}\right)}{\sum\_{k=1}^{5}F(a\_{k}|a\_{i})}\tag{1}$$

F(aj|ai) is the frequency of aj followed by ai. The probability of the location j being annotated as aj conditioned by ai at the previous location i can be recalculated as P<sup>0</sup> (aj|X) according to Eq. (2).

$$P^0(a\_j|X) = P(a\_j|X) \times \operatorname{Tr}(a\_j|a\_i) \tag{2}$$

P(aj|X) is the result of the classification model. It is noted that P<sup>0</sup> (aj|X) is biased towards frequently visited locations, e.g. home and work/school places, as transitions to these places are more likely than to other less visited locations. Consequently, most of the locations under Eq. (2) will be redirected to these two activity types. To overcome this, Tr(aj|ai) is divided by the frequency of aj, resulting in the probability Qr(aj|ai).

$$\begin{aligned} Qr(a\_{\dagger}|a\_{i}) &= \frac{F(a\_{\dagger}|a\_{i})}{\sum\_{a\_{k}=1}^{5} F(a\_{k}|a\_{i}) \times \sum\_{a\_{k}=1}^{5} F(a\_{\dagger}|a\_{k})} \end{aligned} \tag{3}$$

P0 (aj|X) can be revised as P' (aj|X).

transition probability-based enhancement. (2) Otherwise, if no other locations in the neighbouring areas are predicted with a high probability, the prior probability-based enhancement method is employed to increase the prediction accuracy based on the call time at the false location. After recalculation, the activity type with the largest enhancement probability P' is chosen as the annotation result of the false location on that particular day. As a location may be repeatedly visited on multiple days, the multiple days' enhancement results are integrated by majority voting rules as the final annotation for the location. Under the appropriate parameters T<sup>1</sup> and T2, the false prediction is likely to be corrected while accurate inference results are maintained.

The sequential information is represented in a 5 5 transition probability matrix between different activities. Let ai and aj (ai, aj = 1,…5) as the activities performed at the previous location i and current location j, respectively; Tr(aj|ai) as the transition probability from ai to

For each individual, fill the annotated locations into daily sequences; let *D* represent the total of such sequences. For each sequence *d (d=1…D*), *k* and *N(d)* denote a location and the total number of locations in *d (k=1… N(d)).*

No

No

Obtain the final classification based on multiple days' enhancement results

Next location k=k+1

End

Prior probability-based enhancement is applied

Remain P of the location

untouched

A revised probability *P'* for the location k is calculated

No

Figure 2 demonstrates the details of the enhancement process.

Yes

If P<T1 for the location k?

If *K £ N(d)*?

Yes

Yes

Transition probability-based enhancement is applied

If exists a second adjacent location with *P>T2*?

Yes

If *d<D*?

*d=1* and *k=1*

3.5.1. Transition probability-based enhancement

100 Smartphones from an Applied Research Perspective

aj, calculated from the training data as follows:

No

Figure 2. The enhancement algorithm.

*d=d+1*

$$P'(a\_j|X) = P(a\_j|X) \times Qr(a\_j|a\_i) \tag{4}$$

In the user's case, as shown in Figure 1, since the transition probability Qr from work to nonwork obligatory activities is very small, after the enhancement, P' (non � work � obligatory) (0.008) drops behind P' (visit) (0.033), we get the visit activity as the revised annotation.

#### 3.5.2. Prior probability-based enhancement

The above-described transition probability-based enhancement involves at least two locations, which are adjacent in time, and one of which has a prediction probability larger than T2. However, such daily trajectories derived from the classification models are not always available for each day. For example, one of the neighbouring locations has a probability smaller than T2. Or, in the case where people may stay at a location (e.g. home) during an entire day, engaging only in a single (home) activity. This is particularly true with cell phone data. People may not make calls when travelling to an activity location, resulting in the daily movement traces not being fully revealed by their call data. In these cases, we utilize the typical activity and travel behaviour at different time of a day through the prior probability distribution of the activity aj at different call time t, i.e. P(aj|t). By applying Bayesian methods, we compute the posterior probability of aj based on X and t, i.e. P' (aj|X, t). This probability can be computed as follows, with the assumption that X is independent of t.

$$\begin{split}P\left(a\_{\rangle}|\mathbf{X},t\right) &= \frac{P\left(a\_{\rangle},\mathbf{X},t\right)}{P(\mathbf{X},t)} = \frac{P\left(\mathbf{X},t|a\_{\rangle}\right) \times P\left(a\_{\rangle}\right)}{P(\mathbf{X}) \times P(t)} \\ &= \frac{P\left(a\_{\rangle}|\mathbf{X}\right) \times P(\mathbf{X})}{P(a\_{\rangle})} \times \frac{P\left(a\_{\rangle}|t\right) \times P(t)}{P(a\_{\rangle})} \times \frac{P\left(a\_{\rangle}\right)}{P(\mathbf{X}) \times P(t)} \\ &= \frac{P\left(a\_{\rangle}|\mathbf{X}\right) \times P\left(a\_{\rangle}|t\right)}{P(a\_{\rangle})} \end{split} \tag{5}$$

P(aj|X) is the output of the classification model, i.e. the probability of aj performed at the location j conditioned on the previously defined variables X. When P(aj|X) is compared with the new probability P' (aj|X, t), since t is added in the conditional part of P', the new probability is more discriminative and informative than P.

P(aj|t) and P(aj) can be derived from the training data as follows:

$$\begin{aligned} P(a\_{\!\!\!/\!t} \!t) &= \frac{F(a\_{\!\!\!t} \!t)}{\sum\limits\_{a\_{\!\!\!k=1}}^5 F(a\_{\!\!\!t} \!t)}\\ P(a\_{\!\!\!/\!)} &= \frac{F(a\_{\!\!\!t} \!t)}{\sum\limits\_{a\_{\!\!k=1}}^5 F(a\_{\!\!\!t} \!t)} \end{aligned} \tag{6}$$

Here, F(aj|t) refers as the occurrences of aj at t and F(aj) refers as the occurrences of aj at all time. It should be noted that from the theoretic perspective, the above enhancement process has two weak assumptions. One is the replacement of P(aj|X) with the result of the classification model and the other concerns the hypothesis of the independence between X and t. Nevertheless, based on Eq. (5), the preliminary prediction probability is complemented with the prior probability distribution.

#### 4. Case study

In this section, adopting the proposed method and using the cell phone data described in Section 2, a set of experiments is presented. The results of these experiments are discussed and the performance of the annotation process is evaluated.

#### 4.1. Day segments

Table 3 lists the optimal points for each of the intervals, based on the method described in Section 3.2. The first splitting point over an entire day was found at 9 am, generating two intervals of 0–9 am and 9 am to 24 pm. This process was iterated for each of the two newly obtained intervals. If the largest chi-square value over all potential points of an interval was lower than a predefined threshold, i.e. 200 in this experiment, this search stops.


a The rows, respectively, denote the current interval (hour) under investigation, the optimal splitting point S, the chisquare value, the decision on whether or not the interval is split (if it is 'Yes' then two new intervals are formed and if it is 'No' then the symbol 'X' is used), and the order of the optimal points according to the chi-square values.

Table 3. The optimal points of a day.<sup>a</sup>

<sup>P</sup><sup>0</sup> ajjX, t � � <sup>¼</sup> P aj; <sup>X</sup>; <sup>t</sup> � �

102 Smartphones from an Applied Research Perspective

ity is more discriminative and informative than P.

probability distribution.

4. Case study

4.1. Day segments

P(aj|t) and P(aj) can be derived from the training data as follows:

and the performance of the annotation process is evaluated.

P Xð Þ ; <sup>t</sup> <sup>¼</sup> P X, tjaj

� � �

P(aj|X) is the output of the classification model, i.e. the probability of aj performed at the location j conditioned on the previously defined variables X. When P(aj|X) is compared with the new probability P' (aj|X, t), since t is added in the conditional part of P', the new probabil-

> P ajj<sup>t</sup> � � <sup>¼</sup> F ajj<sup>t</sup> � � X 5

> > � � <sup>¼</sup> F aj

P aj

ak¼<sup>1</sup>

X 5

ak¼<sup>1</sup>

Here, F(aj|t) refers as the occurrences of aj at t and F(aj) refers as the occurrences of aj at all time. It should be noted that from the theoretic perspective, the above enhancement process has two weak assumptions. One is the replacement of P(aj|X) with the result of the classification model and the other concerns the hypothesis of the independence between X and t. Nevertheless, based on Eq. (5), the preliminary prediction probability is complemented with the prior

In this section, adopting the proposed method and using the cell phone data described in Section 2, a set of experiments is presented. The results of these experiments are discussed

Table 3 lists the optimal points for each of the intervals, based on the method described in Section 3.2. The first splitting point over an entire day was found at 9 am, generating two intervals of 0–9 am and 9 am to 24 pm. This process was iterated for each of the two newly obtained intervals. If the largest chi-square value over all potential points of an interval was

lower than a predefined threshold, i.e. 200 in this experiment, this search stops.

F ak ð Þ jt

� �

F að Þ<sup>k</sup>

<sup>¼</sup> P ajj<sup>X</sup> � � � P Xð Þ P aj

<sup>¼</sup> P ajj<sup>X</sup> � � � P ajj<sup>t</sup> � � P aj � �

� � � P aj

P Xð Þ� P tð Þ

P ajj<sup>t</sup> � � � P tð Þ P aj � � �

� �

P aj � � P Xð Þ� P tð Þ

(5)

(6)

Figure 3. The evolution of chi-square statistics of the optimal points.

Figure 3 further shows the evolution of the chi-square statistics, in which the first 3 orders yield much higher values than the remaining ones. From the fourth order on, the statistics starts to decline sharply. Thus, the first 3 optimal points were extracted and 4 time periods were generated including 0–8:59 am, 9–13:59 am, 14–18:59 pm and 19–23.59 pm. After each day was segmented into the four periods, all the variables defined in Table 2 were obtained and used as candidates for subsequent feature selection and machine learning. Weka, an opensource Java application consisting of a collection of machine learning algorithms for data mining tasks [16], was used for the implementation.

#### 4.2. Results of individual classification models

The original training dataset is randomly divided into 10 subsets. In each model run, one of these subsets is used as the validation data and the remaining subsets combined as the training data. The number of correctly annotated locations in the validation subset is denoted as Ci(i = 1…10). Let Num as the total number of locations in the training dataset; the prediction accuracy can be defined as follows:

$$\text{Accuracy} = \frac{\sum\_{i=1}^{10} \mathbf{C}\_i}{\text{Num}} \tag{7}$$

The individual classification models are built on the features of locations drawn from the perspectives of both travel and call behaviour as well as on the features profiling only call behaviour, respectively. In addition, the models are also run separately on all candidate variables as well as on the variable subsets that are chosen by filter or wrapper. The prediction results with the best parameter setting in each case are presented in Table 4.

From the prediction results, the following observations can be drawn. (1) The models running on a subset of variables perform better than those operating on all predictors. The average improvement is 0.85% for wrapper and 2.13% for filter. This demonstrates the importance of feature selection techniques in dealing with a large number of predictors relative to a small training set. (2) There are no general conclusions on which feature selection methods are better, depending on specific classification models. SVM performs better with filter, DT and RF do not show much difference between these two feature selection techniques, while MNL gains remarkable improvement of 4.8% with wrapper. (3) When the different models are compared, it is noted that MNL produces the best results with 68.98% accuracy. This is followed by accuracy of 66.06% from RF, 65.69% from SVM and 60.95% from DT. (4) Variation is also exhibited between the variables drawn from different perspectives. In most cases, the prediction accuracy derived from the combination of both travel and call behaviour is higher than that from solely call behaviour. The average accuracy increases by 2.96 and 1.20% for filter and wrapper, and 2.09% for all variables included. This underlines the added value of the variables built based on underlying activity and travel behaviour.

Apart from different model performance, the feature selection techniques combined with various classification models also yield divergent optimal subsets of features. Eight variables are picked up by the multiple selection processes and they are regarded as important predictors, including VFreqRWeek, TotVDurRSun, VarVEndT, VarVStartT and AveVEndT describing activity and travel behaviour, and AveCallTime, IncMesFreqR and MesFreqR3 related to only call behaviour.


Table 4. Prediction accuracy of the individual classification models (%).<sup>a</sup>

#### 4.3. Results of fusion models

The individual classification models are built on the features of locations drawn from the perspectives of both travel and call behaviour as well as on the features profiling only call behaviour, respectively. In addition, the models are also run separately on all candidate variables as well as on the variable subsets that are chosen by filter or wrapper. The prediction

From the prediction results, the following observations can be drawn. (1) The models running on a subset of variables perform better than those operating on all predictors. The average improvement is 0.85% for wrapper and 2.13% for filter. This demonstrates the importance of feature selection techniques in dealing with a large number of predictors relative to a small training set. (2) There are no general conclusions on which feature selection methods are better, depending on specific classification models. SVM performs better with filter, DT and RF do not show much difference between these two feature selection techniques, while MNL gains remarkable improvement of 4.8% with wrapper. (3) When the different models are compared, it is noted that MNL produces the best results with 68.98% accuracy. This is followed by accuracy of 66.06% from RF, 65.69% from SVM and 60.95% from DT. (4) Variation is also exhibited between the variables drawn from different perspectives. In most cases, the prediction accuracy derived from the combination of both travel and call behaviour is higher than that from solely call behaviour. The average accuracy increases by 2.96 and 1.20% for filter and wrapper, and 2.09% for all variables included. This underlines the added value of the variables

Apart from different model performance, the feature selection techniques combined with various classification models also yield divergent optimal subsets of features. Eight variables are picked up by the multiple selection processes and they are regarded as important predictors, including VFreqRWeek, TotVDurRSun, VarVEndT, VarVStartT and AveVEndT describing activity and travel behaviour, and AveCallTime, IncMesFreqR and MesFreqR3

Classification models DT RF MNL SVM-poly SVM- RBF

Filter 60.95 65.33 64.23 63.50 65.69 Wrapper 1.1. 60.58 1.2. 66.06 1.3. 68.98 1.4. 59.26 1.5. 56.57 1.6. All Variables 1.7. 59.12 1.8. 64.60 1.9. 63.50 56.93 1.10. 59.85

Filter 58.76 62.77 62.77 59.85 60.58 Wrapper 59.85 63.50 65.69 59.49 58.39 All variables 56.57 62.04 60.58 57.30 59.85

Parameters N = 4 N = 0 C = 1 c = 100, degree = 1 c =100, Gamma = 0.01

results with the best parameter setting in each case are presented in Table 4.

built based on underlying activity and travel behaviour.

related to only call behaviour.

104 Smartphones from an Applied Research Perspective

The highest prediction accuracy for each model is in bold.

Table 4. Prediction accuracy of the individual classification models (%).<sup>a</sup>

Travel and call behaviour

Call behaviour

a

In this fusion process, the four individual classification models are, respectively, employed as the fusion models to predict the activity types, while the results from each of the classifiers with the best parameter performance shown in Table 4 are used as the predictors. The prediction with the two best performances for each fusion model is presented in Table 5. The results reveal that a fusion model does not necessarily outperform the individual models; the performance depends on the choice of the selected individual classifiers as the predictors. For instance, MNL obtains 68.98% accuracy as an individual classifier, while it achieves 69.71% when used as the fusion model built on the integration of all the four individual models' results. However, the accuracy drops to 61.68% when only DT and SVM-RBF are employed as the predictors.

#### 4.4. Enhancement algorithm

#### 4.4.1. Transition matrix

Similar to the temporal variables, the transition matrix is also built for weekdays, weekend and holidays separately as well as for different periods of a day. The identification of optimal cutting points for the matrix is the same as the previously described method, except the time intervals. For each potential dividing point, two intervals but three scenarios are obtained depending on the time of the two concerned activities in the transition. The first and second scenarios occur when both activities take place in the first interval or in the second. The third scenario is when the first activity takes place in the first interval and second activity in the second interval. Given the small size of the training set, only the first significant cutting point was identified, which is 18 pm. Under this time division, the largest difference in the distribution of activity transitions is among the three scenarios: transitions within 0–17:59 pm or 18–23:59 pm, and transitions from 0–17:59 pm to 18–23:59 pm. Table 6 shows the transition matrix in the first scenario during weekdays.


a The rows represent the fusion models, and the columns include the individual classifiers and the prediction accuracy. X indicates the corresponding individual models being chosen as the predictors.

Table 5. Prediction accuracy of fusion models (%).<sup>a</sup>


a The row and column represent the current and previous activities respectively; the maximum probability for each column is in bold.

Table 6. Transition matrix.a

As expected, for the probability Tr(aj|ai), the highest values are dominated by the transitions to either home or work/school activities. With Qr(aj|ai), however, the dominance of these two activities is reduced by their high frequencies, and transitions to other less represented activities are exposed. This can be manifested by the high transitions from home to non-work activities and from social visit to second social visit locations.

#### 4.4.2. Activity distribution at different time

The activity distribution is also differentiated between weekdays, weekend and holidays. The weekday distribution at each hour P(aj|t) is shown in Figure 4(a) and the distribution of the

Figure 4. Absolute activity distribution (a) and relative activity distribution at each hour (b).

ratio between P(aj|t) and the overall probability of the activity P(aj) is depicted in Figure 4(b). These two distributions show remarkable deviation: in Figure 4(a) either home or work/school types dominate the activities, whereas in Figure 4(b) the most likely activity shifts across various types as the day unfolds.
