**3.1 Area selection**

We selected 61 areas with varied and known social functions, such as neighborhoods, industrial areas, office areas, highways, commercial streets, and shopping malls spread over nine cities, all located in the Tel Aviv metropolitan and its surrounding area (**Figure 2**): Tel Aviv, Holon, Ramat Gan, Petah Tikva, Rosh Haayin, Ra'anana, Ramat Hasharon, Givatayim, and Kfar Saba.

#### **Figure 1.**

*Workflow of land-use classification using the SSK algorithm.*

#### **Figure 2.**

*(Left) A map of Israel with the area covered in the study marked by a red rectangular. (Right) A zoom-in map of this area including the Tel-Aviv metropolitan and its surrounding area with nine cities participating in the study (underlined in red): Tel Aviv, Holon, Ramat Gan, Petah Tikva, Rosh Haayin, Ra'anana, Ramat Hasharon, Givatayim, and Kfar Saba. Approximately 1 million people are living in this area.*

*Mapping of Social Functions in a Smart City When Considering Sparse Knowledge DOI: http://dx.doi.org/10.5772/intechopen.104901*

For example, see the five selected areas in the city of Ra'anana shown in **Figure 3**. Each area is represented as a polygon on the map. Four of the areas are wide; these cover residential neighborhoods. There is one narrow rectangle representing Ahuza Street, the main commercial street. It is narrow to include only the street without the surrounding area.

There is a need to discuss the choice to analyze segments of several cities that were deliberately chosen. Previous works, such as the works of Yuan et al. [3], Toole et al. [42], and Sun et al. [9], performed land-use mapping of whole cities, Beijing, Boston, and Shenzhen, respectively. However, in intentionally chosen areas, the social functions are less mixed. Classifying land of "pure" social function is easier; hence, we expect implementing this method on a whole city to yield a lower accuracy than achieved on this dataset. However, the deliberate choice of areas also has some notable advantages. Analyzing social land uses in their "pure" condition enables us to recognize the core behavior and patterns of the social functions. The areas chosen from different cities enable the examination of the inter-cities' resemblance of social function, as reflected by the use of cellular communication. Deliberately choosing areas causes the labeling process to be less expensive and time-consuming. More importantly, the granted labels are more accurate, and it enables more reliable tests and conclusions. Thus, this dataset enables a careful analysis, which is valuable for the assessment of the feasibility of the method.

### **3.2 Division of time and space into basic spatiotemporal units**

We divided space and time into spatiotemporal units. The chosen areas were divided into smaller geographical units in a grid-like manner; we refer to each unit as a cell. Dividing the land into smaller parts reduces the variety of the social functions that take place in each; therefore, there is more homogenous land use, which is more suitable for land-use categorizing. However, using small fixed-size land parts may lose accuracy when the use of space is dynamic due to a mix of buildings of different uses in close proximity, or even different uses in the same building on different floors. We further note that others [48] found hexagonal cells advantageous over square cells, although the former are less intuitive for the urban environment, or used census blocks, where each partitioning system has its advantages and disadvantages [49]. We preliminary found the square grid suitable for our needs and selected the default size of the cell as 40,000 m<sup>2</sup> , shaped as a 200 200 m<sup>2</sup> . This is the same cell size and shape specified by Toole et al. [42] and Pei et al. [18]. However, because 30 of the 61 areas contained an edge smaller than 100 m, in these areas, we used narrower rectangles.

Land use is dynamic and varies during the day. For example, activity habits in a residential neighborhood at 7 p.m. (say, eating dinner and watching TV) are greatly different than the activity habits in the same neighborhood at 3 a.m. (say, sleeping). Therefore, in addition to dividing space, we also divided the day hourly, that is, 00:00 a.m. to 01:00 a.m. is one time unit, 01:00 a.m. to 02:00 a.m. is another time unit, and so on.

#### **3.3 Land-use labeling**

We labeled each cell per hour with a semantic social function of land use. As mentioned above, we chose to focus on areas that were relatively easy to label and, hence, we could label them with the help of a few locals. The labeled areas were then used as ground truth for training the land use classifier and evaluating its accuracy.

The semantic land-use labels include Residential, Commercial, Industrial, Highway (arterial roads), Office, Street, and No activity (no human activity is expected in this cell at this specific time, e.g., in industrial areas before work hours begin).

#### **3.4 Feature extraction**

In this work, we used 158 features that include varied aspects of the circadian nature of the activity in the cell [17]. We divided the features into five types: (1) Communication volume features measure the degree of communication activity. These features are designated to capture the difference between the activity volume typical to a specific social function (e.g., in commercial zones, there is more cellular communication compared to residential areas). (2) Daily pattern features are calculated by the calling volume in a specific hour relative to the communication volume at different hours of the day in the same zone. These features are designated to identify the circadian pattern of the communication activity typical to that area (e.g., in a residential area, the communication peak hours are in the mornings and evenings, while in industrial areas, the peak is during working hours). (3) Weekly pattern features capture the difference in cellular usage on weekdays compared to the weekend. Thus, it differentiates between land uses, such as residential, where their inhabitants return daily, and those like office zones, where workers do not go on weekends. (4) Contact features measure the number of different days on which people engage in at least one cellular communication in cell s in hour h, thus, differentiating between land uses with frequent visitors and those with occasional ones. (5) Communication habits features are a collection of features that aim to illustrate the land from the perspective of typical cellular communication usage habits, for example, call duration and usage distribution of different types of cellular communications (phone calls and internet usage). These 158 features were found very successful in land-use classification [17]. They predicted residential, industrial, and no activity land uses with F1 (see Eq. (5) below) values higher than 0.9 and provided average accuracy over seven land uses between 81% and 90% at any time of the day.

### **3.5 Semi-supervised self-labeled k-nearest neighbor**

We developed a variation of the k-nearest neighbor algorithm combined with a self-labeled iterative technique that enlarges a labeled dataset when only a few labeled samples exist. We call this method the Semi-supervised Self-labeled K-nearest neighbor (SSK).

Gathering land-use labels of a few segments of an urban area is relatively attainable. This information can be gathered by inquiring locals. However, getting additional land-use labels is often out of reach or too expensive. In a condition of only a small number of labeled samples, the effectiveness of conservative supervised classification algorithms deteriorates. Therefore, we used the self-labeled technique designated to generate more labeled samples as an input for the classifier to tackle the lack of labeled data [50]. The self-labeled technique follows an iterative procedure—in each iteration, unlabeled data is labeled and added to the training set for the next iterations. In the first iteration, a classifier is trained based only on the labeled samples and classifies the unlabeled samples. In every iteration, the samples that the algorithm is most confident of classifying correctly are added to the labeled sample pool.

In our implementation, we used the Distance weighted variation of K-Nearest Neighbor (DKNN) as the classifier. We assumed possessing the "real" land use label of 5% of the samples. In every iteration, 5% of the samples, which the DKNN classifier is most confident of, are added to the training set. The samples used in the classification are the basic spatiotemporal unit described in Section 3.2, which we refer to as cell. We use *xi* to refer to the cell *i*'s sample.

We used DKNN, as introduced by Dudani [51]. In the classic version of KNN, assigning a class to each query sample (unlabeled sample) is determined by its *k* nearest neighbors in the training set, and each of the *k* neighbors has the same impact. In the distance-weighted version, again the *k*-nearest neighbors contribute to the classification of the query sample, but here, the closer the sample is to the query sample, the more impact on the classification it has. Each of the *k* neighbors of the query sample *xq*'s gets a weight *w*ð Þ*<sup>i</sup> <sup>q</sup>* that depends on how close it is to the query sample:

$$w\_q^{(i)} = \frac{1}{d\left(\mathbf{x}\_q, \mathbf{x}\_i\right)^2} \quad \forall i \in \mathbf{1}, \dots, k,\tag{1}$$

where *d xq*, *xi* is the feature-space Euclidean distance between the query sample *xq* and its labeled neighbor *xi*, and other distance-weighted versions may be considered as well, for example, the harmonic mean distance [52]. *k* determines the number of neighbors considered in the calculation. Since training the DKNN does not exist (all computation is done during prediction), the classifier training time and space complexities are *O*ð Þ1 , and the prediction time complexity is *O knd* ð Þ for *n d*-dimensional samples (and the prediction space complexity is also *O*ð Þ1 ). Setting the number of neighbors *k* and a discussion about the considerations leading to its choice will follow below.

For example, let us assume that *k* ¼ 2 and that *xq*'s two closest neighbors (labeled samples closest in the feature space to *xq*) are *xa* and *xb*, and that their feature-space distance from *xq* are 2 and 3, respectively. Then, according to Eq. (1), the weight of *Xa* is <sup>1</sup> <sup>4</sup> and that of *Xb* is <sup>1</sup> 9, as *xa*is closer to *xq*.

The SSK algorithm demonstrated in this section comprises the self-labeled technique and the DKNN classification algorithm. However, we have made some adjustments to make a version of DKNN that is more suitable for our problem. In regular classification, the labels used for training are assumed to be correct. However, this assumption cannot be taken when using the self-labeled technique because only the labels in the first iteration are ground truth labels, and the labels in the next iterations are samples that were not labeled but have been classified through the process. To address this issue, we would like neighbors whose label we are more confident is correct to have more impact on the classification.

Let *O* be the set of all cells (samples) and *L* be the original set of predefined ground truth labeled cells. The set of cells that are currently labeled in a certain iteration is *G*, and its complement set of cells that are not yet labeled *Q* (*Q* ¼ *O*n*G*). In the first iteration of the algorithm, *G* ¼ *L*. When describing the process, we will refer to the cell that its class is being considered as the query cell.

We would like to introduce the term land-use array, which is an object that we use to discuss the method. The number of entries in a land-use array is equal to the number of land uses. We denote the land-use array of *xi* as *Ai*. Each array entry in *Ai* represents a land use, for example, entry 1 would be Residential, entry 2 Commercial, etc. The value of entry *j* represents the certainty that cell *xi* is attributed to class *j*. Consider *Ai* ¼ ð Þ *v*1, *v*2, … , *vc* . *vi* is a value that represents the confidence we have that the land use of cell *xi* is *i*. The sum of all entries in *Ai* is always 1. c is the number of land-use categories.

In the first iteration of the algorithm, the classification of the unlabeled cells is determined using the predefined labeled cells *L*, of which we assume 100% confidence. Before the first iteration, we initialize the land-use arrays of all the cells in *L*. Let us denote the land-use classes of the cells in *L* as *C*, meaning that the label of *xi* ∈*L* is *Ci*. The initialization of the land-use array of cell *xi* ∈*L* follows—entry number *Ci* (the class of *xi*) in *Ai* is set to 1, and all the other entries are set to 0. For example, if cell *xi* is labeled as Commercial, and we assume that Commercial is represented in the second entry, then its land-use array *Ai* ¼ ð Þ 0, 1, 0, … , 0 .

Setting the land-use arrays of the yet unlabeled cells is computed by the land-use arrays that were already calculated. Thus, the computation of the land-use array *Aq* for a query cell *xq* is given by

$$A\_q = \frac{\sum\_{i=1}^k w\_q^{(i)} A\_i}{\sum\_{i=1}^k w\_q^{(i)}} \quad q \in \mathcal{Q}, \tag{2}$$

where *k* is the number of neighbors configured for *xq*, and *w*ð Þ*<sup>i</sup> <sup>q</sup>* is set by Eq. (1).

In the first iteration, the calculation of the land-use arrays is based only on the land-use arrays of the cells in *L*. At the end of the first iteration, the land-use arrays of the cells that were selected to be added to training set *G* of the next iteration will be set according to (2), and they will be used for the calculation of land-use arrays in the next iterations, and the process repeats itself in the next iterations.

For example, we will examine an hour with four land-use classes. For simplicity, let us assume that *k* ¼ 2, meaning that for computing the land-use array *Aq*, we will consider only the two neighbors closest in the feature space. The two nearest neighbors of the query cell *xq* are *xi* and *x <sup>j</sup>*. *xi* is labeled as class 2 and *x <sup>j</sup>* is labeled as class 4; therefore, their land-use arrays are *Ai* ¼ ð Þ 0, 1, 0, 0 and *A <sup>j</sup>* ¼ ð Þ 0, 0, 0, 1 . Their

*Mapping of Social Functions in a Smart City When Considering Sparse Knowledge DOI: http://dx.doi.org/10.5772/intechopen.104901*

weights are *wi* ¼ 6 and *w <sup>j</sup>* ¼ 2. Notice, the weights indicate that *xi* is closer to *xq* than *xj*. Calculating *Aq*:

$$A\_q = \frac{w\_q^{(i)}A\_i + w\_q^{(j)}A\_j}{w\_q^{(i)} + w\_q^{(j)}} = \frac{6(0,1,0,0) + 2(0,0,0,1)}{6+2} = \left(0, \frac{3}{4}, 0, \frac{1}{4}\right). \tag{3}$$

*Aq* is calculated by the weighted average of the land-use arrays of its feature-space neighbors. For example, the value of the fourth entry in *Aq* (<sup>1</sup> <sup>4</sup>), which represents the fourth land use, is the result of a weighted average of the fourth entry in *Ai* (equals 0) and *A <sup>j</sup>* (equals 1), and it is calculated by <sup>6</sup>∙0þ2∙<sup>1</sup> <sup>6</sup>þ<sup>2</sup> <sup>¼</sup> <sup>1</sup> 4. The weighted average value <sup>1</sup> <sup>4</sup> is closer to *Ai* (equals 0) than to *A <sup>j</sup>* (equals 1) because *xq* is closer to *xi*. Notice that (2) guarantees the land-use array entries always sum up to 1. In the example, the highest entry value is <sup>3</sup> 4, and its corresponding land-use class is 2; therefore, it is most reasonable to assign *q* to class 2. If *xq* will be added to *G* at the end of the iteration, then *Aq* will be used to calculate land-use arrays in the next iterations.

However, we will classify *xq* to class 2 only if it has high enough classification confidence, meaning only if we have relatively high confidence that its attribution is correct, we classify it and add it to the training set of the next iteration. The classification confidence of *xq* is estimated by the entry with the maximal value in the landuse array:

$$
overline{e\_q} = \max\left(\mathbf{A}\_q\right).\tag{4}$$

In the example, the classification confidence level of *xq* is <sup>3</sup> <sup>4</sup> of it being attributed to class 2. In the example, *xq* is a candidate for being classified as class 2, and it will be classified as class 2 if the confidence level <sup>3</sup> <sup>4</sup> is high enough.

In each iteration, we add 5% of all the cells to the training set for the next iteration. To consider a proper balance between the labels in the training set over the iterations, we do not blindly add to the training set the top 5% of the samples with the highest classification confidence. The number of cells added to the training set is proportional to the number of candidates for each land use in this iteration. For example, consider a simple case with only two land-use classes. Let us assume that the number of cells ∣*O*∣ ¼ 1000, and therefore the number of cells added to the training set G in each iteration is 50 (5% of 1000). If in a specific iteration, 60% of the cells (600 cells) are candidates for class 1 (i.e., in 60% of the cells, the highest entry in the land-use array is 1), and the other 40% (400 cells) are candidates for class 2, then accordingly, 60% (30) of the cells added to the training set will be from class 1 and 40% (20) of the cells from class 2. The cells with the highest confidence are added to each class separately. In this example, the 30 cells with the highest values in entry 1 (represent class 1) will be labeled accordingly and added to the training set of the next iteration.

We would like to demonstrate in **Figure 4** the process of land-use classification using SSK with an example. We demonstrate classifying a query cell *xq* to land use in the first iteration (**Figure 4(top)**), and then we demonstrate classifying another query cell *xs* in the second iteration (**Figure 4(bottom)**). The bars in **Figure 4** represent the values of each entry in the land-use arrays. In the example, for simplicity, the neighborhood parameter *k* ¼ 2, that is, the classification is based on the two samples that are closest to the query cell in the feature space. In this example, there are

#### *Ubiquitous and Pervasive Computing - New Trends and Opportunities*

#### **Figure 4.**

*Computing land-use arrays for (top) first and (bottom) second iterations of an example.*

four land use classes. The computation of the land-use array of *Aq* in the first iteration (**Figure 4(top)**) was already demonstrated in the previous examples. We saw that after considering *xq*'s two nearest neighbors *xi* and *x <sup>j</sup>*, and based on their land-use arrays *Ai* and *<sup>A</sup> <sup>j</sup>*, then *Aq* <sup>¼</sup> 0, <sup>3</sup> <sup>4</sup> , 0, <sup>1</sup> 4 and *confidenceq* <sup>¼</sup> <sup>3</sup> 4. Let us further assume that this confidence level of *xq* was high enough, and thus *xq* was labeled by class 2 and added to the training set for the second iteration.

In the second iteration (**Figure 4(bottom)**), there is another query cell *xs*. In the example, *xs*'s two nearest neighbors are *xq* (the cell that was added to the training set *Mapping of Social Functions in a Smart City When Considering Sparse Knowledge DOI: http://dx.doi.org/10.5772/intechopen.104901*

in iteration 1) and another cell *xr*, and their land-use arrays are *Aq* <sup>¼</sup> 0, <sup>3</sup> <sup>4</sup> , 0, <sup>1</sup> 4 � � (as already computed) and *Ar* ¼ ð Þ 0, 0, 1, 0 and weights are *wq* ¼ 4 and *wr* ¼ 1, respectively. The land-use array of query cell *As* (Eq. (2)) is:

$$A\_s = \frac{w\_i^{(q)} A\_q + w\_i^{(r)} A\_r}{w\_i^{(q)} + w\_i^{(r)}} = \frac{4\left(0, \frac{3}{4}, 0, \frac{1}{4}\right) + 1(0, 0, 1, 0)}{4 + 1} \left(0, \frac{3}{5}, \frac{1}{5}, \frac{1}{5}\right). \tag{5}$$

**Figure 4(bottom)** demonstrates that the land-use array *As* of cell *xs* is mainly affected by cell *xq* (belonging to class 2), which was labeled and introduced into the training set only in the previous iteration.

There is a need to specify the neighborhood parameter *k* that specifies the number of cells considered in the classification of each query cell. *k* controls the volume of the neighborhood and, consequently, the smoothness of the density estimates; thus, it plays an important role in the performance of the nearest neighbor classifier [53]. Increasing *k* decreases variance and increases bias; conversely, decreasing *k* increases variance and decreases bias [54]. Since the number of labeled cells gradually increases during the process of the self-labeled technique, we offer a dynamic *k* that changes through the iterations; its value depends on the size of j j *G* —the number of cells currently in the training set *G*. Through the iterations, *k* grows with the set of cells (samples) available for training. We used a rule-of-thumb offered by Duda et al. [55], setting the k value by:

$$k \approx \sqrt{|G|}. \tag{6}$$

For example, if the number of labeled cells ∣*G*∣ in the first iteration is 50, then in the first iteration, *<sup>k</sup>* <sup>¼</sup> ffiffiffiffiffi <sup>50</sup> <sup>p</sup> <sup>¼</sup> <sup>7</sup>*:*07≈7, and therefore the closest seven neighbors of each query cell will be considered in the classification. By the next iteration, 50 cells are added to *<sup>G</sup>*, then <sup>∣</sup>*G*<sup>∣</sup> <sup>¼</sup> 100 and *<sup>k</sup>* <sup>¼</sup> ffiffiffiffiffiffiffiffi <sup>100</sup> <sup>p</sup> <sup>¼</sup> 10, thus 10 neighbors will be considered next.
