**7. Discussion and conclusions**

Previous works dedicated to social land-use mapping mostly used more than one data resource and complex methodologies that integrate them. Other works assumed substantial prior knowledge about the examined lands but when used relatively little knowledge about the examined city achieved not satisfactory accuracy rates [18]. The main contribution of this paper is that it offers a method for social land-use mapping when only sparse prior knowledge about the examined city exists, and by relying on the CDR, an inexpensive and available data resource is routinely gathered by telecom operators.

We introduced SSK, a semi-supervised algorithm that requires a relatively small number of labeled samples and, therefore, fits the condition of sparse prior

knowledge. The heart of SSK is the combination of the KNN classifier and the selflabeled technique that enables the enlargement of the training set in an iterative manner. SSK achieves an accuracy rate of 74.4%, a significantly higher rate than that achieved in the works of Toole et al. [42] and Pei et al. [18] of 54% and 58%, respectively. These works also relied mainly on CDR as their main data resource. However, it is not possible to infer that SSK performs better than their methodologies because our validation was on a very different dataset. Whereas they performed landuse mapping of a whole city, Boston in the work of Toole et al. [42], and Singapore in the work of Pei et al. [18], we chose areas of relatively homogenous social function from different cities in Israel. The task of classification in deliberately chosen areas of more "pure" social function is easier. We also compared the SSK's performance to that of a random forest (RF) classifier trained using many more labeled places, with 87.5% of the surface labeled (7/8 of the data set is used for training) compared to 5% in SSK. As expected, RF lowered the bias and variance of the classification and achieved a higher accuracy rate than SSK, but relative to the prior knowledge used in SSK, the performance gaps are mild. In a condition of only a small number of labeled samples, the effectiveness of conservative supervised classification algorithms, such as RF, deteriorates. Therefore, if getting additional land-use labels is out of reach or too expensive, it is better to use SSK.

SSK heavily relies on few labeled cells. If the land use in these cells is relatively mixed, then it has the potential to heavily damage the classification. Therefore, if cells of relatively "pure" social function cannot be obtained, then it is better to consider using an unsupervised method. The good thing is that, in most cases, the ground truth labeled cells are easier to be categorized to one land use (that is the reason they are chosen to be labeled); thus, they are relatively not mixed. Through the iterative steps, coverage of classified lands grows, but accuracy declines. We offer the option to stop the process before all land use is classified. For example, stopping the process at 80% of classified areas raises the accuracy rate to 81%, instead of 74.4%, if all areas are classified.

We also introduced a version of SSK that includes neighbor smoothing. We rely on the neighbor social land-use similarity property and offer a unique interpretation of KNN—a KNN that considers both the feature-space neighbors as in the regular KNN and the geographic space neighbors. We discussed the merits of incorporating smoothing, along with its drawbacks. Smoothing improves the overall accuracy; however, it degrades the chances to discover narrow land of a social function that is different than its surroundings. Therefore, the algorithm enables a parameter that sets the level of smoothing performed and, thus, controls the trade-off between overall accuracy and sensitivity to an exceptional social function. High levels of neighbor smoothing should be most effective in cities that are more "planned"; these cities tend to be more divided into functional parts of homogenous social function. Validating neighbors'smoothing shows that it indeed improves SSK's accuracy rate to 80% with the most smoothed results. In our dataset, it also improves the discovery rate of island land uses. This is mainly due to the homogeneity of the social function of the areas we chose to include in this work.

SSK is assembled of several components, each aiming to tackle some of the difficulties in the problem of mapping social functions (e.g., lack of labeled samples). In addition, SSK leverages opportunities inherent in the problem:

1.Self-labeled technique – While it might be costly to attain sufficient labeled samples needed for a classic classifier, it is relatively easy to attain labels of few locations in a city. Residents can participate in the process of self-labeling of their city and thereby contribute to the efforts to make their own city smarter.


In future work, we would like to validate the offered methodology on a whole city. Because some of the social functions are not well identified, creative solutions will be needed to identify them more consistently. In addition, further research may lead to an enhanced smoothing logic that is more sensitive to island land uses. A limitation of our approach may be that cellular communication cannot always capture the differences between some land uses (e.g., when the communication is limited in less populated areas), and then more data resources will be needed. Therefore, it may also be interesting to examine combining this methodology with other data resources, such as POI and remote-sensing imagery.
