**4.1 Parking data**

*qprob* <sup>¼</sup> <sup>X</sup> *k*

is to 1, the more quality the instance has.

*Data Integrity and Quality*

number of total features. On the other hand *x*<sup>0</sup>

average value will become the metric, named:

• *qinv*\_*mean* and *qinv*\_*med* for IDW.

• *qBME*\_*mean* and *qBME*\_*med* for BME.

reconstructed by the AE. Sometimes ∣*x*<sup>0</sup>

**3.3 Geospatial-based metrics**

representation.

variables [14].

highest quality.

**82**

*i*¼1

probabilistic, it takes values between 0 and 1, in such a way that the closer the value

Another way to determine if the data series exhibits anomalous behavior is

neural network used to learn efficient data codings in an unsupervised manner. The objective of these autoencoders is to learn a representation of the data to be studied, with the aim of eliminating noise, however it is possible to use this tool to detect anomalous values. AE are a specific type of feedforward neural networks where the input is the same as the output. They compress the input into a lower-dimensional code and then reconstruct the output from this

The metric based on AE [13] informs us about how the correlations between the different variables of the system behave. Given that, the metric *qrec* is based on the difference between the input and the output value of the AE, in such a way that the greater the reconstruction error, the less concordance there will be between the

> *qrec* <sup>¼</sup> <sup>X</sup> *N*

Considering sensors' location is also highly relevant for knowledge

for assessing how well a sensor is coordinated and correlated with its peers according to their distance. The used models are Inverse Distance Weighting (IDW) [15] and Bayesian Maximum Entropy (BME) [16]. IDW is a deterministic estimation method in which, assuming that the near sensors are more similar, a weighted average of available values at known points is used to calculate unknown data points. BME is a knowledge-based probabilistic modeling framework for spatial and temporal information. It allows various knowledge bases to be used as prior information, and the determination rules for hard (high precision) and soft (low precision) data are logically incorporated into the modeling. Like previously, we calculate the difference between the interpolated and the real measure, and the

*i*¼1 ∣*x*0

where *xi* correspond to the features of the data taken by the sensors, *N* is the

is represented as *Erec*. Since this metric is based on a difference between two values, it can take any real value greater than 0, in such a way 0 is the value with the

extraction. In this sense, we also provide two metrics that use interpolation methods

by using so-called AutoEncoders. An AutoEncoder is a type of artificial

where *k* is the number of clusters or distributions used, *Gi* corresponds to distribution i and *x <sup>j</sup>* is the vector taken by the sensor. Because this metric is

*Gi x <sup>j</sup>*

� � (11)

*<sup>i</sup>* � *xi*∣ (12)

*<sup>i</sup>* is the value of the vector of variables

*<sup>i</sup>* � *xi*∣ is known as a reconstruction error and

This data was collected from 5 private parking sensors located in the city of Murcia<sup>1</sup> , Spain.

First, the variables that are useful for our goal had to be chosen: the timestamp and the parking occupation measurements and aggregated the data in 10 minutes intervals.

This aggregation can generate redundancies on the timestamps, so the result has been averaged. Storing information about this aggregation process will be useful for the Artificiality metric.

*NA* (not available) instances have been kept since due to their importance in obtaining some quality metrics (Completeness). Given that the data is not measured periodically, a lot of missing values are generated at this point. For illustrative purposes, a new variable called real\_time was computed, which adds a random delay to the timestamps, simulating that the data needs some time to be stored. These are some highlights:


$$q\_{tim} = 1 - \frac{T\_{\text{age}}}{W} \tag{13}$$


A subset of the quality metrics and data values are shown in **Table 1**. Where Park101, … , Park105 are the parkings' ids, as we can see there are many instances

<sup>1</sup> Their locations are stored in the following web address http://mapamurcia.inf.um.es/

### *Data Integrity and Quality*

that cannot be correct, that information is condensed in the quality metrics. Whereas **Figure 1** shows the histograms of all basics metrics that could be computed for the parking dataset. In **Figure 2** the histogram of outlier-based metrics is shown.

Parking data geospatial-based metric's histograms are shown in **Figure 3**. As it was said above, the calculation of these metrics replace the calculation of the concordance metric, because they provide information about the correlation of the different sensors, in this case, the lower the value of the metric, the better.
