**3. Data quality metrics**

In this section, we describe the metrics that have been defined to calculate and annotate the QoI for IoT data. Those were previously described on [8].

#### **3.1 QoI basic metrics**

The first set of metrics is based on a descriptive analysis. This approach was also used on the IoTCrawler framework [9]. It proposes to integrate quality measures and analysis modules to rate data sources to identify the best fitting data sources to get the needed information. The first step before implementing some quality analysis modules is to identify quality measures, which can be used to rate data sources and the delivered/produced data for their Quality of Information. To measure the QoI, we propose to use the so-called QoI Vector, which is defined in Eq. (1) and gathers the information belonging to all the metrics proposed in this framework

$$\vec{Q} = \left\langle q\_{cmp}, q\_{tim}, q\_{pla}, q\_{art}, q\_{con} \right\rangle \tag{1}$$

The elements of the vector are defined as follows:

• Completeness (*qcmp*): it represents the percentage of missing or the unusable data.

*Data Integrity and Quality*

$$q\_{cmp} = 1 - \frac{M\_{miss}}{M\_{exp}}\tag{2}$$

These metrics represent the simplest ones that can be calculated in this kind of IoT scenarios. However, it is possible to go further and compute some metrics that

Since these metrics provide us basic information, it is possible to go further and obtain a series of metrics that can be useful. These new metrics come from the hand

In machine learning, an outlier is an observation that diverges from an overall

In the literature, there are usually considered 4 types of basic outliers for time series: additive outliers, level shifts, temporary changes and innovational outliers,

A metric similar to the case of *qcmp* can be defined, only taking into account the values that are considered outliers instead of the missing ones. The percentage of outliers in the studied sensor is named *qout* (see Eq. (7)). In order to obtain which of these values are considered outliers it is useful an Autoregressive Integrated Moving Average (ARIMA) based framework [12]. It can also determine if the oulier is innovational, additive, level shift, temporary changes or seasonal level shifts.

*qout* <sup>¼</sup> <sup>1</sup> � *Mout*

where *Mout* is the sum of outlier values on the features of the sensor and *Mtotal* is

As important as determining whether an instance is an outlier or not is knowing how much it deviates from what would be the expected value corresponding to the normal behavior of the time series. For that purpose it is necessary to impute the data of the time series that are considered outliers as if they were missing values, in order to know what this expected value would be. Then the difference between the value and the imputation is another metric that has been computed by dividing the difference of each sensors value by the mean, median or mode of the values and

where i corresponds to those indices of the features that present an anomalous behavior, while *x*^*<sup>i</sup>* and *xi* represent the imputed value that follows the expected behavior and the value of the outlier respectively. This metric takes values between

Unsupervised methods are also adequate for oultier detection, so we propose *qprob*. This metric corresponds to the probability of belonging to a certain cluster that has been computed using Gaussian Mixture Models (GMM), which consists of representing in the most faithful way possible the data points by adding some Gaussian distributions. It informs quantitatively of the anomalous values. The number of clusters or Gaussians distributions is an hyperparameter and it could be chosen in different ways. In the experiments we used silhouette coefficient.

*Mtotal*

*qmean* ¼ *mean* j*x*^*<sup>i</sup>* � *xi* ð Þj (8) *qmedian* ¼ *median* j*x*^*<sup>i</sup>* � *xi* ð Þj (9) *qmode* ¼ *mode* j*x*^*<sup>i</sup>* � *xi* ð Þj (10)

(7)

give us a deeper knowledge of the IoT system.

*Quality of Information within Internet of Things Data DOI: http://dx.doi.org/10.5772/intechopen.95844*

**3.2 Oultier-based metrics using heuristics**

see [10, 11] for a complete description.

the sum of total features.

0 and 1, with 1 being the ideal case.

**81**

of Machine Learning (ML), in this case the search for outliers.

pattern. The number of outliers in an indicator of data quality.

then calculate their mean, median or mode (*qmean*, *qmedian*, *qmode*).

where *Mmiss* is the sum of missing values and *Mexp* is the sum of expected values of an incoming dataset.

• Timeliness (*qtim*): refers to the expected time of accessibility and availability of information. In other words, it represents how long is the time difference between the data capture and the reality event happening. It is crucial in critical IoT applications such as traffic safety. Its definition is:

$$q\_{tim} = 1 - \frac{T\_{\text{age}}}{W} \tag{3}$$

where *Tage* is the difference between the expected time and the time taken by the sensor (*Treal* � *Texp*), and *W* is the proper time of the system, which is chosen arbitrarily.

• Plausibility (*qpla*): shows if received data is coherent according to the probabilistic knowledge of the variables that are being measured. Sensor annotations or meta-data are used to determine an expected value range of an incoming measurement.

$$q\_{pla} = \prod P\_{Am notations}(\nu) \tag{4}$$

The range of Plausibility value is defined between 0 and 1.


$$q\_{con}(\mathbf{x}\_0) = \sum\_{i=1}^n \lambda\_i(\mathbf{x}\_0) c(\mathbf{x}\_0, \mathbf{x}\_i) \tag{5}$$

with *λ* as a weight-function

$$\lambda\_i(\mathbf{x}\_0) = \frac{1}{d(\mathbf{x}\_0, \mathbf{x}\_i)} \tag{6}$$

And *d x*ð Þ *<sup>a</sup>*, *xb* propagation and infrastructure-based distance function between sensor location *xa* and *xb* or sensors *a* and *b*.

All the metrics exposed in this section take values between 0 and 1, with the value 1 being the ideal case in which the quality of the data is maximum and 0 the opposite case.

These metrics represent the simplest ones that can be calculated in this kind of IoT scenarios. However, it is possible to go further and compute some metrics that give us a deeper knowledge of the IoT system.

#### **3.2 Oultier-based metrics using heuristics**

*qcmp* <sup>¼</sup> <sup>1</sup> � *Mmiss*

of an incoming dataset.

*Data Integrity and Quality*

arbitrarily.

incoming measurement.

interpolated value.

(*i* ¼ 1, … , *n*).

opposite case.

**80**

with *λ* as a weight-function

sensor location *xa* and *xb* or sensors *a* and *b*.

where *Mmiss* is the sum of missing values and *Mexp* is the sum of expected values

• Timeliness (*qtim*): refers to the expected time of accessibility and availability of information. In other words, it represents how long is the time difference between the data capture and the reality event happening. It is crucial in

*qtim* <sup>¼</sup> <sup>1</sup> � *Tage*

sensor (*Treal* � *Texp*), and *W* is the proper time of the system, which is chosen

• Plausibility (*qpla*): shows if received data is coherent according to the probabilistic knowledge of the variables that are being measured. Sensor annotations or meta-data are used to determine an expected value range of an

The range of Plausibility value is defined between 0 and 1.

*qcon*ð Þ¼ *x*<sup>0</sup>

where *Tage* is the difference between the expected time and the time taken by the

• Artificiality (*qart*): this metric determines the inverse degree of the used sensor fusion techniques and defines if this is a direct measurement of a singular sensor, an aggregated sensor value of multiple sources or an artificially

• Concordance (*qconc*): describes the agreement between information of the data source and the information of other independent data sources, which report correlating effects. The Concordance analysis takes any given sensor *x*<sup>0</sup> and computes the individual concordances, *C x*ð Þ 0, *xi* , with a finite set of *n* sensors

> X*n i*¼1

> > 1 *d x*ð Þ 0, *xi*

And *d x*ð Þ *<sup>a</sup>*, *xb* propagation and infrastructure-based distance function between

All the metrics exposed in this section take values between 0 and 1, with the value 1 being the ideal case in which the quality of the data is maximum and 0 the

*λi*ð Þ¼ *x*<sup>0</sup>

*<sup>W</sup>* (3)

*qpla* <sup>¼</sup> <sup>Y</sup>*PAnnotations*ð Þ*<sup>ν</sup>* (4)

*λi*ð Þ *x*<sup>0</sup> *c x*ð Þ 0, *xi* (5)

(6)

critical IoT applications such as traffic safety. Its definition is:

*Mexp*

(2)

Since these metrics provide us basic information, it is possible to go further and obtain a series of metrics that can be useful. These new metrics come from the hand of Machine Learning (ML), in this case the search for outliers.

In machine learning, an outlier is an observation that diverges from an overall pattern. The number of outliers in an indicator of data quality.

In the literature, there are usually considered 4 types of basic outliers for time series: additive outliers, level shifts, temporary changes and innovational outliers, see [10, 11] for a complete description.

A metric similar to the case of *qcmp* can be defined, only taking into account the values that are considered outliers instead of the missing ones. The percentage of outliers in the studied sensor is named *qout* (see Eq. (7)). In order to obtain which of these values are considered outliers it is useful an Autoregressive Integrated Moving Average (ARIMA) based framework [12]. It can also determine if the oulier is innovational, additive, level shift, temporary changes or seasonal level shifts.

$$q\_{out} = 1 - \frac{M\_{out}}{M\_{total}}\tag{7}$$

where *Mout* is the sum of outlier values on the features of the sensor and *Mtotal* is the sum of total features.

As important as determining whether an instance is an outlier or not is knowing how much it deviates from what would be the expected value corresponding to the normal behavior of the time series. For that purpose it is necessary to impute the data of the time series that are considered outliers as if they were missing values, in order to know what this expected value would be. Then the difference between the value and the imputation is another metric that has been computed by dividing the difference of each sensors value by the mean, median or mode of the values and then calculate their mean, median or mode (*qmean*, *qmedian*, *qmode*).

$$q\_{mean} = mean(|\hat{\mathbf{x}}\_i - \mathbf{x}\_i|) \tag{8}$$

$$q\_{median} = median(|\hat{\mathbf{x}}\_i - \mathbf{x}\_i|) \tag{9}$$

$$q\_{mode} = mode(|\hat{\mathbf{x}}\_i - \mathbf{x}\_i|) \tag{10}$$

where i corresponds to those indices of the features that present an anomalous behavior, while *x*^*<sup>i</sup>* and *xi* represent the imputed value that follows the expected behavior and the value of the outlier respectively. This metric takes values between 0 and 1, with 1 being the ideal case.

Unsupervised methods are also adequate for oultier detection, so we propose *qprob*. This metric corresponds to the probability of belonging to a certain cluster that has been computed using Gaussian Mixture Models (GMM), which consists of representing in the most faithful way possible the data points by adding some Gaussian distributions. It informs quantitatively of the anomalous values. The number of clusters or Gaussians distributions is an hyperparameter and it could be chosen in different ways. In the experiments we used silhouette coefficient.

$$q\_{prob} = \sum\_{i=1}^{k} G\_i(\mathbf{x}\_j) \tag{11}$$

**4. Examples of implementation**

*Quality of Information within Internet of Things Data DOI: http://dx.doi.org/10.5772/intechopen.95844*

**4.1 Parking data**

, Spain.

the Artificiality metric.

These are some highlights:

concept.

**83**

non-absent values there are.

Murcia<sup>1</sup>

intervals.

In this section, 3 different IoT scenarios are introduced, in which the previous

This data was collected from 5 private parking sensors located in the city of

First, the variables that are useful for our goal had to be chosen: the timestamp and the parking occupation measurements and aggregated the data in 10 minutes

This aggregation can generate redundancies on the timestamps, so the result has been averaged. Storing information about this aggregation process will be useful for

*NA* (not available) instances have been kept since due to their importance in obtaining some quality metrics (Completeness). Given that the data is not measured periodically, a lot of missing values are generated at this point. For illustrative purposes, a new variable called real\_time was computed, which adds a random delay to the timestamps, simulating that the data needs some time to be stored.

• Completeness: it consists on counting instance by instance the percentage of

• Timeliness: the random time lag that is included in the data (*Tage*) is used, so when it is divided by the arbitrary aggregation time *W* (600 seconds, in this

*qtim* <sup>¼</sup> <sup>1</sup> � *Tage*

• Plausibility: if the data of each parking lot belongs to the interval 0,*Ci* ½ �, this measure will be said to be plausible and will receive a value of 1. The values of

• Artificiality: due to aggregation over time, the number of instances used for computing the mean and therefore the aggregated value were considered. Thus, if a data was obtained by means of two data-points taken in the same

• Concordance: the geostatistical metrics have been used for covering this

• Outliers: given the amount of missing data, the ARIMA framework could not

A subset of the quality metrics and data values are shown in **Table 1**. Where Park101, … , Park105 are the parkings' ids, as we can see there are many instances

<sup>1</sup> Their locations are stored in the following web address http://mapamurcia.inf.um.es/

2 .

*<sup>W</sup>* (13)

case) it shows the time that data takes to be available, as follows

*Ci* are: 330, 312, 305, 162 and 220 respectively.

time frame, its metric of artificiality will be <sup>1</sup>

be used for detecting outliers in this dataset.

metrics are computed and highlight the possible drawbacks.

where *k* is the number of clusters or distributions used, *Gi* corresponds to distribution i and *x <sup>j</sup>* is the vector taken by the sensor. Because this metric is probabilistic, it takes values between 0 and 1, in such a way that the closer the value is to 1, the more quality the instance has.

Another way to determine if the data series exhibits anomalous behavior is by using so-called AutoEncoders. An AutoEncoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The objective of these autoencoders is to learn a representation of the data to be studied, with the aim of eliminating noise, however it is possible to use this tool to detect anomalous values. AE are a specific type of feedforward neural networks where the input is the same as the output. They compress the input into a lower-dimensional code and then reconstruct the output from this representation.

The metric based on AE [13] informs us about how the correlations between the different variables of the system behave. Given that, the metric *qrec* is based on the difference between the input and the output value of the AE, in such a way that the greater the reconstruction error, the less concordance there will be between the variables [14].

$$q\_{rec} = \sum\_{i=1}^{N} |\mathbf{x}\_i^\prime - \mathbf{x}\_i| \tag{12}$$

where *xi* correspond to the features of the data taken by the sensors, *N* is the number of total features. On the other hand *x*<sup>0</sup> *<sup>i</sup>* is the value of the vector of variables reconstructed by the AE. Sometimes ∣*x*<sup>0</sup> *<sup>i</sup>* � *xi*∣ is known as a reconstruction error and is represented as *Erec*. Since this metric is based on a difference between two values, it can take any real value greater than 0, in such a way 0 is the value with the highest quality.

#### **3.3 Geospatial-based metrics**

Considering sensors' location is also highly relevant for knowledge extraction. In this sense, we also provide two metrics that use interpolation methods for assessing how well a sensor is coordinated and correlated with its peers according to their distance. The used models are Inverse Distance Weighting (IDW) [15] and Bayesian Maximum Entropy (BME) [16]. IDW is a deterministic estimation method in which, assuming that the near sensors are more similar, a weighted average of available values at known points is used to calculate unknown data points. BME is a knowledge-based probabilistic modeling framework for spatial and temporal information. It allows various knowledge bases to be used as prior information, and the determination rules for hard (high precision) and soft (low precision) data are logically incorporated into the modeling. Like previously, we calculate the difference between the interpolated and the real measure, and the average value will become the metric, named:

