**4.2 Luminosity data**

In this section, the monitored luminosity from 4 sensors located in the Pleiades building of the University of Murcia was studied.


#### **Table 1.**

*Parking observations (number of cars) subset.*

First, the data is aggregated using the timestamp as in the previous section, choosing a 10 minutes aggregation time. **Table 2** shows the aggregated values and

**Time** *S***<sup>1</sup>** *S***<sup>2</sup>** *S***<sup>3</sup>** *S***<sup>4</sup>** 18:00:00 20 55 10 80 18:10:00 25 70 20 40 18:20:00 NA 70 10 NA 18:40:00 20 95 10 65 18:50:00 30 30 20 60 19:10:00 20 75 10 280

**Figure 4** shows the histograms of all metrics that could be computed for the luminosity dataset together with basic statistics. The timeliness metric could not be calculated, since there are no signs of any lag in the data's storage. Also, the artificiality value always takes the value of 1 because the timestamps of the data are far

By last, in **Figure 6** the geospatial luminosity's metrics can be seen. As in the case

Given that the only way to calculate concordance on previous datasets has been through spatial interpolation due to poor dataset quality, a dataset of high quality has been used to compare the values that this metric takes in this situation and when

also some of the computed metrics.

*Luminosity basic metric's histograms.*

**4.3 Pollution data**

**85**

**Figure 3.**

**Table 2.**

**Figure 4.**

*Luminosity (lumens) data subset.*

*Parking data geospatial-based metric's histogram.*

*Quality of Information within Internet of Things Data DOI: http://dx.doi.org/10.5772/intechopen.95844*

apart. The rest of metrics are included in **Figure 5**.

they are added to it some imperfections.

of parking, these metrics replace the concordance metric.

**Figure 1.** *Parking basic metric's histograms.*

**Figure 2.** *Parking outlier-based metric's histograms.*

#### *Quality of Information within Internet of Things Data DOI: http://dx.doi.org/10.5772/intechopen.95844*

#### **Figure 3.**

that cannot be correct, that information is condensed in the quality metrics.

was said above, the calculation of these metrics replace the calculation of the concordance metric, because they provide information about the correlation of the

different sensors, in this case, the lower the value of the metric, the better.

building of the University of Murcia was studied.

*Parking observations (number of cars) subset.*

**4.2 Luminosity data**

*Data Integrity and Quality*

**Table 1.**

**Figure 1.**

**Figure 2.**

**84**

*Parking basic metric's histograms.*

*Parking outlier-based metric's histograms.*

Whereas **Figure 1** shows the histograms of all basics metrics that could be computed for the parking dataset. In **Figure 2** the histogram of outlier-based metrics is shown. Parking data geospatial-based metric's histograms are shown in **Figure 3**. As it

In this section, the monitored luminosity from 4 sensors located in the Pleiades

**timestamp** Park**<sup>101</sup>** Park**<sup>102</sup>** Park**<sup>103</sup>** Park**<sup>104</sup>** Park**<sup>105</sup>** 11:50:00 NA 163.33 NA NA 117.5 12:00:00 NA 10000 NA NA 116.5 12:10:00 NA 163.00 NA 10000 116.5 12:20:00 NA 165.00 NA NA 118.0 12:30:00 NA 166.00 NA NA 120.0 12:40:00 1 166.50 NA NA 119.0

*Parking data geospatial-based metric's histogram.*


#### **Table 2.**

*Luminosity (lumens) data subset.*

#### **Figure 4.**

*Luminosity basic metric's histograms.*

First, the data is aggregated using the timestamp as in the previous section, choosing a 10 minutes aggregation time. **Table 2** shows the aggregated values and also some of the computed metrics.

**Figure 4** shows the histograms of all metrics that could be computed for the luminosity dataset together with basic statistics. The timeliness metric could not be calculated, since there are no signs of any lag in the data's storage. Also, the artificiality value always takes the value of 1 because the timestamps of the data are far apart. The rest of metrics are included in **Figure 5**.

By last, in **Figure 6** the geospatial luminosity's metrics can be seen. As in the case of parking, these metrics replace the concordance metric.

#### **4.3 Pollution data**

Given that the only way to calculate concordance on previous datasets has been through spatial interpolation due to poor dataset quality, a dataset of high quality has been used to compare the values that this metric takes in this situation and when they are added to it some imperfections.

#### *Data Integrity and Quality*

**Figure 5.**

*Luminosity outlier-based metric's histograms.*

**Figure 6.**

*Luminosity data geospatial-based metric's histogram.*


#### **Table 3.** *Pollution data subset.*

As can be seen in **Table 3**, this dataset has five variables that inform on the pollution of the atmosphere every five minutes, the data values are scaled.

Now the data are given, one way to calculate the concordance metric is to calculate the correlation between a value and the previous one, in such a way that if when the data is taken properly this value will be very close to 1, while if the data suffers any problem, this value will move away from 1. This is shown in **Figure 7**, in which we have the original dataset on the left side and the same dataset on the right side to which anomalous values have been added randomized, as it can be seen, the agreement values change significantly.

**4.4 Data without context**

**Figure 7.**

**Table 4.**

**Figure 8.**

**87**

*Data without context subset.*

*Pollution dataset concordance comparison.*

*Quality of Information within Internet of Things Data DOI: http://dx.doi.org/10.5772/intechopen.95844*

unknown variables, with 1200 instances.

*Data without context outlier-based metric's histograms (I).*

histogram of both metric is shown.

the value taken by the metrics for a small data subset.

For demonstration purposes, we propose to compute the quality metrics in a dataset whose context, origin and meaning are unknown. It is a dataset in which we have no knowledge about what the columns represent, how the data was collected and the timestamp of the observations. In such scenario, the only basic metric that can be computed is completeness. However, outlier-based metrics are very useful, since they consider the variables as plain time series without taking into account their physical meaning. **Table 4** shows a subset of the dataset, that presents 5

**V1 V2 V3 V4 V5** 0.606470 0.143913 0.348654 2.468968 0.199896 0.543575 0.073679 0.223220 0.594037 0.223077 0.543575 0.143913 0.090365 0.594037 0.223077 0.543575 0.216444 0.090365 0.733265 0.199896 0.606470 0.796689 0.090365 0.872493 0.176716

**Figure 8** shows the histograms of the outlier-based metrics, while **Table 5** shows

Similarly, the probabilistic and reconstruction metrics can be calculated here,

since they do not assume any kind of knowledge of the data. In **Figure 9** the

It should be noted that if the rest of the metrics are calculated in the case of the unaltered dataset, they will take perfect values, that is, they will always indicate a high quality of the dataset.

For this dataset, the rest of the metrics have been calculated, however, the results have not been added, since the dataset presents a high quality and therefore the results are not of great interest since the histograms of the metrics take the ideal behavior.

#### *Quality of Information within Internet of Things Data DOI: http://dx.doi.org/10.5772/intechopen.95844*

#### **Figure 7.**

*Pollution dataset concordance comparison.*


#### **Table 4.**

*Data without context subset.*

**Figure 8.** *Data without context outlier-based metric's histograms (I).*

#### **4.4 Data without context**

For demonstration purposes, we propose to compute the quality metrics in a dataset whose context, origin and meaning are unknown. It is a dataset in which we have no knowledge about what the columns represent, how the data was collected and the timestamp of the observations. In such scenario, the only basic metric that can be computed is completeness. However, outlier-based metrics are very useful, since they consider the variables as plain time series without taking into account their physical meaning. **Table 4** shows a subset of the dataset, that presents 5 unknown variables, with 1200 instances.

**Figure 8** shows the histograms of the outlier-based metrics, while **Table 5** shows the value taken by the metrics for a small data subset.

Similarly, the probabilistic and reconstruction metrics can be calculated here, since they do not assume any kind of knowledge of the data. In **Figure 9** the histogram of both metric is shown.

As can be seen in **Table 3**, this dataset has five variables that inform on the

**Ozone Particulate matter Carbon monoxide Sulfur dioxide Nitrogen dioxide** 0.18 0.23 1.03 1.73 1.66 0.29 0.17 1.05 1.67 1.68 0.31 0.21 1.03 1.77 1.70 0.22 0.30 0.99 1.73 1.66 0.27 0.23 1.03 1.83 1.77

It should be noted that if the rest of the metrics are calculated in the case of the unaltered dataset, they will take perfect values, that is, they will always indicate a

For this dataset, the rest of the metrics have been calculated, however, the results have not been added, since the dataset presents a high quality and therefore the results are not of great interest since the histograms of the metrics take the ideal

pollution of the atmosphere every five minutes, the data values are scaled. Now the data are given, one way to calculate the concordance metric is to calculate the correlation between a value and the previous one, in such a way that if when the data is taken properly this value will be very close to 1, while if the data suffers any problem, this value will move away from 1. This is shown in **Figure 7**, in which we have the original dataset on the left side and the same dataset on the right side to which anomalous values have been added randomized, as it can be seen, the

agreement values change significantly.

*Luminosity data geospatial-based metric's histogram.*

*Luminosity outlier-based metric's histograms.*

*Data Integrity and Quality*

high quality of the dataset.

behavior.

**86**

**Figure 6.**

**Figure 5.**

**Table 3.**

*Pollution data subset.*


*freely available as possible in a timely and responsible manner*<sup>2</sup>

*Quality of Information within Internet of Things Data DOI: http://dx.doi.org/10.5772/intechopen.95844*

potential to become a standard for measuring data quality.

Data Aarhus<sup>4</sup>

**Acknowledgements**

the EU Open Data Portal<sup>3</sup> at European level or the national-level ones such as Open

As future work, we are considering several technologies in order to make our metrics available to researchers and businesses. We consider that they have the

This work has been sponsored by MINECO through the PERSEIDES project (ref.

TIN2017-86885-R), by ERDF funds of project UMU-CAMPUS LIVING LAB EQC2019-006176-P by the European Comission through the H2020 IoTCrawler (contract 779852), and DEMETER (grant agreement 857202) EU Projects. It was also co-financed by the European Social Fund (ESF) and the Youth European

Initiative (YEI) under the Spanish Seneca Foundation (CARM).

<sup>2</sup> https://epsrc.ukri.org/about/standards/researchdata/

<sup>3</sup> https://data.europa.eu/euodp/en/home <sup>4</sup> www.opendata.dk/city-of-aarhus

**89**

given the great amount of data that researchers and practitioners have access to. Our system provides an easy, understandable and quick way to make an informed

decision for choosing between several data sources based on data quality.

. In that sense, the selection of data sources becomes more complicated

. Other initiatives are

**Table 5.**

*Data without context subset.*

**Figure 9.** *Data without context outlier-based metric's histograms (II).*
