Data Integrity and Applications

#### **Chapter 2**

## Data Quality Measurement Based on Domain-Specific Information

*Yury Chernov*

#### **Abstract**

Over the past decades, the topic of data quality became extremely important in various application fields. Originally developed for data warehouses, it received a strong push with the big data concept and artificial intelligence systems. In the presented chapter, we are looking at traditional data quality dimensions, which mainly have a more technical nature. However, we concentrate mostly on the idea of defining a single data quality determinant, which does not substitute the dimensions but allows us to look at the data quality from the point of view of users and particular applications. We consider this approach, which is known as a fit-to-use indicator, in two domains. The first one is the test data for complicated multi-component software systems on the example of a stock exchange. The second domain is scientific research on the example of validation of handwriting psychology. We demonstrate how the fit-to-use determinant of data quality can be defined and formalized and what benefit to the improvement of data quality it can give.

**Keywords:** data quality, quality metrics, fit-to-use determinant, data warehouse, formalization, software application, test data, stock exchange, reference data, validation, handwriting psychology

#### **1. Introduction**

Over the past decades, data quality is getting more and more relevant and important both in science and practice. The topic is well known, therefore, multiple researchers and data practitioners have been intensively investigating different aspects of data quality. There are numerous publications and projects. Especially data-driven design, data-driven management, and the "big data" concept are drawing attention to the data quality issues. High quality is a prerequisite for success. The growth of BI tools (business intelligence) like Tableau or Power BI, market intelligence tools like NetBase Quid or Crunchbase Pro, A/B testing tools like Optimizely or HubSpot, etc. reflects the trend that decisions become more and more data-driven. Naturally, the requirements for data quality are permanently growing. Today, data quality is defined not only as a struggle against duplicates, outliers, missing data, corrupted text, or typos. It is a much more complicated concept.

However, the definition of data quality itself is not easy and always is a bit ambiguous. It depends on the viewpoint, aim, and context. Traditionally, data analysts define a set of data quality dimensions. Meanwhile, there are many dozens of them. Their

concepts often overlap and repeat themselves under different terms. These dimensions reflect different aspects of data quality. Most of them are well formalized and can be quantified. Quantifying is essential for planning improvement measures of data quality in different contexts.

The most popular context is natural data warehouse. Therefore, many dimensions appeared namely in this context. However, many other fields are not less influenced by data quality. The idea to develop a single quantitative indicator, which is perhaps more abstract, has been known for a long time. It seems reasonable and practical. Such a determinant is strongly domain-specific. It reflects specifics of the particular system and it can serve as a good instrument for comparison datasets.

#### **2. Data quality dimensions**

The standard approach to the definition of data quality in terms of the different aspects, which are traditionally termed dimensions, is reflected in numerous publications. Data quality dimensions were captured hierarchically in the very often referred study [1]. The study was based on a customer survey that treated data as a product. The authors defined four categories, which are still reasonable, although the study was done about 30 years ago. These categories formed the first level of the hierarchy. The data quality dimensions built the second level. Below we are following this approach and preserving the categories. However, the dimensions themselves have been modified, reflecting the development of data science and our perception of the topic. It is difficult to identify the actual source of a particular dimension. Many authors are speaking about the same dimensions often naming them differently. The review below is based mainly on several publications [2–8] that were analyzed explicitly. However, numerous additional publications, which the author studied, read, or skimmed, influenced this list as well.

#### **2.1 Intrinsic category**

#### *2.1.1 Correctness*

Data correctness or accuracy refers to the degree to which data represents real objects. In many cases to evaluate correctness, the data are compared to some reference sources. There can be different reference sources. For instance, a natural restriction to the data value (the age can be between 0 and 120 years), a certain rule (the sum of percentage values should be 100), a related database record, a calculated value, etc. When we try to formalize quality then correctness can be defined as a measure of the proximity of a data value to a referenced value, which is considered correct. If a reference source is available, an automated process of correctness evaluation can be established.

#### *2.1.2 Validity*

Validity refers to the degree to which data values comply with rules defined for the system. These can be external rules, for instance, regulation in the finance area, or internal system rules. Validity has associations with the correctness, completeness, and consistency of the data. However, data values can be valid but

not accurate, or they can be valid but not completed. Examples of non-valid data entities are a birth date, which is not in the range of valid dates, or a city, which is not in the list of cities.

#### *2.1.3 Uniqueness/deduplication*

Uniqueness means that no duplications or redundant information are overlapping across all the datasets of the system. It means that entities modeled in the system are captured and represented possibly only once within the proper component or the database segment. Uniqueness ensures that no entity exists more than once within the data. It possesses a unique key within the data set. For instance, in a master product table or person table, each product or person appears once and this entity is assigned a unique identifier. This identifier represents the product or person across the whole system. If additional instances of this product or person are created in different parts of the system, they preserve the unique identifier.

Uniqueness can be monitored statically by periodic duplicate analyses of the data or dynamically when capturing new entities. Periodic checks of the data consistency are a typical task in every data warehouse. Dynamic verifications are often built into a database as triggers and restrictions on fields. If there is a combination of numerous databases, files, and other data collection facilities, then special procedures must be developed. When some problems are detected, analysts use data cleansing and deduplication procedures to address the issue. Formal uniqueness is rather easy to ensure. More complicated are the cases when the same data are named and defined differently – formal procedures can hardly help. Artificial intelligence methods of data analysis could be useful to identify logical duplication or overlapping.

#### *2.1.4 Integrity (referential integrity)*

When we assign unique identifiers to different objects (customers, products, etc.) within our system, we simplify the management of the data. At the same time, that automatically introduces the requirement, that this object identifier is used as a foreign key within the whole data set. This is referred to as referential integrity. Rules associated with referential integrity are constraints against duplication and non-consistency.

#### *2.1.5 Reliability/consistency*

Data reliability refers to two aspects. The first aspect relates to the functioning of different data sources in the system. It should be ensured, that regardless of what source collects the particular data or where it resides, this data cannot contradict a value, which resides in a different source or is collected by a different component of the system. The second aspect relates to the closeness of the initial data value to the subsequent data value.

#### *2.1.6 Data decay*

That is the measure of the rate of negative change to data. The old values taken from different sources become outdated with time. A source can be decommissioned and a new one not applied yet. For instance, biodata, mobile numbers, and emails of persons can be not valid anymore.

#### *2.1.7 Objectivity*

It reflects the extent to which information is unbiased, unprejudiced, and impartial.

#### *2.1.8 Reputation*

It means the extent to which users regard the information in terms of source and/ or content.

#### **2.2 Contextual category**

#### *2.2.1 Completeness*

The dimension means that certain attributes should be assigned values. Completeness rules are based on the following three constrain levels:


Completeness or incompleteness can be measured through the amount of data that does not have values. The decisive is to which extent the system can perform its tasks with an uncompleted data set.

#### *2.2.2 Data coverage*

Data coverage actually reflects the second aspect of completeness, namely the completeness of records. It is the degree to which all required records in the dataset are present. Sometimes data coverage is understood as a measure of availability and comprehensiveness of data compared to the "total data universe." However, this is not practical and could hardly be quantified.

#### *2.2.3 Amount of data*

It reflects the extent to which the volume or quantity of available data is appropriate for the tasks.

#### *2.2.4 Effectiveness or usefulness*

It reflects the capability of the data set to enable users to achieve specified goals or fulfill specified tasks with the accuracy and completeness required in the context of use. Sometimes this dimension is called the relevancy or reasonability of data.

*Data Quality Measurement Based on Domain-Specific Information DOI: http://dx.doi.org/10.5772/intechopen.106939*

#### *2.2.5 Efficiency*

Efficiency reflects the extent to which data can quickly meet the needs of users.

#### *2.2.6 Timeliness (currency)*

Timeliness has two aspects. As data currency, it refers to the degree to which data is upto-date and to the extent to which data are correct despite possible time-related changes.

#### *2.2.7 Timeliness (availability)*

This second aspect of timeliness refers to the extent to which data are available in the expected time frame. It can be measured as the time difference between when information is expected and when it is available.

#### *2.2.8 Credibility*

It reflects the degree to which data values are regarded as true and believable by users and data consumers.

#### *2.2.9 Ease of manipulation*

The dimension reflects the extent to which data are easy to manipulate and apply to different formats.

#### *2.2.10 Maintainability*

Maintainability is the measure of the degree to which data can be easily updated, maintained, and managed.

#### **2.3 Representational category**

#### *2.3.1 Interpretability*

The degree to which data are presented in an appropriate language, symbols, and units of measure.

#### *2.3.2 Consistency*

Consistency reflects the plausibility of data values. That is the extent to which data is presented in the same format within a record, a data file, or a database and that semantic rules are preserved all over the system. Consistency is practically the measure of the equivalence of information in various data stores and applications.

#### *2.3.3 Conciseness*

This dimension reflects how compact information is. The extent to which it is compactly represented without losing completeness.

#### *2.3.4 Conformance/alignment*

This dimension refers to whether data are stored and presented in a format that is consistent with the domain values.

#### *2.3.5 Usability*

This dimension is rather generic. It reflects the extent to which information is clear and easily used. It includes as well understandability, that is, the degree to which data have attributes that enable them to be read and interpreted by users.

#### **2.4 Access category**

#### *2.4.1 Availability/accessibility*

The dimension reflects the ease, with which data can be consulted or retrieved by users or programs.

#### *2.4.2 Confidentiality*

The degree to which disclosure of data should be restricted to authorized users. Relates to the security dimension.

#### *2.4.3 Security*

The dimension reflects the degree to which access to information is appropriately restricted.

Traceability.

It reflects to which extent data lineage is available. That is the possibility to identify the source of data and transformations they have passed.

#### **3. The fit-for-use and domain-specific data quality determinant**

Traditional dimensions of the data quality are good, since they reflect different aspects of data and are rather formal, that is, they can be in most cases automatically evaluated. However, they, first, are often derived from the data warehouse concept [9, 10] and are not always suitable in a different context. Secondly, they are good for homogeneous software systems, where they can be rather easily applied. However, they cannot be directly used for distributed heterogeneous systems, which is often the case, or for special applications, such as scientific research. Both examples we are presenting in the following text.

That is why already long ago they were speaking about the generalized fit-for-use data quality determinant [11, 12], which is close to the view of data users. That was summarized in [2]: "In general, data can be considered of high quality if the data is fit to serve a purpose in a given context." A data user can be a person, a group of people, an organization, or a software system. We consider this indicator the most important in many practical cases. Often it dominates even the formal nonconformity of a product by quality management.

*Data Quality Measurement Based on Domain-Specific Information DOI: http://dx.doi.org/10.5772/intechopen.106939*

Fit-for-use is a rather subjective concept. However, in the data quality context, we can provide the required formalism to make it quantitative. To enable that, we need to define a good metric. Such a metric cannot be universal—it is always contextdependent or domain-specific. However, the requirements for a data quality metric can be generic. Every good metric should answer them. In the next paragraph, we are looking at such requirements.

#### **3.1 Requirements for Data quality metrics**

In [13] authors formulated the set of requirements, which are appropriate for domain-specific data quality metrics. It includes five basic requirements:


**Normalization** should be adequate to assure that results can be interpreted and compared. That means the metric determinants should be on the same scale, which is preferably a relative one. That is important since we use data quality metrics to compare different data sets to each other and select an optimal one (our application case on testing data), to understand the trend of changes in time, or to evaluate the fitness of data for the deduced results of a scientific study (our application case on validation of handwriting analysis).

**Cardinality** in our context means that the metric should be highly differentiated, that is, it should ensure many possible values and not restrict itself to a rough evaluation. The sensitivity of the metrics should be good enough to capture even small differences.

**Adaptability** means that the metric must be easily adapted to a particular application. It should be tied to business-oriented goals. That requirement is actually the basis for the fit-to-use data quality determinant.

**Scalability** means that it should be possible to measure the whole system as well as its components or sub-systems. It can concern, for instance, different layers of data.

**Interpretability** means that the metric should be clear and simple. That means mean that a user understands the metric and that it is comprehensible and meaningful. In particular, simple metrics should be easily formalized and possibly automatically deducted from the system.

These requirements are rather technical ones. Many of the data quality dimensions mentioned above do answer them. However, sometimes and especially by fit-to-use determinants, some compromises are necessary. Metrics are needed for quantifying data quality in order to answer questions regarding the data sets and to work out the measures to improve data quality in particular domains.

### **4. Application case 1. Reference data of the end-to-end test system of the stock exchange**

The current application case is based on the author's experience at the Swiss Stock Exchange [14]. However, the model is rather generic and it could be valid for any other financial stock exchange or other applications.

#### **4.1 Reference data**

The application case covers the quality of the reference data for the test system of the Swiss Stock Exchange. The requirements for the correctness and reliability of the system are extremely high. That is why testing plays a crucial role during the delivery of new functions into production. The controlled testing is carried out at least at four levels: component, integration, system, and end-to-end testing (unit testing is done by developers before they officially release the code). By saying "controlled" I mean that the software is delivered and built in a code control system, it has an official version number, and is installed in a controlled testing environment either automatically (DevOps) or by environment supporting stuff, that is, not by developers themselves.

Test data fully reflects the production system and consists of three parts:


The reference data is very important. It defines the quality of the whole testing. In the current application case, we are speaking about end-to-end testing, which is rather business than technical oriented. Therefore, the major users and customers of the test system and correspondingly end-to-end testers and business experts.

The test system reflects almost fully the real production configuration. It consists of two dozen components, which are distributed among many application servers (Linux and Windows) and multiple databases (Oracle, MS SQL, Postgres, MySQL, SQLite, and 4D). All components can be divided into three categories:


#### *Data Quality Measurement Based on Domain-Specific Information DOI: http://dx.doi.org/10.5772/intechopen.106939*

The reference data in the production system is maintained partly by the system customers (banks and other trading organizations) and partly by the market supporting staff of the stock exchange. The data changes are maintained and distributed on a daily base—the reference data maintenance is not a real-time functionality. It is entered using different tools and interfaces and then is transferred into a central data repository, from which is distributed among all relevant system components. In components, the reference data can be enriched to enable more tests.

Test reference data must ensure complete test coverage. To do that, the testware is maintained in Jira and consists of the test requirements that reflect the system functional requirements, test specifications (or test plans), and test cases, which are the elements of test plans and test executions—the results of the testing for a specific project, a test cycle or a sprint. A big portion of the test cases or some steps of them are automated. Several examples of test cases regarding reference data are:


Test cases are the major objects, which are relevant for the data quality evaluation. The testware includes both the new functionality and the regression testing, which should ensure that the current functions are not affected by new versions of software. Test data must, first, cover all existing business requirements and, secondly, additional functions that are technically possible, but are not used yet in production. They can in principle be activated later, and therefore, must be as well checked. Additionally, the data must support so-called negative test cases to test the reaction of the system to the wrongly entered data.

Reference data must be tested, since they first together with the trading data are provided to the end customers and, secondly, they are the base for the generation of the trading data (orders, trades, transaction reports, indices, etc.)

The data quality evaluation assumes that the test cases are designed properly, that is, they cover all needed functions, business-relevant cases, and configurations. A test case may reflect in this context either a business case or a certain business configuration.

#### **4.2 Data quality determinant for reference test data**

Assuming that the test case design is appropriate, we can define the following usability quality metrics.

$$q = \frac{\sum\_{1}^{n} c\_i b\_i}{\sum\_{1}^{n} c\_i} \tag{1}$$

where q - data quality determinant; n - number of test cases; ci - the weight of i-th test case; and bi - the indicator of test case coverage by the test data (0 or 1).

The model is simple, but it is practically very useful. It fulfills the requirements mentioned above. Maybe just cardinality is fulfilled partly since the differentiability of the model is not perfect, we can get the same value of q for different weight and coverage combinations. To improve the differentiability, the determinant must be done more complex. But that will reduce the clearness and simplicity, that is, will deteriorate interpretability.

The value of the quality determinant is used to compare different test data sets to each other and to ensure the required test quality. Test data sets differ when they are applied in different test environments or at different project phases. The last is probably the most important. Therefore, if we see that the determinant in the current project (project phase) is lower than in the previous project (project phase), it is a clear requirement for additional data enrichment.

Two aspects are important: the definition of the test case weights and the evaluation (preferable automatic) of the test coverage indicator.

#### *4.2.1 Test case weights*

The test case weights are assigned by a test designer or automatically. The automation is based on the assignment of the same weight to all test cases in a certain group, typically in the same test area or test suite. Formally, weight is defined on the continuous interval from 0 to 1. Practically the following values are used: 1.0 (required), 0.75 (important), 0.5 (quite important), 0.25 (not important), 0 (not relevant). The following factors influence the weight:


**Business relevance**: Test cases that are more important for business should have higher weights. This aspect of test case prioritizing is covered in the publications as customer requirement-based techniques [15–18]. For instance, the issuing of a new share happens on the stock exchange four-six times a year, and at the same time, banks are listing hundreds of derivatives and structured products daily. The last case is much more important and test-relevant. Another example is that adding a new trading participant (of which there are several hundred) has a higher weight than adding a new clearing organization, which could happen once in several years. Business relevance depends on the current project. The new functions that are being introduced by the current project may have higher priority over the regression test cases, and they receive lower priority when the project is over since they become regression ones.

**Test automation**: Automated regression test cases have higher priority over manual ones since they check the basic functions and must be always successful. Another reason is that their execution is quicker and simpler. Therefore, they should get a weight value of 1.0.

#### *Data Quality Measurement Based on Domain-Specific Information DOI: http://dx.doi.org/10.5772/intechopen.106939*

**Test case complexity**: Simpler test cases should have a higher weight since the data for them could be easier maintained. Generally, a good design should lead to simple and unambiguous test cases. That is, very complex ones are in any case a bit "suspicious" and probably require re-design. They can be, for instance, broken into several simple ones.

**Execution effort**: Like with the complexity, test cases that require less effort should have higher priority and correspondingly higher weight by the evaluation of the test-data quality.

The initial setting of the weights requires a big effort. However, it should be generally done once and then just be maintained when new test cases are developed, or/and the system functionally is changed. That happens not very often—typically twice a year with big releases.

#### *4.2.2 Indicator of test case coverage*

The indicator of test coverage has only two values—0 or 1. When a test case cannot be executed because of the missed data then it gets the value 0. When the test case can be executed or has another problem, like a bug in the software or not implemented functionality, the indicator gets the value of 1.

The value can be assigned manually, like the weight or automatically based on the results of the test execution. For every test cycle (sprint) a test execution dashboard is defined in Jira. It includes the planned test cases and the results of the execution. The results can be passed, failed, not applicable, etc. If the result is "blocked" that means that the test case is blocked by the missing data. This information can be retrieved and used for evaluation.

#### *4.2.3 Testware status*

The current snapshot of the major end-to-end test specifications of the Swiss Stock exchange is shown in **Table 1**. The total number of test cases is 2473. Of course, that changes with new development, new projects, and corresponding versions of components.

Most test cases in the first two specifications are automated. According to the above-described logic, they get the weight value of 1.0. In the rest of the testware, some 30% of test cases have as well weight 1.0; approximately 30% - weight 0.75;


#### **Table 1.** *Testware volumes.*

**Figure 1.** *Data quality dynamics.*

20% - weight 0.5; and 20% - weight 0.25. This can as well differ from project to project. When a project includes software changes even in less important areas, they should be tested more thoroughly and their weight becomes higher. On the other hand, if important functionality is not affected, the test case may get a lower weight. The sum of weights in the example from **table 1** is 1929. That gives us, for instance, for the case with 20% uncovered test cases (with the weight of 1.0), what is realistic at the initial phases of a project, Q = 0.86.

The typical dynamic of the data quality (data quality determinant) along test cycles or sprints (when a project is being done with agile methodology) is shown in **Figure 1**.

#### **5. Application case 2. Validation of handwriting psychology**

The second application case relates to the area of scientific research. Data quality is not often treated formally by research experiments in psychology. Although it plays a very important role in the success and reliability of the results. An approach to quantify and model this was done by the author in the area of validation studies of handwriting analysis [19].

#### **5.1 Handwriting analysis as a psychometric instrument**

Handwriting analysis (handwriting psychology) is one of the so-called projective techniques for psychological assessment. It is based on the evaluation of a person's handwriting and deducing from it a range of personality traits. It is traditionally used for recruitment and in some specific areas where the typically used self-assessment tools are not applicable. For instance, in forensic psychology.

The technique has certain unique features and advantages over the mostly used questionnaire-based psychological tests. First, it allows the wide coverage of personal characteristics. Secondly, it excludes social desirability, which is typical for selfassessment questionnaires. However, handwriting psychology, like other projective methods, is not sufficiently validated. That often makes its usage controversial. Historically, the validation studies of handwriting analysis were based on expert procedures, involving specialists with their manual and often subjective evaluations.

In the last years, many validation studies were done with software, which does not completely substitute the experts, but rather assists them to make their evaluations more objective and reliable. One of these approaches we are discussing is below. It is based on the HSDetect system [20, 21] for handwriting analysis. The system includes statistically evaluated relations between some 800 handwriting signs and about 400 personality traits and behavior patterns.

#### **5.2 Validation of handwriting analysis**

In psychometrics, traditionally three major quality criteria are required: objectivity, reliability, and validity.

The objectivity of a psychometric test ensures that the testing person does not influence the result. In the case of handwriting analysis, the testing person is an involved expert or a computer program, which evaluates the written sample.

Reliability denotes the accuracy and precision of the procedure. The results should remain the same when the test and its evaluation are repeated under the same conditions. The typical methods for the assessment of reliability are test–retest, parallel evaluation, and split-half methods. In the context of the handwriting analysis, we consider three major components, namely, the handwriting signs, personality traits, and the relation of the signs to the traits, which we call graphometric functions. They are the objects of the quality assessment. In the traditional procedure, an expert evaluates the handwriting of the subject and interprets it in terms of personality traits, compiling a textual report. This procedure is rather subjective and that was the major objection and the root cause of the controversy. The analyzed studies were done with the computer-aided application HSDetect. This ensures objectivity and reliability [20].

Validity is the primary criterion. Objectivity and reliability are important, but they are just the prerequisites for validity. A test is then valid when it really measures what it is supposed to measure. It is always a challenge to practically define against which reference the test should be validated. Theoretically, a psychometric test should be validated against the psychological features. However, how are those obtained? In most cases, only indirectly, since a direct self-assessment is subjective and a proper expert evaluation is extremely difficult to set up. That is why a typical approach is to check the test against other psychometric tests, which are considered valid. This approach is used in statistical experiments, the data quality of which we are investigating in the current study.

The comparison between two psychometric tests is the comparison between two statistical rows – the results of the validated instrument and the reference instrument. Both tests must include the evaluation of the same subjects (involved persons). In our case, all subjects execute the reference test and provide samples of their handwriting. The handwriting is evaluated by the handwriting experts and HSDetect. Therefore, we can say that the input data for both tests are different, the output is the same - evaluated values of so-called test scales or, in other words, psychological constructs. When the results agree, we can say that the instrument under investigation (handwriting analysis) demonstrates good validity against the reference instrument. Researches very often check the agreement using correlation or another statistical method. In most of our experiments, the direct correlation does not work well and we used a special method consisting of four steps:

• Mapping of original quantitative test scale onto a simpler scale with only three values (high, medium, and low) – scale transformation.


In some cases, we did use the correlation, either the product–moment correlation or the lognormal one.

#### **5.3 Data quality determinant for the validation analysis**

Data quality of a psychological test has two aspects. The first is the data of the experiment itself. Let us call it the experiment component. The second one is the distribution of subjects involved in the test along different categories – age, sex, education, profession, etc. – subject component. Both aspects are important because they both influence the meaningfulness of the test results. If we, say, make our experiments only with students, the results may be not significant for retired persons.

#### *5.3.1 Experiment component*

For the experiment component, we define the following three quality parameters:


They are briefly discussed below. By formalizing, variables S and O get a value of 1, when the quality requirement is fulfilled, or 0, when not. Variable N reflects the relation of normally distributed test scales (or test dimensions) to the total number of test scales.

**Sample size**: It was mentioned above that we are using the binomial check to decide the statistical consistency of the result. Typically, power analysis [22] is used to evaluate the required sample size. The standard for psychology levels of α = 0.05 (type I error of 5%) and β = 0.2 (type II error) and the medium effect size of 0.5, results in this case in the minimal sample size = 49. Therefore, when the number of subjects is more than 48, we assume this data quality component as fulfilled, that is S = 1, otherwise S = 0.

The sample size should not be maximal big, but rather it should be optimal with adequate statistical power. It is a critical step in the design of an experiment. Involving too many participants makes a study expensive. If the study is underpowered, it is statistically inconclusive, although its results may be interesting.

**Outliers**: The outliers are those points of the statistical sample that are distant from other observations. This happens either due to measurement variability or due to the experiment error. Often, outliers are excluded from the data set. In this case,

#### *Data Quality Measurement Based on Domain-Specific Information DOI: http://dx.doi.org/10.5772/intechopen.106939*

they may become the subject of special analysis. The exclusion of outliers leads to a reduction in the sample size. On the other hand, that improves the experiment results. Therefore, there is always a trade-off between the result and its reliability.

In our context, there are two types of outliers. The first one means the deviation from the normal distribution of the statistical row (here is the relation to the third quality parameter N). The second type relates to the results of comparison of handwriting analysis to the psychological test. The removal of "bad" points, which mostly contribute to the disagreement, may improve the resulting evaluation. The criterion may be the proportion of improvement to the proportion of change. Say, if we remove 10% of points and that gives us 40% of improvement, we can consider the excluded points as outliers. When the improvement is 5%, the "bad" points are not outliers.

Parameter O = 1, when there are no outliers, and O = 0, if some outliers exist and were not removed.

**Normality**: When a random variable is normally distributed that enables many additional methods of statistical analysis, for example, correlation analysis, variance analysis, regression modeling, or ANOVA. That is why the sample must be always checked for normality. There are many methods, the most powerful of which is the Shapiro–Wilk test.

Most psychological tests have a rich normative base and they are generally normally distributed. Whether our current experiments follow the statistical population of the taken psychological test or not is not important. Therefore, we consider only the handwriting variables, which are the subject of research. In the presented model, normality in general often cannot be distinctly defined, since every test has several scales and the check is done for every particular scale and its handwriting model. Therefore, formally, the normality should be a vector and N, as mentioned above, represents the ratio of normally distributed scales to the total scales.

#### *5.3.2 Subject component*

In the experiments related to handwriting analysis, such as biodata as sex, age, handiness, education, and profession, are important, since they may influence the handwriting signs. However, in the current application case, we consider only two parameters:


In a good experiment, both parameters should be close to uniform distribution to more or less equally represent subject categories. In this case, X and A get a value of 1. If the distribution is far from uniform, they are set to 0.

#### *5.3.3 Data quality determinant*

The determinant model is as followers:

$$q = a\_S S + a\_O O + a\_N N + a\_X X + a\_A A \tag{2}$$

where ai are corresponding weights.


#### **Table 2.**

*Raw data for the estimation of quality determinant.*

The defined components of data quality are not equally important. That we can solve through the standard approach—assigning different weights. Their values were defined by experts, and therefore, are rather subjective. However, that allows the comparison of different experiments. In our case, we assign the weights, so that the sum is 1.0. The experiment component gets a weight of 0.6, while the subject component is 0.4. The number of subjects is as well more important than outliers and normality. This logic results in the following weights as = 0.36, aO = 0.2, aN = 0.21, aX = 0.11, and aA = 0.11. The absolute value of the weights is not extremely important, since our aim is mostly to compare different experiments to each other.

The presented model does satisfy four requirements for the good metrics that were formulated above. Namely normalization, adaptivity, scalability, and interpretability. Only cardinality cannot be assured.

The input data for the evaluation of the quality parameters are shown in **Table 2**. We consider four studies on the validation of handwriting analysis against the following psychometric tests [20, 23]: Cattell's 16 personality factors test (revised) 16PF-R, NEO five-factor inventory by Costa & McCrae, portrait values questionnaire (PVQ ) by Schwartz, and the emotional quotient inventory (EQ-i 2.0).

The data quality determinant was calculated based on model (2) and defined above weights. The results are presented in **Table 3**.

The data quality evaluation for different validation experiments demonstrates big differences. It can be a good indicator of the required experiment improvements. For instance, the removal of outliers when the sample size is big enough can be a proper way to improve the statistical power and the data quality. That may improve


**Table 3.** *Data quality determinant.*

the normality of the data as well. On the other side, outliers may deliver important additional information, and, if their influence on the data quality is not that strong, they should remain in the sample.

In any case, data quality should be the important parameter for the evaluation of the reliability of the whole experiment. How to do that formally is not yet clear. That is the point for further research.

#### **6. Conclusion**

The domain-specific information is a very important factor when we are trying to define data quality. Traditional dimensions of quality reflect technical and formal aspects of the data. They are doubles useful and define the requirements for data quality. However, they are not sufficient. The real attitude of data users and the added value of the data quality is reflected in fit-for-use determinants.

In the current work, we formulate the requirements for the data quality metric and analyze two application cases with the fit-to-use determinants. They demonstrate a rather practical than theoretical approach. However, the presented results can be useful in finding ways to control data improvement.

In the presented application examples, the amount of data was small. The preparation of raw data and the estimation of the determinant value were done offline of the data. A universal data quality determinant is practically useful when it can be derived automatically from original data. The testware for the stock exchange reference data is stored in one of the test management systems (in our case, that is, Jira). The corresponding queries from the database could be easily developed and integrated with the test data management. In the second example, the required data was as well automatically derived from the experiment databases. That is a good basis for a generic system for data quality estimation. It can include the calculation engine with different models and adapters for a particular application. Their role is to retrieve the data and convert it into a generic structure.

Especially useful is a universal data quality determinant and the corresponding automatic procedure can be for artificial intelligence models. Their outcome strongly depends not only on the data quantity but as well on the quality of the training and test data. To avoid the famous GIGO (garbage in, garbage out) effect, data quality should be properly managed at all levels.

#### **Author details**

Yury Chernov QADAS, Zurich, Switzerland

\*Address all correspondence to: y.chernov@gmx.ch

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### **References**

[1] Wang RY, Strong DM. Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems. 1996;**12**(4):5-33. DOI: 10.1080/07421222.1996.11518099

[2] Data VS. Quality management. In: Kunosic S, Zerem E, editors. Scientometrics Recent Advances. London: IntechOpen; 2019. pp. 1-15. DOI: 10.5772/intechopen.86819

[3] Eppler MJ. Managing Information Quality. 2nd ed. Berlin: Springer Verlag; 2003. p. 398

[4] Batini C, Scannapieco M. Data Quality: Concepts, Methodologies and Techniques. 6th ed. Berlin: Springer Verlag; 2006. p. 281

[5] Pipino LL, Lee YW, Wang RY. Data quality assessment. Communications of the ACM. 2002;**45**:211-218

[6] Redman TC. Data Quality for the Information Age. Boston: Artech House; 1996. p. 332

[7] McGilvray D. Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information. Burlington: Morgan Kaufmann; 2008. p. 352

[8] Wand Y, Wang RY. Anchoring data quality dimensions in ontological foundations. Communications of the ACM. 1996;**39**:86-95

[9] Kimball R, Ross M. The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling. 3rd ed. New York: John Wiley & Sons; 2013. p. 600

[10] Jarke M. Fundamentals of Data Warehouses. Berlin, Heidelberg: Springer Verlag; 2003. 219 p. DOI: 10.1007/978-3-662-05153-5

[11] Juran JM, Godfrey AB. Juran's Quality Handbook. 7th ed. New York: McGraw-Hill; 2016. p. 992

[12] Redman TC. Data quality management past, present, and future: Towards a management system for Data. In: Salig S, editor. Handbook of Data Quality: Research and Practice. Berlin, Heidelberg: Springer Verlag; 2013. DOI: 10.1007/978-3-642-36257-6

[13] Heinrich B, Kaiser M, Klier M. How to measure data quality? A metricbased approach. In: Proceedings of the International Conference on Information Systems (ICIS 2007); 9-12 December 2007. Montreal, Quebec, Canada: AIS; 2007

[14] Chernov Y. Test-Data quality as a success factor for end-to-end testing. An approach to formalisation and evaluation. In: Proceedings of 5th International Conference on Data Management; 24-26 July 2016. Lisbon, Portugal: Lisbon SCITEPRESS; 2016. pp. 95-101

[15] Roonqruangsuwan S, Daengdej J. A test prioritization method with practical weight factors. Journal of Software Engineering. 2010;**4**(3):193-214

[16] Zwang X, Xu B, Nie C, Shi L. An approach for optimizing test suite based on testing requirement reduction. Journal of Software. 2007;**18**:821-831

[17] Srikanth H, Williams L. On the economics of requirement-based test case prioritization. In: Proceedings of the 7th International Workshop on Economic-Driven Software Engineering Research; 15-21 May 2005. St. Louis, Missouri, USA. New York: ACM; 2005. pp. 1-3

[18] Elbaum S, Malishevsky A, Rothermel G. Test case prioritization: *Data Quality Measurement Based on Domain-Specific Information DOI: http://dx.doi.org/10.5772/intechopen.106939*

A family of empirical studies. IEEE Transactions on Software Engineering. 2002;**28**:159-182

[19] Chernov Y. Data quality metrics and reliability of validation experiments for psychometric instruments. In: Proceedings of 15th European Conference on Psychological Assessment; 7-10 July 2019. Brussels, Belgium: Brussel: ECPA; 2016. p. 37

[20] Chernov Y. In: Chernov Y, Nauer MA, editors. Formal Validation of Handwriting Analysis, Handwriting Research. Validation and Quality. Berlin: Neopubli; 2018. pp. 38-69

[21] Chernov Y. Компьютерные методы анализа почерка [Computer Methods of Handwriting Analysis]. Zurich: IHS Books; 2021. p. 232

[22] Cohen J. Statistical Power Analysis for the Behavioral Sciences. New Jersey: Lawrence Erlbaum Associates; 1988. p. 567

[23] Chernov Y, Caspers C. Formalized computer-aided handwriting psychology: Validation and integration into psychological assessment. Behavioral Sciences. 2021;**10**(1):27

#### **Chapter 3**

## Multiplicative Data Perturbation Using Random Rotation Method

*Thanveer Jahan*

#### **Abstract**

Today's applications rely on large volumes of personal data being collected and processed regularly. Many unauthorized users try to access this private data. Data perturbation methods are one among many Privacy Preserving Data Mining (PPDM) techniques. They play a key role in perturbing confidential data. The research work focuses on developing an efficient data perturbation method using multivariate dataset which can preserve privacy in a centralized environment and allow publishing data. To carry out the data perturbation on a multivariate dataset, a Multiplicative Data Perturbation (MDP) using Random Rotation method is proposed. The results revealed an efficient multiplicative data perturbation using multivariate datasets which is resilient to attacks or threats and preserves the privacy in centralized environment.

**Keywords:** privacy, multiplicative data perturbation, random rotation method

#### **1. Introduction**

This chapter proposes a Multiplicative Data Perturbation method. It considers multivariate datasets to perturb using a geometric data perturbation method. Then, the perturbed data will use Discrete Cosine Transformation between a pair of data values to determine Euclidean distance. This proposal is clearly elaborated in the following section.

#### **1.1 Background**

Hybrid transformations are used to maintain statistical properties of data as well as mining utilities [1–3]. The statistical properties of data are mean and variance or standard deviation without any loss of data. A feasible solution [4] is provided to optimize the data transformations by maximizing privacy of sensitive attributes. A combined technique using randomization and geometric transformation is used to protect sensitive data. A randomized technique is represented as D ¼ X þ R, where R is additive noise, X is original data and D is perturbed data. A geometric transformation is used as a 2D rotation data matrix represented as D<sup>0</sup> ¼ Rð Þ� θ D, where D is the column vector containing original co-ordinates and D<sup>0</sup> is a column vector whose coordinates are rotated clockwise. The above method considered only single attributes as sensitive and rest of them as non-sensitive attributes. Data perturbation method using fuzzy logic and random rotation is proposed [5, 6].

The original data is perturbed using fuzzy based approach (M) and then random rotation perturbation is used by selecting confidential numerical attributes to get the transformed data P = M\*R, where M is the dataset transformed using fuzzy based approach and R is the random dataset generated. The distorted data P is released for clustering analysis and obtained accuracy. The approach compromises in balancing privacy and accuracy. A hybrid method using SVD and Shearing based data perturbation [7] is proposed to obtain perturbed data. The approach removes the identified attributes from the dataset. These attributes are normalized using Z-score normalization to standardize to the same. Then, the dataset is perturbed using SVD transformation. Each record of the perturbed dataset is further distorted using a Shear based data Perturbation method represented as D<sup>0</sup> ¼ D þ ð Þ ShD ∗ D , where ShD is the random noise and D is the perturbed dataset obtained after SVD transformation.

The results show higher privacy is attained on hybrid methods when compared to single data perturbation methods. A hybrid technique [7, 8] based on Walsh-Hadamard Transformation (WHT) and Rotation is proposed. The Euclidean distance preserving transformation using Walsh-Hadamard (Hn) given below to generate orthogonal matrix to preserve statistical properties of the original dataset.

$$\mathbf{H\_n = \otimes\_{i=n}^D H\_2 = \frac{H\_2 \otimes H\_2 \dots \otimes H\_2}{n}} \tag{1}$$

where H2 is 1 1 1 �1 is a matrix and denotes the tensor or Kronecker product. Then, Rotation transformation is applied to preserve the distance between the data points. The perturbed data preserves distance between data records and maintain accuracy using classifiers. The method is limited to numerical attributes and can be extended to categorical attributes. A hybrid approach for data transformation is proposed by Manikandan et al. [9] to sanitize data and normalize the data using min-max normalization [10]. The approach transforms original data maintaining inter-relative distance among the data. Clustering analysis shows that the numbers of clusters in original data are similar to modified data. Another approach is used to modify the original data to preserve privacy with the help of inter-relative distance on categorical data is proposed [2].

The categorical data is converted into binary data and is transformed using geometric transformation. Then the clustering algorithm is used for analysis and the results for better data utilization as well as privacy preservation. The multiplicative noise is generated using random numbers with mean as 1 and is multiplied by the original data value. A random number with a short Gaussian distribution is calculated with mean as 0 and a small variance. Geetha Mary A et al. [11] proposed a nonadditive method of perturbation by randomization and data is generated based on intervals on the level of privacy specified by a user. A random number is generated that is either added or multiplied with the data to generate a random modified data. The perturbed data is classified and measures using metrics.

The condensation approach is presented by Agrawal and Yui [12] for a multidimensional perturbation technique to provide privacy for multiple columns using covariance matrix. The approach was weak in protecting data privacy. Rotation perturbation was used for privacy preserving data classification [13]. Rotation perturbations are task specific and aim to have better balance between loss of information

and loss of privacy. Multiplicative data perturbations include three types of perturbation techniques such as: Rotation Perturbation, Projection Perturbation and Geometric Data Perturbation.

A Rotation perturbation framework was adopted in privacy preserving data classification [14]. It is defined as G Xð Þ¼ RX where R is randomly generated rotation matrix and X is the original data. The benefit and weakness of this method is distance preservation and is prone to distance inference attacks. These attacks are addressed [15–17]. Chen et al. [14] proposed an improved version on resilience towards attacks. Oliveria et al. [17] proposed a scaling transformation along with random rotation in privacy preserving clustering.

A Random Projection perturbation is proposed [13, 18] to project a set of points from the original multidimensional space to another randomly chosen space. This resulted with an approximate model quality. A random projection matrix is used in privacy preserving data mining to enable an individual to choose their privacy levels.

An ideal data perturbation [19] aim with a balance tradeoff of minimizing information loss and privacy loss. However these are not balanced in the existing algorithms. Compared with the existing approaches in privacy preserving data mining, Geometric data Perturbation have significantly reduced these overcome [20].

A Geometric Data Perturbation is a sequence of random geometric transformation including multiplicative transformation (R), Translation Transformation (T) and Distance Perturbation (DP) [21, 22].

$$\mathbf{R}(\mathbf{X}) = \mathbf{R}(\mathbf{X}) + \mathbf{T} + \mathbf{D}\mathbf{T} \tag{2}$$

The approach has two unique characteristics. The first characteristic is to perturb the original data with geometric rotation, translation and identify rotation invariant classifiers as given in above. The second characteristic is to build privacy model by evaluating the privacy quality of perturbation method. The privacy model generated is used to analyze the attacks, such as, Naives and ICA-based reconstruction. The quality of data perturbation approach is determined by the quality of privacy preserved. It is the difficulty level in estimating the original data from perturbed ones such estimations are named as inference attacks. The attacks are categorized into three categories such as: Naives Inference, Reconstruction based inference and distance based inference. A statistical method based inference to estimate original data from perturbed named as Naives inference attack was proposed [23]. It is represented as O=P, where O is the observed data and P is the perturbed data. Reconstructing the data with perturbed and released information from data is presented. Reconstruction based attacks also called as Independent Component Analysis (ICA) [24, 25]. It is represented as, O = E�<sup>1</sup> P, where E�<sup>1</sup> is the estimation of released information of data and P is the perturbed data. Identifying the images and some relevant information of data using outliers to discover the perturbation is distance based attacks. It is represented as O = E�<sup>1</sup> P, where E�<sup>1</sup> is the mapping to estimate and P is the perturbed data. The higher the inference the more the original data is protected and preserved such that attacker cannot break the perturbation. The above attacks are analyzed with a privacy model with privacy guarantee [26]. It had failed to avoid outlier attack. The existing data perturbation techniques have contradiction between data privacy metric and mining utility [27, 28]. The multiplicative data perturbations will maximize the two levels i.e. data privacy and mining utility. The multiplicative data perturbation shows challenging features to improve data privacy during mining process as well as to preserve the model specific information.

In this chapter a survey is presented on privacy preserving data mining to protect confidential data. The drawbacks of the above existing data perturbation methods have made us to resolve the issues with balanced factors, such as, data privacy and data utility. The challenges in preserving privacy using multiplicative data perturbation have been given a new direction in this research study.

#### **2. Proposed method**

The proposed Multiplicative Data Perturbation (MDP) is shown at **Figure 1** as a block diagram.

The above block diagram considers the original dataset and deals with it in two stages. In the first stage, the original dataset is perturbed using geometric data perturbation. The geometric data perturbation generates a distorted dataset. This distorted dataset is further perturbed using Discrete Cosine Transformation in the second stage to finally generate a distorted dataset. The process of generating a distorted dataset using a geometric data perturbation comprises three steps. At the first step a random dataset is created using random values as in the original dataset. This random dataset is rotated counter clockwise and then multiplied with the original dataset. The resultant dataset obtained the above step is transposed in the second step, that is, Translation Transformation. This Transposed dataset is added with an additive noise in the third step to obtain a distorted dataset. This proposal is an algorithm for multiplicative data perturbation in the next section.

#### **3. Proposed multiplicative data perturbation using random rotation algorithm**

A proposal for multiplicative data perturbation is given in this section. The pseudo code of the proposed algorithm is listed below.

#### **Algorithm:**

Input: A Data Matrix Dp � q.

Output: A Distorted Data matrices D4, D5.

Begin.


D4 <sup>¼</sup> <sup>X</sup> <sup>þ</sup> <sup>R</sup><sup>T</sup> <sup>þ</sup> X1//Geometric data Perturbation.

Step 5: Call function DCT (D4p � q:D5p � q)//Discrete cosine transformation.

Step 6: The resultant distorted data matrix D5p � <sup>q</sup> is output, End.

**Function DCT (D4p** � **q:D5p** � **q)//Function for Discrete Cosine Transformation.** Input: A data matrix D4p � <sup>q</sup> Output: A data matrix D5p � <sup>q</sup>

Begin.

Step 1: Copy the data matrix D4 to a data matrix D5//alias Step 2: For i = 1 to q.

> For k = 1 to q. If k = 1 then

$$\mathbf{D4}[\mathbf{i}] = \left(\mathbf{1}/\sqrt{\mathbf{i}} \ast \mathbf{X\_2}(\mathbf{i}) \ast \left(\cos\left(\mathbf{3.14} \ast (\mathbf{2} + \mathbf{1})/2\mathbf{i}\right)\right)\right)$$

Else

$$\mathbf{D5}[\mathbf{i}] = \left(\sqrt{2}/\mathbf{i} \ast \mathbf{X\_2(i)} \ast \left(\cos\left(3.14 \ast (2+\mathbf{1})/2\mathbf{i}\right)\right)\right)$$

End if

End For

Construct D5 data matrix and return as parameter. End

The algorithm accepts the data matrix Dp � <sup>q</sup> with p rows and q columns as input. It creates a random data matrix R with p rows and q columns having random values as elements. This random data matrix R is rotated counter clockwise by 90° and then multiplied with data matrix Dp � q. The data matrix that results is named as data matrix Xp � q. Create another random data matrix X1 with p rows, q columns such that its mean is 0 and standard deviation is 1. Now, construct the distorted data matrix D4 adding the data matrices X, R<sup>T</sup> and X1. This data matrix D4 is passed as a parameter to the called function DCT(). The predefined conditions are checked and data matrix D5 is updated. This data matrix D5 after completely updated is an output of the algorithm. The time complexity of the proposed MDP algorithm is found to be O(n), where n is the dimension of the dataset.

The process of updating D5 is explained with the help of an example stated below:

Example 1.1: Consider a data matrix Dp � <sup>q</sup> <sup>=</sup> <sup>422</sup> <sup>111</sup> � � where p = 2 and q = 3.

At Step 1, create a random data matrix R2 � <sup>3</sup> as given below: <sup>R</sup> <sup>¼</sup> �0*:*<sup>3034</sup> �0*:*<sup>7873</sup> �1*:*<sup>1471</sup>

<sup>0</sup>*:*2939 0*:*<sup>8884</sup> �1*:*<sup>0689</sup> � � and rotate R counter clockwise by 90° as given below:

$$\mathbf{R}\_{3 \times 2} = \begin{bmatrix} -1.1471 & -1.0689 \\ -0.7873 & 0.8884 \\ -0.3034 & 0.2939 \end{bmatrix}$$

At step 2, construct the data matrix X = D2 � <sup>3</sup> \* R3 � <sup>2</sup> is given as below:

$$\mathbf{X} = \begin{bmatrix} -6.4664 & -2.2049 \\ -2.2378 & 0.1134 \end{bmatrix}$$

At step 3, create another random data matrix X1 with 2 rows and 3 columns such that the mean is 0 and the standard deviation is 1.

$$\mathbf{X1} = \begin{bmatrix} -6.4664 & 1.4384 & -0.7549 \\ -2.9443 & 0.3252 & 1.3703 \end{bmatrix}$$

At step 4, construct the distorted data matrix D4 = X <sup>+</sup> R<sup>T</sup> + X1 as given as: D4 ¼

$$
\begin{bmatrix}
\text{At step 5, the function call DCT (D4:D5) where}
\end{bmatrix}
$$

$$\text{DCT}(\mathbf{k}) = \mathbf{f}(\mathbf{k}) \sum\_{k=1}^{q} \mathbf{D4}(\mathbf{q}) \cos \left[ (2\mathbf{k} + \mathbf{1}) \text{in}/2\mathbf{q} \right] \quad \mathbf{k} = \mathbf{1}, 2\cdots \mathbf{q}; \ \mathbf{i} = \mathbf{1} \cdots \mathbf{p} \tag{3}$$

where

$$\mathbf{f}(\mathbf{k}) = \begin{cases} \frac{1}{\sqrt{q}} & \mathbf{k} = \mathbf{1} \\ \frac{\sqrt{2}}{q} & \mathbf{2} \le \mathbf{k} \le \mathbf{q} \end{cases}$$

Let k = 1, q = 1, f kð Þ¼ <sup>1</sup> ffiffi <sup>q</sup> <sup>p</sup> *:*, then f(1) = 1, substituting the values in the Eq. (3) Dct 1ð Þ¼ 1 ∗ � 3*:*1036 ∗ cos 3½ �¼� ∗ 3*:*14*=*2 5*:*7881

Let k = 2, q = 1, f 2ð Þ¼ ffiffi 2 *q* q , then f(2) = 1, substituting the above values in Eq. (3) DCT 2ð Þ¼ 1 ∗ � 0*:*1362 ∗ cos 2 ½ð Þ ∗ 2 ∗ 3*:*14*=*2 ∗ 2� ¼ 1*:*3900

Similarly, the remaining data values of D4 are calculated to form a D5 data matrix as given below:

$$\mathbf{D5} = \begin{bmatrix} -5.7881 & 1.3900 & 0.4371 \\ 1.3989 & -1.5826 & -2.3630 \end{bmatrix}.$$

The constructed data matrix D5 is the output.

#### **4. Implementation**

The proposed algorithm that was discussed in the previous section is implemented in MatLab. Its source code is included. The details of implementation are furnished in this section.

The implementation utilizes the built in functions available in MatLab such as load (), size(), randn(), rot90(), dct() and normrnd(). First, a load() built-in function is used to read a data into a data matrix D. The size() function is employed to retrieve the number of rows and columns. The function randn() is used to generate a random matrix R where the size is similar to data matrix D. The data matrix R is rotated using built in function available, namely rot90(). Then, to form a data matrix X, the data matrice R is multiplied by D data matrix. Next, normrnd() is called to generate a data matrix X1 having the mean as 0, the standard deviation as 1 and the size as similar to data matrix D. The distorted data matrix D4 is constructed by adding three data matrices, X1, R<sup>T</sup> and X2. Finally, the function DCT() is employed on distorted data matrix D4 to obtain the resultant distorted data matrix D5.

#### **5. Experimentation**

The Experimentation was conducted using desktop computer system loaded with windows XP Operating system, MatLab and Tanagra data mining tool. The experimental details are elaborated in this section. The experimentation begins with the original dataset D is given as input to the proposed MDP algorithm to obtain the distorted dataset D4 and D5. Then, the original dataset D and distorted datasets D4 and D5 are uploaded into Tanagra data mining tool after appending a class attribute. These uploaded datasets are classified using classification utility available within Tanagra data mining tool. The results of classification are analyzed thereafter.

Similarly the datasets are clustered using clustering utilities available in them. The results of clustering are also analyzed and furnished at Section 6.6 under Results and Analysis. Unified column privacy metric to analyze possibility of attacks is also discussed in this section. But, their calculation is shown in section Results and Analysis. The datasets of Credit Approval, Haber-Man, Tic-Tac-toe and Diabetes are used in this experimentation. The details of Credit Approval dataset used in this experiment is furnished here and the rest of the datasets are furnished.

A Real Time Multivariate dataset, namely, Credit Approval, is downloaded from website UCI Machine Learning Repository. The details are shown at **Table 1**. Therefore the original dataset used in the experimentation is a Credit Approval dataset. It


#### **Table 1.**

*Details of credit approval dataset.*


#### **Table 2.**

*A credit approval original dataset D.*

comprises 690 rows/tuples and 15 columns/attributes including one target/class attribute.

A sample list of the original dataset D with 5 rows and 14 attributes is shown at **Table 2**.

The process in the experiment is explained as below:

First, a dataset named creditapproval.txt is loaded into X data matrix with the help of load() method. Next, the size() method on X data matrix determines the number of rows p as 690 and the number of columns q as 14. The data matrix is now named Dp q. Then, a built- in function randn(p, q) is used to create a random data matrix R. The random data matrix R is rotated with the help of built-in function rot90(). The data matrix X is constructed using data matrix R multiplied by data matrix D. The built in function normrnd(0,1, p, q) is used to create another random data matrix X1 with p rows, q columns, such that its mean is 0 and standard deviation is 1. Construct the distorted data matrix D4 by adding three data matrices X, R<sup>T</sup> (transpose of R), X1. The distorted data matrix D4 is given as parameter to function DCT(D4) and it returns the final distorted data matrix D5 as output. When the above process is executed in experimentation it outputs a distorted datasets D4 and D5.

#### **6. Results and analysis**

The distorted datasets D4 and D5 together with the original dataset D, respectively are appended with a class attribute, YES or NO. The original dataset D after appending with a class attribute is shown at **Table 3**.

Similarly the distorted datasets D4 and D5 are also appended with a class attribute and furnished at section 6.6 as part of Results and Analysis. The above mentioned datasets D, D4 and D5 are uploaded into Tanagra data mining tool. First, classification utility is used on the dataset D and distorted datasets D4, D5. It divides the attributes into two categories, non-class attributes and class attribute. These two categories can be two inputs to the classifier chosen from the available ones.

*Multiplicative Data Perturbation Using Random Rotation Method DOI: http://dx.doi.org/10.5772/intechopen.105415*


#### **Table 3.**

*A credit approval original dataset with class attribute.*

Suppose we select SVM (Support Vector Machine) as classifier, then, it classifies the datasets D, D4 and D5 based on class attribute into either credit card either approved or rejected. Such results are furnished at Section 6.6 under Results and Analysis. Similarly, the experimentation is repeated with Iterative Dichotomizer 3 (ID3), (Successor of ID3) C4.5, KNN (k-Nearest Neighbor) and MLP (Multi Layer Perceptron) classifiers.

The results of those experiments are furnished at Section 6.6. A Clustering utility available in Tanagra data mining tool is used to cluster the original dataset D and distorted datasets D4 and D5. Non- class attributes are considered and given as input to k-mean clustering method. As a result, categories of clusters are formed.

A unified column metric, Root Mean Square Error (RSME) is used to evaluate inference attacks. It is calculated using Eq. (3) as given below:

$$\text{RMSE}(\mathbf{r}) = \sqrt{\frac{1}{\mathbf{q}}} \sum\_{i=1}^{q} \left(\mathbf{D} - \mathbf{P}\right)^{2} \tag{4}$$

where D ¼ d1,d2⋯dq are the original dataset values, P ¼ p1,p2⋯pq are the perturbed dataset values and q is number of columns.

Then, privacy D, P ð Þ¼ <sup>4</sup>*<sup>σ</sup>* <sup>2</sup>*<sup>r</sup>* <sup>¼</sup> *<sup>r</sup>* <sup>2</sup> (if standard deviation σ = 1). The attacks used are:

Naives inference is calculated as given in Eq. (4), where D is the original data and P = E (E is estimated or Random dataset).

Reconstruction inference is calculated as given in Eq. (4), where D is the original dataset and the Perturbed dataset

$$\mathbf{P} = \mathbf{E}^{-1} \ast \mathbf{P}.\tag{5}$$

Distance based inference is calculated as given in Eq. (5), where D is the original dataset and P = P<sup>0</sup> (P<sup>0</sup> is mapped set of points of Perturbed dataset P).

The calculations of these metrics are furnished at Section 7 under Results and Analysis.

#### **7. Results and analysis**

The results obtained in the above experiment are presented in this section. The original dataset D is given as input to the proposed MDP and output distorted dataset D4 and D5 are presented below at **Table 4** and **Table 5**, respectively.

When SVM classifier is used on D, D4 and D5 datasets, the following observations are made and the same are presented at **Table 6**.

In the above **Table 4**, the first column presents the original dataset D and the distorted datasets D4 and D5. The number of tuples in the datasets considered for experimenting can be seen in the second column. The third column displays the number of training tuples classified for credit card approved as YES. The number of support vectors available is furnished in the fourth column. Fifth column reveals the error rate of SVM classifier. The computation time is tabulated at last column.

Similarly, when ID3 and C4.5 classifiers are used on D, D4 and D5 datasets the results are tabulated at **Tables 7** and **8**.

In the above **Tables 7** and **8**, the first column presents the dataset D and distorted dataset D4 and D5. The number of tuples in the datasets considered for experimenting can be seen in the second column. The third column displays the number of training tuples belonging to credit card approved as YES. A tree having number of nodes and leaves is furnished in the fourth column. Fifth column reveals the error rate of the ID3 and C4.5 classifiers, respectively. The computation time is tabulated at last column.


#### **Table 4.**

*A credit approval distorted dataset D4.*


#### **Table 5.**

*A credit approval distorted dataset D5.*


#### **Table 6.**

*A credit approval dataset classified using SVM.*


**Table 7.**

*A credit approval dataset classified using ID3.*


#### **Table 8.**

*A credit approval dataset classified using C4.5.*


#### **Table 9.**

*A credit approval dataset classified using KNN.*

When KNN classifier is used on D, D4 and D5 datasets the following observations are made and presented at **Table 9**.

In the above **Table 9**, the first column presents the original dataset D and distorted datasets D4 and D5. The number of tuples in the

datasets considered for experimenting can be seen in the second column. The third column displays the number of training tuples classified as credit card approved as YES for KNN classifier. The fourth column displays the number of neighbors. The fifth column reveals the error rate of KNN classifier. The computation time is tabulated in the last column.

Similarly, the results are tabulated at **Table 10** when MLP classifier is used on D, D4 and D5 datasets.

In the above **Table 10**, the first column presents the original dataset D and the distorted datasets D4 and D5. The number of tuples in the datasets considered for experimenting can be seen in the second column. The third column displays the number of tuples classified for credit card approved as YES. The maximum number of


**Table 10.**

*A credit approval dataset classified using MLP.*

iteration for MLP classifier is furnished in the fourth column. The fifth column reveals the training error rate of KNN classifier. The computation time is tabulated in the last column. Based on the results presented above the accuracy of classification of datasets is presented at **Table 11**. The accuracy is the percentage of tuples that were correctly classified by a classifier.

The above **Table 11** presents the accuracy of the classifiers for Credit Approval, Haber Man, Tic-Tac-Toe and Diabetes datasets. The first column presents the dataset D, the distorted datasets D4 and D5. The second column presents the accuracy of classification obtained on Credit Approval dataset using SVM, ID3, C4.5, KNN and MLP classifiers. The third column presents the accuracy of classification obtained on Haber Man dataset using SVM, ID3, C4.5, KNN and MLP classifiers. The fourth column presents the accuracy of classification obtained on Tic-Tac-Toe dataset using SVM, ID3, C4.5, KNN and MLP classifiers. The fifth column presents the accuracy of classification obtained on Diabetes dataset using SVM, ID3, C4.5, KNN and MLP classifiers.

It is observed that accuracy of C4.5, KNN and MLP classifiers are better than the accuracy of the other classifiers for distorted dataset D5 compared to distorted dataset D4.

The above **Table 12** presents the comparison of accuracy. The first column presents the distorted dataset D4 and D5. The second column presents the accuracy obtained on the proposed MDP using Credit approval, Tic-Tac-Toe and diabetes datasets for SVM and KNN classifiers. The third column presents the accuracy for the existing geometric data perturbation methods using Credit approval, Tic-Tac- Toe and Diabetes datasets for SVM and KNN classifiers. It is observed that the accuracy on the datasets using our proposed MDP was found better than the accuracy of the Existing Geometric data perturbation. Moreover, their accuracy was found only on SVM and KNN classifiers for Credit Approval, Tic-Tac-Toe, and Diabetes datasets only.

The proposed MDP has given good accuracy for distorted dataset D5 compared to distorted dataset D4, whereas the literature does not show any accuracy for distorted data D5.

The results of k-means clustering are shown below at **Table 13**, when k = 2 (form two clusters).

In the above **Table 13**, the first column presents the dataset D, D4, and D5. The number of objects in the dataset considered for the experiment can be seen in the second column. The third column displays the number of objects belonging to cluster1. The fourth column reveals the number of objects belonging to cluster 2. The computational time is presented in the last column. Based on the results presented above the misclassification error rate of datasets is presented at **Table 14**.

*Multiplicative Data Perturbation Using Random Rotation Method DOI: http://dx.doi.org/10.5772/intechopen.105415*


**Table 11.**

*Accuracy of classifiers (%).*


#### **Table 12.**

*Comparison of accuracy.*


#### **Table 13.**

*Clustering on credit approval dataset for k = 2.*


#### **Table 14.**

*Comparison of misclassification error-rate.*

The above **Table 14** presents the misclassification error rate. The first column presents the distorted dataset D4 and D5. The second column presents the error rate obtained on the proposed MDP using Credit Approval, Haber Man, Tic-Tac-Toe and Diabetes datasets.

In the privacy metric mentioned in Section 1.5 in Eq. 1.2, the detailed calculation of privacy quality to analyze attacks is shown below:

Consider the data matrix D <sup>¼</sup> <sup>422</sup> <sup>111</sup> the corresponding distorted data matrix using the proposed MDP is given below:

<sup>P</sup> <sup>¼</sup> �5*:*7881 1*:*3900 0*:*<sup>4371</sup> <sup>1</sup>*:*<sup>3989</sup> �1*:*<sup>5826</sup> �2*:*<sup>3630</sup> , E is the estimated values (Random) as given

below:

$$\begin{aligned} \mathbf{E} &= \begin{bmatrix} -3.5441 & 1.3900 & 0.3211 \\ 0.9321 & 2.4567 & -6.7860 \end{bmatrix} \text{ and calculating } \mathbf{D}' = \mathbf{R}^{-1\*} \mathbf{P} \text{ is given below} \\ \mathbf{D}' &= \begin{bmatrix} -2.1461 & 0.2800 & 0.3211 \\ 1.8421 & 4.6767 & 4.6130 \end{bmatrix} \text{ and calculating } \mathbf{P}' \text{ is given below} \end{aligned}$$


**Table 15.**

*Analysis on attacks.*

$$\mathbf{P'} = \begin{bmatrix} -1.9261 & 0.6800 & 1.3211 \\ 3.6821 & 1.6821 & -4.5920 \end{bmatrix}$$

Then, substitute the above data matrices in eq. 1.2 to analyze the following attacks: Naives-based Inference Attack: The RMSE is calculated by substituting the data matrices D and E. The result for RMSE r, obtained is as given below:

$$\mathbf{r} = \sqrt{\frac{1}{3}} \sum\_{i=1}^{2} \left( (\mathbf{D}) - (\mathbf{E}) \right)^{2} = \mathbf{1.9221}, \text{ Privacy } (\mathbf{D}, \mathbf{P}) = \mathbf{r}/2 = \mathbf{0.6796}$$

Reconstruction -based Inference Attack: The RSME r is calculated by substituting the data matrices D and D″. The result r obtained is as given below:

$$\mathbf{r} = \sqrt{\frac{1}{3}} \sum\_{i=1}^{2} \left( (\mathbf{D}) - (\mathbf{D}'') \right)^2 = \mathbf{1.6794}, \text{ Privacy } \left( \mathbf{D}, \mathbf{D}' \right) = \mathbf{r}/2 = \mathbf{0.839}$$

Distance -based Attack: The RSME r is calculated by substituting the data matrices D and P<sup>0</sup> . The result r obtained is as given below:

$$\mathbf{r} = \sqrt{\frac{\mathbf{i}}{3}} \sum\_{i=1}^{2} \left( (\mathbf{D}) - (\mathbf{P}') \right)^2 = \mathbf{1.70261}, \text{ Privacy } (\mathbf{D}, \mathbf{P}') = \mathbf{0.851}$$

Similarly the RMSE r is calculated for the original D and distorted datasets D4 and D5 and the results are furnished at **Table 15** as shown below.

In the above **Table 15**, the first column presents the Naives based, Reconstruction based and Distance -based attacks. The second column displays RMSE (Root Mean Square Error) r is calculated for the proposed MDP method on Credit Approval, Haber Man, Tic-Tac- Toe and Diabetes datasets. The third column reveals the RMSE calculated for existing hybrid methods on Credit Approval and Diabetes datasets. It is observed that the RMSE r for proposed MDP method on distance -based attack is high compared to RMSE for the existing geometric data perturbation methods. The metric for the proposed MDP shows better quality in preserving the confidential data and provides high uncertainity to reconstruct the original data.

#### **8. Conclusion**

A Multiplicative Data Perturbation algorithm by combining a Geometric Data Perturbation method and Discrete Cosine Transformation is proposed in this chapter. The proposed MDP is successfully implemented using different multivariate datasets mentioned above.

The experiments on those datasets resulted to classify accurately and create accurate number of clusters. Based on the result analysis, it is resolved that our proposed MDP algorithm is efficient to preserve confidential data during perturbation and ensures privacy while being resilient against possible of attacks the proposed methods considered a univariate datasets ex: Terrorist. A multivariate dataset is considered and a multiplicative data perturbation (MDP) was explored to effectively perturb the data in a centralized environment. This method has resulted in perturbing the data effectively and be resilient towards attacks or threats while preserving the privacy.

The research studies can explore the privacy issues on a Big Data as a future scope of research work in the following directions:

Improving Data Analytic techniques –Gather all data, filter them out with certain constraints and use to take confident decision.

Algorithms for Data Visualization- In order to visualize the required information from a pool of random data, powerful algorithms are crucial for accurate results.

In future scope includes, research can include many various methods explore many methods. These latest methods can show various results.

#### **Author details**

Thanveer Jahan Vaagdevi College of Engineering, India

\*Address all correspondence to: tanveer\_j@vaadevi.edu.in

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] Li L, Zhang Q. A privacy preserving clustering technique using hybrid data transformation method. In: 2009 IEEE International Conference on Grey Systems and Intelligent Services (GSIS 2009). Vol. 2010. Nanjing: IEEE; 2009. pp. 1502-1506. DOI: 10.1109/ GSIS.2009.5408151

[2] Natarajan AM, Rajalaxmi RR, Uma N, Kirubhkar G. A hybrid transformation approach for privacy preserving clustering of categorical data. In: Innovations and Advanced Techniques in Computer and Information Sciences and Engineering. Dordrecht: Springer. 2007. pp. 403-408. DOI: 10.1007/978-1- 4020-6268-1\_72

[3] Selva Rathnam S, Karthikeyan T. A survey on recent algorithms for privacy preserving data mining. International Journal of Computer Science and Information Technologies. 2015;**6**(2): 1835-1840

[4] Patel A, Patel K. A hybrid approach in privacy preserving data mining. In: 2017 International Conference on Innovative Mechanisms for Industry Applications (ICIMIA). Vol. 2. Ahmedabad, Gujarat, India: IEEE; 2016. p. 3

[5] M. Naga Lakshmi and K. Sandhya Rani, "A privacy preserving clustering method based on fuzzy approach and random rotation perturbation", Publications of Problems & Application in Engineering Research-Paper, Vol. 04, Issue No. 1, pp. 174-177, 2013.

[6] Mary AG. Fuzzy–based random perturbation for real world medical datasets. International Journal of Telemedicine and clinical Practices. 2015;**1**(2):111-124. DOI: 10.1504/ IJTMCP.2015.069749

[7] M. Naga Lakshmi, K Sandhya Rani," Privacy preserving hybrid data transformation based on SVD"," International Journal of Advanced Research in Computer and Communication Engineering, Vol. 2, Issue 8, 2013, 2278-1021

[8] Jalla HR, Girija PN. An efficient algorithm for privacy preserving data mining using hybrid transformation. International Journal of Data Mining & Knowledge Management Process. 2014; **4**(4):45-53. DOI: 10.5121/ijdkp.2014.4404

[9] Manikandan G, Sairam N, Saranya C, Jayashree S. A hybrid privacy preserving approach in data mining. Middle- East Journal of Scientific Research. 2013; **15**(4):581-585. DOI: 10.5829/idosi. mejsr.2013.15.4.1.991

[10] Saranya C, Manikandan G. Study on normalization techniques for privacy preserving data mining. International Journal of Engineering and Technology (IJET). 2013;**5**(3):2701-2704

[11] Geetha Mary AN, Iyenger NSC. Nonadditive random data perturbation for real world data. Procedia Technology. 2012;**4**:350-354. DOI: 10.1016/j. protcy.2012.05.053

[12] Aggarwal CC, Yu PS. A condensation approach to privacy preserving data mining. In: Proceedings of International Conference on Extending Database Technology (EDBT). Vol. 2992. Heraklion, Crete, Greece: Springer; 2004. pp. 183-199

[13] Liu K, Kargupta H, Ryan J. Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Transactions on Knowledge and Data Engineering (TKDE). 2006;**18**(1):92-106 [14] Chen K, Liu L. "A Random Rotation Perturbation Based Approach to Privacy Preserving Data Classification", CC-Technical Report GIT-CC-05-12. USA: Georgia Institute of Technology; 2005

[15] Lui K, Giannella C, Kargupta H. An Attacker's view of distance preserving maps for privacy preserving data mining. In: Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases(Pkdd'06). Berlin, Heidelberg: Springer-Verlag; 2006

[16] Xu H, Guo S, Chen K. Building confidential and efficient query services in the clod with RASP data perturbation. IEEE Transactions on Knowledge and Data Engineering. 2014;**26**(2):322-335

[17] Oliveira SR, Zaiane OR. Privacy preserving clustering by data transformation. Journal of Information and Data Management (JIDM). 2010; **1**(1):37–51

[18] Guo S, Wu X. Deriving private information from arbitrarily projected data. In: Proceedings of the 11th European conference on principles and practice of knowledge Discovery in databases (PKDD07). Warsaw, Poland. 2007

[19] Balasubramaniam S, Kavitha V. A survey on data retrieval techniques in cloud computing. Journal of Convergence Information Technology. 2013;**8**(16):15-24

[20] Liu J, Yifeng XU. Privacy preserving clustering by random response method of geometric transformation. Harbin, Heilong Jiang, China: IEEE. 2010: 181-188. DOI: 10.1109/ICICSE.2009.31

[21] Balasubramaniam S, Kavitha V. Geometric data perturbation-based personal health record transactions in cloud computing. The Scientific World Journal. 2015;**2015**:927867, 1-927869. DOI: 10.1155/2015/927867

[22] Chen K, Lui L. Geometric Data Perturbation for Privacy Preserving Outsourced Data Mining. London: Springer-Verlag Limited; 2010

[23] Hyvarinen AK, Oja E. Independent Component Analysis. New York/ Chichester/Weinheim/Brisbane/ Singapore/Toronto: Wiley-Interscience; 2001

[24] Brankovic L, Estivill-Castro V. Privacy issues in knowledge discovery and data mining. In: Proceedings of Australian Institute of Computer Ehic Conference (AICEC99). Melbourne, Victoria, Australian: Lecture Notes in Computer Science. 1999;**4213**:297-308. DOI:10.1007/11871637\_30

[25] Liu K, Giannella C, Kargupta H. An Attacker's view of distance preserving maps for privacy preserving data mining. In: European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD). Berlin, Germany; 2006

[26] Li L, Zhang Q. A privacy preserving clustering technique using hybrid data transformation method. In: Grey Systems and Intelligent Services, 2009 GSIS 2009, IEEE International Conference. Nanjing, China: IEEE; 2010. DOI: 10.1109/GSIS.2009.5408151, 08

[27] Rajesh N, Sujatha K, Kumar AALS. Survey on privacy preserving data mining techniques using recent algorithms. International Journal of Computer Applications Foundation of Computer Science (FCS). 2016;**133**(7):30-33

[28] Patel L, Gupta R. A survey of perturbation technique for privacypreserving of data. International Journal of Emerging Technology and Advanced Engineering Website. 2013;**3**(6):162-166

#### **Chapter 4**

## FAIR Data Model for Chemical Substances: Development Challenges, Management Strategies, and Applications

*Nina Jeliazkova, Nikolay Kochev and Gergana Tancheva*

#### **Abstract**

Data models for representation of chemicals are at the core of cheminformatics processing workflows. The standard triple, (structure, properties, and descriptors), traditionally formalizes a molecule and has been the dominant paradigm for several decades. While this approach is useful and widely adopted from academia, the regulatory bodies and industry have complex use cases and impose the concept of chemical substances applied for multicomponent, advanced, and nanomaterials. Chemical substance data model is an extension of the molecule representation and takes into account the practical aspects of chemical data management, emerging research challenges and discussions within academia, industry, and regulators. The substance paradigm must handle a composition of multiple components. Mandatory metadata is packed together with the experimental and theoretical data. Data model elucidation poses challenges regarding metadata, ontology utilization, and adoption of FAIR principles. We illustrate the adoption of these good practices by means of the Ambit/ eNanoMapper data model, which is applied for chemical substances originating from ECHA REACH dossiers and for largest nanosafety database in Europe. The Ambit/ eNanoMapper model allows development of tools for data curation, FAIRification of large collections of nanosafety data, ontology annotation, data conversion to standards such as JSON, RDF, and HDF5, and emerging linear notations for chemical substances.

**Keywords:** FAIR, database, data model, chemical substance, nanomaterial, structure, molecular descriptors, linear notation, ontology

#### **1. Introduction**

Since the emergence of the term cheminformatics within the context of pharmaceutical industry activities around the end of the twentieth century, an adequate chemical structure representation has been essential for the efficient application of cheminformatics methodologies [1]. The chemical structure is at the core of various cheminformatics activities: molecular property prediction via Quantitative Structure-Property Relationships/Quantitative Structure-Activity

Relationships (QSPR/QSAR), searching new biologically active compounds, lead optimization, virtual screening, combinatorial chemistry, etc. The centrality of molecular structure gives the primary flavor that distinguishes these activities from the classical chemometrics approaches [2], focused on data mining of the analytical and experimental results in order to extract useful information for the chemical objects study (e.g. the popular structure elucidation task). The chemometrics techniques from the 70s were transferred, adapted, and further developed within the field of "mathematical chemistry" with a strong focus on graph theory applications for molecule structures representation in the 80s and in 90s, and together with the 3D structure information focus and movement toward big data, resulted in the birth of modern cheminformatics. The main motto "from data to knowledge" summarizes the data workflow from studying chemical objects toward gaining/formalizing chemical information and generation of chemical knowledge as models, classifiers, etc. An adequate representation of the structures is required for all stages of the data management workflow. The chemical object representation development is a dynamic process, which is strongly influenced by the practical needs of the industry and lately, regulatory bodies. The novel deep learning technologies are changing the ways the structure information is used (e.g. linear notations can be directly read by the artificial neural networks as well as vector representations of the structures generated). The chemical substance model is a logical extension of the traditional molecule representation and takes into account practical aspects of chemical data management and new emerging research challenges. Finally, the FAIR (Findable, Accessible, Interoperable, and Reusable) principles [3] were widely popularized and strongly encouraged as a needed background for efficient ongoing interconnections and activities within the: academia, industry, and regulators. On the other hand, the substance paradigm is based on a more complex approach toward representation of the chemical objects and must handle multiple material compositions, enriched with mandatory metadata and corresponding ontology annotations in order to comply with FAIR principles.

In the following sections, the reader will be led into a journey, starting from the classical molecular data model, based on chemical structure, and going through complex representations of chemical substances, industry use cases, and nanomaterials (NMs). The logical evolution of the data model elucidation will be demonstrated within the context of various challenges. The importance of metadata will be discussed as well as the adoption of FAIR principles. The good practices will be exemplified by the Ambit/eNanoMapper data model and real chemical substances from ECHA REACH dossiers. This chapter also discusses the FAIRification of large collections of data and the importance of standard data formats and emerging linear notations for chemical substances.

#### **2. Classical cheminformatics paradigm for molecular data: structure, properties, and descriptors**

The chemoinformatics is a vast interdisciplinary field with a large inheritance from the data mining, graph theory, and mathematical chemistry, enriched with modern methods for big data and artificial intelligence approaches. A common denominator of this methodological variety is the focus on the chemical structure. The centrality of chemical structure is also evident in other domains, strongly related to the cheminformatics, such as reaction informatics, bioinformatics (e.g. proteomics and metabolomics), toxicogenomics, etc. In QSPR/QSAR analysis, physicochemical

#### *FAIR Data Model for Chemical Substances: Development Challenges, Management Strategies… DOI: http://dx.doi.org/10.5772/intechopen.110248*

properties and biological activities are considered as functions of the molecular structure, i.e. P = f(S) or A = f(S). Also, equation reversal is observed for the chemometrics' structure elucidation task: S = f<sup>1</sup> (P), e.g. structure is obtained out of the spectral data (spectrum is the property vector, P, in this case). The representation of the chemicals is the starting point for any of these activities. The molecular structure is the principle "model" that encompasses most important bits of the current chemical knowledge, used for further data processing and modeling.

The hierarchy of basic chemical objects' representations is shown in **Figure 1**. It starts from the smallest chemical objects, atoms, and bonds, which are the building blocks for the chemical structure. The connection table (CT) encodes the chemical graph and is the most widely used approach for structure information representation on a topological level. 2D coordinates and 3D coordinates together with the CT fully describe a chemical structure. Traditionally, the transition from structure to property is helped by an intermediate layer of descriptors, D, i.e. the first step is D = f1(S) and then P = f2(D).

The classical and widely adopted data model of a molecule representation is defined as a triple of the type (S-structure, D-descriptors, and P-properties), as illustrated in **Figure 2**. Different structure representations are systemized in several levels: 0D/1D – constitution, 2D – topology, 3D – geometry, 4D – conformation, and the QM (quantum mechanics) level with detailed electronic structure information. The intermediate layer of molecular descriptors is derived computationally or experimentally and represents useful information for the molecule. Structural descriptors are an important subset of descriptors, used as the principle interface between structure and properties. The structure is reduced to a simpler representation, namely a point in n-dimensional vector space. Variety of cheminformatics tasks, such as searching, classification, virtual screening, clustering, and measuring distance between the objects, can be performed in terms of points in the chemical space of so called "patterns." Traditionally, the chemical patterns are considered more userfriendly to the classical machine learning methods than the original chemical objects.

**Figure 3** shows various structure representations for the molecule of benzene: connection table, 2D and 3D coordinates (with corresponding graphical model), linear notations – SMILES, InChI and SLN, distance matrix as a topological descriptor and registry numbers CAS N, (EC) Number, and PubChem CID. Also, **Figure 2** exemplifies different descriptors: constitutional (NA, NDB), topological (Wiener index and kappa1 index), and geometrical (eccentricity and radius of gyrations) plus the third data layer with molecular properties: LogP, RI, BP, etc.

The majority of chemical database implementations are based on the classical structure paradigm – the molecule triad (S, D, P). This paradigm has been used for several decades, and even nowadays it is the predominant base layer for the public

**Figure 1.**

*Hierarchy of chemical objects: From primitive/small objects (left) to larger and complex objects (right).*

#### **Figure 2.**

*Classical triad model of a molecule: (structure, descriptors, and properties).*

#### **Figure 3.**

*Various structure representations, descriptors, linear notations, and registry (database) indices for the molecule of benzene.*

chemical databases. Naturally, the cheminformatics community and academic circles feel quite comfortable within the triad model. It has been like a "protecting bubble" and proved its usefulness as a "ground zero" for the chemical information workflow.

*FAIR Data Model for Chemical Substances: Development Challenges, Management Strategies… DOI: http://dx.doi.org/10.5772/intechopen.110248*

However, staying in the limits of the classical (S, D, P) model may hinder or isolate the cheminformatics field evolution. The "conveniences" and simplicity of the (S, D, P) model may prevent the establishing of efficient interconnections between the cheminformatics field and other scientific areas, especially in the context of industry and regulators. In the following sections, we describe the further development of the classical triad model into the paradigm of chemical substances (see the last element from the chemical objects chain in **Figure 1**).

#### **3. Data models for chemical substances**

The chemical structure describes a well-defined molecule. Unlike chemical structures, real chemical objects or industrially manufactured ones are not pure substances. Such substances are composed of several components; hence, they cannot be associated with a single unique structure. The regulatory authorities typically need information on chemicals as produced by industry. Another data gap emerges from the lack of tools to consider metadata about the performed experiments and measurements in cheminformatics use cases, e.g. QSAR model building, while such metadata is crucial for the toxicologists and regulators. The substance have to be represented as the entirety of the components with their roles and relations, include rich metadata to enable unambiguous description of experimental results from many biological assays, physicochemical characterizations, exposure, and environmental fate tracking. The challenges are increasing with representation of nanomaterials and advanced materials. Having a consensus on the chemical substance definition is a challenge also due to the discrepancies between the approaches of various regulatory institutions.

According to the International Union of Pure and Applied Chemistry (IUPAC) definition [4], a substance "*is matter of constant composition best characterized by the entities (molecules, formula units, atoms) it is composed of. Physical properties such as density, refractive index, electric conductivity, melting point etc. characterize the chemical substance*." Under Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH), the concept of substance is clearly described [5]: "*A substance is a chemical element and its compounds in the natural state or the result of a manufacturing process. In a manufacturing process, a chemical reaction is usually needed to form a substance*." Under REACH, a chemical substance is composed of three types of components: constituents, impurities, and additives. Chemical substances can be mono-constituent (one main constituent is present to at least 80% (w/w)), multi-constituent (more than one main constituent is present in a concentration between 10% and 80% (w/w)), or UVCB (Substance of Unknown or Variable composition, Complex reaction products or Biological materials). The REACH definition of a substance encompasses all forms of substances and materials on the market, including nanomaterials.

The Government of Canada [6] defines a chemical substance as: "*Elements or compounds that are deliberately created, produced as a by-product of other processes or occurring naturally in the environment.*" The Canadian Environmental Protection Act (CEPA) also requires notification for new substances put in two lists: the Domestic Substances List (DSL) and the Non-DSL (NDSL). Toxic Substances Control Act (TSCA) [7] requires the United States Environmental Protection Agency (US EPA) to compile, keep current, and publish a list of each chemical substance that is manufactured or processed, including imports, in the United States for uses under TSCA. TSCA defines a "chemical substance" *as any organic or inorganic substance of a particular molecular identity, including any combination of these substances occurring in whole or in part as a result of a*

*chemical reaction or occurring in nature, and any element or uncombined radical.* The Japanese Act on the Evaluation of Chemical Substances and Regulation of Their Manufacture is performed under Chemical Substance Control Law (CSCL) [8].

The underlying data model is of crucial importance for the efficiency of any cheminformatics, nanoinformatics, and bioinformatics workflow. Specifically, nanomaterial (NM) representations are the primary subject of the new and rapidly evolving field of nanoinformatics. According to ISO TS 80004–1:2015, definition of a nanomaterial is: "*a material with any external dimension in the nanoscale approximately 1 nm to 100 nm and/or having internal structure or surface structure in the nanoscale*" [9]. The European Commission [10] definition of a nanomaterial is: "*A natural, incidental or manufactured material containing particles, in an unbound state or as an aggregate or as an agglomerate and where, for 50 % or more of the particles in the number size distribution, one or more external dimensions is in the size range 1 nm-100 nm. In specific cases and where warranted by concerns for the environment, health, safety, or competitiveness, the number size distribution threshold of 50 % may be replaced by a threshold between 1 and 50%.*" The substance definition in the European Union regulation REACH [5] and in the Classification, Labelling and Packaging (CLP) Regulation includes all forms of substances and materials on the market, including NMs, i.e. NM is treated as a particular case of a chemical substance.

There are several major data models highlighting the path for storing chemical substances in a database. IUCLID [11], the primary software for preparation and submitting REACH dossiers, stores and maintains data on the hazardous properties of chemical substances and mixtures, as well as their use and associated exposure levels. This is also the first system that fully implements the OECD harmonized templates (HT) [12] on the base of OECD guides of testing and agreed standards. The BioAssay Ontology (BAO) [13] provides a foundation for standardizing assay descriptions and endpoints with capabilities enabling the retrieval of data, relevant to a query. This is the first ontology to describe this domain, and certainly the first time that bioassay and HTS (high throughput screening) data have been represented using expressive description logic [14].

CODATA, the International Council for Science: Committee on Data for Science and Technology (www.codata.org), and VAMAS, an international prestandardization organization, concerned with materials test methods (www.vamas. org), jointly foster the development of a uniform description system for NMs to address the diversity and complexity of nanomaterials. CODATA [15] encourages the interoperability and the usability of such data using a framework with four basic information categories General Identifiers, Characterization, Production and Specification and numerous subcategories and descriptors for detailed information. Most of the terms and concepts used in the descriptive system are easily understandable for people from different directions, as it is expected to be used by different groups of users for research reports, NM identification in regulations and standards, specifying NMs in commercial transactions, etc. [16].

ISA-TAB [17] defines three basic layers for sharing metadata, related to experiments: Investigation, Study and Assay, and the actual experimental data is stored on a separated forth layer and referenced by the ISA data [18]. Additional configuration settings and ontology annotations could be considered as additional layers to this complex multilayered approach. The ISA model is non-standardized and user-defined and can include image files, spreadsheets, and protocol documents, forwarded to appropriate fields in the Study file table. The basic approach to present chemical compounds in ISA-TAB is an ontological record, which usually points to a single chemical structure. ISA model can be

*FAIR Data Model for Chemical Substances: Development Challenges, Management Strategies… DOI: http://dx.doi.org/10.5772/intechopen.110248*

serialized via ISA-TAB [19] format as multiple spreadsheet files or ISA-JSON [20] – data is stored in more convenient fashion as JSON (JavaScript Object Notation [21]) files.

Although the technical approaches and the use case scenarios of the four data models differ, a unifying logic could be traced. The need of generally "larger" chemical data object is not seen only in ECHA's REACH dossiers but also in all regulatory platforms (e.g. CEPA, TSCA, etc.) as well as such courses of action could be observed in public chemical databases evolution (e.g. PubChem has the notion of chemical substance). The foundation of a more sophisticated data model for substances is laid with three principal pieces of information: (1) identification, (2) material/substance description and composition, and (3) measurements records. This is practically illustrated in **Figure 4** for the substance of "benzene." The substance data model obviously includes a collection of standard triples:

Structures = {(Sk, Dk, Pk) | k = 1,2,...,m},

However, a collection of new objects of the type "chemical substance" is needed to encompass the three principle levels. An identification layer may include identifiers and names. The challenge of unique substance identifiers and names is discussed in the last section of the book chapter.

A dynamic approach of material description is needed in at least two dimensions. Apart from multiple components, the industry and regulators, also, need to handle multiple compositions of the same substance, e.g. there may be different manufacturing processes for the same products. The latter is demonstrated with two different compositions of the "benzene" substance, as shown in **Figure 4**. The first one contains three components: benzene as main constituent, toluene as impurity, and some nonaromatic hydrocarbons. Toluene, which is an impurity in composition 1, is included in the second composition as well, but with a different role – it is a constituent of the "benzene" substance. The data model requires new data entities like:

Substance = {. {name1, name2,..., id1, id2,...};

#### **Figure 4.**

*Substance of benzene with several different compositions and information grouped in three layers: Identification, material, and measurements (example is taken from the public records of the ECHA's dossiers and is also accessible via ambit-LRI database web interface).*

```
{composition1, composition2,...};
       {measurement1, measurement2,...};
  }
and
  Composition = {.
       {(S1, D1, P1), (S2, D2, P2),...};
       {component relations};
       {component concentrations};
  }
```
In **Figure 4**, the term "benzene" is used for naming two different types of objects. There is a chemical structure of the benzene molecule which is the main constituent of the "benzene" substance. Hence, clear communication requires proper context in terms of data object types. On the other hand, the molecule of benzene could participate in other chemical substances with different roles. As it is illustrated in **Figure 5**, benzene molecule is an impurity component.

Also, the composition data entity should not be mistaken with the substance entity as well as the structure identifiers (e.g. benzene molecule CAS Number, 71–43–2 and InChI = 1S/C6H6/c1–2–4-6-5-3-1/h1-6H) should not be mistaken with the substance identifiers. The latter is a subtle error but is a common mismatch due to a long-term dominance of the structure-centered thinking. For example, in the nanoinformatics field, CAS number is wrongly associated with the whole nanomaterial instead of with a particular NM component. Also in **Figures 4** and **5**, identifiers of the substance compositions are shown as well. The complicated relationships between the three types of entities: structures, substances, and compositions require identifiers for all entity types. The shown examples utilize internal hash-based identifiers, uniquely generated by the Ambit-LRI database system. The principle difference between

#### **Figure 5.**

*The substance of "3a,4,7,7a-tetrahydro-4,7-methanoindene" with a benzene molecule as an impurity component within the first composition (example is taken from the public records of the ECHA's dossiers and is also accessible via ambit-LRI database web interface).*

#### *FAIR Data Model for Chemical Substances: Development Challenges, Management Strategies… DOI: http://dx.doi.org/10.5772/intechopen.110248*

structure and substance also dictates different approaches for database searching for these types of objects. Structure collections can be searched with the well-known cheminformatics methods (e.g. identity search, similarity, and substructure search), but the resulting hit list with structures should be logically related with the other types of data entities, i.e. substances and compositions. For instance, benzene structure is a component, playing different roles within about 200 different substances from the public ECHA's dossiers. Also, within the context of experimental data handling, there is a difference between the properties of the "entire" chemical substance, stored on the Measurements layer (see **Figure 4**) and the "nominal" properties of the component, as they are treated in the standard triad model.

#### **4. FAIR principles**

A chemical substance database is expected to facilitate analysis of chemical and physical properties, biological analyses, and human and environmental impacts, particularly in the context of safety and risk assessment. Integration of data from multiple sources (e.g. for the needs of read across) is only possible if original measurements are combined with rich metadata and obey a set of well-established good practices for data management. In 2016, Scientific data [3] published "FAIR Principles for Scientific Data Management." The authors provide guidance for improving the discoverability, accessibility, interoperability, and reuse of data popularized as FAIR (Findable, Accessible, Interoperable and Reusable). The principles (see **Figure 6**) emphasize machine capability (i.e. the ability of computing systems to find, access, interact with, and reuse data with no or minimal human intervention), since humans increasingly rely on computational support for data processing as a result of the increase of the volume, complexity, and speed of data creation.

The four foundational principles guide data producers and publishers in how to increase the value of modern scholarly digital publishing. The application of FAIR principles is, also, to algorithms, tools, and workflows used for data generation.

**Figure 6.** *FIAR principles: findable, accessible, interoperable, and reusable.*

GO-FAIR initiative (https://www.go-fair.org/) gained much popularity in the last few years and strongly endorses deployment of as much as possible FAIR data resources. On their dedicated site, GO-FAIR recommends a workflow of seven basic stages for transforming a non-FAIR data resource into a FAIR one (**Figure 7**). Rich and descriptive metadata, used by machines, is a key tool to evaluate and answer the questions being asked about the data. FAIR principles allow experimental data to be used beyond their origin to solve scientific problems, fill in missing data, reuse data in applications, do modeling, and provide tools for other scientific, industrial, and regulatory needs.

The step 3 of the FAIRification workflow is the most important one, namely definition of a semantic data model for chemical objects representation. In this sense, the efforts for substance data model elucidation are also efforts for FAIR data. The other pillars of a primary importance for the data FAIRness are inclusion of rich metadata (step 6), ontology annotations, and data linking with globally unique identifiers (step 4).

The FAIR principles are combined with so-called CARE (Collective benefit, Power to control, Responsibility, Ethics) principles [22]. CARE principles promote Indigenous Data Management to enhance machine functionality addressing concerns about rights and interests of the Indigenous people in their data throughout the data lifecycle (as a collective to have a say in how their data is actually used), trust, and accountability in the contexts of traditional knowledge and scientific data-oriented toward improving human well-being.

TRUST (Transparency, Responsibility, User focus, Sustainability, and Technology) is yet another set of principles [23] aligned with FAIR. To make data FAIR while

*FAIR Data Model for Chemical Substances: Development Challenges, Management Strategies… DOI: http://dx.doi.org/10.5772/intechopen.110248*

preserving them over time requires trustworthy digital repositories (TDRs) with sustainable governance and organizational frameworks, reliable infrastructure, and comprehensive policies supporting community-agreed practices. TDRs may actively preserve data within dynamics of technology and stakeholder requirements. The TRUST principles facilitate communication with all stakeholders, providing repositories and guidance to good practices.

#### **5. Ambit/eNanoMapper data model**

Safe by Design approaches are encouraged and promoted through regulatory initiatives and numerous scientific projects. Experimental FAIR data are the basis of risk assessment processing workflows. The Ambit/eNanoMapper [24, 25] database is an open-source chemical data management solution that currently holds the largest compilation of searchable nano-EHS (Environment, Health, and Safety) data in Europe from multiple completed and most of the ongoing H2020 Nano-EHS projects. Ambit is an open-source cheminformatics platform with over 30 modules implemented on the top of CDK [26, 27]. It is funded by CEFIC-LRI (http://cefic-lri. org/) for linking Ambit [28] system with the IUCLID substance database to support read across of substance data, category formation, REST APIs, web interface, substance and structure search facilities, toxicity prediction, and QSAR models. The eNanoMapper database is an extension of the Ambit cheminformatics platform.

The implementation of substance support in Ambit was inspired by the four data models discussed in previous section. The data model has been developed, tested, and improved for about 15 years, processing use cases and feedback from multiple users. The Ambit/eNanoMapper data schema is visualized in **Figure 8**. It contains a variety of data components (entities) serving different roles for the representation of items of information about substances and measurements. The data model entities may have different implementations at different stages of the data processing workflow:


The data model is a conceptual representation of chemical substances and can be applied with different technologies, enabling interoperability and data linking, internally and externally via REST APIs.

The substances are characterized by their compositions and are identified by names and IDs. The model supports multiple compositions, with one or more components, each with a role assigned. Also, each component is treated via the standard triad approach. The results from physicochemical and biological measurements are treated as properties of the entire substance and are handled via the protocol applications. Efficient experimental protocol description is crucial for the correct communication of the scientific results and for creation of FAIR data resources. The latter is performed by means of a rich set of metadata parameters with a flexible logical organization (e.g. the full experimental data graph defined in ISA data model). The

#### **Figure 8.** *Schema of the ambit/eNanoMapper data model.*

event of applying a test or experimental protocol to a chemical substance is described by a "protocol application" entity. Each protocol application consists of a set of "measurements" for a defined "endpoint" under given "conditions." A measurement result can be a numeric, a string value, or a link to a raw data file (e.g. an IR spectrum, a microscopy image, HTS data, etc.). Measurement entity is also a dynamic data structure where a single number or an interval with lower and upper values together with specified qualifiers are supported. Ambit/eNanoMapper treats miscellaneous cases for a single datum storage, exemplified for the boiling point (BP) endpoint:

BP = 135°C, BP > 130°C, 120°C < BP ≤ 130°C, BP 3°C, BP = 3 0.5°C.

The measurement errors are represented via a separate qualifier, and different approaches for uncertainty are supported (e.g. SD – standard deviation, MAE – mean average error). The same flexibility is applied for storing metadata parameters. Each measurement is packed with a dynamic list of experimental conditions (or experiment factors such as concentration, time, etc.) which are considered as "lower" level metadata parameters. The "high" level metadata, namely the "protocol application", is described by another dynamic list of parameters, links to Standard Operating Procedures (SOP), guidelines, publications, data quality, etc. The data for a particular substance may contain many "protocol applications."

**Figure 9** illustrates different levels of metadata: protocol parameters (Cell type = A549, Method = COMET, and Technical replicates = 3) and varied experimental conditions (Concentration and Exposure time). The same protocol "COMET" can be applied with different parameters (e.g. different cell line and replicates), and

*FAIR Data Model for Chemical Substances: Development Challenges, Management Strategies… DOI: http://dx.doi.org/10.5772/intechopen.110248*


#### **Figure 9.**

*Protocol application data: COMET protocol with measurements of endpoint NET FPG SITES, applied for substance NM-220 from public database NanoReg2.*

another protocol application will be obtained. The protocol applications that are related to one another are grouped to form an "Investigation" entity. Several different substances that have the same "protocol application" applied can be grouped via the "Assay" entity. The higher level components of the model, such as Substance, Protocol Application, Investigation, and Assay, have automatically generated UUIDs which are used for linking and grouping the measurements.

A transition from the standard triples (S, D, P) to the extended substance data model is challenging for the experts from different domains due to various reasons. Typically the huge volume of metadata compared to the simple experimental data (the ratio of metadata volume to data volume reaches 10:1 or even higher) could be a stumbling block for experimentalists and cheminformatics experts, since a lot of effort is needed for the metadata generation and systematization. However, such a reluctance leads to non-FAIR data and poor findability, accessibility, interoperability, and reusability. Currently, the project funding institutions are challenged in the areas of project results sustainability and reusability as well as the issues of data curation of the results from past research projects and scientific publications.

#### **6. Tools for data FAIRification**

Huge volumes of already generated chemical substance data are non-FAIR. One of the predominant ways researchers store their scientific results is in the form of spreadsheets. We demonstrated that the FAIRification can be achieved through the multi-step FAIRification workflow (see **Figure 7**), using the semantic data model of Ambit/eNanoMapper. The analysis of data and metadata is an iterative process, requiring consultations with domain experts to explain the file content and layout, providing SOPs and correct ontology annotations. Generally, the original raw data needs to be converted to the substance data model. For this purpose, a dedicated

software tool was developed, NMDataParser [29], to map the spreadsheets into the Ambit/eNanoMapper semantic model. The latter tool is essentially enabling the most important stage of the FAIRification process – mapping non-FAIR data (e.g. an Excel file) into an existing semantic model.

#### **Figure 10.**

*Parsing of an excel spreadsheet data for a CFE assay of measurements; part of the JSON configuration for relative addressing of the position of the error values is shown (top right); bottom right visualizes the data mapped within the ambit/eNanoMapper substance data model.*

#### **Figure 11.**

*Web-based template wizard for automatic generation of standardized and harmonized templates with corresponding JSON configuration for NMDataParser.*

*FAIR Data Model for Chemical Substances: Development Challenges, Management Strategies… DOI: http://dx.doi.org/10.5772/intechopen.110248*

The NMDataParser is a configurable Excel file parser, developed in Java on top of the Ambit data model and with extensive use of the Apache POI library [30]. It was designed and developed as an open-source library to enable the import of substance data from the Excel spreadsheets with potentially unlimited layout permutations. Different row-based, column-based, or block-based spreadsheet data organizations are supported. The parser is configured via a separate JSON file with its own syntax for mapping the custom spreadsheet structure into the data model components (see **Figure 10**). The parser code, the JSON configuration syntax, documentation, and example files are available at https://github.com/enanomapper/nmdataparser/.

While one JSON configuration file can be applied to multiple Excel files with a similar layout, some complex spreadsheets (e.g. HTS) may require multiple JSON configurations for a single Excel file. The expertise, gained from many years of manual and exhausting configurations of excel file parsing, helped for developing a harmonized and continuously growing set of standard templates which are available via a web interface (see **Figure 11**) with an automatic template generation and corresponding JSON configuration attached.

#### **7. Ambit/eNanoMapper applications, APIs, and services**

Once the data is imported into an Ambit/eNanoMapper database instance, it is immediately available (publicly or with a restricted access) via the web user interface and machine readable via an API supporting multiple serialization formats. Nonpublic datasets are handled by an authentication and authorization system (API keys and OAuth2 plans for direct or delegated access grants are supported). Content from a

**Figure 12.** *Data input and output to ambit/eNanoMapper database.*

variety of sources such as OECD HTs (IUCLID6 files or direct retrieval from IUCLID servers), custom spreadsheet templates, SQL dumps from other databases, and custom formats, provided by partners (e.g. the NanoWiki RDF dump [31]) is aggregated using the common semantic data model. A variety of options for export, data conversion, data retrieval, and data analysis are available (see **Figure 12**). Different views of substance data are implemented via a Web GUI based on the jToxKit [32] JavaScript library, as well as many customized methods for accessing the data through a REST API via external tools like Jupyter notebooks and the KNIME analytics platform.

Multiple data export formats are supported by the Ambit/eNanoMapper web interface and the API, including semantic formats (RDF and JSON-LD), Excel file formats [28], the native JSON serialization and HDF5 standard.

To facilitate data gap analysis and grouping, the aggregated search interface (see **Figure 13**) includes a number of options, allowing exporting all of the search hits into JSON or spreadsheet formats as well as flexible summary reports in Excel format.

All components of the Ambit/eNanoMapper data model (see **Figure 8**) are searchable. The model schema does not dictate a central entity unlike the standard triad, focused on the chemical structure. This way, more options for searching, storing, and viewing are available. **Figure 13** illustrates a faceted substance search. As it was pointed out, structure searching and substance searching are completely different features of the system. **Figure 14** shows an exact (identity) structure search for the benzene molecule and corresponding logical linking between the molecule of benzene and different chemical substances that contain it as a component.

A number of open-source libraries for accessing the eNanoMapper API are available: https://github.com/enanomapper/renm developed in R; https://github.com/

#### **Figure 13.**

*Faceted search for substances within NanoReg2 public database: Searching for NMs that have experiments with A549 cells and phys-chem characterization with zeta potential.*

*FAIR Data Model for Chemical Substances: Development Challenges, Management Strategies… DOI: http://dx.doi.org/10.5772/intechopen.110248*


#### **Figure 14.**

*Exact (identity) structure search in ambit/eNanoMapper database for the benzene structure; result structure is linked to a set of substances having the benzene mole as a component with different roles (rightmost column).*

enanomapper/ambit.js and https://github.com/ideaconsult/jToxKit developed in JavaScript; https://github.com/ideaconsult/pynanomapper developed in Python. Python library, in particular, is used for the set of open-source Jupyter notebooks that demonstrate the eNanoMapper API (https://github.com/ideaconsult/notebooksambit/tree/master/enanomapper).

#### **8. Linear notations and identifiers for chemical substances**

Linear notations are representing chemical structure connectivity and other molecule features as a character string. Linear notations proved to be popular and efficient tools in the field of cheminformatics. The present-day mainstream notations, SMILES [33, 34] and InChI [35], are de facto standards and used in the majority of cheminformatics tools and structural databases. Naturally, linear notations played a significant role for establishing the classical triad model (S, D, P). Linear notation, InChI (International Chemical Identifier), as its name points it out, is originally designed to be a unique structure identifier. The methods for canonical atom numbering and canonical structural presentations are well known (e.g. canonical CTs and canonical SMILES) and together with hashing approaches (e.g. InChI-Key) are widely utilized for structure identification. Also, database and registry molecular numbers are another efficient means for molecule identification. The identification of the chemical substances is a huge challenge, especially in the field of nanoinformatics. Regulatory frameworks experience a lack of unique identifiers, since the traditional identifiers and the most popular linear notations are inadequate. One of the pillars for establishing the FAIR principles is utilization of globally unique and persistent identifiers (see points F1 and F3 from the FAIR principles [3]).

The chemical substance paradigm has been gradually adopted within the cheminformatics and nanoinformatics domains. The substances are serialized via data models with hierarchical organization (e.g. Ambit JSON or ISA model). With a proper canonicalization method, such data serialization (or parts of the data) could be hashed and used as a locally defined identifier, as it is the case with Ambit/eNanoMapper UUIDs (see **Figures 4** and **5**). The complexity of the substance data model justifies the utilization of nonlinear techniques for serialization. However, lately, great effort has been put for developing a linear notation and universal identifiers for chemical substances and NMs.

The InChI Trust (https://www.inchi-trust.org/) works on developing and promoting the use of the IUPAC InChI [35] open-source chemical structure representation algorithm. InChI Trust projects cover versatile types of chemical objects and perform a pioneering work for developing lineation notations for mixtures (MInChI project), nanomaterials (NInChI – project), and Polymer InChI (PInChI – project) – to name a few of the most relevant projects to the chemical substances. Nano-InChI (NInChI) [36] project is a promising effort to integrate concepts of NMs intrinsic and extrinsic properties and to support a domain-specific language for nanoinformatics. NInChI is not intended to replace the chemical substance model but proposes to encode information (composition, size, shape, surface chemistry, etc.) required to unambiguously identify a specific NM as an extension of the IUPAC International Chemical Identifier, termed NInChI. NMs are particulates with specific relationships between the core and surface components that challenges traditional material naming and scientific data communication between researchers, modelers, industry, and regulators. Leveraging best practices with other InChI working groups, e.g. MInChI, Reaction InChI, and PInChI, is planned. NInChI development is a collaborative effort of domain experts from different fields. Currently, the NInChI is under active development, and there are some preliminary NInChI prototypes. For example: Fe3O4 core magnetic nanomaterial with diameter = 38 nm, coated with Glycine and shell thickness of 2 nm can be encodes as:

NInChI = 0.00.1A/C2H5NH2/C3–3-2(4)5/h1,3H2,(H,4,5)/msh/s2t-9!/3Fe.4O/ msp/s38d-9/y2&1.

Another possible approach is utilization of SYBYL Line Notation (SLN) [37, 38]. SLN is unambiguous, nonunique linear notation developed by TRIPOS Inc. SLN supports syntax for specification of molecules, substructure queries, and reactions which cover the capabilities of SMILES [33], SMARTS [39], and SMIRKS [40] taken together. On top of the basic syntax, SLN includes other powerful features for the specification of user-defined attributes, macro and Markush [41] atoms for flexible definition of molecular fragments, search queries and structural libraries, as well as 2D and 3D coordinates. All that is accomplished through a unified syntax within a single notation. These features make SLN suitable for data storage and exchange. To our knowledge, SLN is the most comprehensive and rich linear notation for the representation of chemical objects of various kinds facilitating a wide range of cheminformatics algorithms. Though it is not the most popular linear notation nowadays, SLN has excellent capabilities for supporting the challenging tasks of present-day cheminformatics. SLN's rich syntax allows encoding of a comprehensive and versatile chemical information within the boundaries of a linear string representation otherwise manageable with complex data structures such as JSON [21] or XML [42] schemas.

Particularly, SLN is suitable for treating chemical objects with rich metadata (e.g. chemical substances). The SLN string defines one or more fully connected CTs plus a section with molecule attributes for each CT. One of the SLN advantages is the syntax extension, including comparison operations such as <, <=, >, and > =, while the

*FAIR Data Model for Chemical Substances: Development Challenges, Management Strategies… DOI: http://dx.doi.org/10.5772/intechopen.110248*

SMILES/SMARTS standards support only attribute equality. The latter is in line with the substance model flexibility for storing experimental values. Within the existing notations from the past, SLN seems to have the most wide and flexible syntax features to support the chemical substance paradigm. A SLN example for the mentioned above Fe3O4 core magnetic NM, coated with Glycine:

O[1]Fe[2]OFeOFe@1O@2 < role = core;size = 38 nm > CH2(C(=O)OH)NH2 < role = coating;size = 2 nm > .

#### **9. Conclusions**

The FAIR principles align with the global shift to open data by promoting governance criteria for increased data sharing. Cheminformatics, nanoinformatics, and bioinformatics methods are providing data-driven solutions in the field of chemical substance safety. The FAIR compliance calls for extension of the structure-centered data models to meet the challenges of chemical substance and materials data management. The substance must include not just a single structure, but a composition of many components with definite roles, corresponding interconnections, rich metadata, and ontology annotations. The variety of data sources, formats, and logical organizations challenges the aggregation of data from multiple projects into a common information system. Ambit/eNanoMapper data model has a well-defined semantics and full adoption of the FAIR principles in order to boost successful strategies for reusable and sustainable research results with efficient interconnections and collaboration between academia, industry, and regulators.

#### **Acknowledgements**

The work leading to this chapter has received funding from the European Union's Horizon 2020 Research and Innovation program, Grant Agreements no. 814426 NanoinformaTIX and LRI-EEM9.5 – IC AMBIT.

#### **Author details**

Nina Jeliazkova<sup>1</sup> \*, Nikolay Kochev1,2 and Gergana Tancheva<sup>2</sup>

1 Ideaconsult Ltd., Sofia, Bulgaria

2 Faculty of Chemistry, University of Plovdiv, Department of Analytical Chemistry and Computer Chemistry, Plovdiv, Bulgaria

\*Address all correspondence to: jeliazkova.nina@gmail.com

© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] Gasteger J, Engel T, editors. Chemoinformatics Basic Concepts and Methods. Weinheim: WILEY-VCH Verlag GmbH & Co. KGaA; 2018. p. 575

[2] Massart D, Vandeginste BG, Kaufman L, Demin S, Michotte Y. Chemometrics: A Textbook. Elsevier Science (Verlag); 1988. p. 464. ISBN: 9780080868295

[3] Wilkinson MD, Dumontier M, IjJ A, Appleton G, Axton M, Baak A, et al. The FAIR guiding principles for scientific data management and stewardship. Scientific Data. 2016;**3**:1-9. DOI: 10.1038/sdata.2016.18

[4] McNaught AD, Blackwell AW. IUPAC. In: Compendium of Chemical Terminology Chemical Substance. 2014. 2nd ed. Available from: https://goldbook. iupac.org/terms/view/C01039 . p. 2014. DOI: 10.1351/goldbook.C01039

[5] ECHA (REACH). ECHA What is a substance? [Internet]. Available from: https://echa.europa.eu/support/substa nce-identification/what-is-a-substance. [Accessed: June 12, 2022]

[6] Government of Canada, CEPA. Chemical Substances Glossary [Internet]. 1999. Available from: https:// www.canada.ca/en/health-canada/ services/chemical-substances/chemicalsubstances-glossary.html. [Accessed: June 12, 2022]

[7] Epa A. TSCA Chemical Substance Inventory [Internet]. Available from: https://www.epa.gov/tsca-inventory [Accessed: June 12, 2022]

[8] Japan CSCL. Japan CSCL – Chemical Substance Control Law [Internet].

Available from: https://chemical. chemlinked.com/chempedia/japan-csclchemical-substance-control-law [Accessed: June 12, 2022]

[9] International Organization for Standardization. ISO/TS 80004-1:2015 - Nanotechnologies – Vocabulary – Part 1: Core-terms. ISO; 2015

[10] The European Commission's Science and Knowledge Service [Internet]. Available from: https://joint-research-ce ntre.ec.europa.eu/index\_en [Accessed: June 12, 2022]

[11] Chemicals European Agency in Association with the OECD. IUCLID 6 [Internet]. Available from: https://iuclid6.echa.europa.eu/bg/ project-iuclid-6

[12] OECD HT [Internet]. Available from: https://www.oecd.org/ehs/ templates/ [Accessed: June 12, 2022]

[13] Abeyruwan S, Vempati UD, Küçük-McGinty H, Visser U, Koleti A, Mir A, et al. Evolving BioAssay ontology (BAO): Modularization, integration and applications. Journal of Biomedical Semantics. 2014; **5**(Suppl. 1):1-22. DOI: 10.1186/ 2041-1480-5-S1-S5

[14] Visser U, Abeyruwan S, Vempati U, Smith RP, Lemmon V, Schürer SC. BioAssay ontology (BAO): A semantic description of bioassays and highthroughput screening results. BMC Bioinformatics. 2011;**12**:257-273. DOI: 10.1186/1471-2105-12-257

[15] Rumble J, Freiman S, Teague C. Towards a uniform description system for materials on the nanoscale. Chemistry International [Internet].

*FAIR Data Model for Chemical Substances: Development Challenges, Management Strategies… DOI: http://dx.doi.org/10.5772/intechopen.110248*

Available from: https://www.degruyter. com/document/doi/10.1515/ci-2015-0402/html. 2015;**37**(4):3-7. DOI: 10.1515/ci-2015-0402

[16] Rumble J, Freiman S, Teague C. Uniform Description System for Materials on the Nanoscale Prepared by the CODATA-VAMAS Working Group On the Description of Nanomaterials. 2016. Available from: https://zenodo. org/record/56720#.Y48ltMtBxD8

[17] Assunta SS, Rocca-serra P, Field D, Maguire E, Taylor C, Hofmann O, et al. Toward interoperable bioscience data. National Public Grade. 2012;**44**(2): 121-126. DOI: 10.1038/ng.1054

[18] Robinson R, Cronin M, Richarz A, Rallo R. An ISA-TAB-Nano based data collection framework to support datadriven modelling of nanotoxicology. Beilstein Journal of Nanotechnology. 2015;**6**:1978-1999. DOI: 10.3762/ bjnano.6.202

[19] Thomas DG, Gaheen S, Harper SL, Fritts M, Klaessig F, Hahn-dantona E, et al. ISA-TAB-Nano: A specification for sharing nanomaterial research data in spreadsheet-based format. BMC Biotechnology. 2013;**13**:2-17. DOI: 10.1186/1472-6750-13-2

[20] ISA-JSON format [Internet]. Available from: https://isa-tools.org/f ormat/specification.html [Accessed: June 12, 2022]

[21] ECMA. JSON (ECMA-404 The JSON Data Interchange Syntax). [Internet]. Geneva, Switzerland: ECMA International. Available from: https:// www.ecma-international.org/publica tions-and-standards/standards/ecma-404/ 2017 [Accessed: June 12, 2022]

[22] Carroll SR, Herczog E, Hudson M, Russell K, Stall S. Operationalizing the CARE and FAIR principles for

indigenous data futures. Scientific Data [Internet]. 2021;**8**(1):8-13. DOI: 10.1038/ s41597-021-00892-0

[23] Lin D, Crabtree J, Dillo I, Downs RR, Edmunds R, Giaretta D, et al. The TRUST principles for digital repositories. Scientific Data. 2020;**7**(1):1-5. DOI: 10.1038/s41597-020-0486-7

[24] Jeliazkova N, Apostolova MD, Andreoli C, Barone F, Barrick A, Battistelli C, et al. Towards FAIR nanosafety data. Nature Nanotechnology. 2021;**16**(6):644-654. DOI: 10.1038/s41565-021-00911-6

[25] Jeliazkova N, Chomenidis C, Doganis P, Fadeel B, Grafström R, Hardy B, et al. The eNanoMapper database for nanomaterial safety information. Beilstein Journal of Nanotechnology. 2015;**6**:1609-1634. DOI: 10.3762/bjnano.6.165

[26] Willighagen EL, Mayfield JW, Alvarsson J, Berg A, Carlsson L, Jeliazkova N, et al. The chemistry development kit (CDK) v2.0: Atom typing, depiction, molecular formulas, and substructure searching. Journal of Cheminformatics. 2017;**9**(1):1-19. DOI: 10.1186/s13321-017-0220-4

[27] Chemistry Development Kit [Internet]. Available from: https://cdk. github.io/ [Accessed: June 12, 2022]

[28] Jeliazkova N, Koch V, Li Q, Jensch U, Reigl JS, Kreiling R, et al. Linking LRI AMBIT chemoinformatic system with the IUCLID substance database to support read-across of substance endpoint data and category formation. Toxicology Letters. 2016;**258**: S114-S115. DOI: 10.1016/j.toxlet.2016. 06.1469

[29] Kochev N, Jeliazkova N, Paskaleva V, Tancheva G, Iliev L, Ritchie P, et al. Your spreadsheets can be fair: A tool and fairification workflow for the enanomapper database. Nanomaterials. 2020;**10**(10):1-23. DOI: 10.3390/nano10101908

[30] Apache POI [Internet]. Available from: https://poi.apache.org/ [Accessed: June 12, 2022]

[31] NanoWiki RDF [Internet]. Available from: https://figshare.com/articles/ NanoWiki\_4/4141593 2016 [Accessed: June 12, 2022]

[32] JToxKit [Internet]. Available from: https://github.com/ideaconsult/jToxKit [Accessed: June 12, 2022]

[33] SMILES - A Simplified Chemical Language [internet]. Daylight Theory. Available from: https://www.daylight. com/dayhtml/doc/theory/theory.smiles. html [Accessed: June 12, 2022]

[34] Weininger D, Weininger A, Weininger J. SMILES . 2 . Algorithm for generation of unique SMILES notation. Chemical Information and Computer Science. 1989;**29**(19):97-101. DOI: 10.1021/ci00062a008

[35] Heller SR, McNaught A, Pletnev I, Stein S, Tchekhovskoi D. InChI, the IUPAC international chemical identifier [Internet]. Journal of Cheminformatics. 2015;**7**:1-34. DOI: 10.1186/s13321-015- 0068-4

[36] Lynch I, Afantitis A, Exner T, Himly M, Lobaskin V, Doganis P, et al. Can an inchi for nano address the need for a simplified representation of complex nanomaterials across experimental and nanoinformatics studies? Nanomaterials. 2020;**10**(12): 1-44. DOI: 10.3390/nano10122493

[37] Ash S, Cline MA, Homer RW, Hurst T, Smith GB. SYBYL line notation (SLN): A versatile language for chemical structure representation. Journal of Chemical Information and Computer Sciences. 1997;**37**(1):71-79. DOI: 10.1021/ ci960109j

[38] Homer RW, Swanson J, Jilek RJ, Hurst T, Clark RD. SYBYL line notation (SLN): A single notation to represent chemical structures, queries, reactions, and virtual libraries. Journal of Chemical Information and Modeling. 2008;**48**(12): 2294-2307. DOI: 10.1021/ci7004687

[39] SMARTS - A Language for Describing Molecular Patterns [Internet]. Daylight Theory. Available from: https://www.daylight.com/day html/doc/theory/theory.smarts.html [Accessed: June 12, 2022]

[40] SMIRKS - A Reaction Transform Language [Internet]. Daylight Theory. Available from: https://www.daylight. com/dayhtml/doc/theory/theory.smirks. html [Accessed: June 12, 2022]

[41] Barnard J, Wright PM. Towards inhouse searching of Markush structures from patents. World Patent Information. 2009;**31**(2):97-103. DOI: 10.1016/ j.wpi.2008.09.012

[42] Extensible Markup Language (XML) 1.0 (Fifth Edition) [Internet]. 2008. Available from: https://www.w3.org/ TR/REC-xml/ [Accessed: June 12, 2022]

Section 3
