Climate System Ontology: A Formal Specification of the Complex Climate System

*Armita Davarpanah, Hassan A. Babaie and Guanyu Huang*

## **Abstract**

Modeling the climate system requires a formal representation of the characteristics of the system elements and the processes that change them. The Climate System Ontology (CSO) represents the semantics of the processes that continuously cause change at component and system levels. The CSO domain ontology logically represents various links that relate the nodes in this complex network. It models changes in the radiative balance caused by human activities and other forcings as solar energy flows through the system. CSO formally expresses various processes, including non-linear feedbacks and cycles, that change the compositional, structural, and behavioral characteristics of system components. By reusing the foundational logic of a set of top- and mid-level ontologies, we have modeled complex concepts such as hydrological cycle, forcing, greenhouse effect, feedback, and climate change in the ontology. This coherent, publicly available ontology can be queried to reveal the input and output of processes that directly impact the system elements and causal chains that bring change to the whole system. Our description of best practices in ontology development and explanation of the logics that underlie the extended upper-level ontologies help climate scientists to design interoperable domain and application ontologies, and share and reuse semantically rich climate data.

**Keywords:** climate system, ontology, complex system, climate change, climate data

## **1. Introduction**

The solar powered and highly complex climate system consists of five interacting subsystems of atmosphere, hydrosphere, lithosphere, cryosphere, and biosphere [1]. The system has evolved to maintain its radiative balance through self-organization [2, 3] and adaptation [4] over time. Radiative perturbations caused by natural external forcings such as changes in the solar cycles and volcanic eruptions have periodically changed the climate system over long temporal and spatial scales [5]. The system has adapted to deal with the continuous input of naturally produced carbon dioxide and volcanic aerosol to its atmosphere and oceans, by storing carbon in plants and carbonate sedimentary rocks and coral reefs [6].

More recently, human activities such as burning of fossil fuel and deforestation are increasing the levels of carbon dioxide in the atmosphere and oceans [7, 8]. Increased

concentrations of CO2 and other greenhouse gases are disturbing system's radiative balance through enhanced greenhouse effect which leads to warming of the atmosphere beyond natural balanced levels [9]. The complex climate system is responding to these changes by nonlinear feedback processes and modifying the interactions among its internal components [10]. Through self-organization, the system brings change in its established network structure and climates. Climate change, for instance, brings change to the spatial and temporal pattern, frequency, and intensity of extreme events in system components, affecting all kinds of ecosystems, including social system [11].

The knowledge about the complex interactions in the climate system is well known in the climatology community (e.g., [12–14]). Recently, the Intergovernmental Panel on Climate Change (IPCC) has advanced the knowledge by emphasizing the effects of anthropogenic activities in the climate system [5, 15]. The IPCC reports, however, are written in a natural language. As such, the knowledge cannot be understood by software without natural language processing (NLP). The vast knowledge described in these and other reports about the interactions in the climate system is not easily accessible or understandable by decision and policymakers who are expected to prescribe plans to mitigate climate change impacts. One way to make the knowledge more accessible to software is to develop ontologies that translate the known facts, expressed in natural language, into the logic-based, machine-processible Web Ontology Language (OWL) [16]. Given the complexity of the climate system in which all sorts of nonlinear physical, chemical, and biological processes occur, only ontologies that are based on robust description logic [17] of well-established upper ontologies can reliably model these complex relationships. To this end, we have developed the Climate System Ontology (CSO) based on the widely used, top-level Basic Formal Ontology (BFO) [18, 19] and mid-level Common Core Ontologies (CCO) [20–22], Relation Ontology [23], and Phenotype And Trait Ontology (PATO) [24]. Our Climate System Ontology semantically models the system variables (e.g., concentration of CO2 in the atmosphere or ocean) and core interactive processes among the climate system's components (e.g., change in oceanic circulation, retreat of glaciers, and atmosphere-ocean heat exchange). It explicitly formalizes the impact of anthropogenic activities (e.g., fossil fuel emission, emission of greenhouse gases, and land use) on climate variation. The semantic model represents the natural greenhouse and other effects by representing the influences of trace gases (e.g., greenhouse gases), aerosols, clouds, and other entities on system's energy balance. The CSO ontology also models major feedback and energy transfer processes and negative and positive variations in the radiative forcing.

In this chapter, we provide a detailed description of the best practices that we have applied to develop the Climate System Ontology, CSO, by extending upper ontologies. We also explain how the description logic, inherited from the reused upper-level ontologies, enriched the CSO knowledge model and enabled it to model complex concepts in the climate system. By being publicly available in the cloud-based GitHub software repository (GitHub - adavarpa/Climate-System-Ontology-CSO: An ontology for Climate system), our Climate System Ontology allows climate scientists to build their own domain and application ontologies that model, for instance, ice core, ocean water composition, permafrost melting, or extreme events. The CSO ontology enables integration and federation of heterogeneous data and facilitates search, retrieval, and analysis of climate data.

This chapter is structured as follows: Section 1 introduces the problem and the objectives of making the Climate System Ontology (CSO). Section 2 presents a

*Climate System Ontology: A Formal Specification of the Complex Climate System DOI: http://dx.doi.org/10.5772/intechopen.110809*

concise description of the climate system from a complex system perspective. Section 3 introduces the methodology and the foundational logics and structures of the upper ontologies which were used to construct CSO. Section 4 presents the results of our modeling of parts of the climate system. Section 5 discusses the advantages of developing the CSO domain ontology by extending the upper ontologies and presents models of the most complex processes in the climate system. It also explains how the ontology, currently available on Github (see above), can be applied by domain scientists. This is followed by Section 6 which summarizes our work.

## **2. Climate system**

#### **2.1 Flow of energy in the climate system**

Logical modeling of the network structure and dynamics of the climate system in an ontology requires a formal, explicit specification of our conceptualization of the interactions among the components of this system [25]. In this section, we provide the background knowledge of the climate system based on the descriptions presented by the IPCC reports [5, 15, 26, 27], and the reports by the Royal Society and the U.S. National Academy of Sciences [11, 28]. We then model this knowledge in the next sections.

The open atmospheric, hydrospheric, cryospheric, lithospheric (land surface), and biospheric components of the climate system continuously interact through many physical, chemical, and biological processes over a wide range of spatial and temporal scales [26]. The climate system is driven by solar radiation, mostly in the visible short-wave and near-infrared, and, to some extent, in the ultraviolet part of the electromagnetic spectrum [29, 30]. About a third of this incoming solar radiation is reflected back into space by clouds, the atmosphere, and land surface [31]. The rest is either absorbed by the atmosphere or used to heat the land and oceans. After absorbing part of incoming radiation, the warmed land surface returns the energy as heat (infrared radiation) and water vapor to the atmosphere. Heat, carried through atmospheric circulations by water vapor, is released to the atmosphere through condensation [26]. The infrared radiation emitted from Earth's surface is partly absorbed by greenhouse gases and clouds in the atmosphere [32]. The greenhouse gases and clouds re-emit the absorbed heat in all directions, leading to its entrapment, and warming of the lower atmosphere and Earth's surface. This is the natural greenhouse effect which is a part of Earth's energy balance. Reflection of the incoming radiation by clouds more than compensates for their warming effect, leading to a net cooling of the system [33].

The average net radiation at the tropopause, which for the sake of energy calculations is assigned as the top of the atmosphere, is zero [34]. The net radiation can change due to a change in the incoming solar radiation or the emitted infrared radiation [35]. The imbalance caused by these variations is called radiative forcing [27]. External natural changes such as solar cycles and explosive volcanic eruptions that eject aerosols into the atmosphere bring variation in the radiative forcing [36]. Positive or negative radiative forcings lead to an average increase or decrease of surface temperatures, respectively [33]. In both of these cases, the system needs to restore the radiative balance. Internal processes and feedbacks [37] in the climate system also cause radiative imbalance by affecting the reflected solar radiation and emitted infrared radiation [26].

The positive forcings are induced by changes in the composition of system components, such as increased greenhouse gas and aerosol concentrations in the atmosphere and oceans, for example, due to combustion of fossil fuels through human activities (e.g., [38]). These changes lead to higher surface-tropospheric and sea water temperatures, along with increased acidification of sea water that affects the carbon sink in corals [39, 40]. These, and other processes (e.g., biomass burning, emission of chlorofluorocarbons that destroy stratospheric ozone layer), cause anthropogenic perturbation of the radiative balance in the system that impact climate as the system tries to restore the balance [41]. The increased concentration of anthropogenic greenhouse gases in the atmosphere leads to a decrease in the amount of heat that is lost to space at higher altitudes, causing a positive radiative forcing [26]. This is called enhanced greenhouse effect [42, 43]. The climate system responds to the radiative imbalance through its various complex internal processes and feedbacks that lead to climate change [15].

Components in the climate system respond to the internal variability and forcings through nonlinear feedback cycles that are essential features of all complex systems [44] (see next section). Due to their different physical properties (e.g., heat capacity and thermal conductivity), components of the climate system have different response times to variations brought by external forcings [45]. This nonlinear feedback mechanisms and other internal interactions among system components keep the climate system in a constantly varying state characterized with largescale climate variability such as the El Niño-Southern Oscillation, North Atlantic Oscillation, North-South dipole structure, and Antarctic Oscillation which occur due to ocean–atmosphere interaction (e.g., Rodríguez-Fonseca; [46]). Climate variability can result internally due to the natural interactions among system components or externally by radiative forcings (e.g., [47]). The occurrence and intensity of certain low probability, extreme weather events such as heat wave, drought, and flooding correlate with climate variability [48]. Human activities such as agricultural practices, forestry, and land use lead to anthropogenic forcings that cause variations in the climate (e.g., [38]). For example, emission of methane and nitrous oxide gases through agricultural and industrial practices increases the tropospheric ozone (a greenhouse gas) (e.g., [49]).

#### **2.2 Complex system perspective of the climate system**

The climate system is an adaptive, dissipative system of a network of numerous independent, nonlinearly interacting nodes (components and elements). The system, defined by the diversity, adaptability, connectedness, and mutual dependency of its heterogeneous components, continuously interacts with its environment (space) through the transfer of solar energy. The interactions among the system components change the state of the whole system as it adapts to internal changes and external perturbations [50, 51] such as the positive radiative forcings caused by solar cycles and human activities.

As a complex system mostly stays in the subcritical, far-from-equilibrium state of the 'edge of chaos' between a stable (low complexity) state and unstable state of chaos [52]. When the system is driven far from equilibrium, for example, through positive radiative forcings caused by human activities, it reaches a threshold of instability (critical level). The transition between the states of the 'edge of chaos' and chaos leads to repeated phase changes (e.g., evaporation and condensation) and cascades (e.g., occurrence of extreme climate events) [53].

#### *Climate System Ontology: A Formal Specification of the Complex Climate System DOI: http://dx.doi.org/10.5772/intechopen.110809*

The cascades are followed by a return to the slow relaxation stage (subcritical state) to repeat the process (e.g., [54]). The subcritical state [55] is characterized by continuous natural processes (e.g., incoming solar radiation, reflection of incoming radiation, and emission of land surface infrared radiation to atmosphere) and coevolution [56, 57]. Changes brought by internal and external forcings in one component (e.g., atmosphere and land surface) continuously transfer to other components and reconfigure the network of links among the components [51]. This may lead to co-evolution which occurs when system components simultaneously change due to their interdependency and mutual adaptation [4]. An example of co-evolution in the climate system is when the atmosphere and oceans both become warmer due to the atmosphere–ocean exchange of heat as a result of increased concentration of greenhouse gases in the atmosphere. Another example is an increase of CO2 in the atmosphere and simultaneous decrease of pH in the ocean water as more CO2 is absorbed by the oceans [58, 59].

At the critical points, internal and external forcings at the constituent (micro) level lead to spontaneous emergence of order at the whole system (macro) level by the appearance of new properties, random and unpredictable behavior, structure (network of links between nodes), and pattern. The order appears at the macro-level as a result of nonlinear interactions at the micro-level (components) through self-organization [2, 3]. The emerged properties at the macro-level affect those at the micro-level [60, 61]. Since the system is decentralized, a failure at the micro-level does not bring a failure at the whole system level [61].

The self-organization at the critical points leads to the autonomous formation of a preferred configuration (attractor) through nonlinear feedbacks that better conforms (adapts) to the changing environment. The new pattern (e.g., a climate pattern) brings more effective coordination and cooperation among the system elements [61] by reordering the composition and relationships (links) among the system components (nodes) and even creating new ones (e.g., new circulations in ocean; more frequent adverse climate events) [62]. The new self-organized structure, maintained through continuous flux of energy (e.g., through increased anthropogenic radiative forcing), promotes specific behavior, such as climate change, in the system [3]. Self-organization is a dynamical and adaptive feature of the climate system that allows it to acquire and maintain spatial, temporal, or functional structure that leads to increased order [60]. The structure brought by self-organization is maintained through a constant source of energy (e.g., solar radiation) that allows the system to adapt to dynamic changes (e.g., warming of the atmosphere) through a variety of behaviors (e.g., negative feedbacks) allowing the behaviors to restrict to a small part of its state space (i.e., around the attractor) [56, 63]. The emerged self-organization and nonlinear processes that occur during the unstable chaos state are scale-invariant, governed by power laws [50, 64], and produce emergent variations in the system over a wide range of scales. The critical points themselves evolve over time as driving forcings change.

The order produced by the formation of new patterns and structures, through selforganization, is the product of non-equilibrium in the far-from-equilibrium climate system [65, 66]. These patterns (e.g., of climate) are the result of the interaction of the system with its environment (space outside of atmosphere) [62] through input and output of energy. By continuously getting input, such a dissipative and adaptive system can achieve dynamic equilibrium while still doing work, recycling mass (gas, aerosol), and transforming different forms of energy (radiation).

A change such as an increased water vapor content in the atmosphere does not remain proportional to its causal process (e.g., evaporation of ocean water) for long because of the feedback loops that redirect the output of the process (water vapor) back to the original process as input through intermediary processes (e.g., amplification of temperature due to higher water vapor content in the atmosphere). This feedback cycle strengthens or weakens the output of the original change process (evaporation), causing a larger change (positive feedback) or a reduced or eliminated change that brings the system back to equilibrium (negative feedback) [67]. In other words, the outcome of a process is necessary for the process to proceed in a positive feedback (a form of self-cause) [68]. Negative feedbacks dampen the original process and bring stability to the system. Positive and negative feedbacks maintain the emerged self-organized forms [69] within and across system levels [70]. Thus, feedbacks, as nonlinear, recursive processes, probabilistically lead to adaptive or chaotic outcomes or an equilibrium state [69] and result in emergent properties that are absent in the system or its components [71].

## **3. Methodology**

In this section, we describe the structure of the imported upper ontologies and their resources for modeling the climate system. We chose the Common Core Ontologies (CCO) [20–22] to develop our Climate System Ontology (CSO) because these mid-level ontologies extend the logical foundation of the Basic Formal Ontology [18, 19], which is a simple, standard top-level ontology with an extensive scientific user base [72, 73]. The CCO set adds many useful classes to the BFO class hierarchy and introduces several new object properties (through the Extended Relation Ontology) in addition to the ones defined by the Relation Ontology [23] that are available in BFO.

To make it easy for readers to distinguish the CSO class names from names that are defined in the imported upper-level ontologies, we adhere to the following naming scheme throughout this chapter. Imported BFO, CCO, and PATO class names are preceded with their namespace prefix (e.g., bfo: process, cco: Change, pato: quality). Moreover, the BFO and PATO class names begin with a lower-case letter (bfo: material entity, pato: variability) compared to the CCO classes that begin with a capital letter (e.g., cco: Statis, cco: Temperature). Throughout this chapter, we format the CSO class names with Small caps font and capitalize the first letter of each word (e.g., Ocean, Positive Feedback, Snow-covered Surface, and Global Mean Sea level Change).

The CCO set was downloaded as the 'CommonCoreOntologies-master' zip folder from the GitHub [74] cloud-based repository, which was then uncompressed and saved in a working directory. PATO was also downloaded and placed in the CommonCoreOntologies-master uncompressed folder as pato.owl. To access and reuse classes in the upper-level ontologies, our CSO ontology (cso.owl) directly imported PATO and the 'MergedAllCoreOntology-v1.3.ttl' file from the 'cco-merged' folder in the CommonCoreOntologies-master directory. However, since the extracted CCO-master package already included the bfo.owl and ro.owl files in its 'imports' folder, there was no need to directly import these ontologies in the cso.owl file. The Climate System Ontology (cso.owl) was then built using the Protégé editor [75, 76] and placed in the same CommonCoreOntologies-master folder that contained the individual CCO turtle (.ttl) files and pato.owl.

As a best practice [77], we adhered to the 'single inheritance rule' and designed each class in the Climate System Ontology as a subclass of only one imported BFO,

#### *Climate System Ontology: A Formal Specification of the Complex Climate System DOI: http://dx.doi.org/10.5772/intechopen.110809*

CCO, or PATO class. We also reused the object properties that were available in CCO and RO to relate class instances in the CSO. The CSO classes were later expanded to include their necessary and/or sufficient characteristics by constructing compound logical statements (axioms) (see below). To better understand the design of the CSO domain ontology, we evaluate the logics that underlie the class hierarchy of the imported upper ontologies in the remaining part of this section. All examples throughout this chapter are from the Climate System Ontology (CSO).

The Basic Formal Ontology (BFO) [18] classifies all entities (e.g., things that exist or operate in the climate system) as either bfo: continuant or bfo: occurrent. Continuants persist as whole entities in time. These entities have material and immaterial parts but do not have any temporal part. The bfo: continuant class includes the bfo: generically dependent continuant, bfo: independent continuant, and bfo: specifically dependent continuant subclasses. CCO adds the cco: Information Content Entity class under the bfo: generically dependent continuant to represent class of entities whose instances are information content for an Information Bearing Entity. For example, a plot of the global mean surface temperatures vs. time, a table that lists these data, and a report that describes the data, are instances of the cco: Information Bearing Entity that 'carry' the same cco: Information Content Entity (i.e., global mean surface temperature) in different ways. At any time, t, a bfo: generically dependent continuant 'generically depends' on another entity (i.e., Information Bearing Entity). For example, the 'global mean surface temperature' information content 'generically depends' on its carriers (the plot, table, or report).

CCO defines three subclasses of the 'Information Content Entity' class: 'Descriptive Information Content Entity', Designative Information Content Entity', and 'Directive Information Content Entity' that are used for data and information modeling [21, 22]. The 'Descriptive Information Content Entity' consists of propositions that 'describe' some entity and includes the 'Measurement Information Content Entity' (describes extent, dimensions, quantity, or quality of an entity), 'Measurement Unit' (describes a magnitude of a physical quantity), 'Predictive Information Content Entity' (describes an uncertain future event), 'Reference System' (describes a set of standards for organizing data), and 'Representational Information Content Entity' (consists of a set of propositions or the content of an image that represents some entity, e.g., a Satellite Image of a Hurricane). The 'Designative Information Content Entity' consists of a set of symbols that 'designate' or denote some entity. This allows modeling identifiers, abbreviated names such as acronyms (e.g., ENSO and NAO that designate El Niño-Southern Oscillation and North Atlantic Oscillation, respectively), and chemical formulae (CH4, CO2). The 'Directive Information Content Entity' consists of propositions or images that 'prescribe' some entity. It allows modeling concepts such as National Climate Change Strategy, Policy for Climate Change Adaptation, Energy Security Goal, and Climate Model.

A bfo: independent continuant includes the bfo: immaterial entity and bfo: material entity. The immaterial entity class includes boundaries (under bfo: continuant fiat boundary) and sites (e.g., the eye of a hurricane) which can change location. Boundaries demarcate material entities, e.g., Sea level, Sea Surface, and Vegetationcovered Surface. A bfo: site is a 3D immaterial entity bounded by material entity, such as a Polar Environment, Soil Environment, a Region of High Salinity, Wet Tropical Region, a cco: Country, or a cco: City. A material entity has portion of matter as part and is a spatially extended independent continuant that maintains its identity through time even when gaining or losing parts. The bfo: material entity allowed us

to build classes in CSO such as Climate System Component (e.g., Hydrosphere and Cryosphere), Coast, Ice Field, Atmospheric Layer (e.g., Troposphere), Water Body (e.g., Ocean and Lake), Fossil Fuel, Aerosol, Greenhouse Gas (e.g., Water Vapor), Sample, and Water. CCO provides many classes under its Artifact class (a subclass of bfo: object) that allowed modeling concepts such as Energy-related Carbon Source Facility, Storage of Carbon, Green Infrastructure, Buoy, Drifter, Satellite Sensor, and Ship. A bfo: object aggregate is a group of objects that can be partitioned into mutually exhaustive and pairwise disjoint objects [19]. The object aggregate class let us model the CSO classes of Global Community, Species, Civil Society, Climate System, Forest, and Grassland Ecosystem.

The bfo: specifically dependent continuant inheres in (i.e., is borne by) an independent continuant (the bearer) by virtue of how the bearer is related to other entities [19]. It includes the bfo: quality and bfo: realizable entity subclasses. Examples of quality in CSO are: Amount of Ice, Climate Change Benefit, Capacity for Adaptation, Carbon Intensity, Air Quality, Humidity, Lake Area, and Precipitation Deficit. PATO adds pato: quality in addition to the bfo: quality class. The pato: decreased quality class was used, for example, to model CSO classes of Deceased Amount of Emitted Infrared Radiation from Earth Surface, Decreased Extent of Arctic Sea Ice, Decreased pH of Sea Water, and Decreased Ocean Water Salinity. The pato: increased quality class allowed modeling CSO classes such as Increased Aerosol Content, Increased Net Energy in the Climate System, Increased Intensity of Drought, Increased Ocean Acidity, and Increased Concentration of Carbon Dioxide. PATO also provides the physical object quality, process quality, qualitative quality, and variability classes. These PATO classes allowed modeling many of the qualities of the climate system components in CSO such as the Concentration of Carbon Dioxide, Extent of Arctic Sea Ice, Precipitation Pattern, Glacier Volume, Ocean Water Composition, and Thermal Conductivity.

The bfo: realizable entity is a bfo: specifically dependent continuant that inheres in a bfo: independent continuant. Instances of realizable entities are realized in specific processes. Realizable entities include bfo: disposition (e.g., cco: Color, cco: Albedo, Cosmic Ray Shielding Disposition, Resilience, Risk, Security, Climate Vulnerability) and its bfo: function subclass (e.g., Sensor Artifact Function, and Ecosystem Function), and bfo: role (e.g., Policy Making Role, Driver of Climate Change Role, Positive Radiative Forcing Role, Greenhouse Gas Role, and Proxy Role).

The bfo: occurrent class and its underlying CCO subclasses provide a wide range of mechanisms to model dynamic aspects of the climate system. BFO defines the process, process profile, process boundary, spatiotemporal region, and temporal region classes. CCO adds several classes to each of the BFO classes, increasing their potential for modeling the climate system. These include cco: Act and its cco: Intentional Act subclass, cco: Change, cco: Effect, cco: Natural Process, and cco: Stasis. The cco: Act is a process in which an agent (e.g., a human or group of people) plays a causative role. The cco: Act class is used in CSO to define anthropogenic activities such as Emission of Greenhouse Gases, Forestry, Irrigation, and Land use. The cco: change class allows defining change in the climate system (Climate Change, Forcing, Change in Net Radiation, and Change in the Radiative Balance), in its cycles (Change in Water Cycle), its processes (e.g., Change in Atmospheric Circulation), and in component qualities (e.g., Change in Atmospheric Pressure, Change in Temperature, Change in Humidity, Change in Albedo, and Change in Infrared Radiation). It also provides classes that enable modeling a decrease or increase in a generically or specifically dependent continuant. For example, it allowed CSO to model Decrease in the Extent

#### *Climate System Ontology: A Formal Specification of the Complex Climate System DOI: http://dx.doi.org/10.5772/intechopen.110809*

of Arctic Sea Ice, Decrease in Ocean water Salinity, and Decrease in Temperature of Earth Surface. It also allowed modeling Increase in CH4 level, Increase in Atmospheric Opacity, and Increase in Background Surface Ozone. The cco: Change also defines ways to model loss or gain of dependent continuants, e.g., loss of quality (Loss of Ice Sheet Mass), loss of disposition (Loss of Well-being, Loss of Health), and loss of function (Loss of Ecosystem Function).

The cco: Effect class enabled CSO to define subclasses for Adverse Effect (e.g., Adverse Economic Effect and Adverse Human Effect) and Climate Change Impact (e.g., Impact on Ecosystem, Impact on Infrastructure; Impact on Species). The cco: Natural Process allowed CSO to model solar and other processes in the component of the climate system, e.g., in the atmosphere (e.g., Filtering of Solar Ultra-violet Radiation, Cloud Formation, Precipitation, and Wind), in oceans (e.g., Heat Transfer, Ocean Current, and Upwelling), in the biosphere (Evapotranspiration, Photosynthesis, and Respiration), and in the cryosphere (Flowing of Outlet Glaciers, Melting, and Retreat of Glaciers). The cco: Stasis class was useful for modeling processes, such as drought and glaciation, through which some independent continuants endure in an unchanging condition over a period of time. For example, the 20th Century Warming, Little Ice Age Cooling, and Drought were modeled under the cco: Stasis of Quality. The bfo: temporal region class defines zero- and one-dimensional temporal regions that allowed us to model concepts such as Glacial Period, Interglacial Period, Ice Age, Period of Abnormally Dry Weather, Winter, and Season (e.g., Percolation Season and Runoff Season) in CSO.

## **4. Results**

### **4.1 Semantic modeling**

Ontologies such as CSO consist of a controlled vocabulary modeled by named and defined classes that represent concepts in the domain knowledge (e.g., climate system). They model different kinds of relations among the individuals that are instances of these classes. These knowledge models are built at different levels based on their generality. Top- and mid-level ontologies define general concepts and are designed to be extended by domain ontologies [78]. A domain ontology is a formal (i.e., logical), explicit specification of conceptualizations in a specific area of interest (e.g., climate system) [25]. As a domain ontology, the Climate System Ontology (CSO) must include classes that formally represent concepts in the climate system such as radiation, global warming, atmosphere, adverse climate impact, and relations that are known to exist among their instances.

A class (e.g., Ice Sheet, Ocean, or Island) in the ontology describes the type for its instances (i.e., an ice sheet, ocean, or island). Class descriptions are commonly complex because they must represent the complete set of characteristics of the concept that they represent. Axioms are often needed to fully describe the complete characteristics of a class. In protégé, axioms are built using the 'SubClass Of' or 'Equivalent To' options in the class description panel applying different logical constructs. The 'Equivalent To' option allows defining both necessary and sufficient conditions for a class 'in one logical statement' using the logical 'and' and 'or'. The 'SubClass Of' option allows defining only the necessary conditions 'in one or more, separate logical statements'. As an example, the full description of the Ice Sheet class is expressed in the following paragraph.

Instances of the Ice Sheet class have several characteristics that must be represented in the class description. By extension, an Ice Sheet is a Cryospheric Object that is made of thick Ice and has a continent-scale extent. It also participates (as input) in several processes such as 'flowing outward from a high central gently-sloping ice plateau' and 'storage of a large amount of water'. These complex descriptions require building several axioms for Ice Sheet, applying the 'SubClass Of' option (**Figure 1**). It also requires making the bfo: continuant classes of Ice (a subclass of Cryospheric Material, placed under bfo: object), Continental Width (under pato: size, a subclass of pato: morphology, under pato: quality), Ice Sheet Thickness (under pato: thickness, a subclass of pato: size), and bfo: occurrent classes of 'Flowing Outward from a High Central Gently-sloping Ice Plateau' and 'Storage of a Large Amount of Water' under the Cryospheric Process class (a subclass of Natural Internal Process, under cco: Natural Process). The first Cryospheric Process also includes sub-processes (e.g., Flowing of Ice Streams and Flowing of Outlet Glaciers) through the 'cco: has process part' object property. All of these natural processes that occur in some Cryospheric

#### **Figure 1.**

*Modeling the Ice Sheet class using imported classes and object properties of CCO, BFO, and PATO ontologies. Ice sheet has quality Continental Width and Ice Sheet Thickness. Ice Sheet is input of some Flowing Outward from a High Central Gently-sloping Ice Plateau. Flowing Outward from a High Central Gently-sloping Ice Plateau has process part Flowing of Ice Stream and has process part Flowing of Outlet Glaciers. Flowing of Outlet Glaciers has subclass Flowing of Outlet Glaciers into Ice Shelves and Flowing of Outlet Glaciers into the Sea. Flowing of Outlet Glaciers into Ice Shelves process starts Floating of Ice Shelves on the Sea. Flowing of Outlet Glaciers into the Sea process starts Floating of Ice Shelves on the Sea. Ice Sheet is input of some Storage of a Large Amount of Water. Ice Sheet is made of Ice. Melting of Ice Sheets and Glaciers has input Ice Sheet, has output Meltwater, is cause of Runoff, and process starts (i.e., is cause of) Sea level Rise. Sea level Rise process preceded by Melting of Ice Sheets and Glaciers. In all diagrams in this chapter, dashed lines are relations that are represented by the object properties and point from a property's domain (subject) class to its range (object) class (see below for explanation). Each solid arrow represents the has subclass relation and points from a class to its subclass. The diagram was made by the OntoGraf plugin in Protégé.*

#### *Climate System Ontology: A Formal Specification of the Complex Climate System DOI: http://dx.doi.org/10.5772/intechopen.110809*

object (i.e., Ice) occur during a bfo: one-dimensional temporal region (e.g., a period of time). Ice sheets also participate (as input) in the Melting of Ice Sheets and Glaciers class, with Meltwater as output. Melting also causes other processes, such as Runoff and Sea Level Rise, to occur. The Sea Level Rise process leads to (i.e., has output) Increased Sea Level. These processes and changes are shown in **Figure 1**.

In the Climate System Ontology, relations among class instances are modeled through the RO and CCO object properties. Each object property relates instances of a domain class to instances of a range class (through the rdfs: domain and rdfs: range constructs) [79]. For example, in the statement: 'Class-A *relates-to* Class-B', Class-A is the domain, and Class-B is the range, for the *relates-to* object property. Each CCO or RO object property also has other built-in properties and meta-properties such as owl: disjointWith, owl: inverseOf, owl: FunctionalProperty, and owl: TransitiveProperty [16] that provide additional logic for relationships.

Knowledge is the sum of the facts that are known to be true in the domain of discourse in some point in time. For example, the two statements: 'carbon dioxide is a greenhouse gas' and 'meltwater is a product of melting' are known to be true in climate science. An ontology, as a model of a knowledge in a field of study (e.g., climate system), is developed by building numerous logical statements that represent such known facts. Each of these formal statements has three parts: a subject (S), a predicate (P), and an object (O). In OWL, properties stand for the predicates. The 'SPO triples' are the building blocks of knowledge representation. Ontologies are developed by modeling known facts from knowledge repositories (e.g., books, papers, and reports) in the domain, and defining triple SPO statements in logical 'axioms' (e.g., Melting of Ice Sheet cco: *process starts* Sea level Rise). To enhance reading, the reader may ignore the namespace prefixes in such triple statements. Doing so, the above statement is simply read as: 'melting of ice sheet process starts (i.e., is cause of) sea level rise'. The fact that emission of halocarbons leads to stratospheric ozone depletion and to positive radiative forcing can explicitly be expressed by the following two SPO statements: 'Emission of Halocarbon cco: *is cause of* some Stratospheric Ozone Depletion', and 'Emission of Halocarbon cco: *is cause of* some Positive Radiative Forcing'.

The Protégé editor applies OWL 2 [80] which is based on description logic [17]. By extending BFO, CCO, PATO, and RO, the Climate System Ontology inherits the foundational description logic that underlies these imported upper ontologies. CCO and RO object properties explicitly define the type of domain and range classes for each object property. The built-in description logic of these ontologies guarantees the initial consistency and coherency of the CSO domain ontology. As a good practice and to save time during debugging, we continuously ran the HermiT 1.4.3.456 reasoner [81–85], in Protégé, after each major change to the ontology, for example, after adding a new axiom. This assured consistency and coherency of the ontology.

#### **4.2 Climate system ontology**

In this section, we describe the construction of the Climate System Ontology (CSO) based on the logical foundations of the imported CCO, BFO, PATO, and RO ontologies which were described above. From an ontological perspective, each of the main components of the climate system (e.g., Hydrosphere, Atmosphere) is a bfo: fiat object part (a subclass of bfo: material entity), associated with theoretically drawn divisions. Fiat boundaries do not coincide with physical discontinuities. These material fiat parts are demarcated by a bfo: two-dimensional fiat boundary

which demarcate material entities (e.g., the Equator and Global Mean Sea Level) or immaterial entities (surfaces of cave chambers and boundary of the ozone hole). The location of these two- or three-dimensional fiat boundaries (e.g., Land Surface and Snow-covered Surface) are defined relative to material entities (Rock, Snow, and Vegetation). The CSO examples of fiat boundaries include the cco: Sea Level between the Atmosphere and Hydrosphere, the Land Surface between Atmosphere and Lithosphere, and the Snow-covered Surface between the Cryosphere and Atmosphere. The fiat object parts have their own parts. For example, the Atmosphere has Troposphere, Stratosphere, and other types of Atmospheric Layer as fiat object parts; the Lithosphere has the Northern Hemisphere, Southern Hemisphere, and Mid-latitudes; Ocean has fiat layers such as 'Upper Ocean', 'Top Centimeter Skin', and 'Top Few Meters' as part. These parts and sub-parts consist of different kinds of material objects. The material parts such as Ocean, Land, and Permafrost extend over 3D space and have portion of matter among their proper and improper continuant sub-parts. For example, a molecule of Oxygen in the Atmosphere consists of oxygen atoms, and Soil is made of Mineral, Water, Organic Matter, and Air. Some objects such as Ship, Buoy, Sensor, Storage Facility, and Water Treatment Facility are categorized as cco: Artifact. Other parts are of the bfo: object aggregate types, consisting of disjoint parts that can lose or gain parts while maintaining their identity. A good example is the Climate System which can gain material (Anthropogenic Greenhouse Gas and Volcanic Aerosol) or lose mass and energy in its parts (e.g., Melting of Glaciers and Outgoing Radiation).

The bfo: material entities such as Glacier, Groundwater, Aerosol, and Ozone have characteristics that can be categorized under bfo: quality or pato: quality. For example, the Atmosphere has attributes (qualities) such as Aerosol Content, Heat Content in the Atmosphere, Temperature, and Concentration of Greenhouse Gas; Soil has Soil Moisture Content; Glacier Ice has pato: age; Sea Water has pato: acidity and pato: salinity; Drought and Heat Wave have pato: duration; Precipitation has pato: intensity; Ice has pato: radiation reflection quality (albedo). Natural and anthropogenic processes can change such qualities (attributes). For example, Human Activity can increase the Concentration of Carbon Dioxide in the Atmosphere, the Extent of Permafrost, and pH of the Sea Water.

Qualities inhered in material entities are measured by devices, and their values are commonly reported and analyzed by meteorologists and climatologists. The value and units of these variables can be modeled with the cco: Information Content Entity class which subsumes several classes for data and information modeling described above. Weather and climate data modeled with these classes can readily be integrated, facilitating their transfer and reuse. The 'abnormal' aspects of the Climate System such as Ocean Heat Content Anomaly, Specific Humidity Anomaly, Medieval Climate Anomaly, Cool Little Ice Age Anomaly, Land Surface Temperature Anomaly, and Ocean Surface Temperature Anomaly are modeled using the pato: abnormal class; a subclass of the pato: deviation (from\_normal) class. Entities are related to their qualities through the 'ro: *has quality* object property.

Similar to material entities, occurrents (e.g., processes) such as Precipitation, Flood, Extreme Climate Event, and Monsoon have qualities such as pato: rate, pato: duration, pato: recurrent, and pato: frequency which are modeled under the pato: process quality class. BFO also provides the process profile class (a process) which proved handy for modeling the change in the rate of occurrence of adverse extreme climate events (e.g., Drought and Heat Wave) over time and the rate profile of Melting of Glaciers over a period of time.

#### *Climate System Ontology: A Formal Specification of the Complex Climate System DOI: http://dx.doi.org/10.5772/intechopen.110809*

Material-independent continuants also are characterized with realizable entities such as disposition and role. The bfo: disposition is inherent in material entities because of their physical make-up, such as composition, structure, and texture. For example, Disaster and Risk are dispositions of an Extreme Climate Event. Health, Disease, Right, Security, Well-being, and Vulnerability to Climate-related Extreme Event are human dispositions. These dispositions realize through certain processes. For example, Adverse Climate Event bfo: *realizes* Vulnerability to Heat Wave and Vulnerability to Flood. The cco: Electromagnetic Radiation Property class (a subclass of bfo: disposition) allowed defining CSO dispositions such as the Opacity of Soil, Transparent Ice, Radiation Absorptivity of Greenhouse Gas, Emissivity of Greenhouse Gas, Surface Albedo (under cco: Radiation Reflectivity), and Cosmic Ray Shielding Disposition of Ozone.

Each climate system component may also have a function which, as a subclass of disposition, is also a bfo: realizable entity. The CSO functions include Ecosystem Function, Generating Energy Function (of the Sun), and Regulating Earth Climate Function (of Ocean). The process of Nuclear Fusion in the Sun bfo: *realizes* the Generating Energy Function in the Sun. CCO provides a large number of classes for defining functions for artifacts, such as Sensor Artifact Function and Measurement Artifact Function. The bfo: realizable entity also includes role for continuant entities. These allow defining different kinds of bfo: role in CSO such as Policy Making Role, Carbon Storage Role, Driver of Climate Change Role, Driver of Deep Ocean Water Circulation Role, Carbon Sink Role, Proxy Role, and Greenhouse Gas Role. Processes realize roles. For example, Emission of Greenhouse Gas bfo: *realizes* Greenhouse Gas Role. Volcanic Eruption bfo: *realizes* the Aerosol Role for Volcanic Ash. Sampling from Ice Core bfo: *realizes* the Temperature Proxy Role for the Ice Core (a subclass of Sample). Nuclear fusion in the Sun bfo: *realizes* the 'Driver of Climate Change Role' for the Sun. Other processes realize dispositions. For example, Developing and Deploying Technology or Maintaining Stable Energy Supply bfo: *realizes* Energy Security. Some processes realize specific functions. Soil Moisture Drought cco: *has output* Reduced Ecosystem Function. Material entities relate to their realizable entities through the ro: *has disposition*, ro: *has function*, and ro: *has role* object properties.

The material and immaterial (i.e., independent continuant) parts of the climate system continuously interact through processes over time. The bfo: process provides mechanisms for the bfo: independent continuant system parts to interact. CCO subsumes the bfo: process by defining cco: Act, cco: Change, cco: Effect, cco: Natural Process, and cco: Stasis classes which we have used to define various dynamic aspects of CSO domain ontology. CSO classifies all anthropogenic activities, such as Fossil Fuel Emission, Emission of Halocarbon, and Industrial Activity, under the Human Activity class which is an indirect subclass of cco: Act. A large number of intentional anthropogenic activities are defined in CSO by subsuming the cco: Act class. These include the Evaluating Policies, Disaster Risk Management, Reduction of Disaster Risk, and Maintaining Stable Energy Supply classes. **Figure 2** displays the interactions of selected human activities in the climate system that are modeled in CSO.

The Climate System Ontology defines many dynamic processes that bring change in the components of the climate system. These changes, modeled as subclasses of the cco: Change class, include Change in Humidity and Change in Precipitation in the atmosphere, Varying Ice Area and Varying Snow Area in the Cryosphere, Change in the Storage of Groundwater, Change in Ocean Water Salinity, Change in Sea level in the hydrosphere, Land Cover Change, and Change in Surface Roughness in the lithosphere. The cco: Effect is used in CSO to define the Adverse Environmental Effect (and its

#### **Figure 2.**

*A model of human activities. Human Activity is cause of Change in Atmospheric Composition, process starts Biogeophysical Anthropogenic Change (with Decrease in Evapotranspiration and Decrease in Land Surface Net Radiation subclasses), process starts Biogeochemical Anthropogenic Change (with the Decrease of Vegetation and Decrease in Soil Carbon Stocks subclasses), and is cause of Anthropogenic Change (a subclass of Anthropogenic Forcing). Anthropogenic Activity is-a Human Activity. Anthropogenic Activity is cause of Change in Climate Extremes, Change in Surface Ocean Salinity, Increase in the Concentration of Greenhouse Gases, and Increase in the Probability of Occurrence of Heat Waves. The subclasses of Anthropogenic Activity include Industrial Activity, Anthropogenic Emission, and Land-use. Solid arrows point to subclasses.*

#### **Figure 3.**

*A model of climate change and its relation to other changes. Change in Albedo positively regulates (i.e., increases the frequency, magnitude, and rate of) Radiative Forcing and process starts Climate Change. Radiative Forcing is cause of Climate Change. Change in Albedo, Climate Change, and Change in the Sources and Sinks of Carbon are subclasses of System Change. Change in the Sources and Sinks of Carbon process starts Climate Change. Exchange of Greenhouse Gases Between Land and Atmosphere process starts Climate Change. Increase in the Concentration of Greenhouse Gases and Change in Surface Roughness process starts Climate Change. Feedback Between Climate Change and Atmospheric Concentration of Trace Gas has process part some Climate Change. Subclasses (subtypes) of Climate Change include Change in Precipitation Pattern, Change in Climate Extremes, Change in Intensity of Heavy Precipitation Over Land Regions, Internally-induced Climate Change, Change in Evaporation Characteristics, Change in Temperature, Change in Water Cycle, and Anthropogenic Climate Change. Cumulative Total Anthropogenic CO2 Emission process starts Anthropogenic Climate Change. Human Activity is cause of Anthropogenic Climate Change.*

many subclasses) and Climate Change Impact classes and its subclasses (e.g., Impact on Infrastructure, Impact on Species, and Impact on Cultural Assets). **Figure 3** shows the inter-relationships among a few system changes, including the climate change.

Interactions among the components of the climate system are modeled under the cco: Natural Process class. These include Cloud Formation, Precipitation, and Wind in *Climate System Ontology: A Formal Specification of the Complex Climate System DOI: http://dx.doi.org/10.5772/intechopen.110809*

#### **Figure 4.**

*Representing Drought as a cco: Stasis. Drought is-a cco: Stasis of quality. Drought has output Deficit in Water Storage, has output Hydrological Imbalance, has output Soil Moisture Deficit, has output Streamflow Deficit, is cause of Abnormally Low Precipitation, and occurs on Period of Abnormally Dry Weather. Agricultural Drought, Hydrological Drought, and Meteorological Drought are sub-types of Drought. Climate Stasis is-a cco: Stasis of quality. Climate Stasis has output Climate; occurs on Millions of Years or Month, or Millenia. Subclasses of Climate Stasis include the 20th Century Warming, Medieval Climate Optimum Warming, and Little Ice Age Cooling.*

the atmosphere, Accumulation of Snow, Freezing, and Retreat of Glaciers in the cryosphere, Evapotranspiration and Photosynthesis in the biosphere, and Land Carbon Uptake from Atmosphere and Interception of Infiltrated Water by Vegetation classes in the lithosphere. Some climate system concepts such as 20th Century Warming, Little Ice Age Cooling, Drought, El Niño, and La Niña are classified under the cco: Stasis of Quality class. These are shown in **Figure 4**.

## **5. Discussion**

In the previous section, we presented some modeling artifacts of our Climate System Ontology and demonstrated how classes and object properties of upper ontologies enabled formal modeling of some aspects of the climate system. In this section, we discuss the intricacies of our modeling of the complex climate system concepts such as solar radiation, feedback, climate change impact, enhanced greenhouse effect, hydrological cycle, oscillation, and radiative forcing. We also expand our modeling to include complex interactions among the climate system's components (e.g., Atmosphere–Hydrosphere, Land-Ice, Ice-Ocean, and Soil-Biosphere).

Internal and external forcings continuously evolve the climate system over a wide range of temporal and spatial scales. Major perturbations in the radiative balance lead to self-organization, by creating and changing climate patterns. Nonlinear feedback mechanisms continuously amplify or dampen processes to allow system components to adapt to new changes. The system reorganizes to maintain its identity, structure, and function through new processes and patterns (e.g., more frequent extreme events) or by building resilience. Below, we elaborate on the modeling of some of these complex interactions.

**Figure 5** displays the transformation of the Solar Radiation as it continuously enters and exits the climate system. The Solar Radiation class is modeled as a subclass of cco: Electromagnetic Wave Process (a subclass of cco: Wave Process).

#### **Figure 5.**

*A model of different types of Radiation in CSO. Solar Radiation is-a cco: Electromagnetic wave process, cco: Has input (i.e., involves) Sun, cco: Has input Climate System Component, and cco: Process starts (i.e., causes) External Forcing. The Outgoing Radiation and Incoming Solar Radiation classes are subclasses of Solar Radiation. Reflection of Incoming Solar Radiation is-a cco: Electromagnetic wave process. Incoming Solar Radiation ro: Directly positively regulates (i.e., increases frequency, magnitude, and rate of) Reflection of Incoming Solar Radiation. Incoming Solar Radiation ro: Directly positively regulates Absorption of Incoming Solar Radiation. Absorption of Incoming Solar Radiation has subclass Absorption of Incoming Solar Radiation by the Atmosphere that ro: Directly positively regulates Warming of the Atmosphere and occurs in the Atmosphere. Absorption of Incoming Solar Radiation also has subclass Absorption of Incoming Solar Radiation by Surface which occurs at Sea Surface, Land Surface, Snow-covered Surface, and Vegetation-covered surface. Emission of Infrared Radiation has input Cloud and has subclass Emission of Infrared Radiation in All Directions with Greenhouse Gas and other trace gases as input. Emission of Infrared Radiation occurs in the Atmosphere and process starts (i.e., causes) Warming of the Atmosphere.*

It subsumes the Incoming Solar Radiation and Outgoing Solar Radiation classes. The cco: Electromagnetic Wave Process also subsumes the Reflection of Incoming Solar Radiation, Absorption of Radiation, Return of the Radiation Absorbed by the Surface, Emission of Radiation, and Scattering of Part of the Incoming Solar Radiation classes.

The logical modeling artifacts provided by the imported ontologies also allowed CSO to efficiently model complex system features such as nonlinear dynamics through feedbacks. **Figure 6** shows an example of a positive feedback through which the water vapor (a major greenhouse gas), produced through the evaporation of ocean water, cycles through the atmosphere. The increased concentration of the water vapor leads to an increase in atmospheric temperature, which amplifies the original evaporation process, and leads to increased concentration of water vapor, bringing more heat in the atmosphere. The cyclical Feedback class and its subclasses are modeled in CSO as a subclass of the Cycle class, which is modeled as a subclass of the Fiat Process Part class (a subclass of cco: Change).

Various impacts brought by climate change are also modeled in CSO using several modeling artifacts of CCO. **Figure 7** shows some of the impacts of extreme climate events and represents how these events realize the vulnerability (a disposition) of communities to such events. Several subclasses of the Climate Change Impact class explicitly define specific types of impacts (not expanded in the diagram).

The concepts of natural and enhanced greenhouse effects are shown in the model of **Figure 8**. The figure shows the natural and anthropogenic processes and material entities that cause or are involved in these two types of greenhouse effects.

The hydrological cycle is a major global process in the climate system. It involves numerous processes that occur in different components of the climate *Climate System Ontology: A Formal Specification of the Complex Climate System DOI: http://dx.doi.org/10.5772/intechopen.110809*

#### **Figure 6.**

*A model of a positive feedback in CSO. Increase in Atmospheric Temperature process starts (causes) Increase in Evaporation which process starts Evaporation from the Ocean. Evaporation From the Ocean has input Ocean Water and has output Water Vapor. Evaporation From the Ocean is part of process Hydrological Cycle. Evaporation from the ocean has output Increased Water Vapor Content in the Atmosphere which is an Increased Greenhouse Gas Concentration. Increased Water Vapor Content in the atmosphere 'is input of' Amplification of Temperature Increase and is input of Enhanced Greenhouse Effect. Each of these processes process starts Increase in Atmospheric Temperature which process starts Increase in Evaporation which restarts the cycle by amplifying the Evaporation from the Ocean.*

system. The cyclical process starts by evaporation from bodies of water such as oceans, movement of the output water vapor through atmospheric circulations, condensation of the water vapor, cloud formation, and precipitation as rain or snow. Precipitation is followed by the interception of rain and snow by plants, infiltration of rain and melted snow into soil, soil evaporation, recharge of aquifers, surface runoff, and entry of streams back into oceans. **Figure 9** is a model of part of the hydrological cycle in CSO. The Hydrological Cycle class is modeled in CSO as a subclass of the Water Cycle, under the Fiat Process Part class (a subclass of cco: Change).

The three major subclasses of cco: Information Content Entity (see above) that represent data and information, in combination with the bfo: specifically dependent continuant that defines the system variables, and object properties such as ro: *concretize* that relates a variable (quality) to data and information and enable complete modeling of climate system data. Data modeled through these upper-level constructs enable integration of climate data and information. For example, numeric, graphic (map, plot), and textual (report) data related to specific occurrences (instances) of an El Niño or La Niña event, shown in the CSO model in **Figure 10** can be modeled with the cco: Information Content Entity.

Radiative forcing as a change in the incoming and outgoing radiative flux may be caused by changes in the concentration of anthropogenic greenhouse gases in the atmosphere because of human activities or solar cycles. CSO models Radiative

#### **Figure 7.**

*A model of Climate Change Impact in CSO as an Adverse Impact, a subclass of Impact. The Impact class is-a subclass of Effect which is-a Adverse Effect (under cco: Effect). Different kinds of impacts are also shown as subclasses of the Climate Change Impact. Climate Change is cause of some Climate Change Impact. The Extreme Climate Event process starts (i.e., causes) Climate Change Impact and realizes Vulnerability to Climate-related Extreme Event.*

Forcing and its underlying Positive Radiative Forcing using several cco object properties. **Figure 11** shows anthropogenic activities lead to anthropogenic forcing, causing increase in the global mean surface temperature which causes global mean surface warming. It also shows anthropogenic activities (e.g., emission of halocarbons) and increase in atmospheric opacity cause positive radiative forcing. Positive radiative forcing with input from greenhouse gases starts the process of enhanced greenhouse effect which leads to an increase in atmospheric temperature and ultimately the warming of atmosphere.

## **6. Summary**

The Climate System Ontology (CSO) specifies the characteristics of different elements of the climate system and models the dynamic processes that affect the structure, behavior, and pattern of these elements at the micro- (component) and macro- (whole system) levels. For each process, the ontology identifies the components that change their attributes as they participate in the process as input and other processes that are caused by the process and produce their own output. Many of these processes are nonlinear, producing outputs that affect the original process that

*Climate System Ontology: A Formal Specification of the Complex Climate System DOI: http://dx.doi.org/10.5772/intechopen.110809*

#### **Figure 8.**

*A model of Greenhouse Effect as a cco: Effect, under bfo: Process. Greenhouse Effect process starts (i.e., is cause of) some Increase in the Global Mean Surface Temperature. As a subclass of Greenhouse Effect, the Enhanced Greenhouse Effect process starts some Increase in Atmospheric temperature. Increased Aerosol Content, Increased Concentration of Greenhouse Gases, and Increased Water Vapor Content in the Atmosphere are input of Enhanced Greenhouse Effect. Increase in Water Vapor Content in the Atmosphere has output some Increased Water Vapor Content in the Atmosphere, and process starts some Greenhouse Effect. Emission of Infrared Radiation is cause of Trapping Heat in the Atmosphere. Greenhouse Gas is input of Trapping Heat in the Atmosphere and is input of Emission of Infrared Radiation. Positive Radiative Forcing process starts some enhanced Greenhouse Effect. Absorption of Radiation is cause of some Greenhouse Effect. Greenhouse gas is input of Upward Transfer of Infrared Radiation From Earth Surface to Higher Altitudes (an Atmospheric Process).*

#### **Figure 9.**

*A model of Hydrological Cycle. Evaporation From the Ocean has input Ocean Water and has output Water Vapor. Evaporation From the Land Surface has input Lake Water and has input Stream Water. Evaporation from the Land Surface or Evaporation from the Ocean process starts (i.e., causes) Atmospheric Circulation. Evaporation From the Land Surface is-a Atmospheric–Land Hydrospheric Interaction. Evaporation in the Soil or Evaporation in the Leaves of Plants is subclasses of Evaporation From the Land Surface. Warming of the Atmosphere directly positively regulates (i.e., intensifies) Evaporation. Atmospheric Circulation process starts Condensation. Condensation has input Water Vapor and has output Cloud Droplet. Condensation process precedes, and process starts Cloud Formation. Cloud Formation process precedes, and process starts Precipitation. Precipitation process starts Flow of Water Over Land Surface or Through the Subsurface and Infiltration of Surface Runoff into Soil and Rock. Runoff, as a subclass of the Flow of Water Over Land Surface or Through the Subsurface, has subclass Freshwater Runoff Returning to the Oceans and has input (i.e., involves) Ocean Water. Ocean Water re-enters the hydrological cycle through the Evaporation From the Ocean. The cycle repeats the above processes.*

#### **Figure 10.**

*A model of oceanic oscillations. Two oscillations are shown. North Atlantic Oscillation designated by NAO realizes Climate of Europe; has output Climate Variability in Europe in Winter; occurs on Winter; process starts Westerly Current Between Icelandic Low Pressure and Azores High Pressure Areas (which occurs at Sea Level); realizes Climate of Part of Asia. Atmospheric–Ocean Interactions in North Atlantic process starts North Atlantic Oscillation. The El Niño-Southern Oscillation designated by ENSO; occurs in Ocean; occurs at Sea Level; occurs at Tropical Pacific; has quality ENSO Variability; process starts El Niño and process starts La Niña; caused by Atmosphere–Ocean Interaction in the Tropical Pacific that occurs at Tropical Pacific.*

led to the output. The CSO models the dynamic interactions among system's major components as the solar radiative energy cycles through the system through various reflective, absorptive, and emissive processes. We described the climate system from a complex system perspective and modeled several of its dynamic processes based on this view.

We developed the Climate System Ontology by extending the class hierarchy and logic of a set of well-designed, top- and mid-level ontologies. The terminology used for the class definitions and relations used to model the Climate System Ontology are based on the IPCC and other sources of climate system knowledge. The use of the foundational logics of the imported upper-level ontologies in the development of the Climate System Ontology ensures interoperability with other ontologies that extend the same upper-level ontologies. We gave full descriptions of these upper-level ontologies and specified best practices for using them to build domain or application ontologies. We demonstrated, by providing several examples, how complex features in the climate system can be modeled in the Protégé editor. The ontology is publicly available in the GitHub cloud repository for extension by climate scientists to build their own application ontologies. The

*Climate System Ontology: A Formal Specification of the Complex Climate System DOI: http://dx.doi.org/10.5772/intechopen.110809*

#### **Figure 11.**

Climate System Ontology can be queried by decision and policymakers to discover the effects of different kinds of natural and anthropogenic processes that occur in the complex climate system.

## **Author details**

Armita Davarpanah1 \*, Hassan A. Babaie2 and Guanyu Huang1

1 Environmental and Health Sciences, Spelman College, Atlanta, GA, USA

2 Department of Geosciences, Georgia State University, Atlanta, GA, USA

\*Address all correspondence to: adavarpa@spelman.edu

© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*A model of part of the Positive Radiative Forcing process in CSO. Forcing is modeled as a subclass of cco: Change (not shown). Radiative Forcing is-a forcing and Positive Radiative Forcing is-a Radiative Forcing. Positive Radiative Forcing has input Anthropogenic Greenhouse Gas. Emission of Halocarbons and Increase in Atmospheric Opacity process starts Positive Radiative Forcing. Anthropogenic Forcing is-a Positive Radiative Forcing. Anthropogenic Forcing is cause of Increase in the Global Mean Surface Temperature which is cause of Global Mean Surface Warming. Human Activity is cause of Anthropogenic Change. Anthropogenic Activity is cause of Positive Radiative Forcing which process starts Enhanced Greenhouse Effect, which in turn causes Increase in Atmospheric Temperature, which is cause of Warming of the Atmosphere.*

## **References**

[1] Ahrens CD. Meteorology Today: An Introduction to Weather, Climate, and the Environment. Thomson/Brooks/ Cole: Belmont, CA; 2007. p. 688

[2] Holland JH. Emergence. Philosophica. 1997;**59**(1):11-40

[3] Munoz YJ, de Castro LN. Selforganization and emergence in artificial life: Concepts and illustrations. Journal of Experimental & Theoretical Artificial Intelligence. 2009;**21**(4):273-292

[4] Ehrlich P, Raven P. Butterflies and plants: A study in coevolution. Evolution. 1964;**18**(4):586-608

[5] AR5. Synthesis Report: Climate Change 2014. The Intergovernmental Panel on Climate Change. 2014. Available online: https://www.ipcc.ch/report/ar5/ syr/ [Accessed: March 28, 2022].

[6] Gruber N, Bakker DCE, DeVries T, Gregor L, Hauck J, Landschützer P, et al. Trends and variability in the ocean carbon sink. Nature Reviews & Earth Environment. 2023;**4**:119-134. DOI: 10.1038/s43017-022-00381-x

[7] Collins M, Knutti R, Arblaster J, Dufresne J-L, Fichefet T, Friedlingstein P, et al. Long-term climate change: Projections, commitments and irreversibility. In: Stocker TF, Qin D, Plattner G-K, Tignor M, Allen SK, Boschung J, et al, editors. Climate Change 2013: The Physical Science Basis. Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change. Cambridge, United Kingdom and New York, NY, USA: Cambridge University Press; 2013

[8] Lüthi D, Le Floch M, Bereiter B, Blunier T, Barnola J-M, Siegenthaler U, et al. High-resolution carbon dioxide concentration record 650,000-800,000 years before present. Nature. 2008;**453**: 379-382. DOI: 10.1038/nature06949

[9] Feldman DR, Collins WD, Gero PJ, Torn MS, Mlawer EJ, Shippert TR. Observational determination of surface radiative forcing by CO2 from 2000 to 2010. Nature. 2015;**519**:339-343. DOI: 10.1038/nature14240

[10] Rial JA, Pielke RA, Beniston M, Claussen M, Canadell J, Cox P, et al. Nonlinearities, feedbacks and critical thresholds within the Earth's climate system. Climatic Change. 2004;**65**:11-38. DOI: 10.1023/B:CLIM.0000037493.894 89.3f

[11] Climate Change. Climate change evidence & causes - update 2020. An overview from the Royal Society and the US National Academy of Sciences. 2020. Available online: https://royalsociety. org/~/media/royal\_society\_content/ policy/projects/climate-evidence-causes/ climate-change-evidence-causes.pdf [Accessed: March 28, 2022]

[12] Gilbert L. Concepts and Applications of Climatology. New York, NY: Syrawood Publishing House; 2019. p. 216

[13] Somerville RCJ, Hassol SJ. Communicating the science of climate change. Physics Today. 2011;**64**(10): 48-53

[14] Sullivan D. Climatology. New York, NY: Callisto Reference; 2019. p. 219

[15] AR6. Sixth Assessment Report. Climate Change 2022: Impacts, Adaptation and Vulnerability Six Assessment Report. The working Group II contribution to the Six Assessment

*Climate System Ontology: A Formal Specification of the Complex Climate System DOI: http://dx.doi.org/10.5772/intechopen.110809*

Report. 2022. Available online: www. ipcc.ch/assessment-report/ar6/ [Accessed: March 28, 2022]

[16] OWL. The W3C Web Ontology Language (OWL). 2004. Available online: www.w3.org/OWL/ [Accessed: March 10, 2022]

[17] Baader F. In: Baader F, Calvanese D, McGuinness DL, Nardi D, Patel-Schneider PF, editors. The Description Logic Handbook – Theory, Implementation, and Applications. New York: Cambridge University Press; 2007. p. 602

[18] Arp R, Smith B, Spear AD. Building Ontologies with Basic Formal Ontology. Cambridge MA, USA: MIT Press; 2015. p. 248

[19] BFO. Basic Formal Ontology. 2022. Available online: https://basic-formalontology.org/ [Accessed: June 28, 2022]

[20] CCO. Available online: https:// github.com/CommonCoreOntology/ CommonCoreOntologies)

[21] Rudnicki R. Modeling Information with the Common Core Ontologies. Buffalo, NY: CUBRC Inc.; 2017. Available online: https://www.nist.gov/system/files/ documents/2021/10/14/nist-ai-rfi-cubrc\_ inc\_003.pdf. Accessed March 28, 2022

[22] Rudnicki R. An Overview of the Common Core Ontologies. Buffalo, NY: CUBRC Inc.; 2019. Available online: https://www.nist.gov/system/files/ documents/2021/10/14/nist-ai-rfi-cubrc\_ inc\_004.pdf. [Accessed March 28, 2022]

[23] RO. Relation Ontology (RO). 2008. Available online: https://github.com/ oborel/obo-relations [Accessed: June 15, 2022]

[24] PATO. The 'Phenotype and Trait Ontology'. 2022. Available online:

https://raw.githubusercontent.com/patoontology/pato/master/pato.owl

[25] Gruber TR. A translation approach to portable ontology specifications. Knowledge Acquisition. 1993;**5**(2):199-220

[26] Baede APM, Ahlonsou E, Ding Y, Schimel D, Bolin B, Pollonais S. The climate system: An overview. In: Houghton JT, Ding Y, Grigss DJ, Noguer M, Linden PJ, Van Der D, et al, editors. Climate Change 2001: The Scientific Basis. Cambridge: Cambridge University Press; 2001

[27] Trenberth KE, Fasullo JT, Kiehl JT. Earth's global energy budget. Bull. Amer. Meteor. Soc. 2009;**90**:311-323

[28] National Academy of Sciences. Climate Change and Ecosystems. Washington, DC: The National Academies Press; 2019. DOI: 10.17226/25504

[29] Fröhlich C, Lean JL. The Sun's total irradiance: Cycles, trends and related climate change uncertainties since 1976. Geophysical Research Letters. 1998;**25**:4377-4380

[30] Lean JL, Rind D. Climate forcing by changing solar radiation. Journal of Climate. 1998;**11**:3069-3094

[31] Kiehl JT, Trenberth KE. Earth's annual global mean energy budget. Bull. Am. Met. Soc. 1997;**78**:197-208

[32] Cassia R, Nocioni M, Correa-Aragunde N, Lamattina L. Climate change and the impact of greenhouse gasses: CO2 and NO, friends and foes of plant oxidative stress. Frontiers in Plant Science. 2018;**9**:273. DOI: 10.3389/ fpls.2018.00273

[33] Boucher O, Randall D, Artaxo P, Bretherton C, Feingold G, Forster P,

et al. Clouds and aerosols. In: Stocker TF, Qin D, Plattner G-K, Tignor M, Allen SK, Boschung J, et al, editors. Climate Change 2013: The Physical Science Basis. Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change. Cambridge, United Kingdom and New York, NY, USA: Cambridge University Press; 2013

[34] Gettelman A, Forster PM d F, Fujiwara M, Fu Q, Vomel H, Gohar LK, et al. Radiation balance of the tropical tropopause layer. Journal of Geophysical Research. 2004;**109**:D07103. DOI: 10.1029/2003JD004190

[35] Pierrehumbert RT. Infrared radiation and planetary temperature. Physics Today. 2011;**64**:33-38

[36] Rypdal K. Global temperature response to radiative forcing: Solar cycle versus volcanic eruptions. Journal of Geophysical Research. 2012;**117**:D06115. DOI: 10.1029/2011JD017283

[37] Stocker TF, Clarke GKC, Le Treut H, Lindzen RS, Meleshko, VP, Mugara RK, et al. Physical climate processes and feedbacks. In: Houghton JT, Ding Y, Griggs DJ, Noguer M, van der Linden PJ, Dai X, editors. IPCC, 2001: Climate Change 2001: The Scientific Basis. Contribution of Working Group I to the Third Assessment Report of the Intergovernmental Panel on Climate Change. New York, NY: Cambridge University Press; 2001. pp. 417-470

[38] Raghuraman SP, Paynter D, Ramaswamy V. Anthropogenic forcing and response yield observed positive trend in Earth's energy imbalance. Nature Communications. 2021;**12**:4577. DOI: 10.1038/s41467-021-24544-4

[39] Hall-Spencer JM, Rodolfo-Metalpa, Martin RS, Ransome E, Fine M,

Turner SM, et al. Volcanic carbon dioxide vents show ecosystem effects of ocean acidification. Nature. 2008;**454**:96-99

[40] Riebesell U. Climate change: Acid test for marine biodiversity. Nature. 2008;**454**:46-47

[41] Wuebbles DJ, Easterling DR, Hayhoe K, Knutson T, Kopp RE, Kossin JP, et al. Our globally changing climate. In: Wuebbles DJ, Fahey DW, Hibbard KA, Dokken DJ, Stewart BC, Maycock TK, editors. Climate Science Special Report: Fourth National Climate Assessment. Vol. I. Washington, DC, USA: U.S. Global Change Research Program; 2017. pp. 35-72. DOI: 10.7930/ J08S4N35

[42] Jain PC. Greenhouse effect and climate change: Scientific basis and overview. Renewable Energy. 1993;**3**(4-5):403-420

[43] Kheshgi HS, White BS. Does recent global warming suggest an enhanced greenhouse effect? Climatic Change. 1993;**23**:121-139. DOI: 10.1007/ BF01097333

[44] Feldl N, Roe GH. The nonlinear and nonlocal nature of climate feedbacks. Journal of Climate. 2013;**26**:8289-8304

[45] Meehl GA, Washington WM, Arblaster JM, Hu A, Teng H, Tebaldi C, et al. Climate system response to external forcings and climate change projections in CCSM4. Journal of Climate. 2012;**25**(11):3661-3683. DOI: 10.1175/ jcli-d-11-00240.1

[46] Wang C, Deser C, Yu J-Y, DiNezio P, Clement A. El Niño-southern oscillation (ENSO): A review. In: Glymn P, Manzello D, Enochs I, editors. Coral Reefs of the Eastern Pacific. Dordrecht: Springer Science Publisher; 2016. pp. 85-106

*Climate System Ontology: A Formal Specification of the Complex Climate System DOI: http://dx.doi.org/10.5772/intechopen.110809*

[47] Dong L, McPhaden MJ. The role of external forcing and internal variability in regulating global mean surface temperatures on decadal timescales. Environmental Research Letters. 2017;**12**:034011

[48] Luber G, McGeehin M. Climate change and extreme heat events. American Journal of Preventive Medicine. 2008;**35**(5):429-435

[49] Myers KF, Doran PT, Cook J, Kotcher JE, Myers TA. Consensus revisited: Quantifying scientific agreement on climate change and climate expertise among earth scientists 10 years later. Environmental Research Letters. 2021;**16**(10):104030. DOI: 10.1088/1748-9326/ac2774

[50] Benbya H, Nan N, Tanriverdi H, Yoo Y. Complexity and information systems research in the emerging digital world. MIS Quarterly. 2020;**44**(1):1-17. Special Issue: Complexity & IS Research

[51] Holland JH. Hidden Order: How Adaptation Builds Complexity. Reading, MA: Addison-Wesley; 1995

[52] Glansdorff P, Prigogine I. Thermodynamic Study of Structure, Stability and Fluctuations. New York: Wiley; 1978

[53] Bak P, Tang C, Wiesenfeld K. Selforganized criticality. Physical Review A. 1988a;**38**:364

[54] Crucifix M. Oscillators and relaxation phenomena in Pleistocene climate theory. Philosophical Transactions of the Royal Society A. 2012;**370**:1140-1165. DOI: 10.1098/ rsta.2011.0315

[55] Roli A, Villani M, Filisetti A, et al. Dynamical criticality: Overview and open questions. Journal of Systems

Science and Complexity. 2018;**31**:647- 663. DOI: 10.1007/s11424-017-6117-5

[56] Kauffman SA. The Origins of Order: Self-Organization and Selection in Evolution. New York: Oxford University Press; 1993

[57] Lewin R. Complexity: Life at the Edge of Chaos. Chicago: University of Chicago Press; 1992

[58] Johnson GC, Lyman JM. Warming trends increasingly dominate global ocean. Nature Climate Change. 2020;**10**:757-761. DOI: 10.1038/ s41558-020-0822-0

[59] Watson AJ, Schuster U, Shutler JD, Holding T, Ashton IGC, Landschützer P, et al. Revised estimates of oceanatmosphere CO2 flux are consistent with ocean carbon inventory. Nature Communications. 2020;**11**:4422. DOI: 10.1038/s41467-020-18203-3.

[60] De Wolf T, Holvoet T. Emergence versus self-organization: different concepts but promising when combined. In: Brueckner SA, Di Marzo SG, Karageorgos A, Nagpal R, editors. Engineering Self-Organising Systems. ESOA 2004. Lecture Notes in Computer Science. Vol. 3464. Berlin, Heidelberg: Springer; 2005. DOI: 10.1007/11494676\_1

[61] Heylighen F. The science of selforganisation and adaptivity. In: The Encyclopedia of Life Support Systems. Paris, France: UNESCO Publishing-Eolss Publishers; 2002

[62] Byeon JH. Non-equilibrium thermodynamics approach to the change in political systems. System Research and Behavioral Science. 1999;**16**:283-291

[63] Langton CG. Computation at the edge of chaos: Phase transitions and emergent computation. Physica D. 1990;**42**:12-37

[64] Bak P, Tang C, Wiesenfeld K. Scale invariant spatial and temporal fluctuations in complex systems. Random Fluctuations and Pattern Growth: Experiments and Models. 1988b;**157**:329-335. ISBN: 978-0-7923-0073-1

[65] Prigogine I, Stengers I. Order out of Chaos. New York: Bantam; 1984

[66] Prigogine I, Nicolis G, Babylontz A. Thermodynamics of evolution. Physics Today. 1971;**25**(11):23

[67] Gershenson C. Design and Control of Self-Organizing Systems. PhD Dissertation. Belgium: Vrije Universiteit Brussel; 2007 http://cogprints. org/5442/1/thesis.pdf

[68] Juarrero A. Dynamics in action: Intentional behavior as a complex system. Emergence. 2000;**3**(2):24-57

[69] Mitchel SD. Emergence: Logical, functional, and dynamical. Synthese. 2012;**185**:171-186

[70] Jacobson MJ, Kapur M, So J-J, Lee J. The ontologies of complexity and learning about complex systems. Instructional Science. 2011;**39**:763-883

[71] Couzin ID, Krause J. Selforganization and collective behavior in vertebrates. Advances in the Study of Behavior. 2003;**33**:1-75

[72] BFO Standard. ISO/IEC PRF 21838- 2.2 Information Technology — Top-Level Ontologies (TLO) — Part 2: Basic Formal Ontology (BFO). 2021. https://www.iso. org/standard/74572.html [Accessed: June 10, 2022]

[73] BFO users. 2022. Available online: https://basic-formal-ontology.org/users. html [Accessed: June 27, 2022]

[74] Github. Open Source Repository for Software Development and Version Control. 2023. Available online: http:// github.com

[75] Musen MA. The Protégé project: A look Back and a look forward. AI Matters. 2015;**1**:4-12. DOI: 10.1145/ 2757001.2757003

[76] Protégé. A Free, Open-Source Ontology Editor and Framework for Building Intelligent Systems. 2022. Available online: https://protege. stanford.edu/ [Accessed: December 10, 2021]

[77] Rudnicki R, Smith B, Malyuta T, Mandrick COLW. Best Practices of Ontology Development. White Paper. Buffalo, NY: CUBRC; 2016. Retrieved March 30, 2022 from: https://www.nist. gov/system/files/documents/2021/10/14/ nist-ai-rfi-cubrc\_inc\_002.pdf

[78] Partridge C, Mitchell A, Cook A, Sullivan J, West M. A Survey of Top-Level Ontologies - to Inform the Ontological Choices for a Foundation Data Model. 2020. DOI: 10.17863/ CAM.58311. Available from: https:// www.cdbb.cam.ac.uk/files/a\_survey\_ of\_top-level\_ontologies\_lowres.pdf. Constructioninnovationhub.org.uk, UK Research and Innovation [Accessed: June 24, 2022]

[79] RDFS. RDF Schema 1.1. W3C Recommendation 25 February 2014. 2014. https://www.w3.org/TR/ rdf-schema/

[80] Motik B, Patel-Schneider PF, Parsia B, Bock C, Fokoue A, Haase P, et al. OWL 2 web ontology language: Structural specification and functionalstyle syntax. W3C recommendation. 2009;**27**:159

[81] Shearer R, Motik B, Horrocks I. HermiT: A highly-efficient OWL reasoner. In: Proceedings of the

*Climate System Ontology: A Formal Specification of the Complex Climate System DOI: http://dx.doi.org/10.5772/intechopen.110809*

5th OWLED Workshop on OWL: Experiences and Directions, collocated with the 7th International Semantic Web Conference (ISWC-2008), Karlsruhe, Germany, Oct 2 26-27. 2008

[82] Otte JN, Kiritsi D, Mohd Ali M, Yang R, Zhang B, Rudnicki R, et al. An ontological approach to representing the product life cycle. Applied Ontology. 2019;**14**(2):179-197

[83] Rodríguez-Fonseca B, Suárez-Moreno R, Ayarzagüena B, López-Parages J, Gómara I, Villamayor J, et al. A review of ENSO influence on the North Atlantic. A non-stationary signal. Atmosphere. 2016;**7**:87. DOI: 10.3390/ atmos7070087

[84] Smith B, Ceusters W, Klagges B, Kohler J, Kumar A, Lomax J, et al. Relations in biomedical ontologies. Genome Biology. 2005;**6**(5):R46

[85] Smith B, Ceusters W. Ontological realism: A methodology for coordinated evolution of scientific ontologies. Applied Ontology. 2010;**5**(3-4):139-188

## **Chapter 5**

## Ontologies as a Tool for Formalizing Data Validation Rules

*Nicholas Nicholson and Iztok Štotl*

## **Abstract**

Comparison of health data across national or even regional boundaries is a challenging task. Data sources, data collection methods, and data quality can vary widely and the quality of the indicators themselves is dependent upon the veracity of the underlying data. For any trans-regional or trans-national comparison of indicators, it is imperative to ensure data are appropriately validated. Ontologies provide a number of functionalities to help in this process. Data rules can be formalized using the ontology axioms, which are useful for removing the ambiguities of rules expressed in natural language. In addition, the axioms serve to identify the metadata and their corresponding semantic relationships, which can in turn be linked to standard data dictionaries or other ontologies. Moreover, ontologies provide the means for encapsulating the underlying data model of the domain allowing the rules and the data model to be maintained in a single application. Finally the expression of the axioms in description logic, as supported for example by the web ontology language, allows machine reasoning to validate data sets automatically against the formalized rules.

**Keywords:** web ontology language, data harmonization, data validation, data rules, description logic, linked metadata

## **1. Introduction**

Data validation is a key part of the overall data harmonization process that allows meaningful comparison or integration of different data sets. This is particularly important for the derivation of indicators, which may be used for comparison or benchmarking purposes across countries or regions. Prime examples are population-based disease surveillance programs and environmental monitoring and control programs.

Disease monitoring and surveillance is a particular focus of the European Union and a number of pan-European registry networks exist for this purpose. The European Network of Cancer Registries (ENCR) is the most established surveillance network incorporating over 150 separate regional or national registries [1]. A similar initiative in the United States is the Surveillance, Epidemiology, and End Results (SEER) program [2].

In order to help harmonize the data, which may be collected via different processes from different sources, registry networks generally agree a core or common data set that comprises the most accessible, important and well-defined variables. As an example the ENCR common data set consists of about 50 variables [3]. Even though

the common data set variables are generally well defined, they may not necessarily be described in a manner that easily allows semantic linkage or cross-reference. Furthermore, they may depend on domain-specific knowledge not readily available to data users outside the domain.

Indictors for comparison purposes tend to be derived from common data sets since they constitute the variables that are the most harmonized within a disease domain. It is particularly important that the underlying data of the indicators are consistent and complete to avoid erroneous conclusions or bias in the results [4]. Ensuring an adequate level of consistency however is quite difficult to achieve in practice given the heterogeneity of data sources and data-collection processes.

Assuming a pre-defined level of quality, data consistency can nevertheless be verified using rule-based systems to check that the individual data fields are present and within the expected ranges. More complex, inter-variable rules check data consistencies between variables and their values. Other consistency checks can compare the frequency of occurrences of specific values of data. All these checks provide greater confidence in the fidelity of data sets for comparison purposes [5].

## **2. Specification of the rule base**

Specifying the data-validation rules in an optimal way is itself a challenge. Rules are often described using natural language which, whilst having the advantage of making them more readable, leads to ambiguities for anything other than the most simple rules. Complex rules with dependencies on multiple variables can be illustrated more easily via a series of tables that constrain the values of the variables not forming the major focus within a particular table. Ensuring the consistency and verifying the accuracy of the rules across multiple tables is not straightforward and leads to considerable maintenance overheads.

The ENCR common data set comprises variables describing a tumor, such as: morphology (type of tumor); behavior (how the tumor acts in the body); topography (organ affected); basis of diagnosis (how the tumor was diagnosed); grade (how the tumor cells compare with normal cells under the microscope); and stage (extent of the tumor). Morphology, behavior, topography, and grade are specified by codes adhering to the international classification of diseases for oncology, edition 3 (ICD-O-3) [6]. Stage for solid tumors is generally specified according to the globally recognized TNM staging system describing the extent of cancer disease, where the "T" component is related to the size of the tumor or its invasion into local structures; the "N" component is related to the number and nature of lymph node groups adjacent to the tumor with evidence of tumor spread; and the "M" component is related to the presence of local or distant metastatic sites. The rule interdependencies of all these tumor-description variables in the ENCR rules are illustrated in **Table 1**. To manage more easily the complexity of the interdependencies, the rules are divided into nine separate sets of tables, namely:



#### **Table 1.**

*Rule interdependencies (marked with an "X") of some of the main variables within the ENCR common data set. Morph = morphology; Topog = topography; BoD = basis of diagnosis; Beh = behavior. The shaded cells indicate no interdependencies.*

4.basis of diagnosis/morphology/topography/age;

5.grade/morphology/behavior;

6.morphology/topography;

7. topography/stage-grouping/TNM;

8. topography/topography-grouping (for multiple primary tumor conditions);

9.morphology/morphology-grouping (for multiple primary tumor conditions).

Given the size of the tables, only a few excerpts are shown for illustrative purposes in **Tables 2**–**6**. Whereas they are specific to the ENCR common data set, they are nevertheless indicative of the sorts of difficulties faced by other rule sets defined in a similar fashion.

Apart from the difficulty of ensuring consistency across the rule tables, a further drawback to defining rules in this way relates to the intricacy they impose on compiling a test data set. A comprehensive test data set is important for verifying the ability


**Table 2.**

*Unlikely and rare combinations of age and tumor type (excerpt from table 3 in [3]).*


#### **Table 3.**

*Valid combinations for basis of diagnosis and morphology (excerpt from figure 2 in [3]).*


**Table 4.**

*Invalid combinations for sex and topography (excerpt from table 4 in [3]).*


#### **Table 5.**

*Morphology codes and allowed/refused topography codes (excerpt from table 8 in [3]).*

of data-checking software to trap the different types of errors against the rules. In constructing a test data set, it is necessary to keep record of the variables set incorrectly for each individual test case.

Creating a test record using the tabular rules requires one first to establish a valid morphology/topography combination (one table look-up), then a correct morphology/behavior combination (second table look-up), and thereafter multiple table lookups for all the other variable interdependencies. Given that not all possible morphology/topography combinations lead to defined combinations of the other variables, it becomes an arduous task to follow this process to completion. In practice, what is done is to start from a real cancer registry data set and systematically set the variables to incorrect values. However, such an approach does not guarantee all possible record combination conditions are thereby tested, potentially leading to undetected bugs in the validation software.

For many practical reasons therefore, a more formal representation of the data rules is necessary. Ontologies are interesting since they provide the basis for doing this in a way that is also integrated with the underlying data model.


#### **Table 6.**

*TNM edition 7 stage grouping and T, N, M values for thyroid gland (C73) papillary or follicular (excerpt from appendix III in [3]).*

## **3. The relationship between ontologies and description logics**

Computational ontologies describe and categorize classes of objects and specify the relationships associated with those classes and categories. This information is captured using axiomatic constructs that provide an appropriate mechanism for describing the majority of the ENCR data rules.

There is in fact a very close relationship between the axiom constructs and description logics (DLs) [7], which are themselves closely related to first-order and modal logics. Since first-order logic draws from a well-established mathematical foundation, DLs provide a solid formal framework for representing axioms that can be developed using the more readily understandable ontology constructs.

DLs form a family of knowledge representation languages that are distinguished by their level of expressivity [8]. Expressivity refers to the expressive power of the language governed by the types of operations it can support. The base language is attributive language (AL) supporting concept intersection (⊓), some level of negation (⌐), universal restrictions (∀), and existential restrictions (∃) with limited quantification. The restriction operators ∀ and ∃ are used for qualifying the entities on which a given role acts, with ∃ specifying the notion of an "at-least-one relationship" and ∀ the notion of an "only relationship"; they are similar to the existential and universal quantifiers of first-order logic.

The addition of complex concept negation (C), which includes concept disjunction (⊔), increases the expressivity to attributive language with complements (ALC) that already provides quite a powerful expressivity able to handle many types of data rules. A language of higher expressivity is SHOIN, where S refers to ALC with transitive roles, H to role hierarchy, O to nominals, I to inverse properties, and N to cardinal restrictions. Higher expressivities are also possible but there is a trade-off between expressivity and computational cost for automatic reasoning.

In DL terminology, a knowledge base has two distinct components – a terminological part or TBox, and an assertional part or ABox. An additional term RBox is sometimes used to denote an extended set of role axioms that are described by the letter R in higher expressivities such as SROIQ [8].

The distinction between the TBox and ABox is sometimes also made in the division between ontologies and knowledge graphs [9]. An ontology is considered as a schema that captures the semantic data model using classes, relationships, and attributes (i.e. the TBox, where concepts stand for classes and roles for relationships). A knowledge graph in contrast contains specific instances following the semantic data model represented by the ontology (i.e. the ABox).

### **3.1 Web ontology language**

The World Wide Web Consortium (W3C) describes the web ontology language (OWL) as "a semantic web language designed to represent rich and complex knowledge about things, groups of things, and relations between things". It refers to OWL documents as ontologies [10]. OWL is structured closely along the lines of DLs and provides support for automatic reasoning. It uses the terminology of classes and properties (instead of concepts and roles) for the TBox and represents the ABox as a set of individuals instanced (or asserted) from the TBox axioms.

A number of free, open-source graphical user interface OWL editors are available (e.g. Protégé [11]) that greatly ease the task of ontology development. It is generally more straightforward to define classes and relationships from an ontological point of view than construct them from scratch using DL. The DL expressions can afterwards be determined from the resulting OWL axioms.

## **4. OWL: A formal framework for the specification of the data rules**

OWL's roots in DL allow a formal context to be established for data rules that can overcome the inherent ambiguities associated with their formulation in natural language. Given the relatively rich set of logic operators available however, care is required in deciding how best to formulate the axioms. Unfortunately, there is no simple set of guidelines to help with this task since it is very much dependent on how the ontology will be used. Moreover, DL expressivity comes at the cost of computational speed [12] and where this is important, it is preferable to restrict the DL expressivity to the extent necessary.

## **4.1 Representation of the data rules**

By way of illustration, the following simple examples are only intended to show how some of the rules depicted in **Tables 2**–**6** can be encoded in DL. With reference to **Table 5** (morphology/topography), capturing the fact that the topography code *C*300 (nasal cavity) with a morphology code of 8090 (basal cell carcinoma) is a permissible combination, one can create an OWL axiom stating that *C*300 is a subclasss of the object property *hasMorphology* with a filler class *M*\_8090 (where the prescript *M*\_ has been added for convenience to represent morphology). This statement is represented in DL by:

$$\in \text{C} \mathbf{300} \sqsubseteq \exists \text{has} \\ \text{Morploy} \, \text{M} \, \text{\\_8090} \tag{1}$$

In a similar manner, one can capture the rule in the last row of **Table 2** that an ICD-O-3 topography code *C*61 (prostate gland) together with a morphology of 8140 (adenocarcinoma) is unlikely in men aged less than forty years at diagnosis. This rule, which requires use of an OWL data property, can be framed in such a way to say that for a combination of topography and morphology, the expected age of patients is above thirty-nine years:

$$\text{C} \& 1 \sqcap M\\_8 \\ 140 \sqsubseteq \exists \exp \text{tedAge}.\{> \text{39}\} \tag{2}$$

The introduction of another axiom stating that the conjunction of an expected age of more than thirty-nine years and a patient age at diagnosis of less than forty years is an improbable scenario, Eq. (3), would flag a potential coding error (via subsumption under the class *ImprobableAge*) for any prostate tumor cases with morphology code 8140 for patients younger than forty years of age.

$$\exists \text{expectedAge.} \{ > \text{>9} \} \sqcap patientAgeAt Diagmois. \{ < 40 \} \sqsubseteq ImprobableAge \tag{3}$$

Clearly such a rule would have to be replicated for all the relevant upper age restrictions provided in the rule table. To avoid logic conflicts, a modified set of axioms would need to be created for the rules with lower age restrictions, c.f. row 2 in **Table 2**.

By building up axioms in this manner, all the rules relevant to a given class or hierarchy of classes can be defined. The advantage is that each rule governing a class of objects is visible on the ontology editor's view of the class, unlike the representation of the rules in **Tables 2**–**6** where one has to search between various tables to ascertain all the rules pertinent to a particular entity. As observed earlier, this greatly simplifies the task of building up test cases of data both to validate the behavior of the rules as well as to construct comprehensive test data sets.

#### **4.2 Automatic reasoning**

Owing to its DL foundations, OWL provides the possibility for automatic reasoning. Automatic reasoning is a valuable tool for detecting rule violations in a set of data records. Eq. (3) provided an example where a reasoner could flag a potential coding error in a cancer case.

In designing error-trapping axioms, it is important to be aware of the issues relating to the open world assumption of DL. The open world assumption holds the view that anything not explicitly stated can only be assumed to be unknown. This is in contrast to the closed world assumption in which anything not explicitly stated is considered incorrect (typical for rules expressed for instance in Datalog). The open world assumption has implications in the subsumption of classes in a hierarchy and can dictate the structure of the ontology dependent on the reasoning requirements.

Data rules, which by definition are prescriptive in the dependencies between data variables, are more suited to the closed world assumption. Axioms may therefore have to be written in such a way that serves to force class subsumption in an otherwise open world view. One means for achieving this is to "invert" the class tree – which may be more easily clarified by the following simple practical example. Say we wished to subsume a class with certain attributes (e.g. a class having a topography code of *C*40 and a morphology with code 919) under a general classification class of *Osteosarcoma*. Following the traditional approach of constructing classes using an ontology editor such as Protégé, we might declare an axiom such as:

$$
\text{Osteous}\\
\text{coma} \sqsubseteq \text{C40} \sqcap M\\_\text{919} \tag{4}
$$

If we were to declare a class *TumorCase* also subclassed from an intersection of *C*40 and *M*\_919 and then run the reasoner, we would find that our *TumorCase* class had not been classified under (i.e. subsumed by) the class *Osteosarcoma*. This is due to the open world assumption since it cannot be assumed that the class *Osteosarcoma* is not subclassed from other classes that have not been explicitly stated. It cannot therefore be assumed that the *TumorCase* class is contained by the *Osteosarcoma* class – there is not enough information to say.

The problem can be circumvented either by creating an equivalence (using defined classes) or by inverting the subclass definition. Creating many equivalences with complex classes can however lead to unintended consequences. For example, if the containment operator (⊑) in Eq. (3) were to be replaced by an equivalence (), and if this approach were to be replicated for the whole set of axioms modeling each of the age-restricted rules (c.f. **Table 3**), then all the expressions on the left-hand-side of the equivalence would also become equivalent (since they are all equivalent to the class *ImprobableAge*) and this would be erroneous. Alternatively, the subclass definition of Eq. (4) can be inverted as indicated in Eq. (5):

$$\text{C40} \sqcap M\\_\\_\\_\text{O19} \sqsubseteq \text{Osteoar coma} \tag{5}$$

Running the reasoner now would result in the subsumption of the class *TumorCase*. under the class *Osteosarcoma*.

This method of axiom formulation has been coined "being complex on the lefthand side" [13]. Ontology editors such a Protégé lead developers to put the complexity on the right-hand side of the class containment relation (i.e. subclassing from complex classes). Although moving the complexity to the left-hand side can overcome the subsumption issues of the open world view, it tends to obfuscate the ontology structure. Eq. (3) is a further example of defining axioms following this approach.

Regarding the different formulations for expressing the rule illustrated in Eqs. (4) and (5), it is instructive to note that the equivalence expression:

$$\text{C40} \sqcap \text{M\\_919} \equiv \text{Osteosar coma} \tag{6}$$

is in fact a short-hand way of writing the implied DL expression:

$$\text{C40} \sqcap M\\_919 \sqsubseteq \text{Osteosar coma, Osteosar coma } \sqsubseteq \text{C40} \sqcap M\\_919 \tag{7}$$

**Figure 1** is a view from the Protégé application showing the result of reasoning based on the classes and properties given in an imaginary cancer test case. The non-highlighted lines indicate the information passed into the reasoner and the lines highlighted with yellow background show the extra information returned by the reasoner. Noting that the topography class *C*619 is a subclass of *C*61 and the morphology class *M*\_8140\_3 is a subclass of *M*\_8140, and in accordance with the rules provided in **Table 2** (row 4) and **Table 3** (row 3), and **Table 4** (row 1), the reasoner has ascertained that: the age at diagnosis is improbable for the morphology/topography combination; the basis of diagnosis is correct; and the combination of sex and topography is incorrect. The question mark in the gray circle on the highlighted lines provides the means of polling the reasoner to understand why it has subsumed the class under the identified class.

Protégé also provides a graphical view on the inferred classification tree for the named classes (unnamed classes are not visible). **Figure 2** provides an amplification of the classification tree summarized in **Figure 1**.


#### **Figure 1.**

*Information added from the reasoning process (highlighted lines) based on the prior information of classes asserted in a test case (non-highlighted lines).*


#### **Figure 2.**

*Graphical view of the classification structure (containing both asserted and inferred classes) of the cancer test case shown in Figure 1.*


#### **Figure 3.**

*Thyroid cancer TNM test case to verify the class subsumption results from the reasoner.*

The reasoner can be polled to understand the reasoning applied for class subsumption. **Figure 3** shows a cancer test case for the thyroid gland (*C*739) restricted to TNM information to check whether the test case is subsumed under stage III (c.f. **Table 6**, row 7b). **Figure 4** is the classification tree resulting from the automatic reasoning process on the TNM test case of **Figure 3**. It can be seen that the reasoner has correctly subsumed the test case under the stage III class." **Figure 5** shows the results from polling the reasoner to understand why it subsumed the test class under

#### **Figure 4.**

*Classification tree of the thyroid cancer test case of Figure 3, showing that the reasoner has correctly identified the stage III class (the top-most class in the figure) as required from the rule table shown in Table 6. The classes shaded in the darker color represent defined classes (classes with some equivalence conditions).*


#### **Figure 5.**

*Reasoner justification for the subsumption of the thyroid cancer test case under the TNMStageIII class.*

the *TNMStageIII* class. The specific rule is stated in line 11 of the figure and the other lines provide the reasons for subsuming the classes asserted in the test case under the various classes in the rule itself.

Automatic reasoning can be performed using both TBox axioms and ABox axioms. Since data rules are more often associated with classes of objects, TBox reasoning is in most cases sufficient and can reduce computational costs. Most of the ENCR data rules can be modeled by TBox axioms apart from those, for example, that pertain to multiple tumors (where a person has more than one type of cancer). The rules specify the topography and morphology combinations of any two tumors to be considered different and since two entities with the same class attributes have to be compared, the use of ABox axioms is necessary. Modeling of the multiple primary rules on the basis of DLs supported by OWL has been addressed at length in [14].

The ability to include closed world reasoning in OWL would be ideal and has been made possible to a certain degree via the incorporation of the semantic web rule language (SWRL) into the semantic web stack. SWRL is based on first-order Hornlogic in which rules in Datalog are also expressed [15], but requires an ABox. Another expressive logic formalism allowing some integration of open- and closed-world reasoning is minimal knowledge and negation as failure (MKNF) [16]. This formalism is being developed in a unifying framework in the KAON2 infrastructure [17].

#### **4.3 Encapsulation of the data model**

The axiomatic constructs of an ontology are useful for capturing many of the different aspects of a data model that for relational database models have traditionally been divided across three independent levels of abstraction. Namely, the conceptual schema (describing the semantics of the domain and the scope of the model); the logical schema (describing the structure of the information, as for example a relational database schema); and the physical schema (describing the physical means of storing the data) [18].

One of the strengths of OWL is its relationship with the resource description framework (RDF), which serves as the data interchange layer of the semantic web stack [19]. RDF data is in essence a network of connected triplets of resources, in which the resources at the edges of the triplets (subject and object) are related by the resource in the middle of the triplet (predicate). Each resource is identified by a uniform resource identifier (URI). All OWL constructs are described in terms of RDF data, allowing ontologies to bridge the traditional divide between conceptual and logical levels of abstraction and providing a richer, more integrated data model description framework.

The flexibility and descriptive power of an ontology present their own sets of challenges however. While the usefulness of ontologies is widely acknowledged, the task of building a good ontology is a particularly hard one and falls within the developing domain of ontology engineering [20]. Designing an appropriate ontology does not only depend on a thorough understanding of the domain to be modeled, but must be performed circumspectly in view of the ontology's purpose and future extensibility. There are pitfalls in making an ontology too granular or not granular enough – the result is either a multiplication of application-specific ontologies that cannot easily be integrated, or an ontology overly generic to be useful to any particular application. OWL provides the functionality for importing ontologies that allows larger ontologies to be built up in a modular fashion and this can aid the design process if performed carefully [21].

There are also certain design aspects to take into account that can affect the overall structure of the ontology. One important consideration relates to the extent to which the ontology is to be used in a pre-coordinated or post-coordinated way [22]. Pre-coordination refers to the situation in which all the terms and relationships are stated explicitly in the axioms and leads to a static use of the ontology, whereas postcoordination refers to the more dynamic situation in which new relationships are determined by the automatic reasoning process on the basis of the pre defined axioms. The pitfalls are exacerbated in applications that need to tweak the normal approach to structuring class hierarchies to overcome restrictions in post-coordination that the open world assumption places on class subsumption.

If the axioms describing the data rules are developed circumspectly however, the advantage is that the data model falls out almost by default – the data rules necessarily identify all the concepts within the domain as well as their inter-relations. This may require an iterative process combining both the bottom-up approach of developing axioms in DL and the top-down approach of structuring the ontology, while testing each stage of the development with the reasoner.

The task of developing a data model in an ontology used in a predominantly precoordinated way is perhaps more straightforward and does not require too much juggling in defining the axioms. Moreover, the axioms can be constructed in the more usual manner of subclassing from complex classes. The intelligence of validating data sets would however need to be moved from the ontology to a computer program (for instance via the OWL-API) thereby compounding maintenance issues. The advantage of encapsulating the intelligence in the ontology is that all the knowledge is contained in one application and maintenance aspects are thereby confined to that one application.

### **4.4 Metadata by default**

Elements in an ontology are described in terms of their semantic relations to other elements in the ontology thereby providing a description and context, or in other words the metadata, of the element. Moreover, since each element in an OWL ontology is uniquely defined by a uniform resource identifier (URI), it is readily linkable with other web resources. This allows any element to be associated with other relevant resources via linked open data (LOD) principles. Using knowledge organization schemes, such as simple knowledge organization system (SKOS), it becomes a straightforward matter to link OWL resources semantically with other web-based resources such as data-dictionary or thesauri elements.

The interlinking of any OWL resource to other web resources, especially to other RDF resources, provides a powerful and extensible means of capturing all the necessary metadata components for comprehensively describing a data model element. This aspect has been exploited to create extensive frameworks of distributed metadata registries that allow the reuse of existing metadata resources [23].

It is important to emphasize that a number of complementary tools exist that can be used together to provide a more comprehensive toolkit for validating different types of data rules. Included in the semantic web standards are the shape languages: shape expression (ShEx) and shapes constraint language (SHACL) for providing structural schema for RDF data. There are also additional tools for polling knowledge bases such as the SPARQL protocol and RDF query language (SPARQL) as well as those for extending the expressivity of OWL DLs, such as SWRL. Depending on the type of rule, some of these tools may be more suitable than others; however, since they are agreed or proposed semantic web standards and based on the standard model for data interchange (RDF), they can all reference the elements of a data model described in RDF. This provides a highly flexible and versatile environment in which to develop an integrated toolkit. **Table 7** gives a summary breakdown of these applications with the sorts of operations they support and the components of a knowledge base to which they are applicable.

Whereas other tools and languages (e.g. Datalog) are also available for validating data, and may arguably be more appropriate for defining rules predominantly based on closed-work scenarios, they fall down in this aspect of unifying the rules with the data model and the metadata, especially in the LOD sense. For federated datavalidation processes, the unification of all these elements brings many advantages in terms of data linkage, maintenance, and collaborative development. Having said that, OWL is not able to handle all types of validation checks – such as those for example requiring comparison of dates, checking of frequencies of occurrence, or expressing certain relations between individuals. ShEx, SWRL, and SPARQL can all go some way to handling such checks. SWRL and SPARQL however require an ABox and SWRL has implications on decidability [24]. Moreover, introducing an ABox can create


#### **Table 7.**

*Summary breakdown of some of the semantic web standard applications with the sorts of operations they support.*

performance issues for DL reasoning when many hundreds of thousands of individuals are involved and requires careful consideration in the ontology design phase. An alternative is to create an ABox and use SPARQL querying instead of DL reasoning but this would move the rule logic out of the ontology and into the SPARQL query scripts.

An example for handling the axioms of Eqs. (2) and (3) using a simple SPARQLscript to list all the associated erroneous cancer-case records is shown in **Figure 6**. A ShEx script for checking the same condition is shown in **Figure 7**. The same rule using SWRL could be expressed as shown in **Figure 8**.

The effort required to maintain the rule base developed with such tools however would be considerable and it would make more sense to use them in a pre-processing stage on the data to be validated (translated beforehand into RDF) for those types of checks that cannot be handled within the ontology itself. ShEx in particular provides a valuable pre-processing tool to check the ranges and formats of variables.

#### **Figure 6.**

*An example of a SPARQL script to list all the erroneous patient-age related cancer-case records associated with a particular combination of topography and morphology codes.*

#### **Figure 7.**

*An example of a ShEx script to trap any erroneous patient-age related cancer-case records associated with a particular combination of topography and morphology codes.*

#### **Figure 8.**

*An example of an SWRL rule to catch the same validation errors as for Figures 6 and 7.*

## **5. Role of ontologies in data harmonization**

The focus until this point has been on how ontologies can provide many advantages in the task of data validation against a set of specific data-validation rules. Checking the conformity of data against such rules is just one element in the whole process of data harmonization.

Data harmonization is a term that eludes a clear and concise definition, perhaps partly due to its dependence on the context to which it is applied [25, 26] as well as the fact that it is a multistep activity involving both technical and social processes [5, 26]. An idealized breakdown of these steps has been provided in [5] based on the accumulated experience gained by the Comprehensive Center for the Advancement of Scientific Strategies (COMPASS) resulting from multiple data-harmonization projects across widely different types of data, collaborators, and scientific questions. Whereas not all projects were found to follow all steps and the order of the steps might vary, the six most common steps identified were:

1. Identification of the questions that the harmonized data set is required to answer


In this breakdown, the process of data validation falls manly under steps 5) and 6) although it should be stressed that validation forms only part of the quality-control procedures of step 6). Other fundamental quality metrics consist of the following dimensions: completeness, consistency, accuracy, timeliness, uniqueness, and auditability [27]. Moreover, different entities in the data process may be responsible for

ensuring the quality of the data associated with these separate dimensions. They are nevertheless all important for ensuing an appropriate level of harmonization that allows meaningful comparison or integration of data and it would not be correct to state that data solely validated against a set of validation rules have the prerequisite level of quality for purposes of data comparison.

The degree to which data are harmonized depends ultimately on the specific end use, but the step can never entirely be ignored. In the field of health for example, data harmonization is a critical step in pooling data sets for increasing the power of individual epidemiological studies [5]. It is also a necessary part of health management decision-making, particularly with regard to: clinical decision-making for individual patient clinical management or clinical support and quality improvement tools; operational and strategic decision-making for health system managers and policy-makers; and population-level decision-making for disease surveillance and outbreak management [26].

The point is that ontologies can play an important part in all stages of data harmonization. Starting from the highest levels of abstraction in the six-step harmonization process presented above, ontologies provide the means to capture and organize the high-level data concepts needed to address the questions the harmonized data are required to answer. Ontologies would moreover be able to formalize the questions in direct reference to the high-level data concepts and help identify any missing concepts as well as to verify the underlying logic of those relationships. The next steps are to identify the availability of the data and to develop common data elements (CDEs). The data may be in an unstructured format. The development of CDEs is a process of structuring the data and the semantic relations described in a domain ontology can help identify the relevant information. The role of ontologies in ETL (extract, transform, load) processes has been extensively reviewed in [28]. In particular, the authors point to the efficacy of ontologies: (a) to formalize the needs and requirements of users and resolve semantic ambiguity; (b) to discover concepts and their relationships; (c) to enrich source data, provide mappings (also generating them automatically) and increase ETL performance and efficiency; and (d) to support configuration and instantiation of ETL patterns. Moreover, the validation rule base for the data can itself be derived automatically from the data themselves using ontological methods [29] and allows a verification of any pre-defined set of validation rules.

## **6. Conclusions**

Data validation is an essential step in the task of ascertaining the veracity and homogeneity of data for data comparison purposes. In the case of structured data, validation is often performed using a set of data validation rules. Using the ontology layer (OWL) of the semantic web stack to perform this task brings a number of major advantages. First, it provides the means of formalizing the rules in DL, thereby removing the ambiguities and redundancies inherent in natural language. Second, it helps encapsulate the data model and integrate the conceptual and logical schemas that have traditionally been separated. The encapsulation of the data model and the definition of the rules in DL is a mutually supportive step that allows the integration of a bottom-up approach (rule definitions) with a top-down approach (classification and semantic context), from which the data model is the result. Third, the data model expressed in OWL automatically incorporates the metadata. All named entities (classes, properties, and individuals) have their own URIs that can be accessed and linked

individually. Accessing an OWL link provides the whole semantic context of the entity, which may in turn be annotated with links to other semantic resources to enrich further the contextual information. Other advantages include the possibility of reasoning on the ontology, allowing inferences to be made automatically and providing other semantic relations not explicitly stated a priori in the ontology. Ontologies can also play an important role in more general data harmonization steps. In particular, they can help in defining and formalizing user needs, discovering semantic contexts in unstructured data, and generating semantic mappings.

Whereas ontologies do suffer some drawbacks (such as issues relating to the open world assumption), the fact they can to a large extent unify the underlying data model with the data rules, as well as capture the metadata that can be linked semantically to other metadata dictionaries and classification schemes, makes them an interesting solution. These considerations are of particular importance for applications that need to harmonize data across multiple data providers and heterogeneous data-collection procedures, as well as for improved contextualization of the data that is useful for downstream processes.

## **Acknowledgements**

This work was partly conducted using the Protégé resource, which is supported by grant GM10331601 from the National Institute of General Medical Sciences of the United States National Institutes of Health.

## **Conflict of interest**

The authors declare no conflict of interest.

## **Nomenclature**


*Ontologies as a Tool for Formalizing Data Validation Rules DOI: http://dx.doi.org/10.5772/intechopen.110757*


## **Author details**

Nicholas Nicholson<sup>1</sup> \* and Iztok Štotl2

1 European Commission Joint Research Centre, Ispra, Italy

2 Department of Endocrinology, Diabetes and Metabolic Diseases, University Medical Centre Ljubljana, Ljubljana, Slovenia

\*Address all correspondence to: nicholas.nicholson@ec.europa.eu

© 2023 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## **References**

[1] European Network of Cancer Registries (ENCR). Available from: https://www.encr.eu/ [Accessed: December 26, 2022]

[2] National Cancer Institute. Surveillance, Epidemiology, and End Results Program (SEER). Available from: https://seer.cancer.gov/ [Accessed: December 26, 2022]

[3] Martos C, Crocetti E, Visser O, Rous B, Giusti F. A proposal on cancer data quality checks: one common procedure for European cancer registries. JRC Technical Report, p. 1-99. DOI: 10.2760/429053

[4] Tijhuis M, Finger JD, Slobbe L, Sund R, Tolonen H. In Verschuuren M, van Oers H, editors. Population Health Monitoring. Climbing the Information Pyramid. Cham: Springer; 2019. p. 59-81. DOI: 10.1007/978-3-319-76562-4\_4

[5] Rolland B, Reid S, Stelling D, Warnick G, Thornquist M, Feng Z, et al. Toward rigorous data harmonization in cancer epidemiology research: One approach. American Journal of Epidemiology. 2015;**182**(12):1033-1038. DOI: 10.1093/aje/kwv133

[6] World Health Organization. International Classification of Diseases for Oncology (ICD-O) – 3rd Edition, 1st Revision. 2013. Available online: https:// apps.who.int/iris/handle/10665/96612 [Accessed: December 26, 2022]

[7] Calvanese D, Guarino N. Ontologies and description logics. Intelligenza Artificiale. 2006;**3**:21-27

[8] Baader F, Horrocks I, Lutz C, Sattler U. An Introduction to Description Logic. Cambridge: Cambridge University Press; 2017. DOI: 10.1017/9781139025355 [9] Schrader B. Enterprise Knowledge. White paper: What's the Difference Between an Ontology and a Knowledge Graph? 2020. Available from: https:// enterprise-knowledge.com/whats-thedifference-between-an-ontology-anda-knowledge-graph/ [Accessed: December 26, 2022]

[10] W3C. Web Ontology Language (OWL). 2012. Available from: https:// www.w3.org/OWL/ [Accessed: December 26, 2022]

[11] Protégé. A Free, Open-Source Ontology Editor and Framework for Building Intelligent Systems. Available from: https://protege.stanford.edu/ [Accessed: December 26, 2022]

[12] Calvanese D, De Giacomo G, Lembo D, Lenzerini M, Rosati R. Data complexity of query answering in description logics. Artificial Intelligence. 2013;**195**:335-360. DOI: 10.1016/j. artint.2012.10.003

[13] Sattler U, Stevens R. Being complex on the left-hand side: General concept inclusions. Ontogenesis. 2012. Available from: http://ontogenesis.knowledgeblog. org/1288 [Accessed: December 26, 2022]

[14] Nicholson NC, Giusti F, Bettio M, Negrao Carvalho R, Dimitrova N, Dyba T, et al. An ontology to model the international rules for multiple primary malignant tumours in cancer registration. Applied Sciences. 2021;**11**: 7233. DOI: 10.3390/app11167233

[15] Krötzsch M, Rudolph S, Schmitt PH. On the semantic relationship between Datalog and description logics. In: Hitzler P, Lukasiewicz T, editors. Web Reasoning and Rule Systems. RR 2010. Lecture Notes in Computer Science. Vol. 6333. Berlin, Heidelberg: Springer; 2010. *Ontologies as a Tool for Formalizing Data Validation Rules DOI: http://dx.doi.org/10.5772/intechopen.110757*

pp. 88-102. DOI: 10.1007/978-3- 642-15918-3\_8

[16] Motik B, Rosati R. Closing Semantic Web Ontologies. 2006. Available from: http://www.cs.ox.ac.uk/boris.motik/ pubs/mr06closing-report.pdf [Accessed: January 10, 2023]

[17] KAON2. Available from: http://ka on2.semanticweb.org/ [Accessed: January 10, 2023]

[18] TopQuadrant. Ontologies and Data Models – are They the Same? 2011. Available from: https://topquadrantblog. blogspot.com/2011/09/ontologies-a nd-data-models-are-they.html [Accessed: December 26, 2022]

[19] W3C. Resource Description Framework (RDF). 2014. Available from: https://www.w3.org/RDF/ [Accessed: December 26, 2022]

[20] Mizoguchi R. Ontology engineering environments. In: Staab S, Studer R, editors. Handbook on Ontologies. International Handbooks on Information Systems. Berlin, Heidelberg: Springer; 2004. pp. 275-295. DOI: 10.1007/978-3- 540-24750-0\_14

[21] Cuenca Grau B, Horrocks I, Kazakov Y. Modular reuse of ontologies: Theory and practice. Journal of Artificial Intelligence Research. 2008;**31**:273-318. DOI: 10.1613/jair.2375

[22] Stevens R, Sattler U. Postcoordination: Making things up as you go along. Ontogenesis. 2013. Available from: http://ontogenesis.knowledgeblog. org/1305 [Accessed: December 26, 2022]

[23] Sinaci AA, Laleci Erturkmen GB. A federated semantic metadata registry framework for enabling interoperability across clinical research and care domains. Journal of Biomedical

Informatics. 2013;**46**:784-794. DOI: 10.1016/j.jbi.2013.05.009

[24] Hitzler P, Krötzsch M, Rudolph S. Knowledge Representation for the Semantic Web Part II: Rules for OWL, KI 2009 Paderborn; Integrationszentrum, Kreis Paderborn; 2009. p. 8-14. Available from: https://www.semantic-web-book. org/w/images/5/5e/KI09-OWL-Rules-2. pdf [Accessed: February 21, 2023]

[25] Paquette J. The Many Marvelous Meanings of "Data Harmonization". Towards Data Science. Canada: Towards Data Science Inc.; 2021. Available from: https://towardsdatascience.com/abouttowards-data-science-d691af11cc2f [Accessed: November 16, 2022]

[26] Schmidt BM, Colvin CJ, Hohlfeld A, Leon N. Definitions, components and processes of data harmonisation in healthcare: A scoping review. BMC Medical Informatics and Decision Making. 2020;**20**(1):222. DOI: 10.1186/ s12911-020-01218-7

[27] Nicholson N, Giusti F, Neamtiu L, Randi G, Dyba T, Bettio M, et al. Dotting the "i" of interoperability in FAIR cancer-registry data sets. In: Kais G, Hamdi Y, editors. Cancer Bioinformatics [Internet]. London: IntechOpen; 2021. pp. 131-156. Available from: https:// www.intechopen.com/chapters/79580. DOI: 10.5772/intechopen.101330

[28] Lorvão Antunes A, Cardoso E, Barateiro J. Incorporation of ontologies in data warehouse/business intelligence systems - a systematic literature review. International Journal of Information Management Data Insights. 2022;**2**(2): 100131. DOI: 10.1016/j. jjimei.2022.100131

[29] Brüggemann S, Aden T. Ontology based data validation and cleaning: Restructuring operations for ontology *Latest Advances and New Visions of Ontology in Information Science*

maintenance. In: Koschke R, Herzog O, Rödiger K-H, Ronthaler M, editors. Informatik 2007 – Informatik trifft Logistik – Band 1. Bonn: Gesellschaft für Informatik e.V.; 2007. p. 207-211. Available from: https://dl.gi.de/handle/ 20.500.12116/22581 [Accessed: January 10, 2023]

## **Chapter 6**
