2. "Set of strings" basic equations

The background of the SSF is representation of a database as a finite set of strings:

$$\mathcal{W}\_t = \{w\_1, \dots, w\_{m(t)}\} \subset V^\*,\tag{1}$$

where Wt means DB at the discrete time moment t and V <sup>∗</sup> is a set of all strings in the initial (terminal) alphabet V. Such databases will be called lower, if it is necessary to distinguish them from the other, the set of strings databases. The structure of DB elements wi ∈Wt, named facts, is determined by metadatabase (MDB), in which the current state is denoted by Dt.

Couple

$$
\Theta\_l = \, \prec W\_l \, D\_l \succ , \tag{2}
$$

is named data storage (DS). Data storage is in the correct state, if Wt ∈Wð Þ Dt , where Wð Þ Dt is the set of all correct databases, defined by the MDB.

Access message to DS is triple:

$$a\_t = ,\tag{3}$$

where o is the operation, which execution is the purpose of the access (insert, delete, update, query), c is the DS component (DB, MDB) which is the objective of the access, and x is the content of the access, i.e., query body, or DB elements (facts), which are inserted or deleted. For simplicity it is supposed that the answer (reply) to the access is obtained by the user at the moment t þ 1, next to t, and it is denoted Atþ<sup>1</sup>, if <sup>c</sup> <sup>¼</sup> DB, and AD <sup>t</sup>þ<sup>1</sup>, if c ¼ MDB (both sets are finite).

A set of all possible access messages (3) is called data storage manipulation language (DSML).

SSF background is a sequential definition of four interconnected representations of DSML semantics.

Set-theoretical (S)-semantics of DSML is defined by equations on sets, which connect together input data, DB before and after access, and answer (reply) to the access.

Mathematical (M)-semantics follows aforementioned equations but is defined by some well-known and understandable mathematical constructions, being background of DSML.

Operational (O)-semantics is adequate to M-semantics but is represented by algorithms, providing execution of operations on DB.

At last, implementational (I)-semantics is also represented by algorithms, which, in general case, are much more efficient than the previous, in which the main purpose is recognition of algorithmic decidability of answer search (derivation), i.e., possibility of answer generation by finite number of steps.

Let us begin from S-semantics of the DSML segment, addressing DB, called lower, as usually, data manipulation language (DML). The equations, defining DML S-semantics, operate the following sets:

"Set of Strings" Framework for Big Data Modeling DOI: http://dx.doi.org/10.5772/intechopen.85602


Basic equations, defining DML S-semantics, are as follows:

$$\mathcal{W}\_{t+1} = \mathcal{W}\_t \cup I\_t,\tag{4}$$

$$A\_{t+1} = W\_{t+1} - W\_t,\tag{5}$$

for insertion (speaking more precisely, inclusion),

$$\mathcal{W}\_{t+1} = \mathcal{W}\_t - I\_t,\tag{6}$$

$$A\_{t+1} = W\_t - W\_{t+1} \tag{7}$$

for deletion (exclusion),

$$W\_{t+1} = W\_t \tag{8}$$

$$A\_{t+1} = W\_t \cap I\_t,\tag{9}$$

and for query (everywhere "�" is subtraction on sets). As seen, Eqs. (4)–(9) fully correspond to the sense of basic operations on DB, inherent to any DML. In Eqs. (6) and (9), set It may be infinite.

Example 1. Let database, containing data items from various emergency devices, be as follows:

$$W\_t = \{A \text{REA GREEN VALLEY IS IN NormalAL STATE AT 15.03,} }$$
 
$$\text{AREA BLUE LEAKE IS IN NORMAML STATE AT 15.05,} }$$
 
$$\text{AREA LOWER FOREST IS SWOKED AT 15.20} }$$

(due to free use of natural language in facts, it is unnecessary to comment DB content). Equation

$$\mathcal{W}\_{t+1} = \mathcal{W}\_t \cup \{ \text{AREA GREEN VALLEY IS SMOKED AT 15.20} \} \tag{11}$$

describes insertion of data item, in which the source is device, mounted at the Green Valley, which was detected as smoked since 15.20. When at this moment t þ 1 user accesses DB with query, in which the purpose is to get information about all smoked areas, the infinite set It may be as follows:

$$I\_{t+1} = \{AREA \text{ A IS SMKED } AT \text{ } 00.00, \ldots\}$$

$$AREA \text{ A IS SMKED } AT \text{ } 23.59, \ldots$$

$$AREA \text{ A A IS SMKED } AT \text{ } 00.00, \ldots$$

$$AREA \text{ A A IS SMKED } AT \text{ } 23.59, \ldots$$

$$AREA \text{ Z IS SMKED } AT \text{ } 00.00, \ldots$$

$$AREA \text{ Z IS SMKED } AT \text{ } 23.59, \ldots\}. \tag{12}$$

The answer to the query is

$$A\_{t+2} = W\_{t+1} \cap I\_{t+1} = \{A \text{REA GREEN VALLEY IS SMOKED AT 15.20},$$

$$\text{AREA LOWER FOREST IS SMOKED AT 15.20}, \quad \text{(13)}$$

In expression (12), names of all areas are strings in the alphabet V ¼ f g A; …; Z; 0; …; 9; :; , so

$$I\_{t+1} = \{AREA\} \bullet V^\* \bullet \{\text{SMOKEDAT}\} \bullet \{00, \dots, 23\} \bullet \{.\} \bullet \{00, \dots, 59\}. \blacksquare \tag{14}$$

Note that definitions (4)–(9) are not unique. For example, in the inclusion definition, elements of It set, having place in the DB at moment t, may be included to the answer

$$A\_{l+1} = W\_l \cap I\_{l\natural} \tag{15}$$

as well as the answer may be defined as

$$A\_{l+1} = \left\{FACT^{\top}\right\} \bullet (W\_t \cap I\_l) \bullet \left\{\stackrel{\circ}{\text{ALREADY PRESENTS IN ATABASE}}\right\} \tag{16}$$

$$\cup \left\{FACT^{\top}\right\} \bullet (W\_{t+1} - W\_t) \bullet \left\{\stackrel{\circ}{\text{IS INCLUED}} \stackrel{\circ}{\text{TO}} \text{ DATABASE}\right\}.\tag{16}$$

So, according to Eq. (16), the answer to the access may be as follows:

$$A\_{t+1} = \{FACT\text{ "A2EA GREEN VALLEY SMOKED}\}$$

$$AT\text{ 15.20"}\text{ALREADY PRESENT IS DATABASE}$$

$$FACT\text{ "A2EA LOWER FORES T IS SMOKED AT 15.20"}$$

$$\text{IS INCLUED TO DATABASE}\}.\text{}\tag{17}$$

As may be seen, Eqs. (4)–(9) are based on the closed-world interpretation, which defines that the absence of the fact in the database is equivalent to its absence in the real world (problem area).

DML operations do not touch MDB; thus Dtþ<sup>1</sup> ¼ Dt.

Let us consider DML M- and O-semantics of DML.

The background of M-semantics of the simplest DML is the representation of the MDB Dt as a set of the context-free (CF) generating rules α ! β, where α is a nonterminal symbol ("nonterminal" for short) and β is a string of both nonterminal and terminal symbols. Every nonterminal symbol, from the substantial point of view, is the name of some substring of fact, entering DB; thus β represents the structure of α. The only nonterminal symbol α0, which does not enter any string β, is the "axiom" in the terminology of formal grammars and "fact" in the terminology of SSF. So MDB Dt unambiguously defines CF grammar

$$G\_t = \,^\cdot V, N\_t, a\_0, D\_t \succ,\tag{18}$$

where

$$N\_t = \{ \ a \mid a \to \beta \in D\_t \} \tag{19}$$

is the set of nonterminals ("nonterminal alphabet") of Gt. Database Wt is named correct to metadatabase Dt, if

$$W\_t \subseteq L(G\_t),\tag{20}$$

"Set of Strings" Framework for Big Data Modeling DOI: http://dx.doi.org/10.5772/intechopen.85602

i.e., facts, having place in the DB, are words of the CF language L Gð Þ<sup>t</sup> . In other notation,

$$(\forall w \in W\_t) \; a\_0 \stackrel{\ast}{\Longrightarrow} w,\tag{21}$$

where ∗ ¼) Gt is used to define that string in alphabet V ∪ Nt is generated (or

derived) from another one.

Example 2. Let MDB Dt be as follows (nonterminal symbols are framed by metalinguistic brackets):

```
<fact> ! AREA < name of area>IS< state>
       AT < time>,
< name of area> ! <text>,
<state> ! IN NORMAL STATE,
<state> ! SMOKED,
< time> ! <hours>:< minutes>,
< hours> ! < 0 to 1> <0 to 9>,
< hours> ! 2< 0 to 3>,
<0 to 1> ! 0,
<0 to 1> ! 1,
<0 to 9> ! 0,
…
<0 to 9> ! 9,
<0 to 3> ! 0,
…
<0 to 3> ! 3,
< minutes> ! <0 to 5> <0 to 9>,
<0 to 5> ! 0,
…
<0 to 5> ! 5,
<text> ! <symbol>,
< text> ! <symbol> <text>,
<symbol> ! A,
…
<symbol> ! Z,
<symbol> ! 0,
…
<symbol> ! 9,
< symbol> ! ˽:
```
Database

$$\mathcal{W}\_t = \{ \begin{aligned} &A2EA \; A \; W \; IS \; SMOKED \; AT \; 15.10, \\ &A2EA \; E \; IS \; IN \; NORMAL \; STATE \; AT \; 23.59 \end{aligned} \}$$

is correct to this MDB, unlike database

Wt ¼ f g AREA AT NORMAL .∎

Proposed application of CF grammars differs from the classical, in which the main sense is the description of a set of correct sentences of some language (most frequently, programming language). This description is created by its developers or researchers, is based on syntactic categories referred as nonterminals, and is constant through all life cycle of the language (minor changes may be done by reason of language modification or deeper understanding). In the SSF case, CF generating rules are used for description of the DB element (facts) structure, so nonterminals are more semantic than syntactic objects. From the other side, MDB is updated by DS administration and is a dynamic set, in which changes provide immediate changes of DB in order to keep it in the correct state. Such changes may be defined by the following equations, similar to Eqs. (4)–(9):

$$D\_{t+1} = D\_t \cup I\_t^D,\tag{22}$$

$$A\_{t+1}^D = D\_{t+1} - D\_{t\bullet} \tag{23}$$

$$\mathbf{W}\_{t+1} = \mathbf{W}\_t \tag{24}$$

for insertion (inclusion) of new CF rules to MDB,

$$D\_{t+1} = D\_t - I\_t^D,\tag{25}$$

$$A\_{t+1}^D = D\_t - D\_{t+1} \tag{26}$$

$$\mathcal{W}\_{t+1} = \mathcal{W}\_t \cap L(\mathcal{G}\_{t+1}) \tag{27}$$

for deletion (exclusion) of CF rules, having place in MDB,

$$D\_{t+1} = D\_t,\tag{28}$$

$$A\_{t+1}^D = D\_t \cap I\_t^D,\tag{29}$$

$$\mathcal{W}\_{t+1} = \mathcal{W}\_t \tag{30}$$

and for query to MDB. Here I D <sup>t</sup> is similar to It in Eqs. (4)–(9), being a set of CF rules representing knowledge of DS administration about MDB. As seen, Eqs. (22) and (23) provide extension of MDB; thus

$$L(\mathbf{G}\_{\ell}) \subseteq L(\mathbf{G}\_{\ell+1}),\tag{31}$$

and DB remains correct, because

$$W\_t \subseteq L(\mathbf{G}\_t) \subseteq L(\mathbf{G}\_{t+1}).\tag{32}$$

In Eqs. (25) and (26), where some part (subset) of MDB may be deleted,

$$L(\mathbf{G}\_{t+1}) \subseteq L(\mathbf{G}\_t),\tag{33}$$

"Set of Strings" Framework for Big Data Modeling DOI: http://dx.doi.org/10.5772/intechopen.85602

so some facts w ∈L Gð Þ<sup>t</sup> may become not satisfying condition (20) of DB correctness to MDB Dtþ1, because w ∉L Gð Þ <sup>t</sup>þ<sup>1</sup> . In Eqs. (25)–(30), it is presumed, that Dt is also SDB, in which MDB defines structure of CF rules, which may be as Example 2.

Let us note that the notion of SDB correctness to MDB is from the substantial point of view weaker than the notion of data storage correctness, because in general case

$$\mathbf{W}(D\_t) \subseteq \mathbf{2}^{L(G\_t)},\tag{34}$$

i.e., set of databases in correct storage is the subset of Boolean of L Gð Þ<sup>t</sup> , while SDB correct to MDB is such that

$$\mathbf{W}(D\_t) = \mathbf{2}^{L(G\_t)},\tag{35}$$

i.e., every SDB, containing facts, being words of CF language L Gð Þ<sup>t</sup> , is correct, which is not true in the reality. DS correctness is the generalization of notion of DB integrity, deeply developed inside relational approach covering the total content of database, i.e., interconnections between its different elements. There are known various tools for integrity criteria declaration and check—first of all, functional dependencies and their multiple modifications [24–32]. Storage correctness, being SDB analog of integrity, is considered inside SSF on the basis of augmented Post systems (APS).

Let us consider now the application of the described segment of the SSF to the representation of the most frequently used data models. We shall call such application by the term "emulation."
