5. Word equations on context-free languages

Word equation is a well-known object of discrete mathematics, defined as follows [51–56].

Word equation is written as

$$
\mathfrak{s} = \mathfrak{s}', \tag{58}
$$

where s and s 0 are the so-called terms. Term is a non-empty sequence of symbols of alphabet, which we shall call terminal, presuming it is the same set V, as higher, and variables, which universum is denoted Г. So s ∈ð Þ V ∪ Г <sup>þ</sup>, s<sup>0</sup> ∈ð Þ V ∪ Г <sup>þ</sup>. Domain (set of values) of every variable γ ∈Γ, having place in any term, is V <sup>∗</sup> . Term without any variables is, obviously, word in alphabet V. At least one variable must present in WECFL or just the same in term ss<sup>0</sup> (or s 0 s):

$$\text{cs}' \in (V \cup \Gamma)^{+} - V^{+}.\tag{59}$$

Set

$$d = \{\boldsymbol{\gamma}\_1 \to \boldsymbol{u}\_1, \dots, \boldsymbol{\gamma}\_n \to \boldsymbol{u}\_n\},\tag{60}$$

where γ1, …, γ<sup>n</sup> are the variables, u1, …, un are the strings in alphabet V, and ! is the divider (which is not occasionally the same as higher in the generating rules α ! β, entering metadabases), is called substitution.

Term s d½ � is the result of application of substitution d to term s and is defined as follows. If

$$\mathfrak{s} = \overline{u\_1}\overline{\gamma\_{i1}}\overline{u\_2}\dots\overline{u\_m}\gamma\_{im}\overline{u}\_{m+1}.\tag{61}$$

where ui <sup>∈</sup><sup>V</sup> <sup>∗</sup> and <sup>i</sup> <sup>¼</sup> <sup>1</sup>, …, m <sup>þ</sup> <sup>1</sup>, , then

$$s[\delta] = \overline{u\_1}\overline{\gamma\_{i1}}\overline{u\_2}\dots\overline{u\_m}\overline{\gamma\_{im}}\overline{u\_{m+1}},\tag{62}$$

where

$$\overline{\chi\_{ij}} = \begin{cases} u\_{i\bar{j}\bar{}} \text{ if } \chi\_{\bar{j}} \to u\_{i\bar{j}} \in d \\\\ \chi\_{\bar{j}} \text{ otherwise.} \end{cases} \tag{63}$$

Definitions (62) and (63) cover general case, when some of the variables, entering term s, do not enter the substitution (60).

Substitution d is called terminal substitution to term s∈ð Þ V ∪ Γ <sup>þ</sup> � Vþ, if

$$s[d] \in V^{+},\tag{64}$$

i.e., result of its application to term is word in the alphabet V. In this case, obviously,

$$\{\chi\_{i1},...,\chi\_{in}\} \subseteq \{\chi\_i,...,\chi\_n\}.\tag{65}$$

"Set of Strings" Framework for Big Data Modeling DOI: http://dx.doi.org/10.5772/intechopen.85602

Terminal substitution to terms s and s <sup>0</sup> is called solution of word equation (58), if

$$
\mathfrak{s}[d] \equiv \mathfrak{s}[d'], \tag{66}
$$

i.e., result of application of d to terms s and s 0 is one and the same word (here } � } is identity sign).

Returning to SDB and M-semantics of their DML, we may see that set of terms may be the simplest query language to SDB. If term

$$s = \overline{u\_1}\overline{\gamma\_{i1}}\overline{u\_2}...\overline{u\_m}\overline{\gamma\_{im}}\overline{u\_{m+1}}.\tag{67}$$

is query to DB Wt, then

$$I\_t = \{ \overline{u\_1} u\_{i1} \overline{u\_2} ... \overline{u\_m} u\_{im} \overline{u}\_{m+1} | u\_{i1} \in V^\* \: \ $... \$  u\_{im} \in V^\* \},\tag{68}$$

and

$$A\_{l+1} = W\_l \cap I\_l = \{ w | w \in W\_l \& (\exists u\_{i1} \in V^\*) \dots (\exists u\_{im} \in V^\*) \overline{u\_1} \overline{u\_{i1}} \overline{u\_2} \dots \overline{u\_m} u\_{im} \overline{u}\_{m+1} = w \}, \tag{69}$$

so Eq. (69) is the definition of M-semantics of the term's query language to SDB; as seen, w ∈ Atþ1, if w ∈Wt, and word equation s ¼ w has at least one solution.

Example 7. Consider database Wt, containing three facts:

SENSOR 1 IS AT GREEN VALLEY,

SENSOR 2 IS AT BLUE LAKE, AREA LOWER FOREST IS SMOKED.

If query s ¼ SENSOR a, which purpose, as seen, is to select all facts with information about sensor installation, then

$$A\_{t+1} = \begin{cases} \text{SENSOR} & \mathbf{1} \quad \text{IS} \quad \text{AT} \quad \text{GREEN VALLEY}, \\ \text{SENSOR} & \mathbf{2} \quad \text{IS} \quad \text{AT} \quad \text{BLUE LEKE} \end{cases}, \ldots$$

and solution of word equations

SENSOR a = SENSOR 1 IS AT GREEN VALLEY and SENSOR a = SENSOR 2 IS AT BLUE LAKE are, respectively,

f g a ! 1 IS AT GREEN VALLEY

and

$$\{\mathfrak{a} \to \mathfrak{Z} \mid \text{IS} \quad \text{AT} \quad \text{BLUE LAKE} \}, \mathtt{D}$$

However, the application of the term's query language to databases with incomplete information, containing sentential forms of CF grammar with scheme Dt, being DS metadabase, is not so simple and needs more sophisticated mathematical background.

Let G be CF grammar, corresponding metadabase D (lower index t for simplicity is omitted). We shall call word equation on context-free language L Gð Þ couple

$$<\mathfrak{s} = \mathfrak{s}', \delta >, \tag{70}$$

where the first component s ¼ s <sup>0</sup> is the word equation in the sense (58), called here kernel, while

Introduction to Data Science and Machine Learning

$$\delta = \{\boldsymbol{\gamma}\_1 \to \beta\_1, \dots, \boldsymbol{\gamma}\_l \to \beta\_l\},\tag{71}$$

is the so-called suffix, which defines domains (sets of values) of variables γ1, …, γl, entering terms s and s 0 , by means of strings β1, …, βl, containing terminal and nonterminal symbols of grammar G. Kernel and suffix must satisfy the so-called sentential condition

$$\{s[\delta], s'[\delta]\} \subseteq \text{SF}(G),\tag{72}$$

i.e., strings, being the result of application of substitution δ to terms s and s 0 , must be sentential forms of grammar G. As seen, δ is the generalization of substitution (60), so it will be called lower SF-substitution.

WECFL (70) may be read "s ¼ s 0 , where δ."

Domain of variable γ<sup>i</sup> is the set of strings in terminal alphabet V, which are generated from string β<sup>i</sup> by application of rules of grammar G. This domain is denoted as

$$\mathcal{V}(\boldsymbol{\gamma}\_i, \boldsymbol{\delta}) = \left\{ u \middle| \boldsymbol{\gamma}\_i \to \beta\_i \in \delta \& \beta\_i \stackrel{\*}{\Rightarrow} u \,\& \mu \in \boldsymbol{V}^\* \right\} \tag{73}$$

(from here we shall use ) ∗ in the sense ) ∗ G ).

Suffix δ defines set of terminal substitutions to terms s and s 0 , denoted

$$\Sigma\_{\delta} = \bigcup\_{\substack{\boldsymbol{u}\_{1} \in \boldsymbol{V}(\boldsymbol{\gamma}\_{1}, \delta) \\ \cdots \\ \boldsymbol{u}\_{l} \in \boldsymbol{V}(\boldsymbol{\gamma}\_{l}, \delta)}} \{\{\boldsymbol{\gamma}\_{1} \to \boldsymbol{u}\_{1}, \ldots, \boldsymbol{\gamma}\_{l} \to \boldsymbol{u}\_{l}\}\}. \tag{74}$$

As it is easy to see, direct consequence of the sentential condition (72) and definition (74) is

$$\{s[d], s'[d]\} \subseteq L(G),\tag{75}$$

for every d∈ ∑δ.

If terminal substitution d is such that

$$\mathfrak{s}[d] \equiv \mathfrak{s}\begin{bmatrix} d' \end{bmatrix},\tag{76}$$

it is called solution of WECFL (70). Set of solutions of WECFL (70), which is infinite in general case, is denoted D s ¼ s <sup>0</sup> ½ � ; δ .

Function V may be applied to every term, so

$$\mathcal{V}(\mathfrak{s}, \delta) = \left\{ \mathfrak{s}[d] | d \in \Sigma\_{\delta} \right\} \subseteq L(G), \tag{77}$$

$$\mathbf{V}(s',\delta) = \left\{ s'[d] | d \in \sum\_{\delta} \right\} \subseteq L(G). \tag{78}$$

Example 8. Let metadatabase be the same as in Example 2, and WECFL is

$$
$$\{a \to \text{} AT<\text{},\ b \to A2EAI\text{S}\} > .$$
$$

As seen, this equation satisfies sentential condition, because

"Set of Strings" Framework for Big Data Modeling DOI: http://dx.doi.org/10.5772/intechopen.85602

> s½ �¼ δ AREA GREEN VALLEY IS< state>AT <time>∈ SF Gð Þ<sup>t</sup> , s 0 ½ �¼ δ AREA <name of area>IS <state>AT 15:00∈SF Gð Þ<sup>t</sup> :

According to Eq. (73),

Vð Þ¼ a; δ f g IN NORMAL STATE AT � T ∪ f g SMOKEDAT � TVð Þ¼ b; δ f g AREA �S� f g IS IN NORMAL STATE AT ∪ f g� AREA S � f g IS SMOKED , where T is the set of strings, explicating time (00:00, 00:01, …, 23:58, 23:59), while S is the set of names of the monitored areas.

Terminal substitution

$$s = a \rightarrow \text{SMOKEDAT15.00}, \quad b \rightarrow \text{AREA GREEN VALLY ISSMOKED}$$

is the solution of the presented WECFL. ∎

As seen, in general case the set of solutions of WECFL may be infinite, and the problem is to find finite representation of this set.

Let us consider two sentential forms x and x<sup>0</sup> of unambiguous and acyclic CF grammar G. Each of them defines generated (derived) from it set of strings, being words of language L Gð Þ:

$$\mathcal{W}\_{\mathfrak{x}} = \left\{ w \middle| \mathfrak{x} \stackrel{\*}{\Rightarrow} w \& w \in V^{\*} \right\}, \tag{79}$$

$$W\_{\mathcal{Y}} = \left\{ w \middle| y \stackrel{\*}{\Rightarrow} w \& w \in V^{\*} \right\}. \tag{80}$$

And therefore SF x and x<sup>0</sup> are finite representations of sets Wx and Wy, both being subsets of language L Gð Þ. This obstacle serves as background for the following statement, representing necessary solution.

Statement 1 [51]. If

$$W = W\_{\mathbf{x}} \cap W\_{\mathbf{x}'} \neq \{ \mathcal{Q} \},\tag{81}$$

then there exists SF y such that

$$\mathcal{W} = \left\{ w \middle| \mathfrak{x} \stackrel{\*}{\Rightarrow} y \& \mathfrak{x}' \stackrel{\*}{\Rightarrow} y \& \mathfrak{y} \stackrel{\*}{\Rightarrow} w \& w \in V^\* \right\}. \tag{82}$$

Verbally, non-empty intersection of sets Wx and Wy is the subset of language L Gð Þ, in which words are generated from SF y, which itself is generated from SF x and x<sup>0</sup> simultaneously.

Example 9. Consider SF s d½ � and s 0 ½ � d from Example 8. As seen, SF

y ¼ AREA GREEN VALLEY IS< state>AT 15:00

is the finite representation of intersection Ws d½ � ∩ Ws<sup>0</sup> ½ � <sup>d</sup> .∎

Thus SF y from Eq. (82) is nothing else than required finite representation of the non-empty intersection (81).

This finding is a basis for constructing the set of solutions D s ¼ s <sup>0</sup> ½ � ; δ . Let us begin from the case where all variables, having place in WECFL (or, just the same, in term ss<sup>0</sup> ), enter it once, i.e., there is no more than one occurrence of any variable in ss<sup>0</sup> .

Obviously, if

$$\mathcal{W} = \mathcal{V}(\mathfrak{s}, \delta) \cap \mathcal{V}(\mathfrak{s}', \delta) = \{\mathcal{Q}\}, \tag{83}$$

then WECFL (70) does not have a solution, i.e.,

$$D[\mathfrak{s} = \mathfrak{s}', \delta] = \{\mathcal{Q}\}, \tag{84}$$

and if W 6¼ f g ∅ , then, since s½ � δ and s 0 ½ � δ are sentential forms of CF grammar G, there exists finite representation of set W, being SF y generated (derived) from s½ � δ and s 0 ½ � δ simultaneously.

From this place it is clear, that finite representation of the set D s ¼ s <sup>0</sup> ½ � ; δ is set

$$\overline{\delta} = \{\gamma\_1 \to \overline{\beta}\_1, \dots, \gamma\_l \to \overline{\beta}\_l\},\tag{85}$$

such that

$$\mathcal{W} = \{ w \vert \mathfrak{s} \left[ \overline{\mathfrak{s}} \right] = \mathfrak{s}' \left[ \overline{\mathfrak{s}} \right] = \mathfrak{y} \mathfrak{k} \mathfrak{y} \stackrel{\*}{\Rightarrow} w \mathfrak{k} w \in \mathcal{V}^\* \}. \tag{86}$$

It is easy to verify that β1, …, β<sup>l</sup> are strings, containing terminal and nonterminal symbols, and being generated from strings <sup>β</sup>1, …, <sup>β</sup>l, respectively, by <sup>s</sup>½ �)<sup>δ</sup> <sup>y</sup> <sup>∗</sup> and s 0 ½ �)<sup>y</sup> <sup>y</sup> <sup>∗</sup> .

Set δ will be named unifier of WECFL (70). In accordance with Eqs. (54) and (55), we shall consider lower so-called maximal unifiers, corresponding to

$$y = \inf\{s[\delta], s'[\delta]\},\tag{87}$$

where y is the maximal lower bound of the considered two-element set. Example 10. Let metadatabase be the same as in Example 2, and WECFL is

$$\begin{aligned} \{ \text{, \quad b \rightarrow \text{AREA } } \} > \dots \end{aligned}$$

As seen,

 $s[\delta] = AREA IS$   $SMOKEDAT ,$   $s'[\delta] = AREA ISAT15.00,$   $y = \inf\{s[\delta], s'[\delta]\} = AREA IS \text{ SMOKEDAT 15.00,}$ 

and thus

$$
\overline{\delta} = \{a \to , \quad t \to 15.00,
$$

$$
b \to AREA  \text{IS\,\,SMOKED}\}.\blacksquare
$$

Now we may return to DBI and introduce the so-called term data manipulation language (TDML), being the set of the so-called augmented terms < s, d>, where s is the term and d is the SF-substitution. M-semantics of this language is similar to Eqs. (49), (54), and (55) and is obtained by replacement of SF y by couple <s, d>:

$$A\_{t+1}^{s,d} = \left\{ \mathbf{x} | \mathbf{x} \in X\_t \& \mathbf{s}[d] \stackrel{\*}{\Rightarrow} \mathbf{x} \right\},\tag{88}$$

$$\overline{A}\_{t+1}^{s,d} = \{ \mathfrak{x} | \mathfrak{x} \in \mathcal{X}\_t \& \exists \inf \left\{ s[d], \mathfrak{x} \right\} \}, \tag{89}$$

$$\overline{\overrightarrow{A}}\_{t+1}^{t,d} = \{ \inf \{ s[d], \boldsymbol{\pi} \} | \boldsymbol{\pi} \in X\_t \& \exists \inf \{ s[d], \boldsymbol{\pi} \} \}. \tag{90}$$

"Set of Strings" Framework for Big Data Modeling DOI: http://dx.doi.org/10.5772/intechopen.85602

Moreover, from now we may use augmented terms or even their sets as N-facts. Corresponding equations, which describe M-semantics of TDML, much more useful from the practical point of view, are as follows:

$$A\_{t+1}^{s,d} = \left\{ <\overline{s}, \overline{d} > | < \overline{s}, \overline{d} > \overline{s}, \& s[d] \stackrel{\*}{\Rightarrow} \overline{s}[\overline{d}] \right\},\tag{91}$$

$$\overline{A}\_{t+1}^{\sharp,d} = \{ <\overline{s}, \overline{d} \succ | <\overline{s}, \overline{d} \succ \mathrm{X}\_t \& \exists \inf \left\{ \varsigma[d], \overline{s}[\overline{d}] \right\} \}, \tag{92}$$

$$\overline{\overline{A}}\_{t+1}^{t,d} = \left\{ D\left[ \mathfrak{s} = \overline{\mathfrak{s}}, d \cup \overline{d} \right] \mid < \overline{\mathfrak{s}}, \overline{d} \rhd \mathfrak{s} \in \mathcal{X}\_t \right\}.\tag{93}$$

As may be seen, the last definition provides the most informative reply, containing maximal unifiers of WECFL, each corresponding N-fact, entering BDI.

The concerned reader may find the detailed consideration of WECFL, DBI algorithmics (including N-facts fusion), and key theoretical issues of SDB/DBI internal organization, providing associative access to the stored data as well as their compression, in [38–40].

All the said about TDML is sufficient for consideration of already mentioned knowledge representation, called augmented Post systems, being core of the deductive capabilities of "Set of Strings" Framework. APS are described in the separate chapter of this book.

### Author details

Igor Sheremet Financial University under the Government of Russian Federation, Moscow, Russia

\*Address all correspondence to: sheremet@rfbr.ru

© 2019 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
