**3. Query morphing: a query reformulation approach**

There are many tools available to extract knowledge from data, but they are inadequate in finding an appropriate subset of data. A deep analysis is needed to gain relevant knowledge [21] from available information space. Most of the tools follow the traditional lookup behavior that aims to retrieve the best literal match in a short time by assuming that the user is aware of 'what he is looking for'. That means systems are designed by considering that the user has a clear understanding about his search goals and familiar with database schema and context. It is observed that the success of the search process anticipates effective query articulation. Therefore, domain expert user successfully performs the search operation [33] and retrieve relevant results as he had formulated his query with appropriate terms [22, 23, 28]. But naïve user has to face challenges in the formation of his information seeking task due to less domain awareness. To resolve this he should be assisted through flexible query answering system [12] in query construction by delivering additional possible result sets along with original query results [25, 30]. The motive of such system is to reduce user's cognitive effort in subsequent queries [31, 8] by enhancing his knowledge.

Most systems support 'Query-Result' paradigm which is not sufficient as query formulation [12, 34] affects performance of the system. Instead 'Query-Result-Review-Query' paradigm can help as a user's search intention evolves with search progresses. The traditional methodologies retrieve results based on predefined relevant criteria and fails in identifying shift occurs in user's search intensions. Therefore a recall-oriented approach [19, 35] for query reformulation is designed. The idea behind it is as follows, the user poses an initial query *Q* and system yields effort *T* on finding the optimal results for *Q*. A small portion is set aside for the exploration. Various syntactically adjustments are performed with a small edit distance from *Q* to create variations *Qi* . A conceptual example is shown in **Figure 3**. The result set returned by user's original query request is painted by the small red circle. Possible additional relevant result sets of user's interest for queries are explored in the large data spectrum, providing that result set belongs to surrounding closed region of the original request. Orange elliptic represents the query results that correspond to variations of original query request. After analyzing results user may formulate another query to shift interest towards another query result as shown in the right portion of **Figure 3**. A new region of query result of the user's request includes both new and previous variations of the query. A data space expedition and feedback incorporation is observed for the query reformulation and additional relevant data suggestion in this approach. The additional relevant data objects are retrieved by performing exploration and exploitation in available database. The properties of traditional techniques are also incorporated in our query morphing approach as user's search request and retrieved outcomes analogous to the user history log.

Creating transformation of input like text, image, data etc. is a fundamental process in computer science called morphing [10]. We analogous our query with the morphing inputs and named our reformulation algorithm as 'Query morphing' [41]. Our algorithm helps user in formulation of intermediate queries by creating variants/transformation of the original search query. The assistance to the user will be based on the optimal query reformulations derived during exploration and exploitation of dataspace. The proposed algorithms are developed by considering 'Query-Result- Review-Query' paradigm of computing [7, 15, 39]. The design framework for the same is conveyed in following section and shown in **Figure 4**.

## **3.1. Proposed approach**

techniques are proposed to resolve these issues. These techniques assist the user by suggesting some query terms for subsequent formulation of incremental queries and reduction of irrelevant data retrieval. Fundamental operations such as equijoin and semijoin [11] are characterized for the formation of Boolean membership queries in polynomial time. A user membership driven learning algorithms [2] can also serves better formulation for simple Boolean queries. Many other formulation techniques for query construction such as locate minimal project join queries, discovering query approach [34] etc. are developed to answer query for-

We termed our approach as 'Query morphing' because in literature a traditional method, morphing points transformation of inputs e.g. Data Morphing [20], Image morphing [10, 24]. Similarly, a small transformation of user queries are also carries out in our approach. We realized that the success of our approach is mainly rely on effective database exploration and user participation. The properties of traditional techniques are also incorporated in our query morphing approach as user's search request and retrieved outcomes analogous to the user

There are many tools available to extract knowledge from data, but they are inadequate in finding an appropriate subset of data. A deep analysis is needed to gain relevant knowledge [21] from available information space. Most of the tools follow the traditional lookup behavior that aims to retrieve the best literal match in a short time by assuming that the user is aware of 'what he is looking for'. That means systems are designed by considering that the user has a clear understanding about his search goals and familiar with database schema and context. It is observed that the success of the search process anticipates effective query articulation. Therefore, domain expert user successfully performs the search operation [33] and retrieve relevant results as he had formulated his query with appropriate terms [22, 23, 28]. But naïve user has to face challenges in the formation of his information seeking task due to less domain awareness. To resolve this he should be assisted through flexible query answering system [12] in query construction by delivering additional possible result sets along with original query results [25, 30]. The motive of such system is to reduce user's cognitive effort in subsequent

Most systems support 'Query-Result' paradigm which is not sufficient as query formulation [12, 34] affects performance of the system. Instead 'Query-Result-Review-Query' paradigm can help as a user's search intention evolves with search progresses. The traditional methodologies retrieve results based on predefined relevant criteria and fails in identifying shift occurs in user's search intensions. Therefore a recall-oriented approach [19, 35] for query reformulation is designed. The idea behind it is as follows, the user poses an initial query *Q* and system yields effort *T* on finding the optimal results for *Q*. A small portion is set aside for the exploration. Various syntactically adjustments are performed with a small edit distance

**3. Query morphing: a query reformulation approach**

mulation similar to example tuples [27].

152 From Natural to Artificial Intelligence - Algorithms and Applications

queries [31, 8] by enhancing his knowledge.

history log.

Our query reformulation approach can be seen into two sub activities, one is tradition query processing the other one is generation of morphs that derive intermediate query reformulation. Initially query Qi will be validated and processed by the query engine in the traditional query processing mechanism by the DBMS. Data objects retrieved after processing initial query Qi are identified on *d*-dimensional space that is already created and partitioned into non-overlapping rectangular cells and exploited for subsequent interactions. If we say more specifically, *S* = *D1 × D2 × …, × Dd* be a *d*-dimensional space where *D* = {*D1 , D2 ,…, Dd* } be a

**Figure 3.** A conceptual example of query morphing.

The proposed algorithm employs a bottom up scheme by leveraging the *Apriori* algorithm because monotonicity holds: if a group of data points is a cluster in a *k*-dimensional space, then this group of data points is also part of a cluster in any *(k-1)-*dimensional projections of this space. The recursive step from *(k-1)*-dimensional cells to *k*-dimensional cells involves a self-join of the *k-1* cells sharing first common *(k-2)*-dimensions. Cluster-clique thins the collection of candidates to reduce the time complexity of the *Apriori* process, and keep only the set of dense cells to form clusters in the next level. The portion of the database enclosed by the dense cells is called its coverage. All the subspaces are sorted according to their coverage and less covered subspaces are eliminated to perform thinning. The selection of cutting point between removed and taken subspaces is computed using MDL principle in information

, *u2*

maximal set of connected dense units in k-dimensions form a cluster. Computing clusters is equivalent to computing connected components in the graph where the dense cells represent the vertices and cells sharing common face endures edges between them. This can be computed in quadratic time of the number of dense cells in worst situation. After the identification of all the clusters, a finite set of maximal segment or region is specified by applying a DNF expression whose union forms the cluster. Finding the minimal descriptions for the clusters is equivalent to finding optimal cover of the clusters. By examining all dense units, clusters are formed at higher dimensions and derived as query morphs. Top n keywords from the

tial data result objects O {*oi1,oi2…oin*}. Returned data objects are projected on pre-computed *d*-dimensional sub space of data point divided into hyper rectangular cells. The algorithm will consider these projected data objects as independent clusters C {*ci1, ci2, …, cin*}. Next, cells in proximity (neighborhood) are explored and exploited to form a larger cluster. The cell is dense means cells containing at least *τ* data point are merged to form such clusters C {*c1*

} at lower dimension. After identifying clusters at 1-dimentional subspace we subsequently move further in higher dimensions. As per the monotonicity interesting clusters and

at 1-dimension exist, then at 2-dimensional subspace they can be form a unique cluster *c12*

from the cluster set. Similar computation is performed subsequently at 3rd, 4th, 5th and up to dth dimension to form higher dimension clusters. We stop once we retrieve all the clusters. Now, we consider each cluster from the cluster set as independent morphs of initial user

**An example:** refer schema of the movie database, initial query variant (Qi+1) and corresponding result set shown in the **Figure 5**. {D.name = "D.Fincher"} is 1-dimension subspace cluster. {G.genre = "Drama", 1990 < M.year<2009} is 2-dimention subspace cluster. We are finding interesting fragments of data from the clusters: that values can occur on single or multiple attribute means on 1-dimensional or k-dimensional subspace cluster. We are looking for interesting

). Top N relevant morphs from the set is recommended to the user formulation of

if their intersection is dense enough and we can drop low dimension clusters *c1*

relevancy list are selected for suggestion to user for query reformulation.

Proposed system first process initial query of user Q<sup>i</sup>

exist such that both the cells *u1*

have either common face in the subspaces

in traditional way and return ini-

is connected to *us*

*.* A

155

*,* 

cluster

and *c2*

and *u2*

Query Morphing: A Proximity-Based Approach for Data Exploration

http://dx.doi.org/10.5772/intechopen.77073

theory. Two connected k-dimensional cells *u1*

or another *k*-dimensional unit *us*

*c2 ,..,cn*

*c2*

query (Qi

succeeding exploratory queries.

**Figure 4.** Query morphing and user's interactions.

set of totally ordered and bounded domains (attributes). Divide S into *md* non-overlapping rectangular cells by partitioning every dimension *Di* (1≤ *i* ≤ *d*) into *m* equal length intervals. A *d*-dimensional data points, *p*, is projected in a grid cell, *u*, if in each attribute the value of *v*, is less than the right boundary of that attribute in *u* and greater than or equal to the left boundary of that attribute in *u*. When we consider exploration in high dimension, it can be assumed that relevant data points would exist in the close neighborhood [13, 16, 36]. Thus, the futuristic query formulation is pivotal by neighborhood exploration of each objects from previous queries. Neighborhood of each data object is initialized as a cluster of most probable results and achieved through sub-space clustering technique. A modified 'cluster-clique' algorithm is proposed for cluster/morph generation.

We assume that pre-computed *d*-dimensional sub space of data point divided into hyper rectangular cells is available. The query result retrieved after processing initial query traditionally is projected over this *d*-dimensional spatial representation of data. Identify the initial data object in space and consider them as a different unique cluster. Exploration and exploitation is performed in the neighborhood of every data objects to form cluster covering maximal region. The selectivity of a cell containing data points is defined to be the fraction of total data points in the cell. Only cells whose selectivity are greater than the value of model parameter *τ* are considered as dense and preserved. So, a cell is said to be a dense cell, if the fraction of total data point in that cell excesses input model parameter *τ*. The computation of dense cells applies to all subspaces of *d*-dimensional space. Identify neighboring dense cells that form a cluster containing data points at lower dimension. Cluster-clique holds cluster of dense cells at k-dimension also acquire similar projection at *(k-1)* dimension. The projection of subspace is considered from the bottom up to identify subspaces that contain clusters and to identify the dense cells to retain. A cell for giving projection subspace *St* = *At1 × At2 × …, × Atk* where *k* < *d* and *the < t j ,* if *I < j* is the intersection of an interval in each dimension. The proposed algorithm employs a bottom up scheme by leveraging the *Apriori* algorithm because monotonicity holds: if a group of data points is a cluster in a *k*-dimensional space, then this group of data points is also part of a cluster in any *(k-1)-*dimensional projections of this space.

The proposed algorithm employs a bottom up scheme by leveraging the *Apriori* algorithm because monotonicity holds: if a group of data points is a cluster in a *k*-dimensional space, then this group of data points is also part of a cluster in any *(k-1)-*dimensional projections of this space. The recursive step from *(k-1)*-dimensional cells to *k*-dimensional cells involves a self-join of the *k-1* cells sharing first common *(k-2)*-dimensions. Cluster-clique thins the collection of candidates to reduce the time complexity of the *Apriori* process, and keep only the set of dense cells to form clusters in the next level. The portion of the database enclosed by the dense cells is called its coverage. All the subspaces are sorted according to their coverage and less covered subspaces are eliminated to perform thinning. The selection of cutting point between removed and taken subspaces is computed using MDL principle in information theory. Two connected k-dimensional cells *u1* , *u2* have either common face in the subspaces or another *k*-dimensional unit *us* exist such that both the cells *u1* and *u2* is connected to *us .* A maximal set of connected dense units in k-dimensions form a cluster. Computing clusters is equivalent to computing connected components in the graph where the dense cells represent the vertices and cells sharing common face endures edges between them. This can be computed in quadratic time of the number of dense cells in worst situation. After the identification of all the clusters, a finite set of maximal segment or region is specified by applying a DNF expression whose union forms the cluster. Finding the minimal descriptions for the clusters is equivalent to finding optimal cover of the clusters. By examining all dense units, clusters are formed at higher dimensions and derived as query morphs. Top n keywords from the relevancy list are selected for suggestion to user for query reformulation.

set of totally ordered and bounded domains (attributes). Divide S into *md*

*d*-dimensional data points, *p*, is projected in a grid cell, *u*, if in each attribute the value of *v*, is less than the right boundary of that attribute in *u* and greater than or equal to the left boundary of that attribute in *u*. When we consider exploration in high dimension, it can be assumed that relevant data points would exist in the close neighborhood [13, 16, 36]. Thus, the futuristic query formulation is pivotal by neighborhood exploration of each objects from previous queries. Neighborhood of each data object is initialized as a cluster of most probable results and achieved through sub-space clustering technique. A modified 'cluster-clique' algorithm

We assume that pre-computed *d*-dimensional sub space of data point divided into hyper rectangular cells is available. The query result retrieved after processing initial query traditionally is projected over this *d*-dimensional spatial representation of data. Identify the initial data object in space and consider them as a different unique cluster. Exploration and exploitation is performed in the neighborhood of every data objects to form cluster covering maximal region. The selectivity of a cell containing data points is defined to be the fraction of total data points in the cell. Only cells whose selectivity are greater than the value of model parameter *τ* are considered as dense and preserved. So, a cell is said to be a dense cell, if the fraction of total data point in that cell excesses input model parameter *τ*. The computation of dense cells applies to all subspaces of *d*-dimensional space. Identify neighboring dense cells that form a cluster containing data points at lower dimension. Cluster-clique holds cluster of dense cells at k-dimension also acquire similar projection at *(k-1)* dimension. The projection of subspace is considered from the bottom up to identify subspaces that contain clusters and to identify the dense cells to retain. A cell for giving projection subspace *St* = *At1 × At2 × …, × Atk* where

*,* if *I < j* is the intersection of an interval in each dimension. The proposed algo-

rithm employs a bottom up scheme by leveraging the *Apriori* algorithm because monotonicity holds: if a group of data points is a cluster in a *k*-dimensional space, then this group of data

points is also part of a cluster in any *(k-1)-*dimensional projections of this space.

rectangular cells by partitioning every dimension *Di*

154 From Natural to Artificial Intelligence - Algorithms and Applications

is proposed for cluster/morph generation.

**Figure 4.** Query morphing and user's interactions.

*k* < *d* and *the < t*

*j*

non-overlapping

(1≤ *i* ≤ *d*) into *m* equal length intervals. A

Proposed system first process initial query of user Q<sup>i</sup> in traditional way and return initial data result objects O {*oi1,oi2…oin*}. Returned data objects are projected on pre-computed *d*-dimensional sub space of data point divided into hyper rectangular cells. The algorithm will consider these projected data objects as independent clusters C {*ci1, ci2, …, cin*}. Next, cells in proximity (neighborhood) are explored and exploited to form a larger cluster. The cell is dense means cells containing at least *τ* data point are merged to form such clusters C {*c1 , c2 ,..,cn* } at lower dimension. After identifying clusters at 1-dimentional subspace we subsequently move further in higher dimensions. As per the monotonicity interesting clusters and *c2* at 1-dimension exist, then at 2-dimensional subspace they can be form a unique cluster *c12* if their intersection is dense enough and we can drop low dimension clusters *c1* and *c2* cluster from the cluster set. Similar computation is performed subsequently at 3rd, 4th, 5th and up to dth dimension to form higher dimension clusters. We stop once we retrieve all the clusters. Now, we consider each cluster from the cluster set as independent morphs of initial user query (Qi ). Top N relevant morphs from the set is recommended to the user formulation of succeeding exploratory queries.

**An example:** refer schema of the movie database, initial query variant (Qi+1) and corresponding result set shown in the **Figure 5**. {D.name = "D.Fincher"} is 1-dimension subspace cluster. {G.genre = "Drama", 1990 < M.year<2009} is 2-dimention subspace cluster. We are finding interesting fragments of data from the clusters: that values can occur on single or multiple attribute means on 1-dimensional or k-dimensional subspace cluster. We are looking for interesting

Retrieve movies of other directors who have directed *drama movies* too, by considering that results are of user's interest. In the disigned approach, subspace clustering is used to generate these query morphs/variants and interesting additional results from variant quires. The system will compute dataset (**Figure 5(c)**) of initial query shown in **Figure 5(b)** and project it

Query Morphing: A Proximity-Based Approach for Data Exploration

http://dx.doi.org/10.5772/intechopen.77073

157

Initially, all data points are projected on the d-dimensional space and data points of initial query result are identified. These data points are treated as initial cluster and then the neighborhood is explored to retrieve the larger cluster. As shown in **Figure 6** algorithms perform exploration and form larger cluster by merging neighborhood cells who are dense enough. In movie database, axis-paralleled histograms are constructed for the year and genre at 1-dimention. After 1-dimention next is to steer towards higher dimensions, and at 2-dimension like {G.genre, D.name} and {G.year, D.name.} etc. as shown in **Figure 6(a)** and **(b)**. Neighborhood exploration is performed and clusters are constructed. After finding all the cluster a finite set

**Figure 6.** (a) Cluster formation on 2D space {D.name, M.year} (b) cluster formation on 2D space {D.name, G.genre}

(c) cluster formation on 3D space {D.name, M.year, G.genre} (d) results set of generated morphs.

on the space.

**Figure 5.** (a) imdb movie database schema (b) variant of initial query Qi+1 and (c) Result set of query Qi+1 .

pieces of information at the granularity of clusters: this may be value of a single attribute (1-dimensional cluster) or the value of k attributes (m-dimensional cluster). User want to retrieve all movies directed by 'D. Fincher'. For that refer example query shown in **Figure 5(b)**. From the retrieved results we can say that movies with genre "Drama" are frequently directed by "D. Fincher" so user possibly concerned in movies with {G.genre = "Drama"}. Similarlly, for {G.genre = "Drama",1992 < M.year<2009}. Moreover we intend to retrieval of potentially relevant data that may satisfies user's information but may not part of result set retrieved originally from the initial query. Consider following exploratory/variant of initial query (Qi+1):

*(Q***i+1***): SELECT D.name*

 **FROM** *G, M2D, D, M WHERE* D.name*! = 'D. Fincher'* **AND** *G.genre = 'Drama'* **AND** *D.directorid = M2D.directorid* **AND** *D.directorid = M2D.directorid* **AND** *M.movieid = G.movieid*

Retrieve movies of other directors who have directed *drama movies* too, by considering that results are of user's interest. In the disigned approach, subspace clustering is used to generate these query morphs/variants and interesting additional results from variant quires. The system will compute dataset (**Figure 5(c)**) of initial query shown in **Figure 5(b)** and project it on the space.

Initially, all data points are projected on the d-dimensional space and data points of initial query result are identified. These data points are treated as initial cluster and then the neighborhood is explored to retrieve the larger cluster. As shown in **Figure 6** algorithms perform exploration and form larger cluster by merging neighborhood cells who are dense enough. In movie database, axis-paralleled histograms are constructed for the year and genre at 1-dimention. After 1-dimention next is to steer towards higher dimensions, and at 2-dimension like {G.genre, D.name} and {G.year, D.name.} etc. as shown in **Figure 6(a)** and **(b)**. Neighborhood exploration is performed and clusters are constructed. After finding all the cluster a finite set

pieces of information at the granularity of clusters: this may be value of a single attribute (1-dimensional cluster) or the value of k attributes (m-dimensional cluster). User want to retrieve all movies directed by 'D. Fincher'. For that refer example query shown in **Figure 5(b)**. From the retrieved results we can say that movies with genre "Drama" are frequently directed by "D. Fincher" so user possibly concerned in movies with {G.genre = "Drama"}. Similarlly, for {G.genre = "Drama",1992 < M.year<2009}. Moreover we intend to retrieval of potentially relevant data that may satisfies user's information but may not part of result set retrieved originally from the initial query. Consider following exploratory/variant of initial query (Qi+1):

.

**Figure 5.** (a) imdb movie database schema (b) variant of initial query Qi+1 and (c) Result set of query Qi+1

*(Q***i+1***): SELECT D.name*

**FROM** *G, M2D, D, M*

**AND** *G.genre = 'Drama'*

**AND** *M.movieid = G.movieid*

*WHERE* D.name*! = 'D. Fincher'*

156 From Natural to Artificial Intelligence - Algorithms and Applications

 **AND** *D.directorid = M2D.directorid* **AND** *D.directorid = M2D.directorid*

**Figure 6.** (a) Cluster formation on 2D space {D.name, M.year} (b) cluster formation on 2D space {D.name, G.genre} (c) cluster formation on 3D space {D.name, M.year, G.genre} (d) results set of generated morphs.

of maximal segment (*regions*) are computed using DNF expression whose union is a cluster shown in **Figure 6(c)** at higher dimension. Subsequently, move to 3rd, 4th, 5th …. dth dimension in search of relevant clusters. This consummates the exploration of each data subspace around the pertinent objects of anterior query. All the computed clusters are equivalent to the morph of initial/previous query. Morphs contains addition relevant results as well subset of originally retrieved results. The data items present in the morphs are dignified as relevant by standard measure to previous query and future probable search interest. Now based on implicit and explicit relevance top N morphs and set of relevant terms are suggested to the user. In our example, after computing relevance score using standard relevance measures we can say that query morphs containing movie genre 'Drama' and directed year >1995 scored higher number then morph with movie genre 'Thriller'. Hence, morphs with movie genre 'Drama' and year >1995 considered as high relevance. The system would also suggest top N terms from computed morphs like 'Coppola' based on relevance to the initial query as well result set. These terms help the user in formulating his next exploratory/variant query. As a next step, user may encounter shift towards different query results after reviewing result variants. The newly formulated query now surrounds both past and new variant of the user request.

feasible solution [9], for this various data summarization technique can be employed. For example, relevant terms from the morphs are suggested to user in a selective manner, so

Query Morphing: A Proximity-Based Approach for Data Exploration

http://dx.doi.org/10.5772/intechopen.77073

159

Fundamentally several adjustments can be made to perform the query reformulations, such as adding/removing predicates, changing constants, joining operation through foreign key relationships on auxiliary tables, etc. The kind of adjustment for creation of intermediate query may steer towards relevant result set in optimal processing cost. Query morphing technique is regulating proximity-based query reformulation due to neighborhood exploration characteristics. The ultimate goal is to morph the query that pulls user in a direction where

We proposed an algorithm for query reformulation using object's proximity, 'Query morphing' that mainly design to recommend additional relevant data objects from neighborhood of the user's query results. Each relevant data object of user query act as an exemplar query for generation of optimal intermediate reformulations. Multiple challenges are inferred during solution designing, includes: (i) neighborhood selection and Query morph generation (ii) Evaluation of relevant data objects and Top-K morph (iv) Evaluation of data object's relevance, (III). Demonstration of additional information extracted from retrieved data objects through various visualization. The discussed approach primarily based on proximity-based data exploration, and generalized approach of query creation with small edit distance. It could be realized with major adjustments to the query optimizer. The ultimate goal would be that morphing the query pulls towards the area where information is accessible at low cost.

Computer Engineering Department, National Institute of Technology, Kurukshetra,

[1] Andolina S, Klouche K, Cabral D, Ruotsalo T, Jacucci G. Inspiration wall: Supporting idea generation through automatic information exploration. In: Proceedings of the 2015

ACM SIGCHI Conference on Creativity and Cognition. ACM; 2015. pp. 103-106

that user can use these keywords for intermediate query formation.

information is available at low cost.

**5. Conclusion**

**Author details**

Haryana, India

**References**

Jay Patel and Vikram Singh\*

\*Address all correspondence to: viks@nitkkr.ac.in
