**A Semantic-Based Framework for Summarization and Page Segmentation in Web Mining**

Alessio Leoncini, Fabio Sangiacomo, Paolo Gastaldo and Rodolfo Zunino

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/51178

## **1. Introduction**

The World Wide Web has become a fundamental resource of information for an increasing number of activities, and a huge information flow is exchanged today through the Internet for the widest range of purposes. Although large-bandwidth communications yield fast ac‐ cess to virtually any kind of contents by both human users and machines, the unstructured nature of most available information may pose a crucial issue. In principle, humans can best extract relevant information from posted documents and texts; on the other hand, the over‐ whelming amount of raw data to be processed call for computer-supported approaches. Thus, in recent years, *Web mining* research tackled this issue by applying data mining techni‐ ques to Web resources [1].

This chapter deals with the predominant portion of the web-based information, i.e., docu‐ ments embedding natural-language text. The huge amount of textual digital data [2, 3] and the dynamicity of natural language actually can make it difficult for an Internet user (either human or automated) to extract the desired information effectively: thus people every day face the problem of information overloading [4], whereas search engines often return too many results or biased/inadequate entries [5]. This in turn proves that: 1) treating web-based textual data effectively is a challenging task, and 2) further improvements are needed in the area of Web mining. In other words, algorithms are required to speed up human browsing or to support the actual crawling process [4]. Application areas that can benefit from the use of these algorithms include marketing, CV retrieval, laws and regulations exploration, com‐ petitive intelligence [6], web reputation, business intelligence [7], news articles search [1], topic tracking [8], and innovative technologies search. Focused crawlers represent another potential, crucial area of application of these technologies in the security domain [7, 9].

© 2012 Leoncini et al.; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

The research described in this chapter tackles two challenging problems in Web mining techniques for extracting relevant information. The first problem concerns the acquisition of useful knowledge from textual data; this is a central issue for *Web content mining* research, which mostly approached this task by exploiting text-mining technologies [1]. The second problem relates to the fact that a web page often proposes a considerable amount of infor‐ mation that can be regarded as 'noise' with respect to the truly informative sections for the purposes at hand [10]. According to [10], uninformative web page contents can be divided into navigation units, decoration items, and user interaction parts. On one hand, these ele‐ ments drain the user's attention, who has to spend his/her time to collect truly informative portions; on the other hand, they can affect the performances of algorithms that should ex‐ tract the informative content of a web page [10]. This problem is partially addressed by the research area of *semantic Web*, which aims to enrich web pages with semantic information accessible from humans and machines [5]. Thus *semantic Web mining* aims to combine the outcomes of semantic Web [11] and Web mining to attain more powerful tools that can relia‐ bly address the two problems described above [5].

The rest of the chapter is organized as follows. Section 2 gives an overview of the state of the art in the different research areas involved. Section 3 introduces the overall approach proposed in this research, while Section 4 discusses the actual implementation of the framework. Section 5

A Semantic-Based Framework for Summarization and Page Segmentation in Web Mining

http://dx.doi.org/10.5772/51178

77

The current research proposes a web mining algorithm that exploits knowledge-based se‐ mantic information to integrate text-summarization and web page-segmentation technolo‐ gies, thus improving the overall approach effectiveness. The following sections overview the state of the art in the different research areas involved: web content mining, text summariza‐ tion, and web page segmentation. The Section also highlights the points of novelty intro‐

Web mining is the use of data mining techniques to automatically discover and extract infor‐ mation from web documents and services; the applicative areas include resource finding, in‐ formation selection, generalization and data analysis [14]. Incidentally, machine-learning methods usually address the last two tasks. Web mining includes three main sub-areas: web content mining, web structure mining, and web usage mining [15]. The former area covers the analysis of the contents of web resources, which in general comprise different data sour‐ ces: texts, images, videos and audio; metadata and hyperlinks are often classified as text content. It has been proved that unstructured text represents the prevailing part of web re‐

A wide variety of works in the literature focused on text mining for web content mining [17]. Some web content mining techniques for web search, topic extraction and web opinion mining were explored in [18]. In [19], Liu et al. showed that web content mining could ad‐ dress applicative areas such as sentiment classification, analysis and summarization of con‐ sumer reviews, template detection and page segmentation. In [20], web content mining tackled business applications by developing a framework for competitive intelligence. In [21], an advanced search engine supported web-content categorization based on word-level summarization techniques. A web-page analyzer for detecting undesired advertisement was presented in [22]. The work described in [23] proposed a web-page recommendation system, where learning methods and collaborative filtering techniques cooperated to produce a web

The approach presented in this research differs from those related works in two main as‐ pects: first, it exploits semantic-based techniques to select and rank single sentences extract‐ ed from text; secondly, it combines summarization with web page segmentation. The proposed approach does not belong to the semantic web mining area, which refers to meth‐ odologies that address the development of specific ontologies that enrich original web page contents in a structured format [11, 24]. To the best of the authors' knowledge, the literature

sources [14, 16] this in turn motivates the large use of text mining technologies.

presents the experimental results. Some concluding remarks are made in Section 6.

duced by the present research with respect to previous works.

**2. Related work**

**2.1. Web content mining**

filter for efficient user navigation.

The approach adopted in this work, however, does not rely on semantic information already embedded into the Web resources, and the semantic characterization of words and senten‐ ces plays a crucial role to reach two outcomes:


Semantic characterization is obtained by applying semantic networks to the considered Web resource. As a result, natural language text maps into an abstract representation, that even‐ tually supports the identification of the topics addressed in the Web resource itself. A heu‐ ristic algorithm attains the latter task by using the abstract representation to work out the relevant segments of text in the original document. Page segmentation is then obtained by properly exploiting the information obtained on the relevant topics and the topics covered by the different sections of the Web page.

The novelty contribution of this work lies in a framework that can tackle two tasks at the same time: text summarization and page segmentation. This result is obtained by applying an ap‐ proach that extracts semantic information from the Web resource and does not rely on external information that may not be available. Combining effective page segmentation with text sum‐ marization can eventually support advanced web content mining systems that address the dis‐ covery of patterns, the tracking of selected topics and the efficient resource finding.

Experimental results involved the well-know DUC 2002 dataset [12]. Such dataset has been used to evaluate the ability of the proposed framework to consistently identify the topics ad‐ dressed by a document and eventually generate the corresponding summary. The ROUGE tool [13] has been used to measure the performance of the summarization algorithm exploit‐ ed by the present framework. Numerical results proved that the research described in this chapter compares positively with state-of-the-art approaches published in the literature.

The rest of the chapter is organized as follows. Section 2 gives an overview of the state of the art in the different research areas involved. Section 3 introduces the overall approach proposed in this research, while Section 4 discusses the actual implementation of the framework. Section 5 presents the experimental results. Some concluding remarks are made in Section 6.

## **2. Related work**

The research described in this chapter tackles two challenging problems in Web mining techniques for extracting relevant information. The first problem concerns the acquisition of useful knowledge from textual data; this is a central issue for *Web content mining* research, which mostly approached this task by exploiting text-mining technologies [1]. The second problem relates to the fact that a web page often proposes a considerable amount of infor‐ mation that can be regarded as 'noise' with respect to the truly informative sections for the purposes at hand [10]. According to [10], uninformative web page contents can be divided into navigation units, decoration items, and user interaction parts. On one hand, these ele‐ ments drain the user's attention, who has to spend his/her time to collect truly informative portions; on the other hand, they can affect the performances of algorithms that should ex‐ tract the informative content of a web page [10]. This problem is partially addressed by the research area of *semantic Web*, which aims to enrich web pages with semantic information accessible from humans and machines [5]. Thus *semantic Web mining* aims to combine the outcomes of semantic Web [11] and Web mining to attain more powerful tools that can relia‐

The approach adopted in this work, however, does not rely on semantic information already embedded into the Web resources, and the semantic characterization of words and senten‐

**•** to work out from a Web resource a concise summary, which outlines the relevant topics addressed by the textual data, thus discarding uninformative, irrelevant contents;

**•** to generate a web page segmentation that points out the relevant text parts of the re‐

Semantic characterization is obtained by applying semantic networks to the considered Web resource. As a result, natural language text maps into an abstract representation, that even‐ tually supports the identification of the topics addressed in the Web resource itself. A heu‐ ristic algorithm attains the latter task by using the abstract representation to work out the relevant segments of text in the original document. Page segmentation is then obtained by properly exploiting the information obtained on the relevant topics and the topics covered

The novelty contribution of this work lies in a framework that can tackle two tasks at the same time: text summarization and page segmentation. This result is obtained by applying an ap‐ proach that extracts semantic information from the Web resource and does not rely on external information that may not be available. Combining effective page segmentation with text sum‐ marization can eventually support advanced web content mining systems that address the dis‐

Experimental results involved the well-know DUC 2002 dataset [12]. Such dataset has been used to evaluate the ability of the proposed framework to consistently identify the topics ad‐ dressed by a document and eventually generate the corresponding summary. The ROUGE tool [13] has been used to measure the performance of the summarization algorithm exploit‐ ed by the present framework. Numerical results proved that the research described in this chapter compares positively with state-of-the-art approaches published in the literature.

covery of patterns, the tracking of selected topics and the efficient resource finding.

bly address the two problems described above [5].

76 Theory and Applications for Advanced Text Mining Text Mining

ces plays a crucial role to reach two outcomes:

by the different sections of the Web page.

source.

The current research proposes a web mining algorithm that exploits knowledge-based se‐ mantic information to integrate text-summarization and web page-segmentation technolo‐ gies, thus improving the overall approach effectiveness. The following sections overview the state of the art in the different research areas involved: web content mining, text summariza‐ tion, and web page segmentation. The Section also highlights the points of novelty intro‐ duced by the present research with respect to previous works.

#### **2.1. Web content mining**

Web mining is the use of data mining techniques to automatically discover and extract infor‐ mation from web documents and services; the applicative areas include resource finding, in‐ formation selection, generalization and data analysis [14]. Incidentally, machine-learning methods usually address the last two tasks. Web mining includes three main sub-areas: web content mining, web structure mining, and web usage mining [15]. The former area covers the analysis of the contents of web resources, which in general comprise different data sour‐ ces: texts, images, videos and audio; metadata and hyperlinks are often classified as text content. It has been proved that unstructured text represents the prevailing part of web re‐ sources [14, 16] this in turn motivates the large use of text mining technologies.

A wide variety of works in the literature focused on text mining for web content mining [17]. Some web content mining techniques for web search, topic extraction and web opinion mining were explored in [18]. In [19], Liu et al. showed that web content mining could ad‐ dress applicative areas such as sentiment classification, analysis and summarization of con‐ sumer reviews, template detection and page segmentation. In [20], web content mining tackled business applications by developing a framework for competitive intelligence. In [21], an advanced search engine supported web-content categorization based on word-level summarization techniques. A web-page analyzer for detecting undesired advertisement was presented in [22]. The work described in [23] proposed a web-page recommendation system, where learning methods and collaborative filtering techniques cooperated to produce a web filter for efficient user navigation.

The approach presented in this research differs from those related works in two main as‐ pects: first, it exploits semantic-based techniques to select and rank single sentences extract‐ ed from text; secondly, it combines summarization with web page segmentation. The proposed approach does not belong to the semantic web mining area, which refers to meth‐ odologies that address the development of specific ontologies that enrich original web page contents in a structured format [11, 24]. To the best of the authors' knowledge, the literature provides only two works that used semantic information for web content mining. The re‐ search described in [25] addressed personalized multimedia management systems, and used semantic, ontology-based contextual information to attain a personalized behavior in con‐ tent access and retrieval. An investigation of semantic-based feature extraction for web min‐ ing is proposed in [26], where the WordNet [27] semantic network supported a novel metrics for semantic similarity.

versely, the approach presented in this chapter only relies on the processing of the textual

A Semantic-Based Framework for Summarization and Page Segmentation in Web Mining

http://dx.doi.org/10.5772/51178

79

The processing of textual data in a Web page yields two outcomes: a text summary, that identifies the most relevant topics addressed in the Web page, and the set of sentences that are most correlated with those topics. The latter indirectly supports the segmentation of the web page, as one can identify the substructures that deal with the relevant topics. Several advanced applications for Web mining can benefit from this approach: intelligent crawlers that explore links only related to most informative content, focused robots that follow spe‐ cific content evolution, and web browsers with advertising filters or specific content- high‐ lighting capabilities. This Section presents the overall approach, and introduces the various elements that compose the whole framework. Then, Section 4 will discuss the actual imple‐

The approach relies on a two-level abstraction of the original textual information that is ex‐ tracted from the web page (Figure 1); semantic networks are the tools mainly exploited to accomplish this task. First, raw text is processed to work out *concepts*. Then, concepts are grouped into domains; here, a domain represents a list of related words describing a partic‐ ular subject or area of interest. According to Gliozzo et al [55], domain information corre‐ sponds to a paradigmatic relationship, i.e., two words with meanings that are closely related

Semantic networks allow to characterize the content of a textual resource according to se‐ mantic domains, as opposed to a conventional bag of words. The ultimate objective is to ex‐ ploit a coarse-grained level of sense distinctions, which in turn can lead to identify the topics actually addressed in the Web page. Toward that end, suitable algorithms must process the domain-based representation and recognize the relevant information in the possibly noisy environment of a Web page. Indeed, careful attention should be paid to the fact that many Web pages often address multiple, heterogeneous domains. Section 4 presents in detail the

Text summarization is obtained after the identification of the set, Θ, of domains that charac‐ terize the informative content of the Web page. The summary is obtained by detecting in the original textual source the sentences that are mostly correlated to the domains included in Θ. To complete this task sentences are ranked according to the single terms they involve, since the proposed approach only sets links between terms and concepts (domains). The process can generate the eventual summary according to two criteria: the first criterion yields a summary that describes the overall content of the Web page, and therefore does not distinguish the various domains included in Θ; the second criterion prompts a multiplicity

procedure implementation to identify specific domains in a Web page.

of summaries, one for each domain addressed in Θ.

**3. A Framework for Text Summarization and Segmentation**

information that can be retrieved in the web resource.

mentation of the framework used in this work.

**3.1. Overall system description**

(e.g., synonyms and hyponyms).

#### **2.2. Text summarization**

A summary is a text produced by one or more other texts, expressing important information of original texts, and no longer than half of the original texts [28]. Actually, text summariza‐ tion techniques aim to minimize the reading effort by maximizing the information density that is prompted to the reader [29]. Summarization techniques can be categorized into two approaches: in extractive methods, summaries stem from the verbatim extraction of words or sentences, whereas abstractive methods create original summaries by using natural lan‐ guage generators [30].

The works of Das et al. [30] and Gupta et al. [31] provided extensive surveys on extractive summarization techniques. Several methods relied on word frequency analysis, cue words extraction, or selection of sentences according to their position in the text [32]. More recent works used tf-idf metrics (term frequency - inverse document frequency) [33], graphs analy‐ sis, latent semantic analysis [34], machine learning techniques [35], and fuzzy systems [36, 37]. Other approaches exploited semantic processing: [38] adopted lexicon analysis, whereas concepts extraction supported the research presented in [39]. Abstractive summarization was addressed in [40], where the goal was to understand the main concepts of a document, and then to express those concepts in a natural-language form.

The present work actually relies on a hybrid extractive-abstractive approach. First, most in‐ formative sentences are selected by using co-occurrence of semantic domains [41], thus in‐ volving an extractive summarization. Then, abstractive information is produced by working out the most representative domains for every document.

#### **2.3. Web page segmentation**

Website pages are designed for visual interaction, and typically include a number of visual segments conveying heterogeneous contents. Web page segmentation aims to grasp the page structure and split contents according to visual segments. This is a challenging task that brings about a considerable number of issues. Different techniques were applied to web page segmentation in the past years: PageRank [42], graphs exploration [43], rules [10, 44, 45], heuristics [46, 47, 48, 49], text processing [50], image processing [51], machine learning [52, 53], and semantic processing [54].

Web page segmentation methods apply heuristic algorithms, and mainly rely on the Docu‐ ment Object Model (DOM) tree structure that is associated to a web resource. Therefore, seg‐ mentation algorithms may not operate properly when those ancillary features are not available or when they do not reflect the actual semantic structure of the web page. Con‐ versely, the approach presented in this chapter only relies on the processing of the textual information that can be retrieved in the web resource.

## **3. A Framework for Text Summarization and Segmentation**

The processing of textual data in a Web page yields two outcomes: a text summary, that identifies the most relevant topics addressed in the Web page, and the set of sentences that are most correlated with those topics. The latter indirectly supports the segmentation of the web page, as one can identify the substructures that deal with the relevant topics. Several advanced applications for Web mining can benefit from this approach: intelligent crawlers that explore links only related to most informative content, focused robots that follow spe‐ cific content evolution, and web browsers with advertising filters or specific content- high‐ lighting capabilities. This Section presents the overall approach, and introduces the various elements that compose the whole framework. Then, Section 4 will discuss the actual imple‐ mentation of the framework used in this work.

#### **3.1. Overall system description**

provides only two works that used semantic information for web content mining. The re‐ search described in [25] addressed personalized multimedia management systems, and used semantic, ontology-based contextual information to attain a personalized behavior in con‐ tent access and retrieval. An investigation of semantic-based feature extraction for web min‐ ing is proposed in [26], where the WordNet [27] semantic network supported a novel

A summary is a text produced by one or more other texts, expressing important information of original texts, and no longer than half of the original texts [28]. Actually, text summariza‐ tion techniques aim to minimize the reading effort by maximizing the information density that is prompted to the reader [29]. Summarization techniques can be categorized into two approaches: in extractive methods, summaries stem from the verbatim extraction of words or sentences, whereas abstractive methods create original summaries by using natural lan‐

The works of Das et al. [30] and Gupta et al. [31] provided extensive surveys on extractive summarization techniques. Several methods relied on word frequency analysis, cue words extraction, or selection of sentences according to their position in the text [32]. More recent works used tf-idf metrics (term frequency - inverse document frequency) [33], graphs analy‐ sis, latent semantic analysis [34], machine learning techniques [35], and fuzzy systems [36, 37]. Other approaches exploited semantic processing: [38] adopted lexicon analysis, whereas concepts extraction supported the research presented in [39]. Abstractive summarization was addressed in [40], where the goal was to understand the main concepts of a document,

The present work actually relies on a hybrid extractive-abstractive approach. First, most in‐ formative sentences are selected by using co-occurrence of semantic domains [41], thus in‐ volving an extractive summarization. Then, abstractive information is produced by working

Website pages are designed for visual interaction, and typically include a number of visual segments conveying heterogeneous contents. Web page segmentation aims to grasp the page structure and split contents according to visual segments. This is a challenging task that brings about a considerable number of issues. Different techniques were applied to web page segmentation in the past years: PageRank [42], graphs exploration [43], rules [10, 44, 45], heuristics [46, 47, 48, 49], text processing [50], image processing [51], machine learning

Web page segmentation methods apply heuristic algorithms, and mainly rely on the Docu‐ ment Object Model (DOM) tree structure that is associated to a web resource. Therefore, seg‐ mentation algorithms may not operate properly when those ancillary features are not available or when they do not reflect the actual semantic structure of the web page. Con‐

and then to express those concepts in a natural-language form.

out the most representative domains for every document.

metrics for semantic similarity.

78 Theory and Applications for Advanced Text Mining Text Mining

**2.2. Text summarization**

guage generators [30].

**2.3. Web page segmentation**

[52, 53], and semantic processing [54].

The approach relies on a two-level abstraction of the original textual information that is ex‐ tracted from the web page (Figure 1); semantic networks are the tools mainly exploited to accomplish this task. First, raw text is processed to work out *concepts*. Then, concepts are grouped into domains; here, a domain represents a list of related words describing a partic‐ ular subject or area of interest. According to Gliozzo et al [55], domain information corre‐ sponds to a paradigmatic relationship, i.e., two words with meanings that are closely related (e.g., synonyms and hyponyms).

Semantic networks allow to characterize the content of a textual resource according to se‐ mantic domains, as opposed to a conventional bag of words. The ultimate objective is to ex‐ ploit a coarse-grained level of sense distinctions, which in turn can lead to identify the topics actually addressed in the Web page. Toward that end, suitable algorithms must process the domain-based representation and recognize the relevant information in the possibly noisy environment of a Web page. Indeed, careful attention should be paid to the fact that many Web pages often address multiple, heterogeneous domains. Section 4 presents in detail the procedure implementation to identify specific domains in a Web page.

Text summarization is obtained after the identification of the set, Θ, of domains that charac‐ terize the informative content of the Web page. The summary is obtained by detecting in the original textual source the sentences that are mostly correlated to the domains included in Θ. To complete this task sentences are ranked according to the single terms they involve, since the proposed approach only sets links between terms and concepts (domains). The process can generate the eventual summary according to two criteria: the first criterion yields a summary that describes the overall content of the Web page, and therefore does not distinguish the various domains included in Θ; the second criterion prompts a multiplicity of summaries, one for each domain addressed in Θ.

That approach to text summarization supports an unsupervised procedure for page segmen‐ tation, too. Indeed, the described method can 1) identify within a Web page the sentences that are most related to the main topics addressed in the page itself, and 2) label each sen‐ tence with its specific topic. Thus text summarization can help assess the structure of the Web page, and the resulting information can be combined with that provided by specific structure-oriented tools (e.g., those used for tag analysis in html source code).

**a.** identify words and sentences terminators to split text into words (tokens) and senten‐

A Semantic-Based Framework for Summarization and Page Segmentation in Web Mining

http://dx.doi.org/10.5772/51178

81

**a.** first abstraction level: a semantic network is used to extract a set of concepts from every

**b.** identify the informative contents addressed by processing the list of domains obtained

process the list of domains obtained after Step 3 (Abstraction) to search for the topics

**a.** use the output of Step 4 (Content Analysis) to rank the sentences included in the textual

**a.** use the sentences ranking to select the portions of the web page that deal with the main

Step 4 (Content Analysis) and Step 5 (Outputs) can be supported by different approaches.

The processing starts by feeding the system with the download of a web page. Raw text is

extracted by applying the 'libxml' parsing library [56] to the html source code.

**b.** build a summary by using the most significant sentences according to the rank.

Section 4 will discusses the approaches adopted in this research.

**b.** second abstraction level: the concepts are grouped in homogeneous sets (domains).

token; eventually, a list of concepts is obtained;

**a.** strategy: automatic selection of domain

after Step 3 (Abstraction);

**c.** strategy: user-driven domain

indicated by the user.

ces;

Abstraction:

**b.** erase stop words;

**c.** lemmatization.

Content analysis:

Outputs:

Summarization:

source;

Page Segmentation:

**4. Implementation**

topics.

Figure 2 shows the two alternative strategies that can be included in the Web mining sys‐ tem. The first strategy uses the text summarization abilities to find relevant information in a Web page, and possibly to categorize the contents addressed. The second strategy targets a selective search, which is driven by a query prompted by the user. In the latter case, text summarization and the eventual segmentation allow the mining tool to identify the informa‐ tion that is relevant for the user in the considered Web page.

**Figure 1.** The two abstraction layers exploited to extract contents from textual data.

#### **3.2. Overall system description**

The overall framework can be schematized according to the following steps (Figure 3):

From the Web page to textual data:

**a.** get a Web page;

**b.** iextract textual data from the source code of the Web page.

Text preprocessing:


Abstraction:

That approach to text summarization supports an unsupervised procedure for page segmen‐ tation, too. Indeed, the described method can 1) identify within a Web page the sentences that are most related to the main topics addressed in the page itself, and 2) label each sen‐ tence with its specific topic. Thus text summarization can help assess the structure of the Web page, and the resulting information can be combined with that provided by specific

Figure 2 shows the two alternative strategies that can be included in the Web mining sys‐ tem. The first strategy uses the text summarization abilities to find relevant information in a Web page, and possibly to categorize the contents addressed. The second strategy targets a selective search, which is driven by a query prompted by the user. In the latter case, text summarization and the eventual segmentation allow the mining tool to identify the informa‐

structure-oriented tools (e.g., those used for tag analysis in html source code).

tion that is relevant for the user in the considered Web page.

80 Theory and Applications for Advanced Text Mining Text Mining

**Figure 1.** The two abstraction layers exploited to extract contents from textual data.

**b.** iextract textual data from the source code of the Web page.

The overall framework can be schematized according to the following steps (Figure 3):

**3.2. Overall system description**

From the Web page to textual data:

**a.** get a Web page;

Text preprocessing:


Content analysis:


process the list of domains obtained after Step 3 (Abstraction) to search for the topics indicated by the user.

Outputs:

Summarization:


Page Segmentation:

**a.** use the sentences ranking to select the portions of the web page that deal with the main topics.

Step 4 (Content Analysis) and Step 5 (Outputs) can be supported by different approaches. Section 4 will discusses the approaches adopted in this research.

## **4. Implementation**

The processing starts by feeding the system with the download of a web page. Raw text is extracted by applying the 'libxml' parsing library [56] to the html source code.

times derivationally related forms) down to a common radix form (e.g., by simplifying plurals or verb persons). These subtasks are quite conventional in natural language process‐

A Semantic-Based Framework for Summarization and Page Segmentation in Web Mining

http://dx.doi.org/10.5772/51178

83

The process that extracts sentence and tokens from text is driven by a finite-state machine (FSM), which parses the characters in the text sequentially. The formalism requires the defi‐

**•** set *tdelim*, which includes space, tab and newline codes, plus the following characters:

A detailed description of the complete procedure implemented by the FSM is provided in Figure 4. Actually, Figure 4(a) refers to the core procedure, which includes the initial state STARTS; Figure 4(b) refers to the sub-procedure that starts when the state NUMBER is reached in the procedure of Figure 4(a); Figure 4(c) refers to the sub-procedure that starts when the state ALPHA is reached in the procedure of Figure 4(a). In all the schemes the ele‐ ments with circular shape represent the links between the three procedures: the light-grey elements refer to links that transfer the control to a different procedure; the dark-grey ele‐

The process implemented by the FSM yields a list of tokens, a list of sentences and the posi‐ tion of each token within the associated sentence. Stop-word removal takes out those tokens that either are shorter than three characters or appear in a language-specific list of terms (conjunctions, articles, etc). This effectively shrinks the list of tokens. Finally, a lemmatiza‐ tion process reduces each token to its root term. Different algorithms can perform the lem‐ matization step, depending on the document language. WordNet morphing features [27] support best lemmatization in the English idiom, and has been adopted in this research.

In the following, the symbol Ω will define the list of tokens extracted after text preprocess‐

is the number of tokens.

is a token and *Nt*

ing systems [57], and aim to work out a set of representative tokens.

**•** state *STARTS*: a sentence begins (hence, also a token begins);

**•** set *lower*, which includes all the lower case alphabet characters;

**•** set *upper*, which includes all the upper case alphabet characters;

**•** set character, which is obtained as the union of set *lower* and set *upper*;

ments refer to links that receive the control from a different procedure.

**•** state *ENDS*: end of sentence achieved (hence, end of token also achieved);

**•** set *sdelim*, which includes common sentence delimiter characters, such as :;!?'"

nition of the following quantities:

**•** state *STARTT*: a token begins;

"\',/:;.!?[]{}()\*^-~\_=

ing: Ω = {*ti*

; *i* = 1,..,*Nt*

}, where *ti*

**•** state *ENDT*: end of token achieved;

**•** set *number*, which includes all the numbers;

**•** set *dot*, which only include the dot character.

**Figure 2.** The proposed system can automatically detect the most relevant topics, or alternatively can select single text sections according to the user requests

**Figure 3.** The data flow of the proposed framework

#### **4.1. Text preprocessing**

This phase receives as input the raw text and completes two tasks: 1) it identifies the begin‐ ning and the end of each sentence; 2) it extracts the tokens from each sentence, i.e., the terms that compose the sentence. Additional subtasks are in fact involved for optimal text process‐ ing: after parsing raw text into sentences and tokens, idiom is identified and stop-words are removed accordingly; this operation removes frequent and semantically non-selective ex‐ pressions from text. Then, *lemmatization* simplifies the inflectional forms of a term (some‐ times derivationally related forms) down to a common radix form (e.g., by simplifying plurals or verb persons). These subtasks are quite conventional in natural language process‐ ing systems [57], and aim to work out a set of representative tokens.

The process that extracts sentence and tokens from text is driven by a finite-state machine (FSM), which parses the characters in the text sequentially. The formalism requires the defi‐ nition of the following quantities:

**•** state *STARTT*: a token begins;

**Figure 2.** The proposed system can automatically detect the most relevant topics, or alternatively can select single text

This phase receives as input the raw text and completes two tasks: 1) it identifies the begin‐ ning and the end of each sentence; 2) it extracts the tokens from each sentence, i.e., the terms that compose the sentence. Additional subtasks are in fact involved for optimal text process‐ ing: after parsing raw text into sentences and tokens, idiom is identified and stop-words are removed accordingly; this operation removes frequent and semantically non-selective ex‐ pressions from text. Then, *lemmatization* simplifies the inflectional forms of a term (some‐

sections according to the user requests

82 Theory and Applications for Advanced Text Mining Text Mining

**Figure 3.** The data flow of the proposed framework

**4.1. Text preprocessing**


A detailed description of the complete procedure implemented by the FSM is provided in Figure 4. Actually, Figure 4(a) refers to the core procedure, which includes the initial state STARTS; Figure 4(b) refers to the sub-procedure that starts when the state NUMBER is reached in the procedure of Figure 4(a); Figure 4(c) refers to the sub-procedure that starts when the state ALPHA is reached in the procedure of Figure 4(a). In all the schemes the ele‐ ments with circular shape represent the links between the three procedures: the light-grey elements refer to links that transfer the control to a different procedure; the dark-grey ele‐ ments refer to links that receive the control from a different procedure.

The process implemented by the FSM yields a list of tokens, a list of sentences and the posi‐ tion of each token within the associated sentence. Stop-word removal takes out those tokens that either are shorter than three characters or appear in a language-specific list of terms (conjunctions, articles, etc). This effectively shrinks the list of tokens. Finally, a lemmatiza‐ tion process reduces each token to its root term. Different algorithms can perform the lem‐ matization step, depending on the document language. WordNet morphing features [27] support best lemmatization in the English idiom, and has been adopted in this research.

In the following, the symbol Ω will define the list of tokens extracted after text preprocess‐ ing: Ω = {*ti* ; *i* = 1,..,*Nt* }, where *ti* is a token and *Nt* is the number of tokens.

well-known semantic networks have been used to complete this task: EuroWordNet [58], i.e the multilanguage version of WordNet [27], and its extension WordNet Domains [41]. Both EuroWordNet and WordNet Domains are ontologies designed to decorate words or sets of words with semantic relations. The overall structure of EuroWordNet and WordNet Do‐ mains are based on the conceptual structures theory [59] which describes the different types

A Semantic-Based Framework for Summarization and Page Segmentation in Web Mining

The abstraction from tokens to concepts is accomplished by using EuroWordNet. Euro‐ WordNet is an extension of WordNet semantic knowledge base for English, inspired by the current sycholinguistic theory of human lexical memory [27]. Nouns, verbs, adjectives and adverbs are organized in sets of synonyms (*synsets*), each of which represents a lexical con‐ cept. Actually, the same word can participate in several synsets, as a single word can have different senses (polysemy). Synonym sets are connected to other synsets via a number of semantic relations, which vary based on the type of word (noun, verb, adjective, and ad‐ verb); for example, synsets of noun can be characterized by relations such as hyponymy and meronymy. Words can also be connected to other words through lexical relations (e.g., an‐ tinomy). EuroWordNet supports different languages; thus, in principle, the approach pro‐ posed in this chapter can be easily extended to documents written in Italian, Spanish, French, and German. Table 1 gives, for each language, the number of terms and the number

In the present research, the list of concepts that characterize a text is obtained as follows:

To not inflate the list of concepts, in this work the tokens that connect to more than eight concepts are discarded. Such threshold has been set empirically by exploiting preliminary experiments. The list of concepts, Σ, represents an intermediate step to work out the do‐

The use of synsets to identify concepts possibly brings about the drawback of word disam‐ biguation. The problem of determining which one, out of a set of senses, are invoked in a textual context for a single term is not trivial, and specific techniques [55, 60, 61] have been developed to that purpose. Word disambiguation techniques usually rely on the analysis of the words that lie close to the token itself [61, 62]. Other approaches exploit queries on a knowledge base. A notable example of this approach exploits WordNet Domains and is dis‐ cussed in [63]. As a matter of fact, word disambiguation methods suffer from both high computational complexity [60, 64] and the dependency on dedicated knowledge bases [65]. In this work, word disambiguation is implicitly obtained by completing the abstraction from

= {*c <sup>k</sup>*; *k* = 1,..,*N c,i*}, where *N c,i* is the number of different con‐

that EuroWordNet

http://dx.doi.org/10.5772/51178

85

**a.** For each token *t <sup>i</sup>* ∈ Ω, extract the list of concepts (i.e., synsets) Χ*<sup>i</sup>*

**b.** Assemble the overall list of concepts: Σ = Χ*1* ∪ Χ*2* ∪ Χ*3* ∪*……..* ∪ Χ*Nt*

mains; this step will be discussed in the next subsection.

of relations that can tie together different concepts.

of concepts provided by EuroWordNet [58].

associate to the token: Χ*<sup>i</sup>*

.

cepts in Χ*<sup>i</sup>*

concepts to domains.

*4.2.1. From tokens to concepts*

**Figure 4.** The Finite State Machine that extracts sentences and tokens from text. The three scheme refers to as many sub-procedures

#### **4.2. The abstraction process: from words to domains**

The framework uses a semantic network to map tokens into an abstract representation, which can characterize the informative content of the basic textual resource on a cognitive basis. The underlying hypothesis is that to work out the topics addressed in a text, one can‐ not just depend on the mentioned terms, since each term can in principle convey different senses. On the other hand, the semantic relations that exist between concepts can help un‐ derstand whether the terms can connect to a single subject or area of interest.

The present approach implements such an abstraction process by mapping tokens into do‐ mains. An intermediate step, from tokens to concepts, supports the whole procedure. Two well-known semantic networks have been used to complete this task: EuroWordNet [58], i.e the multilanguage version of WordNet [27], and its extension WordNet Domains [41]. Both EuroWordNet and WordNet Domains are ontologies designed to decorate words or sets of words with semantic relations. The overall structure of EuroWordNet and WordNet Do‐ mains are based on the conceptual structures theory [59] which describes the different types of relations that can tie together different concepts.

#### *4.2.1. From tokens to concepts*

**Figure 4.** The Finite State Machine that extracts sentences and tokens from text. The three scheme refers to as many

The framework uses a semantic network to map tokens into an abstract representation, which can characterize the informative content of the basic textual resource on a cognitive basis. The underlying hypothesis is that to work out the topics addressed in a text, one can‐ not just depend on the mentioned terms, since each term can in principle convey different senses. On the other hand, the semantic relations that exist between concepts can help un‐

The present approach implements such an abstraction process by mapping tokens into do‐ mains. An intermediate step, from tokens to concepts, supports the whole procedure. Two

derstand whether the terms can connect to a single subject or area of interest.

**4.2. The abstraction process: from words to domains**

84 Theory and Applications for Advanced Text Mining Text Mining

sub-procedures

The abstraction from tokens to concepts is accomplished by using EuroWordNet. Euro‐ WordNet is an extension of WordNet semantic knowledge base for English, inspired by the current sycholinguistic theory of human lexical memory [27]. Nouns, verbs, adjectives and adverbs are organized in sets of synonyms (*synsets*), each of which represents a lexical con‐ cept. Actually, the same word can participate in several synsets, as a single word can have different senses (polysemy). Synonym sets are connected to other synsets via a number of semantic relations, which vary based on the type of word (noun, verb, adjective, and ad‐ verb); for example, synsets of noun can be characterized by relations such as hyponymy and meronymy. Words can also be connected to other words through lexical relations (e.g., an‐ tinomy). EuroWordNet supports different languages; thus, in principle, the approach pro‐ posed in this chapter can be easily extended to documents written in Italian, Spanish, French, and German. Table 1 gives, for each language, the number of terms and the number of concepts provided by EuroWordNet [58].

In the present research, the list of concepts that characterize a text is obtained as follows:


To not inflate the list of concepts, in this work the tokens that connect to more than eight concepts are discarded. Such threshold has been set empirically by exploiting preliminary experiments. The list of concepts, Σ, represents an intermediate step to work out the do‐ mains; this step will be discussed in the next subsection.

The use of synsets to identify concepts possibly brings about the drawback of word disam‐ biguation. The problem of determining which one, out of a set of senses, are invoked in a textual context for a single term is not trivial, and specific techniques [55, 60, 61] have been developed to that purpose. Word disambiguation techniques usually rely on the analysis of the words that lie close to the token itself [61, 62]. Other approaches exploit queries on a knowledge base. A notable example of this approach exploits WordNet Domains and is dis‐ cussed in [63]. As a matter of fact, word disambiguation methods suffer from both high computational complexity [60, 64] and the dependency on dedicated knowledge bases [65]. In this work, word disambiguation is implicitly obtained by completing the abstraction from concepts to domains.


else if |*J*| > 1

+ 0.5; *j* ∈ *J*

affected by ambiguities.

**4.3. Text Summarization**

The array *F* eventually measures the relevance of each domain *dj*

evant topics and eventually generate the summary.

ing sentences with the set of topics themselves.

**Figure 5.** Two examples of array of domains relevancies

the relevance of a domain by taking into account the intrinsic semantic properties of a token. Thus, the relative increment in the relevance of a domain is higher when a token can only be linked to one domain. The rationale behind this approach is that these special cases are not

A Semantic-Based Framework for Summarization and Page Segmentation in Web Mining

The array of relevancies, *F*, provides the input to the task designed to work out the most rel‐

The framework is designed to generate a summary by identifying, in the original text, the textual portions that most correlate with the topics addressed by the document. Two tasks should be completed to attain that goal: first, identifying the topics and, secondly, correlat‐

The first subtask is accomplished by scanning the array of relevancies, *F*. In principle, the relevant topics should correspond to the domains having the highest scores in *F*. However, the distribution of relevancies in the array can play a crucial role, too. Figure 5 illustrates this aspect with two examples. Figure 5(a) refers to a case in which a fairly large gap sepa‐ rates a subset of (highly relevant) domains from the remaining domains. Conversely, Figure 5(b) depicts a case in which the most relevant domains cannot be sharply separated from the

. The algorithm evaluates

http://dx.doi.org/10.5772/51178

87

*fj* = *fj*

**Table 1.** EuroWordNet: supported languages and corresponding elements

#### *4.2.2. From concepts to domains*

WordNet Domains [41] supports the abstraction from concepts to domains. A domain is a structure that gathers different synsets belonging to a common area of interest; thus a do‐ main can connect to synsets that pertain to different syntactic categories. Conversely, one synset can be linked to multiple domains. Each domain groups meanings into homogenous clusters; therefore, one can use the abstraction from concepts to domains to work out the topics that are actually addressed in the underlying set of tokens Ω. This can be done as fol‐ lows:

**a.** identify the domains that can be associated to the concepts included in Σ;

**b.** For each concept *c <sup>l</sup>* ∈ Σ, extract the list of domains Θ*<sup>l</sup>* that WordNet Domains associate to that concept: Θ*<sup>l</sup>* = {*dj* ; *j* = 1, …, *Nd,l*}, where *Nd,l* is the number of different domains in Θ*<sup>l</sup>* .

**c.** Obtain the overall list of domains Θ as Θ*1* ∪ Θ*2* ∪ Θ*3* ∪*……..* ∪ Θ*Nc*, where *N <sup>c</sup>* is the cardinal‐ ity of Σ.

design a criterion to work out the foremost domains from Θ.

Different approaches can support the latter step. The implicit goal is to attain word disam‐ biguation, i.e. to remove the ambiguity that may characterize single tokens when they are viewed individually. Thus, one should take advantage of the information obtained from a global analysis; the underlying hypothesis is that the actual topics can be worked out on‐ ly correlating the information provided by the single tokens. In the present work, that in‐ formation is conveyed by the list of domains, Θ. The domain-selection algorithm picks out the domains that occur most frequently within the text. The procedure can be formalized as follows:


*fj* = *fj* + 1; *j* ∈ *J* else if |*J*| > 1

Language Number of terms Number of concepts

WordNet Domains [41] supports the abstraction from concepts to domains. A domain is a structure that gathers different synsets belonging to a common area of interest; thus a do‐ main can connect to synsets that pertain to different syntactic categories. Conversely, one synset can be linked to multiple domains. Each domain groups meanings into homogenous clusters; therefore, one can use the abstraction from concepts to domains to work out the topics that are actually addressed in the underlying set of tokens Ω. This can be done as fol‐

**c.** Obtain the overall list of domains Θ as Θ*1* ∪ Θ*2* ∪ Θ*3* ∪*……..* ∪ Θ*Nc*, where *N <sup>c</sup>* is the cardinal‐

Different approaches can support the latter step. The implicit goal is to attain word disam‐ biguation, i.e. to remove the ambiguity that may characterize single tokens when they are viewed individually. Thus, one should take advantage of the information obtained from a global analysis; the underlying hypothesis is that the actual topics can be worked out on‐ ly correlating the information provided by the single tokens. In the present work, that in‐ formation is conveyed by the list of domains, Θ. The domain-selection algorithm picks out the domains that occur most frequently within the text. The procedure can be formalized

**a.** Create an array *F* with *Nd* elements, where is the cardinality |Θ| of set Θ = {*dj*

= 0, *j* = 1,.., *Nd*

∈ Σ, extract the list of domains Θ*<sup>l</sup>* that WordNet Domains associate to

.

; *j* = 1,..,*Nd*}

; *j* = 1, …, *Nd,l*}, where *Nd,l* is the number of different domains in Θ*<sup>l</sup>*

is linked: *J* = {*j* | *dj*

linked to *ti*

}

English 120160 112641 Italian 37194 44866 Spanish 32166 30350 French 18798 22745 German 17099 15132

**a.** identify the domains that can be associated to the concepts included in Σ;

design a criterion to work out the foremost domains from Θ.

**Table 1.** EuroWordNet: supported languages and corresponding elements

*4.2.2. From concepts to domains*

86 Theory and Applications for Advanced Text Mining Text Mining

lows:

ity of Σ.

as follows:

**c.** For each *ti*

**b.** If |*J*| = 1

+ 1; *j* ∈ *J*

*fj* = *fj*

**b.** For each concept *c <sup>l</sup>*

= {*dj*

**b.** Set each element of *F* to 0: *fj*

∈ Ω

**a.** Identify the list of domains to which *ti*

that concept: Θ*<sup>l</sup>*

*fj* = *fj* + 0.5; *j* ∈ *J*

The array *F* eventually measures the relevance of each domain *dj* . The algorithm evaluates the relevance of a domain by taking into account the intrinsic semantic properties of a token. Thus, the relative increment in the relevance of a domain is higher when a token can only be linked to one domain. The rationale behind this approach is that these special cases are not affected by ambiguities.

The array of relevancies, *F*, provides the input to the task designed to work out the most rel‐ evant topics and eventually generate the summary.

#### **4.3. Text Summarization**

The framework is designed to generate a summary by identifying, in the original text, the textual portions that most correlate with the topics addressed by the document. Two tasks should be completed to attain that goal: first, identifying the topics and, secondly, correlat‐ ing sentences with the set of topics themselves.

The first subtask is accomplished by scanning the array of relevancies, *F*. In principle, the relevant topics should correspond to the domains having the highest scores in *F*. However, the distribution of relevancies in the array can play a crucial role, too. Figure 5 illustrates this aspect with two examples. Figure 5(a) refers to a case in which a fairly large gap sepa‐ rates a subset of (highly relevant) domains from the remaining domains. Conversely, Figure 5(b) depicts a case in which the most relevant domains cannot be sharply separated from the remaining domains. The latter case is more challenging as it may correspond either to a text that deals with heterogeneous contents (e.g., the home page of an online newspaper) or to an ineffective characterization of the domains.

number of tokens linked to the relevant topics with respect to the total number of tokens

; *j* = 1,..,*Nw*}, where *Nw* is the cardinality of Φ.

A Semantic-Based Framework for Summarization and Page Segmentation in Web Mining

= {*tlq*; *q* = 1,..,*Ntl*}, where *Ntl* is the cardinali‐

http://dx.doi.org/10.5772/51178

89

; *l* = 1,..,*Ns*}, where *Ns* is the cardinality of Σ.

, Ω*<sup>l</sup>*

**b.** Create an array *R* with *Ns* elements; each element registers the relevance of the *l*-th sen‐

The most relevant sentences are obtained by ranking the array *R*. Actually the selection re‐ moves the sentences that are too short to be consistently evaluated. The eventual rank of the sentences is used to build the summary. In general, the summary will include all the senten‐

The DUC 2002 dataset [12] provided the experimental basis for the proposed framework. The dataset has been designed to test methodologies that address fully automatic multi-

**•** for each subject, an extractive summary (400 word) created by involving human partici‐

Thus, a summarization technique can be evaluated by comparing the outcome of the com‐

In this work, the DUC 2002 dataset supported two experimental sessions. The first session aimed at evaluating the ability of the proposed framework to generate an effective summary from the documents included in the dataset. The second session was designed to analyze the behavior of the framework in a typical scenario of Web mining: a text source obtained from a Web page that includes different contributions possibly addressing heterogeneous topics.

that compose the sentence. The procedure can be outlined as follows:

*= rl /*|Ω*<sup>l</sup>* |

The list of selected domains Φ = {*dj*

The list of tokens included in a sentence *sl*

∈ Σ

If the token can be linked to a domain in Φ

ces that achieved a relevance greater than a threshold.

document summarization. It is organized as follows:

**•** for each subject, from 5 to 10 different news about that event;

puter-driven process with that provided by the dataset (the ground truth).

The list of sentences Σ = {*sl*

**a.** Inputs:

ty of Ω*<sup>l</sup>* .

tence

*rl* = *rl* + 1

**•** 59 subjects;

pants.

**c.** For each sentence *sl*

For each token *tlq* ∈ Ω*<sup>l</sup>*

**d.** Normalize the elements of R: *rl*

**5. Experimental Results**

To overcome this potential issue, the proposed algorithm operates under the hypothesis that only a limited number of domains compose the subset of relevant topics. The rationale be‐ hind this approach is that a tool for content mining is expected to provide a concise descrip‐ tion of the Web page, whereas a lengthy list of topics would not help meet such a conciseness constraint. The objective of the algorithm therefore becomes to verify if the array *F* can highlight a limited subset of domains that are actually outstanding.

The algorithm operates as follows. First, a threshold α is used to set a reference value for the relevance score of a domain; as a result, all the domains in *F* that did not achieve the refer‐ ence value are discarded, i.e., they are considered not relevant. Then, a heuristic pruning procedure is used to further shrink the subset of candidate domains; the eventual goal –as anticipated above- is to work out a limited number of topics.

The selection procedure can be formalized as follows:


select as relevant all the domains from *d*1 to *d*<sup>m</sup>

**3.** Else

it is not possible to select relevant domains

The heuristic pruning procedure is applied only if the number of selected domains (i.e., the domains included in *F*\*) is larger than a threshold θ, which set an upper limit to the list of relevant topics. The heuristic procedure is designed to identify a cluster of relevant domains within the set *F*\*; to achieve this goal, the gap between consecutive domains is evaluated (the domains in *F*\* are provided in descending order according to the relevance score). The parameter χ sets the threshold over which a gap is considered significant. As anticipated, the latter procedure may also provide a void subset of relevant topics.

The eventual summary is obtained by picking out the sentences of the original text that most correlate with the relevant topics. To do so, the list of available sentences is sorted in order of relevance scores. Score values are worked out by considering the tokens that form each sentence: if a token can be related to any selected topic, then the relevance of the associate sentence increases. The eventual score of a sentence, finally, stems from normalizing the number of tokens linked to the relevant topics with respect to the total number of tokens that compose the sentence. The procedure can be outlined as follows:

**a.** Inputs:

remaining domains. The latter case is more challenging as it may correspond either to a text that deals with heterogeneous contents (e.g., the home page of an online newspaper) or to

To overcome this potential issue, the proposed algorithm operates under the hypothesis that only a limited number of domains compose the subset of relevant topics. The rationale be‐ hind this approach is that a tool for content mining is expected to provide a concise descrip‐ tion of the Web page, whereas a lengthy list of topics would not help meet such a conciseness constraint. The objective of the algorithm therefore becomes to verify if the array

The algorithm operates as follows. First, a threshold α is used to set a reference value for the relevance score of a domain; as a result, all the domains in *F* that did not achieve the refer‐ ence value are discarded, i.e., they are considered not relevant. Then, a heuristic pruning procedure is used to further shrink the subset of candidate domains; the eventual goal –as

**a.** Sort *F* in descending order, so that *f*1 gives the score *r*1 of the most relevant domain

The heuristic pruning procedure is applied only if the number of selected domains (i.e., the domains included in *F*\*) is larger than a threshold θ, which set an upper limit to the list of relevant topics. The heuristic procedure is designed to identify a cluster of relevant domains within the set *F*\*; to achieve this goal, the gap between consecutive domains is evaluated (the domains in *F*\* are provided in descending order according to the relevance score). The parameter χ sets the threshold over which a gap is considered significant. As anticipated,

The eventual summary is obtained by picking out the sentences of the original text that most correlate with the relevant topics. To do so, the list of available sentences is sorted in order of relevance scores. Score values are worked out by considering the tokens that form each sentence: if a token can be related to any selected topic, then the relevance of the associate sentence increases. The eventual score of a sentence, finally, stems from normalizing the

**b.** Obtain *F*\* by removing from *F* all the domains with relevance smaller than α *r*<sup>1</sup>

*F* can highlight a limited subset of domains that are actually outstanding.

anticipated above- is to work out a limited number of topics.

**1.** Find the largest gap *gmn* between consecutive domains in *F*\*

the latter procedure may also provide a void subset of relevant topics.

**2.** If *gmn* is larger than χ and *m* is smaller or equal to θ

select as relevant all the domains from *d*1 to *d*<sup>m</sup>

it is not possible to select relevant domains

The selection procedure can be formalized as follows:

**a.** If the cardinality of *F*\* is smaller or equal to θ

**b.** Else

**3.** Else

an ineffective characterization of the domains.

88 Theory and Applications for Advanced Text Mining Text Mining

The list of selected domains Φ = {*dj* ; *j* = 1,..,*Nw*}, where *Nw* is the cardinality of Φ.

The list of sentences Σ = {*sl* ; *l* = 1,..,*Ns*}, where *Ns* is the cardinality of Σ.

The list of tokens included in a sentence *sl* , Ω*<sup>l</sup>* = {*tlq*; *q* = 1,..,*Ntl*}, where *Ntl* is the cardinali‐ ty of Ω*<sup>l</sup>* .


For each token *tlq* ∈ Ω*<sup>l</sup>*

If the token can be linked to a domain in Φ

*rl* = *rl* + 1

**d.** Normalize the elements of R: *rl = rl /*|Ω*<sup>l</sup>* |

The most relevant sentences are obtained by ranking the array *R*. Actually the selection re‐ moves the sentences that are too short to be consistently evaluated. The eventual rank of the sentences is used to build the summary. In general, the summary will include all the senten‐ ces that achieved a relevance greater than a threshold.

## **5. Experimental Results**

The DUC 2002 dataset [12] provided the experimental basis for the proposed framework. The dataset has been designed to test methodologies that address fully automatic multidocument summarization. It is organized as follows:


Thus, a summarization technique can be evaluated by comparing the outcome of the com‐ puter-driven process with that provided by the dataset (the ground truth).

In this work, the DUC 2002 dataset supported two experimental sessions. The first session aimed at evaluating the ability of the proposed framework to generate an effective summary from the documents included in the dataset. The second session was designed to analyze the behavior of the framework in a typical scenario of Web mining: a text source obtained from a Web page that includes different contributions possibly addressing heterogeneous topics.

## **5.1. The first experimental session: summarization effectiveness**

To evaluate the method's ability at effective summarization, this session adopted the ROUGE software [13]. This made it possible to measure the performances of the proposed approach (as per Section 4) on the DUC 2002 dataset.

the array, *F*, measuring the relevance of a set of domains (as per section 4.2.2); for each sub‐

A Semantic-Based Framework for Summarization and Page Segmentation in Web Mining

http://dx.doi.org/10.5772/51178

91

Figure 6 gives a sample of the pair of arrays associated with one of the subjects in the DUC 2002 dataset; in the graph, light-grey lines are associated with the actual reference scores in the benchmark, whereas dark-grey lines refer to the relevance values worked out by the pro‐

Statistical tools measured the consistency of the domain-selection process: chi-square test runs compared, for each subject, the pair of distributions obtained; the goal was to verify the null hypothesis, namely, that the two distributions came from the same population. The

The results obtained with the chi-square tests showed that the null hypothesis could *not* be rejected in any of the 49 experiments involved (each subject in DUC 2002 corresponded to one experiment). This confirmed that the distributions of the relevant domains obtained from the whole text could not be distinguished from those obtained from the (human gener‐

**Figure 6.** Comparison between the relevance of domains –for the same subject of DUC 2002- in the DUC summary

The first experimental session proved that the framework can effectively tackle this task (and eventually generate a proper summary) when the input was a news-text, which mainly dealt with a single event. A web page, however, often collects different textual resources, each addressing a specific, homogenous set of topics. Hence, the second experimental ses‐ sion was designed to evaluate the ability of the proposed framework to identify the most

ject included in DUC 2002, the array *F* was computed with respect to:

**•** the corresponding summary provided by the dataset.

standard value of 0.05 was selected for the confidence level.

ated) summaries in the DUC 2002 dataset.

and in the summary provided by the proposed algorithm

informative subsections of a web page.

**5.2. The second experimental session: web mining**

**•** the news linked to that subject;

posed method.

ROUGE is a software package for automatic evaluation of summaries that has been widely used in recent years to assess the performance of summarization algorithms. The ROUGE tool actually supports different parameterizations; in the present work, ROUGE-1 has been implemented, thus involving 1-gram co-occurrences between the reference and the candi‐ date summarization results. Using DUC 2002 as a benchmark and ROUGE as the evaluation tool allowed a fair comparison between the present approach and other works already pub‐ lished in the literature.

Table 2 gives the results obtained by the proposed framework on the DUC 2002 dataset. The Table compares experiments tested under different configurations of the summarization al‐ gorithm; in particular, experimental set-ups differ in the number of sentences used to gener‐ ate the summary. The first column gives the number of most informative sentences extracted from the original text; the second, third, and fourth columns report on recall, pre‐ cision, and f-measure, respectively, as measured by ROUGE.


**Table 2.** The performance achieved by the proposed framework on the DUC 2002 dataset as assessed by ROUGE

Table 2 shows that the methodology presented in this chapter attained results that com‐ pared favorably with those achieved by state-of-the-art algorithms [66] on DUC 2002. In this regard, one should consider that the best performance obtained on DUC 2002 is characterized by the following values [66]: recall = 0.47813, precision = 0.45779, F-meas‐ ure = 0.46729. This confirmed the effectiveness of the underlying cognitive approach, map‐ ping raw text into an abstract representation, where semantic domains identified the main topics addressed within each document. Numerical results point out that the high‐ est F-measure was attained when the summarization algorithm picked out at least the most 20 relevant sentences in a text.

An additional set of experiments further analyzed the outcomes of the proposed approach. In this case, the goal was to understand whether the topic-selection criterion actually fit the criterion implicitly applied by human subjects when summarizing the texts. This involved the array, *F*, measuring the relevance of a set of domains (as per section 4.2.2); for each sub‐ ject included in DUC 2002, the array *F* was computed with respect to:

**•** the news linked to that subject;

**5.1. The first experimental session: summarization effectiveness**

approach (as per Section 4) on the DUC 2002 dataset.

90 Theory and Applications for Advanced Text Mining Text Mining

cision, and f-measure, respectively, as measured by ROUGE.

most 20 relevant sentences in a text.

lished in the literature.

To evaluate the method's ability at effective summarization, this session adopted the ROUGE software [13]. This made it possible to measure the performances of the proposed

ROUGE is a software package for automatic evaluation of summaries that has been widely used in recent years to assess the performance of summarization algorithms. The ROUGE tool actually supports different parameterizations; in the present work, ROUGE-1 has been implemented, thus involving 1-gram co-occurrences between the reference and the candi‐ date summarization results. Using DUC 2002 as a benchmark and ROUGE as the evaluation tool allowed a fair comparison between the present approach and other works already pub‐

Table 2 gives the results obtained by the proposed framework on the DUC 2002 dataset. The Table compares experiments tested under different configurations of the summarization al‐ gorithm; in particular, experimental set-ups differ in the number of sentences used to gener‐ ate the summary. The first column gives the number of most informative sentences extracted from the original text; the second, third, and fourth columns report on recall, pre‐

> Number of sentences Recall Precision F-measure 0.3297 0.5523 0.4028 0.4421 0.5747 0.4884 0.5317 0.5563 0.5319 0.5917 0.5126 0.5382 0.6406 0.4765 0.5363

**Table 2.** The performance achieved by the proposed framework on the DUC 2002 dataset as assessed by ROUGE

Table 2 shows that the methodology presented in this chapter attained results that com‐ pared favorably with those achieved by state-of-the-art algorithms [66] on DUC 2002. In this regard, one should consider that the best performance obtained on DUC 2002 is characterized by the following values [66]: recall = 0.47813, precision = 0.45779, F-meas‐ ure = 0.46729. This confirmed the effectiveness of the underlying cognitive approach, map‐ ping raw text into an abstract representation, where semantic domains identified the main topics addressed within each document. Numerical results point out that the high‐ est F-measure was attained when the summarization algorithm picked out at least the

An additional set of experiments further analyzed the outcomes of the proposed approach. In this case, the goal was to understand whether the topic-selection criterion actually fit the criterion implicitly applied by human subjects when summarizing the texts. This involved **•** the corresponding summary provided by the dataset.

Figure 6 gives a sample of the pair of arrays associated with one of the subjects in the DUC 2002 dataset; in the graph, light-grey lines are associated with the actual reference scores in the benchmark, whereas dark-grey lines refer to the relevance values worked out by the pro‐ posed method.

Statistical tools measured the consistency of the domain-selection process: chi-square test runs compared, for each subject, the pair of distributions obtained; the goal was to verify the null hypothesis, namely, that the two distributions came from the same population. The standard value of 0.05 was selected for the confidence level.

The results obtained with the chi-square tests showed that the null hypothesis could *not* be rejected in any of the 49 experiments involved (each subject in DUC 2002 corresponded to one experiment). This confirmed that the distributions of the relevant domains obtained from the whole text could not be distinguished from those obtained from the (human gener‐ ated) summaries in the DUC 2002 dataset.

**Figure 6.** Comparison between the relevance of domains –for the same subject of DUC 2002- in the DUC summary and in the summary provided by the proposed algorithm

#### **5.2. The second experimental session: web mining**

The first experimental session proved that the framework can effectively tackle this task (and eventually generate a proper summary) when the input was a news-text, which mainly dealt with a single event. A web page, however, often collects different textual resources, each addressing a specific, homogenous set of topics. Hence, the second experimental ses‐ sion was designed to evaluate the ability of the proposed framework to identify the most informative subsections of a web page.

The experiments involved the DUC 2002 dataset and were organized as follows. A set of new documents were generated by assembling the news originally provided by DUC 2002. Each new document eventually included four news articles and covered four different top‐ ics. Then, the list of documents was processed by the proposed framework, which was ex‐ pected – for each document – to select as the most relevant topics those that were chosen in the set up. Table 3 reports on the results of this experiment; each row represents a single document: the first column gives the topics actually addressed by the document, while the second column gives the topics proposed by the framework. The table reports in boldface the topics that the framework was not able to pinpoint.

**5.3.Web Page Segmentation**

web page to favor readability.

**Figure 7.** An example of web page analysis supported by the proposed framework

sentences have been highlighted in Figure 7 and Figure 8.

Figure 7 and Figure 8 provide examples of this kind of application. In both cases, the web page included a main section that actually defined the addressed contents, together with other textual parts that did not convey relevant information. The framework supported web content mining by identifying the sentences that actually linked to the relevant topics. These

The second strategy typically aims to support users that want to track selected topics. In this case, the goal is to identify the web-page sections that actually deals with the topics of inter‐ est. Figure 9 provides an example: the selected topic was 'pharmacy/medicine,' and the web page was the 'News' section of the publisher *InTech*. The figure shows that an advanced web

The framework can analyze a web page according to two different strategies. The first strategy, identifying the most relevant topics, typically triggers further actions in ad‐ vanced web-content mining systems: gathering a short summary of the web page (pos‐ sibly a short summary for each main topic), page segmentation, graphic editing of the

A Semantic-Based Framework for Summarization and Page Segmentation in Web Mining

http://dx.doi.org/10.5772/51178

93

Experimental evidence confirmed that the proposed framework yielded satisfactory results in this experiment, too. In this regard, one should also take into account that



**Table 3.** Comparison between actual document topics and topics proposed by the framework

The dataset involved in the experiment was artificially generated to evaluate the effec‐ tiveness of the proposed framework in a scenario that resembles a "real word" case. Hence, a fair comparison with other methodologies cannot be proposed. However, Ta‐ ble 3 provides a solid experimental evidence of the efficiency of the approach intro‐ duced in this research, as the 'artificial' web pages were composed by using the original news included in the DUC 2002 dataset. As a result, one can conclude that the perform‐ ances attained by the framework in terms of ability to identify the relevant topics in an heterogeneous document are very promising.

#### **5.3.Web Page Segmentation**

The experiments involved the DUC 2002 dataset and were organized as follows. A set of new documents were generated by assembling the news originally provided by DUC 2002. Each new document eventually included four news articles and covered four different top‐ ics. Then, the list of documents was processed by the proposed framework, which was ex‐ pected – for each document – to select as the most relevant topics those that were chosen in the set up. Table 3 reports on the results of this experiment; each row represents a single document: the first column gives the topics actually addressed by the document, while the second column gives the topics proposed by the framework. The table reports in boldface

Experimental evidence confirmed that the proposed framework yielded satisfactory results

**•** the relative length of the single news somewhat influenced the overall distribution of the

**•** in several cases the real topics not identified by the framework as the most rele‐ vant (i.e., the topics in bold) had relevance scores very close to those characteriz‐

in this experiment, too. In this regard, one should also take into account that

Actual Topics Topics Proposed by the Framework **Literature** / Military / Music / Politics History / Military / Music / Politics Literature / **Military** / Music / Politics Buildings / Literature / Music / Politics Literature / Military / **Music** / Politics Literature / Military / Politics / Sociology **Literature** / Military / Music / Politics Biology / Military / Music / Politics **Literature** / Military / **Music** / Politics Military / Politics / School / Sociology Astronomy / Economy / Music / **Sport** Astronomy / Biology / Economy / Music **Astronomy** / Music / Politics / **Sport** Biology / Music / Politics / Town Planning Economy / **Music** / Physics / **Sport** Economy / Law / Physics / Transport **Music** / Physics / Politics / **Sport** Law / Physics / Politics / Transport **Music** / Physics / Politics / Sport Physics / Politics / Sport / Transport

**Table 3.** Comparison between actual document topics and topics proposed by the framework

an heterogeneous document are very promising.

The dataset involved in the experiment was artificially generated to evaluate the effec‐ tiveness of the proposed framework in a scenario that resembles a "real word" case. Hence, a fair comparison with other methodologies cannot be proposed. However, Ta‐ ble 3 provides a solid experimental evidence of the efficiency of the approach intro‐ duced in this research, as the 'artificial' web pages were composed by using the original news included in the DUC 2002 dataset. As a result, one can conclude that the perform‐ ances attained by the framework in terms of ability to identify the relevant topics in

the topics that the framework was not able to pinpoint.

topics relevance;

ing the selected ones.

92 Theory and Applications for Advanced Text Mining Text Mining

The framework can analyze a web page according to two different strategies. The first strategy, identifying the most relevant topics, typically triggers further actions in ad‐ vanced web-content mining systems: gathering a short summary of the web page (pos‐ sibly a short summary for each main topic), page segmentation, graphic editing of the web page to favor readability.

**Figure 7.** An example of web page analysis supported by the proposed framework

Figure 7 and Figure 8 provide examples of this kind of application. In both cases, the web page included a main section that actually defined the addressed contents, together with other textual parts that did not convey relevant information. The framework supported web content mining by identifying the sentences that actually linked to the relevant topics. These sentences have been highlighted in Figure 7 and Figure 8.

The second strategy typically aims to support users that want to track selected topics. In this case, the goal is to identify the web-page sections that actually deals with the topics of inter‐ est. Figure 9 provides an example: the selected topic was 'pharmacy/medicine,' and the web page was the 'News' section of the publisher *InTech*. The figure shows that an advanced web

content mining system could exploit the information provided by the framework to high‐ light the text parts that were considered correlated with the topic of interest.

categorized under the Semantic Web area, as it does not rely on semantic information al‐

A Semantic-Based Framework for Summarization and Page Segmentation in Web Mining

http://dx.doi.org/10.5772/51178

95

In the proposed methodology, semantic networks are used to characterize the content of a textual resource according to semantic domains, as opposed to a conventional bag of words. Experimental evidences proved that such an approach can yield a coarse-grained level of sense distinctions, which in turn favors the identification of the topics actually addressed in the Web page. In this regard, experimental results also showed that the system can emulate human assessors in evaluating the relevance of the single sentences that compose a text.

An interesting feature of the present work is that the page segmentation technique is based only on the analysis of the textual part of the Web resource. A future direction of this re‐ search can be the integration of the content-driven segmentation approach with convention‐ al segmentation engines, which are more oriented toward the analysis of the inherent structure of the Web page. The resulting framework should be able to combine the outcomes

Future works may indeed be focused on the integration of semantic orientation approaches into the proposed framework. These techniques are becoming more and more important in the Web 2.0 scenario, where one may need the automatic analysis of fast-changing web ele‐ ments like customer reviews and web reputation data. In this regard, the present framework may provide content-filtering features that support the selection of the data to be analyzed.

of the two modules to improve the performance of the segmentation procedure.

**Figure 9.** Tracking a selected topic by using the proposed framework

ready embedded into the Web resources.

**Figure 8.** A second example of web page analysis supported by the proposed framework

## **6. Conclusions**

The research presented in this chapter introduces a framework that can effectively support advanced Web mining tools. The proposed system addresses the analysis of the textual data provided by a web page and exploits semantic networks to achieve multiple goals: 1) the identification of the most relevant topics; 2) the selection of the sentences that better corre‐ lates with a given topic; 3) the automatic summarization of a textual resource. The eventual framework exploits those functionalities to tackle two tasks at the same time: text summari‐ zation and page segmentation.

The semantic characterization of text is indeed a core aspect of the proposed methodology, which takes advantage of an abstract representation that expresses the informative content of the basic textual resource on a cognitive basis. The present approach, though, cannot be categorized under the Semantic Web area, as it does not rely on semantic information al‐ ready embedded into the Web resources.

content mining system could exploit the information provided by the framework to high‐

light the text parts that were considered correlated with the topic of interest.

94 Theory and Applications for Advanced Text Mining Text Mining

**Figure 8.** A second example of web page analysis supported by the proposed framework

The research presented in this chapter introduces a framework that can effectively support advanced Web mining tools. The proposed system addresses the analysis of the textual data provided by a web page and exploits semantic networks to achieve multiple goals: 1) the identification of the most relevant topics; 2) the selection of the sentences that better corre‐ lates with a given topic; 3) the automatic summarization of a textual resource. The eventual framework exploits those functionalities to tackle two tasks at the same time: text summari‐

The semantic characterization of text is indeed a core aspect of the proposed methodology, which takes advantage of an abstract representation that expresses the informative content of the basic textual resource on a cognitive basis. The present approach, though, cannot be

**6. Conclusions**

zation and page segmentation.

In the proposed methodology, semantic networks are used to characterize the content of a textual resource according to semantic domains, as opposed to a conventional bag of words. Experimental evidences proved that such an approach can yield a coarse-grained level of sense distinctions, which in turn favors the identification of the topics actually addressed in the Web page. In this regard, experimental results also showed that the system can emulate human assessors in evaluating the relevance of the single sentences that compose a text.

An interesting feature of the present work is that the page segmentation technique is based only on the analysis of the textual part of the Web resource. A future direction of this re‐ search can be the integration of the content-driven segmentation approach with convention‐ al segmentation engines, which are more oriented toward the analysis of the inherent structure of the Web page. The resulting framework should be able to combine the outcomes of the two modules to improve the performance of the segmentation procedure.


**Figure 9.** Tracking a selected topic by using the proposed framework

Future works may indeed be focused on the integration of semantic orientation approaches into the proposed framework. These techniques are becoming more and more important in the Web 2.0 scenario, where one may need the automatic analysis of fast-changing web ele‐ ments like customer reviews and web reputation data. In this regard, the present framework may provide content-filtering features that support the selection of the data to be analyzed.

## **Author details**

Alessio Leoncini, Fabio Sangiacomo, Paolo Gastaldo and Rodolfo Zunino

Department of naval, electric, electronic and telecommunications engineering (DITEN), Uni‐ versity of Genoa, Genoa, Italy

[12] Document understanding conference. (2002). http://www-nlpir.nist.gov/projects/

A Semantic-Based Framework for Summarization and Page Segmentation in Web Mining

http://dx.doi.org/10.5772/51178

97

[13] Lin, C.Y. Rouge: A package for automatic evaluation of summaries. *Proceedings of the ACL-04 Workshop: Text Summarization Branches Out, Barcelona, Spain. 2004.*

[14] Etzioni, O. (1996). The world wide web: Quagmire or gold mine. *Communications of*

[15] Madria, S. K., Bhowmick, S. S., Ng, W. K., & Lim, E. P. (1999). Research issues in web data mining. *Proceedings of First International Conference on Data Warehousing and*

[17] Singh, B., & Singh, H. K. Web data mining research: a survey. *Proceedings of 2010 IEEE International Conference on Computational Intelligence and Computing Research, IC‐*

[18] Xu, G., Zhang, Y., & Li, L. (2011). Web Content Mining. *Web Mining and Social Net‐*

[19] Liu, B. (2005). Web content mining. *Proceedings of 14th International World Wide Web*

[20] Baumgartner, R., Gottlob, G., & Herzog, M. (2009). Scalable web data extraction for online market intelligence. *Proceedings of the VLDB Endowment*, 2(2), 1512-1523.

[21] Manne, S. (2011). A Novel Approach for Text Categorization of Unorganized data based with Information Extraction. *International Journal on Computer Science and Engi‐*

[22] Ntoulas, A., Najork, M., Manasse, M., & Fetterly, D. (2006). Detecting spam web pa‐ ges through content analysis. *Proceedings of the 15th international conference on World*

[23] Khribi, M. K., Jemni, M., & Nasraoui, O. (2009). Automatic Recommendations for E-Learning Personalization Based on Web Usage Mining Techniques and Information

[24] Maedche, A., & Staab, S. (2001). Ontology Learning for the Semantic Web. *IEEE Intel‐*

[25] Vallet, D., Castells, P., Fernandez, M., Mylonas, P., & Avrithis, Y. (2007). Personalized content retrieval. *context using ontological knowledge. IEEE Transactions on Circuits and*

[26] Hliaoutakis, A., Varelas, G., Voutsakis, E., Petrakis, E. G. M., & Milios, E. (2006). In‐ formation retrieval by semantic similarity. *International Journal on Semantic Web and*

duc/, (accessed 14 May 2012).

*Knowledge Discovery, DaWaK'99, Florence, Italy*.

*Conference, WWW'05,May 2005, Chiba, Japan.*

*Wide Web, WWW'06,Edinburgh, Scotland*.

Retrieval. *Educational Technology & Society*, 12(4), 30-42.

*Systems for Video Technology 2007*, 17(3), 336-346.

[16] Chakrabarti, S. (2000). Data mining for hypertext. *A tutorial survey*, 1.

*the ACM*, 39(11), 65-68.

*CIC'10. 2010.*

*working*, 6-71.

*neering*, 2846-2854.

*ligent Systems*, 16(2), 72-79.

*Information Systems*, 3(3), 55-73.

## **References**


**Author details**

**References**

versity of Genoa, Genoa, Italy

96 Theory and Applications for Advanced Text Mining Text Mining

*tions*, 2(1), 1-15.

*of the ACM*, 37(7), 30-40.

*ment Applications*, 3-165.

tion. *Norwell: Kluwer Academic Publisher*.

ISI ' June (2008). Taipei, Taiwan. 2008. , 08, 17-20.

11-20.

*can*.

Alessio Leoncini, Fabio Sangiacomo, Paolo Gastaldo and Rodolfo Zunino

Department of naval, electric, electronic and telecommunications engineering (DITEN), Uni‐

[1] Kosala, R., & Blockeel, H. (2000). Web mining research: A survey. *SIGKDD Explora‐*

[2] Gantz, J. F., Reinsel, D., Chute, C., Schlichting, W., Mcarthur, J., Minton, S., Xheneti, I., Toncheva, A., & Manfrediz, A. (2010). The Expanding Digital Universe: A Forecast

[3] Naghavi, M., & Sharifi, M. (2012). A Proposed Architecture for Continuous Web Monitoring Through Online Crawling of Blogs. *International Journal of UbiComp*, 3(1),

[4] Maes, P. (1994). Agents that reduce work and information overload. *Communications*

[5] Stumme, G., Hotho, A., & Berendt, B. (2006). Semantic Web Mining: State of the art

[6] Dai, Y., Kakkonen, T., & Sutinen, E. (2011). MinEDec: a Decision-Support Model That Combines Text-Mining Technologies with Two Competitive Intelligence Analysis Methods. *International Journal of Computer Information Systems and Industrial Manage‐*

[7] Thuraisingham, B.M. (2003). Web Data Mining: Technologies and their Applications

[8] Allan, J. (2002). Topic Detection and Tracking: Event-based Information Organiza‐

[9] Chen, H. Discovery of improvised explosive device content in the Dark Web. Pro‐ ceedings of IEEE International Conference on Intelligence and Security Informatics,

[10] Yu, S., Cai, D., Wen, J. R., & Ma, W. Y. (2003). Improving pseudo-relevance feedback in web information retrieval using web page segmentation. *Proceedings of the 12th In‐*

[11] Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The semantic web. *Scientific Ameri‐*

in Business Intelligence and Counter-terrorism. *Boca Raton: CRC Press*.

*ternational Conference on World Wide Web, WWW'03,New York, USA.*

and future directions. *Journal of Web Semantics*, 4(2), 124-143.

of Worldwide Information Growth Through. *Information and Data 2007.*, 1-21.


[27] Miller, G.A. (1995). WordNet: A Lexical Database for English. *Communications of the ACM*, 38(11), 39-41.

[40] Erkan, G., & Radev, D.R. (2004). LexRank: graph-based lexical centrality as salience in text summarization. *Journal of Artificial Intelligence Research*, 22(1), 457-479.

A Semantic-Based Framework for Summarization and Page Segmentation in Web Mining

http://dx.doi.org/10.5772/51178

99

[41] Magnini, B., & Cavaglià, G. Integrating Subject Field Codes into WordNet. *Gavrilidou M, Crayannis G, Markantonatu S, Piperidis S, Stainhaouer G. (eds.) Proceedings of the Sec‐ ond International Conference on Language Resources and Evaluation, LREC-2000, 31*

[42] Yin, X., & Lee, W. S. (2004). Using link analysis to improve layout on mobile devices. *Proceedings of the Thirteenth International World Wide Web Conference, WWW'04,New*

[43] Yin, X., & Lee, W. S. (2005). Understanding the function of web elements for mobile content delivery using random walk models. *Special interest tracks and posters of the*

[44] Cai, D., Yu, S., Wen, J. R., & Ma, W. Y. (2003). Extracting Content Structure for Web Pages based on Visual Representation. *Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications, APWeb'03, Xian, China. Berlin: Springer-Verlag.* [45] Cai, D., Yu, S., Wen, J. R., & Ma, W. Y. (2004). Block-based web search. *Proceedings of the 27th annual international ACM SIGIR conference on Research and development in infor‐*

[46] Ahmadi, H., & Kong, J. (2008). Efficient web browsing on small screens. *Proceedings of*

[47] Burget, R. (2007). Automatic document structure detection for data integration. *Pro‐ ceedings of the 10th international conference on Business information systems, BIS'07, Pozn‐*

[48] Burget, R., & Rudolfova, I. (2009). Web page element classification based visual fea‐ tures. *Proceedings of First Asian conference on Intelligent Information and Database Sys‐*

[49] Milic-Frayling, N., & Sommerer, R. (2002). Smartview: Flexible viewing of web page contents. *Poster Proceedings of the Eleventh International World Wide Web Conference,*

[50] Kohlschütter, C., & Nejdl, W. (2008). A densitometric approach to web page segmen‐ tation. *Proceeding of the 17th ACM conference on Information and knowledge management,*

[51] Cao, J., Mao, B., & Luo, J. (2010). A segmentation method for web page analysis using shrinking and dividing. *International Journal of Parallel, Emergent and Distributed Sys‐*

[52] Borodin, Y., Mahmud, J., Ramakrishnan, I. V., & Stent, A. (2007). The hearsay nonvisual web browser. *Proceedings of the 2007 international cross-disciplinary conference on*

*the working conference on Advanced visual interfaces, AVI'08, Napoli, Italy.*

*14th international conference on World Wide Web, WWW'05,Chiba, Japan*.

*May-2 June 2000Athens, Greece.*

*mation retrieval, SIGIR'04, Sheffield, UK.*

*tems, ACIIDS'09, Dong hoi, Quang binh, Vietnam.*

*York, USA.*

*an, Poland.*

*WWW'02,Honolulu, USA.*

*CIKM'08, Napa Valley, USA.*

*Web accessibility, W4A'07, Banff, Canada*.

*tems*, 25(2), 93-104.


[40] Erkan, G., & Radev, D.R. (2004). LexRank: graph-based lexical centrality as salience in text summarization. *Journal of Artificial Intelligence Research*, 22(1), 457-479.

[27] Miller, G.A. (1995). WordNet: A Lexical Database for English. *Communications of the*

[28] Radev, D. R., Hovy, E., & Mc Keown, K. (2002). Introduction to the special issue on

[29] Zipf, G. (1949). Human Behaviour and the Principle of Least-Effort. *Cambridge: Addi‐*

[30] Das, D., & Martins, A. F. T. (2007). A Survey on Automatic Text Summarization. *En‐*

[31] Gupta, V., & Lehal, G. S. (2010). A Survey of Text Summarization Extractive Techni‐

[32] Nenkova, A. (2005). Automatic text summarization of newswire: lessons learned from the document understanding conference. *Proceedings of the 20th national confer‐*

[33] García-Hernández, R. A., & Ledeneva, Y. (2009). Word Sequence Models for Single Text Summarization. *Proceedings of the Second International Conferences on Advances in Computer-Human Interactions, ACHI'*, 09, 1-7, *Cancun, Mexico. Washington: IEEE Com‐*

[34] Hennig, L. (2009). Topic-based multi-document summarization with probabilistic la‐ tent semantic analysis. *Proceedings of the Recent Advances in Natural Language Process‐*

[35] Svore, K., Vanderwende, L., & Burges, C. (2007). Enhancing Single-Document Sum‐ marization by Combining RankNet and Third-Party Sources. *Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Nat‐*

[36] Hannah, M. E., Geetha, T. V., & Mukherjee, S. (2011). Automatic extractive text sum‐ marization based on fuzzy logic: a sentence oriented approach. *Proceedings of the Sec‐ ond international conference on Swarm, Evolutionary, and Memetic Computing,*

[37] Suanmali, L., Salim, N., & Binwahlan, M. S. (2009). Fuzzy Logic Based Method for Improving Text Summarization. *International Journal of Computer Science and Informa‐*

[38] Barzilay, R., & Elhadad, M. (1997). Using Lexical Chains for Text Summarization.

[39] Zamanifar, A., Minaei-Bidgoli, B., & Sharifi, M. (2008). A New Hybrid Farsi Text Summarization Technique Based on Term Co-Occurrence and Conceptual Property of Text. *Proceedings of Ninth ACIS International Conference on Software Engineering, Ar‐ tificial Intelligence, Networking and Parallel/Distributed Computing, SNPD'08, Phuket,*

*Proceedings of the ACL Workshop on Intelligent Scalable Text Summarization.*

ques. *Journal of Emerging Technologies in Web Intelligence*, 2(3), 258-268.

summarization. *Computational Linguistics*, 28(4), 399-408.

*ence on Artificial intelligence, AAAI'05, Pittsburgh, USA.*

*ACM*, 38(11), 39-41.

98 Theory and Applications for Advanced Text Mining Text Mining

*gineering and Technology*, 4-192.

*son-Wesley*.

*puter Society; 2009.*

*ing Conference, RANLP-2009.*

*tion Security*, 2(1), 65-70.

*Thailand.*

*ural Language Learning, EMNLP-CoNLL.*

*SEMCCO'11, Visakhapatnam, India. Berlin: Springer-Verlag*.


[53] Mahmud, J. U., Borodin, Y., & Ramakrishnan, I. V. (2007). Csurf: a context-driven non-visual web-browser. *Proceedings of the 16th international conference on World Wide Web, WWW'07,Banff, Canada.*

**Chapter 5**

**Ontology Learning Using Word Net Lexical Expansion**

In knowledge management systems, ontologies play an important role as a backbone for providing and accessing knowledge sources. They are largely used in the next generation of the Semantic Web that focuses on supporting a better cooperation between humans and ma‐ chines [2]. Since manual ontology construction is costly, time-consuming, error-prone, and inflexible to change, it is hoped that an automated ontology learning process will result in more effective and more efficient ontology construction and also be able to create ontologies that better match a specific application [20]. Ontology learning has recently become a major focus for research whose goal is to facilitate the construction of ontologies by decreasing the amount of effort required to produce an ontology for a new domain. However, most current approaches deal with narrowly-defined specific tasks or a single part of the ontology learn‐ ing process rather than providing complete support to users. There are few studies that at‐ tempt to automate the entire ontology learning process from the collection of domainspecific literature and filtering out documents irrelevant to the domain, to text mining to

The World Wide Web is a rich source of documents that is useful for ontology learning. However, because there is so much information of varying quality covering a huge range of topics, it is important to develop document discovery mechanisms based on intelligent techniques such as focused crawling [7] to make the collection process easier for a new domain. However, due to the huge number of retrieved documents, we still require an automatic mechanism rather than domain experts in order to separate out the documents that are truly relevant to the domain

> © 2012 Luong et al.; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,

© 2012 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution,

distribution, and reproduction in any medium, provided the original work is properly cited.

and reproduction in any medium, provided the original work is properly cited.

of interest. Text classification techniques can be used to perform this task.

**and Text Mining**

http://dx.doi.org/10.5772/51141

**1. Introduction**

Hiep Luong, Susan Gauch and Qiang Wang

Additional information is available at the end of the chapter

build new ontologies or enrich existing ones.

