**Analysis and Learning Frameworks for Large-Scale Data Mining**

Kohsuke Yanai and Toshihiko Yanase

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/51713

**1. Introduction**

26 Will-be-set-by-IN-TECH

[58] Voinea L, Telea A (2007) Visual data mining and analysis of software repositories.

[59] Wegman E J, Luo Q (2011) High-Dimensional Clustering Using Parallel Coordinates and

[60] Yamaguchi J K, Dias M M, Franco C (2011) Guidelines For The Choice of Visualization Techniques Applied in the Process of Knowledge Extraction. 13th International Conference on Enterprise Information Systems - ICEIS 2011, Beijing, China. pp. 183-189. [61] Yamaguchi J K, Dias M M (2011) A Study about Influenceable Parameters in the Choice of Visualization Techniques Based on Grounded Theory. IADIS International Conferences Computer Graphics, Visualization, Computer Vision and Image Processing

[62] Zeckzer D, Kalcklösch R, Schröder L, Hagen H, Klein T (2008) Analyzing the reliability of communication between software entities using a 3d visualization of clustered graphs. Proceedings of the 4th ACM symposium on Software visualization. pp. 37-46.

the Grand Tour. Computing Science and Statistics. j. 28:352-360.

Computers & Graphics. j. 31:410-428.

2011, Rome, Italy. pp.177-184.

Recently, lots of companies and organizations try to analyze large amount of business data and leverage extracted knowledge to improve their operations. This chapter discusses techniques for processing large-scale data. In this chapter, we propose two computing frameworks for large-scale data mining:


The first framework is for analysis phase, in which we find out how to utilize business data through trial and error. The proposed framework stores tree-structured data using vertical partitioning technique, and uses Hadoop MapReduce for distributed computing. These methods enable to reduce disk I/O load, and to avoid computationally-intensive processing, such as grouping and combining of records.

The second framework is for model learning phase, in which we create predictive models using machine learning algorithms. The proposed framework is another implementation of MapReduce. The framework is designed to ease parallelization of machine learning algorithms and reduce calculation overheads for iterative procedures. The framework minimizes frequency of thread generation and termination, and keeps feature vectors in local memory and local disk during iteration.

We start with discussion on process of data utilization in enterprise and organization described in Figure 1. We suppose the data utilization process consists of the following phases.


©2012 Yanai and Yanase, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. ©2012 Yanai and Yanase, licensee InTech. This is a paper distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

	- 3. Model learning phase
	- 4. Model application phase

sources and creates tree-structured data in which records from multiple data sources are

Analysis and Learning Frameworks for Large-Scale Data Mining 183

Figure 2 illustrates an example of tree-structured server logs, in which the log data are grouped by each site at the top level. Site information consists of site ID (e.g. "site001") and a list of server information. Server information consists of server ID (e.g. "serv001"), average CPU usage (e.g. "ave-cpu:84.0%") and a list of detail records. Furthermore, a detail record consists of date (e.g. "02/05"), time (e.g. "10:20"), CPU usage (e.g. "cpu:92%") and memory

If we store the data in table format, data grouping and data combining are repeatedly computed in analysis phase which comes after the pre-processing phase. Data grouping and data combining correspond "group-by" and "join" in SQL respectively. Note that the tree structure keeps the data be grouped and joined. In general when data size is large, the computation cost of data grouping and data combining becomes intensive. Therefore, we store data in tree structure format so that we avoid repetition of these

Analysis phase finds out how to utilize the data through trial-and-error. In most situations the purpose of data analysis is not clear at an early stage of the data utilizatoin process. This

"joined".

usage (e.g. "mem:532MB"). [(site001

(site021

**Figure 2.** Example of tree-structured data.

computationally-intensive processing.

**1.2. Analysis phase**

**Step 2-1** Attribute appending

• To obtain statistical information and trend,

**Step 2-1** Aggregation **Step 2-1** Extraction

[(serv001 ave-cpu:84.0%

(serv002 ave-cpu:12.6%

[(serv001 ave-cpu:50.0%

[(02/05 10:10 cpu:92% mem:532MB) (02/05 10:20 cpu:76% mem:235MB)])

[(02/05 15:30 cpu:13% mem:121MB) (02/05 15:40 cpu:15% mem:142MB) (02/05 15:50 cpu:10% mem:140MB)])])

[(02/05;11:40 cpu:88% mem:889MB) (02/05;11:50 cpu:12% mem:254MB)])])]

is the reason why this early phase needs trial-and-error processes.

As described in Figure 1, the analysis phase consists of 3 independent steps:

This phase iterates Step 2-1 and Step 2-2. The purpose of the iterative process is

• To decide what kind of predictive model should be generated, and

**Figure 1.** Process of data utilization.

#### **1.1. Pre-processing phase**

Pre-processing phase consists of 2 steps:

**Step 1-1** Cleansing

#### **Step 1-2** Structuring

Firstly Step 1-1 removes incorrect values and secondly Step 1-2 transforms table-format data into tree-structured data. This pre-processing phase combines raw data from multiple data sources and creates tree-structured data in which records from multiple data sources are "joined".

Figure 2 illustrates an example of tree-structured server logs, in which the log data are grouped by each site at the top level. Site information consists of site ID (e.g. "site001") and a list of server information. Server information consists of server ID (e.g. "serv001"), average CPU usage (e.g. "ave-cpu:84.0%") and a list of detail records. Furthermore, a detail record consists of date (e.g. "02/05"), time (e.g. "10:20"), CPU usage (e.g. "cpu:92%") and memory usage (e.g. "mem:532MB").

```
[(site001
  [(serv001 ave-cpu:84.0%
    [(02/05 10:10 cpu:92% mem:532MB)
     (02/05 10:20 cpu:76% mem:235MB)])
   (serv002 ave-cpu:12.6%
    [(02/05 15:30 cpu:13% mem:121MB)
     (02/05 15:40 cpu:15% mem:142MB)
     (02/05 15:50 cpu:10% mem:140MB)])])
 (site021
  [(serv001 ave-cpu:50.0%
    [(02/05;11:40 cpu:88% mem:889MB)
     (02/05;11:50 cpu:12% mem:254MB)])])]
```
**Figure 2.** Example of tree-structured data.

If we store the data in table format, data grouping and data combining are repeatedly computed in analysis phase which comes after the pre-processing phase. Data grouping and data combining correspond "group-by" and "join" in SQL respectively. Note that the tree structure keeps the data be grouped and joined. In general when data size is large, the computation cost of data grouping and data combining becomes intensive. Therefore, we store data in tree structure format so that we avoid repetition of these computationally-intensive processing.

#### **1.2. Analysis phase**

2 Will-be-set-by-IN-TECH

Firstly Step 1-1 removes incorrect values and secondly Step 1-2 transforms table-format data into tree-structured data. This pre-processing phase combines raw data from multiple data

3. Model learning phase 4. Model application phase

**Figure 1.** Process of data utilization.

Pre-processing phase consists of 2 steps:

**1.1. Pre-processing phase**

**Step 1-1** Cleansing **Step 1-2** Structuring Analysis phase finds out how to utilize the data through trial-and-error. In most situations the purpose of data analysis is not clear at an early stage of the data utilizatoin process. This is the reason why this early phase needs trial-and-error processes.

As described in Figure 1, the analysis phase consists of 3 independent steps:

**Step 2-1** Attribute appending **Step 2-1** Aggregation

**Step 2-1** Extraction

This phase iterates Step 2-1 and Step 2-2. The purpose of the iterative process is

	- To decide which attributes should be used to calculate feature vectors of the predictive model.

**2. Architecture**

**Figure 3.** Architecture.

2 Analysis Tree-structured,

written by users as XML files.

We propose architecture for large-scale data mining. Figure 3 illustrates our architecture.

I/O-intensive Hadoop MR + vertical partitioning +

Analysis and Learning Frameworks for Large-Scale Data Mining 185

data store in tree-structured format
