Phase Data Computation Approach 1 Pre-processing Table format I/O-intensive Hadoop MR

**Table 1.** Approach of each phase. MR: MapReduce, DFS: Distributed File System.

3 Model learning Vectors CPU-intensive Iterative MR + Hadoop DFS 4 Model application Steream Real time Event driven software

As discussed in Section 1, we suppose four phases: pre-processing, analysis, model learning and model application. In pre-processing phase, data is in table format and the computation is I/O-intensive. Hadoop MapReduce [3] is appropriate for the pre-processing from the view point of data format and I/O load reduction. Hadoop MapReduce is distributed computing platform based on MapReduce computation model [4, 5]. Hadoop MapReduce consists of three computation phases: Map, Combine and Reduce. Hadoop MapReduce parallelizes disk I/O by reading and writing data in parallel on Hadoop DFS (Distributed File System).

We develop cleansing program and structuring program which run on Hadoop MapReduce. The cleansing program and the structuring program are general-purpose, which means we can use the same programs for all cases. The cleansing program and the structuring program read cleansing rule and structuring rule respectively, and programs run by following the rules

In analysis phase, data is tree-structured and the computation is I/O-intensive. In addition, the number of attributes is large since this phase repeatedly appends new attributes. Therefore, key approach is also reduction of I/O load. We propose a method combining Hadoop MapReduce, vertical partitioning and data store in tree-structured format in Section

On the other hand, I/O load in model learning phase is permissive. Because input of machine learning algorithms is feature vectors whose size is much smaller than that of raw data. The

3. This phase also needs chart viewer that displays result of aggregation of Step 2-2.

large number of

attributes

Regarding details of Hadoop, refer to the literature [4, 5].

Step 2-1 appends new attributes to tree-structured data by combining existing attributes. We suppose the iteration of attribute appending increases data size by 5-20 times. On the other hand, Step 2-2 calculates statistics of attributes and generates charts that help to grasp characteristics of the data. The calculations of Step 2-2 include mean, variance, histogram, cross tabulation, and so on.

An instance of the iterative process consisting attribute appending and aggregation is the following.


We usually append more than 10 new attributes into the raw data. Attribute appending increases value and visibility of data, and eases trial-and-error process for finding how to utilize the data.

After the iterative process of attribute appending and aggregation, Step 2-3 extracts feature vectors from tree-structured data, which are used in model learning phase.

### **1.3. Model learning phase**

Model learning phase generates predictive models which are used in real-world operations of enterprises and organizations. The model learning phase uses machine learning techniques, such as SVM (support vector machine) [1] and K-Means clustering [2].

For instance, this phase generates a model that predicts when hardware troubles will happen in IT system. The input of the model is history of CPU usage and memory usage. The output is date and time.

## **1.4. Model application phase**

Model application phase applies the predictive models obtained from the model learning phase into actual business operations. We emphasize the input data is "real time".

As described in Figure 1, the model application phase consists of 2 steps:

**Step 4-1** Extraction

**Step 4-2** Classification

Step 4-1 extracts a feature vector from real time data. Usually computation of this step is similar to that of Step 2-3. Step 4-2 attaches a predictive label to the input data by using predictive models. For example, this label represents date and time of hardware trouble. The label is used in business operations as an event.

## **2. Architecture**

4 Will-be-set-by-IN-TECH

• To decide which attributes should be used to calculate feature vectors of the predictive

Step 2-1 appends new attributes to tree-structured data by combining existing attributes. We suppose the iteration of attribute appending increases data size by 5-20 times. On the other hand, Step 2-2 calculates statistics of attributes and generates charts that help to grasp characteristics of the data. The calculations of Step 2-2 include mean, variance, histogram,

An instance of the iterative process consisting attribute appending and aggregation is the

We usually append more than 10 new attributes into the raw data. Attribute appending increases value and visibility of data, and eases trial-and-error process for finding how to

After the iterative process of attribute appending and aggregation, Step 2-3 extracts feature

Model learning phase generates predictive models which are used in real-world operations of enterprises and organizations. The model learning phase uses machine learning techniques,

For instance, this phase generates a model that predicts when hardware troubles will happen in IT system. The input of the model is history of CPU usage and memory usage. The output

Model application phase applies the predictive models obtained from the model learning

Step 4-1 extracts a feature vector from real time data. Usually computation of this step is similar to that of Step 2-3. Step 4-2 attaches a predictive label to the input data by using predictive models. For example, this label represents date and time of hardware trouble. The

phase into actual business operations. We emphasize the input data is "real time".

As described in Figure 1, the model application phase consists of 2 steps:

2. Append a new attribute "average memory usage for each server" (Step 2-1)

vectors from tree-structured data, which are used in model learning phase.

such as SVM (support vector machine) [1] and K-Means clustering [2].

3. Calculate standard deviation of a new attribute "average memory usage" (Step 2-2) 4. Append a new attribute "difference of memory usage from its average" (Step 2-1)

model.

following.

5. ...

utilize the data.

is date and time.

**Step 4-1** Extraction **Step 4-2** Classification

**1.3. Model learning phase**

**1.4. Model application phase**

label is used in business operations as an event.

cross tabulation, and so on.

1. Calculate frequencies of CPU usage (Step 2-2)

We propose architecture for large-scale data mining. Figure 3 illustrates our architecture.

**Figure 3.** Architecture.


**Table 1.** Approach of each phase. MR: MapReduce, DFS: Distributed File System.

As discussed in Section 1, we suppose four phases: pre-processing, analysis, model learning and model application. In pre-processing phase, data is in table format and the computation is I/O-intensive. Hadoop MapReduce [3] is appropriate for the pre-processing from the view point of data format and I/O load reduction. Hadoop MapReduce is distributed computing platform based on MapReduce computation model [4, 5]. Hadoop MapReduce consists of three computation phases: Map, Combine and Reduce. Hadoop MapReduce parallelizes disk I/O by reading and writing data in parallel on Hadoop DFS (Distributed File System). Regarding details of Hadoop, refer to the literature [4, 5].

We develop cleansing program and structuring program which run on Hadoop MapReduce. The cleansing program and the structuring program are general-purpose, which means we can use the same programs for all cases. The cleansing program and the structuring program read cleansing rule and structuring rule respectively, and programs run by following the rules written by users as XML files.

In analysis phase, data is tree-structured and the computation is I/O-intensive. In addition, the number of attributes is large since this phase repeatedly appends new attributes. Therefore, key approach is also reduction of I/O load. We propose a method combining Hadoop MapReduce, vertical partitioning and data store in tree-structured format in Section 3. This phase also needs chart viewer that displays result of aggregation of Step 2-2.

On the other hand, I/O load in model learning phase is permissive. Because input of machine learning algorithms is feature vectors whose size is much smaller than that of raw data. The computation in model learning phase is CPU-intensive since machine learning algorithms include iterative calculation for optimization. Section 4 proposes another MapReduce framework for parallel machine learning, in which iterative algorithms are easily parallelized.

In model application phase, data is stream and the computation should be performed in real time. Therefore, we develop event-driven software that runs at the timing of input data coming. The software includes a library of classification function. It reads a predictive model written in PMML [6] that is XML-based language for model description.

We summarize our approaches in Table 1. The rest of this paper focuses on frameworks for analysis phase and model learning phase. Because while new techniques are necessary for efficient computation in the two phases, system for pre-processing and model application is easily implemented by combining existing technologies.

## **3. Tree-structured data analysis framework**

## **3.1. Mathod**

This section proposes a computing framework that performs data analysis on a large amount of tree-structured data. As discussed in Section 1, an early stage of the data utilization process needs trial-and-error processes, in which we repeatedly append new attributes and calculate statistics of attributes. As a result of repetition of attribute appending, the number of attributes increases. Therefore, not only scalability to the number of records but also scalability to the number of attributes is important.

The key approaches of the proposed framework are:

1. To partition tree-structured data in column-wise and store the partitioned data in separated files corresponding to each attribute, and

**Figure 4.** Vertical partitioning of tree-structured data.

[("serv001" 4.0) ("serv002" 2.6)]. **Value:** Sscalar, vector, matrix or string. e.g. "532MB".

The data model of the proposed framework is a recursively-defined tuple.

paper, elements of a tuple and a list are separated by white spaces.

**Tuple:** Combination of lists and values. e.g. ("serv002", 13, [(15, 10)]).

**List:** Sequence of tuples whose types are the same. e.g.

Analysis and Learning Frameworks for Large-Scale Data Mining 187

A round bracket () represents a tuple while a square bracket [] represents a list. In this

Figure 5 describes pseudo code of partitioning algorithm. The algorithm partitions tree-structured data into recursive lists by running the function "Partition" recursively. Each list to be generated by the algorithm consists of values belonging to the same attribute.

Similarly Figure 6 describes pseudo code of restoring algorithm. The algorithm restores tree-structured data from divided attribute data. An example of input for the algorithm is shown in Figure 7. S is trimmed schema which excludes attributes unused in analysis computation. D is generated by replacing attribute names in trimmed schema with recursive

**3.2. Implementation**

lists stored in attribute files.

2. To use Hadoop MapReduce framework for distributed computing.

The method (1) is referred to as "vertical partitioning." It is well known that vertical partitioning of **table format data** is efficient [7]. We propose vertical partitioning of tree-structured data. Figure 4 illustrates the vertical partitioning method. The proposed framework partitions tree-structured data into multiple lists so that each list includes values belonging to the same attribute. Then the framework stores the lists of each attribute in correspoinding files. Note that the file of "Average CPU usage" in Figure 4 includes only values belonging to "Average CPU usage" attribute, and does not include values of any other attributes.

The framework reads only 1-3 attributes required in data analysis out of 10-30 attributes, and restores tree-structured data that consists of only required attributes. In addition, when appending a new attribute, the framework writes only the newly created attribute into files. If we do not use the vertical partitioning technique, it should write all of existing attributes into files. Thus the proposed method reduces amount of input data as well as amount of output data.

**Figure 4.** Vertical partitioning of tree-structured data.

#### **3.2. Implementation**

6 Will-be-set-by-IN-TECH

computation in model learning phase is CPU-intensive since machine learning algorithms include iterative calculation for optimization. Section 4 proposes another MapReduce framework for parallel machine learning, in which iterative algorithms are easily parallelized. In model application phase, data is stream and the computation should be performed in real time. Therefore, we develop event-driven software that runs at the timing of input data coming. The software includes a library of classification function. It reads a predictive model

We summarize our approaches in Table 1. The rest of this paper focuses on frameworks for analysis phase and model learning phase. Because while new techniques are necessary for efficient computation in the two phases, system for pre-processing and model application is

This section proposes a computing framework that performs data analysis on a large amount of tree-structured data. As discussed in Section 1, an early stage of the data utilization process needs trial-and-error processes, in which we repeatedly append new attributes and calculate statistics of attributes. As a result of repetition of attribute appending, the number of attributes increases. Therefore, not only scalability to the number of records but also scalability to the

1. To partition tree-structured data in column-wise and store the partitioned data in separated

The method (1) is referred to as "vertical partitioning." It is well known that vertical partitioning of **table format data** is efficient [7]. We propose vertical partitioning of tree-structured data. Figure 4 illustrates the vertical partitioning method. The proposed framework partitions tree-structured data into multiple lists so that each list includes values belonging to the same attribute. Then the framework stores the lists of each attribute in correspoinding files. Note that the file of "Average CPU usage" in Figure 4 includes only values belonging to "Average CPU usage" attribute, and does not include values of any other

The framework reads only 1-3 attributes required in data analysis out of 10-30 attributes, and restores tree-structured data that consists of only required attributes. In addition, when appending a new attribute, the framework writes only the newly created attribute into files. If we do not use the vertical partitioning technique, it should write all of existing attributes into files. Thus the proposed method reduces amount of input data as well as amount of output

written in PMML [6] that is XML-based language for model description.

easily implemented by combining existing technologies.

**3. Tree-structured data analysis framework**

The key approaches of the proposed framework are:

2. To use Hadoop MapReduce framework for distributed computing.

files corresponding to each attribute, and

**3.1. Mathod**

attributes.

data.

number of attributes is important.

The data model of the proposed framework is a recursively-defined tuple.

**Tuple:** Combination of lists and values. e.g. ("serv002", 13, [(15, 10)]).

**List:** Sequence of tuples whose types are the same. e.g. [("serv001" 4.0) ("serv002" 2.6)].

**Value:** Sscalar, vector, matrix or string. e.g. "532MB".

A round bracket () represents a tuple while a square bracket [] represents a list. In this paper, elements of a tuple and a list are separated by white spaces.

Figure 5 describes pseudo code of partitioning algorithm. The algorithm partitions tree-structured data into recursive lists by running the function "Partition" recursively. Each list to be generated by the algorithm consists of values belonging to the same attribute.

Similarly Figure 6 describes pseudo code of restoring algorithm. The algorithm restores tree-structured data from divided attribute data. An example of input for the algorithm is shown in Figure 7. S is trimmed schema which excludes attributes unused in analysis computation. D is generated by replacing attribute names in trimmed schema with recursive lists stored in attribute files.

8 Will-be-set-by-IN-TECH 188 Advances in Data Mining Knowledge Discovery and Applications Analysis and Learning Frameworks for Large-Scale Data Mining <sup>9</sup>

```
Partition(S, D) {
  if S is atom then return [D]
  if S is list then {
    L = []
    foreach di in D:
      append Partition(SOE(S), di) to L
    return transpose of L
  }
  if S is tuple then {
    L = []
    foreach (si, di) in (S, D):
      L=L+ Partition(si, di)
  return L
}
```
**Figure 5.** Pseudo code of partitioning algorithm. S is schema information, D is tree-structured data, The function SOE returns schema of an element of a list.

Item Description

**Table 2.** Hadoop configuration.

'(

)

**3.3. Evaluation**

(define new-schema

(define (mapper site)

SQL of the calculation includes "join".

calculation includes "group-by".

to express the calculation with SQL.

(tuple

(tuple

)])))

**Figure 8.** Example of user program.

Hadoop aggregation package Used for Combine and Reduce calculation.

MultipleTextOutputFormat class Used for multiple file output.

and tuples, such as "ref-Server-tuples" and "ref-Memory-usage".

[("Average memory usage")])

We evaluated the proposed framework on 6 benchmark tasks.

CompositeInputFormat class Used for multiple file input and "Map-side join" [3].

Analysis and Learning Frameworks for Large-Scale Data Mining 189

Figure 8 shows an example of user program. The program appends a new attribute "Average memory usage". The variable "new-schema" represents a location of the newly appended attribute in tree structure. The function mapper generates a new tree-structured data including only the attribute to be appended. The framework provides accessors to attributes

[foreach (ref-Server-tuples site) (lambda (server)

**Task A** Calculates average CPU usage for each server and append it as a new attribute into the corresponding tuple of server information. The SQL for the calculation includes "group-by" and "update" if relational database is used instead of the proposed framework. **Task B** Calculates difference between CPU usage and average CPU usage for each server. The

**Task C** Calculates frequency distribution of CPU usage with interval of 10. The SQL of the

**Task D** Calculates difference between CPU usages of two successive detail records and append it as a new attribute into the corresponding tuple of a detail record. It is impossible

Figure 9 shows the result of evaluation on 90 GB data. We used 19 servers as slave machines for Hadoop: 9 servers with 2-core 1.86 GHz CPU and 3 GB memory, and 10 servers with two of 4-core 2.66 GHz CPU and 8 GB memory. Thus the Hadoop cluster has 98 CPU cores in total. The vertical axis of Figure 9 represents average execution time over 5 runs. The result indicates

**Task E** Searches detail records in which both of CPU usage and memory usage is 100%.

(mean (map ref-Memory-usage (ref-Record-tuples server)))

```
Restore(S, D, d=0) {
  if S is atom then return D
  if S is list then {
    L = []
    foreach (di) in D:
      append Restore(SOE(S), di, d+1) to L
    return transpose L with depth d
  }
  if S is tuple then {
  L = []
  foreach (si, di) in (S, D):
    append Restore(si, di, d) to L
  return L
}
```
**Figure 6.** Pseudo code of restoring algorithm. An example of the input is shown in Figure 7. The function SOE returns schema of an element of a list.

```
S: [(
     [("Average CPU uage"
      [("Memoery Usage")])])]
D: [(
    [([[ave-cpu:84.0% ave-cpu:12.6%] [ave-cpu:50.0%]])
     [([[[mem:532MB mem:235MB] [mem:121MB ...]] [[mem:889MB mem:254MB]]])])]
```
**Figure 7.** Example of input of the restoring alogorithm.

We implemented the partitioning algorithm and the restoring algorithm in Gauche. Gauche is an implementation of computer language Scheme. Users implement programs for attribute appending and aggregation using Gauche. The proposed framework combines user programs with partitioning and restoring programs. Then the combined program runs in parallel on Hadoop Streaming of Hadoop MapReduce 0.20.0. Table 2 summarizes key Hadoop components for implementation of the framework.


**Table 2.** Hadoop configuration.

8 Will-be-set-by-IN-TECH

**Figure 5.** Pseudo code of partitioning algorithm. S is schema information, D is tree-structured data, The

Partition(S, D) {

L = []

L = []

return L

}

}

if S is list then {

foreach di in D:

if S is tuple then {

function SOE returns schema of an element of a list.

if S is list then {

foreach (di) in D:

if S is tuple then {

function SOE returns schema of an element of a list.

**Figure 7.** Example of input of the restoring alogorithm.

components for implementation of the framework.

if S is atom then return D

foreach (si, di) in (S, D):

Restore(S, D, d=0) {

L = []

}

}

S: [(

D: [(

L = []

return L

[("Average CPU uage" [("Memoery Usage")])])]

return transpose of L

foreach (si, di) in (S, D): L=L+ Partition(si, di)

if S is atom then return [D]

append Partition(SOE(S), di) to L

append Restore(SOE(S), di, d+1) to L

**Figure 6.** Pseudo code of restoring algorithm. An example of the input is shown in Figure 7. The

[([[[mem:532MB mem:235MB] [mem:121MB ...]] [[mem:889MB mem:254MB]]])])]

We implemented the partitioning algorithm and the restoring algorithm in Gauche. Gauche is an implementation of computer language Scheme. Users implement programs for attribute appending and aggregation using Gauche. The proposed framework combines user programs with partitioning and restoring programs. Then the combined program runs in parallel on Hadoop Streaming of Hadoop MapReduce 0.20.0. Table 2 summarizes key Hadoop

return transpose L with depth d

append Restore(si, di, d) to L

[([[ave-cpu:84.0% ave-cpu:12.6%] [ave-cpu:50.0%]])

Figure 8 shows an example of user program. The program appends a new attribute "Average memory usage". The variable "new-schema" represents a location of the newly appended attribute in tree structure. The function mapper generates a new tree-structured data including only the attribute to be appended. The framework provides accessors to attributes and tuples, such as "ref-Server-tuples" and "ref-Memory-usage".

```
(define new-schema
  '(
    [("Average memory usage")])
)
(define (mapper site)
   (tuple
    [foreach (ref-Server-tuples site) (lambda (server)
     (tuple
       (mean (map ref-Memory-usage (ref-Record-tuples server)))
     )])))
```
**Figure 8.** Example of user program.

#### **3.3. Evaluation**

We evaluated the proposed framework on 6 benchmark tasks.


**Task E** Searches detail records in which both of CPU usage and memory usage is 100%.

Figure 9 shows the result of evaluation on 90 GB data. We used 19 servers as slave machines for Hadoop: 9 servers with 2-core 1.86 GHz CPU and 3 GB memory, and 10 servers with two of 4-core 2.66 GHz CPU and 8 GB memory. Thus the Hadoop cluster has 98 CPU cores in total. The vertical axis of Figure 9 represents average execution time over 5 runs. The result indicates

#### 10 Will-be-set-by-IN-TECH 190 Advances in Data Mining Knowledge Discovery and Applications Analysis and Learning Frameworks for Large-Scale Data Mining <sup>11</sup>

that the vertical partitioning accelerates the calculations by 17.5 times on the task A and by 12.7 times on the task D. The task A and D require the processing of attribute appending, in which a large amount of tree-structured data is not only read from files, but also written into files. That is the reason why the acceleration on the task A and D is more than that on the other tasks.

As a result of the experiments, we conclude that the proposed framework is efficient for data analysis of a large amount of tree-structured data. The performance can be improved by

This section proposes a computing framework for parallel machine learning. The proposed framework is designed to ease parallelization of machine learning algorithms and reduce

We start with discussion on a model of machine learning algorithms. Let *D* = (*xn*, *yn*) be training data, where *xn* is a feature vector with *d* dimension, *yn* is a label. Machine learning algorithm estimates a model *M* describing *D* well. In this paper we discuss machine learning

where *M* represents a model to be trained, *g* is a function which satisfies the following

∀*<sup>i</sup>* < *<sup>j</sup>* < ... < *<sup>k</sup>* < *<sup>N</sup>*, *<sup>g</sup>*([*z*0, ..., *zN*−1]) = *<sup>g</sup>*([*g*[(*z*0, ..., *zi*−1)], *<sup>g</sup>*[(*zi*, ..., *zj*−1)], ..., *<sup>g</sup>*[(*zk*, ..., *zN*−1)]])

For instance, a function that summates elements in an array satisfies the constraints mentioned above. By using the characteristic of *g*, we re-formulate the steps of machine learning

Note that parallelization of the calculation of *Mi*..*<sup>j</sup>* is possible since the calculation is

Consider we use MapReduce for parallelization; Map phase calculates *Mi*..*<sup>j</sup>* and Reduce phase calculates *M*. Although MapReduce fits parallelization of machine learning algorithms described with the above formula, use of Hadoop Mapreduce, that is, the most popular implementation of MapReduce, is unreasonable. Because the implementation of Hadoop MapReduce is optimized so that it performs non-iterative algorithms efficiently. The problems

• Hadoop MapReduce does not keep feature vectors in memory devices during iterations. • Hadoop MapReduce restarts threads of Map and Reduce at every iteration. Initialization overheads of these threads are large compared to computation time of machine learning

Consequently, the proposed framework provides another MapReduce implementation for iterative algorithms of machine learning. The key approaches of the framework are follows.

*Mi*..*<sup>j</sup>* = *<sup>g</sup>*([ *<sup>f</sup>*(*xi*, *yi*, *<sup>M</sup>*), *<sup>f</sup>*(*xi*+1, *yi*+1, *<sup>M</sup>*), ..., *<sup>f</sup>*(*xj*−1, *yj*−1, *<sup>M</sup>*)]) (4) *M* = *r*(*g*([*M*0..*i*, *Mi*..*j*, ..., *Mk*..*N*])) (5)

*zn* = *f*(*xn*, *yn*, *M*) (1) *<sup>M</sup>* = *<sup>r</sup>*(*g*([*z*0, *<sup>z</sup>*1, ..., *zN*−1])), (2)

Analysis and Learning Frameworks for Large-Scale Data Mining 191

(3)

implementing it using Java, instead of Gauche.

calculation overheads of iterative procedures.

algorithms that are describable as an iteration of the following steps:

**4.1. Mathod**

constraint.

algorithms as follows.

algorithms.

independent of other (*xn*, *yn*).

with repeatedly using Hadoop MapReduce are following.

**4. Parallel machine learning framework**

**Figure 9.** Evalution of the tree-structured data analysis framework.

Table 3 compares the proposed method with MySQL. Both of the proposed framework and MySQL run on a single server, and the size of benchmark data is 891 MB. Note that parallelization is not used in this experiment so that we investigate the effect of vertical partitioning and data store in tree-structured format without the disturbing factor due to parallel computation. We created indexes on columns of primary id, CPU usage and memory usage in MySQL tables. Table 3 shows average and standard deviation of execution times over 5 runs. The performance of the proposed method is comparative or superior to that of MySQL on the task A, B, C and D despite the proposed method is mainly implemented in Gauche. On the other hand, the performance of the proposed method on the task E is inferior to that of MySQL. This is because MySQL finds records that match the condition by using indexes while the proposed framework scans whole data linearly to find out the records. However, the actual execution time of the proposed framework on the task E is permissible since it is not long compared to that on the other tasks.


**Table 3.** Comparison of the tree-structured data analysis framework and MySQL using a single server.

As a result of the experiments, we conclude that the proposed framework is efficient for data analysis of a large amount of tree-structured data. The performance can be improved by implementing it using Java, instead of Gauche.

## **4. Parallel machine learning framework**

### **4.1. Mathod**

10 Will-be-set-by-IN-TECH

that the vertical partitioning accelerates the calculations by 17.5 times on the task A and by 12.7 times on the task D. The task A and D require the processing of attribute appending, in which a large amount of tree-structured data is not only read from files, but also written into files. That is the reason why the acceleration on the task A and D is more than that on the

Table 3 compares the proposed method with MySQL. Both of the proposed framework and MySQL run on a single server, and the size of benchmark data is 891 MB. Note that parallelization is not used in this experiment so that we investigate the effect of vertical partitioning and data store in tree-structured format without the disturbing factor due to parallel computation. We created indexes on columns of primary id, CPU usage and memory usage in MySQL tables. Table 3 shows average and standard deviation of execution times over 5 runs. The performance of the proposed method is comparative or superior to that of MySQL on the task A, B, C and D despite the proposed method is mainly implemented in Gauche. On the other hand, the performance of the proposed method on the task E is inferior to that of MySQL. This is because MySQL finds records that match the condition by using indexes while the proposed framework scans whole data linearly to find out the records. However, the actual execution time of the proposed framework on the task E is permissible since it is

Task Proposed method [sec] MySQL [sec] A 10.67 ± 0.08 402.72 ± 5.55 B 76.67 ± 0.36 445.48 ± 3.42 C 13.21 ± 0.18 12.89 ± 0.05 D 36.36 ± 0.20 - E 16.87 ± 0.14 1.34 ± 2.66 **Table 3.** Comparison of the tree-structured data analysis framework and MySQL using a single server.

**Figure 9.** Evalution of the tree-structured data analysis framework.

not long compared to that on the other tasks.

other tasks.

This section proposes a computing framework for parallel machine learning. The proposed framework is designed to ease parallelization of machine learning algorithms and reduce calculation overheads of iterative procedures.

We start with discussion on a model of machine learning algorithms. Let *D* = (*xn*, *yn*) be training data, where *xn* is a feature vector with *d* dimension, *yn* is a label. Machine learning algorithm estimates a model *M* describing *D* well. In this paper we discuss machine learning algorithms that are describable as an iteration of the following steps:

$$z\_{\mathbb{N}} = f(\mathbf{x}\_{\mathbb{N}\_{\prime}} y\_{\mathbb{N}\_{\prime}} M) \tag{1}$$

$$M = r(g([z\_0, z\_1, \dots, z\_{N-1}]))\_\prime \tag{2}$$

where *M* represents a model to be trained, *g* is a function which satisfies the following constraint.

$$\forall i < j < \ldots < k < N, \mathbf{g}([z\_0, \ldots, z\_{N-1}]) = \mathbf{g}([\mathbf{g}([z\_0, \ldots, z\_{i-1}]), \mathbf{g}([z\_i, \ldots, z\_{j-1}])] \ldots , \mathbf{g}([z\_k, \ldots, z\_{N-1}])] \tag{3}$$

For instance, a function that summates elements in an array satisfies the constraints mentioned above. By using the characteristic of *g*, we re-formulate the steps of machine learning algorithms as follows.

$$M\_{i..j} = \text{g}([f(\mathbf{x}\_i, y\_{i\prime}M), f(\mathbf{x}\_{i+1\prime}y\_{i+1\prime}M), \dots, f(\mathbf{x}\_{j-1\prime}y\_{j-1\prime}M)])\tag{4}$$

$$M = r(\mathbf{g}([M\_{0\dots i} \, \_i M\_{i\dots j} \, \_i \dots \_M M\_{k\dots N}]))\tag{5}$$

Note that parallelization of the calculation of *Mi*..*<sup>j</sup>* is possible since the calculation is independent of other (*xn*, *yn*).

Consider we use MapReduce for parallelization; Map phase calculates *Mi*..*<sup>j</sup>* and Reduce phase calculates *M*. Although MapReduce fits parallelization of machine learning algorithms described with the above formula, use of Hadoop Mapreduce, that is, the most popular implementation of MapReduce, is unreasonable. Because the implementation of Hadoop MapReduce is optimized so that it performs non-iterative algorithms efficiently. The problems with repeatedly using Hadoop MapReduce are following.


Consequently, the proposed framework provides another MapReduce implementation for iterative algorithms of machine learning. The key approaches of the framework are follows.


A few MapReduce frameworks for iterative computation have been proposed. Haloop [8] adds the functions of loop control, caching and indexing into Hadoop. However, it restarts threads of Map and Reduce at every iteration like Hadoop. Therefore, the initialization overheads still remain. Twister [10, 11] and Spark [9] reduce the initialization overheads and keep feature vectors in memory devices during iterations. These frameworks perform similarly to the proposed framework if input data size is smaller than total memory size of a computing cluster. However, in case the data size is larger than total memory size, the performance of the proposed framework is superior to that of Twister and Spark since the proposed framework uses local disk as a cache.

## **4.2. Implementation**

We implemented the proposed framework using Java. The framework reads feature vectors and configuration parameters from Hadoop DFS with version of 0.20.2. Figure 10 illustrates the sequence diagram of the proposed framework. The framework consists of a master thread, a Reduce thread and multiple Map threads. The master thread controls the Reduce thread and the Map threads. The Reduce thread controls iterations. The Map threads parallelize calculations of *Mi*..*j*.

**Figure 10.** Sequence diagram of parallel machine learning framework

for (D : x) {

return Mij;

faster the parallelized algorithms run.

}

} }

Mij.s[cid].add(x); Mij.l[cid] += 1;

class KMeansMapper extends Mapper<KMeansModel> { public KMeansModel map(KMeansModel M) { KMeansModel Mij = new KMeansModel();

int cid = argmin\_distance(x, M);

**Figure 11.** Implementation of computing *Mi*..*<sup>j</sup>* in parallel K-Means algorithm.

the Map phase, 40 Map threads run in parallel. On the other hand, one Reduce thread runs in the Reduce phase. The data size of feature vectors is 1.4 GB. Table 4 shows execution times of one iteration on three machine learning algorithms: K-Means [2], Dirichlet process clustering [12] and IPM perceptron [13, 14]. The values are mean and standard deviation over 10 runs. The result indicates that the proposed framework is 33.8-274.1 times as fast as Mahout.

Analysis and Learning Frameworks for Large-Scale Data Mining 193

Figure 13 illustrates scalability of the proposed framework on three machine learning algorithms: K-Means, variational Bayes clustering [15] and linear SVM [1]. The horizontal axes represent the number of Map threads that run in parallel. The vertical axes represent 1 / (execution time), i.e., speed. Figure 13 indicates that the more Map threads in parallel are, the

Firstly the master thread starts multiple Map threads, which read feature vectors from Hadoop DFS and keep the data in memory and HDD in a local machine during an iteration. Secondly the master thread starts a Rreduce thread. The Map threads and the Reduce thread are not terminated until the iteration ends. Next the Reduce thread initializes *M*, and then the Map threads calculate *Mi*..*<sup>j</sup>* in parallel. The Reduce thread updates *M* by collecting the calculation results from the Map threads and continue the iteration.

Figure 11 and Figure 12 shows implementation of parallel K-Means algorithms using the proposed framework. We omit initialization of *M* and termination condition since these implementations are obvious. As shown in Figure 11 and Figure 12, parallelization of the algorithm is easily implemented, and the source code is short. The rest procedures are implemented inside the framework, and users do not have to write codes of data transfer and data read. Thus users are able to focus on core logics of machine learning algorithms.

#### **4.3. Evaluation**

We compared the proposed framework with Hadoop. We used Mahout library as implementations of machine learning algorithms on Hadoop [16]. We used 6 servers as slave machines for both of the proposed framework and Hadoop: 4 servers with 4-core 2.8 GHz CPU and 4 GB memory, and 2 servers with two of 4-core 2.53 GHz CPU and 2 GB memory. In

**Figure 10.** Sequence diagram of parallel machine learning framework

12 Will-be-set-by-IN-TECH

1. It keeps feature vectors in memory devices during iterations. In case data size of feature

2. It does not terminate threads of Map and Reduce and uses the same threads repeatedly.

4. Users implement only 4 functions: initialization of *M*, calculation of *Mi*..*j*, update of *M* and

A few MapReduce frameworks for iterative computation have been proposed. Haloop [8] adds the functions of loop control, caching and indexing into Hadoop. However, it restarts threads of Map and Reduce at every iteration like Hadoop. Therefore, the initialization overheads still remain. Twister [10, 11] and Spark [9] reduce the initialization overheads and keep feature vectors in memory devices during iterations. These frameworks perform similarly to the proposed framework if input data size is smaller than total memory size of a computing cluster. However, in case the data size is larger than total memory size, the performance of the proposed framework is superior to that of Twister and Spark since the

We implemented the proposed framework using Java. The framework reads feature vectors and configuration parameters from Hadoop DFS with version of 0.20.2. Figure 10 illustrates the sequence diagram of the proposed framework. The framework consists of a master thread, a Reduce thread and multiple Map threads. The master thread controls the Reduce thread and the Map threads. The Reduce thread controls iterations. The Map threads parallelize

Firstly the master thread starts multiple Map threads, which read feature vectors from Hadoop DFS and keep the data in memory and HDD in a local machine during an iteration. Secondly the master thread starts a Rreduce thread. The Map threads and the Reduce thread are not terminated until the iteration ends. Next the Reduce thread initializes *M*, and then the Map threads calculate *Mi*..*<sup>j</sup>* in parallel. The Reduce thread updates *M* by collecting the calculation

Figure 11 and Figure 12 shows implementation of parallel K-Means algorithms using the proposed framework. We omit initialization of *M* and termination condition since these implementations are obvious. As shown in Figure 11 and Figure 12, parallelization of the algorithm is easily implemented, and the source code is short. The rest procedures are implemented inside the framework, and users do not have to write codes of data transfer and data read. Thus users are able to focus on core logics of machine learning algorithms.

We compared the proposed framework with Hadoop. We used Mahout library as implementations of machine learning algorithms on Hadoop [16]. We used 6 servers as slave machines for both of the proposed framework and Hadoop: 4 servers with 4-core 2.8 GHz CPU and 4 GB memory, and 2 servers with two of 4-core 2.53 GHz CPU and 2 GB memory. In

vectors is larger than memory size, it uses local disk as a cache.

3. It controls iterations, read/write and data communication.

termination condition.

**4.2. Implementation**

calculations of *Mi*..*j*.

**4.3. Evaluation**

5. It utilizes Hadoop DFS as its file system.

proposed framework uses local disk as a cache.

results from the Map threads and continue the iteration.

```
class KMeansMapper extends Mapper<KMeansModel> {
  public KMeansModel map(KMeansModel M) {
    KMeansModel Mij = new KMeansModel();
    for (D : x) {
      int cid = argmin_distance(x, M);
      Mij.s[cid].add(x);
      Mij.l[cid] += 1;
    }
    return Mij;
  }
}
```
**Figure 11.** Implementation of computing *Mi*..*<sup>j</sup>* in parallel K-Means algorithm.

the Map phase, 40 Map threads run in parallel. On the other hand, one Reduce thread runs in the Reduce phase. The data size of feature vectors is 1.4 GB. Table 4 shows execution times of one iteration on three machine learning algorithms: K-Means [2], Dirichlet process clustering [12] and IPM perceptron [13, 14]. The values are mean and standard deviation over 10 runs. The result indicates that the proposed framework is 33.8-274.1 times as fast as Mahout.

Figure 13 illustrates scalability of the proposed framework on three machine learning algorithms: K-Means, variational Bayes clustering [15] and linear SVM [1]. The horizontal axes represent the number of Map threads that run in parallel. The vertical axes represent 1 / (execution time), i.e., speed. Figure 13 indicates that the more Map threads in parallel are, the faster the parallelized algorithms run.

```
class KMeansReducer extends Reducer<KMeansModel> {
  public KMeansModel reduce(KMeansModel[] Mijs) {
    KMeansModel M = new KMeansModel();
    for (int cid=0; cid<M.num_of_cluster; cid++) {
      for (Mijs : Mij) {
        M.s[cid].add(Mij.s[cid]);
        M.l[cid] += Mij.l[cid];
      }
      M.centroid[cid] = M.s[cid] / M.l[cid];
    }
    return M;
  }
}
```
**Figure 12.** Implementation of updating *M* in parallel K-Means algorithm.


text data, and generates a Hidden Markov model by using Forward Backward algorithm. We compared performance of the parallelized algorithm with that of single thread implementation using C language. We used 1.0 GB of feature vectors as a input of these programs. The parallelized algrorithm on the proposed framework with 32 parallel Map threads run 7.15 times faster than the single thread implementation. Considering the difference of speed between Java and C language, the proposed framework performs the parallelization well. Consequently, we conclude that the proposed framework is efficient for

Analysis and Learning Frameworks for Large-Scale Data Mining 195

This chapter discussed techniques for processing large-scale data. Firstly we explained that process of data utilization in enterprises and organizations includes (1) pre-processing phase, (2) analysis phase, (3) model learning phase and (4) model application phase. Secondly we described architecture for the data utilization process. Then We proposed two computing frameworks: tree-structured data analysis framework for analysis phase, and parallel machine learning framework for model learning phase. The experimental results demonstrated that

• To design original machine learning algorithms which run on the parallel machine learning

[1] Chapelle, O. (2007) Training a Support Vector Machine in the Primal. Neural

[2] MacQueen, J. B. (1967) Some Methods for Classification and Analysis of MultiVariate Observations. Proc. of the fifth Berkeley Symposium on Mathematical Statistics and

[4] Dean, J. and Ghemawat, S. (2004) MapReduce: Simplified Data Processing on Large Clusters. Proceedings of Sixth Symposium on Operating System Design and

[3] White, T. (2009) Hadoop: The Definitive Guide. Oreilly & Associates Inc.

• To implement tree-structured data analysis framework using Java.

• To formulate a framework for model application phase.

*Research & Development Centre, Hitachi India Pvt. Ltd.*

Computation, Vol. 19, No. 5, pp. 1155-1178.

Implementation (OSD2004), pp. 137-150.

*Central Research Laboratory, Hitachi Ltd.*

*Central Research Laboratory, Hitachi Ltd.*

Probability, Vol. 1, pp. 281-297.

parallel machine learning.

our approaches work well. Future works are follows:

framework.

**Author details**

Kohsuke Yanai

Toshihiko Yanase

**6. References**

**5. Conclusion**

**Table 4.** Comparison of the parallel machine learning framework and Mahout on K-Means [2], Dirichlet process clustering [12] and IPM perceptron [13, 14].

**Figure 13.** Scalability evaluation of the parallel machine learning framework.

We also applied the framework in order to parallelize a learning algorithm of an acoustic model for speech recognition. The learning algorithm reads voice data and corresponding text data, and generates a Hidden Markov model by using Forward Backward algorithm. We compared performance of the parallelized algorithm with that of single thread implementation using C language. We used 1.0 GB of feature vectors as a input of these programs. The parallelized algrorithm on the proposed framework with 32 parallel Map threads run 7.15 times faster than the single thread implementation. Considering the difference of speed between Java and C language, the proposed framework performs the parallelization well. Consequently, we conclude that the proposed framework is efficient for parallel machine learning.

## **5. Conclusion**

14 Will-be-set-by-IN-TECH

Algorithm Proposed method [sec] Mahout [sec] K-Means 0.93 ± 0.052 31.8 ± 1.49

Dirichlet process clustering 1.14 ± 0.057 67.4 ± 3.87 IPM perceptron 0.11 ± 0.026 30.7 ± 2.00

**Table 4.** Comparison of the parallel machine learning framework and Mahout on K-Means [2], Dirichlet

class KMeansReducer extends Reducer<KMeansModel> { public KMeansModel reduce(KMeansModel[] Mijs) {

M.centroid[cid] = M.s[cid] / M.l[cid];

for (int cid=0; cid<M.num\_of\_cluster; cid++) {

KMeansModel M = new KMeansModel();

M.s[cid].add(Mij.s[cid]); M.l[cid] += Mij.l[cid];

**Figure 12.** Implementation of updating *M* in parallel K-Means algorithm.

**Figure 13.** Scalability evaluation of the parallel machine learning framework.

We also applied the framework in order to parallelize a learning algorithm of an acoustic model for speech recognition. The learning algorithm reads voice data and corresponding

for (Mijs : Mij) {

}

return M;

process clustering [12] and IPM perceptron [13, 14].

}

} }

> This chapter discussed techniques for processing large-scale data. Firstly we explained that process of data utilization in enterprises and organizations includes (1) pre-processing phase, (2) analysis phase, (3) model learning phase and (4) model application phase. Secondly we described architecture for the data utilization process. Then We proposed two computing frameworks: tree-structured data analysis framework for analysis phase, and parallel machine learning framework for model learning phase. The experimental results demonstrated that our approaches work well.

Future works are follows:


## **Author details**

Kohsuke Yanai *Research & Development Centre, Hitachi India Pvt. Ltd. Central Research Laboratory, Hitachi Ltd.*

Toshihiko Yanase *Central Research Laboratory, Hitachi Ltd.*

## **6. References**

	- [5] Dean, J. and Ghemawat, S. (2008) MapReduce: simplified data processing on large clusters. Communications of the ACM, Vol. 51, No. 1, pp. 107-113.
	- [6] Data Mining Group. PMML standard, http://www.dmg.org/v4-1/GeneralStructure.html

**Data Mining Applications** 

