**3.1 Algorithm description**

Input data streams to be mined

Output frequent itemsets of recent data stream

NSWGA algorithm is divided into three parts: (1) NSWGA uses the parallelism of genetic algorithm to search for the frequent itemsets of the latest data in the nested sub-window. (2)The final frequent itemsets of the sliding window are obtained by the integrated treatment of this series of frequent itemsets in nested sub-windows. (3)With the new data coming, the expired data is deleted periodically. Repeat the above two operations.

In the first part, the current frequent itemsets in NSW is obtained. The process is shown as figure 3.

**Step 1.** Set the size of sliding window SW ω1. Set the size of nested sub-window NSW ω2.Window sizes are determined by the properties of the data stream. ω1 depends on how many current affairs whose frequent itemsets we are interested in. ω2 depends on the processing capability of the algorithm and our statistical frequency. Given the support threshold S, fitness function Fi=Wi/WZ, when Fi ≥ S, transaction iterm i is a frequent pattern of the data set in sliding window.

The iteration times T depends on the number of attributes that a transaction iterm includs and the scope of the attribute values and the original population size. The

Mining Frequent Itemsets over Recent Data Stream Based on Genetic Algorithm 297

**Step 4.** If the number of iterative times is smaller than T, the algorithm jumps to the step 3.

After T times of iterative computation, finish iterative and obtain the frequent

Fig. 4. The generation of initial population

Fig. 5. Parallel computing fitness degree

itemsets in current nested sub-window;

patterns of data stream. Make Roulette Wheel Selection according to the fitness degree. Make crossover with the Crossover probability P. Carry on the variation with the variation probability Q. Ascertain the individual fitness degree after scanning the data. Join the individual which satisfies the condition into the frequent itemsets. Relying on the powerful parallel computing capability of GPU, parallel matching with Z sections, that will reduce a lot of running time, the process is shown in Figure 5.

role of nested sub-window is to avoid repeatedly processing the data which is still in the sliding window after the old data out of the sliding window.

Let the crossover probability is P, the individual mutation probability is Q. To implement parallel computing, the data in the nested sub-window is divided into Z segments.

Fig. 3. The generation of initial population.

	- 1 Statistics the number of I1, I2, I3 in A attribute;
	- 2 Statistics the number of I1, I2, I3 in B attribute;
	- 3 Statistics the number of I1, I2, I3 in C attribute;
	- 4 Reserve the value which is greater than or equal to the threshold S, let others are 0 (in this case, S takes 3);
	- 5 Remove the all zero -line, set non- zero values according to their original row;
	- 6 Line up every non-zero value and keep its original location in the line, fill in the rest position with 0;
	- 7 Combine non-zero iterms according to their original location.Constitute the initial population with frequent 1 –itemsets and the combination iterms. The process is shown in Figure 4.

patterns of data stream. Make Roulette Wheel Selection according to the fitness degree. Make crossover with the Crossover probability P. Carry on the variation with the variation probability Q. Ascertain the individual fitness degree after scanning the data. Join the individual which satisfies the condition into the frequent itemsets. Relying on the powerful parallel computing capability of GPU, parallel matching with Z sections, that will reduce a lot of running time, the process is shown in Figure 5.


Fig. 4. The generation of initial population

296 Bio-Inspired Computational Algorithms and Their Applications

**Step 2.** Use the nested sub-window to achieve the latest data, get frequent 1-itemsets of the

The individuals of this population are possible frequent patterns. 1 Statistics the number of I1, I2, I3 in A attribute; 2 Statistics the number of I1, I2, I3 in B attribute; 3 Statistics the number of I1, I2, I3 in C attribute;

others are 0 (in this case, S takes 3);

in the rest position with 0;

The process is shown in Figure 4.

data, encode the frequent 1 - itemsets to integer strings, and combine the frequent 1 - itemsets randomly to constitute the initial population in the nested sub-window.

4 Reserve the value which is greater than or equal to the threshold S, let

5 Remove the all zero -line, set non- zero values according to their original row; 6 Line up every non-zero value and keep its original location in the line, fill

7 Combine non-zero iterms according to their original location.Constitute the initial population with frequent 1 –itemsets and the combination iterms.

population match with the actual transaction iterms. In order to realize parallel matching, we divide the data into Z sections. Although this operation increases the memory expenses, it reduces the running time. It is important in mining frequent

**Step 3.** Calculating the individual fitness degree is the process that individuals in the initial

in the sliding window after the old data out of the sliding window.

Calculate the individual support degree

selection

End ?

Crossover

No

Mutation

segments.

NSW obtains the latest data

> Obtain the initial population

Input the parameters

Fig. 3. The generation of initial population.

role of nested sub-window is to avoid repeatedly processing the data which is still

Let the crossover probability is P, the individual mutation probability is Q. To implement parallel computing, the data in the nested sub-window is divided into Z

> Join the NSW frequent itemset

End

Support degree is Greater than S

Iteration times is T

Fig. 5. Parallel computing fitness degree

**Step 4.** If the number of iterative times is smaller than T, the algorithm jumps to the step 3. After T times of iterative computation, finish iterative and obtain the frequent itemsets in current nested sub-window;

Mining Frequent Itemsets over Recent Data Stream Based on Genetic Algorithm 299

**Step 6.** With the data stream flowing, this algorithm continues to deal with the new

Comparing with other algorithms which use pattern tree to maintain the historical information of data stream, NSWGA processes a quantity of data parallelly at one time, while the pattern tree algorithms process a single transaction item at one time, each transaction item needs match repeatedly. Mining the frequent itemsets of the data in the current window, the time of whole process is not only dependent on the times of scanning the data in the window, but also dependent on the internal basic operation - the number of

Suppose a data stream has N transaction items, each transaction item has V attributes; each attribute has K possible values. The pattern tree algorithms may have KV frequent pattern search paths. Let the window size is N. When the entire data stream in the window flow

For fp-tree algorithm, when the fp-tree has L paths, the calculated amount is 2 \* N \* V + V \* L, the number L will increase with the threshold of support degree reducing. When the support degree is S, iteration times of genetic algorithms is T, the number of parallel computing is Z (Z according to the amount of data, in this case set Z 200),the sliding window size is N, the necessary calculated amount to get frequent itemsets is P = P1 + P2 +

P2 = V \* T \* N / S \* Z the calculated amount to get the frequent itermsets in the nested sub-

When the property value K is large, this algorithm has obvious advantage in time

In this experiment, we use artificial data sets and the MATLAB and CUDA C language to implement NSWGA algorithm. We use the computer with 2.61GHZ CPU, 2GMB memory, Nvidia GPU C1060, windows XP operating system to test the performance of the algorithm. The size of the sliding window is 100k. The size of the data set is 200K.With the data

1. The analog data stream has three attributes. Each attribute has 10 possible values. The

complexity. When the number of Z is larger, the runtime will become shorter.

<=1/S) the calculated amount to get the final frequent itemsets.

over, the necessary calculated amount to get frequent itemsets is N \* K \* V.

P1 = N \* V the calculated amount to get 1 - frequent itemsets;

α

flowing, we make statistic every 10K of the data.

running results of the algorithms are shown in Table 1.

operations until the data stream coming to the end.

shown as step6.

**3.2 NSWGA algorithm analysis** 

matching.

P3. Thereinto:

**4.1 Experiment** 

\*V\*N/S\*Z\*M (1<=

**4. Experiment and analysis** 

window;

P3=α In the third part, repeat the above two operations dynamically. The process is

incoming data and discard the old data, transfer to step 2 and continue the above

In the second part, the final frequent itemsets in sliding window is obtained. The process is shown as step5.

	- 1 For i = 1: M+1
	- 2 Constitute the mode sets;
	- 3 End
	- 4 Make a parallel search in the sliding window SW;
	- 5 When a mode's support degree is greater than or equal to S, identify it as a final frequent mode;

The process is shown in Figure 6 (a) (b).

Fig. 6. The process of obtaining frequent patterns

In the third part, repeat the above two operations dynamically. The process is shown as step6.

**Step 6.** With the data stream flowing, this algorithm continues to deal with the new incoming data and discard the old data, transfer to step 2 and continue the above operations until the data stream coming to the end.

### **3.2 NSWGA algorithm analysis**

298 Bio-Inspired Computational Algorithms and Their Applications

**Step 5.** Constitute the mode sets with the frequent itemsets that we obtained this time and

on a search to determine the final frequent itemsets in the sliding window.

(a) The generation of mode sets

Parallel matching in SW

Final result

d1 d2 d3 d4 … … … dn-1 dn

(b) The generation of final frequent patterns

4 Make a parallel search in the sliding window SW;

process is shown as step5.

2 Constitute the mode sets;

The process is shown in Figure 6 (a) (b).

frequent mode;

Mode sets

Fig. 6. The process of obtaining frequent patterns

1 For i = 1: M+1

3 End

In the second part, the final frequent itemsets in sliding window is obtained. The

the previous frequent itemsets obtained in the last M (M = ω1/ω2-1) times. Carry

5 When a mode's support degree is greater than or equal to S, identify it as a final

Comparing with other algorithms which use pattern tree to maintain the historical information of data stream, NSWGA processes a quantity of data parallelly at one time, while the pattern tree algorithms process a single transaction item at one time, each transaction item needs match repeatedly. Mining the frequent itemsets of the data in the current window, the time of whole process is not only dependent on the times of scanning the data in the window, but also dependent on the internal basic operation - the number of matching.

Suppose a data stream has N transaction items, each transaction item has V attributes; each attribute has K possible values. The pattern tree algorithms may have KV frequent pattern search paths. Let the window size is N. When the entire data stream in the window flow over, the necessary calculated amount to get frequent itemsets is N \* K \* V.

For fp-tree algorithm, when the fp-tree has L paths, the calculated amount is 2 \* N \* V + V \* L, the number L will increase with the threshold of support degree reducing.

When the support degree is S, iteration times of genetic algorithms is T, the number of parallel computing is Z (Z according to the amount of data, in this case set Z 200),the sliding window size is N, the necessary calculated amount to get frequent itemsets is P = P1 + P2 + P3. Thereinto:

P1 = N \* V the calculated amount to get 1 - frequent itemsets;

P2 = V \* T \* N / S \* Z the calculated amount to get the frequent itermsets in the nested subwindow;

P3=α \*V\*N/S\*Z\*M (1<=α<=1/S) the calculated amount to get the final frequent itemsets.

When the property value K is large, this algorithm has obvious advantage in time complexity. When the number of Z is larger, the runtime will become shorter.
