**FPGA Implementation for GHA-Based Texture Classification**

Shiow-Jyu Lin1,2, Kun-Hung Lin1 and Wen-Jyi Hwang<sup>1</sup>

<sup>1</sup>*Department of Computer Science and Information Engineering, National Taiwan Normal University and* <sup>2</sup>*Department of Electronic Engineering, National Ilan University Taiwan*

#### **1. Introduction**

164 Principal Component Analysis

A.M. Martinez and A.C. Kak, "PCA versus LDA," IEEE Trans. Pattern Analysis Machine

Arif Muntasa, Mochamad Hariadi, Mauridhi Hery Purnomo, *"Automatic Eigenface Selection* 

Arif Muntasa, Mochamad Hariadi, Mauridhi Hery Purnomo, "Maximum Feature Value

Cai, D., He, X., and Han, J. Using Graph Model for Face Analysis, University of Illinois at

Cai, D., X. He, J. Han, and H.-J. Zhang. Orthogonal laplacianfaces for face recognition.

Gunawan Rudi Cahyono, Mochamad Hariadi, Mauridhi Hery Purnomo, "Smile Stages

J.H.P.N. Belhumeur, D. Kriegman, "Eigenfaces vs. fisherfaces: Recognition using class specific linear projection", IEEE Trans. on PAMI, 19(7):711–720, 1997.. Jon Shlens, *"A Tutorial On Principal Component Analysis And Singular Value Decomposition"*,

Kokiopoulou, E. and Saad, Y. Orthogonal Neighborhood Preserving Projections, University

M. Kirby and L. Sirovich, "Application of the KL Procedure for the Characterization of Human Faces," IEEE Trans. Pattern Anal. Mach. Intell., vol. 12, no. 1, pp. 103-108, 1990. M. Turk, A. Pentland, "Eigenfaces for recognition", Journal of Cognitive Science, pages 71–

Mauridhi Hery Purnomo, Tri Arif, Arif Muntasa, "Smiling Stage Classification Based on

Rima Tri Wahyuningrum, Mauridhi Hery Purnomo, I Ketut Eddy Purnama, "Smile Stages

Sch¨olkopf, B., Mika, S., Burges, C. J. C., Knirsch, P., Mller, K. R., Raetsch, G. and Smola, A.:

Sch¨olkopf, B., Smola, A.J. and Mller, K.R.: Nonlinear Component Analysis as a Kernel

X. He, S. Yan, Y. Hu, P. Niyogi, and H.-J. Zhang. Face recognition using laplacianfaces. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 27(3):328–340, 2005. Yale Center for Computational Vision and Control, *Yale Face Database*, http://cvc.yale.

Yambor, W.S . Analysis of PCA-Based and Fisher Discriminant-Based Image Recognition

Eigen-value Problem, Neural Computation, 10(5), (1998) 1299-1319

edu/projects/yalefaces/yalefaces.html, Accessed 2007

Algorithms, Tesis of Master, Colorado State University, 2000

Kernel Laplacianlips Using Selection of Non Linier Function Maximum Value", 2010 IEEE International Conference on Virtual Environments, Human-Computer Interfaces, and Measurement Systems (VECIMS 2010) Proceedings, pp 151-156 Mika, S., Ratsch, G.,Weston, J., Scholkopf, B. and Mller, K.R.: Fisher discriminant analysis with kernels. IEEE Workshop on Neural Networks for Signal Processing IX, (1999) 41-48 Research Center of Att, UK, *Olivetti-Att-ORL FaceDatabase*, http://www.uk.research.

Recognition in Orthodontic Rehabilitation Using 2D-PCA Feature Extraction", 2010

Input Space vs. Feature Space in Kernel Based Methods, IEEE Trans. on NN, Vol

*for Face Recognition"*, The 9th Seminar on Intelligent Technology and Its

Selection of Nonlinear Function Based On Kernel PCA For Face Recognition", Proceeding of The 4th Conference On Information & Communication Technology

Classification Based On Aesthetic Dentistry Using Eigenfaces, Fisherfaces And

Intelligence, vol. 23, no. 2, pp. 228-233, Feb. 2001.

and Systems, Surabaya, Indonesia, 2008b

Urbana-Champaign and University of Chicago, 2005.

*IEEE Transactions on Image Processing*, 15(11):3608–3614, 2006.

Applications (2008a) 29 – 34.

Multiclass Svm", 2009

86, 1991.

http://mathworks.com , 2003

of Minnesota, Minneapolis, 2004.

att.com/facedabase.html, Accessed in 2007

10. No. 5, (1999) 1000-1017

**9. References** 

Principal components analysis (PCA) (Alpaydin, 2010; Jolliffe, 2002) is an effective unsupervised feature extraction algorithm for pattern recognition, classification, computer vision or data compression (Bravo et al., 2010; Zhang et al., 2006; Kim et al., 2005; Liying & Weiwei, 2009; Pavan et al., 2007; Qian & James, 2008). The goal of PCA is to obtain a compact and accurate representation of the data that reduces or eliminates statistically redundant components. Basic approaches for PCA involve the computation of the covariance matrix and the extraction of eigenvalues and eigenvectors. A drawback of the basic approaches is the high computational complexity and large memory requirement for data with high vector dimension. Therefore, these approaches may not be well suited for real time applications requiring fast feature extraction.

A number of fast algorithms (Dogaru et al., 2004; El-Bakry, 2006; Gunter et al., 2007; Sajid et al., 2008; Sharma & Paliwal, 2007) have been proposed to reduce the computation time of PCA. However, only moderate acceleration can be achieved because most of these algorithms are based on software. Although hardware implementation of PCA and its variants are possible, large storage size and complicated circuit control management are usually necessary. The PCA hardware implementation may therefore be possible only for small dimensions (Boonkumklao et al., 2001; Chen & Han, 2009).

An alternative for the PCA implementation is to use the generalized Hebbian algorithm (GHA) (Haykin, 2009; Oja, 1982; Sanger, 1989). The principal computation by the GHA is based on an effective incremental updating scheme for reducing memory utilization. Nevertheless, slow convergence of the GHA (Karhunen & Joutsensalo, 1995) is usually observed. A large number of iterations therefore is required, resulting in long computational time for many GHA-based algorithms. The hardware implementation of GHA has been found to be effective for reducing the computation time. However, since the number of multipliers in the circuit grows with the dimension, the circuits may be suitable only for PCA with small dimensions. Although analog GHA hardware architectures (Carvajal et al., 2007; 2009) can be used to lift the constraints on the vector dimensions, these architectures are difficult to be directly used for digital devices.

vector **y**(*n*) is related to the input vector **x**(*n*) by

**3. The proposed GHA architecture**

Fig. 2. The proposed GHA architecture.

*yj*(*n*) =

Each synaptic weight vector **w***j*(*n*) is adapted by the Hebbian learning rule:

*wji*(*n* + 1) = *wji*(*n*) + *η*[*yj*(*n*)*xi*(*n*) − *yj*(*n*)

*wji*(*n* + 1) = *wji*(*n*) + *ηyj*(*n*)[*xi*(*n*) −

A more detailed discussion of GHA can be found in (Haykin, 2009; Sanger, 1989)

complexity of computing implementation, eq.(2) can be rewritten as

*m* ∑ *i*=1

where the *wji*(*n*) stands for the weight from the *i*-th synapse to the *j*-th neuron at iteration *n*.

FPGA Implementation for GHA-Based Texture Classification 167

where *η* denotes the learning rate. After a large number of iterative computation and adaptation, **w***j*(*n*) will asymptotically approach to the eigenvector associated with the *j*-th principal component *λ<sup>j</sup>* of the input vector, where *λ*<sup>1</sup> > *λ*<sup>2</sup> > ... > *λp*. To reduce the

As shown in Figure 2, the proposed GHA architecture consists of three functional units: the memory unit, the synaptic weight updating (SWU) unit, and the principal components computing (PCC) unit. The memory unit is used for storing the *current* synaptic weight vectors. Assume the *current* synaptic weight vectors **w***j*(*n*), *j* = 1, ..., *p*, are now stored in the memory unit. In addition, the input vector **x**(*n*) is available. Based on **x**(*n*) and **w***j*(*n*), *j* = 1, ..., *p*, the goal of PCC unit is to compute output vector **y**(*n*). Using **x**(*n*), **y**(*n*)

*wji*(*n*)*xi*(*n*), (1)

*wki*(*n*)*yk*(*n*)], (2)

*wki*(*n*)*yk*(*n*)]. (3)

*j* ∑ *k*=1

*j* ∑ *k*=1

In light of the facts stated above, a digital GHA hardware architecture capable of performing fast PCA for large vector dimension is presented. Although large amount of arithmetic computations are required for GHA, the proposed architecture is able to achieve fast training with low area cost. The proposed architectures can be divided into three parts: the synaptic weight updating (SWU) unit, the principal components computing (PCC) unit, and memory unit. The memory unit is the on-chip memory storing synaptic weight vectors. Based on the synaptic weight vectors stored in the memory unit, the SWU and PCC units are then used to compute the principal components and update the synaptic weight vectors, respectively.

In the SWU unit, one synaptic weight vector is computed at a time. The results of precedent weight vectors will be used for the computation of subsequent weight vectors for expediting training speed. In addition, the computation of different weight vectors shares the same circuit for lowering the area cost. Moreover, in the PCC unit, the input vectors are allowed to be separated into smaller segments for the delivery over data bus with limited width. Both the SWU and PCC units can also operate concurrently to further enhance the throughput.

To demonstrate the effectiveness of the proposed architecture, a texture classification system on a system-on-programmable-chip (SOPC) platform is constructed. The system consists of the proposed architecture, a softcore NIOS II processor (Altera Corp., 2010), a DMA controller, and a SDRAM. The proposed architecture is adopted for finding the PCA transform by the GHA training, where the training vectors are stored in the SDRAM. The DMA controller is used for the DMA delivery of the training vectors. The softcore processor is only used for coordinating the SOPC system. It does not participate the GHA training process. As compared with its software counterpart running on Intel *i*7 CPU, our system has significantly lower computational time for large training set. All these facts demonstrate the effectiveness of the proposed architecture.

Fig. 1. The neural model for the GHA.

#### **2. Preliminaries**

Figure 1 shows the neural model for GHA, where **x**(*n*)=[*x*1(*n*), ..., *xm*(*n*)]*T*, and **y**(*n*) = [*y*1(*n*), ..., *yp*(*n*)]*<sup>T</sup>* are the input and output vectors to the GHA model, respectively. The output vector **y**(*n*) is related to the input vector **x**(*n*) by

2 Will-be-set-by-IN-TECH

In light of the facts stated above, a digital GHA hardware architecture capable of performing fast PCA for large vector dimension is presented. Although large amount of arithmetic computations are required for GHA, the proposed architecture is able to achieve fast training with low area cost. The proposed architectures can be divided into three parts: the synaptic weight updating (SWU) unit, the principal components computing (PCC) unit, and memory unit. The memory unit is the on-chip memory storing synaptic weight vectors. Based on the synaptic weight vectors stored in the memory unit, the SWU and PCC units are then used to compute the principal components and update the synaptic weight vectors, respectively.

In the SWU unit, one synaptic weight vector is computed at a time. The results of precedent weight vectors will be used for the computation of subsequent weight vectors for expediting training speed. In addition, the computation of different weight vectors shares the same circuit for lowering the area cost. Moreover, in the PCC unit, the input vectors are allowed to be separated into smaller segments for the delivery over data bus with limited width. Both the SWU and PCC units can also operate concurrently to further enhance the throughput.

To demonstrate the effectiveness of the proposed architecture, a texture classification system on a system-on-programmable-chip (SOPC) platform is constructed. The system consists of the proposed architecture, a softcore NIOS II processor (Altera Corp., 2010), a DMA controller, and a SDRAM. The proposed architecture is adopted for finding the PCA transform by the GHA training, where the training vectors are stored in the SDRAM. The DMA controller is used for the DMA delivery of the training vectors. The softcore processor is only used for coordinating the SOPC system. It does not participate the GHA training process. As compared with its software counterpart running on Intel *i*7 CPU, our system has significantly lower computational time for large training set. All these facts demonstrate the effectiveness of the

Figure 1 shows the neural model for GHA, where **x**(*n*)=[*x*1(*n*), ..., *xm*(*n*)]*T*, and **y**(*n*) = [*y*1(*n*), ..., *yp*(*n*)]*<sup>T</sup>* are the input and output vectors to the GHA model, respectively. The output

proposed architecture.

Fig. 1. The neural model for the GHA.

**2. Preliminaries**

$$y\_j(n) = \sum\_{i=1}^{m} w\_{ji}(n) x\_i(n) \,\tag{1}$$

where the *wji*(*n*) stands for the weight from the *i*-th synapse to the *j*-th neuron at iteration *n*. Each synaptic weight vector **w***j*(*n*) is adapted by the Hebbian learning rule:

$$w\_{jl}(n+1) = w\_{jl}(n) + \eta [y\_j(n)x\_i(n) - y\_j(n)\sum\_{k=1}^j w\_{kl}(n)y\_k(n)]\_\prime \tag{2}$$

where *η* denotes the learning rate. After a large number of iterative computation and adaptation, **w***j*(*n*) will asymptotically approach to the eigenvector associated with the *j*-th principal component *λ<sup>j</sup>* of the input vector, where *λ*<sup>1</sup> > *λ*<sup>2</sup> > ... > *λp*. To reduce the complexity of computing implementation, eq.(2) can be rewritten as

$$w\_{ji}(n+1) = w\_{ji}(n) + \eta y\_j(n)[x\_i(n) - \sum\_{k=1}^j w\_{ki}(n)y\_k(n)].\tag{3}$$

A more detailed discussion of GHA can be found in (Haykin, 2009; Sanger, 1989)

#### **3. The proposed GHA architecture**

Fig. 2. The proposed GHA architecture.

As shown in Figure 2, the proposed GHA architecture consists of three functional units: the memory unit, the synaptic weight updating (SWU) unit, and the principal components computing (PCC) unit. The memory unit is used for storing the *current* synaptic weight vectors. Assume the *current* synaptic weight vectors **w***j*(*n*), *j* = 1, ..., *p*, are now stored in the memory unit. In addition, the input vector **x**(*n*) is available. Based on **x**(*n*) and **w***j*(*n*), *j* = 1, ..., *p*, the goal of PCC unit is to compute output vector **y**(*n*). Using **x**(*n*), **y**(*n*)

Fig. 4. The architecture of each module in SWU unit.

to produce complete **w***j*(*n* + 1) and **z***j*(*n*).

Let

and

**w**ˆ *<sup>j</sup>*,*k*(*n* + 1) and ˆ**z***j*,*k*(*n*).

In addition to **w***j*(*n* + 1), the SWU unit also produces **z***j*(*n*), which will then be used for the computation of **w***j*+1(*n* + 1). Hardware resource consumption can then be effectively reduced. One way to implement the SWU unit is to produce **w***j*(*n* + 1) and **z***j*(*n*) in one shot. In SWU unit, *m* identical modules may be required because the dimension of vectors is *m*. Figure 4 shows the architecture of each module. The area cost of the SWU unit then will grows linearly with *m*. To further reduce the area cost, each of the output vectors **w***j*(*n* + 1) and **z***j*(*n*) are separated into *b* segments, where each segment contains *q* elements. The SWU unit only computes one segment of **w***j*(*n* + 1) and **z***j*(*n*) at a time. Therefore, it will take *b* clock cycles

FPGA Implementation for GHA-Based Texture Classification 169

be the *k*-th segment of **w***j*(*n*) and **z***j*(*n*), respectively. The computation **w***j*(*n*) and **z***j*(*n*) take *b* clock cycles. At the *k*-th clock cycle, *k* = 1, ..., *b*, the SWU unit computes ˆ**w***j*,*k*(*n* + 1) and ˆ**z***j*,*k*(*n*). Because each of ˆ**w***j*,*k*(*n* + 1) and ˆ**z***j*,*k*(*n*) contains only *q* elements, the SWU unit consists of *q* identical modules. The architecture of each module is also shown in Figure 4. The SWU unit can be used for GHA with different vector dimension *m*. As *m* increases, the area cost therefore remains the same at the expense of large number of clock cycles *b* for the computation of

Figures 5, 6 and 7 show the operation of the *q* modules. For the sake of simplicity, the computation of the first weight vector **w**1(*n* + 1) (i.e., *j* = 1) and the corresponding **z**1(*n*) are considered in the figures. Based on eq.(7), the input vector **<sup>z</sup>***j*−1(*n*) is actually the training

**<sup>w</sup>**<sup>ˆ</sup> *<sup>j</sup>*,*k*(*n*)=[*wj*,(*k*−1)*q*+1(*n*), ..., *wj*,(*k*−1)*q*+*q*(*n*)]*T*, *<sup>k</sup>* <sup>=</sup> 1, ..., *<sup>b</sup>*. (8)

**<sup>z</sup>**ˆ*j*,*k*(*n*)=[*zj*,(*k*−1)*q*+1(*n*), ..., *zj*,(*k*−1)*q*+*q*(*n*)]*T*, *<sup>k</sup>* <sup>=</sup> 1, ..., *<sup>b</sup>*. (9)

and **w***j*(*n*), *j* = 1, ..., *p*, the SWU unit produces the new synaptic weight vectors **w***j*(*n* + 1), *j* = 1, ..., *p*. It can be observed from Figure 2 that the new synaptic weight vectors will be stored back to the memory unit for subsequent training.

#### **3.1 SWU unit**

The design of SWU unit is based on eq.(3). Although the direct implementation of eq.(3) is possible, it will consume large hardware resources. To further elaborate this fact, we first see from eq.(3) that the computation of *wji*(*n* + 1) and *wri*(*n* + 1) shares the same term ∑*r <sup>k</sup>*=<sup>1</sup> *wki*(*n*)*yk*(*n*) when *r* ≤ *j*. Consequently, independent implementation of *wji*(*n* + 1) and *wri*(*n* + 1) by hardware using eq.(3) will result in large hardware resource overhead.

Fig. 3. The hardware implementation of eqs.(5) and (6).

To reduce the resource consumption, we first define a vector *zji*(*n*) as

$$z\_{j\bar{1}}(n) = x\_{\bar{i}}(n) - \sum\_{k=1}^{j} w\_{\bar{k}i}(n) y\_{\bar{k}}(n)\_{\prime} \dot{j} = 1, \ldots, p\_{\prime} \tag{4}$$

and **z***j*(*n*)=[*zj*1(*n*), ..., *zjm*(*n*)]*T*. Integrating eq.(3) and (4), we obtain

$$w\_{\vec{j}\vec{l}}(n+1) = w\_{\vec{j}\vec{l}}(n) + \eta y\_{\vec{j}}(n) z\_{\vec{j}\vec{l}}(n),\tag{5}$$

where *zji*(*n*) can be obtained from *<sup>z</sup>*(*j*−1)*<sup>i</sup>* (*n*) by

$$z\_{j\bar{i}}(n) = z\_{(j-1)\bar{i}}(n) - w\_{j\bar{i}}(n)y\_{\bar{j}}(n), \text{j} = 2, \ldots, p. \tag{6}$$

When *j* = 1, from eq.(4) and (6), it follows that

$$z\_{0i}(n) = x\_i(n). \tag{7}$$

Figure 3 depicts the hardware implementation of eqs.(5) and (6). As shown in the figure, the SWU unit produces one synaptic weight vector at a time. The computation of **w***j*(*n* + 1), the *<sup>j</sup>*-th weight vector at the iteration *<sup>n</sup>* + 1, requires the **<sup>z</sup>***j*−1(*n*), **<sup>y</sup>**(*n*) and **<sup>w</sup>***j*(*n*) as inputs.

Fig. 4. The architecture of each module in SWU unit.

In addition to **w***j*(*n* + 1), the SWU unit also produces **z***j*(*n*), which will then be used for the computation of **w***j*+1(*n* + 1). Hardware resource consumption can then be effectively reduced.

One way to implement the SWU unit is to produce **w***j*(*n* + 1) and **z***j*(*n*) in one shot. In SWU unit, *m* identical modules may be required because the dimension of vectors is *m*. Figure 4 shows the architecture of each module. The area cost of the SWU unit then will grows linearly with *m*. To further reduce the area cost, each of the output vectors **w***j*(*n* + 1) and **z***j*(*n*) are separated into *b* segments, where each segment contains *q* elements. The SWU unit only computes one segment of **w***j*(*n* + 1) and **z***j*(*n*) at a time. Therefore, it will take *b* clock cycles to produce complete **w***j*(*n* + 1) and **z***j*(*n*).

Let

4 Will-be-set-by-IN-TECH

and **w***j*(*n*), *j* = 1, ..., *p*, the SWU unit produces the new synaptic weight vectors **w***j*(*n* + 1), *j* = 1, ..., *p*. It can be observed from Figure 2 that the new synaptic weight vectors will be stored

The design of SWU unit is based on eq.(3). Although the direct implementation of eq.(3) is possible, it will consume large hardware resources. To further elaborate this fact, we first see from eq.(3) that the computation of *wji*(*n* + 1) and *wri*(*n* + 1) shares the same term

*<sup>k</sup>*=<sup>1</sup> *wki*(*n*)*yk*(*n*) when *r* ≤ *j*. Consequently, independent implementation of *wji*(*n* + 1) and

*wri*(*n* + 1) by hardware using eq.(3) will result in large hardware resource overhead.

back to the memory unit for subsequent training.

Fig. 3. The hardware implementation of eqs.(5) and (6).

where *zji*(*n*) can be obtained from *<sup>z</sup>*(*j*−1)*<sup>i</sup>*

When *j* = 1, from eq.(4) and (6), it follows that

To reduce the resource consumption, we first define a vector *zji*(*n*) as

*j* ∑ *k*=1

(*n*) by

Figure 3 depicts the hardware implementation of eqs.(5) and (6). As shown in the figure, the SWU unit produces one synaptic weight vector at a time. The computation of **w***j*(*n* + 1), the *<sup>j</sup>*-th weight vector at the iteration *<sup>n</sup>* + 1, requires the **<sup>z</sup>***j*−1(*n*), **<sup>y</sup>**(*n*) and **<sup>w</sup>***j*(*n*) as inputs.

*wki*(*n*)*yk*(*n*), *j* = 1, ..., *p*, (4)

(*n*) − *wji*(*n*)*yj*(*n*), *j* = 2, ..., *p*. (6)

*z*0*i*(*n*) = *xi*(*n*). (7)

*wji*(*n* + 1) = *wji*(*n*) + *ηyj*(*n*)*zji*(*n*), (5)

*zji*(*n*) = *xi*(*n*) −

*zji*(*n*) = *<sup>z</sup>*(*j*−1)*<sup>i</sup>*

and **z***j*(*n*)=[*zj*1(*n*), ..., *zjm*(*n*)]*T*. Integrating eq.(3) and (4), we obtain

**3.1 SWU unit**

∑*r*

$$\mathfrak{w}\_{j,k}(n) = [w\_{j,(k-1)q+1}(n), \dots, w\_{j,(k-1)q+q}(n)]^T, k = 1, \dots, b. \tag{8}$$

and

$$\mathbf{z}\_{j,k}(n) = \left[ z\_{j,(k-1)q+1}(n), \dots, z\_{j,(k-1)q+q}(n) \right]^T, k = 1, \dots, b. \tag{9}$$

be the *k*-th segment of **w***j*(*n*) and **z***j*(*n*), respectively. The computation **w***j*(*n*) and **z***j*(*n*) take *b* clock cycles. At the *k*-th clock cycle, *k* = 1, ..., *b*, the SWU unit computes ˆ**w***j*,*k*(*n* + 1) and ˆ**z***j*,*k*(*n*). Because each of ˆ**w***j*,*k*(*n* + 1) and ˆ**z***j*,*k*(*n*) contains only *q* elements, the SWU unit consists of *q* identical modules. The architecture of each module is also shown in Figure 4. The SWU unit can be used for GHA with different vector dimension *m*. As *m* increases, the area cost therefore remains the same at the expense of large number of clock cycles *b* for the computation of **w**ˆ *<sup>j</sup>*,*k*(*n* + 1) and ˆ**z***j*,*k*(*n*).

Figures 5, 6 and 7 show the operation of the *q* modules. For the sake of simplicity, the computation of the first weight vector **w**1(*n* + 1) (i.e., *j* = 1) and the corresponding **z**1(*n*) are considered in the figures. Based on eq.(7), the input vector **<sup>z</sup>***j*−1(*n*) is actually the training

Fig. 7. The operation of SWU unit for computing the *b*-th segment of **w**1(*n* + 1).

FPGA Implementation for GHA-Based Texture Classification 171

Fig. 8. The operation of SWU unit for computing the first segment of **w**2(*n* + 1).

respectively.

used for the computation of ˆ**z**1,*k*(*n*) and ˆ**w**1,*k*(*n* + 1) in Figures 6 and 7 for *k* = 2 and *k* = *b*,

After the computation of **w**1(*n* + 1) is completed, the vector **z**1(*n*) is available as well. The vector **z**1(*n*) is then used for the computation of **w**2(*n* + 1). Figure 8 shows the computation of the first segment of **w**2(*n* + 1) (i.e., ˆ**w**2,1(*n* + 1)) based on the first segment of **z**1(*n*) (i.e., **z**ˆ1,1(*n*)). The same process proceeds for the subsequent segments until the computation of

Fig. 5. The operation of SWU unit for computing the first segment of **w**1(*n* + 1).

vector **x**(*n*), which is also separated into *b* segments, where the *k*-th segment is given by

$$\mathbf{z}\_{0,k}(n) = [\mathbf{x}\_{(k-1)q+1}(n), \dots, \mathbf{x}\_{(k-1)q+q}(n)]^T, k = 1, \dots, b. \tag{10}$$

They are then multiplexed to the *q* modules. The ˆ**z**0,1(*n*) and ˆ**w**1,1(*n*) are used for the computation of ˆ**z**1,1(*n*) and ˆ**w**1,1(*n* + 1) in Figure 5. Similarly, the ˆ**z**0,*k*(*n*) and ˆ**w**1,*k*(*n*) are 6 Will-be-set-by-IN-TECH

Fig. 5. The operation of SWU unit for computing the first segment of **w**1(*n* + 1).

Fig. 6. The operation of SWU unit for computing the second segment of **w**1(*n* + 1).

vector **x**(*n*), which is also separated into *b* segments, where the *k*-th segment is given by

They are then multiplexed to the *q* modules. The ˆ**z**0,1(*n*) and ˆ**w**1,1(*n*) are used for the computation of ˆ**z**1,1(*n*) and ˆ**w**1,1(*n* + 1) in Figure 5. Similarly, the ˆ**z**0,*k*(*n*) and ˆ**w**1,*k*(*n*) are

**<sup>z</sup>**ˆ0,*k*(*n*)=[*x*(*k*−1)*q*+1(*n*), ..., *<sup>x</sup>*(*k*−1)*q*+*q*(*n*)]*T*, *<sup>k</sup>* <sup>=</sup> 1, ..., *<sup>b</sup>*. (10)

Fig. 7. The operation of SWU unit for computing the *b*-th segment of **w**1(*n* + 1).

Fig. 8. The operation of SWU unit for computing the first segment of **w**2(*n* + 1).

used for the computation of ˆ**z**1,*k*(*n*) and ˆ**w**1,*k*(*n* + 1) in Figures 6 and 7 for *k* = 2 and *k* = *b*, respectively.

After the computation of **w**1(*n* + 1) is completed, the vector **z**1(*n*) is available as well. The vector **z**1(*n*) is then used for the computation of **w**2(*n* + 1). Figure 8 shows the computation of the first segment of **w**2(*n* + 1) (i.e., ˆ**w**2,1(*n* + 1)) based on the first segment of **z**1(*n*) (i.e., **z**ˆ1,1(*n*)). The same process proceeds for the subsequent segments until the computation of

Fig. 11. The architecture of Buffer C in memory unit.

The memory unit contains three buffers: Buffer A, Buffer B and Buffer C. As shown in Figure 9, Buffer A stores training vector **x**(*n*). It consists of *q* sub-buffers, where each sub-buffer

FPGA Implementation for GHA-Based Texture Classification 173

The architecture of Buffer B is depicted in Figure 10, which holds the values of **z***j*(*n*). Each segment of **z***j*(*n*) computed from SWU unit is stored in Buffer B. After all the segments are produced, the Buffer B then deliver the segments of **z***j*(*n*) to SWU unit in the first-in-first-out

The Buffer C is used for storing the synaptic weight vectors **w***j*(*n*), *j* = 1, ..., *p*. It is a two-port RAM for reading and writing weight vectors, as revealed in Figure 11. The address for the RAM is expressed in terms of indices *j* and *i* for reading or writing the *i*-th segment of the

The PCC operations are based on eq.(1). Therefore, the PCC unit of the proposed architecture contains adders and multipliers. Figure 12 shows the architecture of PCC. The training vector

contains *b* elements. All the sub-buffers are connected to the SWU and PCC units.

**3.2 Memory unit**

(FIFO) fashion.

weight vector **w***j*(*n*).

Fig. 12. The architecture of PCC unit.

**3.3 PCC unit**

the entire vectors **w**2(*n* + 1) and **z**2(*n*) are completed. The vector **z**2(*n*) is then used for the computation of **w**3(*n* + 1). The weight vector updating process at the iteration *n* + 1 will be completed until the SWU unit produces the weight vector **w***p*(*n* + 1).

Fig. 9. The architecture of Buffer A in memory unit.

Fig. 10. The architecture of Buffer B in memory unit.

Fig. 11. The architecture of Buffer C in memory unit.

#### **3.2 Memory unit**

8 Will-be-set-by-IN-TECH

the entire vectors **w**2(*n* + 1) and **z**2(*n*) are completed. The vector **z**2(*n*) is then used for the computation of **w**3(*n* + 1). The weight vector updating process at the iteration *n* + 1 will be

completed until the SWU unit produces the weight vector **w***p*(*n* + 1).

Fig. 9. The architecture of Buffer A in memory unit.

Fig. 10. The architecture of Buffer B in memory unit.

The memory unit contains three buffers: Buffer A, Buffer B and Buffer C. As shown in Figure 9, Buffer A stores training vector **x**(*n*). It consists of *q* sub-buffers, where each sub-buffer contains *b* elements. All the sub-buffers are connected to the SWU and PCC units.

The architecture of Buffer B is depicted in Figure 10, which holds the values of **z***j*(*n*). Each segment of **z***j*(*n*) computed from SWU unit is stored in Buffer B. After all the segments are produced, the Buffer B then deliver the segments of **z***j*(*n*) to SWU unit in the first-in-first-out (FIFO) fashion.

The Buffer C is used for storing the synaptic weight vectors **w***j*(*n*), *j* = 1, ..., *p*. It is a two-port RAM for reading and writing weight vectors, as revealed in Figure 11. The address for the RAM is expressed in terms of indices *j* and *i* for reading or writing the *i*-th segment of the weight vector **w***j*(*n*).

Fig. 12. The architecture of PCC unit.

#### **3.3 PCC unit**

The PCC operations are based on eq.(1). Therefore, the PCC unit of the proposed architecture contains adders and multipliers. Figure 12 shows the architecture of PCC. The training vector

Fig. 14. The SOPC system for implementing GHA.

Fig. 15. The first set of textures for the experiments.

Figures 17 and 18 show the distribution of classification success rates (CSR) of the proposed architecture for the texture sets in Figures 15 and 16, respectively. The classification success rate is defined as the number of test vectors which are successfully classified divided by the total number of test vectors. The number of principal components is *p* = 4. The vector dimension is *m* = 16 × 16. The distribution is based on 20 independent GHA training processes. The distribution of the architecture presented in (Lin et al., 2011) with the same *p* is also included for comparison purpose. The vector dimension for (Lin et al., 2011) is *m* = 4 × 4.

FPGA Implementation for GHA-Based Texture Classification 175

**x**(*n*) and synaptic weight vector **w***j*(*n*) are obtained from the Buffer A and Buffer C of the memory unit, respectively. When both **x**(*n*) and **w***j*(*n*) are available, the proposed PCC unit then computes *yj*(*n*). Note that, after the *yj*(*n*) is obtained, the SWU unit can then compute **w***j*(*n* + 1). Figure 13 reveals the timing diagram of the proposed architecture. It can be observed from Figure 13 that both the *yj*<sup>+</sup>1(*n*) and **w***j*(*n* + 1) are computed concurrently. The throughput of the proposed architecture can then be effectively enhanced.

Fig. 13. The timing diagram of the proposed architecture.
