3.4 Proposed system architecture for MxN CNN

The empirical problems that need a solution is that: firstly, identifying boundary points of whole difference grid (space); secondly, dividing the entire computing space into smaller subspaces. Division and combination of boundary areas need to perform appropriately avoiding incorrect results because of tep time computing time; thirdly, controlling real-time data exchange and combining sequential and parallel computing in a CNN chip. The CNN chip proposed in this chapter has solved similarity in the previous problems [4, 5]. The new issues here are dividing computing space processing dynamic sub-boundary and combining sequential and parallel.

### 3.4.1 General MxN CNN

Each CNN cell has its own data element and a core that performs the computing function. The CNN has MxN CNN cells in which only (M-2)x(N-2) CNN cells have computing functions, so that the CNN has MxN data elements and (M-2)x(N-2) cores (Figure 10).

The Buffer supplies MxN data elements for CNN. Each MxN data element is called as one block of data (Figure 11).

The white area is the data element for CNN boundary cells; and the gray part is the data area which requires to be processed by CNN. The CNN arithmetic unit has size of (M-2)x(N-2) cells processing data for the gray area which is inside the input buffer unit.

The Input memory has PxQ blocks of data. It is a true dual port memory.

The Temp memory also has PxQ blocks of data. It is a simple dual port memory. It is used to temporarily store data computed from CNN core and supply data for Boundary updating unit.

Data that need processing sent from PC have the size of mxn (Figure 12). Assume that m = 5, n = 6, M = 3, and N = 4; the white part is boundary and the gray part is the area requiring to be processed. Before the processing data, temporary vertical and horizontal boundaries be need to be added, as in Figure 13, column

Temporary vertical and horizontal boundaries are added to the data structure similar to CNN buffer. The data after being added from temporary vertical and horizontal boundaries will be sent to Input memory. The blocks of data in the Input memory unit (in case that mxn = 5x6, MxN = 3x4) are detailed as follows (Figure 14).

(0,3) and row (3,0).

Buffer (MxN) for CNN core.

General architecture of a CNN chip.

Solving Partial Differential Equation Using FPGA Technology

DOI: http://dx.doi.org/10.5772/intechopen.84588

Figure 11.

Figure 12.

205

Computing space with main boundary.

Figure 10.

Figure 9. Logic architecture of cell of h, u.

Solving Partial Differential Equation Using FPGA Technology DOI: http://dx.doi.org/10.5772/intechopen.84588

#### Figure 10.

Based on templates found in (18)–(20), we can design an architecture for circuit for CNN chip. It is a three-layered CNN 2D. Then, the arithmetic unit for each layer and links to perform parallel calculation on chip can be made. Figure 9 shows the

The empirical problems that need a solution is that: firstly, identifying boundary points of whole difference grid (space); secondly, dividing the entire computing space into smaller subspaces. Division and combination of boundary areas need to perform appropriately avoiding incorrect results because of tep time computing time; thirdly, controlling real-time data exchange and combining sequential and parallel computing in a CNN chip. The CNN chip proposed in this chapter has solved similarity in the previous problems [4, 5]. The new issues here are dividing computing space processing dynamic sub-boundary and combining sequential and

Each CNN cell has its own data element and a core that performs the computing function. The CNN has MxN CNN cells in which only (M-2)x(N-2) CNN cells have computing functions, so that the CNN has MxN data elements and (M-2)x(N-2)

The Buffer supplies MxN data elements for CNN. Each MxN data element is

The Input memory has PxQ blocks of data. It is a true dual port memory. The Temp memory also has PxQ blocks of data. It is a simple dual port memory. It is used to temporarily store data computed from CNN core and supply data for

The white area is the data element for CNN boundary cells; and the gray part is the data area which requires to be processed by CNN. The CNN arithmetic unit has size of (M-2)x(N-2) cells processing data for the gray area which is inside the input

architecture of layer h and layer u (the layer v is similar to u).

Boundary Layer Flows - Theory, Applications and Numerical Methods

3.4 Proposed system architecture for MxN CNN

parallel.

3.4.1 General MxN CNN

Boundary updating unit.

called as one block of data (Figure 11).

cores (Figure 10).

buffer unit.

Figure 9.

204

Logic architecture of cell of h, u.

General architecture of a CNN chip.


#### Figure 11. Buffer (MxN) for CNN core.

Data that need processing sent from PC have the size of mxn (Figure 12).

Assume that m = 5, n = 6, M = 3, and N = 4; the white part is boundary and the gray part is the area requiring to be processed. Before the processing data, temporary vertical and horizontal boundaries be need to be added, as in Figure 13, column (0,3) and row (3,0).

Temporary vertical and horizontal boundaries are added to the data structure similar to CNN buffer. The data after being added from temporary vertical and horizontal boundaries will be sent to Input memory. The blocks of data in the Input memory unit (in case that mxn = 5x6, MxN = 3x4) are detailed as follows (Figure 14).


Figure 12. Computing space with main boundary.


#### Figure 13.

Divide computing space into subspace with subboundary.

0, 1, 2,.., 6 are the addresses of blocks. In case that mxn = 5x6 and MxN = 3x4, we have P = 3 and Q = 2.

$$P \ge Q = \frac{m-2}{M-2} \ge \frac{n-2}{N-2}$$

The Boundary updating unit is in detail structure as follows (in case MxN = 3x4) (Figure 15).

The control unit controls the activities of the whole system set by the algorithm which is as follows:

(1) At every posedge of clk do



#### 3.4.2 Proposed CNN architecture when M = 3 (3xN CNN)

The 3xN CNN architecture is similar to the general MxN CNN architecture (M = 3). In order to reduce the memory consumption and simplify the Boundary updating unit, there are some differences (Figure 16).

Each block of data in the memory (Input memory or Temp memory) is 1xN data elements. Assume that the data which need processing sent from PC has the size of mxn, m = 5, n = 6, and assume that N = 4. As mention above, the data will be processed after temporary vertical boundaries are added; so that, the Input Memory unit will has 5x2 blocks of data (m = 5, Q = 2) as follow (Figure 17).

Each block has size of 1x4 data elements.

The Buffer unit is a Shift up register that has size of 3xN. The input and output have sizes of 1xN and 3xN, respectively. The input is at the bottom.

The Input memory has m rows and Q columns of blocks of data. The control unit reads the blocks in the Input memory by vertical and puts the block of data to the

Figure 16.

207

Figure 15.

Figure 14.

The Boundary updating structure (MxN = 3x4).

The blocks of data in the Input memory in case that mxn = 5x6, MxN = 3x4.

Solving Partial Differential Equation Using FPGA Technology

DOI: http://dx.doi.org/10.5772/intechopen.84588

The architecture of 3xN CNN chip.

Solving Partial Differential Equation Using FPGA Technology DOI: http://dx.doi.org/10.5772/intechopen.84588






Figure 14.

0, 1, 2,.., 6 are the addresses of blocks. In case that mxn = 5x6 and MxN = 3x4,

The Boundary updating unit is in detail structure as follows (in case MxN = 3x4)

The control unit controls the activities of the whole system set by the algorithm

The 3xN CNN architecture is similar to the general MxN CNN architecture (M = 3). In order to reduce the memory consumption and simplify the Boundary

Each block of data in the memory (Input memory or Temp memory) is 1xN data elements. Assume that the data which need processing sent from PC has the size of mxn, m = 5, n = 6, and assume that N = 4. As mention above, the data will be processed after temporary vertical boundaries are added; so that, the Input Memory unit will has 5x2 blocks of data (m = 5, Q = 2) as follow

The Buffer unit is a Shift up register that has size of 3xN. The input and output

The Input memory has m rows and Q columns of blocks of data. The control unit reads the blocks in the Input memory by vertical and puts the block of data to the

have sizes of 1xN and 3xN, respectively. The input is at the bottom.

PxQ <sup>¼</sup> <sup>m</sup> � <sup>2</sup> M � 2 x n � 2 N � 2

we have P = 3 and Q = 2.

Divide computing space into subspace with subboundary.

Boundary Layer Flows - Theory, Applications and Numerical Methods

(Figure 15).

Figure 13.

(5) else

(2) {

(10) }

(Figure 17).

206

which is as follows:

(1) At every posedge of clk do

(4) do the IO task;

(6) buffer = read(Input memory) (7) if (finish computing the first block) (8) if (BoundaryUpdating()) (9) write(Input memory)

3.4.2 Proposed CNN architecture when M = 3 (3xN CNN)

updating unit, there are some differences (Figure 16).

Each block has size of 1x4 data elements.

(3) if (has IO event)

The blocks of data in the Input memory in case that mxn = 5x6, MxN = 3x4.

Figure 15. The Boundary updating structure (MxN = 3x4).

Figure 16. The architecture of 3xN CNN chip.


Figure 17.

The memory with 5x2 blocks (m==5, n = 6, N = 4).

input of buffer. The buffer shifts up 1 step. After step 3, the Buffer has 3xN blocks of data to supply to CNN core. After each step, the Buffer has 3xN blocks of data that need to supply to CNN core (Figure 18).

The output of CNN core has the size of 1xN.

The Boundary updating unit is shown in Figure 19.

The control algorithm for control unit (Figure 20).

(1) At every posedge of clk do


Figure 18.

Figure 19.

Figure 20.

Figure 21.

209

performing calculation.

The Buffer's state after each step (m==5, n = 6, N = 4).

Solving Partial Differential Equation Using FPGA Technology

DOI: http://dx.doi.org/10.5772/intechopen.84588

The output size of CNN core (N = 4).

The Boundary updating structure (N = 4).

The chip Virtex 6 (XCVL240T-1FFG1156) connected to PC for configuring to make CNN chip and

#### 3.5 Implementation

In this part, we implement the 3xN CNN. Q, m, and N are the parameters that we can configure before compiling and programming to the FPGA chip. For defaulting, we assigned Q = 2, m = 8, and N = 4.

#### 3.5.1 Development environment

For experiencing, the ISE Design Suite software version 14.7 and ML605 evaluation board including chip XCVL240T-1FFG1156 (Virtex 6) are used to implement the schematic of CNN.

First, we use Verilog HDL language to describe the CNN architecture. Then, we use ISim simulator to verify our system. Finally, we program the system to the FPGA chip on ML605 board.

The image of experience system as in Figure 20 is as follows.

#### 3.5.2 Input data for h, u, v values

The input of CNN to solve the Navier-Stokes Equation has h, u, v values. We use three Input memory units, three Buffer units, and three Temporary memory units to store h, u, v values. The data element is represented in 32-bit floating point Solving Partial Differential Equation Using FPGA Technology DOI: http://dx.doi.org/10.5772/intechopen.84588


#### Figure 18.

input of buffer. The buffer shifts up 1 step. After step 3, the Buffer has 3xN blocks of data to supply to CNN core. After each step, the Buffer has 3xN blocks of data

write(Temp memory);

memory));

In this part, we implement the 3xN CNN. Q, m, and N are the parameters that

For experiencing, the ISE Design Suite software version 14.7 and ML605 evaluation board including chip XCVL240T-1FFG1156 (Virtex 6) are used to implement

First, we use Verilog HDL language to describe the CNN architecture. Then, we

The input of CNN to solve the Navier-Stokes Equation has h, u, v values. We use three Input memory units, three Buffer units, and three Temporary memory units to store h, u, v values. The data element is represented in 32-bit floating point

use ISim simulator to verify our system. Finally, we program the system to the

The image of experience system as in Figure 20 is as follows.

we can configure before compiling and programming to the FPGA chip. For

BoundaryUpdating(CNNoutput,read(Temp

(6) buffer = read(Input memory);//read by vertical (7) if (finish computing the first block of column q) (8) if (column\_of\_current\_block==0)

else

(9) write(Input memory);

defaulting, we assigned Q = 2, m = 8, and N = 4.

that need to supply to CNN core (Figure 18). The output of CNN core has the size of 1xN. The Boundary updating unit is shown in Figure 19. The control algorithm for control unit (Figure 20).

Boundary Layer Flows - Theory, Applications and Numerical Methods

The memory with 5x2 blocks (m==5, n = 6, N = 4).

(1) At every posedge of clk do

(4) do the IO task;

(3) if (has IO event)

(2) {

Figure 17.

(10) }

208

3.5 Implementation

3.5.1 Development environment

FPGA chip on ML605 board.

3.5.2 Input data for h, u, v values

the schematic of CNN.

(5) else

The Buffer's state after each step (m==5, n = 6, N = 4).

#### Figure 19.

The output size of CNN core (N = 4).

Figure 20.

The Boundary updating structure (N = 4).

#### Figure 21.

The chip Virtex 6 (XCVL240T-1FFG1156) connected to PC for configuring to make CNN chip and performing calculation.

real numbers. Data into h, u, v are added with temporary boundaries, detailed as follow (presented in Decimal and Hex of Single-type Floating-point) (Figure 22). 3.5.4 CNN core

Solving Partial Differential Equation Using FPGA Technology

DOI: http://dx.doi.org/10.5772/intechopen.84588

3.5.5 Boundary updating

211

The interface of each Input memory, Temporary memory for h, u, v is configurated as same in Figure 23. The initial data for the Input memory h, u, v is initialed by COE files. A COE file stores initial values for a memory (Figure 24).


Figure 22. Initial data for the Input memory h, u, v.

3.5.3 Shift up register

Solving Partial Differential Equation Using FPGA Technology DOI: http://dx.doi.org/10.5772/intechopen.84588

3.5.4 CNN core

real numbers. Data into h, u, v are added with temporary boundaries, detailed as follow (presented in Decimal and Hex of Single-type Floating-point) (Figure 22). The interface of each Input memory, Temporary memory for h, u, v is configurated as same in Figure 23. The initial data for the Input memory h, u, v is initialed by COE files. A COE file stores initial values for a memory (Figure 24).

Boundary Layer Flows - Theory, Applications and Numerical Methods

3.5.3 Shift up register

Initial data for the Input memory h, u, v.

Figure 22.

210
