**4. Genaral DCT algorithm**

The DCT for real signal *xn* gives independent spectral coefficients for k = 0,1,...,N-1, changing *fk* also from zero to *fsampl* <sup>2</sup> but with *fsampl* <sup>2</sup>*<sup>N</sup>* grid. DCT vs. DFT gives twice better resolution. Splitting the sum (1) and redefine the indices we get:

$$\bar{X}\_k = a\_k \left\{ \sum\_{n=0}^{\frac{N}{2}-1} \left( x\_n \cos \left[ \frac{\pi}{N} \left( n + \frac{1}{2} \right) k \right] - x\_{N-1-n} \cos \left[ \frac{\pi}{N} \left( N - \left( n + \frac{1}{2} \right) \right) k \right] \right) \right\} \tag{10}$$

Due to symmetry of the cosine function

$$\bar{X}\_k = a\_k \sum\_{n=0}^{\frac{N}{2}-1} \cos\left[\frac{\pi}{N}\left(n + \frac{1}{2}\right)k\right] \times \left(\mathbf{x}\_n + (-1)^k \mathbf{x}\_{N-1-n}\right) \tag{11}$$

We can introduce the new set of variables:

$$A\_{\rm ll} = \begin{cases} \mathbf{x}\_{\rm n} + \mathbf{x}\_{N-1-n} & (n = 0, \dots, \frac{N}{2} - 1) \\ \mathbf{x}\_{N-1-n} - \mathbf{x}\_{\rm n} & (n = \frac{N}{2}, \dots, N-1) \end{cases} \tag{12}$$

DCT coefficients can be separated for even and odd indices respectively:

$$\mathcal{R}\_{\binom{\text{even}}{\text{odd}}} = \mathfrak{a}\_k \sum\_{n=\binom{\text{0}}{N-1}}^{\frac{N}{2}-\binom{1}{0}} A\_{\mathcal{U}} \cos\left[\frac{\pi}{N}\left(n+\frac{1}{2}\right)k\right] \tag{13}$$

Let us notice that (13) for even indices has the same structure as (1) with only shorter range of indices. Recurrently we can introduce new sets of variables for the set of indices k = 2p, where

Experiments 9

<sup>387</sup> An Optimization of 16-Point Discrete Cosine Transform Implemented into a FPGA as a Design for a Spectral First Level Surface Detector Trigger in Extensive Air Shower Experiments

> 101110 1 0 −1 1 −1 0 1 −1 0 −101 110 −1 0 −1

A direct approach from the classical definition requires: a single multiplication for even indices (20) and 5 multiplications for odd indices (22). The scaled coefficients *S*1,7,3,5*X*¯ 1,7,3,5 in (22) can be expressed in an equivalent way introduced by Arai, Agui, Nakajima (AAN,

> 1110 −1 1 1 −101 1 −1 0 −1 1 1 −101 −1

   

*B*7 *S*4*B*<sup>5</sup> (*S*<sup>2</sup> + *S*6)*B*<sup>6</sup> (*S*<sup>2</sup> − *S*6)*B*<sup>4</sup> *S*6(*B*<sup>6</sup> − *B*4)  

   

*B*7 *S*2*B*<sup>4</sup> *S*6*B*<sup>4</sup> *S*4*B*<sup>5</sup> *S*2*B*<sup>6</sup> *S*6*B*<sup>6</sup>  

(22)

(23)

For odd indices with a support of (15) we get:

4

4

  *S*1*X*¯ <sup>1</sup> *S*7*X*¯ <sup>7</sup> *S*3*X*¯ <sup>3</sup> *S*5*X*¯ <sup>5</sup>  =

  *S*1*X*¯ <sup>1</sup> *S*7*X*¯ <sup>7</sup> *S*3*X*¯ <sup>3</sup> *S*5*X*¯ <sup>5</sup>  =  

1988)., which allows reducing an amount of multiplications from 5 to 4 only.

Fig. 4. A fast DCT algorithm developed in 1988 by Arai, Agui and Nakajima

A minimization of multiplications amounts is one of a fundamental goal in long-term numerical calculations. Reduction of product terms significantly speed up sophisticated calculations, because a single multiplication requires several clock cycles of processor. Multiplications in powerful FPGA chips can be however performed in very fast dedicated DSP blocks in a single clock cycle. Signals processed in parallel threads in a hardware implementation of a pipeline design have to be synchronized to each other. Pipeline approach requires additional shift registers for synchronization also for signal currently not being processed. However, such synchronization needs additional resources. Fig. 5 shows the part of pipeline chain corresponding to odd indices of DCT coefficients (lower part in Fig. 4).

 

p is integer, till k < N. In order to use symmetry of trigonometric functions in a maximal way, N should be a power of 2, similarly to Radix-2 approach used in FFT algorithm. If N = 2q, recurrent minimization is possible till p = q. The twiddle factors for successive minimization steps m equal to *cos* <sup>2</sup>*<sup>π</sup>* <sup>2</sup>*<sup>q</sup>* <sup>2</sup>*p*+*<sup>m</sup>* 2 = −1 , because the sum of step index m and range factor p is constant and equals to q. For the rest of indices twiddle factor depends on fractional angle *α* = *<sup>π</sup>*2*q*−*m*−<sup>1</sup> *<sup>N</sup>* .

After the 1st step of minimization, the terms of the sum (13) for odd indices depends only on the odd multiplicity of the fractional angle

$$\bar{X}\_k = \alpha\_k \sum\_{n=N-1}^{\frac{N}{2}-1} A\_n \cos \left[ \frac{\pi}{2N} \left( 2n+1 \right) k \right] \tag{14}$$

Using a following trigonometric identity

$$\cos \mathfrak{a} = \frac{1}{2 \cos \mathfrak{\mathfrak{z}}} (\cos(\mathfrak{a} + \mathfrak{z}) + \cos(\mathfrak{a} - \mathfrak{z})) \tag{15}$$

the fractional angles can be increased by the factor of 2 for *β* = *<sup>k</sup><sup>π</sup>* <sup>2</sup>*<sup>N</sup>* . Thus:

$$\bar{X}\_k = \frac{a\_k}{2\cos\left(\frac{k\pi}{2N}\right)} \sum\_{n=N-1}^{N/2} A\_n \times \left[ \cos\left(\frac{k\pi}{N}(n+1)\right) + \cos\left(\frac{k\pi}{N}n\right) \right] \tag{16}$$

Let us notice that:

1). *cos*(*kπ*)=(−1)*k*, for n = N-1, hence pure *An* coefficient survives,

2). *cos*( *<sup>k</sup><sup>π</sup>* <sup>2</sup> ) = 0, for *<sup>n</sup>* <sup>=</sup> *<sup>N</sup>* <sup>2</sup> because of odd k,

3). the rest of indices appear in cosine terms twice in *An*+<sup>1</sup> and *An* coefficients, which allows introducing the new set of variables

$$B\_{N-1} = A\_{N-1}$$

$$B\_{N-1-n} = A\_{N-n} + A\_{N-1-n} \tag{17}$$

The range of *Bn* indices is continuous and can be split again on even and odd parts. The above procedure can be repeated in recurrence.

#### **5. 8-point DCT algorithm**

For N = 8 according to formulae (12) and (17) we get :

$$A\_{0,1,2,3} = \mathbf{x}\_{0,1,2,3} + \mathbf{x}\_{7,6,5,4} \qquad \qquad A\_{7,6,5,4} = \mathbf{x}\_{0,1,2,3} - \mathbf{x}\_{7,6,5,4} \tag{18}$$

$$B\_{0,1} = A\_{0,1} + A\_{3,2} \qquad B\_{2,3} = A\_{0,1} - A\_{3,2} \qquad B\_{6,5,4} = A\_{7,6,5} + A\_{6,5,4} \qquad B\_7 = A\_7 \tag{19}$$

For even indices the DCT coefficients are expressed as follows:

$$2\sqrt{2}\begin{bmatrix} \check{X}\_0\\ \check{X}\_4 \end{bmatrix} = \begin{bmatrix} 1 & 1\\ 1 & -1 \end{bmatrix} \begin{bmatrix} B\_0\\ B\_1 \end{bmatrix} \qquad\qquad 4\begin{bmatrix} S\_2\check{X}\_2\\ S\_6\check{X}\_6 \end{bmatrix} = B\_3 + S\_4 \begin{bmatrix} 1 & 1\\ -1 & -1 \end{bmatrix} \begin{bmatrix} B\_3\\ B\_2 \end{bmatrix} \tag{20}$$

where

$$S\_k = \cos\left(\frac{k\pi}{16}\right) \tag{21}$$

For odd indices with a support of (15) we get:

4

8 Will-be-set-by-IN-TECH

p is integer, till k < N. In order to use symmetry of trigonometric functions in a maximal way, N should be a power of 2, similarly to Radix-2 approach used in FFT algorithm. If N = 2q, recurrent minimization is possible till p = q. The twiddle factors for successive minimization

is constant and equals to q. For the rest of indices twiddle factor depends on fractional angle

After the 1st step of minimization, the terms of the sum (13) for odd indices depends only on

*Ancos*

*π*

<sup>2</sup>*<sup>N</sup>* (2*<sup>n</sup>* <sup>+</sup> <sup>1</sup>) *<sup>k</sup>*

*<sup>N</sup>* (*<sup>n</sup>* <sup>+</sup> <sup>1</sup>)

*A*0,1,2,3 = *x*0,1,2,3 + *x*7,6,5,4 *A*7,6,5,4 = *x*0,1,2,3 − *x*7,6,5,4 (18)

= *B*<sup>3</sup> + *S*<sup>4</sup>

 1 1 −1 −1  *B*<sup>3</sup> *B*2 

(20)

(21)

= −1 , because the sum of step index m and range factor p

<sup>2</sup>*cos<sup>β</sup>* (*cos*(*<sup>α</sup>* <sup>+</sup> *<sup>β</sup>*) + *cos*(*<sup>α</sup>* <sup>−</sup> *<sup>β</sup>*)) (15)

 + *cos*

*BN*−1−*<sup>n</sup>* = *AN*−*<sup>n</sup>* + *AN*−1−*<sup>n</sup>* (17)

<sup>2</sup>*<sup>N</sup>* . Thus:

 *kπ N n*  (14)

(16)

steps m equal to *cos*

Let us notice that:

2). *cos*( *<sup>k</sup><sup>π</sup>*

*α* = *<sup>π</sup>*2*q*−*m*−<sup>1</sup> *<sup>N</sup>* . <sup>2</sup>*<sup>π</sup>* <sup>2</sup>*<sup>q</sup>* <sup>2</sup>*p*+*<sup>m</sup>* 2 

the odd multiplicity of the fractional angle

Using a following trigonometric identity

*<sup>X</sup>*¯ *<sup>k</sup>* <sup>=</sup> *<sup>α</sup><sup>k</sup>* 2*cos <sup>k</sup><sup>π</sup>* 2*N* 

<sup>2</sup> ) = 0, for *<sup>n</sup>* <sup>=</sup> *<sup>N</sup>*

introducing the new set of variables

procedure can be repeated in recurrence.

For N = 8 according to formulae (12) and (17) we get :

For even indices the DCT coefficients are expressed as follows:

 *B*<sup>0</sup> *B*1 

**5. 8-point DCT algorithm**

2 √ 2 *X*¯ 0 *X*¯ 4 = 1 1 1 −1

where

*X*¯ *<sup>k</sup>* = *α<sup>k</sup>*

*cos<sup>α</sup>* <sup>=</sup> <sup>1</sup>

the fractional angles can be increased by the factor of 2 for *β* = *<sup>k</sup><sup>π</sup>*

*N*/2 ∑ *n*=*N*−1

1). *cos*(*kπ*)=(−1)*k*, for n = N-1, hence pure *An* coefficient survives,

<sup>2</sup> because of odd k,

*An* × *cos kπ*

*BN*−<sup>1</sup> = *AN*−<sup>1</sup>

3). the rest of indices appear in cosine terms twice in *An*+<sup>1</sup> and *An* coefficients, which allows

The range of *Bn* indices is continuous and can be split again on even and odd parts. The above

*B*0,1 = *A*0,1 + *A*3,2 *B*2,3 = *A*0,1 − *A*3,2 *B*6,5,4 = *A*7,6,5 + *A*6,5,4 *B*<sup>7</sup> = *A*<sup>7</sup> (19)

4 *S*2*X*¯ <sup>2</sup> *S*6*X*¯ <sup>6</sup>

 *kπ* 16 

*Sk* = *cos*

*N* <sup>2</sup> −1 ∑ *n*=*N*−1

$$
\begin{bmatrix} S\_1 \bar{X}\_1 \\ S\_7 \bar{X}\_7 \\ S\_3 \bar{X}\_3 \\ S\_5 \bar{X}\_5 \end{bmatrix} = \begin{bmatrix} 1 & 0 & 1 & 1 & 1 & 0 \\ 1 & 0 & -1 & 1 & -1 & 0 \\ 1 & -1 & 0 & -1 & 0 & 1 \\ 1 & 1 & 0 & -1 & 0 & -1 \end{bmatrix} \begin{bmatrix} B\_7 \\ S\_2 B\_4 \\ S\_6 B\_4 \\ S\_4 B\_5 \\ S\_2 B\_6 \\ S\_6 B\_6 \end{bmatrix} \tag{22}
$$

A direct approach from the classical definition requires: a single multiplication for even indices (20) and 5 multiplications for odd indices (22). The scaled coefficients *S*1,7,3,5*X*¯ 1,7,3,5 in (22) can be expressed in an equivalent way introduced by Arai, Agui, Nakajima (AAN, 1988)., which allows reducing an amount of multiplications from 5 to 4 only.

$$
\mathbf{4} \begin{bmatrix} S\_1 \bar{X}\_1 \\ S\_7 \bar{X}\_7 \\ S\_3 \bar{X}\_3 \\ S\_5 \bar{X}\_5 \end{bmatrix} = \begin{bmatrix} 1 & 1 & 1 & 0 & -1 \\ 1 & 1 & -1 & 0 & 1 \\ 1 & -1 & 0 & -1 & 1 \\ 1 & -1 & 0 & 1 & -1 \end{bmatrix} \begin{bmatrix} B\_7 \\ S\_4 B\_5 \\ (S\_2 + S\_6) B\_6 \\ (S\_2 - S\_6) B\_4 \\ S\_6 (B\_6 - B\_4) \end{bmatrix} \tag{23}
$$

Fig. 4. A fast DCT algorithm developed in 1988 by Arai, Agui and Nakajima

A minimization of multiplications amounts is one of a fundamental goal in long-term numerical calculations. Reduction of product terms significantly speed up sophisticated calculations, because a single multiplication requires several clock cycles of processor. Multiplications in powerful FPGA chips can be however performed in very fast dedicated DSP blocks in a single clock cycle. Signals processed in parallel threads in a hardware implementation of a pipeline design have to be synchronized to each other. Pipeline approach requires additional shift registers for synchronization also for signal currently not being processed. However, such synchronization needs additional resources. Fig. 5 shows the part of pipeline chain corresponding to odd indices of DCT coefficients (lower part in Fig. 4).

Experiments 11

<sup>389</sup> An Optimization of 16-Point Discrete Cosine Transform Implemented into a FPGA as a Design for a Spectral First Level Surface Detector Trigger in Extensive Air Shower Experiments

Routines E and F from Fig. 5 have been merged into single routine E (Fig. 6) to short an

D45p64=C45+C64 S3X3= D7p66-D45p24

D7m26=C7-C26

D7m66=C7-C66

D45m64=C45-C64 D7p66=C7+C66

D45m64=C45-C24

D45p24=C45+C24

S1X1= D7p26+ D45p64

S7X7= D7m26+D45m64

S5X5= D7m66-D45m64

<sup>32</sup> gives:

(26)

(27)

B6=A6+A7 D7p26=C7+C26

Fig. 7. Optimized, shorter pipeline chain based on the classical approach. The reduction of

A classical approach reduces a length of the chain from 6 to 5 stages only, at the cost of one additional multipliers. An abridgement of the pipeline chain and in a consequence a reduction of the shift registers needed for synchronization allows saving significant amount of logic blocks, especially for wide data bus. In order to reduce an approximation errors, the data bus

The 16-point DCT algorithm will be implemented according to the classical approach with an optimization of the number of pipeline stages at the cost of an utilization of embedded multipliers (Szadkowski, 2009). The 1st and the 2nd pipeline stages utilize the set of variables (12) and (17) respectively. For N = 16 the fractional angle of the twiddle factor in the 1st step of minimization equals to *β* = *π* . The same fractional angle corresponds to the 2nd step of

The scaling procedure used for odd indices of *X*¯ *<sup>k</sup>* with the fractional angles *β* = *<sup>k</sup><sup>π</sup>*

Coefficients *X*¯ *<sup>k</sup>* for even indices can be expressed by variables (24) and scaling factor (21)

*S*<sup>4</sup> *S*<sup>4</sup> *S*<sup>4</sup> *S*<sup>4</sup> *S*<sup>4</sup> −*S*<sup>4</sup> −*S*<sup>4</sup> *S*<sup>4</sup> *S*<sup>2</sup> *S*<sup>6</sup> −*S*<sup>6</sup> −*S*<sup>2</sup> *S*<sup>6</sup> −*S*<sup>2</sup> *S*<sup>2</sup> −*S*<sup>6</sup>

*S*<sup>7</sup> *S*<sup>5</sup> *S*<sup>3</sup> *S*<sup>1</sup> −*S*<sup>1</sup> *S*<sup>3</sup> −*S*<sup>5</sup> *S*<sup>7</sup> −*S*<sup>5</sup> −*S*<sup>1</sup> −*S*<sup>7</sup> *S*<sup>3</sup> *S*<sup>3</sup> *S*<sup>7</sup> −*S*<sup>1</sup> *S*<sup>5</sup>

 

 

*B*0,1,2,3 = *A*0,1,2,3 + *A*7,6,5,4 *B*4,5,6,7 = *A*3,2,1,0 − *A*4,5,6,7 (24)

*B*<sup>15</sup> = *A*<sup>15</sup> *B*14,...,8 = *A*15,...,9 + *A*14,...,8 (25)

 

   

*B*4 *B*5 *B*6 *B*7  

 

*B*0 *B*1 *B*2 *B*3  

C7 = B7

C66=S6B6 C26=S2B6

C45=S4B5

C64=S6B4 C24=S2B4

amount of pipeline stages and remove unnecessary shift registers.

B7 = A7

in the intermediate stages is enlarged.

minimization for even indices corresponded to *An*.

 

 

*X*¯ 0 *X*¯ 8 *X*¯ 4 *X*¯ <sup>12</sup>

*X*¯ 2 *X*¯ <sup>14</sup> *X*¯ 6 *X*¯ <sup>10</sup>  <sup>=</sup> <sup>1</sup> 2 √2

 <sup>=</sup> <sup>1</sup> 2 √2

**6. 16-point DCT algorithm**

the length of the chain at the cost of an additional multiplier.

A7=x0-x7

A6=x1-x6

A5=x2-x5

A4=x3-x4

B5=A5+A6

B4=A4+A5

Fig. 5. The AAN algorithm limited to indices4-7 only with a time-oriented structure. Adders, sub-tractors, multipliers and shift registers are marked by the following colours: blue, gray, black and green, respectively. Red colour corresponds to routines requiring a cascade processes.

A direct implementation of the pure AAN algorithm requires 7 pipeline stages, which utilize additional resources of shift registers for synchronization for operations like: X(t+1) = X(t). In a numerical calculation in processors data are simply waiting for a next performance cycle. The *D*<sup>64</sup> block contains a cascade of the sum and the multiplication. An implementation of the cascade in a single clock FPGA logic block significantly reduce a speed. Additionally, the *lpm\_add\_sub* mega-function from the Altera® library of parameterized modules (LPM) does not support an inversion of a sum i.e. *B*<sup>4</sup> = −(*A*<sup>4</sup> + *A*5) or *E*<sup>4</sup> = −(*D*<sup>64</sup> + *D*4). These operations would have to be performed in a cascade way by an adder and a sign inversion. Cascade operations performed in the same clock cycle significantly slow down a global registered performance.

Fig. 6. Optimized AAN algorithm for indices 4 - 7. A redefinition and splitting of variables allowed a reduction of the chain length.

A simple redefinition of nodes removes difficulties mentioned above. The *B*<sup>4</sup> node defined as the sum of *A*4,5 nodes requires a simple *lpm\_add\_sub* mega-function. The *D*<sup>4</sup> node with currently inverted sign allows using *lpm\_add\_sub* in *E*<sup>4</sup> performing a subtraction. The *D*<sup>64</sup> node from Fig. 5 can be split into the subtraction *C*<sup>64</sup> and the multiplication *D*<sup>64</sup> in the next clock cycle (Fig. 6).

Routines E and F from Fig. 5 have been merged into single routine E (Fig. 6) to short an amount of pipeline stages and remove unnecessary shift registers.

Fig. 7. Optimized, shorter pipeline chain based on the classical approach. The reduction of the length of the chain at the cost of an additional multiplier.

A classical approach reduces a length of the chain from 6 to 5 stages only, at the cost of one additional multipliers. An abridgement of the pipeline chain and in a consequence a reduction of the shift registers needed for synchronization allows saving significant amount of logic blocks, especially for wide data bus. In order to reduce an approximation errors, the data bus in the intermediate stages is enlarged.
