**6. 16-point DCT algorithm**

10 Will-be-set-by-IN-TECH

D7 = C7

D5 = S4C5

D6=(S2+S6)C6

D4=(S2-S6)C4

D64 = S6(C6+C4)

A direct implementation of the pure AAN algorithm requires 7 pipeline stages, which utilize additional resources of shift registers for synchronization for operations like: X(t+1) = X(t). In a numerical calculation in processors data are simply waiting for a next performance cycle. The *D*<sup>64</sup> block contains a cascade of the sum and the multiplication. An implementation of the cascade in a single clock FPGA logic block significantly reduce a speed. Additionally, the *lpm\_add\_sub* mega-function from the Altera® library of parameterized modules (LPM) does not support an inversion of a sum i.e. *B*<sup>4</sup> = −(*A*<sup>4</sup> + *A*5) or *E*<sup>4</sup> = −(*D*<sup>64</sup> + *D*4). These operations would have to be performed in a cascade way by an adder and a sign inversion. Cascade operations performed in the same clock cycle significantly slow down

D7 = C7

D5 = S4C5

D6=(S2+S6)C6

D4=(S2-S6)C4

D64 = S6C64

Fig. 6. Optimized AAN algorithm for indices 4 - 7. A redefinition and splitting of variables

A simple redefinition of nodes removes difficulties mentioned above. The *B*<sup>4</sup> node defined as the sum of *A*4,5 nodes requires a simple *lpm\_add\_sub* mega-function. The *D*<sup>4</sup> node with currently inverted sign allows using *lpm\_add\_sub* in *E*<sup>4</sup> performing a subtraction. The *D*<sup>64</sup> node from Fig. 5 can be split into the subtraction *C*<sup>64</sup> and the multiplication *D*<sup>64</sup> in the next

Fig. 5. The AAN algorithm limited to indices4-7 only with a time-oriented structure. Adders, sub-tractors, multipliers and shift registers are marked by the following colours: blue, gray, black and green, respectively. Red colour corresponds to routines requiring a

E7 = D7

F7=E7-E<sup>5</sup>

F5=E5+E<sup>7</sup>

F6 = E6

F4 = E4

E6=D6-D64

E5 = D5

E4=-D64-D4

A7=x0-x7

cascade processes.

A7=x0-x7

clock cycle (Fig. 6).

A6=x1-x6

A5=x2-x5

A4=x3-x4

B7 = A7

a global registered performance.

B7 = A7

allowed a reduction of the chain length.

C7 = B7

C64=B6-B4

C6 = B6

C5 = B5

C4 = B4

B6=A6+A7

B5=A5+A6

B4=A4+A5

C7 = B7

C6 = B6

C5 = B5

C4 = B4

B6=A6+A7

B5=A5+A6

B4=-A4-A5

A6=x1-x6

A5=x2-x5

A4=x3-x4

S3X3= F7-F4

S3X3= E7-E4

E7=D7-D5

E6=D6-D64

E5=D7+D5

E4=D4-D64

S7X7= E5-E6

S1X1= E5+E6

S5X5= E7+E4

S7X7= F5-F6

S1X1= F5+F6

S5X5= F7+F4

The 16-point DCT algorithm will be implemented according to the classical approach with an optimization of the number of pipeline stages at the cost of an utilization of embedded multipliers (Szadkowski, 2009). The 1st and the 2nd pipeline stages utilize the set of variables (12) and (17) respectively. For N = 16 the fractional angle of the twiddle factor in the 1st step of minimization equals to *β* = *π* . The same fractional angle corresponds to the 2nd step of minimization for even indices corresponded to *An*.

$$B\_{0,1,2,3} = A\_{0,1,2,3} + A\_{7,6,5,4} \qquad B\_{4,5,6,7} = A\_{3,2,1,0} - A\_{4,5,6,7} \tag{24}$$

The scaling procedure used for odd indices of *X*¯ *<sup>k</sup>* with the fractional angles *β* = *<sup>k</sup><sup>π</sup>* <sup>32</sup> gives:

$$B\_{15} = A\_{15} \qquad B\_{14,\ldots,8} = A\_{15,\ldots,9} + A\_{14,\ldots,8} \tag{25}$$

Coefficients *X*¯ *<sup>k</sup>* for even indices can be expressed by variables (24) and scaling factor (21)

$$
\begin{bmatrix}
\bar{X}\_0 \\
\bar{X}\_8 \\
\bar{X}\_4 \\
\bar{X}\_{12}
\end{bmatrix} = \frac{1}{2\sqrt{2}} \begin{bmatrix}
S\_4 & S\_4 & S\_4 & S\_4 \\
S\_4 - S\_4 & -S\_4 & S\_4 \\
S\_2 & S\_6 - S\_6 - S\_2 \\
S\_6 - S\_2 & S\_2 - S\_6
\end{bmatrix} \begin{bmatrix}
B\_0 \\
B\_1 \\
B\_2 \\
B\_3
\end{bmatrix} \tag{26}
$$

$$
\begin{bmatrix}
\bar{X}\_2 \\
\bar{X}\_{14} \\
\bar{X}\_6 \\
\bar{X}\_{10}
\end{bmatrix} = \frac{1}{2\sqrt{2}} \begin{bmatrix}
S\_7 & S\_5 & S\_3 & S\_1 \\
S\_3 & S\_7 - S\_1 & S\_5
\end{bmatrix} \begin{bmatrix}
B\_4 \\
B\_5 \\
B\_6 \\
B\_7
\end{bmatrix} \tag{27}
$$

Experiments 13

<sup>391</sup> An Optimization of 16-Point Discrete Cosine Transform Implemented into a FPGA as a Design for a Spectral First Level Surface Detector Trigger in Extensive Air Shower Experiments

The 6th stage does not require any multiplier, only 10 adders/sub-tractors and 6 shift registers

In the 7th pipeline stage 12 signals are delayed only for synchronization and 4 are scaled for

*Gn* <sup>=</sup> *Fn* 2*Sk*

In the 8th pipeline stage pure registers for synchronization only are implemented for even

*<sup>X</sup>*¯ *<sup>k</sup>* <sup>=</sup> *Hm* 4

<sup>√</sup>2*cos <sup>k</sup><sup>π</sup>*

for the following (k,m) pairs: (1,15), (15,14), (7,13), (9,12), (3,11), (13,10), (5,9), (11,8), (14,7),

The spectral trigger should be generated if DCT coefficients normalized to the 1st harmonics

Altera® Library of Parameterized Modules (LPM) contains the lpm\_divide routine supporting a division of fixed-point variables. However, this routine needs huge amount of logic elements and it is slow (calculation requires 14 clock cycles in order to keep sufficiently high registered

<sup>=</sup> *<sup>η</sup><sup>k</sup>* <sup>∗</sup> *Hf*(*k*) *η*<sup>1</sup> ∗ *H*<sup>15</sup>

<sup>≤</sup> *Hf*(*k*) <sup>≤</sup> *<sup>H</sup>*<sup>15</sup> *<sup>η</sup>*<sup>1</sup>

<sup>≤</sup> *Thr<sup>H</sup>*

*<sup>k</sup>* are lower and upper thresholds for each spectral index k, respectively.

*ηk Thr<sup>H</sup> k* 

However, the 5th pipeline stage requires only a single multiplier for the *E*<sup>2</sup> variable:

*E*<sup>2</sup> = *S*4*D*<sup>2</sup> *E*0,1,3 = *D*0,1,3 *E*<sup>4</sup> = *D*<sup>6</sup>

<sup>=</sup> *<sup>D</sup>*7,11,15 <sup>±</sup> *<sup>D</sup>*5,9,13 *<sup>E</sup>*14,10 <sup>=</sup> *<sup>D</sup>*<sup>14</sup> <sup>±</sup> *<sup>D</sup>*<sup>10</sup> *<sup>E</sup>*<sup>12</sup> <sup>=</sup> *<sup>D</sup>*<sup>2</sup>

*E* 7,11,15 5,9,13

for synchronization:

*F* 3,5,7,9,13 2.4.6.8.12

indices of *X*¯ 0,2,4,6,8,10,12,14 and

(2,6), (6,5), (10,4), (4,3), (12,2).

are in an arbitrary narrow range:

*<sup>k</sup>* and *Thr<sup>H</sup>*

*<sup>H</sup>*<sup>15</sup> <sup>×</sup> *<sup>θ</sup><sup>L</sup>*

where *Thr<sup>L</sup>*

the following (n,k) pairs: (14,1),(12,7),(10,3),(8,5):

The last stage contains all scaling multipliers:

**7. Implementation of the code into a FPGA**

*Thr<sup>L</sup>*

*<sup>k</sup>* <sup>=</sup> *<sup>H</sup>*<sup>15</sup> *<sup>η</sup>*<sup>1</sup>

*ηk Thr<sup>L</sup> k* 

*<sup>k</sup>* <sup>≤</sup> *<sup>ξ</sup><sup>k</sup>* <sup>=</sup> *<sup>X</sup>*¯ *<sup>k</sup>*

*X*¯ 1

performance). DSP blocks also do not support this routine. A simple conversion to

*H*

9,11,13,15 8,10,12,14

*D*0,1 = *C*<sup>0</sup> ± *C*<sup>1</sup> *D*<sup>2</sup> = *C*<sup>2</sup> + *C*<sup>3</sup> (36)

<sup>4</sup> *<sup>E</sup>*<sup>6</sup> <sup>=</sup> *<sup>D</sup>*<sup>2</sup>

<sup>12</sup> <sup>+</sup> *<sup>D</sup>*<sup>6</sup>

= *G*9,11,13,15 ± *G*8,10,12,14 (41)

<sup>32</sup> (42)

*<sup>k</sup>* (43)

<sup>=</sup> *<sup>H</sup>*<sup>15</sup> <sup>×</sup> *<sup>θ</sup><sup>H</sup>*

*<sup>k</sup>* (44)

<sup>6</sup> <sup>+</sup> *<sup>D</sup>*<sup>6</sup>

<sup>8</sup> *<sup>E</sup>*<sup>8</sup> <sup>=</sup> *<sup>D</sup>*<sup>6</sup>

<sup>4</sup> (37)

<sup>8</sup> (38)

(40)

<sup>12</sup> <sup>−</sup> *<sup>D</sup>*<sup>2</sup>

<sup>6</sup> <sup>−</sup> *<sup>D</sup>*<sup>2</sup>

= *E*3,5,7,9,13 ± *E*2,4,6,8,12 *F*0,1.9.11.13.15 = *E*0,1.9.11.13.15 (39)

After a scaling according to (15) we can introduce the new set of variables for the 3*rd* pipeline stage:

$$4\begin{bmatrix} \bar{X}\_0 \\ \bar{X}\_8 \end{bmatrix} = \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix} \begin{bmatrix} \mathbf{C}\_0 \\ \mathbf{C}\_1 \end{bmatrix} \qquad 4\sqrt{2} \begin{bmatrix} \mathbf{S}\_2 \bar{X}\_4 \\ \mathbf{S}\_6 \bar{X}\_{12} \end{bmatrix} = \begin{bmatrix} 1 + \mathbf{S}\_4 & \mathbf{S}\_4 \\ 1 - \mathbf{S}\_4 - \mathbf{S}\_4 \end{bmatrix} \begin{bmatrix} \mathbf{C}\_3 \\ \mathbf{C}\_2 \end{bmatrix} \tag{28}$$
 
$$\begin{bmatrix} \mathbf{C}\_7 \ \end{bmatrix}$$

$$\begin{aligned} \text{4-\sqrt{2}} \begin{bmatrix} S\_1 \bar{X}\_2 \\ S\_7 \bar{X}\_{14} \\ S\_3 \bar{X}\_6 \\ S\_5 \bar{X}\_{10} \end{bmatrix} = \begin{bmatrix} 1 & S\_4 & S\_2 & S\_6 \\ 1 - S\_4 & S\_6 - S\_2 \\ 1 - S\_4 - S\_6 & S\_2 \\ 1 & S\_4 - S\_2 - S\_6 \end{bmatrix} \begin{bmatrix} \mathcal{C}\_7 \\ \mathcal{C}\_5 \\ \mathcal{C}\_6 \\ \mathcal{C}\_4 \end{bmatrix} = \begin{bmatrix} 1 & 0 & 1 & 1 & 1 & 0 \\ 1 - 1 & 0 & -1 & 0 & 1 \\ 1 & 1 & 0 - 1 & 0 & -1 \\ 1 & 0 - 1 & 1 - 1 & 0 \end{bmatrix} \begin{bmatrix} \mathcal{C}\_7 \\ S\_2 \mathcal{C}\_4 \\ S\_6 \mathcal{C}\_5 \\ S\_4 \mathcal{C}\_5 \\ S\_6 \mathcal{C}\_6 \end{bmatrix} \tag{29}$$

$$\begin{aligned} \mathsf{C}\_{0,1} &= B\_{0,1} + B\_{3,2} & \mathsf{C}\_{3,2} &= B\_{0,1} - B\_{3,2} \\ \mathsf{C}\_{4,5,6} &= B\_{4,5,6} + B\_{5,6,7} & \mathsf{C}\_{7} &= B\_{7} \end{aligned} \tag{30}$$

Let us notice that the structure of the right vector in (29) is exactly the same as in (22), but the structures of the 6x4 matrices are different. In (22) the matrix comes from a transformation for the odd indices supported by (21), while in (29) the matrix comes from a transformation of even indices.

Scaled coefficients corresponding to odd indices

$$
\bar{Z}\_k = 4\sqrt{2}\bar{X}\_k \cos\left(\frac{k\pi}{32}\right) \tag{31}
$$

can be expressed by variables (25) and scaling factors (21) as follows:

$$
\begin{bmatrix} \bar{Z}\_{1,15} \\ Z\_{3,13} \\ Z\_{5,11} \\ \bar{Z}\_{7,9} \end{bmatrix} = \begin{bmatrix} 1 & S\_4 & S\_2 & S\_6 \\ 1 - S\_4 & S\_6 - S\_2 \\ 1 - S\_4 - S\_6 & S\_2 \\ 1 & S\_4 - S\_2 - S\_6 \end{bmatrix} \begin{bmatrix} B\_{15} \\ B\_{11} \\ B\_{13} \\ B\_9 \end{bmatrix} \pm \begin{bmatrix} S\_1 & S\_3 & S\_5 & S\_7 \\ S\_3 - S\_7 - S\_1 - S\_5 \\ S\_5 - S\_1 & S\_7 & S\_3 \\ S\_7 - S\_5 & S\_3 - S\_1 \end{bmatrix} \begin{bmatrix} B\_{14} \\ B\_{12} \\ B\_{10} \\ B\_8 \end{bmatrix} \tag{32}
$$

Matrix (32) can be factorized as follows:

$$
\begin{bmatrix} \mathbf{Z}\_{1,15} \\ \bar{Z}\_{7,9} \\ \bar{Z}\_{5,11} \\ \bar{Z}\_{3,13} \end{bmatrix} = \begin{bmatrix} (\mathbf{C}\_{15} + \mathbf{C}\_{11}) + (\mathbf{C}\_{13}^2 + \mathbf{C}\_9^6) \\ (\mathbf{C}\_{15} + \mathbf{C}\_{11}) - (\mathbf{C}\_{13}^2 + \mathbf{C}\_9^6) \\ (\mathbf{C}\_{15} - \mathbf{C}\_{11}) - (\mathbf{C}\_{13}^6 - \mathbf{C}\_9^2) \\ (\mathbf{C}\_{15} - \mathbf{C}\_{11}) + (\mathbf{C}\_{13}^6 - \mathbf{C}\_9^2) \end{bmatrix} \pm \begin{bmatrix} \frac{1}{25\_1} & 0 & 0 & 0 \\ 0 & \frac{1}{25\_1} & 0 & 0 \\ 0 & 0 & \frac{1}{25\_1} & 0 \\ 0 & 0 & 0 & \frac{1}{25\_1} \end{bmatrix} \begin{bmatrix} 1 & \mathbf{S}\_4 & \mathbf{S}\_2 & \mathbf{S}\_6 \\ 1 - \mathbf{S}\_4 & \mathbf{S}\_6 - \mathbf{S}\_2 \\ 1 - \mathbf{S}\_4 - \mathbf{S}\_6 & \mathbf{S}\_2 \\ 1 & \mathbf{S}\_4 - \mathbf{S}\_2 - \mathbf{S}\_6 \end{bmatrix} \begin{bmatrix} \mathbf{C}\_{14} \\ \mathbf{C}\_{10} \\ \mathbf{C}\_{12} \\ \mathbf{C}\_{8} \end{bmatrix} \tag{33}$$

where:

$$\mathbf{C}\_{8,10,12} = B\_{8,10,12} + B\_{10,12,14} \qquad \mathbf{C}\_{14,15} = B\_{14,15} \qquad \mathbf{C}\_{9,13}^{2,6} = B\_{9,13} \\ \mathbf{S}\_{2,6} \qquad \mathbf{C}\_{11} = B\_{11} \mathbf{S}\_4 \qquad \text{(34)}$$

In the 4th pipeline step directly from (32) we can introduce new variables:

$$D\_{15,11} = \mathbb{C}\_{15} \pm \mathbb{C}\_{11} \qquad D\_{13} = \mathbb{C}\_{13}^2 + \mathbb{C}\_9^6 \qquad D\_9 = \mathbb{C}\_{13}^6 - \mathbb{C}\_9^2 \tag{35}$$

The rest of variables require 10 next multipliers, 3 adders/sub-tractors and 3 shift registers:

$$D\_{5,10} = \mathbb{S}\_4 \mathbb{C}\_{5,10} \qquad D\_{4,6,8,12}^{2,6} = \mathbb{S}\_{2,6} \mathbb{C}\_{4,6,8,12} \qquad D\_{37.14} = \mathbb{C}\_{37,14}$$

$$D\_{0,1} = \mathbb{C}\_0 \pm \mathbb{C}\_1 \qquad D\_2 = \mathbb{C}\_2 + \mathbb{C}\_3 \tag{36}$$

However, the 5th pipeline stage requires only a single multiplier for the *E*<sup>2</sup> variable:

$$E\_2 = S\_4 D\_2 \qquad E\_{0,1,3} = D\_{0,1,3} \qquad E\_4 = D\_6^6 - D\_4^2 \qquad E\_6 = D\_6^2 + D\_4^6 \tag{37}$$

$$E\_{7,11,15} = D\_{7,11,15} \pm D\_{5,9,13} \quad E\_{14,10} = D\_{14} \pm D\_{10} \quad E\_{12} = D\_{12}^2 + D\_8^6 \quad E\_8 = D\_{12}^6 - D\_8^2 \quad \text{(38)}$$
  $5.9.13$ 

The 6th stage does not require any multiplier, only 10 adders/sub-tractors and 6 shift registers for synchronization:

$$F\_{\text{3,5,7,9,13}} = E\_{\text{3,5,7,9,13}} \pm E\_{\text{2,4,6,8,12}} \qquad F\_{\text{0,1,9,11.13.15}} = E\_{\text{0,19,11.13.15}} \tag{39}$$

In the 7th pipeline stage 12 signals are delayed only for synchronization and 4 are scaled for the following (n,k) pairs: (14,1),(12,7),(10,3),(8,5):

$$G\_{\rm ll} = \frac{F\_{\rm n}}{2S\_k} \tag{40}$$

In the 8th pipeline stage pure registers for synchronization only are implemented for even indices of *X*¯ 0,2,4,6,8,10,12,14 and

$$H\_{9,11,13,15} = G\_{9,11,13,15} \pm G\_{8,10,12,14} \tag{41}$$
  $= 8, 10, 12, 14$ 

The last stage contains all scaling multipliers:

12 Will-be-set-by-IN-TECH

After a scaling according to (15) we can introduce the new set of variables for the 3*rd* pipeline

 = 

1 + *S*<sup>4</sup> *S*<sup>4</sup> 1 − *S*<sup>4</sup> −*S*<sup>4</sup>

101110 1 −1 0 −101 110 −1 0 −1 1 0 −1 1 −1 0

*C*4,5,6 = *B*4,5,6 + *B*5,6,7 *C*<sup>7</sup> = *B*<sup>7</sup> (30)

 *C*<sup>3</sup> *C*2 

> 

 

*C*7 *S*2*C*<sup>4</sup> *S*6*C*<sup>4</sup> *S*4*C*<sup>5</sup> *S*2*C*<sup>6</sup> *S*6*C*<sup>6</sup>   (28)

(29)

(31)

(32)

 

(33)

4 √ 2 *S*2*X*¯ <sup>4</sup> *S*6*X*¯ <sup>12</sup>

 

*C*7 *C*5 *C*6 *C*4  =

*C*0,1 = *B*0,1 + *B*3,2 *C*3,2 = *B*0,1 − *B*3,2

Let us notice that the structure of the right vector in (29) is exactly the same as in (22), but the structures of the 6x4 matrices are different. In (22) the matrix comes from a transformation for the odd indices supported by (21), while in (29) the matrix comes from a transformation of

 

 *kπ* 32 

> 

<sup>2</sup>*S*<sup>1</sup> 000 0 <sup>1</sup>

<sup>13</sup> <sup>+</sup> *<sup>C</sup>*<sup>6</sup>

4,6,8,12 = *S*2,6*C*4,6,8,12 *D*3.7.14 = *C*3,7,14

<sup>2</sup>*S*<sup>3</sup> 0 0 0 0 <sup>1</sup>

<sup>2</sup>*S*<sup>5</sup> 0 000 <sup>1</sup>

2*S*<sup>7</sup>

<sup>9</sup> *<sup>D</sup>*<sup>9</sup> <sup>=</sup> *<sup>C</sup>*<sup>6</sup>

*S*<sup>1</sup> *S*<sup>3</sup> *S*<sup>5</sup> *S*<sup>7</sup> *S*<sup>3</sup> −*S*<sup>7</sup> −*S*<sup>1</sup> −*S*<sup>5</sup> *S*<sup>5</sup> −*S*<sup>1</sup> *S*<sup>7</sup> *S*<sup>3</sup> *S*<sup>7</sup> −*S*<sup>5</sup> *S*<sup>3</sup> −*S*<sup>1</sup>

> 

   

1 *S*<sup>4</sup> *S*<sup>2</sup> *S*<sup>6</sup> −*S*<sup>4</sup> *S*<sup>6</sup> −*S*<sup>2</sup> −*S*<sup>4</sup> −*S*<sup>6</sup> *S*<sup>2</sup> *S*<sup>4</sup> −*S*<sup>2</sup> −*S*<sup>6</sup>

9,13 = *B*9,13*S*2,6 *C*<sup>11</sup> = *B*11*S*<sup>4</sup> (34)

<sup>13</sup> <sup>−</sup> *<sup>C</sup>*<sup>2</sup>

 

*B*<sup>14</sup> *B*<sup>12</sup> *B*<sup>10</sup> *B*8

 

> 

<sup>9</sup> (35)

 

*C*<sup>14</sup> *C*<sup>10</sup> *C*<sup>12</sup> *C*8

 

*Z*¯ *<sup>k</sup>* = 4 √ 2*X*¯ *kcos*

> 

 ±  

The rest of variables require 10 next multipliers, 3 adders/sub-tractors and 3 shift registers:

1

 

*B*<sup>15</sup> *B*<sup>11</sup> *B*<sup>13</sup> *B*9

 ±

can be expressed by variables (25) and scaling factors (21) as follows:

<sup>13</sup> <sup>+</sup> *<sup>C</sup>*<sup>6</sup> 9 )

<sup>13</sup> <sup>+</sup> *<sup>C</sup>*<sup>6</sup> 9 )

<sup>13</sup> <sup>−</sup> *<sup>C</sup>*<sup>2</sup> 9 )

<sup>13</sup> <sup>−</sup> *<sup>C</sup>*<sup>2</sup> 9 )

In the 4th pipeline step directly from (32) we can introduce new variables:

*<sup>D</sup>*15,11 <sup>=</sup> *<sup>C</sup>*<sup>15</sup> <sup>±</sup> *<sup>C</sup>*<sup>11</sup> *<sup>D</sup>*<sup>13</sup> <sup>=</sup> *<sup>C</sup>*<sup>2</sup>

*<sup>C</sup>*8,10,12 = *<sup>B</sup>*8,10,12 + *<sup>B</sup>*10,12,14 *<sup>C</sup>*14,15 = *<sup>B</sup>*14,15 *<sup>C</sup>*2,6

*<sup>D</sup>*5,10 = *<sup>S</sup>*4*C*5,10 *<sup>D</sup>*2,6

1 *S*<sup>4</sup> *S*<sup>2</sup> *S*<sup>6</sup> −*S*<sup>4</sup> *S*<sup>6</sup> −*S*<sup>2</sup> −*S*<sup>4</sup> −*S*<sup>6</sup> *S*<sup>2</sup> *S*<sup>4</sup> −*S*<sup>2</sup> −*S*<sup>6</sup>

 *C*<sup>0</sup> *C*1 

1 *S*<sup>4</sup> *S*<sup>2</sup> *S*<sup>6</sup> −*S*<sup>4</sup> *S*<sup>6</sup> −*S*<sup>2</sup> −*S*<sup>4</sup> −*S*<sup>6</sup> *S*<sup>2</sup> *S*<sup>4</sup> −*S*<sup>2</sup> −*S*<sup>6</sup>

stage:

4 √ 2

even indices.

 

where:

*Z*¯ 1,15 *Z*¯ 7,9 *Z*¯ 5,11 *Z*¯ 3,13  =  

  *Z*¯ 1,15 *Z*¯ 3,13 *Z*¯ 5,11 *Z*¯ 7,9

 =

Matrix (32) can be factorized as follows:

4 *X*¯ 0 *X*¯ 8 = 1 1 1 −1

*S*1*X*¯ <sup>2</sup> *S*7*X*¯ <sup>14</sup> *S*3*X*¯ <sup>6</sup> *S*5*X*¯ <sup>10</sup>  =  

Scaled coefficients corresponding to odd indices

 

(*C*<sup>15</sup> + *C*11)+(*C*<sup>2</sup>

(*C*<sup>15</sup> <sup>+</sup> *<sup>C</sup>*11) <sup>−</sup> (*C*<sup>2</sup>

(*C*<sup>15</sup> <sup>−</sup> *<sup>C</sup>*11) <sup>−</sup> (*C*<sup>6</sup>

(*C*<sup>15</sup> <sup>−</sup> *<sup>C</sup>*11)+(*C*<sup>6</sup>

 

$$\bar{X}\_k = \frac{H\_m}{4\sqrt{2}\cos\left(\frac{k\pi}{32}\right)}\tag{42}$$

for the following (k,m) pairs: (1,15), (15,14), (7,13), (9,12), (3,11), (13,10), (5,9), (11,8), (14,7), (2,6), (6,5), (10,4), (4,3), (12,2).

#### **7. Implementation of the code into a FPGA**

The spectral trigger should be generated if DCT coefficients normalized to the 1st harmonics are in an arbitrary narrow range:

$$\text{Thr}\_k^L \le \mathfrak{f}\_k = \frac{\mathfrak{X}\_k}{\overline{X}\_1} = \frac{\eta\_k \* H\_{f(k)}}{\eta\_1 \* H\_{15}} \le \text{Thr}\_k^H \tag{43}$$

where *Thr<sup>L</sup> <sup>k</sup>* and *Thr<sup>H</sup> <sup>k</sup>* are lower and upper thresholds for each spectral index k, respectively. Altera® Library of Parameterized Modules (LPM) contains the lpm\_divide routine supporting a division of fixed-point variables. However, this routine needs huge amount of logic elements and it is slow (calculation requires 14 clock cycles in order to keep sufficiently high registered performance). DSP blocks also do not support this routine. A simple conversion to

$$H\_{15} \times \theta\_k^L = H\_{15} \left( \frac{\eta\_1}{\eta\_k} T l r\_k^L \right) \le H\_{f(k)} \le H\_{15} \left( \frac{\eta\_1}{\eta\_k} T l r\_k^H \right) = H\_{15} \times \theta\_k^H \tag{44}$$

Experiments 15

<sup>393</sup> An Optimization of 16-Point Discrete Cosine Transform Implemented into a FPGA as a Design for a Spectral First Level Surface Detector Trigger in Extensive Air Shower Experiments

allows implementation of fast multipliers from the DSP blocks and calculation of products in

According to (44) the calculation of a sub-trigger needs two multipliers, two comparators and an AND gate. The multiplier stage of an embedded multiplier block supports 9 × 9 or 18 × 18 bit multipliers. Depending on the data width or operational mode of the multiplier, a single embedded multiplier can perform one or two multiplications in parallel. Due to wide data busses embedded multiplier blocks do not use the 9×9 mode in any multiplication. Each multiplier utilizes two embedded multiplier 9-bit elements. The full DCT procedure needs the calculation of all coefficients 70 DSP blocks. However, the scaling of *X*¯ *<sup>k</sup>* in the last pipeline chain is no longer needed. It is moved to the thresholds according to (44). Removing last pipeline chain reduces amount of DSP blocks to 40. Sub-triggers routines (Fig. 9) need 2 DSP blocks each. The chip EP3C40F324I7 selected for the 4th generation of the 1st level SD trigger contains 252 DSP 9-bit multipliers. So, for 3-fold coincidences and an implementation of 3 "engines" the single DCT "engine" can support only 11 independent DCT coefficients

*<sup>k</sup>* , *<sup>B</sup>*0,1,2

*<sup>k</sup>* , *<sup>C</sup>*0,1

*Bk*, *Ck* and *Dk* (k = 2,4,6) from Fig. 3, respectively. Sub-triggers are synchronized to each other in shift registers in order to put simultaneously on an AND gate (Fig. 11). In order to keep a trigger rate below the boundary deriving from the limited radio bandwidth, additionally the amplitude of the jump is verified. If the jump is too weak, a veto comparator disables the AND gate. Thus, if spectral coefficients *ξ<sup>k</sup>* match pattern ranges for each time bins selected by multiplexer totally in 4 consecutive time bins and if veto circuit is enabled the final trigger is generated. A delay time for the veto signal depends on the type of shape, which is an interest of an investigation. For the single time bin of the rising edge the veto is delayed on 3 clock cycles, for the investigated pattern corresponding to the three time bins of the rising edge the maximal ADC value appears 2 clock cycles later in comparison to the previous case, so the

*<sup>k</sup>* and *<sup>D</sup>*<sup>0</sup>

&

*<sup>k</sup>* , respectively. Signals between that thresholds (two comparators +

Ȉ

Next coefficient

Occupancy

sub-trigger

&

*<sup>k</sup>* are generated for the patterns *Ak*,

*<sup>k</sup>* are lower and upper scaled thresholds respectively, which are

a single clock cycle. *θ<sup>L</sup>*

set as external parameters.

*<sup>k</sup>* and *<sup>θ</sup><sup>H</sup>*

(Szadkowski, 2011). Sub-triggers *A*0,1,2,3

veto should be delayed on a single clock cycle only.

H13

H15 Ĭ<sup>H</sup> 12

H15 Ĭ<sup>L</sup> 12

H12

AND gate) generate preliminary sub-triggers, which are next summed and compared with the arbitrary Occupancy level. If an amount of "fired" preliminary sub-triggers is above the

enabled/disabled depending on the veto variable, verifying the minimal amplitude of the input signals to keep the trigger rate on the reasonable level and to prevent the saturation of

selected Occupancy, the final sub-trigger is generated for the next processes. It is

Fig. 9. The structure of sub-triggers. The DCT coefficients *X*¯ *<sup>k</sup>* are not directly calculated. They have been replaced by a boundary of the acceptance lane: upper and lower thresholds

H13

H15 Ĭ<sup>H</sup> 13

H15 Ĭ<sup>L</sup> 13

H15

H12

X7=Ș7H13

X9=Ș9H12

…………………........

*<sup>k</sup>* and *<sup>H</sup>*<sup>15</sup> <sup>×</sup> *<sup>θ</sup><sup>L</sup>*

the transmission channel.

H15=ZS1X1 X1=Ș1H15

H13=ZS7X7

H12=ZS9X9

*<sup>H</sup>*<sup>15</sup> <sup>×</sup> *<sup>θ</sup><sup>H</sup>*

Fig. 8. The pipeline internal structure of 16-point DCT FPGA routine. Signal from the ADC propagates though the (horizontal) shift register *x*15,...,*x*0. Simultaneously, the DCT coefficients are calculated in vertical chains in 9 clock cycles. Each rectangle corresponds to a single clock procedure (a logic block). 16-point DCT "engine" utilizes 35 multipliers, 45 adders, 32 sub-tractors and (16 + 38) shift registers. *H*0,1 = *G*0,1/4, a division is not implemented, two low significant bits are ignored. The width of data is extended in consecutive pipeline stages from N at the shift register *x*15,...,*x*0, even to N+8 in the H routine.

14 Will-be-set-by-IN-TECH

A8=x7-x8

B8=A8+A9

D2

E8=D612-D2

G8 = ɶ5 F8

H9=G9-G8

X11 = ɻ11 H8

Fig. 8. The pipeline internal structure of 16-point DCT FPGA routine. Signal from the ADC

coefficients are calculated in vertical chains in 9 clock cycles. Each rectangle corresponds to a single clock procedure (a logic block). 16-point DCT "engine" utilizes 35 multipliers, 45 adders, 32 sub-tractors and (16 + 38) shift registers. *H*0,1 = *G*0,1/4, a division is not implemented, two low significant bits are ignored. The width of data is extended in

consecutive pipeline stages from N at the shift register *x*15,...,*x*0, even to N+8 in the H routine.

propagates though the (horizontal) shift register *x*15,...,*x*0. Simultaneously, the DCT

F8=E10-E8

8

D6

8 = S6C8

8 = S2C8

C8=B8+B10

A15=x0-x15

x15

B15 = A15

C15 = B15

D15=C15+C11

E15=D15+D13

F15 = E15

G15 = F15

H15=G15+G14

X1 = ɻ1 H15

A14=x1-x14

B14=A14+A15

C14 = B14

D14 = C14

E14=D14+D10

F14=E14+E12

G14 = ɶ1 F14

H14=G15-G14

X15 = ɻ15 H14

A13=x2-x13

B13=A13+A14

C213 = S2B13

D13=C213+C6

E13=D15-D13

F13 = E13

G13 = F13

H13=G13+G12

X7 = ɻ7 H13

9

C613 = S6B13

D612 = S6C12

D212 = S2C12

E12=D212+D6

F12=E14-E12

G12 = ɶ7 F12

H12=G13-G12

X9 = ɻ9 H12

8

A12=x3-x12

B12=A12+A13

C12=B12+B14

A11=x4-x11

B11=A11+A12

C11 = S4B11

D11=C15-C11

E11=D11+D9

F11 = E11

G11 = F11

H11=G11+G10

X3 = ɻ3 H11

A10=x5-x10

B10=A10+A11

C10=B10+B12

D10 = S4C10

E10=D14-D10

G10 = ɶ3 F10

H10=G11-G10

X13 = ɻ13 H10

F10=E10+E8

A9=x6-x9

B9=A9+A10

C29 = S2B9

D9=C613-C2

E9=D11-D9

F9 = E9

G9 = F9

H13=G13+G12

X5 = ɻ5 H9

9

C69 = S6B9

x14 x13 x12 x11 x10 x9 x8 x7 x6 x5 x4 x3 x2 x1 x0

A7=x7+x8

B7=A0-A7

C7 = B7

D7 = C7

E7=D7+D5

F7=E7-E6

G7 = F7

H7 = G7

X14 = ɻ14 H7

A6=x6+x9

B6=A1-A6

C6=B6+B7

D2

E6=D2

F6=E7+E6

G6 = F6

H6 = G6

X2 = ɻ2 H6

6+D6

4

D6

6 = S6C6

6 = S2C6

A5=x5+x10

B5=A2-A5

C5=B5+B6

D5 = S4C5

E5=D7-D5

F5=E5+E4

G5 = F5

H5 = G5

X6 = ɻ6 H5

A4=x4+x11

B4=A3-A4

C4=B4+B5

D2

E4=D6

F4=E5-E4

G4 = F4

H4 = G4

X10 = ɻ10 H4

6-D2

4

D6

4 = S6C4

4 = S2C4

A3=x3+x12

B3=A3+A4

C3=B0-B3

D3 = C3

E3 = D3

F3=E3+E2

G3 = F3

H3 = G3

X4 = ɻ4 H3

A2=x2+x13

B2=A2+A5

C2=B1-B2

D2=C2+C3

E2 = S4D2

F2=E3-E2

G2 = F2

H2 = G2

X12 = ɻ12 H2

A1=x1+x14

B1=A1+A6

C1=B1+B2

D1=C0-C1

E1 = D1

F1 = E1

G1 = F1

H1 = G1 >> 2

X8 = H1

A0=x0+x15

B0=A0+A7

C0=B0+B3

D2=C2+C3

E0 = D0

F0 = E0

G0 = F0

H0 = G0 >> 2

X0 = H0

allows implementation of fast multipliers from the DSP blocks and calculation of products in a single clock cycle. *θ<sup>L</sup> <sup>k</sup>* and *<sup>θ</sup><sup>H</sup> <sup>k</sup>* are lower and upper scaled thresholds respectively, which are set as external parameters.

According to (44) the calculation of a sub-trigger needs two multipliers, two comparators and an AND gate. The multiplier stage of an embedded multiplier block supports 9 × 9 or 18 × 18 bit multipliers. Depending on the data width or operational mode of the multiplier, a single embedded multiplier can perform one or two multiplications in parallel. Due to wide data busses embedded multiplier blocks do not use the 9×9 mode in any multiplication. Each multiplier utilizes two embedded multiplier 9-bit elements. The full DCT procedure needs the calculation of all coefficients 70 DSP blocks. However, the scaling of *X*¯ *<sup>k</sup>* in the last pipeline chain is no longer needed. It is moved to the thresholds according to (44). Removing last pipeline chain reduces amount of DSP blocks to 40. Sub-triggers routines (Fig. 9) need 2 DSP blocks each. The chip EP3C40F324I7 selected for the 4th generation of the 1st level SD trigger contains 252 DSP 9-bit multipliers. So, for 3-fold coincidences and an implementation of 3 "engines" the single DCT "engine" can support only 11 independent DCT coefficients (Szadkowski, 2011). Sub-triggers *A*0,1,2,3 *<sup>k</sup>* , *<sup>B</sup>*0,1,2 *<sup>k</sup>* , *<sup>C</sup>*0,1 *<sup>k</sup>* and *<sup>D</sup>*<sup>0</sup> *<sup>k</sup>* are generated for the patterns *Ak*, *Bk*, *Ck* and *Dk* (k = 2,4,6) from Fig. 3, respectively. Sub-triggers are synchronized to each other in shift registers in order to put simultaneously on an AND gate (Fig. 11). In order to keep a trigger rate below the boundary deriving from the limited radio bandwidth, additionally the amplitude of the jump is verified. If the jump is too weak, a veto comparator disables the AND gate. Thus, if spectral coefficients *ξ<sup>k</sup>* match pattern ranges for each time bins selected by multiplexer totally in 4 consecutive time bins and if veto circuit is enabled the final trigger is generated. A delay time for the veto signal depends on the type of shape, which is an interest of an investigation. For the single time bin of the rising edge the veto is delayed on 3 clock cycles, for the investigated pattern corresponding to the three time bins of the rising edge the maximal ADC value appears 2 clock cycles later in comparison to the previous case, so the veto should be delayed on a single clock cycle only.

Fig. 9. The structure of sub-triggers. The DCT coefficients *X*¯ *<sup>k</sup>* are not directly calculated. They have been replaced by a boundary of the acceptance lane: upper and lower thresholds *<sup>H</sup>*<sup>15</sup> <sup>×</sup> *<sup>θ</sup><sup>H</sup> <sup>k</sup>* and *<sup>H</sup>*<sup>15</sup> <sup>×</sup> *<sup>θ</sup><sup>L</sup> <sup>k</sup>* , respectively. Signals between that thresholds (two comparators + AND gate) generate preliminary sub-triggers, which are next summed and compared with the arbitrary Occupancy level. If an amount of "fired" preliminary sub-triggers is above the selected Occupancy, the final sub-trigger is generated for the next processes. It is enabled/disabled depending on the veto variable, verifying the minimal amplitude of the input signals to keep the trigger rate on the reasonable level and to prevent the saturation of the transmission channel.

Experiments 17

<sup>395</sup> An Optimization of 16-Point Discrete Cosine Transform Implemented into a FPGA as a Design for a Spectral First Level Surface Detector Trigger in Extensive Air Shower Experiments

indices may be taken from the last 16 shift register nodes according to the Fig. 11. The samples with higher indices correspond to the exponentially attenuated tail and the analysis of the tail

x23 x22 x21 x20 x19 x18 x17 x 16 x15 x14 x13 x12 x11 x10 x9

DCT + sub-trigger routines

ɻ MUX <sup>1</sup>

Fig. 11. A scheme of the final spectral trigger. The shift register presented here has an extended length = 24 stages to cover longer time window. However, for a sampling frequencies *fs* ≤ 100 MHz 16 stages and T ≥ 150 ns the window is wide enough for an analysis of horizontal showers. If signal shifted in the register chain matches the expected patterns for 4 consecutive time bins i.e. corresponding to ADC shapes in Fig. 3 (1st row, 3 first graphs. The 4th pattern is exactly the same as the 3rd one. The amplitude of the signal decreases, but the DCT coefficients remain the same (still an exponential attenuation).

3 DCT trigger "engines" have been successfully merged with the Auger code working with 100 MHz sampling. The final code utilizes only 38gives an opportunity to add new, sophisticated algorithms. The slack reported by the compiler corresponds to a maximal sampling frequency 112 MHz, which gives a sufficient safety margin for a stable operation of the system. For sufficiently high amplitudes of the ADC samples the Threshold trigger will be generated 32 clock cycles earlier than the spectral trigger (24 clock cycles of propagation in the shift registers + 8 clock cycles of performance in the DCT chain). If the Threshold trigger has been already generated, the next triggers are inhibited for 768 time bins necessary to fulfill memory buffers (see Fig. 7 in (Szadkowski, 2005a)). Because the Threshold trigger (sensitive to bigger signals) has a higher priority than the spectral trigger, ADC samples will not be delayed for the Threshold trigger in order to synchronize it with the spectral one. The system uses 10-bit resolution (standard Auger one). A compilation for the 12-bit resolution for the current chip EP3C40F324I7 failed, due to a lack of the DSP blocks. 12-bit system requires bigger chip EP3C55. The slack times are on the same level as for EP3C40. All pipeline routines shown in Fig. 8 are implemented in a direct mode (no pipeline mode - like i.e. in the 2nd generation of the FEB based on the ACEX family (see Fig. 2 in (Szadkowski, 2005a)) or for the FFT implementation in the Cyclone family (Fig. 2 in (Szadkowski, 2005b)). So, a performance

ɻ<sup>0</sup> ɻ 0'

x 0

cmp χ <sup>2</sup> χ <sup>3</sup>

χ 1

ɻ -1

A N D

veto

Final spectral trigger

ɻ 1' ɻ 1''

is lest critical than the rising edge, where samples are analyzed with a full speed.

x 8 x 7 x6 x5 x4 x3 x2 x1

DCT + sub-trigger routines

veto threshold

Thresholds for the jump control

Thresholds for investigating the exponential attenuation


Fig. 10. Simulation of the 1-fold spectral trigger simultaneously with the 3-fold threshold trigger. The length of the shift registers = 16. Data in the Ext\_ADC0 channel corresponds to a muon signal with a 1-time-bin rising edge, 11-time-bins attenuation tail and with a constant pedestal = 40 ADC-counts. Together with the begin of the muon peak (at 23.075 *µ*s), two neighboring channels Ext\_ADC1,2 are driven artificially to 150 ADC-counts to generate the standard threshold trigger based on the 3-fold coincidence. The internal PLL clock = 80 MHz. The internal standard threshold trigger appears 5 clock cycles later (+62.5 ns). The nodes lpm\_ff:\$00000|dffs - lpm\_ff:\$00030|dffs correspond to the shift register *x*15,...,*x*0. The system is tuned for the Shape\_A recognition (two 1st time bins on the pedestal level). Ena\_A\_reg is generated (+200 ns = 16 clock cycles) due to the amplitude of the signal (140 ADC-counts) is above the veto threshold. It is delayed next 15 cycles to be synchronized with SUB\_TRIG\_Occ. Sub-triggers are generated 27 clock cycles (+337.5 ns) after the rising edge. A calculation of the Occupancy takes next two clock cycles. 29 clock cycles after the rising edge due to a coincidence of the Occupancy and Ena\_DCT\_del (inversion of the veto) the SUB\_TRIG is generated. Finally it appears in the same position as 3-fold coincidence threshold trigger 31 clock cycles later. Final\_DCT trigger corresponds to the possible coincidence with a neighboring DCT "engines". If the standard threshold trigger(based on 3-fold coincidence) appears next any triggers are ignored though 768 clock cycles.

The 16-point DCT with 16-stage shift register for 100 MHz sampling can cover 150 ns time window. For the horizontal or very inclined showers this interval is sufficient for the analysis. However, for the higher sampling frequency, when the time window may turn out too short, the shift register may be extended from 16 to 24 stages and the eight samples for the higher 16 Will-be-set-by-IN-TECH

Fig. 10. Simulation of the 1-fold spectral trigger simultaneously with the 3-fold threshold trigger. The length of the shift registers = 16. Data in the Ext\_ADC0 channel corresponds to a muon signal with a 1-time-bin rising edge, 11-time-bins attenuation tail and with a constant pedestal = 40 ADC-counts. Together with the begin of the muon peak (at 23.075 *µ*s), two neighboring channels Ext\_ADC1,2 are driven artificially to 150 ADC-counts to generate the standard threshold trigger based on the 3-fold coincidence. The internal PLL clock = 80 MHz. The internal standard threshold trigger appears 5 clock cycles later (+62.5 ns). The nodes lpm\_ff:\$00000|dffs - lpm\_ff:\$00030|dffs correspond to the shift register *x*15,...,*x*0. The system is tuned for the Shape\_A recognition (two 1st time bins on the pedestal level). Ena\_A\_reg is generated (+200 ns = 16 clock cycles) due to the amplitude of the signal (140 ADC-counts) is

SUB\_TRIG\_Occ. Sub-triggers are generated 27 clock cycles (+337.5 ns) after the rising edge. A calculation of the Occupancy takes next two clock cycles. 29 clock cycles after the rising edge due to a coincidence of the Occupancy and Ena\_DCT\_del (inversion of the veto) the SUB\_TRIG is generated. Finally it appears in the same position as 3-fold coincidence threshold trigger 31 clock cycles later. Final\_DCT trigger corresponds to the possible coincidence with a neighboring DCT "engines". If the standard threshold trigger(based on

The 16-point DCT with 16-stage shift register for 100 MHz sampling can cover 150 ns time window. For the horizontal or very inclined showers this interval is sufficient for the analysis. However, for the higher sampling frequency, when the time window may turn out too short, the shift register may be extended from 16 to 24 stages and the eight samples for the higher

above the veto threshold. It is delayed next 15 cycles to be synchronized with

3-fold coincidence) appears next any triggers are ignored though 768 clock cycles.

23.0375 us +62.5 ns +200.0 ns +337.5 ns +375.0 ns +412.5 ns 23.0 us 23.2 us 23.4 us 23.5 us

0 1 0 9 3 1 0 1 0

000 150 000 000 150 000

 140107 84 70 60 53 49 46 44 42 41 40 140 107 70 60 53 49 44 42 41 40 140 107 70 60 53 49 44 42 41 40 140 107 70 60 53 49 44 42 41 40 140 107 70 60 53 49 44 42 41 40 140 107 70 60 53 49 44 42 41 40 140 107 70 60 53 49 44 42 41 40 140 107 70 60 53 49 44 42 41 40 140 107 70 60 53 49 44 42 41 40 140 107 70 60 53 49 44 42 41 40 140 107 70 60 53 49 44 42 41 40 140 107 70 60 53 49 44 42 41 40 140 107 70 60 53 49 44 42 41 40 140 107 70 60 53 49 44 42 41 40 140 107 70 60 53 49 44 42 41 40 140 107 70 60 53 49 44 42 41 40 140 107 70 60 53 49 44 42 41 40

Ext\_Clk PLL\_Clk1 Ext\_ADC2 Ext\_ADC1 Ext\_ADC0 00004|RF\_Fast\_Trig 0|lpm\_ff:\$00000|dffs 0|lpm\_ff:\$00002|dffs 0|lpm\_ff:\$00004|dffs 0|lpm\_ff:\$00006|dffs 0|lpm\_ff:\$00008|dffs 0|lpm\_ff:\$00010|dffs 0|lpm\_ff:\$00012|dffs 0|lpm\_ff:\$00014|dffs 0|lpm\_ff:\$00016|dffs 0|lpm\_ff:\$00018|dffs 0|lpm\_ff:\$00020|dffs 0|lpm\_ff:\$00022|dffs 0|lpm\_ff:\$00024|dffs 0|lpm\_ff:\$00026|dffs 0|lpm\_ff:\$00028|dffs 0|lpm\_ff:\$00030|dffs S:\$00060|Ena\_A\_reg S:\$00060|Ena\_DCT 00060|Ena\_DCT\_del 48|SUB\_TRIG\_DCT 50|SUB\_TRIG\_DCT 52|SUB\_TRIG\_DCT 54|SUB\_TRIG\_DCT 56|SUB\_TRIG\_DCT 58|SUB\_TRIG\_DCT 60|SUB\_TRIG\_DCT 62|SUB\_TRIG\_DCT 64|SUB\_TRIG\_DCT 66|SUB\_TRIG\_DCT 68|SUB\_TRIG\_DCT 0|lpm\_ff:\$00076|dffs 060|SUB\_TRIG\_Occ S:\$00060|SUB\_TRIG S:\$00060IFinal\_DCT indices may be taken from the last 16 shift register nodes according to the Fig. 11. The samples with higher indices correspond to the exponentially attenuated tail and the analysis of the tail is lest critical than the rising edge, where samples are analyzed with a full speed.

Fig. 11. A scheme of the final spectral trigger. The shift register presented here has an extended length = 24 stages to cover longer time window. However, for a sampling frequencies *fs* ≤ 100 MHz 16 stages and T ≥ 150 ns the window is wide enough for an analysis of horizontal showers. If signal shifted in the register chain matches the expected patterns for 4 consecutive time bins i.e. corresponding to ADC shapes in Fig. 3 (1st row, 3 first graphs. The 4th pattern is exactly the same as the 3rd one. The amplitude of the signal decreases, but the DCT coefficients remain the same (still an exponential attenuation).

3 DCT trigger "engines" have been successfully merged with the Auger code working with 100 MHz sampling. The final code utilizes only 38gives an opportunity to add new, sophisticated algorithms. The slack reported by the compiler corresponds to a maximal sampling frequency 112 MHz, which gives a sufficient safety margin for a stable operation of the system. For sufficiently high amplitudes of the ADC samples the Threshold trigger will be generated 32 clock cycles earlier than the spectral trigger (24 clock cycles of propagation in the shift registers + 8 clock cycles of performance in the DCT chain). If the Threshold trigger has been already generated, the next triggers are inhibited for 768 time bins necessary to fulfill memory buffers (see Fig. 7 in (Szadkowski, 2005a)). Because the Threshold trigger (sensitive to bigger signals) has a higher priority than the spectral trigger, ADC samples will not be delayed for the Threshold trigger in order to synchronize it with the spectral one. The system uses 10-bit resolution (standard Auger one). A compilation for the 12-bit resolution for the current chip EP3C40F324I7 failed, due to a lack of the DSP blocks. 12-bit system requires bigger chip EP3C55. The slack times are on the same level as for EP3C40. All pipeline routines shown in Fig. 8 are implemented in a direct mode (no pipeline mode - like i.e. in the 2nd generation of the FEB based on the ACEX family (see Fig. 2 in (Szadkowski, 2005a)) or for the FFT implementation in the Cyclone family (Fig. 2 in (Szadkowski, 2005b)). So, a performance

Experiments 19

<sup>397</sup> An Optimization of 16-Point Discrete Cosine Transform Implemented into a FPGA as a Design for a Spectral First Level Surface Detector Trigger in Extensive Air Shower Experiments

> 0.2 0.25 0.3 0.35 0.4 0.5 0.6

0.7 0.8 0.9

ns

DCT coefficients for signals with only one first time bin on the pedestal level

2 3 4 5 6 7 8 9 10 11 12 13 14 15

0.2 0.25 0.3 0.35 0.4

0.5 0.6 0.7 0.8 0.9

Shape\_B

Signals with various attenuation factors and two first time bins on the pedestal level


Fig. 12. Shapes of signals with various attenuation factors and two first time bins on the

number of sub-triggers) is enough to generate the final spectral trigger.

0.2 0.25 0.3 0.35 0.4

0.5 0.6 0.7 0.8 0.9

Fig. 13. Coefficients for signals with various attenuation factors and two first time bins (left)


ry units

Shape\_A

2 3 4 5 6 7 8 9 10 11 12 13 14 15

and only one time bin (right) on the pedestal level

2 3 4 5 6 7 8 9 10 11 12 13 14

DCT coefficients for signals with two two first time bins on the pedestal level

All signals with first two time bins on the pedestal level for sure will be with only one time bin on the pedestal level in the next clock cycle. But, not vice versa. A signal with only a single time bin on the pedestal level before sharp rising edge can have significant contribution in the 2nd time bin before rising edge and it will not be recognized by a pattern recognition procedure tuned on the Shape\_A. A procedure recognizing Shape\_A is more restrictive and gives lower trigger rate than for the Shape\_B. Due to limited amount of the DSP blocks only 11 DCT coefficients can be analyzed simultaneously. For the Shape\_A the *X*¯ <sup>4</sup> and *X*¯ <sup>10</sup> are ignored and for the Shape\_B : *X*¯ <sup>6</sup> and *X*¯ 14, respectively, as weakly sensitive on changes of signal shapes. The trigger based only on the DCT pattern recognition gives too high rate, due to a contribution of very week signals with also appropriate shape, but usually treated as noise. In order to reduce and control the trigger rate, the veto threshold has been introduced. The calculation of the DCT coefficients in the pipeline chain and next the calculation of sub-triggers in multipliers and comparators block takes 12 clock cycles. The signal is synchronized with the DCT sub-triggers delayed the same time to be compared with the veto threshold, simultaneously with a generated DCT sub-triggers. If the signal is above the sum of the veto threshold and the pedestal, the sub-triggers are enabled to generated a final spectral trigger. The condition that all 11 DCT coefficients were inside the acceptance lane is too strong. The shapes are not ideal, noise introduces additional shape distortions. Similarly as in the ToT trigger only a part of "fired" sub-triggers (Occupancy ≤ 11 = max.


ADC-count

pedestal level


r

Arbitrary

y units

of a signal requires a single clock cycle only. All routines are fast enough to work with 100 MHz sampling without an additional pipeline stages and they do not introduce an additional latency.
