**2. Algorithm description and FPGA implementation**

The chosen hardware involves the use of an FPGA device in which 20 transceiver are imple‐ mented. This means that one can use a single FPGA to connect multiple ADC in parallel, up to three, if one adopts a configuration which uses six out of the eight lines at high speeds, con‐ figurations with fewer lines at the maximum sampling frequency are also possible, but this requires a digital‐down‐conversion (DDC) of the signal, implemented internally to the ADC, which cannot be applied in the described case, since it acts like a filter to the signal which, if not limited to a narrow band, rather than harnessing the entire band up to 1 GHz (as in the described case), results in loss of information.

The ability to connect multiple devices in parallel has the effect of having a very large number of data to store; if the data from the ADC were stored without preprocessing to perform a reduction and store only those useful for experimental purposes would involve several dis‐ advantages, as follows:


The two evaluation boards are connected by a high‐speed VITA 57 Mezzanine Connector in

The FPGA Mezzanine Card (FMC) standard has proven to be highly popular with over 100 total FMC cards now available from a variety of partners. Over 30 of these FMCs specifi‐ cally support high‐speed data converters. The FMC provides a way for customers to quickly configure their standard Xilinx development boards with real‐world analog interfaces.

Xilinx partners have been providing many easy‐to‐use (and to re‐use) reference designs that save customers weeks or even months of development time. Building on this success are the first high‐speed analog FMC cards supporting JESD204B from industry‐leading analog pro‐

The Xilinx ISE 14.5 software has been used to design, develop, and test the CluTim algorithm. It allows for the analysis and synthesis of source code written in a hardware description lan‐

• **To perform timing analysis:** To provide a detailed analysis of the FPGA design. This ensures that the specified timing constraints are properly passed to the implementation

• **To examine RTL diagrams:** After the HDL synthesis phase of the synthesis process, it is possible to show a schematic representation of the synthesized source file. This schematic shows a representation of the preoptimized design in terms of generic symbols, allowing

• **To simulate a design:** ISE simulator (ISim) provides a complete, full‐featured HDL simula‐ tor integrated within ISE, with which is possible to perform waveform tracing, waveform

• **To configure the target device**: It is possible to program the device with a JTAG

• **To provide IP cor**e: IP (intellectual property) core is a block of logic or data that is used in an FPGA. It is part of the growing electronic design automation (EDA) industry trend toward repeated use of previously designed components. Ideally, an IP core should be entirely portable—that is, must be easy to insert into any vendor technology or design

The real data can be stored and visualized using the Chip Scope PRO software. It inserts logic analyzer, system analyzer, and virtual I/O low‐profile software cores directly into design, allowing them to view any internal signal or node, including embedded hard or soft proces‐ sors. Signals are captured in the system at the speed of operation and brought out through the programming interface, freeing up pins for the design. Captured signals are then displayed and analyzed using the ChipScope Pro analyzer tool. These signals can be also saved to be

guage (HDL) such as Verilog and VHDL, provides designs, and is also able:

order to limit parasitic effects due to pin couplings.

viders such as analog devices, IDT, 4DSP, NXP, and others.

for control of design issues early in the design process.

viewing, and HDL source debugging.

processed with other tools, i.e., MATLAB.

**1.2. Software**

164 Field - Programmable Gate Array

tools.

programmer.

methodology.

All of these issues require a real‐time data preprocessing to store only useful data.

The CluTim algorithm [12], here described, is able to process the data in real time. In particu‐ lar, it


The ability of an FPGA to perform multiple operations in parallel turns out to be useful hav‐ ing to manage a large amount of data coming from the ADC at a very high rate.

In fact, in this way, one can execute on different data, at the same time, multiple instructions performing the same function.

Before the written algorithm starts to work properly, the ADC is configured by loading the SPI internal registers with appropriate values.

The clock handling the loading of the SPI has a frequency of 12 MHz, supplied by FPGA by means of the IP core Clocking Wizard 3.6 [18], able to generate a number of clocks shifted in a phase by predetermined values and with frequencies selected from a specific range, starting from the same master clock signal frequency.

In the case of the FPGA used, this range is from a minimum value of 10 MHz to a maximum of 700 MHz (the maximum frequency value sustainable by the FPGA). The master clock used has a frequency of 66 MHz and is generated by a MBH2100H—66 MHz oscillator, which is mounted on the demo board.

After properly configuring the ADC, it begins the first phase of communication between the JESD204B transmitter implemented on the ADC and the receiver implemented on, called synchronization group code (CGS).

In this phase, the receiver finds the boundaries between the 10‐bit characters in the data stream. During the CGS phase, the JESD204B transmit block transmits a known sequence of characters *K*. The receiver must locate this *K* characters in its input data stream using clock and data recovery (CDR) techniques. The receiver issues a synchronization request by activat‐ ing the SYNCINB± pins of the ADC. The JESD204B Tx begins sending *K* characters until the next clock boundary. When the receiver has synchronized, it waits for the correct reception of at least four consecutive *K* characters. It then deactivates SYNCINB±. The ADC then transmits an initial lane alignment sequence (ILAS) on the following clock boundary.

The ILAS phase follows the CGS phase and starts on the next clock boundary. The ILAS consists of four multiframes, with *R* known characters marking the beginning and *A* known characters marking the end. The ILAS begins by sending *R* characters followed by 0–255 ramp data for one multiframe. On the second multiframe, the link configuration data are sent start‐ ing with the third character. The second character is *Q* known characters to confirm that the link configuration data follows. All undefined data slots are filled with ramp data. The 3 and 4 multiframe are the same as multiframe 1.

After the initial lane alignment sequence is completed, the user data is sent. In a usual frame, all characters are user data.

The synchronization clock signal comes from the ADC with a frequency 1\4 of the sampling frequency, 500 MHz. This external clock is used as a reference clock of the RX PLL imple‐ mented in each transceiver. The transceiver gives in output a clock at a half frequency of 250 MHz.

The schematic representation of the connection is shown in **Figure 5**.

From such a signal, using a block called advanced mixed‐mode clock manager (MMCM\_ ADV) [19] provided by ISE, all the clock signals necessary to the management of the trans‐ ceiver modules and of the module JESD204B provided by XILINX are generated.

The ADC communicates with the FPGA through the transceivers. Each transceiver has a high‐ speed (up to 6.5 Gbps) serial data line as input and a 32‐bit word with a frequency of 125 MHz as output, all 32‐bit words (which may be from 1 to 8 according to the number of lines used) are passed simultaneously to the JESD204B block. Here, they are recombined to provide at its output the correct information consisting of 16 words of 12 bits (the ADC resolution) in output at a frequency of 125 MHz (sampling frequency = 16 \* 125 MHz = 2 GHz).

The algorithm, if it were to carry out its function to each data serially, must have an execution frequency of 2 GHz in order to be able to process all the information before it is overwritten

**Figure 5.** Schematic representation.

In the case of the FPGA used, this range is from a minimum value of 10 MHz to a maximum of 700 MHz (the maximum frequency value sustainable by the FPGA). The master clock used has a frequency of 66 MHz and is generated by a MBH2100H—66 MHz oscillator, which is

After properly configuring the ADC, it begins the first phase of communication between the JESD204B transmitter implemented on the ADC and the receiver implemented on, called

In this phase, the receiver finds the boundaries between the 10‐bit characters in the data stream. During the CGS phase, the JESD204B transmit block transmits a known sequence of characters *K*. The receiver must locate this *K* characters in its input data stream using clock and data recovery (CDR) techniques. The receiver issues a synchronization request by activat‐ ing the SYNCINB± pins of the ADC. The JESD204B Tx begins sending *K* characters until the next clock boundary. When the receiver has synchronized, it waits for the correct reception of at least four consecutive *K* characters. It then deactivates SYNCINB±. The ADC then transmits

The ILAS phase follows the CGS phase and starts on the next clock boundary. The ILAS consists of four multiframes, with *R* known characters marking the beginning and *A* known characters marking the end. The ILAS begins by sending *R* characters followed by 0–255 ramp data for one multiframe. On the second multiframe, the link configuration data are sent start‐ ing with the third character. The second character is *Q* known characters to confirm that the link configuration data follows. All undefined data slots are filled with ramp data. The 3 and

After the initial lane alignment sequence is completed, the user data is sent. In a usual frame,

The synchronization clock signal comes from the ADC with a frequency 1\4 of the sampling frequency, 500 MHz. This external clock is used as a reference clock of the RX PLL imple‐ mented in each transceiver. The transceiver gives in output a clock at a half frequency of

From such a signal, using a block called advanced mixed‐mode clock manager (MMCM\_ ADV) [19] provided by ISE, all the clock signals necessary to the management of the trans‐

The ADC communicates with the FPGA through the transceivers. Each transceiver has a high‐ speed (up to 6.5 Gbps) serial data line as input and a 32‐bit word with a frequency of 125 MHz as output, all 32‐bit words (which may be from 1 to 8 according to the number of lines used) are passed simultaneously to the JESD204B block. Here, they are recombined to provide at its output the correct information consisting of 16 words of 12 bits (the ADC resolution) in

The algorithm, if it were to carry out its function to each data serially, must have an execution frequency of 2 GHz in order to be able to process all the information before it is overwritten

ceiver modules and of the module JESD204B provided by XILINX are generated.

output at a frequency of 125 MHz (sampling frequency = 16 \* 125 MHz = 2 GHz).

an initial lane alignment sequence (ILAS) on the following clock boundary.

The schematic representation of the connection is shown in **Figure 5**.

mounted on the demo board.

166 Field - Programmable Gate Array

synchronization group code (CGS).

4 multiframe are the same as multiframe 1.

all characters are user data.

250 MHz.

by new data. This is physically impossible because the maximum operating frequency of the used FPGA is 700 MHz. To overcome this problem, one has to exploit the ability to process more data in parallel, thus reducing the operating frequency of the FPGA with the advantage of relaxing the time constraints by avoiding the introduction of time delays inside the device.

At the beginning of the signal processing procedure, a counter starts to count providing the timing information related to the event under scrutiny. The determination of a peak is done by relating the *i*th sampled bin to a number *n* of preceding bins, where *n* is directly propor‐ tional to the rise times of the signal peak.

The value of *n* has been chosen to be 2. Supposing a 1 ns rise time for the signal, which is sampled at a rate of 2 GS/s, two maxima must be separated by at least three samples to be associated with two distinct peaks.

The implemented algorithm is shown schematically in **Figure 6**.

Among the 16 samples S*<sup>K</sup>*,*<sup>X</sup>* where *K* is the sample number among those available and *X* is the time at which they are present, the functions D1*<sup>K</sup>*,*<sup>X</sup>* and D2*<sup>K</sup>*,*<sup>X</sup>* are calculated according to the relations of Eqs. (1) and (2), respectively (step 1).

relations of kqs. (1) and (2), respectively (step 1).

$$D \, 1\_{k\lambda} = \left(\frac{2^\* \, S\_{k\lambda} - S\_{k+1\lambda} - S\_{k\times 2\lambda}}{16} \, \mathfrak{3}\right) \tag{1}$$

$$\overset{\text{\scriptsize{\textbf{L}}}}{\text{\scriptsize{\textbf{L}}}} \mathcal{D}\_{\text{\tiny{\textbf{x}},\text{X}}} = \left(\frac{2^{\ast}\mathcal{S}\_{\text{\tiny{\textbf{x}},\text{X}}} - \mathcal{S}\_{\text{\tiny{\textbf{x}},\text{X}}} - \mathcal{S}\_{\text{\tiny{\textbf{x}},\text{X}}}}{16} \ast \mathbf{5}\right) \tag{2}$$

**Figure 6.** Algorithm implemented.

The value of *D*1*<sup>K</sup>*,*<sup>X</sup>* function provides an estimate of the variation of the amplitude of the *i*th sample compared to the (*i*‐1)‐th and (*i*‐2)‐th samples. Likewise, the *D*2*<sup>K</sup>*,*<sup>X</sup>* function as far as the (*i*‐2)‐th and (*i*‐3)‐th samples are concerned.

**Figure 7** shows the input signal to the ADC, the peaks found, and the values of the functions and of their differences. As can be noticed, the functions assume their maximum values in correspondence with the signal peaks.

The values of the *D*1*<sup>K</sup>*,*<sup>X</sup>* and *D*2*<sup>K</sup>*,*<sup>X</sup>* functions are stored in a 16‐element vector. For the first three samples in the series of 16 input words, they make use of the last three corresponding sam‐ ples of the previous 16 input words. A temporary storage is, therefore, used to this purpose. Likewise, in order to be able to calculate the differences *D*1*<sup>K</sup>*,*<sup>X</sup>* and *D*2*<sup>K</sup>*,*<sup>X</sup>* in the samples head, the values of the functions *D*1*<sup>K</sup>*,*<sup>X</sup>*‐1 and *D*2*<sup>K</sup>*,*X*‐1, calculated at the previous time and relating to the samples tail must be temporarily stored.

**Figure 7.** Input signal, found peaks, and discrimination functions.

The value of *D*1*<sup>K</sup>*,*<sup>X</sup>* function provides an estimate of the variation of the amplitude of the *i*th sample compared to the (*i*‐1)‐th and (*i*‐2)‐th samples. Likewise, the *D*2*<sup>K</sup>*,*<sup>X</sup>* function as far as the

**Figure 7** shows the input signal to the ADC, the peaks found, and the values of the functions and of their differences. As can be noticed, the functions assume their maximum values in

The values of the *D*1*<sup>K</sup>*,*<sup>X</sup>* and *D*2*<sup>K</sup>*,*<sup>X</sup>* functions are stored in a 16‐element vector. For the first three samples in the series of 16 input words, they make use of the last three corresponding sam‐ ples of the previous 16 input words. A temporary storage is, therefore, used to this purpose. Likewise, in order to be able to calculate the differences *D*1*<sup>K</sup>*,*<sup>X</sup>* and *D*2*<sup>K</sup>*,*<sup>X</sup>* in the samples head, the values of the functions *D*1*<sup>K</sup>*,*<sup>X</sup>*‐1 and *D*2*<sup>K</sup>*,*X*‐1, calculated at the previous time and relating to

(*i*‐2)‐th and (*i*‐3)‐th samples are concerned.

**Figure 6.** Algorithm implemented.

168 Field - Programmable Gate Array

correspondence with the signal peaks.

the samples tail must be temporarily stored.

The values of *D*1*<sup>K</sup>*,*<sup>X</sup>* and *D*2*<sup>K</sup>*,*<sup>X</sup>* and the differences between *D*1*<sup>K</sup>*,*<sup>X</sup>* and *D*1*<sup>K</sup>*‐1,*<sup>X</sup>* and between *D*2*<sup>K</sup>*,*<sup>X</sup>* and *D*2*<sup>K</sup>*‐1,*<sup>X</sup>* are compared with respect to thresholds proportional to the level of noise present in the input signal (step 2).

If the imposed conditions are met, the test sample is identified as a possible maximum and its value is temporarily stored in *M*K,*<sup>X</sup>*.

In order to transfer the data to the memory, the last step consists in checking that, according to the conditions imposed on the signal rise time, there are no adjacent peaks (step 3).

To this purpose, a check is performed on the last maximum. If its corresponding location contains a nonzero value, this is transferred into the memory and the two preceding locations are assigned a zero value. The procedure continues by scrolling all locations and sending to memory all effective maxima. The time information is provided with a counter, the value of which is stored in an FIFO every time a peak is found. The counter is clocked at 125 MHz clock frequency and, therefore, when a peak is found, the time to be stored is multiplied by 16 to take into account the correct ADC sampling rate and added to the sample number cor‐ responding to the maximum found, which is illustrated in **Figure 8**.

The memories are continuously filled as new peaks are found. When a trigger signal occurs at time *t*0, indicating with *T* the observed time window, which coincides with the maximum drift time, the reading procedure is enabled and only the data related to the found peaks in the [*t*0, *t*0 + *T*] time interval are transferred to an external device. This results in data reduction factors of more than one order of magnitude.

For storing data, ISE provides several types of IP cores. The one used is the FIFO generator 9.3 [8]. Within this IP, one can choose between various options, each one supporting different


**Figure 8.** Time information algorithm.

features. For these purposes, an FIFO common clock has been chosen, the depth of which depends on the observed time window *T* of the event, while the width has been chosen of 8 bits. This is linked to the way in which the data is sent to the external device.

ISE provides several methods of data exchange, such as communication through UART and Ethernet.

A UART‐type interface has been used. It requires 1 start bit, 8 data bits, and 1 stop bit to communicate.

The dates are sent with a baud rate of 115,200 bps. The clock UART (16 times baud rate) is obtained from a clock of 50 MHz obtained from the Clock wizard IP core.

However, since its transmission speed is very low, the time required to transfer data stored in the FIFO to the external device is very long. To considerably reduce this time, an Ethernet module, having significantly faster transfer times, can be implemented.

The implemented algorithm has been executed both in MATLAB and VHDL, using a test signal with a defined noise of 0.5 mV rms, and a number of peaks. A comparison of the obtained results can be used to derive an index of the algorithm performance. By comparing the results obtained from the number of real and fake peaks found, one can assess if any error is due to the algorithm itself or due to approximations in the VHDL implementation, when the function *D*1*<sup>K</sup>*,*<sup>X</sup>* and *D*2*<sup>K</sup>*,*<sup>X</sup>* are rounded to integer.

**Figures 9 (MATLAB)** and **10 (VHDL)** show the algorithm efficiency, calculated using Eq. (3), and the percentage of fake peaks, calculated using Eq. (4)

$$\text{Eff\{\%\}} = 100^\* \frac{Pf - Pf\text{ake}}{Pr} \tag{3}$$

$$\text{Pfake[\%]} = 100^\circ \frac{\text{Pfake}}{\text{Pf}} \tag{4}$$

**Figure 9.** Algorithm results with MATLAB.

features. For these purposes, an FIFO common clock has been chosen, the depth of which depends on the observed time window *T* of the event, while the width has been chosen of

ISE provides several methods of data exchange, such as communication through UART and

A UART‐type interface has been used. It requires 1 start bit, 8 data bits, and 1 stop bit to

The dates are sent with a baud rate of 115,200 bps. The clock UART (16 times baud rate) is

However, since its transmission speed is very low, the time required to transfer data stored in the FIFO to the external device is very long. To considerably reduce this time, an Ethernet

The implemented algorithm has been executed both in MATLAB and VHDL, using a test signal with a defined noise of 0.5 mV rms, and a number of peaks. A comparison of the obtained results can be used to derive an index of the algorithm performance. By comparing the results obtained from the number of real and fake peaks found, one can assess if any error is due to the algorithm itself or due to approximations in the VHDL implementation, when

**Figures 9 (MATLAB)** and **10 (VHDL)** show the algorithm efficiency, calculated using Eq. (3),

*Pf* − *P*fake \_\_\_\_\_\_\_\_

*P*fake

*Pr* (3)

*Pf* (4)

8 bits. This is linked to the way in which the data is sent to the external device.

obtained from a clock of 50 MHz obtained from the Clock wizard IP core.

module, having significantly faster transfer times, can be implemented.

the function *D*1*<sup>K</sup>*,*<sup>X</sup>* and *D*2*<sup>K</sup>*,*<sup>X</sup>* are rounded to integer.

Eff[%] = 100\*

and the percentage of fake peaks, calculated using Eq. (4)

*P*fake[%] = 100\* \_\_\_\_\_

Ethernet.

communicate.

**Figure 8.** Time information algorithm.

170 Field - Programmable Gate Array

**Figure 10.** Algorithm results with VHDL.

where *Pf* is the number of peaks found in the signal, *Pr* is the number of real peaks, and *P*fake is the number of fake (equal to *Pf* – *Pr*).

The results were obtained by varying the proportionality factors used to calculate the thresh‐ olds to which the *D*1*<sup>K</sup>*,*<sup>X</sup>* and *D*2*<sup>K</sup>*,*<sup>X</sup>* functions and their differences are compared.

As expected, in both cases, increasing the thresholds not only decreases the efficiency but also reduces the number of false peaks.

**Figure 11.** Algorithm results with MATLAB without noise.

The algorithm executed with MATLAB performs slightly better than VHDL, for example, for a rate of fake peaks of 1%, one has an efficiency of 76% in the case of MATLAB and an effi‐ ciency of 69% for VHDL.

A further test was performed by implementing the algorithm in MATLAB on the signals previously analyzed without the noise contribution. The results obtained are shown in **Figure 11**.

By comparing **Figures 9** and **11**, it can be seen how the presence of noise induces a reduction in efficiency and an increase of false peaks, indicating that a part of the errors of the algorithm may actually be due to the characteristics of the signal at input.

This problem can be mitigated somehow by trying to increase the signal‐to noise‐ratio by filtering the input signal to the ADC.

The use of the algorithm described results in data reduction factors of more than one order of magnitude. It is executed for each readout channel and, considering the high number of I/O FPGA pins, it is possible to process multiple channels corresponding to different drift cham‐ ber signals with a single device.
