**6. Experimental results**

This section presents some experimental results of the proposed architecture. The extracellular recordings for the experiments are based on the simulator developed in [21], where the ground truth about spiking activity can be accessed. Each spike has length 2.67 ms. The sampling rate for the spike recording is 24,000 samples/s. Therefore, there are 64 samples (i.e., *N* = 64) in each spike.

The performance of the proposed architecture for spike detection is first evaluated. The performance evaluation is based on the true positive rate (TPR) and false alarm rate (FAR). The TPR of a detection algorithm is defined as the number of true spikes detected by the algorithm divided by the total number of true spikes. The FAR of a detection algorithm is the number of silent segments, which are falsely detected as spikes by the algorithm, divided by the total number of the segments detected by the algorithm. The TPR and FAR of various detection algorithms are included in **Table 2**. In the experiments, the spike trains are from two neurons. Therefore, there are two templates (i.e., *c* = 2) for the proposed normalized correlator.

Because the normalized correlation is effective for detecting real spikes and ignoring silent segments, we can observe from **Table 2** that the proposed architecture has superior performance over the other algorithms. We use the example shown in **Figure 16** to further demonstrate this fact. In the example, a noisy spike train with SNR = −3 dB is used for the spike detection. **Figure 16** reveals the normalized correlation values , *i* = 1, 2, for the spike train. From


**Figure 16**, we see that, because of large noise corruption, it is difficult to locate spikes even by direct eye inspection. However, based on the normalized correlation values provided by the proposed architecture, the location of true spikes can still be effectively identified.

**Figure 15.** The proposed NOC system for spike sorting.

This section presents some experimental results of the proposed architecture. The extracellular recordings for the experiments are based on the simulator developed in [21], where the ground truth about spiking activity can be accessed. Each spike has length 2.67 ms. The sampling rate for the spike recording is 24,000 samples/s. Therefore, there are 64 samples (i.e., *N* = 64) in each

The performance of the proposed architecture for spike detection is first evaluated. The performance evaluation is based on the true positive rate (TPR) and false alarm rate (FAR). The TPR of a detection algorithm is defined as the number of true spikes detected by the algorithm divided by the total number of true spikes. The FAR of a detection algorithm is the number of silent segments, which are falsely detected as spikes by the algorithm, divided by the total number of the segments detected by the algorithm. The TPR and FAR of various detection algorithms are included in **Table 2**. In the experiments, the spike trains are from two neurons. Therefore, there are two templates (i.e., *c* = 2) for the proposed normalized correlator. Because the normalized correlation is effective for detecting real spikes and ignoring silent segments, we can observe from **Table 2** that the proposed architecture has superior performance over the other algorithms. We use the example shown in **Figure 16** to further demonstrate this fact. In the example, a noisy spike train with SNR = −3 dB is used for the spike detection. **Figure 16** reveals the normalized correlation values , *i* = 1, 2, for the spike train. From

**6. Experimental results**

18 Field - Programmable Gate Array

spike.

**Figure 16.** An example of the proposed normalized correlator for noisy spike detection with SNR = −3 dB for *c* = 2 templates.

Next we evaluate the area complexities. Because adders, multipliers, dividers, comparators, and registers are the basic building blocks of the proposed architecture, the area complexities are separated into five types: the number of adders, multipliers, dividers, comparators, and registers. **Tables 3** and **4** show the area complexities of the normalized correlator and OSORT modules, respectively. It can be observed from **Table 3** that, in the normalized correlator module, the correlator unit and switch buffer have larger area complexities. The number of adders, multipliers, and registers grows with the block dimension *N* and the number of templates *c* in the correlator unit. Let *L* be the capacity (i.e., the maximum number of spikes) of each buffer in the switch buffer. The number of registers in the switch buffer therefore is dependent on *L* and *N*, as shown in **Table 3**. The area complexities of the other types are of *O*(1). Therefore, the proposed circuit has low consumption of dividers and comparators. From **Table 4**, we observe that only the area complexities of the buffers in the OSORT module grow with *N*. The other parts of the OSORT module have fixed area complexities.


**Table 3.** Area complexities of the normalized correlator module.


**Table 4.** Area complexities of the OSORT module.

The proposed architecture has been implemented by FPGA for performance measurement. The target FPGA device for the hardware implementation is Altera STRATIX IV EP4SGX230. The design platform for the experiments is the Altera QUARTUS II with QSYS. **Table 5** shows the hardware utilization of the proposed architecture. There are four different FPGA hardware resources considered: adaptive look-up tables (ALUTs), dedicated logic registers, block memory bits, and DSP blocks. The DSP blocks are dedicated to the implementations of adders, multipliers, dividers, and comparators. The ALUTs, dedicated logic registers, and block memory bits can be used for the implementation of registers, as well as adders, multipliers, dividers, and comparators. It can be observed from **Table 5** that the consumption of DSP blocks of normalized correlator is higher than that of the OSORT module. This is because the normalized correlator requires more number of arithmetic operators. There are 182,400 ALUTs, 182,400 dedicated logic registers, 1288 DSP blocks, and 14,625,792 block memory bits in the target FPGA device. It can be observed from **Table 5** that only limited hardware resources are consumed by the proposed circuit.


**Table 5.** The utilization of FPGA resources of the proposed circuit. The switch buffer capacity for the measurement is *L* = 40.

In addition to consuming low hardware resources, the proposed architecture is able to provide high throughput. **Table 6** reveals the throughput of the proposed architecture for various clock rates and switch buffer size *L*. The throughput is defined as the number of spike samples which can be processed by the proposed architecture per second. The unit of the throughput in the table therefore is mega samples per second (Msamples/sec). It can be observed from **Table 6** that the throughput grows with *L* and/or clock rate. In particular, when *L* = 32 and clock rate is 100 MHz, the throughput is 25.04 Msamples/sec. The throughput of its software counterpart running on Intel I7-930 processor at clock rate 2.8 GHz and 16 GB RAM is only 0.69 Msamples/ sec. The throughput of the proposed architecture therefore is 36 times higher than that of its software counterpart.


**Table 6.** The throughput of the proposed circuit for various clock rates and switch buffer capacities *L*.
