**3.3. Image generation subsystem: Parallel beamforming**

In recent years, computing industry has been opened a way to parallel computing. Nowadays, all consumer computers ship with multi-core processors. Dual-core processors (CPUs) were introduced in personal systems at the beginning of 2006, and it is currently common to find them in laptops as well as 8 and 16-core workstation computers, which 14 Breakthroughs in Ultrasound Imaging

Strategies for Hardware Reduction on the Design of Portable Ultrasound Imaging Systems 15

Strategies for Hardware Reduction on the Design of Portable Ultrasound Imaging Systems

http://dx.doi.org/10.5772/55910

257

**Figure 12.** 8xA-SAFT Architecture with 32 channels in reception needed for Golay encoding

unit on the chip to be used by a program intending to perform general-purpose computations (GPGPU). Furthermore, the execution units on the GPU allow arbitrary read and write access to memory as well as access to a software-managed cache known as shared memory. A CUDA program consists of one or more phases that are executed on either the host (CPU) or a device such as a GPU. The phases that exhibit little or no data parallelism are implemented in CPU code. The phases that exhibit rich amount of data parallelism are implemented in the GPU code. The parallel functions (called kernels) typically generate a large number of threads to exploit data parallelism. It is worth noting that CUDA threads are of much lighter weight than the CPU threads. CUDA programmers can assume that these threads take very few cycles to generate and schedule due to efficient hardware support. This differs from

**Figure 11.** 2R-SAFT Minimal Architecture

means that parallel computing is not relegated to big supercomputers or mainframes computers. However, Graphics Processor Units (GPUs), as their name suggests, came about as accelerators for graphics applications, predominantly those using the OpenGL and DirectX programming interfaces. Although originally they were pure fixed-function devices, the demand for real time and 3D graphics made them evolve into increasingly flexible highly parallel, multithreaded processors with extremely high computational power and very high memory bandwidth converting them into massively parallel machines.

Unlike earlier GPU generations, where computing resources were partitioned into vertex and pixel shaders, nowadays they can be programmed directly in C using CUDA or OpenCL [19], APIs which include a unified shader pipeline, allowing each and every arithmetic logic Strategies for Hardware Reduction on the Design of Portable Ultrasound Imaging Systems http://dx.doi.org/10.5772/55910 257

Strategies for Hardware Reduction on the Design of Portable Ultrasound Imaging Systems 15

**Figure 12.** 8xA-SAFT Architecture with 32 channels in reception needed for Golay encoding

14 Breakthroughs in Ultrasound Imaging

256 Advancements and Breakthroughs in Ultrasound Imaging

**Figure 11.** 2R-SAFT Minimal Architecture

means that parallel computing is not relegated to big supercomputers or mainframes computers. However, Graphics Processor Units (GPUs), as their name suggests, came about as accelerators for graphics applications, predominantly those using the OpenGL and DirectX programming interfaces. Although originally they were pure fixed-function devices, the demand for real time and 3D graphics made them evolve into increasingly flexible highly parallel, multithreaded processors with extremely high computational power and very high

Unlike earlier GPU generations, where computing resources were partitioned into vertex and pixel shaders, nowadays they can be programmed directly in C using CUDA or OpenCL [19], APIs which include a unified shader pipeline, allowing each and every arithmetic logic

memory bandwidth converting them into massively parallel machines.

unit on the chip to be used by a program intending to perform general-purpose computations (GPGPU). Furthermore, the execution units on the GPU allow arbitrary read and write access to memory as well as access to a software-managed cache known as shared memory. A CUDA program consists of one or more phases that are executed on either the host (CPU) or a device such as a GPU. The phases that exhibit little or no data parallelism are implemented in CPU code. The phases that exhibit rich amount of data parallelism are implemented in the GPU code. The parallel functions (called kernels) typically generate a large number of threads to exploit data parallelism. It is worth noting that CUDA threads are of much lighter weight than the CPU threads. CUDA programmers can assume that these threads take very few cycles to generate and schedule due to efficient hardware support. This differs from 16 Breakthroughs in Ultrasound Imaging

CPU threads which typically require thousands of clock cycles for their generation and their scheduling.

Strategies for Hardware Reduction on the Design of Portable Ultrasound Imaging Systems 17

Strategies for Hardware Reduction on the Design of Portable Ultrasound Imaging Systems

http://dx.doi.org/10.5772/55910

259

**Figure 14.** Schematic diagram main parts of a general SAFT beamformer

**3.4. Pre-processing**

Implementing the imaging algorithm on GPU systems primarily involves the parallelization of the core algorithm into small independent threads which can be executed by the GPU in runtime. Thus, the imaging process occurs in multiple stages, which follows closely to that has been detailed in Figure 14. Thus, in order to maximize GPUs efficiency and reduce image generation time as much as possible, a specific solution for every different task have

The first step consists on copying the complete set of acquired signals from CPU memory to GPU memory. We already know that this transaction is slow, and therefore it is recommended to copy all signals at the same time rather than doing it signal by signal.

The pre-processing of the complete set of signals is a fundamental part of the image generation process. Supposing *Xtx*,*rx*(*t*) the received signal from any emitter *tx* and receiver *rx* pair, a function *H*(*t*) is applied to every signal as the following expression suggests:

*Ytx*,*rx*(*t*) = *Xtx*,*rx*(*t*) · *H*(*t*) (5)

been designed. Figure 15 shows how these tasks have been parallelized on the GPU.

**Figure 13.** CUDA program execution diagram

The execution of a typical CUDA program is illustrated in Figure 13 where it is observed that the execution starts with host (CPU) execution. When a kernel function is invoked (or launched), the execution is moved to a device (GPU), where a large number of threads are generated to take advantage of huge data parallelism. All the threads generated by a kernel during an invocation are collectively called a grid. Figure 13 shows the execution of two grids of threads. A grid is a 1D, 2D or 3D structure of blocks, and a block is a 1D, 2D or 3D structure of threads. Thus, the program code is composed by classical functions, which run on CPU using only one thread of execution; and kernels, which run on GPU using multiple parallel threads. When all threads of a kernel complete their execution, the corresponding grid terminates, and the execution continues on the host until another kernel is invoked. It is not our purpose to fully cover all the aspects involved in CUDA Architecture. Thus, an extended discussion about the CUDA hardware and programming model is available in multiple sources in the literature [19–21].

Therefore, in this section we will examine different ways to implement the beamforming process on the GPU using the CUDA programming model. From the model, it is extracted that functions which are executed many times independently over different data are the ideal candidates for this kind of computing. In this sense, several algorithms have been implemented to cover the fundamental parts of a conventional Delay-and-Sum Beamformer (DAS) and they have been also evaluated for their performance. This analysis helps to give a better understanding of the GPU architecture and how to write applications for it.

Schematically, Figure 14 show the main stages of a general beamformer. As we can appreciate there are three main operations to be done: pre-processing of signals, beamforming and post-processing. In the software system we propose (Figure 10) all beamforming procedures take place in the GPU.

Strategies for Hardware Reduction on the Design of Portable Ultrasound Imaging Systems 17

**Figure 14.** Schematic diagram main parts of a general SAFT beamformer

Implementing the imaging algorithm on GPU systems primarily involves the parallelization of the core algorithm into small independent threads which can be executed by the GPU in runtime. Thus, the imaging process occurs in multiple stages, which follows closely to that has been detailed in Figure 14. Thus, in order to maximize GPUs efficiency and reduce image generation time as much as possible, a specific solution for every different task have been designed. Figure 15 shows how these tasks have been parallelized on the GPU.

The first step consists on copying the complete set of acquired signals from CPU memory to GPU memory. We already know that this transaction is slow, and therefore it is recommended to copy all signals at the same time rather than doing it signal by signal.

## **3.4. Pre-processing**

16 Breakthroughs in Ultrasound Imaging

258 Advancements and Breakthroughs in Ultrasound Imaging

**Figure 13.** CUDA program execution diagram

multiple sources in the literature [19–21].

take place in the GPU.

scheduling.

CPU threads which typically require thousands of clock cycles for their generation and their

The execution of a typical CUDA program is illustrated in Figure 13 where it is observed that the execution starts with host (CPU) execution. When a kernel function is invoked (or launched), the execution is moved to a device (GPU), where a large number of threads are generated to take advantage of huge data parallelism. All the threads generated by a kernel during an invocation are collectively called a grid. Figure 13 shows the execution of two grids of threads. A grid is a 1D, 2D or 3D structure of blocks, and a block is a 1D, 2D or 3D structure of threads. Thus, the program code is composed by classical functions, which run on CPU using only one thread of execution; and kernels, which run on GPU using multiple parallel threads. When all threads of a kernel complete their execution, the corresponding grid terminates, and the execution continues on the host until another kernel is invoked. It is not our purpose to fully cover all the aspects involved in CUDA Architecture. Thus, an extended discussion about the CUDA hardware and programming model is available in

Therefore, in this section we will examine different ways to implement the beamforming process on the GPU using the CUDA programming model. From the model, it is extracted that functions which are executed many times independently over different data are the ideal candidates for this kind of computing. In this sense, several algorithms have been implemented to cover the fundamental parts of a conventional Delay-and-Sum Beamformer (DAS) and they have been also evaluated for their performance. This analysis helps to give a

Schematically, Figure 14 show the main stages of a general beamformer. As we can appreciate there are three main operations to be done: pre-processing of signals, beamforming and post-processing. In the software system we propose (Figure 10) all beamforming procedures

better understanding of the GPU architecture and how to write applications for it.

The pre-processing of the complete set of signals is a fundamental part of the image generation process. Supposing *Xtx*,*rx*(*t*) the received signal from any emitter *tx* and receiver *rx* pair, a function *H*(*t*) is applied to every signal as the following expression suggests:

$$Y\_{t\mathbf{x},r\mathbf{x}}(t) = X\_{t\mathbf{x},r\mathbf{x}}(t) \cdot H(t) \tag{5}$$

18 Breakthroughs in Ultrasound Imaging

Strategies for Hardware Reduction on the Design of Portable Ultrasound Imaging Systems 19

Strategies for Hardware Reduction on the Design of Portable Ultrasound Imaging Systems

*Xtx*,*rx*(*t*) = *Itx*,*rx*(*t*) + *jQtx*,*rx*(*t*) = **<sup>X</sup>***tx*,*rx*(*t*)*ejφtx*,*rx* (*t*) (7)

*TBX* ⌉ and the number of blocks in y-dimension *BY* <sup>=</sup> <sup>2</sup>*<sup>N</sup>* <sup>−</sup> 1.

where *HF*(*t*) is a signal conditioning process where a filter is applied in order to remove the

Additionally and for convenience, the acquired signals can be decomposed into their analytic signals form [22] (in-phase *I* and quadrature components *Q*) . Thereby, the second function *HIQ*(*t*) is the Hilbert Transform in order to reduce errors and artefacts which appear at the

In order to carry out a parallel implementation of these operations, the proposed parallelism strategy lies in a signal-oriented parallelization. This means that a GPU computational thread will be associated to each stored signal sample. Thus, considering signals with *L* samples, the computational grid of the kernel will be formed as shown in Figure 15 being the number

As we know, the number of threads per block *TBX* is an empirical value and the designer should evaluate what is the best according to the GPU resources. Typical values are 32, 64 or 128 threads per block, generally any power of two, and attending to our tests we have chosen

There is no limitation on filter length because its coefficients are stored in texture memory, which resides in the device memory and is cached in texture cache to optimize read accesses. Thus, each thread reads from memory the filter coefficients and *L* samples of a signal,

Later on, the Hilbert Transform is applied to every filtered signal so we can obtain their analytic signals. In this case, FFT algorithms provided by CUDA (CUFFT libraries [20]) are used to compute the IFFT of the product of the corresponding signal and the Hilbert Transformer FFTs, as it is defined in [23]. With these libraries, there is no need to define a new kernel nor specify grid and block dimensions, since they are responsible for properly parallelizing and splitting the algorithm, computing the FFT of the data set directly on the GPU. In our particular case, a total of 2*N* − 1 FFTs of *L* points are calculated in parallel. The whole resultant I/Q (in-phase/quadrature-phase) signals pairs (Figure 15) are stored in

texture memory, and they are passed to the next stage via global device memory.

offset level introduced during the acquisition system and to reduce the noise.

envelope detection stage. Then, the signals *Xtx*,*rx*(*t*) can now be expressed as:

where **X***tx*,*rx*(*t*) is the modulus and *φtx*,*rx*(*t*) its corresponding signal phase.

*<sup>H</sup>*(*t*) = *HF*(*t*) · *HIQ*(*t*) (6)

http://dx.doi.org/10.5772/55910

261

and

*3.4.1. Parallel implementation*

256 as the optimal value.

of blocks in x-dimension *BX* = ⌈ *<sup>L</sup>*

convolving them to obtain a filtered sample.

**Figure 15.** System beamforming loop parallelized on GPU for SAFT implementation

Strategies for Hardware Reduction on the Design of Portable Ultrasound Imaging Systems 19

and

18 Breakthroughs in Ultrasound Imaging

260 Advancements and Breakthroughs in Ultrasound Imaging

**Figure 15.** System beamforming loop parallelized on GPU for SAFT implementation

$$H(t) = H\_F(t) \cdot H\_{I\mathbb{Q}}(t) \tag{6}$$

where *HF*(*t*) is a signal conditioning process where a filter is applied in order to remove the offset level introduced during the acquisition system and to reduce the noise.

Additionally and for convenience, the acquired signals can be decomposed into their analytic signals form [22] (in-phase *I* and quadrature components *Q*) . Thereby, the second function *HIQ*(*t*) is the Hilbert Transform in order to reduce errors and artefacts which appear at the envelope detection stage. Then, the signals *Xtx*,*rx*(*t*) can now be expressed as:

$$X\_{\rm tx,rx}(t) = I\_{\rm tx,rx}(t) + jQ\_{\rm tx,rx}(t) = \mathbf{X}\_{\rm tx,rx}(t)e^{j\Phi\_{\rm tx,rx}(t)}\tag{7}$$

where **X***tx*,*rx*(*t*) is the modulus and *φtx*,*rx*(*t*) its corresponding signal phase.

#### *3.4.1. Parallel implementation*

In order to carry out a parallel implementation of these operations, the proposed parallelism strategy lies in a signal-oriented parallelization. This means that a GPU computational thread will be associated to each stored signal sample. Thus, considering signals with *L* samples, the computational grid of the kernel will be formed as shown in Figure 15 being the number of blocks in x-dimension *BX* = ⌈ *<sup>L</sup> TBX* ⌉ and the number of blocks in y-dimension *BY* <sup>=</sup> <sup>2</sup>*<sup>N</sup>* <sup>−</sup> 1. As we know, the number of threads per block *TBX* is an empirical value and the designer should evaluate what is the best according to the GPU resources. Typical values are 32, 64 or 128 threads per block, generally any power of two, and attending to our tests we have chosen 256 as the optimal value.

There is no limitation on filter length because its coefficients are stored in texture memory, which resides in the device memory and is cached in texture cache to optimize read accesses. Thus, each thread reads from memory the filter coefficients and *L* samples of a signal, convolving them to obtain a filtered sample.

Later on, the Hilbert Transform is applied to every filtered signal so we can obtain their analytic signals. In this case, FFT algorithms provided by CUDA (CUFFT libraries [20]) are used to compute the IFFT of the product of the corresponding signal and the Hilbert Transformer FFTs, as it is defined in [23]. With these libraries, there is no need to define a new kernel nor specify grid and block dimensions, since they are responsible for properly parallelizing and splitting the algorithm, computing the FFT of the data set directly on the GPU. In our particular case, a total of 2*N* − 1 FFTs of *L* points are calculated in parallel. The whole resultant I/Q (in-phase/quadrature-phase) signals pairs (Figure 15) are stored in texture memory, and they are passed to the next stage via global device memory.

#### *3.4.2. Optional stage: Decoding*

20 Breakthroughs in Ultrasound Imaging

When Golay encoding is used during acquisition, it is necessary to first merge and deconvolve the 4*N* − 16 received signals, where 50% of signals belong to A and B codes respectively. This can be done very fast making a parallel implementation where the parallelism strategy is also signal-oriented. In this sense, the previous kernel can be modified in order to include the sum of both signals in parallel before the application of the filters coefficients to finally obtain 2*N* − 8 signals.

#### **3.5. Beamforming: Delay and Sum**

All time-domain imaging algorithms are based on the principle of delay-and-sum beamforming. Typically, these algorithms emulate an acoustic lens by applying appropriate time delays to the array elements in order to focus or steer the beam as desired. SAFT beamformers focus the beam at every point in the image, giving better defect detectability as we mentioned [2, 4, 5, 7]. DAS beamforming is not difficult to implement and permits the use of arbitrary array geometries what makes suitable for a wide range of applications.

According to the Hilbert transformation of the first step, two processing streams have been created where two parallel images will be calculated following these equations:

$$A\_I(\mathbf{x}, \mathbf{z}) = \sum\_{i=1}^{N} \sum\_{k=1}^{N} I\_{\mathbf{I} \mathbf{x}\_i, \mathbf{r} \mathbf{x}\_k}(D(\mathbf{x}, \mathbf{z})) \tag{8}$$

Strategies for Hardware Reduction on the Design of Portable Ultrasound Imaging Systems 21

Strategies for Hardware Reduction on the Design of Portable Ultrasound Imaging Systems

*Memory*|*lens*= *RH* × *RV* × <sup>2</sup>*<sup>N</sup>* − <sup>1</sup> (11)

http://dx.doi.org/10.5772/55910

263

the sampling frequency) where the corresponding sample value of the signal is retrieved. Therefore, the number of delays to be calculated is usually large and it is given by:

where *RH* × *RV* are the dimensions of the desired ultrasonic image. Thus, the lens

• *Load pre-calculated delays*. The delays are pre-calculated before beamforming and they are recovered from a look-up table inside the image generation process. The necessary memory to store all the delays is not a significant problem, but the main drawback is the requirement of high bandwidth to make the process faster as well as the fact of updating the table each time. Thereby, this would be a good solution for no in-vivo inspections, where the scenario is known and the delays are calculated only once for

• *Calculate delays on-the-fly*. The delays are dynamically calculated inside the beamforming process. This task, which can be at first computationally more expensive than the first alternative, is however not a heavy computational problem because of the great power of actual systems. In this regard, dynamic calculation of the lenses inside the threads will simplify other operations on images, such as scrolling and zooming. Which approach to choose relies on the rest of the beamformer implementation. Thus, in order to take full advantage of the GPU it is needed to have a balance between bandwidth use and arithmetic operations. In this regard, it has been proved that it is faster to obtain the values for the lenses inside the kernels instead of having them stored in the device memory. Therefore in our proposal, it makes sense to calculate the delays on-the-fly. • **Filtering**. In a real implementation, we sample the elements at a rate just above the Nyquist criteria. Although this preserves the frequency content of the signal, this does not give enough steering delay resolution. The solution is to perform a digital interpolation, increasing the steering-delay resolution. In this particular case, linear interpolation and polynomial interpolation can be easily implemented. The results obtained are practically identical, although the cost associated to each solution differs being the polynomial interpolation time the double of linear interpolation. For this reason, we decided to simply interpolate across two consecutive samples. The penultimate operation is the application of a window function which is multiplied with the data from each channel in

• **Sum**. The final step in the ultrasonic generation process is to obtain the accumulated sum

The delay-and-sum process is applied to the complex signals obtained in the previous stage. We have identified different strategies to implement the ultrasonic image generation process in a GPU depending on how the algorithm is parallelized with respect to threads and blocks

As Figure 16 shows, the parallelization is carried out by launching a thread per image pixel.

*TBY* threads is defined on the kernel, where *RH* and *RV* are the desired image resolution in horizontal and vertical directions, respectively. Each thread is responsible then for calculating

*TBX* ⌉ and *BY* <sup>=</sup> ⌈ *RV*

*TBY* ⌉ blocks of *TBX* <sup>×</sup>

of all the signals samples which contribute to a given spatial point.

calculation can be afforded using two different approaches:

the complete acquisition.

order to reduce mainly the level of sidelobes.

To this end, a computation grid (*GRID*1) with *BX* = ⌈ *RH*

*3.5.1. Parallel implementation*

and relative to the use of GPU resources.

$$A\_Q(\mathbf{x}, z) = \sum\_{i=1}^{N} \sum\_{k=1}^{N} Q\_{\mathbf{f}\mathbf{x}\_i \mathbf{r} \mathbf{x}\_k}(\mathcal{D}(\mathbf{x}, z)) \tag{9}$$

where *AI*(*x*, *z*) and *AQ*(*x*, *z*) are the in-phase and quadrature images respectively, and *D*(*x*, *z*) is the focussing delay for the spatial point (*xp*, *zp*) in the grid which is calculated as follows:

$$D(\mathbf{x}, \mathbf{z}) = \frac{\sqrt{(\mathbf{x}\_p - \mathbf{x}\_{\rm tx})^2 + z\_p^2} + \sqrt{(\mathbf{x}\_p - \mathbf{x}\_{\rm rx})^2 + z\_p^2}}{c} \tag{10}$$

being *xtx* and *xrx* the coordinates of the transducer elements *tx* and *rx*, respectively.

Henceforth, we will focus on the all the operations involved in Delay-and-Sum algorithm, studying the diverse alternatives and their parallel implementation as well as the best way of their optimization.

• **Lens calculation**. A fundamental part of beamforming is calculating the differences in wave arrival time between array elements. Therefore, each signal sample has to be properly delayed according to the distance from the spatial point to the emitter or receiver array elements. The calculation of delays is achieved using equation 10. Although in a conceptual form is a delay, what is actually done is a mapping to the memory buffer (at Strategies for Hardware Reduction on the Design of Portable Ultrasound Imaging Systems 21

the sampling frequency) where the corresponding sample value of the signal is retrieved. Therefore, the number of delays to be calculated is usually large and it is given by:

$$|Memory|\_{lens} = R\_H \times R\_V \times 2N - 1\tag{11}$$

where *RH* × *RV* are the dimensions of the desired ultrasonic image. Thus, the lens calculation can be afforded using two different approaches:


Which approach to choose relies on the rest of the beamformer implementation. Thus, in order to take full advantage of the GPU it is needed to have a balance between bandwidth use and arithmetic operations. In this regard, it has been proved that it is faster to obtain the values for the lenses inside the kernels instead of having them stored in the device memory. Therefore in our proposal, it makes sense to calculate the delays on-the-fly.


#### *3.5.1. Parallel implementation*

20 Breakthroughs in Ultrasound Imaging

262 Advancements and Breakthroughs in Ultrasound Imaging

*3.4.2. Optional stage: Decoding*

coefficients to finally obtain 2*N* − 8 signals.

**3.5. Beamforming: Delay and Sum**

When Golay encoding is used during acquisition, it is necessary to first merge and deconvolve the 4*N* − 16 received signals, where 50% of signals belong to A and B codes respectively. This can be done very fast making a parallel implementation where the parallelism strategy is also signal-oriented. In this sense, the previous kernel can be modified in order to include the sum of both signals in parallel before the application of the filters

All time-domain imaging algorithms are based on the principle of delay-and-sum beamforming. Typically, these algorithms emulate an acoustic lens by applying appropriate time delays to the array elements in order to focus or steer the beam as desired. SAFT beamformers focus the beam at every point in the image, giving better defect detectability as we mentioned [2, 4, 5, 7]. DAS beamforming is not difficult to implement and permits the use of arbitrary array geometries what makes suitable for a wide range of applications.

According to the Hilbert transformation of the first step, two processing streams have been

*N* ∑ *k*=1

*N* ∑ *k*=1

where *AI*(*x*, *z*) and *AQ*(*x*, *z*) are the in-phase and quadrature images respectively, and *D*(*x*, *z*) is the focussing delay for the spatial point (*xp*, *zp*) in the grid which is calculated as follows:

> *<sup>p</sup>* +

*Itxi*,*rxk* (*D*(*x*, *z*)) (8)

*Qtxi*,*rxk* (*D*(*x*, *z*)) (9)

*p <sup>c</sup>* (10)

(*xp* − *xrx*)<sup>2</sup> + *z*<sup>2</sup>

created where two parallel images will be calculated following these equations:

*N* ∑ *i*=1

*N* ∑ *i*=1

(*xp* − *xtx*)<sup>2</sup> + *z*<sup>2</sup>

being *xtx* and *xrx* the coordinates of the transducer elements *tx* and *rx*, respectively.

Henceforth, we will focus on the all the operations involved in Delay-and-Sum algorithm, studying the diverse alternatives and their parallel implementation as well as the best way

• **Lens calculation**. A fundamental part of beamforming is calculating the differences in wave arrival time between array elements. Therefore, each signal sample has to be properly delayed according to the distance from the spatial point to the emitter or receiver array elements. The calculation of delays is achieved using equation 10. Although in a conceptual form is a delay, what is actually done is a mapping to the memory buffer (at

*AI*(*x*, *z*) =

*AQ*(*x*, *z*) =

*D*(*x*, *z*) =

of their optimization.

The delay-and-sum process is applied to the complex signals obtained in the previous stage. We have identified different strategies to implement the ultrasonic image generation process in a GPU depending on how the algorithm is parallelized with respect to threads and blocks and relative to the use of GPU resources.

As Figure 16 shows, the parallelization is carried out by launching a thread per image pixel. To this end, a computation grid (*GRID*1) with *BX* = ⌈ *RH TBX* ⌉ and *BY* <sup>=</sup> ⌈ *RV TBY* ⌉ blocks of *TBX* <sup>×</sup> *TBY* threads is defined on the kernel, where *RH* and *RV* are the desired image resolution in horizontal and vertical directions, respectively. Each thread is responsible then for calculating 22 Breakthroughs in Ultrasound Imaging

the coordinates for the spatial point (*xp*, *zp*) of a specific image pixel and calculating the lens to focus at this point.

Strategies for Hardware Reduction on the Design of Portable Ultrasound Imaging Systems 23

Strategies for Hardware Reduction on the Design of Portable Ultrasound Imaging Systems

*max*(*A*)

*Bx* ⌉ and ⌈ *RV*

*By* ⌉ blocks of *TBX* threads having

http://dx.doi.org/10.5772/55910

(13)

265

*<sup>A</sup>*|*decibels* <sup>=</sup> 20 log10 *<sup>A</sup>*

graphics representation.

**3.7. Performance**

*3.6.1. Parallel implementation*

a thread per image pixel as before.

in Figure 15) is defined which uses a grid with ⌈ *RH*

**Figure 17.** Computing times for TFM and MR solutions using GPU in *µ*seconds

TFM solution is a very intensive procedure.

Finally, the generated ultrasonic image (Final image (dB) referenced in Figure 15) is directly displayed on the screen using the OpenGL libraries, which provide specific functionality for

The parallel implementation of the envelope calculation is carried out inside the beamforming kernel. This is because at the end of the pixel calculation, we have the final output values for both *I* and *Q* components. Thus, we avoid writing twice and we only obtain a single image. For the optional conversion to decibels scale, a new kernel (Kernel 3

A NVIDIA Quadro 4000 graphics card was used to test the beamforming time achieved with the system proposed here. This card has 256 cores and 1GB global memory. It was installed in a computer with a four-core 2.66GHz Intel Q9450 processor and 4GB RAM. GPU-based implementation of the beamformer was done and tested for all acquisition strategies exposed along this chapter. In Figure 17 computing times considering image sizes starting from 200×200 to big size 800×800 for both TFM and minimum redundancy solutions are presented where it is evident than despite using the great power of GPU's the

In Figure 18, the frame rate obtained for different image sizes when 2R-SAFT and kA-SAFT are employed is presented. In particular, attending to the case of an image with 500×500

**Figure 16.** One thread is responsible for a image pixel

In this case, the lens is formed by the 2*N* − 1 times of flight of each emission-reception pair combination. Thus, in order to accelerate all these calculations, the transducer elements coordinates are stored in constant memory in each GPU multiprocessor. In addition, the computed distances from an array element to an image pixel are reused to save time avoiding duplicate calculations. The lens obtained allow us to index in the complex signals stored in texture memory, and real and imaginary parts are interpolated when needed. To this respect, lineal interpolation was implemented obtaining good performance. Then, the 2*N* − 1 resultant complex samples are multiplied by the corresponding apodization gains and added together. Finally, the resultant image (final image in Figure 15) is also stored in texture memory, for a quick data access to the post-processing stage.

#### **3.6. Post-processing**

The post-processing stage involves firstly calculate the envelope (in essence the modulus) of the beamformed images, according to the following expression:

$$A = \sqrt{{A\_I}^2 + {A\_Q}^2} \tag{12}$$

where *AI* and *AQ* are the *In-phase* and *Quadrature* images derived from the beamforming process. This operation prevents the appearance of diverse artefacts associated with the Hilbert Transform.

Likewise, (an optional) stage in the process is in charge of normalizing and converting the image to decibels scale. Although this is not a complex task, it cannot be carried out in the previous stage because we need to know what the maximum value for the image:

Strategies for Hardware Reduction on the Design of Portable Ultrasound Imaging Systems 23

$$|A|\_{decibels} = 20\log\_{10}\left(\frac{A}{\max(A)}\right) \tag{13}$$

Finally, the generated ultrasonic image (Final image (dB) referenced in Figure 15) is directly displayed on the screen using the OpenGL libraries, which provide specific functionality for graphics representation.

## *3.6.1. Parallel implementation*

22 Breakthroughs in Ultrasound Imaging

264 Advancements and Breakthroughs in Ultrasound Imaging

**Figure 16.** One thread is responsible for a image pixel

**3.6. Post-processing**

Hilbert Transform.

memory, for a quick data access to the post-processing stage.

the beamformed images, according to the following expression:

to focus at this point.

the coordinates for the spatial point (*xp*, *zp*) of a specific image pixel and calculating the lens

In this case, the lens is formed by the 2*N* − 1 times of flight of each emission-reception pair combination. Thus, in order to accelerate all these calculations, the transducer elements coordinates are stored in constant memory in each GPU multiprocessor. In addition, the computed distances from an array element to an image pixel are reused to save time avoiding duplicate calculations. The lens obtained allow us to index in the complex signals stored in texture memory, and real and imaginary parts are interpolated when needed. To this respect, lineal interpolation was implemented obtaining good performance. Then, the 2*N* − 1 resultant complex samples are multiplied by the corresponding apodization gains and added together. Finally, the resultant image (final image in Figure 15) is also stored in texture

The post-processing stage involves firstly calculate the envelope (in essence the modulus) of

where *AI* and *AQ* are the *In-phase* and *Quadrature* images derived from the beamforming process. This operation prevents the appearance of diverse artefacts associated with the

Likewise, (an optional) stage in the process is in charge of normalizing and converting the image to decibels scale. Although this is not a complex task, it cannot be carried out in the

<sup>2</sup> + *AQ*

<sup>2</sup> (12)

*A* = *AI*

previous stage because we need to know what the maximum value for the image:

The parallel implementation of the envelope calculation is carried out inside the beamforming kernel. This is because at the end of the pixel calculation, we have the final output values for both *I* and *Q* components. Thus, we avoid writing twice and we only obtain a single image. For the optional conversion to decibels scale, a new kernel (Kernel 3 in Figure 15) is defined which uses a grid with ⌈ *RH Bx* ⌉ and ⌈ *RV By* ⌉ blocks of *TBX* threads having a thread per image pixel as before.

**Figure 17.** Computing times for TFM and MR solutions using GPU in *µ*seconds
