**5. FPGAs: Designing custom processors**

12 Will-be-set-by-IN-TECH

Currently, the three computing barriers are present and they are the main cause of stagnation in which CPU based technology has fallen. Some solutions have been proposed based on alternative technologies, such as the graphics processing units (GPU) and Field Programmable Gate Array (FPGA), and have intended to mitigate these phenomena, [4, 22]. They seem to be

GPUs are the product of the gaming industry development that was born in the 50' and started a continuous growing market on 80'. Video-games require to execute intensive computing algorithms to display the images, but unlike other applications such as seismic migration, the interaction with the user requires the execution in a very short time. Therefore, since the 90' have been developed specialized processors for this purpose. These processors have been called GPU, and they have been widely commercialized in video-game consoles and PC video

GPUs are specific application processors that reach high performance on the task that they were designed for, in the same way that the assembly line would outperform itself if it were dedicated to assembly a little range of vehicle types. This is achieved because the unnecessary machines can be eliminated and the free area is optimized. Likewise improve the availability,

The task of a GPU is an iterative process that generates an image pixel by pixel (point by point) from some data and input parameters. This allows to process in parallel each output

a short and mid-term solutions respectively.

storage and handling of the parts(See figure 7).

acceleration cards, [37].

**Figure 7.** GPU analogy

**4. GPUs: A computing short term alternative**

Another alternative to accelerate the SM process are the FPGAs (Field Programmable Gate Array), these devices are widely used in many problems of complex algorithms acceleration, for this reason, since a couple of years, some traditional manufacturers of high performance computing equipment began to include in their brochure, computer systems that include FPGAs.

To get an initial idea of this technology, let us imagine that the car assembly is going to be amended as follows:


units (remember that now only have the functional units required to assembly only a vehicle type).

• In order to increase the production will be replicated the assembly line as many times as possible and the number of replicates will be determined taking into account two aspects: in the first place, the speed with which the inputs can be placed at the beginning of the assembly line and secondly by the space available within the company.

#### **5.1. FPGA architecture**

An FPGA consists mainly of three types of components (see Figure 8): Configurable Logic Blocks (or Logic Array Block depending on vendor), reconfigurable interconnects and I/O blocks.

**Figure 8.** FPGA components.

In the analogy of the car assembly line, the Configurable Logic Blocks (CLBs) come to be the raw material for the construction of each one of the functional units required to assemble the car. In an FPGA the CLBs are used to build all the functional units required in an specific algorithm, these units can go from simple logical functions (and, or, not) to mathematical complex operations (algorithms, trigonometric functions, etc.). Therefore, a first step in the implementation of an algorithm over an FPGA is the construction of each one of the functional units required by the algorithm to be implemented (see figure 9). The design of these functional units is carried out by means of Hardware Description Languages (HDLs) like VHDL or Verilog.

The design of a functional unit using HDLs is not simply to write a code free of syntax errors, the process corresponds more to the design of a digital electronic circuit [13] which requires a good grounding about digital hardware. It must be remembered that this process consumes more time than an implementation on a CPU or a GPU.

96 New Technologies in the Oil and Gas Industry Alternative Computing Platforms to Implement the Seismic Migration <sup>15</sup> Alternative Computing Platforms to Implement the Seismic Migration 97

**Figure 9.** FPGA analogy.

14 Will-be-set-by-IN-TECH

• In order to increase the production will be replicated the assembly line as many times as possible and the number of replicates will be determined taking into account two aspects: in the first place, the speed with which the inputs can be placed at the beginning of the

An FPGA consists mainly of three types of components (see Figure 8): Configurable Logic Blocks (or Logic Array Block depending on vendor), reconfigurable interconnects and I/O

**E/S**

**CLB CLB CLB CLB CLB CLB CLB CLB**

**CLB CLB CLB CLB CLB CLB CLB CLB**

**CLB CLB CLB CLB CLB CLB CLB CLB**

**CLB CLB CLB CLB CLB CLB CLB CLB**

**CLB CLB CLB CLB CLB CLB CLB CLB**

**CLB CLB CLB CLB CLB CLB CLB CLB**

**CLB CLB CLB CLB CLB CLB CLB CLB**

**E/S**

In the analogy of the car assembly line, the Configurable Logic Blocks (CLBs) come to be the raw material for the construction of each one of the functional units required to assemble the car. In an FPGA the CLBs are used to build all the functional units required in an specific algorithm, these units can go from simple logical functions (and, or, not) to mathematical complex operations (algorithms, trigonometric functions, etc.). Therefore, a first step in the implementation of an algorithm over an FPGA is the construction of each one of the functional units required by the algorithm to be implemented (see figure 9). The design of these functional units is carried out by means of Hardware Description Languages (HDLs)

The design of a functional unit using HDLs is not simply to write a code free of syntax errors, the process corresponds more to the design of a digital electronic circuit [13] which requires a good grounding about digital hardware. It must be remembered that this process consumes

**E/S**

**E/S**

**E/S**

**E/S**

**CLB CLB CLB CLB CLB CLB CLB**

**E/S**

**E/S**

**E/S**

**E/S E/S**

**E/S**

**CLB** Bloques Lógicos Congurables Bloques de Entrada - Salida Interconexiones programables

**E/S**

**E/S**

**E/S**

**E/S**

**E/S**

**E/S**

**E/S**

**E/S**

FPGA

assembly line and secondly by the space available within the company.

**CLB**

**E/S**

**E/S**

**E/S**

**E/S**

**E/S**

**E/S**

**E/S**

more time than an implementation on a CPU or a GPU.

**E/S**

**Figure 8.** FPGA components.

like VHDL or Verilog.

**E/S**

**E/S**

**E/S**

**E/S**

**E/S**

**E/S**

**E/S E/S**

**E/S**

vehicle type).

**5.1. FPGA architecture**

blocks.

units (remember that now only have the functional units required to assembly only a

Currently, providers of FPGAs (within which highlights Xilinx [42] and Altera [2]) offer predesigned modules that facilitate the design, in this way, commonly used blocks (as floating point units) should not be designed from scratch for each new application; these modules are offered as an HDL description (softcore modules) or physically implemented inside the FPGA (modules hardcore).

The reconfigurable interconnects, allows to interconnect the CLBs and the functional units each other, in the analogy of the car assembly line, the reconfigurable interconnects can be viewed as the conveyor belt where circulating the parts necessary for the assembly of the cars.

Finally, the I/O blocks allow to communicate the pins of the device and the internal logic of the FPGA, in the analogy of the car assembly line, the I/O blocks are the devices that place the parts at the beginning of the conveyor belt and also the devices that allow to bring out the cars of the factory. The blocks of input-output are responsible for controlling the flow of data between physical devices (extern) and the functional units in the interior of the FPGA. The different companies design these blocks in order to support different digital communication standards.

#### **5.2. Algorithms that can be accelerated in an FPGA**

Not all the applications can be benefited in the same way in an FPGA and due to the difficulty of implementing an application in these devices, it is advisable to do some previous estimates in order to determine the possibilities to speed up an specific algorithm.

#### 16 Will-be-set-by-IN-TECH 98 New Technologies in the Oil and Gas Industry

Applications that require processing large amounts of data with little or no data dependency, <sup>3</sup> are ideal for implementing in the FPGAs, in addition it is required that these applications are limited by computing and not by data transfer, that is to say, the number of mathematical operations is greater than the number of read and write operations, [20]. In this regard, the SM have in favor that requires processing large amounts of data (in order of tens of Terabytes) and is required to perform billions of mathematical operations. However, migration also have a large number of read and write instructions which cause significant delays in the computational process.

Furthermore, the accuracy and the data format are other influential aspects on the performance of the applications over the FPGAs, with a lower data accuracy (least amount of bits to represent data), the performance of the application inside the FPGA increase; regarding the data format, the FPGAs get better performances with fixed point numbers than floating point numbers. The SM has been traditionally implemented using floating point numbers with single precision, [1, 21]. [14, 16] show that it is possible to implement some parts of the SM using fixed point instead of floating point and produce images with very similar patterns (visually identical), reducing computation times.

The FPGAs have a disadvantage in terms of the operation frequency, if they are compared with a CPU (10 times lower ) or a GPGPU (5 times), for this reason, in order to accelerate an application inside an FPGA is required to perform at least 10 times more computational work per clock cycle than the performed in a CPU, [20]. In this regard, it is important that the algorithm has a great potential of parallelization. So in this aspect, it is helpful for the SM that the traces can be processed independently, due that the shots are made in different times, [4], which facilitates the parallelization of the process.

#### **5.3. Implementation of the SM on FPGAs**

Since 2004, it began to be implemented on the FPGAs, processes related with the construction of subsurface imaging. These implementations have made great contributions to the problems of processing speed and memory.

#### *5.3.1. Breaking the power wall*

As mentioned above, the operating frequency of traditional computing devices has been stalled for several years, due to problems related to the power dissipation. On the other hand, to mitigate the problems associated with the processing speed inside the FPGAs, it has been implemented modules that optimize the development of expensive mathematical operations4, for example in [21] functional units were designed using the Cordic method, [3] to perform the square root, in [17] are used Lee approximations, [28] to perform trigonometric functions. Other research has addressed this problem from the perspective of the representation of numbers [17], the purpose is to change the single-precision (32-bit) floating-point format to fixed-point format (this is not possible, either for CPUs or GPUs) or a floating point format of less than 32 bits, in order to that the mathematical operations can be carried out in less time.

<sup>3</sup> The data dependency is one situation in which the instructions of a program require results of previous ones that have not yet been completed.

<sup>4</sup> The expensive mathematical operations are those that take more clock cycles to complete the process and consume more hardware resources; the expensive operations that are used in seismic processing are the square roots, logarithms and trigonometric functions, among others.

All these investigations have reported significant reductions in processing times of the SM sections when were compared with state of the art CPUs.

#### *5.3.2. Breaking the memory wall*

Computing systems based on FPGAs have both on-board memory<sup>5</sup> as on-chip memory 6, these two types of memory are faster than the external memories. Some research has made implementations in order to reduce the delays caused in the reading and writing operations [22], in their researches, have been designing special memory architectures that allow to optimize the use of different types of memory with FPGAs. The intent is that the majority of read and write operations must be performed at the speed of the on chip memory because this is the fastest, however, the challenge is to put all the data required in each instruction in on-chip memory because this is the one of the smaller capacity.

### **6. Wishlist**

16 Will-be-set-by-IN-TECH

Applications that require processing large amounts of data with little or no data dependency, <sup>3</sup> are ideal for implementing in the FPGAs, in addition it is required that these applications are limited by computing and not by data transfer, that is to say, the number of mathematical operations is greater than the number of read and write operations, [20]. In this regard, the SM have in favor that requires processing large amounts of data (in order of tens of Terabytes) and is required to perform billions of mathematical operations. However, migration also have a large number of read and write instructions which cause significant delays in the

Furthermore, the accuracy and the data format are other influential aspects on the performance of the applications over the FPGAs, with a lower data accuracy (least amount of bits to represent data), the performance of the application inside the FPGA increase; regarding the data format, the FPGAs get better performances with fixed point numbers than floating point numbers. The SM has been traditionally implemented using floating point numbers with single precision, [1, 21]. [14, 16] show that it is possible to implement some parts of the SM using fixed point instead of floating point and produce images with very similar patterns

The FPGAs have a disadvantage in terms of the operation frequency, if they are compared with a CPU (10 times lower ) or a GPGPU (5 times), for this reason, in order to accelerate an application inside an FPGA is required to perform at least 10 times more computational work per clock cycle than the performed in a CPU, [20]. In this regard, it is important that the algorithm has a great potential of parallelization. So in this aspect, it is helpful for the SM that the traces can be processed independently, due that the shots are made in different times, [4],

Since 2004, it began to be implemented on the FPGAs, processes related with the construction of subsurface imaging. These implementations have made great contributions to the problems

As mentioned above, the operating frequency of traditional computing devices has been stalled for several years, due to problems related to the power dissipation. On the other hand, to mitigate the problems associated with the processing speed inside the FPGAs, it has been implemented modules that optimize the development of expensive mathematical operations4, for example in [21] functional units were designed using the Cordic method, [3] to perform the square root, in [17] are used Lee approximations, [28] to perform trigonometric functions. Other research has addressed this problem from the perspective of the representation of numbers [17], the purpose is to change the single-precision (32-bit) floating-point format to fixed-point format (this is not possible, either for CPUs or GPUs) or a floating point format of less than 32 bits, in order to that the mathematical operations can be carried out in less time.

<sup>3</sup> The data dependency is one situation in which the instructions of a program require results of previous ones that have

<sup>4</sup> The expensive mathematical operations are those that take more clock cycles to complete the process and consume more hardware resources; the expensive operations that are used in seismic processing are the square roots, logarithms

computational process.

(visually identical), reducing computation times.

which facilitates the parallelization of the process.

**5.3. Implementation of the SM on FPGAs**

of processing speed and memory.

*5.3.1. Breaking the power wall*

not yet been completed.

and trigonometric functions, among others.

In spite of the great possibilities that have both FPGAs and GPGPUs to reduce the computation times of the seismic processes, their performance at this time is braking, because the rate at which seismic data are transmitted from the main memory to the computing device (FPGA or GPU) is not enough to maintain its computing potential 100% busy. The communication between the FPGAs or GPUs with the exterior can be carried out using large amount of output interfaces, currently one of the most used is Peripheral Component Interconnect Express (PCIe) port that transfer data between the computing device (FPGA or GPU) and a CPU at speeds of 2GB/sec, [14] and this transfer rate can not provide all the necessary data to keep the computing device busy.

Currently, at the Universidad Industrial de Santander (Colombia), a PhD project is being carried out that seeks to increase the transfer rate of seismic data between the main memory and the FPGA using data compression.

In addition, there are currently two approaches to try to reduce the design time on FPGAs (the main problem of this technology): The first strategy is to use compilers CtoHDL [11, 25, 35, 39, 43], these from a C code generates a description in a HDL, without having to manually perform the design; the second strategy is to use high level languages, these languages are the called Like-C as Mitrion-C [31] or SystemC [26]; despite their need, these languages have not yet achieved wide acceptance in the market, because its use still compromise the efficiency of the designs. The research regarding the compilation CtoHDL and high-level languages are active [29, 33, 34, 40, 44] and some results have been positive [7], but the possibility of having access to all the potential of the FPGAs from a high-level language is still in development.

On the other hand, it can be seen that the technological evolution of the GPUs is similar to the beginnings of the PCs evolution, where their progress was subject to Moore's law. It is therefore expected that in coming years new GPGPUs families continue to appear, increasingly so much that are going to raid in many areas of the HPC, but definitively it will come the time when this technology will stagnate for the same three barriers that stopped the

<sup>5</sup> This is the available memory on the board which contains the FPGA

<sup>6</sup> These are blocks of memory inside the FPGA

computers. When that time comes it will be expected that the FPGAs will be technological mature and can take the HPC baton during the next years.

At the end, we believe that the HPC future is going to be built of heterogeneous cluster, composed by these three technologies (CPUs, GPUs and FPGAs). These clusters will have an special operating system (O.S.) that will take advantage of each of these technologies, reducing the three barriers effect and getting the best performance in each application.
