**6. Conclusion**

Three FPGA-based and several GPU-based molecular docking implementations were surveyed in the previous sections. Although molecular docking algorithms are quite diverse in general, the methods introduced in this chapter actually fall into two categories. Both categories represent a docking approach which is easily parallelizable and thus suits well the architecture of accelerator platforms.

The first group includes the correlation-based methods (Section 4.1, 4.2, 5.1, 5.2 and 5.6.1). As it was shown, correlation is a massively parallel operation and can be implemented effectively in FPGA; on GPU in turn it can be performed with optimized FFT kernels. This makes correlation-based docking algorithms ideal for hardware acceleration; the limitation is that they support only rigid-body docking.

152 Bioinformatics

single CPU core.

Intel Xeon E5530 CPU.

**6. Conclusion** 

the architecture of accelerator platforms.

*5.6.3. Using multiple GPUs* 

*5.6.2. Calculating pairwise potentials* 

the GPU, since it has a better memory read pattern than 3D FFT. The measured speedup on an NVIDIA GeForce GTX 285 was about ×45 compared to running Hex with 1D FFTs on a

Reference [38] focuses on the acceleration of calculating the pairwise potentials between the protein-ligand atoms. In many docking applications this is performed with pre-calculated grids which reduces the O(Nprot\*Nlig) complexity to O(Nlig) during docking (where N denotes the number of atoms of the molecule). This implementation, however, calculates the double sum directly; one protein atom is assigned to each CUDA thread, which iterates over the ligand atoms and calculates the corresponding potential values. Although the effectiveness of the approach is uncertain due to the increased complexity, it is interesting since it fits the GPU architecture perfectly. The number of protein atoms is usually high enough to keep the multiprocessors of the GPU busy; it is not necessary to evaluate multiple ligand positions simultaneously. In addition, the amount of input data is smaller (the number of protein atoms is usually lower than that of the grid points), making this approach less memory-intensive. Depending on the molecule sizes, speedups between ×10-260 were observed on an NVIDIA Tesla C1060 GPU, compared to the same algorithm running on an

Similarly to the previous section, reference [39] deals with accelerating only the pairwise potential calculation on GPU. The scoring function consists of two usual terms representing the van der Waals and electrostatic interaction. However, in this implementation two separate GPU devices are used; one of them calculates the van der Waals, the other one the electrostatic term. In a real docking application this approach would be probably impractical due to the required CPU-GPU memory transfer operations. Still, the applicability of multiple GPUs to the docking problem is intriguing; the most trivial way of utilizing them is to perform independent runs on the different devices. In case of this implementation overall speedup factors between ×118-193 were achieved; the test platform consisted of a 2.4 GHz

Three FPGA-based and several GPU-based molecular docking implementations were surveyed in the previous sections. Although molecular docking algorithms are quite diverse in general, the methods introduced in this chapter actually fall into two categories. Both categories represent a docking approach which is easily parallelizable and thus suits well

The first group includes the correlation-based methods (Section 4.1, 4.2, 5.1, 5.2 and 5.6.1). As it was shown, correlation is a massively parallel operation and can be implemented

Intel Core 2 Quad CPU and an NVIDIA GeForce 8800 GTX GPU.

The second group includes docking algorithms based on a global optimization algorithm which is inherently parallel (Section 4.3, 5.3-5.5). Both the evolutionary algorithms used by AutoDock and MolDock, and the ant colony optimization method of PLANTS operate on sets of potential solutions, which allows members of the set to be processed in parallel. The usual pairwise scoring functions applied by these programs offer further parallelization at the level of atoms or atom pairs. In addition, these methods support modeling of molecular flexibility, too.

Many of the introduced, accelerator-based docking implementations achieved significant speedup over single or even multi-core CPUs. The actual speedup value is always a matter of reference platform, of course; still, the results prove that molecular docking can effectively accelerated by hardware and often a performance improvement of 1-2 orders of magnitude can be obtained. However, this improvement is usually not constant; in many cases it was shown that it strongly depends on input parameters (number of atoms, size of search space, search exhaustiveness, etc.), making accelerators usually more suitable for larger problem sizes.

It should also be noted that performance improvement may come at a price: in some cases (4.3, 5.3.2, 5.5) the original algorithm had to be altered to make it more suitable for parallelization. Typically these changes were related to the local search in these cases, which is essentially a sequential algorithm. Such modifications are often necessary, however, they change the behavior and accuracy of the algorithm, which is sometimes unacceptable. Another typical necessity is that in addition to the computationally intensive but parallelizable steps that suit well the accelerator architecture, other parts must also be mapped to the accelerator in order to avoid that the host-accelerator bandwidth becomes a bottleneck. This, however may greatly increase the required programming effort.

Another interesting point is the applicability and performance of FPGAs vs. GPUs. In case of the PIPER implementations (Section 4.2, 5.1) the FPGA outperformed the GPU when both executed correlation directly; but due to the effective FFT-based approach the GPU implementation seemed to be more suitable since its performance scaled well with the problem size. In case of AutoDock (Section 4.3, 5.3.3) the GPU outperformed the FPGA in practical cases, although the latter exploited the low-level parallelism of the docking algorithm more effectively and thus was faster than the GPU if the number of parallel runs was low. All these results confirm that GPU devices represent a real competitor of FPGAs even when considering only performance. In addition, as it was mentioned in Section 3, FPGA programming usually requires hardware skills while GPUs can be programmed in Clike languages (although there are high-level C-based HDLs they are usually not as effective as VHDL or Verilog). GPU cards are cheaper by far than high-performance FPGA accelerators, and often they are already available in the desktop PC. All these facts suggest that GPUs are a better choice as accelerator platform than FPGAs in case of floating point-

intensive applications like the majority of the docking algorithms, although clearly there are problem domains where FPGAs remain superior.

Hardware Accelerated Molecular Docking: A Survey 155

[10] Jones G, Willett P, Glen RC, Leach AR, Taylor R (1997) Development and Validation of

[11] Trott O, Olson AJ (2010) AutoDock Vina: Improving the Speed and Accuracy of Docking with a New Scoring Function, Efficient Optimization, and Multithreading. J.

[12] Teodoro ML, Phillips GN, Kavraki LE (2001) Molecular Docking: A Problem With Thousands of Degrees of Freedom. IEEE Int. Conf. on Robotics and Automation, 2001

[13] Dias R, de Azevedo WF (2008) Molecular Docking Algorithms. Curr. Drug Targets 9:

[14] Kavraki LE (2007) Protein-Ligand Docking, Including Flexible Receptor-Flexible Ligand Docking. Receptor 1-19. Available: http://cnx.org/content/m11456/latest/. Accessed 2012.

[15] Hauck S, DeHon A (2007) Reconfigurable Computing - The Theory and Practice of

[16] Kilts S (2007) Advanced FPGA Design - Architecture, Implementation, and

[17] NVIDIA CUDA C Programming Guide. Available: http://developer.nvidia.com/nvidia-

[18] Sanders J, Kandrot E (2010) Cuda by Example - An Introduction to General-Purpose

[19] AMD Accelerated Parallel Processing OpenCL Programming Guide. Available: http://developer.amd.com/sdks/AMDAPPSDK/documentation/Pages/default.aspx.

[20] Kozakov D, Brenke R, Comeau SR, Vajda S (2006) PIPER: An FFT-Based Protein

[21] VanCourt T, Gu Y, Herbordt MC (2004) FPGA Acceleration of Rigid Molecule Interactions. 12th Ann. IEEE Symp. on Field-Programmable Custom Computing

[22] VanCourt T, Gu Y, Mundada V, Herbordt MC (2006) Rigid Molecule Docking: FPGA Reconfiguration for Alternative Force Laws. EURASIP J. on Applied Signal Processing

[23] Sukhwani B, Herbordt MC (2010) FPGA Acceleration of Rigid-Molecule Docking

[24] Pechan I, Fehér B, Bérces A (2010) FPGA-Based Acceleration of the AutoDock Molecular Docking Software. Conf. on Ph.D. Research in Microelectronics and

[25] Pechan I, Fehér B (2011) Molecular Docking on FPGA and GPU Platforms. Int. Conf. on

[26] Sukhwani B, Herbordt MC (2009) GPU Acceleration of a Production Molecular Docking Code. 2nd Workshop on General Purpose Processing on Graphics Processing

Field Programmable Logic and Applications, 2011 Sept. 5-7, Chania, Greece.

a Genetic Algorithm for Flexible Docking. J. Mol. Biol. 267: 727-748.

FPGA-Based Computation. Morgan Kaufmann. 944 p.

gpu-computing-documentation. Accessed 2012. 04. 29.

Docking Program with Pairwise Potentials. Proteins 65: 392-406.

GPU Programming. Addison-Wesley. 312 p.

Machines, 2004 Apr. 20-23, Napa, USA.

Codes. IET Comput. Digit. Tech. 4: 184-195.

Electronics, 2010 July 18-20, Berlin, Germany.

Units, 2009 Mar. 8, Washington, USA.

Comput. Chem. 31: 455-461.

May 21-26, Seoul, Korea.

Optimization. Wiley. 352 p.

Accessed 2012. 04. 29.

2006: 1-10.

1040-1047.

04. 29.
