**5. Experimental results**

#### **5.1 OpenCL performance on GPU compared the CPU one**

OpenCL offers a convenient way to construct heterogeneous computing systems and opportunities to improve parallel application performance. As first step, the OpenCL SAD kernel was implemented in two platforms: CPU with 4 cores at frequency 2.5 GHz and an NVDIA GPU 920 m of 954 MHz as frequency. The SAD block dimensions are from 4 � 8 to 64 � 64 pixels. A comparative analysis was made on the same video between the CPU and GPU is seen in **Figure 6**. It is clear from the next figure that the GPU execution time is greater than CPU execution (**Figure 7**) [16].

When using the Eq. (2), **Figure 6** indicates the speed up [17] of the both implementations.

$$\text{Speed} - up = \text{CPU execution time} / \text{GPU execution time} \tag{2}$$

The speed up shows that the GPU platform is more efficient than the CPU platform, and this is due to the efficient parallel architecture of GPU compared to CPU. To validate the OpenCL code compared to the CUDA code the next study is proposed.

#### **5.2 Execution performance OpenCL GPU compared to CUDA GPU**

Running the application through GPU requires these steps as it is shown in **Figure 8**. For OpenCL, approach contains GPU detection and kernel compilation.

#### **Figure 6.**

*Performance OpenCL comparison with GPU and CPU platforms.*

**Figure 7.** *Speed-up using OpenCL language.*

The CPU input is read from the host to the device by all frameworks; the kernel is executed on the GPU; the device is returned to the host by copy data. Finally, the results are displayed on CPU.

**Table 2** reports the kernel running time for different size of Prediction Unit (PU) (designed the block size used). In order to get repeated average times, we fixed each problem 10 times for both CUDA and OpenCL.

We use a normalized performance metric, called Performance (PR), to compare the performance of CUDA and OpenCL (**Figure 9**).

$$PR = UDA \text{ execution time} / OpenCL \text{ execution time} \tag{3}$$

If performance ratio is greater than 1, OpenCL will give a better results compared to CUDA language. As shown in **Figure 8**, the performance ratio indicates that the OpenCL kernel running time is better than CUDA kernel running for each *Performance Analysis of OpenCL and CUDA Programming Models for the High Efficiency Video… DOI: http://dx.doi.org/10.5772/intechopen.99823*

#### **Figure 8.**

*Algorithm flow.*


#### **Table 2.**

*GPU and CPU application running times in seconds.*

size block. Similar results are obtained by Frang et al. [18] and Exterman [19], respectively.

#### **5.3 Comparative study**

In this section, we compared the time performance of our proposed implementation to State-of-the-Art process [20, 21].

In the work presented by Xiao et al. [20], when comparing the result of the proposed with the HEVC reference software, experimental results show that the proposed GPU implementation achieves 34.4% encoding time reduction on average while the BD-rate increase is only about 2% for a typical low delay setting. Another interesting work is proposed by Karimi et al. [21] used a specific real-world application to compare the performance of CUDA with NVIDIA's implementation of

#### **Figure 9.** *Performance ratio.*

OpenCL. Contrary to our results, CUDA's kernel execution was here consistently faster than OpenCl's, despite the two implementations running nearly identical code. CUDA seems to be a better choice for applications where achieving as high a performance as possible is important. Otherwise the choice between OpenCL and CUDA can be made by considering factors such as prior familiarity with either system, or available development tools for the target GPU hardware. The performance will be dependent on some variables, including code quality, algorithm type and hardware type.
