**1. Introduction**

Multi-core and Multi-processor architectures (SoC and MPSoC) started a new computing era. They are becoming increasingly used as they can provide designers with new opportunities to meet desired requirements in embdedd system for different application and domain. Multi-media and telecommunication streaming applications are now widely used in several domains such as visio-conference, networking, video cripttage and compression, surveillance, medical services, military imaging and telecommunication applications. These applications are

characterized with stringent delay time of tasks. Model of scheduling tasks in 4 CPUs for Motion Estimation "ME" and DAG algorithm is illustrated in **Figure 1**.

Motion estimation module is very important complex module of a video codec. MPSoCs have the most suitable architecture to meet real time high-definition "HD" encoding requirements. This real time HD coding is based on a block matching algorithm "BMA" which locates matching blocks in a sequence of digital video frames. This technique is used to discover temporal redundancy in the video sequence, increasing the effectiveness of inter-frame video compression. The scheduling of tasks is an important step to accelerate the motion estimation process. The objective is to minimize the execution time of this algorithm, by distributing computing the tasks that describe the algorithm to the various cores of an MPSoC. In this case, we can see various methods to jobs scheduling and partitioning tasks for video codec applications, but are not adapted to our motion estimation block problem.

In this project, we choose the second method to schedule our algorithm based on DAG and GGEN for many reasons. This method minimizes the complexity of NPcomplete problems in tasks. This methods are automatic, generic and periodic algorithm. It refine a granularity of tasks in ME block. The true parallelism tasks in SoC and MPSoC system, we start in 4 CPUs as in [1, 2], after we can works 8, 16, … , 1024 CPUs. There are some works and scheduling in tasks, it is periodic and acyclic. This methods give us an optimal solution for complex scheduling and

**Figure 1.** *Model of scheduling tasks in CPUs.*

*Approximation Algorithm for Scheduling a Chain of Tasks for Motion Estimation… DOI: http://dx.doi.org/10.5772/intechopen.97676*

partitioning tasks on MPSoC system, as in [3, 4]. In this case, we choose parallelizing tasks and instructions in codec video blocks, there is very interessting step for scheduling and partitioning tasks in embedded platforms. We can cite: FPGA, GPU, DSP target, as in [5, 6]. The standard H 264/AVC is a new video codec configured and released by ITU-T and ISO/IEC [7, 8]. This standard give a very important results in bit rate, quality of image and others criteria in video codec block comparing with others standards [2, 7–9].

This paper is scheduled as follows: In section II, we detail the applied algorithm for block motion estimation with scheduling and partitioning tasks method in SoC and MPSoCs systems, this works are designed with acyclic algorithm (DAG-TPG) and co-designed in OVP platform. After, we present the ME module and importance in video codec standards. In section III, we can describe a new scheduling tasks approach and co-design in platform OVP (partitioning). Then, we synthesis with results, section IV. We finished by our conclusion.

### **2. ME in codec video H 264 and scheduling tasks approach**

In this paper, high-quality video encoding imposes unprecedented performance requirements on real time mobile devices. To address the competing requirements of high performance and real time, embedded mobile multimedia device manufactures have recently adopted MPSoC (multiprocessor system-on-chip). Despite the advancements in new technology digital mobile device, computer system and I-Pad, the execution time, energy consumption and quality of image in SoC and MPSoC systems, we needed parallelism and scheduling tasks applied for H 264/AVC video codec. This approach of scheduling can eliminate problems in ME blocks and remains artifacts in image. We can start to minimize a time delays and time execution in ME blocks in video codec. Secondly, video codec blocks very important in structure and function, needed predictive coding blocks. These structures and functions, can be defined and modeled as with acyclic approach (semi-automatic and automatic) directed acyclic graphs (DAG) and generated graphs GGEN. With this approach, we can give a good and important solution for criteria evaluation in video codec, we can site execution time, time delays in tasks, and others [3, 4, 9]. We can finished to describe the delays tasks in some CPUs (SoC and MPSoCs target), there are considered real-tim applications [3, 4, 9]. In other case, video frames and video sequence should meet their deadlines and their estimation, the quality of image is very important. Then, many execution time in SoC and MPSoC targets with scheduling tasks approachs, there are exploit execution time and others criteria evaluations in video codec, we can see [9–12].

In **Table 1**, we classify these representative solutions based on their utilized optimization horizons, application models, complexity models, scheduling granularities, and considered sources of execution time. The scheduling tasks approach was chosen due to its efficiency and implementation simplicity for video codec H 264. The main idea of the scheduling tasks in ME is to decompose a sequence video into a frame and a frame into a Macro-block, so we scheduled and partitioned a tasks in parity order, we can see our method [9–12].

#### **2.1 Scheduling and partitioning algorithm tasks**

In previous works, scheduling and partitioning algorithms in MPSoCs systems have been formulated heuristically due to the inherent complexity of the multimachine scheduling problem. Such algorithms constitute a very important step in multimedia applications. There is an NP-complete step in co-design. A schedule is


**Table 1.**

*Comparaison our solution with different approachs scheduling tasks.*

an assignment of each task to a machine. For scheduling, the load *Mi* defines the requirement of total processing jobs assigned, and the scheduling length is the load on the busiest machine. We want to find a minimum length for the schedule, as in [3, 13]. There are three distinct approaches to these scheduling problems: the theory of queues for networks scheduling, the deterministic scheduling and the software engineering scheduling. The study of approximation algorithms for NP-complete scheduling and partitioning problems for MPSoC systems has started with the work of Graham in 1996, who has analyzed a simple algorithm. When designing an approximate algorithm to minimize such NP-complete problems, we can evaluate its performance in a different way. One would use an algorithm that approximates with a guarantee of the performance of the deviation of the optimum value of the worst case as in [3, 4]. However, most scheduling and partitioning algorithms make assumptions on the relationship between the task dependencies. We may then classify scheduling algorithms. Scheduling, data partitioning and parallelism identification are three key points for an efficient application deployment. There are several approaches to manage scheduled tasks for real-time applications. The most important approaches are cyclic and acyclic approaches. Cyclic approaches have been used for Static Data Flow "SDF", Cyclo-Static Data Flow "CSDF" and Petri networks [4, 13]. Acyclic approaches have used DAG, Unified Directed Acyclic Graph "UDAG, Weighted Directed Acyclic Graph "WDAG", etc.

In this paper, we consider the problem of partitioning and scheduling tasks with a Task Precedence Graph "TPG" which is given as a DAG. A solution based on the DAG representation has the advantage to be generic, simple and usable with a refined granularity. A node in the DAG is a task which is a set of instructions to be executed sequentially without preemption in the same processor. Modeling, terminologies and all the mathematical formalism of the DAG algorithm is presented in [3, 4].

#### **2.2 DAG algorithm**

DAG algorithm is based on the asynchronous message passing paradigm. The parallel architectures are increasingly popular, but scheduling is very difficult because the data and the program must be partitioned and distributed to processors [3, 17, 18]. The methodology for MPSoC in embedded applications is the Adequacy-Algorithm-Architecture "AAA". The following issues are of major importance for distributed memory architectures:

*Approximation Algorithm for Scheduling a Chain of Tasks for Motion Estimation… DOI: http://dx.doi.org/10.5772/intechopen.97676*

#### **Figure 2.** *Principle of SAD function.*


#### **2.3 Motion estimation "ME": application**

ME block is very important in H 264 and H 265 video codec. In various standards, ME needs very complex tasks and instructions, and takes the largest part of video codec [2, 10–12, 19].

#### *2.3.1 Principle of the ME block*

Principle of ME is the following: for a MB in the current frame, we define a search window in the reference frame. There are several evaluation criteria, such as MSE, SAD, BBM, MAD, NCF. We seek in this window the best MB using the Sum of Absolute Difference "SAD" distortion criterion given by Eq. (1) as in [2, 19–21], we descibe this function SAD in **Figure 2**. To optimize the complexity level of the ME module, several fast search algorithms have been defined in the literature as DS, FS, TSS, HDS, PMVFAST, LDPS and the block-matching algorithm "BMA". Intercoding consists in finding a similar block that is aware of a reference frame block. This process is performed by a BMA. General principle of BMA is to exploit the temporal redundancies between consecutive frames.

$$\text{SAD}(\mathbf{x}, \mathbf{y}) = \begin{cases} \sum(\mathbf{i} = \mathbf{0}, j = \mathbf{15}) \sum(\mathbf{i} = \mathbf{0}, j = \mathbf{15})\\ |\mathbf{R}(\mathbf{i}, j) - \mathbf{F}(\mathbf{x} + \mathbf{i}, \mathbf{y} + \mathbf{j})| \end{cases} \tag{1}$$

#### **3. Scheduling based on DAG formalism: approach**

#### **3.1 Block matching algorithm "BMA"**

Inter-coding consists in finding a block similar to the current block of a reference frame. This process is performed by a block-matching algorithm. The general principle

### *Engineering Problems - Uncertainties, Constraints and Optimization Techniques*

**Figure 3.** *Principal of block matching algorithm "BMA".*

of the BMA is to exploit existing temporal redundancies between consecutive frames. This method involves searching for each point of the frame of interest It, the point of the frame *It* þ 1 USD which maximizes a correlation score. The search is performed in a search block [7, 10–12, 19]. We describe a principal of this method in **Figure 3**.

Object of interest is determined when the number of corresponding blocks in the previous and current frame is higher than the value of a certain threshold. The threshold value is obtained experimentally [22]. We define the principle from SAD function in Eq. (1), where (R(i; j)) denotes the pixels of the reference MB and (F(x + i; y + j)) denotes the pixels of the current MB. **Figure 4** below shows the flow chart of the ME block of H 265 video codec. We use a padding method in order to enforce a whole number of packets. The scheduling methodology is defined in the **Figure 5**, which illustrates the flow chart of the MB (16 ∗ 16) for H 265 the video codec. The new idea for the flow chart of H 265 is to work with the padding technique rather than the affinity of the granularity (1*=*2), 1ð Þ *=*4 and the (1*=*8) pixel method. The padding added to the end of a packet in order to enforce a whole number of packets.

#### **3.2 DAG applied to ME**

In this section, we describe our new approach for the scheduling tasks DAG. Firstly, we should to make the method generic, we can validated for any frame size. Then, we chose to work on block parity in scheduling tasks for ME blocks. Four combinations are possibles: odd odd, odd even, even odd, even even. The different

*Approximation Algorithm for Scheduling a Chain of Tasks for Motion Estimation… DOI: http://dx.doi.org/10.5772/intechopen.97676*

**Figure 4.** *Flow chart of ME algorithm.*

sequence test in our works are modeled in three models for the scheduling tasks methods. The generic frame size is ("X=N", "Y=M") where "X" represents the pixels for lines and "Y" denotes the column pixels. After scheduling a three models, we can see a problem in border of image or frame. This problem, we can apply padding method to add rows and columns to the frames by adding empty pixels. We notice that "N" has to take an even value that is divisible by 4. **Figure 6** shows the work-flow of an entire frame with size (N\*M).

For each node in the task graph, one must compute the start and end times of execution cycles by using the weight of each node. The "Gantt of chart" represents *Engineering Problems - Uncertainties, Constraints and Optimization Techniques*

**Figure 5.**

*Flow chart of the* 16 ∗ 16 *for ME block in H 264 video codec.*

the task scheduling in processors in terms of time and gives the order of the tasks of the ME block. This Gantt of chart is illustrated in **Figure 7**. Clustering is the placement of the various tasks of our application on different clusters. It depends on *Approximation Algorithm for Scheduling a Chain of Tasks for Motion Estimation… DOI: http://dx.doi.org/10.5772/intechopen.97676*

**Figure 7.** *Gantt of chart from ME block.*

the number of processors in the implementation platform. In our case, we chose a platform with four processors. We need four groups, four instants of time and then we have four outputs.

#### **3.3 Parameters setting**

For a set P of processors, each node must be assigned to a single processor. For all the the addressable memory, we give a portion of the same space to each case. When setting up parameters, we have to take in consideration the time constraints and the constraint of targets HW/SW. The classical assumptions made on the target MPSoC are:


Despite, the following equations show the scheduling tasks of a classic frame. First, we treat the odd tasks and the odd nodes in the TPG graph, tsð Þ *nii* and wð Þ *nii* presenting the time and weight in odd nodes.

$$tf(n\_{ii}) = ts(n\_{ii}) + w(n\_{ii}) \tag{2}$$

At the end, we compute the equations of even tasks, where ts(*njj*) is the time of the node (jj) and w(*njj*) is the weight in this node.

$$\text{tf}\begin{pmatrix} n\_{\vec{\text{jj}}} \end{pmatrix} = \text{ts}\begin{pmatrix} n\_{\vec{\text{jj}}} \end{pmatrix} + \text{w}\begin{pmatrix} n\_{\vec{\text{jj}}} \end{pmatrix} \tag{3}$$

For the time end calculation "tf" in the node *ni*; *n <sup>j</sup>* . The sequence tasks from the ME blocks are defined in **Figure 8**.

In general, the nodes and tasks in TPG graph are made of nodes *nii*, *nij*, *nji*, *njj*. They are illustrated in Eqs. (4)–(7).

$$N\_{\vec{\mu}} = \left(n\_{\vec{\mu}} + n\_{\vec{\eta}}\right) \tag{4}$$


**Figure 8.** *Gantt of chart for tasks sequences by CPU.*

$$N\_{\vec{\eta}} = \left(n\_{\vec{\eta}} + n\_{\vec{\mu}}\right) \tag{5}$$

$$N\_{ji} = \left(n\_{ji} + n\_{ii}\right) \tag{6}$$

$$N\_{\vec{\mu}} = \left(n\_{\vec{\mu}} + n\_{\vec{\mu}}\right) \tag{7}$$

Execution time for the node of the test sequence is presented by Eq. (8) below.

$$\mathbf{t}\_f(n\_{ii}, n\_{\vec{\eta}}) = \begin{cases} \mathbf{t}\_{f(n\_{ii})} \\ n\_{11}, \dots, n\_{ii} \text{ in the same process} \end{cases} \tag{8}$$

For the nodes or the tasks executed on different processors, the execution period is computed in the output frame for the current reconstruction frame. The nodes are spread by parities over each processor. In the first level, we have 16 nodes. In Level II, we add nodes to the neighbor to reconstruct the macro-blocks. The same is done for all levels, the goal being to reconstruct the frame. In our case, we have four groups since both architectures test four processors. We compute the execution period in Eq. (9) below.

$$\mathbf{t}\_f(n\_i, n\_j) = \begin{cases} \mathbf{t}\_{f\_{n\_i}} + \mathbf{C}(n\_i, n\_j) \\ \text{Otherwise } \mathbf{n}\_i, n\_j \text{ in different process} \end{cases} \tag{9}$$

Hence, the set of end times is computed in the three models for scheduling and partitioning tasks. This algorithm is applied to ME blocks for the test video sequence in H 265 video codec: *tf nii* ð Þ ,*nii* , *tf nij* ð Þ ,*nij* , *tf njj* ð Þ ,*njj* , *tf nji* ð Þ ,*nji* . It is applied to all nodes in the graph found for the task with the ME block using test sequence "Akiyo". Two processors *pi* and *p <sup>j</sup>* are isomorphic if read times are equal. Then the weight odd

*Approximation Algorithm for Scheduling a Chain of Tasks for Motion Estimation… DOI: http://dx.doi.org/10.5772/intechopen.97676*

nodes or tasks *w n*ð Þ*<sup>i</sup>* are equal to the weight nodes or tasks w(*n <sup>j</sup>*). Thus, the set of nodes is *nii*, *nij*, *njj*, *nji* and the set of tasks is *Tii*, *Tij*, *Tjj*, *Tji*. Succeeding in scheduling and partitioning the tasks with DAG is equivalent to the following conditions::


In our case, the platforms contain four processors, therefore we seek other processors satisfying the conditions. Basing on the assumptions above, we can deduce other parameters. The scheduling length for tasks and partitioning problems of processor ð Þ *P* þ 1 is always less than or equal to the processor P.

$$\text{SL}\left(\mathbb{S}\_{opt}(P+\mathbf{1})\right) < \text{SL}\left(\mathbb{S}\_{opt}(P)\right) = \max\left(\mathfrak{t}\_f(n\_{\infty})\right) \tag{10}$$

Where *nxx* is the set of nodes *nii*, *nij*, *nji*, *njj*. This equation is applied to the three models, where "SL(S)" contains the scheduling graph length "Sopt(P)" which has the optimal schedule of P processors. For our application, we have four processors. To apply the hypothesis, we must know the maximum execution time for all nodes ("X" nodes = "X" tasks).

**Figure 9.** *Flow-chart of steps DAG algorithm with ME.*


#### **Table 2.**

*Parameters for steps of tests videos sequences.*
