**4. New GPU-based DDD evaluation**

In this section, we present the new GPU-based DDD evaluation algorithm. Before the details of GPU-based DDD evaluation method, we first discuss the new DDD data structure for GPU parallel computing.

One key observation for DDD structure is that the data dependence is very clear. The data dependency is very simple: a node can be evaluated only after its children are evaluated. Such dependency implies the parallelism where all the nodes satisfying this constraint can be evaluated at the same time. Also, in frequency analysis of analog circuits, evaluation of a DDD node at different frequency runs can be performed in parallel. In the following subsections we show how we can explore such parallelism to speed up the DDD evaluation process.

#### **4.1 New data structure**

To achieve the best performance on GPU, linear memory structure, i.e., data stored in consequent memory addresses, is preferable. For CPU serial computing, the data structure is based on dynamic links in a linked binary tree. For parallel computing, the data will be stored in linear arrays which can be more efficiently accessed by different threads based on thread ids.

As we discussed above, the DDD representation stores all product terms of the determinant of the MNA matrix in a binary linked tree structure. The vertex in the tree structure is known

Two choices are available for *vself* data structure. One is similar to the data structure of *vtree*. The *vself* value for each DDD node is stored consecutively. This data structure is called the linear version of *vself* data structure. The other method is as shown in Fig. 6. The array is organized per MNA element. Due to the fact that some of the DDD nodes share the same MNA element value, the second data structure is more compact in memory than the linear version. So it is called the compact version of *vself* data structure. The compact version is suitable for small circuits because it reduces the global memory traffic when computing *vself*. However, for large circuits, the calculation of *vtree* dominates the time cost. And we can implement a strategy to reduce the global memory traffic for computing *vtree* using the linear version of *vself* data structure to further improve the GPU performance. Therefore, for larger circuits, the linear version is preferable. The performance comparison is discussed later in the

Parallel Symbolic Analysis of Large Analog Circuits on GPU Platforms 121

The parallel evaluating process consists of two stages. First, the *vself* values for all DDD nodes are computed and stored. In this stage, a set of 2D threads are launched on GPU devices. The X-dimension of the 2D threads represents different frequencies; the Y-dimension represents different elements (for compact *vself*) or DDD nodes (for linear *vself*). Therefore, all elements (or DDD nodes) can be computed under all frequencies in massively parallel manners. In the second stage, we simultaneously launch GPU 2D threads to compute all the *vtree* values for DDD nodes based on (3). Notice that a DDD node *vtree* value becomes valid when all its children's *vtree* values are valid. Since we compute all the *vtree* for all the nodes at the same time, the correct *vtree* values will automatically propagate from the bottom of the DDD tree to the top node. The number of such *vtree* iterative computing are decided by the number of layers in DDD tree. A layer represents a set of DDD nodes whose distance from *1-terminal* or *0-terminal* are the same. The number of layers equal to the longest distance between non-terminal nodes and *1-terminal*/*0-terminal*. Algorithm 1 shows the flow of parallel DDD

Line 3 and Line 4 load frequency index and element index respectively with CUDA built-in variables (Thread.X and Thread.Y are our simplified notations). These built-in variables are the mechanism for identifying data within different threads in CUDA. Then, line 5 and Line 6 compute the *vself* with the RCL value of the element under given frequency. From line 8, loop for computing *vtree* is entered. Line 13 and Line 14 load *vtree* values for left/right branch using function *Then()/Else()*. Line 15 through Line 26 explains themselves. Line 27 computes

The GPU performance can be further improved by making proper use of coalesced global memory access to prevent the global memory bandwidth from being performance limitation. Coalesced memory access is one efficient method reducing global memory traffic. When all threads in a warp execute a load instruction, the hardware detects whether the threads access the consecutive global memory address. In such case, the hardware coalesces all of these accesses into a consolidated access to the consecutive global memory. In the implementation of GPU-accelerated DDD evaluation, such favorable data access pattern is fulfilled for the linear version of *vself* data structure to gain performance enhancement. The *vself* data structure is in a linear pattern so that the *vself* values for a given DDD node under a series of frequency

next section.

**4.2 Algorithm flow**

evaluation using compact *vself* data structure.

*vtree* with *vself* and *Left/Right* and ends the flow.

**4.3 Coalesced memory access**

Fig. 6. Illustration of the data structure for parallel method

as DDD node that represents element in MNA matrix which is identified by its index. For each DDD node, the data structure includes the sign value, the MNA index, the RCL values, corresponding frequency value, *vself*, and *vtree*. In the serial approach, these values are stored in a data structure and connected through links, as shown in Fig. 5. On the other side, in the parallel approach, all of these data are stored separately in corresponding linear arrays and each element is identified by the DDD node index (not necessarily the same as the MNA element index). Figure 6 illustrates the new data structure.

Two choices are available for *vself* data structure. One is similar to the data structure of *vtree*. The *vself* value for each DDD node is stored consecutively. This data structure is called the linear version of *vself* data structure. The other method is as shown in Fig. 6. The array is organized per MNA element. Due to the fact that some of the DDD nodes share the same MNA element value, the second data structure is more compact in memory than the linear version. So it is called the compact version of *vself* data structure. The compact version is suitable for small circuits because it reduces the global memory traffic when computing *vself*. However, for large circuits, the calculation of *vtree* dominates the time cost. And we can implement a strategy to reduce the global memory traffic for computing *vtree* using the linear version of *vself* data structure to further improve the GPU performance. Therefore, for larger circuits, the linear version is preferable. The performance comparison is discussed later in the next section.

#### **4.2 Algorithm flow**

8 VLSI Design

Fig. 5. Illustration of the data structure for serial method

Fig. 6. Illustration of the data structure for parallel method

element index). Figure 6 illustrates the new data structure.

as DDD node that represents element in MNA matrix which is identified by its index. For each DDD node, the data structure includes the sign value, the MNA index, the RCL values, corresponding frequency value, *vself*, and *vtree*. In the serial approach, these values are stored in a data structure and connected through links, as shown in Fig. 5. On the other side, in the parallel approach, all of these data are stored separately in corresponding linear arrays and each element is identified by the DDD node index (not necessarily the same as the MNA

The parallel evaluating process consists of two stages. First, the *vself* values for all DDD nodes are computed and stored. In this stage, a set of 2D threads are launched on GPU devices. The X-dimension of the 2D threads represents different frequencies; the Y-dimension represents different elements (for compact *vself*) or DDD nodes (for linear *vself*). Therefore, all elements (or DDD nodes) can be computed under all frequencies in massively parallel manners. In the second stage, we simultaneously launch GPU 2D threads to compute all the *vtree* values for DDD nodes based on (3). Notice that a DDD node *vtree* value becomes valid when all its children's *vtree* values are valid. Since we compute all the *vtree* for all the nodes at the same time, the correct *vtree* values will automatically propagate from the bottom of the DDD tree to the top node. The number of such *vtree* iterative computing are decided by the number of layers in DDD tree. A layer represents a set of DDD nodes whose distance from *1-terminal* or *0-terminal* are the same. The number of layers equal to the longest distance between non-terminal nodes and *1-terminal*/*0-terminal*. Algorithm 1 shows the flow of parallel DDD evaluation using compact *vself* data structure.

Line 3 and Line 4 load frequency index and element index respectively with CUDA built-in variables (Thread.X and Thread.Y are our simplified notations). These built-in variables are the mechanism for identifying data within different threads in CUDA. Then, line 5 and Line 6 compute the *vself* with the RCL value of the element under given frequency. From line 8, loop for computing *vtree* is entered. Line 13 and Line 14 load *vtree* values for left/right branch using function *Then()/Else()*. Line 15 through Line 26 explains themselves. Line 27 computes *vtree* with *vself* and *Left/Right* and ends the flow.

#### **4.3 Coalesced memory access**

The GPU performance can be further improved by making proper use of coalesced global memory access to prevent the global memory bandwidth from being performance limitation. Coalesced memory access is one efficient method reducing global memory traffic. When all threads in a warp execute a load instruction, the hardware detects whether the threads access the consecutive global memory address. In such case, the hardware coalesces all of these accesses into a consolidated access to the consecutive global memory. In the implementation of GPU-accelerated DDD evaluation, such favorable data access pattern is fulfilled for the linear version of *vself* data structure to gain performance enhancement. The *vself* data structure is in a linear pattern so that the *vself* values for a given DDD node under a series of frequency

Fig. 7. Performance comparison

vcstest, ccstest, bigtst (some RLC filters).

1U rack-mounted system (containing four T10 GPUs). The software environment is Red Hat

Parallel Symbolic Analysis of Large Analog Circuits on GPU Platforms 123

For the purpose of performance comparison, the programs with CPU-serial and GPU-parallel algorithm are both tested for the same set of circuits. The testing circuits include: *μ*A741 (a bipolar opamp), Cascode (a CMOS Cascode opamp), ladder7, ladder21, ladder100 (7-, 21-, 100-section cascade resistive ladder networks), rctree1, rctree2 (two RC tree networks), rlctest,

In the two implementations, the same DDD construction algorithm is shared. The numerical evaluation process is done under serial and parallel version separately. The performance comparison for each of the given circuit is listed in Table 1 and illustrated in Fig. 7. In our experimental results, the overhead for data transferring between host and GPU devices are not included as their costs can be amortized over many DDD evaluation processes and can be partially overlapped with GPU computing in more advanced parallelization implementation. The statistics information for DDD representation is also included in the same table. The first column indicates the name of each circuit tested. The second to fourth columns represent the number of nodes in circuit, the number of elements in the MNA matrix and the number of DDD nodes in the generated DDD graph, respectively. The number of determinant product terms is shown in fifth column. CPU time is the time cost for the calculation of DDD evaluation in serial algorithm. The GPU time is the computing time cost for GPU-parallelism (the kernel parts). The final column summerizes the speedup of parallel algorithm over serial algorithm.

4.1.2-48 Linux, gcc version 4.1.2 20080704, and CUDA version 3.2.


```
1: if Launch GPU threads for each node then
2: {Computing vself:}
3: FreqIdx ← Thread.X
4: ElemIdx ← Thread.Y
5: (R, C, L) ← GetRCL(ElemIdx)
6: vsel f ← (R, C ∗ Freq + L/Freq)
7: end if
8: for all lyr such that 0 ≤ lyr ≤ NumberO f Layers do
9: {Computing vtree:}
10: if Launch GPU threads for each node then
11: FreqIdx ← Thread.X
12: DDDIdx ← Thread.Y
13: Lef t ← Then(DDDIdx)
14: Right ← Else(DDDIdx)
15: if is 0 − terminal then
16: Lef t ← (0, 0)
17: Right ← (0, 0)
18: else
19: if is 1 − terminal then
20: Lef t ← (1, 0)
21: Right ← (1, 0)
22: end if
23: end if
24: if sign(DDDIdx) < 0 then
25: vsel f ← −1 ∗ vsel f
26: end if
27: vtree ← vsel f ∗ Lef t + Right
28: end if
29: end for
```
values are stored in coalesced memory. Therefore, threads, in the same block, with consecutive thread index will access consecutive global memory locations, which ensure that the hardware coalesces these accessing process in just one reading operation. In this example, this technique reduces the global memory traffic by a factor of 16. However, for the compact version of *vself* data structure, the *vself* values are stored per elements, which means that for consecutive DDD nodes, their respective *vself* values are not stored in consecutive locations. So, for the compact version of *vself* data structure, the global memory access is not coalesced. The performance comparison for both of versions is discussed in experimental result section.

#### **5. Numerical results**

We have implemented both CPU serial version and GPU version of the DDD-based evaluation programs using C++ and CUDA C, respectively.

The serial and parallel versions of programs have been tested under the same hardware and OS configuraions. The computation platform is a Linux server with two Intel Xeon E5620 2.4 GHz Quad-Core CPUs, 36 GBytes memory, equipped with NVIDIA Tesla S1070

Fig. 7. Performance comparison

values are stored in coalesced memory. Therefore, threads, in the same block, with consecutive thread index will access consecutive global memory locations, which ensure that the hardware coalesces these accessing process in just one reading operation. In this example, this technique reduces the global memory traffic by a factor of 16. However, for the compact version of *vself* data structure, the *vself* values are stored per elements, which means that for consecutive DDD nodes, their respective *vself* values are not stored in consecutive locations. So, for the compact version of *vself* data structure, the global memory access is not coalesced. The performance

We have implemented both CPU serial version and GPU version of the DDD-based evaluation

The serial and parallel versions of programs have been tested under the same hardware and OS configuraions. The computation platform is a Linux server with two Intel Xeon E5620 2.4 GHz Quad-Core CPUs, 36 GBytes memory, equipped with NVIDIA Tesla S1070

comparison for both of versions is discussed in experimental result section.

**Algorithm 1** Parallel DDD evaluation algorithm flow

8: **for all** *lyr* such that 0 ≤ *lyr* ≤ *NumberO f Layers* **do**

10: **if** Launch GPU threads for each node **then**

1: **if** Launch GPU threads for each node **then**

5: (*R*, *C*, *L*) ← *GetRCL*(*ElemIdx*) 6: *vsel f* ← (*R*, *C* ∗ *Freq* + *L*/*Freq*)

2: {Computing vself:} 3: *FreqIdx* ← *Thread*.*X* 4: *ElemIdx* ← *Thread*.*Y*

9: {Computing vtree:}

11: *FreqIdx* ← *Thread*.*X* 12: *DDDIdx* ← *Thread*.*Y* 13: *Lef t* ← *Then*(*DDDIdx*) 14: *Right* ← *Else*(*DDDIdx*) 15: **if** is 0 − *terminal* **then** 16: *Lef t* ← (0, 0) 17: *Right* ← (0, 0)

19: **if** is 1 − *terminal* **then** 20: *Lef t* ← (1, 0) 21: *Right* ← (1, 0)

24: **if** *sign*(*DDDIdx*) < 0 **then** 25: *vsel f* ← −1 ∗ *vsel f*

27: *vtree* ← *vsel f* ∗ *Lef t* + *Right*

programs using C++ and CUDA C, respectively.

7: **end if**

18: **else**

22: **end if** 23: **end if**

26: **end if**

**5. Numerical results**

28: **end if** 29: **end for**

1U rack-mounted system (containing four T10 GPUs). The software environment is Red Hat 4.1.2-48 Linux, gcc version 4.1.2 20080704, and CUDA version 3.2.

For the purpose of performance comparison, the programs with CPU-serial and GPU-parallel algorithm are both tested for the same set of circuits. The testing circuits include: *μ*A741 (a bipolar opamp), Cascode (a CMOS Cascode opamp), ladder7, ladder21, ladder100 (7-, 21-, 100-section cascade resistive ladder networks), rctree1, rctree2 (two RC tree networks), rlctest, vcstest, ccstest, bigtst (some RLC filters).

In the two implementations, the same DDD construction algorithm is shared. The numerical evaluation process is done under serial and parallel version separately. The performance comparison for each of the given circuit is listed in Table 1 and illustrated in Fig. 7. In our experimental results, the overhead for data transferring between host and GPU devices are not included as their costs can be amortized over many DDD evaluation processes and can be partially overlapped with GPU computing in more advanced parallelization implementation. The statistics information for DDD representation is also included in the same table. The first column indicates the name of each circuit tested. The second to fourth columns represent the number of nodes in circuit, the number of elements in the MNA matrix and the number of DDD nodes in the generated DDD graph, respectively. The number of determinant product terms is shown in fifth column. CPU time is the time cost for the calculation of DDD evaluation in serial algorithm. The GPU time is the computing time cost for GPU-parallelism (the kernel parts). The final column summerizes the speedup of parallel algorithm over serial algorithm.

<sup>100</sup> <sup>101</sup> <sup>102</sup> <sup>103</sup> <sup>104</sup> <sup>105</sup> <sup>106</sup> <sup>0</sup>

Parallel Symbolic Analysis of Large Analog Circuits on GPU Platforms 125

CPU GPU

CPU GPU

Freq (Hz)

<sup>100</sup> <sup>101</sup> <sup>102</sup> <sup>103</sup> <sup>104</sup> <sup>105</sup> <sup>106</sup> −100

Fig. 10. Frequency response of *μ*A741 amplifier. The red solid curve is the result of CPU DDD

time of these two methods: the program of CPU evaluation costs 0.84 second, while the GPU parallel version only takes 0.029 second. For this benchmark circuit, we can judge that the parallel computation can easily achieve a speedup of 29 times. As the size of the circuit and

From Table 1, we can make some observations. For a variety of circuits tested in the experiment, the GPU-accelerated version outperforms all of their counterparts. The maximum performance speedup is 38.33 times for *bigtst*. The time cost of the serial version is growing fast along the increasing of circuit size (nodes in the circuit). On the other side, however, the GPU-based parallel version performs much better for larger circuits. And more importantly, the larger the circuit is, the better performance improvement we can gain using GPU-acceleration. This trend is illustrated in Fig. 11. This result implies that the circuit # nodes # elements # DDD nodes # terms CPU time (s) GPU time (s) speedup bigtst <sup>32</sup> <sup>112</sup> <sup>642</sup> 2.68 <sup>×</sup> 107 9.21 0.240 38.33 cascode <sup>14</sup> <sup>76</sup> <sup>2110</sup> 2.32 <sup>×</sup> 105 6.65 0.369 18.00 ccstest 9 35 109 260 0.32 0.014 23.40 ladder100 <sup>101</sup> <sup>301</sup> <sup>301</sup> 9.27 <sup>×</sup> 1020 11.31 0.323 35.00 ladder21 22 64 64 28657 0.55 0.021 25.69 ladder7 8 22 22 34 0.08 0.007 10.86 rctree1 <sup>40</sup> <sup>119</sup> <sup>211</sup> 1.15 <sup>×</sup> 108 2.53 0.076 33.30 rctree2 <sup>53</sup> <sup>158</sup> <sup>302</sup> 4.89 <sup>×</sup> 1010 4.76 0.134 35.51 rlctest 9 39 119 572 0.01 0.001 8.82 *μ*A741 23 89 6205 363914 0.84 0.029 29.14 vcstst 12 46 121 536 0.28 0.013 20.74 Table 1. Performance comparison of CPU-serial and GPU-parallel DDD evaluation for a set

evaluation, while the blue dashed line is the result of GPU parallel DDD evaluation.

the number of DDD nodes grow larger, more speedup can be expected.

Freq (Hz)

−80 −60 −40 −20 0

Phase (degree)

of circuits

Gain (dB)

Fig. 8. The circuit schematic of *μ*A741

Fig. 9. The small signal model for bipolar transistor

Now let us investigate one typical example in detail. Fig. 8 shows the schematic of a *μ*A741 circuit. This bipolar opamp contains 26 transistors and 11 resistors. DC analysis is first performed by SPICE to obtain the operation point, and then small-signal model, shown in Fig. 9, is used for DDD symbolic analysis and numerical evaluation. The AC analysis is performed using both CPU DDD evaluation and GPU parallel DDD evaluation proposed in our work. In Fig. 10 plots the frequency responses of the gain and the phase at the amplifier's output node from the two comparison methods. It can be observed that GPU parallel DDD evaluation provides the same result as it CPU serial counterpart does. We measured the run

Fig. 8. The circuit schematic of *μ*A741

Fig. 9. The small signal model for bipolar transistor

Now let us investigate one typical example in detail. Fig. 8 shows the schematic of a *μ*A741 circuit. This bipolar opamp contains 26 transistors and 11 resistors. DC analysis is first performed by SPICE to obtain the operation point, and then small-signal model, shown in Fig. 9, is used for DDD symbolic analysis and numerical evaluation. The AC analysis is performed using both CPU DDD evaluation and GPU parallel DDD evaluation proposed in our work. In Fig. 10 plots the frequency responses of the gain and the phase at the amplifier's output node from the two comparison methods. It can be observed that GPU parallel DDD evaluation provides the same result as it CPU serial counterpart does. We measured the run

Fig. 10. Frequency response of *μ*A741 amplifier. The red solid curve is the result of CPU DDD evaluation, while the blue dashed line is the result of GPU parallel DDD evaluation.

time of these two methods: the program of CPU evaluation costs 0.84 second, while the GPU parallel version only takes 0.029 second. For this benchmark circuit, we can judge that the parallel computation can easily achieve a speedup of 29 times. As the size of the circuit and the number of DDD nodes grow larger, more speedup can be expected.

From Table 1, we can make some observations. For a variety of circuits tested in the experiment, the GPU-accelerated version outperforms all of their counterparts. The maximum performance speedup is 38.33 times for *bigtst*. The time cost of the serial version is growing fast along the increasing of circuit size (nodes in the circuit). On the other side, however, the GPU-based parallel version performs much better for larger circuits. And more importantly, the larger the circuit is, the better performance improvement we can gain using GPU-acceleration. This trend is illustrated in Fig. 11. This result implies that the


Table 1. Performance comparison of CPU-serial and GPU-parallel DDD evaluation for a set of circuits

In this experiment, both of data structures for storing *vself* are implemented. The performance comparison is listed in Table 2. The GPU parallel version under both of the two data structures for *vself* outperforms the serial version. And the performance speedup is clearly related to the number of product terms in MNA matrix, as shown in Fig. 12. For small circuits with less MNA matrix product terms, the compact version of *vself* is more efficient due to the lowering of global memory traffic when calculating *vself*. However, for large circuits with bigger number of MNA matrix product terms, the linear version of *vself* outperforms the compact version *comp vself* owing to the effect of coalesced memory access as discussed in

Parallel Symbolic Analysis of Large Analog Circuits on GPU Platforms 127

circuit # terms CPU (s) GPU time (s) speedup

Table 2. Performance comparison for two implementations of *vself* data structure

In this chapter, a GPU-based graph-based parallel analysis method for large analog circuits has been presented. Two data structures have been designed to cater the favor of GPU computation and device memory access pattern. Both the CPU version and the GPU version's performance has been studied and compared for circuits with different number of product terms in MNA matrix. The GPU-based DDD evaluation performs much better than its CPU-based serial counterpart, especially for larger circuits. Experimental results from tests on a variety of industrial benchmark circuits show that the new evaluation algorithm can achieve about one to two order of magnitudes speedup over the serial CPU based evaluations on some large analog circuits. The presented parallel techniques can be also used for the parallelization of other decision diagrams based applications such as Binary Decision Diagrams (BDDs) for

Akers, S. B. (1976). Binary decision diagrams, *IEEE Trans. on Computers* 27(6): 509–516. AMD Inc. (2006). Multi-core processors—the next evolution in computing (White Paper).

AMD Inc. (2011b). AMD Steam SDK, http://developer.amd.com/gpu/ATIStreamSDK. Bryant, R. E. (1986). Graph-based algorithms for Boolean function manipulation, *IEEE Trans.*

AMD Inc. (2011a). AMD developer center, http://developer.amd.com/GPU.

bigtst 2.68 <sup>×</sup> 107 9.21 0.240 0.223 38.33 41.21 cascode 2.32 <sup>×</sup> 105 6.65 0.369 0.452 18.00 14.70 ccstest 260 0.32 0.014 0.033 23.40 9.65 ladder100 9.27 <sup>×</sup> 1020 11.31 0.323 0.097 35.00 116.92 ladder21 28657 0.55 0.021 0.028 25.69 19.40 ladder7 34 0.08 0.007 0.025 10.86 3.20 rctree1 1.15 <sup>×</sup> 108 2.53 0.076 0.057 33.30 44.71 rctree2 4.89 <sup>×</sup> 1010 4.76 0.134 0.076 35.51 62.93 rlctest 572 0.01 0.001 0.002 8.82 4.40 *μ*A741 363914 0.84 0.029 0.029 29.14 29.27 vcstst 536 0.28 0.013 0.029 20.74 9.62

w/ comp vself w/ linear vself w/ comp vself w/ linear vself

the prior section.

**6. Summary**

**7. References**

logic synthesis and formal verifications.

http://multicore.amd.com.

*on Computers* pp. 677–691.

Fig. 11. The performance speedup of GPU-acceleration vs. circuits size (number of nodes)

GPU-acceleration is suitable to overcome the performance problem of DDD-based numerical evaluation for large circuits.

Fig. 12. Performance comparison for two approaches of vself data structure (the x-axis is in logarithm scale)

In this experiment, both of data structures for storing *vself* are implemented. The performance comparison is listed in Table 2. The GPU parallel version under both of the two data structures for *vself* outperforms the serial version. And the performance speedup is clearly related to the number of product terms in MNA matrix, as shown in Fig. 12. For small circuits with less MNA matrix product terms, the compact version of *vself* is more efficient due to the lowering of global memory traffic when calculating *vself*. However, for large circuits with bigger number of MNA matrix product terms, the linear version of *vself* outperforms the compact version *comp vself* owing to the effect of coalesced memory access as discussed in the prior section.


Table 2. Performance comparison for two implementations of *vself* data structure

### **6. Summary**

14 VLSI Design

Fig. 11. The performance speedup of GPU-acceleration vs. circuits size (number of nodes)

evaluation for large circuits.

logarithm scale)

GPU-acceleration is suitable to overcome the performance problem of DDD-based numerical

Fig. 12. Performance comparison for two approaches of vself data structure (the x-axis is in

In this chapter, a GPU-based graph-based parallel analysis method for large analog circuits has been presented. Two data structures have been designed to cater the favor of GPU computation and device memory access pattern. Both the CPU version and the GPU version's performance has been studied and compared for circuits with different number of product terms in MNA matrix. The GPU-based DDD evaluation performs much better than its CPU-based serial counterpart, especially for larger circuits. Experimental results from tests on a variety of industrial benchmark circuits show that the new evaluation algorithm can achieve about one to two order of magnitudes speedup over the serial CPU based evaluations on some large analog circuits. The presented parallel techniques can be also used for the parallelization of other decision diagrams based applications such as Binary Decision Diagrams (BDDs) for logic synthesis and formal verifications.

### **7. References**

Akers, S. B. (1976). Binary decision diagrams, *IEEE Trans. on Computers* 27(6): 509–516.

AMD Inc. (2006). Multi-core processors—the next evolution in computing (White Paper). http://multicore.amd.com.

AMD Inc. (2011a). AMD developer center, http://developer.amd.com/GPU.

AMD Inc. (2011b). AMD Steam SDK, http://developer.amd.com/gpu/ATIStreamSDK. Bryant, R. E. (1986). Graph-based algorithms for Boolean function manipulation, *IEEE Trans. on Computers* pp. 677–691.

**1. Introduction** 

**7** 

*India* 

K.A. Sumithra Devi

*R.V. College of Engineering, Bangalore* 

*Vishweshvaraya Technological University, Karnataka* 

**Algorithms for CAD Tools VLSI Design** 

Due to advent of Very Large Scale Integration (VLSI), mainly due to rapid advances in integration technologies the electronics industry has achieved a phenomenal growth over the last two decades. Various applications of VLSI circuits in high-performance computing, telecommunications, and consumer electronics has been expanding progressively, and at a very hasty pace. Steady advances in semi-conductor technology and in the integration level of Integrated circuits (ICs) have enhanced many features, increased the performance, improved reliability of electronic equipment, and at the same time reduce the cost, power consumption and the system size. With the increase in the size and the complexity of the digital system, Computer Aided Design (CAD) tools are introduced into the hardware design process. The early paper and pencil design methods have given way to sophisticated design entry, verification and automatic hardware generation tools. The use of interactive and automatic design tools significantly increased the designer productivity with an efficient management of the design project and by automatically performing a huge amount of time extensive tasks. The designer heavily relies on software tools for every aspect of development cycle starting from circuit specification and design entry to the performance analysis, layout generation and verification. Partitioning is a method which is widely used for solving large complex problems. The partitioning methodology proved to be very useful in solving the VLSI design automation problems occurring in every stage of the IC design process. But the size and the complexity of the VLSI design has increased over time, hence some of the problems can be solved using partitioning techniques. Graphs and hypergraphs are the natural representation of the circuits, so many problems in VLSI design can be solved effectively either by graph or hyper-graph partitioning. VLSI circuit partitioning is a vital part of the physical design stage. The essence of the circuit partitioning is to divide a circuit into number of sub-circuits with minimum interconnection between them. Which can be accomplished recursively partitioning the circuits into two parts until the desired level of complexity is reached. Partitioning is a critical area of VLSI CAD. In order to build complex digital logic circuits it is often essential to sub-divide multi –million transistor design into manageable pieces. The presence of hierarchy gives rise to natural clusters of cells. Most of the widely used algorithms tend to ignore this clustering and divide the net

list in a balanced partitioning and frequently the resulting partitions are not optimal.

The demand for high-speed field-programmable gate array (FPGA) compilation tools has escalated in the deep-sub micron era. Tree partitioning problem is a special case of graph

