**Parallel Symbolic Analysis of Large Analog Circuits on GPU Platforms** \*

Sheldon X.-D. Tan1, Xue-Xin Liu1, Eric Mlinar1 and Esteban Tlelo-Cuautle<sup>2</sup> <sup>1</sup>*Department of Electrical Engineering, University of California, Riverside, CA 92521*

> <sup>2</sup>*Department of Electronics, INAOE* <sup>1</sup>*USA* <sup>2</sup>*Mexico*

#### **1. Introduction**

Graph-based symbolic technique is a viable tool for calculating the behavior or the characterization of an analog circuit. Traditional symbolic analysis tools typically are used to calculate the behavior or the characteristic of a circuit in terms of symbolic parameters (Gielen et al., 1994). The introduction of determinant decision diagrams based symbolic analysis technique allows exact symbolic analysis of much larger analog circuits than any other existing approaches (Shi & Tan, 2000; 2001). Furthermore, with hierarchical symbolic representations (Tan et al., 2005; Tan & Shi, 2000), exact symbolic analysis via DDD graphs essentially allows the analysis of arbitrarily large analog circuits. Some recent advancement in DDD ordering technique and variants of DDD allow even larger analog circuits to be analyzed (Shi, 2010a;b). Once the circuit's small-signal characteristics are presented by DDDs, the evaluation of DDDs, whose CPU time is proportional to the sizes of DDDs, will give exact numerical values. However, with large networks, the DDD size can be huge and the resulting evaluation can be very time consuming.

Modern computer architecture has shifted towards designs that employ multiple processor cores on a chip, so called multi-core processors or chip-multiprocessors (CMP) (AMD Inc., 2006; Intel Corporation, 2006). The graphic processing unit (GPU) is one of the most powerful many-core computing systems in mass-market use (AMD Inc., 2011a; NVIDIA Corporation, 2011a). For instance, NVIDIA Telsa T10 chip has a peak performance of over 1 TFLOPS versus about 80–100 GFLOPS of Intel i5 series Quad-core CPUs (Kirk & Hwu, 2010). In addition to the primary use of GPUs in accelerating graphics rendering operations, there has been considerable interest in exploiting GPUs for general purpose computation (Göddeke, 2011). The introduction of new parallel programming interfaces for general purpose computations, such as Computer Unified Device Architecture (CUDA) (NVIDIA Corporation, 2011b), Stream SDK (AMD Inc., 2011b) and OpenCL (Khronos Group, 2011), has made GPUs a powerful and attractive choice for developing high-performance numerical, scientific computation and solving practical engineering problems.

<sup>\*</sup>This work is funded in part by NSF grants NSF OISE-0929699, OISE-1130402, CCF-1017090 and part by CN-11-575 UC MEXUS-CONACYT Collaborative Research Grants.

**i**

**h**

**d**

**j e**

**a**

Parallel Symbolic Analysis of Large Analog Circuits on GPU Platforms 115

**f b g**

**c**

**1 edge 0 edge**

**1 0**

Fig. 1. A ZBDD representing {*adgi*, *adhi*, *afej*, *cbgj*, *cbih*} under ordering

Note that sub-terms *ad*, *gj*, and *hi* appear in several product terms, and each product term involves a subset (four) out of ten symbolic parameters. Therefore, we view each symbolic product term as a subset, and use a ZBDD to represent the subset system composed of all the subsets each corresponding to a product term. Fig. 1 illustrates the corresponding ZBDD representing all the subsets involved in det(**M**) under ordering *a* > *c* > *b* > *d* > *f* > *e* > *g* > *i* > *h* > *j*. It can be seen that sub-terms *ad*, *gj*, and *ih* have been

Following directly from the properties of ZBDDs, we have the following observations. First, given a fixed order of symbolic parameters, all the subsets in a symbolic determinant can be represented uniquely by a ZBDD. Second, every 1-path in the ZBDD corresponds to a product term, and the number of 1-edges in any 1-path is *n*. The total number of 1-paths is equal to

We can view the resulting ZBDD as a graphical representation of the recursive application of the determinant expansion with the expansion order *a*, *c*, *b*, *d*, *f* ,*e*, *g*, *i*, *h*, *j*. Each vertex is labeled with the matrix entry with respect to which the determinant is expanded, and it represents all the subsets contained in the corresponding sub-matrix determinant. The 1-edge points to the vertex representing all the subsets contained in the cofactor of the current expansion, and 0-edge points to the vertex representing all the subsets contained in the

To embed the signs of the product terms of a symbolic determinant into its corresponding

ZBDD, we associate each vertex *v* with a sign, *s*(*v*), defined as follows:

*a* > *c* > *b* > *d* > *f* > *e* > *g* > *i* > *h* > *j*

shared in the ZBDD representation.

remainder.

the number of product terms in a symbolic determinant.

In this chapter, we present an efficient parallel DDD evaluation technique based on general purpose GPU (GPGPU) computing platform to explore the parallelism of DDD structures. We present a new data structures to present the DDD graphs in the GPUs for massively threaded parallel computing of the numerical values of DDD graphs. The new method explores data parallelism in the DDD numerical evaluation process where DDD graphs are traversed in a depth-first fashion. Numerical results show that the new evaluation algorithm can achieve about one to two orders of magnitude speedup over the serial CPU based evaluations of some analog circuits. The presented parallel techniques can be used for the parallelization of many decision diagrams based applications such as logic synthesis, optimization, and formal verification, all of which are based on binary decision diagrams (BDDs) and its variants (Bryant, 1995; Minato, 1996).

This chapter is organized as follows. Section 2 reviews DDD-based symbolic analysis techniques. Section 3 briefly review the GPU architectures and CUDA computing. Section 4 introduces the new parallel algorithm, and then the results are demonstrated in Section 5. Lastly, Section 6 summarizes this chapter.

#### **2. DDDs and DDD-based symbolic analysis**

Before we introduce our GPU-base parallel analysis method, we first provide a brief overview of determinant decision diagram (DDD) Shi & Tan (2000) in this section.

Determinant decision diagrams (DDDs) was introduced to represent determinants symbolically Shi & Tan (2000). A DDD is essentially a zero-suppressed Binary Decision Diagrams (ZBDDs) — introduced originally for representing sparse subset systems Minato (1993). A ZBDD is a variant of a Binary Decision Diagram (BDD) introduced by Akers Akers (1976) and popularized by Bryant Bryant (1986). BDDs have brought great success to formal verification and testing for combinational and sequential digital circuits Bryant (1986); Hachtel & Somenzi (1996). DDD representation has several advantages over both the expanded and arbitrarily nested forms of a symbolic expression.


A key observation is that the circuit matrix is sparse and that for many times, a symbolic expression may share many sub-expressions. For example, consider the following determinant

$$\det(\mathbf{M}) = \begin{vmatrix} a & b & 0 \ 0 \\ c & d & e \ 0 \\ 0 & f & g \ h \\ 0 & 0 & i \ j \end{vmatrix} = ad \mathbf{g}j - ad \mathbf{hi} - aefj - bcgj + cbih. \tag{1}$$

In this chapter, we present an efficient parallel DDD evaluation technique based on general purpose GPU (GPGPU) computing platform to explore the parallelism of DDD structures. We present a new data structures to present the DDD graphs in the GPUs for massively threaded parallel computing of the numerical values of DDD graphs. The new method explores data parallelism in the DDD numerical evaluation process where DDD graphs are traversed in a depth-first fashion. Numerical results show that the new evaluation algorithm can achieve about one to two orders of magnitude speedup over the serial CPU based evaluations of some analog circuits. The presented parallel techniques can be used for the parallelization of many decision diagrams based applications such as logic synthesis, optimization, and formal verification, all of which are based on binary decision diagrams (BDDs) and its

This chapter is organized as follows. Section 2 reviews DDD-based symbolic analysis techniques. Section 3 briefly review the GPU architectures and CUDA computing. Section 4 introduces the new parallel algorithm, and then the results are demonstrated in Section 5.

Before we introduce our GPU-base parallel analysis method, we first provide a brief overview

Determinant decision diagrams (DDDs) was introduced to represent determinants symbolically Shi & Tan (2000). A DDD is essentially a zero-suppressed Binary Decision Diagrams (ZBDDs) — introduced originally for representing sparse subset systems Minato (1993). A ZBDD is a variant of a Binary Decision Diagram (BDD) introduced by Akers Akers (1976) and popularized by Bryant Bryant (1986). BDDs have brought great success to formal verification and testing for combinational and sequential digital circuits Bryant (1986); Hachtel & Somenzi (1996). DDD representation has several advantages over both the

• First, similar to the nested form, DDD representation is compact for a large class of analog circuits. A ladder-structured network can be represented by a diagram where the number of vertices in the diagram (called its *size*) is equal to the number of symbolic parameters. As indicated by Shi & Tan (2000), the typical size of DDD is dramatically smaller than that of product terms. For instance, 5.71×1020 terms can be represented by a diagram with 398

• Second, similar to the expanded form, DDD representation is canonical, i.e., every determinant has a *unique* representation, and is amenable to symbolic manipulations. This property of canonical representation for matrix determinants is similar to BDD for

A key observation is that the circuit matrix is sparse and that for many times, a symbolic expression may share many sub-expressions. For example, consider the following

= *adgj* − *adhi* − *aef j* − *bcgj* + *cbih*. (1)

representing *binary functions* and ZBDD for representing *subset systems*.

 

of determinant decision diagram (DDD) Shi & Tan (2000) in this section.

expanded and arbitrarily nested forms of a symbolic expression.

variants (Bryant, 1995; Minato, 1996).

Lastly, Section 6 summarizes this chapter.

vertices.

determinant

**2. DDDs and DDD-based symbolic analysis**

det(**M**) =

 

*a b* 0 0 *cde* 0 0 *f gh* 0 0 *i j*

Fig. 1. A ZBDD representing {*adgi*, *adhi*, *afej*, *cbgj*, *cbih*} under ordering *a* > *c* > *b* > *d* > *f* > *e* > *g* > *i* > *h* > *j*

Note that sub-terms *ad*, *gj*, and *hi* appear in several product terms, and each product term involves a subset (four) out of ten symbolic parameters. Therefore, we view each symbolic product term as a subset, and use a ZBDD to represent the subset system composed of all the subsets each corresponding to a product term. Fig. 1 illustrates the corresponding ZBDD representing all the subsets involved in det(**M**) under ordering *a* > *c* > *b* > *d* > *f* > *e* > *g* > *i* > *h* > *j*. It can be seen that sub-terms *ad*, *gj*, and *ih* have been shared in the ZBDD representation.

Following directly from the properties of ZBDDs, we have the following observations. First, given a fixed order of symbolic parameters, all the subsets in a symbolic determinant can be represented uniquely by a ZBDD. Second, every 1-path in the ZBDD corresponds to a product term, and the number of 1-edges in any 1-path is *n*. The total number of 1-paths is equal to the number of product terms in a symbolic determinant.

We can view the resulting ZBDD as a graphical representation of the recursive application of the determinant expansion with the expansion order *a*, *c*, *b*, *d*, *f* ,*e*, *g*, *i*, *h*, *j*. Each vertex is labeled with the matrix entry with respect to which the determinant is expanded, and it represents all the subsets contained in the corresponding sub-matrix determinant. The 1-edge points to the vertex representing all the subsets contained in the cofactor of the current expansion, and 0-edge points to the vertex representing all the subsets contained in the remainder.

To embed the signs of the product terms of a symbolic determinant into its corresponding ZBDD, we associate each vertex *v* with a sign, *s*(*v*), defined as follows:

**1 0**

is associated with a unique pair *r*(*ai*) and *c*(*ai*), which denote, respectively, the row index and column index of *ai*. A *determinant decision diagram* is a signed, rooted, directed acyclic graph with two terminal vertices, namely the 0-terminal vertex and the 1-terminal vertex. Each non-terminal vertex *ai* is associated with a sign, *s*(*ai*), determined by the sign rule defined by (2). It has two outgoing edges, called 1-edge and 0-edge, pointing, respectively, to *Dai*

Here *s*(*ai*)*Dai* is the *cofactor* of *D* with respect to *ai*, *Dai* is the *minor* of *D* with respect to *ai*, *Dai* is the *remainder* of *D* with respect to *ai*, and operations are algebraic multiplications and additions. For example, Fig. 3 shows the DDD representation of det(**M**) under ordering

To enforce the uniqueness and compactness of the DDD representation, the three rules of ZBDDs, namely, zero-suppression, ordered, and shared are adopted. This leads to DDDs

• Every 1-path from the root corresponds to a product term in the fully expanded symbolic expression. It contains exactly *n* 1-edges. The number of 1-paths is equal to the number of

. A determinant decision graph having root vertex *ai* denotes a matrix determinant *D*

*D* = *ais*(*ai*)*Dai* + *Dai* (3)

Fig. 3. A determinant decision diagram representation for matrix **M**

and *Dai*

defined recursively as

• If *ai* is the 1-terminal vertex, then *D* = 1. • If *ai* is the 0-terminal vertex, then *D* = 0. • If *ai* is a non-terminal vertex, then

*a* > *c* > *b* > *d* > *f* > *e* > *g* > *i* > *h* > *j*.

having the following properties:

product terms.

**j e**

**i**

**h**

**-**


**h**

gj-hi -f(ej) b(gj-hi)

a[d(gj-hi)-f(ej)]-c[b(gj-hi)]

**a**

Parallel Symbolic Analysis of Large Analog Circuits on GPU Platforms 117

d(gj-hi)-f(ej) **<sup>d</sup>** -c(b(gj-hi)]

**f b g**

**-**

**c**


**1 edge 0 edge**

Fig. 2. A signed ZBDD for representing symbolic terms from matrix **M**

• Let *P*(*v*) be the set of ZBDD vertices that originate the 1-edges in any 1-path rooted at *v*. Then

$$s(v) = \prod\_{\mathbf{x} \in P(v)} \text{sign}(r(\mathbf{x}) - r(v)) \text{sign}(\varepsilon(\mathbf{x}) - \varepsilon(v)),\tag{2}$$

where *r*(*x*) and *c*(*x*) refer to the absolute row and column indices of vertex *x* in the original matrix, and *u* is an integer so that

$$\text{sign}(u) = \begin{cases} +1, & \text{if } u > 0, \\ -1, & \text{if } u < 0. \end{cases}$$

• If *v* has an edge pointing to the 1-terminal vertex, then *s*(*v*)=+1.

This is called the *sign rule*. For example, in Fig. 2, shown beside each vertex are the row and column indices of that vertex in the original matrix, as well as the sign of that vertex obtained by using the sign rule above. For the sign rule, we have following result:

**Theorem 1.** *The sign of a DDD vertex v, s*(*v*)*, is uniquely determined by (2), and the product of all the signs in a path is exactly the sign of the corresponding product term.*

For example, consider the 1-path *acbgih* in Fig. 2. The vertices that originate all the 1-edges are *c*, *b*, *i*, *h*, their corresponding signs are −, +, − and +, respectively. Their product is +. This is the sign of the symbolic product term *cbih*.

With ZBDD and the sign rule as two foundations, we are now ready to introduce formally our representation of a symbolic determinant. Let **A** be an *n* × *n* sparse matrix with a set of distinct *<sup>m</sup>* symbolic parameters {*a*1, ..., *am*}, where 1 <sup>≤</sup> *<sup>m</sup>* <sup>≤</sup> *<sup>n</sup>*2. Each symbolic parameter *ai*

Fig. 2. A signed ZBDD for representing symbolic terms from matrix **M**

*x*∈*P*(*v*)

• If *v* has an edge pointing to the 1-terminal vertex, then *s*(*v*)=+1.

by using the sign rule above. For the sign rule, we have following result:

*the signs in a path is exactly the sign of the corresponding product term.*

sign(*u*) =

*s*(*v*) = ∏

matrix, and *u* is an integer so that

the sign of the symbolic product term *cbih*.

Then

• Let *P*(*v*) be the set of ZBDD vertices that originate the 1-edges in any 1-path rooted at *v*.

where *r*(*x*) and *c*(*x*) refer to the absolute row and column indices of vertex *x* in the original

This is called the *sign rule*. For example, in Fig. 2, shown beside each vertex are the row and column indices of that vertex in the original matrix, as well as the sign of that vertex obtained

**Theorem 1.** *The sign of a DDD vertex v, s*(*v*)*, is uniquely determined by (2), and the product of all*

For example, consider the 1-path *acbgih* in Fig. 2. The vertices that originate all the 1-edges are *c*, *b*, *i*, *h*, their corresponding signs are −, +, − and +, respectively. Their product is +. This is

With ZBDD and the sign rule as two foundations, we are now ready to introduce formally our representation of a symbolic determinant. Let **A** be an *n* × *n* sparse matrix with a set of distinct *<sup>m</sup>* symbolic parameters {*a*1, ..., *am*}, where 1 <sup>≤</sup> *<sup>m</sup>* <sup>≤</sup> *<sup>n</sup>*2. Each symbolic parameter *ai*

 +1, if *u* > 0, −1, if *u* < 0.

sign(*r*(*x*) − *r*(*v*)) sign(*c*(*x*) − *c*(*v*)), (2)

Fig. 3. A determinant decision diagram representation for matrix **M**

is associated with a unique pair *r*(*ai*) and *c*(*ai*), which denote, respectively, the row index and column index of *ai*. A *determinant decision diagram* is a signed, rooted, directed acyclic graph with two terminal vertices, namely the 0-terminal vertex and the 1-terminal vertex. Each non-terminal vertex *ai* is associated with a sign, *s*(*ai*), determined by the sign rule defined by (2). It has two outgoing edges, called 1-edge and 0-edge, pointing, respectively, to *Dai* and *Dai* . A determinant decision graph having root vertex *ai* denotes a matrix determinant *D* defined recursively as


$$D = a\_i \mathbf{s}(a\_i) D\_{a\_i} + D\_{\overline{\mathbf{n}}\_i} \tag{3}$$

Here *s*(*ai*)*Dai* is the *cofactor* of *D* with respect to *ai*, *Dai* is the *minor* of *D* with respect to *ai*, *Dai* is the *remainder* of *D* with respect to *ai*, and operations are algebraic multiplications and additions. For example, Fig. 3 shows the DDD representation of det(**M**) under ordering *a* > *c* > *b* > *d* > *f* > *e* > *g* > *i* > *h* > *j*.

To enforce the uniqueness and compactness of the DDD representation, the three rules of ZBDDs, namely, zero-suppression, ordered, and shared are adopted. This leads to DDDs having the following properties:

• Every 1-path from the root corresponds to a product term in the fully expanded symbolic expression. It contains exactly *n* 1-edges. The number of 1-paths is equal to the number of product terms.

Fig. 4. Structure of streaming multiprocessor.

**4. New GPU-based DDD evaluation**

parallel computing.

**4.1 New data structure**

thread ids.

provide much better double-precision performance than before.

in NVIDIA T20 series, the Fermi family. These most recent GPU from NVIDIA can already

Parallel Symbolic Analysis of Large Analog Circuits on GPU Platforms 119

In this section, we present the new GPU-based DDD evaluation algorithm. Before the details of GPU-based DDD evaluation method, we first discuss the new DDD data structure for GPU

One key observation for DDD structure is that the data dependence is very clear. The data dependency is very simple: a node can be evaluated only after its children are evaluated. Such dependency implies the parallelism where all the nodes satisfying this constraint can be evaluated at the same time. Also, in frequency analysis of analog circuits, evaluation of a DDD node at different frequency runs can be performed in parallel. In the following subsections we show how we can explore such parallelism to speed up the DDD evaluation process.

To achieve the best performance on GPU, linear memory structure, i.e., data stored in consequent memory addresses, is preferable. For CPU serial computing, the data structure is based on dynamic links in a linked binary tree. For parallel computing, the data will be stored in linear arrays which can be more efficiently accessed by different threads based on

As we discussed above, the DDD representation stores all product terms of the determinant of the MNA matrix in a binary linked tree structure. The vertex in the tree structure is known

• For any determinant *D*, there is a unique DDD representation under a given vertex ordering.

We use |*DDD*| to denote the *size of* a DDD, i.e., the number of vertices in the DDD.

Once a DDD has been constructed, its numerical values of the determinant it represents can be computed by performing the depth-first type search of the graph and performing (3) at each node, whose time complexity is linear function of the size of the graphs (its number of nodes). The computing step is call *Evaluate(D)* where *D* is a DDD root.

For each vertex, there are two values, *vself* and *vtree*. As above mentioned, vself represents the value of the element itself; while vtree represents the value of the whole tree (or subtree). For each vertex, the vtree equals vself multiplying vtree of 1-subtree plus vtree of 0-subtree as shown in (3). In this example, the value of the determinant equals vtree of *a*; while vtree of a equals vself of *a* multiplying vtree of *b* plus vtree of *c*. In a serial implementation, the tree value of *a* is computed by recursively computing all vtree of subtrees, which is very time-consuming when the tree becomes large.

One key observation for DDD structure is that the data dependence is very clear. The data dependency is very simple: a node can be evaluated only after its children are evaluated. Such dependency implies the parallelism where all the nodes satisfying this constraint can be evaluated at the same time. Also, in frequency analysis of analog circuits, evaluation of a DDD node at different frequency runs can be performed in parallel. In the following section we show how we can explore such parallelism to speed up the DDD evaluation process.

#### **3. Review of GPU architectures**

CUDA (short for Compute Unified Device Architecture) is the parallel computing architecture for NVIDIA many-core GPU processors. The architecture of a typical CUDA-capable GPU is consisted of an array of highly threaded streaming multiprocessors (SM) and comes with more than 4 GBytes DRAM, referred to as global memory. Each SM has eight streaming processor (SP) and two special function units (SFU), and possesses its own shared memory and instruction cache. The structure of a streaming multiprocessor is shown in Fig. 4.

As the programming model of GPU, CUDA extends C into CUDA C, and supports tasks such as threads calling and memory allocation, which makes programmers able to explore most of the capabilities of GPU parallelism. In CUDA programming model, threads are organized into blocks; blocks of threads are organized into grids. CUDA also assumes that both the host (CPU) and the device (GPU) maintain their own separate memory spaces in DRAM, referred to as host memory and device memory, respectively. For every block of threads, a shared memory is accessible to all threads in that same block. And the global memory is accessible to all threads in all blocks. Developers can write programs running millions of threads with thousands of blocks in a parallel approach. This massive parallelism forms the reason that programs with GPU acceleration can be multiple times faster than their CPU counterparts.

One thing to mention is that for some series of CUDA GPU, a multiprocessor has eight single-precision floating point ALUs (one per core) but only one double-precision ALU (shared by the eight cores). Thus, for applications whose execution time is dominated by floating point computations, switching from single-precision to double-precision will decrease performance by a factor of approximately eight. However, this situation is being improved

• For any determinant *D*, there is a unique DDD representation under a given vertex

Once a DDD has been constructed, its numerical values of the determinant it represents can be computed by performing the depth-first type search of the graph and performing (3) at each node, whose time complexity is linear function of the size of the graphs (its number of

For each vertex, there are two values, *vself* and *vtree*. As above mentioned, vself represents the value of the element itself; while vtree represents the value of the whole tree (or subtree). For each vertex, the vtree equals vself multiplying vtree of 1-subtree plus vtree of 0-subtree as shown in (3). In this example, the value of the determinant equals vtree of *a*; while vtree of a equals vself of *a* multiplying vtree of *b* plus vtree of *c*. In a serial implementation, the tree value of *a* is computed by recursively computing all vtree of subtrees, which is very time-consuming

One key observation for DDD structure is that the data dependence is very clear. The data dependency is very simple: a node can be evaluated only after its children are evaluated. Such dependency implies the parallelism where all the nodes satisfying this constraint can be evaluated at the same time. Also, in frequency analysis of analog circuits, evaluation of a DDD node at different frequency runs can be performed in parallel. In the following section we show how we can explore such parallelism to speed up the DDD evaluation process.

CUDA (short for Compute Unified Device Architecture) is the parallel computing architecture for NVIDIA many-core GPU processors. The architecture of a typical CUDA-capable GPU is consisted of an array of highly threaded streaming multiprocessors (SM) and comes with more than 4 GBytes DRAM, referred to as global memory. Each SM has eight streaming processor (SP) and two special function units (SFU), and possesses its own shared memory

As the programming model of GPU, CUDA extends C into CUDA C, and supports tasks such as threads calling and memory allocation, which makes programmers able to explore most of the capabilities of GPU parallelism. In CUDA programming model, threads are organized into blocks; blocks of threads are organized into grids. CUDA also assumes that both the host (CPU) and the device (GPU) maintain their own separate memory spaces in DRAM, referred to as host memory and device memory, respectively. For every block of threads, a shared memory is accessible to all threads in that same block. And the global memory is accessible to all threads in all blocks. Developers can write programs running millions of threads with thousands of blocks in a parallel approach. This massive parallelism forms the reason that programs with GPU acceleration can be multiple times faster than their CPU counterparts. One thing to mention is that for some series of CUDA GPU, a multiprocessor has eight single-precision floating point ALUs (one per core) but only one double-precision ALU (shared by the eight cores). Thus, for applications whose execution time is dominated by floating point computations, switching from single-precision to double-precision will decrease performance by a factor of approximately eight. However, this situation is being improved

and instruction cache. The structure of a streaming multiprocessor is shown in Fig. 4.

We use |*DDD*| to denote the *size of* a DDD, i.e., the number of vertices in the DDD.

nodes). The computing step is call *Evaluate(D)* where *D* is a DDD root.

ordering.

when the tree becomes large.

**3. Review of GPU architectures**

Fig. 4. Structure of streaming multiprocessor.

in NVIDIA T20 series, the Fermi family. These most recent GPU from NVIDIA can already provide much better double-precision performance than before.
