**3. Measurement results**

The in-situ measurement scheme was implemented in a 3G cellular phone processor (Hattori et al., 2006) as an example. Supply-noise maps for the processor were obtained while several actual applications were running. Figure 9 shows a chip photomicrograph. Three CPU cores and several IPs, such as an MPEG-4 accelerator, are implemented in the chip. A general-purpose OS runs on the AP-SYS CPU, and a real-time OS runs on the APL-RT CPU. The chip was fabricated using 90-nm, 8-Metal (7Cu+1Al), dual-Vth low-power CMOS process technology.

This chip has 20 power domains, and seven VMONs are implemented in several of the power domains (Kanno, et al., 2006). Five VMONs are implemented in the application part (AP-Part), and two VMONs are implemented in the baseband part (BB-Part). VMONs 1, 3, 4, and 5 are in the same power domain, whereas the others are in separate power domains. The reason these four VMONs were implemented in the same power domain is that this domain is the largest one, and many IPs are integrated in it.

 VDD (V)

Dhrystone execution

(e) (f)

Voltage level on clock off

(d)

69 mV

Fig. 10. Measured dependence of frequency of each VMON on voltage.

2.7mV / MHz ~3.0mV / MHz

<sup>109</sup> In-Situ Supply-Noise Measurement in LSIs

Frequency of VMONs (MHz)

12mV/div 10us/div

(a)

Fig. 11. Measured local supply noise by VMON1

(c)

(b)

with Millivolt Accuracy and Nanosecond-Order Time Resolution

Fig. 9. Implementation example. This chip has three CPUs and several hardware accelerator such as a moving picture encoder (MPEG-4). The 20-power domains for partial power-shut down are implemented in a single LSI. This chip has a distributed common power domains (CPD) whose power-down opportunity is very rare. Seven VMONs and one VMOC are implemented in this chip.

Each VMON was only 2.52 *µ*m × 25.76 *µ*m, and they can be designed as a fundamental standard cell. Figure 10 shows the dependence of each VMON frequency on voltage, which were between 2.9 and 3.1 mV/MHz.

In Fig. 10, the frequency of the ring oscillators was designed to be about 200 MHz. Time resolution was about 5 ns. Note that we used LeCroy's SDA 11000 XXL oscilloscope with a 100-M-word-long time-interval recording memory and a maximum sampling speed of 40 GS/s.

#### **3.1 Dhrystone measurement**

We show the results of measurements taken while executing the Dhrystone benchmark program in the APL-RT CPU and a system control program in the AP-SYS CPU. The Dhrystone is known as a typical benchmark program for measuring performance per unit power, MIPS/mW, and the activation ratio of the circuit in the CPU core is thus high. Figure 11 shows the local supply noise from VMON1 embedded in the APL-RT CPU that was measured while executing the Dhrystone benchmark program. In these measurements, the cache of the APL-RT CPU was ON, and the hit ratio of the cache was 100%. This is the heaviest load for the APL-RT CPU executing the Dhrystone program. The measured maximum local supply noise was 69 mV under operation of the APL-RT CPU at 312 MHz and *V*DD=1.25 V. In this measurement, the baseband part was powered on, but the clock distribution was stopped.

10 Will-be-set-by-IN-TECH

BB-Part AP-Part

11.15 mm

Fig. 9. Implementation example. This chip has three CPUs and several hardware accelerator such as a moving picture encoder (MPEG-4). The 20-power domains for partial power-shut down are implemented in a single LSI. This chip has a distributed common power domains (CPD) whose power-down opportunity is very rare. Seven VMONs and one VMOC are

Each VMON was only 2.52 *µ*m × 25.76 *µ*m, and they can be designed as a fundamental standard cell. Figure 10 shows the dependence of each VMON frequency on voltage, which

In Fig. 10, the frequency of the ring oscillators was designed to be about 200 MHz. Time resolution was about 5 ns. Note that we used LeCroy's SDA 11000 XXL oscilloscope with a 100-M-word-long time-interval recording memory and a maximum sampling speed of 40

We show the results of measurements taken while executing the Dhrystone benchmark program in the APL-RT CPU and a system control program in the AP-SYS CPU. The Dhrystone is known as a typical benchmark program for measuring performance per unit power, MIPS/mW, and the activation ratio of the circuit in the CPU core is thus high. Figure 11 shows the local supply noise from VMON1 embedded in the APL-RT CPU that was measured while executing the Dhrystone benchmark program. In these measurements, the cache of the APL-RT CPU was ON, and the hit ratio of the cache was 100%. This is the heaviest load for the APL-RT CPU executing the Dhrystone program. The measured maximum local supply noise was 69 mV under operation of the APL-RT CPU at 312 MHz and *V*DD=1.25 V. In this measurement, the baseband part was powered on, but the clock distribution was stopped.

APL-RT CPU

VMON1

VMON4

VMON3

CPD

AP-SYS CPU

VMON2

MPEG4

VMON5

11.15 mm

BB-CPU

VMON7

VMONC

implemented in this chip.

**3.1 Dhrystone measurement**

GS/s.

were between 2.9 and 3.1 mV/MHz.

VMON6

Fig. 10. Measured dependence of frequency of each VMON on voltage.

Fig. 11. Measured local supply noise by VMON1

Figure 12 shows supply-noise maps obtained using these VMONs. Generally, although seven measurement points is insufficient for rendering in a 3D surface expression, this simple expression helps to understand the voltage relation between these points. This scheme can also produce a supply-noise-map animation, and Figs. 12(a) to (f) show snapshots of supply-noise maps corresponding to the timing points indicated in Fig. 11. Figure 12(a) is a snapshot when the CPUs are not operating but are consuming clock power. The location of each VMON is shown in Fig. 12(a). Note that the APL-RT CPU was running at 312 MHz, and the AP-SYS CPU was running at 52 MHz. Figure 12(b) is a snapshot taken when the

<sup>111</sup> In-Situ Supply-Noise Measurement in LSIs

Figure 12(c) is a snapshot when the local supply noise is at its maximum. Figure 12(d) is an image taken when the AP-SYS CPU shows a supply "bounce" due to an inductive effect. A typical situation where the APL-RT CPU executes Dhrystone while the AP-SYS CPU is not operating but is consuming clock power is depicted in Fig. 12(e). Figure 12(f) is a snapshot when both CPUs show a supply bounce due to an inductive effect. At this time, the Dhrystone program was terminated, and both CPUs changed their operating modes, causing large current changes. It looks as if clock power consumption has vanished, although the

Another measurement example involves moving picture encoding. A hardware accelerator that executes moving picture encoding and decoding (MPEG4) was implemented in this chip,

The waveform measured by VMON5 is shown in Fig. 13. In this MPEG4-encoding operation, a QCIF-size picture was encoded using the MPEG4 accelerator. In the measurement, the APL-RT CPU was running at 312 MHz, and the AP-SYS CPU was running at 208 MHz. The MPEG4 accelerator was running at 78 MHz, and *V*DDwas 1.25 V. The baseband part was

17.1mV 30.9mV Clock off

(d)

MPEG4 encoding

Initialization of MPEG4

(f)

20us/div 11.6mV/div

Fig. 13. Voltage noise measured while running MPEG encoding operation

(a)(b) (c)(e)

Dhrystone program has just started. Two hot spots are clearly observed.

with Millivolt Accuracy and Nanosecond-Order Time Resolution

clock remains active.

**3.2 Measurement of moving picture encoding**

as shown in Fig. 9, and VMON5 was embedded in it.

powered on, but clock distribution was stopped.

Fig. 12. Measured supply-noise maps for Dhrystone execution. (a) APL-RT CPU and AP-SYS CPU are consuming only clock power; (b) the Dhrystone program has just started in APL-RT CPU; (c) the local supply noise is at its maximum; (d) the AP-SYS CPU shows a supply "bounce" due to an inductive effect; (e) A typical situation where the APL-RT CPU executes Dhrystone and (f) both CPUs show a supply bounce due to an inductive effect. Although the seven measurement points are insufficient for showing in a 3D surface expression, this expression helps to understand the voltage relation between these points.

12 Will-be-set-by-IN-TECH

**VMON2 VMON3**

(b)

(d)

(f)

**VMON1**

**VMON6**

**VMON7**

**VMON4 VMON5**

(a)

(c)

(e)

expression helps to understand the voltage relation between these points.

Fig. 12. Measured supply-noise maps for Dhrystone execution. (a) APL-RT CPU and AP-SYS CPU are consuming only clock power; (b) the Dhrystone program has just started in APL-RT CPU; (c) the local supply noise is at its maximum; (d) the AP-SYS CPU shows a supply "bounce" due to an inductive effect; (e) A typical situation where the APL-RT CPU executes Dhrystone and (f) both CPUs show a supply bounce due to an inductive effect. Although the seven measurement points are insufficient for showing in a 3D surface expression, this

Figure 12 shows supply-noise maps obtained using these VMONs. Generally, although seven measurement points is insufficient for rendering in a 3D surface expression, this simple expression helps to understand the voltage relation between these points. This scheme can also produce a supply-noise-map animation, and Figs. 12(a) to (f) show snapshots of supply-noise maps corresponding to the timing points indicated in Fig. 11. Figure 12(a) is a snapshot when the CPUs are not operating but are consuming clock power. The location of each VMON is shown in Fig. 12(a). Note that the APL-RT CPU was running at 312 MHz, and the AP-SYS CPU was running at 52 MHz. Figure 12(b) is a snapshot taken when the Dhrystone program has just started. Two hot spots are clearly observed.

Figure 12(c) is a snapshot when the local supply noise is at its maximum. Figure 12(d) is an image taken when the AP-SYS CPU shows a supply "bounce" due to an inductive effect. A typical situation where the APL-RT CPU executes Dhrystone while the AP-SYS CPU is not operating but is consuming clock power is depicted in Fig. 12(e). Figure 12(f) is a snapshot when both CPUs show a supply bounce due to an inductive effect. At this time, the Dhrystone program was terminated, and both CPUs changed their operating modes, causing large current changes. It looks as if clock power consumption has vanished, although the clock remains active.

### **3.2 Measurement of moving picture encoding**

Another measurement example involves moving picture encoding. A hardware accelerator that executes moving picture encoding and decoding (MPEG4) was implemented in this chip, as shown in Fig. 9, and VMON5 was embedded in it.

The waveform measured by VMON5 is shown in Fig. 13. In this MPEG4-encoding operation, a QCIF-size picture was encoded using the MPEG4 accelerator. In the measurement, the APL-RT CPU was running at 312 MHz, and the AP-SYS CPU was running at 208 MHz. The MPEG4 accelerator was running at 78 MHz, and *V*DDwas 1.25 V. The baseband part was powered on, but clock distribution was stopped.

Fig. 13. Voltage noise measured while running MPEG encoding operation

The maximum local supply noise measured by VMON5 was 30.9 mV, and the average voltage drop was smaller than that when executing the Dhrystone benchmark program. This result confirms that good power efficiency was attained using hardware accelerators. Measured maps of the typical situations are shown in Fig. 14. Figure 14(a) is a snapshot taken when neither CPU was operating but both were consuming clock power; it also shows the location of each VMON. Note that the APL-RT CPU was running at 312 MHz, and the AP-SYS CPU was running at 208 MHz. Figure 14(b) is a snapshot when the APL-RT CPU was initializing the MPEG4 accelerator. Figure 14(c) depicts the situation when the local supply noise was at its maximum. The image in Fig. 14(d) illustrates the period when the execution of the MPEG4 accelerator was dominant. Figure 14(e) is a snapshot when the APL-RT CPU was executing an interruption operation from the MPEG4 accelerator, and Fig. 14 (f) shows the typical situation

<sup>113</sup> In-Situ Supply-Noise Measurement in LSIs

This measurement was done using simple picture-encoding programs, so frequent interruptions were necessary to manage the execution of the program. However, in real situations, since operation would not be carried out with frequent interruptions , and the APL-RT CPU might be in the sleep mode, the power consumption of the APL-RT CPU would

These results show that by using a hardware accelerator, the power consumption was also distributed over the chip, resulting in a reduction in the total power consumption. This voltage-drop map therefore visually presents the effectiveness of implementing a hardware

An in-situ power supply noise measurement scheme for obtaining supply-noise maps was developed. The key features of this scheme are the minimal size of simple on-chip measurement circuits, which consist of a ring oscillator based probe circuit and analog amplifier, and the support of off-chip high resolution digital signal processing with frequent calibration. Although the probe circuit based on the ring oscillator does not require a sampling-and-hold circuit, high accuracy measurements were achieved by off-chip digital signal processing and frequent calibrations. The frequent calibrations can compensate for process and temperature variations. This scheme enables voltage measurement with millivolt accuracy and nanosecond-order time resolution, which is the period of the ring oscillator. Using the scheme, we demonstrated the world's first measured animation of a supply-noise map in product-level LSIs, that is, 69-mV local supply noise with 5-ns time resolution in a

This work was done in cooperation with H. Mizuno, S. Komatsu, and Y. Kondoh of the Hitachi, Ltd., and T. Irita, K. Hirose, R. Mori, and Y. Yasu of the Renesas Electronics Corporation. We thank T. Yamada and N. Irie of Hitachi Ltd., and T. Hattori, T. Takeda of Renesas Electronics Corporation, and K. Ishibashi of The University of Electro-Communications, for their support and helpful comments. We also express our gratitude to Y. Tsuchihashi, G. Tanaka, Y. Miyairi, T. Ajioka, and N. Morino of Renesas

Electronics Corporation for their valuable advice and assistance.

where the MPEG4 accelerator was encoding a QCIF-size image.

be reduced, and the map would show a calmer surface.

with Millivolt Accuracy and Nanosecond-Order Time Resolution

accelerator.

**4. Conclusion**

3G-cellular-phone processor.

**5. Acknowledgment**

Fig. 14. Measured supply-noise maps for MPEG encoding operation. (a) neither CPU was operating but was consuming clock power; (b) the APL-RT CPU was initializing the MPEG4 accelerator; (c) the local supply noise was at its maximum; (d) the execution of the MPEG4 accelerator was dominant; (e) the APL-RT CPU was executing an interruption operation from the MPEG4 accelerator and (f) the MPEG4 accelerator was encoding a QCIF-size picture.

The maximum local supply noise measured by VMON5 was 30.9 mV, and the average voltage drop was smaller than that when executing the Dhrystone benchmark program. This result confirms that good power efficiency was attained using hardware accelerators. Measured maps of the typical situations are shown in Fig. 14. Figure 14(a) is a snapshot taken when neither CPU was operating but both were consuming clock power; it also shows the location of each VMON. Note that the APL-RT CPU was running at 312 MHz, and the AP-SYS CPU was running at 208 MHz. Figure 14(b) is a snapshot when the APL-RT CPU was initializing the MPEG4 accelerator. Figure 14(c) depicts the situation when the local supply noise was at its maximum. The image in Fig. 14(d) illustrates the period when the execution of the MPEG4 accelerator was dominant. Figure 14(e) is a snapshot when the APL-RT CPU was executing an interruption operation from the MPEG4 accelerator, and Fig. 14 (f) shows the typical situation where the MPEG4 accelerator was encoding a QCIF-size image.

This measurement was done using simple picture-encoding programs, so frequent interruptions were necessary to manage the execution of the program. However, in real situations, since operation would not be carried out with frequent interruptions , and the APL-RT CPU might be in the sleep mode, the power consumption of the APL-RT CPU would be reduced, and the map would show a calmer surface.

These results show that by using a hardware accelerator, the power consumption was also distributed over the chip, resulting in a reduction in the total power consumption. This voltage-drop map therefore visually presents the effectiveness of implementing a hardware accelerator.
