Power Reduction Techniques For Microprocessor SystemsVASANTH VENKATACHALAM AND MICHAEL FRANZUniversity of California, Irvine Power consumption is a major factor that limits the performance of computers. We survey the “state of the art” in techniques that reduce the total power consumed by a microprocessor system over time. These techniques are applied at various levels ranging from circuits to architectures, architectures to system software, and system software to applications. They also include holistic approaches that will become more important over the next decade. We conclude that power management is a multifaceted discipline that is continually expanding with new techniques being developed at every level. These techniques may eventually allow computers to break through the “power wall” and achieve unprecedented levels of performance, versatility, and reliability. Yet it remains too early to tell which techniques will ultimately solve the power problem. Categories and Subject Descriptors: C.5.3 [Computer System Implementation]: Microcomputers—Microprocessors; D.2.10 [Software Engineering]: Design— Methodologies; I.m [Computing Methodologies]: Miscellaneous General Terms: Algorithms, Design, Experimentation, Management, Measurement, Performance Additional Key Words and Phrases: Energy dissipation, power reduction1. INTRODUCTION of power; so much power, in fact, that their power densities and concomitantComputer scientists have always tried to heat generation are rapidly approachingimprove the performance of computers. levels comparable to nuclear reactorsBut although today’s computers are much (Figure 1). These high power densitiesfaster and far more versatile than their impair chip reliability and life expectancy,predecessors, they also consume a lot increase cooling costs, and, for largeParts of this effort have been sponsored by the National Science Foundation under ITR grant CCR-0205712and by the Ofﬁce of Naval Research under grant N00014-01-1-0854.Any opinions, ﬁndings, and conclusions or recommendations expressed in this material are those of theauthors and should not be interpreted as necessarily representing the ofﬁcial views, policies or endorsements,either expressed or implied, of the National Science foundation (NSF), the Ofﬁce of Naval Research (ONR),or any other agency of the U.S. Government.The authors also gratefully acknowledge gifts from Intel, Microsoft Research, and Sun Microsystems thatpartially supported this work.Authors’ addresses: Vasanth Venkatachalam, School of Information and Computer Science, University ofCalifornia at Irvine, Irvine, CA 92697-3425; email: email@example.com; Michael Franz, School of Informationand Computer Science, University of California at Irvine, Irvine, CA 92697-3425; email: firstname.lastname@example.org.Permission to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for proﬁt or direct commercial advantage andthat copies show this notice on the ﬁrst page or initial screen of a display along with the full citation.Copyrights for components of this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use anycomponent of this work in other works requires prior speciﬁc permission and/or a fee. Permissions may berequested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212)869-0481, or email@example.com. c 2005 ACM 0360-0300/05/0900-0195 $5.00 ACM Computing Surveys, Vol. 37, No. 3, September 2005, pp. 195–237.
196 V. Venkatachalam and M. Franz the power problem. Finally, we will dis- cuss some commercial power management systems (Section 3.10) and provide a glimpse into some more radical technolo- gies that are emerging (Section 3.11). 2. DEFINING POWER Power and energy are commonly deﬁned in terms of the work that a system per- forms. Energy is the total amount of work a system performs over a period of time,Fig. 1. Power densities rising. Figure adapted from while power is the rate at which the sys-Pollack 1999. tem performs that work. In formal terms,data centers, even raise environmental P = W/T (1)concerns. At the other end of the performancespectrum, power issues also pose prob- E = P ∗ T, (2)lems for smaller mobile devices with lim- where P is power, E is energy, T is a spe-ited battery capacities. Although one could ciﬁc time interval, and W is the total workgive these devices faster processors and performed in that interval. Energy is mea-larger memories, this would diminish sured in joules, while power is measuredtheir battery life even further. in watts. Without cost effective solutions to These concepts of work, power, and en-the power problem, improvements in ergy are used differently in different con-micro-processor technology will eventu- texts. In the context of computers, workally reach a standstill. Power manage- involves activities associated with run-ment is a multidisciplinary ﬁeld that in- ning programs (e.g., addition, subtraction,volves many aspects (i.e., energy, temper- memory operations), power is the rate atature, reliability), each of which is complex which the computer consumes electricalenough to merit a survey of its own. The energy (or dissipates it in the form of heat)focus of our survey will be on techniques while performing these activities, and en-that reduce the total power consumed by ergy is the total electrical energy the com-typical microprocessor systems. puter consumes (or dissipates as heat) We will follow the high-level taxonomy over time.illustrated in Figure 2. First, we will de- This distinction between power and en-ﬁne power and energy and explain the ergy is important because techniques thatcomplex parameters that dynamic and reduce power do not necessarily reduce en-static power depend on (Section 2). Next, ergy. For example, the power consumedwe will introduce techniques that reduce by a computer can be reduced by halv-power and energy (Section 3), starting ing the clock frequency, but if the com-with circuit (Section 3.1) and architec- puter then takes twice as long to runtural techniques (Section 3.2, Section 3.3, the same programs, the total energy con-and Section 3.4), and then moving on to sumed will be similar. Whether one shouldtwo techniques that are widely applied in reduce power or energy depends on thehardware and software, dynamic voltage context. In mobile applications, reducingscaling (Section 3.5) and resource hiberna- energy is often more important becausetion (Section 3.6). Third, we will examine it increases the battery lifetime. How-what compilers can do to manage power ever, for other systems (e.g., servers), tem-(Section 3.7). We will then discuss recent perature is a larger issue. To keep thework in application level power manage- temperature within acceptable limits, onement (Section 3.8), and recent efforts (Sec- would need to reduce instantaneous powertion 3.9) to develop a holistic solution to regardless of the impact on total energy. ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 197 Fig. 2. Organization of this survey.2.1. Dynamic Power Consumption capacitance (C) and an activity factor (a) that relates to how many 0 → 1 or 1 → 0There are two forms of power consump- transitions occur in a chip:tion, dynamic power consumption andstatic power consumption. Dynamic powerconsumption arises from circuit activity Pdynamic ∼ aCV 2 f . (3)such as the changes of inputs in anadder or values in a register. It has two Accordingly, there are four ways to re-sources, switched capacitance and short- duce dynamic power consumption, thoughcircuit current. they each have different tradeoffs and not Switched capacitance is the primary all of them reduce the total energy con-source of dynamic power consumption and sumed. The ﬁrst way is to reduce the phys-arises from the charging and discharging ical capacitance or stored electrical chargeof capacitors at the outputs of circuits. of a circuit. The physical capacitance de- Short-circuit current is a secondary pends on low level design parameterssource of dynamic power consumption and such as transistor sizes and wire lengths.accounts for only 10-15% of the total power One can reduce the capacitance by re-consumption. It arises because circuits are ducing transistor sizes, but this worsenscomposed of transistors having opposite performance.polarity, negative or NMOS and positive The second way to lower dynamic poweror PMOS. When these two types of tran- is to reduce the switching activity. As com-sistors switch current, there is an instant puter chips get packed with increasinglywhen they are simultaneously on, creat- complex functionalities, their switchinging a short circuit. We will not deal fur- activity increases [De and Borkar 1999],ther with the power dissipation caused by making it more important to develop tech-this short circuit because it is a smaller niques that fall into this category. Onepercentage of total power, and researchers popular technique, clock gating, gateshave not found a way to reduce it without the clock signal from reaching idle func-sacriﬁcing performance. tional units. Because the clock network As the following equation shows, the accounts for a large fraction of a chip’smore dominant component of dynamic total energy consumption, this is a verypower, switched capacitance (Pdynamic ), de- effective way of reducing power and en-pends on four parameters namely, supply ergy throughout a processor and is imple-voltage (V), clock frequency (f), physical mented in numerous commercial systemsACM Computing Surveys, Vol. 37, No. 3, September 2005.
198 V. Venkatachalam and M. Franz 2.2. Understanding Leakage Power Consumption In addition to consuming dynamic power, computer components consume static power, also known as idle power or leakage. According to the most re- cently published industrial roadmaps [ITRSRoadMap], leakage power is rapidly becoming the dominant source of power consumption in circuits (Figure 3) and persists whether a computer is active orFig. 3. ITRS trends for leakage power dissipation. idle. Because its causes are different fromFigure adapted from Meng et al., 2005. those of dynamic power, dynamic power reduction techniques do not necessarily reduce the leakage power. As the equation that follows illustrates,including the Pentium 4, Pentium M, Intel leakage power consumption is the productXScale and Tensilica Xtensa, to mention of the supply voltage (V) and leakage cur-but a few. rent (Il eak ), or parasitic current, that ﬂows The third way to reduce dynamic power through transistors even when the tran-consumption is to reduce the clock fre- sistors are turned off.quency. But as we have just mentioned,this worsens performance and does not al-ways reduce the total energy consumed. Pleak = V Ileak . (4)One would use this technique only ifthe target system does not support volt- To understand how leakage currentage scaling and if the goal is to re- arises, one must understand how transis-duce the peak or average power dissi- tors work. A transistor regulates the ﬂowpation and indirectly reduce the chip’s of current between two terminals calledtemperature. the source and the drain. Between these The fourth way to reduce dynamic power two terminals is an insulator, called theconsumption is to reduce the supply volt- channel, that resists current. As the volt-age. Because reducing the supply voltage age at a third terminal, the gate, is in-increases gate delays, it also requires re- creased, electrical charge accumulates inducing the clock frequency to allow the cir- the channel, reducing the channel’s resis-cuit to work properly. tance and creating a path along which The combination of scaling the supply electricity can ﬂow. Once the gate volt-voltage and clock frequency in tandem is age is high enough, the channel’s polar-called dynamic voltage scaling (DVS). This ity changes, allowing the normal ﬂow oftechnique should ideally reduce dynamic current between the source and the drain.power dissipation cubically because dy- The threshold at which the gate’s voltagenamic power is quadratic in voltage and is high enough for the path to open is calledlinear in clock frequency. This is the most the threshold voltage.widely adopted technique. A growing num- According to this model, a transistor isber of processors, including the Pentium similar to a water dam. It is supposedM, mobile Pentium 4, AMD’s Athlon, and to allow current to ﬂow when the gateTransmeta’s Crusoe and Efﬁcieon proces- voltage exceeds the threshold voltage butsors allow software to adjust clock fre- should otherwise prevent current fromquencies and voltage settings in tandem. ﬂowing. However, transistors are imper-However, DVS has limitations and cannot fect. They leak current even when the gatealways be applied, and even when it can voltage is below the threshold voltage. Inbe applied, it is nontrivial to apply as we fact, there are six different types of cur-will see in Section 3.5. rent that leak through a transistor. These ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 199include reverse-biased-junction leakage, ture increases. But as the Equation (6)gate-induced-drain leakage, subthreshold shows, this increases leakage further,leakage, gate-oxide leakage, gate-current causing yet higher temperatures. This vi-leakage, and punch-through leakage. Of cious cycle is known as thermal runaway.these six, subthreshold leakage and gate- It is the chip designer’s worst nightmare.oxide leakage dominate the total leakage How can these problems be solved?current. Equations (4) and (6) indicate four ways to Gate-oxide leakage ﬂows from the gate of reduce leakage power. The ﬁrst way is toa transistor into the substrate. This type reduce the supply voltage. As we will see,of leakage current depends on the thick- supply voltage reduction is a very commonness of the oxide material that insulates technique that has been applied to compo-the gate: nents throughout a system (e.g., processor, buses, cache memories). V 2 Tox The second way to reduce leakage power Iox = K 2 W e−α V . (5) is to reduce the size of a circuit because the Tox total leakage is proportional to the leak- age dissipated in all of a circuit’s tran- According to this equation, the gate- sistors. One way of doing this is to de-oxide leakage Iox increases exponentially sign a circuit with fewer transistors byas the thickness Tox of the gate’s oxide ma- omitting redundant hardware and usingterial decreases. This is a problem because smaller caches, but this may limit perfor-future chip designs will require the thick- mance and versatility. Another idea is toness to be reduced along with other scaled reduce the effective transistor count dy-parameters such as transistor length and namically by cutting the power supplies tosupply voltage. One way of solving this idle components. Here, too, there are chal-problem would be to insulate the gate us- lenges such as how to predict when dif-ing a high-k dialectric material instead ferent components will be idle and how toof the oxide materials that are currently minimize the overhead of shutting themused. This solution is likely to emerge over on or off. This, is also a common approachthe next decade. of which we will see examples in the sec- Subthreshold leakage current ﬂows be- tions to follow.tween the drain and source of a transistor. The third way to reduce leakage powerIt is the dominant source of leakage and is to cool the computer. Several coolingdepends on a number of parameters that techniques have been developed sinceare related through the following equa- the 1960s. Some blow cold air into thetion: circuit, while others refrigerate the −Vth −V processor [Schmidt and Notohardjono Isub = K 1 W e nT 1−e T . (6) 2002], sometimes even by costly means such as circulating cryogenic ﬂuids like In this equation, W is the gate width liquid nitrogen [Krane et al. 1988]. Theseand K and n are constants. The impor- techniques have three advantages. Firsttant parameters are the supply voltage they signiﬁcantly reduce subthresholdV , the threshold voltage Vth , and the leakage. In fact, a recent study [Schmidttemperature T. The subthreshold leak- and Notohardjono 2002] showed that cool-age current Isub increases exponentially ing a memory cell by 50 degrees Celsiusas the threshold voltage Vth decreases. reduces the leakage energy by ﬁve times.This again raises a problem for future Second, these techniques allow a circuit tochip designs, because as technology scales, work faster because electricity encountersthreshold voltages will have to scale along less resistance at lower temperatures.with supply voltages. Third, cooling eliminates some negative The increase in subthreshold leakage effects of high temperatures, namely thecurrent causes another problem. When the degradation of a chip’s reliability and lifeleakage current increases, the tempera- expectancy. Despite these advantages,ACM Computing Surveys, Vol. 37, No. 3, September 2005.
200 V. Venkatachalam and M. Franz Fig. 4. The transistor stacking effect. The cir- cuits in both (a) and (b) are leaking current be- tween Vdd and Ground. However, Ileak1, the leak- age in (a), is less than Ileak2, the leakage in (b). Figure adapted from Butts and Sohi 2000. Fig. 5. Multiple threshold circuits with sleep transistors.there are issues to consider such as thecosts of the hardware used to cool the and this increases the threshold voltagecircuit. Moreover, cooling techniques of the bottom transistor, making it moreare insufﬁcient if they result in wide resistant to leakage. As a result, in Fig-temperature variations in different parts ure 4(a), in which all transistors are inof a circuit. Rather, one needs to prevent the Off position, transistor B leaks lesshotspots by distributing heat evenly current than transistor A, and transistorthroughout a chip. C leaks less current than transistor B. Hence, the total leakage current is atten- 2.2.1. Reducing Threshold Voltage. The uated as it ﬂows from Vdd to the groundfourth way of reducing leakage power is to through transistors A, B, and C. This is notincrease the threshold voltage. As Equa- the case in the circuit shown in Figure 4tion (6) shows, this reduces the subthresh- (b), which contains only a single offold leakage exponentially. However, it also transistor.reduces the circuit’s performance as is ap- Another way to increase the thresh-parent in the following equation that re- old voltage is to use Multiple Thresh-lates frequency (f), supply voltage (V), and old Circuits With Sleep Transistors (MTC-threshold voltage (Vth ) and where α is a MOS) [Calhoun et al. 2003; Won et al.constant: 2003] . This involves isolating a leaky cir- cuit element by connecting it to a pair (V − Vth )α of virtual power supplies that are linked f∞ . (7) V to its actual power supplies through sleep transistors (Figure 5). When the circuit is One of the less intuitive ways of active, the sleep transistors are activated,increasing the threshold voltage is to connecting the circuit to its power sup-exploit what is called the stacking effect plies. But when the circuit is inactive, the(refer to Figure 4). When two or more tran- sleep transistors are deactivated, thus dis-sistors that are switched off are stacked connecting the circuit from its power sup-on top of each other (a), then they dis- plies. In this inactive state, almost no leak-sipate less leakage than a single tran- age passes through the circuit because thesistor that is turned off (b). This is be- sleep transistors have high threshold volt-cause each transistor in the stack induces ages. (Recall that subthreshold leakagea slight reverse bias between the gate and drops exponentially with a rise in thresh-source of the transistor right below it, old voltage, according to Equation (6).) ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 201 Fig. 6. Adaptive body biasing. This technique effectively conﬁnes the To adjust the threshold voltage, adap-leakage to one part of the circuit, but is tive body biasing applies a voltage to thetricky to implement for several reasons. transistor’s body known as a body biasThe sleep transistors must be sized prop- voltage (Figure 6). This voltage changeserly to minimize the overhead of acti- the polarity of a transistor’s channel, de-vating them. They cannot be turned on creasing or increasing its resistance toand off too frequently. Moreover, this tech- current ﬂow. When the body bias voltagenique does not readily apply to memories is chosen to ﬁll the transistor’s channelbecause memories lose data when their with positive ions (b), the threshold volt-power supplies are cut. age increases and reduces leakage cur- A third way to increase the threshold rents. However, when the voltage is cho-is to employ dual threshold circuits. Dual sen to ﬁll the channel with negative ions,threshold circuits [Liu et al. 2004; Wei the threshold voltage decreases, allowinget al. 1998; Ho and Hwang 2004] reduce higher performance, though at the cost ofleakage by using high threshold (low leak- more leakage.age) transistors on noncritical paths andlow threshold transistors on critical paths,the idea being that noncritical paths can 3. REDUCING POWERexecute instructions more slowly without 3.1. Circuit And Logic Level Techniquesimpairing performance. This is a difﬁ-cult technique to implement because it re- 3.1.1. Transistor Sizing. Transistor siz-quires choosing the right combination of ing [Penzes et al. 2002; Ebergen et al.transistors for high-threshold voltages. If 2004] reduces the width of transistorstoo many transistors are assigned high to reduce their dynamic power consump-threshold voltages, the noncritical paths tion, using low-level models that relate thein the circuit can slow down too much. power consumption to the width. Accord- A fourth way to increase the thresh- ing to these models, reducing the widthold voltage is to apply a technique known also increases the transistor’s delay andas adaptive body biasing [Seta et al. thus the transistors that lie away from1995; Kobayashi and Sakurai 1994; Kim the critical paths of a circuit are usuallyand Roy 2002]. Adaptive body biasing is the best candidates for this technique. Al-a runtime technique that reduces leak- gorithms for applying this technique usu-age power by dynamically adjusting the ally associate with each transistor a toler-threshold voltages of circuits depending able delay which varies depending on howon whether the circuits are active. When close that transistor is to the critical path.a circuit is not active, the technique in- These algorithms then try to scale eachcreases its threshold voltage, thus saving transistor to be as small as possible with-leakage power exponentially, although at out violating its tolerable delay.the expense of a delay in circuit operation.When the circuit is active, the technique 3.1.2. Transistor Reordering. The ar-decreases the threshold voltage to avoid rangement of transistors in a circuitslowing it down. affects energy consumption. Figure 7ACM Computing Surveys, Vol. 37, No. 3, September 2005.
202 V. Venkatachalam and M. Franz 3.1.3. Half Frequency and Half Swing Clocks. Half-frequency and half-swing clocks re- duce frequency and voltage, respectively. Traditionally, hardware events such as register ﬁle writes occur on a rising clock edge. Half-frequency clocks synchro- nize events using both edges, and they tick at half the speed of regular clocks, thus cutting clock switching power in half. Reduced-swing clocks also often use a lower voltage signal and thus reduce power quadratically. Fig. 7. Transistor Reordering. Figure adapted from Hossain et al. 1996. 3.1.4. Logic Gate Restructuring. There are many ways to build a circuit out of logic gates. One decision that affects power con-shows two possible implementations of sumption is how to arrange the gates andthe same circuit that differ only in their their input signals.placement of the transistors marked A For example, consider two implementa-and B. Suppose that the input to transis- tions of a four-input AND gate (Figure 8),tor A is 1, the input to transistor B is 1, a chain implementation (a), and a treeand the input to transistor C is 0. Then implementation (b). Knowing the signaltransistors A and B will be on, allowing probabilities (1 or 0) at each of the pri-current from Vdd to ﬂow through them mary inputs (A, B, C, D), one can easily cal-and charge the capacitors C1 and C2. culate the transition probabilities (0→1) Now suppose that the inputs change and for each output (W, X, F, Y, Z). If each in-that A’s input becomes 0, and C’s input be- put has an equal probability of being acomes 1. Then A will be off while B and C 1 or a 0, then the calculation shows thatwill be on. Now the implementations in (a) the chain implementation (a) is likely toand (b) will differ in the amounts of switch- switch less than the tree implementationing activity. In (a), current from ground (b). This is because each gate in a chainwill ﬂow past transistors B and C, dis- has a lower probability of having a 0→1charging both the capacitors C1 and C2. transition than its predecessor; its tran-However, in (b), the current from ground sition probability depends on those of allwill only ﬂow past transistor C; it will not its predecessors. In the tree implementa-get past transistor A since A is turned off. tion, on the other hand, some gates mayThus it will only discharge the capacitor share a parent (in the tree topology) in-C2, rather than both C1 and C2 as in part stead of being directly connected together.(a). Thus the implementation in (b) will These gates could have the same transi-consume less power than that in (a). tion probabilities. Transistor reordering [Kursun et al. Nevertheless, chain implementations2004; Sultania et al. 2004] rearranges do not necessarily save more energy thantransistors to minimize their switching tree implementations. There are other is-activity. One of its guiding principles is sues to consider when choosing a topology.to place transistors closer to the circuit’s One is the issue of glitches or spuriousoutputs if they switch frequently in or- transitions that occur when a gate does notder to prevent a domino effect where receive all of its inputs at the same time.the switching activity from one transistor These glitches are more common in chaintrickles into many other transistors caus- implementations where signals can traveling widespread power dissipation. This re- along different paths having widely vary-quires proﬁling techniques to determine ing delays. One solution to reduce glitcheshow frequently different transistors are is to change the topology so that the dif-likely to switch. ferent paths in the circuit have similar ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 203 Fig. 8. Gate restructuring. Figure adapted from the Pennsylvania State University Microsystems Design Laboratory’s tutorial on low power design.delays. This solution, known as path bal- such as register ﬁles. A typical master-ancing often transforms chain implemen- slave ﬂip-ﬂop consists of two latches, atations into tree implementations. An- master latch, and a slave latch. The inputsother solution, called retiming, involves of the master latch are the clock signalinserting ﬂip-ﬂops or registers to slow and data. The inputs of the slave latchdown and thereby synchronize the signals are the inverse of the clock signal and thethat pass along different paths but recon- data output by the master latch. Thusverge to the same gate. Because ﬂip-ﬂops when the clock signal is high, the masterand registers are in sync with the proces- latch is turned on, and the slave latchsor clock, they sample their inputs less fre- is turned off. In this phase, the masterquently than logic gates and are thus more samples whatever inputs it receives andimmune to glitches. outputs them. The slave, however, does not sample its inputs but merely outputs 3.1.5. Technology Mapping. Because of whatever it has most recently stored. Onthe huge number of possibilities and the falling edge of the clock, the mastertradeoffs at the gate level, designers rely turns off and the slave turns on. Thus theon tools to determine the most energy- master saves its most recent input andoptimal way of arranging gates and sig- stops sampling any further inputs. Thenals. Technology mapping [Chen et al. slave samples the new inputs it receives2004; Li et al. 2004; Rutenbar et al. 2001] from the master and outputs it.is the automated process of constructing a Besides this master-slave design aregate-level representation of a circuit sub- other common designs such as the pulse-ject to constraints such as area, delay, and triggered ﬂip-ﬂop and sense-ampliﬁer ﬂip-power. Technology mapping for power re- ﬂop. All these designs share some commonlies on gate-level power models and a li- sources of power consumption, namelybrary that describes the available gates, power dissipated from the clock signal,and their design constraints. Before a cir- power dissipated in internal switching ac-cuit can be described in terms of gates, it tivity (caused by the clock signal and byis initially represented at the logic level. changes in data), and power dissipatedThe problem is to design the circuit out of when the outputs change.logic gates in a way that will mimimize the Researchers have proposed several al-total power consumption under delay and ternative low power designs for ﬂip-ﬂops.cost constraints. This is an NP-hard Di- Most of these approaches reduce therected Acyclic Graph (DAG) covering prob- switching activity or the power dissipatedlem, and a common heuristic to solve it is by the clock signal. One alternative is theto break the DAG representation of a cir- self-gating ﬂip-ﬂop. This design inhibitscuit into a set of trees and ﬁnd the optimal the clock signal to the ﬂip-ﬂop when the in-mapping for each subtree using standard puts will produce no change in the outputs.tree-covering algorithms. Strollo et al.  have proposed two ver- sions of this design. In the double-gated 3.1.6. Low Power Flip-Flops. Flip-ﬂops ﬂip-ﬂop, the master and slave latchesare the building blocks of small memories each have their own clock-gating circuit.ACM Computing Surveys, Vol. 37, No. 3, September 2005.
204 V. Venkatachalam and M. FranzThe circuit consists of a comparator that way of reducing power is to encode thechecks whether the current and previous FSM states in a way that minimizes theinputs are different and some logic that in- switching activity throughout the proces-hibits the clock signal based on the output sor. Another common approach involvesof the comparator. In the sequential-gated decomposing the FSM into sub-FSMsﬂip-ﬂop, the master and the slave share and activating only the circuitry neededthe same clock-gating circuit instead of for the currently executing sub-FSM.having their own individual circuit as in Some researchers [Gao and Hayes 2003]the double-gated ﬂip-ﬂop. have developed tools that automate this Another low-power ﬂip-ﬂop is the condi- process of decomposition, subject to sometional capture FLIP-FLOP [Kong et al. 2001; quality and correctness constraints.Nedovic et al. 2001]. This ﬂip-ﬂop detectswhen its inputs produce no change in the 3.1.8. Delay-Based Dynamic-Supply Voltageoutput and stops these redundant inputs Adjustment. Commercial processors thatfrom causing any spurious internal cur- can run at multiple clock speed normallyrent switches. There are many variations use a lookup table to decide what supplyon this idea, such as the conditional dis- voltage to select for a given clock speed.charge ﬂip-ﬂops and conditional precharge This table is use built ahead of timeﬂip-ﬂops [Zhao et al. 2004] which elim- through a worst-case analysis. Becauseinate unnecessary precharging and dis- worst-case analyses may not reﬂect real-charging of internal elements. ity, ARM Inc. has been developing a more A third type of low-power ﬂip-ﬂop is the efﬁcient runtime solution known as thedual edge triggered FLIP-FLOP [Llopis and Razor pipeline [Ernst et al. 2003].Sachdev 1996]. These ﬂip-ﬂops are sensi- Instead of using a lookup table, Razortive to both the rising and falling edges adjusts the supply voltage based on theof the clock, and can do the same work delay in the circuit. The idea is that when-as a regular ﬂip-ﬂop at half the clock fre- ever the voltage is reduced, the circuitquency. By cutting the clock frequency slows down, causing timing errors if thein half, these ﬂip-ﬂops save total power clock frequency chosen for that voltage isconsumption but may not save total en- too high. Because of these errors, some in-ergy, although they could save energy by structions produce inconsistent results orreducing the voltage along with the fre- fail altogether. The Razor pipeline peri-quency. Another drawback is that these odically monitors how many such errorsﬂip-ﬂops require more area than regu- occur. If the number of errors exceeds alar ﬂip-ﬂops. This increase in area may threshold, it increases the supply voltage.cause them to consume more power on the If the number of errors is below a thresh-whole. old, it scales the voltage further to save more energy. This solution requires extra hardware 3.1.7. Low-Power Control Logic Design. in order to detect and correct circuitOne can view the control logic of a pro- errors. To detect errors, it augmentscessor as a Finite State Machine (FSM). the ﬂip-ﬂops in delay-critical regions ofIt speciﬁes the possible processor states the circuit with shadow-latches. Theseand conditions for switching between latches receive the same inputs as thethe states and generates the signals that ﬂip-ﬂops but are clocked more slowly toactivate the appropriate circuitry for each adapt to the reduced supply voltage. Ifstate. For example, when an instruction the output of the ﬂip-ﬂop differs fromis in the execution stage of the pipeline, that of its shadow-latch, then this sig-the control logic may generate signals to niﬁes an error. In the event of such anactivate the correct execution unit. error, the circuit propagates the output Although control logic optimizations of the shadow-latch instead of the ﬂip-have traditionally targeted performance, ﬂop, delaying the pipeline for a cycle ifthey are now also targeting power. One necessary. ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 205 Fig. 9. Crosstalk prevention through shielding.3.2. Low-Power Interconnect smaller, there arise additional sources of power consumption. One of these sourcesInterconnect heavily affects power con- is crosstalk [Sylvester and Keutzer 1998].sumption since it is the medium of most Crosstalk is spurious activity on a wireelectrical activity. Efforts to improve chip that is caused by activity in neighboringperformance are resulting in smaller chips wires. As well as increasing delays and im-with more transistors and more densely pairing circuit integrity, crosstalk can in-packed wires carrying larger currents. The crease power consumption.wires in a chip often use materials with One way of reducing crosstalk [Taylorpoor thermal conductivity [Banerjee and et al. 2001] is to insert a shield wireMehrotra 2001]. (Figure 9) between adjacent bus wires. Since the shield remains deasserted, no 3.2.1. Bus Encoding And CrossTalk. A pop- adjacent wires switch in opposite direc-ular way of reducing the power consumed tions. This solution doubles the number ofin buses is to reduce the switching activity wires. However, the principle behind it hasthrough intelligent bus encoding schemes, sparked efforts to develop self-shieldingsuch as bus-inversion [Stan and Burleson codes [Victor and Keutzer 2001; Patel and1995]. Buses consist of wires that trans- Markov 2003] resistant to crosstalk. Asmit bits (logical 1 or 0). For every data in traditional bus encoding, a value is en-transmission on a bus, the number of wires coded and then transmitted. However, thethat switch depends on the current and code chosen avoids opposing transitions onprevious values transmitted. If the Ham- adjacent bus wires.ming distance between these values ismore than half the number of wires, then 3.2.2. Low Swing Buses. Bus-encodingmost of the wires on the bus will switch schemes attempt to reduce transitions oncurrent. To prevent this from happening, the bus. Alternatively, a bus can transmitbus-inversion transmits the inverse of the the same information but at a lowerintended value and asserts a control sig- voltage. This is the principle behind lownal alerting recipients of the inversion. swing buses [Zhang and Rabaey 1998].For example, if the current binary value to Traditionally, logical one is representedtransmit is 110 and the previous was 000, by 5 volts and logical zero is representedthe bus instead transmits 001, the inverse by −5 volts. In a low-swing systemof 110. (Figure 10), logical one and zero are Bus-inversion ensures that at most half encoded using lower voltages, such asof the bus wires switch during a bus trans- +300mV and −300mV. Typically, theseaction. However, because of the cost of the systems are implemented with differen-logic required to invert the bus lines, this tial signaling. An input signal is split intotechnique is mainly used in external buses two signals of opposite polarity boundedrather than the internal chip interconnect. by a smaller voltage range. The receiver Moreover, some researchers have ar- sees the difference between the twogued that this technique makes the sim- transmitted signals as the actual signalplistic assumption that the amount of and ampliﬁes it back to normal voltage.power that an interconnect wire consumes Low Swing Differential Signaling hasdepends only on the number of bit tran- several advantages in addition to reducedsitions on that wire. As chips become power consumption. It is immune toACM Computing Surveys, Vol. 37, No. 3, September 2005.
206 V. Venkatachalam and M. Franz Fig. 10. Low voltage differential signaling Fig. 11. Bus segmentation.crosstalk and electromagnetic radiation 3.2.4. Adiabatic Buses. The precedingeffects. Since the two transmitted signals techniques work by reducing activityare close together, any spurious activity or voltage. In contrast, adiabatic cir-will affect both equally without affecting cuits [Lyuboslavsky et al. 2000] reduce to-the difference between them. However, tal capacitance. These circuits reuse ex-when implementing this technique, one isting electrical charge to avoid creatingneeds to consider the costs of increased new charge. In a traditional bus, whenhardware at the encoder and decoder. a wire becomes deasserted, its previous charge is wasted. A charge-recovery bus 3.2.3. Bus Segmentation. Another effec- recycles the charge for wires about to betive technique is bus segmentation. In a asserted.traditional shared bus architecture, the Figure 12 illustrates a simple design forentire bus is charged and discharged upon a two-bit adiabatic bus [Bishop and Ir-every access. Segmentation (Figure 11) win 1999]. A comparator in each bitlinesplits a bus into multiple segments con- tests whether the current bit is equal tonected by links that regulate the trafﬁc be- the previously sent bit. Inequality signi-tween adjacent segments. Links connect- ﬁes a transition and connects the bitlineing paths essential to a communication are to a shorting wire used for sharing chargeactivated independently, allowing most of across bitlines. In the sample scenario,the bus to remain powered down. Ideally, Bitline1 has a falling transition while Bit-devices communicating frequently should line0 has a rising transition. As Bitline1be in the same or nearby segments to avoid falls, its positive charge transfers to Bit-powering many links. Jone et al.  line0, allowing Bitline0 to rise. The powerhave examined how to partition a bus saved depends on transition patterns. Noto ensure this property. Their algorithm energy is saved when all lines rise. Thebegins with an undirected graph whose most energy is saved when an equal num-nodes are devices, edges connect commu- ber of lines rise and fall simultaneously.nicating devices, and edge weights dis- The biggest drawback of adiabatic cir-play communication frequency. Applying cuits is a delay for transfering sharedthe Gomory and Hu algorithm , they charge.ﬁnd a spanning tree in which the prod-uct of communication frequency and num-ber of edges between any two devices is 3.2.5. Network-On-Chip. All of the aboveminimal. techniques assume an architecture in ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 207 and columns corresponding to the inter- section points end up switching current even though only parts of these rows and columns are actually traversed. To elim- inate this widespread current switching, the segmented crossbar topology divides the rows and columns into segments de- marcated by tristate buffers. The differ- ent segments are selectively activated as the data traverses through them, henceFig. 12. Two bit charge recovery bus. conﬁning current switches to only those segments along which the data actuallywhich multiple functional units share one passes.or more buses. This shared-bus paradigmhas several drawbacks with respect to 3.3. Low-Power Memories and Memorypower and performance. The inherent bus Hierarchiesbandwidth limits the speed and volumeof data transfers and does not suit the 3.3.1. Types of Memory. One can classifyvarying requirements of different execu- memory structures into two categories,tion units. Only one unit can access the Random Access Memories (RAM) andbus at a time, though several may be re- Read-Only-Memories (ROM). There arequesting bus access simultaneously. Bus two kinds of RAMs, static RAMs (SRAM)arbitration mechanisms are energy inten- and dynamic RAMs (DRAM) which dif-sive. Unlike simple data transfers, every fer in how they store data. SRAMs storebus transaction involves multiple clock data using ﬂip-ﬂops and DRAMs storecycles of handshaking protocols that in- each bit of data as a charge on a capac-crease power and delay. itor; thus DRAMs need to refresh their As a consequence, researchers have data periodically. SRAMs allow faster ac-been investigating the use of generic in- cesses than DRAMs but require moreterconnection networks to replace buses. area and are more expensive. As a result,Compared to buses, networks offer higher normally only register ﬁles, caches, andbandwidth and support concurrent con- high bandwidth parts of the system arenections. This design makes it easier to made up of SRAM cells, while main mem-troubleshoot energy-intensive areas of the ory is made up of DRAM cells. Althoughnetwork. Many networks have sophisti- these cells have slower access timescated mechanisms for adapting to varying than SRAMs, they contain fewer transis-trafﬁc patterns and quality-of-service re- tors and are less expensive than SRAMquirements. Several authors [Zhang et al. cells.2001; Dally and Towles 2001; Sgroi et al. The techniques we will introduce are not2001] propose the application of these conﬁned to any speciﬁc type of RAM ortechniques at the chip level. These claims ROM. Rather they are high-level architec-have yet to be veriﬁed, and interconnect tural principles that apply across the spec-power reduction remains a promising area trum of memories to the extent that thefor future research. required technology is available. They at- Wang et al.  have developed tempt to reduce the energy dissipation ofhardware-level optimizations for differ- memory accesses in two ways, either byent sources of energy consumption in reducing the energy dissipated in a mem-an on-chip network. One of the alterna- ory accesses, or by reducing the number oftive network topologies they propose is memory accesses.the segmented crossbar topology whichaims to reduce the energy consumption 3.3.2. Splitting Memories Into Smaller Sub-of data transfers. When data is trans- systems. An effective way to reduce thefered over a regular network grid, all rows energy that is dissipated in a memoryACM Computing Surveys, Vol. 37, No. 3, September 2005.
208 V. Venkatachalam and M. Franzaccess is to activate only the needed mem- processor and ﬁrst level cache. It wouldory circuits in each access. One way to ex- be smaller than the ﬁrst level cache andpose these circuits is to partition memo- hence dissipate less energy. Yet it wouldries into smaller, independently accessible be large enough to store an application’scomponents. This can be done as different working set and ﬁlter out many mem-granularities. ory references; only when a data request At the lowest granularity, one can de- misses this cache would it require search-sign a main memory system that is split ing higher levels of cache, and the penaltyup into multiple banks each of which for a miss would be offset by redirectingcan independently transition into differ- more memory references to this smaller,ent power modes. Compilers and operating more energy efﬁcient cache. This was thesystems can cluster data into a minimal principle behind the ﬁlter cache. Thoughset of banks, allowing the other unused deceptively simple, this idea lies at thebanks to power down and thereby save en- heart of many of the low-power cache de-ergy [Luz et al. 2002; Lebeck et al. 2000]. signs used today, ranging from simple ex- At a higher granularity, one can parti- tensions of the cache hierarchy (e.g., blocktion the separate banks of a partitioned buffering) to the scratch pad memoriesmemory into subbanks and activate only used in embedded systems, and the com-the relevant subbank in every memory ac- plex trace caches used in high-end gener-cess. This is a technique that has been ap- ation processors.plied to caches [Ghose and Kamble 1999]. One simple extension is the technique ofIn a normal cache access, the set-selection block buffering. This technique augmentsstep involves transferring all blocks in all caches with small buffers to store the mostbanks onto tag-comparison latches. Since recently accessed cache set. If a data itemthe requested word is at a known off- that has been requested belongs to a cacheset in these blocks, energy can be saved set that is already in the buffer, then oneby transfering only words at that offset. does not have to search for it in the rest ofCache subbanking splits the block array of the cache.each cache line into multiple banks. Dur- In embedded systems, scratch pad mem-ing set-selection, only the relevant bank ories [Panda et al. 2000; Banakar et al.from each line is active. Since a single sub- 2002; Kandemir et al. 2002] have beenbank is active at a time, all subbanks of the extension of choice. These are soft-a line share the same output latches and ware controlled memories that reside on-sense ampliﬁers to reduce hardware com- chip. Unlike caches, the data they shouldplexity. Subbanking saves energy with- contain is determined ahead of time andout increasing memory access time. More- remains ﬁxed while programs execute.over, it is independent of program locality These memories are less volatile thanpatterns. conventional caches and can be accessed within a single cycle. They are ideal for specialized embedded systems whose ap- 3.3.3. Augmenting the Memory Hierarchy plications are often hand tuned.With Specialized Cache Structures. The sec- General purpose processors of the lat-ond way to reduce energy dissipation in est generation sometimes rely on morethe memory hierarchy is to reduce the complex caching techniques such as thenumber of memory hierarchy accesses. trace cache. Trace caches were originallyOne group of researchers [Kin et al. 1997] developed for performance but have beendeveloped a simple but effective technique studied recently for their power bene-to do this. Their idea was to integrate a ﬁts. Instead of storing instructions inspecialized cache into the typical mem- their compiled order, a trace cache storesory hierarchy of a modern processor, a traces of instructions in their executedhierarchy that may already contain one order. If an instruction sequence is al-or more levels of caches. The new cache ready in the trace cache, then it needthey were proposing would sit between the not be fetched from the instruction cache ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 209but can be decoded directly from thetrace cache. This saves power by re-ducing the number of instruction cacheaccesses. In the original trace cache design, boththe trace cache and the instruction cacheare accessed in parallel on every clock cy-cle. Thus instructions are fetched from theinstruction cache while the trace cache issearched for the next trace that is pre-dicted to executed. If the trace is in thetrace cache, then the instructions fetchedfrom the instruction cache are discarded. Fig. 13. Drowsy cache line.Otherwise the program execution pro-ceeds out of the instruction cache and si- 3.4.1. Adaptive Caches. There is amultaneously new traces are created from wealth of literature on adaptive caches,the instructions that are fetched. caches whose storage elements (lines, Low-power designs for trace caches typi- blocks, or sets) can be selectively acti-cally aim to reduce the number of instruc- vated based on the application workload.tion cache accesses. One alternative, the One example of such a cache is thedynamic direction prediction-based trace Deep-Submicron Instruction (DRI)cache [Hu et al. 2003], uses branch predi- cache [Powell et al. 2001]. This cachetion to decide where to fetch instructions permits one to deactivate its individualfrom. If the branch predictor predicts the sets on demand by gating their supplynext trace with high conﬁdence and that voltages. To decide what sets to activate attrace is in the trace cache, then the in- any given time, the cache uses a hardwarestructions are fetched from the trace cache proﬁler that monitors the application’srather than the instruction cache. cache-miss patterns. Whenever the cache The selective trace cache [Hu et al. 2002] misses exceed a threshold, the DRI cacheextends this technique with compiler sup- activates previously deactivated sets.port. The idea is to identify frequently exe- Likewise, whenever the miss rate fallscuted, or hot traces, and bring these traces below a threshold, the DRI deactivatesinto the trace cache. The compiler inserts some of these sets by inhibiting theirhints after every branch instruction, indi- supply voltages.cating whether it belongs to a hot trace. At A problem with this approach is thatruntime, these hints cause instructions to dormant memory cells lose data and needbe fetched from either the trace cache or more time to be reactivated for their nextthe instruction cache. use. Thus an alternative to inhibiting their supply voltages is to reduce their voltages as low as possible without losing data.3.4. Low-Power Processor Architecture This is the aim of the drowsy cache, a cache Adaptations whose lines can be placed in a drowsySo far we have described the energy sav- mode where they dissipate minimal powering features of hardware as though they but retain data and can be reactivatedwere a ﬁxed foundation upon which pro- faster.grams execute. However, programs exhibit Figure 13 illustrates a typical drowsywide variations in behavior. Researchers cache line [Kim et al. 2002]. The wakeuphave been developing hardware structures signal chooses between voltages for thewhose parameters can be adjusted on de- active, drowsy, and off states. It alsomand so that one can save energy by regulates the word decoder’s output toactivating just the minimum hardware prevent data in a drowsy line from be-resources needed for the code that is ing accessed. Before a cache access, theexecuting. line circuitry must be precharged. In aACM Computing Surveys, Vol. 37, No. 3, September 2005.
210 V. Venkatachalam and M. Franz the ﬁrst adaptive instruction issue queues, a 32-bit queue consisting of four equal size partitions that can be independently acti- vated. Each partition consists of wakeup Fig. 14. Dead block elimination. logic that decides when instructions are ready to execute and readout logic thattraditional cache, the precharger is contin- dispatches ready instructions into theuously on. To save power, the drowsy cir- pipeline. At any time, only the partitionscuitry regulates the wakeup signal with a that contain the currently executing in-precharge signal. The precharger becomes structions are activated.active only when the line is woken up. There have been many heuristics devel- Many heuristics have been developed oped for conﬁguring these queues. Somefor controlling these drowsy circuits. The require measuring the rate at which in-simplest policy for controlling the drowsy structions are executed per processor clockcache is cache decay [Kaxiras et al. 2001] cycles, or IPC. For example, Buyukto-which simply turns off unused cache lines sunoglu et al.’s heuristic  periodi-after a ﬁxed interval. More sophisticated cally measures the IPC over ﬁxed lengthpolicies [Zhang et al. 2002; 2003; Hu et al. intervals. If the IPC of the current inter-2003] consider information about runtime val is a factor smaller than that of theprogram behavior. For example, Hu et previous interval (indicating worse per-al.  have proposed hardware that formance), then the heuristic increasesdetects when an execution moves into a the instruction queue size to increase thehotspot and activates only those cache throughput. Otherwise it may decreaselines that lie within that hotspot. The the queue size.main idea of their approach is to count Bahar and Manne  propose ahow often different branches are taken. similar heuristic targeting a simulatedWhen a branch is taken sufﬁciently many 8-wide issue processor that can be re-times, its target address is a hotspot, and conﬁgured as a 6-wide issue or 4-widethe hardware sets special bits to indicate issue. They also measure IPC over ﬁxedthat cache lines within that hotspot should intervals but measure the ﬂoating pointnot be shut down. Periodically, a runtime IPC separately from the integer IPC sinceroutine disables cache lines that do not the former is more critical for ﬂoatinghave these bits set and decays the proﬁl- point applications. Their heuristic, likeing data. Whenever it reactivates a cache Buyuktosunoglu et al.’s , comparesline before its next use, it also reactivates the measured IPC with speciﬁc thresholdsthe line that corresponds to the next sub- to determine when to downsize the issuesequent memory access. queue. There are many other heuristics for con- Folegnani and Gonzalez , on thetrolling cache lines. Dead-block elimina- other hand, use a different heuristic. Theytion [Kabadi et al. 2002] powers down ﬁnd that when a program has little par-cache lines containing basic blocks that allelism, the youngest part of the issuehave reached their ﬁnal use. It is a queue (which contains the most recent in-compiler-directed technique that requires structions) contributes very little to thea static control ﬂow analysis to identify overall IPC in that instructions in thisthese blocks. Consider the control ﬂow part are often committed late. Thus theirgraph in Figure 14. Since block B1 exe- heuristic monitors the contribution of thecutes only once, its cache lines can power youngest part of this queue and deacti-down once control reaches B2. Similarly, vates this part when its contribution isthe lines containing B2 and B3 can power minimal.down once control reaches B4. 3.4.2. Adaptive Instruction Queues. Buyu- 3.4.3. Algorithms for Reconﬁguring Multi-ktosunoglu et al.  developed one of ple Structures. In addition to this work, ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 211researchers have also been tackling the to apply, they use ofﬂine proﬁling to plotproblem of problem of reconﬁguring mul- an energy-delay tradeoff curve. Each pointtiple hardware units simultaneously. in the curve represents the energy and de- One group [Hughes et al. 2001; Sasanka lay tradeoff of applying an adaptation toet al. 2002; Hughes and Adve 2004] a subroutine. There are three regions inhas been developing heuristics for com- the curve. The ﬁrst includes adaptationsbining hardware adaptations with dy- that save energy without impairing per-namic voltage scaling during multime- formance; these are the adaptations thatdia applications. Their strategy is to run will always be applied. The third repre-two algorithms while individual video sents adaptations that worsen both per-frames are being processed. A global algo- formance and energy; these are the adap-rithm chooses the initial DVS setting and tations that will never be applied. Be-baseline hardware conﬁguration for each tween these two extremes are adaptationsframe, while a local algorithm tunes var- that trade performance for energy savings;ious hardware parameters (e.g., instruc- some of these may be applied depending ontion window sizes) while the frame is ex- the tolerable performance loss. The mainecuting to use up any remaining slack idea is to trace the curve from the ori-periods. gin and apply every adaptation that one Another group [Iyer and Marculescu encounters until the cumulative perfor-2001, 2002a] has developed heuristics that mance loss from applying all the encoun-adjust the pipeline width and register up- tered adaptations reaches the maximumdate unit (RUU) size for different hotspots tolerable performance loss.or frequently executed program regions. Ponomarev et al.  develop heuris-To detect these hotspots, their technique tics for controlling a conﬁgurable issuecounts the number of times each branch queue, a conﬁgurable reorder buffer and ainstruction is taken. When a branch is conﬁgurable load/store queue. Unlike thetaken enough times, it is marked as a can- IPC based heuristics we have seen, thisdidate branch. heuristic is occupancy-based because the To detect how often candidate branches occupancy of hardware structures indi-are taken, the heuristic uses a hotspot cates how congested these structures are,detection counter. Initially the hotspot and congestion is a leading cause of perfor-counter contains a positive integer. Each mance loss. Whenever a hardware struc-time a candidate branch is taken, the ture (e.g., instruction queue) has low oc-counter is decremented, and each time a cupancy, the heuristic reduces its size. Innoncandidate branch is taken, the counter contrast, if the structure is completely fullis incremented. If the counter becomes for a long time, the heuristic increases itszero, this means that the program execu- size to avoid congestion.tion is inside a hotspot. Another group of researchers [Albonesi Once the program execution is in a et al. 2003; Dropsho et al. 2002] ex-hotspot, this heuristic tests the energy plore the complex problem of conﬁgur-consumption of different processor conﬁg- ing a processor whose adaptable compo-urations at ﬁxed length intervals. It ﬁnally nents include two levels of caches, integerchooses the most energy saving conﬁgura- and ﬂoating point issue queues, load/storetion as the one to use the next time the queues, register ﬁles, as well as a reordersame hotspot is entered. To measure en- buffer. To dodge the sheer explosion of pos-ergy at runtime, the heuristic relies on sible conﬁgurations, they tune each com-hardware that monitors usage statistics of ponent based only on its local usage statis-the most power hungry units and calcu- tics. They propose two heuristics for doinglates the total power based on energy per this, one for tuning caches and another foraccess values. tuning buffers and register ﬁles. Huang et al.[2000; 2003] apply hard- The caches they consider are selec-ware adaptations at the granularity of tive way caches; each of their multiplesubroutines. To decide what adaptations ways can independently be activated orACM Computing Surveys, Vol. 37, No. 3, September 2005.
212 V. Venkatachalam and M. Franzdisabled. Each way has a most recently [Turing 1937]). There have been ongoingused (MRU) counter that is incremented efforts [Nilsen and Rygg 1995; Lim et al.every time that a cache searche hits the 1995; Diniz 2003; Ishihara and Yasuuraway. The cache tuning heuristic samples 1998] in the real-time systems communitythis counter at ﬁxed intervals to determine to accurately estimate the worst case exe-the number of hits in each way and the cution times of programs. These attemptsnumber of overall misses. Using these ac- use either measurement-based prediction,cess statistics, the heuristic computes the static analysis, or a mixture of both. Mea-energy and performance overheads for all surement involves a learning lag wherebypossible cache conﬁgurations and chooses the execution times of various tasks arethe best conﬁguration dynamically. sampled and future execution times are The heuristic for controlling other struc- extrapolated. The problem is that, whentures such as issue queues is occupancy workloads are irregular, future execu-based. It measures how often different tion times may not resemble past execu-components of these structures ﬁll up with tion times. Microarchitectural innovationsdata and uses this information to de- such as pipelining, hyperthreading andcide whether to upsize or downsize the out-of-order execution make it difﬁcult tostructure. predict execution times in real systems statically. Among other things, a compiler3.5. Dynamic Voltage Scaling needs to know how instructions are inter-Dynamic voltage scaling (DVS) addresses leaved in the pipeline, what the probabili-the problem of how to modulate a pro- ties are that different branches are taken,cessor’s clock frequency and supply volt- and when cache misses and pipeline haz-age in lockstep as programs execute. The ards are likely to occur. This requires mod-premise is that a processor’s workloads eling the pipeline and memory hierarchy.vary and that when the processor has less A number of researchers [Ferdinand 1997;work, it can be slowed down without af- Clements 1996] have attempted to inte-fecting performance adversely. For exam- grate complete pipeline and cache mod-ple, if the system has only one task and it els into compilers. It remains nontrivialhas a workload that requires 10 cycles to to develop models that can be used ef-execute and a deadline of 100 seconds, the ﬁciently in a compiler and that captureprocessor can slow down to 1/10 cycles/sec, all the complexities inherent in currentsaving power and meeting the deadline systems.right on time. This is presumably more ef-ﬁcient than running the task at full speedand idling for the remainder of the pe- 3.5.2. Indeterminism And Anomalies In Realriod. Though it appears straightforward, Systems. Even if the DVS algorithm pre-this naive picture of how DVS works is dicts correctly what the processor’s work-highly simpliﬁed and hides serious real- loads will be, determining how fast to runworld complexities. the processor is nontrivial. The nondeter- minism in real systems removes any strict 3.5.1. Unpredictable Nature Of Workloads. relationships between clock frequency, ex-The ﬁrst problem is how to predict work- ecution time, and power consumption.loads with reasonable accuracy. This re- Thus most theoretical studies on voltagequires knowing what tasks will execute scaling are based on assumptions thatat any given time and the work re- may sound reasonable but are not guar-quired for these tasks. Two issues com- anteed to hold in real systems.plicate this problem. First, tasks can One misconception is that total micro-be preempted at arbitrary times due to processor system power is quadratic inuser and I/O device requests. Second, supply voltage. Under the CMOS transis-it is not always possible to accurately tor model, the power dissipation of indi-predict the future runtime of an arbi- vidual transistors are quadratic in theirtrary algorithm (c.f., the Halting Problem supply voltages, but there remains no ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 213precise way of estimating the power dis-sipation of an entire system. One canconstruct scenarios where total systempower is not quadratic in supply voltage.Martin and Siewiorek  found thatthe StrongARM SA-1100 has two supplyvoltages, a 1.5V supply that powers theCPU and a 3.3V supply that powers the I/Opins. The total power dissipation is domi-nated by the larger supply voltage; hence,even if the smaller supply voltage were al-lowed to vary, it would not reduce powerquadratically. Fan et al.  found that,in some architectures where active mem-ory banks dominate total system powerconsumption, reducing the supply voltagedramatically reduces CPU power dissi-pation but has a miniscule effect on to-tal power. Thus, for full beneﬁts, dynamicvoltage scaling may need to be combinedwith resource hibernation to power down Fig. 15. Scheduling effects. When DVS is applied (a), task one takes a longer time to enter its criti-other energy hungry parts of the chip. cal section, and task two is able to preempt it right Another misconception is that it is before it does. When DVS is not applied (b), taskmost power efﬁcient to run a task at one enters its critical section sooner and thus pre-the slowest constant speed that allows vents task two from preempting it until it ﬁnishesit to exactly meet its deadline. Several its critical section. This ﬁgure has been adopted from Buttazzo .theoretical studies [Ishihara and Yasuura1998] attempt to prove this claim. Theseproofs rest on idealistic assumptions such two tasks (Figure 15). When the processoras “power is a convex linearly increasing slows down (a), Task One takes longer tofunction of frequency”, assumptions that enter its critical section and is thus pre-ignore how DVS affects the system as a empted by Task Two as soon as Task Twowhole. When the processor slows down, is released into the system. When the pro-peripheral devices may remain activated cessor instead runs at maximum speed (b),longer, consuming more power. An impor- Task One enters its critical section earliertant study of this issue was done by Fan and blocks Task Two until it ﬁnishes thiset al.  who show that, for speciﬁc section. This hypothetical example sug-DRAM architectures, the energy versus gests that DVS can affect the order inslowdown curve is “U” shaped. As the pro- which tasks are executed. But the order incessor slows down, CPU energy decreases which tasks execute affects the state of thebut the cumulative energy consumed in cache which, in turn, affects performance.active memory banks increases. Thus the The exact relationship between the cacheoptimal speed is actually higher than the state and execution time is not clear. Thus,processor’s lowest speed; any speed lower it is too simplistic to assume that execu-than this causes memory energy dissipa- tion time is inversely proportional to clocktion to overshadow the effects of DVS. frequency. A third misconception is that execution All of these issues show that theo-time is inversely proportional to clock fre- retical studies are insufﬁcient for un-quency. It is actually an open problem derstanding how DVS will affect systemhow slowing down the processor affects state. One needs to develop an experi-the execution time of an arbitrary appli- mental approach driven by heuristics andcation. DVS may result in nonlinearities evaluate the tradeoffs of these heuristics[Buttazzo 2002]. Consider a system with empirically.ACM Computing Surveys, Vol. 37, No. 3, September 2005.
214 V. Venkatachalam and M. Franz Most of the existing DVS approaches formance and energy dissipation. Weisselcan be classiﬁed as interval-based ap- and Bellosa’s technique involves moni-proaches, intertask approaches, or in- toring these event rates using hardwaretratask approaches. performance counters and attributing them to the processes that are executing. At each context switch, the heuristic 3.5.3. Interval-Based Approaches. adjusts the clock frequency for the processInterval-based DVS algorithms mea- being activated based on its previouslysure how busy the processor is over measured event rates. To do this, thesome interval or intervals, estimate how heuristic refers to a previously con-busy it will be in the next interval, and structed table of event rate combinations,adjust the processor speed accordingly. divided into frequency domains. WeisselThese algorithms differ in how they and Bellosa construct this table throughestimate future processor utilization. a detailed ofﬂine simulation where, forWeiser et al.  developed one of the different combinations of event rates, theyearliest interval-based algorithms. This determine the lowest clock frequency thatalgorithm, Past, periodically measures does not violate a user-speciﬁed thresholdhow long the processor idles. If it idles for performance loss.longer than a threshold, the algorithm Intertask DVS has two drawbacks.slows it down; if it remains busy longer First, task workloads are usually un-than a threshold, it speeds it up. Since known until tasks ﬁnish running. Thusthis approach makes decisions based traditional algorithms either assume per-only on the most recent interval, it is fect knowledge of these workloads or es-error prone. It is also prone to thrashing timate future workloads based on priorsince it computes a CPU speed for every workloads. When workloads are irregular,interval. Govil et al.  examine a estimating them is nontrivial. Flautnernumber of algorithms that extend Past et al.  address this problem byby considering measurements collected classifying all workloads as interactiveover larger windows of time. The Aged episodes and producer-consumer episodesAverages algorithm, for example, esti- and using separate DVS algorithms formates the CPU usage for the upcoming each. For interactive episodes, the clockinterval as a weighted average of usages frequency used depends not only on themeasured over all previous intervals. predicted workload but also on a percep-These algorithms save more energy than tion threshold that estimates how fast thePast since they base their decisions on episode would have to run to be accept-more information. However, they all able to a user. To recover from predictionassume that workloads are regular. As we error, the algorithm also includes a panichave mentioned, it is very difﬁcult, if not threshold. If a session lasts longer thanimpossible, to predict irregular workloads this threshold, the processor immediatelyusing history information alone. switches to full speed. The second drawback of intertask ap- 3.5.4. Intertask Approaches. Intertask proaches is that they are unaware of pro-DVS algorithms assign different speeds gram structure. A program’s structurefor different tasks. These speeds remain may provide insight into the work re-ﬁxed for the duration of each task’s exe- quired for it, insight that, in turn, maycution. Weissel and Bellosa  have provide opportunities for techniques suchdeveloped an intertask algorithm that as voltage scaling. Within speciﬁc programuses hardware events as the basis for regions, for example, the processor maychoosing the clock frequencies for differ- spend signiﬁcant time waiting for mem-ent processes. The motivation is that, as ory or disk accesses to complete. Duringprocesses execute, they generate various the execution of these regions, the proces-hardware events (e.g., cache misses, IPC) sor can be slowed down to meet pipelineat varying rates that relate to their per- stall delays. ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 215 3.5.5. Intratask Approaches. Intratask the ﬁrst of these algorithms. They no-approaches [Lee and Sakurai 2000; Lorch ticed that programs have multiple exe-and Smith 2001; Gruian 2001; Yuan and cution paths, some more time consum-Nahrstedt 2003] adjust the processor ing than others, and that whenever con-speed and voltage within tasks. Lee and trol ﬂows away from a critical path, thereSakurai  developed one of the earli- are opportunities for the processor to slowest approaches. Their approach, run-time down and still ﬁnish the programs withinvoltage hopping, splits each task into their deadlines. Based on this observation,ﬁxed length timeslots. The algorithm as- Shin and Kim developed a tool that pro-signs each timeslot the lowest speed that ﬁles a program ofﬂine to determine worstallows it to complete within its preferred case and average cycles for different exe-execution time which is measured as cution paths, and then inserting instruc-the worst case execution time minus the tions to change the processor frequency atelapsed execution time up to the current the start of different paths based on thistimeslot. An issue not considered in this information.work is how to divide a task into timeslots. Another approach, program checkpoint-Moreover, the algorithm is pessimistic ing [Azevedo et al. 2002], annotates a pro-since tasks could ﬁnish before their worst gram with checkpoints and assigns tim-case execution time. It is also insensitive ing constraints for executing code betweento program structure. checkpoints. It then proﬁles the program There are many variations on this idea. ofﬂine to determine the average numberTwo operating system-level intratask al- of cycles between different checkpoints.gorithms are PACE [Lorch and Smith Based on this proﬁling information and2001] and Stochastic DVS [Gruian 2001]. timing constraints, it adjusts the CPUThese algorithms choose a speed for ev- speed at each checkpoint.ery cycle of a task’s execution based on the Perhaps the most well known approachprobability distribution of the task work- is by Hsu and Kremer . They pro-load measured over previous cycles. The ﬁle a program ofﬂine with respect to alldifference between PACE and stochastic possible combinations of clock frequenciesDVS lies in their cost functions. Stochastic assigned to different regions. They thenDVS assumes that energy is proportional build a table describing how each combi-to the square of the supply voltage, while nation affects execution time and powerPACE assumes it is proportional to the consumption. Using this table, they selectsquare of the clock frequency. These sim- the combination of regions and frequen-plifying assumptions are not guaranteed cies that saves the most power without in-to hold in real systems. creasing runtime beyond a threshold. Dudani et al.  have also developedintratask policies at the operating systemlevel. Their work aims to combine in- 3.5.6. The Implications of Memory Boundedtratask voltage scaling with the earliest- Code. Memory bounded applicationsdeadline-ﬁrst scheduling algorithm. have become an attractive target for DVSDudani et al. split into two subprograms algorithms because the time for a memoryeach program that the scheduler chooses access, on most commercial systems, isto execute. The ﬁrst subprogram runs at independent of how fast the processor isthe maximum clock frequency, while the running. When frequent memory accessessecond runs slow enough to keep the com- dominate the program execution time,bined execution time of both subprograms they limit how fast the program can ﬁnishbelow the average execution time for the executing. Because of this “memory wall”,whole program. the processor can actually run slower In addition to these schemes, there and save a lot of energy without losinghave been a number of intratask poli- as much performance as it would if itcies implemented at the compiler level. were slowing down a compute-intensiveShin and Kim  developed one of code.ACM Computing Surveys, Vol. 37, No. 3, September 2005.
216 V. Venkatachalam and M. Franz Fig. 16. Making the best of the memory wall. Figure 16 illustrates this idea. For ease Nevertheless, a growing number ofof exposition, we call a sequence of exe- researchers are addressing the DVS prob-cuted instructions a task. Part (a) depicts lem from this angle; that is, they are devel-a processor executing three tasks. Task 1 oping techniques for modulating the pro-is followed by a memory load instruction cessor frequency based on memory accessthat causes a main memory access. While patterns. Because it is difﬁcult to reasonthis access is taking place, task 2 is able abstractly about the complex events occur-to execute since it does not depend on the ring in modern processors, the research inresults of task 1. However task 3, which this area has a strong experimental ﬂavor;comes after task 2, depends on the result many of the techniques used are heuristicsof task 1, and thus can execute only af- that are validated through detailed simu-ter the memory access completes. So im- lations and experiments on real systems.mediately after executing task 2, the pro- One recent work is by Kondo andcessor sits idle until the memory access Nakamura . It is an interval-basedﬁnishes. What the processor could have approach driven by cache misses. Thedone is illustrated in part (b). Here instead heuristic periodically calculates the num-of running task 2 as fast as possible and ber of outstanding cache misses andidling until the memory access completes, increments one of three counters depend-the processor slows down the execution of ing on whether the number of outstand-task 2 to overlap exactly with the main ing misses is zero, one, or greater thanmemory access. This, of course, saves en- one. The cost function that Kondo andergy without impairing performance. Nakamura use expresses the memory But there are many idealized assump- boundedness of the code as a weightedtions at work here such as the assumption sum of the three counters. In particular,that a DVS algorithm can predict with the third counter which is incrementedcomplete accuracy a program’s future each time there are multiple outstandingbehavior and switch the clock frequency misses, receives the heaviest weight. Atwithout any hidden costs. In reality, ﬁxed length intervals, the heuristic com-things are far more complex. A funda- pares the memory boundedness, so com-mental problem is how to detect program puted, with respect to an upper and lowerregions during which the processor stalls, threshold. If the memory boundedness iswaiting for a memory, disk, or network greater than the upper threshold, thenaccess to complete. Modern processors the heuristic decreases the frequency andcontain counters that measure events voltage by one setting; otherwise if thesuch as cache misses, but it is difﬁcult memory boundedness is below the lowerto extrapolate the nature of these stalls threshold, then it increases the frequencyfrom observing these events. and voltage settings by one unit. ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 217 Stanley-Marbell et al.  propose a Unlike the previous approaches whichsimilar approach, the Power Adaptation attempt to monitor memory boundedness,Unit. This is a ﬁnite state machine that de- a recent technique by Hsu and Feng tects the program counter values for which attempts to monitor CPU boundedness.memory stalls often occur. When a stall is Their algorithm periodically measures theinitially detected, the unit would enter a rate at which instructions are executedtransient state where it counts the num- (expressed in million instructions per sec-ber of stall cycles, and after the stall cy- ond) to determine how compute-intensivecles exceeded a threshold, it would mark the workload is. The authors ﬁnd thatthe initial program counter value as hot. this approach is as accurate as other ap-Thus the next time the program counter proaches that measure memory bounded-assumes the same value, the PAU would ness, but may be easier to implement be-automatically reduce the clock frequency cause of its simpler cost model.and voltage. Another recent work is by Choi et al. 3.5.7. Dynamic Voltage Scaling In Multiple. Their work is based on an exe- Clock Domain Architectures. A Globallycution time model where a program’s to- Asynchronous, Locally Synchronoustal runtime is split into two parts, time (GALS) chip is split into multiple do-that the program spends on-chip, and time mains, each of which has its own localthat it spends in main memory accesses. clock. Each domain is synchronous withThe on-chip time can be calculated from respect to its clock, but the differentthe number of on-chip instructions, the domains are mutually asynchronous inaverage cycles for each of these instruc- that they may run at different clocktions, and the processor clock frequency. frequencies. This design has three ad-The off-chip time is similarly a function vantages. First, the clocks that powerof the number of off-chip (memory refer- different domains are able to distributeence) instructions, the average cycles per their signals over smaller areas, thusoff-chip instruction, and the memory clock reducing clock skew. Second, the effectsfrequency. of changing the clock frequency are felt Using this model, Choi et al.  de- less outside the given domain. This is anrive a formula for the lowest clock fre- important advantage that GALS has overquency that would keep the performance conventional CPUs. When a conventionalloss within a tolerable threshold. They ﬁnd CPU scales its clock frequency, all of thethat the key to computing this formula hardware structures that receive the clockis determining the average number of cy- signal slow down causing widespread per- oncles for an on-chip instruction CPUavg ; this formance loss. In GALS, on the otheressential information determines the ra- hand, one can choose to slow down sometio of on-chip to off-chip instructions. They parts of the circuit, while allowing othersalso found that the average cycles per in- to operate at their maximum frequencies.struction (CPIavg ) is a linear function of This creates more opportunities for savingthe average stall cycles per instruction energy. In compute bound applications,(SPIavg ). This reduced their problem to for example, a GALS system can keep thecalculating SPIavg . To calculate this, they critical paths of the circuit running asused data-cache miss statistics provided fast as possible but slow down other partsby hardware counters. Reasoning that the of the circuit.cycles spent in stalls increase with re- Nevertheless, it is nontrivial to createspect to the number of cache misses, Choi voltage scaling algorithms for GALS pro-et al. built a table that associated different cessors. First, these processors requirecache miss ranges with different values special mechanisms allowing the differentof SPIavg ; this would allow their heuris- domains to communicate and synchronize.tic to calculate in real time the work- Second, it is unclear how to split the pro-load decomposition and appropriate clock cessor into domains. Because communica-frequency. tion energy can be large, domains must beACM Computing Surveys, Vol. 37, No. 3, September 2005.
218 V. Venkatachalam and M. Franzchosen so that there is minimal commu- searching for operations whose energy dis-nication between them. One possibility is sipation exceeds a threshold. It stretchesto cluster similar functionalities into the out the workload of such operations un-same domain. In a GALS processor devel- til the energy dissipation is below a givenoped by Semeraro et al. , all the in- threshold, repeating this process until ei-teger execution units are in the same do- ther no more operations dissipate exces-main. If instead the integer issue queue sive energy or until all of the operationswere in a different domain than the in- have used up their slack. The informationteger ALU, these two units would waste provided by this algorithm would be usedmore energy communicating since they by a compiler to select clock frequencies forare often accessed together. different GALS domains in different pro- Another factor that makes it hard to de- gram regions.sign algorithms is that one needs to con-sider possible interdependencies between 3.6. Resource Hibernationdomains. For example, if one domain isslowed down, data may take more time to Though switching activity causes powermove through it and reach other domains. consumption, computer components con-This may impact the overall congestion in sume power even when idle. Resource hi-the circuit and affect performance. A DVS bernation techniques power down compo-algorithm would need to be aware of these nents during idle periods. A variety ofglobal effects. heuristics have been developed to man- Numerous researchers have tackled the age the power settings of different devicescomplex problem of controlling GALS sys- such as disk drives, network interfaces,tems [Magklis et al. 2003; Marculescu and displays. We now give examples of2004; Dropsho et al. 2004; Semeraro et al. these techniques.2002; Iyer and Marculescu 2002b; Magkliset al. 2003]. We follow with two recent ex- 3.6.1. Disk Drives. Disk drives can ac-amples, the ﬁrst being an interval-based count for a large percentage of a system’sheuristic and the second being a compiler- total energy dissipation and many tech-based heuristic. niques have been developed to attack this Semeraro et al.  developed one problem [Gurumurthi et al. 2003; Li et al.of the ﬁrst runtime algorithms for dy- 2004]. Most of the power loss stems fromnamic voltage scaling in a GALS system. the rotating platter during disk activityThey observed that each domain has an in- and idle periods. To save energy, operatingstruction issue queue that indicates how systems tend to pause disk rotation afterfast instructions are ﬂowing through the a ﬁxed period of inactivity and restart itpipeline, thus giving insight into the work- during the next disk access. The decisionload. They propose to monitor the issue of when to spin down an idle disk involvesqueue utilization over ﬁxed length inter- tradeoffs between power and performance.vals and to respond swiftly to sudden High idle time thresholds lead to more en-changes by rapidly increasing or decreas- ergy loss but better performance since theing the domain’s clock frequency, but to de- disk need not restart to process requests.crease the clock frequency very gradually Low thresholds allow a disk to spin downwhen there are little observable changes sooner when idle but increase the proba-in the workload. bility of an immediate restart to process One of the ﬁrst compiler-based algo- a spontaneous request. These immediaterithms for controlling GALS systems is restarts, called bumps, waste energy andby Magklis et al. . Their main algo- cause delays.rithm is called the shaker algorithm due to One of the techniques used at the oper-its resemblance to a salt shaker. It repeat- ating system level is Predictive Dynamicedly traverses the Directed Acyclic Graph Threshold Adjustment. This technique ad-(DAG) representation of a program from justs at runtime the idleness threshold, orroot to leaves and from leaves to root, acceptable window of time before an idle ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 219 Fig. 17. Predictive disk management.disk can be spun down. If a disk immedi- disk may remain uninterrupted in a low-ately spins up after spinning down, this power, inactive state for longer periods.technique increases the idleness thresh- However, dependencies between disk re-old. Otherwise, it decreases the threshold. quests can sometimes make it impossibleThis is an example of a technique that uses to apply clustering.past disk usage information to predict Moreover, idle periods for spinningfuture usage patterns. These techniques down the disk are not always avail-are less effective at dealing with irreg- able. For example, large scale servers areular workloads, as Figure 17 illustrates. continuously processing heavy workloads.Part (a) depicts a disk usage pattern in Gurumurthi et al.  have recentlywhich active phases are abruptly followed investigated an alternative strategy forby idle phases, and the idle phases keep server environments called Dynamic RPMgetting longer. Dynamic threshold adjust- control (DRPM). Instead of spinning downment (b) abruptly restarts the disk ev- the disk fully, DRPM modulates the ro-erytime it spins it down and keeps the tational speed (in rotations per minute)disk activated during long idle phases. according to workload demands, therebyTo improve on this technique, an operat- eliminating the overheads of transition-ing system can cluster disk requests (c), ing between low power and fully activethereby lengthening idle periods (d) when states. As part of their study, Gurumurthiit can spin down the disk. To create op- et al. develop power models for a disk driveportunities for clustering, the algorithm of that supports 15 different RPM levels. Us-Papathanasiou and Scott  delays ing these models they propose a simpleprocessing nonurgent disk requests and heuristic for RPM modulation. The heuris-reduces the number of remaining requests tic involves some cooperation between theby aggressively prefetching disk data into disks and a controller that oversees thememory. The delayed requests accumu- disks. The controller initially deﬁnes alate in a queue for later processing in a range of RPM levels for each disk; themore voluminous transaction. Thus, the disks modulate their RPMs within thisACM Computing Surveys, Vol. 37, No. 3, September 2005.
220 V. Venkatachalam and M. Franzrange based on their demands. They pe- to the device. The technique identiﬁes pro-riodically monitor their input queues, and gram regions where the card should hyber-whenever the number of requests exceeds nate based on the nature of array accessesa threshold, they increase their RPM set- in each region. If an accessed array is nottings by 1. Meanwhile, the controller over- in local memory, the card is activated tosees the response times in the system. transfer it from the remote server. To de-When the response time is above a thresh- termine what arrays are in memory dur-old, this is a sign of performance degra- ing each region’s execution, the compilerdation, and the controller immediately re- simulates a least-recently-used memoryquests the disks to switch to full power. bank.When on the other hand, the responsetimes are sufﬁciently low, this is a signthat the disks could be slowed down fur- 3.6.3. Displays. Displays are a majorther, and the controller increases the min- source of power consumption. Normally,imum RPM threshold setting. an operating system dims the display af- ter a sufﬁciently long interval with no user response. This heuristic may not coincide 3.6.2. Network Interfaces. Network inter- with a user’s intentions.faces such as wireless cards present a Two examples of more advanced tech-unique challenge for resource hibernation niques that manage displays based ontechniques because the naive solution of user activity are face-off and zoned back-shutting them off would disconnect hosts lighting. They have yet to be implementedfrom other devices with which they may in commercial systems.have to communicate. Thus power man- Face-off [Dalton and Ellis 2003] is anagement strategies also include protocols experimental face recognition system onfor synchronizing the communication of an IBM T21 Thinkpad laptop running Redthese devices with other devices in the net- Hat Linux. It periodically photographswork. Kravets et al.  have developed the monitor’s perspective and detects aa heuristic for doing this at the operat- face through large regions of skin color. Iting system level. One of its key features then controls the display using the ACPIis that it uses timers to track the idleness interface. Its developers tested it on aof devices. If a device should be transmit- long network transfer during which theting but its idle timer expires, it enters a user looked away from the screen. Face-offlistening mode or goes to sleep. To decide saved more power than the default displayhow long devices should remain idle be- manager. A drawback is the repetitivefore shifting into listening mode, Kravets polling for face detection. For futureet al. apply an adaptive threshold algo- work, the authors suggest detecting arithm. When the algorithm detects com- user’s presence through low-power heatmunication activity, it decreases the ac- microsensors. These sensors can triggerceptable hibernation time. When it detects the face recognition algorithm when theyidle periods, it increases the time. detect a user. Threshold-based schemes such as this Zoned backlighting [Flinn and Satya-one allow a network card to remain idle narayanan 1999] targets future displayfor an extended length of time before shut- systems that will allow software to selec-ting down. Hom and Kremer  have tively adjust the brightness of differentargued that compilers can shut down a regions of the screen based on usage pat-card sooner using a model of future pro- terns and battery drain. Researchers atgram behavior. Their compiler-based hi- Carnegie Mellon have proposed ideas forbernation technique targets a simpliﬁed exploiting these displays. The underlyingscenario where a mobile device’s virtual assumption is that users are interestedmemory resides on a server. Page faults re- in only a small part of the screen. Onequire activating the device’s wireless card idea is to make foreground windows withso that pages can be sent from the server keyboard and mouse focus brighter than ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 221 Fig. 18. Performance versus power.back-ground windows. Another is to scale compares the power dissipation and exe-images to allow most of the screen to dim cution time of doing two additions in paral-when the battery is low. lel (as in a VLIW architecture) (a) and do- ing them in succession (b). Doing two ad- ditions in parallel activates twice as many3.7. Compiler-Level Power Management adders over a shorter period. It induces 3.7.1. What Compilers Can Do. There are higher peak resource usage but fewer exe-many ways a compiler can help reduce cution cycles and is better for performance.power regardless of whether a proces- To determine which scenario saves moresor explicitly supports software-controlled energy, one needs more information suchpower reduction. Aside from generating as the peak power increase in scenario (a),code that reconﬁgures hardware units the execution cycle increase in scenarioor activates power reduction mechanisms (b), and the total energy dissipation perthat we have seen, compilers can ap- cycle for (a) and (b). For ease of illustra-ply common performance-oriented opti- tion, we have depicted the peak power inmizations that also save energy by re- (a) to be twice that of (b), but all of theseducing the execution time, optimizations parameters may vary with respect to thesuch as Common Subexpression Elimi- architecture.nation, Partial Redundancy Elimination, Compilers can reduce memory accessesand Strength Reduction. However, some to reduce energy dissipated in these ac-performance optimizations increase code cesses. One way to reduce memory ac-size or parallelism, sometimes increas- cesses is to eliminate redundant loading resource usage and peak power dissi- and store operations. Another is to keeppation. Examples include Loop Unrolling data as close as possible to the processor,and Software Pipelining. Researchers preferably in the registers and lower-level[Givargis et al. 2001] have developed mod- caches, using aggressive register alloca-els for relating performance and power, tion techniques and optimizations improv-but these models are relative to speciﬁc ar- ing cache usage. Several loop transforma-chitectures. There is no ﬁxed relationship tions alter data traversal patterns to makebetween performance and power across all better use of the cache. Examples includearchitectures and applications. Figure 18 Loop Interchange, Loop Tiling, and LoopACM Computing Surveys, Vol. 37, No. 3, September 2005.
222 V. Venkatachalam and M. FranzFusion. The assignment of data to memory While Palm et al.  restrict theirlocations also inﬂuences how long data re- studies to local and remote compilation,mains in the cache. Though all these tech- Chen et al.  also examine the ben-niques reduce power consumption along eﬁts of executing compiled Java code re-one or more dimensions, they might be motely. Their testbed consists of a 750MHzsuboptimal along others. For example, to SPARC workstation server and a clientexploit the full bandwidth a banked mem- handheld running a microSPARC-IIepory architecture provides, one may need embedded processor. The programmer ini-to disperse data across multiple memory tially annotates a set of candidate meth-banks at the expense of cache locality and ods that the client may execute remotely.energy. Before calling any of these methods, the client decides where to compile and exe- cute the method based on computation and 3.7.2. Remote Compilation and Remote Exe- communication costs. It also selects opti-cution. A new area of research for compil- mization levels for compilation. If remoteers involves compiling and executing ap- execution is favorable, the client transfersplications jointly between mobile devices all method data to the server via the objectand more powerful servers to reduce exe- serialization interface. The server uses re-cution time and increase battery life. This ﬂection to invoke the method and serial-is an area frought with challenges such ization to send back the results. Chen et al.as the decision of what program sections evaluate different compilation and execu-to compile or execute remotely. Other is- tion strategies in this framework throughsues include application partitioning be- simulation and analytical models. Theytween multiple servers and handhelds, ap- consider seven partitioning strategies, ﬁveplication migration, and fault tolerance. of which are static and two of which areThough researchers have yet to address dynamic. While the static strategies ﬁxall these issues, a number of studies pro- the partitioning decision for all methodsvide valuable insights into the beneﬁts of of a program, the dynamic ones decide thepartitioning. best strategy for each method at call time. Recent studies of low-power Java ap- Chen et al. ﬁnd the best static strategy toplication partitioning are by Palm et al. vary depending on input size and commu- and Chen et al. . Palm et al. nication channel quality. When the qualitycompare the energy costs of remote and is low, the client’s network interface mustlocal compilation and optimization. They send data at a higher transmission powerexamine four scenarios in which a hand- to reach the server. For large inputs andheld downloads code from a server and good channel quality, static remote execu-executes it locally. In the ﬁrst two sce- tion prevails over other static strategies.narios, the handheld downloads bytecode, Overall, Chen et al. ﬁnd the highest en-compiles it, and executes it. In the re- ergy savings to be achieved by a dynamicmaining scenarios, the server compiles the strategy that decides where to compile andbytecode and the handheld downloads the execute each method.native codestream. Palm et al. ﬁnd that, The Proxy Virtual Machine [Venkat-when a method’s compilation time exceeds achalam et al. 2003] reduces the energythe execution time, remote compilation is consumption of mobile devices by combin-favorable to local compilation. However, if ing application partitioning with dynamicno optimizations are used, remote compi- optimization. The framework (Figure 19)lation can be more expensive due to the positions a powerful server infrastructure,cost of downloading code from the server. the proxy, between mobile devices and theOverall, the largest energy savings are ob- Internet. The proxy includes a just-in-tained by compiling and optimizing code time compiler and bytecode translator. Aremotely but sending results to the hand- high bandwidth, low-latency secure wire-held exactly when needed to minimize idle less connection mediates communicationtime. between the proxy and mobile devices in ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 223 Fig. 19. The proxy VM framework.the vicinity. Users can request Internet ap- be ideal for specialized embedded systemsplications as they normally would through where the set of programs that will ex-a Web browser. The proxy intercepts these ecute is determined ahead of time andrequests and reissues them to a remote where execution behavior is mostly pre-Internet server. The server sends the dictable, real systems tend to be more com-proxy the application in mobile code for- plex. The events occurring within themmat. The proxy veriﬁes the application, continuously change and compete for pro-compiles it to native code, and sends part cessor and memory resources. For exam-of the generated code to the target. The re- ple, context switches abruptly shift exe-mainder of the code executes on the proxy cution from a region of one program to aitself to signiﬁcantly reduce energy con- completely different region of another pro-sumption on the mobile device. gram. If a compiler had generated code so that the two regions put a device into different power mode settings, then there 3.7.3. The Limitations of Statically Optimiz- will be an overhead to transition betweening Compilers. The drawback of compiler- these settings when execution ﬂows frombased approaches is that a compiler’s one region to another.view is usually limited to the programsit is compiling. This raises two problems.First, compilers have incomplete informa- 3.7.4. Dynamic Compilation. Dynamiction about how a program will actually be- compilation addresses some of thesehave, information such as the control ﬂow problems by introducing a feedbackpaths and loop iterations that will be exe- loop. A program is compiled but is thencuted. Hence, statically optimizing compil- also monitored as it executes. As theers often rely on proﬁling data that is col- behavior of the program changes, possiblylected from a program prior to execution, along with other changes in the runtimeto determine what optimizations should environment (e.g., resource levels), thebe applied in different program regions. program is recompiled to adapt to theseA program’s actual runtime behavior may changes. Since the program is continu-diverge from its behavior in these “simu- ously recompiled in response to runtimelation runs”, thus leading to suboptimal feedback, its code is of a signiﬁcantlydecisions. higher quality than it would have been if The second problem is that compiler op- it had been generated by a static compiler.timizations tend to treat programs as if There are a number of scenarios wherethey exist in a vacuum. While this may continuous compilation can be effective.ACM Computing Surveys, Vol. 37, No. 3, September 2005.
224 V. Venkatachalam and M. FranzFor example, as battery capacity de- different performance counter events si-creases, a continuous compiler can ap- multaneously. For each of the core compo-ply more aggressive transformations that nents of this processor Isci and Martonositrade the quality of data output for re- have developed power models that relateduced power consumption. (For example, their power consumption to their usagea compiler can replace expensive ﬂoating statistics. In their actual setup, they takepoint operations with integer operations, two kinds of power readings in parallel.or generate code that dims the display.) One is based on current measurementsHowever, there are tradeoffs. For example, and the other is based on the performancethe cost function for a dynamic compiler counters. These readings give the totalneeds to weigh the overhead of recompil- power as well as a breakdown of powering a program with the energy that can be consumed among the various components.saved. Moreover, there is the issue of how Isci and Martonosi ﬁnd that their perfor-to proﬁle an executing program in a mini- mance counter-based approach is nearlymally invasive, low overhead way. as accurate as the current measurements While there is a large body of work and is also able to distinguish between dif-in dynamic compilation for perfor- ferent phases in program execution.mance [Kistler and Franz 2001; Kistlerand Franz 2003], dynamic recompilation 3.8. Application-Level Power Managementfor power is still a young ﬁeld withmany open problems. One of the earliest Researchers have been exploring how tocommercial dynamic compilation systems give applications a larger role in poweris Transmeta’s Crusoe processor [Trans- management decisions. Most of the recentmeta Corporation 2003; Transmeta work has two goals. The ﬁrst is to de-Corporation 2001] which we will discuss velop techniques that enable applicationsin Section 3.10. One of the earliest aca- to adapt to their runtime environment.demic studies of power-aware dynamic The second is to develop interfaces allow-recompilation was done by Unnikrishnan ing applications to provide hints to loweret al. . Their idea is to instrument layers of the stack (i.e., operating systems,critical program regions with sensitiv- hardware) and likewise exposing usefulity lists that detect changes in various information from the lower layers to appli-energy budgets and hot-swap optimized cations. Although these are two separateversions of these regions with the original goals, the techniques for achieving themversions based on these changes. Each can overlap.optimized version is precomputed aheadof time, using ofﬂine energy estimation 3.8.1. Application Transformations and Adap-techniques. The authors target a banked tations. As a starting point, some re-memory architecture and focus on loop searchers [Tan et al. 2003] have adoptedtransformations such as tiling, unrolling, an “architecture-centric” view of applica-and interchange. Their implementation tions that allows for some high-level trans-is based on Dyninst, a tool that allows formations. These researchers claim thatprogram regions to be patched at runtime. an application’s architecture consists of Though dynamic compilation for power fundamental elements that comprise allis still in its infancy, there is a growing applications, namely processes, event han-body of work in runtime power monitor- dlers, and communication mechanisms.ing. The work that is most relevant for Power is consumed when these various el-dynamic compilation is that of Isci and ements interact during activities such asMartonosi . These researchers have context switches and interprocess commu-developed techniques for proﬁling the nication.power behavior of programs at runtime They propose a low-power applica-using information from hardware perfor- tion design methodology. Initially, an ap-mance counters. They target a Pentium 4 plication is represented as a softwareprocessor which allows one to sample 18 architecture graph (SAG) [Tan et al. 2003] ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 225which captures how it has been built out ful. Originally, Odyssey supported four ap-of processes and events. It is then run plications, a video player, speech recog-through a simulator to measure its base nizer, map viewer, and Web browser. Theenergy consumption which is then com- video player, for instance, would down-pared to its energy consumption when load images from a server storing multi-transformations are applied to the SAG, ple copies at different compression levels.transformations that include merging pro- When bandwidth is low, it would down-cesses to reduce interprocess communica- load the compressed versions of imagestion, replacing expensive IPC mechanisms to transfer less data over the network. Asby cheaper mechanisms, and migrating battery resources become scarce, it wouldcomputations between processes. also reduce the size of the images to con- To determine the optimal set of transfor- serve energy.mations, the researchers use a greedy ap- One example of an operating systemproach. They keep applying transforma- that assists application adaptation istions that reduce energy until no more ECOSystem [Zeng et al. 2002]. It treatstransformations remain to be applied. energy as a commodity assigned to the dif-This results in an optimized version of the ferent resources in a system. The idea issame application that has a more energy to allow resources to operate so long asefﬁcient mix of processes, events, and com- they have enough of the commodity. Thusmunication mechanisms. resource management in ECOSystem re- Sachs et al.  have explored a dif- lies on the notion of “currentcy”, an ab-ferent kind of adaptation that involves straction that models the power resourcetrading the accuracy of computations for as a monetary unit. Essentially, the dif-reduced energy consumption. They pro- ferent resources in a computer (e.g., ALU,pose a video encoder that allows one to memory) have prices for using them, andvary its compression efﬁciency by selec- applications pay these prices with cur-tively skipping the motion search and dis- rentcy. Periodically, the operating systemcrete cosine transform phases of the en- distributes a ﬁxed amount of currentcycoding algorithm. The extent to which it among applications, using the desired bat-skips these phases depends on parame- tery drain rate as the basis for deciding onters chosen by the adaptation algorithm. how much to distribute. Applications canThe target architecture is a processor with use speciﬁc resources only if they can paysupport for DVS and architectural adapta- for them. As an application executes, it ex-tions (e.g., conﬁgurable caches). Two algo- pends its currentcy. The application is in-rithms work side by side. One algorithm terrupted once it depletes its currentcy.tunes the hardware parameters at thestart of each frame, while another tunesthe parameters of the video encoder while 3.8.2. Application Hints. In one of the ear-a frame is being processed. liest works on application hints, Pereira Yet another form of adaptation involves et al.  propose a software architec-reducing energy consumption by trading ture containing two layers of API calls,off ﬁdelity or the quality of data presented one allowing applications to communicateto users. This can take many forms. One with the operating system, and the otherexample of a ﬁdelity reduction system allowing the operating system to com-is Odyssey [Flinn and Satyanarayanan municate with the underlying hardware.1999]. Odyssey is an operating system The API layer that is exposed to appli-and execution framework for multime- cations allows applications to drop hintsdia and Web applications. It continuously to the operating system such as the startmonitors the resources used by applica- times, deadlines, and expected executiontions and alerts applications whose re- times of tasks. Using these hints providedsources fall below requested levels. In by the application, the operating systemturn, the applications lower their qual- will have a better understanding of theity of service until resources are plenti- possible future behavior of programs andACM Computing Surveys, Vol. 37, No. 3, September 2005.
226 V. Venkatachalam and M. Franzwill thus be able to make better DVS popular is how to develop holistic ap-decisions. proaches that integrate information from Similar to this is the work by Anand multiple levels (e.g., compiler, OS, hard-et al.  which considers a setting ware) into power management decisions.where applications adapt by deciding There are already a number of systems be-when to access different I/O devices based ing developed to explore cross-layer adap-on the relative costs of accessing these tations. We give four examples of such sys-devices. For example, it may be expen- tems.sive to access data resident on a disk Forge [Mohapatra et al. 2003] is an in-if the disk is powered down and would tegrated power management frameworktake a long time to restart. In such cases, for networked multimedia applications.an application may choose instead to ac- Forge targets a typical scenario wherecess the data through the network. To aid users request video streams for theirin these adaptations, Anand et al. pro- handheld devices and these requests arepose an API layer that gives applications ﬁltered by a remote proxy that transcodescontrol over all I/O devices but abstracts the streams and transmits them to theaway the details of these devices. The users at the most energy efﬁcient QoSAPI functions as a giant regulating dial levels.that applications can tune to specify their Forge integrates multiple levels of adap-desired power/performance tradeoffs and tations. The hardware level contains a fre-which transparently manages all the de- quency and voltage scaling interface, avices to achieve these tradeoffs. The API network card interface (allowing the net-also give applications more ﬁne-grained work card to be powered down when idle),control over devices they may want to ac- as well as interfaces allowing various ar-cess. One of its unique features is that it chitectural parameters (e.g., cache ways,allows applications to issue ghost hints: if register ﬁle sizes) to be tuned. A levelan application chose not to access a de- above hardware are the operating systemvice that was deactivated but would have and compiler, which control the architec-saved more energy by accessing that de- tural knobs, and a level above the oper-vice if it had been activated, then it can ating system is a distributed middlewaresend the device a message to this effect. frame-work, part of which resides on theAfter a few of these ghost hints, the de- handheld devices and part of which re-vice automatically becomes activated, an- sides on the remote proxy. The middlewareticipating that the application will need to on each handheld device monitors the en-use it. ergy statistics of the device (through the Heath et al.  propose similar work OS), and sends feedback to the middle-to increase opportunities for spinning ware on the proxy. The middleware on thedown idle hard disks and saving energy. In proxy uses this feedback to decide how totheir framework, a compiler performs var- regulate network trafﬁc and at what QoSious code transformations to cluster disk levels to stream requested video feeds.accesses and then inserts hints that tell Different heuristics are used for differ-the operating system how long these clus- ent adaptations. To determine the optimaltered disk accesses will take. The operat- cache parameters for a video feed, one sim-ing system is then able to serve these disk ulates the feed ofﬂine at each QoS levelrequests in a single batch and power down for all possible variations of cache size andthe disk for longer periods of time. associativity and determines the most en- ergy optimal conﬁguration. This conﬁgu- ration is automatically chosen when the3.9. Cross-Layer Adaptations feed executes. The cache conﬁguration, inBecause power consumption depends on turn, affects how long it takes to decodedecisions spanning the entire stack from each video frame. If the decoding time istransistors to applications, a research less than the frame’s deadline, then thequestion that is becoming increasingly DVS algorithm slows down the frame to ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 227exactly meet the deadline. Meanwhile on Another multilayer framework for mul-the proxy, the middleware layer continu- timedia applications has been developedally receives requests for video feeds and by Fei et al. . Their framework con-information about the resource levels on sists of four layers divided into user spacevarious devices. When it receives a request and system space. The system space con-from a user on a handheld device, it checks sists of the target platform and the oper-whether there is enough remaining energy ating system running on top of it. Directlyon the device allowing it to receive the above the operating system in user spacevideo feed at an acceptable quality. If there is a middleware layer that decides how tois, then it transmits the feed at the high- apply adaptations, guided by informationest quality that does not exceed the en- it receives from the other layers. All appli-ergy constraints of the device. Otherwise, cations execute on top of the middlewareit rejects the user’s request or sends the coordinator which is responsible for decid-feed at a lower quality. When the proxy ing when to admit different applicationssends a video feedback back to the users, into the system and how to select their QoSit transmits the feed in bursts of packets, levels. When applications ﬁrst enter theallowing the users’ network cards to be system, they store their meta-informationshut down over longer idle periods to save inside the middleware layer. This includesenergy. information about the QoS levels they of- Grace [Yuan and Nahrstedt 2003] is an- fer and the energy consumption for eachother cross-layer adaptation framework of these levels. The middleware layer, inthat attempts to integrate dynamic volt- turn, monitors battery levels using coun-age scaling, power-aware task schedul- ters inside the operating system. Based oning, and QoS setting. Its target applica- the battery levels, the middleware decidestions are real-time multimedia tasks with on the most energy efﬁcient QoS settingﬁxed periods and deadlines. Grace applies for each application. It then admits appli-two levels of adaptations, global and local. cations into the system according to user-Global adaptations respond to larger re- deﬁned priorities.source variations and are applied less fre- A fourth example is the work of AbouG-quently because they are more expensive. hazaleh et al.  which explores howThese adaptations are controlled by a cen- compilers and operating systems can in-tral coordinator that continuously moni- teract to save energy. This work targetstors battery levels and application work- real-time applications with ﬁxed dead-loads and responds to widespread changes lines and worst-case execution times. Inin these variables. Local adaptations, on this approach, the compiler instrumentsthe other hand, respond to smaller work- speciﬁc program regions with code thatload variations within a task. These adap- saves the remaining worst-case executiontations are controlled by local adaptors. cycles into a register. The operating sys-There are three local adaptors, one to tem periodically reads this register andset the CPU frequency, another to sched- adjusts the speed of the processor so thatule tasks, and a third to adapt QoS pa- the task ﬁnishes by its worst-case execu-rameters. Major changes in the runtime tion time.environment, such as low battery levels 3.10. Commercial Systemsor wide variations in workload, triggerthe central coordinator to decide upon a To understand what power managementnew set of global adaptations. The coor- techniques industry is adopting, we exam-dinator chooses the most energy-optimal ine the low-power techniques used in fouradaptations by solving a constrained op- widely used processors, the Pentium 4,timization problem. It then broadcasts its Pentium M, the PXA27x, and Transmetadecision to the local adaptors which imple- Crusoe. We then discuss three power man-ment these changes but are free to adjust agement strategies by IBM, ARM, andthese initial settings within the execution National Semiconductor that are rapidlyof speciﬁc tasks. gaining in importance.ACM Computing Surveys, Vol. 37, No. 3, September 2005.
228 V. Venkatachalam and M. Franz 3.10.1. The Pentium 4 Processor. Though 3.10.2. The Pentium M Processor. Theits goal is high performance, the Pen- Pentium M is the fruit of Intel’s effortstium 4 processor also contains features to bring the Pentium 4 to the mobile Do-to manage its power consumption [Gun- main. It carefully balances performancether et al. 2001; Hinton et al. 2004]. enhancing features with several power-One of Intel’s goals in including these saving features that increase the bat-features was to prevent the Pentium’s tery lifetime [Genossar and Shamir 2003;internal temperatures from becoming Gochman et al. 2003]. It uses three maindangerously high due to the increased strategies to reduce dynamic power con-power dissipation. To meet this goal, sumption: reducing the total instructionsthe designers included a thermal de- and micro-operations executed, reducingtection and response mechanism which the switching activity in the circuit, andinhibits the processor clock whenever reducing the energy dissipated per tran-the observed temperature exceeds a safe sistor switch.threshold. To monitor temperature vari- To save energy, the Pentium M in-ations, the processor features a diode- tegrates several techniques that reducebased thermal sensor. The sensor sits the total switching activity. These includeat the hottest area of the chip and hardware for predicting idle units and in-measures temperature via voltage drops hibiting their clock signals, buses whoseacross the diode. Whenever the tempera- components are activated only when datature increases into a danger zone, the sen- needs to be transferred, and a techniquesor issues a STOPCLOCK request, caus- called execution stacking which clustersing the main clock signal to be inhibited units that perform similar functions intofrom reaching most of the processor until similar regions so that the processor canthe temperature is no longer in the dan- selectively activate the parts of the circuitger zone. This is to guarantee response that will be needed by an instruction.time while the chip is cooling. The temper- To reduce the static power dissipation,ature sensor also provides temperature the Pentium M incorporates low leakageinformation to higher levels of software transistors in the caches. To further re-through output signals (some of which are duce both dynamic and leakage energyin registers), allowing the compiler or op- throughout the processor, the Pentium Merating system, for instance, to activate supports an enhanced version of the In-other techniques in response to the high tel SpeedStep technology, which unlike itstemperatures. predecessor, allows the processor to tran- The Pentium 4 also supports the low- sition between 6 different frequency andpower operating states deﬁned by the Ad- voltage settings.vanced Conﬁguration and Power Interface(ACPI) speciﬁcation, allowing software tocontrol the processor power modes. In par- 3.10.3. The Intel PXA27x Processors. Theticular, it features a model-speciﬁc regis- Intel PXA27x processors used in manyter allowing software to inﬂuence the pro- wireless handheld devices feature thecessor clock. In addition to stopping the Intel Wireless Speedstep power man-clock, the Pentium 4 features the Intel ager [Intel Corporation 2004]. This powerBasic Speedstep Technology which allows manager uses memory boundedness astwo settings for the processor clock fre- the criterion for deciding how to managequency and voltage, a high setting for per- the processor’s power modes. It receivesformance and a low setting for power. The information from an idle proﬁler whichhigh setting is normally used when the monitors the system idle thread and a per-computer is connected to a wall outlet, and formance proﬁler that monitors the pro-the low setting is normally used when the cessor’s built-in performance counters tocomputer is running on batteries as in the gather statistics pertaining to cache andcase of a laptop. TLB misses, instructions executed, and ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 229processor stalls. Using this information, device-speciﬁc parameters (e.g., availablethe power manager determines how mem- clock frequencies) and to integrate theirory bounded the workload is and decides own policies in a “plug-and-play” manner.how to adjust the processor’s power modes The main idea is to model the underly-as well as its clock frequency and voltage. ing hardware as a state machine composed of different operating points and operat- ing states. The operating points are clock 3.10.4. The Transmeta Crusoe Processor. frequency and voltage settings, while theTransmeta’s Crusoe processor [Transmeta operating states are power mode settingsCorporation 2001, 2003] saves power not such as active, idle and deep sleep. De-only by means of its LongRun Power Man- signers specify the operating points andager but also through its overall design states for their target architecture, and us-which shifts many of the complexities in ing this abstraction, they write policies forinstruction execution from hardware into transitioning between the states. IBM hassoftware. The Crusoe relies on a code mor- experimented with a wide range of poli-phing engine, a layer of software that in- cies for the PowerPC 405LP, from simpleterprets or compiles x86 programs into policies that adjust the frequency based onthe processor’s native VLIW instructions, whether the processor is idle or active, tosaving the generated code in a Transla- more complex policies that integrate infor-tion Cache so that they can be reused. mation about application workloads andThis layer of software is also responsi- deadlines.ble for monitoring executing programs forhotspots and reoptimizing code on the ﬂy.As a result, features that are typically im- 3.10.6. Powerwise and Intelligent Energyplemented in hardware (e.g., out-of-order Management. National Semiconductorexecution) are instead implemented in and ARM Inc. have collaborated tosoftware. This reduces the total on-chip develop a single end-to-end energyreal estate and its accompanying power management system for embedded pro-dissipation. To further reduce the power cessors. Their system brings together twodissipation, Crusoe’s LongRun modulates technologies, ARM’s Intelligent Energythe clock frequency and voltage according Management Software (IEM), whichto workload demands. It identiﬁes idle pe- modulates the processor speed based onriods of the operating system and scales workloads, and National Semiconductor’sthe clock frequency and voltage accord- Powerwise Adaptive Voltage Scalingingly. (AVS), which dynamically adjusts the Transmeta’s later generation Efﬁceon supply voltage for a given clock frequencyprocessor contains enhanced versions of based on variations in temperature andthe code morphing engine and the Lon- other ambient system effects.gRun Power Manager. Some news arti- The IEM software consists of a three-cles claim that the new LongRun includes level decision hierarchy of policies for de-techniques to reduce leakage, but the de- ciding how fast the processor should run.tails are proprietary. At the bottom is a baseline algorithm that selects the optimal frequency based 3.10.5. IBM Dynamic Power Management. on estimating future task workloads fromIBM has developed an extensible oper- past workloads measured over a slidingating system module [Brock and Raja- history window. At the top is an algo-mani 2003] that allows power manage- rithm more suited for interactive and me-ment policies to be developed for a wide dia applications, which considers how fastrange of embedded devices. For portability, these applications would need to run tothis module abstracts away from the un- deliver smooth performance to users. Be-derlying architecture and provides a sim- tween these two layers is an interface al-ple interface allowing designers to specify lowing applications to communicate theirACM Computing Surveys, Vol. 37, No. 3, September 2005.
230 V. Venkatachalam and M. Franzworkload requirements directly. Each of they must be discarded unless they arethese three layers combines its perfor- rechargeable, and recharging a batterymance prediction with a conﬁdence rating, can take several hours and even recharge-which indicates how to compare its deci- able batteries eventually die. Moreover,sion with that of other layers. Whenever battery technology has been advancingthere is a context switch, the IEM software slowly. Building more efﬁcient batteriesanalyzes the decisions and conﬁdence rat- still means building bigger batteries, andings of each layer to ﬁnally decide how fast bigger and heavier batteries are not suitedto run the processor. for small mobile devices. The Powerwise Adaptive Voltage Scal- Fuel cells are similar to batteries ining technology [Dhar et al. 2002] aims to that they generate electricity by meanssave more energy than conventional volt- of a chemical reaction but, unlike batter-age scaled systems by choosing the mini- ies, fuel cells can in principle supply en-mum acceptable voltage for a given clock ergy indeﬁnitely. The main componentsfrequency. Conventional DVS processors of a fuel cell are an anode, a cathode, adetermine the appropriate voltage for a membrane separating the anode from theclock frequency by referring to a table cathode, and a link to transfer the gener-that is developed ofﬂine using a worst-case ated electricity. The fuel enters the anode,model of ambient hardware characteris- where it reacts with a catalyst and splitstics. As a result, the voltages chosen are into protons and electrons. The protonsoften higher than they need to be. Power- diffuse through the membrane, while thewise gets around this problem by adding electrons are forced to travel through thea feedback loop. It continually monitors link generating electricity. When the pro-variations in temperature and other am- tons reach the cathode, they combine withbient system effects through performance Oxygen in the air to produce water andcounters and uses this information to dy- heat as by-products. If the fuel cell uses anamically adjust the supply voltage for water-diluted fuel (as some do), then thiseach clock frequency. This results in an ad- waste water can be recycled back to theditional 45% energy savings over conven- anode.tional voltage scaling. Fuel cells have a number of advantages. One advantage is that fuels (e.g., hydro-3.11. A Glimpse Into Emerging Radical gen) are abundantly available from a wide Technologies variety of natural resources, and many of- fer energy densities high enough to allowImagine a laptop that runs on hydrogen portable devices to run far longer thanfuel or a mobile phone that is powered by they do on batteries. Another advantage isthe same type of engines that propel gi- that refueling is signiﬁcantly faster thangantic aircraft. Such ideas may sound far recharging a battery. In some cases, itfetched, but they are exactly the kinds of merely involves spraying more fuel into ainnovations that some researchers claim tank. A third advantage is that there iswill eventually solve the energy problem. no limit to how many times a fuel cell canWe close our survey by brieﬂy discussing be refueled. As long as it contains fuel, ittwo of the more radical technologies that generates electricity.are likely to gain in importance over the Several companies are aggressively de-next decade. The focus of these techniques veloping fuel cell technologies for portableis on improving energy efﬁciency. The devices including Micro Fuel Cell Inc.,techniques include fuel cells and MEMS NEC, Toshiba, Medis, Panasonic, Mo-systems. torola, Samsung, and Neah Systems. Al- ready a number of prototype cells have 3.11.1. Fuel Cells. Fuel cells [Dyer 2004] emerged as well as prototype mobile com-are being developed to replace the bat- puters powered by hydrogen or alcohol-teries used in mobile devices. Batteries based fuels. Despite these rapid advances,have a limited capacity. Once depleted, some industry analysts estimate that it ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 231will take another 5-10 years for fuel cells over the next four years. However, it is tooto become commonplace in mobile devices. early to tell whether it will in fact replace Fuel cells also have several drawbacks. batteries and fuel cells. One problemFirst, they can get very hot (e.g., 500- is that jet engines produce hot exhaust1000 celsius). Second, the metallic materi- gases that could raise chip temperaturesals and mechanical components that they to dangerous levels, possibly requiringare composed of can be quite expensive. new materials for a chip to withstandThird, fuel cells are ﬂammable. In par- these temperatures. Other issues includeticular, fuel-powered devices will require ﬂammability and the power dissipation ofsafety measures as well as more ﬂexible the rotating turbines.laws allowing them inside airplanes. 4. CONCLUSION 3.11.2. MEMS Systems. Microelectrical Power and energy management has grownand Mechanical Systems (MEMS) are into a multifaceted effort that brings to-miniature versions of large scale devices gether researchers from such diverse ar-commonly used to convert mechanical en- eas as physics, mechanical engineering,ergy into electrical energy. Researchers at electrical engineering, design automation,MIT and the Georgia Institute of Technol- logic and high-level synthesis, computerogy [Epstein 2004] are exploring a radical architecture, operating systems, compilerway of using them to solve the energy prob- design, and application development. Welem. They are developing prototype mil- have examined how the power problemlimeter scale versions of the gigantic gas arises and how the problem has beenturbine engines that power airplanes and addressed along multiple levels rangingdrive electrical generators. These micro- from transistors to applications. We haveengines, they claim, will give mobile com- also surveyed major commercial powerputers unprecedented amounts of unteth- management technologies and provided aered lifetime. glimpse into some emerging technologies. The microengines work using similar We conclude by noting that the ﬁeld isprinciples as their large scale counter- still active, and that researchers are con-parts. They suck air into a compressor tinually developing new algorithms andand ignite it with fuel. The compressed air heuristics along each level as well as ex-then spins a set of turbines that are con- ploring how to integrate algorithms fromnected to a generator to generate electrical multiple levels. Given the wide varietypower. The fuels used could be hydrogen, of microarchitectural and software tech-diesel based, or more energy-dense solid niques available today and the astound-fuels. ingly large number of techniques that will Made from layers of silicon wafers, these be available in the future, it is highly likelytiny engines are supposed to output the that we will overcome the limits imposedsame levels of electrical power per unit by high power consumption and continueof fuel as their large scale counterparts. to build processors offering greater levelsTheir proponents claim they have two ad- of performance and versatility. However,vantages. First, they can output far more only time will tell which approaches willpower using less fuel than fuel cells or ultimately succeed in solving the powerbatteries alone. In fact, the ones under problem.development are expected to output 10to 100 Watts of power almost effortlessly REFERENCESand keep mobile devices powered for days. ABOUGHAZALEH, N., CHILDERS, B., MOSSE, D., MEL-Moreover, as a result of their high energy HEM, R., AND CRAVEN, M. 2003. Energy man-density, these engines would require less agement for real-time embedded applications with compiler support. In Proceedings of thespace than either fuel cells or batteries. ACM SIGPLAN Conference on Languages, Com- This technology, according to re- pilers, and Tools for Embedded Systems. ACMsearchers, is likely to be commercialized Press, 284–293.ACM Computing Surveys, Vol. 37, No. 3, September 2005.
232 V. Venkatachalam and M. FranzALBONESI, D., DROPSHO, S., DWARKADAS, S., FRIEDMAN, of the ACM/SIGDA 12th International Sympo- E., HUANG, M., KURSUN, V., MAGKLIS, G., SCOTT, sium on Field Programmable Gate Arrays. ACM M., SEMERARO, G., BOSE, P., BUYUKTOSUNOGLU, A., Press, 109–117. COOK, P., AND SCHUSTER, S. 2003. Dynamically CHEN, G., KANG, B., KANDEMIR, M., VIJAYKRISHNAN, N., tuning processor resources with adaptive pro- AND IRWIN, M. 2003. Energy-aware compila- cessing. IEEE Computer Magazine 36, 12, 49– tion and execution in java-enabled mobile de- 58. vices. In Proceedings of the 17th Parallel andANAND, M., NIGHTINGALE, E., AND FLINN, J. 2004. Distributed Processing Symposium. IEEE Press, Ghosts in the machine: Interfaces for better 34a. power management. In Proceedings of the Inter- CHOI, K., SOMA, R., AND PEDRAM, M. 2004. Dynamic national Conference on Mobile Systems, Applica- voltage and frequency scaling based on work- tions, and Services. 23–35. load decomposition. In Proceedings of the Inter-AZEVEDO, A., ISSENIN, I., CORNEA, R., GUPTA, R., national Symposium on Low Power Electronics DUTT, N., VEIDENBAUM, A., AND NICOLAU, A. 2002. and Design. ACM Press, 174–179. Proﬁle-based dynamic voltage scheduling using CLEMENTS, P. C. 1996. A survey of architecture de- program checkpoints. In Proceedings of the Con- scription languages. In Proceedings of the 8th In- ference on Design, Automation and Test in Eu- ternational Workshop on Software Speciﬁcation rope. 168–175. and Design. IEEE Computer Society, 16–25.BAHAR, R. I. AND MANNE, S. 2001. Power and energy DALLY, W. J. AND TOWLES, B. 2001. Route packets, reduction via pipeline balancing. In Proceedings not wires: on-chip inteconnection networks. In of the 28th Annual International Symposium on Proceedings of the 38th Conference on Design Au- Computer Architecture. ACM Press, 218–229. tomation. ACM Press, 684–689.BANAKAR, R., STEINKE, S., LEE, B.-S., BALAKRISHNAN, M., DALTON, A. B. AND ELLIS, C. S. 2003. Sensing user AND MARWEDEL, P. 2002. Scratchpad memory: intention and context for energy management. Design alternative for cache on-chip memory in In Proceedings of the 9th Workshop on Hot Topics embedded systems. In Proceedings of the 10th In- in Operating Systems. 151–156. ternational Symposium on Hardware/Software Codesign. 73–78. DE, V. AND BORKAR, S. 1999. Technology and de- sign challenges for low power and high perfor-BANERJEE, K. AND MEHROTRA, A. 2001. Global in- mance. In Proceedings of the International Sym- terconnect warming. IEEE Circuits and Devices posium on Low Power Electronics and Design Magazine 17, 5, 16–32. ISLPED’99 . ACM Press, 163–168.BISHOP, B. AND IRWIN, M. J. 1999. Databus charge DHAR, S., MAKSIMOVIC, D., AND KRANZEN, B. 2002. recovery: practical considerations. In Proceed- Closed-loop adaptive voltage scaling controller ings of the International Symposium on Low for standard-cell asics. In Proceedings of the In- Power Electronics and Design. 85–87. ternational Symposium on Low Power Electron-BROCK, B. AND RAJAMANI, K. 2003. Dynamic power ics and Design ISLPED’02. ACM Press, 103– management for embedded systems. In Proceed- 107. ings of the IEEE International SOC Conference. DINIZ, P. C. 2003. A compiler approach to perfor- 416–419. mance prediction using empirical-based model-BUTTAZZO, G. C. 2002. Scalable applications for ing. In International Conference On Computa- energy-aware processors. In Proceedings of the tional Science. 916–925. 2nd International Conference On Embedded DROPSHO, S., BUYUKTOSUNOGLU, A., BALASUBRAMONIAN, Software. Springer-Verlag, 153–165. R., ALBONESI, D. H., DWARKADAS, S., SEMERARO, G.,BUTTS, J. A. AND SOHI, G. S. 2000. A static power MAGKLIS, G., AND SCOTT, M. L. 2002. Integrat- model for architects. In Proceedings of the 33rd ing adaptive on-chip storage structures for re- Annual ACM/IEEE International Symposium duced dynamic power. In Proceedings of the In- on Microarchitecture. Monterey, CA. 191–201. ternational Conference on Parallel ArchitecturesBUYUKTOSUNOGLU, A., SCHUSTER, S., BROOKS, D., BOSE, and Compilation Techniques. IEEE Computer P., COOK, P. W., AND ALBONESI, D. 2001. An Society, 141–152. adaptive issue queue for reduced power at high DROPSHO, S., SEMERARO, G., ALBONESI, D. H., MAGKLIS, performance. In Proceedings of the 1st Interna- G., AND SCOTT, M. L. 2004. Dynamically trad- tional Workshop on Power-Aware Computer Sys- ing frequency for complexity in a gals micropro- tems. 25–39. cessor. In Proceedings of the 37th InternationalCALHOUN, B. H., HONORE, F. A., AND CHANDRAKASAN, Symposium on Microarchitecture. IEEE Com- A. 2003. Design methodology for ﬁne-grained puter Society, 157–168. leakage control in MTCMOS. In Proceedings DUDANI, A., MUELLER, F., AND ZHU, Y. 2002. Energy- of the International Symposium on Low Power conserving feedback EDF scheduling for embed- Electronics and Design. ACM Press, 104–109. ded systems with real-time constraints. In Pro-CHEN, D., CONG, J., LI, F., AND HE, L. 2004. Low- ceedings of the Joint Conference on Languages, power technology mapping for FPGA architec- Compilers, and Tools for Embedded Systems. tures with dual supply voltages. In Proceedings ACM Press, 213–222. ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 233DYER, C. 2004. Fuel cells and portable electronics. GIVARGIS, T., VAHID, F., AND HENKEL, J. 2001. In Symposium On VLSI Circuits Digest of Tech- System-level exploration for pareto-optimal con- nical Papers (2004). 124–127. ﬁgurations in parameterized systems-on-a-chip.EBERGEN, J., GAINSLEY, J., AND CUNNINGHAM, P. 2004. In Proceedings of the IEEE/ACM International Transistor sizing: How to control the speed and Conference on Computer-Aided Design. IEEE energy consumption of a circuit. In the 10th Press, 25–30. International Symposium on Asynchronous Cir- GOCHMAN, S., RONEN, R., ANATI, I., BERKOVITS, A., cuits and Systems. 51–61. KURTS, T., NAVEH, A., SAEED, A., SPERBER, Z., ANDEPSTEIN, A. 2004. Millimeter-scale, micro- VALENTINE, R. 2003. The Intel Pentium M pro- electromechanical systems gas turbine engines. cessor: Microarchitecture and performance. In- J. Eng. Gas Turb. Power 126, 205–226. tel. Tech. J. 7, 2, 21–59.ERNST, D., KIM, N., DAS, S., PANT, S., RAO, R., PHAM, GOMORY, R. E. AND HU, T. C. 1961. Multi-terminal T., ZIESLER, C., BLAAUW, D., AUSTIN, T., FLAUT- network ﬂows. J. SIAM 9, 4, 551–569. NER, K., AND MUDGE, T. 2003. Razor: A low- GOVIL, K., CHAN, E., AND WASSERMAN, H. 1995. Com- power pipeline based on circuit-level timing paring algorithms for dynamic speed-setting of a speculation. In Proceedings of the 36th Annual low-power CPU. In Proceedings of the 1st Annual IEEE/ACM International Symposium on Mi- International Conference on Mobile Computing croarchitecture. IEEE Computer Society, 7–18. and Networking. ACM Press, 13–25.FAN, X., ELLIS, C. S., AND LEBECK, A. R. 2003. Syn- GRUIAN, F. 2001. Hard real-time scheduling for ergy between power-aware memory systems and low-energy using stochastic data and DVS pro- processor voltage scaling. In Proceedings of the cessors. In Proceedings of the International Sym- Workshop on Power-Aware Computer Systems. posium on Low Power Electronics and Design. 164–179. ACM Press, 46–51.FEI, Y., ZHONG, L., AND JHA, N. 2004. An energy- GUNTHER, S., BINNS, F., CARMEAN, D., AND HALL, J. aware framework for coordinated dynamic soft- 2001. Managing the impact of increasing mi- ware management in mobile computers. In Pro- croprocessor power consumption. Intel Tech. J. ceedings of the IEEE Computer Society’s 12th GURUMURTHI, S., SIVASUBRAMANIAM, A., KANDEMIR, M., Annual International Symposium on Model- AND FRANKE, H. 2003. DRPM: Dynamic speed ing, Analysis, and Simulation of Computer and control for power management in server class Telecommunications Systems. 306–317. disks. SIGARCH Comput. Architect. News 31, 2,FERDINAND, C. 1997. Cache behavior prediction for 169–181. ¨ real time systems. Ph.D. thesis, Universitat des HEATH, T., PINHEIRO, E., HOM, J., KREMER, U., Saarlandes. AND BIANCHINI, R. 2004. Code transformationsFLAUTNER, K., REINHARDT, S., AND MUDGE, T. 2001. for energy-efﬁcient device management. IEEE Automatic performance setting for dynamic volt- Trans. Comput. 53, 8, 974–987. age scaling. In Proceedings of the 7th Annual HINTON, G., SAGER, D., UPTON, M., BOGGS, D., CARMEAN, International Conference on Mobile Computing D., KYKER, A., AND ROUSSEL, P. 2004. The mi- and Networking. ACM Press, 260–271. croarchitecture of the Intel Pentium 4 processorFLINN, J. AND SATYANARAYANAN, M. 1999. Energy- on 90nm technology. Intel Tech. J. 8, 1, 1–17. aware adaptation for mobile applications. In HO, Y.-T. AND HWANG, T.-T. 2004. Low power de- Proceedings of the 17th ACM Symposium on Op- sign using dual threshold voltage. In Proceed- erating Systems Principles. ACM Press, 48–63. ings of the Conference on Asia South Paciﬁc De-FOLEGNANI, D. AND GONZALEZ, A. 2001. Energy- sign Automation IEEE Press, (Piscataway, NJ,) effective issue logic. In Proceedings of the 28th 205–208. Annual International Symposium on Computer HOM, J. AND KREMER, U. 2003. Energy manage- Architecture. ACM Press, 230–239. ment of virtual memory on diskless devices.GAO, F. AND HAYES, J. P. 2003. ILP-based optimiza- In Compilers and Operatings Systems for Low tion of sequential circuits for low power. In Pro- Power. Kluwer Academic Publishers, Norwell, ceedings of the International Symposium on Low MA. 95–113. Power Electronics and Design. ACM Press, 140– HOSSAIN, R., ZHENG, M. AND ALBICKI, A. 1996. Re- 145. ducing power dissipation in CMOS circuits byGENOSSAR, D. AND SHAMIR, N. 2003. Intel Pentium signal probability based transistor reording. In M processor power estimation, budgeting, opti- IEEE Trans. Comput.-Aided Design Integrated mization, and validation. Intel. Tech. J. 7, 2, 44– Circuits Syst. 15, 3, 361–368. 49. HSU, C. AND FENG, W. 2004. Effective dynamicGHOSE, K. AND KAMBLE, M. B. 1999. Reducing voltage scaling through cpu-boundedness detec- power in superscalar processor caches using sub- tion. In Workshop on Power Aware Computing banking, multiple line buffers, and bit-line seg- Systems. mentation. In Proceedings of the International HSU, C.-H. AND KREMER, U. 2003. The design, Symposium on Low Power Electronics and De- implementation, and evaluation of a com- sign. ACM Press, 70–75. piler algorithm for CPU energy reduction. InACM Computing Surveys, Vol. 37, No. 3, September 2005.
234 V. Venkatachalam and M. Franz Proceedings of the ACM SIGPLAN Conference on ings of the Conference on Design, Automation Programming Language Design and Implemen- and Test in Europe. IEEE Press, 190–196. tation. ACM Press, 38–48. IYER, A. AND MARCULESCU, D. 2002a.HU, J., IRWIN, M., VIJAYKRISHNAN, N., AND KANDEMIR, M. Microarchitecture-level power management. 2002. Selective trace cache: A low power and IEEE Trans. VLSI Syst. 10, 3, 230–239. high performance fetch mechanism. Tech. Rep. IYER, A. AND MARCULESCU, D. 2002b. Power ef- CSE-02-016, Department of Computer Science ﬁciency of voltage scaling in multiple clock, and Engineering, The Pennsylvania State Uni- multiple voltage cores. In Proceedings of versity (Oct.). the IEEE/ACM International Conference onHU, J., VIJAYKRISHNAN, N., IRWIN, M., AND KANDEMIR, Computer-Aided Design. ACM Press, 379–386. M. 2003. Using dynamic branch behavior for JONE, W.-B., WANG, J. S., LU, H.-I., HSU, I. P., AND power-efﬁcient instruction fetch. In Proceedings CHEN, J.-Y. 2003. Design theory and imple- of the IEEE Computer Society Annual Sympo- mentation for low-power segmented bus sys- sium on VLSI. 127–132. tems. ACM Trans. Design Autom. Electr. Syst.HU, J. S., NADGIR, A., VIJAYKRISHNAN, N., IRWIN, M. J., 8, 1, 38–54. AND KANDEMIR, M. 2003. Exploiting program KABADI, M., KANNAN, N., CHIDAMBARAM, P., NARAYANAN, hotspots and code sequentiality for instruction S., SUBRAMANIAN, M., AND PARTHASARATHI, R. cache leakage management. In Proceedings of 2002. Dead-block elimination in cache: A the International Symposium on Low Power mechanism to reduce i-cache power consump- Electronics and Design. ACM Press, 402–407. tion in high performance microprocessors. InHUANG, M., RENAU, J., YOO, S.-M., AND TORRELLAS, J. Proceedings of the International Conference on 2000. A framework for dynamic energy efﬁ- High Performance Computing. Springer Verlag, ciency and temperature management. In Pro- 79–88. ceedings of the 33rd Annual ACM/IEEE Inter- KANDEMIR, M., RAMANUJAM, J., AND CHOUDHARY, A. national Symposium on Microarchitecture. ACM 2002. Exploiting shared scratch pad memory Press, 202–213. space in embedded multiprocessor systems. InHUANG, M. C., RENAU, J., AND TORRELLAS, J. 2003. Proceedings of the 39th Conference on Design Au- Positional adaptation of processors: application tomation. ACM Press, 219–224. to energy reduction. In Proceedings of the 30th KAXIRAS, S., HU, Z., AND MARTONOSI, M. 2001. Cache Annual International Symposium on Computer decay: exploiting generational behavior to re- Architecture ISCA ’03. ACM Press, 157–168. duce cache leakage power. In Proceedings of theHUGHES, C. J. AND ADVE, S. V. 2004. A formal ap- 28th Annual International Symposium on Com- proach to frequent energy adaptations for mul- puter Architecture. ACM Press, 240–251. timedia applications. In Proceedings of the 31st KIM, C. AND ROY, K. 2002. Dynamic Vth scaling Annual International Symposium on Computer scheme for active leakage power reduction. In Architecture (ISCA’04). IEEE Computer Society, Proceedings of the Design, Automation and Test 138. in Europe Conference and Exhibition. IEEEHUGHES, C. J., SRINIVASAN, J., AND ADVE, S. V. 2001. Computer Society, 0163–0167. Saving energy with architectural and frequency KIM, N. S., FLAUTNER, K., BLAAUW, D., AND MUDGE, adaptations for multimedia applications. In Pro- T. 2002. Drowsy instruction caches: leakage ceedings of the 34th Annual ACM/IEEE In- power reduction using dynamic voltage scaling ternational Symposium on Microarchitecture and cache sub-bank prediction. In Proceedings of (MICRO’34). IEEE Computer Society, 250– the 35th Annual ACM/IEEE International Sym- 261. posium on Microarchitecture. IEEE ComputerIntel Corporation. 2004. Wireless Intel Speedstep Society Press, 219–230. Power Manager. Intel Corporation. KIN, J., GUPTA, M., AND MANGIONE-SMITH, W. H. 1997.ISCI, C. AND MARTONOSI, M. 2003. Runtime power The ﬁlter cache: An energy efﬁcient memory monitoring in high-end processors: Methodology structure. In Proceedings of the 30th Annual and empirical data. In Proceedings of the 36th ACM/IEEE International Symposium on Mi- Annual IEEE/ACM International Symposium croarchitecture. IEEE Computer Society, 184– on Microarchitecture (MICRO’36). IEEE Com- 193. puter Society, 93–104. KISTLER, T. AND FRANZ, M. 2001. Continuous pro-ISHIHARA, T. AND YASUURA, H. 1998. Voltage gram optimization: design and evaluation. IEEE scheduling problem for dynamically variable Trans. Comput. 50, 6, 549–566. voltage processors. In Proceedings of the Inter- KISTLER, T. AND FRANZ, M. 2003. Continuous pro- national Symposium on Low Power Electronics gram optimization: A case study. ACM Trans. and Design. ACM Press, 197–202. Program. Lang. Syst. 25, 4, 500–548.ITRSRoadmap. The International Technology KOBAYASHI AND SAKURAI. 1994. Self-adjusting Roadmap for Semiconductors. Available at threshold voltage scheme (SATS) for low- http://public.itrs.net. voltage high-speed operation. In Proceedings ofIYER, A. AND MARCULESCU, D. 2001. Power aware the IEEE Custom Integrated Circuits Confer- microarchitecture resource scaling. In Proceed- ence. 271–274. ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 235KONDO, M. AND NAKAMURA, H. 2004. Dynamic pro- tional Conference on Measurement and Modeling cessor throttling for power efﬁcient computa- of Computer Systems. ACM Press, 50–61. tions. In Workshop on Power Aware Computing LUZ, V. D. L., KANDEMIR, M., AND KOLCU, I. 2002. Systems. Automatic data migration for reducing energyKONG, B., KIM, S., AND JUN, Y. 2001. Conditional- consumption in multi-bank memory systems. In capture ﬂip-ﬂop for statistical power reduc- Proceedings of the 39th Conference on Design Au- tion. IEEE J. Solid State Circuits 36, 8, 1263– tomation. ACM Press, 213–218. 1271. LYUBOSLAVSKY, V., BISHOP, B., NARAYANAN, V., AND IR-KRANE, R., PARSONS, J., AND BAR-COHEN, A. 1988. WIN, M. J. 2000. Design of databus charge re- Design of a candidate thermal control system covery mechanism. In Proceedings of the Inter- for a cryogenically cooled computer. IEEE Trans. national Conference on ASIC. ACM Press, 283– Components, Hybrids, Manufact. Techn. 11, 4, 287. 545–556. MAGKLIS, G., SCOTT, M. L., SEMERARO, G., ALBONESI,KRAVETS, R. AND KRISHNAN, P. 1998. Power manage- D. H., AND DROPSHO, S. 2003. Proﬁle-based dy- ment techniques for mobile communication. In namic voltage and frequency scaling for a mul- Proceedings of the 4th Annual ACM/IEEE Inter- tiple clock domain microprocessor. In Proceed- national Conference on Mobile Computing and ings of the 30th Annual International Sympo- Networking. ACM Press, 157–168. sium on Computer Architecture. ACM Press, 14–KURSUN, E., GHIASI, S., AND SARRAFZADEH, M. 2004. 27. Transistor level budgeting for power optimiza- MAGKLIS, G., SEMERARO, G., ALBONESI, D., DROPSHO, S., tion. In Proceedings of the 5th International DWARKADAS, S., AND SCOTT, M. 2003. Dynamic Symposium on Quality Electronic Design. 116– frequency and voltage scaling for a multiple- 121. clock-domain microprocessor. IEEE Micro 23, 6,LEBECK, A. R., FAN, X., ZENG, H., AND ELLIS, C. 2000. 62–68. Power aware page allocation. In Proceedings of MARCULESCU, D. 2004. Application adaptive en- the 9th International Conference on Architec- ergy efﬁcient clustered architectures. In Pro- tural Support for Programming Languages and ceedings of the International Symposium on Low Operating Systems (2000). ACM Press, 105–116. Power Electronics and Design. 344–349.LEE, S. AND SAKURAI, T. 2000. Run-time voltage MARTIN, T. AND SIEWIOREK, D. 2001. Nonideal bat- hopping for low-power real-time systems. In Pro- tery and main memory effects on cpu speed- ceedings of the 37th Conference on Design Au- setting for low power. IEEE Tran. (VLSI) Syst. 9, tomation. ACM Press, 806–809. 1, 29–34.LI, H., KATKOORI, S., AND MAK, W.-K. 2004. Power MENG. Y., SHERWOOD, T., AND KASTNER, R. 2005. Ex- minimization algorithms for LUT-based FPGA ploring the limits of leakage power reduction in technology mapping. ACM Trans. Design Autom. caches. ACM Trans. Architecture Code Optimiz., Electr. Syst. 9, 1, 33–51. 1, 221–246.LI, X., LI, Z., DAVID, F., ZHOU, P., ZHOU, Y., ADVE, S., MOHAPATRA, S., CORNEA, R., DUTT, N., NICOLAU, A., AND KUMAR, S. 2004. Performance directed en- AND VENKATASUBRAMANIAN, N. 2003. Integrated ergy management for main memory and disks. power management for video streaming to mo- In Proceedings of the 11th International Confer- bile handheld devices. In Proceedings of the 11th ence on Architectural Support for Programming ACM International Conference on Multimedia. Languages and Operating Systems (ASPLOS- ACM Press, 582–591. XI). ACM Press, New York, NY, 271–283. NEDOVIC, N., ALEKSIC, M., AND OKLOBDZIJA, V. 2001.LIM, S.-S., BAE, Y. H., JANG, G. T., RHEE, B.-D., MIN, Conditional techniques for low power consump- S. L., PARK, C. Y., SHIN, H., PARK, K., MOON, S.- tion ﬂip-ﬂops. In Proceedings of the 8th Interna- M., AND KIM, C. S. 1995. An accurate worst tional Conference on Electronics, Circuits, and case timing analysis for RISC processors. IEEE Systems. 803–806. Trans. Softw. Eng. 21, 7, 593–604. NILSEN, K. D. AND RYGG, B. 1995. Worst-case exe-LIU, M., WANG, W.-S., AND ORSHANSKY, M. 2004. cution time analysis on modern processors. In Leakage power reduction by dual-vth designs Proceedings of the ACM SIGPLAN Workshop on under probabilistic analysis of vth variation. In Languages, Compilers and Tools for Real-Time Proceedings of the International Symposium on Systems. ACM Press, 20–30. Low Power Electronics and Design ACM Press, PALM, J., LEE, H., DIWAN, A., AND MOSS, J. E. B. 2002. New York, NY, 2–7. When to use a compilation service? In Proceed-LLOPIS, R. AND SACHDEV, M. 1996. Low power, ings of the Joint Conference on Languages, Com- testable dual edge triggered ﬂip-ﬂops. In Pro- pilers, and Tools for Embedded Systems. ACM ceedings of the International Symposium on Low Press, 194–203. Power Electronics and Design. IEEE Press, 341– PANDA, P. R., DUTT, N. D., AND NICOLAU, A. 2000. On- 345. chip vs. off-chip memory: the data partitioningLORCH, J. AND SMITH, A. 2001. Improving dynamic problem in embedded processor-based systems. voltage scaling algorithms with PACE. In Pro- ACM Trans. Design Autom. Electr. Syst. 5, 3, ceedings of the ACM SIGMETRICS Interna- 682–704.ACM Computing Surveys, Vol. 37, No. 3, September 2005.
236 V. Venkatachalam and M. FranzPAPATHANASIOU, A. E. AND SCOTT, M. L. 2002. In- 2002. Energy-efﬁcient processor design using creasing disk burstiness for energy efﬁciency. multiple clock domains with dynamic voltage Technical Report 792 (November), Department and frequency scaling. In Proceedings of the 8th of Computer Science, University of Rochester International Symposium on High-Performance (Nov.). Computer Architecture. IEEE Computer Society,PATEL, K. N. AND MARKOV, I. L. 2003. Error- 29–40. correction and crosstalk avoidance in dsm SETA, K., HARA, H., KURODA, T., KAKUMU, M., AND SAKU- busses. In Proceedings of the International Work- RAI, T. 1995. 50% active-power saving without shop on System-Level Interconnect Prediction. speed degradation using standby power reduc- ACM Press, 9–14. tion (SPR) circuit. In Proceedings of the IEEE In-PENZES, P., NYSTROM, M., AND MARTIN, A. 2002. ternational Solid-State Conference. IEEE Press, Transistor sizing of energy-delay-efﬁcient 318–319. circuits. Tech. Rep. 2002003, Department SGROI, M., SHEETS, M., MIHAL, A., KEUTZER, K., MA- of Computer Science, California Institute of LIK, S., RABAEY, J., AND SANGIOVANNI-VENCENTELLI, Technology. A. 2001. Addressing the system-on-a-chip in-PEREIRA, C., GUPTA, R., AND SRIVASTAVA, M. 2002. terconnect woes through communication-based PASA: A software architecture for building design. In Proceedings of the 38th Conference on power aware embedded systems. In Proceedings Design Automation. ACM Press, 667–672. of the IEEE CAS Workshop on Wireless Commu- SHIN, D., KIM, J., AND LEE, S. 2001. Low-energy nication and Networking. intra-task voltage scheduling using static timingPOLLACK, F. 1999. New microarchitecture chal- analysis. In Proceedings of the 38th Conference lenges in the coming generations of CMOS pro- on Design Automation. ACM Press, 438–443. cess technologies. International Symposium on STAN, M. AND BURLESON, W. 1995. Bus-invert cod- Microarchitecture. ing for low-power i/o. IEEE Trans. VLSI, 49–58.PONOMAREV, D., KUCUK, G., AND GHOSE, K. 2001. STANLEY-MARBELL, P., HSIAO, M., AND KREMER, U. Reducing power requirements of instruction 2002. A Hardware Architecture for Dynamic scheduling through dynamic allocation of mul- Performance and Energy Adaptation. In Pro- tiple datapath resources. In Proceedings of the ceedings of the Workshop on Power-Aware Com- 34th Annual ACM/IEEE International Sympo- puter Systems. 33–52. sium on Microarchitecture. IEEE Computer So- STROLLO, A., NAPOLI, E., AND CARO, D. D. 2000. New ciety, 90–101. clock-gating techniques for low-power ﬂip-ﬂops.POWELL, M., YANG, S.-H., FALSAFI, B., ROY, K., AND In Proceedings of the International Symposium VIJAYKUMAR, T. N. 2001. Reducing leakage in on Low Power Electronics and Design. ACM a high-performance deep-submicron instruction Press, 114–119. cache. IEEE Trans. VLSI Syst. 9, 1, 77–90. SULTANIA, A., SYLVESTER, D., AND SAPATNEKAR, S. 2004.RUTENBAR, R. A., CARLEY, L. R., ZAFALON, R., AND DRAG- Transistor and pin reordering for gate oxide ONE, N. 2001. Low-power technology mapping leakage reduction in dual Tox circuits. In IEEE for mixed-swing logic. In Proceedings of the Inter- International Conference on Computer Design. national Symposium on Low Power Electronics 228–233. and Design. ACM Press, 291–294. SYLVESTER, D. AND KEUTZER, K. 1998. Getting toSACHS, D., ADVE, S., AND JONES, D. 2003. Cross- the bottom of deep submicron. In Proceedings layer adaptive video coding to reduce energy on of the IEEE/ACM International Conference on general purpose processors. In Proceedings of the Computer-Aided Design. ACM Press, 203–211. International Conference on Image Processing. TAN, T., RAGHUNATHAN, A., AND JHA, N. 2003. Soft- 109–112. ware architectural transformations: A new ap-SASANKA, R., HUGHES, C. J., AND ADVE, S. V. 2002. proach to low energy embedded software. In Joint local and global hardware adaptations Design, Automation and Test in Europe. 1046– for energy. SIGARCH Computer Architecture 1051. News 30, 5, 144–155. TAYLOR, C. N., DEY, S., AND ZHAO, Y. 2001. ModelingSCHMIDT, R. AND NOTOHARDJONO, B. 2002. High-end and minimization of interconnect energy dissi- server low-temperature cooling. IBM J. Res. De- pation in nanometer technologies. In Proceed- vel. 46, 6, 739–751. ings of the 38th Conference on Design Automa-SEMERARO, G., ALBONESI, D. H., DROPSHO, S. G., MAGK- tion. ACM Press, 754–757. LIS, G., DWARKADAS, S., AND SCOTT, M. L. 2002. Transmeta Corporation. 2001. LongRun Power Dynamic frequency and voltage control for a Management: Dynamic Power Management for multiple clock domain microarchitecture. In Pro- Crusoe Processors. Transmeta Corporation. ceedings of the 35th Annual ACM/IEEE Interna- Transmeta Corporation. 2003. Crusoe Processor tional Symposium on Microarchitecture. IEEE Product Brief: Model TM5800. Transmeta Cor- Computer Society Press, 356–367. poration.SEMERARO, G., MAGKLIS, G., BALASUBRAMONIAN, R., AL- TURING, A. 1937. Computability and lambda- BONESI, D. H., DWARKADAS, S., AND SCOTT, M. L. deﬁnability. J. Symbolic Logic 2, 4, 153–163. ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 237UNNIKRISHNAN, P., CHEN, G., KANDEMIR, M., AND computing. In Proceedings of the 2003 Interna- MUDGETT, D. R. 2002. Dynamic compilation tional Symposium on Low Power Electronics and for energy adaptation. In Proceedings of Design. ACM Press, 110–115. the IEEE/ACM International Conference on YUAN, W. AND NAHRSTEDT, K. 2003. Energy-efﬁcient Computer-Aided Design. ACM Press, 158– soft real-time cpu scheduling for mobile multi- 163. media systems. In Proceedings of the 19th ACMVENKATACHALAM, V., WANG, L., GAL, A., PROBST, C., Symposium on Operating Systems Principles AND FRANZ, M. 2003. Proxyvm: A network- (SOSP’03). 149–163. based compilation infrastructure for resource- ZENG, H., ELLIS, C. S., LEBECK, A. R., AND VAHDAT, A. constrained devices. Technical Report 03-13, 2002. Ecosystem: managing energy as a ﬁrst University of California, Irvine. class operating system resource. In ProceedingsVICTOR, B. AND KEUTZER, K. 2001. Bus encoding of the 10th International Conference on Archi- to prevent crosstalk delay. In Proceedings of tectural Support for Programming Languages the IEEE/ACM International Conference on and Operating Systems. ACM Press, 123– Computer-Aided Design. IEEE Press, 57–63. 132.WANG, H., PEH, L.-S., AND MALIK, S. 2003. Power- ZHANG, H. AND RABAEY, J. 1998. Low-swing inter- driven design of router microarchitectures in on- connect interface circuits. In Proceedings of the chip networks. In Proceedings of the 36th An- International Symposium on Low Power Elec- nual IEEE/ACM International Symposium on tronics and Design. ACM Press, 161–166. Microarchitecture (MICRO’36). IEEE Computer ZHANG, H., WAN, M., GEORGE, V., AND RABAEY, J. 2001. Society, 105–116. Interconnect architecture exploration for low-WEI, L., CHEN, Z., JOHNSON, M., ROY, K., AND DE, V. energy reconﬁgurable single-chip dsps. In Pro- 1998. Design and optimization of low voltage ceedings of the International Symposium on Sys- high performance dual threshold CMOS circuits. tems Synthesis. ACM Press, 33–38. In Proceedings of the 35th Annual Conference on ZHANG, W., HU, J. S., DEGALAHAL, V., KANDEMIR, Design Automation. ACM Press, 489–494. M., VIJAYKRISHNAN, N., AND IRWIN, M. J. 2002.WEISER, M., WELCH, B., DEMERS, A. J., AND SHENKER, S. Compiler-directed instruction cache leakage op- 1994. Scheduling for reduced CPU energy. In timization. In Proceedings of the 35th Annual Proceedings of the 1st USENIX Symposium on ACM/IEEE International Symposium on Mi- Operating Systems Design and Implementation. croarchitecture. IEEE Computer Society Press, 13–23. 208–218.WEISSEL, A. AND BELLOSA, F. 2002. Process cruise ZHANG, W., KARAKOY, M., KANDEMIR, M., AND CHEN, G. control: event-driven clock scaling for dynamic 2003. A compiler approach for reducing data power management. In Proceedings of the In- cache energy. In Proceedings of the 17th An- ternational Conference on Compilers, Architec- nual International Conference on Supercomput- ture, and Synthesis for Embedded Systems. ACM ing. ACM Press, 76–85. Press, 238–246. ZHAO, P., DARWSIH, T., AND BAYOUMI, M. 2004. High-WON, H.-S., KIM, K.-S., JEONG, K.-O., PARK, K.-T., CHOI, performance and low-power conditional dis- K.-M., AND KONG, J.-T. 2003. An MTCMOS de- charge ﬂip-ﬂop. IEEE Trans. VLSI Syst. 12, 5, sign methodology and its application to mobile 477–484.Received December 2003; revised February and June 2005; accepted September 2005ACM Computing Surveys, Vol. 37, No. 3, September 2005.