Power Reduction Techniques For Microprocessor SystemsVASANTH VENKATACHALAM AND MICHAEL FRANZUniversity of California, Irvine Power consumption is a major factor that limits the performance of computers. We survey the “state of the art” in techniques that reduce the total power consumed by a microprocessor system over time. These techniques are applied at various levels ranging from circuits to architectures, architectures to system software, and system software to applications. They also include holistic approaches that will become more important over the next decade. We conclude that power management is a multifaceted discipline that is continually expanding with new techniques being developed at every level. These techniques may eventually allow computers to break through the “power wall” and achieve unprecedented levels of performance, versatility, and reliability. Yet it remains too early to tell which techniques will ultimately solve the power problem. Categories and Subject Descriptors: C.5.3 [Computer System Implementation]: Microcomputers—Microprocessors; D.2.10 [Software Engineering]: Design— Methodologies; I.m [Computing Methodologies]: Miscellaneous General Terms: Algorithms, Design, Experimentation, Management, Measurement, Performance Additional Key Words and Phrases: Energy dissipation, power reduction1. INTRODUCTION of power; so much power, in fact, that their power densities and concomitantComputer scientists have always tried to heat generation are rapidly approachingimprove the performance of computers. levels comparable to nuclear reactorsBut although today’s computers are much (Figure 1). These high power densitiesfaster and far more versatile than their impair chip reliability and life expectancy,predecessors, they also consume a lot increase cooling costs, and, for largeParts of this effort have been sponsored by the National Science Foundation under ITR grant CCR-0205712and by the Ofﬁce of Naval Research under grant N00014-01-1-0854.Any opinions, ﬁndings, and conclusions or recommendations expressed in this material are those of theauthors and should not be interpreted as necessarily representing the ofﬁcial views, policies or endorsements,either expressed or implied, of the National Science foundation (NSF), the Ofﬁce of Naval Research (ONR),or any other agency of the U.S. Government.The authors also gratefully acknowledge gifts from Intel, Microsoft Research, and Sun Microsystems thatpartially supported this work.Authors’ addresses: Vasanth Venkatachalam, School of Information and Computer Science, University ofCalifornia at Irvine, Irvine, CA 92697-3425; email: email@example.com; Michael Franz, School of Informationand Computer Science, University of California at Irvine, Irvine, CA 92697-3425; email: firstname.lastname@example.org.Permission to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for proﬁt or direct commercial advantage andthat copies show this notice on the ﬁrst page or initial screen of a display along with the full citation.Copyrights for components of this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use anycomponent of this work in other works requires prior speciﬁc permission and/or a fee. Permissions may berequested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212)869-0481, or email@example.com. c 2005 ACM 0360-0300/05/0900-0195 $5.00 ACM Computing Surveys, Vol. 37, No. 3, September 2005, pp. 195–237.
196 V. Venkatachalam and M. Franz the power problem. Finally, we will dis- cuss some commercial power management systems (Section 3.10) and provide a glimpse into some more radical technolo- gies that are emerging (Section 3.11). 2. DEFINING POWER Power and energy are commonly deﬁned in terms of the work that a system per- forms. Energy is the total amount of work a system performs over a period of time,Fig. 1. Power densities rising. Figure adapted from while power is the rate at which the sys-Pollack 1999. tem performs that work. In formal terms,data centers, even raise environmental P = W/T (1)concerns. At the other end of the performancespectrum, power issues also pose prob- E = P ∗ T, (2)lems for smaller mobile devices with lim- where P is power, E is energy, T is a spe-ited battery capacities. Although one could ciﬁc time interval, and W is the total workgive these devices faster processors and performed in that interval. Energy is mea-larger memories, this would diminish sured in joules, while power is measuredtheir battery life even further. in watts. Without cost effective solutions to These concepts of work, power, and en-the power problem, improvements in ergy are used differently in different con-micro-processor technology will eventu- texts. In the context of computers, workally reach a standstill. Power manage- involves activities associated with run-ment is a multidisciplinary ﬁeld that in- ning programs (e.g., addition, subtraction,volves many aspects (i.e., energy, temper- memory operations), power is the rate atature, reliability), each of which is complex which the computer consumes electricalenough to merit a survey of its own. The energy (or dissipates it in the form of heat)focus of our survey will be on techniques while performing these activities, and en-that reduce the total power consumed by ergy is the total electrical energy the com-typical microprocessor systems. puter consumes (or dissipates as heat) We will follow the high-level taxonomy over time.illustrated in Figure 2. First, we will de- This distinction between power and en-ﬁne power and energy and explain the ergy is important because techniques thatcomplex parameters that dynamic and reduce power do not necessarily reduce en-static power depend on (Section 2). Next, ergy. For example, the power consumedwe will introduce techniques that reduce by a computer can be reduced by halv-power and energy (Section 3), starting ing the clock frequency, but if the com-with circuit (Section 3.1) and architec- puter then takes twice as long to runtural techniques (Section 3.2, Section 3.3, the same programs, the total energy con-and Section 3.4), and then moving on to sumed will be similar. Whether one shouldtwo techniques that are widely applied in reduce power or energy depends on thehardware and software, dynamic voltage context. In mobile applications, reducingscaling (Section 3.5) and resource hiberna- energy is often more important becausetion (Section 3.6). Third, we will examine it increases the battery lifetime. How-what compilers can do to manage power ever, for other systems (e.g., servers), tem-(Section 3.7). We will then discuss recent perature is a larger issue. To keep thework in application level power manage- temperature within acceptable limits, onement (Section 3.8), and recent efforts (Sec- would need to reduce instantaneous powertion 3.9) to develop a holistic solution to regardless of the impact on total energy. ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 197 Fig. 2. Organization of this survey.2.1. Dynamic Power Consumption capacitance (C) and an activity factor (a) that relates to how many 0 → 1 or 1 → 0There are two forms of power consump- transitions occur in a chip:tion, dynamic power consumption andstatic power consumption. Dynamic powerconsumption arises from circuit activity Pdynamic ∼ aCV 2 f . (3)such as the changes of inputs in anadder or values in a register. It has two Accordingly, there are four ways to re-sources, switched capacitance and short- duce dynamic power consumption, thoughcircuit current. they each have different tradeoffs and not Switched capacitance is the primary all of them reduce the total energy con-source of dynamic power consumption and sumed. The ﬁrst way is to reduce the phys-arises from the charging and discharging ical capacitance or stored electrical chargeof capacitors at the outputs of circuits. of a circuit. The physical capacitance de- Short-circuit current is a secondary pends on low level design parameterssource of dynamic power consumption and such as transistor sizes and wire lengths.accounts for only 10-15% of the total power One can reduce the capacitance by re-consumption. It arises because circuits are ducing transistor sizes, but this worsenscomposed of transistors having opposite performance.polarity, negative or NMOS and positive The second way to lower dynamic poweror PMOS. When these two types of tran- is to reduce the switching activity. As com-sistors switch current, there is an instant puter chips get packed with increasinglywhen they are simultaneously on, creat- complex functionalities, their switchinging a short circuit. We will not deal fur- activity increases [De and Borkar 1999],ther with the power dissipation caused by making it more important to develop tech-this short circuit because it is a smaller niques that fall into this category. Onepercentage of total power, and researchers popular technique, clock gating, gateshave not found a way to reduce it without the clock signal from reaching idle func-sacriﬁcing performance. tional units. Because the clock network As the following equation shows, the accounts for a large fraction of a chip’smore dominant component of dynamic total energy consumption, this is a verypower, switched capacitance (Pdynamic ), de- effective way of reducing power and en-pends on four parameters namely, supply ergy throughout a processor and is imple-voltage (V), clock frequency (f), physical mented in numerous commercial systemsACM Computing Surveys, Vol. 37, No. 3, September 2005.
198 V. Venkatachalam and M. Franz 2.2. Understanding Leakage Power Consumption In addition to consuming dynamic power, computer components consume static power, also known as idle power or leakage. According to the most re- cently published industrial roadmaps [ITRSRoadMap], leakage power is rapidly becoming the dominant source of power consumption in circuits (Figure 3) and persists whether a computer is active orFig. 3. ITRS trends for leakage power dissipation. idle. Because its causes are different fromFigure adapted from Meng et al., 2005. those of dynamic power, dynamic power reduction techniques do not necessarily reduce the leakage power. As the equation that follows illustrates,including the Pentium 4, Pentium M, Intel leakage power consumption is the productXScale and Tensilica Xtensa, to mention of the supply voltage (V) and leakage cur-but a few. rent (Il eak ), or parasitic current, that ﬂows The third way to reduce dynamic power through transistors even when the tran-consumption is to reduce the clock fre- sistors are turned off.quency. But as we have just mentioned,this worsens performance and does not al-ways reduce the total energy consumed. Pleak = V Ileak . (4)One would use this technique only ifthe target system does not support volt- To understand how leakage currentage scaling and if the goal is to re- arises, one must understand how transis-duce the peak or average power dissi- tors work. A transistor regulates the ﬂowpation and indirectly reduce the chip’s of current between two terminals calledtemperature. the source and the drain. Between these The fourth way to reduce dynamic power two terminals is an insulator, called theconsumption is to reduce the supply volt- channel, that resists current. As the volt-age. Because reducing the supply voltage age at a third terminal, the gate, is in-increases gate delays, it also requires re- creased, electrical charge accumulates inducing the clock frequency to allow the cir- the channel, reducing the channel’s resis-cuit to work properly. tance and creating a path along which The combination of scaling the supply electricity can ﬂow. Once the gate volt-voltage and clock frequency in tandem is age is high enough, the channel’s polar-called dynamic voltage scaling (DVS). This ity changes, allowing the normal ﬂow oftechnique should ideally reduce dynamic current between the source and the drain.power dissipation cubically because dy- The threshold at which the gate’s voltagenamic power is quadratic in voltage and is high enough for the path to open is calledlinear in clock frequency. This is the most the threshold voltage.widely adopted technique. A growing num- According to this model, a transistor isber of processors, including the Pentium similar to a water dam. It is supposedM, mobile Pentium 4, AMD’s Athlon, and to allow current to ﬂow when the gateTransmeta’s Crusoe and Efﬁcieon proces- voltage exceeds the threshold voltage butsors allow software to adjust clock fre- should otherwise prevent current fromquencies and voltage settings in tandem. ﬂowing. However, transistors are imper-However, DVS has limitations and cannot fect. They leak current even when the gatealways be applied, and even when it can voltage is below the threshold voltage. Inbe applied, it is nontrivial to apply as we fact, there are six different types of cur-will see in Section 3.5. rent that leak through a transistor. These ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 199include reverse-biased-junction leakage, ture increases. But as the Equation (6)gate-induced-drain leakage, subthreshold shows, this increases leakage further,leakage, gate-oxide leakage, gate-current causing yet higher temperatures. This vi-leakage, and punch-through leakage. Of cious cycle is known as thermal runaway.these six, subthreshold leakage and gate- It is the chip designer’s worst nightmare.oxide leakage dominate the total leakage How can these problems be solved?current. Equations (4) and (6) indicate four ways to Gate-oxide leakage ﬂows from the gate of reduce leakage power. The ﬁrst way is toa transistor into the substrate. This type reduce the supply voltage. As we will see,of leakage current depends on the thick- supply voltage reduction is a very commonness of the oxide material that insulates technique that has been applied to compo-the gate: nents throughout a system (e.g., processor, buses, cache memories). V 2 Tox The second way to reduce leakage power Iox = K 2 W e−α V . (5) is to reduce the size of a circuit because the Tox total leakage is proportional to the leak- age dissipated in all of a circuit’s tran- According to this equation, the gate- sistors. One way of doing this is to de-oxide leakage Iox increases exponentially sign a circuit with fewer transistors byas the thickness Tox of the gate’s oxide ma- omitting redundant hardware and usingterial decreases. This is a problem because smaller caches, but this may limit perfor-future chip designs will require the thick- mance and versatility. Another idea is toness to be reduced along with other scaled reduce the effective transistor count dy-parameters such as transistor length and namically by cutting the power supplies tosupply voltage. One way of solving this idle components. Here, too, there are chal-problem would be to insulate the gate us- lenges such as how to predict when dif-ing a high-k dialectric material instead ferent components will be idle and how toof the oxide materials that are currently minimize the overhead of shutting themused. This solution is likely to emerge over on or off. This, is also a common approachthe next decade. of which we will see examples in the sec- Subthreshold leakage current ﬂows be- tions to follow.tween the drain and source of a transistor. The third way to reduce leakage powerIt is the dominant source of leakage and is to cool the computer. Several coolingdepends on a number of parameters that techniques have been developed sinceare related through the following equa- the 1960s. Some blow cold air into thetion: circuit, while others refrigerate the −Vth −V processor [Schmidt and Notohardjono Isub = K 1 W e nT 1−e T . (6) 2002], sometimes even by costly means such as circulating cryogenic ﬂuids like In this equation, W is the gate width liquid nitrogen [Krane et al. 1988]. Theseand K and n are constants. The impor- techniques have three advantages. Firsttant parameters are the supply voltage they signiﬁcantly reduce subthresholdV , the threshold voltage Vth , and the leakage. In fact, a recent study [Schmidttemperature T. The subthreshold leak- and Notohardjono 2002] showed that cool-age current Isub increases exponentially ing a memory cell by 50 degrees Celsiusas the threshold voltage Vth decreases. reduces the leakage energy by ﬁve times.This again raises a problem for future Second, these techniques allow a circuit tochip designs, because as technology scales, work faster because electricity encountersthreshold voltages will have to scale along less resistance at lower temperatures.with supply voltages. Third, cooling eliminates some negative The increase in subthreshold leakage effects of high temperatures, namely thecurrent causes another problem. When the degradation of a chip’s reliability and lifeleakage current increases, the tempera- expectancy. Despite these advantages,ACM Computing Surveys, Vol. 37, No. 3, September 2005.
200 V. Venkatachalam and M. Franz Fig. 4. The transistor stacking effect. The cir- cuits in both (a) and (b) are leaking current be- tween Vdd and Ground. However, Ileak1, the leak- age in (a), is less than Ileak2, the leakage in (b). Figure adapted from Butts and Sohi 2000. Fig. 5. Multiple threshold circuits with sleep transistors.there are issues to consider such as thecosts of the hardware used to cool the and this increases the threshold voltagecircuit. Moreover, cooling techniques of the bottom transistor, making it moreare insufﬁcient if they result in wide resistant to leakage. As a result, in Fig-temperature variations in different parts ure 4(a), in which all transistors are inof a circuit. Rather, one needs to prevent the Off position, transistor B leaks lesshotspots by distributing heat evenly current than transistor A, and transistorthroughout a chip. C leaks less current than transistor B. Hence, the total leakage current is atten- 2.2.1. Reducing Threshold Voltage. The uated as it ﬂows from Vdd to the groundfourth way of reducing leakage power is to through transistors A, B, and C. This is notincrease the threshold voltage. As Equa- the case in the circuit shown in Figure 4tion (6) shows, this reduces the subthresh- (b), which contains only a single offold leakage exponentially. However, it also transistor.reduces the circuit’s performance as is ap- Another way to increase the thresh-parent in the following equation that re- old voltage is to use Multiple Thresh-lates frequency (f), supply voltage (V), and old Circuits With Sleep Transistors (MTC-threshold voltage (Vth ) and where α is a MOS) [Calhoun et al. 2003; Won et al.constant: 2003] . This involves isolating a leaky cir- cuit element by connecting it to a pair (V − Vth )α of virtual power supplies that are linked f∞ . (7) V to its actual power supplies through sleep transistors (Figure 5). When the circuit is One of the less intuitive ways of active, the sleep transistors are activated,increasing the threshold voltage is to connecting the circuit to its power sup-exploit what is called the stacking effect plies. But when the circuit is inactive, the(refer to Figure 4). When two or more tran- sleep transistors are deactivated, thus dis-sistors that are switched off are stacked connecting the circuit from its power sup-on top of each other (a), then they dis- plies. In this inactive state, almost no leak-sipate less leakage than a single tran- age passes through the circuit because thesistor that is turned off (b). This is be- sleep transistors have high threshold volt-cause each transistor in the stack induces ages. (Recall that subthreshold leakagea slight reverse bias between the gate and drops exponentially with a rise in thresh-source of the transistor right below it, old voltage, according to Equation (6).) ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 201 Fig. 6. Adaptive body biasing. This technique effectively conﬁnes the To adjust the threshold voltage, adap-leakage to one part of the circuit, but is tive body biasing applies a voltage to thetricky to implement for several reasons. transistor’s body known as a body biasThe sleep transistors must be sized prop- voltage (Figure 6). This voltage changeserly to minimize the overhead of acti- the polarity of a transistor’s channel, de-vating them. They cannot be turned on creasing or increasing its resistance toand off too frequently. Moreover, this tech- current ﬂow. When the body bias voltagenique does not readily apply to memories is chosen to ﬁll the transistor’s channelbecause memories lose data when their with positive ions (b), the threshold volt-power supplies are cut. age increases and reduces leakage cur- A third way to increase the threshold rents. However, when the voltage is cho-is to employ dual threshold circuits. Dual sen to ﬁll the channel with negative ions,threshold circuits [Liu et al. 2004; Wei the threshold voltage decreases, allowinget al. 1998; Ho and Hwang 2004] reduce higher performance, though at the cost ofleakage by using high threshold (low leak- more leakage.age) transistors on noncritical paths andlow threshold transistors on critical paths,the idea being that noncritical paths can 3. REDUCING POWERexecute instructions more slowly without 3.1. Circuit And Logic Level Techniquesimpairing performance. This is a difﬁ-cult technique to implement because it re- 3.1.1. Transistor Sizing. Transistor siz-quires choosing the right combination of ing [Penzes et al. 2002; Ebergen et al.transistors for high-threshold voltages. If 2004] reduces the width of transistorstoo many transistors are assigned high to reduce their dynamic power consump-threshold voltages, the noncritical paths tion, using low-level models that relate thein the circuit can slow down too much. power consumption to the width. Accord- A fourth way to increase the thresh- ing to these models, reducing the widthold voltage is to apply a technique known also increases the transistor’s delay andas adaptive body biasing [Seta et al. thus the transistors that lie away from1995; Kobayashi and Sakurai 1994; Kim the critical paths of a circuit are usuallyand Roy 2002]. Adaptive body biasing is the best candidates for this technique. Al-a runtime technique that reduces leak- gorithms for applying this technique usu-age power by dynamically adjusting the ally associate with each transistor a toler-threshold voltages of circuits depending able delay which varies depending on howon whether the circuits are active. When close that transistor is to the critical path.a circuit is not active, the technique in- These algorithms then try to scale eachcreases its threshold voltage, thus saving transistor to be as small as possible with-leakage power exponentially, although at out violating its tolerable delay.the expense of a delay in circuit operation.When the circuit is active, the technique 3.1.2. Transistor Reordering. The ar-decreases the threshold voltage to avoid rangement of transistors in a circuitslowing it down. affects energy consumption. Figure 7ACM Computing Surveys, Vol. 37, No. 3, September 2005.
202 V. Venkatachalam and M. Franz 3.1.3. Half Frequency and Half Swing Clocks. Half-frequency and half-swing clocks re- duce frequency and voltage, respectively. Traditionally, hardware events such as register ﬁle writes occur on a rising clock edge. Half-frequency clocks synchro- nize events using both edges, and they tick at half the speed of regular clocks, thus cutting clock switching power in half. Reduced-swing clocks also often use a lower voltage signal and thus reduce power quadratically. Fig. 7. Transistor Reordering. Figure adapted from Hossain et al. 1996. 3.1.4. Logic Gate Restructuring. There are many ways to build a circuit out of logic gates. One decision that affects power con-shows two possible implementations of sumption is how to arrange the gates andthe same circuit that differ only in their their input signals.placement of the transistors marked A For example, consider two implementa-and B. Suppose that the input to transis- tions of a four-input AND gate (Figure 8),tor A is 1, the input to transistor B is 1, a chain implementation (a), and a treeand the input to transistor C is 0. Then implementation (b). Knowing the signaltransistors A and B will be on, allowing probabilities (1 or 0) at each of the pri-current from Vdd to ﬂow through them mary inputs (A, B, C, D), one can easily cal-and charge the capacitors C1 and C2. culate the transition probabilities (0→1) Now suppose that the inputs change and for each output (W, X, F, Y, Z). If each in-that A’s input becomes 0, and C’s input be- put has an equal probability of being acomes 1. Then A will be off while B and C 1 or a 0, then the calculation shows thatwill be on. Now the implementations in (a) the chain implementation (a) is likely toand (b) will differ in the amounts of switch- switch less than the tree implementationing activity. In (a), current from ground (b). This is because each gate in a chainwill ﬂow past transistors B and C, dis- has a lower probability of having a 0→1charging both the capacitors C1 and C2. transition than its predecessor; its tran-However, in (b), the current from ground sition probability depends on those of allwill only ﬂow past transistor C; it will not its predecessors. In the tree implementa-get past transistor A since A is turned off. tion, on the other hand, some gates mayThus it will only discharge the capacitor share a parent (in the tree topology) in-C2, rather than both C1 and C2 as in part stead of being directly connected together.(a). Thus the implementation in (b) will These gates could have the same transi-consume less power than that in (a). tion probabilities. Transistor reordering [Kursun et al. Nevertheless, chain implementations2004; Sultania et al. 2004] rearranges do not necessarily save more energy thantransistors to minimize their switching tree implementations. There are other is-activity. One of its guiding principles is sues to consider when choosing a topology.to place transistors closer to the circuit’s One is the issue of glitches or spuriousoutputs if they switch frequently in or- transitions that occur when a gate does notder to prevent a domino effect where receive all of its inputs at the same time.the switching activity from one transistor These glitches are more common in chaintrickles into many other transistors caus- implementations where signals can traveling widespread power dissipation. This re- along different paths having widely vary-quires proﬁling techniques to determine ing delays. One solution to reduce glitcheshow frequently different transistors are is to change the topology so that the dif-likely to switch. ferent paths in the circuit have similar ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 203 Fig. 8. Gate restructuring. Figure adapted from the Pennsylvania State University Microsystems Design Laboratory’s tutorial on low power design.delays. This solution, known as path bal- such as register ﬁles. A typical master-ancing often transforms chain implemen- slave ﬂip-ﬂop consists of two latches, atations into tree implementations. An- master latch, and a slave latch. The inputsother solution, called retiming, involves of the master latch are the clock signalinserting ﬂip-ﬂops or registers to slow and data. The inputs of the slave latchdown and thereby synchronize the signals are the inverse of the clock signal and thethat pass along different paths but recon- data output by the master latch. Thusverge to the same gate. Because ﬂip-ﬂops when the clock signal is high, the masterand registers are in sync with the proces- latch is turned on, and the slave latchsor clock, they sample their inputs less fre- is turned off. In this phase, the masterquently than logic gates and are thus more samples whatever inputs it receives andimmune to glitches. outputs them. The slave, however, does not sample its inputs but merely outputs 3.1.5. Technology Mapping. Because of whatever it has most recently stored. Onthe huge number of possibilities and the falling edge of the clock, the mastertradeoffs at the gate level, designers rely turns off and the slave turns on. Thus theon tools to determine the most energy- master saves its most recent input andoptimal way of arranging gates and sig- stops sampling any further inputs. Thenals. Technology mapping [Chen et al. slave samples the new inputs it receives2004; Li et al. 2004; Rutenbar et al. 2001] from the master and outputs it.is the automated process of constructing a Besides this master-slave design aregate-level representation of a circuit sub- other common designs such as the pulse-ject to constraints such as area, delay, and triggered ﬂip-ﬂop and sense-ampliﬁer ﬂip-power. Technology mapping for power re- ﬂop. All these designs share some commonlies on gate-level power models and a li- sources of power consumption, namelybrary that describes the available gates, power dissipated from the clock signal,and their design constraints. Before a cir- power dissipated in internal switching ac-cuit can be described in terms of gates, it tivity (caused by the clock signal and byis initially represented at the logic level. changes in data), and power dissipatedThe problem is to design the circuit out of when the outputs change.logic gates in a way that will mimimize the Researchers have proposed several al-total power consumption under delay and ternative low power designs for ﬂip-ﬂops.cost constraints. This is an NP-hard Di- Most of these approaches reduce therected Acyclic Graph (DAG) covering prob- switching activity or the power dissipatedlem, and a common heuristic to solve it is by the clock signal. One alternative is theto break the DAG representation of a cir- self-gating ﬂip-ﬂop. This design inhibitscuit into a set of trees and ﬁnd the optimal the clock signal to the ﬂip-ﬂop when the in-mapping for each subtree using standard puts will produce no change in the outputs.tree-covering algorithms. Strollo et al.  have proposed two ver- sions of this design. In the double-gated 3.1.6. Low Power Flip-Flops. Flip-ﬂops ﬂip-ﬂop, the master and slave latchesare the building blocks of small memories each have their own clock-gating circuit.ACM Computing Surveys, Vol. 37, No. 3, September 2005.
204 V. Venkatachalam and M. FranzThe circuit consists of a comparator that way of reducing power is to encode thechecks whether the current and previous FSM states in a way that minimizes theinputs are different and some logic that in- switching activity throughout the proces-hibits the clock signal based on the output sor. Another common approach involvesof the comparator. In the sequential-gated decomposing the FSM into sub-FSMsﬂip-ﬂop, the master and the slave share and activating only the circuitry neededthe same clock-gating circuit instead of for the currently executing sub-FSM.having their own individual circuit as in Some researchers [Gao and Hayes 2003]the double-gated ﬂip-ﬂop. have developed tools that automate this Another low-power ﬂip-ﬂop is the condi- process of decomposition, subject to sometional capture FLIP-FLOP [Kong et al. 2001; quality and correctness constraints.Nedovic et al. 2001]. This ﬂip-ﬂop detectswhen its inputs produce no change in the 3.1.8. Delay-Based Dynamic-Supply Voltageoutput and stops these redundant inputs Adjustment. Commercial processors thatfrom causing any spurious internal cur- can run at multiple clock speed normallyrent switches. There are many variations use a lookup table to decide what supplyon this idea, such as the conditional dis- voltage to select for a given clock speed.charge ﬂip-ﬂops and conditional precharge This table is use built ahead of timeﬂip-ﬂops [Zhao et al. 2004] which elim- through a worst-case analysis. Becauseinate unnecessary precharging and dis- worst-case analyses may not reﬂect real-charging of internal elements. ity, ARM Inc. has been developing a more A third type of low-power ﬂip-ﬂop is the efﬁcient runtime solution known as thedual edge triggered FLIP-FLOP [Llopis and Razor pipeline [Ernst et al. 2003].Sachdev 1996]. These ﬂip-ﬂops are sensi- Instead of using a lookup table, Razortive to both the rising and falling edges adjusts the supply voltage based on theof the clock, and can do the same work delay in the circuit. The idea is that when-as a regular ﬂip-ﬂop at half the clock fre- ever the voltage is reduced, the circuitquency. By cutting the clock frequency slows down, causing timing errors if thein half, these ﬂip-ﬂops save total power clock frequency chosen for that voltage isconsumption but may not save total en- too high. Because of these errors, some in-ergy, although they could save energy by structions produce inconsistent results orreducing the voltage along with the fre- fail altogether. The Razor pipeline peri-quency. Another drawback is that these odically monitors how many such errorsﬂip-ﬂops require more area than regu- occur. If the number of errors exceeds alar ﬂip-ﬂops. This increase in area may threshold, it increases the supply voltage.cause them to consume more power on the If the number of errors is below a thresh-whole. old, it scales the voltage further to save more energy. This solution requires extra hardware 3.1.7. Low-Power Control Logic Design. in order to detect and correct circuitOne can view the control logic of a pro- errors. To detect errors, it augmentscessor as a Finite State Machine (FSM). the ﬂip-ﬂops in delay-critical regions ofIt speciﬁes the possible processor states the circuit with shadow-latches. Theseand conditions for switching between latches receive the same inputs as thethe states and generates the signals that ﬂip-ﬂops but are clocked more slowly toactivate the appropriate circuitry for each adapt to the reduced supply voltage. Ifstate. For example, when an instruction the output of the ﬂip-ﬂop differs fromis in the execution stage of the pipeline, that of its shadow-latch, then this sig-the control logic may generate signals to niﬁes an error. In the event of such anactivate the correct execution unit. error, the circuit propagates the output Although control logic optimizations of the shadow-latch instead of the ﬂip-have traditionally targeted performance, ﬂop, delaying the pipeline for a cycle ifthey are now also targeting power. One necessary. ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 205 Fig. 9. Crosstalk prevention through shielding.3.2. Low-Power Interconnect smaller, there arise additional sources of power consumption. One of these sourcesInterconnect heavily affects power con- is crosstalk [Sylvester and Keutzer 1998].sumption since it is the medium of most Crosstalk is spurious activity on a wireelectrical activity. Efforts to improve chip that is caused by activity in neighboringperformance are resulting in smaller chips wires. As well as increasing delays and im-with more transistors and more densely pairing circuit integrity, crosstalk can in-packed wires carrying larger currents. The crease power consumption.wires in a chip often use materials with One way of reducing crosstalk [Taylorpoor thermal conductivity [Banerjee and et al. 2001] is to insert a shield wireMehrotra 2001]. (Figure 9) between adjacent bus wires. Since the shield remains deasserted, no 3.2.1. Bus Encoding And CrossTalk. A pop- adjacent wires switch in opposite direc-ular way of reducing the power consumed tions. This solution doubles the number ofin buses is to reduce the switching activity wires. However, the principle behind it hasthrough intelligent bus encoding schemes, sparked efforts to develop self-shieldingsuch as bus-inversion [Stan and Burleson codes [Victor and Keutzer 2001; Patel and1995]. Buses consist of wires that trans- Markov 2003] resistant to crosstalk. Asmit bits (logical 1 or 0). For every data in traditional bus encoding, a value is en-transmission on a bus, the number of wires coded and then transmitted. However, thethat switch depends on the current and code chosen avoids opposing transitions onprevious values transmitted. If the Ham- adjacent bus wires.ming distance between these values ismore than half the number of wires, then 3.2.2. Low Swing Buses. Bus-encodingmost of the wires on the bus will switch schemes attempt to reduce transitions oncurrent. To prevent this from happening, the bus. Alternatively, a bus can transmitbus-inversion transmits the inverse of the the same information but at a lowerintended value and asserts a control sig- voltage. This is the principle behind lownal alerting recipients of the inversion. swing buses [Zhang and Rabaey 1998].For example, if the current binary value to Traditionally, logical one is representedtransmit is 110 and the previous was 000, by 5 volts and logical zero is representedthe bus instead transmits 001, the inverse by −5 volts. In a low-swing systemof 110. (Figure 10), logical one and zero are Bus-inversion ensures that at most half encoded using lower voltages, such asof the bus wires switch during a bus trans- +300mV and −300mV. Typically, theseaction. However, because of the cost of the systems are implemented with differen-logic required to invert the bus lines, this tial signaling. An input signal is split intotechnique is mainly used in external buses two signals of opposite polarity boundedrather than the internal chip interconnect. by a smaller voltage range. The receiver Moreover, some researchers have ar- sees the difference between the twogued that this technique makes the sim- transmitted signals as the actual signalplistic assumption that the amount of and ampliﬁes it back to normal voltage.power that an interconnect wire consumes Low Swing Differential Signaling hasdepends only on the number of bit tran- several advantages in addition to reducedsitions on that wire. As chips become power consumption. It is immune toACM Computing Surveys, Vol. 37, No. 3, September 2005.
206 V. Venkatachalam and M. Franz Fig. 10. Low voltage differential signaling Fig. 11. Bus segmentation.crosstalk and electromagnetic radiation 3.2.4. Adiabatic Buses. The precedingeffects. Since the two transmitted signals techniques work by reducing activityare close together, any spurious activity or voltage. In contrast, adiabatic cir-will affect both equally without affecting cuits [Lyuboslavsky et al. 2000] reduce to-the difference between them. However, tal capacitance. These circuits reuse ex-when implementing this technique, one isting electrical charge to avoid creatingneeds to consider the costs of increased new charge. In a traditional bus, whenhardware at the encoder and decoder. a wire becomes deasserted, its previous charge is wasted. A charge-recovery bus 3.2.3. Bus Segmentation. Another effec- recycles the charge for wires about to betive technique is bus segmentation. In a asserted.traditional shared bus architecture, the Figure 12 illustrates a simple design forentire bus is charged and discharged upon a two-bit adiabatic bus [Bishop and Ir-every access. Segmentation (Figure 11) win 1999]. A comparator in each bitlinesplits a bus into multiple segments con- tests whether the current bit is equal tonected by links that regulate the trafﬁc be- the previously sent bit. Inequality signi-tween adjacent segments. Links connect- ﬁes a transition and connects the bitlineing paths essential to a communication are to a shorting wire used for sharing chargeactivated independently, allowing most of across bitlines. In the sample scenario,the bus to remain powered down. Ideally, Bitline1 has a falling transition while Bit-devices communicating frequently should line0 has a rising transition. As Bitline1be in the same or nearby segments to avoid falls, its positive charge transfers to Bit-powering many links. Jone et al.  line0, allowing Bitline0 to rise. The powerhave examined how to partition a bus saved depends on transition patterns. Noto ensure this property. Their algorithm energy is saved when all lines rise. Thebegins with an undirected graph whose most energy is saved when an equal num-nodes are devices, edges connect commu- ber of lines rise and fall simultaneously.nicating devices, and edge weights dis- The biggest drawback of adiabatic cir-play communication frequency. Applying cuits is a delay for transfering sharedthe Gomory and Hu algorithm , they charge.ﬁnd a spanning tree in which the prod-uct of communication frequency and num-ber of edges between any two devices is 3.2.5. Network-On-Chip. All of the aboveminimal. techniques assume an architecture in ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 207 and columns corresponding to the inter- section points end up switching current even though only parts of these rows and columns are actually traversed. To elim- inate this widespread current switching, the segmented crossbar topology divides the rows and columns into segments de- marcated by tristate buffers. The differ- ent segments are selectively activated as the data traverses through them, henceFig. 12. Two bit charge recovery bus. conﬁning current switches to only those segments along which the data actuallywhich multiple functional units share one passes.or more buses. This shared-bus paradigmhas several drawbacks with respect to 3.3. Low-Power Memories and Memorypower and performance. The inherent bus Hierarchiesbandwidth limits the speed and volumeof data transfers and does not suit the 3.3.1. Types of Memory. One can classifyvarying requirements of different execu- memory structures into two categories,tion units. Only one unit can access the Random Access Memories (RAM) andbus at a time, though several may be re- Read-Only-Memories (ROM). There arequesting bus access simultaneously. Bus two kinds of RAMs, static RAMs (SRAM)arbitration mechanisms are energy inten- and dynamic RAMs (DRAM) which dif-sive. Unlike simple data transfers, every fer in how they store data. SRAMs storebus transaction involves multiple clock data using ﬂip-ﬂops and DRAMs storecycles of handshaking protocols that in- each bit of data as a charge on a capac-crease power and delay. itor; thus DRAMs need to refresh their As a consequence, researchers have data periodically. SRAMs allow faster ac-been investigating the use of generic in- cesses than DRAMs but require moreterconnection networks to replace buses. area and are more expensive. As a result,Compared to buses, networks offer higher normally only register ﬁles, caches, andbandwidth and support concurrent con- high bandwidth parts of the system arenections. This design makes it easier to made up of SRAM cells, while main mem-troubleshoot energy-intensive areas of the ory is made up of DRAM cells. Althoughnetwork. Many networks have sophisti- these cells have slower access timescated mechanisms for adapting to varying than SRAMs, they contain fewer transis-trafﬁc patterns and quality-of-service re- tors and are less expensive than SRAMquirements. Several authors [Zhang et al. cells.2001; Dally and Towles 2001; Sgroi et al. The techniques we will introduce are not2001] propose the application of these conﬁned to any speciﬁc type of RAM ortechniques at the chip level. These claims ROM. Rather they are high-level architec-have yet to be veriﬁed, and interconnect tural principles that apply across the spec-power reduction remains a promising area trum of memories to the extent that thefor future research. required technology is available. They at- Wang et al.  have developed tempt to reduce the energy dissipation ofhardware-level optimizations for differ- memory accesses in two ways, either byent sources of energy consumption in reducing the energy dissipated in a mem-an on-chip network. One of the alterna- ory accesses, or by reducing the number oftive network topologies they propose is memory accesses.the segmented crossbar topology whichaims to reduce the energy consumption 3.3.2. Splitting Memories Into Smaller Sub-of data transfers. When data is trans- systems. An effective way to reduce thefered over a regular network grid, all rows energy that is dissipated in a memoryACM Computing Surveys, Vol. 37, No. 3, September 2005.
208 V. Venkatachalam and M. Franzaccess is to activate only the needed mem- processor and ﬁrst level cache. It wouldory circuits in each access. One way to ex- be smaller than the ﬁrst level cache andpose these circuits is to partition memo- hence dissipate less energy. Yet it wouldries into smaller, independently accessible be large enough to store an application’scomponents. This can be done as different working set and ﬁlter out many mem-granularities. ory references; only when a data request At the lowest granularity, one can de- misses this cache would it require search-sign a main memory system that is split ing higher levels of cache, and the penaltyup into multiple banks each of which for a miss would be offset by redirectingcan independently transition into differ- more memory references to this smaller,ent power modes. Compilers and operating more energy efﬁcient cache. This was thesystems can cluster data into a minimal principle behind the ﬁlter cache. Thoughset of banks, allowing the other unused deceptively simple, this idea lies at thebanks to power down and thereby save en- heart of many of the low-power cache de-ergy [Luz et al. 2002; Lebeck et al. 2000]. signs used today, ranging from simple ex- At a higher granularity, one can parti- tensions of the cache hierarchy (e.g., blocktion the separate banks of a partitioned buffering) to the scratch pad memoriesmemory into subbanks and activate only used in embedded systems, and the com-the relevant subbank in every memory ac- plex trace caches used in high-end gener-cess. This is a technique that has been ap- ation processors.plied to caches [Ghose and Kamble 1999]. One simple extension is the technique ofIn a normal cache access, the set-selection block buffering. This technique augmentsstep involves transferring all blocks in all caches with small buffers to store the mostbanks onto tag-comparison latches. Since recently accessed cache set. If a data itemthe requested word is at a known off- that has been requested belongs to a cacheset in these blocks, energy can be saved set that is already in the buffer, then oneby transfering only words at that offset. does not have to search for it in the rest ofCache subbanking splits the block array of the cache.each cache line into multiple banks. Dur- In embedded systems, scratch pad mem-ing set-selection, only the relevant bank ories [Panda et al. 2000; Banakar et al.from each line is active. Since a single sub- 2002; Kandemir et al. 2002] have beenbank is active at a time, all subbanks of the extension of choice. These are soft-a line share the same output latches and ware controlled memories that reside on-sense ampliﬁers to reduce hardware com- chip. Unlike caches, the data they shouldplexity. Subbanking saves energy with- contain is determined ahead of time andout increasing memory access time. More- remains ﬁxed while programs execute.over, it is independent of program locality These memories are less volatile thanpatterns. conventional caches and can be accessed within a single cycle. They are ideal for specialized embedded systems whose ap- 3.3.3. Augmenting the Memory Hierarchy plications are often hand tuned.With Specialized Cache Structures. The sec- General purpose processors of the lat-ond way to reduce energy dissipation in est generation sometimes rely on morethe memory hierarchy is to reduce the complex caching techniques such as thenumber of memory hierarchy accesses. trace cache. Trace caches were originallyOne group of researchers [Kin et al. 1997] developed for performance but have beendeveloped a simple but effective technique studied recently for their power bene-to do this. Their idea was to integrate a ﬁts. Instead of storing instructions inspecialized cache into the typical mem- their compiled order, a trace cache storesory hierarchy of a modern processor, a traces of instructions in their executedhierarchy that may already contain one order. If an instruction sequence is al-or more levels of caches. The new cache ready in the trace cache, then it needthey were proposing would sit between the not be fetched from the instruction cache ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 209but can be decoded directly from thetrace cache. This saves power by re-ducing the number of instruction cacheaccesses. In the original trace cache design, boththe trace cache and the instruction cacheare accessed in parallel on every clock cy-cle. Thus instructions are fetched from theinstruction cache while the trace cache issearched for the next trace that is pre-dicted to executed. If the trace is in thetrace cache, then the instructions fetchedfrom the instruction cache are discarded. Fig. 13. Drowsy cache line.Otherwise the program execution pro-ceeds out of the instruction cache and si- 3.4.1. Adaptive Caches. There is amultaneously new traces are created from wealth of literature on adaptive caches,the instructions that are fetched. caches whose storage elements (lines, Low-power designs for trace caches typi- blocks, or sets) can be selectively acti-cally aim to reduce the number of instruc- vated based on the application workload.tion cache accesses. One alternative, the One example of such a cache is thedynamic direction prediction-based trace Deep-Submicron Instruction (DRI)cache [Hu et al. 2003], uses branch predi- cache [Powell et al. 2001]. This cachetion to decide where to fetch instructions permits one to deactivate its individualfrom. If the branch predictor predicts the sets on demand by gating their supplynext trace with high conﬁdence and that voltages. To decide what sets to activate attrace is in the trace cache, then the in- any given time, the cache uses a hardwarestructions are fetched from the trace cache proﬁler that monitors the application’srather than the instruction cache. cache-miss patterns. Whenever the cache The selective trace cache [Hu et al. 2002] misses exceed a threshold, the DRI cacheextends this technique with compiler sup- activates previously deactivated sets.port. The idea is to identify frequently exe- Likewise, whenever the miss rate fallscuted, or hot traces, and bring these traces below a threshold, the DRI deactivatesinto the trace cache. The compiler inserts some of these sets by inhibiting theirhints after every branch instruction, indi- supply voltages.cating whether it belongs to a hot trace. At A problem with this approach is thatruntime, these hints cause instructions to dormant memory cells lose data and needbe fetched from either the trace cache or more time to be reactivated for their nextthe instruction cache. use. Thus an alternative to inhibiting their supply voltages is to reduce their voltages as low as possible without losing data.3.4. Low-Power Processor Architecture This is the aim of the drowsy cache, a cache Adaptations whose lines can be placed in a drowsySo far we have described the energy sav- mode where they dissipate minimal powering features of hardware as though they but retain data and can be reactivatedwere a ﬁxed foundation upon which pro- faster.grams execute. However, programs exhibit Figure 13 illustrates a typical drowsywide variations in behavior. Researchers cache line [Kim et al. 2002]. The wakeuphave been developing hardware structures signal chooses between voltages for thewhose parameters can be adjusted on de- active, drowsy, and off states. It alsomand so that one can save energy by regulates the word decoder’s output toactivating just the minimum hardware prevent data in a drowsy line from be-resources needed for the code that is ing accessed. Before a cache access, theexecuting. line circuitry must be precharged. In aACM Computing Surveys, Vol. 37, No. 3, September 2005.
210 V. Venkatachalam and M. Franz the ﬁrst adaptive instruction issue queues, a 32-bit queue consisting of four equal size partitions that can be independently acti- vated. Each partition consists of wakeup Fig. 14. Dead block elimination. logic that decides when instructions are ready to execute and readout logic thattraditional cache, the precharger is contin- dispatches ready instructions into theuously on. To save power, the drowsy cir- pipeline. At any time, only the partitionscuitry regulates the wakeup signal with a that contain the currently executing in-precharge signal. The precharger becomes structions are activated.active only when the line is woken up. There have been many heuristics devel- Many heuristics have been developed oped for conﬁguring these queues. Somefor controlling these drowsy circuits. The require measuring the rate at which in-simplest policy for controlling the drowsy structions are executed per processor clockcache is cache decay [Kaxiras et al. 2001] cycles, or IPC. For example, Buyukto-which simply turns off unused cache lines sunoglu et al.’s heuristic  periodi-after a ﬁxed interval. More sophisticated cally measures the IPC over ﬁxed lengthpolicies [Zhang et al. 2002; 2003; Hu et al. intervals. If the IPC of the current inter-2003] consider information about runtime val is a factor smaller than that of theprogram behavior. For example, Hu et previous interval (indicating worse per-al.  have proposed hardware that formance), then the heuristic increasesdetects when an execution moves into a the instruction queue size to increase thehotspot and activates only those cache throughput. Otherwise it may decreaselines that lie within that hotspot. The the queue size.main idea of their approach is to count Bahar and Manne  propose ahow often different branches are taken. similar heuristic targeting a simulatedWhen a branch is taken sufﬁciently many 8-wide issue processor that can be re-times, its target address is a hotspot, and conﬁgured as a 6-wide issue or 4-widethe hardware sets special bits to indicate issue. They also measure IPC over ﬁxedthat cache lines within that hotspot should intervals but measure the ﬂoating pointnot be shut down. Periodically, a runtime IPC separately from the integer IPC sinceroutine disables cache lines that do not the former is more critical for ﬂoatinghave these bits set and decays the proﬁl- point applications. Their heuristic, likeing data. Whenever it reactivates a cache Buyuktosunoglu et al.’s , comparesline before its next use, it also reactivates the measured IPC with speciﬁc thresholdsthe line that corresponds to the next sub- to determine when to downsize the issuesequent memory access. queue. There are many other heuristics for con- Folegnani and Gonzalez , on thetrolling cache lines. Dead-block elimina- other hand, use a different heuristic. Theytion [Kabadi et al. 2002] powers down ﬁnd that when a program has little par-cache lines containing basic blocks that allelism, the youngest part of the issuehave reached their ﬁnal use. It is a queue (which contains the most recent in-compiler-directed technique that requires structions) contributes very little to thea static control ﬂow analysis to identify overall IPC in that instructions in thisthese blocks. Consider the control ﬂow part are often committed late. Thus theirgraph in Figure 14. Since block B1 exe- heuristic monitors the contribution of thecutes only once, its cache lines can power youngest part of this queue and deacti-down once control reaches B2. Similarly, vates this part when its contribution isthe lines containing B2 and B3 can power minimal.down once control reaches B4. 3.4.2. Adaptive Instruction Queues. Buyu- 3.4.3. Algorithms for Reconﬁguring Multi-ktosunoglu et al.  developed one of ple Structures. In addition to this work, ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 211researchers have also been tackling the to apply, they use ofﬂine proﬁling to plotproblem of problem of reconﬁguring mul- an energy-delay tradeoff curve. Each pointtiple hardware units simultaneously. in the curve represents the energy and de- One group [Hughes et al. 2001; Sasanka lay tradeoff of applying an adaptation toet al. 2002; Hughes and Adve 2004] a subroutine. There are three regions inhas been developing heuristics for com- the curve. The ﬁrst includes adaptationsbining hardware adaptations with dy- that save energy without impairing per-namic voltage scaling during multime- formance; these are the adaptations thatdia applications. Their strategy is to run will always be applied. The third repre-two algorithms while individual video sents adaptations that worsen both per-frames are being processed. A global algo- formance and energy; these are the adap-rithm chooses the initial DVS setting and tations that will never be applied. Be-baseline hardware conﬁguration for each tween these two extremes are adaptationsframe, while a local algorithm tunes var- that trade performance for energy savings;ious hardware parameters (e.g., instruc- some of these may be applied depending ontion window sizes) while the frame is ex- the tolerable performance loss. The mainecuting to use up any remaining slack idea is to trace the curve from the ori-periods. gin and apply every adaptation that one Another group [Iyer and Marculescu encounters until the cumulative perfor-2001, 2002a] has developed heuristics that mance loss from applying all the encoun-adjust the pipeline width and register up- tered adaptations reaches the maximumdate unit (RUU) size for different hotspots tolerable performance loss.or frequently executed program regions. Ponomarev et al.  develop heuris-To detect these hotspots, their technique tics for controlling a conﬁgurable issuecounts the number of times each branch queue, a conﬁgurable reorder buffer and ainstruction is taken. When a branch is conﬁgurable load/store queue. Unlike thetaken enough times, it is marked as a can- IPC based heuristics we have seen, thisdidate branch. heuristic is occupancy-based because the To detect how often candidate branches occupancy of hardware structures indi-are taken, the heuristic uses a hotspot cates how congested these structures are,detection counter. Initially the hotspot and congestion is a leading cause of perfor-counter contains a positive integer. Each mance loss. Whenever a hardware struc-time a candidate branch is taken, the ture (e.g., instruction queue) has low oc-counter is decremented, and each time a cupancy, the heuristic reduces its size. Innoncandidate branch is taken, the counter contrast, if the structure is completely fullis incremented. If the counter becomes for a long time, the heuristic increases itszero, this means that the program execu- size to avoid congestion.tion is inside a hotspot. Another group of researchers [Albonesi Once the program execution is in a et al. 2003; Dropsho et al. 2002] ex-hotspot, this heuristic tests the energy plore the complex problem of conﬁgur-consumption of different processor conﬁg- ing a processor whose adaptable compo-urations at ﬁxed length intervals. It ﬁnally nents include two levels of caches, integerchooses the most energy saving conﬁgura- and ﬂoating point issue queues, load/storetion as the one to use the next time the queues, register ﬁles, as well as a reordersame hotspot is entered. To measure en- buffer. To dodge the sheer explosion of pos-ergy at runtime, the heuristic relies on sible conﬁgurations, they tune each com-hardware that monitors usage statistics of ponent based only on its local usage statis-the most power hungry units and calcu- tics. They propose two heuristics for doinglates the total power based on energy per this, one for tuning caches and another foraccess values. tuning buffers and register ﬁles. Huang et al.[2000; 2003] apply hard- The caches they consider are selec-ware adaptations at the granularity of tive way caches; each of their multiplesubroutines. To decide what adaptations ways can independently be activated orACM Computing Surveys, Vol. 37, No. 3, September 2005.
212 V. Venkatachalam and M. Franzdisabled. Each way has a most recently [Turing 1937]). There have been ongoingused (MRU) counter that is incremented efforts [Nilsen and Rygg 1995; Lim et al.every time that a cache searche hits the 1995; Diniz 2003; Ishihara and Yasuuraway. The cache tuning heuristic samples 1998] in the real-time systems communitythis counter at ﬁxed intervals to determine to accurately estimate the worst case exe-the number of hits in each way and the cution times of programs. These attemptsnumber of overall misses. Using these ac- use either measurement-based prediction,cess statistics, the heuristic computes the static analysis, or a mixture of both. Mea-energy and performance overheads for all surement involves a learning lag wherebypossible cache conﬁgurations and chooses the execution times of various tasks arethe best conﬁguration dynamically. sampled and future execution times are The heuristic for controlling other struc- extrapolated. The problem is that, whentures such as issue queues is occupancy workloads are irregular, future execu-based. It measures how often different tion times may not resemble past execu-components of these structures ﬁll up with tion times. Microarchitectural innovationsdata and uses this information to de- such as pipelining, hyperthreading andcide whether to upsize or downsize the out-of-order execution make it difﬁcult tostructure. predict execution times in real systems statically. Among other things, a compiler3.5. Dynamic Voltage Scaling needs to know how instructions are inter-Dynamic voltage scaling (DVS) addresses leaved in the pipeline, what the probabili-the problem of how to modulate a pro- ties are that different branches are taken,cessor’s clock frequency and supply volt- and when cache misses and pipeline haz-age in lockstep as programs execute. The ards are likely to occur. This requires mod-premise is that a processor’s workloads eling the pipeline and memory hierarchy.vary and that when the processor has less A number of researchers [Ferdinand 1997;work, it can be slowed down without af- Clements 1996] have attempted to inte-fecting performance adversely. For exam- grate complete pipeline and cache mod-ple, if the system has only one task and it els into compilers. It remains nontrivialhas a workload that requires 10 cycles to to develop models that can be used ef-execute and a deadline of 100 seconds, the ﬁciently in a compiler and that captureprocessor can slow down to 1/10 cycles/sec, all the complexities inherent in currentsaving power and meeting the deadline systems.right on time. This is presumably more ef-ﬁcient than running the task at full speedand idling for the remainder of the pe- 3.5.2. Indeterminism And Anomalies In Realriod. Though it appears straightforward, Systems. Even if the DVS algorithm pre-this naive picture of how DVS works is dicts correctly what the processor’s work-highly simpliﬁed and hides serious real- loads will be, determining how fast to runworld complexities. the processor is nontrivial. The nondeter- minism in real systems removes any strict 3.5.1. Unpredictable Nature Of Workloads. relationships between clock frequency, ex-The ﬁrst problem is how to predict work- ecution time, and power consumption.loads with reasonable accuracy. This re- Thus most theoretical studies on voltagequires knowing what tasks will execute scaling are based on assumptions thatat any given time and the work re- may sound reasonable but are not guar-quired for these tasks. Two issues com- anteed to hold in real systems.plicate this problem. First, tasks can One misconception is that total micro-be preempted at arbitrary times due to processor system power is quadratic inuser and I/O device requests. Second, supply voltage. Under the CMOS transis-it is not always possible to accurately tor model, the power dissipation of indi-predict the future runtime of an arbi- vidual transistors are quadratic in theirtrary algorithm (c.f., the Halting Problem supply voltages, but there remains no ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 213precise way of estimating the power dis-sipation of an entire system. One canconstruct scenarios where total systempower is not quadratic in supply voltage.Martin and Siewiorek  found thatthe StrongARM SA-1100 has two supplyvoltages, a 1.5V supply that powers theCPU and a 3.3V supply that powers the I/Opins. The total power dissipation is domi-nated by the larger supply voltage; hence,even if the smaller supply voltage were al-lowed to vary, it would not reduce powerquadratically. Fan et al.  found that,in some architectures where active mem-ory banks dominate total system powerconsumption, reducing the supply voltagedramatically reduces CPU power dissi-pation but has a miniscule effect on to-tal power. Thus, for full beneﬁts, dynamicvoltage scaling may need to be combinedwith resource hibernation to power down Fig. 15. Scheduling effects. When DVS is applied (a), task one takes a longer time to enter its criti-other energy hungry parts of the chip. cal section, and task two is able to preempt it right Another misconception is that it is before it does. When DVS is not applied (b), taskmost power efﬁcient to run a task at one enters its critical section sooner and thus pre-the slowest constant speed that allows vents task two from preempting it until it ﬁnishesit to exactly meet its deadline. Several its critical section. This ﬁgure has been adopted from Buttazzo .theoretical studies [Ishihara and Yasuura1998] attempt to prove this claim. Theseproofs rest on idealistic assumptions such two tasks (Figure 15). When the processoras “power is a convex linearly increasing slows down (a), Task One takes longer tofunction of frequency”, assumptions that enter its critical section and is thus pre-ignore how DVS affects the system as a empted by Task Two as soon as Task Twowhole. When the processor slows down, is released into the system. When the pro-peripheral devices may remain activated cessor instead runs at maximum speed (b),longer, consuming more power. An impor- Task One enters its critical section earliertant study of this issue was done by Fan and blocks Task Two until it ﬁnishes thiset al.  who show that, for speciﬁc section. This hypothetical example sug-DRAM architectures, the energy versus gests that DVS can affect the order inslowdown curve is “U” shaped. As the pro- which tasks are executed. But the order incessor slows down, CPU energy decreases which tasks execute affects the state of thebut the cumulative energy consumed in cache which, in turn, affects performance.active memory banks increases. Thus the The exact relationship between the cacheoptimal speed is actually higher than the state and execution time is not clear. Thus,processor’s lowest speed; any speed lower it is too simplistic to assume that execu-than this causes memory energy dissipa- tion time is inversely proportional to clocktion to overshadow the effects of DVS. frequency. A third misconception is that execution All of these issues show that theo-time is inversely proportional to clock fre- retical studies are insufﬁcient for un-quency. It is actually an open problem derstanding how DVS will affect systemhow slowing down the processor affects state. One needs to develop an experi-the execution time of an arbitrary appli- mental approach driven by heuristics andcation. DVS may result in nonlinearities evaluate the tradeoffs of these heuristics[Buttazzo 2002]. Consider a system with empirically.ACM Computing Surveys, Vol. 37, No. 3, September 2005.
214 V. Venkatachalam and M. Franz Most of the existing DVS approaches formance and energy dissipation. Weisselcan be classiﬁed as interval-based ap- and Bellosa’s technique involves moni-proaches, intertask approaches, or in- toring these event rates using hardwaretratask approaches. performance counters and attributing them to the processes that are executing. At each context switch, the heuristic 3.5.3. Interval-Based Approaches. adjusts the clock frequency for the processInterval-based DVS algorithms mea- being activated based on its previouslysure how busy the processor is over measured event rates. To do this, thesome interval or intervals, estimate how heuristic refers to a previously con-busy it will be in the next interval, and structed table of event rate combinations,adjust the processor speed accordingly. divided into frequency domains. WeisselThese algorithms differ in how they and Bellosa construct this table throughestimate future processor utilization. a detailed ofﬂine simulation where, forWeiser et al.  developed one of the different combinations of event rates, theyearliest interval-based algorithms. This determine the lowest clock frequency thatalgorithm, Past, periodically measures does not violate a user-speciﬁed thresholdhow long the processor idles. If it idles for performance loss.longer than a threshold, the algorithm Intertask DVS has two drawbacks.slows it down; if it remains busy longer First, task workloads are usually un-than a threshold, it speeds it up. Since known until tasks ﬁnish running. Thusthis approach makes decisions based traditional algorithms either assume per-only on the most recent interval, it is fect knowledge of these workloads or es-error prone. It is also prone to thrashing timate future workloads based on priorsince it computes a CPU speed for every workloads. When workloads are irregular,interval. Govil et al.  examine a estimating them is nontrivial. Flautnernumber of algorithms that extend Past et al.  address this problem byby considering measurements collected classifying all workloads as interactiveover larger windows of time. The Aged episodes and producer-consumer episodesAverages algorithm, for example, esti- and using separate DVS algorithms formates the CPU usage for the upcoming each. For interactive episodes, the clockinterval as a weighted average of usages frequency used depends not only on themeasured over all previous intervals. predicted workload but also on a percep-These algorithms save more energy than tion threshold that estimates how fast thePast since they base their decisions on episode would have to run to be accept-more information. However, they all able to a user. To recover from predictionassume that workloads are regular. As we error, the algorithm also includes a panichave mentioned, it is very difﬁcult, if not threshold. If a session lasts longer thanimpossible, to predict irregular workloads this threshold, the processor immediatelyusing history information alone. switches to full speed. The second drawback of intertask ap- 3.5.4. Intertask Approaches. Intertask proaches is that they are unaware of pro-DVS algorithms assign different speeds gram structure. A program’s structurefor different tasks. These speeds remain may provide insight into the work re-ﬁxed for the duration of each task’s exe- quired for it, insight that, in turn, maycution. Weissel and Bellosa  have provide opportunities for techniques suchdeveloped an intertask algorithm that as voltage scaling. Within speciﬁc programuses hardware events as the basis for regions, for example, the processor maychoosing the clock frequencies for differ- spend signiﬁcant time waiting for mem-ent processes. The motivation is that, as ory or disk accesses to complete. Duringprocesses execute, they generate various the execution of these regions, the proces-hardware events (e.g., cache misses, IPC) sor can be slowed down to meet pipelineat varying rates that relate to their per- stall delays. ACM Computing Surveys, Vol. 37, No. 3, September 2005.
Power Reduction Techniques for Microprocessor Systems 215 3.5.5. Intratask Approaches. Intratask the ﬁrst of these algorithms. They no-approaches [Lee and Sakurai 2000; Lorch ticed that programs have multiple exe-and Smith 2001; Gruian 2001; Yuan and cution paths, some more time consum-Nahrstedt 2003] adjust the processor ing than others, and that whenever con-speed and voltage within tasks. Lee and trol ﬂows away from a critical path, thereSakurai  developed one of the earli- are opportunities for the processor to slowest approaches. Their approach, run-time down and still ﬁnish the programs withinvoltage hopping, splits each task into their deadlines. Based on this observation,ﬁxed length timeslots. The algorithm as- Shin and Kim developed a tool that pro-signs each timeslot the lowest speed that ﬁles a program ofﬂine to determine worstallows it to complete within its preferred case and average cycles for different exe-execution time which is measured as cution paths, and then inserting instruc-the worst case execution time minus the tions to change the processor frequency atelapsed execution time up to the current the start of different paths based on thistimeslot. An issue not considered in this information.work is how to divide a task into timeslots. Another approach, program checkpoint-Moreover, the algorithm is pessimistic ing [Azevedo et al. 2002], annotates a pro-since tasks could ﬁnish before their worst gram with checkpoints and assigns tim-case execution time. It is also insensitive ing constraints for executing code betweento program structure. checkpoints. It then proﬁles the program There are many variations on this idea. ofﬂine to determine the average numberTwo operating system-level intratask al- of cycles between different checkpoints.gorithms are PACE [Lorch and Smith Based on this proﬁling information and2001] and Stochastic DVS [Gruian 2001]. timing constraints, it adjusts the CPUThese algorithms choose a speed for ev- speed at each checkpoint.ery cycle of a task’s execution based on the Perhaps the most well known approachprobability distribution of the task work- is by Hsu and Kremer . They pro-load measured over previous cycles. The ﬁle a program ofﬂine with respect to alldifference between PACE and stochastic possible combinations of clock frequenciesDVS lies in their cost functions. Stochastic assigned to different regions. They thenDVS assumes that energy is proportional build a table describing how each combi-to the square of the supply voltage, while nation affects execution time and powerPACE assumes it is proportional to the consumption. Using this table, they selectsquare of the clock frequency. These sim- the combination of regions and frequen-plifying assumptions are not guaranteed cies that saves the most power without in-to hold in real systems. creasing runtime beyond a threshold. Dudani et al.  have also developedintratask policies at the operating systemlevel. Their work aims to combine in- 3.5.6. The Implications of Memory Boundedtratask voltage scaling with the earliest- Code. Memory bounded applicationsdeadline-ﬁrst scheduling algorithm. have become an attractive target for DVSDudani et al. split into two subprograms algorithms because the time for a memoryeach program that the scheduler chooses access, on most commercial systems, isto execute. The ﬁrst subprogram runs at independent of how fast the processor isthe maximum clock frequency, while the running. When frequent memory accessessecond runs slow enough to keep the com- dominate the program execution time,bined execution time of both subprograms they limit how fast the program can ﬁnishbelow the average execution time for the executing. Because of this “memory wall”,whole program. the processor can actually run slower In addition to these schemes, there and save a lot of energy without losinghave been a number of intratask poli- as much performance as it would if itcies implemented at the compiler level. were slowing down a compute-intensiveShin and Kim  developed one of code.ACM Computing Surveys, Vol. 37, No. 3, September 2005.