Published on

This thesis evaluates how the energy efficiency of the ARMv7 architecture based processors
Cortex-A9 MPCpre and Cortex-A8 compare in applications such as a SIPProxy
and a web server compared to Intel Xeon processors. The focus is on comparing
the energy efficiency between the two architectures rather than just the performance.
As the processors used in servers today have more computational power than
the Cortex-A9 MPCore, several of these slower but more energy efficient processors
are needed. Depending on the application, benchmarks indicate energy efficiency of
3-11 times greater for the ARM Cortex-A9 in comparison to the Intel Xeon. The topics
of interconnects between processors and overhead caused by using an increasing
number of processors, are left for later research

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. E NERGY E FFICIENCY OF ARMARCHITECTURES FOR CLOUD COMPUTING APPLICATIONS Olle Svanfeldt-Winter Master of Science Thesis Supervisor: Prof. Johan Lilius Advisor: Dr. Sébastien Lafond Embedded Systems Laboratory Department of Information Technologies Åbo Akademi University 2011
  2. 2. A BSTRACTThis thesis evaluates how the energy efficiency of the ARMv7 architecture based pro-cessors Cortex-A9 MPCpre and Cortex-A8 compare in applications such as a SIP-Proxy and a web server compared to Intel Xeon processors. The focus is on com-paring the energy efficiency between the two architectures rather than just the perfor-mance. As the processors used in servers today have more computational power thanthe Cortex-A9 MPCore, several of these slower but more energy efficient processorsare needed. Depending on the application, benchmarks indicate energy efficiency of3-11 times greater for the ARM Cortex-A9 in comparison to the Intel Xeon. The top-ics of interconnects between processors and overhead caused by using an increasingnumber of processors, are left for later research.Keywords: Cloud Computing, Energy Efficiency, ARM, Erlang, SIP-Proxy, Apache i
  3. 3. C ONTENTSAbstract iContents iiList of Figures ivGlossary vi1 Introduction 1 1.1 Purpose of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Cloud Software project . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Energy efficiency of servers 5 2.1 Throughput and latency . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Large scale energy consumption . . . . . . . . . . . . . . . . . . . . 7 2.4 Reducing energy consumption . . . . . . . . . . . . . . . . . . . . . 9 2.5 Energy proportional computing . . . . . . . . . . . . . . . . . . . . . 11 2.6 Energy efficient low power processors . . . . . . . . . . . . . . . . . 13 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Evaluated computing platforms 15 3.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 Performance comparison 32 4.1 Apache results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 Emark results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3 SIP-Proxy results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 ii
  4. 4. 5 Conclusions and future work 48 5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51Bibliography 52Swedish Summary 556 Energieffektivitet hos ARM-arkitektur för applikationer i datormoln 55 6.1 Introduktion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.2 Energiförbrukning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.3 Förbättring av energieffektivitet . . . . . . . . . . . . . . . . . . . . 56 6.4 Mätningar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.5 Slutsatser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59A Results from Erlang benchmarking 60 iii
  5. 5. L IST OF F IGURES2.1 Monthly costs for server, power and infrastructure [1] . . . . . . . . . 92.2 CPU contribution to total server power usage for two generations of Google servers. The rightmost bar shows the newer server when idling [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.1 BeagleBoard block diagram [3] . . . . . . . . . . . . . . . . . . . . . 173.2 OMAP3530 block diagram [4] . . . . . . . . . . . . . . . . . . . . . 173.3 Block diagram of the Versatile Express with the Motherboard Express µATX, CoreTile Express A9x4 and LogicTile Express [5] . . . . . . . 183.4 Top level view of the main components of the CoreTile Express A9x4 and with the CA9 NEC chip [5] . . . . . . . . . . . . . . . . . . . . 203.5 Test setup for Apache test . . . . . . . . . . . . . . . . . . . . . . . . 283.6 Test setup SIP-Proxy test . . . . . . . . . . . . . . . . . . . . . . . . 304.1 Comparison between CoreTile Express, Tegra and an Intel Pentium 4 powered machine running the Apache HTTP server. . . . . . . . . . . 334.2 CPU utilization during test on machine with two Quad Core Intel Xeon E5430 processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.3 Number of requests handled for each Joule used by the CPU . . . . . 364.4 Graph showing the CPU utilization for CoreTile Express during SIP- Proxy benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.5 Power consumption for the CPU in CoreTile Express during SIP-Proxy test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.6 Graph showing performance of reference machine with two Quad Core Xeons using an increasing number of schedulers . . . . . . . . . . . . 454.7 Number of calls handled for each Joule used by the CPU . . . . . . . 46 iv
  6. 6. 5.1 Achievable energy dissipation reduction by the usage of more efficient processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2 Achievable energy dissipation reduction when moving to more effi- cient processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 v
  7. 7. G LOSSARYSMP In symmetric multiprocessing two or more processors are connected to the same main memory.DVF Dynamic Voltage and Frequency regulation is used to adjust the input voltage and clock frequency according to the momentary need in order to avoid unnec- essary energy consumptionPower gating Cutting power to parts of a chip when the particular part is not needed.Data center Facility that houses computer systemsServer farm A server farm is a collection of servers. They are used when a single server is not capable of providing the required serviceCPU Central processing unit.Cloud Platform for computational and data access services where details of physical location of the hardware is not necessarily of concern for the end userGranularity Granularity describes the extent a system is broken down into smaller partsDMIPS Dhrystone MIPS. Obtained when dividing a Dhrystone benchmark score by 1757 vi
  8. 8. 1 I NTRODUCTIONCloud computing systems often use large server farms in order to provide services.These server farms have high energy consumption. The energy is not only needed torun the servers themselves but also for the systems that keep them cool. Energy con-sumption is seen as both an economical and ecological issue. Regardless of whetherone wants to save money or to cause less strain on the environment, the solution is thesame; to reduce the energy consumption. The approach presented in this thesis to reduce the power dissipated by server farmsis to replace their processors with ones that are more energy efficient. The architectureof processors used in smartphones and embedded systems have been designed withenergy efficiency in mind from the beginning, something that has not been the casewith the x86 architecture usually found in servers. This makes the processors used inembedded systems interesting candidates when looking for replacements for regularx86 architecture based server processors. Although the processors used in mobile devices use less energy for executing a sin-gle instruction, the execution of a single instruction is often not directly comparable toan instruction execution on a regular server processor, due to factors such as differencesin the instruction sets. Also the computational power for a single low power processoris generally modest compared to traditional desktop and server processors. Movingto processors with lower individual performance increases the number of processorsneeded to provide the same service as before. Distributing work on a larger numberof processors increases the importance of parallelism on the software side. Applica-tions that use a lot of I/O operations will get less of a performance drop than thosethat are more computationally intensive, when switching to processors with lower in-dividual performance, as the speed of I/O is dependent on other components than justthe CPU. Applications designed to be run on server farms are already designed to bedistributable in order to use the added resources from a server farm, compared to thatof a single server. 1
  9. 9. Applications such as gateways in telecommunication have a large number of re-quests to serve but the requests are light and generally independent of each other. Alsotasks performed by web servers such as serving static webpages are suitable candidatesto be run in clouds of low power processors, as the services provided are not CPU in-tensive. The suitability to provide services using low power energy efficient processorswill be evaluated using benchmarks. General performance benchmarks for the Erlang virtual machine (VM) will be usedin addition to Apache benchmarking and a SIP-Proxy running on top of Erlang. Gen-eral performance benchmarks will be used to evaluate the performance of the Erlangrun time system (erts) for a variety of tasks such as message passing and pure numbercrunching thereby comparing the performance between the different architectures. Theobtained results are used to evaluate why some tasks run better on some hardware thanothers and explain the performance differences for realistic applications. The extent of this thesis is limited to how well a single Cortex-A9 MPCore and aCortex-A8 perform in comparison to processors such as the Intel Xeon. The topics ofinterconnects between processors and overhead caused by using an increasing numberof processors are left for later research.1.1 Purpose of this thesisThe purpose of this thesis is to evaluate the energy efficiency of ARM Cortex-A9 MP-Core based processors compared to x86 based processors for telecom systems andother Cloud like services. In addition to the energy efficiency of the Cortex-A9 MP-Core processors single core Cortex-A8 processor will also be evaluated. To make anenergy efficiency comparison possible the performance of the processors will first beevaluated. The main interest is the comparison between the energy efficiency for thetwo architectures rather than on the energy efficiency between certain processor mod-els. To achieve this, several processors based on the same underlying architecture willbe evaluated. The ability to provide simple web services will also be evaluated on thesame ARM based test machines. For testing how well the testmachines perform inproviding simple web services Apache 2.2 will be used. The focus is on the abilityto provide static web pages. For the telecommunication part the focus will be on aSIP-Proxy running on top of the Erlang VM. Micro benchmarks that stress different 2
  10. 10. aspects of the Erlang VM are used to evaluate how efficient the ARM based processorscan handle different tasks. The important metric is how much performance is achievedcompared to the amount of energy used, rather than pure performance of the proces-sors. The potential for energy saving achieved from using ARMv7 based processorsin servers compared to servers based on x86 processors will be evaluated. To makethe comparison realistic only the efficiency of the processors will be considered. Thepotential for energy efficiency improvement for the rest of the components in the testmachines are not considered. The impact to total data center infrastructure and powercost will be analyzed using a cost model for a hypothetical data center.1.2 Cloud Software projectThe Cloud Software Program (2010-2013) is a SHOK-program financed through TEKESand coordinated by Tivit Oy. Its aim is to improve the competitive position of theFinnish software intensive industry in the global market [6]. The content of this thesisis part of the project. The research focus for the project in the Embedded Systems Laboratory at TheDepartment of Information Technologies at Åbo Akademi is to evaluate the potentialgain for energy efficiency by using low power nodes to provide services. In additionto energy efficiency the total cost of ownership for the cloud server infrastructure is ina central role.1.3 Thesis structureChapter 2 begins with the introduction of concepts such as energy and energy effi-ciency. Followed by why energy efficiency is an issue for cloud service providers andhow much and where energy is consumed in data centers. Methods to reduce energyconsumption as well as the concept of energy proportional computing are presentedin chapter 2. The chapter ends with a discussion on the motivations and theories onwhy the usage of energy efficient low power processors is a viable option. Chapter 3presents the hardware and software used in the evaluations as well as the benchmarks.Chapter 4 presents the results from the benchmarks presented in Chapter 3. Compar-isons between the results for the different test machines in the benchmarks are alsopresented in this chapter. Chapter 5 shows conclusions together with suggestions for 3
  11. 11. future work. 4
  12. 12. 2 E NERGY EFFICIENCY OF SERVERSIn this chapter concepts such as energy, energy efficiency and energy proportionalitywill be presented. Metrics necessary to evaluate energy efficiency will also be brieflydiscussed, continued by the subject of why energy consumption is both an economicaland practical issue. An example of how much energy is being used by server farms andthe cost associated with it is shown. Both direct energy cost and costs related to energyrelated infrastructure. The chapter ends by presenting methods used to decrease energyconsumption and by discussing why using energy efficient low power nodes would bean option.2.1 Throughput and latencyIt is important to notice that latency and throughput do not always correlate. Even iftwo systems have the same throughput the time to serve a single request is not nec-essarily the same. A system that uses one node to provide a certain service has toprocess the individual requests faster than one consisting of several units. If the num-ber of nodes used to provide a certain service is doubled, the requirements are naturallycut by half for each unit. The formula below shows the relation between throughput,latency and the number of nodes. T hroughput = Availablenodes ∗ (1/Latency) The unit of throughput can vary greatly. If the performance of a web server is eval-uated, the throughput can be defined as the number of requests served each second.The latency is the time taken to serve a request and the available nodes simply indi-cates how many nodes that are available to provide the service. The throughput canbe kept on the same level even if the latency increases, provided that the number ofnodes is increased to compensate. For example if one server with the ability to serve 5
  13. 13. a hundred requests per second were be replaced by servers with the ability to serve 10requests each, ten of the less powerful servers would be needed to achieve the sameperformance. It is implied that the minimum latency for the service cannot be lowerthan the minimum latency for a node. How long latency is acceptable depends onthe service being provided: a phone call is likely to have stricter requirements for theresponse time than a service for downloading files.2.2 EnergyEnergy is generally described as the ability to perform work. It can be of many formssuch as kinetic, thermal or electric. In this thesis the focus is on electric and thermalenergy as computers use electric energy and transform it to thermal energy. Poweris the rate at which energy is being transformed into another form. In the case ofcomputers the conversion is from electrical energy to thermal energy. The unit forpower is Watt (W). Electric energy is measured in Joule (J) and is defined as powermultiplied with time. Energy = AvgP ower ∗ T ime Energy efficiency is the amount of work done compared to the amount of energyused. The exact way to measure and compare energy efficiency varies depending on theparticular application. When considering the energy efficiency for services provided bya cloud or a server it could be for example how many Joules is needed for a transaction,or retrieving a file from a web server. Depending on the application the metrics cangreatly vary. As it is stated by the first law of thermodynamics, energy is never created or de-stroyed, only converted to other forms. All the energy that a computer, or part of acomputer consumes will be converted into heat. Depending on the amount of energythe component consumes, the greater the problem with heat dissipation becomes. Theability to dissipate heat is dependent on many factors, such as surface area and ma-terial. Heat sinks are often used to increase the ability of the component in questionto dissipate heat. Regardless of thermal dissipation capabilities and the amount ofthermal energy dissipated, the heat has to be transferred somewhere in order to avoidoverheating issues. 6
  14. 14. 2.3 Large scale energy consumptionSmall computer systems such as personal computers can generally be cooled by a fewfans as the space they are kept in is relatively large in comparison to the amount ofheat that is generated. When having a large number of servers in the same place theamount of heat builds adds up. Also the modest energy consumption of a regular homecomputer is often not a great economical issue as the power needed is on the samescale of magnitude as a few incandescent light bulbs. The more densely hardware isstacked in order to fit as much equipment as possible in the smallest possible space;the more heat is also produced in the same space. The biggest consumer of energy in a server is the CPU with approximately 45percent of the total consumption [2]. The energy consumption ratios between differentcomponents vary depending on the configuration of the server. In servers where severaldisk drives are used for data storage the energy consumption of the disk drives alsobecomes significant [7]. According to Schäppi et al. the total energy consumptionof data centers has been increasing for years [8]. In 2006 the energy consumptionof servers in Western Europe (EU 15 and Switzerland) was 14,7 TWh [8]; this doesnot include any energy consumed by the infrastructure such as cooling, lighting andUPS. Schnäppi et al. also states that the complete energy consumption for the datacenters in the same region is 36,9 TWh. It is not uncommon for data center serviceproviders to boast of high energy efficiency, both for their servers and for the datacenters as a whole. Companies do not, however, generally present exact data on energyconsumption and the technical specification of their centers for the public, makingaccurate estimates difficult. Several different metrics for energy efficiency on a datacenter scale are used. Power Usage Effectiveness (PUE) and DCiE are metrics definedby Green Grid in a white paper called “The green grid power efficiency metrics: PUE& DCiE“ [9]. The definitions on PUE and DCiE are shown below. P U E = T otalF acilityP ower/IT EquipmentP ower[9] DCiE = 1/P U E = IT EquipmentP ower/T otalF acilityP ower ∗ 100%[9] IT Equipment power includes the servers but also network equipment and equip-ment used to monitor and control the data center. Total facility power includes in 7
  15. 15. addition to the IT equipment cooling, UPS, lighting and distribution losses external tothe IT equipment [9]. In an ideal data center the PUE would be 1 and would mean that all power usedby the center is used to power the IT equipment. According to "The green grid powerefficiency metrics: PUE & DCiE“ preliminary data shows that many data centers havea PUI of 3.0 or greater [9]. Companies that provide cloud services need large amounts of computer resources.When using cloud computing the user does not need to worry about the resourceslocally, and many new data centers are being built to provide the required resources.Several companies including Google and Microsoft are building data centers [10] withincreasing numbers of servers. Many of the centers are so large that instead of usinga server rack as the basic unit shipping containers are used [10] [11]. For exampleGoogle uses shipping containers to house servers in their data centers. One containeris reported to house 1160 servers, and the power consumption of just one container isreported to be up to 250 KW [11]. Using the reported values one server would useapproximately 216 W. In 2008 Microsoft announced that they were building a data center containing 300000 servers [12]. If the power consumption of the servers in Microsoft’s new serverfarm is the same as that reported by Google the power consumption of the servers inthe farm is approximately 65 MW. The fact that the servers are packed tightly alsomeans that the challenges for the cooling is increasing. The problem with large heatdissipation is being addressed in different ways, for example Intel provides energyefficient versions of some of its Xeon processors intended especially for high densityblade servers [13]. The more energy efficient versions are generally more expensive.For example the Intel Xeon L5434 costs 562 e[14] and the E5430 costs 455 e[15].Having to pay less for keeping the servers running and still providing the same servicesmakes new business opportunities possible and increases the profit for current businessareas. In order to evaluate the potential savings caused by a reduction of the energy con-sumption the total cost structure for a server farm must be analyzed. Hamilton [1]presents a cost analysis for a hypothetical data center. To enable the comparison be-tween cost elements such as infrastructure, hardware and power, amortization times aredefined for the investments. The infrastructure in Hamilton’s hypothetical data centeris designed to have a 15-year amortization time for infrastructure and a 3-year amorti- 8
  16. 16. zation time for the servers. A five percent annual cost for the capital used to build thedata center is assumed. The cost of power is set at $0.07/KWh for this example. Thecosts of the data center can be seen in the pie chart shown in Figure 2.1. The chartshows that the direct cost of power is 19 percent of the total cost. Hamilton contin-ues pointing out that for the hypothetical data center 82 percent of the infrastructurecosts consist of power and cooling infrastructure, and that thereby the maximum powerconsumption of the servers is reflected in the infrastructure costs. In Hamilton´s hy-pothetical data center the combined cost of power and cooling infrastructure, and theactual power is 42 percent. Hamilton writes that the power consumption contributionis 23 percent of the total cost. The numbers in the graph do not support that statement.The contribution of power and cooling infrastructure to the total cost on the other handis 23 percent and the cost of power is 19 percent, according to the graph. These are thevalues that will be used in chapter five. Figure 2.1: Monthly costs for server, power and infrastructure [1]2.4 Reducing energy consumptionIn order to improve energy efficiency for a computer, the causes for the energy con-sumption must be known. Knowing the contributions of the main components in amodern computer helps to focus only on the critical components. Modern computersystems are built using CMOS circuits. The causes for energy consumption in a CMOScircuit are divided into Static power consumption and Active power consumption. Thestatic power consumption is caused by unintended leakage currents within the circuits. 9
  17. 17. The static power consumption can be reduced by bringing down the number of activetransistors and turning parts of the chip off when not needed. Another factor that af-fects the static power consumption is the supply voltage. Active power consumption iscaused by switching the states of the transistors and is thereby dependent of the usageof the circuit. The time taken to charge and discharge a capacitor is dependent on the voltageused. A higher voltage allows a shorter switching time and thereby a higher clockfrequency. The lowest voltage possible should be used for the planned clock frequencyin order to be energy efficient. Voltage and Frequency Scaling (DVFS) is a method toreduce the energy consumption of a processor at times when it is not required to runat full capacity. DVFS works by varying both the voltage and clock frequency of theprocessor, depending on the performance required at a specific time [16]. The number of transistors in a CPU has increased approximately as predicted byMoore’s law for the last forty years, which means doubling roughly every two years[17]. The number of transistors is reflected in chip performance. David A. Pattersonpoints out that the bandwidth (performance) improvement of CPUs has been fasterthan for other components [18]. The annual improvements can be seen in Table 2.1.The performance increases shown in the table are without units as it only shows theannual improvement for the type of component in comparison to similar componentsfrom previous years. While the difference in improvement per year is not huge, thedifference has been building up for many years. Patterson points out that bandwidthbetween components such as the CPU and memory can always be improved by addingmore communications paths between them, but that it is costly and causes an increasein energy consumption and the size of the circuits. In addition to the un proportionalincrease in performance for the components in Table 2.1, Patterson raises concernsthat latency has improved less than bandwidth. Patterson continues to point out thatmarketing has been one reason for this inbalance, that an increase in bandwidth iseasier to sell than a decrease in latency. Finally Patterson reminds us that certainmethods created to improve bandwidth, such as buffering, has a negative effect onlatency [18]. The time it takes for a computer to execute a process is not only dependent on thespeed of its CPU. Other components in a computer such as the random-access memory(RAM) are not as fast as the CPU. The speed difference causes the CPU to waste manyclock cycles waiting for memory transactions. If data has to be fetched from a hard 10
  18. 18. CPU DRAM LAN HDD 1.50 1.27 139 1.28Table 2.1: Relative annual bandwidth improvement of different computer componentsduring the last 20-25 years [18]drive disk (HDD) the waiting period is further increased. If there is more than one task running on the same system, and the tasks are runningindependently from each other, the system might well be able to execute other taskswhile one is waiting for I/O. This works well if most tasks do not require I/O operationsand access to memory. If the purpose of a system is to mainly run tasks that are I/Ointensive and require lots of memory accesses, much time is potentially wasted for theCPU. As Hamilton [1] points out there are at least two ways of dealing with the perfor-mance inbalance problem. One is to simply invest in better bandwidth and commu-nication paths between the memory and CPU. Another way is to avoid the problemby using lower-powered and cheaper CPUs that does not need as fast memory [1].Hamilton also points out that because server hardware is built with higher quality re-quirements, and in lower volumes than client hardware it is more expensive. Hamiltoncontinues that ”When we replace servers well before they fail, we are effectively pay-ing for quality that we’re not using“[1]. The energy efficiency is in general better fornewer hardware adding to the pressure to upgrade to newer servers.2.5 Energy proportional computingAccording to Barroso and Holzle [2] a server is generally operating at 10 to 50 percentof its maximum capacity but is rarely completely idle. Having data on several serversimproves the availability of the data; a side effect is, however, that more servers mustbe online. In a case where servers would be completely idle for significant times,powering down a part of the server farm would allow significant power savings. Inpractice some sort of load and task migration/management system would be neededto distribute tasks in a favorable manner between the available servers, in order toallow powering down a larger number of servers. Barroso and Holzle continue tostate that even when a server is close to idle it still consumes about half of its peak 11
  19. 19. power consumption. In a completely energy proportional server no energy would beused while the server is idle. Complete energy proportionality is not feasible with themanufacturing techniques and materials of today’s processors, due to leakage currents. Regardless of the average power consumption during standard operation, a datacenter must still have the infrastructure to support the maximum power that the serverscan use, or are allowed to use. Reducing the peak power consumption also reduces thedemand on the power and cooling infrastructure, the part of the infrastructure that isresponsible for 82 percent of the total infrastructure costs [1]. The energy consumptionof a server is not necessarily the same as the combined peak power of the componentsthe server is built from [7]. The maximum peak power consumption measured fora server constructed for the example was less than 60 percent of the combined peakpower consumption for its components. Fan et al. [7] continue to state that the powerconsumption is also application specific. From the tests performed in preparation forthis thesis, it is clear that even if the system reported full CPU utilization the actualpower consumption of the CPU can vary. Furthermore Fan et al. [7] state that incase of an actual data center the consumption is 72 percent of the actual peak powerconsumption.Figure 2.2: CPU contribution to total server power usage for two generations of Googleservers. The rightmost bar shows the newer server when idling [2] Figure 2.2 from [2] shows the percentage of energy consumption that the CPU con-tributes to the total energy consumption of the server. The data are from two servers 12
  20. 20. used by Google in 2005 and 2007. The graphs show that the contribution to the to-tal consumption is approximately 45 percent during its peak power consumption andapproximately 27 percent when idle for the newer server. The power saving mech-anisms on the server is unknown but from the data provided in the graph the powersaving works better on the processor itself than on the server as a whole because thecontribution made by it is smaller when the server is idling. Barroso and Holzle alsopoint out that they have experienced that the dynamic power ranges for DRAM arebelow 50 percent: 25 percent for disc drives and 15 for networking switches [2]. Theobservations are in line with the results in Figure 2.2. The authors in [7] claim that peak power consumption is the most important factorfor guiding server deployment in data centers but that the power bill is defined bythe average consumption. A lower peak power consumption for the servers allowsfor a larger number of servers within the same energy budget, leading to a higherutilization level of the cooling and power infrastructure and thereby a more effectiveuse of the available resources and budget. The requirements for both cooling andpower, including UPS are reduced with lower peak power consumption.2.6 Energy efficient low power processorsServers have generally been constructed for high performance, using high performanceprocessors rather than energy efficient ones. Processors that originate from embeddedsystems are in contrast mainly built for energy efficiency. This is due to both thermalconstraints and power constraints from battery powered devices. This kind of proces-sors hardly ever needs active cooling, regardless of them having small physical size. By replacing an energy hungry high performance processor with a set of energyefficient processors originating from battery powered embedded systems, the energyconsumption can be reduced. General purpose processors for embedded systems areproduced in large numbers. To get the same amount of work done, a larger number ofthe slower processors is needed. When increasing the number of processors the gran-ularity of the power consumption also increases. In order to improve energy efficiencywith the changing of processors, the processors used must be at least as energy efficientas the one that should be replaced. Processors used in battery powered devices wherecomputational power is required, are ideal for the evaluating the energy saving poten-tial. ARM Cortex-A8 and the Cortex-A9 MPCore are tested for this purpose. When 13
  21. 21. using DVFS to reduce the energy consumption of the processor the server continuesin an operational state. A much greater energy reduction can, however, be achievedby entering a sleep state, where a processor is turned off and thereby not able to doany calculations until it is waken up. The time to enter and return from a sleep stateis generally longer than when changing between power states using DVFS. In a serverwith multiple processors it could be possible to put the ones that are not needed at themoment in a sleep state, in order to reduce power consumption. In this case it is, how-ever, important to be able to predict how long time switching between different powerstates takes, and know if the service deadlines allow for such a latency.2.7 SummaryIn a perfectly power proportional server the instantaneous energy consumption is pro-portional to the required service level. An idling server would not use any energyand a server functioning at half capacity would use half of the server’s peak powerconsumption. Techniques such as DVFS and power gating are used to increase en-ergy proportionality. In a modern computer based on CMOS circuits complete energyproportionality is not achievable due to leakage currents. While the average powerconsumption determines the amount of actual used energy, the peak power consump-tion is what defines the required capacity of the cooling and power infrastructure. Thecost of power and cooling infrastructure combined with direct energy consumption ina data center is 42 percent, and approximately 45 percent of a server’s peak powerconsumption is caused by the server’s processor or processors. Processors intended for usage in embedded devices are designed for energy effi-ciency, in contrast to server processors that are designed for performance. The poten-tial benefit from replacing server grade processors based on the x86 -architecture, usedin modern servers with more energy efficient ARM Cortex-A9 MPCore processors isevaluated in this thesis. 14
  22. 22. 3 E VALUATED COMPUTING PLATFORMSThis chapter describes the hardware and software that has been used for the bench-marking. The evaluated hardware uses the ARMv7 -architecture based Cortex-A8 andCortex-A9 MPCore processors. The platforms that will be used for testing are the Bea-gleBoard, Versatile Express with a CoreTile Express A9 MPCore daughter board anda Tegra 250 development board. The benchmarks presented in this chapter are usedto evaluate the performance of the following applications: Apache 2 HTTP server, anErlang based SIP-Proxy used for session management and some benchmarks testingvarious aspects of the Erlang virtual machine itself. What SIP and a SIP-Proxy is willbe covered as well as Erlang. In order to evaluate how well a cluster of the ARM Cortex-A8 and Cortex-A9MPCore processors are suited to replace processors used in todays servers the perfor-mance of single processors must first be evaluated. It is possible to determine howthe energy efficiency compares between the different architectures by running com-parison benchmarks with machines that are built using processors based on the x86architecture. The performance of the processor architecture will be better shown byusing two different Cortex-A9 MPCore processors compared to using only one. As thetwo Cortex-A9 MPCore machines have different clock frequencies as well as differ-ent number of cores, the scaling properties for the two performance increasing optionscan be evaluated. How the energy efficiency is affected by these factors must also beevaluated. In practice evaluate, if increasing a processors clock frequency or addingadditional cores is more beneficial, when looking at the performance per watt. If theresults for a single ARM processor show worse energy efficiency compared to modernserver processors a cluster of the low power processors will then also have a worseenergy efficiency. 15
  23. 23. 3.1 Hardware3.1.1 BeagleBoardThe BeagleBoard [3] is a low cost system based on the ARM Cortex-A8 processorwith low power requirements. The version of the BeagleBoard that was used for themeasurements is the C3. It is equipped with a TI-OMAP3530 chip with an ARMCortex-A8 processor running at 600MHz. The main storage device is a Micro SD-cardand there is 256MB DDR RAM available. A block diagram of the BeagleBoard isshown in figure 3.1 and a block diagram of the OMAP3530 chip in figure 3.2. TheBeagleBoard that was used for the benchmarking had Ångström Linux installed withkernel version 2.6.32. The first tests were run on a BeagleBoard B5. The B5 was later replaced by a C3as it has double the amount of RAM compared to the B5, allowing for a wider range oftests to be run. Neither model has an Ethernet port built in, a USB to Ethernet adapterwas therefore added to get Ethernet connectivity. An improvement from the B5 modelto the C3 model is a USB A-port in addition to the OTG mini USB port on the B5board. Ethernet connectivity has not been built in before the new xM model and alsoon the xM it is a USB based Ethernet solution [19]. A USB to Ethernet adapter wasfound to be the best way to get network connectivity to the BeagleBoard. Due to theUSB to Ethernet adapter the maximum bandwidth is limited by the speed of USB 2.0to 480 Mbps. This was not a serious limitation, due to the limited performance of theBeagleBoard in the benchmarks. Although the BeagleBoard did give some indications on how the test programsperformed on an ARM based system, the tests were not conclusive. This, because ofthe many differences compared to “normal” computer systems not just caused by theprocessor architecture, but also from the small amount of RAM and the speed of theRAM. The slow speed of the Micro SD-card used for main storage was also slowingdown the entire system. In the test with Erlang, a non SMP version of the Erlang runtime system (erts) was used, as the Cortex-A8 only has one core. 16
  24. 24. Figure 3.1: BeagleBoard block diagram [3] Figure 3.2: OMAP3530 block diagram [4]3.1.2 Versatile ExpressThe Versatile Express [5] development platform that was used consisted of the Versa-tile Express Motherboard (V2M-P1) with a CoreTile Express A9 MPCore [5] (V2P-CA9) daughter board. In addition to the Quad Core Cortex-A9 MPCore the daughterboard has 1GB of DDR2 memory with a 266MHz clock frequency [5]. A block dia- 17
  25. 25. gram of the Versatile Express can be seen in figure 3.3. The diagram shows a seconddaughter board, a LogicTile Express, in addition to the CoreTile Express, the particularmachine used for the benchmarking did not have a LogicTile Express installed.Figure 3.3: Block diagram of the Versatile Express with the Motherboard ExpressµATX, CoreTile Express A9x4 and LogicTile Express [5] The ARM processor on the daughter board is a CA9 NEC [5] chip clocked at400MHZ with limited power management functions. Power gating and DVFS are not 18
  26. 26. supported on the chip [5], which needs to be noted when considering the power con-sumption of the system. A top level view of the chip can be seen in figure 3.4. Aspowering on and off cores is the main power reduction technique available on thisparticular chip, the power consumption is not precisely matched to the required perfor-mance. In a system where power gating and DVFS are available, the possibilities forpower proportional computing are better. The Versatile Express does, however, allow monitoring of both operating voltageand power consumption. To use this functionality a kernel module was created andloaded to the kernel on the V2P-CA9 to enable usage of the necessary registers forcollecting voltage, current and power consumption data. The registers used are VD10_ S2 and VD10 _ S3. VD10 _ S3 is the power measurement device for the Cortex-A9 system supply, cores, MPEs SCU and PL310 logic [5]. VD10 _ S3 is the mostinteresting power measurement supply for this comparison. VD10 _ S2 is the currentmeasuring device for the PL310, L2 cache and SRAM cell supply. A program that readthe values for the voltage, current and power for both supplies once every second andstored them for further use, was created. The use of the program allowed continuousmonitoring during benchmarking. Data logging at shorter intervals was also tested, butwas discontinued in order to reduce the interference caused by the data collecting, andbecause the added value brought by it was negligible. Furthermore, several possibilities exist for changing settings on the chip, such asthe speed of the memory and the clock frequency for the cores. Suitable frequencycombinations for the different clocks must be carefully calculated in order for the sys-tem to work properly. These changes must be done while the system is off line as thesettings are stored on a memory card on the Versatile Express Motherboard, and readfrom there on startup. A Debian installation was provided with the Versatile Express.The installation was provided with a 2.6.28 Linux kernel. Official support for the Ver-satile Express in the Linux kernel was not added before version 2.6.33. To determinethe reasons for unexpected performance differences between the test platforms, mainlybetween the CoreTile Express, and the Tegra 250, impacts of different parts of the testplatforms were examined. Kernel version 2.6.33 was used to evaluate if unexpectedperformance differences were caused by the kernel version previously used. The mainreasons for choosing version 2.6.33 for the test, are that it supports the Versatile Ex-press and is the closest possible version to the 2.6.32 used on the Tegra. Having theTegra 250, and the CoreTile Express using similar software is useful, in order to find 19
  27. 27. differences caused by the hardware. The operating system was installed on a USB flashdrive, as the native memory card on the Versatile Express was significantly slower thanthe USB flash drive.Figure 3.4: Top level view of the main components of the CoreTile Express A9x4 andwith the CA9 NEC chip [5] 20
  28. 28. 3.1.3 TegraThe Tegra [20] is a Tegra 200 series developer kit with a Tegra 250 system intendedto support software development. The Tegra 250 chip includes a dual core Cortex-A9MPCore chip running at 1GHz. The board also contains 1GB of DDR2-667 RAM andis equipped with a SMSC LAN9514 USB hub with integrated 10/100 Ethernet. Theused Tegra 250 board had an additional PCI express Gigabit Ethernet card isntalled inorder to avoid networking bottlenecks. Compared to the older chip on the VersatileExpress, the newer Cortex-A9 has both more advanced power management featuresand a different networking implementation. By evaluating the performance of boththe Tegra and the CoreTile Express, the aim is to identify how varying number ofcores and difference in clock frequencies is reflected in the performance for runningdifferent applications. Ubuntu 10.04 was installed on the board with a Linux kernelversion 2.6.32, which was provided by Nvidia. As the only compatible kernel versionavailable for the Tegra was the 2.6.32, and the Versatile Express was not supportedbefore 2.6.33, the two Cortex-A9 systems could not use the same kernel version. Forthe benchmarks the Tegra 250 board was controlled through its serial port and the twoEthernet ports. As information of the power consumption of either the parts of, or the entire Tegra250 chip was not available, the values used are estimates derived from the informationreleased by ARM [21]. The Tegra 250 chip also includes several other specializedprocessors in addition to the Cortex-A9 MPCore. This makes the process of measur-ing the energy consumption of the Cortex-A9 even more challenging. There is littleinformation available for the exact configuration and manufacturing process for theTegra 250. According to ARM, a Dual Core Cortex-A9 built using the TSMC (TaiwanSemiconductor Manufacturing Company) 40G process, which is a 40 nm manufactur-ing process, in a speed optimized implementation uses 1.9 W at 2 GHz, resulting in10000 DMIPS. A power optimized implementation uses 0.5W at 800 MHz providing4000 DMIPS [21]. In this thesis the power consumption is estimated to be 1 W for theCortex-A9 in the Tegra Reference and client machinesIn addition to the test machines with ARMv7-A processors other test machines withx86 processors are needed to make a comparison between the energy efficiency of the 21
  29. 29. processor architectures. A variety of machines were used during the benchmarking,both to make comparisons possible but also to enable the benchmarking in both theSIP-Proxy and the Apache HTTP server tests. The results that are presented for refer-ence values originate mainly from three different machines. The first has a Dual CoreIntel E6600 processor, the second has two Intel Quad Core E5430 processors and thethird has two Quad Core Intel L5430 processors. According to the data sheet for the 5400 series [13], there are three different subseries within the 5400 series, targeting different markets, the X5400, E5400 and L5400sub-series. The X5400 series is described as a performance version and the E5400 asa mainstream performance version. The L5400 is described as a lower voltage andlower power version intended specifically for dual processor server blades. The listedthermal dissipation power (TDP) for X5400, E5400 and L5400 series is 130 W, 80 Wand 50 W.3.1.5 NetworkThe benchmarks that required interaction between several machines were connectedin a number of different ways, depending on the test in question, and partly by theavailable resources. Due to the design of the test machines, Gigabit Ethernet, was notalways available. The BeagleBoards had the option of Ethernet over USB, or a USBto Ethernet adapter. In order to make the tests more comparable, the USB to Ethernetadapter option was used. Both the Versatile Express, and the Tegra 250 had 10/100Mbps Ethernet capabilities. The Tegra had in addition to the built in fast Ethernet aGigabit Ethernet card. At first the machines that were to be benchmarked were connected directly to thebenchmarking machine without any switches. To increase the number of clients a fastEthernet switch was used in order to add up to six benchmarking machines. In orderto enable benchmarks by using a larger number of more powerful machines, a GigabitEthernet LAN was used. Through this network ten client machines were controlled byan eleventh machine. These machines were used to create the necessary traffic for thebenchmark. To avoid and detect problems caused by other users of the same network,the tests were performed in the evenings outside office hours. Tests were also re donelater to confirm the results. As the network bandwidth was limited the file requestedin the test was small, in order to keep the bandwidth requirements as low as possible.The theoretical maximum bandwidth for the LAN is a gigabit, or 131 072 KBps. The 22
  30. 30. bandwidth of the network was not a problem for the ARM test machines. However,for the machine used for the reference results, it was a potential bottleneck consideringthe ability of the reference machines to serve tens of thousands of request per second.3.2 Software3.2.1 ErlangErlang [22] is a functional programming language and a Virtual Machine. The Erlangsyntax resembles that of prolog, not surprisingly, as it started out as a modified versionof prolog. The first version of Erlang was created at the Ericsson Computer ScienceLaboratory by Joe Armstrong, Robert Virding and Mike Williams. The development ofErlanf began in the eighties and Erlang is still used by Ericsson in telecommunicationapplications [22]. It is designed to be highly concurrent and designed for fault tolerantsoft real-time systems [23]. The aim was to create a language that would be suitablefor creating telecommunication systems, consisting of millions of lines of code. Thesesystems are not only large, they are also meant to constantly be running. To be ableto run them continuously for as long times as possible, software upgrades must bepossible without stopping the system [24]. Erlang/OTP is often implied when discussing Erlang. OTP is short of Open Tele-com Platform. It contains tools, libraries and procedures for building Erlang appli-cations. It provides readymade components, such as a complete web server and FTPserver. It is also useful when creating telecommunication applications. Both the ErlangVM and OTP are open source licensed. The Erlang run time system (erts) implements its own lightweight processes andgarbage collection mechanism. Erlang is run as a single process in the host operatingsystem, and schedules the Erlang processes within it. SMP Erlang enables the use ofmore than one CPU core on the host machine by using multiple schedulers. All sched-ulers are run as separate processes in order to enable their simultaneous execution. Ingeneral equally many schedulers are run as there are available CPU cores. From theusers perspective there is no difference if the cores are on the same CPU, or on differ-ent CPUs in the same SMP machine. The number of schedulers that are used can vary,but by default it is the same as the number of available CPU cores. There is no sharedmemory between Erlang processes, which means that all communication is done using 23
  31. 31. message passing, enabling the construction of distributed systems [23].3.2.2 SIP-ProxySIP is short for Session Initialization Protocol, a standard defined by IETF [25]. IETFor Internet Engineering Task Force is an organization that develops and promotes In-ternet standards. IETF does not have any formal membership or membership require-ments [25]. SIP is an application-layer protocol for controlling sessions with one ormore participants. It is used for creating, modifying and terminating sessions. The ses-sions can be multimedia, including video or voice calls, and the session modificationpossibilities include the ability to add or remove media and participants, and changeaddresses. The protocol itself can be run on top of several different transport protocols,such as Transmission Control Protocol (TCP) or User Datagram Protocol (UDP). SIPincludes features such as the possibility for a user to move around in a network, whilemaintaining a single visible identifier. It is also possible to be connected to the net-work from several different places, by for example using several different phones thatare associated with the same identifier. The SIP protocol does not provide services onits own, it does, however, provide primitives that can be used to implement a variety ofservices. There is a great variety of extensions for the SIP protocol to make it usablefor many use cases and environments. SIP enables the creation of an infrastructure, consisting of proxy servers that userscan use to access a service. A SIP-proxy is a server that helps route requests to thecurrent location of the user, and makes requests on behalf of the client, the proxy alsoauthenticates and authorizes users for the provided services. The protocol allows forregistration of the users locations to be used by the proxy servers.3.3 BenchmarksTo evaluate the performance of our hardware, several benchmarks were used. First, theperformance of the Erlang rts is benchmarked to see how well it performs on the hard-ware. This shows how well an application running on top of Erlang could be expectedto run. By benchmarking the SIP-Proxy, and Apache 2 server, the performance foractual services is evaluated for all the hardware. More precise information about thebenchmark setups are presented in the following sections, and the results are presentedand analyzed in the following chapter. 24
  32. 32. 3.3.1 Apache 2Apache 2.2 HTTP server was used to determine how well the Cortex-A9 MPCoremachines can perform with traditional server tasks. Apache 2.2 was chosen as it is bothfreely available, open source and has been one of the most popular servers for a longtime. The Apache HTTP server is available for many platforms, such as Linux, MacOS/X and is used with a variety of architectures. As these benchmarks are targetingthe x86 and ARMv7-A architectures, the ability for the Apache HTTP server to runon both is crucial. These tests are focusing on use cases with small static files. Theinitial tests were run using Apache Bench (AB), a tool for quick performance testing.The use of AB was later discontinued in favor of autobench, in order to produce morereliable results. Autobench [26] was used to measure how well Apache performs on the differentmachines and with different test parameters. Autobench is a tool that helps automatethe use of httperf. Httperf is a program for benchmarking the performance of a HTTPserver. It creates connections to a server, in order to fetch a file. During one connec-tion, one or several requests for the file is made, depending on the test parameters. Bychanging the rate the connections are created and the number of requests for each con-nection, the load on the server varies. By running httperf several times with differentrates of connections the servers response to different load can be evaluated. By runningthe test from several machines simultaneously, the load generating capabilities of thetest setup can be taken beyond that of a single machine, but the instances must be con-trolled separately. The results from such tests must also generally be later combinedmanually. To decrease the error caused by differences in the starting time for the testfrom different test machines, the testing time should be relatively long [27]. One of the most useful features of autobench is that it runs multiple tests whilesteadily increasing the load on the server according to the users instructions. In orderto run autobench, a number of parameters must be set. The parameters can either beset in a configuration file, or given as command line arguments when starting the test.An example for benchmarking a single server giving the arguments from the commandline, looks like the following. autobench -single_host -host1 -uri1 /10B-quiet -low_rate 100 -high_rate 1000 -rate_step 20 -num_call10 -num_conn 10000 -timeout 5 -file benchmarkresult.tsv Single host indicates that only one server is benchmarked. Host1 is the ad- 25
  33. 33. dress to the server, in this case Uri1 is the file that is requestedin the test, a file called 10B, is used in this example. The quiet option must beused if the results are expected to be sent to STDOUT, as it restricts the amount ofdata that httperf produces. Too much output from httperf causes autobench to cre-ate badly formatted report tables. The benchmarking stats from the point indicated bylow_rate, and continues to the upper limit given by high_rate using the step sizefrom rate_step. In this case the test starts from 100 connections and is run until1000 connections per seconds is achieved, always adding 20 connections for each newrun. Num_call regulates how many requests for the particular file should be madefor each connection, 10 in this case. If the option to keep alive connections is disabledon the server, the test will report a number of unsuccessful attempts at retrieving thefile. The number of attempted requests is calculated by multiplying the num_callwith the connection rate. Num_conn specifies the number of connections to createfor each step of requests. As the time taken to attempt a number of connections isdependent on the rate the connections are created, a more convenient alternative is re-placing num_conn with const_test_time. Const_test_time specifies howlong the test should continue, automatically calculating the value for the number ofconnections for all request rates during the testing. Timeout sets the value in sec-onds to how long a request is allowed to last before it is considered a failure. A longertimeout value causes a greater fd (file descriptor) usage than a shorter, increasingthe risk for stability issues for the system under test. The file option simply specifiesthe name of the file where the results are stored. In theory, almost any benchmarking program could have been run on several ma-chines simultaneously, and then manually combine the results, autobench helps auto-mate this process for httperf. Two programs are provided together with autobench:autobench_admin and autobenchd. Their purpose is to make the usage of several ma-chines for testing convenient. The idea with these programs is that autobenchd is runon all client machines and one of them runs autobench_admin. Autobench_admindistributes the required requests between all client machines. After the instances ofautobenchd have completed their part of the test, the results are collected and com-bined by autobench_admin. With the basic configuration the requests are distributedequally among all autobenchd instances. This has two implications. The first is that therequested number of connections must be evenly dividable between all instances. Thesecond is that in order to get clean results, all client machines should only be requested 26
  34. 34. to create a number of requests that the used machines are able to produce. The resultsfrom a test where the client machine has not been able to produce enough request dueto limitations of its own, look similar to results where the server that is being tested isnot able to reply to the requests. In the results file, autobench stores the number of requests that were supposed tobe requested, followed by the number that actually was attempted and the number ofconnections created. Statistics on what the minimum, mean and maximum numberof requests during the test is also provided. The amount of bandwidth used, and thenumber of errors is also listed. As the tests are run several times, trends are easilydetectable. Especially short and sudden interference is generally obvious, interferencethat remains constant is more difficult to detect. By running the tests several times, andwith an increasing load, more information is generated on how the server responds toa varying number of connections and requests. Testing several times with similar testparameters also gives an overview and helps verify if something has interfered withthe test. As illustrated in Figure 3.5, the test was set up so that the machines that were tobe benchmarked, were connected through Ethernet to the client machines that createdthe requests. In order to make sure that the server was the bottleneck in the finaltests, i.e. that the data produced were meaningful; the test was run several times withslight variations in the test parameters. To guarantee that the client machines wereable to create enough traffic, the same tests were run using an increasing number ofclient machines. If the results remained close to the same, even if the number of clientmachines was increased, the load created by the client machines was sufficient. Thenumber of served requests was set as the metric for performance. The only requirementwas that the requests being counted would be served within five seconds, the rest of theresponses were discarded. Any other quality of service aspects such as the number ofunanswered request were ignored, as the focus was on maximum performance ratherthan quality of service. The original plan was to leave all installations of the Apache 2 http server in theirdefault configurations to get a generic comparison. It soon became clear that the de-fault configurations for the Apache server installations were not identical. Althoughthe same version of Apache (Apache 2.2) was provided with all Linux distributionsthat were used, the configurations varied. It was expected that there would be differ-ences such as where configuration files were located on the different distributions, but 27
  35. 35. not that the default configurations would differ. The main difference was that on theApache installation on the Fedora machine, the keep alive option was disabled, thiswas changed to be enabled on all machines. By running the same static page fetching test, but using files of different sizes andmonitoring the bandwidth usage, it was decided if a file could be used safely withoutgetting problems with the available network bandwidth. Small file size was required asthe Versatile Express was equipped with only a 10/100 Mbps Ethernet card rather thanGigabit Ethernet. The performance of the reference machine was expected to servea large number of requests resulting in a high bandwidth usage even with small files.In tests where small files are used, the overhead caused by the underlying protocolsbecomes significant. Figure 3.5: Test setup for Apache test3.3.2 Basic Erlang performance benchmarksTo evaluate the performance of the Erlang Virtual Machine (VM) running on the testmachines some general performance benchmarking was done. A set of micro bench-marks running on the Erlang VM was used. Some of the benchmarks were able tomake use of more than one core on the host machine, while others did not gain anynotable benefit compared to running on just one core. The micro benchmarks are de-signed to stress different parts of the VM. While these tests do not emulate a realisticservice producing scenario, they do give information of the general performance levels 28
  36. 36. of different parts of the systems. This information can then be used to compare theperformance of the VMs running on different platforms, and provide a way to estimatehow well different applications could be expected to run. The results are included asAppendix A. Due to reasons such as insufficient memory in the test hardware, not allbenchmarks were run on all available test platforms. If a benchmark was not able torun on all machines, the results from the benchmark was also not analyzed for the restof the machines either.3.3.3 SIP-ProxyAn Erlang based SIP-Proxy (Session Initiation Protocol Proxy) was tested to find outhow well an ARM Cortex-A9 would perform in telecom applications. The perfor-mance of the SIP-Proxy was measured in the number of calls per second it could han-dle. The metric for energy efficiency for the proxy was decided to be the number ofcalls the proxy could handle for each Joule used. As the proxy is running on top ofErlang, the results from this particular proxy reflect the result from the Erlang microbenchmark. The value this benchmark brings comes from giving a performance andenergy efficiency evaluation for a realistic service, rather than just parts of the systemas the micro benchmarks did. To measure the performance of the proxy, two other machines were used as shownin Figure 3.6. One machine was used to create the messages that should be passed andthe other was used as the receiver. Both of these machines were running SIPp, an opensource test and traffic generation tool made available by HP. The version used was 3.1and was compiled from the source. The bandwidth required for the proxy running on amachine capable of only a limited number of calls each second is not big, bandwidth,was thereby not expected to be an issue here. The proxy had a fd leak and in order toavoid issues caused by this, the maximum number of fd:s both system wide, and foreach user was increased. During the time the proxy was running there was constantlya small increase in the memory used by the underlying erts. The total memory usageof the system was, however, all in all modest. To get more reliable results and avoid asmuch interference from the unwanted accumulative use of system resources caused bythe fd leak, the proxy was restarted between every test. In order to evaluate the performance of the proxy a definition for when the proxypassed a test was needed. In the reference results provided by Ericsson the proxy hadbeen expected to run for a few minutes. The reference results used in this benchmark 29
  37. 37. are from a machine with two Quad Core Intel Xeon L5430 processors and 8 GB ofRAM. As the testing focused on processors with lower performance more strict re-quirements were set up. This was done to enable a more accurate comparison betweenthe different ARM Cortex-A9 MPCore processors that were to be tested. Figure 3.6: Test setup SIP-Proxy test The test machines were required to be able to pass all requested messages for a twominute period. The two minute requirement was as a result of two different factors.The first was the requirement used by Ericsson that the new requirement would needto be in line with. The second was caused by the way the proxy used system resources.When running the proxy, the CPU utilization level of the system increased rapidly for awhile. After the rapid increase a very slow increase was observable for the entire timethe proxy was running. The rapid increase was observed to halt well before the twominute mark for all the tested ARM Cortex-A9 MPCore systems. There was not muchneed for discussion for an acceptable error rate, as a low sustainable error rate wasnot encountered during the testing. If the load was not significantly decreased quicklyafter a failed message caused by a high load, the fail rate would increase rapidly. The erts on both the Versatile Express and the Tegra 250 was recompiled to supportprofiling using Gprof. Some optimizations had to be disabled from the make files forthe erts in order for Gprof to work. Gprof shows function calls executed by a specificprogram, Oprofile was also used as it has the ability to perform system wide profiling.To enable profiling using Oprofile the kernels on both machines were recompiled. Asthe profiling has a negative effect on system performance the profiling enabled versions 30
  38. 38. of both the erts and the kernels were not used when obtaining results for maximumperformance in any benchmarks. The versions that supported profiling were only usedto find reasons for unexpected anomalies and performance differences.3.4 SummaryTo evaluate the energy efficiency of the ARM Cortex-A8 and the Cortex-A9 MPCore,compared to processors built on the x86-architecture for server tasks, a set of test hard-ware has been used. To evaluate the performance of the Cortex-A8 a BeagleBoardis used. Evaluating the Cortex-A9 is done using a Tegra 200 development kit with aTegra 250 chip, and a Versatile Express with a CoreTile Express. An Apache 2.2 HTTPserver is used to evaluate the energy efficiency of the Cortex-A9 MPCore processor,compared to an Intel Xeon processor when serving static files to clients. The perfor-mance for running the Erlang VM on the test machines is evaluated directly using a setof micro benchmarks, as well as a Erlang based SIP-Proxy. 31
  39. 39. 4 P ERFORMANCE COMPARISONThis chapter explains the individual executions of the benchmarks. After each bench-mark execution the corresponding results are presented. The results from the execu-tions are followed by performance comparisons. After the pure performance compar-isons the energy efficiencies will be compared. All energy efficiency comparisons inthis chapter focus on the energy consumption off the processors themselves, rather thanon the total consumption of the computers being benchmarked.4.1 Apache resultsAfter reaching its peak performance in this particular test the performance of the Ver-satile Express dropped quickly, and in the end made the server unresponsive. This canbe seen in Figure 4.1. This happened sometimes before reaching full CPU utilization.A reason for this could be the network implementation on the Versatile Express. Ifenough interrupts are generated in order to handle TCP packets it eventually leads toa situation where an increasing part of the runtime is used up by interrupts and thusleaving a decreasing amount of resources available to actually provide the intendedservice. This has not been proven to be the case here, but is a viable possibility. Todeal with the unresponsiveness the test machine was restarted between tests. When benchmarking the machine with the two Intel Quad Core Xeon E5430 pro-cessors using the same test parameters as for the rest of the machines, the test proved tonot be CPU intensive enough as full CPU utilization was not achieved. In order to findthe bottleneck, the number of clients was increased. The test was run using both tenand five client machines. As the results were the same in both tests the performanceof the client machines was not a bottleneck. The bandwidth was tested by redoing thetest using a larger file than the original. The result from the test with the larger file wasclose to that of the original test, with the biggest difference being a higher bandwidth 32
  40. 40. Figure 4.1: Comparison between CoreTile Express, Tegra and an Intel Pentium 4 pow-ered machine running the Apache HTTP server.usage. The system reported no shortage of available memory in any of the ApacheHTTP server benchmarks. The machine with the two Quad Core Xeons was able to serve 36000 requestsper second when a hundred requests were made for each connection. For ten requestsfor each connection the result was only 6200 requests/s. The data transfer during thetest with 36000 requests per second was reported by Autobench to be 11600 KBps.Compared to the theoretical maximum bandwidth for a gigabit network (131072 KBps)the used bandwidth was less than 10 percent. Although the theoretical bandwidth isgenerally not achieved in a real life network, more than 10 percent is achievable. Inaddition, higher data transfer rates from the same server, using the same network wasachieved using a larger file. As the network seemed an unlikely bottleneck, other partsof the test setup was inspected. The machine running the Apache server reported 60 33
  41. 41. percent CPU utilization for the test with 36000 requests and 10 percent for the test with6200 requests. If the CPU utilization level and web server performance would continuehaving the same relation to each other, the performance in both cases is around 60000requests per second with full CPU utilization. Figure 4.2 show the CPU utilizationat a few points during the Apache test for the dual E5430 machine. Assuming theperformance for one E5430 running at 100 percent is the same as the performanceof two running at 50 percent, the performance for one E5430 is 33000 requests persecond.Figure 4.2: CPU utilization during test on machine with two Quad Core Intel XeonE5430 processors The results from the Apache test shown in Table 4.1 are from fetching a static fileof size 10 Bytes. 10 Calls per connection and 100 calls per connection were requestedin the tests. The better results from the two benchmarks were used. The performancefor the Tegra 250 was more or less the same when making ten or a hundred requestsfor each connection. For the comparison machine with the two Xeon processors thedifference was approximately a multiple of ten. The better results were used for the 34
  42. 42. Machine Request / second Requests / Joule Quad Core Intel Xeon E5430 (2.66 GHz, 80 W) 33000 413 Pentium 4 (2.8GHz) 7100 80 Dual Core Cortex-A9 MPCore (1 GHz) 4600 4600 Quad Core Cortex-A9 MPCore (400 MHz) 3400 2833 Cortex-A8 (600 MHz) 760 760Table 4.1: Ability of Apache 2.2 to serve a 10 byte static files using different hardwarecomparison. The test results can be seen in Table 4.1. As it can be seen in the table theTegra 250 managed to serve 4600 requests per second and the Versatile Express 3400requests per second. The performance difference of the Versatile Express comparedto the Tegra 250 is likely caused by both the slower clock frequency of the CPU andthe network implementation. The difference in combined clock frequencies betweenthe two processors on its own is slightly less than the performance difference. Thecombined number of clock ticks for the Versatile Express is (400 * 4) 1600 and 2000(1000 * 2) for the Tegra 250. Comparing these, the Versatile Express has 80 percentof the clock ticks of the Tegra 250. The performance of the Versatile Express is incomparison slightly less, 74 percent of that of the Tegra 250. The results from the Apache tests were mainly compared against a machine withtwo Intel Quad Core Xeon processors running at 2.66 GHz. To provide a more com-prehensive comparison, and more reference points a machine with a Pentium 4 (2.8GHz) was also benchmarked. While the machine that has the more traditional serverprocessors outperforms the tested Cortex-A9 processors, the Cortex-A9 processors dowell taking their energy consumption into account. The Intel Xeon processor (E5430)that was used in the reference machine has a reported maximum thermal design power(TDP) of 80 W, while the Quad Core Cortex-A9 according to performed tests has amaximum measured power consumption of 1.2 W. The rightmost column in table 4.1 shows the number of answered calls producedper Joule used. A clear improvement in energy efficiency is visible, starting from thePentium 4 to the Dual Core ARM Cortex-A9 MPCore. Figure 4.3 shows the energyefficiency comparison as a bar diagram. Figure 4.3 indicates a energy efficiency of about 6,9 times the performance per 35
  43. 43. Joule for the Versatile Express compared to the Intel Xeon. The results can be assumedto be a bit better for the Intel Xeon in practice when its actual power consumption istaken into account. The actual power dissipation of the Quad Core ARM was, however,also below 1 W during the test rather than the measured maximum of 1.2 Watts thatwas used for the calculations. As there are no actual numbers available for the powerconsumption of the CPU on the Tegra 250 board the estimate of 1 W is used for itspower consumption. For the Tegra 250 the energy efficiency compared to the referenceIntel Xeon processor was approximately 11,1 times better. A clear improvement inenergy efficiency is also visible between the Pentium 4 processor and the Xeon. Thisimprovement is an indication on the energy efficiency improvement for Intel’s x86based processors. One of the major improvements from the Pentium 4 to the XeonL5430 is the manufacturing technology that has improved from 90 nm to 45 nm. Figure 4.3: Number of requests handled for each Joule used by the CPU4.2 Emark resultsA set of benchmarks called Emark was used to evaluate the performance of the erts.The benchmarks are meant to be used for evaluating Erts and its performance on par-ticular hardware. The tests packet included a set of baseline results for comparison. 36
  44. 44. The benchmarks can be used to test the performance of either different erts implemen-tations or to compare different hardware against each other. In the results the sameversions of the erts was used in order to make the results as much dependent on thedifferences in hardware as possible, rather than differences in software. The differentbenchmarks returns results using different metrics, some measure time while other thenumber of transactions and the Stones test gives the results in “stones”. All the resultsshown here are a comparison to the baseline results if not something else is mentioned.They are to be interpreted as how many times worse the tested systems performed thanthe baseline. Regardless of the metrics in the original benchmarks. A lower score inthese results is always better and should be interpreted as how many of these machineswould theoretically, in a perfect world, and without any overhead be needed to replacethe baseline machine in the particular test. The machine used for the baseline has aDual Core Intel E6600. The chip is built using a 65 nm technology and has a TDP of65 W [28]. Among the tests that were run was a message passing test called “big bang”. Itcreates a thousand processes and every process sends a “ping” message to every otherprocess, every process that receives a “ping” responds with a “pong” message. Anadvantage to the message passing test in the Stones benchmark is that it is capable ofusing more than one core. The inability to effectively use more than one core at a timeis something that holds true for all the tests in the Stones benchmarks. This can easilybe seen in Appendix A. Table 4.2 shows how the different machines performed inthe benchmark compared to the baseline results. All results in the table are measuredusing as many OS processes as there are available cores on the particular test machine.The BeagleBoard uses an erts implementation without SMP support and the othersruns SMP enabled erts implementations. Most of the benchmarks here have been runseveral times with different parameters and the results in the table are average values.If results for some test was not available the results from the corresponding test onthe other machines has also been omitted. Short explanations of what the differentbenchmarks test are given in Table 4.3. Most of the benchmarks in Table 4.3 gets better results when using more thanone scheduler, there are, however, some exceptions. The results show that the bench-marks codec, containers and Msgq does not gain any benefit from using more than onescheduler. 37
  45. 45. Test BeagleBoard CoreTile Express Tegra 250 Bang 17.8 14.2 6.7 Big 38.8 12.3 9.7 Chameneosredux 5.6 17.3 6.7 Codec 19.4 19.9 7.0 Containers 34.4 30.2 14.1 Ets _ test 10.7 5.9 3.7 Genstress 17.6 14.4 7.6 Mbrot 40.3 12.9 8.5 Msgq 1.0 1.5 0.9 Netio 29.3 12.4 12.6Table 4.2: Performance of ARM test machines compared to Baseline resultsTest ExplanationBang All to one message passingBig All to all message passingChameneosredux Shake hands with everyoneCodec Encode Decode binaries (test binary to term)Containers Adds and lookups in containers (ADT’s)Ets _ test Ets insert/lookup (also in parallel)Genstress Genserver testMbrot Mandelbrot calculations (concurrent, number crunching)Msgq Message queue bashingNetio TCP messages Table 4.3: Explanations on what the benchmarks evaluate 38
  46. 46. SMP1 SMP2 SMP3 SMP4 Baseline 12998 10820 BeagleBoard 284375 V2P-CA9 248217 141691 105538 106230 Tegra 250 102404 102403 Table 4.4: Results from Netio benchmark The benchmark named Netio that tests TCP messages shows an interesting behav-ior. The results from the benchmark are visible in table 4.4. As a difference to table4.2 the results in this table are the direct output from the benchmark. According to thesource code for the benchmark the values represents milliseconds. A lower result isbetter. In this particular test run the following test parameters were used. 200 Connec-tions, 1000 packets and a packet size of 10000. There is a clear increase in performancewhen adding more schedulers from one to three for the V2P-CA9. Between three andfour schedulers the test shows no further improvement, it actually show a decrease.The decrease is still small enough to be discarded, due to the fact that the test resultsdiffer slightly between the different test runs. For the Tegra 250 the results for one andtwo schedulers are basically identical. The difference between the best results for theCoreTile Express and the Tegra 250 is only three percent. The benchmark is designedto stress the packet receive and accept processes and is not affected by external factors,as it is not dependent on I/O. The results from the Stones benchmark is shown in Table 4.5. The table shows thatthe performance of the Tegra 250 is consistently better than the one of the V2P-CA9.The interesting thing is that the BeagleBoard outperforms not just the CoreTile Expressbut also the Tegra 250 in some of the benchmarks. As only the Links benchmark is ableto gain any benefit of using more than one core the BeagleBoard has an advantage overthe CoreTile Express with its higher clock frequency in the rest of the benchmarks. Inthe small and medium message passing benchmark, it is even faster than the Tegra 250. 39
  47. 47. Test BeagleBoard CoreTile Express Tegra 250List manipulation 26 25 11Small messages (message passing) 10 35 14Medium messages 13 34 14Huge messages 17 28 11Pattern matching 32 30 14Traverse 35 27 13Work with large dataset 32 26 12Work with large local dataset 35 28 13Alloc and dealloc 22 23 10Bif dispatch 11 18 7Binary handling 15 27 11Ets datadictionary 27 46 19Generic server (with timeout) 15 37 15Small Integer arithmetic 30 23 12Float arithmetic 31 37 9Function calls 38 30 13Timers 10 26 8Links 18 9 8 Table 4.5: Results from the Stones benchmark compared to the baseline 40
  48. 48. 4.3 SIP-Proxy resultsEricsson provided reference results for the benchmark. The machine used for thiscomparison has two Quad-Core Intel Xeon L5430 processors running at 2.66GHz.The test result for the reference machine is presented in table 4.7. If the CPU isthe bottleneck the performance increase is approximately dependent on the amount ofCPU resources available, in this case, the number of cores. As visible in Table 4.7this is not the case. There is a significant performance improvement all the way fromone scheduler (SMP1) to four schedulers (SMP4). When the number of schedulersis increased from four to eight, there is only an increase of 50 calls/s, although thenumber of available schedulers, and thereby cores, has doubled. This indicates thatthe results are dependent on something else than the pure processing power of theprocessors. As the focus is on processor performance and energy efficiency, resultsthat are not dependent on the processors themselves, is to be avoided. Only the resultsfrom one to four cores will be used, in practice considering the reference machine ashaving only one Quad-Core processor, using only the energy required by one, ratherthan two processors. An issue with comparing the energy efficient in this test is thatthe only energy consumption data available is the TDP information provided by themanufacturer. According to the information available on Intel web page the maximumTDP of the L5430 is 50W and that it is manufactured using a 45 nm process [14]. Inthe datasheet for the 5400 series on page 87 however, it is stated that the TDP for theL5400 series is also 50 W [13]. When evaluating the performance of the SIP-Proxy the Versatile express was ableto handle 30 calls/s. The reference machine with its Intel Xeon L5430 was able tohandle 350 calls/s. Both machines were tested using different numbers of schedulers(1-4), these results can be seen in Table 4.8. By taking into account that the CPU ofthe reference machine has a maximum TDP of 50 W compared to the measured maxi-mum consumption of 1.2 W used by the Cortex-A9, the Cortex-A9 performs well. Bycomparing the throughputs and the power consumptions, it can be seen that the Cortex-A9 can handle 3.5 times more traffic for each watt it dissipates compared to the IntelXeon. An issue in this comparison is that the energy consumption listed for the Xeonis according to the manufacturer rated TDP, and not actual measured maximum energyconsumption. To compensate for this the maximum measured energy consumption forthe Quad Core Cortex-A9 is also used for comparison. The power consumption duringthe benchmark was measured using the VD10_S3 register on the CoreTile Express, 41
  49. 49. Calls/s CPU utilization Power avg 1 19 0.54 5 38 0.65 10 51 0.66 15 66 0.85 25 88 0.91 30 98 0.95Table 4.6: CPU utilization and average CPU power consumption for the CoreTile Ex-press during SIP-Proxy benchmark SMP1 SMP2 SMP4 SMP8 Calls / Second 130 240 350 400Table 4.7: Performance of reference machine with two Quad Core Xeons with differentnumbers of schedulersand the average values are shown in Table 4.6, together with the CPU utilization andthe number of calls the proxy was subjected to. During this particular test where theenergy consumption was measured, SMP 8 was used. The power consumption is alsopresented in graph 4.5. It is noteworthy that no DVFS is available on the CoreTile Ex-press reducing the possibilities for precise reduction of energy consumption in relationto the load. As the erts does not generally benefit from using more schedulers than there areavailable CPU cores on the host machine, the results for those tests are not listed here.The strange thing about these test results is that the Tegra 250 performs significantlyworse than the Versatile Express. In other benchmarks the Tegra 250 has consistentlyover performed the CoreTile Express, except in this and the TCP message benchmark,that is part of the basic Erlang benchmarking presented previously in this chapter.While the Versatile has the advantage of having double the number of CPU cores the 42
  50. 50. Figure 4.4: Graph showing the CPU utilization for CoreTile Express during SIP-ProxybenchmarkTegra 250 has more than double the clock frequency on its cores. As visible in Table4.8, the performance for the Versatile Express and the Tegra 250 is very similar whenusing the same number of cores. When using one core the difference could be dueto a static overhead of running the proxy but the results from using two cores are notas easily dismissible. The main question here is why an almost identical performanceincrease is achieved when adding a core running at 400 MHz and one at 1000 MHz.If the test machine running at 1 GHz would not report almost full CPU utilization, itwould be clear that the CPU is not the bottleneck, but this is not the case here. As the performance of the proxy when running on the Tegra 250 was not as ex-pected from the technical data available and our previous benchmarks, additional stepsto certify the results were taken. The erts on both the Tegra 250 and the CoreTile Ex-press was recompiled from the same source in the same way using the same version ofGCC and using the same version of the libatomic library (7.2 alpha 4). As this did notcause any difference in the results the erts was again recompiled with a few changesto support profiling using Gprof. The performance on the CoreTile Express was af- 43
  51. 51. Figure 4.5: Power consumption for the CPU in CoreTile Express during SIP-Proxytestfected more by running Gprof than the Tegra 250. With the profiling enabled the Tegra250 could handle nine calls per second, while the CoreTile Express could handle sixfor a two minute period using SMP2, as can be seen in table 4.9. A test was thenperformed using five calls per second and SMP2 for two minutes on both machines,while profiling using Gprof. The biggest difference between the number of functioncalls and time spent in different functions, was in functions that have to do with atomicread functions. This is caused by the fact that the schedulers are frequently left withoutwork, and at that point, in an attempt to optimize, spins over a variable to check formore work. To match the throughput between the two test machines the Tegra wasnot under maximum load causing the schedulers to be without work more often thanon the V2P-CA9. Other significant differences were not observed. In order to profilesystem wide rather than just the erts the kernels on both the Tegra 250 and the CoreTileExpress was recompiled to support Oprofile. Oprofile showed that when running on ashigh load as possible the Tegra 250 spent 31,6 % of its time running vmlinux, whilethe CoreTile Express spent 20,6 %. The times spent running the erts were 65 % and 44
  52. 52. Figure 4.6: Graph showing performance of reference machine with two Quad CoreXeons using an increasing number of schedulers67 %. To make sure the issue was not caused by problems with the OS the installation onthe Tegra 250 was replaced by a backup of a older version of Ubuntu that the boardhad originally been tested with, before the evaluation started, and thereby the changespossibly caused by it. The results remained the same from the previous tests. Beingunable to find a reason for the benchmark results even after great effort ARM agreedto redo the measurements. They same erts was used and the same installation of theproxy-server. The kernels used were also compiled separately although both were ofversion 2.6.32. ARM reported the same results as produced earlier with the Tegra 250.The cause for the unexpected performance difference is still unknown. The energy efficiency for the SIP-Proxy is shown in Figure 4.7. The energy ef-ficiency difference between the Versatile Express and the Tegra 250 is close to theperformance difference, around half. Compared to the Versatile Express the Xeon usesapproximately 3.6 times more energy for each call. Compared to the Tegra 250 the 45