Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deview 2013 rise of the wimpy machines - john mao


Published on

Published in: Technology
  • Be the first to comment

Deview 2013 rise of the wimpy machines - john mao

  1. 1. Rise of the (Wimpy) Machines Datacenter Efficiency with ARM-based Servers John Mao! Director of Strategy, Calxeda!
  2. 2. What is the name of the computer system in this movie that tried to end the human-race? Skynet
  3. 3. Origins of Wimpy Core Computing •  FAWN:  A  Fast  Array  of  Wimpy  Nodes   –  Project  from  CMU  led  by  Prof.  David  Anderson,   started  in  2008  (acDve  through  2012)   –  Measure  and  compare  performance  per  Joule  of     energy  advantages  over  tradiDonal  servers   –  Original  focus  on  large  distributed  key-­‐value  store     applicaDons  and  use-­‐cases  (i.e.  Amazon  Dynamo,     LinkedIn’s  Voldemort,  Facebook’s  memcached)     [PublicaDon]  hTp://­‐sosp09.pdf   [Website]  hTp://  
  4. 4. FAWN: A Fast Array of Wimpy Nodes •  Why  FAWN?  MoDvated  by  key  trends:   –  Increasing  CPU-­‐I/O  Gap   –  CPU  power  consumpDon  grows  super-­‐linearly     with  speed   –  Dynamic  power  scaling  on  tradiDonal  systems  is   surprisingly  inefficient  
  5. 5. FAWN: A Fast Array of Wimpy Nodes 1G 3G 2G 5G 4G [Photo  Credit]   h-p://  
  6. 6. FAWN: A Fast Array of Wimpy Nodes •  Multiple generations of hardware used: –  1G (2008) •  Single-core 500MHz AMD Geode LX processor •  256MB DDR SDRAM (400MHz) •  100Mbps Ethernet –  5G (2012) •  Intel Atom D510 – 1.66GHz dual-core w/HT •  2-4GB DDR2 (667MHz) •  100Mbps Ethernet
  7. 7. Key Findings from FAWN Project   “The  FAWN  cluster  achieves  364  queries  per     Joule  —  two  orders  of  magnitude  be-er  than     tradiDonal  disk-­‐based  clusters.”         [Source]  hTp://­‐sosp09.pdf    
  8. 8. So what about ®? ARM ARM is a good “wimpy” processor & CPU architecture for the datacenter because: 1.  Focus on low power: origins in embedded systems and mobile devices 2.  Datacenter focused roadmap: 32-bit CPUs today, 64-bit CPUs in 1-2 years; increasing performance (with same energy efficiency) 3.  Business model: ability to integrate for specific markets and applications 4.  Emerging software ecosystem: while not x86, ARM has growing ecosystem
  9. 9. Focus on Low Power •  History in targeting energy-sensitive markets: –  Netbooks, Smartbooks, Tablets, Thin Clients –  Smartphones, Feature phones –  Set-top Box, Digital TV, Blu-Ray players, Gaming consoles –  Automotive Infotainment, Navigation –  Wireless base-stations, VoIP phones and equipment •  Design Goals –  Performance, Power, Easy Synthesis
  10. 10. Focus on Low Power In  2005,  about  98%  of  all  mobile  phones  sold   used  at  least  one  ARM  processor.     As  of  2009,  due  to  low  power  consumpDon  the  ARM   architecture  is  the  most  widely  used  32-­‐bit  RISC     architecture  in  mobile  devices  and  embedded     systems.     [Source]  hTp://    
  11. 11. Focus on Low Power Translating ARM energy-efficiency into the modern datacenter with Cortex-A9: Total System* Power (Today!) ~Power per ECX-1000 Node (with disk @Wall) Linux at Rest 130 W 5.4 W phpbench 155 W 6.5 W Coremark (4 threads per SOC) 169 W 7.0 W Website @ 70% Utilization 172 W 7.2 W LINPACK 191 W 7.9 W STREAM 205 W 8.5 W Workload (on 24 nodes & SSDs) *All measurements done on a 24-node system @1.1GHz, with 24 SSDs and 96 GB DRAM in the Calxeda Lab. For specific workloads, ECX-1000 can enable a complete 24-node cluster at similar power level as a 2 socket x86.
  12. 12. But, what about performance?
  13. 13. Online Review: Calxeda’s ARM Server Tested Anandtech chartered review comparing Boston Viridis’ 24-Calxeda ECX-1000 (Cortex-A9) cluster against Intel E5-2650Lsystem. (March 2012)
  14. 14. Calxeda Provides Better Web Throughput Boston Viridis outperforms Xeon E5-2650L by 30% with more than 15 users.   Test  is  PHPbb  running  on  Apache2  with   variable  numbers  of  users  (concurrency)   generaDng  traffic.  
  15. 15. Calxeda Provides Lower Response Times Boston Viridis outperforms Xeon E5-2650L by 60% with more than 15 users.   Test  is  PHPbb  running  on  Apache2  with   variable  numbers  of  users  (concurrency)   generaDng  traffic.  
  16. 16. Calxeda Provides Highest Performance/Watt Boston Viridis provides 80% more throughput per Watt than Xeon E5. •  10-36% less raw power   Test  is  PHPbb  running  on  Apache2  with   variable  numbers  of  users  (concurrency)   generaDng  traffic.  
  17. 17. Online Review: Calxeda’s ARM Server Tested Reviewer’s Key Takeaways: –  For scale-out workloads, Calxeda’s ARM-based scale-out hardware architecture is very promising. –  Microbenchmarks show Calxeda ECX-1000 ~10% behind Intel Atom N2800 @1.86 MHz –  “Real World” Application Benchmarking shows 70%+ higher performance-per-watt than Intel Xeon E5 at mid to high user load –  “Calxeda really did it: each server needs about 8.3W (200W/24), measured at the wall…about 6W (at 1.4GHz) per server node…” –  “So on the one hand, no, the current Calxeda servers are no Intel Xeon killers (yet). However, we feel that Calxeda's ECX-1000 server node is revolutionary technology.”
  18. 18. ® ARM Cortex-A15 •  Based on ARMv7A architecture –  Ensures software application compatibility with orther Cortex-A processors •  LPAE support up to 1TB physical memory •  Full hardware virtualization support •  From ARM: delivers 2X performance over Cortex-A9 processor with similar power •  big.LITTLE configuration support for mobile devices
  19. 19. Datacenter Focused Roadmap 3rd Generation Calxeda Fabric and I/O Lago (ARM® Cortex A57) “Triple Play”: 3 Generations of Pin-Compatible SOCs Sarita (ARM® Cortex A57) Flagship 64-bit Product for a Broader Application Set Compatible 64-bit On-Ramp for Early Access and Ecosystem Enablement Midway: ECX-2000 (4 Core, ARM® Cortex A15) Performance/$ for Cloud and Analytics Highbank: ECX-1000 (4 Core, ARM® Cortex A9) Power Efficient Solution for Storage and Web Hosting 2013 2014 2015 [Source] Calxeda public SOC roadmap (June 2013)
  20. 20. “Midway”: Calxeda ECX-2000 Compared to Calxeda’s Cortex-A9 SOC (ECX-1000), the “Midway” SOC delivers: –  1.5X more single-thread performance –  2X more floating point performance –  3X STREAM (memory b/w) performance –  4X+ more physical memory support (16GB+) –  Same performance-per-Watt Plan to update Anandtech benchmark report
  21. 21. But, ARM doesn’t make/sell SOCs?
  22. 22. ® ARM Business Model •  ARM does not make or sell SOC. •  Instead, ARM licenses IP and technology to partners (like Calxeda) who design and build System-on-Chips (SOCs) for various industries and markets. •  Calxeda is focused exclusively on bringing ARM-based technology to the datacenter. –  Calxeda provides own IP (e.g. Fabric) as additional value for servers.
  23. 23. EnergyCore® architecture at a glance A complete building block for hyper-efficient computing EnergyCore Management Engine Advanced system, power and fabric management for energy-proportional computing I/O Controllers Standard drivers, standard interfaces. No surprises. Processor Complex Multi-core ARM® processors integrated with high bandwidth memory controllers EnergyCore Fabric Switch Integrated high-performance fabric provides inter-node connectivity with industry standard networking
  24. 24. ® EnergyCore Fabric (F1/F2) Integrated 80Gb (8x10Gb cross-bar) Fabric Switch: •  Up to 5 external links: –  Dynamic bandwidth: 1Gb to 10 Gb per link –  < 200 Nano-Seconds latency, node to node •  3 internal links (to the SOC): –  2x 10Gb Ethernet ports to the OS –  1x 10Gb Ethernet port to Mgmt –  Transparent to OS and software •  Topology agnostic à Eliminates Top-of-Rack-Switch ports & cabling à Enables extreme density; lowers cost and power
  25. 25. So, what can we use this for?
  26. 26. Target Workloads •  Data-Intensive Applications: –  Storage (scale-out, distributed storage) •  i.e. Ceph, Gluster, etc. –  Analytics (NoSQL, MapReduce, distributed databases) •  i.e. Hadoop, Cassandra, etc. •  Distributed, State-less Applications –  Web Front End –  Caching Servers –  Content Distribution Networks (CDN)
  27. 27. Use-Case: Storage via Ceph •  Official Ceph “Dumpling”+ release now supports Calxeda-based platforms •  Initial benchmarks complete (with x86 comparison) –  Even without optimizations, performance is promising •  Identified optimization areas (under investigation): –  Potentially use NEON instructions for CRC32 –  Implement zero-copy on OSD’s –  Transition reads/write to bufferlists –  Optimize client side too – librados/librbd
  28. 28. Use-Case: Storage via Ceph With same number of HDD’s, Calxeda-based system delivers 50% more performance than traditional x86-servers.
  29. 29. The AAEON CRS-200S-2R Advantage An ARM-based, lower cost, higher performance server platform for scale-out storage Calxeda’s ARM-based SOCs: •  Energy Efficient •  More cores per HDD •  Lower system power •  High Bandwidth Fabric •  Multi-10Gb links for data-intensive apps Compared to traditional x86-based, 2U rack mount servers, the AAEON CRS-200S-2R server platform is: ü  35% Lower TCO* ü  66% Less Rack Space ü  50% Higher performance
  30. 30. Summary •  Even 64-bit ARM processors are not ideal for every single workload. •  However, scale-out, data-intensive, workloads can leverage ARM’s energy-efficiency to provide a significantly better TCO. •  For the server market (especially with scale-out apps), replacing the CPU core is not enough. –  Look for SOCs that optimize “between the nodes” in a cluster (e.g. fabric interconnects will help dramatically) •  Interested in joining the “ARM revolution”? –  Contact us! – John Mao,
  31. 31. Thank You!