Your SlideShare is downloading. ×
0
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Deview 2013   rise of the wimpy machines - john mao
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Deview 2013 rise of the wimpy machines - john mao

960

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
960
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
31
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Rise of the (Wimpy) Machines Datacenter Efficiency with ARM-based Servers John Mao! Director of Strategy, Calxeda!
  • 2. What is the name of the computer system in this movie that tried to end the human-race? Skynet
  • 3. Origins of Wimpy Core Computing •  FAWN:  A  Fast  Array  of  Wimpy  Nodes   –  Project  from  CMU  led  by  Prof.  David  Anderson,   started  in  2008  (acDve  through  2012)   –  Measure  and  compare  performance  per  Joule  of     energy  advantages  over  tradiDonal  servers   –  Original  focus  on  large  distributed  key-­‐value  store     applicaDons  and  use-­‐cases  (i.e.  Amazon  Dynamo,     LinkedIn’s  Voldemort,  Facebook’s  memcached)     [PublicaDon]  hTp://www.sigops.org/sosp/sosp09/papers/andersen-­‐sosp09.pdf   [Website]  hTp://www.cs.cmu.edu/~fawnproj/  
  • 4. FAWN: A Fast Array of Wimpy Nodes •  Why  FAWN?  MoDvated  by  key  trends:   –  Increasing  CPU-­‐I/O  Gap   –  CPU  power  consumpDon  grows  super-­‐linearly     with  speed   –  Dynamic  power  scaling  on  tradiDonal  systems  is   surprisingly  inefficient  
  • 5. FAWN: A Fast Array of Wimpy Nodes 1G 3G 2G 5G 4G [Photo  Credit]   h-p://www.cs.cmu.edu/~fawnproj/  
  • 6. FAWN: A Fast Array of Wimpy Nodes •  Multiple generations of hardware used: –  1G (2008) •  Single-core 500MHz AMD Geode LX processor •  256MB DDR SDRAM (400MHz) •  100Mbps Ethernet –  5G (2012) •  Intel Atom D510 – 1.66GHz dual-core w/HT •  2-4GB DDR2 (667MHz) •  100Mbps Ethernet
  • 7. Key Findings from FAWN Project   “The  FAWN  cluster  achieves  364  queries  per     Joule  —  two  orders  of  magnitude  be-er  than     tradiDonal  disk-­‐based  clusters.”         [Source]  hTp://www.sigops.org/sosp/sosp09/papers/andersen-­‐sosp09.pdf    
  • 8. So what about ®? ARM ARM is a good “wimpy” processor & CPU architecture for the datacenter because: 1.  Focus on low power: origins in embedded systems and mobile devices 2.  Datacenter focused roadmap: 32-bit CPUs today, 64-bit CPUs in 1-2 years; increasing performance (with same energy efficiency) 3.  Business model: ability to integrate for specific markets and applications 4.  Emerging software ecosystem: while not x86, ARM has growing ecosystem
  • 9. Focus on Low Power •  History in targeting energy-sensitive markets: –  Netbooks, Smartbooks, Tablets, Thin Clients –  Smartphones, Feature phones –  Set-top Box, Digital TV, Blu-Ray players, Gaming consoles –  Automotive Infotainment, Navigation –  Wireless base-stations, VoIP phones and equipment •  Design Goals –  Performance, Power, Easy Synthesis
  • 10. Focus on Low Power In  2005,  about  98%  of  all  mobile  phones  sold   used  at  least  one  ARM  processor.     As  of  2009,  due  to  low  power  consumpDon  the  ARM   architecture  is  the  most  widely  used  32-­‐bit  RISC     architecture  in  mobile  devices  and  embedded     systems.     [Source]  hTp://en.wikipedia.org/wiki/ARM_architecture    
  • 11. Focus on Low Power Translating ARM energy-efficiency into the modern datacenter with Cortex-A9: Total System* Power (Today!) ~Power per ECX-1000 Node (with disk @Wall) Linux at Rest 130 W 5.4 W phpbench 155 W 6.5 W Coremark (4 threads per SOC) 169 W 7.0 W Website @ 70% Utilization 172 W 7.2 W LINPACK 191 W 7.9 W STREAM 205 W 8.5 W Workload (on 24 nodes & SSDs) *All measurements done on a 24-node system @1.1GHz, with 24 SSDs and 96 GB DRAM in the Calxeda Lab. For specific workloads, ECX-1000 can enable a complete 24-node cluster at similar power level as a 2 socket x86.
  • 12. But, what about performance?
  • 13. Online Review: Calxeda’s ARM Server Tested Anandtech chartered review comparing Boston Viridis’ 24-Calxeda ECX-1000 (Cortex-A9) cluster against Intel E5-2650Lsystem. (March 2012) http://www.anandtech.com/show/6757/calxedas-arm-server-tested
  • 14. Calxeda Provides Better Web Throughput Boston Viridis outperforms Xeon E5-2650L by 30% with more than 15 users.   Test  is  PHPbb  running  on  Apache2  with   variable  numbers  of  users  (concurrency)   generaDng  traffic.  
  • 15. Calxeda Provides Lower Response Times Boston Viridis outperforms Xeon E5-2650L by 60% with more than 15 users.   Test  is  PHPbb  running  on  Apache2  with   variable  numbers  of  users  (concurrency)   generaDng  traffic.  
  • 16. Calxeda Provides Highest Performance/Watt Boston Viridis provides 80% more throughput per Watt than Xeon E5. •  10-36% less raw power   Test  is  PHPbb  running  on  Apache2  with   variable  numbers  of  users  (concurrency)   generaDng  traffic.  
  • 17. Online Review: Calxeda’s ARM Server Tested Reviewer’s Key Takeaways: –  For scale-out workloads, Calxeda’s ARM-based scale-out hardware architecture is very promising. –  Microbenchmarks show Calxeda ECX-1000 ~10% behind Intel Atom N2800 @1.86 MHz –  “Real World” Application Benchmarking shows 70%+ higher performance-per-watt than Intel Xeon E5 at mid to high user load –  “Calxeda really did it: each server needs about 8.3W (200W/24), measured at the wall…about 6W (at 1.4GHz) per server node…” –  “So on the one hand, no, the current Calxeda servers are no Intel Xeon killers (yet). However, we feel that Calxeda's ECX-1000 server node is revolutionary technology.”
  • 18. ® ARM Cortex-A15 •  Based on ARMv7A architecture –  Ensures software application compatibility with orther Cortex-A processors •  LPAE support up to 1TB physical memory •  Full hardware virtualization support •  From ARM: delivers 2X performance over Cortex-A9 processor with similar power •  big.LITTLE configuration support for mobile devices
  • 19. Datacenter Focused Roadmap 3rd Generation Calxeda Fabric and I/O Lago (ARM® Cortex A57) “Triple Play”: 3 Generations of Pin-Compatible SOCs Sarita (ARM® Cortex A57) Flagship 64-bit Product for a Broader Application Set Compatible 64-bit On-Ramp for Early Access and Ecosystem Enablement Midway: ECX-2000 (4 Core, ARM® Cortex A15) Performance/$ for Cloud and Analytics Highbank: ECX-1000 (4 Core, ARM® Cortex A9) Power Efficient Solution for Storage and Web Hosting 2013 2014 2015 [Source] Calxeda public SOC roadmap (June 2013)
  • 20. “Midway”: Calxeda ECX-2000 Compared to Calxeda’s Cortex-A9 SOC (ECX-1000), the “Midway” SOC delivers: –  1.5X more single-thread performance –  2X more floating point performance –  3X STREAM (memory b/w) performance –  4X+ more physical memory support (16GB+) –  Same performance-per-Watt Plan to update Anandtech benchmark report
  • 21. But, ARM doesn’t make/sell SOCs?
  • 22. ® ARM Business Model •  ARM does not make or sell SOC. •  Instead, ARM licenses IP and technology to partners (like Calxeda) who design and build System-on-Chips (SOCs) for various industries and markets. •  Calxeda is focused exclusively on bringing ARM-based technology to the datacenter. –  Calxeda provides own IP (e.g. Fabric) as additional value for servers.
  • 23. EnergyCore® architecture at a glance A complete building block for hyper-efficient computing EnergyCore Management Engine Advanced system, power and fabric management for energy-proportional computing I/O Controllers Standard drivers, standard interfaces. No surprises. Processor Complex Multi-core ARM® processors integrated with high bandwidth memory controllers EnergyCore Fabric Switch Integrated high-performance fabric provides inter-node connectivity with industry standard networking
  • 24. ® EnergyCore Fabric (F1/F2) Integrated 80Gb (8x10Gb cross-bar) Fabric Switch: •  Up to 5 external links: –  Dynamic bandwidth: 1Gb to 10 Gb per link –  < 200 Nano-Seconds latency, node to node •  3 internal links (to the SOC): –  2x 10Gb Ethernet ports to the OS –  1x 10Gb Ethernet port to Mgmt –  Transparent to OS and software •  Topology agnostic à Eliminates Top-of-Rack-Switch ports & cabling à Enables extreme density; lowers cost and power
  • 25. So, what can we use this for?
  • 26. Target Workloads •  Data-Intensive Applications: –  Storage (scale-out, distributed storage) •  i.e. Ceph, Gluster, etc. –  Analytics (NoSQL, MapReduce, distributed databases) •  i.e. Hadoop, Cassandra, etc. •  Distributed, State-less Applications –  Web Front End –  Caching Servers –  Content Distribution Networks (CDN)
  • 27. Use-Case: Storage via Ceph •  Official Ceph “Dumpling”+ release now supports Calxeda-based platforms •  Initial benchmarks complete (with x86 comparison) –  Even without optimizations, performance is promising •  Identified optimization areas (under investigation): –  Potentially use NEON instructions for CRC32 –  Implement zero-copy on OSD’s –  Transition reads/write to bufferlists –  Optimize client side too – librados/librbd
  • 28. Use-Case: Storage via Ceph With same number of HDD’s, Calxeda-based system delivers 50% more performance than traditional x86-servers.
  • 29. The AAEON CRS-200S-2R Advantage An ARM-based, lower cost, higher performance server platform for scale-out storage Calxeda’s ARM-based SOCs: •  Energy Efficient •  More cores per HDD •  Lower system power •  High Bandwidth Fabric •  Multi-10Gb links for data-intensive apps Compared to traditional x86-based, 2U rack mount servers, the AAEON CRS-200S-2R server platform is: ü  35% Lower TCO* ü  66% Less Rack Space ü  50% Higher performance
  • 30. Summary •  Even 64-bit ARM processors are not ideal for every single workload. •  However, scale-out, data-intensive, workloads can leverage ARM’s energy-efficiency to provide a significantly better TCO. •  For the server market (especially with scale-out apps), replacing the CPU core is not enough. –  Look for SOCs that optimize “between the nodes” in a cluster (e.g. fabric interconnects will help dramatically) •  Interested in joining the “ARM revolution”? –  Contact us! – John Mao, john.mao@calxeda.com
  • 31. Thank You!

×