Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Rise of the (Wimpy) Machines
Datacenter Efficiency with ARM-based Servers
John Mao!
Director of Strategy, Calxeda!
What is the name of the computer system in
this movie that tried to end the human-race?

Skynet
Origins of Wimpy Core Computing
•  FAWN:	
  A	
  Fast	
  Array	
  of	
  Wimpy	
  Nodes	
  
–  Project	
  from	
  CMU	
  le...
FAWN: A Fast Array of Wimpy Nodes
•  Why	
  FAWN?	
  MoDvated	
  by	
  key	
  trends:	
  
–  Increasing	
  CPU-­‐I/O	
  Ga...
FAWN: A Fast Array of Wimpy Nodes

1G

3G

2G
5G

4G

[Photo	
  Credit]	
  
h-p://www.cs.cmu.edu/~fawnproj/	
  
FAWN: A Fast Array of Wimpy Nodes
•  Multiple generations of hardware used:
–  1G (2008)
•  Single-core 500MHz AMD Geode L...
Key Findings from FAWN Project
	
  
“The	
  FAWN	
  cluster	
  achieves	
  364	
  queries	
  per	
  	
  
Joule	
  —	
  two...
So what about

®?
ARM

ARM is a good “wimpy” processor & CPU
architecture for the datacenter because:
1.  Focus on low pow...
Focus on Low Power
•  History in targeting energy-sensitive markets:
–  Netbooks, Smartbooks, Tablets, Thin Clients
–  Sma...
Focus on Low Power
In	
  2005,	
  about	
  98%	
  of	
  all	
  mobile	
  phones	
  sold	
  
used	
  at	
  least	
  one	
  ...
Focus on Low Power
Translating ARM energy-efficiency into the
modern datacenter with Cortex-A9:
Total System* Power
(Today...
But, what about performance?
Online Review: Calxeda’s ARM Server Tested

Anandtech chartered review
comparing Boston Viridis’
24-Calxeda ECX-1000
(Cort...
Calxeda Provides Better Web Throughput

Boston Viridis outperforms
Xeon E5-2650L by 30% with
more than 15 users.
	
  
Test...
Calxeda Provides Lower Response Times

Boston Viridis outperforms
Xeon E5-2650L by 60% with
more than 15 users.
	
  
Test	...
Calxeda Provides Highest Performance/Watt

Boston Viridis provides 80%
more throughput per Watt
than Xeon E5.
•  10-36% le...
Online Review: Calxeda’s ARM Server Tested
Reviewer’s Key Takeaways:
–  For scale-out workloads, Calxeda’s ARM-based scale...
®
ARM

Cortex-A15

•  Based on ARMv7A architecture
–  Ensures software application compatibility
with orther Cortex-A proc...
Datacenter Focused Roadmap
3rd Generation
Calxeda Fabric and I/O

Lago (ARM® Cortex A57)

“Triple Play”: 3 Generations
of ...
“Midway”: Calxeda ECX-2000
Compared to Calxeda’s Cortex-A9 SOC
(ECX-1000), the “Midway” SOC delivers:
–  1.5X more single-...
But, ARM doesn’t make/sell SOCs?
®
ARM

Business Model

•  ARM does not make or sell SOC.
•  Instead, ARM licenses IP and technology
to partners (like Calx...
EnergyCore® architecture at a glance
A complete building block for hyper-efficient computing

EnergyCore
Management Engine...
®
EnergyCore

Fabric (F1/F2)
Integrated 80Gb (8x10Gb cross-bar)
Fabric Switch:
•  Up to 5 external links:
–  Dynamic bandw...
So, what can we use this for?
Target Workloads
•  Data-Intensive Applications:
–  Storage (scale-out, distributed storage)
•  i.e. Ceph, Gluster, etc.

...
Use-Case: Storage via Ceph
•  Official Ceph “Dumpling”+ release now supports
Calxeda-based platforms
•  Initial benchmarks...
Use-Case: Storage via Ceph

With same number of HDD’s,
Calxeda-based system delivers
50% more performance than
traditional...
The AAEON CRS-200S-2R Advantage
An ARM-based, lower cost, higher performance server platform for scale-out storage

Calxed...
Summary
•  Even 64-bit ARM processors are not ideal for
every single workload.
•  However, scale-out, data-intensive, work...
Thank You!
Deview 2013   rise of the wimpy machines - john mao
Upcoming SlideShare
Loading in …5
×

of

Deview 2013   rise of the wimpy machines - john mao Slide 1 Deview 2013   rise of the wimpy machines - john mao Slide 2 Deview 2013   rise of the wimpy machines - john mao Slide 3 Deview 2013   rise of the wimpy machines - john mao Slide 4 Deview 2013   rise of the wimpy machines - john mao Slide 5 Deview 2013   rise of the wimpy machines - john mao Slide 6 Deview 2013   rise of the wimpy machines - john mao Slide 7 Deview 2013   rise of the wimpy machines - john mao Slide 8 Deview 2013   rise of the wimpy machines - john mao Slide 9 Deview 2013   rise of the wimpy machines - john mao Slide 10 Deview 2013   rise of the wimpy machines - john mao Slide 11 Deview 2013   rise of the wimpy machines - john mao Slide 12 Deview 2013   rise of the wimpy machines - john mao Slide 13 Deview 2013   rise of the wimpy machines - john mao Slide 14 Deview 2013   rise of the wimpy machines - john mao Slide 15 Deview 2013   rise of the wimpy machines - john mao Slide 16 Deview 2013   rise of the wimpy machines - john mao Slide 17 Deview 2013   rise of the wimpy machines - john mao Slide 18 Deview 2013   rise of the wimpy machines - john mao Slide 19 Deview 2013   rise of the wimpy machines - john mao Slide 20 Deview 2013   rise of the wimpy machines - john mao Slide 21 Deview 2013   rise of the wimpy machines - john mao Slide 22 Deview 2013   rise of the wimpy machines - john mao Slide 23 Deview 2013   rise of the wimpy machines - john mao Slide 24 Deview 2013   rise of the wimpy machines - john mao Slide 25 Deview 2013   rise of the wimpy machines - john mao Slide 26 Deview 2013   rise of the wimpy machines - john mao Slide 27 Deview 2013   rise of the wimpy machines - john mao Slide 28 Deview 2013   rise of the wimpy machines - john mao Slide 29 Deview 2013   rise of the wimpy machines - john mao Slide 30 Deview 2013   rise of the wimpy machines - john mao Slide 31 Deview 2013   rise of the wimpy machines - john mao Slide 32
Upcoming SlideShare
Haeinsa deview _최종
Next
Download to read offline and view in fullscreen.

1 Like

Share

Download to read offline

Deview 2013 rise of the wimpy machines - john mao

Download to read offline

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Deview 2013 rise of the wimpy machines - john mao

  1. 1. Rise of the (Wimpy) Machines Datacenter Efficiency with ARM-based Servers John Mao! Director of Strategy, Calxeda!
  2. 2. What is the name of the computer system in this movie that tried to end the human-race? Skynet
  3. 3. Origins of Wimpy Core Computing •  FAWN:  A  Fast  Array  of  Wimpy  Nodes   –  Project  from  CMU  led  by  Prof.  David  Anderson,   started  in  2008  (acDve  through  2012)   –  Measure  and  compare  performance  per  Joule  of     energy  advantages  over  tradiDonal  servers   –  Original  focus  on  large  distributed  key-­‐value  store     applicaDons  and  use-­‐cases  (i.e.  Amazon  Dynamo,     LinkedIn’s  Voldemort,  Facebook’s  memcached)     [PublicaDon]  hTp://www.sigops.org/sosp/sosp09/papers/andersen-­‐sosp09.pdf   [Website]  hTp://www.cs.cmu.edu/~fawnproj/  
  4. 4. FAWN: A Fast Array of Wimpy Nodes •  Why  FAWN?  MoDvated  by  key  trends:   –  Increasing  CPU-­‐I/O  Gap   –  CPU  power  consumpDon  grows  super-­‐linearly     with  speed   –  Dynamic  power  scaling  on  tradiDonal  systems  is   surprisingly  inefficient  
  5. 5. FAWN: A Fast Array of Wimpy Nodes 1G 3G 2G 5G 4G [Photo  Credit]   h-p://www.cs.cmu.edu/~fawnproj/  
  6. 6. FAWN: A Fast Array of Wimpy Nodes •  Multiple generations of hardware used: –  1G (2008) •  Single-core 500MHz AMD Geode LX processor •  256MB DDR SDRAM (400MHz) •  100Mbps Ethernet –  5G (2012) •  Intel Atom D510 – 1.66GHz dual-core w/HT •  2-4GB DDR2 (667MHz) •  100Mbps Ethernet
  7. 7. Key Findings from FAWN Project   “The  FAWN  cluster  achieves  364  queries  per     Joule  —  two  orders  of  magnitude  be-er  than     tradiDonal  disk-­‐based  clusters.”         [Source]  hTp://www.sigops.org/sosp/sosp09/papers/andersen-­‐sosp09.pdf    
  8. 8. So what about ®? ARM ARM is a good “wimpy” processor & CPU architecture for the datacenter because: 1.  Focus on low power: origins in embedded systems and mobile devices 2.  Datacenter focused roadmap: 32-bit CPUs today, 64-bit CPUs in 1-2 years; increasing performance (with same energy efficiency) 3.  Business model: ability to integrate for specific markets and applications 4.  Emerging software ecosystem: while not x86, ARM has growing ecosystem
  9. 9. Focus on Low Power •  History in targeting energy-sensitive markets: –  Netbooks, Smartbooks, Tablets, Thin Clients –  Smartphones, Feature phones –  Set-top Box, Digital TV, Blu-Ray players, Gaming consoles –  Automotive Infotainment, Navigation –  Wireless base-stations, VoIP phones and equipment •  Design Goals –  Performance, Power, Easy Synthesis
  10. 10. Focus on Low Power In  2005,  about  98%  of  all  mobile  phones  sold   used  at  least  one  ARM  processor.     As  of  2009,  due  to  low  power  consumpDon  the  ARM   architecture  is  the  most  widely  used  32-­‐bit  RISC     architecture  in  mobile  devices  and  embedded     systems.     [Source]  hTp://en.wikipedia.org/wiki/ARM_architecture    
  11. 11. Focus on Low Power Translating ARM energy-efficiency into the modern datacenter with Cortex-A9: Total System* Power (Today!) ~Power per ECX-1000 Node (with disk @Wall) Linux at Rest 130 W 5.4 W phpbench 155 W 6.5 W Coremark (4 threads per SOC) 169 W 7.0 W Website @ 70% Utilization 172 W 7.2 W LINPACK 191 W 7.9 W STREAM 205 W 8.5 W Workload (on 24 nodes & SSDs) *All measurements done on a 24-node system @1.1GHz, with 24 SSDs and 96 GB DRAM in the Calxeda Lab. For specific workloads, ECX-1000 can enable a complete 24-node cluster at similar power level as a 2 socket x86.
  12. 12. But, what about performance?
  13. 13. Online Review: Calxeda’s ARM Server Tested Anandtech chartered review comparing Boston Viridis’ 24-Calxeda ECX-1000 (Cortex-A9) cluster against Intel E5-2650Lsystem. (March 2012) http://www.anandtech.com/show/6757/calxedas-arm-server-tested
  14. 14. Calxeda Provides Better Web Throughput Boston Viridis outperforms Xeon E5-2650L by 30% with more than 15 users.   Test  is  PHPbb  running  on  Apache2  with   variable  numbers  of  users  (concurrency)   generaDng  traffic.  
  15. 15. Calxeda Provides Lower Response Times Boston Viridis outperforms Xeon E5-2650L by 60% with more than 15 users.   Test  is  PHPbb  running  on  Apache2  with   variable  numbers  of  users  (concurrency)   generaDng  traffic.  
  16. 16. Calxeda Provides Highest Performance/Watt Boston Viridis provides 80% more throughput per Watt than Xeon E5. •  10-36% less raw power   Test  is  PHPbb  running  on  Apache2  with   variable  numbers  of  users  (concurrency)   generaDng  traffic.  
  17. 17. Online Review: Calxeda’s ARM Server Tested Reviewer’s Key Takeaways: –  For scale-out workloads, Calxeda’s ARM-based scale-out hardware architecture is very promising. –  Microbenchmarks show Calxeda ECX-1000 ~10% behind Intel Atom N2800 @1.86 MHz –  “Real World” Application Benchmarking shows 70%+ higher performance-per-watt than Intel Xeon E5 at mid to high user load –  “Calxeda really did it: each server needs about 8.3W (200W/24), measured at the wall…about 6W (at 1.4GHz) per server node…” –  “So on the one hand, no, the current Calxeda servers are no Intel Xeon killers (yet). However, we feel that Calxeda's ECX-1000 server node is revolutionary technology.”
  18. 18. ® ARM Cortex-A15 •  Based on ARMv7A architecture –  Ensures software application compatibility with orther Cortex-A processors •  LPAE support up to 1TB physical memory •  Full hardware virtualization support •  From ARM: delivers 2X performance over Cortex-A9 processor with similar power •  big.LITTLE configuration support for mobile devices
  19. 19. Datacenter Focused Roadmap 3rd Generation Calxeda Fabric and I/O Lago (ARM® Cortex A57) “Triple Play”: 3 Generations of Pin-Compatible SOCs Sarita (ARM® Cortex A57) Flagship 64-bit Product for a Broader Application Set Compatible 64-bit On-Ramp for Early Access and Ecosystem Enablement Midway: ECX-2000 (4 Core, ARM® Cortex A15) Performance/$ for Cloud and Analytics Highbank: ECX-1000 (4 Core, ARM® Cortex A9) Power Efficient Solution for Storage and Web Hosting 2013 2014 2015 [Source] Calxeda public SOC roadmap (June 2013)
  20. 20. “Midway”: Calxeda ECX-2000 Compared to Calxeda’s Cortex-A9 SOC (ECX-1000), the “Midway” SOC delivers: –  1.5X more single-thread performance –  2X more floating point performance –  3X STREAM (memory b/w) performance –  4X+ more physical memory support (16GB+) –  Same performance-per-Watt Plan to update Anandtech benchmark report
  21. 21. But, ARM doesn’t make/sell SOCs?
  22. 22. ® ARM Business Model •  ARM does not make or sell SOC. •  Instead, ARM licenses IP and technology to partners (like Calxeda) who design and build System-on-Chips (SOCs) for various industries and markets. •  Calxeda is focused exclusively on bringing ARM-based technology to the datacenter. –  Calxeda provides own IP (e.g. Fabric) as additional value for servers.
  23. 23. EnergyCore® architecture at a glance A complete building block for hyper-efficient computing EnergyCore Management Engine Advanced system, power and fabric management for energy-proportional computing I/O Controllers Standard drivers, standard interfaces. No surprises. Processor Complex Multi-core ARM® processors integrated with high bandwidth memory controllers EnergyCore Fabric Switch Integrated high-performance fabric provides inter-node connectivity with industry standard networking
  24. 24. ® EnergyCore Fabric (F1/F2) Integrated 80Gb (8x10Gb cross-bar) Fabric Switch: •  Up to 5 external links: –  Dynamic bandwidth: 1Gb to 10 Gb per link –  < 200 Nano-Seconds latency, node to node •  3 internal links (to the SOC): –  2x 10Gb Ethernet ports to the OS –  1x 10Gb Ethernet port to Mgmt –  Transparent to OS and software •  Topology agnostic à Eliminates Top-of-Rack-Switch ports & cabling à Enables extreme density; lowers cost and power
  25. 25. So, what can we use this for?
  26. 26. Target Workloads •  Data-Intensive Applications: –  Storage (scale-out, distributed storage) •  i.e. Ceph, Gluster, etc. –  Analytics (NoSQL, MapReduce, distributed databases) •  i.e. Hadoop, Cassandra, etc. •  Distributed, State-less Applications –  Web Front End –  Caching Servers –  Content Distribution Networks (CDN)
  27. 27. Use-Case: Storage via Ceph •  Official Ceph “Dumpling”+ release now supports Calxeda-based platforms •  Initial benchmarks complete (with x86 comparison) –  Even without optimizations, performance is promising •  Identified optimization areas (under investigation): –  Potentially use NEON instructions for CRC32 –  Implement zero-copy on OSD’s –  Transition reads/write to bufferlists –  Optimize client side too – librados/librbd
  28. 28. Use-Case: Storage via Ceph With same number of HDD’s, Calxeda-based system delivers 50% more performance than traditional x86-servers.
  29. 29. The AAEON CRS-200S-2R Advantage An ARM-based, lower cost, higher performance server platform for scale-out storage Calxeda’s ARM-based SOCs: •  Energy Efficient •  More cores per HDD •  Lower system power •  High Bandwidth Fabric •  Multi-10Gb links for data-intensive apps Compared to traditional x86-based, 2U rack mount servers, the AAEON CRS-200S-2R server platform is: ü  35% Lower TCO* ü  66% Less Rack Space ü  50% Higher performance
  30. 30. Summary •  Even 64-bit ARM processors are not ideal for every single workload. •  However, scale-out, data-intensive, workloads can leverage ARM’s energy-efficiency to provide a significantly better TCO. •  For the server market (especially with scale-out apps), replacing the CPU core is not enough. –  Look for SOCs that optimize “between the nodes” in a cluster (e.g. fabric interconnects will help dramatically) •  Interested in joining the “ARM revolution”? –  Contact us! – John Mao, john.mao@calxeda.com
  31. 31. Thank You!
  • salmotrutta

    May. 22, 2015

Views

Total views

1,705

On Slideshare

0

From embeds

0

Number of embeds

218

Actions

Downloads

36

Shares

0

Comments

0

Likes

1

×