Scalable Network Designs for
Benoît“tsuna”Sigoure
Member of the Yak Shaving Staff
tsuna@aristanetworks.com @tsunanet
• Scal...
1970
A Brief History of Network Software
Custom
monolithic
embedded
OS 1980
1990
2000
Modified
BSD, QNX
or Linux
kernel bas...
1970
A Brief History of Network Software
Custom
monolithic
embedded
OS
“The switch
is just a
Linux box”
1980
1990
2000
Mod...
1970
A Brief History of Network Software
Custom
monolithic
embedded
OS
“The switch
is just a
Linux box”
1980
1990
2000
Mod...
1970
A Brief History of Network Software
Custom
monolithic
embedded
OS
“The switch
is just a
Linux box”
1980
1990
2000
Mod...
Inside a Modern Switch
Arista 7050Q-64
Inside a Modern Switch
2 PSUs
4 Fans
Arista 7050Q-64
Inside a Modern Switch
2 PSUs
4 Fans
DDR3
x86 CPU
USB Flash
SATA SSD
Arista 7050Q-64
Inside a Modern Switch
2 PSUs
4 Fans
DDR3
x86 CPU
Network
chip(s)
USB Flash
SATA SSD
Arista 7050Q-64
Inside a Modern Switch
2 PSUs
4 Fans
DDR3
x86 CPU
Network
chip(s)
USB Flash
SATA SSD
Arista 7050Q-64
Command
API
It’s All About The Software
Zero Touch
Replacement
VmTracer CloudVision Email Wireshark
Fast
Failover
DANZ
Fab...
Command
API
It’s All About The Software
Zero Touch
Replacement
VmTracer CloudVision Email Wireshark
Fast
Failover
DANZ
Fab...
Command
API
#!/bin/bash
It’s All About The Software
Zero Touch
Replacement
VmTracer CloudVision Email Wireshark
Fast
Failo...
Command
API
#!/bin/bash
It’s All About The Software
Zero Touch
Replacement
VmTracer CloudVision Email Wireshark
Fast
Failo...
Typical Design: Keep It Simple
Server 1
Server 2
Server 3
Server N
Rack-1
Start small: single rack, single switch
Scale: 6...
Typical Design: Keep It Simple
Server 1
Server 2
Server 3
Server N
Rack-1
Start small: 2 racks, 2 switches, no core
Scale:...
Typical Design: Keep It Simple
Server 1
Server 2
Server 3
Server N
Rack-1
Server 1
Server 2
Server 3
Server N
Rack-2
Serve...
Server 1
Server 2
Server 3
Server N
Rack-1
Server 1
Server 2
Server 3
Server N
Rack-2
Server 1
Server 2
Server 3
Server N
...
Server 1
Server 2
Server 3
Server N
Rack-1
Server 1
Server 2
Server 3
Server N
Rack-2
Server 1
Server 2
Server 3
Server N
...
Server 1
Server 2
Server 3
Server N
Rack-1
Server 1
Server 2
Server 3
Server N
Rack-2
Server 1
Server 2
Server 3
Server N
...
Server 1
Server 2
Server 3
Server N
Rack-1
Server 1
Server 2
Server 3
Server N
Rack-2
56 racks, 8 core switches
Scale: 313...
Server 1
Server 2
Server 3
Server N
Rack-1
Server 1
Server 2
Server 3
Server N
Rack-2
1152 racks, 16 core modular switches...
Layer 2 vs Layer 3
• Layer 2: “data link layer” – Ethernet “bridge”
• Layer 3: “network layer” – IP “router”
A B C D E F G...
Layer 2 vs Layer 3
• Layer 2: “data link layer” – Ethernet “bridge”
• Layer 3: “network layer” – IP “router”
A B C D E F G...
Layer 2 vs Layer 3
• Layer 2: “data link layer” – Ethernet “bridge”
• Layer 3: “network layer” – IP “router”
A B C D E F G...
Layer 2
Pros:
• Plug’n’play
Cons:
• Everybody needs to be aware of everybody
watch your MAC & ARP table sizes
• No visibil...
Layer 3
Pros:
• Horizontally scalable
• Deterministic IP assignment scheme
• 100% based on open standards & protocols
• Ca...
1G vs 10G
• 1Gbps is today what
100Mbps was yesterday
• New generation servers
come with 10G LAN on
board
• At 1Gbps, the ...
Jumbo Frames
Elephant Framed! by Dhinakaran Gajavarathan
Jumbo Frames
0
500
1000
1500
2000
Million packets per TeraGen
MTU=1500 MTU=9212
3x reduction
Jumbo Frames
0
85
170
255
340
Context switches
MTU=1500 MTU=9212
0
250
500
750
1000
Interrupts
11% 8%
0
125
250
375
500
So...
«Hardware» buffers by Roger Ferrer Ibáñez
Deep Buffers
«Hardware» buffers by Roger Ferrer Ibáñez
Deep Buffers
0
38
75
113
150
Packet buffer per 10G port (MB)
Arista7500
Arista75...
0
250000
500000
750000
1000000
Packets dropped per TeraGen
1MB	

 1:4
1MB	

 1:5.33
48MB	

 1:5.33
4x10G 1:4
3x10G 1:5.33
...
0
250000
500000
750000
1000000
Packets dropped per TeraGen
1MB	

 1:4
1MB	

 1:5.33
48MB	

 1:5.33
Zero!
4x10G 1:4
3x10G 1...
0
250000
500000
750000
1000000
Packets dropped per TeraGen
1MB	

 1:4
1MB	

 1:5.33
48MB	

 1:5.33
Zero!
4x10G 1:4
3x10G 1...
Machine Crashes
Machine Crashes
Machine Crashes
Machine Crashes
Machine Crashes
• Detect machine failures
• Redirect traffic to the switch
• Have the switch’s kernel send TCP resets
• Immediately kills ...
10.4.5.2
• Detect machine failures
• Redirect traffic to the switch
• Have the switch’s kernel send TCP resets
• Immediate...
10.4.5.2
• Switch learns & tracks IP ↔ port mapping
• Port down → take over IP and MAC addresses
• Kicks in as soon as har...
Conclusion
• Start thinking of the network as yet another
service that runs on Linux boxes
• Build layer 3 networks based ...
Thank You
We’re hiring in SF, Santa Clara,
Vancouver, and Bangalore
Benoît“tsuna”Sigoure
Member of the Yak Shaving Staff
tsuna@arista...
Upcoming SlideShare
Loading in...5
×

HBaseCon 2013: Scalable Network Designs for Apache HBase

2,629

Published on

Presented by: Benoit Sigoure, Arista Networks

0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,629
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide

Transcript of "HBaseCon 2013: Scalable Network Designs for Apache HBase"

  1. 1. Scalable Network Designs for Benoît“tsuna”Sigoure Member of the Yak Shaving Staff tsuna@aristanetworks.com @tsunanet • Scalable Network Design • Some Measurements • How the Network Can Help
  2. 2. 1970 A Brief History of Network Software Custom monolithic embedded OS 1980 1990 2000 Modified BSD, QNX or Linux kernel base 2010
  3. 3. 1970 A Brief History of Network Software Custom monolithic embedded OS “The switch is just a Linux box” 1980 1990 2000 Modified BSD, QNX or Linux kernel base 2010
  4. 4. 1970 A Brief History of Network Software Custom monolithic embedded OS “The switch is just a Linux box” 1980 1990 2000 Modified BSD, QNX or Linux kernel base 2010 EOS = Linux
  5. 5. 1970 A Brief History of Network Software Custom monolithic embedded OS “The switch is just a Linux box” 1980 1990 2000 Modified BSD, QNX or Linux kernel base 2010 EOS = Linux Custom ASIC
  6. 6. Inside a Modern Switch Arista 7050Q-64
  7. 7. Inside a Modern Switch 2 PSUs 4 Fans Arista 7050Q-64
  8. 8. Inside a Modern Switch 2 PSUs 4 Fans DDR3 x86 CPU USB Flash SATA SSD Arista 7050Q-64
  9. 9. Inside a Modern Switch 2 PSUs 4 Fans DDR3 x86 CPU Network chip(s) USB Flash SATA SSD Arista 7050Q-64
  10. 10. Inside a Modern Switch 2 PSUs 4 Fans DDR3 x86 CPU Network chip(s) USB Flash SATA SSD Arista 7050Q-64
  11. 11. Command API It’s All About The Software Zero Touch Replacement VmTracer CloudVision Email Wireshark Fast Failover DANZ Fabric Monitoring
  12. 12. Command API It’s All About The Software Zero Touch Replacement VmTracer CloudVision Email Wireshark Fast Failover DANZ Fabric Monitoring OpenTSDB
  13. 13. Command API #!/bin/bash It’s All About The Software Zero Touch Replacement VmTracer CloudVision Email Wireshark Fast Failover DANZ Fabric Monitoring OpenTSDB
  14. 14. Command API #!/bin/bash It’s All About The Software Zero Touch Replacement VmTracer CloudVision Email Wireshark Fast Failover DANZ Fabric Monitoring OpenTSDB$ sudo yum install <pkg>$ sudo yum install <pkg>
  15. 15. Typical Design: Keep It Simple Server 1 Server 2 Server 3 Server N Rack-1 Start small: single rack, single switch Scale: 64 nodes max 10Gbps network, line-rate 48 RJ45 Ports 7050T-64 Console Port Mgmt Port 4 QSFP+ Ports USB Port Air Vents
  16. 16. Typical Design: Keep It Simple Server 1 Server 2 Server 3 Server N Rack-1 Start small: 2 racks, 2 switches, no core Scale: 96 nodes max with 4 uplinks Oversubscription ratio 1:3 Server 1 Server 2 Server 3 Server N Rack-2 40Gbps uplinks
  17. 17. Typical Design: Keep It Simple Server 1 Server 2 Server 3 Server N Rack-1 Server 1 Server 2 Server 3 Server N Rack-2 Server 1 Server 2 Server 3 Server N Rack-3 3 racks, 3 switches, no core Scale: 144 nodes max with 4 uplinks Oversubscription ratio 1:6
  18. 18. Server 1 Server 2 Server 3 Server N Rack-1 Server 1 Server 2 Server 3 Server N Rack-2 Server 1 Server 2 Server 3 Server N Rack-3 3 racks, 4 switches Scale: 144 nodes max with 4 uplinks/rack Oversubscription ratio 1:3 4x40Gbps uplinks
  19. 19. Server 1 Server 2 Server 3 Server N Rack-1 Server 1 Server 2 Server 3 Server N Rack-2 Server 1 Server 2 Server 3 Server N Rack-3 4 racks, 5 switches Scale: 192 nodes max Oversubscription ratio 1:3 4x40Gbps uplinks Server 1 Server 2 Server 3 Server N Rack-4
  20. 20. Server 1 Server 2 Server 3 Server N Rack-1 Server 1 Server 2 Server 3 Server N Rack-2 Server 1 Server 2 Server 3 Server N Rack-3 5 to 7 racks, 7 to 9 switches Scale: 240 to 336 nodes Oversubscription ratio 1:3 2x40Gbps uplinks Server 1 Server 2 Server 3 Server N Rack-4 Server 1 Server 2 Server 3 Server N Rack-5
  21. 21. Server 1 Server 2 Server 3 Server N Rack-1 Server 1 Server 2 Server 3 Server N Rack-2 56 racks, 8 core switches Scale: 3136 nodes Oversubscription ratio 1:3 8x10Gbps uplinks Server 1 Server 2 Server 3 Server N Rack-55 Server 1 Server 2 Server 3 Server N Rack-56 ... ... ...
  22. 22. Server 1 Server 2 Server 3 Server N Rack-1 Server 1 Server 2 Server 3 Server N Rack-2 1152 racks, 16 core modular switches Scale: 55296 nodes Oversubscription ratio still 1:3 16x10Gbps uplinks Server 1 Server 2 Server 3 Server N Rack-1151 Server 1 Server 2 Server 3 Server N Rack-1152 ... ... ...
  23. 23. Layer 2 vs Layer 3 • Layer 2: “data link layer” – Ethernet “bridge” • Layer 3: “network layer” – IP “router” A B C D E F G H Uplinks blocked by STP Layer 2 E,F,G,H→ ← A, B, C, D E, F, G, H → ←A,B,C,D A B C D E F G H 0.0.0.0/0 via core1 via core2 Layer 3
  24. 24. Layer 2 vs Layer 3 • Layer 2: “data link layer” – Ethernet “bridge” • Layer 3: “network layer” – IP “router” A B C D E F G H E,F,G,H→ ← A, B, C, D E, F, G, H → ←A,B,C,D A B C D E F G H 0.0.0.0/0 via core1 via core2 Layer 3Layer 2 with vendor magic
  25. 25. Layer 2 vs Layer 3 • Layer 2: “data link layer” – Ethernet “bridge” • Layer 3: “network layer” – IP “router” A B C D E F G H E,F,G,H→ ← A, B, C, D E, F, G, H → ←A,B,C,D A B C D E F G H 0.0.0.0/0 via core1 via core2 Layer 3Layer 2 with vendor magic
  26. 26. Layer 2 Pros: • Plug’n’play Cons: • Everybody needs to be aware of everybody watch your MAC & ARP table sizes • No visibility of individual hops • Vendor magic or spanning tree • If you break STP and cause a loop game over • Broadcast packets hit everybody
  27. 27. Layer 3 Pros: • Horizontally scalable • Deterministic IP assignment scheme • 100% based on open standards & protocols • Can debug with ping / traceroute • Degrades better under fault or misconfig Cons: • Need to think up your IP assignment scheme ahead of time
  28. 28. 1G vs 10G • 1Gbps is today what 100Mbps was yesterday • New generation servers come with 10G LAN on board • At 1Gbps, the network is one of the first bottlenecks • A single HBase client can very easily push over 4Gbps to HBase 0 175 350 525 700 TeraGen in seconds 1Gbps 10Gbps 3.5x
  29. 29. Jumbo Frames Elephant Framed! by Dhinakaran Gajavarathan
  30. 30. Jumbo Frames 0 500 1000 1500 2000 Million packets per TeraGen MTU=1500 MTU=9212 3x reduction
  31. 31. Jumbo Frames 0 85 170 255 340 Context switches MTU=1500 MTU=9212 0 250 500 750 1000 Interrupts 11% 8% 0 125 250 375 500 Soft IRQ CPU Time 15% (millions) (millions) (thousand jiffies)
  32. 32. «Hardware» buffers by Roger Ferrer Ibáñez Deep Buffers
  33. 33. «Hardware» buffers by Roger Ferrer Ibáñez Deep Buffers 0 38 75 113 150 Packet buffer per 10G port (MB) Arista7500 Arista7500E IndustryAverage 1.5 48 128
  34. 34. 0 250000 500000 750000 1000000 Packets dropped per TeraGen 1MB 1:4 1MB 1:5.33 48MB 1:5.33 4x10G 1:4 3x10G 1:5.33 ... 16 hosts x 10G 16 hosts x 10G ... Deep Buffers
  35. 35. 0 250000 500000 750000 1000000 Packets dropped per TeraGen 1MB 1:4 1MB 1:5.33 48MB 1:5.33 Zero! 4x10G 1:4 3x10G 1:5.33 ... 16 hosts x 10G 16 hosts x 10G ... Deep Buffers
  36. 36. 0 250000 500000 750000 1000000 Packets dropped per TeraGen 1MB 1:4 1MB 1:5.33 48MB 1:5.33 Zero! 4x10G 1:4 3x10G 1:5.33 ... 16 hosts x 10G 16 hosts x 10G ... Deep Buffers
  37. 37. Machine Crashes
  38. 38. Machine Crashes
  39. 39. Machine Crashes
  40. 40. Machine Crashes
  41. 41. Machine Crashes
  42. 42. • Detect machine failures • Redirect traffic to the switch • Have the switch’s kernel send TCP resets • Immediately kills all existing and new flows 10.4.5.1 10.4.5.3 10.4.6.1 10.4.6.2 10.4.6.3 10.4.5.254 10.4.6.254 10.4.5.2 How Can a Modern Network Help?
  43. 43. 10.4.5.2 • Detect machine failures • Redirect traffic to the switch • Have the switch’s kernel send TCP resets • Immediately kills all existing and new flows 10.4.5.1 10.4.5.2 10.4.5.3 10.4.6.1 10.4.6.2 10.4.6.3 10.4.5.254 10.4.6.25410.4.5.210.4.5.2 EOS = Linux How Can a Modern Network Help?
  44. 44. 10.4.5.2 • Switch learns & tracks IP ↔ port mapping • Port down → take over IP and MAC addresses • Kicks in as soon as hardware notifies software of the port going down, within a few milliseconds • Port back up → resume normal operation • Ideal to prevent RegionServers stalling on WAL 10.4.5.1 10.4.5.2 10.4.5.3 10.4.6.1 10.4.6.2 10.4.6.3 10.4.5.254 10.4.6.25410.4.5.210.4.5.2 EOS = Linux Fast Server Failover
  45. 45. Conclusion • Start thinking of the network as yet another service that runs on Linux boxes • Build layer 3 networks based on standards • Use jumbo frames • Deep buffers prevent packet loss on oversubscribed uplinks • The network sits in the middle of everything; it can help. Expect to see more in that area.
  46. 46. Thank You
  47. 47. We’re hiring in SF, Santa Clara, Vancouver, and Bangalore Benoît“tsuna”Sigoure Member of the Yak Shaving Staff tsuna@aristanetworks.com@tsunanet

×