Vtastic: Innovations In Distributed Systems Testing
Jack Wadden, Sr. Engineering Manager
Akamai Technologies, Inc.
©2015 AKAMAI | FASTER FORWARDTM
AKAMAI CDN OVERVIEW
• We Make the Internet Fast, Reliable and Secure
• Globally-Distributed Network of Servers
• Caching Content Close to End Users
• Scalable Live Media Streaming
• Protocol Optimizations
• DNS-Based Load Balancing System
• Chooses the Best Server to Handle Your Requests
©2015 AKAMAI | FASTER FORWARDTM
MASSIVE SCALE
• 15-30% of All Internet Traffic
• 3+ Trillion Hits/day (2 x 1012)
• 30+ Tbps
• 215,000+ Servers
• Located in 120+ Countries
• 1000+ Software Components
• 100+ of Server Roles
©2015 AKAMAI | FASTER FORWARDTM
SYSTEM TESTING AT AKAMAI
©2015 AKAMAI | FASTER FORWARDTM
TESTNETS: AKAMAI’S SYSTEM TEST
ENVIRONMENT
©2015 AKAMAI | FASTER FORWARDTM
HOWEVER, AT AKAMAI TESTNETS
ARE A SCARCE RESOURCE
©2015 AKAMAI | FASTER FORWARDTM
THEY ARE EXPENSIVE TO BUILD
©2015 AKAMAI | FASTER FORWARDTM
AND REQUIRE A HUGE TEAM TO
MAINTAIN
©2015 AKAMAI | FASTER FORWARDTM
SHARING LEADS TO DISRUPTIONS
©2015 AKAMAI | FASTER FORWARDTM
SOMETIMES THE FIT IS POOR
©2015 AKAMAI | FASTER FORWARDTM
CONFLICTING USES NEED TO BE
COORDINATED
©2015 AKAMAI | FASTER FORWARDTM
AND RESULT IN INEVITABLE DELAYS
©2015 AKAMAI | FASTER FORWARDTM
FEATURES OF A BETTER TESTNET
Low barrier to access Eliminate coordination
No-block debugging Automation
Portable, restorable configuration Efficient maintenance
Permit destructive testing Optimal platform utilization
CONTINUOUS,
AUTOMATED,
END-TO-END TESTING
FOR ALL ENGINEERS
ON EVERY COMPONENT
ACROSS AKAMAI
The Vision:
©2015 AKAMAI | FASTER FORWARDTM
VTASTIC Resource
Tracker
OpenNebula
Master
TESTNET CLONING
Storage
©2015 AKAMAI | FASTER FORWARDTM
TESTNET CLONING
Test Harness
VTASTIC Resource
Tracker
OpenNebula
Master
Storage
Testnet
Clones
©2015 AKAMAI | FASTER FORWARDTM
VTASTIC MASTER TESTNET
• Supported by SME teams
• Running Production Versions
• Vtastic Team Coordinates Changes
• Custom Clones can be Saved, Shared
Master Master
Candidate
Master Master
Candidate
CloneSnapshot
Clone
Old
Master
©2015 AKAMAI | FASTER FORWARDTM
CLONES USE PRIVATE IP SPACE
100.80.0.8
(MDT)
100.80.0.15
(KDC)
100.80.0.21
(UMP)
GWSH, SOCKS
172.26.238.16
(NAT Exit)
100.80.0.1
(NAT Gateway)
IP (Anything)
VLAN #83
©2015 AKAMAI | FASTER FORWARDTM
NAT TUNNELING TOOLS
• vpoint: Testnet-Attached bash Shell
• LD_PRELOAD for Transparent SOCKS Tunneling (dante-client)
• Proprietary SSH-proxy client
• chrome-vpoint, firefox-vpoint
• Dedicated browser session with SOCKS configuration
©2015 AKAMAI | FASTER FORWARDTM
DESIGN APPROACH
• Centrally-Managed Infrastructure
• Resources Granted to Users/Groups
• Distributed Storage & Compute Platform
• Commodity Hardware
• Open Source Technology
• Virtualization: Qemu/KVM
• Storage: GlusterFS
• Orchestration: OpenNebula!!
• Vtastic VRT: Python, Django, Apache
©2015 AKAMAI | FASTER FORWARDTM
SPECS, SCALE
• 55 VM Hosts
• 32 Logical Cores (32 threads)
• 128 GB RAM (moving to 256)
• 2 x 10 Gbps Ethernet
• Average 45 VMs per Host
• 50-60 Testnets
• 30-200 Nodes per Testnet
• 2200+ Total VMs
• 50 Storage Nodes
• 8 Cores
• 32 GB RAM
• 10 Gbps Ethernet
• 6 x 384 GB SSD = 2.1 TB
• Total Usable Space = 42 TB
• Master Testnet
• 203 Nodes
• 434 Cores
• 918GB RAM
• ~2 TB (After virt-sparsify)
©2015 AKAMAI | FASTER FORWARDTM
1.0: GLUSTER & FUSE
• Backing Files and Scratch Images on Remote Storage
• Qemu Uses POSIX Path (/glusterclient/foo)
• Problems:
• Memory Leaks, Hangs in GlusterFS FUSE Mount
• Occasional Loss of VMs
• Performance Concerns
©2015 AKAMAI | FASTER FORWARDTM
1.1: GLUSTER DIRECT
• Qemu uses libgfapi (gluster://SERVER:PORT/foo)
• Backing Files and Scratch Images on Remote Storage
• FUSE Mount Used for Image Management
• Problems:
• Frequent, Catastrophic Loss of VMs
• Occasional FUSE Mount Problems (Image Management)
©2015 AKAMAI | FASTER FORWARDTM
1.2: FUSE + LOCAL SCRATCH
• Qemu Uses POSIX Path (/glusterclient/foo) for Backing Image
• FUSE Mount Used for Image Management
• Scratch Images Stored on Local Disk
• Problems:
• Increased Snapshot Time
• No Live Migration
• Occasional FUSE Mount Problems (Image Management)
• Lack of Trust (VM Loss Experienced before Re-creating Gluster Volume)
©2015 AKAMAI | FASTER FORWARDTM
IN DEVELOPMENT: CEPH
• Static and Scratch Images on Remote Storage
• Live Migration Possible
• Ability to Flatten Penultimate Node
• Holy Grail, or New Devil?
• Challenges:
• Learning Curve
• Ceph Stability?
• Need Support for Trees of RBD Clones
©2015 AKAMAI | FASTER FORWARDTM
FUTURE POSSIBILIES
• Incorporating Physical Hardware (Load/Performance Testing)
• Realistic Network Conditions (Latency, Loss)
• Subnetting / Internetworking
• Multi-Site w/Smart Replication
VTASTIC.AKAMAI.COM
©2015 AKAMAI | FASTER FORWARDTM
IMAGE CREDITS
• http://www.huffingtonpost.com/2013/04/18/embarassing-data-disasters_n_3109254.html
• http://exchange.nottingham.ac.uk/research/files/2012/08/drinks-production-line-912x343.jpg
• http://machinelearningmastery.com/wp-content/uploads/2013/12/test-harness.jpg
• http://www.constructionweekonline.com/pictures/drought.gif
• http://static.giantbomb.com/uploads/original/23/232017/2612483-supercomputer_neu_03.jpg
• http://blog.straphq.com/wp-content/uploads/sites/18/2015/02/hackathon-hackers.jpg
• https://nationalsafety.files.wordpress.com/2011/07/071511_2104_safetyfails4.jpg?w=595
• http://img.khelnama.com/sites/default/files/styles/gallery_content_big/public/mediaimages/gallery/2013/Feb/Tug%20of%20War%20image.jpg
• http://www.globalnerdy.com/wordpress/wp-content/uploads/2013/06/WWDC-bathroom-line.jpg
• http://media.masslive.com/republican/photo/2010/11/9022738-large.jpg
• Unlock by Joel Bryant from the Noun Project
• debug by Lemon Liu from the Noun Project
• Robot by Angela Dinh from the Noun Project
• Server by Mister Pixel from the Noun Project
• coin by Rohith M S from the Noun Project
• Waiting Room by Luis Prado from the Noun Project
• users by TukTuk Design from the Noun Project
• Traffic Light by Arthur Shlain from the Noun Project
• Wrench by Rashida Luqman Kheriwala from the Noun Project
• http://product-images.www8-hp.com/digmedialib/prodimg/lowres/c02632282.png
• http://www.i2clipart.com/cliparts/2/c/3/a/clipart-database-symbol-256x256-2c3a.png
• http://piedmontnewsonline.com/wp-content/uploads/awpcp/help_wanted_sign-large2.png
• https://upload.wikimedia.org/wikipedia/commons/thumb/3/31/XM12_and_XM2.png/220px-XM12_and_XM2.png
• http://www.follytoxnetsystems.net/movie%20pix/cisco%20router_2801.gif
• http://fcw.com/~/media/GIG/FCWNow/Topics/Records%20Management/electronic%20records%20management.jpg
• play by Convoy from the Noun Project
• Camera by iconoci from the Noun Project

OpenNebulaconf2017US: Vtastic:Akamai innovations for distributed system testing by Jack Wadden, Akamai

  • 1.
    Vtastic: Innovations InDistributed Systems Testing Jack Wadden, Sr. Engineering Manager Akamai Technologies, Inc.
  • 2.
    ©2015 AKAMAI |FASTER FORWARDTM AKAMAI CDN OVERVIEW • We Make the Internet Fast, Reliable and Secure • Globally-Distributed Network of Servers • Caching Content Close to End Users • Scalable Live Media Streaming • Protocol Optimizations • DNS-Based Load Balancing System • Chooses the Best Server to Handle Your Requests
  • 3.
    ©2015 AKAMAI |FASTER FORWARDTM MASSIVE SCALE • 15-30% of All Internet Traffic • 3+ Trillion Hits/day (2 x 1012) • 30+ Tbps • 215,000+ Servers • Located in 120+ Countries • 1000+ Software Components • 100+ of Server Roles
  • 4.
    ©2015 AKAMAI |FASTER FORWARDTM SYSTEM TESTING AT AKAMAI
  • 5.
    ©2015 AKAMAI |FASTER FORWARDTM TESTNETS: AKAMAI’S SYSTEM TEST ENVIRONMENT
  • 6.
    ©2015 AKAMAI |FASTER FORWARDTM HOWEVER, AT AKAMAI TESTNETS ARE A SCARCE RESOURCE
  • 7.
    ©2015 AKAMAI |FASTER FORWARDTM THEY ARE EXPENSIVE TO BUILD
  • 8.
    ©2015 AKAMAI |FASTER FORWARDTM AND REQUIRE A HUGE TEAM TO MAINTAIN
  • 9.
    ©2015 AKAMAI |FASTER FORWARDTM SHARING LEADS TO DISRUPTIONS
  • 10.
    ©2015 AKAMAI |FASTER FORWARDTM SOMETIMES THE FIT IS POOR
  • 11.
    ©2015 AKAMAI |FASTER FORWARDTM CONFLICTING USES NEED TO BE COORDINATED
  • 12.
    ©2015 AKAMAI |FASTER FORWARDTM AND RESULT IN INEVITABLE DELAYS
  • 13.
    ©2015 AKAMAI |FASTER FORWARDTM FEATURES OF A BETTER TESTNET Low barrier to access Eliminate coordination No-block debugging Automation Portable, restorable configuration Efficient maintenance Permit destructive testing Optimal platform utilization
  • 14.
    CONTINUOUS, AUTOMATED, END-TO-END TESTING FOR ALLENGINEERS ON EVERY COMPONENT ACROSS AKAMAI The Vision:
  • 15.
    ©2015 AKAMAI |FASTER FORWARDTM VTASTIC Resource Tracker OpenNebula Master TESTNET CLONING Storage
  • 16.
    ©2015 AKAMAI |FASTER FORWARDTM TESTNET CLONING Test Harness VTASTIC Resource Tracker OpenNebula Master Storage Testnet Clones
  • 17.
    ©2015 AKAMAI |FASTER FORWARDTM VTASTIC MASTER TESTNET • Supported by SME teams • Running Production Versions • Vtastic Team Coordinates Changes • Custom Clones can be Saved, Shared Master Master Candidate Master Master Candidate CloneSnapshot Clone Old Master
  • 18.
    ©2015 AKAMAI |FASTER FORWARDTM CLONES USE PRIVATE IP SPACE 100.80.0.8 (MDT) 100.80.0.15 (KDC) 100.80.0.21 (UMP) GWSH, SOCKS 172.26.238.16 (NAT Exit) 100.80.0.1 (NAT Gateway) IP (Anything) VLAN #83
  • 19.
    ©2015 AKAMAI |FASTER FORWARDTM NAT TUNNELING TOOLS • vpoint: Testnet-Attached bash Shell • LD_PRELOAD for Transparent SOCKS Tunneling (dante-client) • Proprietary SSH-proxy client • chrome-vpoint, firefox-vpoint • Dedicated browser session with SOCKS configuration
  • 20.
    ©2015 AKAMAI |FASTER FORWARDTM DESIGN APPROACH • Centrally-Managed Infrastructure • Resources Granted to Users/Groups • Distributed Storage & Compute Platform • Commodity Hardware • Open Source Technology • Virtualization: Qemu/KVM • Storage: GlusterFS • Orchestration: OpenNebula!! • Vtastic VRT: Python, Django, Apache
  • 21.
    ©2015 AKAMAI |FASTER FORWARDTM SPECS, SCALE • 55 VM Hosts • 32 Logical Cores (32 threads) • 128 GB RAM (moving to 256) • 2 x 10 Gbps Ethernet • Average 45 VMs per Host • 50-60 Testnets • 30-200 Nodes per Testnet • 2200+ Total VMs • 50 Storage Nodes • 8 Cores • 32 GB RAM • 10 Gbps Ethernet • 6 x 384 GB SSD = 2.1 TB • Total Usable Space = 42 TB • Master Testnet • 203 Nodes • 434 Cores • 918GB RAM • ~2 TB (After virt-sparsify)
  • 22.
    ©2015 AKAMAI |FASTER FORWARDTM 1.0: GLUSTER & FUSE • Backing Files and Scratch Images on Remote Storage • Qemu Uses POSIX Path (/glusterclient/foo) • Problems: • Memory Leaks, Hangs in GlusterFS FUSE Mount • Occasional Loss of VMs • Performance Concerns
  • 23.
    ©2015 AKAMAI |FASTER FORWARDTM 1.1: GLUSTER DIRECT • Qemu uses libgfapi (gluster://SERVER:PORT/foo) • Backing Files and Scratch Images on Remote Storage • FUSE Mount Used for Image Management • Problems: • Frequent, Catastrophic Loss of VMs • Occasional FUSE Mount Problems (Image Management)
  • 24.
    ©2015 AKAMAI |FASTER FORWARDTM 1.2: FUSE + LOCAL SCRATCH • Qemu Uses POSIX Path (/glusterclient/foo) for Backing Image • FUSE Mount Used for Image Management • Scratch Images Stored on Local Disk • Problems: • Increased Snapshot Time • No Live Migration • Occasional FUSE Mount Problems (Image Management) • Lack of Trust (VM Loss Experienced before Re-creating Gluster Volume)
  • 25.
    ©2015 AKAMAI |FASTER FORWARDTM IN DEVELOPMENT: CEPH • Static and Scratch Images on Remote Storage • Live Migration Possible • Ability to Flatten Penultimate Node • Holy Grail, or New Devil? • Challenges: • Learning Curve • Ceph Stability? • Need Support for Trees of RBD Clones
  • 26.
    ©2015 AKAMAI |FASTER FORWARDTM FUTURE POSSIBILIES • Incorporating Physical Hardware (Load/Performance Testing) • Realistic Network Conditions (Latency, Loss) • Subnetting / Internetworking • Multi-Site w/Smart Replication
  • 27.
  • 28.
    ©2015 AKAMAI |FASTER FORWARDTM IMAGE CREDITS • http://www.huffingtonpost.com/2013/04/18/embarassing-data-disasters_n_3109254.html • http://exchange.nottingham.ac.uk/research/files/2012/08/drinks-production-line-912x343.jpg • http://machinelearningmastery.com/wp-content/uploads/2013/12/test-harness.jpg • http://www.constructionweekonline.com/pictures/drought.gif • http://static.giantbomb.com/uploads/original/23/232017/2612483-supercomputer_neu_03.jpg • http://blog.straphq.com/wp-content/uploads/sites/18/2015/02/hackathon-hackers.jpg • https://nationalsafety.files.wordpress.com/2011/07/071511_2104_safetyfails4.jpg?w=595 • http://img.khelnama.com/sites/default/files/styles/gallery_content_big/public/mediaimages/gallery/2013/Feb/Tug%20of%20War%20image.jpg • http://www.globalnerdy.com/wordpress/wp-content/uploads/2013/06/WWDC-bathroom-line.jpg • http://media.masslive.com/republican/photo/2010/11/9022738-large.jpg • Unlock by Joel Bryant from the Noun Project • debug by Lemon Liu from the Noun Project • Robot by Angela Dinh from the Noun Project • Server by Mister Pixel from the Noun Project • coin by Rohith M S from the Noun Project • Waiting Room by Luis Prado from the Noun Project • users by TukTuk Design from the Noun Project • Traffic Light by Arthur Shlain from the Noun Project • Wrench by Rashida Luqman Kheriwala from the Noun Project • http://product-images.www8-hp.com/digmedialib/prodimg/lowres/c02632282.png • http://www.i2clipart.com/cliparts/2/c/3/a/clipart-database-symbol-256x256-2c3a.png • http://piedmontnewsonline.com/wp-content/uploads/awpcp/help_wanted_sign-large2.png • https://upload.wikimedia.org/wikipedia/commons/thumb/3/31/XM12_and_XM2.png/220px-XM12_and_XM2.png • http://www.follytoxnetsystems.net/movie%20pix/cisco%20router_2801.gif • http://fcw.com/~/media/GIG/FCWNow/Topics/Records%20Management/electronic%20records%20management.jpg • play by Convoy from the Noun Project • Camera by iconoci from the Noun Project