• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Making clouds go faster, for fun and profit!

Making clouds go faster, for fun and profit!



Everyone loves it when things are fast, and that statement holds true whether you're visiting http://www.livingsocial.com or whether you're hitting the OpenStack Nova API and requesting, "Please show ...

Everyone loves it when things are fast, and that statement holds true whether you're visiting http://www.livingsocial.com or whether you're hitting the OpenStack Nova API and requesting, "Please show me all the instances which I've got running". Nobody ever writes in asking for support and saying, "All of my API calls are completing far too quickly. Slow it down!".

Optimizing the performance of software is arguably a never ending crusade. At some point in time you'll get things fast enough that you can say, "Any effort invested beyond this point is not adding value for the business" but then along comes new code which adds a zillion awesome features, but also regresses performance back to a level where it needs another tune-up.

In the process of transforming our infrastructure and preparing our new OpenStack IaaS to host all our applications, we've been looking for performance wins across the whole stack. We've got some aggressive targets to meet. We've investigated many hardware options and chosen an optimal solution, we've instrumented some of the OpenStack APIs and benchmarked to produce interesting results, and whilst we're not done yet, we do have a "Half-Time Match Report".

Join me as I walk through our learnings so far and propose follow-on areas for investigation and optimization.



Total Views
Views on SlideShare
Embed Views



1 Embed 4

http://www.linkedin.com 4



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Making clouds go faster, for fun and profit! Making clouds go faster, for fun and profit! Presentation Transcript

    • This slide intentionally left blank.Wednesday, 17 October 12
    • MAKING CLOUDS GO FASTER FOR FUN AND PROFIT 2Wednesday, 17 October 12
    • 3Wednesday, 17 October 12
    • Speakers Who crafted this talk? 4Wednesday, 17 October 12
    • Alex Howells @nixgeek Technical Operations LivingSocial alex.howells@livingsocial.com http://github.com/agh 5Wednesday, 17 October 12
    • Paul Thomas @ftergl0w Technical Operations LivingSocial paul.thomas@livingsocial.com http://github.com/AfterGlow 6Wednesday, 17 October 12
    • Bedtime Reading You can get a copy of these slides after the talk - https://speakerdeck.com/u/nixgeekWednesday, 17 October 12
    • Problem? 8Wednesday, 17 October 12
    • Performance It doesn’t need to be rocket science. It does matter though! I promise I’m not trolling you. 9Wednesday, 17 October 12
    • In a parallel universe... “Oh man, that was too fast! It’s so much better now it’s slow!!” -- Average User 10Wednesday, 17 October 12
    • YEAH RIGHT I wish I had users who were that easy to please! But since we live in the real world... 11Wednesday, 17 October 12
    • In our universe... “Why is that dude smiling?! This is too slow! Why can’t it be faster?” -- Average Users 12Wednesday, 17 October 12
    • THINGS ARE IMPROVING Cactus => Diablo => Essex => Folsom But things can improve faster with focus! 13Wednesday, 17 October 12
    • Mostly reliable, but can be a bit slow! Today 14Wednesday, 17 October 12
    • Faster. More scalable. A real driving experience. The Future? 15Wednesday, 17 October 12
    • What’s the big deal? Why should I listen to you? 16Wednesday, 17 October 12
    • WE’RE A LOT LIKE YOU! Developers. Operators. Engineers. Users. We see potential. We see opportunities. 17Wednesday, 17 October 12
    • 18Wednesday, 17 October 12
    • Airspace LivingSocial PaaS We care about speed because ... * Scaling services up/down needs to happen fast! * Needing to maintain huge pools of “slack capacity” to account for sudden spikes in traffic sucks. * Upgrading applications should be fast. What does fast mean to us? One example? New instances online in under 10 seconds. 19Wednesday, 17 October 12
    • Performance Matters What could your business do if instances came online in under 5 seconds vs. 50 seconds? > Makes integration tests leveraging the Cloud complete much faster. > Seasonal spikes? React to them faster - happier customers spend more money. > Engineers who don’t grumble that “getting servers is a pain in the ass”. > Deploy new applications and services more quickly and easily. Along with many other things ... 20Wednesday, 17 October 12
    • What do we do? 21Wednesday, 17 October 12
    • Think Positive Because solutions are better than problems! 22Wednesday, 17 October 12
    • 23Wednesday, 17 October 12
    • Two-Pronged Approach Hardware & Software “A Love Story” 24Wednesday, 17 October 12
    • Warning! Picking the right hardware is quite hard. It’s often individual to your users needs. What works for us may not rock your world. 25Wednesday, 17 October 12
    • Hardware 26Wednesday, 17 October 12
    • Our Servers Supermicro 1027R-WRFT+ 2x Intel Xeon E5-2670 (8C/16T 2.60GHz) 16 x 8GB 1600MHz ECC Memory LSI 9266-8i (1-LD RAID-10) 8 x Intel 520-series 240GB SSD Dual-Port Intel X540 10GBASE-T 27Wednesday, 17 October 12
    • Benefits * ‘Just right’ balance of CPU/RAM for us. * Exceptional ephemeral I/O performance > Not using eMLC - trade off? > We can think about SQL on IaaS * A surplus of network bandwidth Servers are not a bottleneck! 28Wednesday, 17 October 12
    • Our Network Top of Rack - Zone Spine - Arista Networks 7050T Arista Networks 7050Q 48-port 10GBASE-T Switch 16-port 40GbE Switch + 4-port 40GbE (uplinks) 29Wednesday, 17 October 12
    • Benefits * A network which runs Linux! * Ability to automate it via ZTP and Chef * Non-blocking communication in a rack. * Provision 160Gbps to spine via four cables. * Under 2:1 contention for comms in/out of rack. * Less need to think about QoS! Network is not a bottleneck! 30Wednesday, 17 October 12
    • Software 31Wednesday, 17 October 12
    • Production Ubuntu 12.04 LTS (‘Precise Pangolin’) Hypervisor -- KVM CloudScaling OCS 1.3 .. based off OpenStack Essex .. Moving to OCS 2.0 in near future... .. that one is OpenStack Folsom .. 32Wednesday, 17 October 12
    • Ubuntu 12.04 LTS (‘Precise Pangolin’) Hypervisor -- KVM Useful for development and testing .. we’re running OpenStack Folsom now .. Most of the data shown later was grabbed with help from DevStack running on similar hardware to our production environment. 33Wednesday, 17 October 12
    • WHAT NOW? We’ve picked the hardware stack. It’s awesome. We’ve got our software installed. It’s looking great. 34Wednesday, 17 October 12
    • Monitoring Support calls are imprecise. We need data! 35Wednesday, 17 October 12
    • Old School * Is my service (API) responding on TCP/8774? * Am I able to make a GET and fetch instance info? * Is my server running all the processes it should? * Are there any errors on my network ports? If any of this looks broken, send me alerts saying so!Wednesday, 17 October 12
    • New Thinking “End-User Experience Monitoring” * “How long did my website take to show?” * Individual performance of each click or API call * Inspection of latency within the application If lots of users interactions are slow, then I want you to alert me. If its just an outlier - log it and shut up.Wednesday, 17 October 12
    • DEMO TIME! Because pretty pictures are awesome. We’ll call the slowest transactions our “Disaster Porn”. 38Wednesday, 17 October 12
    • Boundary “AppViz” * Port-to-port throughput/latency * How much SQL traffic are you doing? Updates in real-time. Look backwards in time. Powered by IPFIX (RFC 5101) 39Wednesday, 17 October 12
    • Tracelytics Latency Trends * Over the last 60 minutes * Over the last 24 hours * Over the last 7 days Lots more cool stuff to help ... We’ll blitz through a few more things next ... Top Tip: This is bad news. 40Wednesday, 17 October 12
    • Tracelytics Patches If you want to try out OpenStack APM - https://github.com/Afterglow/tracelytics-openstack Any questions? Just open an issue! 41Wednesday, 17 October 12
    • GlanceWednesday, 17 October 12
    • KeystoneWednesday, 17 October 12
    • NovaWednesday, 17 October 12
    • NovaWednesday, 17 October 12
    • NovaWednesday, 17 October 12
    • NovaWednesday, 17 October 12
    • “Call to Arms” > Performance regression tests as an OpenStack CI gate? > More people talking about “How I fixed those >5 second outliers!” > Better ‘shared knowledge’ about what settings to tweak for added oomph > Architectural analysis asking about “big picture” (big impact) changes Reminder about those patches - https://github.com/Afterglow/tracelytics-openstack 48Wednesday, 17 October 12
    • Credits Because these folks are awesome N.B. Not intended as an exhaustive list of all the awesome people in the world/room! 49Wednesday, 17 October 12
    • Credits http://www.livingsocial.com 50Wednesday, 17 October 12
    • Credits http://www.cloudscaling.com 51Wednesday, 17 October 12
    • Credits http://www.aristanetworks.com 52Wednesday, 17 October 12
    • Credits http://www.tracelytics.com 53Wednesday, 17 October 12
    • We’re done talking, thanks for listening! Any questions? 54Wednesday, 17 October 12
    • Interested? E-mail Ken - ken.persel@livingsocial.com Or just find me! Reminder that these slides are over at - https://speakerdeck.com/u/nixgeekWednesday, 17 October 12