Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Building a Private Cloud to Efficiently
Handle 40 Billion Requests / Day
October 28th, 2015
Pierre Gohon | Sr. Site Reliab...
Who are we?
TubeMogul (Nasdaq : TUBE)
● Enterprise software company for digital branding
● Over 27 Billion Ads served in 2...
Who are we?
Operations Engineering
● Ensure the smooth day to day operation of the platform
infrastructure
● Provide a cos...
Our Infrastructure
Public Cloud On Premises
Multiple locations with a mix of Public Cloud and On Premises
● 6 AWS Regions (us-east*2, us-west*2, europe, apac)
● Physical servers in Michigan / Arizona (Web/Databases)
● DNS served...
Why?
● Own your infrastructure stack
● Physical proximity matters (reduced/controlled latency)
● Better infrastructure pla...
Project timeline
Where do we stand?
● DIY ?
○ Small OPS team
■ 12 members in two timezones
■ 3 only dedicated to OpenStack
○ New challenges
■ Internal trainin...
● Are applications AWS dependent ?
○ Internal ops tools
○ Developer’s applications
○ AWS S3, DynamoDB, SNS, SQS, SES, SWF
...
● Managing our own ASN / IPs (v4/v6)
● Choose “best for needs” transit providers (tier 1)
● Better control routes to/from ...
● Applications are already designed for redundancy/cloud
● Circumvent virtualized networking limitations
● Fine-tune barem...
How? Networking - Hybrid physical / virtualized
Network node Compute node Load balancer
public network
private network
usi...
How? Networking - RTT
● Latency from our DC to AWS is 6ms average in US-WEST
rtb-bidder01(rtb):~$ mtr -r -c 50 gw01.us-wes...
● If you are not building a multi-thousand hypervisors cloud,
you don’t need it to be complex
● Simplifies day-to-day oper...
● Affinity / anti-affinity rules
○ Enforce resiliency using anti-affinity rules
○ Improve performances using affinity rule...
How?
Treat your infrastructure as any other
engineering project
Infrastructure As Code
● Follow standard development lifecycle
● Repeatable and consistent server
provisioning
Continuous ...
● We already have a lot of automation:
● ~10,000 Puppet deployments last year
● Over 8,500 production deployments via jenk...
Infrastructure As Code - Code Review
Gerrit, an industry standard : OpenStack, Eclipse, Google, Chromium,
WikiMedia, LibreOffice, Spotify, GlusterFS, etc...
Fi...
Infrastructure As Code - Gerrit in Action
Automatic verify : -1 if the commit doesn’t pass Jenkins code validation
Infrastructure As Code - The Workflow
Lab / QA
Prod cluster
Infrastructure As Code - Continuous Delivery with Jenkins
Infrastructure As Code - Team Awareness
Infrastructure As Code - Safe upgrade paths
Easy as 1-2-3:
1. Test your upgrades using Jenkins
2. Deploy the upgrade by pr...
Get ready for production :
Monitor everything
Monitor as much as you can ?
● Existing monitoring (Nagios, Graphite) still in use
● Specific checks for OpenStack
○ check...
Monitoring auto-discovery
● New OpenStack node is automatically monitored
○ automatically / upon request
○ nagios detects ...
Centralized monitoring
Monitoring is graphing
A look in the rearview mirror
Benefits - Transparency / visibility
Discover new odd/unexpected traffic/activity patterns
Benefits - Tailored Instances
Before After
m3.xlarge + 2GB RAM? m3.2xlarge!
# nova flavor-create
rtb.collector rtb.collect...
Benefits - Operational Transparency
AWS
OpenStack
# cerveza -m noc -- --zone tm-sjc-1a --start demo01
# cerveza -m noc -- ...
Benefits - Efficiency
Before After
Benefits - Efficiency
1+ million rx packets/s on only 2 Haproxy Load Balancers, full SSL
What does not fit?
Downscaling does not really make sense for us
cpus are online and paid for, we should use them
Upscalin...
● We can be “double hybrids” (aws + openstack + haproxy bare
metal)
● Dev environment is needed for Openstack (new version...
Still a lot left to do
Technical aspect
Need to migrate other AWS Regions
Gain more experience
Version upgrades
Continue t...
- Ad serving in production since 2015-05
- Bidding traffic in production since 2015-09
- 100% uptime since pre-production ...
Questions?
Pierre Gohon
Pierre Grandin
@pierregohon
@p_grandin
Upcoming SlideShare
Loading in …5
×

Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

What do you do when your usual setup or turnkey solution isn’t suited for your workload?

Most of the documentation and user feedback that you can find about OpenStack is written for the use-case of running a public facing cloud serving several external customers. When you want to host a single tenant with a single application the problem is completely different, you don't want publicly exposed APIs. You want to ensure optimal resource allocation to maximize your application performance. You want to leverage the fact that you own the infrastructure layer to optimize your instance placement strategy, and to get the best latency and to avoid creating SPOFs using affinity (or anti affinity rules).

This talk will focus on what we learned during a two years journey; from getting OpenStack up and running reliably, to investigating performance bottlenecks, to maximizing the performance of our private cloud.

  • Login to see the comments

Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle 40 billion requests per day

  1. 1. Building a Private Cloud to Efficiently Handle 40 Billion Requests / Day October 28th, 2015 Pierre Gohon | Sr. Site Reliability Engineer | pierre.gohon@tubemogul.com Pierre Grandin | Sr. Site Reliability Engineer | pierre.grandin@tubemogul.com
  2. 2. Who are we? TubeMogul (Nasdaq : TUBE) ● Enterprise software company for digital branding ● Over 27 Billion Ads served in 2014 ● Over 40 Billion Ad Auctions per day in Q3 2015 ● Bids processed in less than 50 ms ● Bids served in less than 80 ms (inc. network round trip) ● 5 PB of monthly video traffic served ● 1.6 EB of data stored
  3. 3. Who are we? Operations Engineering ● Ensure the smooth day to day operation of the platform infrastructure ● Provide a cost effective and cutting edge infrastructure ● Provide support to dev teams ● Team composed of SREs, SEs and DBAs (US and UA) ● Managing over 2,500 servers (virtual and physical)
  4. 4. Our Infrastructure Public Cloud On Premises Multiple locations with a mix of Public Cloud and On Premises
  5. 5. ● 6 AWS Regions (us-east*2, us-west*2, europe, apac) ● Physical servers in Michigan / Arizona (Web/Databases) ● DNS served by third party (UltraDNS +Dynect) ● External monitoring using Catchpoint ● CDNs to deliver content ● External security audits We’re not adding complexity! Before Openstack: we’re already very “Hybrid”…
  6. 6. Why? ● Own your infrastructure stack ● Physical proximity matters (reduced/controlled latency) ● Better infrastructure planning ● Technological transparency ● … $$ !
  7. 7. Project timeline
  8. 8. Where do we stand?
  9. 9. ● DIY ? ○ Small OPS team ■ 12 members in two timezones ■ 3 only dedicated to OpenStack ○ New challenges ■ Internal training ■ Little external support (really ?) vs AWS ■ Manage data centers (Servers, Network, …) OpenStack challenges - Operational aspect
  10. 10. ● Are applications AWS dependent ? ○ Internal ops tools ○ Developer’s applications ○ AWS S3, DynamoDB, SNS, SQS, SES, SWF ● Convert developers to the project : we need their support ● OpenStack release cycle (when shall we update to latest version?) ● OpenStack really needed components ? ● How far do we go (S3 replacement ? Network control ? Hardware control ?) OpenStack challenges - Application migration aspect
  11. 11. ● Managing our own ASN / IPs (v4/v6) ● Choose “best for needs” transit providers (tier 1) ● Better control routes to/from our endpoints ● Allow dedicated AWS connections / others ● Allow direct peerings to ad networks ● Want to be accountable for networking issues ● Cost control How? Networking - External connectivity
  12. 12. ● Applications are already designed for redundancy/cloud ● Circumvent virtualized networking limitations ● Fine-tune baremetal nodes for HAProxy ● For the future equipments are “cloud ready” (nexus 5K for top of rack switch) ○ automatic switch configuration ○ cisco software evolutions ? ● 1G for admin, X*10G for public ? ● Leverage multicast ? How? Networking - Hybrid physical / virtualized
  13. 13. How? Networking - Hybrid physical / virtualized Network node Compute node Load balancer public network private network using VLANs 1 2 3 2
  14. 14. How? Networking - RTT ● Latency from our DC to AWS is 6ms average in US-WEST rtb-bidder01(rtb):~$ mtr -r -c 50 gw01.us-west-1a.public HOST: rtb-bidder01 Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.0.4.1 0.0% 50 0.2 0.2 0.1 0.3 0.0 2.|-- XXX.XXX.XXX.XXX 0.0% 50 0.2 0.3 0.2 2.6 0.3 3.|-- ae-43.r02.snjsca04.us.bb. 0.0% 50 1.4 1.5 1.2 2.3 0.2 4.|-- ae-4.r06.plalca01.us.bb.g 0.0% 50 2.0 2.1 1.8 3.4 0.3 5.|-- ae-1.amazon.plalca01.us.b 0.0% 50 39.2 3.5 1.5 39.2 5.6 6.|-- 205.251.229.40 0.0% 50 3.5 2.8 2.2 4.9 0.6 7.|-- 205.251.230.120 0.0% 50 2.1 2.3 2.0 8.5 0.9 8.|-- ??? 100.0 50 0.0 0.0 0.0 0.0 0.0 9.|-- ??? 100.0 50 0.0 0.0 0.0 0.0 0.0 10.|-- ??? 100.0 50 0.0 0.0 0.0 0.0 0.0 11.|-- 216.182.237.133 0.0% 50 4.0 6.0 2.7 20.2 5.2
  15. 15. ● If you are not building a multi-thousand hypervisors cloud, you don’t need it to be complex ● Simplifies day-to-day operations ● Home made puppet catalog ○ because less lines of code ○ because of the learning curve ○ because need to tweak settings (ulimit?) ● No need for horizon ● No need for shared storage How? Keep it simple
  16. 16. ● Affinity / anti-affinity rules ○ Enforce resiliency using anti-affinity rules ○ Improve performances using affinity rules How? Leverage your knowledge of your infrastructure {"profile": "OpenStack", "cluster": "rtb-hbase", "hostname": "rtb-hbase-region01", "nagios_host": "mgmt01"}
  17. 17. How? Treat your infrastructure as any other engineering project
  18. 18. Infrastructure As Code ● Follow standard development lifecycle ● Repeatable and consistent server provisioning Continuous Delivery ● Iterate quickly ● Automated code review to improve code quality Reliability Improve Production Stability Enforce Better Security Practices How? Continuous Delivery
  19. 19. ● We already have a lot of automation: ● ~10,000 Puppet deployments last year ● Over 8,500 production deployments via jenkins last year ● On the infrastructure: ○ masterless mode for the deployment ○ master mode once the node is up and running ● On the VMs: ○ Puppet run is triggered by cloud-init, directly at boot ○ from boot to production ready: <5 minutes Puppet see also : http://www.slideshare.net/NicolasBrousse/puppet-camp-paris-2015
  20. 20. Infrastructure As Code - Code Review
  21. 21. Gerrit, an industry standard : OpenStack, Eclipse, Google, Chromium, WikiMedia, LibreOffice, Spotify, GlusterFS, etc... Fine Grained Permissions Rules Plugged into LDAP Code Review per commit Stream Events Integrated with Jenkins, Jira and Hipchat Managing about 600 Git repositories Infrastructure As Code - Gerrit Integration
  22. 22. Infrastructure As Code - Gerrit in Action Automatic verify : -1 if the commit doesn’t pass Jenkins code validation
  23. 23. Infrastructure As Code - The Workflow Lab / QA Prod cluster
  24. 24. Infrastructure As Code - Continuous Delivery with Jenkins
  25. 25. Infrastructure As Code - Team Awareness
  26. 26. Infrastructure As Code - Safe upgrade paths Easy as 1-2-3: 1. Test your upgrades using Jenkins 2. Deploy the upgrade by pressing a single button* 3. Enjoy the rest of your day * https://github.com/pgrandin/lcam fig.1 : N. Brousse, Sr. Director of Operation Engineering, switching our production workload to OpenStack
  27. 27. Get ready for production : Monitor everything
  28. 28. Monitor as much as you can ? ● Existing monitoring (Nagios, Graphite) still in use ● Specific checks for OpenStack ○ check component API : performance / availability / operability ○ check resources : ports, failed instances ● Monitoring capacity metrics for all hardware ● SNMP traps for network equipment ● Monitoring is just an extension of our existing monitoring in AWS
  29. 29. Monitoring auto-discovery ● New OpenStack node is automatically monitored ○ automatically / upon request ○ nagios detects new hosts (API query) ○ nagios applies component related check by role ○ graphing is also automatically updated
  30. 30. Centralized monitoring
  31. 31. Monitoring is graphing
  32. 32. A look in the rearview mirror
  33. 33. Benefits - Transparency / visibility Discover new odd/unexpected traffic/activity patterns
  34. 34. Benefits - Tailored Instances Before After m3.xlarge + 2GB RAM? m3.2xlarge! # nova flavor-create rtb.collector rtb.collector 17408 8 2
  35. 35. Benefits - Operational Transparency AWS OpenStack # cerveza -m noc -- --zone tm-sjc-1a --start demo01 # cerveza -m noc -- --zone us-east-1a --start demo01
  36. 36. Benefits - Efficiency Before After
  37. 37. Benefits - Efficiency 1+ million rx packets/s on only 2 Haproxy Load Balancers, full SSL
  38. 38. What does not fit? Downscaling does not really make sense for us cpus are online and paid for, we should use them Upscaling has its limits : AWS is refreshing instance types every year … Sometime a small feature added can have huge load impact. It makes sense to keep the elastic workloads (machine learning, ...) in AWS
  39. 39. ● We can be “double hybrids” (aws + openstack + haproxy bare metal) ● Dev environment is needed for Openstack (new versions / break things) ● Storage is still a big issue due to our volume (1.6 EB) ● Some stuff may stay “forever” on AWS ? ● More dev/ops communication ● OpenStack is flexible ● No need for HA everywhere ● Spikes can be offloaded on AWS (cloud bursting) What we’ve learnt
  40. 40. Still a lot left to do Technical aspect Need to migrate other AWS Regions Gain more experience Version upgrades Continue to adapt our tooling Add more alarms for capacity issues Different Regions, different issues ? Human aspect Dev team still thinks in the AWS world ( and sometimes OPS too…)
  41. 41. - Ad serving in production since 2015-05 - Bidding traffic in production since 2015-09 - 100% uptime since pre-production (2015-03) Cost of operation for our current production workload: - Reduced by a factor of two, including OpEx cost! Aftermath
  42. 42. Questions?
  43. 43. Pierre Gohon Pierre Grandin @pierregohon @p_grandin

×