Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Dark Launching with Consul
Senior Specialist Engineer
@bmonkman
Bill Monkman
• The most widely used platform for managing social media
• Integrates with Twitter, Facebook, Instagram, LinkedIn, G+, et...
• Everything in Amazon AWS (low thousands of servers)
• Primary languages PHP, Scala, Python, Go
• 10+ releases to product...
• Deployed in all datacenters (AWS Regions) in staging and prod
• Clusters of 3-5 servers (Multi-AZ)
• Consul agent instal...
• AKA Feature Flagging, Feature Toggle, etc.
• Allow dynamic control of your systems in real time
• Used extensively at Fa...
Dark Launching
Various restriction types:
• boolean
• percentage_static
• percentage_random
• user_list
• organization_list
• plan_code
•...
Use Cases
Typical
Push new code then:
● Dark launch to yourself or your team to test
● Launch to the whole Hootsuite organ...
Use Cases
Migration
● Controlled migration to new services
● Phased rollouts
● Allowing beta group of users to try new fea...
Use Cases
Load Testing
● When creating a new feature or service, send partial traffic to it,
slowly ramp up
● Shadow reads...
Use Cases
Security / Protection
● “Kill twitter streams” flag
● Attack mitigation
Use Cases
A/B Testing
● Test a feature to half the user base to gauge impact/adoption
● Try to limit it to simple tests. A...
Wrap code in a dark launch block
Newly added flags will be automatically registered in the KV store the
first time the cod...
Managed via a web interface
(screenshot)
Dark Launching at Hootsuite
Managed via a web interface
(screenshot)
Dark Launching at Hootsuite
• Has become core to our continuous delivery workflow
• Changed the way we use source control
• Branching in production
• ...
Web Server
Memcached
Web Server
Initial implementation
Dark Launching at Hootsuite
Web Server
Memcached
PHP-FPMPHP-FPMPHP-...
Problems with the old way
● As Dark Launching became important to our process, usage skyrocketed
● Initial implementation ...
Enter Consul
● Fans of Hashicorp products already
● Saw potential for a “push” based solution to dark launch management
● ...
Implementation
Base data stored in Consul KV store (with metadata in MongoDB)
Implementation
Watch added using Ansible, baked into image
Implementation (PHP)
● Handler that receives all KV data for a project
● Writes out a PHP syntax config file with all data...
Implementation (PHP)
<?php
$dlCodes =
array (
'ACCOUNT_CURRENCY_TOGGLE' => array (
'value' => 0,
'restriction' => 'boolean...
Web ServerWeb Server
Modifying a flag
Web Server
PHP-FPMPHP-FPMPHP-FPM
Consul Agent
Consul ServerConsul ServerConsul Serve...
Web ServerWeb Server
Creating a flag
Web Server
PHP-FPMPHP-FPMPHP-FPM
Consul Agent
Consul ServerConsul ServerConsul Server...
Implementation (Scala)
● Handler that receives all KV data for a project
● Writes out a Typesafe HOCON syntax config file ...
Implementation (Containers)
● We use Mesos / Marathon to schedule long-running services written in
Scala and Go
● Similar ...
Problems
● Multi-DC setup was hampered until Consul 0.5.1 due to lack of distinct
LAN/WAN advertise addresses
● Atomicity ...
Convergence
1 second
Lessons Learned
● Enable ACLs early, plan your usage of ACLs
● Put enough thought into your KV store structure
● You may n...
Conclusions
● Consul worked well for us right from the start (~0.4.0)
● Making an existing, valuable system better was a g...
Conclusions
● Increased stability and decreased load on Memcached / MySQL
● Since data is now pushed rather than pulled, t...
Thank you!
bill@hootsuite.com
@bmonkman
Bill Monkman
http://code.hootsuite.com
Upcoming SlideShare
Loading in …5
×

Dark launching with Consul at Hootsuite - Bill Monkman

1,629 views

Published on


Dark Launching (A.K.A. Feature Flagging) is a technique and mindset that has truly shaped the way we write, test, and deploy code at Hootsuite. It gives our team realtime, fine-grained control over our production systems which helps to prevent issues from reaching users, and build developer confidence in a culture of pushing code many times per day.
In this presentation I will go over how the system helps us both in the context of microservices and monoliths, and how we made use of Consul, Hashicorp's HA service discovery / KV store, to make it more resilient and performant at scale.

Published in: Technology

Dark launching with Consul at Hootsuite - Bill Monkman

  1. 1. Dark Launching with Consul Senior Specialist Engineer @bmonkman Bill Monkman
  2. 2. • The most widely used platform for managing social media • Integrates with Twitter, Facebook, Instagram, LinkedIn, G+, etc. • Started 7 years ago, now over 10 million users • Used by over 800 of the Fortune 1000 Hootsuite
  3. 3. • Everything in Amazon AWS (low thousands of servers) • Primary languages PHP, Scala, Python, Go • 10+ releases to production per day • 20+ Microservices • Started using Consul in late 2014 • Also using Vagrant, Packer, Terraform, Vault Hootsuite
  4. 4. • Deployed in all datacenters (AWS Regions) in staging and prod • Clusters of 3-5 servers (Multi-AZ) • Consul agent installed on almost every server • First use: Dark Launching Consul at Hootsuite
  5. 5. • AKA Feature Flagging, Feature Toggle, etc. • Allow dynamic control of your systems in real time • Used extensively at Facebook, Etsy, Flickr, others • Integrated with all the languages we use, both front and back-end • Very powerful tool for continuous delivery • Key to engineers at HS pushing code quickly and confidently • Allowed even other departments to control the system (Support, Marketing) Dark Launching
  6. 6. Dark Launching
  7. 7. Various restriction types: • boolean • percentage_static • percentage_random • user_list • organization_list • plan_code • language • webserver • etc. Dark Launching
  8. 8. Use Cases Typical Push new code then: ● Dark launch to yourself or your team to test ● Launch to the whole Hootsuite organization ● 10% of all users ● Watch graphs ● 50% ● 100% ● Simple means of rollback if necessary
  9. 9. Use Cases Migration ● Controlled migration to new services ● Phased rollouts ● Allowing beta group of users to try new features ahead of full release
  10. 10. Use Cases Load Testing ● When creating a new feature or service, send partial traffic to it, slowly ramp up ● Shadow reads/writes
  11. 11. Use Cases Security / Protection ● “Kill twitter streams” flag ● Attack mitigation
  12. 12. Use Cases A/B Testing ● Test a feature to half the user base to gauge impact/adoption ● Try to limit it to simple tests. Anything more complex needs a real A/B framework
  13. 13. Wrap code in a dark launch block Newly added flags will be automatically registered in the KV store the first time the code executes (with some stampede protection) Dark Launching at Hootsuite
  14. 14. Managed via a web interface (screenshot) Dark Launching at Hootsuite
  15. 15. Managed via a web interface (screenshot) Dark Launching at Hootsuite
  16. 16. • Has become core to our continuous delivery workflow • Changed the way we use source control • Branching in production • Comes with some associated costs - cleanup / complexity Dark Launching at Hootsuite
  17. 17. Web Server Memcached Web Server Initial implementation Dark Launching at Hootsuite Web Server Memcached PHP-FPMPHP-FPMPHP-FPM Memcached MySQL
  18. 18. Problems with the old way ● As Dark Launching became important to our process, usage skyrocketed ● Initial implementation with Mysql and Memcached ran into various issues ○ Hot cache keys ○ Too tied in to our core dashboard ○ Not suitable for a distributed system (move to microservices) ● Outages!
  19. 19. Enter Consul ● Fans of Hashicorp products already ● Saw potential for a “push” based solution to dark launch management ● Wanted to explore it for other uses, this was a useful test ground ● Evaluated a few tools, and though Consul was fairly bleeding-edge, we liked the feature set and direction of it and had faith in the team behind it. ● Based on well known algorithms/protocols (RAFT and SWIM) ● Started experimenting with a small-scale deployment
  20. 20. Implementation Base data stored in Consul KV store (with metadata in MongoDB)
  21. 21. Implementation Watch added using Ansible, baked into image
  22. 22. Implementation (PHP) ● Handler that receives all KV data for a project ● Writes out a PHP syntax config file with all data as an array ● Hits webserver on localhost to clear APC cache (in-memory cache) ● PHP code then checks cache, reloads from file if missing and does a KV lookup on the array of dark launch data ● If the checked flag does not exist in the data, communicate with the local consul agent to add it.
  23. 23. Implementation (PHP) <?php $dlCodes = array ( 'ACCOUNT_CURRENCY_TOGGLE' => array ( 'value' => 0, 'restriction' => 'boolean', 'isAvailableToJs' => 0, 'createdDate' => '2015-09-28 00:12:34', ), ... );
  24. 24. Web ServerWeb Server Modifying a flag Web Server PHP-FPMPHP-FPMPHP-FPM Consul Agent Consul ServerConsul ServerConsul Server DL Config 1 2 3 4 5 Consul Agent
  25. 25. Web ServerWeb Server Creating a flag Web Server PHP-FPMPHP-FPMPHP-FPM Consul Agent Consul ServerConsul ServerConsul Server DL Config 4 3 2 1 Consul Agent
  26. 26. Implementation (Scala) ● Handler that receives all KV data for a project ● Writes out a Typesafe HOCON syntax config file with all data as a list ● Uses inotify to watch for changes to the file ● Scala code asks the actor for data for a specific dark launch code ● Uses an Akka Agent (a construct which just manages state)
  27. 27. Implementation (Containers) ● We use Mesos / Marathon to schedule long-running services written in Scala and Go ● Similar to previous implementations. ● Consul runs on the mesos slave host, writes all service dark launch data to disk ● Shared between all containers on the host
  28. 28. Problems ● Multi-DC setup was hampered until Consul 0.5.1 due to lack of distinct LAN/WAN advertise addresses ● Atomicity - Convergence is slower than atomic memcached change, though it’s not a problem for our usage of dark launching (typical convergence is within 1 second)
  29. 29. Convergence 1 second
  30. 30. Lessons Learned ● Enable ACLs early, plan your usage of ACLs ● Put enough thought into your KV store structure ● You may need to bribe your security team to convince them that having bi- directional communication between all nodes on specific ports is okay ● It’s important to understand Consul’s outage recovery process and document what to do in the unlikely event that all servers fail. ● Key prefix type events will be delivered even to nodes that were down at the time of the event
  31. 31. Conclusions ● Consul worked well for us right from the start (~0.4.0) ● Making an existing, valuable system better was a great way to introduce it to the company, making its adoption much more smooth ● Using it for many other projects now ○ Nginx LB configuration based on auto-scaling web servers ○ Service discovery for seeding Akka Cluster ○ Distributed locking for various purposes ○ Microservice Discovery and routing system (Skyline) ● Seamless upgrade process
  32. 32. Conclusions ● Increased stability and decreased load on Memcached / MySQL ● Since data is now pushed rather than pulled, the system can still read dark launch data independently of the state of the data store. ● Now usable in all DCs, projects and environments ● Shared state allows us to coordinate changes between microservices
  33. 33. Thank you! bill@hootsuite.com @bmonkman Bill Monkman http://code.hootsuite.com

×