What they don’t tell you about µ-services…
Q C o n N Y – J u n e 2 0 1 6
Daniel Rolnick
C h i e f Te c h n o l o g y O f f i c e r
Daniel Rolnick
C h i e f Te c h n o l o g y O f f i c e r
daniel.rolnick@yodle.com
Story Time
September 2014
Story Time
June 2016
Story Time
Something’s gotta give
▶ Changing environments cause stress
▶ Existing processes need to be revisited
▶ Processes need to to be created
▶ New technology needs to be integrated
▶ Businesses are built on trade-offs
Evolution Requires Adaptation
Expected developmental needs
▶ Platform as a Service
▶ Service Discovery
▶ Testing
▶ Containerization
▶ Monitoring
Eyes Wide Open
Unexpected implications of micro-services
▶ Impact on data access
▶ Build and Deploy Tooling
▶ Source Repository Complexity
▶ Cross application monitoring
Expect the Unexpected
Bring on the complexity
Story Time
0
50
100
150
200
250
Yodle	Service Count
Data access
patterns
Independent Data Domains
▶ Isolated data ownership per micro-service
▶ Options: Physical Databases, Schemas, Polyglot
▶ Ideal state for new things but what about the old stuff
▶ Can’t get there in one move
Microservices Macroproblems
Baby Steps to Freedom
▶ Central data stores are leaky abstractions
Microservices Macroproblems
Baby Steps to Freedom
Microservices Macroproblems
▶ Central data stores are leaky abstractions
▶ Enforce data ownership
through access patterns
Baby Steps to Freedom
▶ Central data stores are leaky abstractions
▶ Enforce data ownership
through access patterns
▶ Façade for decoupling
Microservices Macroproblems
Baby Steps to Freedom
▶ Central data stores are leaky abstractions
▶ Enforce data ownership
through access patterns
▶ Façade for decoupling
▶ Multi-step process
Microservices Macroproblems
Shared Containers Simplify Things
Microservices Macroproblems
▶ Services in the same container reuse
connections
▶ Connection pooling goes away
▶ Base connection count starts
adding up
▶ You could always go to a minimum
idle of zero
▶ What could go wrong?
Microservices Macroproblems
0
50
100
150
200
250
Yodle	Service Count
External Connection Pooling
▶ Connection pooling outside of the container
▶ Add visibility while you’re at it
▶ Better logging, cleaner visualizations
Microservices Macroproblems
Microservices Macroproblems
Tooling for empowerment
▶ Server spin-up
▶ Schema and Account creation
▶ Ensure externalized your configurations
Microservices Macroproblems
Platform as
a Service
Static Configurations
▶ Every application deployed to a fixed set of hosts on a set of known ports
▶ Monitoring was done at a gross system synthetic level
▶ Only complete outages were easily detectable
▶ Manual restarts required
▶ PS-Watcher and Docker restart help but are not sufficient
▶ This was not going to scale
A Place for Everything and Everything…
Keeping services alive by hand is problematic
▶ Researched available PaaS Platforms available in late 2014
• Mesos / Marathon
• CoreOS
▶ What about:
• Kubernetes
• Swarm
• AWS Elastic Container Service
This Ain’t Gonna Scale
Mesos and Marathon
▶ Deploy applications to marathon
▶ Marathon decides what host and port to run applications on
▶ Health checks are built in to ensure application up-time
▶ Mesos ensures the applications run and are contained
Platform as a Service
Platform as a Service
0
50
100
150
200
250
Yodle	Service Count
Pace of Innovation Increases
Service
Discovery
Aware Apps vs. Smart Pipes
▶ Service discovery can be baked into
your application
Dynamic Topologies Require Service Discovery
Aware Apps vs. Smart Pipes
▶ Plumbing can take care of it for you
▶ Smart Pipes allows
• Easier path to polyglot ecosystem
• Decouple applications from
service discovery
▶ We chose the latter but we had to iterate a few times to get there
Dynamic Topologies Require Service Discovery
Curator already in place
▶ Already used zookeeper/curator for our thrift based macro-services
▶ Made our micro-services self register and do discovery via curator
▶ You can’t solve everything at once
▶ Not our desired end state
Use What You Know
Hipache by dotCloud
▶ URLs looked like https://svcb.services.prod.yodle.com
▶ Utilized dedicated routing servers
Service Discovery V2
Hipache by dotCloud
▶ Pros: Decoupled service discovery from applications
▶ Cons: Services had to be environment aware
Service Discovery V2
PaaS’s built-in routing layer
▶ Marathon has a built-in routing layer using haproxy
▶ Simple command to generate an haproxy config
▶ Basic listener (Qubit Bamboo) keep haproxy files up-to-date
▶ Hipache could have worked
Service Discovery V3
Discovery was simpler
Service Discovery V3 Continued
Discovery was simpler
▶ Service discovery is now fully externalized
▶ Iterate on routing and discovery independently
▶ Created tech debt for the applications
Service Discovery V3 Continued
Service Discovery V4
0
50
100
150
200
250
Yodle	Service Count
Scale Problems
Many to Many Problems
▶ As the number of slave nodes in our PaaS grew so did our problems
▶ Health checks from every host to every container
▶ Ensuring the HAproxy file was up-to-date
on all hosts was annoying
▶ Centralized onto a small cluster of routing boxes
Service Discovery V4
Testing
Regressions give comfort
▶ Monolithic releases are understandable
▶ We tested everything
▶ Everything works
Continuous Integration
Release code as it is written
Continuous Delivery Pipeline
Develop
Commit to
Branch
Continuous
Integration
Merge
Continuous
Delivery
Regressions take time
▶ Empower continuous delivery
▶ Broke apart our monolithic regression suite
▶ Same methodology for macro and micro-services
Continuous Integration
Enter the Canary
▶ Landscape is in flux
▶ If we test a subset of things how can
we be sure everything works?
▶ Canary Ensures
▶ Dependencies met
▶ Satisfying existing contracts
▶ Handle production load
Continuous Delivery Pipeline
Continuous Delivery Pipeline
▶ Special canary routing in our service discovery layer
▶ Test anywhere in the service mesh
▶ Discoverable tests using a /tests endpoint
▶ Monitor canary health in New Relic
▶ Promote to Canary Partial
Continuous Delivery Pipeline
▶ Receive partial production load
▶ Monitor canary health in New Relic
▶ Validate response codes
▶ Measure throughput
▶ Promote to general availability
Sentinel
Continuous Delivery Pipeline
Sentinel
Continuous Delivery Pipeline
Sentinel
Continuous Delivery Pipeline
Sentinel
Continuous Delivery Pipeline
Sentinel
▶ INSERT SCREENSHOTS OF SENTINEL
Continuous Delivery Pipeline
Sentinel
▶ INSERT SCREENSHOTS OF SENTINEL
Continuous Delivery Pipeline
Sentinel
▶ INSERT SCREENSHOTS OF SENTINEL
Continuous Delivery Pipeline
Containers
Standardization is required
▶ Polyglot environments buck standardization
▶ Micro-service environments increase complexity
▶ Operational complexity can grown unbounded
▶ Developers own the runtime
▶ Common runtime from an operator’s standpoint
▶ Tooling provides consistent deployments
Containers Bring Simplicity
Hierarchical Container Images
▶ How do you roll out environmental changes when you have 200 different container
builds?
Containers Bring Simplicity
Containers make a mess
▶ Docker host machines were littered
▶ Docker registry is littered with old images
▶ Developed a tagging process
Containers Bring Simplicity
Monitoring
Legacy Monitoring not cutting it
▶ Designed for testing and monitoring infrastructure
▶ Needed application performance management
▶ Wanted something that would scale with us with little effort
Increased Complexity Increased Requirements
Graphite and Grafana
▶ Dropwizard metrics to report data
▶ Teams built custom dashboards
▶ Too much manual effort
▶ No alerting
Increased Complexity Increased Requirements
Enter the Hackathon
▶ New Relic Monitoring For Microservices
▶ Simple – just add an agent
▶ Detailed per application dashboards out of the box
▶ Single score to focus attention (Useful for initial canary implementation)
▶ Basic alerting
Increased Complexity Increased Requirements
100 Apps in 100 Days
▶ Made use of our base containers
▶ Rolled out monitoring to every application in the fleet
▶ Suddenly we had visibility everywhere.
▶ Some Limitations
• No good docker support (this is better now)
• Services graphs aren’t dynamically generated
Increased Complexity Increased Requirements
Finding root causes
▶ Hundreds of Dashboards
▶ Hundreds of Individual Service Nodes
▶ Finding root causes in complex service graphs is difficult
▶ Anomalies from individual service nodes difficult to detect
▶ Still looking for a good solution
Increased Complexity Increased Requirements
Source
Repository
Complexity
Source Code Management
▶ Organizational scheme to help think about it
▶ Hound to help with code searching
▶ Repo tool to help keep up-to-date
▶ Upgrading libraries is a challenge
Source Repository Complexity
Dependency Management
▶ INSERT IMAGE OF VANTAGE
Source Repository Complexity
Dependency Management
Source Repository Complexity
Dependency Management
Source Repository Complexity
Dependency Management
Source Repository Complexity
Build tooling
▶ Many build systems don’t directly allow scripting
▶ Bamboo definitely doesn’t
▶ Build tooling iterations are painful
▶ Managing Bamboo build and deploy plans at scale is hard
Source Repository Complexity
Existing Build Tooling
Build and Deploy Tooling
0
50
100
150
200
250
Yodle	Service Count
Directly to Marathon configurations in Bamboo
Build and Deploy Tooling
0
50
100
150
200
250
Yodle	Service Count
LaunchPad as an Abstraction Layer
Build and Deploy Tooling
0
50
100
150
200
250
Yodle	Service Count
Build and Deploy Tooling
Sentinel For Human Service Discovery
Source Repository Complexity
Sentinel For Human Service Discovery
Source Repository Complexity
Sentinel For Human Service Discovery
Source Repository Complexity
Conclusion
Even if you aren’t on the bleeding edge …
▶ Every environment is different
▶ Legacy Applications present unique challenges
▶ Different business requirements
▶ Different trade-offs
Plan for Challenges
Every Hurdle Was Worth It
Improved Agility
0
200
400
600
800
1000
1200
1400
Monthly	Deployments
We make it easy to grow and manage
profitable customer relationships
It’s success simplified!

What they don't tell you about micro-services

  • 1.
    What they don’ttell you about µ-services… Q C o n N Y – J u n e 2 0 1 6 Daniel Rolnick C h i e f Te c h n o l o g y O f f i c e r
  • 2.
    Daniel Rolnick C hi e f Te c h n o l o g y O f f i c e r daniel.rolnick@yodle.com
  • 3.
  • 4.
  • 5.
  • 6.
    Something’s gotta give ▶Changing environments cause stress ▶ Existing processes need to be revisited ▶ Processes need to to be created ▶ New technology needs to be integrated ▶ Businesses are built on trade-offs Evolution Requires Adaptation
  • 7.
    Expected developmental needs ▶Platform as a Service ▶ Service Discovery ▶ Testing ▶ Containerization ▶ Monitoring Eyes Wide Open
  • 8.
    Unexpected implications ofmicro-services ▶ Impact on data access ▶ Build and Deploy Tooling ▶ Source Repository Complexity ▶ Cross application monitoring Expect the Unexpected
  • 9.
    Bring on thecomplexity Story Time 0 50 100 150 200 250 Yodle Service Count
  • 10.
  • 11.
    Independent Data Domains ▶Isolated data ownership per micro-service ▶ Options: Physical Databases, Schemas, Polyglot ▶ Ideal state for new things but what about the old stuff ▶ Can’t get there in one move Microservices Macroproblems
  • 12.
    Baby Steps toFreedom ▶ Central data stores are leaky abstractions Microservices Macroproblems
  • 13.
    Baby Steps toFreedom Microservices Macroproblems ▶ Central data stores are leaky abstractions ▶ Enforce data ownership through access patterns
  • 14.
    Baby Steps toFreedom ▶ Central data stores are leaky abstractions ▶ Enforce data ownership through access patterns ▶ Façade for decoupling Microservices Macroproblems
  • 15.
    Baby Steps toFreedom ▶ Central data stores are leaky abstractions ▶ Enforce data ownership through access patterns ▶ Façade for decoupling ▶ Multi-step process Microservices Macroproblems
  • 16.
    Shared Containers SimplifyThings Microservices Macroproblems ▶ Services in the same container reuse connections ▶ Connection pooling goes away ▶ Base connection count starts adding up ▶ You could always go to a minimum idle of zero ▶ What could go wrong?
  • 17.
  • 18.
    External Connection Pooling ▶Connection pooling outside of the container ▶ Add visibility while you’re at it ▶ Better logging, cleaner visualizations Microservices Macroproblems
  • 19.
  • 20.
    Tooling for empowerment ▶Server spin-up ▶ Schema and Account creation ▶ Ensure externalized your configurations Microservices Macroproblems
  • 21.
  • 22.
    Static Configurations ▶ Everyapplication deployed to a fixed set of hosts on a set of known ports ▶ Monitoring was done at a gross system synthetic level ▶ Only complete outages were easily detectable ▶ Manual restarts required ▶ PS-Watcher and Docker restart help but are not sufficient ▶ This was not going to scale A Place for Everything and Everything…
  • 23.
    Keeping services aliveby hand is problematic ▶ Researched available PaaS Platforms available in late 2014 • Mesos / Marathon • CoreOS ▶ What about: • Kubernetes • Swarm • AWS Elastic Container Service This Ain’t Gonna Scale
  • 24.
    Mesos and Marathon ▶Deploy applications to marathon ▶ Marathon decides what host and port to run applications on ▶ Health checks are built in to ensure application up-time ▶ Mesos ensures the applications run and are contained Platform as a Service
  • 25.
    Platform as aService 0 50 100 150 200 250 Yodle Service Count Pace of Innovation Increases
  • 26.
  • 27.
    Aware Apps vs.Smart Pipes ▶ Service discovery can be baked into your application Dynamic Topologies Require Service Discovery
  • 28.
    Aware Apps vs.Smart Pipes ▶ Plumbing can take care of it for you ▶ Smart Pipes allows • Easier path to polyglot ecosystem • Decouple applications from service discovery ▶ We chose the latter but we had to iterate a few times to get there Dynamic Topologies Require Service Discovery
  • 29.
    Curator already inplace ▶ Already used zookeeper/curator for our thrift based macro-services ▶ Made our micro-services self register and do discovery via curator ▶ You can’t solve everything at once ▶ Not our desired end state Use What You Know
  • 30.
    Hipache by dotCloud ▶URLs looked like https://svcb.services.prod.yodle.com ▶ Utilized dedicated routing servers Service Discovery V2
  • 31.
    Hipache by dotCloud ▶Pros: Decoupled service discovery from applications ▶ Cons: Services had to be environment aware Service Discovery V2
  • 32.
    PaaS’s built-in routinglayer ▶ Marathon has a built-in routing layer using haproxy ▶ Simple command to generate an haproxy config ▶ Basic listener (Qubit Bamboo) keep haproxy files up-to-date ▶ Hipache could have worked Service Discovery V3
  • 33.
    Discovery was simpler ServiceDiscovery V3 Continued
  • 34.
    Discovery was simpler ▶Service discovery is now fully externalized ▶ Iterate on routing and discovery independently ▶ Created tech debt for the applications Service Discovery V3 Continued
  • 35.
  • 36.
    Many to ManyProblems ▶ As the number of slave nodes in our PaaS grew so did our problems ▶ Health checks from every host to every container ▶ Ensuring the HAproxy file was up-to-date on all hosts was annoying ▶ Centralized onto a small cluster of routing boxes Service Discovery V4
  • 37.
  • 38.
    Regressions give comfort ▶Monolithic releases are understandable ▶ We tested everything ▶ Everything works Continuous Integration
  • 39.
    Release code asit is written Continuous Delivery Pipeline Develop Commit to Branch Continuous Integration Merge Continuous Delivery
  • 40.
    Regressions take time ▶Empower continuous delivery ▶ Broke apart our monolithic regression suite ▶ Same methodology for macro and micro-services Continuous Integration
  • 41.
    Enter the Canary ▶Landscape is in flux ▶ If we test a subset of things how can we be sure everything works? ▶ Canary Ensures ▶ Dependencies met ▶ Satisfying existing contracts ▶ Handle production load Continuous Delivery Pipeline
  • 42.
    Continuous Delivery Pipeline ▶Special canary routing in our service discovery layer ▶ Test anywhere in the service mesh ▶ Discoverable tests using a /tests endpoint ▶ Monitor canary health in New Relic ▶ Promote to Canary Partial
  • 43.
    Continuous Delivery Pipeline ▶Receive partial production load ▶ Monitor canary health in New Relic ▶ Validate response codes ▶ Measure throughput ▶ Promote to general availability
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
    Sentinel ▶ INSERT SCREENSHOTSOF SENTINEL Continuous Delivery Pipeline
  • 49.
    Sentinel ▶ INSERT SCREENSHOTSOF SENTINEL Continuous Delivery Pipeline
  • 50.
    Sentinel ▶ INSERT SCREENSHOTSOF SENTINEL Continuous Delivery Pipeline
  • 51.
  • 52.
    Standardization is required ▶Polyglot environments buck standardization ▶ Micro-service environments increase complexity ▶ Operational complexity can grown unbounded ▶ Developers own the runtime ▶ Common runtime from an operator’s standpoint ▶ Tooling provides consistent deployments Containers Bring Simplicity
  • 53.
    Hierarchical Container Images ▶How do you roll out environmental changes when you have 200 different container builds? Containers Bring Simplicity
  • 54.
    Containers make amess ▶ Docker host machines were littered ▶ Docker registry is littered with old images ▶ Developed a tagging process Containers Bring Simplicity
  • 55.
  • 56.
    Legacy Monitoring notcutting it ▶ Designed for testing and monitoring infrastructure ▶ Needed application performance management ▶ Wanted something that would scale with us with little effort Increased Complexity Increased Requirements
  • 57.
    Graphite and Grafana ▶Dropwizard metrics to report data ▶ Teams built custom dashboards ▶ Too much manual effort ▶ No alerting Increased Complexity Increased Requirements
  • 58.
    Enter the Hackathon ▶New Relic Monitoring For Microservices ▶ Simple – just add an agent ▶ Detailed per application dashboards out of the box ▶ Single score to focus attention (Useful for initial canary implementation) ▶ Basic alerting Increased Complexity Increased Requirements
  • 59.
    100 Apps in100 Days ▶ Made use of our base containers ▶ Rolled out monitoring to every application in the fleet ▶ Suddenly we had visibility everywhere. ▶ Some Limitations • No good docker support (this is better now) • Services graphs aren’t dynamically generated Increased Complexity Increased Requirements
  • 60.
    Finding root causes ▶Hundreds of Dashboards ▶ Hundreds of Individual Service Nodes ▶ Finding root causes in complex service graphs is difficult ▶ Anomalies from individual service nodes difficult to detect ▶ Still looking for a good solution Increased Complexity Increased Requirements
  • 61.
  • 62.
    Source Code Management ▶Organizational scheme to help think about it ▶ Hound to help with code searching ▶ Repo tool to help keep up-to-date ▶ Upgrading libraries is a challenge Source Repository Complexity
  • 63.
    Dependency Management ▶ INSERTIMAGE OF VANTAGE Source Repository Complexity
  • 64.
  • 65.
  • 66.
  • 67.
    Build tooling ▶ Manybuild systems don’t directly allow scripting ▶ Bamboo definitely doesn’t ▶ Build tooling iterations are painful ▶ Managing Bamboo build and deploy plans at scale is hard Source Repository Complexity
  • 68.
    Existing Build Tooling Buildand Deploy Tooling 0 50 100 150 200 250 Yodle Service Count
  • 69.
    Directly to Marathonconfigurations in Bamboo Build and Deploy Tooling 0 50 100 150 200 250 Yodle Service Count
  • 70.
    LaunchPad as anAbstraction Layer Build and Deploy Tooling 0 50 100 150 200 250 Yodle Service Count
  • 71.
  • 72.
    Sentinel For HumanService Discovery Source Repository Complexity
  • 73.
    Sentinel For HumanService Discovery Source Repository Complexity
  • 74.
    Sentinel For HumanService Discovery Source Repository Complexity
  • 75.
  • 76.
    Even if youaren’t on the bleeding edge … ▶ Every environment is different ▶ Legacy Applications present unique challenges ▶ Different business requirements ▶ Different trade-offs Plan for Challenges
  • 77.
    Every Hurdle WasWorth It Improved Agility 0 200 400 600 800 1000 1200 1400 Monthly Deployments
  • 78.
    We make iteasy to grow and manage profitable customer relationships It’s success simplified!