Microservices: State of the Union
Adrian Cockcroft @adrianco
Technology Fellow - Battery Ventures
June 2016
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
microservices-review
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
What does @adrianco do?
@adrianco
Technology Due
Diligence on Deals
Presentations at
Conferences
Presentations at
Companies
Technical
Advice for Portfolio
Companies
Program
Committee for
Conferences
Networking with
Interesting PeopleTinkering with
Technologies
Maintain
Relationship with
Cloud Vendors
Previously: Netflix, eBay, Sun Microsystems, CCL, TCU London BSc Applied Physics
Developer responsibilities:
Faster, cheaper, safer
What Happened?
Rate of change
increased
Cost and size and
risk of change
reduced
Disruptor:
Continuous Delivery with
Containerized Microservices
Microservices
A Microservice Definition
Loosely coupled service oriented
architecture with bounded contexts
A Microservice Definition
Loosely coupled service oriented
architecture with bounded contexts
If every service has to be
updated at the same time
it’s not loosely coupled
A Microservice Definition
Loosely coupled service oriented
architecture with bounded contexts
If every service has to be
updated at the same time
it’s not loosely coupled
If you have to know too much about surrounding
services you don’t have a bounded context. See the
Domain Driven Design book by Eric Evans.
Speeding Up The Platform
Datacenter Snowflakes
• Deploy in months
• Live for years
Speeding Up The Platform
Datacenter Snowflakes
• Deploy in months
• Live for years
Virtualized and Cloud
• Deploy in minutes
• Live for weeks
Speeding Up The Platform
Datacenter Snowflakes
• Deploy in months
• Live for years
Virtualized and Cloud
• Deploy in minutes
• Live for weeks
Container Deployments
• Deploy in seconds
• Live for minutes/hours
Speeding Up The Platform
Datacenter Snowflakes
• Deploy in months
• Live for years
Virtualized and Cloud
• Deploy in minutes
• Live for weeks
Container Deployments
• Deploy in seconds
• Live for minutes/hours
Lambda Deployments
• Deploy in milliseconds
• Live for seconds
Speeding Up The Platform
AWS Lambda is leading exploration of serverless architectures in 2016
Datacenter Snowflakes
• Deploy in months
• Live for years
Virtualized and Cloud
• Deploy in minutes
• Live for weeks
Container Deployments
• Deploy in seconds
• Live for minutes/hours
Lambda Deployments
• Deploy in milliseconds
• Live for seconds
http://www.infoq.com/presentations/Twitter-Timeline-Scalability
http://www.infoq.com/presentations/twitter-soa
http://www.infoq.com/presentations/Zipkin
http://www.infoq.com/presentations/scale-gilt
Go-Kit https://www.youtube.com/watch?v=aL6sd4d4hxk
http://www.infoq.com/presentations/circuit-breaking-distributed-systems
https://speakerdeck.com/mattheath/scaling-micro-services-in-go-highload-plus-plus-2014
State of the Art in Web Scale
Microservice Architectures
AWS Re:Invent : Asgard to Zuul https://www.youtube.com/watch?v=p7ysHhs5hl0
Resiliency at Massive Scale https://www.youtube.com/watch?v=ZfYJHtVL1_w
Microservice Architecture https://www.youtube.com/watch?v=CriDUYtfrjs
New projects for 2015 and Docker Packaging https://www.youtube.com/watch?v=hi7BDAtjfKY
Spinnaker deployment pipeline https://www.youtube.com/watch?v=dwdVwE52KkU
http://www.infoq.com/presentations/spring-cloud-2015
Microservice Architectures
ConfigurationTooling Discovery Routing Observability
Development: Languages and Container
Operational: Orchestration and Deployment Infrastructure
Datastores
Policy: Architectural and Security Compliance
Next Generation Applications
Fill in the gaps, rapidly evolving ecosystem choices
Archaius
LaunchDarkly
Habitat
Configuration
Lambda
Docker
Spinnaker
Tooling
Etcd
Eureka
Consul
Discovery
Compose
Linkerd
Weave
Routing
Zipkin
Prometheus
Hystrix
Observability
Development: components interfaces languages e.g. Docker Hub, Artifactory, Datawire Quark, Go, Rust
Operational: Mesos, Kubernetes, Swarm, Nomad for private clouds. ECS, Mesos, GKS for public
Datastores: Orchestrated, Distributed Ephemeral e.g. Cassandra, or DBaaS e.g. DynamoDB
Policy: Security compliance e.g. Docker Content Trust. Architecture compliance e.g. Cloud Foundry
@adrianco
In Search of Segmentation
Ops
Dev
Datacenters
AD/LDAP Roles
VLAN Networks
Hypervisor
IPtables
Docker Links
AWS Accounts
IAM Roles
VPC
Security Groups
Calico Policy
Docker Net/Weave
@adrianco
Hierarchical Segmentation
B CA
B C
E FD
E F
Homepage Team Security Group Reports Team Security Group
VPC Z - Manage a small number of network spaces
D
An AWS oriented example…
AWS Account - Manage across multiple accounts
containers and links
@adrianco
What’s Often Missing?
Failure injection testing
Versioning, routing
Binary protocols and interfaces
Timeouts and retries
Denormalized data models
Monitoring, tracing
Simplicity through symmetry
@adrianco
Failure Injection Testing
Netflix Chaos Monkey, Simian Army, FIT and Gremlin
http://techblog.netflix.com/2011/07/netflix-simian-army.html
http://techblog.netflix.com/2014/10/fit-failure-injection-testing.html
http://techblog.netflix.com/2016/01/automated-failure-testing.html
https://www.infoq.com/presentations/failure-test-research-netflix
! Chaos Monkey - enforcing stateless business logic
! Chaos Gorilla - enforcing zone isolation/replication
! Chaos Kong - enforcing region isolation/replication
! Security Monkey - watching for insecure configuration settings
! FIT & Gremlin - inject errors to enforce robust dependencies
! See over 100 NetflixOSS projects at netflix.github.com
! Get “Technical Indigestion” reading techblog.netflix.com
Trust with Verification
@adrianco
Benefits of version aware routing
Immediately and safely introduce a new version
Canary test in production
Use DIY feature flags or .
Route clients to a version so they can’t get disrupted
Change client or dependencies but not both at once
Eventually remove old versions
Incremental or infrequent “break the build” garbage collection
@adrianco
Versioning, Routing
Version numbering: Interface.Feature.Bugfix
V1.2.3 to V1.2.4 - Canary test then remove old version
V1.2.x to V1.3.x - Canary test then remove or keep both
Route V1.3.x clients to new version to get new feature
Remove V1.2.x only after V1.3.x is found to work for V1.2.x clients
V1.x.x to V2.x.x - Route clients to specific versions
Remove old server version when all old clients are gone
@adrianco
Protocols
Measure serialization, transmission, deserialization costs
Sending a megabyte of XML between microservices will
make you sad, but not as sad as 10yrs ago with SOAP
Use Thrift, Protobuf/gRPC, Avro, SBE internally
Use JSON for external/public interfaces
https://github.com/real-logic/simple-binary-encoding
@adrianco
Interfaces
When you build a service, build a “driver” client for it
Reference implementation error handling and serialization
Release automation stress test using client
Validate that service interface is usable!
Minimize additional dependencies
Swagger - OpenAPI Specification
Datawire Quark adds behaviors to API spec
@adrianco
Interface Version Pinning
Change one thing at a time!
Pin the version of everything else
Incremental build/test/deploy pipeline
Deploy existing app code with new platform
Deploy existing app code with new dependencies
Deploy new app code with pinned platform/dependencies
@adrianco
Interfaces between teams
@adrianco
Interfaces between teams
Client
Code
Minimal
Object Model
@adrianco
Interfaces between teams
Service
Code
Client
Code
Minimal
Object Model
Full Object
Model
@adrianco
Interfaces between teams
Service
Code
Client
Code
Minimal
Object Model
Full Object
Model
Cache
Code
Common
Object Model
Decoupled
object
models
@adrianco
Interfaces between teams
Service
Code
Client
Code
Minimal
Object Model
Service
Driver
Service
Handler
Full Object
Model
Cache
Code
Common
Object Model
Decoupled
object
models
@adrianco
Interfaces between teams
Service
Code
Client
Code
Minimal
Object Model
Cache
Driver
Service
Driver
Service
Handler
Full Object
Model
Cache
Code
Cache
Handler
Common
Object Model
Decoupled
object
models
@adrianco
Interfaces between teams
Service
Code
Client
Code
Minimal
Object Model
Cache
Driver
Service
Driver
Platform Platform
Service
Handler
Full Object
Model
Cache
Code
Platform
Cache
Handler
Common
Object Model
Decoupled
object
models
@adrianco
Interfaces between teams
Service
Code
Client
Code
Minimal
Object Model
Cache
Driver
Service
Driver
Platform Platform
Service
Handler
Full Object
Model
Cache
Code
Platform
Cache
Handler
Common
Object Model
Versioned
dependency
interfacesDecoupled
object
models
@adrianco
Interfaces between teams
Service
Code
Client
Code
Minimal
Object Model
Cache
Driver
Service
Driver
Platform Platform
Service
Handler
Full Object
Model
Cache
Code
Platform
Cache
Handler
Common
Object Model
Versioned
dependency
interfaces
Versioned
platform
interface
Decoupled
object
models
@adrianco
Interfaces between teams
Service
Code
Client
Code
Minimal
Object Model
Cache
Driver
Service
Driver
Platform Platform
Service
Handler
Full Object
Model
Cache
Code
Platform
Cache
Handler
Common
Object Model
Versioned
dependency
interfaces
Versioned
platform
interface
Decoupled
object
models
Versioned routing
@adrianco
Timeouts and Retries
Connection timeout vs. request timeout confusion
Usually setup incorrectly, global defaults
Systems collapse with “retry storms”
Timeouts too long, too many retries
Services doing work that can never be used
@adrianco
Connections and Requests
TCP makes a connection, HTTP makes a request
HTTP hopefully reuses connections for several requests
Both have different timeout and retry needs!
TCP timeout is purely a property of one network latency hop
HTTP timeout depends on the service and its dependencies
connection path
request path
@adrianco
Timeouts and Retries
Edge
Service
Good
Service
Good
Service
Bad config: Every service defaults to 2 second timeout, two retries
Edge
Service not
responding
Overloaded
service not
responding
Failed
Service
If anything breaks, everything upstream stops responding
Retries add unproductive work
@adrianco
Timeouts and Retries
Edge
Service
Good
Service
Good
Service
Bad config: Every service defaults to 2 second timeout, two retries
Edge
Service not
responding
Overloaded
service not
responding
Failed
Service
If anything breaks, everything upstream stops responding
Retries add unproductive work
@adrianco
Timeouts and Retries
Edge
Service
Good
Service
Good
Service
Bad config: Every service defaults to 2 second timeout, two retries
Edge
Service not
responding
Overloaded
service not
responding
Failed
Service
If anything breaks, everything upstream stops responding
Retries add unproductive work
@adrianco
Timeouts and Retries
Bad config: Every service defaults to 2 second timeout, two retries
Edge
service
responds
slowly
Overloaded
service
Partially
failed
service
@adrianco
Timeouts and Retries
Bad config: Every service defaults to 2 second timeout, two retries
Edge
service
responds
slowly
Overloaded
service
Partially
failed
service
First request from Edge timed out so it ignores the successful
response and keeps retrying. Middle service load increases as
it’s doing work that isn’t being consumed
@adrianco
Timeouts and Retries
Bad config: Every service defaults to 2 second timeout, two retries
Edge
service
responds
slowly
Overloaded
service
Partially
failed
service
First request from Edge timed out so it ignores the successful
response and keeps retrying. Middle service load increases as
it’s doing work that isn’t being consumed
@adrianco
Timeout and Retry Fixes
Cascading timeout budget
Static settings that decrease from the edge
or dynamic budget passed with request
How often do retries actually succeed?
Don’t ask the same instance the same thing
Only retry on a different connection
@adrianco
Timeouts and Retries
Edge
Service
Good
Service
Budgeted timeout, one retry
Failed
Service
@adrianco
Timeouts and Retries
Edge
Service
Good
Service
Budgeted timeout, one retry
Failed
Service
3s
1s
1s
Fast fail
response
after 2s
Upstream timeout must always be longer than
total downstream timeout * retries delay
No unproductive work while fast failing
@adrianco
Timeouts and Retries
Edge
Service
Good
Service
Budgeted timeout, failover retry
Failed
Service
For replicated services with multiple instances
never retry against a failed instance
No extra retries or unproductive work
Good
Service
@adrianco
Timeouts and Retries
Edge
Service
Good
Service
Budgeted timeout, failover retry
Failed
Service
3s 1s
For replicated services with multiple instances
never retry against a failed instance
No extra retries or unproductive work
Good
Service
Successful
response
delayed 1s
@adrianco
Manage Inconsistency
ACM Paper: "The Network is Reliable"
Distributed systems are inconsistent by nature
Clients are inconsistent with servers
Most caches are inconsistent
Versions are inconsistent
Get over it and
Deal with it
See http://queue.acm.org/detail.cfm?id=2655736
@adrianco
Denormalized Data Models
Any non-trivial organization has many databases
Cross references exist, inconsistencies exist
Microservices work best with individual simple stores
Scale, operate, mutate, fail them independently
NoSQL allows flexible schema/object versions
@adrianco
Denormalized Data Models
Build custom cross-datasource check/repair processes
Ensure all cross references are up to date
Read these Pat Helland papers
Immutability Changes Everything
http://highscalability.com/blog/2015/1/26/paper-immutability-changes-everything-by-pat-helland.html
Memories, Guesses and Apologies
https://blogs.msdn.microsoft.com/pathelland/2007/05/15/memories-guesses-and-apologies/
Standing on the Distributed Shoulders of Giants
http://queue.acm.org/detail.cfm?id=2953944
Cloud Native
Monitoring and
Microservices
Low Latency SaaS Based Monitors
https://www.datadoghq.com/ http://www.instana.com/ www.bigpanda.io www.vividcortex.com signalfx.com wavefront.com sysdig.com
See www.battery.com for a list of portfolio investments
A Tragic Quadrant
Ability to scale
Ability to
handle
rapidly
changing
microservices
In-house tools
at web scale
companies
Most current
monitoring & APM
tools
Next generation
APM
Next generation
Monitoring
Datacenter
Cloud
Containers
100s 1,000s 10,000s 100,000s
Lambda
A Tragic Quadrant
Ability to scale
Ability to
handle
rapidly
changing
microservices
In-house tools
at web scale
companies
Most current
monitoring & APM
tools
Next generation
APM
Next generation
Monitoring
Datacenter
Cloud
Containers
100s 1,000s 10,000s 100,000s
Lambda
YMMV: Opinionated approximate positioning only
Interesting
architectures have a
lot of microservices!
Flow visualization is
a big challenge.
See http://www.slideshare.net/LappleApple/gilt-from-monolith-ruby-app-to-micro-service-scala-service-architecture
Simulated Microservices
Model and visualize microservices
Simulate interesting architectures
Generate large scale configurations
Eventually stress test real tools
Code: github.com/adrianco/spigo
Simulate Protocol Interactions in Go
Visualize with D3
See for yourself: http://simianviz.surge.sh
Follow @simianviz for updates
ELB Load Balancer
Zuul
API Proxy
Karyon
Business Logic
Staash
Data Access Layer
Priam
Cassandra Datastore
Three
Availability
Zones
Denominator
DNS Endpoint
@adrianco
Simplicity through symmetry
Symmetry
Invariants
Stable assertions
No special cases
Single purpose components
Serverless
Serverless Architectures
AWS Lambda getting some early wins
Google Cloud Functions, Azure Functions alpha launched
IBM OpenWhisk - open sourced
Startup activity: iron.io , serverless.com, apex.run toolkit
@adrianco
Serverless Architecture
API Gateway
Kinesis S3DynamoDB
@adrianco
Serverless Architecture
API Gateway
Kinesis S3DynamoDB
@adrianco
Serverless Architecture
API Gateway
Kinesis S3DynamoDB
AWS Lambda Reference Archhttp://www.allthingsdistributed.com/2016/05/aws-lambda-serverless-reference-architectures.html
Serverless Programming Model
Event driven functions
Role based permissions
Whitelisted API based security
Good for simple single threaded code
Serverless Cost Efficiencies
100% useful work, no agents, overheads
100% utilization, no charge between requests
No need to size capacity for peak traffic
Anecdotal costs ~1% of conventional system
Ideal for low traffic, Corp IT, spiky workloads
Serverless Work in Progress
Tooling for ease of use
Multi-region HA/DR patterns
Debugging and testing frameworks
Monitoring, end to end tracing
DIY Serverless Operating Challenges
Startup latency
Execution overhead
Charging model
Capacity planning
Learn More…
@adrianco
“We see the world as increasingly more complex and chaotic
because we use inadequate concepts to explain it. When we
understand something, we no longer see it as chaotic or complex.”
Jamshid Gharajedaghi - 2011
Systems Thinking: Managing Chaos and Complexity: A Platform for Designing Business Architecture
Q&A
Adrian Cockcroft @adrianco
http://slideshare.com/adriancockcroft
Technology Fellow - Battery Ventures
See www.battery.com for a list of portfolio investments
Security
Visit http://www.battery.com/our-companies/ for a full list of all portfolio companies in which all Battery Funds have invested.
Palo Alto Networks
Enterprise IT
Operations &
Management
Big DataCompute
Networking
Storage
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
microservices-review

Microservices: State of the Union

  • 1.
    Microservices: State ofthe Union Adrian Cockcroft @adrianco Technology Fellow - Battery Ventures June 2016
  • 2.
    InfoQ.com: News &Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ microservices-review
  • 3.
    Presented at QConNew York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4.
    What does @adriancodo? @adrianco Technology Due Diligence on Deals Presentations at Conferences Presentations at Companies Technical Advice for Portfolio Companies Program Committee for Conferences Networking with Interesting PeopleTinkering with Technologies Maintain Relationship with Cloud Vendors Previously: Netflix, eBay, Sun Microsystems, CCL, TCU London BSc Applied Physics
  • 5.
  • 6.
    What Happened? Rate ofchange increased Cost and size and risk of change reduced
  • 7.
  • 8.
  • 9.
    A Microservice Definition Looselycoupled service oriented architecture with bounded contexts
  • 10.
    A Microservice Definition Looselycoupled service oriented architecture with bounded contexts If every service has to be updated at the same time it’s not loosely coupled
  • 11.
    A Microservice Definition Looselycoupled service oriented architecture with bounded contexts If every service has to be updated at the same time it’s not loosely coupled If you have to know too much about surrounding services you don’t have a bounded context. See the Domain Driven Design book by Eric Evans.
  • 12.
    Speeding Up ThePlatform Datacenter Snowflakes • Deploy in months • Live for years
  • 13.
    Speeding Up ThePlatform Datacenter Snowflakes • Deploy in months • Live for years Virtualized and Cloud • Deploy in minutes • Live for weeks
  • 14.
    Speeding Up ThePlatform Datacenter Snowflakes • Deploy in months • Live for years Virtualized and Cloud • Deploy in minutes • Live for weeks Container Deployments • Deploy in seconds • Live for minutes/hours
  • 15.
    Speeding Up ThePlatform Datacenter Snowflakes • Deploy in months • Live for years Virtualized and Cloud • Deploy in minutes • Live for weeks Container Deployments • Deploy in seconds • Live for minutes/hours Lambda Deployments • Deploy in milliseconds • Live for seconds
  • 16.
    Speeding Up ThePlatform AWS Lambda is leading exploration of serverless architectures in 2016 Datacenter Snowflakes • Deploy in months • Live for years Virtualized and Cloud • Deploy in minutes • Live for weeks Container Deployments • Deploy in seconds • Live for minutes/hours Lambda Deployments • Deploy in milliseconds • Live for seconds
  • 17.
    http://www.infoq.com/presentations/Twitter-Timeline-Scalability http://www.infoq.com/presentations/twitter-soa http://www.infoq.com/presentations/Zipkin http://www.infoq.com/presentations/scale-gilt Go-Kit https://www.youtube.com/watch?v=aL6sd4d4hxk http://www.infoq.com/presentations/circuit-breaking-distributed-systems https://speakerdeck.com/mattheath/scaling-micro-services-in-go-highload-plus-plus-2014 State ofthe Art in Web Scale Microservice Architectures AWS Re:Invent : Asgard to Zuul https://www.youtube.com/watch?v=p7ysHhs5hl0 Resiliency at Massive Scale https://www.youtube.com/watch?v=ZfYJHtVL1_w Microservice Architecture https://www.youtube.com/watch?v=CriDUYtfrjs New projects for 2015 and Docker Packaging https://www.youtube.com/watch?v=hi7BDAtjfKY Spinnaker deployment pipeline https://www.youtube.com/watch?v=dwdVwE52KkU http://www.infoq.com/presentations/spring-cloud-2015
  • 18.
    Microservice Architectures ConfigurationTooling DiscoveryRouting Observability Development: Languages and Container Operational: Orchestration and Deployment Infrastructure Datastores Policy: Architectural and Security Compliance
  • 19.
    Next Generation Applications Fillin the gaps, rapidly evolving ecosystem choices Archaius LaunchDarkly Habitat Configuration Lambda Docker Spinnaker Tooling Etcd Eureka Consul Discovery Compose Linkerd Weave Routing Zipkin Prometheus Hystrix Observability Development: components interfaces languages e.g. Docker Hub, Artifactory, Datawire Quark, Go, Rust Operational: Mesos, Kubernetes, Swarm, Nomad for private clouds. ECS, Mesos, GKS for public Datastores: Orchestrated, Distributed Ephemeral e.g. Cassandra, or DBaaS e.g. DynamoDB Policy: Security compliance e.g. Docker Content Trust. Architecture compliance e.g. Cloud Foundry
  • 20.
    @adrianco In Search ofSegmentation Ops Dev Datacenters AD/LDAP Roles VLAN Networks Hypervisor IPtables Docker Links AWS Accounts IAM Roles VPC Security Groups Calico Policy Docker Net/Weave
  • 21.
    @adrianco Hierarchical Segmentation B CA BC E FD E F Homepage Team Security Group Reports Team Security Group VPC Z - Manage a small number of network spaces D An AWS oriented example… AWS Account - Manage across multiple accounts containers and links
  • 22.
    @adrianco What’s Often Missing? Failureinjection testing Versioning, routing Binary protocols and interfaces Timeouts and retries Denormalized data models Monitoring, tracing Simplicity through symmetry
  • 23.
    @adrianco Failure Injection Testing NetflixChaos Monkey, Simian Army, FIT and Gremlin http://techblog.netflix.com/2011/07/netflix-simian-army.html http://techblog.netflix.com/2014/10/fit-failure-injection-testing.html http://techblog.netflix.com/2016/01/automated-failure-testing.html https://www.infoq.com/presentations/failure-test-research-netflix
  • 24.
    ! Chaos Monkey- enforcing stateless business logic ! Chaos Gorilla - enforcing zone isolation/replication ! Chaos Kong - enforcing region isolation/replication ! Security Monkey - watching for insecure configuration settings ! FIT & Gremlin - inject errors to enforce robust dependencies ! See over 100 NetflixOSS projects at netflix.github.com ! Get “Technical Indigestion” reading techblog.netflix.com Trust with Verification
  • 25.
    @adrianco Benefits of versionaware routing Immediately and safely introduce a new version Canary test in production Use DIY feature flags or . Route clients to a version so they can’t get disrupted Change client or dependencies but not both at once Eventually remove old versions Incremental or infrequent “break the build” garbage collection
  • 26.
    @adrianco Versioning, Routing Version numbering:Interface.Feature.Bugfix V1.2.3 to V1.2.4 - Canary test then remove old version V1.2.x to V1.3.x - Canary test then remove or keep both Route V1.3.x clients to new version to get new feature Remove V1.2.x only after V1.3.x is found to work for V1.2.x clients V1.x.x to V2.x.x - Route clients to specific versions Remove old server version when all old clients are gone
  • 27.
    @adrianco Protocols Measure serialization, transmission,deserialization costs Sending a megabyte of XML between microservices will make you sad, but not as sad as 10yrs ago with SOAP Use Thrift, Protobuf/gRPC, Avro, SBE internally Use JSON for external/public interfaces https://github.com/real-logic/simple-binary-encoding
  • 28.
    @adrianco Interfaces When you builda service, build a “driver” client for it Reference implementation error handling and serialization Release automation stress test using client Validate that service interface is usable! Minimize additional dependencies Swagger - OpenAPI Specification Datawire Quark adds behaviors to API spec
  • 29.
    @adrianco Interface Version Pinning Changeone thing at a time! Pin the version of everything else Incremental build/test/deploy pipeline Deploy existing app code with new platform Deploy existing app code with new dependencies Deploy new app code with pinned platform/dependencies
  • 30.
  • 31.
  • 32.
  • 33.
    @adrianco Interfaces between teams Service Code Client Code Minimal ObjectModel Full Object Model Cache Code Common Object Model Decoupled object models
  • 34.
    @adrianco Interfaces between teams Service Code Client Code Minimal ObjectModel Service Driver Service Handler Full Object Model Cache Code Common Object Model Decoupled object models
  • 35.
    @adrianco Interfaces between teams Service Code Client Code Minimal ObjectModel Cache Driver Service Driver Service Handler Full Object Model Cache Code Cache Handler Common Object Model Decoupled object models
  • 36.
    @adrianco Interfaces between teams Service Code Client Code Minimal ObjectModel Cache Driver Service Driver Platform Platform Service Handler Full Object Model Cache Code Platform Cache Handler Common Object Model Decoupled object models
  • 37.
    @adrianco Interfaces between teams Service Code Client Code Minimal ObjectModel Cache Driver Service Driver Platform Platform Service Handler Full Object Model Cache Code Platform Cache Handler Common Object Model Versioned dependency interfacesDecoupled object models
  • 38.
    @adrianco Interfaces between teams Service Code Client Code Minimal ObjectModel Cache Driver Service Driver Platform Platform Service Handler Full Object Model Cache Code Platform Cache Handler Common Object Model Versioned dependency interfaces Versioned platform interface Decoupled object models
  • 39.
    @adrianco Interfaces between teams Service Code Client Code Minimal ObjectModel Cache Driver Service Driver Platform Platform Service Handler Full Object Model Cache Code Platform Cache Handler Common Object Model Versioned dependency interfaces Versioned platform interface Decoupled object models Versioned routing
  • 40.
    @adrianco Timeouts and Retries Connectiontimeout vs. request timeout confusion Usually setup incorrectly, global defaults Systems collapse with “retry storms” Timeouts too long, too many retries Services doing work that can never be used
  • 41.
    @adrianco Connections and Requests TCPmakes a connection, HTTP makes a request HTTP hopefully reuses connections for several requests Both have different timeout and retry needs! TCP timeout is purely a property of one network latency hop HTTP timeout depends on the service and its dependencies connection path request path
  • 42.
    @adrianco Timeouts and Retries Edge Service Good Service Good Service Badconfig: Every service defaults to 2 second timeout, two retries Edge Service not responding Overloaded service not responding Failed Service If anything breaks, everything upstream stops responding Retries add unproductive work
  • 43.
    @adrianco Timeouts and Retries Edge Service Good Service Good Service Badconfig: Every service defaults to 2 second timeout, two retries Edge Service not responding Overloaded service not responding Failed Service If anything breaks, everything upstream stops responding Retries add unproductive work
  • 44.
    @adrianco Timeouts and Retries Edge Service Good Service Good Service Badconfig: Every service defaults to 2 second timeout, two retries Edge Service not responding Overloaded service not responding Failed Service If anything breaks, everything upstream stops responding Retries add unproductive work
  • 45.
    @adrianco Timeouts and Retries Badconfig: Every service defaults to 2 second timeout, two retries Edge service responds slowly Overloaded service Partially failed service
  • 46.
    @adrianco Timeouts and Retries Badconfig: Every service defaults to 2 second timeout, two retries Edge service responds slowly Overloaded service Partially failed service First request from Edge timed out so it ignores the successful response and keeps retrying. Middle service load increases as it’s doing work that isn’t being consumed
  • 47.
    @adrianco Timeouts and Retries Badconfig: Every service defaults to 2 second timeout, two retries Edge service responds slowly Overloaded service Partially failed service First request from Edge timed out so it ignores the successful response and keeps retrying. Middle service load increases as it’s doing work that isn’t being consumed
  • 48.
    @adrianco Timeout and RetryFixes Cascading timeout budget Static settings that decrease from the edge or dynamic budget passed with request How often do retries actually succeed? Don’t ask the same instance the same thing Only retry on a different connection
  • 49.
  • 50.
    @adrianco Timeouts and Retries Edge Service Good Service Budgetedtimeout, one retry Failed Service 3s 1s 1s Fast fail response after 2s Upstream timeout must always be longer than total downstream timeout * retries delay No unproductive work while fast failing
  • 51.
    @adrianco Timeouts and Retries Edge Service Good Service Budgetedtimeout, failover retry Failed Service For replicated services with multiple instances never retry against a failed instance No extra retries or unproductive work Good Service
  • 52.
    @adrianco Timeouts and Retries Edge Service Good Service Budgetedtimeout, failover retry Failed Service 3s 1s For replicated services with multiple instances never retry against a failed instance No extra retries or unproductive work Good Service Successful response delayed 1s
  • 53.
    @adrianco Manage Inconsistency ACM Paper:"The Network is Reliable" Distributed systems are inconsistent by nature Clients are inconsistent with servers Most caches are inconsistent Versions are inconsistent Get over it and Deal with it See http://queue.acm.org/detail.cfm?id=2655736
  • 54.
    @adrianco Denormalized Data Models Anynon-trivial organization has many databases Cross references exist, inconsistencies exist Microservices work best with individual simple stores Scale, operate, mutate, fail them independently NoSQL allows flexible schema/object versions
  • 55.
    @adrianco Denormalized Data Models Buildcustom cross-datasource check/repair processes Ensure all cross references are up to date Read these Pat Helland papers Immutability Changes Everything http://highscalability.com/blog/2015/1/26/paper-immutability-changes-everything-by-pat-helland.html Memories, Guesses and Apologies https://blogs.msdn.microsoft.com/pathelland/2007/05/15/memories-guesses-and-apologies/ Standing on the Distributed Shoulders of Giants http://queue.acm.org/detail.cfm?id=2953944
  • 56.
  • 57.
    Low Latency SaaSBased Monitors https://www.datadoghq.com/ http://www.instana.com/ www.bigpanda.io www.vividcortex.com signalfx.com wavefront.com sysdig.com See www.battery.com for a list of portfolio investments
  • 58.
    A Tragic Quadrant Abilityto scale Ability to handle rapidly changing microservices In-house tools at web scale companies Most current monitoring & APM tools Next generation APM Next generation Monitoring Datacenter Cloud Containers 100s 1,000s 10,000s 100,000s Lambda
  • 59.
    A Tragic Quadrant Abilityto scale Ability to handle rapidly changing microservices In-house tools at web scale companies Most current monitoring & APM tools Next generation APM Next generation Monitoring Datacenter Cloud Containers 100s 1,000s 10,000s 100,000s Lambda YMMV: Opinionated approximate positioning only
  • 60.
    Interesting architectures have a lotof microservices! Flow visualization is a big challenge. See http://www.slideshare.net/LappleApple/gilt-from-monolith-ruby-app-to-micro-service-scala-service-architecture
  • 61.
    Simulated Microservices Model andvisualize microservices Simulate interesting architectures Generate large scale configurations Eventually stress test real tools Code: github.com/adrianco/spigo Simulate Protocol Interactions in Go Visualize with D3 See for yourself: http://simianviz.surge.sh Follow @simianviz for updates ELB Load Balancer Zuul API Proxy Karyon Business Logic Staash Data Access Layer Priam Cassandra Datastore Three Availability Zones Denominator DNS Endpoint
  • 62.
    @adrianco Simplicity through symmetry Symmetry Invariants Stableassertions No special cases Single purpose components
  • 63.
  • 64.
    Serverless Architectures AWS Lambdagetting some early wins Google Cloud Functions, Azure Functions alpha launched IBM OpenWhisk - open sourced Startup activity: iron.io , serverless.com, apex.run toolkit
  • 65.
  • 66.
  • 67.
  • 68.
    AWS Lambda ReferenceArchhttp://www.allthingsdistributed.com/2016/05/aws-lambda-serverless-reference-architectures.html
  • 69.
    Serverless Programming Model Eventdriven functions Role based permissions Whitelisted API based security Good for simple single threaded code
  • 70.
    Serverless Cost Efficiencies 100%useful work, no agents, overheads 100% utilization, no charge between requests No need to size capacity for peak traffic Anecdotal costs ~1% of conventional system Ideal for low traffic, Corp IT, spiky workloads
  • 71.
    Serverless Work inProgress Tooling for ease of use Multi-region HA/DR patterns Debugging and testing frameworks Monitoring, end to end tracing
  • 72.
    DIY Serverless OperatingChallenges Startup latency Execution overhead Charging model Capacity planning
  • 73.
  • 74.
    @adrianco “We see theworld as increasingly more complex and chaotic because we use inadequate concepts to explain it. When we understand something, we no longer see it as chaotic or complex.” Jamshid Gharajedaghi - 2011 Systems Thinking: Managing Chaos and Complexity: A Platform for Designing Business Architecture
  • 75.
    Q&A Adrian Cockcroft @adrianco http://slideshare.com/adriancockcroft TechnologyFellow - Battery Ventures See www.battery.com for a list of portfolio investments
  • 76.
    Security Visit http://www.battery.com/our-companies/ fora full list of all portfolio companies in which all Battery Funds have invested. Palo Alto Networks Enterprise IT Operations & Management Big DataCompute Networking Storage
  • 77.
    Watch the videowith slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ microservices-review