Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

5 must have patterns for your microservice

1,074 views

Published on

"Netflix is actually a log generating application that just happens to stream movies"

Building a service/Microservice is itself easy. Scaling it on the cloud is not that hard either but operating, maintaining and iterating a production large scale service is not just about linearisation. As Cockcroft points out, telemetry and monitoring is the most important aspect of building Microservices
We discuss 5 patterns that any serious Microservice should have:
- Canary (an endpoint reporting health of underlying dependencies)
- IO monitor (measuring all calls from Microservice to external dependencies)
- A circuit breaker
- An ActivityId-Propagator
- An exception and short timeout retry policy

Published in: Software

5 must have patterns for your microservice

  1. 1. >>> 5 must-have Patterns for your web-scale Microservices @aliostad Ali Kheyrollahi, ASOS
  2. 2. @aliostad > stackoverflow > £1.5 bln global fashion destination > 35% every year
  3. 3. @aliostad /// ASOS in numbers 2 0 1 6 T u r n O v e r → £15 bln A c t i v e C u s t o m e r s → 12 M N e w P r o d u c t s / w k → 4 k U n i q u e V i s i t s / m o → 123 M P a g e V i e w s / d a y → 95 M P l a t f o r m T e a m s → 40
  4. 4. @aliostad /// Microservices Architecture
  5. 5. @aliostad /// why microservices > Scaling people not the solution > Decentralising decision centres => Agility > Frequent deployment => Agility > Reduced complexity of each ms (Divide/Conquere) => Agility > Overall solution complex but ...
  6. 6. @aliostad /// anecdote Often you can measure your success in implementing Microservice Architecture not be the number of services you build, but by the number you decommission.
  7. 7. @aliostad /// microservices vs soa SOA Microservices Main Goal Architectual Decoupling Agility Audience Mainly Architecture Business (Everyone) Set out to solve Architectural Coupling Scaling People, Frequent Deployment Organisational Structure Impact Minimal Huge Service Cardinality Usually up to a dozen >40 (Commonly >100) When to do Always teams > ~5** ** Debateable. There are articles and discussions on this very topic
  8. 8. @aliostad /// microservice challenges > Very difficult to build a complete mental picture of solution > When things go wrong, need to know where before why > Potentially increased latency > Performance outliers intractable to solve > A complete mind-shift requiring a new operating model
  9. 9. @aliostad /// probability distribution Response Time Probabilty
  10. 10. @aliostad /// performance outliers Microservice A Microservie B 99th Percentile = 500ms 99th Percentile = 500ms A B Total <1s 99% 99% 98.01% >500m 1% 99% 0.99% >500m 99% 1% 0.99% >1s 1% 1% 0.01%
  11. 11. @aliostad /// ActivityId Propagator
  12. 12. @aliostad /// ActivityId > Every customer request matters > Every request is unique > Every request creates a chain (or tree) of calls/events > Activities are correlated > You need an ActivityId (or CorrelationId) to link calls/events
  13. 13. @aliostad /// ActivityId Microservice Id IdId Thread Local Storage Id To Other APIs Id Event
  14. 14. @aliostad /// ActivityId - HTTP Request GET /api/v2/foo HTTP/1.1 host: foo.com activity-id: 96c5a1f106ce468ebcca8303ed7464bd Response 200 OK activity-id: 96c5a1f106ce468ebcca8303ed7464bd
  15. 15. @aliostad /// Retry and Timeout Policy
  16. 16. @aliostad /// Failure Microservice A 1% chance of failure X Wait (back-off) X Wait (back-off longer) Microservice B 1% chance of failure
  17. 17. @aliostad /// Preemptive Timeout Microservice A X retry X retry Short timeout Short timeout Microservice B
  18. 18. @aliostad /// Timeout C B A A > B > C A > B + C
  19. 19. @aliostad /// Choosing a timeout? Static => Based on Server SLO Dynamic => 95th percentile
  20. 20. @aliostad /// IO Monitor
  21. 21. @aliostad /// Blame Game “If there is a single place where you can play blame game, instead of collective responsibility, it is in Microservices troubleshooting”
  22. 22. @aliostad /// Did you say IO?? Microservice DB API Cache Measure... every time your code goes out of your process
  23. 23. @aliostad /// Recording Methods > Explicitly by calling record() > Asking the library to record a closure > Aspect-oriented Java (spf4j) private static final MeasurementRecorder recorder = RecorderFactory.createScalableCountingRecorder(forWhat, unitOfMeasurement, sampleTimeMillis); … recorder.record(measurement); .NET (PerfIt) var ins = new SimpleInstrumentor(new InstrumentationInfo() { Counters = CounterTypes.StandardCounters, Description = "test", InstanceName = "Test instance", CategoryName = TestCategory }); ins.Instrument(() => Thread.Sleep(100), "test..."); Java and .NET @PerformanceMonitor(warnThresholdMillis=1, errorThresholdMillis=100, recorderSource = RecorderSourceInstance.Rs5m.class) [PerfItFilter(“PerfItTests", InstanceName = "Test")] public string Get() { return Guid.NewGuid().ToString(); }
  24. 24. @aliostad /// Publishing Methods > Local file (various to logstash) > TCP and HTTP (many, to zipkin, influxdb) > UDP (statsd, collectd to graphite, logstash) > Raising Kernel-level event (Windows ETW) > Local communication (statsd)
  25. 25. @aliostad /// Circuit- Breaker
  26. 26. @aliostad /// tri-state > Closed traffic can flow normally > Open traffic does not flow > Half-open circuit breaker tests the waters again Closed Open Half-open Test Failure Wait timeout
  27. 27. @aliostad /// Netflix Hysterix RequestVolumeThreshold ErrorThresholdPercentage SleepWindowInMilliseconds TimeInMilliseconds NumBuckets
  28. 28. @aliostad /// Fallback > Custom: e.g. serve content from a local cache (status 206) > Silent: return null/no-data/empty (status 200/204) > Fail-fast: Customer experience is important (status 5xx)
  29. 29. @aliostad /// Canary and Health Endpoint
  30. 30. @aliostad /// Health Endpoints Ping returns a success code when invoked Canary returns a connectivity status and latency on the service and dependencies “… none of them invoke any application code”
  31. 31. @aliostad /// Ping Request GET /api/health HTTP/1.1 host: foo.com Response 200 OK Response 500 Server Error
  32. 32. @aliostad /// Canary Request GET /api/canary HTTP/1.1 host: foo.com Response 200 OK { [Nested Structure] }
  33. 33. @aliostad /// ChirpResult { "serviceName": "foo", "latency": "00:00:00.0542172", "statusCode": 200, "isCritical": true }
  34. 34. @aliostad /// ChirpResult
  35. 35. @aliostad /// ChirpResult - critical failure API NC NC C 200 200 500 500
  36. 36. @aliostad /// ChirpResult - non-critical failure API NC NC C 500 200 200 200
  37. 37. @aliostad /// AOP / Declarative (c#) [AzureStorageCanary("Foo-AzureStorage-BarDatabaseServer", “config-key-for-cn“)] [SqlCanary("SQL-BazActiveDatabase", null, typeof(SqlConnectionFactory))] [CanaryEndpointCanary("Dependency-Api", “config-key-for-endpoint“)] public class CanaryController : CanaryBaseController { … // some boilerplate code }
  38. 38. @aliostad /// Deep vs Shallow API API “Deep”“Shallow” /api/canary?deep=false
  39. 39. @aliostad /// Wrap-up > If you have more than ~5 teams, consider Microservices > Logging/Monitoring/Alerting: single most important asset > Use ActivityId Propagator to correlate (consider zipkin) > Cloud is a jungleTM . Without retry/timeout you won’t survive > Monitor and measure all calls to external services (blame game) > Protect your systems with circuit-breakers (and isolation) > Canary helps you detect connectivity from customer view
  40. 40. @aliostad Thomas Wood: Daisy Picture Thomas Au: Thermometer Picture Torbakhopper: Cables Picture Dam Picture - Japan Hsiung: Lights Picture Health Endpoint in API Design

×