Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Microservices Application Tracing Standards and Simulators - Adrians at OSCON

4,158 views

Published on

Joint presentation with Adrian Cole - OSCON - Austin May 18th 2016

Published in: Software

Microservices Application Tracing Standards and Simulators - Adrians at OSCON

  1. 1. Microservices Application Tracing Standards and Simulators From Zipkin to Greater Tracing: Involving a wider group of people in distributed tracing @adrianfcole @adrianco #oscon
  2. 2. Introduction introduction opening zipkin beyond zipkin simulation @adrianco @adrianfcole
  3. 3. @adrianfcole • spring cloud at pivotal • focused on distributed tracing • helped open zipkin
  4. 4. Opening Zipkin introduction opening zipkin beyond zipkin simulation @adrianfcole
  5. 5. Distributed Tracing commoditizes knowledge Distributed tracing systems collect end-to-end latency graphs (traces) in near real-time. You can compare traces to understand why certain requests take longer than others.
  6. 6. Zipkin is like Chrome DevTool’s network panel! http://zipkin.io/ • .. except you see your whole architecture
  7. 7. It started with community focus
  8. 8. commit 92c941890c2009a401b777093342dc4f28955640 Author: Johan Oskarsson <johan@oskarsson.nu> Date: Tue Nov 15 10:09:47 2011 -0800 [split] Enable B3 tracing for TFE. Filter out finagle-http headers from incoming requests BigBrotherBird is silently born
  9. 9. Zipkin is less silently born commit 2b7acead044e71c744f39804abe564383eb5f846 Author: Johan Oskarsson <johan@oskarsson.nu> Date: Wed Jun 6 11:28:34 2012 -0700 Initial commit
  10. 10. zipkin says “we are a community”
  11. 11. (open)zipkin left the nest
  12. 12. So what happened? Zipkin development at Twitter was in short bursts, centered on other work Many experienced Zipkin engineers don’t work at Twitter (or in the Bay Area) Platform diversity is a reality for many Having the same goals was our opportunity
  13. 13. How’s OpenZipkin doing now? Zipkin's now releasable (maybe too releasable) We’re working on understandability on usability We’re making the community easier to find We hit bumps, and sometimes reverse change
  14. 14. Beyond Zipkin introduction opening zipkin beyond zipkin simulation @adrianfcole
  15. 15. The “greater” tracing Many groups are solving similar problems Some focus on stacks, others on instrumentation By collaborating more, we can make tracing greater
  16. 16. Instrumentation portability Interop through shared trace pipelines. Practical matters, like categorization and tactical designs Moving R&D to implementation Simulation and system testing distributed-tracing google group Distributed Tracing Workgroup
  17. 17. OpenTracing is an effort to clean-up and de-risk distributed tracing instrumentation OpenTracing Interfaces decouple instrumentation from vendor-specific dependencies and terminology. This allows applications to switch products with less effort. http://opentracing.io/ OpenTracing: Go, Python, Java, JavaScript..
  18. 18. A single configuration change to bind a Tracer implementation in main() or similar import opentracing "github.com/opentracing/opentracing-go" import "github.com/tracer_x/tracerimpl" func main() { // Bind tracerimpl to the opentracing system opentracing.InitGlobalTracer( tracerimpl.New(kTracerImplAccessToken)) ... normal main() stuff ... } How does it work? Clean, vendor-neutral instrumentation code that naturally tells the story of a distributed operation import opentracing "github.com/opentracing/opentracing-go" func AddContact(c *Contact) { sp := opentracing.StartSpan("AddContact") defer sp.Finish() sp.LogEventWithPayload("Added contact: ", *c) subRoutine(sp, ...) ... } func subRoutine(parentSpan opentracing.Span, ...) { ... sp := opentracing.StartChildSpan(parentSpan, "subRoutine") defer sp.Finish() sp.Info("deferred work to subroutine") ... } Thanks, @el_bhs for the slide!
  19. 19. Pivot Tracing is applied research from Brown University (the one that brought us X-Trace). Pivot tracing allows you to dynamically query systems at runtime, grouping on “Baggage” which propagates across service boundaries. pivottracing.io Pivot Tracing
  20. 20. Start writing queries including the fancy happened-before join From incr In DataNodeMetrics.incrBytesRead Join cl In First(ClientProtocols) On cl -> incr GroupBy cl.procName Select cl.procName, SUM(incr.delta) How does it work? Services need to be in Java and be able to talk to a provided PubSub broker. // Add a library <dependency> <groupId>edu.brown.cs.systems</groupId> <artifactId>pivottracing-agent</artifactId> <version>4.0</version> </dependency> // Initialize it on bootstrap PivotTracing.initialize(); @brownsys_jmace made this!
  21. 21. Simulation introduction opening zipkin beyond zipkin simulation @adrianfcole
  22. 22. What does @adrianco do? @adrianco Technology Due Diligence on Deals Presentations at Conferences Presentations at Companies Technical Advice for Portfolio Companies Program Committee for Conferences Networking with Interesting PeopleTinkering with Technologies Maintain Relationship with Cloud Vendors Previously: Netflix, eBay, Sun Microsystems, CCL, TCU London BSc Applied Physics
  23. 23. @adrianco Testing Flow Monitors Monitoring tools often “explode on impact” with real world use cases at scale Interestingly large complex environments are expensive to create or hard to get access to Free, open source tools don’t have a budget…
  24. 24. OSS Microservice Simulator Model and visualize microservices Simulate interesting architectures Generate large scale configurations Stress test real tools like Zipkin Code: github.com/adrianco/spigo Simulate Protocol Interactions in Go Visualize with D3, Neo4j or Guesstimate See for yourself: http://simianviz.surge.sh Follow @simianviz for updates ELB Load Balancer Zuul API Proxy Karyon Business Logic Staash Data Access Layer Priam Cassandra Datastore Three Availability Zones Denominator DNS Endpoint
  25. 25. POST Spigo flows to zipkin # collect flows, duration 2 seconds, architecture lamp $ ./spigo -c —d 2 -a lamp —snip— # clean out zipkin database and post newly created data $ misc/zipkin.sh lamp
  26. 26. Spigo Nanoservice Structure func Start(listener chan gotocol.Message) { ... for { select { case msg := <-listener: flow.Instrument(msg, name, hist) switch msg.Imposition { case gotocol.Hello: // get named by parent ... case gotocol.NameDrop: // someone new to talk to ... case gotocol.Put: // upstream request handler ... outmsg := gotocol.Message{gotocol.Replicate, listener, time.Now(), msg.Ctx.NewParent(), msg.Intention} flow.AnnotateSend(outmsg, name) outmsg.GoSend(replicas) } case <-eurekaTicker.C: // poll the service registry ... } } } Skeleton code for sideways replicating a Put message Instrument incoming requests Instrument outgoing requests update trace context
  27. 27. Flow Trace Records riak2 us-east-1 zoneC riak9 us-west-2 zoneA Put s896 Replicate riak3 us-east-1 zoneA riak8 us-west-2 zoneC riak4 us-east-1 zoneB riak10 us-west-2 zoneB us-east-1.zoneC.riak2 t98p895s896 Put us-east-1.zoneA.riak3 t98p896s908 Replicate us-east-1.zoneB.riak4 t98p896s909 Replicate us-west-2.zoneA.riak9 t98p896s910 Replicate us-west-2.zoneB.riak10 t98p910s912 Replicate us-west-2.zoneC.riak8 t98p910s913 Replicate staash us-east-1 zoneC s910 s908s913 s909s912 Replicate Put context: transaction parent span
  28. 28. Zipkin Trace Dependencies
  29. 29. Zipkin Trace Dependencies
  30. 30. Trace for one Spigo Flow
  31. 31. Definition of an architecture { "arch": "lamp", "description":"Simple LAMP stack", "version": "arch-0.0", "victim": "webserver", "services": [ { "name": "rds-mysql", "package": "store", "count": 2, "regions": 1, "dependencies": [] }, { "name": "memcache", "package": "store", "count": 1, "regions": 1, "dependencies": [] }, { "name": "webserver", "package": "monolith", "count": 18, "regions": 1, "dependencies": ["memcache", "rds-mysql"] }, { "name": "webserver-elb", "package": "elb", "count": 0, "regions": 1, "dependencies": ["webserver"] }, { "name": "www", "package": "denominator", "count": 0, "regions": 0, "dependencies": ["webserver-elb"] } ] } Header includes chaos monkey victim New tier name Tier package 0 = non Regional Node count List of tier dependencies See for yourself: http://simianviz.surge.sh/lamp
  32. 32. Migrating to Microservices See for yourself: http://simianviz.surge.sh/migration Endpoint ELB PHP MySQL MySQL Next step Controls node placement distance Select models
  33. 33. Running Spigo $ ./spigo -a lamp -d 2 -j -c 2016/05/16 18:46:37 Loading architecture from json_arch/lamp_arch.json 2016/05/16 18:46:37 lamp.edda: starting 2016/05/16 18:46:37 HTTP metrics now available at localhost:8123/debug/vars 2016/05/16 18:46:37 Architecture: lamp Simple LAMP stack 2016/05/16 18:46:37 architecture: scaling to 100% 2016/05/16 18:46:37 Starting: {rds-mysql store 1 2 []} 2016/05/16 18:46:37 lamp.us-east-1.zoneB..eureka01...eureka.eureka: starting 2016/05/16 18:46:37 lamp.us-east-1.zoneC..eureka02...eureka.eureka: starting 2016/05/16 18:46:37 lamp.us-east-1.zoneA..eureka00...eureka.eureka: starting 2016/05/16 18:46:37 Starting: {memcache store 1 1 []} 2016/05/16 18:46:37 Starting: {webserver monolith 1 18 [memcache rds-mysql]} 2016/05/16 18:46:37 Starting: {webserver-elb elb 1 0 [webserver]} 2016/05/16 18:46:37 Starting: {www denominator 0 0 [webserver-elb]} 2016/05/16 18:46:37 lamp.*.*..www00...www.denominator activity rate 10ms 2016/05/16 18:46:38 chaosmonkey delete: lamp.us-east-1.zoneA..webserver09...webserver.monolith 2016/05/16 18:46:39 asgard: Shutdown 2016/05/16 18:46:39 Saving 30 histograms for Guesstimate 2016/05/16 18:46:39 lamp.us-east-1.zoneA..eureka00...eureka.eureka: closing 2016/05/16 18:46:39 lamp.us-east-1.zoneC..eureka02...eureka.eureka: closing 2016/05/16 18:46:39 lamp.us-east-1.zoneB..eureka01...eureka.eureka: closing 2016/05/16 18:46:39 spigo: complete 2016/05/16 18:46:39 lamp.edda: closing 2016/05/16 18:46:39 Flushing flows to json_metrics/lamp_flow.json -a architecture lamp -d run for 2 seconds -j graph json/lamp.json -c flows json_metrics/lamp_flow.json
  34. 34. Riak IoT Architecture { "arch": "riak", "description":"Riak IoT ingestion example for the RICON 2015 presentation", "version": "arch-0.0", "victim": "", "services": [ { "name": "riakTS", "package": "riak", "count": 6, "regions": 1, "dependencies": ["riakTS", "eureka"]}, { "name": "ingester", "package": "staash", "count": 6, "regions": 1, "dependencies": ["riakTS"]}, { "name": "ingestMQ", "package": "karyon", "count": 3, "regions": 1, "dependencies": ["ingester"]}, { "name": "riakKV", "package": "riak", "count": 3, "regions": 1, "dependencies": ["riakKV"]}, { "name": "enricher", "package": "staash", "count": 6, "regions": 1, "dependencies": ["riakKV", "ingestMQ"]}, { "name": "enrichMQ", "package": "karyon", "count": 3, "regions": 1, "dependencies": ["enricher"]}, { "name": "analytics", "package": "karyon", "count": 6, "regions": 1, "dependencies": ["ingester"]}, { "name": "analytics-elb", "package": "elb", "count": 0, "regions": 1, "dependencies": ["analytics"]}, { "name": "analytics-api", "package": "denominator", "count": 0, "regions": 0, "dependencies": ["analytics-elb"]}, { "name": "normalization", "package": "karyon", "count": 6, "regions": 1, "dependencies": ["enrichMQ"]}, { "name": "iot-elb", "package": "elb", "count": 0, "regions": 1, "dependencies": ["normalization"]}, { "name": "iot-api", "package": "denominator", "count": 0, "regions": 0, "dependencies": ["iot-elb"]}, { "name": "stream", "package": "karyon", "count": 6, "regions": 1, "dependencies": ["ingestMQ"]}, { "name": "stream-elb", "package": "elb", "count": 0, "regions": 1, "dependencies": ["stream"]}, { "name": "stream-api", "package": "denominator", "count": 0, "regions": 0, "dependencies": ["stream-elb"]} ] } New tier name Tier package Node count List of tier dependencies 0 = non Regional
  35. 35. Single Region Riak IoT See for yourself: http://simianviz.surge.sh/riak
  36. 36. Single Region Riak IoT IoT Ingestion Endpoint Stream Endpoint Analytics Endpoint See for yourself: http://simianviz.surge.sh/riak
  37. 37. Single Region Riak IoT IoT Ingestion Endpoint Stream Endpoint Analytics Endpoint Load Balancer Load Balancer Load Balancer See for yourself: http://simianviz.surge.sh/riak
  38. 38. Single Region Riak IoT IoT Ingestion Endpoint Stream Endpoint Analytics Endpoint Load Balancer Normalization Services Load Balancer Load Balancer Stream Service Analytics Service See for yourself: http://simianviz.surge.sh/riak
  39. 39. Single Region Riak IoT IoT Ingestion Endpoint Stream Endpoint Analytics Endpoint Load Balancer Normalization Services Enrich Message Queue Riak KV Enricher Services Load Balancer Load Balancer Stream Service Analytics Service See for yourself: http://simianviz.surge.sh/riak
  40. 40. Single Region Riak IoT IoT Ingestion Endpoint Stream Endpoint Analytics Endpoint Load Balancer Normalization Services Enrich Message Queue Riak KV Enricher Services Ingest Message Queue Load Balancer Load Balancer Stream Service Analytics Service See for yourself: http://simianviz.surge.sh/riak
  41. 41. Single Region Riak IoT IoT Ingestion Endpoint Stream Endpoint Analytics Endpoint Load Balancer Normalization Services Enrich Message Queue Riak KV Enricher Services Ingest Message Queue Load Balancer Load Balancer Stream Service Riak TS Analytics Service Ingester Service See for yourself: http://simianviz.surge.sh/riak
  42. 42. Two Region Riak IoT IoT Ingestion Endpoint Stream Endpoint Analytics Endpoint East Region Ingestion West Region Ingestion Multi Region TS Analytics See for yourself: http://simianviz.surge.sh/riak
  43. 43. Spigo with Neo4j $ ./spigo -a netflix -d 2 -n -c -kv chat:200ms 2016/05/18 12:07:08 Graph will be written to Neo4j via NEO4JURL=localhost:7474 2016/05/18 12:07:08 Loading architecture from json_arch/netflix_arch.json 2016/05/18 12:07:08 HTTP metrics now available at localhost:8123/debug/vars 2016/05/18 12:07:08 netflix.edda: starting 2016/05/18 12:07:08 Architecture: netflix A simplified Netflix service. See http://netflix.github.io/ to decode the package names 2016/05/18 12:07:08 architecture: scaling to 100% 2016/05/18 12:07:08 Starting: {cassSubscriber priamCassandra 1 6 [cassSubscriber eureka]} 2016/05/18 12:07:08 netflix.us-east-1.zoneA..eureka00...eureka.eureka: starting 2016/05/18 12:07:08 netflix.us-east-1.zoneB..eureka01...eureka.eureka: starting 2016/05/18 12:07:08 netflix.us-east-1.zoneC..eureka02...eureka.eureka: starting 2016/05/18 12:07:08 Starting: {evcacheSubscriber store 1 3 []} 2016/05/18 12:07:08 Starting: {subscriber staash 1 3 [cassSubscriber evcacheSubscriber]} 2016/05/18 12:07:08 Starting: {cassPersonalization priamCassandra 1 6 [cassPersonalization eureka]} 2016/05/18 12:07:08 Starting: {personalizationData staash 1 3 [cassPersonalization]} 2016/05/18 12:07:08 Starting: {cassHistory priamCassandra 1 6 [cassHistory eureka]} 2016/05/18 12:07:08 Starting: {historyData staash 1 3 [cassHistory]} 2016/05/18 12:07:08 Starting: {contentMetadataS3 store 1 1 []} 2016/05/18 12:07:08 Starting: {personalize karyon 1 9 [contentMetadataS3 subscriber historyData personalizationData]} 2016/05/18 12:07:08 Starting: {login karyon 1 6 [subscriber]} 2016/05/18 12:07:08 Starting: {home karyon 1 9 [contentMetadataS3 subscriber personalize]} 2016/05/18 12:07:08 Starting: {play karyon 1 9 [contentMetadataS3 historyData subscriber]} 2016/05/18 12:07:08 Starting: {loginpage karyon 1 6 [login]} 2016/05/18 12:07:08 Starting: {homepage karyon 1 9 [home]} 2016/05/18 12:07:08 Starting: {playpage karyon 1 9 [play]} 2016/05/18 12:07:08 Starting: {wwwproxy zuul 1 3 [loginpage homepage playpage]} 2016/05/18 12:07:08 Starting: {apiproxy zuul 1 3 [login home play]} 2016/05/18 12:07:08 Starting: {www-elb elb 1 0 [wwwproxy]} 2016/05/18 12:07:08 Starting: {api-elb elb 1 0 [apiproxy]} 2016/05/18 12:07:08 Starting: {www denominator 0 0 [www-elb]} 2016/05/18 12:07:08 Starting: {api denominator 0 0 [api-elb]} 2016/05/18 12:07:08 netflix.*.*..api00...api.denominator activity rate 200ms 2016/05/18 12:07:09 chaosmonkey delete: netflix.us-east-1.zoneA..homepage03...homepage.karyon 2016/05/18 12:07:10 asgard: Shutdown 2016/05/18 12:07:10 Saving 108 histograms for Guesstimate 2016/05/18 12:07:10 Saving 108 histograms for Guesstimate 2016/05/18 12:07:10 netflix.us-east-1.zoneC..eureka02...eureka.eureka: closing 2016/05/18 12:07:10 netflix.us-east-1.zoneA..eureka00...eureka.eureka: closing 2016/05/18 12:07:10 netflix.us-east-1.zoneB..eureka01...eureka.eureka: closing 2016/05/18 12:07:10 spigo: complete 2016/05/18 12:07:11 netflix.edda: closing -a architecture netflix -d run for 2 seconds -n graph and flows written to Neo4j -c flows json_metrics/netflix_flow.json -kv chat:200ms start flows at 5/sec
  44. 44. Neo4j Visualization
  45. 45. Neo4j Trace Flow Queries
  46. 46. @adrianco Conclusion Monitoring tools can be stressed at scale by simulating their inputs with metrics, dependency graphs and flows Spigo can be extended to produce any format output at very large scale from a laptop
  47. 47. Ask Adrians @adrianco @adrianfcole distributed-tracing google group opentracing.io zipkin.io github.com/adrianco/spigo @opentracing @simianviz @zipkinproject pivottracing.io

×