Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS Loft Talk: Behind the Scenes with SignalFx


Published on

Slides from SignalFx CTO Phillip Liu's presentation at the AWS Loft in SF after DockerCon: Behind the Scenes with SignalFx.

Phil discussed how SignalFx deploys, runs, and operates a completely Dockerized microservices architecture for a production SaaS application dealing with large volumes of high resolution customer data.

Published in: Technology
  • Be the first to comment

AWS Loft Talk: Behind the Scenes with SignalFx

  1. 1. SignalFx
  2. 2. SignalFx Behind the Scenes with SignalFx Phillip Liu @SignalFx -
  3. 3. Agenda • Background • Overview of Key SignalFx Services • SignalFx infrastructure and operations • Analytics approach to monitoring • Code push side effects, an example • Summary
  4. 4. SignalFx Background
  5. 5. About Me [2013 - ] SignalFx - Founder, CTO, Software Engineer Microservices; Monitoring using Analytics [2008 - 2012] Facebook - Software Engineer, Software Architect Hyperscale SOA; Monitoring using Nagios, Ganglia, and in-house Analytics [2004 - 2008] Opsware - Chief Architect, Software Engineer Monolithic Architecture; Monitoring using Ganglia, Nagios, Splunk [2000 - 2004] Loudcloud - Software Engineer LAMP, Application Server; Monitoring using SNMP, Ganglia, NetCool [1998 - 2000] Marimba - Software Engineer Client / Server; Monitoring using SNMP, FreshWater Software [ … ]
  6. 6. About SignalFx
  7. 7. SignalFx Overview of SignalFx Services
  8. 8. A Microservices Definition Loosely coupled service oriented architecture with bounded context. Adrian Cockcroft
  9. 9. Overview of Key SignalFx Services
  10. 10. Microservice Complexity More than 15 internal services. Services span hundreds of instances across multiple AZs. Have dependencies on tens of external services.
  11. 11. SignalFx SignalFx Infrastructure
  12. 12. Amazon EC2
  13. 13. SignalFx Operations at SignalFx
  14. 14. Shared Responsibility • Engineering is organized around services they provide • No dedicated operations team • Each service team is responsible for building and operating their services • Infrastructure team provides IaaS - DNS, LB, Mail, Server, and Network configuration and provisioning • Ingest team provides Ingest API, Quantization, and TSDB services
  15. 15. Continuous Build and Deployment • Services are built and tested on each commit • Each service deploy at their cadence • Nearly all deployments are non-disruptive • Push to lab, test; push product canary, test; rest of prod • Service engineered to be resilient to partial cluster availability • Each service is engineered to support +1/-1 upgrades
  16. 16. On-call Rotation • All dev on weekly on-call rotation (couple of times a year) • On-call works on operational tools • On-call rotates from lab -> production • On-call is the incident manager • Owns driving both black out and brown out incidents to resolution
  17. 17. Operations Tools sfhost - CLI for VM configuration and provisioning sfc - console to access management data for all services signalscope - deep transactions tracing maestro - Docker orchestrator jenkins - continuous build and deployment
  18. 18. Monitoring • We use SignalFx to monitor SignalFx • Engineers instrument their code as part of dev process • Each service provides at least one dashboard • CollectD for OS and Docker metrics on all VMs • Yammer metrics for all Java app servers • Custom logger to count exception types
  19. 19. Monitoring - API Service Dashboard
  20. 20. SignalFx Analytics Approach to Monitoring
  21. 21. Monitoring Challenges • High iteration rate leads to shortened test cycles • Integration test combinations are intractable • Catch problems during rolling deployments • Identify upstream/downstream side effects • e.g. backpressure • Identify brownouts before the customer • etc.
  22. 22. Analytics Approach to Monitoring Measure
  23. 23. Analytics Approach to Monitoring Analyze
  24. 24. Analytics Approach to Monitoring Detect
  25. 25. SignalFx Examples
  26. 26. Code Push Side Effects - Time Series Router
  27. 27. Code Push Side Effects Push canary instance and Metadata API dashboard shows healthy tier.
  28. 28. Code Push Side Effects However, upstream UI dashboard showed unusual # of timeouts.
  29. 29. Code Push Side Effects In search of root cause. Always safe to start by looking at exception counts. Can’t derive much from all the noise.
  30. 30. Code Push Side Effects Sum the # of exceptions to create a single signal.
  31. 31. Code Push Side Effects Compare sum with time-shifted sum from a day ago.
  32. 32. Code Push Side Effects Look at an outlier host - an Analytics service host.
  33. 33. Code Push Side Effects enum constant MURMUR128_MITZ_64 does not exist in class at ~[na: 1.7.0_79] at ~[na:1.7.0_79] at 1990) ~[na:1.7.0_79] … Looking at Analytic’s logs revealed source of the problem.
  34. 34. Code Push Side Effects • Analytics across multiple microservices reduced time to identify problem. From push to resolution was ~15min • Service instrumentation helped narrowed down root cause • Discovery allowed us to create a detector using analytics to notify similar problems in the future
  35. 35. Other Examples • A customer started dropping data because they reverted to an unsupported API • Compare TSDB write throughput of two different write strategies • Create per-service capacity reports • Identify memory usage patterns across our Analytics service • Create a detector for every previously uncaught error conditions - postmortem output
  36. 36. SignalFx Summary
  37. 37. Summary • Microservice architecture is inherently complex • Measure all the things • Use data analytics techniques to • Identify problems • Chase down root cause • Use intelligent detectors to catch recurrence
  38. 38. SignalFx Questions
  39. 39. SignalFx Thank You! Phillip Liu WE’RE HIRING