Slides from SignalFx CTO Phillip Liu's presentation at the AWS Loft in SF after DockerCon: Behind the Scenes with SignalFx.
Phil discussed how SignalFx deploys, runs, and operates a completely Dockerized microservices architecture for a production SaaS application dealing with large volumes of high resolution customer data.
3. Agenda
• Background
• Overview of Key SignalFx Services
• SignalFx infrastructure and operations
• Analytics approach to monitoring
• Code push side effects, an example
• Summary
10. Microservice Complexity
More than 15 internal
services.
Services span hundreds of
instances across multiple
AZs.
Have dependencies on
tens of external services.
14. Shared Responsibility
• Engineering is organized around services they provide
• No dedicated operations team
• Each service team is responsible for building and operating
their services
• Infrastructure team provides IaaS - DNS, LB, Mail, Server,
and Network configuration and provisioning
• Ingest team provides Ingest API, Quantization, and TSDB
services
15. Continuous Build and Deployment
• Services are built and tested on each commit
• Each service deploy at their cadence
• Nearly all deployments are non-disruptive
• Push to lab, test; push product canary, test; rest of prod
• Service engineered to be resilient to partial cluster
availability
• Each service is engineered to support +1/-1 upgrades
16. On-call Rotation
• All dev on weekly on-call rotation (couple of times a year)
• On-call works on operational tools
• On-call rotates from lab -> production
• On-call is the incident manager
• Owns driving both black out and brown out incidents to
resolution
17. Operations Tools
sfhost - CLI for VM configuration and provisioning
sfc - console to access management data for all services
signalscope - deep transactions tracing
maestro - Docker orchestrator
jenkins - continuous build and deployment
18. Monitoring
• We use SignalFx to monitor SignalFx
• Engineers instrument their code as part of dev process
• Each service provides at least one dashboard
• CollectD for OS and Docker metrics on all VMs
• Yammer metrics for all Java app servers
• Custom logger to count exception types
21. Monitoring Challenges
• High iteration rate leads to shortened test cycles
• Integration test combinations are intractable
• Catch problems during rolling deployments
• Identify upstream/downstream side effects
• e.g. backpressure
• Identify brownouts before the customer
• etc.
27. Code Push Side Effects
Push canary instance and Metadata API
dashboard shows healthy tier.
28. Code Push Side Effects
However, upstream UI dashboard
showed unusual # of timeouts.
29. Code Push Side Effects
In search of root cause.
Always safe to start by looking at exception counts.
Can’t derive much from all the noise.
30. Code Push Side Effects
Sum the # of exceptions to create a single signal.
31. Code Push Side Effects
Compare sum with time-shifted sum from a day ago.
32. Code Push Side Effects
Look at an outlier host - an Analytics
service host.
33. Code Push Side Effects
java.io.InvalidObjectException: enum constant MURMUR128_MITZ_64 does
not exist in class com.google.common.hash.BloomFilterStrategies
at java.io.ObjectInputStream.readEnum(ObjectInputStream.java:1743) ~[na:
1.7.0_79]
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
~[na:1.7.0_79]
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:
1990) ~[na:1.7.0_79]
…
Looking at Analytic’s logs revealed
source of the problem.
34. Code Push Side Effects
• Analytics across multiple microservices reduced time
to identify problem. From push to resolution was
~15min
• Service instrumentation helped narrowed down root
cause
• Discovery allowed us to create a detector using
analytics to notify similar problems in the future
35. Other Examples
• A customer started dropping data because they
reverted to an unsupported API
• Compare TSDB write throughput of two different write
strategies
• Create per-service capacity reports
• Identify memory usage patterns across our Analytics
service
• Create a detector for every previously uncaught error
conditions - postmortem output
37. Summary
• Microservice architecture is inherently complex
• Measure all the things
• Use data analytics techniques to
• Identify problems
• Chase down root cause
• Use intelligent detectors to catch recurrence