AWS Loft Talk: Behind the Scenes with SignalFx

SignalFx
Behind the Scenes with SignalFx
Phillip Liu
phillip@signalfx.com
@SignalFx - signalfx.com

Agenda
• Background
• Overview of Key SignalFx Services
• SignalFx infrastructure and operations
• Analytics approach to monitoring
• Code push side effects, an example
• Summary

About Me
[2013 - ] SignalFx - Founder, CTO, Software Engineer
Microservices; Monitoring using Analytics
[2008 - 2012] Facebook - Software Engineer, Software Architect
Hyperscale SOA; Monitoring using Nagios, Ganglia, and in-house
Analytics
[2004 - 2008] Opsware - Chief Architect, Software Engineer
Monolithic Architecture; Monitoring using Ganglia, Nagios, Splunk
[2000 - 2004] Loudcloud - Software Engineer
LAMP, Application Server; Monitoring using SNMP, Ganglia, NetCool
[1998 - 2000] Marimba - Software Engineer
Client / Server; Monitoring using SNMP, FreshWater Software
[ … ]

SignalFx
Overview of SignalFx Services

A Microservices Definition
Loosely coupled service
oriented architecture with
bounded context.
Adrian Cockcroft

Overview of Key SignalFx Services

Microservice Complexity
More than 15 internal
services.
Services span hundreds of
instances across multiple
AZs.
Have dependencies on
tens of external services.

SignalFx
SignalFx Infrastructure

SignalFx
Operations at SignalFx

Shared Responsibility
• Engineering is organized around services they provide

• No dedicated operations team

• Each service team is responsible for building and operating
their services

• Infrastructure team provides IaaS - DNS, LB, Mail, Server,
and Network configuration and provisioning

• Ingest team provides Ingest API, Quantization, and TSDB
services

Continuous Build and Deployment
• Services are built and tested on each commit

• Each service deploy at their cadence

• Nearly all deployments are non-disruptive

• Push to lab, test; push product canary, test; rest of prod

• Service engineered to be resilient to partial cluster
availability

• Each service is engineered to support +1/-1 upgrades

On-call Rotation
• All dev on weekly on-call rotation (couple of times a year)

• On-call works on operational tools

• On-call rotates from lab -> production

• On-call is the incident manager

• Owns driving both black out and brown out incidents to
resolution

Operations Tools
sfhost - CLI for VM conﬁguration and provisioning

sfc - console to access management data for all services

signalscope - deep transactions tracing

maestro - Docker orchestrator

jenkins - continuous build and deployment

Monitoring
• We use SignalFx to monitor SignalFx

• Engineers instrument their code as part of dev process

• Each service provides at least one dashboard

• CollectD for OS and Docker metrics on all VMs

• Yammer metrics for all Java app servers

• Custom logger to count exception types

Monitoring - API Service Dashboard

SignalFx
Analytics Approach to Monitoring

Monitoring Challenges
• High iteration rate leads to shortened test cycles
• Integration test combinations are intractable
• Catch problems during rolling deployments
• Identify upstream/downstream side effects
• e.g. backpressure
• Identify brownouts before the customer
• etc.

Measure

Analyze

Detect

Code Push Side Effects - Time Series Router

Code Push Side Effects
Push canary instance and Metadata API
dashboard shows healthy tier.

However, upstream UI dashboard
showed unusual # of timeouts.

In search of root cause.
Always safe to start by looking at exception counts.
Can’t derive much from all the noise.

Sum the # of exceptions to create a single signal.

Compare sum with time-shifted sum from a day ago.

Look at an outlier host - an Analytics
service host.

java.io.InvalidObjectException: enum constant MURMUR128_MITZ_64 does
not exist in class com.google.common.hash.BloomFilterStrategies
at java.io.ObjectInputStream.readEnum(ObjectInputStream.java:1743) ~[na:
1.7.0_79]
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
~[na:1.7.0_79]
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:
1990) ~[na:1.7.0_79]
…
Looking at Analytic’s logs revealed
source of the problem.

• Analytics across multiple microservices reduced time
to identify problem. From push to resolution was
~15min

• Service instrumentation helped narrowed down root
cause

• Discovery allowed us to create a detector using
analytics to notify similar problems in the future

Other Examples
• A customer started dropping data because they
reverted to an unsupported API

• Compare TSDB write throughput of two different write
strategies

• Create per-service capacity reports

• Identify memory usage patterns across our Analytics
service

• Create a detector for every previously uncaught error
conditions - postmortem output

Summary
• Microservice architecture is inherently complex

• Measure all the things

• Use data analytics techniques to

• Identify problems

• Chase down root cause

• Use intelligent detectors to catch recurrence

SignalFx
Thank You!
Phillip Liu
phillip@signalfx.com
WE’RE HIRING
http://signalfx.com/careers.html

AWS Loft Talk: Behind the Scenes with SignalFx

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to AWS Loft Talk: Behind the Scenes with SignalFx

Similar to AWS Loft Talk: Behind the Scenes with SignalFx (20)

Recently uploaded

Recently uploaded (20)

AWS Loft Talk: Behind the Scenes with SignalFx