SlideShare a Scribd company logo
1 of 39
Download to read offline
SignalFx
SignalFx
Behind the Scenes with SignalFx
Phillip Liu
phillip@signalfx.com
@SignalFx - signalfx.com
Agenda
• Background
• Overview of Key SignalFx Services
• SignalFx infrastructure and operations
• Analytics approach to monitoring
• Code push side effects, an example
• Summary
SignalFx
Background
About Me
[2013 - ] SignalFx - Founder, CTO, Software Engineer
Microservices; Monitoring using Analytics
[2008 - 2012] Facebook - Software Engineer, Software Architect
Hyperscale SOA; Monitoring using Nagios, Ganglia, and in-house
Analytics
[2004 - 2008] Opsware - Chief Architect, Software Engineer
Monolithic Architecture; Monitoring using Ganglia, Nagios, Splunk
[2000 - 2004] Loudcloud - Software Engineer
LAMP, Application Server; Monitoring using SNMP, Ganglia, NetCool
[1998 - 2000] Marimba - Software Engineer
Client / Server; Monitoring using SNMP, FreshWater Software
[ … ]
About SignalFx
SignalFx
Overview of SignalFx Services
A Microservices Definition
Loosely coupled service
oriented architecture with
bounded context.
Adrian Cockcroft
Overview of Key SignalFx Services
Microservice Complexity
More than 15 internal
services.
Services span hundreds of
instances across multiple
AZs.
Have dependencies on
tens of external services.
SignalFx
SignalFx Infrastructure
Amazon EC2
SignalFx
Operations at SignalFx
Shared Responsibility
• Engineering is organized around services they provide

• No dedicated operations team

• Each service team is responsible for building and operating
their services

• Infrastructure team provides IaaS - DNS, LB, Mail, Server,
and Network configuration and provisioning

• Ingest team provides Ingest API, Quantization, and TSDB
services
Continuous Build and Deployment
• Services are built and tested on each commit

• Each service deploy at their cadence

• Nearly all deployments are non-disruptive

• Push to lab, test; push product canary, test; rest of prod

• Service engineered to be resilient to partial cluster
availability

• Each service is engineered to support +1/-1 upgrades
On-call Rotation
• All dev on weekly on-call rotation (couple of times a year)

• On-call works on operational tools

• On-call rotates from lab -> production

• On-call is the incident manager

• Owns driving both black out and brown out incidents to
resolution
Operations Tools
sfhost - CLI for VM configuration and provisioning

sfc - console to access management data for all services

signalscope - deep transactions tracing

maestro - Docker orchestrator

jenkins - continuous build and deployment
Monitoring
• We use SignalFx to monitor SignalFx

• Engineers instrument their code as part of dev process

• Each service provides at least one dashboard

• CollectD for OS and Docker metrics on all VMs

• Yammer metrics for all Java app servers

• Custom logger to count exception types
Monitoring - API Service Dashboard
SignalFx
Analytics Approach to Monitoring
Monitoring Challenges
• High iteration rate leads to shortened test cycles
• Integration test combinations are intractable
• Catch problems during rolling deployments
• Identify upstream/downstream side effects
• e.g. backpressure
• Identify brownouts before the customer
• etc.
Analytics Approach to Monitoring
Measure
Analytics Approach to Monitoring
Analyze
Analytics Approach to Monitoring
Detect
SignalFx
Examples
Code Push Side Effects - Time Series Router
Code Push Side Effects
Push canary instance and Metadata API
dashboard shows healthy tier.
Code Push Side Effects
However, upstream UI dashboard
showed unusual # of timeouts.
Code Push Side Effects
In search of root cause.
Always safe to start by looking at exception counts.
Can’t derive much from all the noise.
Code Push Side Effects
Sum the # of exceptions to create a single signal.
Code Push Side Effects
Compare sum with time-shifted sum from a day ago.
Code Push Side Effects
Look at an outlier host - an Analytics
service host.
Code Push Side Effects
java.io.InvalidObjectException: enum constant MURMUR128_MITZ_64 does
not exist in class com.google.common.hash.BloomFilterStrategies
at java.io.ObjectInputStream.readEnum(ObjectInputStream.java:1743) ~[na:
1.7.0_79]
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
~[na:1.7.0_79]
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:
1990) ~[na:1.7.0_79]
…
Looking at Analytic’s logs revealed
source of the problem.
Code Push Side Effects
• Analytics across multiple microservices reduced time
to identify problem. From push to resolution was
~15min

• Service instrumentation helped narrowed down root
cause

• Discovery allowed us to create a detector using
analytics to notify similar problems in the future
Other Examples
• A customer started dropping data because they
reverted to an unsupported API

• Compare TSDB write throughput of two different write
strategies

• Create per-service capacity reports

• Identify memory usage patterns across our Analytics
service

• Create a detector for every previously uncaught error
conditions - postmortem output
SignalFx
Summary
Summary
• Microservice architecture is inherently complex

• Measure all the things

• Use data analytics techniques to

• Identify problems

• Chase down root cause

• Use intelligent detectors to catch recurrence
SignalFx
Questions
SignalFx
Thank You!
Phillip Liu
phillip@signalfx.com
WE’RE HIRING
http://signalfx.com/careers.html

More Related Content

What's hot

Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Nagios
 
Open-source vs. public cloud in the Big Data landscape. Friends or Foes?
Open-source vs. public cloud in the Big Data landscape. Friends or Foes?Open-source vs. public cloud in the Big Data landscape. Friends or Foes?
Open-source vs. public cloud in the Big Data landscape. Friends or Foes?GetInData
 
Flink Forward SF 2017: Scott Kidder - Building a Real-Time Anomaly-Detection ...
Flink Forward SF 2017: Scott Kidder - Building a Real-Time Anomaly-Detection ...Flink Forward SF 2017: Scott Kidder - Building a Real-Time Anomaly-Detection ...
Flink Forward SF 2017: Scott Kidder - Building a Real-Time Anomaly-Detection ...Flink Forward
 
OSMC 2021 | Use OpenSource monitoring for an Enterprise Grade Platform
OSMC 2021 | Use OpenSource monitoring for an Enterprise Grade PlatformOSMC 2021 | Use OpenSource monitoring for an Enterprise Grade Platform
OSMC 2021 | Use OpenSource monitoring for an Enterprise Grade PlatformNETWAYS
 
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanSpark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanDatabricks
 
OSMC 2021 | Monitoring Open Source Hardware
OSMC 2021 | Monitoring Open Source HardwareOSMC 2021 | Monitoring Open Source Hardware
OSMC 2021 | Monitoring Open Source HardwareNETWAYS
 
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...
Flink Forward San Francisco 2018: Andrew Gao &  Jeff Sharpe - "Finding Bad Ac...Flink Forward San Francisco 2018: Andrew Gao &  Jeff Sharpe - "Finding Bad Ac...
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...Flink Forward
 
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015DevOpsDays Tel Aviv
 
Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...
Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...
Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...Timothy Spann
 
OSMC 2021 | Monitoring Open Infrastructure Logs – With Real Life Examples
OSMC 2021 | Monitoring Open Infrastructure Logs – With Real Life ExamplesOSMC 2021 | Monitoring Open Infrastructure Logs – With Real Life Examples
OSMC 2021 | Monitoring Open Infrastructure Logs – With Real Life ExamplesNETWAYS
 
Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016
Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016
Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016Zabbix
 
Virtual Flink Forward 2020: Data driven matchmaking streaming at Hyperconnect...
Virtual Flink Forward 2020: Data driven matchmaking streaming at Hyperconnect...Virtual Flink Forward 2020: Data driven matchmaking streaming at Hyperconnect...
Virtual Flink Forward 2020: Data driven matchmaking streaming at Hyperconnect...Flink Forward
 
stackconf 2020 | Ignite talk: Opensource in Advanced Research Computing, How ...
stackconf 2020 | Ignite talk: Opensource in Advanced Research Computing, How ...stackconf 2020 | Ignite talk: Opensource in Advanced Research Computing, How ...
stackconf 2020 | Ignite talk: Opensource in Advanced Research Computing, How ...NETWAYS
 
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment OptionsNagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment OptionsNagios
 
From MapReduce to Apache Spark
From MapReduce to Apache SparkFrom MapReduce to Apache Spark
From MapReduce to Apache SparkJen Aman
 
Security From The Big Data and Analytics Perspective
Security From The Big Data and Analytics PerspectiveSecurity From The Big Data and Analytics Perspective
Security From The Big Data and Analytics PerspectiveAll Things Open
 
Extending Sysdig with Chisel
Extending Sysdig with ChiselExtending Sysdig with Chisel
Extending Sysdig with ChiselSysdig
 

What's hot (20)

Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
Marcelo Perazolo, Lead Software Architect, IBM Corporation - Monitoring a Pow...
 
Open-source vs. public cloud in the Big Data landscape. Friends or Foes?
Open-source vs. public cloud in the Big Data landscape. Friends or Foes?Open-source vs. public cloud in the Big Data landscape. Friends or Foes?
Open-source vs. public cloud in the Big Data landscape. Friends or Foes?
 
Flink Forward SF 2017: Scott Kidder - Building a Real-Time Anomaly-Detection ...
Flink Forward SF 2017: Scott Kidder - Building a Real-Time Anomaly-Detection ...Flink Forward SF 2017: Scott Kidder - Building a Real-Time Anomaly-Detection ...
Flink Forward SF 2017: Scott Kidder - Building a Real-Time Anomaly-Detection ...
 
Hystrix
HystrixHystrix
Hystrix
 
OSMC 2021 | Use OpenSource monitoring for an Enterprise Grade Platform
OSMC 2021 | Use OpenSource monitoring for an Enterprise Grade PlatformOSMC 2021 | Use OpenSource monitoring for an Enterprise Grade Platform
OSMC 2021 | Use OpenSource monitoring for an Enterprise Grade Platform
 
Datadog- Monitoring In Motion
Datadog- Monitoring In Motion Datadog- Monitoring In Motion
Datadog- Monitoring In Motion
 
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanSpark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu Kasinathan
 
OSMC 2021 | Monitoring Open Source Hardware
OSMC 2021 | Monitoring Open Source HardwareOSMC 2021 | Monitoring Open Source Hardware
OSMC 2021 | Monitoring Open Source Hardware
 
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...
Flink Forward San Francisco 2018: Andrew Gao &  Jeff Sharpe - "Finding Bad Ac...Flink Forward San Francisco 2018: Andrew Gao &  Jeff Sharpe - "Finding Bad Ac...
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...
 
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
 
Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...
Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...
Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...
 
OSMC 2021 | Monitoring Open Infrastructure Logs – With Real Life Examples
OSMC 2021 | Monitoring Open Infrastructure Logs – With Real Life ExamplesOSMC 2021 | Monitoring Open Infrastructure Logs – With Real Life Examples
OSMC 2021 | Monitoring Open Infrastructure Logs – With Real Life Examples
 
Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016
Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016
Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016
 
Virtual Flink Forward 2020: Data driven matchmaking streaming at Hyperconnect...
Virtual Flink Forward 2020: Data driven matchmaking streaming at Hyperconnect...Virtual Flink Forward 2020: Data driven matchmaking streaming at Hyperconnect...
Virtual Flink Forward 2020: Data driven matchmaking streaming at Hyperconnect...
 
stackconf 2020 | Ignite talk: Opensource in Advanced Research Computing, How ...
stackconf 2020 | Ignite talk: Opensource in Advanced Research Computing, How ...stackconf 2020 | Ignite talk: Opensource in Advanced Research Computing, How ...
stackconf 2020 | Ignite talk: Opensource in Advanced Research Computing, How ...
 
Fully automated kubernetes deployment and management
Fully automated kubernetes deployment and managementFully automated kubernetes deployment and management
Fully automated kubernetes deployment and management
 
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment OptionsNagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options
 
From MapReduce to Apache Spark
From MapReduce to Apache SparkFrom MapReduce to Apache Spark
From MapReduce to Apache Spark
 
Security From The Big Data and Analytics Perspective
Security From The Big Data and Analytics PerspectiveSecurity From The Big Data and Analytics Perspective
Security From The Big Data and Analytics Perspective
 
Extending Sysdig with Chisel
Extending Sysdig with ChiselExtending Sysdig with Chisel
Extending Sysdig with Chisel
 

Viewers also liked

SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer OptimizationSignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer OptimizationSignalFx
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseSignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseDataStax Academy
 
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15Making Cassandra Perform as a Time Series Database - Cassandra Summit 15
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15SignalFx
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache CassandraPatrick McFadin
 
Microservices and Devs in Charge: Why Monitoring is an Analytics Problem
Microservices and Devs in Charge: Why Monitoring is an Analytics ProblemMicroservices and Devs in Charge: Why Monitoring is an Analytics Problem
Microservices and Devs in Charge: Why Monitoring is an Analytics ProblemSignalFx
 
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...SignalFx
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into OverdriveTodd Palino
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTodd Palino
 
Go debugging and troubleshooting tips - from real life lessons at SignalFx
Go debugging and troubleshooting tips - from real life lessons at SignalFxGo debugging and troubleshooting tips - from real life lessons at SignalFx
Go debugging and troubleshooting tips - from real life lessons at SignalFxSignalFx
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesTodd Palino
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
Cassandra Compression and Performance Evaluation
Cassandra Compression and Performance EvaluationCassandra Compression and Performance Evaluation
Cassandra Compression and Performance EvaluationSchubert Zhang
 

Viewers also liked (15)

SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer OptimizationSignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseSignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series Database
 
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15Making Cassandra Perform as a Time Series Database - Cassandra Summit 15
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
 
Docker {at,with} SignalFx
Docker {at,with} SignalFxDocker {at,with} SignalFx
Docker {at,with} SignalFx
 
Microservices and Devs in Charge: Why Monitoring is an Analytics Problem
Microservices and Devs in Charge: Why Monitoring is an Analytics ProblemMicroservices and Devs in Charge: Why Monitoring is an Analytics Problem
Microservices and Devs in Charge: Why Monitoring is an Analytics Problem
 
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
Operationalizing Docker at Scale: Lessons from Running Microservices in Produ...
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and Profit
 
Go debugging and troubleshooting tips - from real life lessons at SignalFx
Go debugging and troubleshooting tips - from real life lessons at SignalFxGo debugging and troubleshooting tips - from real life lessons at SignalFx
Go debugging and troubleshooting tips - from real life lessons at SignalFx
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Cassandra Compression and Performance Evaluation
Cassandra Compression and Performance EvaluationCassandra Compression and Performance Evaluation
Cassandra Compression and Performance Evaluation
 

Similar to AWS Loft Talk: Behind the Scenes with SignalFx

Why monitoring is an analytics problem
Why monitoring is an analytics problemWhy monitoring is an analytics problem
Why monitoring is an analytics problemPhillip Liu
 
How to go from waterfall app dev to secure agile development in 2 weeks
How to go from waterfall app dev to secure agile development in 2 weeks How to go from waterfall app dev to secure agile development in 2 weeks
How to go from waterfall app dev to secure agile development in 2 weeks Ulf Mattsson
 
Cerberus : Framework for Manual and Automated Testing (Web Application)
Cerberus : Framework for Manual and Automated Testing (Web Application)Cerberus : Framework for Manual and Automated Testing (Web Application)
Cerberus : Framework for Manual and Automated Testing (Web Application)CIVEL Benoit
 
Cerberus_Presentation1
Cerberus_Presentation1Cerberus_Presentation1
Cerberus_Presentation1CIVEL Benoit
 
Intro to GitOps with Weave GitOps, Flagger and Linkerd
Intro to GitOps with Weave GitOps, Flagger and LinkerdIntro to GitOps with Weave GitOps, Flagger and Linkerd
Intro to GitOps with Weave GitOps, Flagger and LinkerdWeaveworks
 
Cncf checkov and bridgecrew
Cncf checkov and bridgecrewCncf checkov and bridgecrew
Cncf checkov and bridgecrewLibbySchulze
 
apidays LIVE Paris - Serverless security: how to protect what you don't see? ...
apidays LIVE Paris - Serverless security: how to protect what you don't see? ...apidays LIVE Paris - Serverless security: how to protect what you don't see? ...
apidays LIVE Paris - Serverless security: how to protect what you don't see? ...apidays
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Sourceaspyker
 
Pragmatic Pipeline Security
Pragmatic Pipeline SecurityPragmatic Pipeline Security
Pragmatic Pipeline SecurityJames Wickett
 
CENTRAL MANAGEMENT OF NETWORK AND CALL SERVICES
CENTRAL MANAGEMENT OF NETWORK AND CALL SERVICESCENTRAL MANAGEMENT OF NETWORK AND CALL SERVICES
CENTRAL MANAGEMENT OF NETWORK AND CALL SERVICESNazmul Hossain Rakib
 
The differing ways to monitor and instrument
The differing ways to monitor and instrumentThe differing ways to monitor and instrument
The differing ways to monitor and instrumentJonah Kowall
 
Serverless security - how to protect what you don't see?
Serverless security - how to protect what you don't see?Serverless security - how to protect what you don't see?
Serverless security - how to protect what you don't see?Sqreen
 
Leveraging Analytics for DevOps
Leveraging Analytics for DevOpsLeveraging Analytics for DevOps
Leveraging Analytics for DevOpsMichael Floyd
 
AWS live hack: Atlassian + Snyk OSS on AWS
AWS live hack: Atlassian + Snyk OSS on AWSAWS live hack: Atlassian + Snyk OSS on AWS
AWS live hack: Atlassian + Snyk OSS on AWSEric Smalling
 
Bangalore OpenMSA DevDay - September 19, 2018
Bangalore OpenMSA DevDay - September 19, 2018Bangalore OpenMSA DevDay - September 19, 2018
Bangalore OpenMSA DevDay - September 19, 2018UBiqube
 
Making the Shift from DevOps to Practical DevSecOps | Sumo Logic Webinar
Making the Shift from DevOps to Practical DevSecOps | Sumo Logic WebinarMaking the Shift from DevOps to Practical DevSecOps | Sumo Logic Webinar
Making the Shift from DevOps to Practical DevSecOps | Sumo Logic WebinarSumo Logic
 
How to Manage the Risk of your Polyglot Environments
How to Manage the Risk of your Polyglot EnvironmentsHow to Manage the Risk of your Polyglot Environments
How to Manage the Risk of your Polyglot EnvironmentsDevOps.com
 
Splunk: Forward me the REST of those shells
Splunk: Forward me the REST of those shellsSplunk: Forward me the REST of those shells
Splunk: Forward me the REST of those shellsAnthony D Hendricks
 
SDLC & DevOps Transformation with Agile
SDLC & DevOps Transformation with AgileSDLC & DevOps Transformation with Agile
SDLC & DevOps Transformation with AgileAbdel Moneim Emad
 
Making Security Agile
Making Security AgileMaking Security Agile
Making Security AgileOleg Gryb
 

Similar to AWS Loft Talk: Behind the Scenes with SignalFx (20)

Why monitoring is an analytics problem
Why monitoring is an analytics problemWhy monitoring is an analytics problem
Why monitoring is an analytics problem
 
How to go from waterfall app dev to secure agile development in 2 weeks
How to go from waterfall app dev to secure agile development in 2 weeks How to go from waterfall app dev to secure agile development in 2 weeks
How to go from waterfall app dev to secure agile development in 2 weeks
 
Cerberus : Framework for Manual and Automated Testing (Web Application)
Cerberus : Framework for Manual and Automated Testing (Web Application)Cerberus : Framework for Manual and Automated Testing (Web Application)
Cerberus : Framework for Manual and Automated Testing (Web Application)
 
Cerberus_Presentation1
Cerberus_Presentation1Cerberus_Presentation1
Cerberus_Presentation1
 
Intro to GitOps with Weave GitOps, Flagger and Linkerd
Intro to GitOps with Weave GitOps, Flagger and LinkerdIntro to GitOps with Weave GitOps, Flagger and Linkerd
Intro to GitOps with Weave GitOps, Flagger and Linkerd
 
Cncf checkov and bridgecrew
Cncf checkov and bridgecrewCncf checkov and bridgecrew
Cncf checkov and bridgecrew
 
apidays LIVE Paris - Serverless security: how to protect what you don't see? ...
apidays LIVE Paris - Serverless security: how to protect what you don't see? ...apidays LIVE Paris - Serverless security: how to protect what you don't see? ...
apidays LIVE Paris - Serverless security: how to protect what you don't see? ...
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
 
Pragmatic Pipeline Security
Pragmatic Pipeline SecurityPragmatic Pipeline Security
Pragmatic Pipeline Security
 
CENTRAL MANAGEMENT OF NETWORK AND CALL SERVICES
CENTRAL MANAGEMENT OF NETWORK AND CALL SERVICESCENTRAL MANAGEMENT OF NETWORK AND CALL SERVICES
CENTRAL MANAGEMENT OF NETWORK AND CALL SERVICES
 
The differing ways to monitor and instrument
The differing ways to monitor and instrumentThe differing ways to monitor and instrument
The differing ways to monitor and instrument
 
Serverless security - how to protect what you don't see?
Serverless security - how to protect what you don't see?Serverless security - how to protect what you don't see?
Serverless security - how to protect what you don't see?
 
Leveraging Analytics for DevOps
Leveraging Analytics for DevOpsLeveraging Analytics for DevOps
Leveraging Analytics for DevOps
 
AWS live hack: Atlassian + Snyk OSS on AWS
AWS live hack: Atlassian + Snyk OSS on AWSAWS live hack: Atlassian + Snyk OSS on AWS
AWS live hack: Atlassian + Snyk OSS on AWS
 
Bangalore OpenMSA DevDay - September 19, 2018
Bangalore OpenMSA DevDay - September 19, 2018Bangalore OpenMSA DevDay - September 19, 2018
Bangalore OpenMSA DevDay - September 19, 2018
 
Making the Shift from DevOps to Practical DevSecOps | Sumo Logic Webinar
Making the Shift from DevOps to Practical DevSecOps | Sumo Logic WebinarMaking the Shift from DevOps to Practical DevSecOps | Sumo Logic Webinar
Making the Shift from DevOps to Practical DevSecOps | Sumo Logic Webinar
 
How to Manage the Risk of your Polyglot Environments
How to Manage the Risk of your Polyglot EnvironmentsHow to Manage the Risk of your Polyglot Environments
How to Manage the Risk of your Polyglot Environments
 
Splunk: Forward me the REST of those shells
Splunk: Forward me the REST of those shellsSplunk: Forward me the REST of those shells
Splunk: Forward me the REST of those shells
 
SDLC & DevOps Transformation with Agile
SDLC & DevOps Transformation with AgileSDLC & DevOps Transformation with Agile
SDLC & DevOps Transformation with Agile
 
Making Security Agile
Making Security AgileMaking Security Agile
Making Security Agile
 

Recently uploaded

OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 

Recently uploaded (20)

OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 

AWS Loft Talk: Behind the Scenes with SignalFx

  • 2. SignalFx Behind the Scenes with SignalFx Phillip Liu phillip@signalfx.com @SignalFx - signalfx.com
  • 3. Agenda • Background • Overview of Key SignalFx Services • SignalFx infrastructure and operations • Analytics approach to monitoring • Code push side effects, an example • Summary
  • 5. About Me [2013 - ] SignalFx - Founder, CTO, Software Engineer Microservices; Monitoring using Analytics [2008 - 2012] Facebook - Software Engineer, Software Architect Hyperscale SOA; Monitoring using Nagios, Ganglia, and in-house Analytics [2004 - 2008] Opsware - Chief Architect, Software Engineer Monolithic Architecture; Monitoring using Ganglia, Nagios, Splunk [2000 - 2004] Loudcloud - Software Engineer LAMP, Application Server; Monitoring using SNMP, Ganglia, NetCool [1998 - 2000] Marimba - Software Engineer Client / Server; Monitoring using SNMP, FreshWater Software [ … ]
  • 8. A Microservices Definition Loosely coupled service oriented architecture with bounded context. Adrian Cockcroft
  • 9. Overview of Key SignalFx Services
  • 10. Microservice Complexity More than 15 internal services. Services span hundreds of instances across multiple AZs. Have dependencies on tens of external services.
  • 14. Shared Responsibility • Engineering is organized around services they provide • No dedicated operations team • Each service team is responsible for building and operating their services • Infrastructure team provides IaaS - DNS, LB, Mail, Server, and Network configuration and provisioning • Ingest team provides Ingest API, Quantization, and TSDB services
  • 15. Continuous Build and Deployment • Services are built and tested on each commit • Each service deploy at their cadence • Nearly all deployments are non-disruptive • Push to lab, test; push product canary, test; rest of prod • Service engineered to be resilient to partial cluster availability • Each service is engineered to support +1/-1 upgrades
  • 16. On-call Rotation • All dev on weekly on-call rotation (couple of times a year) • On-call works on operational tools • On-call rotates from lab -> production • On-call is the incident manager • Owns driving both black out and brown out incidents to resolution
  • 17. Operations Tools sfhost - CLI for VM configuration and provisioning sfc - console to access management data for all services signalscope - deep transactions tracing maestro - Docker orchestrator jenkins - continuous build and deployment
  • 18. Monitoring • We use SignalFx to monitor SignalFx • Engineers instrument their code as part of dev process • Each service provides at least one dashboard • CollectD for OS and Docker metrics on all VMs • Yammer metrics for all Java app servers • Custom logger to count exception types
  • 19. Monitoring - API Service Dashboard
  • 21. Monitoring Challenges • High iteration rate leads to shortened test cycles • Integration test combinations are intractable • Catch problems during rolling deployments • Identify upstream/downstream side effects • e.g. backpressure • Identify brownouts before the customer • etc.
  • 22. Analytics Approach to Monitoring Measure
  • 23. Analytics Approach to Monitoring Analyze
  • 24. Analytics Approach to Monitoring Detect
  • 26. Code Push Side Effects - Time Series Router
  • 27. Code Push Side Effects Push canary instance and Metadata API dashboard shows healthy tier.
  • 28. Code Push Side Effects However, upstream UI dashboard showed unusual # of timeouts.
  • 29. Code Push Side Effects In search of root cause. Always safe to start by looking at exception counts. Can’t derive much from all the noise.
  • 30. Code Push Side Effects Sum the # of exceptions to create a single signal.
  • 31. Code Push Side Effects Compare sum with time-shifted sum from a day ago.
  • 32. Code Push Side Effects Look at an outlier host - an Analytics service host.
  • 33. Code Push Side Effects java.io.InvalidObjectException: enum constant MURMUR128_MITZ_64 does not exist in class com.google.common.hash.BloomFilterStrategies at java.io.ObjectInputStream.readEnum(ObjectInputStream.java:1743) ~[na: 1.7.0_79] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347) ~[na:1.7.0_79] at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java: 1990) ~[na:1.7.0_79] … Looking at Analytic’s logs revealed source of the problem.
  • 34. Code Push Side Effects • Analytics across multiple microservices reduced time to identify problem. From push to resolution was ~15min • Service instrumentation helped narrowed down root cause • Discovery allowed us to create a detector using analytics to notify similar problems in the future
  • 35. Other Examples • A customer started dropping data because they reverted to an unsupported API • Compare TSDB write throughput of two different write strategies • Create per-service capacity reports • Identify memory usage patterns across our Analytics service • Create a detector for every previously uncaught error conditions - postmortem output
  • 37. Summary • Microservice architecture is inherently complex • Measure all the things • Use data analytics techniques to • Identify problems • Chase down root cause • Use intelligent detectors to catch recurrence
  • 39. SignalFx Thank You! Phillip Liu phillip@signalfx.com WE’RE HIRING http://signalfx.com/careers.html