SlideShare a Scribd company logo
SignalFx
Microservices and Devs in Charge:
Why Monitoring is an Analytics Problem
SignalFx
Microservices and Devs in Charge:
Why Monitoring is an Analytics Problem
Phillip Liu
phillip@signalfx.com
@SignalFx - signalfx.com
Agenda
• My background
• Microservices, a review
• Analytics approach to monitoring
• Code push side effects, an example
• Summary
SignalFx
My Background
Experience
[2013 - ] SignalFx - Founder, CTO, Software Engineer
Microservices; Monitoring using Analytics
[2008 - 2012] Facebook - Software Engineer, Software Architect
Hyperscale SOA; Monitoring using Nagios, Ganglia, and in-house
Analytics
[2004 - 2008] Opsware - Chief Architect, Software Engineer
Monolithic Architecture; Monitoring using Ganglia, Nagios, Splunk
[2000 - 2004] Loudcloud - Software Engineer
LAMP, Application Server; Monitoring using SNMP, Ganglia, NetCool
[1998 - 2000] Marimba - Software Engineer
Client / Server; Monitoring using SNMP, FreshWater Software
[ … ]
SignalFx
Microservices, a Review
A Microservices Definition
Loosely coupled service
oriented architecture with
bounded context.
Adrian Cockcroft
SignalFx’s Microservices
More than 15 internal services.
Spanning hundreds of
instances.
Across 3 AZs.
Have dependencies on
tens of external services.
Monitoring Challenges
• High iteration rate leads to shortened test
cycles
• Integration test combinations are intractable
• Catch problems during rolling deployments
• Identify upstream/downstream side effects
• e.g. backpressure
• Identify brownouts before the customer
• etc.
SignalFx
Analytics Approach to Monitoring
Measure
Store
Analyze
Detect
SignalFx
Examples
Monitoring at SignalFx
•We use SignalFx to monitor SignalFx
•CollectD for OS and Docker metrics on all VMs
•Yammer metrics for all Java app servers
•Custom logger to count exception types
•All metrics are sent to an analytics service
•Each service deploy a their cadence
•Push lab, then canary in prod, then rest of tier
Code Push Side Effects
Code Push Side Effects
Push canary instance and Metadata API
dashboard shows healthy tier.
Code Push Side Effects
However, upstream UI dashboard
showed unusual # of timeouts.
Code Push Side Effects
In search of root cause.
Always safe to start by looking at exception counts.
Can’t derive much from all the noise.
Code Push Side Effects
Sum the # of exceptions to create a single signal.
Code Push Side Effects
Compare sum with time-shifted sum from a day ago.
Code Push Side Effects
Look at an outlier host - an Analytics
service host.
Code Push Side Effects
java.io.InvalidObjectException: enum constant MURMUR128_MITZ_64 does
not exist in class com.google.common.hash.BloomFilterStrategies
at java.io.ObjectInputStream.readEnum(ObjectInputStream.java:1743) ~[na:
1.7.0_79]
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
~[na:1.7.0_79]
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:
1990) ~[na:1.7.0_79]
…
Looking at Analytic’s logs revealed
source of the problem.
Code Push Side Effects
• Analytics across multiple microservices reduced
time to identify problem. From push to resolution
was ~15min
• Service instrumentation helped narrowed down
root cause
• Discovery allowed us to create a detector using
analytics to notify similar problems in the future
Other Examples
• A customer started dropping data because they
reverted to an unsupported API
• Compare tsdb write throughput of two different
write strategies
• Create per-service capacity reports
• Identify memory usage patterns across our
Analytics service
• Create a detector for every previously uncaught
error conditions - postmortem output
SignalFx
Summary
• Measure and Store as much metrics and events as
possible
• Use data analytics techniques to
• Identify problems
• Chase down root cause
• Create analytics based detectors to notify you of
recurrence
SignalFx
Thank You!
Phillip Liu
phillip@signalfx.com
WE’RE HIRING
jobs@signalfx.com
@SignalFx - signalfx.com

More Related Content

What's hot

Opentelemetry - From frontend to backend
Opentelemetry - From frontend to backendOpentelemetry - From frontend to backend
Opentelemetry - From frontend to backend
Sebastian Poxhofer
 
Open-source vs. public cloud in the Big Data landscape. Friends or Foes?
Open-source vs. public cloud in the Big Data landscape. Friends or Foes?Open-source vs. public cloud in the Big Data landscape. Friends or Foes?
Open-source vs. public cloud in the Big Data landscape. Friends or Foes?
GetInData
 
Extending the Enterprise with MEF
Extending the Enterprise with MEFExtending the Enterprise with MEF
Extending the Enterprise with MEF
Brian Ritchie
 
Serverless security - how to protect what you don't see?
Serverless security - how to protect what you don't see?Serverless security - how to protect what you don't see?
Serverless security - how to protect what you don't see?
Sqreen
 
Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021
InfluxData
 
Hystrix make your app bullet proof
Hystrix   make your app bullet proofHystrix   make your app bullet proof
Hystrix make your app bullet proof
Marcin Kłopotek
 
Deployment Automation & Self-Healing with Dynatrace & Ansible
Deployment Automation & Self-Healing with Dynatrace & AnsibleDeployment Automation & Self-Healing with Dynatrace & Ansible
Deployment Automation & Self-Healing with Dynatrace & Ansible
Jürgen Etzlstorfer
 
From 70 Networking Tasks to a Single Click by WWT: Building an F5 Solution wi...
From 70 Networking Tasks to a Single Click by WWT: Building an F5 Solution wi...From 70 Networking Tasks to a Single Click by WWT: Building an F5 Solution wi...
From 70 Networking Tasks to a Single Click by WWT: Building an F5 Solution wi...
Joel W. King
 
FOSDEM 2021 - Infrastructure as Code Drift & Driftctl
FOSDEM 2021 - Infrastructure as Code Drift & DriftctlFOSDEM 2021 - Infrastructure as Code Drift & Driftctl
FOSDEM 2021 - Infrastructure as Code Drift & Driftctl
Stephane Jourdan
 
Eric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosEric Loyd - Fractal Nagios
Eric Loyd - Fractal Nagios
Nagios
 
ADDO Open Source Observability Tools
ADDO Open Source Observability Tools ADDO Open Source Observability Tools
ADDO Open Source Observability Tools
Mickey Boxell
 
Modern vSphere Monitoring and Dashboard using InfluxDB, Telegraf and Grafana
Modern vSphere Monitoring and Dashboard using InfluxDB, Telegraf and GrafanaModern vSphere Monitoring and Dashboard using InfluxDB, Telegraf and Grafana
Modern vSphere Monitoring and Dashboard using InfluxDB, Telegraf and Grafana
InfluxData
 
OSMC 2021 | Current State of Icinga
OSMC 2021 | Current State of IcingaOSMC 2021 | Current State of Icinga
OSMC 2021 | Current State of Icinga
NETWAYS
 
Policy as code what helm developers need to know about security
Policy as code  what helm developers need to know about securityPolicy as code  what helm developers need to know about security
Policy as code what helm developers need to know about security
LibbySchulze
 
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
DevOps.com
 
Kubernetes meetup geneva june 2021
Kubernetes meetup geneva   june 2021Kubernetes meetup geneva   june 2021
Kubernetes meetup geneva june 2021
SebastienSEYMARC
 
Running Kafka and Spark on Raspberry PI with Azure and some .net magic
Running Kafka and Spark on Raspberry PI with Azure and some .net magicRunning Kafka and Spark on Raspberry PI with Azure and some .net magic
Running Kafka and Spark on Raspberry PI with Azure and some .net magic
Marco Parenzan
 
StackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite Talk
StackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite TalkStackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite Talk
StackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite Talk
StackStorm
 
Wipro Customer Presentation
Wipro Customer PresentationWipro Customer Presentation
Wipro Customer Presentation
Splunk
 
A Network Engineer's Approach to Automation
A Network Engineer's Approach to AutomationA Network Engineer's Approach to Automation
A Network Engineer's Approach to Automation
Jeremy Schulman
 

What's hot (20)

Opentelemetry - From frontend to backend
Opentelemetry - From frontend to backendOpentelemetry - From frontend to backend
Opentelemetry - From frontend to backend
 
Open-source vs. public cloud in the Big Data landscape. Friends or Foes?
Open-source vs. public cloud in the Big Data landscape. Friends or Foes?Open-source vs. public cloud in the Big Data landscape. Friends or Foes?
Open-source vs. public cloud in the Big Data landscape. Friends or Foes?
 
Extending the Enterprise with MEF
Extending the Enterprise with MEFExtending the Enterprise with MEF
Extending the Enterprise with MEF
 
Serverless security - how to protect what you don't see?
Serverless security - how to protect what you don't see?Serverless security - how to protect what you don't see?
Serverless security - how to protect what you don't see?
 
Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021Getting Started: Intro to Telegraf - July 2021
Getting Started: Intro to Telegraf - July 2021
 
Hystrix make your app bullet proof
Hystrix   make your app bullet proofHystrix   make your app bullet proof
Hystrix make your app bullet proof
 
Deployment Automation & Self-Healing with Dynatrace & Ansible
Deployment Automation & Self-Healing with Dynatrace & AnsibleDeployment Automation & Self-Healing with Dynatrace & Ansible
Deployment Automation & Self-Healing with Dynatrace & Ansible
 
From 70 Networking Tasks to a Single Click by WWT: Building an F5 Solution wi...
From 70 Networking Tasks to a Single Click by WWT: Building an F5 Solution wi...From 70 Networking Tasks to a Single Click by WWT: Building an F5 Solution wi...
From 70 Networking Tasks to a Single Click by WWT: Building an F5 Solution wi...
 
FOSDEM 2021 - Infrastructure as Code Drift & Driftctl
FOSDEM 2021 - Infrastructure as Code Drift & DriftctlFOSDEM 2021 - Infrastructure as Code Drift & Driftctl
FOSDEM 2021 - Infrastructure as Code Drift & Driftctl
 
Eric Loyd - Fractal Nagios
Eric Loyd - Fractal NagiosEric Loyd - Fractal Nagios
Eric Loyd - Fractal Nagios
 
ADDO Open Source Observability Tools
ADDO Open Source Observability Tools ADDO Open Source Observability Tools
ADDO Open Source Observability Tools
 
Modern vSphere Monitoring and Dashboard using InfluxDB, Telegraf and Grafana
Modern vSphere Monitoring and Dashboard using InfluxDB, Telegraf and GrafanaModern vSphere Monitoring and Dashboard using InfluxDB, Telegraf and Grafana
Modern vSphere Monitoring and Dashboard using InfluxDB, Telegraf and Grafana
 
OSMC 2021 | Current State of Icinga
OSMC 2021 | Current State of IcingaOSMC 2021 | Current State of Icinga
OSMC 2021 | Current State of Icinga
 
Policy as code what helm developers need to know about security
Policy as code  what helm developers need to know about securityPolicy as code  what helm developers need to know about security
Policy as code what helm developers need to know about security
 
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
 
Kubernetes meetup geneva june 2021
Kubernetes meetup geneva   june 2021Kubernetes meetup geneva   june 2021
Kubernetes meetup geneva june 2021
 
Running Kafka and Spark on Raspberry PI with Azure and some .net magic
Running Kafka and Spark on Raspberry PI with Azure and some .net magicRunning Kafka and Spark on Raspberry PI with Azure and some .net magic
Running Kafka and Spark on Raspberry PI with Azure and some .net magic
 
StackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite Talk
StackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite TalkStackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite Talk
StackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite Talk
 
Wipro Customer Presentation
Wipro Customer PresentationWipro Customer Presentation
Wipro Customer Presentation
 
A Network Engineer's Approach to Automation
A Network Engineer's Approach to AutomationA Network Engineer's Approach to Automation
A Network Engineer's Approach to Automation
 

Viewers also liked

Fault, Errors, and Promise Theory
Fault, Errors, and Promise TheoryFault, Errors, and Promise Theory
Fault, Errors, and Promise Theory
Mark Burgess
 
Docker at and with SignalFx
Docker at and with SignalFxDocker at and with SignalFx
Docker at and with SignalFx
SignalFx
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
SignalFx
 
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15Making Cassandra Perform as a Time Series Database - Cassandra Summit 15
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15
SignalFx
 
SignalFx Elasticsearch Metrics Monitoring and Alerting
SignalFx Elasticsearch Metrics Monitoring and AlertingSignalFx Elasticsearch Metrics Monitoring and Alerting
SignalFx Elasticsearch Metrics Monitoring and Alerting
SignalFx
 
AWS Loft Talk: Behind the Scenes with SignalFx
AWS Loft Talk: Behind the Scenes with SignalFxAWS Loft Talk: Behind the Scenes with SignalFx
AWS Loft Talk: Behind the Scenes with SignalFx
SignalFx
 
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer OptimizationSignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
SignalFx
 
Go debugging and troubleshooting tips - from real life lessons at SignalFx
Go debugging and troubleshooting tips - from real life lessons at SignalFxGo debugging and troubleshooting tips - from real life lessons at SignalFx
Go debugging and troubleshooting tips - from real life lessons at SignalFx
SignalFx
 

Viewers also liked (8)

Fault, Errors, and Promise Theory
Fault, Errors, and Promise TheoryFault, Errors, and Promise Theory
Fault, Errors, and Promise Theory
 
Docker at and with SignalFx
Docker at and with SignalFxDocker at and with SignalFx
Docker at and with SignalFx
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15Making Cassandra Perform as a Time Series Database - Cassandra Summit 15
Making Cassandra Perform as a Time Series Database - Cassandra Summit 15
 
SignalFx Elasticsearch Metrics Monitoring and Alerting
SignalFx Elasticsearch Metrics Monitoring and AlertingSignalFx Elasticsearch Metrics Monitoring and Alerting
SignalFx Elasticsearch Metrics Monitoring and Alerting
 
AWS Loft Talk: Behind the Scenes with SignalFx
AWS Loft Talk: Behind the Scenes with SignalFxAWS Loft Talk: Behind the Scenes with SignalFx
AWS Loft Talk: Behind the Scenes with SignalFx
 
SignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer OptimizationSignalFx Kafka Consumer Optimization
SignalFx Kafka Consumer Optimization
 
Go debugging and troubleshooting tips - from real life lessons at SignalFx
Go debugging and troubleshooting tips - from real life lessons at SignalFxGo debugging and troubleshooting tips - from real life lessons at SignalFx
Go debugging and troubleshooting tips - from real life lessons at SignalFx
 

Similar to Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

Scaling security in a cloud environment v0.5 (Sep 2017)
Scaling security in a cloud environment  v0.5 (Sep 2017)Scaling security in a cloud environment  v0.5 (Sep 2017)
Scaling security in a cloud environment v0.5 (Sep 2017)
Dinis Cruz
 
MuleSoft Surat Virtual Meetup#4 - Anypoint Monitoring and MuleSoft dataloader.io
MuleSoft Surat Virtual Meetup#4 - Anypoint Monitoring and MuleSoft dataloader.ioMuleSoft Surat Virtual Meetup#4 - Anypoint Monitoring and MuleSoft dataloader.io
MuleSoft Surat Virtual Meetup#4 - Anypoint Monitoring and MuleSoft dataloader.io
Jitendra Bafna
 
Making Security Agile
Making Security AgileMaking Security Agile
Making Security Agile
Oleg Gryb
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
aspyker
 
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB
 
apidays LIVE Paris - Serverless security: how to protect what you don't see? ...
apidays LIVE Paris - Serverless security: how to protect what you don't see? ...apidays LIVE Paris - Serverless security: how to protect what you don't see? ...
apidays LIVE Paris - Serverless security: how to protect what you don't see? ...
apidays
 
Top 10 Software to Detect & Prevent Security Vulnerabilities from BlackHat US...
Top 10 Software to Detect & Prevent Security Vulnerabilities from BlackHat US...Top 10 Software to Detect & Prevent Security Vulnerabilities from BlackHat US...
Top 10 Software to Detect & Prevent Security Vulnerabilities from BlackHat US...
Mobodexter
 
Dev{sec}ops
Dev{sec}opsDev{sec}ops
Dev{sec}ops
Steven Carlson
 
Pragmatic Pipeline Security
Pragmatic Pipeline SecurityPragmatic Pipeline Security
Pragmatic Pipeline Security
James Wickett
 
AWS live hack: Atlassian + Snyk OSS on AWS
AWS live hack: Atlassian + Snyk OSS on AWSAWS live hack: Atlassian + Snyk OSS on AWS
AWS live hack: Atlassian + Snyk OSS on AWS
Eric Smalling
 
Log Analytics for Distributed Microservices
Log Analytics for Distributed MicroservicesLog Analytics for Distributed Microservices
Log Analytics for Distributed Microservices
Kai Wähner
 
Bangalore OpenMSA DevDay - September 19, 2018
Bangalore OpenMSA DevDay - September 19, 2018Bangalore OpenMSA DevDay - September 19, 2018
Bangalore OpenMSA DevDay - September 19, 2018
UBiqube
 
Cncf checkov and bridgecrew
Cncf checkov and bridgecrewCncf checkov and bridgecrew
Cncf checkov and bridgecrew
LibbySchulze
 
Platform governance, gestire un ecosistema di microservizi a livello enterprise
Platform governance, gestire un ecosistema di microservizi a livello enterprisePlatform governance, gestire un ecosistema di microservizi a livello enterprise
Platform governance, gestire un ecosistema di microservizi a livello enterprise
Giulio Roggero
 
Code Coverage and Test Suite Effectiveness: Empirical Study with Real Bugs in...
Code Coverage and Test Suite Effectiveness: Empirical Study with Real Bugs in...Code Coverage and Test Suite Effectiveness: Empirical Study with Real Bugs in...
Code Coverage and Test Suite Effectiveness: Empirical Study with Real Bugs in...
Pavneet Singh Kochhar
 
Empowering developers and operators through Gitlab and HashiCorp
Empowering developers and operators through Gitlab and HashiCorpEmpowering developers and operators through Gitlab and HashiCorp
Empowering developers and operators through Gitlab and HashiCorp
Mitchell Pronschinske
 
OORPT Dynamic Analysis
OORPT Dynamic AnalysisOORPT Dynamic Analysis
OORPT Dynamic Analysislienhard
 
BrownResearch_CV
BrownResearch_CVBrownResearch_CV
BrownResearch_CVAbby Brown
 
Software Analytics: Data Analytics for Software Engineering and Security
Software Analytics: Data Analytics for Software Engineering and SecuritySoftware Analytics: Data Analytics for Software Engineering and Security
Software Analytics: Data Analytics for Software Engineering and Security
Tao Xie
 
Zero-bug Software, Mathematically Guaranteed
Zero-bug Software, Mathematically GuaranteedZero-bug Software, Mathematically Guaranteed
Zero-bug Software, Mathematically Guaranteed
Ashley Zupkus
 

Similar to Microservices and Devs in Charge: Why Monitoring is an Analytics Problem (20)

Scaling security in a cloud environment v0.5 (Sep 2017)
Scaling security in a cloud environment  v0.5 (Sep 2017)Scaling security in a cloud environment  v0.5 (Sep 2017)
Scaling security in a cloud environment v0.5 (Sep 2017)
 
MuleSoft Surat Virtual Meetup#4 - Anypoint Monitoring and MuleSoft dataloader.io
MuleSoft Surat Virtual Meetup#4 - Anypoint Monitoring and MuleSoft dataloader.ioMuleSoft Surat Virtual Meetup#4 - Anypoint Monitoring and MuleSoft dataloader.io
MuleSoft Surat Virtual Meetup#4 - Anypoint Monitoring and MuleSoft dataloader.io
 
Making Security Agile
Making Security AgileMaking Security Agile
Making Security Agile
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
 
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
 
apidays LIVE Paris - Serverless security: how to protect what you don't see? ...
apidays LIVE Paris - Serverless security: how to protect what you don't see? ...apidays LIVE Paris - Serverless security: how to protect what you don't see? ...
apidays LIVE Paris - Serverless security: how to protect what you don't see? ...
 
Top 10 Software to Detect & Prevent Security Vulnerabilities from BlackHat US...
Top 10 Software to Detect & Prevent Security Vulnerabilities from BlackHat US...Top 10 Software to Detect & Prevent Security Vulnerabilities from BlackHat US...
Top 10 Software to Detect & Prevent Security Vulnerabilities from BlackHat US...
 
Dev{sec}ops
Dev{sec}opsDev{sec}ops
Dev{sec}ops
 
Pragmatic Pipeline Security
Pragmatic Pipeline SecurityPragmatic Pipeline Security
Pragmatic Pipeline Security
 
AWS live hack: Atlassian + Snyk OSS on AWS
AWS live hack: Atlassian + Snyk OSS on AWSAWS live hack: Atlassian + Snyk OSS on AWS
AWS live hack: Atlassian + Snyk OSS on AWS
 
Log Analytics for Distributed Microservices
Log Analytics for Distributed MicroservicesLog Analytics for Distributed Microservices
Log Analytics for Distributed Microservices
 
Bangalore OpenMSA DevDay - September 19, 2018
Bangalore OpenMSA DevDay - September 19, 2018Bangalore OpenMSA DevDay - September 19, 2018
Bangalore OpenMSA DevDay - September 19, 2018
 
Cncf checkov and bridgecrew
Cncf checkov and bridgecrewCncf checkov and bridgecrew
Cncf checkov and bridgecrew
 
Platform governance, gestire un ecosistema di microservizi a livello enterprise
Platform governance, gestire un ecosistema di microservizi a livello enterprisePlatform governance, gestire un ecosistema di microservizi a livello enterprise
Platform governance, gestire un ecosistema di microservizi a livello enterprise
 
Code Coverage and Test Suite Effectiveness: Empirical Study with Real Bugs in...
Code Coverage and Test Suite Effectiveness: Empirical Study with Real Bugs in...Code Coverage and Test Suite Effectiveness: Empirical Study with Real Bugs in...
Code Coverage and Test Suite Effectiveness: Empirical Study with Real Bugs in...
 
Empowering developers and operators through Gitlab and HashiCorp
Empowering developers and operators through Gitlab and HashiCorpEmpowering developers and operators through Gitlab and HashiCorp
Empowering developers and operators through Gitlab and HashiCorp
 
OORPT Dynamic Analysis
OORPT Dynamic AnalysisOORPT Dynamic Analysis
OORPT Dynamic Analysis
 
BrownResearch_CV
BrownResearch_CVBrownResearch_CV
BrownResearch_CV
 
Software Analytics: Data Analytics for Software Engineering and Security
Software Analytics: Data Analytics for Software Engineering and SecuritySoftware Analytics: Data Analytics for Software Engineering and Security
Software Analytics: Data Analytics for Software Engineering and Security
 
Zero-bug Software, Mathematically Guaranteed
Zero-bug Software, Mathematically GuaranteedZero-bug Software, Mathematically Guaranteed
Zero-bug Software, Mathematically Guaranteed
 

Recently uploaded

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 

Recently uploaded (20)

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 

Microservices and Devs in Charge: Why Monitoring is an Analytics Problem

  • 1. SignalFx Microservices and Devs in Charge: Why Monitoring is an Analytics Problem
  • 2. SignalFx Microservices and Devs in Charge: Why Monitoring is an Analytics Problem Phillip Liu phillip@signalfx.com @SignalFx - signalfx.com
  • 3. Agenda • My background • Microservices, a review • Analytics approach to monitoring • Code push side effects, an example • Summary
  • 5. Experience [2013 - ] SignalFx - Founder, CTO, Software Engineer Microservices; Monitoring using Analytics [2008 - 2012] Facebook - Software Engineer, Software Architect Hyperscale SOA; Monitoring using Nagios, Ganglia, and in-house Analytics [2004 - 2008] Opsware - Chief Architect, Software Engineer Monolithic Architecture; Monitoring using Ganglia, Nagios, Splunk [2000 - 2004] Loudcloud - Software Engineer LAMP, Application Server; Monitoring using SNMP, Ganglia, NetCool [1998 - 2000] Marimba - Software Engineer Client / Server; Monitoring using SNMP, FreshWater Software [ … ]
  • 7. A Microservices Definition Loosely coupled service oriented architecture with bounded context. Adrian Cockcroft
  • 8. SignalFx’s Microservices More than 15 internal services. Spanning hundreds of instances. Across 3 AZs. Have dependencies on tens of external services.
  • 9. Monitoring Challenges • High iteration rate leads to shortened test cycles • Integration test combinations are intractable • Catch problems during rolling deployments • Identify upstream/downstream side effects • e.g. backpressure • Identify brownouts before the customer • etc.
  • 12. Store
  • 16. Monitoring at SignalFx •We use SignalFx to monitor SignalFx •CollectD for OS and Docker metrics on all VMs •Yammer metrics for all Java app servers •Custom logger to count exception types •All metrics are sent to an analytics service •Each service deploy a their cadence •Push lab, then canary in prod, then rest of tier
  • 17. Code Push Side Effects
  • 18. Code Push Side Effects Push canary instance and Metadata API dashboard shows healthy tier.
  • 19. Code Push Side Effects However, upstream UI dashboard showed unusual # of timeouts.
  • 20. Code Push Side Effects In search of root cause. Always safe to start by looking at exception counts. Can’t derive much from all the noise.
  • 21. Code Push Side Effects Sum the # of exceptions to create a single signal.
  • 22. Code Push Side Effects Compare sum with time-shifted sum from a day ago.
  • 23. Code Push Side Effects Look at an outlier host - an Analytics service host.
  • 24. Code Push Side Effects java.io.InvalidObjectException: enum constant MURMUR128_MITZ_64 does not exist in class com.google.common.hash.BloomFilterStrategies at java.io.ObjectInputStream.readEnum(ObjectInputStream.java:1743) ~[na: 1.7.0_79] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347) ~[na:1.7.0_79] at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java: 1990) ~[na:1.7.0_79] … Looking at Analytic’s logs revealed source of the problem.
  • 25. Code Push Side Effects • Analytics across multiple microservices reduced time to identify problem. From push to resolution was ~15min • Service instrumentation helped narrowed down root cause • Discovery allowed us to create a detector using analytics to notify similar problems in the future
  • 26. Other Examples • A customer started dropping data because they reverted to an unsupported API • Compare tsdb write throughput of two different write strategies • Create per-service capacity reports • Identify memory usage patterns across our Analytics service • Create a detector for every previously uncaught error conditions - postmortem output
  • 28. • Measure and Store as much metrics and events as possible • Use data analytics techniques to • Identify problems • Chase down root cause • Create analytics based detectors to notify you of recurrence
  • 29. SignalFx Thank You! Phillip Liu phillip@signalfx.com WE’RE HIRING jobs@signalfx.com @SignalFx - signalfx.com