Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Monitoring "unknown unknowns" - Guy Fighel - DevOpsDays Tel Aviv 2017

164 views

Published on

DevOpsDays Tel Aviv 2017

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Monitoring "unknown unknowns" - Guy Fighel - DevOpsDays Tel Aviv 2017

  1. 1. Monitoring “unknown unknowns” With Machine Intelligence
  2. 2. @guyfig On-Call Engineer by Nature "If a tree falls in a forest and no one is around to hear it, does it make a sound?"
  3. 3. Observability is a superset between monitoring and instrumentation. Making systems debuggable and understandable @mipsytipsy Do you really know what to observe? Instrumentation - mostly Developer driven What is the output? Dashboard? Exploration tool?
  4. 4. one can determine the behavior of the entire system from the system's outputs Observability In Control Theory
  5. 5. Unknown Unknowns - Rumsfeld Quadrant
  6. 6. -Static thresholds -Defined Alerts -Static Runbooks -Anomaly Detection -Predictions -External Knowledge -Knowledge -Recommendations -Auto Collaboration -Inference -Auto Correlations -Semantic Analysis -Decision making The Observability Quadrant (Based on Johari window)
  7. 7. Humans Driven Detection Set thresholds to find patterns Simulate based on known Use percentiles, basic stats
  8. 8. Will That Help in a “Fire-Fighting” Mode?
  9. 9. Find The Problem Thresholds? Baseline? Anomaly? - Scale matters - Stationary noise matters - Use Autocorrelation
  10. 10. Preprocessing Data sklearn.preprocessing
  11. 11. Independent component analysis (ICA) separates a multivariate signal into additive subcomponents that are maximally independent. from sklearn.decomposition import FastICA, PCA
  12. 12. Find The Problem CPU 90% Time in Minutes EC2 Instance changed from t2.small to m3.xl Events & context matters Anomly?
  13. 13. Use Enrichment
  14. 14. What Can Machines Do? Process different types of data, transform it fast and handle huge amounts in real-time Automate and adapt Anomaly Detection Apply Semantic text similarities to find patterns (Information Retrieval) Apply auto correlation models Evolve and adapt (overtime) based on human interaction
  15. 15. The Goal - Centralization Observability for systems with imperfect outputs Events enrichments, symptoms detection and inference Automatic Outlier Detection Automatic Correlation Get closer to the Control Theory mathematical definition
  16. 16. Pick The Right Tools https://github.com/turi-code/SFrame
  17. 17. - Define the model. Use a single schema (Apache Avro) - Events are agnostic. Can represent logs, stack trace, metric, user action, HTTP event, etc. - Every event should have a set of common fields as well as optional key/value attributes Get a Common SchemaUse Common Schema
  18. 18. Deterministic models are better to start with (Fuzzy Logic, Rules) Choose your logic and start run it across your data (schema) Apply similarity checks to strings first (TF-IDF, BM25, Fuzzy, other classifiers) Look into correlations, start with simple obvious ones, before building classifiers (Unsupervised/Semi-supervised learning is much more relevant overall) Build your prediction models on time series data first. (Statistics has solid models) Time and context are dimensions you will be able to start addressing Best Practices
  19. 19. Use It In Production - Your team == your users - Ask for feedback - Re-calculate relevancy - Apply Recommendations based on your own team knowledge
  20. 20. github.com/signifai @guyfig Thank You

×