SlideShare a Scribd company logo
1 of 30
Download to read offline
Insights and Graph-based ML
Anomaly Detection for eBay Edge
Services
Presenter: Anoop Koloth, Hanzhang Wang
Team: Anirudh Muralidhar, Kalieswaran Rayar, Phuong Nguyen, Saravana Chilla,
Venkatesh Palani
EnovyCon, Nov 18, 2019
Agenda
eBay Envoy Ecosystem for Edge
Data, Traffic Insights and Anomaly Detection
Grano - Graph-based Anomaly Detection
© 2018 eBay. All rights reserved.2
© 2018 eBay. All rights reserved.3
Traffic Engineering
LATENCY
AVAILABILITY
Envoy at eBay
© 2018 eBay. All rights reserved.5
L7 - High Level
Request
Services
Latency
Zoom Images
Client Error
Client
Interactions
Security
Experimentation
>12 Billion+ records /day
120+ dimensions
60+ metrics
Anomaly Detection
Ramp Up Decision
Traffic
Insights
Bot/Attack
Remediation
© 2018 eBay. All rights reserved.6
Data Set
route_name , rlogid , isp ,
actual_pop_dist , actual ,
parent , tags ,
upstream-service-time ,
time_to_first_resp_byte,
response_flag
© 2018 eBay. All rights reserved.7
Raw Record
Observability
Anycast
© 2018 eBay. All rights reserved.10
First AD Solution
Xpack( Inverse
Deduction ) , Anadot ,
In-House - PyAnom
Cross Data sets /
metrics / models
Single variant
© 2018 eBay. All rights reserved.11
Our Journey
“Leverage ML and graph-based solutions to reduce Mean Time to Detect (MTTD) production
incidents in distributed systems.”
The time between and incident and somebody’s knowing about it.
● “Somebody” is an automated response system or a person
● “Knowing about” Doesn't count if there's an alert that gets lost in a flood of alerts, even if the
monitoring system knew.
● “It” is the incident and the root cause(s).
© 2018 eBay. All rights reserved.12
Grano Vision
GRANO is an end-to-end graph-based anomaly detection and root cause analysis system for
distributed cloud-native systems
Anomaly Detection
ML time series anomaly detection on metrics
Adaptive Anomaly Graph
Adaptive analysis of all health signals on a real-time property graph of connect metrics/event (or
project on a system topology) to identify the root cause
Applications
Alerting, interactive Graph-UI, and integration with monitoring system
© 2018 eBay. All rights reserved.13
© 2018 eBay. All rights reserved.14
Knowledge Graph (or Heterogeneous Graph)
Knowledge Graph:
1. Represents knowledge domain.
2. Connects things of different types in a systematic way.
3. Encode knowledge arranged in a network of nodes and edges rather
than tables of rows and columns.
Why graph here?
1. Native method to understand the “Unknown”
2. White box and Visualization
3. The platform and support is ready (e.g. Data, Graph DB, Computation
Power)
4. “Get ready for AI” or “translator of AI” analysis (Pattern)
5. Recommendation/Refactoring
© 2018 eBay. All rights reserved.15
Motivating Example
© 2018 eBay. All rights reserved.16
Data
Preprocess
xPack
ML jobs
Service
Adaptive
GraphGen
Grano Quick
Walkthrough
{
"Compoents":{
"POP": ["LHR"]
"Site": ["EU","US"]
"Flow": ["AddtoCart", "Checkout"]
}
"Links": {
"Agent": ["POP","Flow","Site"]
}
"Alerts":{
"XPack":{
"Latency_90":{
"LHR": [0,0,0,0.4]
"LHR_AddtoCart_EU":
[0.1,0.3,0.5,0.9]
}
}
}
}
General Anomaly Graph Schema
PyAnom
○ Additive decomposition
forecasting models
○ Multivariate clustering
models
○ Statical-based models
GraphDB
© 2018 eBay. All rights reserved.17
Pop
Flow
Site EU US
Checkout Buy Sell
LHR DUS SJC ...
...
...
#Type #Example
... ...
Adaptive GraphGen
© 2018 eBay. All rights reserved.18
Pop
Flow
Site
Sub-Metrics
Metric_of
Metric_of
Metric_of
SJC
Checkout
US
US_Checkout_SJC
Metric_of
Metric_of
Metric_of
#Type #Example
Adaptive GraphGen
© 2018 eBay. All rights reserved.19
SJC
Checkout
US
US_Checkout_SJC
Metric_of
Metric_of
Metric_of
#Type #Example
Pop
Flow
Site
Sub-Metrics
Metric_of
Metric_of
Metric_of
Dependency
US_AddCart_SJC
Dependency
(pearson correlation)
Metric_of
Adaptive GraphGen
© 2018 eBay. All rights reserved.20
SJC
Checkout
US
US_Checkout_SJC
#Type #Example
Event/
Alert
Pop
Flow
Site
Sub-Metrics
Metric_of
Metric_of
Metric_of
{Criticality,
Confidence,
Value,
Expected range ...}
Duration_P50
Criticality
{Medium,
40, 104ms,
60-80ms…}
{High,
90, 2000 ms,
105-300ms…
}
Dependency
US_AddCart_SJC
{High,
90, 1700 ms,
80-200ms…}
Metric_of
Metric_of
Metric_of
Metric_of
Dependency
Adaptive GraphGen
© 2018 eBay. All rights reserved.21
Data
Preprocess
xPack
ML jobs
Service
GraphDB
ML Models
Grano
Algorithm
Adaptive
GraphGen
Grano Quick
Walkthrough
© 2018 eBay. All rights reserved.22
SJC
Checkout
US
US_Checkout_SJC
#Type #Example
Event/
Alert
Pop
Flow
Site
Sub-Metrics
Metric_of
Metric_of
Metric_of
{Criticality,
Confidence,
Value,
Expected range ...}
Duration_P50
Criticality
{Medium,
40, 104ms,
60-80ms…}
{High,
90, 2000 ms,
105-300ms…
}
Dependency
US_AddCart_SJC
{High,
90, 1700 ms,
80-200ms…}
Metric_of
Metric_of
Metric_of
Metric_of
Dependency
Assign edge score based on:
1. Alarm Severity (Metrics Dynamics)
2. Alarm Connectivities (at the Dimension)
Grano Algorithms
© 2018 eBay. All rights reserved.23
SJC
Checkout
US
US_Checkout_SJC
#Type #Example
Event/
Alert
Pop
Flow
Site
Sub-Metrics
Metric_of
Metric_of
Metric_of
Duration_P50
Criticality
{Medium,
40, 104ms,
60-80ms…}
{High,
90, 2000 ms,
105-300ms…
}
Dependency
US_AddCart_SJC
{High,
90, 1700 ms,
80-200ms…}
Metric_of
Metric_of
Metric_of
Metric_of
Dependency
Other Alerts
Alarm Criticality (e.g.
Latency vs CPU)
Edge Scores
{Criticality,
Confidence,
Value,
Expected range ...}
Grano Algorithms
© 2018 eBay. All rights reserved.24
SJC
Checkout
US
US_Checkout_SJC
#Type #Example
Event/
Alert
Pop
Flow
Site
Sub-Metrics
Metric_of
Metric_of
Metric_of
{Criticality,
Confidence,
Value,
Expected range ...}
Duration_P50
Criticality
{Medium,
40, 104ms,
60-80ms…}
{High,
90, 2000 ms,
105-300ms…
}
Dependency
US_AddCart_SJC
Propagations{High,
90, 1700 ms,
80-200ms…}
Propagation through the topology
from all nodes to their connected
ones.
Grano Algorithms
© 2018 eBay. All rights reserved.25
Data
Preprocess
xPack
ML jobs
Service
GraphDB
Knowledge
APIs
UI
Slack/Email
Alerts
Grano Quick
Walkthrough
ML Models
Grano
Algorithm
Adaptive
GraphGen
© 2018 eBay. All rights reserved.26
Grano UI
Envoy enables us for better Observability , Scale , TLS Termination , TCP Congestion BBR and
Routing.
Today, we are able to serve experiences(closer and faster) to users with higher ATB powered by
our detection.
We should leverage ML graph, and knowledge engineering, rather than a relying on one
method/metric (single point failure).
Graph is powerful to understand, connect, and make sense out of complex heterogeneous data
(e.g., anomalies, metrics, and system event).
© 2018 eBay. All rights reserved.27
Takeaway
Thank
You
L4 Architecture
● IPVS based dataplane
● custom consistent hashing ko
● k8s driven control plane
● k8s native deployment
● horizontally scalable
● DSR
● BGP
Anomaly Detection
© 2018 eBay. All rights reserved.30
Bucket Span + Data History + Exponential Smoothing
m1
--------- d1 ( primary )
--------- d2 ( secondary / influencers )

More Related Content

What's hot

"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig..."How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...Data Science Milan
 
Predictive Analytics Using R | Edureka
Predictive Analytics Using R | EdurekaPredictive Analytics Using R | Edureka
Predictive Analytics Using R | EdurekaEdureka!
 
Machine Learning in the Real World
Machine Learning in the Real WorldMachine Learning in the Real World
Machine Learning in the Real WorldSrinath Perera
 
Putting together AI pipelines with Acumos
Putting together AI pipelines with AcumosPutting together AI pipelines with Acumos
Putting together AI pipelines with AcumosPantelis Monogioudis
 
How to design ai functions to the cloud native infra
How to design ai functions to the cloud native infraHow to design ai functions to the cloud native infra
How to design ai functions to the cloud native infraChun Myung Kyu
 
Sparklyr: Big Data enabler for R users
Sparklyr: Big Data enabler for R usersSparklyr: Big Data enabler for R users
Sparklyr: Big Data enabler for R usersICTeam S.p.A.
 
Tuning 2.0: Advanced Optimization Techniques Webinar
Tuning 2.0: Advanced Optimization Techniques WebinarTuning 2.0: Advanced Optimization Techniques Webinar
Tuning 2.0: Advanced Optimization Techniques WebinarSigOpt
 
GITEX Big Data Conference 2014 – SAP Presentation
GITEX Big Data Conference 2014 – SAP PresentationGITEX Big Data Conference 2014 – SAP Presentation
GITEX Big Data Conference 2014 – SAP PresentationPedro Pereira
 
Delivering Large Scale Real-time Graph Analytics with Dell Infrastructure and...
Delivering Large Scale Real-time Graph Analytics with Dell Infrastructure and...Delivering Large Scale Real-time Graph Analytics with Dell Infrastructure and...
Delivering Large Scale Real-time Graph Analytics with Dell Infrastructure and...TigerGraph
 
Graph-Based Identity Resolution at Scale
Graph-Based Identity Resolution at ScaleGraph-Based Identity Resolution at Scale
Graph-Based Identity Resolution at ScaleTigerGraph
 
Machinel Learning with spark
Machinel Learning with spark Machinel Learning with spark
Machinel Learning with spark Ons Dridi
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligenceManish Jain
 
MLSEV. Use Case: Smart Energy Management
MLSEV. Use Case: Smart Energy ManagementMLSEV. Use Case: Smart Energy Management
MLSEV. Use Case: Smart Energy ManagementBigML, Inc
 
Data Science for predictive maintenance in connected vehicles
Data Science for predictive maintenance in connected vehiclesData Science for predictive maintenance in connected vehicles
Data Science for predictive maintenance in connected vehiclesSrivatsan Ramanujam
 
Scalability and Autonomous Analytics
Scalability and Autonomous AnalyticsScalability and Autonomous Analytics
Scalability and Autonomous AnalyticsInspirient
 
A practical look at how to build & run IoT business logic
A practical look at how to build & run IoT business logicA practical look at how to build & run IoT business logic
A practical look at how to build & run IoT business logicVeselin Pizurica
 

What's hot (20)

"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig..."How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
 
Predictive Analytics Using R | Edureka
Predictive Analytics Using R | EdurekaPredictive Analytics Using R | Edureka
Predictive Analytics Using R | Edureka
 
Building a Data Analytics PaaS for Smart Cities
Building a Data Analytics PaaS for Smart CitiesBuilding a Data Analytics PaaS for Smart Cities
Building a Data Analytics PaaS for Smart Cities
 
Machine Learning in the Real World
Machine Learning in the Real WorldMachine Learning in the Real World
Machine Learning in the Real World
 
Resume_Vignesh_ThulasiDass
Resume_Vignesh_ThulasiDass Resume_Vignesh_ThulasiDass
Resume_Vignesh_ThulasiDass
 
Putting together AI pipelines with Acumos
Putting together AI pipelines with AcumosPutting together AI pipelines with Acumos
Putting together AI pipelines with Acumos
 
Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct...
Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct...Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct...
Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct...
 
How to design ai functions to the cloud native infra
How to design ai functions to the cloud native infraHow to design ai functions to the cloud native infra
How to design ai functions to the cloud native infra
 
Sparklyr: Big Data enabler for R users
Sparklyr: Big Data enabler for R usersSparklyr: Big Data enabler for R users
Sparklyr: Big Data enabler for R users
 
Tuning 2.0: Advanced Optimization Techniques Webinar
Tuning 2.0: Advanced Optimization Techniques WebinarTuning 2.0: Advanced Optimization Techniques Webinar
Tuning 2.0: Advanced Optimization Techniques Webinar
 
GITEX Big Data Conference 2014 – SAP Presentation
GITEX Big Data Conference 2014 – SAP PresentationGITEX Big Data Conference 2014 – SAP Presentation
GITEX Big Data Conference 2014 – SAP Presentation
 
Delivering Large Scale Real-time Graph Analytics with Dell Infrastructure and...
Delivering Large Scale Real-time Graph Analytics with Dell Infrastructure and...Delivering Large Scale Real-time Graph Analytics with Dell Infrastructure and...
Delivering Large Scale Real-time Graph Analytics with Dell Infrastructure and...
 
Graph-Based Identity Resolution at Scale
Graph-Based Identity Resolution at ScaleGraph-Based Identity Resolution at Scale
Graph-Based Identity Resolution at Scale
 
Machinel Learning with spark
Machinel Learning with spark Machinel Learning with spark
Machinel Learning with spark
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial Intelligence
 
MLSEV. Use Case: Smart Energy Management
MLSEV. Use Case: Smart Energy ManagementMLSEV. Use Case: Smart Energy Management
MLSEV. Use Case: Smart Energy Management
 
Data Science for predictive maintenance in connected vehicles
Data Science for predictive maintenance in connected vehiclesData Science for predictive maintenance in connected vehicles
Data Science for predictive maintenance in connected vehicles
 
Scalability and Autonomous Analytics
Scalability and Autonomous AnalyticsScalability and Autonomous Analytics
Scalability and Autonomous Analytics
 
A practical look at how to build & run IoT business logic
A practical look at how to build & run IoT business logicA practical look at how to build & run IoT business logic
A practical look at how to build & run IoT business logic
 
KBC Proven Application of Digital Twin
KBC Proven Application of Digital TwinKBC Proven Application of Digital Twin
KBC Proven Application of Digital Twin
 

Similar to Insights and Graph-based ML Anomaly Detection for eBay Edge Services

Scaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analyticsScaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analyticsConnected Data World
 
Flink Forward Berlin 2018: Stephan Ewen - Keynote: "Unlocking the next wave o...
Flink Forward Berlin 2018: Stephan Ewen - Keynote: "Unlocking the next wave o...Flink Forward Berlin 2018: Stephan Ewen - Keynote: "Unlocking the next wave o...
Flink Forward Berlin 2018: Stephan Ewen - Keynote: "Unlocking the next wave o...Flink Forward
 
Connected IoT and Intelligent Solutions
Connected IoT and Intelligent SolutionsConnected IoT and Intelligent Solutions
Connected IoT and Intelligent SolutionsAmazon Web Services
 
Real-time data integration to the cloud
Real-time data integration to the cloudReal-time data integration to the cloud
Real-time data integration to the cloudSankar Nagarajan
 
Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...
Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...
Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...Amazon Web Services
 
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...Amazon Web Services
 
Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...
Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...
Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...Fwdays
 
Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon Web Services
 
AWS O&G Day - Ambyint and AWS
AWS O&G Day - Ambyint and AWSAWS O&G Day - Ambyint and AWS
AWS O&G Day - Ambyint and AWSAWS Summits
 
Predicting Hospital Bills at Pre-admission
Predicting Hospital Bills at Pre-admissionPredicting Hospital Bills at Pre-admission
Predicting Hospital Bills at Pre-admissionEugene Yan Ziyou
 
Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018
Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018
Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018Amazon Web Services
 
AI now ron tolido, capgemini cwin18_toulouse
AI now ron tolido, capgemini cwin18_toulouseAI now ron tolido, capgemini cwin18_toulouse
AI now ron tolido, capgemini cwin18_toulouseCapgemini
 
Ureason jules oudmans
Ureason jules oudmansUreason jules oudmans
Ureason jules oudmansBigDataExpo
 
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughtonReal-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughtonSynerzip
 
AI & ML in Cyber Security - Why Algorithms are Dangerous
AI & ML in Cyber Security - Why Algorithms are DangerousAI & ML in Cyber Security - Why Algorithms are Dangerous
AI & ML in Cyber Security - Why Algorithms are DangerousPriyanka Aash
 
AI & ML in Cyber Security - Why Algorithms are Dangerous
AI & ML in Cyber Security - Why Algorithms are DangerousAI & ML in Cyber Security - Why Algorithms are Dangerous
AI & ML in Cyber Security - Why Algorithms are DangerousRaffael Marty
 
Python tutorial for ML
Python tutorial for MLPython tutorial for ML
Python tutorial for MLBin Han
 
Cheryl Wiebe - Advanced Analytics in the Industrial World
Cheryl Wiebe - Advanced Analytics in the Industrial WorldCheryl Wiebe - Advanced Analytics in the Industrial World
Cheryl Wiebe - Advanced Analytics in the Industrial WorldRehgan Avon
 
Meetup 27/6/2018: AIOPS om de uitdagingen van een slimme stad te ondersteunen
Meetup 27/6/2018: AIOPS om de uitdagingen van een slimme stad te ondersteunenMeetup 27/6/2018: AIOPS om de uitdagingen van een slimme stad te ondersteunen
Meetup 27/6/2018: AIOPS om de uitdagingen van een slimme stad te ondersteunenDigipolis Antwerpen
 
AI and ML in Cybersecurity
AI and ML in CybersecurityAI and ML in Cybersecurity
AI and ML in CybersecurityForcepoint LLC
 

Similar to Insights and Graph-based ML Anomaly Detection for eBay Edge Services (20)

Scaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analyticsScaling up business value with real-time operational graph analytics
Scaling up business value with real-time operational graph analytics
 
Flink Forward Berlin 2018: Stephan Ewen - Keynote: "Unlocking the next wave o...
Flink Forward Berlin 2018: Stephan Ewen - Keynote: "Unlocking the next wave o...Flink Forward Berlin 2018: Stephan Ewen - Keynote: "Unlocking the next wave o...
Flink Forward Berlin 2018: Stephan Ewen - Keynote: "Unlocking the next wave o...
 
Connected IoT and Intelligent Solutions
Connected IoT and Intelligent SolutionsConnected IoT and Intelligent Solutions
Connected IoT and Intelligent Solutions
 
Real-time data integration to the cloud
Real-time data integration to the cloudReal-time data integration to the cloud
Real-time data integration to the cloud
 
Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...
Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...
Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...
 
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
Architecting for Real-Time Insights with Amazon Kinesis (ANT310) - AWS re:Inv...
 
Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...
Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...
Anna Vergeles, Nataliia Manakova "Unsupervised Real-Time Stream-Based Novelty...
 
Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)
 
AWS O&G Day - Ambyint and AWS
AWS O&G Day - Ambyint and AWSAWS O&G Day - Ambyint and AWS
AWS O&G Day - Ambyint and AWS
 
Predicting Hospital Bills at Pre-admission
Predicting Hospital Bills at Pre-admissionPredicting Hospital Bills at Pre-admission
Predicting Hospital Bills at Pre-admission
 
Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018
Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018
Best Practices for Scalable Monitoring (ENT310-S) - AWS re:Invent 2018
 
AI now ron tolido, capgemini cwin18_toulouse
AI now ron tolido, capgemini cwin18_toulouseAI now ron tolido, capgemini cwin18_toulouse
AI now ron tolido, capgemini cwin18_toulouse
 
Ureason jules oudmans
Ureason jules oudmansUreason jules oudmans
Ureason jules oudmans
 
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughtonReal-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
 
AI & ML in Cyber Security - Why Algorithms are Dangerous
AI & ML in Cyber Security - Why Algorithms are DangerousAI & ML in Cyber Security - Why Algorithms are Dangerous
AI & ML in Cyber Security - Why Algorithms are Dangerous
 
AI & ML in Cyber Security - Why Algorithms are Dangerous
AI & ML in Cyber Security - Why Algorithms are DangerousAI & ML in Cyber Security - Why Algorithms are Dangerous
AI & ML in Cyber Security - Why Algorithms are Dangerous
 
Python tutorial for ML
Python tutorial for MLPython tutorial for ML
Python tutorial for ML
 
Cheryl Wiebe - Advanced Analytics in the Industrial World
Cheryl Wiebe - Advanced Analytics in the Industrial WorldCheryl Wiebe - Advanced Analytics in the Industrial World
Cheryl Wiebe - Advanced Analytics in the Industrial World
 
Meetup 27/6/2018: AIOPS om de uitdagingen van een slimme stad te ondersteunen
Meetup 27/6/2018: AIOPS om de uitdagingen van een slimme stad te ondersteunenMeetup 27/6/2018: AIOPS om de uitdagingen van een slimme stad te ondersteunen
Meetup 27/6/2018: AIOPS om de uitdagingen van een slimme stad te ondersteunen
 
AI and ML in Cybersecurity
AI and ML in CybersecurityAI and ML in Cybersecurity
AI and ML in Cybersecurity
 

Recently uploaded

Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningVitsRangannavar
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 

Recently uploaded (20)

Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
cybersecurity notes for mca students for learning
cybersecurity notes for mca students for learningcybersecurity notes for mca students for learning
cybersecurity notes for mca students for learning
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 

Insights and Graph-based ML Anomaly Detection for eBay Edge Services

  • 1. Insights and Graph-based ML Anomaly Detection for eBay Edge Services Presenter: Anoop Koloth, Hanzhang Wang Team: Anirudh Muralidhar, Kalieswaran Rayar, Phuong Nguyen, Saravana Chilla, Venkatesh Palani EnovyCon, Nov 18, 2019
  • 2. Agenda eBay Envoy Ecosystem for Edge Data, Traffic Insights and Anomaly Detection Grano - Graph-based Anomaly Detection © 2018 eBay. All rights reserved.2
  • 3. © 2018 eBay. All rights reserved.3 Traffic Engineering LATENCY AVAILABILITY
  • 5. © 2018 eBay. All rights reserved.5 L7 - High Level
  • 6. Request Services Latency Zoom Images Client Error Client Interactions Security Experimentation >12 Billion+ records /day 120+ dimensions 60+ metrics Anomaly Detection Ramp Up Decision Traffic Insights Bot/Attack Remediation © 2018 eBay. All rights reserved.6 Data Set
  • 7. route_name , rlogid , isp , actual_pop_dist , actual , parent , tags , upstream-service-time , time_to_first_resp_byte, response_flag © 2018 eBay. All rights reserved.7 Raw Record
  • 10. © 2018 eBay. All rights reserved.10 First AD Solution
  • 11. Xpack( Inverse Deduction ) , Anadot , In-House - PyAnom Cross Data sets / metrics / models Single variant © 2018 eBay. All rights reserved.11 Our Journey
  • 12. “Leverage ML and graph-based solutions to reduce Mean Time to Detect (MTTD) production incidents in distributed systems.” The time between and incident and somebody’s knowing about it. ● “Somebody” is an automated response system or a person ● “Knowing about” Doesn't count if there's an alert that gets lost in a flood of alerts, even if the monitoring system knew. ● “It” is the incident and the root cause(s). © 2018 eBay. All rights reserved.12 Grano Vision
  • 13. GRANO is an end-to-end graph-based anomaly detection and root cause analysis system for distributed cloud-native systems Anomaly Detection ML time series anomaly detection on metrics Adaptive Anomaly Graph Adaptive analysis of all health signals on a real-time property graph of connect metrics/event (or project on a system topology) to identify the root cause Applications Alerting, interactive Graph-UI, and integration with monitoring system © 2018 eBay. All rights reserved.13
  • 14. © 2018 eBay. All rights reserved.14 Knowledge Graph (or Heterogeneous Graph) Knowledge Graph: 1. Represents knowledge domain. 2. Connects things of different types in a systematic way. 3. Encode knowledge arranged in a network of nodes and edges rather than tables of rows and columns. Why graph here? 1. Native method to understand the “Unknown” 2. White box and Visualization 3. The platform and support is ready (e.g. Data, Graph DB, Computation Power) 4. “Get ready for AI” or “translator of AI” analysis (Pattern) 5. Recommendation/Refactoring
  • 15. © 2018 eBay. All rights reserved.15 Motivating Example
  • 16. © 2018 eBay. All rights reserved.16 Data Preprocess xPack ML jobs Service Adaptive GraphGen Grano Quick Walkthrough { "Compoents":{ "POP": ["LHR"] "Site": ["EU","US"] "Flow": ["AddtoCart", "Checkout"] } "Links": { "Agent": ["POP","Flow","Site"] } "Alerts":{ "XPack":{ "Latency_90":{ "LHR": [0,0,0,0.4] "LHR_AddtoCart_EU": [0.1,0.3,0.5,0.9] } } } } General Anomaly Graph Schema PyAnom ○ Additive decomposition forecasting models ○ Multivariate clustering models ○ Statical-based models GraphDB
  • 17. © 2018 eBay. All rights reserved.17 Pop Flow Site EU US Checkout Buy Sell LHR DUS SJC ... ... ... #Type #Example ... ... Adaptive GraphGen
  • 18. © 2018 eBay. All rights reserved.18 Pop Flow Site Sub-Metrics Metric_of Metric_of Metric_of SJC Checkout US US_Checkout_SJC Metric_of Metric_of Metric_of #Type #Example Adaptive GraphGen
  • 19. © 2018 eBay. All rights reserved.19 SJC Checkout US US_Checkout_SJC Metric_of Metric_of Metric_of #Type #Example Pop Flow Site Sub-Metrics Metric_of Metric_of Metric_of Dependency US_AddCart_SJC Dependency (pearson correlation) Metric_of Adaptive GraphGen
  • 20. © 2018 eBay. All rights reserved.20 SJC Checkout US US_Checkout_SJC #Type #Example Event/ Alert Pop Flow Site Sub-Metrics Metric_of Metric_of Metric_of {Criticality, Confidence, Value, Expected range ...} Duration_P50 Criticality {Medium, 40, 104ms, 60-80ms…} {High, 90, 2000 ms, 105-300ms… } Dependency US_AddCart_SJC {High, 90, 1700 ms, 80-200ms…} Metric_of Metric_of Metric_of Metric_of Dependency Adaptive GraphGen
  • 21. © 2018 eBay. All rights reserved.21 Data Preprocess xPack ML jobs Service GraphDB ML Models Grano Algorithm Adaptive GraphGen Grano Quick Walkthrough
  • 22. © 2018 eBay. All rights reserved.22 SJC Checkout US US_Checkout_SJC #Type #Example Event/ Alert Pop Flow Site Sub-Metrics Metric_of Metric_of Metric_of {Criticality, Confidence, Value, Expected range ...} Duration_P50 Criticality {Medium, 40, 104ms, 60-80ms…} {High, 90, 2000 ms, 105-300ms… } Dependency US_AddCart_SJC {High, 90, 1700 ms, 80-200ms…} Metric_of Metric_of Metric_of Metric_of Dependency Assign edge score based on: 1. Alarm Severity (Metrics Dynamics) 2. Alarm Connectivities (at the Dimension) Grano Algorithms
  • 23. © 2018 eBay. All rights reserved.23 SJC Checkout US US_Checkout_SJC #Type #Example Event/ Alert Pop Flow Site Sub-Metrics Metric_of Metric_of Metric_of Duration_P50 Criticality {Medium, 40, 104ms, 60-80ms…} {High, 90, 2000 ms, 105-300ms… } Dependency US_AddCart_SJC {High, 90, 1700 ms, 80-200ms…} Metric_of Metric_of Metric_of Metric_of Dependency Other Alerts Alarm Criticality (e.g. Latency vs CPU) Edge Scores {Criticality, Confidence, Value, Expected range ...} Grano Algorithms
  • 24. © 2018 eBay. All rights reserved.24 SJC Checkout US US_Checkout_SJC #Type #Example Event/ Alert Pop Flow Site Sub-Metrics Metric_of Metric_of Metric_of {Criticality, Confidence, Value, Expected range ...} Duration_P50 Criticality {Medium, 40, 104ms, 60-80ms…} {High, 90, 2000 ms, 105-300ms… } Dependency US_AddCart_SJC Propagations{High, 90, 1700 ms, 80-200ms…} Propagation through the topology from all nodes to their connected ones. Grano Algorithms
  • 25. © 2018 eBay. All rights reserved.25 Data Preprocess xPack ML jobs Service GraphDB Knowledge APIs UI Slack/Email Alerts Grano Quick Walkthrough ML Models Grano Algorithm Adaptive GraphGen
  • 26. © 2018 eBay. All rights reserved.26 Grano UI
  • 27. Envoy enables us for better Observability , Scale , TLS Termination , TCP Congestion BBR and Routing. Today, we are able to serve experiences(closer and faster) to users with higher ATB powered by our detection. We should leverage ML graph, and knowledge engineering, rather than a relying on one method/metric (single point failure). Graph is powerful to understand, connect, and make sense out of complex heterogeneous data (e.g., anomalies, metrics, and system event). © 2018 eBay. All rights reserved.27 Takeaway
  • 29. L4 Architecture ● IPVS based dataplane ● custom consistent hashing ko ● k8s driven control plane ● k8s native deployment ● horizontally scalable ● DSR ● BGP
  • 30. Anomaly Detection © 2018 eBay. All rights reserved.30 Bucket Span + Data History + Exponential Smoothing m1 --------- d1 ( primary ) --------- d2 ( secondary / influencers )