SlideShare a Scribd company logo
1 of 26
Predicting Real-time Service-level Metrics
from Device Statistics
Stockholm, Oct 21, 2015
Rerngvit Yanggratoke(1)
(1) KTH Royal Institute of Technology, Sweden
(2) Ericsson Research, Sweden
(3) Swedish Institute of Computer Science (SICS), Sweden
1
Machine Learning meetup at Spotify
In collaboration with
Jawwad Ahmed(2), John Ardelius(3), Christofer Flinta(2),
Andreas Johnsson(2), Zuoying Jiang(1), Daniel Gillblad(3), Rolf Stadler(1)
About Me
• Currently, PhD student at KTH
• “Real-time analytics for network management”
– Apply machine learning on system-generated data to
better manage networked services and systems
• Favorite data science tools
– R for data analysis and visualization
– SparkR for large-scale batch processing
• Free time
– Playing foosball and badminton
– Developing and playing mobile games
– Contributing to Opensource projects: Flink and SparkR2
Overview
3
Hardware
Operating System
Services
Device statistics X
Real-time service metrics Y
- CPU load, memory load, #network
active sockets, #context switching,
#processes, etc..
- We read raw data from /proc
provided by Linux kernel
- Video frame rate, web response
time, read latency, map/reduce job
completion time, ….
Predict
….
Outline
• Problem / motivation
• Device statistics X
• Service-level metrics Y
• Testbed setup
• X-Y traces
• Data cleaning and preprocessing
• Evaluation results
• Real-time analytics engine
• Demo
• Recap
4
Problem : M: X➝ Ŷ predicts Y in real-time
Problem / motivation
Motivation :
• Building block for real-time service assurance for a service
operator or infrastructure provider
- CPU load, memory load, #network
active sockets, #context switching,
#processes, etc..
- Video frame rate, audio buffer rate,
RTP packet rate
- We select Video streaming (VLC) as an
example service
5
X: device statisticsY: service-level metrics
Service (S): Cluster
Network
Service (S): Client
[Supervised-learning regression problem]
Device statistics Xproc and Xsar
• Linux kernel statistics Xproc
– Features extracted from /proc directory
– CPU core jiffies, current memory usage, virtual memory statistics,
#processes, #blocked processes, …
– About 4,000 metrics
• System Activity Report (SAR) Xsar
– SAR computes metrics from /proc over time interval
– CPU core utilization, memory and swap space utilization, disk I/O statistics, …
– About 840 metrics
• Xproc contains many OS counters, while Xsar does not
• For model predictions, include numerical features
6
Service-level metrics Y
• Video streaming service based on VLC media player
• Measured metrics
– Video frame rate (frames/sec)
– Audio buffer rate (buffers/sec)
– RTP packet rate (packets/sec)
– …
• We instrumented the VLC software to capture
underlying events to compute the metrics.
7
X4 Sensor
Server machine
Server machine
X3 Sensor
HTTP Load
balancer
Video
streaming
(HTTP)
Transcoding
instance
X6 Sensor
Server machine
Transcoding
instance
Video
Streaming
(HTTP)
X5 Sensor
Server machine
Transcoding
instance
Network
File System
X1 Sensor
Server machine
File system
access
Network
File System
X2 Sensor
Server machine
Web Server
Web Server
Web Server
Server-side
VLC Client
VLC
Sensor
(Y1)
X4 Sensor
Server machine
Client machine
Load generator
VLC Client
VoD Load
generator
Server machine
X3 Sensor
HTTP Load
balancer
Video
streaming
(HTTP)
Transcoding
instance
X6 Sensor
Server machine
Transcoding
instance
Video
Streaming
(HTTP)
X5 Sensor
Server machine
Transcoding
instance
Network
File System
X1 Sensor
Server machine
File system
access
Network
File System
X2 Sensor
Server machine
Web Server
Web Server
Web Server
Server-side Client-side
X-Y traces for evaluation
We collect the following traces
– Periodic-load trace, flashcrowd-load trace, constant-load
trace, poisson-load trace, linearly-increasing-load trace
10
We published the traces use in our works
http://mldata.org/repository/data/viewslug/realm-im2015-vod-traces/
http://mldata.org/repository/data/viewslug/realm-cnsm2015-vod-traces/
Data cleaning and preprocessing
• Cleaning
– Removal of non-numeric features
– Removal of highly correlated features
– Removal of constant features
– …
• Preprocessing options
– Principal component analysis
• Dimensionality reduction
– Incorporate historical data (X = [Xt, Xt-1, …, Xt-k])
– *Automatic feature selection
– ….
11
Evaluation
Model training and evaluation
X-Y trace
Train set
Test set
70%
30%
Model
training
Linear regression
Regression tree
Random forest
Lasso regression
….
Learning
model
Learning
modelLearning
model
ei = | yi - ˆyi |
NMAE =
1
ym
eii=1
m
å
Prediction accuracy
ˆyi = M(xi )
yi
Measure
Training Time
12
Normalized Mean
Absolute Error
13
Video frame rate
• Y – bimodal distribution
• Both methods provide
similar prediction errors
Method NMAE (%)
Linear regression 12
Regression tree 11
Audio buffer rate
14
• Y – bimodal distribution
• Regression tree outperforms
least-square linear regression
Method NMAE (%)
Linear regression 41
Regression tree 19
RTP packet rate
15
• Y – wider spread distribution
• Least-square linear regression
outperforms regression tree
Method NMAE (%)
Linear regression 15
Regression tree 19
Evaluation results – periodic-load trace
Device
statistics Regression method
NMAE (%)
Video Audio RTP
X_proc
Linear regression 26 59 39
Lasso regression 23 63 35
Regression tree 23 61 36
Random forest 22 60 34
X_sar
Linear regression 12 41 15
Lasso regression 16 51 17
Regression tree 11 19 19
Random forest
6 0.94 15 16
Lessons learned
• Appropriate models depend on the service metrics
• It is feasible to accurately predict client-side metrics
based on low-level device statistics
– NMAE below 15% across service-level metrics and traces
• Preprocessing of X is critical
– X_sar outperforms X_proc
– Significant improvement of prediction accuracy
• There is a trade-off between computational
resources vs. prediction accuracy
– Random forest vs. linear regression
17
Automatic feature selection
• Exhaustive search is infeasible
– Requires O(2p) training executions (p = ~5000)
• Forward stepwise feature selection
– Heuristic method O(p2) training executions
– Start with an empty set
– Incrementally grows the feature set with one
feature at a time that minimizes a specific
evaluation metric
• Selected metric is the cross-validation error
18
Random forest on different feature sets
19
Trace
Feature set
Video Audio
NMAE (%) Training
(secs)
NMAE(%) Training
(secs)
Periodic-load
Full 12 > 50000 32 > 70000
Automatically
optimized
feature set
6 862 22 1600
Flash-load
Full 8 > 55000 21 > 75000
Automatically
optimized
feature set
4 778 15 1750
* Learning with the optimized feature set significantly improve
prediction accuracy and reduce training time
Real-time Analytics Engine
20
Hardware
Linux OS
Service (S): Server
Xsar = ~5000 metrics
Network
preprocessing
Service (S): Client
Y1 = video frame rate
Y2 = audio buffer rate
Y3 = network read rate
Computing
Prediction
Models
ˆY1 = M1 X( )
ˆY2 = M2 X( )
ˆY3 = M3 X( )
Predict
Predict
Predict
Historical Y
Recorded Demo
22
23
Recap
• Design and develop experimental testbeds
• Run experiments with various load patterns
• Collected and published traces
• Machine learning model assessment
– Batch learning on traces
– Online learning on traces
– Real-time learning on live statistics
• Design and develop real-time analytics engine
• Implement a prototype for real-time learning on
live statistics
24
Key-value store (Voldemort)
25
Questions?
26

More Related Content

What's hot

The data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architecturesThe data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architecturesVincenzo Gulisano
 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream miningAlbert Bifet
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisVincenzo Gulisano
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Ian Foster
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming AlgorithmsJoe Kelley
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streamsKrish_ver2
 
defense_slides_pengdu
defense_slides_pengdudefense_slides_pengdu
defense_slides_pengduPeng Du
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorizationrecsysfr
 
Big Graph Analytics Systems (Sigmod16 Tutorial)
Big Graph Analytics Systems (Sigmod16 Tutorial)Big Graph Analytics Systems (Sigmod16 Tutorial)
Big Graph Analytics Systems (Sigmod16 Tutorial)Yuanyuan Tian
 
Smart Scalable Feature Reduction with Random Forests with Erik Erlandson
Smart Scalable Feature Reduction with Random Forests with Erik ErlandsonSmart Scalable Feature Reduction with Random Forests with Erik Erlandson
Smart Scalable Feature Reduction with Random Forests with Erik ErlandsonDatabricks
 
FFWD - Fast Forward With Degradation
FFWD - Fast Forward With DegradationFFWD - Fast Forward With Degradation
FFWD - Fast Forward With DegradationRolando Brondolin
 
Ikuro Sato's slide presented at ICONIP2017
Ikuro Sato's slide presented at ICONIP2017Ikuro Sato's slide presented at ICONIP2017
Ikuro Sato's slide presented at ICONIP2017Ikuro Sato
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsAlbert Bifet
 
Real-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering
Real-Time Top-R Topic Detection on Twitter with Topic Hijack FilteringReal-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering
Real-Time Top-R Topic Detection on Twitter with Topic Hijack FilteringKohei Hayashi
 
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017MLconf
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013MLconf
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningMehrnaz Faraz
 
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmA Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmNECST Lab @ Politecnico di Milano
 
強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷Eiji Sekiya
 

What's hot (19)

The data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architecturesThe data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architectures
 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream mining
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data Analysis
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
defense_slides_pengdu
defense_slides_pengdudefense_slides_pengdu
defense_slides_pengdu
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorization
 
Big Graph Analytics Systems (Sigmod16 Tutorial)
Big Graph Analytics Systems (Sigmod16 Tutorial)Big Graph Analytics Systems (Sigmod16 Tutorial)
Big Graph Analytics Systems (Sigmod16 Tutorial)
 
Smart Scalable Feature Reduction with Random Forests with Erik Erlandson
Smart Scalable Feature Reduction with Random Forests with Erik ErlandsonSmart Scalable Feature Reduction with Random Forests with Erik Erlandson
Smart Scalable Feature Reduction with Random Forests with Erik Erlandson
 
FFWD - Fast Forward With Degradation
FFWD - Fast Forward With DegradationFFWD - Fast Forward With Degradation
FFWD - Fast Forward With Degradation
 
Ikuro Sato's slide presented at ICONIP2017
Ikuro Sato's slide presented at ICONIP2017Ikuro Sato's slide presented at ICONIP2017
Ikuro Sato's slide presented at ICONIP2017
 
Real-Time Big Data Stream Analytics
Real-Time Big Data Stream AnalyticsReal-Time Big Data Stream Analytics
Real-Time Big Data Stream Analytics
 
Real-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering
Real-Time Top-R Topic Detection on Twitter with Topic Hijack FilteringReal-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering
Real-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering
 
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmA Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
 
強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷
 

Similar to 20151021_DataScienceMeetup_revised

Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent MonitoringIntelie
 
陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰台灣資料科學年會
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 
Apache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise CompanyApache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise CompanyDatabricks
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Raja Chiky
 
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ..."Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...Edge AI and Vision Alliance
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit
 
Optimizing the performance of Chamilo LMS
Optimizing the performance of Chamilo LMSOptimizing the performance of Chamilo LMS
Optimizing the performance of Chamilo LMSChamilo Association
 
Optimizing the performance of your LMS
Optimizing the performance of your LMSOptimizing the performance of your LMS
Optimizing the performance of your LMSPatrick Roth
 
Real Time Human Posture Detection with Multiple Depth Sensors
Real Time Human Posture Detection with Multiple Depth SensorsReal Time Human Posture Detection with Multiple Depth Sensors
Real Time Human Posture Detection with Multiple Depth SensorsWassim Filali
 
Design and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management SystemDesign and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management SystemErdi Olmezogullari
 
Remote Log Analytics Using DDS, ELK, and RxJS
Remote Log Analytics Using DDS, ELK, and RxJSRemote Log Analytics Using DDS, ELK, and RxJS
Remote Log Analytics Using DDS, ELK, and RxJSSumant Tambe
 
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scaleDataScienceConferenc1
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfDuy-Hieu Bui
 

Similar to 20151021_DataScienceMeetup_revised (20)

Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent Monitoring
 
陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Apache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise CompanyApache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise Company
 
Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
 
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ..."Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
"Energy-efficient Hardware for Embedded Vision and Deep Convolutional Neural ...
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at Renault
 
Optimizing the performance of Chamilo LMS
Optimizing the performance of Chamilo LMSOptimizing the performance of Chamilo LMS
Optimizing the performance of Chamilo LMS
 
Optimizing the performance of your LMS
Optimizing the performance of your LMSOptimizing the performance of your LMS
Optimizing the performance of your LMS
 
Real Time Human Posture Detection with Multiple Depth Sensors
Real Time Human Posture Detection with Multiple Depth SensorsReal Time Human Posture Detection with Multiple Depth Sensors
Real Time Human Posture Detection with Multiple Depth Sensors
 
Design and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management SystemDesign and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management System
 
LEGaTO: Use cases
LEGaTO: Use casesLEGaTO: Use cases
LEGaTO: Use cases
 
Remote Log Analytics Using DDS, ELK, and RxJS
Remote Log Analytics Using DDS, ELK, and RxJSRemote Log Analytics Using DDS, ELK, and RxJS
Remote Log Analytics Using DDS, ELK, and RxJS
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
rerngvit_phd_seminar
rerngvit_phd_seminarrerngvit_phd_seminar
rerngvit_phd_seminar
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
 
H2020-AHTOOLS Use Case 3 Functional Design
H2020-AHTOOLS Use Case 3 Functional DesignH2020-AHTOOLS Use Case 3 Functional Design
H2020-AHTOOLS Use Case 3 Functional Design
 

20151021_DataScienceMeetup_revised

  • 1. Predicting Real-time Service-level Metrics from Device Statistics Stockholm, Oct 21, 2015 Rerngvit Yanggratoke(1) (1) KTH Royal Institute of Technology, Sweden (2) Ericsson Research, Sweden (3) Swedish Institute of Computer Science (SICS), Sweden 1 Machine Learning meetup at Spotify In collaboration with Jawwad Ahmed(2), John Ardelius(3), Christofer Flinta(2), Andreas Johnsson(2), Zuoying Jiang(1), Daniel Gillblad(3), Rolf Stadler(1)
  • 2. About Me • Currently, PhD student at KTH • “Real-time analytics for network management” – Apply machine learning on system-generated data to better manage networked services and systems • Favorite data science tools – R for data analysis and visualization – SparkR for large-scale batch processing • Free time – Playing foosball and badminton – Developing and playing mobile games – Contributing to Opensource projects: Flink and SparkR2
  • 3. Overview 3 Hardware Operating System Services Device statistics X Real-time service metrics Y - CPU load, memory load, #network active sockets, #context switching, #processes, etc.. - We read raw data from /proc provided by Linux kernel - Video frame rate, web response time, read latency, map/reduce job completion time, …. Predict ….
  • 4. Outline • Problem / motivation • Device statistics X • Service-level metrics Y • Testbed setup • X-Y traces • Data cleaning and preprocessing • Evaluation results • Real-time analytics engine • Demo • Recap 4
  • 5. Problem : M: X➝ Ŷ predicts Y in real-time Problem / motivation Motivation : • Building block for real-time service assurance for a service operator or infrastructure provider - CPU load, memory load, #network active sockets, #context switching, #processes, etc.. - Video frame rate, audio buffer rate, RTP packet rate - We select Video streaming (VLC) as an example service 5 X: device statisticsY: service-level metrics Service (S): Cluster Network Service (S): Client [Supervised-learning regression problem]
  • 6. Device statistics Xproc and Xsar • Linux kernel statistics Xproc – Features extracted from /proc directory – CPU core jiffies, current memory usage, virtual memory statistics, #processes, #blocked processes, … – About 4,000 metrics • System Activity Report (SAR) Xsar – SAR computes metrics from /proc over time interval – CPU core utilization, memory and swap space utilization, disk I/O statistics, … – About 840 metrics • Xproc contains many OS counters, while Xsar does not • For model predictions, include numerical features 6
  • 7. Service-level metrics Y • Video streaming service based on VLC media player • Measured metrics – Video frame rate (frames/sec) – Audio buffer rate (buffers/sec) – RTP packet rate (packets/sec) – … • We instrumented the VLC software to capture underlying events to compute the metrics. 7
  • 8. X4 Sensor Server machine Server machine X3 Sensor HTTP Load balancer Video streaming (HTTP) Transcoding instance X6 Sensor Server machine Transcoding instance Video Streaming (HTTP) X5 Sensor Server machine Transcoding instance Network File System X1 Sensor Server machine File system access Network File System X2 Sensor Server machine Web Server Web Server Web Server Server-side
  • 9. VLC Client VLC Sensor (Y1) X4 Sensor Server machine Client machine Load generator VLC Client VoD Load generator Server machine X3 Sensor HTTP Load balancer Video streaming (HTTP) Transcoding instance X6 Sensor Server machine Transcoding instance Video Streaming (HTTP) X5 Sensor Server machine Transcoding instance Network File System X1 Sensor Server machine File system access Network File System X2 Sensor Server machine Web Server Web Server Web Server Server-side Client-side
  • 10. X-Y traces for evaluation We collect the following traces – Periodic-load trace, flashcrowd-load trace, constant-load trace, poisson-load trace, linearly-increasing-load trace 10 We published the traces use in our works http://mldata.org/repository/data/viewslug/realm-im2015-vod-traces/ http://mldata.org/repository/data/viewslug/realm-cnsm2015-vod-traces/
  • 11. Data cleaning and preprocessing • Cleaning – Removal of non-numeric features – Removal of highly correlated features – Removal of constant features – … • Preprocessing options – Principal component analysis • Dimensionality reduction – Incorporate historical data (X = [Xt, Xt-1, …, Xt-k]) – *Automatic feature selection – …. 11
  • 12. Evaluation Model training and evaluation X-Y trace Train set Test set 70% 30% Model training Linear regression Regression tree Random forest Lasso regression …. Learning model Learning modelLearning model ei = | yi - ˆyi | NMAE = 1 ym eii=1 m å Prediction accuracy ˆyi = M(xi ) yi Measure Training Time 12 Normalized Mean Absolute Error
  • 13. 13 Video frame rate • Y – bimodal distribution • Both methods provide similar prediction errors Method NMAE (%) Linear regression 12 Regression tree 11
  • 14. Audio buffer rate 14 • Y – bimodal distribution • Regression tree outperforms least-square linear regression Method NMAE (%) Linear regression 41 Regression tree 19
  • 15. RTP packet rate 15 • Y – wider spread distribution • Least-square linear regression outperforms regression tree Method NMAE (%) Linear regression 15 Regression tree 19
  • 16. Evaluation results – periodic-load trace Device statistics Regression method NMAE (%) Video Audio RTP X_proc Linear regression 26 59 39 Lasso regression 23 63 35 Regression tree 23 61 36 Random forest 22 60 34 X_sar Linear regression 12 41 15 Lasso regression 16 51 17 Regression tree 11 19 19 Random forest 6 0.94 15 16
  • 17. Lessons learned • Appropriate models depend on the service metrics • It is feasible to accurately predict client-side metrics based on low-level device statistics – NMAE below 15% across service-level metrics and traces • Preprocessing of X is critical – X_sar outperforms X_proc – Significant improvement of prediction accuracy • There is a trade-off between computational resources vs. prediction accuracy – Random forest vs. linear regression 17
  • 18. Automatic feature selection • Exhaustive search is infeasible – Requires O(2p) training executions (p = ~5000) • Forward stepwise feature selection – Heuristic method O(p2) training executions – Start with an empty set – Incrementally grows the feature set with one feature at a time that minimizes a specific evaluation metric • Selected metric is the cross-validation error 18
  • 19. Random forest on different feature sets 19 Trace Feature set Video Audio NMAE (%) Training (secs) NMAE(%) Training (secs) Periodic-load Full 12 > 50000 32 > 70000 Automatically optimized feature set 6 862 22 1600 Flash-load Full 8 > 55000 21 > 75000 Automatically optimized feature set 4 778 15 1750 * Learning with the optimized feature set significantly improve prediction accuracy and reduce training time
  • 21. Hardware Linux OS Service (S): Server Xsar = ~5000 metrics Network preprocessing Service (S): Client Y1 = video frame rate Y2 = audio buffer rate Y3 = network read rate Computing Prediction Models ˆY1 = M1 X( ) ˆY2 = M2 X( ) ˆY3 = M3 X( ) Predict Predict Predict Historical Y
  • 23. 23
  • 24. Recap • Design and develop experimental testbeds • Run experiments with various load patterns • Collected and published traces • Machine learning model assessment – Batch learning on traces – Online learning on traces – Real-time learning on live statistics • Design and develop real-time analytics engine • Implement a prototype for real-time learning on live statistics 24

Editor's Notes

  1. X is available to the cloud operator, while Y is in general not
  2. We also implements the system that applies this concept. Essentially, we take a collect the data from variety of machines in real-time