SlideShare a Scribd company logo
1 of 38
Download to read offline
Data-driven Performance Prediction and
Resource Allocation for Cloud Services
Rerngvit Yanggratoke
Doctoral Thesis
May 3, 2016
Advisor: Prof. Rolf Stadler
Opponent:
Prof. Filip De Turck, Ghent University, Ghent, Belgium
Grading Committee:
Prof. Raouf Boutaba, University of Waterloo, Canada
Dr. Giovanni Pacific, IBM Research, USA
Prof. Lena Wosinska, KTH Royal Institute of Technology, Sweden
Cloud services – search, tax filing,

video-on-demand, and social-network services
Introduction
2
Performance of such services is important
Backend
system
Data center
Internet
This thesis focuses on performance management of
backend systems in a data center
Three fundamental problems for performance
management of backend systems in a data center
1. Resource allocation for a large-scale cloud
environment
2. Performance modeling of a distributed 

key-value store
3. Real-time prediction of service metrics
Problem and Approach
3
Data-driven approach 

Estimate model parameters from measurements
1. Resource allocation for a large-scale
cloud environment
2. Performance modeling of a distributed
key-value store
3. Real-time prediction of service metrics
4. Contributions and open questions
Outline
4
Motivation – Large-scale Cloud
5
DimensionData Research, 2014
• Apple's data center in North
Carolina, USA (about
500’000 feet2)
• Expected to host 150’000+
machines
• Amazon data center in
Virginia, USA (about 300’000
machines)
• Microsoft's mega data center
in Dublin, Ireland (300’000
feet2)
• Facebook planned for a
massive data center in Iowa,
USA (1.4M feet2)
• eBay's data center in Utah,
USA (at least 240’000 feet2)
“Big gets Bigger: the Rise of the Mega Data Centre”
Resource Allocation
6
Select the machine to run an application that satisfies:
• Resource demands of all applications in the cloud
• Management objectives of the cloud provider
Resource allocation system computes the solution
Resource allocation system that supports
• Joint allocation of compute and network resources
• Generic and extensible for management objectives
• Scalable operation (> 100’000 machines)
• Dynamic adaptation to changes in load patterns
Requirement and Approach
7
Approach
• Formulate the problem as an optimization problem
• Use distributed protocols - gossip-based algorithms
• Assume the full-bisection-bandwidth network
The objective function expresses
a management objective
The Objective Function
8
Balance load objective Energy efficiency objective
Pn = energy of machine n
Fairness objective Service differentiation objective
θ ∈ ω,γ,λ{ }= {CPU,Memory, Network}
= set of machines, = set of applications
The objective function expresses
a management objective
The Objective Function
9
Balance load objective Energy efficiency objective
Pn(t) = energy of machine n at time t
Fairness objective Service differentiation objective
θ ∈ ω,γ,λ{ }= {CPU,Memory, Network}
= set of machines, = set of applications
Minimize
Subject to
Optimization Problem
10
(1)
(2)
(1) are capacity constraints
(2) are demand constraints
This problem is NP-hard
• We apply a heuristic solution
• How to distribute this computation?
Gossip Protocol
11
• A round-based protocol that relies on pairwise 

interactions to accomplish a global objective
• Each node executes the same code
• The size of the exchanged message is limited
Balanced load objective
“During a gossip interaction, move an application that
minimizes ”
A generic and scalable gossip protocol 

for resource allocation
Generic Resource Management Protocol
(GRMP)
12
<- is minimized locally here
The protocol implements an iterative descent method
Evaluation Results from Simulation
13
0.3
0.6
0.9
1.2
1.5
0.15
0.35
0.55
0.75
0.95
0
0.2
0.4
0.6
0.8
1
Network Load FactorCPU Load Factor
Unbalance
Protocol
Ideal
0.3
0.6
0.9
1.2
1.5
0.15
0.35
0.55
0.75
0.95
0
0.2
0.4
0.6
0.8
1
Network Load FactorCPU Load Factor
Relativepowerconsumption
Protocol
Ideal
Scalability
• System size up to 100’000
• Evaluation metrics do 

not change
Balanced load objective Energy efficiency objective
R. Yanggratoke, F. Wuhib and R. Stadler, “Gossip-
based resource allocation for green computing in large
clouds,” In Proc. 7th International Conference on
Network and Service Management (CNSM), Paris,
France, October 24-28, 2011
F. Wuhib, R. Yanggratoke, and R. Stadler, “Allocating
Compute and Network Resources under Management
Objectives in Large-Scale Clouds,” Journal of Network
and Systems Management (JNSM), Vol. 23, No. 1, pp.
111-136, January 2015
Publications
14
1. Resource allocation for a large-scale
cloud environment
2. Performance modeling of a distributed
key-value store
3. Real-time prediction of service metrics
4. Contributions and open questions
Outline
15
Low latency is key to the Spotify service
Spotify Backend for Music Streaming
16
A distributed key-value store
Problem and Approach
17
Approach
• Simplified architecture
• Probabilistic and stochastic modeling techniques
Development of performance models 

for a Spotify backend site
1. Predicting response time distribution
2. Estimating capacity under different object
allocation policies
• Random policy
• Popularity-aware policy
Simplified Architecture for 

a Spotify backend site
18
• Model only Production Storage
• AP selects a storage server uniformly at random
• Ignore network and access-point processing delays
• Consider steady-state conditions and 

Poisson arrivals
Model for Response Time Distribution
19
Probability that a request to the cluster is served below a latency t is
Model for a single storage server
Model for a cluster of storage servers
Probability that a request to a server is served below a latency t is
19
Model Predictions vs. Measurements from
Spotify Backend
20
Spotify
Server
Spotify
Cluster
Model for Estimating Capacity under

Different Object Allocation Policies
21
Capacity
Ω : Max request rate to a server so that a QoS is satisfied
Ωc: Max request rate to the cluster so that the request rate to
each server is at most Ω
Capacity of a cluster under the popularity-aware policy
Capacity of a cluster under the random policy
When assuming
21
Model Predictions vs. Measurements from
Testbed
22
Number
of
objects
Random policy Popularity-aware policy
Measurement Model Error (%) Measurement Model Error
(%)
1250 166.50 176.21 5.83 190.10 200 5.21
1750 180.30 180.92 0.35 191.00 200 4.71
2500 189.70 182.30 3.90 190.80 200 4.82
5000 188.40 186.92 0.78 191.00 200 4.71
10000 188.60 190.39 0.95 192.40 200 3.95
Cluster Capacity for the Random Policy vs.
Popularity-Aware policy
23
Spotify backend does not need
the popularity-aware policy
R. Yanggratoke, G. Kreitz, M. Goldmann and R.
Stadler, “Predicting response times for the Spotify
backend,” In Proc. 8th International conference on
Network and service management (CNSM), Las
Vegas, NV, USA, October 22-26, 2012
Best Paper Award
R. Yanggratoke, G. Kreitz, M. Goldmann, R. Stadler
and V. Fodor, “On the performance of the Spotify
backend,” accepted to Journal of Network and
Systems Management (JNSM), Vol. 23, No. 1, pp.
111-136, January 2015
Publications
24
1. Resource allocation for a large-scale
cloud environment
2. Performance modeling of a distributed
key-value store
3. Real-time prediction of service metrics
4. Contributions and open questions
Outline
25
Problem predicts Y in real time
Real-time Prediction Problem
Motivation :
Key building block for real-time service assurance system
- X : CPU load, memory load,
#network active sockets,
#processes, etc..
Video-on-demand (VOD):
video frame rate, audio buffer rate
Key-value store (KV): 

response time
26
Service-agnostic Approach
27
Our approach
• Take “all” available statistics (> 4000 features)
• Learn using low-level (OS-level) metrics
Existing works
• Apply analytical models to model the service
• Statistical learning on engineered service-specific
features
Design goal ➔ Service-agnostic prediction
Testbed – Video Streaming
28
Dell PowerEdge R715 2U rack servers, 64 GB RAM, two 12-core AMD
Opteron processors, a 500 GB hard disk, and 1 Gb network controller
Testbed – Key-value Store
29
Prediction Methods
30
Increased difficulty and realism
Online
learning
Batch
learning
Real-time
learning
Using traces Using live statistics
NMAE =
Evaluation metric: Normalized mean absolute error (NMAE)
Real-time Analytics Engine
31
32
Real-time Analytics Demonstrator
With virtualization : negligible changes
End-to-end with network path metrics : +10%
Evaluation for Real-time Learning
33
Load pattern
Video-on-demand (VoD)
Key-value store
(KV)
Video frame rate Audio buffer rate Response time
Periodic-load 3.6% 14% 7%
Flashcrowd-load 5.6% 11% 6%
Periodic-load (VoD) +
Flashcrowd-load (KV)
8% 29% 11%
R. Yanggratoke, J. Ahmed, J. Ardelius, C. Flinta, A. Johnsson, D.
Gillblad, R. Stadler, “Predicting real-time service-level metrics
from device statistics,”, IM 2015
R. Yanggratoke, J. Ahmed, J. Ardelius, C. Flinta, A. Johnsson, D.
Gillblad, R. Stadler, “A platform for predicting real-time service-
level metrics from device statistics,” IM 2015, demo session
R. Yanggratoke, J. Ahmed, J. Ardelius, C. Flinta, A. Johnsson, D.
Gillblad, R. Stadler, “Predicting service metrics for cluster-based
services using real-time analytics,” CNSM 2015
R. Yanggratoke, J. Ahmed, J. Ardelius, C. Flinta, A. Johnsson, D.
Gillblad, R. Stadler, “A service-agnostic method for predicting
service metrics in real-time,” submitted to JNSM
J. Ahmed, A. Johnsson, R. Yanggratoke, J. Ardelius, C. Flinta, R.
Stadler, “Predicting SLA Conformance for Cluster-Based
Services Using Distributed Analytics,” NOMS 2016
Publications
34
1. Resource allocation for a large-scale
cloud environment
2. Performance modeling of a distributed
key-value store
3. Real-time prediction of service metrics
4. Contributions and open questions
Outline
35
We designed, developed, and evaluated a 

generic protocol for resource allocation that
• supports joint allocation of compute and network resources
• enables scalable operation (> 100’000 machines)
• supports dynamic adaptation to changes in load patterns
We designed, developed, and evaluated performance models for response
time distribution and capacity of a distributed key-value store that is
• Simple yet accurate for Spotify’s operational range
• Obtainable and efficient
We designed, developed, and evaluated a
solution for predicting service metrics in real-time that
• Service-agnostic
• Accurate and efficient
Key Contributions of the Thesis
36
Resource allocation for a large-scale cloud environment
• Centralized vs. decentralized resource allocation
• Multiple data centers and telecom clouds
Performance modeling of a distributed key-value store
• Black-box models for performance predictions
• Online performance management using analytical models
Analytics-based prediction of service metrics
• Prediction in large systems
• Analytics-based performance management
• Forecasting of service metrics
• Prediction of end-to-end service metrics
Open Questions for Future Research
37
PhD Defense

More Related Content

What's hot

Ashfaq Munshi, ML7 Fellow, Pepperdata
Ashfaq Munshi, ML7 Fellow, PepperdataAshfaq Munshi, ML7 Fellow, Pepperdata
Ashfaq Munshi, ML7 Fellow, PepperdataMLconf
 
Propagating Data Policies - A User Study
Propagating Data Policies - A User StudyPropagating Data Policies - A User Study
Propagating Data Policies - A User StudyEnrico Daga
 
Hattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesHattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesJason Hattrick-Simpers
 
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016MLconf
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning ClusteringMapR Technologies
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
 
Smart Metrics for High Performance Material Design
Smart Metrics for High Performance Material DesignSmart Metrics for High Performance Material Design
Smart Metrics for High Performance Material Designaimsnist
 
Big (chemical) data? No Problem!
Big (chemical) data? No Problem!Big (chemical) data? No Problem!
Big (chemical) data? No Problem!Greg Landrum
 
Big data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at KitwareBig data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at Kitwarebigdataviz_bay
 
Overview of DuraMat software tool development
Overview of DuraMat software tool developmentOverview of DuraMat software tool development
Overview of DuraMat software tool developmentAnubhav Jain
 
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...aimsnist
 
Interactive Latency in Big Data Visualization
Interactive Latency in Big Data VisualizationInteractive Latency in Big Data Visualization
Interactive Latency in Big Data Visualizationbigdataviz_bay
 

What's hot (14)

Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Ashfaq Munshi, ML7 Fellow, Pepperdata
Ashfaq Munshi, ML7 Fellow, PepperdataAshfaq Munshi, ML7 Fellow, Pepperdata
Ashfaq Munshi, ML7 Fellow, Pepperdata
 
Propagating Data Policies - A User Study
Propagating Data Policies - A User StudyPropagating Data Policies - A User Study
Propagating Data Policies - A User Study
 
Hattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop SlidesHattrick Simpers TMS Machine Learning Workshop Slides
Hattrick Simpers TMS Machine Learning Workshop Slides
 
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning Clustering
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
ML in materials discovery
ML in materials discovery ML in materials discovery
ML in materials discovery
 
Smart Metrics for High Performance Material Design
Smart Metrics for High Performance Material DesignSmart Metrics for High Performance Material Design
Smart Metrics for High Performance Material Design
 
Big (chemical) data? No Problem!
Big (chemical) data? No Problem!Big (chemical) data? No Problem!
Big (chemical) data? No Problem!
 
Big data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at KitwareBig data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at Kitware
 
Overview of DuraMat software tool development
Overview of DuraMat software tool developmentOverview of DuraMat software tool development
Overview of DuraMat software tool development
 
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...
 
Interactive Latency in Big Data Visualization
Interactive Latency in Big Data VisualizationInteractive Latency in Big Data Visualization
Interactive Latency in Big Data Visualization
 

Similar to rerngvit_phd_seminar

Real time intrusion detection in network traffic using adaptive and auto-scal...
Real time intrusion detection in network traffic using adaptive and auto-scal...Real time intrusion detection in network traffic using adaptive and auto-scal...
Real time intrusion detection in network traffic using adaptive and auto-scal...Gobinath Loganathan
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and ComputationTal Lavian Ph.D.
 
M3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications
M3AT: Monitoring Agents Assignment Model for the Data-Intensive ApplicationsM3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications
M3AT: Monitoring Agents Assignment Model for the Data-Intensive ApplicationsVladislavKashansky
 
Semantics in Sensor Networks
Semantics in Sensor NetworksSemantics in Sensor Networks
Semantics in Sensor NetworksOscar Corcho
 
Self-Tuning Data Centers
Self-Tuning Data CentersSelf-Tuning Data Centers
Self-Tuning Data CentersReza Rahimi
 
陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰台灣資料科學年會
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuantUniversity
 
ParStream - Big Data for Business Users
ParStream - Big Data for Business UsersParStream - Big Data for Business Users
ParStream - Big Data for Business UsersParStream Inc.
 
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...mattdenesuk
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptopRising Media, Inc.
 
Scalable Web Technology for the Internet of Things
Scalable Web Technology for the Internet of ThingsScalable Web Technology for the Internet of Things
Scalable Web Technology for the Internet of ThingsMatthias Kovatsch
 
Artificial intelligence - A Teaser to the Topic.
Artificial intelligence - A Teaser to the Topic.Artificial intelligence - A Teaser to the Topic.
Artificial intelligence - A Teaser to the Topic.Dr. Kim (Kyllesbech Larsen)
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
Operationalizing SDN
Operationalizing SDNOperationalizing SDN
Operationalizing SDNADVA
 
Show and Tell - Data and Digitalisation, Digital Twins.pdf
Show and Tell - Data and Digitalisation, Digital Twins.pdfShow and Tell - Data and Digitalisation, Digital Twins.pdf
Show and Tell - Data and Digitalisation, Digital Twins.pdfSIFOfgem
 
Transforming optical networking with AI
Transforming optical networking with AITransforming optical networking with AI
Transforming optical networking with AIADVA
 

Similar to rerngvit_phd_seminar (20)

Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Real time intrusion detection in network traffic using adaptive and auto-scal...
Real time intrusion detection in network traffic using adaptive and auto-scal...Real time intrusion detection in network traffic using adaptive and auto-scal...
Real time intrusion detection in network traffic using adaptive and auto-scal...
 
LEGaTO: Use cases
LEGaTO: Use casesLEGaTO: Use cases
LEGaTO: Use cases
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and Computation
 
M3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications
M3AT: Monitoring Agents Assignment Model for the Data-Intensive ApplicationsM3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications
M3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications
 
Srikanta Mishra
Srikanta MishraSrikanta Mishra
Srikanta Mishra
 
Semantics in Sensor Networks
Semantics in Sensor NetworksSemantics in Sensor Networks
Semantics in Sensor Networks
 
Self-Tuning Data Centers
Self-Tuning Data CentersSelf-Tuning Data Centers
Self-Tuning Data Centers
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 
陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰陸永祥/全球網路攝影機帶來的機會與挑戰
陸永祥/全球網路攝影機帶來的機會與挑戰
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
 
ParStream - Big Data for Business Users
ParStream - Big Data for Business UsersParStream - Big Data for Business Users
ParStream - Big Data for Business Users
 
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
Big Data, Physics, and the Industrial Internet: How Modeling & Analytics are ...
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
 
Scalable Web Technology for the Internet of Things
Scalable Web Technology for the Internet of ThingsScalable Web Technology for the Internet of Things
Scalable Web Technology for the Internet of Things
 
Artificial intelligence - A Teaser to the Topic.
Artificial intelligence - A Teaser to the Topic.Artificial intelligence - A Teaser to the Topic.
Artificial intelligence - A Teaser to the Topic.
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Operationalizing SDN
Operationalizing SDNOperationalizing SDN
Operationalizing SDN
 
Show and Tell - Data and Digitalisation, Digital Twins.pdf
Show and Tell - Data and Digitalisation, Digital Twins.pdfShow and Tell - Data and Digitalisation, Digital Twins.pdf
Show and Tell - Data and Digitalisation, Digital Twins.pdf
 
Transforming optical networking with AI
Transforming optical networking with AITransforming optical networking with AI
Transforming optical networking with AI
 

rerngvit_phd_seminar

  • 1. Data-driven Performance Prediction and Resource Allocation for Cloud Services Rerngvit Yanggratoke Doctoral Thesis May 3, 2016 Advisor: Prof. Rolf Stadler Opponent: Prof. Filip De Turck, Ghent University, Ghent, Belgium Grading Committee: Prof. Raouf Boutaba, University of Waterloo, Canada Dr. Giovanni Pacific, IBM Research, USA Prof. Lena Wosinska, KTH Royal Institute of Technology, Sweden
  • 2. Cloud services – search, tax filing,
 video-on-demand, and social-network services Introduction 2 Performance of such services is important Backend system Data center Internet This thesis focuses on performance management of backend systems in a data center
  • 3. Three fundamental problems for performance management of backend systems in a data center 1. Resource allocation for a large-scale cloud environment 2. Performance modeling of a distributed 
 key-value store 3. Real-time prediction of service metrics Problem and Approach 3 Data-driven approach 
 Estimate model parameters from measurements
  • 4. 1. Resource allocation for a large-scale cloud environment 2. Performance modeling of a distributed key-value store 3. Real-time prediction of service metrics 4. Contributions and open questions Outline 4
  • 5. Motivation – Large-scale Cloud 5 DimensionData Research, 2014 • Apple's data center in North Carolina, USA (about 500’000 feet2) • Expected to host 150’000+ machines • Amazon data center in Virginia, USA (about 300’000 machines) • Microsoft's mega data center in Dublin, Ireland (300’000 feet2) • Facebook planned for a massive data center in Iowa, USA (1.4M feet2) • eBay's data center in Utah, USA (at least 240’000 feet2) “Big gets Bigger: the Rise of the Mega Data Centre”
  • 6. Resource Allocation 6 Select the machine to run an application that satisfies: • Resource demands of all applications in the cloud • Management objectives of the cloud provider Resource allocation system computes the solution
  • 7. Resource allocation system that supports • Joint allocation of compute and network resources • Generic and extensible for management objectives • Scalable operation (> 100’000 machines) • Dynamic adaptation to changes in load patterns Requirement and Approach 7 Approach • Formulate the problem as an optimization problem • Use distributed protocols - gossip-based algorithms • Assume the full-bisection-bandwidth network
  • 8. The objective function expresses a management objective The Objective Function 8 Balance load objective Energy efficiency objective Pn = energy of machine n Fairness objective Service differentiation objective θ ∈ ω,γ,λ{ }= {CPU,Memory, Network} = set of machines, = set of applications
  • 9. The objective function expresses a management objective The Objective Function 9 Balance load objective Energy efficiency objective Pn(t) = energy of machine n at time t Fairness objective Service differentiation objective θ ∈ ω,γ,λ{ }= {CPU,Memory, Network} = set of machines, = set of applications
  • 10. Minimize Subject to Optimization Problem 10 (1) (2) (1) are capacity constraints (2) are demand constraints This problem is NP-hard • We apply a heuristic solution • How to distribute this computation?
  • 11. Gossip Protocol 11 • A round-based protocol that relies on pairwise 
 interactions to accomplish a global objective • Each node executes the same code • The size of the exchanged message is limited Balanced load objective “During a gossip interaction, move an application that minimizes ”
  • 12. A generic and scalable gossip protocol 
 for resource allocation Generic Resource Management Protocol (GRMP) 12 <- is minimized locally here The protocol implements an iterative descent method
  • 13. Evaluation Results from Simulation 13 0.3 0.6 0.9 1.2 1.5 0.15 0.35 0.55 0.75 0.95 0 0.2 0.4 0.6 0.8 1 Network Load FactorCPU Load Factor Unbalance Protocol Ideal 0.3 0.6 0.9 1.2 1.5 0.15 0.35 0.55 0.75 0.95 0 0.2 0.4 0.6 0.8 1 Network Load FactorCPU Load Factor Relativepowerconsumption Protocol Ideal Scalability • System size up to 100’000 • Evaluation metrics do 
 not change Balanced load objective Energy efficiency objective
  • 14. R. Yanggratoke, F. Wuhib and R. Stadler, “Gossip- based resource allocation for green computing in large clouds,” In Proc. 7th International Conference on Network and Service Management (CNSM), Paris, France, October 24-28, 2011 F. Wuhib, R. Yanggratoke, and R. Stadler, “Allocating Compute and Network Resources under Management Objectives in Large-Scale Clouds,” Journal of Network and Systems Management (JNSM), Vol. 23, No. 1, pp. 111-136, January 2015 Publications 14
  • 15. 1. Resource allocation for a large-scale cloud environment 2. Performance modeling of a distributed key-value store 3. Real-time prediction of service metrics 4. Contributions and open questions Outline 15
  • 16. Low latency is key to the Spotify service Spotify Backend for Music Streaming 16 A distributed key-value store
  • 17. Problem and Approach 17 Approach • Simplified architecture • Probabilistic and stochastic modeling techniques Development of performance models 
 for a Spotify backend site 1. Predicting response time distribution 2. Estimating capacity under different object allocation policies • Random policy • Popularity-aware policy
  • 18. Simplified Architecture for 
 a Spotify backend site 18 • Model only Production Storage • AP selects a storage server uniformly at random • Ignore network and access-point processing delays • Consider steady-state conditions and 
 Poisson arrivals
  • 19. Model for Response Time Distribution 19 Probability that a request to the cluster is served below a latency t is Model for a single storage server Model for a cluster of storage servers Probability that a request to a server is served below a latency t is 19
  • 20. Model Predictions vs. Measurements from Spotify Backend 20 Spotify Server Spotify Cluster
  • 21. Model for Estimating Capacity under
 Different Object Allocation Policies 21 Capacity Ω : Max request rate to a server so that a QoS is satisfied Ωc: Max request rate to the cluster so that the request rate to each server is at most Ω Capacity of a cluster under the popularity-aware policy Capacity of a cluster under the random policy When assuming 21
  • 22. Model Predictions vs. Measurements from Testbed 22 Number of objects Random policy Popularity-aware policy Measurement Model Error (%) Measurement Model Error (%) 1250 166.50 176.21 5.83 190.10 200 5.21 1750 180.30 180.92 0.35 191.00 200 4.71 2500 189.70 182.30 3.90 190.80 200 4.82 5000 188.40 186.92 0.78 191.00 200 4.71 10000 188.60 190.39 0.95 192.40 200 3.95
  • 23. Cluster Capacity for the Random Policy vs. Popularity-Aware policy 23 Spotify backend does not need the popularity-aware policy
  • 24. R. Yanggratoke, G. Kreitz, M. Goldmann and R. Stadler, “Predicting response times for the Spotify backend,” In Proc. 8th International conference on Network and service management (CNSM), Las Vegas, NV, USA, October 22-26, 2012 Best Paper Award R. Yanggratoke, G. Kreitz, M. Goldmann, R. Stadler and V. Fodor, “On the performance of the Spotify backend,” accepted to Journal of Network and Systems Management (JNSM), Vol. 23, No. 1, pp. 111-136, January 2015 Publications 24
  • 25. 1. Resource allocation for a large-scale cloud environment 2. Performance modeling of a distributed key-value store 3. Real-time prediction of service metrics 4. Contributions and open questions Outline 25
  • 26. Problem predicts Y in real time Real-time Prediction Problem Motivation : Key building block for real-time service assurance system - X : CPU load, memory load, #network active sockets, #processes, etc.. Video-on-demand (VOD): video frame rate, audio buffer rate Key-value store (KV): 
 response time 26
  • 27. Service-agnostic Approach 27 Our approach • Take “all” available statistics (> 4000 features) • Learn using low-level (OS-level) metrics Existing works • Apply analytical models to model the service • Statistical learning on engineered service-specific features Design goal ➔ Service-agnostic prediction
  • 28. Testbed – Video Streaming 28 Dell PowerEdge R715 2U rack servers, 64 GB RAM, two 12-core AMD Opteron processors, a 500 GB hard disk, and 1 Gb network controller
  • 30. Prediction Methods 30 Increased difficulty and realism Online learning Batch learning Real-time learning Using traces Using live statistics NMAE = Evaluation metric: Normalized mean absolute error (NMAE)
  • 33. With virtualization : negligible changes End-to-end with network path metrics : +10% Evaluation for Real-time Learning 33 Load pattern Video-on-demand (VoD) Key-value store (KV) Video frame rate Audio buffer rate Response time Periodic-load 3.6% 14% 7% Flashcrowd-load 5.6% 11% 6% Periodic-load (VoD) + Flashcrowd-load (KV) 8% 29% 11%
  • 34. R. Yanggratoke, J. Ahmed, J. Ardelius, C. Flinta, A. Johnsson, D. Gillblad, R. Stadler, “Predicting real-time service-level metrics from device statistics,”, IM 2015 R. Yanggratoke, J. Ahmed, J. Ardelius, C. Flinta, A. Johnsson, D. Gillblad, R. Stadler, “A platform for predicting real-time service- level metrics from device statistics,” IM 2015, demo session R. Yanggratoke, J. Ahmed, J. Ardelius, C. Flinta, A. Johnsson, D. Gillblad, R. Stadler, “Predicting service metrics for cluster-based services using real-time analytics,” CNSM 2015 R. Yanggratoke, J. Ahmed, J. Ardelius, C. Flinta, A. Johnsson, D. Gillblad, R. Stadler, “A service-agnostic method for predicting service metrics in real-time,” submitted to JNSM J. Ahmed, A. Johnsson, R. Yanggratoke, J. Ardelius, C. Flinta, R. Stadler, “Predicting SLA Conformance for Cluster-Based Services Using Distributed Analytics,” NOMS 2016 Publications 34
  • 35. 1. Resource allocation for a large-scale cloud environment 2. Performance modeling of a distributed key-value store 3. Real-time prediction of service metrics 4. Contributions and open questions Outline 35
  • 36. We designed, developed, and evaluated a 
 generic protocol for resource allocation that • supports joint allocation of compute and network resources • enables scalable operation (> 100’000 machines) • supports dynamic adaptation to changes in load patterns We designed, developed, and evaluated performance models for response time distribution and capacity of a distributed key-value store that is • Simple yet accurate for Spotify’s operational range • Obtainable and efficient We designed, developed, and evaluated a solution for predicting service metrics in real-time that • Service-agnostic • Accurate and efficient Key Contributions of the Thesis 36
  • 37. Resource allocation for a large-scale cloud environment • Centralized vs. decentralized resource allocation • Multiple data centers and telecom clouds Performance modeling of a distributed key-value store • Black-box models for performance predictions • Online performance management using analytical models Analytics-based prediction of service metrics • Prediction in large systems • Analytics-based performance management • Forecasting of service metrics • Prediction of end-to-end service metrics Open Questions for Future Research 37