Autonomic SLA-driven
Provisioning for Cloud
Applications

 Nicolas Bonvin, Thanasis Papaioannou, Karl Aberer

 CCGRID 2011, May 23-26 2011, New Port Beach, CA, USA

 nicolas.bonvin@epfl.ch
 LSIR - EPFL
Cloud Apps – Issue #1 : Placement

    ●    A distributed, component-based application running on an elastic
         infrastructure




                       C1
                       C1          C2
                                   C2          C3
                                               C3             C4
                                                              C4




2   EPFL – LSIR - Nicolas Bonvin
Cloud Apps – Issue #1 : Placement

    ●    A distributed, component-based application running on an elastic
         infrastructure




                       C1
                       C1                C2
                                         C2    C3
                                               C3             C4
                                                              C4


                                   VM1        VM2             VM3




3   EPFL – LSIR - Nicolas Bonvin
Cloud Apps – Issue #1 : Placement

    ●    A distributed, component-based application running on an elastic
         infrastructure
    ●    Performance of C1, C2 and C3 is probably less than C4
    ●    No info on other VMs colocated on same server !



                       C1
                       C1                  C2
                                           C2       C3
                                                    C3         C4
                                                               C4


                                   VM1              VM2       VM3


                                         Server 1            Server 2




4   EPFL – LSIR - Nicolas Bonvin
Cloud Apps – Issue #1 : Placement

    ●    A distributed, component-based application running on an elastic
         infrastructure
    ●    Performance of C1, C2 and C3 is probably less than C4
    ●    No info on other VMs colocated on same server !



                       C1
                       C1                  C2
                                           C2            C3
                                                         C3           C4
                                                                      C4


                                   VM1                   VM2         VM3


                                         Server 1                   Server 2




                                          No control on placement


5   EPFL – LSIR - Nicolas Bonvin
Cloud Apps – Issue #2 : Unstability

    ●    Load-balanced trafic to 4 identical components on 4 identical VMs




                       C1
                       C1           C1
                                    C1         C1
                                               C1             C1
                                                              C1


                      VM1           VM2       VM3             VM4



                   100 ms          100 ms    100 ms         100 ms




6   EPFL – LSIR - Nicolas Bonvin
Cloud Apps – Issue #2 : Unstability

    ●    Load-balanced trafic to 4 identical components on 4 identical VMs
                     –    VM performance can vary up to a ratio 4 ! [Dej2009]
                                   ●   Physical server, Hypervisor, Storage, ...




                         C1
                         C1                  C1
                                             C1              C1
                                                             C1              C1
                                                                             C1


                         VM1                VM2             VM3              VM4



                   100 ms                 140 ms           100 ms          100 ms




7   EPFL – LSIR - Nicolas Bonvin
Cloud Apps – Issue #2 : Unstability

    ●    Load-balanced trafic to 4 identical components on 4 identical VMs
                     –    VM performance can vary up to a ratio 4 ! [Dej2009]
                                   ●   Physical server, Hypervisor, Storage, ...
                                   ●   Component overloaded




                         C1
                         C1                  C1
                                             C1              C1
                                                             C1              C1
                                                                             C1


                         VM1                VM2             VM3              VM4



                   130 ms                 140 ms           100 ms          100 ms




8   EPFL – LSIR - Nicolas Bonvin
Cloud Apps – Issue #2 : Unstability

    ●    Load-balanced trafic to 4 identical components on 4 identical VMs
                     –    VM performance can vary up to a ratio 4 ! [Dej2009]
                                   ●   Physical server, Hypervisor, Storage, ...
                                   ●   Component overloaded
                                   ●   Component bug, crash, deadlock, ...



                         C1
                         C1                  C1
                                             C1              C1
                                                             C1              C1
                                                                             C1


                         VM1                VM2             VM3              VM4



                   130 ms                 140 ms           100 ms          infinity




9   EPFL – LSIR - Nicolas Bonvin
Cloud Apps – Issue #2 : Unstability

     ●    Load-balanced trafic to 4 identical components on 4 identical VMs
                      –    VM performance can vary up to a ratio 4 ! [Dej2009]
                                    ●   Physical server, Hypervisor, Storage, ...
                                    ●   Component overloaded
                                    ●   Component bug, crash, deadlock, ...
                                    ●   Failure of C1 on VM4 -> load is rebalanced


                          C1
                          C1                  C1
                                              C1              C1
                                                              C1              C1
                                                                              C1


                          VM1                VM2             VM3              VM4



                    140 ms                 150 ms           130 ms          infinity




10   EPFL – LSIR - Nicolas Bonvin
Cloud Apps – Issue #2 : Unstability

     ●    Load-balanced trafic to 4 identical components on 4 identical VMs
                      –    VM performance can vary up to a ratio 4 ! [Dej2009]
                                    ●   Physical server, Hypervisor, Storage, ...
                                    ●   Component overloaded
                                    ●   Component bug, crash, deadlock, ...
                                    ●   Failure of C1 on VM4 -> load is rebalanced


                          C1
                          C1                  C1
                                              C1              C1
                                                              C1               C1
                                                                               C1


                          VM1                VM2             VM3              VM4



                    140 ms                 150 ms           130 ms           infinity

                                          Application should react early !

11   EPFL – LSIR - Nicolas Bonvin
Cloud Apps – Overview

     ●    Build for failures
                      –   Do not trust the underlying infrastructure
                      –   Do not trust your components either !
     ●    Components should adapt to the changing conditions
                      –   Quickly
                      –   Automatically
                      –   e.g. by replacing a wonky VM by a new one




12   EPFL – LSIR - Nicolas Bonvin
Scarce:
a framework to build scalable cloud applications
Architecture Overview

     ●    An agent on each server / VM
                      –    starts/stops/monitors the components
                      –    Takes decisions on behalf of the components
     ●    An agent communicates with other agents
                      –    Routing table
                      –    Status of the server (resources usage)


                          Server                        Agent
                                                                               Agent
                 A

                 B              Agent                            GOSSIPING
                                                                + BROADCAST
                                                    Agent
                                                                                Agent
                 E


                                                                       Agent


14   EPFL – LSIR - Nicolas Bonvin
An economic approach

     ●    Time is split into epochs (no synchronization between servers)
     ●    Servers charge a virtual rent for hosting a component according to
                      –   Current resource usage (I/O, CPU, ...) of the server
                      –   Technical factors (HW, connectivity, ...)
                      –   Non-technical factors (country stability, ....)




15   EPFL – LSIR - Nicolas Bonvin
An economic approach

     ●    Time is split into epochs (no synchronization between servers)
     ●    Servers charge a virtual rent for hosting a component according to
                      –   Current resource usage (I/O, CPU, ...) of the server
                      –   Technical factors (HW, connectivity, ...)
                      –   Non-technical factors (country stability, ....)


     ●    Components
                      –   Pay virtual rent at each epoch
                      –   Gain virtual money by processing requests
                      –   Take decisions based on balance ( = gain – rent )
                                    ●   Replicate, migrate, suicide, stay

     ●    Virtual rents are updated by gossiping (no centralized board)

16   EPFL – LSIR - Nicolas Bonvin
Economic model (i)




     ●    The rent of a server is different for each component !


17   EPFL – LSIR - Nicolas Bonvin
Economic model (ii)

                                                                      CPU : 70%
                                                                      I/O : 20%
                                                             VM1
         CPU : 30%
         I/O : 5%
                                    C1
                                    C1           ?
                                                                      CPU : 25%
                                                                      I/O : 65%
                                                             VM2


     ●    VM1 and VM2 have an « identical » resources usage : 45%
     ●    Server rent = server's resources usage with component's weights
                      –   Rent for C1 @ VM1 > rent for C1 @ VM2


                                         Multiplexing of server resources

18   EPFL – LSIR - Nicolas Bonvin
Economic model (iii)

     ●    Choosing a candidate server j during replication/migration of a
          component i
                      –   netbenefit maximization




     ●    2 optimization goals :
                      –   high-availability by geographical diversity of replicas
                      –   low latency by grouping related components
     ●    gj : weight related to the proximity of the server location to the
          geographical distribution of the client requests to the component
     ●    Si is the set of server hosting a replica of component i


19   EPFL – LSIR - Nicolas Bonvin
SLA Performance Guarantees (i)

     ●    Each component has its own SLA constraints
     ●    SLA derived directly from entry components


                                                   C2
                                                   C2   C4
                                                        C4

                                        C1
                                         C1
                                    SLA :: 500ms
                                    SLA 500ms

                                                   C3
                                                   C3   C5
                                                        C5




     ●    Resp. Time = Service Time + max (Resp. Time of Dependencies)




20   EPFL – LSIR - Nicolas Bonvin
SLA Performance Guarantees (ii)

     ●    SLA propagation from parents to children
     ●    Parent j sends its performance constraints (e.g. response time upper
          bound) to its dependencies D(j) :




     ●    Child i computes its own performance constraints :




     ●         : group of constraints sent by the replicas of the parent g




21   EPFL – LSIR - Nicolas Bonvin
SLA Performance Guarantees (iii)

     ●    SLA propagation from parents to children




22   EPFL – LSIR - Nicolas Bonvin
Automatic Provisioning

     ●    Usage of allocated resources is maximized :
                      –   autonomic migration / replication / suicide of components
                      –   not enough to ensure end-to-end response time


     ●    Cloud resources managed by framework via cloud API

     ●    Each individual component has to satisfy its own SLA
                      –   SLA easily met -> decrease resources (scale down)
                      –   SLA not met -> increase resources (scale up, scale out)




23   EPFL – LSIR - Nicolas Bonvin
Adaptivity to slow servers

     ●    Each component keeps statistics about its children
                      –   e.g. 95th perc. response time
     ●    A routing coefficient is computed for each child at each epoch
                      –   Send more requests to more performant children




24   EPFL – LSIR - Nicolas Bonvin
Evaluation
Evaluation: Setup

     ●    5 components, mostly CPU-intensive (wc >> wm,wn,wd)



                                                   C2
                                                   C2   C4
                                                        C4

                                        C1
                                         C1
                                    SLA :: 500ms
                                    SLA 500ms

                                                   C3
                                                   C3   C5
                                                        C5




     ●    8 8-cores servers (Intel Core i7 920, 2.67 GHz, 8GB, Linux 2.6.32-
          trunk-amd64)
     ●    d=0, C=110, k =10000, xs* = 25%




26   EPFL – LSIR - Nicolas Bonvin
Adaptation to Varying Load (i)

     ●    5 rps to 60 rps at minute 8, step 5 rps/min
     ●    Static setup : 2 servers with 2 cores




27   EPFL – LSIR - Nicolas Bonvin
Adaptation to Varying Load (ii)

     ●    5 rps to 60 rps at minute 8, step 5 rps/min
     ●    Static setup : 2 servers with 2 cores




28   EPFL – LSIR - Nicolas Bonvin
Adaptation to Slow Server

     ●    Max 2 cores/server, 25 rps
     ●    At minute 4, a server gets slower (200 ms delay)




29   EPFL – LSIR - Nicolas Bonvin
Scalability

     ●    Add 5 rps
            per minute until 150 rps
     ●    Max 6 cores/server




30   EPFL – LSIR - Nicolas Bonvin
Conclusion
Conclusion

     ●    Framework for building cloud applications
     ●    Elasticity : add/remove resources
     ●    High Availability : software, hardware, network failures
     ●    Scalability : growing load, peaks, scaling down, ...
                      –   Quick replication of busy components
     ●    Load Balancing : load has to be shared by all available servers
                      –   Replication of busy components
                      –   Migration of less busy components
                      –   Reach equilibrium when load is stable
     ●    SLA performance guarantees
                      –   Automatic provisioning
     ●    No synchronization, fully decentralized



32   EPFL – LSIR - Nicolas Bonvin
Thank you !

Autonomic SLA-driven Provisioning for Cloud Applications

  • 1.
    Autonomic SLA-driven Provisioning forCloud Applications Nicolas Bonvin, Thanasis Papaioannou, Karl Aberer CCGRID 2011, May 23-26 2011, New Port Beach, CA, USA nicolas.bonvin@epfl.ch LSIR - EPFL
  • 2.
    Cloud Apps –Issue #1 : Placement ● A distributed, component-based application running on an elastic infrastructure C1 C1 C2 C2 C3 C3 C4 C4 2 EPFL – LSIR - Nicolas Bonvin
  • 3.
    Cloud Apps –Issue #1 : Placement ● A distributed, component-based application running on an elastic infrastructure C1 C1 C2 C2 C3 C3 C4 C4 VM1 VM2 VM3 3 EPFL – LSIR - Nicolas Bonvin
  • 4.
    Cloud Apps –Issue #1 : Placement ● A distributed, component-based application running on an elastic infrastructure ● Performance of C1, C2 and C3 is probably less than C4 ● No info on other VMs colocated on same server ! C1 C1 C2 C2 C3 C3 C4 C4 VM1 VM2 VM3 Server 1 Server 2 4 EPFL – LSIR - Nicolas Bonvin
  • 5.
    Cloud Apps –Issue #1 : Placement ● A distributed, component-based application running on an elastic infrastructure ● Performance of C1, C2 and C3 is probably less than C4 ● No info on other VMs colocated on same server ! C1 C1 C2 C2 C3 C3 C4 C4 VM1 VM2 VM3 Server 1 Server 2 No control on placement 5 EPFL – LSIR - Nicolas Bonvin
  • 6.
    Cloud Apps –Issue #2 : Unstability ● Load-balanced trafic to 4 identical components on 4 identical VMs C1 C1 C1 C1 C1 C1 C1 C1 VM1 VM2 VM3 VM4 100 ms 100 ms 100 ms 100 ms 6 EPFL – LSIR - Nicolas Bonvin
  • 7.
    Cloud Apps –Issue #2 : Unstability ● Load-balanced trafic to 4 identical components on 4 identical VMs – VM performance can vary up to a ratio 4 ! [Dej2009] ● Physical server, Hypervisor, Storage, ... C1 C1 C1 C1 C1 C1 C1 C1 VM1 VM2 VM3 VM4 100 ms 140 ms 100 ms 100 ms 7 EPFL – LSIR - Nicolas Bonvin
  • 8.
    Cloud Apps –Issue #2 : Unstability ● Load-balanced trafic to 4 identical components on 4 identical VMs – VM performance can vary up to a ratio 4 ! [Dej2009] ● Physical server, Hypervisor, Storage, ... ● Component overloaded C1 C1 C1 C1 C1 C1 C1 C1 VM1 VM2 VM3 VM4 130 ms 140 ms 100 ms 100 ms 8 EPFL – LSIR - Nicolas Bonvin
  • 9.
    Cloud Apps –Issue #2 : Unstability ● Load-balanced trafic to 4 identical components on 4 identical VMs – VM performance can vary up to a ratio 4 ! [Dej2009] ● Physical server, Hypervisor, Storage, ... ● Component overloaded ● Component bug, crash, deadlock, ... C1 C1 C1 C1 C1 C1 C1 C1 VM1 VM2 VM3 VM4 130 ms 140 ms 100 ms infinity 9 EPFL – LSIR - Nicolas Bonvin
  • 10.
    Cloud Apps –Issue #2 : Unstability ● Load-balanced trafic to 4 identical components on 4 identical VMs – VM performance can vary up to a ratio 4 ! [Dej2009] ● Physical server, Hypervisor, Storage, ... ● Component overloaded ● Component bug, crash, deadlock, ... ● Failure of C1 on VM4 -> load is rebalanced C1 C1 C1 C1 C1 C1 C1 C1 VM1 VM2 VM3 VM4 140 ms 150 ms 130 ms infinity 10 EPFL – LSIR - Nicolas Bonvin
  • 11.
    Cloud Apps –Issue #2 : Unstability ● Load-balanced trafic to 4 identical components on 4 identical VMs – VM performance can vary up to a ratio 4 ! [Dej2009] ● Physical server, Hypervisor, Storage, ... ● Component overloaded ● Component bug, crash, deadlock, ... ● Failure of C1 on VM4 -> load is rebalanced C1 C1 C1 C1 C1 C1 C1 C1 VM1 VM2 VM3 VM4 140 ms 150 ms 130 ms infinity Application should react early ! 11 EPFL – LSIR - Nicolas Bonvin
  • 12.
    Cloud Apps –Overview ● Build for failures – Do not trust the underlying infrastructure – Do not trust your components either ! ● Components should adapt to the changing conditions – Quickly – Automatically – e.g. by replacing a wonky VM by a new one 12 EPFL – LSIR - Nicolas Bonvin
  • 13.
    Scarce: a framework tobuild scalable cloud applications
  • 14.
    Architecture Overview ● An agent on each server / VM – starts/stops/monitors the components – Takes decisions on behalf of the components ● An agent communicates with other agents – Routing table – Status of the server (resources usage) Server Agent Agent A B Agent GOSSIPING + BROADCAST Agent Agent E Agent 14 EPFL – LSIR - Nicolas Bonvin
  • 15.
    An economic approach ● Time is split into epochs (no synchronization between servers) ● Servers charge a virtual rent for hosting a component according to – Current resource usage (I/O, CPU, ...) of the server – Technical factors (HW, connectivity, ...) – Non-technical factors (country stability, ....) 15 EPFL – LSIR - Nicolas Bonvin
  • 16.
    An economic approach ● Time is split into epochs (no synchronization between servers) ● Servers charge a virtual rent for hosting a component according to – Current resource usage (I/O, CPU, ...) of the server – Technical factors (HW, connectivity, ...) – Non-technical factors (country stability, ....) ● Components – Pay virtual rent at each epoch – Gain virtual money by processing requests – Take decisions based on balance ( = gain – rent ) ● Replicate, migrate, suicide, stay ● Virtual rents are updated by gossiping (no centralized board) 16 EPFL – LSIR - Nicolas Bonvin
  • 17.
    Economic model (i) ● The rent of a server is different for each component ! 17 EPFL – LSIR - Nicolas Bonvin
  • 18.
    Economic model (ii) CPU : 70% I/O : 20% VM1 CPU : 30% I/O : 5% C1 C1 ? CPU : 25% I/O : 65% VM2 ● VM1 and VM2 have an « identical » resources usage : 45% ● Server rent = server's resources usage with component's weights – Rent for C1 @ VM1 > rent for C1 @ VM2 Multiplexing of server resources 18 EPFL – LSIR - Nicolas Bonvin
  • 19.
    Economic model (iii) ● Choosing a candidate server j during replication/migration of a component i – netbenefit maximization ● 2 optimization goals : – high-availability by geographical diversity of replicas – low latency by grouping related components ● gj : weight related to the proximity of the server location to the geographical distribution of the client requests to the component ● Si is the set of server hosting a replica of component i 19 EPFL – LSIR - Nicolas Bonvin
  • 20.
    SLA Performance Guarantees(i) ● Each component has its own SLA constraints ● SLA derived directly from entry components C2 C2 C4 C4 C1 C1 SLA :: 500ms SLA 500ms C3 C3 C5 C5 ● Resp. Time = Service Time + max (Resp. Time of Dependencies) 20 EPFL – LSIR - Nicolas Bonvin
  • 21.
    SLA Performance Guarantees(ii) ● SLA propagation from parents to children ● Parent j sends its performance constraints (e.g. response time upper bound) to its dependencies D(j) : ● Child i computes its own performance constraints : ● : group of constraints sent by the replicas of the parent g 21 EPFL – LSIR - Nicolas Bonvin
  • 22.
    SLA Performance Guarantees(iii) ● SLA propagation from parents to children 22 EPFL – LSIR - Nicolas Bonvin
  • 23.
    Automatic Provisioning ● Usage of allocated resources is maximized : – autonomic migration / replication / suicide of components – not enough to ensure end-to-end response time ● Cloud resources managed by framework via cloud API ● Each individual component has to satisfy its own SLA – SLA easily met -> decrease resources (scale down) – SLA not met -> increase resources (scale up, scale out) 23 EPFL – LSIR - Nicolas Bonvin
  • 24.
    Adaptivity to slowservers ● Each component keeps statistics about its children – e.g. 95th perc. response time ● A routing coefficient is computed for each child at each epoch – Send more requests to more performant children 24 EPFL – LSIR - Nicolas Bonvin
  • 25.
  • 26.
    Evaluation: Setup ● 5 components, mostly CPU-intensive (wc >> wm,wn,wd) C2 C2 C4 C4 C1 C1 SLA :: 500ms SLA 500ms C3 C3 C5 C5 ● 8 8-cores servers (Intel Core i7 920, 2.67 GHz, 8GB, Linux 2.6.32- trunk-amd64) ● d=0, C=110, k =10000, xs* = 25% 26 EPFL – LSIR - Nicolas Bonvin
  • 27.
    Adaptation to VaryingLoad (i) ● 5 rps to 60 rps at minute 8, step 5 rps/min ● Static setup : 2 servers with 2 cores 27 EPFL – LSIR - Nicolas Bonvin
  • 28.
    Adaptation to VaryingLoad (ii) ● 5 rps to 60 rps at minute 8, step 5 rps/min ● Static setup : 2 servers with 2 cores 28 EPFL – LSIR - Nicolas Bonvin
  • 29.
    Adaptation to SlowServer ● Max 2 cores/server, 25 rps ● At minute 4, a server gets slower (200 ms delay) 29 EPFL – LSIR - Nicolas Bonvin
  • 30.
    Scalability ● Add 5 rps per minute until 150 rps ● Max 6 cores/server 30 EPFL – LSIR - Nicolas Bonvin
  • 31.
  • 32.
    Conclusion ● Framework for building cloud applications ● Elasticity : add/remove resources ● High Availability : software, hardware, network failures ● Scalability : growing load, peaks, scaling down, ... – Quick replication of busy components ● Load Balancing : load has to be shared by all available servers – Replication of busy components – Migration of less busy components – Reach equilibrium when load is stable ● SLA performance guarantees – Automatic provisioning ● No synchronization, fully decentralized 32 EPFL – LSIR - Nicolas Bonvin
  • 33.