Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
An economic approach for scalable and highly-available distributed applications
1. An economic approach for
scalable and highly-available
distributed applications
Nicolas Bonvin, Thanasis Papaioannou, Karl Aberer
CLOUD 2010, July 5-10 2010, Miami, Florida, USA
nicolas.bonvin@epfl.ch
LSIR - EPFL
2. Introduction
● A distributed application = many (remote) components
● A component is
– A piece of software
– Loosely coupled
– Self-Contained
E
● e.g. a SOA-based application
B C D
A
2 EPFL – LSIR - Nicolas Bonvin
3. Placement: first problem
● Where should the components be placed to maximize the
application performance ?
E
B C D
A
?
1 2 3 4
3 EPFL – LSIR - Nicolas Bonvin
4. Placement: first problem
● Where should the components be placed to maximize the
application performance ?
E
– Random placement ?
B C D
A
1 2 3 4
A D B
C E
Bad resource utilization !
4 EPFL – LSIR - Nicolas Bonvin
5. Placement: first problem
● Where should the components be placed to maximize the
application performance ? E
– « Clever » random placement ?
B C D
A
1 2 3 4
A E D B
C
D and E should probably be hosted on the same server !
Not always optimal !
5 EPFL – LSIR - Nicolas Bonvin
6. Even more components !
● High Availability: software, hardware, network failures
● Scalability: growing load, peaks, scaling down, ...
Replication !
E E
B B C C D D
A A A
6 EPFL – LSIR - Nicolas Bonvin
7. Placement: second problem
● Where should the components be placed to maximize the
application availability ?
E E
B B C C D D
A A A
?
Rack 1 Rack 2 Rack 3 Rack 4
Datacenter 1 Datacenter 2
7 EPFL – LSIR - Nicolas Bonvin
8. Multi Objective Optimization Problem
● Maximize the geographical distance of replicas
– Greater availability
● Minimize the geographical distance between related
components
– Lower latency
● Balance the load (disk I/O, network I/O, CPU) between the
servers
– Better application performance
NP-Complete
8 EPFL – LSIR - Nicolas Bonvin
10. Architecture overview
● An agent on each server
– starts/stops/monitors the components
– Takes decisions on behalf of the components
● An agent communicates with other agents
– Routing table
– Status of the server (resources usage)
Server Agent
Agent
A
B Agent GOSSIPING
+ BROADCAST
Agent
Agent
E
Agent
10 EPFL – LSIR - Nicolas Bonvin
11. An economic approach
● Time is split into epochs (no synchronization between servers)
● Servers charge a virtual rent for hosting a component according to
– Current resource usage (I/O, CPU, ...) of the server
– Technical factors (HW, connectivity, ...)
– Non-technical factors (country stability, ....)
● Components
– Pay virtual rent at each epoch
– Gain virtual money by processing requests
– Take decisions based on balance ( = gain – rent )
● Replicate, migrate, suicide, stay
● Virtual rents are updated by gossiping (no centralized board)
11 EPFL – LSIR - Nicolas Bonvin
12. Economic model
● Replication of a component
– If minimum availability is not reached
– If b' > 0 for last n epochs
● Migration/Suicide of a component
– If balance c < 0 for last n epochs
12 EPFL – LSIR - Nicolas Bonvin
13. Availability (i)
● Increase availability by increasing geographical diversity
● Handled by replication
– Granularity: rack, room, datacenter, country, ...
– Label: NA-US-NY1-C01-R12-S02
● Each component must satisfy a minimum availability
● Si is the set of server hosting a replica of component i
13 EPFL – LSIR - Nicolas Bonvin
14. Availability (ii)
● Similarity: computes the distance between 2 servers
● Diversity:
● Choosing a candidate server j
● gj : weight related to the proximity of the server location to the
geographical distribution of the client requests to the component
14 EPFL – LSIR - Nicolas Bonvin
15. Summary
● High Availability: software, hardware, network failures
– Geographical aware placement (netbenef maximization)
– Minimum availability level per component
● Scalability: growing load, peaks, scaling down, ...
– Quick replication of busy components
● Load Balancing: load has to be shared by all available servers
– Replication of busy components
– Migration of less busy components
– Reach equilibrium when load is stable
● No synchronization, fully decentralized
15 EPFL – LSIR - Nicolas Bonvin