Cloud probing, in a way, is the inverse Virtual Network Embedding (VNE) problem. VNE optimizes the deterministic mapping of multiple virtual graphs onto a shared physical topology. However, cloud platforms today rarely offer such a raw level of control to users, instead, offering their own tools which optimize for certain QoS metrics but not the topology. This paper presents a new framework in which VM populations optimize their own topology by probing the cloud platform they run on and triggering migrations of resources. The core advantage here is that applications can set and implement their own topological and performance targets. This paper presents practical but simple usecases while the topic itself expands into a new generation of cloud-specific end-to-end measurement methods and tools.
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Cloud Probing
1.
2. .
Cloud Platforms (taxonomy)
• Cloud Platforms (Amazon)
◦ raw access at VM level
◦ client decides when and what to migrate
• App Platforms (Heroku)
◦ container level
◦ heroku packs containers to VMs
◦ user has limited access to migrations
• DIY Platforms (Docker)
◦ container level
◦ manual install at each VM, then automation
◦ Docker is a Github for OS images
M.Zhanikeev -- maratishe@gmail.com -- Cloud Probing -- http://bit.do/150115icm -- 2/24
...
2/24
3. .
Cloud Populations
APP
Cloud/DC
APP
APP
VM
Container
APP
Cloud/DC
APP
APP…
• population =
service (heroku,
docker, video
streaming 01)
• app can be VM or
container
• users can be included
as e2e QoS 04
01 myself+0 "Multi-Source Stream Aggregation in the Cloud" Book on Advanced Content Delivery, Wiley (2014)
04 myself+0 "A holistic community-based architecture for measuring E2E QoS at DCs" IJCSE (2014)
M.Zhanikeev -- maratishe@gmail.com -- Cloud Probing -- http://bit.do/150115icm -- 3/24
...
3/24
4. .
Related Topics
• active probing 03
◦ available bandwidth, bulk transfer, etc.
• delay space and network coordination 07
• Virtual Network Embedding (VNE) 09
• migration cost and energy-efficient clouds
◦ migration schedules and greyboxes 05
• fog computing -- clouds at network edge 08
• BigData Networking -- circuits-over-packets in particular 02
03 1+myself "Active Network Measurement: Theory, Methods, and Tools" ITU (2009)
07 myself+1 "Application of Graph Theory to Clustering in Delay Space" APSITT (2010)
09 J.Lu+1 "Efficient Mapping of Virtual Networks onto a Shared Substrate" Washington Univ. (2006)
05 myself+0 "Optimizing Virtual Machine Migration for Energy-Efficient Clouds" IEICEJ (2014)
08 myself+0 "A Cloud Visitation Platform for Federated Services at Network Edge" 10th CISSE (2014)
02 myself+0 "Circuit Emulation for Big Data Transfers in Clouds" Book on Networking for Big Data, CRC (2015)
M.Zhanikeev -- maratishe@gmail.com -- Cloud Probing -- http://bit.do/150115icm -- 4/24
...
4/24
5. .
Cloud Probing is Reversed VNE
• VNE: optimize mapping of many virtual graphs onto one physical topology
◦ problem: feasibility low, complexity very high
◦ unlikely for cloud providers to implement it in near future
• Cloud Probing: optimize your own population
◦ basically a distributed version of client-side VNE
◦ no need for support from cloud providers -- can use today!
M.Zhanikeev -- maratishe@gmail.com -- Cloud Probing -- http://bit.do/150115icm -- 5/24
...
5/24
7. .
Experiment on Amazon (AWS) Cloud
• Planetlab (legacy) → Amazon Cloud
• 15 VMs across 8 AWS regions
• 5 VMs migrate to random locations every hour
◦ roughly equal distribution is enforced
• each hour: continuous probing in random pairs of VMs
◦ rx/tx direction is emulated as HTTP GET or POST
M.Zhanikeev -- maratishe@gmail.com -- Cloud Probing -- http://bit.do/150115icm -- 7/24
...
7/24
10. .
Groping by Probing
• probing: migrate and see what happens
• groping: no way to know whether migration results in better or worse
performance
• ... in advanced designs, can use history to assign probabilities → markov
modeling
Migrate
IDLE BETTERWORSE
Revert
Migrate
Revert
M.Zhanikeev -- maratishe@gmail.com -- Cloud Probing -- http://bit.do/150115icm -- 10/24
...
10/24
11. .
The Low-Start Model
Performance
Cost
Stop
New
state
• the low-start
concept
• each new
improvement comes at
higher cost
• stop or new
state? ... in practice
new state is, by
nature, more likely
M.Zhanikeev -- maratishe@gmail.com -- Cloud Probing -- http://bit.do/150115icm -- 11/24
...
11/24
13. .
Stress Ring (1) Copy AMI
california
ireland
oregon
saopaulo
singapore
sydney
virginia
key (copyami)
sizes (1 10)
parties (ab) • stress ring: pressure implodes the
balloon
• key: which metric becomes stress
• sizes: kbytes ... small = delay, large
= throughput
• parties: AA = intrA-DC, AB =
intER-DC
• Copy API is AWS action for moving VM
images across DCs
• ... Brazil is very far from Tokyo
M.Zhanikeev -- maratishe@gmail.com -- Cloud Probing -- http://bit.do/150115icm -- 13/24
...
13/24
14. .
Stress Ring (2) Intra-DC Delay and Bulk
california
ireland
oregon
saopaulo
singapore
sydney
tokyo
virginia
key (probe)
sizes (1 10)
parties (aa)
california
ireland
oregon
saopaulo
singapore
sydney
tokyo virginia
key (probe)
sizes (2000 5000)
parties (aa)
M.Zhanikeev -- maratishe@gmail.com -- Cloud Probing -- http://bit.do/150115icm -- 14/24
...
14/24
15. .
Stress Ring (3) Inter-DC Bulk
california
ireland
oregon
saopaulo
singapore
sydney
tokyo
virginia
key (probe)
sizes (2000 5000)
parties (ab)
• 2-ring version
• outside ring: same as before
• inside ring: the main contributor to
stress
• reading: California's throughput is not
bad but variance is high and mostly
caused by Oregon
M.Zhanikeev -- maratishe@gmail.com -- Cloud Probing -- http://bit.do/150115icm -- 15/24
...
15/24
17. .
Stress: Graph vs Ring
using
bigdatabigdatabigdata
guigui
apiapiapi
pmstackspmstacks
vmsvmsvms
vmappsvmappsvmappsvmapps
distappsdistappsdistappsdistapps
scrumscrum
ticketdevticketdev
mongodbmongodb
eclipseeclipse
making researching
optimization
migration
visualization
apps
tools
tractractractractrac
.
Cloud Populations...
..
.
... are mostly rings, almost never
graphs
• related topic: graph drawing 10
• rings are easiler to draw and understand
• rings are better for management
decisions -- which DCs causes the
most stress?
• facebook social graph vs google circles
10 T.Kamada+1 "An algorithm for drawing general undirected graphs" Information Processing Letters (1989)
M.Zhanikeev -- maratishe@gmail.com -- Cloud Probing -- http://bit.do/150115icm -- 17/24
...
17/24
18. .
Stress Optimization
• v (will call it key later) -- an arbitrary performance metric
• DCs/regions are a and b, i.e. performance is vab
• same-node (intra-DC) vaa (always a) and directional vab ̸= vba
• (even ring is a) graph G(N, M) of n nodes and m links
• collect a set of values
{
vab
}
for pairwise a, b ∈ G
• then stress is an aggregate of probing data:
Sa = f(vaa, vab, vac, ..., vax), where
{
a, b, c, ...x
}
∈ G, (1)
• ... f() is an arbitrary aggregator function (sum, average, etc.).
• stress optimization is then:
minimize
∑
Sx, x ∈ G, (2)
M.Zhanikeev -- maratishe@gmail.com -- Cloud Probing -- http://bit.do/150115icm -- 18/24
...
18/24
19. .
Analysis: Models
1. Pooler Model (1 ring)
◦ BigData aggregation
◦ 3 VMs, 1 VM collects and stores data from other 2 VMs
2. Syncer Model (2 rings)
◦ 1st ring: same as Pooler Model, only all-to-all throughput
◦ 2nd ring: e2e delay between users and 3 VMs -- with time belts, etc.
• ... are trace-based simulations -- AWS experiment is the trace
M.Zhanikeev -- maratishe@gmail.com -- Cloud Probing -- http://bit.do/150115icm -- 19/24
...
19/24
20. .
Analysis: Migration
Pooler model
Syncer model
• Pooler Model is more stable --
certain combinations of DCs are better
• Syncer Model -- less stable
because of 2 rings and daytime
fluctuations
M.Zhanikeev -- maratishe@gmail.com -- Cloud Probing -- http://bit.do/150115icm -- 20/24
...
20/24
21. .
Analysis: Overall
0 100 200 300 400 500
Ordered list of values
0
20
40
60
80
100
Completiontime(s)
Do nothingOptimize
Pooler Model
0 1000 2000 3000 4000 5000 6000
Ordered list of values
2.25
2.55
2.85
3.15
3.45
3.75
Averagedelay(logofms)
Syncer Model
• stress optimization results in
better performance in more than
80% of cases
M.Zhanikeev -- maratishe@gmail.com -- Cloud Probing -- http://bit.do/150115icm -- 21/24
...
21/24
23. .
Implementation: the TopoAPI
API ServiceContract (key)Population
TopoAPI
Service
Stats
New session
ID
ADD( a, b, value)
OK
OPTIMIZE( model)
Graph, Migrations, …
Read
Result Solve
• an independent
service --
heroku-based API
• fully abstract
a, b, value performance
tuple
• sessions are up to client
• generic: stress ring is
only one model, others are
possible
M.Zhanikeev -- maratishe@gmail.com -- Cloud Probing -- http://bit.do/150115icm -- 23/24
...
23/24
24. .
That’s all, thank you ...
M.Zhanikeev -- maratishe@gmail.com -- Cloud Probing -- http://bit.do/150115icm -- 24/24
...
24/24