Network-aware Data Management for Large Scale Distributed Applications, IBM Research-Almaden, San Jose, CA – June 24, 2015

Network-‐aware
Data

Management
for
Large-‐scale

Distributed
Applications

June
24,
2015

Mehmet
Balman

h3p://balman.info

Senior
Performance
Engineer
at
VMware
Inc.

Guest/Aﬃliate
at
Berkeley
Lab

1

About
me:

Ø 2013:
Performance,
Central
Engineering,
VMware,
Palo
Alto,
CA

Ø 2009:
ComputaMonal
Research
Division
(CRD)
at
Lawrence
Berkeley

NaMonal
Laboratory
(LBNL)

Ø 2005:
Center
for
ComputaMon
&
Technology
(CCT),
Baton
Rouge,
LA

v Computer
Science,
Louisiana
State
University
(2010,2008)

v Bogazici
University,
Istanbul,
Turkey
(2006,2000)

Data
Transfer
Scheduling
with
Advance
ReservaMon
and
Provisioning,
Ph.D.

Failure-‐Awareness
and
Dynamic
AdaptaMon
in
Data
Scheduling,
M.S.

Parallel
Tetrahedral
Mesh
Reﬁnement,
M.S.

2

Why
Network-‐aware?

Networking
is
one
of
the
major
components
in
many
of
the

soluMons
today

•  Distributed
data
and
compute
resources

•  CollaboraMon:
data
to
be
shared
between
remote
sites

•  Data
centers
are
complex
network
infrastructures

ü What
further
steps
are
necessary
to
take
full
advantage
of
future

networking
infrastructure?

ü How
are
we
going
to
deal
with
performance
problems?

ü How
can
we
enhance
data
management
services
and
make
them

network-‐aware?

New
collabora>ons
between
data
management
and

networking
communi>es.

3

Two
major
players:

• AbstracMon
and
Programmability

•  Rapid
Development,
Intelligent
services

•  OrchestraMng
compute,
storage,
and
network
resources
together

•  IntegraMon
and
deployment
of
complex
workﬂows

•  VirtualizaMon
(+containers)

•  Distributed
storage
(storage
wars)

•  Open
Source

(if
you
can’t
ﬁx
it,
you
don’t
own
it)

•  Performance
Gap:

•  LimitaMon
in
current
system
so3ware
vs
foreseen

speed:

•  Hardware
is
fast,
Sofware
is
slow

•  Latency
vs
throughput
mismatch
will
lead
to
new

innovaGons

4

Outline

•  VSAN
+
VVOL
Storage
Performance

in
Virtualized

Environments

•  PetaShare
Distributed
Storage
+
Stork
Data
Scheduler

Adap>ve
Tuning
+
Advanced
Buﬀers

•  Data
Streaming
in
High-‐bandwidth
Networks

•  Climate100:
Advance
Network
IniMaMve
and
100Gbps
Demo

•  MemzNet:
Memory-‐Mapped
Network
Zero-‐copy
Channels

•  Core
Aﬃnity
and
End
System
Tuning
in
High-‐Throughput

Flows

•  Network
Reserva>on
and
Online
Scheduling
(QoS)

•  FlexRes:
A
Flexible
Network
ReservaMon
Algorithm

•  SchedSim:
Online
Scheduling
with
Advance
Provisioning

5

VSAN:
virtual
SAN

6

VSAN
image:
blog.vmware.com

Distributed
Object

Storage

Hybrid
(SSD+HDD)

VSAN
performance
work
in
a
nutshell

7

Observer
image:
blog.vmware.com

•  Every
write
operaMon
needs
to
go
over
network
(and

network
is
not
free)

•  Each
layer
(cache,
disk,
object
management,
etc.)
needs

resources
(CPU,
memory)

•  Resource
limitaMons
vs
Latency
eﬀect

•  Needs
to
support
thousands
of
VMs
Placement
of
Objects:

•  Which
Host?

•  Which
Disk/SSD
in
the

Host?

What
if
there
are

failures,
migraMons,

and
if
we
need
to

rebalance

8

VVOL:

virtual
volumes

VVOL
image:
blog.vmware.com

Oﬄoading

control
operaMons
to

the
storage
array

•  powerOn

•  powerOﬀ

•  Delete

•  clone

VVOL
performance
work

• Eﬀect
of
the
latency

in
control
path

• 

linked
clone
vs
VVOL
clones

9

Vsphere

Storage

Host

VASA
VP

Data
path

Control
path

•  Op>mize
service
latencies

•  Batching
(disklib)

•  Use
concurrent
opera>ons

PetaShare
+
Stork
Data
Scheduler

10

AggregaMon
in
Data
Path:

Advance
Buﬀer
Cache
in
Petafs
and
Petashell
clients
by
aggregaMng

I/O
requests
to
minimize
the
number
of
network
messages

Adaptive
Tuning
+
Advanced
Buffer

11

AdapMve
Tuning
for

Bulk
Transfer

Buﬀer
Cache
for

Remote
I/O

Outline

•  VSAN
+
VVOL
Storage
Performance

in
Virtualized

Environments

•  PetaShare
Distributed
Storage
+
Stork
Data
Scheduler

Adap>ve
Tuning
+
Advanced
Buﬀers

•  Data
Streaming
in
High-‐bandwidth
Networks

•  Climate100:
Advance
Network
IniMaMve
and
100Gbps
Demo

•  MemzNet:
Memory-‐Mapped
Network
Zero-‐copy
Channels

•  Core
Aﬃnity
and
End
System
Tuning
in
High-‐Throughput

Flows

•  Network
Reserva>on
and
Online
Scheduling
(QoS)

•  FlexRes:
A
Flexible
Network
ReservaMon
Algorithm

•  SchedSim:
Online
Scheduling
with
Advance
Provisioning

12

100Gbps
networking
has
Linally
arrived!

Applica>ons’
Perspec>ve

Increasing
the
bandwidth
is
not
suﬃcient
by
itself;
we
need

careful
evaluaMon
of
high-‐bandwidth
networks
from
the

applicaMons’
perspecMve.

1Gbps
to
10Gbps
transiMon

(10
years
ago)

ApplicaMon
did
not
run
10
Mmes

faster
because
there
was
more

bandwidth
available

13

ANI

100Gbps

Demo

•  100Gbps
demo
by
ESnet
and

Internet2

•  ApplicaMon
design
issues
and
host

tuning
strategies
to
scale
to
100Gbps

rates

•  VisualizaMon
of
remotely
located
data

(Cosmology)

•  Data
movement
of
large

datasets
with

many
ﬁles
(Climate
analysis)

14

Earth
System
Grid
Federation
(ESGF)

15

•  Over
2,700
sites

•  25,000
users

•  IPCC
Fifh
Assessment
Report
(AR5)
2PB

•  IPCC
Forth
Assessment
Report
(AR4)
35TB

•  Remote

Data
Analysis

•  Bulk
Data
Movement

Application’s

Perspective:

Climate
Data
Analysis

16

lots-‐of-‐small-‐*iles
problem!

*ile-‐centric
tools?

FTP
RPC
request a file
request a file
send file
send file
request
data
send data
•  Keep
the
network
pipe
full

•  We
want
out-‐of-‐order
and
asynchronous
send
receive

17

Many
Concurrent
Streams

(a) total throughput vs. the number of concurrent memory-to-memory transfers, (b) interface trafﬁc, packages per second (blue) and bytes per second, over a single
NIC with different number of concurrent transfers. Three hosts, each with 4 available NICs, and a total of 10 10Gbps NIC pairs were used to saturate the 100Gbps
pipe in the ANI Testbed. 10 data movement jobs, each corresponding to a NIC pair, at source and destination started simultaneously. Each peak represents a
different test; 1, 2, 4, 8, 16, 32, 64 concurrent streams per job were initiated for 5min intervals (e.g. when concurrency level is 4, there are 40 streams in total).

18

ANI testbed 100Gbps (10x10NICs, three hosts): Interrupts/CPU vs the number of concurrent transfers [1, 2, 4, 8, 16, 32 64 concurrent jobs - 5min
intervals], TCP buffer size is 50M

Effects
of
many
concurrent
streams

19

Analysis
of

Core
AfLinities

(NUMA
Effect)

20
Nathan
Hanford
et
al.

NDM’13

Sandy
Bridge
Architecture

Receive
process

21

Analysis
of

Core
AfLinities

(NUMA
Effect)

Nathan
Hanford
et
al.

NDM’14

100Gbps
demo
environment

RRT:

Sea3le
–
NERSC

16ms

NERSC
–
ANL

50ms

NERSC
–
ORNL

64ms

22

Framework
for
the
Memory-‐mapped

Network
Channel

+
SynchronizaMon
mechanism
for
RoCE

-‐
Keep
the
pipe
full
for
remote
analysis
23

Moving
climate
*iles
ef*iciently

24

Advantages

•  Decoupling
I/O
and
network
operaMons

•  front-‐end
(I/O

processing)

•  back-‐end
(networking
layer)

•  Not
limited
by
the
characterisMcs
of
the
file
sizes

•  On
the
fly
tar
approach,

bundling
and
sending

many
files

together

•  Dynamic
data
channel
management

Can
increase/decrease
the
parallelism
level
both

in
the
network

communicaMon
and
I/O
read/write
operaMons,
without
closing
and

reopening
the
data
channel
connecMon
(as
is
done
in
regular
FTP

variants).

MemzNet
is

is
not
file-‐centric.
Bookkeeping
informaMon
is
embedded

inside
each
block.

25

MemzNet’s
Architecture
for
data

streaming

26

100Gbps
Demo

•  CMIP3
data
(35TB)
from
the
GPFS
ﬁlesystem
at
NERSC

•  Block
size
4MB

•  Each
block’s
data
secMon
was
aligned
according
to
the

system
pagesize.

•  1GB
cache
both
at
the
client
and
the
server

•  At
NERSC,
8
front-‐end
threads
on
each
host
for
reading
data
ﬁles

in
parallel.

• 
At
ANL/ORNL,
4
front-‐end
threads
for
processing
received
data

blocks.

• 
4
parallel
TCP
streams
(four
back-‐end
threads)
were
used
for

each
host-‐to-‐host
connecMon.

27

MemzNet’s
Performance

TCP
buﬀer
size
is
set
to
50MB

MemzNetGridFTP
100Gbps demo
ANI Testbed
28

Challenge?

•  High-‐bandwidth
brings
new
challenges!

•  We
need
substanMal
amount
of
processing
power
and
involvement
of

mulMple
cores
to
ﬁll
a
40Gbps
or
100Gbps
network

•  Fine-‐tuning,
both
in
network
and
applicaMon
layers,
to
take

advantage
of
the
higher
network
capacity.

•  Incremental
improvement
in
current
tools?

•  We
cannot
expect
every
applicaMon
to
tune
and
improve
every
Mme
we

change
the
link
technology
or
speed.

29

MemzNet

•  MemzNet:
Memory-‐mapped
Network
Channel

•  High-‐performance
data
movement

MemzNet
is
an
iniMal
eﬀort
to
put
a
new
layer

between
the
applicaMon
and
the
transport
layer.

•  Main
goal
is
to
deﬁne
a
network
channel
so
applicaMons
can

directly
use
it
without
the
burden
of
managing/tuning
the
network

communicaMon.

30

Tech
report:
LBNL-‐6177E

MemzNet
=
New
Execution
Model

•  Luigi
Rizzo
’s
netmap

•  proposes
a
new
API
to
send/receive
data
over
the

network

• RDMA
programming
model

•  MemzNet
as
a
memory-‐management
component

• IX:
Data
Plane
OS
(Adam
Baley
et
al.
@standford
–

similar
to
MemzNet’s
model)

•  mTCP
(even
based
/
replaces
send/receive
in
user
level)

•  Tanenbaum
et
al.

Minimizing
context
switches:

proposing
to
use
MONITOR/MWAIT
for

synchronizaMon

31

Outline

•  VSAN
+
VVOL
Storage
Performance

in
Virtualized

Environments

•  PetaShare
Distributed
Storage
+
Stork
Data
Scheduler

Adap>ve
Tuning
+
Advanced
Buﬀers

•  Data
Streaming
in
High-‐bandwidth
Networks

•  Climate100:
Advance
Network
IniMaMve
and
100Gbps
Demo

•  MemzNet:
Memory-‐Mapped
Network
Zero-‐copy
Channels

•  Core
Aﬃnity
and
End
System
Tuning
in
High-‐Throughput

Flows

•  Network
Reserva>on
and
Online
Scheduling
(QoS)

•  FlexRes:
A
Flexible
Network
ReservaMon
Algorithm

•  SchedSim:
Online
Scheduling
with
Advance
Provisioning

32

Problem
Domain:
Esnet’s
OSCARS

33

ASIA-PACIFIC
(ASGC/Kreonet2/
TWAREN)
ASIA-PACIFIC
(KAREN/KREONET2/
NUS-GP/ODN/
REANNZ/SINET/
TRANSPAC/TWAREN)
AUSTRALIA
(AARnet)
LATIN AMERICA
CLARA/CUDI
CANADA
(CANARIE)
RUSSIA
AND CHINA
(GLORIAD)
US R&E
(DREN/Internet2/NLR)
US R&E
(DREN/Internet2/
NASA)
US R&E
(NASA/NISN/
USDOI)
ASIA-PACIFIC
(BNP/HEPNET)
ASIA-PACIFIC
(ASCC/KAREN/
KREONET2/NUS-GP/
ODN/REANNZ/
SINET/TRANSPAC)
AUSTRALIA
(AARnet)
US R&E
(DREN/Internet2/
NISN/NLR)
US R&E
(Internet2/
NLR)
CERN
US R&E
(DREN/Internet2/
NISN)
CANADA
(CANARIE) LHCONE
CANADA
(CANARIE)
FRANCE
(OpenTransit)
RUSSIA
AND CHINA
(GLORIAD)
CERN
(USLHCNet)
ASIA-PACIFIC
(SINET)
EUROPE
(GÉANT/
NORDUNET)
EUROPE
(GÉANT)
LATIN AMERICA
(AMPATH/CLARA)
LATIN AMERICA
(CLARA/CUDI)
HOUSTON
ALBUQUERQUE
El PASO
SUNNYVALE
BOISE
SEATTLE
KANSAS CITY
NASHVILLE
WASHINGTON DC
NEW YORK
BOSTON
CHICAGO
DENVER
SACRAMENTO
ATLANTA
PNNL
SLAC
AMES PPPL
BNL
ORNL
JLAB
FNAL
ANL
LBNL
•  ConnecMng
experimental
faciliMes
and
supercompuMng
centers

•  On-‐Demand
Secure
Circuits
and
Advance
ReservaMon
System

•  Guaranteed
between
collaboraMng
insMtuMons
by
delivering

network-‐as-‐a-‐service

•  Co-‐allocaMon
of
storage
and
network
resources

(SRM:
Storage
Resource
Manager)

OSCARS
provides
yes/no

answers
to
a
reservaMon

request
for
(bandwidth,

start_Gme,
end_Gme)

End-‐to-‐end
ReservaMon:

Storage+Network

Reservation
Request

•  Between
edge
routers

Need
to
ensure
availability
of
the
requested
bandwidth
from
source
to

desGnaGon
for
the
requested
Gme
interval

v 
R={
nsource,
ndesGnaGon,
Mbandwidth,
tstart,
tend}.

v  source/desMnaMon
end-‐points

v  Requested
bandwidth

v  start/end
Mmes

Commi3ed
reservaMons
between
tstart
and
tend
are
examined

The
shortest
path
from
source
to
desMnaMon
is
calculated
based
on
the

engineering
metric
on
each
link,
and
a
bandwidth
guaranteed
path
is
set

up
to
commit
and
eventually
complete
the
reservaMon
request
for
the

given
Mme
period

34

Reservation

35

v  Components (Graph):
v node (router), port, link (connecting two ports)
v engineering metric (~latency)
v maximum bandwidth (capacity)
v  Reservation:
v source, destination, path, time
v (time t1, t3) A -> B -> D (900Mbps)
v (time t2, t3) A -> C -> D (400Mbps)
v (time t4, t5) A -> B -> D (800Mpbs)
A

C
B

D

800Mbps

900Mbps
500Mbps

1000Mbps

300Mbps

ReservaMon
1

ReservaMon
2

ReservaMon
3

t1

t2
t3

t4
t5

Example

(Mme
t1,
t2)
:

A
to
D
(600Mbps)
NO

A
to
D
(500Mbps)
YES

A

C
B

D

0
Mbps
/
900Mbps
(900Mbps)

100
Mbps
/
900Mbps
(1000Mbps)

800
Mbps
/
0Mbps
(800Mbps)

500
Mbps
/
0Mbps
(500Mbps)

300
Mbps
/

0
Mbps
(300Mbps)

AcMve
reservaMon

reservaMon
1:
(Mme
t1,
t3)

A
-‐>
B
-‐>
D

(900Mbps)

reservaMon
2:
(Mme
t1,
t3)

A
-‐>
C
-‐>
D

(400Mbps)

reservaMon
3:
(Mme
t4,
t5)

A
-‐>
B
-‐>
D

(800Mpbs)

available/
reserved

(capacity)

36

Example

A

C
B

D

0
Mbps
/
900Mbps
(900Mbps)

100
Mbps
/
900Mbps
(1000Mbps)

400
Mbps
/
400Mbps
(800Mbps)

100
Mbps
/
400Mbps
(500Mbps)

300
Mbps
/

0
Mbps
(300Mbps)

(Mme
t1,
t3)
:

A
to
D
(500Mbps)
NO

A
to
C
(500Mbps)
No

(not
max-‐FLOW!)

AcMve
reservaMon

reservaMon
1:
(Mme
t1,
t3)

A
-‐>
B
-‐>
D

(900Mbps)

reservaMon
2:
(Mme
t1,
t3)

A
-‐>
C
-‐>
D

(400Mbps)

reservaMon
3:
(Mme
t4,
t5)

A
-‐>
B
-‐>
D

(800Mpbs)

available/
reserved

(capacity)

37

Alternative
Approach:
Flexible
Reservations

•  IF
the
requested
bandwidth
can
not
be
guaranteed:

•  Try-‐and-‐error
unMl
get
an
available
reservaMon

•  Client
is
not
given
other
possible
opMons

•  How
can
we
enhance
the
OSCARS
reservaMon
system?

•  Be
Flexible:

•  Submit
constraints
and
the
system
suggests
possible
reservaMon
opMons

saMsfying
given
requirements

38

Rs
'={
nsource
,
ndesGnaGon,
MMAXbandwidth,
DdataSize,
tEarliestStart,
tLatestEnd}

ReservaMon
engine
ﬁnds
out
the
reservaMon

R={
nsource,
ndesGnaGon,
Mbandwidth,
tstart,
tend}

for
the
earliest
compleMon
or
for
the
shortest
duraMon

where
Mbandwidth≤
MMAXbandwidth
and
tEarliestStart
≤
tstart
<
tend≤
tLatestEnd
.

Bandwidth
Allocation
(time-‐dependent)

Modiﬁed
Dijstra's

algorithms
(max
available

bandwidth):

•  BoUleneck
constraint

(not
addiMve)

•  QoS
constraint
is
addiMve

in
shortest
path,
etc)

39
The
maximum
bandwidth
available
for
allocaMon
from
a
source
node
to
a
desMnaMon

node

t1
t2
t3
t4
t5
t6

Analogous Example
n  A vehicle travelling from city A to city B
n  There are multiple cities between A and B connected with separate
highways.
n  Each highway has a specific speed limit
–  (maximum bandwidth)
n  But we need to reduce our speed if there is high traffic load on the
road
n  We know the load on each highway for every time period
–  (active reservations)
n  The first question is which path the vehicle should follow in order to
reach city B from city A as early as possible (earliest completion)
•  Or, we can delay our journey and start later if the total travel time
would be reduced. Second question is to find the route along with the
starting time for shortest travel duration (shortest duration)
40

Advance bandwidth reservation: we have to set the speed limit before starting and
cannot change during the journey

Time steps
n  Time steps between t1 and t13
Mme

t4
t2
t3
t1
t5
t6
t7
t8
t9
t10
t11
t12
t13

ReservaMon
1

ReservaMon
2

ReservaMon
3

Res
1
Res
1,2

Res

2

Res
3

t4
t1
t6
t7
t9
t12
t13

Mme

Mme
steps

Max (2r+1) time steps,
where r is the number of
reservations
ts1
ts2
ts3
ts4

41

Static Graphs
Res
1
Res
1,2
Res
2

t4
t1

t6
t7
t9

A

C
B

D

0
Mbps

100
Mbps

800
Mbps

500
Mbps

300
Mbps)

A

C
B

D

0
Mbps

100
Mbps

400
Mbps

100
Mbps

300
Mbps)

A

C
B

D

900
Mbps

1000
Mbps

400
Mbps

100
Mbps

300
Mbps)

A

C
B

D

900
Mbps

1000
Mbps

800
Mbps

500
Mbps

300
Mbps)

t4
t6

t7

G(ts3)
G(ts4)
G(ts2)
G(ts1)

42

Time Windows
Res
1,2
Res
2

t1

t6
t9

A

C
B

D

0
Mbps

100
Mbps

400
Mbps

100
Mbps

300
Mbps

A

C
B

D

900
Mbps

1000
Mbps

400
Mbps

100
Mbps

300
Mbps

t6

Max (s × (s + 1))/2 time windows, where s is the
number of time steps
G(tw)=G(ts3)
x
G(ts4)

tw=ts1+ts2

Bo3leneck
constraint

G(tw)=G(ts1)
x
G(ts2)

tw=ts3+ts4

43

Time
Window
List

(special
data
structures)

now
infinite

Time
windows
list

new
reservaMon:

reservaMon
1,
start
t1,
end
t10

now
t1
t10
infinite

Res
1

new
reservaMon:

reservaMon
2,
start
t12,
end
t20

now
t1
t10
t12

Res
1

t20
infinite

Res
2

44

Careful
sofware
design
makes
implementaMon
fast
and
efficient

Performance
max-bandwidth path ~ O(n^2 )
n is the number of nodes in the topology graph
In the worst-case, we may require to search all time
windows, (s × (s + 1))/2, where s is the number of
time steps.
If there are r committed reservations in the search
period, there can be a maximum of 2r + 1 different
time steps in the worst-case.
Overall, the worst-case complexity is bounded
by O(r^2 n^2 )
Note: r is relatively very small compared to the
number of nodes n 45

Example
Reservation 1: (time t1, t6) A -> B -> D
(900Mbps)
Reservation 2: (time t4, t7) A -> C -> D
(400Mbps)
Reservation 3: (time t9, t12) A -> B -> D
(700Mpbs)
A

C
B

D

800Mbps

900Mbps
500Mbps

1000Mbps

300Mbps

t4
t2
t3
t1
t5
t6
t7
t8
t9
t10
t11
t12
t13

ReservaMon
1

ReservaMon
2

ReservaMon
3

from A to D (earliest completion)
max bandwidth = 200Mbps, volume = 200Mbps x 4 time slots
earliest start = t1, latest finish t13
46

Search Order - Time Windows
Res
1
Res
1,2

Res

2

Res
3

t4
t1
t6
t7
t9
t12
t13

Mme

windows

Res
1

Res
1,
2

Res
1,
2

2

Res
1,2

Res
1,
2

Res
2

Res
1,
2

Res
1,
2

t1-‐-‐t6

t4—t6

t1-‐-‐t4

t6—t7

t4—t7

t1—t7

t7—t9

t6—t9

t4—t9

t1—t9

Max
bandwidth
from
A
to
D

1.  900Mbps

(3)

2.  100Mbps

(2)

3.  100Mbps

(5)

4.  900Mbps

(1)

5.  100Mbps

(3)

6.  100Mbps

(6)

7.  900Mpbs

(2)

8.  900Mbps

(3)

9.  100Mbps

(5)

10.  100Mbps

(8)

ReservaMon:
(
A
to
D
)
(100Mbps)
start=t1

end=t9
47

Search Order - Time Windows
Shortest
dura>on?

Res
1
Res
1,2

Res

2

Res
3

t4
t1
t6
t7
t9
t12
t13

Mme

windows

Res
3

Res
3
t9—t13

t12—t12

t9—t12

Max
bandwidth
from
A
to
D

1.  200Mbps

(3)

2.  900Mbps

(1)

3.  200Mbps

(4)

ReservaMon:
(A
to
D
)
(200Mbps)
start=t9
end=t13

Ø from
A
to
D,
max
bandwidth
=
200Mbps

volume
=
175Mbps
x
4
Mme
slots

earliest
start
=
t1,
latest
ﬁnish
t13

earliest
compleMon:

(
A
to
D
)
(100Mbps)
start=t1

end=t8

shortest
duraMon:

(
A
to
D
)
(200Mbps)
start=t9

end=t12.5

48

Source
>
Network
>
Destination

A
CB
D
800Mbps

900Mbps
500Mbps

1000Mbps

300Mbps

n2

n1

Now
we
have

mulMple
requests

49

With
start/end
times

• 
Each
transfer
request
has
start
and
end
Mmes

•  n
transfer
requests
are
given
(each
request
has
a
specific
amount
of

profit)

•  ObjecMve
is
to
maximize
the
profit

•  If
profit
is
same
for
each
request,
then
objecMve
is
to

maximize
the
number
of
jobs
in
a
give
Mme
period

•  Unspli3able
Flow
Problem:

•  An
undirected
graph,

•  route
demand
from
source(s)
to
desMnaMons(s)
and
maximize/minimize

the
total
profit/cost

50

The
online
scheduling
method
here
is
inspired
from
Gale-‐Shapley
algorithm
(also

known
as
stable
marriage
problem)

Methodology

•  Displace
other
jobs
to
open
space
for
the
new
request

• 
we
can
shif
max
n
jobs?

•  Never
accept
a
job
if
it
causes
other
commi3ed
jobs
to
break
their

criteria

•  Planning
ahead
(gives
opportunity
for
co-‐allocaMon)

•  Gives
a
polynomial
approximaMon
algorithm

•  The
preference
converts
the
UFP
problem
into
Dijkstra
path

search

•  UMlizes
Mme
windows/Mme
steps
for
ranking
(be3er
than
earliest

deadline
ﬁrst)

•  Earliest
compleMon
+
shortest
duraMon

•  Minimize
concurrency

•  Even
random
ranking
would
work
(relaxaMon
in
an
NP-‐hard
problem

51

Recall
Time
Windows

Res
1
Res
1,2

Res

2

Res
3

t4
t1
t6
t7
t9
t12
t13

Mme

windows

Res
1

Res
1,
2

Res
1,
2

2

Res
1,2

Res
1,
2

Res
2

Res
1,
2

Res
1,
2

t1-‐-‐t6

t4—t6

t1-‐-‐t4

t6—t7

t4—t7

t1—t7

t7—t9

t6—t9

t4—t9

t1—t9

Max
bandwidth
from
A
to
D

1.  900Mbps

(3)

2.  100Mbps

(2)

3.  100Mbps

(5)

4.  900Mbps

(1)

5.  100Mbps

(3)

6.  100Mbps

(6)

7.  900Mpbs

(2)

8.  900Mbps

(3)

9.  100Mbps

(5)

10.  100Mbps

(8)

ReservaMon:
(
A
to
D
)
(100Mbps)
start=t1

end=t9
53

Test

54

In
real
life,
number
of

nodes
and
number
of

reservaMon
in
a
given

search
interval
are

limited
See
AINA’13
paper
for
results

+
comparison
with
diﬀerent
preference
metrics

Autonomic
Provisioning
System

•  Generate
constraints
automaMcally
(without
user
input)

•  Volume
(elephant
flow?)

•  True
deadline
if
applicable

•  End-‐host
resource
availability

•  Burst
rate
(fixed
bandwidth,
variable
bandwidth)

•  Update
constraints
according
to
feedback
and
monitoring

•  Minimize
operaMonal
cost

•  AlternaMve
to
manual
traffic
engineering

What
is
the
incenMve
to
make
correct
reservaMons?

55

Data
Center
1

Data
Center
2

Data
node
B

(web
access)

Experimental

facility
A

*
(1)
Experimental
facility
A
generates
30T
of
data
every
day,
and
it
needs
to
be
stored
in

data
center
2,
before
the
next
run,
since
local
disk
space
is
limited

*
(2)
There
is
a
reservaMon
made
between
data
center
1
and
2.
It
is
used
to
replicate

data
files,
1P
total
size,
when
new
data
is
available
in
data
center
2

*
(3)
New
results
are
published
at
data
node
B,
we
expect
high
traffic
to
download

new
simulaMon
files
for
the
next
couple
of
months

Wide-‐area

SDN

56

Example

•  Experimental
facility
periodically
transfers
data
(i.e.
every
night)

•  Data
replicaMon
happens
occasionally,
and
it
will
take
a
week
to

move
1P
of
data.
If
could
get
delayed
couple
of
hours
with
no
harm

•  Wide-‐area
download
traffic
will
increase
gradually,
most
of
the

traffic
will
be
during
the
day.

•  We
can
dynamically
increase
preference
for
download
traffic
in
the

mornings,
give
high
priority
for
transferring
data
from
the
facility
at
night,

and
use
rest
of
the
bandwidth
for
data
replicaMon
(and
allocate
some

bandwidth
to
confirm
that
it
would
finish
within
a
week
as
usual)

57

Virtual
Circuit

ReservaMon
Engine

Autonomic
provisioning

system

monitoring

Reserva>on
Engine

–  Select
opMmal
path/Mme/bandwidth

–  maximize
the
number
of
admi3ed
requests

– 
increase
overall
system
uMlizaMon
and
network
eﬃciency

–  Dynamically
update
the
selected
rouMng
path
for
network
eﬃciency

–  Modify
exisMng
reservaMons
dynamically
to
open
space/Mme
for
new

requests

58

THANK
YOU

Any
QuesMon/Comment?

Mehmet
Balman

mbalman@lbl.gov

h3p://balman.info

59

Network-aware Data Management for Large Scale Distributed Applications, IBM Research-Almaden, San Jose, CA – June 24, 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Network-aware Data Management for Large Scale Distributed Applications, IBM Research-Almaden, San Jose, CA – June 24, 2015

Similar to Network-aware Data Management for Large Scale Distributed Applications, IBM Research-Almaden, San Jose, CA – June 24, 2015 (20)

More from balmanme

More from balmanme (20)

Recently uploaded

Recently uploaded (20)

Network-aware Data Management for Large Scale Distributed Applications, IBM Research-Almaden, San Jose, CA – June 24, 2015