An introduction and evaluations of a wide area distributed storage system

An introduction and
evaluations of
a wide area distributed
storage system

2001.9.11
September 11 attacks

2003.8.14
Northeast blackout of 2003

2011.3.11The aftermath of the 2011
Tohoku earthquake and tsunami

Eurasian
plate
North
American
Plate
Paciﬁc
Ocean
Plate
Philippine Sea
Plate
epicenter
of 3.11
Nankai
(South Sea)
Trough
[NEXT]

National Institute of
Informatics

Trans-Japan
Inter-Cloud
Testbed

Kitami Institute
of Technology
University of the
Ryukyus
SINET
the longest
path

Cybermedia Center
Osaka University
Kitami Institute
of Technology
University of the
Ryukyus
XenServer
6.0.2
CloudStack
4.0.0
XenServer
6.0.2
CloudStack
4.0.0

Storage XenMotion
Live Migration
without shared storage
> XenServer 6.1

WIDE cloud
different translate

64 256 1024 409616384655362621441.04858e+064.1943e+061.67772e+076.71089e+074
16
64
256
1024
4096
16384
0
20000
40000
60000
80000
100000
120000
Kbytes/sec
File size in 2^n KBytes
Record size in 2^n Kbytes
0
20000
40000
60000
80000
100000
120000
High
Random R/W
Performance

POSIX ﬁle system
interface protocl
NFS, CIFS, iSCSI

RICCRegional InterCloud Committee

Distcloudwidely distributed virtualization
infrastructure

Con$idential
Global VM migration is also available by sharing "storage space" by VM host machines.
Real time availability makes it possible. Actual data copy follows.
(VM operator need virtually common Ethernet segment and fat pipe for memory copy)
TOYAMA site
OSAKA site
TOKYO site
before Migration
Copy to DR-sites
Copy to DR-sites
live migration of VM
between distributed areas
real time and active-active features seem to be just a simple "shared storage".
Live migration is also possible between DR sites
(it requires common subnet and fat pipe for memory copy, of course)
after Migration
Copy to DR-sites

Con$idential
Front-end servers aggregate client requests (READ / WRITE) so that,
lots of back-end servers can handle user data in parallel & distributed manner.
Both of performance & storage space are scalable, depends on # of servers.
front-end
(access server)
Access Gateway
(via NFS, CIFS or similar)
clients back-end
(core server)
WRITE req.
write
blocks
read blocks
READ req.
scalable performance &
scalable storage size
by parallel & distributing
processing technology

File
block block block
block block block
block block block
Meta
Data
consistent
hash
backend
(core servers)

Con$idential
1. assign a new unique ID for any updated block (to ensure consistency).
2. make replication in local site (for quick ACK) and update meta data.
3. make replication in global distributed environment (for actual data copies).
back-end
(multi-sites)
a file, consisted from many blocks
multiplicity in multi-location,
makes each user data,
redundant in local, at first,
3 distributed copies, at last.
(2) create 2 copies in local
for each user data,
write META data,
ant returns ACK
(1)
(1')
(3-a)
(3-a)
(3-a) make a copy
in different location
right after ACK.
(3-b) remove one
of 2 local blocks,
in a future.
(3-b)
(1) assign a new unique ID
for any updated block, so that,
ID ensures the consistency
Most important !
the key for "distributed replication"

redundancy
= 3
r = 2
ACK
r = 1
r = 0
write

dundancy
= 3
ACK
r = 2
e = 0
r = 1
e = 0
r = 0
e = 1
r = -1
e = 2
external

10Gbps
VMs
core
servers
access server
(nfsd)
VM images
VM
image
chunks
virtualization
host

316 km
440 km
690 km
Hiroshima
Univ.
Kanazawa
Univ.

Hiroshima Univ. Kanazawa Univ.
NII
VMM: virtual machine monitor
CS: core servers
HS: hint servers
AS: access servers
AS AS
VMM VMM
CS CS CS CS CS CSHS HS
CS CS CSHS
L3VPN
L3VPN
L2VPN
L2VPN
L2VPN
L2VPN
L3VPN
EXAGE-LAN
EXAGE-LAN
admin
LAN
admin
LANMIGRATION-LAN
EXAGE-LAN
MIGRATION-LAN

iozone -aceI
a: full automatic mode
c: Include close() in the timing calculations
e: Include flush (fsync,fflush) in the timing calculations
I: Use DIRECT_IO if possible for all file operations.

write
64 256 1024 409616384655362621441.04858e+064.1943e+061.67772e+076.71089e+074
16
64
256
1024
4096
16384
0
20000
40000
60000
80000
100000
120000
Kbytes/sec
Record size in 2^n Kbytes
0
20000
40000
60000
80000
100000
120000
64 256 1024 4096 16384655362621441.04858e+064.1943e+061.67772e+076.71089e+07
4
16
64
256
1024
4096
16384
Recordsizein2^nKbytes

write read re-readre-write
random read backwords read records rewrite
strided read
random write
fwrite
file size [Bytes] file size [Bytes] file size [Bytes]
recordsize[KB]
recordsize[KB]recordsize[KB]
4
16
64
256
1024
4096
16384
4
16
64
256
1024
4096
16384
0
20
40
60
80
100
120
64KB 256 4 16 64 256 1GB 4 161MB 64KB 256 4 16 64 256 1GB 4 161MB
64KB 256 4 16 64 256 1GB 4 161MB 64KB 256 4 16 64 256 1GB 4 161MB
64KB 256 4 16 64 256 1GB 4 161MB 64KB 256 4 16 64 256 1GB 4 161MB 64KB 256 4 16 64 256 1GB 4 161MB 64KB 256 4 16 64 256 1GB 4 161MB
64KB 256 4 16 64 256 1GB 4 161MB 64KB 256 4 16 64 256 1GB 4 161MB
64KB 256 4 16 64 256 1GB 4 161MB
MB/sec
4
16
64
256
1024
4096
16384
frewrite

0
20
40
60
80
100
120
10MB 100MB 1GB 10GB
Throughput(MB/sec)
File size
write rewrite read reread
random read random write bkwd read
stride read fwrite fread
legend
record rewrite
0
20
40
60
80
100
120
10MB 100MB 1GB 10GB
Throughput(MB/sec)
File size
0
20
40
60
80
100
120
10MB 100MB 1GB 10GB
Throughput(MB/sec)
File size
0
20
40
60
80
100
120
10MB 100MB 1GB 10GB
Throughput(MB/sec)
File size
0
20
40
60
80
100
120
10MB 100MB 1GB 10GB
Throughput(MB/sec)
File size
0
20
40
60
80
100
120
10MB 100MB 1GB 10GB
Throughput(MB/sec)
File size
0
20
40
60
80
100
120
10MB 100MB 1GB 10GB
Throughput(MB/sec)
File size
0
20
40
60
80
100
120
10MB 100MB 1GB 10GB
Throughput(MB/sec)
File size
0
20
40
60
80
100
120
10MB 100MB 1GB 10GB
Throughput(MB/sec)
File size
0
20
40
60
80
100
120
10MB 100MB 1GB 10GB
Throughput(MB/sec)
File size
0
20
40
60
80
100
120
10MB 100MB 1GB 10GB
Throughput(MB/sec)
File size
従来方式 Exage/Storage
広域対応 Exage/Storage

SINET4 Hiroshima University EXAGE L3VPN
SINET4 Kanazawa University EXAGE L3VPN

core
servers
KVM host
access
server
distcloud
NFS server
access
server
Kanazawa
Univ.
Hiroshima
Univ.

proposed
method
(read)
NFS
(read)
decline of
throughput
by latency
start
live migration

proposedmethod shared NFS
Read (before migration) Read (after migration)
Write (before migration) Write (after migration)
Throughput(MB/sec)

SC2013
2013/11/17∼22
@Colorado Convention Center

Ikuo Nakagawa
INTEC Inc. / Osaka University

Kohei Ichikawa
Nara Institute of Science and Technology

We have been developing a widely distributed cluster storage system and
evaluating the storage along with various applications. The main advantage of
our storage is its very fast random I/O performance, even though it provides a
POSIX compatible file system interface on the top of distributed cluster storage.

s
Shinji Shimojo
Director of JGN-X, NICT

24,000 km
RTT=244ms
1Gbps
loop
back
real stage

Blocks (chunks)
are located
on the nearest

consistent
hash
Meta data
is not suitable
for wide area

type of
line
load
condition
required time
(sec)
domestic no load 17.9
international
no load 201.6
read load 175.4
write load 400.6
required time to migration IO performance
type of  
access pattern
load 
condition
domestic 
(read) 64.6
domestic 
(write) 58.7
international 
(read) 25.4
international 
write) 20.9
average throughput (MB/s) of dd

Live migration
demo on an
international
line

Evaluations of
distcloud on
an international
line

Disaster
Recovery
demonstration
of DC down

U.S. region
will be build
soon

SC142014/11/16∼21
@Ernest N. Morial Convention Center

behavior data
from
mobile devices

data from
non-electriﬁcation
area

mobile
devices
sensor
devices
personal data
aggregation service
high
latency power
consumption

mobile
devices
sensor
devices
low
latency
wide-area distributed
platform
regional
exchange
regional
exchange
personal data
aggregation service

the Internet
distcloud storage
region A region B region C
live migration
optimize routes with
remaining independence of
each region
users from the Internet
can access the VM
after live migration

Layer method outline features
L3
routing
update routing table 
by each migrations
○ routing per region
cannot routing per VM 
routing operation cost
routing 
+
L2 extension
VPLS, IEEE802.1ad PB(Q in Q)
IEEE802.1ah (Mac-in-Mac)
○ stability, operation cost 
poor scalability
L2 over L3 VXLAN, OTV, NVGRE
○ stability 
overhead of tunneling 
IP multicast
SDN OpenFlow
○ programable operation 
cost of equipment
ID/locator separation LISP
○ scalability, routing per VM 
cost, immediacy
IP mobility MAT, NEMO, MIP (Kagemusha)
○ scalability 
load of router
L4 mSCTP SCTP multipath
○ independent from L2/L3 
limited in SCTP
L7 DNS + reverseNAT Dynamic DNS
○ independent from L2/L3 
altering IP addr. 
closing connection

https://www.ﬂickr.com/photos/idvsolutions/7439877658/sizes/o/in/photostream/

An introduction and evaluations of a wide area distributed storage system

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to An introduction and evaluations of a wide area distributed storage system

Similar to An introduction and evaluations of a wide area distributed storage system (20)

More from Hiroki Kashiwazaki

More from Hiroki Kashiwazaki (20)

Recently uploaded

Recently uploaded (20)

An introduction and evaluations of a wide area distributed storage system