Beolink.org
Pisa Block Distribution and Replication
Framework
Fabrizio Manfredi Furuholmen
Federico Mosca
Beolink.org
Buzzwords 2014
2
Agenda
 Introduction
 Overview
 Problem
 Common Pattern
 Implementation
 Data placement
 Data Consistency
 Cluster Coordination
 Data Transmission
Beolink.org
Block Storage Devices
3
Pisa
is a simple block data
distribution and
Replication Framework on a
wide range of node
New Node
Transfer
New Node
New Node
Node
Data
Block
Key
Data
[Hash]
Beolink.org
4
Build a solution
Beolink.orgWhat is it ?
5
RestFS
is
High scalable, high available
network object storage
Beolink.orgFive pylons
6
Objects
• Separation btw data and
metadata
• Each element is marked
with a revision
• Each element is marked
with an hash.
Cache
• Client side
• Callback/Notify
• Persistent
Transmission
• Parallel operation
• Http like protocol
• Compression
• Transfer by difference
Distribution
• Resource discovery by
DNS
• Data spread on multi
node cluster
• Decentralize
• Independents cluster
• Data Replication
Security
• Secure connection
• Encryption client side,
• Extend ACL
• Delegation/Federation
• Admin Delegation
Beolink.org
7
RestFS Key Words
RestFS
Cell
collection of servers
Bucket
virtual container,
hosted by one or
more server
Object
entity (file, dir, …)
contained in a
Bucket
Beolink.orgObject
8
Data Metadata
Segments
Object
Attributes
set by user
Properties
ACL
Ext Properties
Block 1
Block 2
Block n
Block …
HashHashHashHash
SerialSerialSerialSerialSerial
Beolink.org
9
Main Goal …
Storage as Lego
Brick
The infrastructure has to be inexpensive
with high scalability and reliability
Beolink.org
10
Problems
Beolink.org
11
Main Problem
VS
Beolink.org
12
Main Problem
Beolink.org
13
CAP theorem
According to Brewer’s CAP theorem, it is impossible for any distributed
computer system to simultaneously provide all three of Consistency,
Availability and Partition Tolerance.
You
can’t have the three at
the same time
and get an acceptable latency.
Beolink.org
14
CAP
ACID
Atomic: Everything in a transaction succeeds or the
entire transaction is rolled back.
Consistent: A transaction cannot leave the database in
an inconsistent state.
Isolated: Transactions cannot interfere with each other.
Durable: Completed transactions persist, even when
servers restart etc.
- Strong consistency for transaction highest priority
- Pessimistic
- Complex mechanisms
- Availability and scaling highest priorities
- Weak consistency
- Optimistic
- Best Effort
- Simple and FAST
Basic Availability
Soft-state
Eventual consistency
BASE
RDBMS
NoSQL
Beolink.orgFirst of all …
15
“Think as a child…”
Beolink.orgSecond …
16
“There is always a failure waiting
around the corner”
*Werner Vogel
Beolink.org
17
Data Distribution
Replication
Data
Placement
Data
Consistency
Cluster
Coordination
Data
Transmission
Beolink.org
18
Data Placement
Better Distribution = partitioning
Parallel operation = parallel stream/multi core
Beolink.org
19
Data Distribution: DHT
 Distributed Hash Table
 Blocks are distributed in
partitions
 Partition are identify by an
hash prefix
 Partition hosted in servers
19
Part id Node id
1 2
2 …
Node
id
Node
1 obj
2 obj
0000010000
Key (hash)
Partition id
Beolink.orgData Distribution
Zero Hop Hash (Consistent HASH)
- Partition location with 0 hops
- 1% capacity added and 1% moved
Node
• Zone
• Weight
Partition , array list (FIXED) :
• Position = kex prefix
• Value = node id
Shuffle
Avoid sequential allocation
Part_list = array('H')
part_key_shift = 32 - part_exp
part_count = 2 ** part_exp
sha(data).digest())[0] >> self.partition_shift
shuffle(part_list)
Ip = 10.1.0.1
zone = 1
weight = 3.0
class = 1
Beolink.org
21
Data placement
Vnode base Client base
Replication
Beolink.org
22
Data Distribution
Proximity base
http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/
node_ids = [master_node]
zones = [self.nodes[node_ids[0]]]
for replica in xrange(1, replicas):
while self.part_list[part_id] in node_ids :
part_id += 1
if part_id >= len(self.part_list):
part_id = 0
node_ids.append(self.part_list[part_id])
return [self.nodes[n] for n in node_ids]
Part Serv
1 xxxx
2 yyyyy
3 zzzzz
4
5
…
Partition one will be
also in node 2 and 3 ,
the master node is
always the first
Beolink.org
23
Data Consistency
http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/
To avoid ACID implementation but to
guarantee the consistency some
solution leave to the client the
ownership of the algorithm.
Beolink.org
24
Data Consistency
http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/
Tunable trade-offs for
distribution and
replication (N, R, W)
The Read operation is implemented with hash
check
Beolink.org
25
Cluster Coordination
Cluster
communication
Table
distribution
(routing
table)
Failure
detection
Join /
Leaving
node to
Cluster
Beolink.org
26
Cluster Coordination
Epidemic (Gossip)
epidemic: anybody can infect anyone
else with equal probability
O(logn)
http://www.cis.cornell.edu/IAI/events/Gossip_Tutorial.pdf
Periodic anti-entropy exchanges among
nodes ensure that they eventually
converge, even if updates are lost.
Arbitrary pairs of replicas periodically
establish contact and resolve all
differences between their databases.
Hash reduce the volume of data
exchanged in the common case.
Beolink.org
27
Cluster Coordination
Table Items(Routing Table)
• Node table list
• Partition 2 Node List
Bootstrap
• DNS name or IP at startup
• DNS Lookup (SRV)
• Multicast
Transfer Type
• Complete transfer
• Resync by Diff (Merkel Tree)
• Notification for a single change
• Join Node
• Leave Node
• Partition owner
Part Serv
1 xxxx
2 …
3
4
5
…
Node
ID
Object
1 xxxx
2 …
3
4
5
…
Segment hash
1-100 xxxx
101-200 …
…
Beolink.orgCluster Coordination
28
Node X New Node Z
Bootstrap
Part Serv
X Z
.. …
Notify of new node
Partition
claim x
Table
Change
Notification
via Gossip
Node Y
Accept
Client
Request part x
Return New Owner
Requestpartx
Returndata
In case the date is not
present in the new node the
new node act as a proxy
(Lazy trnasfer)
Beolink.org
29
Transport Protocol
ZeroMQ and
MessagePack (RPC)
 Cluster Communications
 Client Data transfer
 Partition replication/Relocation
Beolink.org
30
Status
Eeeemmm… not really
perfect …
Beolink.org
31
Next
http://www.cs.rutgers.edu/~pxk/417/notes/23-lookup.html
Chord
Space base/multi dimension
 New data distribution model
Chord/Cluster Node
 Vector clock
 Rebalance, handover partition (weight
change)
 Locking
 WAN area network replication (Async)
 Config Replication
(pub/sub, event)
 Server Priority
…
Beolink.org
Thank you
http://restfs.beolink.org
manfred.furuholmen@gmail.com
fege85@gmail.com

Pisa

Editor's Notes

  • #3  The session devided in three main parts, a small introduction on ecosystem, the presentation of the Goals and architecture of Restfs and a small demo of same basic capabilties of the RestFS
  • #7 We devide the design in 5 areas
  • #8 Now I will describe a little bit more RestFS, functionality and architcture
  • #14  The session devided in three main parts, a small introduction on ecosystem, the presentation of the Goals and architecture of Restfs and a small demo of same basic capabilties of the RestFS
  • #16 The principle used for all decicion in design and to identify the right or best solution is base on the idea behind hadoop or GFS and it is
  • #17 The principle used for all decicion in design and to identify the right or best solution is base on the idea behind hadoop or GFS and it is