Pisa

Beolink.org
Pisa Block Distribution and Replication
Framework
Fabrizio Manfredi Furuholmen
Federico Mosca

Beolink.org
Buzzwords 2014
2
Agenda
 Introduction
 Overview
 Problem
 Common Pattern
 Implementation
 Data placement
 Data Consistency
 Cluster Coordination
 Data Transmission

Beolink.org
Block Storage Devices
3
Pisa
is a simple block data
distribution and
Replication Framework on a
wide range of node
New Node
Transfer
New Node
New Node
Node
Data
Block
Key
Data
[Hash]

Beolink.org
4
Build a solution

Beolink.orgWhat is it ?
5
RestFS
is
High scalable, high available
network object storage

Beolink.orgFive pylons
6
Objects
• Separation btw data and
metadata
• Each element is marked
with a revision
• Each element is marked
with an hash.
Cache
• Client side
• Callback/Notify
• Persistent
Transmission
• Parallel operation
• Http like protocol
• Compression
• Transfer by difference
Distribution
• Resource discovery by
DNS
• Data spread on multi
node cluster
• Decentralize
• Independents cluster
• Data Replication
Security
• Secure connection
• Encryption client side,
• Extend ACL
• Delegation/Federation
• Admin Delegation

Beolink.org
7
RestFS Key Words
RestFS
Cell
collection of servers
Bucket
virtual container,
hosted by one or
more server
Object
entity (file, dir, …)
contained in a
Bucket

Beolink.orgObject
8
Data Metadata
Segments
Object
Attributes
set by user
Properties
ACL
Ext Properties
Block 1
Block 2
Block n
Block …
HashHashHashHash
SerialSerialSerialSerialSerial

Beolink.org
9
Main Goal …
Storage as Lego
Brick
The infrastructure has to be inexpensive
with high scalability and reliability

Beolink.org
11
Main Problem
VS

Beolink.org
13
CAP theorem
According to Brewer’s CAP theorem, it is impossible for any distributed
computer system to simultaneously provide all three of Consistency,
Availability and Partition Tolerance.
You
can’t have the three at
the same time
and get an acceptable latency.

Beolink.org
14
CAP
ACID
Atomic: Everything in a transaction succeeds or the
entire transaction is rolled back.
Consistent: A transaction cannot leave the database in
an inconsistent state.
Isolated: Transactions cannot interfere with each other.
Durable: Completed transactions persist, even when
servers restart etc.
- Strong consistency for transaction highest priority
- Pessimistic
- Complex mechanisms
- Availability and scaling highest priorities
- Weak consistency
- Optimistic
- Best Effort
- Simple and FAST
Basic Availability
Soft-state
Eventual consistency
BASE
RDBMS
NoSQL

Beolink.orgFirst of all …
15
“Think as a child…”

Beolink.orgSecond …
16
“There is always a failure waiting
around the corner”
*Werner Vogel

Beolink.org
17
Data Distribution
Replication
Data
Placement
Data
Consistency
Cluster
Coordination
Data
Transmission

Beolink.org
18
Data Placement
Better Distribution = partitioning
Parallel operation = parallel stream/multi core

Beolink.org
19
Data Distribution: DHT
 Distributed Hash Table
 Blocks are distributed in
partitions
 Partition are identify by an
hash prefix
 Partition hosted in servers
19
Part id Node id
1 2
2 …
Node
id
Node
1 obj
2 obj
0000010000
Key (hash)
Partition id

Beolink.orgData Distribution
Zero Hop Hash (Consistent HASH)
- Partition location with 0 hops
- 1% capacity added and 1% moved
Node
• Zone
• Weight
Partition , array list (FIXED) :
• Position = kex prefix
• Value = node id
Shuffle
Avoid sequential allocation
Part_list = array('H')
part_key_shift = 32 - part_exp
part_count = 2 ** part_exp
sha(data).digest())[0] >> self.partition_shift
shuffle(part_list)
Ip = 10.1.0.1
zone = 1
weight = 3.0
class = 1

Beolink.org
21
Data placement
Vnode base Client base
Replication

Beolink.org
22
Data Distribution
Proximity base
http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/
node_ids = [master_node]
zones = [self.nodes[node_ids[0]]]
for replica in xrange(1, replicas):
while self.part_list[part_id] in node_ids :
part_id += 1
if part_id >= len(self.part_list):
part_id = 0
node_ids.append(self.part_list[part_id])
return [self.nodes[n] for n in node_ids]
Part Serv
1 xxxx
2 yyyyy
3 zzzzz
4
5
…
Partition one will be
also in node 2 and 3 ,
the master node is
always the first

Beolink.org
23
Data Consistency
To avoid ACID implementation but to
guarantee the consistency some
solution leave to the client the
ownership of the algorithm.

Beolink.org
24
Data Consistency
Tunable trade-offs for
distribution and
replication (N, R, W)
The Read operation is implemented with hash
check

Beolink.org
25
Cluster Coordination
Cluster
communication
Table
distribution
(routing
table)
Failure
detection
Join /
Leaving
node to
Cluster

Beolink.org
26
Epidemic (Gossip)
epidemic: anybody can infect anyone
else with equal probability
O(logn)
http://www.cis.cornell.edu/IAI/events/Gossip_Tutorial.pdf
Periodic anti-entropy exchanges among
nodes ensure that they eventually
converge, even if updates are lost.
Arbitrary pairs of replicas periodically
establish contact and resolve all
differences between their databases.
Hash reduce the volume of data
exchanged in the common case.

Beolink.org
27
Table Items(Routing Table)
• Node table list
• Partition 2 Node List
Bootstrap
• DNS name or IP at startup
• DNS Lookup (SRV)
• Multicast
Transfer Type
• Complete transfer
• Resync by Diff (Merkel Tree)
• Notification for a single change
• Join Node
• Leave Node
• Partition owner
Part Serv
1 xxxx
2 …
3
4
5
…
Node
ID
Object
1 xxxx
2 …
3
4
5
…
Segment hash
1-100 xxxx
101-200 …
…

Beolink.orgCluster Coordination
28
Node X New Node Z
Bootstrap
Part Serv
X Z
.. …
Notify of new node
Partition
claim x
Table
Change
Notification
via Gossip
Node Y
Accept
Client
Request part x
Return New Owner
Requestpartx
Returndata
In case the date is not
present in the new node the
new node act as a proxy
(Lazy trnasfer)

Beolink.org
29
Transport Protocol
ZeroMQ and
MessagePack (RPC)
 Cluster Communications
 Client Data transfer
 Partition replication/Relocation

Beolink.org
30
Status
Eeeemmm… not really
perfect …

Beolink.org
31
Next
http://www.cs.rutgers.edu/~pxk/417/notes/23-lookup.html
Chord
Space base/multi dimension
 New data distribution model
Chord/Cluster Node
 Vector clock
 Rebalance, handover partition (weight
change)
 Locking
 WAN area network replication (Async)
 Config Replication
(pub/sub, event)
 Server Priority
…

Beolink.org
Thank you
http://restfs.beolink.org
manfred.furuholmen@gmail.com
fege85@gmail.com

Pisa

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Pisa

Similar to Pisa (20)

More from Manfred Furuholmen

More from Manfred Furuholmen (20)

Recently uploaded

Recently uploaded (20)

Pisa

Editor's Notes