Pisa is a decentralized block storage distribution and replication framework with the specific goal of simplifying the development of storage back-end services in a distributed environment. Main chararistics of the project are the message security, self-organization cluster and simple setup. Pisa is a subproject of RestFS project and the talk will explain our experience acquired with the development of this subcomponent and the decisions taken in the design of the framework.
3. Beolink.org
Block Storage Devices
3
Pisa
is a simple block data
distribution and
Replication Framework on a
wide range of node
New Node
Transfer
New Node
New Node
Node
Data
Block
Key
Data
[Hash]
5. Beolink.orgWhat is it ?
5
RestFS
is
High scalable, high available
network object storage
6. Beolink.orgFive pylons
6
Objects
• Separation btw data and
metadata
• Each element is marked
with a revision
• Each element is marked
with an hash.
Cache
• Client side
• Callback/Notify
• Persistent
Transmission
• Parallel operation
• Http like protocol
• Compression
• Transfer by difference
Distribution
• Resource discovery by
DNS
• Data spread on multi
node cluster
• Decentralize
• Independents cluster
• Data Replication
Security
• Secure connection
• Encryption client side,
• Extend ACL
• Delegation/Federation
• Admin Delegation
13. Beolink.org
13
CAP theorem
According to Brewer’s CAP theorem, it is impossible for any distributed
computer system to simultaneously provide all three of Consistency,
Availability and Partition Tolerance.
You
can’t have the three at
the same time
and get an acceptable latency.
14. Beolink.org
14
CAP
ACID
Atomic: Everything in a transaction succeeds or the
entire transaction is rolled back.
Consistent: A transaction cannot leave the database in
an inconsistent state.
Isolated: Transactions cannot interfere with each other.
Durable: Completed transactions persist, even when
servers restart etc.
- Strong consistency for transaction highest priority
- Pessimistic
- Complex mechanisms
- Availability and scaling highest priorities
- Weak consistency
- Optimistic
- Best Effort
- Simple and FAST
Basic Availability
Soft-state
Eventual consistency
BASE
RDBMS
NoSQL
19. Beolink.org
19
Data Distribution: DHT
Distributed Hash Table
Blocks are distributed in
partitions
Partition are identify by an
hash prefix
Partition hosted in servers
19
Part id Node id
1 2
2 …
Node
id
Node
1 obj
2 obj
0000010000
Key (hash)
Partition id
20. Beolink.orgData Distribution
Zero Hop Hash (Consistent HASH)
- Partition location with 0 hops
- 1% capacity added and 1% moved
Node
• Zone
• Weight
Partition , array list (FIXED) :
• Position = kex prefix
• Value = node id
Shuffle
Avoid sequential allocation
Part_list = array('H')
part_key_shift = 32 - part_exp
part_count = 2 ** part_exp
sha(data).digest())[0] >> self.partition_shift
shuffle(part_list)
Ip = 10.1.0.1
zone = 1
weight = 3.0
class = 1
26. Beolink.org
26
Cluster Coordination
Epidemic (Gossip)
epidemic: anybody can infect anyone
else with equal probability
O(logn)
http://www.cis.cornell.edu/IAI/events/Gossip_Tutorial.pdf
Periodic anti-entropy exchanges among
nodes ensure that they eventually
converge, even if updates are lost.
Arbitrary pairs of replicas periodically
establish contact and resolve all
differences between their databases.
Hash reduce the volume of data
exchanged in the common case.
27. Beolink.org
27
Cluster Coordination
Table Items(Routing Table)
• Node table list
• Partition 2 Node List
Bootstrap
• DNS name or IP at startup
• DNS Lookup (SRV)
• Multicast
Transfer Type
• Complete transfer
• Resync by Diff (Merkel Tree)
• Notification for a single change
• Join Node
• Leave Node
• Partition owner
Part Serv
1 xxxx
2 …
3
4
5
…
Node
ID
Object
1 xxxx
2 …
3
4
5
…
Segment hash
1-100 xxxx
101-200 …
…
28. Beolink.orgCluster Coordination
28
Node X New Node Z
Bootstrap
Part Serv
X Z
.. …
Notify of new node
Partition
claim x
Table
Change
Notification
via Gossip
Node Y
Accept
Client
Request part x
Return New Owner
Requestpartx
Returndata
In case the date is not
present in the new node the
new node act as a proxy
(Lazy trnasfer)
The session devided in three main parts, a small introduction on ecosystem, the presentation of the Goals and architecture of Restfs and a small demo of same basic capabilties of the RestFS
We devide the design in 5 areas
Now I will describe a little bit more RestFS, functionality and architcture
The session devided in three main parts, a small introduction on ecosystem, the presentation of the Goals and architecture of Restfs and a small demo of same basic capabilties of the RestFS
The principle used for all decicion in design and to identify the right or best solution is base on the idea behind hadoop or GFS and it is
The principle used for all decicion in design and to identify the right or best solution is base on the idea behind hadoop or GFS and it is