Beolink.org
Pisa Block Distribution and Replication
Framework
Fabrizio Manfredi Furuholmen
Federico Mosca
Beolink.org
Buzzwords 2014
2
Agenda
 Introduction
 Overview
 Problem
 Common Pattern
 Implementation
 Data placement...
Beolink.org
Block Storage Devices
3
Pisa
is a simple block data
distribution and
Replication Framework on a
wide range of ...
Beolink.org
4
Build a solution
Beolink.orgWhat is it ?
5
RestFS
is
High scalable, high available
network object storage
Beolink.orgFive pylons
6
Objects
• Separation btw data and
metadata
• Each element is marked
with a revision
• Each elemen...
Beolink.org
7
RestFS Key Words
RestFS
Cell
collection of servers
Bucket
virtual container,
hosted by one or
more server
Ob...
Beolink.orgObject
8
Data Metadata
Segments
Object
Attributes
set by user
Properties
ACL
Ext Properties
Block 1
Block 2
Blo...
Beolink.org
9
Main Goal …
Storage as Lego
Brick
The infrastructure has to be inexpensive
with high scalability and reliabi...
Beolink.org
10
Problems
Beolink.org
11
Main Problem
VS
Beolink.org
12
Main Problem
Beolink.org
13
CAP theorem
According to Brewer’s CAP theorem, it is impossible for any distributed
computer system to simu...
Beolink.org
14
CAP
ACID
Atomic: Everything in a transaction succeeds or the
entire transaction is rolled back.
Consistent:...
Beolink.orgFirst of all …
15
“Think as a child…”
Beolink.orgSecond …
16
“There is always a failure waiting
around the corner”
*Werner Vogel
Beolink.org
17
Data Distribution
Replication
Data
Placement
Data
Consistency
Cluster
Coordination
Data
Transmission
Beolink.org
18
Data Placement
Better Distribution = partitioning
Parallel operation = parallel stream/multi core
Beolink.org
19
Data Distribution: DHT
 Distributed Hash Table
 Blocks are distributed in
partitions
 Partition are iden...
Beolink.orgData Distribution
Zero Hop Hash (Consistent HASH)
- Partition location with 0 hops
- 1% capacity added and 1% m...
Beolink.org
21
Data placement
Vnode base Client base
Replication
Beolink.org
22
Data Distribution
Proximity base
http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-n...
Beolink.org
23
Data Consistency
http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/
...
Beolink.org
24
Data Consistency
http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/
...
Beolink.org
25
Cluster Coordination
Cluster
communication
Table
distribution
(routing
table)
Failure
detection
Join /
Leav...
Beolink.org
26
Cluster Coordination
Epidemic (Gossip)
epidemic: anybody can infect anyone
else with equal probability
O(lo...
Beolink.org
27
Cluster Coordination
Table Items(Routing Table)
• Node table list
• Partition 2 Node List
Bootstrap
• DNS n...
Beolink.orgCluster Coordination
28
Node X New Node Z
Bootstrap
Part Serv
X Z
.. …
Notify of new node
Partition
claim x
Tab...
Beolink.org
29
Transport Protocol
ZeroMQ and
MessagePack (RPC)
 Cluster Communications
 Client Data transfer
 Partition...
Beolink.org
30
Status
Eeeemmm… not really
perfect …
Beolink.org
31
Next
http://www.cs.rutgers.edu/~pxk/417/notes/23-lookup.html
Chord
Space base/multi dimension
 New data di...
Beolink.org
Thank you
http://restfs.beolink.org
manfred.furuholmen@gmail.com
fege85@gmail.com
Upcoming SlideShare
Loading in …5
×

Pisa

449 views

Published on

Pisa is a decentralized block storage distribution and replication framework with the specific goal of simplifying the development of storage back-end services in a distributed environment. Main chararistics of the project are the message security, self-organization cluster and simple setup. Pisa is a subproject of RestFS project and the talk will explain our experience acquired with the development of this subcomponent and the decisions taken in the design of the framework.

Published in: Internet, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
449
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • The session devided in three main parts, a small introduction on ecosystem, the presentation of the Goals and architecture of Restfs and a small demo of same basic capabilties of the RestFS
  • We devide the design in 5 areas
  • Now I will describe a little bit more RestFS, functionality and architcture
  • The session devided in three main parts, a small introduction on ecosystem, the presentation of the Goals and architecture of Restfs and a small demo of same basic capabilties of the RestFS
  • The principle used for all decicion in design and to identify the right or best solution is base on the idea behind hadoop or GFS and it is
  • The principle used for all decicion in design and to identify the right or best solution is base on the idea behind hadoop or GFS and it is
  • Pisa

    1. 1. Beolink.org Pisa Block Distribution and Replication Framework Fabrizio Manfredi Furuholmen Federico Mosca
    2. 2. Beolink.org Buzzwords 2014 2 Agenda  Introduction  Overview  Problem  Common Pattern  Implementation  Data placement  Data Consistency  Cluster Coordination  Data Transmission
    3. 3. Beolink.org Block Storage Devices 3 Pisa is a simple block data distribution and Replication Framework on a wide range of node New Node Transfer New Node New Node Node Data Block Key Data [Hash]
    4. 4. Beolink.org 4 Build a solution
    5. 5. Beolink.orgWhat is it ? 5 RestFS is High scalable, high available network object storage
    6. 6. Beolink.orgFive pylons 6 Objects • Separation btw data and metadata • Each element is marked with a revision • Each element is marked with an hash. Cache • Client side • Callback/Notify • Persistent Transmission • Parallel operation • Http like protocol • Compression • Transfer by difference Distribution • Resource discovery by DNS • Data spread on multi node cluster • Decentralize • Independents cluster • Data Replication Security • Secure connection • Encryption client side, • Extend ACL • Delegation/Federation • Admin Delegation
    7. 7. Beolink.org 7 RestFS Key Words RestFS Cell collection of servers Bucket virtual container, hosted by one or more server Object entity (file, dir, …) contained in a Bucket
    8. 8. Beolink.orgObject 8 Data Metadata Segments Object Attributes set by user Properties ACL Ext Properties Block 1 Block 2 Block n Block … HashHashHashHash SerialSerialSerialSerialSerial
    9. 9. Beolink.org 9 Main Goal … Storage as Lego Brick The infrastructure has to be inexpensive with high scalability and reliability
    10. 10. Beolink.org 10 Problems
    11. 11. Beolink.org 11 Main Problem VS
    12. 12. Beolink.org 12 Main Problem
    13. 13. Beolink.org 13 CAP theorem According to Brewer’s CAP theorem, it is impossible for any distributed computer system to simultaneously provide all three of Consistency, Availability and Partition Tolerance. You can’t have the three at the same time and get an acceptable latency.
    14. 14. Beolink.org 14 CAP ACID Atomic: Everything in a transaction succeeds or the entire transaction is rolled back. Consistent: A transaction cannot leave the database in an inconsistent state. Isolated: Transactions cannot interfere with each other. Durable: Completed transactions persist, even when servers restart etc. - Strong consistency for transaction highest priority - Pessimistic - Complex mechanisms - Availability and scaling highest priorities - Weak consistency - Optimistic - Best Effort - Simple and FAST Basic Availability Soft-state Eventual consistency BASE RDBMS NoSQL
    15. 15. Beolink.orgFirst of all … 15 “Think as a child…”
    16. 16. Beolink.orgSecond … 16 “There is always a failure waiting around the corner” *Werner Vogel
    17. 17. Beolink.org 17 Data Distribution Replication Data Placement Data Consistency Cluster Coordination Data Transmission
    18. 18. Beolink.org 18 Data Placement Better Distribution = partitioning Parallel operation = parallel stream/multi core
    19. 19. Beolink.org 19 Data Distribution: DHT  Distributed Hash Table  Blocks are distributed in partitions  Partition are identify by an hash prefix  Partition hosted in servers 19 Part id Node id 1 2 2 … Node id Node 1 obj 2 obj 0000010000 Key (hash) Partition id
    20. 20. Beolink.orgData Distribution Zero Hop Hash (Consistent HASH) - Partition location with 0 hops - 1% capacity added and 1% moved Node • Zone • Weight Partition , array list (FIXED) : • Position = kex prefix • Value = node id Shuffle Avoid sequential allocation Part_list = array('H') part_key_shift = 32 - part_exp part_count = 2 ** part_exp sha(data).digest())[0] >> self.partition_shift shuffle(part_list) Ip = 10.1.0.1 zone = 1 weight = 3.0 class = 1
    21. 21. Beolink.org 21 Data placement Vnode base Client base Replication
    22. 22. Beolink.org 22 Data Distribution Proximity base http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/ node_ids = [master_node] zones = [self.nodes[node_ids[0]]] for replica in xrange(1, replicas): while self.part_list[part_id] in node_ids : part_id += 1 if part_id >= len(self.part_list): part_id = 0 node_ids.append(self.part_list[part_id]) return [self.nodes[n] for n in node_ids] Part Serv 1 xxxx 2 yyyyy 3 zzzzz 4 5 … Partition one will be also in node 2 and 3 , the master node is always the first
    23. 23. Beolink.org 23 Data Consistency http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/ To avoid ACID implementation but to guarantee the consistency some solution leave to the client the ownership of the algorithm.
    24. 24. Beolink.org 24 Data Consistency http://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/ Tunable trade-offs for distribution and replication (N, R, W) The Read operation is implemented with hash check
    25. 25. Beolink.org 25 Cluster Coordination Cluster communication Table distribution (routing table) Failure detection Join / Leaving node to Cluster
    26. 26. Beolink.org 26 Cluster Coordination Epidemic (Gossip) epidemic: anybody can infect anyone else with equal probability O(logn) http://www.cis.cornell.edu/IAI/events/Gossip_Tutorial.pdf Periodic anti-entropy exchanges among nodes ensure that they eventually converge, even if updates are lost. Arbitrary pairs of replicas periodically establish contact and resolve all differences between their databases. Hash reduce the volume of data exchanged in the common case.
    27. 27. Beolink.org 27 Cluster Coordination Table Items(Routing Table) • Node table list • Partition 2 Node List Bootstrap • DNS name or IP at startup • DNS Lookup (SRV) • Multicast Transfer Type • Complete transfer • Resync by Diff (Merkel Tree) • Notification for a single change • Join Node • Leave Node • Partition owner Part Serv 1 xxxx 2 … 3 4 5 … Node ID Object 1 xxxx 2 … 3 4 5 … Segment hash 1-100 xxxx 101-200 … …
    28. 28. Beolink.orgCluster Coordination 28 Node X New Node Z Bootstrap Part Serv X Z .. … Notify of new node Partition claim x Table Change Notification via Gossip Node Y Accept Client Request part x Return New Owner Requestpartx Returndata In case the date is not present in the new node the new node act as a proxy (Lazy trnasfer)
    29. 29. Beolink.org 29 Transport Protocol ZeroMQ and MessagePack (RPC)  Cluster Communications  Client Data transfer  Partition replication/Relocation
    30. 30. Beolink.org 30 Status Eeeemmm… not really perfect …
    31. 31. Beolink.org 31 Next http://www.cs.rutgers.edu/~pxk/417/notes/23-lookup.html Chord Space base/multi dimension  New data distribution model Chord/Cluster Node  Vector clock  Rebalance, handover partition (weight change)  Locking  WAN area network replication (Async)  Config Replication (pub/sub, event)  Server Priority …
    32. 32. Beolink.org Thank you http://restfs.beolink.org manfred.furuholmen@gmail.com fege85@gmail.com

    ×