About Basho: Basho makes and distributes Riak CS. Built on Riak, Basho's opensource, scalable datastore used by thousands in production, CS is made for companies that need large file storage that can't go down.
About the speaker: Andy Gross, Basho's Chief Architect, will take you on a tour of RiakCS, talk about how and why Basho built it, and the architecture that underpins it. He'll also highlight various uses case featuring Fortune500 companies who rely on Riak CS.
Human Factors of XR: Using Human Factors to Design XR Systems
Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief Architect (Basho)
1. Riak and Riak CSRiak and Riak CS
Andy Gross <@argv0>Andy Gross <@argv0>
Chief Architect, Basho TechnologiesChief Architect, Basho Technologies
Silicon Valley Cloud Computing GroupSilicon Valley Cloud Computing Group
April 2, 2013April 2, 2013
2. BashoBasho
120+ employees, offices in SF, MA,120+ employees, offices in SF, MA,
London, JapanLondon, Japan
Founded in 2008, open sourced Riak inFounded in 2008, open sourced Riak in
20092009
Sponsors of the Riak open sourceSponsors of the Riak open source
database (Apache 2)database (Apache 2)
Sell Enterprise features (multi-DCSell Enterprise features (multi-DC
replication), support, training.replication), support, training.
Riak CS (S3-compat storage) released inRiak CS (S3-compat storage) released in
March 2012March 2012
3. Now Open Source (Apache 2)Now Open Source (Apache 2)
Cloud storage software backed by RiakCloud storage software backed by Riak
S3 APIS3 API
Formerly closed-sourceFormerly closed-source
Per-tenant reportingPer-tenant reporting
Pluggable authenticationPluggable authentication
Detailed statsDetailed stats
DTrace supportDTrace support
5. what is a cloud service?what is a cloud service?
operationally simpleoperationally simple
horizontally scalablehorizontally scalable
globally distributedglobally distributed
highly availablehighly available
no SPOFsno SPOFs
fault tolerantfault tolerant
6. you can’t outsource theseyou can’t outsource these
propertiesproperties
operationally simpleoperationally simple
horizontally scalablehorizontally scalable
globally distributedglobally distributed
highly availablehighly available
no SPOFsno SPOFs
fault tolerantfault tolerant
14. Key-Value store (plus extras)Key-Value store (plus extras)
Distributed, horizontally scalableDistributed, horizontally scalable
Eventually consistentEventually consistent
Fault-tolerantFault-tolerant
Highly-availableHighly-available
Inspired by Amazon’s DynamoInspired by Amazon’s Dynamo
16. Distributed &Distributed &
Horizontally ScalableHorizontally Scalable
Default configuration is in a clusterDefault configuration is in a cluster
Load and data are spread evenly via consistentLoad and data are spread evenly via consistent
hashinghashing
Scalable: Add more nodes to get more XScalable: Add more nodes to get more X
17. Fault-TolerantFault-Tolerant
Symmetry: All nodes participate equallySymmetry: All nodes participate equally
Decentralized: no central control, no SPOFDecentralized: no central control, no SPOF
All data is replicated 3x by defaultAll data is replicated 3x by default
Cluster transparently survives...Cluster transparently survives...
node failurenode failure
network partitionsnetwork partitions
Built on Erlang/OTP (designed for FT)Built on Erlang/OTP (designed for FT)
18. Highly-AvailableHighly-Available
Any node can serve client requestsAny node can serve client requests
Fallbacks (sloppy quorums) are used whenFallbacks (sloppy quorums) are used when
nodes are downnodes are down
Always accepts write requestsAlways accepts write requests
Accepts read request as long as R/NAccepts read request as long as R/N
nodes are alivenodes are alive
Per-request quorumsPer-request quorums
19. Inspired by Amazon’sInspired by Amazon’s
DynamoDynamo
Masterless, peer-coordinated replicationMasterless, peer-coordinated replication
Consistent hashingConsistent hashing
Eventually consistentEventually consistent
Quorum reads and writesQuorum reads and writes
Anti-entropy: read repair, hinted handoffAnti-entropy: read repair, hinted handoff
20. Riak
Node
Riak
Node
Riak
Node
Riak
Node
Riak
Node
Large Object
Riak CS
S3
API
Reporting
API
Riak CS
S3
API
Reporting
API
Riak CS
S3
API
Reporting
API
Riak CS
S3
API
Reporting
API
Riak CS
S3
API
Reporting
API
1. user uploads
an object
1 MB
2. Riak CS
breaks object
into 1 MB chunks
1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB
3. Riak CS
streams chunks
to Riak nodes
4. Riak
replicates
and stores
chunks
23. Consistent HashingConsistent Hashing
Invented by Danny Lewin and others @Invented by Danny Lewin and others @
MIT/AkamaiMIT/Akamai
Minimizes remapping of keys when number ofMinimizes remapping of keys when number of
hash slots changeshash slots changes
Originally applied to CDNs, used in Dynamo forOriginally applied to CDNs, used in Dynamo for
replica placementreplica placement
Enables incremental scalability, even spreadEnables incremental scalability, even spread
Minimizes hot spotsMinimizes hot spots
24.
25. Vector ClocksVector Clocks
Introduced by Mattern et al, in 1988Introduced by Mattern et al, in 1988
Extends Lamport’s timestamps (1978)Extends Lamport’s timestamps (1978)
Each value in Dynamo tagged with vector clockEach value in Dynamo tagged with vector clock
Allows detection of stale values, logical siblingsAllows detection of stale values, logical siblings
26. Read RepairRead Repair
Update stale versions opportunistically on readsUpdate stale versions opportunistically on reads
(instead of writes)(instead of writes)
Pushes system toward consistency, afterPushes system toward consistency, after
returning value to clientreturning value to client
Reflects focus on a cheap, always-availableReflects focus on a cheap, always-available
write pathwrite path
27. Hinted HandoffHinted Handoff
Any node can accept writes for other nodes ifAny node can accept writes for other nodes if
they’re downthey’re down
All messages include a destinationAll messages include a destination
Data accepted by node other than destinationData accepted by node other than destination
is handed off when node recoversis handed off when node recovers
As long as a single node is alive the cluster canAs long as a single node is alive the cluster can
accept a writeaccept a write
28. Anti-EntropyAnti-Entropy
Replicas maintain a Merkle Tree of keys andReplicas maintain a Merkle Tree of keys and
their versions/hashestheir versions/hashes
Trees periodically exchanged with peer vnodesTrees periodically exchanged with peer vnodes
Merkle tree enables cheap comparisonMerkle tree enables cheap comparison
Only values with different hashes areOnly values with different hashes are
exchangedexchanged
Pushes system toward consistencyPushes system toward consistency
29. Gossip ProtocolGossip Protocol
Decentralized approach to managing globalDecentralized approach to managing global
statestate
Trades off atomicity of state changes for aTrades off atomicity of state changes for a
decentralized approachdecentralized approach
Volume of gossip can overwhelm networksVolume of gossip can overwhelm networks
without carewithout care
30. Hinted Handoff
• Node fails
• Requests go to fallback
• Node comes back
• “Handoff” - data returns
to recovered node
• Normal operations
resume
hash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
`
``
X
X
X
X
X
X
X
X
```
31. Anatomy of a Request
get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
Get Handler (FSM)Get Handler (FSM)
client
Riak
hash(“hash(“blocks/6307C89A-710A-42CD-9FFB-
2A6B39F983EA”)”)
== 10, 11, 12== 10, 11, 12
get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
Coordinating node
Cluster
66 77 88 99 1010 1111 1212 1313 1414 1515 1616
The Ring
R=2R=2
v1v1 v2v2
v1v1 v2v2
v2v2
34. riak is a solid foundationriak is a solid foundation
for building cloudfor building cloud
servicesservices
35. Coming Soon:Coming Soon:
Riak CS 1.4 (Q2)Riak CS 1.4 (Q2)
Swift APISwift API
Keystone IntegrationKeystone Integration
S3 FeaturesS3 Features
COPY ObjectCOPY Object
Object VersioningObject Versioning
Riak CS 1.5 (Q3)Riak CS 1.5 (Q3)
Server side encryptionServer side encryption
36. Coming Later (2014)Coming Later (2014)
Erasure codingErasure coding
Reduced redundancy storageReduced redundancy storage
Native indexing/searchNative indexing/search
37. RICON East - May 13-14,RICON East - May 13-14,
NYCNYC
A distributed systems conference forA distributed systems conference for
developersdevelopers
Speakers from Comcast, State Farm, UCSpeakers from Comcast, State Farm, UC
Berkeley, Harvard, and many moreBerkeley, Harvard, and many more
Use discount code SVCloud20 for 20% offUse discount code SVCloud20 for 20% off
ticketstickets
http://ricon.io/east.htmlhttp://ricon.io/east.html
X = throughput, compute power for MapReduce, storage, lower latency
Consistent hashing means: 1) large, fixed-size key-space 2) no rehashing of keys - always hash the same way
1) Client requests a key 2) Get handler starts up to service the request 3) Hashes key to its owner partitions (N=3) 4) Sends similar “get” request to those partitions 5) Waits for R replies that concur (R=2) 6) Resolves the object, replies to client 7) Third reply may come back at any time, but FSM replies as soon as quorum is satisfied/violated