Riak: A friendly key/value store for the web.Presentation Transcript
riak
A friendly key/value store for the web.
ION
EV NAT
010D
D
2 N TLA
POR
A primer by Bruce Williams
D PO
EV R
N TLA
A N
TI D
O
N
My name is
Bruce Williams.
ct ed
di g
ad din
I’m lee
a nd e b
th e.
to e dg
D PO
EV R
N TLA
A N
TI D
O
N
2001 - Present Day
wa yyy before it was
a viable job choice.
D PO
EV R
N TLA
A N
TI D
O
N
But I use other
languages, too.
rom .
y f ms
all ig
ci ad
es pe ar
rp
o the
D PO
EV R
N TLA
A N
TI D
O
N
Photo by oddsteph - http://flic.kr/p/6vWPBU
me
su of
as e
Le t’s on ll
a is ba
J av base
t he ats.
b
Choose the Right Weapon
D PO
EV R
N TLA
A N
TI D
O
N
Based in the D.C. area.
(but I’m not.)
You may find the following
conspicuously missing in
this talk:
r y!
o r
S
D PO
EV R
N TLA
I will not be
A N
TI D
O
N
presenting a paper on
Dynamo, the CAP
theorem, vector
clocks, merkle trees,
etc. These are explained
elsewhere by my
alg orithmic betters.
D PO
EV R
N TLA
A N
TI D
O
N
I will not be dwelling
on performance or
redundancy.
Expect some vague
statements like “very
fa st” and “very robust.”
D PO
EV R
N TLA
A N
TI D
O
N
I will not try to
convince you that
“NoSQL” is the
messiah.
I t’s an alternative that
m akes sense in some
situations.
D PO
EV R
N TLA
A N
TI D
O
N
I will not be conducting
a large-scale
comparison of
competing technologies.
b ut I’d love to hear
abou t what you use, and
why
What is Riak?
D PO
EV R
N TLA
A N
TI D
O
N
NoSQL
and of the Dynamo
persuasion.
D PO
EV R
N TLA
A N
TI D
O
N
Open Source
& a commercial
“EnterpriseDS”
version with some
proprietary pieces
D PO
EV R
N TLA
A N
TI D
O
N
Key/Value Store
With some metadata.
D PO
EV R
N TLA
A N
TI D
O
N
Schema-less
Great for sparse data,
but requires more
discipline.
D PO
EV R
N TLA
A N
TI D
O
N
Datatype Agnostic
Con tent-Type is King.
D PO
EV R
N TLA
A N
TI D
O
N
Language Agnostic
REST & PBC
Erlang, Javascript, Java,
PHP, Python, Ruby, ...
D PO
EV R
N TLA
A N
TI D
O
N
Distributed
It’s [mostly] Erlang, what
did you expect?
D PO
EV R
N TLA
A N
TI D
O
N
Masterless
All nodes are equal
D PO
EV R
N TLA
A N
TI D
O
N
Scalable
o r “easy to scale.”
D PO
EV R
N TLA
A N
TI D
O
N
Eventually
Consistent
and CAP tunable.
D PO
EV R
N TLA
A N
TI D
O
N
Uses Map/Reduce
and “Link.”
Getting
Up & Running
N
O
TI D
A N
N TLA
http://riak.basho.com
EV R
D PO
N
O
TI D
A N
N TLA
EV R
D PO
hg & git
D PO
EV R
A Quick Local Cluster
N TLA
A N
TI D
O
N
$ ./riak1/bin/riak start
$ ./riak2/bin/riak start
$ ./riak3/bin/riak start
Start three
“nodes”
$ ./riak2/bin/riak-admin join riak1@127.0.0.1
$ ./riak3/bin/riak-admin join riak1@127.0.0.1
Join them in
to a cluster
Your Data
D PO
EV R
Object
N TLA
A N
TI D
O
N
Content Type
Body
+ Links
The thing you’re storing.
D PO
EV R
ca
Key
N TLA
n
A N
TI D
de be
O
N
au fi
to ne use
ge ma d o r-
ne tic r
ra a
te lly
d
pic1
The identifier for the object.
D PO
EV R
Bucket
N TLA
A N
TI D
O
N
“p thin
ic
wi
1” “im
is ag
un es
iq ”
pic1
ue
pic2 pic3
images
The type or category of object.
D PO
EV R
Addressability
N TLA
A N
TI D
O
N
<i
ma
ge
images
s/
pi
c1
>
pic1
Refer to objects by bucket and key.
D PO
EV R
Example
N TLA
A N
TI D
O
N
require 'riak'
client = Riak::Client.new
client.bucket('images').new('pic1').tap do |pic1|
pic1.content_type = 'image/jpeg'
pic1.data = File.read('/path/to/jpg')
pic1.store
end
$g em install riak-client
D PO
EV R
Example
N TLA
A N
TI D
O
N
client.bucket('people').new('bruce').tap do |bruce|
bruce.data = {
name: 'Bruce Williams',
email: 'bruce@codefluency.com'
}
bruce.store
end
puts client['people']['bruce'].data['name']
“application/json” is the
d efault for riak-client
D PO
EV R
Links
N TLA
A N
TI D
O
N
st
or
ed
images people
he
re
pic1 bruce
can also be
“tagged”
Connect objects
D PO
EV R
Example
N TLA
A N
TI D
O
N
client['people']['bruce'].tap do |bruce|
bruce.links << client['images']['pic1'].to_link('avatar')
bruce.store
end
client['people']['bruce'].walk(:tag => 'avatar')
Hooks
pre-commit
reject or transform an object to be committed
post-commit
notify external services, build your own indexe
Where does it go?
D PO
EV R
The Ring
N TLA
A N
TI D
O
N
A 160-bit integer space
D PO
EV R
The Ring
N TLA
A N
TI D
O
N
broken into equal sized partitions.
N
O
TI D
A N
N TLA
EV R
D PO
st more functional)
looks kinda like this
The Ring
(it’s ju
It
Photo by marchdoe - http://flickr.com/photos/marchdoe/457741149
D PO
EV R
The Ring
N TLA
A N
TI D
O
N
Each partition is managed
by a vnode (virtual node),
D PO
EV R
The Ring
N TLA
A N
TI D
O
N
Each vnode runs on
a [physical] node.
D PO
EV R
The Ring
N TLA
A N
TI D
O
N
1 2
3 4
Each node owns an equal share of
vnodes (& partitions)
D PO
EV R
Replication
N TLA
A N
TI D
O
N
3
is
th
e
de
fa
ult
n_val = 3
Objects are written to multiple
partitions.
,
ils
N
O
TI D
A N
N TLA fa
EV R ” up
“2 ck
Uses Hinted Handoff to deal with
D PO
de pi k.
no s
n er lac
he th s
W e o the
th
Availability
node failures.
4
2
3
1
D PO
EV R
Persistence
N TLA
A N
TI D
O
N
dets ets fs
gb_trees innostore
bitcask multi +
Supports pluggable backends
CAP Tuning
D PO
EV R
GET
N TLA
A N
TI D
O
N
r
how many replicas need to agree (default: 2)
D PO
EV R
PUT
N TLA
A N
TI D
O
N
r
how many replicas need to agree when retrieving an
existing object before the write (default: 2)
w
how many replicas to write to before returning a
successful response (default: 2).
dw
how many replicas to commit to durable storage
before returning a successful response (default: 0)
(Map|Link)*Reduce
D PO
EV R
Map
N TLA
A N
TI D
O
N
obj [result, ...]
your function
Map functions take one piece of data
as input, and produce zero or more
results as output.
Data-locality is important in Riak.
Map phases are run where the data is
stored.
You can have multiple map phases.
The input to a map definition is a
series of [bucket, key] names.
unlike CouchDB
D PO
EV R
Link
N TLA
A N
TI D
O
N
obj [linked_obj, ...]
link walk, using a
pattern
A special kind of map phase; links
matching a pattern are “walked” to
find objects to be output.
D PO
EV R
Reduce
N TLA
A N
TI D
O
N
[obj, ...] [result]
your function
Reduce functions combine the output
of many "map" step evaluations, into
one result
The reduce phase occurs on the
“coordinating node.”
Reduces may be run multiple times
as more input comes in (eg, re-
reduce)
D PO
EV R
Example
N TLA
A N
TI D
O
N
bruce = client['people']['bruce']
melissa = client['people']['melissa']
lets assume these have ages
addy = client['addresses'].new('123fake')
addy.data = {
street: '123 Fake St',
city: 'Portland', state: 'OR', zip: '97214'
}
addy.links << bruce.to_link('resident')
addy.links << melissa.to_link('resident')
addy.store
D PO
EV R
Example
N TLA
A N
TI D
O
N
Riak::MapReduce.new(client).add(addy).
link(tag: 'resident').
map("function (v) { return [Riak.mapValuesJson(v)[0]['age'] || 0] }").
reduce(function: 'Riak.reduceSum', keep: true).
run
We should get an array with one value
Hurdles
D PO
EV R
N TLA
No range queries.
A N
TI D
O
N
Sorry, Cassandra fans
Things like time
series data require
creative approaches.
like bucket and key naming, etc
D PO
EV R
N TLA
A N
Don’t list keys.
TI D
O
N
ever, if you can avoid it.
Processing an entire
bucket is more expensive
than you might think.
because it lists keys
D PO
EV R
N TLA
A N
TI D
O
N
Watch your encoding.
MapReduce Javascript
phases need your data
to be in valid Unicode.
you’ll get a “bad encoding” error