Ben Coverston - The Apache Cassandra Project

Ben
Coverston

Director
of
Opera2ons

ben.coverston@datastax.com

Hosted
By:

Ma=hew
O’Keefe

MorningStar

History

•  Open
Sourced
by
FB
in
July
2008

•  Apache
Incubator
March
2009

•  Graduated
March
2010

•  Riptano
Founded
April
2010

•  First
Summit
August
2010

•  Riptano
Changed
to
Datastax
January
2011

You
Changed
Your
Name?
Why!?

•  Suits

–  Marke2ng

–  Relevancy

–  Riptano
too
“Skateboard”

•  The
Real
Reason?

–  “The
X
makes
it
sound
cool.”
–
Bender
Bending

Rodriguez,
Futurama

Strengths

•  Scalable

•  Reliable

–  Replica2on
that
works

–  Mul2-‐DC
Support

–  No
Single
Point
of
Failure

•  Analy2cs
in
the
same
system
as
OLTP
(with

“integrated”
Hadoop
support)

Weaknesses

•  No
ACID
Transac2ons

•  Limited
Support
for
(OLTP)
ad-‐hoc
queries

•  ..but
you
lost
that
when
you
started
to
shard

your
rela2onal
system.

A
Short
History
of
Big
Data
(Or
Why

Cassandra)

•  Rela2onal
databases
scale
poorly

•  B-‐trees
are
slow

–  ..and
require
read
before
write.

–  ..hope
your
dataset
ﬁts
in
memory

We
just
need
to
buy
a
bigger
box…

We
Just
Need
to
Cache
Our
Data…

Add
a
few
more
Databases

Add
Another
Layer
of
Abstrac2on

What
do
we
end
up
with?

(“The
eBay
Architecture,”
Randy
Shoup
and
Dan
Pritche=)

BASE

•  BASE
is
diametrically
opposed
to
ACID.
Where

ACID
is
pessimis2c
and
forces
consistency
at

the
end
of
every
opera2on,
BASE
is
op2mis2c

and
accepts
that
the
database
consistency
will

be
in
a
state
of
ﬂux.
Although
this
sounds

impossible
to
cope
with,
in
reality
it
is
quite

manageable
and
leads
to
levels
of
scalability

that
cannot
be
obtained
with
ACID.

–  Dan
Pritche=
–
NoSQL
Pioneer,
Ebay
Engineer

h=p://queue.acm.org/detail.cfm?id=1394128

Myth

•  Lack
of
ACID
means
that
I
have
to
give
up

transac2onal
guarantees
and
consistency.

•  Paraphrasing:
At
Nellix
we
tend
to
be

op2mis2c.
When
things
don’t
quite
work
out

we
try
again.

–  Siddharth
Andand

•  Achievable

Cassandra
In
Produc2on

•  Nellix
:
Streaming
Bookmarks

•  Digital
Reasoning:
NLP
&
En2ty
Analy2cs

•  OpenX:
largest
publisher-‐side
ad
network

•  Cloudkick:
performance
data
&
aggrega2on

•  SimpleGeo:
loca2on-‐as-‐API

•  Ooyala:
video
analy2cs
and
business
intelligence

•  ngmoco:
massively
mul2player
online
game
worlds

•  Kosmix:
social
media
aggrega2on

•  Reddit:
vote
tracking
system

•  Twi=er:
Rainbird,
geo
data,
analy2cs

•  …
lots
more

Who
is
inves2ng
in
Cassandra?

•  DataStax

•  Twi=er:

–  We're
inves2ng
in
Cassandra
every
day.
It'll
be

with
us
for
a
long
2me
and
our
usage
of
it
will
only

grow.

•  Rackspace

•  >
100
diﬀerent
individuals
have
submi=ed

patches
to
C*

•  You?

Durability

•  Write
to
Commit
Log

–  fsync
is
cheap
(append
only)

–  Latency
is
only
subject
to
rota2onal
latency

•  Separate
par22on
(no
seeking)

•  SSD
won’t
hurt,
but
it
may
not
help
either.

•  Write
to
memtable

•  Flush
memtable
to
SSTable

Log
Structured
Storage

Tuneable
Consistency

•  One,
Quorum,
All

•  R
+
W
>
N

•  Choose
availability
vs
consistency
(latency)

Adding
A
Node
(Con2nued)

Consistent
Hashing

•  Hash
Func2on
-‐-‐
K
à
T

–  Let’s
call
this
|k|
(hash
of
k)
for
our
examples

•  Par22oner
Determines
Loca2on
in
the
Ring

Replica2on

•  Simple
Replica2on
Strategy

•  Network
Topology
Strategy

–  How
many
replicas
in
each
datacenter
for
each

keyspace?

–  Generaliza2on
of
Rack
Aware
Strategy

Coordinators

•  Each
Node
can
be
a
coordinator

•  Manages
wri2ng,
read
repair.

•  Success
depends
on
per-‐call
CL
request

Reliability

•  No
Single
Points
of
Failure

•  Mul2ple
Datacenters

•  Monitorable

–  JMX
(or
whatever
plugs
into
it
–
lots
of
counters)

–  Cac2

–  Munin

–  Nagios

Expecta2on
of
Failure

•  C*
is
designed
to
fail

•  No
“Clean
Shutdown”

•  kill
-‐9,
it’s
ok.

Keyspaces
and
ColumnFamilies

•  Loosely
analogous
to
“Schemas”
and
“Tables”

Inside
CFs,
columns
are
dynamic

l  Twi=er:
“Fiveen
months
ago,
it
took
two

weeks
to
perform
ALTER
TABLE
on
the

statuses
[tweets]
table.”

ColumnFamilies

l  Sta2c

l  Object
data

l  Dynamic

l  Precalculated
query
results

“sta2c”
columnfamilies

Users
zznate
Password:
*
Name:
Nate

drivx
Password:
*
Name:
Brandon

thobbs
Password:
*
Name:
Tyler

jbellis
Password:
*
Name:
Jonathan
Site:
riptano.com

“dynamic”
columnfamilies

Following
zznate
drivx:
thobbs:

drivx

thobbs
zznate:

pcmanus
jbellis
drivx:
mdennis:
thobbs:
xedin:
zznate:

:

Inser2ng

l  Really
“insert
or
update”

l  Not
a
key/value
store
–
update
as
much
of
the

row
as
you
want

Column
indexes

l  Name
vs
range
ﬁlters

l  “reversed=true”

l  Special
case:
forward-‐scan
star2ng
with
beginning

of
row
is
fastest

Example:
Twissandra

•  h=p://twissandra.com

Tweets

RowKey: 92dbeb50-ed45-11df-a6d0-000c29864c4f
=> (column=body, value=Four score and seven years ago,
timestamp=1289446891681799)

=> (column=username, value=alincoln,
timestamp=1289446891681799)

-------------------
RowKey: d418a66e-edc5-11df-ae6c-000c29864c4f

=> (column=body, value=Do geese see God?,
timestamp=1289501976713199)
=> (column=username, value=pdrome,
timestamp=1289501976713199)

Userline

RowKey: ericflo
=> (column=1289446393708810, value=6a0b4834-ed44-11df-
bc31-000c29864c4f, timestamp=1289446393710212)

=> (column=1289446397693831, value=6c6b5916-ed44-11df-
bc31-000c29864c4f, timestamp=1289446397694646)

=> (column=1289446891681780, value=92dbeb50-ed45-11df-
a6d0-000c29864c4f, timestamp=1289446891685065)

=> (column=1289446897315887, value=96379f92-ed45-11df-
a6d0-000c29864c4f, timestamp=1289446897317676)

Userline

1289847840615:
3f19757a-‐
zznate
1289847887086:
a20fcf52-‐595c...

c89d...

drivx

thobbs
1289847887086:
a20fcf52-‐595c...

1289847840615:
3f19757a-‐ 1289847844275:
844e75e2-‐
jbellis

c89d...
b546...

Timeline

RowKey: ericflo
=> (column=1289446393708810, value=6a0b4834-ed44-11df-
bc31-000c29864c4f, timestamp=1289446393710212)

=> (column=1289446397693831, value=6c6b5916-ed44-11df-
bc31-000c29864c4f, timestamp=1289446397694646)

=> (column=1289446891681780, value=92dbeb50-ed45-11df-
a6d0-000c29864c4f, timestamp=1289446891685065)

=> (column=1289446897315887, value=96379f92-ed45-11df-
a6d0-000c29864c4f, timestamp=1289446897317676)

Adding
a
tweet

tweet_id = str(uuid())
body = '@ericflo thanks for Twissandra, it helps!'
timestamp = long(time.time() * 1e6)

columns = {'uname': useruuid, 'body': body}
TWEET.insert(tweet_id, columns)

columns = {ts: tweet_id}
USERLINE.insert(uname, columns)

TIMELINE.insert(uname, columns)
for follower_uname in FOLLOWERS.get(uname, 5000):
TIMELINE.insert(follower_uname, columns)

Reads

timeline = USERLINE.get(uname, column_reversed=True)
tweets = TWEET.multiget(timeline.values())

start = request.GET.get('start')
limit = NUM_PER_PAGE
timeline = TIMELINE.get(uname, column_start=start,
column_count=limit, column_reversed=True)
tweets = TWEET.multiget(timeline.values())

I
can
has
smarter
clients?

l  Don't
use
thriv
directly

l  Higher
level
clients
have
a
lot
of
features
you

want

l  Knowledge
about
data
types

l  Connec2on
pooling

l  Automa2c
retries

l  Logging

Raw
thriv
API:
Connec2ng

def get_client(host='127.0.0.1', port=9170):
socket = TSocket.TSocket(host, port)
transport = TTransport.TBufferedTransport(socket)
transport.open()
protocol =
TBinaryProtocol.TBinaryProtocolAccelerated(transport)
client = Cassandra.Client(protocol)
return client

Raw
thriv
API:
Inser2ng

data = {'id': useruuid, ...}
columns = [Column(k, v, time.time())
for (k, v) in data.items()]
mutations = [Mutation(ColumnOrSuperColumn(column=c))
for c in columns]
rows = {useruuid: {'User': mutations}}
client.batch_mutate('Twissandra', rows,
ConsistencyLevel.ONE)

Raw
thriv
API:
Fetching

l  get,
get_slice,
get_count,
mul2get_slice,

get_range_slices

l  ColumnOrSuperColumn

l  h=p://wiki.apache.org/cassandra/API

API
layers

Layer
Analog

libpq
Thriv

JDBC
Hector

JPA
Kundera

Language
support

l  Python

l  pycassa

l  telephus

l  Ruby

l  Speed
is
a
nega2ve

l  Java

l  Hector

l  PHP
(soon
with
less
suckage!)

Done
yet?

l  S2ll
doing
1+N
queries
per
page

l  Solu2on:
Supercolumns

l  Err..
Well
maybe…

Supercolumns:
limita2ons

l  Requires
reading
an
en2re
SC
(not
the
en2re

row)
from
disk
even
if
you
just
want
one

subcolumn

l  No
Secondary
Indexes

l  It’s
just
an
extra
map
layer.

l  Probably
best
to
avoid
them
if
you
can.

UUIDs

l  Column
names
should
be
uuids,
not
longs,
to

avoid
collisions

l  Version
1
UUIDs
can
be
sorted
by
2me

(“TimeUUID”)

l  Any
UUID
can
be
sorted
by
its
raw
bytes

(“LexicalUUID”)

l  Usually
Version
4

l  Slightly
less
overhead

0.7: secondary indexes

Obviate need for Userline (but not Timeline)
l

Lucandra

l  What
documents
contain
term
X?

l  …
and
term
Y?

l  …
or
start
with
Z?

FAQ:
coun2ng

l  UUIDs
+
batch
process

l  Mutex
(contrib/mutex
or
“cages”)

l  Use
redis
or
mysql
or
memcached

l  column-‐per-‐app-‐server

l  counter
API
(aver
.7
is
out)

Tips

l  Insert
instead
of
check-‐then-‐insert

l  Use
client-‐side
clock
to
your
advantage

l  use
TTL

l  Wider
rows
(but
not
too
wide)

l  Start
with
queries,
work
backwards

l  Avoid
storing
extra
“2mestamp”
columns

Ben Coverston - The Apache Cassandra Project

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ben Coverston - The Apache Cassandra Project

Similar to Ben Coverston - The Apache Cassandra Project (20)

Ben Coverston - The Apache Cassandra Project