0
Modern Apache Cassandra
Jonathan Ellis
Project Chair, Apache Cassandra
CTO, DataStax
©2013 DataStax Confidential. Do not d...
Five years of Cassandra

0.1
Jul-08

...

0.3
Jul-09

0.6
May-10

0.7
Feb-11

1.0
Dec-11

DSE

1.2
Oct-12

2.0
Jul-13
Application/Use Case
• Social Signals: like/want/own
features for eBay product and item
pages
• Hunch taste graph for eBay...
Time series data
Multi-datacenter support
Distributed counters
Hadoop support
Application/Use Case
• Adobe AudienceManager: web
analytics, content management,
and online advertising
Why Cassandra?
• L...
Bootstrapping
Bootstrapping
s
d

Bootstrapping
s
d

s
d
s
d
s
d

Bootstrapping
s
d

s
d
s
d
Bootstrapping
Tuneable consistency
•(We’ll come back to this)
Application/Use Case
• Logging
• Notifications
Why Cassandra?
• Efficient writes
• Durable
• Scalable
• High availability
...
Durable + efficient writes
write( k1 ,c1:v1 )
Memory

Memtable

Commit log

Hard drive
write(k1 ,c1:v
Memory

k1 c1:v

Memtable

k1 c1:v

Commit log

Hard drive
write(k1 ,c2:v
k1 c1:v c2:v

Memory

k1 c1:v
k1 c2:v

Hard drive
write(k2 ,c1:v

c2:v

)
k1 c1:v c2:v

Memory

k2 c1:v c2:v

k1 c1:v
k1 c2:v
k2 c1:v c2:v

Hard drive
write(k1 ,c1:v

c3:v

)
k1 c1:v c2:v c3:v

Memory

k2 c1:v c2:v

k1 c1:v
k1 c2:v
k2 c1:v c2:v
k1 c1:v c3:v

Hard drive
Memory

flush

index / BF
cleanup

k1 c1:v c2:v c3:v
k2 c1:v c2:v

SSTable

Hard drive
High availability
•99.9999% availability on Cassandra
•(We’ll come back to this, too)
Core values
•Massive scalability
•High performance
•Ease of use
•Reliability/Availabilty

Cassandra
MySQL

HBase

Redis
VLDB benchmark (RWS)

THROUGHPUT OPS/SEC)

80000

Cassandra
MySQL

HBase

Redis

C

SS
A

RA
ND
A

60000

40000

20000

0
...
Endpoint benchmark (RW)
HBase

MongoDB

AN
DR
A

Cassandra

CA

THROUGHPUT OPS/SEC)

SS

35000

26250

17500

8750

0
1

2...
Ease of use
CREATE TABLE users (
id uuid PRIMARY KEY,
name text,
state text,
birth_date int
);
CREATE INDEX ON
users(state...
Classic partitioning (SPOF)
partition 1 partition 2 partition 3 partition 4

router
client
(Not a theoretical problem)

https://speakerdeck.com/mitsuhiko/a-year-of-mongodb
http://aphyr.com/posts/288-the-network-is...
Fully distributed, no SPOF
Client

p3

p6

p1

p1

p1
Partitioning
Primary key determines placement*

jim

age: 36

car: camaro gender: M

carol

age: 37

car: subaru gender: F...
PK

Murmur Hash

jim

5e02739678...

carol

a9a0198010...

johnny

f4eb27cea7...

suzy

78b421309e...

Murmur* hash
operat...
The “token ring”

Node A

Node B

Node D

Node C
Start
A
B
C
D

End

0xc000000000..
1
0x0000000000..
1
0x4000000000..
1
0x8000000000..
1

0x0000000000..
0
0x4000000000..
0...
Start
A
B
C
D

End

0xc000000000..
1
0x0000000000..
1
0x4000000000..
1
0x8000000000..
1

0x0000000000..
0
0x4000000000..
0...
Start
A
B
C
D

End

0xc000000000..
1
0x0000000000..
1
0x4000000000..
1
0x8000000000..
1

0x0000000000..
0
0x4000000000..
0...
Start
A
B
C
D

End

0xc000000000..
1
0x0000000000..
1
0x4000000000..
1
0x8000000000..
1

0x0000000000..
0
0x4000000000..
0...
Start
A
B
C
D

End

0xc000000000..
1
0x0000000000..
1
0x4000000000..
1
0x8000000000..
1

0x0000000000..
0
0x4000000000..
0...
Replication

Node A

Node D

carol

a9a0198010...

Node B

Node C
Node A

Node D

carol

a9a0198010...

Node B

Node C
Node A

Node D

carol

a9a0198010...

Node B

Node C
Virtual nodes

Node A

Node B

C’’

B

A’’

C

D’
Node D

Node C

Without vnodes

B’

C’

A
A’

D

With vnodes
A closer look at reads
90%
busy

Client

Coordinator

30%
busy

40%
busy
A closer look at reads
90%
busy

Client

Coordinator

30%
busy

40%
busy
A closer look at reads
90%
busy

Client

Coordinator

30%
busy

40%
busy
A closer look at reads
90%
busy

Client

Coordinator

30%
busy

40%
busy
A closer look at reads
90%
busy

Client

Coordinator

30%
busy

40%
busy
Rapid read protection
90%
busy

Client

Coordinator

30%
busy

40%
busy
Rapid read protection
90%
busy

Client

Coordinator

30%
busy

40%
busy
Rapid read protection
90%
busy

Client

Coordinator

30%
busy

40%
busy
Rapid read protection
90%
busy

Client

X

Coordinator

30%
busy

40%
busy
Rapid read protection
90%
busy

Client

X

Coordinator

30%
busy

40%
busy
Rapid read protection
90%
busy

Client

X

Coordinator

30%
busy

40%
busy
Rapid read protection
90%
busy

Client

X

Coordinator

30%
busy

40%
busy
Rapid Read Protection

NONE
Consistency levels
90%
busy

Client

Coordinator

30%
busy

40%
busy
Consistency levels
90%
busy

Client

Coordinator

30%
busy

40%
busy
Consistency levels
90%
busy

Client

Coordinator

30%
busy

40%
busy
Consistency levels
90%
busy

Client

Coordinator

30%
busy

40%
busy
Consistency levels
90%
busy

Client

Coordinator

30%
busy

40%
busy
Consistency levels
•ONE
•QUORUM
•LOCAL_QUORUM
•LOCAL_ONE
•TWO
•ALL
Race condition
SELECT name
FROM users
WHERE username = 'pmcfadin';

#CASSANDRAEU
Race condition

#CASSANDRAEU

SELECT name
FROM users
WHERE username = 'pmcfadin';
(0 rows)

SELECT name
FROM users
WHERE u...
Race condition

#CASSANDRAEU

SELECT name
FROM users
WHERE username = 'pmcfadin';
(0 rows)

SELECT name
FROM users
WHERE u...
Race condition

#CASSANDRAEU

SELECT name
FROM users
WHERE username = 'pmcfadin';
(0 rows)

SELECT name
FROM users
WHERE u...
Race condition

#CASSANDRAEU

SELECT name
FROM users
WHERE username = 'pmcfadin';
(0 rows)

SELECT name
FROM users
WHERE u...
Lightweight transactions
INSERT INTO users
(username, name, email,
password, created_date)
VALUES ('pmcfadin',
'Patrick Mc...
Lightweight transactions
INSERT INTO users
(username, name, email,
password, created_date)
VALUES ('pmcfadin',
'Patrick Mc...
Lightweight transactions
INSERT INTO users
(username, name, email,
password, created_date)
VALUES ('pmcfadin',
'Patrick Mc...
Paxos
•All operations are quorum-based
•Each replica sends information about unfinished
operations to the leader during pr...
Details
•4 round trips vs 1 for normal updates
•Paxos state is durable
•Immediate consistency with no leader election or f...
Use with caution
•Great for 1% of your application
•Eventual consistency is your friend
• http://www.slideshare.net/planet...
Cassandra 2.1
User defined types
CREATE TYPE address (
street text,
city text,
zip_code int,
phones set<text>
)
CREATE TABLE users (
id u...
Collection indexing
CREATE TABLE songs (
id uuid PRIMARY KEY,
artist text,
album text,
title text,
data blob,
tags set<tex...
More-efficient repair
More-efficient repair
More-efficient repair
More-efficient repair
More-efficient repair
More-efficient repair
More-efficient repair
More-efficient repair
More-efficient repair
2.1 roadmap
•Efficient handling of cold data
•Counters 2.0
•Only repair new-since-last-repair data
•January/February 2014
Вопросы?
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf 2013
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf 2013
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf 2013
Upcoming SlideShare
Loading in...5
×

Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf 2013

1,886

Published on

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,886
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
50
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf 2013"

  1. 1. Modern Apache Cassandra Jonathan Ellis Project Chair, Apache Cassandra CTO, DataStax ©2013 DataStax Confidential. Do not distribute without consent. 1
  2. 2. Five years of Cassandra 0.1 Jul-08 ... 0.3 Jul-09 0.6 May-10 0.7 Feb-11 1.0 Dec-11 DSE 1.2 Oct-12 2.0 Jul-13
  3. 3. Application/Use Case • Social Signals: like/want/own features for eBay product and item pages • Hunch taste graph for eBay users and items • Many time series use cases Why Cassandra? • Multi-datacenter • Scalable • Write performance • Distributed counters • Hadoop support ACE
  4. 4. Time series data
  5. 5. Multi-datacenter support
  6. 6. Distributed counters
  7. 7. Hadoop support
  8. 8. Application/Use Case • Adobe AudienceManager: web analytics, content management, and online advertising Why Cassandra? • Low-latency • Scalable • Multi-datacenter • Tuneable consistency ACE
  9. 9. Bootstrapping
  10. 10. Bootstrapping
  11. 11. s d Bootstrapping s d s d s d
  12. 12. s d Bootstrapping s d s d s d
  13. 13. Bootstrapping
  14. 14. Tuneable consistency •(We’ll come back to this)
  15. 15. Application/Use Case • Logging • Notifications Why Cassandra? • Efficient writes • Durable • Scalable • High availability ACE
  16. 16. Durable + efficient writes write( k1 ,c1:v1 ) Memory Memtable Commit log Hard drive
  17. 17. write(k1 ,c1:v Memory k1 c1:v Memtable k1 c1:v Commit log Hard drive
  18. 18. write(k1 ,c2:v k1 c1:v c2:v Memory k1 c1:v k1 c2:v Hard drive
  19. 19. write(k2 ,c1:v c2:v ) k1 c1:v c2:v Memory k2 c1:v c2:v k1 c1:v k1 c2:v k2 c1:v c2:v Hard drive
  20. 20. write(k1 ,c1:v c3:v ) k1 c1:v c2:v c3:v Memory k2 c1:v c2:v k1 c1:v k1 c2:v k2 c1:v c2:v k1 c1:v c3:v Hard drive
  21. 21. Memory flush index / BF cleanup k1 c1:v c2:v c3:v k2 c1:v c2:v SSTable Hard drive
  22. 22. High availability •99.9999% availability on Cassandra •(We’ll come back to this, too)
  23. 23. Core values •Massive scalability •High performance •Ease of use •Reliability/Availabilty Cassandra MySQL HBase Redis
  24. 24. VLDB benchmark (RWS) THROUGHPUT OPS/SEC) 80000 Cassandra MySQL HBase Redis C SS A RA ND A 60000 40000 20000 0 0 2 4 6 NUMBER OF NODES 8 10 12
  25. 25. Endpoint benchmark (RW) HBase MongoDB AN DR A Cassandra CA THROUGHPUT OPS/SEC) SS 35000 26250 17500 8750 0 1 2 4 8 NUMBER OF NODES 16 32
  26. 26. Ease of use CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date int ); CREATE INDEX ON users(state); SELECT * FROM users WHERE state=‘Texas’ AND birth_date > 1950;
  27. 27. Classic partitioning (SPOF) partition 1 partition 2 partition 3 partition 4 router client
  28. 28. (Not a theoretical problem) https://speakerdeck.com/mitsuhiko/a-year-of-mongodb http://aphyr.com/posts/288-the-network-is-reliable
  29. 29. Fully distributed, no SPOF Client p3 p6 p1 p1 p1
  30. 30. Partitioning Primary key determines placement* jim age: 36 car: camaro gender: M carol age: 37 car: subaru gender: F johnny age:12 gender: M suzy age:10 gender: F
  31. 31. PK Murmur Hash jim 5e02739678... carol a9a0198010... johnny f4eb27cea7... suzy 78b421309e... Murmur* hash operation yields a 64-bit number for keys of any size.
  32. 32. The “token ring” Node A Node B Node D Node C
  33. 33. Start A B C D End 0xc000000000.. 1 0x0000000000.. 1 0x4000000000.. 1 0x8000000000.. 1 0x0000000000.. 0 0x4000000000.. 0 0x8000000000.. 0 0xc000000000.. 0 jim 5e02739678... carol a9a0198010... johnny f4eb27cea7... suzy 78b421309e...
  34. 34. Start A B C D End 0xc000000000.. 1 0x0000000000.. 1 0x4000000000.. 1 0x8000000000.. 1 0x0000000000.. 0 0x4000000000.. 0 0x8000000000.. 0 0xc000000000.. 0 jim 5e02739678... carol a9a0198010... johnny f4eb27cea7... suzy 78b421309e...
  35. 35. Start A B C D End 0xc000000000.. 1 0x0000000000.. 1 0x4000000000.. 1 0x8000000000.. 1 0x0000000000.. 0 0x4000000000.. 0 0x8000000000.. 0 0xc000000000.. 0 jim 5e02739678... carol a9a0198010... johnny f4eb27cea7... suzy 78b421309e...
  36. 36. Start A B C D End 0xc000000000.. 1 0x0000000000.. 1 0x4000000000.. 1 0x8000000000.. 1 0x0000000000.. 0 0x4000000000.. 0 0x8000000000.. 0 0xc000000000.. 0 jim 5e02739678... carol a9a0198010... johnny f4eb27cea7... suzy 78b421309e...
  37. 37. Start A B C D End 0xc000000000.. 1 0x0000000000.. 1 0x4000000000.. 1 0x8000000000.. 1 0x0000000000.. 0 0x4000000000.. 0 0x8000000000.. 0 0xc000000000.. 0 jim 5e02739678... carol a9a0198010... johnny f4eb27cea7... suzy 78b421309e...
  38. 38. Replication Node A Node D carol a9a0198010... Node B Node C
  39. 39. Node A Node D carol a9a0198010... Node B Node C
  40. 40. Node A Node D carol a9a0198010... Node B Node C
  41. 41. Virtual nodes Node A Node B C’’ B A’’ C D’ Node D Node C Without vnodes B’ C’ A A’ D With vnodes
  42. 42. A closer look at reads 90% busy Client Coordinator 30% busy 40% busy
  43. 43. A closer look at reads 90% busy Client Coordinator 30% busy 40% busy
  44. 44. A closer look at reads 90% busy Client Coordinator 30% busy 40% busy
  45. 45. A closer look at reads 90% busy Client Coordinator 30% busy 40% busy
  46. 46. A closer look at reads 90% busy Client Coordinator 30% busy 40% busy
  47. 47. Rapid read protection 90% busy Client Coordinator 30% busy 40% busy
  48. 48. Rapid read protection 90% busy Client Coordinator 30% busy 40% busy
  49. 49. Rapid read protection 90% busy Client Coordinator 30% busy 40% busy
  50. 50. Rapid read protection 90% busy Client X Coordinator 30% busy 40% busy
  51. 51. Rapid read protection 90% busy Client X Coordinator 30% busy 40% busy
  52. 52. Rapid read protection 90% busy Client X Coordinator 30% busy 40% busy
  53. 53. Rapid read protection 90% busy Client X Coordinator 30% busy 40% busy
  54. 54. Rapid Read Protection NONE
  55. 55. Consistency levels 90% busy Client Coordinator 30% busy 40% busy
  56. 56. Consistency levels 90% busy Client Coordinator 30% busy 40% busy
  57. 57. Consistency levels 90% busy Client Coordinator 30% busy 40% busy
  58. 58. Consistency levels 90% busy Client Coordinator 30% busy 40% busy
  59. 59. Consistency levels 90% busy Client Coordinator 30% busy 40% busy
  60. 60. Consistency levels •ONE •QUORUM •LOCAL_QUORUM •LOCAL_ONE •TWO •ALL
  61. 61. Race condition SELECT name FROM users WHERE username = 'pmcfadin'; #CASSANDRAEU
  62. 62. Race condition #CASSANDRAEU SELECT name FROM users WHERE username = 'pmcfadin'; (0 rows) SELECT name FROM users WHERE username = 'pmcfadin';
  63. 63. Race condition #CASSANDRAEU SELECT name FROM users WHERE username = 'pmcfadin'; (0 rows) SELECT name FROM users WHERE username = 'pmcfadin'; INSERT INTO users (username, name, email, password, created_date) VALUES ('pmcfadin', 'Patrick McFadin', ['patrick@datastax.com'], 'ba27e03fd9...', '2011-06-20 13:50:00'); (0 rows)
  64. 64. Race condition #CASSANDRAEU SELECT name FROM users WHERE username = 'pmcfadin'; (0 rows) SELECT name FROM users WHERE username = 'pmcfadin'; INSERT INTO users (username, name, email, password, created_date) VALUES ('pmcfadin', 'Patrick McFadin', ['patrick@datastax.com'], 'ba27e03fd9...', '2011-06-20 13:50:00'); (0 rows) INSERT INTO users (username, name, email, password, created_date) VALUES ('pmcfadin', 'Patrick McFadin', ['patrick@datastax.com'], 'ea24e13ad9...', '2011-06-20 13:50:01');
  65. 65. Race condition #CASSANDRAEU SELECT name FROM users WHERE username = 'pmcfadin'; (0 rows) SELECT name FROM users WHERE username = 'pmcfadin'; INSERT INTO users (username, name, email, password, created_date) VALUES ('pmcfadin', 'Patrick McFadin', ['patrick@datastax.com'], 'ba27e03fd9...', '2011-06-20 13:50:00'); (0 rows) This one wins INSERT INTO users (username, name, email, password, created_date) VALUES ('pmcfadin', 'Patrick McFadin', ['patrick@datastax.com'], 'ea24e13ad9...', '2011-06-20 13:50:01');
  66. 66. Lightweight transactions INSERT INTO users (username, name, email, password, created_date) VALUES ('pmcfadin', 'Patrick McFadin', ['patrick@datastax.com'], 'ba27e03fd9...', '2011-06-20 13:50:00') IF NOT EXISTS; #CASSANDRAEU
  67. 67. Lightweight transactions INSERT INTO users (username, name, email, password, created_date) VALUES ('pmcfadin', 'Patrick McFadin', ['patrick@datastax.com'], 'ba27e03fd9...', '2011-06-20 13:50:00') IF NOT EXISTS; [applied] ----------True #CASSANDRAEU INSERT INTO users (username, name, email, password, created_date) VALUES ('pmcfadin', 'Patrick McFadin', ['patrick@datastax.com'], 'ea24e13ad9...', '2011-06-20 13:50:01') IF NOT EXISTS;
  68. 68. Lightweight transactions INSERT INTO users (username, name, email, password, created_date) VALUES ('pmcfadin', 'Patrick McFadin', ['patrick@datastax.com'], 'ba27e03fd9...', '2011-06-20 13:50:00') IF NOT EXISTS; [applied] ----------True #CASSANDRAEU INSERT INTO users (username, name, email, password, created_date) VALUES ('pmcfadin', 'Patrick McFadin', ['patrick@datastax.com'], 'ea24e13ad9...', '2011-06-20 13:50:01') IF NOT EXISTS; [applied] | username | created_date | name -----------+----------+----------------+---------------False | pmcfadin | 2011-06-20 ... | Patrick McFadin
  69. 69. Paxos •All operations are quorum-based •Each replica sends information about unfinished operations to the leader during prepare •Paxos made Simple
  70. 70. Details •4 round trips vs 1 for normal updates •Paxos state is durable •Immediate consistency with no leader election or failover •ConsistencyLevel.SERIAL •http://www.datastax.com/dev/blog/lightweighttransactions-in-cassandra-2-0
  71. 71. Use with caution •Great for 1% of your application •Eventual consistency is your friend • http://www.slideshare.net/planetcassandra/c-summit-2013- eventual-consistency-hopeful-consistency-by-christos-kalantzis
  72. 72. Cassandra 2.1
  73. 73. User defined types CREATE TYPE address ( street text, city text, zip_code int, phones set<text> ) CREATE TABLE users ( id uuid PRIMARY KEY, name text, addresses map<text, address> ) SELECT id, name, addresses.city, addresses.phones FROM users; id | name | addresses.city | addresses.phones --------------------+----------------+-------------------------63bf691f | jbellis | Austin | {'512-4567', '512-9999'}
  74. 74. Collection indexing CREATE TABLE songs ( id uuid PRIMARY KEY, artist text, album text, title text, data blob, tags set<text> ); CREATE INDEX song_tags_idx ON songs(tags); SELECT * FROM songs WHERE 'blues' IN tags; id | album | artist | tags | title ----------+---------------+-------------------+-----------------------+-----------------5027b27e | Country Blues | Lightnin' Hopkins | {'acoustic', 'blues'} | Worrying My Mind
  75. 75. More-efficient repair
  76. 76. More-efficient repair
  77. 77. More-efficient repair
  78. 78. More-efficient repair
  79. 79. More-efficient repair
  80. 80. More-efficient repair
  81. 81. More-efficient repair
  82. 82. More-efficient repair
  83. 83. More-efficient repair
  84. 84. 2.1 roadmap •Efficient handling of cold data •Counters 2.0 •Only repair new-since-last-repair data •January/February 2014
  85. 85. Вопросы?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×