Managing Cassandra : SCALE 12x
@AlTobey
Open Source Mechanic | Datastax
Apache Cassandra のオープンソースエバンジェリスト
©2014 DataStax

Obsessed with infrastructure my whole life.

See distributed systems everywhere.

!1
Five years of Cassandra
0.1
Jul-08

...

0.3
0

0.6
1

0.7

1.0
2

1.2
3

DSE

4

2.0
5
Why Cassandra?

Or any non-relational?
アルトビー
Leo

My boys

I did not have permission from my wife to use her image ;)


Rufus
!

Less Tolerant

For the record.
Can you spell
HTTP?

More and more services on line.


!

And they must be available
理由Cassandraを選ぶのか
!

日本で億のインターネットユーザー

100,000,000 internet users in Japan

across the world … B2B, video conferencing, shopping, entertainment
!

Traditional solutions…

!

…may not be a good fit
Client-server
All your wildest !
dreams will come !
true.

Client - server database architecture.

Obsolete.

The original “funnel-shaped” architecture
3-tier

Client - client - server database architecture.

Still suitable for small applications.

e.g. LAMP, RoR
3-tier + caching

cache
slave

more complex

cache coherency is a hard problem

cascading failures are common

Next: Out: the funnel. In: The ring.

master

slave
Webscale

outer ring: clients (cell phones, etc.)

middle ring: application servers

inside ring: Cassandra servers


!

Serving millions of clients with mere hundreds or thousands of nodes requires a different approach to applications!
When it Rains

scale out … but at a cost
A new plan
Dynamo Paper(2007)
• How do we build a data store that is:
• Reliable
• Performant
• “Always On”
• Nothing new and shiny

!
!

Evolutionary. Real. Computer Science
Also the basis for Riak and Voldemort
BigTable(2006)
• Richer data model
• 1 key. Lots of values
• Fast sequential access
• 38 Papers cited
Cassandra(2008)
• Distributed features of Dynamo
• Data Model and storage from
BigTable
• February 17, 2010 it graduated to
a top-level Apache project
Cassandra - More than one server
• All nodes participate in a cluster
• Shared nothing
• Add or remove as needed
• More capacity? Add a server


!18
VLDB benchmark (RWS)
HBase

Redis

THROUGHPUT OPS/SEC)

Cassandra

!19

MySQL
Cassandra - Fully Replicated
• Client writes local
• Data syncs across WAN
• Replication per Data Center

!20
Read-Modify-Write
UPDATE	
  Employees	
  SET	
  Rank=4,	
  Promoted=2014-­‐01-­‐24	
  
WHERE	
  EmployeeID=1337;

EmployeeID**1337
Name********アルトビー
StartDate***2013510501
Rank********3
Promoted****null

This might be what it looks like from SQL / CQL, but …


!

EmployeeID**1337
Name********アルトビー
StartDate***2013510501
Rank********4
Promoted****2014501524
Read-Modify-Write
UPDATE	
  Employees	
  SET	
  Rank=4,	
  Promoted=2014-­‐01-­‐24	
  
WHERE	
  EmployeeID=1337;
EmployeeID**1337
Name********アルトビー
StartDate***2013510501
Rank********4
Promoted****2014501524

EmployeeID**1337
Name********アルトビー
StartDate***2013510501
Rank********3
Promoted****null

RDBMS

TNSTAAFL
無償の昼食なんてものはありません

TNSTAAFL …

If you’re lucky, the cell is in cache.

Otherwise, it’s a disk access to read, another to write.
Eventual Consistency
UPDATE	
  Employees	
  SET	
  Rank=4,	
  Promoted=2014-­‐01-­‐24	
  
WHERE	
  EmployeeID=1337;
EmployeeID**1337
Name********アルトビー
StartDate***2013510501
Rank********3
Promoted****null

EmployeeID**1337
Name********アルトビー
StartDate***2013510501
Rank********4
Promoted****2014501524

Explain distributed RMW

More complicated.

Will talk about how it’s abstracted in CQL later.

Coordinator
Eventual Consistency
UPDATE	
  Employees	
  SET	
  Rank=4,	
  Promoted=2014-­‐01-­‐24	
  
WHERE	
  EmployeeID=1337;
EmployeeID**1337
Name********アルトビー
StartDate***2013510501
Rank********3
Promoted****null

EmployeeID**1337
Name********アルトビー
StartDate***2013510501
Rank********4
Promoted****2014501524

Coordinator

read

write

Memory replication on write, depending on RF, usually RF=3.

Reads AND writes remain available through partitions.

Hinted handoff.
Avoiding read-modify-write
cassandra13_drinks column family

Albert

Tuesday

6

Wednesday

0

Evan

Tuesday

0

Wednesday

0

Frank

Tuesday

3

Wednesday

3

Kelvin

Tuesday

0

Wednesday

0

Krzysztof

Tuesday

0

Wednesday

0

Phillip

Tuesday

12

Wednesday

0

memTable

#CASSANDRA
©2014 DataStax

⁍ CF to track how much I expect my team at Ooyala to drink
⁍ Row keys are names
⁍ Column keys are days
⁍ Values are a count of drinks

!25
Avoiding read-modify-write
cassandra13_drinks column family

Al

Tuesday

2

Wednesday

0

Phillip

Tuesday

0

Wednesday

1

Albert

Tuesday

6

Wednesday

0

Evan

Tuesday

0

Wednesday

0

Frank

Tuesday

3

Wednesday

3

Kelvin

Tuesday

0

Wednesday

0

Krzysztof

Tuesday

0

Wednesday

0

Phillip

Tuesday

12

Wednesday

0

memTable

ssTable

#CASSANDRA
©2014 DataStax

⁍ Next day, after after a flush
⁍ I’m speaking so I decided to drink less
⁍ Phillip informs me that he has quit drinking

!26
Avoiding read-modify-write
cassandra13_drinks column family
memTable

Albert

Tuesday

22

Wednesday

0

Albert

Tuesday

2

Wednesday

0

Phillip

Tuesday

0

Wednesday

1

Albert

Tuesday

6

Wednesday

0

Evan

Tuesday

0

Wednesday

0

Frank

Tuesday

3

Wednesday

3

Kelvin

Tuesday

0

Wednesday

0

Krzysztof

Tuesday

0

Wednesday

0

Phillip

Tuesday

12

Wednesday

0

ssTable

ssTable

#CASSANDRA
©2014 DataStax

⁍ I’m drinking with all you people so I decide to add 20
⁍ read 2, add 20, write 22

!27
Avoiding read-modify-write
cassandra13_drinks column family

Albert

Tuesday

22

Wednesday

0

Evan

Tuesday

0

Wednesday

0

Frank

Tuesday

3

Wednesday

3

Kelvin

Tuesday

0

Wednesday

0

Krzysztof

Tuesday

0

Wednesday

0

Phillip

Tuesday

0

Wednesday

1

ssTable

#CASSANDRA

!28

©2014 DataStax

⁍ After compaction & conflict resolution
⁍ Overwriting the same value is just fine! Works really well for some patterns such as time-series data
⁍ Separate read/write streams handy for debugging, but not a big deal
Overwriting
CREATE TABLE host_lookup (
name
varchar,
id
uuid,
PRIMARY KEY(name)
);
!

INSERT INTO host_uuid (name,id) VALUES
(“www.tobert.org”, “463b03ec-fcc1-4428-bac8-80ccee1c2f77”);
!

INSERT INTO host_uuid (name,id) VALUES
(“tobert.org”,
“463b03ec-fcc1-4428-bac8-80ccee1c2f77”);
!

INSERT INTO host_uuid (name,id) VALUES
(“www.tobert.org”, “463b03ec-fcc1-4428-bac8-80ccee1c2f77”);
!

SELECT id FROM host_lookup WHERE name=“tobert.org”;

Beware of expensive compaction

Best for: small indexes, lookup tables

Compaction handles RMW at storage level in the background.

Under heavy writes, clock synchronization is very important to avoid timestamp collisions. In practice, this isn’t a
problem very often and even when it goes wrong, not much harm done.
Key/Value
CREATE TABLE keyval (
key VARCHAR,
value blob,
PRIMARY KEY(key)
);
!

INSERT INTO keyval (key,value) VALUES (?, ?);
!

SELECT value FROM keyval WHERE key=?;

e.g. memcached

Don’t do this.

But it works when you really need it.
Journaling / Logging / Time-series
CREATE TABLE tsdb (
time_bucket timestamp,
time
timestamp,
value
blob,
PRIMARY KEY(time_bucket, time)
);
!

INSERT INTO tsdb (time_bucket, time, value) VALUES (
“2014-10-24”,
-- 1-day bucket (UTC)
“2014-10-24T12:12:12Z”, -- ALWAYS USE UTC
‘{“foo”: “bar”}’
);

Oversimplified, use normalization over blobs whenever possible.

ALWAYS USE UTC :)
Journaling / Logging / Time-series
2014(01(24 2014(01(24T12:12:12Z 2014(01(24T21:21:21Z
{“key”:" value”}
{“key”:"“value”}
2014(01(25 2014(01(25T13:13:13Z
{“key”:"“value”}

{"“2014(01(24”"=>"{
""""“2014(01(24T12:12:12Z”"=>"{
""""""""‘{“foo”:"“bar”}’
""""}
}

Oversimplified, use normalization over blobs whenever possible.

ALWAYS USE UTC :)
Cassandra Collections
CREATE TABLE posts (
id
uuid,
body
varchar,
created timestamp,
authors set<varchar>,
tags
set<varchar>,
PRIMARY KEY(id)
);
!

INSERT INTO posts (id,body,created,authors,tags) VALUES (
ea4aba7d-9344-4d08-8ca5-873aa1214068,
‘アルトビーの犬はばかね’,
‘now',
[‘アルトビー’, ’ィオートビー’],
[‘dog’, ‘silly’, ’犬’, ‘ばか’]
);

quick story about 犬ばかね

sets & maps are CRDTs, safe to modify
Cassandra Collections
CREATE TABLE metrics (
bucket timestamp,
time
timestamp,
value
blob,
labels map<varchar,varchar>,
PRIMARY KEY(bucket)
);

sets & maps are CRDTs, safe to modify
Lightweight Transactions
• Cassandra 2.0 and on support LWT based on PAXOS
• PAXOS is a distributed consensus protocol
• Given a constraint, Cassandra ensures correct ordering
Lightweight Transactions
UPDATE	
  users	
  	
  
	
  	
  	
  SET	
  username=‘tobert’	
  
	
  WHERE	
  id=68021e8a-­‐9eb0-­‐436c-­‐8cdd-­‐aac629788383	
  
	
  	
  	
  	
  IF	
  username=‘renice’;	
  
!

INSERT	
  INTO	
  users	
  (id,	
  username)	
  
VALUES	
  (68021e8a-­‐9eb0-­‐436c-­‐8cdd-­‐aac629788383,	
  ‘renice’)	
  
IF	
  NOT	
  EXISTS;	
  
!
!

Client error on conflict.
Brendan Gregg’s Tools Chart
dstat -lrvn 10
iostat -x 1
htop
Datastax Opscenter
Config Changes: Apache 1.0 ➜ DSE 3.0
⁍ Schema: compaction_strategy = LCS

⁍ Schema: bloom_filter_fp_chance = 0.1

⁍ Schema: sstable_size_in_mb = 256

⁍ Schema: compression_options = Snappy

⁍ YAML: compaction_throughput_mb_per_sec: 0
#CASSANDRA13

!43

©2014 DataStax

⁍ LCS is a huge improvement in operations life (no more major compactions)
⁍ Bloom filters were tipping over a 24GiB heap
⁍ With lots of data per node, sstable sizes in LCS must be MUCH bigger
⁍ > 100,000 open files slows everything down, especially startup
⁍ 256mb v.s. 5mb is 50x reduction in file count
⁍ default has been fixed as of C* 2.0
⁍ Compaction can’t keep up: even huge rates don’t work, must be disabled
⁍ try to adjust heap, etc. so you’re flushing at nearly full memtables to reduce compaction needs
⁍ backreference RMW?
⁍ might be fixed in >= 1.2
nodetool ring

10.10.10.10

Analytics

rack1

Up

Normal

47.73 MB

1.72%

101204669472175663702469172037896580098

10.10.10.10

Analytics

rack1

Up

Normal

63.94 MB

0.86%

102671403812352122596707855690619718940

10.10.10.10

Analytics

rack1

Up

Normal

85.73 MB

0.86%

104138138152528581490946539343342857782

10.10.10.10

Analytics

rack1

Up

Normal

47.87 MB

0.86%

105604872492705040385185222996065996624

10.10.10.10

Analytics

rack1

Up

Normal

39.73 MB

0.86%

107071606832881499279423906648789135466

10.10.10.10

Analytics

rack1

Up

Normal

40.74 MB

1.75%

110042394566257506011458285920000334950

10.10.10.10

Analytics

rack1

Up

Normal

40.08 MB

2.20%

113781420866907675791616368030579466301

10.10.10.10

Analytics

rack1

Up

Normal

56.19 MB

3.45%

119650151395618797017962053073524524487

10.10.10.10

Analytics

rack1

Up

Normal

214.88 MB

11.62%

139424886777089715561324792149872061049

10.10.10.10

Analytics

rack1

Up

Normal

214.29 MB

2.45%

143588210871399618110700028431440799305

10.10.10.10

Analytics

rack1

Up

Normal

158.49 MB

1.76%

146577368624928021690175250344904436129

10.10.10.10

Analytics

rack1

Up

Normal

40.3 MB

0.92%

148140168357822348318107048925037023042

!44
©2014 DataStax

⁍ hotspots
nodetool cfstats
Keyspace: gostress
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0
Column Family: stressful
SSTable count: 1

Controllable by sstable_size_in_mb

Space used (live): 32981239
Space used (total): 32981239
Number of Keys (estimate): 128
Memtable Columns Count: 0
Memtable Data Size: 0
Memtable Switch Count: 0
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0

Could be using a lot of heap

Bloom Filter False Positives: 0
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 336

#CASSANDRA
©2014 DataStax

Compacted row minimum size: 7007507
Compacted row maximum size: 8409007
Compacted row mean size: 8409007

⁍ bloom filters
⁍ sstable_size_in_mb

!45
nodetool proxyhistograms
Offset

Read Latency

Write Latency

Range Latency

35

0

20

0

42

0

61

0

50

0

82

0

60

0

440

0

72

0

3416

0

86

0

17910

0

103

0

48675

0

124

1

97423

0

149

0

153109

0

179

2

186205

0

215

5

139022

0

258

134

44058

0

310

2656

60660

0

372

34698

742684

0

446

469515

7359351

0

535

3920391

31030588

0

642

9852708

33070248

0

770

4487796

9719615

0

924

651959

984889

0

#CASSANDRA
©2014 DataStax

⁍ units are microseconds
⁍ can give you a good idea of how much latency coordinator hops are costing you

!46
nodetool compactionstats
al@node ~ $ nodetool compactionstats
pending tasks: 3
compaction type

keyspace

Compaction

hastur

column family bytes compacted

bytes total

progress

gauge_archive

9819749801

16922291634

58.03%

Compaction

hastur counter_archive

12141850720

16147440484

75.19%

Compaction

hastur

647389841

1475432590

43.88%

column family bytes compacted

Active compaction remaining time :

mark_archive
n/a

al@node ~ $ nodetool compactionstats
pending tasks: 3
compaction type

keyspace

bytes total

progress

Compaction

hastur

gauge_archive

10239806890

16922291634

60.51%

Compaction

hastur counter_archive

12544404397

16147440484

77.69%

Compaction

hastur

1107897093

1475432590

75.09%

Active compaction remaining time :

#CASSANDRA
©2014 DataStax

⁍

mark_archive
n/a

!47
Stress Testing Tools
⁍ cassandra-stress

⁍ YCSB

⁍ Production

⁍ Terasort (DSE)

⁍ Homegrown
#CASSANDRA

!48

©2014 DataStax

⁍ we mostly focus on cassandra-stress for burn-in of new clusters
⁍ can quickly figure out the right setting for -Xmn
⁍ Terasort is interesting for comparing DSE to Cloudera/Hortonworks/etc. (it’s fast!)
⁍ Consider writing custom benchmarks for your application patterns
⁍ sometimes it’s faster to write one than figure out how to make a generic tool do what you want
/etc/sysctl.conf
kernel.pid_max = 999999
fs.file-max = 1048576
vm.max_map_count = 1048576
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 65536 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
vm.swappiness = 1
#CASSANDRA

!49

©2014 DataStax

⁍ pid_max doesn’t fix anything, I just like it and have never had a problem with it
⁍ These are my starting point settings for nearly every system/application.
⁍ Generally safe for production.
⁍ vm.dirty*ratio can go big for fake fast writes, generally safe for Cassandra, but beware you’re more likely to see FS/file
corruption on power loss
⁍ but you will get latency spikes if you hit dirty_ratio (percentage of RAM), so don’t tune too low
/etc/rc.local
ra=$((2**14))# 16k
ss=$(blockdev --getss /dev/sda)
blockdev --setra $(($ra / $ss)) /dev/sda
!

echo 128 > /sys/block/sda/queue/nr_requests
echo deadline > /sys/block/sda/queue/scheduler
echo 16384 > /sys/block/md7/md/stripe_cache_size

#CASSANDRA

!50

©2014 DataStax

⁍ Lower readahead is better for latency on seeky workloads
⁍ More readahead will artificially increase your IOPS by reading a bunch of stuff you might not need!
⁍ nr_requests = number of IO structs the kernel will keep in flight, don’t go crazy
⁍ Deadline is best for raw throughput
⁍ CFQ supports cgroup priorities and is occasionally better for latency on SATA drives
⁍ Default stripe cache is 128. The increase seems to help MD RAID5 a lot.
⁍ Don’t forget to set readahead separately for MD RAID devices
Ending discussion notes
• 2 socket, ECC memory
• 16GiB minimum, prefer 32-64GiB, over 128GiB and Linux will need serious tuning

• SSD where possible, Samsung 840 Pro is a good choice, any Intel is fine
• NO SAN/NAS, 20ms latency tops
• if you MUST (and please, don’t) dedicate spindles to C* nodes, use separate network

• Avoid disk configurations targeted at Hadoop, disks are too slow
• http://www.datastax.com/documentation/cassandra/2.0/pdf/
cassandra20.pdf
• read the sections on Repair, Tombstones & Snapshots

Managing Cassandra at Scale by Al Tobey

  • 1.
    Managing Cassandra :SCALE 12x @AlTobey Open Source Mechanic | Datastax Apache Cassandra のオープンソースエバンジェリスト ©2014 DataStax Obsessed with infrastructure my whole life. See distributed systems everywhere. !1
  • 2.
    Five years ofCassandra 0.1 Jul-08 ... 0.3 0 0.6 1 0.7 1.0 2 1.2 3 DSE 4 2.0 5
  • 4.
    Why Cassandra? Or anynon-relational?
  • 5.
    アルトビー Leo My boys I didnot have permission from my wife to use her image ;) Rufus
  • 6.
    ! Less Tolerant For therecord. Can you spell HTTP? More and more services on line. ! And they must be available
  • 7.
    理由Cassandraを選ぶのか ! 日本で億のインターネットユーザー 100,000,000 internet usersin Japan across the world … B2B, video conferencing, shopping, entertainment
  • 8.
  • 9.
    Client-server All your wildest! dreams will come ! true. Client - server database architecture. Obsolete. The original “funnel-shaped” architecture
  • 10.
    3-tier Client - client- server database architecture. Still suitable for small applications. e.g. LAMP, RoR
  • 11.
    3-tier + caching cache slave morecomplex cache coherency is a hard problem cascading failures are common Next: Out: the funnel. In: The ring. master slave
  • 12.
    Webscale outer ring: clients(cell phones, etc.) middle ring: application servers inside ring: Cassandra servers ! Serving millions of clients with mere hundreds or thousands of nodes requires a different approach to applications!
  • 13.
    When it Rains scaleout … but at a cost
  • 14.
  • 15.
    Dynamo Paper(2007) • Howdo we build a data store that is: • Reliable • Performant • “Always On” • Nothing new and shiny ! ! Evolutionary. Real. Computer Science Also the basis for Riak and Voldemort
  • 16.
    BigTable(2006) • Richer datamodel • 1 key. Lots of values • Fast sequential access • 38 Papers cited
  • 17.
    Cassandra(2008) • Distributed featuresof Dynamo • Data Model and storage from BigTable • February 17, 2010 it graduated to a top-level Apache project
  • 18.
    Cassandra - Morethan one server • All nodes participate in a cluster • Shared nothing • Add or remove as needed • More capacity? Add a server
 !18
  • 19.
    VLDB benchmark (RWS) HBase Redis THROUGHPUTOPS/SEC) Cassandra !19 MySQL
  • 20.
    Cassandra - FullyReplicated • Client writes local • Data syncs across WAN • Replication per Data Center !20
  • 21.
    Read-Modify-Write UPDATE  Employees  SET  Rank=4,  Promoted=2014-­‐01-­‐24   WHERE  EmployeeID=1337; EmployeeID**1337 Name********アルトビー StartDate***2013510501 Rank********3 Promoted****null This might be what it looks like from SQL / CQL, but … ! EmployeeID**1337 Name********アルトビー StartDate***2013510501 Rank********4 Promoted****2014501524
  • 22.
    Read-Modify-Write UPDATE  Employees  SET  Rank=4,  Promoted=2014-­‐01-­‐24   WHERE  EmployeeID=1337; EmployeeID**1337 Name********アルトビー StartDate***2013510501 Rank********4 Promoted****2014501524 EmployeeID**1337 Name********アルトビー StartDate***2013510501 Rank********3 Promoted****null RDBMS TNSTAAFL 無償の昼食なんてものはありません TNSTAAFL … If you’re lucky, the cell is in cache. Otherwise, it’s a disk access to read, another to write.
  • 23.
    Eventual Consistency UPDATE  Employees  SET  Rank=4,  Promoted=2014-­‐01-­‐24   WHERE  EmployeeID=1337; EmployeeID**1337 Name********アルトビー StartDate***2013510501 Rank********3 Promoted****null EmployeeID**1337 Name********アルトビー StartDate***2013510501 Rank********4 Promoted****2014501524 Explain distributed RMW More complicated. Will talk about how it’s abstracted in CQL later. Coordinator
  • 24.
    Eventual Consistency UPDATE  Employees  SET  Rank=4,  Promoted=2014-­‐01-­‐24   WHERE  EmployeeID=1337; EmployeeID**1337 Name********アルトビー StartDate***2013510501 Rank********3 Promoted****null EmployeeID**1337 Name********アルトビー StartDate***2013510501 Rank********4 Promoted****2014501524 Coordinator read write Memory replication on write, depending on RF, usually RF=3. Reads AND writes remain available through partitions. Hinted handoff.
  • 25.
    Avoiding read-modify-write cassandra13_drinks columnfamily Albert Tuesday 6 Wednesday 0 Evan Tuesday 0 Wednesday 0 Frank Tuesday 3 Wednesday 3 Kelvin Tuesday 0 Wednesday 0 Krzysztof Tuesday 0 Wednesday 0 Phillip Tuesday 12 Wednesday 0 memTable #CASSANDRA ©2014 DataStax ⁍ CF to track how much I expect my team at Ooyala to drink ⁍ Row keys are names ⁍ Column keys are days ⁍ Values are a count of drinks !25
  • 26.
    Avoiding read-modify-write cassandra13_drinks columnfamily Al Tuesday 2 Wednesday 0 Phillip Tuesday 0 Wednesday 1 Albert Tuesday 6 Wednesday 0 Evan Tuesday 0 Wednesday 0 Frank Tuesday 3 Wednesday 3 Kelvin Tuesday 0 Wednesday 0 Krzysztof Tuesday 0 Wednesday 0 Phillip Tuesday 12 Wednesday 0 memTable ssTable #CASSANDRA ©2014 DataStax ⁍ Next day, after after a flush ⁍ I’m speaking so I decided to drink less ⁍ Phillip informs me that he has quit drinking !26
  • 27.
    Avoiding read-modify-write cassandra13_drinks columnfamily memTable Albert Tuesday 22 Wednesday 0 Albert Tuesday 2 Wednesday 0 Phillip Tuesday 0 Wednesday 1 Albert Tuesday 6 Wednesday 0 Evan Tuesday 0 Wednesday 0 Frank Tuesday 3 Wednesday 3 Kelvin Tuesday 0 Wednesday 0 Krzysztof Tuesday 0 Wednesday 0 Phillip Tuesday 12 Wednesday 0 ssTable ssTable #CASSANDRA ©2014 DataStax ⁍ I’m drinking with all you people so I decide to add 20 ⁍ read 2, add 20, write 22 !27
  • 28.
    Avoiding read-modify-write cassandra13_drinks columnfamily Albert Tuesday 22 Wednesday 0 Evan Tuesday 0 Wednesday 0 Frank Tuesday 3 Wednesday 3 Kelvin Tuesday 0 Wednesday 0 Krzysztof Tuesday 0 Wednesday 0 Phillip Tuesday 0 Wednesday 1 ssTable #CASSANDRA !28 ©2014 DataStax ⁍ After compaction & conflict resolution ⁍ Overwriting the same value is just fine! Works really well for some patterns such as time-series data ⁍ Separate read/write streams handy for debugging, but not a big deal
  • 29.
    Overwriting CREATE TABLE host_lookup( name varchar, id uuid, PRIMARY KEY(name) ); ! INSERT INTO host_uuid (name,id) VALUES (“www.tobert.org”, “463b03ec-fcc1-4428-bac8-80ccee1c2f77”); ! INSERT INTO host_uuid (name,id) VALUES (“tobert.org”, “463b03ec-fcc1-4428-bac8-80ccee1c2f77”); ! INSERT INTO host_uuid (name,id) VALUES (“www.tobert.org”, “463b03ec-fcc1-4428-bac8-80ccee1c2f77”); ! SELECT id FROM host_lookup WHERE name=“tobert.org”; Beware of expensive compaction Best for: small indexes, lookup tables Compaction handles RMW at storage level in the background. Under heavy writes, clock synchronization is very important to avoid timestamp collisions. In practice, this isn’t a problem very often and even when it goes wrong, not much harm done.
  • 30.
    Key/Value CREATE TABLE keyval( key VARCHAR, value blob, PRIMARY KEY(key) ); ! INSERT INTO keyval (key,value) VALUES (?, ?); ! SELECT value FROM keyval WHERE key=?; e.g. memcached Don’t do this. But it works when you really need it.
  • 31.
    Journaling / Logging/ Time-series CREATE TABLE tsdb ( time_bucket timestamp, time timestamp, value blob, PRIMARY KEY(time_bucket, time) ); ! INSERT INTO tsdb (time_bucket, time, value) VALUES ( “2014-10-24”, -- 1-day bucket (UTC) “2014-10-24T12:12:12Z”, -- ALWAYS USE UTC ‘{“foo”: “bar”}’ ); Oversimplified, use normalization over blobs whenever possible. ALWAYS USE UTC :)
  • 32.
    Journaling / Logging/ Time-series 2014(01(24 2014(01(24T12:12:12Z 2014(01(24T21:21:21Z {“key”:" value”} {“key”:"“value”} 2014(01(25 2014(01(25T13:13:13Z {“key”:"“value”} {"“2014(01(24”"=>"{ """"“2014(01(24T12:12:12Z”"=>"{ """"""""‘{“foo”:"“bar”}’ """"} } Oversimplified, use normalization over blobs whenever possible. ALWAYS USE UTC :)
  • 33.
    Cassandra Collections CREATE TABLEposts ( id uuid, body varchar, created timestamp, authors set<varchar>, tags set<varchar>, PRIMARY KEY(id) ); ! INSERT INTO posts (id,body,created,authors,tags) VALUES ( ea4aba7d-9344-4d08-8ca5-873aa1214068, ‘アルトビーの犬はばかね’, ‘now', [‘アルトビー’, ’ィオートビー’], [‘dog’, ‘silly’, ’犬’, ‘ばか’] ); quick story about 犬ばかね sets & maps are CRDTs, safe to modify
  • 34.
    Cassandra Collections CREATE TABLEmetrics ( bucket timestamp, time timestamp, value blob, labels map<varchar,varchar>, PRIMARY KEY(bucket) ); sets & maps are CRDTs, safe to modify
  • 35.
    Lightweight Transactions • Cassandra2.0 and on support LWT based on PAXOS • PAXOS is a distributed consensus protocol • Given a constraint, Cassandra ensures correct ordering
  • 36.
    Lightweight Transactions UPDATE  users          SET  username=‘tobert’    WHERE  id=68021e8a-­‐9eb0-­‐436c-­‐8cdd-­‐aac629788383          IF  username=‘renice’;   ! INSERT  INTO  users  (id,  username)   VALUES  (68021e8a-­‐9eb0-­‐436c-­‐8cdd-­‐aac629788383,  ‘renice’)   IF  NOT  EXISTS;   ! ! Client error on conflict.
  • 37.
  • 38.
  • 40.
  • 41.
  • 42.
  • 43.
    Config Changes: Apache1.0 ➜ DSE 3.0 ⁍ Schema: compaction_strategy = LCS ⁍ Schema: bloom_filter_fp_chance = 0.1 ⁍ Schema: sstable_size_in_mb = 256 ⁍ Schema: compression_options = Snappy ⁍ YAML: compaction_throughput_mb_per_sec: 0 #CASSANDRA13 !43 ©2014 DataStax ⁍ LCS is a huge improvement in operations life (no more major compactions) ⁍ Bloom filters were tipping over a 24GiB heap ⁍ With lots of data per node, sstable sizes in LCS must be MUCH bigger ⁍ > 100,000 open files slows everything down, especially startup ⁍ 256mb v.s. 5mb is 50x reduction in file count ⁍ default has been fixed as of C* 2.0 ⁍ Compaction can’t keep up: even huge rates don’t work, must be disabled ⁍ try to adjust heap, etc. so you’re flushing at nearly full memtables to reduce compaction needs ⁍ backreference RMW? ⁍ might be fixed in >= 1.2
  • 44.
    nodetool ring 10.10.10.10 Analytics rack1 Up Normal 47.73 MB 1.72% 101204669472175663702469172037896580098 10.10.10.10 Analytics rack1 Up Normal 63.94MB 0.86% 102671403812352122596707855690619718940 10.10.10.10 Analytics rack1 Up Normal 85.73 MB 0.86% 104138138152528581490946539343342857782 10.10.10.10 Analytics rack1 Up Normal 47.87 MB 0.86% 105604872492705040385185222996065996624 10.10.10.10 Analytics rack1 Up Normal 39.73 MB 0.86% 107071606832881499279423906648789135466 10.10.10.10 Analytics rack1 Up Normal 40.74 MB 1.75% 110042394566257506011458285920000334950 10.10.10.10 Analytics rack1 Up Normal 40.08 MB 2.20% 113781420866907675791616368030579466301 10.10.10.10 Analytics rack1 Up Normal 56.19 MB 3.45% 119650151395618797017962053073524524487 10.10.10.10 Analytics rack1 Up Normal 214.88 MB 11.62% 139424886777089715561324792149872061049 10.10.10.10 Analytics rack1 Up Normal 214.29 MB 2.45% 143588210871399618110700028431440799305 10.10.10.10 Analytics rack1 Up Normal 158.49 MB 1.76% 146577368624928021690175250344904436129 10.10.10.10 Analytics rack1 Up Normal 40.3 MB 0.92% 148140168357822348318107048925037023042 !44 ©2014 DataStax ⁍ hotspots
  • 45.
    nodetool cfstats Keyspace: gostress ReadCount: 0 Read Latency: NaN ms. Write Count: 0 Write Latency: NaN ms. Pending Tasks: 0 Column Family: stressful SSTable count: 1 Controllable by sstable_size_in_mb Space used (live): 32981239 Space used (total): 32981239 Number of Keys (estimate): 128 Memtable Columns Count: 0 Memtable Data Size: 0 Memtable Switch Count: 0 Read Count: 0 Read Latency: NaN ms. Write Count: 0 Write Latency: NaN ms. Pending Tasks: 0 Could be using a lot of heap Bloom Filter False Positives: 0 Bloom Filter False Ratio: 0.00000 Bloom Filter Space Used: 336 #CASSANDRA ©2014 DataStax Compacted row minimum size: 7007507 Compacted row maximum size: 8409007 Compacted row mean size: 8409007 ⁍ bloom filters ⁍ sstable_size_in_mb !45
  • 46.
    nodetool proxyhistograms Offset Read Latency WriteLatency Range Latency 35 0 20 0 42 0 61 0 50 0 82 0 60 0 440 0 72 0 3416 0 86 0 17910 0 103 0 48675 0 124 1 97423 0 149 0 153109 0 179 2 186205 0 215 5 139022 0 258 134 44058 0 310 2656 60660 0 372 34698 742684 0 446 469515 7359351 0 535 3920391 31030588 0 642 9852708 33070248 0 770 4487796 9719615 0 924 651959 984889 0 #CASSANDRA ©2014 DataStax ⁍ units are microseconds ⁍ can give you a good idea of how much latency coordinator hops are costing you !46
  • 47.
    nodetool compactionstats al@node ~$ nodetool compactionstats pending tasks: 3 compaction type keyspace Compaction hastur column family bytes compacted bytes total progress gauge_archive 9819749801 16922291634 58.03% Compaction hastur counter_archive 12141850720 16147440484 75.19% Compaction hastur 647389841 1475432590 43.88% column family bytes compacted Active compaction remaining time : mark_archive n/a al@node ~ $ nodetool compactionstats pending tasks: 3 compaction type keyspace bytes total progress Compaction hastur gauge_archive 10239806890 16922291634 60.51% Compaction hastur counter_archive 12544404397 16147440484 77.69% Compaction hastur 1107897093 1475432590 75.09% Active compaction remaining time : #CASSANDRA ©2014 DataStax ⁍ mark_archive n/a !47
  • 48.
    Stress Testing Tools ⁍cassandra-stress ⁍ YCSB ⁍ Production ⁍ Terasort (DSE) ⁍ Homegrown #CASSANDRA !48 ©2014 DataStax ⁍ we mostly focus on cassandra-stress for burn-in of new clusters ⁍ can quickly figure out the right setting for -Xmn ⁍ Terasort is interesting for comparing DSE to Cloudera/Hortonworks/etc. (it’s fast!) ⁍ Consider writing custom benchmarks for your application patterns ⁍ sometimes it’s faster to write one than figure out how to make a generic tool do what you want
  • 49.
    /etc/sysctl.conf kernel.pid_max = 999999 fs.file-max= 1048576 vm.max_map_count = 1048576 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 65536 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 vm.swappiness = 1 #CASSANDRA !49 ©2014 DataStax ⁍ pid_max doesn’t fix anything, I just like it and have never had a problem with it ⁍ These are my starting point settings for nearly every system/application. ⁍ Generally safe for production. ⁍ vm.dirty*ratio can go big for fake fast writes, generally safe for Cassandra, but beware you’re more likely to see FS/file corruption on power loss ⁍ but you will get latency spikes if you hit dirty_ratio (percentage of RAM), so don’t tune too low
  • 50.
    /etc/rc.local ra=$((2**14))# 16k ss=$(blockdev --getss/dev/sda) blockdev --setra $(($ra / $ss)) /dev/sda ! echo 128 > /sys/block/sda/queue/nr_requests echo deadline > /sys/block/sda/queue/scheduler echo 16384 > /sys/block/md7/md/stripe_cache_size #CASSANDRA !50 ©2014 DataStax ⁍ Lower readahead is better for latency on seeky workloads ⁍ More readahead will artificially increase your IOPS by reading a bunch of stuff you might not need! ⁍ nr_requests = number of IO structs the kernel will keep in flight, don’t go crazy ⁍ Deadline is best for raw throughput ⁍ CFQ supports cgroup priorities and is occasionally better for latency on SATA drives ⁍ Default stripe cache is 128. The increase seems to help MD RAID5 a lot. ⁍ Don’t forget to set readahead separately for MD RAID devices
  • 51.
    Ending discussion notes •2 socket, ECC memory • 16GiB minimum, prefer 32-64GiB, over 128GiB and Linux will need serious tuning • SSD where possible, Samsung 840 Pro is a good choice, any Intel is fine • NO SAN/NAS, 20ms latency tops • if you MUST (and please, don’t) dedicate spindles to C* nodes, use separate network • Avoid disk configurations targeted at Hadoop, disks are too slow • http://www.datastax.com/documentation/cassandra/2.0/pdf/ cassandra20.pdf • read the sections on Repair, Tombstones & Snapshots