Managing Cassandra at Scale by Al Tobey

Managing Cassandra : SCALE 12x
@AlTobey
Open Source Mechanic | Datastax
Apache Cassandra のオープンソースエバンジェリスト
©2014 DataStax

Obsessed with infrastructure my whole life.

See distributed systems everywhere.

!1

Five years of Cassandra
0.1
Jul-08

...

0.3
0

0.6
1

0.7

1.0
2

1.2
3

DSE

4

2.0
5

Why Cassandra?

Or any non-relational?

アルトビー
Leo

My boys

I did not have permission from my wife to use her image ;)

Rufus

!

Less Tolerant

For the record.
Can you spell
HTTP?

More and more services on line.

!

And they must be available

理由Cassandraを選ぶのか
!

日本で億のインターネットユーザー

100,000,000 internet users in Japan

across the world … B2B, video conferencing, shopping, entertainment

!

Traditional solutions…

!

…may not be a good fit

Client-server
All your wildest !
dreams will come !
true.

Client - server database architecture.

Obsolete.

The original “funnel-shaped” architecture

3-tier

Client - client - server database architecture.

Still suitable for small applications.

e.g. LAMP, RoR

3-tier + caching

cache
slave

more complex

cache coherency is a hard problem

cascading failures are common

Next: Out: the funnel. In: The ring.

master

slave

Webscale

outer ring: clients (cell phones, etc.)

middle ring: application servers

inside ring: Cassandra servers

!

Serving millions of clients with mere hundreds or thousands of nodes requires a diﬀerent approach to applications!

When it Rains

scale out … but at a cost

Dynamo Paper(2007)
• How do we build a data store that is:
• Reliable
• Performant
• “Always On”
• Nothing new and shiny

!
!

Evolutionary. Real. Computer Science
Also the basis for Riak and Voldemort

BigTable(2006)
• Richer data model
• 1 key. Lots of values
• Fast sequential access
• 38 Papers cited

Cassandra(2008)
• Distributed features of Dynamo
• Data Model and storage from
BigTable
• February 17, 2010 it graduated to
a top-level Apache project

Cassandra - More than one server
• All nodes participate in a cluster
• Shared nothing
• Add or remove as needed
• More capacity? Add a server 

!18

VLDB benchmark (RWS)
HBase

Redis

THROUGHPUT OPS/SEC)

Cassandra

!19

MySQL

Cassandra - Fully Replicated
• Client writes local
• Data syncs across WAN
• Replication per Data Center

!20

Read-Modify-Write
UPDATE
Employees
SET
Rank=4,
Promoted=2014-‐01-‐24

WHERE
EmployeeID=1337;

EmployeeID**1337
Name********アルトビー
StartDate***2013510501
Rank********3
Promoted****null

This might be what it looks like from SQL / CQL, but …

!

EmployeeID**1337
Rank********4
Promoted****2014501524

Read-Modify-Write
UPDATE
Employees
SET
Rank=4,
Promoted=2014-‐01-‐24

WHERE
EmployeeID=1337;
EmployeeID**1337
Rank********4
Promoted****2014501524

EmployeeID**1337
Rank********3
Promoted****null

RDBMS

TNSTAAFL
無償の昼食なんてものはありません

TNSTAAFL …

If you’re lucky, the cell is in cache.

Otherwise, it’s a disk access to read, another to write.

Eventual Consistency
UPDATE
Employees
SET
Rank=4,
Promoted=2014-‐01-‐24

WHERE
EmployeeID=1337;
EmployeeID**1337
Rank********3
Promoted****null

EmployeeID**1337
Rank********4
Promoted****2014501524

Explain distributed RMW

More complicated.

Will talk about how it’s abstracted in CQL later.

Coordinator

Eventual Consistency
UPDATE
Employees
SET
Rank=4,
Promoted=2014-‐01-‐24

WHERE
EmployeeID=1337;
EmployeeID**1337
Rank********3
Promoted****null

EmployeeID**1337
Rank********4
Promoted****2014501524

Coordinator

read

write

Memory replication on write, depending on RF, usually RF=3.

Reads AND writes remain available through partitions.

Hinted handoﬀ.

Avoiding read-modify-write
cassandra13_drinks column family

Albert

Tuesday

6

Wednesday

0

Evan

Tuesday

0

Wednesday

0

Frank

Tuesday

3

Wednesday

3

Kelvin

Tuesday

0

Wednesday

0

Krzysztof

Tuesday

0

Wednesday

0

Phillip

Tuesday

12

Wednesday

0

memTable

#CASSANDRA
©2014 DataStax

⁍ CF to track how much I expect my team at Ooyala to drink
⁍ Row keys are names
⁍ Column keys are days
⁍ Values are a count of drinks

!25


Al

Tuesday

2

Wednesday

0

Phillip

Tuesday

0

Wednesday

1

Albert

Tuesday

6

Wednesday

0

Evan

Tuesday

0

Wednesday

0

Frank

Tuesday

3

Wednesday

3

Kelvin

Tuesday

0

Wednesday

0

Krzysztof

Tuesday

0

Wednesday

0

Phillip

Tuesday

12

Wednesday

0

memTable

ssTable

#CASSANDRA
©2014 DataStax

⁍ Next day, after after a ﬂush
⁍ I’m speaking so I decided to drink less
⁍ Phillip informs me that he has quit drinking

!26

memTable

Albert

Tuesday

22

Wednesday

0

Albert

Tuesday

2

Wednesday

0

Phillip

Tuesday

0

Wednesday

1

Albert

Tuesday

6

Wednesday

0

Evan

Tuesday

0

Wednesday

0

Frank

Tuesday

3

Wednesday

3

Kelvin

Tuesday

0

Wednesday

0

Krzysztof

Tuesday

0

Wednesday

0

Phillip

Tuesday

12

Wednesday

0

ssTable

ssTable

#CASSANDRA
©2014 DataStax

⁍ I’m drinking with all you people so I decide to add 20
⁍ read 2, add 20, write 22

!27


Albert

Tuesday

22

Wednesday

0

Evan

Tuesday

0

Wednesday

0

Frank

Tuesday

3

Wednesday

3

Kelvin

Tuesday

0

Wednesday

0

Krzysztof

Tuesday

0

Wednesday

0

Phillip

Tuesday

0

Wednesday

1

ssTable

#CASSANDRA

!28

©2014 DataStax

⁍ After compaction & conﬂict resolution
⁍ Overwriting the same value is just ﬁne! Works really well for some patterns such as time-series data
⁍ Separate read/write streams handy for debugging, but not a big deal

Overwriting
CREATE TABLE host_lookup (
name
varchar,
id
uuid,
PRIMARY KEY(name)
);
!

INSERT INTO host_uuid (name,id) VALUES
(“www.tobert.org”, “463b03ec-fcc1-4428-bac8-80ccee1c2f77”);
!

(“tobert.org”,
“463b03ec-fcc1-4428-bac8-80ccee1c2f77”);
!

(“www.tobert.org”, “463b03ec-fcc1-4428-bac8-80ccee1c2f77”);
!

SELECT id FROM host_lookup WHERE name=“tobert.org”;

Beware of expensive compaction

Best for: small indexes, lookup tables

Compaction handles RMW at storage level in the background.

Under heavy writes, clock synchronization is very important to avoid timestamp collisions. In practice, this isn’t a
problem very often and even when it goes wrong, not much harm done.

Key/Value
CREATE TABLE keyval (
key VARCHAR,
value blob,
PRIMARY KEY(key)
);
!

INSERT INTO keyval (key,value) VALUES (?, ?);
!

SELECT value FROM keyval WHERE key=?;

e.g. memcached

Don’t do this.

But it works when you really need it.

Journaling / Logging / Time-series
CREATE TABLE tsdb (
time_bucket timestamp,
time
timestamp,
value
blob,
PRIMARY KEY(time_bucket, time)
);
!

INSERT INTO tsdb (time_bucket, time, value) VALUES (
“2014-10-24”,
-- 1-day bucket (UTC)
“2014-10-24T12:12:12Z”, -- ALWAYS USE UTC
‘{“foo”: “bar”}’
);

Oversimpliﬁed, use normalization over blobs whenever possible.

ALWAYS USE UTC :)

Journaling / Logging / Time-series
2014(01(24 2014(01(24T12:12:12Z 2014(01(24T21:21:21Z
{“key”:" value”}
{“key”:"“value”}
2014(01(25 2014(01(25T13:13:13Z
{“key”:"“value”}

{"“2014(01(24”"=>"{
""""“2014(01(24T12:12:12Z”"=>"{
""""""""‘{“foo”:"“bar”}’
""""}
}

Oversimpliﬁed, use normalization over blobs whenever possible.

ALWAYS USE UTC :)

Cassandra Collections
CREATE TABLE posts (
id
uuid,
body
varchar,
created timestamp,
authors set<varchar>,
tags
set<varchar>,
PRIMARY KEY(id)
);
!

INSERT INTO posts (id,body,created,authors,tags) VALUES (
ea4aba7d-9344-4d08-8ca5-873aa1214068,
‘アルトビーの犬はばかね’,
‘now',
[‘アルトビー’, ’ィオートビー’],
[‘dog’, ‘silly’, ’犬’, ‘ばか’]
);

quick story about 犬ばかね

sets & maps are CRDTs, safe to modify

Cassandra Collections
CREATE TABLE metrics (
bucket timestamp,
time
timestamp,
value
blob,
labels map<varchar,varchar>,
PRIMARY KEY(bucket)
);

sets & maps are CRDTs, safe to modify

Lightweight Transactions
• Cassandra 2.0 and on support LWT based on PAXOS
• PAXOS is a distributed consensus protocol
• Given a constraint, Cassandra ensures correct ordering

Lightweight Transactions
UPDATE
users

SET
username=‘tobert’

WHERE
id=68021e8a-‐9eb0-‐436c-‐8cdd-‐aac629788383

IF
username=‘renice’;

!

INSERT
INTO
users
(id,
username)

VALUES
(68021e8a-‐9eb0-‐436c-‐8cdd-‐aac629788383,
‘renice’)

IF
NOT
EXISTS;

!
!

Client error on conﬂict.

Config Changes: Apache 1.0 ➜ DSE 3.0
⁍ Schema: compaction_strategy = LCS

⁍ Schema: bloom_filter_fp_chance = 0.1

⁍ Schema: sstable_size_in_mb = 256

⁍ Schema: compression_options = Snappy

⁍ YAML: compaction_throughput_mb_per_sec: 0
#CASSANDRA13

!43

©2014 DataStax

⁍ LCS is a huge improvement in operations life (no more major compactions)
⁍ Bloom filters were tipping over a 24GiB heap
⁍ With lots of data per node, sstable sizes in LCS must be MUCH bigger
⁍ > 100,000 open files slows everything down, especially startup
⁍ 256mb v.s. 5mb is 50x reduction in file count
⁍ default has been fixed as of C* 2.0
⁍ Compaction can’t keep up: even huge rates don’t work, must be disabled
⁍ try to adjust heap, etc. so you’re flushing at nearly full memtables to reduce compaction needs
⁍ backreference RMW?
⁍ might be fixed in >= 1.2

nodetool ring

10.10.10.10

Analytics

rack1

Up

Normal

47.73 MB

1.72%

101204669472175663702469172037896580098

10.10.10.10

Analytics

rack1

Up

Normal

63.94 MB

0.86%

102671403812352122596707855690619718940

10.10.10.10

Analytics

rack1

Up

Normal

85.73 MB

0.86%

104138138152528581490946539343342857782

10.10.10.10

Analytics

rack1

Up

Normal

47.87 MB

0.86%

105604872492705040385185222996065996624

10.10.10.10

Analytics

rack1

Up

Normal

39.73 MB

0.86%

107071606832881499279423906648789135466

10.10.10.10

Analytics

rack1

Up

Normal

40.74 MB

1.75%

110042394566257506011458285920000334950

10.10.10.10

Analytics

rack1

Up

Normal

40.08 MB

2.20%

113781420866907675791616368030579466301

10.10.10.10

Analytics

rack1

Up

Normal

56.19 MB

3.45%

119650151395618797017962053073524524487

10.10.10.10

Analytics

rack1

Up

Normal

214.88 MB

11.62%

139424886777089715561324792149872061049

10.10.10.10

Analytics

rack1

Up

Normal

214.29 MB

2.45%

143588210871399618110700028431440799305

10.10.10.10

Analytics

rack1

Up

Normal

158.49 MB

1.76%

146577368624928021690175250344904436129

10.10.10.10

Analytics

rack1

Up

Normal

40.3 MB

0.92%

148140168357822348318107048925037023042

!44
©2014 DataStax

⁍ hotspots

nodetool cfstats
Keyspace: gostress
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0
Column Family: stressful
SSTable count: 1

Controllable by sstable_size_in_mb

Space used (live): 32981239
Space used (total): 32981239
Number of Keys (estimate): 128
Memtable Columns Count: 0
Memtable Data Size: 0
Memtable Switch Count: 0
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0

Could be using a lot of heap

Bloom Filter False Positives: 0
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 336

#CASSANDRA
©2014 DataStax

Compacted row minimum size: 7007507
Compacted row maximum size: 8409007
Compacted row mean size: 8409007

⁍ bloom ﬁlters
⁍ sstable_size_in_mb

!45

nodetool proxyhistograms
Offset

Read Latency

Write Latency

Range Latency

35

0

20

0

42

0

61

0

50

0

82

0

60

0

440

0

72

0

3416

0

86

0

17910

0

103

0

48675

0

124

1

97423

0

149

0

153109

0

179

2

186205

0

215

5

139022

0

258

134

44058

0

310

2656

60660

0

372

34698

742684

0

446

469515

7359351

0

535

3920391

31030588

0

642

9852708

33070248

0

770

4487796

9719615

0

924

651959

984889

0

#CASSANDRA
©2014 DataStax

⁍ units are microseconds
⁍ can give you a good idea of how much latency coordinator hops are costing you

!46

nodetool compactionstats
al@node ~ $ nodetool compactionstats
pending tasks: 3
compaction type

keyspace

Compaction

hastur

column family bytes compacted

bytes total

progress

gauge_archive

9819749801

16922291634

58.03%

Compaction

hastur counter_archive

12141850720

16147440484

75.19%

Compaction

hastur

647389841

1475432590

43.88%

column family bytes compacted

Active compaction remaining time :

mark_archive
n/a

al@node ~ $ nodetool compactionstats
pending tasks: 3
compaction type

keyspace

bytes total

progress

Compaction

hastur

gauge_archive

10239806890

16922291634

60.51%

Compaction

hastur counter_archive

12544404397

16147440484

77.69%

Compaction

hastur

1107897093

1475432590

75.09%

Active compaction remaining time :

#CASSANDRA
©2014 DataStax

⁍

mark_archive
n/a

!47

Stress Testing Tools
⁍ cassandra-stress

⁍ YCSB

⁍ Production

⁍ Terasort (DSE)

⁍ Homegrown
#CASSANDRA

!48

©2014 DataStax

⁍ we mostly focus on cassandra-stress for burn-in of new clusters
⁍ can quickly ﬁgure out the right setting for -Xmn
⁍ Terasort is interesting for comparing DSE to Cloudera/Hortonworks/etc. (it’s fast!)
⁍ Consider writing custom benchmarks for your application patterns
⁍ sometimes it’s faster to write one than ﬁgure out how to make a generic tool do what you want

/etc/sysctl.conf
kernel.pid_max = 999999
fs.file-max = 1048576
vm.max_map_count = 1048576
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 65536 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
vm.swappiness = 1
#CASSANDRA

!49

©2014 DataStax

⁍ pid_max doesn’t ﬁx anything, I just like it and have never had a problem with it
⁍ These are my starting point settings for nearly every system/application.
⁍ Generally safe for production.
⁍ vm.dirty*ratio can go big for fake fast writes, generally safe for Cassandra, but beware you’re more likely to see FS/ﬁle
corruption on power loss
⁍ but you will get latency spikes if you hit dirty_ratio (percentage of RAM), so don’t tune too low

/etc/rc.local
ra=$((2**14))# 16k
ss=$(blockdev --getss /dev/sda)
blockdev --setra $(($ra / $ss)) /dev/sda
!

echo 128 > /sys/block/sda/queue/nr_requests
echo deadline > /sys/block/sda/queue/scheduler
echo 16384 > /sys/block/md7/md/stripe_cache_size

#CASSANDRA

!50

©2014 DataStax

⁍ Lower readahead is better for latency on seeky workloads
⁍ More readahead will artiﬁcially increase your IOPS by reading a bunch of stuff you might not need!
⁍ nr_requests = number of IO structs the kernel will keep in ﬂight, don’t go crazy
⁍ Deadline is best for raw throughput
⁍ CFQ supports cgroup priorities and is occasionally better for latency on SATA drives
⁍ Default stripe cache is 128. The increase seems to help MD RAID5 a lot.
⁍ Don’t forget to set readahead separately for MD RAID devices

Ending discussion notes
• 2 socket, ECC memory
• 16GiB minimum, prefer 32-64GiB, over 128GiB and Linux will need serious tuning

• SSD where possible, Samsung 840 Pro is a good choice, any Intel is fine
• NO SAN/NAS, 20ms latency tops
• if you MUST (and please, don’t) dedicate spindles to C* nodes, use separate network

• Avoid disk configurations targeted at Hadoop, disks are too slow
• http://www.datastax.com/documentation/cassandra/2.0/pdf/
cassandra20.pdf
• read the sections on Repair, Tombstones & Snapshots

Managing Cassandra at Scale by Al Tobey

More Related Content

What's hot

Viewers also liked

Similar to Managing Cassandra at Scale by Al Tobey

More from DataStax Academy

Recently uploaded

Managing Cassandra at Scale by Al Tobey