Scaling Cassandra for Big Data

Scaling Cassandra for Big Data

Al Tobey
Tech Lead, Data and Compute
Services
Ooyala, Inc.
al@ooyala.com
@AlTobey

Why Cassandra?

Highly available

Commodity hardware

Horizontal scale

Columnar data model

Open source

What does Ooyala use it for?
Fast access to data generated by Map/Reduce

High availability key/value out of Storm

Cross-device resume (playhead tracking)

ML predictions

Time-series data, raw events & application metrics

The Beginning

Our data is doubling every year

Cluster size: 18 nodes

Biggest CF: 2TB

Repairs becoming a problem

Expired tombstones

First Migration

Upgrade to C* 0.6 to 0.8

Remove expired tombstones

Scrub data and rebuild indexes

Lots of Linux performance tuning

Map/Reduce

Second Migration

Upgrade to Cassandra 1.0

Remove expired tombstones

Update schema

More Linux performance tuning

Map/Reduce - this time using DSE Hadoop

Tuning Highlights
Bloom filter false-positive chance (schema)

Index density (schema)

LeveledCompaction ssTable size (schema)

XFS filesystem bugs (Linux)
Stick with ext4 if you like to sleep.

NO SWAP!

More Information

cassandra-users mailing list

irc.freenode.net #cassandra / #cassandra-ops

http://www.datastax.com/docs/1.1/index

@AlTobey / al@ooyala.com

Contact me about open positions at Ooyala.

Rejected Slides follow:

An old version of this deck was a lot more
technical. I've added them back for online
posting since people have asked about the
specifics.

Linux: General Observations
● use a modern kernel, 2.6.32 is ancient
○ Running 3.4.11 on new production hardware
○ default Ubuntu Lucid / Oneiric kernels in EC2
● I have yet to use XFS bug-free
○ 2.6.38 has an especially fun bug
○ allocsize=64m allocates 64m always & forever
○ echo 1 > /proc/sys/vm/drop_caches
● Put commit log on a different filesystem
● btrfs works fine in production
● Block alignment is hard
○ use GPT disk labels and it's generally not an issue
○ or just skip disk labels and RAID whole disks

Linux: almost a server OS
/etc/security/limits.conf

* - memlock unlimited
* - nofile 1048576
* - fsize unlimited
* - nproc 999999

Linux: I love bufferbloat
/etc/sysctl.conf

kernel.sysrq = 1
kernel.panic = 300
fs.file-max = 1048576
kernel.pid_max = 999999
vm.max_map_count = 1048576
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 65536 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

Ubuntu: FFFFFFFUUUUUUUUUUU
/etc/fstab

/dev/md4 /commit ext4 nobootwait,barrier=1,journal_ioprio=0,rw 0 0
/dev/md7 /srv xfs nobootwait,rw 0 0

● force barriers for journal
● noatime & relatime aren't necessary anymore
○ since ~ 2.6.31
● nobootwait is an upstart option
○ set this or upstart will troll you at 4am
○ mountall hangs on boot for any error without this
○ use on both hardware and EC2 unless you love
using OOB consoles
● As noted, XFS is buggy, so consider ext4.

Linux: Final Adjustments
/etc/rc.local (or whatever you prefer)
● CFQ disk scheduler
○ deadline is still faster, but no cgroup support
○ noop is a popular choice in EC2, SSD, and HW
RAID
● Tune readahead
○ don't go crazy, 64k is a decent choice
○ big RA will inflate your bandwidth numbers, but
really large values will waste IO on unused data
● If running MD RAID5/6
○ echo 16384 >
/sys/block/$md/md/stripe_cache_size

JVM: ALL THE MEMORY
● Use Oracle JVM 1.6 for Cassandra
○ OpenJDK works, still not recommended
○ Use fpm to create packages if you don't have them
● Default Cassandra GC settings are OK
○ -XX:+UseNUMA
■ works fine in production
■ Apache scripts will use numactl if installed
● DSE does not! (yet)
○ Bigger data will need bigger heaps.
■ 12G seems to work OK
■ 24G works, but approaching limits of JVM
■ too little free memory causes excessive
memtable flushing (more on this later)

Cassandra.(?:ya|f)ml
● index_interval: 512
○ save some memory on indexes
● compaction_throughput_mb_per_sec: 0
○ this can hurt your read latency, but in my experience
leveled compaction falls behind under very high
insert loads without this, use a bigger heap to
compensate?
● rpc_server_type: hsha
○ if you have lots & lots of connections, e.g. from
Hadoop, saves memory

Cassandra: Schema Tuning
● Enable compression
○ compression_options = {'sstable_compression': 'org.
apache.cassandra.io.compress.SnappyCompressor'};
● Examine bloom filter false-positives
○ nodetool -h localhost cfstats |grep Bloom
○ bloom_filter_fp_chance = 0.1 # diminishing returns
● Reduce ssTable count
○ memory pressure caused frequent memtable flushes
○ compaction throttling made it worse
○ compaction_strategy_options = {'sstable_size_in_mb':
256}
● Give yourself time to repair
○ gc_grace = 5184000 # 60 days
○ shoot for (node_count * 86400 * 3) to be safe

Future
● Upgrade all clusters to DSE 2.2
● Chef cookbook (likely open)
● Mixing CQL3 and Thrift API access
○ all lower case CF names
○ WITH COMPACT STORAGE
● Cassandra 1.2
○ native protocol
○ JBOD support
○ vnodes
○ compound row key support in CQL3

MOAR
● Freenode IRC is a great resource
○ #cassandra, #cassandra-ops
● cassandra-users mailing list
● DataStax Enterprise
○ The Hadoop integration works and is useful
○ Still playing with Solr
○ OpsCenter is really nice
● Me:
○ @AlTobey on Twitter
○ tobert on irc.freenode.net
○ https://gist.github.com/tobert
●

More Information (again)

cassandra-users mailing list

irc.freenode.net #cassandra / #cassandra-ops

http://www.datastax.com/docs/1.1/index

@AlTobey / al@ooyala.com

Contact me about open positions at Ooyala.

Scaling Cassandra for Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (12)

Similar to Scaling Cassandra for Big Data

Similar to Scaling Cassandra for Big Data (20)

More from DataStax Academy

More from DataStax Academy (20)

Scaling Cassandra for Big Data