Scaling Cassandra for Big Data

5,734 views
5,352 views

Published on

Scaling Cassandra for Big Data

  1. 1. Scaling Cassandra for Big Data Al Tobey Tech Lead, Data and Compute Services Ooyala, Inc. al@ooyala.com @AlTobey
  2. 2. Why Cassandra?Highly availableCommodity hardwareHorizontal scaleColumnar data modelOpen source
  3. 3. What does Ooyala use it for?Fast access to data generated by Map/ReduceHigh availability key/value out of StormCross-device resume (playhead tracking)ML predictionsTime-series data, raw events & application metrics
  4. 4. The BeginningOur data is doubling every yearCluster size: 18 nodesBiggest CF: 2TBRepairs becoming a problemExpired tombstones
  5. 5. First MigrationUpgrade to C* 0.6 to 0.8Remove expired tombstonesScrub data and rebuild indexesLots of Linux performance tuningMap/Reduce
  6. 6. Second MigrationUpgrade to Cassandra 1.0Remove expired tombstonesUpdate schemaMore Linux performance tuningMap/Reduce - this time using DSE Hadoop
  7. 7. Tuning HighlightsBloom filter false-positive chance (schema)Index density (schema)LeveledCompaction ssTable size (schema)XFS filesystem bugs (Linux) Stick with ext4 if you like to sleep.NO SWAP!
  8. 8. More Informationcassandra-users mailing listirc.freenode.net #cassandra / #cassandra-opshttp://www.datastax.com/docs/1.1/index@AlTobey / al@ooyala.comContact me about open positions at Ooyala.
  9. 9. Rejected Slides follow:An old version of this deck was a lot moretechnical. Ive added them back for onlineposting since people have asked about thespecifics.
  10. 10. Linux: General Observations● use a modern kernel, 2.6.32 is ancient ○ Running 3.4.11 on new production hardware ○ default Ubuntu Lucid / Oneiric kernels in EC2● I have yet to use XFS bug-free ○ 2.6.38 has an especially fun bug ○ allocsize=64m allocates 64m always & forever ○ echo 1 > /proc/sys/vm/drop_caches● Put commit log on a different filesystem● btrfs works fine in production● Block alignment is hard ○ use GPT disk labels and its generally not an issue ○ or just skip disk labels and RAID whole disks
  11. 11. Linux: almost a server OS/etc/security/limits.conf* - memlock unlimited* - nofile 1048576* - fsize unlimited* - nproc 999999
  12. 12. Linux: I love bufferbloat/etc/sysctl.confkernel.sysrq = 1kernel.panic = 300fs.file-max = 1048576kernel.pid_max = 999999vm.max_map_count = 1048576net.core.rmem_max = 16777216net.core.wmem_max = 16777216net.ipv4.tcp_rmem = 4096 65536 16777216net.ipv4.tcp_wmem = 4096 65536 16777216
  13. 13. Ubuntu: FFFFFFFUUUUUUUUUUU/etc/fstab/dev/md4 /commit ext4 nobootwait,barrier=1,journal_ioprio=0,rw 0 0/dev/md7 /srv xfs nobootwait,rw 0 0 ● force barriers for journal ● noatime & relatime arent necessary anymore ○ since ~ 2.6.31 ● nobootwait is an upstart option ○ set this or upstart will troll you at 4am ○ mountall hangs on boot for any error without this ○ use on both hardware and EC2 unless you love using OOB consoles ● As noted, XFS is buggy, so consider ext4.
  14. 14. Linux: Final Adjustments/etc/rc.local (or whatever you prefer)● CFQ disk scheduler ○ deadline is still faster, but no cgroup support ○ noop is a popular choice in EC2, SSD, and HW RAID● Tune readahead ○ dont go crazy, 64k is a decent choice ○ big RA will inflate your bandwidth numbers, but really large values will waste IO on unused data● If running MD RAID5/6 ○ echo 16384 > /sys/block/$md/md/stripe_cache_size
  15. 15. JVM: ALL THE MEMORY● Use Oracle JVM 1.6 for Cassandra ○ OpenJDK works, still not recommended ○ Use fpm to create packages if you dont have them● Default Cassandra GC settings are OK ○ -XX:+UseNUMA ■ works fine in production ■ Apache scripts will use numactl if installed ● DSE does not! (yet) ○ Bigger data will need bigger heaps. ■ 12G seems to work OK ■ 24G works, but approaching limits of JVM ■ too little free memory causes excessive memtable flushing (more on this later)
  16. 16. Cassandra.(?:ya|f)ml● index_interval: 512 ○ save some memory on indexes● compaction_throughput_mb_per_sec: 0 ○ this can hurt your read latency, but in my experience leveled compaction falls behind under very high insert loads without this, use a bigger heap to compensate?● rpc_server_type: hsha ○ if you have lots & lots of connections, e.g. from Hadoop, saves memory
  17. 17. Cassandra: Schema Tuning● Enable compression ○ compression_options = {sstable_compression: org. apache.cassandra.io.compress.SnappyCompressor};● Examine bloom filter false-positives ○ nodetool -h localhost cfstats |grep Bloom ○ bloom_filter_fp_chance = 0.1 # diminishing returns● Reduce ssTable count ○ memory pressure caused frequent memtable flushes ○ compaction throttling made it worse ○ compaction_strategy_options = {sstable_size_in_mb: 256}● Give yourself time to repair ○ gc_grace = 5184000 # 60 days ○ shoot for (node_count * 86400 * 3) to be safe
  18. 18. Future● Upgrade all clusters to DSE 2.2● Chef cookbook (likely open)● Mixing CQL3 and Thrift API access ○ all lower case CF names ○ WITH COMPACT STORAGE● Cassandra 1.2 ○ native protocol ○ JBOD support ○ vnodes ○ compound row key support in CQL3
  19. 19. MOAR● Freenode IRC is a great resource ○ #cassandra, #cassandra-ops● cassandra-users mailing list● DataStax Enterprise ○ The Hadoop integration works and is useful ○ Still playing with Solr ○ OpsCenter is really nice● Me: ○ @AlTobey on Twitter ○ tobert on irc.freenode.net ○ https://gist.github.com/tobert●
  20. 20. More Information (again)cassandra-users mailing listirc.freenode.net #cassandra / #cassandra-opshttp://www.datastax.com/docs/1.1/index@AlTobey / al@ooyala.comContact me about open positions at Ooyala.

×