Managing Cassandra at Scale by Al Tobey

2,346 views

Published on

Al Tobey presents on Managing Cassandra at Scale.

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,346
On SlideShare
0
From Embeds
0
Number of Embeds
65
Actions
Shares
0
Downloads
45
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Managing Cassandra at Scale by Al Tobey

  1. 1. Managing Cassandra : SCALE 12x @AlTobey Open Source Mechanic | Datastax Apache Cassandra のオープンソースエバンジェリスト ©2014 DataStax Obsessed with infrastructure my whole life. See distributed systems everywhere. !1
  2. 2. Five years of Cassandra 0.1 Jul-08 ... 0.3 0 0.6 1 0.7 1.0 2 1.2 3 DSE 4 2.0 5
  3. 3. Why Cassandra? Or any non-relational?
  4. 4. アルトビー Leo My boys I did not have permission from my wife to use her image ;) Rufus
  5. 5. ! Less Tolerant For the record. Can you spell HTTP? More and more services on line. ! And they must be available
  6. 6. 理由Cassandraを選ぶのか ! 日本で億のインターネットユーザー 100,000,000 internet users in Japan across the world … B2B, video conferencing, shopping, entertainment
  7. 7. ! Traditional solutions… ! …may not be a good fit
  8. 8. Client-server All your wildest ! dreams will come ! true. Client - server database architecture. Obsolete. The original “funnel-shaped” architecture
  9. 9. 3-tier Client - client - server database architecture. Still suitable for small applications. e.g. LAMP, RoR
  10. 10. 3-tier + caching cache slave more complex cache coherency is a hard problem cascading failures are common Next: Out: the funnel. In: The ring. master slave
  11. 11. Webscale outer ring: clients (cell phones, etc.) middle ring: application servers inside ring: Cassandra servers ! Serving millions of clients with mere hundreds or thousands of nodes requires a different approach to applications!
  12. 12. When it Rains scale out … but at a cost
  13. 13. A new plan
  14. 14. Dynamo Paper(2007) • How do we build a data store that is: • Reliable • Performant • “Always On” • Nothing new and shiny ! ! Evolutionary. Real. Computer Science Also the basis for Riak and Voldemort
  15. 15. BigTable(2006) • Richer data model • 1 key. Lots of values • Fast sequential access • 38 Papers cited
  16. 16. Cassandra(2008) • Distributed features of Dynamo • Data Model and storage from BigTable • February 17, 2010 it graduated to a top-level Apache project
  17. 17. Cassandra - More than one server • All nodes participate in a cluster • Shared nothing • Add or remove as needed • More capacity? Add a server
 !18
  18. 18. VLDB benchmark (RWS) HBase Redis THROUGHPUT OPS/SEC) Cassandra !19 MySQL
  19. 19. Cassandra - Fully Replicated • Client writes local • Data syncs across WAN • Replication per Data Center !20
  20. 20. Read-Modify-Write UPDATE  Employees  SET  Rank=4,  Promoted=2014-­‐01-­‐24   WHERE  EmployeeID=1337; EmployeeID**1337 Name********アルトビー StartDate***2013510501 Rank********3 Promoted****null This might be what it looks like from SQL / CQL, but … ! EmployeeID**1337 Name********アルトビー StartDate***2013510501 Rank********4 Promoted****2014501524
  21. 21. Read-Modify-Write UPDATE  Employees  SET  Rank=4,  Promoted=2014-­‐01-­‐24   WHERE  EmployeeID=1337; EmployeeID**1337 Name********アルトビー StartDate***2013510501 Rank********4 Promoted****2014501524 EmployeeID**1337 Name********アルトビー StartDate***2013510501 Rank********3 Promoted****null RDBMS TNSTAAFL 無償の昼食なんてものはありません TNSTAAFL … If you’re lucky, the cell is in cache. Otherwise, it’s a disk access to read, another to write.
  22. 22. Eventual Consistency UPDATE  Employees  SET  Rank=4,  Promoted=2014-­‐01-­‐24   WHERE  EmployeeID=1337; EmployeeID**1337 Name********アルトビー StartDate***2013510501 Rank********3 Promoted****null EmployeeID**1337 Name********アルトビー StartDate***2013510501 Rank********4 Promoted****2014501524 Explain distributed RMW More complicated. Will talk about how it’s abstracted in CQL later. Coordinator
  23. 23. Eventual Consistency UPDATE  Employees  SET  Rank=4,  Promoted=2014-­‐01-­‐24   WHERE  EmployeeID=1337; EmployeeID**1337 Name********アルトビー StartDate***2013510501 Rank********3 Promoted****null EmployeeID**1337 Name********アルトビー StartDate***2013510501 Rank********4 Promoted****2014501524 Coordinator read write Memory replication on write, depending on RF, usually RF=3. Reads AND writes remain available through partitions. Hinted handoff.
  24. 24. Avoiding read-modify-write cassandra13_drinks column family Albert Tuesday 6 Wednesday 0 Evan Tuesday 0 Wednesday 0 Frank Tuesday 3 Wednesday 3 Kelvin Tuesday 0 Wednesday 0 Krzysztof Tuesday 0 Wednesday 0 Phillip Tuesday 12 Wednesday 0 memTable #CASSANDRA ©2014 DataStax ⁍ CF to track how much I expect my team at Ooyala to drink ⁍ Row keys are names ⁍ Column keys are days ⁍ Values are a count of drinks !25
  25. 25. Avoiding read-modify-write cassandra13_drinks column family Al Tuesday 2 Wednesday 0 Phillip Tuesday 0 Wednesday 1 Albert Tuesday 6 Wednesday 0 Evan Tuesday 0 Wednesday 0 Frank Tuesday 3 Wednesday 3 Kelvin Tuesday 0 Wednesday 0 Krzysztof Tuesday 0 Wednesday 0 Phillip Tuesday 12 Wednesday 0 memTable ssTable #CASSANDRA ©2014 DataStax ⁍ Next day, after after a flush ⁍ I’m speaking so I decided to drink less ⁍ Phillip informs me that he has quit drinking !26
  26. 26. Avoiding read-modify-write cassandra13_drinks column family memTable Albert Tuesday 22 Wednesday 0 Albert Tuesday 2 Wednesday 0 Phillip Tuesday 0 Wednesday 1 Albert Tuesday 6 Wednesday 0 Evan Tuesday 0 Wednesday 0 Frank Tuesday 3 Wednesday 3 Kelvin Tuesday 0 Wednesday 0 Krzysztof Tuesday 0 Wednesday 0 Phillip Tuesday 12 Wednesday 0 ssTable ssTable #CASSANDRA ©2014 DataStax ⁍ I’m drinking with all you people so I decide to add 20 ⁍ read 2, add 20, write 22 !27
  27. 27. Avoiding read-modify-write cassandra13_drinks column family Albert Tuesday 22 Wednesday 0 Evan Tuesday 0 Wednesday 0 Frank Tuesday 3 Wednesday 3 Kelvin Tuesday 0 Wednesday 0 Krzysztof Tuesday 0 Wednesday 0 Phillip Tuesday 0 Wednesday 1 ssTable #CASSANDRA !28 ©2014 DataStax ⁍ After compaction & conflict resolution ⁍ Overwriting the same value is just fine! Works really well for some patterns such as time-series data ⁍ Separate read/write streams handy for debugging, but not a big deal
  28. 28. Overwriting CREATE TABLE host_lookup ( name varchar, id uuid, PRIMARY KEY(name) ); ! INSERT INTO host_uuid (name,id) VALUES (“www.tobert.org”, “463b03ec-fcc1-4428-bac8-80ccee1c2f77”); ! INSERT INTO host_uuid (name,id) VALUES (“tobert.org”, “463b03ec-fcc1-4428-bac8-80ccee1c2f77”); ! INSERT INTO host_uuid (name,id) VALUES (“www.tobert.org”, “463b03ec-fcc1-4428-bac8-80ccee1c2f77”); ! SELECT id FROM host_lookup WHERE name=“tobert.org”; Beware of expensive compaction Best for: small indexes, lookup tables Compaction handles RMW at storage level in the background. Under heavy writes, clock synchronization is very important to avoid timestamp collisions. In practice, this isn’t a problem very often and even when it goes wrong, not much harm done.
  29. 29. Key/Value CREATE TABLE keyval ( key VARCHAR, value blob, PRIMARY KEY(key) ); ! INSERT INTO keyval (key,value) VALUES (?, ?); ! SELECT value FROM keyval WHERE key=?; e.g. memcached Don’t do this. But it works when you really need it.
  30. 30. Journaling / Logging / Time-series CREATE TABLE tsdb ( time_bucket timestamp, time timestamp, value blob, PRIMARY KEY(time_bucket, time) ); ! INSERT INTO tsdb (time_bucket, time, value) VALUES ( “2014-10-24”, -- 1-day bucket (UTC) “2014-10-24T12:12:12Z”, -- ALWAYS USE UTC ‘{“foo”: “bar”}’ ); Oversimplified, use normalization over blobs whenever possible. ALWAYS USE UTC :)
  31. 31. Journaling / Logging / Time-series 2014(01(24 2014(01(24T12:12:12Z 2014(01(24T21:21:21Z {“key”:" value”} {“key”:"“value”} 2014(01(25 2014(01(25T13:13:13Z {“key”:"“value”} {"“2014(01(24”"=>"{ """"“2014(01(24T12:12:12Z”"=>"{ """"""""‘{“foo”:"“bar”}’ """"} } Oversimplified, use normalization over blobs whenever possible. ALWAYS USE UTC :)
  32. 32. Cassandra Collections CREATE TABLE posts ( id uuid, body varchar, created timestamp, authors set<varchar>, tags set<varchar>, PRIMARY KEY(id) ); ! INSERT INTO posts (id,body,created,authors,tags) VALUES ( ea4aba7d-9344-4d08-8ca5-873aa1214068, ‘アルトビーの犬はばかね’, ‘now', [‘アルトビー’, ’ィオートビー’], [‘dog’, ‘silly’, ’犬’, ‘ばか’] ); quick story about 犬ばかね sets & maps are CRDTs, safe to modify
  33. 33. Cassandra Collections CREATE TABLE metrics ( bucket timestamp, time timestamp, value blob, labels map<varchar,varchar>, PRIMARY KEY(bucket) ); sets & maps are CRDTs, safe to modify
  34. 34. Lightweight Transactions • Cassandra 2.0 and on support LWT based on PAXOS • PAXOS is a distributed consensus protocol • Given a constraint, Cassandra ensures correct ordering
  35. 35. Lightweight Transactions UPDATE  users          SET  username=‘tobert’    WHERE  id=68021e8a-­‐9eb0-­‐436c-­‐8cdd-­‐aac629788383          IF  username=‘renice’;   ! INSERT  INTO  users  (id,  username)   VALUES  (68021e8a-­‐9eb0-­‐436c-­‐8cdd-­‐aac629788383,  ‘renice’)   IF  NOT  EXISTS;   ! ! Client error on conflict.
  36. 36. Brendan Gregg’s Tools Chart
  37. 37. dstat -lrvn 10
  38. 38. iostat -x 1
  39. 39. htop
  40. 40. Datastax Opscenter
  41. 41. Config Changes: Apache 1.0 ➜ DSE 3.0 ⁍ Schema: compaction_strategy = LCS ⁍ Schema: bloom_filter_fp_chance = 0.1 ⁍ Schema: sstable_size_in_mb = 256 ⁍ Schema: compression_options = Snappy ⁍ YAML: compaction_throughput_mb_per_sec: 0 #CASSANDRA13 !43 ©2014 DataStax ⁍ LCS is a huge improvement in operations life (no more major compactions) ⁍ Bloom filters were tipping over a 24GiB heap ⁍ With lots of data per node, sstable sizes in LCS must be MUCH bigger ⁍ > 100,000 open files slows everything down, especially startup ⁍ 256mb v.s. 5mb is 50x reduction in file count ⁍ default has been fixed as of C* 2.0 ⁍ Compaction can’t keep up: even huge rates don’t work, must be disabled ⁍ try to adjust heap, etc. so you’re flushing at nearly full memtables to reduce compaction needs ⁍ backreference RMW? ⁍ might be fixed in >= 1.2
  42. 42. nodetool ring 10.10.10.10 Analytics rack1 Up Normal 47.73 MB 1.72% 101204669472175663702469172037896580098 10.10.10.10 Analytics rack1 Up Normal 63.94 MB 0.86% 102671403812352122596707855690619718940 10.10.10.10 Analytics rack1 Up Normal 85.73 MB 0.86% 104138138152528581490946539343342857782 10.10.10.10 Analytics rack1 Up Normal 47.87 MB 0.86% 105604872492705040385185222996065996624 10.10.10.10 Analytics rack1 Up Normal 39.73 MB 0.86% 107071606832881499279423906648789135466 10.10.10.10 Analytics rack1 Up Normal 40.74 MB 1.75% 110042394566257506011458285920000334950 10.10.10.10 Analytics rack1 Up Normal 40.08 MB 2.20% 113781420866907675791616368030579466301 10.10.10.10 Analytics rack1 Up Normal 56.19 MB 3.45% 119650151395618797017962053073524524487 10.10.10.10 Analytics rack1 Up Normal 214.88 MB 11.62% 139424886777089715561324792149872061049 10.10.10.10 Analytics rack1 Up Normal 214.29 MB 2.45% 143588210871399618110700028431440799305 10.10.10.10 Analytics rack1 Up Normal 158.49 MB 1.76% 146577368624928021690175250344904436129 10.10.10.10 Analytics rack1 Up Normal 40.3 MB 0.92% 148140168357822348318107048925037023042 !44 ©2014 DataStax ⁍ hotspots
  43. 43. nodetool cfstats Keyspace: gostress Read Count: 0 Read Latency: NaN ms. Write Count: 0 Write Latency: NaN ms. Pending Tasks: 0 Column Family: stressful SSTable count: 1 Controllable by sstable_size_in_mb Space used (live): 32981239 Space used (total): 32981239 Number of Keys (estimate): 128 Memtable Columns Count: 0 Memtable Data Size: 0 Memtable Switch Count: 0 Read Count: 0 Read Latency: NaN ms. Write Count: 0 Write Latency: NaN ms. Pending Tasks: 0 Could be using a lot of heap Bloom Filter False Positives: 0 Bloom Filter False Ratio: 0.00000 Bloom Filter Space Used: 336 #CASSANDRA ©2014 DataStax Compacted row minimum size: 7007507 Compacted row maximum size: 8409007 Compacted row mean size: 8409007 ⁍ bloom filters ⁍ sstable_size_in_mb !45
  44. 44. nodetool proxyhistograms Offset Read Latency Write Latency Range Latency 35 0 20 0 42 0 61 0 50 0 82 0 60 0 440 0 72 0 3416 0 86 0 17910 0 103 0 48675 0 124 1 97423 0 149 0 153109 0 179 2 186205 0 215 5 139022 0 258 134 44058 0 310 2656 60660 0 372 34698 742684 0 446 469515 7359351 0 535 3920391 31030588 0 642 9852708 33070248 0 770 4487796 9719615 0 924 651959 984889 0 #CASSANDRA ©2014 DataStax ⁍ units are microseconds ⁍ can give you a good idea of how much latency coordinator hops are costing you !46
  45. 45. nodetool compactionstats al@node ~ $ nodetool compactionstats pending tasks: 3 compaction type keyspace Compaction hastur column family bytes compacted bytes total progress gauge_archive 9819749801 16922291634 58.03% Compaction hastur counter_archive 12141850720 16147440484 75.19% Compaction hastur 647389841 1475432590 43.88% column family bytes compacted Active compaction remaining time : mark_archive n/a al@node ~ $ nodetool compactionstats pending tasks: 3 compaction type keyspace bytes total progress Compaction hastur gauge_archive 10239806890 16922291634 60.51% Compaction hastur counter_archive 12544404397 16147440484 77.69% Compaction hastur 1107897093 1475432590 75.09% Active compaction remaining time : #CASSANDRA ©2014 DataStax ⁍ mark_archive n/a !47
  46. 46. Stress Testing Tools ⁍ cassandra-stress ⁍ YCSB ⁍ Production ⁍ Terasort (DSE) ⁍ Homegrown #CASSANDRA !48 ©2014 DataStax ⁍ we mostly focus on cassandra-stress for burn-in of new clusters ⁍ can quickly figure out the right setting for -Xmn ⁍ Terasort is interesting for comparing DSE to Cloudera/Hortonworks/etc. (it’s fast!) ⁍ Consider writing custom benchmarks for your application patterns ⁍ sometimes it’s faster to write one than figure out how to make a generic tool do what you want
  47. 47. /etc/sysctl.conf kernel.pid_max = 999999 fs.file-max = 1048576 vm.max_map_count = 1048576 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 65536 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 vm.swappiness = 1 #CASSANDRA !49 ©2014 DataStax ⁍ pid_max doesn’t fix anything, I just like it and have never had a problem with it ⁍ These are my starting point settings for nearly every system/application. ⁍ Generally safe for production. ⁍ vm.dirty*ratio can go big for fake fast writes, generally safe for Cassandra, but beware you’re more likely to see FS/file corruption on power loss ⁍ but you will get latency spikes if you hit dirty_ratio (percentage of RAM), so don’t tune too low
  48. 48. /etc/rc.local ra=$((2**14))# 16k ss=$(blockdev --getss /dev/sda) blockdev --setra $(($ra / $ss)) /dev/sda ! echo 128 > /sys/block/sda/queue/nr_requests echo deadline > /sys/block/sda/queue/scheduler echo 16384 > /sys/block/md7/md/stripe_cache_size #CASSANDRA !50 ©2014 DataStax ⁍ Lower readahead is better for latency on seeky workloads ⁍ More readahead will artificially increase your IOPS by reading a bunch of stuff you might not need! ⁍ nr_requests = number of IO structs the kernel will keep in flight, don’t go crazy ⁍ Deadline is best for raw throughput ⁍ CFQ supports cgroup priorities and is occasionally better for latency on SATA drives ⁍ Default stripe cache is 128. The increase seems to help MD RAID5 a lot. ⁍ Don’t forget to set readahead separately for MD RAID devices
  49. 49. Ending discussion notes • 2 socket, ECC memory • 16GiB minimum, prefer 32-64GiB, over 128GiB and Linux will need serious tuning • SSD where possible, Samsung 840 Pro is a good choice, any Intel is fine • NO SAN/NAS, 20ms latency tops • if you MUST (and please, don’t) dedicate spindles to C* nodes, use separate network • Avoid disk configurations targeted at Hadoop, disks are too slow • http://www.datastax.com/documentation/cassandra/2.0/pdf/ cassandra20.pdf • read the sections on Repair, Tombstones & Snapshots

×