Cassandra TK 2014 - Large Nodes

4,606 views

Published on

A discussion of running cassandra with a large data load per node.

Published in: Technology
  • For data visualization,data analyticsand data intelligence tools online training with job placements, register at http://www.todaycourses.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Cassandra TK 2014 - Large Nodes

  1. 1. CASSANDRA TK 2014 LARGE NODES WITH CASSANDRA Aaron Morton @aaronmorton ! Co-Founder & Principal Consultant www.thelastpickle.com Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License
  2. 2. About The Last Pickle. Work with clients to deliver and improve Apache Cassandra based solutions. Apache Cassandra Committer, DataStax MVP, Hector Maintainer, Apache Usergrid Committer. Based in New Zealand & USA.
  3. 3. Large Node? ! “Avoid storing more than 500GB per node” ! (Originally said about EC2 nodes.)
  4. 4. Large Node? ! “You may have issues if you have over 1 Billion keys per node.”
  5. 5. Before version 1.2 large nodes had operational and performance concerns.
  6. 6. After version 1.2 large nodes have fewer operational and performance concerns.
  7. 7. Issues Pre 1.2 Work Arounds Pre 1.2 Improvements 1.2 to 2.1 !
  8. 8. Memory Management. Some in memory structures grow with number of rows and size of data.
  9. 9. Bloom Filter Stores bitset used to determine if a key exists in an SSTable with a certain probability. ! Size depends on number of rows and bloom_filter_fp_chance.
  10. 10. Bloom Filter Allocates pages of 4096 longs in a long[][] array.
  11. 11. Bloom Filter Size 0.01 bloom_filter_fp_chance 0.10 bloom_filter_fp_chance Bloom Filer Size in MB 1,200 900 600 300 0 1 10 100 Millions of Rows 1,000
  12. 12. Compression Metadata Stores long offset into compressed Data.db file for each chunk_length_kb (default 64) of uncompressed data. ! Size depends on the uncompressed data size.
  13. 13. Compression Metadata Allocates pages of 4096 longs in a long[][] array.
  14. 14. Compression Metadata Size Snappy Compressor Compress Metadata Size in MB 1,400 1,050 700 350 0 1 10 100 Uncompressed Size in GB 1,000 10,000
  15. 15. Index Samples Stores offset into -Index.db for every index_interval (128) keys. ! Size depends on the number of rows and the size of the keys. !
  16. 16. Index Samples Allocates long[] for offsets and byte[] [] for row keys. ! (Version 1.2 using on heap structures)
  17. 17. Index Samples Total Size Position Offset Keys (25 bytes long) Index Sample Total Size in MB 300 225 150 75 0 1 10 100 Millions of Rows 1,000
  18. 18. Memory Management. Larger Heaps (above 8GB) take longer to GC. ! Large working set results in frequent prolonged GC.
  19. 19. Bootstrap. The joining node requests data from one replica of each token range it will own. ! Sending is throttled by stream_throughput_outbound_mega bits_per_sec (default 200/25MB).
  20. 20. Bootstrap. With RF 3, only three nodes will send data to a bootstrapping node. ! Maximum send rate is 75 MB/sec (3*25MB).
  21. 21. Moving Nodes. Copy data from existing node to new node. ! At 50 MB/s transferring 100GB takes 33 minutes.
  22. 22. Disk Management. Need a multi TB volume or use multiple volumes.
  23. 23. Disk Management with RAID-0. Single disk failure results in total node failure.
  24. 24. Disk Management with RAID-10. Requires double the raw capacity.
  25. 25. Disk Management with Multiple Volumes. Specified via data_file_directories ! Write load not distributed. ! Single failure will shut down node.
  26. 26. Repair. Compare data between nodes and exchange differences. !
  27. 27. Comparing Data for Repair. Calculate Merkle Tree hash by reading all rows in a Table. (Validation Compaction) ! Single comparator, throttled by compaction_throughput_mb_per_sec (default 16).
  28. 28. Comparing Data for Repair. Time taken grows as the size of the data per node grows.
  29. 29. Exchanging Data for Repair. Ranges of rows with differences are Streamed. ! Sending is throttled by stream_throughput_outbound_mega bits_per_sec (default 200/25MB).
  30. 30. Compaction. Requires free space to write new SSTables.
  31. 31. SizeTieredCompactionStrategy. Groups SSTables by size, assumes no reduction in size. ! In theory requires 50% free space, in practice can work beyond 50% though not recommended.
  32. 32. LeveledCompactionStrategy. Groups SSTables by “level” and groups row fragments per level. ! Requires approximately 25% free space.
  33. 33. Issues Pre 1.2 Work Arounds Pre 1.2 Improvements 1.2 to 2.1 !
  34. 34. Memory Management Work Arounds. Reduce Bloom Filter size by increasing bloom_filter_fp_chance from 0.01 to 0.1. ! May increase read latency.
  35. 35. Memory Management Work Arounds. Reduce Compression Metadata size by increasing chunk_length_kb. ! May increase read latency.
  36. 36. Memory Management Work Arounds. Reduce Index Samples size by increasing index_interval to 512. ! May increase read latency.
  37. 37. Memory Management Work Arounds. When necessary use a 12GB MAX_HEAP_SIZE. ! Keep HEAP_NEWSIZE “reasonable” e.g. less than 1200MB.
  38. 38. Bootstrap Work Arounds. Increase streaming throughput via nodetool setstreamthroughput whenever possible.
  39. 39. Moving Node Work Arounds. Copy nodetool snapshot while the original node is operational. ! Copy only a delta when the original node is stopped.
  40. 40. Disk Management Work Arounds. Use RAID-0 and over provision nodes anticipating failure. ! Use RAID-10 and accept additional costs.
  41. 41. Repair Work Arounds. Only use if data is deleted, rely on Consistently Level for distribution. ! Frequent small repair using token ranges.
  42. 42. Compaction Work Arounds. Over provision disk capacity when using SizeTieredCompactionStrategy. ! Reduce min_compaction_threshold (default 4) max_compaction_threshold (default 32) to reduce number of SSTables per compaction.
  43. 43. Compaction Work Arounds. Use LeveledCompactionStrategy where appropriate.
  44. 44. Issues Pre 1.2 Work Arounds Pre 1.2 Improvements 1.2 to 2.1
  45. 45. Memory Management Improvements. Version 1.2 moved Bloom Filters and Compression Meta Data off the JVM Heap to Native Memory. ! Version 2.0 moved Index Samples off the JVM Heap.
  46. 46. Bootstrap Improvements. Virtual Nodes increases the number of Token Ranges per node from 1 to 256. ! Bootstrapping node can request data from 256 different nodes.
  47. 47. Disk Layout Improvements. “JBOD” support distributes concurrent writes to multiple data_file_directories.
  48. 48. Disk Layout Improvements. disk_failure_policy adds support for handling disk failure. ! ignore stop best_effort
  49. 49. Repair Improvements. “Avoid repairing already-repaired data by default” CASSANDRA-5351 ! Scheduled for 2.1
  50. 50. Compaction Improvements. “Avoid allocating overly large bloom filters” CASSANDRA-5906 ! Included in 2.1
  51. 51. Thanks. !
  52. 52. Aaron Morton @aaronmorton www.thelastpickle.com Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License

×