• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Cassandra TK 2014 - Large Nodes
 

Cassandra TK 2014 - Large Nodes

on

  • 1,658 views

A discussion of running cassandra with a large data load per node.

A discussion of running cassandra with a large data load per node.

Statistics

Views

Total Views
1,658
Views on SlideShare
708
Embed Views
950

Actions

Likes
1
Downloads
13
Comments
0

9 Embeds 950

http://planetcassandra.org 874
http://d.hatena.ne.jp 51
http://reader.faltering.com 12
http://feedly.com 5
http://www.newsblur.com 2
http://newsblur.com 2
http://planetcassandra.prakashinfotech.com 2
http://plus.url.google.com 1
http://translate.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Cassandra TK 2014 - Large Nodes Cassandra TK 2014 - Large Nodes Presentation Transcript

    • CASSANDRA TK 2014 LARGE NODES WITH CASSANDRA Aaron Morton @aaronmorton ! Co-Founder & Principal Consultant www.thelastpickle.com Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License
    • About The Last Pickle. Work with clients to deliver and improve Apache Cassandra based solutions. Apache Cassandra Committer, DataStax MVP, Hector Maintainer, Apache Usergrid Committer. Based in New Zealand & USA.
    • Large Node? ! “Avoid storing more than 500GB per node” ! (Originally said about EC2 nodes.)
    • Large Node? ! “You may have issues if you have over 1 Billion keys per node.”
    • Before version 1.2 large nodes had operational and performance concerns.
    • After version 1.2 large nodes have fewer operational and performance concerns.
    • Issues Pre 1.2 Work Arounds Pre 1.2 Improvements 1.2 to 2.1 !
    • Memory Management. Some in memory structures grow with number of rows and size of data.
    • Bloom Filter Stores bitset used to determine if a key exists in an SSTable with a certain probability. ! Size depends on number of rows and bloom_filter_fp_chance.
    • Bloom Filter Allocates pages of 4096 longs in a long[][] array.
    • Bloom Filter Size 0.01 bloom_filter_fp_chance 0.10 bloom_filter_fp_chance Bloom Filer Size in MB 1,200 900 600 300 0 1 10 100 Millions of Rows 1,000
    • Compression Metadata Stores long offset into compressed Data.db file for each chunk_length_kb (default 64) of uncompressed data. ! Size depends on the uncompressed data size.
    • Compression Metadata Allocates pages of 4096 longs in a long[][] array.
    • Compression Metadata Size Snappy Compressor Compress Metadata Size in MB 1,400 1,050 700 350 0 1 10 100 Uncompressed Size in GB 1,000 10,000
    • Index Samples Stores offset into -Index.db for every index_interval (128) keys. ! Size depends on the number of rows and the size of the keys. !
    • Index Samples Allocates long[] for offsets and byte[] [] for row keys. ! (Version 1.2 using on heap structures)
    • Index Samples Total Size Position Offset Keys (25 bytes long) Index Sample Total Size in MB 300 225 150 75 0 1 10 100 Millions of Rows 1,000
    • Memory Management. Larger Heaps (above 8GB) take longer to GC. ! Large working set results in frequent prolonged GC.
    • Bootstrap. The joining node requests data from one replica of each token range it will own. ! Sending is throttled by stream_throughput_outbound_mega bits_per_sec (default 200/25MB).
    • Bootstrap. With RF 3, only three nodes will send data to a bootstrapping node. ! Maximum send rate is 75 MB/sec (3*25MB).
    • Moving Nodes. Copy data from existing node to new node. ! At 50 MB/s transferring 100GB takes 33 minutes.
    • Disk Management. Need a multi TB volume or use multiple volumes.
    • Disk Management with RAID-0. Single disk failure results in total node failure.
    • Disk Management with RAID-10. Requires double the raw capacity.
    • Disk Management with Multiple Volumes. Specified via data_file_directories ! Write load not distributed. ! Single failure will shut down node.
    • Repair. Compare data between nodes and exchange differences. !
    • Comparing Data for Repair. Calculate Merkle Tree hash by reading all rows in a Table. (Validation Compaction) ! Single comparator, throttled by compaction_throughput_mb_per_sec (default 16).
    • Comparing Data for Repair. Time taken grows as the size of the data per node grows.
    • Exchanging Data for Repair. Ranges of rows with differences are Streamed. ! Sending is throttled by stream_throughput_outbound_mega bits_per_sec (default 200/25MB).
    • Compaction. Requires free space to write new SSTables.
    • SizeTieredCompactionStrategy. Groups SSTables by size, assumes no reduction in size. ! In theory requires 50% free space, in practice can work beyond 50% though not recommended.
    • LeveledCompactionStrategy. Groups SSTables by “level” and groups row fragments per level. ! Requires approximately 25% free space.
    • Issues Pre 1.2 Work Arounds Pre 1.2 Improvements 1.2 to 2.1 !
    • Memory Management Work Arounds. Reduce Bloom Filter size by increasing bloom_filter_fp_chance from 0.01 to 0.1. ! May increase read latency.
    • Memory Management Work Arounds. Reduce Compression Metadata size by increasing chunk_length_kb. ! May increase read latency.
    • Memory Management Work Arounds. Reduce Index Samples size by increasing index_interval to 512. ! May increase read latency.
    • Memory Management Work Arounds. When necessary use a 12GB MAX_HEAP_SIZE. ! Keep HEAP_NEWSIZE “reasonable” e.g. less than 1200MB.
    • Bootstrap Work Arounds. Increase streaming throughput via nodetool setstreamthroughput whenever possible.
    • Moving Node Work Arounds. Copy nodetool snapshot while the original node is operational. ! Copy only a delta when the original node is stopped.
    • Disk Management Work Arounds. Use RAID-0 and over provision nodes anticipating failure. ! Use RAID-10 and accept additional costs.
    • Repair Work Arounds. Only use if data is deleted, rely on Consistently Level for distribution. ! Frequent small repair using token ranges.
    • Compaction Work Arounds. Over provision disk capacity when using SizeTieredCompactionStrategy. ! Reduce min_compaction_threshold (default 4) max_compaction_threshold (default 32) to reduce number of SSTables per compaction.
    • Compaction Work Arounds. Use LeveledCompactionStrategy where appropriate.
    • Issues Pre 1.2 Work Arounds Pre 1.2 Improvements 1.2 to 2.1
    • Memory Management Improvements. Version 1.2 moved Bloom Filters and Compression Meta Data off the JVM Heap to Native Memory. ! Version 2.0 moved Index Samples off the JVM Heap.
    • Bootstrap Improvements. Virtual Nodes increases the number of Token Ranges per node from 1 to 256. ! Bootstrapping node can request data from 256 different nodes.
    • Disk Layout Improvements. “JBOD” support distributes concurrent writes to multiple data_file_directories.
    • Disk Layout Improvements. disk_failure_policy adds support for handling disk failure. ! ignore stop best_effort
    • Repair Improvements. “Avoid repairing already-repaired data by default” CASSANDRA-5351 ! Scheduled for 2.1
    • Compaction Improvements. “Avoid allocating overly large bloom filters” CASSANDRA-5906 ! Included in 2.1
    • Thanks. !
    • Aaron Morton @aaronmorton www.thelastpickle.com Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License