About "Apache Cassandra"

  • 2,326 views
Uploaded on

This presentation explain about "Apache Cassandra's concepts and architecture". …

This presentation explain about "Apache Cassandra's concepts and architecture".

My friends and colleagues said
"This presentation should be release on public space to help many peoples work in IT"
so, I upload this file for everyone love "Technology for the people"

This presentation used for educating the employee of KT last year.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,326
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. APACHE CASSANDRA Scalability, Performance and Fault Tolerance in Distributed databases Jihyun.An (jihyun.an@kt.com) 18, June 2013
  • 2. TABLE OF CONTENTS  Preface  Basic Concepts  P2P Architecture  Primitive Data Model & Architecture  Basic Operations  Fault Management  Consistency  Performance  Problem handling
  • 3. TABLE OF CONTENTS (NEXT TIME)  Maintaining  Cluster Management  Node Management  Problem Handling  Tuning  Playing (for Development, Client stance)  Designing  Client  Thrift  Native  CQL  3rd party  Hector  OCM  Extension  Baas.io  Hadoop
  • 4. PREFACE
  • 5. OUR WORLD  Traditional DBMS is very valuable  Storage(+Memory) and Computational Resources cost is cheap (than before)  But we meet new section  Big data  (near) Real time  Complex and various requirement  Recommendation  Find FOAF  …  Event Driven Trigging  User Session  …
  • 6. OUR WORLD (CONT)  Complex applications combine difference types of problems  Different language -> more productive  ex: Functional language, Multiprocessing optimized language  Polyglot persistent layer  Performance vs Durability?  Reliability?  …
  • 7. TRADITIONAL DBMS  Relational Model  Well-defined Schema  Access with Selection/Projection  Derived from Joining/Grouping/Aggregating(Counting..)  Small data (from refined)  …  But  Painful data model changes  Hard to scale out  Ineffective in handling large volumes of data  Not considered with hardware  …
  • 8. TRADITIONAL DBMS (CONT)  Has many constraints for ACID  PK/FK & checking  Domain Type checking  .. checking checking  Lots of IO / Processing  OODBMS, ORDBMS  Good but .. more more checking / processing  Not well with Disk IO
  • 9. NOSQL  Key-value store  Column : Cassandra, Hbase, Bigtable …  Others : Redis, Dynamo, Voldemort, Hazelcast …  Document oriented  MongoDB, CouchDB …  Graph store  Neo4j, Orient DB, BigOWL, FlockDB ..
  • 10. NOSQL (CONT) Benefits  Higher performance  Higher scalability  Flexible Datamodel  More effective for some case  Less administrative overhead Drawbacks  Limited Transactions  Relaxed Consistency  Unconstrained data  Limited ad-hoc query capabilities  Limited administrative aid tools
  • 11. CAP Brewer’s theorem We can pick two of Consistency Availability Partition tolerance A C P Amazon Dynamo derivatives Cassandra, Voldemort, CouchDB , Riak Neo4j, Bigtable Bigtable derivatives : MongoDB, Hbase Hypertable, Redis Relational: MySQL, MSSQL, Postgres
  • 12. Dynamo (Architecture) BigTable (Data model) Cassandra (Apache) Cassandra is a free, open-source, high scalable, distributed database system for managing large amounts of data Written in JAVA Running on JVM References : BigTable (http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf) Dynamo (http://web.archive.org/web/20120129154946/http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf)
  • 13. DESIGN GOALS  Simple Key/Value(Column) store  limited on storage  No support anything (aggregating, grouping …) but basic operation (CRUD, Range access)  But extendable  Hadoop (MR, HDFS, Pig, Hive ..)  ESP  Distributed Processing Interface (ex: BSP, MR)  Baas.io  …
  • 14. DESIGN GOALS (CONT)  High Availability  Decentralized  Everyone can accessor  Replication & Their access  Multi DC support  Eventual consistency  Less write complexity  Audit and repair when read  Possible tuning -> Trade offs between consistency, durability and latency
  • 15. DESIGN GOALS (CONT)  Incremental scalability  Equal Member  Linear Scalability  Unlimited space  Write / Read throughput increase linearly by add node(member)  Low total cost  Minimize administrative work  Automatic partitioning  Flush / compaction  Data balancing / moving  Virtual nodes (since v1.2)  Middle powered nodes make good performance  Collaborating work will make powerful performance and huge space
  • 16. FOUNDER & HISTORY  Founder  Avinash Lakshman (one of the authors of Amazon's Dynamo)  Prashant Malik ( Facebook Engineer )  Developer  About 50  History  Open sourced by Facebook in July 2008  Became an Apache Incubator project in March 2009  Graduated to a top-level project in Feb 2010  0.6 released (added support for integrated caching, and Apache Hadoop MapReduce) in Apr 2010  0.7 released (added secondary indexes and online schema change) in Jan 2011  0.8 released (added the Cassandra Query Language (CQL), self-tuning memtables, and support for zero-downtime upgrades) in Jun 2011  1.0 released (added integrated compression, leveled compaction, and improved read performance) in Oct 2011  1.1 released (added self-tuning caches, row-level isolation, and support for mixed ssd/spinning disk deployments) in Apr 2012  1.2 released (added clustering across virtual nodes, inter-node communication, atomic batches, and request tracing) in Jan 2013
  • 17. PROMINENT USERS User Cluster size Node count Usage Now Facebook >200 ? Inbox search Abandoned, Moved to HBase Cisco WebEx ? ? User feed, activity OK Netflix ? ? Backend OK Formspring ? (26 million account with 10 m responsed per day) ? Social-graph data OK Urban airship, Rackspace, Open X, Twitter (preparing move to)
  • 18. BASIC CONCEPTS
  • 19. P2P ARCHITECTURE  All nodes are same (has equality)  No single point of failure / Decentralized  Compare with  mongoDB  broker structure (cubrid …)  Master / slave  …
  • 20. P2P ARCHITECTURE  Driven linear scalability References : http://dev.kthcorp.com/2011/12/07/cassandra-on-aws-100-million-writ/
  • 21. PRIMITIVE DATA MODEL & ARCHITECTURE
  • 22. COLUMN  Basic and primitive type (the smallest increment of data)  A tuple containing a name, a value and a timestamp  Timestamp is important  Provided by client  Determine the most recent one  If meet the collision, DBMS chose the latest one Name Value Timestamp
  • 23. COLUMN (CONT)  Types  Standard: A column has a name (UUID or UTF8 …)  Composite: A column has composite name (UUID+UTF8 …)  Expiring: TTL marked  Counter: Only has name and value, timestamp managed by server  Super: Used to manage wide rows, inferior to using composite columns (DO NOT USE, All sub-columns serialized) Counter Name Value Name Name Value Timestamp Name Value Timestamp
  • 24. COLUMN (CONT)  Types (CQL3 based)  Standard: Has one primary key.  Composite: Has more than one primary key, recommended for managing wide rows.  Expiring: Gets deleted during compaction.  Counter: Counts occurrences of an event.  Super: Used to manage wide rows, inferior to using composite columns (DO NOT USE, All sub-columns serialized) DDL : CREATE TABLE test ( user_id varchar, article_id uuid, content varchar, PRIMARY KEY (user_id, article_id) ); user_id article_id content Smith <uuid1> Blah1.. Smith <uuid2> Blah2.. {uuid1,content} Blah1… Timestamp {uuid2,content} Blah2… Timestamp Smith <Logical> <Physical> SELECT user_id,article_id from test order by article_id DESC LIMIT 1;
  • 25. ROWS  A row containing a represent key and a set of columns  A row key must be unique (usually UUID)  Supports up to 2 billion columns per (physical) row.  Columns are sorted by their name (Column’s Name indexed)  Primitive  Secondary Index  Direct Column Access Name Value Timestamp Name Value Timestamp Name Value Timestamp Row Key
  • 26. COLUMN FAMILY  Container for columns and rows  No fixed schema  Each row is uniquely identified by its row key  Each row can have a different set of columns  Rows are sorted by row key  Comparator / Validator  Static/Dynamic CF  If columns type is super column, CF called “Super Column Familty”  Like “Table” in Relational world Name Value Timestamp Name Value Timestamp Name Value Timestamp Row Key Name Value Timestamp Row Key
  • 27. DISTRIBUTION Row Row Row Row Row Row Server 1 Server 3 Server 2 Server 4 How to map?
  • 28. TOKEN RING  Node is a instance (typically same as a server)  Used to map between each row and node  Range from 0 to 2127-1  Associated with a row key  Node  Assigned a unique token (ex: token 5 to Node 5)  Range is from previous node token to their token  token 4 < Node 5’range <= token 5 Node 1 Node 2 Node 3 Node 4Node 5 Node 6 Node 7 Node 8 Token 5 Token 4
  • 29. PARTITIONING Row Key Random Partitioners (MD5, Murmur3) Order Preserving Partitioner / Byte Ordered Partitioner Default Row Key Row Key Row Key
  • 30. REPLICATION  Any node has read/write role is called coordinator node (by client)  Locator determine where located the replica  Replica is used at  Consistency check  Repair  Ensure W + R > N for consistency  Local Cache (Row cache) Node 1 Node 2 Node 3 Node 4Node 5 Node 6 Node 7 Node 8 Replica Factor is 4 (N-1 will be replicated) Simple Locator treat strategy order as proximity Locator (Simple) Coordinator node Locating first one 1 2 Here is original
  • 31. REPLICATION (CONT)  Multi DC support  Allow to Specify how many replcas in each DC  Within DC replicas are placed on different racks  Relies on snitch to place replicas  Strategy (provided from Snitch)  Simple (Single DC)  RackInferringSnitch  PropertyFileSnitch  EC2Snitch  EC2MultiRegionSnitch DC1 DC2 Entire
  • 32. ADD / REMOVE NODE  Data transfer between nodes called “Streaming”  If add node 5, node 3 and node 4, 1 (suppose RF is 2) involved in streaming  If remove node 2 node 3(got higher token and their replica container) serve instead Node 1 Node 2 Node 3 Node 4 Node 1 Node 2 Node 3 Node 4 Node 5 Node 1 Node 3 Node 4
  • 33. VIRTUAL NODES  Support since v1.2  Real time migration support?  Shuffle utility  One node has many tokens  => one node has many ranges Node 1 Node 2 Number of token is 4 Cluster Node 2 Node 1
  • 34. VIRTUAL NODES (CONT)  Less administrative works  Save cost  When Add/Remove node  many node co-works  No need to determine the token  Shuffle to re-balance  Less changing time  Smart balancing  No need to balance (Sufficiently number of token should be higher) Number of token is 4 Node 2 Node 1 Cluster Node 2 Node 1 Node 3 Add node 3
  • 35. KEYSPACE  A namespace for column families  Authorization  CF? yeah  Replication  Key oriented schema (see right) { "row_key1": { "Users":{ "emailAddress":{"name":"emailAddress","value":"foo@bar.co m" }, "webSite":{"name":"webSite", "value":http://bar.com} }, "Stats":{ "visits":{"name":"visits", "value":"243"} } }, "row_key2": { "Users":{ "emailAddress":{"name":"emailAddress", "value":"user2@bar.com"}, "twitter":{"name":"twitter", "value":"user2"} } } } Row Key Column Family Column
  • 36. CLUSTER  Total amount of data managed by the cluster is represented as a ring  Cluster of nodes  Has multiple(or single) Keyspace  Partitioning Strategy defined  Authentication
  • 37. GOSSIP  Gossip protocol is used for cluster membership.  Failure detection on service level (Alive or Not)  Responsible  Every node in the system knows every other node’s status  Implemented as  Sync -> Ack -> Ack2  Information : status, load, bootstraping  Basic status is Alive/Dead/Join  Runs every second  Status disseminated in O(logN) (N is the number of nodes)  Seed  PHI is used for auditing dead or alive in time window (5 -> detecting in 15~16 s)  Data structure  HeartBeat<Application Status<Endpoint Status<Endpoint StatusMap N1 N2 N3 N4 N6 N5
  • 38. BASIC OPERATIONS
  • 39. WRITE / UPDATE  CommitLog  Abstracted Mmaped Type  File & Memory Sync -> On system failure? This is angel for U ^^.  Java NIO  C-Heap used (=Native Heap)  Log Data (Write->Delete? But exists)  Segment Rolling structure  Memtable  In memory buffer and workspace  Sorted order by row key  If reach threshold or period point, written to disk to a persistent table structure(SSTable)
  • 40. WRITE / UPDATE (LOCAL LEVEL) Write CommitLog Write : “1”:{“name”:”fullname”,”value”:”smith”} Write : “2”:{“name”:”fullname”,”value”:”mike”} Delete : “1” Write : “3”:{“name”:”fullname”,”value”:”osang”} … Key Name Value 1 fullname smith 2 fullname mike 3 fullname Osang … … … Memtable SSTable SSTable SSTable 1 Write to commitLog 2 Write/Update to Memtable 3Write to Disk (flush)
  • 41. SSTABLE  SSTable is Sorted String Table  Best for log structured DB  Store large numbers of key-value pairs  Immutable  Create with “Flush”  Merges by (major/minor) compaction  Has one or more column has different version (timestamp)  Choose recent one
  • 42. READ (LOCAL LEVEL) Key Name Value 2 fullname mike 3 fullname Osang … … … SSTable BF IDX SSTable BF IDX Read Memtable
  • 43. READ (CLUSTER LEVEL, +READ REPAIR) Replica (Original, Right) Replica (Right) Replica (Wrong) Digest Comparing Choose the right one if digests differ (the most recent) Recover Read Operation Coordinator Locator 1 Transferred from original/replica node (with consistency level) 2 3
  • 44. DELETE  Add tomstone (this is some type of column)  Garbage collected when compacting  GC grace seconds : 864000 (default 10 days)  Issue  If the fault node recover after GCGraceSeconds, the deleted data can be resurrected
  • 45. FAULT MANAGEMENT
  • 46. DETECTION  Dynamic threshold for marking nodes  Accrual Detection Mechanism calculates a per-node threshold  Automatic take into account Network condition, workload and other conditions might affect perceived heartbeat rate.  From 3rd party client  Hector  Failover
  • 47. HINTED-HANDOFF  The coordinator will store a hint for if the node down or failed to acknowledge the write  Hint consists of the target replica and the mutation(column object) to be replayed  Use java heap (might next to be off-heap)  Only saved within limited time (default, 1 hour) after a replica fails  When failed node is alive again, it will begin streaming the miss writes
  • 48. REPAIR  Support triangle method  CommitLog Replaying (by administrator)  Read Repair (realtime)  Anti-entropy Repair (by administrator)
  • 49. READ REPAIR  Background work  Configured per CF  Choose most recently written value if they are inconsistent, and replace it.
  • 50. ANTI-ENTROPY REPAIR  Ensure all data on a replica is made consistent  Merkle tree used  Tree of data block’s hashes  Verify inconsistent  Repair node request merkle hash (piece of CF) to replicas and comparing, streaming from a replica if inconsistent, do Read-repair Block 1 Block 2 Block 3 … CF hash hash hash hash hash hash hash
  • 51. CONSISTENCY
  • 52. BASIC  Full ACID compliance in distributed system is a bad idea. (network, … )  Single row updates are atomic (include internal indexes), everything else is not  Relaxing consistency does not equal data corruption  Tunable Consistency  Speed vs precision  Any read and write operation decides how consistent the requested data should be (from client)
  • 53. CONDITION  Consistency ensure if  (W + R) > N  W is nodes written (succeed)  R is nodes read  N is replica factor
  • 54. CONDITION (CONT) N is 3 Operations 1. Write 3 2. Write 5 3. Write 1 3 5 1 Worst case W is 1 1 5 1W is 2 3 1 1or W is 2 1 1 1 R is 1 Possible case 3 5 1or or R is 21 1 R is 3 Written Read (W+R)>N ensure that at lease one latest value can be selected This is eventual consistency
  • 55. READ CONSISTENCY LEVELS  One  Two  Three  Quorum  Local Quorum  Each Quorum  All Specify how many replicas must response before a result is return to the client Quorum : (Replication Factor / 2) + 1 Local Quorum / Each Quorum is used at Multi- DC Round down to a whole number processing (If satisfied, return right away)
  • 56. WRITE CONSISTENCY LEVELS  ANY  One  Two  Three  Quorum  Local Quorum  Each Quorum  All Specify how many replicas must succeed before returning acknowledge to client Quorum : (Replication Factor / 2) + 1 Local Quorum / Each Quorum is used at Multi- DC ANY level contain hinted-handoff condition Round down to a whole number processing (If satisfied, return right away)
  • 57. PERFORMANCE
  • 58. CACHE  Key/Row Cache can save their data to files  Key Cache  Accessed Frequently  Hold the location of keys (indicating to columns)  In memory, on JVM heap  Row Cache  Optional  Hold entire columns of the row  In memory, on Off-heap (since v1.1) or JVM heap  If you have huge column, this will make OOME (Out Of Memory Event)
  • 59. CACHE  Mmaped Disk Access  On 64bit JVM, used for data and index summary (default)  Provide virtual mmaped space in Memory for SSTable  On C-Heap(native heap)  GC make this as cache  Data accessed frequently live long period, otherwise GC will purge that  If the data exists in memory, return it (=cache)  (Problem) GC C-Heap when its full only  (Problem) handle open SSTable, this mean Cassandra can allocate the entire size of open SSTables, otherwise native OOME  If you wanna have efficient Key/Row/Mmaped Access cache, add sufficient nodes to cluster
  • 60. BLOOM FILTERS  Each SSTable has this  Used to check if a requested row key exists in the SSTable before doing any seeks (disk)  Per row key, generate several hashes and mark the buckets for the key  Check each bucket for the key’s hashes, if any is empty the key does not exists  False positive are possible, but false negative are not Key 1 Key 2 Key 2 Hash A Hash B Hash C 1 1 1 Same hashes Only has
  • 61. INDEX  Primary Index  Per CF  The index of CF’s row key  Efficient access with Index summary (1 row key out of every 128 is sampled)  In memory, on JVM heap (next move to Off-heap) Read BF KeyCache SSTable Index Summary Primary Index Offset Calculator
  • 62. INDEX (CONT)  Secondary Index  For Column’s value(s)  Support composite type  Hidden CF  Implemented by CF’name index  Value is the CF’name  Write/Update/Delete operation is atomic  Share value for many rows is good for  On the contrary unique value for indexing is poor (-> use Dynamic CF for indexing)
  • 63. COMPACTION  Combines data from SSTables  Merge row fragments  Rebuild primary and secondary indexes  Remove expired columns marked with tomestone  Delete old SSTable if complete  “Minor” only compactions merge SSTables of similar size, “Major” compactions merge all SSTables in a given CF  Size-tiered compaction  Leveled compaction  Since v1.0  Based on LevelDB  Temporary use maximum twice space and spike in disk IO.
  • 64. ARCHITECTURE  Write : no race conditions, not handled by disk IO  Read : Slow than write, but fast (DHT, cache …)  Load balancing  Virtual Nodes  Replication  Multi-DC
  • 65. BENCHMARK References : http://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&cad=rja&sqi=2&ved=0CCsQFjAA&url=http%3A%2F%2F68.18 0.206.246%2Ffiles%2Fycsb.pdf&ei=O_nAUYqlPI2okQWO-ICwCA&usg=AFQjCNGySLHho0zZ- eMsJIm4VjsoNEOyKw&sig2=6p45QMDvTN963EqbM8YpDg/ Workload A—update heavy: (a) read operations, (b) update operations. Throughput in this (and all figures) represents total operations per second, including reads and writes. Workload B—read heavy: (a) read operations, (b) update operations By YCSB (Yahoo Cloud Serving Benchmark)
  • 66. BENCHMARK (CONT) References : http://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&cad=rja&sqi=2&ved=0CCsQFjAA&url=http%3A%2F%2F68.18 0.206.246%2Ffiles%2Fycsb.pdf&ei=O_nAUYqlPI2okQWO-ICwCA&usg=AFQjCNGySLHho0zZ- eMsJIm4VjsoNEOyKw&sig2=6p45QMDvTN963EqbM8YpDg/ Workload E—short scans. By YCSB (Yahoo Cloud Serving Benchmark) Read performance as cluster size increases.
  • 67. BENCHMARK (CONT) Elastic speedup: Time series showing impact of adding servers online. By YCSB (Yahoo Cloud Serving Benchmark) References : http://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&cad=rja&sqi=2&ved=0CCsQFjAA&url=http%3A%2F%2F68.18 0.206.246%2Ffiles%2Fycsb.pdf&ei=O_nAUYqlPI2okQWO-ICwCA&usg=AFQjCNGySLHho0zZ- eMsJIm4VjsoNEOyKw&sig2=6p45QMDvTN963EqbM8YpDg/
  • 68. BENCHMARK (CONT) By NoSQLBenchmarking.com References : http://www.nosqlbenchmarking.com/2011/02/new-results-for-cassandra-0-7-2//
  • 69. BENCHMARK (CONT) By Cubrid References : http://www.cubrid.org/blog/dev-platform/nosql-benchmarking/
  • 70. BENCHMARK (CONT) By VLDB References : http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf/ Read latency Write latencyThroughput (95% read, 5% write)
  • 71. BENCHMARK (LAST) By VLDB References : http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf/ Throughput (50% read, 50% write) Throughput (100% write)
  • 72. PROBLEM HANDLING
  • 73. RESOURCE  Memory  Off-heap & Heap  OOME Problem  CPU  GC  Hashing  Compression / Compaction  Network Handling  Context Switching  Lazy Problem  IO  Bottleneck for everything
  • 74. MEMORY  Heap (GC management)  Permanent (-XX:PermSize, -XX:MaxPermSize)  JVM Heap (-Xmx, -Xms, -Xmn)  C-Heap (=Native Heap)  OS Shared  Thread Stack (-Xss)  Objects that access with JNI  Off-Heap  OS Shared  GC managed by Cassandra
  • 75. MEMORY (CONT)  Heap  Permanent  JVM Heap  Memtable  KeyCache  IndexSummary(move to Off-heap on next release)  Buffer  Transport  Socket  Disk  C-Heap  Thread Stack  File Memory Map (Virtual space)  Data / Index buffer (default)  CommitLog v1.2  Off-Heap (OS shared)  RowCache  BloomFilter  Index->CompressionMetaData- >ChuckOffset
  • 76. MEMORY (CONT)  Memtable  Managed  total size (default 1/3 JVM heap, flush largest memtable for CF if reached)  Emergency, heap usage above the fraction of the max after full GC(CMS) -> flush largest memtable (each time) -> prevent full GC / OOME  KeyCache  Managed  total size (100M or 5% of the max)  Emergency, heap usage above the fraction of the max after full GC(CMS) -> reduce max cache size -> prevent full GC / OOME  RowCache/CommitLog  Managed  total size (default disabled) -> prevent OOME
  • 77. MEMORY (CONT)  Thread Stack  Not managed  But XSS set as 180k (default)  Check thrift (transport level, RPC server)’s server serving type (sync, hsha, async(has bugs))  Set min/max threads for connection (default unlimited) v1.2
  • 78. MEMORY (CONT)  Transport buffer  Thrift  Support many languages and crossing  Provide server/client interface, serializing  Apache project, created by Facebook  Framed buffer (default max 16M, variable size)  4k, 16k, 32k, … 16M  Determine by client  Per connection  Adjust max frame buffer size (client, server)  Set min/max threads for connection (default unlimited) v1.2 Data Service Client Data Service Thrift
  • 79. MEMORY (LAST)  C-Heap/Off-Heap  OS Shared -> Other application possible to make some problem  File Memory Map (Virtual space)  GC when Full GC  0 <= total size <= the size of opened SSTables  If cannot allocate? -> Native OOME  But  Generally access limited space of SSTable  GC make space  Worst case? (If OOME occur)  yaml->disk_access_mode : standard (restart required)  Add sufficient nodes  Yaml->disk_access_mode : auto After joining v1.2
  • 80. CPU  GC  CMS  Marking phase : low thread priority -> but high usage rate (it’s not a problem)  CMSInitiatingOccupancyFraction is 75 (default)  UseCMSInitiatingOccupancyOnly  Full GC  Frequency is important -> may has a problem (eg: thrift transport buffer)  Add nodes or analyze memory usage to adjust configuration for  Minor GC  It’s OK  Compaction  If do slow, okay  So priority down with “-XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Dcassandra.compaction.priority=1”  High CPU Load -> sustaining? -> When U need to add nodes
  • 81. SWAPPING  Swapping make big problem for real-time application  IO block -> Thread block -> Gossip/Compaction/Flush … delaying -> make other problem  Disable or Set minimum Swapping  Disable Swap partition  Or Enable JNA + Kernel Configuration  JNA : Mlockall (keep heap memory in physical memory)  Kernel  vm.swappiness=0 (but distress -> possible to swapping)  vm.overcommit_memory=1  Or vm.overcommit_memory=2 (overcommit managed)  vm.overcommit_ratio=? (eg 0.75)  Max memory = swap partition size + ratio*physical memory size  Eg: 8G = 2G + 0.75*8G
  • 82. MORNITERING  System Monitoring  CPU / Memory / Disk  Nagios, Ganglia, Cacti, Zabbix  Network Monitoring  Per Client  NfSen (network flow monitoring, see: http://nfsen.sourceforge.net/#mozTocId376385)  Cluster Monitoring / Maintaining  OpsCenter
  • 83. CHECK THREAD  “top” command  “H” key command to spread per thread  “P” key command to sort by CPU usage rate  Choose heavy rate thread’s PID  PID convert to in Hex (http://www.binaryhexconverter.com/decimal-to-hex-converter)  “jstack <Parent PID> > filename.log” command to save java stack to file  Search PID in Hex 313C
  • 84. CHECK HEAP  Use dump file that from “jmap” or OOME  Use “jhat” or another tool to analyze  Check [B  and their reference object
  • 85. For development, maintaining Sorry.. I have just two days to write this presentation. Next time I will write and speak to U. See U next time
  • 86. Question or Talk about anything with Cassandra
  • 87. Thank you If you have any problem or question for me, please contact my email. jihyun.an@kt.com