• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
About "Apache Cassandra"
 

About "Apache Cassandra"

on

  • 397 views

This presentation explain about "Apache Cassandra's concepts and architecture". ...

This presentation explain about "Apache Cassandra's concepts and architecture".

My friends and colleagues said
"This presentation should be release on public space to help many peoples work in IT"
so, I upload this file for everyone love "Technology for the people"

This presentation used for educating the employee of KT last year.

Statistics

Views

Total Views
397
Views on SlideShare
397
Embed Views
0

Actions

Likes
1
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    About "Apache Cassandra" About "Apache Cassandra" Presentation Transcript

    • APACHE CASSANDRA Scalability, Performance and Fault Tolerance in Distributed databases Jihyun.An (jihyun.an@kt.com) 18, June 2013
    • TABLE OF CONTENTS  Preface  Basic Concepts  P2P Architecture  Primitive Data Model & Architecture  Basic Operations  Fault Management  Consistency  Performance  Problem handling
    • TABLE OF CONTENTS (NEXT TIME)  Maintaining  Cluster Management  Node Management  Problem Handling  Tuning  Playing (for Development, Client stance)  Designing  Client  Thrift  Native  CQL  3rd party  Hector  OCM  Extension  Baas.io  Hadoop
    • PREFACE
    • OUR WORLD  Traditional DBMS is very valuable  Storage(+Memory) and Computational Resources cost is cheap (than before)  But we meet new section  Big data  (near) Real time  Complex and various requirement  Recommendation  Find FOAF  …  Event Driven Trigging  User Session  …
    • OUR WORLD (CONT)  Complex applications combine difference types of problems  Different language -> more productive  ex: Functional language, Multiprocessing optimized language  Polyglot persistent layer  Performance vs Durability?  Reliability?  …
    • TRADITIONAL DBMS  Relational Model  Well-defined Schema  Access with Selection/Projection  Derived from Joining/Grouping/Aggregating(Counting..)  Small data (from refined)  …  But  Painful data model changes  Hard to scale out  Ineffective in handling large volumes of data  Not considered with hardware  …
    • TRADITIONAL DBMS (CONT)  Has many constraints for ACID  PK/FK & checking  Domain Type checking  .. checking checking  Lots of IO / Processing  OODBMS, ORDBMS  Good but .. more more checking / processing  Not well with Disk IO
    • NOSQL  Key-value store  Column : Cassandra, Hbase, Bigtable …  Others : Redis, Dynamo, Voldemort, Hazelcast …  Document oriented  MongoDB, CouchDB …  Graph store  Neo4j, Orient DB, BigOWL, FlockDB ..
    • NOSQL (CONT) Benefits  Higher performance  Higher scalability  Flexible Datamodel  More effective for some case  Less administrative overhead Drawbacks  Limited Transactions  Relaxed Consistency  Unconstrained data  Limited ad-hoc query capabilities  Limited administrative aid tools
    • CAP Brewer’s theorem We can pick two of Consistency Availability Partition tolerance A C P Amazon Dynamo derivatives Cassandra, Voldemort, CouchDB , Riak Neo4j, Bigtable Bigtable derivatives : MongoDB, Hbase Hypertable, Redis Relational: MySQL, MSSQL, Postgres
    • Dynamo (Architecture) BigTable (Data model) Cassandra (Apache) Cassandra is a free, open-source, high scalable, distributed database system for managing large amounts of data Written in JAVA Running on JVM References : BigTable (http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf) Dynamo (http://web.archive.org/web/20120129154946/http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf)
    • DESIGN GOALS  Simple Key/Value(Column) store  limited on storage  No support anything (aggregating, grouping …) but basic operation (CRUD, Range access)  But extendable  Hadoop (MR, HDFS, Pig, Hive ..)  ESP  Distributed Processing Interface (ex: BSP, MR)  Baas.io  …
    • DESIGN GOALS (CONT)  High Availability  Decentralized  Everyone can accessor  Replication & Their access  Multi DC support  Eventual consistency  Less write complexity  Audit and repair when read  Possible tuning -> Trade offs between consistency, durability and latency
    • DESIGN GOALS (CONT)  Incremental scalability  Equal Member  Linear Scalability  Unlimited space  Write / Read throughput increase linearly by add node(member)  Low total cost  Minimize administrative work  Automatic partitioning  Flush / compaction  Data balancing / moving  Virtual nodes (since v1.2)  Middle powered nodes make good performance  Collaborating work will make powerful performance and huge space
    • FOUNDER & HISTORY  Founder  Avinash Lakshman (one of the authors of Amazon's Dynamo)  Prashant Malik ( Facebook Engineer )  Developer  About 50  History  Open sourced by Facebook in July 2008  Became an Apache Incubator project in March 2009  Graduated to a top-level project in Feb 2010  0.6 released (added support for integrated caching, and Apache Hadoop MapReduce) in Apr 2010  0.7 released (added secondary indexes and online schema change) in Jan 2011  0.8 released (added the Cassandra Query Language (CQL), self-tuning memtables, and support for zero-downtime upgrades) in Jun 2011  1.0 released (added integrated compression, leveled compaction, and improved read performance) in Oct 2011  1.1 released (added self-tuning caches, row-level isolation, and support for mixed ssd/spinning disk deployments) in Apr 2012  1.2 released (added clustering across virtual nodes, inter-node communication, atomic batches, and request tracing) in Jan 2013
    • PROMINENT USERS User Cluster size Node count Usage Now Facebook >200 ? Inbox search Abandoned, Moved to HBase Cisco WebEx ? ? User feed, activity OK Netflix ? ? Backend OK Formspring ? (26 million account with 10 m responsed per day) ? Social-graph data OK Urban airship, Rackspace, Open X, Twitter (preparing move to)
    • BASIC CONCEPTS
    • P2P ARCHITECTURE  All nodes are same (has equality)  No single point of failure / Decentralized  Compare with  mongoDB  broker structure (cubrid …)  Master / slave  …
    • P2P ARCHITECTURE  Driven linear scalability References : http://dev.kthcorp.com/2011/12/07/cassandra-on-aws-100-million-writ/
    • PRIMITIVE DATA MODEL & ARCHITECTURE
    • COLUMN  Basic and primitive type (the smallest increment of data)  A tuple containing a name, a value and a timestamp  Timestamp is important  Provided by client  Determine the most recent one  If meet the collision, DBMS chose the latest one Name Value Timestamp
    • COLUMN (CONT)  Types  Standard: A column has a name (UUID or UTF8 …)  Composite: A column has composite name (UUID+UTF8 …)  Expiring: TTL marked  Counter: Only has name and value, timestamp managed by server  Super: Used to manage wide rows, inferior to using composite columns (DO NOT USE, All sub-columns serialized) Counter Name Value Name Name Value Timestamp Name Value Timestamp
    • COLUMN (CONT)  Types (CQL3 based)  Standard: Has one primary key.  Composite: Has more than one primary key, recommended for managing wide rows.  Expiring: Gets deleted during compaction.  Counter: Counts occurrences of an event.  Super: Used to manage wide rows, inferior to using composite columns (DO NOT USE, All sub-columns serialized) DDL : CREATE TABLE test ( user_id varchar, article_id uuid, content varchar, PRIMARY KEY (user_id, article_id) ); user_id article_id content Smith <uuid1> Blah1.. Smith <uuid2> Blah2.. {uuid1,content} Blah1… Timestamp {uuid2,content} Blah2… Timestamp Smith <Logical> <Physical> SELECT user_id,article_id from test order by article_id DESC LIMIT 1;
    • ROWS  A row containing a represent key and a set of columns  A row key must be unique (usually UUID)  Supports up to 2 billion columns per (physical) row.  Columns are sorted by their name (Column’s Name indexed)  Primitive  Secondary Index  Direct Column Access Name Value Timestamp Name Value Timestamp Name Value Timestamp Row Key
    • COLUMN FAMILY  Container for columns and rows  No fixed schema  Each row is uniquely identified by its row key  Each row can have a different set of columns  Rows are sorted by row key  Comparator / Validator  Static/Dynamic CF  If columns type is super column, CF called “Super Column Familty”  Like “Table” in Relational world Name Value Timestamp Name Value Timestamp Name Value Timestamp Row Key Name Value Timestamp Row Key
    • DISTRIBUTION Row Row Row Row Row Row Server 1 Server 3 Server 2 Server 4 How to map?
    • TOKEN RING  Node is a instance (typically same as a server)  Used to map between each row and node  Range from 0 to 2127-1  Associated with a row key  Node  Assigned a unique token (ex: token 5 to Node 5)  Range is from previous node token to their token  token 4 < Node 5’range <= token 5 Node 1 Node 2 Node 3 Node 4Node 5 Node 6 Node 7 Node 8 Token 5 Token 4
    • PARTITIONING Row Key Random Partitioners (MD5, Murmur3) Order Preserving Partitioner / Byte Ordered Partitioner Default Row Key Row Key Row Key
    • REPLICATION  Any node has read/write role is called coordinator node (by client)  Locator determine where located the replica  Replica is used at  Consistency check  Repair  Ensure W + R > N for consistency  Local Cache (Row cache) Node 1 Node 2 Node 3 Node 4Node 5 Node 6 Node 7 Node 8 Replica Factor is 4 (N-1 will be replicated) Simple Locator treat strategy order as proximity Locator (Simple) Coordinator node Locating first one 1 2 Here is original
    • REPLICATION (CONT)  Multi DC support  Allow to Specify how many replcas in each DC  Within DC replicas are placed on different racks  Relies on snitch to place replicas  Strategy (provided from Snitch)  Simple (Single DC)  RackInferringSnitch  PropertyFileSnitch  EC2Snitch  EC2MultiRegionSnitch DC1 DC2 Entire
    • ADD / REMOVE NODE  Data transfer between nodes called “Streaming”  If add node 5, node 3 and node 4, 1 (suppose RF is 2) involved in streaming  If remove node 2 node 3(got higher token and their replica container) serve instead Node 1 Node 2 Node 3 Node 4 Node 1 Node 2 Node 3 Node 4 Node 5 Node 1 Node 3 Node 4
    • VIRTUAL NODES  Support since v1.2  Real time migration support?  Shuffle utility  One node has many tokens  => one node has many ranges Node 1 Node 2 Number of token is 4 Cluster Node 2 Node 1
    • VIRTUAL NODES (CONT)  Less administrative works  Save cost  When Add/Remove node  many node co-works  No need to determine the token  Shuffle to re-balance  Less changing time  Smart balancing  No need to balance (Sufficiently number of token should be higher) Number of token is 4 Node 2 Node 1 Cluster Node 2 Node 1 Node 3 Add node 3
    • KEYSPACE  A namespace for column families  Authorization  CF? yeah  Replication  Key oriented schema (see right) { "row_key1": { "Users":{ "emailAddress":{"name":"emailAddress","value":"foo@bar.co m" }, "webSite":{"name":"webSite", "value":http://bar.com} }, "Stats":{ "visits":{"name":"visits", "value":"243"} } }, "row_key2": { "Users":{ "emailAddress":{"name":"emailAddress", "value":"user2@bar.com"}, "twitter":{"name":"twitter", "value":"user2"} } } } Row Key Column Family Column
    • CLUSTER  Total amount of data managed by the cluster is represented as a ring  Cluster of nodes  Has multiple(or single) Keyspace  Partitioning Strategy defined  Authentication
    • GOSSIP  Gossip protocol is used for cluster membership.  Failure detection on service level (Alive or Not)  Responsible  Every node in the system knows every other node’s status  Implemented as  Sync -> Ack -> Ack2  Information : status, load, bootstraping  Basic status is Alive/Dead/Join  Runs every second  Status disseminated in O(logN) (N is the number of nodes)  Seed  PHI is used for auditing dead or alive in time window (5 -> detecting in 15~16 s)  Data structure  HeartBeat<Application Status<Endpoint Status<Endpoint StatusMap N1 N2 N3 N4 N6 N5
    • BASIC OPERATIONS
    • WRITE / UPDATE  CommitLog  Abstracted Mmaped Type  File & Memory Sync -> On system failure? This is angel for U ^^.  Java NIO  C-Heap used (=Native Heap)  Log Data (Write->Delete? But exists)  Segment Rolling structure  Memtable  In memory buffer and workspace  Sorted order by row key  If reach threshold or period point, written to disk to a persistent table structure(SSTable)
    • WRITE / UPDATE (LOCAL LEVEL) Write CommitLog Write : “1”:{“name”:”fullname”,”value”:”smith”} Write : “2”:{“name”:”fullname”,”value”:”mike”} Delete : “1” Write : “3”:{“name”:”fullname”,”value”:”osang”} … Key Name Value 1 fullname smith 2 fullname mike 3 fullname Osang … … … Memtable SSTable SSTable SSTable 1 Write to commitLog 2 Write/Update to Memtable 3Write to Disk (flush)
    • SSTABLE  SSTable is Sorted String Table  Best for log structured DB  Store large numbers of key-value pairs  Immutable  Create with “Flush”  Merges by (major/minor) compaction  Has one or more column has different version (timestamp)  Choose recent one
    • READ (LOCAL LEVEL) Key Name Value 2 fullname mike 3 fullname Osang … … … SSTable BF IDX SSTable BF IDX Read Memtable
    • READ (CLUSTER LEVEL, +READ REPAIR) Replica (Original, Right) Replica (Right) Replica (Wrong) Digest Comparing Choose the right one if digests differ (the most recent) Recover Read Operation Coordinator Locator 1 Transferred from original/replica node (with consistency level) 2 3
    • DELETE  Add tomstone (this is some type of column)  Garbage collected when compacting  GC grace seconds : 864000 (default 10 days)  Issue  If the fault node recover after GCGraceSeconds, the deleted data can be resurrected
    • FAULT MANAGEMENT
    • DETECTION  Dynamic threshold for marking nodes  Accrual Detection Mechanism calculates a per-node threshold  Automatic take into account Network condition, workload and other conditions might affect perceived heartbeat rate.  From 3rd party client  Hector  Failover
    • HINTED-HANDOFF  The coordinator will store a hint for if the node down or failed to acknowledge the write  Hint consists of the target replica and the mutation(column object) to be replayed  Use java heap (might next to be off-heap)  Only saved within limited time (default, 1 hour) after a replica fails  When failed node is alive again, it will begin streaming the miss writes
    • REPAIR  Support triangle method  CommitLog Replaying (by administrator)  Read Repair (realtime)  Anti-entropy Repair (by administrator)
    • READ REPAIR  Background work  Configured per CF  Choose most recently written value if they are inconsistent, and replace it.
    • ANTI-ENTROPY REPAIR  Ensure all data on a replica is made consistent  Merkle tree used  Tree of data block’s hashes  Verify inconsistent  Repair node request merkle hash (piece of CF) to replicas and comparing, streaming from a replica if inconsistent, do Read-repair Block 1 Block 2 Block 3 … CF hash hash hash hash hash hash hash
    • CONSISTENCY
    • BASIC  Full ACID compliance in distributed system is a bad idea. (network, … )  Single row updates are atomic (include internal indexes), everything else is not  Relaxing consistency does not equal data corruption  Tunable Consistency  Speed vs precision  Any read and write operation decides how consistent the requested data should be (from client)
    • CONDITION  Consistency ensure if  (W + R) > N  W is nodes written (succeed)  R is nodes read  N is replica factor
    • CONDITION (CONT) N is 3 Operations 1. Write 3 2. Write 5 3. Write 1 3 5 1 Worst case W is 1 1 5 1W is 2 3 1 1or W is 2 1 1 1 R is 1 Possible case 3 5 1or or R is 21 1 R is 3 Written Read (W+R)>N ensure that at lease one latest value can be selected This is eventual consistency
    • READ CONSISTENCY LEVELS  One  Two  Three  Quorum  Local Quorum  Each Quorum  All Specify how many replicas must response before a result is return to the client Quorum : (Replication Factor / 2) + 1 Local Quorum / Each Quorum is used at Multi- DC Round down to a whole number processing (If satisfied, return right away)
    • WRITE CONSISTENCY LEVELS  ANY  One  Two  Three  Quorum  Local Quorum  Each Quorum  All Specify how many replicas must succeed before returning acknowledge to client Quorum : (Replication Factor / 2) + 1 Local Quorum / Each Quorum is used at Multi- DC ANY level contain hinted-handoff condition Round down to a whole number processing (If satisfied, return right away)
    • PERFORMANCE
    • CACHE  Key/Row Cache can save their data to files  Key Cache  Accessed Frequently  Hold the location of keys (indicating to columns)  In memory, on JVM heap  Row Cache  Optional  Hold entire columns of the row  In memory, on Off-heap (since v1.1) or JVM heap  If you have huge column, this will make OOME (Out Of Memory Event)
    • CACHE  Mmaped Disk Access  On 64bit JVM, used for data and index summary (default)  Provide virtual mmaped space in Memory for SSTable  On C-Heap(native heap)  GC make this as cache  Data accessed frequently live long period, otherwise GC will purge that  If the data exists in memory, return it (=cache)  (Problem) GC C-Heap when its full only  (Problem) handle open SSTable, this mean Cassandra can allocate the entire size of open SSTables, otherwise native OOME  If you wanna have efficient Key/Row/Mmaped Access cache, add sufficient nodes to cluster
    • BLOOM FILTERS  Each SSTable has this  Used to check if a requested row key exists in the SSTable before doing any seeks (disk)  Per row key, generate several hashes and mark the buckets for the key  Check each bucket for the key’s hashes, if any is empty the key does not exists  False positive are possible, but false negative are not Key 1 Key 2 Key 2 Hash A Hash B Hash C 1 1 1 Same hashes Only has
    • INDEX  Primary Index  Per CF  The index of CF’s row key  Efficient access with Index summary (1 row key out of every 128 is sampled)  In memory, on JVM heap (next move to Off-heap) Read BF KeyCache SSTable Index Summary Primary Index Offset Calculator
    • INDEX (CONT)  Secondary Index  For Column’s value(s)  Support composite type  Hidden CF  Implemented by CF’name index  Value is the CF’name  Write/Update/Delete operation is atomic  Share value for many rows is good for  On the contrary unique value for indexing is poor (-> use Dynamic CF for indexing)
    • COMPACTION  Combines data from SSTables  Merge row fragments  Rebuild primary and secondary indexes  Remove expired columns marked with tomestone  Delete old SSTable if complete  “Minor” only compactions merge SSTables of similar size, “Major” compactions merge all SSTables in a given CF  Size-tiered compaction  Leveled compaction  Since v1.0  Based on LevelDB  Temporary use maximum twice space and spike in disk IO.
    • ARCHITECTURE  Write : no race conditions, not handled by disk IO  Read : Slow than write, but fast (DHT, cache …)  Load balancing  Virtual Nodes  Replication  Multi-DC
    • BENCHMARK References : http://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&cad=rja&sqi=2&ved=0CCsQFjAA&url=http%3A%2F%2F68.18 0.206.246%2Ffiles%2Fycsb.pdf&ei=O_nAUYqlPI2okQWO-ICwCA&usg=AFQjCNGySLHho0zZ- eMsJIm4VjsoNEOyKw&sig2=6p45QMDvTN963EqbM8YpDg/ Workload A—update heavy: (a) read operations, (b) update operations. Throughput in this (and all figures) represents total operations per second, including reads and writes. Workload B—read heavy: (a) read operations, (b) update operations By YCSB (Yahoo Cloud Serving Benchmark)
    • BENCHMARK (CONT) References : http://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&cad=rja&sqi=2&ved=0CCsQFjAA&url=http%3A%2F%2F68.18 0.206.246%2Ffiles%2Fycsb.pdf&ei=O_nAUYqlPI2okQWO-ICwCA&usg=AFQjCNGySLHho0zZ- eMsJIm4VjsoNEOyKw&sig2=6p45QMDvTN963EqbM8YpDg/ Workload E—short scans. By YCSB (Yahoo Cloud Serving Benchmark) Read performance as cluster size increases.
    • BENCHMARK (CONT) Elastic speedup: Time series showing impact of adding servers online. By YCSB (Yahoo Cloud Serving Benchmark) References : http://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&cad=rja&sqi=2&ved=0CCsQFjAA&url=http%3A%2F%2F68.18 0.206.246%2Ffiles%2Fycsb.pdf&ei=O_nAUYqlPI2okQWO-ICwCA&usg=AFQjCNGySLHho0zZ- eMsJIm4VjsoNEOyKw&sig2=6p45QMDvTN963EqbM8YpDg/
    • BENCHMARK (CONT) By NoSQLBenchmarking.com References : http://www.nosqlbenchmarking.com/2011/02/new-results-for-cassandra-0-7-2//
    • BENCHMARK (CONT) By Cubrid References : http://www.cubrid.org/blog/dev-platform/nosql-benchmarking/
    • BENCHMARK (CONT) By VLDB References : http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf/ Read latency Write latencyThroughput (95% read, 5% write)
    • BENCHMARK (LAST) By VLDB References : http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf/ Throughput (50% read, 50% write) Throughput (100% write)
    • PROBLEM HANDLING
    • RESOURCE  Memory  Off-heap & Heap  OOME Problem  CPU  GC  Hashing  Compression / Compaction  Network Handling  Context Switching  Lazy Problem  IO  Bottleneck for everything
    • MEMORY  Heap (GC management)  Permanent (-XX:PermSize, -XX:MaxPermSize)  JVM Heap (-Xmx, -Xms, -Xmn)  C-Heap (=Native Heap)  OS Shared  Thread Stack (-Xss)  Objects that access with JNI  Off-Heap  OS Shared  GC managed by Cassandra
    • MEMORY (CONT)  Heap  Permanent  JVM Heap  Memtable  KeyCache  IndexSummary(move to Off-heap on next release)  Buffer  Transport  Socket  Disk  C-Heap  Thread Stack  File Memory Map (Virtual space)  Data / Index buffer (default)  CommitLog v1.2  Off-Heap (OS shared)  RowCache  BloomFilter  Index->CompressionMetaData- >ChuckOffset
    • MEMORY (CONT)  Memtable  Managed  total size (default 1/3 JVM heap, flush largest memtable for CF if reached)  Emergency, heap usage above the fraction of the max after full GC(CMS) -> flush largest memtable (each time) -> prevent full GC / OOME  KeyCache  Managed  total size (100M or 5% of the max)  Emergency, heap usage above the fraction of the max after full GC(CMS) -> reduce max cache size -> prevent full GC / OOME  RowCache/CommitLog  Managed  total size (default disabled) -> prevent OOME
    • MEMORY (CONT)  Thread Stack  Not managed  But XSS set as 180k (default)  Check thrift (transport level, RPC server)’s server serving type (sync, hsha, async(has bugs))  Set min/max threads for connection (default unlimited) v1.2
    • MEMORY (CONT)  Transport buffer  Thrift  Support many languages and crossing  Provide server/client interface, serializing  Apache project, created by Facebook  Framed buffer (default max 16M, variable size)  4k, 16k, 32k, … 16M  Determine by client  Per connection  Adjust max frame buffer size (client, server)  Set min/max threads for connection (default unlimited) v1.2 Data Service Client Data Service Thrift
    • MEMORY (LAST)  C-Heap/Off-Heap  OS Shared -> Other application possible to make some problem  File Memory Map (Virtual space)  GC when Full GC  0 <= total size <= the size of opened SSTables  If cannot allocate? -> Native OOME  But  Generally access limited space of SSTable  GC make space  Worst case? (If OOME occur)  yaml->disk_access_mode : standard (restart required)  Add sufficient nodes  Yaml->disk_access_mode : auto After joining v1.2
    • CPU  GC  CMS  Marking phase : low thread priority -> but high usage rate (it’s not a problem)  CMSInitiatingOccupancyFraction is 75 (default)  UseCMSInitiatingOccupancyOnly  Full GC  Frequency is important -> may has a problem (eg: thrift transport buffer)  Add nodes or analyze memory usage to adjust configuration for  Minor GC  It’s OK  Compaction  If do slow, okay  So priority down with “-XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Dcassandra.compaction.priority=1”  High CPU Load -> sustaining? -> When U need to add nodes
    • SWAPPING  Swapping make big problem for real-time application  IO block -> Thread block -> Gossip/Compaction/Flush … delaying -> make other problem  Disable or Set minimum Swapping  Disable Swap partition  Or Enable JNA + Kernel Configuration  JNA : Mlockall (keep heap memory in physical memory)  Kernel  vm.swappiness=0 (but distress -> possible to swapping)  vm.overcommit_memory=1  Or vm.overcommit_memory=2 (overcommit managed)  vm.overcommit_ratio=? (eg 0.75)  Max memory = swap partition size + ratio*physical memory size  Eg: 8G = 2G + 0.75*8G
    • MORNITERING  System Monitoring  CPU / Memory / Disk  Nagios, Ganglia, Cacti, Zabbix  Network Monitoring  Per Client  NfSen (network flow monitoring, see: http://nfsen.sourceforge.net/#mozTocId376385)  Cluster Monitoring / Maintaining  OpsCenter
    • CHECK THREAD  “top” command  “H” key command to spread per thread  “P” key command to sort by CPU usage rate  Choose heavy rate thread’s PID  PID convert to in Hex (http://www.binaryhexconverter.com/decimal-to-hex-converter)  “jstack <Parent PID> > filename.log” command to save java stack to file  Search PID in Hex 313C
    • CHECK HEAP  Use dump file that from “jmap” or OOME  Use “jhat” or another tool to analyze  Check [B  and their reference object
    • For development, maintaining Sorry.. I have just two days to write this presentation. Next time I will write and speak to U. See U next time
    • Question or Talk about anything with Cassandra
    • Thank you If you have any problem or question for me, please contact my email. jihyun.an@kt.com