APACHE CASSANDRA
Scalability, Performance and Fault Tolerance
in Distributed databases
Jihyun.An (jihyun.an@kt.com)
18, June 2013
TABLE OF CONTENTS
 Preface
 Basic Concepts
 P2P Architecture
 Primitive Data Model & Architecture
 Basic Operations
 Fault Management
 Consistency
 Performance
 Problem handling
TABLE OF CONTENTS (NEXT TIME)
 Maintaining
 Cluster Management
 Node Management
 Problem Handling
 Tuning
 Playing (for Development, Client stance)
 Designing
 Client
 Thrift
 Native
 CQL
 3rd party
 Hector
 OCM
 Extension
 Baas.io
 Hadoop
PREFACE
OUR WORLD
 Traditional DBMS is very valuable
 Storage(+Memory) and Computational Resources cost is cheap (than before)
 But we meet new section
 Big data
 (near) Real time
 Complex and various requirement
 Recommendation
 Find FOAF
 …
 Event Driven Trigging
 User Session
 …
OUR WORLD (CONT)
 Complex applications combine difference types of problems
 Different language -> more productive
 ex: Functional language, Multiprocessing optimized language
 Polyglot persistent layer
 Performance vs Durability?
 Reliability?
 …
TRADITIONAL DBMS
 Relational Model
 Well-defined Schema
 Access with Selection/Projection
 Derived from Joining/Grouping/Aggregating(Counting..)
 Small data (from refined)
 …
 But
 Painful data model changes
 Hard to scale out
 Ineffective in handling large volumes of data
 Not considered with hardware
 …
TRADITIONAL DBMS (CONT)
 Has many constraints for ACID
 PK/FK & checking
 Domain Type checking
 .. checking checking
 Lots of IO / Processing
 OODBMS, ORDBMS
 Good but .. more more checking / processing
 Not well with Disk IO
NOSQL
 Key-value store
 Column : Cassandra, Hbase, Bigtable …
 Others : Redis, Dynamo, Voldemort, Hazelcast …
 Document oriented
 MongoDB, CouchDB …
 Graph store
 Neo4j, Orient DB, BigOWL, FlockDB ..
NOSQL (CONT)
Benefits
 Higher performance
 Higher scalability
 Flexible Datamodel
 More effective for some case
 Less administrative overhead
Drawbacks
 Limited Transactions
 Relaxed Consistency
 Unconstrained data
 Limited ad-hoc query capabilities
 Limited administrative aid tools
CAP
Brewer’s theorem
We can pick two of
Consistency
Availability
Partition tolerance
A
C P
Amazon Dynamo derivatives
Cassandra, Voldemort, CouchDB
, Riak
Neo4j, Bigtable
Bigtable derivatives : MongoDB, Hbase
Hypertable, Redis
Relational:
MySQL, MSSQL,
Postgres
Dynamo
(Architecture)
BigTable
(Data model)
Cassandra
(Apache) Cassandra is a free, open-source, high scalable,
distributed database system for managing large amounts of data
Written in JAVA
Running on JVM
References :
BigTable (http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf)
Dynamo (http://web.archive.org/web/20120129154946/http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf)
DESIGN GOALS
 Simple Key/Value(Column) store
 limited on storage
 No support anything (aggregating, grouping …) but basic operation
(CRUD, Range access)
 But extendable
 Hadoop (MR, HDFS, Pig, Hive ..)
 ESP
 Distributed Processing Interface (ex: BSP, MR)
 Baas.io
 …
DESIGN GOALS (CONT)
 High Availability
 Decentralized
 Everyone can accessor
 Replication & Their access
 Multi DC support
 Eventual consistency
 Less write complexity
 Audit and repair when read
 Possible tuning -> Trade offs between consistency, durability and latency
DESIGN GOALS (CONT)
 Incremental scalability
 Equal Member
 Linear Scalability
 Unlimited space
 Write / Read throughput increase linearly by add node(member)
 Low total cost
 Minimize administrative work
 Automatic partitioning
 Flush / compaction
 Data balancing / moving
 Virtual nodes (since v1.2)
 Middle powered nodes make good performance
 Collaborating work will make powerful performance and huge space
FOUNDER & HISTORY
 Founder
 Avinash Lakshman (one of the authors of Amazon's Dynamo)
 Prashant Malik ( Facebook Engineer )
 Developer
 About 50
 History
 Open sourced by Facebook in July 2008
 Became an Apache Incubator project in March 2009
 Graduated to a top-level project in Feb 2010
 0.6 released (added support for integrated caching, and Apache Hadoop MapReduce) in Apr 2010
 0.7 released (added secondary indexes and online schema change) in Jan 2011
 0.8 released (added the Cassandra Query Language (CQL), self-tuning memtables, and support for zero-downtime upgrades) in Jun 2011
 1.0 released (added integrated compression, leveled compaction, and improved read performance) in Oct 2011
 1.1 released (added self-tuning caches, row-level isolation, and support for mixed ssd/spinning disk deployments) in Apr 2012
 1.2 released (added clustering across virtual nodes, inter-node communication, atomic batches, and request tracing) in Jan 2013
PROMINENT USERS
User Cluster size Node count Usage Now
Facebook >200 ? Inbox search Abandoned,
Moved to HBase
Cisco WebEx ? ? User feed, activity OK
Netflix ? ? Backend OK
Formspring ? (26 million
account with 10 m
responsed per day)
? Social-graph data OK
Urban airship,
Rackspace, Open X,
Twitter (preparing
move to)
BASIC CONCEPTS
P2P ARCHITECTURE
 All nodes are same (has equality)
 No single point of failure / Decentralized
 Compare with
 mongoDB
 broker structure (cubrid …)
 Master / slave
 …
P2P ARCHITECTURE
 Driven linear scalability
References :
http://dev.kthcorp.com/2011/12/07/cassandra-on-aws-100-million-writ/
PRIMITIVE DATA MODEL & ARCHITECTURE
COLUMN
 Basic and primitive type (the smallest increment of data)
 A tuple containing a name, a value and a timestamp
 Timestamp is important
 Provided by client
 Determine the most recent one
 If meet the collision, DBMS chose the latest one
Name
Value
Timestamp
COLUMN (CONT)
 Types
 Standard: A column has a name (UUID or UTF8 …)
 Composite: A column has composite name (UUID+UTF8 …)
 Expiring: TTL marked
 Counter: Only has name and value, timestamp managed by server
 Super: Used to manage wide rows, inferior to using composite
columns (DO NOT USE, All sub-columns serialized)
Counter Name
Value
Name
Name
Value
Timestamp
Name
Value
Timestamp
COLUMN (CONT)
 Types (CQL3 based)
 Standard: Has one primary key.
 Composite: Has more than one primary key,
recommended for managing wide rows.
 Expiring: Gets deleted during compaction.
 Counter: Counts occurrences of an event.
 Super: Used to manage wide rows, inferior to using
composite columns (DO NOT USE, All sub-columns
serialized)
DDL : CREATE TABLE test (
user_id varchar,
article_id uuid,
content varchar,
PRIMARY KEY (user_id, article_id)
);
user_id article_id content
Smith <uuid1> Blah1..
Smith <uuid2> Blah2..
{uuid1,content}
Blah1…
Timestamp
{uuid2,content}
Blah2…
Timestamp
Smith
<Logical>
<Physical>
SELECT user_id,article_id from test order
by article_id DESC LIMIT 1;
ROWS
 A row containing a represent key and a set of columns
 A row key must be unique (usually UUID)
 Supports up to 2 billion columns per (physical) row.
 Columns are sorted by their name (Column’s Name indexed)
 Primitive
 Secondary Index
 Direct Column Access
Name
Value
Timestamp
Name
Value
Timestamp
Name
Value
Timestamp
Row
Key
COLUMN FAMILY
 Container for columns and rows
 No fixed schema
 Each row is uniquely identified by its row key
 Each row can have a different set of columns
 Rows are sorted by row key
 Comparator / Validator
 Static/Dynamic CF
 If columns type is super column, CF called “Super Column Familty”
 Like “Table” in Relational world
Name
Value
Timestamp
Name
Value
Timestamp
Name
Value
Timestamp
Row
Key
Name
Value
Timestamp
Row
Key
DISTRIBUTION
Row
Row
Row
Row
Row
Row
Server
1
Server
3
Server
2
Server
4
How to
map?
TOKEN RING
 Node is a instance (typically same as a server)
 Used to map between each row and node
 Range from 0 to 2127-1
 Associated with a row key
 Node
 Assigned a unique token (ex: token 5 to Node 5)
 Range is from previous node token to their token
 token 4 < Node 5’range <= token 5
Node 1
Node 2
Node 3
Node 4Node 5
Node 6
Node 7
Node 8
Token 5
Token 4
PARTITIONING
Row
Key
Random
Partitioners
(MD5,
Murmur3)
Order
Preserving
Partitioner /
Byte
Ordered
Partitioner
Default
Row
Key
Row
Key
Row
Key
REPLICATION
 Any node has read/write role is called
coordinator node (by client)
 Locator determine where located the replica
 Replica is used at
 Consistency check
 Repair
 Ensure W + R > N for consistency
 Local Cache (Row cache)
Node 1
Node 2
Node 3
Node 4Node 5
Node 6
Node 7
Node 8
Replica Factor is 4 (N-1 will be replicated)
Simple Locator treat strategy order as proximity
Locator
(Simple)
Coordinator node
Locating first one
1
2
Here is original
REPLICATION (CONT)
 Multi DC support
 Allow to Specify how many replcas in each DC
 Within DC replicas are placed on different racks
 Relies on snitch to place replicas
 Strategy (provided from Snitch)
 Simple (Single DC)
 RackInferringSnitch
 PropertyFileSnitch
 EC2Snitch
 EC2MultiRegionSnitch
DC1
DC2
Entire
ADD / REMOVE NODE
 Data transfer between nodes called “Streaming”
 If add node 5,
node 3 and node 4, 1 (suppose RF is 2) involved in streaming
 If remove node 2
node 3(got higher token and their replica container) serve instead
Node 1
Node 2
Node 3
Node 4
Node 1
Node 2
Node 3
Node 4
Node 5
Node 1
Node 3
Node 4
VIRTUAL NODES
 Support since v1.2
 Real time migration support?
 Shuffle utility
 One node has many tokens
 => one node has many ranges Node 1 Node 2
Number of token is 4
Cluster
Node 2
Node 1
VIRTUAL NODES (CONT)
 Less administrative works
 Save cost
 When Add/Remove node
 many node co-works
 No need to determine the token
 Shuffle to re-balance
 Less changing time
 Smart balancing
 No need to balance
(Sufficiently number of token should be higher)
Number of token is 4
Node 2
Node 1
Cluster
Node 2
Node 1
Node 3
Add node 3
KEYSPACE
 A namespace for column families
 Authorization
 CF? yeah
 Replication
 Key oriented schema (see right)
{
"row_key1":
{
"Users":{
"emailAddress":{"name":"emailAddress","value":"foo@bar.co
m"
},
"webSite":{"name":"webSite", "value":http://bar.com}
},
"Stats":{ "visits":{"name":"visits", "value":"243"} }
},
"row_key2":
{
"Users":{
"emailAddress":{"name":"emailAddress",
"value":"user2@bar.com"},
"twitter":{"name":"twitter", "value":"user2"}
}
}
}
Row Key
Column Family
Column
CLUSTER
 Total amount of data managed by the cluster is represented as a
ring
 Cluster of nodes
 Has multiple(or single) Keyspace
 Partitioning Strategy defined
 Authentication
GOSSIP
 Gossip protocol is used for cluster membership.
 Failure detection on service level (Alive or Not)
 Responsible
 Every node in the system knows every other node’s status
 Implemented as
 Sync -> Ack -> Ack2
 Information : status, load, bootstraping
 Basic status is Alive/Dead/Join
 Runs every second
 Status disseminated in O(logN) (N is the number of nodes)
 Seed
 PHI is used for auditing dead or alive in time window
(5 -> detecting in 15~16 s)
 Data structure
 HeartBeat<Application Status<Endpoint Status<Endpoint StatusMap
N1
N2
N3
N4
N6
N5
BASIC OPERATIONS
WRITE / UPDATE
 CommitLog
 Abstracted Mmaped Type
 File & Memory Sync -> On system failure? This is angel for U ^^.
 Java NIO
 C-Heap used (=Native Heap)
 Log Data (Write->Delete? But exists)
 Segment Rolling structure
 Memtable
 In memory buffer and workspace
 Sorted order by row key
 If reach threshold or period point, written to disk to a persistent table
structure(SSTable)
WRITE / UPDATE (LOCAL LEVEL)
Write
CommitLog
Write : “1”:{“name”:”fullname”,”value”:”smith”}
Write : “2”:{“name”:”fullname”,”value”:”mike”}
Delete : “1”
Write : “3”:{“name”:”fullname”,”value”:”osang”}
… Key Name Value
1 fullname smith
2 fullname mike
3 fullname Osang
… … …
Memtable
SSTable SSTable SSTable
1 Write to commitLog
2
Write/Update to Memtable
3Write to Disk (flush)
SSTABLE
 SSTable is Sorted String Table
 Best for log structured DB
 Store large numbers of key-value pairs
 Immutable
 Create with “Flush”
 Merges by (major/minor) compaction
 Has one or more column has different version (timestamp)
 Choose recent one
READ (LOCAL LEVEL)
Key Name Value
2 fullname mike
3 fullname Osang
… … …
SSTable
BF
IDX
SSTable
BF
IDX
Read
Memtable
READ (CLUSTER LEVEL, +READ REPAIR)
Replica
(Original, Right)
Replica
(Right)
Replica
(Wrong)
Digest Comparing
Choose the right one if digests differ
(the most recent)
Recover
Read
Operation
Coordinator
Locator
1 Transferred from original/replica node (with consistency level)
2
3
DELETE
 Add tomstone (this is some type of column)
 Garbage collected when compacting
 GC grace seconds : 864000 (default 10 days)
 Issue
 If the fault node recover after GCGraceSeconds, the deleted data can
be resurrected
FAULT MANAGEMENT
DETECTION
 Dynamic threshold for marking nodes
 Accrual Detection Mechanism calculates a per-node threshold
 Automatic take into account Network condition, workload and
other conditions might affect perceived heartbeat rate.
 From 3rd party client
 Hector
 Failover
HINTED-HANDOFF
 The coordinator will store a hint for if the node down or failed to
acknowledge the write
 Hint consists of the target replica and the mutation(column
object) to be replayed
 Use java heap (might next to be off-heap)
 Only saved within limited time (default, 1 hour) after a replica fails
 When failed node is alive again, it will begin streaming the miss
writes
REPAIR
 Support triangle method
 CommitLog Replaying (by administrator)
 Read Repair (realtime)
 Anti-entropy Repair (by administrator)
READ REPAIR
 Background work
 Configured per CF
 Choose most recently written value if they are inconsistent, and
replace it.
ANTI-ENTROPY REPAIR
 Ensure all data on a replica is made consistent
 Merkle tree used
 Tree of data block’s hashes
 Verify inconsistent
 Repair node request merkle hash (piece of CF)
to replicas and comparing, streaming from a
replica if inconsistent, do Read-repair
Block
1
Block
2
Block
3
…
CF
hash hash hash hash
hash hash
hash
CONSISTENCY
BASIC
 Full ACID compliance in distributed system is a bad idea.
(network, … )
 Single row updates are atomic (include internal indexes),
everything else is not
 Relaxing consistency does not equal data corruption
 Tunable Consistency
 Speed vs precision
 Any read and write operation decides how consistent the requested
data should be (from client)
CONDITION
 Consistency ensure if
 (W + R) > N
 W is nodes written (succeed)
 R is nodes read
 N is replica factor
CONDITION (CONT)
N is 3
Operations
1. Write 3
2. Write 5
3. Write 1
3 5 1
Worst case
W is 1
1 5 1W is 2 3 1 1or
W is 2 1 1 1
R is 1
Possible case
3 5 1or or
R is 21
1 R is 3
Written Read
(W+R)>N ensure that at lease one latest value can be selected
This is eventual consistency
READ CONSISTENCY LEVELS
 One
 Two
 Three
 Quorum
 Local Quorum
 Each Quorum
 All
Specify how many replicas must response
before a result is return to the client
Quorum : (Replication Factor / 2) + 1
Local Quorum / Each Quorum is used at Multi-
DC
Round down to a whole number processing
(If satisfied, return right away)
WRITE CONSISTENCY LEVELS
 ANY
 One
 Two
 Three
 Quorum
 Local Quorum
 Each Quorum
 All
Specify how many replicas must succeed
before returning acknowledge to client
Quorum : (Replication Factor / 2) + 1
Local Quorum / Each Quorum is used at Multi-
DC
ANY level contain hinted-handoff condition
Round down to a whole number processing
(If satisfied, return right away)
PERFORMANCE
CACHE
 Key/Row Cache can save their data to files
 Key Cache
 Accessed Frequently
 Hold the location of keys (indicating to columns)
 In memory, on JVM heap
 Row Cache
 Optional
 Hold entire columns of the row
 In memory, on Off-heap (since v1.1) or JVM heap
 If you have huge column, this will make OOME (Out Of Memory Event)
CACHE
 Mmaped Disk Access
 On 64bit JVM, used for data and index summary (default)
 Provide virtual mmaped space in Memory for SSTable
 On C-Heap(native heap)
 GC make this as cache
 Data accessed frequently live long period, otherwise GC will purge that
 If the data exists in memory, return it (=cache)
 (Problem) GC C-Heap when its full only
 (Problem) handle open SSTable, this mean Cassandra can allocate the entire size
of open SSTables, otherwise native OOME
 If you wanna have efficient Key/Row/Mmaped Access cache, add
sufficient nodes to cluster
BLOOM FILTERS
 Each SSTable has this
 Used to check if a requested row key exists in the SSTable before
doing any seeks (disk)
 Per row key, generate several hashes and mark the buckets for
the key
 Check each bucket for the key’s hashes, if any is empty the key
does not exists
 False positive are possible, but false negative are not
Key 1 Key 2 Key 2
Hash A Hash B Hash C
1 1 1
Same hashes
Only has
INDEX
 Primary Index
 Per CF
 The index of CF’s row key
 Efficient access with Index summary (1 row key out of every 128 is
sampled)
 In memory, on JVM heap (next move to Off-heap)
Read BF
KeyCache
SSTable
Index
Summary
Primary
Index
Offset
Calculator
INDEX (CONT)
 Secondary Index
 For Column’s value(s)
 Support composite type
 Hidden CF
 Implemented by CF’name index
 Value is the CF’name
 Write/Update/Delete operation is atomic
 Share value for many rows is good for
 On the contrary unique value for indexing is poor (-> use Dynamic CF for
indexing)
COMPACTION
 Combines data from SSTables
 Merge row fragments
 Rebuild primary and secondary indexes
 Remove expired columns marked with tomestone
 Delete old SSTable if complete
 “Minor” only compactions merge SSTables of similar size, “Major” compactions
merge all SSTables in a given CF
 Size-tiered compaction
 Leveled compaction
 Since v1.0
 Based on LevelDB
 Temporary use maximum twice space and spike in disk IO.
ARCHITECTURE
 Write : no race conditions, not handled by disk IO
 Read : Slow than write, but fast (DHT, cache …)
 Load balancing
 Virtual Nodes
 Replication
 Multi-DC
BENCHMARK
References :
http://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&cad=rja&sqi=2&ved=0CCsQFjAA&url=http%3A%2F%2F68.18
0.206.246%2Ffiles%2Fycsb.pdf&ei=O_nAUYqlPI2okQWO-ICwCA&usg=AFQjCNGySLHho0zZ-
eMsJIm4VjsoNEOyKw&sig2=6p45QMDvTN963EqbM8YpDg/
Workload A—update heavy: (a) read
operations, (b) update operations.
Throughput in this (and
all figures) represents total operations
per second, including reads and
writes.
Workload B—read heavy: (a) read
operations, (b) update operations
By YCSB (Yahoo Cloud Serving Benchmark)
BENCHMARK (CONT)
References :
http://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&cad=rja&sqi=2&ved=0CCsQFjAA&url=http%3A%2F%2F68.18
0.206.246%2Ffiles%2Fycsb.pdf&ei=O_nAUYqlPI2okQWO-ICwCA&usg=AFQjCNGySLHho0zZ-
eMsJIm4VjsoNEOyKw&sig2=6p45QMDvTN963EqbM8YpDg/
Workload E—short scans.
By YCSB (Yahoo Cloud Serving Benchmark)
Read performance as cluster size increases.
BENCHMARK (CONT)
Elastic speedup:
Time series showing
impact of adding
servers online.
By YCSB (Yahoo Cloud Serving Benchmark)
References :
http://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&cad=rja&sqi=2&ved=0CCsQFjAA&url=http%3A%2F%2F68.18
0.206.246%2Ffiles%2Fycsb.pdf&ei=O_nAUYqlPI2okQWO-ICwCA&usg=AFQjCNGySLHho0zZ-
eMsJIm4VjsoNEOyKw&sig2=6p45QMDvTN963EqbM8YpDg/
BENCHMARK (CONT) By NoSQLBenchmarking.com
References :
http://www.nosqlbenchmarking.com/2011/02/new-results-for-cassandra-0-7-2//
BENCHMARK (CONT) By Cubrid
References :
http://www.cubrid.org/blog/dev-platform/nosql-benchmarking/
BENCHMARK (CONT) By VLDB
References :
http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf/
Read latency Write latencyThroughput (95% read, 5% write)
BENCHMARK (LAST) By VLDB
References :
http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf/
Throughput (50% read, 50% write) Throughput (100% write)
PROBLEM HANDLING
RESOURCE
 Memory
 Off-heap & Heap
 OOME Problem
 CPU
 GC
 Hashing
 Compression / Compaction
 Network Handling
 Context Switching
 Lazy Problem
 IO
 Bottleneck for everything
MEMORY
 Heap (GC management)
 Permanent (-XX:PermSize, -XX:MaxPermSize)
 JVM Heap (-Xmx, -Xms, -Xmn)
 C-Heap (=Native Heap)
 OS Shared
 Thread Stack (-Xss)
 Objects that access with JNI
 Off-Heap
 OS Shared
 GC managed by Cassandra
MEMORY (CONT)
 Heap
 Permanent
 JVM Heap
 Memtable
 KeyCache
 IndexSummary(move to Off-heap on next
release)
 Buffer
 Transport
 Socket
 Disk
 C-Heap
 Thread Stack
 File Memory Map (Virtual space)
 Data / Index buffer (default)
 CommitLog
v1.2
 Off-Heap (OS shared)
 RowCache
 BloomFilter
 Index->CompressionMetaData-
>ChuckOffset
MEMORY (CONT)
 Memtable
 Managed
 total size (default 1/3 JVM heap, flush largest memtable for CF if reached)
 Emergency, heap usage above the fraction of the max after full GC(CMS) -> flush
largest memtable (each time) -> prevent full GC / OOME
 KeyCache
 Managed
 total size (100M or 5% of the max)
 Emergency, heap usage above the fraction of the max after full GC(CMS) ->
reduce max cache size -> prevent full GC / OOME
 RowCache/CommitLog
 Managed
 total size (default disabled) -> prevent OOME
MEMORY (CONT)
 Thread Stack
 Not managed
 But XSS set as 180k (default)
 Check thrift (transport level, RPC server)’s server serving type (sync,
hsha, async(has bugs))
 Set min/max threads for connection (default unlimited)
v1.2
MEMORY (CONT)
 Transport buffer
 Thrift
 Support many languages and crossing
 Provide server/client interface, serializing
 Apache project, created by Facebook
 Framed buffer (default max 16M, variable size)
 4k, 16k, 32k, … 16M
 Determine by client
 Per connection
 Adjust max frame buffer size (client, server)
 Set min/max threads for connection (default unlimited)
v1.2
Data Service
Client
Data Service
Thrift
MEMORY (LAST)
 C-Heap/Off-Heap
 OS Shared -> Other application possible to make some problem
 File Memory Map (Virtual space)
 GC when Full GC
 0 <= total size <= the size of opened SSTables
 If cannot allocate? -> Native OOME
 But
 Generally access limited space of SSTable
 GC make space
 Worst case? (If OOME occur)
 yaml->disk_access_mode : standard (restart required)
 Add sufficient nodes
 Yaml->disk_access_mode : auto After joining
v1.2
CPU
 GC
 CMS
 Marking phase : low thread priority -> but high usage rate (it’s not a problem)
 CMSInitiatingOccupancyFraction is 75 (default)
 UseCMSInitiatingOccupancyOnly
 Full GC
 Frequency is important -> may has a problem (eg: thrift transport buffer)
 Add nodes or analyze memory usage to adjust configuration for
 Minor GC
 It’s OK
 Compaction
 If do slow, okay
 So priority down with “-XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Dcassandra.compaction.priority=1”
 High CPU Load -> sustaining? -> When U need to add nodes
SWAPPING
 Swapping make big problem for real-time application
 IO block -> Thread block -> Gossip/Compaction/Flush … delaying ->
make other problem
 Disable or Set minimum Swapping
 Disable Swap partition
 Or Enable JNA + Kernel Configuration
 JNA : Mlockall (keep heap memory in physical memory)
 Kernel
 vm.swappiness=0 (but distress -> possible to swapping)
 vm.overcommit_memory=1
 Or vm.overcommit_memory=2 (overcommit managed)
 vm.overcommit_ratio=? (eg 0.75)
 Max memory = swap partition size + ratio*physical memory size
 Eg: 8G = 2G + 0.75*8G
MORNITERING
 System Monitoring
 CPU / Memory / Disk
 Nagios, Ganglia, Cacti, Zabbix
 Network Monitoring
 Per Client
 NfSen (network flow monitoring, see:
http://nfsen.sourceforge.net/#mozTocId376385)
 Cluster Monitoring / Maintaining
 OpsCenter
CHECK THREAD
 “top” command
 “H” key command to spread per thread
 “P” key command to sort by CPU usage rate
 Choose heavy rate thread’s PID
 PID convert to in Hex (http://www.binaryhexconverter.com/decimal-to-hex-converter)
 “jstack <Parent PID> > filename.log” command to save java stack to file
 Search PID in Hex
313C
CHECK HEAP
 Use dump file that from “jmap” or OOME
 Use “jhat” or another tool to analyze
 Check [B
 and their reference object
For development, maintaining
Sorry..
I have just two days to write this presentation.
Next time I will write and speak to U.
See U next time
Question or Talk about anything with Cassandra
Thank you
If you have any problem or question for me, please contact my email.
jihyun.an@kt.com

About "Apache Cassandra"

  • 1.
    APACHE CASSANDRA Scalability, Performanceand Fault Tolerance in Distributed databases Jihyun.An (jihyun.an@kt.com) 18, June 2013
  • 2.
    TABLE OF CONTENTS Preface  Basic Concepts  P2P Architecture  Primitive Data Model & Architecture  Basic Operations  Fault Management  Consistency  Performance  Problem handling
  • 3.
    TABLE OF CONTENTS(NEXT TIME)  Maintaining  Cluster Management  Node Management  Problem Handling  Tuning  Playing (for Development, Client stance)  Designing  Client  Thrift  Native  CQL  3rd party  Hector  OCM  Extension  Baas.io  Hadoop
  • 4.
  • 5.
    OUR WORLD  TraditionalDBMS is very valuable  Storage(+Memory) and Computational Resources cost is cheap (than before)  But we meet new section  Big data  (near) Real time  Complex and various requirement  Recommendation  Find FOAF  …  Event Driven Trigging  User Session  …
  • 6.
    OUR WORLD (CONT) Complex applications combine difference types of problems  Different language -> more productive  ex: Functional language, Multiprocessing optimized language  Polyglot persistent layer  Performance vs Durability?  Reliability?  …
  • 7.
    TRADITIONAL DBMS  RelationalModel  Well-defined Schema  Access with Selection/Projection  Derived from Joining/Grouping/Aggregating(Counting..)  Small data (from refined)  …  But  Painful data model changes  Hard to scale out  Ineffective in handling large volumes of data  Not considered with hardware  …
  • 8.
    TRADITIONAL DBMS (CONT) Has many constraints for ACID  PK/FK & checking  Domain Type checking  .. checking checking  Lots of IO / Processing  OODBMS, ORDBMS  Good but .. more more checking / processing  Not well with Disk IO
  • 9.
    NOSQL  Key-value store Column : Cassandra, Hbase, Bigtable …  Others : Redis, Dynamo, Voldemort, Hazelcast …  Document oriented  MongoDB, CouchDB …  Graph store  Neo4j, Orient DB, BigOWL, FlockDB ..
  • 10.
    NOSQL (CONT) Benefits  Higherperformance  Higher scalability  Flexible Datamodel  More effective for some case  Less administrative overhead Drawbacks  Limited Transactions  Relaxed Consistency  Unconstrained data  Limited ad-hoc query capabilities  Limited administrative aid tools
  • 11.
    CAP Brewer’s theorem We canpick two of Consistency Availability Partition tolerance A C P Amazon Dynamo derivatives Cassandra, Voldemort, CouchDB , Riak Neo4j, Bigtable Bigtable derivatives : MongoDB, Hbase Hypertable, Redis Relational: MySQL, MSSQL, Postgres
  • 12.
    Dynamo (Architecture) BigTable (Data model) Cassandra (Apache) Cassandrais a free, open-source, high scalable, distributed database system for managing large amounts of data Written in JAVA Running on JVM References : BigTable (http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf) Dynamo (http://web.archive.org/web/20120129154946/http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf)
  • 13.
    DESIGN GOALS  SimpleKey/Value(Column) store  limited on storage  No support anything (aggregating, grouping …) but basic operation (CRUD, Range access)  But extendable  Hadoop (MR, HDFS, Pig, Hive ..)  ESP  Distributed Processing Interface (ex: BSP, MR)  Baas.io  …
  • 14.
    DESIGN GOALS (CONT) High Availability  Decentralized  Everyone can accessor  Replication & Their access  Multi DC support  Eventual consistency  Less write complexity  Audit and repair when read  Possible tuning -> Trade offs between consistency, durability and latency
  • 15.
    DESIGN GOALS (CONT) Incremental scalability  Equal Member  Linear Scalability  Unlimited space  Write / Read throughput increase linearly by add node(member)  Low total cost  Minimize administrative work  Automatic partitioning  Flush / compaction  Data balancing / moving  Virtual nodes (since v1.2)  Middle powered nodes make good performance  Collaborating work will make powerful performance and huge space
  • 16.
    FOUNDER & HISTORY Founder  Avinash Lakshman (one of the authors of Amazon's Dynamo)  Prashant Malik ( Facebook Engineer )  Developer  About 50  History  Open sourced by Facebook in July 2008  Became an Apache Incubator project in March 2009  Graduated to a top-level project in Feb 2010  0.6 released (added support for integrated caching, and Apache Hadoop MapReduce) in Apr 2010  0.7 released (added secondary indexes and online schema change) in Jan 2011  0.8 released (added the Cassandra Query Language (CQL), self-tuning memtables, and support for zero-downtime upgrades) in Jun 2011  1.0 released (added integrated compression, leveled compaction, and improved read performance) in Oct 2011  1.1 released (added self-tuning caches, row-level isolation, and support for mixed ssd/spinning disk deployments) in Apr 2012  1.2 released (added clustering across virtual nodes, inter-node communication, atomic batches, and request tracing) in Jan 2013
  • 17.
    PROMINENT USERS User Clustersize Node count Usage Now Facebook >200 ? Inbox search Abandoned, Moved to HBase Cisco WebEx ? ? User feed, activity OK Netflix ? ? Backend OK Formspring ? (26 million account with 10 m responsed per day) ? Social-graph data OK Urban airship, Rackspace, Open X, Twitter (preparing move to)
  • 18.
  • 19.
    P2P ARCHITECTURE  Allnodes are same (has equality)  No single point of failure / Decentralized  Compare with  mongoDB  broker structure (cubrid …)  Master / slave  …
  • 20.
    P2P ARCHITECTURE  Drivenlinear scalability References : http://dev.kthcorp.com/2011/12/07/cassandra-on-aws-100-million-writ/
  • 21.
    PRIMITIVE DATA MODEL& ARCHITECTURE
  • 22.
    COLUMN  Basic andprimitive type (the smallest increment of data)  A tuple containing a name, a value and a timestamp  Timestamp is important  Provided by client  Determine the most recent one  If meet the collision, DBMS chose the latest one Name Value Timestamp
  • 23.
    COLUMN (CONT)  Types Standard: A column has a name (UUID or UTF8 …)  Composite: A column has composite name (UUID+UTF8 …)  Expiring: TTL marked  Counter: Only has name and value, timestamp managed by server  Super: Used to manage wide rows, inferior to using composite columns (DO NOT USE, All sub-columns serialized) Counter Name Value Name Name Value Timestamp Name Value Timestamp
  • 24.
    COLUMN (CONT)  Types(CQL3 based)  Standard: Has one primary key.  Composite: Has more than one primary key, recommended for managing wide rows.  Expiring: Gets deleted during compaction.  Counter: Counts occurrences of an event.  Super: Used to manage wide rows, inferior to using composite columns (DO NOT USE, All sub-columns serialized) DDL : CREATE TABLE test ( user_id varchar, article_id uuid, content varchar, PRIMARY KEY (user_id, article_id) ); user_id article_id content Smith <uuid1> Blah1.. Smith <uuid2> Blah2.. {uuid1,content} Blah1… Timestamp {uuid2,content} Blah2… Timestamp Smith <Logical> <Physical> SELECT user_id,article_id from test order by article_id DESC LIMIT 1;
  • 25.
    ROWS  A rowcontaining a represent key and a set of columns  A row key must be unique (usually UUID)  Supports up to 2 billion columns per (physical) row.  Columns are sorted by their name (Column’s Name indexed)  Primitive  Secondary Index  Direct Column Access Name Value Timestamp Name Value Timestamp Name Value Timestamp Row Key
  • 26.
    COLUMN FAMILY  Containerfor columns and rows  No fixed schema  Each row is uniquely identified by its row key  Each row can have a different set of columns  Rows are sorted by row key  Comparator / Validator  Static/Dynamic CF  If columns type is super column, CF called “Super Column Familty”  Like “Table” in Relational world Name Value Timestamp Name Value Timestamp Name Value Timestamp Row Key Name Value Timestamp Row Key
  • 27.
  • 28.
    TOKEN RING  Nodeis a instance (typically same as a server)  Used to map between each row and node  Range from 0 to 2127-1  Associated with a row key  Node  Assigned a unique token (ex: token 5 to Node 5)  Range is from previous node token to their token  token 4 < Node 5’range <= token 5 Node 1 Node 2 Node 3 Node 4Node 5 Node 6 Node 7 Node 8 Token 5 Token 4
  • 29.
  • 30.
    REPLICATION  Any nodehas read/write role is called coordinator node (by client)  Locator determine where located the replica  Replica is used at  Consistency check  Repair  Ensure W + R > N for consistency  Local Cache (Row cache) Node 1 Node 2 Node 3 Node 4Node 5 Node 6 Node 7 Node 8 Replica Factor is 4 (N-1 will be replicated) Simple Locator treat strategy order as proximity Locator (Simple) Coordinator node Locating first one 1 2 Here is original
  • 31.
    REPLICATION (CONT)  MultiDC support  Allow to Specify how many replcas in each DC  Within DC replicas are placed on different racks  Relies on snitch to place replicas  Strategy (provided from Snitch)  Simple (Single DC)  RackInferringSnitch  PropertyFileSnitch  EC2Snitch  EC2MultiRegionSnitch DC1 DC2 Entire
  • 32.
    ADD / REMOVENODE  Data transfer between nodes called “Streaming”  If add node 5, node 3 and node 4, 1 (suppose RF is 2) involved in streaming  If remove node 2 node 3(got higher token and their replica container) serve instead Node 1 Node 2 Node 3 Node 4 Node 1 Node 2 Node 3 Node 4 Node 5 Node 1 Node 3 Node 4
  • 33.
    VIRTUAL NODES  Supportsince v1.2  Real time migration support?  Shuffle utility  One node has many tokens  => one node has many ranges Node 1 Node 2 Number of token is 4 Cluster Node 2 Node 1
  • 34.
    VIRTUAL NODES (CONT) Less administrative works  Save cost  When Add/Remove node  many node co-works  No need to determine the token  Shuffle to re-balance  Less changing time  Smart balancing  No need to balance (Sufficiently number of token should be higher) Number of token is 4 Node 2 Node 1 Cluster Node 2 Node 1 Node 3 Add node 3
  • 35.
    KEYSPACE  A namespacefor column families  Authorization  CF? yeah  Replication  Key oriented schema (see right) { "row_key1": { "Users":{ "emailAddress":{"name":"emailAddress","value":"foo@bar.co m" }, "webSite":{"name":"webSite", "value":http://bar.com} }, "Stats":{ "visits":{"name":"visits", "value":"243"} } }, "row_key2": { "Users":{ "emailAddress":{"name":"emailAddress", "value":"user2@bar.com"}, "twitter":{"name":"twitter", "value":"user2"} } } } Row Key Column Family Column
  • 36.
    CLUSTER  Total amountof data managed by the cluster is represented as a ring  Cluster of nodes  Has multiple(or single) Keyspace  Partitioning Strategy defined  Authentication
  • 37.
    GOSSIP  Gossip protocolis used for cluster membership.  Failure detection on service level (Alive or Not)  Responsible  Every node in the system knows every other node’s status  Implemented as  Sync -> Ack -> Ack2  Information : status, load, bootstraping  Basic status is Alive/Dead/Join  Runs every second  Status disseminated in O(logN) (N is the number of nodes)  Seed  PHI is used for auditing dead or alive in time window (5 -> detecting in 15~16 s)  Data structure  HeartBeat<Application Status<Endpoint Status<Endpoint StatusMap N1 N2 N3 N4 N6 N5
  • 38.
  • 39.
    WRITE / UPDATE CommitLog  Abstracted Mmaped Type  File & Memory Sync -> On system failure? This is angel for U ^^.  Java NIO  C-Heap used (=Native Heap)  Log Data (Write->Delete? But exists)  Segment Rolling structure  Memtable  In memory buffer and workspace  Sorted order by row key  If reach threshold or period point, written to disk to a persistent table structure(SSTable)
  • 40.
    WRITE / UPDATE(LOCAL LEVEL) Write CommitLog Write : “1”:{“name”:”fullname”,”value”:”smith”} Write : “2”:{“name”:”fullname”,”value”:”mike”} Delete : “1” Write : “3”:{“name”:”fullname”,”value”:”osang”} … Key Name Value 1 fullname smith 2 fullname mike 3 fullname Osang … … … Memtable SSTable SSTable SSTable 1 Write to commitLog 2 Write/Update to Memtable 3Write to Disk (flush)
  • 41.
    SSTABLE  SSTable isSorted String Table  Best for log structured DB  Store large numbers of key-value pairs  Immutable  Create with “Flush”  Merges by (major/minor) compaction  Has one or more column has different version (timestamp)  Choose recent one
  • 42.
    READ (LOCAL LEVEL) KeyName Value 2 fullname mike 3 fullname Osang … … … SSTable BF IDX SSTable BF IDX Read Memtable
  • 43.
    READ (CLUSTER LEVEL,+READ REPAIR) Replica (Original, Right) Replica (Right) Replica (Wrong) Digest Comparing Choose the right one if digests differ (the most recent) Recover Read Operation Coordinator Locator 1 Transferred from original/replica node (with consistency level) 2 3
  • 44.
    DELETE  Add tomstone(this is some type of column)  Garbage collected when compacting  GC grace seconds : 864000 (default 10 days)  Issue  If the fault node recover after GCGraceSeconds, the deleted data can be resurrected
  • 45.
  • 46.
    DETECTION  Dynamic thresholdfor marking nodes  Accrual Detection Mechanism calculates a per-node threshold  Automatic take into account Network condition, workload and other conditions might affect perceived heartbeat rate.  From 3rd party client  Hector  Failover
  • 47.
    HINTED-HANDOFF  The coordinatorwill store a hint for if the node down or failed to acknowledge the write  Hint consists of the target replica and the mutation(column object) to be replayed  Use java heap (might next to be off-heap)  Only saved within limited time (default, 1 hour) after a replica fails  When failed node is alive again, it will begin streaming the miss writes
  • 48.
    REPAIR  Support trianglemethod  CommitLog Replaying (by administrator)  Read Repair (realtime)  Anti-entropy Repair (by administrator)
  • 49.
    READ REPAIR  Backgroundwork  Configured per CF  Choose most recently written value if they are inconsistent, and replace it.
  • 50.
    ANTI-ENTROPY REPAIR  Ensureall data on a replica is made consistent  Merkle tree used  Tree of data block’s hashes  Verify inconsistent  Repair node request merkle hash (piece of CF) to replicas and comparing, streaming from a replica if inconsistent, do Read-repair Block 1 Block 2 Block 3 … CF hash hash hash hash hash hash hash
  • 51.
  • 52.
    BASIC  Full ACIDcompliance in distributed system is a bad idea. (network, … )  Single row updates are atomic (include internal indexes), everything else is not  Relaxing consistency does not equal data corruption  Tunable Consistency  Speed vs precision  Any read and write operation decides how consistent the requested data should be (from client)
  • 53.
    CONDITION  Consistency ensureif  (W + R) > N  W is nodes written (succeed)  R is nodes read  N is replica factor
  • 54.
    CONDITION (CONT) N is3 Operations 1. Write 3 2. Write 5 3. Write 1 3 5 1 Worst case W is 1 1 5 1W is 2 3 1 1or W is 2 1 1 1 R is 1 Possible case 3 5 1or or R is 21 1 R is 3 Written Read (W+R)>N ensure that at lease one latest value can be selected This is eventual consistency
  • 55.
    READ CONSISTENCY LEVELS One  Two  Three  Quorum  Local Quorum  Each Quorum  All Specify how many replicas must response before a result is return to the client Quorum : (Replication Factor / 2) + 1 Local Quorum / Each Quorum is used at Multi- DC Round down to a whole number processing (If satisfied, return right away)
  • 56.
    WRITE CONSISTENCY LEVELS ANY  One  Two  Three  Quorum  Local Quorum  Each Quorum  All Specify how many replicas must succeed before returning acknowledge to client Quorum : (Replication Factor / 2) + 1 Local Quorum / Each Quorum is used at Multi- DC ANY level contain hinted-handoff condition Round down to a whole number processing (If satisfied, return right away)
  • 57.
  • 58.
    CACHE  Key/Row Cachecan save their data to files  Key Cache  Accessed Frequently  Hold the location of keys (indicating to columns)  In memory, on JVM heap  Row Cache  Optional  Hold entire columns of the row  In memory, on Off-heap (since v1.1) or JVM heap  If you have huge column, this will make OOME (Out Of Memory Event)
  • 59.
    CACHE  Mmaped DiskAccess  On 64bit JVM, used for data and index summary (default)  Provide virtual mmaped space in Memory for SSTable  On C-Heap(native heap)  GC make this as cache  Data accessed frequently live long period, otherwise GC will purge that  If the data exists in memory, return it (=cache)  (Problem) GC C-Heap when its full only  (Problem) handle open SSTable, this mean Cassandra can allocate the entire size of open SSTables, otherwise native OOME  If you wanna have efficient Key/Row/Mmaped Access cache, add sufficient nodes to cluster
  • 60.
    BLOOM FILTERS  EachSSTable has this  Used to check if a requested row key exists in the SSTable before doing any seeks (disk)  Per row key, generate several hashes and mark the buckets for the key  Check each bucket for the key’s hashes, if any is empty the key does not exists  False positive are possible, but false negative are not Key 1 Key 2 Key 2 Hash A Hash B Hash C 1 1 1 Same hashes Only has
  • 61.
    INDEX  Primary Index Per CF  The index of CF’s row key  Efficient access with Index summary (1 row key out of every 128 is sampled)  In memory, on JVM heap (next move to Off-heap) Read BF KeyCache SSTable Index Summary Primary Index Offset Calculator
  • 62.
    INDEX (CONT)  SecondaryIndex  For Column’s value(s)  Support composite type  Hidden CF  Implemented by CF’name index  Value is the CF’name  Write/Update/Delete operation is atomic  Share value for many rows is good for  On the contrary unique value for indexing is poor (-> use Dynamic CF for indexing)
  • 63.
    COMPACTION  Combines datafrom SSTables  Merge row fragments  Rebuild primary and secondary indexes  Remove expired columns marked with tomestone  Delete old SSTable if complete  “Minor” only compactions merge SSTables of similar size, “Major” compactions merge all SSTables in a given CF  Size-tiered compaction  Leveled compaction  Since v1.0  Based on LevelDB  Temporary use maximum twice space and spike in disk IO.
  • 64.
    ARCHITECTURE  Write :no race conditions, not handled by disk IO  Read : Slow than write, but fast (DHT, cache …)  Load balancing  Virtual Nodes  Replication  Multi-DC
  • 65.
    BENCHMARK References : http://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&cad=rja&sqi=2&ved=0CCsQFjAA&url=http%3A%2F%2F68.18 0.206.246%2Ffiles%2Fycsb.pdf&ei=O_nAUYqlPI2okQWO-ICwCA&usg=AFQjCNGySLHho0zZ- eMsJIm4VjsoNEOyKw&sig2=6p45QMDvTN963EqbM8YpDg/ Workload A—updateheavy: (a) read operations, (b) update operations. Throughput in this (and all figures) represents total operations per second, including reads and writes. Workload B—read heavy: (a) read operations, (b) update operations By YCSB (Yahoo Cloud Serving Benchmark)
  • 66.
  • 67.
    BENCHMARK (CONT) Elastic speedup: Timeseries showing impact of adding servers online. By YCSB (Yahoo Cloud Serving Benchmark) References : http://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&cad=rja&sqi=2&ved=0CCsQFjAA&url=http%3A%2F%2F68.18 0.206.246%2Ffiles%2Fycsb.pdf&ei=O_nAUYqlPI2okQWO-ICwCA&usg=AFQjCNGySLHho0zZ- eMsJIm4VjsoNEOyKw&sig2=6p45QMDvTN963EqbM8YpDg/
  • 68.
    BENCHMARK (CONT) ByNoSQLBenchmarking.com References : http://www.nosqlbenchmarking.com/2011/02/new-results-for-cassandra-0-7-2//
  • 69.
    BENCHMARK (CONT) ByCubrid References : http://www.cubrid.org/blog/dev-platform/nosql-benchmarking/
  • 70.
    BENCHMARK (CONT) ByVLDB References : http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf/ Read latency Write latencyThroughput (95% read, 5% write)
  • 71.
    BENCHMARK (LAST) ByVLDB References : http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf/ Throughput (50% read, 50% write) Throughput (100% write)
  • 72.
  • 73.
    RESOURCE  Memory  Off-heap& Heap  OOME Problem  CPU  GC  Hashing  Compression / Compaction  Network Handling  Context Switching  Lazy Problem  IO  Bottleneck for everything
  • 74.
    MEMORY  Heap (GCmanagement)  Permanent (-XX:PermSize, -XX:MaxPermSize)  JVM Heap (-Xmx, -Xms, -Xmn)  C-Heap (=Native Heap)  OS Shared  Thread Stack (-Xss)  Objects that access with JNI  Off-Heap  OS Shared  GC managed by Cassandra
  • 75.
    MEMORY (CONT)  Heap Permanent  JVM Heap  Memtable  KeyCache  IndexSummary(move to Off-heap on next release)  Buffer  Transport  Socket  Disk  C-Heap  Thread Stack  File Memory Map (Virtual space)  Data / Index buffer (default)  CommitLog v1.2  Off-Heap (OS shared)  RowCache  BloomFilter  Index->CompressionMetaData- >ChuckOffset
  • 76.
    MEMORY (CONT)  Memtable Managed  total size (default 1/3 JVM heap, flush largest memtable for CF if reached)  Emergency, heap usage above the fraction of the max after full GC(CMS) -> flush largest memtable (each time) -> prevent full GC / OOME  KeyCache  Managed  total size (100M or 5% of the max)  Emergency, heap usage above the fraction of the max after full GC(CMS) -> reduce max cache size -> prevent full GC / OOME  RowCache/CommitLog  Managed  total size (default disabled) -> prevent OOME
  • 77.
    MEMORY (CONT)  ThreadStack  Not managed  But XSS set as 180k (default)  Check thrift (transport level, RPC server)’s server serving type (sync, hsha, async(has bugs))  Set min/max threads for connection (default unlimited) v1.2
  • 78.
    MEMORY (CONT)  Transportbuffer  Thrift  Support many languages and crossing  Provide server/client interface, serializing  Apache project, created by Facebook  Framed buffer (default max 16M, variable size)  4k, 16k, 32k, … 16M  Determine by client  Per connection  Adjust max frame buffer size (client, server)  Set min/max threads for connection (default unlimited) v1.2 Data Service Client Data Service Thrift
  • 79.
    MEMORY (LAST)  C-Heap/Off-Heap OS Shared -> Other application possible to make some problem  File Memory Map (Virtual space)  GC when Full GC  0 <= total size <= the size of opened SSTables  If cannot allocate? -> Native OOME  But  Generally access limited space of SSTable  GC make space  Worst case? (If OOME occur)  yaml->disk_access_mode : standard (restart required)  Add sufficient nodes  Yaml->disk_access_mode : auto After joining v1.2
  • 80.
    CPU  GC  CMS Marking phase : low thread priority -> but high usage rate (it’s not a problem)  CMSInitiatingOccupancyFraction is 75 (default)  UseCMSInitiatingOccupancyOnly  Full GC  Frequency is important -> may has a problem (eg: thrift transport buffer)  Add nodes or analyze memory usage to adjust configuration for  Minor GC  It’s OK  Compaction  If do slow, okay  So priority down with “-XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Dcassandra.compaction.priority=1”  High CPU Load -> sustaining? -> When U need to add nodes
  • 81.
    SWAPPING  Swapping makebig problem for real-time application  IO block -> Thread block -> Gossip/Compaction/Flush … delaying -> make other problem  Disable or Set minimum Swapping  Disable Swap partition  Or Enable JNA + Kernel Configuration  JNA : Mlockall (keep heap memory in physical memory)  Kernel  vm.swappiness=0 (but distress -> possible to swapping)  vm.overcommit_memory=1  Or vm.overcommit_memory=2 (overcommit managed)  vm.overcommit_ratio=? (eg 0.75)  Max memory = swap partition size + ratio*physical memory size  Eg: 8G = 2G + 0.75*8G
  • 82.
    MORNITERING  System Monitoring CPU / Memory / Disk  Nagios, Ganglia, Cacti, Zabbix  Network Monitoring  Per Client  NfSen (network flow monitoring, see: http://nfsen.sourceforge.net/#mozTocId376385)  Cluster Monitoring / Maintaining  OpsCenter
  • 83.
    CHECK THREAD  “top”command  “H” key command to spread per thread  “P” key command to sort by CPU usage rate  Choose heavy rate thread’s PID  PID convert to in Hex (http://www.binaryhexconverter.com/decimal-to-hex-converter)  “jstack <Parent PID> > filename.log” command to save java stack to file  Search PID in Hex 313C
  • 84.
    CHECK HEAP  Usedump file that from “jmap” or OOME  Use “jhat” or another tool to analyze  Check [B  and their reference object
  • 85.
    For development, maintaining Sorry.. Ihave just two days to write this presentation. Next time I will write and speak to U. See U next time
  • 86.
    Question or Talkabout anything with Cassandra
  • 87.
    Thank you If youhave any problem or question for me, please contact my email. jihyun.an@kt.com