GETTING STARTED WITH
BY VISWANATH JAYACHANDRAN
AGENDA
PART 1
• BIG DATA
• INTRODUCTION TO CASSANDRA
• INTERNAL ARCHITECTURE
• WRITE PATH
• COMPACTION
• READ PATH
PART 2
• INSTALLATION
• CASSANDRA TOOLS
• CCM
• OPS CENTRE
• DEV CENTRE
• NODE TOOL AND
• CASSANDRA STRESS
• DATA MODEL
• CASSANDRA QUERY LANGUAGE
(CQL)
RESOURCES
• PLANET CASSANDRA
• HTTP://PLANETCASSANDRA.ORG/
• DATASTAX CASSANDRA DOCUMENTATION
• HTTP://WWW.DATASTAX.COM/DOCS
• APACHE CASSANDRA PROJECT
• HTTP://CASSANDRA.APACHE.ORG/
BIG DATA
BIG DATA
EVERYBODY IS DOING IT BUT, NOT MANY KNOW WHAT IT IS.
INFOGRAPHIC
CHARACTERISTICS REQUIRED FOR BIG DATA
SYSTEMS
• MULTI-REGION AVAILABILITY
• VERY FAST AND RELIABLE
RESPONSE
• NO SINGLE POINT OF FAILURE
RELATIONAL MODEL
• NORMALIZED TABLE SCHEMA
• CROSS TABLE JOINS
• ACID COMPLIANCE
• BIG DATA TABLE JOINS – BILLIONS OF
ROWS, OR MORE – REQUIRE
MASSIVE OVERHEAD
• SHARDING TABLES ACROSS
SYSTEMS IS COMPLEX AND FRAGILE
BIG DATA
PRIORITIES OF MODERN
APPLICATION
1. NEEDS FOR SPEED AND
AVAILABILITY OUTRANKS "ALWAYS
ON" CONSISTENCY
2. COMMODITY SERVER RACKS
INSTEAD OF MASSIVE HIGH-END
SYSTEMS
3. REAL WORLD NEED FOR
TRANSACTIONAL GUARANTEES IS
LIMITED
STRATEGIES FOR MODERN
APPLICATION
1. RELAX CONSISTENCY AND
SCHEMA REQUIREMENTS
2. DISTRIBUTE DATA ACROSS NODES
3. OPTIMIZE DATA TO SUIT ACTUAL
NEEDS
CAP THEOREM
• IN DISTRIBUTED SYSTEMS, CONSISTENCY, AVAILABILITY, AND PARTITION
TOLERANCE EXIST IN A MUTUALLY DEPENDENT RELATIONSHIP.
• PICK ANY TWO.
SOFTWARE. HARDWARE. COMPLETE
• VERTICAL SCALING HAS ITS
LIMITS!
WHAT IS CASSANDRA?
• A DISTRIBUTED DATABASE FOR MANAGING LARGE AMOUNTS
OF STRUCTURED DATA ACROSS MANY COMMODITY SERVERS.
• NEAR-LINEAR HORIZONTAL SCALING ACROSS COMMODITY
SERVERS
• NO SINGLE POINT OF FAILURE: CASSANDRA HAS A MASTER
LESS “RING” DESIGN WHERE ALL NODES PLAY AN IDENTICAL
ROLE; THERE IS NO CONCEPT OF A MASTER NODE.
FAULT TOLERANT
10
50
3070
80
40
20
60
Client
Replication Factor = 3
We could still
retrieve the data
from the other 2
nodes
Node failure or it goes
down temporarily
LINEARLY SCALABLE
• SIMPLY ADD NODES TO DOUBLE, QUADRUPLE PERFORMANCE AND CAPACITY
10
50
3070
80
40
20
60
10
30
2040100 000
transactions
/sec
200 000
transactions
/sec
400 000
transactions
/sec
10
20
MULTI DATA CENTRE SUPPORT
• DATA CENTERS ARE ACTIVE –
ACTIVE
• WRITE TO EITHER DATA
CENTRE
North American
Data Center
European
Data Center
Client
15
55
3575
85
45
25
65
10
50
3070
80
40
20
60
ASYNCHRONO
US
REPLICATION
• BENEFITS
• DATA-LOCALITY
• DISASTER RECOVERY
DISASTER
PREVENTION
AVAILABILITY AND RESILIENCY AS A
SERVICE
A SET OF TOOLS (SCHEDULED
AGENTS) THAT DELIBERATELY SHUTS
DOWN SERVICES, SLOWS DOWN
PERFORMANCES, CHECKS
CONFORMITY
• CHAOS MONKEY RANDOMLY
BRINGS DOWN A NODE.
• GORILLA MONKEY SIMULATES THE
OUTAGE OF AN ENTIRE AVAILABILITY
ZONE
• KONG MONKEY SIMULATES THE
Cluster
MULTI DATA CENTRE FOR
WORKLOAD SEGREGATION
• COPY OF PRODUCTION DATA FOR TESTING, BENCHMARKING AND RUNNING
ANALYTICS
Analytics
Data Center
Spark / Hadoop
Production / live
Data Center
Client
15
55
3575
85
45
25
65
10
50
3070
80
40
20
60
ASYNCHRONO
US
REPLICATION
ORACLE VS CASSANDRA DEPLOYMENTS
CLOUDERA: HADOOP'S ANTI FRAUD REFERENCE
ARCHITECTURE
REAL TIME FRAUD DETECTION WITH DSE
HISTORY
• CLUSTER LAYER
• AMAZON DYNAMO DB
PAPER
• MASTERLESS
ARCHITECTURE
• DATA-STORE LAYER
• GOOGLE BIG TABLE
PAPER
• COLUMNS/COLUMNS
FAMILY
• OPEN
SOURCED
SINCE 2008
INTERNALS OF CASSANDRA
INTERNAL ARCHITECTURE
Cassandra cluster
Data centre 1
Node
2
Node
1
Node
4
Node
3
Data centre 2
Node
6
Node
5
Node
8
Node
7
• NODE – ONE
CASSANDRA
INSTANCE
• RACK – A
LOGICAL SET OF
NODES
• DATA CENTRE – A
LOGICAL SET OF
RACKS
• CLUSTER – THE
FULL SET OF
NODES WHICH
MAP TO A SINGLE
COMPLETE
TOKEN RING
DATA DISTRIBUTION
• DATA IS STORED ON NODES IN PARTITIONS, THAT’S ANALOGOUS
TO A ROW IN A RDBMS TABLE.
• A PARTITION’S KEY IS PASSED TO A CONSISTENT HASHING
ALGORITHM TO GENERATE A TOKEN.
• TOKEN IS AN INTEGER WHOSE VALUE IS BETWEEN 2-63 TO 263
• TOKEN IS USED TO IDENTIFY THE LOCATION OF A PARTITION
WITHIN A CLUSTER.
• IN OTHER WORDS, TOKEN = HASH (PARTITION KEY)
MurmurHash3 function
12345
329585043
2507136630
test@example.co
m
CONSISTENT HASHING AND PARITIONER
• CONSISTENT HASHING ALLOWS
DISTRIBUTING DATA ACROSS A
CLUSTER WHICH MINIMIZES
REORGANIZATION WHEN NODES
ARE ADDED OR REMOVED.
name
ag
e Car gender
jim 36 camaro M
carol 37 bmw F
johnn
y
12 M
suzy 10 F
Partition
key
Partition
key Murmur3 hash value
jim -2245462676723223822
carol 7723358927203680754
johnny -6723372854036780875
suzy 1168604627387940318
1
3
24
-9223372036854775808
to
-4611686018427387903
4611686018427387903
to
9223372036854775808
-4611686018427387904
to -1
0 to
4611686018427387903
suzy
1168604627387940318
johnny
-6723372854036780875
carol
7723358927203680754
jim
-2245462676723223822
• EACH NODE IN THE CLUSTER IS
RESPONSIBLE FOR A RANGE OF
DATA BASED ON THE HASH
VALUE
V-NODES
• VNODES ALLOW EACH
NODE TO OWN A LARGE
NUMBER OF SMALL
PARTITION RANGES
DISTRIBUTED
THROUGHOUT THE
CLUSTER.
• VNODES ALSO USE
CONSISTENT HASHING TO
DISTRIBUTE DATA BUT
USING THEM DOESN'T
REQUIRE TOKEN
CONSISTENCY LEVELS
• APPLY BOTH TO READ & WRITE AND TUNABLE AT RUNTIME
1. ONE: FAST, MAY NOT READ LATEST WRITTEN VALUE
2. QUORUM: STRICT MAJORITY W.R.T. REPLICATION FACTOR GOOD BALANCE
3. ALL: PARANOID SLOW, NO HIGH AVAILABILITY
WRITE PATH
HOW DATA IS WRITTEN INTO THE STORAGE ENGINE
LOG STRUCTURED STORAGE ENGINE
• IN CASSANDRA, DATA IS SEQUENTIALLY APPENDED, NOT PLACED IN PRE-SET
LOCATIONS
KEY COMPONENTS OF THE WRITE PATH
• TO HANDLE WRITE REQUESTS, EACH NODE IMPLEMENTS 4 KEY COMPONENTS
1. MEMTABLES – IN-MEMORY TABLES CORRESPONDING TO CQL TABLES, WITH
INDEXES
2. COMMIT LOG – APPEND-ONLY LOG, REPLAYED TO RESTORE DOWNED NODE'S
MEMTABLES
3. SSTABLES – MEMTABLE SNAPSHOTS PERIODICALLY FLUSHED TO DISK, CLEARING
HEAP
4. COMPACTION – PERIODIC PROCESS TO MERGE AND STREAMLINE SSTABLES
Process
Artifacts
Memtable corresponding to the CQL table
Node memory
Node file system
Commit log
Coordinato
r
Nod
e
Partition key
3
firstName:Bruc
e
lastName:Wayne age:30
Partition key
2
firstName:Alfre
d
lastName:Pennywo
rth
age:62
Partition key
1
firstName:Jim lastName:Gordon age:42
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
Immutable sorted string
tables
Flush current
state to
SSTable
Commit log
append
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
Periodic
compaction
COMMITLOG
• CONFIGURED IN CONF/CASSANDRA.YAML
• WHEN COMMIT LOG SIZE REACHES ITS
THRESHOLD, THE MEMTABLE IS FLUSHED TO
DISK.
• COMMITLOG_TOTAL_SPACE_IN_MB – TOTAL
SPACE TO BE USED FOR ALL COMMIT LOGS
• COMMITLOG_SEGMENT_SIZE_IN_MB – MAX
SIZE OF INDIVIDUAL COMMIT LOG SEGMENT
• FLUSHED COMMIT LOG SEGMENTS ARE
REUSED INSTEAD OF WIPING THEM AND
REUSING THEM.
• FOR EFFICIENCY, ENSURE THAT DATA
DIRECTORIES AND COMMIT LOGS IN
DIFFERENT DRIVES TO MINIMISE WRITE
HEAD CONTENTION.
• COMMIT LOG ENTRIES ACCRUE
IN MEMORY, AND ARE
SYNCHRONISED TO DISK IN
EITHER A BATCH OR PERIODIC
MANNER.
• BATCH – WRITES ARE NOT
ACKNOWLEDGED UNTIL THE LOG
SYNCS TO DISK. DEFAULT IS 50
MS
• PERIODIC – WRITES ARE
ACKNOWLEDGED IMMEDIATELY,
WHILE SYNC HAPPENS
PERIODICALLY. DEFAULT SYNC
CYCLE IS 10 SECONDS.
• SEE COMMITLOG_SYNC
COMPACTION
• IT IS A CRITICAL, PERIODIC SSTABLE
MAINTENANCE PROCESS THAT
1. MERGES MOST RECENT PARTITION KEYS AND
COLUMNS
2. EVICTS DELETED AND TTL EXPIRED
PARTITION COLUMNS
3. CREATES NEW SSTABLE
4. REBUILDS PARTITION INDEX AND PARTITION
SUMMARY
5. DELETES THE OLD SSTABLES
• WHY IT IS NECESSARY?
• SSTABLES ARE IMMUTABLE, SO UPDATES
TEND TO FRAGMENT DATA OVER TIME
• DELETES ARE WRITES AND MUST BE
READ PATH
UNDERSTAND HOW DATA IS READ FROM THE STORAGE ENGINE
READ PATH FLOW AMONG NODES
Coordinato
r
… … … …
pk7 … … Level:42
Timestamp 1114
Node memory
Node file system
pk1 … … …
pk7 First:Betty
Timestamp:541
Last:Blue
Timestamp:541
Level:63
Timestamp:541
pk2 … … …
pk7 First:Elizabeth
Timestamp: 994
Level:63
Timestamp: 994
pk1 … … …
pk2 … … …
SS Tables
MemTabl
e
Row cache
Pk1
Pk2
pk7Read
<pk7>
Hi
t pk7 First: Elizabeth Last: Blue Level: 42
Coordinato
r
… … … …
pk7 … … Level:42
Timestamp 1114
Node memory
Node file system
pk1 … … …
pk7 First:Betty
Timestamp:541
Last:Blue
Timestamp:541
Level:63
Timestamp:541
pk2 … … …
pk7 First:Elizabeth
Timestamp: 994
Level:63
Timestamp: 994
pk1 … … …
pk2 … … …
SS Tables
MemTabl
e
Row cache
Pk1
Pk2
Read
<pk7>
Miss
Bloo
m
filter
Bloo
m
filter
Bloo
m
filter
Key
Cach
e
pk
7?
pk
7?
Hi
t
Hi
t
pk7 First:Elizabet
h
Last:Blue Level:42
Merge all data read based on time
stamp
Coordinato
r
… … … …
pk7 … … Level:42
Timestamp 1114
Node memory
Node file
system
pk1 … … …
pk7 First:Betty
Timestamp:54
1
Last:Blue
Timestamp:541
Level:63
Timestamp:541
pk2 … … …
pk7 First:Elizabeth
Timestamp:
994
Level:63
Timestamp: 994
pk1 … … …
pk2 … … …
SS Tables
MemTabl
e
Row cache
Pk1
Pk2
Read
<pk7>
Miss
Bloo
m
filter
Bloo
m
filter
Bloo
m
filter
Key
Cach
e
pk
7?
pk
7?
Partition
summary
Partition
index
Partition
index
Partition
summary
Miss
Miss
pk7 First:Elizabet
h
Last:Blu
e
Level:42
Merge
TOMBSTONES
• DELETED COLUMNS ARE NOT IMMEDIATELY REMOVED, JUST MARKED FOR
DELETION.
• WHY? IMMEDIATE REMOVAL WOULD REQUIRE A TIME-WASTING SEEK
• WHEN A CQL QUERY DELETES A PARTITION COLUMN, OR ITS TTL IS FOUND TO
BE EXPIRED DURING A READ THE FOLLOWING HAPPENS
1. A TOMBSTONE (DELETION MARKER) IS APPLIED TO THIS COLUMN IN ITS
MEMTABLE
2. SUBSEQUENT QUERIES TREAT THIS COLUMN AS DELETED
3. AT THE NEXT MEMTABLE FLUSH, THE TOMBSTONE PASSES TO THE NEW
SSTABLE AT EACH COMPACTION, TOMBSTONED COLUMNS OLDER THAN
GC_GRACE_SECONDS ARE EVICTED FROM THE NEWLY COMPACTED
SSTABLES
ZOMBIE COLUMNS
• IF A NODE FAILS BEFORE A REPLICATED
TOMBSTONE ARRIVES, THEN IS RESTORED MORE
THAN GC_GRACE_SECONDS LATER, THE
OTHERWISE DELETED COLUMN WILL REAPPEAR, AS
ALL OTHER NODES WILL HAVE EVICTED THE
TOMBSTONE.
THE CURE
• USE NODETOOL REPAIR WHEN RESTORING FAILED
NODES, TO ENSURE ALL ITS PARTITIONS ARE
CONSISTENT, INCLUDING ANY PENDING DELETIONS.
CASSANDRA QUERY
LANGUAGE
(CQL)
PROVIDES A FAMILIAR, ROW-COLUMN, SQL-LIKE APPROACH
PROVIDES CLEAR SCHEMA DEFINITIONS IN A FLEXIBLE (NOSQL)
SCHEMA CONTEXT
Table (Column
Family) -
Contains rows
Keyspace - Contains
all tables. Specifies
replication
Cluster - Contains all nodes.
Even across WAN
LOGICAL CONTAINERS
INSTALLATION
DOWNLOAD ALL NECESSARY SOFTWARE FROM
HTTP://DOWNLOADS.DATASTAX.COM/COMMUNITY/
CASSANDRA CLUSTER
MANAGER
CCM
A LIBRARY (OR META-TOOL) TO CREATE, LAUNCH AND
REMOVE AN APACHE CASSANDRA CLUSTER ON LOCALHOST.
FOR DETAILS, SEE HTTPS://GITHUB.COM/PCMANUS/CCM
CCM – CREATE A TEST CLUSTER
• CCM CREATE TEST -V 2.1.8 -N 2 -S –D
• CREATE A CLUSTER
• NAMED 'TEST'
• USING CASSANDRA VERSION 2.1.8
• WITH 2 NODES
• START IT RIGHT AWAY
• DEBUG OUTPUT FOR START-UP PROCESS.
• CCM WILL INSTALL AND COMPILE THE VERSION OF CASSANDRA IF IT'S
UNAVAILABLE.
• ONCE STARTED, CCM WILL UTILISE THIS CLUSTER AS THE DEFAULT ONE.
• EACH NODE ON CASSANDRA CLUSTER CAN USE A DIFFERENT VERSION.
CCM – SHOW COMMAND
CCM – EXECUTE AN EXTERNAL FILE
CCM – DESCRIBE
CCM – IMPORT DATA
CCM – TRUNCATE TABLE
CASSANDRA CLUSTER MANAGER (CCM) – ADD A
NODE
OPS CENTER
A WEB-BASED VISUAL MANAGEMENT AND MONITORING SOLUTION
NODE TOOL
COMMAND LINE CLUSTER MANAGEMENT UTILITY THAT
CONNECTS TO A SPECIFIC NODE VIA JMX
CASSANDRA STRESS
A LOAD TESTING UTILITY THAT PERFORMS INSERTS AND
READS TO A TEST KEYSPACE, IN AN EFFORT TO MEASURE
PERFORMANCE AND BENCHMARK.
DEV CENTER
A VISUAL SCHEMA AND QUERY TOOL THAT ALLOWS
DEVELOPERS TO CREATE AND RUN CQL QUERIES AND
COMMANDS.
DATA MODEL
DIVERGENCE OF CASSANDRA FROM THE
RELATIONAL WORLD
• IN A RELATIONAL DATABASE, ONE CAN SEARCH ON ANY OF THE COLUMNS
THAT BELONG TO THE TABLE BUT NOT IN CASSANDRA.
• CASSANDRA STORES THE DATA DIFFERENTLY ON THE DISK THAN THE WAY
CQL PRESENTS IT.
• CQL PROVIDES A TWO DIMENSIONAL VIEW OF POTENTIALLY
MULTIDIMENSIONAL DATA
• SIMPLY PUT, CASSANDRA PHYSICALLY STORES DATA AS A MAP OF MAPS.
1
title
Interstellar
runtime
169
year
2014
2
title
Minions
runtime
91
year
2015
3
title
Thor
runtime
115
year
2011
Key Value
Partitions
Partition key
PHYSICAL STORAGE
COLUMN FAMILY
• TABLE IS A SET OF PARTITIONS
• PARTITION MAY BE SINGLE OR MULTIPLE
ROW
• PARTITION KEY UNIQUELY IDENTIFIES A
PARTITION AND MAY BE SIMPLE OR
COMPOSITE
• COLUMN UNIQUELY IDENTIFIES A CELL
IN A PARTITION, AND MAY BE REGULAR
OR CLUSTERING
• PRIMARY KEY IS COMPRISED OF A
PARTITION KEY PLUS CLUSTERING
COLUMNS, IF ANY, AND UNIQUELY
IDENTIFIES A ROW IN BOTH ITS
PARTITION AND TABLE
COLUMN FAMILY
• SET OF ROWS WITH A SIMILAR STRUCTURE.
• SORTED COLUMNS
• MULTI DIMENSIONAL DATA
• SIZE OF A COLUMN FAMILY IS ONLY LIMITED TO THE SIZE OF A CLUSTER
• ROWS ARE DISTRIBUTED AMONG THE NODES IN A CLUSTER
• DATA FROM A ONE ROW MUST FIT ON ONE NODE
• DATA FROM ANY GIVEN ROW NEVER SPANS MULTIPLE NODES
• MAXIMUM COLUMNS PER ROW IS 2 BILLION IN THEORY BUT IN PRACTICE – UP TO
100 THOUSAND
• MAXIMUM DATA SIZE PER COLUMN VALUE IS 2 GB IN THEORY BUT IN PRACTICE IT’S
UP TO 100 MB
CLI
UPSERTS
• CASSANDRA DOES NOT PERFORM A READ OPERATION BEFORE A WRITE.
• IT’S AN OPTIMISATION BY DESIGN BECAUSE WITH MASSIVE AMOUNT OF DATA
RESIDING IN THE DATA STORE, A WRITE OPERATION PERFORMED BEFORE
EVERY READ WOULD NOT SCALABLE BY DESIGN.
• CASSANDRA DOES NOT MAKE ANY DISTINCTION BETWEEN AN INSERT AND AN
UPDATE; WHICH MAKES THE TERM UPSERT.
• YOU’LL BE UPSET IF YOU DO AN UPSERT ;)
UPSERTS
UPSERTS: CASE 1
• NO PRIMARY KEY VIOLATION EXCEPTION
• CASSANDRA SIMPLY FINDS THE CORRESPONDING PARTITION, PERFORMS AN INSERT
OPERATION AND RETURNS.
• HOWEVER, DEVELOPERS CAN STILL EXPLICITLY ASK CASSANDRA TO PERFORM A READ
PRIOR TO A WRITE OPERATION.
UPSERTS: CASE 2
• UPDATES TO AN NON EXISTING RECORD PERFORMS AN INSERT BY USING THE
WHERE CLAUSE.
LIGHT WEIGHT TRANSACTIONS OR
COMPARE AND SET (CAS)
• A NEW CLAUSE IF NOT EXISTS FOR INSERTS
• INSERT OPERATION EXECUTES IF A ROW WITH THE SAME PRIMARY KEY DOES NOT EXIST
• USES A CONSENSUS ALGORITHM CALLED PAXOS TO ENSURE INSERTS ARE DONE SERIALLY
• MULTIPLE MESSAGES ARE PASSED BETWEEN COORDINATOR AND REPLICAS WITH A LARGE
PERFORMANCE PENALTY
[applied] column returns true if row
does not exist and insert executes
[applied] column is false if row exists
and the existing row will be returned
LIGHT WEIGHT TRANSACTIONS OR
COMPARE AND SET (CAS)
• UPDATE USES IF TO VERIFY THE VALUE FOR COLUMN(S) BEFORE EXECUTION
[applied] column returns true if
condition(s) matches and update
written
[applied] column is false if condition(s)
do not match and the current row will be
returned
TTL OPTION
• TIME-TO-LIVE (TTL) DEFINES EXPIRING
COLUMNS THAT ARE EVENTUALLY DELETED.
• TTL IS SPECIFIED IN SECONDS.
• BENEFIT:
• HELPS KEEP THE SIZE OF A TABLE AND ITS
PARTITIONS MANAGEABLE
• RESTRICTS THE DATA VIEW TO MOST RECENT
DATA
Store a row for 86400 seconds
Re-inserting the same row before it
expires will overwrite TTL.
CLUSTERING COLUMNS
• CLUSTERING COLUMNS GROUP TABLE’S ROWS INTO PARTITIONS.
• CLUSTERING COLUMN COME AFTER PARTITION KEY, WITHIN PRIMARY KEY CLAUSE.
• DOUBLE SET OF PARENTHESES AROUND THE PARTITION KEY IS FOR CLUSTERING
COLUMNS. Partition key
Clustering
column
CLUSTERING COLUMNS
year name id runtime
2014 Interstellar 1 169
2015 Minions 2 91
2011 Thor 3 115
2015 Home 4 94
2015
2014
2011
Minions:id Minions:runti
me
2 91
Home:id Home:runtime
4 94
Interstellar:i
d
Interstellar:runti
me
1 169
Thor:id Thor:runtime
3 115
‘Home’ comes before ‘Minions’ as
names are arranged in ascending
order
QUERYING CLUSTERING COLUMNS
• CLUSTERED COLUMN VALUES ARE STORED IN SORTED ORDER WITH ASCENDING
BEING THE DEFAULT ORDER. HOWEVER, CLUSTERING COLUMN’S ORDERING CAN
BE CHANGED.
QUERYING CLUSTERING COLUMNS
• TO LOCATE A PARTICULAR ROW WITHIN A CLUSTERED PARTITION REQUIRES A
SIMPLE BINARY SEARCH; WHICH IS OF LOGARITHMIC TIME AND HENCE
CONSIDERABLY FAST.
Partition key
Clustering
column
RANGE QUERY ON CLUSTERING COLUMNS
• RANGE QUERIES CAN ALSO BE PERFORMED ON CLUSTERING COLUMNS.
• HOWEVER, RANGE SEARCH CAN BE PERFORMED ONLY ON CLUSTERING COLUMNS;
NOT ANY OTHER
Partition key
Clustering
column
STATIC COLUMN
• STATIC COLUMN VALUES ARE SHARED FOR ALL ROWS IN A MULTI-ROW
PARTITION
DATA TYPES
UUID AND TIMEUUID
• UNIVERSALLY UNIQUE IDENTIFIERS THAT’S USED TO ASSIGN CONFLICT-FREE
(UNIQUE) IDENTIFIERS TO DATA OBJECTS.
• FORMAT
HEX{8}-HEX{4}-HEX{4}-HEX{4}-HEX{12}
• UUID:
• VERSION 4 UUIDS SEPARATED BY DASHES
• TIMEUUID:
• VERSION 1 UUIDS
• EMBEDS A TIME VALUE WITHIN A UUID - GENERATED USING TIME (60 BITS), A CLOCK
SEQUENCE NUMBER (14 BITS), AND MAC ADDRESS (48 BITS)
• CQL FUNCTION NOW() GENERATES A NEW TIMEUUID
• CQL FUNCTION DATEOF() EXTRACTS THE EMBEDDED TIMESTAMP FROM TIMEUUID
COUNTER
• DATA TYPE FOR A DISTRIBUTED COUNTER FOR TRACKING A COUNT.
• IT ALLOWS RACE-FREE INCREMENTS WITH LOCAL LATENCY ACROSS MULTIPLE
DATACENTERS SIMULTANEOUSLY
• LIMITATIONS
1. INITIALISED WITH ZERO AND CAN ONLY BE INCREMENTED OR DECREMENTED
2. CANNOT BE A PART OF PRIMARY KEY
3. IF A TABLE HAS A COUNTER COLUMN, ALL NON-COUNTER COLUMNS MUST BE PART
OF A PRIMARY KEY
COUNTER
• CASSANDRA READS THE CURRENT VALUE FOR EVERY COUNTER UPDATE AND
APPLIES THE DELTA.
• PERFORMANCE
1. READ IS AS EFFICIENT AS FOR NON-COUNTER COLUMNS
2. UPDATE IS FAST BUT SLIGHTLY SLOWER THAN AN UPDATE FOR NON-COUNTER
COLUMNS
• ACCURACY
• IF A COUNTER UPDATE IS TIMED OUT, A CLIENT APPLICATION CANNOT SIMPLY
RETRY A “FAILED” COUNTER UPDATE AS THE TIMED-OUT UPDATE MAY HAVE BEEN
PERSISTED
• COUNTER UPDATE IS NOT AN IDEMPOTENT OPERATION
COLLECTION COLUMNS
• COLLECTION COLUMNS ARE MULTI-VALUED COLUMNS RETRIEVED IN ITS
ENTIRETY.
• SUPPORTED COLLECTIONS
• SET - TYPED COLLECTION OF UNIQUE VALUES
• ORDERED BY VALUES NO DUPLICATES
• LIST - TYPED COLLECTION OF NON-UNIQUE VALUES
• ORDERED BY POSITION DUPLICATES ALLOWED
• MAP - TYPED COLLECTION OF KEY-VALUE PAIRS
• ORDERED BY KEYS UNIQUE KEYS BUT NOT VALUES
• SIZE LIMITS
• MAXIMUM NUMBER OF ELEMENTS IN A COLLECTION: 64 000
• MAXIMUM SIZE OF EACH COLLECTION ELEMENT: 64 KB
• USAGE LIMITS
• CANNOT BE PART OF A PRIMARY KEY I.E. PARTITION KEY OR CLUSTERING COLUMN
• CANNOT NEST INSIDE ANOTHER COLLECTION
SET MANIPULATION
• DEFINE A USERS TABLE TO ACCOMMODATE MULTIPLE EMAIL ADDRESS
CREATE TABLE USERS (
USER_ID TEXT PRIMARY KEY,
FIRST_NAME TEXT,
LAST_NAME TEXT,
EMAILS SET<TEXT>
);
• INSERT DATA INTO THE SET, ENCLOSING VALUES IN CURLY BRACKETS
INSERT INTO USERS (USER_ID, FIRST_NAME, LAST_NAME, EMAILS)
VALUES('FRODO', 'BILBO', 'BAGGINS', {'FRODO@BAGGINS.NAME',
'BILBO.BAGGINS@ABOUT.ME'});
• ADD AN ELEMENT TO A SET USING THE UPDATE COMMAND AND THE ADDITION (+)
OPERATOR
UPDATE USERS
SET EMAILS = EMAILS + {'BILBO.BAGGINS@THEHOBBIT.ORG'} WHERE USER_ID = 'FRODO';
• REMOVE ALL ELEMENTS FROM A SET BY USING THE UPDATE OR DELETE
STATEMENT.
LIST MANIPULATION
• ADD A LIST DECLARATION TO A TABLE BY ADDING A COLUMN AND MANIPULATE IT
ALTER TABLE USERS ADD TOP_PLACES LIST<TEXT>;
UPDATE USERS SET TOP_PLACES = [ 'RIVENDELL', 'MORDOR' ] WHERE USER_ID = 'FRODO';
UPDATE USERS SET TOP_PLACES[2] = 'RIDDERMARK' WHERE USER_ID = 'FRODO';
MAP MANIPULATION
• ADD A TODO LIST TO EVERY USER PROFILE IN AN EXISTING USERS TABLE
ALTER TABLE USERS ADD TODO MAP<TIMESTAMP, TEXT>;
UPDATE USERS
SET TODO = { '2012-9-24' : 'ENTER MORDOR', '2014-10-2 12:00' : 'THROW RING INTO MOUNT
DOOM' }
WHERE USER_ID = 'FRODO';
UPDATE USERS
SET TODO = TODO + { '2013-10-1 18:00': 'CHECK INTO INN OF PRACING PONY'}
WHERE USER_ID='FRODO';
• COMPUTE THE TTL TO USE TO EXPIRE TO-DO LIST ELEMENTS ON THE DAY OF THE
TIMESTAMP, AND SET THE ELEMENTS TO EXPIRE.
UPDATE USERS USING TTL 86400
SET TODO['2012-10-1'] = 'FIND WATER' WHERE USER_ID = 'FRODO';
SECONDARY INDEX
• A SECONDARY INDEX CAN INDEX
ADDITIONAL COLUMNS TO ENABLE
SEARCHING BY THOSE COLUMNS
• ONE COLUMN PER INDEX
• LIMITATIONS: IT CANNOT BE CREATED
FOR
1. COUNTER COLUMNS
2. STATIC COLUMNS
• DROP A SECONDARY
INDEX
• CREATE A SECONDARY
INDEX
SECONDARY INDEX:
WHEN AND WHEN NOT TO USE IT
• SECONDARY INDEXES ARE FOR
SEARCHING CONVENIENCE SO USE
IT ONLY
1. ON COLUMNS WITH LOW-
CARDINALITY
2. ON COLUMNS THAT MAY CONTAIN A
RELATIVELY SMALL SET OF
DISTINCT VALUES LIKE GENRE OF
MUSIC
3. WITH SMALLER DATASETS OR
WHEN PROTOTYPING
• DO NOT USE ON
1. ON HIGH-CARDINALITY COLUMNS
2. ON COUNTER COLUMN TABLES
3. ON A FREQUENTLY UPDATED OR
DELETED COLUMNS
4. TO LOOK FOR A ROW IN A LARGE
PARTITION UNLESS NARROWLY
QUERIED
USER-DEFINED TYPE
• GROUPS RELATED FIELDS OF INFORMATION
• REPRESENTS RELATED DATA IN A SINGLE TABLE, INSTEAD OF MULTIPLE, SEPARATE
TABLES
• TABLE COLUMNS CAN BE USER-DEFINED TYPES
• A USER-DEFINED TYPE CAN BE USED AS A DATA TYPE FOR A COLLECTION
• REQUIRES THE USE OF THE FROZEN KEYWORD
TUPLE
• FIXED-LENGTH SETS OF TYPED POSITIONAL FIELDS
• ALTERNATIVE TO CREATING A USER-DEFINED TYPE THAT’S USEFUL WHEN
PROTOTYPING
• ACCOMMODATES UP TO 32768 FIELDS, BUT GENERALLY ONLY USE A FEW
• TUPLES CAN BE NESTED IN OTHER TUPLES
BATCH STATEMENT
• COMBINES MULTIPLE INSERT, UPDATE, AND DELETE STATEMENTS INTO A
SINGLE LOGICAL OPERATION
BATCH STATEMENT
• ATOMIC OPERATION
• IF ANY STATEMENT IN THE BATCH SUCCEEDS, ALL WILL
• NO BATCH ISOLATION
• OTHER “TRANSACTIONS” CAN READ AND WRITE DATA BEING AFFECTED BY A
PARTIALLY EXECUTED BATCH
• ORDER
• OPERATIONS WITH IN A BATCH WILL BE EXECUTED IN ANY ORDER SEEN FIT BY THE
EXECUTION ENGINE
BATCH STATEMENT
UNLOGGED
BATCH
• DOES NOT WRITE TO THE BATCH LOG
• SAVES TIME BUT NO LONGER ATOMIC
• ALLOWS OPERATIONS ON COUNTER
COLUMNS
LIGHT WEIGHT TRANSACTIONS IN A BATCH
• BATCH WILL EXECUTE ONLY IF CONDITIONS FOR ALL LIGHTWEIGHT
TRANSACTIONS ARE MET
• ALL OPERATIONS IN BATCH WILL EXECUTE SERIALLY WITH THE INCREASED
PERFORMANCE OVERHEAD
STEPS TO BUILD A DATA MODEL
Application
workflow
Conceptual
Data model
Map
conceptual
to logical
Logical data
model
Physical
optimisation
Physcial
data model
Getting started with Cassandra 2.1
Getting started with Cassandra 2.1

Getting started with Cassandra 2.1

  • 1.
    GETTING STARTED WITH BYVISWANATH JAYACHANDRAN
  • 2.
    AGENDA PART 1 • BIGDATA • INTRODUCTION TO CASSANDRA • INTERNAL ARCHITECTURE • WRITE PATH • COMPACTION • READ PATH PART 2 • INSTALLATION • CASSANDRA TOOLS • CCM • OPS CENTRE • DEV CENTRE • NODE TOOL AND • CASSANDRA STRESS • DATA MODEL • CASSANDRA QUERY LANGUAGE (CQL)
  • 3.
    RESOURCES • PLANET CASSANDRA •HTTP://PLANETCASSANDRA.ORG/ • DATASTAX CASSANDRA DOCUMENTATION • HTTP://WWW.DATASTAX.COM/DOCS • APACHE CASSANDRA PROJECT • HTTP://CASSANDRA.APACHE.ORG/
  • 4.
  • 7.
    BIG DATA EVERYBODY ISDOING IT BUT, NOT MANY KNOW WHAT IT IS.
  • 9.
  • 10.
    CHARACTERISTICS REQUIRED FORBIG DATA SYSTEMS • MULTI-REGION AVAILABILITY • VERY FAST AND RELIABLE RESPONSE • NO SINGLE POINT OF FAILURE
  • 11.
    RELATIONAL MODEL • NORMALIZEDTABLE SCHEMA • CROSS TABLE JOINS • ACID COMPLIANCE • BIG DATA TABLE JOINS – BILLIONS OF ROWS, OR MORE – REQUIRE MASSIVE OVERHEAD • SHARDING TABLES ACROSS SYSTEMS IS COMPLEX AND FRAGILE
  • 12.
    BIG DATA PRIORITIES OFMODERN APPLICATION 1. NEEDS FOR SPEED AND AVAILABILITY OUTRANKS "ALWAYS ON" CONSISTENCY 2. COMMODITY SERVER RACKS INSTEAD OF MASSIVE HIGH-END SYSTEMS 3. REAL WORLD NEED FOR TRANSACTIONAL GUARANTEES IS LIMITED STRATEGIES FOR MODERN APPLICATION 1. RELAX CONSISTENCY AND SCHEMA REQUIREMENTS 2. DISTRIBUTE DATA ACROSS NODES 3. OPTIMIZE DATA TO SUIT ACTUAL NEEDS
  • 13.
    CAP THEOREM • INDISTRIBUTED SYSTEMS, CONSISTENCY, AVAILABILITY, AND PARTITION TOLERANCE EXIST IN A MUTUALLY DEPENDENT RELATIONSHIP. • PICK ANY TWO.
  • 14.
    SOFTWARE. HARDWARE. COMPLETE •VERTICAL SCALING HAS ITS LIMITS!
  • 15.
    WHAT IS CASSANDRA? •A DISTRIBUTED DATABASE FOR MANAGING LARGE AMOUNTS OF STRUCTURED DATA ACROSS MANY COMMODITY SERVERS. • NEAR-LINEAR HORIZONTAL SCALING ACROSS COMMODITY SERVERS • NO SINGLE POINT OF FAILURE: CASSANDRA HAS A MASTER LESS “RING” DESIGN WHERE ALL NODES PLAY AN IDENTICAL ROLE; THERE IS NO CONCEPT OF A MASTER NODE.
  • 16.
    FAULT TOLERANT 10 50 3070 80 40 20 60 Client Replication Factor= 3 We could still retrieve the data from the other 2 nodes Node failure or it goes down temporarily
  • 17.
    LINEARLY SCALABLE • SIMPLYADD NODES TO DOUBLE, QUADRUPLE PERFORMANCE AND CAPACITY 10 50 3070 80 40 20 60 10 30 2040100 000 transactions /sec 200 000 transactions /sec 400 000 transactions /sec 10 20
  • 18.
    MULTI DATA CENTRESUPPORT • DATA CENTERS ARE ACTIVE – ACTIVE • WRITE TO EITHER DATA CENTRE North American Data Center European Data Center Client 15 55 3575 85 45 25 65 10 50 3070 80 40 20 60 ASYNCHRONO US REPLICATION • BENEFITS • DATA-LOCALITY • DISASTER RECOVERY
  • 19.
  • 20.
    AVAILABILITY AND RESILIENCYAS A SERVICE A SET OF TOOLS (SCHEDULED AGENTS) THAT DELIBERATELY SHUTS DOWN SERVICES, SLOWS DOWN PERFORMANCES, CHECKS CONFORMITY • CHAOS MONKEY RANDOMLY BRINGS DOWN A NODE. • GORILLA MONKEY SIMULATES THE OUTAGE OF AN ENTIRE AVAILABILITY ZONE • KONG MONKEY SIMULATES THE
  • 21.
    Cluster MULTI DATA CENTREFOR WORKLOAD SEGREGATION • COPY OF PRODUCTION DATA FOR TESTING, BENCHMARKING AND RUNNING ANALYTICS Analytics Data Center Spark / Hadoop Production / live Data Center Client 15 55 3575 85 45 25 65 10 50 3070 80 40 20 60 ASYNCHRONO US REPLICATION
  • 22.
  • 23.
    CLOUDERA: HADOOP'S ANTIFRAUD REFERENCE ARCHITECTURE
  • 24.
    REAL TIME FRAUDDETECTION WITH DSE
  • 25.
    HISTORY • CLUSTER LAYER •AMAZON DYNAMO DB PAPER • MASTERLESS ARCHITECTURE • DATA-STORE LAYER • GOOGLE BIG TABLE PAPER • COLUMNS/COLUMNS FAMILY • OPEN SOURCED SINCE 2008
  • 26.
  • 27.
    INTERNAL ARCHITECTURE Cassandra cluster Datacentre 1 Node 2 Node 1 Node 4 Node 3 Data centre 2 Node 6 Node 5 Node 8 Node 7 • NODE – ONE CASSANDRA INSTANCE • RACK – A LOGICAL SET OF NODES • DATA CENTRE – A LOGICAL SET OF RACKS • CLUSTER – THE FULL SET OF NODES WHICH MAP TO A SINGLE COMPLETE TOKEN RING
  • 28.
    DATA DISTRIBUTION • DATAIS STORED ON NODES IN PARTITIONS, THAT’S ANALOGOUS TO A ROW IN A RDBMS TABLE. • A PARTITION’S KEY IS PASSED TO A CONSISTENT HASHING ALGORITHM TO GENERATE A TOKEN. • TOKEN IS AN INTEGER WHOSE VALUE IS BETWEEN 2-63 TO 263 • TOKEN IS USED TO IDENTIFY THE LOCATION OF A PARTITION WITHIN A CLUSTER. • IN OTHER WORDS, TOKEN = HASH (PARTITION KEY) MurmurHash3 function 12345 329585043 2507136630 test@example.co m
  • 29.
    CONSISTENT HASHING ANDPARITIONER • CONSISTENT HASHING ALLOWS DISTRIBUTING DATA ACROSS A CLUSTER WHICH MINIMIZES REORGANIZATION WHEN NODES ARE ADDED OR REMOVED. name ag e Car gender jim 36 camaro M carol 37 bmw F johnn y 12 M suzy 10 F Partition key Partition key Murmur3 hash value jim -2245462676723223822 carol 7723358927203680754 johnny -6723372854036780875 suzy 1168604627387940318 1 3 24 -9223372036854775808 to -4611686018427387903 4611686018427387903 to 9223372036854775808 -4611686018427387904 to -1 0 to 4611686018427387903 suzy 1168604627387940318 johnny -6723372854036780875 carol 7723358927203680754 jim -2245462676723223822 • EACH NODE IN THE CLUSTER IS RESPONSIBLE FOR A RANGE OF DATA BASED ON THE HASH VALUE
  • 30.
    V-NODES • VNODES ALLOWEACH NODE TO OWN A LARGE NUMBER OF SMALL PARTITION RANGES DISTRIBUTED THROUGHOUT THE CLUSTER. • VNODES ALSO USE CONSISTENT HASHING TO DISTRIBUTE DATA BUT USING THEM DOESN'T REQUIRE TOKEN
  • 31.
    CONSISTENCY LEVELS • APPLYBOTH TO READ & WRITE AND TUNABLE AT RUNTIME 1. ONE: FAST, MAY NOT READ LATEST WRITTEN VALUE 2. QUORUM: STRICT MAJORITY W.R.T. REPLICATION FACTOR GOOD BALANCE 3. ALL: PARANOID SLOW, NO HIGH AVAILABILITY
  • 37.
    WRITE PATH HOW DATAIS WRITTEN INTO THE STORAGE ENGINE
  • 38.
    LOG STRUCTURED STORAGEENGINE • IN CASSANDRA, DATA IS SEQUENTIALLY APPENDED, NOT PLACED IN PRE-SET LOCATIONS
  • 39.
    KEY COMPONENTS OFTHE WRITE PATH • TO HANDLE WRITE REQUESTS, EACH NODE IMPLEMENTS 4 KEY COMPONENTS 1. MEMTABLES – IN-MEMORY TABLES CORRESPONDING TO CQL TABLES, WITH INDEXES 2. COMMIT LOG – APPEND-ONLY LOG, REPLAYED TO RESTORE DOWNED NODE'S MEMTABLES 3. SSTABLES – MEMTABLE SNAPSHOTS PERIODICALLY FLUSHED TO DISK, CLEARING HEAP 4. COMPACTION – PERIODIC PROCESS TO MERGE AND STREAMLINE SSTABLES Process Artifacts
  • 40.
    Memtable corresponding tothe CQL table Node memory Node file system Commit log Coordinato r Nod e Partition key 3 firstName:Bruc e lastName:Wayne age:30 Partition key 2 firstName:Alfre d lastName:Pennywo rth age:62 Partition key 1 firstName:Jim lastName:Gordon age:42 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Immutable sorted string tables Flush current state to SSTable Commit log append . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Periodic compaction
  • 41.
    COMMITLOG • CONFIGURED INCONF/CASSANDRA.YAML • WHEN COMMIT LOG SIZE REACHES ITS THRESHOLD, THE MEMTABLE IS FLUSHED TO DISK. • COMMITLOG_TOTAL_SPACE_IN_MB – TOTAL SPACE TO BE USED FOR ALL COMMIT LOGS • COMMITLOG_SEGMENT_SIZE_IN_MB – MAX SIZE OF INDIVIDUAL COMMIT LOG SEGMENT • FLUSHED COMMIT LOG SEGMENTS ARE REUSED INSTEAD OF WIPING THEM AND REUSING THEM. • FOR EFFICIENCY, ENSURE THAT DATA DIRECTORIES AND COMMIT LOGS IN DIFFERENT DRIVES TO MINIMISE WRITE HEAD CONTENTION. • COMMIT LOG ENTRIES ACCRUE IN MEMORY, AND ARE SYNCHRONISED TO DISK IN EITHER A BATCH OR PERIODIC MANNER. • BATCH – WRITES ARE NOT ACKNOWLEDGED UNTIL THE LOG SYNCS TO DISK. DEFAULT IS 50 MS • PERIODIC – WRITES ARE ACKNOWLEDGED IMMEDIATELY, WHILE SYNC HAPPENS PERIODICALLY. DEFAULT SYNC CYCLE IS 10 SECONDS. • SEE COMMITLOG_SYNC
  • 42.
    COMPACTION • IT ISA CRITICAL, PERIODIC SSTABLE MAINTENANCE PROCESS THAT 1. MERGES MOST RECENT PARTITION KEYS AND COLUMNS 2. EVICTS DELETED AND TTL EXPIRED PARTITION COLUMNS 3. CREATES NEW SSTABLE 4. REBUILDS PARTITION INDEX AND PARTITION SUMMARY 5. DELETES THE OLD SSTABLES • WHY IT IS NECESSARY? • SSTABLES ARE IMMUTABLE, SO UPDATES TEND TO FRAGMENT DATA OVER TIME • DELETES ARE WRITES AND MUST BE
  • 43.
    READ PATH UNDERSTAND HOWDATA IS READ FROM THE STORAGE ENGINE
  • 44.
    READ PATH FLOWAMONG NODES
  • 45.
    Coordinato r … … …… pk7 … … Level:42 Timestamp 1114 Node memory Node file system pk1 … … … pk7 First:Betty Timestamp:541 Last:Blue Timestamp:541 Level:63 Timestamp:541 pk2 … … … pk7 First:Elizabeth Timestamp: 994 Level:63 Timestamp: 994 pk1 … … … pk2 … … … SS Tables MemTabl e Row cache Pk1 Pk2 pk7Read <pk7> Hi t pk7 First: Elizabeth Last: Blue Level: 42
  • 46.
    Coordinato r … … …… pk7 … … Level:42 Timestamp 1114 Node memory Node file system pk1 … … … pk7 First:Betty Timestamp:541 Last:Blue Timestamp:541 Level:63 Timestamp:541 pk2 … … … pk7 First:Elizabeth Timestamp: 994 Level:63 Timestamp: 994 pk1 … … … pk2 … … … SS Tables MemTabl e Row cache Pk1 Pk2 Read <pk7> Miss Bloo m filter Bloo m filter Bloo m filter Key Cach e pk 7? pk 7? Hi t Hi t pk7 First:Elizabet h Last:Blue Level:42 Merge all data read based on time stamp
  • 47.
    Coordinato r … … …… pk7 … … Level:42 Timestamp 1114 Node memory Node file system pk1 … … … pk7 First:Betty Timestamp:54 1 Last:Blue Timestamp:541 Level:63 Timestamp:541 pk2 … … … pk7 First:Elizabeth Timestamp: 994 Level:63 Timestamp: 994 pk1 … … … pk2 … … … SS Tables MemTabl e Row cache Pk1 Pk2 Read <pk7> Miss Bloo m filter Bloo m filter Bloo m filter Key Cach e pk 7? pk 7? Partition summary Partition index Partition index Partition summary Miss Miss pk7 First:Elizabet h Last:Blu e Level:42 Merge
  • 48.
    TOMBSTONES • DELETED COLUMNSARE NOT IMMEDIATELY REMOVED, JUST MARKED FOR DELETION. • WHY? IMMEDIATE REMOVAL WOULD REQUIRE A TIME-WASTING SEEK • WHEN A CQL QUERY DELETES A PARTITION COLUMN, OR ITS TTL IS FOUND TO BE EXPIRED DURING A READ THE FOLLOWING HAPPENS 1. A TOMBSTONE (DELETION MARKER) IS APPLIED TO THIS COLUMN IN ITS MEMTABLE 2. SUBSEQUENT QUERIES TREAT THIS COLUMN AS DELETED 3. AT THE NEXT MEMTABLE FLUSH, THE TOMBSTONE PASSES TO THE NEW SSTABLE AT EACH COMPACTION, TOMBSTONED COLUMNS OLDER THAN GC_GRACE_SECONDS ARE EVICTED FROM THE NEWLY COMPACTED SSTABLES
  • 49.
    ZOMBIE COLUMNS • IFA NODE FAILS BEFORE A REPLICATED TOMBSTONE ARRIVES, THEN IS RESTORED MORE THAN GC_GRACE_SECONDS LATER, THE OTHERWISE DELETED COLUMN WILL REAPPEAR, AS ALL OTHER NODES WILL HAVE EVICTED THE TOMBSTONE. THE CURE • USE NODETOOL REPAIR WHEN RESTORING FAILED NODES, TO ENSURE ALL ITS PARTITIONS ARE CONSISTENT, INCLUDING ANY PENDING DELETIONS.
  • 50.
    CASSANDRA QUERY LANGUAGE (CQL) PROVIDES AFAMILIAR, ROW-COLUMN, SQL-LIKE APPROACH PROVIDES CLEAR SCHEMA DEFINITIONS IN A FLEXIBLE (NOSQL) SCHEMA CONTEXT
  • 51.
    Table (Column Family) - Containsrows Keyspace - Contains all tables. Specifies replication Cluster - Contains all nodes. Even across WAN LOGICAL CONTAINERS
  • 52.
    INSTALLATION DOWNLOAD ALL NECESSARYSOFTWARE FROM HTTP://DOWNLOADS.DATASTAX.COM/COMMUNITY/
  • 57.
    CASSANDRA CLUSTER MANAGER CCM A LIBRARY(OR META-TOOL) TO CREATE, LAUNCH AND REMOVE AN APACHE CASSANDRA CLUSTER ON LOCALHOST. FOR DETAILS, SEE HTTPS://GITHUB.COM/PCMANUS/CCM
  • 58.
    CCM – CREATEA TEST CLUSTER • CCM CREATE TEST -V 2.1.8 -N 2 -S –D • CREATE A CLUSTER • NAMED 'TEST' • USING CASSANDRA VERSION 2.1.8 • WITH 2 NODES • START IT RIGHT AWAY • DEBUG OUTPUT FOR START-UP PROCESS. • CCM WILL INSTALL AND COMPILE THE VERSION OF CASSANDRA IF IT'S UNAVAILABLE. • ONCE STARTED, CCM WILL UTILISE THIS CLUSTER AS THE DEFAULT ONE. • EACH NODE ON CASSANDRA CLUSTER CAN USE A DIFFERENT VERSION.
  • 59.
    CCM – SHOWCOMMAND
  • 60.
    CCM – EXECUTEAN EXTERNAL FILE
  • 61.
  • 62.
  • 63.
  • 64.
    CASSANDRA CLUSTER MANAGER(CCM) – ADD A NODE
  • 65.
    OPS CENTER A WEB-BASEDVISUAL MANAGEMENT AND MONITORING SOLUTION
  • 69.
    NODE TOOL COMMAND LINECLUSTER MANAGEMENT UTILITY THAT CONNECTS TO A SPECIFIC NODE VIA JMX
  • 71.
    CASSANDRA STRESS A LOADTESTING UTILITY THAT PERFORMS INSERTS AND READS TO A TEST KEYSPACE, IN AN EFFORT TO MEASURE PERFORMANCE AND BENCHMARK.
  • 74.
    DEV CENTER A VISUALSCHEMA AND QUERY TOOL THAT ALLOWS DEVELOPERS TO CREATE AND RUN CQL QUERIES AND COMMANDS.
  • 77.
  • 79.
    DIVERGENCE OF CASSANDRAFROM THE RELATIONAL WORLD • IN A RELATIONAL DATABASE, ONE CAN SEARCH ON ANY OF THE COLUMNS THAT BELONG TO THE TABLE BUT NOT IN CASSANDRA. • CASSANDRA STORES THE DATA DIFFERENTLY ON THE DISK THAN THE WAY CQL PRESENTS IT. • CQL PROVIDES A TWO DIMENSIONAL VIEW OF POTENTIALLY MULTIDIMENSIONAL DATA • SIMPLY PUT, CASSANDRA PHYSICALLY STORES DATA AS A MAP OF MAPS.
  • 80.
  • 81.
    COLUMN FAMILY • TABLEIS A SET OF PARTITIONS • PARTITION MAY BE SINGLE OR MULTIPLE ROW • PARTITION KEY UNIQUELY IDENTIFIES A PARTITION AND MAY BE SIMPLE OR COMPOSITE • COLUMN UNIQUELY IDENTIFIES A CELL IN A PARTITION, AND MAY BE REGULAR OR CLUSTERING • PRIMARY KEY IS COMPRISED OF A PARTITION KEY PLUS CLUSTERING COLUMNS, IF ANY, AND UNIQUELY IDENTIFIES A ROW IN BOTH ITS PARTITION AND TABLE
  • 82.
    COLUMN FAMILY • SETOF ROWS WITH A SIMILAR STRUCTURE. • SORTED COLUMNS • MULTI DIMENSIONAL DATA • SIZE OF A COLUMN FAMILY IS ONLY LIMITED TO THE SIZE OF A CLUSTER • ROWS ARE DISTRIBUTED AMONG THE NODES IN A CLUSTER • DATA FROM A ONE ROW MUST FIT ON ONE NODE • DATA FROM ANY GIVEN ROW NEVER SPANS MULTIPLE NODES • MAXIMUM COLUMNS PER ROW IS 2 BILLION IN THEORY BUT IN PRACTICE – UP TO 100 THOUSAND • MAXIMUM DATA SIZE PER COLUMN VALUE IS 2 GB IN THEORY BUT IN PRACTICE IT’S UP TO 100 MB
  • 84.
  • 85.
    UPSERTS • CASSANDRA DOESNOT PERFORM A READ OPERATION BEFORE A WRITE. • IT’S AN OPTIMISATION BY DESIGN BECAUSE WITH MASSIVE AMOUNT OF DATA RESIDING IN THE DATA STORE, A WRITE OPERATION PERFORMED BEFORE EVERY READ WOULD NOT SCALABLE BY DESIGN. • CASSANDRA DOES NOT MAKE ANY DISTINCTION BETWEEN AN INSERT AND AN UPDATE; WHICH MAKES THE TERM UPSERT. • YOU’LL BE UPSET IF YOU DO AN UPSERT ;)
  • 86.
  • 87.
    UPSERTS: CASE 1 •NO PRIMARY KEY VIOLATION EXCEPTION • CASSANDRA SIMPLY FINDS THE CORRESPONDING PARTITION, PERFORMS AN INSERT OPERATION AND RETURNS. • HOWEVER, DEVELOPERS CAN STILL EXPLICITLY ASK CASSANDRA TO PERFORM A READ PRIOR TO A WRITE OPERATION.
  • 88.
    UPSERTS: CASE 2 •UPDATES TO AN NON EXISTING RECORD PERFORMS AN INSERT BY USING THE WHERE CLAUSE.
  • 89.
    LIGHT WEIGHT TRANSACTIONSOR COMPARE AND SET (CAS) • A NEW CLAUSE IF NOT EXISTS FOR INSERTS • INSERT OPERATION EXECUTES IF A ROW WITH THE SAME PRIMARY KEY DOES NOT EXIST • USES A CONSENSUS ALGORITHM CALLED PAXOS TO ENSURE INSERTS ARE DONE SERIALLY • MULTIPLE MESSAGES ARE PASSED BETWEEN COORDINATOR AND REPLICAS WITH A LARGE PERFORMANCE PENALTY [applied] column returns true if row does not exist and insert executes [applied] column is false if row exists and the existing row will be returned
  • 90.
    LIGHT WEIGHT TRANSACTIONSOR COMPARE AND SET (CAS) • UPDATE USES IF TO VERIFY THE VALUE FOR COLUMN(S) BEFORE EXECUTION [applied] column returns true if condition(s) matches and update written [applied] column is false if condition(s) do not match and the current row will be returned
  • 91.
    TTL OPTION • TIME-TO-LIVE(TTL) DEFINES EXPIRING COLUMNS THAT ARE EVENTUALLY DELETED. • TTL IS SPECIFIED IN SECONDS. • BENEFIT: • HELPS KEEP THE SIZE OF A TABLE AND ITS PARTITIONS MANAGEABLE • RESTRICTS THE DATA VIEW TO MOST RECENT DATA Store a row for 86400 seconds Re-inserting the same row before it expires will overwrite TTL.
  • 92.
    CLUSTERING COLUMNS • CLUSTERINGCOLUMNS GROUP TABLE’S ROWS INTO PARTITIONS. • CLUSTERING COLUMN COME AFTER PARTITION KEY, WITHIN PRIMARY KEY CLAUSE. • DOUBLE SET OF PARENTHESES AROUND THE PARTITION KEY IS FOR CLUSTERING COLUMNS. Partition key Clustering column
  • 93.
    CLUSTERING COLUMNS year nameid runtime 2014 Interstellar 1 169 2015 Minions 2 91 2011 Thor 3 115 2015 Home 4 94 2015 2014 2011 Minions:id Minions:runti me 2 91 Home:id Home:runtime 4 94 Interstellar:i d Interstellar:runti me 1 169 Thor:id Thor:runtime 3 115 ‘Home’ comes before ‘Minions’ as names are arranged in ascending order
  • 94.
    QUERYING CLUSTERING COLUMNS •CLUSTERED COLUMN VALUES ARE STORED IN SORTED ORDER WITH ASCENDING BEING THE DEFAULT ORDER. HOWEVER, CLUSTERING COLUMN’S ORDERING CAN BE CHANGED.
  • 95.
    QUERYING CLUSTERING COLUMNS •TO LOCATE A PARTICULAR ROW WITHIN A CLUSTERED PARTITION REQUIRES A SIMPLE BINARY SEARCH; WHICH IS OF LOGARITHMIC TIME AND HENCE CONSIDERABLY FAST. Partition key Clustering column
  • 96.
    RANGE QUERY ONCLUSTERING COLUMNS • RANGE QUERIES CAN ALSO BE PERFORMED ON CLUSTERING COLUMNS. • HOWEVER, RANGE SEARCH CAN BE PERFORMED ONLY ON CLUSTERING COLUMNS; NOT ANY OTHER Partition key Clustering column
  • 98.
    STATIC COLUMN • STATICCOLUMN VALUES ARE SHARED FOR ALL ROWS IN A MULTI-ROW PARTITION
  • 99.
  • 100.
    UUID AND TIMEUUID •UNIVERSALLY UNIQUE IDENTIFIERS THAT’S USED TO ASSIGN CONFLICT-FREE (UNIQUE) IDENTIFIERS TO DATA OBJECTS. • FORMAT HEX{8}-HEX{4}-HEX{4}-HEX{4}-HEX{12} • UUID: • VERSION 4 UUIDS SEPARATED BY DASHES • TIMEUUID: • VERSION 1 UUIDS • EMBEDS A TIME VALUE WITHIN A UUID - GENERATED USING TIME (60 BITS), A CLOCK SEQUENCE NUMBER (14 BITS), AND MAC ADDRESS (48 BITS) • CQL FUNCTION NOW() GENERATES A NEW TIMEUUID • CQL FUNCTION DATEOF() EXTRACTS THE EMBEDDED TIMESTAMP FROM TIMEUUID
  • 101.
    COUNTER • DATA TYPEFOR A DISTRIBUTED COUNTER FOR TRACKING A COUNT. • IT ALLOWS RACE-FREE INCREMENTS WITH LOCAL LATENCY ACROSS MULTIPLE DATACENTERS SIMULTANEOUSLY • LIMITATIONS 1. INITIALISED WITH ZERO AND CAN ONLY BE INCREMENTED OR DECREMENTED 2. CANNOT BE A PART OF PRIMARY KEY 3. IF A TABLE HAS A COUNTER COLUMN, ALL NON-COUNTER COLUMNS MUST BE PART OF A PRIMARY KEY
  • 102.
    COUNTER • CASSANDRA READSTHE CURRENT VALUE FOR EVERY COUNTER UPDATE AND APPLIES THE DELTA. • PERFORMANCE 1. READ IS AS EFFICIENT AS FOR NON-COUNTER COLUMNS 2. UPDATE IS FAST BUT SLIGHTLY SLOWER THAN AN UPDATE FOR NON-COUNTER COLUMNS • ACCURACY • IF A COUNTER UPDATE IS TIMED OUT, A CLIENT APPLICATION CANNOT SIMPLY RETRY A “FAILED” COUNTER UPDATE AS THE TIMED-OUT UPDATE MAY HAVE BEEN PERSISTED • COUNTER UPDATE IS NOT AN IDEMPOTENT OPERATION
  • 103.
    COLLECTION COLUMNS • COLLECTIONCOLUMNS ARE MULTI-VALUED COLUMNS RETRIEVED IN ITS ENTIRETY. • SUPPORTED COLLECTIONS • SET - TYPED COLLECTION OF UNIQUE VALUES • ORDERED BY VALUES NO DUPLICATES • LIST - TYPED COLLECTION OF NON-UNIQUE VALUES • ORDERED BY POSITION DUPLICATES ALLOWED • MAP - TYPED COLLECTION OF KEY-VALUE PAIRS • ORDERED BY KEYS UNIQUE KEYS BUT NOT VALUES • SIZE LIMITS • MAXIMUM NUMBER OF ELEMENTS IN A COLLECTION: 64 000 • MAXIMUM SIZE OF EACH COLLECTION ELEMENT: 64 KB • USAGE LIMITS • CANNOT BE PART OF A PRIMARY KEY I.E. PARTITION KEY OR CLUSTERING COLUMN • CANNOT NEST INSIDE ANOTHER COLLECTION
  • 104.
    SET MANIPULATION • DEFINEA USERS TABLE TO ACCOMMODATE MULTIPLE EMAIL ADDRESS CREATE TABLE USERS ( USER_ID TEXT PRIMARY KEY, FIRST_NAME TEXT, LAST_NAME TEXT, EMAILS SET<TEXT> ); • INSERT DATA INTO THE SET, ENCLOSING VALUES IN CURLY BRACKETS INSERT INTO USERS (USER_ID, FIRST_NAME, LAST_NAME, EMAILS) VALUES('FRODO', 'BILBO', 'BAGGINS', {'FRODO@BAGGINS.NAME', 'BILBO.BAGGINS@ABOUT.ME'}); • ADD AN ELEMENT TO A SET USING THE UPDATE COMMAND AND THE ADDITION (+) OPERATOR UPDATE USERS SET EMAILS = EMAILS + {'BILBO.BAGGINS@THEHOBBIT.ORG'} WHERE USER_ID = 'FRODO'; • REMOVE ALL ELEMENTS FROM A SET BY USING THE UPDATE OR DELETE STATEMENT.
  • 105.
    LIST MANIPULATION • ADDA LIST DECLARATION TO A TABLE BY ADDING A COLUMN AND MANIPULATE IT ALTER TABLE USERS ADD TOP_PLACES LIST<TEXT>; UPDATE USERS SET TOP_PLACES = [ 'RIVENDELL', 'MORDOR' ] WHERE USER_ID = 'FRODO'; UPDATE USERS SET TOP_PLACES[2] = 'RIDDERMARK' WHERE USER_ID = 'FRODO';
  • 106.
    MAP MANIPULATION • ADDA TODO LIST TO EVERY USER PROFILE IN AN EXISTING USERS TABLE ALTER TABLE USERS ADD TODO MAP<TIMESTAMP, TEXT>; UPDATE USERS SET TODO = { '2012-9-24' : 'ENTER MORDOR', '2014-10-2 12:00' : 'THROW RING INTO MOUNT DOOM' } WHERE USER_ID = 'FRODO'; UPDATE USERS SET TODO = TODO + { '2013-10-1 18:00': 'CHECK INTO INN OF PRACING PONY'} WHERE USER_ID='FRODO'; • COMPUTE THE TTL TO USE TO EXPIRE TO-DO LIST ELEMENTS ON THE DAY OF THE TIMESTAMP, AND SET THE ELEMENTS TO EXPIRE. UPDATE USERS USING TTL 86400 SET TODO['2012-10-1'] = 'FIND WATER' WHERE USER_ID = 'FRODO';
  • 107.
    SECONDARY INDEX • ASECONDARY INDEX CAN INDEX ADDITIONAL COLUMNS TO ENABLE SEARCHING BY THOSE COLUMNS • ONE COLUMN PER INDEX • LIMITATIONS: IT CANNOT BE CREATED FOR 1. COUNTER COLUMNS 2. STATIC COLUMNS • DROP A SECONDARY INDEX • CREATE A SECONDARY INDEX
  • 108.
    SECONDARY INDEX: WHEN ANDWHEN NOT TO USE IT • SECONDARY INDEXES ARE FOR SEARCHING CONVENIENCE SO USE IT ONLY 1. ON COLUMNS WITH LOW- CARDINALITY 2. ON COLUMNS THAT MAY CONTAIN A RELATIVELY SMALL SET OF DISTINCT VALUES LIKE GENRE OF MUSIC 3. WITH SMALLER DATASETS OR WHEN PROTOTYPING • DO NOT USE ON 1. ON HIGH-CARDINALITY COLUMNS 2. ON COUNTER COLUMN TABLES 3. ON A FREQUENTLY UPDATED OR DELETED COLUMNS 4. TO LOOK FOR A ROW IN A LARGE PARTITION UNLESS NARROWLY QUERIED
  • 109.
    USER-DEFINED TYPE • GROUPSRELATED FIELDS OF INFORMATION • REPRESENTS RELATED DATA IN A SINGLE TABLE, INSTEAD OF MULTIPLE, SEPARATE TABLES • TABLE COLUMNS CAN BE USER-DEFINED TYPES • A USER-DEFINED TYPE CAN BE USED AS A DATA TYPE FOR A COLLECTION • REQUIRES THE USE OF THE FROZEN KEYWORD
  • 110.
    TUPLE • FIXED-LENGTH SETSOF TYPED POSITIONAL FIELDS • ALTERNATIVE TO CREATING A USER-DEFINED TYPE THAT’S USEFUL WHEN PROTOTYPING • ACCOMMODATES UP TO 32768 FIELDS, BUT GENERALLY ONLY USE A FEW • TUPLES CAN BE NESTED IN OTHER TUPLES
  • 111.
    BATCH STATEMENT • COMBINESMULTIPLE INSERT, UPDATE, AND DELETE STATEMENTS INTO A SINGLE LOGICAL OPERATION
  • 112.
    BATCH STATEMENT • ATOMICOPERATION • IF ANY STATEMENT IN THE BATCH SUCCEEDS, ALL WILL • NO BATCH ISOLATION • OTHER “TRANSACTIONS” CAN READ AND WRITE DATA BEING AFFECTED BY A PARTIALLY EXECUTED BATCH • ORDER • OPERATIONS WITH IN A BATCH WILL BE EXECUTED IN ANY ORDER SEEN FIT BY THE EXECUTION ENGINE
  • 113.
    BATCH STATEMENT UNLOGGED BATCH • DOESNOT WRITE TO THE BATCH LOG • SAVES TIME BUT NO LONGER ATOMIC • ALLOWS OPERATIONS ON COUNTER COLUMNS
  • 114.
    LIGHT WEIGHT TRANSACTIONSIN A BATCH • BATCH WILL EXECUTE ONLY IF CONDITIONS FOR ALL LIGHTWEIGHT TRANSACTIONS ARE MET • ALL OPERATIONS IN BATCH WILL EXECUTE SERIALLY WITH THE INCREASED PERFORMANCE OVERHEAD
  • 115.
    STEPS TO BUILDA DATA MODEL Application workflow Conceptual Data model Map conceptual to logical Logical data model Physical optimisation Physcial data model

Editor's Notes

  • #19 In the case of a datacenter outage, applications can carry on a retry policy which flips over to the other datacenter which also has a copy of the data. Operational simplicity: 1 node = 1 process + 2 configuration file
  • #22 Real-time streaming/analytics/aggregation.
  • #25 Datastax recently purchased Aurelius; the company behind Titian; the distributed graph database so, in the next major release of Cassandra, Graph component shall also be intergrated. Titan: The native graph supports Gremlin as graph query language Frames OGM Rexster graph server Blueprints stand graph API
  • #26 Data-store layer Google Big Table paper Columns/columns family
  • #28 Nodes join a cluster based on the configuration of their own conf/cassandra.yaml file Key settings include cluster_name – shared name to logically distinguish a set of nodes. seeds – IP addresses of initial nodes for a new node to contact and discover the cluster topology. Best practice to use the two seeds per data center. listen_address – IP address through which this particular node communicates Strictly technically speaking, a seed isn’t absolutely necessary for facilitating a node to join an existing cluster. Just as a client can communicate with any node belonging to a cluster to fetch data, a new node can communicate with a member node acting as a coordinator to discover a cluster’s topology. However for operational convenience and to ensure rogue nodes don’t join an existing cluster, seed nodes are in place to mentor and orient newly members willing to join a cluster.
  • #29 Check token-generator tool in apache-cassandra\tools\bin Cassandra offers the following partitioners: Murmur3Partitioner (default): uniformly distributes data across the cluster based on MurmurHash hash values. It provides faster hashing and improved performance than other 2 partitioners. RandomPartitioner: uniformly distributes data across the cluster based on MD5 hash values. The default partitioner prior to Cassandra 1.2. ByteOrderedPartitioner: keeps an ordered distribution of data lexically by key bytes The partitioner is configured in the cassandra.yaml file
  • #38 Explains how the data is managed down to the file system, once a write request (insert, update or delete) is received by the co-ordinator from a client. Explains how cassandra injests the stream of incoming data so quickly.
  • #39 Data is always sequentailly appended both in memory and in the file system. Appends are continually merged an compacted at appropriate instances.
  • #40 All the above mentioned components involved belong to individual nodes i.e. a Cassandra instance. When any node receive any write request 1. The record appends to the CommitLog, and 2. The record appends to the Memtable for this record's target CQL table 3. Periodically, Memtables flush to SSTables, clearing JVM heap and CommitLog 4. Periodically, Compaction process runs to merge and streamline SSTables More detailed introduction of the key components. CommitLog is an append only log file that resides in the file system of the node. It exists to replay and restore the state of MemTables , in case for any reason the nodes down and has to be restarted. As the JVM heap memory fills for a memtable, it’s data is written out to the file system as an immutable file. Immutable is a key concept here as the data is flushed rapidly to disk and the heap is cleared. Compaction is the process by which the SSTables are periodically merged, compacted and streamlined.
  • #41 When a cassandra client passes a write request to any node acting as a coordinator, the coordinator passes it on to a node or nodes. The write request handling node immediately does 2 things. It appends a record to the commit log and It appends a record to the Memtable of the target CQL table. When MemTables are periodically flushed to SSTables on the disk, The JVM heap for the MemTable is cleared CommitLog is marked as finished; i.e. the commit log is no longer necessary as the data is on the file system. Inserts, updates and deletes (which are just writes) to the same partition key can end up in various SSTables. It’s because values do mutate over time or just cease to exist. To continually improve the efficiency of a read operation, SSTables are periodically compacted into often larger SSTables. All the appended values for a single parition key are merged into one record in a single table. Prior generation SSTables are cleared once compaction completes. Memtables are in-memory representations of each CQL table in each key space. Each Memtable accrues writes and provides reads for data not yet flushed to file system as SSTables. A read operation checks the Memtable first. A Memtable flushes the oldest commit log segments to a new corresponding SSTable on disk when Memtable total space in mb reaches its threshold in JVM heap Commit log total space in mb reaches its threshold nodetool flush command is issued which force flushes designated Memtables.
  • #43 How does compaction affect reads and disk space? During compaction, disk I/O and utilisation increases. Also, off-cache read performance may be impacted. However, after compaction read performance increases as less SSTables are read for off-cache reads. Furthermore, disk utilization drops as old SSTables are deleted. Available compaction strategies are Size-Tiered (default) – compaction triggered as number of SSTables reach a threshold Leveled – uniform-size SSTables organized and compacted by successive levels Date-Tiered – data written within a certain time window is saved together
  • #44 Read path is very different from a write path as it focuses on improving performance.
  • #45 Cassandra returns the most recent record among the nodes read for a given request. Consistency Level sets how many nodes will be read and it may vary by request. If a node is slow responding to a request, the coordinator forwards it to another holding a replica of the requested partition.
  • #46 Generally speaking Row cache resides in memory but off java’s heap Mem table resides on heap But these are configurable in Cassandra 2.1 Row cache is optional because for read performance, it’s best to rely on operating system’s file system cache which also resides in RAM; just like the row cache. It’s best to avoid double caching. In cassandra and also In the OS level Row cache the entire contents of the row in memory so it’s best used when you’ve a small subset of data to keep hot where you need most or all of columns. Row caching is enabled in CQL with the caching and rows_per_partition properties ALL – cache all rows for a partition key n – cache the first n rows for a partition key NONE – (default) disable row caching for this table Example: CREATE TABLE player ( first text PRIMARY KEY, last text, level text ) WITH caching = {'keys': 'ALL', 'rows_per_partition': '1'}; Row cache size and save period are set globally for all tables on a node in cassandra.yaml
  • #47 Bloom Filters report if a partition key may be in its corresponding SSTable. Bloom filter is a space efficient probabilistic data structure that’s used to test whether an element is a member of the set. False positives are possible but false negatives are not. Cassandra uses bloom filters to save IO and boost performance when performing a key look-up. Each SSTable has a bloom filter associated with it that Cassandra checks before doing any disk seeks. Generally speaking, larger the tables, higher the possibility of having a false positive. Setting that controls the percentage of false positive results from bloom filter is “bloom_filter_fp_chance” whose value is from 0.0 to 1.0 where 0.0 means no false positives but involves greatest memory use. The value 1.0 disables bloom filtering for that table. Example: ALTER TABLE albums WITH bloom_filter_fp_chance = 0.1; Key Caches – maps recently read partition keys to specific SSTable offsets. With the help of key cache, the values are therefore retrieved directly by performing read from a specific offset within a SSTable file. A key cache is first keyed by the SSTable file so the bloom filter adds value by indicating whether one should seek within a particular location within the key-cache. For column family read level optimisations, increasing the capacity pf this cache can have an immediate impact as soon as the cache warms up. Key cache is enabled by default at a level of 200 000 keys. Key caching is enabled in CQL with the caching and keys properties ALL – (default) enable key caching for this table NONE – disable key caching for this table Example: CREATE TABLE player ( first text PRIMARY KEY, last text, level text ) WITH caching = {'keys': 'ALL', 'rows_per_partition': '1'}; Merge – Unless served from the row cache, a read uses a partition key to locate, merge, and return values from a MemTable and any related SSTable storing values for that key. Question: Why not just read from the MemTable alone if it contains the latest data? Well, it’s because technically there’s no definitive guarantee that, the most recent version of any given data is only in memory; which is why Cassandra performs a disk read from SSTables.
  • #48 Partition Summaries – Sampling from partition index stored in memory. It’s a subset of partition index where in 1 partition key out of every 128 is sampled by default. It can be seen as high level partial index into an exhaustive disk index. Partition Indexes – Sorted partition keys mapped to their SSTable offsets. In other words, it’s a list of primary keys and the start position of data.
  • #49 gc_grace_seconds – table property defining how long tombstones will be retained before eviction in the next compaction (default: 864000, 10 days)
  • #54 Manually starting and stopping a Cassandra instance Starting cassandra from command line C:\Program Files\DataStax Community\apache-cassandra\bin>cassandra -f -p c:\temp\cassandra-pid.txt On a Mac, go to the bin directory of cassandra installation directory and type either ./cassandra -p ../../cassandra.pid (run cassandra in background and export it's process number in the specified file) ./cassandra -f (run cassandra in foreground) Stopping cassandra Press Ctrl+C to stop the server if running in foreground else kill <process-id-of cassandra> (Fetch process id from cassandra.pid or )
  • #59 In case you incur a post conflict issue in Mac OSX, do the following // List all the processes running on your Mac with corresponding port numbers being used. sudo lsof -iTCP -sTCP:LISTEN -n -P Unfortunately, C* only shows up as java in tasklist. You probably have JDK installed so 'jps' command to find the 'CassandraDaemon' Create a loop back interface to resolve port conflicts under ~/scripts/loop_alias.sh Contents of loop_alias.sh shall be #!/bin/bash sudo ifconfig lo0 alias 127.0.0.2 up sudo ifconfig lo0 alias 127.0.0.3 up Run it as follows 'bash ~/scripts/loop_alias.sh'
  • #62 There are no table joins in Cassandra. Cassandra is intentionally designed that way because the philosophy of Cassandra is that table joins aren’t supposed to exist. For performance, Cassandra focuses on query driven design i.e. create the data structure the application needs instead of creating a database before hand and forcing application developers to confirm to it.
  • #63 COPY command copies data into the table, in whatever sequence columns are present in the external file.
  • #66 Using OpsCenter Agent The DataStax Agent must be connected and running on each node in order for OpsCenter functionality to work properly. Download the agent tarball and manually install the agents. For each node, an instance of agent is used. Configure /agents/agent1/datastax-agent-5.2.0/conf/address.yaml as shown below stomp_interface: "127.0.0.1" agent_rpc_interface: 127.0.0.1 jmx_host: 127.0.0.1 jmx_port: 7100 if the file doesn't exist, create it. Repeat the procedure for other instances of agents connecting to other nodes in your cluster. See http://www.datastax.com/dev/blog/running-opscenter-with-a-local-development-cluster Using OPS Center Go to Ops centre installation directory and start it in foreground as shown below cd tools/cassandra-tools/opscenter-5.2.0/ bin/opscenter -f As there's no command to stop the ops centre, execute the following commands to kill the opscenter process. ps -ef | grep opscenter sudo kill pid Go to http://localhost:8888/opscenter/index.html If you’re using OpsCenter for the first time, select the option "Manage existing cluster" Enter the following host ip addresses 127.0.0.1 (ip address of node 1 of my local cluster) 127.0.0.2 (ip address of node 2 of my local cluster) 127.0.0.3 (ip address of node 3 of my local cluster) Since, all nodes can’t use the same JMX port, we now need to remove the JMX port option from the configuration OpsCenter stored. Instead we will tell each agent individually which port to use. See the agent configuraion section for details. Remove the default JMX port specified in opscenter-5.2.0/conf/clusters/<cluster-name>.conf
  • #70 It can be invoked directly via CCM. - Status: Displays cluster information summary - Info: Displays settings and data for a specific node - Ring: Displays summary state for nodes in target node's token ring
  • #71 Many node tool commands display similar information; just arranged differently focusing on specific details so that it's most suitable for a specific context.
  • #77 Dev Centre does NOT support CQL shell commands like SOURCE, COPY, DESCRIBE, SHOW etc.
  • #80 Cassandra physically stores data in partitions on disk but CQL returns them in a tabular format; that’s familiar to users from relational database world. Latency is minimised by neatly avoiding joins and execution plans.
  • #81 In Cassandra, primary key clause actually determines the partitioning criteria. Partition: Smallest unit of atomic storage. In other words, Cassandra does not split the partition across the nodes. It replicates and distributes partitions across nodes. Cassandra applies a hash function on the partition key and its hashed result is used to determine the node where the partition is stored. Using WHERE clause on any other field part from partition key would require searching all partitions on all nodes so Cassandra says, no can’t do. Cassandra will NOT sacrifice the constant time look-up on hash table. Cassandra supports composite partition keys.
  • #94 When specified, each record is uniquely identified by its clustering column; not its partition key. Key in each individual cell is changed. The value of clustering column is prefixed to, keys of records in individual cells, in an effort to uniquely identify it. Records with same partition key and clustering column name will indeed cause a collision and upserts. They must be handled properly. Retrieve all videos for the year ’n’ will neatly suit this data model as partitions per year are stored as an unit. Skinny rows: Each partition key has only one table row in it. Multi-row partition: Multiple rows of a table are held within one partition. For one from relational data base world, its helps to think of clustering columns as “group by group” .
  • #95  Range queries can also be performed on clustering columns. However, range search can be perfumed only on clustering columns; not any other because data is stored on disk on sorted manner for other columns. Therefore the database has to seek around which proves to be an expensive operation.
  • #97 Why? Data is stored on disk in sorted manner for clustering columns. for other columns, the database has to seek around which proves to be an expensive operation.
  • #108 By default, tables are indexed on columns in a primary key. Search on a partition key is very efficient Search on a partition key and clustering columns is very efficient Search on other columns is not supported Internally, a Cassandra index is a data partition. Cassandra uses the index to filter the data and pull out the records in question. Each node indexes its own data.
  • #109  As with relational databases, keeping indexes up to date is not free, so unnecessary indexes should be avoided. When a column is updated, the index is updated as well. If the old column value was still in the memtable, which typically occurs when updating a small set of rows repeatedly, Cassandra removes the corresponding obsolete index entry; otherwise, the old entry remains to be purged by compaction. If a read sees a stale index entry before compaction purges it, the reader thread invalidates it.
  • #110 Cassandra serializes a frozen value having multiple components into a single value