SlideShare a Scribd company logo
BASIC STUFF YOU NEED TO KNOW ABOUT
CASSANDRA
AUG. 2018
YU-CHANG HO (ANDY)
FORMER RESEARCH ASSISTANT, ACADEMIA SINICA
A GREEK STORY
➡An Ancient Greek Prophet
➡Second-most beautiful woman in the
world
➡Gift of Prophecy from Apollo
➡Figure of Tragedy
‣ Ref. https://www.wikiwand.com/en/
Cassandra
APACHE CASSANDRA
WHAT IS APACHE CASSANDRA (C*)?
▸ Originated at Facebook Inc.
▸ Combines the concept of Google BigTable & Amazon Dynamo.
▸ Data Modeling: BigTable
▸ System Architecture: Dynamo
▸ A distributed database system with high scalability.
▸ Written in Java (The JVM Tuning Hell!!).
Ref. https://www.wikiwand.com/en/Apache_Cassandra
Ref. https://www.wikiwand.com/en/Dynamo_(storage_system)
Ref. https://www.wikiwand.com/en/Bigtable
APACHE CASSANDRA
WHAT IS APACHE CASSANDRA (C*)?- CONT.
▸ It is a popular database system! (Ranked in 2018)
1 Oracle
2 MySQL
3 Microsoft SQL Server
4 PostgreSQL
5 MongoDB
6 DB2
7 Reids
8 Elasticsearch
9 Microsoft Access
10 Cassandra
Ref. https://db-engines.com/en/ranking
APACHE CASSANDRA
WHAT IS APACHE CASSANDRA (C*)?- CONT.
▸ There is no master/slave relationship among C* nodes!
▸ Every node could be read/written.
▸ In our scenario, we assume the GCP node to be the
“Master” to control the data insertion.
▸ Our currently using version: 3.11.2.
APACHE CASSANDRA
THE CAP THEOREM
▸ Eric Brewer, UC Berkeley
▸ C: Consistency
▸ A: Availability
▸ P: Partition-tolerance
▸ All 3 parts of CAP cannot be satisfied at the same time.
Ref. https://www.wikiwand.com/en/CAP_theorem
APACHE CASSANDRA
THE CAP THEOREM OF CASSANDRA
▸ C: The consistency of data → Eventually consistency
▸ A: The availability of service → Always available
▸ P: Ability to distribute that load effectively → High Scalability
▸ Still we could tried to satisfied all the three parts! (Tuning the
consistency level for R/W)
▸ Provided high availability and some level of consistency.
Ref. https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed
TERMINOLOGY
APACHE CASSANDRA
APACHE CASSANDRA
TERMINOLOGY YOU NEED TO UNDERSTAND
▸ Data Center: A group of nodes
▸ Rack: Also, a group of nodes
Data Center
Rack 1 Rack 2
A Node
APACHE CASSANDRA
TERMINOLOGY YOU NEED TO UNDERSTAND- CONT.
▸ Ring: The logical representation of the cluster of nodes.
1
3
24
APACHE CASSANDRA
TERMINOLOGY YOU NEED TO UNDERSTAND- CONT.
▸ Keyspace: ColumnFamily in BigTable, Database in MySQL
▸ Table: Just table, don’t be confused! :-)
▸ MemTable: Cassandra will first store in memory. After a
certain among of data is reached, flush to disk (SSTable).
▸ Commit_log: Not only store in memory, C* will also first
create a log for those new data to prevent from failure and
is able to restore those data if bad thing happens.
▸ SSTable: The compressed files of data stored in disk.
APACHE CASSANDRA
TERMINOLOGY YOU NEED TO UNDERSTAND- CONT.
▸ Replica: A copy/duplication of data.
▸ Replication Factor (RF): The number of replica you wish to
maintain in a certain data center.
▸ Partitioner: A partitioner determines how data is distributed
across the nodes in the cluster. (Token created by hash.)
▸ Coordinator: It is a role (sub-process) when one of a node
receive a query. It try to find the data among nodes. And on
each node, MemTable and SSTable are checked.
APACHE CASSANDRA
TERMINOLOGY YOU NEED TO UNDERSTAND- CONT.
▸ Gossip Protocol: The protocol for a C* node to discover the
information of other nodes.
▸ Seed Node: The node that mainly keep the topology information.
▸ Now, we have GCP (TW), UCSD (US), NTU (JP) seed nodes.
▸ Snitch: The protocol for a C* node to map IPs to racks and data
centers (the topology).
▸ When perform a read, a snitch would be useful.
▸ Create the topology and help decide which node to be query.
APACHE CASSANDRA
TERMINOLOGY YOU NEED TO UNDERSTAND- CONT.
▸ Consistency Level (CL): The arbitrary assignment of
consistent the query should achieve.
ANY
Lowest level. Even if all the replica node are
down, the withe could still successd.
ONE At least one replica node should succeed.
QUORUM (RF / 2) + 1 nodes should succeed.
ALL Highest level, every node should succeed.
LOCAL_ONE
For multiple data center. One node in a certain
data center should succeed.
LOCAL_QUORUM For multiple data center. See QUORUM.
Ref. https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlConfigConsistency.html
APACHE CASSANDRA
TERMINOLOGY YOU NEED TO UNDERSTAND- CONT.
▸ Compaction: Commit the data. Clean the deleted data and
compress the remaining data in to SSTable.
▸ When performing repair, SSTable rebuild, or clean, you
might see C* is doing compaction in order to make the data
consistent.
▸ Tombstone: Data deletion is not as usual. Delete is done as
insertion (mark a data to be deleted).
▸ gc_grace_period: A certain period of time that C* will ensure
all the nodes received the tombstone info. (Default: 10 days)
CONFIGURATION
APACHE CASSANDRA
APACHE CASSANDRA
CASSANDRA CONFIGURATION
cluster_name <cluster name>
listen_interface <ethernet interface name>
listen_address <the IP address on the main interface>
authenticator PasswordAuthenticator
authorizer CassandraAuthorizer
endpoint_snitch GossipingPropertyFileSnitch
seeds <the seed server address>
broadcast_address <External IP address>
permissions_validity_in_ms 20000
concurrent_reads 16 * num. of disk used by data_file_directories
concurrent_writes 8 * num. of cores
concurrent_counter_writes 16 * num. of disk used by data_file_directories
streaming_keep_alive_period_in_secs 3600 (1hr)
read_request_timeout_in_ms 10000
APACHE CASSANDRA
CASSANDRA CONFIGURATION- CONT.
listen_interface <ethernet interface name>
listen_address <the IP address on the main interface>
broadcast_address <External IP address>
▸ Most of our machine is a VM, which might under a local
DHCP environment. The main interface might listen on a
local IP, say 192.168.xxx.xxx.
▸ In this case, you need to set the broadcast_address to
make the other nodes able to find the node you are going
to add.
APACHE CASSANDRA
CASSANDRA CONFIGURATION- CONT.
▸ authenticator / authorizer: A pair of assignment for Cassandra
account management.
▸ (PasswordAuthenticator/CassandraAuthorizer) is a fixed pair,
don’t change them.
▸ endpoint_snitch: What kind of snitch you would like to use.
▸ GossipingPropertyFileSnitch: You need to modified
cassandra-rackdc.properties to use this snitch.
authenticator PasswordAuthenticator
authorizer CassandraAuthorizer
endpoint_snitch GossipingPropertyFileSnitch
APACHE CASSANDRA
CASSANDRA CONFIGURATION- CONT.
▸ permissions_validity_in_ms: How long to cache the auth.
info?
permissions_validity_in_ms 20000
concurrent_reads 16 * num. of disk used by data_file_directories
concurrent_writes 8 * num. of cores
concurrent_counter_writes 16 * num. of disk used by data_file_directories
▸ concurrent_*: Hardware resource dependent.
APACHE CASSANDRA
CASSANDRA CONFIGURATION- CONT.
▸ Some of the machine has higher network latency, these
settings will try to prevent the Cassandra from time-out.
streaming_keep_alive_period_in_secs 3600 (1hr)
read_request_timeout_in_ms 10000
APACHE CASSANDRA
CASSANDRA CONFIGURATION- CONT.
▸ Still a lot of configure to learn and discover!
▸ Lots of comments available in cassandra.yaml. Check
them out if you have time.
QUERY
APACHE CASSANDRA
APACHE CASSANDRA
HOW IS DATA WRITTEN?
1. Write data to MemTable (memory) & log data in
commit_log (disk)
‣ Durable writes: Failure tolerance!
2. Flush data from MemTable
‣ commitlog_total_space_in_mb: Threshold to flush
3. Storing data on disk in SSTable
APACHE CASSANDRA
HOW IS DATA WRITTEN?- CONT.
‣ Commit_log replay on restart. This is why sometimes the
reboot of Cassandra might be longer and sometimes
shorter. It depends on how many data it should replay.
WRITE
REQUEST
SSTABLE
(COMMITTED DATA)
COMMIT_LOG
FlushMEMTABLE
Memory
Hard Disk
Ref. https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlHowDataWritten.html
APACHE CASSANDRA
HOW IS DATA WRITTEN?- CONT.
‣ Notes that it is recommended to keep the storage of commit_log and
SSTable on different disk.
‣ If possible, attach at least 3 hard disk drive to your machine. (SSD is more
than welcome!)
WRITE
REQUEST
SSTABLE
(COMMITTED DATA)
COMMIT_LOG
FlushMEMTABLE
Memory
Hard Disk
Ref. https://wiki.apache.org/cassandra/PerformanceTuning
APACHE CASSANDRA
HOW IS DATA READ?
▸ Coordinator will find which node(s) to ask for the required
data.
▸ On the responsible node:
▸ Try to find data in MemTable first.
▸ Find the data in compressed SSTable file.
▸ Combine the results (from MemTable & SSTable) and
return to the coordinator.
Ref. https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutReads.html
APACHE CASSANDRA
HOW IS DATA DELETED?
▸ Keep in mind that it is a large-scale distributed system.
Deletion could be dangerous to harm the consistency.
▸ Deletion as Insertion: Tombstone.
▸ gc_grace_seconds: Prevent from party-rock zombie!!
▸ Compaction “clear” the data.
▸ You may assign a TTL to a data row!
Ref. https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutDeletes.html
Everyday I’m shuffling!
HI, STILL ON
BOARD?🤯
Just KEEP MOVING!
— Lara Craft, Tomb Raider 2013
CONCEPTS & MONITORING
APACHE CASSANDRA
APACHE CASSANDRA
REPLICA, REPLICATION FACTOR (RF)
▸ How to determine the placement of replica?
▸ SimpleStrategy & NetworkTopologyStrategy
▸ Simple Strategy: Places the first replica on a node
determined by the partitioner. Additional replicas are
placed on the next nodes clockwise in the ring without
considering topology.
APACHE CASSANDRA
REPLICA, REPLICATION FACTOR (RF)- CONT.
▸ NetworkTopologyStrategy: Required to set the RF for
each data center.
▸ NetworkTopologyStrategy: Places replicas in the same
datacenter by walking the ring clockwise until reaching
the first node in another rack.
ALTER KEYSPACE <keyspace> WITH REPLICATION
= {'class': 'NetworkTopologyStrategy',
'DC1': <num>, 'DC2': <num>} with
durable_write=true;
APACHE CASSANDRA
NETWORKTOPOLOGYSTRATEGY VS. SIMPLESTRATEGY
▸ Multiple DC using SimplyStrategy:
1
4
6 2
5 3
TW
TW
TW
US
JP
TW
APACHE CASSANDRA
NETWORKTOPOLOGYSTRATEGY VS. SIMPLESTRATEGY- CONT.
▸ Why this is horrible?
1
4
6 2
5 3
TW
TW
US
JP
TW
Network bottleneck
A Data Query
TW
APACHE CASSANDRA
NETWORKTOPOLOGYSTRATEGY VS. SIMPLESTRATEGY- CONT.
▸ Multiple DC using NetworkTopologyStrategy:
1
3
241 1TW
USJP
RF = 1
RF = 2
RF = 1
APACHE CASSANDRA
REPLICATION UNDER NETWORKTOPOLOGYSTRATEGY
1
4
6 2
5 3
1
2
3
4
5
6
3
4
5
6
1
2
5
6
1
2
3
4
Dataset
Rack 1
Rack 2
Rack 3
Replica 1 Replica 2
APACHE CASSANDRA
REPLICATION- CONT.
▸ It’s all about fault-tolerance (Availability).
▸ Enable the system to continue working even though there
are some node is not available.
▸ Fault-tolerance in the level of data center, rack.
▸ Do not let RF > {NUM. OF NODES IN A DC}!!!
▸ Always remember to increase the RF of system_auth
keyspace before you add a new node!!!
Ref. https://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archDataDistributeReplication.html
APACHE CASSANDRA
HINTED-HANDOFF
▸ The process that help the dead node to recover the data.
▸ The other nodes will keep the data for a certain period of
time for the dead node. When the node come back online,
they will stream the data to that revived node.
▸ Default: 3 days. Therefore, we should deal with the dead
node and bring it back within this period.
APACHE CASSANDRA
NODETOOL
▸ A monitoring/controlling tool of C*.
▸ To control C*, you should be familiar with this guy.
▸ Refer to: https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsNodetool.html
APACHE CASSANDRA
CQLSH
▸ CQL: Cassandra Query Language, looks like traditional
SQL command.
▸ A commanding shell to interact with C*.
▸ Look like this:
APACHE CASSANDRA
CQLSH- CONT.
▸ You may alter the settings of existing keyspace, table using
CQLSH. For example, change the RF of a keyspace.
▸ Of course, CQLSH could be used to create/delete/modify/
query keyspace and table.
▸ Refer to: https://paper.dropbox.com/doc/Cassandra-Management-Operations--
AIIgTHW33s5ArnWYx18kxfU3Ag-
AvuMYLwTQhgWUKc6h1sUd#:uid=865346154186617362484552&h2=The-cqlsh-Command
APACHE CASSANDRA
THE SYSTEM STATUS CHECK
▸ This command return all the status of existing nodes.
▸ Status interpretation:
▸ UN (Up/Normal): Node is working properly
▸ DN (Down/Normal): The Node is offline
▸ UL (Up/Leaving): The Node is leaving the cluster (node
deletion)
$ nodetool status
APACHE CASSANDRA
THE SYSTEM STATUS CHECK- CONT.
▸ This command also tells you the data portion of each DC,
the disk usage, and the UUID of a node.
$ nodetool status
Disk Usage Data Portion
APACHE CASSANDRA
THE SYSTEM STATUS CHECK- CONT.
▸ This command shows the listening port of the machine.
▸ It’s a quick way to check if C* is still online.
▸ Cassandra port usage:
$ netstat -lnt
7000 Gossiping port (unencrypted)
9042 CQLSH/client API communication port
7199 JMX monitoring port
APACHE CASSANDRA
THE SYSTEM STATUS CHECK- CONT.
▸ This command shows the status of C* process. It should
always in the status “active (running)”.
▸ If you see the status is in “active (exited)”, then C* is
already dead due to some error present. Check the log for
further information.
$ service cassandra status
IT’S REALLY A
LONG STORY…
CASSANDRA OPERATIONS
(HUMAN INVOLVED 😈)
APACHE CASSANDRA
APACHE CASSANDRA
REPAIR
▸ The process to maintain data consistency across the
cluster.
▸ This is the operation that will make you burn the midnight
oil……
$ nodetool repair [option]
Cassandra
Ref. https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsRepair.html
APACHE CASSANDRA
REPAIR- CONT.
▸ Full Repair & Incremental Repair
▸ This should be done periodically!
▸ As recommendation by C* official:
▸ Incremental repair every 1 - 3 days (within GC grace
period)
▸ Full repair every 1 - 3 weeks
Ref. http://cassandra.apache.org/doc/latest/operating/repair.html
APACHE CASSANDRA
REPAIR- CONT.
▸ How to monitor the repair progress?? Good question!
▸ The log files
▸ Useful commands:
$ nodetool netstats
# print the status of streaming
$ nodetool compactionstats
# print th status of compaction
$ nodetool tpstats
# show the thread pool running processes
APACHE CASSANDRA
SSTABLE CORRUPTION
▸ If a repair failed or the data sync is not well performed, this
will happen……
▸ For example, when you see this after repair is done:
▸ Prepare a cup of coffee, you might need it……. 😨
[2017-05-16 00:26:40,555] Repair session dbbf6510-39ef-11e7-8027-d710f406f829 for range
(-4631786651008530880,-4578496872070625882] failed with error [repair #dbbf6510-39ef-11e7-8027-
d710f406f829 on watchtower_keyspace/release_stages,
(-4631786651008530880,-4578496872070625882]] Validation failed in /xxx.xxx.xxx.xxx (progress: 0%)
APACHE CASSANDRA
SSTABLE CORRUPTION- CONT.
▸ All you need to do, is to run the following on the node with IP
xxx.xxx.xxx.xxx:
▸ Same as repair, using the same set of nodetool commands to see if C*
is still working.
▸ If everything goes well, try the repair again and hope nothing bad
happen again.
$ nodetool scrub
[2017-05-16 00:26:40,555] Repair session dbbf6510-39ef-11e7-8027-d710f406f829 for range
(-4631786651008530880,-4578496872070625882] failed with error [repair #dbbf6510-39ef-11e7-8027-
d710f406f829 on watchtower_keyspace/release_stages,
(-4631786651008530880,-4578496872070625882]] Validation failed in /xxx.xxx.xxx.xxx (progress: 0%)
Ref. https://support.datastax.com/hc/en-us/articles/205256895--Validation-failed-when-running-a-nodetool-
repair
APACHE CASSANDRA
RUNNING OUT OF DISK SPACE! DO YOU PERFORM DELETION?
▸ Remember as for now, the Master node has only 100GB of
disk space. Approximately, the data will grow 1.xGB each
month.
▸ Frequently check the following:
$ nodetool status
# check the data portion and disk usage
$ df -h
# check the real hard disk space usage
APACHE CASSANDRA
RUNNING OUT OF DISK SPACE! DO YOU PERFORM DELETION?- CONT.
▸ If the C* eat up too many space, you could perform the
data deletion by issuing a repair:
▸ Or you could try to clear the data snapshot:
$ nodetool repair [option]
# repair the data and this will trigger the data
compaction
$ nodetool clearsnapshot
Ref. https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsAboutSnapshots.html
APACHE CASSANDRA
NEW NODE COMING IN, GREAT!
▸ Make sure RF of system_auth is increased first.
▸ Perform network connectivity, performance check.
▸ Refer to here: https://paper.dropbox.com/doc/Cassandra-Management-Operations--
AIIgTHW33s5ArnWYx18kxfU3Ag-
AvuMYLwTQhgWUKc6h1sUd#:uid=308409713240027648094943&h2=Add-a-New-Node
▸ Bootstrap of new node might failed, check the log files
frequently!
APACHE CASSANDRA
NEW NODE COMING IN, GREAT!- CONT.
▸ How could I know the bootstrap failed:
▸ Log files (of course!)
▸ nodetool status show highly in-balance data portion.
▸ Might be a network throughput issue, try to fix it and
resume the bootstrap:
$ nodetool bootstrap resume
APACHE CASSANDRA
LESS POSSIBLE BUT COULD HAPPEN, NODE DELETION
▸ You might want to delete a node for any issue coming up.
▸ Refer to: https://paper.dropbox.com/doc/Cassandra-Management-Operations--
AIIgTHW33s5ArnWYx18kxfU3Ag-
AvuMYLwTQhgWUKc6h1sUd#:uid=454006913486500030503564&h2=Delete/Remove-a-Node
▸ If everything goes fine, reduce the RF of system_auth to
make the RF of it not larger than the total num. of nodes.
APACHE CASSANDRA
CASSANDRA OPERATIONS
▸ Too many things to discuss, which is hard to cover them all
in this talk.
▸ Please frequently check the doc for further information:
▸ https://paper.dropbox.com/doc/Cassandra-Management-Operations--
AIIgTHW33s5ArnWYx18kxfU3Ag-AvuMYLwTQhgWUKc6h1sUd
APACHE CASSANDRA
SYSTEM_AUTH & CURRENT CASSANDRA USER ACCOUNT
▸ I keep talking about the system_auth keyspace, what is it
anyway?
▸ system_auth: The keyspace that keep the account info. of
Cassandra.
▸ If the data in system_auth is inconsistent, the
authentication might fault on a certain node. You will see
authentication failed for a certain period of time.
▸ Data loss!!!
APACHE CASSANDRA
SYSTEM_AUTH & CURRENT CASSANDRA USER ACCOUNT- CONT.
▸ Increase RF of system_auth first before adding a new node
is just a theoretical approach……
▸ Current user account in Cassandra:
cassandra
Default superuser, now treated as a backup superuser. Has
the same password as iisnrl account.
iisnrl The main superuser.
kairosdb The user for master KairosDB insertion. Non-superuser.
lassgroup
The user for participating parties to archive data. Non-
superuser.
THANK YOU!
ALL YOU NEED TO KNOW ABOUT
CASSANDRA

More Related Content

What's hot

Cassandra by example - the path of read and write requests
Cassandra by example - the path of read and write requestsCassandra by example - the path of read and write requests
Cassandra by example - the path of read and write requests
grro
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
Paris Data Engineers !
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
Christian Johannsen
 
Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...
Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...
Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...
DataStax
 
Building Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleBuilding Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scale
Alex Thompson
 
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
DataStax
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
Robert Stupp
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
DataStax
 
MySQL HA
MySQL HAMySQL HA
MySQL HA
Kris Buytaert
 
Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)
Rick Branson
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... Cassandra
Instaclustr
 
Tech Talk: Best Practices for Data Modeling
Tech Talk: Best Practices for Data ModelingTech Talk: Best Practices for Data Modeling
Tech Talk: Best Practices for Data Modeling
ScyllaDB
 
Cassandra
CassandraCassandra
Cassandra
Carbo Kuo
 
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
DataStax
 
Understanding AntiEntropy in Cassandra
Understanding AntiEntropy in CassandraUnderstanding AntiEntropy in Cassandra
Understanding AntiEntropy in Cassandra
Jason Brown
 
Cassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write pathCassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write path
Joshua McKenzie
 
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
DataStax
 
Cassandra advanced part-ll
Cassandra advanced part-llCassandra advanced part-ll
Cassandra advanced part-ll
achudhivi
 
Oracle: Binding versus caging
Oracle: Binding versus cagingOracle: Binding versus caging
Oracle: Binding versus caging
BertrandDrouvot
 

What's hot (19)

Cassandra by example - the path of read and write requests
Cassandra by example - the path of read and write requestsCassandra by example - the path of read and write requests
Cassandra by example - the path of read and write requests
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
 
Apache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek BerlinApache Cassandra at the Geek2Geek Berlin
Apache Cassandra at the Geek2Geek Berlin
 
Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...
Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...
Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...
 
Building Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleBuilding Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scale
 
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
 
MySQL HA
MySQL HAMySQL HA
MySQL HA
 
Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)Cassandra at Instagram (August 2013)
Cassandra at Instagram (August 2013)
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... Cassandra
 
Tech Talk: Best Practices for Data Modeling
Tech Talk: Best Practices for Data ModelingTech Talk: Best Practices for Data Modeling
Tech Talk: Best Practices for Data Modeling
 
Cassandra
CassandraCassandra
Cassandra
 
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
 
Understanding AntiEntropy in Cassandra
Understanding AntiEntropy in CassandraUnderstanding AntiEntropy in Cassandra
Understanding AntiEntropy in Cassandra
 
Cassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write pathCassandra 2.1 boot camp, Read/Write path
Cassandra 2.1 boot camp, Read/Write path
 
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
 
Cassandra advanced part-ll
Cassandra advanced part-llCassandra advanced part-ll
Cassandra advanced part-ll
 
Oracle: Binding versus caging
Oracle: Binding versus cagingOracle: Binding versus caging
Oracle: Binding versus caging
 

Similar to Basic stuff You Need to Know about Cassandra

Cassandra 3.x et la future 4.0
Cassandra 3.x et la future 4.0Cassandra 3.x et la future 4.0
Cassandra 3.x et la future 4.0
Alexander DEJANOVSKI
 
Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandra
Aaron Ploetz
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
Data Con LA
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
Nathan Milford
 
Cassandra no sql ecosystem
Cassandra no sql ecosystemCassandra no sql ecosystem
cassandra
cassandracassandra
cassandra
Akash R
 
Cassandra admin
Cassandra adminCassandra admin
Scaling Cassandra for Big Data
Scaling Cassandra for Big DataScaling Cassandra for Big Data
Scaling Cassandra for Big Data
DataStax Academy
 
Cassandra
CassandraCassandra
Cassandra
Diego Pacheco
 
Cassandra Redis
Cassandra RedisCassandra Redis
Cassandra Redis
Diego Pacheco
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandra
PL dream
 
Montreal User Group - Cloning Cassandra
Montreal User Group - Cloning CassandraMontreal User Group - Cloning Cassandra
Montreal User Group - Cloning Cassandra
Adam Hutson
 
Apache Cassandra.pptx
Apache Cassandra.pptxApache Cassandra.pptx
Apache Cassandra.pptx
AnyaForger34
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
zznate
 
NoSql Database
NoSql DatabaseNoSql Database
NoSql Database
Suresh Parmar
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
DataStax
 
Apache Cassandra Lunch #70: Basics of Apache Cassandra
Apache Cassandra Lunch #70: Basics of Apache CassandraApache Cassandra Lunch #70: Basics of Apache Cassandra
Apache Cassandra Lunch #70: Basics of Apache Cassandra
Anant Corporation
 
Cassandra - A Distributed Database System
Cassandra - A Distributed Database System Cassandra - A Distributed Database System
Cassandra - A Distributed Database System
Md. Shohel Rana
 
Cassndra (4).pptx
Cassndra (4).pptxCassndra (4).pptx
Cassndra (4).pptx
NikhilAmauriya
 
C* Summit EU 2013: Cassandra Internals
C* Summit EU 2013: Cassandra Internals C* Summit EU 2013: Cassandra Internals
C* Summit EU 2013: Cassandra Internals
DataStax Academy
 

Similar to Basic stuff You Need to Know about Cassandra (20)

Cassandra 3.x et la future 4.0
Cassandra 3.x et la future 4.0Cassandra 3.x et la future 4.0
Cassandra 3.x et la future 4.0
 
Intro to cassandra
Intro to cassandraIntro to cassandra
Intro to cassandra
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
 
Cassandra for Sysadmins
Cassandra for SysadminsCassandra for Sysadmins
Cassandra for Sysadmins
 
Cassandra no sql ecosystem
Cassandra no sql ecosystemCassandra no sql ecosystem
Cassandra no sql ecosystem
 
cassandra
cassandracassandra
cassandra
 
Cassandra admin
Cassandra adminCassandra admin
Cassandra admin
 
Scaling Cassandra for Big Data
Scaling Cassandra for Big DataScaling Cassandra for Big Data
Scaling Cassandra for Big Data
 
Cassandra
CassandraCassandra
Cassandra
 
Cassandra Redis
Cassandra RedisCassandra Redis
Cassandra Redis
 
Storage cassandra
Storage   cassandraStorage   cassandra
Storage cassandra
 
Montreal User Group - Cloning Cassandra
Montreal User Group - Cloning CassandraMontreal User Group - Cloning Cassandra
Montreal User Group - Cloning Cassandra
 
Apache Cassandra.pptx
Apache Cassandra.pptxApache Cassandra.pptx
Apache Cassandra.pptx
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
 
NoSql Database
NoSql DatabaseNoSql Database
NoSql Database
 
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
 
Apache Cassandra Lunch #70: Basics of Apache Cassandra
Apache Cassandra Lunch #70: Basics of Apache CassandraApache Cassandra Lunch #70: Basics of Apache Cassandra
Apache Cassandra Lunch #70: Basics of Apache Cassandra
 
Cassandra - A Distributed Database System
Cassandra - A Distributed Database System Cassandra - A Distributed Database System
Cassandra - A Distributed Database System
 
Cassndra (4).pptx
Cassndra (4).pptxCassndra (4).pptx
Cassndra (4).pptx
 
C* Summit EU 2013: Cassandra Internals
C* Summit EU 2013: Cassandra Internals C* Summit EU 2013: Cassandra Internals
C* Summit EU 2013: Cassandra Internals
 

Recently uploaded

Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 

Recently uploaded (20)

Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 

Basic stuff You Need to Know about Cassandra

  • 1. BASIC STUFF YOU NEED TO KNOW ABOUT CASSANDRA AUG. 2018 YU-CHANG HO (ANDY) FORMER RESEARCH ASSISTANT, ACADEMIA SINICA
  • 2. A GREEK STORY ➡An Ancient Greek Prophet ➡Second-most beautiful woman in the world ➡Gift of Prophecy from Apollo ➡Figure of Tragedy ‣ Ref. https://www.wikiwand.com/en/ Cassandra
  • 3. APACHE CASSANDRA WHAT IS APACHE CASSANDRA (C*)? ▸ Originated at Facebook Inc. ▸ Combines the concept of Google BigTable & Amazon Dynamo. ▸ Data Modeling: BigTable ▸ System Architecture: Dynamo ▸ A distributed database system with high scalability. ▸ Written in Java (The JVM Tuning Hell!!). Ref. https://www.wikiwand.com/en/Apache_Cassandra Ref. https://www.wikiwand.com/en/Dynamo_(storage_system) Ref. https://www.wikiwand.com/en/Bigtable
  • 4. APACHE CASSANDRA WHAT IS APACHE CASSANDRA (C*)?- CONT. ▸ It is a popular database system! (Ranked in 2018) 1 Oracle 2 MySQL 3 Microsoft SQL Server 4 PostgreSQL 5 MongoDB 6 DB2 7 Reids 8 Elasticsearch 9 Microsoft Access 10 Cassandra Ref. https://db-engines.com/en/ranking
  • 5. APACHE CASSANDRA WHAT IS APACHE CASSANDRA (C*)?- CONT. ▸ There is no master/slave relationship among C* nodes! ▸ Every node could be read/written. ▸ In our scenario, we assume the GCP node to be the “Master” to control the data insertion. ▸ Our currently using version: 3.11.2.
  • 6. APACHE CASSANDRA THE CAP THEOREM ▸ Eric Brewer, UC Berkeley ▸ C: Consistency ▸ A: Availability ▸ P: Partition-tolerance ▸ All 3 parts of CAP cannot be satisfied at the same time. Ref. https://www.wikiwand.com/en/CAP_theorem
  • 7. APACHE CASSANDRA THE CAP THEOREM OF CASSANDRA ▸ C: The consistency of data → Eventually consistency ▸ A: The availability of service → Always available ▸ P: Ability to distribute that load effectively → High Scalability ▸ Still we could tried to satisfied all the three parts! (Tuning the consistency level for R/W) ▸ Provided high availability and some level of consistency. Ref. https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed
  • 9. APACHE CASSANDRA TERMINOLOGY YOU NEED TO UNDERSTAND ▸ Data Center: A group of nodes ▸ Rack: Also, a group of nodes Data Center Rack 1 Rack 2 A Node
  • 10. APACHE CASSANDRA TERMINOLOGY YOU NEED TO UNDERSTAND- CONT. ▸ Ring: The logical representation of the cluster of nodes. 1 3 24
  • 11. APACHE CASSANDRA TERMINOLOGY YOU NEED TO UNDERSTAND- CONT. ▸ Keyspace: ColumnFamily in BigTable, Database in MySQL ▸ Table: Just table, don’t be confused! :-) ▸ MemTable: Cassandra will first store in memory. After a certain among of data is reached, flush to disk (SSTable). ▸ Commit_log: Not only store in memory, C* will also first create a log for those new data to prevent from failure and is able to restore those data if bad thing happens. ▸ SSTable: The compressed files of data stored in disk.
  • 12. APACHE CASSANDRA TERMINOLOGY YOU NEED TO UNDERSTAND- CONT. ▸ Replica: A copy/duplication of data. ▸ Replication Factor (RF): The number of replica you wish to maintain in a certain data center. ▸ Partitioner: A partitioner determines how data is distributed across the nodes in the cluster. (Token created by hash.) ▸ Coordinator: It is a role (sub-process) when one of a node receive a query. It try to find the data among nodes. And on each node, MemTable and SSTable are checked.
  • 13. APACHE CASSANDRA TERMINOLOGY YOU NEED TO UNDERSTAND- CONT. ▸ Gossip Protocol: The protocol for a C* node to discover the information of other nodes. ▸ Seed Node: The node that mainly keep the topology information. ▸ Now, we have GCP (TW), UCSD (US), NTU (JP) seed nodes. ▸ Snitch: The protocol for a C* node to map IPs to racks and data centers (the topology). ▸ When perform a read, a snitch would be useful. ▸ Create the topology and help decide which node to be query.
  • 14. APACHE CASSANDRA TERMINOLOGY YOU NEED TO UNDERSTAND- CONT. ▸ Consistency Level (CL): The arbitrary assignment of consistent the query should achieve. ANY Lowest level. Even if all the replica node are down, the withe could still successd. ONE At least one replica node should succeed. QUORUM (RF / 2) + 1 nodes should succeed. ALL Highest level, every node should succeed. LOCAL_ONE For multiple data center. One node in a certain data center should succeed. LOCAL_QUORUM For multiple data center. See QUORUM. Ref. https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlConfigConsistency.html
  • 15. APACHE CASSANDRA TERMINOLOGY YOU NEED TO UNDERSTAND- CONT. ▸ Compaction: Commit the data. Clean the deleted data and compress the remaining data in to SSTable. ▸ When performing repair, SSTable rebuild, or clean, you might see C* is doing compaction in order to make the data consistent. ▸ Tombstone: Data deletion is not as usual. Delete is done as insertion (mark a data to be deleted). ▸ gc_grace_period: A certain period of time that C* will ensure all the nodes received the tombstone info. (Default: 10 days)
  • 17. APACHE CASSANDRA CASSANDRA CONFIGURATION cluster_name <cluster name> listen_interface <ethernet interface name> listen_address <the IP address on the main interface> authenticator PasswordAuthenticator authorizer CassandraAuthorizer endpoint_snitch GossipingPropertyFileSnitch seeds <the seed server address> broadcast_address <External IP address> permissions_validity_in_ms 20000 concurrent_reads 16 * num. of disk used by data_file_directories concurrent_writes 8 * num. of cores concurrent_counter_writes 16 * num. of disk used by data_file_directories streaming_keep_alive_period_in_secs 3600 (1hr) read_request_timeout_in_ms 10000
  • 18. APACHE CASSANDRA CASSANDRA CONFIGURATION- CONT. listen_interface <ethernet interface name> listen_address <the IP address on the main interface> broadcast_address <External IP address> ▸ Most of our machine is a VM, which might under a local DHCP environment. The main interface might listen on a local IP, say 192.168.xxx.xxx. ▸ In this case, you need to set the broadcast_address to make the other nodes able to find the node you are going to add.
  • 19. APACHE CASSANDRA CASSANDRA CONFIGURATION- CONT. ▸ authenticator / authorizer: A pair of assignment for Cassandra account management. ▸ (PasswordAuthenticator/CassandraAuthorizer) is a fixed pair, don’t change them. ▸ endpoint_snitch: What kind of snitch you would like to use. ▸ GossipingPropertyFileSnitch: You need to modified cassandra-rackdc.properties to use this snitch. authenticator PasswordAuthenticator authorizer CassandraAuthorizer endpoint_snitch GossipingPropertyFileSnitch
  • 20. APACHE CASSANDRA CASSANDRA CONFIGURATION- CONT. ▸ permissions_validity_in_ms: How long to cache the auth. info? permissions_validity_in_ms 20000 concurrent_reads 16 * num. of disk used by data_file_directories concurrent_writes 8 * num. of cores concurrent_counter_writes 16 * num. of disk used by data_file_directories ▸ concurrent_*: Hardware resource dependent.
  • 21. APACHE CASSANDRA CASSANDRA CONFIGURATION- CONT. ▸ Some of the machine has higher network latency, these settings will try to prevent the Cassandra from time-out. streaming_keep_alive_period_in_secs 3600 (1hr) read_request_timeout_in_ms 10000
  • 22. APACHE CASSANDRA CASSANDRA CONFIGURATION- CONT. ▸ Still a lot of configure to learn and discover! ▸ Lots of comments available in cassandra.yaml. Check them out if you have time.
  • 24. APACHE CASSANDRA HOW IS DATA WRITTEN? 1. Write data to MemTable (memory) & log data in commit_log (disk) ‣ Durable writes: Failure tolerance! 2. Flush data from MemTable ‣ commitlog_total_space_in_mb: Threshold to flush 3. Storing data on disk in SSTable
  • 25. APACHE CASSANDRA HOW IS DATA WRITTEN?- CONT. ‣ Commit_log replay on restart. This is why sometimes the reboot of Cassandra might be longer and sometimes shorter. It depends on how many data it should replay. WRITE REQUEST SSTABLE (COMMITTED DATA) COMMIT_LOG FlushMEMTABLE Memory Hard Disk Ref. https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlHowDataWritten.html
  • 26. APACHE CASSANDRA HOW IS DATA WRITTEN?- CONT. ‣ Notes that it is recommended to keep the storage of commit_log and SSTable on different disk. ‣ If possible, attach at least 3 hard disk drive to your machine. (SSD is more than welcome!) WRITE REQUEST SSTABLE (COMMITTED DATA) COMMIT_LOG FlushMEMTABLE Memory Hard Disk Ref. https://wiki.apache.org/cassandra/PerformanceTuning
  • 27. APACHE CASSANDRA HOW IS DATA READ? ▸ Coordinator will find which node(s) to ask for the required data. ▸ On the responsible node: ▸ Try to find data in MemTable first. ▸ Find the data in compressed SSTable file. ▸ Combine the results (from MemTable & SSTable) and return to the coordinator. Ref. https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutReads.html
  • 28. APACHE CASSANDRA HOW IS DATA DELETED? ▸ Keep in mind that it is a large-scale distributed system. Deletion could be dangerous to harm the consistency. ▸ Deletion as Insertion: Tombstone. ▸ gc_grace_seconds: Prevent from party-rock zombie!! ▸ Compaction “clear” the data. ▸ You may assign a TTL to a data row! Ref. https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutDeletes.html Everyday I’m shuffling!
  • 29. HI, STILL ON BOARD?🤯 Just KEEP MOVING! — Lara Craft, Tomb Raider 2013
  • 31. APACHE CASSANDRA REPLICA, REPLICATION FACTOR (RF) ▸ How to determine the placement of replica? ▸ SimpleStrategy & NetworkTopologyStrategy ▸ Simple Strategy: Places the first replica on a node determined by the partitioner. Additional replicas are placed on the next nodes clockwise in the ring without considering topology.
  • 32. APACHE CASSANDRA REPLICA, REPLICATION FACTOR (RF)- CONT. ▸ NetworkTopologyStrategy: Required to set the RF for each data center. ▸ NetworkTopologyStrategy: Places replicas in the same datacenter by walking the ring clockwise until reaching the first node in another rack. ALTER KEYSPACE <keyspace> WITH REPLICATION = {'class': 'NetworkTopologyStrategy', 'DC1': <num>, 'DC2': <num>} with durable_write=true;
  • 33. APACHE CASSANDRA NETWORKTOPOLOGYSTRATEGY VS. SIMPLESTRATEGY ▸ Multiple DC using SimplyStrategy: 1 4 6 2 5 3 TW TW TW US JP TW
  • 34. APACHE CASSANDRA NETWORKTOPOLOGYSTRATEGY VS. SIMPLESTRATEGY- CONT. ▸ Why this is horrible? 1 4 6 2 5 3 TW TW US JP TW Network bottleneck A Data Query TW
  • 35. APACHE CASSANDRA NETWORKTOPOLOGYSTRATEGY VS. SIMPLESTRATEGY- CONT. ▸ Multiple DC using NetworkTopologyStrategy: 1 3 241 1TW USJP RF = 1 RF = 2 RF = 1
  • 36. APACHE CASSANDRA REPLICATION UNDER NETWORKTOPOLOGYSTRATEGY 1 4 6 2 5 3 1 2 3 4 5 6 3 4 5 6 1 2 5 6 1 2 3 4 Dataset Rack 1 Rack 2 Rack 3 Replica 1 Replica 2
  • 37. APACHE CASSANDRA REPLICATION- CONT. ▸ It’s all about fault-tolerance (Availability). ▸ Enable the system to continue working even though there are some node is not available. ▸ Fault-tolerance in the level of data center, rack. ▸ Do not let RF > {NUM. OF NODES IN A DC}!!! ▸ Always remember to increase the RF of system_auth keyspace before you add a new node!!! Ref. https://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archDataDistributeReplication.html
  • 38. APACHE CASSANDRA HINTED-HANDOFF ▸ The process that help the dead node to recover the data. ▸ The other nodes will keep the data for a certain period of time for the dead node. When the node come back online, they will stream the data to that revived node. ▸ Default: 3 days. Therefore, we should deal with the dead node and bring it back within this period.
  • 39. APACHE CASSANDRA NODETOOL ▸ A monitoring/controlling tool of C*. ▸ To control C*, you should be familiar with this guy. ▸ Refer to: https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsNodetool.html
  • 40. APACHE CASSANDRA CQLSH ▸ CQL: Cassandra Query Language, looks like traditional SQL command. ▸ A commanding shell to interact with C*. ▸ Look like this:
  • 41. APACHE CASSANDRA CQLSH- CONT. ▸ You may alter the settings of existing keyspace, table using CQLSH. For example, change the RF of a keyspace. ▸ Of course, CQLSH could be used to create/delete/modify/ query keyspace and table. ▸ Refer to: https://paper.dropbox.com/doc/Cassandra-Management-Operations-- AIIgTHW33s5ArnWYx18kxfU3Ag- AvuMYLwTQhgWUKc6h1sUd#:uid=865346154186617362484552&h2=The-cqlsh-Command
  • 42. APACHE CASSANDRA THE SYSTEM STATUS CHECK ▸ This command return all the status of existing nodes. ▸ Status interpretation: ▸ UN (Up/Normal): Node is working properly ▸ DN (Down/Normal): The Node is offline ▸ UL (Up/Leaving): The Node is leaving the cluster (node deletion) $ nodetool status
  • 43. APACHE CASSANDRA THE SYSTEM STATUS CHECK- CONT. ▸ This command also tells you the data portion of each DC, the disk usage, and the UUID of a node. $ nodetool status Disk Usage Data Portion
  • 44. APACHE CASSANDRA THE SYSTEM STATUS CHECK- CONT. ▸ This command shows the listening port of the machine. ▸ It’s a quick way to check if C* is still online. ▸ Cassandra port usage: $ netstat -lnt 7000 Gossiping port (unencrypted) 9042 CQLSH/client API communication port 7199 JMX monitoring port
  • 45. APACHE CASSANDRA THE SYSTEM STATUS CHECK- CONT. ▸ This command shows the status of C* process. It should always in the status “active (running)”. ▸ If you see the status is in “active (exited)”, then C* is already dead due to some error present. Check the log for further information. $ service cassandra status
  • 47. CASSANDRA OPERATIONS (HUMAN INVOLVED 😈) APACHE CASSANDRA
  • 48. APACHE CASSANDRA REPAIR ▸ The process to maintain data consistency across the cluster. ▸ This is the operation that will make you burn the midnight oil…… $ nodetool repair [option] Cassandra Ref. https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsRepair.html
  • 49. APACHE CASSANDRA REPAIR- CONT. ▸ Full Repair & Incremental Repair ▸ This should be done periodically! ▸ As recommendation by C* official: ▸ Incremental repair every 1 - 3 days (within GC grace period) ▸ Full repair every 1 - 3 weeks Ref. http://cassandra.apache.org/doc/latest/operating/repair.html
  • 50. APACHE CASSANDRA REPAIR- CONT. ▸ How to monitor the repair progress?? Good question! ▸ The log files ▸ Useful commands: $ nodetool netstats # print the status of streaming $ nodetool compactionstats # print th status of compaction $ nodetool tpstats # show the thread pool running processes
  • 51. APACHE CASSANDRA SSTABLE CORRUPTION ▸ If a repair failed or the data sync is not well performed, this will happen…… ▸ For example, when you see this after repair is done: ▸ Prepare a cup of coffee, you might need it……. 😨 [2017-05-16 00:26:40,555] Repair session dbbf6510-39ef-11e7-8027-d710f406f829 for range (-4631786651008530880,-4578496872070625882] failed with error [repair #dbbf6510-39ef-11e7-8027- d710f406f829 on watchtower_keyspace/release_stages, (-4631786651008530880,-4578496872070625882]] Validation failed in /xxx.xxx.xxx.xxx (progress: 0%)
  • 52. APACHE CASSANDRA SSTABLE CORRUPTION- CONT. ▸ All you need to do, is to run the following on the node with IP xxx.xxx.xxx.xxx: ▸ Same as repair, using the same set of nodetool commands to see if C* is still working. ▸ If everything goes well, try the repair again and hope nothing bad happen again. $ nodetool scrub [2017-05-16 00:26:40,555] Repair session dbbf6510-39ef-11e7-8027-d710f406f829 for range (-4631786651008530880,-4578496872070625882] failed with error [repair #dbbf6510-39ef-11e7-8027- d710f406f829 on watchtower_keyspace/release_stages, (-4631786651008530880,-4578496872070625882]] Validation failed in /xxx.xxx.xxx.xxx (progress: 0%) Ref. https://support.datastax.com/hc/en-us/articles/205256895--Validation-failed-when-running-a-nodetool- repair
  • 53. APACHE CASSANDRA RUNNING OUT OF DISK SPACE! DO YOU PERFORM DELETION? ▸ Remember as for now, the Master node has only 100GB of disk space. Approximately, the data will grow 1.xGB each month. ▸ Frequently check the following: $ nodetool status # check the data portion and disk usage $ df -h # check the real hard disk space usage
  • 54. APACHE CASSANDRA RUNNING OUT OF DISK SPACE! DO YOU PERFORM DELETION?- CONT. ▸ If the C* eat up too many space, you could perform the data deletion by issuing a repair: ▸ Or you could try to clear the data snapshot: $ nodetool repair [option] # repair the data and this will trigger the data compaction $ nodetool clearsnapshot Ref. https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsAboutSnapshots.html
  • 55. APACHE CASSANDRA NEW NODE COMING IN, GREAT! ▸ Make sure RF of system_auth is increased first. ▸ Perform network connectivity, performance check. ▸ Refer to here: https://paper.dropbox.com/doc/Cassandra-Management-Operations-- AIIgTHW33s5ArnWYx18kxfU3Ag- AvuMYLwTQhgWUKc6h1sUd#:uid=308409713240027648094943&h2=Add-a-New-Node ▸ Bootstrap of new node might failed, check the log files frequently!
  • 56. APACHE CASSANDRA NEW NODE COMING IN, GREAT!- CONT. ▸ How could I know the bootstrap failed: ▸ Log files (of course!) ▸ nodetool status show highly in-balance data portion. ▸ Might be a network throughput issue, try to fix it and resume the bootstrap: $ nodetool bootstrap resume
  • 57. APACHE CASSANDRA LESS POSSIBLE BUT COULD HAPPEN, NODE DELETION ▸ You might want to delete a node for any issue coming up. ▸ Refer to: https://paper.dropbox.com/doc/Cassandra-Management-Operations-- AIIgTHW33s5ArnWYx18kxfU3Ag- AvuMYLwTQhgWUKc6h1sUd#:uid=454006913486500030503564&h2=Delete/Remove-a-Node ▸ If everything goes fine, reduce the RF of system_auth to make the RF of it not larger than the total num. of nodes.
  • 58. APACHE CASSANDRA CASSANDRA OPERATIONS ▸ Too many things to discuss, which is hard to cover them all in this talk. ▸ Please frequently check the doc for further information: ▸ https://paper.dropbox.com/doc/Cassandra-Management-Operations-- AIIgTHW33s5ArnWYx18kxfU3Ag-AvuMYLwTQhgWUKc6h1sUd
  • 59. APACHE CASSANDRA SYSTEM_AUTH & CURRENT CASSANDRA USER ACCOUNT ▸ I keep talking about the system_auth keyspace, what is it anyway? ▸ system_auth: The keyspace that keep the account info. of Cassandra. ▸ If the data in system_auth is inconsistent, the authentication might fault on a certain node. You will see authentication failed for a certain period of time. ▸ Data loss!!!
  • 60. APACHE CASSANDRA SYSTEM_AUTH & CURRENT CASSANDRA USER ACCOUNT- CONT. ▸ Increase RF of system_auth first before adding a new node is just a theoretical approach…… ▸ Current user account in Cassandra: cassandra Default superuser, now treated as a backup superuser. Has the same password as iisnrl account. iisnrl The main superuser. kairosdb The user for master KairosDB insertion. Non-superuser. lassgroup The user for participating parties to archive data. Non- superuser.
  • 61. THANK YOU! ALL YOU NEED TO KNOW ABOUT CASSANDRA