7. Masterless Architecture
Peer-to-peer
Database
Automatic Data
Distribution
Built-in Replication
partition id name
usa_1994 ae12f… The Shawshank Redemption
usa_1994 bb2ac… Forrest Gump
skorea_2002 31dcc… The Way Home
vn_2015 890af… Jackpot
2
usa_1994 ae… The Sha..
usa_1994 bb… Forrest..
1 skorea_2002 31… The Way …
3 vn_2015 89… Jackpot
Cassandra cluster
CREATE TABLE movie (
partition text ,
id uuid,
name text
)
8. Masterless Architecture
Peer-to-peer
Database
Automatic Data
Distribution
Built-in Replication
partition id name
(A)
usa_1994 ae12f… The Shawshank Redemption
usa_1994 bb2ac… Forrest Gump
(B) skorea_2002 31dcc… The Way Home
(C) vn_2015 890af… Jackpot
2
1
3
Cassandra cluster
CREATE TABLE movie (
partition text ,
id uuid,
name text
)
(B) (A’) (C’)
(A) (B’) (C’)
(C) (A’) (B’)
14. Cautions
Cassandra is from NoSQL family
So
Different data modeling (try to google “nosql
data modeling”)
No Atomicity for multiple tables (or
“documents”) no transaction
Not good at complex query (group by, having,
like, >=, etc.)
15. Cautions
CREATE TABLE movie (
partition text, id uuid,
name text, released_year int, country text,
PRIMARY KEY (partition, id)
)
Limited query capability in Cassandra
SELECT * FROM movie WHERE
release_year < 2000
name LIKE “Star Trek”
name IN ( “Tangled”, “Frozen” )
name = “Inside Out” and year = 2015
length = null
length = 120 OR release_year = 2015
UPDATE movie SET length=95 WHERE partition = “vn” AND id = 10
INSERT INTO movie(partition, id, length) VALUES(“usa”, 15, 120)
(1)
(2)
(3)
(4)
(5)
(6)
16. Cautions
CREATE TABLE movie (
partition text, id uuid,
name text, released_year int, country text,
PRIMARY KEY (partition, id)
)
Limited query capability in Cassandra
SELECT * FROM movie WHERE
release_year < 2000
name LIKE “Star Trek”
name IN ( “Tangled”, “Frozen” )
name = “Inside Out” and year = 2015
length = null
length = 120 OR release_year = 2015
UPDATE movie SET length=95 WHERE partition = “vn” AND id = 10
INSERT INTO movie(partition, id, length) VALUES(“usa”, 15, 120)
(1)
(2)
(3)
(4)
(5)
(6)
18. Take Away
Masterless Architecture
– Peer-to-peer Database
– Automatic Data Distribution
– Built-in Replication
Cautions
– Cassandra is from NoSQL family
– Limited query capability
Open source, backed by big sponsor, commercial use
Hosted by any platform
A company provides many things around Cassandra (documentation, tools, etc.) both free and commercial
Note: products using Cassandra listed in slide are more about user-generated content
03 characteristic make Cassandra becoming preferable choice for Big Data
Eliminate SPOF and gain (single point of failure)
App can connect to another node if the current node is shutdown
Processing capability is spread out
Partitioner: a function for deriving a token representing a row from its partition key (typically by hashing). A partitioner determines how data is distributed across the nodes in the cluster (including replicas)
Cassandra offers the following partitioners:
(from Cassandra v1.2+) Murmur3Partitioner (default): uniformly distributes data across the cluster based on MurmurHash (better than MD5 in term of distribution purpose) hash values.
RandomPartitioner: uniformly distributes data across the cluster based on MD5 hash values.
ByteOrderedPartitioner: keeps an ordered distribution of data lexically by key bytes
Replicate factor: the total number of replicas across the cluster. A replication factor of 1 means that there is only one copy of each row on one node
Replication strategy: a replication strategy determines which nodes to place replicas on
Summarize things
Peer-to-peer High Availability
Data Distribution Scalability
Replication Fault Tolerance
Prepare
Start guest OS: vagrant up
Launch 5 terminal and remote to guest OS in all of them. From now on, below commands are executed in remote SSH terminals
Launch Cassandra seed node. When it’s fully starting, launch remaining Cassandra nodes
Explain setup.cql file
Setup schema and seed data: $ ~/apps/cassandra/bin/cqlsh -f ~/provision/setup.cql
High Availability
Insert new row in node2 (127.0.0.2): $ ~/apps/cassandra/bin/cqlsh -e "USE techcon2015; INSERT INTO participant(department, employee_id, name) VALUES (‘Delivery', 100, 'Nguyen Van An');" 127.0.0.2
See new row in seed node (127.0.0.4): $ ~/apps/cassandra/bin/cqlsh -e "USE techcon2015; select * from participant;" 127.0.0.4
Data is available on any node
Fault Tolerance
Show the node status: $ ~/apps/cassandra/bin/nodetool status (explain the status and state)
Shutdown node2 (127.0.0.2)
Show the node status again, indicate the Down status
Show that data is still available: $ ~/apps/cassandra/bin/cqlsh -e "USE techcon2015; select * from participant;" 127.0.0.x (x is the other alive IP)
Insert new row in seed node (127.0.0.4): $ ~/apps/cassandra/bin/cqlsh -e "USE techcon2015; INSERT INTO participant(department, employee_id, name) VALUES (‘Delivery', 200, 'Nguyen Van Chau');" 127.0.0.4
See new row in seed node (127.0.0.4): $ ~/apps/cassandra/bin/cqlsh -e "USE techcon2015; select * from participant;" 127.0.0.x (x is the other alive IP)
No downtime for disconnected node
Scalability
Show the node status: $ ~/apps/cassandra/bin/nodetool status
Remove node 127.0.0.2: $ ~/apps/cassandra/bin/nodetool -h 127.0.0.1 -p 7182 decommission (1782 is node2 jmx port)
Show the node status again (indicate node removed): $ ~/apps/cassandra/bin/nodetool status
Shutdown node 127.0.0.2
Restart node 127.0.0.2 to join again
Cluster can be effortlessly reduced (or increased)
google term “nosql data modeling” to see more explanation
No transaction (commit, fallback)
(from experience in Cassandra and mongo)