Cassandra

by developer for developers
Lucian Neghina
Big Data & Cloud Computing

Skills
Elastic
Scalability
High
Availability
Tuneable
Consistency
High
Performance

Distributed Architecture
Peer to PeerNode & Ring
Client
coordinator

Data Distribution & Replication
Client
coordinator
Partitioners: partitions data across
the cluster
Replication Factor: how many nodes will
the data be replicated to
- RF <= number of nodes
Replication Strategy: determines the
replication for each node
- SimpleStrategy
- NetworkTopologyStrategy
Replication
Factor: 3

Turnable Consistency
ALL: highest consistency and the lowest
availability
EACH_QUORUM: maintain consistency at the
same level in each data center
QUORUM: strong consistency if you can
tolerate some level of failure
LOCAL_QUORUM: used in multiple data center
clusters and maintain consistency locally
ONE, TWO, THREE: checks closest nodes to
the coordinator
ANY: provides low latency and a guarantee
that a write never fails

Consistency
Strong Consistency
CL_W + CL_R > Replication Factor
reads always reflect the most recent write
Eventual Consistency
CL_W + CL_R <= Replication Factor
If fast write operations are required, but
strong consistency is still desired, the
write consistency level is lowered to 1,
but now read operations have to verify a
matched value on all 3 replicas. Writes
will be fast, but reads will be slower.

Schema
Data Type
Data structures:
- Keyspace
Create / Alter / Drop / Use / Describe
- Table / Column Families
Create / Alter / Drop / Truncate / Describe
- Columns
Name, Value, Timestamp, TTL (optional)
Operations: insert, update, delete
Keyspace
Table
Partition
Row

Partition Key
24
58
83
Partition Key = State
State UserId Name
TX 1 John
TX 4 Maria
TX 5 Daniel
State UserId Name
NY 3 David
NY 6 Roxana
NY 7 Robert
24
58

Clustering Key
Data ordering inside a partition
State UserId Name
TX 1 John
TX 2 Igor
TX 3 Maria
TX 4 Daniel
TX 5 Jonathan
TX 6 Ruby
State UserId Name City
TX 1 John Houston
TX 2 Igor Dallas
TX 3 Maria Austin
TX 4 Daniel Austin
TX 5 Jonathan Dallas
TX 6 Ruby Austin
State City Name UserId
TX Austin Daniel 4
TX Austin Maria 3
TX Austin Ruby 6
TX Dallas Igor 2
TX Dallas Jonathan 5
TX Houston John 1
Primary key((State),User_id) Primary key((State),User_id) Primary key((State),City,Name,User_id)

Primary key
Primary key((State),City,Name,User_id)
Partition key Clustering key
Simple Primary key
Primary key(Column) - partition key
Compound Primary key
Primary key(Column1, Column2) - (partition key, clustering key)
Compound Partitioning key
Primary key((Column1, Column2), Column3) - ((partition key), clustering key)
Data is split in partitions, identified by PartitionKey.
Clustering key = orders data (on disk) inside a partition. Data will be sorted on disk.
Combination of Partition Key(s) and Clustering key(s) = Primary Key
We can perform either equality “=” or range queries (>,<) on clustering columns
Clustering key is optional, you can have Partition with only Partition Key = Single Row Partitions

Indexes
Partition key
Other columns
Queries by partition key are local, all good
Scanning all the data
Indexes the data held by a given node.
Secondary Indexes
Server side data denormalization.
Materialized Views
Client site data denormalize
Design tables per query
Allow Filtering
Local Indexing
Distributed Indexing
Queries

Materialized Views (3.0 and later)

Fact-Based Model
● Alice is a user
● Alice is 28 y.o.
● Alice wears a wristband
● A wristband is a sensor
● A wristband records a heart rate
● A heart rate is a measurement

Workflow & Queries
Q1: Find a user with a
known username
Q2: Find followers of user
Q3: Find sensors owned by
a user
Q4: Find measurements for
a sensor in a data range
Q5: Find daily summary of
hourly aggregates

Limitations
No arbitrary WHERE clauses
in CQL you predicate can only contain
columns specified in your primary key
No JOIN construct
there is no way to join data across column
No native GROUP BY
you cannot group identical data
No arbitrary ORDER BY clauses
order by can be applied to a cluster column
No column filtering
table columns cannot be filtering without
creating the index

Data definition:
● Keyspace
Create, Use, Alter, Drop
● Table
Create, Alter, Drop, Truncate
● Index
Create, Drop
● Type
Create, Alter, Drop
● Trigger, Function, Aggregate
Create, Drop
Operations
Data manipulation:
● Insert - can update existing row
● Update - can create new row
● Delete
● Batch
● Select

Cassandra Cluster
https://github.com/eSolutionsGrup/cassandra-spark-cluster
$ cd cassandra-spark-cluster
$ docker-compose up -d

Cassandra

More Related Content

What's hot

Similar to Cassandra

Recently uploaded

Cassandra