by developer for developers
Lucian Neghina
Big Data & Cloud Computing
Skills
Elastic
Scalability
High
Availability
Tuneable
Consistency
High
Performance
Cassandra Concepts
Distributed Architecture
Peer to PeerNode & Ring
Client
coordinator
Data Distribution & Replication
Client
coordinator
Partitioners: partitions data across
the cluster
Replication Factor: how many nodes will
the data be replicated to
- RF <= number of nodes
Replication Strategy: determines the
replication for each node
- SimpleStrategy
- NetworkTopologyStrategy
Replication
Factor: 3
Turnable Consistency
ALL: highest consistency and the lowest
availability
EACH_QUORUM: maintain consistency at the
same level in each data center
QUORUM: strong consistency if you can
tolerate some level of failure
LOCAL_QUORUM: used in multiple data center
clusters and maintain consistency locally
ONE, TWO, THREE: checks closest nodes to
the coordinator
ANY: provides low latency and a guarantee
that a write never fails
Consistency
Strong Consistency
CL_W + CL_R > Replication Factor
reads always reflect the most recent write
Eventual Consistency
CL_W + CL_R <= Replication Factor
If fast write operations are required, but
strong consistency is still desired, the
write consistency level is lowered to 1,
but now read operations have to verify a
matched value on all 3 replicas. Writes
will be fast, but reads will be slower.
Data Modeling
Schema
Data Type
Data structures:
- Keyspace
Create / Alter / Drop / Use / Describe
- Table / Column Families
Create / Alter / Drop / Truncate / Describe
- Columns
Name, Value, Timestamp, TTL (optional)
Operations: insert, update, delete
Keyspace
Table
Partition
Row
Data Structures
Partition Key
24
58
83
Partition Key = State
State UserId Name
TX 1 John
TX 4 Maria
TX 5 Daniel
State UserId Name
NY 3 David
NY 6 Roxana
NY 7 Robert
24
58
Clustering Key
Data ordering inside a partition
State UserId Name
TX 1 John
TX 2 Igor
TX 3 Maria
TX 4 Daniel
TX 5 Jonathan
TX 6 Ruby
State UserId Name City
TX 1 John Houston
TX 2 Igor Dallas
TX 3 Maria Austin
TX 4 Daniel Austin
TX 5 Jonathan Dallas
TX 6 Ruby Austin
State City Name UserId
TX Austin Daniel 4
TX Austin Maria 3
TX Austin Ruby 6
TX Dallas Igor 2
TX Dallas Jonathan 5
TX Houston John 1
Primary key((State),User_id) Primary key((State),User_id) Primary key((State),City,Name,User_id)
Primary key
Primary key((State),City,Name,User_id)
Partition key Clustering key
Simple Primary key
Primary key(Column) - partition key
Compound Primary key
Primary key(Column1, Column2) - (partition key, clustering key)
Compound Partitioning key
Primary key((Column1, Column2), Column3) - ((partition key), clustering key)
Data is split in partitions, identified by PartitionKey.
Clustering key = orders data (on disk) inside a partition. Data will be sorted on disk.
Combination of Partition Key(s) and Clustering key(s) = Primary Key
We can perform either equality “=” or range queries (>,<) on clustering columns
Clustering key is optional, you can have Partition with only Partition Key = Single Row Partitions
Static Tables
Dynamic Tables
Indexes
Partition key
Other columns
Queries by partition key are local, all good
Scanning all the data
Indexes the data held by a given node.
Secondary Indexes
Server side data denormalization.
Materialized Views
Client site data denormalize
Design tables per query
Allow Filtering
Local Indexing
Distributed Indexing
Queries
Secondary Indexes
Materialized Views (3.0 and later)
Conceptual Data Model
Fact-Based Model
● Alice is a user
● Alice is 28 y.o.
● Alice wears a wristband
● A wristband is a sensor
● A wristband records a heart rate
● A heart rate is a measurement
Entity-Relationship Model
Workflow & Queries
Q1: Find a user with a
known username
Q2: Find followers of user
Q3: Find sensors owned by
a user
Q4: Find measurements for
a sensor in a data range
Q5: Find daily summary of
hourly aggregates
Logical Data Model
CQL
Cassandra Query Language
Limitations
No arbitrary WHERE clauses
in CQL you predicate can only contain
columns specified in your primary key
No JOIN construct
there is no way to join data across column
No native GROUP BY
you cannot group identical data
No arbitrary ORDER BY clauses
order by can be applied to a cluster column
No column filtering
table columns cannot be filtering without
creating the index
Data definition:
● Keyspace
Create, Use, Alter, Drop
● Table
Create, Alter, Drop, Truncate
● Index
Create, Drop
● Type
Create, Alter, Drop
● Trigger, Function, Aggregate
Create, Drop
Operations
Data manipulation:
● Insert - can update existing row
● Update - can create new row
● Delete
● Batch
● Select
Cassandra Cluster
https://github.com/eSolutionsGrup/cassandra-spark-cluster
$ cd cassandra-spark-cluster
$ docker-compose up -d

Cassandra

  • 1.
    by developer fordevelopers Lucian Neghina Big Data & Cloud Computing
  • 2.
  • 3.
  • 4.
    Distributed Architecture Peer toPeerNode & Ring Client coordinator
  • 5.
    Data Distribution &Replication Client coordinator Partitioners: partitions data across the cluster Replication Factor: how many nodes will the data be replicated to - RF <= number of nodes Replication Strategy: determines the replication for each node - SimpleStrategy - NetworkTopologyStrategy Replication Factor: 3
  • 6.
    Turnable Consistency ALL: highestconsistency and the lowest availability EACH_QUORUM: maintain consistency at the same level in each data center QUORUM: strong consistency if you can tolerate some level of failure LOCAL_QUORUM: used in multiple data center clusters and maintain consistency locally ONE, TWO, THREE: checks closest nodes to the coordinator ANY: provides low latency and a guarantee that a write never fails
  • 7.
    Consistency Strong Consistency CL_W +CL_R > Replication Factor reads always reflect the most recent write Eventual Consistency CL_W + CL_R <= Replication Factor If fast write operations are required, but strong consistency is still desired, the write consistency level is lowered to 1, but now read operations have to verify a matched value on all 3 replicas. Writes will be fast, but reads will be slower.
  • 8.
  • 9.
    Schema Data Type Data structures: -Keyspace Create / Alter / Drop / Use / Describe - Table / Column Families Create / Alter / Drop / Truncate / Describe - Columns Name, Value, Timestamp, TTL (optional) Operations: insert, update, delete Keyspace Table Partition Row
  • 10.
  • 11.
    Partition Key 24 58 83 Partition Key= State State UserId Name TX 1 John TX 4 Maria TX 5 Daniel State UserId Name NY 3 David NY 6 Roxana NY 7 Robert 24 58
  • 12.
    Clustering Key Data orderinginside a partition State UserId Name TX 1 John TX 2 Igor TX 3 Maria TX 4 Daniel TX 5 Jonathan TX 6 Ruby State UserId Name City TX 1 John Houston TX 2 Igor Dallas TX 3 Maria Austin TX 4 Daniel Austin TX 5 Jonathan Dallas TX 6 Ruby Austin State City Name UserId TX Austin Daniel 4 TX Austin Maria 3 TX Austin Ruby 6 TX Dallas Igor 2 TX Dallas Jonathan 5 TX Houston John 1 Primary key((State),User_id) Primary key((State),User_id) Primary key((State),City,Name,User_id)
  • 13.
    Primary key Primary key((State),City,Name,User_id) Partitionkey Clustering key Simple Primary key Primary key(Column) - partition key Compound Primary key Primary key(Column1, Column2) - (partition key, clustering key) Compound Partitioning key Primary key((Column1, Column2), Column3) - ((partition key), clustering key) Data is split in partitions, identified by PartitionKey. Clustering key = orders data (on disk) inside a partition. Data will be sorted on disk. Combination of Partition Key(s) and Clustering key(s) = Primary Key We can perform either equality “=” or range queries (>,<) on clustering columns Clustering key is optional, you can have Partition with only Partition Key = Single Row Partitions
  • 14.
  • 15.
  • 16.
    Indexes Partition key Other columns Queriesby partition key are local, all good Scanning all the data Indexes the data held by a given node. Secondary Indexes Server side data denormalization. Materialized Views Client site data denormalize Design tables per query Allow Filtering Local Indexing Distributed Indexing Queries
  • 17.
  • 18.
  • 19.
  • 20.
    Fact-Based Model ● Aliceis a user ● Alice is 28 y.o. ● Alice wears a wristband ● A wristband is a sensor ● A wristband records a heart rate ● A heart rate is a measurement
  • 21.
  • 22.
    Workflow & Queries Q1:Find a user with a known username Q2: Find followers of user Q3: Find sensors owned by a user Q4: Find measurements for a sensor in a data range Q5: Find daily summary of hourly aggregates
  • 23.
  • 24.
  • 25.
    Limitations No arbitrary WHEREclauses in CQL you predicate can only contain columns specified in your primary key No JOIN construct there is no way to join data across column No native GROUP BY you cannot group identical data No arbitrary ORDER BY clauses order by can be applied to a cluster column No column filtering table columns cannot be filtering without creating the index
  • 26.
    Data definition: ● Keyspace Create,Use, Alter, Drop ● Table Create, Alter, Drop, Truncate ● Index Create, Drop ● Type Create, Alter, Drop ● Trigger, Function, Aggregate Create, Drop Operations Data manipulation: ● Insert - can update existing row ● Update - can create new row ● Delete ● Batch ● Select
  • 27.