OVERVIEW AND REAL WORLD 
APPLICATIONS 
Cassandra 
Jersey Shore Tech Meetup 
Nov 13, 2014
You Are Not Here… 
*** http://njhalloffame.org/ 
2
Agenda 
3 
 Some Basic Concepts/Overview 
 New Developments In Cassandra 
 Basic Data Modeling Concepts 
 Materialized Views 
 Secondary Indexes 
 Counters 
 Time Series Data 
 Expiring Data
Cassandra High Level 
4 
Cassandra's architecture is based on the combination 
of two technologies: 
 Google BigTable – Data Model 
 Amazon Dynamo – Distributed Architecture 
BTW – these mean the same thing -> 
Cassandra = C*
Architecture Basics & Terminology 
5 
 Nodes are single instances of C* 
 Cluster is a group of nodes 
 Data is organized by keys (tokens) which are 
distributed across the cluster 
 Replication Factor (rf) determines how many copies 
are key 
 Data Center Aware – works well in multi-DC/EC2 
etc. 
 Consistency Level – powerful feature to tune 
consistency vs. speed vs. availability.’
C* Ring 
6
More Architecture 
7 
 Information on who has what data and who is 
available is transferred using gossip. 
 No single point of failure (SPF), every node can 
service requests. 
 Handles Replication and Downed Nodes (within 
reason)
CAP Theorem 
8 
 Distributed Systems Law: 
 Consistency 
 Availability 
 Partition Tolerance 
(you can only really have two in a distributed system) 
 Cassandra is AP with Eventual Consistency
Consistency 
9 
 Cassandra Uses the concept of Tunable Consistency, 
which make it very powerful and flexible for system 
needs.
C* Persistence Model 
10
Read Path 
11
Write Path 
12
Data Model Architecture 
13 
 Keyspace – container of column families (tables). 
Defines RF among others. 
 Table – column family. Contains definition of 
schema. 
 Row – a “record” identified by a key 
 Column - a key and a value
14
Deletions 
15 
 Distributed systems present unique problem for 
deletes. If it actually deleted data and a node was 
down and didn’t receive the delete notice it would try 
and create record when came back online. So… 
 Tombstone - The data is replaced with a special 
value called a Tombstone, works within distributed 
architecture
Keys 
16 
 Primary Key 
 Partition Key – identifies a row 
 Cluster Key – sorting within a row 
 Using CQL these are defined together as a compound 
(composite) key 
 Compound keys are how you implement “wide 
rows”, the COOL FEATURE!
Single Primary Key 
17 
create table users ( 
user_id UUID PRIMARY KEY, 
firstname text, 
lastname text, 
emailaddres text 
); 
** Cassandra Data Types 
http://www.datastax.com/documentation/cql/3.0/cql/cql_ref 
erence/cql_data_types_c.html
Compound Key 
18 
create table users ( 
emailaddress text, 
department text, 
firstname text, 
lastname text, 
PRIMARY KEY (emailaddress, department) 
); 
 Partition Key plus Cluster Key 
 emailaddress is partition key 
 department is cluster key
Compound Key 
19 
create table users ( 
emailaddress text, 
department text, 
country text, 
firstname text, 
lastname text, 
PRIMARY KEY ((emailaddress, department), country) 
); 
 Partition Key plus Cluster Key 
 Emailaddress & department is partition key 
 country is cluster key
New Rules 
20 
 Writes Are Cheap 
 Denormalize All You Need 
 Model Your Queries, Not Data (understand access 
patterns) 
 Application Worries About Joins
What’s New In 2.0 
21 
Conditional DDL 
IF Exists or If Not Exists 
Drop Column Support 
ALTER TABLE users DROP lastname;
More New Stuff 
22 
 Triggers 
CREATE TRIGGER myTrigger 
ON myTable 
USING 'com.thejavaexperts.cassandra.updateevt' 
 Lightweight Transactions (CAS) 
UPDATE users 
SET firstname = 'tim' 
WHERE emailaddress = 'tpeters@example.com' 
IF firstname = 'tom'; 
** Not like an ACID Transaction!!
CAS & Transactions 
23 
 CAS - compare-and-set operations. In a single, 
atomic operation compares a value of a column in 
the database and applying a modification depending 
on the result of the comparison. 
 Consider performance hit. CAS is (was) considered 
an anti-pattern.
Data Modeling… The Basics 
24 
 Cassandra now is very familiar to RDBMS/SQL 
users. 
 Very nicely hides the underlying data storage model. 
 Still have all the power of Cassandra, it is all in the 
key definition. 
RDBMS = model data 
Cassandra = model access (queries)
Side-Note On Querying 
25 
 Create table with compound key 
 Select using ALLOW FILTERING 
 Counts 
 Select using IN or =
Batch Operations 
26 
 Saves Network Roundtrips 
 Can contain INSERT, UPDATE, DELETE 
 Atomic by default (all or nothing) 
 Can use timestamp for specific ordering
Batch Operation Example 
27 
BEGIN BATCH 
INSERT INTO users (emailaddress, firstname, lastname, country) values 
('brian.enochson@gmail.com', 'brian', 'enochson', 'USA'); 
INSERT INTO users (emailaddress, firstname, lastname, country) values 
('tpeters@example.com', 'tom', 'peters', 'DE'); 
INSERT INTO users (emailaddress, firstname, lastname, country) values 
('jsmith@example.com', 'jim', 'smith', 'USA'); 
INSERT INTO users (emailaddress, firstname, lastname, country) values 
('arogers@example.com', 'alan', 'rogers', 'USA'); 
DELETE FROM users WHERE emailaddress = 'jsmith@example.com'; 
APPLY BATCH; 
 select in cqlsh 
 List in cassandra-cli with timestamp
More Data Modeling… 
28 
 No Joins 
 No Foreign Keys 
 No Third (or any other) Normal Form Concerns 
 Redundant Data Encouraged. Apps maintain 
consistency.
Secondary Indexes 
29 
 Allow defining indexes to allow other access than 
partition key. 
 Each node has a local index for its data. 
 They have uses, but shouldn’t be used all the time 
without consideration. 
 We will look at alternatives.
Secondary Index Example 
30 
 Create a table 
 Try to select with column not in PK 
 Add Secondary Index 
 Try select again. (maybe need to reinsert)
When to use? 
31 
 Low Cardinality – small number of unique values 
 High Cardinality – high number of distinct values 
 Secondary Indexes are good for Low Cardinality. So 
country codes, department codes etc. Not email 
addresses.
Materialized View 
32 
 Want full distribution can use what is called a 
Materialized View pattern. 
 Remember redundant data is fine. 
 Model the queries
Materialized View Example 
33 
 Show normal able with compound key and querying 
limitations 
 Create Materialized View Table With Different 
Compound Key, support alternate access. 
 Selects use partition key. 
 Secondary indexes local, not distributed 
 Allow filtering. Can cause performance issues
Counters 
34 
 Updated in 2.1 and now work in a more distributed 
and accurate manner. 
 Table organization, example 
 How to update, view etc.
Time Series Example…. 
35 
 Time series table model. 
 Need to consider interval for event frequency and 
wide row size. 
 Make what is tracked by time and unit of interval 
partition key.
Time Series Data 
36 
 Due to its quick writing model Cassandra is suited 
for storing time series data. 
 The Cassandra wide row is a perfect fit for modeling 
time series / time based events. 
 Let’s look at an example….
Event Data 
37 
 Notice primary key and cluster key. 
 Insert some data 
 View in CQL, then in CLI as wide row
TTL – Self Expiring Data 
38 
 Another technique is data that has a defined lifespan. 
 For instance session identifiers, temporary 
passwords etc. 
 For this Cassandra provides a Time To Live (TTL) 
mechanism.
TTL Example… 
39 
 Create table 
 Insert data using TTL 
 Can update specific column with table 
 Show using selects.
Questions 
40 
 http://www.thejavaexperts.net/ 
 Email: brian.enochson@gmail.com 
 Twitter: @benochso 
 G+: https://plus.google.com/+BrianEnochson

Cassandra20141113

  • 1.
    OVERVIEW AND REALWORLD APPLICATIONS Cassandra Jersey Shore Tech Meetup Nov 13, 2014
  • 2.
    You Are NotHere… *** http://njhalloffame.org/ 2
  • 3.
    Agenda 3 Some Basic Concepts/Overview  New Developments In Cassandra  Basic Data Modeling Concepts  Materialized Views  Secondary Indexes  Counters  Time Series Data  Expiring Data
  • 4.
    Cassandra High Level 4 Cassandra's architecture is based on the combination of two technologies:  Google BigTable – Data Model  Amazon Dynamo – Distributed Architecture BTW – these mean the same thing -> Cassandra = C*
  • 5.
    Architecture Basics &Terminology 5  Nodes are single instances of C*  Cluster is a group of nodes  Data is organized by keys (tokens) which are distributed across the cluster  Replication Factor (rf) determines how many copies are key  Data Center Aware – works well in multi-DC/EC2 etc.  Consistency Level – powerful feature to tune consistency vs. speed vs. availability.’
  • 6.
  • 7.
    More Architecture 7  Information on who has what data and who is available is transferred using gossip.  No single point of failure (SPF), every node can service requests.  Handles Replication and Downed Nodes (within reason)
  • 8.
    CAP Theorem 8  Distributed Systems Law:  Consistency  Availability  Partition Tolerance (you can only really have two in a distributed system)  Cassandra is AP with Eventual Consistency
  • 9.
    Consistency 9 Cassandra Uses the concept of Tunable Consistency, which make it very powerful and flexible for system needs.
  • 10.
  • 11.
  • 12.
  • 13.
    Data Model Architecture 13  Keyspace – container of column families (tables). Defines RF among others.  Table – column family. Contains definition of schema.  Row – a “record” identified by a key  Column - a key and a value
  • 14.
  • 15.
    Deletions 15 Distributed systems present unique problem for deletes. If it actually deleted data and a node was down and didn’t receive the delete notice it would try and create record when came back online. So…  Tombstone - The data is replaced with a special value called a Tombstone, works within distributed architecture
  • 16.
    Keys 16 Primary Key  Partition Key – identifies a row  Cluster Key – sorting within a row  Using CQL these are defined together as a compound (composite) key  Compound keys are how you implement “wide rows”, the COOL FEATURE!
  • 17.
    Single Primary Key 17 create table users ( user_id UUID PRIMARY KEY, firstname text, lastname text, emailaddres text ); ** Cassandra Data Types http://www.datastax.com/documentation/cql/3.0/cql/cql_ref erence/cql_data_types_c.html
  • 18.
    Compound Key 18 create table users ( emailaddress text, department text, firstname text, lastname text, PRIMARY KEY (emailaddress, department) );  Partition Key plus Cluster Key  emailaddress is partition key  department is cluster key
  • 19.
    Compound Key 19 create table users ( emailaddress text, department text, country text, firstname text, lastname text, PRIMARY KEY ((emailaddress, department), country) );  Partition Key plus Cluster Key  Emailaddress & department is partition key  country is cluster key
  • 20.
    New Rules 20  Writes Are Cheap  Denormalize All You Need  Model Your Queries, Not Data (understand access patterns)  Application Worries About Joins
  • 21.
    What’s New In2.0 21 Conditional DDL IF Exists or If Not Exists Drop Column Support ALTER TABLE users DROP lastname;
  • 22.
    More New Stuff 22  Triggers CREATE TRIGGER myTrigger ON myTable USING 'com.thejavaexperts.cassandra.updateevt'  Lightweight Transactions (CAS) UPDATE users SET firstname = 'tim' WHERE emailaddress = 'tpeters@example.com' IF firstname = 'tom'; ** Not like an ACID Transaction!!
  • 23.
    CAS & Transactions 23  CAS - compare-and-set operations. In a single, atomic operation compares a value of a column in the database and applying a modification depending on the result of the comparison.  Consider performance hit. CAS is (was) considered an anti-pattern.
  • 24.
    Data Modeling… TheBasics 24  Cassandra now is very familiar to RDBMS/SQL users.  Very nicely hides the underlying data storage model.  Still have all the power of Cassandra, it is all in the key definition. RDBMS = model data Cassandra = model access (queries)
  • 25.
    Side-Note On Querying 25  Create table with compound key  Select using ALLOW FILTERING  Counts  Select using IN or =
  • 26.
    Batch Operations 26  Saves Network Roundtrips  Can contain INSERT, UPDATE, DELETE  Atomic by default (all or nothing)  Can use timestamp for specific ordering
  • 27.
    Batch Operation Example 27 BEGIN BATCH INSERT INTO users (emailaddress, firstname, lastname, country) values ('brian.enochson@gmail.com', 'brian', 'enochson', 'USA'); INSERT INTO users (emailaddress, firstname, lastname, country) values ('tpeters@example.com', 'tom', 'peters', 'DE'); INSERT INTO users (emailaddress, firstname, lastname, country) values ('jsmith@example.com', 'jim', 'smith', 'USA'); INSERT INTO users (emailaddress, firstname, lastname, country) values ('arogers@example.com', 'alan', 'rogers', 'USA'); DELETE FROM users WHERE emailaddress = 'jsmith@example.com'; APPLY BATCH;  select in cqlsh  List in cassandra-cli with timestamp
  • 28.
    More Data Modeling… 28  No Joins  No Foreign Keys  No Third (or any other) Normal Form Concerns  Redundant Data Encouraged. Apps maintain consistency.
  • 29.
    Secondary Indexes 29  Allow defining indexes to allow other access than partition key.  Each node has a local index for its data.  They have uses, but shouldn’t be used all the time without consideration.  We will look at alternatives.
  • 30.
    Secondary Index Example 30  Create a table  Try to select with column not in PK  Add Secondary Index  Try select again. (maybe need to reinsert)
  • 31.
    When to use? 31  Low Cardinality – small number of unique values  High Cardinality – high number of distinct values  Secondary Indexes are good for Low Cardinality. So country codes, department codes etc. Not email addresses.
  • 32.
    Materialized View 32  Want full distribution can use what is called a Materialized View pattern.  Remember redundant data is fine.  Model the queries
  • 33.
    Materialized View Example 33  Show normal able with compound key and querying limitations  Create Materialized View Table With Different Compound Key, support alternate access.  Selects use partition key.  Secondary indexes local, not distributed  Allow filtering. Can cause performance issues
  • 34.
    Counters 34 Updated in 2.1 and now work in a more distributed and accurate manner.  Table organization, example  How to update, view etc.
  • 35.
    Time Series Example…. 35  Time series table model.  Need to consider interval for event frequency and wide row size.  Make what is tracked by time and unit of interval partition key.
  • 36.
    Time Series Data 36  Due to its quick writing model Cassandra is suited for storing time series data.  The Cassandra wide row is a perfect fit for modeling time series / time based events.  Let’s look at an example….
  • 37.
    Event Data 37  Notice primary key and cluster key.  Insert some data  View in CQL, then in CLI as wide row
  • 38.
    TTL – SelfExpiring Data 38  Another technique is data that has a defined lifespan.  For instance session identifiers, temporary passwords etc.  For this Cassandra provides a Time To Live (TTL) mechanism.
  • 39.
    TTL Example… 39  Create table  Insert data using TTL  Can update specific column with table  Show using selects.
  • 40.
    Questions 40 http://www.thejavaexperts.net/  Email: brian.enochson@gmail.com  Twitter: @benochso  G+: https://plus.google.com/+BrianEnochson