SlideShare a Scribd company logo
1 of 27
Download to read offline
Getting to know
by Michelle Darling
mdarlingcmt@gmail.com
August 2013
Agenda:
● What is Cassandra?
● Installation, CQL3
● Data Modelling
● Summary
Only 15 min to cover these, so
please hold questions til the
end, or email me :-) and I’ll
summarize Q&A for everyone.
Unfortunately, no time for:
● DB Admin
○ Detailed Architecture
○ Partitioning /
Consistent Hashing
○ Consistency Tuning
○ Data Distribution &
Replication
○ System Tables
● App Development
○ Using Python, Ruby etc
to access Cassandra
○ Using Hadoop to
stream data into
Cassandra
What is Cassandra?
“Fortuneteller of Doom”
from Greek Mythology. Tried to
warn others about future disasters,
but no one listened. Unfortunately,
she was 100% accurate.
NoSQL Distributed DB
● Consistency - A__ID
● Availability - High
● Point of Failure - none
● Good for Event
Tracking & Analysis
○ Time series data
○ Sensor device data
○ Social media analytics
○ Risk Analysis
○ Failure Prediction
Rackspace: “Which servers
are under heavy load
and are about to crash?”
The Evolution of Cassandra
2008: Open-Source Release / 2013: Enterprise & Community Editions
Data Model
● Wide rows, sparse arrays
● High performance through very
fast write throughput.
Infrastructure
● Peer-Peer Gossip
● Key-Value Pairs
● Tunable Consistency
2006
2005
● Originally for Inbox Search
● But now used for Instagram
Other NoSQL vs. Cassandra
NoSQL Taxonomy:
● Key-Value Pairs
○ Dynamo, Riak, Redis
● Column-Based
○ BigTable, HBase,
Cassandra
● Document-Based
○ MongoDB, Couchbase
● Graph
○ Neo4J
Big Data Capable
C* Differentiators:
● Production-proven at
Netflix, eBay, Twitter,
20 of Fortune 100
● “Clear Winner” in
Scalability,
Performance,
Availability
-- DataStax
Architecture
● Cluster (ring)
● Nodes (circles)
● Peer-to-Peer Model
● Gossip Protocol
Partitioner:
Consistent Hashing
Netflix
Streaming Video
● Personalized
Recommendations per
family member
● Built on Amazon Web
Services (AWS) +
Cassandra
Cloud installation using
● Amazon Web Services (AWS)
● Elastic Compute Cloud (EC2)
○ Free for the 1st year! Then pay only for what you use.
○ Sign up for AWS EC2 account: Big Data University Video 4:34 minutes,
● Amazon Machine Image (AMI)
○ Preconfigured installation template
○ Choose: “DataStax AMI for Cassandra
Community Edition”
○ Follow these *very good* step-by-step
instructions from DataStax.
○ AMIs also available for CouchBase, MongoDB
(make sure you pick the free tier community versions to avoid
monthly charge$$!!!).
AWS EC2 Dashboard
DataStax AMI Setup
DataStax AMI Setup
--clustername Michelle
--totalnodes 1
--version community
“Roll your Own” Installation
DataStax Community Edition
● Install instructions
For Linux, Windows,
MacOS:
http://www.datastax.com/2012/01/getting-
started-with-cassandra
● Video: “Set up a 4-
node Cassandra
cluster in under 2
minutes”
http://www.screenr.com/5G6
Invoke CQLSH, CREATE KEYSPACE
./bin/cqlsh
cqlsh> CREATE KEYSPACE big_data
… with strategy_class = ‘org.apache.cassandra.
locator.SimpleStrategy’
… with strategy_options:replication_factor=‘1’;
cqlsh> use big_data;
cqlsh:big_data>
Tip: Skip Thrift -- use CQL3
Thrift RPC
// Your Column
Column col = new Column(ByteBuffer.wrap("name".
getBytes()));
col.setValue(ByteBuffer.wrap("value".getBytes()));
col.setTimestamp(System.currentTimeMillis());
// Don't ask
ColumnOrSuperColumn cosc = new ColumnOrSuperColumn();
cosc.setColumn(col);
// Prepare to be amazed
Mutation mutation = new Mutation();
mutation.setColumnOrSuperColumn(cosc);
List<Mutation> mutations = new ArrayList<Mutation>();
mutations.add(mutation);
Map mutations_map = new HashMap<ByteBuffer, Map<String,
List<Mutation>>>();
Map cf_map = new HashMap<String, List<Mutation>>();
cf_map.set("Standard1", mutations);
mutations_map.put(ByteBuffer.wrap("key".getBytes()),
cf_map);
cassandra.batch_mutate(mutations_map,
consistency_level);
CQL3
- Uses cqlsh
- “SQL-like” language
- Runs on top of Thrift RPC
- Much more user-friendly.
Thrift code on left
equals this in CQL3:
INSERT INTO (id, name)
VALUES ('key',
'value');
CREATE TABLE
cqlsh:big_data> create table user_tags (
… user_id varchar,
… tag varchar,
… value counter,
… primary key (user_id, tag)
…):
● TABLE user_tags: “How many times has a user
mentioned a hashtag?”
● COUNTER datatype - Computes & stores counter value
at the time data is written. This optimizes query
performance.
UPDATE TABLE
SELECT FROM TABLE
cqlsh:big_data> UPDATE user_tags SET
value=value+1 WHERE user_id = ‘paul’ AND tag =
‘cassandra’
cqlsh:big_data> SELECT * FROM user_tags
user_id | tag | value
--------+-----------+----------
paul | cassandra | 1
DATA MODELING
A Major Paradigm Shift!
RDBMS Cassandra
Structured Data, Fixed Schema Unstructured Data, Flexible Schema
“Array of Arrays”
2D: ROW x COLUMN
“Nested Key-Value Pairs”
3D: ROW Key x COLUMN key x COLUMN values
DATABASE KEYSPACE
TABLE TABLE a.k.a COLUMN FAMILY
ROW ROW a.k.a PARTITION. Unit of replication.
COLUMN COLUMN [Name, Value, Timestamp]. a.k.a CLUSTER. Unit
of storage. Up to 2 billion columns per row.
FOREIGN KEYS, JOINS,
ACID Consistency
Referential Integrity not enforced, so A_CID.
BUT relationships represented using COLLECTIONS.
Cassandra
3D+: Nested Objects
RDBMS
2D: Rows
x columns
Example:
“Twissandra” Web App
Twitter-Inspired
sample application
written in Python +
Cassandra.
● Play with the app:
twissandra.com
● Examine & learn
from the code on
GitHub.
Features/Queries:
● Sign In, Sign Up
● Post Tweet
● Userline (User’s tweets)
● Timeline (All tweets)
● Following (Users being
followed by user)
● Followers (Users
following this user)
Twissandra.com vs Twitter.com
Twissandra - RDBMS Version
Entities
● USER, TWEET
● FOLLOWER, FOLLOWING
● FRIENDS
Relationships:
● USER has many TWEETs.
● USER is a FOLLOWER of many
USERs.
● Many USERs are FOLLOWING
USER.
Twissandra - Cassandra Version
Tip: Model tables to mirror queries.
TABLES or CFs
● TWEET
● USER, USERNAME
● FOLLOWERS, FOLLOWING
● USERLINE, TIMELINE
Notes:
● Extra tables mirror queries.
● Denormalized tables are
“pre-formed”for faster
performance.
TABLE
Tip: Remember,
Skip Thrift -- use CQL3
What does C* data look like?
TABLE Userline
“List all of user’s Tweets”
*************
Row Key: user_id
Columns
● Column Key: tweet_id
● “at” Timestamp
● TTL (Time to Live) -
seconds til expiration
date.
*************
Cassandra Data Model = LEGOs?
FlexibleSchema
Summary:
● Go straight from SQL
to CQL3; skip Thrift, Column
Families, SuperColumns, etc
● Denormalize tables to
mirror important queries.
Roughly 1 table per impt query.
● Choose wisely:
○ Partition Keys
○ Cluster Keys
○ Indexes
○ TTL
○ Counters
○ Collections
See DataStax Music Service
Example
● Consider hybrid
approach:
○ 20% - RDBMS for highly
structured, OLTP, ACID
requirements.
○ 80% - Scale Cassandra to
handle the rest of data.
Remember:
● Cheap: storage,
servers, OpenSource
software.
● Precious: User AND
Developer Happiness.
Resources
C* Summit 2013:
● Slides
● Cassandra at eBay Scale (slides)
● Data Modelers Still Have Jobs -
Adjusting For the NoSQL
Environment (Slides)
● Real-time Analytics using
Cassandra, Spark and Shark
slides
● Cassandra By Example: Data
Modelling with CQL3 Slides
● DATASTAX C*OLLEGE CREDIT:
DATA MODELLING FOR APACHE
CASSANDRA slides
I wish I found these 1st:
● How do I Cassandra?
slides
● Mobile version of
DataStax web docs
(link)

More Related Content

What's hot

ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
alex_araujo
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 

What's hot (20)

Cassandra
Cassandra Cassandra
Cassandra
 
NoSQL Database- cassandra column Base DB
NoSQL Database- cassandra column Base DBNoSQL Database- cassandra column Base DB
NoSQL Database- cassandra column Base DB
 
Intro to Cassandra
Intro to CassandraIntro to Cassandra
Intro to Cassandra
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandra
 
Appache Cassandra
Appache Cassandra  Appache Cassandra
Appache Cassandra
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
 
Cassandra 101
Cassandra 101Cassandra 101
Cassandra 101
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Cassandra an overview
Cassandra an overviewCassandra an overview
Cassandra an overview
 
Bulk Loading Data into Cassandra
Bulk Loading Data into CassandraBulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Introduction to Cassandra Architecture
Introduction to Cassandra ArchitectureIntroduction to Cassandra Architecture
Introduction to Cassandra Architecture
 
Cassandra internals
Cassandra internalsCassandra internals
Cassandra internals
 
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
 
Cassandra
CassandraCassandra
Cassandra
 
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
 

Similar to Cassandra NoSQL Tutorial

Similar to Cassandra NoSQL Tutorial (20)

Multi-cluster k8ssandra
Multi-cluster k8ssandraMulti-cluster k8ssandra
Multi-cluster k8ssandra
 
Cassandra Talk: Austin JUG
Cassandra Talk: Austin JUGCassandra Talk: Austin JUG
Cassandra Talk: Austin JUG
 
Introduction to cloud and openstack
Introduction to cloud and openstackIntroduction to cloud and openstack
Introduction to cloud and openstack
 
GumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWSGumGum: Multi-Region Cassandra in AWS
GumGum: Multi-Region Cassandra in AWS
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Avoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdfAvoiding Pitfalls for Cassandra.pdf
Avoiding Pitfalls for Cassandra.pdf
 
On Rails with Apache Cassandra
On Rails with Apache CassandraOn Rails with Apache Cassandra
On Rails with Apache Cassandra
 
Running Cassandra in AWS
Running Cassandra in AWSRunning Cassandra in AWS
Running Cassandra in AWS
 
Apache Cassandra Lunch #64: Cassandra for .NET Developers
Apache Cassandra Lunch #64: Cassandra for .NET DevelopersApache Cassandra Lunch #64: Cassandra for .NET Developers
Apache Cassandra Lunch #64: Cassandra for .NET Developers
 
Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databases
 
Apache Cassandra introduction
Apache Cassandra introductionApache Cassandra introduction
Apache Cassandra introduction
 
Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015Lambda at Weather Scale - Cassandra Summit 2015
Lambda at Weather Scale - Cassandra Summit 2015
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
 
Cassandra To Infinity And Beyond
Cassandra To Infinity And BeyondCassandra To Infinity And Beyond
Cassandra To Infinity And Beyond
 
A Microservices approach with Cassandra and Quarkus | DevNation Tech Talk
A Microservices approach with Cassandra and Quarkus | DevNation Tech TalkA Microservices approach with Cassandra and Quarkus | DevNation Tech Talk
A Microservices approach with Cassandra and Quarkus | DevNation Tech Talk
 
Cassandra REST API with Pagination TEAM 15
Cassandra REST API with Pagination TEAM 15Cassandra REST API with Pagination TEAM 15
Cassandra REST API with Pagination TEAM 15
 
Global Cluster Topologies in MongoDB Atlas
Global Cluster Topologies in MongoDB AtlasGlobal Cluster Topologies in MongoDB Atlas
Global Cluster Topologies in MongoDB Atlas
 
Cassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction GuideCassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction Guide
 
Cassandra's Odyssey @ Netflix
Cassandra's Odyssey @ NetflixCassandra's Odyssey @ Netflix
Cassandra's Odyssey @ Netflix
 
DataStax TechDay - Munich 2014
DataStax TechDay - Munich 2014DataStax TechDay - Munich 2014
DataStax TechDay - Munich 2014
 

More from Michelle Darling

More from Michelle Darling (8)

Family pics2august014
Family pics2august014Family pics2august014
Family pics2august014
 
Final pink panthers_03_31
Final pink panthers_03_31Final pink panthers_03_31
Final pink panthers_03_31
 
Final pink panthers_03_30
Final pink panthers_03_30Final pink panthers_03_30
Final pink panthers_03_30
 
Php summary
Php summaryPhp summary
Php summary
 
Rsplit apply combine
Rsplit apply combineRsplit apply combine
Rsplit apply combine
 
College day pressie
College day pressieCollege day pressie
College day pressie
 
R learning by examples
R learning by examplesR learning by examples
R learning by examples
 
V3 gamingcasestudy
V3 gamingcasestudyV3 gamingcasestudy
V3 gamingcasestudy
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Cassandra NoSQL Tutorial

  • 1. Getting to know by Michelle Darling mdarlingcmt@gmail.com August 2013
  • 2. Agenda: ● What is Cassandra? ● Installation, CQL3 ● Data Modelling ● Summary Only 15 min to cover these, so please hold questions til the end, or email me :-) and I’ll summarize Q&A for everyone. Unfortunately, no time for: ● DB Admin ○ Detailed Architecture ○ Partitioning / Consistent Hashing ○ Consistency Tuning ○ Data Distribution & Replication ○ System Tables ● App Development ○ Using Python, Ruby etc to access Cassandra ○ Using Hadoop to stream data into Cassandra
  • 3. What is Cassandra? “Fortuneteller of Doom” from Greek Mythology. Tried to warn others about future disasters, but no one listened. Unfortunately, she was 100% accurate. NoSQL Distributed DB ● Consistency - A__ID ● Availability - High ● Point of Failure - none ● Good for Event Tracking & Analysis ○ Time series data ○ Sensor device data ○ Social media analytics ○ Risk Analysis ○ Failure Prediction Rackspace: “Which servers are under heavy load and are about to crash?”
  • 4. The Evolution of Cassandra 2008: Open-Source Release / 2013: Enterprise & Community Editions Data Model ● Wide rows, sparse arrays ● High performance through very fast write throughput. Infrastructure ● Peer-Peer Gossip ● Key-Value Pairs ● Tunable Consistency 2006 2005 ● Originally for Inbox Search ● But now used for Instagram
  • 5. Other NoSQL vs. Cassandra NoSQL Taxonomy: ● Key-Value Pairs ○ Dynamo, Riak, Redis ● Column-Based ○ BigTable, HBase, Cassandra ● Document-Based ○ MongoDB, Couchbase ● Graph ○ Neo4J Big Data Capable C* Differentiators: ● Production-proven at Netflix, eBay, Twitter, 20 of Fortune 100 ● “Clear Winner” in Scalability, Performance, Availability -- DataStax
  • 6. Architecture ● Cluster (ring) ● Nodes (circles) ● Peer-to-Peer Model ● Gossip Protocol Partitioner: Consistent Hashing
  • 7. Netflix Streaming Video ● Personalized Recommendations per family member ● Built on Amazon Web Services (AWS) + Cassandra
  • 8. Cloud installation using ● Amazon Web Services (AWS) ● Elastic Compute Cloud (EC2) ○ Free for the 1st year! Then pay only for what you use. ○ Sign up for AWS EC2 account: Big Data University Video 4:34 minutes, ● Amazon Machine Image (AMI) ○ Preconfigured installation template ○ Choose: “DataStax AMI for Cassandra Community Edition” ○ Follow these *very good* step-by-step instructions from DataStax. ○ AMIs also available for CouchBase, MongoDB (make sure you pick the free tier community versions to avoid monthly charge$$!!!).
  • 11. DataStax AMI Setup --clustername Michelle --totalnodes 1 --version community
  • 12. “Roll your Own” Installation DataStax Community Edition ● Install instructions For Linux, Windows, MacOS: http://www.datastax.com/2012/01/getting- started-with-cassandra ● Video: “Set up a 4- node Cassandra cluster in under 2 minutes” http://www.screenr.com/5G6
  • 13. Invoke CQLSH, CREATE KEYSPACE ./bin/cqlsh cqlsh> CREATE KEYSPACE big_data … with strategy_class = ‘org.apache.cassandra. locator.SimpleStrategy’ … with strategy_options:replication_factor=‘1’; cqlsh> use big_data; cqlsh:big_data>
  • 14. Tip: Skip Thrift -- use CQL3 Thrift RPC // Your Column Column col = new Column(ByteBuffer.wrap("name". getBytes())); col.setValue(ByteBuffer.wrap("value".getBytes())); col.setTimestamp(System.currentTimeMillis()); // Don't ask ColumnOrSuperColumn cosc = new ColumnOrSuperColumn(); cosc.setColumn(col); // Prepare to be amazed Mutation mutation = new Mutation(); mutation.setColumnOrSuperColumn(cosc); List<Mutation> mutations = new ArrayList<Mutation>(); mutations.add(mutation); Map mutations_map = new HashMap<ByteBuffer, Map<String, List<Mutation>>>(); Map cf_map = new HashMap<String, List<Mutation>>(); cf_map.set("Standard1", mutations); mutations_map.put(ByteBuffer.wrap("key".getBytes()), cf_map); cassandra.batch_mutate(mutations_map, consistency_level); CQL3 - Uses cqlsh - “SQL-like” language - Runs on top of Thrift RPC - Much more user-friendly. Thrift code on left equals this in CQL3: INSERT INTO (id, name) VALUES ('key', 'value');
  • 15. CREATE TABLE cqlsh:big_data> create table user_tags ( … user_id varchar, … tag varchar, … value counter, … primary key (user_id, tag) …): ● TABLE user_tags: “How many times has a user mentioned a hashtag?” ● COUNTER datatype - Computes & stores counter value at the time data is written. This optimizes query performance.
  • 16. UPDATE TABLE SELECT FROM TABLE cqlsh:big_data> UPDATE user_tags SET value=value+1 WHERE user_id = ‘paul’ AND tag = ‘cassandra’ cqlsh:big_data> SELECT * FROM user_tags user_id | tag | value --------+-----------+---------- paul | cassandra | 1
  • 17. DATA MODELING A Major Paradigm Shift! RDBMS Cassandra Structured Data, Fixed Schema Unstructured Data, Flexible Schema “Array of Arrays” 2D: ROW x COLUMN “Nested Key-Value Pairs” 3D: ROW Key x COLUMN key x COLUMN values DATABASE KEYSPACE TABLE TABLE a.k.a COLUMN FAMILY ROW ROW a.k.a PARTITION. Unit of replication. COLUMN COLUMN [Name, Value, Timestamp]. a.k.a CLUSTER. Unit of storage. Up to 2 billion columns per row. FOREIGN KEYS, JOINS, ACID Consistency Referential Integrity not enforced, so A_CID. BUT relationships represented using COLLECTIONS.
  • 19. Example: “Twissandra” Web App Twitter-Inspired sample application written in Python + Cassandra. ● Play with the app: twissandra.com ● Examine & learn from the code on GitHub. Features/Queries: ● Sign In, Sign Up ● Post Tweet ● Userline (User’s tweets) ● Timeline (All tweets) ● Following (Users being followed by user) ● Followers (Users following this user)
  • 21. Twissandra - RDBMS Version Entities ● USER, TWEET ● FOLLOWER, FOLLOWING ● FRIENDS Relationships: ● USER has many TWEETs. ● USER is a FOLLOWER of many USERs. ● Many USERs are FOLLOWING USER.
  • 22. Twissandra - Cassandra Version Tip: Model tables to mirror queries. TABLES or CFs ● TWEET ● USER, USERNAME ● FOLLOWERS, FOLLOWING ● USERLINE, TIMELINE Notes: ● Extra tables mirror queries. ● Denormalized tables are “pre-formed”for faster performance.
  • 24. What does C* data look like? TABLE Userline “List all of user’s Tweets” ************* Row Key: user_id Columns ● Column Key: tweet_id ● “at” Timestamp ● TTL (Time to Live) - seconds til expiration date. *************
  • 25. Cassandra Data Model = LEGOs? FlexibleSchema
  • 26. Summary: ● Go straight from SQL to CQL3; skip Thrift, Column Families, SuperColumns, etc ● Denormalize tables to mirror important queries. Roughly 1 table per impt query. ● Choose wisely: ○ Partition Keys ○ Cluster Keys ○ Indexes ○ TTL ○ Counters ○ Collections See DataStax Music Service Example ● Consider hybrid approach: ○ 20% - RDBMS for highly structured, OLTP, ACID requirements. ○ 80% - Scale Cassandra to handle the rest of data. Remember: ● Cheap: storage, servers, OpenSource software. ● Precious: User AND Developer Happiness.
  • 27. Resources C* Summit 2013: ● Slides ● Cassandra at eBay Scale (slides) ● Data Modelers Still Have Jobs - Adjusting For the NoSQL Environment (Slides) ● Real-time Analytics using Cassandra, Spark and Shark slides ● Cassandra By Example: Data Modelling with CQL3 Slides ● DATASTAX C*OLLEGE CREDIT: DATA MODELLING FOR APACHE CASSANDRA slides I wish I found these 1st: ● How do I Cassandra? slides ● Mobile version of DataStax web docs (link)