Big Data-Driven Applications
with Cassandra and Spark
Artem Chebotko, Ph.D.
Solution Architect
1 Modern Big Data and Cloud Applications
2 Cassandra and Spark Highlights
3 Architecture Overview
4 Languages and APIs
5 Live Demo
2
Modern Application Requirements
• Numerous Endpoints
• Geographically Distributed
• Continuously Available
• Instantaneously Responsive
• Immediately Decisive
• Predictably Scalable
3
Applications by Response Time and Workload
4
Analytical (OLAP)Operational (OLTP)
Applications by Response Time and Workload
5
Real-time transactions
Analytical (OLAP)Operational (OLTP)
• Web and IoT apps
• Financial transactions
Applications by Response Time and Workload
6
Real-time transactions
Real-time analytics
Analytical (OLAP)Operational (OLTP)
• Web and IoT apps
• Financial transactions
• Recommendations
Applications by Response Time and Workload
7
Real-time transactions
Real-time analytics
Streaming analytics
Analytical (OLAP)Operational (OLTP)
• Web and IoT apps
• Financial transactions
• Recommendations
• Fraud prevention
Applications by Response Time and Workload
8
Real-time transactions
Real-time analytics
Streaming analytics
Batch analytics
Analytical (OLAP)Operational (OLTP)
• Web and IoT apps
• Financial transactions
• Recommendations
• Fraud prevention
• Predictive models
• Fraud detection
Roles of Cassandra and Spark
9
Real-time transactions
Real-time analytics
Streaming analytics
Batch analytics
Analytical (OLAP)Operational (OLTP)
• Web and IoT apps
• Financial transactions
• Recommendations
• Fraud prevention
• Predictive models
• Fraud detection
Numerous Endpoints
Geographically Distributed
Continuously Available
Instantaneously Responsive
Immediately Decisive
Predictably Scalable
1 Modern Big Data and Cloud Applications
2 Cassandra and Spark Highlights
3 Architecture Overview
4 Languages and APIs
5 Live Demo
10
Cassandra – Operational Database
• Millions of concurrent users
• Millisecond response time
• Linear scalability
• Always on
11
Cassandra – Operational Database
12
Spark – Analytics Platform
• Real-time, streaming and batch analytics
• Up to 100x faster than Hadoop
• Scalability, fault-resilience
• Versatile and rich API
13
Spark – Analytics Platform
14
SQL Streaming MLlib GraphX
Cluster Manager
Standalone YARN Mesos
Spark-Cassandra Connector
Open-Source Package for Spark
• Routine Spark-Cassandra interactions
– Read from and write into Cassandra
• Profound optimizations
– Predicate pushdown
– Data locality
– Cassandra-optimized joins
– Cassandra-aware partitioning
– Shuffle-free grouping
15
1 Modern Big Data and Cloud Applications
2 Cassandra and Spark Highlights
3 Architecture Overview
4 Languages and APIs
5 Live Demo
16
C*: Distributed, Shared Nothing, Peer-to-Peer
17
C* Client
C*
C*
C*
-263
+263
-1
Driver
C* Client Driver
transaction
transaction
transaction
transaction
C*: Partitioning and Replication
18
replica 2
replica 1
replica 3
coordinator
partitioner
partition
partition
key
write request
acknowledgment
CL=QUORUM
RF=3
......
TABLE
C*: Partitioning and Replication
19
replica 2
replica 1
replica 3
partition
partition
key
result
CL=ONE
RF=3
coordinator
partitioner
read request
Spark: Master-Worker, Failover Masters
20
Spark
Client
Driver
Master
Worker
SparkContext
Spark
Client
Driver
SparkContext
Worker
Worker
Executor Executor
Executor Executor
Executor Executor
Spark: Computation Scheduling
21
Driver
SparkContext
DAG Job 0
Stage 1
task task
task
Stage 0
task task
task
Stage 2
task task
task
Job 1
Stage 4
task task
task
Stage 3
task task
task
Stage 5
task task
task
Executor
task
cache
task
Executor
task
cache
task
Executor
task
cache
task
Spark-Cassandra Connector
22
C*
C*
C*
Master Worker
Executor
Executor
Spark-Cassandra Connector
Worker
Executor
Executor
Spark-Cassandra
Connector
Worker
Executor
Executor
Spark-Cassandra
Connector
Spark-Cassandra Connector
23
C*
C*
C*
Master Worker
Executor
Executor
Spark-Cassandra Connector
Worker
Executor
Executor
Spark-Cassandra
Connector
Worker
Executor
Executor
Spark-Cassandra
Connector
ClusterNode
Spark Node
Master JVM
Connector.jar
Worker JVM
Executor JVM
Executor JVM
C* Node
C* JVM
Multi-DC Deployment and Workload Separation
24
C* Client Driver Spark
Client
Driver
SparkContextC* Client Driver
C*
C*
C*
Master
Worker
Executor
WorkerWorker
C*
C*
C*
Executor
Executor
Executor
Executor Executor
Executor
C*C*
C*
Data
Replication
C* Client Driver
Spark
Client
Driver
SparkContext
real-time
transactions
interactive and
batch analytics
DC
Operations
DC
Analytics
1 Modern Big Data and Cloud Applications
2 Cassandra and Spark Highlights
3 Architecture Overview
4 Languages and APIs
5 Live Demo
25
Getting Started with Cassandra and Spark Applications
• Data Model and Cassandra Query Language
• Core Spark and Spark-Cassandra Connector
26
Keyspace and Replication
27
CREATE KEYSPACE iot
WITH replication = {'class': 'NetworkTopologyStrategy',
'DC-Kyiv-Operations' : 3,
'DC-Houston-Analytics': 2};
USE iot;
Table with Single-Row Partitions
28
username age address
Alice 28 Santa Clara, CA
Alex 37 Austin, TX
users CREATE TABLE users (
username TEXT,
age INT,
address TEXT,
PRIMARY KEY(username)
);
SELECT * FROM users
WHERE username = ?;
Table with Single-Row Partitions
29
id type settings owner
1 phone {gps ⇒ on,
pedometer ⇒ on}
Alice
2 wristband {heart rate ⇒ on, …} Alice
3 thermostat {temp ⇒ 75, …} Alice
4 security {…} Alex
5 phone {…} Alex
sensors CREATE TABLE sensors (
id INT,
type TEXT,
settings MAP<TEXT,TEXT>,
owner TEXT,
PRIMARY KEY(id)
);
SELECT * FROM sensors
WHERE id = ?;
Table with Multi-Row Partitions
30
username id type settings age address
Alice 1 phone {gps ⇒ on, …} 28 Santa Clara, CA
Alice 2 wristband {heart rate ⇒ on, …} 28 Santa Clara, CA
Alice 3 thermostat {temp ⇒ 75, …} 28 Santa Clara, CA
Alex 4 security … 37 Austin, TX
Alex 5 phone … 37 Austin, TX
sensors_by_user
ASCASC
Table with Multi-Row Partitions
CREATE TABLE sensors_by_user (
username TEXT, age INT STATIC, address TEXT STATIC,
id INT, type TEXT, settings MAP<TEXT,TEXT>,
PRIMARY KEY(username, id)
) WITH CLUSTERING ORDER BY (id ASC);
SELECT * FROM sensors_by_user WHERE username = ?;
SELECT * FROM sensors_by_user WHERE username = ? AND id = ?;
SELECT * FROM sensors_by_user WHERE username = ? AND id > ?
ORDER BY id DESC;
31
Retrieving Data from C*
• SparkContext, RDD, Connector
32
val rdd = sc.cassandraTable("iot","sensors_by_user")
.select("username","id","type")
Predicate Pushdown
33
sc.cassandraTable("iot","sensors_by_user")
.select("username","id","type")
.filter(row => row.getString("username") == "Alice")
• Suboptimal code
Predicate Pushdown
34
sc.cassandraTable("iot","sensors_by_user")
.select("username","id","type")
.filter(row => row.getString("username") == "Alice")
.where("username = 'Alice'")
• Predicate pushed down to C*
Data Locality
input.split.size_in_mb input.consistency.level input.fetch.size_in_rows
35
Cassandra Spark
Node
(64) (LOCAL_ONE) (1000)
• Standard Spark join = shuffle + shuffle
Cassandra-Optimized Joins
36
val s = sc.cassandraTable("iot","sensors")
.keyBy(row => row.getString("owner"))
val u = sc.cassandraTable("iot","users")
.keyBy(row => row.getString("username"))
s.join(u)
Cassandra-Optimized Joins
37
• Shuffle
B
Partition 1
A C D
Map Task
Partition A
Reduce Task
B
Partition 2
A C D
Map Task
Partition B
Reduce Task
B
Partition 3
A C D
Map Task
Partition D
Reduce Task
Partition C
Reduce Task
Buckets:
memory
memory
Shuffle write
Shuffle read
disk
disk
Aggregation Aggregation Aggregation
Aggregation AggregationAggregationAggregation
• Connector join = no shuffle + no data locality
Cassandra-Optimized Joins
38
sc.cassandraTable("iot","sensors")
.select("id","type","owner".as("username"))
.joinWithCassandraTable("iot","users")
.on(SomeColumns("username"))
Cassandra-Optimized Joins
39
id type owner 
username
1 … Alice
4 … Alex
3 ... Alice
2 … Alice
5 … Alex
username age address
Alex 37 …
username age address
Alice 28 …
• Connector join + CAP = shuffle + data locality
Cassandra-Aware Partitioning
40
sc.cassandraTable("iot","sensors")
.select("id","type","owner".as("username"))
.repartitionByCassandraReplica("iot","users")
.joinWithCassandraTable("iot","users")
.on(SomeColumns("username"))
Cassandra-Aware Partitioning
41
id type owner 
username
1 … Alice
2 … Alice
3 ... Alice
username age address
Alex 37 …
username age address
Alice 28 …
id type owner 
username
4 … Alex
5 … Alex
• Suboptimal code
Shuffle-Free Grouping
42
sc.cassandraTable("iot","sensors_by_user")
.select("username","id","type")
.as((u:String,i:Int,t:String)=>(u,(i,t)))
.groupByKey
• Shuffling eliminated at no extra cost
Shuffle-Free Grouping
43
sc.cassandraTable("iot","sensors_by_user")
.select("username","id","type")
.as((u:String,i:Int,t:String)=>(u,(i,t)))
.groupByKeyspanByKey
Saving Data to C*
44
rdd.saveToCassandra("iot","users",
SomeColumns("username", "age"))
output.consistency.level (LOCAL_QUORUM)
output.batch.grouping.key (Partition)
output.batch.size.bytes (1024)
output.batch.grouping.buffer.size (1000)
output.concurrent.writes (5)
1 Modern Big Data and Cloud Applications
2 Cassandra and Spark Highlights
3 Architecture Overview
4 Languages and APIs
5 Live Demo
45
Artem Chebotko
achebotko@datastax.com
www.linkedin.com/in/artemchebotko
46

Big Data-Driven Applications with Cassandra and Spark

  • 1.
    Big Data-Driven Applications withCassandra and Spark Artem Chebotko, Ph.D. Solution Architect
  • 2.
    1 Modern BigData and Cloud Applications 2 Cassandra and Spark Highlights 3 Architecture Overview 4 Languages and APIs 5 Live Demo 2
  • 3.
    Modern Application Requirements •Numerous Endpoints • Geographically Distributed • Continuously Available • Instantaneously Responsive • Immediately Decisive • Predictably Scalable 3
  • 4.
    Applications by ResponseTime and Workload 4 Analytical (OLAP)Operational (OLTP)
  • 5.
    Applications by ResponseTime and Workload 5 Real-time transactions Analytical (OLAP)Operational (OLTP) • Web and IoT apps • Financial transactions
  • 6.
    Applications by ResponseTime and Workload 6 Real-time transactions Real-time analytics Analytical (OLAP)Operational (OLTP) • Web and IoT apps • Financial transactions • Recommendations
  • 7.
    Applications by ResponseTime and Workload 7 Real-time transactions Real-time analytics Streaming analytics Analytical (OLAP)Operational (OLTP) • Web and IoT apps • Financial transactions • Recommendations • Fraud prevention
  • 8.
    Applications by ResponseTime and Workload 8 Real-time transactions Real-time analytics Streaming analytics Batch analytics Analytical (OLAP)Operational (OLTP) • Web and IoT apps • Financial transactions • Recommendations • Fraud prevention • Predictive models • Fraud detection
  • 9.
    Roles of Cassandraand Spark 9 Real-time transactions Real-time analytics Streaming analytics Batch analytics Analytical (OLAP)Operational (OLTP) • Web and IoT apps • Financial transactions • Recommendations • Fraud prevention • Predictive models • Fraud detection Numerous Endpoints Geographically Distributed Continuously Available Instantaneously Responsive Immediately Decisive Predictably Scalable
  • 10.
    1 Modern BigData and Cloud Applications 2 Cassandra and Spark Highlights 3 Architecture Overview 4 Languages and APIs 5 Live Demo 10
  • 11.
    Cassandra – OperationalDatabase • Millions of concurrent users • Millisecond response time • Linear scalability • Always on 11
  • 12.
  • 13.
    Spark – AnalyticsPlatform • Real-time, streaming and batch analytics • Up to 100x faster than Hadoop • Scalability, fault-resilience • Versatile and rich API 13
  • 14.
    Spark – AnalyticsPlatform 14 SQL Streaming MLlib GraphX Cluster Manager Standalone YARN Mesos
  • 15.
    Spark-Cassandra Connector Open-Source Packagefor Spark • Routine Spark-Cassandra interactions – Read from and write into Cassandra • Profound optimizations – Predicate pushdown – Data locality – Cassandra-optimized joins – Cassandra-aware partitioning – Shuffle-free grouping 15
  • 16.
    1 Modern BigData and Cloud Applications 2 Cassandra and Spark Highlights 3 Architecture Overview 4 Languages and APIs 5 Live Demo 16
  • 17.
    C*: Distributed, SharedNothing, Peer-to-Peer 17 C* Client C* C* C* -263 +263 -1 Driver C* Client Driver transaction transaction transaction transaction
  • 18.
    C*: Partitioning andReplication 18 replica 2 replica 1 replica 3 coordinator partitioner partition partition key write request acknowledgment CL=QUORUM RF=3 ...... TABLE
  • 19.
    C*: Partitioning andReplication 19 replica 2 replica 1 replica 3 partition partition key result CL=ONE RF=3 coordinator partitioner read request
  • 20.
    Spark: Master-Worker, FailoverMasters 20 Spark Client Driver Master Worker SparkContext Spark Client Driver SparkContext Worker Worker Executor Executor Executor Executor Executor Executor
  • 21.
    Spark: Computation Scheduling 21 Driver SparkContext DAGJob 0 Stage 1 task task task Stage 0 task task task Stage 2 task task task Job 1 Stage 4 task task task Stage 3 task task task Stage 5 task task task Executor task cache task Executor task cache task Executor task cache task
  • 22.
    Spark-Cassandra Connector 22 C* C* C* Master Worker Executor Executor Spark-CassandraConnector Worker Executor Executor Spark-Cassandra Connector Worker Executor Executor Spark-Cassandra Connector
  • 23.
    Spark-Cassandra Connector 23 C* C* C* Master Worker Executor Executor Spark-CassandraConnector Worker Executor Executor Spark-Cassandra Connector Worker Executor Executor Spark-Cassandra Connector ClusterNode Spark Node Master JVM Connector.jar Worker JVM Executor JVM Executor JVM C* Node C* JVM
  • 24.
    Multi-DC Deployment andWorkload Separation 24 C* Client Driver Spark Client Driver SparkContextC* Client Driver C* C* C* Master Worker Executor WorkerWorker C* C* C* Executor Executor Executor Executor Executor Executor C*C* C* Data Replication C* Client Driver Spark Client Driver SparkContext real-time transactions interactive and batch analytics DC Operations DC Analytics
  • 25.
    1 Modern BigData and Cloud Applications 2 Cassandra and Spark Highlights 3 Architecture Overview 4 Languages and APIs 5 Live Demo 25
  • 26.
    Getting Started withCassandra and Spark Applications • Data Model and Cassandra Query Language • Core Spark and Spark-Cassandra Connector 26
  • 27.
    Keyspace and Replication 27 CREATEKEYSPACE iot WITH replication = {'class': 'NetworkTopologyStrategy', 'DC-Kyiv-Operations' : 3, 'DC-Houston-Analytics': 2}; USE iot;
  • 28.
    Table with Single-RowPartitions 28 username age address Alice 28 Santa Clara, CA Alex 37 Austin, TX users CREATE TABLE users ( username TEXT, age INT, address TEXT, PRIMARY KEY(username) ); SELECT * FROM users WHERE username = ?;
  • 29.
    Table with Single-RowPartitions 29 id type settings owner 1 phone {gps ⇒ on, pedometer ⇒ on} Alice 2 wristband {heart rate ⇒ on, …} Alice 3 thermostat {temp ⇒ 75, …} Alice 4 security {…} Alex 5 phone {…} Alex sensors CREATE TABLE sensors ( id INT, type TEXT, settings MAP<TEXT,TEXT>, owner TEXT, PRIMARY KEY(id) ); SELECT * FROM sensors WHERE id = ?;
  • 30.
    Table with Multi-RowPartitions 30 username id type settings age address Alice 1 phone {gps ⇒ on, …} 28 Santa Clara, CA Alice 2 wristband {heart rate ⇒ on, …} 28 Santa Clara, CA Alice 3 thermostat {temp ⇒ 75, …} 28 Santa Clara, CA Alex 4 security … 37 Austin, TX Alex 5 phone … 37 Austin, TX sensors_by_user ASCASC
  • 31.
    Table with Multi-RowPartitions CREATE TABLE sensors_by_user ( username TEXT, age INT STATIC, address TEXT STATIC, id INT, type TEXT, settings MAP<TEXT,TEXT>, PRIMARY KEY(username, id) ) WITH CLUSTERING ORDER BY (id ASC); SELECT * FROM sensors_by_user WHERE username = ?; SELECT * FROM sensors_by_user WHERE username = ? AND id = ?; SELECT * FROM sensors_by_user WHERE username = ? AND id > ? ORDER BY id DESC; 31
  • 32.
    Retrieving Data fromC* • SparkContext, RDD, Connector 32 val rdd = sc.cassandraTable("iot","sensors_by_user") .select("username","id","type")
  • 33.
  • 34.
    Predicate Pushdown 34 sc.cassandraTable("iot","sensors_by_user") .select("username","id","type") .filter(row =>row.getString("username") == "Alice") .where("username = 'Alice'") • Predicate pushed down to C*
  • 35.
    Data Locality input.split.size_in_mb input.consistency.levelinput.fetch.size_in_rows 35 Cassandra Spark Node (64) (LOCAL_ONE) (1000)
  • 36.
    • Standard Sparkjoin = shuffle + shuffle Cassandra-Optimized Joins 36 val s = sc.cassandraTable("iot","sensors") .keyBy(row => row.getString("owner")) val u = sc.cassandraTable("iot","users") .keyBy(row => row.getString("username")) s.join(u)
  • 37.
    Cassandra-Optimized Joins 37 • Shuffle B Partition1 A C D Map Task Partition A Reduce Task B Partition 2 A C D Map Task Partition B Reduce Task B Partition 3 A C D Map Task Partition D Reduce Task Partition C Reduce Task Buckets: memory memory Shuffle write Shuffle read disk disk Aggregation Aggregation Aggregation Aggregation AggregationAggregationAggregation
  • 38.
    • Connector join= no shuffle + no data locality Cassandra-Optimized Joins 38 sc.cassandraTable("iot","sensors") .select("id","type","owner".as("username")) .joinWithCassandraTable("iot","users") .on(SomeColumns("username"))
  • 39.
    Cassandra-Optimized Joins 39 id typeowner  username 1 … Alice 4 … Alex 3 ... Alice 2 … Alice 5 … Alex username age address Alex 37 … username age address Alice 28 …
  • 40.
    • Connector join+ CAP = shuffle + data locality Cassandra-Aware Partitioning 40 sc.cassandraTable("iot","sensors") .select("id","type","owner".as("username")) .repartitionByCassandraReplica("iot","users") .joinWithCassandraTable("iot","users") .on(SomeColumns("username"))
  • 41.
    Cassandra-Aware Partitioning 41 id typeowner  username 1 … Alice 2 … Alice 3 ... Alice username age address Alex 37 … username age address Alice 28 … id type owner  username 4 … Alex 5 … Alex
  • 42.
    • Suboptimal code Shuffle-FreeGrouping 42 sc.cassandraTable("iot","sensors_by_user") .select("username","id","type") .as((u:String,i:Int,t:String)=>(u,(i,t))) .groupByKey
  • 43.
    • Shuffling eliminatedat no extra cost Shuffle-Free Grouping 43 sc.cassandraTable("iot","sensors_by_user") .select("username","id","type") .as((u:String,i:Int,t:String)=>(u,(i,t))) .groupByKeyspanByKey
  • 44.
    Saving Data toC* 44 rdd.saveToCassandra("iot","users", SomeColumns("username", "age")) output.consistency.level (LOCAL_QUORUM) output.batch.grouping.key (Partition) output.batch.size.bytes (1024) output.batch.grouping.buffer.size (1000) output.concurrent.writes (5)
  • 45.
    1 Modern BigData and Cloud Applications 2 Cassandra and Spark Highlights 3 Architecture Overview 4 Languages and APIs 5 Live Demo 45
  • 46.

Editor's Notes

  • #12 Apple-scale Transactions per second 1000-node clusters Multiple data centers
  • #13 Personalization Product catalogs Fraud detection Internet of Things Messaging
  • #14 In-memory
  • #18 This is a quick review of what you already know about Cassandra: Peer-to-peer architecture Failure tolerance/availability Cassandra token ring Data structures (table) Data distribution (partition key) Data replication (replication factor) Data consistency (consistency level)
  • #21 Master-worker architecture Master (aka cluster manager) manages Workers and their resources Workers instantiate Executors and give them resources (cores and memory) Driver schedules computation directly with Executors Failure tolerance Worker/Executor - no problem; computation is picked up by remaining Workers/Executors Master - new Master is elected (DSE feature); running applications are not affected; new applications are only affected temporarily until new Master is started
  • #36 spark.cassandra.input.split.size_in_mb (64) spark.cassandra.input.consistency.level (LOCAL_ONE) spark.cassandra.input.fetch.size_in_rows (1000)
  • #38 Shuffling costs Disk IO Network traffic Partitioning External sorting Serialization Deserialization Compression