8. Cassandra Schema
User Id
(Partitioning
Key)
Date Total Steps
Base Table
Date
(Partitioning
Key)
User Id
(Clustering
Key)
Total Steps
(Clustering
Key)
Materialized View
Primary Key
Primary Key
Order by Total Steps
9. Challenges and Learnings
Spark
To avoid a read from Cassandra, I used Spark in-memory
computation on DStream — updateStateByKey(updateFunc)
— Spark workers ran out of memory when scaled up.
Cassandra
Inserted data into two different tables — a base table and a
sorted data table — faced consistency issues.
10. Anurag Tiwari
• Staff Design Engineer
• Silicon Program Manager
• CM Program Manager
• Member of Technical Staff
• Ph.D. Computer Science and Engineering
12. Challenges and Learnings
To avoid a read from Cassandra I used Spark in-memory
computation on DStream — updateStateByKey(updateFunc)
DSTREAM
R
D
D
R
D
D
R
D
D
Previous State
R
D
D
R
D
D
R
D
D
R
D
D
R
D
D
5000 records 5M records
updateFunc called on 5M records
13. Cassandra Schema
CREATE TABLE rank_steps.walkers_steps2 (
user int,
arrival_time text,
num_steps int,
PRIMARY KEY (user, arrival_time)
) WITH CLUSTERING ORDER BY (arrival_time ASC)
CREATE MATERIALIZED VIEW rank_steps.top_walkers8 AS
SELECT arrival_time, num_steps, user
FROM rank_steps.walkers_steps2
WHERE user IS NOT NULL AND num_steps IS NOT NULL
AND arrival_time IS NOT NULL
PRIMARY KEY (arrival_time, num_steps, user)
WITH CLUSTERING ORDER BY (num_steps DESC, user ASC)
14. Materialized Views (MV) in
Cassandra 3.0
Eliminate the need of data denormalization by developers
— No need to create multiple tables for different queries.
Can be queried as any Cassandra table.
Persistent view — NOT an SQL view.
Automatic propagation of updates from the
base table to MV ensuring eventual consistency.