Apache Cassandra for Timeseries- and Graph-Data

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA
HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH
Apache Cassandra for Timeseries- and
Graph-Data
Guido Schmutz

Guido Schmutz
• Working for Trivadis for more than 18 years
• Oracle ACE Director for Fusion Middleware and SOA
• Co-Author of different books
• Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
• Member of Trivadis Architecture Board
• Technology Manager @ Trivadis
• More than 25 years of software development
experience
• Contact: guido.schmutz@trivadis.com
• Blog: http://guidoschmutz.wordpress.com
• Twitter: gschmutz
2

Agenda
1. Customer Use Case and Architecture
2. Cassandra Data Modeling
3. Cassandra for Timeseries Data
4. Cassandra for Graph Data
5. Summary
3

Customer Use Case and
Architecture
4

Research Project @ Armasuisse W&T
W+T flagship project, standing
for innovation & tech transfer
Building capabilities in the
areas of:
• Social Media Intelligence
(SOCMINT)
• Big Data Technologies &
Architectures
Invest into new, innovative and not
widely-proven technology
• Batch analysis
• Real-time analysis
• NoSQL databases
• Text analysis (NLP)
• …
3 Phases: June 2013 – June 2015
5

SOCMINT Demonstrator – Time Dimension
Major data model: Time
series (TS)
TS reflect user behaviors
over time
Activities correlate with
events
Anomaly detection
Event detection &
prediction
6

SOCMINT Demonstrator – Social Dimension
User-user networks (social
graphs);
Twitter: follower, retweet and
mention graphs
Who is central in a social
network?
Who has retweeted a given
tweet to whom?
7

SOCMINT Demonstrator - “Lambda Architecture” for Big
Data
Data
Collection
(Analytical) Batch Data Processing
Batch
compute
Batch Result
Store
Data
Sources
Channel
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
(Analytical) Real-Time Data Processing
Stream/Event Processing
Batch
compute
Real-Time
Result Store
Messaging
Result Store
Query
Engine
Result Store
Computed
Information
Raw Data
(Reservoir)
= Data in Motion = Data at Rest
8

SOCMINT Demonstrator – Frameworks & Components
in Use
Data
Collection
(Analytical) Batch Data Processing
Batch
compute
Batch Result
Store
Data
Sources
Channel
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
Social
(Analytical) Real-Time Data Processing
Stream/Event Processing
Batch
compute
Real-Time
Result Store
Messaging
Result Store
Query
Engine
Result Store
Computed
Information
Raw Data
(Reservoir)
= Data in Motion = Data at Rest
10

SOCMINT Demonstrator – Cassandra Cluster
6 node cluster based on Datastax Enterprise
Edition (DSE)
Installed in a virtualized environment but we
control the placement on disk
We only keep 3 days of data
• use TTL of Cassandra to automatically
erase old data
Cassandra supports both Timeseries and
Connected-Data (Graph)
Node 1
Node 2Node 6
Node 3Node 5
Node 4
11

Cassandra Data Modelling
13
• Don’t think relational
• Denormalize, Denormalize, Denormalize ….
• Rows are gigantic and sorted = one row is stored on one node
• Know your application/use cases => from query to model
• Index is not an afterthought, anymore => “index” upfront
• Control physical storage structure

Static Column Family – “Skinny Row”
14
rowkey
CREATE TABLE skinny (rowkey text,
c1 text PRIMARY KEY,
c2 text,
c3 text,
PRIMARY KEY (rowkey));
Grows up to Billion of Rows
rowkey-1 c1 c2 c3
value-c1 value-c2 value-c3
rowkey-2 c1 c3
value-c1 value-c3
rowkey-3 c1 c2 c3
value-c1 value-c2 value-c3
c1 c2 c3
Partition Key

Dynamic Column Family – “Wide Row”
15
rowkey
Billion of Rows
rowkey-1 ckey-1:c1 ckey-1:c2
value-c1 value-c2
rowkey-2
rowkey-3
CREATE TABLE wide (rowkey text,
ckey text,
c1 text,
c2 text,
PRIMARY KEY (rowkey, ckey) WITH CLUSTERING ORDER BY (ckey ASC);
ckey-2:c1 ckey-2:c2
value-c1 value-c2
ckey-3:c1 ckey-3:c2
value-c1 value-c2
ckey-1:c1 ckey-1:c2
value-c1 value-c2
ckey-2:c1 ckey-2:c2
value-c1 value-c2
ckey-1:c1 ckey-1:c2
value-c1 value-c2
ckey-2:c1 ckey-2:c2
value-c1 value-c2
ckey-3:c1 ckey-3:c2
value-c1 value-c2
1 2 Billion
Partition Key Clustering Key

Cassandra for Timeseries Data
16

Know your application => From query to model
17
Show Timeline of
Tweets
Show Timeseries on
different levels of
aggregation
(resolution)
• Seconds
• Minute
• Hours

Show Timeline: Provide Raw Data (Tweets)
18
CREATE TABLE tweet (tweet_id bigint,
username text,
message text,
hashtags list<text>,
latitude double,
longitude double,
…
PRIMARY KEY(tweet_id));
• Skinny Row Table
• Holds the sensor raw data =>
Tweets
• Similar to a relational table
• Primary Key is the partition key
10000121 username message hashtags latitude longitude
gschmutz Getting ready for.. [cassandra, nosql] 0 0
20121223 username message hashtags latitude longitude
DataStax The Speed Factor .. [BigData 0 0
tweet_id

Show Timeline: Provide Raw Data (Tweets)
19
INSERT INTO tweet (tweet_id, username, message, hashtags, latitude,
longitude) VALUES (10000121, 'gschmutz', 'Getting ready for my talk about
using Cassandra for Timeseries and Graph Data', ['cassandra', 'nosql'],
0,0);
SELECT tweet_id, username, hashtags, message FROM tweet
WHERE tweet_id = 10000121 ;
tweet_id | username | hashtag | message
---------+----------+------------------------+----------------------------
10000121 | gschmutz | ['cassandra', 'nosql'] | Getting ready for ...
20121223 | DataStax | [’BigData’] | The Speed Factor ...

Show Timeline: Provide Sequence of Events
20
CREATE TABLE tweet_timeline (
sensor_id text,
bucket_id text,
time_id timestamp,
tweet_id bigint,
PRIMARY KEY((sensor_id, bucket_id), time_id))
WITH CLUSTERING ORDER BY (time_id DESC);
Wide Row Table
bucket-id creates buckets
for columns
• SECOND-2015-10-14
ABC-001:SECOND-2015-10-14 10:00:02:tweet-id
10000121
DEF-931:SECOND-2015-10-14 10:09:02:tweet-id
1003121343
09:12:09:tweet-id
1002111343
09:10:02:tweet-id
1001121343

Show Timeline: Provide Sequence of Events
21
INSERT INTO tweet_timeline (sensor_id, bucket_id, time_id, tweet_id)
VALUES ('ABC-001', 'SECOND-2015-10-14', '2015-09-30 10:50:00', 10000121 );
SELECT * from tweet_timeline
WHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10'
AND key = 'ALL’ AND time_id <= '2015-10-14 12:00:00';
sensor_id | bucket_id | time_id | tweet_id
----------+-------------------+--------------------------+----------
ABC-001 | SECOND-2015-10-14 | 2015-10-14 11:53:00+0000 | 10020334
ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:52:00+0000 | 10000334
ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:51:00+0000 | 10000127
ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:50:00+0000 | 10000121
Sorted by time_id

Show Timeseries: Provide list of counts
22
CREATE TABLE tweet_count (
sensor_id text,
bucket_id text,
key text,
time_id timestamp,
count counter,
PRIMARY KEY((sensor_id, bucket_id), key, time_id))
WITH CLUSTERING ORDER BY (key ASC, time_id DESC);
Wide Row Table
bucket-id creates buckets
for columns
• SECOND-2015-10-14
• HOUR-2015-10
• DAY-2015-10
ABC-001:HOUR-2015-10 ALL:10:00:count
1’550
ABC-001:DAY-2015-10 ALL:14-OCT:count
105’999
ALL:13-OCT:count
120’344
nosql:14-OCT:count
2’532
ALL:09:00:count
2’299
nosql:08:00:count
25
30d * 24h * n keys = n * 720 cols

Show Timeseries: Provide list of counts
23
UPDATE tweet_count SET count = count + 1
WHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10'
AND key = 'ALL’ AND time_id = '2015-10-14 10:00:00';
SELECT * from tweet_count
WHERE sensor_id = 'ABC-001' AND bucket_id = 'HOUR-2015-10'
AND key = 'ALL' AND time_id >= '2015-10-14 08:00:00’;
sensor_id | bucket_id | key | time_id | count
----------+--------------+-----+--------------------------+-------
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 12:00:00+0000 | 100230
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 11:00:00+0000 | 102230
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 10:00:00+0000 | 105430
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 09:00:00+0000 | 203240
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 08:00:00+0000 | 132230

Processing Pipeline
Kafka provides reliable and efficient queuing
Strom processes (rollups, counts)
Cassandrastores at the same speed
StoringProcessingQueuing
24
Twitter
Sensor 1
Twitter
Sensor 2
Twitter
Sensor 3
Visualization
Application
Visualization
Application

Processing Pipeline – Stream-Processing with
25
Pre-Processes data before storing in different Cassandra tables
Implemented in Java
Using DataStax Java driver for writing to Cassandra (similar to JDBC)
Kafka
Sentence
Splitter
Kafka
Spout
Word
Counter
Sentence
Splitter
Word
Counter
Who will win: Barca, Real,
Juve or Bayern? …
bit.ly/1yRsPmE #fcb
#barca
… #barca
… #fcb real = 1
juve = 1
barca = 2
bayern = 1
INCR
barca
INCR
real
INCR
juve
INCR
barca
INCR
bayern
real
juve
barca
barca
bayern
fcb
fcb = 1
INCR
fcb

Using Cassandra for Social Dimension
27

Introduction to the Graph Model – Property Graph
Node / Vertex
• Represent Entities
• Can contain properties
(key-value pairs)
Relationship / Edge
• Lines between nodes
• may be directed or
undirected
Properties
• Values about node or
relationship
• Allow to add semantic to
relationships
User 1 Tweet
author
follow retweet
User 2
Id: 16134540
name: cloudera
location: Palo Alto
Id: 18898576
name: gschmutz
location: Berne
Id: 18898576
text: Join BigData..
time: June 11 2015
time: June 11 2015
key: value
28

Titan:DB – Graph Database
Optimized to work against billions of nodes and edges
• Theoretical limitation of 2^60 edges and 1^60 nodes
Works with several different distributed databases
• Apache Cassandra, Apache HBase, Oracle BerkeleyDB and Amazon DynamoDB
Supports many concurrent users doing complex graph traversals
simultaneously
Native integration with TinkerPop stack
Created by Thinkaurelius (http://thinkaurelius.com/) now part of DataStax
29

Titan:DB – Schema and Data Modeling
Titan gaph has a schema comprised of the edge labels, property keys, and
vertex labels
schema can either be explicitly or implicitly defined
schema can evolve over time without interruption of normal operations
mgmt = graph.openManagement()
person = mgmt.makeVertexLabel('person').make()
birthDate = mgmt.makePropertyKey('birthDate')
.dataType(Long.class)
.cardinality(Cardinality.SINGLE).make()
name = mgmt.makePropertyKey('name').dataType(String.class)
.cardinality(Cardinality.SET).make()

SOCMINT Data Model
User Post
Term
author(time,targetId)
follow
useHashtag
retweetOf(time,targetId)
mention(time)
mentionOf(time)
User:
#userId => userId (as String)
name => screenName
language => lang
profileImageUrlHttps
location => location
time => createdAt
pageRank
lastUpdateTime
useUrl
Place:
#placeId=> id (as String)
street => street
name => fullName
country => country
type => placeType
url => placeUrl
lastUpdateTime
retweet(time)
reply(time)
replyTo(time)
Place
placed(time)
Term:
#value=> hashtag or url value
type => “hashtag” or “url”
lastUpdateTime
reply(time)
Post:
#postId => tweetId(as String)
time => createAt
targetIds=> targetIds
language => lang
coordinate => latitude + longitude
lastUpdateTime
32

TinkerPop 3 Stack
• TinkerPop is a framework composed of
various interoperable components
• Vendor independent (similar to JDBC for
RDBMS)
• Core API defines Graph, Vertex, Edge, …
• Gremlin traversal language is vendor-
independent way to query (traverse) a graph
• Gremlin server can be leveraged to allow over
the wire communication with a TinkerPop
enabled graph system
http://tinkerpop.incubator.apache.org/
33

Gremlin – a graph query language
Imperative graph traversal language
• Sequence of “steps” of the computation
Must understand structure of graph
peter
paul
roger
ken
eva
bob
marc
g.V(1).out(“follow”).out(“follow”).count()
g.V(1).repeat(out(“follow”)).times(2).count()
follow
follow
follow
follow
follow
follow
follow
or
34

Summary
36
Cassandra is an always-on
database
Ability to collect and analyze
massive volumes of data in
sequence at extremely high
velocity
Forget (some of) your existing
database modeling skills
Cassandra is an excellent fit for
time series data
Cassandra is no longer “just a” column
family database => Multi-Model
Database
• DSE Search
• JSON support
• DSE Graph
• DSE Timeseries
• Spark Support

Summary - Know your domain
Connectedness of Datalow high
Document
Data
Store
Key-Value
Stores
Wide-
Column
Store
Graph
Databases
Relational
Databases

Apache Cassandra for Timeseries- and Graph-Data

More Related Content

Viewers also liked

Similar to Apache Cassandra for Timeseries- and Graph-Data

More from Guido Schmutz

Recently uploaded

Apache Cassandra for Timeseries- and Graph-Data