BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA
HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH
Apache Cassandra for Timeseries- and
Graph-Data
Guido Schmutz
Guido Schmutz
• Working for Trivadis for more than 18 years
• Oracle ACE Director for Fusion Middleware and SOA
• Co-Author of different books
• Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
• Member of Trivadis Architecture Board
• Technology Manager @ Trivadis
• More than 25 years of software development
experience
• Contact: guido.schmutz@trivadis.com
• Blog: http://guidoschmutz.wordpress.com
• Twitter: gschmutz
2
Agenda
1. Customer Use Case and Architecture
2. Cassandra Data Modeling
3. Cassandra for Timeseries Data
4. Cassandra for Graph Data
5. Summary
3
Customer Use Case and
Architecture
4
Research Project @ Armasuisse W&T
W+T flagship project, standing
for innovation & tech transfer
Building capabilities in the
areas of:
• Social Media Intelligence
(SOCMINT)
• Big Data Technologies &
Architectures
Invest into new, innovative and not
widely-proven technology
• Batch analysis
• Real-time analysis
• NoSQL databases
• Text analysis (NLP)
• …
3 Phases: June 2013 – June 2015
5
SOCMINT Demonstrator – Time Dimension
Major data model: Time
series (TS)
TS reflect user behaviors
over time
Activities correlate with
events
Anomaly detection
Event detection &
prediction
6
SOCMINT Demonstrator – Social Dimension
User-user networks (social
graphs);
Twitter: follower, retweet and
mention graphs
Who is central in a social
network?
Who has retweeted a given
tweet to whom?
7
SOCMINT Demonstrator - “Lambda Architecture” for Big
Data
Data
Collection
(Analytical) Batch Data Processing
Batch
compute
Batch Result
Store
Data
Sources
Channel
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
(Analytical)	Real-Time	Data	Processing
Stream/Event	Processing
Batch
compute
Real-Time
Result Store
Messaging
Result	Store
Query
Engine
Result	Store
Computed	
Information
Raw	Data	
(Reservoir)
=	Data	in	Motion =	Data	at	Rest
8
SOCMINT Demonstrator – Frameworks & Components
in Use
Data
Collection
(Analytical) Batch Data Processing
Batch
compute
Batch Result
Store
Data
Sources
Channel
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
Social
(Analytical)	Real-Time	Data	Processing
Stream/Event	Processing
Batch
compute
Real-Time
Result Store
Messaging
Result	Store
Query
Engine
Result	Store
Computed	
Information
Raw	Data	
(Reservoir)
=	Data	in	Motion =	Data	at	Rest
10
SOCMINT Demonstrator – Cassandra Cluster
6 node cluster based on Datastax Enterprise
Edition (DSE)
Installed in a virtualized environment but we
control the placement on disk
We only keep 3 days of data
• use TTL of Cassandra to automatically
erase old data
Cassandra supports both Timeseries and
Connected-Data (Graph)
Node	1
Node	2Node	6
Node	3Node	5
Node	4
11
Cassandra Data Modeling
12
Cassandra Data Modelling
13
• Don’t think relational
• Denormalize, Denormalize, Denormalize ….
• Rows are gigantic and sorted = one row is stored on one node
• Know your application/use cases => from query to model
• Index is not an afterthought, anymore => “index” upfront
• Control physical storage structure
Static Column Family – “Skinny Row”
14
rowkey
CREATE TABLE skinny (rowkey text,
c1 text PRIMARY KEY,
c2 text,
c3 text,
PRIMARY KEY (rowkey));
Grows	up	to	Billion	of	Rows
rowkey-1 c1 c2 c3
value-c1 value-c2 value-c3
rowkey-2 c1 c3
value-c1 value-c3
rowkey-3 c1 c2 c3
value-c1 value-c2 value-c3
c1 c2 c3
Partition	Key
Dynamic Column Family – “Wide Row”
15
rowkey
Billion	of	Rows
rowkey-1 ckey-1:c1 ckey-1:c2
value-c1 value-c2
rowkey-2
rowkey-3
CREATE TABLE wide (rowkey text,
ckey text,
c1 text,
c2 text,
PRIMARY KEY (rowkey, ckey) WITH CLUSTERING ORDER BY (ckey ASC);
ckey-2:c1 ckey-2:c2
value-c1 value-c2
ckey-3:c1 ckey-3:c2
value-c1 value-c2
ckey-1:c1 ckey-1:c2
value-c1 value-c2
ckey-2:c1 ckey-2:c2
value-c1 value-c2
ckey-1:c1 ckey-1:c2
value-c1 value-c2
ckey-2:c1 ckey-2:c2
value-c1 value-c2
ckey-3:c1 ckey-3:c2
value-c1 value-c2
1 2	Billion
Partition	Key Clustering Key
Cassandra for Timeseries Data
16
Know your application => From query to model
17
Show Timeline of
Tweets
Show Timeseries on
different levels of
aggregation
(resolution)
• Seconds
• Minute
• Hours
Show Timeline: Provide Raw Data (Tweets)
18
CREATE TABLE tweet (tweet_id bigint,
username text,
message text,
hashtags list<text>,
latitude double,
longitude double,
…
PRIMARY KEY(tweet_id));
• Skinny Row Table
• Holds the sensor raw data =>
Tweets
• Similar to a relational table
• Primary Key is the partition key
10000121 username message hashtags latitude longitude
gschmutz Getting	ready	for.. [cassandra,	nosql] 0 0
20121223 username message hashtags latitude longitude
DataStax The Speed	Factor	.. [BigData 0 0
tweet_id
Partition	Key Clustering Key
Show Timeline: Provide Raw Data (Tweets)
19
INSERT INTO tweet (tweet_id, username, message, hashtags, latitude,
longitude) VALUES (10000121, 'gschmutz', 'Getting ready for my talk about
using Cassandra for Timeseries and Graph Data', ['cassandra', 'nosql'],
0,0);
SELECT tweet_id, username, hashtags, message FROM tweet
WHERE tweet_id = 10000121 ;
tweet_id | username | hashtag | message
---------+----------+------------------------+----------------------------
10000121 | gschmutz | ['cassandra', 'nosql'] | Getting ready for ...
20121223 | DataStax | [’BigData’] | The Speed Factor ...
Partition	Key Clustering Key
Show Timeline: Provide Sequence of Events
20
CREATE TABLE tweet_timeline (
sensor_id text,
bucket_id text,
time_id timestamp,
tweet_id bigint,
PRIMARY KEY((sensor_id, bucket_id), time_id))
WITH CLUSTERING ORDER BY (time_id DESC);
Wide Row Table
bucket-id creates buckets
for columns
• SECOND-2015-10-14
ABC-001:SECOND-2015-10-14 10:00:02:tweet-id
10000121	
DEF-931:SECOND-2015-10-14 10:09:02:tweet-id
1003121343
09:12:09:tweet-id
1002111343
09:10:02:tweet-id
1001121343
Partition	Key Clustering Key
Show Timeline: Provide Sequence of Events
21
INSERT INTO tweet_timeline (sensor_id, bucket_id, time_id, tweet_id)
VALUES ('ABC-001', 'SECOND-2015-10-14', '2015-09-30 10:50:00', 10000121 );
SELECT * from tweet_timeline
WHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10'
AND key = 'ALL’ AND time_id <= '2015-10-14 12:00:00';
sensor_id | bucket_id | time_id | tweet_id
----------+-------------------+--------------------------+----------
ABC-001 | SECOND-2015-10-14 | 2015-10-14 11:53:00+0000 | 10020334
ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:52:00+0000 | 10000334
ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:51:00+0000 | 10000127
ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:50:00+0000 | 10000121
Sorted	by	time_id
Partition	Key Clustering Key
Show Timeseries: Provide list of counts
22
CREATE TABLE tweet_count (
sensor_id text,
bucket_id text,
key text,
time_id timestamp,
count counter,
PRIMARY KEY((sensor_id, bucket_id), key, time_id))
WITH CLUSTERING ORDER BY (key ASC, time_id DESC);
Wide Row Table
bucket-id creates buckets
for columns
• SECOND-2015-10-14
• HOUR-2015-10
• DAY-2015-10
ABC-001:HOUR-2015-10 ALL:10:00:count
1’550
ABC-001:DAY-2015-10 ALL:14-OCT:count
105’999
ALL:13-OCT:count
120’344
nosql:14-OCT:count
2’532
ALL:09:00:count
2’299
nosql:08:00:count
25
30d	*	24h	*	n	keys	=	n	*	720	cols
Partition	Key Clustering Key
Show Timeseries: Provide list of counts
23
UPDATE tweet_count SET count = count + 1
WHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10'
AND key = 'ALL’ AND time_id = '2015-10-14 10:00:00';
SELECT * from tweet_count
WHERE sensor_id = 'ABC-001' AND bucket_id = 'HOUR-2015-10'
AND key = 'ALL' AND time_id >= '2015-10-14 08:00:00’;
sensor_id | bucket_id | key | time_id | count
----------+--------------+-----+--------------------------+-------
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 12:00:00+0000 | 100230
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 11:00:00+0000 | 102230
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 10:00:00+0000 | 105430
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 09:00:00+0000 | 203240
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 08:00:00+0000 | 132230
Partition	Key Clustering Key
Processing Pipeline
Kafka provides reliable and efficient queuing
Strom processes (rollups, counts)
Cassandrastores at the same speed
StoringProcessingQueuing
24
Twitter
Sensor 1
Twitter
Sensor 2
Twitter
Sensor 3
Visualization
Application
Visualization
Application
Processing Pipeline – Stream-Processing with
25
Pre-Processes data before storing in different Cassandra tables
Implemented in Java
Using DataStax Java driver for writing to Cassandra (similar to JDBC)
Kafka
Sentence
Splitter
Kafka
Spout
Word
Counter
Sentence
Splitter
Word
Counter
Who	 will	 win:	 Barca,	Real,	
Juve or	Bayern?	…	
bit.ly/1yRsPmE	 #fcb
#barca
…	#barca
…	#fcb real	=	1
juve =	1	
barca =	2
bayern =	1
INCR
barca
INCR
real
INCR
juve
INCR
barca
INCR
bayern
real
juve
barca
barca
bayern
fcb
fcb =	1	
INCR
fcb
Cassandra for Graph-Data
26
Using Cassandra for Social Dimension
27
Introduction to the Graph Model – Property Graph
Node / Vertex
• Represent Entities
• Can contain properties
(key-value pairs)
Relationship / Edge
• Lines between nodes
• may be directed or
undirected
Properties
• Values about node or
relationship
• Allow to add semantic to
relationships
User	1 Tweet
author
follow retweet
User	2
Id:	16134540
name:	cloudera
location:	Palo	Alto
Id:	18898576
name:	gschmutz
location:	Berne
Id:	18898576
text:	Join	BigData..
time:	June	11	2015
time:	June	11	2015
key:	 value
28
Titan:DB – Graph Database
Optimized to work against billions of nodes and edges
• Theoretical limitation of 2^60 edges and 1^60 nodes
Works with several different distributed databases
• Apache Cassandra, Apache HBase, Oracle BerkeleyDB and Amazon DynamoDB
Supports many concurrent users doing complex graph traversals
simultaneously
Native integration with TinkerPop stack
Created by Thinkaurelius (http://thinkaurelius.com/) now part of DataStax
29
Titan:DB Architecture
30
Titan:DB – Schema and Data Modeling
Titan gaph has a schema comprised of the edge labels, property keys, and
vertex labels
schema can either be explicitly or implicitly defined
schema can evolve over time without interruption of normal operations
mgmt = graph.openManagement()
person = mgmt.makeVertexLabel('person').make()
birthDate = mgmt.makePropertyKey('birthDate')
.dataType(Long.class)
.cardinality(Cardinality.SINGLE).make()
name = mgmt.makePropertyKey('name').dataType(String.class)
.cardinality(Cardinality.SET).make()
SOCMINT Data Model
User Post
Term
author(time,targetId)
follow
useHashtag
retweetOf(time,targetId)
mention(time)	
mentionOf(time)
User:
#userId =>	userId (as	String)
name	=>	screenName
language	=>	lang
profileImageUrlHttps
location	=>	location
time	=>	createdAt
pageRank
lastUpdateTime
useUrl
Place:
#placeId=>	id	(as	String)
street	=>	street
name	=>	fullName
country	=>	country
type	=>	placeType
url =>	placeUrl
lastUpdateTime
retweet(time)
reply(time)	
replyTo(time)	
Place
placed(time)
Term:
#value=>	hashtag or	url value
type	=>	“hashtag”	or	“url”
lastUpdateTime
reply(time)
Post:
#postId =>	tweetId(as	String)
time	=>	createAt
targetIds=>	targetIds
language	=>	lang
coordinate	=>	latitude	+	longitude
lastUpdateTime
32
TinkerPop 3 Stack
• TinkerPop is a framework composed of
various interoperable components
• Vendor independent (similar to JDBC for
RDBMS)
• Core API defines Graph, Vertex, Edge, …
• Gremlin traversal language is vendor-
independent way to query (traverse) a graph
• Gremlin server can be leveraged to allow over
the wire communication with a TinkerPop
enabled graph system
http://tinkerpop.incubator.apache.org/
33
Gremlin – a graph query language
Imperative graph traversal language
• Sequence of “steps” of the computation
Must understand structure of graph
peter
paul
roger
ken
eva
bob
marc
g.V(1).out(“follow”).out(“follow”).count()
g.V(1).repeat(out(“follow”)).times(2).count()
follow
follow
follow
follow
follow
follow
follow
or
34
Summary
35
Summary
36
Cassandra is an always-on
database
Ability to collect and analyze
massive volumes of data in
sequence at extremely high
velocity
Forget (some of) your existing
database modeling skills
Cassandra is an excellent fit for
time series data
Cassandra is no longer “just a” column
family database => Multi-Model
Database
• DSE Search
• JSON support
• DSE Graph
• DSE Timeseries
• Spark Support
Summary - Know your domain
Connectedness	of	Datalow high
Document
Data
Store
Key-Value
Stores
Wide-
Column
Store
Graph
Databases
Relational
Databases
38

Apache Cassandra for Timeseries- and Graph-Data

  • 1.
    BASEL BERN BRUGGDÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH Apache Cassandra for Timeseries- and Graph-Data Guido Schmutz
  • 2.
    Guido Schmutz • Workingfor Trivadis for more than 18 years • Oracle ACE Director for Fusion Middleware and SOA • Co-Author of different books • Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data • Member of Trivadis Architecture Board • Technology Manager @ Trivadis • More than 25 years of software development experience • Contact: guido.schmutz@trivadis.com • Blog: http://guidoschmutz.wordpress.com • Twitter: gschmutz 2
  • 3.
    Agenda 1. Customer UseCase and Architecture 2. Cassandra Data Modeling 3. Cassandra for Timeseries Data 4. Cassandra for Graph Data 5. Summary 3
  • 4.
    Customer Use Caseand Architecture 4
  • 5.
    Research Project @Armasuisse W&T W+T flagship project, standing for innovation & tech transfer Building capabilities in the areas of: • Social Media Intelligence (SOCMINT) • Big Data Technologies & Architectures Invest into new, innovative and not widely-proven technology • Batch analysis • Real-time analysis • NoSQL databases • Text analysis (NLP) • … 3 Phases: June 2013 – June 2015 5
  • 6.
    SOCMINT Demonstrator –Time Dimension Major data model: Time series (TS) TS reflect user behaviors over time Activities correlate with events Anomaly detection Event detection & prediction 6
  • 7.
    SOCMINT Demonstrator –Social Dimension User-user networks (social graphs); Twitter: follower, retweet and mention graphs Who is central in a social network? Who has retweeted a given tweet to whom? 7
  • 8.
    SOCMINT Demonstrator -“Lambda Architecture” for Big Data Data Collection (Analytical) Batch Data Processing Batch compute Batch Result Store Data Sources Channel Data Access Reports Service Analytic Tools Alerting Tools Social RDBMS Sensor ERP Logfiles Mobile Machine (Analytical) Real-Time Data Processing Stream/Event Processing Batch compute Real-Time Result Store Messaging Result Store Query Engine Result Store Computed Information Raw Data (Reservoir) = Data in Motion = Data at Rest 8
  • 9.
    SOCMINT Demonstrator –Frameworks & Components in Use Data Collection (Analytical) Batch Data Processing Batch compute Batch Result Store Data Sources Channel Data Access Reports Service Analytic Tools Alerting Tools Social (Analytical) Real-Time Data Processing Stream/Event Processing Batch compute Real-Time Result Store Messaging Result Store Query Engine Result Store Computed Information Raw Data (Reservoir) = Data in Motion = Data at Rest 10
  • 10.
    SOCMINT Demonstrator –Cassandra Cluster 6 node cluster based on Datastax Enterprise Edition (DSE) Installed in a virtualized environment but we control the placement on disk We only keep 3 days of data • use TTL of Cassandra to automatically erase old data Cassandra supports both Timeseries and Connected-Data (Graph) Node 1 Node 2Node 6 Node 3Node 5 Node 4 11
  • 11.
  • 12.
    Cassandra Data Modelling 13 •Don’t think relational • Denormalize, Denormalize, Denormalize …. • Rows are gigantic and sorted = one row is stored on one node • Know your application/use cases => from query to model • Index is not an afterthought, anymore => “index” upfront • Control physical storage structure
  • 13.
    Static Column Family– “Skinny Row” 14 rowkey CREATE TABLE skinny (rowkey text, c1 text PRIMARY KEY, c2 text, c3 text, PRIMARY KEY (rowkey)); Grows up to Billion of Rows rowkey-1 c1 c2 c3 value-c1 value-c2 value-c3 rowkey-2 c1 c3 value-c1 value-c3 rowkey-3 c1 c2 c3 value-c1 value-c2 value-c3 c1 c2 c3 Partition Key
  • 14.
    Dynamic Column Family– “Wide Row” 15 rowkey Billion of Rows rowkey-1 ckey-1:c1 ckey-1:c2 value-c1 value-c2 rowkey-2 rowkey-3 CREATE TABLE wide (rowkey text, ckey text, c1 text, c2 text, PRIMARY KEY (rowkey, ckey) WITH CLUSTERING ORDER BY (ckey ASC); ckey-2:c1 ckey-2:c2 value-c1 value-c2 ckey-3:c1 ckey-3:c2 value-c1 value-c2 ckey-1:c1 ckey-1:c2 value-c1 value-c2 ckey-2:c1 ckey-2:c2 value-c1 value-c2 ckey-1:c1 ckey-1:c2 value-c1 value-c2 ckey-2:c1 ckey-2:c2 value-c1 value-c2 ckey-3:c1 ckey-3:c2 value-c1 value-c2 1 2 Billion Partition Key Clustering Key
  • 15.
  • 16.
    Know your application=> From query to model 17 Show Timeline of Tweets Show Timeseries on different levels of aggregation (resolution) • Seconds • Minute • Hours
  • 17.
    Show Timeline: ProvideRaw Data (Tweets) 18 CREATE TABLE tweet (tweet_id bigint, username text, message text, hashtags list<text>, latitude double, longitude double, … PRIMARY KEY(tweet_id)); • Skinny Row Table • Holds the sensor raw data => Tweets • Similar to a relational table • Primary Key is the partition key 10000121 username message hashtags latitude longitude gschmutz Getting ready for.. [cassandra, nosql] 0 0 20121223 username message hashtags latitude longitude DataStax The Speed Factor .. [BigData 0 0 tweet_id Partition Key Clustering Key
  • 18.
    Show Timeline: ProvideRaw Data (Tweets) 19 INSERT INTO tweet (tweet_id, username, message, hashtags, latitude, longitude) VALUES (10000121, 'gschmutz', 'Getting ready for my talk about using Cassandra for Timeseries and Graph Data', ['cassandra', 'nosql'], 0,0); SELECT tweet_id, username, hashtags, message FROM tweet WHERE tweet_id = 10000121 ; tweet_id | username | hashtag | message ---------+----------+------------------------+---------------------------- 10000121 | gschmutz | ['cassandra', 'nosql'] | Getting ready for ... 20121223 | DataStax | [’BigData’] | The Speed Factor ... Partition Key Clustering Key
  • 19.
    Show Timeline: ProvideSequence of Events 20 CREATE TABLE tweet_timeline ( sensor_id text, bucket_id text, time_id timestamp, tweet_id bigint, PRIMARY KEY((sensor_id, bucket_id), time_id)) WITH CLUSTERING ORDER BY (time_id DESC); Wide Row Table bucket-id creates buckets for columns • SECOND-2015-10-14 ABC-001:SECOND-2015-10-14 10:00:02:tweet-id 10000121 DEF-931:SECOND-2015-10-14 10:09:02:tweet-id 1003121343 09:12:09:tweet-id 1002111343 09:10:02:tweet-id 1001121343 Partition Key Clustering Key
  • 20.
    Show Timeline: ProvideSequence of Events 21 INSERT INTO tweet_timeline (sensor_id, bucket_id, time_id, tweet_id) VALUES ('ABC-001', 'SECOND-2015-10-14', '2015-09-30 10:50:00', 10000121 ); SELECT * from tweet_timeline WHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10' AND key = 'ALL’ AND time_id <= '2015-10-14 12:00:00'; sensor_id | bucket_id | time_id | tweet_id ----------+-------------------+--------------------------+---------- ABC-001 | SECOND-2015-10-14 | 2015-10-14 11:53:00+0000 | 10020334 ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:52:00+0000 | 10000334 ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:51:00+0000 | 10000127 ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:50:00+0000 | 10000121 Sorted by time_id Partition Key Clustering Key
  • 21.
    Show Timeseries: Providelist of counts 22 CREATE TABLE tweet_count ( sensor_id text, bucket_id text, key text, time_id timestamp, count counter, PRIMARY KEY((sensor_id, bucket_id), key, time_id)) WITH CLUSTERING ORDER BY (key ASC, time_id DESC); Wide Row Table bucket-id creates buckets for columns • SECOND-2015-10-14 • HOUR-2015-10 • DAY-2015-10 ABC-001:HOUR-2015-10 ALL:10:00:count 1’550 ABC-001:DAY-2015-10 ALL:14-OCT:count 105’999 ALL:13-OCT:count 120’344 nosql:14-OCT:count 2’532 ALL:09:00:count 2’299 nosql:08:00:count 25 30d * 24h * n keys = n * 720 cols Partition Key Clustering Key
  • 22.
    Show Timeseries: Providelist of counts 23 UPDATE tweet_count SET count = count + 1 WHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10' AND key = 'ALL’ AND time_id = '2015-10-14 10:00:00'; SELECT * from tweet_count WHERE sensor_id = 'ABC-001' AND bucket_id = 'HOUR-2015-10' AND key = 'ALL' AND time_id >= '2015-10-14 08:00:00’; sensor_id | bucket_id | key | time_id | count ----------+--------------+-----+--------------------------+------- ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 12:00:00+0000 | 100230 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 11:00:00+0000 | 102230 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 10:00:00+0000 | 105430 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 09:00:00+0000 | 203240 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 08:00:00+0000 | 132230 Partition Key Clustering Key
  • 23.
    Processing Pipeline Kafka providesreliable and efficient queuing Strom processes (rollups, counts) Cassandrastores at the same speed StoringProcessingQueuing 24 Twitter Sensor 1 Twitter Sensor 2 Twitter Sensor 3 Visualization Application Visualization Application
  • 24.
    Processing Pipeline –Stream-Processing with 25 Pre-Processes data before storing in different Cassandra tables Implemented in Java Using DataStax Java driver for writing to Cassandra (similar to JDBC) Kafka Sentence Splitter Kafka Spout Word Counter Sentence Splitter Word Counter Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb #barca … #barca … #fcb real = 1 juve = 1 barca = 2 bayern = 1 INCR barca INCR real INCR juve INCR barca INCR bayern real juve barca barca bayern fcb fcb = 1 INCR fcb
  • 25.
  • 26.
    Using Cassandra forSocial Dimension 27
  • 27.
    Introduction to theGraph Model – Property Graph Node / Vertex • Represent Entities • Can contain properties (key-value pairs) Relationship / Edge • Lines between nodes • may be directed or undirected Properties • Values about node or relationship • Allow to add semantic to relationships User 1 Tweet author follow retweet User 2 Id: 16134540 name: cloudera location: Palo Alto Id: 18898576 name: gschmutz location: Berne Id: 18898576 text: Join BigData.. time: June 11 2015 time: June 11 2015 key: value 28
  • 28.
    Titan:DB – GraphDatabase Optimized to work against billions of nodes and edges • Theoretical limitation of 2^60 edges and 1^60 nodes Works with several different distributed databases • Apache Cassandra, Apache HBase, Oracle BerkeleyDB and Amazon DynamoDB Supports many concurrent users doing complex graph traversals simultaneously Native integration with TinkerPop stack Created by Thinkaurelius (http://thinkaurelius.com/) now part of DataStax 29
  • 29.
  • 30.
    Titan:DB – Schemaand Data Modeling Titan gaph has a schema comprised of the edge labels, property keys, and vertex labels schema can either be explicitly or implicitly defined schema can evolve over time without interruption of normal operations mgmt = graph.openManagement() person = mgmt.makeVertexLabel('person').make() birthDate = mgmt.makePropertyKey('birthDate') .dataType(Long.class) .cardinality(Cardinality.SINGLE).make() name = mgmt.makePropertyKey('name').dataType(String.class) .cardinality(Cardinality.SET).make()
  • 31.
    SOCMINT Data Model UserPost Term author(time,targetId) follow useHashtag retweetOf(time,targetId) mention(time) mentionOf(time) User: #userId => userId (as String) name => screenName language => lang profileImageUrlHttps location => location time => createdAt pageRank lastUpdateTime useUrl Place: #placeId=> id (as String) street => street name => fullName country => country type => placeType url => placeUrl lastUpdateTime retweet(time) reply(time) replyTo(time) Place placed(time) Term: #value=> hashtag or url value type => “hashtag” or “url” lastUpdateTime reply(time) Post: #postId => tweetId(as String) time => createAt targetIds=> targetIds language => lang coordinate => latitude + longitude lastUpdateTime 32
  • 32.
    TinkerPop 3 Stack •TinkerPop is a framework composed of various interoperable components • Vendor independent (similar to JDBC for RDBMS) • Core API defines Graph, Vertex, Edge, … • Gremlin traversal language is vendor- independent way to query (traverse) a graph • Gremlin server can be leveraged to allow over the wire communication with a TinkerPop enabled graph system http://tinkerpop.incubator.apache.org/ 33
  • 33.
    Gremlin – agraph query language Imperative graph traversal language • Sequence of “steps” of the computation Must understand structure of graph peter paul roger ken eva bob marc g.V(1).out(“follow”).out(“follow”).count() g.V(1).repeat(out(“follow”)).times(2).count() follow follow follow follow follow follow follow or 34
  • 34.
  • 35.
    Summary 36 Cassandra is analways-on database Ability to collect and analyze massive volumes of data in sequence at extremely high velocity Forget (some of) your existing database modeling skills Cassandra is an excellent fit for time series data Cassandra is no longer “just a” column family database => Multi-Model Database • DSE Search • JSON support • DSE Graph • DSE Timeseries • Spark Support
  • 36.
    Summary - Knowyour domain Connectedness of Datalow high Document Data Store Key-Value Stores Wide- Column Store Graph Databases Relational Databases
  • 37.