Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Cassandra for Timeseries- and Graph-Data

4,001 views

Published on

Apache Cassandra has proven to be one of the best solutions for storing and retrieving data at high velocity and high volume.
In the first part of the talk we will discuss how the storage model of Cassandra is ideal for time series use cases, which are often of high velocity and high volume. Time series data is everywhere today: Internet of Things, sensor data, transactional data, social media streams. We go over examples of how to best build data models.
We will also cover pairing Apache Spark with Apache Cassandra to create a real time data analytics platform.

The second part of the talk will present Titan:db, an open source distributed graph database build on top of Cassandra that can power real-time applications with thousands of concurrent users over graphs with billions of edges. It exposes a property graph data model directly atop Cassandra which makes storing and querying relationship data fast, easy, and scalable to huge graphs. This talk demonstrates how Titan's features enable complex, multi-relational databases in Cassandra and discusses how Titan:db has been used in a customer case to store social network data.

Published in: Software

Apache Cassandra for Timeseries- and Graph-Data

  1. 1. BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH Apache Cassandra for Timeseries- and Graph-Data Guido Schmutz
  2. 2. Guido Schmutz • Working for Trivadis for more than 18 years • Oracle ACE Director for Fusion Middleware and SOA • Co-Author of different books • Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data • Member of Trivadis Architecture Board • Technology Manager @ Trivadis • More than 25 years of software development experience • Contact: guido.schmutz@trivadis.com • Blog: http://guidoschmutz.wordpress.com • Twitter: gschmutz 2
  3. 3. Agenda 1. Customer Use Case and Architecture 2. Cassandra Data Modeling 3. Cassandra for Timeseries Data 4. Cassandra for Graph Data 5. Summary 3
  4. 4. Customer Use Case and Architecture 4
  5. 5. Research Project @ Armasuisse W&T W+T flagship project, standing for innovation & tech transfer Building capabilities in the areas of: • Social Media Intelligence (SOCMINT) • Big Data Technologies & Architectures Invest into new, innovative and not widely-proven technology • Batch analysis • Real-time analysis • NoSQL databases • Text analysis (NLP) • … 3 Phases: June 2013 – June 2015 5
  6. 6. SOCMINT Demonstrator – Time Dimension Major data model: Time series (TS) TS reflect user behaviors over time Activities correlate with events Anomaly detection Event detection & prediction 6
  7. 7. SOCMINT Demonstrator – Social Dimension User-user networks (social graphs); Twitter: follower, retweet and mention graphs Who is central in a social network? Who has retweeted a given tweet to whom? 7
  8. 8. SOCMINT Demonstrator - “Lambda Architecture” for Big Data Data Collection (Analytical) Batch Data Processing Batch compute Batch Result Store Data Sources Channel Data Access Reports Service Analytic Tools Alerting Tools Social RDBMS Sensor ERP Logfiles Mobile Machine (Analytical) Real-Time Data Processing Stream/Event Processing Batch compute Real-Time Result Store Messaging Result Store Query Engine Result Store Computed Information Raw Data (Reservoir) = Data in Motion = Data at Rest 8
  9. 9. SOCMINT Demonstrator – Frameworks & Components in Use Data Collection (Analytical) Batch Data Processing Batch compute Batch Result Store Data Sources Channel Data Access Reports Service Analytic Tools Alerting Tools Social (Analytical) Real-Time Data Processing Stream/Event Processing Batch compute Real-Time Result Store Messaging Result Store Query Engine Result Store Computed Information Raw Data (Reservoir) = Data in Motion = Data at Rest 10
  10. 10. SOCMINT Demonstrator – Cassandra Cluster 6 node cluster based on Datastax Enterprise Edition (DSE) Installed in a virtualized environment but we control the placement on disk We only keep 3 days of data • use TTL of Cassandra to automatically erase old data Cassandra supports both Timeseries and Connected-Data (Graph) Node 1 Node 2Node 6 Node 3Node 5 Node 4 11
  11. 11. Cassandra Data Modeling 12
  12. 12. Cassandra Data Modelling 13 • Don’t think relational • Denormalize, Denormalize, Denormalize …. • Rows are gigantic and sorted = one row is stored on one node • Know your application/use cases => from query to model • Index is not an afterthought, anymore => “index” upfront • Control physical storage structure
  13. 13. Static Column Family – “Skinny Row” 14 rowkey CREATE TABLE skinny (rowkey text, c1 text PRIMARY KEY, c2 text, c3 text, PRIMARY KEY (rowkey)); Grows up to Billion of Rows rowkey-1 c1 c2 c3 value-c1 value-c2 value-c3 rowkey-2 c1 c3 value-c1 value-c3 rowkey-3 c1 c2 c3 value-c1 value-c2 value-c3 c1 c2 c3 Partition Key
  14. 14. Dynamic Column Family – “Wide Row” 15 rowkey Billion of Rows rowkey-1 ckey-1:c1 ckey-1:c2 value-c1 value-c2 rowkey-2 rowkey-3 CREATE TABLE wide (rowkey text, ckey text, c1 text, c2 text, PRIMARY KEY (rowkey, ckey) WITH CLUSTERING ORDER BY (ckey ASC); ckey-2:c1 ckey-2:c2 value-c1 value-c2 ckey-3:c1 ckey-3:c2 value-c1 value-c2 ckey-1:c1 ckey-1:c2 value-c1 value-c2 ckey-2:c1 ckey-2:c2 value-c1 value-c2 ckey-1:c1 ckey-1:c2 value-c1 value-c2 ckey-2:c1 ckey-2:c2 value-c1 value-c2 ckey-3:c1 ckey-3:c2 value-c1 value-c2 1 2 Billion Partition Key Clustering Key
  15. 15. Cassandra for Timeseries Data 16
  16. 16. Know your application => From query to model 17 Show Timeline of Tweets Show Timeseries on different levels of aggregation (resolution) • Seconds • Minute • Hours
  17. 17. Show Timeline: Provide Raw Data (Tweets) 18 CREATE TABLE tweet (tweet_id bigint, username text, message text, hashtags list<text>, latitude double, longitude double, … PRIMARY KEY(tweet_id)); • Skinny Row Table • Holds the sensor raw data => Tweets • Similar to a relational table • Primary Key is the partition key 10000121 username message hashtags latitude longitude gschmutz Getting ready for.. [cassandra, nosql] 0 0 20121223 username message hashtags latitude longitude DataStax The Speed Factor .. [BigData 0 0 tweet_id Partition Key Clustering Key
  18. 18. Show Timeline: Provide Raw Data (Tweets) 19 INSERT INTO tweet (tweet_id, username, message, hashtags, latitude, longitude) VALUES (10000121, 'gschmutz', 'Getting ready for my talk about using Cassandra for Timeseries and Graph Data', ['cassandra', 'nosql'], 0,0); SELECT tweet_id, username, hashtags, message FROM tweet WHERE tweet_id = 10000121 ; tweet_id | username | hashtag | message ---------+----------+------------------------+---------------------------- 10000121 | gschmutz | ['cassandra', 'nosql'] | Getting ready for ... 20121223 | DataStax | [’BigData’] | The Speed Factor ... Partition Key Clustering Key
  19. 19. Show Timeline: Provide Sequence of Events 20 CREATE TABLE tweet_timeline ( sensor_id text, bucket_id text, time_id timestamp, tweet_id bigint, PRIMARY KEY((sensor_id, bucket_id), time_id)) WITH CLUSTERING ORDER BY (time_id DESC); Wide Row Table bucket-id creates buckets for columns • SECOND-2015-10-14 ABC-001:SECOND-2015-10-14 10:00:02:tweet-id 10000121 DEF-931:SECOND-2015-10-14 10:09:02:tweet-id 1003121343 09:12:09:tweet-id 1002111343 09:10:02:tweet-id 1001121343 Partition Key Clustering Key
  20. 20. Show Timeline: Provide Sequence of Events 21 INSERT INTO tweet_timeline (sensor_id, bucket_id, time_id, tweet_id) VALUES ('ABC-001', 'SECOND-2015-10-14', '2015-09-30 10:50:00', 10000121 ); SELECT * from tweet_timeline WHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10' AND key = 'ALL’ AND time_id <= '2015-10-14 12:00:00'; sensor_id | bucket_id | time_id | tweet_id ----------+-------------------+--------------------------+---------- ABC-001 | SECOND-2015-10-14 | 2015-10-14 11:53:00+0000 | 10020334 ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:52:00+0000 | 10000334 ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:51:00+0000 | 10000127 ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:50:00+0000 | 10000121 Sorted by time_id Partition Key Clustering Key
  21. 21. Show Timeseries: Provide list of counts 22 CREATE TABLE tweet_count ( sensor_id text, bucket_id text, key text, time_id timestamp, count counter, PRIMARY KEY((sensor_id, bucket_id), key, time_id)) WITH CLUSTERING ORDER BY (key ASC, time_id DESC); Wide Row Table bucket-id creates buckets for columns • SECOND-2015-10-14 • HOUR-2015-10 • DAY-2015-10 ABC-001:HOUR-2015-10 ALL:10:00:count 1’550 ABC-001:DAY-2015-10 ALL:14-OCT:count 105’999 ALL:13-OCT:count 120’344 nosql:14-OCT:count 2’532 ALL:09:00:count 2’299 nosql:08:00:count 25 30d * 24h * n keys = n * 720 cols Partition Key Clustering Key
  22. 22. Show Timeseries: Provide list of counts 23 UPDATE tweet_count SET count = count + 1 WHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10' AND key = 'ALL’ AND time_id = '2015-10-14 10:00:00'; SELECT * from tweet_count WHERE sensor_id = 'ABC-001' AND bucket_id = 'HOUR-2015-10' AND key = 'ALL' AND time_id >= '2015-10-14 08:00:00’; sensor_id | bucket_id | key | time_id | count ----------+--------------+-----+--------------------------+------- ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 12:00:00+0000 | 100230 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 11:00:00+0000 | 102230 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 10:00:00+0000 | 105430 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 09:00:00+0000 | 203240 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 08:00:00+0000 | 132230 Partition Key Clustering Key
  23. 23. Processing Pipeline Kafka provides reliable and efficient queuing Strom processes (rollups, counts) Cassandrastores at the same speed StoringProcessingQueuing 24 Twitter Sensor 1 Twitter Sensor 2 Twitter Sensor 3 Visualization Application Visualization Application
  24. 24. Processing Pipeline – Stream-Processing with 25 Pre-Processes data before storing in different Cassandra tables Implemented in Java Using DataStax Java driver for writing to Cassandra (similar to JDBC) Kafka Sentence Splitter Kafka Spout Word Counter Sentence Splitter Word Counter Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb #barca … #barca … #fcb real = 1 juve = 1 barca = 2 bayern = 1 INCR barca INCR real INCR juve INCR barca INCR bayern real juve barca barca bayern fcb fcb = 1 INCR fcb
  25. 25. Cassandra for Graph-Data 26
  26. 26. Using Cassandra for Social Dimension 27
  27. 27. Introduction to the Graph Model – Property Graph Node / Vertex • Represent Entities • Can contain properties (key-value pairs) Relationship / Edge • Lines between nodes • may be directed or undirected Properties • Values about node or relationship • Allow to add semantic to relationships User 1 Tweet author follow retweet User 2 Id: 16134540 name: cloudera location: Palo Alto Id: 18898576 name: gschmutz location: Berne Id: 18898576 text: Join BigData.. time: June 11 2015 time: June 11 2015 key: value 28
  28. 28. Titan:DB – Graph Database Optimized to work against billions of nodes and edges • Theoretical limitation of 2^60 edges and 1^60 nodes Works with several different distributed databases • Apache Cassandra, Apache HBase, Oracle BerkeleyDB and Amazon DynamoDB Supports many concurrent users doing complex graph traversals simultaneously Native integration with TinkerPop stack Created by Thinkaurelius (http://thinkaurelius.com/) now part of DataStax 29
  29. 29. Titan:DB Architecture 30
  30. 30. Titan:DB – Schema and Data Modeling Titan gaph has a schema comprised of the edge labels, property keys, and vertex labels schema can either be explicitly or implicitly defined schema can evolve over time without interruption of normal operations mgmt = graph.openManagement() person = mgmt.makeVertexLabel('person').make() birthDate = mgmt.makePropertyKey('birthDate') .dataType(Long.class) .cardinality(Cardinality.SINGLE).make() name = mgmt.makePropertyKey('name').dataType(String.class) .cardinality(Cardinality.SET).make()
  31. 31. SOCMINT Data Model User Post Term author(time,targetId) follow useHashtag retweetOf(time,targetId) mention(time) mentionOf(time) User: #userId => userId (as String) name => screenName language => lang profileImageUrlHttps location => location time => createdAt pageRank lastUpdateTime useUrl Place: #placeId=> id (as String) street => street name => fullName country => country type => placeType url => placeUrl lastUpdateTime retweet(time) reply(time) replyTo(time) Place placed(time) Term: #value=> hashtag or url value type => “hashtag” or “url” lastUpdateTime reply(time) Post: #postId => tweetId(as String) time => createAt targetIds=> targetIds language => lang coordinate => latitude + longitude lastUpdateTime 32
  32. 32. TinkerPop 3 Stack • TinkerPop is a framework composed of various interoperable components • Vendor independent (similar to JDBC for RDBMS) • Core API defines Graph, Vertex, Edge, … • Gremlin traversal language is vendor- independent way to query (traverse) a graph • Gremlin server can be leveraged to allow over the wire communication with a TinkerPop enabled graph system http://tinkerpop.incubator.apache.org/ 33
  33. 33. Gremlin – a graph query language Imperative graph traversal language • Sequence of “steps” of the computation Must understand structure of graph peter paul roger ken eva bob marc g.V(1).out(“follow”).out(“follow”).count() g.V(1).repeat(out(“follow”)).times(2).count() follow follow follow follow follow follow follow or 34
  34. 34. Summary 35
  35. 35. Summary 36 Cassandra is an always-on database Ability to collect and analyze massive volumes of data in sequence at extremely high velocity Forget (some of) your existing database modeling skills Cassandra is an excellent fit for time series data Cassandra is no longer “just a” column family database => Multi-Model Database • DSE Search • JSON support • DSE Graph • DSE Timeseries • Spark Support
  36. 36. Summary - Know your domain Connectedness of Datalow high Document Data Store Key-Value Stores Wide- Column Store Graph Databases Relational Databases
  37. 37. 38

×