Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Real Time Analytics with Apache Cassandra - Cassandra Day Munich

867 views

Published on

Time series data is everywhere: IoT, sensor data or financial transactions. The industry has moved to databases like Cassandra to handle the high velocity and high volume of data that is now common place. In this talk I will present how we have used Cassandra to store time series data. I will highlight both the Cassandra data model as well as the architecture we put in place for collecting and ingesting data into Cassandra, using Apache Kafka and Apache Storm.

Published in: Technology
  • Be the first to comment

Real Time Analytics with Apache Cassandra - Cassandra Day Munich

  1. 1. BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH Real-Time Analytics with Apache Cassandra Cassandra Day Munich, 9.2.2016 Guido Schmutz
  2. 2. Guido Schmutz Working for Trivadis for more than 19 years Oracle ACE Director for Fusion Middleware and SOA Co-Author of different books Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data Member of Trivadis Architecture Board Technology Manager @ Trivadis More than 25 years of software development experience Contact: guido.schmutz@trivadis.com Blog: http://guidoschmutz.wordpress.com Slideshare: http://de.slideshare.net/gschmutz Twitter: gschmutz 2
  3. 3. Our company. © Trivadis – The Company3 2/11/16 Trivadis is a market leader in IT consulting, system integration, solution engineering and the provision of IT services focusing on and and Open Source technologies in Switzerland, Germany, Austria and Denmark. We offer our services in the following strategic business fields: Trivadis Services takes over the interacting operation of your IT systems. O P E R A T I O N
  4. 4. COPENHAGEN MUNICH LAUSANNE BERN ZURICH BRUGG GENEVA HAMBURG DÜSSELDORF FRANKFURT STUTTGART FREIBURG BASEL VIENNA With over 600 specialists and IT experts in your region. © Trivadis – The Company4 2/11/16 14 Trivadis branches and more than 600 employees 200 Service Level Agreements Over 4,000 training participants Research and development budget: CHF 5.0 million Financially self-supporting and sustainably profitable Experience from more than 1,900 projects per year at over 800 customers
  5. 5. Agenda 1. Customer Use Case and Architecture 2. Cassandra Data Modeling 3. Cassandra for Timeseries Data 4. Titan:db for Graph Data 5
  6. 6. Customer Use Case and Architecture 6
  7. 7. Data Science Lab @ Armasuisse W&T W+T flagship project, standing for innovation & tech transfer Building capabilities in the areas of: • Social Media Intelligence (SOCMINT) • Big Data Technologies & Architectures Invest into new, innovative and not widely-proven technology • Batch / Real-time analysis • NoSQL databases • Text analysis (NLP) • Graph Data • … 3 Phases: June 2013 – June 2015 7
  8. 8. SOCMINT Demonstrator – Time Dimension Major data model: Time series (TS) TS reflect user behaviors over time Activities correlate with events Anomaly detection Event detection & prediction 8
  9. 9. SOCMINT Demonstrator – Social Dimension User-user networks (social graphs); Twitter: follower, retweet and mention graphs Who is central in a social network? Who has retweeted a given tweet to whom? 9
  10. 10. SOCMINT Demonstrator - “Lambda Architecture” for Big Data Data Collection (Analytical) Batch Data Processing Batch compute Batch Result StoreData Sources Channel Data Access Reports Service Analytic Tools Alerting Tools Social RDBMS Sensor ERP Logfiles Mobile Machine (Analytical) Real-Time Data Processing Stream/Event Processing Batch compute Real-Time Result Store Messaging Result Store Query Engine Result Store Computed Information Raw Data (Reservoir) = Data in Motion = Data at Rest 10
  11. 11. SOCMINT Demonstrator – Frameworks & Components in Use Data Collection (Analytical) Batch Data Processing Batch compute Batch Result StoreData Sources Channel Data Access Reports Service Analytic Tools Alerting Tools Social (Analytical) Real-Time Data Processing Stream/Event Processing Batch compute Real-Time Result Store Messaging Result Store Query Engine Result Store Computed Information Raw Data (Reservoir) = Data in Motion = Data at Rest 11
  12. 12. Streaming Analytics Processing Pipeline Kafka provides reliable and efficient queuing Storm processes (rollups, counts) Cassandrastores results at same speed StoringProcessingQueuing 12 Twitter Sensor 1 Twitter Sensor 2 Twitter Sensor 3 Visualizatio n Application Visualizatio n Application
  13. 13. Cassandra Data Modeling 13
  14. 14. Cassandra Data Modelling 14 • Don’t think relational ! • Denormalize, Denormalize, Denormalize …. • Rows are gigantic and sorted = one row is stored on one node • Know your application/use cases => from query to model • Index is not an afterthought, anymore => “index” upfront • Control physical storage structure
  15. 15. “Static” Tables – “Skinny Row” 15 rowkey CREATE TABLE skinny (rowkey text, c1 text PRIMARY KEY, c2 text, c3 text, PRIMARY KEY (rowkey)); Grows up to Billion of Rows rowkey-1 c1 c2 c3 value-c1 value-c2 value-c3 rowkey-2 c1 c3 value-c1 value-c3 rowkey-3 c1 c2 c3 value-c1 value-c2 value-c3 c1 c2 c3 Partition Key
  16. 16. “Dynamic” Tables – “Wide Row” 16 rowkey Billion of Rows rowkey-1 ckey-1:c1 ckey-1:c2 value-c1 value-c2 rowkey-2 rowkey-3 CREATE TABLE wide (rowkey text, ckey text, c1 text, c2 text, PRIMARY KEY (rowkey, ckey) WITH CLUSTERING ORDER BY (ckey ASC); ckey-2:c1 ckey-2:c2 value-c1 value-c2 ckey-3:c1 ckey-3:c2 value-c1 value-c2 ckey-1:c1 ckey-1:c2 value-c1 value-c2 ckey-2:c1 ckey-2:c2 value-c1 value-c2 ckey-1:c1 ckey-1:c2 value-c1 value-c2 ckey-2:c1 ckey-2:c2 value-c1 value-c2 ckey-3:c1 ckey-3:c2 value-c1 value-c2 1 2 Billion Partition Key Clustering Key
  17. 17. Cassandra for Timeseries Data 17
  18. 18. Know your application => From query to model 18 Show Timeline of Tweets Show Timeseries on different levels of aggregation (resolution) • Seconds • Minute • Hours
  19. 19. Show Timeline: Provide Raw Data (Tweets) 19 CREATE TABLE tweet (tweet_id bigint, username text, message text, hashtags list<text>, latitude double, longitude double, … PRIMARY KEY(tweet_id)); • Skinny Row Table • Holds the sensor raw data => Tweets • Similar to a relational table • Primary Key is the partition key 10000121 username message hashtags latitude longitude gschmutz Getting ready for .. [cassandra, nosql] 0 0 20121223 username message hashtags latitude longitude DataStax The Speed Factor .. [BigData 0 0 tweet_id Partition Key Clustering Key
  20. 20. Show Timeline: Provide Raw Data (Tweets) 20 INSERT INTO tweet (tweet_id, username, message, hashtags, latitude, longitude) VALUES (10000121, 'gschmutz', 'Getting ready for my talk about using Cassandra for Timeseries and Graph Data', ['cassandra', 'nosql'], 0,0); SELECT tweet_id, username, hashtags, message FROM tweet WHERE tweet_id = 10000121 ; tweet_id | username | hashtag | message ---------+----------+------------------------+---------------------------- 10000121 | gschmutz | ['cassandra', 'nosql'] | Getting ready for ... 20121223 | DataStax | [’BigData’] | The Speed Factor ... Partition Key Clustering Key
  21. 21. Show Timeline: Provide Sequence of Events 21 CREATE TABLE tweet_timeline ( sensor_id text, bucket_id text, time_id timestamp, tweet_id bigint, PRIMARY KEY((sensor_id, bucket_id), time_id)) WITH CLUSTERING ORDER BY (time_id DESC); Wide Row Table bucket-id creates buckets for columns • SECOND-2015-10-14 ABC-001:SECOND-2015-10-14 10:00:02:tweet-id 10000121 DEF-931:SECOND-2015-10-14 10:09:02:tweet-id 1003121343 09:12:09:tweet-id 1002111343 09:10:02:tweet-id 1001121343 Partition Key Clustering Key
  22. 22. Show Timeline: Provide Sequence of Events 22 INSERT INTO tweet_timeline (sensor_id, bucket_id, time_id, tweet_id) VALUES ('ABC-001', 'SECOND-2015-10-14', '2015-09-30 10:50:00', 10000121 ); SELECT * from tweet_timeline WHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10' AND key = 'ALL’ AND time_id <= '2015-10-14 12:00:00'; sensor_id | bucket_id | time_id | tweet_id ----------+-------------------+--------------------------+---------- ABC-001 | SECOND-2015-10-14 | 2015-10-14 11:53:00+0000 | 10020334 ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:52:00+0000 | 10000334 ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:51:00+0000 | 10000127 ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:50:00+0000 | 10000121 Sorted by time_id Partition Key Clustering Key
  23. 23. Show Timeseries: Provide list of metrics 23 CREATE TABLE tweet_count ( sensor_id text, bucket_id text, key text, time_id timestamp, count counter, PRIMARY KEY((sensor_id, bucket_id), key, time_id)) WITH CLUSTERING ORDER BY (key ASC, time_id DESC); Wide Row Table bucket-id creates buckets for columns • SECOND-2015-10-14 • HOUR-2015-10 • DAY-2015-10 ABC-001:HOUR-2015-10 ALL:10:00:count 1’550 ABC-001:DAY-2015-10 ALL:14-OCT:count 105’999 ALL:13-OCT:count 120’344 nosql:14-OCT:count 2’532 ALL:09:00:count 2’299 nosql:08:00:count 25 30d * 24h * n keys = n * 720 cols Partition Key Clustering Key
  24. 24. Show Timeseries: Provide list of metrics 24 UPDATE tweet_count SET count = count + 1 WHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10' AND key = 'ALL’ AND time_id = '2015-10-14 10:00:00'; SELECT * from tweet_count WHERE sensor_id = 'ABC-001' AND bucket_id = 'HOUR-2015-10' AND key = 'ALL' AND time_id >= '2015-10-14 08:00:00’; sensor_id | bucket_id | key | time_id | count ----------+--------------+-----+--------------------------+------- ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 12:00:00+0000 | 100230 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 11:00:00+0000 | 102230 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 10:00:00+0000 | 105430 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 09:00:00+0000 | 203240 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 08:00:00+0000 | 132230 Partition Key Clustering Key
  25. 25. Titan:db & Cassandra for Graph Data 25
  26. 26. Introduction to the Graph Model – Property Graph Vertex (Node) • Represent Entities • Always have an ID • Can contain properties (key- value pairs) Edge (Relationship) • Lines between nodes • may be directed or undirected • Have IDs and properites Properties • Values about node or relationship • Allow to add semantic to relationships User 1 Tweet 2 author follow retweet User 2 Id: 16134540 name: cloudera location: Palo Alto Id: 18898576 name: gschmutz location: Berne Id: 18898999 text: CDH5 has been.. time: July 11 2015 time: June 11 2015 key: value 26 since: May 2012 Tweet 1 Id: 18898576 text: Join BigData.. time: June 11 2015 author
  27. 27. Titan:db Architecture 27 http://thinkaurelius.github.io/titan/
  28. 28. TinkerPop 3 Stack TinkerPop is a framework composed of various interoperable components Vendor independent (similar to JDBC for RDBMS) Core API defines Graph, Vertex, Edge, … Gremlin traversal language is vendor- independent way to query (traverse) a graph Gremlin server can be leveraged to allow over the wire communication with a TinkerPop enabled graph system http://tinkerpop.incubator.apache.org/ 28
  29. 29. Gremlin Graph Traversal Engine 29 Language / System agostic: many graph languages for many graph systems Provided Traversal Engine: SPARQL or any other graph query language on the Gremlin Traversal Machine Native distributed execution: A Gremlin Traversal over an OLAP Graph Processor (Hadoop / Spark)
  30. 30. Gremlin in Action – Creating the Graph 30
  31. 31. Gremlin in Action – Graph Traversal 31
  32. 32. Gremlin in Action – Graph Traversal (II) 32
  33. 33. Summary - Know your domain Connectedness of Datalow high Document Data Store Key-Value Stores Wide- Column Store Graph Databases Relational Databases
  34. 34. Guido Schmutz Email: guido.schmutz@trivadis.com +41 79 412 05 39 34

×