Titan: Big Graph Data with Cassandra

70,343 views

Published on

Titan is an open source distributed graph database build on top of Cassandra that can power real-time applications with thousands of concurrent users over graphs with billions of edges. Graphs are a versatile data model for capturing and analyzing rich relational structures. Graphs are an increasingly popular way to represent data in a wide range of domains such as social networking, recommendation engines, advertisement optimization, knowledge representation, health care, education, and security.

This presentation discusses Titan's data model, query language, and novel techniques in edge compression, data layout, and vertex-centric indices which facilitate the representation and processing of Big Graph Data across a Cassandra cluster. We demonstrate Titan's performance on a large scale benchmark evaluation using Twitter data.

Presented at the Cassandra 2012 Summit.

Published in: Technology, Education
3 Comments
125 Likes
Statistics
Notes
  • Thank you for sharing this. I have a question about the use of cassandra. You mentioned that a column family has the vertex as partition key (row key) and columns of the row are the properties and edges starting from that vertex. This limits the amount of adjacent vertex for any vertex to 2M, meaning you cannot store a dense graph using this approach. do i understand it right? THANKS!!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • More than 5000 IT Certified ( SAP,Oracle,Mainframe,Microsoft and IBM Technologies etc...)Consultants registered. Register for IT courses at http://www.todaycourses.com Most of our companies will help you in processing H1B Visa, Work Permit and Job Placements
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Thanks for the Presentation! Does it Titan Support Node type definition and apply relationships to the result set?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
70,343
On SlideShare
0
From Embeds
0
Number of Embeds
8,687
Actions
Shares
0
Downloads
1,084
Comments
3
Likes
125
Embeds 0
No embeds

No notes for slide

Titan: Big Graph Data with Cassandra

  1. TITANBIG GRAPH DATA WITH CASSANDRA#TITANDB #GRAPHDB #CASSANDRA12Matthias Broecheler, CTO AURELIUSAugust VIII, MMXII THINKAURELIUS.COM
  2. AbstractTitan is an open source distributed graph database build on top ofCassandra that can power real-time applications with thousands ofconcurrent users over graphs with billions of edges. Graphs are a versatiledata model for capturing and analyzing rich relational structures. Graphsare an increasingly popular way to represent data in a wide range ofdomains such as social networking, recommendation engines,advertisement optimization, knowledge representation, health care,education, and security.This presentation discusses Titans data model, query language, and noveltechniques in edge compression, data layout, and vertex-centric indiceswhich facilitate the representation and processing of Big Graph Dataacross a Cassandra cluster. We demonstrate Titans performance on alarge scale benchmark evaluation using Twitter data.
  3. Titan Graph Database   supports real time local traversals (OLTP)   is highly scalable   in the number of concurrent users   in the size of the graph   is open source under the Apache2 license   builds on top of Apache Cassandra for distribution and replication
  4. IThe Graph Data Model AURELIUS THINKAURELIUS.COM
  5. Hercules: demigodAlcmene: humanJupiter: godSaturn: titanPluto: godNeptune: godCerberus: monster Entities
  6. Name TypeHercules demigodAlcmene humanJupiter godSaturn titanPluto godNeptune godCerberus monster Table
  7. Name: Name: Name: Name:Hercules Alcmene Jupiter SaturnType: Type: Type: Type:demigod human god titanName: Name: Name:Pluto Neptune CerberusType: Type: Type:god god monster Documents
  8. Hercules type:demigodAlcmene type:humanJupiter type:godSaturn type:titanPluto type:godNeptune type:godCerberus type:monster Key->Value
  9. name: Neptune name: Alcmene type: god type: godVertex Property name: Saturn name: Jupiter name: Hercules type: titan type: god type: demigod name: Pluto name: Cerberus type: god type: monster Graph
  10. name: Neptune name: Alcmene type: god type: godEdge brother mother name: Saturn name: Jupiter name: Hercules type: titan type: god type: demigod father father Edge battled brother time:12 Property name: Pluto name: Cerberus type: god type: monster Edge Type pet Graph
  11. IGraph = Agile Data Model
  12. IIGraph Use Cases AURELIUS THINKAURELIUS.COM
  13. Recommendations
  14. Recommendation?name: Hercules
  15. name: “Muscle building for beginners”name: Hercules type: book bought
  16. name: “Muscle building for beginners”name: Hercules type: book bought boughtname: Newton
  17. name: “Muscle building for beginners”name: Hercules type: book bought bought name: “How to deal with Father issues”name: Newton type: book bought
  18. name: “Muscle building for beginners” name: Hercules type: book boughtrecommend bought name: “How to deal with Father issues” name: Newton type: book bought Traversal
  19. name: “Dancing with the Stars” type: DVD in-Cart name: “Muscle building for beginners”name: Hercules type: book bought bought name: “How to deal with Father issues”name: Newton type: book bought viewed name: “Friends forever bracelet” type: Accessory
  20. name: “Dancing with the Stars” type: DVD in-Cart name: “Muscle building for beginners” name: Hercules type: book boughtfriends bought name: “How to deal with Father issues” name: Newton type: book bought viewed name: “Friends forever bracelet” type: Accessory
  21. name: “Dancing with the Stars” type: DVD in-Cart name: “Muscle building for beginners” name: Hercules type: book bought time:24friends bought time:22 name: “How to deal with Father issues” name: Newton type: book bought time:20 viewed name: “Friends forever bracelet” type: Accessory
  22. Recommendations Path Finding
  23. name: Neptune name: Alcmene type: god type: god X brother mother name: Saturn name: Jupiter name: Hercules type: titan type: god type: demigodX father father battled brother time:12 name: Pluto name: Cerberus type: god type: monster pet Path Finding
  24. name: Neptune name: Alcmene type: god type: god X brother mother name: Saturn name: Jupiter name: Hercules type: titan type: god type: demigodX father father battled brother time:12 name: Pluto name: Cerberus type: god type: monster pet Path Finding
  25. yahoo.com geocities.com cnn.com /johnlittlesite<html> … <html> …</html>! <html> … </html>! </html>! Credibility?
  26. url: yahoo.com url: cnn.comhtml: <html>…! html: <html>…! url: geocities.com/johnlittlesite Link html: <html>…! Graph
  27. url: yahoo.com url: cnn.comhtml: <html>…! html: <html>…! elections funny cat foreign policy url: geocities.com/johnlittlesite Link html: <html>…! Graph
  28. IIGraph = Milk Your Connections
  29. IIIThe Titan Graph Database AURELIUS THINKAURELIUS.COM
  30. Titan Features  numerous concurrent users  real-time traversals (OLTP)  high availability  dynamic scalability  built on Apache Cassandra
  31. Titan Ecosystem  Native Blueprints Graph Server Implementation Graph Algorithms  Gremlin Query Object-Graph Mapper Language Traversal Language  Rexster Server Dataflow Processing   any Titan graph can be exposed Generic Graph API as a REST endpoint
  32. Titan InternalsI.  Data ManagementII.  Edge CompressionIII. Vertex-Centric Indices
  33. IVRebuilding Twitter with Titan AURELIUS THINKAURELIUS.COM
  34. text: string name: string! time: long! follows User Tweettime: long! tweets time: long!
  35. time: long! stream text: string name: string! time: long! follows User Tweettime: long! tweets time: long!
  36. Titan Storage Model  Adjacency list in one 5 column family  Row key = vertex id  Each property and edge 5 in one column   Denormalized, i.e. stored twice  Direction and label/key as column prefix   Use slice predicate for quick retrieval
  37. Connecting Titantitan$ bin/gremlin.sh! ,,,/! (o o)!-----oOOo-(_)-oOOo-----!gremlin> conf = new BaseConfiguration();!==>org.apache.commons.configuration.BaseConfiguration@763861e6!gremlin> conf.setProperty("storage.backend","cassandra");!gremlin> conf.setProperty("storage.hostname","77.77.77.77");!gremlin> g = TitanFactory.open(conf);==>titangraph[cassandra:77.77.77.77]!gremlin>!
  38. Defining Property Keysgremlin> g.makeType().name(“time”).! ! ! dataType(Long.class).! ! ! functional().! ! ! makePropertyKey();!gremlin> g.makeType().name(“text”).dataType(String.class).! ! ! functional().makePropertyKey();!gremlin> g.makeType().name(“name”).dataType(String.class).! ! ! indexed().! ! ! unique().! ! ! functional().makePropertyKey();!
  39. Defining Property Keys Each type has a unique namegremlin> g.makeType().name(“time”).! The allowed data type ! ! dataType(Long.class).! ! ! functional().! If a key is functional, each vertex can ! ! makePropertyKey();! have at most one property for this keygremlin> g.makeType().name(“text”).dataType(String.class).! ! ! functional().makePropertyKey();!gremlin> g.makeType().name(“name”).dataType(String.class).! ! ! indexed().! ! ! unique().! ! ! functional().makePropertyKey();!
  40. Defining Property Keysgremlin> g.makeType().name(“time”).! ! ! dataType(Long.class).! ! ! functional().! ! ! makePropertyKey();!gremlin> g.makeType().name(“text”).dataType(String.class).! ! ! functional().makePropertyKey();!gremlin> g.makeType().name(“name”).dataType(String.class).! ! ! indexed().! Creates and maintains an index over property values ! ! unique().! Ensures that each property value is uniquely ! ! functional().makePropertyKey();! associated with only one vertex by acquiring a lock.
  41. Titan Indexing  Vertices can be retrieved by property key + value name : Hercules 5  Titan maintains index in a name : Jupiter 9 separate column family as graph is updated  Only need to define a property key as .index()
  42. Titan Locking  Locking ensures consistency when it is needed name : Hercules 5  Titan uses time stamped quorum reads and writes on 9 separate CFs for locking  Uses name : name : Jupiter Hercules   Property uniqueness: .unique() father   Functional edges: .functional()   Global ID management x name : father Pluto
  43. Defining Edge Labelsgremlin> g.makeType().name(“follows”).! ! ! primaryKey(time).! ! ! makeEdgeLabel();!gremlin> g.makeType().name(“tweets”).! ! ! primaryKey(time).makeEdgeLabel();!gremlin> g.makeType().name(“stream).! ! ! primaryKey(time).! ! ! unidirected().! ! ! makeEdgeLabel();!
  44. Defining Edge Labelsgremlin> g.makeType().name(“follows”).! ! ! primaryKey(time).! Sort/index key for edges of this label ! ! makeEdgeLabel();!gremlin> g.makeType().name(“tweets”).! ! ! primaryKey(time).makeEdgeLabel();!gremlin> g.makeType().name(“stream).! ! ! primaryKey(time).! ! ! unidirected().! ! ! makeEdgeLabel();!
  45. Defining Edge Labelsgremlin> g.makeType().name(“follows”).! ! ! primaryKey(time).! ! ! makeEdgeLabel();!gremlin> g.makeType().name(“tweets”).! ! ! primaryKey(time).makeEdgeLabel();!gremlin> g.makeType().name(“stream).! ! ! primaryKey(time).! ! ! unidirected().! Store edges of this label only in outgoing direction ! ! makeEdgeLabel();!
  46. Vertex-Centric Indices  Sort and index edges per vertex by primary key   Primary key can be composite  Enables efficient focused traversals   Only retrieve edges that matter  Uses slice predicate for quick, index-driven retrieval
  47. tweets tweets tweetstime: 123 time: 334 time: 624 v.query()! tweets v time: 1112 follows follows follows follows
  48. tweets tweets tweetstime: 123 time: 334 time: 624 v.query()! tweets .direction(OUT)! v time: 1112 follows follows
  49. tweets tweets tweetstime: 123 time: 334 time: 624 v.query()! tweets .direction(OUT)! v time: 1112 .labels(“tweets”)!
  50. v.query()! tweets .direction(OUT)!v time: 1112 .labels(“tweets”)! .has(“time”,T.gt,1000)!
  51. name: Hercules name: PlutoCreate Accounts gremlin> hercules = g.addVertex([name:Hercules]);! gremlin> pluto = g.addVertex([name:Pluto]);!
  52. name: Hercules name: PlutoAdd Followship follows time:2 gremlin> hercules = g.addVertex([name:Hercules]);! gremlin> pluto = g.addVertex([name:Pluto]);! gremlin> g.addEdge(hercules,pluto,"follows",[time:2]);!
  53. name: Hercules name: PlutoPublish Tweet follows time:2 tweets time:4 text: A tweet! time: 4! gremlin> hercules = g.addVertex([name:Hercules]);! gremlin> pluto = g.addVertex([name:Pluto]);! gremlin> g.addEdge(hercules,pluto,"follows",[time:2]);! gremlin> tweet = g.addVertex([text:A tweet!,time:4])! gremlin> g.addEdge(pluto,tweet,"tweets",[time:4]) !
  54. name: Hercules name: PlutoUpdate Streams follows time:2 stream tweets time:4 time:4 text: A tweet! time: 4! gremlin> hercules = g.addVertex([name:Hercules]);! gremlin> pluto = g.addVertex([name:Pluto]);! gremlin> g.addEdge(hercules,pluto,"follows",[time:2]);! gremlin> tweet = g.addVertex([text:A tweet!,time:4])! gremlin> g.addEdge(pluto,tweet,"tweets",[time:4]) ! gremlin> pluto.in("follows").each{g.addEdge(it,tweet,"stream",[time:4])} !
  55. name: Hercules name: PlutoRead Stream follows time:2 stream tweets time:4 time:4 text: A tweet! time: 4! gremlin> hercules = g.addVertex([name:Hercules]);! gremlin> pluto = g.addVertex([name:Pluto]);! gremlin> g.addEdge(hercules,pluto,"follows",[time:2]);! gremlin> tweet = g.addVertex([text:A tweet!,time:4])! gremlin> g.addEdge(pluto,tweet,"tweets",[time:4]) ! gremlin> pluto.in("follows").each{g.addEdge(it,tweet,"stream",[time:4])} ! gremlin> hercules.outE(stream)[0..9].inV.map! Sorted by time because its ‘stream’s primary key
  56. Followship name: Hercules name: PlutoRecommendation follows time:2 follows follows time:9 name: Neptune follows = g.V(name,’Hercules’).out(follows).toList()! follows20 = follows[(0..19).collect{random.nextInt(follows.size)}]! m = [:]! follows20.each ! { it.outE(follows’[0..29].inV.except(follows).groupCount(m).iterate() }! m.sort{a,b -> b.value <=> a.value}[0..4]!
  57. IVTitan Performance Evaluation onTwitter-like Benchmark AURELIUS THINKAURELIUS.COM
  58. Twitter Benchmark  1.47 billion followship edges and 41.7 million users   Loaded into Titan using BatchGraph   Twitter in 2009, crawled by Kwak et. al  4 Transaction Types   Create Account (1%)   Publish tweet (15%)   Read stream (76%)   Recommendation (8%) Kwak, H., Lee, C., Park, H., Moon, S., “What is   Follow recommended user (30%) Twitter, a Social Network or a News Media?,” World Wide Web Conference, 2010.
  59. Benchmark Setup  6 cc1.4xl Cassandra nodes   in one placement group   Cassandra 1.10  40 m1.small worker machines   repeatedly running transactions   simulating servers handling user requests  EC2 cost: $11/hour
  60. Benchmark ResultsTransaction Type Number of tx Mean tx time Std of tx timeCreate account 379,019 115.15 ms 5.88 msPublish tweet 7,580,995 18.45 ms 6.34 msRead stream 37,936,184 6.29 ms 1.62 msRecommendation 3,793,863 67.65 ms 13.89 ms Total 49,690,061 Runtime 2.3 hours 5,900 tx/sec
  61. Peak Load ResultsTransaction Type Number of tx Mean tx time Std of tx timeCreate account 374,860 172.74 ms 10.52 msPublish tweet 7,517,667 70.07 ms 19.43 msRead stream 37,618,648 24.40 ms 3.18 msRecommendation 3,758,266 229.83 ms 29.08 ms Total 49,269,441 Runtime 1.3 hours 10,200 tx/sec
  62. Benchmark ConclusionTitan  can  handle  10s  of  thousands  of  concurrent  users  with   short   response   5mes   even   for   complex   traversals  on   a   simulated   social   networking   applica5on   based   on  real-­‐world   network   data   with   billions   of   edges   and  millions  of  users  in  a  standard  EC2  deployment.  For  more  informa5on  on  the  benchmark:  hDp://thinkaurelius.com/2012/08/06/5tan-­‐provides-­‐real-­‐5me-­‐big-­‐graph-­‐data/  
  63. Future Titan  Titan+Cassandra embedding   sending Gremlin queries into the cluster  Graph partitioning together with ByteOrderedPartitioner   data locality = better performance  Let us know what you need!
  64. Titan goes OLAP Map/Reduce Load & Compress Analysis results back into Titan Stores a massive-scale Batch processing of large Runs global graph algorithmsproperty graph allowing real- graphs with Hadoop on large, compressed, time traversals and updates in-memory graphs
  65. IIIGraph = Scalable + Practical
  66. TITANTHINKAURELIUS.GITHUB.COM/TITAN
  67. AURELIUSTHINKAURELIUS.COM

×