Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

Big Graph Data Slide 1 Big Graph Data Slide 2 Big Graph Data Slide 3 Big Graph Data Slide 4 Big Graph Data Slide 5 Big Graph Data Slide 6 Big Graph Data Slide 7 Big Graph Data Slide 8 Big Graph Data Slide 9 Big Graph Data Slide 10 Big Graph Data Slide 11 Big Graph Data Slide 12 Big Graph Data Slide 13 Big Graph Data Slide 14 Big Graph Data Slide 15 Big Graph Data Slide 16 Big Graph Data Slide 17 Big Graph Data Slide 18 Big Graph Data Slide 19 Big Graph Data Slide 20 Big Graph Data Slide 21 Big Graph Data Slide 22 Big Graph Data Slide 23 Big Graph Data Slide 24 Big Graph Data Slide 25 Big Graph Data Slide 26 Big Graph Data Slide 27 Big Graph Data Slide 28 Big Graph Data Slide 29 Big Graph Data Slide 30 Big Graph Data Slide 31 Big Graph Data Slide 32 Big Graph Data Slide 33 Big Graph Data Slide 34 Big Graph Data Slide 35 Big Graph Data Slide 36 Big Graph Data Slide 37 Big Graph Data Slide 38 Big Graph Data Slide 39 Big Graph Data Slide 40 Big Graph Data Slide 41 Big Graph Data Slide 42 Big Graph Data Slide 43 Big Graph Data Slide 44 Big Graph Data Slide 45 Big Graph Data Slide 46 Big Graph Data Slide 47 Big Graph Data Slide 48 Big Graph Data Slide 49 Big Graph Data Slide 50 Big Graph Data Slide 51 Big Graph Data Slide 52 Big Graph Data Slide 53 Big Graph Data Slide 54 Big Graph Data Slide 55 Big Graph Data Slide 56 Big Graph Data Slide 57 Big Graph Data Slide 58 Big Graph Data Slide 59 Big Graph Data Slide 60 Big Graph Data Slide 61 Big Graph Data Slide 62 Big Graph Data Slide 63 Big Graph Data Slide 64 Big Graph Data Slide 65 Big Graph Data Slide 66 Big Graph Data Slide 67 Big Graph Data Slide 68 Big Graph Data Slide 69 Big Graph Data Slide 70 Big Graph Data Slide 71 Big Graph Data Slide 72 Big Graph Data Slide 73 Big Graph Data Slide 74 Big Graph Data Slide 75 Big Graph Data Slide 76
Upcoming SlideShare
Titan: The Rise of Big Graph Data
Next
Download to read offline and view in fullscreen.

21 Likes

Share

Download to read offline

Big Graph Data

Download to read offline

The problems we are faced with in the 21st century require efficient analysis of ever more complex systems. This presentation outlines how such problems can be better understood and effectively solved if they are modeled as graphs or networks. We present two tools for to help solve such problems at scale: Titan, which is a real-time distributed graph database based on Apache Cassandra and Hbase and Faunus, which is a batch analytics framework for graphs based on Apache Hadoop. We discuss their current development status as of November 2012 and illustrate an example application for the GitHub coding network.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Big Graph Data

  1. BIG GRAPH DATA Understanding a Complex World Matthias Broecheler, CTO @mbroecheler AURELIUS November XIII, MMXII THINKAURELIUS.COM
  2. I Graph Foundation AURELIUS THINKAURELIUS.COM
  3. name: Neptune name: Alcmene type: god type: god Vertex Property name: Saturn name: Jupiter name: Hercules type: titan type: god type: demigod name: Pluto name: Cerberus type: god type: monster Graph
  4. name: Neptune name: Alcmene type: god type: god Edge brother mother name: Saturn name: Jupiter name: Hercules type: titan type: god type: demigod father father Edge battled brother Property time:12 name: Pluto name: Cerberus type: god type: monster Edge Type pet Graph
  5. name: Neptune name: Alcmene type: god type: god brother mother name: Saturn name: Jupiter name: Hercules type: titan type: god type: demigod father father battled brother time:12 name: Pluto name: Cerberus type: god type: monster pet Path
  6. name: Neptune name: Alcmene type: god type: god brother mother name: Saturn name: Jupiter name: Hercules type: titan type: god type: demigod father father battled brother time:12 name: Pluto name: Cerberus type: god type: monster pet Degree
  7. I Connected World AURELIUS THINKAURELIUS.COM
  8. HEALTH
  9. HEALTH
  10. HEALTH
  11. HEALTH
  12. ECONOMY
  13. ECONOMY
  14. ECONOMY
  15. ECONOMY
  16. Social Systems
  17. Social Systems
  18. Social Systems
  19. Social Systems
  20. III Titan Graph Database AURELIUS THINKAURELIUS.COM
  21. Titan Features   Numerous Concurrent Users   Many Short Transactions   read/write   Real-time Traversals (OLTP)   High Availability   Dynamic Scalability   Variable Consistency Model   ACID or eventual consistency   Real-time Big Graph Data
  22. Storage Backends Partitionability Consistency Availability
  23. Titan Features I.  Data Management II.  Vertex-Centric Indices
  24. Titan Features III.  Graph Partitioning IV.  Edge Compression
  25. Titan Ecosystem   Native Blueprints Graph Server Implementation Graph   Gremlin Query Algorithms Language Object-Graph Mapper   Rexster Server Traversal Language   any Titan graph can be exposed as a REST endpoint Dataflow Processing Generic Graph API
  26. IV Github Network AURELIUS THINKAURELIUS.COM
  27. Setup $ ./titan-0.1.0/bin/gremlin.sh! ! ! !,,,/! (o o)! -----oOOo-(_)-oOOo-----! gremlin> g = TitanFactory.open('/tmp/titan')! ==>titangraph[local:/tmp/titan]!
  28. Titan Storage Model   Adjacency list in one 5 column family   Row key = vertex id   Each property and edge in one column 5   Denormalized, i.e. stored twice   Direction and label/key as column prefix   Use slice predicate for quick retrieval
  29. created USER edited opened pushed COMMENT PAGE on ISSUE COMMIT on on to in REPOSITORY
  30. Defining Property Keys gremlin> g.makeType().name(‘username’).! ! ! ! dataType(String.class).! ! ! ! functional().! ! ! ! indexed().unique().! ! ! ! makePropertyKey()! gremlin> g.makeType().name(‘time’).! ! ! ! dataType(Long.class).! ! ! ! functional().makePropertyKey()!
  31. Defining Edge Labels gremlin> g.makeType().name(‘on’).! ! ! ! makeEdgeLabel()! gremlin> g.makeType().name(‘pushed’).! ! ! ! primaryKey(time).! ! ! ! makeEdgeLabel()! gremlin> g.makeType().name(‘in’).! ! ! ! unidirected().! ! ! ! makeEdgeLabel()!
  32. Create & Retrieve gremlin> v = g.addVertex([username: ‘okram’])! ==>v[4]! gremlin> v.map! ==>{username=okram}! gremlin> g.V('username','okram')! ==>v[4]!
  33. Titan Locking   Locking ensures consistency when it is needed name : Hercules 5   Titan uses time stamped consistent reads and writes 9 on separate CFs for locking   Uses name :   Property uniqueness: .unique() name : Hercules Jupiter   Functional edges: .functional() father   Global ID management x name : father Pluto
  34. Titan Indexing   Vertices can be retrieved by property key + value name : Hercules 5   Titan maintains index in a separate column family as name : Jupiter 9 graph is updated   Only need to define a property key as .index()
  35. Basic Queries gremlin> v.out(‘pushed’)! gremlin> v.out(‘pushed’).out(‘to’).name! gremlin> v.out(‘pushed’).out(‘to’).dedup.name! gremlin> v.out(‘pushed’).out(‘to’).dedup.! ! ! ! name.sort{it}! gremlin> v.outE(‘pushed’).has(‘time’,T.gt,1000).inV!
  36. Basic Queries gremlin> v.out(‘pushed’)! gremlin> v.out(‘pushed’).out(‘to’).name! gremlin> v.out(‘pushed’).out(‘to’).dedup.name! gremlin> v.out(‘pushed’).out(‘to’).dedup.! ! ! ! name.sort{it}! gremlin> v.outE(‘pushed’).has(‘time’,T.gt,1000).inV! Query Optimization
  37. Vertex-Centric Indices   Sort and index edges per vertex by primary key   Primary key can be composite   Enables efficient focused traversals   Only retrieve edges that matter   Uses push down predicates for quick, index-driven retrieval
  38. battled battled battled time: 1 time: 3 time: 5 mother battled v v.query()! time: 9 father fought fought
  39. battled battled battled time: 1 time: 3 time: 5 mother battled v v.query()! time: 9 .direction(OUT)! father
  40. battled battled battled time: 1 time: 3 time: 5 battled v v.query()! time: 9 .direction(OUT)! .labels(‘battled’)!
  41. battled battled time: 1 time: 3 v v.query()! .direction(OUT)! .labels(‘battled’)! .has(‘time,T.lt,5)!
  42. Recommendation Engine gremlin> v.out('pushed').out('to')[0..9].! ! ! ! in('to').in('pushed')[0..500].! ! ! ! except([v]).name.! ! ! ! groupCount.cap.next().sort{-it.value}[0..4]!
  43. Recommendation Engine gremlin> v.out('pushed').out('to')[0..9].! ! ! ! in('to').in('pushed')[0..500].! ! ! ! except([v]).name.! ! ! ! groupCount.cap.next().sort{-it.value}[0..4]! v = g.V(‘username’,’okram’):! ==>lvca=175! ==>spmallette=56! ==>sgomezvillamor=36! ==>mbroecheler=33! ==>joshsh=20!
  44. Recommendation Engine gremlin> v.out('pushed').out('to')[0..9].! ! ! ! in('to').in('pushed')[0..500].! ! ! ! except([v]).name.! ! ! ! groupCount.cap.next().sort{-it.value}[0..4]! v = g.V(‘username’,’torvalds’):! ==>iksaif=90! ==>rjwysocki=22! ==>kernel-digger=20! ==>giuseppecalderaro=16! ==>groeck=15!
  45. Titan Embedding   Rexster RexPro   lightweight Gremlin Server   based on Grizzly   Titan Gremlin Engine   Embedded Storage Backend   in-JVM method calls
  46. Graph Partitioning Goal: Vertex Co-location   Titan maintains multiple ID Pools   Ordered Partitioner in Storage Backend   Dynamically determines optimal partition and allocates corresponding ID Pool IDs
  47. What’s coming   Full-text indexing   external index system integration   Bulk Loading   integration with storage backend utilities and Hadoop ingestion   240 Billion Edge Benchmark   performance analysis and improvements across the entire stack
  48. V Faunus Graph Analytics AURELIUS THINKAURELIUS.COM
  49. Faunus Features   Hadoop-based Graph Computing Framework   Graph Analytics   Breadth-first Traversals   Global Graph Computations   Batch Big Graph Data
  50. Faunus Architecture g._()!
  51. Faunus Work Flow g.V.out .out .count() hdfs://user/ubuntu/ output/job-0/ output/job-1/ graph* output/job-2/ { sideeffect* Compressed HDFS Graphs   stored in sequence files   variable length encoding   prefix compression
  52. Faunus Setup $ bin/gremlin.sh ! ,,,/! (o o)! -----oOOo-(_)-oOOo-----! gremlin> g = FaunusFactory.open('bin/titan-hbase.properties')! ==>faunusgraph[titanhbaseinputformat]! gremlin> g.getProperties()! ==>faunus.graph.input.format=com.thinkaurelius.faunus.formats.titan.hbase.TitanHBaseInputFormat ==>faunus.graph.output.format=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat! ==>faunus.sideeffect.output.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat! ==>faunus.output.location=dbpedia! ==>faunus.output.location.overwrite=true! gremlin> g._() ! 12/11/09 15:17:45 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)! 12/11/09 15:17:45 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1: MapSequence[com.thinkaurelius.faunus.mapreduce.transform.IdentityMap.Map]! 12/11/09 15:17:50 INFO mapred.JobClient: Running job: job_201211081058_0003!
  53. Graph Analytics gremlin> g.E.has('label',’followed').keep.! ! ! !V.sideEffect('{it.degree = it.outE.count()}').! ! ! !degree.groupCount! gremlin> g.E.has('label','pushed').keep.! ! ! !V.sideEffect('{it.degree = it.outE.count()}').! ! ! !degree.groupCount!
  54. Follow Degree Distribution
  55. Follow Degree Distribution P(k) ~ k-γ γ = 2.2
  56. Pushed Degree Distribution
  57. Global Recommendations gremlin> g.E.has('label','pushed','to').keep.! ! ! !V.out('pushed').out('to').! ! ! !in('to').in('pushed').! ! ! !sideEffect('{it.score =it.pathCounter}').! ! ! !score.order(F.decr,'name')! # Top 5:! Jippi ! ! ! !60892182927! garbear ! ! !30095282886! FakeHeal ! ! !30038040349! brianchandotcom !24684133382! nyarla ! ! !15230275746!
  58. What’s coming   Faunus 0.1   Bulk Loading   loaded graph into Titan   loading derivations into Titan   Extending Gremlin Support   currently only a subset is of Gremlin implemented   Operational Tools
  59. I Graph = Relationship Centric
  60. II Graph = Agile Data Model
  61. III Graph = Algebraic Data Model
  62. Aurelius Graph Cluster Apache 2 Map/Reduce Load & Compress Analysis results back into Titan Stores a massive-scale Batch processing of large Runs global graph algorithms property graph allowing real- graphs with Hadoop on large, compressed, time traversals and updates in-memory graphs
  63. Speed of Traversal/Process The Graph Landscape Illustration only, not to scale Size of Graph
  64. TINKERPOP.COM
  65. Thanks! Vadas Gintautas Marko Rodriguez @vadasg @twarko Stephen Mallette Daniel LaRocque @spmallette AURELIUS THINKAURELIUS.COM
  66. AURELIUS THINKAURELIUS.COM
  67. XVX Benchmark Results AURELIUS THINKAURELIUS.COM
  68. XVX - I Titan Performance Evaluation on Twitter-like Benchmark AURELIUS THINKAURELIUS.COM
  69. Twitter Benchmark   1.47 billion followship edges and 41.7 million users   Loaded into Titan using BatchGraph   Twitter in 2009, crawled by Kwak et. al   4 Transaction Types   Create Account (1%)   Publish tweet (15%)   Read stream (76%)   Recommendation (8%)   Follow recommended user (30%) Kwak, H., Lee, C., Park, H., Moon, S., “What is Twitter, a Social Network or a News Media?,” World Wide Web Conference, 2010.
  70. Benchmark Setup   6 cc1.4xl Cassandra nodes   in one placement group   Cassandra 1.10   40 m1.small worker machines   repeatedly running transactions   simulating servers handling user requests   EC2 cost: $11/hour
  71. Benchmark Results Transaction Type Number of tx Mean tx time Std of tx time Create account 379,019 115.15 ms 5.88 ms Publish tweet 7,580,995 18.45 ms 6.34 ms Read stream 37,936,184 6.29 ms 1.62 ms Recommendation 3,793,863 67.65 ms 13.89 ms Total 49,690,061 Runtime 2.3 hours 5,900 tx/sec
  72. Peak Load Results Transaction Type Number of tx Mean tx time Std of tx time Create account 374,860 172.74 ms 10.52 ms Publish tweet 7,517,667 70.07 ms 19.43 ms Read stream 37,618,648 24.40 ms 3.18 ms Recommendation 3,758,266 229.83 ms 29.08 ms Total 49,269,441 Runtime 1.3 hours 10,200 tx/sec
  73. Benchmark Conclusion Titan   can   handle   10s   of   thousands   of   concurrent   users   with   short   response   5mes   even   for   complex   traversals   on   a   simulated   social   networking   applica5on  based  on  real-­‐world  network  data  with   billions  of  edges  and  millions  of  users  in  a  standard   EC2  deployment.   For  more  informa5on  on  the  benchmark:   hDp://thinkaurelius.com/2012/08/06/5tan-­‐provides-­‐real-­‐5me-­‐big-­‐graph-­‐ data/  
  • WaelAbdelMagied

    May. 23, 2015
  • lover4u

    May. 15, 2015
  • timgluz

    Jun. 12, 2014
  • fanchen501598

    Feb. 17, 2014
  • x8lucas8x

    Oct. 17, 2013
  • ssuser3c3394

    Aug. 3, 2013
  • GallePriat

    May. 16, 2013
  • tantrieuf31

    May. 14, 2013
  • coolhero

    Apr. 9, 2013
  • xgorse

    Jan. 27, 2013
  • oopit1

    Jan. 21, 2013
  • TizianoPiccardi

    Nov. 27, 2012
  • fahied

    Nov. 19, 2012
  • gverlouw

    Nov. 18, 2012
  • TakeshiWatanabe2

    Nov. 16, 2012
  • nowiz

    Nov. 15, 2012
  • AaronYang3

    Nov. 15, 2012
  • sshaaf

    Nov. 15, 2012
  • vevck

    Nov. 15, 2012
  • pablo_pareja

    Nov. 15, 2012

The problems we are faced with in the 21st century require efficient analysis of ever more complex systems. This presentation outlines how such problems can be better understood and effectively solved if they are modeled as graphs or networks. We present two tools for to help solve such problems at scale: Titan, which is a real-time distributed graph database based on Apache Cassandra and Hbase and Faunus, which is a batch analytics framework for graphs based on Apache Hadoop. We discuss their current development status as of November 2012 and illustrate an example application for the GitHub coding network.

Views

Total views

20,212

On Slideshare

0

From embeds

0

Number of embeds

522

Actions

Downloads

471

Shares

0

Comments

0

Likes

21

×