• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Big Graph Data
 

Big Graph Data

on

  • 5,740 views

The problems we are faced with in the 21st century require efficient analysis of ever more complex systems. This presentation outlines how such problems can be better understood and effectively solved ...

The problems we are faced with in the 21st century require efficient analysis of ever more complex systems. This presentation outlines how such problems can be better understood and effectively solved if they are modeled as graphs or networks. We present two tools for to help solve such problems at scale: Titan, which is a real-time distributed graph database based on Apache Cassandra and Hbase and Faunus, which is a batch analytics framework for graphs based on Apache Hadoop. We discuss their current development status as of November 2012 and illustrate an example application for the GitHub coding network.

Statistics

Views

Total Views
5,740
Views on SlideShare
5,241
Embed Views
499

Actions

Likes
20
Downloads
268
Comments
1

9 Embeds 499

http://www.scoop.it 422
http://skilldev.cloudapp.net 32
http://formacionmestreacasa.gva.es 21
https://twitter.com 14
http://circleoflightspiritualcentre.wordpress.com 3
http://www.skill-player.com 3
http://lectrio.com 2
http://twitter.com 1
https://si0.twimg.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • These presentations look great - are there videos for them anywhere?
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Big Graph Data Big Graph Data Presentation Transcript

    • BIG GRAPH DATAUnderstanding a Complex WorldMatthias Broecheler, CTO@mbroecheler AURELIUSNovember XIII, MMXII THINKAURELIUS.COM
    • IGraph Foundation AURELIUS THINKAURELIUS.COM
    • name: Neptune name: Alcmene type: god type: godVertex Property name: Saturn name: Jupiter name: Hercules type: titan type: god type: demigod name: Pluto name: Cerberus type: god type: monster Graph
    • name: Neptune name: Alcmene type: god type: godEdge brother mother name: Saturn name: Jupiter name: Hercules type: titan type: god type: demigod father father Edge battled brother Property time:12 name: Pluto name: Cerberus type: god type: monster Edge Type pet Graph
    • name: Neptune name: Alcmene type: god type: god brother mothername: Saturn name: Jupiter name: Herculestype: titan type: god type: demigod father father battled brother time:12 name: Pluto name: Cerberus type: god type: monster pet Path
    • name: Neptune name: Alcmene type: god type: god brother mothername: Saturn name: Jupiter name: Herculestype: titan type: god type: demigod father father battled brother time:12 name: Pluto name: Cerberus type: god type: monster pet Degree
    • IConnected World AURELIUS THINKAURELIUS.COM
    • HEALTH
    • HEALTH
    • HEALTH
    • HEALTH
    • ECONOMY
    • ECONOMY
    • ECONOMY
    • ECONOMY
    • SocialSystems
    • SocialSystems
    • SocialSystems
    • SocialSystems
    • IIITitan Graph Database AURELIUS THINKAURELIUS.COM
    • Titan Features  Numerous Concurrent Users  Many Short Transactions   read/write  Real-time Traversals (OLTP)  High Availability  Dynamic Scalability  Variable Consistency Model   ACID or eventual consistency  Real-time Big Graph Data
    • Storage Backends PartitionabilityConsistency Availability
    • Titan FeaturesI.  Data ManagementII.  Vertex-Centric Indices
    • Titan FeaturesIII.  Graph PartitioningIV.  Edge Compression
    • Titan Ecosystem  Native Blueprints Graph Server Implementation Graph  Gremlin Query Algorithms Language Object-Graph Mapper  Rexster Server Traversal Language   any Titan graph can be exposed as a REST endpoint Dataflow Processing Generic Graph API
    • IVGithub Network AURELIUS THINKAURELIUS.COM
    • Setup$ ./titan-0.1.0/bin/gremlin.sh! ! ! !,,,/! (o o)!-----oOOo-(_)-oOOo-----!gremlin> g = TitanFactory.open(/tmp/titan)!==>titangraph[local:/tmp/titan]!
    • Titan Storage Model  Adjacency list in one 5 column family  Row key = vertex id  Each property and edge in one column 5   Denormalized, i.e. stored twice  Direction and label/key as column prefix   Use slice predicate for quick retrieval
    • created USER edited opened pushedCOMMENT PAGE on ISSUE COMMIT on on to in REPOSITORY
    • Defining Property Keysgremlin> g.makeType().name(‘username’).! ! ! ! dataType(String.class).! ! ! ! functional().! ! ! ! indexed().unique().! ! ! ! makePropertyKey()!gremlin> g.makeType().name(‘time’).! ! ! ! dataType(Long.class).! ! ! ! functional().makePropertyKey()!
    • Defining Edge Labelsgremlin> g.makeType().name(‘on’).! ! ! ! makeEdgeLabel()!gremlin> g.makeType().name(‘pushed’).! ! ! ! primaryKey(time).! ! ! ! makeEdgeLabel()!gremlin> g.makeType().name(‘in’).! ! ! ! unidirected().! ! ! ! makeEdgeLabel()!
    • Create & Retrievegremlin> v = g.addVertex([username: ‘okram’])!==>v[4]!gremlin> v.map!==>{username=okram}!gremlin> g.V(username,okram)!==>v[4]!
    • Titan Locking  Locking ensures consistency when it is needed name : Hercules 5  Titan uses time stamped consistent reads and writes 9 on separate CFs for locking  Uses name :   Property uniqueness: .unique() name : Hercules Jupiter   Functional edges: .functional() father   Global ID management x name : father Pluto
    • Titan Indexing  Vertices can be retrieved by property key + value name : Hercules 5  Titan maintains index in a separate column family as name : Jupiter 9 graph is updated  Only need to define a property key as .index()
    • Basic Queriesgremlin> v.out(‘pushed’)!gremlin> v.out(‘pushed’).out(‘to’).name!gremlin> v.out(‘pushed’).out(‘to’).dedup.name!gremlin> v.out(‘pushed’).out(‘to’).dedup.! ! ! ! name.sort{it}!gremlin> v.outE(‘pushed’).has(‘time’,T.gt,1000).inV!
    • Basic Queriesgremlin> v.out(‘pushed’)!gremlin> v.out(‘pushed’).out(‘to’).name!gremlin> v.out(‘pushed’).out(‘to’).dedup.name!gremlin> v.out(‘pushed’).out(‘to’).dedup.! ! ! ! name.sort{it}!gremlin> v.outE(‘pushed’).has(‘time’,T.gt,1000).inV! Query Optimization
    • Vertex-Centric Indices  Sort and index edges per vertex by primary key   Primary key can be composite  Enables efficient focused traversals   Only retrieve edges that matter  Uses push down predicates for quick, index-driven retrieval
    • battled battled battled time: 1 time: 3 time: 5 mother battled v v.query()! time: 9 father fought fought
    • battled battled battled time: 1 time: 3 time: 5 mother battled v v.query()! time: 9 .direction(OUT)! father
    • battled battled battled time: 1 time: 3 time: 5 battled v v.query()! time: 9 .direction(OUT)! .labels(‘battled’)!
    • battled battled time: 1 time: 3 v v.query()! .direction(OUT)! .labels(‘battled’)! .has(‘time,T.lt,5)!
    • Recommendation Enginegremlin> v.out(pushed).out(to)[0..9].! ! ! ! in(to).in(pushed)[0..500].! ! ! ! except([v]).name.! ! ! ! groupCount.cap.next().sort{-it.value}[0..4]!
    • Recommendation Enginegremlin> v.out(pushed).out(to)[0..9].! ! ! ! in(to).in(pushed)[0..500].! ! ! ! except([v]).name.! ! ! ! groupCount.cap.next().sort{-it.value}[0..4]!v = g.V(‘username’,’okram’):!==>lvca=175!==>spmallette=56!==>sgomezvillamor=36!==>mbroecheler=33!==>joshsh=20!
    • Recommendation Enginegremlin> v.out(pushed).out(to)[0..9].! ! ! ! in(to).in(pushed)[0..500].! ! ! ! except([v]).name.! ! ! ! groupCount.cap.next().sort{-it.value}[0..4]!v = g.V(‘username’,’torvalds’):!==>iksaif=90!==>rjwysocki=22!==>kernel-digger=20!==>giuseppecalderaro=16!==>groeck=15!
    • Titan Embedding  Rexster RexPro   lightweight Gremlin Server   based on Grizzly  Titan Gremlin Engine  Embedded Storage Backend   in-JVM method calls
    • Graph PartitioningGoal: Vertex Co-location  Titan maintains multiple ID Pools  Ordered Partitioner in Storage Backend  Dynamically determines optimal partition and allocates corresponding ID Pool IDs
    • What’s coming  Full-text indexing   external index system integration  Bulk Loading   integration with storage backend utilities and Hadoop ingestion  240 Billion Edge Benchmark   performance analysis and improvements across the entire stack
    • VFaunus Graph Analytics AURELIUS THINKAURELIUS.COM
    • Faunus Features  Hadoop-based Graph Computing Framework  Graph Analytics  Breadth-first Traversals  Global Graph Computations  Batch Big Graph Data
    • Faunus Architecture g._()!
    • Faunus Work Flowg.V.out .out .count() hdfs://user/ubuntu/ output/job-0/ output/job-1/ graph* output/job-2/ { sideeffect*Compressed HDFS Graphs  stored in sequence files  variable length encoding  prefix compression
    • Faunus Setup$ bin/gremlin.sh ! ,,,/! (o o)!-----oOOo-(_)-oOOo-----!gremlin> g = FaunusFactory.open(bin/titan-hbase.properties)!==>faunusgraph[titanhbaseinputformat]!gremlin> g.getProperties()!==>faunus.graph.input.format=com.thinkaurelius.faunus.formats.titan.hbase.TitanHBaseInputFormat==>faunus.graph.output.format=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat!==>faunus.sideeffect.output.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat!==>faunus.output.location=dbpedia!==>faunus.output.location.overwrite=true!gremlin> g._() !12/11/09 15:17:45 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)!12/11/09 15:17:45 INFO mapreduce.FaunusCompiler: Executing job 1 out of 1:MapSequence[com.thinkaurelius.faunus.mapreduce.transform.IdentityMap.Map]!12/11/09 15:17:50 INFO mapred.JobClient: Running job: job_201211081058_0003!
    • Graph Analyticsgremlin> g.E.has(label,’followed).keep.! ! ! !V.sideEffect({it.degree = it.outE.count()}).! ! ! !degree.groupCount!gremlin> g.E.has(label,pushed).keep.! ! ! !V.sideEffect({it.degree = it.outE.count()}).! ! ! !degree.groupCount!
    • Follow Degree Distribution
    • Follow Degree Distribution P(k) ~ k-γ γ = 2.2
    • Pushed Degree Distribution
    • Global Recommendationsgremlin> g.E.has(label,pushed,to).keep.! ! ! !V.out(pushed).out(to).! ! ! !in(to).in(pushed).! ! ! !sideEffect({it.score =it.pathCounter}).! ! ! !score.order(F.decr,name)!# Top 5:!Jippi ! ! ! !60892182927!garbear ! ! !30095282886!FakeHeal ! ! !30038040349!brianchandotcom !24684133382!nyarla ! ! !15230275746!
    • What’s coming  Faunus 0.1  Bulk Loading   loaded graph into Titan   loading derivations into Titan  Extending Gremlin Support   currently only a subset is of Gremlin implemented  Operational Tools
    • IGraph = Relationship Centric
    • IIGraph = Agile Data Model
    • IIIGraph = Algebraic Data Model
    • Aurelius Graph Cluster Apache 2 Map/Reduce Load & Compress Analysis results back into Titan Stores a massive-scale Batch processing of large Runs global graph algorithmsproperty graph allowing real- graphs with Hadoop on large, compressed, time traversals and updates in-memory graphs
    • Speed of Traversal/Process The Graph LandscapeIllustration only, not to scale Size of Graph
    • TINKERPOP.COM
    • Thanks! Vadas Gintautas Marko Rodriguez @vadasg @twarko Stephen Mallette Daniel LaRocque @spmallette AURELIUS THINKAURELIUS.COM
    • AURELIUSTHINKAURELIUS.COM
    • XVXBenchmark Results AURELIUS THINKAURELIUS.COM
    • XVX - ITitan Performance Evaluation onTwitter-like Benchmark AURELIUS THINKAURELIUS.COM
    • Twitter Benchmark  1.47 billion followship edges and 41.7 million users   Loaded into Titan using BatchGraph   Twitter in 2009, crawled by Kwak et. al  4 Transaction Types   Create Account (1%)   Publish tweet (15%)   Read stream (76%)   Recommendation (8%)   Follow recommended user (30%) Kwak, H., Lee, C., Park, H., Moon, S., “What is Twitter, a Social Network or a News Media?,” World Wide Web Conference, 2010.
    • Benchmark Setup  6 cc1.4xl Cassandra nodes   in one placement group   Cassandra 1.10  40 m1.small worker machines   repeatedly running transactions   simulating servers handling user requests  EC2 cost: $11/hour
    • Benchmark ResultsTransaction Type Number of tx Mean tx time Std of tx timeCreate account 379,019 115.15 ms 5.88 msPublish tweet 7,580,995 18.45 ms 6.34 msRead stream 37,936,184 6.29 ms 1.62 msRecommendation 3,793,863 67.65 ms 13.89 ms Total 49,690,061 Runtime 2.3 hours 5,900 tx/sec
    • Peak Load ResultsTransaction Type Number of tx Mean tx time Std of tx timeCreate account 374,860 172.74 ms 10.52 msPublish tweet 7,517,667 70.07 ms 19.43 msRead stream 37,618,648 24.40 ms 3.18 msRecommendation 3,758,266 229.83 ms 29.08 ms Total 49,269,441 Runtime 1.3 hours 10,200 tx/sec
    • Benchmark ConclusionTitan   can   handle   10s   of   thousands   of   concurrent  users   with   short   response   5mes   even   for   complex  traversals   on   a   simulated   social   networking  applica5on  based  on  real-­‐world  network  data  with  billions  of  edges  and  millions  of  users  in  a  standard  EC2  deployment.  For  more  informa5on  on  the  benchmark:  hDp://thinkaurelius.com/2012/08/06/5tan-­‐provides-­‐real-­‐5me-­‐big-­‐graph-­‐data/