• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
WOOster: A Map-Reduce based Platform for Graph Mining

WOOster: A Map-Reduce based Platform for Graph Mining



Large scale graphs containing O(billion) of vertices...

Large scale graphs containing O(billion) of vertices
are becoming increasingly common in various applica-
tions. With graphs of such proportion, efficient query-
ing infrastructure becomes crucial. In this paper, we
propose WOOster a hosted querying infrastructure de-
signed specifically for the large graphs. We make two
key contributions: a) Design of the WOOster frame-
work. b)Scalable map-reduce algorithms for two pop-
ular graph queries: sub-graph match and reachability.
Our experiments show that the proposed map-reduce
algorithms scale well with large synthetic datasets.



Total Views
Views on SlideShare
Embed Views



3 Embeds 13

http://www.linkedin.com 10
https://www.linkedin.com 2
http://www.slashdocs.com 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    WOOster: A Map-Reduce based Platform for Graph Mining WOOster: A Map-Reduce based Platform for Graph Mining Presentation Transcript

    • WOOster: A Map-Reduce basedPlatform for Graph Mining Aravindan Raghuveer Yahoo! Inc, Bangalore.
    • Introduction “If you squint the right way, graphs are everywhere” [1] @ Yahoo! : • The WOO Graph: All knowledge assimilated from the web. - http://iswc2011.semanticweb.org/fileadmin/iswc/Pa pers/Industry/WOO_ISWC.pptx [1] http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html 2Yahoo! Confidential
    • The What and Why? What? Family of Graph Query Algorithms. • Framework: • For graph storage and invoking the query algorithms • Hosted Solution on Hadoop Why? • Family of Graph Query Algorithms: Present day algorithms do not scale to billion edge, vertex graphs. • Framework: •Optimizes storage layout to suit graph query algorithms •Improves throughput of the queries. 3Yahoo! Confidential
    • Outline of the talk • MapReduce 101 • Graph Mining Approaches • Brief overview of WOOster architecture • Graph query algorithms in WOOster: • Sub Graph Matching • Reachability Query • Experiments • ConclusionYahoo! Confidential
    • Map Reduce 101  Switch to slides from Cloud Computing with MapReduce and Hadoop  www.cs.berkeley.edu/~matei/talks/2009/parlab_bo otcamp_clouds.ppt 5Yahoo! Confidential
    • MapReduce Programming Model• Data type: key-value records• Map function: (Kin, Vin)  list(Kinter, Vinter)• Reduce function: (Kinter, list(Vinter))  list(Kout, Vout)
    • Example: Word Countdef mapper(line): foreach word in line.split(): output(word, 1)def reducer(key, values): output(key, sum(values))
    • Word Count Execution Input Map Shuffle & Sort Reduce Output the, 1 brown, 1 the quick fox, 1 brown, 2 Mapbrown fox fox, 2 Reduce how, 1 the, 1 fox, 1 now, 1 the, 1 the, 3the fox ate Mapthe mouse quick, 1 how, 1 ate, 1 ate, 1 now, 1 mouse, 1 brown, 1 Reduce cow, 1 how now mouse, 1 Map cow, 1brown cow quick, 1
    • Graph Mining Approaches : Two Schools  School-1: Invent a new platform: - Map-reduce is not best suited for graph mining: - BSP, PRAM models : circa 1980s - Pregel, Haloop from Google [1]  School-2: Ride on Map-Reduce - MR has wide adoption, open source tools, industry support. - Invest on one more computing infrastructure - Apache Giraph: http://incubator.apache.org/giraph/ (BSP on Hadoop) - Efforts in open source / academia on the same lines: • Pegasus CMU [2] • Graph Mining in Apache Mahout[3] • Rayethon’s Graph Mining [4] [1] SIGMOD 2010, http://dl.acm.org/citation.cfm?id=1807184 [2] http://www.cs.cmu.edu/~pegasus/ [3] http://www.robust-project.eu/news/robust-project-pushes-large-scale-graph-mining-with-hadoop-apache 9 [4] http://www.cloudera.com/blog/2010/03/how-raytheon-researchers-are-using-hadoop-to-build-a-scalable-distributed-triple-store/Yahoo! Confidential
    • WOOster Architecture • User submits a query WOOster Web UI & WebService APIs • Planner periodically scans for newly arrived queries. • Planner creates a M-R plan that Graph Planner re-uses computation, / IO Indices Jobs D/B across queries. (Batching) Executor • Executor executes the M-R plan. • Result notified to the user WOO Graph (Hosted Solution) GridYahoo! Confidential
    • The Sub-Graph Match Query Find all instances in graph G of query Q Vertices have attributes (ex age:31) Vertices and edges have constraints (ex: age<40) Edges have relationship labels. Notation Query Vertex Graph Vertex A matched graph vertex Why Sub-Graph Match (Exact Graph Isomorphism)?: A popular and expressive graph query useful to mine patterns. To our knowledge, a large scale algorithm to operate on a billion vertex graph is not present.Yahoo! Confidential
    • Overview of the Solution Step-0. Data Layout on HDFS Step-1. Query Graph Partitioning Step-2. Edge Selection Step-3. Query Partition Matching Step-4. Query Partition MergingYahoo! Confidential
    • Data Layout on HDFS • How to store a large scale graph? • Adjacency List like solution: • Each row/line has information about a vertex: • Vertex attributes • Vertex neighbors and the labels associated with each edge. Implications: •Enables early pruning of non-matching edges and vertices. •Each vertex has information about itself and its immediate neighbors only.Yahoo! Confidential
    • Step-1: Query Graph Partitioning Why?: Parallelized solving of independent sub- problems How? Find minimum number of partitions such that diameter of partition = 2. Pivot Vertices Intuition: •In a spanning tree of diameter 2, there is one vertex that is connected to all other vertices  pivot vertex •Will use this property in steps 2, 3.Yahoo! Confidential
    • Step-2: Edge Selection • What: Select a subset of edges from G that match atleast one edge in Q. • How: 3. g1-g2 emitted: g1 mapped to a query vertex. g2 Map g1 g2 Reduce g3 g1 g1 Logic Logic g1 g2 g41. g1:Current 2b. 2a. 4. vertex in For every emits allof Mapper neigbor g1-g2 emited Reducer emits 5. mapper. edges if vertex and q1, there exists a from g2’s an edge if a pair edge constraints are corresponding mapper is found neighbor for g1 metYahoo! Confidential
    • Step-3: Query Partition Matching Edge Selection: • Associates a graph vertex to the possible query vertices it could map to • Associates the graph vertex to its “pivot” graph vertex. g1 g2 • Pivot graph vertex is a graph vertex which is mapped to a pivot query vertex: g1 in this example Reducer forms the partition g1 g2 3. Edge Selection Map Reduce g2 g1 g3 output Logic Logic g1 g3 g1 g4 g4 Mapper emits pivot graph vertex as key and edge as 2. Reducer receives all value edges with the same 1. pivot graph vertexYahoo! Confidential
    • Step-4: Query Partition Merging • Merges partitions one after another to form the a query match • More details in paper. Take-away from Steps1-4: (also for any scalable Map-Reduce program)  The mapper/reducer keys are chosen such that:  # keys is proportional to the number of matches of query Q in the graph.  Hence the algorithm scales well for large graphs and complex queries.Yahoo! Confidential
    • Results 160 140 120 Time (sec) 100 80 60 40 20 0 100 150 200 250 Number of Reducers Edge Selection Query Partition Matching Query Partition Merging  Graph of 10 million vertices and 50 million edges  Complex Query of 24 vertices  Note that the edge selection time reduces with increasing number of reducers.Yahoo! Confidential
    • In the paper…  Detailed map-reduce algorithms for sub-graph match and reachability  Theoretical analysis for scalability  Construction of the synthetic dataset  Methodology and more experiments.  Reachability query: examples, map-reduce algorithm  Related workYahoo! Confidential
    • Future Work • Indexing structure for graphs suited for M-R jobs • Compare with giraph based approach. • Better batching strategies. • Right interface for custom graph algorithms to be plugged in while WOOster providing automatic batching. • More graph mining algorithms implementedYahoo! Confidential
    • Questions / Comments 21Yahoo! Confidential