Introduce what are Graphs and explore what happens behind some of the applications (PageRank, Maps, FaceBook etc) using Graph processing. Introduce @ a high level the different frameworks/softwares behind Graph processing.
AgendaIntroduction to Graphs Representing graphs Different types of graphs Algorithms in graphsWhat constitutes a graph application Graph databases (examples and how they work) Graph computing engines (examples and how they work)Questions & Answers
How is a graph represented? 4 1 2 3 6 Vertex 5 EdgeA collection of vertices connected to each other using edges, with both vertices and edgeshaving properties. A vertex can be a person, place, account or any item which needs to betracked.
W Sh hom n ds ? A social graph ee s ta ho l t ul o d f rie be I r s fri eco run Deepak en m reA ds m h oa 4 wi enW th d ? Friend Relative Friend Friend Friend 1 2 3 Bob 6 Sheetal Name:Arun Tom Age : 25 Sex : M Friend Relation : Collegue Collegue Vertex 5 EdgeProperties Prajval
Facebook Recruiting Competition @ w The challenge is to recommend missing links in a social vie inter ok? network. Participants will be presented with an external t an cebo anonymized, directed social graph (no, not Facebook, keep an Fa guessing) from which some edges have been deleted, andW asked to make ranked predictions for each user in the test set of which other users they would want to follow. What is Kaggle? 4 Kaggle is an innovative solution for statistical/analytics outsourcing. We are the leading platform for predictive modeling competitions. Companies, governments and 1 2 3 6 researchers present datasets and problems - the worlds best data scientists then compete to produce the best solutions. At the end of a competition, the competition host pays prize money in exchange for the intellectual property 5 behind the winning model. http://www.kaggle.com/c/FacebookRecruiting
I th wou r tes t ho een ta? A spatial graph e pl ld l a s sh ce ike t he etw lcut or s, to t is e b Ca New Delhi te wh co st ic v ha tanc and pa h er W is re D alo 4 th is all g ? th e B an 450 km 600 km 250 km 350 km 450 km 1 2 3 Lucknow 6 Kolkotta Name:Bangalore MumbaiPopulataion : 25,00,000 850 km Area : 35,000 SqKm Distance : 700 km Vertex 800 km 5 Edge Properties Chennai
How to represent a Graph for computing? 3, 6.... as an adjacency list for sparse graph 41 -> 2,4,52 -> 33 -> 5 2, 4, 5 3 54 -> 3.65 -> 1 2 3 66 -> 5 5.... as an adjacency matrix for dense graph 1 2 3 4 5 6 5 1 0 1 0 1 1 0 2 0 0 1 0 0 0 A graph with few edges is sparse, many edges is dense. 3 0 0 0 0 1 0 4 0 0 1 0 0 0 5 0 0 0 0 0 0 Obviously, the web with billions of pages cannot be represented 6 0 0 0 0 1 0 as an adjaceny matrix.
Different Graphs Social graph (Facebook, LinkedIn etc) Spacial graph (Google Maps, MapQuest, FedEx etc) Web graph (PageRank, Recomendations etc) Computer network graph (Optimal network layoutetc) Financial graph (Fraud detection, Currency Flowetc) Data representations (Lists etc) Chemistry (to represent genomes/molucules) And others
Some of the Graph Algorithms Shortest path (Finding the shortest path from A to B) Minimal Spanning Tree (Cheapest way to connect objects, so that each object is connected to another – can be used in internet, cable wiring etc) Graph center (placing a warehouse, hospital in a city, so that all the locations can be reached easily) Bipartite Matching (Matching in a dating site, job to employee and others) Finding Planar Graph (as in the case of circuit designs). http://www.graph-magics.com/practic_use.php
How to store a Graph? Sim an ple, b deOption 1 : In a flat file as asy ut no to t effi ma cie 1- 4,5,6 inta nt in. 4- 2,5,6Where vertex 1 is connected to vertex 4,5,6 and so onOption 2 : In a relational database using referencingtables or join tables.Option 3 : Using a specialized database designed onlyand only for graphs.
Comparing Graph with Relational DB ld wou ring one r sto ich fer fo ata?Wh pre h d In a DB of 1,000,000 users finding friends-of-friends py ou Gra for 1,000 users at various depths. Depth Execution Time – MySQL Execution Time –Neo4j 2 0.016 0.010 3 30.267 0.168 4 1,543.505 1.359 5 Not Finished in 1 Hour 2.132 http://www.neotechnology.com/2012/06/how-much-faster-is-a-graph-database-really/
So, what is a Graph DB?A graph database is any storage system thatprovides `index free adjacency`. 3, 6 4 2, 4, 5 3 5 1 2 3 6 5 5Every element (node or edge) has a direct pointer to its adjacent element.No Index lookup : We can determine which vertex is adjacent wo which other vertexwithout lookup an index-tree.
So, what is a Graph DB? (.....) n p tio s. th e o raph is g g h DB istin s rap perG en wh
So, what is a Graph DB? (.....) Key Value Store like Amazon Dynamo.Data Size Columnar Databases like Cassandra, HBase. Document Databases like MongoDB, CouchDB.. Graph Databases like Neo4J ily m fa L Q oS N t he Data Complexity of rt Pa
Graph DB Bindings (~JDBC API)//connect to the database//begin transactionNode firstNode;Node secondNode;Relationship relationship;firstNode = graphDb.createNode();firstNode.setProperty( "message", "Hello, " );secondNode = graphDb.createNode();secondNode.setProperty( "message", "World!" );relationship = firstNode.createRelationshipTo( secondNode,RelTypes.KNOWS );relationship.setProperty( "message", "brave Neo4j " );//end the transaction//close the connection to the database http://docs.neo4j.org/chunked/milestone/tutorials-java-embedded-hello-world.html
Different Graph Databases FlockDB from Twitter AllegrographGraphBase From Objectivity http://en.wikipedia.org/wiki/Graph_database
What is a Graph Computing Engine? Algorithms Graph Computing OutputFormat Engine Output Location Graph engines come with some built-in graph InputFormat processing algorithms, but also provide an easy to useInput Location API to build new algorithms and extend the framework. http://incubator.apache.org/giraph/apidocs/index.html http://incubator.apache.org/hama/docs/r0.3.0/api/index.html
Different Graph Computing EnginesMemory based graphs like (graph size < local machine ram) - jung.sourceforge.net - igraph.sourceforge.net - metworkx.lanl.govDisk based graphs like (graph size < local hard disk size) - Neo4j - Infinite Graph – objectivity.com - sparsity-technologies.com/dexCluster based graphs like (depends on the cluster specs) l - Apache Hama de mo l - Apache Giraph SP llel) ege B a r - GoldenORB d on Par le p se ous oog Ba ron f G h o y nc pirit l k S he s ( Bu in t
Bulk Synchronous ParallelSome quick facts• An alternate computing model to MapReduce (Not all problems can be solved with MapReduce efficiently). Also, any MR algorithm can be simulated on BSP and vice versa. Developed by Leslie Valinat during the 1980s. Was resurrected by Google in the Pregel Paper (extensively used for PageRank) Good for - Processing big data with complicated relationships, eg., graph and networks. - Iterative and Recursive scientific computations - Continious Event Processing (CEP) http://googleresearch.blogspot.in/2009/06/large-scale-graph-computing-at-google.html http://arxiv.org/abs/1203.2081 – Comparing MR vs BSP
What is Bulk Synchronous Parallel? Super Step 1 Super Step 2 Super Step 3 http://en.wikipedia.org/wiki/Bulk_synchronous_parallel/ http://blog.octo.com/en/introduction-to-large-scale-graph-processing/
Hama vs Giraph Derived Derived Google Pregel ** Giraph Hama BSP BSP MapReduce HDFS** http://googleresearch.blogspot.in/2009/06/large-scale-graph-computing-at-google.html
Hama vs Giraph (.....) Hama GiraphPure BSP engine. Uses BSP, but BSP API is not exposed.Matrix, Graph, Network and other Just for Graph processing.procesing.Jobs are run as a BSP Job on HDFS. Jobs as run as MapReduce on Hadoop.Both of them are derived from on `Pregel : A System for Large-Scale GraphProcessing` paper published by Google. Both have been recently promoted fromIncubator to Apache Top Level Project.Both of them have a few graph algorithms implemented and also provide a very easyAPI to implement new Graph algorithms. ** http://googleresearch.blogspot.in/2009/06/large-scale-graph-computing-at-google.html
Page Rank in Hama PageRank Algorithm assigns numerical weightage to each element of a hyperlinked set of documents . bin/hama jar ../hama-0.4.0-examples.jar pagerank <input path> <output path> [damping factor] [epsilon error] [tasks] Input Output Site1tSite2tSite3 Site1 0.5 Site2tSite3 Site2 1.3 Site3 Site3 1.2 http://wiki.apache.org/hama/PageRank
Whats next?Deep dive into - Both Graph databases and frameworks with a Demo. - Bulk Syncronous Parallel procssing model.Hadoop, Hive, Pig and others are too crowded. Graph Frameworks andDatabases are emerging and are an easy entry to contribute to in Apache.Would suggest to subscribe/follow the mailing lists in Apache and try to getfamiliar and contribute to them.