Large Scale Social Networks Analysis – LS SNA Rui Sarmento João Gama Tiago Cunha Albert Bifet LIAAD/INESC TEC FEP - University of Porto April 13, 2013
Outline – LS SNA 2/191. Motivation2. Software Tools – State of the art – Recent Evolution – PEGASUS – Graphlab – Snap (Stanford Network Analysis Platform) – Other Tools3. Case Study – Network of companies and financial organizations – Some Numbers – Algorithms and Used tools – Processing Time4. Summary & Conclusions
1.Motivation – LS SNA 3/19Generic Problem: Nowadays, the huge amounts of data available pose problems for analysis with regular hardware and/or software.Example Facts: “We have produced more data in the last two years than in all of prior history so we are witnessing a Big Bang of Data” – Tim McGuire, Mckinsey
1.Motivation – LS SNA 4/19Solution: Emerging technologies, like modern models for parallel computing, multicore computers or even clusters of computers, can be very useful for analyzing massive network data.
1.Motivation – LS SNA 5/19Particular case Study: CrunchBase database (accessed May 2012)• Network A of companies and financial organizations/funds, e.g: Y X » Company Y has connection to investment fund X• Network B of persons and companies e.g.: A Y » Person A has connection to company Y
1.Motivation – LS SNA 6/19What can we do? - we want to analyze entities behavior in terms of relationships, or other influences.- we want to determine some characteristic of the network from the point of view of the self- centered and the network as a whole.What is the problem?- Takes too much time (many hours or even days) to do it with normal software like Gephi or R even with a good PC
2. Software Tools – LS SNA 7/19• State of the art – Recent Evolution2001 – Boost Graph Library (C++)2005 – Parallel BGL (C++), Hadoop (Java)2007 – Development of Graphlab Starts2008 – SNAP Small-world Network Analysis and Partitioning (C, openMP) . .2013 – Several Graph Frameworks using Hadoop and/or HDFS
2. Software Tools – LS SNA 8/19• PEGASUS – Computation framework written in JAVA – Is an open-source, graph-mining system with massive scalability – Dependent of Hadoop – Graph Oriented Tool
2. Software Tools – LS SNA 9/19• Graphlab API – Computation framework written in C++ – Computation in GraphLab is applied to dependent records which are stored as vertices in a large distributed data-graph – Computation in GraphLab is expressed as vertex- programs which are executed in parallel on each vertex and can interact with neighboring vertices. – GraphLab programs interact by directly reading the state of neighboring vertices and by modifying the state of adjacent edges. – HDFS Integration: Access your data directly from HDFS
2. Software Tools – LS SNA 10/19• Snap (Stanford Network Analysis Platform) – Not Parallel however… – SNAP library is written in C++ and optimized for maximum performance and compact graph representation – It easily scales to massive networks with hundreds of millions of nodes, and billions of edges – …although some algorithms in Snap might be slow due to complexity
2. Software Tools – LS SNA 11/19• Other Tools (Resuming) – Several more tools available: • Giraph – graph oriented • Rhadoop (Package for R and Hadoop) – generic tool => All previous tools dependant of Hadoop which seems to be more and more commonly adopted
2. Software Tools – LS SNA 12/19Software Pegasus Graphlab SnapAlgorithmsavailable from Degree approximate Cascadessoftware install PageRank diameter Centrality(graph analysis) Random Walk kcore Cliques with Restart pagerank Community (RWR) connected Concomp Radius component Forestfire Connected simple coloring Graphgen Components directed triangle count Graphhash format convert Kcores sssp Kronem undirected triangle Krongen count Kronfit Maggen Magfit Motifs Ncpplot Netevol Netinf Netstat Mkdatasets infopath
3. Case Study – LS SNA 13/19 => Some Numbers• Network of companies and financial organizations/funds 1. Number of firms: 88,269 2. Number of investment funds: 7697• Network of persons and companies 1. Number of persons: 118,394
3. Case Study – LS SNA 14/19 => Algorithms and Used tools – Node Degree with PEGASUS – Friends of Friends with Hadoop Map-Reduce – Centrality Measures with Snap (Stanford Network Analysis Platform) – Triangles Counting with Graphlab
3. Case Study – LS SNA 15/19 => Processing Time
4. Summary & Conclusions LS SNA 16/19• Summary & Conclusions – This paper resumes which tools to look for when dealing with big graphs studies. – We are witnesses of a big proliferation of software tools aimed at the analysis of big scale graphs. – What was once a problem to deal with these networks is solved with the right tools
References I – LS SNA 17/19• APACHE. 2012. Apache Giraph [Online]. The Apache Software Foundation. Available: http://incubator.apache.org/giraph/.• GRAPHLAB. Graphlab The Abstraction [Online]. Available: http://graphlab.org/home/abstraction/ 2012].• GRAPHLAB. 2012. Graph Analytics Toolkit [Online]. Available: http://graphlab.org/toolkits/graph-analytics/ 2012].• HOLMES, A. 2012. Hadoop In Practice, Manning.• LESKOVEC, J. Stanford Network Analysis Platform [Online]. Available: http://snap.stanford.edu/snap/ [Accessed 12-2012 2012].• MAZZA, G. 2012. FrontPage - Hadoop Wiki [Online]. Available: http://wiki.apache.org/lucene-hadoop/ [Accessed 11-2012.• THANEDAR, V. 2012. API Documentation [Online]. Available: http://developer.crunchbase.com/docs [Accessed 04-2012 2012].
References II – LS SNA 18/19• UNIVERSITY, C. M. 2012. Project Pegasus [Online]. Available: http://www.cs.cmu.edu/~pegasus/ 2012].• WASHINGTON, U. O. What is Hadoop? [Online]. Available: http://escience.washington.edu/get-help-now/what-hadoop [Accessed 05-03-2013 2013].• OWENS, J. R. 2013. Hadoop Real-World Solutions Cookbook. PACKT Publishing.• HOLMES, A. 2012. Hadoop In Practice, Manning.• McGuire, T. Big Data Better Decisions [Online]. Available: http://www.slideshare.net/McK_CMSOForum/big-data-and-advanced- analytics [Accessed 05-03-2013 2013].