SlideShare a Scribd company logo
Large Scale Social Networks
     Analysis – LS SNA
   Rui Sarmento           João Gama
           Tiago Cunha           Albert Bifet

            LIAAD/INESC TEC
         FEP - University of Porto

              April 13, 2013
Outline – LS SNA                                            2/19
1.   Motivation
2.   Software Tools
     –   State of the art – Recent Evolution
     –   PEGASUS
     –   Graphlab
     –   Snap (Stanford Network Analysis Platform)
     –   Other Tools
3.   Case Study
     –   Network of companies and financial organizations
     –   Some Numbers
     –   Algorithms and Used tools
     –   Processing Time
4.   Summary & Conclusions
1.Motivation – LS SNA                3/19
Generic Problem:
  Nowadays, the huge amounts of data
  available pose problems for analysis with
  regular hardware and/or software.
Example Facts:
  “We have produced more data in the last two
  years than in all of prior history so we are
  witnessing a Big Bang of Data” – Tim
  McGuire, Mckinsey
1.Motivation – LS SNA                4/19
Solution:
  Emerging technologies, like modern models
  for parallel computing, multicore computers
  or even clusters of computers, can be very
  useful for analyzing massive network data.
1.Motivation – LS SNA                                          5/19
Particular case Study:
  CrunchBase database (accessed May 2012)
• Network A of companies and financial organizations/funds, e.g:

                         Y                            X


           »   Company Y has connection to investment fund X
• Network B of persons and companies e.g.:

                         A                            Y

           »   Person A has connection to company Y
1.Motivation – LS SNA                     6/19
What can we do?
 - we want to analyze entities behavior in terms of
   relationships, or other influences.
- we want to determine some characteristic of the
   network from the point of view of the self-
   centered and the network as a whole.
What is the problem?
- Takes too much time (many hours or even days)
   to do it with normal software like Gephi or R even
   with a good PC
2. Software Tools – LS SNA 7/19
• State of the art – Recent Evolution
2001 – Boost Graph Library (C++)
2005 – Parallel BGL (C++), Hadoop (Java)
2007 – Development of Graphlab Starts
2008 – SNAP Small-world Network Analysis and
  Partitioning (C, openMP)
  .
  .
2013 – Several Graph Frameworks using Hadoop
  and/or HDFS
2. Software Tools – LS SNA 8/19
• PEGASUS
  – Computation framework written in JAVA
  – Is an open-source, graph-mining system with
    massive scalability
  – Dependent of Hadoop
  – Graph Oriented Tool
2. Software Tools – LS SNA 9/19
• Graphlab API
  – Computation framework written in C++
  – Computation in GraphLab is applied to dependent
    records which are stored as vertices in a large
    distributed data-graph
  – Computation in GraphLab is expressed as vertex-
    programs which are executed in parallel on each
    vertex and can interact with neighboring vertices.
  – GraphLab programs interact by directly reading the
    state of neighboring vertices and by modifying the
    state of adjacent edges.
  – HDFS Integration: Access your data directly from HDFS
2. Software Tools – LS SNA 10/19
• Snap (Stanford Network Analysis Platform)
  – Not Parallel however…
  – SNAP library is written in C++ and optimized for
    maximum performance and compact graph
    representation
  – It easily scales to massive networks with hundreds
    of millions of nodes, and billions of edges
  – …although some algorithms in Snap might be slow
    due to complexity
2. Software Tools – LS SNA 11/19
• Other Tools (Resuming)
  – Several more tools available:
     • Giraph – graph oriented
     • Rhadoop (Package for R and Hadoop) – generic tool


  => All previous tools dependant of Hadoop which
    seems to be more and more commonly adopted
2. Software Tools – LS SNA 12/19
Software           Pegasus          Graphlab                Snap
Algorithms
available from
                     Degree           approximate             Cascades
software install     PageRank         diameter                Centrality
(graph analysis)     Random Walk      kcore                   Cliques
                     with Restart     pagerank                Community
                     (RWR)            connected               Concomp
                     Radius           component               Forestfire
                     Connected        simple coloring         Graphgen
                     Components       directed triangle
                                      count                   Graphhash
                                      format convert          Kcores
                                      sssp                    Kronem
                                      undirected triangle     Krongen
                                      count                   Kronfit
                                                              Maggen
                                                              Magfit
                                                              Motifs
                                                              Ncpplot
                                                              Netevol
                                                              Netinf
                                                              Netstat
                                                              Mkdatasets
                                                              infopath
3. Case Study – LS SNA                           13/19
   => Some Numbers
• Network of companies and financial organizations/funds
     1. Number of firms: 88,269
     2. Number of investment funds: 7697
• Network of persons and companies
     1. Number of persons: 118,394
3. Case Study – LS SNA                        14/19
 => Algorithms and Used tools
     – Node Degree with PEGASUS
     – Friends of Friends with Hadoop Map-Reduce
     – Centrality Measures with Snap (Stanford Network
        Analysis Platform)
     – Triangles Counting with Graphlab
3. Case Study – LS SNA   15/19
 => Processing Time
4. Summary & Conclusions LS SNA
                             16/19
• Summary & Conclusions
  – This paper resumes which tools to look for when
    dealing with big graphs studies.
  – We are witnesses of a big proliferation of software
    tools aimed at the analysis of big scale graphs.
  – What was once a problem to deal with these
    networks is solved with the right tools
References I – LS SNA                                    17/19
• APACHE. 2012. Apache Giraph [Online]. The Apache Software Foundation.
  Available: http://incubator.apache.org/giraph/.
• GRAPHLAB. Graphlab The Abstraction [Online]. Available:
  http://graphlab.org/home/abstraction/ 2012].
• GRAPHLAB. 2012. Graph Analytics Toolkit [Online]. Available:
  http://graphlab.org/toolkits/graph-analytics/ 2012].
• HOLMES, A. 2012. Hadoop In Practice, Manning.
• LESKOVEC, J. Stanford Network Analysis Platform [Online]. Available:
  http://snap.stanford.edu/snap/ [Accessed 12-2012 2012].
• MAZZA, G. 2012. FrontPage - Hadoop Wiki [Online]. Available:
  http://wiki.apache.org/lucene-hadoop/ [Accessed 11-2012.
• THANEDAR, V. 2012. API Documentation [Online]. Available:
  http://developer.crunchbase.com/docs [Accessed 04-2012 2012].
References II – LS SNA                                 18/19
• UNIVERSITY, C. M. 2012. Project Pegasus [Online]. Available:
  http://www.cs.cmu.edu/~pegasus/ 2012].
• WASHINGTON, U. O. What is Hadoop? [Online]. Available:
  http://escience.washington.edu/get-help-now/what-hadoop [Accessed
  05-03-2013 2013].
• OWENS, J. R. 2013. Hadoop Real-World Solutions Cookbook. PACKT
  Publishing.
• HOLMES, A. 2012. Hadoop In Practice, Manning.
• McGuire, T. Big Data Better Decisions [Online]. Available:
  http://www.slideshare.net/McK_CMSOForum/big-data-and-advanced-
  analytics [Accessed 05-03-2013 2013].
END – LS SNA          19/19



         Thank You!
         Questions?

More Related Content

What's hot

Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
Paco Nathan
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Yuanyuan Tian
 
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARKBig Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Matt Stubbs
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
Ganesan Narayanasamy
 
Harvard poster
Harvard posterHarvard poster
Harvard poster
Alysson Almeida
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
Sarah Guido
 
useR 2014 jskim
useR 2014 jskimuseR 2014 jskim
useR 2014 jskim
Jinseob Kim
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoop
dbpublications
 
Delivering Application-Layer​ Traffic Optimization​ (ALTO) Services based on ...
Delivering Application-Layer​ Traffic Optimization​ (ALTO) Services based on ...Delivering Application-Layer​ Traffic Optimization​ (ALTO) Services based on ...
Delivering Application-Layer​ Traffic Optimization​ (ALTO) Services based on ...
Danny Alex Lachos Perez
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
Rusif Eyvazli
 
Data science in ruby is it possible? is it fast? should we use it?
Data science in ruby is it possible? is it fast? should we use it?Data science in ruby is it possible? is it fast? should we use it?
Data science in ruby is it possible? is it fast? should we use it?
Rodrigo Urubatan
 
Enhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce TechniqueEnhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce Technique
journalBEEI
 
Realtime Data Analysis Patterns
Realtime Data Analysis PatternsRealtime Data Analysis Patterns
Realtime Data Analysis Patterns
Mikio L. Braun
 
Graph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierGraph Data: a New Data Management Frontier
Graph Data: a New Data Management Frontier
Demai Ni
 
LD4KD 2015 - Demos and tools
LD4KD 2015 - Demos and toolsLD4KD 2015 - Demos and tools
LD4KD 2015 - Demos and tools
Vrije Universiteit Amsterdam
 
Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_Sharmila
Nithin Kakkireni
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
eldariof
 
A Machine Learning Approach to SPARQL Query Performance Prediction
A Machine Learning Approach to SPARQL Query Performance PredictionA Machine Learning Approach to SPARQL Query Performance Prediction
A Machine Learning Approach to SPARQL Query Performance Prediction
Rakebul Hasan
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
jins0618
 

What's hot (20)

Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
 
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARKBig Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Basics of Digital Design and Verilog
Basics of Digital Design and VerilogBasics of Digital Design and Verilog
Basics of Digital Design and Verilog
 
Harvard poster
Harvard posterHarvard poster
Harvard poster
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
 
useR 2014 jskim
useR 2014 jskimuseR 2014 jskim
useR 2014 jskim
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoop
 
Delivering Application-Layer​ Traffic Optimization​ (ALTO) Services based on ...
Delivering Application-Layer​ Traffic Optimization​ (ALTO) Services based on ...Delivering Application-Layer​ Traffic Optimization​ (ALTO) Services based on ...
Delivering Application-Layer​ Traffic Optimization​ (ALTO) Services based on ...
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
 
Data science in ruby is it possible? is it fast? should we use it?
Data science in ruby is it possible? is it fast? should we use it?Data science in ruby is it possible? is it fast? should we use it?
Data science in ruby is it possible? is it fast? should we use it?
 
Enhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce TechniqueEnhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce Technique
 
Realtime Data Analysis Patterns
Realtime Data Analysis PatternsRealtime Data Analysis Patterns
Realtime Data Analysis Patterns
 
Graph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierGraph Data: a New Data Management Frontier
Graph Data: a New Data Management Frontier
 
LD4KD 2015 - Demos and tools
LD4KD 2015 - Demos and toolsLD4KD 2015 - Demos and tools
LD4KD 2015 - Demos and tools
 
Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_Sharmila
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
 
A Machine Learning Approach to SPARQL Query Performance Prediction
A Machine Learning Approach to SPARQL Query Performance PredictionA Machine Learning Approach to SPARQL Query Performance Prediction
A Machine Learning Approach to SPARQL Query Performance Prediction
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
 

Similar to Large scale social networks analysis joclad 2013

Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
csandit
 
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONSBIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
cscpconf
 
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
dbpublications
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
Krishna Sankar
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
Geoffrey Fox
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark Summit
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
George Long
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
Nicola Ferraro
 
Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013
Vijay Srinivas Agneeswaran, Ph.D
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Ahmed Elsayed
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
Ahmed Kamal
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big Graphs
Petr Novotný
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
samthemonad
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
Gabriele Modena
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
Impetus Technologies
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
IJCSIS Research Publications
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.
Anirudh Gangwar
 

Similar to Large scale social networks analysis joclad 2013 (20)

Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
 
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONSBIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
 
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
 
Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big Graphs
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 

Large scale social networks analysis joclad 2013

  • 1. Large Scale Social Networks Analysis – LS SNA Rui Sarmento João Gama Tiago Cunha Albert Bifet LIAAD/INESC TEC FEP - University of Porto April 13, 2013
  • 2. Outline – LS SNA 2/19 1. Motivation 2. Software Tools – State of the art – Recent Evolution – PEGASUS – Graphlab – Snap (Stanford Network Analysis Platform) – Other Tools 3. Case Study – Network of companies and financial organizations – Some Numbers – Algorithms and Used tools – Processing Time 4. Summary & Conclusions
  • 3. 1.Motivation – LS SNA 3/19 Generic Problem: Nowadays, the huge amounts of data available pose problems for analysis with regular hardware and/or software. Example Facts: “We have produced more data in the last two years than in all of prior history so we are witnessing a Big Bang of Data” – Tim McGuire, Mckinsey
  • 4. 1.Motivation – LS SNA 4/19 Solution: Emerging technologies, like modern models for parallel computing, multicore computers or even clusters of computers, can be very useful for analyzing massive network data.
  • 5. 1.Motivation – LS SNA 5/19 Particular case Study: CrunchBase database (accessed May 2012) • Network A of companies and financial organizations/funds, e.g: Y X » Company Y has connection to investment fund X • Network B of persons and companies e.g.: A Y » Person A has connection to company Y
  • 6. 1.Motivation – LS SNA 6/19 What can we do? - we want to analyze entities behavior in terms of relationships, or other influences. - we want to determine some characteristic of the network from the point of view of the self- centered and the network as a whole. What is the problem? - Takes too much time (many hours or even days) to do it with normal software like Gephi or R even with a good PC
  • 7. 2. Software Tools – LS SNA 7/19 • State of the art – Recent Evolution 2001 – Boost Graph Library (C++) 2005 – Parallel BGL (C++), Hadoop (Java) 2007 – Development of Graphlab Starts 2008 – SNAP Small-world Network Analysis and Partitioning (C, openMP) . . 2013 – Several Graph Frameworks using Hadoop and/or HDFS
  • 8. 2. Software Tools – LS SNA 8/19 • PEGASUS – Computation framework written in JAVA – Is an open-source, graph-mining system with massive scalability – Dependent of Hadoop – Graph Oriented Tool
  • 9. 2. Software Tools – LS SNA 9/19 • Graphlab API – Computation framework written in C++ – Computation in GraphLab is applied to dependent records which are stored as vertices in a large distributed data-graph – Computation in GraphLab is expressed as vertex- programs which are executed in parallel on each vertex and can interact with neighboring vertices. – GraphLab programs interact by directly reading the state of neighboring vertices and by modifying the state of adjacent edges. – HDFS Integration: Access your data directly from HDFS
  • 10. 2. Software Tools – LS SNA 10/19 • Snap (Stanford Network Analysis Platform) – Not Parallel however… – SNAP library is written in C++ and optimized for maximum performance and compact graph representation – It easily scales to massive networks with hundreds of millions of nodes, and billions of edges – …although some algorithms in Snap might be slow due to complexity
  • 11. 2. Software Tools – LS SNA 11/19 • Other Tools (Resuming) – Several more tools available: • Giraph – graph oriented • Rhadoop (Package for R and Hadoop) – generic tool => All previous tools dependant of Hadoop which seems to be more and more commonly adopted
  • 12. 2. Software Tools – LS SNA 12/19 Software Pegasus Graphlab Snap Algorithms available from Degree approximate Cascades software install PageRank diameter Centrality (graph analysis) Random Walk kcore Cliques with Restart pagerank Community (RWR) connected Concomp Radius component Forestfire Connected simple coloring Graphgen Components directed triangle count Graphhash format convert Kcores sssp Kronem undirected triangle Krongen count Kronfit Maggen Magfit Motifs Ncpplot Netevol Netinf Netstat Mkdatasets infopath
  • 13. 3. Case Study – LS SNA 13/19 => Some Numbers • Network of companies and financial organizations/funds 1. Number of firms: 88,269 2. Number of investment funds: 7697 • Network of persons and companies 1. Number of persons: 118,394
  • 14. 3. Case Study – LS SNA 14/19 => Algorithms and Used tools – Node Degree with PEGASUS – Friends of Friends with Hadoop Map-Reduce – Centrality Measures with Snap (Stanford Network Analysis Platform) – Triangles Counting with Graphlab
  • 15. 3. Case Study – LS SNA 15/19 => Processing Time
  • 16. 4. Summary & Conclusions LS SNA 16/19 • Summary & Conclusions – This paper resumes which tools to look for when dealing with big graphs studies. – We are witnesses of a big proliferation of software tools aimed at the analysis of big scale graphs. – What was once a problem to deal with these networks is solved with the right tools
  • 17. References I – LS SNA 17/19 • APACHE. 2012. Apache Giraph [Online]. The Apache Software Foundation. Available: http://incubator.apache.org/giraph/. • GRAPHLAB. Graphlab The Abstraction [Online]. Available: http://graphlab.org/home/abstraction/ 2012]. • GRAPHLAB. 2012. Graph Analytics Toolkit [Online]. Available: http://graphlab.org/toolkits/graph-analytics/ 2012]. • HOLMES, A. 2012. Hadoop In Practice, Manning. • LESKOVEC, J. Stanford Network Analysis Platform [Online]. Available: http://snap.stanford.edu/snap/ [Accessed 12-2012 2012]. • MAZZA, G. 2012. FrontPage - Hadoop Wiki [Online]. Available: http://wiki.apache.org/lucene-hadoop/ [Accessed 11-2012. • THANEDAR, V. 2012. API Documentation [Online]. Available: http://developer.crunchbase.com/docs [Accessed 04-2012 2012].
  • 18. References II – LS SNA 18/19 • UNIVERSITY, C. M. 2012. Project Pegasus [Online]. Available: http://www.cs.cmu.edu/~pegasus/ 2012]. • WASHINGTON, U. O. What is Hadoop? [Online]. Available: http://escience.washington.edu/get-help-now/what-hadoop [Accessed 05-03-2013 2013]. • OWENS, J. R. 2013. Hadoop Real-World Solutions Cookbook. PACKT Publishing. • HOLMES, A. 2012. Hadoop In Practice, Manning. • McGuire, T. Big Data Better Decisions [Online]. Available: http://www.slideshare.net/McK_CMSOForum/big-data-and-advanced- analytics [Accessed 05-03-2013 2013].
  • 19. END – LS SNA 19/19 Thank You! Questions?