SlideShare a Scribd company logo
1 of 46
Download to read offline
Massive Graph Mining 
Apache Spark’s GraphX and Data Mining
Who we are 
Andy 
@Noootsab 
@NextLab_be 
@Wajug co-driver 
@Devoxx4Kids organizer 
Maths & CS 
Data lover: geo, open, massive 
Fool 
Rand 
@randhindi 
@snips 
Entrepreneur 
PhD bioinformatics, etc.. 
Love data & ML
Graph 101 
A graph is a mathematical representation of 
linked data. 
It’s defined in term of its Vertices and Edges, 
G(V,E). 
A vertex is an entity that can bring a bag of 
data (generally small) 
An edge connects vertices, and can also own a 
bag of data.
Graph 101 
A Graph represent data in a less convenient 
way for classical processing framework. 
Because the burden is not put on the 
observations themselves (row) but on their 
linkage, and specifically density. 
Thus, the problem is often translated as a self-join 
one.
Graph 101 
A Graph, G(V,E) has a reverse representation, 
its Dual. 
A Dual is nothing other than the graph, G’(V’, 
E’), where 
● a vertex is an edge in G, and 
● an edge is a vertex in G, which has at least 
one edge.
Graph 101 
The classical way to store or share the 
connectivity of a graph is using its tabular 
version, that is, its Adjacency Matrix. 
ref: http://en.wikipedia.org/wiki/Adjacency_matrix
GraphX (Apache Spark) 
Spark 101
GraphX (Apache Spark) 
Offers a Graph API on top of Spark. 
Enabling cross-world manipulations
GraphX (Apache Spark) 
How it differs from other classical systems...
GraphX (Apache Spark)
GraphX (Apache Spark)
GraphX (Apache Spark) 
Plenty of operators on both RDDs, but
GraphX (Apache Spark) 
Plenty of operators on both RDDs, but
GraphX (Apache Spark) 
1. Sends messages to neighbors 
2. Returns an RDD of aggregated messages
GraphX (Apache Spark) 
Offers higher level operators and algo, like
GraphX (Apache Spark) 
This one rules them all (and more) 
More later...
PageRank and Pregel 
Everybody know PageRank, right? 
If not: it’s our oil, our friend, our preferred black 
box… 
It’s why Google Search works so fine!
PageRank and Pregel 
Essentially, PageRank is all about importance 
of a node in a Graph → Link Analysis. 
The bottom line is: 
● In-Links are votes 
● In-Links from important node are more 
important →recursion
PageRank and Pregel 
https://d396qusza40orc.cloudfront.net/mmds/lecture_slides/week1_pagerank_the_flow_formulation.pdf
PageRank and Pregel 
TL;DR 
The importance of a node is the probability that 
a random (drunk) walker fall on a given node. 
So, it depends on: 
1. the probability that he lands into one of its 
neighbor 
2. the probability that he crosses a link from 
the neighbor to it 
3. an arbitrary probability of teleportation
PageRank and Pregel 
Solution: Power Method/Iteration (recursive) 
r_new = A x r_old 
Matrix algebra is a pain in distributed 
environment… 
But wait, the process is rather graph oriented!
PageRank and Pregel 
Pregel (google again) 
Based on BSP, Bulk Sync Parallel 
BSP works like message passing style
PageRank and Pregel 
During Superstep i, a vertex can: 
● use messages received from Superstep i-1 
● execute a function 
● send messages 
● vote to halt
PageRank and Pregel
PageRank and Pregel 
In GraphX, as usual with Spark, it’s simple: 
mapReduceTriplet
PageRank and Pregel 
PageRank with Pregel:
PageRank and Pregel 
Applying on our USA.csv file:
OpenStreetMap 
Founded by Steve Coast (UK, 2004) 
Aims to take Geodata off the govs hands to 
give them to the crowd 
Actually, the crowd has to create them...
OSM
OSM
OSM 
So it’s a Graph! 
Node = Vertex 
single point in space defined by its latitude, longitude and node id 
Way = Edge 
A way can have between 2 and 2,000 nodes
OSM 
The network is over-complex for what we need, 
thus: 
● reducing cycling ways like roundabouts to a 
single one 
● transforming the nodes into sections, i.e. 
pieces of streets between 2 intersections
OSM 
Hence, OSM ~ G(Node, Way) 
If it’s not exactly we can still manipulate them 
In our case, we don’t need the connectivity of 
an intersection, but the connectivity of a 
section. 
This is given by G’ (dual of G)
Dataset 
● 80 cities 
● 3M edges in total 
● smallest city 200 edges (Tempe) 
● largest city 200,000 edges (Los Angeles)
Comparing Cities 
● Hypothesis: Cities with similar connectivity 
have similar PageRank distribution 
NYC Chicago
Fort Worth = Philadelphia? 
Looks the same!
Smells like Spurious Correlation
Normalizing PageRank distributions 
● Problem: PageRank is correlated with the 
size of the city 
● size of city = number of sections (edges) in 
the graph 
● Normalized PageRank = PageRank / 
size_of_city 
● Now we can compare cities of different 
sizes!
Fort Worth != Philadelphia! 
Totally different!
Fort Worth before and after 
Note that range of PageRank is preserved
Distance between PG Distributions 
● How to compare PageRank distributions? 
● It’s not always a normal distribution! 
● Can use the Kullback-Leibler divergence 
from information theory 
● the Kullback–Leibler divergence of Q from 
P, denoted DKL(P||Q), is a measure of the 
information lost when Q is used to 
approximate P
KL Divergence 
● Easy to compute 
● Units is nats (can be bits if using log2 
instead of ln)
Very different cities: Dallas & Seattle 
● KL divergence = 18.407 
● Dallas is irregular, Seattle is a perfect grid
Very similar cities: Atlanta & Boston 
● KL divergence = 0.36 
● Both are very irregular
Next steps 
● Using multiple street topology indicators to 
measure the risk of car accident
Q.E.D 
Thanks for keeping up! 
Question => 
Future[(Option[Response], Future[Question])]

More Related Content

What's hot

Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Doug Needham
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseApache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseMo Patel
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Spark Summit
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
 
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4jExtending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4jDatabricks
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and GiraphDoug Needham
 
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014鉄平 土佐
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer spaceGraphAware
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Yves Raimond
 
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaMagellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaSpark Summit
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jWilliam Lyon
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 

What's hot (20)

Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights.
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseApache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4jExtending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
 
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer space
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015
 
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaMagellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4j
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 

Similar to Machine Learning and GraphX

Pagerank (from Google)
Pagerank (from Google)Pagerank (from Google)
Pagerank (from Google)Sri Prasanna
 
Lec5 pagerank
Lec5 pagerankLec5 pagerank
Lec5 pagerankCarlos
 
Lec5 Pagerank
Lec5 PagerankLec5 Pagerank
Lec5 Pagerankmobius.cn
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on HadoopVivian S. Zhang
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...Holden Karau
 
Benchmarking tool for graph algorithms
Benchmarking tool for graph algorithmsBenchmarking tool for graph algorithms
Benchmarking tool for graph algorithmsYash Khandelwal
 
Benchmarking Tool for Graph Algorithms
Benchmarking Tool for Graph AlgorithmsBenchmarking Tool for Graph Algorithms
Benchmarking Tool for Graph AlgorithmsYash Khandelwal
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.pptCheeWeiTan10
 
Benchmarking tool for graph algorithms
Benchmarking tool for graph algorithmsBenchmarking tool for graph algorithms
Benchmarking tool for graph algorithmsYash Khandelwal
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneDoug Needham
 
Graph analysis over relational database
Graph analysis over relational databaseGraph analysis over relational database
Graph analysis over relational databaseGraphRM
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache SparkLucian Neghina
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Codemotion
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Databricks
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraphsscdotopen
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Graph Analysis over Relational Database. Roberto Franchini - Arcade Analytics
Graph Analysis over Relational Database. Roberto Franchini - Arcade AnalyticsGraph Analysis over Relational Database. Roberto Franchini - Arcade Analytics
Graph Analysis over Relational Database. Roberto Franchini - Arcade AnalyticsData Driven Innovation
 

Similar to Machine Learning and GraphX (20)

Pagerank (from Google)
Pagerank (from Google)Pagerank (from Google)
Pagerank (from Google)
 
Lec5 Pagerank
Lec5 PagerankLec5 Pagerank
Lec5 Pagerank
 
Lec5 pagerank
Lec5 pagerankLec5 pagerank
Lec5 pagerank
 
Lec5 Pagerank
Lec5 PagerankLec5 Pagerank
Lec5 Pagerank
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
 
Benchmarking tool for graph algorithms
Benchmarking tool for graph algorithmsBenchmarking tool for graph algorithms
Benchmarking tool for graph algorithms
 
Benchmarking Tool for Graph Algorithms
Benchmarking Tool for Graph AlgorithmsBenchmarking Tool for Graph Algorithms
Benchmarking Tool for Graph Algorithms
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.ppt
 
Benchmarking tool for graph algorithms
Benchmarking tool for graph algorithmsBenchmarking tool for graph algorithms
Benchmarking tool for graph algorithms
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZone
 
Graph analysis over relational database
Graph analysis over relational databaseGraph analysis over relational database
Graph analysis over relational database
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Graph Analysis over Relational Database. Roberto Franchini - Arcade Analytics
Graph Analysis over Relational Database. Roberto Franchini - Arcade AnalyticsGraph Analysis over Relational Database. Roberto Franchini - Arcade Analytics
Graph Analysis over Relational Database. Roberto Franchini - Arcade Analytics
 

More from Andy Petrella

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best PracicesAndy Petrella
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data MappingAndy Petrella
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooksAndy Petrella
 
Governance compliance
Governance   complianceGovernance   compliance
Governance complianceAndy Petrella
 
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPRAndy Petrella
 
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and howAndy Petrella
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data scienceAndy Petrella
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scalaAndy Petrella
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Andy Petrella
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Andy Petrella
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformAndy Petrella
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserAndy Petrella
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open ScienceAndy Petrella
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at ScaleAndy Petrella
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 

More from Andy Petrella (20)

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best Pracices
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data Mapping
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooks
 
Governance compliance
Governance   complianceGovernance   compliance
Governance compliance
 
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPR
 
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and how
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scala
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platform
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open Science
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 

Recently uploaded

Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Recently uploaded (20)

Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

Machine Learning and GraphX

  • 1. Massive Graph Mining Apache Spark’s GraphX and Data Mining
  • 2. Who we are Andy @Noootsab @NextLab_be @Wajug co-driver @Devoxx4Kids organizer Maths & CS Data lover: geo, open, massive Fool Rand @randhindi @snips Entrepreneur PhD bioinformatics, etc.. Love data & ML
  • 3. Graph 101 A graph is a mathematical representation of linked data. It’s defined in term of its Vertices and Edges, G(V,E). A vertex is an entity that can bring a bag of data (generally small) An edge connects vertices, and can also own a bag of data.
  • 4. Graph 101 A Graph represent data in a less convenient way for classical processing framework. Because the burden is not put on the observations themselves (row) but on their linkage, and specifically density. Thus, the problem is often translated as a self-join one.
  • 5. Graph 101 A Graph, G(V,E) has a reverse representation, its Dual. A Dual is nothing other than the graph, G’(V’, E’), where ● a vertex is an edge in G, and ● an edge is a vertex in G, which has at least one edge.
  • 6. Graph 101 The classical way to store or share the connectivity of a graph is using its tabular version, that is, its Adjacency Matrix. ref: http://en.wikipedia.org/wiki/Adjacency_matrix
  • 8. GraphX (Apache Spark) Offers a Graph API on top of Spark. Enabling cross-world manipulations
  • 9. GraphX (Apache Spark) How it differs from other classical systems...
  • 12. GraphX (Apache Spark) Plenty of operators on both RDDs, but
  • 13. GraphX (Apache Spark) Plenty of operators on both RDDs, but
  • 14. GraphX (Apache Spark) 1. Sends messages to neighbors 2. Returns an RDD of aggregated messages
  • 15. GraphX (Apache Spark) Offers higher level operators and algo, like
  • 16. GraphX (Apache Spark) This one rules them all (and more) More later...
  • 17. PageRank and Pregel Everybody know PageRank, right? If not: it’s our oil, our friend, our preferred black box… It’s why Google Search works so fine!
  • 18. PageRank and Pregel Essentially, PageRank is all about importance of a node in a Graph → Link Analysis. The bottom line is: ● In-Links are votes ● In-Links from important node are more important →recursion
  • 19. PageRank and Pregel https://d396qusza40orc.cloudfront.net/mmds/lecture_slides/week1_pagerank_the_flow_formulation.pdf
  • 20. PageRank and Pregel TL;DR The importance of a node is the probability that a random (drunk) walker fall on a given node. So, it depends on: 1. the probability that he lands into one of its neighbor 2. the probability that he crosses a link from the neighbor to it 3. an arbitrary probability of teleportation
  • 21. PageRank and Pregel Solution: Power Method/Iteration (recursive) r_new = A x r_old Matrix algebra is a pain in distributed environment… But wait, the process is rather graph oriented!
  • 22. PageRank and Pregel Pregel (google again) Based on BSP, Bulk Sync Parallel BSP works like message passing style
  • 23. PageRank and Pregel During Superstep i, a vertex can: ● use messages received from Superstep i-1 ● execute a function ● send messages ● vote to halt
  • 25. PageRank and Pregel In GraphX, as usual with Spark, it’s simple: mapReduceTriplet
  • 26. PageRank and Pregel PageRank with Pregel:
  • 27. PageRank and Pregel Applying on our USA.csv file:
  • 28. OpenStreetMap Founded by Steve Coast (UK, 2004) Aims to take Geodata off the govs hands to give them to the crowd Actually, the crowd has to create them...
  • 29. OSM
  • 30. OSM
  • 31. OSM So it’s a Graph! Node = Vertex single point in space defined by its latitude, longitude and node id Way = Edge A way can have between 2 and 2,000 nodes
  • 32. OSM The network is over-complex for what we need, thus: ● reducing cycling ways like roundabouts to a single one ● transforming the nodes into sections, i.e. pieces of streets between 2 intersections
  • 33. OSM Hence, OSM ~ G(Node, Way) If it’s not exactly we can still manipulate them In our case, we don’t need the connectivity of an intersection, but the connectivity of a section. This is given by G’ (dual of G)
  • 34. Dataset ● 80 cities ● 3M edges in total ● smallest city 200 edges (Tempe) ● largest city 200,000 edges (Los Angeles)
  • 35. Comparing Cities ● Hypothesis: Cities with similar connectivity have similar PageRank distribution NYC Chicago
  • 36. Fort Worth = Philadelphia? Looks the same!
  • 37. Smells like Spurious Correlation
  • 38. Normalizing PageRank distributions ● Problem: PageRank is correlated with the size of the city ● size of city = number of sections (edges) in the graph ● Normalized PageRank = PageRank / size_of_city ● Now we can compare cities of different sizes!
  • 39. Fort Worth != Philadelphia! Totally different!
  • 40. Fort Worth before and after Note that range of PageRank is preserved
  • 41. Distance between PG Distributions ● How to compare PageRank distributions? ● It’s not always a normal distribution! ● Can use the Kullback-Leibler divergence from information theory ● the Kullback–Leibler divergence of Q from P, denoted DKL(P||Q), is a measure of the information lost when Q is used to approximate P
  • 42. KL Divergence ● Easy to compute ● Units is nats (can be bits if using log2 instead of ln)
  • 43. Very different cities: Dallas & Seattle ● KL divergence = 18.407 ● Dallas is irregular, Seattle is a perfect grid
  • 44. Very similar cities: Atlanta & Boston ● KL divergence = 0.36 ● Both are very irregular
  • 45. Next steps ● Using multiple street topology indicators to measure the risk of car accident
  • 46. Q.E.D Thanks for keeping up! Question => Future[(Option[Response], Future[Question])]