SlideShare a Scribd company logo
1 of 46
Download to read offline
Massive Graph Mining 
Apache Spark’s GraphX and Data Mining
Who we are 
Andy 
@Noootsab 
@NextLab_be 
@Wajug co-driver 
@Devoxx4Kids organizer 
Maths & CS 
Data lover: geo, open, massive 
Fool 
Rand 
@randhindi 
@snips 
Entrepreneur 
PhD bioinformatics, etc.. 
Love data & ML
Graph 101 
A graph is a mathematical representation of 
linked data. 
It’s defined in term of its Vertices and Edges, 
G(V,E). 
A vertex is an entity that can bring a bag of 
data (generally small) 
An edge connects vertices, and can also own a 
bag of data.
Graph 101 
A Graph represent data in a less convenient 
way for classical processing framework. 
Because the burden is not put on the 
observations themselves (row) but on their 
linkage, and specifically density. 
Thus, the problem is often translated as a self-join 
one.
Graph 101 
A Graph, G(V,E) has a reverse representation, 
its Dual. 
A Dual is nothing other than the graph, G’(V’, 
E’), where 
● a vertex is an edge in G, and 
● an edge is a vertex in G, which has at least 
one edge.
Graph 101 
The classical way to store or share the 
connectivity of a graph is using its tabular 
version, that is, its Adjacency Matrix. 
ref: http://en.wikipedia.org/wiki/Adjacency_matrix
GraphX (Apache Spark) 
Spark 101
GraphX (Apache Spark) 
Offers a Graph API on top of Spark. 
Enabling cross-world manipulations
GraphX (Apache Spark) 
How it differs from other classical systems...
GraphX (Apache Spark)
GraphX (Apache Spark)
GraphX (Apache Spark) 
Plenty of operators on both RDDs, but
GraphX (Apache Spark) 
Plenty of operators on both RDDs, but
GraphX (Apache Spark) 
1. Sends messages to neighbors 
2. Returns an RDD of aggregated messages
GraphX (Apache Spark) 
Offers higher level operators and algo, like
GraphX (Apache Spark) 
This one rules them all (and more) 
More later...
PageRank and Pregel 
Everybody know PageRank, right? 
If not: it’s our oil, our friend, our preferred black 
box… 
It’s why Google Search works so fine!
PageRank and Pregel 
Essentially, PageRank is all about importance 
of a node in a Graph → Link Analysis. 
The bottom line is: 
● In-Links are votes 
● In-Links from important node are more 
important →recursion
PageRank and Pregel 
https://d396qusza40orc.cloudfront.net/mmds/lecture_slides/week1_pagerank_the_flow_formulation.pdf
PageRank and Pregel 
TL;DR 
The importance of a node is the probability that 
a random (drunk) walker fall on a given node. 
So, it depends on: 
1. the probability that he lands into one of its 
neighbor 
2. the probability that he crosses a link from 
the neighbor to it 
3. an arbitrary probability of teleportation
PageRank and Pregel 
Solution: Power Method/Iteration (recursive) 
r_new = A x r_old 
Matrix algebra is a pain in distributed 
environment… 
But wait, the process is rather graph oriented!
PageRank and Pregel 
Pregel (google again) 
Based on BSP, Bulk Sync Parallel 
BSP works like message passing style
PageRank and Pregel 
During Superstep i, a vertex can: 
● use messages received from Superstep i-1 
● execute a function 
● send messages 
● vote to halt
PageRank and Pregel
PageRank and Pregel 
In GraphX, as usual with Spark, it’s simple: 
mapReduceTriplet
PageRank and Pregel 
PageRank with Pregel:
PageRank and Pregel 
Applying on our USA.csv file:
OpenStreetMap 
Founded by Steve Coast (UK, 2004) 
Aims to take Geodata off the govs hands to 
give them to the crowd 
Actually, the crowd has to create them...
OSM
OSM
OSM 
So it’s a Graph! 
Node = Vertex 
single point in space defined by its latitude, longitude and node id 
Way = Edge 
A way can have between 2 and 2,000 nodes
OSM 
The network is over-complex for what we need, 
thus: 
● reducing cycling ways like roundabouts to a 
single one 
● transforming the nodes into sections, i.e. 
pieces of streets between 2 intersections
OSM 
Hence, OSM ~ G(Node, Way) 
If it’s not exactly we can still manipulate them 
In our case, we don’t need the connectivity of 
an intersection, but the connectivity of a 
section. 
This is given by G’ (dual of G)
Dataset 
● 80 cities 
● 3M edges in total 
● smallest city 200 edges (Tempe) 
● largest city 200,000 edges (Los Angeles)
Comparing Cities 
● Hypothesis: Cities with similar connectivity 
have similar PageRank distribution 
NYC Chicago
Fort Worth = Philadelphia? 
Looks the same!
Smells like Spurious Correlation
Normalizing PageRank distributions 
● Problem: PageRank is correlated with the 
size of the city 
● size of city = number of sections (edges) in 
the graph 
● Normalized PageRank = PageRank / 
size_of_city 
● Now we can compare cities of different 
sizes!
Fort Worth != Philadelphia! 
Totally different!
Fort Worth before and after 
Note that range of PageRank is preserved
Distance between PG Distributions 
● How to compare PageRank distributions? 
● It’s not always a normal distribution! 
● Can use the Kullback-Leibler divergence 
from information theory 
● the Kullback–Leibler divergence of Q from 
P, denoted DKL(P||Q), is a measure of the 
information lost when Q is used to 
approximate P
KL Divergence 
● Easy to compute 
● Units is nats (can be bits if using log2 
instead of ln)
Very different cities: Dallas & Seattle 
● KL divergence = 18.407 
● Dallas is irregular, Seattle is a perfect grid
Very similar cities: Atlanta & Boston 
● KL divergence = 0.36 
● Both are very irregular
Next steps 
● Using multiple street topology indicators to 
measure the risk of car accident
Q.E.D 
Thanks for keeping up! 
Question => 
Future[(Option[Response], Future[Question])]

More Related Content

What's hot

Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Doug Needham
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseApache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseMo Patel
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Spark Summit
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
 
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4jExtending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4jDatabricks
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and GiraphDoug Needham
 
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014鉄平 土佐
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer spaceGraphAware
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Yves Raimond
 
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaMagellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaSpark Summit
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jWilliam Lyon
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 

What's hot (20)

Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights.
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseApache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4jExtending Spark Graph for the Enterprise with Morpheus and Neo4j
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
 
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer space
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015
 
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaMagellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4j
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 

Similar to Machine Learning and GraphX

Pagerank (from Google)
Pagerank (from Google)Pagerank (from Google)
Pagerank (from Google)Sri Prasanna
 
Lec5 pagerank
Lec5 pagerankLec5 pagerank
Lec5 pagerankCarlos
 
Lec5 Pagerank
Lec5 PagerankLec5 Pagerank
Lec5 Pagerankmobius.cn
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on HadoopVivian S. Zhang
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...Holden Karau
 
Benchmarking tool for graph algorithms
Benchmarking tool for graph algorithmsBenchmarking tool for graph algorithms
Benchmarking tool for graph algorithmsYash Khandelwal
 
Benchmarking Tool for Graph Algorithms
Benchmarking Tool for Graph AlgorithmsBenchmarking Tool for Graph Algorithms
Benchmarking Tool for Graph AlgorithmsYash Khandelwal
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.pptCheeWeiTan10
 
Benchmarking tool for graph algorithms
Benchmarking tool for graph algorithmsBenchmarking tool for graph algorithms
Benchmarking tool for graph algorithmsYash Khandelwal
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneDoug Needham
 
Graph analysis over relational database
Graph analysis over relational databaseGraph analysis over relational database
Graph analysis over relational databaseGraphRM
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache SparkLucian Neghina
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Codemotion
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Databricks
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraphsscdotopen
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Graph Analysis over Relational Database. Roberto Franchini - Arcade Analytics
Graph Analysis over Relational Database. Roberto Franchini - Arcade AnalyticsGraph Analysis over Relational Database. Roberto Franchini - Arcade Analytics
Graph Analysis over Relational Database. Roberto Franchini - Arcade AnalyticsData Driven Innovation
 

Similar to Machine Learning and GraphX (20)

Pagerank (from Google)
Pagerank (from Google)Pagerank (from Google)
Pagerank (from Google)
 
Lec5 Pagerank
Lec5 PagerankLec5 Pagerank
Lec5 Pagerank
 
Lec5 pagerank
Lec5 pagerankLec5 pagerank
Lec5 pagerank
 
Lec5 Pagerank
Lec5 PagerankLec5 Pagerank
Lec5 Pagerank
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
 
Benchmarking tool for graph algorithms
Benchmarking tool for graph algorithmsBenchmarking tool for graph algorithms
Benchmarking tool for graph algorithms
 
Benchmarking Tool for Graph Algorithms
Benchmarking Tool for Graph AlgorithmsBenchmarking Tool for Graph Algorithms
Benchmarking Tool for Graph Algorithms
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.ppt
 
Benchmarking tool for graph algorithms
Benchmarking tool for graph algorithmsBenchmarking tool for graph algorithms
Benchmarking tool for graph algorithms
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZone
 
Graph analysis over relational database
Graph analysis over relational databaseGraph analysis over relational database
Graph analysis over relational database
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Graph Analysis over Relational Database. Roberto Franchini - Arcade Analytics
Graph Analysis over Relational Database. Roberto Franchini - Arcade AnalyticsGraph Analysis over Relational Database. Roberto Franchini - Arcade Analytics
Graph Analysis over Relational Database. Roberto Franchini - Arcade Analytics
 

More from Andy Petrella

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best PracicesAndy Petrella
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data MappingAndy Petrella
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooksAndy Petrella
 
Governance compliance
Governance   complianceGovernance   compliance
Governance complianceAndy Petrella
 
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPRAndy Petrella
 
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and howAndy Petrella
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data scienceAndy Petrella
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scalaAndy Petrella
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Andy Petrella
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Andy Petrella
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformAndy Petrella
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserAndy Petrella
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open ScienceAndy Petrella
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at ScaleAndy Petrella
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 

More from Andy Petrella (20)

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best Pracices
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data Mapping
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooks
 
Governance compliance
Governance   complianceGovernance   compliance
Governance compliance
 
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPR
 
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and how
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scala
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platform
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open Science
 
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at ScaleBioBankCloud: Machine Learning on Genomics + GA4GH  @ Med at Scale
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 

Recently uploaded

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Recently uploaded (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Machine Learning and GraphX

  • 1. Massive Graph Mining Apache Spark’s GraphX and Data Mining
  • 2. Who we are Andy @Noootsab @NextLab_be @Wajug co-driver @Devoxx4Kids organizer Maths & CS Data lover: geo, open, massive Fool Rand @randhindi @snips Entrepreneur PhD bioinformatics, etc.. Love data & ML
  • 3. Graph 101 A graph is a mathematical representation of linked data. It’s defined in term of its Vertices and Edges, G(V,E). A vertex is an entity that can bring a bag of data (generally small) An edge connects vertices, and can also own a bag of data.
  • 4. Graph 101 A Graph represent data in a less convenient way for classical processing framework. Because the burden is not put on the observations themselves (row) but on their linkage, and specifically density. Thus, the problem is often translated as a self-join one.
  • 5. Graph 101 A Graph, G(V,E) has a reverse representation, its Dual. A Dual is nothing other than the graph, G’(V’, E’), where ● a vertex is an edge in G, and ● an edge is a vertex in G, which has at least one edge.
  • 6. Graph 101 The classical way to store or share the connectivity of a graph is using its tabular version, that is, its Adjacency Matrix. ref: http://en.wikipedia.org/wiki/Adjacency_matrix
  • 8. GraphX (Apache Spark) Offers a Graph API on top of Spark. Enabling cross-world manipulations
  • 9. GraphX (Apache Spark) How it differs from other classical systems...
  • 12. GraphX (Apache Spark) Plenty of operators on both RDDs, but
  • 13. GraphX (Apache Spark) Plenty of operators on both RDDs, but
  • 14. GraphX (Apache Spark) 1. Sends messages to neighbors 2. Returns an RDD of aggregated messages
  • 15. GraphX (Apache Spark) Offers higher level operators and algo, like
  • 16. GraphX (Apache Spark) This one rules them all (and more) More later...
  • 17. PageRank and Pregel Everybody know PageRank, right? If not: it’s our oil, our friend, our preferred black box… It’s why Google Search works so fine!
  • 18. PageRank and Pregel Essentially, PageRank is all about importance of a node in a Graph → Link Analysis. The bottom line is: ● In-Links are votes ● In-Links from important node are more important →recursion
  • 19. PageRank and Pregel https://d396qusza40orc.cloudfront.net/mmds/lecture_slides/week1_pagerank_the_flow_formulation.pdf
  • 20. PageRank and Pregel TL;DR The importance of a node is the probability that a random (drunk) walker fall on a given node. So, it depends on: 1. the probability that he lands into one of its neighbor 2. the probability that he crosses a link from the neighbor to it 3. an arbitrary probability of teleportation
  • 21. PageRank and Pregel Solution: Power Method/Iteration (recursive) r_new = A x r_old Matrix algebra is a pain in distributed environment… But wait, the process is rather graph oriented!
  • 22. PageRank and Pregel Pregel (google again) Based on BSP, Bulk Sync Parallel BSP works like message passing style
  • 23. PageRank and Pregel During Superstep i, a vertex can: ● use messages received from Superstep i-1 ● execute a function ● send messages ● vote to halt
  • 25. PageRank and Pregel In GraphX, as usual with Spark, it’s simple: mapReduceTriplet
  • 26. PageRank and Pregel PageRank with Pregel:
  • 27. PageRank and Pregel Applying on our USA.csv file:
  • 28. OpenStreetMap Founded by Steve Coast (UK, 2004) Aims to take Geodata off the govs hands to give them to the crowd Actually, the crowd has to create them...
  • 29. OSM
  • 30. OSM
  • 31. OSM So it’s a Graph! Node = Vertex single point in space defined by its latitude, longitude and node id Way = Edge A way can have between 2 and 2,000 nodes
  • 32. OSM The network is over-complex for what we need, thus: ● reducing cycling ways like roundabouts to a single one ● transforming the nodes into sections, i.e. pieces of streets between 2 intersections
  • 33. OSM Hence, OSM ~ G(Node, Way) If it’s not exactly we can still manipulate them In our case, we don’t need the connectivity of an intersection, but the connectivity of a section. This is given by G’ (dual of G)
  • 34. Dataset ● 80 cities ● 3M edges in total ● smallest city 200 edges (Tempe) ● largest city 200,000 edges (Los Angeles)
  • 35. Comparing Cities ● Hypothesis: Cities with similar connectivity have similar PageRank distribution NYC Chicago
  • 36. Fort Worth = Philadelphia? Looks the same!
  • 37. Smells like Spurious Correlation
  • 38. Normalizing PageRank distributions ● Problem: PageRank is correlated with the size of the city ● size of city = number of sections (edges) in the graph ● Normalized PageRank = PageRank / size_of_city ● Now we can compare cities of different sizes!
  • 39. Fort Worth != Philadelphia! Totally different!
  • 40. Fort Worth before and after Note that range of PageRank is preserved
  • 41. Distance between PG Distributions ● How to compare PageRank distributions? ● It’s not always a normal distribution! ● Can use the Kullback-Leibler divergence from information theory ● the Kullback–Leibler divergence of Q from P, denoted DKL(P||Q), is a measure of the information lost when Q is used to approximate P
  • 42. KL Divergence ● Easy to compute ● Units is nats (can be bits if using log2 instead of ln)
  • 43. Very different cities: Dallas & Seattle ● KL divergence = 18.407 ● Dallas is irregular, Seattle is a perfect grid
  • 44. Very similar cities: Atlanta & Boston ● KL divergence = 0.36 ● Both are very irregular
  • 45. Next steps ● Using multiple street topology indicators to measure the risk of car accident
  • 46. Q.E.D Thanks for keeping up! Question => Future[(Option[Response], Future[Question])]