SlideShare a Scribd company logo
Dynamic Community Detection for Large-scale e-Commerce data
with Spark Streaming and GraphX
Ming Huang
Meng Zhang, Bin Wei
GuangYuan Huang, Jinkui Shi
Community Detection
Scenarios
•  VIP Customer
•  Reputation Escalator
•  Fraud Seller
•  ………
Algorithms
•  LPA
•  GN
•  Fast Unfolding
•  …….
How to make it Dynamic?
Static Communities Streaming Data
Make sophisticated, real-time decisions
Definition & Solution
Dynamic Community Detection
1.  Decide New Node’s community
2.  Update Graph Physical Topology
3.  Effect communities and modularity
Spark Streaming + GraphX à Streaming Graph
REAL-TIME
Streaming Graph
Edges
DStream
Graph
DStream
merge merge merge
Stock Graph
… … …
Models and Algorithms
Quick Overview of
Fast Unfolding
Modularity:
!
Q=
1
2m
Aij
*
ki
kj
2m
⎡
⎣
⎢
⎢
⎤
⎦
⎥
⎥i,j
∑ δ ci
,cj( )
!
Q = Qi
i
c
∑ =
in∑
2m
)
tot∑
2m
⎛
⎝⎜
⎞
⎠⎟
2
⎡
⎣
⎢
⎢
⎤
⎦
⎥
⎥i
c
∑
Incremental Algorithms
JV(Streaming with RDD ) UMG(Streaming with Graph)
"   Union & Modularity Greedy"   Join & Vote
JV
A B C
C1 C2 C2
D D D
A B C
D D D
C1 C2 C2
D
C2
join
Vote
incEdgeRDD stockCommunityRDD
D
C2
UMG 1 - Union
A
B
C1
C2
C3
C
(C1 or C2) ?
   newGraph = stockGraph.union(incGraph)"
A
B
C
D
UMG 2 - findBestCommunity
A
B
C
D
gain1=G(node(d), community(1))
gain2=G(node(d) , community(2))
C3
incVertexWithNeighbors = newGraph.mapReduceTriplets[Array[VertexData]]
(collectNeighborFunc, _ ++ _,"# # # # #Some((incGraph.vertices, EdgeDirection.Either)))
idCommunity = incVertexWithNeighbors.map {"
case (vid, neighbors) => (vid, findBestCommunity(neighbors))"
}.cache()"
!
Ci
=Cmax
j
G(nodei
,Cj
)
!
ΔQ=
in∑ +ki,in
2m
+
tot+ki∑
2m
⎛
⎝
⎜
⎞
⎠
⎟
2
⎡
⎣
⎢
⎢
⎤
⎦
⎥
⎥
+
in∑
2m
+
tot∑
2m
⎛
⎝
⎜
⎞
⎠
⎟
2
+
ki
2m
⎛
⎝
⎜
⎞
⎠
⎟
2
⎡
⎣
⎢
⎢
⎤
⎦
⎥
⎥
C2
C1
UMG 3 - updateCommunities
A
D
B
C
newCommunityRdd = idCommunity.updateCommunities(commuitiyRdd)"
"
newModularity = newCommunityRdd.map(community=>community.modularity).reduce(_+_)"
C1
C2
!
Q = Qi
i
c
∑ =
in∑
2m
)
tot∑
2m
⎛
⎝⎜
⎞
⎠⎟
2
⎡
⎣
⎢
⎢
⎤
⎦
⎥
⎥i
c
∑
(Q1, Q2)
edgeStreamRDD.foreachRDD { "
  incEdgeRdd => { "
   val incGraph  = buildIncGraph(incEdgeRdd) "
   (communityInfoRDD, modularity) = streamingFU.trainOn(incGraph)"
outputToHBase(communityInfoRDD)"
outputToHBase(modularity)"
edgeRdd "
  }"
} "
Flow Example Code
ssc.start()"
ssc.awaitTermination()"
val conf = new SparkConf().setMaster(……).setAppName(……)"
val ssc = new StreamingContext(conf, Seconds(60))"
"
"
val totalGraph = initGraph(totalEdgesRdd) "
Val streamingFU = new StreamingFU().setTotalGraph(totalGraph)"
"
val onlineDataFlow = getDataFlow(ssc.sparkContext)"
val edgeStreamRDD  = ssc.queueStream(onlineDataFlow, true) "
"
Experiment Results
Autonomous Systems Graphs
Stanford Large Network Dataset Collection(as-733)
https://snap.stanford.edu/data/
Modularity Trend – AS
Online Trading Graph
Buyer Seller
C-C
Modularity Trend – OT
Streaming Graph à Better Result
Key Points
"   Operator
"   Merge Small graph into Large graph
"   Model
"   Local changes
"   Index or summary
"   Algorithm
"   Delicate formula
"   Commutative law & Associative law
"   Parallelly & Incrementally
Complex GraphX
Operators
Graph Union Operator
GRAPH(H)GRAPH(G)
∪ =	

E
F
G
H
B
C
D E
F
A
B
C
D
E
F
A
H
G
GRAPH(G U H)
Graph Union Operator
https://issues.apache.org/jira/browse/SPARK-7894"
"
[GraphX] Complex Operators between Graphs: Union
https://github.com/apache/spark/pull/6685"
"
   newGraph = stockGraph.union(incGraph)"
Complex GraphX Operators
"   Union of Graphs ( G ∪ H )
"   Intersection of Graphs ( G ∩ H)
"   Graph Join
"   Difference of Graphs(G – H)
"   Graph Complement
"   Line Graph ( L(G) )
Issues:"
Complex Operators between Graphs
https://issues.apache.org/jira/browse/SPARK-7893"
Streaming Optimization
Monitoring and Correction
Ω
Data Loading Modularity Threshold CheckingStreaming-FU
FastUnfolding
[Hourly Monitoring]
[Streaming]
[Daily Running]
FastUnfolding
communityID	
 communityInfo	

community1	
 (in1,tot1,degree1,modularity1)	

……	
 ……	

mTime mValue
timestamp1 totalModularity1
…… ……
modularityTablecommRDDTable
Streaming Resource Allocation
•  Driver-Memory: 20G
•  Executors: 100
•  Core: 2
•  Executor-Memory: 20G
Not Enough for Peak Period!
Streaming Buffer
Kafka
Stream
Hdfs
Stream
Join
StreamingFUModel
Streaming-
FU
Streaming-
Buffer
TT
Receiver
Split
HDFS
Modularity Correction Buffer
Resource Peak Buffer
Kafka
Buffer
Writer
Conclusion
"   Streaming Graph
"   Complex Operators will help
"   Daily Rebuild & Threshold Check
"   Costs more memory and time
"   Open Question
checkpoint with Streaming or Graph?
Acknowledgements
1.  Limits of community detection
" http://www.slideshare.net/vtraag/comm-detect
2.  Community Detection
" http://www.traag.net/projects/community-detection/
3.  Social Network Analysis
" http://lorenzopaoliani.info/topics/
4.  Community detection in complex networks using Extremal Optimization
" http://arxiv.org/pdf/cond-mat/0501368.pdf
"   Q & A
Agenda
"   Dynamic Community Detection
"   Streaming Graph
"   Models and Algorithms
"   Complex GraphX Operators
"   Streaming Optimization
"   Conclusion
Static vs. Dynamic
Static Model Dynamic Model

More Related Content

What's hot

B trees in Data Structure
B trees in Data StructureB trees in Data Structure
B trees in Data Structure
Anuj Modi
 
2 d geometric transformations
2 d geometric transformations2 d geometric transformations
2 d geometric transformations
Mohd Arif
 

What's hot (20)

2-3 Tree
2-3 Tree2-3 Tree
2-3 Tree
 
B trees in Data Structure
B trees in Data StructureB trees in Data Structure
B trees in Data Structure
 
Rehashing
RehashingRehashing
Rehashing
 
Dda algorithm
Dda algorithmDda algorithm
Dda algorithm
 
Abstract data types
Abstract data typesAbstract data types
Abstract data types
 
queue & its applications
queue & its applicationsqueue & its applications
queue & its applications
 
knapsack problem
knapsack problemknapsack problem
knapsack problem
 
Knapsack problem using fixed tuple
Knapsack problem using fixed tupleKnapsack problem using fixed tuple
Knapsack problem using fixed tuple
 
Lecture _Line Scan Conversion.ppt
Lecture _Line Scan Conversion.pptLecture _Line Scan Conversion.ppt
Lecture _Line Scan Conversion.ppt
 
Graphs in datastructures
Graphs in datastructuresGraphs in datastructures
Graphs in datastructures
 
fractals
fractalsfractals
fractals
 
Linkedlists
LinkedlistsLinkedlists
Linkedlists
 
Stack and Queue
Stack and Queue Stack and Queue
Stack and Queue
 
Comuter graphics ellipse drawing algorithm
Comuter graphics ellipse drawing algorithmComuter graphics ellipse drawing algorithm
Comuter graphics ellipse drawing algorithm
 
2 d geometric transformations
2 d geometric transformations2 d geometric transformations
2 d geometric transformations
 
Leftist heap
Leftist heapLeftist heap
Leftist heap
 
Revised Data Structure- STACK in Python XII CS.pdf
Revised Data Structure- STACK in Python XII CS.pdfRevised Data Structure- STACK in Python XII CS.pdf
Revised Data Structure- STACK in Python XII CS.pdf
 
Queue ppt
Queue pptQueue ppt
Queue ppt
 
Prim's algorithm
Prim's algorithmPrim's algorithm
Prim's algorithm
 
Dijkstra's algorithm presentation
Dijkstra's algorithm presentationDijkstra's algorithm presentation
Dijkstra's algorithm presentation
 

Viewers also liked

ODSC_Cherven_20160518
ODSC_Cherven_20160518ODSC_Cherven_20160518
ODSC_Cherven_20160518
Ken Cherven
 
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
Spark Summit
 
Monitoring Electronic Trading Environments using Spark by Fergal Toomey and P...
Monitoring Electronic Trading Environments using Spark by Fergal Toomey and P...Monitoring Electronic Trading Environments using Spark by Fergal Toomey and P...
Monitoring Electronic Trading Environments using Spark by Fergal Toomey and P...
Spark Summit
 
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Spark Summit
 

Viewers also liked (20)

Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
 
Limits of community detection
Limits of community detectionLimits of community detection
Limits of community detection
 
ODSC_Cherven_20160518
ODSC_Cherven_20160518ODSC_Cherven_20160518
ODSC_Cherven_20160518
 
Emr hive barcamp 2012
Emr hive   barcamp 2012Emr hive   barcamp 2012
Emr hive barcamp 2012
 
Visualizing Networks
Visualizing NetworksVisualizing Networks
Visualizing Networks
 
Xgboost
XgboostXgboost
Xgboost
 
Diagnosing Open-Source Community Health with Spark-(William Benton, Red Hat)
Diagnosing Open-Source Community Health with Spark-(William Benton, Red Hat)Diagnosing Open-Source Community Health with Spark-(William Benton, Red Hat)
Diagnosing Open-Source Community Health with Spark-(William Benton, Red Hat)
 
Spark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan RavatSpark Summit EU talk by Brij Bhushan Ravat
Spark Summit EU talk by Brij Bhushan Ravat
 
Estudio sobre Spark, Storm, Kafka y Hive
Estudio sobre Spark, Storm, Kafka y HiveEstudio sobre Spark, Storm, Kafka y Hive
Estudio sobre Spark, Storm, Kafka y Hive
 
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
 
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
 
Spark Summit Keynote with Ken Tsai
Spark Summit Keynote with Ken TsaiSpark Summit Keynote with Ken Tsai
Spark Summit Keynote with Ken Tsai
 
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan Kessler
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
Monitoring Electronic Trading Environments using Spark by Fergal Toomey and P...
Monitoring Electronic Trading Environments using Spark by Fergal Toomey and P...Monitoring Electronic Trading Environments using Spark by Fergal Toomey and P...
Monitoring Electronic Trading Environments using Spark by Fergal Toomey and P...
 
Huohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For SparkHuohua: A Distributed Time Series Analysis Framework For Spark
Huohua: A Distributed Time Series Analysis Framework For Spark
 
Community Detection in Social Media
Community Detection in Social MediaCommunity Detection in Social Media
Community Detection in Social Media
 
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
Online Predictive Modeling of Fraud Schemes from Mulitple Live Streams by Cla...
 
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
 
Credit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisCredit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph Analysis
 

Similar to Dynamic Community Detection for Large-scale e-Commerce data with Spark Streaming and GraphX-(Ming Huang, Taobao)

RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
Connected Data World
 

Similar to Dynamic Community Detection for Large-scale e-Commerce data with Spark Streaming and GraphX-(Ming Huang, Taobao) (20)

RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
 
Scaling graph investigations with Math, GPUs, & Experts
Scaling graph investigations with Math, GPUs, & ExpertsScaling graph investigations with Math, GPUs, & Experts
Scaling graph investigations with Math, GPUs, & Experts
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
 
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big Graphs
 
GraphQL & DGraph with Go
GraphQL & DGraph with GoGraphQL & DGraph with Go
GraphQL & DGraph with Go
 
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
Hardware Accelerated Machine Learning Solution for Detecting Fraud and Money ...
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
 
Graph protocol for accessing information about blockchains and d apps
Graph protocol for accessing information about blockchains and d appsGraph protocol for accessing information about blockchains and d apps
Graph protocol for accessing information about blockchains and d apps
 
Introduction to MapReduce & hadoop
Introduction to MapReduce & hadoopIntroduction to MapReduce & hadoop
Introduction to MapReduce & hadoop
 
Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...
 
100X Investigations - Graphistry / Microsoft BlueHat
100X Investigations - Graphistry / Microsoft BlueHat100X Investigations - Graphistry / Microsoft BlueHat
100X Investigations - Graphistry / Microsoft BlueHat
 
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational Databases
 
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational DatabasesGraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational Databases
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
 
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONSBIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 

More from Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Domenico Conte
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 

Recently uploaded (20)

Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
Professional Data Engineer Certification Exam Guide  _  Learn  _  Google Clou...
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 

Dynamic Community Detection for Large-scale e-Commerce data with Spark Streaming and GraphX-(Ming Huang, Taobao)

  • 1. Dynamic Community Detection for Large-scale e-Commerce data with Spark Streaming and GraphX Ming Huang Meng Zhang, Bin Wei GuangYuan Huang, Jinkui Shi
  • 2. Community Detection Scenarios •  VIP Customer •  Reputation Escalator •  Fraud Seller •  ……… Algorithms •  LPA •  GN •  Fast Unfolding •  …….
  • 3. How to make it Dynamic? Static Communities Streaming Data Make sophisticated, real-time decisions
  • 4. Definition & Solution Dynamic Community Detection 1.  Decide New Node’s community 2.  Update Graph Physical Topology 3.  Effect communities and modularity Spark Streaming + GraphX à Streaming Graph REAL-TIME
  • 7. Quick Overview of Fast Unfolding Modularity: ! Q= 1 2m Aij * ki kj 2m ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥i,j ∑ δ ci ,cj( ) ! Q = Qi i c ∑ = in∑ 2m ) tot∑ 2m ⎛ ⎝⎜ ⎞ ⎠⎟ 2 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥i c ∑
  • 8. Incremental Algorithms JV(Streaming with RDD ) UMG(Streaming with Graph) "   Union & Modularity Greedy"   Join & Vote
  • 9. JV A B C C1 C2 C2 D D D A B C D D D C1 C2 C2 D C2 join Vote incEdgeRDD stockCommunityRDD D C2
  • 10. UMG 1 - Union A B C1 C2 C3 C (C1 or C2) ?    newGraph = stockGraph.union(incGraph)" A B C D
  • 11. UMG 2 - findBestCommunity A B C D gain1=G(node(d), community(1)) gain2=G(node(d) , community(2)) C3 incVertexWithNeighbors = newGraph.mapReduceTriplets[Array[VertexData]] (collectNeighborFunc, _ ++ _,"# # # # #Some((incGraph.vertices, EdgeDirection.Either))) idCommunity = incVertexWithNeighbors.map {" case (vid, neighbors) => (vid, findBestCommunity(neighbors))" }.cache()" ! Ci =Cmax j G(nodei ,Cj ) ! ΔQ= in∑ +ki,in 2m + tot+ki∑ 2m ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ 2 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ + in∑ 2m + tot∑ 2m ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ 2 + ki 2m ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ 2 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ C2 C1
  • 12. UMG 3 - updateCommunities A D B C newCommunityRdd = idCommunity.updateCommunities(commuitiyRdd)" " newModularity = newCommunityRdd.map(community=>community.modularity).reduce(_+_)" C1 C2 ! Q = Qi i c ∑ = in∑ 2m ) tot∑ 2m ⎛ ⎝⎜ ⎞ ⎠⎟ 2 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥i c ∑ (Q1, Q2)
  • 13. edgeStreamRDD.foreachRDD { "   incEdgeRdd => { "    val incGraph  = buildIncGraph(incEdgeRdd) "    (communityInfoRDD, modularity) = streamingFU.trainOn(incGraph)" outputToHBase(communityInfoRDD)" outputToHBase(modularity)" edgeRdd "   }" } " Flow Example Code ssc.start()" ssc.awaitTermination()" val conf = new SparkConf().setMaster(……).setAppName(……)" val ssc = new StreamingContext(conf, Seconds(60))" " " val totalGraph = initGraph(totalEdgesRdd) " Val streamingFU = new StreamingFU().setTotalGraph(totalGraph)" " val onlineDataFlow = getDataFlow(ssc.sparkContext)" val edgeStreamRDD  = ssc.queueStream(onlineDataFlow, true) " "
  • 15. Autonomous Systems Graphs Stanford Large Network Dataset Collection(as-733) https://snap.stanford.edu/data/
  • 18. Modularity Trend – OT Streaming Graph à Better Result
  • 19. Key Points "   Operator "   Merge Small graph into Large graph "   Model "   Local changes "   Index or summary "   Algorithm "   Delicate formula "   Commutative law & Associative law "   Parallelly & Incrementally
  • 21. Graph Union Operator GRAPH(H)GRAPH(G) ∪ =  E F G H B C D E F A B C D E F A H G GRAPH(G U H) Graph Union Operator https://issues.apache.org/jira/browse/SPARK-7894" " [GraphX] Complex Operators between Graphs: Union https://github.com/apache/spark/pull/6685" "    newGraph = stockGraph.union(incGraph)"
  • 22. Complex GraphX Operators "   Union of Graphs ( G ∪ H ) "   Intersection of Graphs ( G ∩ H) "   Graph Join "   Difference of Graphs(G – H) "   Graph Complement "   Line Graph ( L(G) ) Issues:" Complex Operators between Graphs https://issues.apache.org/jira/browse/SPARK-7893"
  • 24. Monitoring and Correction Ω Data Loading Modularity Threshold CheckingStreaming-FU FastUnfolding [Hourly Monitoring] [Streaming] [Daily Running] FastUnfolding communityID  communityInfo  community1  (in1,tot1,degree1,modularity1)  ……  ……  mTime mValue timestamp1 totalModularity1 …… …… modularityTablecommRDDTable
  • 25. Streaming Resource Allocation •  Driver-Memory: 20G •  Executors: 100 •  Core: 2 •  Executor-Memory: 20G Not Enough for Peak Period!
  • 27. Conclusion "   Streaming Graph "   Complex Operators will help "   Daily Rebuild & Threshold Check "   Costs more memory and time "   Open Question checkpoint with Streaming or Graph?
  • 28. Acknowledgements 1.  Limits of community detection " http://www.slideshare.net/vtraag/comm-detect 2.  Community Detection " http://www.traag.net/projects/community-detection/ 3.  Social Network Analysis " http://lorenzopaoliani.info/topics/ 4.  Community detection in complex networks using Extremal Optimization " http://arxiv.org/pdf/cond-mat/0501368.pdf
  • 29. "   Q & A
  • 30. Agenda "   Dynamic Community Detection "   Streaming Graph "   Models and Algorithms "   Complex GraphX Operators "   Streaming Optimization "   Conclusion
  • 31. Static vs. Dynamic Static Model Dynamic Model