Dynamic Community Detection for Large-scale e-Commerce data
with Spark Streaming and GraphX
Ming Huang
Meng Zhang, Bin Wei
GuangYuan Huang, Jinkui Shi
Community Detection
Scenarios
•  VIP Customer
•  Reputation Escalator
•  Fraud Seller
•  ………
Algorithms
•  LPA
•  GN
•  Fast Unfolding
•  …….
How to make it Dynamic?
Static Communities Streaming Data
Make sophisticated, real-time decisions
Definition & Solution
Dynamic Community Detection
1.  Decide New Node’s community
2.  Update Graph Physical Topology
3.  Effect communities and modularity
Spark Streaming + GraphX à Streaming Graph
REAL-TIME
Streaming Graph
Edges
DStream
Graph
DStream
merge merge merge
Stock Graph
… … …
Models and Algorithms
Quick Overview of
Fast Unfolding
Modularity:
!
Q=
1
2m
Aij
*
ki
kj
2m
⎡
⎣
⎢
⎢
⎤
⎦
⎥
⎥i,j
∑ δ ci
,cj( )
!
Q = Qi
i
c
∑ =
in∑
2m
)
tot∑
2m
⎛
⎝⎜
⎞
⎠⎟
2
⎡
⎣
⎢
⎢
⎤
⎦
⎥
⎥i
c
∑
Incremental Algorithms
JV(Streaming with RDD ) UMG(Streaming with Graph)
"   Union & Modularity Greedy"   Join & Vote
JV
A B C
C1 C2 C2
D D D
A B C
D D D
C1 C2 C2
D
C2
join
Vote
incEdgeRDD stockCommunityRDD
D
C2
UMG 1 - Union
A
B
C1
C2
C3
C
(C1 or C2) ?
   newGraph = stockGraph.union(incGraph)"
A
B
C
D
UMG 2 - findBestCommunity
A
B
C
D
gain1=G(node(d), community(1))
gain2=G(node(d) , community(2))
C3
incVertexWithNeighbors = newGraph.mapReduceTriplets[Array[VertexData]]
(collectNeighborFunc, _ ++ _,"# # # # #Some((incGraph.vertices, EdgeDirection.Either)))
idCommunity = incVertexWithNeighbors.map {"
case (vid, neighbors) => (vid, findBestCommunity(neighbors))"
}.cache()"
!
Ci
=Cmax
j
G(nodei
,Cj
)
!
ΔQ=
in∑ +ki,in
2m
+
tot+ki∑
2m
⎛
⎝
⎜
⎞
⎠
⎟
2
⎡
⎣
⎢
⎢
⎤
⎦
⎥
⎥
+
in∑
2m
+
tot∑
2m
⎛
⎝
⎜
⎞
⎠
⎟
2
+
ki
2m
⎛
⎝
⎜
⎞
⎠
⎟
2
⎡
⎣
⎢
⎢
⎤
⎦
⎥
⎥
C2
C1
UMG 3 - updateCommunities
A
D
B
C
newCommunityRdd = idCommunity.updateCommunities(commuitiyRdd)"
"
newModularity = newCommunityRdd.map(community=>community.modularity).reduce(_+_)"
C1
C2
!
Q = Qi
i
c
∑ =
in∑
2m
)
tot∑
2m
⎛
⎝⎜
⎞
⎠⎟
2
⎡
⎣
⎢
⎢
⎤
⎦
⎥
⎥i
c
∑
(Q1, Q2)
edgeStreamRDD.foreachRDD { "
  incEdgeRdd => { "
   val incGraph  = buildIncGraph(incEdgeRdd) "
   (communityInfoRDD, modularity) = streamingFU.trainOn(incGraph)"
outputToHBase(communityInfoRDD)"
outputToHBase(modularity)"
edgeRdd "
  }"
} "
Flow Example Code
ssc.start()"
ssc.awaitTermination()"
val conf = new SparkConf().setMaster(……).setAppName(……)"
val ssc = new StreamingContext(conf, Seconds(60))"
"
"
val totalGraph = initGraph(totalEdgesRdd) "
Val streamingFU = new StreamingFU().setTotalGraph(totalGraph)"
"
val onlineDataFlow = getDataFlow(ssc.sparkContext)"
val edgeStreamRDD  = ssc.queueStream(onlineDataFlow, true) "
"
Experiment Results
Autonomous Systems Graphs
Stanford Large Network Dataset Collection(as-733)
https://snap.stanford.edu/data/
Modularity Trend – AS
Online Trading Graph
Buyer Seller
C-C
Modularity Trend – OT
Streaming Graph à Better Result
Key Points
"   Operator
"   Merge Small graph into Large graph
"   Model
"   Local changes
"   Index or summary
"   Algorithm
"   Delicate formula
"   Commutative law & Associative law
"   Parallelly & Incrementally
Complex GraphX
Operators
Graph Union Operator
GRAPH(H)GRAPH(G)
∪ =	

E
F
G
H
B
C
D E
F
A
B
C
D
E
F
A
H
G
GRAPH(G U H)
Graph Union Operator
https://issues.apache.org/jira/browse/SPARK-7894"
"
[GraphX] Complex Operators between Graphs: Union
https://github.com/apache/spark/pull/6685"
"
   newGraph = stockGraph.union(incGraph)"
Complex GraphX Operators
"   Union of Graphs ( G ∪ H )
"   Intersection of Graphs ( G ∩ H)
"   Graph Join
"   Difference of Graphs(G – H)
"   Graph Complement
"   Line Graph ( L(G) )
Issues:"
Complex Operators between Graphs
https://issues.apache.org/jira/browse/SPARK-7893"
Streaming Optimization
Monitoring and Correction
Ω
Data Loading Modularity Threshold CheckingStreaming-FU
FastUnfolding
[Hourly Monitoring]
[Streaming]
[Daily Running]
FastUnfolding
communityID	
 communityInfo	

community1	
 (in1,tot1,degree1,modularity1)	

……	
 ……	

mTime mValue
timestamp1 totalModularity1
…… ……
modularityTablecommRDDTable
Streaming Resource Allocation
•  Driver-Memory: 20G
•  Executors: 100
•  Core: 2
•  Executor-Memory: 20G
Not Enough for Peak Period!
Streaming Buffer
Kafka
Stream
Hdfs
Stream
Join
StreamingFUModel
Streaming-
FU
Streaming-
Buffer
TT
Receiver
Split
HDFS
Modularity Correction Buffer
Resource Peak Buffer
Kafka
Buffer
Writer
Conclusion
"   Streaming Graph
"   Complex Operators will help
"   Daily Rebuild & Threshold Check
"   Costs more memory and time
"   Open Question
checkpoint with Streaming or Graph?
Acknowledgements
1.  Limits of community detection
" http://www.slideshare.net/vtraag/comm-detect
2.  Community Detection
" http://www.traag.net/projects/community-detection/
3.  Social Network Analysis
" http://lorenzopaoliani.info/topics/
4.  Community detection in complex networks using Extremal Optimization
" http://arxiv.org/pdf/cond-mat/0501368.pdf
"   Q & A
Agenda
"   Dynamic Community Detection
"   Streaming Graph
"   Models and Algorithms
"   Complex GraphX Operators
"   Streaming Optimization
"   Conclusion
Static vs. Dynamic
Static Model Dynamic Model

Dynamic Community Detection for Large-scale e-Commerce data with Spark Streaming and GraphX-(Ming Huang, Taobao)

  • 1.
    Dynamic Community Detectionfor Large-scale e-Commerce data with Spark Streaming and GraphX Ming Huang Meng Zhang, Bin Wei GuangYuan Huang, Jinkui Shi
  • 2.
    Community Detection Scenarios •  VIPCustomer •  Reputation Escalator •  Fraud Seller •  ……… Algorithms •  LPA •  GN •  Fast Unfolding •  …….
  • 3.
    How to makeit Dynamic? Static Communities Streaming Data Make sophisticated, real-time decisions
  • 4.
    Definition & Solution DynamicCommunity Detection 1.  Decide New Node’s community 2.  Update Graph Physical Topology 3.  Effect communities and modularity Spark Streaming + GraphX à Streaming Graph REAL-TIME
  • 5.
  • 6.
  • 7.
    Quick Overview of FastUnfolding Modularity: ! Q= 1 2m Aij * ki kj 2m ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥i,j ∑ δ ci ,cj( ) ! Q = Qi i c ∑ = in∑ 2m ) tot∑ 2m ⎛ ⎝⎜ ⎞ ⎠⎟ 2 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥i c ∑
  • 8.
    Incremental Algorithms JV(Streaming withRDD ) UMG(Streaming with Graph) "   Union & Modularity Greedy"   Join & Vote
  • 9.
    JV A B C C1C2 C2 D D D A B C D D D C1 C2 C2 D C2 join Vote incEdgeRDD stockCommunityRDD D C2
  • 10.
    UMG 1 -Union A B C1 C2 C3 C (C1 or C2) ?    newGraph = stockGraph.union(incGraph)" A B C D
  • 11.
    UMG 2 -findBestCommunity A B C D gain1=G(node(d), community(1)) gain2=G(node(d) , community(2)) C3 incVertexWithNeighbors = newGraph.mapReduceTriplets[Array[VertexData]] (collectNeighborFunc, _ ++ _,"# # # # #Some((incGraph.vertices, EdgeDirection.Either))) idCommunity = incVertexWithNeighbors.map {" case (vid, neighbors) => (vid, findBestCommunity(neighbors))" }.cache()" ! Ci =Cmax j G(nodei ,Cj ) ! ΔQ= in∑ +ki,in 2m + tot+ki∑ 2m ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ 2 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ + in∑ 2m + tot∑ 2m ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ 2 + ki 2m ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ 2 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ C2 C1
  • 12.
    UMG 3 -updateCommunities A D B C newCommunityRdd = idCommunity.updateCommunities(commuitiyRdd)" " newModularity = newCommunityRdd.map(community=>community.modularity).reduce(_+_)" C1 C2 ! Q = Qi i c ∑ = in∑ 2m ) tot∑ 2m ⎛ ⎝⎜ ⎞ ⎠⎟ 2 ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥i c ∑ (Q1, Q2)
  • 13.
    edgeStreamRDD.foreachRDD { "   incEdgeRdd => {"    val incGraph  = buildIncGraph(incEdgeRdd) "    (communityInfoRDD, modularity) = streamingFU.trainOn(incGraph)" outputToHBase(communityInfoRDD)" outputToHBase(modularity)" edgeRdd "   }" } " Flow Example Code ssc.start()" ssc.awaitTermination()" val conf = new SparkConf().setMaster(……).setAppName(……)" val ssc = new StreamingContext(conf, Seconds(60))" " " val totalGraph = initGraph(totalEdgesRdd) " Val streamingFU = new StreamingFU().setTotalGraph(totalGraph)" " val onlineDataFlow = getDataFlow(ssc.sparkContext)" val edgeStreamRDD  = ssc.queueStream(onlineDataFlow, true) " "
  • 14.
  • 15.
    Autonomous Systems Graphs StanfordLarge Network Dataset Collection(as-733) https://snap.stanford.edu/data/
  • 16.
  • 17.
  • 18.
    Modularity Trend –OT Streaming Graph à Better Result
  • 19.
    Key Points "  Operator "   Merge Small graph into Large graph "   Model "   Local changes "   Index or summary "   Algorithm "   Delicate formula "   Commutative law & Associative law "   Parallelly & Incrementally
  • 20.
  • 21.
    Graph Union Operator GRAPH(H)GRAPH(G) ∪=  E F G H B C D E F A B C D E F A H G GRAPH(G U H) Graph Union Operator https://issues.apache.org/jira/browse/SPARK-7894" " [GraphX] Complex Operators between Graphs: Union https://github.com/apache/spark/pull/6685" "    newGraph = stockGraph.union(incGraph)"
  • 22.
    Complex GraphX Operators "  Union of Graphs ( G ∪ H ) "   Intersection of Graphs ( G ∩ H) "   Graph Join "   Difference of Graphs(G – H) "   Graph Complement "   Line Graph ( L(G) ) Issues:" Complex Operators between Graphs https://issues.apache.org/jira/browse/SPARK-7893"
  • 23.
  • 24.
    Monitoring and Correction Ω DataLoading Modularity Threshold CheckingStreaming-FU FastUnfolding [Hourly Monitoring] [Streaming] [Daily Running] FastUnfolding communityID  communityInfo  community1  (in1,tot1,degree1,modularity1)  ……  ……  mTime mValue timestamp1 totalModularity1 …… …… modularityTablecommRDDTable
  • 25.
    Streaming Resource Allocation • Driver-Memory: 20G •  Executors: 100 •  Core: 2 •  Executor-Memory: 20G Not Enough for Peak Period!
  • 26.
  • 27.
    Conclusion "   StreamingGraph "   Complex Operators will help "   Daily Rebuild & Threshold Check "   Costs more memory and time "   Open Question checkpoint with Streaming or Graph?
  • 28.
    Acknowledgements 1.  Limits ofcommunity detection " http://www.slideshare.net/vtraag/comm-detect 2.  Community Detection " http://www.traag.net/projects/community-detection/ 3.  Social Network Analysis " http://lorenzopaoliani.info/topics/ 4.  Community detection in complex networks using Extremal Optimization " http://arxiv.org/pdf/cond-mat/0501368.pdf
  • 29.
  • 30.
    Agenda "   DynamicCommunity Detection "   Streaming Graph "   Models and Algorithms "   Complex GraphX Operators "   Streaming Optimization "   Conclusion
  • 31.
    Static vs. Dynamic StaticModel Dynamic Model