SlideShare a Scribd company logo
Large Graph Processing
Jeffrey Xu Yu (ไบŽๆ—ญ)
Department of Systems Engineering and
Engineering Management
The Chinese University of Hong Kong
yu@se.cuhk.edu.hk, http://www.se.cuhk.edu.hk/~yu
2
Social Networks
3
Social Networks
4
Facebook Social Network
๏ฎ In 2011, 721 million users, 69 billion friendship links. The
degree of separation is 4. (Four Degrees of Separation by
Backstrom, Boldi, Rosa, Ugander, and Vigna, 2012)
5
The Scale/Growth of Social Networks
๏ฎ Facebook statistics
๏ฑ 829 million daily active users on average in June 2014
๏ฑ 1.32 billion monthly active users as of June 30, 2014
๏ฑ 81.7% of daily active users are outside the U.S. and
Canada
๏ฑ 22% increase in Facebook users from 2012 to 2013
๏ฎ Facebook activities (every 20 minutes on Facebook)
๏ฑ 1 million links shared
๏ฑ 2 million friends requested
๏ฑ 3 million messages sent
http://newsroom.fb.com/company-info/
http://www.statisticbrain.com/facebook-statistics/
6
The Scale/Growth of Social Networks
๏ฎ Twitter statistics
๏ฑ 271 million monthly active users in 2014
๏ฑ 135,000 new users signing up every day
๏ฑ 78% of Twitter active users are on mobile
๏ฑ 77% of accounts are outside the U.S.
๏ฎ Twitter activities
๏ฑ 500 million Tweets are sent per day
๏ฑ 9,100 Tweets are sent per second
https://about.twitter.com/company
http://www.statisticbrain.com/twitter-statistics/
7
Location Based Social Networks
8
Financial Networks
๏ฎ We borrow ยฃ1.7 trillion, but
we're lending ยฃ1.8 trillion.
Confused? Yes, inter-nation
finance is complicated..." 9
US Social Commerce --
Statistics and Trends
10
Activities on Social Networks
๏ฎ When all functions are integrated โ€ฆ.
11
Graph Mining/Querying/Searching
๏ฎ We have been working on many graph problems.
๏ฑ Keyword search in databases
๏ฑ Reachability query over large graphs
๏ฑ Shortest path query over large graphs
๏ฑ Large graph pattern matching
๏ฑ Graph clustering
๏ฑ Graph processing on Cloud
๏ฑ โ€ฆโ€ฆ
12
Part I: Social Networks
13
Some Topics
๏ฎ Ranking over trust networks
๏ฎ Influence on social networks
๏ฎ Influenceability estimation in Social Networks
๏ฎ Random-walk domination
๏ฎ Diversified ranking
๏ฎ Top-k structural diversity search
14
Ranking over Trust Networks
15
๏ฎ Real rating systems (users and objects)
๏ฑ Online shopping websites (Amazon) www.amazon.com
๏ฑ Online product review websites (Epinions) www.epinions.com
๏ฑ Paper review system (Microsoft CMT)
๏ฑ Movie rating (IMDB)
๏ฑ Video rating (Youtube)
Reputation-based Ranking
16
The Bipartite Rating Network
๏ฎ Two entities: users and objects
๏ฎ Users can give rating to objects
๏ฎ If we take the average as the ranking score of an object,
o1 and o3 are the top.
๏ฎ If we consider the userโ€™s reputation, e.g., u4, โ€ฆ
Objects
Users
Ratings
17
Reputation-based Ranking
๏ฎ Two fundamental problems
๏ฑ How to rank objects using the ratings?
๏ฑ How to evaluate usersโ€™ rating reputation?
๏ฎ Algorithmic challenges
๏ฑ Robustness
๏ฎ Robust to the spamming users
๏ฑ Scalability
๏ฎ Scalable to large networks
๏ฑ Convergence
๏ฎ Convergent to a unique and fixed ranking vector
18
Signed/Unsigned Trust Networks
๏ฎ Signed Trust Social Networks (users): A user can
express their trust/distrust to others by positive/negative
trust score.
๏ฑ Epinions (www.epinions.com)
๏ฑ Slashdot (www.slashdot.org)
๏ฎ Unsigned Trust Social Networks (users): A user can only
express their trust.
๏ฑ Advogato (www.advogato.org)
๏ฑ Kaitiaki (www.kaitiaki.org.nz)
๏ฎ Unsigned Rating Networks (users and objects)
๏ฑ Question-Answer systems
๏ฑ Movie-rating systems (IMDB)
๏ฑ Video rating systems in Youtube
19
The Trustworthiness of a User
๏ฎ The final trustworthiness of a user is determined by how
users trust each other in a global context and is
measured by bias.
๏ฎ The bias of a user reflects the extend up to which his/her
opinions differ from others.
๏ฎ If a user has a zero bias, then his/her opinions are 100%
unbaised and 100% taken.
๏ฎ Such a user has high trustworthiness.
๏ฎ The trustworthiness, the trust score, of a user is
1 โ€“ his/her bias score.
20
An Existing Approach
๏ฎ MB [Mishra and Bhattacharya, WWWโ€™11]
๏ฑ The trustworthiness of a user cannot be trusted,
because MB treats the bias of a user by relative
differences between itself and others.
๏ฑ If a user gives all his/her friends a much higher trust
score than the average of others, and gives all his/her
foes a much lower trust score than the average of
others, such differences cancel out. This user has
zero bias and can be 100% trusted.
21
An Example
๏ฎ Node 5 gives a trust score
๐‘Š51 = 0.1 to node 1. Node 2
and node 3 give a high trust
score ๐‘Š21 = ๐‘Š31 = 0.8 to
node 1.
๏ฎ Node 5 is different from
others (biased), 0.1 โ€“ 0.8.
22
MB Approach
๏ฎ The bias of a node ๐‘– is ๐‘๐‘–.
๏ฎ The prestige score of node ๐‘– is ๐‘Ÿ๐‘–.
๏ฎ The iterative system is
23
An Example
๏ฎ Consider 5๏ƒ 1, 2๏ƒ 1, 3๏ƒ 1.
๏ฑ A trust score = 0.1 โ€“ 0.8 = -0.7.
๏ฎ Consider 2๏ƒ 3, 4๏ƒ 3, 5๏ƒ 3.
๏ฑ A trust score = 0.9 โ€“ 0.2 = 0.7
๏ฎ Node 5 has zero bias.
๏ฎ The bias scores by MB.
24
Our Approach
๏ฎ To address it, consider a contraction mapping.
๏ฎ Given a metric space ๐‘‹ with a distance function ๐‘‘().
๏ฎ A mapping ๐‘‡ from ๐‘‹ to ๐‘‹ is a contraction mapping if
there exists a constant c where 0 โ‰ค ๐‘ < 1 such that
๐‘‘(๐‘‡(๐‘ฅ), ๐‘‡(๐‘ฆ)) โ‰ค ๐‘ ร— ๐‘‘(๐‘ฅ, ๐‘ฆ).
๏ฎ The ๐‘‡ has a unique fixed point.
25
Our Approach
๏ฎ We use two vectors, ๐‘ and ๐‘Ÿ, for bias and prestige.
๏ฎ The ๐‘๐‘— = (๐‘“(๐‘Ÿ)) ๐‘— denotes the bias of node ๐‘—, where ๐‘Ÿ is
the prestige vector of the nodes, and ๐‘“(๐‘Ÿ) is a vector-
valued contractive function. (๐‘“ ๐‘Ÿ ) ๐‘— denotes the ๐‘—-th
element of vector ๐‘“(๐‘Ÿ).
๏ฎ Let 0 โ‰ค ๐‘“(๐‘Ÿ) โ‰ค ๐‘’, and ๐‘’ = [1, 1, โ€ฆ , 1] ๐‘‡
๏ฎ For any ๐‘ฅ, ๐‘ฆ โˆˆ ๐‘… ๐‘›, the function ๐‘“: ๐‘… ๐‘› โ†’ ๐‘… ๐‘› is a vector-
valued contractive function if the following condition
holds,
๐‘“ ๐‘ฅ โ€“ ๐‘“ ๐‘ฆ โ‰ค ๐œ† โˆฅ ๐‘ฅ โˆ’ ๐‘ฆ โˆฅโˆž ๐‘’
where ๐œ† โˆˆ [0,1) and โˆฅโˆ™โˆฅโˆž denotes the infinity norm.
26
The Framework
๏ฎ Use a vector-valued contractive function, which is a
generalization of the contracting mapping in the fixed
point theory.
๏ฎ MB is a special case in our framework.
๏ฎ The iterative system can converges into a unique fixed
prestige and bias vector in an exponential rate of
convergence.
๏ฎ We can handle both unsigned and singed trust social
networks.
27
Influence on Social Networks
28
Diffusion in Networks
๏ฎ We care about the decisions made by friends and
colleagues.
๏ฎ Why imitating the behavior of others
๏ฑ Informational effects: the choices made by others can
provide indirect information about what they know.
๏ฑ Direct-benefit effects: there are direct payoffs from
copying the decisions of others.
๏ฎ Diffusion: how new behaviors, practices, opinions,
conventions, and technologies spread through a social
network.
29
A Real World Example
๏ฎ Hotmailโ€™s viral climb to
the top spot (90โ€™s):
8 million users in
18 months!
๏ฎ Far more effective than
conventional advertising
by rivals and far cheaper
too!
30
Stochastic Diffusion Model
๏ฎ Consider a directed graph ๐บ = (๐‘‰, ๐ธ).
๏ฎ The diffusion of information (or influence) proceeds in
discrete time steps, with time ๐‘ก = 0, 1, โ€ฆ. Each node ๐‘ฃ
has two possible states, inactive and active.
๏ฎ Let ๐‘†๐‘ก โŠ† ๐‘‰ be the set of active nodes at time ๐‘ก (active set
at time ๐‘ก). ๐‘†0 is the seed set (the seeds of influence
diffusion).
๏ฎ A stochastic diffusion model (with discrete time steps) for
a social graph ๐บ specifies the randomized process of
generating active sets ๐‘†๐‘ก for all ๐‘ก โ‰ฅ 1 given the initial ๐‘†0.
๏ฎ A progressive model is a model ๐‘†๐‘กโˆ’1 โŠ† ๐‘†๐‘ก for ๐‘ก > 1.
31
Influence Spread
๏ฎ Let ฮฆ(๐‘†0) be the final active set (eventually stable
active set) where ๐‘†0 is the initial seed set.
๏ฎ ฮฆ(๐‘†0) is a random set determined by the stochastic
process of the diffusion model.
๏ฎ To maximize the expected size of the final active set.
๏ฎ Let ๐”ผ(๐‘‹) denote the expected value of a random
variable ๐‘‹.
๏ฎ The influence spreed of seed set ๐‘†0 is defined as
๐œŽ ๐‘†0 = ๐”ผ(|ฮฆ(๐‘†0)|). Here the expectation is taken
among all random events leading to ฮฆ(๐‘†0).
32
Independent Cascade Model (IC)
๏ฎ IC takes ๐บ = (๐‘‰, ๐ธ), the influence probability ๐‘ on all
edges, and initial seed set ๐‘†0 as the input, and generates
the active sets ๐‘†๐‘ก for all ๐‘ก โ‰ฅ 1.
๏ฑ At every time step ๐‘ก โ‰ฅ 1, first set ๐‘†๐‘ก = ๐‘†๐‘กโˆ’1.
๏ฑ Next for every inactive node ๐‘ฃ โˆ‰ ๐‘†๐‘กโˆ’1, for node ๐‘ข โˆˆ
๐‘๐‘–๐‘› ๐‘ฃ โˆฉ ๐‘†๐‘กโˆ’1๐‘†๐‘กโˆ’2 , ๐‘ข executes an activation attempt
with success probability ๐‘(๐‘ข, ๐‘ฃ). If successful, ๐‘ฃ is
added into ๐‘†๐‘ก and it is said ๐‘ข activates ๐‘ฃ at time ๐‘ก. If
multiple nodes active ๐‘ฃ at time ๐‘ก, the end effect is the
same.
33
An Example
34
Another Example
35
Influenceability Estimation in Social Networks
๏ฎ Applications
๏ฑ Influence maximization for viral marketing
๏ฑ Influential nodes discovery
๏ฑ Online advertisement
๏ฎ The fundamental issue
๏ฑ How to evaluate the influenceability for a give node
in a social network?
36
๏ฎ The independent cascade model.
๏ฑ Each node has an independent probability to
influence his neighbors.
๏ฑ Can be modeled by a probabilistic graph, called
influence network, ๐บ = (๐‘‰, ๐ธ, ๐‘ƒ).
๏ฑ A possible graph ๐บ ๐‘ƒ = (๐‘‰๐‘ƒ, ๐ธ ๐‘ƒ) has probability
Pr ๐บ ๐‘ƒ = ๐‘’โˆˆ๐ธ ๐‘ƒ
๐‘ ๐‘’ ๐‘’โˆˆ๐ธ  ๐ธ ๐‘ƒ
(1 โˆ’ ๐‘ ๐‘’)
๏ฎ There are 2|๐ธ| possible graphs (ฮฉ).
Reconsider IC Model
37
An Example
38
๏ฎ Independent cascade model.
๏ฑ Given a probabilistic graph ๐บ ๐‘ƒ = (๐‘‰๐‘ƒ, ๐‘‰๐‘ƒ)
๏ฑ Pr ๐บ ๐‘ƒ = ๐‘’โˆˆ๐ธ ๐‘ƒ
๐‘ ๐‘’ ๐‘’โˆˆ๐ธ  ๐ธ ๐‘ƒ
(1 โˆ’ ๐‘ ๐‘’)
๏ฎ Given a graph ๐บ = (๐‘‰, ๐ธ, ๐‘ƒ), and a node ๐‘ , estimate the
expected number of nodes that are reachable from ๐‘ .
๏ฑ ๐น๐‘ (๐บ) = ๐บ ๐‘ƒโˆˆฮฉ
Pr ๐บ ๐‘ƒ ๐‘“๐‘ (๐บ ๐‘ƒ) where ๐‘“๐‘ (๐บ ๐‘ƒ) is the
number of nodes that are reachable from the seed
node ๐‘ .
The Problem
39
Reduce the Variance
๏ฎ The accuracy of an approximate algorithm is measured
by the mean squared error ๐”ผ ( ๐น๐‘  ๐บ โˆ’ ๐น๐‘  ๐บ )2
๏ฎ By the variance-bias decomposition
๐”ผ ( ๐น๐‘  ๐บ โˆ’ ๐น๐‘  ๐บ )2 = Var ๐น๐‘  ๐บ + ๐”ผ( ๐น๐‘  ๐บ โˆ’ ๐น๐‘  ๐บ )
2
๏ฑ Make an estimator unbiased ๏ƒ  the 2nd term will be
cancelled out.
๏ฑ Make the variance as small as possible.
40
Naรฏve Monte-Carlo (NMC)
๏ฎ Sampling ๐‘ possible graphs ๐บ1, ๐บ2, โ€ฆ , ๐บ ๐‘.
๏ฎ For each sampled possible graph ๐บ๐‘–, compute the
number of nodes that are reachable from ๐‘ .
๏ฎ ๐‘๐‘€๐ถ Estimator: Average of the number of reachable
nodes over ๐‘ possible graphs. ๐น ๐‘๐‘€๐ถ = ๐‘–=1
๐‘
๐‘“๐‘ (๐บ ๐‘–)
๐‘
๏ฎ ๐น ๐‘๐‘€๐ถ is an unbiased estimator of ๐น๐‘ (๐บ) since
๐”ผ ๐น ๐‘๐‘€๐ถ = ๐น๐‘  ๐บ .
๏ฎ ๐‘๐‘€๐ถ is the only existing algorithm used in the influence
maximization literature.
41
Naรฏve Monte-Carlo (NMC)
๏ฎ ๐‘๐‘€๐ถ Estimator: Average of the number of reachable
nodes over ๐‘ possible graphs. ๐น ๐‘๐‘€๐ถ = ๐‘–=1
๐‘
๐‘“๐‘ (๐บ ๐‘–)
๐‘
๏ฎ ๐น ๐‘๐‘€๐ถ is an unbiased estimator of ๐น๐‘ (๐บ)
since ๐”ผ ๐น ๐‘๐‘€๐ถ = ๐น๐‘ (๐บ).
๏ฎ The variance of ๐‘๐‘€๐ถ is
๐‘‰๐‘Ž๐‘Ÿ ๐น ๐‘๐‘€๐ถ =
๐”ผ ๐‘“๐‘ (๐บ)2 โˆ’ (๐”ผ ๐‘“๐‘ (๐บ) )2
๐‘
=
๐บ ๐‘ƒโˆˆฮฉ ๐‘ƒ๐‘Ÿ ๐บ ๐‘ ๐‘“๐‘ (๐บ)2โˆ’๐น๐‘ (๐บ)2
๐‘
๏ฎ Computing the variance is extreme expensive, because
it needs to enumerate all the possible graphs.
42
Naรฏve Monte-Carlo (NMC)
๏ฎ In practice, it resorts to an unbiased estimator of ๐‘‰๐‘Ž๐‘Ÿ( ๐น ๐‘๐‘€๐ถ).
๏ฎ The variance of ๐‘๐‘€๐ถ is
๐‘‰๐‘Ž๐‘Ÿ ๐น ๐‘๐‘€๐ถ =
๐‘–=1
๐‘
(๐‘“๐‘  ๐บ๐‘– โˆ’ ๐น ๐‘๐‘€๐ถ)2
๐‘ โˆ’ 1
๏ฎ But, ๐‘‰๐‘Ž๐‘Ÿ ๐น ๐‘๐‘€๐ถ may be very large, because ๐‘“๐‘  ๐บ๐‘– fall into
the interval [0, ๐‘› โˆ’ 1].
๏ฎ The variance can be up to ๐‘‚(๐‘›2).
43
Stratified Sampling
๏ฎ Stratified is to divide a set of data items into subsets
before sampling.
๏ฎ A stratum is a subset.
๏ฎ The strata should be mutually exclusive, and should
include all data items in the set.
๏ฎ Stratified sampling can be used to reduce variance.
44
A Recursive Estimator [Jin et al. VLDBโ€™11]
๏ฎ Randomly select 1 edge to partition the probability
space (the set of all possible graphs) into 2 strata
(2 subsets)
๏ฑ The possible graphs in the first subset include
the selected edge.
๏ฑ The possible graphs in the second subset do
not include the selected edge.
๏ฎ Sample possible graphs in each stratum ๐‘– with a
sample size ๐‘๐‘– proportioning to the probability of
that stratum.
๏ฎ Recursively apply the same idea in each stratum.
45
A Recursive Estimator [Jin et al. VLDBโ€™11]
๏ฎ Advantages:
๏ฑ unbiased estimator with a smaller variance.
๏ฎ Limitations:
๏ฑ Select only one edge for stratification, which is not
enough to significantly reduce the variance.
๏ฑ Randomly select edges, which results in a possible
large variance.
46
More Effective Estimators
๏ฎ Four Stratified Sampling (SS) Estimators
๏ฑ Type-I basic SS estimator (BSS-I)
๏ฑ Type-I recursive SS estimator (RSS-I)
๏ฑ Type-II basic SS estimator (BSS-II)
๏ฑ Type-II recursive SS estimator (RSS-II)
๏ฎ All are unbiased and their variances are significantly
smaller than the variance of NMC.
๏ฎ Time and space complexity of all are the same as
NMC.
47
Type-I Basic Estimator (BSS-I)
๏ฎ Select ๐‘Ÿ edges to partition the probability space (all the
possible graphs) into 2 ๐‘Ÿ
strata.
๏ฎ Each stratum corresponds to a probability subspace
(a set of possible graphs).
๏ฎ Let ๐œ‹๐‘– = Pr[๐บ ๐‘ƒ โˆˆ ฮฉ๐‘–].
๏ฎ How to select ๐‘Ÿ edges: BFS or random
48
Type-I BSS-I Estimator
Sample size = ๐‘
2 ๐‘Ÿ
๐‘ = ๐‘๐œ‹1
BSS-I
49
Type-I Recursive Estimator (RSS-I)
๏ฎ Recursively apply the BSS-I into each stratum, until the sample
size reaches a given threshold.
๏ฎ RSS-I is unbiased and its variance is smaller than BSS-I
๏ฎ Time and space complexity are the same as NMC.
Sample size = ๐‘
BSS-I
RSS-I
๐‘ = ๐‘๐œ‹1
50
Type-II Basic Estimator (BSS-II)
๏ฎ Select ๐‘Ÿ edges to partition the probability space (all the
possible graphs) into ๐‘Ÿ + 1 strata.
๏ฎ Similarly, each stratum corresponds to a probability
subspace (a set of possible graphs).
๏ฎ How to select ๐‘Ÿ edges: BFS or random
51
Type-II Estimators
Sample size = ๐‘
๐‘Ÿ + 1
BSS-II
RSS-II
๐‘ = ๐‘๐œ‹1
52
Random-walk Domination
53
๏ฎ Social browsing: a process that users in a social network find
information along their social ties.
๏ฑ photo-sharing Flickr, online advertisements
๏ฎ Two issues:
๏ฑ Problem-I: How to place items on ๐‘˜ users in a social
network so that the other users can easily discover by
social browsing?
๏ฎ To minimize the expected number of hops that every
node hits the target set.
๏ฑ Problem-II: How to place items on ๐‘˜ users so that as many
users as possible can discover by social browsing?
๏ฎ To maximize the expected number of nodes that hit the
target set.
Social Browsing
54
๏ฎ The two problems are a random walk problem.
๏ฎ ๐ฟ-length random walk model where the path length of
random walks is bounded by a nonnegative number ๐ฟ.
๏ฑ A random walk in general can be considered as ๐ฟ = โˆž.
๏ฎ Let ๐‘ ๐‘ข
๐‘ก
be the position of an ๐ฟ-length random walk,
starting from node ๐‘ข, at discrete time ๐‘ก.
๏ฎ Let ๐‘‡๐‘ข๐‘ฃ
๐ฟ be a random walk variable.
๏ฑ ๐‘‡๐‘ข๐‘ฃ
๐ฟ
โ‰œ min{min ๐‘ก: ๐‘ ๐‘ข
๐‘ก
= v, t โ‰ฅ 0}, ๐ฟ
๏ฎ The hitting time โ„Ž ๐‘ข๐‘ฃ
๐ฟ can be defined as the expectation
of ๐‘‡๐‘ข๐‘ฃ
๐ฟ .
๏ฑ โ„Ž ๐‘ข๐‘ฃ
๐ฟ
= ๐”ผ[๐‘‡๐‘ข๐‘ฃ
๐ฟ
]
The Random Walk
55
The Hitting Time
๏ฎ Sarkar and Moore in UAIโ€™07 define the hitting time of the
๐ฟ-length random walk in a recursive manner.
โ„Ž ๐‘ข๐‘ฃ
๐ฟ =
0, ๐‘ข = ๐‘ฃ
1 +
๐‘คโˆˆ๐‘‰
๐‘ ๐‘ข๐‘คโ„Ž ๐‘ค๐‘ฃ
๐ฟโˆ’1, ๐‘ข โ‰  ๐‘ฃ
๏ฎ Our hitting time can be computed by the recursive
procedure.
๏ฑ Let ๐‘‘ ๐‘ข be the degree of node ๐‘ข and ๐‘(๐‘ข) be the set of
neighbor nodes of ๐‘ข.
๏ฎ ๐‘ ๐‘ข๐‘ค = 1/๐‘‘ ๐‘ข be the transition probability for ๐‘ค โˆˆ
๐‘(๐‘ข) and ๐‘ ๐‘ข๐‘ค = 0 otherwise.
56
The Random-Walk Domination
๏ฎ Consider a set of nodes ๐‘†. If a random walk from ๐‘ข
reaches ๐‘† by an ๐ฟ-length random walk, we say
๐‘† dominates ๐‘ข by an ๐ฟ-length random walk.
๏ฎ Generalized hitting time over a set of nodes, ๐‘†. The
hitting time โ„Ž ๐‘ข๐‘†
๐ฟ
can be defined as the expectation of a
random walk variable ๐‘‡๐‘ข๐‘†
๐ฟ
.
๏ฑ ๐‘‡๐‘ข๐‘†
๐ฟ
โ‰œ min{min ๐‘ก: ๐‘ ๐‘ข
๐‘ก
โˆˆ ๐‘†, t โ‰ฅ 0}, ๐ฟ
๏ฑ โ„Ž ๐‘ข๐‘†
๐ฟ
= ๐”ผ[๐‘‡๐‘ข๐‘†
๐ฟ
]
๏ฎ It can be computed recursively.
๏ฑ โ„Ž ๐‘ข๐‘†
๐ฟ
=
0, ๐‘ข โˆˆ ๐‘†
1 + ๐‘คโˆˆ๐‘‰ ๐‘ ๐‘ข๐‘คโ„Ž ๐‘ค๐‘†
๐ฟโˆ’1
, ๐‘ข โˆ‰ ๐‘†
57
๏ฎ How to place items on ๐‘˜ users in a social network so that
the other users can easily discover by social browsing?
๏ฎ To minimize the total expected number of hops of which
every node hits the target set.
Problem-I
or
58
๏ฎ How to place items on ๐‘˜ users so that as many users as
possible can discover by social browsing? To maximize the
expected number of nodes that hit the target set.
๏ฎ Let ๐‘‹ ๐‘ข๐‘†
๐ฟ
be an indicator random variable such that if ๐‘ข hits
any one node in ๐‘†, then ๐‘‹ ๐‘ข๐‘†
๐ฟ
= 1, and ๐‘‹ ๐‘ข๐‘†
๐ฟ
= 0 otherwise by an
๐ฟ-length random walk.
๏ฎ Let ๐‘ ๐‘ข๐‘†
๐ฟ
be the probability of an event that an ๐ฟ-length random
walk starting from ๐‘ข hits a node in ๐‘†.
๏ฎ Then, ๐”ผ ๐‘‹ ๐‘ข๐‘†
๐ฟ
= ๐‘ ๐‘ข๐‘†
๐ฟ
.
๏ฎ ๐‘ ๐‘ข๐‘†
๐ฟ
=
1, ๐‘ข โˆˆ ๐‘†
๐‘คโˆˆ๐‘‰ ๐‘ ๐‘ข๐‘ค ๐‘ ๐‘ค๐‘†
๐ฟโˆ’1
, ๐‘ข โˆ‰ ๐‘†
Problem-II
59
Influence Maximization vs Problem II
๏ฎ Influence maximization is to select ๐‘˜ nodes to maximize
the expected number of nodes that are reachable from
the nodes selected.
๏ฑ Independent cascade model
๏ฑ Probability associated with the edges are independent
๏ฑ A target node can influence multiple immediate neighbors
at a time.
๏ฎ Problem II is to select ๐‘˜ nodes to maximize the
expected number of nodes that reach a node in the
nodes selected.
๏ฑ ๐ฟ-length random walk model
60
๏ฎ The submodular set function maximization subject to
cardinality constraint is ๐‘๐‘ƒ-hard.
๏ฑ
๐‘Ž๐‘Ÿ๐‘” max
๐‘†โŠ†๐‘‰
๐น(๐‘†)
๐‘ . ๐‘ก. ๐‘† = ๐พ
๏ฎ The greedy algorithm
๏ฑ There is a 1 โˆ’
1
๐‘’
approximation algorithm.
๏ฑ Linear time and space complexity w.r.t. the size of the
graph.
๏ฎ Submodularity: ๐น(๐‘†) is submodular and non-decreasing.
๏ฑ Non-decreasing: ๐‘“(๐‘†) โ‰ค ๐‘“(๐‘‡) for ๐‘† โŠ† ๐‘‡ โŠ† ๐‘‰.
๏ฑ Submodular: Let ๐‘”๐‘—(๐‘†) = ๐‘“(๐‘† โˆช {๐‘—}) โ€“ ๐‘“(๐‘†) be the marginal gain. Then,
๐‘”๐‘—(๐‘†) โ‰ฅ ๐‘”๐‘—(๐‘‡), for j โˆˆ V  T and ๐‘† โŠ† ๐‘‡ โŠ† ๐‘‰.
Submodular Function Maximization
61
๏ฎ The submodular set function maximization subject to
cardinality constraint is ๐‘๐‘ƒ-hard.
๏ฑ
๐‘Ž๐‘Ÿ๐‘” max
๐‘†โŠ†๐‘‰
๐น(๐‘†)
๐‘ . ๐‘ก. ๐‘† = ๐พ
๏ฎ Both Problem I and Problem II use a submodular set
function.
๏ฑ Problem-I: ๐น1 S = nL โˆ’ ๐‘ขโˆˆ๐‘‰๐‘† โ„Ž ๐‘ข๐‘†
๐ฟ
๏ฑ Problem-II: ๐น2(๐‘†) = ๐‘คโˆˆ๐‘‰ ๐”ผ[๐‘‹ ๐‘ค๐‘†
๐ฟ
] = ๐‘คโˆˆ๐‘‰ ๐‘ ๐‘ค๐‘†
๐ฟ
Submodular Function Maximization
62
The Algorithm
๏ฎ Let ๐œŽ ๐‘ข S = F(๐‘† โˆช {๐‘ข}) โˆ’ ๐น(๐‘†)
๏ฎ It implies dynamic programming (DP) is needed to
compute the marginal gain.
Marginal gain
63
Diversified Ranking
64
Diversified Ranking [Li et al, TKDEโ€™13]
๏ฎ Why diversified ranking?
๏ฑ Information requirements diversity
๏ฑ Query incomplete
PAKDD09-65
Problem Statement
๏ฎ The goal is to find K nodes in a graph that are relevant to
the query node, and also they are dissimilar to each
other.
๏ฎ Main applications
๏ฑ Ranking nodes in social network, ranking papers, etc.
66
Challenges
๏ฎ Diversity measures
๏ฑ No wildly accepted diversity measures on graph in the
literature.
๏ฎ Scalability
๏ฑ Most existing methods cannot be scalable to large
graphs.
๏ฎ Lack of intuitive interpretation.
67
Grasshopper/ManiRank
๏ฎ The main idea
๏ฑ Work in an iterative manner.
๏ฑ Select a node at one iteration by random walk.
๏ฑ Set the selected node to be an absorbing node, and
perform random walk again to select the second node.
๏ฑ Perform the same process ๐พ iterations to get ๐พ nodes.
๏ฎ No diversity measure
๏ฑ Achieving diversity only by intuition and experiments.
๏ฎ Cannot scale to large graph (time complexity O(๐พ๐‘›2
))
68
Grasshopper/ManiRank
๏ฎ Initial random walk with no absorbing states
๏ฎ Absorbing random walk after ranking the first item
69
Our Approach
๏ฎ The main idea
๏ฑ Relevance of the top-K nodes (denoted by a set S) is achieved by
the large (Personalized) PageRank scores.
๏ฑ Diversity of the top-K nodes is achieved by large expansion ratio.
๏ฎ Expansion ratio of a set nodes ๐‘†: ๐œŽ ๐‘† = ๐‘ ๐‘† /๐‘›.
๏ฑ Larger expansion ratio implies better diversity
70
๏ฎ The submodular set function maximization subject to
cardinality constraint is ๐‘๐‘ƒ-hard.
๏ฑ
๐‘Ž๐‘Ÿ๐‘” max
๐‘†โŠ†๐‘‰
๐น(๐‘†)
๐‘ . ๐‘ก. ๐‘† = ๐พ
๏ฎ The greedy algorithm
๏ฑ There is a 1 โˆ’
1
๐‘’
approximation algorithm.
๏ฑ Linear time and space complexity w.r.t. the size of the
graph.
๏ฎ Submodularity: ๐น(๐‘†) is submodular and non-decreasing.
๏ฑ Non-decreasing: ๐‘“(๐‘†) โ‰ค ๐‘“(๐‘‡) for ๐‘† โŠ† ๐‘‡ โŠ† ๐‘‰.
๏ฑ Submodular: Let ๐‘”๐‘—(๐‘†) = ๐‘“(๐‘† โˆช {๐‘—}) โ€“ ๐‘“(๐‘†) be the marginal gain. Then,
๐‘”๐‘—(๐‘†) โ‰ฅ ๐‘”๐‘—(๐‘‡), for j โˆˆ V  T and ๐‘† โŠ† ๐‘‡ โŠ† ๐‘‰.
Submodular Function Maximization
71
Top-k Structural Diversity
Search
72
๏ฎ Social contagion is a process of information (e.g. fads,
news, opinions) diffusion in the online social networks
๏ฑ Traditional biological contagion model, the affected
probability depends on degree.
MarketingOpinions Diffusion Social Network
Social Contagion
73
Facebook Study [Ugander et al., PNASโ€™12]
๏ฎ Case study: The process of a user joins Facebook in
response to an invitation email from an existing
Facebook user.
๏ฎ Social contagion is not like biological contagion.
74
๏ฎ Structural diversity of an
individual is the number of
connected components in
oneโ€™s neighborhood.
๏ฎ The problem: Find ๐‘˜
individuals
with highest structural
diversity. Connected components in the
neighborhood of โ€œwhite centerโ€
Structural Diversity
75
Part II: I/O Efficiency
76
Big Data: The Volume
๏ฎ Consider a dataset ๐ท of 1 PetaByte (1015 bytes).
A linear scan of ๐ท takes 46 hours with a fastest
Solid State Drive (SSD) of speed of 6GB/s.
๏ฑ PTIME queries do not always serve as a good yardstick
for tractability in โ€œBig Data with Preprocessingโ€ by Fan.
et al., PVLDBโ€13.
๏ฎ Consider a function ๐‘“ ๐บ . One possible way is to
make ๐บ small to be ๐บโ€™, and find the answers from
๐บโ€™ as it can be answered by ๐บ, ๐‘“โ€™(๐บโ€™) โ‰ˆ ๐‘“(๐บ).
๏ฑ There are many ways we can explore.
77
Big Data: The Volume
๏ฎ Consider a function ๐‘“ ๐บ . One possible way is to make ๐บ
small to be ๐บโ€™, and find the answers from ๐บโ€™ as it can be
answered by ๐บ, ๐‘“โ€™(๐บโ€™) โ‰ˆ ๐‘“(๐บ).
๏ฎ There are many ways we can explore.
๏ฎ Make data simple and small
๏ฑ Graph sampling, Graph compression
๏ฑ Graph sparsification, Graph simplification
๏ฑ Graph summary
๏ฑ Graph clustering
๏ฑ Graph views
78
More Work on Big Data
๏ฎ We also believe that there are many things we need to
do on Big Data.
๏ฎ We are planning explore many directions.
๏ฑ Make data simple and small
๏ฎ Graph sampling, graph simplification, graph
summary, graph clustering, graph views.
๏ฑ Explore different computing approaches
๏ฎ Parallel computing, distributed computing,
streaming computing, semi-external/external
computing.
79
I/O Efficient Graph Computing
๏ฎ I/O Efficient: Computing SCCs in Massive Graphs by
Zhiwei Zhang, Jeffrey Xu Yu, Lu Qin, Lijun Chang, and
Xuemin Lin, SIGMODโ€™13.
๏ฎ Contract & Expand: I/O Efficient SCCs Computing by
Zhiwei Zhang, Lu Qin, and Jeffrey Xu Yu.
๏ฎ Divide & Conquer: I/O Efficient Depth-First Search,
Zhiwei Zhang, Jeffrey Xu Yu, Lu Qin, and Zechao
Shang.
80
Reachability Query
๏ฎ Two possible but infeasible solutions:
๏ฑ Traverse ๐บ(๐‘‰, ๐ธ) to answer a reachability query
๏ฎ Low query performance: ๐‘‚(|๐ธ|) query time
๏ฑ Precompute and store the transitive closure
๏ฎ Fast query processing
๏ฎ Large storage requirement: ๐‘‚(|๐‘‰|2)
๏ฎ The labeling approaches:
๏ฑ Assign labels to nodes in a
preprocessing step offline.
๏ฑ Answer a query using the labels
assigned online.
81
A
B
D
C
F
G
A
B
C
G
F D
Make a Graph Small and Simple
๏ฎ Any directed graph ๐บ can be represented as a DAG (Directed
Acyclic Graph), ๐บ ๐ท, by taking every SCC (Strongly
Connected Component) in ๐บ as a node in ๐บ ๐ท.
๏ฎ An SCC of a directed graph ๐บ = (๐‘‰, ๐ธ) is a maximal set of
nodes ๐ถ โŠ† ๐‘‰ such that for every pair of nodes ๐‘ข and ๐‘ฃ in ๐ถ, ๐‘ข
and ๐‘ฃ are reachable from each other.
82
A
B
D
C
I
E
F
G
H
A
B
C
G
F
D
E
H
I
The Reachability Queries
๏ฎ Reachability queries can be answered by DAG.
83
The Issue and the Challenge
๏ฎ It needs to convert a
massive directed graph ๐บ
into a DAG ๐บ ๐ท in order to
process it efficiently
because
๏ฑ ๐บ cannot be held in main
memory, and ๐บ ๐ท can be
much smaller.
๏ฎ It is assumed that it can be
done in the existing works.
๏ฎ But, it needs a large
main memory to convert.
84
The Issue and the Challenge
๏ฎ The Dataset uk-2007
๏ฑ Nodes: 105,896,555
๏ฑ Edges: 3,738,733,648
๏ฑ Average degree: 35
๏ฎ Memory:
๏ฑ 400 MB for nodes, and
๏ฑ 28 GB for edges.
85
In Memory Algorithm?
๏ฎ In Memory Algorithm: Scan ๐บ twice
๏ฑ DFS(G) to obtain a decreasing order for each node of ๐บ
๏ฑ Reverse every edge to obtain ๐บ ๐‘‡, and
๏ฑ DFS(๐บ ๐‘‡) according to the same decreasing order to find all
SCCs.
86
4
7
2 3
51
9
68
In Memory Algorithm?
๏ฎ DFS(G) to obtain a decreasing order for each node of ๐บ
87
4
7
2 3
51
9
68
4
7
2 3
51
9
68
In Memory Algorithm?
๏ฎ Reverse every edge to obtain ๐บ ๐‘‡
.
88
4
7
2 3
51
9
68
In Memory Algorithm?
๏ฎ DFS(๐บ ๐‘‡
) according to the same decreasing order to find all
SCCs. (A subtree (in black edges) form an SCC.)
89
(Semi)-External Algorithms
๏ฎ In Memory Algorithm: Scan ๐บ twice
๏ฎ The in memory algorithm cannot handle a large graph that
cannot be held in memory.
๏ฑ Why? No locality. A large number of random I/Os.
๏ฎ Consider external algorithms and/or semi-external
algorithms. Let ๐‘€ be the size of main memory.
๏ฑ External algorithm: ๐‘€ < |๐บ|
๏ฑ Semi-external algorithm: ๐‘˜ โˆ™ |๐‘‰| โ‰ค ๐‘€ < |๐บ|
๏ฎ It assumes that a tree can be held in memory.
90
A
B
D
C
I
E
F
G
H
A
B
C
G
F
Main Memory
Contraction Based External Algorithm (1)
๏ฎ Load in a subgraph and merge SCCs
in it in main memory in every iteration
[Cosgaya-Lozano et al. SEA'09]
91
A
B
D
C
I
E
F
G
H
A
B
C
G
F
A B
C
G
D
Main Memory
Contraction Based External Algorithm (2)
92
A
B
D
C
I
E
F
G
H
A
B
C
G
F
A B
C GD FD
H
E
Cannot Find All SCCs Always!
Main Memory
DAG! And memory is full!Cannot load in โ€œIโ€ into memory!
Contraction Based External Algorithm (3)
93
2
8
3
4
7
5
1
9
6
Tree-Edge
Forward-Cross-Edge
Backward-Edge
Forward-Edge
Backward-Cross-Edgedelete old tree edge
New tree edge
DFS Based Semi-External Algorithm
๏ฎ Find a DFS-tree without
forward-cross-edges
[Sibeyn et al. SPAAโ€™02].
๏ฎ For a forward-cross-
edge (๐‘ข, ๐‘ฃ), delete tree
edge to ๐‘ฃ, and (๐‘ข, ๐‘ฃ) as
a new tree edge.
94
DFS Based Approaches: Cost-1
๏ฎ DFS-SCC uses sequential I/Os.
๏ฎ DFS-SCC needs to traverse a graph ๐บ twice using DFS
to compute all SCCs.
๏ฎ In each DFS, in the worst case it needs the number of
๐‘‘๐‘’๐‘๐‘กโ„Ž(๐บ) โˆ™ |๐ธ(๐‘‰)|/๐ต I/Os, where ๐ต is the block size.
95
DFS Based Approaches: Cost-2
๏ฎ Partial SCCs cannot be contracted to save space while
constructing a DFS tree.
๏ฎ Why?
๏ฑ DFS-SCC needs to traverse a graph ๐บ twice using DFS to
compute all SCCs.
๏ฑ DFS-SCC uses a total order of nodes (decreasing postorder)
in the second DFS, which is computed in the first DFS.
๏ฑ SCCs cannot be partially contracted in the first DFS.
๏ฑ SCCs can be partially contracted in the second DFS, but we
have to remember which nodes belongs to which SCCs with
extra space. Not free!
96
DFS Based Approaches: Cost-3
๏ฎ High CPU cost for reshaping a DFS-tree, when it attempts
to reduce the number of forward-cross-edges.
97
Our New Approach [SIGMODโ€™13]
๏ฎ We propose a new two phase algorithm, 2P-SCC:
๏ฑ Tree-Construction and Tree-Search.
๏ฑ In Tree-Construction phase, we construct a tree-like
structure.
๏ฑ In Tree-Search phase, we scan the graph only once.
๏ฎ We further propose a new algorithm, 1P-SCC, to
combine Tree-Construction and Tree-Search with new
optimization techniques, using a tree.
๏ฑ Early-Acceptance
๏ฑ Early-Rejection
๏ฑ Batch Edge Reduction
A joint work by Zhiwei Zhang, Jeffrey Yu, Qin Lu, Lijun Chang, and Xuemin Lin
98
A New Weak Order
๏ฎ The total order used in DFS-SCC is too strong and there
is no obvious relationship between the total order and
the SCCs per se, in order to reduce I/Os.
๏ฑ The total order cannot help to reduce I/O costs.
๏ฎ We introduce a new weak order.
๏ฎ For an SCC, there must exist at least one cycle.
๏ฎ While constructing a tree ๐‘‡ for ๐บ, a cycle will appear to
contain at least one edge (๐‘ข, ๐‘ฃ) that links to a higher
level node in ๐‘‡. ๐‘‘๐‘’๐‘๐‘กโ„Ž(๐‘ข) > ๐‘‘๐‘’๐‘๐‘กโ„Ž(๐‘ฃ).
๏ฎ There are two cases when ๐‘‘๐‘’๐‘๐‘กโ„Ž(๐‘ข) > ๐‘‘๐‘’๐‘๐‘กโ„Ž(๐‘ฃ).
๏ฑ A cycle: ๐‘ฃ is an ancestor of ๐‘ข in ๐‘‡
๏ฑ Not a cycle (up-edge): ๐‘ฃ is not an ancestor of ๐‘ข in ๐‘‡.
๏ฎ We reduce the number of up-edges iteratively.
99
๏ฎ Let ๐‘…๐‘ ๐‘’๐‘ก(๐‘ข, ๐บ, ๐‘‡) be the set of nodes including ๐‘ข and nodes that
๐‘ฃ can reach by a tree ๐‘‡ of ๐บ.
๏ฎ ๐‘‘๐‘’๐‘๐‘กโ„Ž(๐‘ข, ๐‘‡): The length of the longest simple
path from root to ๐‘ข.
๏ฎ ๐‘‘๐‘Ÿ๐‘Ž๐‘›๐‘˜(๐‘ข, ๐‘‡) = min{๐‘‘๐‘’๐‘๐‘กโ„Ž(๐‘ฃ, ๐‘‡) | ๐‘ฃ โˆˆ ๐‘…๐‘ ๐‘’๐‘ก(๐‘ข, ๐บ, ๐‘‡)}
๏ฑ drank is used as the weak order!
๏ฎ ๐‘‘๐‘™๐‘–๐‘›๐‘˜ ๐‘ข, ๐‘‡ = ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘–๐‘› ๐‘ฃ ๐‘‘๐‘’๐‘๐‘กโ„Ž ๐‘ฃ, ๐‘‡ ๐‘ฃ โˆˆ ๐‘…๐‘ ๐‘’๐‘ก(๐‘ข, ๐บ, ๐‘‡)}
๏ฎ Nodes do not need to have a unique order.
B
H
C
D
G
E
A
IF
๏ƒ˜ depth(B) = 1
๏ƒ˜ drank(B) = 1
๏ƒ˜ dlink(B) = B
๏ƒ˜ depth(E) = 3
๏ƒ˜ drank(E) = 1
๏ƒ˜ dlink(E) = B
The Weak Order: drank
100
2P-SCC
๏ฎ To reduce Cost-1, we use a BR+-tree to compute all
SCCs in the Tree-Construction phase. We compute all
SCCs by traversing ๐บ only once using the BR+-tree
constructed in the Tree-Search phase.
๏ฎ To reduce Cost-3, we have shown that we only need to
update the depth of nodes locally.
101
B
H
C D
G
E
A
IF
๏ฎ BR-Tree is a spanning
tree of G.
๏ฎ BR+-Tree is a BR-Tree
plus some additional
edges (๐‘ข, ๐‘ฃ) such that ๐‘ฃ
is an ancestor of ๐‘ข.
BR-Tree and BR+-Tree
๏ฑ In Memory: Black edges
102
B
H
C D
GE
A
IF
drank(I) = 1
drank(H) = 2
Up-edge
Tree-Construction: Up-edge
๏ฎ An edge (๐‘ข, ๐‘ฃ) is an up-edge on
the conditions:
๏ฑ ๐‘ฃ is not an ancestor of ๐‘ข in ๐‘‡
๏ฑ ๐‘‘๐‘Ÿ๐‘Ž๐‘›๐‘˜(๐‘ข, ๐‘‡) โ‰ฅ ๐‘‘๐‘Ÿ๐‘Ž๐‘›๐‘˜(๐‘ฃ, ๐‘‡)
๏ฎ Up-edges violate the existing
order
103
๏ฎ When there is an violate
up-edge, then
๏ฑ Modify T
๏ฎ Delete the old tree
edge
๏ฎ Set the up-edge as a
new tree edge
๏ฑ Graph Reconstruction
๏ฎ No I/O cost, low CPU
cost.
B
H
C D
GE
A
IF
Tree-Construction (Push-Down)
104
B
D
E
A
F
drank(E) = 1
dlink(E) = B
drank(F) = 1
Up-edge
Tree-Construction (Graph Reconstruction)
๏ฎ Tree edges and one
extra edge in BR+-Tree
form a part of an SCC.
๏ฎ For an up-edge (๐‘ข, ๐‘ฃ), if
๐‘‘๐‘™๐‘–๐‘›๐‘˜(๐‘ฃ, ๐‘‡) is an ancestor
of ๐‘ข in ๐‘‡, delete (๐‘ข, ๐‘ฃ) and
add (๐‘ข, ๐‘‘๐‘™๐‘–๐‘›๐‘˜(๐‘ฃ)).
๏ฎ In Tree-Search, scan the
graph only once to find all
SCCs, which reduces I/O
costs.
A new edge
105
Tree-Construction
๏ฎ When a BR+-tree is completely constructed, there are no
up-edges.
๏ฎ There are only two kinds of edges in G.
๏ฑ The BR+-tree edges, and
๏ฑ The edges (๐‘ข, ๐‘ฃ) where ๐‘‘๐‘Ÿ๐‘Ž๐‘›๐‘˜(๐‘ข, ๐‘‡) < ๐‘‘๐‘Ÿ๐‘Ž๐‘›๐‘˜(๐‘ฃ, ๐‘‡).
๏ฎ Such edges do not play in any role in determining an
SCC.
106
B
H
C D
G
E
A
IF
In memory for each node u:
๏ฎ TreeEdge(u)
๏ฎ dlink(u)
๏ฎ drank(u)
In total: 3 ร— |๐‘‰|
Search Procedure:
๏ฎ If an edge (๐‘ข , ๐‘ฃ) points
to an ancestor, merge
all nodes from ๐‘ฃ to ๐‘ข in
the tree
Only need to scan the
graph once.
Tree-Search
107
From 2P-SCC To 1P-SCC
๏ฎ With 2P-SCC:
๏ฑ In Tree-Construction phase, we construct a tree by an
approach similar to DFS-SCC, and
๏ฑ In Tree-Search phase, we scan the graph only once.
๏ฑ The memory used for BR+-tree is 3 ร— |๐‘‰|.
๏ฎ With 1P-SCC: We combine Tree-Construction and Tree-
Search with new optimization techniques to reduce Cost-2
and Cost-3:
๏ฑ Early-Acceptance
๏ฑ Early-Rejection
๏ฑ Batch Edge Reduction
๏ฑ Only need to use a BR-tree with memory of 2 ร— |๐‘‰|.
108
Early-Acceptance and Early Rejection
๏ฎ Early acceptance: we contract a partial SCC into a
node in an early stage while constructing a BR-tree.
๏ฎ Early rejection: we identify necessary conditions to
remove nodes that will not participate in any SCCs
while constructing a BR-tree.
109
Early-Acceptance and Early Rejection
๏ฎ Consider an example.
๏ฎ The three nodes on the left can be contracted into a node on the right.
๏ฎ The node โ€œaโ€ and the subtrees, C and D, can be rejected.
110
B
I
C
D
H
E
A
JG
๏ฎ Memory: 2 ร— |๐‘‰|
๏ฎ Reduce I/O Cost
KF
Up-edge: Modify Tree
Up-edge: Modify Tree
Early-Acceptance
Early-Acceptance
Modify Tree + Early Acceptance
111
DFS Based vs Ours Approaches
๏ฎ I/O cost for DFS is high
๏ฑ Use a total order
๏ฎ Cannot merge SCCs
when found earlier
๏ฑ Total order cannot be
changed. Large # of
I/Os.
๏ฎ Cannot prune non-SCC
nodes
๏ฑ Total order cannot be
changed
๏ฎ Smaller I/O Cost
๏ฑ Use a weaker order
๏ฎ Merge SCCs as early as
possible
๏ฑ Merge nodes with the
same order. Small size,
small # of I/Os.
๏ฎ prune non-SCC nodes as
early as possible
๏ฑ Weaker order is flexible
112
Optimization: Batch Edge Reduction
๏ฎ With 1PC-SCC, CPU cost is still high.
๏ฑ In order to determine whether (๐‘ข, ๐‘ฃ) is a backward-
edge/up-edge, it needs to check the ancestor
relationships between ๐‘ข and ๐‘ฃ over a tree.
๏ฎ The tree is frequently updated.
๏ฎ The average depth of nodes in the tree becomes
larger with frequently push-down operation.
113
Optimization: Batch Edge Reduction
๏ฎ When memory can hold more edges, there is no need to
contract partial SCCs edge by edge.
๏ฎ Find all SCCs in the main memory at the same time
๏ฑ Read all edges that can be read into memory.
๏ฑ Construct a graph with the edges of the tree
constructed in memory already plus the edges newly
read into memory.
๏ฑ Construct a DAG in memory using the existing memory
algorithm, which finds all SCCs in memory.
๏ฑ Reconstruct the BR-Tree according to the DAG.
114
Performance Studies
๏ฎ Implement using visual C++ 2005
๏ฎ Test on a PC with Intel Core2 Quard 2.66GHz CPU
and 3.43GB memory running Windows XP
๏ฎ Disk Block Size: 64๐พ๐ต
๏ฎ Memory Size: 3 ร— |๐‘‰(๐บ)| ร— 4๐ต + 64 ๐พ๐ต
115
|V| |E| Average
Degree
cit-patent 3,774,768 16,518,947 4.70
go-uniprot 6,967,956 34,770,235 4.99
citeseerx 6,540,399 15,011,259 2.30
WEBSPAM-
UK2002
105,896,555 3,738,733,568 35.00
Four Real Data Sets
116
Parameter Range Default
Node Size 30M - 70M 30M
Average Degree 3 - 7 5
Size of Massive SCCs 200K โ€“ 600K 400K
Size of Large SCCs 4K - 12K 8K
Size of Small SCCs 20 - 60 40
# of Massive SCCs 1 1
# of Large SCCs 30 - 70 50
# of Small SCCs 6K โ€“ 14K 10K
Synthetic Data Sets
๏ฎ We construct a graph G by (1) randomly selecting all
nodes in SCCs first, (2) adding edges among the
nodes in an SCC until all nodes form an SCC, and (3)
randomly adding nodes/edges to the graph.
117
1PB-SCC 1P-SCC 2P-SCC DFS-SCC
cit-patent(s) 24s 22s 701s 840s
go-uniprot(s) 22s 21s 301s 856s
citeseerx(s) 10s 8s 517s 669s
cit-patent(I/O) 16,031 13,331 133,467 667,530
go-uniprot(I/O) 26,034 47,947 471,540 619,969
citeseerx(I/O) 15,472 13,482 104,611 392,659
Performance Studies
118
WEBSPAM-UK2007: Vary Node Size
119
WEBSPAM-UK2007: Vary Memory
120
Synthetic Data Sets: Vary SCC Sizes
121
Synthetic Data Sets: Vary # of SCCs
122
From Semi-External to External
๏ฎ Existing semi-external solutions work under the condition
that it can held a tree in main-memory ๐‘˜โˆ™ |๐‘‰| โ‰ค |๐‘€|, and
generate a large I/Os.
๏ฎ We study an external algorithm by removing the
condition of ๐‘˜ โˆ™ |๐‘‰| โ‰ค |๐‘€|.
123
A joint work by Zhiwei Zhang, Qin Lu, and Jeffrey Yu
The New Approach: The Main Idea
๏ฎ DFS based approaches generate random accesses
๏ฎ Contraction based semi-external approach
reduces |๐‘‰| and |๐ธ| together at the same time.
๏ฑ Cannot find all SCCs.
๏ฎ The main idea of our external algorithm:
๏ฑ Work on a small graph ๐บโ€™ of ๐บ by reducing ๐‘‰ because ๐‘€
can be small.
๏ฑ Find all SCCs in ๐บโ€™.
๏ฑ Add removed nodes back to find SCCs in ๐บ.
124
The New Approach: The Property
๏ฎ Reducing the given graph
๏ฑ ๐บ ๐‘‰, ๐ธ โ†’ ๐บโ€ฒ
๐‘‰โ€ฒ
, ๐ธโ€ฒ
, ๐‘‰ < |๐‘‰โ€ฒ
|.
๏ฑ If ๐‘ข can reach ๐‘ฃ in ๐บ, ๐‘ข can also reach ๐‘ฃ in ๐บโ€ฒ.
๏ฑ Maintaining this property may generate a large
number of random I/O access.
๏ฎ Reason: several nodes on the path from ๐‘ข to ๐‘ฃ may
be removed from ๐บ in the previous iterations.
125
The New Approach: The Approach
๏ฎ We introduce a new Contraction & Expansion approach.
๏ฑ Contraction phase:
๏ฎ Reduce nodes iteratively, ๐บ1, ๐บ2 โ€ฆ ๐บ๐‘™.
๏ฑ It decreases |๐‘‰(๐บ๐‘–)|, but may increase |๐ธ(๐บ๐‘–)|.
๏ฑ Expansion phase:
๏ฎ In the reverse order in contraction phase,
๐บ๐‘™, ๐บ๐‘™โˆ’1 โ€ฆ ๐บ1.
๏ฑ Find all SCCs in ๐บ๐‘™ using a semi-external
algorithm.
๏‚ง The semi-external algorithm deals with edges.
๏ฑ Expand ๐บ๐‘– back to ๐บ๐‘–โˆ’1.
126
The Contraction
๏ฎ In Contraction phase, graph ๐บ1, ๐บ2 โ€ฆ ๐บ๐‘™ are generated,
๏ฎ ๐บ๐‘–+1 is generated by removing a batch of nodes from ๐บ๐‘–.
๏ฎ Stops until ๐‘˜ โˆ™ |๐‘‰| < |๐‘€| when semi-external approach
can be applied.
G1 G2 G3
127
The Expansion
๏ฎ In Expansion phase, removed nodes are added
๏ฎ Addition is in the reverse order of their removal in
contraction phase.
G1 G2 G3
128
The Contraction Phase
๏ฎ Compared with ๐บ๐‘–, ๐บ๐‘–+1should have the following
properties
๏ฑ Contractable:
๏ฎ |๐‘‰(๐บ๐‘–+1)| < |๐‘‰(๐บ๐‘–)|
๏ฑ SCC-Preservable:
๏ฎ ๐‘†๐ถ๐ถ(๐‘ข, ๐บ๐‘–) = ๐‘†๐ถ๐ถ(๐‘ฃ, ๐บ๐‘–) โŸบ ๐‘†๐ถ๐ถ(๐‘ข, ๐บ๐‘–+1) = ๐‘†๐ถ๐ถ(๐‘ฃ, ๐บ๐‘–+1)
๏ฑ Recoverable:
๏ฎ ๐‘ฃ โˆˆ ๐‘‰๐‘– โˆ’ ๐‘‰๐‘–+1 โŸบ ๐‘›๐‘’๐‘–๐‘”โ„Ž๐‘๐‘œ๐‘ข๐‘Ÿ ๐‘ฃ, ๐บ๐‘– โŠ† ๐บ๐‘–+1
129
Contract Vi+1
๏ฎ Recoverable:
๏ฑ ๐‘ฃ โˆˆ ๐‘‰๐‘– โˆ’ ๐‘‰๐‘–+1 โŸบ ๐‘›๐‘’๐‘–๐‘”โ„Ž๐‘๐‘œ๐‘ข๐‘Ÿ(๐‘ฃ, ๐บ๐‘–) โŠ† ๐บ๐‘–+1
๏ฎ ๐บ๐‘–+1 is recoverable if and only if ๐‘‰๐‘–+1 is a vertex cover of
๐บ๐‘–.
๏ฎ At this condition, we can determine which SCCs the
nodes in ๐บ๐‘– belong to by scanning ๐บ๐‘– once.
๏ฎ For each edge, we select the node with a higher degree
or a higher order.
130
Contract Vi+1
c
d
h
a
b
e
f
g
i
ID1 ID2 Deg1 Deg2
a b 3 3
a d 3 4
b c 3 2
c d 2 4
d e 4 4
d g 4 4
e b 4 3
e g 4 4
f g 2 4
g h 4 2
h i 2 2
i f 2 2
DISK
For each edge, we select the node with
a higher degree or a higher order.
131
Construct Ei+1
๏ฎ SCC-Preservable:
๏ฑ ๐‘†๐ถ๐ถ(๐‘ข, ๐บ๐‘–) = ๐‘†๐ถ๐ถ(๐‘ฃ, ๐บ๐‘–) โŸบ ๐‘†๐ถ๐ถ(๐‘ข, ๐บ๐‘–+1) = ๐‘†๐ถ๐ถ(๐‘ฃ, ๐บ๐‘–+1)
๏ฎ If ๐‘ฃ โˆˆ ๐‘‰๐‘– โ€“ ๐‘‰๐‘–+1, remove (๐‘ฃ๐‘–๐‘›, ๐‘ฃ) and (๐‘ฃ, ๐‘ฃ ๐‘œ๐‘ข๐‘ก) and add (๐‘ฃ๐‘–๐‘›, ๐‘ฃ ๐‘œ๐‘ข๐‘ก).
๏ฎ Although |๐ธ| may be larger, |๐‘‰| is sure to be smaller.
๏ฎ Smaller |๐‘‰| implies semi-external approach can be applied.
132
ID1 ID2
e d
b d
i g
g i
Construct Ei+1
c
d
h
a
b
e
f
g
i
ID1 ID2
d e
d g
e b
e g
DISK
If ๐‘ฃ โˆˆ ๐‘‰๐‘– โ€“ ๐‘‰๐‘–+1, remove (๐‘ฃ, ๐‘ฃ๐‘–๐‘›) and
(๐‘ฃ, ๐‘ฃ ๐‘œ๐‘ข๐‘ก) and add (๐‘ฃ๐‘–๐‘›, ๐‘ฃ ๐‘œ๐‘ข๐‘ก)
Existing Edges
New Edges
133
The Expansion Phase
๏ฎ ๐‘†๐ถ๐ถ(๐‘ข, ๐บ๐‘–) = ๐‘†๐ถ๐ถ(๐‘ฃ, ๐บ๐‘–) = ๐‘†๐ถ๐ถ(๐‘ค, ๐บ๐‘–) (๐‘ข, ๐‘ค โˆˆ ๐‘‰๐‘–+1)
โŸบ ๐‘ข โ†’ ๐‘ฃ & ๐‘ฃ โ†’ ๐‘ค in ๐บ๐‘–
๏ฎ For any node ๐‘ฃ โˆˆ ๐‘‰๐‘–+1 โˆ’ ๐‘‰๐‘–, ๐‘†๐ถ๐ถ(๐‘ฃ, ๐บ๐‘–) can be
computed using
๐‘›๐‘’๐‘–๐‘”โ„Ž๐‘๐‘œ๐‘ข๐‘Ÿ๐‘–๐‘› (๐‘ฃ, ๐บ๐‘–) and ๐‘›๐‘’๐‘–๐‘”โ„Ž๐‘๐‘œ๐‘ข๐‘Ÿ๐‘œ๐‘ข๐‘ก(๐‘ฃ, ๐บ๐‘–) only.
a
b
c
134
ID1 ID2
a b
a d
b c
c d
d e
d g
e b
e g
f g
g h
h i
i f
Expansion Phase
c
d
h
a
b
e
f
g
i
DISK
๐‘†๐ถ๐ถ ๐‘ข, ๐บ๐‘– = ๐‘†๐ถ๐ถ ๐‘ฃ, ๐บ๐‘– = ๐‘†๐ถ๐ถ ๐‘ค, ๐บ๐‘–
(๐‘ข, ๐‘ค โˆˆ ๐‘‰๐‘–+1) โŸบ ๐‘ข โ†’ ๐‘ฃ & ๐‘ฃ โ†’ ๐‘ค in ๐บ๐‘–
135
Performance Studies
๏ฎ Implement using visual C++ 2005
๏ฎ Test on a PC with Intel Core2 Quard 2.66GHz CPU and 3.5GB
memory running Windows XP
๏ฎ Disk Block Size: 64KB
๏ฎ Default memory Size: 400๐‘€
136
Data Set
๏ฎ Real Data set
๏ฎ Synthetic Data
V E Average
Degree
WEBSPAM-
UK2007
105,896,555 3,738,733,568 35.00
Parameter
Node Size 25M โ€“ 100M
Average Degree 2 - 6
Size of SCCs 20 โ€“ 600K
Number of SCCs 1 โ€“ 14 K
137
Performance Studies
Vary Memory Size
138
DFS [SIGMODโ€™15]
๏ฎ Given a graph ๐บ(๐‘‰, ๐ธ), depth-first search is to
search ๐บ following the depth-first order.
A
B E
D
C
F
IH
J
G
A joint work by Zhiwei Zhang, Jeffrey Yu, Qin Lu, and Zechao Shang
139
The Challenge
๏ฎ It needs to DFS a
massive directed graph
๐บ, but it is possible that
๐บ cannot be entirely
held in main memory.
๏ฎ Our work only keeps
nodes in memory,
which is much smaller.
140
The Issue and the Challenge (1)
๏ฎ Consider all edges from ๐‘ข, like ๐‘ข, ๐‘ฃ1 , ๐‘ข, ๐‘ฃ2 ,
โ€ฆ , ๐‘ข, ๐‘ฃ ๐‘ . Suppose DFS searches from ๐‘ข to ๐‘ฃ1. It
is hard to estimate when it will visit ๐‘ฃ๐‘– (2 โ‰ค ๐‘– โ‰ค ๐‘).
๏ฎ It is hard to know when
C/D will be visited even
they are near A and B.
๏ฎ It is hard to design
the format of graph
on disk.
A
B C
D
E
141
The Issue and the Challenge (2)
๏ฎ A small part of graph can change DFS a lot.
๏ฎ Even almost the entire graph can be kept in
memory, it still costs a lot to find the DFS.
๏ฎ (E,D) will change the
existing DFS significantly.
๏ฎ A large number of
iterations is needed
even the memory
keeps a large
portion of graph.
A
B C D
E F G
142
Problem Statement
๏ฎ We study a semi-external algorithm that
computes a DFS-Tree by which DFS can be
obtained.
๏ฎ The limited memory ๐‘˜ ๐‘‰ โ‰ค ๐‘€ โ‰ค |๐บ|
๏ฎ ๐‘˜ is a small constant number.
๏ฎ ๐บ = ๐‘‰ + |๐ธ|
143
DFS-Tree & Edge Type
๏ฎ A DFS of ๐บ forms a DFS-Tree
๏ฎ A DFS procedure can be obtained by a DFS-Tree.
A
B E
D
C
F
IH
J
G
144
DFS-Tree & Edge Type
๏ฎ Given a spanning tree ๐‘‡, there exist 4 types of
non-tree edges.
A
B E
D
C
F
IH
J
G
Forward Edge
Forward-cross Edge Backward-cross Edge
Backward Edge
145
DFS-Tree & Edge Type
๏ฎ An ordered spanning tree is a DFS-Tree if there
does not have any forward-cross edges.
A
B E
D
C
F
IH
J
G
Forward Edge
Forward-cross Edge Backward-cross Edge
Backward Edge
146
Existing Solutions
๏ฎ Iteratively remove the forward-cross edges.
๏ฎ Procedure:
๏ฑ If there exists a forward-cross edge
๏ฎ Construct a new ๐‘‡ by conducting DFS over the
graph in memory
147
Existing Solutions
๏ฎ Construct a new ๐‘‡ by conducting DFS over the graph in
memory until no forward-cross edges exist.
A
B E
D
C
F
IH
J
G
Forward-cross Edge
148
The Drawbacks
๏ฎ D-1: A total order in ๐‘‰(๐บ) needs to be
maintained in the whole process.
๏ฎ D-2: A large number of I/Os is produced
๏ฑ Need to scan all edges in every iteration.
๏ฎ D-3: A large number of iterations is needed.
๏ฑ The possibility of grouping the edges near each
other in DFS is not considered.
149
Why Divide & Conquer
๏ฎ We aim at dividing the graph into several
subgraphs ๐บ1, ๐บ2 , โ€ฆ , ๐บ ๐‘ with possible overlaps
among them.
๏ฎ Goal: The DFS-Tree for ๐บ can be computed by
the DFS-Trees for all ๐บ๐‘–.
๏ฎ Divide & Conquer approach can overcome the
existing drawbacks.
150
Why Divide & Conquer
๏ฎ To address D-1
๏ฑ A total order in ๐‘‰(๐บ) needs to be maintained in
the whole process.
๏ฎ After dividing the graph ๐บ into ๐บ0 , ๐บ1 , โ€ฆ , ๐บ ๐‘, we
only need to maintain the total order in ๐‘‰(๐บ๐‘–).
151
Why Divide & Conquer
๏ฎ To address D-2
๏ฑ A large number of I/Os is produced.
๏ฑ It needs to scan all edges in each iterations.
๏ฎ After dividing the graph ๐บ into ๐บ0 , ๐บ1 , โ€ฆ , ๐บ ๐‘, we
only need to scan the edges in ๐บ๐‘– to eliminate
forward-cross edges.
152
Why Divide & Conquer
๏ฎ To address D-3
๏ฑ A large number of iterations is needed.
๏ฑ It cannot group the edges together that are near
each other in DFS visiting sequence.
๏ฎ After dividing the graph ๐บ into ๐บ0 , ๐บ1 , โ€ฆ , ๐บ ๐‘, the
DFS procedure can be applied to ๐บ๐‘– independently.
153
Valid Division
A
B
F
C
D
E
๐บ1
๐บ2
A
B
F
C
D
E
๐บ1
๐บ2
The left is not a DFS-tree The right is a DFS-tree
154
Invalid Division
๏ฎ An example:
A
B
F
C
D
E
๐บ1
๐บ2
No matter how the DFS-
Trees for ๐บ1 and ๐บ2 are
ordered, the merged tree
cannot be a DFS-Tree for ๐บ.
155
How to Cut: Challenges
๏ฎ Challenge-1: uneasy to check whether a division
is valid.
๏ฑ Need to make sure a DFS-Tree for a divided subgraph
will not affect the DFS-Tree of others.
๏ฎ Challenge-2: finding a good division is non-trivial.
๏ฑ The edge types between different subgraphs are
complicated.
๏ฎ Challenge-3: The merge procedure needs to
make sure that the result is the DFS-Tree for ๐บ.
156
Our New Approach
๏ฎ To address Challenge-1:
๏ฑ Compute a light-weight summary graph (S-graph)
denoted as ๐บ.
๏ฑ Check whether a division is valid by searching ๐บ
๏ฎ To address Challenge-2:
๏ฑ Recursively divide & conquer.
๏ฎ To address Challenge-3:
๏ฑ The DFS-Tree for ๐บ is computed only by ๐‘‡๐‘– and ๐บ.
157
Four Division Properties
๏ฎ Node-Coverage: 1 โ‰ค ๐‘– โ‰ค ๐‘ ๐‘‰ ๐บ๐‘– = ๐บ
๏ฎ Contractible: ๐‘‰ ๐บ๐‘– < |V(๐บ)|
๏ฎ Independence: any pair of nodes in
๐‘‰ ๐‘‡๐‘– โˆฉ ๐‘‰(๐‘‡๐‘—) are consistent.
๏ฑ ๐‘‡๐‘– and ๐‘‡๐‘— can be dealt with independently
(๐‘‡๐‘– and ๐‘‡๐‘— are DFS-Tree for ๐บ๐‘– and ๐บ๐‘—)
๏ฎ DFS-Preservable: there exists a DFS-Tree ๐‘‡ for
graph ๐บ such that ๐‘‰ ๐‘‡ = 1โ‰ค๐‘–โ‰ค๐‘ ๐‘‰(๐‘‡๐‘–) and
E ๐‘‡ = 1โ‰ค๐‘–โ‰ค๐‘ ๐ธ(๐‘‡๐‘–)
๏ฑ DFS-Tree for ๐บ can be computed by ๐‘‡๐‘–
158
DFS-Preservable Property
๏ฎ DFS-Tree for ๐บ can be computed by ๐‘‡๐‘–.
๏ฎ DFSโˆ—
-Tree: A spanning tree with the same edge
set of a DFS-Tree (without order).
๏ฎ Suppose the independence property is satisfied,
then the DFS-preservable property is satisfied if
and only if the spanning tree T with ๐‘‰ ๐‘‡ =
1โ‰ค๐‘–โ‰ค๐‘ ๐‘‰(๐‘‡๐‘–) and ๐ธ ๐‘‡ = 1โ‰ค๐‘–โ‰ค๐‘ ๐ธ(๐‘‡๐‘–) is a ๐ท๐น๐‘†โˆ—-
Tree.
159
Independence Property
๏ฎ Any pair of nodes in ๐‘‰ ๐‘‡๐‘– โˆฉ ๐‘‰(๐‘‡๐‘—) are consistent
(๐‘‡๐‘– and ๐‘‡๐‘— are DFS-Tree for ๐บ๐‘– and ๐บ๐‘—).
๏ฑ ๐‘‡๐‘– , ๐‘‡๐‘— can be dealt with independently.
๏ฑ This may not hold: ๐‘ข is an ancestor of ๐‘ฃ in ๐‘‡๐‘–, but
is a sibling in ๐‘‡๐‘—.
๏ฎ Theorem:
๏ฑ Given a division ๐บ0, ๐บ1, โ€ฆ , ๐บ ๐‘ of ๐บ, the
independence property is satisfied if and only if for
any subgraphs ๐บ๐‘– and ๐บ๐‘—, ๐ธ ๐บ๐‘– โˆฉ ๐ธ ๐บ๐‘— = โˆ….
160
Independence Property
C
A
ED
B
F
๐บ1
๐บ3
๐บ2
161
DFS-Preservable Example
A
B
D
C
E
F
G
๏ฎ DFS-preservable
property is not satisfied.
๏ฎ The DFS-Tree for ๐บ does
not exist given the DFS-
Tree for each subgraph.
๏ฎ Forward-cross edges
always exist.
162
Our Approach
๏ฎ Root based division: independence is satisfied.
๏ฑ For each ๐บ๐‘–, it has a spanning tree ๐‘‡๐‘–.
๏ฑ For a division ๐บ0, ๐บ1, โ€ฆ, ๐บ ๐‘, ๐บ0 โˆฉ ๐บ๐‘– = ๐‘Ÿ๐‘–.
๏ฑ ๐‘Ÿ๐‘– is the root of ๐‘‡๐‘– and the leaf of ๐‘‡0
๐บ0
๐บ๐‘– ๐บ๐‘—
163
Our Approach
๏ฎ We expand ๐บ0 to capture the relationship between
different ๐บ๐‘– and call it S-graph.
๏ฎ S-graph is used to check whether the current division
is valid (DFS-preservable property is satisfied)
๐บ0
๐บ๐‘– ๐บ๐‘—
S-graph
164
S-edge
๏ฎ S-edge: given a spanning tree ๐‘‡ of ๐บ, (๐‘ขโ€ฒ
, ๐‘ฃโ€ฒ
) is the
S-edge of ๐‘ข, ๐‘ฃ if
๏ฑ ๐‘ขโ€ฒ is ancestor of ๐‘ข and ๐‘ฃโ€ฒ is ancestor of ๐‘ฃ in ๐‘‡,
๏ฑ Both ๐‘ขโ€ฒ, ๐‘ฃโ€ฒ are the children of ๐ฟ๐ถ๐ด(๐‘ข, ๐‘ฃ), where
๐ฟ๐ถ๐ด(๐‘ข, ๐‘ฃ) is the lowest common ancestor of ๐‘ข, ๐‘ฃ in ๐‘‡.
165
S-edge Example
A
B
D
H
I
E
K
F
C
J
๐บ0
Cross edge
S-edge
G
166
S-graph
๏ฎ For a division ๐บ0, ๐บ1, โ€ฆ, ๐บ ๐‘ and ๐‘‡0 is the DFS-Tree
for ๐บ0, S-graph ๐บ is constructed in the following:
๏ฑ Remove all backward and forward edges w.r.t. ๐‘‡0
๏ฑ Replace all cross-edges (๐‘ข, ๐‘ฃ) with their
corresponding S-edge if the S-edge is between nodes
in ๐บ0,
๏ฑ For edge (๐‘ข, ๐‘ฃ), if ๐‘ข โˆˆ ๐บ๐‘– and ๐‘ฃ โˆˆ ๐บ0, add edge (๐‘Ÿ๐‘–, ๐‘ฃ)
and do the same for ๐‘ฃ.
167
S-graph Example
A
B
D
H
I
E
K
F
C
J
๐บ0
Cross edge
S-edge
G
link (๐‘Ÿ๐‘–, ๐‘ฃ)
168
S-graph Example
A
B
D
H
I
E
K
F
C
J
๐บ0
Cross edge
S-edge
G
link (๐‘Ÿ๐‘–, ๐‘ฃ)
S-graph
169
Division Theorem
๏ฎ Consider a division ๐บ0, ๐บ1, โ€ฆ, ๐บ ๐‘ and suppose ๐‘‡0 is
the DFS-Tree for ๐บ0, the division is DFS-preservable
if and only if the S-graph ๐บ is a DAG.
170
Divide-Star Algorithm
๏ฎ Divide ๐บ according to the children of the root ๐‘… of ๐บ.
๏ฎ If the corresponding S-graph ๐บ is a DAG, each
subgraph can be computed independently.
๏ฎ Deal with strongly connected component:
๏ฑ Modify ๐‘‡: add a virtual node RS representing a SCC S.
๏ฑ Modify ๐บ:
๏ฎ For any edge (๐‘ข, ๐‘ฃ) in S-graph ๐บ, if ๐‘ข โˆ‰ ๐‘† and ๐‘ฃ โˆˆ ๐‘†,
add edge (๐‘ข, ๐‘…๐‘†). Do the same for ๐‘ฃ.
๏ฎ Remove all nodes in S and corresponding edges.
๏ฑ Modify Division: create a new tree ๐‘‡โ€ฒ rooted at the
virtual root RS and connect to the roots in the SCC.
171
Divide-Star Algorithm
A
B
D
H
I
E
K
F
C
J
G
S-graph
SCC
Add a virtual
root DF
172
Divide-Star Algorithm
A
B
D
H
I
E
K
F
C
J
G
S-graph
DF
S-graph is
DAG
173
Divide-Star Algorithm
A
B
D
H
I
E
K
F
J
G
๐บ0
DF
Divide the graph
into 4 parts
B
C
DF
H
๐บ1
๐บ2 ๐บ3
174
Divide-TD Algorithm
๏ฎ Divide-Star algorithm divides the graph according to
the children of the root.
๏ฑ The depth of ๐‘‡0 is 1.
๏ฑ The max number of subgraphs after dividing will not
be larger than the number of children.
๏ฎ Divide-TD algorithm enlarges ๐‘‡0 and the
corresponding S-graph.
๏ฑ It can result in more subgraphs than that Divide-Star
can provide.
175
Divide-TD Algorithm
๏ฎ Divide-TD algorithm enlarges ๐‘‡0 to a Cut-Tree.
๏ฎ Cut-Tree: Given a tree ๐‘‡ with root ๐‘ก0, a cut-tree ๐‘‡๐‘ is
a subtree of ๐‘‡ which satisfies two conditions.
๏ฑ The root of ๐‘‡๐‘ is ๐‘ก0.
๏ฑ For any node ๐‘ฃ โˆˆ ๐‘‡ with child nodes ๐‘ฃ1, ๐‘ฃ2, โ€ฆ , ๐‘ฃ ๐‘˜, if ๐‘ฃ โˆˆ
๐‘‡๐‘, then either ๐‘ฃ is a leaf node or a node in ๐‘‡๐‘ with all
child nodes ๐‘ฃ1, ๐‘ฃ2, โ€ฆ , ๐‘ฃ ๐‘˜.
๏ฎ With such conditions, for any S-edge (๐‘ข, ๐‘ฃ), only two
situations exist.
๏ฑ ๐‘ข, ๐‘ฃ โˆˆ ๐‘‡๐‘
๏ฑ ๐‘ข, ๐‘ฃ โˆ‰ ๐‘‡๐‘
176
Cut-Tree Construction
๏ฎ Given a tree T with root ๐‘Ÿ0.
๏ฎ Initially ๐‘‡๐‘ contains only the root ๐‘Ÿ0.
๏ฎ Iteratively pick a leaf node ๐‘ฃ in ๐‘‡๐‘ and all the child nodes
of ๐‘ฃ in ๐‘‡.
๏ฎ The process stops until the memory cannot hold it after
adding the next nodes.
177
Divide-TD Algorithm
A
B
D
H
I
E
K
F
C
J
G
Cut-Tree ๐‘‡๐‘
178
Divide-TD Algorithm
A
B
D
H
I
E
K
F
C
J
G
Add a virtual
node DF
SCC
Cut-Tree ๐‘‡๐‘
179
Divide-TD Algorithm
A
B
D
I
E
K
F
J
G
DF
S-Graph is a DAG
Divide the graph
into 5 parts
B
C
DF
H
๐บ1
๐บ2 ๐บ3 I K๐บ4
๐บ0
180
Merge Algorithm
๏ฎ According to the properties, the DFS-Tree for
subgraphs are ๐‘‡0 , ๐‘‡1 ,โ€ฆ,๐‘‡๐‘, there exists a DFS-Tree T
with ๐‘‰ ๐‘‡ = 1โ‰ค๐‘–โ‰ค๐‘ ๐‘‰(๐‘‡๐‘–) and ๐ธ ๐‘‡ = 1โ‰ค๐‘–โ‰ค๐‘ ๐ธ(๐‘‡๐‘–).
๏ฎ Only need to organize ๐‘‡๐‘– in the merged tree such that
the result tree ๐‘‡ is a DFS-Tree.
๏ฎ Since S-graph ๐บ is a DAG in the division procedure,
we can topological sort ๐บ and organize ๐‘‡๐‘– according to
the topological order.
๏ฎ Remove virtual nodes ๐‘ฃ and add edges from the father
of ๐‘ฃ to the children of ๐‘ฃ.
๏ฎ It can be proven that the result tree is a DFS-Tree.
181
Merge Algorithm
A
B
D
H
I
E
K
F
J
G
๐บ0
DF
B
C
DF
H
๐บ1
๐บ2 ๐บ3
Topological sort ๐บ0
Removing S-edges
and find the DFS-
Tree
182
Merge Algorithm
A
B
D
H
I
E
K
F
J
G
๐‘‡0
DF
B
C
DF
H
๐‘‡1
๐‘‡2 ๐‘‡3
Merge trees
according to the
order
183
Merge Algorithm
A
B
D
I
E
K
F
J
G
๐‘‡0
C
H
๐‘‡1
๐‘‡2 ๐‘‡3
184
Performance Studies
๏ฎ Implement using visual C++ 2010
๏ฎ Test on a PC with Intel Core2 Quard 2.66GHz
CPU and 4GB memory running Windows 7
Enterprise
๏ฎ Disk Block Size: 64KB
185
|V| |E| Average
Degree
Wikilinks 25,942,246 601,038,301 23.16
Arabic-2005 22,744,080 639,999,458 28.14
Twitter-2010 41,652,230 1,468,365,182 35.25
WEBGRAPH-
UK2007
105,895,908 3,738,733,568 35.00
Four Real Data Sets
186
Web-graph Results
๏ฎ Memory size 2GB
๏ฎ Varying node size percentage
187
๏ฎ We study the I/O efficient DFS algorithms for a large
graph.
๏ฎ We analyze the drawbacks of existing semi-external
DFS algorithm.
๏ฎ We discuss the challenges and four properties in
order to find a divide & conquer approach.
๏ฎ Based on the properties, we design two novel graph
division algorithms and a merge algorithm to reduce
the cost to DFS the graph.
๏ฎ We have conducted extensive performance studies to
confirm the efficiency of our algorithms.
Conclusion
188
๏ฎ We also believe that there are many things we need
to do on large graphs or big graphs.
๏ฎ We know what we have known on graph processing.
๏ฎ We do not know yet what we do not know on graph
processing.
๏ฎ We need to explore many directions such as
๏ฎ parallel computing
๏ฎ distributed computing
๏ฎ streaming computing
๏ฎ semi-external/external computing.
Some Conclusion Remarks
189
I/O Cost Minimization
๏ฎ If there does not exist node ๐‘ข for ๐‘ฃ that ๐‘†๐ถ๐ถ(๐‘ข, ๐บ๐‘–) =
๐‘†๐ถ๐ถ(๐‘ฃ, ๐บ๐‘–), ๐‘ฃ can be removed from ๐บ๐‘–+1.
๏ฎ For a node ๐‘ฃ, if ๐‘›eigh๐‘๐‘œ๐‘ข๐‘Ÿ(๐‘ฃ, ๐บ๐‘–) โŠ† ๐‘‰๐‘–+1, ๐‘ฃ can be
removed from ๐‘‰๐‘–+1.
๏ฎ The I/O complexity is
๐‘‚(๐‘ ๐‘œ๐‘Ÿ๐‘ก ๐‘‰๐‘– + ๐‘ ๐‘œ๐‘Ÿ๐‘ก ๐ธ๐‘– + ๐‘ ๐‘๐‘Ž๐‘›(|๐ธ๐‘–|))
190
B
H
C D
G
E
A
IF
This edge makes all
nodes in a partial
SCC the same order.
Another Example
๏ฎ Keep tree structure edges in memory.
๏ฎ Only concern the depth of nodes
reachable but not the exact positions.
๏ฎ Early-acceptance: merging SCCs
partially whenever possible does
not affect the order of others.
๏ฎ Early-rejection: prune non-SCC
nodes when possible.
๏ฑ Prune the node โ€œAโ€.
๏ฑ In Memory: Black edges
๏ฑ On Disk: Red edges
191
B
I
C
D
H
E
A
JG
๏ฎ No need to remember
๐‘‘๐‘™๐‘–๐‘›๐‘˜(๐‘ข, ๐‘‡).
๏ฎ Merge nodes of the
same order when an
edge (๐‘ข, ๐‘ฃ) is found,
where ๐‘ฃ is an
ancestor of ๐‘ข in ๐‘‡.
๏ฎ Smaller graph size,
smaller I/O Cost
KF
Up-edge: Modify Tree
Up-edge: Modify Tree
Memory: 2 ร— |๐‘‰|
Early-Acceptance
Early-Acceptance
Optimization: Early Acceptance
192
Performance Studies
Vary Degree
193

More Related Content

What's hot

On building more human query answering systems
On building more human query answering systemsOn building more human query answering systems
On building more human query answering systems
INRIA-OAK
ย 
๏ฟผMarkov Chain and Classification of Difficulty Levels Enhances the Learning P...
๏ฟผMarkov Chain and Classification of Difficulty Levels Enhances the Learning P...๏ฟผMarkov Chain and Classification of Difficulty Levels Enhances the Learning P...
๏ฟผMarkov Chain and Classification of Difficulty Levels Enhances the Learning P...
Educational Technology
ย 
07 Statistical approaches to randomization
07 Statistical approaches to randomization07 Statistical approaches to randomization
07 Statistical approaches to randomization
dnac
ย 
Sarjinder singh
Sarjinder singhSarjinder singh
Sarjinder singhHina Aslam
ย 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)
Kunwoo Park
ย 
๏ฟผHandling Missing Attributes using Matrix Factorization ๏ฟผ
๏ฟผHandling Missing Attributes using Matrix Factorization ๏ฟผ๏ฟผHandling Missing Attributes using Matrix Factorization ๏ฟผ
๏ฟผHandling Missing Attributes using Matrix Factorization ๏ฟผ
CS, NcState
ย 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with Spark
Chris Johnson
ย 
Wavelet, Wavelet Image Compression, STW, SPIHT, MATLAB
Wavelet, Wavelet Image Compression, STW, SPIHT, MATLABWavelet, Wavelet Image Compression, STW, SPIHT, MATLAB
Wavelet, Wavelet Image Compression, STW, SPIHT, MATLAB
Manish Tiwari
ย 
IRJET- Deep Neural Network based Mechanism to Compute Depression in Socia...
IRJET-  	  Deep Neural Network based Mechanism to Compute Depression in Socia...IRJET-  	  Deep Neural Network based Mechanism to Compute Depression in Socia...
IRJET- Deep Neural Network based Mechanism to Compute Depression in Socia...
IRJET Journal
ย 
DIE 20130724
DIE 20130724DIE 20130724
DIE 20130724
Tokyo Tech
ย 
08 Statistical Models for Nets I, cross-section
08 Statistical Models for Nets I, cross-section08 Statistical Models for Nets I, cross-section
08 Statistical Models for Nets I, cross-section
Duke Network Analysis Center
ย 
Erik Bernhardsson, CTO, Better Mortgage
Erik Bernhardsson, CTO, Better MortgageErik Bernhardsson, CTO, Better Mortgage
Erik Bernhardsson, CTO, Better Mortgage
MLconf
ย 
Ensemble Learning in Recommender Systems: Combining Multiple User Interaction...
Ensemble Learning in Recommender Systems: Combining Multiple User Interaction...Ensemble Learning in Recommender Systems: Combining Multiple User Interaction...
Ensemble Learning in Recommender Systems: Combining Multiple User Interaction...
Arthur Fortes
ย 
Churn prediction in mobile social games towards a complete assessment using ...
Churn prediction in mobile social games  towards a complete assessment using ...Churn prediction in mobile social games  towards a complete assessment using ...
Churn prediction in mobile social games towards a complete assessment using ...
Alain Saas
ย 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
ย 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
Jaclyn Kokx
ย 
Introduction to Topological Data Analysis
Introduction to Topological Data AnalysisIntroduction to Topological Data Analysis
Introduction to Topological Data Analysis
Mason Porter
ย 
Topological Data Analysis of Complex Spatial Systems
Topological Data Analysis of Complex Spatial SystemsTopological Data Analysis of Complex Spatial Systems
Topological Data Analysis of Complex Spatial Systems
Mason Porter
ย 
Exploring social in๏ฌ‚uence via posterior effect of word of-mouth
Exploring social in๏ฌ‚uence via posterior effect of word of-mouthExploring social in๏ฌ‚uence via posterior effect of word of-mouth
Exploring social in๏ฌ‚uence via posterior effect of word of-mouthmoresmile
ย 
22 An Introduction to Stochastic Actor-Oriented Models (SAOM or Siena)
22 An Introduction to Stochastic Actor-Oriented Models (SAOM or Siena)22 An Introduction to Stochastic Actor-Oriented Models (SAOM or Siena)
22 An Introduction to Stochastic Actor-Oriented Models (SAOM or Siena)
Duke Network Analysis Center
ย 

What's hot (20)

On building more human query answering systems
On building more human query answering systemsOn building more human query answering systems
On building more human query answering systems
ย 
๏ฟผMarkov Chain and Classification of Difficulty Levels Enhances the Learning P...
๏ฟผMarkov Chain and Classification of Difficulty Levels Enhances the Learning P...๏ฟผMarkov Chain and Classification of Difficulty Levels Enhances the Learning P...
๏ฟผMarkov Chain and Classification of Difficulty Levels Enhances the Learning P...
ย 
07 Statistical approaches to randomization
07 Statistical approaches to randomization07 Statistical approaches to randomization
07 Statistical approaches to randomization
ย 
Sarjinder singh
Sarjinder singhSarjinder singh
Sarjinder singh
ย 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)
ย 
๏ฟผHandling Missing Attributes using Matrix Factorization ๏ฟผ
๏ฟผHandling Missing Attributes using Matrix Factorization ๏ฟผ๏ฟผHandling Missing Attributes using Matrix Factorization ๏ฟผ
๏ฟผHandling Missing Attributes using Matrix Factorization ๏ฟผ
ย 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with Spark
ย 
Wavelet, Wavelet Image Compression, STW, SPIHT, MATLAB
Wavelet, Wavelet Image Compression, STW, SPIHT, MATLABWavelet, Wavelet Image Compression, STW, SPIHT, MATLAB
Wavelet, Wavelet Image Compression, STW, SPIHT, MATLAB
ย 
IRJET- Deep Neural Network based Mechanism to Compute Depression in Socia...
IRJET-  	  Deep Neural Network based Mechanism to Compute Depression in Socia...IRJET-  	  Deep Neural Network based Mechanism to Compute Depression in Socia...
IRJET- Deep Neural Network based Mechanism to Compute Depression in Socia...
ย 
DIE 20130724
DIE 20130724DIE 20130724
DIE 20130724
ย 
08 Statistical Models for Nets I, cross-section
08 Statistical Models for Nets I, cross-section08 Statistical Models for Nets I, cross-section
08 Statistical Models for Nets I, cross-section
ย 
Erik Bernhardsson, CTO, Better Mortgage
Erik Bernhardsson, CTO, Better MortgageErik Bernhardsson, CTO, Better Mortgage
Erik Bernhardsson, CTO, Better Mortgage
ย 
Ensemble Learning in Recommender Systems: Combining Multiple User Interaction...
Ensemble Learning in Recommender Systems: Combining Multiple User Interaction...Ensemble Learning in Recommender Systems: Combining Multiple User Interaction...
Ensemble Learning in Recommender Systems: Combining Multiple User Interaction...
ย 
Churn prediction in mobile social games towards a complete assessment using ...
Churn prediction in mobile social games  towards a complete assessment using ...Churn prediction in mobile social games  towards a complete assessment using ...
Churn prediction in mobile social games towards a complete assessment using ...
ย 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
ย 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
ย 
Introduction to Topological Data Analysis
Introduction to Topological Data AnalysisIntroduction to Topological Data Analysis
Introduction to Topological Data Analysis
ย 
Topological Data Analysis of Complex Spatial Systems
Topological Data Analysis of Complex Spatial SystemsTopological Data Analysis of Complex Spatial Systems
Topological Data Analysis of Complex Spatial Systems
ย 
Exploring social in๏ฌ‚uence via posterior effect of word of-mouth
Exploring social in๏ฌ‚uence via posterior effect of word of-mouthExploring social in๏ฌ‚uence via posterior effect of word of-mouth
Exploring social in๏ฌ‚uence via posterior effect of word of-mouth
ย 
22 An Introduction to Stochastic Actor-Oriented Models (SAOM or Siena)
22 An Introduction to Stochastic Actor-Oriented Models (SAOM or Siena)22 An Introduction to Stochastic Actor-Oriented Models (SAOM or Siena)
22 An Introduction to Stochastic Actor-Oriented Models (SAOM or Siena)
ย 

Similar to Jeffrey xu yu large graph processing

(141205) Masters_Thesis_Defense_Sundong_Kim
(141205) Masters_Thesis_Defense_Sundong_Kim(141205) Masters_Thesis_Defense_Sundong_Kim
(141205) Masters_Thesis_Defense_Sundong_Kim
Sundong Kim
ย 
A new similarity measurement based on hellinger distance for collaborating fi...
A new similarity measurement based on hellinger distance for collaborating fi...A new similarity measurement based on hellinger distance for collaborating fi...
A new similarity measurement based on hellinger distance for collaborating fi...
Prabhu Kumar
ย 
Can we predict your sentiments by listening to your peers?
Can we predict your sentiments by listening to your peers?Can we predict your sentiments by listening to your peers?
Can we predict your sentiments by listening to your peers?
International Federation for Information Technologies in Travel and Tourism (IFITT)
ย 
User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...
User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...
User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...
IIIT Hyderabad
ย 
An agent-based model of the effects of message interventions on opinion dynam...
An agent-based model of the effects of message interventions on opinion dynam...An agent-based model of the effects of message interventions on opinion dynam...
An agent-based model of the effects of message interventions on opinion dynam...
Shahan Ali Memon
ย 
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques  Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
ijsc
ย 
Social Learning in Networks: Extraction Deterministic Rules
Social Learning in Networks: Extraction Deterministic RulesSocial Learning in Networks: Extraction Deterministic Rules
Social Learning in Networks: Extraction Deterministic Rules
Dmitrii Ignatov
ย 
Profit Maximization over Social Networks
Profit Maximization over Social NetworksProfit Maximization over Social Networks
Profit Maximization over Social Networks
Wei Lu
ย 
Pro max icdm2012-slides
Pro max icdm2012-slidesPro max icdm2012-slides
Pro max icdm2012-slidesLaks Lakshmanan
ย 
Recent advances in deep recommender systems
Recent advances in deep recommender systemsRecent advances in deep recommender systems
Recent advances in deep recommender systems
NAVER Engineering
ย 
Online Social Netowrks- report
Online Social Netowrks- reportOnline Social Netowrks- report
Online Social Netowrks- reportAjay Karri
ย 
An approximate possibilistic
An approximate possibilisticAn approximate possibilistic
An approximate possibilistic
csandit
ย 
An application of artificial intelligent neural network and discriminant anal...
An application of artificial intelligent neural network and discriminant anal...An application of artificial intelligent neural network and discriminant anal...
An application of artificial intelligent neural network and discriminant anal...
Alexander Decker
ย 
Social Media Mining - Chapter 3 (Network Measures)
Social Media Mining - Chapter 3 (Network Measures)Social Media Mining - Chapter 3 (Network Measures)
Social Media Mining - Chapter 3 (Network Measures)
SocialMediaMining
ย 
08 Inference for Networks โ€“ DYAD Model Overview (2017)
08 Inference for Networks โ€“ DYAD Model Overview (2017)08 Inference for Networks โ€“ DYAD Model Overview (2017)
08 Inference for Networks โ€“ DYAD Model Overview (2017)
Duke Network Analysis Center
ย 
Low rank models for recommender systems with limited preference information
Low rank models for recommender systems with limited preference informationLow rank models for recommender systems with limited preference information
Low rank models for recommender systems with limited preference information
Evgeny Frolov
ย 
Using Networks to Measure Influence and Impact
Using Networks to Measure Influence and ImpactUsing Networks to Measure Influence and Impact
Using Networks to Measure Influence and ImpactYunhao Zhang
ย 
Ability Study of Proximity Measure for Big Data Mining Context on Clustering
Ability Study of Proximity Measure for Big Data Mining Context on ClusteringAbility Study of Proximity Measure for Big Data Mining Context on Clustering
Ability Study of Proximity Measure for Big Data Mining Context on ClusteringKamleshKumar394
ย 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Xiaohan Zeng
ย 

Similar to Jeffrey xu yu large graph processing (20)

(141205) Masters_Thesis_Defense_Sundong_Kim
(141205) Masters_Thesis_Defense_Sundong_Kim(141205) Masters_Thesis_Defense_Sundong_Kim
(141205) Masters_Thesis_Defense_Sundong_Kim
ย 
A new similarity measurement based on hellinger distance for collaborating fi...
A new similarity measurement based on hellinger distance for collaborating fi...A new similarity measurement based on hellinger distance for collaborating fi...
A new similarity measurement based on hellinger distance for collaborating fi...
ย 
Can we predict your sentiments by listening to your peers?
Can we predict your sentiments by listening to your peers?Can we predict your sentiments by listening to your peers?
Can we predict your sentiments by listening to your peers?
ย 
User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...
User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...
User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...
ย 
An agent-based model of the effects of message interventions on opinion dynam...
An agent-based model of the effects of message interventions on opinion dynam...An agent-based model of the effects of message interventions on opinion dynam...
An agent-based model of the effects of message interventions on opinion dynam...
ย 
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques  Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
ย 
Social Learning in Networks: Extraction Deterministic Rules
Social Learning in Networks: Extraction Deterministic RulesSocial Learning in Networks: Extraction Deterministic Rules
Social Learning in Networks: Extraction Deterministic Rules
ย 
Profit Maximization over Social Networks
Profit Maximization over Social NetworksProfit Maximization over Social Networks
Profit Maximization over Social Networks
ย 
Pro max icdm2012-slides
Pro max icdm2012-slidesPro max icdm2012-slides
Pro max icdm2012-slides
ย 
Recent advances in deep recommender systems
Recent advances in deep recommender systemsRecent advances in deep recommender systems
Recent advances in deep recommender systems
ย 
Online Social Netowrks- report
Online Social Netowrks- reportOnline Social Netowrks- report
Online Social Netowrks- report
ย 
An approximate possibilistic
An approximate possibilisticAn approximate possibilistic
An approximate possibilistic
ย 
An application of artificial intelligent neural network and discriminant anal...
An application of artificial intelligent neural network and discriminant anal...An application of artificial intelligent neural network and discriminant anal...
An application of artificial intelligent neural network and discriminant anal...
ย 
Social Media Mining - Chapter 3 (Network Measures)
Social Media Mining - Chapter 3 (Network Measures)Social Media Mining - Chapter 3 (Network Measures)
Social Media Mining - Chapter 3 (Network Measures)
ย 
08 Inference for Networks โ€“ DYAD Model Overview (2017)
08 Inference for Networks โ€“ DYAD Model Overview (2017)08 Inference for Networks โ€“ DYAD Model Overview (2017)
08 Inference for Networks โ€“ DYAD Model Overview (2017)
ย 
Low rank models for recommender systems with limited preference information
Low rank models for recommender systems with limited preference informationLow rank models for recommender systems with limited preference information
Low rank models for recommender systems with limited preference information
ย 
Using Networks to Measure Influence and Impact
Using Networks to Measure Influence and ImpactUsing Networks to Measure Influence and Impact
Using Networks to Measure Influence and Impact
ย 
Ability Study of Proximity Measure for Big Data Mining Context on Clustering
Ability Study of Proximity Measure for Big Data Mining Context on ClusteringAbility Study of Proximity Measure for Big Data Mining Context on Clustering
Ability Study of Proximity Measure for Big Data Mining Context on Clustering
ย 
Final Report
Final ReportFinal Report
Final Report
ย 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
ย 

More from jins0618

Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud EnvironmentMachine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
jins0618
ย 
Latent Interest and Topic Mining on User-item Bipartite Networks
Latent Interest and Topic Mining on User-item Bipartite NetworksLatent Interest and Topic Mining on User-item Bipartite Networks
Latent Interest and Topic Mining on User-item Bipartite Networks
jins0618
ย 
Web Service QoS Prediction Approach in Mobile Internet Environments
Web Service QoS Prediction Approach in Mobile Internet EnvironmentsWeb Service QoS Prediction Approach in Mobile Internet Environments
Web Service QoS Prediction Approach in Mobile Internet Environments
jins0618
ย 
ๅ•ๆฝ‡ ๆ˜Ÿ็Žฏ็ง‘ๆŠ€ๅคงๆ•ฐๆฎๆŠ€ๆœฏๆŽข็ดขไธŽๅบ”็”จๅฎž่ทต
ๅ•ๆฝ‡ ๆ˜Ÿ็Žฏ็ง‘ๆŠ€ๅคงๆ•ฐๆฎๆŠ€ๆœฏๆŽข็ดขไธŽๅบ”็”จๅฎž่ทตๅ•ๆฝ‡ ๆ˜Ÿ็Žฏ็ง‘ๆŠ€ๅคงๆ•ฐๆฎๆŠ€ๆœฏๆŽข็ดขไธŽๅบ”็”จๅฎž่ทต
ๅ•ๆฝ‡ ๆ˜Ÿ็Žฏ็ง‘ๆŠ€ๅคงๆ•ฐๆฎๆŠ€ๆœฏๆŽข็ดขไธŽๅบ”็”จๅฎž่ทต
jins0618
ย 
ๆŽๆˆ˜ๆ€€ ๅคงๆ•ฐๆฎ็Žฏๅขƒไธ‹ๆ•ฐๆฎๅญ˜ๅ‚จไธŽ็ฎก็†็š„็ ”็ฉถ
ๆŽๆˆ˜ๆ€€ ๅคงๆ•ฐๆฎ็Žฏๅขƒไธ‹ๆ•ฐๆฎๅญ˜ๅ‚จไธŽ็ฎก็†็š„็ ”็ฉถๆŽๆˆ˜ๆ€€ ๅคงๆ•ฐๆฎ็Žฏๅขƒไธ‹ๆ•ฐๆฎๅญ˜ๅ‚จไธŽ็ฎก็†็š„็ ”็ฉถ
ๆŽๆˆ˜ๆ€€ ๅคงๆ•ฐๆฎ็Žฏๅขƒไธ‹ๆ•ฐๆฎๅญ˜ๅ‚จไธŽ็ฎก็†็š„็ ”็ฉถ
jins0618
ย 
2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline
jins0618
ย 
Christian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big dataChristian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big data
jins0618
ย 
Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...
jins0618
ย 
Ling liu part 02๏ผšbig graph processing
Ling liu part 02๏ผšbig graph processingLing liu part 02๏ผšbig graph processing
Ling liu part 02๏ผšbig graph processing
jins0618
ย 
Ling liu part 01๏ผšbig graph processing
Ling liu part 01๏ผšbig graph processingLing liu part 01๏ผšbig graph processing
Ling liu part 01๏ผšbig graph processing
jins0618
ย 
Wang ke mining revenue-maximizing bundling configuration
Wang ke mining revenue-maximizing bundling configurationWang ke mining revenue-maximizing bundling configuration
Wang ke mining revenue-maximizing bundling configuration
jins0618
ย 
Wang ke classification by cut clearance under threshold
Wang ke classification by cut clearance under thresholdWang ke classification by cut clearance under threshold
Wang ke classification by cut clearance under threshold
jins0618
ย 
2015 07-tuto2-clus type
2015 07-tuto2-clus type2015 07-tuto2-clus type
2015 07-tuto2-clus type
jins0618
ย 
2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining
jins0618
ย 
2015 07-tuto3-mining hin
2015 07-tuto3-mining hin2015 07-tuto3-mining hin
2015 07-tuto3-mining hin
jins0618
ย 
2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline
jins0618
ย 
Weiyi meng web data truthfulness analysis
Weiyi meng web data truthfulness analysisWeiyi meng web data truthfulness analysis
Weiyi meng web data truthfulness analysis
jins0618
ย 
Ke yi small summaries for big data
Ke yi small summaries for big dataKe yi small summaries for big data
Ke yi small summaries for big data
jins0618
ย 
Gao cong geospatial social media data management and context-aware recommenda...
Gao cong geospatial social media data management and context-aware recommenda...Gao cong geospatial social media data management and context-aware recommenda...
Gao cong geospatial social media data management and context-aware recommenda...
jins0618
ย 
Chengqi zhang graph processing and mining in the era of big data
Chengqi zhang graph processing and mining in the era of big dataChengqi zhang graph processing and mining in the era of big data
Chengqi zhang graph processing and mining in the era of big data
jins0618
ย 

More from jins0618 (20)

Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud EnvironmentMachine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
ย 
Latent Interest and Topic Mining on User-item Bipartite Networks
Latent Interest and Topic Mining on User-item Bipartite NetworksLatent Interest and Topic Mining on User-item Bipartite Networks
Latent Interest and Topic Mining on User-item Bipartite Networks
ย 
Web Service QoS Prediction Approach in Mobile Internet Environments
Web Service QoS Prediction Approach in Mobile Internet EnvironmentsWeb Service QoS Prediction Approach in Mobile Internet Environments
Web Service QoS Prediction Approach in Mobile Internet Environments
ย 
ๅ•ๆฝ‡ ๆ˜Ÿ็Žฏ็ง‘ๆŠ€ๅคงๆ•ฐๆฎๆŠ€ๆœฏๆŽข็ดขไธŽๅบ”็”จๅฎž่ทต
ๅ•ๆฝ‡ ๆ˜Ÿ็Žฏ็ง‘ๆŠ€ๅคงๆ•ฐๆฎๆŠ€ๆœฏๆŽข็ดขไธŽๅบ”็”จๅฎž่ทตๅ•ๆฝ‡ ๆ˜Ÿ็Žฏ็ง‘ๆŠ€ๅคงๆ•ฐๆฎๆŠ€ๆœฏๆŽข็ดขไธŽๅบ”็”จๅฎž่ทต
ๅ•ๆฝ‡ ๆ˜Ÿ็Žฏ็ง‘ๆŠ€ๅคงๆ•ฐๆฎๆŠ€ๆœฏๆŽข็ดขไธŽๅบ”็”จๅฎž่ทต
ย 
ๆŽๆˆ˜ๆ€€ ๅคงๆ•ฐๆฎ็Žฏๅขƒไธ‹ๆ•ฐๆฎๅญ˜ๅ‚จไธŽ็ฎก็†็š„็ ”็ฉถ
ๆŽๆˆ˜ๆ€€ ๅคงๆ•ฐๆฎ็Žฏๅขƒไธ‹ๆ•ฐๆฎๅญ˜ๅ‚จไธŽ็ฎก็†็š„็ ”็ฉถๆŽๆˆ˜ๆ€€ ๅคงๆ•ฐๆฎ็Žฏๅขƒไธ‹ๆ•ฐๆฎๅญ˜ๅ‚จไธŽ็ฎก็†็š„็ ”็ฉถ
ๆŽๆˆ˜ๆ€€ ๅคงๆ•ฐๆฎ็Žฏๅขƒไธ‹ๆ•ฐๆฎๅญ˜ๅ‚จไธŽ็ฎก็†็š„็ ”็ฉถ
ย 
2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline
ย 
Christian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big dataChristian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big data
ย 
Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...
ย 
Ling liu part 02๏ผšbig graph processing
Ling liu part 02๏ผšbig graph processingLing liu part 02๏ผšbig graph processing
Ling liu part 02๏ผšbig graph processing
ย 
Ling liu part 01๏ผšbig graph processing
Ling liu part 01๏ผšbig graph processingLing liu part 01๏ผšbig graph processing
Ling liu part 01๏ผšbig graph processing
ย 
Wang ke mining revenue-maximizing bundling configuration
Wang ke mining revenue-maximizing bundling configurationWang ke mining revenue-maximizing bundling configuration
Wang ke mining revenue-maximizing bundling configuration
ย 
Wang ke classification by cut clearance under threshold
Wang ke classification by cut clearance under thresholdWang ke classification by cut clearance under threshold
Wang ke classification by cut clearance under threshold
ย 
2015 07-tuto2-clus type
2015 07-tuto2-clus type2015 07-tuto2-clus type
2015 07-tuto2-clus type
ย 
2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining
ย 
2015 07-tuto3-mining hin
2015 07-tuto3-mining hin2015 07-tuto3-mining hin
2015 07-tuto3-mining hin
ย 
2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline
ย 
Weiyi meng web data truthfulness analysis
Weiyi meng web data truthfulness analysisWeiyi meng web data truthfulness analysis
Weiyi meng web data truthfulness analysis
ย 
Ke yi small summaries for big data
Ke yi small summaries for big dataKe yi small summaries for big data
Ke yi small summaries for big data
ย 
Gao cong geospatial social media data management and context-aware recommenda...
Gao cong geospatial social media data management and context-aware recommenda...Gao cong geospatial social media data management and context-aware recommenda...
Gao cong geospatial social media data management and context-aware recommenda...
ย 
Chengqi zhang graph processing and mining in the era of big data
Chengqi zhang graph processing and mining in the era of big dataChengqi zhang graph processing and mining in the era of big data
Chengqi zhang graph processing and mining in the era of big data
ย 

Recently uploaded

ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(TWUๆฏ•ไธš่ฏ)่ฅฟไธ‰ไธ€ๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(TWUๆฏ•ไธš่ฏ)่ฅฟไธ‰ไธ€ๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(TWUๆฏ•ไธš่ฏ)่ฅฟไธ‰ไธ€ๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(TWUๆฏ•ไธš่ฏ)่ฅฟไธ‰ไธ€ๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ocavb
ย 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
ย 
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(UPennๆฏ•ไธš่ฏ)ๅฎพๅค•ๆณ•ๅฐผไบšๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(UPennๆฏ•ไธš่ฏ)ๅฎพๅค•ๆณ•ๅฐผไบšๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(UPennๆฏ•ไธš่ฏ)ๅฎพๅค•ๆณ•ๅฐผไบšๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(UPennๆฏ•ไธš่ฏ)ๅฎพๅค•ๆณ•ๅฐผไบšๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ewymefz
ย 
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(UMichๆฏ•ไธš่ฏ)ๅฏ†ๆญ‡ๆ นๅคงๅญฆ|ๅฎ‰ๅจœๅ กๅˆ†ๆ กๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(UMichๆฏ•ไธš่ฏ)ๅฏ†ๆญ‡ๆ นๅคงๅญฆ|ๅฎ‰ๅจœๅ กๅˆ†ๆ กๆฏ•ไธš่ฏๆˆ็ปฉๅ•ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(UMichๆฏ•ไธš่ฏ)ๅฏ†ๆญ‡ๆ นๅคงๅญฆ|ๅฎ‰ๅจœๅ กๅˆ†ๆ กๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(UMichๆฏ•ไธš่ฏ)ๅฏ†ๆญ‡ๆ นๅคงๅญฆ|ๅฎ‰ๅจœๅ กๅˆ†ๆ กๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ewymefz
ย 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
ย 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
ย 
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(CBUๆฏ•ไธš่ฏ)ๅกๆ™ฎ้กฟๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(CBUๆฏ•ไธš่ฏ)ๅกๆ™ฎ้กฟๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(CBUๆฏ•ไธš่ฏ)ๅกๆ™ฎ้กฟๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(CBUๆฏ•ไธš่ฏ)ๅกๆ™ฎ้กฟๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
nscud
ย 
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(QUๆฏ•ไธš่ฏ)็š‡ๅŽๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(QUๆฏ•ไธš่ฏ)็š‡ๅŽๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(QUๆฏ•ไธš่ฏ)็š‡ๅŽๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(QUๆฏ•ไธš่ฏ)็š‡ๅŽๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
enxupq
ย 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
ย 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
ย 
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(BUๆฏ•ไธš่ฏ)ๆณขๅฃซ้กฟๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(BUๆฏ•ไธš่ฏ)ๆณขๅฃซ้กฟๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(BUๆฏ•ไธš่ฏ)ๆณขๅฃซ้กฟๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(BUๆฏ•ไธš่ฏ)ๆณขๅฃซ้กฟๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ewymefz
ย 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
ย 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
ย 
ๅ“ช้‡Œๅ–(usqๆฏ•ไธš่ฏไนฆ)ๅ—ๆ˜†ๅฃซๅ…ฐๅคงๅญฆๆฏ•ไธš่ฏ็ ”็ฉถ็”Ÿๆ–‡ๅ‡ญ่ฏไนฆๆ‰˜็ฆ่ฏไนฆๅŽŸ็‰ˆไธ€ๆจกไธ€ๆ ท
ๅ“ช้‡Œๅ–(usqๆฏ•ไธš่ฏไนฆ)ๅ—ๆ˜†ๅฃซๅ…ฐๅคงๅญฆๆฏ•ไธš่ฏ็ ”็ฉถ็”Ÿๆ–‡ๅ‡ญ่ฏไนฆๆ‰˜็ฆ่ฏไนฆๅŽŸ็‰ˆไธ€ๆจกไธ€ๆ ทๅ“ช้‡Œๅ–(usqๆฏ•ไธš่ฏไนฆ)ๅ—ๆ˜†ๅฃซๅ…ฐๅคงๅญฆๆฏ•ไธš่ฏ็ ”็ฉถ็”Ÿๆ–‡ๅ‡ญ่ฏไนฆๆ‰˜็ฆ่ฏไนฆๅŽŸ็‰ˆไธ€ๆจกไธ€ๆ ท
ๅ“ช้‡Œๅ–(usqๆฏ•ไธš่ฏไนฆ)ๅ—ๆ˜†ๅฃซๅ…ฐๅคงๅญฆๆฏ•ไธš่ฏ็ ”็ฉถ็”Ÿๆ–‡ๅ‡ญ่ฏไนฆๆ‰˜็ฆ่ฏไนฆๅŽŸ็‰ˆไธ€ๆจกไธ€ๆ ท
axoqas
ย 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
ย 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
ย 
Q1โ€™2024 Update: MYCIโ€™s Leap Year Rebound
Q1โ€™2024 Update: MYCIโ€™s Leap Year ReboundQ1โ€™2024 Update: MYCIโ€™s Leap Year Rebound
Q1โ€™2024 Update: MYCIโ€™s Leap Year Rebound
Oppotus
ย 
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(YUๆฏ•ไธš่ฏ)็บฆๅ…‹ๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(YUๆฏ•ไธš่ฏ)็บฆๅ…‹ๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(YUๆฏ•ไธš่ฏ)็บฆๅ…‹ๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(YUๆฏ•ไธš่ฏ)็บฆๅ…‹ๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
enxupq
ย 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
ย 
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(UVicๆฏ•ไธš่ฏ)็ปดๅคšๅˆฉไบšๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(UVicๆฏ•ไธš่ฏ)็ปดๅคšๅˆฉไบšๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(UVicๆฏ•ไธš่ฏ)็ปดๅคšๅˆฉไบšๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(UVicๆฏ•ไธš่ฏ)็ปดๅคšๅˆฉไบšๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ukgaet
ย 

Recently uploaded (20)

ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(TWUๆฏ•ไธš่ฏ)่ฅฟไธ‰ไธ€ๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(TWUๆฏ•ไธš่ฏ)่ฅฟไธ‰ไธ€ๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(TWUๆฏ•ไธš่ฏ)่ฅฟไธ‰ไธ€ๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(TWUๆฏ•ไธš่ฏ)่ฅฟไธ‰ไธ€ๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ย 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
ย 
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(UPennๆฏ•ไธš่ฏ)ๅฎพๅค•ๆณ•ๅฐผไบšๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(UPennๆฏ•ไธš่ฏ)ๅฎพๅค•ๆณ•ๅฐผไบšๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(UPennๆฏ•ไธš่ฏ)ๅฎพๅค•ๆณ•ๅฐผไบšๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(UPennๆฏ•ไธš่ฏ)ๅฎพๅค•ๆณ•ๅฐผไบšๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ย 
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(UMichๆฏ•ไธš่ฏ)ๅฏ†ๆญ‡ๆ นๅคงๅญฆ|ๅฎ‰ๅจœๅ กๅˆ†ๆ กๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(UMichๆฏ•ไธš่ฏ)ๅฏ†ๆญ‡ๆ นๅคงๅญฆ|ๅฎ‰ๅจœๅ กๅˆ†ๆ กๆฏ•ไธš่ฏๆˆ็ปฉๅ•ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(UMichๆฏ•ไธš่ฏ)ๅฏ†ๆญ‡ๆ นๅคงๅญฆ|ๅฎ‰ๅจœๅ กๅˆ†ๆ กๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(UMichๆฏ•ไธš่ฏ)ๅฏ†ๆญ‡ๆ นๅคงๅญฆ|ๅฎ‰ๅจœๅ กๅˆ†ๆ กๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ย 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
ย 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
ย 
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(CBUๆฏ•ไธš่ฏ)ๅกๆ™ฎ้กฟๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(CBUๆฏ•ไธš่ฏ)ๅกๆ™ฎ้กฟๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(CBUๆฏ•ไธš่ฏ)ๅกๆ™ฎ้กฟๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(CBUๆฏ•ไธš่ฏ)ๅกๆ™ฎ้กฟๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ย 
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(QUๆฏ•ไธš่ฏ)็š‡ๅŽๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(QUๆฏ•ไธš่ฏ)็š‡ๅŽๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(QUๆฏ•ไธš่ฏ)็š‡ๅŽๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(QUๆฏ•ไธš่ฏ)็š‡ๅŽๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ย 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
ย 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
ย 
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(BUๆฏ•ไธš่ฏ)ๆณขๅฃซ้กฟๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(BUๆฏ•ไธš่ฏ)ๆณขๅฃซ้กฟๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(BUๆฏ•ไธš่ฏ)ๆณขๅฃซ้กฟๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(BUๆฏ•ไธš่ฏ)ๆณขๅฃซ้กฟๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ย 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
ย 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
ย 
ๅ“ช้‡Œๅ–(usqๆฏ•ไธš่ฏไนฆ)ๅ—ๆ˜†ๅฃซๅ…ฐๅคงๅญฆๆฏ•ไธš่ฏ็ ”็ฉถ็”Ÿๆ–‡ๅ‡ญ่ฏไนฆๆ‰˜็ฆ่ฏไนฆๅŽŸ็‰ˆไธ€ๆจกไธ€ๆ ท
ๅ“ช้‡Œๅ–(usqๆฏ•ไธš่ฏไนฆ)ๅ—ๆ˜†ๅฃซๅ…ฐๅคงๅญฆๆฏ•ไธš่ฏ็ ”็ฉถ็”Ÿๆ–‡ๅ‡ญ่ฏไนฆๆ‰˜็ฆ่ฏไนฆๅŽŸ็‰ˆไธ€ๆจกไธ€ๆ ทๅ“ช้‡Œๅ–(usqๆฏ•ไธš่ฏไนฆ)ๅ—ๆ˜†ๅฃซๅ…ฐๅคงๅญฆๆฏ•ไธš่ฏ็ ”็ฉถ็”Ÿๆ–‡ๅ‡ญ่ฏไนฆๆ‰˜็ฆ่ฏไนฆๅŽŸ็‰ˆไธ€ๆจกไธ€ๆ ท
ๅ“ช้‡Œๅ–(usqๆฏ•ไธš่ฏไนฆ)ๅ—ๆ˜†ๅฃซๅ…ฐๅคงๅญฆๆฏ•ไธš่ฏ็ ”็ฉถ็”Ÿๆ–‡ๅ‡ญ่ฏไนฆๆ‰˜็ฆ่ฏไนฆๅŽŸ็‰ˆไธ€ๆจกไธ€ๆ ท
ย 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
ย 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
ย 
Q1โ€™2024 Update: MYCIโ€™s Leap Year Rebound
Q1โ€™2024 Update: MYCIโ€™s Leap Year ReboundQ1โ€™2024 Update: MYCIโ€™s Leap Year Rebound
Q1โ€™2024 Update: MYCIโ€™s Leap Year Rebound
ย 
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(YUๆฏ•ไธš่ฏ)็บฆๅ…‹ๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(YUๆฏ•ไธš่ฏ)็บฆๅ…‹ๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(YUๆฏ•ไธš่ฏ)็บฆๅ…‹ๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(YUๆฏ•ไธš่ฏ)็บฆๅ…‹ๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ย 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
ย 
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(UVicๆฏ•ไธš่ฏ)็ปดๅคšๅˆฉไบšๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(UVicๆฏ•ไธš่ฏ)็ปดๅคšๅˆฉไบšๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(UVicๆฏ•ไธš่ฏ)็ปดๅคšๅˆฉไบšๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ไธ€ๆฏ”ไธ€ๅŽŸ็‰ˆ(UVicๆฏ•ไธš่ฏ)็ปดๅคšๅˆฉไบšๅคงๅญฆๆฏ•ไธš่ฏๆˆ็ปฉๅ•
ย 

Jeffrey xu yu large graph processing

  • 1. Large Graph Processing Jeffrey Xu Yu (ไบŽๆ—ญ) Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong yu@se.cuhk.edu.hk, http://www.se.cuhk.edu.hk/~yu
  • 2. 2
  • 5. Facebook Social Network ๏ฎ In 2011, 721 million users, 69 billion friendship links. The degree of separation is 4. (Four Degrees of Separation by Backstrom, Boldi, Rosa, Ugander, and Vigna, 2012) 5
  • 6. The Scale/Growth of Social Networks ๏ฎ Facebook statistics ๏ฑ 829 million daily active users on average in June 2014 ๏ฑ 1.32 billion monthly active users as of June 30, 2014 ๏ฑ 81.7% of daily active users are outside the U.S. and Canada ๏ฑ 22% increase in Facebook users from 2012 to 2013 ๏ฎ Facebook activities (every 20 minutes on Facebook) ๏ฑ 1 million links shared ๏ฑ 2 million friends requested ๏ฑ 3 million messages sent http://newsroom.fb.com/company-info/ http://www.statisticbrain.com/facebook-statistics/ 6
  • 7. The Scale/Growth of Social Networks ๏ฎ Twitter statistics ๏ฑ 271 million monthly active users in 2014 ๏ฑ 135,000 new users signing up every day ๏ฑ 78% of Twitter active users are on mobile ๏ฑ 77% of accounts are outside the U.S. ๏ฎ Twitter activities ๏ฑ 500 million Tweets are sent per day ๏ฑ 9,100 Tweets are sent per second https://about.twitter.com/company http://www.statisticbrain.com/twitter-statistics/ 7
  • 9. Financial Networks ๏ฎ We borrow ยฃ1.7 trillion, but we're lending ยฃ1.8 trillion. Confused? Yes, inter-nation finance is complicated..." 9
  • 10. US Social Commerce -- Statistics and Trends 10
  • 11. Activities on Social Networks ๏ฎ When all functions are integrated โ€ฆ. 11
  • 12. Graph Mining/Querying/Searching ๏ฎ We have been working on many graph problems. ๏ฑ Keyword search in databases ๏ฑ Reachability query over large graphs ๏ฑ Shortest path query over large graphs ๏ฑ Large graph pattern matching ๏ฑ Graph clustering ๏ฑ Graph processing on Cloud ๏ฑ โ€ฆโ€ฆ 12
  • 13. Part I: Social Networks 13
  • 14. Some Topics ๏ฎ Ranking over trust networks ๏ฎ Influence on social networks ๏ฎ Influenceability estimation in Social Networks ๏ฎ Random-walk domination ๏ฎ Diversified ranking ๏ฎ Top-k structural diversity search 14
  • 15. Ranking over Trust Networks 15
  • 16. ๏ฎ Real rating systems (users and objects) ๏ฑ Online shopping websites (Amazon) www.amazon.com ๏ฑ Online product review websites (Epinions) www.epinions.com ๏ฑ Paper review system (Microsoft CMT) ๏ฑ Movie rating (IMDB) ๏ฑ Video rating (Youtube) Reputation-based Ranking 16
  • 17. The Bipartite Rating Network ๏ฎ Two entities: users and objects ๏ฎ Users can give rating to objects ๏ฎ If we take the average as the ranking score of an object, o1 and o3 are the top. ๏ฎ If we consider the userโ€™s reputation, e.g., u4, โ€ฆ Objects Users Ratings 17
  • 18. Reputation-based Ranking ๏ฎ Two fundamental problems ๏ฑ How to rank objects using the ratings? ๏ฑ How to evaluate usersโ€™ rating reputation? ๏ฎ Algorithmic challenges ๏ฑ Robustness ๏ฎ Robust to the spamming users ๏ฑ Scalability ๏ฎ Scalable to large networks ๏ฑ Convergence ๏ฎ Convergent to a unique and fixed ranking vector 18
  • 19. Signed/Unsigned Trust Networks ๏ฎ Signed Trust Social Networks (users): A user can express their trust/distrust to others by positive/negative trust score. ๏ฑ Epinions (www.epinions.com) ๏ฑ Slashdot (www.slashdot.org) ๏ฎ Unsigned Trust Social Networks (users): A user can only express their trust. ๏ฑ Advogato (www.advogato.org) ๏ฑ Kaitiaki (www.kaitiaki.org.nz) ๏ฎ Unsigned Rating Networks (users and objects) ๏ฑ Question-Answer systems ๏ฑ Movie-rating systems (IMDB) ๏ฑ Video rating systems in Youtube 19
  • 20. The Trustworthiness of a User ๏ฎ The final trustworthiness of a user is determined by how users trust each other in a global context and is measured by bias. ๏ฎ The bias of a user reflects the extend up to which his/her opinions differ from others. ๏ฎ If a user has a zero bias, then his/her opinions are 100% unbaised and 100% taken. ๏ฎ Such a user has high trustworthiness. ๏ฎ The trustworthiness, the trust score, of a user is 1 โ€“ his/her bias score. 20
  • 21. An Existing Approach ๏ฎ MB [Mishra and Bhattacharya, WWWโ€™11] ๏ฑ The trustworthiness of a user cannot be trusted, because MB treats the bias of a user by relative differences between itself and others. ๏ฑ If a user gives all his/her friends a much higher trust score than the average of others, and gives all his/her foes a much lower trust score than the average of others, such differences cancel out. This user has zero bias and can be 100% trusted. 21
  • 22. An Example ๏ฎ Node 5 gives a trust score ๐‘Š51 = 0.1 to node 1. Node 2 and node 3 give a high trust score ๐‘Š21 = ๐‘Š31 = 0.8 to node 1. ๏ฎ Node 5 is different from others (biased), 0.1 โ€“ 0.8. 22
  • 23. MB Approach ๏ฎ The bias of a node ๐‘– is ๐‘๐‘–. ๏ฎ The prestige score of node ๐‘– is ๐‘Ÿ๐‘–. ๏ฎ The iterative system is 23
  • 24. An Example ๏ฎ Consider 5๏ƒ 1, 2๏ƒ 1, 3๏ƒ 1. ๏ฑ A trust score = 0.1 โ€“ 0.8 = -0.7. ๏ฎ Consider 2๏ƒ 3, 4๏ƒ 3, 5๏ƒ 3. ๏ฑ A trust score = 0.9 โ€“ 0.2 = 0.7 ๏ฎ Node 5 has zero bias. ๏ฎ The bias scores by MB. 24
  • 25. Our Approach ๏ฎ To address it, consider a contraction mapping. ๏ฎ Given a metric space ๐‘‹ with a distance function ๐‘‘(). ๏ฎ A mapping ๐‘‡ from ๐‘‹ to ๐‘‹ is a contraction mapping if there exists a constant c where 0 โ‰ค ๐‘ < 1 such that ๐‘‘(๐‘‡(๐‘ฅ), ๐‘‡(๐‘ฆ)) โ‰ค ๐‘ ร— ๐‘‘(๐‘ฅ, ๐‘ฆ). ๏ฎ The ๐‘‡ has a unique fixed point. 25
  • 26. Our Approach ๏ฎ We use two vectors, ๐‘ and ๐‘Ÿ, for bias and prestige. ๏ฎ The ๐‘๐‘— = (๐‘“(๐‘Ÿ)) ๐‘— denotes the bias of node ๐‘—, where ๐‘Ÿ is the prestige vector of the nodes, and ๐‘“(๐‘Ÿ) is a vector- valued contractive function. (๐‘“ ๐‘Ÿ ) ๐‘— denotes the ๐‘—-th element of vector ๐‘“(๐‘Ÿ). ๏ฎ Let 0 โ‰ค ๐‘“(๐‘Ÿ) โ‰ค ๐‘’, and ๐‘’ = [1, 1, โ€ฆ , 1] ๐‘‡ ๏ฎ For any ๐‘ฅ, ๐‘ฆ โˆˆ ๐‘… ๐‘›, the function ๐‘“: ๐‘… ๐‘› โ†’ ๐‘… ๐‘› is a vector- valued contractive function if the following condition holds, ๐‘“ ๐‘ฅ โ€“ ๐‘“ ๐‘ฆ โ‰ค ๐œ† โˆฅ ๐‘ฅ โˆ’ ๐‘ฆ โˆฅโˆž ๐‘’ where ๐œ† โˆˆ [0,1) and โˆฅโˆ™โˆฅโˆž denotes the infinity norm. 26
  • 27. The Framework ๏ฎ Use a vector-valued contractive function, which is a generalization of the contracting mapping in the fixed point theory. ๏ฎ MB is a special case in our framework. ๏ฎ The iterative system can converges into a unique fixed prestige and bias vector in an exponential rate of convergence. ๏ฎ We can handle both unsigned and singed trust social networks. 27
  • 28. Influence on Social Networks 28
  • 29. Diffusion in Networks ๏ฎ We care about the decisions made by friends and colleagues. ๏ฎ Why imitating the behavior of others ๏ฑ Informational effects: the choices made by others can provide indirect information about what they know. ๏ฑ Direct-benefit effects: there are direct payoffs from copying the decisions of others. ๏ฎ Diffusion: how new behaviors, practices, opinions, conventions, and technologies spread through a social network. 29
  • 30. A Real World Example ๏ฎ Hotmailโ€™s viral climb to the top spot (90โ€™s): 8 million users in 18 months! ๏ฎ Far more effective than conventional advertising by rivals and far cheaper too! 30
  • 31. Stochastic Diffusion Model ๏ฎ Consider a directed graph ๐บ = (๐‘‰, ๐ธ). ๏ฎ The diffusion of information (or influence) proceeds in discrete time steps, with time ๐‘ก = 0, 1, โ€ฆ. Each node ๐‘ฃ has two possible states, inactive and active. ๏ฎ Let ๐‘†๐‘ก โŠ† ๐‘‰ be the set of active nodes at time ๐‘ก (active set at time ๐‘ก). ๐‘†0 is the seed set (the seeds of influence diffusion). ๏ฎ A stochastic diffusion model (with discrete time steps) for a social graph ๐บ specifies the randomized process of generating active sets ๐‘†๐‘ก for all ๐‘ก โ‰ฅ 1 given the initial ๐‘†0. ๏ฎ A progressive model is a model ๐‘†๐‘กโˆ’1 โŠ† ๐‘†๐‘ก for ๐‘ก > 1. 31
  • 32. Influence Spread ๏ฎ Let ฮฆ(๐‘†0) be the final active set (eventually stable active set) where ๐‘†0 is the initial seed set. ๏ฎ ฮฆ(๐‘†0) is a random set determined by the stochastic process of the diffusion model. ๏ฎ To maximize the expected size of the final active set. ๏ฎ Let ๐”ผ(๐‘‹) denote the expected value of a random variable ๐‘‹. ๏ฎ The influence spreed of seed set ๐‘†0 is defined as ๐œŽ ๐‘†0 = ๐”ผ(|ฮฆ(๐‘†0)|). Here the expectation is taken among all random events leading to ฮฆ(๐‘†0). 32
  • 33. Independent Cascade Model (IC) ๏ฎ IC takes ๐บ = (๐‘‰, ๐ธ), the influence probability ๐‘ on all edges, and initial seed set ๐‘†0 as the input, and generates the active sets ๐‘†๐‘ก for all ๐‘ก โ‰ฅ 1. ๏ฑ At every time step ๐‘ก โ‰ฅ 1, first set ๐‘†๐‘ก = ๐‘†๐‘กโˆ’1. ๏ฑ Next for every inactive node ๐‘ฃ โˆ‰ ๐‘†๐‘กโˆ’1, for node ๐‘ข โˆˆ ๐‘๐‘–๐‘› ๐‘ฃ โˆฉ ๐‘†๐‘กโˆ’1๐‘†๐‘กโˆ’2 , ๐‘ข executes an activation attempt with success probability ๐‘(๐‘ข, ๐‘ฃ). If successful, ๐‘ฃ is added into ๐‘†๐‘ก and it is said ๐‘ข activates ๐‘ฃ at time ๐‘ก. If multiple nodes active ๐‘ฃ at time ๐‘ก, the end effect is the same. 33
  • 36. Influenceability Estimation in Social Networks ๏ฎ Applications ๏ฑ Influence maximization for viral marketing ๏ฑ Influential nodes discovery ๏ฑ Online advertisement ๏ฎ The fundamental issue ๏ฑ How to evaluate the influenceability for a give node in a social network? 36
  • 37. ๏ฎ The independent cascade model. ๏ฑ Each node has an independent probability to influence his neighbors. ๏ฑ Can be modeled by a probabilistic graph, called influence network, ๐บ = (๐‘‰, ๐ธ, ๐‘ƒ). ๏ฑ A possible graph ๐บ ๐‘ƒ = (๐‘‰๐‘ƒ, ๐ธ ๐‘ƒ) has probability Pr ๐บ ๐‘ƒ = ๐‘’โˆˆ๐ธ ๐‘ƒ ๐‘ ๐‘’ ๐‘’โˆˆ๐ธ ๐ธ ๐‘ƒ (1 โˆ’ ๐‘ ๐‘’) ๏ฎ There are 2|๐ธ| possible graphs (ฮฉ). Reconsider IC Model 37
  • 39. ๏ฎ Independent cascade model. ๏ฑ Given a probabilistic graph ๐บ ๐‘ƒ = (๐‘‰๐‘ƒ, ๐‘‰๐‘ƒ) ๏ฑ Pr ๐บ ๐‘ƒ = ๐‘’โˆˆ๐ธ ๐‘ƒ ๐‘ ๐‘’ ๐‘’โˆˆ๐ธ ๐ธ ๐‘ƒ (1 โˆ’ ๐‘ ๐‘’) ๏ฎ Given a graph ๐บ = (๐‘‰, ๐ธ, ๐‘ƒ), and a node ๐‘ , estimate the expected number of nodes that are reachable from ๐‘ . ๏ฑ ๐น๐‘ (๐บ) = ๐บ ๐‘ƒโˆˆฮฉ Pr ๐บ ๐‘ƒ ๐‘“๐‘ (๐บ ๐‘ƒ) where ๐‘“๐‘ (๐บ ๐‘ƒ) is the number of nodes that are reachable from the seed node ๐‘ . The Problem 39
  • 40. Reduce the Variance ๏ฎ The accuracy of an approximate algorithm is measured by the mean squared error ๐”ผ ( ๐น๐‘  ๐บ โˆ’ ๐น๐‘  ๐บ )2 ๏ฎ By the variance-bias decomposition ๐”ผ ( ๐น๐‘  ๐บ โˆ’ ๐น๐‘  ๐บ )2 = Var ๐น๐‘  ๐บ + ๐”ผ( ๐น๐‘  ๐บ โˆ’ ๐น๐‘  ๐บ ) 2 ๏ฑ Make an estimator unbiased ๏ƒ  the 2nd term will be cancelled out. ๏ฑ Make the variance as small as possible. 40
  • 41. Naรฏve Monte-Carlo (NMC) ๏ฎ Sampling ๐‘ possible graphs ๐บ1, ๐บ2, โ€ฆ , ๐บ ๐‘. ๏ฎ For each sampled possible graph ๐บ๐‘–, compute the number of nodes that are reachable from ๐‘ . ๏ฎ ๐‘๐‘€๐ถ Estimator: Average of the number of reachable nodes over ๐‘ possible graphs. ๐น ๐‘๐‘€๐ถ = ๐‘–=1 ๐‘ ๐‘“๐‘ (๐บ ๐‘–) ๐‘ ๏ฎ ๐น ๐‘๐‘€๐ถ is an unbiased estimator of ๐น๐‘ (๐บ) since ๐”ผ ๐น ๐‘๐‘€๐ถ = ๐น๐‘  ๐บ . ๏ฎ ๐‘๐‘€๐ถ is the only existing algorithm used in the influence maximization literature. 41
  • 42. Naรฏve Monte-Carlo (NMC) ๏ฎ ๐‘๐‘€๐ถ Estimator: Average of the number of reachable nodes over ๐‘ possible graphs. ๐น ๐‘๐‘€๐ถ = ๐‘–=1 ๐‘ ๐‘“๐‘ (๐บ ๐‘–) ๐‘ ๏ฎ ๐น ๐‘๐‘€๐ถ is an unbiased estimator of ๐น๐‘ (๐บ) since ๐”ผ ๐น ๐‘๐‘€๐ถ = ๐น๐‘ (๐บ). ๏ฎ The variance of ๐‘๐‘€๐ถ is ๐‘‰๐‘Ž๐‘Ÿ ๐น ๐‘๐‘€๐ถ = ๐”ผ ๐‘“๐‘ (๐บ)2 โˆ’ (๐”ผ ๐‘“๐‘ (๐บ) )2 ๐‘ = ๐บ ๐‘ƒโˆˆฮฉ ๐‘ƒ๐‘Ÿ ๐บ ๐‘ ๐‘“๐‘ (๐บ)2โˆ’๐น๐‘ (๐บ)2 ๐‘ ๏ฎ Computing the variance is extreme expensive, because it needs to enumerate all the possible graphs. 42
  • 43. Naรฏve Monte-Carlo (NMC) ๏ฎ In practice, it resorts to an unbiased estimator of ๐‘‰๐‘Ž๐‘Ÿ( ๐น ๐‘๐‘€๐ถ). ๏ฎ The variance of ๐‘๐‘€๐ถ is ๐‘‰๐‘Ž๐‘Ÿ ๐น ๐‘๐‘€๐ถ = ๐‘–=1 ๐‘ (๐‘“๐‘  ๐บ๐‘– โˆ’ ๐น ๐‘๐‘€๐ถ)2 ๐‘ โˆ’ 1 ๏ฎ But, ๐‘‰๐‘Ž๐‘Ÿ ๐น ๐‘๐‘€๐ถ may be very large, because ๐‘“๐‘  ๐บ๐‘– fall into the interval [0, ๐‘› โˆ’ 1]. ๏ฎ The variance can be up to ๐‘‚(๐‘›2). 43
  • 44. Stratified Sampling ๏ฎ Stratified is to divide a set of data items into subsets before sampling. ๏ฎ A stratum is a subset. ๏ฎ The strata should be mutually exclusive, and should include all data items in the set. ๏ฎ Stratified sampling can be used to reduce variance. 44
  • 45. A Recursive Estimator [Jin et al. VLDBโ€™11] ๏ฎ Randomly select 1 edge to partition the probability space (the set of all possible graphs) into 2 strata (2 subsets) ๏ฑ The possible graphs in the first subset include the selected edge. ๏ฑ The possible graphs in the second subset do not include the selected edge. ๏ฎ Sample possible graphs in each stratum ๐‘– with a sample size ๐‘๐‘– proportioning to the probability of that stratum. ๏ฎ Recursively apply the same idea in each stratum. 45
  • 46. A Recursive Estimator [Jin et al. VLDBโ€™11] ๏ฎ Advantages: ๏ฑ unbiased estimator with a smaller variance. ๏ฎ Limitations: ๏ฑ Select only one edge for stratification, which is not enough to significantly reduce the variance. ๏ฑ Randomly select edges, which results in a possible large variance. 46
  • 47. More Effective Estimators ๏ฎ Four Stratified Sampling (SS) Estimators ๏ฑ Type-I basic SS estimator (BSS-I) ๏ฑ Type-I recursive SS estimator (RSS-I) ๏ฑ Type-II basic SS estimator (BSS-II) ๏ฑ Type-II recursive SS estimator (RSS-II) ๏ฎ All are unbiased and their variances are significantly smaller than the variance of NMC. ๏ฎ Time and space complexity of all are the same as NMC. 47
  • 48. Type-I Basic Estimator (BSS-I) ๏ฎ Select ๐‘Ÿ edges to partition the probability space (all the possible graphs) into 2 ๐‘Ÿ strata. ๏ฎ Each stratum corresponds to a probability subspace (a set of possible graphs). ๏ฎ Let ๐œ‹๐‘– = Pr[๐บ ๐‘ƒ โˆˆ ฮฉ๐‘–]. ๏ฎ How to select ๐‘Ÿ edges: BFS or random 48
  • 49. Type-I BSS-I Estimator Sample size = ๐‘ 2 ๐‘Ÿ ๐‘ = ๐‘๐œ‹1 BSS-I 49
  • 50. Type-I Recursive Estimator (RSS-I) ๏ฎ Recursively apply the BSS-I into each stratum, until the sample size reaches a given threshold. ๏ฎ RSS-I is unbiased and its variance is smaller than BSS-I ๏ฎ Time and space complexity are the same as NMC. Sample size = ๐‘ BSS-I RSS-I ๐‘ = ๐‘๐œ‹1 50
  • 51. Type-II Basic Estimator (BSS-II) ๏ฎ Select ๐‘Ÿ edges to partition the probability space (all the possible graphs) into ๐‘Ÿ + 1 strata. ๏ฎ Similarly, each stratum corresponds to a probability subspace (a set of possible graphs). ๏ฎ How to select ๐‘Ÿ edges: BFS or random 51
  • 52. Type-II Estimators Sample size = ๐‘ ๐‘Ÿ + 1 BSS-II RSS-II ๐‘ = ๐‘๐œ‹1 52
  • 54. ๏ฎ Social browsing: a process that users in a social network find information along their social ties. ๏ฑ photo-sharing Flickr, online advertisements ๏ฎ Two issues: ๏ฑ Problem-I: How to place items on ๐‘˜ users in a social network so that the other users can easily discover by social browsing? ๏ฎ To minimize the expected number of hops that every node hits the target set. ๏ฑ Problem-II: How to place items on ๐‘˜ users so that as many users as possible can discover by social browsing? ๏ฎ To maximize the expected number of nodes that hit the target set. Social Browsing 54
  • 55. ๏ฎ The two problems are a random walk problem. ๏ฎ ๐ฟ-length random walk model where the path length of random walks is bounded by a nonnegative number ๐ฟ. ๏ฑ A random walk in general can be considered as ๐ฟ = โˆž. ๏ฎ Let ๐‘ ๐‘ข ๐‘ก be the position of an ๐ฟ-length random walk, starting from node ๐‘ข, at discrete time ๐‘ก. ๏ฎ Let ๐‘‡๐‘ข๐‘ฃ ๐ฟ be a random walk variable. ๏ฑ ๐‘‡๐‘ข๐‘ฃ ๐ฟ โ‰œ min{min ๐‘ก: ๐‘ ๐‘ข ๐‘ก = v, t โ‰ฅ 0}, ๐ฟ ๏ฎ The hitting time โ„Ž ๐‘ข๐‘ฃ ๐ฟ can be defined as the expectation of ๐‘‡๐‘ข๐‘ฃ ๐ฟ . ๏ฑ โ„Ž ๐‘ข๐‘ฃ ๐ฟ = ๐”ผ[๐‘‡๐‘ข๐‘ฃ ๐ฟ ] The Random Walk 55
  • 56. The Hitting Time ๏ฎ Sarkar and Moore in UAIโ€™07 define the hitting time of the ๐ฟ-length random walk in a recursive manner. โ„Ž ๐‘ข๐‘ฃ ๐ฟ = 0, ๐‘ข = ๐‘ฃ 1 + ๐‘คโˆˆ๐‘‰ ๐‘ ๐‘ข๐‘คโ„Ž ๐‘ค๐‘ฃ ๐ฟโˆ’1, ๐‘ข โ‰  ๐‘ฃ ๏ฎ Our hitting time can be computed by the recursive procedure. ๏ฑ Let ๐‘‘ ๐‘ข be the degree of node ๐‘ข and ๐‘(๐‘ข) be the set of neighbor nodes of ๐‘ข. ๏ฎ ๐‘ ๐‘ข๐‘ค = 1/๐‘‘ ๐‘ข be the transition probability for ๐‘ค โˆˆ ๐‘(๐‘ข) and ๐‘ ๐‘ข๐‘ค = 0 otherwise. 56
  • 57. The Random-Walk Domination ๏ฎ Consider a set of nodes ๐‘†. If a random walk from ๐‘ข reaches ๐‘† by an ๐ฟ-length random walk, we say ๐‘† dominates ๐‘ข by an ๐ฟ-length random walk. ๏ฎ Generalized hitting time over a set of nodes, ๐‘†. The hitting time โ„Ž ๐‘ข๐‘† ๐ฟ can be defined as the expectation of a random walk variable ๐‘‡๐‘ข๐‘† ๐ฟ . ๏ฑ ๐‘‡๐‘ข๐‘† ๐ฟ โ‰œ min{min ๐‘ก: ๐‘ ๐‘ข ๐‘ก โˆˆ ๐‘†, t โ‰ฅ 0}, ๐ฟ ๏ฑ โ„Ž ๐‘ข๐‘† ๐ฟ = ๐”ผ[๐‘‡๐‘ข๐‘† ๐ฟ ] ๏ฎ It can be computed recursively. ๏ฑ โ„Ž ๐‘ข๐‘† ๐ฟ = 0, ๐‘ข โˆˆ ๐‘† 1 + ๐‘คโˆˆ๐‘‰ ๐‘ ๐‘ข๐‘คโ„Ž ๐‘ค๐‘† ๐ฟโˆ’1 , ๐‘ข โˆ‰ ๐‘† 57
  • 58. ๏ฎ How to place items on ๐‘˜ users in a social network so that the other users can easily discover by social browsing? ๏ฎ To minimize the total expected number of hops of which every node hits the target set. Problem-I or 58
  • 59. ๏ฎ How to place items on ๐‘˜ users so that as many users as possible can discover by social browsing? To maximize the expected number of nodes that hit the target set. ๏ฎ Let ๐‘‹ ๐‘ข๐‘† ๐ฟ be an indicator random variable such that if ๐‘ข hits any one node in ๐‘†, then ๐‘‹ ๐‘ข๐‘† ๐ฟ = 1, and ๐‘‹ ๐‘ข๐‘† ๐ฟ = 0 otherwise by an ๐ฟ-length random walk. ๏ฎ Let ๐‘ ๐‘ข๐‘† ๐ฟ be the probability of an event that an ๐ฟ-length random walk starting from ๐‘ข hits a node in ๐‘†. ๏ฎ Then, ๐”ผ ๐‘‹ ๐‘ข๐‘† ๐ฟ = ๐‘ ๐‘ข๐‘† ๐ฟ . ๏ฎ ๐‘ ๐‘ข๐‘† ๐ฟ = 1, ๐‘ข โˆˆ ๐‘† ๐‘คโˆˆ๐‘‰ ๐‘ ๐‘ข๐‘ค ๐‘ ๐‘ค๐‘† ๐ฟโˆ’1 , ๐‘ข โˆ‰ ๐‘† Problem-II 59
  • 60. Influence Maximization vs Problem II ๏ฎ Influence maximization is to select ๐‘˜ nodes to maximize the expected number of nodes that are reachable from the nodes selected. ๏ฑ Independent cascade model ๏ฑ Probability associated with the edges are independent ๏ฑ A target node can influence multiple immediate neighbors at a time. ๏ฎ Problem II is to select ๐‘˜ nodes to maximize the expected number of nodes that reach a node in the nodes selected. ๏ฑ ๐ฟ-length random walk model 60
  • 61. ๏ฎ The submodular set function maximization subject to cardinality constraint is ๐‘๐‘ƒ-hard. ๏ฑ ๐‘Ž๐‘Ÿ๐‘” max ๐‘†โŠ†๐‘‰ ๐น(๐‘†) ๐‘ . ๐‘ก. ๐‘† = ๐พ ๏ฎ The greedy algorithm ๏ฑ There is a 1 โˆ’ 1 ๐‘’ approximation algorithm. ๏ฑ Linear time and space complexity w.r.t. the size of the graph. ๏ฎ Submodularity: ๐น(๐‘†) is submodular and non-decreasing. ๏ฑ Non-decreasing: ๐‘“(๐‘†) โ‰ค ๐‘“(๐‘‡) for ๐‘† โŠ† ๐‘‡ โŠ† ๐‘‰. ๏ฑ Submodular: Let ๐‘”๐‘—(๐‘†) = ๐‘“(๐‘† โˆช {๐‘—}) โ€“ ๐‘“(๐‘†) be the marginal gain. Then, ๐‘”๐‘—(๐‘†) โ‰ฅ ๐‘”๐‘—(๐‘‡), for j โˆˆ V T and ๐‘† โŠ† ๐‘‡ โŠ† ๐‘‰. Submodular Function Maximization 61
  • 62. ๏ฎ The submodular set function maximization subject to cardinality constraint is ๐‘๐‘ƒ-hard. ๏ฑ ๐‘Ž๐‘Ÿ๐‘” max ๐‘†โŠ†๐‘‰ ๐น(๐‘†) ๐‘ . ๐‘ก. ๐‘† = ๐พ ๏ฎ Both Problem I and Problem II use a submodular set function. ๏ฑ Problem-I: ๐น1 S = nL โˆ’ ๐‘ขโˆˆ๐‘‰๐‘† โ„Ž ๐‘ข๐‘† ๐ฟ ๏ฑ Problem-II: ๐น2(๐‘†) = ๐‘คโˆˆ๐‘‰ ๐”ผ[๐‘‹ ๐‘ค๐‘† ๐ฟ ] = ๐‘คโˆˆ๐‘‰ ๐‘ ๐‘ค๐‘† ๐ฟ Submodular Function Maximization 62
  • 63. The Algorithm ๏ฎ Let ๐œŽ ๐‘ข S = F(๐‘† โˆช {๐‘ข}) โˆ’ ๐น(๐‘†) ๏ฎ It implies dynamic programming (DP) is needed to compute the marginal gain. Marginal gain 63
  • 65. Diversified Ranking [Li et al, TKDEโ€™13] ๏ฎ Why diversified ranking? ๏ฑ Information requirements diversity ๏ฑ Query incomplete PAKDD09-65
  • 66. Problem Statement ๏ฎ The goal is to find K nodes in a graph that are relevant to the query node, and also they are dissimilar to each other. ๏ฎ Main applications ๏ฑ Ranking nodes in social network, ranking papers, etc. 66
  • 67. Challenges ๏ฎ Diversity measures ๏ฑ No wildly accepted diversity measures on graph in the literature. ๏ฎ Scalability ๏ฑ Most existing methods cannot be scalable to large graphs. ๏ฎ Lack of intuitive interpretation. 67
  • 68. Grasshopper/ManiRank ๏ฎ The main idea ๏ฑ Work in an iterative manner. ๏ฑ Select a node at one iteration by random walk. ๏ฑ Set the selected node to be an absorbing node, and perform random walk again to select the second node. ๏ฑ Perform the same process ๐พ iterations to get ๐พ nodes. ๏ฎ No diversity measure ๏ฑ Achieving diversity only by intuition and experiments. ๏ฎ Cannot scale to large graph (time complexity O(๐พ๐‘›2 )) 68
  • 69. Grasshopper/ManiRank ๏ฎ Initial random walk with no absorbing states ๏ฎ Absorbing random walk after ranking the first item 69
  • 70. Our Approach ๏ฎ The main idea ๏ฑ Relevance of the top-K nodes (denoted by a set S) is achieved by the large (Personalized) PageRank scores. ๏ฑ Diversity of the top-K nodes is achieved by large expansion ratio. ๏ฎ Expansion ratio of a set nodes ๐‘†: ๐œŽ ๐‘† = ๐‘ ๐‘† /๐‘›. ๏ฑ Larger expansion ratio implies better diversity 70
  • 71. ๏ฎ The submodular set function maximization subject to cardinality constraint is ๐‘๐‘ƒ-hard. ๏ฑ ๐‘Ž๐‘Ÿ๐‘” max ๐‘†โŠ†๐‘‰ ๐น(๐‘†) ๐‘ . ๐‘ก. ๐‘† = ๐พ ๏ฎ The greedy algorithm ๏ฑ There is a 1 โˆ’ 1 ๐‘’ approximation algorithm. ๏ฑ Linear time and space complexity w.r.t. the size of the graph. ๏ฎ Submodularity: ๐น(๐‘†) is submodular and non-decreasing. ๏ฑ Non-decreasing: ๐‘“(๐‘†) โ‰ค ๐‘“(๐‘‡) for ๐‘† โŠ† ๐‘‡ โŠ† ๐‘‰. ๏ฑ Submodular: Let ๐‘”๐‘—(๐‘†) = ๐‘“(๐‘† โˆช {๐‘—}) โ€“ ๐‘“(๐‘†) be the marginal gain. Then, ๐‘”๐‘—(๐‘†) โ‰ฅ ๐‘”๐‘—(๐‘‡), for j โˆˆ V T and ๐‘† โŠ† ๐‘‡ โŠ† ๐‘‰. Submodular Function Maximization 71
  • 73. ๏ฎ Social contagion is a process of information (e.g. fads, news, opinions) diffusion in the online social networks ๏ฑ Traditional biological contagion model, the affected probability depends on degree. MarketingOpinions Diffusion Social Network Social Contagion 73
  • 74. Facebook Study [Ugander et al., PNASโ€™12] ๏ฎ Case study: The process of a user joins Facebook in response to an invitation email from an existing Facebook user. ๏ฎ Social contagion is not like biological contagion. 74
  • 75. ๏ฎ Structural diversity of an individual is the number of connected components in oneโ€™s neighborhood. ๏ฎ The problem: Find ๐‘˜ individuals with highest structural diversity. Connected components in the neighborhood of โ€œwhite centerโ€ Structural Diversity 75
  • 76. Part II: I/O Efficiency 76
  • 77. Big Data: The Volume ๏ฎ Consider a dataset ๐ท of 1 PetaByte (1015 bytes). A linear scan of ๐ท takes 46 hours with a fastest Solid State Drive (SSD) of speed of 6GB/s. ๏ฑ PTIME queries do not always serve as a good yardstick for tractability in โ€œBig Data with Preprocessingโ€ by Fan. et al., PVLDBโ€13. ๏ฎ Consider a function ๐‘“ ๐บ . One possible way is to make ๐บ small to be ๐บโ€™, and find the answers from ๐บโ€™ as it can be answered by ๐บ, ๐‘“โ€™(๐บโ€™) โ‰ˆ ๐‘“(๐บ). ๏ฑ There are many ways we can explore. 77
  • 78. Big Data: The Volume ๏ฎ Consider a function ๐‘“ ๐บ . One possible way is to make ๐บ small to be ๐บโ€™, and find the answers from ๐บโ€™ as it can be answered by ๐บ, ๐‘“โ€™(๐บโ€™) โ‰ˆ ๐‘“(๐บ). ๏ฎ There are many ways we can explore. ๏ฎ Make data simple and small ๏ฑ Graph sampling, Graph compression ๏ฑ Graph sparsification, Graph simplification ๏ฑ Graph summary ๏ฑ Graph clustering ๏ฑ Graph views 78
  • 79. More Work on Big Data ๏ฎ We also believe that there are many things we need to do on Big Data. ๏ฎ We are planning explore many directions. ๏ฑ Make data simple and small ๏ฎ Graph sampling, graph simplification, graph summary, graph clustering, graph views. ๏ฑ Explore different computing approaches ๏ฎ Parallel computing, distributed computing, streaming computing, semi-external/external computing. 79
  • 80. I/O Efficient Graph Computing ๏ฎ I/O Efficient: Computing SCCs in Massive Graphs by Zhiwei Zhang, Jeffrey Xu Yu, Lu Qin, Lijun Chang, and Xuemin Lin, SIGMODโ€™13. ๏ฎ Contract & Expand: I/O Efficient SCCs Computing by Zhiwei Zhang, Lu Qin, and Jeffrey Xu Yu. ๏ฎ Divide & Conquer: I/O Efficient Depth-First Search, Zhiwei Zhang, Jeffrey Xu Yu, Lu Qin, and Zechao Shang. 80
  • 81. Reachability Query ๏ฎ Two possible but infeasible solutions: ๏ฑ Traverse ๐บ(๐‘‰, ๐ธ) to answer a reachability query ๏ฎ Low query performance: ๐‘‚(|๐ธ|) query time ๏ฑ Precompute and store the transitive closure ๏ฎ Fast query processing ๏ฎ Large storage requirement: ๐‘‚(|๐‘‰|2) ๏ฎ The labeling approaches: ๏ฑ Assign labels to nodes in a preprocessing step offline. ๏ฑ Answer a query using the labels assigned online. 81
  • 82. A B D C F G A B C G F D Make a Graph Small and Simple ๏ฎ Any directed graph ๐บ can be represented as a DAG (Directed Acyclic Graph), ๐บ ๐ท, by taking every SCC (Strongly Connected Component) in ๐บ as a node in ๐บ ๐ท. ๏ฎ An SCC of a directed graph ๐บ = (๐‘‰, ๐ธ) is a maximal set of nodes ๐ถ โŠ† ๐‘‰ such that for every pair of nodes ๐‘ข and ๐‘ฃ in ๐ถ, ๐‘ข and ๐‘ฃ are reachable from each other. 82
  • 83. A B D C I E F G H A B C G F D E H I The Reachability Queries ๏ฎ Reachability queries can be answered by DAG. 83
  • 84. The Issue and the Challenge ๏ฎ It needs to convert a massive directed graph ๐บ into a DAG ๐บ ๐ท in order to process it efficiently because ๏ฑ ๐บ cannot be held in main memory, and ๐บ ๐ท can be much smaller. ๏ฎ It is assumed that it can be done in the existing works. ๏ฎ But, it needs a large main memory to convert. 84
  • 85. The Issue and the Challenge ๏ฎ The Dataset uk-2007 ๏ฑ Nodes: 105,896,555 ๏ฑ Edges: 3,738,733,648 ๏ฑ Average degree: 35 ๏ฎ Memory: ๏ฑ 400 MB for nodes, and ๏ฑ 28 GB for edges. 85
  • 86. In Memory Algorithm? ๏ฎ In Memory Algorithm: Scan ๐บ twice ๏ฑ DFS(G) to obtain a decreasing order for each node of ๐บ ๏ฑ Reverse every edge to obtain ๐บ ๐‘‡, and ๏ฑ DFS(๐บ ๐‘‡) according to the same decreasing order to find all SCCs. 86
  • 87. 4 7 2 3 51 9 68 In Memory Algorithm? ๏ฎ DFS(G) to obtain a decreasing order for each node of ๐บ 87
  • 88. 4 7 2 3 51 9 68 4 7 2 3 51 9 68 In Memory Algorithm? ๏ฎ Reverse every edge to obtain ๐บ ๐‘‡ . 88
  • 89. 4 7 2 3 51 9 68 In Memory Algorithm? ๏ฎ DFS(๐บ ๐‘‡ ) according to the same decreasing order to find all SCCs. (A subtree (in black edges) form an SCC.) 89
  • 90. (Semi)-External Algorithms ๏ฎ In Memory Algorithm: Scan ๐บ twice ๏ฎ The in memory algorithm cannot handle a large graph that cannot be held in memory. ๏ฑ Why? No locality. A large number of random I/Os. ๏ฎ Consider external algorithms and/or semi-external algorithms. Let ๐‘€ be the size of main memory. ๏ฑ External algorithm: ๐‘€ < |๐บ| ๏ฑ Semi-external algorithm: ๐‘˜ โˆ™ |๐‘‰| โ‰ค ๐‘€ < |๐บ| ๏ฎ It assumes that a tree can be held in memory. 90
  • 91. A B D C I E F G H A B C G F Main Memory Contraction Based External Algorithm (1) ๏ฎ Load in a subgraph and merge SCCs in it in main memory in every iteration [Cosgaya-Lozano et al. SEA'09] 91
  • 93. A B D C I E F G H A B C G F A B C GD FD H E Cannot Find All SCCs Always! Main Memory DAG! And memory is full!Cannot load in โ€œIโ€ into memory! Contraction Based External Algorithm (3) 93
  • 94. 2 8 3 4 7 5 1 9 6 Tree-Edge Forward-Cross-Edge Backward-Edge Forward-Edge Backward-Cross-Edgedelete old tree edge New tree edge DFS Based Semi-External Algorithm ๏ฎ Find a DFS-tree without forward-cross-edges [Sibeyn et al. SPAAโ€™02]. ๏ฎ For a forward-cross- edge (๐‘ข, ๐‘ฃ), delete tree edge to ๐‘ฃ, and (๐‘ข, ๐‘ฃ) as a new tree edge. 94
  • 95. DFS Based Approaches: Cost-1 ๏ฎ DFS-SCC uses sequential I/Os. ๏ฎ DFS-SCC needs to traverse a graph ๐บ twice using DFS to compute all SCCs. ๏ฎ In each DFS, in the worst case it needs the number of ๐‘‘๐‘’๐‘๐‘กโ„Ž(๐บ) โˆ™ |๐ธ(๐‘‰)|/๐ต I/Os, where ๐ต is the block size. 95
  • 96. DFS Based Approaches: Cost-2 ๏ฎ Partial SCCs cannot be contracted to save space while constructing a DFS tree. ๏ฎ Why? ๏ฑ DFS-SCC needs to traverse a graph ๐บ twice using DFS to compute all SCCs. ๏ฑ DFS-SCC uses a total order of nodes (decreasing postorder) in the second DFS, which is computed in the first DFS. ๏ฑ SCCs cannot be partially contracted in the first DFS. ๏ฑ SCCs can be partially contracted in the second DFS, but we have to remember which nodes belongs to which SCCs with extra space. Not free! 96
  • 97. DFS Based Approaches: Cost-3 ๏ฎ High CPU cost for reshaping a DFS-tree, when it attempts to reduce the number of forward-cross-edges. 97
  • 98. Our New Approach [SIGMODโ€™13] ๏ฎ We propose a new two phase algorithm, 2P-SCC: ๏ฑ Tree-Construction and Tree-Search. ๏ฑ In Tree-Construction phase, we construct a tree-like structure. ๏ฑ In Tree-Search phase, we scan the graph only once. ๏ฎ We further propose a new algorithm, 1P-SCC, to combine Tree-Construction and Tree-Search with new optimization techniques, using a tree. ๏ฑ Early-Acceptance ๏ฑ Early-Rejection ๏ฑ Batch Edge Reduction A joint work by Zhiwei Zhang, Jeffrey Yu, Qin Lu, Lijun Chang, and Xuemin Lin 98
  • 99. A New Weak Order ๏ฎ The total order used in DFS-SCC is too strong and there is no obvious relationship between the total order and the SCCs per se, in order to reduce I/Os. ๏ฑ The total order cannot help to reduce I/O costs. ๏ฎ We introduce a new weak order. ๏ฎ For an SCC, there must exist at least one cycle. ๏ฎ While constructing a tree ๐‘‡ for ๐บ, a cycle will appear to contain at least one edge (๐‘ข, ๐‘ฃ) that links to a higher level node in ๐‘‡. ๐‘‘๐‘’๐‘๐‘กโ„Ž(๐‘ข) > ๐‘‘๐‘’๐‘๐‘กโ„Ž(๐‘ฃ). ๏ฎ There are two cases when ๐‘‘๐‘’๐‘๐‘กโ„Ž(๐‘ข) > ๐‘‘๐‘’๐‘๐‘กโ„Ž(๐‘ฃ). ๏ฑ A cycle: ๐‘ฃ is an ancestor of ๐‘ข in ๐‘‡ ๏ฑ Not a cycle (up-edge): ๐‘ฃ is not an ancestor of ๐‘ข in ๐‘‡. ๏ฎ We reduce the number of up-edges iteratively. 99
  • 100. ๏ฎ Let ๐‘…๐‘ ๐‘’๐‘ก(๐‘ข, ๐บ, ๐‘‡) be the set of nodes including ๐‘ข and nodes that ๐‘ฃ can reach by a tree ๐‘‡ of ๐บ. ๏ฎ ๐‘‘๐‘’๐‘๐‘กโ„Ž(๐‘ข, ๐‘‡): The length of the longest simple path from root to ๐‘ข. ๏ฎ ๐‘‘๐‘Ÿ๐‘Ž๐‘›๐‘˜(๐‘ข, ๐‘‡) = min{๐‘‘๐‘’๐‘๐‘กโ„Ž(๐‘ฃ, ๐‘‡) | ๐‘ฃ โˆˆ ๐‘…๐‘ ๐‘’๐‘ก(๐‘ข, ๐บ, ๐‘‡)} ๏ฑ drank is used as the weak order! ๏ฎ ๐‘‘๐‘™๐‘–๐‘›๐‘˜ ๐‘ข, ๐‘‡ = ๐‘Ž๐‘Ÿ๐‘”๐‘š๐‘–๐‘› ๐‘ฃ ๐‘‘๐‘’๐‘๐‘กโ„Ž ๐‘ฃ, ๐‘‡ ๐‘ฃ โˆˆ ๐‘…๐‘ ๐‘’๐‘ก(๐‘ข, ๐บ, ๐‘‡)} ๏ฎ Nodes do not need to have a unique order. B H C D G E A IF ๏ƒ˜ depth(B) = 1 ๏ƒ˜ drank(B) = 1 ๏ƒ˜ dlink(B) = B ๏ƒ˜ depth(E) = 3 ๏ƒ˜ drank(E) = 1 ๏ƒ˜ dlink(E) = B The Weak Order: drank 100
  • 101. 2P-SCC ๏ฎ To reduce Cost-1, we use a BR+-tree to compute all SCCs in the Tree-Construction phase. We compute all SCCs by traversing ๐บ only once using the BR+-tree constructed in the Tree-Search phase. ๏ฎ To reduce Cost-3, we have shown that we only need to update the depth of nodes locally. 101
  • 102. B H C D G E A IF ๏ฎ BR-Tree is a spanning tree of G. ๏ฎ BR+-Tree is a BR-Tree plus some additional edges (๐‘ข, ๐‘ฃ) such that ๐‘ฃ is an ancestor of ๐‘ข. BR-Tree and BR+-Tree ๏ฑ In Memory: Black edges 102
  • 103. B H C D GE A IF drank(I) = 1 drank(H) = 2 Up-edge Tree-Construction: Up-edge ๏ฎ An edge (๐‘ข, ๐‘ฃ) is an up-edge on the conditions: ๏ฑ ๐‘ฃ is not an ancestor of ๐‘ข in ๐‘‡ ๏ฑ ๐‘‘๐‘Ÿ๐‘Ž๐‘›๐‘˜(๐‘ข, ๐‘‡) โ‰ฅ ๐‘‘๐‘Ÿ๐‘Ž๐‘›๐‘˜(๐‘ฃ, ๐‘‡) ๏ฎ Up-edges violate the existing order 103
  • 104. ๏ฎ When there is an violate up-edge, then ๏ฑ Modify T ๏ฎ Delete the old tree edge ๏ฎ Set the up-edge as a new tree edge ๏ฑ Graph Reconstruction ๏ฎ No I/O cost, low CPU cost. B H C D GE A IF Tree-Construction (Push-Down) 104
  • 105. B D E A F drank(E) = 1 dlink(E) = B drank(F) = 1 Up-edge Tree-Construction (Graph Reconstruction) ๏ฎ Tree edges and one extra edge in BR+-Tree form a part of an SCC. ๏ฎ For an up-edge (๐‘ข, ๐‘ฃ), if ๐‘‘๐‘™๐‘–๐‘›๐‘˜(๐‘ฃ, ๐‘‡) is an ancestor of ๐‘ข in ๐‘‡, delete (๐‘ข, ๐‘ฃ) and add (๐‘ข, ๐‘‘๐‘™๐‘–๐‘›๐‘˜(๐‘ฃ)). ๏ฎ In Tree-Search, scan the graph only once to find all SCCs, which reduces I/O costs. A new edge 105
  • 106. Tree-Construction ๏ฎ When a BR+-tree is completely constructed, there are no up-edges. ๏ฎ There are only two kinds of edges in G. ๏ฑ The BR+-tree edges, and ๏ฑ The edges (๐‘ข, ๐‘ฃ) where ๐‘‘๐‘Ÿ๐‘Ž๐‘›๐‘˜(๐‘ข, ๐‘‡) < ๐‘‘๐‘Ÿ๐‘Ž๐‘›๐‘˜(๐‘ฃ, ๐‘‡). ๏ฎ Such edges do not play in any role in determining an SCC. 106
  • 107. B H C D G E A IF In memory for each node u: ๏ฎ TreeEdge(u) ๏ฎ dlink(u) ๏ฎ drank(u) In total: 3 ร— |๐‘‰| Search Procedure: ๏ฎ If an edge (๐‘ข , ๐‘ฃ) points to an ancestor, merge all nodes from ๐‘ฃ to ๐‘ข in the tree Only need to scan the graph once. Tree-Search 107
  • 108. From 2P-SCC To 1P-SCC ๏ฎ With 2P-SCC: ๏ฑ In Tree-Construction phase, we construct a tree by an approach similar to DFS-SCC, and ๏ฑ In Tree-Search phase, we scan the graph only once. ๏ฑ The memory used for BR+-tree is 3 ร— |๐‘‰|. ๏ฎ With 1P-SCC: We combine Tree-Construction and Tree- Search with new optimization techniques to reduce Cost-2 and Cost-3: ๏ฑ Early-Acceptance ๏ฑ Early-Rejection ๏ฑ Batch Edge Reduction ๏ฑ Only need to use a BR-tree with memory of 2 ร— |๐‘‰|. 108
  • 109. Early-Acceptance and Early Rejection ๏ฎ Early acceptance: we contract a partial SCC into a node in an early stage while constructing a BR-tree. ๏ฎ Early rejection: we identify necessary conditions to remove nodes that will not participate in any SCCs while constructing a BR-tree. 109
  • 110. Early-Acceptance and Early Rejection ๏ฎ Consider an example. ๏ฎ The three nodes on the left can be contracted into a node on the right. ๏ฎ The node โ€œaโ€ and the subtrees, C and D, can be rejected. 110
  • 111. B I C D H E A JG ๏ฎ Memory: 2 ร— |๐‘‰| ๏ฎ Reduce I/O Cost KF Up-edge: Modify Tree Up-edge: Modify Tree Early-Acceptance Early-Acceptance Modify Tree + Early Acceptance 111
  • 112. DFS Based vs Ours Approaches ๏ฎ I/O cost for DFS is high ๏ฑ Use a total order ๏ฎ Cannot merge SCCs when found earlier ๏ฑ Total order cannot be changed. Large # of I/Os. ๏ฎ Cannot prune non-SCC nodes ๏ฑ Total order cannot be changed ๏ฎ Smaller I/O Cost ๏ฑ Use a weaker order ๏ฎ Merge SCCs as early as possible ๏ฑ Merge nodes with the same order. Small size, small # of I/Os. ๏ฎ prune non-SCC nodes as early as possible ๏ฑ Weaker order is flexible 112
  • 113. Optimization: Batch Edge Reduction ๏ฎ With 1PC-SCC, CPU cost is still high. ๏ฑ In order to determine whether (๐‘ข, ๐‘ฃ) is a backward- edge/up-edge, it needs to check the ancestor relationships between ๐‘ข and ๐‘ฃ over a tree. ๏ฎ The tree is frequently updated. ๏ฎ The average depth of nodes in the tree becomes larger with frequently push-down operation. 113
  • 114. Optimization: Batch Edge Reduction ๏ฎ When memory can hold more edges, there is no need to contract partial SCCs edge by edge. ๏ฎ Find all SCCs in the main memory at the same time ๏ฑ Read all edges that can be read into memory. ๏ฑ Construct a graph with the edges of the tree constructed in memory already plus the edges newly read into memory. ๏ฑ Construct a DAG in memory using the existing memory algorithm, which finds all SCCs in memory. ๏ฑ Reconstruct the BR-Tree according to the DAG. 114
  • 115. Performance Studies ๏ฎ Implement using visual C++ 2005 ๏ฎ Test on a PC with Intel Core2 Quard 2.66GHz CPU and 3.43GB memory running Windows XP ๏ฎ Disk Block Size: 64๐พ๐ต ๏ฎ Memory Size: 3 ร— |๐‘‰(๐บ)| ร— 4๐ต + 64 ๐พ๐ต 115
  • 116. |V| |E| Average Degree cit-patent 3,774,768 16,518,947 4.70 go-uniprot 6,967,956 34,770,235 4.99 citeseerx 6,540,399 15,011,259 2.30 WEBSPAM- UK2002 105,896,555 3,738,733,568 35.00 Four Real Data Sets 116
  • 117. Parameter Range Default Node Size 30M - 70M 30M Average Degree 3 - 7 5 Size of Massive SCCs 200K โ€“ 600K 400K Size of Large SCCs 4K - 12K 8K Size of Small SCCs 20 - 60 40 # of Massive SCCs 1 1 # of Large SCCs 30 - 70 50 # of Small SCCs 6K โ€“ 14K 10K Synthetic Data Sets ๏ฎ We construct a graph G by (1) randomly selecting all nodes in SCCs first, (2) adding edges among the nodes in an SCC until all nodes form an SCC, and (3) randomly adding nodes/edges to the graph. 117
  • 118. 1PB-SCC 1P-SCC 2P-SCC DFS-SCC cit-patent(s) 24s 22s 701s 840s go-uniprot(s) 22s 21s 301s 856s citeseerx(s) 10s 8s 517s 669s cit-patent(I/O) 16,031 13,331 133,467 667,530 go-uniprot(I/O) 26,034 47,947 471,540 619,969 citeseerx(I/O) 15,472 13,482 104,611 392,659 Performance Studies 118
  • 121. Synthetic Data Sets: Vary SCC Sizes 121
  • 122. Synthetic Data Sets: Vary # of SCCs 122
  • 123. From Semi-External to External ๏ฎ Existing semi-external solutions work under the condition that it can held a tree in main-memory ๐‘˜โˆ™ |๐‘‰| โ‰ค |๐‘€|, and generate a large I/Os. ๏ฎ We study an external algorithm by removing the condition of ๐‘˜ โˆ™ |๐‘‰| โ‰ค |๐‘€|. 123 A joint work by Zhiwei Zhang, Qin Lu, and Jeffrey Yu
  • 124. The New Approach: The Main Idea ๏ฎ DFS based approaches generate random accesses ๏ฎ Contraction based semi-external approach reduces |๐‘‰| and |๐ธ| together at the same time. ๏ฑ Cannot find all SCCs. ๏ฎ The main idea of our external algorithm: ๏ฑ Work on a small graph ๐บโ€™ of ๐บ by reducing ๐‘‰ because ๐‘€ can be small. ๏ฑ Find all SCCs in ๐บโ€™. ๏ฑ Add removed nodes back to find SCCs in ๐บ. 124
  • 125. The New Approach: The Property ๏ฎ Reducing the given graph ๏ฑ ๐บ ๐‘‰, ๐ธ โ†’ ๐บโ€ฒ ๐‘‰โ€ฒ , ๐ธโ€ฒ , ๐‘‰ < |๐‘‰โ€ฒ |. ๏ฑ If ๐‘ข can reach ๐‘ฃ in ๐บ, ๐‘ข can also reach ๐‘ฃ in ๐บโ€ฒ. ๏ฑ Maintaining this property may generate a large number of random I/O access. ๏ฎ Reason: several nodes on the path from ๐‘ข to ๐‘ฃ may be removed from ๐บ in the previous iterations. 125
  • 126. The New Approach: The Approach ๏ฎ We introduce a new Contraction & Expansion approach. ๏ฑ Contraction phase: ๏ฎ Reduce nodes iteratively, ๐บ1, ๐บ2 โ€ฆ ๐บ๐‘™. ๏ฑ It decreases |๐‘‰(๐บ๐‘–)|, but may increase |๐ธ(๐บ๐‘–)|. ๏ฑ Expansion phase: ๏ฎ In the reverse order in contraction phase, ๐บ๐‘™, ๐บ๐‘™โˆ’1 โ€ฆ ๐บ1. ๏ฑ Find all SCCs in ๐บ๐‘™ using a semi-external algorithm. ๏‚ง The semi-external algorithm deals with edges. ๏ฑ Expand ๐บ๐‘– back to ๐บ๐‘–โˆ’1. 126
  • 127. The Contraction ๏ฎ In Contraction phase, graph ๐บ1, ๐บ2 โ€ฆ ๐บ๐‘™ are generated, ๏ฎ ๐บ๐‘–+1 is generated by removing a batch of nodes from ๐บ๐‘–. ๏ฎ Stops until ๐‘˜ โˆ™ |๐‘‰| < |๐‘€| when semi-external approach can be applied. G1 G2 G3 127
  • 128. The Expansion ๏ฎ In Expansion phase, removed nodes are added ๏ฎ Addition is in the reverse order of their removal in contraction phase. G1 G2 G3 128
  • 129. The Contraction Phase ๏ฎ Compared with ๐บ๐‘–, ๐บ๐‘–+1should have the following properties ๏ฑ Contractable: ๏ฎ |๐‘‰(๐บ๐‘–+1)| < |๐‘‰(๐บ๐‘–)| ๏ฑ SCC-Preservable: ๏ฎ ๐‘†๐ถ๐ถ(๐‘ข, ๐บ๐‘–) = ๐‘†๐ถ๐ถ(๐‘ฃ, ๐บ๐‘–) โŸบ ๐‘†๐ถ๐ถ(๐‘ข, ๐บ๐‘–+1) = ๐‘†๐ถ๐ถ(๐‘ฃ, ๐บ๐‘–+1) ๏ฑ Recoverable: ๏ฎ ๐‘ฃ โˆˆ ๐‘‰๐‘– โˆ’ ๐‘‰๐‘–+1 โŸบ ๐‘›๐‘’๐‘–๐‘”โ„Ž๐‘๐‘œ๐‘ข๐‘Ÿ ๐‘ฃ, ๐บ๐‘– โŠ† ๐บ๐‘–+1 129
  • 130. Contract Vi+1 ๏ฎ Recoverable: ๏ฑ ๐‘ฃ โˆˆ ๐‘‰๐‘– โˆ’ ๐‘‰๐‘–+1 โŸบ ๐‘›๐‘’๐‘–๐‘”โ„Ž๐‘๐‘œ๐‘ข๐‘Ÿ(๐‘ฃ, ๐บ๐‘–) โŠ† ๐บ๐‘–+1 ๏ฎ ๐บ๐‘–+1 is recoverable if and only if ๐‘‰๐‘–+1 is a vertex cover of ๐บ๐‘–. ๏ฎ At this condition, we can determine which SCCs the nodes in ๐บ๐‘– belong to by scanning ๐บ๐‘– once. ๏ฎ For each edge, we select the node with a higher degree or a higher order. 130
  • 131. Contract Vi+1 c d h a b e f g i ID1 ID2 Deg1 Deg2 a b 3 3 a d 3 4 b c 3 2 c d 2 4 d e 4 4 d g 4 4 e b 4 3 e g 4 4 f g 2 4 g h 4 2 h i 2 2 i f 2 2 DISK For each edge, we select the node with a higher degree or a higher order. 131
  • 132. Construct Ei+1 ๏ฎ SCC-Preservable: ๏ฑ ๐‘†๐ถ๐ถ(๐‘ข, ๐บ๐‘–) = ๐‘†๐ถ๐ถ(๐‘ฃ, ๐บ๐‘–) โŸบ ๐‘†๐ถ๐ถ(๐‘ข, ๐บ๐‘–+1) = ๐‘†๐ถ๐ถ(๐‘ฃ, ๐บ๐‘–+1) ๏ฎ If ๐‘ฃ โˆˆ ๐‘‰๐‘– โ€“ ๐‘‰๐‘–+1, remove (๐‘ฃ๐‘–๐‘›, ๐‘ฃ) and (๐‘ฃ, ๐‘ฃ ๐‘œ๐‘ข๐‘ก) and add (๐‘ฃ๐‘–๐‘›, ๐‘ฃ ๐‘œ๐‘ข๐‘ก). ๏ฎ Although |๐ธ| may be larger, |๐‘‰| is sure to be smaller. ๏ฎ Smaller |๐‘‰| implies semi-external approach can be applied. 132
  • 133. ID1 ID2 e d b d i g g i Construct Ei+1 c d h a b e f g i ID1 ID2 d e d g e b e g DISK If ๐‘ฃ โˆˆ ๐‘‰๐‘– โ€“ ๐‘‰๐‘–+1, remove (๐‘ฃ, ๐‘ฃ๐‘–๐‘›) and (๐‘ฃ, ๐‘ฃ ๐‘œ๐‘ข๐‘ก) and add (๐‘ฃ๐‘–๐‘›, ๐‘ฃ ๐‘œ๐‘ข๐‘ก) Existing Edges New Edges 133
  • 134. The Expansion Phase ๏ฎ ๐‘†๐ถ๐ถ(๐‘ข, ๐บ๐‘–) = ๐‘†๐ถ๐ถ(๐‘ฃ, ๐บ๐‘–) = ๐‘†๐ถ๐ถ(๐‘ค, ๐บ๐‘–) (๐‘ข, ๐‘ค โˆˆ ๐‘‰๐‘–+1) โŸบ ๐‘ข โ†’ ๐‘ฃ & ๐‘ฃ โ†’ ๐‘ค in ๐บ๐‘– ๏ฎ For any node ๐‘ฃ โˆˆ ๐‘‰๐‘–+1 โˆ’ ๐‘‰๐‘–, ๐‘†๐ถ๐ถ(๐‘ฃ, ๐บ๐‘–) can be computed using ๐‘›๐‘’๐‘–๐‘”โ„Ž๐‘๐‘œ๐‘ข๐‘Ÿ๐‘–๐‘› (๐‘ฃ, ๐บ๐‘–) and ๐‘›๐‘’๐‘–๐‘”โ„Ž๐‘๐‘œ๐‘ข๐‘Ÿ๐‘œ๐‘ข๐‘ก(๐‘ฃ, ๐บ๐‘–) only. a b c 134
  • 135. ID1 ID2 a b a d b c c d d e d g e b e g f g g h h i i f Expansion Phase c d h a b e f g i DISK ๐‘†๐ถ๐ถ ๐‘ข, ๐บ๐‘– = ๐‘†๐ถ๐ถ ๐‘ฃ, ๐บ๐‘– = ๐‘†๐ถ๐ถ ๐‘ค, ๐บ๐‘– (๐‘ข, ๐‘ค โˆˆ ๐‘‰๐‘–+1) โŸบ ๐‘ข โ†’ ๐‘ฃ & ๐‘ฃ โ†’ ๐‘ค in ๐บ๐‘– 135
  • 136. Performance Studies ๏ฎ Implement using visual C++ 2005 ๏ฎ Test on a PC with Intel Core2 Quard 2.66GHz CPU and 3.5GB memory running Windows XP ๏ฎ Disk Block Size: 64KB ๏ฎ Default memory Size: 400๐‘€ 136
  • 137. Data Set ๏ฎ Real Data set ๏ฎ Synthetic Data V E Average Degree WEBSPAM- UK2007 105,896,555 3,738,733,568 35.00 Parameter Node Size 25M โ€“ 100M Average Degree 2 - 6 Size of SCCs 20 โ€“ 600K Number of SCCs 1 โ€“ 14 K 137
  • 139. DFS [SIGMODโ€™15] ๏ฎ Given a graph ๐บ(๐‘‰, ๐ธ), depth-first search is to search ๐บ following the depth-first order. A B E D C F IH J G A joint work by Zhiwei Zhang, Jeffrey Yu, Qin Lu, and Zechao Shang 139
  • 140. The Challenge ๏ฎ It needs to DFS a massive directed graph ๐บ, but it is possible that ๐บ cannot be entirely held in main memory. ๏ฎ Our work only keeps nodes in memory, which is much smaller. 140
  • 141. The Issue and the Challenge (1) ๏ฎ Consider all edges from ๐‘ข, like ๐‘ข, ๐‘ฃ1 , ๐‘ข, ๐‘ฃ2 , โ€ฆ , ๐‘ข, ๐‘ฃ ๐‘ . Suppose DFS searches from ๐‘ข to ๐‘ฃ1. It is hard to estimate when it will visit ๐‘ฃ๐‘– (2 โ‰ค ๐‘– โ‰ค ๐‘). ๏ฎ It is hard to know when C/D will be visited even they are near A and B. ๏ฎ It is hard to design the format of graph on disk. A B C D E 141
  • 142. The Issue and the Challenge (2) ๏ฎ A small part of graph can change DFS a lot. ๏ฎ Even almost the entire graph can be kept in memory, it still costs a lot to find the DFS. ๏ฎ (E,D) will change the existing DFS significantly. ๏ฎ A large number of iterations is needed even the memory keeps a large portion of graph. A B C D E F G 142
  • 143. Problem Statement ๏ฎ We study a semi-external algorithm that computes a DFS-Tree by which DFS can be obtained. ๏ฎ The limited memory ๐‘˜ ๐‘‰ โ‰ค ๐‘€ โ‰ค |๐บ| ๏ฎ ๐‘˜ is a small constant number. ๏ฎ ๐บ = ๐‘‰ + |๐ธ| 143
  • 144. DFS-Tree & Edge Type ๏ฎ A DFS of ๐บ forms a DFS-Tree ๏ฎ A DFS procedure can be obtained by a DFS-Tree. A B E D C F IH J G 144
  • 145. DFS-Tree & Edge Type ๏ฎ Given a spanning tree ๐‘‡, there exist 4 types of non-tree edges. A B E D C F IH J G Forward Edge Forward-cross Edge Backward-cross Edge Backward Edge 145
  • 146. DFS-Tree & Edge Type ๏ฎ An ordered spanning tree is a DFS-Tree if there does not have any forward-cross edges. A B E D C F IH J G Forward Edge Forward-cross Edge Backward-cross Edge Backward Edge 146
  • 147. Existing Solutions ๏ฎ Iteratively remove the forward-cross edges. ๏ฎ Procedure: ๏ฑ If there exists a forward-cross edge ๏ฎ Construct a new ๐‘‡ by conducting DFS over the graph in memory 147
  • 148. Existing Solutions ๏ฎ Construct a new ๐‘‡ by conducting DFS over the graph in memory until no forward-cross edges exist. A B E D C F IH J G Forward-cross Edge 148
  • 149. The Drawbacks ๏ฎ D-1: A total order in ๐‘‰(๐บ) needs to be maintained in the whole process. ๏ฎ D-2: A large number of I/Os is produced ๏ฑ Need to scan all edges in every iteration. ๏ฎ D-3: A large number of iterations is needed. ๏ฑ The possibility of grouping the edges near each other in DFS is not considered. 149
  • 150. Why Divide & Conquer ๏ฎ We aim at dividing the graph into several subgraphs ๐บ1, ๐บ2 , โ€ฆ , ๐บ ๐‘ with possible overlaps among them. ๏ฎ Goal: The DFS-Tree for ๐บ can be computed by the DFS-Trees for all ๐บ๐‘–. ๏ฎ Divide & Conquer approach can overcome the existing drawbacks. 150
  • 151. Why Divide & Conquer ๏ฎ To address D-1 ๏ฑ A total order in ๐‘‰(๐บ) needs to be maintained in the whole process. ๏ฎ After dividing the graph ๐บ into ๐บ0 , ๐บ1 , โ€ฆ , ๐บ ๐‘, we only need to maintain the total order in ๐‘‰(๐บ๐‘–). 151
  • 152. Why Divide & Conquer ๏ฎ To address D-2 ๏ฑ A large number of I/Os is produced. ๏ฑ It needs to scan all edges in each iterations. ๏ฎ After dividing the graph ๐บ into ๐บ0 , ๐บ1 , โ€ฆ , ๐บ ๐‘, we only need to scan the edges in ๐บ๐‘– to eliminate forward-cross edges. 152
  • 153. Why Divide & Conquer ๏ฎ To address D-3 ๏ฑ A large number of iterations is needed. ๏ฑ It cannot group the edges together that are near each other in DFS visiting sequence. ๏ฎ After dividing the graph ๐บ into ๐บ0 , ๐บ1 , โ€ฆ , ๐บ ๐‘, the DFS procedure can be applied to ๐บ๐‘– independently. 153
  • 155. Invalid Division ๏ฎ An example: A B F C D E ๐บ1 ๐บ2 No matter how the DFS- Trees for ๐บ1 and ๐บ2 are ordered, the merged tree cannot be a DFS-Tree for ๐บ. 155
  • 156. How to Cut: Challenges ๏ฎ Challenge-1: uneasy to check whether a division is valid. ๏ฑ Need to make sure a DFS-Tree for a divided subgraph will not affect the DFS-Tree of others. ๏ฎ Challenge-2: finding a good division is non-trivial. ๏ฑ The edge types between different subgraphs are complicated. ๏ฎ Challenge-3: The merge procedure needs to make sure that the result is the DFS-Tree for ๐บ. 156
  • 157. Our New Approach ๏ฎ To address Challenge-1: ๏ฑ Compute a light-weight summary graph (S-graph) denoted as ๐บ. ๏ฑ Check whether a division is valid by searching ๐บ ๏ฎ To address Challenge-2: ๏ฑ Recursively divide & conquer. ๏ฎ To address Challenge-3: ๏ฑ The DFS-Tree for ๐บ is computed only by ๐‘‡๐‘– and ๐บ. 157
  • 158. Four Division Properties ๏ฎ Node-Coverage: 1 โ‰ค ๐‘– โ‰ค ๐‘ ๐‘‰ ๐บ๐‘– = ๐บ ๏ฎ Contractible: ๐‘‰ ๐บ๐‘– < |V(๐บ)| ๏ฎ Independence: any pair of nodes in ๐‘‰ ๐‘‡๐‘– โˆฉ ๐‘‰(๐‘‡๐‘—) are consistent. ๏ฑ ๐‘‡๐‘– and ๐‘‡๐‘— can be dealt with independently (๐‘‡๐‘– and ๐‘‡๐‘— are DFS-Tree for ๐บ๐‘– and ๐บ๐‘—) ๏ฎ DFS-Preservable: there exists a DFS-Tree ๐‘‡ for graph ๐บ such that ๐‘‰ ๐‘‡ = 1โ‰ค๐‘–โ‰ค๐‘ ๐‘‰(๐‘‡๐‘–) and E ๐‘‡ = 1โ‰ค๐‘–โ‰ค๐‘ ๐ธ(๐‘‡๐‘–) ๏ฑ DFS-Tree for ๐บ can be computed by ๐‘‡๐‘– 158
  • 159. DFS-Preservable Property ๏ฎ DFS-Tree for ๐บ can be computed by ๐‘‡๐‘–. ๏ฎ DFSโˆ— -Tree: A spanning tree with the same edge set of a DFS-Tree (without order). ๏ฎ Suppose the independence property is satisfied, then the DFS-preservable property is satisfied if and only if the spanning tree T with ๐‘‰ ๐‘‡ = 1โ‰ค๐‘–โ‰ค๐‘ ๐‘‰(๐‘‡๐‘–) and ๐ธ ๐‘‡ = 1โ‰ค๐‘–โ‰ค๐‘ ๐ธ(๐‘‡๐‘–) is a ๐ท๐น๐‘†โˆ—- Tree. 159
  • 160. Independence Property ๏ฎ Any pair of nodes in ๐‘‰ ๐‘‡๐‘– โˆฉ ๐‘‰(๐‘‡๐‘—) are consistent (๐‘‡๐‘– and ๐‘‡๐‘— are DFS-Tree for ๐บ๐‘– and ๐บ๐‘—). ๏ฑ ๐‘‡๐‘– , ๐‘‡๐‘— can be dealt with independently. ๏ฑ This may not hold: ๐‘ข is an ancestor of ๐‘ฃ in ๐‘‡๐‘–, but is a sibling in ๐‘‡๐‘—. ๏ฎ Theorem: ๏ฑ Given a division ๐บ0, ๐บ1, โ€ฆ , ๐บ ๐‘ of ๐บ, the independence property is satisfied if and only if for any subgraphs ๐บ๐‘– and ๐บ๐‘—, ๐ธ ๐บ๐‘– โˆฉ ๐ธ ๐บ๐‘— = โˆ…. 160
  • 162. DFS-Preservable Example A B D C E F G ๏ฎ DFS-preservable property is not satisfied. ๏ฎ The DFS-Tree for ๐บ does not exist given the DFS- Tree for each subgraph. ๏ฎ Forward-cross edges always exist. 162
  • 163. Our Approach ๏ฎ Root based division: independence is satisfied. ๏ฑ For each ๐บ๐‘–, it has a spanning tree ๐‘‡๐‘–. ๏ฑ For a division ๐บ0, ๐บ1, โ€ฆ, ๐บ ๐‘, ๐บ0 โˆฉ ๐บ๐‘– = ๐‘Ÿ๐‘–. ๏ฑ ๐‘Ÿ๐‘– is the root of ๐‘‡๐‘– and the leaf of ๐‘‡0 ๐บ0 ๐บ๐‘– ๐บ๐‘— 163
  • 164. Our Approach ๏ฎ We expand ๐บ0 to capture the relationship between different ๐บ๐‘– and call it S-graph. ๏ฎ S-graph is used to check whether the current division is valid (DFS-preservable property is satisfied) ๐บ0 ๐บ๐‘– ๐บ๐‘— S-graph 164
  • 165. S-edge ๏ฎ S-edge: given a spanning tree ๐‘‡ of ๐บ, (๐‘ขโ€ฒ , ๐‘ฃโ€ฒ ) is the S-edge of ๐‘ข, ๐‘ฃ if ๏ฑ ๐‘ขโ€ฒ is ancestor of ๐‘ข and ๐‘ฃโ€ฒ is ancestor of ๐‘ฃ in ๐‘‡, ๏ฑ Both ๐‘ขโ€ฒ, ๐‘ฃโ€ฒ are the children of ๐ฟ๐ถ๐ด(๐‘ข, ๐‘ฃ), where ๐ฟ๐ถ๐ด(๐‘ข, ๐‘ฃ) is the lowest common ancestor of ๐‘ข, ๐‘ฃ in ๐‘‡. 165
  • 167. S-graph ๏ฎ For a division ๐บ0, ๐บ1, โ€ฆ, ๐บ ๐‘ and ๐‘‡0 is the DFS-Tree for ๐บ0, S-graph ๐บ is constructed in the following: ๏ฑ Remove all backward and forward edges w.r.t. ๐‘‡0 ๏ฑ Replace all cross-edges (๐‘ข, ๐‘ฃ) with their corresponding S-edge if the S-edge is between nodes in ๐บ0, ๏ฑ For edge (๐‘ข, ๐‘ฃ), if ๐‘ข โˆˆ ๐บ๐‘– and ๐‘ฃ โˆˆ ๐บ0, add edge (๐‘Ÿ๐‘–, ๐‘ฃ) and do the same for ๐‘ฃ. 167
  • 169. S-graph Example A B D H I E K F C J ๐บ0 Cross edge S-edge G link (๐‘Ÿ๐‘–, ๐‘ฃ) S-graph 169
  • 170. Division Theorem ๏ฎ Consider a division ๐บ0, ๐บ1, โ€ฆ, ๐บ ๐‘ and suppose ๐‘‡0 is the DFS-Tree for ๐บ0, the division is DFS-preservable if and only if the S-graph ๐บ is a DAG. 170
  • 171. Divide-Star Algorithm ๏ฎ Divide ๐บ according to the children of the root ๐‘… of ๐บ. ๏ฎ If the corresponding S-graph ๐บ is a DAG, each subgraph can be computed independently. ๏ฎ Deal with strongly connected component: ๏ฑ Modify ๐‘‡: add a virtual node RS representing a SCC S. ๏ฑ Modify ๐บ: ๏ฎ For any edge (๐‘ข, ๐‘ฃ) in S-graph ๐บ, if ๐‘ข โˆ‰ ๐‘† and ๐‘ฃ โˆˆ ๐‘†, add edge (๐‘ข, ๐‘…๐‘†). Do the same for ๐‘ฃ. ๏ฎ Remove all nodes in S and corresponding edges. ๏ฑ Modify Division: create a new tree ๐‘‡โ€ฒ rooted at the virtual root RS and connect to the roots in the SCC. 171
  • 174. Divide-Star Algorithm A B D H I E K F J G ๐บ0 DF Divide the graph into 4 parts B C DF H ๐บ1 ๐บ2 ๐บ3 174
  • 175. Divide-TD Algorithm ๏ฎ Divide-Star algorithm divides the graph according to the children of the root. ๏ฑ The depth of ๐‘‡0 is 1. ๏ฑ The max number of subgraphs after dividing will not be larger than the number of children. ๏ฎ Divide-TD algorithm enlarges ๐‘‡0 and the corresponding S-graph. ๏ฑ It can result in more subgraphs than that Divide-Star can provide. 175
  • 176. Divide-TD Algorithm ๏ฎ Divide-TD algorithm enlarges ๐‘‡0 to a Cut-Tree. ๏ฎ Cut-Tree: Given a tree ๐‘‡ with root ๐‘ก0, a cut-tree ๐‘‡๐‘ is a subtree of ๐‘‡ which satisfies two conditions. ๏ฑ The root of ๐‘‡๐‘ is ๐‘ก0. ๏ฑ For any node ๐‘ฃ โˆˆ ๐‘‡ with child nodes ๐‘ฃ1, ๐‘ฃ2, โ€ฆ , ๐‘ฃ ๐‘˜, if ๐‘ฃ โˆˆ ๐‘‡๐‘, then either ๐‘ฃ is a leaf node or a node in ๐‘‡๐‘ with all child nodes ๐‘ฃ1, ๐‘ฃ2, โ€ฆ , ๐‘ฃ ๐‘˜. ๏ฎ With such conditions, for any S-edge (๐‘ข, ๐‘ฃ), only two situations exist. ๏ฑ ๐‘ข, ๐‘ฃ โˆˆ ๐‘‡๐‘ ๏ฑ ๐‘ข, ๐‘ฃ โˆ‰ ๐‘‡๐‘ 176
  • 177. Cut-Tree Construction ๏ฎ Given a tree T with root ๐‘Ÿ0. ๏ฎ Initially ๐‘‡๐‘ contains only the root ๐‘Ÿ0. ๏ฎ Iteratively pick a leaf node ๐‘ฃ in ๐‘‡๐‘ and all the child nodes of ๐‘ฃ in ๐‘‡. ๏ฎ The process stops until the memory cannot hold it after adding the next nodes. 177
  • 179. Divide-TD Algorithm A B D H I E K F C J G Add a virtual node DF SCC Cut-Tree ๐‘‡๐‘ 179
  • 180. Divide-TD Algorithm A B D I E K F J G DF S-Graph is a DAG Divide the graph into 5 parts B C DF H ๐บ1 ๐บ2 ๐บ3 I K๐บ4 ๐บ0 180
  • 181. Merge Algorithm ๏ฎ According to the properties, the DFS-Tree for subgraphs are ๐‘‡0 , ๐‘‡1 ,โ€ฆ,๐‘‡๐‘, there exists a DFS-Tree T with ๐‘‰ ๐‘‡ = 1โ‰ค๐‘–โ‰ค๐‘ ๐‘‰(๐‘‡๐‘–) and ๐ธ ๐‘‡ = 1โ‰ค๐‘–โ‰ค๐‘ ๐ธ(๐‘‡๐‘–). ๏ฎ Only need to organize ๐‘‡๐‘– in the merged tree such that the result tree ๐‘‡ is a DFS-Tree. ๏ฎ Since S-graph ๐บ is a DAG in the division procedure, we can topological sort ๐บ and organize ๐‘‡๐‘– according to the topological order. ๏ฎ Remove virtual nodes ๐‘ฃ and add edges from the father of ๐‘ฃ to the children of ๐‘ฃ. ๏ฎ It can be proven that the result tree is a DFS-Tree. 181
  • 185. Performance Studies ๏ฎ Implement using visual C++ 2010 ๏ฎ Test on a PC with Intel Core2 Quard 2.66GHz CPU and 4GB memory running Windows 7 Enterprise ๏ฎ Disk Block Size: 64KB 185
  • 186. |V| |E| Average Degree Wikilinks 25,942,246 601,038,301 23.16 Arabic-2005 22,744,080 639,999,458 28.14 Twitter-2010 41,652,230 1,468,365,182 35.25 WEBGRAPH- UK2007 105,895,908 3,738,733,568 35.00 Four Real Data Sets 186
  • 187. Web-graph Results ๏ฎ Memory size 2GB ๏ฎ Varying node size percentage 187
  • 188. ๏ฎ We study the I/O efficient DFS algorithms for a large graph. ๏ฎ We analyze the drawbacks of existing semi-external DFS algorithm. ๏ฎ We discuss the challenges and four properties in order to find a divide & conquer approach. ๏ฎ Based on the properties, we design two novel graph division algorithms and a merge algorithm to reduce the cost to DFS the graph. ๏ฎ We have conducted extensive performance studies to confirm the efficiency of our algorithms. Conclusion 188
  • 189. ๏ฎ We also believe that there are many things we need to do on large graphs or big graphs. ๏ฎ We know what we have known on graph processing. ๏ฎ We do not know yet what we do not know on graph processing. ๏ฎ We need to explore many directions such as ๏ฎ parallel computing ๏ฎ distributed computing ๏ฎ streaming computing ๏ฎ semi-external/external computing. Some Conclusion Remarks 189
  • 190. I/O Cost Minimization ๏ฎ If there does not exist node ๐‘ข for ๐‘ฃ that ๐‘†๐ถ๐ถ(๐‘ข, ๐บ๐‘–) = ๐‘†๐ถ๐ถ(๐‘ฃ, ๐บ๐‘–), ๐‘ฃ can be removed from ๐บ๐‘–+1. ๏ฎ For a node ๐‘ฃ, if ๐‘›eigh๐‘๐‘œ๐‘ข๐‘Ÿ(๐‘ฃ, ๐บ๐‘–) โŠ† ๐‘‰๐‘–+1, ๐‘ฃ can be removed from ๐‘‰๐‘–+1. ๏ฎ The I/O complexity is ๐‘‚(๐‘ ๐‘œ๐‘Ÿ๐‘ก ๐‘‰๐‘– + ๐‘ ๐‘œ๐‘Ÿ๐‘ก ๐ธ๐‘– + ๐‘ ๐‘๐‘Ž๐‘›(|๐ธ๐‘–|)) 190
  • 191. B H C D G E A IF This edge makes all nodes in a partial SCC the same order. Another Example ๏ฎ Keep tree structure edges in memory. ๏ฎ Only concern the depth of nodes reachable but not the exact positions. ๏ฎ Early-acceptance: merging SCCs partially whenever possible does not affect the order of others. ๏ฎ Early-rejection: prune non-SCC nodes when possible. ๏ฑ Prune the node โ€œAโ€. ๏ฑ In Memory: Black edges ๏ฑ On Disk: Red edges 191
  • 192. B I C D H E A JG ๏ฎ No need to remember ๐‘‘๐‘™๐‘–๐‘›๐‘˜(๐‘ข, ๐‘‡). ๏ฎ Merge nodes of the same order when an edge (๐‘ข, ๐‘ฃ) is found, where ๐‘ฃ is an ancestor of ๐‘ข in ๐‘‡. ๏ฎ Smaller graph size, smaller I/O Cost KF Up-edge: Modify Tree Up-edge: Modify Tree Memory: 2 ร— |๐‘‰| Early-Acceptance Early-Acceptance Optimization: Early Acceptance 192