This document discusses influenceability estimation in social networks. It describes the independent cascade model of influence diffusion, where each node has an independent probability of influencing its neighbors. The problem is to estimate the expected number of nodes reachable from a given seed node. The document presents the naive Monte Carlo (NMC) approach, which samples possible graphs and averages the number of reachable nodes over the samples. While NMC provides an unbiased estimator, it has high variance. The document aims to reduce the variance to improve estimation accuracy.
Control of Photo Sharing on Online Social Network.SAFAD ISMAIL
ย
Aย social networking serviceย (alsoย social networking site',ย SNSย orย social media) is an online platform which people use to buildย social networksย orย social relationsย with other people who share similar personal or career interests, activities, backgrounds or real-life connections.
Methodological study of opinion mining and sentiment analysis techniquesijsc
ย
Decision making both on individual and organizational level is always accompanied by the search of
otherโs opinion on the same. With tremendous establishment of opinion rich resources like, reviews, forum
discussions, blogs, micro-blogs, Twitter etc provide a rich anthology of sentiments. This user generated
content can serve as a benefaction to market if the semantic orientations are deliberated. Opinion mining
and sentiment analysis are the formalization for studying and construing opinions and sentiments. The
digital ecosystem has itself paved way for use of huge volume of opinionated data recorded. This paper is
an attempt to review and evaluate the various techniques used for opinion and sentiment analysis.
Control of Photo Sharing on Online Social Network.SAFAD ISMAIL
ย
Aย social networking serviceย (alsoย social networking site',ย SNSย orย social media) is an online platform which people use to buildย social networksย orย social relationsย with other people who share similar personal or career interests, activities, backgrounds or real-life connections.
Methodological study of opinion mining and sentiment analysis techniquesijsc
ย
Decision making both on individual and organizational level is always accompanied by the search of
otherโs opinion on the same. With tremendous establishment of opinion rich resources like, reviews, forum
discussions, blogs, micro-blogs, Twitter etc provide a rich anthology of sentiments. This user generated
content can serve as a benefaction to market if the semantic orientations are deliberated. Opinion mining
and sentiment analysis are the formalization for studying and construing opinions and sentiments. The
digital ecosystem has itself paved way for use of huge volume of opinionated data recorded. This paper is
an attempt to review and evaluate the various techniques used for opinion and sentiment analysis.
On building more human query answering systemsINRIA-OAK
ย
The underlying principle behind every query answering system is the existence of a query describing the information of interest. When this model is applied to non-expert users, two traditional issues become highly significant.
The first is that many queries are often over specified leading to empty answers. We propose a principled optimization-based interactive query relaxation framework for such queries. The framework computes dynamically and suggests alternative queries with less conditions to help the user arrive at a query with a non-empty answer, or at a query for which it is clear that independently of the relaxations the answer will always be empty.
The second issue is the lack of expertise from the user to accurately describe the requirements of the elements of interest. The user may though know examples of elements that would like to have in the results. We introduce a novel form of query paradigm in which queries are not any more specifications of what the user is searching for, but simply a sample of what the user knows to be of interest. We refer to this novel form of queries as Exemplar Queries.
Spotify uses a range of Machine Learning models to power its music recommendation features including the Discover page and Radio. Due to the iterative nature of training these models they suffer from IO overhead of Hadoop and are a natural fit to the Spark programming paradigm. In this talk I will present both the right way as well as the wrong way to implement collaborative filtering models with Spark. Additionally, I will deep dive into how Matrix Factorization is implemented in the MLlib library.
This presentation is used to show comparison of two wavelet image compression techniques named as STW and SPIHT. This compression performed using MATLAB Wavelet Tool. The black & white image is compressed using tool. Three parameters PSNR, MSE, CR and Size is used to compare.
Erik Bernhardsson is the CTO at Better, a small startup in NYC working with mortgages. Before Better, he spent five years at Spotify managing teams working with machine learning and data analytics, in particular music recommendations.
Abstract Summary:
Nearest Neighbor Methods And Vector Models: Vector models are being used in a lot of different fields: natural language processing, recommender systems, computer vision, and other things. They are fast and convenient and are often state of the art in terms of accuracy. One of the challenges with vector models is that as the number of dimensions increase, finding similar items gets challenging. Erik developed a library called โAnnoyโ that uses a forest of random tree to do fast approximate nearest neighbor queries in high dimensional spaces. We will cover some specific applications of vector models with and how Annoy works.
Churn prediction in mobile social games towards a complete assessment using ...Alain Saas
ย
Reducing user attrition, i.e. churn, is a broad challenge faced by several industries. In mobile social games, decreasing churn is decisive to increase player retention and rise revenues. Churn prediction models allow to understand player loyalty and to anticipate when they will stop playing a game. Thanks to these predictions, several initiatives can be taken to retain those players who are more likely to churn.
Survival analysis focuses on predicting the time of occurrence of a certain event, churn in our case. Classical methods, like regressions, could be applied only when all players have left the game. The challenge arises for datasets with incomplete churning information for all players, as most of them still connect to the game. This is called a censored data problem and is in the nature of churn. Censoring is commonly dealt with survival analysis techniques, but due to the inflexibility of the survival statistical algorithms, the accuracy achieved is often poor. In contrast, novel ensemble learning techniques, increasingly popular in a variety of scientific fields, provide high-class prediction results.
In this work, we develop, for the first time in the social games domain, a survival ensemble model which provides a comprehensive analysis together with an accurate prediction of churn. For each player, we predict the probability of churning as function of time, which permits to distinguish various levels of loyalty profiles. Additionally, we assess the risk factors that explain the predicted player survival times. Our results show that churn prediction by survival ensembles significantly improves the accuracy and robustness of traditional analyses, like Cox regression.
There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time.
The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.
Introduction to Topological Data AnalysisMason Porter
ย
Here are slides for my 3/14/21 talk on an introduction to topological data analysis.
This is the first talk in our Short Course on topological data analysis at the 2021 American Physical Society (APS) March Meeting: https://march.aps.org/program/dsoft/gsnp-short-course-introduction-to-topological-data-analysis/
(141205) Masters_Thesis_Defense_Sundong_KimSundong Kim
ย
Masters thesis defense presentation slide
Topic : Maximizing Influence over a Target user through Friend Recommendation
Presenter : Sundong Kim @ KAIST IsysE department
Keywords : Social network, Friend recommendation, Incremental Algorithm, Maximizing influence
A new similarity measurement based on hellinger distance for collaborating fi...Prabhu Kumar
ย
This project proposed a similarity measurement which is focusing on recommendation performance under the cold start problem [The problem which occurs in the recommendation for newly comer items and users, which doesn't have any recognition in the system] and also perfectly suitable for sparse data set.
This technique solves the problem of the cold start in recommender system as well as improves the performance of recommendation to the users.
On building more human query answering systemsINRIA-OAK
ย
The underlying principle behind every query answering system is the existence of a query describing the information of interest. When this model is applied to non-expert users, two traditional issues become highly significant.
The first is that many queries are often over specified leading to empty answers. We propose a principled optimization-based interactive query relaxation framework for such queries. The framework computes dynamically and suggests alternative queries with less conditions to help the user arrive at a query with a non-empty answer, or at a query for which it is clear that independently of the relaxations the answer will always be empty.
The second issue is the lack of expertise from the user to accurately describe the requirements of the elements of interest. The user may though know examples of elements that would like to have in the results. We introduce a novel form of query paradigm in which queries are not any more specifications of what the user is searching for, but simply a sample of what the user knows to be of interest. We refer to this novel form of queries as Exemplar Queries.
Spotify uses a range of Machine Learning models to power its music recommendation features including the Discover page and Radio. Due to the iterative nature of training these models they suffer from IO overhead of Hadoop and are a natural fit to the Spark programming paradigm. In this talk I will present both the right way as well as the wrong way to implement collaborative filtering models with Spark. Additionally, I will deep dive into how Matrix Factorization is implemented in the MLlib library.
This presentation is used to show comparison of two wavelet image compression techniques named as STW and SPIHT. This compression performed using MATLAB Wavelet Tool. The black & white image is compressed using tool. Three parameters PSNR, MSE, CR and Size is used to compare.
Erik Bernhardsson is the CTO at Better, a small startup in NYC working with mortgages. Before Better, he spent five years at Spotify managing teams working with machine learning and data analytics, in particular music recommendations.
Abstract Summary:
Nearest Neighbor Methods And Vector Models: Vector models are being used in a lot of different fields: natural language processing, recommender systems, computer vision, and other things. They are fast and convenient and are often state of the art in terms of accuracy. One of the challenges with vector models is that as the number of dimensions increase, finding similar items gets challenging. Erik developed a library called โAnnoyโ that uses a forest of random tree to do fast approximate nearest neighbor queries in high dimensional spaces. We will cover some specific applications of vector models with and how Annoy works.
Churn prediction in mobile social games towards a complete assessment using ...Alain Saas
ย
Reducing user attrition, i.e. churn, is a broad challenge faced by several industries. In mobile social games, decreasing churn is decisive to increase player retention and rise revenues. Churn prediction models allow to understand player loyalty and to anticipate when they will stop playing a game. Thanks to these predictions, several initiatives can be taken to retain those players who are more likely to churn.
Survival analysis focuses on predicting the time of occurrence of a certain event, churn in our case. Classical methods, like regressions, could be applied only when all players have left the game. The challenge arises for datasets with incomplete churning information for all players, as most of them still connect to the game. This is called a censored data problem and is in the nature of churn. Censoring is commonly dealt with survival analysis techniques, but due to the inflexibility of the survival statistical algorithms, the accuracy achieved is often poor. In contrast, novel ensemble learning techniques, increasingly popular in a variety of scientific fields, provide high-class prediction results.
In this work, we develop, for the first time in the social games domain, a survival ensemble model which provides a comprehensive analysis together with an accurate prediction of churn. For each player, we predict the probability of churning as function of time, which permits to distinguish various levels of loyalty profiles. Additionally, we assess the risk factors that explain the predicted player survival times. Our results show that churn prediction by survival ensembles significantly improves the accuracy and robustness of traditional analyses, like Cox regression.
There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time.
The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.
Introduction to Topological Data AnalysisMason Porter
ย
Here are slides for my 3/14/21 talk on an introduction to topological data analysis.
This is the first talk in our Short Course on topological data analysis at the 2021 American Physical Society (APS) March Meeting: https://march.aps.org/program/dsoft/gsnp-short-course-introduction-to-topological-data-analysis/
(141205) Masters_Thesis_Defense_Sundong_KimSundong Kim
ย
Masters thesis defense presentation slide
Topic : Maximizing Influence over a Target user through Friend Recommendation
Presenter : Sundong Kim @ KAIST IsysE department
Keywords : Social network, Friend recommendation, Incremental Algorithm, Maximizing influence
A new similarity measurement based on hellinger distance for collaborating fi...Prabhu Kumar
ย
This project proposed a similarity measurement which is focusing on recommendation performance under the cold start problem [The problem which occurs in the recommendation for newly comer items and users, which doesn't have any recognition in the system] and also perfectly suitable for sparse data set.
This technique solves the problem of the cold start in recommender system as well as improves the performance of recommendation to the users.
User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...IIIT Hyderabad
ย
Online Social Networks (OSNs) are popular platforms for online users. Users typically register and maintain their accounts (user identities) across different OSNs to share a variety of content and remain connected with their friends. Consequently, linking user identities across OSN platforms, referred to as user identity linkage (UIL) becomes a critical problem. Solving this problem enables us to build a more comprehensive view of userโs activities across OSNs, which is highly beneficial for targeted advertisements, recommendations, and many more applications. In the thesis, we propose approaches for analyzing data collection methods, investigating biases in identity linkage datasets, linkage of user identities across social networks, control-ability of user identity linkage, and application of user identity linkage solutions to solve related problems.
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques ijsc
ย
Decision making both on individual and organizational level is always accompanied by the search of otherโs opinion on the same. With tremendous establishment of opinion rich resources like, reviews, forum discussions, blogs, micro-blogs, Twitter etc provide a rich anthology of sentiments. This user generated content can serve as a benefaction to market if the semantic orientations are deliberated. Opinion mining and sentiment analysis are the formalization for studying and construing opinions and sentiments. The digital ecosystem has itself paved way for use of huge volume of opinionated data recorded. This paper is an attempt to review and evaluate the various techniques used for opinion and sentiment analysis.
Social Learning in Networks: Extraction Deterministic RulesDmitrii Ignatov
ย
In this talk, we want to introduce experimental
economics to the field of data mining and vice versa. It continues
related work on mining deterministic behavior rules of human
subjects in data gathered from experiments. Game-theoretic
predictions partially fail to work with this data. Equilibria also
known as game-theoretic predictions solely succeed with experienced
subjects in specific games โ conditions, which are rarely
given. Contemporary experimental economics offers a number of
alternative models apart from game theory. In relevant literature,
these models are always biased by philosophical plausibility
considerations and are claimed to fit the data. An agnostic
data mining approach to the problem is introduced in this
paper โ the philosophical plausibility considerations follow after
the correlations are found. No other biases are regarded apart
from determinism. The dataset of the paper โSocial Learning in
Networksโ by Choi et al 2012 is taken for evaluation. As a result,
we come up with new findings. As future work, the design of a
new infrastructure is discussed.
We study the problem of profit maximization in social networks through influence diffusion. We propose elegant model that describes the diffusion process, distinguishes between the states of being influenced and adopting a product. We then give efficient and effective algorithms to solve this NP-hard problem.
In this lecture, I will first cover the recent advances in neural recommender systems such as autoencoder-based and MLP-based recommender systems. Then, I will introduce the recent achievement for automatic playlist continuation in music recommendation.
Min-based qualitative possibilistic networks are one of the effective tools for a compact representation of decision problems under uncertainty. The exact approaches for computing decision based on possibilistic networks are limited by the size of the possibility distributions.
Generally, these approaches are based on possibilistic propagation algorithms. An important step in the computation of the decision is the transformation of the DAG into a secondary structure, known as the junction trees. This transformation is known to be costly and represents a difficult problem. We propose in this paper a new approximate approach for the computation
of decision under uncertainty within possibilistic networks. The computing of the optimal optimistic decision no longer goes through the junction tree construction step. Instead, it is performed by calculating the degree of normalization in the moral graph resulting from the merging of the possibilistic network codifying knowledge of the agent and that codifying its preferences.
R. Zafarani, M. A. Abbasi, and H. Liu, Social Media Mining: An Introduction, Cambridge University Press, 2014.
Free book and slides at http://socialmediamining.info/
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Xiaohan Zeng
ย
The advent of the social networks has completely changed our daily life. The deluge of data collected on Social Network Services (SNS) and recent developments in complex network theory have enabled many marvelous predictive analysis, which tells us many amazing stories.
Why do we often feel that "the world is so small?" Is the six-degree separation purely imagination or based on mathematical insights? Why are there just a few rockstars who enjoy extreme popularity while most of us stay unknown to the world? When science meets coffee shop knowledge, things are bound to be intriguing.
I will first briefly describe what social networks are, in the mathematical sense. Then I will introduce some ways to extract characteristics of networks, and how these analyses can explain many anecdotes in our life. Finally, I'll show an example of what we can learn from social network analysis, based on data from Groupon.
Similar to Jeffrey xu yu large graph processing (20)
Machine Status Prediction for Dynamic and Heterogenous Cloud Environmentjins0618
ย
The widespread utilization of cloud computing services
has brought in the emergence of cloud service reliability
as an important issue for both cloud providers and users. To
enhance cloud service reliability and reduce the subsequent losses, the future status of virtual machines should be monitored in real time and predicted before they crash. However, most existing methods ignore the following two characteristics of actual cloud
environment, and will result in bad performance of status prediction:
1. cloud environment is dynamically changing; 2. cloud
environment consists of many heterogeneous physical and virtual
machines. In this paper, we investigate the predictive power of
collected data from cloud environment, and propose a simple yet
general machine learning model StaP to predict multiple machine
status. We introduce the motivation, the model development
and optimization of the proposed StaP. The experimental results
validated the effectiveness of the proposed StaP.
Latent Interest and Topic Mining on User-item Bipartite Networksjins0618
ย
Latent Factor Model (LFM) is extensively used in
dealing with user-item bipartite networks in service recommendation systems. To alleviate the limitations of LFM, this papers presents a novel unsupervised learning model, Latent Interest and Topic Mining model (LITM), to automatically
mine the latent user interests and item topics from user-item
bipartite networks. In particular, we introduce the motivation
and objectives of this bipartite network based approach, and
detail the model development and optimization process of the
proposed LITM. This work not only provides an efficient method for latent user interest and item topic mining, but also highlights a new way to improve the accuracy of service recommendation. Experimental studies are performed and the results validate the LITMโs efficiency in model training, and its ability to provide better service recommendation performance based on user-item bipartite networks are demonstrated.
Web Service QoS Prediction Approach in Mobile Internet Environmentsjins0618
ย
Existing many Web service QoS prediction
approaches are very accurate in Internet environments,
however they cannot provide accurate prediction values in
Mobile Internet environments since QoS values of Web
services have great volatility. In this paper, we propose an
accurate Web service QoS prediction approach by weakening
the volatility of QoS data from Web services in Mobile Internet
environments. This approach contains three process, i.e., QoS
preprocessing, user similarity computing, and QoS predicting.
We have implemented our proposed approach with experiment
based on real world and synthetic datasets. The results show
that our approach outperforms other approaches in Mobile
Internet environments.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
ย
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
ย
Abstract โ Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Opendatabay - Open Data Marketplace.pptxOpendatabay
ย
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
1. Large Graph Processing
Jeffrey Xu Yu (ไบๆญ)
Department of Systems Engineering and
Engineering Management
The Chinese University of Hong Kong
yu@se.cuhk.edu.hk, http://www.se.cuhk.edu.hk/~yu
5. Facebook Social Network
๏ฎ In 2011, 721 million users, 69 billion friendship links. The
degree of separation is 4. (Four Degrees of Separation by
Backstrom, Boldi, Rosa, Ugander, and Vigna, 2012)
5
6. The Scale/Growth of Social Networks
๏ฎ Facebook statistics
๏ฑ 829 million daily active users on average in June 2014
๏ฑ 1.32 billion monthly active users as of June 30, 2014
๏ฑ 81.7% of daily active users are outside the U.S. and
Canada
๏ฑ 22% increase in Facebook users from 2012 to 2013
๏ฎ Facebook activities (every 20 minutes on Facebook)
๏ฑ 1 million links shared
๏ฑ 2 million friends requested
๏ฑ 3 million messages sent
http://newsroom.fb.com/company-info/
http://www.statisticbrain.com/facebook-statistics/
6
7. The Scale/Growth of Social Networks
๏ฎ Twitter statistics
๏ฑ 271 million monthly active users in 2014
๏ฑ 135,000 new users signing up every day
๏ฑ 78% of Twitter active users are on mobile
๏ฑ 77% of accounts are outside the U.S.
๏ฎ Twitter activities
๏ฑ 500 million Tweets are sent per day
๏ฑ 9,100 Tweets are sent per second
https://about.twitter.com/company
http://www.statisticbrain.com/twitter-statistics/
7
12. Graph Mining/Querying/Searching
๏ฎ We have been working on many graph problems.
๏ฑ Keyword search in databases
๏ฑ Reachability query over large graphs
๏ฑ Shortest path query over large graphs
๏ฑ Large graph pattern matching
๏ฑ Graph clustering
๏ฑ Graph processing on Cloud
๏ฑ โฆโฆ
12
16. ๏ฎ Real rating systems (users and objects)
๏ฑ Online shopping websites (Amazon) www.amazon.com
๏ฑ Online product review websites (Epinions) www.epinions.com
๏ฑ Paper review system (Microsoft CMT)
๏ฑ Movie rating (IMDB)
๏ฑ Video rating (Youtube)
Reputation-based Ranking
16
17. The Bipartite Rating Network
๏ฎ Two entities: users and objects
๏ฎ Users can give rating to objects
๏ฎ If we take the average as the ranking score of an object,
o1 and o3 are the top.
๏ฎ If we consider the userโs reputation, e.g., u4, โฆ
Objects
Users
Ratings
17
18. Reputation-based Ranking
๏ฎ Two fundamental problems
๏ฑ How to rank objects using the ratings?
๏ฑ How to evaluate usersโ rating reputation?
๏ฎ Algorithmic challenges
๏ฑ Robustness
๏ฎ Robust to the spamming users
๏ฑ Scalability
๏ฎ Scalable to large networks
๏ฑ Convergence
๏ฎ Convergent to a unique and fixed ranking vector
18
19. Signed/Unsigned Trust Networks
๏ฎ Signed Trust Social Networks (users): A user can
express their trust/distrust to others by positive/negative
trust score.
๏ฑ Epinions (www.epinions.com)
๏ฑ Slashdot (www.slashdot.org)
๏ฎ Unsigned Trust Social Networks (users): A user can only
express their trust.
๏ฑ Advogato (www.advogato.org)
๏ฑ Kaitiaki (www.kaitiaki.org.nz)
๏ฎ Unsigned Rating Networks (users and objects)
๏ฑ Question-Answer systems
๏ฑ Movie-rating systems (IMDB)
๏ฑ Video rating systems in Youtube
19
20. The Trustworthiness of a User
๏ฎ The final trustworthiness of a user is determined by how
users trust each other in a global context and is
measured by bias.
๏ฎ The bias of a user reflects the extend up to which his/her
opinions differ from others.
๏ฎ If a user has a zero bias, then his/her opinions are 100%
unbaised and 100% taken.
๏ฎ Such a user has high trustworthiness.
๏ฎ The trustworthiness, the trust score, of a user is
1 โ his/her bias score.
20
21. An Existing Approach
๏ฎ MB [Mishra and Bhattacharya, WWWโ11]
๏ฑ The trustworthiness of a user cannot be trusted,
because MB treats the bias of a user by relative
differences between itself and others.
๏ฑ If a user gives all his/her friends a much higher trust
score than the average of others, and gives all his/her
foes a much lower trust score than the average of
others, such differences cancel out. This user has
zero bias and can be 100% trusted.
21
22. An Example
๏ฎ Node 5 gives a trust score
๐51 = 0.1 to node 1. Node 2
and node 3 give a high trust
score ๐21 = ๐31 = 0.8 to
node 1.
๏ฎ Node 5 is different from
others (biased), 0.1 โ 0.8.
22
23. MB Approach
๏ฎ The bias of a node ๐ is ๐๐.
๏ฎ The prestige score of node ๐ is ๐๐.
๏ฎ The iterative system is
23
24. An Example
๏ฎ Consider 5๏ 1, 2๏ 1, 3๏ 1.
๏ฑ A trust score = 0.1 โ 0.8 = -0.7.
๏ฎ Consider 2๏ 3, 4๏ 3, 5๏ 3.
๏ฑ A trust score = 0.9 โ 0.2 = 0.7
๏ฎ Node 5 has zero bias.
๏ฎ The bias scores by MB.
24
25. Our Approach
๏ฎ To address it, consider a contraction mapping.
๏ฎ Given a metric space ๐ with a distance function ๐().
๏ฎ A mapping ๐ from ๐ to ๐ is a contraction mapping if
there exists a constant c where 0 โค ๐ < 1 such that
๐(๐(๐ฅ), ๐(๐ฆ)) โค ๐ ร ๐(๐ฅ, ๐ฆ).
๏ฎ The ๐ has a unique fixed point.
25
26. Our Approach
๏ฎ We use two vectors, ๐ and ๐, for bias and prestige.
๏ฎ The ๐๐ = (๐(๐)) ๐ denotes the bias of node ๐, where ๐ is
the prestige vector of the nodes, and ๐(๐) is a vector-
valued contractive function. (๐ ๐ ) ๐ denotes the ๐-th
element of vector ๐(๐).
๏ฎ Let 0 โค ๐(๐) โค ๐, and ๐ = [1, 1, โฆ , 1] ๐
๏ฎ For any ๐ฅ, ๐ฆ โ ๐ ๐, the function ๐: ๐ ๐ โ ๐ ๐ is a vector-
valued contractive function if the following condition
holds,
๐ ๐ฅ โ ๐ ๐ฆ โค ๐ โฅ ๐ฅ โ ๐ฆ โฅโ ๐
where ๐ โ [0,1) and โฅโโฅโ denotes the infinity norm.
26
27. The Framework
๏ฎ Use a vector-valued contractive function, which is a
generalization of the contracting mapping in the fixed
point theory.
๏ฎ MB is a special case in our framework.
๏ฎ The iterative system can converges into a unique fixed
prestige and bias vector in an exponential rate of
convergence.
๏ฎ We can handle both unsigned and singed trust social
networks.
27
29. Diffusion in Networks
๏ฎ We care about the decisions made by friends and
colleagues.
๏ฎ Why imitating the behavior of others
๏ฑ Informational effects: the choices made by others can
provide indirect information about what they know.
๏ฑ Direct-benefit effects: there are direct payoffs from
copying the decisions of others.
๏ฎ Diffusion: how new behaviors, practices, opinions,
conventions, and technologies spread through a social
network.
29
30. A Real World Example
๏ฎ Hotmailโs viral climb to
the top spot (90โs):
8 million users in
18 months!
๏ฎ Far more effective than
conventional advertising
by rivals and far cheaper
too!
30
31. Stochastic Diffusion Model
๏ฎ Consider a directed graph ๐บ = (๐, ๐ธ).
๏ฎ The diffusion of information (or influence) proceeds in
discrete time steps, with time ๐ก = 0, 1, โฆ. Each node ๐ฃ
has two possible states, inactive and active.
๏ฎ Let ๐๐ก โ ๐ be the set of active nodes at time ๐ก (active set
at time ๐ก). ๐0 is the seed set (the seeds of influence
diffusion).
๏ฎ A stochastic diffusion model (with discrete time steps) for
a social graph ๐บ specifies the randomized process of
generating active sets ๐๐ก for all ๐ก โฅ 1 given the initial ๐0.
๏ฎ A progressive model is a model ๐๐กโ1 โ ๐๐ก for ๐ก > 1.
31
32. Influence Spread
๏ฎ Let ฮฆ(๐0) be the final active set (eventually stable
active set) where ๐0 is the initial seed set.
๏ฎ ฮฆ(๐0) is a random set determined by the stochastic
process of the diffusion model.
๏ฎ To maximize the expected size of the final active set.
๏ฎ Let ๐ผ(๐) denote the expected value of a random
variable ๐.
๏ฎ The influence spreed of seed set ๐0 is defined as
๐ ๐0 = ๐ผ(|ฮฆ(๐0)|). Here the expectation is taken
among all random events leading to ฮฆ(๐0).
32
33. Independent Cascade Model (IC)
๏ฎ IC takes ๐บ = (๐, ๐ธ), the influence probability ๐ on all
edges, and initial seed set ๐0 as the input, and generates
the active sets ๐๐ก for all ๐ก โฅ 1.
๏ฑ At every time step ๐ก โฅ 1, first set ๐๐ก = ๐๐กโ1.
๏ฑ Next for every inactive node ๐ฃ โ ๐๐กโ1, for node ๐ข โ
๐๐๐ ๐ฃ โฉ ๐๐กโ1๐๐กโ2 , ๐ข executes an activation attempt
with success probability ๐(๐ข, ๐ฃ). If successful, ๐ฃ is
added into ๐๐ก and it is said ๐ข activates ๐ฃ at time ๐ก. If
multiple nodes active ๐ฃ at time ๐ก, the end effect is the
same.
33
36. Influenceability Estimation in Social Networks
๏ฎ Applications
๏ฑ Influence maximization for viral marketing
๏ฑ Influential nodes discovery
๏ฑ Online advertisement
๏ฎ The fundamental issue
๏ฑ How to evaluate the influenceability for a give node
in a social network?
36
37. ๏ฎ The independent cascade model.
๏ฑ Each node has an independent probability to
influence his neighbors.
๏ฑ Can be modeled by a probabilistic graph, called
influence network, ๐บ = (๐, ๐ธ, ๐).
๏ฑ A possible graph ๐บ ๐ = (๐๐, ๐ธ ๐) has probability
Pr ๐บ ๐ = ๐โ๐ธ ๐
๐ ๐ ๐โ๐ธ ๐ธ ๐
(1 โ ๐ ๐)
๏ฎ There are 2|๐ธ| possible graphs (ฮฉ).
Reconsider IC Model
37
39. ๏ฎ Independent cascade model.
๏ฑ Given a probabilistic graph ๐บ ๐ = (๐๐, ๐๐)
๏ฑ Pr ๐บ ๐ = ๐โ๐ธ ๐
๐ ๐ ๐โ๐ธ ๐ธ ๐
(1 โ ๐ ๐)
๏ฎ Given a graph ๐บ = (๐, ๐ธ, ๐), and a node ๐ , estimate the
expected number of nodes that are reachable from ๐ .
๏ฑ ๐น๐ (๐บ) = ๐บ ๐โฮฉ
Pr ๐บ ๐ ๐๐ (๐บ ๐) where ๐๐ (๐บ ๐) is the
number of nodes that are reachable from the seed
node ๐ .
The Problem
39
40. Reduce the Variance
๏ฎ The accuracy of an approximate algorithm is measured
by the mean squared error ๐ผ ( ๐น๐ ๐บ โ ๐น๐ ๐บ )2
๏ฎ By the variance-bias decomposition
๐ผ ( ๐น๐ ๐บ โ ๐น๐ ๐บ )2 = Var ๐น๐ ๐บ + ๐ผ( ๐น๐ ๐บ โ ๐น๐ ๐บ )
2
๏ฑ Make an estimator unbiased ๏ the 2nd term will be
cancelled out.
๏ฑ Make the variance as small as possible.
40
41. Naรฏve Monte-Carlo (NMC)
๏ฎ Sampling ๐ possible graphs ๐บ1, ๐บ2, โฆ , ๐บ ๐.
๏ฎ For each sampled possible graph ๐บ๐, compute the
number of nodes that are reachable from ๐ .
๏ฎ ๐๐๐ถ Estimator: Average of the number of reachable
nodes over ๐ possible graphs. ๐น ๐๐๐ถ = ๐=1
๐
๐๐ (๐บ ๐)
๐
๏ฎ ๐น ๐๐๐ถ is an unbiased estimator of ๐น๐ (๐บ) since
๐ผ ๐น ๐๐๐ถ = ๐น๐ ๐บ .
๏ฎ ๐๐๐ถ is the only existing algorithm used in the influence
maximization literature.
41
42. Naรฏve Monte-Carlo (NMC)
๏ฎ ๐๐๐ถ Estimator: Average of the number of reachable
nodes over ๐ possible graphs. ๐น ๐๐๐ถ = ๐=1
๐
๐๐ (๐บ ๐)
๐
๏ฎ ๐น ๐๐๐ถ is an unbiased estimator of ๐น๐ (๐บ)
since ๐ผ ๐น ๐๐๐ถ = ๐น๐ (๐บ).
๏ฎ The variance of ๐๐๐ถ is
๐๐๐ ๐น ๐๐๐ถ =
๐ผ ๐๐ (๐บ)2 โ (๐ผ ๐๐ (๐บ) )2
๐
=
๐บ ๐โฮฉ ๐๐ ๐บ ๐ ๐๐ (๐บ)2โ๐น๐ (๐บ)2
๐
๏ฎ Computing the variance is extreme expensive, because
it needs to enumerate all the possible graphs.
42
43. Naรฏve Monte-Carlo (NMC)
๏ฎ In practice, it resorts to an unbiased estimator of ๐๐๐( ๐น ๐๐๐ถ).
๏ฎ The variance of ๐๐๐ถ is
๐๐๐ ๐น ๐๐๐ถ =
๐=1
๐
(๐๐ ๐บ๐ โ ๐น ๐๐๐ถ)2
๐ โ 1
๏ฎ But, ๐๐๐ ๐น ๐๐๐ถ may be very large, because ๐๐ ๐บ๐ fall into
the interval [0, ๐ โ 1].
๏ฎ The variance can be up to ๐(๐2).
43
44. Stratified Sampling
๏ฎ Stratified is to divide a set of data items into subsets
before sampling.
๏ฎ A stratum is a subset.
๏ฎ The strata should be mutually exclusive, and should
include all data items in the set.
๏ฎ Stratified sampling can be used to reduce variance.
44
45. A Recursive Estimator [Jin et al. VLDBโ11]
๏ฎ Randomly select 1 edge to partition the probability
space (the set of all possible graphs) into 2 strata
(2 subsets)
๏ฑ The possible graphs in the first subset include
the selected edge.
๏ฑ The possible graphs in the second subset do
not include the selected edge.
๏ฎ Sample possible graphs in each stratum ๐ with a
sample size ๐๐ proportioning to the probability of
that stratum.
๏ฎ Recursively apply the same idea in each stratum.
45
46. A Recursive Estimator [Jin et al. VLDBโ11]
๏ฎ Advantages:
๏ฑ unbiased estimator with a smaller variance.
๏ฎ Limitations:
๏ฑ Select only one edge for stratification, which is not
enough to significantly reduce the variance.
๏ฑ Randomly select edges, which results in a possible
large variance.
46
47. More Effective Estimators
๏ฎ Four Stratified Sampling (SS) Estimators
๏ฑ Type-I basic SS estimator (BSS-I)
๏ฑ Type-I recursive SS estimator (RSS-I)
๏ฑ Type-II basic SS estimator (BSS-II)
๏ฑ Type-II recursive SS estimator (RSS-II)
๏ฎ All are unbiased and their variances are significantly
smaller than the variance of NMC.
๏ฎ Time and space complexity of all are the same as
NMC.
47
48. Type-I Basic Estimator (BSS-I)
๏ฎ Select ๐ edges to partition the probability space (all the
possible graphs) into 2 ๐
strata.
๏ฎ Each stratum corresponds to a probability subspace
(a set of possible graphs).
๏ฎ Let ๐๐ = Pr[๐บ ๐ โ ฮฉ๐].
๏ฎ How to select ๐ edges: BFS or random
48
50. Type-I Recursive Estimator (RSS-I)
๏ฎ Recursively apply the BSS-I into each stratum, until the sample
size reaches a given threshold.
๏ฎ RSS-I is unbiased and its variance is smaller than BSS-I
๏ฎ Time and space complexity are the same as NMC.
Sample size = ๐
BSS-I
RSS-I
๐ = ๐๐1
50
51. Type-II Basic Estimator (BSS-II)
๏ฎ Select ๐ edges to partition the probability space (all the
possible graphs) into ๐ + 1 strata.
๏ฎ Similarly, each stratum corresponds to a probability
subspace (a set of possible graphs).
๏ฎ How to select ๐ edges: BFS or random
51
54. ๏ฎ Social browsing: a process that users in a social network find
information along their social ties.
๏ฑ photo-sharing Flickr, online advertisements
๏ฎ Two issues:
๏ฑ Problem-I: How to place items on ๐ users in a social
network so that the other users can easily discover by
social browsing?
๏ฎ To minimize the expected number of hops that every
node hits the target set.
๏ฑ Problem-II: How to place items on ๐ users so that as many
users as possible can discover by social browsing?
๏ฎ To maximize the expected number of nodes that hit the
target set.
Social Browsing
54
55. ๏ฎ The two problems are a random walk problem.
๏ฎ ๐ฟ-length random walk model where the path length of
random walks is bounded by a nonnegative number ๐ฟ.
๏ฑ A random walk in general can be considered as ๐ฟ = โ.
๏ฎ Let ๐ ๐ข
๐ก
be the position of an ๐ฟ-length random walk,
starting from node ๐ข, at discrete time ๐ก.
๏ฎ Let ๐๐ข๐ฃ
๐ฟ be a random walk variable.
๏ฑ ๐๐ข๐ฃ
๐ฟ
โ min{min ๐ก: ๐ ๐ข
๐ก
= v, t โฅ 0}, ๐ฟ
๏ฎ The hitting time โ ๐ข๐ฃ
๐ฟ can be defined as the expectation
of ๐๐ข๐ฃ
๐ฟ .
๏ฑ โ ๐ข๐ฃ
๐ฟ
= ๐ผ[๐๐ข๐ฃ
๐ฟ
]
The Random Walk
55
56. The Hitting Time
๏ฎ Sarkar and Moore in UAIโ07 define the hitting time of the
๐ฟ-length random walk in a recursive manner.
โ ๐ข๐ฃ
๐ฟ =
0, ๐ข = ๐ฃ
1 +
๐คโ๐
๐ ๐ข๐คโ ๐ค๐ฃ
๐ฟโ1, ๐ข โ ๐ฃ
๏ฎ Our hitting time can be computed by the recursive
procedure.
๏ฑ Let ๐ ๐ข be the degree of node ๐ข and ๐(๐ข) be the set of
neighbor nodes of ๐ข.
๏ฎ ๐ ๐ข๐ค = 1/๐ ๐ข be the transition probability for ๐ค โ
๐(๐ข) and ๐ ๐ข๐ค = 0 otherwise.
56
57. The Random-Walk Domination
๏ฎ Consider a set of nodes ๐. If a random walk from ๐ข
reaches ๐ by an ๐ฟ-length random walk, we say
๐ dominates ๐ข by an ๐ฟ-length random walk.
๏ฎ Generalized hitting time over a set of nodes, ๐. The
hitting time โ ๐ข๐
๐ฟ
can be defined as the expectation of a
random walk variable ๐๐ข๐
๐ฟ
.
๏ฑ ๐๐ข๐
๐ฟ
โ min{min ๐ก: ๐ ๐ข
๐ก
โ ๐, t โฅ 0}, ๐ฟ
๏ฑ โ ๐ข๐
๐ฟ
= ๐ผ[๐๐ข๐
๐ฟ
]
๏ฎ It can be computed recursively.
๏ฑ โ ๐ข๐
๐ฟ
=
0, ๐ข โ ๐
1 + ๐คโ๐ ๐ ๐ข๐คโ ๐ค๐
๐ฟโ1
, ๐ข โ ๐
57
58. ๏ฎ How to place items on ๐ users in a social network so that
the other users can easily discover by social browsing?
๏ฎ To minimize the total expected number of hops of which
every node hits the target set.
Problem-I
or
58
59. ๏ฎ How to place items on ๐ users so that as many users as
possible can discover by social browsing? To maximize the
expected number of nodes that hit the target set.
๏ฎ Let ๐ ๐ข๐
๐ฟ
be an indicator random variable such that if ๐ข hits
any one node in ๐, then ๐ ๐ข๐
๐ฟ
= 1, and ๐ ๐ข๐
๐ฟ
= 0 otherwise by an
๐ฟ-length random walk.
๏ฎ Let ๐ ๐ข๐
๐ฟ
be the probability of an event that an ๐ฟ-length random
walk starting from ๐ข hits a node in ๐.
๏ฎ Then, ๐ผ ๐ ๐ข๐
๐ฟ
= ๐ ๐ข๐
๐ฟ
.
๏ฎ ๐ ๐ข๐
๐ฟ
=
1, ๐ข โ ๐
๐คโ๐ ๐ ๐ข๐ค ๐ ๐ค๐
๐ฟโ1
, ๐ข โ ๐
Problem-II
59
60. Influence Maximization vs Problem II
๏ฎ Influence maximization is to select ๐ nodes to maximize
the expected number of nodes that are reachable from
the nodes selected.
๏ฑ Independent cascade model
๏ฑ Probability associated with the edges are independent
๏ฑ A target node can influence multiple immediate neighbors
at a time.
๏ฎ Problem II is to select ๐ nodes to maximize the
expected number of nodes that reach a node in the
nodes selected.
๏ฑ ๐ฟ-length random walk model
60
61. ๏ฎ The submodular set function maximization subject to
cardinality constraint is ๐๐-hard.
๏ฑ
๐๐๐ max
๐โ๐
๐น(๐)
๐ . ๐ก. ๐ = ๐พ
๏ฎ The greedy algorithm
๏ฑ There is a 1 โ
1
๐
approximation algorithm.
๏ฑ Linear time and space complexity w.r.t. the size of the
graph.
๏ฎ Submodularity: ๐น(๐) is submodular and non-decreasing.
๏ฑ Non-decreasing: ๐(๐) โค ๐(๐) for ๐ โ ๐ โ ๐.
๏ฑ Submodular: Let ๐๐(๐) = ๐(๐ โช {๐}) โ ๐(๐) be the marginal gain. Then,
๐๐(๐) โฅ ๐๐(๐), for j โ V T and ๐ โ ๐ โ ๐.
Submodular Function Maximization
61
62. ๏ฎ The submodular set function maximization subject to
cardinality constraint is ๐๐-hard.
๏ฑ
๐๐๐ max
๐โ๐
๐น(๐)
๐ . ๐ก. ๐ = ๐พ
๏ฎ Both Problem I and Problem II use a submodular set
function.
๏ฑ Problem-I: ๐น1 S = nL โ ๐ขโ๐๐ โ ๐ข๐
๐ฟ
๏ฑ Problem-II: ๐น2(๐) = ๐คโ๐ ๐ผ[๐ ๐ค๐
๐ฟ
] = ๐คโ๐ ๐ ๐ค๐
๐ฟ
Submodular Function Maximization
62
63. The Algorithm
๏ฎ Let ๐ ๐ข S = F(๐ โช {๐ข}) โ ๐น(๐)
๏ฎ It implies dynamic programming (DP) is needed to
compute the marginal gain.
Marginal gain
63
65. Diversified Ranking [Li et al, TKDEโ13]
๏ฎ Why diversified ranking?
๏ฑ Information requirements diversity
๏ฑ Query incomplete
PAKDD09-65
66. Problem Statement
๏ฎ The goal is to find K nodes in a graph that are relevant to
the query node, and also they are dissimilar to each
other.
๏ฎ Main applications
๏ฑ Ranking nodes in social network, ranking papers, etc.
66
67. Challenges
๏ฎ Diversity measures
๏ฑ No wildly accepted diversity measures on graph in the
literature.
๏ฎ Scalability
๏ฑ Most existing methods cannot be scalable to large
graphs.
๏ฎ Lack of intuitive interpretation.
67
68. Grasshopper/ManiRank
๏ฎ The main idea
๏ฑ Work in an iterative manner.
๏ฑ Select a node at one iteration by random walk.
๏ฑ Set the selected node to be an absorbing node, and
perform random walk again to select the second node.
๏ฑ Perform the same process ๐พ iterations to get ๐พ nodes.
๏ฎ No diversity measure
๏ฑ Achieving diversity only by intuition and experiments.
๏ฎ Cannot scale to large graph (time complexity O(๐พ๐2
))
68
70. Our Approach
๏ฎ The main idea
๏ฑ Relevance of the top-K nodes (denoted by a set S) is achieved by
the large (Personalized) PageRank scores.
๏ฑ Diversity of the top-K nodes is achieved by large expansion ratio.
๏ฎ Expansion ratio of a set nodes ๐: ๐ ๐ = ๐ ๐ /๐.
๏ฑ Larger expansion ratio implies better diversity
70
71. ๏ฎ The submodular set function maximization subject to
cardinality constraint is ๐๐-hard.
๏ฑ
๐๐๐ max
๐โ๐
๐น(๐)
๐ . ๐ก. ๐ = ๐พ
๏ฎ The greedy algorithm
๏ฑ There is a 1 โ
1
๐
approximation algorithm.
๏ฑ Linear time and space complexity w.r.t. the size of the
graph.
๏ฎ Submodularity: ๐น(๐) is submodular and non-decreasing.
๏ฑ Non-decreasing: ๐(๐) โค ๐(๐) for ๐ โ ๐ โ ๐.
๏ฑ Submodular: Let ๐๐(๐) = ๐(๐ โช {๐}) โ ๐(๐) be the marginal gain. Then,
๐๐(๐) โฅ ๐๐(๐), for j โ V T and ๐ โ ๐ โ ๐.
Submodular Function Maximization
71
73. ๏ฎ Social contagion is a process of information (e.g. fads,
news, opinions) diffusion in the online social networks
๏ฑ Traditional biological contagion model, the affected
probability depends on degree.
MarketingOpinions Diffusion Social Network
Social Contagion
73
74. Facebook Study [Ugander et al., PNASโ12]
๏ฎ Case study: The process of a user joins Facebook in
response to an invitation email from an existing
Facebook user.
๏ฎ Social contagion is not like biological contagion.
74
75. ๏ฎ Structural diversity of an
individual is the number of
connected components in
oneโs neighborhood.
๏ฎ The problem: Find ๐
individuals
with highest structural
diversity. Connected components in the
neighborhood of โwhite centerโ
Structural Diversity
75
77. Big Data: The Volume
๏ฎ Consider a dataset ๐ท of 1 PetaByte (1015 bytes).
A linear scan of ๐ท takes 46 hours with a fastest
Solid State Drive (SSD) of speed of 6GB/s.
๏ฑ PTIME queries do not always serve as a good yardstick
for tractability in โBig Data with Preprocessingโ by Fan.
et al., PVLDBโ13.
๏ฎ Consider a function ๐ ๐บ . One possible way is to
make ๐บ small to be ๐บโ, and find the answers from
๐บโ as it can be answered by ๐บ, ๐โ(๐บโ) โ ๐(๐บ).
๏ฑ There are many ways we can explore.
77
78. Big Data: The Volume
๏ฎ Consider a function ๐ ๐บ . One possible way is to make ๐บ
small to be ๐บโ, and find the answers from ๐บโ as it can be
answered by ๐บ, ๐โ(๐บโ) โ ๐(๐บ).
๏ฎ There are many ways we can explore.
๏ฎ Make data simple and small
๏ฑ Graph sampling, Graph compression
๏ฑ Graph sparsification, Graph simplification
๏ฑ Graph summary
๏ฑ Graph clustering
๏ฑ Graph views
78
79. More Work on Big Data
๏ฎ We also believe that there are many things we need to
do on Big Data.
๏ฎ We are planning explore many directions.
๏ฑ Make data simple and small
๏ฎ Graph sampling, graph simplification, graph
summary, graph clustering, graph views.
๏ฑ Explore different computing approaches
๏ฎ Parallel computing, distributed computing,
streaming computing, semi-external/external
computing.
79
80. I/O Efficient Graph Computing
๏ฎ I/O Efficient: Computing SCCs in Massive Graphs by
Zhiwei Zhang, Jeffrey Xu Yu, Lu Qin, Lijun Chang, and
Xuemin Lin, SIGMODโ13.
๏ฎ Contract & Expand: I/O Efficient SCCs Computing by
Zhiwei Zhang, Lu Qin, and Jeffrey Xu Yu.
๏ฎ Divide & Conquer: I/O Efficient Depth-First Search,
Zhiwei Zhang, Jeffrey Xu Yu, Lu Qin, and Zechao
Shang.
80
81. Reachability Query
๏ฎ Two possible but infeasible solutions:
๏ฑ Traverse ๐บ(๐, ๐ธ) to answer a reachability query
๏ฎ Low query performance: ๐(|๐ธ|) query time
๏ฑ Precompute and store the transitive closure
๏ฎ Fast query processing
๏ฎ Large storage requirement: ๐(|๐|2)
๏ฎ The labeling approaches:
๏ฑ Assign labels to nodes in a
preprocessing step offline.
๏ฑ Answer a query using the labels
assigned online.
81
82. A
B
D
C
F
G
A
B
C
G
F D
Make a Graph Small and Simple
๏ฎ Any directed graph ๐บ can be represented as a DAG (Directed
Acyclic Graph), ๐บ ๐ท, by taking every SCC (Strongly
Connected Component) in ๐บ as a node in ๐บ ๐ท.
๏ฎ An SCC of a directed graph ๐บ = (๐, ๐ธ) is a maximal set of
nodes ๐ถ โ ๐ such that for every pair of nodes ๐ข and ๐ฃ in ๐ถ, ๐ข
and ๐ฃ are reachable from each other.
82
84. The Issue and the Challenge
๏ฎ It needs to convert a
massive directed graph ๐บ
into a DAG ๐บ ๐ท in order to
process it efficiently
because
๏ฑ ๐บ cannot be held in main
memory, and ๐บ ๐ท can be
much smaller.
๏ฎ It is assumed that it can be
done in the existing works.
๏ฎ But, it needs a large
main memory to convert.
84
85. The Issue and the Challenge
๏ฎ The Dataset uk-2007
๏ฑ Nodes: 105,896,555
๏ฑ Edges: 3,738,733,648
๏ฑ Average degree: 35
๏ฎ Memory:
๏ฑ 400 MB for nodes, and
๏ฑ 28 GB for edges.
85
86. In Memory Algorithm?
๏ฎ In Memory Algorithm: Scan ๐บ twice
๏ฑ DFS(G) to obtain a decreasing order for each node of ๐บ
๏ฑ Reverse every edge to obtain ๐บ ๐, and
๏ฑ DFS(๐บ ๐) according to the same decreasing order to find all
SCCs.
86
87. 4
7
2 3
51
9
68
In Memory Algorithm?
๏ฎ DFS(G) to obtain a decreasing order for each node of ๐บ
87
89. 4
7
2 3
51
9
68
In Memory Algorithm?
๏ฎ DFS(๐บ ๐
) according to the same decreasing order to find all
SCCs. (A subtree (in black edges) form an SCC.)
89
90. (Semi)-External Algorithms
๏ฎ In Memory Algorithm: Scan ๐บ twice
๏ฎ The in memory algorithm cannot handle a large graph that
cannot be held in memory.
๏ฑ Why? No locality. A large number of random I/Os.
๏ฎ Consider external algorithms and/or semi-external
algorithms. Let ๐ be the size of main memory.
๏ฑ External algorithm: ๐ < |๐บ|
๏ฑ Semi-external algorithm: ๐ โ |๐| โค ๐ < |๐บ|
๏ฎ It assumes that a tree can be held in memory.
90
93. A
B
D
C
I
E
F
G
H
A
B
C
G
F
A B
C GD FD
H
E
Cannot Find All SCCs Always!
Main Memory
DAG! And memory is full!Cannot load in โIโ into memory!
Contraction Based External Algorithm (3)
93
95. DFS Based Approaches: Cost-1
๏ฎ DFS-SCC uses sequential I/Os.
๏ฎ DFS-SCC needs to traverse a graph ๐บ twice using DFS
to compute all SCCs.
๏ฎ In each DFS, in the worst case it needs the number of
๐๐๐๐กโ(๐บ) โ |๐ธ(๐)|/๐ต I/Os, where ๐ต is the block size.
95
96. DFS Based Approaches: Cost-2
๏ฎ Partial SCCs cannot be contracted to save space while
constructing a DFS tree.
๏ฎ Why?
๏ฑ DFS-SCC needs to traverse a graph ๐บ twice using DFS to
compute all SCCs.
๏ฑ DFS-SCC uses a total order of nodes (decreasing postorder)
in the second DFS, which is computed in the first DFS.
๏ฑ SCCs cannot be partially contracted in the first DFS.
๏ฑ SCCs can be partially contracted in the second DFS, but we
have to remember which nodes belongs to which SCCs with
extra space. Not free!
96
97. DFS Based Approaches: Cost-3
๏ฎ High CPU cost for reshaping a DFS-tree, when it attempts
to reduce the number of forward-cross-edges.
97
98. Our New Approach [SIGMODโ13]
๏ฎ We propose a new two phase algorithm, 2P-SCC:
๏ฑ Tree-Construction and Tree-Search.
๏ฑ In Tree-Construction phase, we construct a tree-like
structure.
๏ฑ In Tree-Search phase, we scan the graph only once.
๏ฎ We further propose a new algorithm, 1P-SCC, to
combine Tree-Construction and Tree-Search with new
optimization techniques, using a tree.
๏ฑ Early-Acceptance
๏ฑ Early-Rejection
๏ฑ Batch Edge Reduction
A joint work by Zhiwei Zhang, Jeffrey Yu, Qin Lu, Lijun Chang, and Xuemin Lin
98
99. A New Weak Order
๏ฎ The total order used in DFS-SCC is too strong and there
is no obvious relationship between the total order and
the SCCs per se, in order to reduce I/Os.
๏ฑ The total order cannot help to reduce I/O costs.
๏ฎ We introduce a new weak order.
๏ฎ For an SCC, there must exist at least one cycle.
๏ฎ While constructing a tree ๐ for ๐บ, a cycle will appear to
contain at least one edge (๐ข, ๐ฃ) that links to a higher
level node in ๐. ๐๐๐๐กโ(๐ข) > ๐๐๐๐กโ(๐ฃ).
๏ฎ There are two cases when ๐๐๐๐กโ(๐ข) > ๐๐๐๐กโ(๐ฃ).
๏ฑ A cycle: ๐ฃ is an ancestor of ๐ข in ๐
๏ฑ Not a cycle (up-edge): ๐ฃ is not an ancestor of ๐ข in ๐.
๏ฎ We reduce the number of up-edges iteratively.
99
100. ๏ฎ Let ๐ ๐ ๐๐ก(๐ข, ๐บ, ๐) be the set of nodes including ๐ข and nodes that
๐ฃ can reach by a tree ๐ of ๐บ.
๏ฎ ๐๐๐๐กโ(๐ข, ๐): The length of the longest simple
path from root to ๐ข.
๏ฎ ๐๐๐๐๐(๐ข, ๐) = min{๐๐๐๐กโ(๐ฃ, ๐) | ๐ฃ โ ๐ ๐ ๐๐ก(๐ข, ๐บ, ๐)}
๏ฑ drank is used as the weak order!
๏ฎ ๐๐๐๐๐ ๐ข, ๐ = ๐๐๐๐๐๐ ๐ฃ ๐๐๐๐กโ ๐ฃ, ๐ ๐ฃ โ ๐ ๐ ๐๐ก(๐ข, ๐บ, ๐)}
๏ฎ Nodes do not need to have a unique order.
B
H
C
D
G
E
A
IF
๏ depth(B) = 1
๏ drank(B) = 1
๏ dlink(B) = B
๏ depth(E) = 3
๏ drank(E) = 1
๏ dlink(E) = B
The Weak Order: drank
100
101. 2P-SCC
๏ฎ To reduce Cost-1, we use a BR+-tree to compute all
SCCs in the Tree-Construction phase. We compute all
SCCs by traversing ๐บ only once using the BR+-tree
constructed in the Tree-Search phase.
๏ฎ To reduce Cost-3, we have shown that we only need to
update the depth of nodes locally.
101
102. B
H
C D
G
E
A
IF
๏ฎ BR-Tree is a spanning
tree of G.
๏ฎ BR+-Tree is a BR-Tree
plus some additional
edges (๐ข, ๐ฃ) such that ๐ฃ
is an ancestor of ๐ข.
BR-Tree and BR+-Tree
๏ฑ In Memory: Black edges
102
103. B
H
C D
GE
A
IF
drank(I) = 1
drank(H) = 2
Up-edge
Tree-Construction: Up-edge
๏ฎ An edge (๐ข, ๐ฃ) is an up-edge on
the conditions:
๏ฑ ๐ฃ is not an ancestor of ๐ข in ๐
๏ฑ ๐๐๐๐๐(๐ข, ๐) โฅ ๐๐๐๐๐(๐ฃ, ๐)
๏ฎ Up-edges violate the existing
order
103
104. ๏ฎ When there is an violate
up-edge, then
๏ฑ Modify T
๏ฎ Delete the old tree
edge
๏ฎ Set the up-edge as a
new tree edge
๏ฑ Graph Reconstruction
๏ฎ No I/O cost, low CPU
cost.
B
H
C D
GE
A
IF
Tree-Construction (Push-Down)
104
105. B
D
E
A
F
drank(E) = 1
dlink(E) = B
drank(F) = 1
Up-edge
Tree-Construction (Graph Reconstruction)
๏ฎ Tree edges and one
extra edge in BR+-Tree
form a part of an SCC.
๏ฎ For an up-edge (๐ข, ๐ฃ), if
๐๐๐๐๐(๐ฃ, ๐) is an ancestor
of ๐ข in ๐, delete (๐ข, ๐ฃ) and
add (๐ข, ๐๐๐๐๐(๐ฃ)).
๏ฎ In Tree-Search, scan the
graph only once to find all
SCCs, which reduces I/O
costs.
A new edge
105
106. Tree-Construction
๏ฎ When a BR+-tree is completely constructed, there are no
up-edges.
๏ฎ There are only two kinds of edges in G.
๏ฑ The BR+-tree edges, and
๏ฑ The edges (๐ข, ๐ฃ) where ๐๐๐๐๐(๐ข, ๐) < ๐๐๐๐๐(๐ฃ, ๐).
๏ฎ Such edges do not play in any role in determining an
SCC.
106
107. B
H
C D
G
E
A
IF
In memory for each node u:
๏ฎ TreeEdge(u)
๏ฎ dlink(u)
๏ฎ drank(u)
In total: 3 ร |๐|
Search Procedure:
๏ฎ If an edge (๐ข , ๐ฃ) points
to an ancestor, merge
all nodes from ๐ฃ to ๐ข in
the tree
Only need to scan the
graph once.
Tree-Search
107
108. From 2P-SCC To 1P-SCC
๏ฎ With 2P-SCC:
๏ฑ In Tree-Construction phase, we construct a tree by an
approach similar to DFS-SCC, and
๏ฑ In Tree-Search phase, we scan the graph only once.
๏ฑ The memory used for BR+-tree is 3 ร |๐|.
๏ฎ With 1P-SCC: We combine Tree-Construction and Tree-
Search with new optimization techniques to reduce Cost-2
and Cost-3:
๏ฑ Early-Acceptance
๏ฑ Early-Rejection
๏ฑ Batch Edge Reduction
๏ฑ Only need to use a BR-tree with memory of 2 ร |๐|.
108
109. Early-Acceptance and Early Rejection
๏ฎ Early acceptance: we contract a partial SCC into a
node in an early stage while constructing a BR-tree.
๏ฎ Early rejection: we identify necessary conditions to
remove nodes that will not participate in any SCCs
while constructing a BR-tree.
109
110. Early-Acceptance and Early Rejection
๏ฎ Consider an example.
๏ฎ The three nodes on the left can be contracted into a node on the right.
๏ฎ The node โaโ and the subtrees, C and D, can be rejected.
110
111. B
I
C
D
H
E
A
JG
๏ฎ Memory: 2 ร |๐|
๏ฎ Reduce I/O Cost
KF
Up-edge: Modify Tree
Up-edge: Modify Tree
Early-Acceptance
Early-Acceptance
Modify Tree + Early Acceptance
111
112. DFS Based vs Ours Approaches
๏ฎ I/O cost for DFS is high
๏ฑ Use a total order
๏ฎ Cannot merge SCCs
when found earlier
๏ฑ Total order cannot be
changed. Large # of
I/Os.
๏ฎ Cannot prune non-SCC
nodes
๏ฑ Total order cannot be
changed
๏ฎ Smaller I/O Cost
๏ฑ Use a weaker order
๏ฎ Merge SCCs as early as
possible
๏ฑ Merge nodes with the
same order. Small size,
small # of I/Os.
๏ฎ prune non-SCC nodes as
early as possible
๏ฑ Weaker order is flexible
112
113. Optimization: Batch Edge Reduction
๏ฎ With 1PC-SCC, CPU cost is still high.
๏ฑ In order to determine whether (๐ข, ๐ฃ) is a backward-
edge/up-edge, it needs to check the ancestor
relationships between ๐ข and ๐ฃ over a tree.
๏ฎ The tree is frequently updated.
๏ฎ The average depth of nodes in the tree becomes
larger with frequently push-down operation.
113
114. Optimization: Batch Edge Reduction
๏ฎ When memory can hold more edges, there is no need to
contract partial SCCs edge by edge.
๏ฎ Find all SCCs in the main memory at the same time
๏ฑ Read all edges that can be read into memory.
๏ฑ Construct a graph with the edges of the tree
constructed in memory already plus the edges newly
read into memory.
๏ฑ Construct a DAG in memory using the existing memory
algorithm, which finds all SCCs in memory.
๏ฑ Reconstruct the BR-Tree according to the DAG.
114
115. Performance Studies
๏ฎ Implement using visual C++ 2005
๏ฎ Test on a PC with Intel Core2 Quard 2.66GHz CPU
and 3.43GB memory running Windows XP
๏ฎ Disk Block Size: 64๐พ๐ต
๏ฎ Memory Size: 3 ร |๐(๐บ)| ร 4๐ต + 64 ๐พ๐ต
115
116. |V| |E| Average
Degree
cit-patent 3,774,768 16,518,947 4.70
go-uniprot 6,967,956 34,770,235 4.99
citeseerx 6,540,399 15,011,259 2.30
WEBSPAM-
UK2002
105,896,555 3,738,733,568 35.00
Four Real Data Sets
116
117. Parameter Range Default
Node Size 30M - 70M 30M
Average Degree 3 - 7 5
Size of Massive SCCs 200K โ 600K 400K
Size of Large SCCs 4K - 12K 8K
Size of Small SCCs 20 - 60 40
# of Massive SCCs 1 1
# of Large SCCs 30 - 70 50
# of Small SCCs 6K โ 14K 10K
Synthetic Data Sets
๏ฎ We construct a graph G by (1) randomly selecting all
nodes in SCCs first, (2) adding edges among the
nodes in an SCC until all nodes form an SCC, and (3)
randomly adding nodes/edges to the graph.
117
123. From Semi-External to External
๏ฎ Existing semi-external solutions work under the condition
that it can held a tree in main-memory ๐โ |๐| โค |๐|, and
generate a large I/Os.
๏ฎ We study an external algorithm by removing the
condition of ๐ โ |๐| โค |๐|.
123
A joint work by Zhiwei Zhang, Qin Lu, and Jeffrey Yu
124. The New Approach: The Main Idea
๏ฎ DFS based approaches generate random accesses
๏ฎ Contraction based semi-external approach
reduces |๐| and |๐ธ| together at the same time.
๏ฑ Cannot find all SCCs.
๏ฎ The main idea of our external algorithm:
๏ฑ Work on a small graph ๐บโ of ๐บ by reducing ๐ because ๐
can be small.
๏ฑ Find all SCCs in ๐บโ.
๏ฑ Add removed nodes back to find SCCs in ๐บ.
124
125. The New Approach: The Property
๏ฎ Reducing the given graph
๏ฑ ๐บ ๐, ๐ธ โ ๐บโฒ
๐โฒ
, ๐ธโฒ
, ๐ < |๐โฒ
|.
๏ฑ If ๐ข can reach ๐ฃ in ๐บ, ๐ข can also reach ๐ฃ in ๐บโฒ.
๏ฑ Maintaining this property may generate a large
number of random I/O access.
๏ฎ Reason: several nodes on the path from ๐ข to ๐ฃ may
be removed from ๐บ in the previous iterations.
125
126. The New Approach: The Approach
๏ฎ We introduce a new Contraction & Expansion approach.
๏ฑ Contraction phase:
๏ฎ Reduce nodes iteratively, ๐บ1, ๐บ2 โฆ ๐บ๐.
๏ฑ It decreases |๐(๐บ๐)|, but may increase |๐ธ(๐บ๐)|.
๏ฑ Expansion phase:
๏ฎ In the reverse order in contraction phase,
๐บ๐, ๐บ๐โ1 โฆ ๐บ1.
๏ฑ Find all SCCs in ๐บ๐ using a semi-external
algorithm.
๏ง The semi-external algorithm deals with edges.
๏ฑ Expand ๐บ๐ back to ๐บ๐โ1.
126
127. The Contraction
๏ฎ In Contraction phase, graph ๐บ1, ๐บ2 โฆ ๐บ๐ are generated,
๏ฎ ๐บ๐+1 is generated by removing a batch of nodes from ๐บ๐.
๏ฎ Stops until ๐ โ |๐| < |๐| when semi-external approach
can be applied.
G1 G2 G3
127
128. The Expansion
๏ฎ In Expansion phase, removed nodes are added
๏ฎ Addition is in the reverse order of their removal in
contraction phase.
G1 G2 G3
128
129. The Contraction Phase
๏ฎ Compared with ๐บ๐, ๐บ๐+1should have the following
properties
๏ฑ Contractable:
๏ฎ |๐(๐บ๐+1)| < |๐(๐บ๐)|
๏ฑ SCC-Preservable:
๏ฎ ๐๐ถ๐ถ(๐ข, ๐บ๐) = ๐๐ถ๐ถ(๐ฃ, ๐บ๐) โบ ๐๐ถ๐ถ(๐ข, ๐บ๐+1) = ๐๐ถ๐ถ(๐ฃ, ๐บ๐+1)
๏ฑ Recoverable:
๏ฎ ๐ฃ โ ๐๐ โ ๐๐+1 โบ ๐๐๐๐โ๐๐๐ข๐ ๐ฃ, ๐บ๐ โ ๐บ๐+1
129
130. Contract Vi+1
๏ฎ Recoverable:
๏ฑ ๐ฃ โ ๐๐ โ ๐๐+1 โบ ๐๐๐๐โ๐๐๐ข๐(๐ฃ, ๐บ๐) โ ๐บ๐+1
๏ฎ ๐บ๐+1 is recoverable if and only if ๐๐+1 is a vertex cover of
๐บ๐.
๏ฎ At this condition, we can determine which SCCs the
nodes in ๐บ๐ belong to by scanning ๐บ๐ once.
๏ฎ For each edge, we select the node with a higher degree
or a higher order.
130
131. Contract Vi+1
c
d
h
a
b
e
f
g
i
ID1 ID2 Deg1 Deg2
a b 3 3
a d 3 4
b c 3 2
c d 2 4
d e 4 4
d g 4 4
e b 4 3
e g 4 4
f g 2 4
g h 4 2
h i 2 2
i f 2 2
DISK
For each edge, we select the node with
a higher degree or a higher order.
131
132. Construct Ei+1
๏ฎ SCC-Preservable:
๏ฑ ๐๐ถ๐ถ(๐ข, ๐บ๐) = ๐๐ถ๐ถ(๐ฃ, ๐บ๐) โบ ๐๐ถ๐ถ(๐ข, ๐บ๐+1) = ๐๐ถ๐ถ(๐ฃ, ๐บ๐+1)
๏ฎ If ๐ฃ โ ๐๐ โ ๐๐+1, remove (๐ฃ๐๐, ๐ฃ) and (๐ฃ, ๐ฃ ๐๐ข๐ก) and add (๐ฃ๐๐, ๐ฃ ๐๐ข๐ก).
๏ฎ Although |๐ธ| may be larger, |๐| is sure to be smaller.
๏ฎ Smaller |๐| implies semi-external approach can be applied.
132
133. ID1 ID2
e d
b d
i g
g i
Construct Ei+1
c
d
h
a
b
e
f
g
i
ID1 ID2
d e
d g
e b
e g
DISK
If ๐ฃ โ ๐๐ โ ๐๐+1, remove (๐ฃ, ๐ฃ๐๐) and
(๐ฃ, ๐ฃ ๐๐ข๐ก) and add (๐ฃ๐๐, ๐ฃ ๐๐ข๐ก)
Existing Edges
New Edges
133
134. The Expansion Phase
๏ฎ ๐๐ถ๐ถ(๐ข, ๐บ๐) = ๐๐ถ๐ถ(๐ฃ, ๐บ๐) = ๐๐ถ๐ถ(๐ค, ๐บ๐) (๐ข, ๐ค โ ๐๐+1)
โบ ๐ข โ ๐ฃ & ๐ฃ โ ๐ค in ๐บ๐
๏ฎ For any node ๐ฃ โ ๐๐+1 โ ๐๐, ๐๐ถ๐ถ(๐ฃ, ๐บ๐) can be
computed using
๐๐๐๐โ๐๐๐ข๐๐๐ (๐ฃ, ๐บ๐) and ๐๐๐๐โ๐๐๐ข๐๐๐ข๐ก(๐ฃ, ๐บ๐) only.
a
b
c
134
135. ID1 ID2
a b
a d
b c
c d
d e
d g
e b
e g
f g
g h
h i
i f
Expansion Phase
c
d
h
a
b
e
f
g
i
DISK
๐๐ถ๐ถ ๐ข, ๐บ๐ = ๐๐ถ๐ถ ๐ฃ, ๐บ๐ = ๐๐ถ๐ถ ๐ค, ๐บ๐
(๐ข, ๐ค โ ๐๐+1) โบ ๐ข โ ๐ฃ & ๐ฃ โ ๐ค in ๐บ๐
135
136. Performance Studies
๏ฎ Implement using visual C++ 2005
๏ฎ Test on a PC with Intel Core2 Quard 2.66GHz CPU and 3.5GB
memory running Windows XP
๏ฎ Disk Block Size: 64KB
๏ฎ Default memory Size: 400๐
136
137. Data Set
๏ฎ Real Data set
๏ฎ Synthetic Data
V E Average
Degree
WEBSPAM-
UK2007
105,896,555 3,738,733,568 35.00
Parameter
Node Size 25M โ 100M
Average Degree 2 - 6
Size of SCCs 20 โ 600K
Number of SCCs 1 โ 14 K
137
139. DFS [SIGMODโ15]
๏ฎ Given a graph ๐บ(๐, ๐ธ), depth-first search is to
search ๐บ following the depth-first order.
A
B E
D
C
F
IH
J
G
A joint work by Zhiwei Zhang, Jeffrey Yu, Qin Lu, and Zechao Shang
139
140. The Challenge
๏ฎ It needs to DFS a
massive directed graph
๐บ, but it is possible that
๐บ cannot be entirely
held in main memory.
๏ฎ Our work only keeps
nodes in memory,
which is much smaller.
140
141. The Issue and the Challenge (1)
๏ฎ Consider all edges from ๐ข, like ๐ข, ๐ฃ1 , ๐ข, ๐ฃ2 ,
โฆ , ๐ข, ๐ฃ ๐ . Suppose DFS searches from ๐ข to ๐ฃ1. It
is hard to estimate when it will visit ๐ฃ๐ (2 โค ๐ โค ๐).
๏ฎ It is hard to know when
C/D will be visited even
they are near A and B.
๏ฎ It is hard to design
the format of graph
on disk.
A
B C
D
E
141
142. The Issue and the Challenge (2)
๏ฎ A small part of graph can change DFS a lot.
๏ฎ Even almost the entire graph can be kept in
memory, it still costs a lot to find the DFS.
๏ฎ (E,D) will change the
existing DFS significantly.
๏ฎ A large number of
iterations is needed
even the memory
keeps a large
portion of graph.
A
B C D
E F G
142
143. Problem Statement
๏ฎ We study a semi-external algorithm that
computes a DFS-Tree by which DFS can be
obtained.
๏ฎ The limited memory ๐ ๐ โค ๐ โค |๐บ|
๏ฎ ๐ is a small constant number.
๏ฎ ๐บ = ๐ + |๐ธ|
143
144. DFS-Tree & Edge Type
๏ฎ A DFS of ๐บ forms a DFS-Tree
๏ฎ A DFS procedure can be obtained by a DFS-Tree.
A
B E
D
C
F
IH
J
G
144
145. DFS-Tree & Edge Type
๏ฎ Given a spanning tree ๐, there exist 4 types of
non-tree edges.
A
B E
D
C
F
IH
J
G
Forward Edge
Forward-cross Edge Backward-cross Edge
Backward Edge
145
146. DFS-Tree & Edge Type
๏ฎ An ordered spanning tree is a DFS-Tree if there
does not have any forward-cross edges.
A
B E
D
C
F
IH
J
G
Forward Edge
Forward-cross Edge Backward-cross Edge
Backward Edge
146
147. Existing Solutions
๏ฎ Iteratively remove the forward-cross edges.
๏ฎ Procedure:
๏ฑ If there exists a forward-cross edge
๏ฎ Construct a new ๐ by conducting DFS over the
graph in memory
147
148. Existing Solutions
๏ฎ Construct a new ๐ by conducting DFS over the graph in
memory until no forward-cross edges exist.
A
B E
D
C
F
IH
J
G
Forward-cross Edge
148
149. The Drawbacks
๏ฎ D-1: A total order in ๐(๐บ) needs to be
maintained in the whole process.
๏ฎ D-2: A large number of I/Os is produced
๏ฑ Need to scan all edges in every iteration.
๏ฎ D-3: A large number of iterations is needed.
๏ฑ The possibility of grouping the edges near each
other in DFS is not considered.
149
150. Why Divide & Conquer
๏ฎ We aim at dividing the graph into several
subgraphs ๐บ1, ๐บ2 , โฆ , ๐บ ๐ with possible overlaps
among them.
๏ฎ Goal: The DFS-Tree for ๐บ can be computed by
the DFS-Trees for all ๐บ๐.
๏ฎ Divide & Conquer approach can overcome the
existing drawbacks.
150
151. Why Divide & Conquer
๏ฎ To address D-1
๏ฑ A total order in ๐(๐บ) needs to be maintained in
the whole process.
๏ฎ After dividing the graph ๐บ into ๐บ0 , ๐บ1 , โฆ , ๐บ ๐, we
only need to maintain the total order in ๐(๐บ๐).
151
152. Why Divide & Conquer
๏ฎ To address D-2
๏ฑ A large number of I/Os is produced.
๏ฑ It needs to scan all edges in each iterations.
๏ฎ After dividing the graph ๐บ into ๐บ0 , ๐บ1 , โฆ , ๐บ ๐, we
only need to scan the edges in ๐บ๐ to eliminate
forward-cross edges.
152
153. Why Divide & Conquer
๏ฎ To address D-3
๏ฑ A large number of iterations is needed.
๏ฑ It cannot group the edges together that are near
each other in DFS visiting sequence.
๏ฎ After dividing the graph ๐บ into ๐บ0 , ๐บ1 , โฆ , ๐บ ๐, the
DFS procedure can be applied to ๐บ๐ independently.
153
155. Invalid Division
๏ฎ An example:
A
B
F
C
D
E
๐บ1
๐บ2
No matter how the DFS-
Trees for ๐บ1 and ๐บ2 are
ordered, the merged tree
cannot be a DFS-Tree for ๐บ.
155
156. How to Cut: Challenges
๏ฎ Challenge-1: uneasy to check whether a division
is valid.
๏ฑ Need to make sure a DFS-Tree for a divided subgraph
will not affect the DFS-Tree of others.
๏ฎ Challenge-2: finding a good division is non-trivial.
๏ฑ The edge types between different subgraphs are
complicated.
๏ฎ Challenge-3: The merge procedure needs to
make sure that the result is the DFS-Tree for ๐บ.
156
157. Our New Approach
๏ฎ To address Challenge-1:
๏ฑ Compute a light-weight summary graph (S-graph)
denoted as ๐บ.
๏ฑ Check whether a division is valid by searching ๐บ
๏ฎ To address Challenge-2:
๏ฑ Recursively divide & conquer.
๏ฎ To address Challenge-3:
๏ฑ The DFS-Tree for ๐บ is computed only by ๐๐ and ๐บ.
157
158. Four Division Properties
๏ฎ Node-Coverage: 1 โค ๐ โค ๐ ๐ ๐บ๐ = ๐บ
๏ฎ Contractible: ๐ ๐บ๐ < |V(๐บ)|
๏ฎ Independence: any pair of nodes in
๐ ๐๐ โฉ ๐(๐๐) are consistent.
๏ฑ ๐๐ and ๐๐ can be dealt with independently
(๐๐ and ๐๐ are DFS-Tree for ๐บ๐ and ๐บ๐)
๏ฎ DFS-Preservable: there exists a DFS-Tree ๐ for
graph ๐บ such that ๐ ๐ = 1โค๐โค๐ ๐(๐๐) and
E ๐ = 1โค๐โค๐ ๐ธ(๐๐)
๏ฑ DFS-Tree for ๐บ can be computed by ๐๐
158
159. DFS-Preservable Property
๏ฎ DFS-Tree for ๐บ can be computed by ๐๐.
๏ฎ DFSโ
-Tree: A spanning tree with the same edge
set of a DFS-Tree (without order).
๏ฎ Suppose the independence property is satisfied,
then the DFS-preservable property is satisfied if
and only if the spanning tree T with ๐ ๐ =
1โค๐โค๐ ๐(๐๐) and ๐ธ ๐ = 1โค๐โค๐ ๐ธ(๐๐) is a ๐ท๐น๐โ-
Tree.
159
160. Independence Property
๏ฎ Any pair of nodes in ๐ ๐๐ โฉ ๐(๐๐) are consistent
(๐๐ and ๐๐ are DFS-Tree for ๐บ๐ and ๐บ๐).
๏ฑ ๐๐ , ๐๐ can be dealt with independently.
๏ฑ This may not hold: ๐ข is an ancestor of ๐ฃ in ๐๐, but
is a sibling in ๐๐.
๏ฎ Theorem:
๏ฑ Given a division ๐บ0, ๐บ1, โฆ , ๐บ ๐ of ๐บ, the
independence property is satisfied if and only if for
any subgraphs ๐บ๐ and ๐บ๐, ๐ธ ๐บ๐ โฉ ๐ธ ๐บ๐ = โ .
160
163. Our Approach
๏ฎ Root based division: independence is satisfied.
๏ฑ For each ๐บ๐, it has a spanning tree ๐๐.
๏ฑ For a division ๐บ0, ๐บ1, โฆ, ๐บ ๐, ๐บ0 โฉ ๐บ๐ = ๐๐.
๏ฑ ๐๐ is the root of ๐๐ and the leaf of ๐0
๐บ0
๐บ๐ ๐บ๐
163
164. Our Approach
๏ฎ We expand ๐บ0 to capture the relationship between
different ๐บ๐ and call it S-graph.
๏ฎ S-graph is used to check whether the current division
is valid (DFS-preservable property is satisfied)
๐บ0
๐บ๐ ๐บ๐
S-graph
164
165. S-edge
๏ฎ S-edge: given a spanning tree ๐ of ๐บ, (๐ขโฒ
, ๐ฃโฒ
) is the
S-edge of ๐ข, ๐ฃ if
๏ฑ ๐ขโฒ is ancestor of ๐ข and ๐ฃโฒ is ancestor of ๐ฃ in ๐,
๏ฑ Both ๐ขโฒ, ๐ฃโฒ are the children of ๐ฟ๐ถ๐ด(๐ข, ๐ฃ), where
๐ฟ๐ถ๐ด(๐ข, ๐ฃ) is the lowest common ancestor of ๐ข, ๐ฃ in ๐.
165
167. S-graph
๏ฎ For a division ๐บ0, ๐บ1, โฆ, ๐บ ๐ and ๐0 is the DFS-Tree
for ๐บ0, S-graph ๐บ is constructed in the following:
๏ฑ Remove all backward and forward edges w.r.t. ๐0
๏ฑ Replace all cross-edges (๐ข, ๐ฃ) with their
corresponding S-edge if the S-edge is between nodes
in ๐บ0,
๏ฑ For edge (๐ข, ๐ฃ), if ๐ข โ ๐บ๐ and ๐ฃ โ ๐บ0, add edge (๐๐, ๐ฃ)
and do the same for ๐ฃ.
167
170. Division Theorem
๏ฎ Consider a division ๐บ0, ๐บ1, โฆ, ๐บ ๐ and suppose ๐0 is
the DFS-Tree for ๐บ0, the division is DFS-preservable
if and only if the S-graph ๐บ is a DAG.
170
171. Divide-Star Algorithm
๏ฎ Divide ๐บ according to the children of the root ๐ of ๐บ.
๏ฎ If the corresponding S-graph ๐บ is a DAG, each
subgraph can be computed independently.
๏ฎ Deal with strongly connected component:
๏ฑ Modify ๐: add a virtual node RS representing a SCC S.
๏ฑ Modify ๐บ:
๏ฎ For any edge (๐ข, ๐ฃ) in S-graph ๐บ, if ๐ข โ ๐ and ๐ฃ โ ๐,
add edge (๐ข, ๐ ๐). Do the same for ๐ฃ.
๏ฎ Remove all nodes in S and corresponding edges.
๏ฑ Modify Division: create a new tree ๐โฒ rooted at the
virtual root RS and connect to the roots in the SCC.
171
175. Divide-TD Algorithm
๏ฎ Divide-Star algorithm divides the graph according to
the children of the root.
๏ฑ The depth of ๐0 is 1.
๏ฑ The max number of subgraphs after dividing will not
be larger than the number of children.
๏ฎ Divide-TD algorithm enlarges ๐0 and the
corresponding S-graph.
๏ฑ It can result in more subgraphs than that Divide-Star
can provide.
175
176. Divide-TD Algorithm
๏ฎ Divide-TD algorithm enlarges ๐0 to a Cut-Tree.
๏ฎ Cut-Tree: Given a tree ๐ with root ๐ก0, a cut-tree ๐๐ is
a subtree of ๐ which satisfies two conditions.
๏ฑ The root of ๐๐ is ๐ก0.
๏ฑ For any node ๐ฃ โ ๐ with child nodes ๐ฃ1, ๐ฃ2, โฆ , ๐ฃ ๐, if ๐ฃ โ
๐๐, then either ๐ฃ is a leaf node or a node in ๐๐ with all
child nodes ๐ฃ1, ๐ฃ2, โฆ , ๐ฃ ๐.
๏ฎ With such conditions, for any S-edge (๐ข, ๐ฃ), only two
situations exist.
๏ฑ ๐ข, ๐ฃ โ ๐๐
๏ฑ ๐ข, ๐ฃ โ ๐๐
176
177. Cut-Tree Construction
๏ฎ Given a tree T with root ๐0.
๏ฎ Initially ๐๐ contains only the root ๐0.
๏ฎ Iteratively pick a leaf node ๐ฃ in ๐๐ and all the child nodes
of ๐ฃ in ๐.
๏ฎ The process stops until the memory cannot hold it after
adding the next nodes.
177
181. Merge Algorithm
๏ฎ According to the properties, the DFS-Tree for
subgraphs are ๐0 , ๐1 ,โฆ,๐๐, there exists a DFS-Tree T
with ๐ ๐ = 1โค๐โค๐ ๐(๐๐) and ๐ธ ๐ = 1โค๐โค๐ ๐ธ(๐๐).
๏ฎ Only need to organize ๐๐ in the merged tree such that
the result tree ๐ is a DFS-Tree.
๏ฎ Since S-graph ๐บ is a DAG in the division procedure,
we can topological sort ๐บ and organize ๐๐ according to
the topological order.
๏ฎ Remove virtual nodes ๐ฃ and add edges from the father
of ๐ฃ to the children of ๐ฃ.
๏ฎ It can be proven that the result tree is a DFS-Tree.
181
185. Performance Studies
๏ฎ Implement using visual C++ 2010
๏ฎ Test on a PC with Intel Core2 Quard 2.66GHz
CPU and 4GB memory running Windows 7
Enterprise
๏ฎ Disk Block Size: 64KB
185
186. |V| |E| Average
Degree
Wikilinks 25,942,246 601,038,301 23.16
Arabic-2005 22,744,080 639,999,458 28.14
Twitter-2010 41,652,230 1,468,365,182 35.25
WEBGRAPH-
UK2007
105,895,908 3,738,733,568 35.00
Four Real Data Sets
186
188. ๏ฎ We study the I/O efficient DFS algorithms for a large
graph.
๏ฎ We analyze the drawbacks of existing semi-external
DFS algorithm.
๏ฎ We discuss the challenges and four properties in
order to find a divide & conquer approach.
๏ฎ Based on the properties, we design two novel graph
division algorithms and a merge algorithm to reduce
the cost to DFS the graph.
๏ฎ We have conducted extensive performance studies to
confirm the efficiency of our algorithms.
Conclusion
188
189. ๏ฎ We also believe that there are many things we need
to do on large graphs or big graphs.
๏ฎ We know what we have known on graph processing.
๏ฎ We do not know yet what we do not know on graph
processing.
๏ฎ We need to explore many directions such as
๏ฎ parallel computing
๏ฎ distributed computing
๏ฎ streaming computing
๏ฎ semi-external/external computing.
Some Conclusion Remarks
189
190. I/O Cost Minimization
๏ฎ If there does not exist node ๐ข for ๐ฃ that ๐๐ถ๐ถ(๐ข, ๐บ๐) =
๐๐ถ๐ถ(๐ฃ, ๐บ๐), ๐ฃ can be removed from ๐บ๐+1.
๏ฎ For a node ๐ฃ, if ๐eigh๐๐๐ข๐(๐ฃ, ๐บ๐) โ ๐๐+1, ๐ฃ can be
removed from ๐๐+1.
๏ฎ The I/O complexity is
๐(๐ ๐๐๐ก ๐๐ + ๐ ๐๐๐ก ๐ธ๐ + ๐ ๐๐๐(|๐ธ๐|))
190
191. B
H
C D
G
E
A
IF
This edge makes all
nodes in a partial
SCC the same order.
Another Example
๏ฎ Keep tree structure edges in memory.
๏ฎ Only concern the depth of nodes
reachable but not the exact positions.
๏ฎ Early-acceptance: merging SCCs
partially whenever possible does
not affect the order of others.
๏ฎ Early-rejection: prune non-SCC
nodes when possible.
๏ฑ Prune the node โAโ.
๏ฑ In Memory: Black edges
๏ฑ On Disk: Red edges
191
192. B
I
C
D
H
E
A
JG
๏ฎ No need to remember
๐๐๐๐๐(๐ข, ๐).
๏ฎ Merge nodes of the
same order when an
edge (๐ข, ๐ฃ) is found,
where ๐ฃ is an
ancestor of ๐ข in ๐.
๏ฎ Smaller graph size,
smaller I/O Cost
KF
Up-edge: Modify Tree
Up-edge: Modify Tree
Memory: 2 ร |๐|
Early-Acceptance
Early-Acceptance
Optimization: Early Acceptance
192