SlideShare a Scribd company logo
1 of 52
1
B.Tech Project Report
On
Link Prediction Problem for Heterogeneous Networks
By
Sai Akhil Reddy Gopidi (1210110172)
Nithin Kumar (1210110095)
Roopesh Kumar Kotte (1210110093)
Supervisor
Dr. Dolly Sharma, Shiv Nadar University (dolly.sharma@snu.edu.in)
Submitted in the partial fulfillment of requirements for
Bachelor of Technology in Computer Science and Engineering
Department of Computer Science and Engineering,
School of Engineering, Shiv Nadar University,
Gautam Buddha Nagar, U.P., India, 201314
http://www.snu.edu.in
2
Approval Sheet
This report entitled Link Prediction Problem for Heterogeneous Networks by Sai Akhil
Reddy Gopidi, Nithin Kumar & Roopesh Kumar Kotte is approved for the
degree of Bachelor of Technology in Computer Science and Engineering.
Project Advisor
Name Dolly Sharma
Signature ________________________________
Date ________________________________
3
Declaration Sheet
We declare that this written submission represents our ideas in our own words and where
others' ideas or words have been included, we have adequately cited and referenced the
original sources.
We also declare that we have adhered to all principles of academic honesty and integrity
and have not misrepresented or fabricated or falsified any idea/data/fact/source in
our submission.
We understand that any violation of the above will be cause for disciplinary action
by the Institute and can also evoke penal action from the sources that have thus not been
properly cited or from whom proper permission has not been taken when needed.
Name of the Student: Sai Akhil Reddy Gopidi
Signature ___________________________
Name of the Student: Nithin Kumar Kakkrineni
Signature ___________________________
Name of the Student: Roopesh Kumar Kotte
Signature ___________________________
Date _______________________________
4
Abstract
Interaction among members has become an important aspect in many social networks like
Facebook and YouTube. As the number of users grows the network size also increases and it
becomes difficult for users to find their friends or search and watch the videos they like. To
make the life of the user easy, social networks suggest friends and videos to the users based
on their previous searches and mutual interests. Therefore social networks have focused on
link prediction techniques that allow users to easily find their needs. Link prediction is a
critical task that not only helps increase the linkage inside the network but also improves the
user experience. In a link prediction algorithm it is required to identify the factors that
influence link creation. In this project we analyze and discuss some of these factors and also
give an approach which satisfies these factors. The approach is to estimate link relevance by
using Fuzzy Link Based Classification algorithms based on Backpropagation that gives the
probability of a link existence at some future time. We then evaluate the accuracy of the
results obtained using accuracy measurement techniques like Precision. We apply our
methods to YouTube dataset in order to evaluate the performance of our algorithms and
compare it to the performance of previously proposed algorithms.
5
Contents
6
List of Figures and Tables
S.No. Figure/Table Number Description
1. Fig 1.1 Heterogeneous graph with multiple edge types
2. Fig 1.2 Depicting problem statement using graph
3. Fig 2.1 CSV file of type 1 edge in YouTube dataset
4. Table 3.1
AUROC for YouTube, Disease and Climate network for
each edge type
5. Fig 3.1 Graph showing connections between nodes
6. Table 3.2 Distance between the nodes
7. Fig 3.2 Alternate neighbors and link possibility
8. Fig 3.3 Local random walk path
9. Table 4.1 Results obtained from the AUROC curve on DBLP dataset
10. Fig 4.1 Impact of Collaboration Frequency of different measures
11. Fig 5.1 System Architecture
12. Fig 5.2 Feed Forward Neural Network structure
13. Fig 5.3 Example of a triangular formation in the network
14. Fig 5.4 Sums of three sides of a triangle
15. Fig 5.5 Threshold Value
16. Fig 5.6 Number of expected links for bucket size of two
17. Fig 5.7 Expected links with hop counts
18. Fig 5.8 Local random walk score calculated for expected links
19. Fig 5.9 Calculated Precision
20. Fig 5.10 Precision vs Hop count graph
21. Fig 5.11 Hop count vs estimated links graph
22. Fig 5.12 Edges obtained for bucket size
23. Fig 6.1 Project Phases
24. Table 6.1 Project Schedule
7
Abbreviations and Nomenclature
AUROC- Area under the Receiver Operating Curve
CSV- Comma Separated Value
DBLP- Digital Bibliography & Library Project
JVM- Java Virtual Machine
GC- Garbage Collector
STL- Standard Template Library
JC- Jaccard’s coefficient
CN- Common Neighbor or Contact Network
A/A- Adamic Adar
LRW- Local Random Walk
CP- CANDECOMP/PARAFAC
MRLP- Multi Relational Link Prediction
MRIP- Multi Relational Influence Propagation
SBN- Shared subscriptions
SBR- Shared subscribers
VID- Shared favorite videos
FFNet- Feed Forward Neural Network
NN- Neural Network
8
9
Chapter 1
Introduction
1.1 Overview
Data mining and analyzing data is an important and upcoming field in computer science.
Huge volumes of data stored in data warehouses must be extracted for the purpose of pattern
recognition which can in turn benefit the organization. Storing and managing this data had
become a major problem in today’s world. Interaction among members of a community is or
network is of highest priority. Organizations like Facebook and YouTube which aim to
connect millions of people around the world analyze the pattern of user behavior and
recommends friends and videos accordingly. Given a user A at some point of time t, the task
at hand is to estimate all the possibilities of link formation between user A and user B by
taking into consideration all the parameters that these two users share. Link prediction makes
10
it possible to determine wheather two people can become friends or not beforehand. Many
social networks use this technique to suggest friends to the users so that they do not have to
search for all their friends.
This project includes some of the commonly the various link prediction techniques that have
been used previously as well as their drawbacks. We then propose our algorithm and
implement it on the YouTube dataset. Finally we compare the obtained results of our
algorithm with the results of the previously proposed algorithms and conclude wheather our
algorithm is efficient or not.
1.2 Problem Statement
Given a heterogeneous graph G = (V1 U V2…U Vm, E1 U E2… U En), where Vu (u € N)
represents the set of nodes of same type u (users) and Ej (j € M) represents the link of type j
between the nodes (relationship between users), our task is to predict the future possible links
between the users. Since it is not possible to compare the dataset at two different time
intervals, we use the cross-dimension validation process. We divide the complete dataset into
2 divisions-
1. Training set
2. Testing set
Fig 1.1 Heterogeneous graph with multiple edge types [1]
11
Out of the 5,574,249 edges in our dataset, we omit 10% of the total nodes, i.e., 8,36,137 and
implement our algorithm on the remaining 47,38,111 to train the dataset. After the training is
completed we test the algorithm on the entire dataset to check how accurate the results are
using precision accuracy metric since we already know which links are to be expected and
which links do not exist.
Fig 1.2 Depicting problem statement using graph
The numbers above the links indicate the intensity of the links. For example, users 2 and 6
have more commonly shared videos than users 5 and 6.
1.3 Team Members Contribution
Name: Akhil Reddy
Contribution: Akhil’s contribution includes literature survey to understand the fundamental
concepts of link prediction, problem formulation, designing the algorithm and testing the
algorithm.
Name: Nithin Kumar
Contribution: Nithin’s contribution includes assistance with literature survey, problem
formulation, designing and testing the algorithm.
Name: Roopesh Kumar
12
Contribution: Roopesh’s contribution includes data gathering, testing the algorithm, research
in finding suitable data structure to accommodate all the nodes in the dataset and poster
design.
Chapter 2
Feasibility Study and Requirements
2.1 Dataset Used
The dataset that we have selected to implement our algorithm is the YouTube dataset from
December 2008, which is a video sharing platform for millions of users. This dataset includes
information of those users who were will to share their information [2].
Number of Nodes: 15,088
13
Number of Edges: 5,574,249
Types of Edges: 5
In this case we are considering all the users as nodes and the different types of relations
between them as edges, to construct out heterogeneous graph. A graph G = (V1 U V2…U
Vm, E1 U E2… U En), where Vu (u € N) represents the set of nodes of same type u (Users)
and Ej (j € M) represents the link of type j between the nodes (relationship between users) is
called a heterogeneous graph.
There are 5 types of edges in this dataset, namely-
1. The contact network between the 15,088 users.
2. The number of shared friends between two users in the 848, 003 (excluding the
15,088) contacts- Two users are connected if they both add another user as contact.
3. The number of shared subscriptions between two users- When two users subscribe
to same person/channel.
4. The number of shared subscribers between two users- Two users are connected if
another user has subscribed both of them.
5. The number of shared favorite videos- Users sharing same videos.
The dataset is in a CSV (Comma Separated Value) format for each edge type independently.
E.g.: 7, 12, 94 – Indicates that user ids 7 and 12 have an intensity of 94 between them of a
particular edge type.
14
Fig 2.1 CSV file of type 1 edge in YouTube dataset
In the above figure the 1st and 2nd columns represent the node that have a link of type 1 which
the intensity in the 3rd column.
2.2 Scope:
Our YouTube network consists of different type of edges with an interaction value which
forms a large network. The scope of our project is to reduce the link prediction space
specifically in the YouTube network and suggest link in a probabilistic way that the link will
be useful in near future. Our experiment of link prediction is only for research purpose.
2.3 Problems Faced:
15
 Initially we considered using DBLP dataset, but due to lack of multiple edges and
insufficient data available in the dataset we were forced to work on a new dataset.
 As the dataset contains 15088 authors and 5,574,249 links in the YouTube dataset, it
is hard to accommodate all the nodes in the dynamic array in the form of a 3-D matrix
because it overflows the available memory of the ram and gives a GC overhead error.
 System requires larger RAM than the available RAM in our laptops to load dataset
into the data structure.
2.4 Software and Hardware Requirements
 Operating systems.
 NetBeans with JDK to run the Java code
 Larger ram.
2.5 Technical Feasibility
Initially the project source code was meant to be written in Java using 3D arrays to create the
adjacency matrix. As the number of nodes was very high, the Java VM could find enough
memory to store the data. Hence we shifted to C++ vectors which have the capability to
dynamically allocate memory to the node. Even the vector STL could not accommodate all
the data in a 3D vector and hence we were forced to shift back to Java and used 2D arrays to
create the adjacency matrix. This Project is technically feasible as it works perfectly fine on
the existing version of Java provided we use 2D arrays.
NOTE- To use 2D arrays for such large datasets where number of nodes is 15088, the
JVM option must be changed to –Xms2g to allocate more heap size.
2.6 Economic Feasibility
It is economically feasible as the data set is publicly available online, so it is free of cost
there is no extra costs required for the project. Research done to get into the project is also
based on the scholarly articles available online, there is no need to learn new technology
(language).
16
2.7 Schedule feasibility
It took some time to know the concepts of link prediction in heterogeneous, homogenous
networks, techniques used to predict the future link. When we discover what to do
innovatively and something new, we proceeded swiftly with our project in order to complete
it within the stipulated time frame.
2.8 Project Meetings
2.8.1 Meeting with Supervisor
We have constantly met our project supervisor on weekly bases (twice a week) to discuss
about the objectives of the project. We were instructed to conduct the literary survey in the
beginning to get an idea about the subject since everyone in the group was new to this
domain of study. Once we completed literature survey the algorithm was discussed to be
implemented on the dataset. Later accuracy metrics were discussed to check out the
performance of the algorithm and finally the results were compared with the results of other
algorithms. All the important meetings with the supervisor were conducted in person and the
minor details were either discussed over phone or e-mails.
2.8.2 Group Meetings
The group members met daily. Initially we discussed the scope and schedule of the project as
well as discussed on individual roles to be carried out. On completing the literature survey
and getting a good grasp of the subject we formulated a problem and finalized the dataset to
be worked. During the literature survey the papers were distributed among the team members
and on completion of paper the member explained the contents of that paper to other
members to save time and avoid redundancy.
2.9 Text Deliverables
Along with this report several other documents are also included in order understand the
research in a deeper sense.
Dataset- The YouTube dataset on which the research was conducted on. The dataset is in a
CSV format with each link type having its own file.
17
Source Code- A CD is given along with the report containing the code for all the algorithms
implemented.
List of all expected value- On running the LRW a list of all expected edges is obtained. A
file including all these edges is included
Deleted Files- To test the accuracy of the algorithm, 10% of links are deleted from original
file and stored in another file. Testing is done on this file.
2.10 Conclusion
We can confidently conclude after considering all the above stated points that the project is
feasible and we will complete the project in stipulated time allotted to complete the project
work.
18
Chapter 3
Commonly Used Algorithm
Many algorithms have been proposed for link prediction purpose in homogeneous networks
as well as heterogeneous networks. We cannot conclude that a specific approach is the best
way to predict links because link prediction methods are domain specific. Performance of the
algorithms is based on how well the network supports the predefined scoring methods for link
formation. For example, Facebook and Twitter being social networks yield best results for
neighborhood methods like common neighbor and Adamic/Adar for friend recommendation
links and in a climate network Jaccard’s coefficient performs well due to spatial
autocorrelation [3]. It is also possible that only one method might not give the best results for
all the different links in a single network. Hence performance of algorithm is not only
dependent on the predefined scoring measure and the type of the network, but also on the
type of links that it is being used to predict. This is clearly illustrated in the disease-gene
network, where different method works best for different link in the same network [3]. The
19
AUROC table below clearly indicates the above stated, where the bold faced indicate the best
link predictor method.
Table 3.1 AUROC for YouTube, Disease and Climate network for each edge type [3]
3.1 Commonly Used Algorithms
The following discussed methods can be used for any pair of nodes (A, B) in a network.
Fig 3.1 Graph showing connections between nodes
Score is allocated to each link based on the predefined scoring techniques used in these
algorithms and based on this score we predict wheather there is a possibility of link between
the nodes.
3.1.1 Graph Distance
20
In this method the distance between the two nodes, i.e., the source and destination nodes is
calculated and the inverse or negated length is considered. If the distance between the nodes
is less then there is higher chance that these nodes might be connected and vice versa.
Table 3.2 Distance between the nodes
Nodes Distance
(A,C) -2
(C,E) -3
(A,E) -3
As shown is Fig 3.1, in the above table the distance between nodes A and C is the least,
therefore there are higher chances of link formation between these two nodes. The negative
sign (-) is only to show that the least distance value has higher probability of link.
3.1.2 Common Neighbors
Link prediction in this method is based on the number of common neighbors that two nodes
have. If two nodes have more number of common neighbors then more is the probability of
link existence between the nodes and vice versa.
Score= Ə (A) ∩ Ə (B), which is the total number of common neighbor nodes of the two
nodes.
Where Ə (x) denotes the neighbors of a node x.
3.1.3 Jaccard’s Coefficient
Fig 3.2 Alternate neighbors and link possibility
21
Jaccard’s coefficient is the derived from common neighbors method but provides more
accurate results. For a given pair of nodes (A, B), the score assigned is the number of
common neighbor nodes of A and B divided by the total number of neighbor nodes of A and
B.
Score= Ə (A) ∩ Ə (B) ÷ Ə (A) U Ə (B)
The numerator part is similar to common neighbor’s method.
From Fig 3.3 we can see that for nodes C and D the common neighbors are A and B. The
scoring method used by common neighbors gives a high score for a link between C and D.
We are not considering any other nodes in this method. It is also clear from Fig 3.2 that node
C has many other neighbors apart from A and B whereas D has only those two neighbors.
Therefore in this case the score calculated by the common neighbor method is not accurate
and to negate these additional neighbors of the nodes we divide the common neighbors of
both nodes by the total number of nodes. This increases the accuracy of calculated score.
3.1.4 Adamic/Adar
Adamic/Adar is the advanced version of Jaccards coefficient, which weighs rarer neighbors
more heavily. To put in simple terms- for a pair of nodes (A, B), if the common neighbors of
A and B have less common neighbors then there is a higher possibility of link existence
between A and B.
From Fig 3.2, the common neighbors of A and B are C and D which in turn do not have any
common neighbors. So there is a better link formation between A and B.
Score= , where z is the common neighbor of nodes x and y.
3.1.5 Preferential Attachment
Score assigned to the pair of nodes is the product of their degrees. Higher score is assigned if
the nodes have many edges attached to them.
3.1.6 Average Commute Time
It is the average number of steps required by a random walker starting from a source node to
reach destination node. The two nodes are said to have a link if they have smaller commute
time.
22
3.2 Local Random Walk
Random walk algorithm is an advanced method which is used to predict links in a network. It
is a Markov process, that is, it is a memory less process which makes its next move based
only on its current location and does not consider the previously followed path. In a given
graph G (V,E) for a pair of nodes x,y the process can be defined using a transition probability
matrix Pxy=axy / kx
where axy=1 if x and y are connected and 0 other wise. Kx denotes the degree of node k.
Let us consider a random walker starting at node x and must travel to node y and let πxy(t) be
the probability that this walker reaches node y after t steps. πx(t) =PT πx(t−1), is the
probability score to calculate that the random walker will come to location x from the
previously positioned location.
In the below figure consider that the random walker starts at x and must reach y. From x the
walker can go to any of the 4 nodes ahead of him, i.e., nodes 1-4 and through these he can
reach node y. The probability the walker will go to node 1 from node x is-
Π1(t) =PT Π1(t−1)
Similarly for the remaining nodes. To travel from node 1 to node y, again the probability is
Πy(t) =PT Πy(t−1)
Hence the probability of going to the next state is dependent only on the current state and not
the previous states.
Fig 3.3 Local random walk path
23
Summing up and taking average of the probabilities can give the score for existence of link
between nodes x and y.
3.2.1 Random Walk with Restart
Sometimes is so happens that the random walker deviates and goes away too far from the
destination node. In the above figures, consider that the walker has taken moved from node x
to node 100, which is far away from the node y. This gives low and inaccurate score
prediction and there are chances that the walker may never reach the destination. To
overcome this problem we can use random walk with restart, where walkers are continuously
released at regular intervals from the starting point which increases the probability of the
walker to reach the destination in the best possible path.
24
Chapter 4
Related Work
In order to understand the basic crux of the link prediction problem, we team had conducted
an extensive literature survey of various papers published so that we were aware of the
previously researched problems, the approach used, the dataset they experimented on and the
results obtained. Every approach has its own advantages and drawbacks over the others.
Since this is relatively new topic and research has recently begun in this domain, very few
papers were published. The obtained papers were distributed among the team members and
on daily basis the important content of the papers were discussed with the whole team. Few
important papers that are relevant to this project are discussed below.
25
4.1 Tensor Factorization
 Paper Title: Link Prediction in Heterogeneous Networks Based on Tensor
Factorization [4].
 Authors: Piao Yong , Li Xiaodong1 and Jiang He
 Publication: The Open Cybernetics & Systemics Journal, 2014, 8, 316-321
 Problem: To predict the edges that will be added to the network during the interval
from time t to a given future time t’.
 Method: Heterogeneous networks can be organized as a third order tensor (Node!
Node! Link type) or multi-dimensional. Proposed a method based on tensor
factorization that can capture the correlation between different types of links for the
link prediction problem without loss of information. Employed
CANDECOMP/PARAFAC (CP) tensor decomposition to capture the underlying
patterns in the node-relationship-node tensor. The CP decomposition generated
feature vectors for the nodes in the graph, that are computed to get a similarity score
that combines the multiple types of the graph.
After CP decomposition, 3 factor matrices were known: node matrix A, relationship
matrix B, and node matrix C. Link prediction can be computed according to the
captured associations. Score Matrix s is defined as
In this paper, they used alternating least-squares (ALS) with weighted-!-
regularization algorithm to fit the CP decomposition.
 Results: Adamic/Adar measure and Katz measure performs well both in theoretical
and practical experiments. So here, they just compared these two measures with their
methods. Their method provided better precision than unsupervised ones on the data
sets and also provided a competitive effect to Adamic/Adar measure and both those
two methods beat Katz measure.
 Challenges: It is cost intensively to compute tensor factorization.
 Datasets: UMLS. This data set contains data from the Unified Medical Language
System semantic work. This consists of 135 entities and 54 relationships. The entities
26
are high-level concepts like 'Disease or Syndrome', 'Diagnostic Procedure', or
'Mammal'.
4.2 Multi Relational Influence Propagation
 Paper Title: Link Prediction in Heterogeneous Networks: Influence and Time
Matters[5].
 Author: Yang Yang and Nitesh V. Chawla, Department of Computer Science & Engg,
University of Notre Dame.
Yizhou Sun and Jiawei Hanı, Department of Computer Science, University of Illinois
at Urbana-Champaign.
 Problem: Given a heterogeneous network, in this case a DBLP bibliographic network,
the machine must be able to predict wheather the link is present in the network and
the possibility of link in future. DBLP dataset contains information about 3,215
authors who published a minimum of 5 papers in conferences between 1990 and
2010. The links can be of different types, for E.g. link between author and author (co-
author), author-paper (writes), paper-conference (published in).
 Method: Different unsupervised link prediction algorithms were used to test the data
set like Common Neighbour, Jaccard Coefficient, Adamic/Adar Preferential
attachment, etc. Of all the algorithms, Multi relational Influence propagation (MRIP)
which uses conditional probability which is equivalent to edge correctness yielded the
best results. For unsupervised learning data between 1990 and 2000 was chosen as
training set and data between 2001 and 2005 as training set.
 Results: As unsupervised link predictions are domain specific performance varies for
each algorithm. MRIP has better performance than others in predicting co-authorship
between authors and predicting terms shared between authors and has slightly less
performance in conference presenter’s link.
 Challenges: MRIP works well for stable networks. But as DBLP is a non-stable
network (unit root value= 0.99), since the number of links keep on increasing every
year the traditional unsupervised link prediction algorithms are not of much use.
Availability of dataset and security is a problem. Additional information is collected
through user survey which is incomplete and unreliable. Information is needed that
can expose the users subconscious behavior at a particular time.
27
 Future Work: As the network changes with time temporal feature based methods are
implemented. Bootstrapping technology is one such method. Based on the degree of a
node we rank them in descending order and analyze how new links in future are
associated with the top K% of them.
 Dataset & availability: The whole DBLP dataset is available as an XML file.
Table 4.1 Results obtained from the AUROC curve on DBLP dataset [5]
JC CN AA MRIP
Collaboration 0.590 0.597 0.596 0.769
Conference 0.702 0.698 0.689 0.691
Key Terms 0.545 0.546 0.532 0.811
4.3 Multi Relational Link Prediction
 Paper Title: Multi-Relational Link Prediction in Heterogeneous Information
Networks[3].
 Author: Darcy Davis, Ryan Lichtenwalter, Nitesh V. Chawla, Interdisciplinary Centre
for Network Science and Applications, Department of Computer Science and
Engineering,
University of Notre Dame.
 Problem: Three different domains are considered- YouTube, Disease-Gene and
Climate network datasets.
YouTube has 15,088 users as of December 2008, who are considered as nodes in this
case. The users are connected by 5 different edges- contact network (CN) of the user,
shared contact with users outside of the network (FR), shared subscriptions (SBN),
shared subscribers (SBR), and shared favorite videos (VID).
The disease-gene network consists of 703 diseases and 1,132 genes with 4 edges.
The climate network has 1,701 locations with 7 edges for different climate changes.
 Method: Unsupervised link prediction methods are implemented. Link prediction for
each edge type is evaluated individually. Link prediction performance is evaluated
separately for each edge using Area Under the Receiver Operating Curve (AUROC).
 Results: Performance of the algorithms is based on how well the network supports the
predefined link scoring assumption. Performance of local neighborhood methods is
28
predominant in social networks like YouTube. Jaccard coefficient performs well in
climate network closely located areas have similar climate. In disease-gene network
each link type was captured best by a different method. Refer to Fig 3.1 for AUROC
values of each node.
 Challenges: A node in a network can have multiple edges and each edge can increase
the likelihood of a contact. In YouTube 76% of node pairs with contact edge have
other edges which increase the number of contacts of that particular edge. Bad
performance of MRLP on other edge types indicate that MRLP doesn’t work well
when additional link types are introduced (noise).
 Future Work: High performance link prediction (HPLP) is introduced for this purpose
which uses Feature vector, homogeneous link prediction and heterogeneous link
prediction.
4.4 Graph Model TransFG
 Paper title: Inferring social ties across heterogeneous networks[6].
 Authors: Jie Tang Tsinghua University
Tiancheng Lou Tsinghua University
Jon Kleinberg Cornell university.
 Problem: Predict the type of relationship in a target network by leveraging the
supervised information (labeled relationships) from the source network.
 Method: proposed Predictive model such as transfer based factor graph model
(TransFG) for learning and predicting the type of social relationships across network.
 Results: Proposed method TransFG is more helpful when combined with social
theories (Structural balance, structural hole, social status, two step flow) in inferring
type of relationship in social network. Performance drops when any one of the social
theories is ignored. TransFG is checked against social theories in datasets such as
Epinions, Slashdot, Mobile in predicting undirected relationship and Coauthor and
Enron for predicting directed relationship.
 Challenges: As discussed in the paper, there are two types of networks source and
target networks, predictive model needs to learn both the networks, the challenge is
then how to bridge the two networks, so that we can transfer the labeled information
from source network to target network.
29
 Future work: some other social theories can be further explored and validated for
analyzing the formation of different types of social relationships
 Dataset: Epinions, Slashdot, Mobile, Coauthor, Enron are datasets and they are
publicly available.
4.5 Path Predict
 Paper Title: Co-Author Relationship Prediction in Heterogeneous Bibliographic
Networks [7].
 Authors: Yizhou Sun, Rick Barber, Manish Gupta, Charu C. Aggarwal, Jiawei Han
 Publication: Published in Int. Conf. on Advances in Social Networks Analysis and
Mining (ASONAM'11), July 2011
 Problem: Identify the kind of connections between two authors that are more helpful
to lead to future link collaborations. Basically predicting whether two authors that
have never co-authored before will co-author sometime in the future rather than
predicting how many times two authors will co-author in the future. Given a
heterogeneous network, the link prediction task is then generalized to relationship
building prediction, which is to predict whether two objects will build a relationship
following a certain target relation.
 Method: There are two stages (Training and test stage). In the training stage, we first
sample a set of author pairs that have never co-authored in T0, collect their associated
topological features in T0, and record whether a relationship is to appear between
them in the future interval T1.
 Model used: Path predict model. Defined the topological features in the DBLP
network and used meta path based Topological feature. Meta paths between two
object types can be obtained by traversing on the DBLP network schema, by using
standard traversal methods such as the BFS (breadth-first search) algorithm.
Discussed below are four measures on meta path.
1. Path count
2. Normal Path count
3. Random walk
4. Systematic Random walk.
30
Defined Co-authorship Model, used logic regression method as the prediction model.
For each training pair of authors (ai1 , ai2), let xi be the (d+ 1)- dimensional vector
including constant 1 and d topological features between them, and yi be the label of
whether they will be co-authors in the future (yi = 1 if they will be co-authors, and
otherwise 0), which follows binomial distribution with probability pi . The probability
pi=e^( xiβ)/ (e^( xiβ )+ 1) where β is the d + 1 coefficient weights associated with the
constant and each topological feature. Then used standard MLE (Maximum
Likelihood Estimation) to derive βˆ, that maximizes the likelihood of all the training
pairs.
 Results: The co-authorship for highly productive authors is easier to predict than less
productive authors. The prediction accuracy is higher when the target authors are 3-
hop co-authors, which means the collaboration between closer authors in the network
is more affected by information that is not available from network topology. In the
previous cases, we say two authors have a co-authorship if they have co-authored one
paper. Here, we study the relationships defined by different collaboration frequency.
From Fig. 4.1, we can see that, the measure symmetric random walk is more
important in deciding high frequency co-author relationships. Two authors who can
be reached with high probability mutually in the network will be more likely to build
strong collaboration relations.[7]
 Challenges: Predicting co-authors for a given author is an extremely difficult task, as
we have too many candidate target authors (3-hop candidates are used), while the
number of real new relationships are usually quite small.
 Dataset: The DBLP bibliographic network is available in the internet in an XML file.
Fig 4.1 Impact of Collaboration Frequency of different measures [7]
31
Chapter 5
Algorithm Design and Implementation
To predict the links in the dataset we have used the Fuzzy Link Based Classification
algorithm- a subpart of Neuro Fuzzy Link Based Classification algorithm, which is a
combination on the Feedforward Neural Networks (FFNet) Backpropagation techniques and
fuzzy logic. FFNet was inspired from the neural system the human body. In this chapter we
will first explain about the system design involved in setting up the network followed by
explaining the FFNet and Backpropagation algorithm, explain the reasons for using the
algorithm and then discuss about how we worked on our dataset and the steps involved in
obtaining the desired output.
5.1 System Design
From selecting the dataset to be worked on to data classification and link prediction, many
steps are involved in like clustering data, classification of data, data extraction etc. These
32
steps are performed in a proper order as mentioned in the below figure of System
architecture.
Fig 5.1 System Architecture [9]
 Initially the dataset must be selected to retrieve the data so that classification can be
performed.
 In user interface model data is retrieved from the dataset and represented in a readable
format. On this data pattern recognition and analysis can be performed.
 In clustering and classification phase dissimilar data is differentiated from similar
data. In our project this step can be omitted because the dataset is already in the form
of a CSV file, divided according to the respective link types.
 Knowledge base contains the rules for the construction of the fuzzy system. The
system checks each input and acts according to the weights and specified attributes of
the nodes.
 Finally the decision manager contains the logic to make decision based on the rules
mentioned in the knowledge base. It is the heart of fuzzy logic. It decides the output
and passes it to the response which displays the final result.
33
5.2 Backpropagation
5.2.1 Feed Forward Neural Networks
FFNet consists of a number of nodes that are linked to each other with edges of different
types (in a heterogeneous network), where each edge carries some weight. All the nodes are
connected to every other node in the layers that precede and succeed them. Input values are
given at the input layer which propagates the value to further layers. Input and output for the
input layer is the same. The final layer is the output layer that gives the predicted value. All
the layers apart from input and output layers are called hidden layers. The outputs from the
hidden layer are propagated as inputs to other layers and finally to the output layer.
5.2.1.1 Learning Phase
The number of input and output units depends on the attributes of the nodes and the number
of categories we want to classify the data. First we must create the adjacency matrix for all
the nodes and edge types to store the strength value. From this matrix, we can finalize the
number input units as the number of attributes of the node and output as the different types of
categories.
Fig 5.2 Feed Forward Neural Network structure [10]
34
The outputs of the network are compared with the actual outputs and if the output values are
large then the categorization is correct. On this comparison the weights are modified so that
when same types of inputs are presented the outcome value is higher than the current value.
5.2.2 Neuro Fuzzy Link Based Classifier
Neuro fuzzy link based classifier works based on the neuro fuzzy rules to classify the edges.
These rules are based on backpropagation technique. In our problem we first adjust the
weights by normalizing the edge strengths of all the 5 types followed by implementing
triangular fuzzy membership function that looks for all the possible triangular formations in
the network. Based on the triangle perimeter we calculate the threshold value and base our
final predictions.
5.3 Reason for Using the Algorithm
Some of the main reasons for using the algorithm are [8]-
1. Neural Networks (NN) have high tolerance of noisy data.
2. NN can be used to work on datasets on which they have not been trained.
3. The can be used when we have limited idea about the relationship between the
attributes of the nodes type of edges present in the dataset.
4. NN algorithms are inherently parallel; parallelization techniques can be used to speed
up the computation process [8].
5. NN algorithms have been successful in handwritten character recognition, medical
purposes and training a machine to pronounce English text.
5.4 Steps Involved in Proposed Algorithm
1. Initially all the values of the dataset are in a CSV file, with each type of edge having a
different file. The values are represented as shown in Fig 2.1. Before starting the
implementation, 10% of the total edges are removed from each file and store in another file
and the experiment is performed on the remaining 90% of the values. Final testing will be
performed on these eliminated 10% values to check the accuracy to the algorithm
35
2. We create an n x n 2D matrix to store the strengths of the edges between the nodes,
where n is the total number of nodes in the network which is 15088 in this case. The strengths
of the 1st edge types (contact network) are stored in the matrix. No normalization of values is
needed as all the strength values are 1 in the 1st type.
3. If no edge exists between the nodes then the link strength is 0.
4. The 2nd file is read which has strengths of the 2nd edge type (number of shared friends
between two users). The strength values are not of the same value. Normalization is
performed on each strength value and the normalized value is added to the strength values of
the 1st type.
Normalization= (xi – Minimum value) ÷ (Maximum value – Minimum value)
Where,
Xi – strength of the currently read edge
Maximum value- Strength with maximum value in the current file (of a
particular edge type).
Minimum Value-- Strength with minimum value in the current file (of a
particular edge type).
5. Step 4 is repeated for the remaining 3 files.
6. Finally the n x n 2D matrix will have the normalized strength values between the nodes
of all the 5 files.
7. Once the values are normalized the next step to find out all the triangular formations in
network.
36
Fig 5.3 Example of a triangular formation in the network
For example in Fig 5.3, nodes A, B and C for a triangle. The values above the links are
the normalized values of the link types between the nodes.
8. After all the triangles are recognized the sum of their 3 sides is calculated. Fig 5.4
shows the calculated values of all the triangles.
E.g. Sum for Fig 5.3 = 0.9+0.5+0.1=1.5
Fig 5.4 Sums of three sides of all the triangles
37
9. From the obtained sum of each triangle in the network 1000 values are selected at
random and stored in an excel file and sorted in ascending order and stored in a bucket
size of n/200 where n is the total values we select (n=1000 in this case). We take this
value as threshold value (999th value).
There are many other ways to select the threshold values but we had many triangles in the
data nearly to the size of 5.6 GB.
Fig 5.5 Threshold value
10. In the next step the open triangles are identified, i.e. the ones in which 3rd side is not
closed, consider x as normalized strength of one edge and y as the normalized strength of
another edge.
If - (x+y) > Threshold value
38
Then there is a possibility that a link exists between the remaining nodes where no
link is currently present.
For taking two values in a bucket (value=2.2674) we get 43,85,214 expected links.
Fig 5.6 Number of expected links for bucket size of two
11. The links for which sum of two sides was greater than threshold are stored in
another file.
12. LRW algorithm is implemented on these expected links. The start node and end
node are specified and the random walker moves from start to end through the
neighbors connected to node in its current state. If the distance between the start and
destination is greater than n (user defined number) hops then it is assumed that no link
exists between them. Otherwise the number of hops are calculated.
13. Based on the number of hops score is calculated for each link prediction using the
formulae
39
S x y
LRW (t) = k x / 2|E| · π x y (t) + k y / 2|E| · πyx (t)
Where
K x is the number of neighbors of node x
E is the total number of edges in the entire network
π x y= πyx is the hop distance from node x to node y (we are assuming that distance
from node x to node y is the same from node y to node x).
14. The links are arranged in descending order with respect to their scores and the top
25% of them are considered (L). We then compare these links with those that we have
deleted from the original dataset in the beginning and compare how many links that
are predicted are actually present (l).
15. Finally precision is calculated using the formulae
Precision = l / L
to find out the accuracy of the algorithm implemented. Fig 5.7 shows the
calculated precision for all the hop counts that we experimented with
5.5 Pseudo Codes
5.5.1 Adjacency Matrix
1. Read the data in file1 ;
2. Store the values as node1,node2 ,normalizedstrength;
3. Adjacency[node1-1][node2-1]= Adjacency[node1-1][node2-1]+normalizedstrength;
For creating testing and training datasets
1. Calculate the no of edges in each file.
2. int c= no of edges/10,count=0;
3. While(count!=c){
int a= (Math.Random)*15088;
int b= (Math.Random)*15088;
If there is an edge between a and b
Make adjacency [a][b]=0;
Increment count;
Write (a+1) +”t+( b+1) in a file.
40
}
4. For(int i=0;i<l5088;i++)
for(int j=0;j<15088;j++)
if( adjacency[i][j]!=0)
write( (i+1)+”t”+ (j+1)) in a file
5. End
5.5.2 Closed Triangles and Strength
1. for(int i=0;i<l5088;i++)
for(int j=0;j<15088;j++)
for(int k=0;k<15088;k++)
if( adjacency[i][j]!=0) && adjacency[j][k]!=0) && adjacency[i][k]!=0)
{
write (i+1)+”t”+ (j+1)+”t”+(k+1)+”t+” adj[i][j]+adj[j][k]+adj[i][k] in a new file
} } }
2. Now read the file and chose 1000 random triangles and save them in other file
3. The 1000 random triangles are sorted and are taken different bucket sizes
such as 2 in a bucket size
4. Store the998th triangle strength as threshold value.
5.5.3 Open Triangles
1. for ( int i=0;i<15088;i++
for(int j=0;j<15088;j++)
for(int k=0;k<15088;k++)
if there is an edge between (i,j) and (j,k) but no edge between (I,k)
if strength of (i,j)+(j,k) exceeds the threshold value
Adjacency[i][k]=1000;
2. for ( int i=0;i<15088;i++
for(int j=0;j<15088;j++)
if the strength between I and j =1000;
41
write (i+1) and (j+1) in a file
5.5.4 Applying Random walk algorithm and calculating score
1. Read the open triangle file
2. Store the node values into a and b
3. Start the random walker from node till it reaches to b or the hop count
value reach to 16.
4. If the hop count < 15 then store the value of and b.
5. Calculate the no of neighbors in a and b
6. Using the formula of local random walk predict the score of expected
edges.
7. Sort the edges using the scores
8. Select the top 25% of the edges
9. Calculate how many of these 25% edges are available in deleted dataset
10. Calculate the precision value.
5.6 Result
We conducted the experiment on the 43,85,214 expected links by taking the threshold value
as 2.2674, the 999th value of the randomly selected 1000 values. The calculated precision is
highest for hop count 50 hops which is 2.336% for predicting 61 correct links from the
expected 2611 links while hop count of 15 has least precision of 1.77% for predicting 12
correct links from the expected 675 links.
42
Fig 5.7 Expected links with hop counts
43
Fig 5.8 Local random walk score calculated for expected links
Fig 5.9 Calculated Precision
44
Fig 5.10 Precision vs Hop count graph
Fig 5.11 Hop count vs estimated links graph
5.7 Failed Approaches
In our approach to get the optimum results we have tried various combinations to get the
desired results. Some of the approaches we tried did not yield desired results and in this
section will be explaining some for these.
45
5.7.1 3D matrix
Our initial idea was to read the all the five CSV file of dataset into a 3D matrix and then do
the link prediction computations. But due to the excessive nodes (15088 x 15088 x 5) the
JVM could not find sufficient heap size to allocate memory. Therefore we decided to use a
2D array read each CSV file at a time, calculate the normalized value and then read the 2nd
file and so on.
5.7.2 Calculating Threshold value
 After finding out all the triangular node possibilities in the network, a threshold value
was to be calculated in order to find the possible missing links. Initially after
calculating the perimeter of the triangles we took the average of the perimeter and
multiplied it with 2/3 as the threshold. As a result we got a 31GB file that could
neither be opened using normal text editor nor copied.
Note- We multiplied with 2/3 because in the open triangle 2 links are already formed out of 3.
 In our 2nd attempt we randomly selected 1000 values and sorted them in ascending
order and took the values in buckets for 50. We selected the 950th value as the
threshold. As a result we got a more than 6 Crore expected links as shown in Fig 5.7.
 In the 3rd attempt we directly took the 975th and got 5 Crore expected nodes.
Fig 5.12 Edges obtained for bucket size 50
46
47
Chapter 6
Project Schedule and Conclusion
6.1 Project schedule
Our project was planned in two phases. The first phase includes literature survey to know the
complete information involved in link prediction, the various algorithms used and their draw
backs. Based on our understanding we then formulated a problem to work on and select the
dataset. In the second phase we stored the extracted values from the dataset in a 2D array
with reference to their strengths of edges, includes developing an algorithm, collecting results
and testing.
Phase1: Literature survey
Problem formulation
Dataset collection
Phase2: Developing algorithm
Implementation of the algorithm
Testing.
48
Fig 6.1 Project Phases
Table 6.1 Project Schedule
Date Activity
Meetings with
Advisor
Jan 6th
Meeting with advisor to discuss project schedule
and details
Meeting with advisor
Jan 9th -16
Research articles gathering and focusing on our
topic
Jan 17th-23rd
Literature survey of Article on link prediction in
heterogeneous networks and unsupervised link
prediction algorithms like Common neighbor,
Jaccard coefficient, Adamic/Adar
Jan 20th Meeting with
advisor
Explained in detail
about these methods
by advisor.
Jan 25th-31st
Literature Survey on
1. Paper on Multi-Relational link prediction in
Heterogeneous networks.
2. Paper on Link prediction in heterogeneous
Networks based on tensor factorization.
Jan 30th meeting with
advisor
literature survey
and problem
formulation
algorithm
implementation
link
prediction
49
Feb 2nd- 8th
Literature Survey on
1. Paper on inferring social ties across
heterogeneous networks.
2. Paper on exploiting place features in link
prediction on location based social networks.
Feb 7th meeting with
advisor
Feb 10th- 16th
1.Problem formulation and finding related datasets
2. Working on large Xml files to retrieve required
data. Due to the non-availability of different types
of edges in Co-authorship network we shifted to
YouTube Dataset
Feb 15th Meeting with
advisor
Feb 17th-24th
Worked on text files of dataset and tried to store
the values in 3d array in java. Due to heap size
problem we decided to shift to C++
Feb 22nd meeting with
advisor
Feb 26-30
Mid semester presentation week
March 1st-7th
Worked in C++ to store data in 3D vectors. March 3rd Meeting
with advisor
March 8th-15th
Faced the same memory heap problem with 3D
vectors in C++
Decided to work in 2d arrays in java with
normalized score for the strength of the edge.
March 10th suggested
to try in 2D java
matrix by storing
normalized strength
value.
March 18th-
27th
Mid-Semester break
Worked on final code for implementation.
April 1st-7th
Code implementation, collected results required at
various stages for final precision.
April 1st
Planned to work on closed triangles in the network
to find out the threshold value.
50
April 2nd
Average value turned out to be very low and this
resulted in huge file that couldn’t be opened
April 3rd
We randomized 1000 values of sum from closed
triangle list and picked values in buckets of 25, 15
and 2
April 4th We took optimal set of unlinked triangles and
performed local random walk algorithm on them
April 5th Calculated score of the predicted links and
selected the top 20% of the links
April 6th Calculated precision value for the predicted links
April 7th-15th Final Report, poster and presentation.
6.2 Future Work
The short duration of the project forced us to compromise on a couple of things that could
have yielded better precision values. Given a chance to expand this project we will
implement the following tasks that could not be included now.
6.2.1 Implement the complete Neuro Fuzzy Link BasedAlgorithm
Our initial idea was to implement the Neuro Fuzzy Link Based classifier algorithm that
incorporated FFNet and Backpropagation involving changes to edge intensity to obtain the
desired result. But due to shortage of time we decided to implement the Fuzzy Link Based
algorithm along with triangular fuzzy membership function that calculates the number of
triangles formed in the network and then calculates the threshold value based on this.
6.2.2 Local Random Walk
We implemented Local Random Walk algorithm after selecting the open triangles in the
network. Generally LRW is implemented multiple times, say 100, and the score is calculated
for those edges that occur multiple times. The run time to implement LRW is generally more
than 2 hours and due to time constraint we could implement LRW only once with a specified
hop count and calculated the score for the suggested links. This was one reason for the low
precision. Running the LRW multiple times and then calculating precision would have
yielded much better precision.
51
6.3 Conclusion
Data mining is a vast and upcoming area of research and link prediction is only a part of it.
Our area of interest lie in parallel with this field of study and some of us even opted to pursue
higher studies and specialize in Data Mining and Data Warehousing which motivated us to
take up this project. Though the results were satisfactory they were not as good as expected.
Keeping in mind the fact that this was our first experience in data mining and machine
learning field and the short duration of the project time period, we tried our best to go through
all the methods that are used in link prediction problem and implement our algorithm
successfully.
References:
[1] Heterogeneous Graph- http://www.mdpi.com/1424-8220/15/10/24735/htm
[2] YouTube Dataset download-
http://socialcomputing.asu.edu/datasets/YouTube
[3] Multi-Relational Link Prediction in Heterogeneous Information Networks
Darcy Davis, Ryan Lichtenwalter, Nitesh V. Chawla, Department of Computer Science
and Engineering
University of Notre Dame,Notre Dame, IN, 46556 US
52
https://www3.nd.edu/~dial/papers/ASONAM11b.pdf
[4] Link Prediction in Heterogeneous Networks Based on Tensor Factorization
The Open Cybernetics & Systemics Journal, 2014, 8, 316-321
http://benthamopen.com/contents/pdf/TOCSJ/TOCSJ-8-316.pdf
[5] Link Prediction in Heterogeneous Networks: Influence and Time Matters
http://hanj.cs.illinois.edu/pdf/icdm12_yyang.pdf
[6] Inferring Social Ties across Heterogenous Networks
https://www.cs.cornell.edu/home/kleinber/wsdm12-links.pdf
[7] Co-Author Relationship Prediction in Heterogeneous Bibliographic Networks
http://www.ccs.neu.edu/home/yzsun/papers/asonam11_pathpredict.pdf
[8] Data Mining Concepts and Techniques- 3rd edition
Jiawei Han, Micheline Kamber, Jian Pei.
[9] NEURO FUZZY LINK BASED CLASSIFIER FOR THE ANALYSIS OF BEHAVIOR
MODELS IN SOCIAL NETWORKS - Journal of Computer Science 10 (4): 578-584, 2014
Indira Priya Ponnuvel, Ghosh Dalim Kumar, Kannan Arputharaj and Ganapathy Sannasi
http://thescipub.com/PDF/jcssp.2014.578.584.pdf
[10] FFnet diagram
http://www.fon.hum.uva.nl/praat/manual/Feedforward_neural_networks_1__What_is
_a_feedforward_ne.html

More Related Content

What's hot

Community Detection in Social Media
Community Detection in Social MediaCommunity Detection in Social Media
Community Detection in Social MediaSymeon Papadopoulos
 
Clique-based Network Clustering
Clique-based Network ClusteringClique-based Network Clustering
Clique-based Network ClusteringGuang Ouyang
 
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATIONIMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATIONadeij1
 
Community detection
Community detectionCommunity detection
Community detectionScott Pauls
 
Community detection algorithms
Community detection algorithmsCommunity detection algorithms
Community detection algorithmsAlireza Andalib
 
Formulation of modularity factor for community detection applying
Formulation of modularity factor for community detection applyingFormulation of modularity factor for community detection applying
Formulation of modularity factor for community detection applyingIAEME Publication
 
IRJET- Fusion Method for Image Reranking and Similarity Finding based on Topi...
IRJET- Fusion Method for Image Reranking and Similarity Finding based on Topi...IRJET- Fusion Method for Image Reranking and Similarity Finding based on Topi...
IRJET- Fusion Method for Image Reranking and Similarity Finding based on Topi...IRJET Journal
 
Community detection in social networks[1]
Community detection in social networks[1]Community detection in social networks[1]
Community detection in social networks[1]sdnumaygmailcom
 
Group and Community Detection in Social Networks
Group and Community Detection in Social NetworksGroup and Community Detection in Social Networks
Group and Community Detection in Social NetworksKent State University
 
IRJET - Visual Question Answering – Implementation using Keras
IRJET -  	  Visual Question Answering – Implementation using KerasIRJET -  	  Visual Question Answering – Implementation using Keras
IRJET - Visual Question Answering – Implementation using KerasIRJET Journal
 
Techniques that Facebook use to Analyze and QuerySocial Graphs
Techniques that Facebook use to Analyze and QuerySocial GraphsTechniques that Facebook use to Analyze and QuerySocial Graphs
Techniques that Facebook use to Analyze and QuerySocial GraphsHaneen Droubi
 
Studying user footprints in different online social networks
Studying user footprints in different online social networksStudying user footprints in different online social networks
Studying user footprints in different online social networksIIIT Hyderabad
 
IRJET- Quantify Mutually Dependent Privacy Risks with Locality Data
IRJET- Quantify Mutually Dependent Privacy Risks with Locality DataIRJET- Quantify Mutually Dependent Privacy Risks with Locality Data
IRJET- Quantify Mutually Dependent Privacy Risks with Locality DataIRJET Journal
 
Data Mining In Social Networks Using K-Means Clustering Algorithm
Data Mining In Social Networks Using K-Means Clustering AlgorithmData Mining In Social Networks Using K-Means Clustering Algorithm
Data Mining In Social Networks Using K-Means Clustering Algorithmnishant24894
 
Graph Based User Interest Modeling in Twitter
Graph Based User Interest Modeling in TwitterGraph Based User Interest Modeling in Twitter
Graph Based User Interest Modeling in Twitterraghavr186
 
11.hybrid ga svm for efficient feature selection in e-mail classification
11.hybrid ga svm for efficient feature selection in e-mail classification11.hybrid ga svm for efficient feature selection in e-mail classification
11.hybrid ga svm for efficient feature selection in e-mail classificationAlexander Decker
 

What's hot (19)

Community Detection in Social Media
Community Detection in Social MediaCommunity Detection in Social Media
Community Detection in Social Media
 
Link prediction
Link predictionLink prediction
Link prediction
 
Clique-based Network Clustering
Clique-based Network ClusteringClique-based Network Clustering
Clique-based Network Clustering
 
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATIONIMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
IMPROVED SENTIMENT ANALYSIS USING A CUSTOMIZED DISTILBERT NLP CONFIGURATION
 
Community detection
Community detectionCommunity detection
Community detection
 
Community detection algorithms
Community detection algorithmsCommunity detection algorithms
Community detection algorithms
 
Formulation of modularity factor for community detection applying
Formulation of modularity factor for community detection applyingFormulation of modularity factor for community detection applying
Formulation of modularity factor for community detection applying
 
IRJET- Fusion Method for Image Reranking and Similarity Finding based on Topi...
IRJET- Fusion Method for Image Reranking and Similarity Finding based on Topi...IRJET- Fusion Method for Image Reranking and Similarity Finding based on Topi...
IRJET- Fusion Method for Image Reranking and Similarity Finding based on Topi...
 
Community detection in social networks[1]
Community detection in social networks[1]Community detection in social networks[1]
Community detection in social networks[1]
 
Group and Community Detection in Social Networks
Group and Community Detection in Social NetworksGroup and Community Detection in Social Networks
Group and Community Detection in Social Networks
 
IRJET - Visual Question Answering – Implementation using Keras
IRJET -  	  Visual Question Answering – Implementation using KerasIRJET -  	  Visual Question Answering – Implementation using Keras
IRJET - Visual Question Answering – Implementation using Keras
 
Techniques that Facebook use to Analyze and QuerySocial Graphs
Techniques that Facebook use to Analyze and QuerySocial GraphsTechniques that Facebook use to Analyze and QuerySocial Graphs
Techniques that Facebook use to Analyze and QuerySocial Graphs
 
Studying user footprints in different online social networks
Studying user footprints in different online social networksStudying user footprints in different online social networks
Studying user footprints in different online social networks
 
[IJET-V2I2P5] Authors:Mr. Veer Karan Bharat1, Miss. Dethe Pratima Vilas2, Mis...
[IJET-V2I2P5] Authors:Mr. Veer Karan Bharat1, Miss. Dethe Pratima Vilas2, Mis...[IJET-V2I2P5] Authors:Mr. Veer Karan Bharat1, Miss. Dethe Pratima Vilas2, Mis...
[IJET-V2I2P5] Authors:Mr. Veer Karan Bharat1, Miss. Dethe Pratima Vilas2, Mis...
 
IRJET- Quantify Mutually Dependent Privacy Risks with Locality Data
IRJET- Quantify Mutually Dependent Privacy Risks with Locality DataIRJET- Quantify Mutually Dependent Privacy Risks with Locality Data
IRJET- Quantify Mutually Dependent Privacy Risks with Locality Data
 
Data Mining In Social Networks Using K-Means Clustering Algorithm
Data Mining In Social Networks Using K-Means Clustering AlgorithmData Mining In Social Networks Using K-Means Clustering Algorithm
Data Mining In Social Networks Using K-Means Clustering Algorithm
 
CSMR06a.ppt
CSMR06a.pptCSMR06a.ppt
CSMR06a.ppt
 
Graph Based User Interest Modeling in Twitter
Graph Based User Interest Modeling in TwitterGraph Based User Interest Modeling in Twitter
Graph Based User Interest Modeling in Twitter
 
11.hybrid ga svm for efficient feature selection in e-mail classification
11.hybrid ga svm for efficient feature selection in e-mail classification11.hybrid ga svm for efficient feature selection in e-mail classification
11.hybrid ga svm for efficient feature selection in e-mail classification
 

Similar to Final Report

Testing Vitality Ranking and Prediction in Social Networking Services With Dy...
Testing Vitality Ranking and Prediction in Social Networking Services With Dy...Testing Vitality Ranking and Prediction in Social Networking Services With Dy...
Testing Vitality Ranking and Prediction in Social Networking Services With Dy...reshma reshu
 
Ego net facebook data analysis
Ego net facebook data analysisEgo net facebook data analysis
Ego net facebook data analysisSamsil Arefin
 
Inferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
Inferring Peer Centrality in Socially-Informed Peer-to-Peer SystemsInferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
Inferring Peer Centrality in Socially-Informed Peer-to-Peer SystemsNicolas Kourtellis
 
Predicting_new_friendships_in_social_networks
Predicting_new_friendships_in_social_networksPredicting_new_friendships_in_social_networks
Predicting_new_friendships_in_social_networksAnvardh Nanduri
 
FRIEND RECOMMENDATION IN ONLINE SOCIAL NETWORKS USING LDA
FRIEND RECOMMENDATION IN ONLINE SOCIAL NETWORKS USING LDAFRIEND RECOMMENDATION IN ONLINE SOCIAL NETWORKS USING LDA
FRIEND RECOMMENDATION IN ONLINE SOCIAL NETWORKS USING LDAJournal For Research
 
Web Graph Clustering Using Hyperlink Structure
Web Graph Clustering Using Hyperlink StructureWeb Graph Clustering Using Hyperlink Structure
Web Graph Clustering Using Hyperlink Structureaciijournal
 
Results for Web Graph Mining Base Recommender System for Query, Image and Soc...
Results for Web Graph Mining Base Recommender System for Query, Image and Soc...Results for Web Graph Mining Base Recommender System for Query, Image and Soc...
Results for Web Graph Mining Base Recommender System for Query, Image and Soc...iosrjce
 
LINKING SOFTWARE DEVELOPMENT PHASE AND PRODUCT ATTRIBUTES WITH USER EVALUATIO...
LINKING SOFTWARE DEVELOPMENT PHASE AND PRODUCT ATTRIBUTES WITH USER EVALUATIO...LINKING SOFTWARE DEVELOPMENT PHASE AND PRODUCT ATTRIBUTES WITH USER EVALUATIO...
LINKING SOFTWARE DEVELOPMENT PHASE AND PRODUCT ATTRIBUTES WITH USER EVALUATIO...cscpconf
 
Discovering Relation Based on User-Shared Image in Social Media by Using Big ...
Discovering Relation Based on User-Shared Image in Social Media by Using Big ...Discovering Relation Based on User-Shared Image in Social Media by Using Big ...
Discovering Relation Based on User-Shared Image in Social Media by Using Big ...Radhiyatammardhiyyah -
 
Mining User Lifecycles from Online Community Platforms and their Application ...
Mining User Lifecycles from Online Community Platforms and their Application ...Mining User Lifecycles from Online Community Platforms and their Application ...
Mining User Lifecycles from Online Community Platforms and their Application ...Matthew Rowe
 
An innovative approach to solve the network design problem concerning intelli...
An innovative approach to solve the network design problem concerning intelli...An innovative approach to solve the network design problem concerning intelli...
An innovative approach to solve the network design problem concerning intelli...alienaimi
 
LCF: A Temporal Approach to Link Prediction in Dynamic Social Networks
 LCF: A Temporal Approach to Link Prediction in Dynamic Social Networks LCF: A Temporal Approach to Link Prediction in Dynamic Social Networks
LCF: A Temporal Approach to Link Prediction in Dynamic Social NetworksIJCSIS Research Publications
 
LINKING SOFTWARE DEVELOPMENT PHASE AND PRODUCT ATTRIBUTES WITH USER EVALUATIO...
LINKING SOFTWARE DEVELOPMENT PHASE AND PRODUCT ATTRIBUTES WITH USER EVALUATIO...LINKING SOFTWARE DEVELOPMENT PHASE AND PRODUCT ATTRIBUTES WITH USER EVALUATIO...
LINKING SOFTWARE DEVELOPMENT PHASE AND PRODUCT ATTRIBUTES WITH USER EVALUATIO...csandit
 
EffectiveCrowdSourcingForProductFeatureIdeation v18
EffectiveCrowdSourcingForProductFeatureIdeation v18EffectiveCrowdSourcingForProductFeatureIdeation v18
EffectiveCrowdSourcingForProductFeatureIdeation v18Karthikeyan Rajasekharan
 
DYNAMIC ASSIGNMENT OF USERS AND MANAGEMENT OF USER’S DATA IN SOCIAL NETWORK
DYNAMIC ASSIGNMENT OF USERS AND MANAGEMENT OF USER’S DATA IN SOCIAL NETWORK DYNAMIC ASSIGNMENT OF USERS AND MANAGEMENT OF USER’S DATA IN SOCIAL NETWORK
DYNAMIC ASSIGNMENT OF USERS AND MANAGEMENT OF USER’S DATA IN SOCIAL NETWORK ijiert bestjournal
 
A novel tool for stereo matching of images
A novel tool for stereo matching of imagesA novel tool for stereo matching of images
A novel tool for stereo matching of imageseSAT Publishing House
 
A novel tool for stereo matching of images
A novel tool for stereo matching of imagesA novel tool for stereo matching of images
A novel tool for stereo matching of imageseSAT Publishing House
 

Similar to Final Report (20)

Testing Vitality Ranking and Prediction in Social Networking Services With Dy...
Testing Vitality Ranking and Prediction in Social Networking Services With Dy...Testing Vitality Ranking and Prediction in Social Networking Services With Dy...
Testing Vitality Ranking and Prediction in Social Networking Services With Dy...
 
Ego net facebook data analysis
Ego net facebook data analysisEgo net facebook data analysis
Ego net facebook data analysis
 
Inferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
Inferring Peer Centrality in Socially-Informed Peer-to-Peer SystemsInferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
Inferring Peer Centrality in Socially-Informed Peer-to-Peer Systems
 
Predicting_new_friendships_in_social_networks
Predicting_new_friendships_in_social_networksPredicting_new_friendships_in_social_networks
Predicting_new_friendships_in_social_networks
 
FRIEND RECOMMENDATION IN ONLINE SOCIAL NETWORKS USING LDA
FRIEND RECOMMENDATION IN ONLINE SOCIAL NETWORKS USING LDAFRIEND RECOMMENDATION IN ONLINE SOCIAL NETWORKS USING LDA
FRIEND RECOMMENDATION IN ONLINE SOCIAL NETWORKS USING LDA
 
Web Graph Clustering Using Hyperlink Structure
Web Graph Clustering Using Hyperlink StructureWeb Graph Clustering Using Hyperlink Structure
Web Graph Clustering Using Hyperlink Structure
 
B017650510
B017650510B017650510
B017650510
 
Results for Web Graph Mining Base Recommender System for Query, Image and Soc...
Results for Web Graph Mining Base Recommender System for Query, Image and Soc...Results for Web Graph Mining Base Recommender System for Query, Image and Soc...
Results for Web Graph Mining Base Recommender System for Query, Image and Soc...
 
LINKING SOFTWARE DEVELOPMENT PHASE AND PRODUCT ATTRIBUTES WITH USER EVALUATIO...
LINKING SOFTWARE DEVELOPMENT PHASE AND PRODUCT ATTRIBUTES WITH USER EVALUATIO...LINKING SOFTWARE DEVELOPMENT PHASE AND PRODUCT ATTRIBUTES WITH USER EVALUATIO...
LINKING SOFTWARE DEVELOPMENT PHASE AND PRODUCT ATTRIBUTES WITH USER EVALUATIO...
 
Discovering Relation Based on User-Shared Image in Social Media by Using Big ...
Discovering Relation Based on User-Shared Image in Social Media by Using Big ...Discovering Relation Based on User-Shared Image in Social Media by Using Big ...
Discovering Relation Based on User-Shared Image in Social Media by Using Big ...
 
Mining User Lifecycles from Online Community Platforms and their Application ...
Mining User Lifecycles from Online Community Platforms and their Application ...Mining User Lifecycles from Online Community Platforms and their Application ...
Mining User Lifecycles from Online Community Platforms and their Application ...
 
An innovative approach to solve the network design problem concerning intelli...
An innovative approach to solve the network design problem concerning intelli...An innovative approach to solve the network design problem concerning intelli...
An innovative approach to solve the network design problem concerning intelli...
 
LCF: A Temporal Approach to Link Prediction in Dynamic Social Networks
 LCF: A Temporal Approach to Link Prediction in Dynamic Social Networks LCF: A Temporal Approach to Link Prediction in Dynamic Social Networks
LCF: A Temporal Approach to Link Prediction in Dynamic Social Networks
 
Ppt
PptPpt
Ppt
 
LINKING SOFTWARE DEVELOPMENT PHASE AND PRODUCT ATTRIBUTES WITH USER EVALUATIO...
LINKING SOFTWARE DEVELOPMENT PHASE AND PRODUCT ATTRIBUTES WITH USER EVALUATIO...LINKING SOFTWARE DEVELOPMENT PHASE AND PRODUCT ATTRIBUTES WITH USER EVALUATIO...
LINKING SOFTWARE DEVELOPMENT PHASE AND PRODUCT ATTRIBUTES WITH USER EVALUATIO...
 
Performance neo4j-versus (2)
Performance neo4j-versus (2)Performance neo4j-versus (2)
Performance neo4j-versus (2)
 
EffectiveCrowdSourcingForProductFeatureIdeation v18
EffectiveCrowdSourcingForProductFeatureIdeation v18EffectiveCrowdSourcingForProductFeatureIdeation v18
EffectiveCrowdSourcingForProductFeatureIdeation v18
 
DYNAMIC ASSIGNMENT OF USERS AND MANAGEMENT OF USER’S DATA IN SOCIAL NETWORK
DYNAMIC ASSIGNMENT OF USERS AND MANAGEMENT OF USER’S DATA IN SOCIAL NETWORK DYNAMIC ASSIGNMENT OF USERS AND MANAGEMENT OF USER’S DATA IN SOCIAL NETWORK
DYNAMIC ASSIGNMENT OF USERS AND MANAGEMENT OF USER’S DATA IN SOCIAL NETWORK
 
A novel tool for stereo matching of images
A novel tool for stereo matching of imagesA novel tool for stereo matching of images
A novel tool for stereo matching of images
 
A novel tool for stereo matching of images
A novel tool for stereo matching of imagesA novel tool for stereo matching of images
A novel tool for stereo matching of images
 

Final Report

  • 1. 1 B.Tech Project Report On Link Prediction Problem for Heterogeneous Networks By Sai Akhil Reddy Gopidi (1210110172) Nithin Kumar (1210110095) Roopesh Kumar Kotte (1210110093) Supervisor Dr. Dolly Sharma, Shiv Nadar University (dolly.sharma@snu.edu.in) Submitted in the partial fulfillment of requirements for Bachelor of Technology in Computer Science and Engineering Department of Computer Science and Engineering, School of Engineering, Shiv Nadar University, Gautam Buddha Nagar, U.P., India, 201314 http://www.snu.edu.in
  • 2. 2 Approval Sheet This report entitled Link Prediction Problem for Heterogeneous Networks by Sai Akhil Reddy Gopidi, Nithin Kumar & Roopesh Kumar Kotte is approved for the degree of Bachelor of Technology in Computer Science and Engineering. Project Advisor Name Dolly Sharma Signature ________________________________ Date ________________________________
  • 3. 3 Declaration Sheet We declare that this written submission represents our ideas in our own words and where others' ideas or words have been included, we have adequately cited and referenced the original sources. We also declare that we have adhered to all principles of academic honesty and integrity and have not misrepresented or fabricated or falsified any idea/data/fact/source in our submission. We understand that any violation of the above will be cause for disciplinary action by the Institute and can also evoke penal action from the sources that have thus not been properly cited or from whom proper permission has not been taken when needed. Name of the Student: Sai Akhil Reddy Gopidi Signature ___________________________ Name of the Student: Nithin Kumar Kakkrineni Signature ___________________________ Name of the Student: Roopesh Kumar Kotte Signature ___________________________ Date _______________________________
  • 4. 4 Abstract Interaction among members has become an important aspect in many social networks like Facebook and YouTube. As the number of users grows the network size also increases and it becomes difficult for users to find their friends or search and watch the videos they like. To make the life of the user easy, social networks suggest friends and videos to the users based on their previous searches and mutual interests. Therefore social networks have focused on link prediction techniques that allow users to easily find their needs. Link prediction is a critical task that not only helps increase the linkage inside the network but also improves the user experience. In a link prediction algorithm it is required to identify the factors that influence link creation. In this project we analyze and discuss some of these factors and also give an approach which satisfies these factors. The approach is to estimate link relevance by using Fuzzy Link Based Classification algorithms based on Backpropagation that gives the probability of a link existence at some future time. We then evaluate the accuracy of the results obtained using accuracy measurement techniques like Precision. We apply our methods to YouTube dataset in order to evaluate the performance of our algorithms and compare it to the performance of previously proposed algorithms.
  • 6. 6 List of Figures and Tables S.No. Figure/Table Number Description 1. Fig 1.1 Heterogeneous graph with multiple edge types 2. Fig 1.2 Depicting problem statement using graph 3. Fig 2.1 CSV file of type 1 edge in YouTube dataset 4. Table 3.1 AUROC for YouTube, Disease and Climate network for each edge type 5. Fig 3.1 Graph showing connections between nodes 6. Table 3.2 Distance between the nodes 7. Fig 3.2 Alternate neighbors and link possibility 8. Fig 3.3 Local random walk path 9. Table 4.1 Results obtained from the AUROC curve on DBLP dataset 10. Fig 4.1 Impact of Collaboration Frequency of different measures 11. Fig 5.1 System Architecture 12. Fig 5.2 Feed Forward Neural Network structure 13. Fig 5.3 Example of a triangular formation in the network 14. Fig 5.4 Sums of three sides of a triangle 15. Fig 5.5 Threshold Value 16. Fig 5.6 Number of expected links for bucket size of two 17. Fig 5.7 Expected links with hop counts 18. Fig 5.8 Local random walk score calculated for expected links 19. Fig 5.9 Calculated Precision 20. Fig 5.10 Precision vs Hop count graph 21. Fig 5.11 Hop count vs estimated links graph 22. Fig 5.12 Edges obtained for bucket size 23. Fig 6.1 Project Phases 24. Table 6.1 Project Schedule
  • 7. 7 Abbreviations and Nomenclature AUROC- Area under the Receiver Operating Curve CSV- Comma Separated Value DBLP- Digital Bibliography & Library Project JVM- Java Virtual Machine GC- Garbage Collector STL- Standard Template Library JC- Jaccard’s coefficient CN- Common Neighbor or Contact Network A/A- Adamic Adar LRW- Local Random Walk CP- CANDECOMP/PARAFAC MRLP- Multi Relational Link Prediction MRIP- Multi Relational Influence Propagation SBN- Shared subscriptions SBR- Shared subscribers VID- Shared favorite videos FFNet- Feed Forward Neural Network NN- Neural Network
  • 8. 8
  • 9. 9 Chapter 1 Introduction 1.1 Overview Data mining and analyzing data is an important and upcoming field in computer science. Huge volumes of data stored in data warehouses must be extracted for the purpose of pattern recognition which can in turn benefit the organization. Storing and managing this data had become a major problem in today’s world. Interaction among members of a community is or network is of highest priority. Organizations like Facebook and YouTube which aim to connect millions of people around the world analyze the pattern of user behavior and recommends friends and videos accordingly. Given a user A at some point of time t, the task at hand is to estimate all the possibilities of link formation between user A and user B by taking into consideration all the parameters that these two users share. Link prediction makes
  • 10. 10 it possible to determine wheather two people can become friends or not beforehand. Many social networks use this technique to suggest friends to the users so that they do not have to search for all their friends. This project includes some of the commonly the various link prediction techniques that have been used previously as well as their drawbacks. We then propose our algorithm and implement it on the YouTube dataset. Finally we compare the obtained results of our algorithm with the results of the previously proposed algorithms and conclude wheather our algorithm is efficient or not. 1.2 Problem Statement Given a heterogeneous graph G = (V1 U V2…U Vm, E1 U E2… U En), where Vu (u € N) represents the set of nodes of same type u (users) and Ej (j € M) represents the link of type j between the nodes (relationship between users), our task is to predict the future possible links between the users. Since it is not possible to compare the dataset at two different time intervals, we use the cross-dimension validation process. We divide the complete dataset into 2 divisions- 1. Training set 2. Testing set Fig 1.1 Heterogeneous graph with multiple edge types [1]
  • 11. 11 Out of the 5,574,249 edges in our dataset, we omit 10% of the total nodes, i.e., 8,36,137 and implement our algorithm on the remaining 47,38,111 to train the dataset. After the training is completed we test the algorithm on the entire dataset to check how accurate the results are using precision accuracy metric since we already know which links are to be expected and which links do not exist. Fig 1.2 Depicting problem statement using graph The numbers above the links indicate the intensity of the links. For example, users 2 and 6 have more commonly shared videos than users 5 and 6. 1.3 Team Members Contribution Name: Akhil Reddy Contribution: Akhil’s contribution includes literature survey to understand the fundamental concepts of link prediction, problem formulation, designing the algorithm and testing the algorithm. Name: Nithin Kumar Contribution: Nithin’s contribution includes assistance with literature survey, problem formulation, designing and testing the algorithm. Name: Roopesh Kumar
  • 12. 12 Contribution: Roopesh’s contribution includes data gathering, testing the algorithm, research in finding suitable data structure to accommodate all the nodes in the dataset and poster design. Chapter 2 Feasibility Study and Requirements 2.1 Dataset Used The dataset that we have selected to implement our algorithm is the YouTube dataset from December 2008, which is a video sharing platform for millions of users. This dataset includes information of those users who were will to share their information [2]. Number of Nodes: 15,088
  • 13. 13 Number of Edges: 5,574,249 Types of Edges: 5 In this case we are considering all the users as nodes and the different types of relations between them as edges, to construct out heterogeneous graph. A graph G = (V1 U V2…U Vm, E1 U E2… U En), where Vu (u € N) represents the set of nodes of same type u (Users) and Ej (j € M) represents the link of type j between the nodes (relationship between users) is called a heterogeneous graph. There are 5 types of edges in this dataset, namely- 1. The contact network between the 15,088 users. 2. The number of shared friends between two users in the 848, 003 (excluding the 15,088) contacts- Two users are connected if they both add another user as contact. 3. The number of shared subscriptions between two users- When two users subscribe to same person/channel. 4. The number of shared subscribers between two users- Two users are connected if another user has subscribed both of them. 5. The number of shared favorite videos- Users sharing same videos. The dataset is in a CSV (Comma Separated Value) format for each edge type independently. E.g.: 7, 12, 94 – Indicates that user ids 7 and 12 have an intensity of 94 between them of a particular edge type.
  • 14. 14 Fig 2.1 CSV file of type 1 edge in YouTube dataset In the above figure the 1st and 2nd columns represent the node that have a link of type 1 which the intensity in the 3rd column. 2.2 Scope: Our YouTube network consists of different type of edges with an interaction value which forms a large network. The scope of our project is to reduce the link prediction space specifically in the YouTube network and suggest link in a probabilistic way that the link will be useful in near future. Our experiment of link prediction is only for research purpose. 2.3 Problems Faced:
  • 15. 15  Initially we considered using DBLP dataset, but due to lack of multiple edges and insufficient data available in the dataset we were forced to work on a new dataset.  As the dataset contains 15088 authors and 5,574,249 links in the YouTube dataset, it is hard to accommodate all the nodes in the dynamic array in the form of a 3-D matrix because it overflows the available memory of the ram and gives a GC overhead error.  System requires larger RAM than the available RAM in our laptops to load dataset into the data structure. 2.4 Software and Hardware Requirements  Operating systems.  NetBeans with JDK to run the Java code  Larger ram. 2.5 Technical Feasibility Initially the project source code was meant to be written in Java using 3D arrays to create the adjacency matrix. As the number of nodes was very high, the Java VM could find enough memory to store the data. Hence we shifted to C++ vectors which have the capability to dynamically allocate memory to the node. Even the vector STL could not accommodate all the data in a 3D vector and hence we were forced to shift back to Java and used 2D arrays to create the adjacency matrix. This Project is technically feasible as it works perfectly fine on the existing version of Java provided we use 2D arrays. NOTE- To use 2D arrays for such large datasets where number of nodes is 15088, the JVM option must be changed to –Xms2g to allocate more heap size. 2.6 Economic Feasibility It is economically feasible as the data set is publicly available online, so it is free of cost there is no extra costs required for the project. Research done to get into the project is also based on the scholarly articles available online, there is no need to learn new technology (language).
  • 16. 16 2.7 Schedule feasibility It took some time to know the concepts of link prediction in heterogeneous, homogenous networks, techniques used to predict the future link. When we discover what to do innovatively and something new, we proceeded swiftly with our project in order to complete it within the stipulated time frame. 2.8 Project Meetings 2.8.1 Meeting with Supervisor We have constantly met our project supervisor on weekly bases (twice a week) to discuss about the objectives of the project. We were instructed to conduct the literary survey in the beginning to get an idea about the subject since everyone in the group was new to this domain of study. Once we completed literature survey the algorithm was discussed to be implemented on the dataset. Later accuracy metrics were discussed to check out the performance of the algorithm and finally the results were compared with the results of other algorithms. All the important meetings with the supervisor were conducted in person and the minor details were either discussed over phone or e-mails. 2.8.2 Group Meetings The group members met daily. Initially we discussed the scope and schedule of the project as well as discussed on individual roles to be carried out. On completing the literature survey and getting a good grasp of the subject we formulated a problem and finalized the dataset to be worked. During the literature survey the papers were distributed among the team members and on completion of paper the member explained the contents of that paper to other members to save time and avoid redundancy. 2.9 Text Deliverables Along with this report several other documents are also included in order understand the research in a deeper sense. Dataset- The YouTube dataset on which the research was conducted on. The dataset is in a CSV format with each link type having its own file.
  • 17. 17 Source Code- A CD is given along with the report containing the code for all the algorithms implemented. List of all expected value- On running the LRW a list of all expected edges is obtained. A file including all these edges is included Deleted Files- To test the accuracy of the algorithm, 10% of links are deleted from original file and stored in another file. Testing is done on this file. 2.10 Conclusion We can confidently conclude after considering all the above stated points that the project is feasible and we will complete the project in stipulated time allotted to complete the project work.
  • 18. 18 Chapter 3 Commonly Used Algorithm Many algorithms have been proposed for link prediction purpose in homogeneous networks as well as heterogeneous networks. We cannot conclude that a specific approach is the best way to predict links because link prediction methods are domain specific. Performance of the algorithms is based on how well the network supports the predefined scoring methods for link formation. For example, Facebook and Twitter being social networks yield best results for neighborhood methods like common neighbor and Adamic/Adar for friend recommendation links and in a climate network Jaccard’s coefficient performs well due to spatial autocorrelation [3]. It is also possible that only one method might not give the best results for all the different links in a single network. Hence performance of algorithm is not only dependent on the predefined scoring measure and the type of the network, but also on the type of links that it is being used to predict. This is clearly illustrated in the disease-gene network, where different method works best for different link in the same network [3]. The
  • 19. 19 AUROC table below clearly indicates the above stated, where the bold faced indicate the best link predictor method. Table 3.1 AUROC for YouTube, Disease and Climate network for each edge type [3] 3.1 Commonly Used Algorithms The following discussed methods can be used for any pair of nodes (A, B) in a network. Fig 3.1 Graph showing connections between nodes Score is allocated to each link based on the predefined scoring techniques used in these algorithms and based on this score we predict wheather there is a possibility of link between the nodes. 3.1.1 Graph Distance
  • 20. 20 In this method the distance between the two nodes, i.e., the source and destination nodes is calculated and the inverse or negated length is considered. If the distance between the nodes is less then there is higher chance that these nodes might be connected and vice versa. Table 3.2 Distance between the nodes Nodes Distance (A,C) -2 (C,E) -3 (A,E) -3 As shown is Fig 3.1, in the above table the distance between nodes A and C is the least, therefore there are higher chances of link formation between these two nodes. The negative sign (-) is only to show that the least distance value has higher probability of link. 3.1.2 Common Neighbors Link prediction in this method is based on the number of common neighbors that two nodes have. If two nodes have more number of common neighbors then more is the probability of link existence between the nodes and vice versa. Score= Ə (A) ∩ Ə (B), which is the total number of common neighbor nodes of the two nodes. Where Ə (x) denotes the neighbors of a node x. 3.1.3 Jaccard’s Coefficient Fig 3.2 Alternate neighbors and link possibility
  • 21. 21 Jaccard’s coefficient is the derived from common neighbors method but provides more accurate results. For a given pair of nodes (A, B), the score assigned is the number of common neighbor nodes of A and B divided by the total number of neighbor nodes of A and B. Score= Ə (A) ∩ Ə (B) ÷ Ə (A) U Ə (B) The numerator part is similar to common neighbor’s method. From Fig 3.3 we can see that for nodes C and D the common neighbors are A and B. The scoring method used by common neighbors gives a high score for a link between C and D. We are not considering any other nodes in this method. It is also clear from Fig 3.2 that node C has many other neighbors apart from A and B whereas D has only those two neighbors. Therefore in this case the score calculated by the common neighbor method is not accurate and to negate these additional neighbors of the nodes we divide the common neighbors of both nodes by the total number of nodes. This increases the accuracy of calculated score. 3.1.4 Adamic/Adar Adamic/Adar is the advanced version of Jaccards coefficient, which weighs rarer neighbors more heavily. To put in simple terms- for a pair of nodes (A, B), if the common neighbors of A and B have less common neighbors then there is a higher possibility of link existence between A and B. From Fig 3.2, the common neighbors of A and B are C and D which in turn do not have any common neighbors. So there is a better link formation between A and B. Score= , where z is the common neighbor of nodes x and y. 3.1.5 Preferential Attachment Score assigned to the pair of nodes is the product of their degrees. Higher score is assigned if the nodes have many edges attached to them. 3.1.6 Average Commute Time It is the average number of steps required by a random walker starting from a source node to reach destination node. The two nodes are said to have a link if they have smaller commute time.
  • 22. 22 3.2 Local Random Walk Random walk algorithm is an advanced method which is used to predict links in a network. It is a Markov process, that is, it is a memory less process which makes its next move based only on its current location and does not consider the previously followed path. In a given graph G (V,E) for a pair of nodes x,y the process can be defined using a transition probability matrix Pxy=axy / kx where axy=1 if x and y are connected and 0 other wise. Kx denotes the degree of node k. Let us consider a random walker starting at node x and must travel to node y and let πxy(t) be the probability that this walker reaches node y after t steps. πx(t) =PT πx(t−1), is the probability score to calculate that the random walker will come to location x from the previously positioned location. In the below figure consider that the random walker starts at x and must reach y. From x the walker can go to any of the 4 nodes ahead of him, i.e., nodes 1-4 and through these he can reach node y. The probability the walker will go to node 1 from node x is- Π1(t) =PT Π1(t−1) Similarly for the remaining nodes. To travel from node 1 to node y, again the probability is Πy(t) =PT Πy(t−1) Hence the probability of going to the next state is dependent only on the current state and not the previous states. Fig 3.3 Local random walk path
  • 23. 23 Summing up and taking average of the probabilities can give the score for existence of link between nodes x and y. 3.2.1 Random Walk with Restart Sometimes is so happens that the random walker deviates and goes away too far from the destination node. In the above figures, consider that the walker has taken moved from node x to node 100, which is far away from the node y. This gives low and inaccurate score prediction and there are chances that the walker may never reach the destination. To overcome this problem we can use random walk with restart, where walkers are continuously released at regular intervals from the starting point which increases the probability of the walker to reach the destination in the best possible path.
  • 24. 24 Chapter 4 Related Work In order to understand the basic crux of the link prediction problem, we team had conducted an extensive literature survey of various papers published so that we were aware of the previously researched problems, the approach used, the dataset they experimented on and the results obtained. Every approach has its own advantages and drawbacks over the others. Since this is relatively new topic and research has recently begun in this domain, very few papers were published. The obtained papers were distributed among the team members and on daily basis the important content of the papers were discussed with the whole team. Few important papers that are relevant to this project are discussed below.
  • 25. 25 4.1 Tensor Factorization  Paper Title: Link Prediction in Heterogeneous Networks Based on Tensor Factorization [4].  Authors: Piao Yong , Li Xiaodong1 and Jiang He  Publication: The Open Cybernetics & Systemics Journal, 2014, 8, 316-321  Problem: To predict the edges that will be added to the network during the interval from time t to a given future time t’.  Method: Heterogeneous networks can be organized as a third order tensor (Node! Node! Link type) or multi-dimensional. Proposed a method based on tensor factorization that can capture the correlation between different types of links for the link prediction problem without loss of information. Employed CANDECOMP/PARAFAC (CP) tensor decomposition to capture the underlying patterns in the node-relationship-node tensor. The CP decomposition generated feature vectors for the nodes in the graph, that are computed to get a similarity score that combines the multiple types of the graph. After CP decomposition, 3 factor matrices were known: node matrix A, relationship matrix B, and node matrix C. Link prediction can be computed according to the captured associations. Score Matrix s is defined as In this paper, they used alternating least-squares (ALS) with weighted-!- regularization algorithm to fit the CP decomposition.  Results: Adamic/Adar measure and Katz measure performs well both in theoretical and practical experiments. So here, they just compared these two measures with their methods. Their method provided better precision than unsupervised ones on the data sets and also provided a competitive effect to Adamic/Adar measure and both those two methods beat Katz measure.  Challenges: It is cost intensively to compute tensor factorization.  Datasets: UMLS. This data set contains data from the Unified Medical Language System semantic work. This consists of 135 entities and 54 relationships. The entities
  • 26. 26 are high-level concepts like 'Disease or Syndrome', 'Diagnostic Procedure', or 'Mammal'. 4.2 Multi Relational Influence Propagation  Paper Title: Link Prediction in Heterogeneous Networks: Influence and Time Matters[5].  Author: Yang Yang and Nitesh V. Chawla, Department of Computer Science & Engg, University of Notre Dame. Yizhou Sun and Jiawei Hanı, Department of Computer Science, University of Illinois at Urbana-Champaign.  Problem: Given a heterogeneous network, in this case a DBLP bibliographic network, the machine must be able to predict wheather the link is present in the network and the possibility of link in future. DBLP dataset contains information about 3,215 authors who published a minimum of 5 papers in conferences between 1990 and 2010. The links can be of different types, for E.g. link between author and author (co- author), author-paper (writes), paper-conference (published in).  Method: Different unsupervised link prediction algorithms were used to test the data set like Common Neighbour, Jaccard Coefficient, Adamic/Adar Preferential attachment, etc. Of all the algorithms, Multi relational Influence propagation (MRIP) which uses conditional probability which is equivalent to edge correctness yielded the best results. For unsupervised learning data between 1990 and 2000 was chosen as training set and data between 2001 and 2005 as training set.  Results: As unsupervised link predictions are domain specific performance varies for each algorithm. MRIP has better performance than others in predicting co-authorship between authors and predicting terms shared between authors and has slightly less performance in conference presenter’s link.  Challenges: MRIP works well for stable networks. But as DBLP is a non-stable network (unit root value= 0.99), since the number of links keep on increasing every year the traditional unsupervised link prediction algorithms are not of much use. Availability of dataset and security is a problem. Additional information is collected through user survey which is incomplete and unreliable. Information is needed that can expose the users subconscious behavior at a particular time.
  • 27. 27  Future Work: As the network changes with time temporal feature based methods are implemented. Bootstrapping technology is one such method. Based on the degree of a node we rank them in descending order and analyze how new links in future are associated with the top K% of them.  Dataset & availability: The whole DBLP dataset is available as an XML file. Table 4.1 Results obtained from the AUROC curve on DBLP dataset [5] JC CN AA MRIP Collaboration 0.590 0.597 0.596 0.769 Conference 0.702 0.698 0.689 0.691 Key Terms 0.545 0.546 0.532 0.811 4.3 Multi Relational Link Prediction  Paper Title: Multi-Relational Link Prediction in Heterogeneous Information Networks[3].  Author: Darcy Davis, Ryan Lichtenwalter, Nitesh V. Chawla, Interdisciplinary Centre for Network Science and Applications, Department of Computer Science and Engineering, University of Notre Dame.  Problem: Three different domains are considered- YouTube, Disease-Gene and Climate network datasets. YouTube has 15,088 users as of December 2008, who are considered as nodes in this case. The users are connected by 5 different edges- contact network (CN) of the user, shared contact with users outside of the network (FR), shared subscriptions (SBN), shared subscribers (SBR), and shared favorite videos (VID). The disease-gene network consists of 703 diseases and 1,132 genes with 4 edges. The climate network has 1,701 locations with 7 edges for different climate changes.  Method: Unsupervised link prediction methods are implemented. Link prediction for each edge type is evaluated individually. Link prediction performance is evaluated separately for each edge using Area Under the Receiver Operating Curve (AUROC).  Results: Performance of the algorithms is based on how well the network supports the predefined link scoring assumption. Performance of local neighborhood methods is
  • 28. 28 predominant in social networks like YouTube. Jaccard coefficient performs well in climate network closely located areas have similar climate. In disease-gene network each link type was captured best by a different method. Refer to Fig 3.1 for AUROC values of each node.  Challenges: A node in a network can have multiple edges and each edge can increase the likelihood of a contact. In YouTube 76% of node pairs with contact edge have other edges which increase the number of contacts of that particular edge. Bad performance of MRLP on other edge types indicate that MRLP doesn’t work well when additional link types are introduced (noise).  Future Work: High performance link prediction (HPLP) is introduced for this purpose which uses Feature vector, homogeneous link prediction and heterogeneous link prediction. 4.4 Graph Model TransFG  Paper title: Inferring social ties across heterogeneous networks[6].  Authors: Jie Tang Tsinghua University Tiancheng Lou Tsinghua University Jon Kleinberg Cornell university.  Problem: Predict the type of relationship in a target network by leveraging the supervised information (labeled relationships) from the source network.  Method: proposed Predictive model such as transfer based factor graph model (TransFG) for learning and predicting the type of social relationships across network.  Results: Proposed method TransFG is more helpful when combined with social theories (Structural balance, structural hole, social status, two step flow) in inferring type of relationship in social network. Performance drops when any one of the social theories is ignored. TransFG is checked against social theories in datasets such as Epinions, Slashdot, Mobile in predicting undirected relationship and Coauthor and Enron for predicting directed relationship.  Challenges: As discussed in the paper, there are two types of networks source and target networks, predictive model needs to learn both the networks, the challenge is then how to bridge the two networks, so that we can transfer the labeled information from source network to target network.
  • 29. 29  Future work: some other social theories can be further explored and validated for analyzing the formation of different types of social relationships  Dataset: Epinions, Slashdot, Mobile, Coauthor, Enron are datasets and they are publicly available. 4.5 Path Predict  Paper Title: Co-Author Relationship Prediction in Heterogeneous Bibliographic Networks [7].  Authors: Yizhou Sun, Rick Barber, Manish Gupta, Charu C. Aggarwal, Jiawei Han  Publication: Published in Int. Conf. on Advances in Social Networks Analysis and Mining (ASONAM'11), July 2011  Problem: Identify the kind of connections between two authors that are more helpful to lead to future link collaborations. Basically predicting whether two authors that have never co-authored before will co-author sometime in the future rather than predicting how many times two authors will co-author in the future. Given a heterogeneous network, the link prediction task is then generalized to relationship building prediction, which is to predict whether two objects will build a relationship following a certain target relation.  Method: There are two stages (Training and test stage). In the training stage, we first sample a set of author pairs that have never co-authored in T0, collect their associated topological features in T0, and record whether a relationship is to appear between them in the future interval T1.  Model used: Path predict model. Defined the topological features in the DBLP network and used meta path based Topological feature. Meta paths between two object types can be obtained by traversing on the DBLP network schema, by using standard traversal methods such as the BFS (breadth-first search) algorithm. Discussed below are four measures on meta path. 1. Path count 2. Normal Path count 3. Random walk 4. Systematic Random walk.
  • 30. 30 Defined Co-authorship Model, used logic regression method as the prediction model. For each training pair of authors (ai1 , ai2), let xi be the (d+ 1)- dimensional vector including constant 1 and d topological features between them, and yi be the label of whether they will be co-authors in the future (yi = 1 if they will be co-authors, and otherwise 0), which follows binomial distribution with probability pi . The probability pi=e^( xiβ)/ (e^( xiβ )+ 1) where β is the d + 1 coefficient weights associated with the constant and each topological feature. Then used standard MLE (Maximum Likelihood Estimation) to derive βˆ, that maximizes the likelihood of all the training pairs.  Results: The co-authorship for highly productive authors is easier to predict than less productive authors. The prediction accuracy is higher when the target authors are 3- hop co-authors, which means the collaboration between closer authors in the network is more affected by information that is not available from network topology. In the previous cases, we say two authors have a co-authorship if they have co-authored one paper. Here, we study the relationships defined by different collaboration frequency. From Fig. 4.1, we can see that, the measure symmetric random walk is more important in deciding high frequency co-author relationships. Two authors who can be reached with high probability mutually in the network will be more likely to build strong collaboration relations.[7]  Challenges: Predicting co-authors for a given author is an extremely difficult task, as we have too many candidate target authors (3-hop candidates are used), while the number of real new relationships are usually quite small.  Dataset: The DBLP bibliographic network is available in the internet in an XML file. Fig 4.1 Impact of Collaboration Frequency of different measures [7]
  • 31. 31 Chapter 5 Algorithm Design and Implementation To predict the links in the dataset we have used the Fuzzy Link Based Classification algorithm- a subpart of Neuro Fuzzy Link Based Classification algorithm, which is a combination on the Feedforward Neural Networks (FFNet) Backpropagation techniques and fuzzy logic. FFNet was inspired from the neural system the human body. In this chapter we will first explain about the system design involved in setting up the network followed by explaining the FFNet and Backpropagation algorithm, explain the reasons for using the algorithm and then discuss about how we worked on our dataset and the steps involved in obtaining the desired output. 5.1 System Design From selecting the dataset to be worked on to data classification and link prediction, many steps are involved in like clustering data, classification of data, data extraction etc. These
  • 32. 32 steps are performed in a proper order as mentioned in the below figure of System architecture. Fig 5.1 System Architecture [9]  Initially the dataset must be selected to retrieve the data so that classification can be performed.  In user interface model data is retrieved from the dataset and represented in a readable format. On this data pattern recognition and analysis can be performed.  In clustering and classification phase dissimilar data is differentiated from similar data. In our project this step can be omitted because the dataset is already in the form of a CSV file, divided according to the respective link types.  Knowledge base contains the rules for the construction of the fuzzy system. The system checks each input and acts according to the weights and specified attributes of the nodes.  Finally the decision manager contains the logic to make decision based on the rules mentioned in the knowledge base. It is the heart of fuzzy logic. It decides the output and passes it to the response which displays the final result.
  • 33. 33 5.2 Backpropagation 5.2.1 Feed Forward Neural Networks FFNet consists of a number of nodes that are linked to each other with edges of different types (in a heterogeneous network), where each edge carries some weight. All the nodes are connected to every other node in the layers that precede and succeed them. Input values are given at the input layer which propagates the value to further layers. Input and output for the input layer is the same. The final layer is the output layer that gives the predicted value. All the layers apart from input and output layers are called hidden layers. The outputs from the hidden layer are propagated as inputs to other layers and finally to the output layer. 5.2.1.1 Learning Phase The number of input and output units depends on the attributes of the nodes and the number of categories we want to classify the data. First we must create the adjacency matrix for all the nodes and edge types to store the strength value. From this matrix, we can finalize the number input units as the number of attributes of the node and output as the different types of categories. Fig 5.2 Feed Forward Neural Network structure [10]
  • 34. 34 The outputs of the network are compared with the actual outputs and if the output values are large then the categorization is correct. On this comparison the weights are modified so that when same types of inputs are presented the outcome value is higher than the current value. 5.2.2 Neuro Fuzzy Link Based Classifier Neuro fuzzy link based classifier works based on the neuro fuzzy rules to classify the edges. These rules are based on backpropagation technique. In our problem we first adjust the weights by normalizing the edge strengths of all the 5 types followed by implementing triangular fuzzy membership function that looks for all the possible triangular formations in the network. Based on the triangle perimeter we calculate the threshold value and base our final predictions. 5.3 Reason for Using the Algorithm Some of the main reasons for using the algorithm are [8]- 1. Neural Networks (NN) have high tolerance of noisy data. 2. NN can be used to work on datasets on which they have not been trained. 3. The can be used when we have limited idea about the relationship between the attributes of the nodes type of edges present in the dataset. 4. NN algorithms are inherently parallel; parallelization techniques can be used to speed up the computation process [8]. 5. NN algorithms have been successful in handwritten character recognition, medical purposes and training a machine to pronounce English text. 5.4 Steps Involved in Proposed Algorithm 1. Initially all the values of the dataset are in a CSV file, with each type of edge having a different file. The values are represented as shown in Fig 2.1. Before starting the implementation, 10% of the total edges are removed from each file and store in another file and the experiment is performed on the remaining 90% of the values. Final testing will be performed on these eliminated 10% values to check the accuracy to the algorithm
  • 35. 35 2. We create an n x n 2D matrix to store the strengths of the edges between the nodes, where n is the total number of nodes in the network which is 15088 in this case. The strengths of the 1st edge types (contact network) are stored in the matrix. No normalization of values is needed as all the strength values are 1 in the 1st type. 3. If no edge exists between the nodes then the link strength is 0. 4. The 2nd file is read which has strengths of the 2nd edge type (number of shared friends between two users). The strength values are not of the same value. Normalization is performed on each strength value and the normalized value is added to the strength values of the 1st type. Normalization= (xi – Minimum value) ÷ (Maximum value – Minimum value) Where, Xi – strength of the currently read edge Maximum value- Strength with maximum value in the current file (of a particular edge type). Minimum Value-- Strength with minimum value in the current file (of a particular edge type). 5. Step 4 is repeated for the remaining 3 files. 6. Finally the n x n 2D matrix will have the normalized strength values between the nodes of all the 5 files. 7. Once the values are normalized the next step to find out all the triangular formations in network.
  • 36. 36 Fig 5.3 Example of a triangular formation in the network For example in Fig 5.3, nodes A, B and C for a triangle. The values above the links are the normalized values of the link types between the nodes. 8. After all the triangles are recognized the sum of their 3 sides is calculated. Fig 5.4 shows the calculated values of all the triangles. E.g. Sum for Fig 5.3 = 0.9+0.5+0.1=1.5 Fig 5.4 Sums of three sides of all the triangles
  • 37. 37 9. From the obtained sum of each triangle in the network 1000 values are selected at random and stored in an excel file and sorted in ascending order and stored in a bucket size of n/200 where n is the total values we select (n=1000 in this case). We take this value as threshold value (999th value). There are many other ways to select the threshold values but we had many triangles in the data nearly to the size of 5.6 GB. Fig 5.5 Threshold value 10. In the next step the open triangles are identified, i.e. the ones in which 3rd side is not closed, consider x as normalized strength of one edge and y as the normalized strength of another edge. If - (x+y) > Threshold value
  • 38. 38 Then there is a possibility that a link exists between the remaining nodes where no link is currently present. For taking two values in a bucket (value=2.2674) we get 43,85,214 expected links. Fig 5.6 Number of expected links for bucket size of two 11. The links for which sum of two sides was greater than threshold are stored in another file. 12. LRW algorithm is implemented on these expected links. The start node and end node are specified and the random walker moves from start to end through the neighbors connected to node in its current state. If the distance between the start and destination is greater than n (user defined number) hops then it is assumed that no link exists between them. Otherwise the number of hops are calculated. 13. Based on the number of hops score is calculated for each link prediction using the formulae
  • 39. 39 S x y LRW (t) = k x / 2|E| · π x y (t) + k y / 2|E| · πyx (t) Where K x is the number of neighbors of node x E is the total number of edges in the entire network π x y= πyx is the hop distance from node x to node y (we are assuming that distance from node x to node y is the same from node y to node x). 14. The links are arranged in descending order with respect to their scores and the top 25% of them are considered (L). We then compare these links with those that we have deleted from the original dataset in the beginning and compare how many links that are predicted are actually present (l). 15. Finally precision is calculated using the formulae Precision = l / L to find out the accuracy of the algorithm implemented. Fig 5.7 shows the calculated precision for all the hop counts that we experimented with 5.5 Pseudo Codes 5.5.1 Adjacency Matrix 1. Read the data in file1 ; 2. Store the values as node1,node2 ,normalizedstrength; 3. Adjacency[node1-1][node2-1]= Adjacency[node1-1][node2-1]+normalizedstrength; For creating testing and training datasets 1. Calculate the no of edges in each file. 2. int c= no of edges/10,count=0; 3. While(count!=c){ int a= (Math.Random)*15088; int b= (Math.Random)*15088; If there is an edge between a and b Make adjacency [a][b]=0; Increment count; Write (a+1) +”t+( b+1) in a file.
  • 40. 40 } 4. For(int i=0;i<l5088;i++) for(int j=0;j<15088;j++) if( adjacency[i][j]!=0) write( (i+1)+”t”+ (j+1)) in a file 5. End 5.5.2 Closed Triangles and Strength 1. for(int i=0;i<l5088;i++) for(int j=0;j<15088;j++) for(int k=0;k<15088;k++) if( adjacency[i][j]!=0) && adjacency[j][k]!=0) && adjacency[i][k]!=0) { write (i+1)+”t”+ (j+1)+”t”+(k+1)+”t+” adj[i][j]+adj[j][k]+adj[i][k] in a new file } } } 2. Now read the file and chose 1000 random triangles and save them in other file 3. The 1000 random triangles are sorted and are taken different bucket sizes such as 2 in a bucket size 4. Store the998th triangle strength as threshold value. 5.5.3 Open Triangles 1. for ( int i=0;i<15088;i++ for(int j=0;j<15088;j++) for(int k=0;k<15088;k++) if there is an edge between (i,j) and (j,k) but no edge between (I,k) if strength of (i,j)+(j,k) exceeds the threshold value Adjacency[i][k]=1000; 2. for ( int i=0;i<15088;i++ for(int j=0;j<15088;j++) if the strength between I and j =1000;
  • 41. 41 write (i+1) and (j+1) in a file 5.5.4 Applying Random walk algorithm and calculating score 1. Read the open triangle file 2. Store the node values into a and b 3. Start the random walker from node till it reaches to b or the hop count value reach to 16. 4. If the hop count < 15 then store the value of and b. 5. Calculate the no of neighbors in a and b 6. Using the formula of local random walk predict the score of expected edges. 7. Sort the edges using the scores 8. Select the top 25% of the edges 9. Calculate how many of these 25% edges are available in deleted dataset 10. Calculate the precision value. 5.6 Result We conducted the experiment on the 43,85,214 expected links by taking the threshold value as 2.2674, the 999th value of the randomly selected 1000 values. The calculated precision is highest for hop count 50 hops which is 2.336% for predicting 61 correct links from the expected 2611 links while hop count of 15 has least precision of 1.77% for predicting 12 correct links from the expected 675 links.
  • 42. 42 Fig 5.7 Expected links with hop counts
  • 43. 43 Fig 5.8 Local random walk score calculated for expected links Fig 5.9 Calculated Precision
  • 44. 44 Fig 5.10 Precision vs Hop count graph Fig 5.11 Hop count vs estimated links graph 5.7 Failed Approaches In our approach to get the optimum results we have tried various combinations to get the desired results. Some of the approaches we tried did not yield desired results and in this section will be explaining some for these.
  • 45. 45 5.7.1 3D matrix Our initial idea was to read the all the five CSV file of dataset into a 3D matrix and then do the link prediction computations. But due to the excessive nodes (15088 x 15088 x 5) the JVM could not find sufficient heap size to allocate memory. Therefore we decided to use a 2D array read each CSV file at a time, calculate the normalized value and then read the 2nd file and so on. 5.7.2 Calculating Threshold value  After finding out all the triangular node possibilities in the network, a threshold value was to be calculated in order to find the possible missing links. Initially after calculating the perimeter of the triangles we took the average of the perimeter and multiplied it with 2/3 as the threshold. As a result we got a 31GB file that could neither be opened using normal text editor nor copied. Note- We multiplied with 2/3 because in the open triangle 2 links are already formed out of 3.  In our 2nd attempt we randomly selected 1000 values and sorted them in ascending order and took the values in buckets for 50. We selected the 950th value as the threshold. As a result we got a more than 6 Crore expected links as shown in Fig 5.7.  In the 3rd attempt we directly took the 975th and got 5 Crore expected nodes. Fig 5.12 Edges obtained for bucket size 50
  • 46. 46
  • 47. 47 Chapter 6 Project Schedule and Conclusion 6.1 Project schedule Our project was planned in two phases. The first phase includes literature survey to know the complete information involved in link prediction, the various algorithms used and their draw backs. Based on our understanding we then formulated a problem to work on and select the dataset. In the second phase we stored the extracted values from the dataset in a 2D array with reference to their strengths of edges, includes developing an algorithm, collecting results and testing. Phase1: Literature survey Problem formulation Dataset collection Phase2: Developing algorithm Implementation of the algorithm Testing.
  • 48. 48 Fig 6.1 Project Phases Table 6.1 Project Schedule Date Activity Meetings with Advisor Jan 6th Meeting with advisor to discuss project schedule and details Meeting with advisor Jan 9th -16 Research articles gathering and focusing on our topic Jan 17th-23rd Literature survey of Article on link prediction in heterogeneous networks and unsupervised link prediction algorithms like Common neighbor, Jaccard coefficient, Adamic/Adar Jan 20th Meeting with advisor Explained in detail about these methods by advisor. Jan 25th-31st Literature Survey on 1. Paper on Multi-Relational link prediction in Heterogeneous networks. 2. Paper on Link prediction in heterogeneous Networks based on tensor factorization. Jan 30th meeting with advisor literature survey and problem formulation algorithm implementation link prediction
  • 49. 49 Feb 2nd- 8th Literature Survey on 1. Paper on inferring social ties across heterogeneous networks. 2. Paper on exploiting place features in link prediction on location based social networks. Feb 7th meeting with advisor Feb 10th- 16th 1.Problem formulation and finding related datasets 2. Working on large Xml files to retrieve required data. Due to the non-availability of different types of edges in Co-authorship network we shifted to YouTube Dataset Feb 15th Meeting with advisor Feb 17th-24th Worked on text files of dataset and tried to store the values in 3d array in java. Due to heap size problem we decided to shift to C++ Feb 22nd meeting with advisor Feb 26-30 Mid semester presentation week March 1st-7th Worked in C++ to store data in 3D vectors. March 3rd Meeting with advisor March 8th-15th Faced the same memory heap problem with 3D vectors in C++ Decided to work in 2d arrays in java with normalized score for the strength of the edge. March 10th suggested to try in 2D java matrix by storing normalized strength value. March 18th- 27th Mid-Semester break Worked on final code for implementation. April 1st-7th Code implementation, collected results required at various stages for final precision. April 1st Planned to work on closed triangles in the network to find out the threshold value.
  • 50. 50 April 2nd Average value turned out to be very low and this resulted in huge file that couldn’t be opened April 3rd We randomized 1000 values of sum from closed triangle list and picked values in buckets of 25, 15 and 2 April 4th We took optimal set of unlinked triangles and performed local random walk algorithm on them April 5th Calculated score of the predicted links and selected the top 20% of the links April 6th Calculated precision value for the predicted links April 7th-15th Final Report, poster and presentation. 6.2 Future Work The short duration of the project forced us to compromise on a couple of things that could have yielded better precision values. Given a chance to expand this project we will implement the following tasks that could not be included now. 6.2.1 Implement the complete Neuro Fuzzy Link BasedAlgorithm Our initial idea was to implement the Neuro Fuzzy Link Based classifier algorithm that incorporated FFNet and Backpropagation involving changes to edge intensity to obtain the desired result. But due to shortage of time we decided to implement the Fuzzy Link Based algorithm along with triangular fuzzy membership function that calculates the number of triangles formed in the network and then calculates the threshold value based on this. 6.2.2 Local Random Walk We implemented Local Random Walk algorithm after selecting the open triangles in the network. Generally LRW is implemented multiple times, say 100, and the score is calculated for those edges that occur multiple times. The run time to implement LRW is generally more than 2 hours and due to time constraint we could implement LRW only once with a specified hop count and calculated the score for the suggested links. This was one reason for the low precision. Running the LRW multiple times and then calculating precision would have yielded much better precision.
  • 51. 51 6.3 Conclusion Data mining is a vast and upcoming area of research and link prediction is only a part of it. Our area of interest lie in parallel with this field of study and some of us even opted to pursue higher studies and specialize in Data Mining and Data Warehousing which motivated us to take up this project. Though the results were satisfactory they were not as good as expected. Keeping in mind the fact that this was our first experience in data mining and machine learning field and the short duration of the project time period, we tried our best to go through all the methods that are used in link prediction problem and implement our algorithm successfully. References: [1] Heterogeneous Graph- http://www.mdpi.com/1424-8220/15/10/24735/htm [2] YouTube Dataset download- http://socialcomputing.asu.edu/datasets/YouTube [3] Multi-Relational Link Prediction in Heterogeneous Information Networks Darcy Davis, Ryan Lichtenwalter, Nitesh V. Chawla, Department of Computer Science and Engineering University of Notre Dame,Notre Dame, IN, 46556 US
  • 52. 52 https://www3.nd.edu/~dial/papers/ASONAM11b.pdf [4] Link Prediction in Heterogeneous Networks Based on Tensor Factorization The Open Cybernetics & Systemics Journal, 2014, 8, 316-321 http://benthamopen.com/contents/pdf/TOCSJ/TOCSJ-8-316.pdf [5] Link Prediction in Heterogeneous Networks: Influence and Time Matters http://hanj.cs.illinois.edu/pdf/icdm12_yyang.pdf [6] Inferring Social Ties across Heterogenous Networks https://www.cs.cornell.edu/home/kleinber/wsdm12-links.pdf [7] Co-Author Relationship Prediction in Heterogeneous Bibliographic Networks http://www.ccs.neu.edu/home/yzsun/papers/asonam11_pathpredict.pdf [8] Data Mining Concepts and Techniques- 3rd edition Jiawei Han, Micheline Kamber, Jian Pei. [9] NEURO FUZZY LINK BASED CLASSIFIER FOR THE ANALYSIS OF BEHAVIOR MODELS IN SOCIAL NETWORKS - Journal of Computer Science 10 (4): 578-584, 2014 Indira Priya Ponnuvel, Ghosh Dalim Kumar, Kannan Arputharaj and Ganapathy Sannasi http://thescipub.com/PDF/jcssp.2014.578.584.pdf [10] FFnet diagram http://www.fon.hum.uva.nl/praat/manual/Feedforward_neural_networks_1__What_is _a_feedforward_ne.html