Final Report

1
B.Tech Project Report
On
Link Prediction Problem for Heterogeneous Networks
By
Sai Akhil Reddy Gopidi (1210110172)
Nithin Kumar (1210110095)
Roopesh Kumar Kotte (1210110093)
Supervisor
Dr. Dolly Sharma, Shiv Nadar University (dolly.sharma@snu.edu.in)
Submitted in the partial fulfillment of requirements for
Bachelor of Technology in Computer Science and Engineering
Department of Computer Science and Engineering,
School of Engineering, Shiv Nadar University,
Gautam Buddha Nagar, U.P., India, 201314
http://www.snu.edu.in

2
Approval Sheet
This report entitled Link Prediction Problem for Heterogeneous Networks by Sai Akhil
Reddy Gopidi, Nithin Kumar & Roopesh Kumar Kotte is approved for the
degree of Bachelor of Technology in Computer Science and Engineering.
Project Advisor
Name Dolly Sharma
Signature ________________________________
Date ________________________________

3
Declaration Sheet
We declare that this written submission represents our ideas in our own words and where
others' ideas or words have been included, we have adequately cited and referenced the
original sources.
We also declare that we have adhered to all principles of academic honesty and integrity
and have not misrepresented or fabricated or falsified any idea/data/fact/source in
our submission.
We understand that any violation of the above will be cause for disciplinary action
by the Institute and can also evoke penal action from the sources that have thus not been
properly cited or from whom proper permission has not been taken when needed.
Name of the Student: Sai Akhil Reddy Gopidi
Signature ___________________________
Name of the Student: Nithin Kumar Kakkrineni
Signature ___________________________
Name of the Student: Roopesh Kumar Kotte
Signature ___________________________
Date _______________________________

4
Abstract
Interaction among members has become an important aspect in many social networks like
Facebook and YouTube. As the number of users grows the network size also increases and it
becomes difficult for users to find their friends or search and watch the videos they like. To
make the life of the user easy, social networks suggest friends and videos to the users based
on their previous searches and mutual interests. Therefore social networks have focused on
link prediction techniques that allow users to easily find their needs. Link prediction is a
critical task that not only helps increase the linkage inside the network but also improves the
user experience. In a link prediction algorithm it is required to identify the factors that
influence link creation. In this project we analyze and discuss some of these factors and also
give an approach which satisfies these factors. The approach is to estimate link relevance by
using Fuzzy Link Based Classification algorithms based on Backpropagation that gives the
probability of a link existence at some future time. We then evaluate the accuracy of the
results obtained using accuracy measurement techniques like Precision. We apply our
methods to YouTube dataset in order to evaluate the performance of our algorithms and
compare it to the performance of previously proposed algorithms.

6
List of Figures and Tables
S.No. Figure/Table Number Description
1. Fig 1.1 Heterogeneous graph with multiple edge types
2. Fig 1.2 Depicting problem statement using graph
3. Fig 2.1 CSV file of type 1 edge in YouTube dataset
4. Table 3.1
AUROC for YouTube, Disease and Climate network for
each edge type
5. Fig 3.1 Graph showing connections between nodes
6. Table 3.2 Distance between the nodes
7. Fig 3.2 Alternate neighbors and link possibility
8. Fig 3.3 Local random walk path
9. Table 4.1 Results obtained from the AUROC curve on DBLP dataset
10. Fig 4.1 Impact of Collaboration Frequency of different measures
11. Fig 5.1 System Architecture
12. Fig 5.2 Feed Forward Neural Network structure
13. Fig 5.3 Example of a triangular formation in the network
14. Fig 5.4 Sums of three sides of a triangle
15. Fig 5.5 Threshold Value
16. Fig 5.6 Number of expected links for bucket size of two
17. Fig 5.7 Expected links with hop counts
18. Fig 5.8 Local random walk score calculated for expected links
19. Fig 5.9 Calculated Precision
20. Fig 5.10 Precision vs Hop count graph
21. Fig 5.11 Hop count vs estimated links graph
22. Fig 5.12 Edges obtained for bucket size
23. Fig 6.1 Project Phases
24. Table 6.1 Project Schedule

7
Abbreviations and Nomenclature
AUROC- Area under the Receiver Operating Curve
CSV- Comma Separated Value
DBLP- Digital Bibliography & Library Project
JVM- Java Virtual Machine
GC- Garbage Collector
STL- Standard Template Library
JC- Jaccard’s coefficient
CN- Common Neighbor or Contact Network
A/A- Adamic Adar
LRW- Local Random Walk
CP- CANDECOMP/PARAFAC
MRLP- Multi Relational Link Prediction
MRIP- Multi Relational Influence Propagation
SBN- Shared subscriptions
SBR- Shared subscribers
VID- Shared favorite videos
FFNet- Feed Forward Neural Network
NN- Neural Network

9
Chapter 1
Introduction
1.1 Overview
Data mining and analyzing data is an important and upcoming field in computer science.
Huge volumes of data stored in data warehouses must be extracted for the purpose of pattern
recognition which can in turn benefit the organization. Storing and managing this data had
become a major problem in today’s world. Interaction among members of a community is or
network is of highest priority. Organizations like Facebook and YouTube which aim to
connect millions of people around the world analyze the pattern of user behavior and
recommends friends and videos accordingly. Given a user A at some point of time t, the task
at hand is to estimate all the possibilities of link formation between user A and user B by
taking into consideration all the parameters that these two users share. Link prediction makes

10
it possible to determine wheather two people can become friends or not beforehand. Many
social networks use this technique to suggest friends to the users so that they do not have to
search for all their friends.
This project includes some of the commonly the various link prediction techniques that have
been used previously as well as their drawbacks. We then propose our algorithm and
implement it on the YouTube dataset. Finally we compare the obtained results of our
algorithm with the results of the previously proposed algorithms and conclude wheather our
algorithm is efficient or not.
1.2 Problem Statement
Given a heterogeneous graph G = (V1 U V2…U Vm, E1 U E2… U En), where Vu (u € N)
represents the set of nodes of same type u (users) and Ej (j € M) represents the link of type j
between the nodes (relationship between users), our task is to predict the future possible links
between the users. Since it is not possible to compare the dataset at two different time
intervals, we use the cross-dimension validation process. We divide the complete dataset into
2 divisions-
1. Training set
2. Testing set
Fig 1.1 Heterogeneous graph with multiple edge types [1]

11
Out of the 5,574,249 edges in our dataset, we omit 10% of the total nodes, i.e., 8,36,137 and
implement our algorithm on the remaining 47,38,111 to train the dataset. After the training is
completed we test the algorithm on the entire dataset to check how accurate the results are
using precision accuracy metric since we already know which links are to be expected and
which links do not exist.
Fig 1.2 Depicting problem statement using graph
The numbers above the links indicate the intensity of the links. For example, users 2 and 6
have more commonly shared videos than users 5 and 6.
1.3 Team Members Contribution
Name: Akhil Reddy
Contribution: Akhil’s contribution includes literature survey to understand the fundamental
concepts of link prediction, problem formulation, designing the algorithm and testing the
algorithm.
Name: Nithin Kumar
Contribution: Nithin’s contribution includes assistance with literature survey, problem
formulation, designing and testing the algorithm.
Name: Roopesh Kumar

12
Contribution: Roopesh’s contribution includes data gathering, testing the algorithm, research
in finding suitable data structure to accommodate all the nodes in the dataset and poster
design.
Chapter 2
Feasibility Study and Requirements
2.1 Dataset Used
The dataset that we have selected to implement our algorithm is the YouTube dataset from
December 2008, which is a video sharing platform for millions of users. This dataset includes
information of those users who were will to share their information [2].
Number of Nodes: 15,088

13
Number of Edges: 5,574,249
Types of Edges: 5
In this case we are considering all the users as nodes and the different types of relations
between them as edges, to construct out heterogeneous graph. A graph G = (V1 U V2…U
Vm, E1 U E2… U En), where Vu (u € N) represents the set of nodes of same type u (Users)
and Ej (j € M) represents the link of type j between the nodes (relationship between users) is
called a heterogeneous graph.
There are 5 types of edges in this dataset, namely-
1. The contact network between the 15,088 users.
2. The number of shared friends between two users in the 848, 003 (excluding the
15,088) contacts- Two users are connected if they both add another user as contact.
3. The number of shared subscriptions between two users- When two users subscribe
to same person/channel.
4. The number of shared subscribers between two users- Two users are connected if
another user has subscribed both of them.
5. The number of shared favorite videos- Users sharing same videos.
The dataset is in a CSV (Comma Separated Value) format for each edge type independently.
E.g.: 7, 12, 94 – Indicates that user ids 7 and 12 have an intensity of 94 between them of a
particular edge type.

14
Fig 2.1 CSV file of type 1 edge in YouTube dataset
In the above figure the 1st and 2nd columns represent the node that have a link of type 1 which
the intensity in the 3rd column.
2.2 Scope:
Our YouTube network consists of different type of edges with an interaction value which
forms a large network. The scope of our project is to reduce the link prediction space
specifically in the YouTube network and suggest link in a probabilistic way that the link will
be useful in near future. Our experiment of link prediction is only for research purpose.
2.3 Problems Faced:

15
 Initially we considered using DBLP dataset, but due to lack of multiple edges and
insufficient data available in the dataset we were forced to work on a new dataset.
 As the dataset contains 15088 authors and 5,574,249 links in the YouTube dataset, it
is hard to accommodate all the nodes in the dynamic array in the form of a 3-D matrix
because it overflows the available memory of the ram and gives a GC overhead error.
 System requires larger RAM than the available RAM in our laptops to load dataset
into the data structure.
2.4 Software and Hardware Requirements
 Operating systems.
 NetBeans with JDK to run the Java code
 Larger ram.
2.5 Technical Feasibility
Initially the project source code was meant to be written in Java using 3D arrays to create the
adjacency matrix. As the number of nodes was very high, the Java VM could find enough
memory to store the data. Hence we shifted to C++ vectors which have the capability to
dynamically allocate memory to the node. Even the vector STL could not accommodate all
the data in a 3D vector and hence we were forced to shift back to Java and used 2D arrays to
create the adjacency matrix. This Project is technically feasible as it works perfectly fine on
the existing version of Java provided we use 2D arrays.
NOTE- To use 2D arrays for such large datasets where number of nodes is 15088, the
JVM option must be changed to –Xms2g to allocate more heap size.
2.6 Economic Feasibility
It is economically feasible as the data set is publicly available online, so it is free of cost
there is no extra costs required for the project. Research done to get into the project is also
based on the scholarly articles available online, there is no need to learn new technology
(language).

16
2.7 Schedule feasibility
It took some time to know the concepts of link prediction in heterogeneous, homogenous
networks, techniques used to predict the future link. When we discover what to do
innovatively and something new, we proceeded swiftly with our project in order to complete
it within the stipulated time frame.
2.8 Project Meetings
2.8.1 Meeting with Supervisor
We have constantly met our project supervisor on weekly bases (twice a week) to discuss
about the objectives of the project. We were instructed to conduct the literary survey in the
beginning to get an idea about the subject since everyone in the group was new to this
domain of study. Once we completed literature survey the algorithm was discussed to be
implemented on the dataset. Later accuracy metrics were discussed to check out the
performance of the algorithm and finally the results were compared with the results of other
algorithms. All the important meetings with the supervisor were conducted in person and the
minor details were either discussed over phone or e-mails.
2.8.2 Group Meetings
The group members met daily. Initially we discussed the scope and schedule of the project as
well as discussed on individual roles to be carried out. On completing the literature survey
and getting a good grasp of the subject we formulated a problem and finalized the dataset to
be worked. During the literature survey the papers were distributed among the team members
and on completion of paper the member explained the contents of that paper to other
members to save time and avoid redundancy.
2.9 Text Deliverables
Along with this report several other documents are also included in order understand the
research in a deeper sense.
Dataset- The YouTube dataset on which the research was conducted on. The dataset is in a
CSV format with each link type having its own file.

17
Source Code- A CD is given along with the report containing the code for all the algorithms
implemented.
List of all expected value- On running the LRW a list of all expected edges is obtained. A
file including all these edges is included
Deleted Files- To test the accuracy of the algorithm, 10% of links are deleted from original
file and stored in another file. Testing is done on this file.
2.10 Conclusion
We can confidently conclude after considering all the above stated points that the project is
feasible and we will complete the project in stipulated time allotted to complete the project
work.

18
Chapter 3
Commonly Used Algorithm
Many algorithms have been proposed for link prediction purpose in homogeneous networks
as well as heterogeneous networks. We cannot conclude that a specific approach is the best
way to predict links because link prediction methods are domain specific. Performance of the
algorithms is based on how well the network supports the predefined scoring methods for link
formation. For example, Facebook and Twitter being social networks yield best results for
neighborhood methods like common neighbor and Adamic/Adar for friend recommendation
links and in a climate network Jaccard’s coefficient performs well due to spatial
autocorrelation [3]. It is also possible that only one method might not give the best results for
all the different links in a single network. Hence performance of algorithm is not only
dependent on the predefined scoring measure and the type of the network, but also on the
type of links that it is being used to predict. This is clearly illustrated in the disease-gene
network, where different method works best for different link in the same network [3]. The

19
AUROC table below clearly indicates the above stated, where the bold faced indicate the best
link predictor method.
Table 3.1 AUROC for YouTube, Disease and Climate network for each edge type [3]
3.1 Commonly Used Algorithms
The following discussed methods can be used for any pair of nodes (A, B) in a network.
Fig 3.1 Graph showing connections between nodes
Score is allocated to each link based on the predefined scoring techniques used in these
algorithms and based on this score we predict wheather there is a possibility of link between
the nodes.
3.1.1 Graph Distance

20
In this method the distance between the two nodes, i.e., the source and destination nodes is
calculated and the inverse or negated length is considered. If the distance between the nodes
is less then there is higher chance that these nodes might be connected and vice versa.
Table 3.2 Distance between the nodes
Nodes Distance
(A,C) -2
(C,E) -3
(A,E) -3
As shown is Fig 3.1, in the above table the distance between nodes A and C is the least,
therefore there are higher chances of link formation between these two nodes. The negative
sign (-) is only to show that the least distance value has higher probability of link.
3.1.2 Common Neighbors
Link prediction in this method is based on the number of common neighbors that two nodes
have. If two nodes have more number of common neighbors then more is the probability of
link existence between the nodes and vice versa.
Score= Ə (A) ∩ Ə (B), which is the total number of common neighbor nodes of the two
nodes.
Where Ə (x) denotes the neighbors of a node x.
3.1.3 Jaccard’s Coefficient
Fig 3.2 Alternate neighbors and link possibility

21
Jaccard’s coefficient is the derived from common neighbors method but provides more
accurate results. For a given pair of nodes (A, B), the score assigned is the number of
common neighbor nodes of A and B divided by the total number of neighbor nodes of A and
B.
Score= Ə (A) ∩ Ə (B) ÷ Ə (A) U Ə (B)
The numerator part is similar to common neighbor’s method.
From Fig 3.3 we can see that for nodes C and D the common neighbors are A and B. The
scoring method used by common neighbors gives a high score for a link between C and D.
We are not considering any other nodes in this method. It is also clear from Fig 3.2 that node
C has many other neighbors apart from A and B whereas D has only those two neighbors.
Therefore in this case the score calculated by the common neighbor method is not accurate
and to negate these additional neighbors of the nodes we divide the common neighbors of
both nodes by the total number of nodes. This increases the accuracy of calculated score.
3.1.4 Adamic/Adar
Adamic/Adar is the advanced version of Jaccards coefficient, which weighs rarer neighbors
more heavily. To put in simple terms- for a pair of nodes (A, B), if the common neighbors of
A and B have less common neighbors then there is a higher possibility of link existence
between A and B.
From Fig 3.2, the common neighbors of A and B are C and D which in turn do not have any
common neighbors. So there is a better link formation between A and B.
Score= , where z is the common neighbor of nodes x and y.
3.1.5 Preferential Attachment
Score assigned to the pair of nodes is the product of their degrees. Higher score is assigned if
the nodes have many edges attached to them.
3.1.6 Average Commute Time
It is the average number of steps required by a random walker starting from a source node to
reach destination node. The two nodes are said to have a link if they have smaller commute
time.

22
3.2 Local Random Walk
Random walk algorithm is an advanced method which is used to predict links in a network. It
is a Markov process, that is, it is a memory less process which makes its next move based
only on its current location and does not consider the previously followed path. In a given
graph G (V,E) for a pair of nodes x,y the process can be defined using a transition probability
matrix Pxy=axy / kx
where axy=1 if x and y are connected and 0 other wise. Kx denotes the degree of node k.
Let us consider a random walker starting at node x and must travel to node y and let πxy(t) be
the probability that this walker reaches node y after t steps. πx(t) =PT πx(t−1), is the
probability score to calculate that the random walker will come to location x from the
previously positioned location.
In the below figure consider that the random walker starts at x and must reach y. From x the
walker can go to any of the 4 nodes ahead of him, i.e., nodes 1-4 and through these he can
reach node y. The probability the walker will go to node 1 from node x is-
Π1(t) =PT Π1(t−1)
Similarly for the remaining nodes. To travel from node 1 to node y, again the probability is
Πy(t) =PT Πy(t−1)
Hence the probability of going to the next state is dependent only on the current state and not
the previous states.
Fig 3.3 Local random walk path

23
Summing up and taking average of the probabilities can give the score for existence of link
between nodes x and y.
3.2.1 Random Walk with Restart
Sometimes is so happens that the random walker deviates and goes away too far from the
destination node. In the above figures, consider that the walker has taken moved from node x
to node 100, which is far away from the node y. This gives low and inaccurate score
prediction and there are chances that the walker may never reach the destination. To
overcome this problem we can use random walk with restart, where walkers are continuously
released at regular intervals from the starting point which increases the probability of the
walker to reach the destination in the best possible path.

24
Chapter 4
Related Work
In order to understand the basic crux of the link prediction problem, we team had conducted
an extensive literature survey of various papers published so that we were aware of the
previously researched problems, the approach used, the dataset they experimented on and the
results obtained. Every approach has its own advantages and drawbacks over the others.
Since this is relatively new topic and research has recently begun in this domain, very few
papers were published. The obtained papers were distributed among the team members and
on daily basis the important content of the papers were discussed with the whole team. Few
important papers that are relevant to this project are discussed below.

25
4.1 Tensor Factorization
 Paper Title: Link Prediction in Heterogeneous Networks Based on Tensor
Factorization [4].
 Authors: Piao Yong , Li Xiaodong1 and Jiang He
 Publication: The Open Cybernetics & Systemics Journal, 2014, 8, 316-321
 Problem: To predict the edges that will be added to the network during the interval
from time t to a given future time t’.
 Method: Heterogeneous networks can be organized as a third order tensor (Node!
Node! Link type) or multi-dimensional. Proposed a method based on tensor
factorization that can capture the correlation between different types of links for the
link prediction problem without loss of information. Employed
CANDECOMP/PARAFAC (CP) tensor decomposition to capture the underlying
patterns in the node-relationship-node tensor. The CP decomposition generated
feature vectors for the nodes in the graph, that are computed to get a similarity score
that combines the multiple types of the graph.
After CP decomposition, 3 factor matrices were known: node matrix A, relationship
matrix B, and node matrix C. Link prediction can be computed according to the
captured associations. Score Matrix s is defined as
In this paper, they used alternating least-squares (ALS) with weighted-!-
regularization algorithm to fit the CP decomposition.
 Results: Adamic/Adar measure and Katz measure performs well both in theoretical
and practical experiments. So here, they just compared these two measures with their
methods. Their method provided better precision than unsupervised ones on the data
sets and also provided a competitive effect to Adamic/Adar measure and both those
two methods beat Katz measure.
 Challenges: It is cost intensively to compute tensor factorization.
 Datasets: UMLS. This data set contains data from the Unified Medical Language
System semantic work. This consists of 135 entities and 54 relationships. The entities

26
are high-level concepts like 'Disease or Syndrome', 'Diagnostic Procedure', or
'Mammal'.
4.2 Multi Relational Influence Propagation
 Paper Title: Link Prediction in Heterogeneous Networks: Influence and Time
Matters[5].
 Author: Yang Yang and Nitesh V. Chawla, Department of Computer Science & Engg,
University of Notre Dame.
Yizhou Sun and Jiawei Hanı, Department of Computer Science, University of Illinois
at Urbana-Champaign.
 Problem: Given a heterogeneous network, in this case a DBLP bibliographic network,
the machine must be able to predict wheather the link is present in the network and
the possibility of link in future. DBLP dataset contains information about 3,215
authors who published a minimum of 5 papers in conferences between 1990 and
2010. The links can be of different types, for E.g. link between author and author (co-
author), author-paper (writes), paper-conference (published in).
 Method: Different unsupervised link prediction algorithms were used to test the data
set like Common Neighbour, Jaccard Coefficient, Adamic/Adar Preferential
attachment, etc. Of all the algorithms, Multi relational Influence propagation (MRIP)
which uses conditional probability which is equivalent to edge correctness yielded the
best results. For unsupervised learning data between 1990 and 2000 was chosen as
training set and data between 2001 and 2005 as training set.
 Results: As unsupervised link predictions are domain specific performance varies for
each algorithm. MRIP has better performance than others in predicting co-authorship
between authors and predicting terms shared between authors and has slightly less
performance in conference presenter’s link.
 Challenges: MRIP works well for stable networks. But as DBLP is a non-stable
network (unit root value= 0.99), since the number of links keep on increasing every
year the traditional unsupervised link prediction algorithms are not of much use.
Availability of dataset and security is a problem. Additional information is collected
through user survey which is incomplete and unreliable. Information is needed that
can expose the users subconscious behavior at a particular time.

27
 Future Work: As the network changes with time temporal feature based methods are
implemented. Bootstrapping technology is one such method. Based on the degree of a
node we rank them in descending order and analyze how new links in future are
associated with the top K% of them.
 Dataset & availability: The whole DBLP dataset is available as an XML file.
Table 4.1 Results obtained from the AUROC curve on DBLP dataset [5]
JC CN AA MRIP
Collaboration 0.590 0.597 0.596 0.769
Conference 0.702 0.698 0.689 0.691
Key Terms 0.545 0.546 0.532 0.811
4.3 Multi Relational Link Prediction
 Paper Title: Multi-Relational Link Prediction in Heterogeneous Information
Networks[3].
 Author: Darcy Davis, Ryan Lichtenwalter, Nitesh V. Chawla, Interdisciplinary Centre
for Network Science and Applications, Department of Computer Science and
Engineering,
University of Notre Dame.
 Problem: Three different domains are considered- YouTube, Disease-Gene and
Climate network datasets.
YouTube has 15,088 users as of December 2008, who are considered as nodes in this
case. The users are connected by 5 different edges- contact network (CN) of the user,
shared contact with users outside of the network (FR), shared subscriptions (SBN),
shared subscribers (SBR), and shared favorite videos (VID).
The disease-gene network consists of 703 diseases and 1,132 genes with 4 edges.
The climate network has 1,701 locations with 7 edges for different climate changes.
 Method: Unsupervised link prediction methods are implemented. Link prediction for
each edge type is evaluated individually. Link prediction performance is evaluated
separately for each edge using Area Under the Receiver Operating Curve (AUROC).
 Results: Performance of the algorithms is based on how well the network supports the
predefined link scoring assumption. Performance of local neighborhood methods is

28
predominant in social networks like YouTube. Jaccard coefficient performs well in
climate network closely located areas have similar climate. In disease-gene network
each link type was captured best by a different method. Refer to Fig 3.1 for AUROC
values of each node.
 Challenges: A node in a network can have multiple edges and each edge can increase
the likelihood of a contact. In YouTube 76% of node pairs with contact edge have
other edges which increase the number of contacts of that particular edge. Bad
performance of MRLP on other edge types indicate that MRLP doesn’t work well
when additional link types are introduced (noise).
 Future Work: High performance link prediction (HPLP) is introduced for this purpose
which uses Feature vector, homogeneous link prediction and heterogeneous link
prediction.
4.4 Graph Model TransFG
 Paper title: Inferring social ties across heterogeneous networks[6].
 Authors: Jie Tang Tsinghua University
Tiancheng Lou Tsinghua University
Jon Kleinberg Cornell university.
 Problem: Predict the type of relationship in a target network by leveraging the
supervised information (labeled relationships) from the source network.
 Method: proposed Predictive model such as transfer based factor graph model
(TransFG) for learning and predicting the type of social relationships across network.
 Results: Proposed method TransFG is more helpful when combined with social
theories (Structural balance, structural hole, social status, two step flow) in inferring
type of relationship in social network. Performance drops when any one of the social
theories is ignored. TransFG is checked against social theories in datasets such as
Epinions, Slashdot, Mobile in predicting undirected relationship and Coauthor and
Enron for predicting directed relationship.
 Challenges: As discussed in the paper, there are two types of networks source and
target networks, predictive model needs to learn both the networks, the challenge is
then how to bridge the two networks, so that we can transfer the labeled information
from source network to target network.

29
 Future work: some other social theories can be further explored and validated for
analyzing the formation of different types of social relationships
 Dataset: Epinions, Slashdot, Mobile, Coauthor, Enron are datasets and they are
publicly available.
4.5 Path Predict
 Paper Title: Co-Author Relationship Prediction in Heterogeneous Bibliographic
Networks [7].
 Authors: Yizhou Sun, Rick Barber, Manish Gupta, Charu C. Aggarwal, Jiawei Han
 Publication: Published in Int. Conf. on Advances in Social Networks Analysis and
Mining (ASONAM'11), July 2011
 Problem: Identify the kind of connections between two authors that are more helpful
to lead to future link collaborations. Basically predicting whether two authors that
have never co-authored before will co-author sometime in the future rather than
predicting how many times two authors will co-author in the future. Given a
heterogeneous network, the link prediction task is then generalized to relationship
building prediction, which is to predict whether two objects will build a relationship
following a certain target relation.
 Method: There are two stages (Training and test stage). In the training stage, we first
sample a set of author pairs that have never co-authored in T0, collect their associated
topological features in T0, and record whether a relationship is to appear between
them in the future interval T1.
 Model used: Path predict model. Defined the topological features in the DBLP
network and used meta path based Topological feature. Meta paths between two
object types can be obtained by traversing on the DBLP network schema, by using
standard traversal methods such as the BFS (breadth-first search) algorithm.
Discussed below are four measures on meta path.
1. Path count
2. Normal Path count
3. Random walk
4. Systematic Random walk.

30
Defined Co-authorship Model, used logic regression method as the prediction model.
For each training pair of authors (ai1 , ai2), let xi be the (d+ 1)- dimensional vector
including constant 1 and d topological features between them, and yi be the label of
whether they will be co-authors in the future (yi = 1 if they will be co-authors, and
otherwise 0), which follows binomial distribution with probability pi . The probability
pi=e^( xiβ)/ (e^( xiβ )+ 1) where β is the d + 1 coefficient weights associated with the
constant and each topological feature. Then used standard MLE (Maximum
Likelihood Estimation) to derive βˆ, that maximizes the likelihood of all the training
pairs.
 Results: The co-authorship for highly productive authors is easier to predict than less
productive authors. The prediction accuracy is higher when the target authors are 3-
hop co-authors, which means the collaboration between closer authors in the network
is more affected by information that is not available from network topology. In the
previous cases, we say two authors have a co-authorship if they have co-authored one
paper. Here, we study the relationships defined by different collaboration frequency.
From Fig. 4.1, we can see that, the measure symmetric random walk is more
important in deciding high frequency co-author relationships. Two authors who can
be reached with high probability mutually in the network will be more likely to build
strong collaboration relations.[7]
 Challenges: Predicting co-authors for a given author is an extremely difficult task, as
we have too many candidate target authors (3-hop candidates are used), while the
number of real new relationships are usually quite small.
 Dataset: The DBLP bibliographic network is available in the internet in an XML file.
Fig 4.1 Impact of Collaboration Frequency of different measures [7]

31
Chapter 5
Algorithm Design and Implementation
To predict the links in the dataset we have used the Fuzzy Link Based Classification
algorithm- a subpart of Neuro Fuzzy Link Based Classification algorithm, which is a
combination on the Feedforward Neural Networks (FFNet) Backpropagation techniques and
fuzzy logic. FFNet was inspired from the neural system the human body. In this chapter we
will first explain about the system design involved in setting up the network followed by
explaining the FFNet and Backpropagation algorithm, explain the reasons for using the
algorithm and then discuss about how we worked on our dataset and the steps involved in
obtaining the desired output.
5.1 System Design
From selecting the dataset to be worked on to data classification and link prediction, many
steps are involved in like clustering data, classification of data, data extraction etc. These

32
steps are performed in a proper order as mentioned in the below figure of System
architecture.
Fig 5.1 System Architecture [9]
 Initially the dataset must be selected to retrieve the data so that classification can be
performed.
 In user interface model data is retrieved from the dataset and represented in a readable
format. On this data pattern recognition and analysis can be performed.
 In clustering and classification phase dissimilar data is differentiated from similar
data. In our project this step can be omitted because the dataset is already in the form
of a CSV file, divided according to the respective link types.
 Knowledge base contains the rules for the construction of the fuzzy system. The
system checks each input and acts according to the weights and specified attributes of
the nodes.
 Finally the decision manager contains the logic to make decision based on the rules
mentioned in the knowledge base. It is the heart of fuzzy logic. It decides the output
and passes it to the response which displays the final result.

33
5.2 Backpropagation
5.2.1 Feed Forward Neural Networks
FFNet consists of a number of nodes that are linked to each other with edges of different
types (in a heterogeneous network), where each edge carries some weight. All the nodes are
connected to every other node in the layers that precede and succeed them. Input values are
given at the input layer which propagates the value to further layers. Input and output for the
input layer is the same. The final layer is the output layer that gives the predicted value. All
the layers apart from input and output layers are called hidden layers. The outputs from the
hidden layer are propagated as inputs to other layers and finally to the output layer.
5.2.1.1 Learning Phase
The number of input and output units depends on the attributes of the nodes and the number
of categories we want to classify the data. First we must create the adjacency matrix for all
the nodes and edge types to store the strength value. From this matrix, we can finalize the
number input units as the number of attributes of the node and output as the different types of
categories.
Fig 5.2 Feed Forward Neural Network structure [10]

34
The outputs of the network are compared with the actual outputs and if the output values are
large then the categorization is correct. On this comparison the weights are modified so that
when same types of inputs are presented the outcome value is higher than the current value.
5.2.2 Neuro Fuzzy Link Based Classifier
Neuro fuzzy link based classifier works based on the neuro fuzzy rules to classify the edges.
These rules are based on backpropagation technique. In our problem we first adjust the
weights by normalizing the edge strengths of all the 5 types followed by implementing
triangular fuzzy membership function that looks for all the possible triangular formations in
the network. Based on the triangle perimeter we calculate the threshold value and base our
final predictions.
5.3 Reason for Using the Algorithm
Some of the main reasons for using the algorithm are [8]-
1. Neural Networks (NN) have high tolerance of noisy data.
2. NN can be used to work on datasets on which they have not been trained.
3. The can be used when we have limited idea about the relationship between the
attributes of the nodes type of edges present in the dataset.
4. NN algorithms are inherently parallel; parallelization techniques can be used to speed
up the computation process [8].
5. NN algorithms have been successful in handwritten character recognition, medical
purposes and training a machine to pronounce English text.
5.4 Steps Involved in Proposed Algorithm
1. Initially all the values of the dataset are in a CSV file, with each type of edge having a
different file. The values are represented as shown in Fig 2.1. Before starting the
implementation, 10% of the total edges are removed from each file and store in another file
and the experiment is performed on the remaining 90% of the values. Final testing will be
performed on these eliminated 10% values to check the accuracy to the algorithm

35
2. We create an n x n 2D matrix to store the strengths of the edges between the nodes,
where n is the total number of nodes in the network which is 15088 in this case. The strengths
of the 1st edge types (contact network) are stored in the matrix. No normalization of values is
needed as all the strength values are 1 in the 1st type.
3. If no edge exists between the nodes then the link strength is 0.
4. The 2nd file is read which has strengths of the 2nd edge type (number of shared friends
between two users). The strength values are not of the same value. Normalization is
performed on each strength value and the normalized value is added to the strength values of
the 1st type.
Normalization= (xi – Minimum value) ÷ (Maximum value – Minimum value)
Where,
Xi – strength of the currently read edge
Maximum value- Strength with maximum value in the current file (of a
particular edge type).
Minimum Value-- Strength with minimum value in the current file (of a
particular edge type).
5. Step 4 is repeated for the remaining 3 files.
6. Finally the n x n 2D matrix will have the normalized strength values between the nodes
of all the 5 files.
7. Once the values are normalized the next step to find out all the triangular formations in
network.

36
Fig 5.3 Example of a triangular formation in the network
For example in Fig 5.3, nodes A, B and C for a triangle. The values above the links are
the normalized values of the link types between the nodes.
8. After all the triangles are recognized the sum of their 3 sides is calculated. Fig 5.4
shows the calculated values of all the triangles.
E.g. Sum for Fig 5.3 = 0.9+0.5+0.1=1.5
Fig 5.4 Sums of three sides of all the triangles

37
9. From the obtained sum of each triangle in the network 1000 values are selected at
random and stored in an excel file and sorted in ascending order and stored in a bucket
size of n/200 where n is the total values we select (n=1000 in this case). We take this
value as threshold value (999th value).
There are many other ways to select the threshold values but we had many triangles in the
data nearly to the size of 5.6 GB.
Fig 5.5 Threshold value
10. In the next step the open triangles are identified, i.e. the ones in which 3rd side is not
closed, consider x as normalized strength of one edge and y as the normalized strength of
another edge.
If - (x+y) > Threshold value

38
Then there is a possibility that a link exists between the remaining nodes where no
link is currently present.
For taking two values in a bucket (value=2.2674) we get 43,85,214 expected links.
Fig 5.6 Number of expected links for bucket size of two
11. The links for which sum of two sides was greater than threshold are stored in
another file.
12. LRW algorithm is implemented on these expected links. The start node and end
node are specified and the random walker moves from start to end through the
neighbors connected to node in its current state. If the distance between the start and
destination is greater than n (user defined number) hops then it is assumed that no link
exists between them. Otherwise the number of hops are calculated.
13. Based on the number of hops score is calculated for each link prediction using the
formulae

39
S x y
LRW (t) = k x / 2|E| · π x y (t) + k y / 2|E| · πyx (t)
Where
K x is the number of neighbors of node x
E is the total number of edges in the entire network
π x y= πyx is the hop distance from node x to node y (we are assuming that distance
from node x to node y is the same from node y to node x).
14. The links are arranged in descending order with respect to their scores and the top
25% of them are considered (L). We then compare these links with those that we have
deleted from the original dataset in the beginning and compare how many links that
are predicted are actually present (l).
15. Finally precision is calculated using the formulae
Precision = l / L
to find out the accuracy of the algorithm implemented. Fig 5.7 shows the
calculated precision for all the hop counts that we experimented with
5.5 Pseudo Codes
5.5.1 Adjacency Matrix
1. Read the data in file1 ;
2. Store the values as node1,node2 ,normalizedstrength;
3. Adjacency[node1-1][node2-1]= Adjacency[node1-1][node2-1]+normalizedstrength;
For creating testing and training datasets
1. Calculate the no of edges in each file.
2. int c= no of edges/10,count=0;
3. While(count!=c){
int a= (Math.Random)*15088;
int b= (Math.Random)*15088;
If there is an edge between a and b
Make adjacency [a][b]=0;
Increment count;
Write (a+1) +”t+( b+1) in a file.

40
}
4. For(int i=0;i<l5088;i++)
for(int j=0;j<15088;j++)
if( adjacency[i][j]!=0)
write( (i+1)+”t”+ (j+1)) in a file
5. End
5.5.2 Closed Triangles and Strength
1. for(int i=0;i<l5088;i++)
for(int j=0;j<15088;j++)
for(int k=0;k<15088;k++)
if( adjacency[i][j]!=0) && adjacency[j][k]!=0) && adjacency[i][k]!=0)
{
write (i+1)+”t”+ (j+1)+”t”+(k+1)+”t+” adj[i][j]+adj[j][k]+adj[i][k] in a new file
} } }
2. Now read the file and chose 1000 random triangles and save them in other file
3. The 1000 random triangles are sorted and are taken different bucket sizes
such as 2 in a bucket size
4. Store the998th triangle strength as threshold value.
5.5.3 Open Triangles
1. for ( int i=0;i<15088;i++
for(int j=0;j<15088;j++)
for(int k=0;k<15088;k++)
if there is an edge between (i,j) and (j,k) but no edge between (I,k)
if strength of (i,j)+(j,k) exceeds the threshold value
Adjacency[i][k]=1000;
2. for ( int i=0;i<15088;i++
for(int j=0;j<15088;j++)
if the strength between I and j =1000;

41
write (i+1) and (j+1) in a file
5.5.4 Applying Random walk algorithm and calculating score
1. Read the open triangle file
2. Store the node values into a and b
3. Start the random walker from node till it reaches to b or the hop count
value reach to 16.
4. If the hop count < 15 then store the value of and b.
5. Calculate the no of neighbors in a and b
6. Using the formula of local random walk predict the score of expected
edges.
7. Sort the edges using the scores
8. Select the top 25% of the edges
9. Calculate how many of these 25% edges are available in deleted dataset
10. Calculate the precision value.
5.6 Result
We conducted the experiment on the 43,85,214 expected links by taking the threshold value
as 2.2674, the 999th value of the randomly selected 1000 values. The calculated precision is
highest for hop count 50 hops which is 2.336% for predicting 61 correct links from the
expected 2611 links while hop count of 15 has least precision of 1.77% for predicting 12
correct links from the expected 675 links.

42
Fig 5.7 Expected links with hop counts

43
Fig 5.8 Local random walk score calculated for expected links
Fig 5.9 Calculated Precision

44
Fig 5.10 Precision vs Hop count graph
Fig 5.11 Hop count vs estimated links graph
5.7 Failed Approaches
In our approach to get the optimum results we have tried various combinations to get the
desired results. Some of the approaches we tried did not yield desired results and in this
section will be explaining some for these.

45
5.7.1 3D matrix
Our initial idea was to read the all the five CSV file of dataset into a 3D matrix and then do
the link prediction computations. But due to the excessive nodes (15088 x 15088 x 5) the
JVM could not find sufficient heap size to allocate memory. Therefore we decided to use a
2D array read each CSV file at a time, calculate the normalized value and then read the 2nd
file and so on.
5.7.2 Calculating Threshold value
 After finding out all the triangular node possibilities in the network, a threshold value
was to be calculated in order to find the possible missing links. Initially after
calculating the perimeter of the triangles we took the average of the perimeter and
multiplied it with 2/3 as the threshold. As a result we got a 31GB file that could
neither be opened using normal text editor nor copied.
Note- We multiplied with 2/3 because in the open triangle 2 links are already formed out of 3.
 In our 2nd attempt we randomly selected 1000 values and sorted them in ascending
order and took the values in buckets for 50. We selected the 950th value as the
threshold. As a result we got a more than 6 Crore expected links as shown in Fig 5.7.
 In the 3rd attempt we directly took the 975th and got 5 Crore expected nodes.
Fig 5.12 Edges obtained for bucket size 50

47
Chapter 6
Project Schedule and Conclusion
6.1 Project schedule
Our project was planned in two phases. The first phase includes literature survey to know the
complete information involved in link prediction, the various algorithms used and their draw
backs. Based on our understanding we then formulated a problem to work on and select the
dataset. In the second phase we stored the extracted values from the dataset in a 2D array
with reference to their strengths of edges, includes developing an algorithm, collecting results
and testing.
Phase1: Literature survey
Problem formulation
Dataset collection
Phase2: Developing algorithm
Implementation of the algorithm
Testing.

48
Fig 6.1 Project Phases
Table 6.1 Project Schedule
Date Activity
Meetings with
Advisor
Jan 6th
Meeting with advisor to discuss project schedule
and details
Meeting with advisor
Jan 9th -16
Research articles gathering and focusing on our
topic
Jan 17th-23rd
Literature survey of Article on link prediction in
heterogeneous networks and unsupervised link
prediction algorithms like Common neighbor,
Jaccard coefficient, Adamic/Adar
Jan 20th Meeting with
advisor
Explained in detail
about these methods
by advisor.
Jan 25th-31st
Literature Survey on
1. Paper on Multi-Relational link prediction in
Heterogeneous networks.
2. Paper on Link prediction in heterogeneous
Networks based on tensor factorization.
Jan 30th meeting with
advisor
literature survey
and problem
formulation
algorithm
implementation
link
prediction

49
Feb 2nd- 8th
Literature Survey on
1. Paper on inferring social ties across
heterogeneous networks.
2. Paper on exploiting place features in link
prediction on location based social networks.
Feb 7th meeting with
advisor
Feb 10th- 16th
1.Problem formulation and finding related datasets
2. Working on large Xml files to retrieve required
data. Due to the non-availability of different types
of edges in Co-authorship network we shifted to
YouTube Dataset
Feb 15th Meeting with
advisor
Feb 17th-24th
Worked on text files of dataset and tried to store
the values in 3d array in java. Due to heap size
problem we decided to shift to C++
Feb 22nd meeting with
advisor
Feb 26-30
Mid semester presentation week
March 1st-7th
Worked in C++ to store data in 3D vectors. March 3rd Meeting
with advisor
March 8th-15th
Faced the same memory heap problem with 3D
vectors in C++
Decided to work in 2d arrays in java with
normalized score for the strength of the edge.
March 10th suggested
to try in 2D java
matrix by storing
normalized strength
value.
March 18th-
27th
Mid-Semester break
Worked on final code for implementation.
April 1st-7th
Code implementation, collected results required at
various stages for final precision.
April 1st
Planned to work on closed triangles in the network
to find out the threshold value.

50
April 2nd
Average value turned out to be very low and this
resulted in huge file that couldn’t be opened
April 3rd
We randomized 1000 values of sum from closed
triangle list and picked values in buckets of 25, 15
and 2
April 4th We took optimal set of unlinked triangles and
performed local random walk algorithm on them
April 5th Calculated score of the predicted links and
selected the top 20% of the links
April 6th Calculated precision value for the predicted links
April 7th-15th Final Report, poster and presentation.
6.2 Future Work
The short duration of the project forced us to compromise on a couple of things that could
have yielded better precision values. Given a chance to expand this project we will
implement the following tasks that could not be included now.
6.2.1 Implement the complete Neuro Fuzzy Link BasedAlgorithm
Our initial idea was to implement the Neuro Fuzzy Link Based classifier algorithm that
incorporated FFNet and Backpropagation involving changes to edge intensity to obtain the
desired result. But due to shortage of time we decided to implement the Fuzzy Link Based
algorithm along with triangular fuzzy membership function that calculates the number of
triangles formed in the network and then calculates the threshold value based on this.
6.2.2 Local Random Walk
We implemented Local Random Walk algorithm after selecting the open triangles in the
network. Generally LRW is implemented multiple times, say 100, and the score is calculated
for those edges that occur multiple times. The run time to implement LRW is generally more
than 2 hours and due to time constraint we could implement LRW only once with a specified
hop count and calculated the score for the suggested links. This was one reason for the low
precision. Running the LRW multiple times and then calculating precision would have
yielded much better precision.

51
6.3 Conclusion
Data mining is a vast and upcoming area of research and link prediction is only a part of it.
Our area of interest lie in parallel with this field of study and some of us even opted to pursue
higher studies and specialize in Data Mining and Data Warehousing which motivated us to
take up this project. Though the results were satisfactory they were not as good as expected.
Keeping in mind the fact that this was our first experience in data mining and machine
learning field and the short duration of the project time period, we tried our best to go through
all the methods that are used in link prediction problem and implement our algorithm
successfully.
References:
[1] Heterogeneous Graph- http://www.mdpi.com/1424-8220/15/10/24735/htm
[2] YouTube Dataset download-
http://socialcomputing.asu.edu/datasets/YouTube
[3] Multi-Relational Link Prediction in Heterogeneous Information Networks
Darcy Davis, Ryan Lichtenwalter, Nitesh V. Chawla, Department of Computer Science
and Engineering
University of Notre Dame,Notre Dame, IN, 46556 US

52
https://www3.nd.edu/~dial/papers/ASONAM11b.pdf
[4] Link Prediction in Heterogeneous Networks Based on Tensor Factorization
The Open Cybernetics & Systemics Journal, 2014, 8, 316-321
http://benthamopen.com/contents/pdf/TOCSJ/TOCSJ-8-316.pdf
[5] Link Prediction in Heterogeneous Networks: Influence and Time Matters
http://hanj.cs.illinois.edu/pdf/icdm12_yyang.pdf
[6] Inferring Social Ties across Heterogenous Networks
https://www.cs.cornell.edu/home/kleinber/wsdm12-links.pdf
[7] Co-Author Relationship Prediction in Heterogeneous Bibliographic Networks
http://www.ccs.neu.edu/home/yzsun/papers/asonam11_pathpredict.pdf
[8] Data Mining Concepts and Techniques- 3rd edition
Jiawei Han, Micheline Kamber, Jian Pei.
[9] NEURO FUZZY LINK BASED CLASSIFIER FOR THE ANALYSIS OF BEHAVIOR
MODELS IN SOCIAL NETWORKS - Journal of Computer Science 10 (4): 578-584, 2014
Indira Priya Ponnuvel, Ghosh Dalim Kumar, Kannan Arputharaj and Ganapathy Sannasi
http://thescipub.com/PDF/jcssp.2014.578.584.pdf
[10] FFnet diagram
http://www.fon.hum.uva.nl/praat/manual/Feedforward_neural_networks_1__What_is
_a_feedforward_ne.html

Final Report

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Final Report

Similar to Final Report (20)

Final Report