Distributed Link Prediction in Large Scale Graphs using Apache Spark

Distributed Link Prediction in Large Scale Graphs
using Apache Spark
Anastasios Theodosiou
1
Aristotle University of Thessaloniki, Thessaloniki 54621, GREECE
anastasios.theodosiou@gmail.com
Abstract. Social networks like Facebook, Instagram, Twitter, and LinkedIn have
become an integral part of our everyday life. Through these, users can share dig-
ital content (links, photos, videos), express or share their opinions, and expand
their social circle by making new friends. All these user interactions lead to the
evolution and development of these networks over time. A typical example of
link prediction is some of the services offered by these networks to their users.
An essential service for them is to support their users with suggestions for new
friendships based on their existing network, as well as their preferences resulting
from their interactions with the network. Link prediction techniques attempt to
predict the possibility of a future connection between two nodes on a given net-
work. Beyond social networks, link prediction has a broad scope. Some of these
are, in e-commerce, genetics, and security. Due to the massive amounts of data
that is collected today, the need for scalable approaches arises to this problem.
The purpose of this diploma thesis is to experiment and use various techniques
of machine learning, both supervised and unsupervised, to predict links to a net-
work of academic papers using document similarity metrics based on the charac-
teristics of the nodes but also other structural features, based on the network. Ex-
perimentation and implementation of the application took place using Apache
Spark to manage the large data volume using the Scala programming language.
Keywords: Link prediction, Data mining, Machine Learning, Apache Spark,
Graphs, Online Social Networks, Recommender Systems
1 The link prediction problem
In order to better understand precisely what the link prediction problem is, a brief ex-
ample will be given. Suppose there is a network whose nodes represent individuals and
links between individuals representing relationships or interactions. Having a network
of these features, how can we predict its evolution in the future? Alternatively, how can
we predict the creation of new edges or deletion of existing links in the future? By
studying the evolution of social networks over time, we can understand how a node
interacts with another node. In order to carry out this study, we need many different
snapshots of the network structure over time, so the volume of data we which we need
to collect and process overgrows. Therefore, finding different scalable approaches for
their parallel processing becomes necessary. Some other real-world examples of link

2
predictions point to friends and followers in a social networking site, indicate relevant
products to customers or providing suggestions to professionals for teamwork based on
their field of study or their interests. Therefore, we can conclude that the link prediction
problem is the probability of predicting a future edge between two nodes.
1.1 Social networks and the difficulty of link prediction.
A (social) network can be represented by a graph G (V, E) where V is the set of its
vertices and E, the set of its edges. The number of possible connections in such a net-
work is equal to [V * (V-1)] / 2. The network we are looking at in this work consists of
27,770 nodes. If we want to compute all possible edges and suggest new ones based on
some metric similarity of documents (e.g., Jaccard Similarity), we would have to check
385,572 .565 edges. We understand that this number is quite large even on a relatively
small network, like this one of our work. However, social networks are sparse, so there
is no need to choose an edge randomly and try to predict its existence in the network.
Because the number of possible connections is large enough, there is a need to find
alternative and more efficient approaches to predicting them. Major social platforms
such as Facebook, Twitter, Instagram, LinkedIn, and others, have as one of their
primary services the proposals of new links in the form of a new "social friendship."
High accuracy in such predictions can help us understand what is the factor that leads
to the evolution of these networks and to provide more accurate and meaningful
suggestions. The social network which we have been studying for this work is a net-
work of academic papers where each one cites some other papers. A classic method for
proposing collaboration on such a network is through the bibliography system. How-
ever, the result of a new proposal based on this system could not be entirely accurate.
We need to extend and enrich this method with more data or new techniques so that we
can achieve greater accuracy in our recommendations. For example, we can use meth-
ods based on the content and structure of the documents.
2 Graphs and social networks.
The graphs provide a better way to deal with abstract concepts such as relationships
and interactions in a network. They also offer an intuitive - visual way of thinking about
these concepts. They are still a natural basis for analyzing relationships in a social con-
text. Over time, graphs are increasingly used in data science. Graph databases have
become common computing tools and alternatives for SQL and NoSQL databases.
Concepts of graph theory are used for the study and modeling of social networks, Fraud
Patterns, energy consumption patterns, influence on a social network and many other
areas of application. Social Network Analysis (SNA) is probably the most well-known
application of graph theory for data science. They are also used in agglomeration algo-
rithms (see K-Means). Therefore, there are many reasons for the use of graphs and so
many fields of application. From the computer science perspective, the graphs offer
computational efficiency. The "Big O" multi-plot for some algorithms is better for data
that is arranged in graph format compared to tabular data (see table data).

3
3 Link prediction and locality sensitive hashing.
The problem of finding identical or duplicate documents based on a similarity metric
seems relatively straightforward. Using a hash function, the work can be completed
very quickly, and the algorithm is fast. However, the problem becomes more
complicated if we want to find similar documents with spelling mistakes or with even
different words. The brute force technique can be used to find such documents and to
predict links with higher accuracy but, without being a scalable technique. On the other
hand, the LSH algorithm, or else Locality Sensitive Hashing, is a technique that can
also be used for the same problems, but yielding approximate results in a much better
time than the brute force technique. LSH in our problem can suggest an edge between
two nodes if the similarity of the two documents is above a given threshold. More gen-
erally, LSH belongs to a family of functions known as the LSH family which hashes
the data into buckets so that documents with high similarity are being hashed into the
same bucket. The general idea of LSH is to find an algorithm such that if we insert two
document signatures, it will be able to tell us that these nodes can form a candidate pair
or not if their similarity is higher than the given threshold. As for the part of MinHash-
ing, there are two necessary steps. First, we hash the columns from the signature matrix
with several hash functions, and then we check whether two documents are being
hashed into the same bucket even for one of the different functions. In this case, we can
accept the two documents as a candidate pair. Regarding the problem of link prediction,
if the Jaccard similarity of the two documents is above the given threshold, we can
conclude that there is a potential edge between them.
4 Suggested approach and results.
The network which we have studied in this work is composed of 27,770 nodes «papers»
and 352,857 edges. Each node consists of some attributes which represent them. These
are the document id, the publication date, the title, the authors, the journal(s) and the
abstract of the paper. Furthermore, there was a second file, containing all the edges of
the «edge list» network, during our experiments. This work aimed to make link predic-
tion between the papers above. An edge between two nodes exists when at least one
node points to another node by referencing them through the bibliographic system. Our
proposed approach does not take into account such reports nor does it use the metrics
mentioned in chapter two but instead relies on the similarity of the records based on
their characteristics through the Jaccard similarity metric and other structural features
of the network. Two different approaches were used where, in the first approach, the
problem was treated as binary classification, while in the second approach, two differ-
ent techniques of unsupervised machine learning were used. The brute force method
and Locality Sensitive Hashing with the use and configuration of the MinHashLSH
algorithm which is provided by the Apache Spark.

4
4.1 Supervised link prediction approach.
As we have already mentioned, the problem was treated as a binary classification.
Therefore, different models of machine learning were used where each was based on a
particular classifier. All models were tested on a four-core system and 8GB of RAM.
In the first phase, the datasets were loaded both for nodes and edges. Once this process
has been completed, a join operation began between the two files in order to create our
initial dataframe. This dataframe eventually contained the id of the two papers which
ware involved in an edge, as well as all the other attributes that characterize the speci-
fied nodes. After that, a tokenization procedure was performed in each column of the
dataframe. So all texts were converted into a bag of words. Next, all stop words were
removed so that the Jaccard similarity will not be affected by them. At this point, the
features that each classifier would consider for its training phase were calculated. They
come mainly from the attributes of the node but also the structural features of the net-
work. These were: (a) the time difference in publication between the two papers, (b)
the title overlap, (c) the authors overlap, (d) the journal overlap, (e) the abstract overlap.
Furthermore, ware added three more structural features concerning the node and these
were: a) common neighbors of the nodes of each edge b) the sum of the total triangles
belonging to each node of that edge and c) the PageRank score for each of the nodes in
the network. After that, we took the Squared test of independence, with the help of
ChiSqSelector class of Apache Spark, to determine whether there is a significant rela-
tionship between two categorical features. From this test and other experiments, it was
decided not to use the PageRank feature as its subscription to the final Accuracy, and
F1 was found to be almost zero. Finally, the data ware divided into two parts by 70%
for the training phase and the remaining 30% for the test phase.
Naïve Bayes Classifier.
The first classifier which was used was Naïve Bayes. Several tests have been performed
to select the threshold of this algorithm. Naïve Bayes for this data set and the selected
features gave the best results when the threshold value was 0.5 or 50%. Table 1 de-
scribes the results of this particular algorithm.
Table 1. Results from Naive Bayes classifier
Dataset Split Accuracy F1 Exec. Time (sec)
70/30 0.58614 0.58876 1090.06
Logistic Regression Classifier.
Since the results of the Naïve Bayes were not so good, other classifiers were tested.
One of these was the Logistic Regression Classifier. As in previous models, here too,
we have tasted different feature sets. Specifically, the first test performed here con-
tained only the features derived from the nodes' attributes. Tests have also been
performed for a different number of iterations. The first test concerned only the follow-
ing features: a) time difference of publishing, b) Jaccard's similarity of titles, the overlap

5
of titles, d) overlap of authors, d) overlap of the journal and e) overlap of abstract. The
results of this algorithm for these features are shown in Table 2.
Table 2. Logistic regression results with attributes based on node
Max. Iterations Accuracy F1 Exec. Time (sec)
10 0.79890 0.79812 1694.76
100 0.79890 0.79947 1654.78
1000 0.79713 0.79957 1628.16
10000 0.79723 0.79778 1807.94
Although this model achieves higher accuracy and f1 ratios than the previous model,
the next test showed that with the addition of structural features, the algorithm achieves
even better results as shown in Table 3.
Table 3. Logistic regression results with node's and structural features
Max. Iterations Accuracy F1 Exec. Time (sec)
10 0.93518 0.93559 959.72
100 0.93561 0.93600 1002.28
We note that adding structural features significantly increased the accuracy of the al-
gorithm and reduced its overall execution time. The next model with which we experi-
mented, was the Linear SVM.
Linear SVM Classifier.
This model was tested as the previous models do, in the same set of features. Experi-
ments were also performed for different values of the maximum number of iterations
as well as for the RegParam parameter. The test results are shown in Table 4.
Table 4. Linear SVM results
Max Iterations RegParam Accuracy F1 Exec. Time
10 0.1 0.85967 0.85967 934.15
100 0.1 0.88044 0.88152 1124.26
10 0.3 0.84362 0.84355 893.23
100 0.3 0.85683 0.85821 1313.11
From tests carried out, it turned out that the Linea SVM algorithm works best for the
combination of the MaxIterations parameters at 100 and RegParam equal to 0.1. Next
in the model series was the Multilayer Perceptron Classifier.
Multilayer Perceptron Classifier.
This classifier is based on neural networks. Here, many experiments were performed,
both for the maximum number of iterations of the algorithm and the number of layers.

6
Extra parameters were tested, but these two affected the result more than any other
parameter. The results are presented in Table 5.
Table 5. MLPC results based on the number of iterations and layers
Max Iterations Layers Accuracy F1 Exec. Time
100 13,10,7,2 0.87953 0.87951 1007.67
200 13,10,7,2 0.94770 0.94776 1106.78
400 13,7,4,2 0.95187 0.95205 1347.12
The best possible results concerning the data set and the features we have chosen are
for the maximum number of iterations at 400 and the layers 13,7,4,2. Next in the model
series which we have tested was the Decision Tree.
Decision Tree Classifier.
It was observed that the most significant difference in classifier behavior was the pa-
rameter of the maximum depth which would have the tree of our algorithm. The results
for this parameter are shown in Table 6.
Table 6. Decision tree classifier results
Max Depth Accuracy F1 Exec. Time (sec)
4 0.95116 0.95129 1302.87
8 0.95300 0.95314 1308.23
16 0.94262 0.94279 1177.16
30 0.92497 0.92494 1342.28
As it results from the above table, this model achieves even greater accuracy but also
F1 than all previous classifiers. The best possible value for this model came with a
maximum depth of 8. Finally, the Random Forest algorithm was used in our experi-
ments.
Random Forest Classifier.
The sixth and last classifier we used to solve the link prediction problem was the Ran-
dom Forest algorithm. This algorithm is a special category of Decision Trees. One ad-
vantage of this is that it uses multiple decision trees to avoid overfitting. We experi-
mented with two basic parameters of the algorithm, the first one was the maximum
depth of the trees, and the second parameter was the number of total trees. The results
of this test are described in Table 7.
Table 7. Random forest classifier results
Max Depth Num. Trees Accuracy F1 Exec. Time (sec)
4 10 0.95066 0.95077 1314.01
8 10 0.95580 0.95591 1191.91
4 100 0.95058 0.95068 1262.46

7
8 100 0.95527 0.95538 1230.55
The Random Forest model achieves the most accurate results in accuracy and f1 from
all the models mentioned above. The best possible values for accuracy and f1 are ob-
tained with the maximum number of trees at 10 and at the same time with a maximum
depth per tree equal to 8.
Model comparison.
Summarizing the above results from the different classifiers, a comparison was made
between them in terms of accuracy and f1. Figure 1 shows the change in accuracy and
f1 per model.
Figure 1. Comparison of the classifiers
As far as the algorithm execution time is concerned, we can see that the shorter time it
has the Linear Regression model with a total completion time of 1002.28 seconds while
the Random Forest model requires 189.69 seconds longer. The difference in the run
time of the six classifiers is shown in Figure 2.
Figure 2. The execution time of six classifiers
4.2 Unsupervised link prediction approach.
From the perspective of unsupervised machine learning, the problem of link prediction
was addressed by two different techniques but very widespread. The approach we pro-
pose differentiates from most of the related work on this problem as to how to deal with

8
it. Nearly the same data preprocessing techniques were performed as in the previous
chapter. The main difference is that for each node, a bag of words was created and
correlated with it. In more detail, for each paper, tokenization of each column into
words was performed, and then we concatenate all the dataframe columns into one. In
the next phase, all stop words were removed so that the Jaccard similarity was not af-
fected. Once our data had been prepared, we proceeded to predict new links with two
different techniques. The first technique which was tested was the brute force technique
and the second one, was the LocalitySensitive Hashing - LSH algorithm in combination
with MinHashing. It is worth noting that these experiments were carried out on a cluster
of 80 cores. All tests were done with maximum use of 64 cores and 32GB of RAM. As
there were hardware constraints and more specifically we face random access memory
issues, the experiments were performed for a subset of the original data set.
Brute force prediction.
Initially, a join operation was performed on the data in order to create all the possible
edges that may occur. This process is relatively slow, but it can only take place once,
and then we can use it as is. After that, the Jaccard similarity was calculated for all
candidate edges. The maximum Jaccard similarity was found to be 0.4973. After this
process has been completed, we have set a threshold for the Jaccard similarity so that
edges with a similarity greater than or equal to it, ware selected. The run time of the
algorithm increased geometrically as the number of nodes in the data set increased. In
Table 8, we observe cumulative results of the algorithm for the accuracy, the number
of candidate pairs, the total number of checks which was performed, and the algo-
rithm’s execution time.
Table 8. Aggregative results of brute force algorithm execution
Nodes Checks # Candidates Accuracy Exec. Time (sec)
1000 499500 3916 0.9368 62.89
2000 1999000 14055 0.9662 161.04
5000 12497500 74302 0.9711 566.73
7000 24496500 106534 0.9789 1446.98
Figure 3 shows the change in algorithm accuracy as the number of nodes in the network
grows.
Figure 3. The increase of accuracy based on the dataset volume

9
This technique is generally fairly accurate, but extremely time-consuming and utterly
dependent on system resources. For this reason, there is a need for new techniques that
can produce results within a reasonable time. This problem comes to solve MinHashing
and the Locality Sensitive Hashing algorithm.
MinHashLSH prediction.
The basic idea behind this algorithm is that it uses MinHashing in conjunction with the
LSH algorithm so that documents with a high similarity index are hashed into the same
bucket while those with a small index in different ones. In general, the entire workflow
is the same as the one followed in the brute force algorithm, except that here we are
joining the data based on the Jaccard distance and not the Jaccard index. So if we want
to set a similarity limit for our documents with 60% Jaccard similarity, we should set a
Jaccard distance equal to 1 - Jaccard similarity, i.e., the two documents should be at
least 40% away. Table 9 lists some results from the various experiments performed
with this method and with Jaccard distance 0.8 as this was the number that provided
more accurate results.
Table 9. MinHashLSH scores relative to the number of hash tables
Hash Tables Candidates Precision Recall Accuracy F1 Time (sec)
2 986 0.7261 0.0120 0.97133 0.02370 26.19
4 1610 0.7304 0.0385 0.97147 0.03853 31.12
8 3026 0.6265 0.0607 0.97147 0.06072 70.20
16 3628 0.5975 0.0364 0.97148 0.06877 111.48
32 3824 0.5983 0.0385 0.97147 0.07246 211.95
64 3840 0.5968 0.0385 0.97149 0.07246 514.83
128 3840 0.5968 0.0385 0.97151 0.07251 1344.72
We notice that as the number of hash tables grows, the accuracy of the results increases.
At the same time, the algorithm’s execution time is increasing almost linearly. Figure
4 illustrates the change in algorithm accuracy relative to the number of hash tables.
Figure 4. The accuracy of MinHashLSH vs. hash tables
As regards the evaluation of the unsupervised techniques we used and because we did
not have a classifier or regressor, a TP, FP, TN and FN calculation function was imple-
mented comparing the results of the algorithms with the original graph of the network.

10
Then the Precision and Recall metrics were calculated, and from this data, we arrived
at the calculation of Accuracy and F1. Many experiments have been carried out, and
many tests have been done which are available in the full version of the diploma thesis.
5 Conclusion.
The problem of link prediction has a wide range of application in different areas. In this
diploma thesis, we studied techniques of both supervised and unsupervised machine
learning techniques. After many experiments and trials, we came to the conclusion,
given the circumstances, of the data we had at our disposal but also of the way we chose
to address this problem, that as regards the solution from supervised techniques, the
model based on the Random Forest classifier, is the ideal solution to the problem. On
the other hand, in the unsupervised machine learning part, the MinHashLSH method
was chosen as it is much faster and can produce quite good results and reach very close
to the levels of brute force techniques. However, it requires much attention as it gener-
ates many false positives.
6 Future work.
As a future work, we will address the problem of link prediction through a different
viewpoint. We will re-examine the same network but this time with a technique based
on clustering. This approach uses similar nodes in a «cluster» and aims that nodes from
the same cluster exhibit a similar connectivity pattern. In more detail, with this method,
we will initially set a threshold θ, and then we will subtract all the edges of the graph
having a weight less than the limit. Then, each linked element of the graph will corre-
spond to a cluster. In general, two nodes are in the same connected component as if
there is a path between them. From supervised machine learning, we will try techniques
that will be based purely on neural networks with more complex data preprocessing
techniques, and we hope to achieve even better results and in less execution time.

11
References.
1. Charu C. Aggarwal (auth.) - Recommender Systems, 2 Springer International Publish-ing
2. Reza Zafarani, Mohammad Ali Abbasi, Huan Liu - Social Media Mining, Cambridge
3. Feiyi (Aaron) Tang - Link-Prediction and its Application in Online Social Networks, 2 Vic-
toria University
4. L. Adamic and E. Adar - Friends and neighbors on the web. Social Networks, 2003
5. David Liben-Nowell and Jon Kleinberg - The Link Prediction Problem for Social Net-works,
2004
6. M. E. J. Newman. Clustering and preferential attachment in growing networks. Physi-cal Re-
view E, 64(02), 2001
7. D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social net-
works. Journal of the American Society for Information Science and Technology,
2007
8. L. Katz. A new status index derived from sociometric analysis. Psychometrika, March 1953
9. Hially Rodrigues S´a and Ricardo B. C. Prudˆencio , Supervised Learning for Link Pre-diction
in Weighted Networks, Center of Informatics, Federal University of Pernambu-co, CEP 5-970 -
Recife (PE) – Brazil
10. Huang, Zan, Link Prediction Based on Graph Topology: The Predictive Value of Gen-
eralized Clustering Coefficient, 2010
11. Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining of Massive Datasets.
12. Broder, Andrei Z, Moses Charikar, Alan M Frieze, and Michael Mitzenmacher 2000.
“Min-Wise Independent Permutations.” Journal of Computer and System Sciences 60 (3). Else-
vier: 630–59
13. Panagiotis Symeonidis, Nantia Iakovidou, Nikolaos Mantas, Yannis Manolopoulos, From
biological to social networks: Link prediction based on multi-way spectral cluster-ing, 2013
14. Link Prediction - Karsten Borgwardt, Christoph Lippert and Nino
Shervashidze,https://www.ethz.ch/content/dam/ethz/specialinterest/bsse/borgwardt-lab/docu-
ments/slides/BNA09_10_12.pdf
15. Apache Spark Tutorial: Machine Learning (article) - Datacamp. (n.d.). Retrieved from
https://www.datacamp.com/community/tutorials/apache-spark-tutorial-machine-learn
16. Persagen Consulting | Specializing In Molecular/functional ... (n.d.). Retrieved from
https://persagen.com/resources/biokdd-review-knowledge_graphs.html
17. Link Prediction In Social Networks Using Computationally ... (n.d.). Retrieved
from:https://mafiadoc.com/link-prediction-in-social-networks-using computationally-ef
18. Social Media Mining - Reza Zafarani, Mohammad Ali Abbasi, Huan Liu, Cambridge Uni-
versity, 2014
19. Recommender Systems: The Textbook, Charu C. AggarwalIBM T.J. Watson Research
Center Yorktown Heights, NY, USA
20. Link prediction using unsupervised learning, Mohammad Al Hasan, Vineet Chaoji, Saeed
Slem Mohammed Zaki

Distributed Link Prediction in Large Scale Graphs using Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Distributed Link Prediction in Large Scale Graphs using Apache Spark

Similar to Distributed Link Prediction in Large Scale Graphs using Apache Spark (20)

Recently uploaded

Recently uploaded (20)

Distributed Link Prediction in Large Scale Graphs using Apache Spark