SlideShare a Scribd company logo
Distributed Link Prediction in Large Scale Graphs
using Apache Spark
Anastasios Theodosiou
1
Aristotle University of Thessaloniki, Thessaloniki 54621, GREECE
anastasios.theodosiou@gmail.com
Abstract. Social networks like Facebook, Instagram, Twitter, and LinkedIn have
become an integral part of our everyday life. Through these, users can share dig-
ital content (links, photos, videos), express or share their opinions, and expand
their social circle by making new friends. All these user interactions lead to the
evolution and development of these networks over time. A typical example of
link prediction is some of the services offered by these networks to their users.
An essential service for them is to support their users with suggestions for new
friendships based on their existing network, as well as their preferences resulting
from their interactions with the network. Link prediction techniques attempt to
predict the possibility of a future connection between two nodes on a given net-
work. Beyond social networks, link prediction has a broad scope. Some of these
are, in e-commerce, genetics, and security. Due to the massive amounts of data
that is collected today, the need for scalable approaches arises to this problem.
The purpose of this diploma thesis is to experiment and use various techniques
of machine learning, both supervised and unsupervised, to predict links to a net-
work of academic papers using document similarity metrics based on the charac-
teristics of the nodes but also other structural features, based on the network. Ex-
perimentation and implementation of the application took place using Apache
Spark to manage the large data volume using the Scala programming language.
Keywords: Link prediction, Data mining, Machine Learning, Apache Spark,
Graphs, Online Social Networks, Recommender Systems
1 The link prediction problem
In order to better understand precisely what the link prediction problem is, a brief ex-
ample will be given. Suppose there is a network whose nodes represent individuals and
links between individuals representing relationships or interactions. Having a network
of these features, how can we predict its evolution in the future? Alternatively, how can
we predict the creation of new edges or deletion of existing links in the future? By
studying the evolution of social networks over time, we can understand how a node
interacts with another node. In order to carry out this study, we need many different
snapshots of the network structure over time, so the volume of data we which we need
to collect and process overgrows. Therefore, finding different scalable approaches for
their parallel processing becomes necessary. Some other real-world examples of link
2
predictions point to friends and followers in a social networking site, indicate relevant
products to customers or providing suggestions to professionals for teamwork based on
their field of study or their interests. Therefore, we can conclude that the link prediction
problem is the probability of predicting a future edge between two nodes.
1.1 Social networks and the difficulty of link prediction.
A (social) network can be represented by a graph G (V, E) where V is the set of its
vertices and E, the set of its edges. The number of possible connections in such a net-
work is equal to [V * (V-1)] / 2. The network we are looking at in this work consists of
27,770 nodes. If we want to compute all possible edges and suggest new ones based on
some metric similarity of documents (e.g., Jaccard Similarity), we would have to check
385,572 .565 edges. We understand that this number is quite large even on a relatively
small network, like this one of our work. However, social networks are sparse, so there
is no need to choose an edge randomly and try to predict its existence in the network.
Because the number of possible connections is large enough, there is a need to find
alternative and more efficient approaches to predicting them. Major social platforms
such as Facebook, Twitter, Instagram, LinkedIn, and others, have as one of their
primary services the proposals of new links in the form of a new "social friendship."
High accuracy in such predictions can help us understand what is the factor that leads
to the evolution of these networks and to provide more accurate and meaningful
suggestions. The social network which we have been studying for this work is a net-
work of academic papers where each one cites some other papers. A classic method for
proposing collaboration on such a network is through the bibliography system. How-
ever, the result of a new proposal based on this system could not be entirely accurate.
We need to extend and enrich this method with more data or new techniques so that we
can achieve greater accuracy in our recommendations. For example, we can use meth-
ods based on the content and structure of the documents.
2 Graphs and social networks.
The graphs provide a better way to deal with abstract concepts such as relationships
and interactions in a network. They also offer an intuitive - visual way of thinking about
these concepts. They are still a natural basis for analyzing relationships in a social con-
text. Over time, graphs are increasingly used in data science. Graph databases have
become common computing tools and alternatives for SQL and NoSQL databases.
Concepts of graph theory are used for the study and modeling of social networks, Fraud
Patterns, energy consumption patterns, influence on a social network and many other
areas of application. Social Network Analysis (SNA) is probably the most well-known
application of graph theory for data science. They are also used in agglomeration algo-
rithms (see K-Means). Therefore, there are many reasons for the use of graphs and so
many fields of application. From the computer science perspective, the graphs offer
computational efficiency. The "Big O" multi-plot for some algorithms is better for data
that is arranged in graph format compared to tabular data (see table data).
3
3 Link prediction and locality sensitive hashing.
The problem of finding identical or duplicate documents based on a similarity metric
seems relatively straightforward. Using a hash function, the work can be completed
very quickly, and the algorithm is fast. However, the problem becomes more
complicated if we want to find similar documents with spelling mistakes or with even
different words. The brute force technique can be used to find such documents and to
predict links with higher accuracy but, without being a scalable technique. On the other
hand, the LSH algorithm, or else Locality Sensitive Hashing, is a technique that can
also be used for the same problems, but yielding approximate results in a much better
time than the brute force technique. LSH in our problem can suggest an edge between
two nodes if the similarity of the two documents is above a given threshold. More gen-
erally, LSH belongs to a family of functions known as the LSH family which hashes
the data into buckets so that documents with high similarity are being hashed into the
same bucket. The general idea of LSH is to find an algorithm such that if we insert two
document signatures, it will be able to tell us that these nodes can form a candidate pair
or not if their similarity is higher than the given threshold. As for the part of MinHash-
ing, there are two necessary steps. First, we hash the columns from the signature matrix
with several hash functions, and then we check whether two documents are being
hashed into the same bucket even for one of the different functions. In this case, we can
accept the two documents as a candidate pair. Regarding the problem of link prediction,
if the Jaccard similarity of the two documents is above the given threshold, we can
conclude that there is a potential edge between them.
4 Suggested approach and results.
The network which we have studied in this work is composed of 27,770 nodes «papers»
and 352,857 edges. Each node consists of some attributes which represent them. These
are the document id, the publication date, the title, the authors, the journal(s) and the
abstract of the paper. Furthermore, there was a second file, containing all the edges of
the «edge list» network, during our experiments. This work aimed to make link predic-
tion between the papers above. An edge between two nodes exists when at least one
node points to another node by referencing them through the bibliographic system. Our
proposed approach does not take into account such reports nor does it use the metrics
mentioned in chapter two but instead relies on the similarity of the records based on
their characteristics through the Jaccard similarity metric and other structural features
of the network. Two different approaches were used where, in the first approach, the
problem was treated as binary classification, while in the second approach, two differ-
ent techniques of unsupervised machine learning were used. The brute force method
and Locality Sensitive Hashing with the use and configuration of the MinHashLSH
algorithm which is provided by the Apache Spark.
4
4.1 Supervised link prediction approach.
As we have already mentioned, the problem was treated as a binary classification.
Therefore, different models of machine learning were used where each was based on a
particular classifier. All models were tested on a four-core system and 8GB of RAM.
In the first phase, the datasets were loaded both for nodes and edges. Once this process
has been completed, a join operation began between the two files in order to create our
initial dataframe. This dataframe eventually contained the id of the two papers which
ware involved in an edge, as well as all the other attributes that characterize the speci-
fied nodes. After that, a tokenization procedure was performed in each column of the
dataframe. So all texts were converted into a bag of words. Next, all stop words were
removed so that the Jaccard similarity will not be affected by them. At this point, the
features that each classifier would consider for its training phase were calculated. They
come mainly from the attributes of the node but also the structural features of the net-
work. These were: (a) the time difference in publication between the two papers, (b)
the title overlap, (c) the authors overlap, (d) the journal overlap, (e) the abstract overlap.
Furthermore, ware added three more structural features concerning the node and these
were: a) common neighbors of the nodes of each edge b) the sum of the total triangles
belonging to each node of that edge and c) the PageRank score for each of the nodes in
the network. After that, we took the Squared test of independence, with the help of
ChiSqSelector class of Apache Spark, to determine whether there is a significant rela-
tionship between two categorical features. From this test and other experiments, it was
decided not to use the PageRank feature as its subscription to the final Accuracy, and
F1 was found to be almost zero. Finally, the data ware divided into two parts by 70%
for the training phase and the remaining 30% for the test phase.
Naïve Bayes Classifier.
The first classifier which was used was Naïve Bayes. Several tests have been performed
to select the threshold of this algorithm. Naïve Bayes for this data set and the selected
features gave the best results when the threshold value was 0.5 or 50%. Table 1 de-
scribes the results of this particular algorithm.
Table 1. Results from Naive Bayes classifier
Dataset Split Accuracy F1 Exec. Time (sec)
70/30 0.58614 0.58876 1090.06
Logistic Regression Classifier.
Since the results of the Naïve Bayes were not so good, other classifiers were tested.
One of these was the Logistic Regression Classifier. As in previous models, here too,
we have tasted different feature sets. Specifically, the first test performed here con-
tained only the features derived from the nodes' attributes. Tests have also been
performed for a different number of iterations. The first test concerned only the follow-
ing features: a) time difference of publishing, b) Jaccard's similarity of titles, the overlap
5
of titles, d) overlap of authors, d) overlap of the journal and e) overlap of abstract. The
results of this algorithm for these features are shown in Table 2.
Table 2. Logistic regression results with attributes based on node
Max. Iterations Accuracy F1 Exec. Time (sec)
10 0.79890 0.79812 1694.76
100 0.79890 0.79947 1654.78
1000 0.79713 0.79957 1628.16
10000 0.79723 0.79778 1807.94
Although this model achieves higher accuracy and f1 ratios than the previous model,
the next test showed that with the addition of structural features, the algorithm achieves
even better results as shown in Table 3.
Table 3. Logistic regression results with node's and structural features
Max. Iterations Accuracy F1 Exec. Time (sec)
10 0.93518 0.93559 959.72
100 0.93561 0.93600 1002.28
We note that adding structural features significantly increased the accuracy of the al-
gorithm and reduced its overall execution time. The next model with which we experi-
mented, was the Linear SVM.
Linear SVM Classifier.
This model was tested as the previous models do, in the same set of features. Experi-
ments were also performed for different values of the maximum number of iterations
as well as for the RegParam parameter. The test results are shown in Table 4.
Table 4. Linear SVM results
Max Iterations RegParam Accuracy F1 Exec. Time
10 0.1 0.85967 0.85967 934.15
100 0.1 0.88044 0.88152 1124.26
10 0.3 0.84362 0.84355 893.23
100 0.3 0.85683 0.85821 1313.11
From tests carried out, it turned out that the Linea SVM algorithm works best for the
combination of the MaxIterations parameters at 100 and RegParam equal to 0.1. Next
in the model series was the Multilayer Perceptron Classifier.
Multilayer Perceptron Classifier.
This classifier is based on neural networks. Here, many experiments were performed,
both for the maximum number of iterations of the algorithm and the number of layers.
6
Extra parameters were tested, but these two affected the result more than any other
parameter. The results are presented in Table 5.
Table 5. MLPC results based on the number of iterations and layers
Max Iterations Layers Accuracy F1 Exec. Time
100 13,10,7,2 0.87953 0.87951 1007.67
200 13,10,7,2 0.94770 0.94776 1106.78
400 13,7,4,2 0.95187 0.95205 1347.12
The best possible results concerning the data set and the features we have chosen are
for the maximum number of iterations at 400 and the layers 13,7,4,2. Next in the model
series which we have tested was the Decision Tree.
Decision Tree Classifier.
It was observed that the most significant difference in classifier behavior was the pa-
rameter of the maximum depth which would have the tree of our algorithm. The results
for this parameter are shown in Table 6.
Table 6. Decision tree classifier results
Max Depth Accuracy F1 Exec. Time (sec)
4 0.95116 0.95129 1302.87
8 0.95300 0.95314 1308.23
16 0.94262 0.94279 1177.16
30 0.92497 0.92494 1342.28
As it results from the above table, this model achieves even greater accuracy but also
F1 than all previous classifiers. The best possible value for this model came with a
maximum depth of 8. Finally, the Random Forest algorithm was used in our experi-
ments.
Random Forest Classifier.
The sixth and last classifier we used to solve the link prediction problem was the Ran-
dom Forest algorithm. This algorithm is a special category of Decision Trees. One ad-
vantage of this is that it uses multiple decision trees to avoid overfitting. We experi-
mented with two basic parameters of the algorithm, the first one was the maximum
depth of the trees, and the second parameter was the number of total trees. The results
of this test are described in Table 7.
Table 7. Random forest classifier results
Max Depth Num. Trees Accuracy F1 Exec. Time (sec)
4 10 0.95066 0.95077 1314.01
8 10 0.95580 0.95591 1191.91
4 100 0.95058 0.95068 1262.46
7
8 100 0.95527 0.95538 1230.55
The Random Forest model achieves the most accurate results in accuracy and f1 from
all the models mentioned above. The best possible values for accuracy and f1 are ob-
tained with the maximum number of trees at 10 and at the same time with a maximum
depth per tree equal to 8.
Model comparison.
Summarizing the above results from the different classifiers, a comparison was made
between them in terms of accuracy and f1. Figure 1 shows the change in accuracy and
f1 per model.
Figure 1. Comparison of the classifiers
As far as the algorithm execution time is concerned, we can see that the shorter time it
has the Linear Regression model with a total completion time of 1002.28 seconds while
the Random Forest model requires 189.69 seconds longer. The difference in the run
time of the six classifiers is shown in Figure 2.
Figure 2. The execution time of six classifiers
4.2 Unsupervised link prediction approach.
From the perspective of unsupervised machine learning, the problem of link prediction
was addressed by two different techniques but very widespread. The approach we pro-
pose differentiates from most of the related work on this problem as to how to deal with
8
it. Nearly the same data preprocessing techniques were performed as in the previous
chapter. The main difference is that for each node, a bag of words was created and
correlated with it. In more detail, for each paper, tokenization of each column into
words was performed, and then we concatenate all the dataframe columns into one. In
the next phase, all stop words were removed so that the Jaccard similarity was not af-
fected. Once our data had been prepared, we proceeded to predict new links with two
different techniques. The first technique which was tested was the brute force technique
and the second one, was the LocalitySensitive Hashing - LSH algorithm in combination
with MinHashing. It is worth noting that these experiments were carried out on a cluster
of 80 cores. All tests were done with maximum use of 64 cores and 32GB of RAM. As
there were hardware constraints and more specifically we face random access memory
issues, the experiments were performed for a subset of the original data set.
Brute force prediction.
Initially, a join operation was performed on the data in order to create all the possible
edges that may occur. This process is relatively slow, but it can only take place once,
and then we can use it as is. After that, the Jaccard similarity was calculated for all
candidate edges. The maximum Jaccard similarity was found to be 0.4973. After this
process has been completed, we have set a threshold for the Jaccard similarity so that
edges with a similarity greater than or equal to it, ware selected. The run time of the
algorithm increased geometrically as the number of nodes in the data set increased. In
Table 8, we observe cumulative results of the algorithm for the accuracy, the number
of candidate pairs, the total number of checks which was performed, and the algo-
rithm’s execution time.
Table 8. Aggregative results of brute force algorithm execution
Nodes Checks # Candidates Accuracy Exec. Time (sec)
1000 499500 3916 0.9368 62.89
2000 1999000 14055 0.9662 161.04
5000 12497500 74302 0.9711 566.73
7000 24496500 106534 0.9789 1446.98
Figure 3 shows the change in algorithm accuracy as the number of nodes in the network
grows.
Figure 3. The increase of accuracy based on the dataset volume
9
This technique is generally fairly accurate, but extremely time-consuming and utterly
dependent on system resources. For this reason, there is a need for new techniques that
can produce results within a reasonable time. This problem comes to solve MinHashing
and the Locality Sensitive Hashing algorithm.
MinHashLSH prediction.
The basic idea behind this algorithm is that it uses MinHashing in conjunction with the
LSH algorithm so that documents with a high similarity index are hashed into the same
bucket while those with a small index in different ones. In general, the entire workflow
is the same as the one followed in the brute force algorithm, except that here we are
joining the data based on the Jaccard distance and not the Jaccard index. So if we want
to set a similarity limit for our documents with 60% Jaccard similarity, we should set a
Jaccard distance equal to 1 - Jaccard similarity, i.e., the two documents should be at
least 40% away. Table 9 lists some results from the various experiments performed
with this method and with Jaccard distance 0.8 as this was the number that provided
more accurate results.
Table 9. MinHashLSH scores relative to the number of hash tables
Hash Tables Candidates Precision Recall Accuracy F1 Time (sec)
2 986 0.7261 0.0120 0.97133 0.02370 26.19
4 1610 0.7304 0.0385 0.97147 0.03853 31.12
8 3026 0.6265 0.0607 0.97147 0.06072 70.20
16 3628 0.5975 0.0364 0.97148 0.06877 111.48
32 3824 0.5983 0.0385 0.97147 0.07246 211.95
64 3840 0.5968 0.0385 0.97149 0.07246 514.83
128 3840 0.5968 0.0385 0.97151 0.07251 1344.72
We notice that as the number of hash tables grows, the accuracy of the results increases.
At the same time, the algorithm’s execution time is increasing almost linearly. Figure
4 illustrates the change in algorithm accuracy relative to the number of hash tables.
Figure 4. The accuracy of MinHashLSH vs. hash tables
As regards the evaluation of the unsupervised techniques we used and because we did
not have a classifier or regressor, a TP, FP, TN and FN calculation function was imple-
mented comparing the results of the algorithms with the original graph of the network.
10
Then the Precision and Recall metrics were calculated, and from this data, we arrived
at the calculation of Accuracy and F1. Many experiments have been carried out, and
many tests have been done which are available in the full version of the diploma thesis.
5 Conclusion.
The problem of link prediction has a wide range of application in different areas. In this
diploma thesis, we studied techniques of both supervised and unsupervised machine
learning techniques. After many experiments and trials, we came to the conclusion,
given the circumstances, of the data we had at our disposal but also of the way we chose
to address this problem, that as regards the solution from supervised techniques, the
model based on the Random Forest classifier, is the ideal solution to the problem. On
the other hand, in the unsupervised machine learning part, the MinHashLSH method
was chosen as it is much faster and can produce quite good results and reach very close
to the levels of brute force techniques. However, it requires much attention as it gener-
ates many false positives.
6 Future work.
As a future work, we will address the problem of link prediction through a different
viewpoint. We will re-examine the same network but this time with a technique based
on clustering. This approach uses similar nodes in a «cluster» and aims that nodes from
the same cluster exhibit a similar connectivity pattern. In more detail, with this method,
we will initially set a threshold θ, and then we will subtract all the edges of the graph
having a weight less than the limit. Then, each linked element of the graph will corre-
spond to a cluster. In general, two nodes are in the same connected component as if
there is a path between them. From supervised machine learning, we will try techniques
that will be based purely on neural networks with more complex data preprocessing
techniques, and we hope to achieve even better results and in less execution time.
11
References.
1. Charu C. Aggarwal (auth.) - Recommender Systems, 2 Springer International Publish-ing
2. Reza Zafarani, Mohammad Ali Abbasi, Huan Liu - Social Media Mining, Cambridge
3. Feiyi (Aaron) Tang - Link-Prediction and its Application in Online Social Networks, 2 Vic-
toria University
4. L. Adamic and E. Adar - Friends and neighbors on the web. Social Networks, 2003
5. David Liben-Nowell and Jon Kleinberg - The Link Prediction Problem for Social Net-works,
2004
6. M. E. J. Newman. Clustering and preferential attachment in growing networks. Physi-cal Re-
view E, 64(02), 2001
7. D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social net-
works. Journal of the American Society for Information Science and Technology,
2007
8. L. Katz. A new status index derived from sociometric analysis. Psychometrika, March 1953
9. Hially Rodrigues S´a and Ricardo B. C. Prudˆencio , Supervised Learning for Link Pre-diction
in Weighted Networks, Center of Informatics, Federal University of Pernambu-co, CEP 5-970 -
Recife (PE) – Brazil
10. Huang, Zan, Link Prediction Based on Graph Topology: The Predictive Value of Gen-
eralized Clustering Coefficient, 2010
11. Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining of Massive Datasets.
12. Broder, Andrei Z, Moses Charikar, Alan M Frieze, and Michael Mitzenmacher 2000.
“Min-Wise Independent Permutations.” Journal of Computer and System Sciences 60 (3). Else-
vier: 630–59
13. Panagiotis Symeonidis, Nantia Iakovidou, Nikolaos Mantas, Yannis Manolopoulos, From
biological to social networks: Link prediction based on multi-way spectral cluster-ing, 2013
14. Link Prediction - Karsten Borgwardt, Christoph Lippert and Nino
Shervashidze,https://www.ethz.ch/content/dam/ethz/specialinterest/bsse/borgwardt-lab/docu-
ments/slides/BNA09_10_12.pdf
15. Apache Spark Tutorial: Machine Learning (article) - Datacamp. (n.d.). Retrieved from
https://www.datacamp.com/community/tutorials/apache-spark-tutorial-machine-learn
16. Persagen Consulting | Specializing In Molecular/functional ... (n.d.). Retrieved from
https://persagen.com/resources/biokdd-review-knowledge_graphs.html
17. Link Prediction In Social Networks Using Computationally ... (n.d.). Retrieved
from:https://mafiadoc.com/link-prediction-in-social-networks-using computationally-ef
18. Social Media Mining - Reza Zafarani, Mohammad Ali Abbasi, Huan Liu, Cambridge Uni-
versity, 2014
19. Recommender Systems: The Textbook, Charu C. AggarwalIBM T.J. Watson Research
Center Yorktown Heights, NY, USA
20. Link prediction using unsupervised learning, Mohammad Al Hasan, Vineet Chaoji, Saeed
Slem Mohammed Zaki

More Related Content

What's hot

Ppt
PptPpt
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
rathnaarul
 
Keyword query routing
Keyword query routingKeyword query routing
Keyword query routing
Shakas Technologies
 
Context Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word PairsContext Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word Pairs
IJCSIS Research Publications
 
Social Data Mining
Social Data MiningSocial Data Mining
Social Data Mining
Mahesh Meniya
 
Data Mining In Social Networks Using K-Means Clustering Algorithm
Data Mining In Social Networks Using K-Means Clustering AlgorithmData Mining In Social Networks Using K-Means Clustering Algorithm
Data Mining In Social Networks Using K-Means Clustering Algorithm
nishant24894
 
Ijetcas14 347
Ijetcas14 347Ijetcas14 347
Ijetcas14 347
Iasir Journals
 
Az31349353
Az31349353Az31349353
Az31349353
IJERA Editor
 
An updated look at social network extraction system a personal data analysis ...
An updated look at social network extraction system a personal data analysis ...An updated look at social network extraction system a personal data analysis ...
An updated look at social network extraction system a personal data analysis ...
eSAT Publishing House
 
Approaches for Keyword Query Routing
Approaches for Keyword Query RoutingApproaches for Keyword Query Routing
Approaches for Keyword Query Routing
IJERA Editor
 
G1803054653
G1803054653G1803054653
G1803054653
IOSR Journals
 
Predicting_new_friendships_in_social_networks
Predicting_new_friendships_in_social_networksPredicting_new_friendships_in_social_networks
Predicting_new_friendships_in_social_networks
Anvardh Nanduri
 
Are Positive or Negative Tweets More "Retweetable" in Brazilian Politics?
Are Positive or Negative Tweets More "Retweetable" in Brazilian Politics?Are Positive or Negative Tweets More "Retweetable" in Brazilian Politics?
Are Positive or Negative Tweets More "Retweetable" in Brazilian Politics?
Molly Gibbons (she/her)
 
Link Prediction in (Partially) Aligned Heterogeneous Social Networks
Link Prediction in (Partially) Aligned Heterogeneous Social NetworksLink Prediction in (Partially) Aligned Heterogeneous Social Networks
Link Prediction in (Partially) Aligned Heterogeneous Social Networks
Sina Sajadmanesh
 
Keyword Query Routing
Keyword Query RoutingKeyword Query Routing
Keyword Query Routing
SWAMI06
 
At33264269
At33264269At33264269
At33264269
IJERA Editor
 
Ego web qqml presentation 2016 pdf export
Ego web qqml presentation 2016 pdf exportEgo web qqml presentation 2016 pdf export
Ego web qqml presentation 2016 pdf export
David Kennedy
 
Ay3313861388
Ay3313861388Ay3313861388
Ay3313861388
IJMER
 

What's hot (18)

Ppt
PptPpt
Ppt
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
 
Keyword query routing
Keyword query routingKeyword query routing
Keyword query routing
 
Context Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word PairsContext Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word Pairs
 
Social Data Mining
Social Data MiningSocial Data Mining
Social Data Mining
 
Data Mining In Social Networks Using K-Means Clustering Algorithm
Data Mining In Social Networks Using K-Means Clustering AlgorithmData Mining In Social Networks Using K-Means Clustering Algorithm
Data Mining In Social Networks Using K-Means Clustering Algorithm
 
Ijetcas14 347
Ijetcas14 347Ijetcas14 347
Ijetcas14 347
 
Az31349353
Az31349353Az31349353
Az31349353
 
An updated look at social network extraction system a personal data analysis ...
An updated look at social network extraction system a personal data analysis ...An updated look at social network extraction system a personal data analysis ...
An updated look at social network extraction system a personal data analysis ...
 
Approaches for Keyword Query Routing
Approaches for Keyword Query RoutingApproaches for Keyword Query Routing
Approaches for Keyword Query Routing
 
G1803054653
G1803054653G1803054653
G1803054653
 
Predicting_new_friendships_in_social_networks
Predicting_new_friendships_in_social_networksPredicting_new_friendships_in_social_networks
Predicting_new_friendships_in_social_networks
 
Are Positive or Negative Tweets More "Retweetable" in Brazilian Politics?
Are Positive or Negative Tweets More "Retweetable" in Brazilian Politics?Are Positive or Negative Tweets More "Retweetable" in Brazilian Politics?
Are Positive or Negative Tweets More "Retweetable" in Brazilian Politics?
 
Link Prediction in (Partially) Aligned Heterogeneous Social Networks
Link Prediction in (Partially) Aligned Heterogeneous Social NetworksLink Prediction in (Partially) Aligned Heterogeneous Social Networks
Link Prediction in (Partially) Aligned Heterogeneous Social Networks
 
Keyword Query Routing
Keyword Query RoutingKeyword Query Routing
Keyword Query Routing
 
At33264269
At33264269At33264269
At33264269
 
Ego web qqml presentation 2016 pdf export
Ego web qqml presentation 2016 pdf exportEgo web qqml presentation 2016 pdf export
Ego web qqml presentation 2016 pdf export
 
Ay3313861388
Ay3313861388Ay3313861388
Ay3313861388
 

Similar to Distributed Link Prediction in Large Scale Graphs using Apache Spark

A Survey On Link Prediction In Social Networks
A Survey On Link Prediction In Social NetworksA Survey On Link Prediction In Social Networks
A Survey On Link Prediction In Social Networks
April Smith
 
Poster Abstracts
Poster AbstractsPoster Abstracts
Poster Abstracts
butest
 
The Linked Data Advantage
The Linked Data AdvantageThe Linked Data Advantage
The Linked Data Advantage
Sqrrl
 
Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview. Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview.
Doug Needham
 
Sub-Graph Finding Information over Nebula Networks
Sub-Graph Finding Information over Nebula NetworksSub-Graph Finding Information over Nebula Networks
Sub-Graph Finding Information over Nebula Networks
ijceronline
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
dannyijwest
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
IJwest
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
dannyijwest
 
Organizational Overlap on Social Networks and its Applications
Organizational Overlap on Social Networks and its ApplicationsOrganizational Overlap on Social Networks and its Applications
Organizational Overlap on Social Networks and its Applications
Sam Shah
 
Annotating Search Results from Web Databases
Annotating Search Results from Web Databases Annotating Search Results from Web Databases
Annotating Search Results from Web Databases
Mohit Sngg
 
Searching for patterns in crowdsourced information
Searching for patterns in crowdsourced informationSearching for patterns in crowdsourced information
Searching for patterns in crowdsourced information
Silvia Puglisi
 
Running head DEPRESSION PREDICTION DRAFT1DEPRESSION PREDICTI.docx
Running head DEPRESSION PREDICTION DRAFT1DEPRESSION PREDICTI.docxRunning head DEPRESSION PREDICTION DRAFT1DEPRESSION PREDICTI.docx
Running head DEPRESSION PREDICTION DRAFT1DEPRESSION PREDICTI.docx
healdkathaleen
 
Cluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector MachineCluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector Machine
CSCJournals
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZone
Doug Needham
 
OUTCOME ANALYSIS IN ACADEMIC INSTITUTIONS USING NEO4J
OUTCOME ANALYSIS IN ACADEMIC INSTITUTIONS USING NEO4JOUTCOME ANALYSIS IN ACADEMIC INSTITUTIONS USING NEO4J
OUTCOME ANALYSIS IN ACADEMIC INSTITUTIONS USING NEO4J
ijcsity
 
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
Journal For Research
 
Fuzzy AndANN Based Mining Approach Testing For Social Network Analysis
Fuzzy AndANN Based Mining Approach Testing For Social Network AnalysisFuzzy AndANN Based Mining Approach Testing For Social Network Analysis
Fuzzy AndANN Based Mining Approach Testing For Social Network Analysis
IJERA Editor
 
G5234552
G5234552G5234552
G5234552
IOSR-JEN
 
Graph-based Analysis and Opinion Mining in Social Network
Graph-based Analysis and Opinion Mining in Social NetworkGraph-based Analysis and Opinion Mining in Social Network
Graph-based Analysis and Opinion Mining in Social Network
Khan Mostafa
 
How To Make Linked Data More than Data
How To Make Linked Data More than DataHow To Make Linked Data More than Data
How To Make Linked Data More than Data
Artificial Intelligence Institute at UofSC
 

Similar to Distributed Link Prediction in Large Scale Graphs using Apache Spark (20)

A Survey On Link Prediction In Social Networks
A Survey On Link Prediction In Social NetworksA Survey On Link Prediction In Social Networks
A Survey On Link Prediction In Social Networks
 
Poster Abstracts
Poster AbstractsPoster Abstracts
Poster Abstracts
 
The Linked Data Advantage
The Linked Data AdvantageThe Linked Data Advantage
The Linked Data Advantage
 
Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview. Social Network Analysis Introduction including Data Structure Graph overview.
Social Network Analysis Introduction including Data Structure Graph overview.
 
Sub-Graph Finding Information over Nebula Networks
Sub-Graph Finding Information over Nebula NetworksSub-Graph Finding Information over Nebula Networks
Sub-Graph Finding Information over Nebula Networks
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
 
Organizational Overlap on Social Networks and its Applications
Organizational Overlap on Social Networks and its ApplicationsOrganizational Overlap on Social Networks and its Applications
Organizational Overlap on Social Networks and its Applications
 
Annotating Search Results from Web Databases
Annotating Search Results from Web Databases Annotating Search Results from Web Databases
Annotating Search Results from Web Databases
 
Searching for patterns in crowdsourced information
Searching for patterns in crowdsourced informationSearching for patterns in crowdsourced information
Searching for patterns in crowdsourced information
 
Running head DEPRESSION PREDICTION DRAFT1DEPRESSION PREDICTI.docx
Running head DEPRESSION PREDICTION DRAFT1DEPRESSION PREDICTI.docxRunning head DEPRESSION PREDICTION DRAFT1DEPRESSION PREDICTI.docx
Running head DEPRESSION PREDICTION DRAFT1DEPRESSION PREDICTI.docx
 
Cluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector MachineCluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector Machine
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZone
 
OUTCOME ANALYSIS IN ACADEMIC INSTITUTIONS USING NEO4J
OUTCOME ANALYSIS IN ACADEMIC INSTITUTIONS USING NEO4JOUTCOME ANALYSIS IN ACADEMIC INSTITUTIONS USING NEO4J
OUTCOME ANALYSIS IN ACADEMIC INSTITUTIONS USING NEO4J
 
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
 
Fuzzy AndANN Based Mining Approach Testing For Social Network Analysis
Fuzzy AndANN Based Mining Approach Testing For Social Network AnalysisFuzzy AndANN Based Mining Approach Testing For Social Network Analysis
Fuzzy AndANN Based Mining Approach Testing For Social Network Analysis
 
G5234552
G5234552G5234552
G5234552
 
Graph-based Analysis and Opinion Mining in Social Network
Graph-based Analysis and Opinion Mining in Social NetworkGraph-based Analysis and Opinion Mining in Social Network
Graph-based Analysis and Opinion Mining in Social Network
 
How To Make Linked Data More than Data
How To Make Linked Data More than DataHow To Make Linked Data More than Data
How To Make Linked Data More than Data
 

Recently uploaded

GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
kalichargn70th171
 
socradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdfsocradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdf
SOCRadar
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
Alina Yurenko
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
Green Software Development
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
Rakesh Kumar R
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 

Recently uploaded (20)

GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
 
socradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdfsocradar-q1-2024-aviation-industry-report.pdf
socradar-q1-2024-aviation-industry-report.pdf
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, FactsALGIT - Assembly Line for Green IT - Numbers, Data, Facts
ALGIT - Assembly Line for Green IT - Numbers, Data, Facts
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 

Distributed Link Prediction in Large Scale Graphs using Apache Spark

  • 1. Distributed Link Prediction in Large Scale Graphs using Apache Spark Anastasios Theodosiou 1 Aristotle University of Thessaloniki, Thessaloniki 54621, GREECE anastasios.theodosiou@gmail.com Abstract. Social networks like Facebook, Instagram, Twitter, and LinkedIn have become an integral part of our everyday life. Through these, users can share dig- ital content (links, photos, videos), express or share their opinions, and expand their social circle by making new friends. All these user interactions lead to the evolution and development of these networks over time. A typical example of link prediction is some of the services offered by these networks to their users. An essential service for them is to support their users with suggestions for new friendships based on their existing network, as well as their preferences resulting from their interactions with the network. Link prediction techniques attempt to predict the possibility of a future connection between two nodes on a given net- work. Beyond social networks, link prediction has a broad scope. Some of these are, in e-commerce, genetics, and security. Due to the massive amounts of data that is collected today, the need for scalable approaches arises to this problem. The purpose of this diploma thesis is to experiment and use various techniques of machine learning, both supervised and unsupervised, to predict links to a net- work of academic papers using document similarity metrics based on the charac- teristics of the nodes but also other structural features, based on the network. Ex- perimentation and implementation of the application took place using Apache Spark to manage the large data volume using the Scala programming language. Keywords: Link prediction, Data mining, Machine Learning, Apache Spark, Graphs, Online Social Networks, Recommender Systems 1 The link prediction problem In order to better understand precisely what the link prediction problem is, a brief ex- ample will be given. Suppose there is a network whose nodes represent individuals and links between individuals representing relationships or interactions. Having a network of these features, how can we predict its evolution in the future? Alternatively, how can we predict the creation of new edges or deletion of existing links in the future? By studying the evolution of social networks over time, we can understand how a node interacts with another node. In order to carry out this study, we need many different snapshots of the network structure over time, so the volume of data we which we need to collect and process overgrows. Therefore, finding different scalable approaches for their parallel processing becomes necessary. Some other real-world examples of link
  • 2. 2 predictions point to friends and followers in a social networking site, indicate relevant products to customers or providing suggestions to professionals for teamwork based on their field of study or their interests. Therefore, we can conclude that the link prediction problem is the probability of predicting a future edge between two nodes. 1.1 Social networks and the difficulty of link prediction. A (social) network can be represented by a graph G (V, E) where V is the set of its vertices and E, the set of its edges. The number of possible connections in such a net- work is equal to [V * (V-1)] / 2. The network we are looking at in this work consists of 27,770 nodes. If we want to compute all possible edges and suggest new ones based on some metric similarity of documents (e.g., Jaccard Similarity), we would have to check 385,572 .565 edges. We understand that this number is quite large even on a relatively small network, like this one of our work. However, social networks are sparse, so there is no need to choose an edge randomly and try to predict its existence in the network. Because the number of possible connections is large enough, there is a need to find alternative and more efficient approaches to predicting them. Major social platforms such as Facebook, Twitter, Instagram, LinkedIn, and others, have as one of their primary services the proposals of new links in the form of a new "social friendship." High accuracy in such predictions can help us understand what is the factor that leads to the evolution of these networks and to provide more accurate and meaningful suggestions. The social network which we have been studying for this work is a net- work of academic papers where each one cites some other papers. A classic method for proposing collaboration on such a network is through the bibliography system. How- ever, the result of a new proposal based on this system could not be entirely accurate. We need to extend and enrich this method with more data or new techniques so that we can achieve greater accuracy in our recommendations. For example, we can use meth- ods based on the content and structure of the documents. 2 Graphs and social networks. The graphs provide a better way to deal with abstract concepts such as relationships and interactions in a network. They also offer an intuitive - visual way of thinking about these concepts. They are still a natural basis for analyzing relationships in a social con- text. Over time, graphs are increasingly used in data science. Graph databases have become common computing tools and alternatives for SQL and NoSQL databases. Concepts of graph theory are used for the study and modeling of social networks, Fraud Patterns, energy consumption patterns, influence on a social network and many other areas of application. Social Network Analysis (SNA) is probably the most well-known application of graph theory for data science. They are also used in agglomeration algo- rithms (see K-Means). Therefore, there are many reasons for the use of graphs and so many fields of application. From the computer science perspective, the graphs offer computational efficiency. The "Big O" multi-plot for some algorithms is better for data that is arranged in graph format compared to tabular data (see table data).
  • 3. 3 3 Link prediction and locality sensitive hashing. The problem of finding identical or duplicate documents based on a similarity metric seems relatively straightforward. Using a hash function, the work can be completed very quickly, and the algorithm is fast. However, the problem becomes more complicated if we want to find similar documents with spelling mistakes or with even different words. The brute force technique can be used to find such documents and to predict links with higher accuracy but, without being a scalable technique. On the other hand, the LSH algorithm, or else Locality Sensitive Hashing, is a technique that can also be used for the same problems, but yielding approximate results in a much better time than the brute force technique. LSH in our problem can suggest an edge between two nodes if the similarity of the two documents is above a given threshold. More gen- erally, LSH belongs to a family of functions known as the LSH family which hashes the data into buckets so that documents with high similarity are being hashed into the same bucket. The general idea of LSH is to find an algorithm such that if we insert two document signatures, it will be able to tell us that these nodes can form a candidate pair or not if their similarity is higher than the given threshold. As for the part of MinHash- ing, there are two necessary steps. First, we hash the columns from the signature matrix with several hash functions, and then we check whether two documents are being hashed into the same bucket even for one of the different functions. In this case, we can accept the two documents as a candidate pair. Regarding the problem of link prediction, if the Jaccard similarity of the two documents is above the given threshold, we can conclude that there is a potential edge between them. 4 Suggested approach and results. The network which we have studied in this work is composed of 27,770 nodes «papers» and 352,857 edges. Each node consists of some attributes which represent them. These are the document id, the publication date, the title, the authors, the journal(s) and the abstract of the paper. Furthermore, there was a second file, containing all the edges of the «edge list» network, during our experiments. This work aimed to make link predic- tion between the papers above. An edge between two nodes exists when at least one node points to another node by referencing them through the bibliographic system. Our proposed approach does not take into account such reports nor does it use the metrics mentioned in chapter two but instead relies on the similarity of the records based on their characteristics through the Jaccard similarity metric and other structural features of the network. Two different approaches were used where, in the first approach, the problem was treated as binary classification, while in the second approach, two differ- ent techniques of unsupervised machine learning were used. The brute force method and Locality Sensitive Hashing with the use and configuration of the MinHashLSH algorithm which is provided by the Apache Spark.
  • 4. 4 4.1 Supervised link prediction approach. As we have already mentioned, the problem was treated as a binary classification. Therefore, different models of machine learning were used where each was based on a particular classifier. All models were tested on a four-core system and 8GB of RAM. In the first phase, the datasets were loaded both for nodes and edges. Once this process has been completed, a join operation began between the two files in order to create our initial dataframe. This dataframe eventually contained the id of the two papers which ware involved in an edge, as well as all the other attributes that characterize the speci- fied nodes. After that, a tokenization procedure was performed in each column of the dataframe. So all texts were converted into a bag of words. Next, all stop words were removed so that the Jaccard similarity will not be affected by them. At this point, the features that each classifier would consider for its training phase were calculated. They come mainly from the attributes of the node but also the structural features of the net- work. These were: (a) the time difference in publication between the two papers, (b) the title overlap, (c) the authors overlap, (d) the journal overlap, (e) the abstract overlap. Furthermore, ware added three more structural features concerning the node and these were: a) common neighbors of the nodes of each edge b) the sum of the total triangles belonging to each node of that edge and c) the PageRank score for each of the nodes in the network. After that, we took the Squared test of independence, with the help of ChiSqSelector class of Apache Spark, to determine whether there is a significant rela- tionship between two categorical features. From this test and other experiments, it was decided not to use the PageRank feature as its subscription to the final Accuracy, and F1 was found to be almost zero. Finally, the data ware divided into two parts by 70% for the training phase and the remaining 30% for the test phase. Naïve Bayes Classifier. The first classifier which was used was Naïve Bayes. Several tests have been performed to select the threshold of this algorithm. Naïve Bayes for this data set and the selected features gave the best results when the threshold value was 0.5 or 50%. Table 1 de- scribes the results of this particular algorithm. Table 1. Results from Naive Bayes classifier Dataset Split Accuracy F1 Exec. Time (sec) 70/30 0.58614 0.58876 1090.06 Logistic Regression Classifier. Since the results of the Naïve Bayes were not so good, other classifiers were tested. One of these was the Logistic Regression Classifier. As in previous models, here too, we have tasted different feature sets. Specifically, the first test performed here con- tained only the features derived from the nodes' attributes. Tests have also been performed for a different number of iterations. The first test concerned only the follow- ing features: a) time difference of publishing, b) Jaccard's similarity of titles, the overlap
  • 5. 5 of titles, d) overlap of authors, d) overlap of the journal and e) overlap of abstract. The results of this algorithm for these features are shown in Table 2. Table 2. Logistic regression results with attributes based on node Max. Iterations Accuracy F1 Exec. Time (sec) 10 0.79890 0.79812 1694.76 100 0.79890 0.79947 1654.78 1000 0.79713 0.79957 1628.16 10000 0.79723 0.79778 1807.94 Although this model achieves higher accuracy and f1 ratios than the previous model, the next test showed that with the addition of structural features, the algorithm achieves even better results as shown in Table 3. Table 3. Logistic regression results with node's and structural features Max. Iterations Accuracy F1 Exec. Time (sec) 10 0.93518 0.93559 959.72 100 0.93561 0.93600 1002.28 We note that adding structural features significantly increased the accuracy of the al- gorithm and reduced its overall execution time. The next model with which we experi- mented, was the Linear SVM. Linear SVM Classifier. This model was tested as the previous models do, in the same set of features. Experi- ments were also performed for different values of the maximum number of iterations as well as for the RegParam parameter. The test results are shown in Table 4. Table 4. Linear SVM results Max Iterations RegParam Accuracy F1 Exec. Time 10 0.1 0.85967 0.85967 934.15 100 0.1 0.88044 0.88152 1124.26 10 0.3 0.84362 0.84355 893.23 100 0.3 0.85683 0.85821 1313.11 From tests carried out, it turned out that the Linea SVM algorithm works best for the combination of the MaxIterations parameters at 100 and RegParam equal to 0.1. Next in the model series was the Multilayer Perceptron Classifier. Multilayer Perceptron Classifier. This classifier is based on neural networks. Here, many experiments were performed, both for the maximum number of iterations of the algorithm and the number of layers.
  • 6. 6 Extra parameters were tested, but these two affected the result more than any other parameter. The results are presented in Table 5. Table 5. MLPC results based on the number of iterations and layers Max Iterations Layers Accuracy F1 Exec. Time 100 13,10,7,2 0.87953 0.87951 1007.67 200 13,10,7,2 0.94770 0.94776 1106.78 400 13,7,4,2 0.95187 0.95205 1347.12 The best possible results concerning the data set and the features we have chosen are for the maximum number of iterations at 400 and the layers 13,7,4,2. Next in the model series which we have tested was the Decision Tree. Decision Tree Classifier. It was observed that the most significant difference in classifier behavior was the pa- rameter of the maximum depth which would have the tree of our algorithm. The results for this parameter are shown in Table 6. Table 6. Decision tree classifier results Max Depth Accuracy F1 Exec. Time (sec) 4 0.95116 0.95129 1302.87 8 0.95300 0.95314 1308.23 16 0.94262 0.94279 1177.16 30 0.92497 0.92494 1342.28 As it results from the above table, this model achieves even greater accuracy but also F1 than all previous classifiers. The best possible value for this model came with a maximum depth of 8. Finally, the Random Forest algorithm was used in our experi- ments. Random Forest Classifier. The sixth and last classifier we used to solve the link prediction problem was the Ran- dom Forest algorithm. This algorithm is a special category of Decision Trees. One ad- vantage of this is that it uses multiple decision trees to avoid overfitting. We experi- mented with two basic parameters of the algorithm, the first one was the maximum depth of the trees, and the second parameter was the number of total trees. The results of this test are described in Table 7. Table 7. Random forest classifier results Max Depth Num. Trees Accuracy F1 Exec. Time (sec) 4 10 0.95066 0.95077 1314.01 8 10 0.95580 0.95591 1191.91 4 100 0.95058 0.95068 1262.46
  • 7. 7 8 100 0.95527 0.95538 1230.55 The Random Forest model achieves the most accurate results in accuracy and f1 from all the models mentioned above. The best possible values for accuracy and f1 are ob- tained with the maximum number of trees at 10 and at the same time with a maximum depth per tree equal to 8. Model comparison. Summarizing the above results from the different classifiers, a comparison was made between them in terms of accuracy and f1. Figure 1 shows the change in accuracy and f1 per model. Figure 1. Comparison of the classifiers As far as the algorithm execution time is concerned, we can see that the shorter time it has the Linear Regression model with a total completion time of 1002.28 seconds while the Random Forest model requires 189.69 seconds longer. The difference in the run time of the six classifiers is shown in Figure 2. Figure 2. The execution time of six classifiers 4.2 Unsupervised link prediction approach. From the perspective of unsupervised machine learning, the problem of link prediction was addressed by two different techniques but very widespread. The approach we pro- pose differentiates from most of the related work on this problem as to how to deal with
  • 8. 8 it. Nearly the same data preprocessing techniques were performed as in the previous chapter. The main difference is that for each node, a bag of words was created and correlated with it. In more detail, for each paper, tokenization of each column into words was performed, and then we concatenate all the dataframe columns into one. In the next phase, all stop words were removed so that the Jaccard similarity was not af- fected. Once our data had been prepared, we proceeded to predict new links with two different techniques. The first technique which was tested was the brute force technique and the second one, was the LocalitySensitive Hashing - LSH algorithm in combination with MinHashing. It is worth noting that these experiments were carried out on a cluster of 80 cores. All tests were done with maximum use of 64 cores and 32GB of RAM. As there were hardware constraints and more specifically we face random access memory issues, the experiments were performed for a subset of the original data set. Brute force prediction. Initially, a join operation was performed on the data in order to create all the possible edges that may occur. This process is relatively slow, but it can only take place once, and then we can use it as is. After that, the Jaccard similarity was calculated for all candidate edges. The maximum Jaccard similarity was found to be 0.4973. After this process has been completed, we have set a threshold for the Jaccard similarity so that edges with a similarity greater than or equal to it, ware selected. The run time of the algorithm increased geometrically as the number of nodes in the data set increased. In Table 8, we observe cumulative results of the algorithm for the accuracy, the number of candidate pairs, the total number of checks which was performed, and the algo- rithm’s execution time. Table 8. Aggregative results of brute force algorithm execution Nodes Checks # Candidates Accuracy Exec. Time (sec) 1000 499500 3916 0.9368 62.89 2000 1999000 14055 0.9662 161.04 5000 12497500 74302 0.9711 566.73 7000 24496500 106534 0.9789 1446.98 Figure 3 shows the change in algorithm accuracy as the number of nodes in the network grows. Figure 3. The increase of accuracy based on the dataset volume
  • 9. 9 This technique is generally fairly accurate, but extremely time-consuming and utterly dependent on system resources. For this reason, there is a need for new techniques that can produce results within a reasonable time. This problem comes to solve MinHashing and the Locality Sensitive Hashing algorithm. MinHashLSH prediction. The basic idea behind this algorithm is that it uses MinHashing in conjunction with the LSH algorithm so that documents with a high similarity index are hashed into the same bucket while those with a small index in different ones. In general, the entire workflow is the same as the one followed in the brute force algorithm, except that here we are joining the data based on the Jaccard distance and not the Jaccard index. So if we want to set a similarity limit for our documents with 60% Jaccard similarity, we should set a Jaccard distance equal to 1 - Jaccard similarity, i.e., the two documents should be at least 40% away. Table 9 lists some results from the various experiments performed with this method and with Jaccard distance 0.8 as this was the number that provided more accurate results. Table 9. MinHashLSH scores relative to the number of hash tables Hash Tables Candidates Precision Recall Accuracy F1 Time (sec) 2 986 0.7261 0.0120 0.97133 0.02370 26.19 4 1610 0.7304 0.0385 0.97147 0.03853 31.12 8 3026 0.6265 0.0607 0.97147 0.06072 70.20 16 3628 0.5975 0.0364 0.97148 0.06877 111.48 32 3824 0.5983 0.0385 0.97147 0.07246 211.95 64 3840 0.5968 0.0385 0.97149 0.07246 514.83 128 3840 0.5968 0.0385 0.97151 0.07251 1344.72 We notice that as the number of hash tables grows, the accuracy of the results increases. At the same time, the algorithm’s execution time is increasing almost linearly. Figure 4 illustrates the change in algorithm accuracy relative to the number of hash tables. Figure 4. The accuracy of MinHashLSH vs. hash tables As regards the evaluation of the unsupervised techniques we used and because we did not have a classifier or regressor, a TP, FP, TN and FN calculation function was imple- mented comparing the results of the algorithms with the original graph of the network.
  • 10. 10 Then the Precision and Recall metrics were calculated, and from this data, we arrived at the calculation of Accuracy and F1. Many experiments have been carried out, and many tests have been done which are available in the full version of the diploma thesis. 5 Conclusion. The problem of link prediction has a wide range of application in different areas. In this diploma thesis, we studied techniques of both supervised and unsupervised machine learning techniques. After many experiments and trials, we came to the conclusion, given the circumstances, of the data we had at our disposal but also of the way we chose to address this problem, that as regards the solution from supervised techniques, the model based on the Random Forest classifier, is the ideal solution to the problem. On the other hand, in the unsupervised machine learning part, the MinHashLSH method was chosen as it is much faster and can produce quite good results and reach very close to the levels of brute force techniques. However, it requires much attention as it gener- ates many false positives. 6 Future work. As a future work, we will address the problem of link prediction through a different viewpoint. We will re-examine the same network but this time with a technique based on clustering. This approach uses similar nodes in a «cluster» and aims that nodes from the same cluster exhibit a similar connectivity pattern. In more detail, with this method, we will initially set a threshold θ, and then we will subtract all the edges of the graph having a weight less than the limit. Then, each linked element of the graph will corre- spond to a cluster. In general, two nodes are in the same connected component as if there is a path between them. From supervised machine learning, we will try techniques that will be based purely on neural networks with more complex data preprocessing techniques, and we hope to achieve even better results and in less execution time.
  • 11. 11 References. 1. Charu C. Aggarwal (auth.) - Recommender Systems, 2 Springer International Publish-ing 2. Reza Zafarani, Mohammad Ali Abbasi, Huan Liu - Social Media Mining, Cambridge 3. Feiyi (Aaron) Tang - Link-Prediction and its Application in Online Social Networks, 2 Vic- toria University 4. L. Adamic and E. Adar - Friends and neighbors on the web. Social Networks, 2003 5. David Liben-Nowell and Jon Kleinberg - The Link Prediction Problem for Social Net-works, 2004 6. M. E. J. Newman. Clustering and preferential attachment in growing networks. Physi-cal Re- view E, 64(02), 2001 7. D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social net- works. Journal of the American Society for Information Science and Technology, 2007 8. L. Katz. A new status index derived from sociometric analysis. Psychometrika, March 1953 9. Hially Rodrigues S´a and Ricardo B. C. Prudˆencio , Supervised Learning for Link Pre-diction in Weighted Networks, Center of Informatics, Federal University of Pernambu-co, CEP 5-970 - Recife (PE) – Brazil 10. Huang, Zan, Link Prediction Based on Graph Topology: The Predictive Value of Gen- eralized Clustering Coefficient, 2010 11. Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining of Massive Datasets. 12. Broder, Andrei Z, Moses Charikar, Alan M Frieze, and Michael Mitzenmacher 2000. “Min-Wise Independent Permutations.” Journal of Computer and System Sciences 60 (3). Else- vier: 630–59 13. Panagiotis Symeonidis, Nantia Iakovidou, Nikolaos Mantas, Yannis Manolopoulos, From biological to social networks: Link prediction based on multi-way spectral cluster-ing, 2013 14. Link Prediction - Karsten Borgwardt, Christoph Lippert and Nino Shervashidze,https://www.ethz.ch/content/dam/ethz/specialinterest/bsse/borgwardt-lab/docu- ments/slides/BNA09_10_12.pdf 15. Apache Spark Tutorial: Machine Learning (article) - Datacamp. (n.d.). Retrieved from https://www.datacamp.com/community/tutorials/apache-spark-tutorial-machine-learn 16. Persagen Consulting | Specializing In Molecular/functional ... (n.d.). Retrieved from https://persagen.com/resources/biokdd-review-knowledge_graphs.html 17. Link Prediction In Social Networks Using Computationally ... (n.d.). Retrieved from:https://mafiadoc.com/link-prediction-in-social-networks-using computationally-ef 18. Social Media Mining - Reza Zafarani, Mohammad Ali Abbasi, Huan Liu, Cambridge Uni- versity, 2014 19. Recommender Systems: The Textbook, Charu C. AggarwalIBM T.J. Watson Research Center Yorktown Heights, NY, USA 20. Link prediction using unsupervised learning, Mohammad Al Hasan, Vineet Chaoji, Saeed Slem Mohammed Zaki