This document summarizes an approach to distributed link prediction in large graphs using Apache Spark. It discusses using machine learning techniques like locality sensitive hashing to predict links between nodes in a graph based on document similarity metrics and other structural features. The approach is tested on a graph of 27,770 academic papers linked by 352,857 citations. Both supervised and unsupervised machine learning methods are explored, including treating it as a binary classification problem and using locality sensitive hashing and MinHashLSH through Apache Spark to efficiently handle the large data volumes. The results suggest this distributed approach can accurately predict new links in large graphs.
Basic explanation about graph mining for social network analysis (SNA). I tried to describe some metrics and benefit from SNA (focusing on telecommunication field). Basic spark with graphx script to analyse the graph also in the slide
Entity linking with a knowledge base issues techniques and solutionsCloudTechnologies
The document discusses entity linking, which is the task of linking entity mentions in text to corresponding entries in a knowledge base. It presents challenges like name variations and ambiguity. The paper then surveys main approaches to entity linking and discusses applications like knowledge base population and question answering. It also covers evaluation of entity linking systems and directions for future work.
The document describes an entity linking system with three main modules: 1) Entity linking to map entity mentions in text to entities in a knowledge base, which is challenging due to name variations and ambiguity. 2) A knowledge base containing entities. 3) Candidate entity ranking to rank potential entities for a mention using evidence like supervised ranking methods. The system aims to address issues in entity linking like predicting unlinkable mentions.
An Efficient Modified Common Neighbor Approach for Link Prediction in Social ...IOSR Journals
This document discusses link prediction in social networks. It analyzes shortcomings of existing leading link prediction methods like common neighbor. It then proposes a modified common neighbor approach that takes into account both topological network structure and node similarities based on features. The approach generates a weight for each link based on the number of common features between nodes, divided by the total number of features. It then calculates a contribution score for each common neighbor by multiplying the weights of that neighbor's links to the two nodes. Experimental results on co-authorship networks show the modified common neighbor approach outperforms existing methods.
Iaetsd similarity search in information networks usingIaetsd Iaetsd
The document proposes a novel meta-path based similarity measure called PathSim to find similar peer objects in heterogeneous information networks. PathSim captures peer similarity by measuring how strongly connected two objects are as well as how comparable their visibility is in the network. An efficient algorithm is also introduced to support online top-k queries for meta-path based similarity search using partial materialization and co-clustering based pruning. Experimental results on bibliographic networks extracted from DBLP and Flickr demonstrate the effectiveness of PathSim and efficiency of the proposed algorithms.
This document discusses how adding formal semantics to linked open data can make it more useful and powerful. It describes how existing linked data lacks formal semantics, limiting its capabilities. The document proposes two approaches: 1) Enriching linked data schemas using ontology matching techniques to capture relationships between datasets. 2) Developing a system called LOQUS that can perform federated queries across multiple linked datasets by decomposing queries and merging results. This would allow queries without needing intimate knowledge of each dataset's structure.
This document discusses methods for measuring semantic similarity between words. It begins by discussing how traditional lexical similarity measurements do not consider semantics. It then discusses several existing approaches that measure semantic similarity using web search engines and text snippets. These approaches calculate word co-occurrence statistics from page counts and analyze lexical patterns extracted from snippets. Pattern clustering is used to group semantically similar patterns. The approaches are evaluated using datasets and metrics like precision and recall. Finally, the document proposes a new method that combines page count statistics, lexical pattern extraction and clustering, and support vector machines to measure semantic similarity.
Basic explanation about graph mining for social network analysis (SNA). I tried to describe some metrics and benefit from SNA (focusing on telecommunication field). Basic spark with graphx script to analyse the graph also in the slide
Entity linking with a knowledge base issues techniques and solutionsCloudTechnologies
The document discusses entity linking, which is the task of linking entity mentions in text to corresponding entries in a knowledge base. It presents challenges like name variations and ambiguity. The paper then surveys main approaches to entity linking and discusses applications like knowledge base population and question answering. It also covers evaluation of entity linking systems and directions for future work.
The document describes an entity linking system with three main modules: 1) Entity linking to map entity mentions in text to entities in a knowledge base, which is challenging due to name variations and ambiguity. 2) A knowledge base containing entities. 3) Candidate entity ranking to rank potential entities for a mention using evidence like supervised ranking methods. The system aims to address issues in entity linking like predicting unlinkable mentions.
An Efficient Modified Common Neighbor Approach for Link Prediction in Social ...IOSR Journals
This document discusses link prediction in social networks. It analyzes shortcomings of existing leading link prediction methods like common neighbor. It then proposes a modified common neighbor approach that takes into account both topological network structure and node similarities based on features. The approach generates a weight for each link based on the number of common features between nodes, divided by the total number of features. It then calculates a contribution score for each common neighbor by multiplying the weights of that neighbor's links to the two nodes. Experimental results on co-authorship networks show the modified common neighbor approach outperforms existing methods.
Iaetsd similarity search in information networks usingIaetsd Iaetsd
The document proposes a novel meta-path based similarity measure called PathSim to find similar peer objects in heterogeneous information networks. PathSim captures peer similarity by measuring how strongly connected two objects are as well as how comparable their visibility is in the network. An efficient algorithm is also introduced to support online top-k queries for meta-path based similarity search using partial materialization and co-clustering based pruning. Experimental results on bibliographic networks extracted from DBLP and Flickr demonstrate the effectiveness of PathSim and efficiency of the proposed algorithms.
This document discusses how adding formal semantics to linked open data can make it more useful and powerful. It describes how existing linked data lacks formal semantics, limiting its capabilities. The document proposes two approaches: 1) Enriching linked data schemas using ontology matching techniques to capture relationships between datasets. 2) Developing a system called LOQUS that can perform federated queries across multiple linked datasets by decomposing queries and merging results. This would allow queries without needing intimate knowledge of each dataset's structure.
This document discusses methods for measuring semantic similarity between words. It begins by discussing how traditional lexical similarity measurements do not consider semantics. It then discusses several existing approaches that measure semantic similarity using web search engines and text snippets. These approaches calculate word co-occurrence statistics from page counts and analyze lexical patterns extracted from snippets. Pattern clustering is used to group semantically similar patterns. The approaches are evaluated using datasets and metrics like precision and recall. Finally, the document proposes a new method that combines page count statistics, lexical pattern extraction and clustering, and support vector machines to measure semantic similarity.
UNIT V TEXT AND OPINION MINING
Text Mining in Social Networks -Opinion extraction – Sentiment classification and clustering -
Temporal sentiment analysis - Irony detection in opinion mining - Wish analysis – Product review mining – Review Classification – Tracking sentiments towards topics over time
Semantic similarity and semantic relatedness
measure in particular is very important in the current scenario
due to the huge demand for natural language processing based
applications such as chatbots and information retrieval systems
such as knowledge base based FAQ systems. Current approaches
generally use similarity measures which does not use the context
sensitive relationships between the words. This leads to erroneous
similarity predictions and is not of much use in real life
applications. This work proposes a novel approach that gives an
accurate relatedness measure of any two words in a sentence by
taking their context into consideration. This context correction
results in a more accurate similarity prediction which results in
higher accuracy of information retrieval systems.
This document discusses social data mining. It begins by defining data, information, and knowledge. It then defines data mining as extracting useful unknown information from large datasets. Social data mining is defined as systematically analyzing valuable information from social media, which is vast, noisy, distributed, unstructured, and dynamic. Common social media platforms are described. Graph mining and text mining are discussed as important techniques for social data mining. The generic social data mining process of data collection, modeling, and various mining methods is outlined. OAuth 2.0 authorization is also summarized as an important process for applications to access each other's data.
Data Mining In Social Networks Using K-Means Clustering Algorithmnishant24894
This topic deals with K-Means Clustering Algorithm which is used to categorize the data set into clusters depending upon their similarities like common interest or organization or colleges, etc. It categorize the data into clusters on the basis of mutual friendship.
The document proposes an improved clustering algorithm for social network analysis. It combines BSP (Business System Planning) clustering with Principal Component Analysis (PCA) to group social network objects into classes based on their links and attributes. Specifically, it applies PCA before BSP clustering to reduce the dimensionality of the social network data and retain only the most important variables for clustering. This improves the BSP clustering results by focusing on the key information in the social network.
This document summarizes and compares different methods for performing keyword searches in relational databases. It discusses candidate network-based methods, Steiner-tree based algorithms, and backward expanding keyword search approaches. It also evaluates methods that aim to improve search efficiency and accuracy, such as integrating multiple related tuple units and developing structure-aware indexes. The overall goal is to find an effective and efficient approach to keyword search over relational database structures.
An updated look at social network extraction system a personal data analysis ...eSAT Publishing House
This document summarizes a study on analyzing personal social network data over time. The study extracted data from Facebook, calculated social network analysis metrics like degree distribution and betweenness centrality, and analyzed how the network changed dynamically over time. Key findings included identifying influential and non-influential users, detecting communities that formed within the network, and identifying the celebrity or most influential user within one person's local network. Analyzing how social networks and interactions change dynamically provides insights useful for applications like marketing and recommendations.
The growing number of datasets published on the Web as linked data brings both opportunities for high data
availability of data. As the data increases challenges for querying also increases. It is very difficult to search
linked data using structured languages. Hence, we use Keyword Query searching for linked data. In this paper,
we propose different approaches for keyword query routing through which the efficiency of keyword search can
be improved greatly. By routing the keywords to the relevant data sources the processing cost of keyword search
queries can be greatly reduced. In this paper, we contrast and compare four models – Keyword level, Element
level, Set level and query expansion using semantic and linguistic analysis. These models are used for keyword
query routing in keyword search.
1) The document discusses a review of semantic approaches for nearest neighbor search. It describes using an ontology to add a semantic layer to an information retrieval system to relate concepts using query words.
2) A technique called spatial inverted index is proposed to locate multidimensional information and handle nearest neighbor queries by finding the hospitals closest to a given address.
3) Several semantic approaches are described including using clustering measures, specificity measures, link analysis, and relation-based page ranking to improve search and interpret hidden concepts behind keywords.
This document discusses predicting new friendships in social networks using temporal information. It describes research on predicting new links in social networks over time using supervised learning models trained on temporal features from past network interactions. The researchers used anonymized Facebook data over 28 months to train decision tree and neural network classifiers to predict new relationships, finding models using temporal information performed better than those without it.
Are Positive or Negative Tweets More "Retweetable" in Brazilian Politics?Molly Gibbons (she/her)
This document summarizes an analysis of tweets containing the terms "Brazil" and "Michel Temer" to understand the political and economic scenario in Brazil. RapidMiner was used to collect tweets over 17 days and the Rosette Text Toolkit categorized tweets and analyzed sentiment. For "Michel Temer", there was a weak to moderate negative correlation between sentiment and retweets, and 75% of tweets were negative. For tweets about "Brazil" categorized as law/politics, 62% were negative and the most mentioned entities were the Senate, President, and Supreme Court. The analysis demonstrates how RapidMiner and Rosette can be used together to understand sentiment in social media posts about political topics.
Link Prediction in (Partially) Aligned Heterogeneous Social NetworksSina Sajadmanesh
This document discusses link prediction in homogeneous and heterogeneous social networks. It begins by introducing the problem of link prediction and its applications. It then discusses various unsupervised and supervised methods for link prediction in homogeneous networks. Next, it covers relationship prediction and collective link prediction in heterogeneous networks. It also discusses link prediction in aligned heterogeneous networks using link transfer and anchor link inference. Finally, it outlines future work on this topic.
This document discusses keyword query routing to identify relevant data sources for keyword searches over multiple structured and linked data sources. It proposes using a multilevel inter-relationship graph and scoring mechanism to compute relevance and generate routing plans that route keywords only to pertinent sources. This improves keyword search performance without compromising result quality. An algorithm is developed based on modeling the search space and developing a summary model to incorporate relevance at different levels and dimensions. Experiments showed the summary model preserves relevant information compactly.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
This document presents a general framework for building classifiers and clustering models using hidden topics to deal with short and sparse text data. It analyzes hidden topics from a large universal dataset using LDA. These topics are then used to enrich both the training data and new short text data by combining them with the topic distributions. This helps reduce data sparseness and improves classification and clustering accuracy for short texts like web snippets. The framework is also applied to contextual advertising by matching web pages and ads based on their hidden topic similarity.
A Survey On Link Prediction In Social NetworksApril Smith
This document summarizes a survey on link prediction in social networks. It discusses how link prediction can be framed as a binary classification problem to predict whether a link will exist between two nodes in the future. It presents a framework for link prediction that uses topological features of the network, applies proximity measures between nodes, and uses supervised learning algorithms. The document discusses several papers on link prediction approaches, including initial clustering of related nodes before prediction and using a generalized clustering coefficient feature without explicit clustering.
This document summarizes research posters being presented at a computer science and electrical engineering department research review. It describes 8 posters presented by BS, MS, and PhD students. The posters cover topics such as identifying political affiliations in blogs, statistically weighted visualization hierarchies, voter verifiable optical-scan voting, predictive caching in mobile networks, generating statistical volume models, predicting appropriate semantic web terms, approximating online social network community structure, and utilizing semantic policies for managing BGP route dissemination.
Social Network Analysis Introduction including Data Structure Graph overview. Doug Needham
Social Network Analysis Introduction including Data Structure Graph overview. Given in Cincinnati August 18th 2015 as part of the DataSeed Meetup group.
UNIT V TEXT AND OPINION MINING
Text Mining in Social Networks -Opinion extraction – Sentiment classification and clustering -
Temporal sentiment analysis - Irony detection in opinion mining - Wish analysis – Product review mining – Review Classification – Tracking sentiments towards topics over time
Semantic similarity and semantic relatedness
measure in particular is very important in the current scenario
due to the huge demand for natural language processing based
applications such as chatbots and information retrieval systems
such as knowledge base based FAQ systems. Current approaches
generally use similarity measures which does not use the context
sensitive relationships between the words. This leads to erroneous
similarity predictions and is not of much use in real life
applications. This work proposes a novel approach that gives an
accurate relatedness measure of any two words in a sentence by
taking their context into consideration. This context correction
results in a more accurate similarity prediction which results in
higher accuracy of information retrieval systems.
This document discusses social data mining. It begins by defining data, information, and knowledge. It then defines data mining as extracting useful unknown information from large datasets. Social data mining is defined as systematically analyzing valuable information from social media, which is vast, noisy, distributed, unstructured, and dynamic. Common social media platforms are described. Graph mining and text mining are discussed as important techniques for social data mining. The generic social data mining process of data collection, modeling, and various mining methods is outlined. OAuth 2.0 authorization is also summarized as an important process for applications to access each other's data.
Data Mining In Social Networks Using K-Means Clustering Algorithmnishant24894
This topic deals with K-Means Clustering Algorithm which is used to categorize the data set into clusters depending upon their similarities like common interest or organization or colleges, etc. It categorize the data into clusters on the basis of mutual friendship.
The document proposes an improved clustering algorithm for social network analysis. It combines BSP (Business System Planning) clustering with Principal Component Analysis (PCA) to group social network objects into classes based on their links and attributes. Specifically, it applies PCA before BSP clustering to reduce the dimensionality of the social network data and retain only the most important variables for clustering. This improves the BSP clustering results by focusing on the key information in the social network.
This document summarizes and compares different methods for performing keyword searches in relational databases. It discusses candidate network-based methods, Steiner-tree based algorithms, and backward expanding keyword search approaches. It also evaluates methods that aim to improve search efficiency and accuracy, such as integrating multiple related tuple units and developing structure-aware indexes. The overall goal is to find an effective and efficient approach to keyword search over relational database structures.
An updated look at social network extraction system a personal data analysis ...eSAT Publishing House
This document summarizes a study on analyzing personal social network data over time. The study extracted data from Facebook, calculated social network analysis metrics like degree distribution and betweenness centrality, and analyzed how the network changed dynamically over time. Key findings included identifying influential and non-influential users, detecting communities that formed within the network, and identifying the celebrity or most influential user within one person's local network. Analyzing how social networks and interactions change dynamically provides insights useful for applications like marketing and recommendations.
The growing number of datasets published on the Web as linked data brings both opportunities for high data
availability of data. As the data increases challenges for querying also increases. It is very difficult to search
linked data using structured languages. Hence, we use Keyword Query searching for linked data. In this paper,
we propose different approaches for keyword query routing through which the efficiency of keyword search can
be improved greatly. By routing the keywords to the relevant data sources the processing cost of keyword search
queries can be greatly reduced. In this paper, we contrast and compare four models – Keyword level, Element
level, Set level and query expansion using semantic and linguistic analysis. These models are used for keyword
query routing in keyword search.
1) The document discusses a review of semantic approaches for nearest neighbor search. It describes using an ontology to add a semantic layer to an information retrieval system to relate concepts using query words.
2) A technique called spatial inverted index is proposed to locate multidimensional information and handle nearest neighbor queries by finding the hospitals closest to a given address.
3) Several semantic approaches are described including using clustering measures, specificity measures, link analysis, and relation-based page ranking to improve search and interpret hidden concepts behind keywords.
This document discusses predicting new friendships in social networks using temporal information. It describes research on predicting new links in social networks over time using supervised learning models trained on temporal features from past network interactions. The researchers used anonymized Facebook data over 28 months to train decision tree and neural network classifiers to predict new relationships, finding models using temporal information performed better than those without it.
Are Positive or Negative Tweets More "Retweetable" in Brazilian Politics?Molly Gibbons (she/her)
This document summarizes an analysis of tweets containing the terms "Brazil" and "Michel Temer" to understand the political and economic scenario in Brazil. RapidMiner was used to collect tweets over 17 days and the Rosette Text Toolkit categorized tweets and analyzed sentiment. For "Michel Temer", there was a weak to moderate negative correlation between sentiment and retweets, and 75% of tweets were negative. For tweets about "Brazil" categorized as law/politics, 62% were negative and the most mentioned entities were the Senate, President, and Supreme Court. The analysis demonstrates how RapidMiner and Rosette can be used together to understand sentiment in social media posts about political topics.
Link Prediction in (Partially) Aligned Heterogeneous Social NetworksSina Sajadmanesh
This document discusses link prediction in homogeneous and heterogeneous social networks. It begins by introducing the problem of link prediction and its applications. It then discusses various unsupervised and supervised methods for link prediction in homogeneous networks. Next, it covers relationship prediction and collective link prediction in heterogeneous networks. It also discusses link prediction in aligned heterogeneous networks using link transfer and anchor link inference. Finally, it outlines future work on this topic.
This document discusses keyword query routing to identify relevant data sources for keyword searches over multiple structured and linked data sources. It proposes using a multilevel inter-relationship graph and scoring mechanism to compute relevance and generate routing plans that route keywords only to pertinent sources. This improves keyword search performance without compromising result quality. An algorithm is developed based on modeling the search space and developing a summary model to incorporate relevance at different levels and dimensions. Experiments showed the summary model preserves relevant information compactly.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
This document presents a general framework for building classifiers and clustering models using hidden topics to deal with short and sparse text data. It analyzes hidden topics from a large universal dataset using LDA. These topics are then used to enrich both the training data and new short text data by combining them with the topic distributions. This helps reduce data sparseness and improves classification and clustering accuracy for short texts like web snippets. The framework is also applied to contextual advertising by matching web pages and ads based on their hidden topic similarity.
A Survey On Link Prediction In Social NetworksApril Smith
This document summarizes a survey on link prediction in social networks. It discusses how link prediction can be framed as a binary classification problem to predict whether a link will exist between two nodes in the future. It presents a framework for link prediction that uses topological features of the network, applies proximity measures between nodes, and uses supervised learning algorithms. The document discusses several papers on link prediction approaches, including initial clustering of related nodes before prediction and using a generalized clustering coefficient feature without explicit clustering.
This document summarizes research posters being presented at a computer science and electrical engineering department research review. It describes 8 posters presented by BS, MS, and PhD students. The posters cover topics such as identifying political affiliations in blogs, statistically weighted visualization hierarchies, voter verifiable optical-scan voting, predictive caching in mobile networks, generating statistical volume models, predicting appropriate semantic web terms, approximating online social network community structure, and utilizing semantic policies for managing BGP route dissemination.
Social Network Analysis Introduction including Data Structure Graph overview. Doug Needham
Social Network Analysis Introduction including Data Structure Graph overview. Given in Cincinnati August 18th 2015 as part of the DataSeed Meetup group.
Sub-Graph Finding Information over Nebula Networksijceronline
Social and information networks have been extensively studied over years. This paper studies a new query on sub graph search on heterogeneous networks. Given an uncertain network of N objects, where each object is associated with a network to an underlying critical problem of discovering, top-k sub graphs of entities with rare and surprising associations returns k objects such that the expected matching sub graph queries efficiently involves, Compute all matching sub graphs which satisfy "Nebula computing requests" and this query is useful in ranking such results based on the rarity and the interestingness of the associations among nebula requests in the sub graphs. "In evaluating Top k-selection queries, "we compute information nebula using a global structural context similarity, and our similarity measure is independent of connection sub graphs". We need to compute the previous work on the matching problem can be harnessed for expected best for a naive ranking after matching for large graphs. Top k-selection sets and search for the optimal selection set with the large graphs; sub graphs may have enormous number of matches. In this paper, we identify several important properties of top-k selection queries, We propose novel top–K mechanisms to exploit these indexes for answering interesting sub graph queries efficiently.
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGdannyijwest
Social Networks has become one of the most popular platforms to allow users to communicate, and share their interests without being at the same geographical location. With the great and rapid growth of Social Media sites such as Facebook, LinkedIn, Twitter…etc. causes huge amount of user-generated content. Thus, the improvement in the information quality and integrity becomes a great challenge to all social media sites, which allows users to get the desired content or be linked to the best link relation using improved search / link technique. So introducing semantics to social networks will widen up the representation of the social networks. In this paper, a new model of social networks based on semantic tag ranking is introduced. This model is based on the concept of multi-agent systems. In this proposed model the representation of social links will be extended by the semantic relationships found in the vocabularies which are known as (tags) in most of social networks.The proposed model for the social media engine is based on enhanced Latent Dirichlet Allocation(E-LDA) as a semantic indexing algorithm, combined with Tag Rank as social network ranking algorithm. The improvements on (E-LDA) phase is done by optimizing (LDA) algorithm using the optimal parameters. Then a filter is introduced to enhance the final indexing output. In ranking phase, using Tag Rank based on the indexing phase has improved the output of the ranking. Simulation results of the proposed model have shown improvements in indexing and ranking output.
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGIJwest
The document presents a new model for intelligent social networks based on semantic tag ranking. It uses a multi-agent system approach with agents performing indexing and ranking. For indexing, it uses an enhanced Latent Dirichlet Allocation (E-LDA) model that optimizes LDA parameters. Tags above a threshold from E-LDA output are ranked using Tag Rank. Simulation results showed improvements in indexing and ranking over conventional methods. The model introduces semantics to social networks to improve search and link recommendation.
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGdannyijwest
Social Networks has become one of the most popular platforms to allow users to communicate, and share
their interests without being at the same geographical location. With the great and rapid growth of Social
Media sites such as Facebook, LinkedIn, Twitter...etc. causes huge amount of user-generated content.
Thus, the improvement in the information quality and integrity becomes a great challenge to all social
media sites, which allows users to get the desired content or be linked to the best link relation using
improved search / link technique. So introducing semantics to social networks will widen up the
representation of the social networks.
Organizational Overlap on Social Networks and its ApplicationsSam Shah
[This work was presented at WWW 2013.]
Online social networks have become important for networking, communication, sharing, and discovery. A considerable challenge these networks face is the fact that an online social network is partially observed because two individuals might know each other, but may not have established a connection on the site. Therefore, link prediction and recommendations are important tasks for any online social network. In this paper, we address the problem of computing edge affinity between two users on a social network, based on the users belonging to organizations such as companies, schools, and online groups. We present experimental insights from social network data on organizational overlap, a novel mathematical model to compute the probability of connection between two peo- ple based on organizational overlap, and experimental validation of this model based on real social network data. We also present novel ways in which the organization overlap model can be applied to link prediction and community detection, which in itself could be useful for recommending entities to follow and generating personalized news feed.
Searching for patterns in crowdsourced informationSilvia Puglisi
This document introduces crowdsourcing and discusses discovering patterns in crowdsourced data. It discusses defining the context of volunteered information on the internet in order to understand relationships between data. A network model is proposed where different types of context define nodes and relationships between context determine edges. Properties of small world networks are discussed including how they could be used to model relationships between crowdsourced data and evaluate data quality. Finally, applications to search ranking, privacy and security are briefly mentioned.
Running head DEPRESSION PREDICTION DRAFT1DEPRESSION PREDICTI.docxhealdkathaleen
This paper explores using machine learning and natural language processing techniques to analyze social media posts and other online behaviors to detect levels of depression in individuals. Key approaches discussed include using k-means clustering and neural networks on sources like reviews, posts, and articles. Link mining and weighted network modeling are also used to understand relationships between online content and detect patterns associated with depression. The goal is to help identify individuals who may be depressed so counselors can better assist them.
Cluster Based Web Search Using Support Vector MachineCSCJournals
Now days, searches for the web pages of a person with a given name constitute a notable fraction of queries to Web search engines. This method exploits a variety of semantic information extracted from web pages. The rapid growth of the Internet has made the Web a popular place for collecting information. Today, Internet user access billions of web pages online using search engines. Information in the Web comes from many sources, including websites of companies, organizations, communications and personal homepages, etc. Effective representation of Web search results remains an open problem in the Information Retrieval community. For ambiguous queries, a traditional approach is to organize search results into groups (clusters), one for each meaning of the query. These groups are usually constructed according to the topical similarity of the retrieved documents, but it is possible for documents to be totally dissimilar and still correspond to the same meaning of the query. To overcome this problem, the relevant Web pages are often located close to each other in the Web graph of hyperlinks. It presents a graphical approach for entity resolution & complements the traditional methodology with the analysis of the entity-relationship (ER) graph constructed for the dataset being analyzed. It also demonstrates a technique that measures the degree of interconnectedness between various pairs of nodes in the graph. It can significantly improve the quality of entity resolution. Using Support vector machines (SVMs) which are a set of related Supervised learning methods used for classification of load of user queries to the sever machine to different client machines so that system will be stable. clusters web pages based on their capacities stores whole database on server machine. Keywords: SVM, cluster; ER.
Presentation given at DMZ about Data Structure Graphs.
Also known as Applying Social Network Analysis Techniques to Data Modeling and Data Architecture
OUTCOME ANALYSIS IN ACADEMIC INSTITUTIONS USING NEO4Jijcsity
Databases are an integral part of a computing system and users heavily rely on the services they provide.When interact with a computing system, we expect that data be stored for future use, that the data is able to be looked up fastly, and we can perform complex queries against the data stored in the database. Many
different emerging database types available for use such as relational databases, object databases, keyvalue databases, graph databases, and RDF databases. Each type of database provides unique qualities that have applications in certain domains. Our work aims to investigate and compare the performance and
scalability of relational databases to graph databases in terms of handling multilevel queries such as finding the impact of a particular subject with the working area of pass out students. MySQL was chosen as the relational database, Neo4j as the graph database.
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...Journal For Research
The document discusses performance evaluation of social network analysis algorithms using Apache Spark. It analyzes the performance of algorithms like PageRank, connected components, triangle counting and K-means clustering on different social network datasets. The results show that GraphX PageRank performs faster than the naive implementation in Spark. Connected components execution time grows super linearly initially and then fluctuates. Triangle counting time grows linearly with size. K-means clustering is tested using both naive and MLlib implementations in Spark.
Fuzzy AndANN Based Mining Approach Testing For Social Network AnalysisIJERA Editor
Fast and Appropriate Social Network Analysis (SNA) tools ,techniques, are required to collect and classify
opinion scores on social networksites , as a grouping on wrong opinion may create problems for a society or
country . Social Network Analysis (SNA) is popular means for researcher as the number of users and groups
increasing day by day on that social sites , and a large group may influence other.In this paper, we
recommendhybrid model of opinion recommendation systems, for single user and for collective community
respectively, formed on social liking and influence network theory. By collecting thedata of user social networks
and preferenceslike, we designed aimproved hybrid prototype to imitate the social influence by like and sharing
the information among groups.The significance of this paper to analyze the suitability of ANN and Fuzzy sets
method in a hybrid manner for social web sites classifications, First, we intend to use Artificial Neural
Network(ANN)techniques in social media data classification by using some contemporary methods different
than the conventional methods of statistics and data analysis, in next we want to propagate the fuzzy approach
as a way to overcome the uncertainity that is always present in social media analysis . We give a brief overview
of the main ideas and recent results of social networks analysis , and we point to relationships between the two
social network analysis and classification approaches .This researchsuggests a hybrid classification model build
on fuzzy and artificial neural network (HFANN). Information Gain and three popular social sites are used to
collect information depicting features that are then used to train and test the proposed methods . This neoteric
approach combines the advantages of ANN and Fuzzy sets in classification accuracy with utilizing social data
and knowledge base available in the hate lexicons.
The document describes evaluating dynamic linking through the query process using the Licas test platform. It summarizes recent test results on the query test process that supports previous findings and shows the effectiveness of the linking mechanism in new test scenarios. The linking mechanism can dynamically link information sources based on query values. A query system generates random networks and queries to test how well the system optimizes query performance by reducing the search process while maintaining answer quality. Test results show the linking mechanism remains effective.
Graph-based Analysis and Opinion Mining in Social NetworkKhan Mostafa
This is the final report for Networks & Data Mining Techniques project focusing on mining social network to estimate public opinion about entities and associated keywords. This project mines Twitter for recent feeds and analyzes them to estimate sentiment score, discussed entity and describing keywords in each tweet. This data is then exploited to elicit overall sentiment associated with each entity. Entities and keywords extracted is also used to form an entity-keyword bigraph. This graph is further used to detect entity communities and keywords found within those communities. Presented implementation works in linear time.
Talk at Semantic Technology Conference, 2010, 23 June, 2010, San Francisco.
The LOD cloud has a potential for applicability in many AI-related tasks, such as open domain question answering, knowledge discovery, and the Semantic Web. An important prerequisite before the LOD cloud can enable these goals is allowing its users (and applications) to effectively pose queries to and retrieve answers from it. However, this prerequisite is still an open problem for the LOD cloud and has restricted it to “merely more data.” To transform the LOD cloud from "merely more data" to "semantically linked data” there are plenty of open issues which should be addressed. We believe this transformation of the LOD cloud can be performed by addressing the shortcomings identified by us: lack of conceptual description of datasets, lack of expressivity, and difficulties with respect to querying.
Similar to Distributed Link Prediction in Large Scale Graphs using Apache Spark (20)
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...kalichargn70th171
A dynamic process unfolds in the intricate realm of software development, dedicated to crafting and sustaining products that effortlessly address user needs. Amidst vital stages like market analysis and requirement assessments, the heart of software development lies in the meticulous creation and upkeep of source code. Code alterations are inherent, challenging code quality, particularly under stringent deadlines.
SOCRadar's Aviation Industry Q1 Incident Report is out now!
The aviation industry has always been a prime target for cybercriminals due to its critical infrastructure and high stakes. In the first quarter of 2024, the sector faced an alarming surge in cybersecurity threats, revealing its vulnerabilities and the relentless sophistication of cyber attackers.
SOCRadar’s Aviation Industry, Quarterly Incident Report, provides an in-depth analysis of these threats, detected and examined through our extensive monitoring of hacker forums, Telegram channels, and dark web platforms.
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j
Dr. Jesús Barrasa, Head of Solutions Architecture for EMEA, Neo4j
Découvrez les dernières innovations de Neo4j, et notamment les dernières intégrations cloud et les améliorations produits qui font de Neo4j un choix essentiel pour les développeurs qui créent des applications avec des données interconnectées et de l’IA générative.
Utilocate offers a comprehensive solution for locate ticket management by automating and streamlining the entire process. By integrating with Geospatial Information Systems (GIS), it provides accurate mapping and visualization of utility locations, enhancing decision-making and reducing the risk of errors. The system's advanced data analytics tools help identify trends, predict potential issues, and optimize resource allocation, making the locate ticket management process smarter and more efficient. Additionally, automated ticket management ensures consistency and reduces human error, while real-time notifications keep all relevant personnel informed and ready to respond promptly.
The system's ability to streamline workflows and automate ticket routing significantly reduces the time taken to process each ticket, making the process faster and more efficient. Mobile access allows field technicians to update ticket information on the go, ensuring that the latest information is always available and accelerating the locate process. Overall, Utilocate not only enhances the efficiency and accuracy of locate ticket management but also improves safety by minimizing the risk of utility damage through precise and timely locates.
Microservice Teams - How the cloud changes the way we workSven Peters
A lot of technical challenges and complexity come with building a cloud-native and distributed architecture. The way we develop backend software has fundamentally changed in the last ten years. Managing a microservices architecture demands a lot of us to ensure observability and operational resiliency. But did you also change the way you run your development teams?
Sven will talk about Atlassian’s journey from a monolith to a multi-tenanted architecture and how it affected the way the engineering teams work. You will learn how we shifted to service ownership, moved to more autonomous teams (and its challenges), and established platform and enablement teams.
GraphSummit Paris - The art of the possible with Graph TechnologyNeo4j
Sudhir Hasbe, Chief Product Officer, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
OpenMetadata Community Meeting - 5th June 2024OpenMetadata
The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features.
* How to run your own data quality framework
* What is the performance impact of running data quality frameworks
* How to run the test cases in your own ETL pipelines
* How the Incident Manager is integrated
* Get notified with alerts when test cases fail
Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E
Atelier - Innover avec l’IA Générative et les graphes de connaissancesNeo4j
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Allez au-delà du battage médiatique autour de l’IA et découvrez des techniques pratiques pour utiliser l’IA de manière responsable à travers les données de votre organisation. Explorez comment utiliser les graphes de connaissances pour augmenter la précision, la transparence et la capacité d’explication dans les systèmes d’IA générative. Vous partirez avec une expérience pratique combinant les relations entre les données et les LLM pour apporter du contexte spécifique à votre domaine et améliorer votre raisonnement.
Amenez votre ordinateur portable et nous vous guiderons sur la mise en place de votre propre pile d’IA générative, en vous fournissant des exemples pratiques et codés pour démarrer en quelques minutes.
Do you want Software for your Business? Visit Deuglo
Deuglo has top Software Developers in India. They are experts in software development and help design and create custom Software solutions.
Deuglo follows seven steps methods for delivering their services to their customers. They called it the Software development life cycle process (SDLC).
Requirement — Collecting the Requirements is the first Phase in the SSLC process.
Feasibility Study — after completing the requirement process they move to the design phase.
Design — in this phase, they start designing the software.
Coding — when designing is completed, the developers start coding for the software.
Testing — in this phase when the coding of the software is done the testing team will start testing.
Installation — after completion of testing, the application opens to the live server and launches!
Maintenance — after completing the software development, customers start using the software.
Distributed Link Prediction in Large Scale Graphs using Apache Spark
1. Distributed Link Prediction in Large Scale Graphs
using Apache Spark
Anastasios Theodosiou
1
Aristotle University of Thessaloniki, Thessaloniki 54621, GREECE
anastasios.theodosiou@gmail.com
Abstract. Social networks like Facebook, Instagram, Twitter, and LinkedIn have
become an integral part of our everyday life. Through these, users can share dig-
ital content (links, photos, videos), express or share their opinions, and expand
their social circle by making new friends. All these user interactions lead to the
evolution and development of these networks over time. A typical example of
link prediction is some of the services offered by these networks to their users.
An essential service for them is to support their users with suggestions for new
friendships based on their existing network, as well as their preferences resulting
from their interactions with the network. Link prediction techniques attempt to
predict the possibility of a future connection between two nodes on a given net-
work. Beyond social networks, link prediction has a broad scope. Some of these
are, in e-commerce, genetics, and security. Due to the massive amounts of data
that is collected today, the need for scalable approaches arises to this problem.
The purpose of this diploma thesis is to experiment and use various techniques
of machine learning, both supervised and unsupervised, to predict links to a net-
work of academic papers using document similarity metrics based on the charac-
teristics of the nodes but also other structural features, based on the network. Ex-
perimentation and implementation of the application took place using Apache
Spark to manage the large data volume using the Scala programming language.
Keywords: Link prediction, Data mining, Machine Learning, Apache Spark,
Graphs, Online Social Networks, Recommender Systems
1 The link prediction problem
In order to better understand precisely what the link prediction problem is, a brief ex-
ample will be given. Suppose there is a network whose nodes represent individuals and
links between individuals representing relationships or interactions. Having a network
of these features, how can we predict its evolution in the future? Alternatively, how can
we predict the creation of new edges or deletion of existing links in the future? By
studying the evolution of social networks over time, we can understand how a node
interacts with another node. In order to carry out this study, we need many different
snapshots of the network structure over time, so the volume of data we which we need
to collect and process overgrows. Therefore, finding different scalable approaches for
their parallel processing becomes necessary. Some other real-world examples of link
2. 2
predictions point to friends and followers in a social networking site, indicate relevant
products to customers or providing suggestions to professionals for teamwork based on
their field of study or their interests. Therefore, we can conclude that the link prediction
problem is the probability of predicting a future edge between two nodes.
1.1 Social networks and the difficulty of link prediction.
A (social) network can be represented by a graph G (V, E) where V is the set of its
vertices and E, the set of its edges. The number of possible connections in such a net-
work is equal to [V * (V-1)] / 2. The network we are looking at in this work consists of
27,770 nodes. If we want to compute all possible edges and suggest new ones based on
some metric similarity of documents (e.g., Jaccard Similarity), we would have to check
385,572 .565 edges. We understand that this number is quite large even on a relatively
small network, like this one of our work. However, social networks are sparse, so there
is no need to choose an edge randomly and try to predict its existence in the network.
Because the number of possible connections is large enough, there is a need to find
alternative and more efficient approaches to predicting them. Major social platforms
such as Facebook, Twitter, Instagram, LinkedIn, and others, have as one of their
primary services the proposals of new links in the form of a new "social friendship."
High accuracy in such predictions can help us understand what is the factor that leads
to the evolution of these networks and to provide more accurate and meaningful
suggestions. The social network which we have been studying for this work is a net-
work of academic papers where each one cites some other papers. A classic method for
proposing collaboration on such a network is through the bibliography system. How-
ever, the result of a new proposal based on this system could not be entirely accurate.
We need to extend and enrich this method with more data or new techniques so that we
can achieve greater accuracy in our recommendations. For example, we can use meth-
ods based on the content and structure of the documents.
2 Graphs and social networks.
The graphs provide a better way to deal with abstract concepts such as relationships
and interactions in a network. They also offer an intuitive - visual way of thinking about
these concepts. They are still a natural basis for analyzing relationships in a social con-
text. Over time, graphs are increasingly used in data science. Graph databases have
become common computing tools and alternatives for SQL and NoSQL databases.
Concepts of graph theory are used for the study and modeling of social networks, Fraud
Patterns, energy consumption patterns, influence on a social network and many other
areas of application. Social Network Analysis (SNA) is probably the most well-known
application of graph theory for data science. They are also used in agglomeration algo-
rithms (see K-Means). Therefore, there are many reasons for the use of graphs and so
many fields of application. From the computer science perspective, the graphs offer
computational efficiency. The "Big O" multi-plot for some algorithms is better for data
that is arranged in graph format compared to tabular data (see table data).
3. 3
3 Link prediction and locality sensitive hashing.
The problem of finding identical or duplicate documents based on a similarity metric
seems relatively straightforward. Using a hash function, the work can be completed
very quickly, and the algorithm is fast. However, the problem becomes more
complicated if we want to find similar documents with spelling mistakes or with even
different words. The brute force technique can be used to find such documents and to
predict links with higher accuracy but, without being a scalable technique. On the other
hand, the LSH algorithm, or else Locality Sensitive Hashing, is a technique that can
also be used for the same problems, but yielding approximate results in a much better
time than the brute force technique. LSH in our problem can suggest an edge between
two nodes if the similarity of the two documents is above a given threshold. More gen-
erally, LSH belongs to a family of functions known as the LSH family which hashes
the data into buckets so that documents with high similarity are being hashed into the
same bucket. The general idea of LSH is to find an algorithm such that if we insert two
document signatures, it will be able to tell us that these nodes can form a candidate pair
or not if their similarity is higher than the given threshold. As for the part of MinHash-
ing, there are two necessary steps. First, we hash the columns from the signature matrix
with several hash functions, and then we check whether two documents are being
hashed into the same bucket even for one of the different functions. In this case, we can
accept the two documents as a candidate pair. Regarding the problem of link prediction,
if the Jaccard similarity of the two documents is above the given threshold, we can
conclude that there is a potential edge between them.
4 Suggested approach and results.
The network which we have studied in this work is composed of 27,770 nodes «papers»
and 352,857 edges. Each node consists of some attributes which represent them. These
are the document id, the publication date, the title, the authors, the journal(s) and the
abstract of the paper. Furthermore, there was a second file, containing all the edges of
the «edge list» network, during our experiments. This work aimed to make link predic-
tion between the papers above. An edge between two nodes exists when at least one
node points to another node by referencing them through the bibliographic system. Our
proposed approach does not take into account such reports nor does it use the metrics
mentioned in chapter two but instead relies on the similarity of the records based on
their characteristics through the Jaccard similarity metric and other structural features
of the network. Two different approaches were used where, in the first approach, the
problem was treated as binary classification, while in the second approach, two differ-
ent techniques of unsupervised machine learning were used. The brute force method
and Locality Sensitive Hashing with the use and configuration of the MinHashLSH
algorithm which is provided by the Apache Spark.
4. 4
4.1 Supervised link prediction approach.
As we have already mentioned, the problem was treated as a binary classification.
Therefore, different models of machine learning were used where each was based on a
particular classifier. All models were tested on a four-core system and 8GB of RAM.
In the first phase, the datasets were loaded both for nodes and edges. Once this process
has been completed, a join operation began between the two files in order to create our
initial dataframe. This dataframe eventually contained the id of the two papers which
ware involved in an edge, as well as all the other attributes that characterize the speci-
fied nodes. After that, a tokenization procedure was performed in each column of the
dataframe. So all texts were converted into a bag of words. Next, all stop words were
removed so that the Jaccard similarity will not be affected by them. At this point, the
features that each classifier would consider for its training phase were calculated. They
come mainly from the attributes of the node but also the structural features of the net-
work. These were: (a) the time difference in publication between the two papers, (b)
the title overlap, (c) the authors overlap, (d) the journal overlap, (e) the abstract overlap.
Furthermore, ware added three more structural features concerning the node and these
were: a) common neighbors of the nodes of each edge b) the sum of the total triangles
belonging to each node of that edge and c) the PageRank score for each of the nodes in
the network. After that, we took the Squared test of independence, with the help of
ChiSqSelector class of Apache Spark, to determine whether there is a significant rela-
tionship between two categorical features. From this test and other experiments, it was
decided not to use the PageRank feature as its subscription to the final Accuracy, and
F1 was found to be almost zero. Finally, the data ware divided into two parts by 70%
for the training phase and the remaining 30% for the test phase.
Naïve Bayes Classifier.
The first classifier which was used was Naïve Bayes. Several tests have been performed
to select the threshold of this algorithm. Naïve Bayes for this data set and the selected
features gave the best results when the threshold value was 0.5 or 50%. Table 1 de-
scribes the results of this particular algorithm.
Table 1. Results from Naive Bayes classifier
Dataset Split Accuracy F1 Exec. Time (sec)
70/30 0.58614 0.58876 1090.06
Logistic Regression Classifier.
Since the results of the Naïve Bayes were not so good, other classifiers were tested.
One of these was the Logistic Regression Classifier. As in previous models, here too,
we have tasted different feature sets. Specifically, the first test performed here con-
tained only the features derived from the nodes' attributes. Tests have also been
performed for a different number of iterations. The first test concerned only the follow-
ing features: a) time difference of publishing, b) Jaccard's similarity of titles, the overlap
5. 5
of titles, d) overlap of authors, d) overlap of the journal and e) overlap of abstract. The
results of this algorithm for these features are shown in Table 2.
Table 2. Logistic regression results with attributes based on node
Max. Iterations Accuracy F1 Exec. Time (sec)
10 0.79890 0.79812 1694.76
100 0.79890 0.79947 1654.78
1000 0.79713 0.79957 1628.16
10000 0.79723 0.79778 1807.94
Although this model achieves higher accuracy and f1 ratios than the previous model,
the next test showed that with the addition of structural features, the algorithm achieves
even better results as shown in Table 3.
Table 3. Logistic regression results with node's and structural features
Max. Iterations Accuracy F1 Exec. Time (sec)
10 0.93518 0.93559 959.72
100 0.93561 0.93600 1002.28
We note that adding structural features significantly increased the accuracy of the al-
gorithm and reduced its overall execution time. The next model with which we experi-
mented, was the Linear SVM.
Linear SVM Classifier.
This model was tested as the previous models do, in the same set of features. Experi-
ments were also performed for different values of the maximum number of iterations
as well as for the RegParam parameter. The test results are shown in Table 4.
Table 4. Linear SVM results
Max Iterations RegParam Accuracy F1 Exec. Time
10 0.1 0.85967 0.85967 934.15
100 0.1 0.88044 0.88152 1124.26
10 0.3 0.84362 0.84355 893.23
100 0.3 0.85683 0.85821 1313.11
From tests carried out, it turned out that the Linea SVM algorithm works best for the
combination of the MaxIterations parameters at 100 and RegParam equal to 0.1. Next
in the model series was the Multilayer Perceptron Classifier.
Multilayer Perceptron Classifier.
This classifier is based on neural networks. Here, many experiments were performed,
both for the maximum number of iterations of the algorithm and the number of layers.
6. 6
Extra parameters were tested, but these two affected the result more than any other
parameter. The results are presented in Table 5.
Table 5. MLPC results based on the number of iterations and layers
Max Iterations Layers Accuracy F1 Exec. Time
100 13,10,7,2 0.87953 0.87951 1007.67
200 13,10,7,2 0.94770 0.94776 1106.78
400 13,7,4,2 0.95187 0.95205 1347.12
The best possible results concerning the data set and the features we have chosen are
for the maximum number of iterations at 400 and the layers 13,7,4,2. Next in the model
series which we have tested was the Decision Tree.
Decision Tree Classifier.
It was observed that the most significant difference in classifier behavior was the pa-
rameter of the maximum depth which would have the tree of our algorithm. The results
for this parameter are shown in Table 6.
Table 6. Decision tree classifier results
Max Depth Accuracy F1 Exec. Time (sec)
4 0.95116 0.95129 1302.87
8 0.95300 0.95314 1308.23
16 0.94262 0.94279 1177.16
30 0.92497 0.92494 1342.28
As it results from the above table, this model achieves even greater accuracy but also
F1 than all previous classifiers. The best possible value for this model came with a
maximum depth of 8. Finally, the Random Forest algorithm was used in our experi-
ments.
Random Forest Classifier.
The sixth and last classifier we used to solve the link prediction problem was the Ran-
dom Forest algorithm. This algorithm is a special category of Decision Trees. One ad-
vantage of this is that it uses multiple decision trees to avoid overfitting. We experi-
mented with two basic parameters of the algorithm, the first one was the maximum
depth of the trees, and the second parameter was the number of total trees. The results
of this test are described in Table 7.
Table 7. Random forest classifier results
Max Depth Num. Trees Accuracy F1 Exec. Time (sec)
4 10 0.95066 0.95077 1314.01
8 10 0.95580 0.95591 1191.91
4 100 0.95058 0.95068 1262.46
7. 7
8 100 0.95527 0.95538 1230.55
The Random Forest model achieves the most accurate results in accuracy and f1 from
all the models mentioned above. The best possible values for accuracy and f1 are ob-
tained with the maximum number of trees at 10 and at the same time with a maximum
depth per tree equal to 8.
Model comparison.
Summarizing the above results from the different classifiers, a comparison was made
between them in terms of accuracy and f1. Figure 1 shows the change in accuracy and
f1 per model.
Figure 1. Comparison of the classifiers
As far as the algorithm execution time is concerned, we can see that the shorter time it
has the Linear Regression model with a total completion time of 1002.28 seconds while
the Random Forest model requires 189.69 seconds longer. The difference in the run
time of the six classifiers is shown in Figure 2.
Figure 2. The execution time of six classifiers
4.2 Unsupervised link prediction approach.
From the perspective of unsupervised machine learning, the problem of link prediction
was addressed by two different techniques but very widespread. The approach we pro-
pose differentiates from most of the related work on this problem as to how to deal with
8. 8
it. Nearly the same data preprocessing techniques were performed as in the previous
chapter. The main difference is that for each node, a bag of words was created and
correlated with it. In more detail, for each paper, tokenization of each column into
words was performed, and then we concatenate all the dataframe columns into one. In
the next phase, all stop words were removed so that the Jaccard similarity was not af-
fected. Once our data had been prepared, we proceeded to predict new links with two
different techniques. The first technique which was tested was the brute force technique
and the second one, was the LocalitySensitive Hashing - LSH algorithm in combination
with MinHashing. It is worth noting that these experiments were carried out on a cluster
of 80 cores. All tests were done with maximum use of 64 cores and 32GB of RAM. As
there were hardware constraints and more specifically we face random access memory
issues, the experiments were performed for a subset of the original data set.
Brute force prediction.
Initially, a join operation was performed on the data in order to create all the possible
edges that may occur. This process is relatively slow, but it can only take place once,
and then we can use it as is. After that, the Jaccard similarity was calculated for all
candidate edges. The maximum Jaccard similarity was found to be 0.4973. After this
process has been completed, we have set a threshold for the Jaccard similarity so that
edges with a similarity greater than or equal to it, ware selected. The run time of the
algorithm increased geometrically as the number of nodes in the data set increased. In
Table 8, we observe cumulative results of the algorithm for the accuracy, the number
of candidate pairs, the total number of checks which was performed, and the algo-
rithm’s execution time.
Table 8. Aggregative results of brute force algorithm execution
Nodes Checks # Candidates Accuracy Exec. Time (sec)
1000 499500 3916 0.9368 62.89
2000 1999000 14055 0.9662 161.04
5000 12497500 74302 0.9711 566.73
7000 24496500 106534 0.9789 1446.98
Figure 3 shows the change in algorithm accuracy as the number of nodes in the network
grows.
Figure 3. The increase of accuracy based on the dataset volume
9. 9
This technique is generally fairly accurate, but extremely time-consuming and utterly
dependent on system resources. For this reason, there is a need for new techniques that
can produce results within a reasonable time. This problem comes to solve MinHashing
and the Locality Sensitive Hashing algorithm.
MinHashLSH prediction.
The basic idea behind this algorithm is that it uses MinHashing in conjunction with the
LSH algorithm so that documents with a high similarity index are hashed into the same
bucket while those with a small index in different ones. In general, the entire workflow
is the same as the one followed in the brute force algorithm, except that here we are
joining the data based on the Jaccard distance and not the Jaccard index. So if we want
to set a similarity limit for our documents with 60% Jaccard similarity, we should set a
Jaccard distance equal to 1 - Jaccard similarity, i.e., the two documents should be at
least 40% away. Table 9 lists some results from the various experiments performed
with this method and with Jaccard distance 0.8 as this was the number that provided
more accurate results.
Table 9. MinHashLSH scores relative to the number of hash tables
Hash Tables Candidates Precision Recall Accuracy F1 Time (sec)
2 986 0.7261 0.0120 0.97133 0.02370 26.19
4 1610 0.7304 0.0385 0.97147 0.03853 31.12
8 3026 0.6265 0.0607 0.97147 0.06072 70.20
16 3628 0.5975 0.0364 0.97148 0.06877 111.48
32 3824 0.5983 0.0385 0.97147 0.07246 211.95
64 3840 0.5968 0.0385 0.97149 0.07246 514.83
128 3840 0.5968 0.0385 0.97151 0.07251 1344.72
We notice that as the number of hash tables grows, the accuracy of the results increases.
At the same time, the algorithm’s execution time is increasing almost linearly. Figure
4 illustrates the change in algorithm accuracy relative to the number of hash tables.
Figure 4. The accuracy of MinHashLSH vs. hash tables
As regards the evaluation of the unsupervised techniques we used and because we did
not have a classifier or regressor, a TP, FP, TN and FN calculation function was imple-
mented comparing the results of the algorithms with the original graph of the network.
10. 10
Then the Precision and Recall metrics were calculated, and from this data, we arrived
at the calculation of Accuracy and F1. Many experiments have been carried out, and
many tests have been done which are available in the full version of the diploma thesis.
5 Conclusion.
The problem of link prediction has a wide range of application in different areas. In this
diploma thesis, we studied techniques of both supervised and unsupervised machine
learning techniques. After many experiments and trials, we came to the conclusion,
given the circumstances, of the data we had at our disposal but also of the way we chose
to address this problem, that as regards the solution from supervised techniques, the
model based on the Random Forest classifier, is the ideal solution to the problem. On
the other hand, in the unsupervised machine learning part, the MinHashLSH method
was chosen as it is much faster and can produce quite good results and reach very close
to the levels of brute force techniques. However, it requires much attention as it gener-
ates many false positives.
6 Future work.
As a future work, we will address the problem of link prediction through a different
viewpoint. We will re-examine the same network but this time with a technique based
on clustering. This approach uses similar nodes in a «cluster» and aims that nodes from
the same cluster exhibit a similar connectivity pattern. In more detail, with this method,
we will initially set a threshold θ, and then we will subtract all the edges of the graph
having a weight less than the limit. Then, each linked element of the graph will corre-
spond to a cluster. In general, two nodes are in the same connected component as if
there is a path between them. From supervised machine learning, we will try techniques
that will be based purely on neural networks with more complex data preprocessing
techniques, and we hope to achieve even better results and in less execution time.
11. 11
References.
1. Charu C. Aggarwal (auth.) - Recommender Systems, 2 Springer International Publish-ing
2. Reza Zafarani, Mohammad Ali Abbasi, Huan Liu - Social Media Mining, Cambridge
3. Feiyi (Aaron) Tang - Link-Prediction and its Application in Online Social Networks, 2 Vic-
toria University
4. L. Adamic and E. Adar - Friends and neighbors on the web. Social Networks, 2003
5. David Liben-Nowell and Jon Kleinberg - The Link Prediction Problem for Social Net-works,
2004
6. M. E. J. Newman. Clustering and preferential attachment in growing networks. Physi-cal Re-
view E, 64(02), 2001
7. D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social net-
works. Journal of the American Society for Information Science and Technology,
2007
8. L. Katz. A new status index derived from sociometric analysis. Psychometrika, March 1953
9. Hially Rodrigues S´a and Ricardo B. C. Prudˆencio , Supervised Learning for Link Pre-diction
in Weighted Networks, Center of Informatics, Federal University of Pernambu-co, CEP 5-970 -
Recife (PE) – Brazil
10. Huang, Zan, Link Prediction Based on Graph Topology: The Predictive Value of Gen-
eralized Clustering Coefficient, 2010
11. Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining of Massive Datasets.
12. Broder, Andrei Z, Moses Charikar, Alan M Frieze, and Michael Mitzenmacher 2000.
“Min-Wise Independent Permutations.” Journal of Computer and System Sciences 60 (3). Else-
vier: 630–59
13. Panagiotis Symeonidis, Nantia Iakovidou, Nikolaos Mantas, Yannis Manolopoulos, From
biological to social networks: Link prediction based on multi-way spectral cluster-ing, 2013
14. Link Prediction - Karsten Borgwardt, Christoph Lippert and Nino
Shervashidze,https://www.ethz.ch/content/dam/ethz/specialinterest/bsse/borgwardt-lab/docu-
ments/slides/BNA09_10_12.pdf
15. Apache Spark Tutorial: Machine Learning (article) - Datacamp. (n.d.). Retrieved from
https://www.datacamp.com/community/tutorials/apache-spark-tutorial-machine-learn
16. Persagen Consulting | Specializing In Molecular/functional ... (n.d.). Retrieved from
https://persagen.com/resources/biokdd-review-knowledge_graphs.html
17. Link Prediction In Social Networks Using Computationally ... (n.d.). Retrieved
from:https://mafiadoc.com/link-prediction-in-social-networks-using computationally-ef
18. Social Media Mining - Reza Zafarani, Mohammad Ali Abbasi, Huan Liu, Cambridge Uni-
versity, 2014
19. Recommender Systems: The Textbook, Charu C. AggarwalIBM T.J. Watson Research
Center Yorktown Heights, NY, USA
20. Link prediction using unsupervised learning, Mohammad Al Hasan, Vineet Chaoji, Saeed
Slem Mohammed Zaki