1. Assignment 1: Application
Survey on Data Mining and
Data Warehousing
CSCI 4144, Winter 2016
ID: B00707506, Student Name: Patrick Walter
1/15/2016
2. 2
Link-prediction in Social Networks: A Survey
Introduction
A social network consists of two main components, a set of social actors and a set of
connections. In many cases the social actors represent people, while the connections represent
any form of social interaction, collaboration or influence. It follows that a social network can be
easily represented by a graph with the actors being nodes, and the connections being edges.
The popularity of social networks online has exploded over the past decade. Social networks
have expanded from the contexts of networks of researchers who have collaborated with each
other or employees at a company who have worked together to social networks which can
connect anyone in the world together.
Given that social networks are often based on people, they are often highly dynamic
with actors constantly making new interactions and connections with each other. In many
applications it is beneficial to be able to make predictions about these future connections. The
link-prediction problem was defined by Jon Kleinberg and David Liben-Nowell as the following,
“Given a snapshot of a social network at time t, we seek to accurately predict the edges that
will be added to the network during the interval from time t to a given future time t’ “. (Liben-
Nowell & Kleinberg, 2007) Using link-prediction a system can model the evolution of the
network based on features that are intrinsic to the network. An example of the link-prediction
problem is seen in social networks such as Facebook and other web-based social networks.
Facebook has systems that suggest users to make connections with other users who they may
3. 3
know, or with companies they may like. These suggestions may create a more engaging
experience for users when they can easily make connections with their friends. Link-predictions
can also be used by companies to make suggestions on employees that should work together
on new projects. Thus many companies have vested interest in developing effective link-
prediction systems.
Using Location-based Data to Make Better Predictions
Many link-prediction systems rely heavily on making predictions based on 2-hop
neighbours, or friends- of-friends. This is a result of the scale of most social networks being the
millions of nodes, and the likelihood of two nodes making a connection declining exponentially
with each hop. Social networks that deploy location-based information such as check-ins can
give a way to make predictions that do not occur between neighbouring nodes. By exploiting
the location data of nodes, link-predictions can be made for nodes sharing one or more of these
locations. These nodes may not be within the 2-hop neighbourhood of each other and
therefore the link between them could not be made by a friends-of-friends system. The new
link made by these place-friends can be predicted by using the check-in information of the two
nodes. Thus the problem is defined by a group of researchers from University of Cambridge is:
“how do we design a link prediction system which exploits data about user check-ins” (Scellato,
Noulas, & Mascolo, 2011).
Solution Technology
The solution that Scellato, Noulas, & Mascolo used came in the form of supervised
learning. For each pair of users the link prediction is based on a set of features that describe the
4. 4
pair. These features are based on both common social links and common and overlapping
location data. To create the training data simple labelling is applied. For each snapshot, the
features of every disjoint pair of users are computed, then in the next snapshot the pairs that
become connected are labelled positive and the others are labelled negative. Using the created
training data, classifiers are trained to construct models which can classify test data. Due to the
nature of the data having heavily skewed class distribution, using a supervised method allows
for effective discovery of inter-class boundaries to perform better classification (2011).
Evaluation
Using multiple supervised learning implementations, Scellato, Noulas, & Mascolo were
able to empirically show that using place-data increased the performance of a link-prediction
system. Random forests and model trees with linear regression gave the best performance in
their research. It was noted that the link-prediction was the more accurate in predicting links
that would be made by place-friends since they were able to exploit location-based user activity
(2011).
Allowing for Positive andNegative Links in Link-prediction Networks
In the real world, not all connections between actors in a social network are positive.
Some online social networks have implemented this concept by having actors able to create
connections that can be either positive or negative, for example “friend” or “foe”. A group of
researchers from Stanford and Cornell University “study online social networks in which
relationships can be either positive (indicating relations such as friendship) or negative
(indicating relations such as opposition or antagonism).” (Leskovec, Jure, Huttenlocher, &
5. 5
Kleinberg, 2010). In their research, Leskovec, Jure, Huttenlocher, & Kleinberg discuss how the
sign of a given link interacts with other links in the same neighbourhood or other links
throughout the entire network. Or in terms of the link-prediction problem, what predictions
can be made about the configurations of link signs in a real social network (2010). They define
the edge sign prediction problem as follows: “given a social network with signs on all its edges,
but the sign on the edge from node u to node v, denoted s(u, v), has been “hidden.” How
reliably can we infer this sign s(u, v) using the information provided by the rest of the
network?” (Leskovec, Jure, Huttenlocher, & Kleinberg, 2010).
Solution Technology
To solve the edge sign prediction problem, Leskovec, Huttenlocher and Kleinberg
implemented a solution using a logistic regression classifier, a form of supervised learning. Since
most networks exhibited skewed distribution of positive and negative signed links the group
used two approaches. One approach used a full dataset which had only about one fifth of the
connections being negative, and the other used a balanced dataset with an equal distribution of
signs. In order to use this machine-learning approach features must be defined that describe
pairs of actors with a hidden link. There are two sets of features used. One set of features is
based on the signed degree of the two nodes which are called the degree features (2010). The
other, called the triad features, are based on the joint relationships the two nodes have with
other nodes in their neighbourhood, similar to the friends-of-friends features used in Scellato,
Noulas, and Mascolo’s research.
6. 6
Evaluation
In total there are 23 features used to describe each hidden link, 7 degree features and
16 triad features. The Leskovec, Jure, Huttenlocher, & Kleinberg evaluated the solution on the
basis of each set of features by representing each set by a vector. What stood out the most in
the evaluation was that predictions based on their models significantly outperformed a
previous study which used propagation to go beyond the 2-hop neighbourhood on the same
dataset. This means that sign prediction can be understood based solely on the signs of other
links in the same one-step neighbourhood. In general using the full dataset gained much higher
accuracy, with about 15% improvement from random guessing (2010).
Using Continuous-valued Links in Link-predictions Networks
In the previously mentioned case of link-prediction using location-based information,
the researches treated links as binary relations, and in the edge sign prediction problem the
links were evaluated as being ternary relations. Researchers at Purdue University believe that
“in online social networks the low cost of link formation can lead to networks with
heterogeneous relationship strengths (e.g., acquaintances and best friends mixed together).”
(Xiang, Neville, & Rogati, 2010). Xiang, Neville, & Rogati developed a model to predict and
estimate the strength of links in a social network based on their interaction activity and
similarity. This challenge extends from the link-prediction problem as the group believes that
treating links as binary relations will increase the amount of noise learned by a prediction
model by treating strong and weak links equal. In most online social networks, creating links
comes at such a low-cost that many links may be much less significant than others. Including
7. 7
these insignificant leaks in the learned model can greatly degrade the performance of the
system (2010).
Solution Technology
In order to achieve their model, the Xiang, Neville, & Rogati implemented an
unsupervised method to infer the strength of links in a network. These strength values are
continuous to represent a range of weak to strong relationships (2010). More specifically the
researchers “formulate a latent variable model to infer (hidden) relationship strengths and
develop a coordinate ascent optimization procedure for inference.” (Xiang, Neville, & Rogati,
2010). A Gaussian Distribution was used to model the conditional probability of strengths using
the similarity of the actors involved in each link and maximum likelihood of the probabilities is
used to estimate the latent variable model and a gradient-based method is used to optimize
the parameters of the model (Xiang, Neville, & Rogati, 2010).
Evaluation
Evaluation was done based on two measures, the autocorrelation improvement and the
classification improvement. In terms of autocorrelation, “the relationship-strength network has
significantly higher autocorrelation than the friendship graph in all cases” (Xiang, Neville, &
Rogati, 2010). Using Gaussian random field semi-supervised classification algorithmand
comparing with other works the group reports their model “results in the highest classification
performance for all tasks, suggesting that [their] approach to summarizing the rich profile and
interaction information in online social networks leads to a single meaningful relationship graph
8. 8
which can improve subsequent knowledge discovery and prediction tasks.” (Xiang, Neville, &
Rogati, 2010).
Drivers and Enablers of Data Mining and Data Warehousing
There are many factors that create a demand for data mining and data warehousing
technologies. Many companies, organizations, and institutions have an interest in extracting
information and knowledge from their stored and incoming data. Some groups seek to use their
data to create monetary value while others seek understand how to serve their customers or
employees better. In today’s wide spread use of technology and the World Wide Web, society
is creating new data at alarming rates. In order to handle all this endless stream of data many
companies turn to data mining and warehousing technologies. Many companies can use data
mining to make better business decisions, better target their customers, and find new ways to
market their products and services. The amount of data created in stored far exceeds the
capabilities of any traditional data analysis tools and creates a demand for data mining.
The decreasing cost of computational power and storage are facilitating the widespread
use of data mining and data warehousing in the business world. Globalization is also driving
these technologies as the world becomes more interconnected in online communities. The
increasing availability of data collection devices such as smart phones is also contributing to the
use of data mining. Increasingly datasets are becoming openly available to the public from
many governments and organizations. The abundance of data, the low cost of computation
power, and the use of open and free software creates an environment that fosters data mining.
9. 9
References
Leskovec,Jure,Huttenlocher,D.,&Kleinberg,J.(2010).PredictingPositive andNegative LinksinOnline
Social Networks. Proceeding WWW'10 Proceedingsof the19th internationalconferenceon World wide
web (pp.641-650). NewYork,NY, USA: ACM.
Liben-Nowell,D.,& Kleinberg,J.(2007).The Link-PredictionProblemforSocial Networks. Journalof the
American Societyfor Information Scienceand Technology ,58 (7), 1019-1031.
Scellato,S.,Noulas,A.,&Mascolo,C. (2011). ExploitingPlacesFeaturesinLinkPredictiononLocatio-
basedSocial Networks. Proceeding KDD'11 Proceedingsof the17th ACMSIGKDD international
conferenceon Knowledgediscovery and data mining (pp.1046-1054). New York,NY: ACM.
Xiang, R.,Neville,J.,&Rogati,M. (2010). ModelingRelationshipStrengthinOnline SocialNetworks.
Proceeding WWW '10 Proceedingsof the19th internationalconferenceon World wide web (pp.981-
990). NewYork,NY, USA: ACM.
10. 10
Questions
a) Why DM and DW technologies are becoming important tools for today's business world?
With the growth of data being collected by businesses data warehousing technologies are
become more important. Companies need Data Warehousing technologies to easily access
aggregate information from their data. Businesses also seek to integrate data from multiple
different database systems with different designs and schemas. Data warehousing technology
allows for a company to store their data based on groupings. With all this data companies need
to make sense of it all. Data mining technologies allow for businesses to turn the information
stored in their data warehousing technologies into knowledge. Data mining aids businesses in
making decisions and sheds light on interested correlations that would be otherwise unknown.
In today’s online world, data is what drives businesses and data mining is the methodology of
producing knowledge from vast amounts of data.
b) What are the main differences between data mining, traditional statistics data analysis,
and information retrieval?
Data mining is the process of extracting knowledge from large amounts of data which
involves several steps that turn raw data into knowledge that is easily understood by
humans. Traditional statistical data analysis cannot handle large amounts of data.
Information retrieval, in terms of database systems, only involves accessing and retrieving
data, creating aggregate values, or performing deductive queries.
11. 11
c) How is a data warehouse model different from a relational database model? Why DW
technology is more advanced in supporting business management?
A relational database is simply a collection of tables. Each table has columns and rows and
each cell can be accessed independently or an aggregate query may be applied to a subset
of cells. In order to access any data from a relational database queries must be made in a
relational query language. This is much different than a data warehouse which is a
repository of information from many sources stored under a unified schema. Data in a data
warehouse is stored in a way that it can provide information in a historical perspective and
in a summarized manner. Data warehouses are multidimensional and each cell contains
some aggregate measure. All of these are more advanced in supporting business
management. For example a manager can easily access the aggregate sales of a particular
product by region, or year, or region and year, or any other combination of attributes.
d) What are the main difference between using OLAP on DW and using SQL on traditional
database for supporting business decision making?
Using on-line analytical processing operations allow for data to be presented in different
layers of abstraction to accommodate for different viewpoints. This is useful in a business
environment as different departments may want to see the company’s data in different
ways. Using OLAP is much faster than SQL aggregate queries as the aggregates are
precompiled and don’t need to use computationally expensive operations such as join.