SlideShare a Scribd company logo
Mining Heterogeneous
Information Networks
JIAWEI HAN
COMPUTER SCIENCE
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
JULY 20, 2015
1
2
3
Outline
 Motivation: Why Mining Information Networks?
 Part I: Clustering, Ranking and Classification
 Clustering and Ranking in Information Networks
 Classification of Information Networks
 Part II: Meta-Path Based Exploration of Information Networks
 Similarity Search in Information Networks
 Relationship Prediction in Information Networks
 Recommendation with Heterogeneous Information Networks
 User-Guided Meta-Path based clustering in heterogeneous networks
 ClusCite: Citation recommendation in heterogeneous networks
4
Ubiquitous Information Networks
 Graphs and substructures: Chemical compounds, visual objects, circuits, XML
 Biological networks
 Bibliographic networks: DBLP, ArXiv, PubMed, …
 Social networks: Facebook >100 million active users
 World Wide Web (WWW): > 3 billion nodes, > 50 billion arcs
 Cyber-physical networks
Yeast protein
interaction networkWorld-Wide Web Co-author network Social network sites
5
What Are Heterogeneous Information Networks?
 Information network: A network where each node represents an entity (e.g., actor in a
social network) and each link (e.g., tie) a relationship between entities
 Each node/link may have attributes, labels, and weights
 Link may carry rich semantic information
 Homogeneous vs. heterogeneous networks
 Homogeneous networks
 Single object type and single link type
 Single model social networks (e.g., friends)
 WWW: A collection of linked Web pages
 Heterogeneous, multi-typed networks
 Multiple object and link types
 Medical network: Patients, doctors, diseases, contacts, treatments
 Bibliographic network: Publications, authors, venues (e.g., DBLP > 2 million papers)
6
Homogeneous vs. Heterogeneous Networks
Conference-Author Network
SIGMOD
SDM
ICDM
KDD
EDBT
VLDB
ICML
AAAI
Tom
Jim
Lucy
Mike
Jack
Tracy
Cindy
Bob
Mary
Alice
Co-author Network
7
Mining Heterogeneous Information Networks
 Homogeneous networks can often be derived from their original
heterogeneous networks
 Ex. Coauthor networks can be derived from author-paper-conference
networks by projection on authors
 Paper citation networks can be derived from a complete bibliographic
network with papers and citations projected
 Heterogeneous networks carry richer information than their corresponding
projected homogeneous networks
 Typed heterogeneous network vs. non-typed heterogeneous network (i.e.,
not distinguishing different types of nodes)
 Typed nodes and links imply more structures, leading to richer discovery
 Mining semi-structured heterogeneous information networks
 Clustering, ranking, classification, prediction, similarity search, etc.
8
Examples of Heterogeneous Information Networks
 Bibliographic information networks: DBLP, ArXive, PubMed
 Entity types: paper (P), venue (V), author (A), and term (T)
 Relation type: authors write papers, venues publish papers, papers contain terms
 Twitter information network, and other social media network
 Objects types: user, tweet, hashtag, and term
 Relation/link types: users follow users, users post tweets, tweets reply tweets,
tweets use terms, tweets contain hashtags
 Flickr information network
 Object types: image, user, tag, group, and comment
 Relation types: users upload images, image contain tags, images belong to groups,
users post comments, and comments comment on images
 Healthcare information network
 Object types: doctor, patient, disease, treatment, and device
 Relation types: treatments used-for diseases, patients have diseases, patients visit
doctors
9
Structures Facilitate Mining Heterogeneous Networks
 Network construction: generates structured networks from unstructured text data
 Each node: an entity; each link: a relationship between entities
 Each node/link may have attributes, labels, and weights
 Heterogeneous, multi-typed networks: e.g., Medical network: Patients, doctors,
diseases, contacts, treatments
Venue Paper Author
DBLP Bibliographic Network The IMDB Movie Network
Actor
Movie
Director
Movie
Studio
The Facebook Network
10
What Can be Mined from Heterogeneous Networks?
 A homogeneous network can be derived from its “parent” heterogeneous network
 Ex. Coauthor networks from the original author-paper-conference networks
 Heterogeneous networks carry richer info. than the projected homogeneous networks
 Typed nodes & links imply more structures, leading to richer discovery
 Ex.: DBLP: A Computer Science bibliographic database (network)
Knowledge hidden in DBLP Network Mining Functions
Who are the leading researchers on Web search? Ranking
Who are the peer researchers of Jure Leskovec? Similarity Search
Whom will Christos Faloutsos collaborate with? Relationship Prediction
Which relationships are most influential for an author to decide her topics? Relation Strength Learning
How was the field of Data Mining emerged or evolving? Network Evolution
Which authors are rather different from his/her peers in IR? Outlier/anomaly detection
11
Principles of Mining Heterogeneous Information Net
 Information propagation across heterogeneous nodes & links
 How to compute ranking scores, similarity scores, and clusters, and how to make
good use of class labels, across heterogeneous nodes and links
 Objects in the networks are interdependent and knowledge can only be mined
using the holistic information in a network
 Search and mining by exploring network meta structures
 Heter. info networks: semi-structured and typed
 Network schema: a meta structure, guidance of search and mining
 Meta-path based similarity search and mining
 User-guided exploration of information networks
 Automatically select the right relation (or meta-path) combinations with
appropriate weights for a particular search or mining task
 User-guided or feedback-based network exploration is a strategy
12
13
Outline
 Motivation: Why Mining Information Networks?
 Part I: Clustering, Ranking and Classification
 Clustering and Ranking in Information Networks
 Classification of Information Networks
 Part II: Meta-Path Based Exploration of Information Networks
 Similarity Search in Information Networks
 Relationship Prediction in Information Networks
 Recommendation with Heterogeneous Information Networks
 User-Guided Meta-Path based clustering in heterogeneous networks
 ClusCite: Citation recommendation in heterogeneous networks
14
Ranking-Based Clustering in Heterogeneous Networks
 Clustering and ranking: Two critical functions in data mining
 Clustering without ranking? Think about no PageRank dark time before Google
 Ranking will make more sense within a particular cluster
 Einstein in physics vs. Turing in computer science
 Why not integrate ranking with clustering & classification?
 High-ranked objects should be more important in a cluster than low-ranked ones
 Why treat every object the same weight in the same cluster?
 But how to get their weight?
 Integrate ranking with clustering/classification in the same process
 Ranking, as the feature, is conditional (i.e., relative) to a specific cluster
 Ranking and clustering may mutually enhance each other
 Ranking-based clustering: RankClus [EDBT’09], NetClus [KDD’09]
15
A Bi-Typed Network Model and Simple Ranking
 A bi-typed network model
 Let X represents type venue
 Y: Type author
 The DBLP network can be represented as matrix W
 Our task: Rank-based clustering of heterogeneous network W
 Simple Ranking
 Proportional to # of publications of an author and a venue
 Considers only immediate neighborhood in the network
But what about an author publishing many papers only in very weak venues?
A bi-typed network
XX XY
YX YY
W W
W
W W
 
  
 
16
The RankClus Methodology
 Ranking as the feature of the cluster
 Ranking is conditional on a specific cluster
 E.g., VLDB’s rank in Theory vs. its rank in the DB area
 The distributions of ranking scores over objects are different in each cluster
 Clustering and ranking are mutually enhanced
 Better clustering: Rank distributions for clusters are more distinguishing from
each other
 Better ranking: Better metric for objects is learned from the ranking
 Not every object should be treated equally in clustering!
 Y. Sun, et al., “RankClus: Integrating Clustering with Ranking for Heterogeneous
Information Network Analysis”, EDBT'09
17
Simple Ranking vs. Authority Ranking
 Simple Ranking
 Proportional to # of publications of an author / a conference
 Considers only immediate neighborhood in the network
 Authority Ranking:
 More sophisticated “rank rules” are needed
 Propagate the ranking scores in the network over different types
What about an author publishing 100 papers in very weak conferences?
18
Authority Ranking
 Methodology: Propagate the ranking scores in the network over different types
 Rule 1: Highly ranked authors publish many papers in highly ranked venues
 Rule 2: Highly ranked venues attract many papers from many highly ranked authors
 Rule 3: The rank of an author is enhanced if he or she co-authors with many highly
ranked authors
 Other ranking functions are quite possible (e.g., using domain knowledge)
 Ex. Journals may weight more than conferences in science
19
Alternative Ranking Functions
 A ranking function is not only related to the link property, but also depends on
domain knowledge
 Ex: Journals may weight more than conferences in science
 Ranking functions can be defined on multi-typed networks
 Ex: PopRank takes into account the impact both from the same type of objects
and from the other types of objects, with different impact factors for different
types
 Use expert knowledge, for example,
 TrustRank semi-automatically separates reputable, good objects from spam ones
 Personalized PageRank uses expert ranking as query and generates rank
distributions w.r.t. such knowledge
 A research problem that needs systematic study
20
The EM (Expectation Maximization) Algorithm
 The k-means algorithm has two steps at each iteration (in the E-M framework):
 Expectation Step (E-step): Given the current cluster centers, each object is
assigned to the cluster whose center is closest to the object: An object is expected
to belong to the closest cluster
 Maximization Step (M-step): Given the cluster assignment, for each cluster, the
algorithm adjusts the center so that the sum of distance from the objects assigned
to this cluster and the new center is minimized
 The (EM) algorithm: A framework to approach maximum likelihood or maximum a
posteriori estimates of parameters in statistical models.
 E-step assigns objects to clusters according to the current probabilistic clustering
or parameters of probabilistic clusters
 M-step finds the new clustering or parameters that minimize the sum of squared
error (SSE) or maximize the expected likelihood
21
From Conditional Rank Distribution to E-M Framework
 Given a bi-typed bibliographic network, how can we use the conditional rank scores
to further improve the clustering results?
 Conditional rank distribution as cluster feature
 For each cluster Ck, the conditional rank scores, rX|CK and rY|CK, can be viewed as
conditional rank distributions of X and Y, which are the features for cluster Ck
 Cluster membership as object feature
 From p(k|oi) ∝ p(oi|k)p(k), the higher its conditional rank in a cluster (p(oi|k)), the
higher possibility an object will belong to that cluster (p(k|oi))
 Highly ranked attribute object has more impact on determining the cluster
membership of a target object
 Parameter estimation using the Expectation-Maximization algorithm
 E-step: Calculate the distribution p(z = k|yj, xi, Θ) based on the current value of Θ
 M-Step: Update Θ according to the current distribution
22
RankClus: Integrating Clustering with Ranking
 An EM styled Algorithm
 Initialization
 Randomly partition
 Repeat
 Ranking
 Ranking objects in each sub-
network induced from each cluster
 Generating new measure space
 Estimate mixture model
coefficients for each target object
 Adjusting cluster
 Until change < threshold
SIGMOD
SDM
ICDM
KDD
EDBT
VLDB
ICML
AAAI
Tom
Jim
Lucy
Mike
Jack
Tracy
Cindy
Bob
Mary
Alice
SIGMOD
VLDB
EDBT
KDDICDM
SDM
AAAI
ICML
Objects
Ranking
Sub-Network
Ranking
Clustering
RankClus [EDBT’09]: Ranking and clustering mutually
enhancing each other in an E-M framework
An E-M framework for iterative enhancement
23
Step-by-Step Running of RankClus
Initially, ranking
distributions are mixed
together
Two clusters of objects
mixed together, but
preserve similarity
somehow
Improved a little
Two clusters are almost
well separated
Improved significantly
Stable
Well separated
Clustering and ranking
of two fields: DB/DM
vs. HW/CA (hardware/
computer architecture)
Stable
24
Clustering Performance (NMI)
 Compare the clustering accuracy: For N objects, K clusters, and two clustering
results, let n(i, j ), be # objects with cluster label i in the 1st clustering result (say
generated by the new alg.) and that w. cluster label j in the 2nd clustering result (say
the ground truth)
 Normalized Mutual Info. (NMI):
 joint distribution: p(i, j) = n(i,j )/N
 row distr. p1(j) = , column distr.
D1: med. separated & med. density
D2: med. separated & low density
D3: med. Separated & high density
D4: highly separated & med. density
D5: poorly separated & med. density
25
RankClus: Clustering & Ranking CS Venues in DBLP
Top-10 conferences in 5 clusters using RankClus in DBLP (when k = 15)
26
Time Complexity: Linear to # of Links
 At each iteration, |E|: edges in network, m: # of target objects, K: # of clusters
 Ranking for sparse network
 ~O(|E|)
 Mixture model estimation
 ~O(K|E|+mK)
 Cluster adjustment
 ~O(mK^2)
 In all, linear to |E|
 ~O(K|E|)
 Note: SimRank will be at least quadratic at each iteration since it evaluates distance
between every pair in the network
Comparison of algorithms on execution time
27
NetClus: Ranking-Based Clustering with Star Network Schema
 Beyond bi-typed network: Capture more semantics with multiple types
 Split a network into multi-subnetworks, each a (multi-typed) net-cluster [KDD’09]
Research
Paper
Term
AuthorVenue
Publish Write
Contain
P
T
AV
P
T
AV
……
P
T
AV
NetClus
Computer Science
Database
Hardware
Theory
DBLP network: Using terms, venues, and authors
to jointly distinguish a sub-field, e.g., database
28
The NetClus Algorithm
 Generate initial partitions for target objects and induce initial net-clusters
from the original network
 Repeat // An E-M Framework
 Build ranking-based probabilistic generative model for each net-cluster
 Calculate the posterior probabilities for each target object
 Adjust their cluster assignment according to the new measure defined by
the posterior probabilities to each cluster
 Until the clusters do not change significantly
 Calculate the posterior probabilities for each attribute object in each net-
cluster
29
NetClus on the DBLP Network
 NetClus initialization: G = (V, E, W), weight wxixj linking xi and xj
 V = A U C U D U T, where D (paper), A (author), C (conf.), T(term)
 Simple ranking:
 Authority ranking for type Y based on type X, through the center type Z:
 For DBLP:
30
Multi-Typed Networks Lead to Better Results
 The network cluster for database area: Conferences, Authors, and Terms
 NetClus leads to better clustering and ranking than RankClus
 NetClus vs. RankClus: 16% higher accuracy on conference clustering
31
NetClus: Distinguishing Conferences
 AAAI 0.0022667 0.00899168 0.934024 0.0300042 0.0247133
 CIKM 0.150053 0.310172 0.00723807 0.444524 0.0880127
 CVPR 0.000163812 0.00763072 0.931496 0.0281342 0.032575
 ECIR 3.47023e-05 0.00712695 0.00657402 0.978391 0.00787288
 ECML 0.00077477 0.110922 0.814362 0.0579426 0.015999
 EDBT 0.573362 0.316033 0.00101442 0.0245591 0.0850319
 ICDE 0.529522 0.376542 0.00239152 0.0151113 0.0764334
 ICDM 0.000455028 0.778452 0.0566457 0.113184 0.0512633
 ICML 0.000309624 0.050078 0.878757 0.0622335 0.00862134
 IJCAI 0.00329816 0.0046758 0.94288 0.0303745 0.0187718
 KDD 0.00574223 0.797633 0.0617351 0.067681 0.0672086
 PAKDD 0.00111246 0.813473 0.0403105 0.0574755 0.0876289
 PKDD 5.39434e-05 0.760374 0.119608 0.052926 0.0670379
 PODS 0.78935 0.113751 0.013939 0.00277417 0.0801858
 SDM 0.000172953 0.841087 0.058316 0.0527081 0.0477156
 SIGIR 0.00600399 0.00280013 0.00275237 0.977783 0.0106604
 SIGMOD 0.689348 0.223122 0.0017703 0.00825455 0.0775055
 VLDB 0.701899 0.207428 0.00100012 0.0116966 0.0779764
 WSDM 0.00751654 0.269259 0.0260291 0.683646 0.0135497
 WWW 0.0771186 0.270635 0.029307 0.451857 0.171082
32
NetClus: Experiment on DBLP: Database System Cluster
 NetClus generates high quality
clusters for all of the three
participating types in the
DBLP network
 Quality can be easily judged
by our commonsense
knowledge
 Highly-ranked objects: Objects
centered in the cluster
database 0.0995511
databases 0.0708818
system 0.0678563
data 0.0214893
query 0.0133316
systems 0.0110413
queries 0.0090603
management 0.00850744
object 0.00837766
relational 0.0081175
processing 0.00745875
based 0.00736599
distributed 0.0068367
xml 0.00664958
oriented 0.00589557
design 0.00527672
web 0.00509167
information 0.0050518
model 0.00499396
efficient 0.00465707
Surajit Chaudhuri 0.00678065
Michael Stonebraker 0.00616469
Michael J. Carey 0.00545769
C. Mohan 0.00528346
David J. DeWitt 0.00491615
Hector Garcia-Molina 0.00453497
H. V. Jagadish 0.00434289
David B. Lomet 0.00397865
Raghu Ramakrishnan 0.0039278
Philip A. Bernstein 0.00376314
Joseph M. Hellerstein 0.00372064
Jeffrey F. Naughton 0.00363698
Yannis E. Ioannidis 0.00359853
Jennifer Widom 0.00351929
Per-Ake Larson 0.00334911
Rakesh Agrawal 0.00328274
Dan Suciu 0.00309047
Michael J. Franklin 0.00304099
Umeshwar Dayal 0.00290143
Abraham Silberschatz 0.00278185
VLDB 0.318495
SIGMOD Conf. 0.313903
ICDE 0.188746
PODS 0.107943
EDBT 0.0436849
Term Venue Author
33
Rank-Based Clustering: Works in Multiple Domains
RankCompete: Organize your photo album automatically!
MedRank: Rank treatments for AIDS from Medline
Use multi-typed image features to
build up heterogeneous networks
Explore multiple
types in a star
schema network
34 Net-cluster Hierarchy
iNextCube: Information Network-Enhanced Text Cube
(VLDB’09 Demo)
Architecture of iNextCube
Dimension hierarchies generated by NetClus
Author/conference/
term ranking for each
research area. The
research areas can be
at different levels.
All
DB and IS Theory Architecture …
DB DM IR
XML Distributed DB …
Demo: iNextCube.cs.uiuc.edu
35
Impact of RankClus Methodology
 RankCompete [Cao et al., WWW’10]
 Extend to the domain of web images
 RankClus in Medical Literature [Li et al., Working paper]
 Ranking treatments for diseases
 RankClass [Ji et al., KDD’11]
 Integrate classification with ranking
 Trustworthy Analysis [Gupta et al., WWW’11] [Khac Le et al., IPSN’11]
 Integrate clustering with trustworthiness score
 Topic Modeling in Heterogeneous Networks [Deng et al., KDD’11]
 Propagate topic information among different types of objects
 …
36
37
Outline
 Motivation: Why Mining Information Networks?
 Part I: Clustering, Ranking and Classification
 Clustering and Ranking in Information Networks
 Classification of Information Networks
 Part II: Meta-Path Based Exploration of Information Networks
 Similarity Search in Information Networks
 Relationship Prediction in Information Networks
 Recommendation with Heterogeneous Information Networks
 User-Guided Meta-Path based clustering in heterogeneous networks
 ClusCite: Citation recommendation in heterogeneous networks
38
Classification: Knowledge Propagation Across
Heterogeneous Typed Networks
 RankClass [Ji et al., KDD’11]:
Ranking-based classification
 Highly ranked objects will
play more role in
classification
 Class membership and
ranking are statistical
distributions
 Let ranking and
classification mutually
enhance each other!
 Output: Classification
results + ranking list of
objects within each class
Classification: Labeled knowledge propagates through multi-typed
objects across heterogeneous networks [KDD’11]
39
GNetMine: Methodology
 Classification of networked data can be essentially viewed as a process of
knowledge propagation, where information is propagated from labeled objects
to unlabeled ones through links until a stationary state is achieved
 A novel graph-based regularization framework to address the classification
problem on heterogeneous information networks
 Respect the link type differences by preserving consistency over each relation
graph corresponding to each type of links separately
 Mathematical intuition: Consistency assumption
 The confidence (f)of two objects (xip and xjq) belonging to class k should be
similar if xip ↔ xjq (Rij,pq > 0)
 f should be similar to the given ground truth on the labeled data
40
GNetMine: Graph-Based Regularization
 Minimize the objective function
( ) ( )
1
( ) ( ) 2
,
, 1 1 1 , ,
( ) ( ) ( ) ( )
1
( ,..., )
1 1
( )
( ) ( )
ji
k k
m
nnm
k k
ij ij pq ip jq
i j p q ij pp ji qq
m
k k T k k
i i i i i
i
J
R f f
D D


  

 
  
 
 y y
f f
f f
Smoothness constraints: objects linked together should share
similar estimations of confidence belonging to class k
Normalization term applied to each type of link separately:
reduce the impact of popularity of objects
Confidence estimation on labeled data and their pre-given
labels should be similar
User preference: how much do you
value this relationship / ground truth?
41
RankClass: Ranking-Based Classification
 Classification in heterogeneous networks
 Knowledge propagation: Class label knowledge propagated across multi-typed
objects through a heterogeneous network
 GNetMine [Ji et al., PKDD’10]: Objects are treated equally
 RankClass [Ji et al., KDD’11]: Ranking-based classification
 Highly ranked objects will play more role in classification
 An object can only be ranked high in some focused classes
 Class membership and ranking are stat. distributions
 Let ranking and classification mutually enhance each other!
 Output: Classification results + ranking list of objects within each class
42
From RankClus to GNetMine & RankClass
 RankClus [EDBT’09]: Clustering and ranking working together
 No training, no available class labels, no expert knowledge
 GNetMine [PKDD’10]: Incorp. label information in networks
 Classification in heterog. networks, but objects treated equally
 RankClass [KDD’11]: Integration of ranking and classification in heterogeneous
network analysis
 Ranking: informative understanding & summary of each class
 Class membership is critical information when ranking objects
 Let ranking and classification mutually enhance each other!
 Output: Classification results + ranking list of objects within each class
43
Why Rank-Based Classification?
 Different objects within one class have different importance/visibility!
 The ranking of objects within one class serves as an informative understanding and
summary of the class
6
11
7
9
8
10
44
Motivation
 Why not do ranking after classification, or vice versa?
 Because they mutually enhance each other, not unidirectional.
 RankClass: iteratively let ranking and classification mutually enhance each other
Ranking Classification
45
RankClass: Ranking-Based Classification
Class: a group of multi-typed nodes sharing the same topic + within-class ranking
node
class
The RankClass
Framework
46
Comparing Classification Accuracy on the DBLP Data
 Class: Four research areas:
DB, DM, AI, IR
 Four types of objects
 Paper(14376), Conf.(20),
Author(14475),
Term(8920)
 Three types of relations
 Paper-conf., paper-
author, paper-term
 Algorithms for comparison
 LLGC [Zhou et al. NIPS’03]
 wvRN) [Macskassy et al.
JMLR’07]
 nLB [Lu et al. ICML’03,
Macskassy et al.
JMLR’07]
47
Object Ranking in Each Class: Experiment
 DBLP: 4-fields data set (DB, DM, AI, IR) forming a heterog. info. Network
 Rank objects within each class (with extremely limited label information)
 Obtain high classification accuracy and excellent ranking within each class
Database Data Mining AI IR
Top-5 ranked
conferences
VLDB KDD IJCAI SIGIR
SIGMOD SDM AAAI ECIR
ICDE ICDM ICML CIKM
PODS PKDD CVPR WWW
EDBT PAKDD ECML WSDM
Top-5 ranked
terms
data mining learning retrieval
database data knowledge information
query clustering reasoning web
system classification logic search
xml frequent cognition text
48
49
Outline
 Motivation: Why Mining Information Networks?
 Part I: Clustering, Ranking and Classification
 Clustering and Ranking in Information Networks
 Classification of Information Networks
 Part II: Meta-Path Based Exploration of Information Networks
 Similarity Search in Information Networks
 Relationship Prediction in Information Networks
 Recommendation with Heterogeneous Information Networks
 User-Guided Meta-Path based clustering in heterogeneous networks
 ClusCite: Citation recommendation in heterogeneous networks
50
Similarity Search in Heterogeneous Networks
 Similarity measure/search is the base for cluster analysis
 Who are the most similar to Christos Faloutsos based on the DBLP network?
 Meta-Path: Meta-level description of a path between two objects
 A path on network schema
 Denote an existing or concatenated relation between two object types
 Different meta-paths tell different semantics
Christos’ students or close collaborators Work in similar fields with similar reputation
Meta-Path: Author-Paper-Author Meta-Path: Author-Paper-Venue-Paper-Author
Co-authorship
Meta-path: A-P-A
51
Existing Popular Similarity Measures for Networks
 Random walk (RW):
 The probability of random walk starting at x and ending at y, with
meta-path P
 Used in Personalized PageRank (P-Pagerank) (Jeh and Widom 2003)
 Favors highly visible objects (i.e., objects with large degrees)
 Pairwise random walk (PRW):
 The probability of pairwise random walk starting at (x, y) and ending
at a common object (say z), following a meta-path (P1, P2)
 Used in SimRank (Jeh and Widom 2002)
 Favors pure objects (i.e., objects with highly skewed distribution in
their in-links or out-links)
( , ) ( )p P
s x y prob p
 
1 2 1 2
1 2( , ) ( , )
( , ) ( ) ( )p p P P
s x y prob p prob p
 
X Y
P
X Y
P1
P2
-1
Z
Note: P-PageRank
and SimRank do
not distinguish
object type and
relationship type
52
Which Similarity Measure Is Better for Finding Peers?
 Meta-path: APCPA
 Mike publishes similar # of papers as Bob
and Mary
 Other measures find Mike is closer to Jim
AuthorConf. SIGMOD VLDB ICDM KDD
Mike 2 1 0 0
Jim 50 20 0 0
Mary 2 0 1 0
Bob 2 1 0 0
Ann 0 0 1 1
MeasureAuthor Jim Mary Bob Ann
P-PageRank 0.376 0.013 0.016 0.005
SimRank 0.716 0.572 0.713 0.184
Random Walk 0.8983 0.0238 0.0390 0
Pairwise R.W. 0.5714 0.4440 0.5556 0
PathSim (APCPA) 0.083 0.8 1 0
Who is more
similar to
Mike?
Comparison of Multiple Measures: A Toy Example
 PathSim: Favors peers
 Peers: Objects with
strong connectivity and
similar visibility with a
given meta-path
x y
53
Example with DBLP: Find Academic Peers by PathSim
 Anhai Doan
 CS, Wisconsin
 Database area
 PhD: 2002
 Jun Yang
 CS, Duke
 Database area
 PhD: 2001
 Amol Deshpande
 CS, Maryland
 Database area
 PhD: 2004
 Jignesh Patel
 CS, Wisconsin
 Database area
 PhD: 1998
Meta-Path: Author-Paper-Venue-Paper-Author
54
Some Meta-Path Is “Better” Than Others
 Which pictures are most similar to this one?
Image
Group
UserTag
Meta-Path: Image-Tag-Image Meta-Path: Image-Tag-Image-Group-Image-Tag-Image
Evaluate the similarity
between images according
to tags and groups
Evaluate the similarity
between images according
to their linked tags
55
Comparing Similarity Measures in DBLP Data
Favor highly
visible objects
Are these tiny forums
most similar to SIGMOD?
Which venues are most
similar to DASFAA?
Which venues are most
similar to SIGMOD?
56
Long Meta-Path May Not Carry the Right Semantics
 Repeat the meta-path 2, 4, and infinite times for conference similarity query
57
Co-Clustering-Based Pruning Algorithm
 Meta-Path based similarity computation can be costly
 The overall cost can be reduced by storing commuting matrices
for short path schemas and computing top-k queries on line
 Framework
 Generate co-clusters for materialized commuting
matrices for feature objects and target objects
 Derive upper bound for similarity between object and
target cluster and between object and object
 Safely prune target clusters and objects if the upper bound
similarity is lower than current threshold
 Dynamically update top-k threshold
58
Meta-Path: A Key Concept for Heterogeneous Networks
 Meta-path based mining
 PathPredict [Sun et al., ASONAM’11]
 Co-authorship prediction using meta-path based similarity
 PathPredict_when [Sun et al., WSDM’12]
 When a relationship will happen
 Citation prediction [Yu et al., SDM’12]
 Meta-path + topic
 Meta-path learning
 User Guided Meta-Path Selection [Sun et al., KDD’12]
 Meta-path selection + clustering
59
60
Outline
 Motivation: Why Mining Information Networks?
 Part I: Clustering, Ranking and Classification
 Clustering and Ranking in Information Networks
 Classification of Information Networks
 Part II: Meta-Path Based Exploration of Information Networks
 Similarity Search in Information Networks
 Relationship Prediction in Information Networks
 Recommendation with Heterogeneous Information Networks
 User-Guided Meta-Path based clustering in heterogeneous networks
 ClusCite: Citation recommendation in heterogeneous networks
61
Relationship Prediction vs. Link Prediction
 Link prediction in homogeneous networks [Liben-Nowell and Kleinberg, 2003,
Hasan et al., 2006]
 E.g., friendship prediction
 Relationship prediction in heterogeneous networks
 Different types of relationships need different prediction models
 Different connection paths need to be treated separately!
 Meta-path-based approach to define topological features
vs.
vs.
62
Why Prediction Using Heterogeneous Info Networks?
 Rich semantics contained in structured, text-rich heterogeneous networks
 Homogeneous networks, such as coauthor networks, miss too much criticially
important information
 Schema-guided relationship prediction
 Semantic relationships among similar typed links share similar semantics and are
comparable and inferable
 Relationships across different typed links are not directly comparable but their
collective behavior will help predict particular relationships
 Meta-paths: encoding topological features
 E.g., citation relations between authors: A — P → P — A
 Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship
Prediction in Heterogeneous Bibliographic Networks", ASONAM'11
cite writewrite
63
PathPredict: Meta-Path Based Relationship Prediction
 Who will be your new coauthors in the next 5 years?
 Meta path-guided prediction of links and relationships
 Philosophy: Meta path relationships among similar typed links share
similar semantics and are comparable and inferable
 Co-author prediction (A—P—A) [Sun et al., ASONAM’11]
 Use topological features encoded by meta paths, e.g., citation
relations between authors (A—P→P—A)
vs.
Meta-paths
between
authors of
length ≤ 4
Meta-
Path
Semantic
Meaning
64
Logistic Regression-Based Prediction Model
 Training and test pair: <xi, yi> = <history feature list, future relationship label>
 Logistic Regression Model
 Model the probability for each relationship as
 𝛽 is the coefficients for each feature (including a constant 1)
 MLE estimation
 Maximize the likelihood of observing all the relationships in the training data
A—P—A—P—A A—P—V—P—A A—P—T—P—A A—P→P—A A—P—A
<Mike, Ann> 4 5 100 3 Yes = 1
<Mike, Jim> 0 1 20 2 No = 0
65
Selection among Competitive Measures
We study four measures that defines a relationship R encoded by a meta path
 Path Count: Number of path instances between authors following R
 Normalized Path Count: Normalize path count following R by the “degree”
of authors
 Random Walk: Consider one way random walk following R
 Symmetric Random Walk: Consider random walk in both directions
66
Performance Comparison: Homogeneous vs.
Heterogeneous Topological Features
 Homogeneous features
 Only consider co-author sub-network (common neighbor; rooted PageRank)
 Mix all types together (homogeneous path count)
 Heterogeneous feature
 Heterogeneous path count
Notation: HP2hop: highly productive source
authors with 2-hops reaching target authors
67
Comparison among Different Topological Features
 Hybrid heterogeneous topological feature is the best
Notations
(1) the path count (PC)
(2) the normalized path count (NPC)
(3) the random walk (RW)
(4) the symmetric random walk (SRW)
PC1: homogeneous common neighbor
PCSum: homogeneous path count
68
The Power of PathPredict: Experiment on DBLP
 Explain the prediction power of
each meta-path
 Wald Test for logistic
regression
 Higher prediction accuracy than
using projected homogeneous
network
 11% higher in prediction
accuracy
Co-author prediction for Jian Pei: Only 42 among 4809
candidates are true first-time co-authors!
(Feature collected in [1996, 2002]; Test period in [2003,2009])
69
The Power of PathPredict: Experiment on DBLP
 Explain the prediction power of each
meta-path
 Wald Test for logistic regression
 Higher prediction accuracy than using
projected homogeneous network
 11% higher in prediction accuracy
Co-author prediction for Jian Pei: Only 42 among 4809
candidates are true first-time co-authors!
(Feature collected in [1996, 2002]; Test period in [2003,2009])
Evaluation of the prediction
power of different meta-paths
Prediction of new coauthors
of Jian Pei in [2003-2009]
Social relations
play more
important role?
70
When Will It Happen?
 From “whether” to “when”
 “Whether”: Will Jim rent the movie “Avatar” in Netflix?
 “When”: When will Jim rent the movie “Avatar”?
 What is the probability Jim will rent “Avatar” within 2 months? 𝑃(𝑌 ≤ 2)
 By when Jim will rent “Avatar” with 90% probability? 𝑡: 𝑃 𝑌 ≤ 𝑡 = 0.9
 What is the expected time it will take for Jim to rent “Avatar”? 𝐸(𝑌)
May provide useful
information to supply
chain management
Output: P(X=1)=?
Output: A distribution of time!
71
Generalized Linear Model
under Weibull Distribution Assumption
The Relationship Building Time Prediction Model
 Solution
 Directly model relationship building time: P(Y=t)
 Geometric distribution, Exponential distribution, Weibull distribution
 Use generalized linear model
 Deal with censoring (relationship builds beyond the observed time interval)
Training Framework
T: Right
Censoring
72
Author Citation Time Prediction in DBLP
 Top-4 meta-paths for author citation time prediction
 Predict when Philip S. Yu will cite a new author
Social relations are less important in author citation
prediction than in co-author prediction
Under Weibull distribution assumption
Study the same topic
Co-cited by the same paper
Follow co-authors’ citation
Follow the citations of authors
who study the same topic
73
74
Outline
 Motivation: Why Mining Information Networks?
 Part I: Clustering, Ranking and Classification
 Clustering and Ranking in Information Networks
 Classification of Information Networks
 Part II: Meta-Path Based Exploration of Information Networks
 Similarity Search in Information Networks
 Relationship Prediction in Information Networks
 Recommendation with Heterogeneous Information Networks
 User-Guided Meta-Path based clustering in heterogeneous networks
 ClusCite: Citation recommendation in heterogeneous networks
75
Enhancing the Power of Recommender Systems by Heterog.
Info. Network Analysis
 Heterogeneous relationships complement each other
 Users and items with limited feedback can be connected to the network by
different types of paths
 Connect new users or items in the information network
 Different users may require different models: Relationship heterogeneity makes
personalized recommendation models easier to define
Avatar TitanicAliens Revolutionary
Road
James
Cameron
Kate
Winslet
Leonardo
Dicaprio
Zoe
Saldana
Adventure
Romance
Collaborative filtering methods suffer from
the data sparsity issue
# of users or items
A small set
of users &
items have a
large number
of ratings
Most users & items have a
small number of ratings
#ofratings
Personalized recommendation with heterog.
Networks [WSDM’14]
76
Relationship Heterogeneity Based Personalized
Recommendation Models
Different users may have different behaviors or preferences
Aliens
James Cameron fan
80s Sci-fi fan
Sigourney Weaver fan
Different users may be
interested in the same
movie for different reasons
 Two levels of personalization
 Data level
 Most recommendation methods use one
model for all users and rely on personal
feedback to achieve personalization
 Model level
 With different entity relationships, we can
learn personalized models for different
users to further distinguish their differences
77
Preference Propagation-Based Latent Features
Alice
Bob
Kate Winslet
Naomi Watts
Titanic
revolutionary
road
skyfall
King Konggenre: drama
Sam Mendes
tag: Oscar NominationCharlie
Generate L different
meta-path (path
types) connecting
users and items
Propagate user
implicit feedback
along each meta-
path
Calculate latent-features
for users and items for
each meta-path with
NMF related method
Ralph Fiennes
78
L
user-cluster similarity
Recommendation Models
Observation 1: Different meta-paths may have different importance
Global Recommendation Model
Personalized Recommendation Model
Observation 2: Different users may require different models
ranking score
the q-th meta-path
features for user i and item j
c total soft user clusters
(1)
(2)
79
Parameter Estimation
• Bayesian personalized ranking (Rendle UAI’09)
• Objective function
min
Θ
sigmoid function
for each correctly ranked item pair
i.e., 𝑢𝑖 gave feedback to 𝑒 𝑎 but not 𝑒 𝑏
Soft cluster users with
NMF + k-means
For each user
cluster, learn
one model with
Eq. (3)
Generate
personalized model
for each user on the
fly with Eq. (2)
(3)
Learning Personalized Recommendation Model
80
Experiments: Heterogeneous Network Modeling
Enhances the Quality of Recommendation
 Datasets
 Comparison methods
 Popularity: recommend the most popular items to users
 Co-click: conditional probabilities between items
 NMF: non-negative matrix factorization on user feedback
 Hybrid-SVM: use Rank-SVM with plain features (utilize both user feedback and
information network)
HeteRec personalized
recommendation
(HeteRec-p) leads to
the best
recommendation
81
82
Outline
 Motivation: Why Mining Information Networks?
 Part I: Clustering, Ranking and Classification
 Clustering and Ranking in Information Networks
 Classification of Information Networks
 Part II: Meta-Path Based Exploration of Information Networks
 Similarity Search in Information Networks
 Relationship Prediction in Information Networks
 Recommendation with Heterogeneous Information Networks
 User-Guided Meta-Path based clustering in heterogeneous networks
 ClusCite: Citation recommendation in heterogeneous networks
83
Why User Guidance in Clustering?
 Different users may like to get different clusters for different clustering goals
 Ex. Clustering authors based on their connections in the network
{1,2,3,4}
{5,6,7,8} {1,3,5,7}
{2,4,6,8}
{1,3}
{2,4}
{5,7}
{6,8}
Which meta-
path(s) to choose?
84
User Guidance Determines Clustering Results
 Different user preferences (e.g., by
seeding desired clusters) lead to the
choice of different meta-paths
{1}
{5} {1,2,3,4}
{5,6,7,8}+
{1}
{2}
{5}
{6}
{1,3}
{2,4}
{5,7}
{6,8}
+
Seeds Meta-path(s) Clustering
Seeds Meta-path(s) Clustering
 Problem: User-guided clustering
with meta-path selection
 Input:
 The target type for clustering T
 # of clusters k
 Seeds in some clusters: L1, …, Lk
 Candidate meta-paths: P1, …, PM
 Output:
 Weight of each meta-path: w1, …,
wm
 Clustering results that are
consistent with the user guidance
85
PathSelClus: A Probabilistic Modeling Approach
 Part 1: Modeling the Relationship Generation
 A good clustering result should lead to high likelihood in observing existing
relationships
 Higher quality relations should count more in the total likelihood
 Part 2: Modeling the Guidance from Users
 The more consistent with the guidance, the higher probability of the clustering
result
 Part 3: Modeling the Quality Weights for Meta-Paths
 The more consistent with the clustering result, the higher quality weight
86
Part 1: Modeling the Relationship Generation

86
87
Part 2: Modeling the Guidance from Users

87
88
Part 3: Modeling the Quality Weights for Meta-Paths

88
Dirichlet Distribution
89
The Learning Algorithm

89
90
Effectiveness of Meta-Path Selection
 Experiments on Yelp data
 Object Types: Users, Businesses, Reviews, Terms
 Relation Types: UR, RU, BR, RB, TR, RT
 Task: Candidate meta-paths: B-R-U-R-B, B-R-T-R-B
 Target objects: restaurants
 # of clusters: 6
 Output:
 PathSelClus vs. others
 High accuracy
 Restaurant vs. shopping
 For restraunts, meta-path B-R-U-R-B weighs only 0.1716
 For clustering shopping, B-R-U-R-B weighs 0.5864
Users try different
kinds of food
91 91
92
Outline
 Motivation: Why Mining Information Networks?
 Part I: Clustering, Ranking and Classification
 Clustering and Ranking in Information Networks
 Classification of Information Networks
 Part II: Meta-Path Based Exploration of Information Networks
 Similarity Search in Information Networks
 Relationship Prediction in Information Networks
 Recommendation with Heterogeneous Information Networks
 User-Guided Meta-Path based clustering in heterogeneous networks
 ClusCite: Citation recommendation in heterogeneous networks
93
The Citation Recommendation Problem
Given a newly written manuscript (title, abstract and/or content) and its attributes
(authors, target venues), suggests a small number of papers as high quality references.
94
Why ClusCite for Citation Recommendation?
 Xiang Ren, Jialu Liu, Xiao Yu, Urvashi Khandelwal, Quanquan Gu, Lidan Wang,
Jiawei Han, “ClusCite: Effective Citation Recommendation by Information
Network-Based Clustering”, KDD’14
 Research papers need to cite relevant and important previous work
 Help readers understand its background, context and innovation
 Already large, rapidly growing body of scientific literature
 automatic recommendations of high quality citations
 Traditional literature search systems
 infeasible to cast users’ rich information needs
into queries consisting of just a few keywords
95
Consider Distinct Citation Behavioral Patterns
Basic idea: Paper’s citations can be organized into different groups, each
having its own behavioral pattern to identify references of interest
 A principled way to capture paper’s citation behaviors
 More accurate approach:paper-specific recommendation model
Most existing studies: assume all papers adopt same criterion and follow
same behavioral pattern in citing other papers:
 Context-based [He et al., WWW’10; Huang et al., CIKM’12]
 Topical similarity-based [Nallapati et al., KDD’08; Tang et al., PAKDD’09]
 Structural similarity-based [Liben-Nowell et al., CIKM’03; Strohman et al.,
SIGIR’07]
 Hybrid methods [Bethard et al., CIKM’10; Yu et al., SDM’12]
96
Observation: Distinct Citation Behavioral Patterns
Each group follow distinct
behavioral patterns and
adopt different criterions
in deciding relevance and
authority of a candidate
paper
97
Heterogeneous Bibliographic Network
A unified data model for bibliographic dataset (papers and their attributes)
• Captures paper-paper relevance of different semantics
• Enables authority propagation between different types of objects
KDD
literature search
WSDM
Network Schema
98
Citation Recommendation in Heterogeneous Networks
Given heterogeneous bibliographic network; terms, authors and target
venues of a query manuscript q,we aim to build a recommendation model
specifically for q,and recommend a small subset of papers as high quality
references for q
Heterogeneous bibliographic network
Query (to the system)
99
The ClusCite Framework
We explore the principle that: citations tend to be softly clustered into different
interest groups, based on the heterogeneous network structures
Paper-specific recommendation:
by integrating learned models of
query’s related interest groups
Phase I: Joint Learning (offline)
Phase II: Recommendation (online)
For different interest groups, learn
distinct models on finding relevant
papers and judging authority of papers
Derive group membership for
query manuscript
100
ClusCite: An Overview of the Proposed Model
query’s group membership relative citation score (how likely q
will cite p) within each group
How likely a query manuscript q will cite a candidate paper p (suppose K interest groups):
It is desirable to suggest papers that have high relative citation scores across
multiple related interest groups of the query manuscript
paper relative relevance
(query-candidate paper)
paper relative authority
(candidate paper)
101
Proposed Model I: Group Membership
 Learn each query’s group membership: scalability & generalizability
 Leverage the group memberships of related attribute objects to approximate
query’s group membership
Different types of attribute
objects (X =
authors/venues/terms)
Query’s related (linked)
objects of type-X
Attribute object’s group
membership (to learn)
102
Proposed Model II: Paper Relevance
meta path-based relevance score (l-th feature)
Network schema
103
Proposed Model II: Paper Relevance
Relevance features play
different roles in
different interest groups
weights on different meta path-based features
104
Proposed Model III: Object Relative Authority
Paper relative authority: A paper may have quite different visibility/authority
among different groups, even it is overall highly cited
Relative authority propagation over the network
KDDWSDM
Relative authority in Group A
Relative authority in Group B
105
Proposed Model III: Object Relative Authority
Paper relative authority: A paper may have quite different visibility/authority
among different groups, even it is overall highly cited
and pa-
putes the
oup. The
final rec-
mbership
sent how
groups.
ce
of meta
der vari-
re, could
different
ong with
ated pa-
because
d ICDM)
riving relative authority separately, we propose to estimate
them jointly using graph regularization, which preservescon-
sistency over each subgraph. By doing so, paper relative
authority serves as a feature for learning interest groups,
and better estimated groups can in turn help derive relative
authority more accurately (Fig. 3).
We adopt the semi-supervised learning framework [8] that
leads to iteratively updating rules as authority propagation
between different types of objects.
FP = GP FP , FA , FV ; λA , λV ,
FA = GA FP and FV = GV FP . (3)
We denote relative authority score matrices for paper, au-
thor and venue objects by FP ∈ RK × n
, FA ∈ RK × |A | and
FV ∈ RK × |V | . Generally, in an interest group, relative im-
portance of one type of object could be a combination of the
relative importance from different types of objects [23].
106
Model Learning: A Joint Optimization Problem
A joint optimization problem:
ance may not
mization prob-
neously, which
egularization.
bject in terms
earned model
n Sec. 4.1 and
4.2 along with
4.3.
m
on network as
ed citation re-
e of negatives
d may cite in
methods adopt
ctive functions
egative exam-
subgraph R(V ) .
Integrating theloss in Eq. (5) with graph regularization in
Eq. (6), we formulate a joint optimization problem follow-
ing the graph-regularized semi-supervised learning frame-
work [9]:
min
P ,W ,F P ,F A ,F V
1
2
L + R +
cp
2
∥P∥2
F +
cw
2
∥W ∥2
F
s.t. P ≥ 0; W ≥ 0. (7)
To ensurestability of theobtained solution, Tikhonov reg-
ularizers are imposed on variables P and W [4], and we use
cp , cw > 0 to control the strength of regularization. In ad-
dition, we impose non-negativity constraints to make sure
learned group membership indicators and feature weights
can provide semantic meaning as we want.
4.2 The ClusCite Algorithm
Directly solving Eq (7) is not easy because the objective
function is non-convex. We develop an alternative mini-
mization algorithm, called ClusCit e, which alternatively
formance, which is defined as follows:
L =
n
i ,j = 1
Mi j Yi j −
K
k = 1
L
l = 1
θ
(k )
pi w
(l )
k S
(i )
j l −
K
k = 1
θ
(k )
pi FP ,k j
2
,
=
n
i = 1
M i ⊙ Y i − R i P(W S(i ) T
+ FP )
2
2
.
(5)
We define the weight indicator matrix M ∈ Rn × n
for the
citation matrix, where M i j takes value 1 if the citation rela-
tionship between pi and pj is observed and 0 in other cases.
By doing so, the model can focus on positive examples and
get rid of noise in the 0 values. One can also impose a de-
gree of confidence by setting different weights on different
observed citations, but we leave this to future study.
For ease of optimization, the loss can be further rewrit-
ten in a matrix form, where matrix P ∈ R(|T |+ |A |+ |V |) × K
is
nd other conceptual information. Specifically, we
the query’s group membership θ
( k )
q by weighted
on of group memberships of its attribute objects.
θ( k )
q =
X ∈ { A ,V ,T } x ∈ N X (q)
θ
(k )
x
|NX (q)|
. (4)
e NX (q) to denote type X neighbors for query q,
tribute objects. How likely a type X object is to
the k-th interest group is represented by θ
(k )
x .
specific citation recommendation can be efficiently
d for each query manuscript q by applying Eq. (1)
h definitions in Eqs. (2), (3) and (4).
ODEL LEARNING
ction introduces thelearning algorithm for thepro-
ation recommendation model in Eq. (1).
are three sets of parameters in our model: group
hips for attribute objects; feature weights for inter-
s; and object relative authority within each inter-
. A straightforward way is to first conduct hard-
g of attribute objects based on the network and
ve feature weights and relative authority for each
Such a solution encounters several problems: (1)
ct may have multiple citation interests, (2) mined
usters may not properly capture distinct citation
ten in a matrix form, where matrix P ∈ R( |T |+ |A |+ |V |)× K
is
group membership indicator for all attribute objects while
R i ∈ Rn × ((|T |+ |A |+ |V |))
is thecorresponding neighbor indica-
tor matrix such that R i P = X ∈ { A ,V ,T } x ∈ N X ( pi )
θ
( k )
x
|N X ( pi ) |
.
Feature weights for each interest group are represented by
each row of thematrix W ∈ RK × L
, i.e., Wk l = w
( l )
k . Hadamard
product ⊙is used for the matrix element-wise product.
As discussed in Sec. 3.3, to achieve authority learning
jointly, we adopt graph regularization to preserve consisten-
cy over the paper-author and paper-venue subgraphs, which
takes the following form:
R = λ A
2
n
i = 1
|A |
j = 1
R
(A )
i j
F P , i
D
( P A )
i i
−
F A , j
D
( A P )
j j
2
2
+ λ V
2
n
i = 1
|V |
j = 1
R
( V )
i j
F P , i
D
( P V )
i i
−
F V , j
D
( V P )
j j
2
2
.
(6)
Theintuition behind theabovetwo termsisnatural: Linked
objects in theheterogeneousnetwork are morelikely to share
similar relative authority scores [14]. To reduce impact of
node popularity, we apply a normalization technique on au-
thority vectors, which helpssuppresspopular objects to keep
them from dominating the authority propagation. Each
element in the diagonal matrix D ( P A ) ∈ Rn × n
is the de-
gree of paper pi in subgraph R(A ) while each element in
D ( A P ) ∈ R|A |× |A |
is the degree of author aj in subgraph
R(A ) Similarly, we can define the two diagonal matrices for
weighted model prediction error
Graph
regularization
to encode
authority
propagation
Algorithm: Alternating minimization (w.r.t. each variable)
107
Learning Algorithm
 Updating formula for alternating minimization
First, to learn group membership for attribute objects,
we take the derivative of the objective function in Eq. (7)
with respect to P while fixing other variables, and apply
the Karush-Kuhn-Tucker complementary condition to im-
pose the non-negativity constraint [7]. With some simple
algebraic operations, a multiplicative update formula for P
can be derived as follows:
Pj k ← Pj k
n
i = 1
R T
i
˜Y i S( i )
W T
+ L+
P 1 + L−
P 2
j k
LP 0 + L−
P 1 + L+
P 2 + cp P
j k
, (8)
where matrices LP 0, LP 1 and LP 2 are defined as follows:
LP 0 =
n
i = 1
R T
i Ri PW ˜S( i )T ˜S(i )
W T
; LP 1 =
n
i = 1
R T
i
˜Y i FT
P ;
LP 2 =
n
i = 1
R T
i Ri P ˜F
( i )
P
˜F
( i )T
P +
n
i = 1
RT
i Ri PW ˜S( i )T
FT
P
+
n
i = 1
RT
i Ri PFP
˜S(i )
W T
.
In order to preserve non-negativity throughout the update,
L P 1 is decomposed into L −
P 1 and L +
P 1 where A+
i j = (|Ai j | +
Ai j )/ 2 and A−
i j = (|Ai j | − Ai j )/ 2. Similarly, we decompose
L into L −
and L +
, but note that the decomposition is
i = 1
In order to preserve non-negativity throughout the update,
L P 1 is decomposed into L −
P 1 and L +
P 1 where A+
i j = (|Ai j | +
Ai j )/ 2 and A−
i j = (|Ai j | − Ai j )/ 2. Similarly, we decompose
L P 2 into L −
P 2 and L +
P 2, but note that the decomposition is
applied to each of the three components of L P 2, respectively.
We denote the masked Y i as ˜Y i , which is the Hadamard
product of M i and Y i . Similarly, ˜S(i ) and ˜F
(i )
P denote row-
wise masked S(i ) and F P by M i .
Second, to learn feature weights for interest groups, the
multiplicative update formula for W can be derived follow-
ing a similar derivation as that of P, taking the form:
Wk l ← Wk l
n
i = 1
PT
R T
i
˜Y i S( i )
+ L−
W
k l
n
i = 1
PT R T
i R i PW ˜S(i ) T ˜S(i ) + L +
W + cw W
k l
,
(9)
where we have L W = n
i = 1 P T R T
i R i P F P
˜S( i ) .
Similarly, to preserve non-negativity of W , L W is decom-
posed into L +
W and L −
W , which can be computed same be-
fore.
Finally, we derive the authority propagation functions in
Eq. (3) by optimizing the objective function in Eq. (7) with
respect to the authority score matrices of papers, authors
arned group membership indicators and feature weights
n provide semantic meaning as we want.
2 The ClusCite Algorithm
Directly solving Eq (7) is not easy because the objective
nction is non-convex. We develop an alternative mini-
zation algorithm, called ClusCit e, which alternatively
timizes the problem with respect to each variable.
Thelearning algorithm essentially accomplishestwo things
multaneously and iteratively: Co-clustering of attribute
jectsand relevancefeatureswith respect to interest groups,
d authority propagation between different objects. Dur-
g an iteration, different learning components will mutually
hance each other (Fig. 3): Feature weights and relative
thority can be more accurately derived with high quality
terest groups while in turn they serve a good feature for
arning high quality interest groups.
First, to learn group membership for attribute objects,
e take the derivative of the objective function in Eq. (7)
th respect to P while fixing other variables, and apply
e Karush-Kuhn-Tucker complementary condition to im-
ose the non-negativity constraint [7]. With some simple
gebraic operations, a multiplicative update formula for P
n be derived as follows:
0 0.2 0.4 0.6 0.8 1
0
2
4
6
8
10
12
14
16
18
20
Paper relative authority score
#citationsfromtestset
Iteration 1
0 0.2 0.4 0.6 0.8 1
0
2
4
6
8
10
12
14
16
18
20
Paper relative authority score
#citationsfromtestset
Iteration 5
0 0.2 0.4 0.6 0.8 1
0
2
4
6
8
10
12
14
16
18
20
Paper relative authority score
#citationsfromtestset
Iteration 10
Figure 3: Correlat ion bet ween paper relat ive au-
t hority and # ground t rut h cit at ions, during differ-
ent it erat ions.
and venues. Specifically, we take the derivative of the ob-
jective function with respect to F P , F A and FV , and follow
traditional semi-supervised learning frameworks [8] to derive
the update rules, which take the form:
FP = GP (FP , FA , FV ; λA , λV )
=
1
λA + λV
λA FA ST
A + λV FV ST
V + L F P (10)
FA = GA (FP ) = FP SA ; (11)
FV = GV (FP ) = FP SV . (12)
where we have normalized adjacency matrices and the paper
authority guidance terms defined as follows:
SA = (D (P A )
)− 1/ 2
R ( A )
(D ( A P )
)− 1/ 2
A lgor it hm 1 Model Learning by ClusCite
I nput : adjacency matrices { Y , SA , SV } , neighbor indicator
R , mask matrix M , meta path-based features { S(i )
} , pa-
rameters { λA , λV , cw , cp } , number of interest groups K
Out put : group membership P, feature weights W , relative
authority { FP , FA , FV }
1: Initialize P, W with positive values, and { FP , FA , FV }
with citation counts from training set
2: repeat
3: Update group membership P by Eq. (8)
4: Update feature weights W by Eq. (9)
5: Compute paper relative authority FP by Eq. (10)
6: Compute author relative authority FA by Eq. (11)
7: Compute venue relative authority FV by Eq. (12)
8: unt il objective in Eq. (7) converges
membership matrix P by Eq. (8) takesO(L|E|3
/ n3
+ L2
|E|2
/ n2
+ K dn) time. Learning the feature weights W by Eq. (9)
takes O(L|E|3
/ n3
+ L2
|E|2
/ n2
+ K dn) time. Updating all
three relative authority matrices takes O(L|E |3
/ n3
+ |E| +
Table 3: St at ist ics of two bibliogr aphic networ ks.
Data sets D B L P P ubM ed
# papers 137,298 100,215
# authors 135,612 212,312
# venues 2,639 2,319
# terms 29,814 37,618
# relationships ∼2.3M ∼3.6M
Paper avg citations 5.16 17.55
Table 4: Tr aining, validat ion and t est ing paper sub-
set s from t he D B L P and PubM ed dat aset s
(a) The DBLP dataset
Subsets Train Validation Test
Years T0= [1996, 2007] T1= [2008] T2 = [2009, 2011]
# papers 62.23% 12.56% 25.21%
(b) The PubMed dataset
Subsets Train Validation Test
Years T0= [1966, 2008] T1= [2009] T2 = [2010, 2013]
# papers 64.50% 7.81% 27.69%
set (T0) as the ground truth. Such an evaluation practice
108
Learning Algorithm
 Convergence (to local minimum):
 Guaranteed by an analysis similar to block coordinate descent
 Empirically converges in 50 iterations
 Mutually enhancing:
 The mode learning does two things simultaneously and iteratively: (1) co-
clustering of attribute objects; (2) authority propagation within groups
 Authority/relevance features can be better learned with high-quality groups
derived
 Good features help deriving high-quality groups
109
Experimental Setup
 Datasets
– DBLP: 137k papers; ~2.3M relationships; Avg # citations/paper: 5.16
– PubMed: 100k papers; ~3.6M relationships; Avg # citations/paper: 17.55
 Evaluations
hbor indicator
res { S(i )
} , pa-
groups K
hts W , relative
{ FP , FA , FV }
8)
by Eq. (10)
by Eq. (11)
by Eq. (12)
3
/ n3
+ L2
|E |2
/ n2
W by Eq. (9)
. Updating all
E|3
/ n3
+ |E | +
compute Clus-
# papers 137,298 100,215
# authors 135,612 212,312
# venues 2,639 2,319
# terms 29,814 37,618
# relationships ∼2.3M ∼3.6M
Paper avg citations 5.16 17.55
Table 4: Training, validat ion and t est ing paper sub-
set s from t he D B L P and PubM ed dat aset s
(a) The DBLP dataset
Subsets Train Validation Test
Years T0= [1996, 2007] T1= [2008] T2= [2009, 2011]
# papers 62.23% 12.56% 25.21%
(b) The PubMed dataset
Subsets Train Validation Test
Years T0= [1966, 2008] T1= [2009] T2= [2010, 2013]
# papers 64.50% 7.81% 27.69%
set (T0) as the ground truth. Such an evaluation practice
is more realistic because a citation recommendation system
110
Experimental Setup
 Feature generation: meta paths
– (P – X – P)y, where X = {A, V, T} and y = {1, 2, 3}
 E.g., (P – X – P)2 = P – X – P – X – P
– P – X – P  P
– P – X – P  P
 Feature generation: similarity measures
– PathSim [VLDB’11]: symmetric measure
– Random walk-based measure [SDM’12]: can be asymmetric
 Combining meta paths with different measures, in total we have 24 meta path-
based relevance features
111
Example Output: Relative Authority Ranking
research area. This demonstrates that the relative authority
propagation can generate meaningful authority scores with
respect to different interest groups.
Table 6: Top-5 aut hor ity venues and aut hor s from
t wo exam ple int er est gr oups der ived by ClusCit e.
R ank Venue A ut hor
G r oup I (dat abase and infor m at ion syst em )
1 VLDB 0.0763 Hector Garcia-Molina 0.0202
2 SIGMOD 0.0653 Christos Faloutsos 0.0187
3 T K DE 0.0651 Elisa Bertino 0.0180
4 CIK M 0.0590 Dan Suciu 0.0179
5 SIGK DD 0.0488 H. V. Jagadish 0.0178
G r oup I I ( com put er vision and m ult im edia)
1 T PAMI 0.0733 Richard Szeliski 0.0139
2 ACM MM 0.0533 Jitendra Malik 0.0122
3 ICCV 0.0403 Luc Van Gool 0.0121
4 CVPR 0.0401 Andrew Blake 0.0117
5 ECCV 0.0393 Alex Pentland 0.0114
to expres
contexts
recomme
relevant
size of ea
interestin
the citat
On the
measure
tion tech
niques [1
or no cit
erogenou
issue by
between
based m
consideri
ies start
112
Experimental Results
 Case study on citation behavioral patterns
Each paper
is assigned
to the group
with highest
group
membership
score
113
Experimental Results
 Quality change of estimated paper relative authority ranking
0 0.2 0.4 0.6 0.8 1
0
2
4
6
8
10
12
14
16
18
20
Paper relative authority score
#citationsfromtestset
Iteration 1
0 0.2 0.4 0.6 0.8 1
0
2
4
6
8
10
12
14
16
18
20
Paper relative authority score
#citationsfromtestset
Iteration 5
0 0.2 0.4 0.6 0.8 1
0
2
4
6
8
10
12
14
16
18
20
Paper relative authority score
#citationsfromtestset
Iteration 10
Figure 3: Correlat ion bet ween paper relat ive au-
t hority and # ground t rut h cit at ions, during differ-
ent it erat ions.
and venues. Specifically, we take the derivative of the ob-
Correlation between paper relative authority score and # ground truth
citations, during different iterations
114
Performance Comparison: Methods Compared
 Compared methods:
1. BM25: content-based
2. PopRank [WWW’05]: heterogeneous link-based authority
3. TopicSim: topic-based similarity by LDA
4. Link-PLSA-LDA [KDD’08]: topic and link relevance
5. Meta-path based relevance:
1. L2-LR [SDM’12, WSDM’12]: logistics regression with L2 regularization
2. RankSVM [KDD’02]
6. MixSim: relevance, topic distribution, PopRank scores, using RankSVM
7. ClusCite: the proposed full fledged model
8. ClusCite-Rel: with only relevance component
115
Performance Comparison on DBLP and PubMed
 17.68% improvement in Recall@50;
9.57% in MRR, over the best performing
compared method, on DBLP
ommendation performance comparisons on D BLP and PubM ed datasets in terms of Precisio
RR. We set the number of interest groups to be 200 (K = 200) for ClusCite and ClusCite-Re
od D B LP PubM ed
P@10 P@20 R@20 R@50 MRR P@10 P@20 R@20 R@50 MRR
5 0.1260 0.0902 0.1431 0.2146 0.4107 0.1847 0.1349 0.1754 0.2470 0.4971
nk 0.0112 0.0098 0.0155 0.0308 0.0451 0.0438 0.0314 0.0402 0.0814 0.2012
Sim 0.0328 0.0273 0.0432 0.0825 0.1161 0.0761 0.0685 0.0855 0.1516 0.3254
A-LDA 0.1023 0.0893 0.1295 0.1823 0.3748 0.1439 0.1002 0.1589 0.2015 0.4079
R 0.2274 0.1677 0.2471 0.3547 0.4866 0.2527 0.1959 0.2504 0.3981 0.5308
VM 0.2372 0.1799 0.2733 0.3621 0.4989 0.2534 0.1954 0.2499 0.382 0.5187
ea 0.2261 0.1689 0.2473 0.3636 0.5002 0.2699 0.2025 0.2519 0.4021 0.5041
e-Rel 0.2402 0.1872 0.2856 0.4015 0.5156 0.2786 0.2221 0.2753 0.4305 0.5524
ite 0.2429 0.1958 0.2993 0.4279 0.5481 0.3019 0.2434 0.3129 0.4587 0.5787
 20.19% improvement in Recall@20;
14.79% in MRR, over the best performing
compared method, on PubMed
116
Performance Analysis: Number of Interest Groups
117
Performance Analysis: Number of Attributes Each
Query Manuscript Has
Partition the test set
into six subsets:
Group 1 has least Avg#
attribute objects
Group 6 has most Avg#
attribute objects
When more attribute
available for the query
manuscript, our model can
do better recommendation
118
Performance Analysis: Model Generalization
Divide test set by year
Test the performance
change over different
query subsets
From 2010 to 2013,
MixSim drops
16.42% vs. ClusCite
drops only 7.72%
 ClusCite framework: organize paper citations into interest groups, and design
cluster-based models, yielding paper-specific recommendation
 Experimental results demonstrate significant improvement over state-of-the-art
methods
119 119
Conclusions
 Heterogeneous information networks are ubiquitous
 Most datasets can be “organized” or “transformed” into “structured” multi-typed
heterogeneous info. networks
 Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, …
 Surprisingly rich knowledge can be mined from structured heterogeneous info.
networks
 Clustering, ranking, classification, path prediction, ……
 Knowledge is power, but knowledge is hidden in massive, but “relatively structured”
nodes and links!
 Key issue: Construction of trusted, semi-structured heterogeneous networks from
unstructured data
 From data to knowledge: Much more to be explored but heterogeneous network
mining has shown high promise!
120
References
 M. Ji, J. Han, and M. Danilevsky, "Ranking-Based Classification of Heterogeneous Information Networks", KDD'11
 X. Ren, J. Liu, X. Yu, U. Khandelwal, Q. Gu, L. Wang, J. Han, “ClusCite: Effective Citation Recommendation by
Information Network-Based Clustering”, KDD’14
 Y. Sun and J. Han, Mining Heterogeneous Information Networks: Principles and Methodologies, Morgan &
Claypool Publishers, 2012
 Y. Sun, Y. Yu, and J. Han, "Ranking-Based Clustering of Heterogeneous Information Networks with Star Network
Schema", KDD’09
 Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous
Information Networks”, VLDB'11
 Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship Prediction in Heterogeneous
Bibliographic Networks", ASONAM'11
 Y. Sun, J. Han, C. C. Aggarwal, N. Chawla, “When Will It Happen? Relationship Prediction in Heterogeneous
Information Networks”, WSDM'12
 Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, T. Wu, "RankClus: Integrating Clustering with Ranking for Heterogeneous
Information Network Analysis", EDBT’09
 X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandelwal, B. Norick, and J. Han, "Personalized Entity Recommendation: A
Heterogeneous Information Network Approach", WSDM'14
120
121
From Data to Networks to Knowledge: An Evolutionary Path!
Han, Kamber and Pei,
Data Mining, 3rd ed. 2011
Yu, Han and Faloutsos (eds.),
Link Mining: Models, Algorithms
and Applications, 2010
Sun and Han, Mining Heterogeneous
Information Networks, 2012
Y. Sun: SIGKDD’13 Dissertation Award
Wang and Han, Mining Latent Entity
Structures, 2015
C. Wang: SIGKDD’13 Dissertation Award

More Related Content

What's hot

Contractor-Borner-SNA-SAC
Contractor-Borner-SNA-SACContractor-Borner-SNA-SAC
Contractor-Borner-SNA-SAC
webuploader
 
Lesson hypertext and intertext
Lesson hypertext and intertextLesson hypertext and intertext
Lesson hypertext and intertext
CristinaGrumal
 
DREaM Event 2: Louise Cooke
DREaM Event 2: Louise CookeDREaM Event 2: Louise Cooke
Hci
HciHci
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)
Besnik Fetahu
 
Empowering Search Through 3RDi Semantic Enrichment
Empowering Search Through 3RDi Semantic EnrichmentEmpowering Search Through 3RDi Semantic Enrichment
Empowering Search Through 3RDi Semantic Enrichment
The Digital Group
 
Knowledge Organization Systems
Knowledge Organization SystemsKnowledge Organization Systems
Knowledge Organization Systems
R A Akerkar
 
Ir 01
Ir   01Ir   01
Developing a Secured Recommender System in Social Semantic Network
Developing a Secured Recommender System in Social Semantic NetworkDeveloping a Secured Recommender System in Social Semantic Network
Developing a Secured Recommender System in Social Semantic Network
Tamer Rezk
 
A Survey on Decision Support Systems in Social Media
A Survey on Decision Support Systems in Social MediaA Survey on Decision Support Systems in Social Media
A Survey on Decision Support Systems in Social Media
Editor IJCATR
 
Fox-Keynote-Now and Now of Data Publishing-nfdp13
Fox-Keynote-Now and Now of Data Publishing-nfdp13Fox-Keynote-Now and Now of Data Publishing-nfdp13
Fox-Keynote-Now and Now of Data Publishing-nfdp13
DataDryad
 
Semantics-aware Content-based Recommender Systems
Semantics-aware Content-based Recommender SystemsSemantics-aware Content-based Recommender Systems
Semantics-aware Content-based Recommender Systems
Pasquale Lops
 
Research on ontology based information retrieval techniques
Research on ontology based information retrieval techniquesResearch on ontology based information retrieval techniques
Research on ontology based information retrieval techniques
Kausar Mukadam
 
ECCS 2010
ECCS 2010ECCS 2010
ECCS 2010
Shenghui Wang
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
DataminingTools Inc
 
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
paperpublications3
 
Semantics 101
Semantics 101Semantics 101
Semantics 101
Kurt Cagle
 
The Semantic Web in Digital Libraries: A Literature Review
The Semantic Web in Digital Libraries: A Literature ReviewThe Semantic Web in Digital Libraries: A Literature Review
The Semantic Web in Digital Libraries: A Literature Review
sstose
 

What's hot (18)

Contractor-Borner-SNA-SAC
Contractor-Borner-SNA-SACContractor-Borner-SNA-SAC
Contractor-Borner-SNA-SAC
 
Lesson hypertext and intertext
Lesson hypertext and intertextLesson hypertext and intertext
Lesson hypertext and intertext
 
DREaM Event 2: Louise Cooke
DREaM Event 2: Louise CookeDREaM Event 2: Louise Cooke
DREaM Event 2: Louise Cooke
 
Hci
HciHci
Hci
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)
 
Empowering Search Through 3RDi Semantic Enrichment
Empowering Search Through 3RDi Semantic EnrichmentEmpowering Search Through 3RDi Semantic Enrichment
Empowering Search Through 3RDi Semantic Enrichment
 
Knowledge Organization Systems
Knowledge Organization SystemsKnowledge Organization Systems
Knowledge Organization Systems
 
Ir 01
Ir   01Ir   01
Ir 01
 
Developing a Secured Recommender System in Social Semantic Network
Developing a Secured Recommender System in Social Semantic NetworkDeveloping a Secured Recommender System in Social Semantic Network
Developing a Secured Recommender System in Social Semantic Network
 
A Survey on Decision Support Systems in Social Media
A Survey on Decision Support Systems in Social MediaA Survey on Decision Support Systems in Social Media
A Survey on Decision Support Systems in Social Media
 
Fox-Keynote-Now and Now of Data Publishing-nfdp13
Fox-Keynote-Now and Now of Data Publishing-nfdp13Fox-Keynote-Now and Now of Data Publishing-nfdp13
Fox-Keynote-Now and Now of Data Publishing-nfdp13
 
Semantics-aware Content-based Recommender Systems
Semantics-aware Content-based Recommender SystemsSemantics-aware Content-based Recommender Systems
Semantics-aware Content-based Recommender Systems
 
Research on ontology based information retrieval techniques
Research on ontology based information retrieval techniquesResearch on ontology based information retrieval techniques
Research on ontology based information retrieval techniques
 
ECCS 2010
ECCS 2010ECCS 2010
ECCS 2010
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
 
Semantics 101
Semantics 101Semantics 101
Semantics 101
 
The Semantic Web in Digital Libraries: A Literature Review
The Semantic Web in Digital Libraries: A Literature ReviewThe Semantic Web in Digital Libraries: A Literature Review
The Semantic Web in Digital Libraries: A Literature Review
 

Similar to 2015 07-tuto3-mining hin

Searching for patterns in crowdsourced information
Searching for patterns in crowdsourced informationSearching for patterns in crowdsourced information
Searching for patterns in crowdsourced information
Silvia Puglisi
 
ICPSR - Complex Systems Models in the Social Sciences - Lecture 4 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 4 - Professor...ICPSR - Complex Systems Models in the Social Sciences - Lecture 4 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 4 - Professor...
Daniel Katz
 
Network analyses of SPSSI
Network analyses of SPSSINetwork analyses of SPSSI
Network analyses of SPSSI
Kevin Lanning
 
A survey of heterogeneous information network analysis
A survey of heterogeneous information network analysisA survey of heterogeneous information network analysis
A survey of heterogeneous information network analysis
SOYEON KIM
 
Journalism and the Semantic Web
Journalism and the Semantic WebJournalism and the Semantic Web
Journalism and the Semantic Web
Kurt Cagle
 
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
Armin Haller
 
From Bibliometrics to Cybermetrics - a book chapter by Nicola de Bellis
From Bibliometrics to Cybermetrics - a book chapter by Nicola de BellisFrom Bibliometrics to Cybermetrics - a book chapter by Nicola de Bellis
From Bibliometrics to Cybermetrics - a book chapter by Nicola de Bellis
Xanat V. Meza
 
Artificial Intelligence Advances | Vol.3, Iss.2 October 2021
Artificial Intelligence Advances | Vol.3, Iss.2 October 2021Artificial Intelligence Advances | Vol.3, Iss.2 October 2021
Artificial Intelligence Advances | Vol.3, Iss.2 October 2021
Bilingual Publishing Group
 
Linked Data and Users in Library - Does the library communicate efficiently?
Linked Data and Users in Library - Does the library communicate efficiently?Linked Data and Users in Library - Does the library communicate efficiently?
Linked Data and Users in Library - Does the library communicate efficiently?
Hansung University
 
What is Linked Data, and What Does It Mean for Libraries?
What is Linked Data, and What Does It Mean for Libraries?What is Linked Data, and What Does It Mean for Libraries?
What is Linked Data, and What Does It Mean for Libraries?
Emily Nimsakont
 
Semantic Web Technologies: Changing Bibliographic Descriptions?
Semantic Web Technologies: Changing Bibliographic Descriptions?Semantic Web Technologies: Changing Bibliographic Descriptions?
Semantic Web Technologies: Changing Bibliographic Descriptions?
Stuart Weibel
 
Semantic citation
Semantic citationSemantic citation
Semantic citation
Deepak K
 
Iaetsd similarity search in information networks using
Iaetsd similarity search in information networks usingIaetsd similarity search in information networks using
Iaetsd similarity search in information networks using
Iaetsd Iaetsd
 
Entity linking with a knowledge base issues techniques and solutions
Entity linking with a knowledge base issues techniques and solutionsEntity linking with a knowledge base issues techniques and solutions
Entity linking with a knowledge base issues techniques and solutions
CloudTechnologies
 
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Laurent Alquier
 
Scits 2014
Scits 2014Scits 2014
Scits 2014
Kevin Lanning
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
Fred Stutzman
 
Network Theory Final Draft
Network Theory Final DraftNetwork Theory Final Draft
Network Theory Final Draft
kmbaldau
 
DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0
John Breslin
 
A theory of Metadata enriching & filtering
A theory of  Metadata enriching & filteringA theory of  Metadata enriching & filtering
A theory of Metadata enriching & filtering
Cuerpo Academico 'Estudios de la Información'
 

Similar to 2015 07-tuto3-mining hin (20)

Searching for patterns in crowdsourced information
Searching for patterns in crowdsourced informationSearching for patterns in crowdsourced information
Searching for patterns in crowdsourced information
 
ICPSR - Complex Systems Models in the Social Sciences - Lecture 4 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 4 - Professor...ICPSR - Complex Systems Models in the Social Sciences - Lecture 4 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 4 - Professor...
 
Network analyses of SPSSI
Network analyses of SPSSINetwork analyses of SPSSI
Network analyses of SPSSI
 
A survey of heterogeneous information network analysis
A survey of heterogeneous information network analysisA survey of heterogeneous information network analysis
A survey of heterogeneous information network analysis
 
Journalism and the Semantic Web
Journalism and the Semantic WebJournalism and the Semantic Web
Journalism and the Semantic Web
 
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
 
From Bibliometrics to Cybermetrics - a book chapter by Nicola de Bellis
From Bibliometrics to Cybermetrics - a book chapter by Nicola de BellisFrom Bibliometrics to Cybermetrics - a book chapter by Nicola de Bellis
From Bibliometrics to Cybermetrics - a book chapter by Nicola de Bellis
 
Artificial Intelligence Advances | Vol.3, Iss.2 October 2021
Artificial Intelligence Advances | Vol.3, Iss.2 October 2021Artificial Intelligence Advances | Vol.3, Iss.2 October 2021
Artificial Intelligence Advances | Vol.3, Iss.2 October 2021
 
Linked Data and Users in Library - Does the library communicate efficiently?
Linked Data and Users in Library - Does the library communicate efficiently?Linked Data and Users in Library - Does the library communicate efficiently?
Linked Data and Users in Library - Does the library communicate efficiently?
 
What is Linked Data, and What Does It Mean for Libraries?
What is Linked Data, and What Does It Mean for Libraries?What is Linked Data, and What Does It Mean for Libraries?
What is Linked Data, and What Does It Mean for Libraries?
 
Semantic Web Technologies: Changing Bibliographic Descriptions?
Semantic Web Technologies: Changing Bibliographic Descriptions?Semantic Web Technologies: Changing Bibliographic Descriptions?
Semantic Web Technologies: Changing Bibliographic Descriptions?
 
Semantic citation
Semantic citationSemantic citation
Semantic citation
 
Iaetsd similarity search in information networks using
Iaetsd similarity search in information networks usingIaetsd similarity search in information networks using
Iaetsd similarity search in information networks using
 
Entity linking with a knowledge base issues techniques and solutions
Entity linking with a knowledge base issues techniques and solutionsEntity linking with a knowledge base issues techniques and solutions
Entity linking with a knowledge base issues techniques and solutions
 
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
 
Scits 2014
Scits 2014Scits 2014
Scits 2014
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
 
Network Theory Final Draft
Network Theory Final DraftNetwork Theory Final Draft
Network Theory Final Draft
 
DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0DM110 - Week 10 - Semantic Web / Web 3.0
DM110 - Week 10 - Semantic Web / Web 3.0
 
A theory of Metadata enriching & filtering
A theory of  Metadata enriching & filteringA theory of  Metadata enriching & filtering
A theory of Metadata enriching & filtering
 

More from jins0618

Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud EnvironmentMachine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
jins0618
 
Latent Interest and Topic Mining on User-item Bipartite Networks
Latent Interest and Topic Mining on User-item Bipartite NetworksLatent Interest and Topic Mining on User-item Bipartite Networks
Latent Interest and Topic Mining on User-item Bipartite Networks
jins0618
 
Web Service QoS Prediction Approach in Mobile Internet Environments
Web Service QoS Prediction Approach in Mobile Internet EnvironmentsWeb Service QoS Prediction Approach in Mobile Internet Environments
Web Service QoS Prediction Approach in Mobile Internet Environments
jins0618
 
吕潇 星环科技大数据技术探索与应用实践
吕潇 星环科技大数据技术探索与应用实践吕潇 星环科技大数据技术探索与应用实践
吕潇 星环科技大数据技术探索与应用实践
jins0618
 
李战怀 大数据环境下数据存储与管理的研究
李战怀 大数据环境下数据存储与管理的研究李战怀 大数据环境下数据存储与管理的研究
李战怀 大数据环境下数据存储与管理的研究
jins0618
 
2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline
jins0618
 
Christian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big dataChristian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big data
jins0618
 
Jeffrey xu yu large graph processing
Jeffrey xu yu large graph processingJeffrey xu yu large graph processing
Jeffrey xu yu large graph processing
jins0618
 
Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...
jins0618
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
jins0618
 
Ling liu part 01:big graph processing
Ling liu part 01:big graph processingLing liu part 01:big graph processing
Ling liu part 01:big graph processing
jins0618
 
Wang ke mining revenue-maximizing bundling configuration
Wang ke mining revenue-maximizing bundling configurationWang ke mining revenue-maximizing bundling configuration
Wang ke mining revenue-maximizing bundling configuration
jins0618
 
Wang ke classification by cut clearance under threshold
Wang ke classification by cut clearance under thresholdWang ke classification by cut clearance under threshold
Wang ke classification by cut clearance under threshold
jins0618
 
2015 07-tuto2-clus type
2015 07-tuto2-clus type2015 07-tuto2-clus type
2015 07-tuto2-clus type
jins0618
 
2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining
jins0618
 
2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline
jins0618
 
Weiyi meng web data truthfulness analysis
Weiyi meng web data truthfulness analysisWeiyi meng web data truthfulness analysis
Weiyi meng web data truthfulness analysis
jins0618
 
Ke yi small summaries for big data
Ke yi small summaries for big dataKe yi small summaries for big data
Ke yi small summaries for big data
jins0618
 
Gao cong geospatial social media data management and context-aware recommenda...
Gao cong geospatial social media data management and context-aware recommenda...Gao cong geospatial social media data management and context-aware recommenda...
Gao cong geospatial social media data management and context-aware recommenda...
jins0618
 
Chengqi zhang graph processing and mining in the era of big data
Chengqi zhang graph processing and mining in the era of big dataChengqi zhang graph processing and mining in the era of big data
Chengqi zhang graph processing and mining in the era of big data
jins0618
 

More from jins0618 (20)

Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud EnvironmentMachine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
 
Latent Interest and Topic Mining on User-item Bipartite Networks
Latent Interest and Topic Mining on User-item Bipartite NetworksLatent Interest and Topic Mining on User-item Bipartite Networks
Latent Interest and Topic Mining on User-item Bipartite Networks
 
Web Service QoS Prediction Approach in Mobile Internet Environments
Web Service QoS Prediction Approach in Mobile Internet EnvironmentsWeb Service QoS Prediction Approach in Mobile Internet Environments
Web Service QoS Prediction Approach in Mobile Internet Environments
 
吕潇 星环科技大数据技术探索与应用实践
吕潇 星环科技大数据技术探索与应用实践吕潇 星环科技大数据技术探索与应用实践
吕潇 星环科技大数据技术探索与应用实践
 
李战怀 大数据环境下数据存储与管理的研究
李战怀 大数据环境下数据存储与管理的研究李战怀 大数据环境下数据存储与管理的研究
李战怀 大数据环境下数据存储与管理的研究
 
2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline
 
Christian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big dataChristian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big data
 
Jeffrey xu yu large graph processing
Jeffrey xu yu large graph processingJeffrey xu yu large graph processing
Jeffrey xu yu large graph processing
 
Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
 
Ling liu part 01:big graph processing
Ling liu part 01:big graph processingLing liu part 01:big graph processing
Ling liu part 01:big graph processing
 
Wang ke mining revenue-maximizing bundling configuration
Wang ke mining revenue-maximizing bundling configurationWang ke mining revenue-maximizing bundling configuration
Wang ke mining revenue-maximizing bundling configuration
 
Wang ke classification by cut clearance under threshold
Wang ke classification by cut clearance under thresholdWang ke classification by cut clearance under threshold
Wang ke classification by cut clearance under threshold
 
2015 07-tuto2-clus type
2015 07-tuto2-clus type2015 07-tuto2-clus type
2015 07-tuto2-clus type
 
2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining
 
2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline
 
Weiyi meng web data truthfulness analysis
Weiyi meng web data truthfulness analysisWeiyi meng web data truthfulness analysis
Weiyi meng web data truthfulness analysis
 
Ke yi small summaries for big data
Ke yi small summaries for big dataKe yi small summaries for big data
Ke yi small summaries for big data
 
Gao cong geospatial social media data management and context-aware recommenda...
Gao cong geospatial social media data management and context-aware recommenda...Gao cong geospatial social media data management and context-aware recommenda...
Gao cong geospatial social media data management and context-aware recommenda...
 
Chengqi zhang graph processing and mining in the era of big data
Chengqi zhang graph processing and mining in the era of big dataChengqi zhang graph processing and mining in the era of big data
Chengqi zhang graph processing and mining in the era of big data
 

Recently uploaded

Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
facilitymanager11
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
sameer shah
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
exukyp
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
bmucuha
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 

Recently uploaded (20)

Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 

2015 07-tuto3-mining hin

  • 1. Mining Heterogeneous Information Networks JIAWEI HAN COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN JULY 20, 2015 1
  • 2. 2
  • 3. 3 Outline  Motivation: Why Mining Information Networks?  Part I: Clustering, Ranking and Classification  Clustering and Ranking in Information Networks  Classification of Information Networks  Part II: Meta-Path Based Exploration of Information Networks  Similarity Search in Information Networks  Relationship Prediction in Information Networks  Recommendation with Heterogeneous Information Networks  User-Guided Meta-Path based clustering in heterogeneous networks  ClusCite: Citation recommendation in heterogeneous networks
  • 4. 4 Ubiquitous Information Networks  Graphs and substructures: Chemical compounds, visual objects, circuits, XML  Biological networks  Bibliographic networks: DBLP, ArXiv, PubMed, …  Social networks: Facebook >100 million active users  World Wide Web (WWW): > 3 billion nodes, > 50 billion arcs  Cyber-physical networks Yeast protein interaction networkWorld-Wide Web Co-author network Social network sites
  • 5. 5 What Are Heterogeneous Information Networks?  Information network: A network where each node represents an entity (e.g., actor in a social network) and each link (e.g., tie) a relationship between entities  Each node/link may have attributes, labels, and weights  Link may carry rich semantic information  Homogeneous vs. heterogeneous networks  Homogeneous networks  Single object type and single link type  Single model social networks (e.g., friends)  WWW: A collection of linked Web pages  Heterogeneous, multi-typed networks  Multiple object and link types  Medical network: Patients, doctors, diseases, contacts, treatments  Bibliographic network: Publications, authors, venues (e.g., DBLP > 2 million papers)
  • 6. 6 Homogeneous vs. Heterogeneous Networks Conference-Author Network SIGMOD SDM ICDM KDD EDBT VLDB ICML AAAI Tom Jim Lucy Mike Jack Tracy Cindy Bob Mary Alice Co-author Network
  • 7. 7 Mining Heterogeneous Information Networks  Homogeneous networks can often be derived from their original heterogeneous networks  Ex. Coauthor networks can be derived from author-paper-conference networks by projection on authors  Paper citation networks can be derived from a complete bibliographic network with papers and citations projected  Heterogeneous networks carry richer information than their corresponding projected homogeneous networks  Typed heterogeneous network vs. non-typed heterogeneous network (i.e., not distinguishing different types of nodes)  Typed nodes and links imply more structures, leading to richer discovery  Mining semi-structured heterogeneous information networks  Clustering, ranking, classification, prediction, similarity search, etc.
  • 8. 8 Examples of Heterogeneous Information Networks  Bibliographic information networks: DBLP, ArXive, PubMed  Entity types: paper (P), venue (V), author (A), and term (T)  Relation type: authors write papers, venues publish papers, papers contain terms  Twitter information network, and other social media network  Objects types: user, tweet, hashtag, and term  Relation/link types: users follow users, users post tweets, tweets reply tweets, tweets use terms, tweets contain hashtags  Flickr information network  Object types: image, user, tag, group, and comment  Relation types: users upload images, image contain tags, images belong to groups, users post comments, and comments comment on images  Healthcare information network  Object types: doctor, patient, disease, treatment, and device  Relation types: treatments used-for diseases, patients have diseases, patients visit doctors
  • 9. 9 Structures Facilitate Mining Heterogeneous Networks  Network construction: generates structured networks from unstructured text data  Each node: an entity; each link: a relationship between entities  Each node/link may have attributes, labels, and weights  Heterogeneous, multi-typed networks: e.g., Medical network: Patients, doctors, diseases, contacts, treatments Venue Paper Author DBLP Bibliographic Network The IMDB Movie Network Actor Movie Director Movie Studio The Facebook Network
  • 10. 10 What Can be Mined from Heterogeneous Networks?  A homogeneous network can be derived from its “parent” heterogeneous network  Ex. Coauthor networks from the original author-paper-conference networks  Heterogeneous networks carry richer info. than the projected homogeneous networks  Typed nodes & links imply more structures, leading to richer discovery  Ex.: DBLP: A Computer Science bibliographic database (network) Knowledge hidden in DBLP Network Mining Functions Who are the leading researchers on Web search? Ranking Who are the peer researchers of Jure Leskovec? Similarity Search Whom will Christos Faloutsos collaborate with? Relationship Prediction Which relationships are most influential for an author to decide her topics? Relation Strength Learning How was the field of Data Mining emerged or evolving? Network Evolution Which authors are rather different from his/her peers in IR? Outlier/anomaly detection
  • 11. 11 Principles of Mining Heterogeneous Information Net  Information propagation across heterogeneous nodes & links  How to compute ranking scores, similarity scores, and clusters, and how to make good use of class labels, across heterogeneous nodes and links  Objects in the networks are interdependent and knowledge can only be mined using the holistic information in a network  Search and mining by exploring network meta structures  Heter. info networks: semi-structured and typed  Network schema: a meta structure, guidance of search and mining  Meta-path based similarity search and mining  User-guided exploration of information networks  Automatically select the right relation (or meta-path) combinations with appropriate weights for a particular search or mining task  User-guided or feedback-based network exploration is a strategy
  • 12. 12
  • 13. 13 Outline  Motivation: Why Mining Information Networks?  Part I: Clustering, Ranking and Classification  Clustering and Ranking in Information Networks  Classification of Information Networks  Part II: Meta-Path Based Exploration of Information Networks  Similarity Search in Information Networks  Relationship Prediction in Information Networks  Recommendation with Heterogeneous Information Networks  User-Guided Meta-Path based clustering in heterogeneous networks  ClusCite: Citation recommendation in heterogeneous networks
  • 14. 14 Ranking-Based Clustering in Heterogeneous Networks  Clustering and ranking: Two critical functions in data mining  Clustering without ranking? Think about no PageRank dark time before Google  Ranking will make more sense within a particular cluster  Einstein in physics vs. Turing in computer science  Why not integrate ranking with clustering & classification?  High-ranked objects should be more important in a cluster than low-ranked ones  Why treat every object the same weight in the same cluster?  But how to get their weight?  Integrate ranking with clustering/classification in the same process  Ranking, as the feature, is conditional (i.e., relative) to a specific cluster  Ranking and clustering may mutually enhance each other  Ranking-based clustering: RankClus [EDBT’09], NetClus [KDD’09]
  • 15. 15 A Bi-Typed Network Model and Simple Ranking  A bi-typed network model  Let X represents type venue  Y: Type author  The DBLP network can be represented as matrix W  Our task: Rank-based clustering of heterogeneous network W  Simple Ranking  Proportional to # of publications of an author and a venue  Considers only immediate neighborhood in the network But what about an author publishing many papers only in very weak venues? A bi-typed network XX XY YX YY W W W W W       
  • 16. 16 The RankClus Methodology  Ranking as the feature of the cluster  Ranking is conditional on a specific cluster  E.g., VLDB’s rank in Theory vs. its rank in the DB area  The distributions of ranking scores over objects are different in each cluster  Clustering and ranking are mutually enhanced  Better clustering: Rank distributions for clusters are more distinguishing from each other  Better ranking: Better metric for objects is learned from the ranking  Not every object should be treated equally in clustering!  Y. Sun, et al., “RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis”, EDBT'09
  • 17. 17 Simple Ranking vs. Authority Ranking  Simple Ranking  Proportional to # of publications of an author / a conference  Considers only immediate neighborhood in the network  Authority Ranking:  More sophisticated “rank rules” are needed  Propagate the ranking scores in the network over different types What about an author publishing 100 papers in very weak conferences?
  • 18. 18 Authority Ranking  Methodology: Propagate the ranking scores in the network over different types  Rule 1: Highly ranked authors publish many papers in highly ranked venues  Rule 2: Highly ranked venues attract many papers from many highly ranked authors  Rule 3: The rank of an author is enhanced if he or she co-authors with many highly ranked authors  Other ranking functions are quite possible (e.g., using domain knowledge)  Ex. Journals may weight more than conferences in science
  • 19. 19 Alternative Ranking Functions  A ranking function is not only related to the link property, but also depends on domain knowledge  Ex: Journals may weight more than conferences in science  Ranking functions can be defined on multi-typed networks  Ex: PopRank takes into account the impact both from the same type of objects and from the other types of objects, with different impact factors for different types  Use expert knowledge, for example,  TrustRank semi-automatically separates reputable, good objects from spam ones  Personalized PageRank uses expert ranking as query and generates rank distributions w.r.t. such knowledge  A research problem that needs systematic study
  • 20. 20 The EM (Expectation Maximization) Algorithm  The k-means algorithm has two steps at each iteration (in the E-M framework):  Expectation Step (E-step): Given the current cluster centers, each object is assigned to the cluster whose center is closest to the object: An object is expected to belong to the closest cluster  Maximization Step (M-step): Given the cluster assignment, for each cluster, the algorithm adjusts the center so that the sum of distance from the objects assigned to this cluster and the new center is minimized  The (EM) algorithm: A framework to approach maximum likelihood or maximum a posteriori estimates of parameters in statistical models.  E-step assigns objects to clusters according to the current probabilistic clustering or parameters of probabilistic clusters  M-step finds the new clustering or parameters that minimize the sum of squared error (SSE) or maximize the expected likelihood
  • 21. 21 From Conditional Rank Distribution to E-M Framework  Given a bi-typed bibliographic network, how can we use the conditional rank scores to further improve the clustering results?  Conditional rank distribution as cluster feature  For each cluster Ck, the conditional rank scores, rX|CK and rY|CK, can be viewed as conditional rank distributions of X and Y, which are the features for cluster Ck  Cluster membership as object feature  From p(k|oi) ∝ p(oi|k)p(k), the higher its conditional rank in a cluster (p(oi|k)), the higher possibility an object will belong to that cluster (p(k|oi))  Highly ranked attribute object has more impact on determining the cluster membership of a target object  Parameter estimation using the Expectation-Maximization algorithm  E-step: Calculate the distribution p(z = k|yj, xi, Θ) based on the current value of Θ  M-Step: Update Θ according to the current distribution
  • 22. 22 RankClus: Integrating Clustering with Ranking  An EM styled Algorithm  Initialization  Randomly partition  Repeat  Ranking  Ranking objects in each sub- network induced from each cluster  Generating new measure space  Estimate mixture model coefficients for each target object  Adjusting cluster  Until change < threshold SIGMOD SDM ICDM KDD EDBT VLDB ICML AAAI Tom Jim Lucy Mike Jack Tracy Cindy Bob Mary Alice SIGMOD VLDB EDBT KDDICDM SDM AAAI ICML Objects Ranking Sub-Network Ranking Clustering RankClus [EDBT’09]: Ranking and clustering mutually enhancing each other in an E-M framework An E-M framework for iterative enhancement
  • 23. 23 Step-by-Step Running of RankClus Initially, ranking distributions are mixed together Two clusters of objects mixed together, but preserve similarity somehow Improved a little Two clusters are almost well separated Improved significantly Stable Well separated Clustering and ranking of two fields: DB/DM vs. HW/CA (hardware/ computer architecture) Stable
  • 24. 24 Clustering Performance (NMI)  Compare the clustering accuracy: For N objects, K clusters, and two clustering results, let n(i, j ), be # objects with cluster label i in the 1st clustering result (say generated by the new alg.) and that w. cluster label j in the 2nd clustering result (say the ground truth)  Normalized Mutual Info. (NMI):  joint distribution: p(i, j) = n(i,j )/N  row distr. p1(j) = , column distr. D1: med. separated & med. density D2: med. separated & low density D3: med. Separated & high density D4: highly separated & med. density D5: poorly separated & med. density
  • 25. 25 RankClus: Clustering & Ranking CS Venues in DBLP Top-10 conferences in 5 clusters using RankClus in DBLP (when k = 15)
  • 26. 26 Time Complexity: Linear to # of Links  At each iteration, |E|: edges in network, m: # of target objects, K: # of clusters  Ranking for sparse network  ~O(|E|)  Mixture model estimation  ~O(K|E|+mK)  Cluster adjustment  ~O(mK^2)  In all, linear to |E|  ~O(K|E|)  Note: SimRank will be at least quadratic at each iteration since it evaluates distance between every pair in the network Comparison of algorithms on execution time
  • 27. 27 NetClus: Ranking-Based Clustering with Star Network Schema  Beyond bi-typed network: Capture more semantics with multiple types  Split a network into multi-subnetworks, each a (multi-typed) net-cluster [KDD’09] Research Paper Term AuthorVenue Publish Write Contain P T AV P T AV …… P T AV NetClus Computer Science Database Hardware Theory DBLP network: Using terms, venues, and authors to jointly distinguish a sub-field, e.g., database
  • 28. 28 The NetClus Algorithm  Generate initial partitions for target objects and induce initial net-clusters from the original network  Repeat // An E-M Framework  Build ranking-based probabilistic generative model for each net-cluster  Calculate the posterior probabilities for each target object  Adjust their cluster assignment according to the new measure defined by the posterior probabilities to each cluster  Until the clusters do not change significantly  Calculate the posterior probabilities for each attribute object in each net- cluster
  • 29. 29 NetClus on the DBLP Network  NetClus initialization: G = (V, E, W), weight wxixj linking xi and xj  V = A U C U D U T, where D (paper), A (author), C (conf.), T(term)  Simple ranking:  Authority ranking for type Y based on type X, through the center type Z:  For DBLP:
  • 30. 30 Multi-Typed Networks Lead to Better Results  The network cluster for database area: Conferences, Authors, and Terms  NetClus leads to better clustering and ranking than RankClus  NetClus vs. RankClus: 16% higher accuracy on conference clustering
  • 31. 31 NetClus: Distinguishing Conferences  AAAI 0.0022667 0.00899168 0.934024 0.0300042 0.0247133  CIKM 0.150053 0.310172 0.00723807 0.444524 0.0880127  CVPR 0.000163812 0.00763072 0.931496 0.0281342 0.032575  ECIR 3.47023e-05 0.00712695 0.00657402 0.978391 0.00787288  ECML 0.00077477 0.110922 0.814362 0.0579426 0.015999  EDBT 0.573362 0.316033 0.00101442 0.0245591 0.0850319  ICDE 0.529522 0.376542 0.00239152 0.0151113 0.0764334  ICDM 0.000455028 0.778452 0.0566457 0.113184 0.0512633  ICML 0.000309624 0.050078 0.878757 0.0622335 0.00862134  IJCAI 0.00329816 0.0046758 0.94288 0.0303745 0.0187718  KDD 0.00574223 0.797633 0.0617351 0.067681 0.0672086  PAKDD 0.00111246 0.813473 0.0403105 0.0574755 0.0876289  PKDD 5.39434e-05 0.760374 0.119608 0.052926 0.0670379  PODS 0.78935 0.113751 0.013939 0.00277417 0.0801858  SDM 0.000172953 0.841087 0.058316 0.0527081 0.0477156  SIGIR 0.00600399 0.00280013 0.00275237 0.977783 0.0106604  SIGMOD 0.689348 0.223122 0.0017703 0.00825455 0.0775055  VLDB 0.701899 0.207428 0.00100012 0.0116966 0.0779764  WSDM 0.00751654 0.269259 0.0260291 0.683646 0.0135497  WWW 0.0771186 0.270635 0.029307 0.451857 0.171082
  • 32. 32 NetClus: Experiment on DBLP: Database System Cluster  NetClus generates high quality clusters for all of the three participating types in the DBLP network  Quality can be easily judged by our commonsense knowledge  Highly-ranked objects: Objects centered in the cluster database 0.0995511 databases 0.0708818 system 0.0678563 data 0.0214893 query 0.0133316 systems 0.0110413 queries 0.0090603 management 0.00850744 object 0.00837766 relational 0.0081175 processing 0.00745875 based 0.00736599 distributed 0.0068367 xml 0.00664958 oriented 0.00589557 design 0.00527672 web 0.00509167 information 0.0050518 model 0.00499396 efficient 0.00465707 Surajit Chaudhuri 0.00678065 Michael Stonebraker 0.00616469 Michael J. Carey 0.00545769 C. Mohan 0.00528346 David J. DeWitt 0.00491615 Hector Garcia-Molina 0.00453497 H. V. Jagadish 0.00434289 David B. Lomet 0.00397865 Raghu Ramakrishnan 0.0039278 Philip A. Bernstein 0.00376314 Joseph M. Hellerstein 0.00372064 Jeffrey F. Naughton 0.00363698 Yannis E. Ioannidis 0.00359853 Jennifer Widom 0.00351929 Per-Ake Larson 0.00334911 Rakesh Agrawal 0.00328274 Dan Suciu 0.00309047 Michael J. Franklin 0.00304099 Umeshwar Dayal 0.00290143 Abraham Silberschatz 0.00278185 VLDB 0.318495 SIGMOD Conf. 0.313903 ICDE 0.188746 PODS 0.107943 EDBT 0.0436849 Term Venue Author
  • 33. 33 Rank-Based Clustering: Works in Multiple Domains RankCompete: Organize your photo album automatically! MedRank: Rank treatments for AIDS from Medline Use multi-typed image features to build up heterogeneous networks Explore multiple types in a star schema network
  • 34. 34 Net-cluster Hierarchy iNextCube: Information Network-Enhanced Text Cube (VLDB’09 Demo) Architecture of iNextCube Dimension hierarchies generated by NetClus Author/conference/ term ranking for each research area. The research areas can be at different levels. All DB and IS Theory Architecture … DB DM IR XML Distributed DB … Demo: iNextCube.cs.uiuc.edu
  • 35. 35 Impact of RankClus Methodology  RankCompete [Cao et al., WWW’10]  Extend to the domain of web images  RankClus in Medical Literature [Li et al., Working paper]  Ranking treatments for diseases  RankClass [Ji et al., KDD’11]  Integrate classification with ranking  Trustworthy Analysis [Gupta et al., WWW’11] [Khac Le et al., IPSN’11]  Integrate clustering with trustworthiness score  Topic Modeling in Heterogeneous Networks [Deng et al., KDD’11]  Propagate topic information among different types of objects  …
  • 36. 36
  • 37. 37 Outline  Motivation: Why Mining Information Networks?  Part I: Clustering, Ranking and Classification  Clustering and Ranking in Information Networks  Classification of Information Networks  Part II: Meta-Path Based Exploration of Information Networks  Similarity Search in Information Networks  Relationship Prediction in Information Networks  Recommendation with Heterogeneous Information Networks  User-Guided Meta-Path based clustering in heterogeneous networks  ClusCite: Citation recommendation in heterogeneous networks
  • 38. 38 Classification: Knowledge Propagation Across Heterogeneous Typed Networks  RankClass [Ji et al., KDD’11]: Ranking-based classification  Highly ranked objects will play more role in classification  Class membership and ranking are statistical distributions  Let ranking and classification mutually enhance each other!  Output: Classification results + ranking list of objects within each class Classification: Labeled knowledge propagates through multi-typed objects across heterogeneous networks [KDD’11]
  • 39. 39 GNetMine: Methodology  Classification of networked data can be essentially viewed as a process of knowledge propagation, where information is propagated from labeled objects to unlabeled ones through links until a stationary state is achieved  A novel graph-based regularization framework to address the classification problem on heterogeneous information networks  Respect the link type differences by preserving consistency over each relation graph corresponding to each type of links separately  Mathematical intuition: Consistency assumption  The confidence (f)of two objects (xip and xjq) belonging to class k should be similar if xip ↔ xjq (Rij,pq > 0)  f should be similar to the given ground truth on the labeled data
  • 40. 40 GNetMine: Graph-Based Regularization  Minimize the objective function ( ) ( ) 1 ( ) ( ) 2 , , 1 1 1 , , ( ) ( ) ( ) ( ) 1 ( ,..., ) 1 1 ( ) ( ) ( ) ji k k m nnm k k ij ij pq ip jq i j p q ij pp ji qq m k k T k k i i i i i i J R f f D D               y y f f f f Smoothness constraints: objects linked together should share similar estimations of confidence belonging to class k Normalization term applied to each type of link separately: reduce the impact of popularity of objects Confidence estimation on labeled data and their pre-given labels should be similar User preference: how much do you value this relationship / ground truth?
  • 41. 41 RankClass: Ranking-Based Classification  Classification in heterogeneous networks  Knowledge propagation: Class label knowledge propagated across multi-typed objects through a heterogeneous network  GNetMine [Ji et al., PKDD’10]: Objects are treated equally  RankClass [Ji et al., KDD’11]: Ranking-based classification  Highly ranked objects will play more role in classification  An object can only be ranked high in some focused classes  Class membership and ranking are stat. distributions  Let ranking and classification mutually enhance each other!  Output: Classification results + ranking list of objects within each class
  • 42. 42 From RankClus to GNetMine & RankClass  RankClus [EDBT’09]: Clustering and ranking working together  No training, no available class labels, no expert knowledge  GNetMine [PKDD’10]: Incorp. label information in networks  Classification in heterog. networks, but objects treated equally  RankClass [KDD’11]: Integration of ranking and classification in heterogeneous network analysis  Ranking: informative understanding & summary of each class  Class membership is critical information when ranking objects  Let ranking and classification mutually enhance each other!  Output: Classification results + ranking list of objects within each class
  • 43. 43 Why Rank-Based Classification?  Different objects within one class have different importance/visibility!  The ranking of objects within one class serves as an informative understanding and summary of the class 6 11 7 9 8 10
  • 44. 44 Motivation  Why not do ranking after classification, or vice versa?  Because they mutually enhance each other, not unidirectional.  RankClass: iteratively let ranking and classification mutually enhance each other Ranking Classification
  • 45. 45 RankClass: Ranking-Based Classification Class: a group of multi-typed nodes sharing the same topic + within-class ranking node class The RankClass Framework
  • 46. 46 Comparing Classification Accuracy on the DBLP Data  Class: Four research areas: DB, DM, AI, IR  Four types of objects  Paper(14376), Conf.(20), Author(14475), Term(8920)  Three types of relations  Paper-conf., paper- author, paper-term  Algorithms for comparison  LLGC [Zhou et al. NIPS’03]  wvRN) [Macskassy et al. JMLR’07]  nLB [Lu et al. ICML’03, Macskassy et al. JMLR’07]
  • 47. 47 Object Ranking in Each Class: Experiment  DBLP: 4-fields data set (DB, DM, AI, IR) forming a heterog. info. Network  Rank objects within each class (with extremely limited label information)  Obtain high classification accuracy and excellent ranking within each class Database Data Mining AI IR Top-5 ranked conferences VLDB KDD IJCAI SIGIR SIGMOD SDM AAAI ECIR ICDE ICDM ICML CIKM PODS PKDD CVPR WWW EDBT PAKDD ECML WSDM Top-5 ranked terms data mining learning retrieval database data knowledge information query clustering reasoning web system classification logic search xml frequent cognition text
  • 48. 48
  • 49. 49 Outline  Motivation: Why Mining Information Networks?  Part I: Clustering, Ranking and Classification  Clustering and Ranking in Information Networks  Classification of Information Networks  Part II: Meta-Path Based Exploration of Information Networks  Similarity Search in Information Networks  Relationship Prediction in Information Networks  Recommendation with Heterogeneous Information Networks  User-Guided Meta-Path based clustering in heterogeneous networks  ClusCite: Citation recommendation in heterogeneous networks
  • 50. 50 Similarity Search in Heterogeneous Networks  Similarity measure/search is the base for cluster analysis  Who are the most similar to Christos Faloutsos based on the DBLP network?  Meta-Path: Meta-level description of a path between two objects  A path on network schema  Denote an existing or concatenated relation between two object types  Different meta-paths tell different semantics Christos’ students or close collaborators Work in similar fields with similar reputation Meta-Path: Author-Paper-Author Meta-Path: Author-Paper-Venue-Paper-Author Co-authorship Meta-path: A-P-A
  • 51. 51 Existing Popular Similarity Measures for Networks  Random walk (RW):  The probability of random walk starting at x and ending at y, with meta-path P  Used in Personalized PageRank (P-Pagerank) (Jeh and Widom 2003)  Favors highly visible objects (i.e., objects with large degrees)  Pairwise random walk (PRW):  The probability of pairwise random walk starting at (x, y) and ending at a common object (say z), following a meta-path (P1, P2)  Used in SimRank (Jeh and Widom 2002)  Favors pure objects (i.e., objects with highly skewed distribution in their in-links or out-links) ( , ) ( )p P s x y prob p   1 2 1 2 1 2( , ) ( , ) ( , ) ( ) ( )p p P P s x y prob p prob p   X Y P X Y P1 P2 -1 Z Note: P-PageRank and SimRank do not distinguish object type and relationship type
  • 52. 52 Which Similarity Measure Is Better for Finding Peers?  Meta-path: APCPA  Mike publishes similar # of papers as Bob and Mary  Other measures find Mike is closer to Jim AuthorConf. SIGMOD VLDB ICDM KDD Mike 2 1 0 0 Jim 50 20 0 0 Mary 2 0 1 0 Bob 2 1 0 0 Ann 0 0 1 1 MeasureAuthor Jim Mary Bob Ann P-PageRank 0.376 0.013 0.016 0.005 SimRank 0.716 0.572 0.713 0.184 Random Walk 0.8983 0.0238 0.0390 0 Pairwise R.W. 0.5714 0.4440 0.5556 0 PathSim (APCPA) 0.083 0.8 1 0 Who is more similar to Mike? Comparison of Multiple Measures: A Toy Example  PathSim: Favors peers  Peers: Objects with strong connectivity and similar visibility with a given meta-path x y
  • 53. 53 Example with DBLP: Find Academic Peers by PathSim  Anhai Doan  CS, Wisconsin  Database area  PhD: 2002  Jun Yang  CS, Duke  Database area  PhD: 2001  Amol Deshpande  CS, Maryland  Database area  PhD: 2004  Jignesh Patel  CS, Wisconsin  Database area  PhD: 1998 Meta-Path: Author-Paper-Venue-Paper-Author
  • 54. 54 Some Meta-Path Is “Better” Than Others  Which pictures are most similar to this one? Image Group UserTag Meta-Path: Image-Tag-Image Meta-Path: Image-Tag-Image-Group-Image-Tag-Image Evaluate the similarity between images according to tags and groups Evaluate the similarity between images according to their linked tags
  • 55. 55 Comparing Similarity Measures in DBLP Data Favor highly visible objects Are these tiny forums most similar to SIGMOD? Which venues are most similar to DASFAA? Which venues are most similar to SIGMOD?
  • 56. 56 Long Meta-Path May Not Carry the Right Semantics  Repeat the meta-path 2, 4, and infinite times for conference similarity query
  • 57. 57 Co-Clustering-Based Pruning Algorithm  Meta-Path based similarity computation can be costly  The overall cost can be reduced by storing commuting matrices for short path schemas and computing top-k queries on line  Framework  Generate co-clusters for materialized commuting matrices for feature objects and target objects  Derive upper bound for similarity between object and target cluster and between object and object  Safely prune target clusters and objects if the upper bound similarity is lower than current threshold  Dynamically update top-k threshold
  • 58. 58 Meta-Path: A Key Concept for Heterogeneous Networks  Meta-path based mining  PathPredict [Sun et al., ASONAM’11]  Co-authorship prediction using meta-path based similarity  PathPredict_when [Sun et al., WSDM’12]  When a relationship will happen  Citation prediction [Yu et al., SDM’12]  Meta-path + topic  Meta-path learning  User Guided Meta-Path Selection [Sun et al., KDD’12]  Meta-path selection + clustering
  • 59. 59
  • 60. 60 Outline  Motivation: Why Mining Information Networks?  Part I: Clustering, Ranking and Classification  Clustering and Ranking in Information Networks  Classification of Information Networks  Part II: Meta-Path Based Exploration of Information Networks  Similarity Search in Information Networks  Relationship Prediction in Information Networks  Recommendation with Heterogeneous Information Networks  User-Guided Meta-Path based clustering in heterogeneous networks  ClusCite: Citation recommendation in heterogeneous networks
  • 61. 61 Relationship Prediction vs. Link Prediction  Link prediction in homogeneous networks [Liben-Nowell and Kleinberg, 2003, Hasan et al., 2006]  E.g., friendship prediction  Relationship prediction in heterogeneous networks  Different types of relationships need different prediction models  Different connection paths need to be treated separately!  Meta-path-based approach to define topological features vs. vs.
  • 62. 62 Why Prediction Using Heterogeneous Info Networks?  Rich semantics contained in structured, text-rich heterogeneous networks  Homogeneous networks, such as coauthor networks, miss too much criticially important information  Schema-guided relationship prediction  Semantic relationships among similar typed links share similar semantics and are comparable and inferable  Relationships across different typed links are not directly comparable but their collective behavior will help predict particular relationships  Meta-paths: encoding topological features  E.g., citation relations between authors: A — P → P — A  Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship Prediction in Heterogeneous Bibliographic Networks", ASONAM'11 cite writewrite
  • 63. 63 PathPredict: Meta-Path Based Relationship Prediction  Who will be your new coauthors in the next 5 years?  Meta path-guided prediction of links and relationships  Philosophy: Meta path relationships among similar typed links share similar semantics and are comparable and inferable  Co-author prediction (A—P—A) [Sun et al., ASONAM’11]  Use topological features encoded by meta paths, e.g., citation relations between authors (A—P→P—A) vs. Meta-paths between authors of length ≤ 4 Meta- Path Semantic Meaning
  • 64. 64 Logistic Regression-Based Prediction Model  Training and test pair: <xi, yi> = <history feature list, future relationship label>  Logistic Regression Model  Model the probability for each relationship as  𝛽 is the coefficients for each feature (including a constant 1)  MLE estimation  Maximize the likelihood of observing all the relationships in the training data A—P—A—P—A A—P—V—P—A A—P—T—P—A A—P→P—A A—P—A <Mike, Ann> 4 5 100 3 Yes = 1 <Mike, Jim> 0 1 20 2 No = 0
  • 65. 65 Selection among Competitive Measures We study four measures that defines a relationship R encoded by a meta path  Path Count: Number of path instances between authors following R  Normalized Path Count: Normalize path count following R by the “degree” of authors  Random Walk: Consider one way random walk following R  Symmetric Random Walk: Consider random walk in both directions
  • 66. 66 Performance Comparison: Homogeneous vs. Heterogeneous Topological Features  Homogeneous features  Only consider co-author sub-network (common neighbor; rooted PageRank)  Mix all types together (homogeneous path count)  Heterogeneous feature  Heterogeneous path count Notation: HP2hop: highly productive source authors with 2-hops reaching target authors
  • 67. 67 Comparison among Different Topological Features  Hybrid heterogeneous topological feature is the best Notations (1) the path count (PC) (2) the normalized path count (NPC) (3) the random walk (RW) (4) the symmetric random walk (SRW) PC1: homogeneous common neighbor PCSum: homogeneous path count
  • 68. 68 The Power of PathPredict: Experiment on DBLP  Explain the prediction power of each meta-path  Wald Test for logistic regression  Higher prediction accuracy than using projected homogeneous network  11% higher in prediction accuracy Co-author prediction for Jian Pei: Only 42 among 4809 candidates are true first-time co-authors! (Feature collected in [1996, 2002]; Test period in [2003,2009])
  • 69. 69 The Power of PathPredict: Experiment on DBLP  Explain the prediction power of each meta-path  Wald Test for logistic regression  Higher prediction accuracy than using projected homogeneous network  11% higher in prediction accuracy Co-author prediction for Jian Pei: Only 42 among 4809 candidates are true first-time co-authors! (Feature collected in [1996, 2002]; Test period in [2003,2009]) Evaluation of the prediction power of different meta-paths Prediction of new coauthors of Jian Pei in [2003-2009] Social relations play more important role?
  • 70. 70 When Will It Happen?  From “whether” to “when”  “Whether”: Will Jim rent the movie “Avatar” in Netflix?  “When”: When will Jim rent the movie “Avatar”?  What is the probability Jim will rent “Avatar” within 2 months? 𝑃(𝑌 ≤ 2)  By when Jim will rent “Avatar” with 90% probability? 𝑡: 𝑃 𝑌 ≤ 𝑡 = 0.9  What is the expected time it will take for Jim to rent “Avatar”? 𝐸(𝑌) May provide useful information to supply chain management Output: P(X=1)=? Output: A distribution of time!
  • 71. 71 Generalized Linear Model under Weibull Distribution Assumption The Relationship Building Time Prediction Model  Solution  Directly model relationship building time: P(Y=t)  Geometric distribution, Exponential distribution, Weibull distribution  Use generalized linear model  Deal with censoring (relationship builds beyond the observed time interval) Training Framework T: Right Censoring
  • 72. 72 Author Citation Time Prediction in DBLP  Top-4 meta-paths for author citation time prediction  Predict when Philip S. Yu will cite a new author Social relations are less important in author citation prediction than in co-author prediction Under Weibull distribution assumption Study the same topic Co-cited by the same paper Follow co-authors’ citation Follow the citations of authors who study the same topic
  • 73. 73
  • 74. 74 Outline  Motivation: Why Mining Information Networks?  Part I: Clustering, Ranking and Classification  Clustering and Ranking in Information Networks  Classification of Information Networks  Part II: Meta-Path Based Exploration of Information Networks  Similarity Search in Information Networks  Relationship Prediction in Information Networks  Recommendation with Heterogeneous Information Networks  User-Guided Meta-Path based clustering in heterogeneous networks  ClusCite: Citation recommendation in heterogeneous networks
  • 75. 75 Enhancing the Power of Recommender Systems by Heterog. Info. Network Analysis  Heterogeneous relationships complement each other  Users and items with limited feedback can be connected to the network by different types of paths  Connect new users or items in the information network  Different users may require different models: Relationship heterogeneity makes personalized recommendation models easier to define Avatar TitanicAliens Revolutionary Road James Cameron Kate Winslet Leonardo Dicaprio Zoe Saldana Adventure Romance Collaborative filtering methods suffer from the data sparsity issue # of users or items A small set of users & items have a large number of ratings Most users & items have a small number of ratings #ofratings Personalized recommendation with heterog. Networks [WSDM’14]
  • 76. 76 Relationship Heterogeneity Based Personalized Recommendation Models Different users may have different behaviors or preferences Aliens James Cameron fan 80s Sci-fi fan Sigourney Weaver fan Different users may be interested in the same movie for different reasons  Two levels of personalization  Data level  Most recommendation methods use one model for all users and rely on personal feedback to achieve personalization  Model level  With different entity relationships, we can learn personalized models for different users to further distinguish their differences
  • 77. 77 Preference Propagation-Based Latent Features Alice Bob Kate Winslet Naomi Watts Titanic revolutionary road skyfall King Konggenre: drama Sam Mendes tag: Oscar NominationCharlie Generate L different meta-path (path types) connecting users and items Propagate user implicit feedback along each meta- path Calculate latent-features for users and items for each meta-path with NMF related method Ralph Fiennes
  • 78. 78 L user-cluster similarity Recommendation Models Observation 1: Different meta-paths may have different importance Global Recommendation Model Personalized Recommendation Model Observation 2: Different users may require different models ranking score the q-th meta-path features for user i and item j c total soft user clusters (1) (2)
  • 79. 79 Parameter Estimation • Bayesian personalized ranking (Rendle UAI’09) • Objective function min Θ sigmoid function for each correctly ranked item pair i.e., 𝑢𝑖 gave feedback to 𝑒 𝑎 but not 𝑒 𝑏 Soft cluster users with NMF + k-means For each user cluster, learn one model with Eq. (3) Generate personalized model for each user on the fly with Eq. (2) (3) Learning Personalized Recommendation Model
  • 80. 80 Experiments: Heterogeneous Network Modeling Enhances the Quality of Recommendation  Datasets  Comparison methods  Popularity: recommend the most popular items to users  Co-click: conditional probabilities between items  NMF: non-negative matrix factorization on user feedback  Hybrid-SVM: use Rank-SVM with plain features (utilize both user feedback and information network) HeteRec personalized recommendation (HeteRec-p) leads to the best recommendation
  • 81. 81
  • 82. 82 Outline  Motivation: Why Mining Information Networks?  Part I: Clustering, Ranking and Classification  Clustering and Ranking in Information Networks  Classification of Information Networks  Part II: Meta-Path Based Exploration of Information Networks  Similarity Search in Information Networks  Relationship Prediction in Information Networks  Recommendation with Heterogeneous Information Networks  User-Guided Meta-Path based clustering in heterogeneous networks  ClusCite: Citation recommendation in heterogeneous networks
  • 83. 83 Why User Guidance in Clustering?  Different users may like to get different clusters for different clustering goals  Ex. Clustering authors based on their connections in the network {1,2,3,4} {5,6,7,8} {1,3,5,7} {2,4,6,8} {1,3} {2,4} {5,7} {6,8} Which meta- path(s) to choose?
  • 84. 84 User Guidance Determines Clustering Results  Different user preferences (e.g., by seeding desired clusters) lead to the choice of different meta-paths {1} {5} {1,2,3,4} {5,6,7,8}+ {1} {2} {5} {6} {1,3} {2,4} {5,7} {6,8} + Seeds Meta-path(s) Clustering Seeds Meta-path(s) Clustering  Problem: User-guided clustering with meta-path selection  Input:  The target type for clustering T  # of clusters k  Seeds in some clusters: L1, …, Lk  Candidate meta-paths: P1, …, PM  Output:  Weight of each meta-path: w1, …, wm  Clustering results that are consistent with the user guidance
  • 85. 85 PathSelClus: A Probabilistic Modeling Approach  Part 1: Modeling the Relationship Generation  A good clustering result should lead to high likelihood in observing existing relationships  Higher quality relations should count more in the total likelihood  Part 2: Modeling the Guidance from Users  The more consistent with the guidance, the higher probability of the clustering result  Part 3: Modeling the Quality Weights for Meta-Paths  The more consistent with the clustering result, the higher quality weight
  • 86. 86 Part 1: Modeling the Relationship Generation  86
  • 87. 87 Part 2: Modeling the Guidance from Users  87
  • 88. 88 Part 3: Modeling the Quality Weights for Meta-Paths  88 Dirichlet Distribution
  • 90. 90 Effectiveness of Meta-Path Selection  Experiments on Yelp data  Object Types: Users, Businesses, Reviews, Terms  Relation Types: UR, RU, BR, RB, TR, RT  Task: Candidate meta-paths: B-R-U-R-B, B-R-T-R-B  Target objects: restaurants  # of clusters: 6  Output:  PathSelClus vs. others  High accuracy  Restaurant vs. shopping  For restraunts, meta-path B-R-U-R-B weighs only 0.1716  For clustering shopping, B-R-U-R-B weighs 0.5864 Users try different kinds of food
  • 91. 91 91
  • 92. 92 Outline  Motivation: Why Mining Information Networks?  Part I: Clustering, Ranking and Classification  Clustering and Ranking in Information Networks  Classification of Information Networks  Part II: Meta-Path Based Exploration of Information Networks  Similarity Search in Information Networks  Relationship Prediction in Information Networks  Recommendation with Heterogeneous Information Networks  User-Guided Meta-Path based clustering in heterogeneous networks  ClusCite: Citation recommendation in heterogeneous networks
  • 93. 93 The Citation Recommendation Problem Given a newly written manuscript (title, abstract and/or content) and its attributes (authors, target venues), suggests a small number of papers as high quality references.
  • 94. 94 Why ClusCite for Citation Recommendation?  Xiang Ren, Jialu Liu, Xiao Yu, Urvashi Khandelwal, Quanquan Gu, Lidan Wang, Jiawei Han, “ClusCite: Effective Citation Recommendation by Information Network-Based Clustering”, KDD’14  Research papers need to cite relevant and important previous work  Help readers understand its background, context and innovation  Already large, rapidly growing body of scientific literature  automatic recommendations of high quality citations  Traditional literature search systems  infeasible to cast users’ rich information needs into queries consisting of just a few keywords
  • 95. 95 Consider Distinct Citation Behavioral Patterns Basic idea: Paper’s citations can be organized into different groups, each having its own behavioral pattern to identify references of interest  A principled way to capture paper’s citation behaviors  More accurate approach:paper-specific recommendation model Most existing studies: assume all papers adopt same criterion and follow same behavioral pattern in citing other papers:  Context-based [He et al., WWW’10; Huang et al., CIKM’12]  Topical similarity-based [Nallapati et al., KDD’08; Tang et al., PAKDD’09]  Structural similarity-based [Liben-Nowell et al., CIKM’03; Strohman et al., SIGIR’07]  Hybrid methods [Bethard et al., CIKM’10; Yu et al., SDM’12]
  • 96. 96 Observation: Distinct Citation Behavioral Patterns Each group follow distinct behavioral patterns and adopt different criterions in deciding relevance and authority of a candidate paper
  • 97. 97 Heterogeneous Bibliographic Network A unified data model for bibliographic dataset (papers and their attributes) • Captures paper-paper relevance of different semantics • Enables authority propagation between different types of objects KDD literature search WSDM Network Schema
  • 98. 98 Citation Recommendation in Heterogeneous Networks Given heterogeneous bibliographic network; terms, authors and target venues of a query manuscript q,we aim to build a recommendation model specifically for q,and recommend a small subset of papers as high quality references for q Heterogeneous bibliographic network Query (to the system)
  • 99. 99 The ClusCite Framework We explore the principle that: citations tend to be softly clustered into different interest groups, based on the heterogeneous network structures Paper-specific recommendation: by integrating learned models of query’s related interest groups Phase I: Joint Learning (offline) Phase II: Recommendation (online) For different interest groups, learn distinct models on finding relevant papers and judging authority of papers Derive group membership for query manuscript
  • 100. 100 ClusCite: An Overview of the Proposed Model query’s group membership relative citation score (how likely q will cite p) within each group How likely a query manuscript q will cite a candidate paper p (suppose K interest groups): It is desirable to suggest papers that have high relative citation scores across multiple related interest groups of the query manuscript paper relative relevance (query-candidate paper) paper relative authority (candidate paper)
  • 101. 101 Proposed Model I: Group Membership  Learn each query’s group membership: scalability & generalizability  Leverage the group memberships of related attribute objects to approximate query’s group membership Different types of attribute objects (X = authors/venues/terms) Query’s related (linked) objects of type-X Attribute object’s group membership (to learn)
  • 102. 102 Proposed Model II: Paper Relevance meta path-based relevance score (l-th feature) Network schema
  • 103. 103 Proposed Model II: Paper Relevance Relevance features play different roles in different interest groups weights on different meta path-based features
  • 104. 104 Proposed Model III: Object Relative Authority Paper relative authority: A paper may have quite different visibility/authority among different groups, even it is overall highly cited Relative authority propagation over the network KDDWSDM Relative authority in Group A Relative authority in Group B
  • 105. 105 Proposed Model III: Object Relative Authority Paper relative authority: A paper may have quite different visibility/authority among different groups, even it is overall highly cited and pa- putes the oup. The final rec- mbership sent how groups. ce of meta der vari- re, could different ong with ated pa- because d ICDM) riving relative authority separately, we propose to estimate them jointly using graph regularization, which preservescon- sistency over each subgraph. By doing so, paper relative authority serves as a feature for learning interest groups, and better estimated groups can in turn help derive relative authority more accurately (Fig. 3). We adopt the semi-supervised learning framework [8] that leads to iteratively updating rules as authority propagation between different types of objects. FP = GP FP , FA , FV ; λA , λV , FA = GA FP and FV = GV FP . (3) We denote relative authority score matrices for paper, au- thor and venue objects by FP ∈ RK × n , FA ∈ RK × |A | and FV ∈ RK × |V | . Generally, in an interest group, relative im- portance of one type of object could be a combination of the relative importance from different types of objects [23].
  • 106. 106 Model Learning: A Joint Optimization Problem A joint optimization problem: ance may not mization prob- neously, which egularization. bject in terms earned model n Sec. 4.1 and 4.2 along with 4.3. m on network as ed citation re- e of negatives d may cite in methods adopt ctive functions egative exam- subgraph R(V ) . Integrating theloss in Eq. (5) with graph regularization in Eq. (6), we formulate a joint optimization problem follow- ing the graph-regularized semi-supervised learning frame- work [9]: min P ,W ,F P ,F A ,F V 1 2 L + R + cp 2 ∥P∥2 F + cw 2 ∥W ∥2 F s.t. P ≥ 0; W ≥ 0. (7) To ensurestability of theobtained solution, Tikhonov reg- ularizers are imposed on variables P and W [4], and we use cp , cw > 0 to control the strength of regularization. In ad- dition, we impose non-negativity constraints to make sure learned group membership indicators and feature weights can provide semantic meaning as we want. 4.2 The ClusCite Algorithm Directly solving Eq (7) is not easy because the objective function is non-convex. We develop an alternative mini- mization algorithm, called ClusCit e, which alternatively formance, which is defined as follows: L = n i ,j = 1 Mi j Yi j − K k = 1 L l = 1 θ (k ) pi w (l ) k S (i ) j l − K k = 1 θ (k ) pi FP ,k j 2 , = n i = 1 M i ⊙ Y i − R i P(W S(i ) T + FP ) 2 2 . (5) We define the weight indicator matrix M ∈ Rn × n for the citation matrix, where M i j takes value 1 if the citation rela- tionship between pi and pj is observed and 0 in other cases. By doing so, the model can focus on positive examples and get rid of noise in the 0 values. One can also impose a de- gree of confidence by setting different weights on different observed citations, but we leave this to future study. For ease of optimization, the loss can be further rewrit- ten in a matrix form, where matrix P ∈ R(|T |+ |A |+ |V |) × K is nd other conceptual information. Specifically, we the query’s group membership θ ( k ) q by weighted on of group memberships of its attribute objects. θ( k ) q = X ∈ { A ,V ,T } x ∈ N X (q) θ (k ) x |NX (q)| . (4) e NX (q) to denote type X neighbors for query q, tribute objects. How likely a type X object is to the k-th interest group is represented by θ (k ) x . specific citation recommendation can be efficiently d for each query manuscript q by applying Eq. (1) h definitions in Eqs. (2), (3) and (4). ODEL LEARNING ction introduces thelearning algorithm for thepro- ation recommendation model in Eq. (1). are three sets of parameters in our model: group hips for attribute objects; feature weights for inter- s; and object relative authority within each inter- . A straightforward way is to first conduct hard- g of attribute objects based on the network and ve feature weights and relative authority for each Such a solution encounters several problems: (1) ct may have multiple citation interests, (2) mined usters may not properly capture distinct citation ten in a matrix form, where matrix P ∈ R( |T |+ |A |+ |V |)× K is group membership indicator for all attribute objects while R i ∈ Rn × ((|T |+ |A |+ |V |)) is thecorresponding neighbor indica- tor matrix such that R i P = X ∈ { A ,V ,T } x ∈ N X ( pi ) θ ( k ) x |N X ( pi ) | . Feature weights for each interest group are represented by each row of thematrix W ∈ RK × L , i.e., Wk l = w ( l ) k . Hadamard product ⊙is used for the matrix element-wise product. As discussed in Sec. 3.3, to achieve authority learning jointly, we adopt graph regularization to preserve consisten- cy over the paper-author and paper-venue subgraphs, which takes the following form: R = λ A 2 n i = 1 |A | j = 1 R (A ) i j F P , i D ( P A ) i i − F A , j D ( A P ) j j 2 2 + λ V 2 n i = 1 |V | j = 1 R ( V ) i j F P , i D ( P V ) i i − F V , j D ( V P ) j j 2 2 . (6) Theintuition behind theabovetwo termsisnatural: Linked objects in theheterogeneousnetwork are morelikely to share similar relative authority scores [14]. To reduce impact of node popularity, we apply a normalization technique on au- thority vectors, which helpssuppresspopular objects to keep them from dominating the authority propagation. Each element in the diagonal matrix D ( P A ) ∈ Rn × n is the de- gree of paper pi in subgraph R(A ) while each element in D ( A P ) ∈ R|A |× |A | is the degree of author aj in subgraph R(A ) Similarly, we can define the two diagonal matrices for weighted model prediction error Graph regularization to encode authority propagation Algorithm: Alternating minimization (w.r.t. each variable)
  • 107. 107 Learning Algorithm  Updating formula for alternating minimization First, to learn group membership for attribute objects, we take the derivative of the objective function in Eq. (7) with respect to P while fixing other variables, and apply the Karush-Kuhn-Tucker complementary condition to im- pose the non-negativity constraint [7]. With some simple algebraic operations, a multiplicative update formula for P can be derived as follows: Pj k ← Pj k n i = 1 R T i ˜Y i S( i ) W T + L+ P 1 + L− P 2 j k LP 0 + L− P 1 + L+ P 2 + cp P j k , (8) where matrices LP 0, LP 1 and LP 2 are defined as follows: LP 0 = n i = 1 R T i Ri PW ˜S( i )T ˜S(i ) W T ; LP 1 = n i = 1 R T i ˜Y i FT P ; LP 2 = n i = 1 R T i Ri P ˜F ( i ) P ˜F ( i )T P + n i = 1 RT i Ri PW ˜S( i )T FT P + n i = 1 RT i Ri PFP ˜S(i ) W T . In order to preserve non-negativity throughout the update, L P 1 is decomposed into L − P 1 and L + P 1 where A+ i j = (|Ai j | + Ai j )/ 2 and A− i j = (|Ai j | − Ai j )/ 2. Similarly, we decompose L into L − and L + , but note that the decomposition is i = 1 In order to preserve non-negativity throughout the update, L P 1 is decomposed into L − P 1 and L + P 1 where A+ i j = (|Ai j | + Ai j )/ 2 and A− i j = (|Ai j | − Ai j )/ 2. Similarly, we decompose L P 2 into L − P 2 and L + P 2, but note that the decomposition is applied to each of the three components of L P 2, respectively. We denote the masked Y i as ˜Y i , which is the Hadamard product of M i and Y i . Similarly, ˜S(i ) and ˜F (i ) P denote row- wise masked S(i ) and F P by M i . Second, to learn feature weights for interest groups, the multiplicative update formula for W can be derived follow- ing a similar derivation as that of P, taking the form: Wk l ← Wk l n i = 1 PT R T i ˜Y i S( i ) + L− W k l n i = 1 PT R T i R i PW ˜S(i ) T ˜S(i ) + L + W + cw W k l , (9) where we have L W = n i = 1 P T R T i R i P F P ˜S( i ) . Similarly, to preserve non-negativity of W , L W is decom- posed into L + W and L − W , which can be computed same be- fore. Finally, we derive the authority propagation functions in Eq. (3) by optimizing the objective function in Eq. (7) with respect to the authority score matrices of papers, authors arned group membership indicators and feature weights n provide semantic meaning as we want. 2 The ClusCite Algorithm Directly solving Eq (7) is not easy because the objective nction is non-convex. We develop an alternative mini- zation algorithm, called ClusCit e, which alternatively timizes the problem with respect to each variable. Thelearning algorithm essentially accomplishestwo things multaneously and iteratively: Co-clustering of attribute jectsand relevancefeatureswith respect to interest groups, d authority propagation between different objects. Dur- g an iteration, different learning components will mutually hance each other (Fig. 3): Feature weights and relative thority can be more accurately derived with high quality terest groups while in turn they serve a good feature for arning high quality interest groups. First, to learn group membership for attribute objects, e take the derivative of the objective function in Eq. (7) th respect to P while fixing other variables, and apply e Karush-Kuhn-Tucker complementary condition to im- ose the non-negativity constraint [7]. With some simple gebraic operations, a multiplicative update formula for P n be derived as follows: 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 Paper relative authority score #citationsfromtestset Iteration 1 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 Paper relative authority score #citationsfromtestset Iteration 5 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 Paper relative authority score #citationsfromtestset Iteration 10 Figure 3: Correlat ion bet ween paper relat ive au- t hority and # ground t rut h cit at ions, during differ- ent it erat ions. and venues. Specifically, we take the derivative of the ob- jective function with respect to F P , F A and FV , and follow traditional semi-supervised learning frameworks [8] to derive the update rules, which take the form: FP = GP (FP , FA , FV ; λA , λV ) = 1 λA + λV λA FA ST A + λV FV ST V + L F P (10) FA = GA (FP ) = FP SA ; (11) FV = GV (FP ) = FP SV . (12) where we have normalized adjacency matrices and the paper authority guidance terms defined as follows: SA = (D (P A ) )− 1/ 2 R ( A ) (D ( A P ) )− 1/ 2 A lgor it hm 1 Model Learning by ClusCite I nput : adjacency matrices { Y , SA , SV } , neighbor indicator R , mask matrix M , meta path-based features { S(i ) } , pa- rameters { λA , λV , cw , cp } , number of interest groups K Out put : group membership P, feature weights W , relative authority { FP , FA , FV } 1: Initialize P, W with positive values, and { FP , FA , FV } with citation counts from training set 2: repeat 3: Update group membership P by Eq. (8) 4: Update feature weights W by Eq. (9) 5: Compute paper relative authority FP by Eq. (10) 6: Compute author relative authority FA by Eq. (11) 7: Compute venue relative authority FV by Eq. (12) 8: unt il objective in Eq. (7) converges membership matrix P by Eq. (8) takesO(L|E|3 / n3 + L2 |E|2 / n2 + K dn) time. Learning the feature weights W by Eq. (9) takes O(L|E|3 / n3 + L2 |E|2 / n2 + K dn) time. Updating all three relative authority matrices takes O(L|E |3 / n3 + |E| + Table 3: St at ist ics of two bibliogr aphic networ ks. Data sets D B L P P ubM ed # papers 137,298 100,215 # authors 135,612 212,312 # venues 2,639 2,319 # terms 29,814 37,618 # relationships ∼2.3M ∼3.6M Paper avg citations 5.16 17.55 Table 4: Tr aining, validat ion and t est ing paper sub- set s from t he D B L P and PubM ed dat aset s (a) The DBLP dataset Subsets Train Validation Test Years T0= [1996, 2007] T1= [2008] T2 = [2009, 2011] # papers 62.23% 12.56% 25.21% (b) The PubMed dataset Subsets Train Validation Test Years T0= [1966, 2008] T1= [2009] T2 = [2010, 2013] # papers 64.50% 7.81% 27.69% set (T0) as the ground truth. Such an evaluation practice
  • 108. 108 Learning Algorithm  Convergence (to local minimum):  Guaranteed by an analysis similar to block coordinate descent  Empirically converges in 50 iterations  Mutually enhancing:  The mode learning does two things simultaneously and iteratively: (1) co- clustering of attribute objects; (2) authority propagation within groups  Authority/relevance features can be better learned with high-quality groups derived  Good features help deriving high-quality groups
  • 109. 109 Experimental Setup  Datasets – DBLP: 137k papers; ~2.3M relationships; Avg # citations/paper: 5.16 – PubMed: 100k papers; ~3.6M relationships; Avg # citations/paper: 17.55  Evaluations hbor indicator res { S(i ) } , pa- groups K hts W , relative { FP , FA , FV } 8) by Eq. (10) by Eq. (11) by Eq. (12) 3 / n3 + L2 |E |2 / n2 W by Eq. (9) . Updating all E|3 / n3 + |E | + compute Clus- # papers 137,298 100,215 # authors 135,612 212,312 # venues 2,639 2,319 # terms 29,814 37,618 # relationships ∼2.3M ∼3.6M Paper avg citations 5.16 17.55 Table 4: Training, validat ion and t est ing paper sub- set s from t he D B L P and PubM ed dat aset s (a) The DBLP dataset Subsets Train Validation Test Years T0= [1996, 2007] T1= [2008] T2= [2009, 2011] # papers 62.23% 12.56% 25.21% (b) The PubMed dataset Subsets Train Validation Test Years T0= [1966, 2008] T1= [2009] T2= [2010, 2013] # papers 64.50% 7.81% 27.69% set (T0) as the ground truth. Such an evaluation practice is more realistic because a citation recommendation system
  • 110. 110 Experimental Setup  Feature generation: meta paths – (P – X – P)y, where X = {A, V, T} and y = {1, 2, 3}  E.g., (P – X – P)2 = P – X – P – X – P – P – X – P  P – P – X – P  P  Feature generation: similarity measures – PathSim [VLDB’11]: symmetric measure – Random walk-based measure [SDM’12]: can be asymmetric  Combining meta paths with different measures, in total we have 24 meta path- based relevance features
  • 111. 111 Example Output: Relative Authority Ranking research area. This demonstrates that the relative authority propagation can generate meaningful authority scores with respect to different interest groups. Table 6: Top-5 aut hor ity venues and aut hor s from t wo exam ple int er est gr oups der ived by ClusCit e. R ank Venue A ut hor G r oup I (dat abase and infor m at ion syst em ) 1 VLDB 0.0763 Hector Garcia-Molina 0.0202 2 SIGMOD 0.0653 Christos Faloutsos 0.0187 3 T K DE 0.0651 Elisa Bertino 0.0180 4 CIK M 0.0590 Dan Suciu 0.0179 5 SIGK DD 0.0488 H. V. Jagadish 0.0178 G r oup I I ( com put er vision and m ult im edia) 1 T PAMI 0.0733 Richard Szeliski 0.0139 2 ACM MM 0.0533 Jitendra Malik 0.0122 3 ICCV 0.0403 Luc Van Gool 0.0121 4 CVPR 0.0401 Andrew Blake 0.0117 5 ECCV 0.0393 Alex Pentland 0.0114 to expres contexts recomme relevant size of ea interestin the citat On the measure tion tech niques [1 or no cit erogenou issue by between based m consideri ies start
  • 112. 112 Experimental Results  Case study on citation behavioral patterns Each paper is assigned to the group with highest group membership score
  • 113. 113 Experimental Results  Quality change of estimated paper relative authority ranking 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 Paper relative authority score #citationsfromtestset Iteration 1 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 Paper relative authority score #citationsfromtestset Iteration 5 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 20 Paper relative authority score #citationsfromtestset Iteration 10 Figure 3: Correlat ion bet ween paper relat ive au- t hority and # ground t rut h cit at ions, during differ- ent it erat ions. and venues. Specifically, we take the derivative of the ob- Correlation between paper relative authority score and # ground truth citations, during different iterations
  • 114. 114 Performance Comparison: Methods Compared  Compared methods: 1. BM25: content-based 2. PopRank [WWW’05]: heterogeneous link-based authority 3. TopicSim: topic-based similarity by LDA 4. Link-PLSA-LDA [KDD’08]: topic and link relevance 5. Meta-path based relevance: 1. L2-LR [SDM’12, WSDM’12]: logistics regression with L2 regularization 2. RankSVM [KDD’02] 6. MixSim: relevance, topic distribution, PopRank scores, using RankSVM 7. ClusCite: the proposed full fledged model 8. ClusCite-Rel: with only relevance component
  • 115. 115 Performance Comparison on DBLP and PubMed  17.68% improvement in Recall@50; 9.57% in MRR, over the best performing compared method, on DBLP ommendation performance comparisons on D BLP and PubM ed datasets in terms of Precisio RR. We set the number of interest groups to be 200 (K = 200) for ClusCite and ClusCite-Re od D B LP PubM ed P@10 P@20 R@20 R@50 MRR P@10 P@20 R@20 R@50 MRR 5 0.1260 0.0902 0.1431 0.2146 0.4107 0.1847 0.1349 0.1754 0.2470 0.4971 nk 0.0112 0.0098 0.0155 0.0308 0.0451 0.0438 0.0314 0.0402 0.0814 0.2012 Sim 0.0328 0.0273 0.0432 0.0825 0.1161 0.0761 0.0685 0.0855 0.1516 0.3254 A-LDA 0.1023 0.0893 0.1295 0.1823 0.3748 0.1439 0.1002 0.1589 0.2015 0.4079 R 0.2274 0.1677 0.2471 0.3547 0.4866 0.2527 0.1959 0.2504 0.3981 0.5308 VM 0.2372 0.1799 0.2733 0.3621 0.4989 0.2534 0.1954 0.2499 0.382 0.5187 ea 0.2261 0.1689 0.2473 0.3636 0.5002 0.2699 0.2025 0.2519 0.4021 0.5041 e-Rel 0.2402 0.1872 0.2856 0.4015 0.5156 0.2786 0.2221 0.2753 0.4305 0.5524 ite 0.2429 0.1958 0.2993 0.4279 0.5481 0.3019 0.2434 0.3129 0.4587 0.5787  20.19% improvement in Recall@20; 14.79% in MRR, over the best performing compared method, on PubMed
  • 116. 116 Performance Analysis: Number of Interest Groups
  • 117. 117 Performance Analysis: Number of Attributes Each Query Manuscript Has Partition the test set into six subsets: Group 1 has least Avg# attribute objects Group 6 has most Avg# attribute objects When more attribute available for the query manuscript, our model can do better recommendation
  • 118. 118 Performance Analysis: Model Generalization Divide test set by year Test the performance change over different query subsets From 2010 to 2013, MixSim drops 16.42% vs. ClusCite drops only 7.72%  ClusCite framework: organize paper citations into interest groups, and design cluster-based models, yielding paper-specific recommendation  Experimental results demonstrate significant improvement over state-of-the-art methods
  • 119. 119 119 Conclusions  Heterogeneous information networks are ubiquitous  Most datasets can be “organized” or “transformed” into “structured” multi-typed heterogeneous info. networks  Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, …  Surprisingly rich knowledge can be mined from structured heterogeneous info. networks  Clustering, ranking, classification, path prediction, ……  Knowledge is power, but knowledge is hidden in massive, but “relatively structured” nodes and links!  Key issue: Construction of trusted, semi-structured heterogeneous networks from unstructured data  From data to knowledge: Much more to be explored but heterogeneous network mining has shown high promise!
  • 120. 120 References  M. Ji, J. Han, and M. Danilevsky, "Ranking-Based Classification of Heterogeneous Information Networks", KDD'11  X. Ren, J. Liu, X. Yu, U. Khandelwal, Q. Gu, L. Wang, J. Han, “ClusCite: Effective Citation Recommendation by Information Network-Based Clustering”, KDD’14  Y. Sun and J. Han, Mining Heterogeneous Information Networks: Principles and Methodologies, Morgan & Claypool Publishers, 2012  Y. Sun, Y. Yu, and J. Han, "Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema", KDD’09  Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks”, VLDB'11  Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship Prediction in Heterogeneous Bibliographic Networks", ASONAM'11  Y. Sun, J. Han, C. C. Aggarwal, N. Chawla, “When Will It Happen? Relationship Prediction in Heterogeneous Information Networks”, WSDM'12  Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, T. Wu, "RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis", EDBT’09  X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandelwal, B. Norick, and J. Han, "Personalized Entity Recommendation: A Heterogeneous Information Network Approach", WSDM'14 120
  • 121. 121 From Data to Networks to Knowledge: An Evolutionary Path! Han, Kamber and Pei, Data Mining, 3rd ed. 2011 Yu, Han and Faloutsos (eds.), Link Mining: Models, Algorithms and Applications, 2010 Sun and Han, Mining Heterogeneous Information Networks, 2012 Y. Sun: SIGKDD’13 Dissertation Award Wang and Han, Mining Latent Entity Structures, 2015 C. Wang: SIGKDD’13 Dissertation Award