2015_FIT_Talk.pptx

December 14-16, 2015, Serena Hotel, Islamabad
13th International Conference on Frontiers of Information Technology (FIT), 2015
Multi-View Clustering
Algorithms and Applications
Presented by
Syed Fawad Hussain, PhD
Ghulam Ishaq Khan Institute of Engineering Sciences
and Technology.
Invited Talk, FIT 2015

Outline
13th Internaitonal Conference on Frontiers of IT, December 14-16, 2015
Multi-View Clustering: Algorithms and Applications
2
1. Introduction
1. Data generation
2. Motivation
2. Clustering and Co-Clustering
1. Traditional Clustering
2. Co-clustering
3. Multi-View Multi-Dimensional Clustering
1. Multiview data
2. Knowledge transfer between views
3. Experimental results
4. Application Areas of Multi-View Clustering

Information Generation
 A huge percentage of information is
generated (mostly un-structured)
documents, journals, web pages, emails...
 Information is usually generated
from different sources
 Different languages (for web pages)
 Different feature extractors (e.g. images)
 Different links (citation data)
 Different sections (movie data from imdb)
 Etc.
1. Introduction

Views
 Data is described by a set of variables/features
 Words describing documents
 Keywords describing movies
 Links describing webpages
 Actors describing movies
 Features describing images
 Sound describing video clips, etc.
 A view?
 A set of features/attributes/variables describing a set of
objects/instances.
 Is independent, and individually sufficient for learning
4
1. Introduction

Clustering
5
 Division of data into groups of ‘similar objects’
 Classical clustering algorithms are based on “similarities” and
organize data into classes such that there is
 high intra-class similarity
 low inter-class similarity
 Example:
P1(1,2), P2(2,2)
P3(4,5), P4(5,7),
P1 P2 P3 P4
P1 0 1 18 41
P2 1 0 13 34
P3 18 13 0 5
P4 41 34 5 0
C1 {P1,P2}
C2 {P3,P4}

Co-Clustering
6
 How to automatically find semantic relationship in the data?
 How to calculate similarity between documents?
Basic Idea:
 Two documents are similar if they contain similar words
 Two words are similar if they occur in similar documents
 Solution?
 Create similarity matrices R – between docs, and C – between words
 Iteratively update R and C using the other.
Boeing recently unveiled its
new B787 aircraft dubbed
the “Dreamliner”.
Airbus’ latest A350 is a
next generation plane is
due to fly in 2013
d1 d2

Co-Clustering
7
Hussain et al, 2010
 The algorithm is as follows
 Step 1 - Given A, define R(0)=I, C(0)=I
 Step 2 – for k=1 to t, do
Step 3: Output R(t) and C(t)

Co-Clustering
8
 Bipartite Graph
 G=(V1,V2,E)
 V1={d1,d2,…,dm}
 V2={w1,w2,…,wn}
 E =Aij , iV1, j V2
Practically 4 iterations are enough
 Iteration 1:
 R(1) : Sim(d1,d2), Sim(d1,d3), …
 C(1): Sim(w1,w2), Sim(w1,w3), …
 Iteration 2:
 R(2) : Sim(d1,d4) via C24 and C34 …
 …
Successive iterations means
paths of increasing length
d1 d2 d3 d4
w1
w2 w3 w4 w5 w6
Aij

Co-Clustering
9
Gene
clusters
0 10 20 30 40 50 60
0
1
2
( co-
62 )
0 10 20 30 40 50 60
-1
0
1
2
( co-
42 )
0 10 20 30 40 50 60
0
1
2
3
( co-
63 )
Expression
level
Expression
Expression
Expression
 Colon Cancer dataset
 1096 genes
 62 tissues  Normal (42) + Tumor (20)
Source: Hussain S.F, 2011

Single view vs Multiple views
 Are these “researchers” similar?
 Are their publication text similar?
 Do they often cite the same (group of) authors?
 Do they often publish in the same venue?
 Are these “movies” similar?
 Are they described by similar text in their plot?
 Do they have similar/same actors?
 Are they being described by similar keywords (genre)?
10
3. Multi-view Clustering

What are the natural grouping in this data?
11

Single view vs Multiple views
12
Movie: Titanic
Leonardo diCaprio Kate Winslet … …
ship Iceberg europe voyage …
romantic
tragedy
adventure
…
…
Movie by Actors
Movie by plot
Movie by genre
Source: imdb

Multi-view data
13
Movies-by-Actors Matrix
Movies/
actors
DiCaprio Kate Keanu Jolie
Titanic 1 1 0 0
Matrix 0 0 1 0
… … … … …
Movies-by-Keywords Matrix
Movies/
plot
ship iceberg Sci-fi murder
Titanic 1 1 0 0
Matrix 0 0 1 1
… … … … …
Movies-by-Genre Matrix
Movies/
genre
romantic tragedy war Sci-fi
Titanic 1 1 0 0
Matrix 0 0 0 1
… … … … …
Rows are similar across all views!

Clustering on multiple views
14
Movies-by-Keywords Matrix
Movies
Clustering 2
Intermediate result
Movies-by-Actors Matrix
Clustering 1
Intermediate result
Movies-by-Genre Matrix
Clustering 3
Intermediate result
Combined Clustering
Better than each
individual clustering

Multi-View Learning
 SIAM-Similar dataset: containing 1690 articles published in SIAM J MATRIX
ANAL A, SIAM J NUMER ANAL and SIAM J SCI COMPUT.
15
View Spectral Sum LMF
Abstract 0.2037
0.630 0.714
Title 0.2021
Keywords 0.2502
Authors 0.0017
citation 0.0078
[Wang et al, 2010]

Why it works?
 The probability of disagreement is bound by the probability of error in the
individual views
 Each view (must) have complementary information
 A single view is quite sparse (curse of dimensionality)
 The more informative the single views, the better the results.
16

Multi-view co-clustering
17
M: a single data view
R: row-row similarity matrix
C: col-col similarity matrix
χ-SIM : Co-clustering Algo
[Hussain et al, 2015]

Experimental setup
18
Dataset used
Experiments:
 Single view clustering
 Single view co-clustering
 Multi-view co-clustering

Results
19
Single View Co-Clustering Multi-View
𝐑(𝐭+𝟏)
= 𝐑𝐀
𝐭
𝐑(𝐭+𝟏)
= 𝐑𝐁
𝐭
VA VB VA VB VA VB VA VB
Cora 0.3209 0.3678 0.6004 0.3109 0.6004 0.7146 0.4453 0.3109
Citeseer 0.2503 0.3489 0.3783 0.3998 0.3783 0.5047 0.5897 0.3998
Cornell 0.3487 0.58974 0.3846 0.6051 0.3846 0.6051 0.4872 0.6051
Movies 0.2561 0.19125 0.2723 0.2253 0.2723 0.2853 0.2771 0.2253
Texas 0.3623 0.4670 0.4813 0.6791 0.4813 0.5508 0.6578 0.6791

Results
20
0.3678
0.3489
0.58974
0.2561
0.467
0.6004
0.3998
0.6051
0.2723
0.6791
0.7754
0.7135
0.7231
0.363
0.7754
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
C ORA C IT E S E E R C ORNE LL MOVIE S T E X A S
NMI
SCORE
DATASET
SINGLE VS MULTI-VIEW CLUSTERING
Single Co-clustering Multi-View
110.82 104.5 22.61 41.74 66.04
%
Increase

Co-Clustering of multi-view data
21
Original Matrix Cora Dataset Co-Cluster
Mideast Politics Motorcycles Baseball Computer
Graphics
Space
Jewish Ride Pitching Graphics Nasa
Israel Harleys Players Image Flight
Arab Camping Season Color Shuttle
Palestinian Bikers yankees display orbital

Success Stories
22
4. Application Areas of Multi-View Data
• Million-dollar prize
– Improve the baseline movie
recommendation approach of
Netflix by 10% in accuracy
– The top submissions all combine
several teams and algorithms as
an ensemble

Information Retrieval
23

IBM’s Watson
24
 Watson uses a variety of techniques like deep learning as
just one element in a very complicated ensemble of
techniques, ranging from the statistical technique of Bayesian
inference to deductive reasoning.
Keanu Reeves had a Nokia phone, but it took a land line to slip in & out
of this, the title of a 1999 sci-fi flick
Watson – Around 6 million rules, Access to 10 billion web pages, Massively
parallel Computing power (6000 computers), complex machine learning
algorithms.

Self Driving Google Cars
25
 Can so far driven
300,000 miles
without accident
 An average American
has an accident at
165,000 miles
 Uses multiple sources of information,
- Many Cameras ( for situational awareness),
- laser range finder ( for other traffic) ,
- GPS,
- Google maps, radar sensor, etc

Conclusion
 Data is growing at an enormous rate
 Capturing data is easy…using it is not!
26
5. Conclusion
“There are known knowns i.e. things we know that
we know; then there are known unknowns i.e.
things we know that we don’t know; and then we
have the unknown unknowns i.e. things we do not
know that we do not know.”
Donald Rumsfield
Former US Secretary of Defence

Conclusion
 No Free-Lunch theorem
 There is a lack of inherent superiority of any classifier
 If we make no prior assumption about the nature of the classification task, is any
classification method superior overall?
 Is any algorithm overall superior to random guessing?
 Answer is to both questions… NO!
 The Ugly-duckling theorem
 In the absence of assumptions there is no “best” feature representation.
 You need to try with a variety of methods, and
 You need to know your data, and
 You need to experiment a bit,
and finally
You need to contact and work with a machine learning expert
27
5. Conclusion

References
[Xu,2013] C. Xu, D. Tao and C. Xu, A survey on multi-view learning, arXiv
preprint arXiv:1304.5634 (2013).
[Andew et. al, 2013] G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep
canonical correlation analysis. In ICML, pp. 1247–1255, 2013
[Wang, 2009] W. Tang, Z. Lu and I. Dhillon, Clustering with multiple graphs, Data
Mining, 2009. ICDM'09. Ninth IEEE International Conference on. IEEE,
2009.
[Wang, ]W. Wang, R. Arora, K. Livescu, and J. Bilmes, On Deep Multi-View
Representation Learning, ” in Proc. of the 30th Int. Conf. Machine Learning
(ICML 2013), 2013, pp. 1247–1255.
29
Multi-view clustering

References
[Hussain, 2010] S.F. Hussain, C. Grimal, G. Bisson, An improved co-similarity
measure for document clustering. Machine Learning and Applications
(ICMLA), 2010 Ninth International Conference on. IEEE, 2010.
[Hussain, 2011] S.F. Hussain. "Bi-clustering gene expression data using co-
similarity." Advanced Data Mining and Applications. Springer Berlin
Heidelberg, 2011. 190-200.
[Hussain, 2015] Hussain, Syed Fawad, and Shariq Bashir. "Co-clustering of multi-
view datasets." Knowledge and Information Systems (2015): 1-26.
30
Multi-view clustering

Co-Clustering
31
3. Multi-View Multi-Dimensional Clustering
 Traditional clustering equates to finding groups in data “ under all
features/attributes”. In co-clustering (also called bi-clustering), the
pattern/behavior is usually observed under “a specified subset of
attributes/conditions”
 Preferred when
 Things behave different under
different subsets e.g. gene
expression data
 To improve clustering results
To minimize the effect of “curse
of dimensionality”

Direct multi-view constrained clustering
 Factorize all matrices at the same time under some constraint
where A(m) is a single view, P is the common factor shared between
all graphs, and Λ(m) captures the characteristics of each graph, α is a
weighting factor
 Deep Canonical Correlation Analysis[Andew et. al, 2013]
 Deep multi-view learning representation[Wang et al, 2015]
 Survey of Multi-View Clustering [Xu et. al., 2013]
32
2. Techniques to knowledge transfer
[Wang et. al, 2009]

Clustering on multiple views
33
1. Introduction
Movies-by-Actors Matrix Movies-by-genre Matrix
Movies-by-keywords Matrix
Movies
Clustering 1 Clustering 3
Clustering 2
Intermediate result Intermediate result Intermediate result

Using Intermediate Integration
 Combine information between views at the intermediate step
 Combine intermediate results (e.g. similarity matrices) from the views
34

Using Late Integration
 Combine information between views at the intermediate step
 Given 2 views of the data, X(1) and X(2)
 Cluster the views to generate two predictions P(1) and P(2)
 Use P(1) as a training label for next iteration of X(2) and vice versa
35

2015_FIT_Talk.pptx

Recommended

Recommended

More Related Content

Similar to 2015_FIT_Talk.pptx

Similar to 2015_FIT_Talk.pptx (20)

Recently uploaded

Recently uploaded (20)

2015_FIT_Talk.pptx