Linked Open Data for ACademia (LODAC) together with National Museum of Nature and Science have started collecting linked data of interspecies interaction and making link prediction for future observations. The initial data is very sparse and disconnected, making it very difficult to predict potential missing links using collaborative filtering alone. In this paper, we introduce Link Prediction on Interspecies Interaction (LPII) to solve this situation using hybrid recommendation approach. Our prediction model is a combination of three scoring functions, and takes into account collaborative filtering, community structure, and biological classification. We have found our approach, LPII, to be more accurate than other combinations of perdition models. Using statistical significance testing, we demonstrate that these scoring functions are important and play different roles depending on the conditions of linked data. This shows that LPII can be applied to deal with other real-world situations of link prediction.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach
1. Link Prediction in Linked Data of
Interspecies Interactions using
Hybrid Recommendation Approach
Hideaki TAKEDA
Professor
Tsuyoshi HOSOYA
Mycologist
Rathachai CHAWUTHAI
rathachai.c@gmail.com
Chiang Mai, Thailand JIST 2014 November 10th, 2014
2. LODAC Linked Open Data for ACadamia
“Salix pierotii”
lodac:Salix
species:
hasSuperTaxon
lodac:
Salix_ pierotii
3. National Museum of Nature and Science
30,000 Interactions
4,000 Fungi
7,000 Hosts
4. Let’s find
the Missing Links
between species LPII
Link Prediction
on Interspecies Interactions
Objective:
To predict missing links between fungi and hosts
8. 903 Rust Fungi 2,001 Hosts
2,966 Links
Biological
Classification
of Fungi
Biological
Classification
of Hosts
Selected
8
9. DATA PREPARATION LPII APPROACH
List of
Fungus-Host
interaction with
predictive scores
RESULT
transform data using
a Weight Function
BIOLOGIST
Making Observation
Finding
Missing
Links
Collaborative
Filtering
Score Score Score
Combine
1 2
4
3
Introduction
9
Community
Structure
Biological
Classification
Fungus-Host
Interaction
Dataset
Generate Result
10. Collaborative Filtering
Some fungi found at the same host
are common neighbors.
If some close neighbors of the fungus f
are found at a host h,
the fungus f may be found at the host h.
10
1
19. f2
f4
f1
f5
0.50
0.33
f3
Projection of Fungi
CommunityStructure h1
h2
h3
h4
h5
Community #1
Community #2
Community #3
PCS( f,h ) =
Number of links between
the community of the
fungus f and the host h
Number of all links
given by the community
of the fungus f
PCS( f3,h1 ) =
2
5
= 0.40
19
20. How to
deal wi th
many
very smal l
communi t ies?
20
21. Biological Classification
If a host h is commonly found
in the biological classification of
the fungus f,
the fungus f may be found at the
host h.
21
3
22. BIOLOGICAL CLASSIFICATION (TAXONOMY)
Classification Example
Domain e.g. Eukaryota
Kingdom e.g. Fungi
Phylum e.g. Basidiomycota
Class e.g. Urediniomycetes
Order e.g. Uredinales
Family e.g. Melampsoraceae
Genus e.g. Melampsora
Species e.g. Melampsora Yezoensis
22
23. f1
f2
f3
f4
f5
h1
h2
h3
h4
h5
G1
G2
with Biological Classification
Biological Classification
23
PBC( f,h ) =
Number of links between the
biological classification of the
fungus f and the host h
Number of all links given by
the biological classification of
the fungus f
PBC( f4,h2 ) =
1
4
= 0.25
26. Learning and Testing
Training set
(2,500 links)
Test set
(500 links)
Candidates
(400,000 links)
f1
f2
f3
f4
f5
h1
h2
h3
h4
h5
f1
f2
f3
f4
f5
h1
h2
h3
h4
h5
f1
f2
f3
f4
f5
h1
h2
h3
h4
h5
All Possible
Links
Existent Links Missing Links
0.421
0.864
0.466
0.490
0.366
0.515
0.313
0.076
0.362
0.902
0.069
0.524
0.876
0.464
0.839
0.504
26
27. AUC Area Under the receiver operating characteristic Curve
① PII( f1,h2 ) = 0.70
② PII( f2,h3 ) = 0.60
③ PII( f1,h3 ) = 0.50
④ PII( f4,h3 ) = 0.40
⑤ PII( f2,h2 ) = 0.30
⑥ PII( f3,h3 ) = 0.20
⑦ PII( f4,h3 ) = 0.10
① PII( f1,h2 ) = 0.70
② PII( f2,h2 ) = 0.60
③ PII( f3,h3 ) = 0.50
④ PII( f2,h3 ) = 0.50
⑤ PII( f1,h3 ) = 0.40
⑥ PII( f4,h3 ) = 0.30
⑦ PII( f4,h3 ) = 0.10
Predicted List #1
(sorted by predictive score)
High AUC Low AUC
For n comparisons,
• n' is number of times when
the test links have higher
score than the missing links.
• n" is number of times when
the test links have same
score as the missing links.
Predicted List #2
(sorted by predictive score)
27 ( Red scores are test links)
29. DATA PREPARATION LPII APPROACH
LOD
Cloud
RDF data of
Interspecies
Interactions
Projection
of Fungi
Collaborative
Filtering
transform data using
a Weight Function
Community
Structure
Biological
Classification
SPARQL
querying
being input of
Scoring Functions
ranking
predictions
in decreasing
order
Bipartite Graph
update
knowledgebase
Predicted Missing Links
of Fungus-Host together with
prediction scores
RESULT
Missing
Links
Community
Detection Method
DOMAIN
EXPERT
found?
yes
NOTE
select
connected fungi
clustering using
Biological
Classification
make
observation
Data
Process
Third party method
Scoring Function
Input argument
Linear Operation
Decision
Dataflow
+
find
missing
sharing links
PII
PCF
(f,h) +
(f,h) PCS
(f,h) PBC
(f,h)
1 2
4
3
29
Overall
30. Hybrid Recommender Approach
PII( f,h ) PCF( f,h )
PCS( f,h )
PBC( f,h )
α
β
γ should be very γ
low as about
0.1 and 0.2.
30
31. Conclusion
Informatics Biology
• RDF Model for Interspecies Interaction
• Improve the use of Collaborative filtering
with sparse dataset using
• Community Structure
• and Biological Classification
• It has been found that
• In general case, PCF + PCS is enough.
• But when a node
• having a few common neighbors
• and locating in a small community,
• PBC becomes a key player for
making link prediction.
• This model supports the view that most
fungi under the same genus have similar
parasite behavior.
• Some predicted links having high
predictive score, such as,
• Phragmidium mucronatum ハマナス
• Phragmidium fusiforme ハマナス
• Phragmidium potentillae イワキンバイ
have been discovered from other
literatures.
• Next enhancement is to analyze fungal
species into fungal spore types.
31
33. DATA PREPARATION LPII APPROACH
LOD
Cloud
RDF data of
Interspecies
Interactions
transform data using
a Weight Function
NFungi-Projection
or GProjFungi
Collaborative
Filtering
Community
Structure
Biological
Classification
SPARQL
querying
being input of
Scoring Functions
ranking
predictions
in decreasing
order
Bipartite Graph
GBipt
including
LExist
update
knowledgebase
Predicted Missing Links
of Fungus-Host together with
prediction scores
RESULT
Missing
Links
Or
LMiss
clustering using
a Community
Detection Method
DOMAIN
EXPERT
found?
yes
NOTE
select
connected fungi
clustering using
Biological
Classification
make
observation
Data
Process
Third party method
Scoring Function
Input argument
Linear Operation
Decision
Dataflow
+
find
missing
sharing links
PII
PCF
(f,h) +
(f,h) PCS
(f,h) PBC
(f,h)
1 2
4
3
Overall
α β γ
33