SlideShare a Scribd company logo
1 of 25
Anthony Coutant, Philippe Leray, Hoel Le Capitaine
DUKe (Data, User, Knowledge) Team, LINA
26th June, 2014
Learning Probabilistic Relational Models using
Non-Negative Matrix Factorization
7ème Journées Francophones sur les Réseaux Bayésiens et les Modèles Graphiques Probabilistes
22 / 24
Context
• Probabilistic Relational Models (PRM)
– Attributes uncertainty in Relational datasets
• Relational datasets: attributes + link
• PRM with Reference Uncertainty (RU) model link uncertainty
• Partitioning individuals necessary in PRM-RU
33 / 24
Problem & Proposal
• PRM-RU partition individuals based on attributes only
• We propose to cluster the relationship information instead
• We show that :
– Attributes partitioning do not explain all relationships
– Relational partitioning can explain attributes oriented relationships
44 / 24
Flat datasets – Bayesian Networks
• Individuals supposed i.i.d.
P(G1)
A B
0,25 0,75
P(G2)
A B
0,25 0,75
Dataset
G1 G2 R
A B 1st
B A 1st
B B 2nd
B B 2nd
G1, G2
P(R|G1,G2) A,A A,B B,A B,B
1st division 0,8 0,5 0,5 0,2
2nd division 0,2 0,5 0,5 0,8
Grade 1
Ranking
Grade 2
55 / 24
Relational datasets – Relational schema
Student
Intelligence
Ranking
Registration
Grade
Satisfaction
1,n1
Instance
Schema
Course
Phil101
Difficulté
???
Note
???
Registration
#4563
Note
???
Satisfaction
???
Student
Jane Doe
Intelligence
???
Classement
???
Student
Jane Doe
Intelligence
high
Ranking
1st division
Registration
#4563
Note
???
Satisfaction
???
Registration
#4563
Grade
A
Satisfaction
high
Course
Phil101
Difficulty
high
Evaluation
high
Course
Difficulty
Evaluation
1,n 1
66 / 24
Probabilistic Relational Models (PRM) .
MEAN(G)
P(R|MEAN(G)) A B
1st division 0,8 0,2
2nd division 0,2 0,8
PRM
Schema
Instance
Student
Intelligence
Ranking
Registration
Grade
Satisfaction
1,n1Course
Difficulty
Evaluation
1,n 1
Evaluation Intelligence
Grade
Satisfaction
Difficulty Ranking
Course Registration Student
MEAN
MEAN
Course
Math
Difficulté
???
Note
???
Registration
#6251
Note
???
Satisfaction
???
Student
John Smith
Intelligence
???
Classement
???
Student
Jane Doe
Intelligence
???
Ranking
???
Registration
#5621
Note
???
Satisfaction
???
Registration
#4563
Grade
???
Satisfaction
???
Course
Phil
Difficulty
???
Evaluation
???
Instance
77 / 24
Probabilistic Relational Models (PRM) ..
MEAN(G)
P(R|MEAN(G)) A B
1st division 0,8 0,2
2nd division 0,2 0,8
PRM
Schema
Course
Math
Difficulté
???
Note
???
Registration
#6251
Note
???
Satisfaction
???
Student
John Smith
Intelligence
???
Classement
???
Student
Jane Doe
Intelligence
???
Ranking
???
Registration
#5621
Note
???
Satisfaction
???
Registration
#4563
Grade
???
Satisfaction
???
Course
Phil
Difficulty
???
Evaluation
???
Instance
Evaluation Intelligence
Grade
Satisfaction
Difficulty Ranking
Course Registration Student
MEAN
MEAN
Math.Diff
#4563.Grade
#5621.Grade
#6251.Grade
MEAN
GBN (Ground Bayesian Network)
Math.Eval
Phil.Diff
Phil.Eval
#4563.Satis #5621.Satis
#6251.Satis
MEAN
JD.Int
JS.Int
JD.Rank
JS.Rank
MEAN
MEAN
Instance
Student
Intelligence
Ranking
Registration
Grade
Satisfaction
1,n1Course
Difficulty
Evaluation
1,n 1
88 / 24
Uncertainty in Relational datasets
Course
Phil101
Difficulté
???
Note
???
Registration
#4563
Note
???
Satisfaction
???
Student
Jane Doe
Intelligence
???
Classement
???
Student
Jane Doe
Intelligence
???
Ranking
???
Registration
#4563
Note
???
Satisfaction
???
Registration
#4563
Grade
???
Satisfaction
???
Course
Phil101
Difficulty
???
Evaluation
???
Student
Jane Doe
Intelligence
???
Ranking
???
Student
Jane Doe
Intelligence
???
Ranking
???
Registration
#4563
Note
???
Satisfaction
???
Registration
#4563
Grade
A
Satisfaction
???
Course
Phil101
Difficulté
???
Note
???
Course
Phil101
Difficulty
???
Evaluation
high
Course
Phil101
Difficulté
???
Note
???
Registration
#4563
Note
???
Satisfaction
???
Student
Jane Doe
Intelligence
???
Classement
???
Student
Jane Doe
Intelligence
???
Ranking
???
Registration
#4563
Note
???
Satisfaction
???
Registration
#4563
Grade
???
Satisfaction
???
Course
Phil101
Difficulty
???
Evaluation
???
Student
Jane Doe
Intelligence
???
Ranking
???
Student
Jane Doe
Intelligence
???
Ranking
???
Registration
#4563
Note
???
Satisfaction
???
Registration
#4563
Grade
A
Satisfaction
???
Course
Phil101
Difficulté
???
Note
???
Course
Phil101
Difficulty
???
Evaluation
high
?
Attributes uncertainty (PRM)
Attributes and link uncertainty (PRM extensions)
?
99 / 24
• Reference uncertainty: P(r.Course = ci, r.Student = sj | r.exists = true)
• A random variable for each individual id? Not generalizable
• Solution: partitioning
Difficulty Intelligence
Course Student
Registration
Student
Evaluation RankingCourse
P(Student | Course.Difficulty)?
P(Course)?
PRM with reference uncertainty .
1010 / 24
• P(Student | ClusterStudent) follows a uniform law
Difficulty Intelligence
Course Student
Registration
ClusterCourse
Course
ClusterStudent
Student
P(CStudent | S.Intelligence)
low high
C1 0 1
C2 1 0
P(Student | CStudent)
C1 C2
s1 0 1
s2 1 0
Evaluation Ranking
PRM with reference uncertainty ..
1111 / 24
• P(Student | ClusterStudent) follows a uniform law
Difficulty Intelligence
Course Student
Registration
ClusterCourse
Course
ClusterStudent
Student
P(CStudent | S.Intelligence)
low high
C1 0 1
C2 1 0
P(Student | CStudent)
C1 C2
s1 0 1
s2 1 0
Evaluation Ranking
PRM with reference uncertainty ..
highlow
Biolow
high C1
C2
Students Population stats
50% 50%
Partition Function
1212 / 24
Attributes-oriented Partition Functions in PRM-RU
• PRM-RU: Clustering from attributes
• Assumption: attributes explain the relationship
• Not generalizable, relationship information not used for partitioning
Course Student
P(Green | Red) = 1
P(Purple | Blue) = 1
YES
1313 / 24
Attributes-oriented Partition Functions in PRM-RU
• PRM-RU: Clustering from attributes
• Assumption: attributes explain the relationship
• Not generalizable, relationship information not used for partitioning
Course Student
P(Green | Red) = 1
P(Purple | Blue) = 1
Course Student
P(Green | Red) = 1
P(Purple | Blue) = 1
YES IS THAT SO?
1414 / 24
Attributes-oriented Partition Functions in PRM-RU
• PRM-RU: Clustering from attributes
• Assumption: attributes explain the relationship
• Not generalizable, relationship information not used for partitioning
Course Student Course Student
P(Green | Red) = 1
P(Purple | Blue) = 1
P(Green | Red) = 0.5
P(Purple | Red) = 0.5
Course Student
P(Green | Red) = 0.5
P(Purple | Red) = 0.5
Course Student
P(Green | Red) = 1
P(Purple | Blue) = 1
YES NOIS THAT SO?
1515 / 24
Relationship-oriented Partitioning
• Objective: finding partitioning maximizing intra-partition edges
Course Student
P(Student.p1 | Course.p1) = 1
P(Student.p2 | Course.p2) = 1
p1
p2
Course Student
P(Green | Red) = 0.5
P(Purple | Red) = 0.5
1616 / 24
Experiments – Protocol – Dataset generation
Entity 2
Att 1
…
Att n
R
1,n 1
Entity 1
Att 1
…
Att n
1 1,n
Schema
Instance
Entity 1 Entity 2R
1717 / 24
Experiments – Protocol – Dataset generation
Entity 2
Att 1
…
Att n
R
1,n 1
Entity 1
Att 1
…
Att n
1 1,n
Schema
Instance
Entity 1 Entity 2
Attributes partitioning
favorable case
Relationship partitioning
favorable case
Entity 1 Entity 2
R
R
1818 / 24
Experiments – Protocol – Learning
Entity 1 Entity 2Relation
Att n
Att 1
Att n
Att 1
CE1
CE2
E2
E1
• Parameter learning on set up structure
• 2 PRM compared:
– Either with attributes partitioning
– Or with relational partitioning
1919 / 24
Experiments – Protocol – Evaluation
• For each generated dataset D
– Split D into 10 subsets {D1, …, D10}
– Perform 10 Folds CV each with one Di for test and others for training
• Do it for PRM with attributes partitioning : store the results of 10 log likelihood PattsLL[i]
• Do it for PRM with relationship partitioning : store the results of 10 log likelihood PrelLL[i]
– Evaluate mean and sd of PattsLL[i] and PrelLL[i]
– Evaluate significancy of relationship partitioning over attributes partitioning
2020 / 24
Experiments – Results
Random clusters
(independent from attributes)
k
2 4 16
n
25
50
100
200
Relational > Attributes partitioning
Attributes > Relational partitioning
Partitionings not significantly comparable
k
2 4 16
n
25
50
100
200
Attributes => Cluster
(fully dependent from attributes)
Green:
Red:
Orange:
2121 / 24
Experiments – About the NMF choice for partitioning
• NMF
– Find low dimension factor matrices which product approximates the original matrix
– A relationship between two entities is an adjacency matrix
• Motivation for NMF usage
– (Restrictively) captures latent information from both rows and columns: co-clustering
– Several extensions dedicated to more accurate co-clustering (NMTF)
– Extensions for Laplacian regularization
• Allow to capture both attributes and relationship information for clustering
– Extensions for Tensor factorization
• Allow to model n-ary relationships, n >= 2
– NMF = Good starting choice for the long-term needs?
2222 / 24
Experiments – About the NMF choice for partitioning
• But
– Troubles with performances in experimentations
– Very sensitive to initialization: crashes whenever reaching singular
state
– Moving toward large scale methods : graph based relational
clustering?
2323 / 24
Conclusion
• PRM-RU to define probability structure in relational datasets
• Need for partitioning
• PRM-RU use attributes oriented partitioning
• We propose to cluster the relationship information instead
• Experiments show that :
– Attributes partitioning do not explain all relationships
– Relational partitioning can explain attributes oriented relationships
2424 / 24
Perspectives
• Experiments on real life datasets
• Towards large scale partitioning methods
• PRM-RU Structure Learning using clustering algorithms
• What about other link uncertainty representations?
Anthony Coutant, Philippe Leray, Hoel Le Capitaine
DUKe (Data, User, Knowledge) Team, LINA
Questions?
7ème Journées Francophones sur les Réseaux Bayésiens et les Modèles Graphiques Probabilistes
(anthony.coutant | philippe.leray | hoel.lecapitaine)
@univ-nantes.fr

More Related Content

What's hot

neural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classificationneural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classificationJEE HYUN PARK
 
Towards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositoriesTowards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositoriesValentina Paunovic
 
201511-TIA_Presentation
201511-TIA_Presentation201511-TIA_Presentation
201511-TIA_Presentationhpcosta
 
Using Knowledge Building Forums in EFL Classroms - FIETxs2019
 Using Knowledge Building Forums in EFL Classroms - FIETxs2019 Using Knowledge Building Forums in EFL Classroms - FIETxs2019
Using Knowledge Building Forums in EFL Classroms - FIETxs2019ARGET URV
 
LP&IIS2013 PPT. Chinese Named Entity Recognition with Conditional Random Fiel...
LP&IIS2013 PPT. Chinese Named Entity Recognition with Conditional Random Fiel...LP&IIS2013 PPT. Chinese Named Entity Recognition with Conditional Random Fiel...
LP&IIS2013 PPT. Chinese Named Entity Recognition with Conditional Random Fiel...Lifeng (Aaron) Han
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingeXascale Infolab
 
Parcc public blueprints narrated math 04262013
Parcc public blueprints narrated math 04262013Parcc public blueprints narrated math 04262013
Parcc public blueprints narrated math 04262013Achieve, Inc.
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham
 
Communication systems-theory-for-undergraduate-students-using-matlab
Communication systems-theory-for-undergraduate-students-using-matlabCommunication systems-theory-for-undergraduate-students-using-matlab
Communication systems-theory-for-undergraduate-students-using-matlabSaifAbdulNabi1
 
AI to advance science research
AI to advance science researchAI to advance science research
AI to advance science researchDing Li
 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelLifeng (Aaron) Han
 
Software Quality Assessment Practices
Software Quality Assessment PracticesSoftware Quality Assessment Practices
Software Quality Assessment PracticesMoutasm Tamimi
 
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...Jinho Choi
 
Combining Committee-Based Semi-supervised and Active Learning and Its Applica...
Combining Committee-Based Semi-supervised and Active Learning and Its Applica...Combining Committee-Based Semi-supervised and Active Learning and Its Applica...
Combining Committee-Based Semi-supervised and Active Learning and Its Applica...Mohamed Farouk
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han
 

What's hot (16)

neural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classificationneural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classification
 
Towards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositoriesTowards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositories
 
201511-TIA_Presentation
201511-TIA_Presentation201511-TIA_Presentation
201511-TIA_Presentation
 
Parekh dfa
Parekh dfaParekh dfa
Parekh dfa
 
Using Knowledge Building Forums in EFL Classroms - FIETxs2019
 Using Knowledge Building Forums in EFL Classroms - FIETxs2019 Using Knowledge Building Forums in EFL Classroms - FIETxs2019
Using Knowledge Building Forums in EFL Classroms - FIETxs2019
 
LP&IIS2013 PPT. Chinese Named Entity Recognition with Conditional Random Fiel...
LP&IIS2013 PPT. Chinese Named Entity Recognition with Conditional Random Fiel...LP&IIS2013 PPT. Chinese Named Entity Recognition with Conditional Random Fiel...
LP&IIS2013 PPT. Chinese Named Entity Recognition with Conditional Random Fiel...
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition ranking
 
Parcc public blueprints narrated math 04262013
Parcc public blueprints narrated math 04262013Parcc public blueprints narrated math 04262013
Parcc public blueprints narrated math 04262013
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Communication systems-theory-for-undergraduate-students-using-matlab
Communication systems-theory-for-undergraduate-students-using-matlabCommunication systems-theory-for-undergraduate-students-using-matlab
Communication systems-theory-for-undergraduate-students-using-matlab
 
AI to advance science research
AI to advance science researchAI to advance science research
AI to advance science research
 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
 
Software Quality Assessment Practices
Software Quality Assessment PracticesSoftware Quality Assessment Practices
Software Quality Assessment Practices
 
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
 
Combining Committee-Based Semi-supervised and Active Learning and Its Applica...
Combining Committee-Based Semi-supervised and Active Learning and Its Applica...Combining Committee-Based Semi-supervised and Active Learning and Its Applica...
Combining Committee-Based Semi-supervised and Active Learning and Its Applica...
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
 

Similar to Learning Probabilistic Relational Models using Non-Negative Matrix Factorization

Download
DownloadDownload
Downloadbutest
 
Download
DownloadDownload
Downloadbutest
 
Learning analytics and accessibility – #calrg 2015
Learning analytics and accessibility – #calrg 2015Learning analytics and accessibility – #calrg 2015
Learning analytics and accessibility – #calrg 2015Martyn Cooper
 
High Dimensional Biological Data Analysis and Visualization
High Dimensional Biological Data Analysis and VisualizationHigh Dimensional Biological Data Analysis and Visualization
High Dimensional Biological Data Analysis and VisualizationDmitry Grapov
 
Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...
Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...
Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...Matthew Powers
 
Your Classifier is Secretly an Energy based model and you should treat it lik...
Your Classifier is Secretly an Energy based model and you should treat it lik...Your Classifier is Secretly an Energy based model and you should treat it lik...
Your Classifier is Secretly an Energy based model and you should treat it lik...Seunghyun Hwang
 
Capstone eLearning Deck
Capstone eLearning DeckCapstone eLearning Deck
Capstone eLearning DeckZacharyCote2
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reductionYan Xu
 
Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...Dalei Li
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analyticsCollin Bennett
 
Data analysis
Data analysisData analysis
Data analysisamlbinder
 
Online Stochastic Tensor Decomposition for Background Subtraction in Multispe...
Online Stochastic Tensor Decomposition for Background Subtraction in Multispe...Online Stochastic Tensor Decomposition for Background Subtraction in Multispe...
Online Stochastic Tensor Decomposition for Background Subtraction in Multispe...ActiveEon
 
Multi-method Evaluation in Scientific Paper Recommender Systems
Multi-method Evaluation in Scientific Paper Recommender SystemsMulti-method Evaluation in Scientific Paper Recommender Systems
Multi-method Evaluation in Scientific Paper Recommender SystemsAravind Sesagiri Raamkumar
 
IRJET- Analysis of Chi-Square Independence Test for Naïve Bayes Feature Selec...
IRJET- Analysis of Chi-Square Independence Test for Naïve Bayes Feature Selec...IRJET- Analysis of Chi-Square Independence Test for Naïve Bayes Feature Selec...
IRJET- Analysis of Chi-Square Independence Test for Naïve Bayes Feature Selec...IRJET Journal
 

Similar to Learning Probabilistic Relational Models using Non-Negative Matrix Factorization (20)

0 introduction
0  introduction0  introduction
0 introduction
 
Download
DownloadDownload
Download
 
Download
DownloadDownload
Download
 
Learning analytics and accessibility – #calrg 2015
Learning analytics and accessibility – #calrg 2015Learning analytics and accessibility – #calrg 2015
Learning analytics and accessibility – #calrg 2015
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
High Dimensional Biological Data Analysis and Visualization
High Dimensional Biological Data Analysis and VisualizationHigh Dimensional Biological Data Analysis and Visualization
High Dimensional Biological Data Analysis and Visualization
 
convolutional_rbm.ppt
convolutional_rbm.pptconvolutional_rbm.ppt
convolutional_rbm.ppt
 
Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...
Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...
Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...
 
Your Classifier is Secretly an Energy based model and you should treat it lik...
Your Classifier is Secretly an Energy based model and you should treat it lik...Your Classifier is Secretly an Energy based model and you should treat it lik...
Your Classifier is Secretly an Energy based model and you should treat it lik...
 
Capstone eLearning Deck
Capstone eLearning DeckCapstone eLearning Deck
Capstone eLearning Deck
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reduction
 
Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...Two strategies for large-scale multi-label classification on the YouTube-8M d...
Two strategies for large-scale multi-label classification on the YouTube-8M d...
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
 
Data analysis
Data analysisData analysis
Data analysis
 
Oopsla04.ppt
Oopsla04.pptOopsla04.ppt
Oopsla04.ppt
 
Online Stochastic Tensor Decomposition for Background Subtraction in Multispe...
Online Stochastic Tensor Decomposition for Background Subtraction in Multispe...Online Stochastic Tensor Decomposition for Background Subtraction in Multispe...
Online Stochastic Tensor Decomposition for Background Subtraction in Multispe...
 
Multi-method Evaluation in Scientific Paper Recommender Systems
Multi-method Evaluation in Scientific Paper Recommender SystemsMulti-method Evaluation in Scientific Paper Recommender Systems
Multi-method Evaluation in Scientific Paper Recommender Systems
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
KNN
KNNKNN
KNN
 
IRJET- Analysis of Chi-Square Independence Test for Naïve Bayes Feature Selec...
IRJET- Analysis of Chi-Square Independence Test for Naïve Bayes Feature Selec...IRJET- Analysis of Chi-Square Independence Test for Naïve Bayes Feature Selec...
IRJET- Analysis of Chi-Square Independence Test for Naïve Bayes Feature Selec...
 

Recently uploaded

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 

Recently uploaded (20)

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 

Learning Probabilistic Relational Models using Non-Negative Matrix Factorization

  • 1. Anthony Coutant, Philippe Leray, Hoel Le Capitaine DUKe (Data, User, Knowledge) Team, LINA 26th June, 2014 Learning Probabilistic Relational Models using Non-Negative Matrix Factorization 7ème Journées Francophones sur les Réseaux Bayésiens et les Modèles Graphiques Probabilistes
  • 2. 22 / 24 Context • Probabilistic Relational Models (PRM) – Attributes uncertainty in Relational datasets • Relational datasets: attributes + link • PRM with Reference Uncertainty (RU) model link uncertainty • Partitioning individuals necessary in PRM-RU
  • 3. 33 / 24 Problem & Proposal • PRM-RU partition individuals based on attributes only • We propose to cluster the relationship information instead • We show that : – Attributes partitioning do not explain all relationships – Relational partitioning can explain attributes oriented relationships
  • 4. 44 / 24 Flat datasets – Bayesian Networks • Individuals supposed i.i.d. P(G1) A B 0,25 0,75 P(G2) A B 0,25 0,75 Dataset G1 G2 R A B 1st B A 1st B B 2nd B B 2nd G1, G2 P(R|G1,G2) A,A A,B B,A B,B 1st division 0,8 0,5 0,5 0,2 2nd division 0,2 0,5 0,5 0,8 Grade 1 Ranking Grade 2
  • 5. 55 / 24 Relational datasets – Relational schema Student Intelligence Ranking Registration Grade Satisfaction 1,n1 Instance Schema Course Phil101 Difficulté ??? Note ??? Registration #4563 Note ??? Satisfaction ??? Student Jane Doe Intelligence ??? Classement ??? Student Jane Doe Intelligence high Ranking 1st division Registration #4563 Note ??? Satisfaction ??? Registration #4563 Grade A Satisfaction high Course Phil101 Difficulty high Evaluation high Course Difficulty Evaluation 1,n 1
  • 6. 66 / 24 Probabilistic Relational Models (PRM) . MEAN(G) P(R|MEAN(G)) A B 1st division 0,8 0,2 2nd division 0,2 0,8 PRM Schema Instance Student Intelligence Ranking Registration Grade Satisfaction 1,n1Course Difficulty Evaluation 1,n 1 Evaluation Intelligence Grade Satisfaction Difficulty Ranking Course Registration Student MEAN MEAN Course Math Difficulté ??? Note ??? Registration #6251 Note ??? Satisfaction ??? Student John Smith Intelligence ??? Classement ??? Student Jane Doe Intelligence ??? Ranking ??? Registration #5621 Note ??? Satisfaction ??? Registration #4563 Grade ??? Satisfaction ??? Course Phil Difficulty ??? Evaluation ??? Instance
  • 7. 77 / 24 Probabilistic Relational Models (PRM) .. MEAN(G) P(R|MEAN(G)) A B 1st division 0,8 0,2 2nd division 0,2 0,8 PRM Schema Course Math Difficulté ??? Note ??? Registration #6251 Note ??? Satisfaction ??? Student John Smith Intelligence ??? Classement ??? Student Jane Doe Intelligence ??? Ranking ??? Registration #5621 Note ??? Satisfaction ??? Registration #4563 Grade ??? Satisfaction ??? Course Phil Difficulty ??? Evaluation ??? Instance Evaluation Intelligence Grade Satisfaction Difficulty Ranking Course Registration Student MEAN MEAN Math.Diff #4563.Grade #5621.Grade #6251.Grade MEAN GBN (Ground Bayesian Network) Math.Eval Phil.Diff Phil.Eval #4563.Satis #5621.Satis #6251.Satis MEAN JD.Int JS.Int JD.Rank JS.Rank MEAN MEAN Instance Student Intelligence Ranking Registration Grade Satisfaction 1,n1Course Difficulty Evaluation 1,n 1
  • 8. 88 / 24 Uncertainty in Relational datasets Course Phil101 Difficulté ??? Note ??? Registration #4563 Note ??? Satisfaction ??? Student Jane Doe Intelligence ??? Classement ??? Student Jane Doe Intelligence ??? Ranking ??? Registration #4563 Note ??? Satisfaction ??? Registration #4563 Grade ??? Satisfaction ??? Course Phil101 Difficulty ??? Evaluation ??? Student Jane Doe Intelligence ??? Ranking ??? Student Jane Doe Intelligence ??? Ranking ??? Registration #4563 Note ??? Satisfaction ??? Registration #4563 Grade A Satisfaction ??? Course Phil101 Difficulté ??? Note ??? Course Phil101 Difficulty ??? Evaluation high Course Phil101 Difficulté ??? Note ??? Registration #4563 Note ??? Satisfaction ??? Student Jane Doe Intelligence ??? Classement ??? Student Jane Doe Intelligence ??? Ranking ??? Registration #4563 Note ??? Satisfaction ??? Registration #4563 Grade ??? Satisfaction ??? Course Phil101 Difficulty ??? Evaluation ??? Student Jane Doe Intelligence ??? Ranking ??? Student Jane Doe Intelligence ??? Ranking ??? Registration #4563 Note ??? Satisfaction ??? Registration #4563 Grade A Satisfaction ??? Course Phil101 Difficulté ??? Note ??? Course Phil101 Difficulty ??? Evaluation high ? Attributes uncertainty (PRM) Attributes and link uncertainty (PRM extensions) ?
  • 9. 99 / 24 • Reference uncertainty: P(r.Course = ci, r.Student = sj | r.exists = true) • A random variable for each individual id? Not generalizable • Solution: partitioning Difficulty Intelligence Course Student Registration Student Evaluation RankingCourse P(Student | Course.Difficulty)? P(Course)? PRM with reference uncertainty .
  • 10. 1010 / 24 • P(Student | ClusterStudent) follows a uniform law Difficulty Intelligence Course Student Registration ClusterCourse Course ClusterStudent Student P(CStudent | S.Intelligence) low high C1 0 1 C2 1 0 P(Student | CStudent) C1 C2 s1 0 1 s2 1 0 Evaluation Ranking PRM with reference uncertainty ..
  • 11. 1111 / 24 • P(Student | ClusterStudent) follows a uniform law Difficulty Intelligence Course Student Registration ClusterCourse Course ClusterStudent Student P(CStudent | S.Intelligence) low high C1 0 1 C2 1 0 P(Student | CStudent) C1 C2 s1 0 1 s2 1 0 Evaluation Ranking PRM with reference uncertainty .. highlow Biolow high C1 C2 Students Population stats 50% 50% Partition Function
  • 12. 1212 / 24 Attributes-oriented Partition Functions in PRM-RU • PRM-RU: Clustering from attributes • Assumption: attributes explain the relationship • Not generalizable, relationship information not used for partitioning Course Student P(Green | Red) = 1 P(Purple | Blue) = 1 YES
  • 13. 1313 / 24 Attributes-oriented Partition Functions in PRM-RU • PRM-RU: Clustering from attributes • Assumption: attributes explain the relationship • Not generalizable, relationship information not used for partitioning Course Student P(Green | Red) = 1 P(Purple | Blue) = 1 Course Student P(Green | Red) = 1 P(Purple | Blue) = 1 YES IS THAT SO?
  • 14. 1414 / 24 Attributes-oriented Partition Functions in PRM-RU • PRM-RU: Clustering from attributes • Assumption: attributes explain the relationship • Not generalizable, relationship information not used for partitioning Course Student Course Student P(Green | Red) = 1 P(Purple | Blue) = 1 P(Green | Red) = 0.5 P(Purple | Red) = 0.5 Course Student P(Green | Red) = 0.5 P(Purple | Red) = 0.5 Course Student P(Green | Red) = 1 P(Purple | Blue) = 1 YES NOIS THAT SO?
  • 15. 1515 / 24 Relationship-oriented Partitioning • Objective: finding partitioning maximizing intra-partition edges Course Student P(Student.p1 | Course.p1) = 1 P(Student.p2 | Course.p2) = 1 p1 p2 Course Student P(Green | Red) = 0.5 P(Purple | Red) = 0.5
  • 16. 1616 / 24 Experiments – Protocol – Dataset generation Entity 2 Att 1 … Att n R 1,n 1 Entity 1 Att 1 … Att n 1 1,n Schema Instance Entity 1 Entity 2R
  • 17. 1717 / 24 Experiments – Protocol – Dataset generation Entity 2 Att 1 … Att n R 1,n 1 Entity 1 Att 1 … Att n 1 1,n Schema Instance Entity 1 Entity 2 Attributes partitioning favorable case Relationship partitioning favorable case Entity 1 Entity 2 R R
  • 18. 1818 / 24 Experiments – Protocol – Learning Entity 1 Entity 2Relation Att n Att 1 Att n Att 1 CE1 CE2 E2 E1 • Parameter learning on set up structure • 2 PRM compared: – Either with attributes partitioning – Or with relational partitioning
  • 19. 1919 / 24 Experiments – Protocol – Evaluation • For each generated dataset D – Split D into 10 subsets {D1, …, D10} – Perform 10 Folds CV each with one Di for test and others for training • Do it for PRM with attributes partitioning : store the results of 10 log likelihood PattsLL[i] • Do it for PRM with relationship partitioning : store the results of 10 log likelihood PrelLL[i] – Evaluate mean and sd of PattsLL[i] and PrelLL[i] – Evaluate significancy of relationship partitioning over attributes partitioning
  • 20. 2020 / 24 Experiments – Results Random clusters (independent from attributes) k 2 4 16 n 25 50 100 200 Relational > Attributes partitioning Attributes > Relational partitioning Partitionings not significantly comparable k 2 4 16 n 25 50 100 200 Attributes => Cluster (fully dependent from attributes) Green: Red: Orange:
  • 21. 2121 / 24 Experiments – About the NMF choice for partitioning • NMF – Find low dimension factor matrices which product approximates the original matrix – A relationship between two entities is an adjacency matrix • Motivation for NMF usage – (Restrictively) captures latent information from both rows and columns: co-clustering – Several extensions dedicated to more accurate co-clustering (NMTF) – Extensions for Laplacian regularization • Allow to capture both attributes and relationship information for clustering – Extensions for Tensor factorization • Allow to model n-ary relationships, n >= 2 – NMF = Good starting choice for the long-term needs?
  • 22. 2222 / 24 Experiments – About the NMF choice for partitioning • But – Troubles with performances in experimentations – Very sensitive to initialization: crashes whenever reaching singular state – Moving toward large scale methods : graph based relational clustering?
  • 23. 2323 / 24 Conclusion • PRM-RU to define probability structure in relational datasets • Need for partitioning • PRM-RU use attributes oriented partitioning • We propose to cluster the relationship information instead • Experiments show that : – Attributes partitioning do not explain all relationships – Relational partitioning can explain attributes oriented relationships
  • 24. 2424 / 24 Perspectives • Experiments on real life datasets • Towards large scale partitioning methods • PRM-RU Structure Learning using clustering algorithms • What about other link uncertainty representations?
  • 25. Anthony Coutant, Philippe Leray, Hoel Le Capitaine DUKe (Data, User, Knowledge) Team, LINA Questions? 7ème Journées Francophones sur les Réseaux Bayésiens et les Modèles Graphiques Probabilistes (anthony.coutant | philippe.leray | hoel.lecapitaine) @univ-nantes.fr

Editor's Notes

  1. Beaucoup de jeux de données relationnels impliquant individus de différents types + diverses relations entre eux. Besoin d’algorithmes spécifiques pour l’apprentissage, car relaxent hypothèse i.i.d faite généralement PRM sont une extension des RB pour les données relationnelles Les PRM classiques gèrent l’incertitude au niveau des attributs de chaque individu, supposant les relations connues Hors, un jeu de données relationnel comporte à la fois une notion descriptive des entités (attributs) et une dimension topologique (la façon dont les individus sont connectés) Il est important de gérer l’incertitude topologique en plus de l’incertitude d’attributs Plusieurs extensions des PRM pour cela : PRM-RU, PRM-EU … PRM-RU est intéressante car elle expose de façon claire le besoin d’un partitionnement pour s’abstraire d’un jeu de données On s’intéresse ici à la question du choix d’un partitionnement, et de son impact sur l’apprentissage de ces PRM-RU.
  2. La définition et l’apprentissage d’un PRM-RU nécessite de partitionner les individus impliqués dans une relation. Toutefois, la seule technique de partitionnement définie dans la littérature repose uniquement sur les attributs des individus. Conséquence : la topologie réelle de la relation n’est pas prise en compte pour modéliser l’incertitude de liens entre individus ! Nous proposons de partitionner les individus de façon topologique en lieu et place de l’approche historique. Nous montrons de façon expérimentale que : 1) un partitionnement orienté attributs n’explique pas toutes les relations; 2) un partitionnement relationnel peut expliquer à la fois des relations indépendantes des attributs, mais aussi des relations expliquées par les attributs.
  3. Beaucoup d’algorithmes d’apprentissage automatique supposent que les données considérées sont i.i.d. (identiquement et indépendamment distribuées). De tels jeux de données peuvent être représentés par des tableaux individus x attribut. Ici, par exemple, je considère un jeu de données où chaque ligne représente les résultats scolaires d’un étudiant. Un réseau Bayésien sur ce jeu de données peut exprimer des dépendances / indépendances probabilistes entre les attributs d’un unique étudiant. Je peux par exemple exprimer le fait que le classement d’un étudiant dépende de ses notes. En revanche, je ne peux pas exprimer la dépendance entre le classement d’un étudiant, et le classement de ses pairs.
  4. Les jeux de données relationnels sont plus complexes. Un jeu de données ne respecte plus seulement la définition d’un ensemble d’attributs de façon tabulaire, mais obéit aux lois d’un schéma relationnel. Le schéma relationnel décrit l’ensemble des types d’entités existant dans les données, les attributs décrivant ces entités, et les types de relations possibles entre eux. Si nous étendons le jeu de données précédent, nous pouvons par exemple définir une relation entre des individus étudiants et des individus cours, de type « inscription », chaque entité et relation étant décrite par divers attributs. Une instance de ce schéma est alors un jeu de données qui respecte ce schéma. Un exemple en est donné en bas.
  5. Un PRM est un modèle graphique probabiliste dirigé défini sur un schéma relationnel. Il représente un patron de distribution de probabilités défini sur l’ensemble des instances possibles de ce schéma.
  6. Etant donné un PRM et une instance respectant le même schéma relationnel, il est possible de créer un Réseau Bayésien déroulé, mis à plat, appelé « Ground Bayesian Network » qui permet de faire de l’inférence sur les valeurs d’attributs des différents individus du jeu de données considéré.
  7. Les PRM ne gèrent que l’incertitude au niveau des attributs des différents individus, et supposent les relations entre ces individus connus. Notre but est par exemple, en haut, de trouver quelle est la difficulté probable d’un cours en fonction des différents étudiants qui y sont inscrits, et les notes qu’ils ont obtenus sur ce cours. Mais il peut être intéressant de gérer le cas où les relations sont également incertaines, pour faire de la détection automatique de liens. Par exemple, en bas, nous pourrions vouloir trouver l’étudiant le plus probablement attaché à un lien d’inscription concernant un cours particulier, étant donné les caractéristiques du cours, de l’inscription, et des étudiants candidats. Des extensions des PRM existent pour cela.
  8. Parmi les extensions des PRM, les PRM avec incertitude de référence essaient de modéliser l’incertitude de liens par une distribution de probabilités de type P(endpoints | existence du lien). Cette extension ajoute pour cela des distributions sur l’ensemble des individus possibles au niveau de chaque relation. La question à laquelle il faut maintenant répondre concerne le choix de la structure probabiliste. Naïvement, on pourrait tenter d’apprendre une distribution de probabilités sur l’ensemble des clés des individus impliqués. Toutefois, cela n’est pas généralisable. Une solution proposée dans les PRM-RU pour pallier ce problème, est de partitionner les individus impliqués dans la relation.
  9. Nous introduisons donc 2 variables aléatoires au total pour chaque point d’entrée d’une relation. L’idée est alors de choisir un individu dans un processus en 2 temps : 1) choisir un groupe d’individus selon une distribution de probabilités définie; 2) choisir aléatoirement un individu de ce groupe, selon une loi uniforme. La complexité du choix est alors réduite car les dimensions des distributions à apprendre au niveau des variables de partitionnement sont très réduites.
  10. Il ne reste plus qu’à effectivement partitionner les individus. Pour chaque variable « cluster », il faut associer une fonction de partition permettant de regrouper les individus. La variable « cluster » est alors défini sur l’ensemble des clusters de cette fonction.
  11. Le problème dans les PRM-RU provient de la façon dont sont définies les fonctions de partition, où les seuls attributs des individus de l’entité considérés sont utilisés. L’hypothèse faite est alors très forte : on considère que la relation est pleinement expliquée par les attributs des types d’entités liés. Paradoxalement, cela signifie que l’apprentissage des variables permettant de faire de la détection de liens n’utilisent pas directement les liens du jeu de données d’apprentissage. Cela n’est pas généralisable dans tous les cas. Prenons par exemple le graphe biparti ci-dessous, instance de la relation binaire définie précédemment entre les étudiants et les cours. Je considère ici que deux individus d’une même couleur ont la même configuration d’attributs. Ici, si je tente d’apprendre des distributions sur ma relation, les attributs seuls vont me permettre un bon apprentissage, car tous les cours rouges sont liés à tous les étudiants verts et tous les cours bleus sont liés à tous les étudiants violets. Les distributions de probabilité sont alors informatives et discriminantes.
  12. Si je prends un second exemple, ce n’est plus si clair. Je vais ainsi toujours apprendre les mêmes distributions de probabilité que précédemment, disant que si un lien est lié à un cours rouge, alors il va nécessairement être lié à un étudiant vert. Toutefois, cela signifie qu’une fois le choix de la couleur et donc du groupe effectué, je peux prendre aléatoirement n’importe quel individu vert. Or, la topologie de la relation exhibe 4 composantes connexes différentes, séparant les deux individus rouges et verts, m’informant que chaque individu d’une paire colorée doit être traitée différemment pour cette relation.
  13. Dans le dernier cas, la relation est encore plus mélangée et les distributions apprises ne sont plus informatives. J’ai des composantes connexes intéressantes, mais elle sont masquées par des distributions de probabilités consensuelles entre les individus colorées.
  14. Notre but est de partitionner les individus d’une relation de telle façon que le nombre de liens à l’intérieur de chaque groupe soit maximal. À partir de ce partitionnement, notre but est d’apprendre des distributions de probabilités qui reflètent le plus fidèlement possible la topologie de la relation. Si je reprends le dernier exemple de la slide précédente, je pourrais obtenir un meilleur apprentissage de paramètres en découpant mes individus selon les composantes du graphe, et non pas selon les couleurs des individus.
  15. Nous avons souhaité confronté les deux techniques de partitionnement et leur impact sur l’apprentissage d’un PRM-RU. Pour cela, nous avons construit deux types de jeux de données. Chacun de ces jeux de données respecte le même schéma relationnel simple, composé d’une seule relation binaire entre deux entités. Chacun de ces jeux de données exhibe différents sous-groupes, avec des préférences fortes pour d’autres sous-groupes de l’autre entité.
  16. La différence entre les deux jeux de données provient de la répartition des individus dans ces sous-groupes. Le premier jeu de données défini une implication totale entre la valeur des attributs d’un individu et son appartenance à un groupe. Le second jeu de données défini une indépendance totale entre attributs et groupe affecté. Le premier jeu est censé favoriser l’approche de partitionnement par attributs, et l’autre est censé favoriser l’approche orientée relation.
  17. La comparaison a été faite entre deux apprentissage de paramètres d’une structure fixée de PRM. Deux versions de PRM ont été considérées : une avec un partitionnement par attributs, l’autre avec un partitionnement relationnel.
  18. Le protocole d’évaluation est le suivant : 1) découper un jeu de données en 10 parties. 2) Faire un apprentissage en validation croisée en 10 étapes en prenant chaque fois un jeu de données comme jeu de tests. 3) Stocker les 10 log vraisemblances calculées pour le PRM avec partitionnement attributs. Faire de même pour le PRM avec partitionnement relationnel. 4) Evaluer la significativité des résultats par un z-test.
  19. Les résultats obtenus pour chaque jeu de données sont les suivants. Un carré vert indique que l’approche relationnelle est significativement meilleure que l’approche par attributs. Un carré rouge exprime l’idée inverse. Un carré orange montre une zone où le test statistique ne permet pas de trancher. Dans le cas d’un jeu de données avec groupes indépendants des attributs, à gauche ici, on peut voir que notre méthode est significativement meilleure sauf lorsque le ratio n / k est très faible. Des phénomènes de sur apprentissage peuvent expliquer ce comportement. De l’autre côté, lorsque les attributs expliquent une relation, nous ne pouvons pas discerner de schéma clair. Nous pouvons donc dire que notre méthode s’applique autant que la méthode historique pour ces relations. Notre approche semble donc plus généralisable.
  20. Avant de conclure, je voulais parler de la méthode de partitionnement que nous avons choisi à l’écriture de ce papier pour l’approche relationnelle. Nous avons choisi une implémentation de NMF minimisant une KL-divergence car c’est une technique de factorisation qui a été montrée équivalente à pLSA, nous donnant une interprétation solide en terme de probabilités pour notre problème. Nous avons fait ce choix à l’origine pour diverses raisons autres que la seule interprétabilité. (cf. slide pour la suite)