A Bayesian Approach to Model Overlapping
Objects Available as Distance Data
Sandhya Prabhakaran1
and Julia E. Vogt2,3
Memorial Sloan Kettering Cancer Centre, NYC
1
University of Basel
2
Swiss Institute of Bioinformatics
3
MLconf, NYC
29th March 2019
Two religions in Machine Learning
Frequentists
(https://medium.com/datadriveninvestor/bayesian-vs-frequentist-for-dummies-58ce230c3796)
Two religions in Machine Learning
Frequentists Bayesians
(https://medium.com/datadriveninvestor/bayesian-vs-frequentist-for-dummies-58ce230c3796)
Two religions in Machine Learning
● A coin toss example: 10 heads in 10 tosses (= data given)
● Frequentists:
○ Probability is a Point estimate
○ What is the relative frequency of tails = no answer
Two religions in Machine Learning
● A coin toss example: 10 heads in 10 tosses (= data given)
● Frequentists:
○ Probability is a Point estimate
○ What is the relative frequency of tails = no answer
● Bayesians:
○ Probability is a distribution
○ What is the relative frequency of tails = 0.5
Two religions in Machine Learning
● A coin toss example: 10 heads in 10 tosses (= data given)
● Frequentists:
○ Probability is a Point estimate
○ What is the relative frequency of tails = no answer
● Bayesians:
○ Probability is a distribution
○ What is the relative frequency of tails = 0.5
○ A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly
believes he has seen a mule.
○ More flexible: inference, thinking, planning and reasoning (downstream analyses)
Bayesian: Clustering
Bayesian: Clustering
Bayesian: Clustering of vectorial objects
Bayesian: Clustering of vectorial objects
Clustering
algorithm
Bayesian: Clustering of non-vectorial objects
(Image courtesy: shutterstock)
Bayesian: Clustering of non-vectorial objects
Mostly available as
pairwise distance data
POCD:
Probabilistic model for Overlap Clustering of
Distance data
POCD: Overlap Clustering for distance data
POCD: Overlap Clustering for distance data
POCD: Overlap Clustering for distance data
POCD: Overlap Clustering for distance data
POCD: Overlap Clustering for distance data
POCD: Overlap Clustering for distance data
POCD: Overlap Clustering for distance data
● Bayesian clustering model
● Given pairwise D, we infer Z (the cluster assignment matrix)
POCD: Overlap Clustering for distance data
Z
● Binary matrix
● Cluster assignment
matrix
● Needs to be inferred
POCD: Overlap Clustering for distance data
● Bayesian clustering model
● Given pairwise D, we infer Z:
p(Z|D,.) ∝ p(D|Z) p(Z)
(posterior) (likelihood) (prior)
POCD: Overlap Clustering for distance data
p(Z|D,.) ∝ p(D|Z) p(Z)
(prior)(posterior) (likelihood)
POCD: Overlap Clustering for distance data
Prior over Z: Indian Buffet process
● As k → infinity, we arrive at the IBP
● No need to fix the number of clusters
p(Z|D,.) ∝ p(D|Z) p(Z)
(prior)(posterior) (likelihood)
POCD: Overlap Clustering for distance data
Invariant Likelihood: generalised Wishart
● Translation and rotation invariant
p(Z|D,.) ∝ p(D|Z) p(Z)
(prior)(posterior) (likelihood)
POCD: Overlap Clustering for distance data
Inference using Metropolis Hastings
● MCMC algorithm
● Used in models deploying the IBP
● Asymptotically exact
approximations of the posterior
● We need to infer Z and #clusters
p(Z|D,.) ∝ p(D|Z) p(Z)
(prior)(posterior) (likelihood)
POCD: Overlap Clustering for distance data
Inference using Metropolis Hastings
● MCMC algorithm
● Used in models deploying the IBP
● Asymptotically exact
approximations of the posterior
● We need to infer Z and #clusters
p(Z|D,.) ∝ p(D|Z) p(Z)
(prior)(posterior) (likelihood)
POCD: Overlap Clustering for distance data
Clustering protein contact maps from HIV Protease inhibitors (PIs)
● Of the 26 FDA approved anti-HIV drugs:
○ 10 are PIs
● The PIs exhibit similar behaviour
○ Similar chemical structure
● Not readily available
https://www.sciencedirect.com/science/article/pii/S0165614711001398
POCD: Overlap Clustering for distance data
Clustering protein contact maps from HIV Protease inhibitors (PIs)
● Necessary to identify alternative PIs for therapy
○ What are the structural dissimilarities amongst PIs?
POCD: Overlap Clustering for distance data
Clustering protein contact maps from HIV Protease inhibitors (PIs)
● Necessary to identify alternative PIs for therapy
○ What are the structural dissimilarities amongst PIs?
● Use Protein Contact Maps of each PI
○ Distances between all AA residue pairs for a protein
○ Row-wise vectorise the contact map
○ Compute the Normalised Information distance
POCD: Overlap Clustering for distance data
Contact Maps of the Protease Inhibitors
POCD:
Probabilistic model for Overlap Clustering of
Distance data
Reading material
● A tutorial on Bayesian nonparametric models:
http://gershmanlab.webfactional.com/pubs/GershmanBlei12.pdf
● Leo Breiman: ‘Statistical Modeling: The Two Cultures’:
https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726
● An abstract of this work as Spotlight at the Bayesian Nonparametrics Workshop at NeurIPS 2018:
https://drive.google.com/file/d/1ExVpeUomv8Z4mPMu5as_CbmrHjVY0IDV/view
● Tutorials on latest Deep learning papers: https://www.depthfirstlearning.com/ ( @DepthFirstLearn)
POCD: Overlap Clustering for distance data
@sandhya212
Thank you

Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Available As Distance Data

  • 1.
    A Bayesian Approachto Model Overlapping Objects Available as Distance Data Sandhya Prabhakaran1 and Julia E. Vogt2,3 Memorial Sloan Kettering Cancer Centre, NYC 1 University of Basel 2 Swiss Institute of Bioinformatics 3 MLconf, NYC 29th March 2019
  • 2.
    Two religions inMachine Learning Frequentists (https://medium.com/datadriveninvestor/bayesian-vs-frequentist-for-dummies-58ce230c3796)
  • 3.
    Two religions inMachine Learning Frequentists Bayesians (https://medium.com/datadriveninvestor/bayesian-vs-frequentist-for-dummies-58ce230c3796)
  • 4.
    Two religions inMachine Learning ● A coin toss example: 10 heads in 10 tosses (= data given) ● Frequentists: ○ Probability is a Point estimate ○ What is the relative frequency of tails = no answer
  • 5.
    Two religions inMachine Learning ● A coin toss example: 10 heads in 10 tosses (= data given) ● Frequentists: ○ Probability is a Point estimate ○ What is the relative frequency of tails = no answer ● Bayesians: ○ Probability is a distribution ○ What is the relative frequency of tails = 0.5
  • 6.
    Two religions inMachine Learning ● A coin toss example: 10 heads in 10 tosses (= data given) ● Frequentists: ○ Probability is a Point estimate ○ What is the relative frequency of tails = no answer ● Bayesians: ○ Probability is a distribution ○ What is the relative frequency of tails = 0.5 ○ A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule. ○ More flexible: inference, thinking, planning and reasoning (downstream analyses)
  • 7.
  • 8.
  • 9.
    Bayesian: Clustering ofvectorial objects
  • 10.
    Bayesian: Clustering ofvectorial objects Clustering algorithm
  • 11.
    Bayesian: Clustering ofnon-vectorial objects (Image courtesy: shutterstock)
  • 12.
    Bayesian: Clustering ofnon-vectorial objects Mostly available as pairwise distance data
  • 13.
    POCD: Probabilistic model forOverlap Clustering of Distance data
  • 14.
    POCD: Overlap Clusteringfor distance data
  • 15.
    POCD: Overlap Clusteringfor distance data
  • 16.
    POCD: Overlap Clusteringfor distance data
  • 17.
    POCD: Overlap Clusteringfor distance data
  • 18.
    POCD: Overlap Clusteringfor distance data
  • 19.
    POCD: Overlap Clusteringfor distance data
  • 20.
    POCD: Overlap Clusteringfor distance data ● Bayesian clustering model ● Given pairwise D, we infer Z (the cluster assignment matrix)
  • 21.
    POCD: Overlap Clusteringfor distance data Z ● Binary matrix ● Cluster assignment matrix ● Needs to be inferred
  • 22.
    POCD: Overlap Clusteringfor distance data ● Bayesian clustering model ● Given pairwise D, we infer Z: p(Z|D,.) ∝ p(D|Z) p(Z) (posterior) (likelihood) (prior)
  • 23.
    POCD: Overlap Clusteringfor distance data p(Z|D,.) ∝ p(D|Z) p(Z) (prior)(posterior) (likelihood)
  • 24.
    POCD: Overlap Clusteringfor distance data Prior over Z: Indian Buffet process ● As k → infinity, we arrive at the IBP ● No need to fix the number of clusters p(Z|D,.) ∝ p(D|Z) p(Z) (prior)(posterior) (likelihood)
  • 25.
    POCD: Overlap Clusteringfor distance data Invariant Likelihood: generalised Wishart ● Translation and rotation invariant p(Z|D,.) ∝ p(D|Z) p(Z) (prior)(posterior) (likelihood)
  • 26.
    POCD: Overlap Clusteringfor distance data Inference using Metropolis Hastings ● MCMC algorithm ● Used in models deploying the IBP ● Asymptotically exact approximations of the posterior ● We need to infer Z and #clusters p(Z|D,.) ∝ p(D|Z) p(Z) (prior)(posterior) (likelihood)
  • 27.
    POCD: Overlap Clusteringfor distance data Inference using Metropolis Hastings ● MCMC algorithm ● Used in models deploying the IBP ● Asymptotically exact approximations of the posterior ● We need to infer Z and #clusters p(Z|D,.) ∝ p(D|Z) p(Z) (prior)(posterior) (likelihood)
  • 28.
    POCD: Overlap Clusteringfor distance data Clustering protein contact maps from HIV Protease inhibitors (PIs) ● Of the 26 FDA approved anti-HIV drugs: ○ 10 are PIs ● The PIs exhibit similar behaviour ○ Similar chemical structure ● Not readily available https://www.sciencedirect.com/science/article/pii/S0165614711001398
  • 29.
    POCD: Overlap Clusteringfor distance data Clustering protein contact maps from HIV Protease inhibitors (PIs) ● Necessary to identify alternative PIs for therapy ○ What are the structural dissimilarities amongst PIs?
  • 30.
    POCD: Overlap Clusteringfor distance data Clustering protein contact maps from HIV Protease inhibitors (PIs) ● Necessary to identify alternative PIs for therapy ○ What are the structural dissimilarities amongst PIs? ● Use Protein Contact Maps of each PI ○ Distances between all AA residue pairs for a protein ○ Row-wise vectorise the contact map ○ Compute the Normalised Information distance
  • 31.
    POCD: Overlap Clusteringfor distance data Contact Maps of the Protease Inhibitors
  • 32.
    POCD: Probabilistic model forOverlap Clustering of Distance data
  • 33.
    Reading material ● Atutorial on Bayesian nonparametric models: http://gershmanlab.webfactional.com/pubs/GershmanBlei12.pdf ● Leo Breiman: ‘Statistical Modeling: The Two Cultures’: https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726 ● An abstract of this work as Spotlight at the Bayesian Nonparametrics Workshop at NeurIPS 2018: https://drive.google.com/file/d/1ExVpeUomv8Z4mPMu5as_CbmrHjVY0IDV/view ● Tutorials on latest Deep learning papers: https://www.depthfirstlearning.com/ ( @DepthFirstLearn)
  • 34.
    POCD: Overlap Clusteringfor distance data @sandhya212 Thank you