SlideShare a Scribd company logo
Species sampling models
julyan.arbel@carloalberto.org www.julyanarbel.com
Bocconi University, Milan, Italy & Collegio Carlo Alberto, Turin
Statalks Seminar @ Collegio Carlo Alberto
February 12, 2016
1/12
2/12
Discovery probabilities
Table of Contents
Discovery probabilities
3/12
Discovery probabilities
Discovery problem: motivating example
What is the probability of observing a new species?
4/12
Discovery probabilities
Discovery problem: motivating example
Good and Turing worked on this problem Bletchley Park to crack German
ciphers for the Enigma machine during World War II
They proposed the estimator
Number of species observed once
Total number of species
5/12
Discovery probabilities
Discovery problem
• Population of individuals (Xi )i≥1 belonging to an ideally an infinite
number of species (θi )i≥1, respective unknown proportions (pi )i≥1
• Given (X1, . . . , Xn), make inference on the probability that the (n + 1)-th
observation coincides with a species whose frequency is l, for
l = 0, 1, . . . , n. This probability is termed l-discovery, that is
Dn(l) =
i≥1
pi I{l}(˜ni )
where ˜ni is the frequency of the species of type θi in the sample
• Dn(0) denotes the proportion of yet unobserved species, or the probability
of discovering a new species, or the missing mass
• Applications arising from ecology, biology, design of experiments,
bioinformatics, genetics, linguistic, economics, network modeling,
chemistry, ...
6/12
Discovery probabilities
BNP model
• The BNP approach for estimating Dn(l) is based on the randomization of
the unknown species proportions pi ’s. See Lijoi, Mena and Pr¨unster (2007)
Let P = i≥1 pi δθ denote a discrete random probability measure
Let Xn = (X1, . . . , Xn) be a sample from a population with composition P,
namely
Xi | P
iid
∼ P
P ∼ Q
with P playing the role of the nonparametric prior
• Due to the discreteness of P, the sample Xn from P exhibits ties with
positive probability. In other terms Xn features k distinct observations
X∗
1 , . . . , X∗
Kn
with corresponding frequencies (n1, . . . , nk )
• The information provided by (n1, . . . , nk ) can be coded by
mn = (m1, . . . , mn) where mi = number of species in the sample Xn
having frequency i
Under this alternative codification one obtains 1≤i≤n mi = k and
1≤i≤n imi = n.
7/12
Discovery probabilities
Good Turing estimators of discovery
Remember, Good and Turing estimate the prob. of observing a new species as
Number of species observed once
Total number of species
ie
ˇDn(0) =
m1
n
Also generalized to any frequency l ≤ n
ˇDn(l) =
(l + 1)ml+1
n
Good (1953)
BNP counterparts of these estimators?
8/12
Discovery probabilities
BNP estimators of discovery
Gibbs-type random probability measure P with index σ ∈ (0, 1): it is
characterized by (it induces) a predictive distribution of the form
P[Xn+1 ∈ A | Xn] =
Vn+1,kn+1
Vn,kn
G0(A) +
Vn+1,kn
Vn,kn
kn
i=1
(ni − σ) δX∗
i
(A),
BNP estimator ˆDn(l) of Dn(l) derived from the predictive using sets
A0 = X{X∗
1 , . . . , X∗
Kn
} and Al = {X∗
i : Ni,n = l}
BNP Good Turing
ˆDn(0) = E[Ph(A0) | Xn] =
Vn+1,kn+1
Vn,kn
ˇDn(l) = m1
n
ˆDn(l) = E[Ph(Al ) | Xn] = (l − σ)ml
Vn+1,kn
Vn,kn
ˇDn(l) =
(l+1)ml+1
n
9/12
Discovery probabilities
Credible intervals for discovery
• Special case of Pitman–Yor process (Perman, Pitman and Yor, 1992).
For σ ∈ (0, 1) and θ > −σ and
Vn,kn =
kn−1
i=1 (θ + iσ)
(θ + 1)(n−1)
Then closed form expression for the posterior distribution as Beta
Pp(A0) | Xn
d
= Bθ+σkn,n−σkn
and
Pp(Al ) | Xn
d
= B(l−σ)ml ,θ+n−(l−σ)ml
• Similar results in the general Gibbs class
• Practical tool for deriving credible intervals for the BNP estimator ˆDn(l),
for any l = 0, 1, . . . , n. This is typically done by performing a numerical
evaluation of appropriate quantiles of the distribution of Pp(Al ) | Xn
10/12
Discovery probabilities
Application to EST libraries
Application to genomic datasets called Expressed Sequence Tags (EST)
libraries
• Naegleria gruberi aerobic library consists of n = 959 ESTs with kn = 473
distinct genes and ml,959 = 346, 57, 19, 12, 9, 5, 4, 2, 4, 5, 4, 1, 1, 1, 1, 1, 1,
for l∈{1, 2, . . . , 12} ∪ {16, 17, 18} ∪ {27} ∪ {55}
• Naegleria gruberi anaerobic library consists of n = 969 ESTs with
kn = 631 distinct genes and ml,969 = 491, 72, 30, 9, 13, 5, 3, 1, 2, 0, 1, 0, 1,
for l ∈ {1, 2, . . . , 13}
• Prior specification: Pitman–Yor process, with empirical Bayes procedure
for estimating (σ, θ)
• ˆσ = 0.669, ˆθ = 46.241 for the Naegleria gruberi aerobic library
• ˆσ = 0.656, ˆθ = 155.408 for the Naegleria gruberi anaerobic library
11/12
Discovery probabilities
Application to EST libraries
Posterior distributions (dashed curve for aerobic, solid for anaerobic) of
discovery probabilities Dn(l), for l ∈ {0, 1, 5}
0.3 0.4 0.5 0.6
0
10
20
30
0.08 0.12 0.16 0.2
0
10
20
30
40
0.02 0.03 0.04 0.05 0.06 0.07
0
10
20
30
40
50
60
70
12/12
Discovery probabilities
Conclusion
Take-home messages
We have seen that Bayesian nonparametric methods allow for
• smoothing estimation of the discovery probabilities Dn(l) via more robust
estimators than frequentist counterparts
• a principled treatment of uncertainty where credible intervals can be
obtained naturally: closed form expression of the posterior distribution

More Related Content

Viewers also liked

Bayesian Classics
Bayesian ClassicsBayesian Classics
Bayesian Classics
Julyan Arbel
 
Reading Testing a point-null hypothesis, by Jiahuan Li, Feb. 25, 2013
Reading Testing a point-null hypothesis, by Jiahuan Li, Feb. 25, 2013Reading Testing a point-null hypothesis, by Jiahuan Li, Feb. 25, 2013
Reading Testing a point-null hypothesis, by Jiahuan Li, Feb. 25, 2013
Christian Robert
 
Gelfand and Smith (1990), read by
Gelfand and Smith (1990), read byGelfand and Smith (1990), read by
Gelfand and Smith (1990), read by
Christian Robert
 
Reading Birnbaum's (1962) paper, by Li Chenlu
Reading Birnbaum's (1962) paper, by Li ChenluReading Birnbaum's (1962) paper, by Li Chenlu
Reading Birnbaum's (1962) paper, by Li Chenlu
Christian Robert
 
Reading Neyman's 1933
Reading Neyman's 1933 Reading Neyman's 1933
Reading Neyman's 1933
Christian Robert
 
Testing point null hypothesis, a discussion by Amira Mziou
Testing point null hypothesis, a discussion by Amira MziouTesting point null hypothesis, a discussion by Amira Mziou
Testing point null hypothesis, a discussion by Amira Mziou
Christian Robert
 
Reading Efron's 1979 paper on bootstrap
Reading Efron's 1979 paper on bootstrapReading Efron's 1979 paper on bootstrap
Reading Efron's 1979 paper on bootstrap
Christian Robert
 
Reading the Lasso 1996 paper by Robert Tibshirani
Reading the Lasso 1996 paper by Robert TibshiraniReading the Lasso 1996 paper by Robert Tibshirani
Reading the Lasso 1996 paper by Robert Tibshirani
Christian Robert
 
slides Céline Beji
slides Céline Bejislides Céline Beji
slides Céline Beji
Christian Robert
 
Reading the Lindley-Smith 1973 paper on linear Bayes estimators
Reading the Lindley-Smith 1973 paper on linear Bayes estimatorsReading the Lindley-Smith 1973 paper on linear Bayes estimators
Reading the Lindley-Smith 1973 paper on linear Bayes estimators
Christian Robert
 

Viewers also liked (10)

Bayesian Classics
Bayesian ClassicsBayesian Classics
Bayesian Classics
 
Reading Testing a point-null hypothesis, by Jiahuan Li, Feb. 25, 2013
Reading Testing a point-null hypothesis, by Jiahuan Li, Feb. 25, 2013Reading Testing a point-null hypothesis, by Jiahuan Li, Feb. 25, 2013
Reading Testing a point-null hypothesis, by Jiahuan Li, Feb. 25, 2013
 
Gelfand and Smith (1990), read by
Gelfand and Smith (1990), read byGelfand and Smith (1990), read by
Gelfand and Smith (1990), read by
 
Reading Birnbaum's (1962) paper, by Li Chenlu
Reading Birnbaum's (1962) paper, by Li ChenluReading Birnbaum's (1962) paper, by Li Chenlu
Reading Birnbaum's (1962) paper, by Li Chenlu
 
Reading Neyman's 1933
Reading Neyman's 1933 Reading Neyman's 1933
Reading Neyman's 1933
 
Testing point null hypothesis, a discussion by Amira Mziou
Testing point null hypothesis, a discussion by Amira MziouTesting point null hypothesis, a discussion by Amira Mziou
Testing point null hypothesis, a discussion by Amira Mziou
 
Reading Efron's 1979 paper on bootstrap
Reading Efron's 1979 paper on bootstrapReading Efron's 1979 paper on bootstrap
Reading Efron's 1979 paper on bootstrap
 
Reading the Lasso 1996 paper by Robert Tibshirani
Reading the Lasso 1996 paper by Robert TibshiraniReading the Lasso 1996 paper by Robert Tibshirani
Reading the Lasso 1996 paper by Robert Tibshirani
 
slides Céline Beji
slides Céline Bejislides Céline Beji
slides Céline Beji
 
Reading the Lindley-Smith 1973 paper on linear Bayes estimators
Reading the Lindley-Smith 1973 paper on linear Bayes estimatorsReading the Lindley-Smith 1973 paper on linear Bayes estimators
Reading the Lindley-Smith 1973 paper on linear Bayes estimators
 

Similar to Species sampling models in Bayesian Nonparametrics

Basics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programmingBasics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programming
SSA KPI
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Probability distribution
Probability distributionProbability distribution
Probability distribution
Ranjan Kumar
 
Probability distribution
Probability distributionProbability distribution
Probability distribution
Nafiur Rahman Tuhin
 
Neural decoding
Neural decodingNeural decoding
Neural decoding
Hiroaki Hamada
 
Dirichlet processes and Applications
Dirichlet processes and ApplicationsDirichlet processes and Applications
Dirichlet processes and Applications
Saurav Jha
 
lecture4.pdf
lecture4.pdflecture4.pdf
lecture4.pdf
TarikuArega1
 
2주차
2주차2주차
Statistics (1): estimation, Chapter 2: Empirical distribution and bootstrap
Statistics (1): estimation, Chapter 2: Empirical distribution and bootstrapStatistics (1): estimation, Chapter 2: Empirical distribution and bootstrap
Statistics (1): estimation, Chapter 2: Empirical distribution and bootstrap
Christian Robert
 
Binomial probability distributions
Binomial probability distributions  Binomial probability distributions
Binomial probability distributions
Long Beach City College
 
kinds of distribution
 kinds of distribution kinds of distribution
kinds of distribution
Unsa Shakir
 
Management business for management studies
Management business for management studiesManagement business for management studies
Management business for management studies
Dilshaj1
 
Probability and Statistics : Binomial Distribution notes ppt.pdf
Probability and Statistics : Binomial Distribution notes ppt.pdfProbability and Statistics : Binomial Distribution notes ppt.pdf
Probability and Statistics : Binomial Distribution notes ppt.pdf
nomovi6416
 
Sampling Distributions
Sampling DistributionsSampling Distributions
Sampling Distributions
DataminingTools Inc
 
Talk 3
Talk 3Talk 3
2 Review of Statistics. 2 Review of Statistics.
2 Review of Statistics. 2 Review of Statistics.2 Review of Statistics. 2 Review of Statistics.
2 Review of Statistics. 2 Review of Statistics.
WeihanKhor2
 
Statistics (1): estimation, Chapter 1: Models
Statistics (1): estimation, Chapter 1: ModelsStatistics (1): estimation, Chapter 1: Models
Statistics (1): estimation, Chapter 1: Models
Christian Robert
 
Sampling Distributions
Sampling DistributionsSampling Distributions
Sampling Distributions
mathscontent
 
Chapter 5 2022.pdf
Chapter 5 2022.pdfChapter 5 2022.pdf
Chapter 5 2022.pdf
Mohamed Ali
 
4 1 probability and discrete probability distributions
4 1 probability and discrete    probability distributions4 1 probability and discrete    probability distributions
4 1 probability and discrete probability distributions
Lama K Banna
 

Similar to Species sampling models in Bayesian Nonparametrics (20)

Basics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programmingBasics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programming
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Probability distribution
Probability distributionProbability distribution
Probability distribution
 
Probability distribution
Probability distributionProbability distribution
Probability distribution
 
Neural decoding
Neural decodingNeural decoding
Neural decoding
 
Dirichlet processes and Applications
Dirichlet processes and ApplicationsDirichlet processes and Applications
Dirichlet processes and Applications
 
lecture4.pdf
lecture4.pdflecture4.pdf
lecture4.pdf
 
2주차
2주차2주차
2주차
 
Statistics (1): estimation, Chapter 2: Empirical distribution and bootstrap
Statistics (1): estimation, Chapter 2: Empirical distribution and bootstrapStatistics (1): estimation, Chapter 2: Empirical distribution and bootstrap
Statistics (1): estimation, Chapter 2: Empirical distribution and bootstrap
 
Binomial probability distributions
Binomial probability distributions  Binomial probability distributions
Binomial probability distributions
 
kinds of distribution
 kinds of distribution kinds of distribution
kinds of distribution
 
Management business for management studies
Management business for management studiesManagement business for management studies
Management business for management studies
 
Probability and Statistics : Binomial Distribution notes ppt.pdf
Probability and Statistics : Binomial Distribution notes ppt.pdfProbability and Statistics : Binomial Distribution notes ppt.pdf
Probability and Statistics : Binomial Distribution notes ppt.pdf
 
Sampling Distributions
Sampling DistributionsSampling Distributions
Sampling Distributions
 
Talk 3
Talk 3Talk 3
Talk 3
 
2 Review of Statistics. 2 Review of Statistics.
2 Review of Statistics. 2 Review of Statistics.2 Review of Statistics. 2 Review of Statistics.
2 Review of Statistics. 2 Review of Statistics.
 
Statistics (1): estimation, Chapter 1: Models
Statistics (1): estimation, Chapter 1: ModelsStatistics (1): estimation, Chapter 1: Models
Statistics (1): estimation, Chapter 1: Models
 
Sampling Distributions
Sampling DistributionsSampling Distributions
Sampling Distributions
 
Chapter 5 2022.pdf
Chapter 5 2022.pdfChapter 5 2022.pdf
Chapter 5 2022.pdf
 
4 1 probability and discrete probability distributions
4 1 probability and discrete    probability distributions4 1 probability and discrete    probability distributions
4 1 probability and discrete probability distributions
 

More from Julyan Arbel

UCD_talk_nov_2020
UCD_talk_nov_2020UCD_talk_nov_2020
UCD_talk_nov_2020
Julyan Arbel
 
Bayesian neural networks increasingly sparsify their units with depth
Bayesian neural networks increasingly sparsify their units with depthBayesian neural networks increasingly sparsify their units with depth
Bayesian neural networks increasingly sparsify their units with depth
Julyan Arbel
 
Lindley smith 1972
Lindley smith 1972Lindley smith 1972
Lindley smith 1972
Julyan Arbel
 
Berger 2000
Berger 2000Berger 2000
Berger 2000
Julyan Arbel
 
Seneta 1993
Seneta 1993Seneta 1993
Seneta 1993
Julyan Arbel
 
Lehmann 1990
Lehmann 1990Lehmann 1990
Lehmann 1990
Julyan Arbel
 
Diaconis Ylvisaker 1985
Diaconis Ylvisaker 1985Diaconis Ylvisaker 1985
Diaconis Ylvisaker 1985
Julyan Arbel
 
Hastings 1970
Hastings 1970Hastings 1970
Hastings 1970
Julyan Arbel
 
Jefferys Berger 1992
Jefferys Berger 1992Jefferys Berger 1992
Jefferys Berger 1992
Julyan Arbel
 
Bayesian Classics
Bayesian ClassicsBayesian Classics
Bayesian Classics
Julyan Arbel
 
R in latex
R in latexR in latex
R in latex
Julyan Arbel
 
Arbel oviedo
Arbel oviedoArbel oviedo
Arbel oviedo
Julyan Arbel
 
Poster DDP (BNP 2011 Veracruz)
Poster DDP (BNP 2011 Veracruz)Poster DDP (BNP 2011 Veracruz)
Poster DDP (BNP 2011 Veracruz)
Julyan Arbel
 
Causesof effects
Causesof effectsCausesof effects
Causesof effects
Julyan Arbel
 
Bayesian adaptive optimal estimation using a sieve prior
Bayesian adaptive optimal estimation using a sieve priorBayesian adaptive optimal estimation using a sieve prior
Bayesian adaptive optimal estimation using a sieve prior
Julyan Arbel
 
Seminaire ihp
Seminaire ihpSeminaire ihp
Seminaire ihp
Julyan Arbel
 

More from Julyan Arbel (16)

UCD_talk_nov_2020
UCD_talk_nov_2020UCD_talk_nov_2020
UCD_talk_nov_2020
 
Bayesian neural networks increasingly sparsify their units with depth
Bayesian neural networks increasingly sparsify their units with depthBayesian neural networks increasingly sparsify their units with depth
Bayesian neural networks increasingly sparsify their units with depth
 
Lindley smith 1972
Lindley smith 1972Lindley smith 1972
Lindley smith 1972
 
Berger 2000
Berger 2000Berger 2000
Berger 2000
 
Seneta 1993
Seneta 1993Seneta 1993
Seneta 1993
 
Lehmann 1990
Lehmann 1990Lehmann 1990
Lehmann 1990
 
Diaconis Ylvisaker 1985
Diaconis Ylvisaker 1985Diaconis Ylvisaker 1985
Diaconis Ylvisaker 1985
 
Hastings 1970
Hastings 1970Hastings 1970
Hastings 1970
 
Jefferys Berger 1992
Jefferys Berger 1992Jefferys Berger 1992
Jefferys Berger 1992
 
Bayesian Classics
Bayesian ClassicsBayesian Classics
Bayesian Classics
 
R in latex
R in latexR in latex
R in latex
 
Arbel oviedo
Arbel oviedoArbel oviedo
Arbel oviedo
 
Poster DDP (BNP 2011 Veracruz)
Poster DDP (BNP 2011 Veracruz)Poster DDP (BNP 2011 Veracruz)
Poster DDP (BNP 2011 Veracruz)
 
Causesof effects
Causesof effectsCausesof effects
Causesof effects
 
Bayesian adaptive optimal estimation using a sieve prior
Bayesian adaptive optimal estimation using a sieve priorBayesian adaptive optimal estimation using a sieve prior
Bayesian adaptive optimal estimation using a sieve prior
 
Seminaire ihp
Seminaire ihpSeminaire ihp
Seminaire ihp
 

Recently uploaded

4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 

Recently uploaded (20)

4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 

Species sampling models in Bayesian Nonparametrics

  • 1. Species sampling models julyan.arbel@carloalberto.org www.julyanarbel.com Bocconi University, Milan, Italy & Collegio Carlo Alberto, Turin Statalks Seminar @ Collegio Carlo Alberto February 12, 2016 1/12
  • 2. 2/12 Discovery probabilities Table of Contents Discovery probabilities
  • 3. 3/12 Discovery probabilities Discovery problem: motivating example What is the probability of observing a new species?
  • 4. 4/12 Discovery probabilities Discovery problem: motivating example Good and Turing worked on this problem Bletchley Park to crack German ciphers for the Enigma machine during World War II They proposed the estimator Number of species observed once Total number of species
  • 5. 5/12 Discovery probabilities Discovery problem • Population of individuals (Xi )i≥1 belonging to an ideally an infinite number of species (θi )i≥1, respective unknown proportions (pi )i≥1 • Given (X1, . . . , Xn), make inference on the probability that the (n + 1)-th observation coincides with a species whose frequency is l, for l = 0, 1, . . . , n. This probability is termed l-discovery, that is Dn(l) = i≥1 pi I{l}(˜ni ) where ˜ni is the frequency of the species of type θi in the sample • Dn(0) denotes the proportion of yet unobserved species, or the probability of discovering a new species, or the missing mass • Applications arising from ecology, biology, design of experiments, bioinformatics, genetics, linguistic, economics, network modeling, chemistry, ...
  • 6. 6/12 Discovery probabilities BNP model • The BNP approach for estimating Dn(l) is based on the randomization of the unknown species proportions pi ’s. See Lijoi, Mena and Pr¨unster (2007) Let P = i≥1 pi δθ denote a discrete random probability measure Let Xn = (X1, . . . , Xn) be a sample from a population with composition P, namely Xi | P iid ∼ P P ∼ Q with P playing the role of the nonparametric prior • Due to the discreteness of P, the sample Xn from P exhibits ties with positive probability. In other terms Xn features k distinct observations X∗ 1 , . . . , X∗ Kn with corresponding frequencies (n1, . . . , nk ) • The information provided by (n1, . . . , nk ) can be coded by mn = (m1, . . . , mn) where mi = number of species in the sample Xn having frequency i Under this alternative codification one obtains 1≤i≤n mi = k and 1≤i≤n imi = n.
  • 7. 7/12 Discovery probabilities Good Turing estimators of discovery Remember, Good and Turing estimate the prob. of observing a new species as Number of species observed once Total number of species ie ˇDn(0) = m1 n Also generalized to any frequency l ≤ n ˇDn(l) = (l + 1)ml+1 n Good (1953) BNP counterparts of these estimators?
  • 8. 8/12 Discovery probabilities BNP estimators of discovery Gibbs-type random probability measure P with index σ ∈ (0, 1): it is characterized by (it induces) a predictive distribution of the form P[Xn+1 ∈ A | Xn] = Vn+1,kn+1 Vn,kn G0(A) + Vn+1,kn Vn,kn kn i=1 (ni − σ) δX∗ i (A), BNP estimator ˆDn(l) of Dn(l) derived from the predictive using sets A0 = X{X∗ 1 , . . . , X∗ Kn } and Al = {X∗ i : Ni,n = l} BNP Good Turing ˆDn(0) = E[Ph(A0) | Xn] = Vn+1,kn+1 Vn,kn ˇDn(l) = m1 n ˆDn(l) = E[Ph(Al ) | Xn] = (l − σ)ml Vn+1,kn Vn,kn ˇDn(l) = (l+1)ml+1 n
  • 9. 9/12 Discovery probabilities Credible intervals for discovery • Special case of Pitman–Yor process (Perman, Pitman and Yor, 1992). For σ ∈ (0, 1) and θ > −σ and Vn,kn = kn−1 i=1 (θ + iσ) (θ + 1)(n−1) Then closed form expression for the posterior distribution as Beta Pp(A0) | Xn d = Bθ+σkn,n−σkn and Pp(Al ) | Xn d = B(l−σ)ml ,θ+n−(l−σ)ml • Similar results in the general Gibbs class • Practical tool for deriving credible intervals for the BNP estimator ˆDn(l), for any l = 0, 1, . . . , n. This is typically done by performing a numerical evaluation of appropriate quantiles of the distribution of Pp(Al ) | Xn
  • 10. 10/12 Discovery probabilities Application to EST libraries Application to genomic datasets called Expressed Sequence Tags (EST) libraries • Naegleria gruberi aerobic library consists of n = 959 ESTs with kn = 473 distinct genes and ml,959 = 346, 57, 19, 12, 9, 5, 4, 2, 4, 5, 4, 1, 1, 1, 1, 1, 1, for l∈{1, 2, . . . , 12} ∪ {16, 17, 18} ∪ {27} ∪ {55} • Naegleria gruberi anaerobic library consists of n = 969 ESTs with kn = 631 distinct genes and ml,969 = 491, 72, 30, 9, 13, 5, 3, 1, 2, 0, 1, 0, 1, for l ∈ {1, 2, . . . , 13} • Prior specification: Pitman–Yor process, with empirical Bayes procedure for estimating (σ, θ) • ˆσ = 0.669, ˆθ = 46.241 for the Naegleria gruberi aerobic library • ˆσ = 0.656, ˆθ = 155.408 for the Naegleria gruberi anaerobic library
  • 11. 11/12 Discovery probabilities Application to EST libraries Posterior distributions (dashed curve for aerobic, solid for anaerobic) of discovery probabilities Dn(l), for l ∈ {0, 1, 5} 0.3 0.4 0.5 0.6 0 10 20 30 0.08 0.12 0.16 0.2 0 10 20 30 40 0.02 0.03 0.04 0.05 0.06 0.07 0 10 20 30 40 50 60 70
  • 12. 12/12 Discovery probabilities Conclusion Take-home messages We have seen that Bayesian nonparametric methods allow for • smoothing estimation of the discovery probabilities Dn(l) via more robust estimators than frequentist counterparts • a principled treatment of uncertainty where credible intervals can be obtained naturally: closed form expression of the posterior distribution