SlideShare a Scribd company logo
1 of 23
Download to read offline
Stochastic Context Free GrammarsStochastic Context Free Grammars
Grammars
● Wiki
a grammar is a set of
rewriting rules for forming
strings in a formal language
● context-free:
rewrite single variables
● Formal definition
a grammar is a 4-tuple
● N set of nonterminals
● V set of terminals
● P set of rules
● S start symbol
● Example
generates {a
m
u
n
∣ m ,n≥0}S  aSu ∣ aS ∣ Su ∣ 
S ⇒ aSu ⇒ aaSuu ⇒ aauu
S ⇒ aS ⇒ aaS ⇒ aaSu ⇒ aaSuu ⇒ aauu
Stochastic CFGs
● A context free grammar (CFG) + probabilities
● Assign probabilities to generated strings
● Example
0.1 0.4 0.4 0.1
S  aSu ∣ aS ∣ Su ∣ 
S ⇒
0.1
aSu ⇒
0.1
aaSuu ⇒
0.1
aauu
S ⇒
0.4
aS ⇒
0.4
aaS ⇒
0.4
aaSu ⇒
0.4
aaSuu ⇒
0.1
aauu
0.001
0.00256
SCFGs
● Purpose:
● generate the same string using different sets of rules
● each set of rules tells a different story
● each set of rules assigns a different probability to the string
0.1 0.4 0.4 0.1
S  aSu ∣ aS ∣ Su ∣ 
S ⇒
0.1
aSu ⇒
0.1
aaSuu ⇒
0.1
aauu
S ⇒
0.4
aS ⇒
0.4
aaS ⇒
0.4
aaSu ⇒
0.4
aaSuu ⇒
0.1
aauu
0.001
0.00256
SCFGs & RNA
● Relation to RNA and 2nd
structure prediction
● generates RNA sequences – strings over {A, C, G, U}
● 2nd
structure is given by the set of rules used
● assigns probabilities to structures
0.1 0.4 0.4 0.1
S  aSu ∣ aS ∣ Su ∣ 
S ⇒
0.1
aSu ⇒
0.1
aaSuu ⇒
0.1
aauu
S ⇒
0.4
aS ⇒
0.4
aaS ⇒
0.4
aaSu ⇒
0.4
aaSuu ⇒
0.1
aauu
0.001
0.00256
SCFGs & RNA
0.1 0.4 0.4 0.1
S  aSu ∣ aS ∣ Su ∣ 
  . .
S ⇒
0.1
aSu ⇒
0.1
aaSuu ⇒
0.1
aauu
    
S ⇒
0.4
aS ⇒
0.4
aaS ⇒
0.4
aaSu ⇒
0.4
aaSuu ⇒
0.1
aauu
. .. .. . .. .. ....
A better example
S  aS ∣ cS ∣ gS ∣ uS
Sa ∣ Sc ∣ Sg ∣ Su
aSu ∣ cSg ∣ gSu
uSa ∣ gSc ∣ uSg
SS
Algorithms
● Determine the most probable structure for a RNA sequence
● Determine the total probability of generating a sequence
(the sum of probabilities of all ways of generating it)
● Given a data set with sequences and associated structures,
determine the rules' probabilities that maximize the total
probability of generating the right structures from the set
Algorithms
● Determine the most probable structure for a RNA sequence
● Determine the total probability of generating a sequence
(the sum of probabilities of all ways of generating it)
● Given a data set with sequences and associated structures,
determine the rules' probabilities that maximize the total
probability of generating the right structures from the set
Chomsky Normal Form
ABC
Ad
A
● Only rules of the form
S  aS ⇒
S  AS
A  a
S  Sa ⇒
S  SA
A  a
● Any CFG can be rewritten in CNF
Cocke–Younger–Kasami
● Calculate best structure for small subsequences and work
outwards to larger and larger subsequences
● Notations
● Grammar G in CNF with nonterminals V1
, ..., Vm
● V1
is the start symbol
● t(x, y, z) is the probability of rule Vx
→ Vy
Vz
● e(x, a) is the probability of rule Vx
→ a
● score[x, i, j] is the maximum probability of generating
seq[i, j] from Vx
CYK
● Vx
→ seq[i]
score[x, i, i] = e(x, seq[i])
● Vx
→ Vy
Vz
and for some i ≤ k < j
score[x, i, j] = score[y, i, k] · score[z, k+1, j] · t(x, y, z)
V x
Vy Vz
i k k+1 j
CYK
score[x ,i , j]=
{
0 if ji
ex , seq[i] if i= j
max
i≤k j
V x Vy Vz
score[y ,i ,k]⋅score[z ,k1, j]⋅tx ,y ,z
V x
Vy Vz
i k k+1 j
CYK
score[x ,i , j]=
{
0 if ji
ex , seq[i] if i= j
max
i≤k j
V x Vy Vz
score[y ,i ,k]⋅score[z ,k1, j]⋅tx ,y ,z
Space?
Time?
V x
Vy Vz
i k k+1 j
CYK
score[x ,i , j]=
{
0 if ji
ex , seq[i] if i= j
max
i≤k j
V x Vy Vz
score[y ,i ,k]⋅score[z ,k1, j]⋅tx ,y ,z
Space?
O(m ∙ n2
)
Time?
O(m∙ r∙ n3
)
V x
Vy Vz
i k k+1 j
CYK
score[x ,i , j]=
{
0 if ji
ex , seq[i] if i= j
max
i≤k j
V x Vy Vz
score[y ,i ,k]⋅score[z ,k1, j]⋅tx ,y ,z
Space?
O(m ∙ n2
)
Time?
O(m∙ r∙ n3
)
Backtracking?
V x
Vy Vz
i k k+1 j
CYK
score[x ,i , j]=
{
0 if ji
ex , seq[i] if i= j
max
i≤k j
V x Vy Vz
score[y ,i ,k]⋅score[z ,k1, j]⋅tx ,y ,z
Space?
O(m ∙ n2
)
Time?
O(m∙ r∙ n3
)
Backtracking?
O(r∙ n2
)
V x
Vy Vz
i k k+1 j
SCFG design
● Dowell & Eddy (2004)
G1: S  dS d ∣ d S ∣ S d ∣ SS ∣ 
G2: S  d S d ∣ d L ∣ Rd ∣ LS
L  d S d ∣ aL
R  Rd ∣ 
G3: S  d S ∣ d S d S ∣ 
G4: S  d S ∣ T ∣ 
T  T d ∣ d S d ∣ T d S d
G5: S  LS ∣ L
L  d F d ∣ d
F  d F d ∣ LS
SCFG design
● Dowell & Eddy (2004)
G1: S  dS d ∣ d S ∣ S d ∣ SS ∣ 
G2: S  d S d ∣ d L ∣ Rd ∣ LS
L  d S d ∣ aL
R  Rd ∣ 
G3: S  d S ∣ d S d S ∣ 
G4: S  d S ∣ T ∣ 
T  T d ∣ d S d ∣ T d S d
G5: S  LS ∣ L
L  d F d ∣ d
F  d F d ∣ LS
Prediction accuracy
● Sensitivity and specificity
sensitivity =
TN
TNFP
specificity =
TP
TPFN
sensitivity =
4
42
= 0.666
specificity =
4
42
= 0.666
Prediction accuracy
sensitivity =
4
42
= 0.666
specificity =
4
42
= 0.666
sensitivity =
5
52
= 0.714
specificity =
2
23
= 0.4
Prediction accuracy
sensitivity =
4
42
= 0.666
specificity =
4
42
= 0.666
sensitivity =
5
52
= 0.714
specificity =
2
23
= 0.4
Use RNA 2nd
structure metrics
(Moulton et al. 2000)
Search for better SCFGs
● Evolutionary algorithm
● Initial population
● Mutation model
● Breeding model
● Selection

More Related Content

Similar to AB-RNA-SCFG-2010

Metrics for generativemodels
Metrics for generativemodelsMetrics for generativemodels
Metrics for generativemodelsDai-Hai Nguyen
 
Formal methods 4 - Z notation
Formal methods   4 - Z notationFormal methods   4 - Z notation
Formal methods 4 - Z notationVlad Patryshev
 
Encoding Generalized Quantifiers in Dependency-based Compositional Semantics
Encoding Generalized Quantifiers in Dependency-based Compositional SemanticsEncoding Generalized Quantifiers in Dependency-based Compositional Semantics
Encoding Generalized Quantifiers in Dependency-based Compositional SemanticsYubing Dong
 
Modeling Big Count Data: An IRLS Framework for COM-Poisson Regression and GAM
Modeling Big Count Data: An IRLS Framework for COM-Poisson Regression and GAMModeling Big Count Data: An IRLS Framework for COM-Poisson Regression and GAM
Modeling Big Count Data: An IRLS Framework for COM-Poisson Regression and GAMSuneel Babu Chatla
 
zkStudyClub - zkSaaS (Sruthi Sekar, UCB)
zkStudyClub - zkSaaS (Sruthi Sekar, UCB)zkStudyClub - zkSaaS (Sruthi Sekar, UCB)
zkStudyClub - zkSaaS (Sruthi Sekar, UCB)Alex Pruden
 
Enm fy17nano qsar
Enm fy17nano qsarEnm fy17nano qsar
Enm fy17nano qsarPaulHarten1
 
ゲーム理論BASIC 第39回 -交渉集合とカーネル-
ゲーム理論BASIC 第39回 -交渉集合とカーネル-ゲーム理論BASIC 第39回 -交渉集合とカーネル-
ゲーム理論BASIC 第39回 -交渉集合とカーネル-ssusere0a682
 
Mathematical Understanding in Traffic Flow Modelling
Mathematical Understanding in Traffic Flow ModellingMathematical Understanding in Traffic Flow Modelling
Mathematical Understanding in Traffic Flow ModellingNikhil Chandra Sarkar
 
Overlap Layout Consensus assembly
Overlap Layout Consensus assemblyOverlap Layout Consensus assembly
Overlap Layout Consensus assemblyZhuyi Xue
 
One Problem, Two Structures, Six Solvers and Ten Years of Personnel Schedulin...
One Problem, Two Structures, Six Solvers and Ten Years of Personnel Schedulin...One Problem, Two Structures, Six Solvers and Ten Years of Personnel Schedulin...
One Problem, Two Structures, Six Solvers and Ten Years of Personnel Schedulin...Pierre Schaus
 
Fixed point and common fixed point theorems in complete metric spaces
Fixed point and common fixed point theorems in complete metric spacesFixed point and common fixed point theorems in complete metric spaces
Fixed point and common fixed point theorems in complete metric spacesAlexander Decker
 
GAN in_kakao
GAN in_kakaoGAN in_kakao
GAN in_kakaoJunho Kim
 
A compact zero knowledge proof to restrict message space in homomorphic encry...
A compact zero knowledge proof to restrict message space in homomorphic encry...A compact zero knowledge proof to restrict message space in homomorphic encry...
A compact zero knowledge proof to restrict message space in homomorphic encry...MITSUNARI Shigeo
 
SSD & HDD Performance Testing with TKperf
SSD & HDD Performance Testing with TKperfSSD & HDD Performance Testing with TKperf
SSD & HDD Performance Testing with TKperfWerner Fischer
 

Similar to AB-RNA-SCFG-2010 (20)

Metrics for generativemodels
Metrics for generativemodelsMetrics for generativemodels
Metrics for generativemodels
 
Formal methods 4 - Z notation
Formal methods   4 - Z notationFormal methods   4 - Z notation
Formal methods 4 - Z notation
 
Lash
LashLash
Lash
 
Encoding Generalized Quantifiers in Dependency-based Compositional Semantics
Encoding Generalized Quantifiers in Dependency-based Compositional SemanticsEncoding Generalized Quantifiers in Dependency-based Compositional Semantics
Encoding Generalized Quantifiers in Dependency-based Compositional Semantics
 
Modeling Big Count Data: An IRLS Framework for COM-Poisson Regression and GAM
Modeling Big Count Data: An IRLS Framework for COM-Poisson Regression and GAMModeling Big Count Data: An IRLS Framework for COM-Poisson Regression and GAM
Modeling Big Count Data: An IRLS Framework for COM-Poisson Regression and GAM
 
zkStudyClub - zkSaaS (Sruthi Sekar, UCB)
zkStudyClub - zkSaaS (Sruthi Sekar, UCB)zkStudyClub - zkSaaS (Sruthi Sekar, UCB)
zkStudyClub - zkSaaS (Sruthi Sekar, UCB)
 
Blast fasta 4
Blast fasta 4Blast fasta 4
Blast fasta 4
 
Archipelagos
ArchipelagosArchipelagos
Archipelagos
 
Enm fy17nano qsar
Enm fy17nano qsarEnm fy17nano qsar
Enm fy17nano qsar
 
ゲーム理論BASIC 第39回 -交渉集合とカーネル-
ゲーム理論BASIC 第39回 -交渉集合とカーネル-ゲーム理論BASIC 第39回 -交渉集合とカーネル-
ゲーム理論BASIC 第39回 -交渉集合とカーネル-
 
Mathematical Understanding in Traffic Flow Modelling
Mathematical Understanding in Traffic Flow ModellingMathematical Understanding in Traffic Flow Modelling
Mathematical Understanding in Traffic Flow Modelling
 
Overlap Layout Consensus assembly
Overlap Layout Consensus assemblyOverlap Layout Consensus assembly
Overlap Layout Consensus assembly
 
Ch06 multalign
Ch06 multalignCh06 multalign
Ch06 multalign
 
One Problem, Two Structures, Six Solvers and Ten Years of Personnel Schedulin...
One Problem, Two Structures, Six Solvers and Ten Years of Personnel Schedulin...One Problem, Two Structures, Six Solvers and Ten Years of Personnel Schedulin...
One Problem, Two Structures, Six Solvers and Ten Years of Personnel Schedulin...
 
Fixed point and common fixed point theorems in complete metric spaces
Fixed point and common fixed point theorems in complete metric spacesFixed point and common fixed point theorems in complete metric spaces
Fixed point and common fixed point theorems in complete metric spaces
 
Biological sequences analysis
Biological sequences analysisBiological sequences analysis
Biological sequences analysis
 
GAN in_kakao
GAN in_kakaoGAN in_kakao
GAN in_kakao
 
A compact zero knowledge proof to restrict message space in homomorphic encry...
A compact zero knowledge proof to restrict message space in homomorphic encry...A compact zero knowledge proof to restrict message space in homomorphic encry...
A compact zero knowledge proof to restrict message space in homomorphic encry...
 
SSD & HDD Performance Testing with TKperf
SSD & HDD Performance Testing with TKperfSSD & HDD Performance Testing with TKperf
SSD & HDD Performance Testing with TKperf
 
chapter9.ppt
chapter9.pptchapter9.ppt
chapter9.ppt
 

More from Paula Tataru

More from Paula Tataru (20)

write_thesis
write_thesiswrite_thesis
write_thesis
 
Thiele
ThieleThiele
Thiele
 
PhDretreat2014
PhDretreat2014PhDretreat2014
PhDretreat2014
 
PhDretreat2011
PhDretreat2011PhDretreat2011
PhDretreat2011
 
PaulaTataru_PhD_defense
PaulaTataru_PhD_defensePaulaTataru_PhD_defense
PaulaTataru_PhD_defense
 
part A
part Apart A
part A
 
birc-csd2012
birc-csd2012birc-csd2012
birc-csd2012
 
TreeOfLife-jeopardy-2014
TreeOfLife-jeopardy-2014TreeOfLife-jeopardy-2014
TreeOfLife-jeopardy-2014
 
AB-RNA-comparison-2011
AB-RNA-comparison-2011AB-RNA-comparison-2011
AB-RNA-comparison-2011
 
AB-RNA-alignments-2011
AB-RNA-alignments-2011AB-RNA-alignments-2011
AB-RNA-alignments-2011
 
AB-RNA-Nussinov-2011
AB-RNA-Nussinov-2011AB-RNA-Nussinov-2011
AB-RNA-Nussinov-2011
 
AB-RNA-alignments-2010
AB-RNA-alignments-2010AB-RNA-alignments-2010
AB-RNA-alignments-2010
 
AB-RNA-Nus-2010
AB-RNA-Nus-2010AB-RNA-Nus-2010
AB-RNA-Nus-2010
 
PaulaTataruVienna
PaulaTataruViennaPaulaTataruVienna
PaulaTataruVienna
 
PaulaTataruCSHL
PaulaTataruCSHLPaulaTataruCSHL
PaulaTataruCSHL
 
PaulaTataruAarhus
PaulaTataruAarhusPaulaTataruAarhus
PaulaTataruAarhus
 
mgsa_poster
mgsa_postermgsa_poster
mgsa_poster
 
PaulaTataruOxford
PaulaTataruOxfordPaulaTataruOxford
PaulaTataruOxford
 
PaulaTataru
PaulaTataruPaulaTataru
PaulaTataru
 
Mols_August2013
Mols_August2013Mols_August2013
Mols_August2013
 

AB-RNA-SCFG-2010

  • 1. Stochastic Context Free GrammarsStochastic Context Free Grammars
  • 2. Grammars ● Wiki a grammar is a set of rewriting rules for forming strings in a formal language ● context-free: rewrite single variables ● Formal definition a grammar is a 4-tuple ● N set of nonterminals ● V set of terminals ● P set of rules ● S start symbol ● Example generates {a m u n ∣ m ,n≥0}S  aSu ∣ aS ∣ Su ∣  S ⇒ aSu ⇒ aaSuu ⇒ aauu S ⇒ aS ⇒ aaS ⇒ aaSu ⇒ aaSuu ⇒ aauu
  • 3. Stochastic CFGs ● A context free grammar (CFG) + probabilities ● Assign probabilities to generated strings ● Example 0.1 0.4 0.4 0.1 S  aSu ∣ aS ∣ Su ∣  S ⇒ 0.1 aSu ⇒ 0.1 aaSuu ⇒ 0.1 aauu S ⇒ 0.4 aS ⇒ 0.4 aaS ⇒ 0.4 aaSu ⇒ 0.4 aaSuu ⇒ 0.1 aauu 0.001 0.00256
  • 4. SCFGs ● Purpose: ● generate the same string using different sets of rules ● each set of rules tells a different story ● each set of rules assigns a different probability to the string 0.1 0.4 0.4 0.1 S  aSu ∣ aS ∣ Su ∣  S ⇒ 0.1 aSu ⇒ 0.1 aaSuu ⇒ 0.1 aauu S ⇒ 0.4 aS ⇒ 0.4 aaS ⇒ 0.4 aaSu ⇒ 0.4 aaSuu ⇒ 0.1 aauu 0.001 0.00256
  • 5. SCFGs & RNA ● Relation to RNA and 2nd structure prediction ● generates RNA sequences – strings over {A, C, G, U} ● 2nd structure is given by the set of rules used ● assigns probabilities to structures 0.1 0.4 0.4 0.1 S  aSu ∣ aS ∣ Su ∣  S ⇒ 0.1 aSu ⇒ 0.1 aaSuu ⇒ 0.1 aauu S ⇒ 0.4 aS ⇒ 0.4 aaS ⇒ 0.4 aaSu ⇒ 0.4 aaSuu ⇒ 0.1 aauu 0.001 0.00256
  • 6. SCFGs & RNA 0.1 0.4 0.4 0.1 S  aSu ∣ aS ∣ Su ∣    . . S ⇒ 0.1 aSu ⇒ 0.1 aaSuu ⇒ 0.1 aauu      S ⇒ 0.4 aS ⇒ 0.4 aaS ⇒ 0.4 aaSu ⇒ 0.4 aaSuu ⇒ 0.1 aauu . .. .. . .. .. ....
  • 7. A better example S  aS ∣ cS ∣ gS ∣ uS Sa ∣ Sc ∣ Sg ∣ Su aSu ∣ cSg ∣ gSu uSa ∣ gSc ∣ uSg SS
  • 8. Algorithms ● Determine the most probable structure for a RNA sequence ● Determine the total probability of generating a sequence (the sum of probabilities of all ways of generating it) ● Given a data set with sequences and associated structures, determine the rules' probabilities that maximize the total probability of generating the right structures from the set
  • 9. Algorithms ● Determine the most probable structure for a RNA sequence ● Determine the total probability of generating a sequence (the sum of probabilities of all ways of generating it) ● Given a data set with sequences and associated structures, determine the rules' probabilities that maximize the total probability of generating the right structures from the set
  • 10. Chomsky Normal Form ABC Ad A ● Only rules of the form S  aS ⇒ S  AS A  a S  Sa ⇒ S  SA A  a ● Any CFG can be rewritten in CNF
  • 11. Cocke–Younger–Kasami ● Calculate best structure for small subsequences and work outwards to larger and larger subsequences ● Notations ● Grammar G in CNF with nonterminals V1 , ..., Vm ● V1 is the start symbol ● t(x, y, z) is the probability of rule Vx → Vy Vz ● e(x, a) is the probability of rule Vx → a ● score[x, i, j] is the maximum probability of generating seq[i, j] from Vx
  • 12. CYK ● Vx → seq[i] score[x, i, i] = e(x, seq[i]) ● Vx → Vy Vz and for some i ≤ k < j score[x, i, j] = score[y, i, k] · score[z, k+1, j] · t(x, y, z) V x Vy Vz i k k+1 j
  • 13. CYK score[x ,i , j]= { 0 if ji ex , seq[i] if i= j max i≤k j V x Vy Vz score[y ,i ,k]⋅score[z ,k1, j]⋅tx ,y ,z V x Vy Vz i k k+1 j
  • 14. CYK score[x ,i , j]= { 0 if ji ex , seq[i] if i= j max i≤k j V x Vy Vz score[y ,i ,k]⋅score[z ,k1, j]⋅tx ,y ,z Space? Time? V x Vy Vz i k k+1 j
  • 15. CYK score[x ,i , j]= { 0 if ji ex , seq[i] if i= j max i≤k j V x Vy Vz score[y ,i ,k]⋅score[z ,k1, j]⋅tx ,y ,z Space? O(m ∙ n2 ) Time? O(m∙ r∙ n3 ) V x Vy Vz i k k+1 j
  • 16. CYK score[x ,i , j]= { 0 if ji ex , seq[i] if i= j max i≤k j V x Vy Vz score[y ,i ,k]⋅score[z ,k1, j]⋅tx ,y ,z Space? O(m ∙ n2 ) Time? O(m∙ r∙ n3 ) Backtracking? V x Vy Vz i k k+1 j
  • 17. CYK score[x ,i , j]= { 0 if ji ex , seq[i] if i= j max i≤k j V x Vy Vz score[y ,i ,k]⋅score[z ,k1, j]⋅tx ,y ,z Space? O(m ∙ n2 ) Time? O(m∙ r∙ n3 ) Backtracking? O(r∙ n2 ) V x Vy Vz i k k+1 j
  • 18. SCFG design ● Dowell & Eddy (2004) G1: S  dS d ∣ d S ∣ S d ∣ SS ∣  G2: S  d S d ∣ d L ∣ Rd ∣ LS L  d S d ∣ aL R  Rd ∣  G3: S  d S ∣ d S d S ∣  G4: S  d S ∣ T ∣  T  T d ∣ d S d ∣ T d S d G5: S  LS ∣ L L  d F d ∣ d F  d F d ∣ LS
  • 19. SCFG design ● Dowell & Eddy (2004) G1: S  dS d ∣ d S ∣ S d ∣ SS ∣  G2: S  d S d ∣ d L ∣ Rd ∣ LS L  d S d ∣ aL R  Rd ∣  G3: S  d S ∣ d S d S ∣  G4: S  d S ∣ T ∣  T  T d ∣ d S d ∣ T d S d G5: S  LS ∣ L L  d F d ∣ d F  d F d ∣ LS
  • 20. Prediction accuracy ● Sensitivity and specificity sensitivity = TN TNFP specificity = TP TPFN sensitivity = 4 42 = 0.666 specificity = 4 42 = 0.666
  • 21. Prediction accuracy sensitivity = 4 42 = 0.666 specificity = 4 42 = 0.666 sensitivity = 5 52 = 0.714 specificity = 2 23 = 0.4
  • 22. Prediction accuracy sensitivity = 4 42 = 0.666 specificity = 4 42 = 0.666 sensitivity = 5 52 = 0.714 specificity = 2 23 = 0.4 Use RNA 2nd structure metrics (Moulton et al. 2000)
  • 23. Search for better SCFGs ● Evolutionary algorithm ● Initial population ● Mutation model ● Breeding model ● Selection