2. Grammars
● Wiki
a grammar is a set of
rewriting rules for forming
strings in a formal language
● context-free:
rewrite single variables
● Formal definition
a grammar is a 4-tuple
● N set of nonterminals
● V set of terminals
● P set of rules
● S start symbol
● Example
generates {a
m
u
n
∣ m ,n≥0}S aSu ∣ aS ∣ Su ∣
S ⇒ aSu ⇒ aaSuu ⇒ aauu
S ⇒ aS ⇒ aaS ⇒ aaSu ⇒ aaSuu ⇒ aauu
3. Stochastic CFGs
● A context free grammar (CFG) + probabilities
● Assign probabilities to generated strings
● Example
0.1 0.4 0.4 0.1
S aSu ∣ aS ∣ Su ∣
S ⇒
0.1
aSu ⇒
0.1
aaSuu ⇒
0.1
aauu
S ⇒
0.4
aS ⇒
0.4
aaS ⇒
0.4
aaSu ⇒
0.4
aaSuu ⇒
0.1
aauu
0.001
0.00256
4. SCFGs
● Purpose:
● generate the same string using different sets of rules
● each set of rules tells a different story
● each set of rules assigns a different probability to the string
0.1 0.4 0.4 0.1
S aSu ∣ aS ∣ Su ∣
S ⇒
0.1
aSu ⇒
0.1
aaSuu ⇒
0.1
aauu
S ⇒
0.4
aS ⇒
0.4
aaS ⇒
0.4
aaSu ⇒
0.4
aaSuu ⇒
0.1
aauu
0.001
0.00256
5. SCFGs & RNA
● Relation to RNA and 2nd
structure prediction
● generates RNA sequences – strings over {A, C, G, U}
● 2nd
structure is given by the set of rules used
● assigns probabilities to structures
0.1 0.4 0.4 0.1
S aSu ∣ aS ∣ Su ∣
S ⇒
0.1
aSu ⇒
0.1
aaSuu ⇒
0.1
aauu
S ⇒
0.4
aS ⇒
0.4
aaS ⇒
0.4
aaSu ⇒
0.4
aaSuu ⇒
0.1
aauu
0.001
0.00256
6. SCFGs & RNA
0.1 0.4 0.4 0.1
S aSu ∣ aS ∣ Su ∣
. .
S ⇒
0.1
aSu ⇒
0.1
aaSuu ⇒
0.1
aauu
S ⇒
0.4
aS ⇒
0.4
aaS ⇒
0.4
aaSu ⇒
0.4
aaSuu ⇒
0.1
aauu
. .. .. . .. .. ....
7. A better example
S aS ∣ cS ∣ gS ∣ uS
Sa ∣ Sc ∣ Sg ∣ Su
aSu ∣ cSg ∣ gSu
uSa ∣ gSc ∣ uSg
SS
8. Algorithms
● Determine the most probable structure for a RNA sequence
● Determine the total probability of generating a sequence
(the sum of probabilities of all ways of generating it)
● Given a data set with sequences and associated structures,
determine the rules' probabilities that maximize the total
probability of generating the right structures from the set
9. Algorithms
● Determine the most probable structure for a RNA sequence
● Determine the total probability of generating a sequence
(the sum of probabilities of all ways of generating it)
● Given a data set with sequences and associated structures,
determine the rules' probabilities that maximize the total
probability of generating the right structures from the set
11. Cocke–Younger–Kasami
● Calculate best structure for small subsequences and work
outwards to larger and larger subsequences
● Notations
● Grammar G in CNF with nonterminals V1
, ..., Vm
● V1
is the start symbol
● t(x, y, z) is the probability of rule Vx
→ Vy
Vz
● e(x, a) is the probability of rule Vx
→ a
● score[x, i, j] is the maximum probability of generating
seq[i, j] from Vx
12. CYK
● Vx
→ seq[i]
score[x, i, i] = e(x, seq[i])
● Vx
→ Vy
Vz
and for some i ≤ k < j
score[x, i, j] = score[y, i, k] · score[z, k+1, j] · t(x, y, z)
V x
Vy Vz
i k k+1 j
13. CYK
score[x ,i , j]=
{
0 if ji
ex , seq[i] if i= j
max
i≤k j
V x Vy Vz
score[y ,i ,k]⋅score[z ,k1, j]⋅tx ,y ,z
V x
Vy Vz
i k k+1 j
14. CYK
score[x ,i , j]=
{
0 if ji
ex , seq[i] if i= j
max
i≤k j
V x Vy Vz
score[y ,i ,k]⋅score[z ,k1, j]⋅tx ,y ,z
Space?
Time?
V x
Vy Vz
i k k+1 j
15. CYK
score[x ,i , j]=
{
0 if ji
ex , seq[i] if i= j
max
i≤k j
V x Vy Vz
score[y ,i ,k]⋅score[z ,k1, j]⋅tx ,y ,z
Space?
O(m ∙ n2
)
Time?
O(m∙ r∙ n3
)
V x
Vy Vz
i k k+1 j
16. CYK
score[x ,i , j]=
{
0 if ji
ex , seq[i] if i= j
max
i≤k j
V x Vy Vz
score[y ,i ,k]⋅score[z ,k1, j]⋅tx ,y ,z
Space?
O(m ∙ n2
)
Time?
O(m∙ r∙ n3
)
Backtracking?
V x
Vy Vz
i k k+1 j
17. CYK
score[x ,i , j]=
{
0 if ji
ex , seq[i] if i= j
max
i≤k j
V x Vy Vz
score[y ,i ,k]⋅score[z ,k1, j]⋅tx ,y ,z
Space?
O(m ∙ n2
)
Time?
O(m∙ r∙ n3
)
Backtracking?
O(r∙ n2
)
V x
Vy Vz
i k k+1 j
18. SCFG design
● Dowell & Eddy (2004)
G1: S dS d ∣ d S ∣ S d ∣ SS ∣
G2: S d S d ∣ d L ∣ Rd ∣ LS
L d S d ∣ aL
R Rd ∣
G3: S d S ∣ d S d S ∣
G4: S d S ∣ T ∣
T T d ∣ d S d ∣ T d S d
G5: S LS ∣ L
L d F d ∣ d
F d F d ∣ LS
19. SCFG design
● Dowell & Eddy (2004)
G1: S dS d ∣ d S ∣ S d ∣ SS ∣
G2: S d S d ∣ d L ∣ Rd ∣ LS
L d S d ∣ aL
R Rd ∣
G3: S d S ∣ d S d S ∣
G4: S d S ∣ T ∣
T T d ∣ d S d ∣ T d S d
G5: S LS ∣ L
L d F d ∣ d
F d F d ∣ LS