Stochastic Definite Clause Grammars
InterLogOnt, Nov 24
Christian Theil Have
What and why?
● DCG Syntax ● Probabilistic model
– Convenient – Polynomial parsing
– Expresssive – Parameter learning
– Flexible – Robust
DCG Grammar rules
● Definite Clause Grammars
– Grammar formalism on top of Prolog.
– Production rules with unification variables
– Context-sensitive.. (stronger actually)
– Exploits unification semantics of Prolog
Simple DCG grammar Difference list representation
sentence --> subject(N), verb(N), object.
subject(sing) --> [he].
subject(plur) --> [they].
object --> [cake].
object --> [food].
verb(sing) --> [eats].
verb(plur) --> [eat].
Stochastic Definite Clause
● Implemented as a DCG compiler
– With some extensions to DCG syntax
● Transforms a DCG (grammar) into a stochastic
logic program implemented in PRISM.
● Probabilistic inferences and parameter learning
are then performed using PRISM
(S)DCG Compilation PRISM program
● PRISM - http://sato-www.cs.titech.ac.jp/prism/
● Extends Prolog with random variables (msws in PRISM lingo)
● Performs probabilistic inferences over such programs ,
● Probability calculation - probability of a derivation
● Viterbi - find most probable derivation
● EM learning – learn parameters from a set of example goals
PRISM program example: Bernoulli trials
:- set_sw(coin, 0.6+0.4).
msw(coin,R), % Probabilistic choice
N1 is N – 1,
ber(N1,Y). % Recursion
The probabilistic model
One random variable encodes probability
of expansion for rules with same functor/
arity s(N) ==> np(N).
s(N) ==> np(N),vp(N).
The choice is made a selection rule
The selected rule is invoked through
Selection rule s(A,B) :- msw(s,Outcome), s(Outcome, A, B).
s(s1, A, B) :- np(_, A, B).
s(s2, A, B) :- np(N, A, D), vp(N, D, B).
Since SDCG embodies unification constraints,
some derivations may fail
We only observe the successful
derivations in sample data.
If the training algorithm only
considers successful derivations, it
will converge to a wrong probability
distribution (missing probability Failed derivations
In PRISM this is handled using the fgEM algorithm, which is based on
Cussens Failure-Adjusted Maximization (FAM) algorithm.
A “failure program” which traces all derivations is derived using First Order
Compilaton and the probabilities of failed derivations are estimated as part of
the fgEM algorithm.
Unification failure issues
Infinite/long derivation paths
● Impossible/difficult to derive failure program.
● Workaround: SDCG has an option which limits the depth of
● Still: size of the failure program is very much an issue.
FOC requirement - “universally quantified clauses”:
● Not the case with Difference Lists: 'C'([X|Y], X,Y).
● Workaround 1:
– Trick the first order compiler by manually adding
implications after program is partly compiled.
– Works empirically, but may be dubious
● Workaround 2:
– Append based grammar
– Works, but have inherent inefficiencies
● SDCG extends the usual DCG syntax
– Compatible with DCG (superset)
– Regular expression operators
● Convenient rule recursion
● Allows writing rules as templates which are filled out
according to certain rules
● Convenient expression of higher order HMM's
Regular expression operators
Regular expressions operators can be associated with rule constituents:
name ==> ?(title), +(firstname), *(lastname).
? may be repeated zero or one times
* may be repeated zero or more times
+ may be one or more time
The constituent in the original rule is replaced with a substitute which
refers to intermediary rules, which implements the regular expression.
regex_sub ==> 
regex_sub ==> original_constituent
regex_sub ==> regex_sub,regex_sub
Limitation: Cannot be used in rules with unification variables.
Special goals prefixed with @ are treated as macros.
Grammar rules with macros are dynamically expanded.
Example: determines which
word(he,sg,masc). word(she,sg,fem). variables to keep
number(Word,Number) :- word(Word,Number,_).
gender(Word,Gender) :- word(Word,_,Gender).
expand_mode(number(-, +)). word(@number(Word, N), @gender(Word,G)) ==>
expand_mode(gender(-, +)). @wordlist(Word, WordList).
Meta rule is created and called,
exp(Word, N, G, WordList) :- number(Word,N), gender(Word, G), wordlist(Word,WordList).
word(sg,masc) ==> [ he ].
find all answers
word(sg,fem) ==> [ she ].
A conditioned rule takes the form,
name(F1,F2,...,Fn) | V1,V2,...,Vn ==> C1,C2,...,Cn.
The | operator can be seen as a guard that assures the rule is only
expanded if the conditions V1..Vn unify with F1..FN
It is possible to specify which variables must unify using a condition_mode:
n(A,B,C) | x,y ==> c1, c2.
Conditioned rules are grouped by non-terminal name and arity and
always has the same number of conditions.
Probabilistic semantics: A distinct probability distribution for each
distinct set of conditions.
Model without conditioning: Model with conditioning:
n ==> n1. n|a ==> n1(X).
n ==> n2. n|a ==> n2(X).
n1 ==> ... n|b ==> n1(X).
... n|b ==> n2(X).
Selection using unification
Example, simple toy grammar
start ==> s(N). n(sg) ==> [time].
s(N) ==> np(N). n(pl) ==> [flies].
s(N) ==> np(N),vp(N). v(sg) ==> [flies].
np(N) ==> n(sg),n(N). v(sg) ==> [crawls].
np(N) ==> n(N). v(pl) ==> [fly].
vp(N) ==> v(N),np(N).
vp(N) ==> v(N)
Probability of a
| ?- prob(start([time,flies],,Tree), P). sentence
P = 0.083333333333333 ?
| ?- viterbig(start([time,flies],,Tree), P).
Tree = [start,[[s(pl),[[np(pl),[[n(sg),[]],[n(pl),[]]]]]]]] The most probable
P = 0.0625 ? parse
| ?- n_viterbig(10,start([time,flies],,Tree), P).
Tree = [start,[[s(pl),[[np(pl),[[n(sg),[]],[n(pl),[]]]]]]]]
P = 0.0625 ?; Most probable parses
Tree = [start,[[s(sg),[[np(sg),[[n(sg),[]]]],[vp(sg),[[v(sg),[]]]]]]]] (indeed all two)
P = 0.020833333333333 ?;
More interesting example
Simple part of speech tagger – fully connected first order HMM.
tag_word(Previous, @tag(Current), [Current|TagsRest]) | @tag(SomeTag) ==>
Some tags Some words