Chapter3 - Fourier Series Representation of Periodic Signals
Simple effective decipherment via combinatorial optimization
1. Simple Effective Decipherment via Combinatorial Optimization
Taylor Berg-Kirkpatrick and Dan Klein
Computer Science Division
University of California at Berkeley
{tberg, klein}@cs.berkeley.edu
Abstract 2011), and bilingual lexicon induction (Koehn and
Knight, 2002; Haghighi et al., 2008). We consider
We present a simple objective function that a common element, which is a model wherein there
when optimized yields accurate solutions to are character-level correspondences and word-level
both decipherment and cognate pair identifica- correspondences, with the word matching parame-
tion problems. The objective simultaneously
terized by the character one. This approach sub-
scores a matching between two alphabets and
a matching between two lexicons, each in a sumes a range of past tasks, though of course past
different language. We introduce a simple work has specialized in interesting ways.
coordinate descent procedure that efficiently Past work has emphasized the modeling as-
finds effective solutions to the resulting com- pect, where here we use a parametrically simplistic
binatorial optimization problem. Our system model, but instead emphasize inference.
requires only a list of words in both languages
as input, yet it competes with and surpasses 2 Decipherment as Two-Level
several state-of-the-art systems that are both
substantially more complex and make use of
Optimization
more information. Our method represents two matchings, one at the al-
phabet level and one at the lexicon level. A vector of
1 Introduction variables x specifies a matching between alphabets.
For each character i in the source alphabet and each
Decipherment induces a correspondence between character j in the target alphabet we define an indi-
the words in an unknown language and the words cator variable xij that is on if and only if character i
in a known language. We focus on the setting where is mapped to character j. Similarly, a vector y rep-
a close correspondence between the alphabets of the resents a matching between lexicons. For word u in
two languages exists, but is unknown. Given only the source lexicon and word v in the target lexicon,
two lists of words, the lexicons of both languages, the indicator variable yuv denotes that u maps to v.
we attempt to induce the correspondence between Note that the matchings need not be one-to-one.
alphabets and identify the cognates pairs present in We define an objective function on the matching
the lexicons. The system we propose accomplishes variables as follows. Let E DIT D IST(u, v; x) denote
this by defining a simple combinatorial optimiza- the edit distance between source word u and target
tion problem that is a function of both the alphabet word v given alphabet matching x. Let the length
and cognate matchings, and then induces correspon- of word u be lu and the length of word w be lw .
dences by optimizing the objective using a block co- This edit distance depends on x in the following
ordinate descent procedure. way. Insertions and deletions always cost a constant
There is a range of past work that has var- .1 Substitutions also cost unless the characters
iously investigated cognate detection (Kondrak, are matched in x, in which case the substitution is
2001; Bouchard-Cˆ t´ et al., 2007; Bouchard-Cˆ t´
oe oe 1
In practice we set = lu +lv . lu + lv is the maximum
1
et al., 2009; Hall and Klein, 2010), character-level number of edit operations between words u and v. This nor-
decipherment (Knight and Yamada, 1999; Knight malization insures that edit distances are between 0 and 1 for
et al., 2006; Snyder et al., 2010; Ravi and Knight, all pairs of words.
313
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 313–321,
Edinburgh, Scotland, UK, July 27–31, 2011. c 2011 Association for Computational Linguistics
2. free. Now, the objective that we will minimize can
be stated simply: u v yuv · E DIT D IST(u, v; x), In order to get a better handle on the shape of the
the sum of the edit distances between the matched objective and to develop an efficient optimization
words, where the edit distance function is parame- procedure we decompose each edit distance compu-
terized by the alphabet matching. tation and re-formulate the optimization problem in
Without restrictions on the matchings x and y Section 2.2.
this objective can always be driven to zero by either
mapping all characters to all characters, or matching
none of the words. It is thus necessary to restrict 2.1 Example
the matchings in some way. Let I be the size of Figure 1 presents both an example matching prob-
the source alphabet and J be the size of the target lem and a diagram of the variables and objective.
alphabet. We allow the alphabet matching x to
be many-to-many but require that each character Here, the source lexicon consists of the English
participate in no more than two mappings and that words (cat, bat, cart, rat, cab), and
the total number of mappings be max(I, J), a the source alphabet consists of the characters (a,
constraint we refer to as restricted-many-to-many. b, c, r, t). The target alphabet is (0, 1,
The requirements can be encoded with the following
linear constraints on x: 2, 3). We have used digits as symbols in the target
alphabet to make it clear that we treat the alphabets
∀i xij ≤ 2
as disjoint. We have no prior knowledge about any
j correspondence between alphabets, or between lexi-
∀j xij ≤ 2 cons.
i The target lexicon consists of the words (23,
xij = max(I, J) 1233, 120, 323, 023). The bipartite graphs
i j show a specific setting of the matching variables.
The lexicon matching y is required to be τ -one-to- The bold edges correspond to the xij and yuv that
one. By this we mean that y is an at-most-one-to-one are one. The matchings shown achieve an edit dis-
matching that covers proportion τ of the smaller of tance of zero between all matched word pairs ex-
the two lexicons. Let U be the size of the source cept for the pair (cat, 23). The best edit align-
lexicon and V be this size of the target lexicon.
This requirement can be encoded with the following ment for this pair is also diagrammed. Here, ‘a’
linear constraints: is aligned to ‘2’, ‘t’ is aligned to ‘3’, and ‘c’ is
deleted and therefore aligned to the null position ‘#’.
∀u yuv ≤ 1 Only the initial deletion has a non-zero cost since
v all other alignments correspond to substitutions be-
∀v yuv ≤ 1 tween characters that are matched in x.
u
yuv = τ min(U, V ) 2.2 Explicit Objective
u v
Computing E DIT D IST(u, v; x) requires running a
Now we are ready to define the full optimization dynamic program because of the unknown edit
problem. The first formulation is called the Implicit alignments; here we define those alignments z ex-
Matching Objective since includes an implicit plicitly, which makes the E DIT D IST(u, v; x) easy to
minimization over edit alignments inside the com- write explicitly at the cost of more variables. How-
putation of E DIT D IST. ever, by writing the objective in an explicit form that
refers to these edit variables, we are able to describe
(1) Implicit Matching Objective: a efficient block coordinate descent procedure that
can be used for optimization.
min yuv · E DIT D IST(u, v; x) E DIT D IST(u, v; x) is computed by minimizing
x,y
u v over the set of monotonic alignments between the
s.t. x is restricted-many-to-many characters of the source word u and the characters
of the target word v. Let un be the character at the
y is τ -one-to-one nth position of the source word u, and similarly for
314
3. Alphabet Matching Matching Problem
a 0
b 1 min yuv · EditDist(u, v; x)
x,y
u v
c 2
s.t. x is restricted-many-to-many
r 3
y is τ -one-to-one
t xij
Lexicon Matching Edit Distance
cat 23
# # EditDist(u, v; x) =
bat 1233 Substitution
c
2
cart 120 minzuv · (1 − xun vm )zuv,nm
a 3 s.t. n m
rat 323 zuv is monotonic
+ zuv,n# + zuv,#m
t
cab 023 zuv,nm n m
yuv Deletion Insertion
Figure 1: An example problem displaying source and target lexicons and alphabets, along with specific matchings.
The variables involved in the optimization problem are diagrammed. x are the alphabet matching indicator variables,
y are the lexicon matching indicator variables, and z are the edit alignment indicator variables. The index u refers to
a word in the source lexicon, v refers to word in the target lexicon, i refers to a character in the source alphabet, and
j refers to a character in the target alphabet. n and m refer to positions in source and target words respectively. The
matching objective function is also shown.
vm . Let zuv be the vector of alignment variables
for the edit distance computation between source
word u and target word v, where entry zuv,nm S UB(zuv , x) = (1 − xun vm )zuv,nm
indicates whether the character at position n of n,m
source word u is aligned to the character at position
m of target word v. Additionally, define variables D EL(zuv ) = zuv,n#
zuv,n# and zuv,#m denoting null alignments, which n
will be used to keep track of insertions and deletions. I NS(zuv ) = zuv,#m
m
E DIT D IST(u, v; x) = Notice that the variable zuv,nm being turned on in-
min · S UB(zuv , x) + D EL(zuv ) + I NS(zuv ) dicates the substitute operation, while a zuv,n# or
zuv
zuv,#m being turned on indicates an insert or delete
s.t. zuv is monotonic operation. These variables are digrammed in Fig-
ure 1. The requirement that zuv be a monotonic
alignment can be expressed using linear constraints,
but in our optimization procedure (described in Sec-
We define S UB(zuv , x) to be the number of sub- tion 3) these constraints need not be explicitly rep-
stitutions between characters that are not matched resented.
in x, D EL(zuv ) to be the number of deletions, and Now we can substitute the explicit edit distance
I NS(zuv ) to be the number of insertions. equation into the implicit matching objective (1).
315
4. Noticing that the mins and sums commute, we arrive Algorithm 1 Block Coordinate Descent
at the explicit form of the matching optimization Randomly initialize alphabet matching x.
problem. repeat
for all u, v do
(2) Explicit Matching Objective: (euv , zuv ) ← E DIT D IST(u, v; x)
end for
min yuv · · S UB(zuv , x) + D EL(zuv ) + I NS(zuv ) [Hungarian]
x,y,z
u,v y ← argminy τ -one-to-one u,v yuv euv
s.t. x is restricted-many-to-many [Solve LP]
y is τ -one-to-one x ← argmaxx restr.-many-to-many i,j xij cij
∀uv zuv is monotonic
until convergence
The implicit and explicit optimizations are the same,
apart from the fact that the explicit optimization now Notice that y simply picks out which edit distance
explicitly represents the edit alignment variables z. problems affect the objective. The zuv in each of
Let the explicit matching objective (2) be denoted these edit distance problems can be optimized in-
as J(x, y, z). The relaxation of the explicit problem dependently. zuv that do not have yuv active have
with 0-1 constraints removed has integer solutions,2 no effect on the objective, and zuv with yuv active
however the objective J(x, y, z) is non-convex. We can be optimized using the standard edit distance dy-
thus turn to a block coordinate descent method in the namic program. Thus, in a first step we compute the
next section in order to find local optima. U · V edit distances euv and best monotonic align-
ment variables zuv between all pairs of source and
3 Optimization Method target words using U ·V calls to the standard edit dis-
tance dynamic program. Altogether, this takes time
We now state a block coordinate descent procedure O ( u lu ) · ( v lv ) .
to find local optima of J(x, y, z) under the con- Now, in a second step we compute the least
straints on x, y, and z. This procedure alternates weighted τ -one-to-one matching y under the
between updating y and z to their exact joint optima weights euv . This can be accomplished in time
when x is held fixed, and updating x to its exact op- O(max(U, V )3 ) using the Hungarian algorithm
timum when y and z are held fixed. (Kuhn, 1955). These two steps produce y and z that
The psuedocode for the procedure is given in Al- exactly achieve the optimum value of J(x, y, z) for
gorithm 1. Note that the function E DIT D IST returns the given value of x.
both the min edit distance euv and the argmin edit
alignments zuv . Also note that cij is as defined in 3.2 Alphabet Matching Update
Section 3.2.
Let y and z, the lexicon matching variables and the
3.1 Lexicon Matching Update edit alignments, be fixed. Now, we find the optimal
alphabet matching variables x subject to the con-
Let x, the alphabet matching variable, be fixed. We
straint that x is restricted-many-to-many.
consider the problem of optimizing J(x, y, z) over
It makes sense that to optimize J(x, y, z) with re-
the lexicon matching variable y and and the edit
spect to x we should prioritize mappings xij that
alignments z under the constraint that y is τ -one-
would mitigate the largest substitution costs in the
to-one and each zuv is monotonic.
active edit distance problems. Indeed, with a little
2
This can be shown by observing that optimizing x when y algebra it can be shown that solving a maximum
and z are held fixed yields integer solutions (shown in Section weighted matching problem with weights cij that
3.2), and similarly for the optimization of y and z when x is
fixed (shown in Section 3.1). Thus, every local optimum with
count potential substitution costs gives the correct
respect to these block coordinate updates has integer solutions. update for x. In particular, cij is the total cost of
The global optimum must be one of these local optima. substitution edits in the active edit alignment prob-
316
5. lems that would result if source character i were not 4.1 Phonetic Cognate Lexicons
mapped to target character j in the alphabet match-
The first data set we evaluate on consists of 583
ing x. This can be written as:
triples of phonetic transcriptions of cognates in
cij = · yuv · zuv,nm Spanish, Portuguese, and Italian. The data set was
u,v n,m s.t. un =i,vm =j introduced by Bouchard-Cˆ t´ et al. (2007). For a
oe
given pair of languages the task is to determine the
If x were constrained to be one-to-one, we
could again apply the Hungarian algorithm, this mapping between lexicons that correctly maps each
time to find a maximum weighted matching under source word to its cognate in the target lexicon. We
the weights cij . Since we have instead allowed refer to this task and data set as ROMANCE.
restricted-many-to-many alphabet matchings we Hall and Klein (2010) presented a state-of-the-
turn to linear programming for optimizing x. We art system for the task of cognate identification and
can state the update problem as the following linear evaluated on this data set. Their model explicitly
program (LP), which is guaranteed to have integer
solutions: represents parameters for phonetic change between
languages and their parents in a phylogenetic tree.
They estimate parameters and infer the pairs of cog-
min xij cij
x nates present in all three languages jointly, while we
ij
consider each pair of languages in turn.
s.t. ∀i xij ≤ 2, ∀j xij ≤ 2 Their model has similarities with our own in that
j i
it learns correspondences between the alphabets of
xij = max(I, J) pairs of languages. However, their correspondences
i j
are probabilistic and implicit while ours are hard and
In experiments we used the GNU Linear Program- explicit. Their model also differs from our own in
ming Toolkit (GLPK) to solve the LP and update a key way. Notice that the phonetic alphabets for
the alphabet matching x. This update yields match- the three languages are actually the same. Since
ing variables x that achieve the optimum value of phonetic change occurs gradually across languages
J(x, y, z) for fixed y and z. a helpful prior on the correspondence is to favor the
identity. Their model makes use of such a prior.
3.3 Random Restarts Our model, on the other hand, is unaware of any
In practice we found that the block coordinate de- prior correspondence between alphabets and does
scent procedure can get stuck at poor local optima. not make use of this additional information about
To find better optima, we run the coordinate descent phonetic change.
procedure multiple times, initialized each time with Hall and Klein (2010) also evaluate their model
a random alphabet matching. We choose the local on lexicons that do not have a perfect cognate map-
optimum with the best objective value across all ini- ping. This scenario, where not every word in one
tializations. This approach yielded substantial im- language has a cognate in another, is more realistic.
provements in achieved objective value. They produced a data set with this property by prun-
ing words from the ROMANCE data set until only
4 Experiments about 75% of the words in each source lexicon have
We compare our system to three different state-of- cognates in each target lexicon. We refer to this task
the-art systems on three different data sets. We set and data set as PARTIAL ROMANCE.
up experiments that allow for as direct a comparison
4.2 Lexicons Extracted from Corpora
as possible. In some cases it must be pointed out
that the past system’s goals are different from our Next, we evaluate our model on a noisier data set.
own, and we will be comparing in a different way Here the lexicons in source and target languages
than the respective work was intended. The three are extracted from corpora by taking the top 2,000
systems make use of additional, or slightly different, words in each corpus. In particular, we used the En-
sources of information. glish and Spanish sides of the Europarl parallel cor-
317
6. pus (Koehn, 2005). To make this set up more real- between the two lexicons. For this data set, the lexi-
istic (though fairly comparable), we insured that the cons are large enough that finding the exact solution
corpora were non-parallel by using the first 50K sen- can be slow. Thus, in all experiments on this data
tences on the English side and the second 50K sen- set, we instead use a greedy competitive linking al-
tences on the Spanish side. To generate a gold cog- gorithm that runs in time O(U 2 V 2 log(U V )).
nate matching we used the intersected HMM align- Again, for this dataset it is reasonable to expect
ment model of Liang et al. (2008) to align the full that many characters will map to themselves in the
parallel corpus. From this alignment we extracted a best alphabet matching. The alphabets are not iden-
translation lexicon by adding an entry for each word tical, but are far from disjoint. Neither our system,
pair with the property that the English word was nor that of Haghighi et al. (2008) make use of this
aligned to the Spanish in over 10% of the alignments expectation. As far as both systems are concerned,
involving the English word. To reduce this transla- the alphabets are disjoint.
tion lexicon down to a cognate matching we went
through the translation lexicon by hand and removed 4.3 Decipherment
any pair of words that we judged to not be cognates. Finally, we evaluate our model on a data set where
The resulting gold matching contains cognate map- a main goal is to decipher an unknown correspon-
pings in the English lexicon for 1,026 of the words dence between alphabets. We attempt to learn a
in the Spanish lexicon. This means that only about mapping from the alphabet of the ancient Semitic
50% of the words in English lexicon have cognates language Ugaritic to the alphabet of Hebrew, and
in the Spanish lexicon. We evaluate on this data set at the same time learn a matching between Hebrew
by computing precision and recall for the number of words in a Hebrew lexicon and their cognates in a
English words that are mapped to a correct cognate. Ugaritic lexicon. This task is related to the task at-
We refer to this task and data set as E UROPARL. tempted by Snyder et al. (2010). The data set con-
On this data set, we compare against the state-of- sists of a Ugaritic lexicon of 2,214 words, each of
the-art orthographic system presented in Haghighi which has a Hebrew cognate, the lexicon of their
et al. (2008). Haghighi et al. (2008) presents sev- 2,214 Hebrew cognates, and a gold cognate dictio-
eral systems that are designed to extract transla- nary for evaluation. We refer to this task and data set
tion lexicons for non-parallel corpora by learning as U GARITIC.
a correspondence between their monolingual lexi- The non-parameteric Bayesian system of Snyder
cons. Since our system specializes in matching cog- et al. (2010) assumes that the morphology of He-
nates and does not take into account additional infor- brew is known, making use of an inventory of suf-
mation from corpus statistics, we compare against fixes, prefixes, and stems derived from the words
the version of their system that only takes into ac- in the Hebrew bible. It attempts to learn a corre-
count orthographic features and is thus is best suited spondence between the morphology of Ugaritic and
for cognate detection. Their system requires a small that of Hebrew while reconstructing cognates for
seed of correct cognate pairs. From this seed the sys- Ugaritic words. This is a slightly different goal than
tem learns a projection using canonical correlation that of our system, which learns a correspondence
analysis (CCA) into a canonical feature space that between lexicons. Snyder et al. (2010) run their
allows feature vectors from source words and target system on a set 7,386 Ugaritic words, the same set
words to be compared. Once in this canonical space, that we extracted our 2,214 Ugaritic words with He-
similarity metrics can be computed and words can be brew cognates from. We evaluate the accuracy of the
matched using a bipartite matching algorithm. The lexicon matching produced by our system on these
process is iterative, adding cognate pairs to the seed 2,214 Ugaritic words, and so do they, measuring the
lexicon gradually and each time re-computing a re- number of correctly reconstructed cognates.
fined projection. Our system makes no use of a seed By restricting the source and target lexicons to
lexicon whatsoever. sets of cognates we have made the task easier. This
Both our system and the system of Haghighi et was necessary, however, because the Ugaritic and
al. (2008) must solve bipartite matching problems Hebrew corpora used by Snyder et al. (2010) are not
318
7. Model τ Accuracy Model τ Precision Recall F1
Hall and Klein (2010) – 90.3 Hall and Klein (2010) – 66.9 82.0 73.6
M ATCHER 1.0 90.1 M ATCHER 0.25 99.7 34.0 50.7
0.50 93.8 60.2 73.3
0.75 81.1 78.0 79.5
Table 1: Results on ROMANCE data set. Our system is
labeled M ATCHER. We compare against the phylogenetic
cognate detection system of Hall and Klein (2010). We Table 2: Results on PARTIAL ROMANCE data set. Our
show the pairwise cognate accuracy across all pairs of system is labeled M ATCHER. We compare against the
languages from the following set: Spanish, Portuguese, phylogenetic cognate detection system of Hall and Klein
and Italian. (2010). We show the pairwise cognate precision, recall,
and F1 across all pairs of languages from the following
set: Spanish, Portuguese, and Italian. Note that approx-
comparable: only a small proportion of the words imately 75% of the source words in each of the source
in the Ugaritic lexicon have cognates in the lexicon lexicons have cognates in each of the target lexicons.
composed of the most frequent Hebrew words.
Here, the alphabets really are disjoint. The sym-
Our average accuracy across all pairs of languages
bols in both languages look nothing alike. There is
is 90.1%. The phylogenetic system of Hall and
no obvious prior expectation about how the alpha-
Klein (2010) achieves an average accuracy of 90.3%
bets will be matched. We evaluate against a well-
across all pairs of languages. Our system achieves
established correspondence between the alphabets
accuracy comparable to that of the phylogenetic sys-
of Ugaritic and Hebrew. The Ugaritic alphabet con-
tem, despite the fact that the phylogenetic system is
tains 30 characters, the Hebrew alphabet contains 22
substantially more complex and makes use of an in-
characters, and the gold matching contains 33 en-
formed prior on alphabet correspondences.
tries. We evaluate the learned alphabet matching by
The alphabet matching learned by our system is
counting the number of recovered entries from the
interesting to analyze. For the pairing of Span-
gold matching.
ish and Portuguese it recovers phonetic correspon-
Due to the size of the source and target lexicons,
dences that are well known. Our system learns the
we again use the greedy competitive linking algo-
correct cognate pairing of Spanish /bino/ to Por-
rithm in place of the exact Hungarian algorithm in
tuguese /vinu/. This pair exemplifies two com-
experiments on this data set.
mon phonetic correspondences for Spanish and Por-
tuguese: the Spanish /o/ often transforms to a /u/ in
5 Results
Portuguese, and Spanish /b/ often transforms to /v/
We present results on all four datasets ROMANCE, in Portuguese. Our system, which allows many-to-
PARTIAL ROMANCE, E UROPARL, and U GARITIC. many alphabet correspondences, correctly identifies
On the ROMANCE and PARTIAL ROMANCE data sets the mappings /o/ → /u/ and /b/ → /v/ as well as the
we compare against the numbers published by Hall identity mappings /o/ → /o/ and /b/ → /b/ which are
and Klein (2010). We ran an implementation of also common.
the orthographic system presented by Haghighi et
al. (2008) on our E UROPARL data set. We com- 5.2 PARTIAL ROMANCE
pare against the numbers published by Snyder et al. In Table 2 we present the results of running our sys-
(2010) on the U GARITIC data set. We refer to our tem on the PARTIAL ROMANCE data set. In this data
system as M ATCHER in result tables and discussion. set, only approximately 75% of the source words in
each of the source lexicons have cognates in each of
5.1 ROMANCE the target lexicons. The parameter τ trades off pre-
The results of running our system, M ATCHER, on cision and recall. We show results for three different
the ROMANCE data set are shown in Table 1. We settings of τ : 0.25, 0.5, and 0.75.
recover 88.9% of the correct cognate mappings on Our system achieves an average precision across
the pair Spanish and Italian, 85.7% on Italian and language pairs of 99.7% at an average recall of
Portuguese, and 95.6% on Spanish and Portuguese. 34.0%. For the pairs Italian – Portuguese, and Span-
319
8. Model Seed τ Precision Recall F1 Model τ Lexicon Acc. Alphabet Acc.
Haghighi et al. (2008) 20 0.1 72.0 14.0 23.5 Snyder et al. (2010) – 60.4* 29/33*
20 0.25 63.6 31.0 41.7 M ATCHER 1.0 90.4 28/33
20 0.5 44.8 43.7 44.2
50 0.1 90.5 17.6 29.5
50 0.25 75.4 36.7 49.4 Table 4: Results on U GARITIC data set. Our system is la-
50 0.5 56.4 55.0 55.7 beled M ATCHER. We compare against the decipherment
M ATCHER 0 0.1 93.5 18.2 30.5 system of Snyder et al. (2010). *Note that results for this
0 0.25 83.2 40.5 54.5 system are on a somewhat different task. In particular, the
0 0.5 56.5 55.1 55.8 M ATCHER system assumes the inventories of cognates in
both Hebrew and Ugaritic are known, while the system
Table 3: Results on E UROPARL data set. Our system of Snyder et al. (2010) reconstructs cognates assuming
is labeled M ATCHER. We compare against the bilingual only that the morphology of Hebrew is known, which is a
lexicon induction system of Haghighi et al. (2008). We harder task. We show cognate pair identification accuracy
show the cognate precision, recall, and F1 for the pair of and alphabet matching accuracy for Ugaritic and Hebrew.
languages English and Spanish using lexicons extracted
from corpora. Note that approximately 50% of the words
in the English lexicon have cognates in the Spanish lexi-
con. tional information: a seed matching of correct cog-
nate pairs. The results show that as the size of
this seed is decreased, the performance of the ortho-
ish – Portuguese, our system achieves prefect preci-
graphic system degrades.
sion at recalls of 32.2% and 38.1% respectively. The
best average F1 achieved by our system is 79.5%, 5.4 U GARITIC
which surpasses the average F1 of 73.6 achieved by
In Table 4 we present results on the U GARITIC data
the phylogenetic system of Hall and Klein (2010).
set. We evaluate both accuracy of the lexicon match-
The phylogenetic system observes the phyloge-
ing learned by our system, and the accuracy of the
netic tree of ancestry for the three languages and
alphabet matching. Our system achieves a lexicon
explicitly models cognate evolution and survival in
accuracy of 90.4% while correctly identifying 28 out
a ‘survival’ tree. One might expect the phyloge-
the 33 gold character mappings.
netic system to achieve better results on this data set
We also present the results for the decipherment
where part of the task is identifying which words do
model of Snyder et al. (2010) in Table 4. Note that
not have cognates. It is surprising that our model
while the evaluation data sets for our two models
does so well given its simplicity.
are the same, the tasks are very different. In par-
5.3 E UROPARL ticular, our system assumes the inventories of cog-
nates in both Hebrew and Ugaritic are known, while
Table 3 presents results for our system on the E U -
the system of Snyder et al. (2010) reconstructs cog-
ROPARL data set across three different settings of τ :
nates assuming only that the morphology of Hebrew
0.1, 0.25, and 0.5. We compare against the ortho-
is known, which is a harder task. Even so, the re-
graphic system presented by Haghighi et al. (2008),
sults show that our system is effective at decipher-
across the same three settings of τ , and with two dif-
ment when semantically similar lexicons are avail-
ferent sizes of seed lexicon: 20 and 50. In this data
able.
set, only approximately 50% of the source words
have cognates in the target lexicon. 6 Conclusion
Our system achieves a precision of 93.5% at a re-
call of 18.2%, and a best F1 of 55.0%. Using a seed We have presented a simple combinatorial model
matching of 50 word pairs, the orthographic sys- that simultaneously incorporates both a matching
tem of Haghighi et al. (2008) achieves a best F1 of between alphabets and a matching between lexicons.
55.7%. Using a seed matching of 20 word pairs, Our system is effective at both the tasks of cognate
it achieves a best F1 of 44.2%. Our system out- identification and alphabet decipherment, requiring
performs the orthographic system even though the only lists of words in both languages as input.
orthographic system makes use of important addi-
320
9. References
A. Bouchard-Cˆ t´ , P. Liang, T.L. Griffiths, and D. Klein.
oe
2007. A probabilistic approach to diachronic phonol-
ogy. In Proc. of EMNLP.
A. Bouchard-Cˆ t´ , T.L. Griffiths, and D. Klein.
oe
2009. Improved reconstruction of protolanguage word
forms. In Proc. of NAACL.
A. Haghighi, P. Liang, T. Berg-Kirkpatrick, and D. Klein.
2008. Learning bilingual lexicons from monolingual
corpora. Proceedings of ACL.
D. Hall and D. Klein. 2010. Finding cognate groups
using phylogenies. In Proc. of ACL.
K. Knight and K. Yamada. 1999. A computational ap-
proach to deciphering unknown scripts. In Proc. of
ACL Workshop on Unsupervised Learning in Natural
Language Processing.
K. Knight, A. Nair, N. Rathod, and K. Yamada. 2006.
Unsupervised analysis for decipherment problems. In
Proc. of COLING/ACL.
P. Koehn and K. Knight. 2002. Learning a translation
lexicon from monolingual corpora. In Proc. of ACL
workshop on Unsupervised lexical acquisition.
P. Koehn. 2005. Europarl: A Parallel Corpus for Statis-
tical Machine Translation. In Proc. of Machine Trans-
lation Summit.
G. Kondrak. 2001. Identifying Cognates by Phonetic and
Semantic Similarity. In NAACL.
H.W. Kuhn. 1955. The Hungarian method for the assign-
ment problem. Naval research logistics quarterly.
P. Liang, D. Klein, and M.I. Jordan. 2008. Agreement-
based learning. Proc. of NIPS.
S. Ravi and K. Knight. 2011. Bayesian inference for Zo-
diac and other homophonic ciphers. In Proc. of ACL.
B. Snyder, R. Barzilay, and K. Knight. 2010. A statisti-
cal model for lost language decipherment. In Proc. of
ACL.
321