SlideShare a Scribd company logo
1 of 9
Download to read offline
Simple Effective Decipherment via Combinatorial Optimization

                                Taylor Berg-Kirkpatrick and Dan Klein
                                       Computer Science Division
                                   University of California at Berkeley
                                     {tberg, klein}@cs.berkeley.edu


                      Abstract                               2011), and bilingual lexicon induction (Koehn and
                                                             Knight, 2002; Haghighi et al., 2008). We consider
     We present a simple objective function that             a common element, which is a model wherein there
     when optimized yields accurate solutions to             are character-level correspondences and word-level
     both decipherment and cognate pair identifica-           correspondences, with the word matching parame-
     tion problems. The objective simultaneously
                                                             terized by the character one. This approach sub-
     scores a matching between two alphabets and
     a matching between two lexicons, each in a              sumes a range of past tasks, though of course past
     different language. We introduce a simple               work has specialized in interesting ways.
     coordinate descent procedure that efficiently               Past work has emphasized the modeling as-
     finds effective solutions to the resulting com-          pect, where here we use a parametrically simplistic
     binatorial optimization problem. Our system             model, but instead emphasize inference.
     requires only a list of words in both languages
     as input, yet it competes with and surpasses            2 Decipherment as Two-Level
     several state-of-the-art systems that are both
     substantially more complex and make use of
                                                               Optimization
     more information.                                       Our method represents two matchings, one at the al-
                                                             phabet level and one at the lexicon level. A vector of
1   Introduction                                             variables x specifies a matching between alphabets.
                                                             For each character i in the source alphabet and each
Decipherment induces a correspondence between                character j in the target alphabet we define an indi-
the words in an unknown language and the words               cator variable xij that is on if and only if character i
in a known language. We focus on the setting where           is mapped to character j. Similarly, a vector y rep-
a close correspondence between the alphabets of the          resents a matching between lexicons. For word u in
two languages exists, but is unknown. Given only             the source lexicon and word v in the target lexicon,
two lists of words, the lexicons of both languages,          the indicator variable yuv denotes that u maps to v.
we attempt to induce the correspondence between              Note that the matchings need not be one-to-one.
alphabets and identify the cognates pairs present in            We define an objective function on the matching
the lexicons. The system we propose accomplishes             variables as follows. Let E DIT D IST(u, v; x) denote
this by defining a simple combinatorial optimiza-             the edit distance between source word u and target
tion problem that is a function of both the alphabet         word v given alphabet matching x. Let the length
and cognate matchings, and then induces correspon-           of word u be lu and the length of word w be lw .
dences by optimizing the objective using a block co-         This edit distance depends on x in the following
ordinate descent procedure.                                  way. Insertions and deletions always cost a constant
   There is a range of past work that has var-                .1 Substitutions also cost unless the characters
iously investigated cognate detection (Kondrak,              are matched in x, in which case the substitution is
2001; Bouchard-Cˆ t´ et al., 2007; Bouchard-Cˆ t´
                    oe                           oe     1
                                                          In practice we set = lu +lv . lu + lv is the maximum
                                                                                   1
et al., 2009; Hall and Klein, 2010), character-level number of edit operations between words u and v. This nor-
decipherment (Knight and Yamada, 1999; Knight malization insures that edit distances are between 0 and 1 for
et al., 2006; Snyder et al., 2010; Ravi and Knight, all pairs of words.
                                                  313

      Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 313–321,
            Edinburgh, Scotland, UK, July 27–31, 2011. c 2011 Association for Computational Linguistics
free. Now, the objective that we will minimize can
be stated simply: u v yuv · E DIT D IST(u, v; x),               In order to get a better handle on the shape of the
the sum of the edit distances between the matched               objective and to develop an efficient optimization
words, where the edit distance function is parame-              procedure we decompose each edit distance compu-
terized by the alphabet matching.                               tation and re-formulate the optimization problem in
   Without restrictions on the matchings x and y                Section 2.2.
this objective can always be driven to zero by either
mapping all characters to all characters, or matching
none of the words. It is thus necessary to restrict             2.1   Example
the matchings in some way. Let I be the size of                 Figure 1 presents both an example matching prob-
the source alphabet and J be the size of the target             lem and a diagram of the variables and objective.
alphabet. We allow the alphabet matching x to
be many-to-many but require that each character                 Here, the source lexicon consists of the English
participate in no more than two mappings and that               words (cat, bat, cart, rat, cab), and
the total number of mappings be max(I, J), a                    the source alphabet consists of the characters (a,
constraint we refer to as restricted-many-to-many.              b, c, r, t). The target alphabet is (0, 1,
The requirements can be encoded with the following
linear constraints on x:                                        2, 3). We have used digits as symbols in the target
                                                                alphabet to make it clear that we treat the alphabets
               ∀i           xij ≤ 2
                                                                as disjoint. We have no prior knowledge about any
                        j                                       correspondence between alphabets, or between lexi-
              ∀j            xij ≤ 2                             cons.
                        i                                          The target lexicon consists of the words (23,
                                      xij = max(I, J)           1233, 120, 323, 023). The bipartite graphs
                        i        j                              show a specific setting of the matching variables.
The lexicon matching y is required to be τ -one-to-             The bold edges correspond to the xij and yuv that
one. By this we mean that y is an at-most-one-to-one            are one. The matchings shown achieve an edit dis-
matching that covers proportion τ of the smaller of             tance of zero between all matched word pairs ex-
the two lexicons. Let U be the size of the source               cept for the pair (cat, 23). The best edit align-
lexicon and V be this size of the target lexicon.
This requirement can be encoded with the following              ment for this pair is also diagrammed. Here, ‘a’
linear constraints:                                             is aligned to ‘2’, ‘t’ is aligned to ‘3’, and ‘c’ is
                                                                deleted and therefore aligned to the null position ‘#’.
             ∀u             yuv ≤ 1                             Only the initial deletion has a non-zero cost since
                    v                                           all other alignments correspond to substitutions be-
             ∀v             yuv ≤ 1                             tween characters that are matched in x.
                    u

                                     yuv = τ min(U, V )         2.2   Explicit Objective
                    u        v
                                                                Computing E DIT D IST(u, v; x) requires running a
  Now we are ready to define the full optimization               dynamic program because of the unknown edit
problem. The first formulation is called the Implicit            alignments; here we define those alignments z ex-
Matching Objective since includes an implicit                   plicitly, which makes the E DIT D IST(u, v; x) easy to
minimization over edit alignments inside the com-               write explicitly at the cost of more variables. How-
putation of E DIT D IST.                                        ever, by writing the objective in an explicit form that
                                                                refers to these edit variables, we are able to describe
(1) Implicit Matching Objective:                                a efficient block coordinate descent procedure that
                                                                can be used for optimization.
      min                    yuv · E DIT D IST(u, v; x)            E DIT D IST(u, v; x) is computed by minimizing
       x,y
              u     v                                           over the set of monotonic alignments between the
        s.t. x is restricted-many-to-many                       characters of the source word u and the characters
                                                                of the target word v. Let un be the character at the
             y is τ -one-to-one                                 nth position of the source word u, and similarly for
                                                          314
Alphabet Matching                       Matching Problem
            a          0                                 
            b               1                     min                  yuv · EditDist(u, v; x)
                                                   x,y
                                                             u    v
            c               2
                                                   s.t. x is restricted-many-to-many
            r               3
                                                         y is τ -one-to-one
            t    xij

          Lexicon Matching                   Edit Distance
          cat               23
                                             #           #        EditDist(u, v; x) =
          bat               1233                                                                             Substitution
                                              c                                       




                                                                                                           
                                                         2
      cart                  120                                       minzuv     ·           (1 − xun vm )zuv,nm
                                              a          3             s.t.           n   m
                                                                                                                                 
          rat               323                                   zuv is monotonic
                                                                                              +         zuv,n# +       zuv,#m
                                              t
       cab                  023                zuv,nm                                               n              m




                                                                                                                   
                                                                                               
                 yuv                                                                                    Deletion       Insertion




Figure 1: An example problem displaying source and target lexicons and alphabets, along with specific matchings.
The variables involved in the optimization problem are diagrammed. x are the alphabet matching indicator variables,
y are the lexicon matching indicator variables, and z are the edit alignment indicator variables. The index u refers to
a word in the source lexicon, v refers to word in the target lexicon, i refers to a character in the source alphabet, and
j refers to a character in the target alphabet. n and m refer to positions in source and target words respectively. The
matching objective function is also shown.




vm . Let zuv be the vector of alignment variables
for the edit distance computation between source
word u and target word v, where entry zuv,nm                             S UB(zuv , x) =            (1 − xun vm )zuv,nm
indicates whether the character at position n of                                              n,m
source word u is aligned to the character at position
m of target word v. Additionally, define variables                             D EL(zuv ) =          zuv,n#
zuv,n# and zuv,#m denoting null alignments, which                                             n
will be used to keep track of insertions and deletions.                       I NS(zuv ) =          zuv,#m
                                                                                              m



   E DIT D IST(u, v; x) =                                        Notice that the variable zuv,nm being turned on in-
   min      · S UB(zuv , x) + D EL(zuv ) + I NS(zuv )            dicates the substitute operation, while a zuv,n# or
    zuv
                                                                 zuv,#m being turned on indicates an insert or delete
    s.t. zuv is monotonic                                        operation. These variables are digrammed in Fig-
                                                                 ure 1. The requirement that zuv be a monotonic
                                                                 alignment can be expressed using linear constraints,
                                                                 but in our optimization procedure (described in Sec-
We define S UB(zuv , x) to be the number of sub-                  tion 3) these constraints need not be explicitly rep-
stitutions between characters that are not matched               resented.
in x, D EL(zuv ) to be the number of deletions, and                 Now we can substitute the explicit edit distance
I NS(zuv ) to be the number of insertions.                       equation into the implicit matching objective (1).
                                                 315
Noticing that the mins and sums commute, we arrive                  Algorithm 1 Block Coordinate Descent
at the explicit form of the matching optimization                     Randomly initialize alphabet matching x.
problem.                                                              repeat
                                                                        for all u, v do
(2) Explicit Matching Objective:                                           (euv , zuv ) ← E DIT D IST(u, v; x)
                                                                        end for
min             yuv · · S UB(zuv , x) + D EL(zuv ) + I NS(zuv )         [Hungarian]
x,y,z
          u,v                                                           y ← argminy τ -one-to-one     u,v yuv euv
 s.t. x is restricted-many-to-many                                      [Solve LP]
        y is τ -one-to-one                                              x ← argmaxx restr.-many-to-many      i,j xij cij
        ∀uv zuv is monotonic
                                                                      until convergence

The implicit and explicit optimizations are the same,
apart from the fact that the explicit optimization now                 Notice that y simply picks out which edit distance
explicitly represents the edit alignment variables z.               problems affect the objective. The zuv in each of
Let the explicit matching objective (2) be denoted                  these edit distance problems can be optimized in-
as J(x, y, z). The relaxation of the explicit problem               dependently. zuv that do not have yuv active have
with 0-1 constraints removed has integer solutions,2                no effect on the objective, and zuv with yuv active
however the objective J(x, y, z) is non-convex. We                  can be optimized using the standard edit distance dy-
thus turn to a block coordinate descent method in the               namic program. Thus, in a first step we compute the
next section in order to find local optima.                          U · V edit distances euv and best monotonic align-
                                                                    ment variables zuv between all pairs of source and
3       Optimization Method                                         target words using U ·V calls to the standard edit dis-
                                                                    tance dynamic program. Altogether, this takes time
We now state a block coordinate descent procedure                   O ( u lu ) · ( v lv ) .
to find local optima of J(x, y, z) under the con-                       Now, in a second step we compute the least
straints on x, y, and z. This procedure alternates                  weighted τ -one-to-one matching y under the
between updating y and z to their exact joint optima                weights euv . This can be accomplished in time
when x is held fixed, and updating x to its exact op-                O(max(U, V )3 ) using the Hungarian algorithm
timum when y and z are held fixed.                                   (Kuhn, 1955). These two steps produce y and z that
   The psuedocode for the procedure is given in Al-                 exactly achieve the optimum value of J(x, y, z) for
gorithm 1. Note that the function E DIT D IST returns               the given value of x.
both the min edit distance euv and the argmin edit
alignments zuv . Also note that cij is as defined in                 3.2   Alphabet Matching Update
Section 3.2.
                                                                    Let y and z, the lexicon matching variables and the
3.1 Lexicon Matching Update                                         edit alignments, be fixed. Now, we find the optimal
                                                                    alphabet matching variables x subject to the con-
Let x, the alphabet matching variable, be fixed. We
                                                                    straint that x is restricted-many-to-many.
consider the problem of optimizing J(x, y, z) over
                                                                       It makes sense that to optimize J(x, y, z) with re-
the lexicon matching variable y and and the edit
                                                                    spect to x we should prioritize mappings xij that
alignments z under the constraint that y is τ -one-
                                                                    would mitigate the largest substitution costs in the
to-one and each zuv is monotonic.
                                                                    active edit distance problems. Indeed, with a little
    2
     This can be shown by observing that optimizing x when y        algebra it can be shown that solving a maximum
and z are held fixed yields integer solutions (shown in Section      weighted matching problem with weights cij that
3.2), and similarly for the optimization of y and z when x is
fixed (shown in Section 3.1). Thus, every local optimum with
                                                                    count potential substitution costs gives the correct
respect to these block coordinate updates has integer solutions.    update for x. In particular, cij is the total cost of
The global optimum must be one of these local optima.               substitution edits in the active edit alignment prob-
                                                              316
lems that would result if source character i were not             4.1   Phonetic Cognate Lexicons
mapped to target character j in the alphabet match-
                                                                  The first data set we evaluate on consists of 583
ing x. This can be written as:
                                                                  triples of phonetic transcriptions of cognates in
    cij =                                        · yuv · zuv,nm   Spanish, Portuguese, and Italian. The data set was
            u,v       n,m s.t. un =i,vm =j                        introduced by Bouchard-Cˆ t´ et al. (2007). For a
                                                                                                oe
                                                                  given pair of languages the task is to determine the
   If x were constrained to be one-to-one, we
could again apply the Hungarian algorithm, this                   mapping between lexicons that correctly maps each
time to find a maximum weighted matching under                     source word to its cognate in the target lexicon. We
the weights cij . Since we have instead allowed                   refer to this task and data set as ROMANCE.
restricted-many-to-many alphabet matchings we                        Hall and Klein (2010) presented a state-of-the-
turn to linear programming for optimizing x. We                   art system for the task of cognate identification and
can state the update problem as the following linear              evaluated on this data set. Their model explicitly
program (LP), which is guaranteed to have integer
solutions:                                                        represents parameters for phonetic change between
                                                                  languages and their parents in a phylogenetic tree.
                                                                  They estimate parameters and infer the pairs of cog-
            min         xij cij
             x                                                    nates present in all three languages jointly, while we
                  ij
                                                                  consider each pair of languages in turn.
            s.t. ∀i            xij ≤ 2, ∀j       xij ≤ 2             Their model has similarities with our own in that
                       j                     i
                                                                  it learns correspondences between the alphabets of
                                xij = max(I, J)                   pairs of languages. However, their correspondences
                  i        j
                                                                  are probabilistic and implicit while ours are hard and
In experiments we used the GNU Linear Program-                    explicit. Their model also differs from our own in
ming Toolkit (GLPK) to solve the LP and update                    a key way. Notice that the phonetic alphabets for
the alphabet matching x. This update yields match-                the three languages are actually the same. Since
ing variables x that achieve the optimum value of                 phonetic change occurs gradually across languages
J(x, y, z) for fixed y and z.                                      a helpful prior on the correspondence is to favor the
                                                                  identity. Their model makes use of such a prior.
3.3 Random Restarts                                               Our model, on the other hand, is unaware of any
In practice we found that the block coordinate de-                prior correspondence between alphabets and does
scent procedure can get stuck at poor local optima.               not make use of this additional information about
To find better optima, we run the coordinate descent               phonetic change.
procedure multiple times, initialized each time with                 Hall and Klein (2010) also evaluate their model
a random alphabet matching. We choose the local                   on lexicons that do not have a perfect cognate map-
optimum with the best objective value across all ini-             ping. This scenario, where not every word in one
tializations. This approach yielded substantial im-               language has a cognate in another, is more realistic.
provements in achieved objective value.                           They produced a data set with this property by prun-
                                                                  ing words from the ROMANCE data set until only
4   Experiments                                                   about 75% of the words in each source lexicon have
We compare our system to three different state-of-                cognates in each target lexicon. We refer to this task
the-art systems on three different data sets. We set              and data set as PARTIAL ROMANCE.
up experiments that allow for as direct a comparison
                                                                  4.2   Lexicons Extracted from Corpora
as possible. In some cases it must be pointed out
that the past system’s goals are different from our               Next, we evaluate our model on a noisier data set.
own, and we will be comparing in a different way                  Here the lexicons in source and target languages
than the respective work was intended. The three                  are extracted from corpora by taking the top 2,000
systems make use of additional, or slightly different,            words in each corpus. In particular, we used the En-
sources of information.                                           glish and Spanish sides of the Europarl parallel cor-
                                                   317
pus (Koehn, 2005). To make this set up more real-         between the two lexicons. For this data set, the lexi-
istic (though fairly comparable), we insured that the     cons are large enough that finding the exact solution
corpora were non-parallel by using the first 50K sen-      can be slow. Thus, in all experiments on this data
tences on the English side and the second 50K sen-        set, we instead use a greedy competitive linking al-
tences on the Spanish side. To generate a gold cog-       gorithm that runs in time O(U 2 V 2 log(U V )).
nate matching we used the intersected HMM align-             Again, for this dataset it is reasonable to expect
ment model of Liang et al. (2008) to align the full       that many characters will map to themselves in the
parallel corpus. From this alignment we extracted a       best alphabet matching. The alphabets are not iden-
translation lexicon by adding an entry for each word      tical, but are far from disjoint. Neither our system,
pair with the property that the English word was          nor that of Haghighi et al. (2008) make use of this
aligned to the Spanish in over 10% of the alignments      expectation. As far as both systems are concerned,
involving the English word. To reduce this transla-       the alphabets are disjoint.
tion lexicon down to a cognate matching we went
through the translation lexicon by hand and removed       4.3   Decipherment
any pair of words that we judged to not be cognates.      Finally, we evaluate our model on a data set where
The resulting gold matching contains cognate map-         a main goal is to decipher an unknown correspon-
pings in the English lexicon for 1,026 of the words       dence between alphabets. We attempt to learn a
in the Spanish lexicon. This means that only about        mapping from the alphabet of the ancient Semitic
50% of the words in English lexicon have cognates         language Ugaritic to the alphabet of Hebrew, and
in the Spanish lexicon. We evaluate on this data set      at the same time learn a matching between Hebrew
by computing precision and recall for the number of       words in a Hebrew lexicon and their cognates in a
English words that are mapped to a correct cognate.       Ugaritic lexicon. This task is related to the task at-
We refer to this task and data set as E UROPARL.          tempted by Snyder et al. (2010). The data set con-
   On this data set, we compare against the state-of-     sists of a Ugaritic lexicon of 2,214 words, each of
the-art orthographic system presented in Haghighi         which has a Hebrew cognate, the lexicon of their
et al. (2008). Haghighi et al. (2008) presents sev-       2,214 Hebrew cognates, and a gold cognate dictio-
eral systems that are designed to extract transla-        nary for evaluation. We refer to this task and data set
tion lexicons for non-parallel corpora by learning        as U GARITIC.
a correspondence between their monolingual lexi-             The non-parameteric Bayesian system of Snyder
cons. Since our system specializes in matching cog-       et al. (2010) assumes that the morphology of He-
nates and does not take into account additional infor-    brew is known, making use of an inventory of suf-
mation from corpus statistics, we compare against         fixes, prefixes, and stems derived from the words
the version of their system that only takes into ac-      in the Hebrew bible. It attempts to learn a corre-
count orthographic features and is thus is best suited    spondence between the morphology of Ugaritic and
for cognate detection. Their system requires a small      that of Hebrew while reconstructing cognates for
seed of correct cognate pairs. From this seed the sys-    Ugaritic words. This is a slightly different goal than
tem learns a projection using canonical correlation       that of our system, which learns a correspondence
analysis (CCA) into a canonical feature space that        between lexicons. Snyder et al. (2010) run their
allows feature vectors from source words and target       system on a set 7,386 Ugaritic words, the same set
words to be compared. Once in this canonical space,       that we extracted our 2,214 Ugaritic words with He-
similarity metrics can be computed and words can be       brew cognates from. We evaluate the accuracy of the
matched using a bipartite matching algorithm. The         lexicon matching produced by our system on these
process is iterative, adding cognate pairs to the seed    2,214 Ugaritic words, and so do they, measuring the
lexicon gradually and each time re-computing a re-        number of correctly reconstructed cognates.
fined projection. Our system makes no use of a seed           By restricting the source and target lexicons to
lexicon whatsoever.                                       sets of cognates we have made the task easier. This
   Both our system and the system of Haghighi et          was necessary, however, because the Ugaritic and
al. (2008) must solve bipartite matching problems         Hebrew corpora used by Snyder et al. (2010) are not
                                                    318
Model                    τ    Accuracy            Model                    τ     Precision   Recall    F1
        Hall and Klein (2010)    –      90.3              Hall and Klein (2010)    –       66.9       82.0    73.6
        M ATCHER                1.0     90.1              M ATCHER                0.25     99.7       34.0    50.7
                                                                                  0.50     93.8       60.2    73.3
                                                                                  0.75     81.1       78.0    79.5
Table 1: Results on ROMANCE data set. Our system is
labeled M ATCHER. We compare against the phylogenetic
cognate detection system of Hall and Klein (2010). We    Table 2: Results on PARTIAL ROMANCE data set. Our
show the pairwise cognate accuracy across all pairs of   system is labeled M ATCHER. We compare against the
languages from the following set: Spanish, Portuguese,   phylogenetic cognate detection system of Hall and Klein
and Italian.                                             (2010). We show the pairwise cognate precision, recall,
                                                         and F1 across all pairs of languages from the following
                                                         set: Spanish, Portuguese, and Italian. Note that approx-
comparable: only a small proportion of the words         imately 75% of the source words in each of the source
in the Ugaritic lexicon have cognates in the lexicon     lexicons have cognates in each of the target lexicons.
composed of the most frequent Hebrew words.
   Here, the alphabets really are disjoint. The sym-
                                                         Our average accuracy across all pairs of languages
bols in both languages look nothing alike. There is
                                                         is 90.1%. The phylogenetic system of Hall and
no obvious prior expectation about how the alpha-
                                                         Klein (2010) achieves an average accuracy of 90.3%
bets will be matched. We evaluate against a well-
                                                         across all pairs of languages. Our system achieves
established correspondence between the alphabets
                                                         accuracy comparable to that of the phylogenetic sys-
of Ugaritic and Hebrew. The Ugaritic alphabet con-
                                                         tem, despite the fact that the phylogenetic system is
tains 30 characters, the Hebrew alphabet contains 22
                                                         substantially more complex and makes use of an in-
characters, and the gold matching contains 33 en-
                                                         formed prior on alphabet correspondences.
tries. We evaluate the learned alphabet matching by
                                                            The alphabet matching learned by our system is
counting the number of recovered entries from the
                                                         interesting to analyze. For the pairing of Span-
gold matching.
                                                         ish and Portuguese it recovers phonetic correspon-
   Due to the size of the source and target lexicons,
                                                         dences that are well known. Our system learns the
we again use the greedy competitive linking algo-
                                                         correct cognate pairing of Spanish /bino/ to Por-
rithm in place of the exact Hungarian algorithm in
                                                         tuguese /vinu/. This pair exemplifies two com-
experiments on this data set.
                                                         mon phonetic correspondences for Spanish and Por-
                                                         tuguese: the Spanish /o/ often transforms to a /u/ in
5     Results
                                                         Portuguese, and Spanish /b/ often transforms to /v/
We present results on all four datasets ROMANCE,         in Portuguese. Our system, which allows many-to-
PARTIAL ROMANCE, E UROPARL, and U GARITIC.               many alphabet correspondences, correctly identifies
On the ROMANCE and PARTIAL ROMANCE data sets             the mappings /o/ → /u/ and /b/ → /v/ as well as the
we compare against the numbers published by Hall         identity mappings /o/ → /o/ and /b/ → /b/ which are
and Klein (2010). We ran an implementation of            also common.
the orthographic system presented by Haghighi et
al. (2008) on our E UROPARL data set. We com-            5.2   PARTIAL ROMANCE
pare against the numbers published by Snyder et al.      In Table 2 we present the results of running our sys-
(2010) on the U GARITIC data set. We refer to our        tem on the PARTIAL ROMANCE data set. In this data
system as M ATCHER in result tables and discussion.      set, only approximately 75% of the source words in
                                                         each of the source lexicons have cognates in each of
5.1    ROMANCE                                           the target lexicons. The parameter τ trades off pre-
The results of running our system, M ATCHER, on          cision and recall. We show results for three different
the ROMANCE data set are shown in Table 1. We            settings of τ : 0.25, 0.5, and 0.75.
recover 88.9% of the correct cognate mappings on            Our system achieves an average precision across
the pair Spanish and Italian, 85.7% on Italian and       language pairs of 99.7% at an average recall of
Portuguese, and 95.6% on Spanish and Portuguese.         34.0%. For the pairs Italian – Portuguese, and Span-
                                                319
Model                    Seed    τ     Precision   Recall    F1      Model                   τ    Lexicon Acc.   Alphabet Acc.
 Haghighi et al. (2008)    20    0.1      72.0       14.0    23.5     Snyder et al. (2010)    –       60.4*          29/33*
                           20    0.25     63.6       31.0    41.7     M ATCHER               1.0      90.4           28/33
                           20    0.5      44.8       43.7    44.2
                           50    0.1      90.5       17.6    29.5
                           50    0.25     75.4       36.7    49.4   Table 4: Results on U GARITIC data set. Our system is la-
                           50    0.5      56.4       55.0    55.7   beled M ATCHER. We compare against the decipherment
 M ATCHER                  0     0.1      93.5       18.2    30.5   system of Snyder et al. (2010). *Note that results for this
                           0     0.25     83.2       40.5    54.5   system are on a somewhat different task. In particular, the
                           0     0.5      56.5       55.1    55.8   M ATCHER system assumes the inventories of cognates in
                                                                    both Hebrew and Ugaritic are known, while the system
Table 3: Results on E UROPARL data set. Our system                  of Snyder et al. (2010) reconstructs cognates assuming
is labeled M ATCHER. We compare against the bilingual               only that the morphology of Hebrew is known, which is a
lexicon induction system of Haghighi et al. (2008). We              harder task. We show cognate pair identification accuracy
show the cognate precision, recall, and F1 for the pair of          and alphabet matching accuracy for Ugaritic and Hebrew.
languages English and Spanish using lexicons extracted
from corpora. Note that approximately 50% of the words
in the English lexicon have cognates in the Spanish lexi-
con.                                                                tional information: a seed matching of correct cog-
                                                                    nate pairs. The results show that as the size of
                                                                    this seed is decreased, the performance of the ortho-
ish – Portuguese, our system achieves prefect preci-
                                                                    graphic system degrades.
sion at recalls of 32.2% and 38.1% respectively. The
best average F1 achieved by our system is 79.5%,                    5.4   U GARITIC
which surpasses the average F1 of 73.6 achieved by
                                                                    In Table 4 we present results on the U GARITIC data
the phylogenetic system of Hall and Klein (2010).
                                                                    set. We evaluate both accuracy of the lexicon match-
   The phylogenetic system observes the phyloge-
                                                                    ing learned by our system, and the accuracy of the
netic tree of ancestry for the three languages and
                                                                    alphabet matching. Our system achieves a lexicon
explicitly models cognate evolution and survival in
                                                                    accuracy of 90.4% while correctly identifying 28 out
a ‘survival’ tree. One might expect the phyloge-
                                                                    the 33 gold character mappings.
netic system to achieve better results on this data set
                                                                       We also present the results for the decipherment
where part of the task is identifying which words do
                                                                    model of Snyder et al. (2010) in Table 4. Note that
not have cognates. It is surprising that our model
                                                                    while the evaluation data sets for our two models
does so well given its simplicity.
                                                                    are the same, the tasks are very different. In par-
5.3   E UROPARL                                                     ticular, our system assumes the inventories of cog-
                                                                    nates in both Hebrew and Ugaritic are known, while
Table 3 presents results for our system on the E U -
                                                                    the system of Snyder et al. (2010) reconstructs cog-
ROPARL data set across three different settings of τ :
                                                                    nates assuming only that the morphology of Hebrew
0.1, 0.25, and 0.5. We compare against the ortho-
                                                                    is known, which is a harder task. Even so, the re-
graphic system presented by Haghighi et al. (2008),
                                                                    sults show that our system is effective at decipher-
across the same three settings of τ , and with two dif-
                                                                    ment when semantically similar lexicons are avail-
ferent sizes of seed lexicon: 20 and 50. In this data
                                                                    able.
set, only approximately 50% of the source words
have cognates in the target lexicon.                                6 Conclusion
   Our system achieves a precision of 93.5% at a re-
call of 18.2%, and a best F1 of 55.0%. Using a seed                 We have presented a simple combinatorial model
matching of 50 word pairs, the orthographic sys-                    that simultaneously incorporates both a matching
tem of Haghighi et al. (2008) achieves a best F1 of                 between alphabets and a matching between lexicons.
55.7%. Using a seed matching of 20 word pairs,                      Our system is effective at both the tasks of cognate
it achieves a best F1 of 44.2%. Our system out-                     identification and alphabet decipherment, requiring
performs the orthographic system even though the                    only lists of words in both languages as input.
orthographic system makes use of important addi-
                                                    320
References
A. Bouchard-Cˆ t´ , P. Liang, T.L. Griffiths, and D. Klein.
               oe
   2007. A probabilistic approach to diachronic phonol-
   ogy. In Proc. of EMNLP.
A. Bouchard-Cˆ t´ , T.L. Griffiths, and D. Klein.
                 oe
   2009. Improved reconstruction of protolanguage word
   forms. In Proc. of NAACL.
A. Haghighi, P. Liang, T. Berg-Kirkpatrick, and D. Klein.
   2008. Learning bilingual lexicons from monolingual
   corpora. Proceedings of ACL.
D. Hall and D. Klein. 2010. Finding cognate groups
   using phylogenies. In Proc. of ACL.
K. Knight and K. Yamada. 1999. A computational ap-
   proach to deciphering unknown scripts. In Proc. of
   ACL Workshop on Unsupervised Learning in Natural
   Language Processing.
K. Knight, A. Nair, N. Rathod, and K. Yamada. 2006.
   Unsupervised analysis for decipherment problems. In
   Proc. of COLING/ACL.
P. Koehn and K. Knight. 2002. Learning a translation
   lexicon from monolingual corpora. In Proc. of ACL
   workshop on Unsupervised lexical acquisition.
P. Koehn. 2005. Europarl: A Parallel Corpus for Statis-
   tical Machine Translation. In Proc. of Machine Trans-
   lation Summit.
G. Kondrak. 2001. Identifying Cognates by Phonetic and
   Semantic Similarity. In NAACL.
H.W. Kuhn. 1955. The Hungarian method for the assign-
   ment problem. Naval research logistics quarterly.
P. Liang, D. Klein, and M.I. Jordan. 2008. Agreement-
   based learning. Proc. of NIPS.
S. Ravi and K. Knight. 2011. Bayesian inference for Zo-
   diac and other homophonic ciphers. In Proc. of ACL.
B. Snyder, R. Barzilay, and K. Knight. 2010. A statisti-
   cal model for lost language decipherment. In Proc. of
   ACL.




                                                       321

More Related Content

What's hot

Dependency Analysis of Abstract Universal Structures in Korean and English
Dependency Analysis of Abstract Universal Structures in Korean and EnglishDependency Analysis of Abstract Universal Structures in Korean and English
Dependency Analysis of Abstract Universal Structures in Korean and EnglishJinho Choi
 
Data Complexity in EL Family of Description Logics
Data Complexity in EL Family of Description LogicsData Complexity in EL Family of Description Logics
Data Complexity in EL Family of Description LogicsAdila Krisnadhi
 
Dependent Types in Natural Language Semantics
Dependent Types in Natural Language SemanticsDependent Types in Natural Language Semantics
Dependent Types in Natural Language SemanticsDaisuke BEKKI
 
visualizing and measuring the geometry of bert
visualizing and measuring the geometry of bertvisualizing and measuring the geometry of bert
visualizing and measuring the geometry of berttaeseon ryu
 
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deren Lei
 
Composing (Im)politeness in Dependent Type Semantics
Composing (Im)politeness in Dependent Type SemanticsComposing (Im)politeness in Dependent Type Semantics
Composing (Im)politeness in Dependent Type SemanticsDaisuke BEKKI
 
Formal language
Formal languageFormal language
Formal languageRajendran
 
Composition of soft set relations and construction of transitive
Composition of soft set relations and construction of transitiveComposition of soft set relations and construction of transitive
Composition of soft set relations and construction of transitiveAlexander Decker
 
深層意味表現学習 (Deep Semantic Representations)
深層意味表現学習 (Deep Semantic Representations)深層意味表現学習 (Deep Semantic Representations)
深層意味表現学習 (Deep Semantic Representations)Danushka Bollegala
 
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015RIILP
 
Wsd as distributed constraint optimization problem
Wsd as distributed constraint optimization problemWsd as distributed constraint optimization problem
Wsd as distributed constraint optimization problemlolokikipipi
 
Space Efficient Suffix Array Construction using Induced Sorting LMS Substrings
Space Efficient Suffix Array Construction using Induced Sorting LMS SubstringsSpace Efficient Suffix Array Construction using Induced Sorting LMS Substrings
Space Efficient Suffix Array Construction using Induced Sorting LMS Substringsijistjournal
 
Space Efficient Suffix Array Construction using Induced Sorting LMS Substrings
 Space Efficient Suffix Array Construction using Induced Sorting LMS Substrings  Space Efficient Suffix Array Construction using Induced Sorting LMS Substrings
Space Efficient Suffix Array Construction using Induced Sorting LMS Substrings ijistjournal
 
Jarrar.lecture notes.aai.2012s.descriptionlogic
Jarrar.lecture notes.aai.2012s.descriptionlogicJarrar.lecture notes.aai.2012s.descriptionlogic
Jarrar.lecture notes.aai.2012s.descriptionlogicSinaInstitute
 
Jarrar: Description Logic
Jarrar: Description LogicJarrar: Description Logic
Jarrar: Description LogicMustafa Jarrar
 
The complexity of promise problems with applications to public-key cryptography
The complexity of promise problems with applications to public-key cryptographyThe complexity of promise problems with applications to public-key cryptography
The complexity of promise problems with applications to public-key cryptographyXequeMateShannon
 

What's hot (20)

H-MLQ
H-MLQH-MLQ
H-MLQ
 
Dependency Analysis of Abstract Universal Structures in Korean and English
Dependency Analysis of Abstract Universal Structures in Korean and EnglishDependency Analysis of Abstract Universal Structures in Korean and English
Dependency Analysis of Abstract Universal Structures in Korean and English
 
AI Lesson 09
AI Lesson 09AI Lesson 09
AI Lesson 09
 
Data Complexity in EL Family of Description Logics
Data Complexity in EL Family of Description LogicsData Complexity in EL Family of Description Logics
Data Complexity in EL Family of Description Logics
 
Dependent Types in Natural Language Semantics
Dependent Types in Natural Language SemanticsDependent Types in Natural Language Semantics
Dependent Types in Natural Language Semantics
 
visualizing and measuring the geometry of bert
visualizing and measuring the geometry of bertvisualizing and measuring the geometry of bert
visualizing and measuring the geometry of bert
 
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
 
AI Lesson 12
AI Lesson 12AI Lesson 12
AI Lesson 12
 
Composing (Im)politeness in Dependent Type Semantics
Composing (Im)politeness in Dependent Type SemanticsComposing (Im)politeness in Dependent Type Semantics
Composing (Im)politeness in Dependent Type Semantics
 
Formal language
Formal languageFormal language
Formal language
 
Lesson 28
Lesson 28Lesson 28
Lesson 28
 
Composition of soft set relations and construction of transitive
Composition of soft set relations and construction of transitiveComposition of soft set relations and construction of transitive
Composition of soft set relations and construction of transitive
 
深層意味表現学習 (Deep Semantic Representations)
深層意味表現学習 (Deep Semantic Representations)深層意味表現学習 (Deep Semantic Representations)
深層意味表現学習 (Deep Semantic Representations)
 
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
ESR11 Hoang Cuong - EXPERT Summer School - Malaga 2015
 
Wsd as distributed constraint optimization problem
Wsd as distributed constraint optimization problemWsd as distributed constraint optimization problem
Wsd as distributed constraint optimization problem
 
Space Efficient Suffix Array Construction using Induced Sorting LMS Substrings
Space Efficient Suffix Array Construction using Induced Sorting LMS SubstringsSpace Efficient Suffix Array Construction using Induced Sorting LMS Substrings
Space Efficient Suffix Array Construction using Induced Sorting LMS Substrings
 
Space Efficient Suffix Array Construction using Induced Sorting LMS Substrings
 Space Efficient Suffix Array Construction using Induced Sorting LMS Substrings  Space Efficient Suffix Array Construction using Induced Sorting LMS Substrings
Space Efficient Suffix Array Construction using Induced Sorting LMS Substrings
 
Jarrar.lecture notes.aai.2012s.descriptionlogic
Jarrar.lecture notes.aai.2012s.descriptionlogicJarrar.lecture notes.aai.2012s.descriptionlogic
Jarrar.lecture notes.aai.2012s.descriptionlogic
 
Jarrar: Description Logic
Jarrar: Description LogicJarrar: Description Logic
Jarrar: Description Logic
 
The complexity of promise problems with applications to public-key cryptography
The complexity of promise problems with applications to public-key cryptographyThe complexity of promise problems with applications to public-key cryptography
The complexity of promise problems with applications to public-key cryptography
 

Viewers also liked

Viewers also liked (9)

Font new edit
Font new editFont new edit
Font new edit
 
Leukemia dep-force
Leukemia dep-forceLeukemia dep-force
Leukemia dep-force
 
Gis
GisGis
Gis
 
Sjo userguide th
Sjo userguide thSjo userguide th
Sjo userguide th
 
Innovation china india_sep.07_sean_dougherty
Innovation china india_sep.07_sean_doughertyInnovation china india_sep.07_sean_dougherty
Innovation china india_sep.07_sean_dougherty
 
Creating your own_color_palettes
Creating your own_color_palettesCreating your own_color_palettes
Creating your own_color_palettes
 
จรรยาบรรณของสถาปนิก
จรรยาบรรณของสถาปนิกจรรยาบรรณของสถาปนิก
จรรยาบรรณของสถาปนิก
 
Grossary
GrossaryGrossary
Grossary
 
Innovation china india_sep.07_a_panagariya
Innovation china india_sep.07_a_panagariyaInnovation china india_sep.07_a_panagariya
Innovation china india_sep.07_a_panagariya
 

Similar to Simple effective decipherment via combinatorial optimization

Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
形式言語理論への 測度論的アプローチ
形式言語理論への 測度論的アプローチ形式言語理論への 測度論的アプローチ
形式言語理論への 測度論的アプローチRyoma Sin'ya
 
Improved Word Alignments Using the Web as a Corpus
Improved Word Alignments Using the Web as a CorpusImproved Word Alignments Using the Web as a Corpus
Improved Word Alignments Using the Web as a CorpusSvetlin Nakov
 
A Probabilistic Attack On NP-Complete Problems
A Probabilistic Attack On NP-Complete ProblemsA Probabilistic Attack On NP-Complete Problems
A Probabilistic Attack On NP-Complete ProblemsBrittany Allen
 
Harnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic RulesHarnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic RulesSho Takase
 
Deep learning book_chap_02
Deep learning book_chap_02Deep learning book_chap_02
Deep learning book_chap_02HyeongGooKang
 
Threshold network models
Threshold network modelsThreshold network models
Threshold network modelsNaoki Masuda
 
Course Assignment : Skip gram
Course Assignment : Skip gramCourse Assignment : Skip gram
Course Assignment : Skip gramKhalilBergaoui
 
Ldml - public
Ldml - publicLdml - public
Ldml - publicseanscon
 
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingContext-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingTomonari Masada
 
Csr2011 june17 15_15_kaminski
Csr2011 june17 15_15_kaminskiCsr2011 june17 15_15_kaminski
Csr2011 june17 15_15_kaminskiCSR2011
 
transplantation-isospectral-poster
transplantation-isospectral-postertransplantation-isospectral-poster
transplantation-isospectral-posterFeynman Liang
 
High-order Finite Elements for Computational Physics
High-order Finite Elements for Computational PhysicsHigh-order Finite Elements for Computational Physics
High-order Finite Elements for Computational PhysicsRobert Rieben
 
A Systematic Approach To Probabilistic Pointer Analysis
A Systematic Approach To Probabilistic Pointer AnalysisA Systematic Approach To Probabilistic Pointer Analysis
A Systematic Approach To Probabilistic Pointer AnalysisMonica Franklin
 

Similar to Simple effective decipherment via combinatorial optimization (20)

Unequal-Cost Prefix-Free Codes
Unequal-Cost Prefix-Free CodesUnequal-Cost Prefix-Free Codes
Unequal-Cost Prefix-Free Codes
 
Unified Programming Theory
Unified Programming TheoryUnified Programming Theory
Unified Programming Theory
 
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
 
形式言語理論への 測度論的アプローチ
形式言語理論への 測度論的アプローチ形式言語理論への 測度論的アプローチ
形式言語理論への 測度論的アプローチ
 
graph_embeddings
graph_embeddingsgraph_embeddings
graph_embeddings
 
Improved Word Alignments Using the Web as a Corpus
Improved Word Alignments Using the Web as a CorpusImproved Word Alignments Using the Web as a Corpus
Improved Word Alignments Using the Web as a Corpus
 
A Probabilistic Attack On NP-Complete Problems
A Probabilistic Attack On NP-Complete ProblemsA Probabilistic Attack On NP-Complete Problems
A Probabilistic Attack On NP-Complete Problems
 
Harnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic RulesHarnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic Rules
 
Science in text mining
Science in text miningScience in text mining
Science in text mining
 
Deep learning book_chap_02
Deep learning book_chap_02Deep learning book_chap_02
Deep learning book_chap_02
 
Lec1
Lec1Lec1
Lec1
 
Threshold network models
Threshold network modelsThreshold network models
Threshold network models
 
Course Assignment : Skip gram
Course Assignment : Skip gramCourse Assignment : Skip gram
Course Assignment : Skip gram
 
Ldml - public
Ldml - publicLdml - public
Ldml - public
 
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingContext-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
 
semeval2016
semeval2016semeval2016
semeval2016
 
Csr2011 june17 15_15_kaminski
Csr2011 june17 15_15_kaminskiCsr2011 june17 15_15_kaminski
Csr2011 june17 15_15_kaminski
 
transplantation-isospectral-poster
transplantation-isospectral-postertransplantation-isospectral-poster
transplantation-isospectral-poster
 
High-order Finite Elements for Computational Physics
High-order Finite Elements for Computational PhysicsHigh-order Finite Elements for Computational Physics
High-order Finite Elements for Computational Physics
 
A Systematic Approach To Probabilistic Pointer Analysis
A Systematic Approach To Probabilistic Pointer AnalysisA Systematic Approach To Probabilistic Pointer Analysis
A Systematic Approach To Probabilistic Pointer Analysis
 

More from Attaporn Ninsuwan

More from Attaporn Ninsuwan (20)

J query fundamentals
J query fundamentalsJ query fundamentals
J query fundamentals
 
Jquery enlightenment
Jquery enlightenmentJquery enlightenment
Jquery enlightenment
 
Jquery-Begining
Jquery-BeginingJquery-Begining
Jquery-Begining
 
Br ainfocom94
Br ainfocom94Br ainfocom94
Br ainfocom94
 
Chapter 12 - Computer Forensics
Chapter 12 - Computer ForensicsChapter 12 - Computer Forensics
Chapter 12 - Computer Forensics
 
Techniques for data hiding p
Techniques for data hiding pTechniques for data hiding p
Techniques for data hiding p
 
Stop badware infected_sites_report_062408
Stop badware infected_sites_report_062408Stop badware infected_sites_report_062408
Stop badware infected_sites_report_062408
 
Steganography past-present-future 552
Steganography past-present-future 552Steganography past-present-future 552
Steganography past-present-future 552
 
Ch03-Computer Security
Ch03-Computer SecurityCh03-Computer Security
Ch03-Computer Security
 
Ch02-Computer Security
Ch02-Computer SecurityCh02-Computer Security
Ch02-Computer Security
 
Ch01-Computer Security
Ch01-Computer SecurityCh01-Computer Security
Ch01-Computer Security
 
Ch8-Computer Security
Ch8-Computer SecurityCh8-Computer Security
Ch8-Computer Security
 
Ch7-Computer Security
Ch7-Computer SecurityCh7-Computer Security
Ch7-Computer Security
 
Ch6-Computer Security
Ch6-Computer SecurityCh6-Computer Security
Ch6-Computer Security
 
Ch06b-Computer Security
Ch06b-Computer SecurityCh06b-Computer Security
Ch06b-Computer Security
 
Ch5-Computer Security
Ch5-Computer SecurityCh5-Computer Security
Ch5-Computer Security
 
Ch04-Computer Security
Ch04-Computer SecurityCh04-Computer Security
Ch04-Computer Security
 
Chapter5 - The Discrete-Time Fourier Transform
Chapter5 - The Discrete-Time Fourier TransformChapter5 - The Discrete-Time Fourier Transform
Chapter5 - The Discrete-Time Fourier Transform
 
Chapter4 - The Continuous-Time Fourier Transform
Chapter4 - The Continuous-Time Fourier TransformChapter4 - The Continuous-Time Fourier Transform
Chapter4 - The Continuous-Time Fourier Transform
 
Chapter3 - Fourier Series Representation of Periodic Signals
Chapter3 - Fourier Series Representation of Periodic SignalsChapter3 - Fourier Series Representation of Periodic Signals
Chapter3 - Fourier Series Representation of Periodic Signals
 

Simple effective decipherment via combinatorial optimization

  • 1. Simple Effective Decipherment via Combinatorial Optimization Taylor Berg-Kirkpatrick and Dan Klein Computer Science Division University of California at Berkeley {tberg, klein}@cs.berkeley.edu Abstract 2011), and bilingual lexicon induction (Koehn and Knight, 2002; Haghighi et al., 2008). We consider We present a simple objective function that a common element, which is a model wherein there when optimized yields accurate solutions to are character-level correspondences and word-level both decipherment and cognate pair identifica- correspondences, with the word matching parame- tion problems. The objective simultaneously terized by the character one. This approach sub- scores a matching between two alphabets and a matching between two lexicons, each in a sumes a range of past tasks, though of course past different language. We introduce a simple work has specialized in interesting ways. coordinate descent procedure that efficiently Past work has emphasized the modeling as- finds effective solutions to the resulting com- pect, where here we use a parametrically simplistic binatorial optimization problem. Our system model, but instead emphasize inference. requires only a list of words in both languages as input, yet it competes with and surpasses 2 Decipherment as Two-Level several state-of-the-art systems that are both substantially more complex and make use of Optimization more information. Our method represents two matchings, one at the al- phabet level and one at the lexicon level. A vector of 1 Introduction variables x specifies a matching between alphabets. For each character i in the source alphabet and each Decipherment induces a correspondence between character j in the target alphabet we define an indi- the words in an unknown language and the words cator variable xij that is on if and only if character i in a known language. We focus on the setting where is mapped to character j. Similarly, a vector y rep- a close correspondence between the alphabets of the resents a matching between lexicons. For word u in two languages exists, but is unknown. Given only the source lexicon and word v in the target lexicon, two lists of words, the lexicons of both languages, the indicator variable yuv denotes that u maps to v. we attempt to induce the correspondence between Note that the matchings need not be one-to-one. alphabets and identify the cognates pairs present in We define an objective function on the matching the lexicons. The system we propose accomplishes variables as follows. Let E DIT D IST(u, v; x) denote this by defining a simple combinatorial optimiza- the edit distance between source word u and target tion problem that is a function of both the alphabet word v given alphabet matching x. Let the length and cognate matchings, and then induces correspon- of word u be lu and the length of word w be lw . dences by optimizing the objective using a block co- This edit distance depends on x in the following ordinate descent procedure. way. Insertions and deletions always cost a constant There is a range of past work that has var- .1 Substitutions also cost unless the characters iously investigated cognate detection (Kondrak, are matched in x, in which case the substitution is 2001; Bouchard-Cˆ t´ et al., 2007; Bouchard-Cˆ t´ oe oe 1 In practice we set = lu +lv . lu + lv is the maximum 1 et al., 2009; Hall and Klein, 2010), character-level number of edit operations between words u and v. This nor- decipherment (Knight and Yamada, 1999; Knight malization insures that edit distances are between 0 and 1 for et al., 2006; Snyder et al., 2010; Ravi and Knight, all pairs of words. 313 Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 313–321, Edinburgh, Scotland, UK, July 27–31, 2011. c 2011 Association for Computational Linguistics
  • 2. free. Now, the objective that we will minimize can be stated simply: u v yuv · E DIT D IST(u, v; x), In order to get a better handle on the shape of the the sum of the edit distances between the matched objective and to develop an efficient optimization words, where the edit distance function is parame- procedure we decompose each edit distance compu- terized by the alphabet matching. tation and re-formulate the optimization problem in Without restrictions on the matchings x and y Section 2.2. this objective can always be driven to zero by either mapping all characters to all characters, or matching none of the words. It is thus necessary to restrict 2.1 Example the matchings in some way. Let I be the size of Figure 1 presents both an example matching prob- the source alphabet and J be the size of the target lem and a diagram of the variables and objective. alphabet. We allow the alphabet matching x to be many-to-many but require that each character Here, the source lexicon consists of the English participate in no more than two mappings and that words (cat, bat, cart, rat, cab), and the total number of mappings be max(I, J), a the source alphabet consists of the characters (a, constraint we refer to as restricted-many-to-many. b, c, r, t). The target alphabet is (0, 1, The requirements can be encoded with the following linear constraints on x: 2, 3). We have used digits as symbols in the target alphabet to make it clear that we treat the alphabets ∀i xij ≤ 2 as disjoint. We have no prior knowledge about any j correspondence between alphabets, or between lexi- ∀j xij ≤ 2 cons. i The target lexicon consists of the words (23, xij = max(I, J) 1233, 120, 323, 023). The bipartite graphs i j show a specific setting of the matching variables. The lexicon matching y is required to be τ -one-to- The bold edges correspond to the xij and yuv that one. By this we mean that y is an at-most-one-to-one are one. The matchings shown achieve an edit dis- matching that covers proportion τ of the smaller of tance of zero between all matched word pairs ex- the two lexicons. Let U be the size of the source cept for the pair (cat, 23). The best edit align- lexicon and V be this size of the target lexicon. This requirement can be encoded with the following ment for this pair is also diagrammed. Here, ‘a’ linear constraints: is aligned to ‘2’, ‘t’ is aligned to ‘3’, and ‘c’ is deleted and therefore aligned to the null position ‘#’. ∀u yuv ≤ 1 Only the initial deletion has a non-zero cost since v all other alignments correspond to substitutions be- ∀v yuv ≤ 1 tween characters that are matched in x. u yuv = τ min(U, V ) 2.2 Explicit Objective u v Computing E DIT D IST(u, v; x) requires running a Now we are ready to define the full optimization dynamic program because of the unknown edit problem. The first formulation is called the Implicit alignments; here we define those alignments z ex- Matching Objective since includes an implicit plicitly, which makes the E DIT D IST(u, v; x) easy to minimization over edit alignments inside the com- write explicitly at the cost of more variables. How- putation of E DIT D IST. ever, by writing the objective in an explicit form that refers to these edit variables, we are able to describe (1) Implicit Matching Objective: a efficient block coordinate descent procedure that can be used for optimization. min yuv · E DIT D IST(u, v; x) E DIT D IST(u, v; x) is computed by minimizing x,y u v over the set of monotonic alignments between the s.t. x is restricted-many-to-many characters of the source word u and the characters of the target word v. Let un be the character at the y is τ -one-to-one nth position of the source word u, and similarly for 314
  • 3. Alphabet Matching Matching Problem a 0 b 1 min yuv · EditDist(u, v; x) x,y u v c 2 s.t. x is restricted-many-to-many r 3 y is τ -one-to-one t xij Lexicon Matching Edit Distance cat 23 # # EditDist(u, v; x) = bat 1233 Substitution c 2 cart 120 minzuv · (1 − xun vm )zuv,nm a 3 s.t. n m rat 323 zuv is monotonic + zuv,n# + zuv,#m t cab 023 zuv,nm n m yuv Deletion Insertion Figure 1: An example problem displaying source and target lexicons and alphabets, along with specific matchings. The variables involved in the optimization problem are diagrammed. x are the alphabet matching indicator variables, y are the lexicon matching indicator variables, and z are the edit alignment indicator variables. The index u refers to a word in the source lexicon, v refers to word in the target lexicon, i refers to a character in the source alphabet, and j refers to a character in the target alphabet. n and m refer to positions in source and target words respectively. The matching objective function is also shown. vm . Let zuv be the vector of alignment variables for the edit distance computation between source word u and target word v, where entry zuv,nm S UB(zuv , x) = (1 − xun vm )zuv,nm indicates whether the character at position n of n,m source word u is aligned to the character at position m of target word v. Additionally, define variables D EL(zuv ) = zuv,n# zuv,n# and zuv,#m denoting null alignments, which n will be used to keep track of insertions and deletions. I NS(zuv ) = zuv,#m m E DIT D IST(u, v; x) = Notice that the variable zuv,nm being turned on in- min · S UB(zuv , x) + D EL(zuv ) + I NS(zuv ) dicates the substitute operation, while a zuv,n# or zuv zuv,#m being turned on indicates an insert or delete s.t. zuv is monotonic operation. These variables are digrammed in Fig- ure 1. The requirement that zuv be a monotonic alignment can be expressed using linear constraints, but in our optimization procedure (described in Sec- We define S UB(zuv , x) to be the number of sub- tion 3) these constraints need not be explicitly rep- stitutions between characters that are not matched resented. in x, D EL(zuv ) to be the number of deletions, and Now we can substitute the explicit edit distance I NS(zuv ) to be the number of insertions. equation into the implicit matching objective (1). 315
  • 4. Noticing that the mins and sums commute, we arrive Algorithm 1 Block Coordinate Descent at the explicit form of the matching optimization Randomly initialize alphabet matching x. problem. repeat for all u, v do (2) Explicit Matching Objective: (euv , zuv ) ← E DIT D IST(u, v; x) end for min yuv · · S UB(zuv , x) + D EL(zuv ) + I NS(zuv ) [Hungarian] x,y,z u,v y ← argminy τ -one-to-one u,v yuv euv s.t. x is restricted-many-to-many [Solve LP] y is τ -one-to-one x ← argmaxx restr.-many-to-many i,j xij cij ∀uv zuv is monotonic until convergence The implicit and explicit optimizations are the same, apart from the fact that the explicit optimization now Notice that y simply picks out which edit distance explicitly represents the edit alignment variables z. problems affect the objective. The zuv in each of Let the explicit matching objective (2) be denoted these edit distance problems can be optimized in- as J(x, y, z). The relaxation of the explicit problem dependently. zuv that do not have yuv active have with 0-1 constraints removed has integer solutions,2 no effect on the objective, and zuv with yuv active however the objective J(x, y, z) is non-convex. We can be optimized using the standard edit distance dy- thus turn to a block coordinate descent method in the namic program. Thus, in a first step we compute the next section in order to find local optima. U · V edit distances euv and best monotonic align- ment variables zuv between all pairs of source and 3 Optimization Method target words using U ·V calls to the standard edit dis- tance dynamic program. Altogether, this takes time We now state a block coordinate descent procedure O ( u lu ) · ( v lv ) . to find local optima of J(x, y, z) under the con- Now, in a second step we compute the least straints on x, y, and z. This procedure alternates weighted τ -one-to-one matching y under the between updating y and z to their exact joint optima weights euv . This can be accomplished in time when x is held fixed, and updating x to its exact op- O(max(U, V )3 ) using the Hungarian algorithm timum when y and z are held fixed. (Kuhn, 1955). These two steps produce y and z that The psuedocode for the procedure is given in Al- exactly achieve the optimum value of J(x, y, z) for gorithm 1. Note that the function E DIT D IST returns the given value of x. both the min edit distance euv and the argmin edit alignments zuv . Also note that cij is as defined in 3.2 Alphabet Matching Update Section 3.2. Let y and z, the lexicon matching variables and the 3.1 Lexicon Matching Update edit alignments, be fixed. Now, we find the optimal alphabet matching variables x subject to the con- Let x, the alphabet matching variable, be fixed. We straint that x is restricted-many-to-many. consider the problem of optimizing J(x, y, z) over It makes sense that to optimize J(x, y, z) with re- the lexicon matching variable y and and the edit spect to x we should prioritize mappings xij that alignments z under the constraint that y is τ -one- would mitigate the largest substitution costs in the to-one and each zuv is monotonic. active edit distance problems. Indeed, with a little 2 This can be shown by observing that optimizing x when y algebra it can be shown that solving a maximum and z are held fixed yields integer solutions (shown in Section weighted matching problem with weights cij that 3.2), and similarly for the optimization of y and z when x is fixed (shown in Section 3.1). Thus, every local optimum with count potential substitution costs gives the correct respect to these block coordinate updates has integer solutions. update for x. In particular, cij is the total cost of The global optimum must be one of these local optima. substitution edits in the active edit alignment prob- 316
  • 5. lems that would result if source character i were not 4.1 Phonetic Cognate Lexicons mapped to target character j in the alphabet match- The first data set we evaluate on consists of 583 ing x. This can be written as: triples of phonetic transcriptions of cognates in cij = · yuv · zuv,nm Spanish, Portuguese, and Italian. The data set was u,v n,m s.t. un =i,vm =j introduced by Bouchard-Cˆ t´ et al. (2007). For a oe given pair of languages the task is to determine the If x were constrained to be one-to-one, we could again apply the Hungarian algorithm, this mapping between lexicons that correctly maps each time to find a maximum weighted matching under source word to its cognate in the target lexicon. We the weights cij . Since we have instead allowed refer to this task and data set as ROMANCE. restricted-many-to-many alphabet matchings we Hall and Klein (2010) presented a state-of-the- turn to linear programming for optimizing x. We art system for the task of cognate identification and can state the update problem as the following linear evaluated on this data set. Their model explicitly program (LP), which is guaranteed to have integer solutions: represents parameters for phonetic change between languages and their parents in a phylogenetic tree. They estimate parameters and infer the pairs of cog- min xij cij x nates present in all three languages jointly, while we ij consider each pair of languages in turn. s.t. ∀i xij ≤ 2, ∀j xij ≤ 2 Their model has similarities with our own in that j i it learns correspondences between the alphabets of xij = max(I, J) pairs of languages. However, their correspondences i j are probabilistic and implicit while ours are hard and In experiments we used the GNU Linear Program- explicit. Their model also differs from our own in ming Toolkit (GLPK) to solve the LP and update a key way. Notice that the phonetic alphabets for the alphabet matching x. This update yields match- the three languages are actually the same. Since ing variables x that achieve the optimum value of phonetic change occurs gradually across languages J(x, y, z) for fixed y and z. a helpful prior on the correspondence is to favor the identity. Their model makes use of such a prior. 3.3 Random Restarts Our model, on the other hand, is unaware of any In practice we found that the block coordinate de- prior correspondence between alphabets and does scent procedure can get stuck at poor local optima. not make use of this additional information about To find better optima, we run the coordinate descent phonetic change. procedure multiple times, initialized each time with Hall and Klein (2010) also evaluate their model a random alphabet matching. We choose the local on lexicons that do not have a perfect cognate map- optimum with the best objective value across all ini- ping. This scenario, where not every word in one tializations. This approach yielded substantial im- language has a cognate in another, is more realistic. provements in achieved objective value. They produced a data set with this property by prun- ing words from the ROMANCE data set until only 4 Experiments about 75% of the words in each source lexicon have We compare our system to three different state-of- cognates in each target lexicon. We refer to this task the-art systems on three different data sets. We set and data set as PARTIAL ROMANCE. up experiments that allow for as direct a comparison 4.2 Lexicons Extracted from Corpora as possible. In some cases it must be pointed out that the past system’s goals are different from our Next, we evaluate our model on a noisier data set. own, and we will be comparing in a different way Here the lexicons in source and target languages than the respective work was intended. The three are extracted from corpora by taking the top 2,000 systems make use of additional, or slightly different, words in each corpus. In particular, we used the En- sources of information. glish and Spanish sides of the Europarl parallel cor- 317
  • 6. pus (Koehn, 2005). To make this set up more real- between the two lexicons. For this data set, the lexi- istic (though fairly comparable), we insured that the cons are large enough that finding the exact solution corpora were non-parallel by using the first 50K sen- can be slow. Thus, in all experiments on this data tences on the English side and the second 50K sen- set, we instead use a greedy competitive linking al- tences on the Spanish side. To generate a gold cog- gorithm that runs in time O(U 2 V 2 log(U V )). nate matching we used the intersected HMM align- Again, for this dataset it is reasonable to expect ment model of Liang et al. (2008) to align the full that many characters will map to themselves in the parallel corpus. From this alignment we extracted a best alphabet matching. The alphabets are not iden- translation lexicon by adding an entry for each word tical, but are far from disjoint. Neither our system, pair with the property that the English word was nor that of Haghighi et al. (2008) make use of this aligned to the Spanish in over 10% of the alignments expectation. As far as both systems are concerned, involving the English word. To reduce this transla- the alphabets are disjoint. tion lexicon down to a cognate matching we went through the translation lexicon by hand and removed 4.3 Decipherment any pair of words that we judged to not be cognates. Finally, we evaluate our model on a data set where The resulting gold matching contains cognate map- a main goal is to decipher an unknown correspon- pings in the English lexicon for 1,026 of the words dence between alphabets. We attempt to learn a in the Spanish lexicon. This means that only about mapping from the alphabet of the ancient Semitic 50% of the words in English lexicon have cognates language Ugaritic to the alphabet of Hebrew, and in the Spanish lexicon. We evaluate on this data set at the same time learn a matching between Hebrew by computing precision and recall for the number of words in a Hebrew lexicon and their cognates in a English words that are mapped to a correct cognate. Ugaritic lexicon. This task is related to the task at- We refer to this task and data set as E UROPARL. tempted by Snyder et al. (2010). The data set con- On this data set, we compare against the state-of- sists of a Ugaritic lexicon of 2,214 words, each of the-art orthographic system presented in Haghighi which has a Hebrew cognate, the lexicon of their et al. (2008). Haghighi et al. (2008) presents sev- 2,214 Hebrew cognates, and a gold cognate dictio- eral systems that are designed to extract transla- nary for evaluation. We refer to this task and data set tion lexicons for non-parallel corpora by learning as U GARITIC. a correspondence between their monolingual lexi- The non-parameteric Bayesian system of Snyder cons. Since our system specializes in matching cog- et al. (2010) assumes that the morphology of He- nates and does not take into account additional infor- brew is known, making use of an inventory of suf- mation from corpus statistics, we compare against fixes, prefixes, and stems derived from the words the version of their system that only takes into ac- in the Hebrew bible. It attempts to learn a corre- count orthographic features and is thus is best suited spondence between the morphology of Ugaritic and for cognate detection. Their system requires a small that of Hebrew while reconstructing cognates for seed of correct cognate pairs. From this seed the sys- Ugaritic words. This is a slightly different goal than tem learns a projection using canonical correlation that of our system, which learns a correspondence analysis (CCA) into a canonical feature space that between lexicons. Snyder et al. (2010) run their allows feature vectors from source words and target system on a set 7,386 Ugaritic words, the same set words to be compared. Once in this canonical space, that we extracted our 2,214 Ugaritic words with He- similarity metrics can be computed and words can be brew cognates from. We evaluate the accuracy of the matched using a bipartite matching algorithm. The lexicon matching produced by our system on these process is iterative, adding cognate pairs to the seed 2,214 Ugaritic words, and so do they, measuring the lexicon gradually and each time re-computing a re- number of correctly reconstructed cognates. fined projection. Our system makes no use of a seed By restricting the source and target lexicons to lexicon whatsoever. sets of cognates we have made the task easier. This Both our system and the system of Haghighi et was necessary, however, because the Ugaritic and al. (2008) must solve bipartite matching problems Hebrew corpora used by Snyder et al. (2010) are not 318
  • 7. Model τ Accuracy Model τ Precision Recall F1 Hall and Klein (2010) – 90.3 Hall and Klein (2010) – 66.9 82.0 73.6 M ATCHER 1.0 90.1 M ATCHER 0.25 99.7 34.0 50.7 0.50 93.8 60.2 73.3 0.75 81.1 78.0 79.5 Table 1: Results on ROMANCE data set. Our system is labeled M ATCHER. We compare against the phylogenetic cognate detection system of Hall and Klein (2010). We Table 2: Results on PARTIAL ROMANCE data set. Our show the pairwise cognate accuracy across all pairs of system is labeled M ATCHER. We compare against the languages from the following set: Spanish, Portuguese, phylogenetic cognate detection system of Hall and Klein and Italian. (2010). We show the pairwise cognate precision, recall, and F1 across all pairs of languages from the following set: Spanish, Portuguese, and Italian. Note that approx- comparable: only a small proportion of the words imately 75% of the source words in each of the source in the Ugaritic lexicon have cognates in the lexicon lexicons have cognates in each of the target lexicons. composed of the most frequent Hebrew words. Here, the alphabets really are disjoint. The sym- Our average accuracy across all pairs of languages bols in both languages look nothing alike. There is is 90.1%. The phylogenetic system of Hall and no obvious prior expectation about how the alpha- Klein (2010) achieves an average accuracy of 90.3% bets will be matched. We evaluate against a well- across all pairs of languages. Our system achieves established correspondence between the alphabets accuracy comparable to that of the phylogenetic sys- of Ugaritic and Hebrew. The Ugaritic alphabet con- tem, despite the fact that the phylogenetic system is tains 30 characters, the Hebrew alphabet contains 22 substantially more complex and makes use of an in- characters, and the gold matching contains 33 en- formed prior on alphabet correspondences. tries. We evaluate the learned alphabet matching by The alphabet matching learned by our system is counting the number of recovered entries from the interesting to analyze. For the pairing of Span- gold matching. ish and Portuguese it recovers phonetic correspon- Due to the size of the source and target lexicons, dences that are well known. Our system learns the we again use the greedy competitive linking algo- correct cognate pairing of Spanish /bino/ to Por- rithm in place of the exact Hungarian algorithm in tuguese /vinu/. This pair exemplifies two com- experiments on this data set. mon phonetic correspondences for Spanish and Por- tuguese: the Spanish /o/ often transforms to a /u/ in 5 Results Portuguese, and Spanish /b/ often transforms to /v/ We present results on all four datasets ROMANCE, in Portuguese. Our system, which allows many-to- PARTIAL ROMANCE, E UROPARL, and U GARITIC. many alphabet correspondences, correctly identifies On the ROMANCE and PARTIAL ROMANCE data sets the mappings /o/ → /u/ and /b/ → /v/ as well as the we compare against the numbers published by Hall identity mappings /o/ → /o/ and /b/ → /b/ which are and Klein (2010). We ran an implementation of also common. the orthographic system presented by Haghighi et al. (2008) on our E UROPARL data set. We com- 5.2 PARTIAL ROMANCE pare against the numbers published by Snyder et al. In Table 2 we present the results of running our sys- (2010) on the U GARITIC data set. We refer to our tem on the PARTIAL ROMANCE data set. In this data system as M ATCHER in result tables and discussion. set, only approximately 75% of the source words in each of the source lexicons have cognates in each of 5.1 ROMANCE the target lexicons. The parameter τ trades off pre- The results of running our system, M ATCHER, on cision and recall. We show results for three different the ROMANCE data set are shown in Table 1. We settings of τ : 0.25, 0.5, and 0.75. recover 88.9% of the correct cognate mappings on Our system achieves an average precision across the pair Spanish and Italian, 85.7% on Italian and language pairs of 99.7% at an average recall of Portuguese, and 95.6% on Spanish and Portuguese. 34.0%. For the pairs Italian – Portuguese, and Span- 319
  • 8. Model Seed τ Precision Recall F1 Model τ Lexicon Acc. Alphabet Acc. Haghighi et al. (2008) 20 0.1 72.0 14.0 23.5 Snyder et al. (2010) – 60.4* 29/33* 20 0.25 63.6 31.0 41.7 M ATCHER 1.0 90.4 28/33 20 0.5 44.8 43.7 44.2 50 0.1 90.5 17.6 29.5 50 0.25 75.4 36.7 49.4 Table 4: Results on U GARITIC data set. Our system is la- 50 0.5 56.4 55.0 55.7 beled M ATCHER. We compare against the decipherment M ATCHER 0 0.1 93.5 18.2 30.5 system of Snyder et al. (2010). *Note that results for this 0 0.25 83.2 40.5 54.5 system are on a somewhat different task. In particular, the 0 0.5 56.5 55.1 55.8 M ATCHER system assumes the inventories of cognates in both Hebrew and Ugaritic are known, while the system Table 3: Results on E UROPARL data set. Our system of Snyder et al. (2010) reconstructs cognates assuming is labeled M ATCHER. We compare against the bilingual only that the morphology of Hebrew is known, which is a lexicon induction system of Haghighi et al. (2008). We harder task. We show cognate pair identification accuracy show the cognate precision, recall, and F1 for the pair of and alphabet matching accuracy for Ugaritic and Hebrew. languages English and Spanish using lexicons extracted from corpora. Note that approximately 50% of the words in the English lexicon have cognates in the Spanish lexi- con. tional information: a seed matching of correct cog- nate pairs. The results show that as the size of this seed is decreased, the performance of the ortho- ish – Portuguese, our system achieves prefect preci- graphic system degrades. sion at recalls of 32.2% and 38.1% respectively. The best average F1 achieved by our system is 79.5%, 5.4 U GARITIC which surpasses the average F1 of 73.6 achieved by In Table 4 we present results on the U GARITIC data the phylogenetic system of Hall and Klein (2010). set. We evaluate both accuracy of the lexicon match- The phylogenetic system observes the phyloge- ing learned by our system, and the accuracy of the netic tree of ancestry for the three languages and alphabet matching. Our system achieves a lexicon explicitly models cognate evolution and survival in accuracy of 90.4% while correctly identifying 28 out a ‘survival’ tree. One might expect the phyloge- the 33 gold character mappings. netic system to achieve better results on this data set We also present the results for the decipherment where part of the task is identifying which words do model of Snyder et al. (2010) in Table 4. Note that not have cognates. It is surprising that our model while the evaluation data sets for our two models does so well given its simplicity. are the same, the tasks are very different. In par- 5.3 E UROPARL ticular, our system assumes the inventories of cog- nates in both Hebrew and Ugaritic are known, while Table 3 presents results for our system on the E U - the system of Snyder et al. (2010) reconstructs cog- ROPARL data set across three different settings of τ : nates assuming only that the morphology of Hebrew 0.1, 0.25, and 0.5. We compare against the ortho- is known, which is a harder task. Even so, the re- graphic system presented by Haghighi et al. (2008), sults show that our system is effective at decipher- across the same three settings of τ , and with two dif- ment when semantically similar lexicons are avail- ferent sizes of seed lexicon: 20 and 50. In this data able. set, only approximately 50% of the source words have cognates in the target lexicon. 6 Conclusion Our system achieves a precision of 93.5% at a re- call of 18.2%, and a best F1 of 55.0%. Using a seed We have presented a simple combinatorial model matching of 50 word pairs, the orthographic sys- that simultaneously incorporates both a matching tem of Haghighi et al. (2008) achieves a best F1 of between alphabets and a matching between lexicons. 55.7%. Using a seed matching of 20 word pairs, Our system is effective at both the tasks of cognate it achieves a best F1 of 44.2%. Our system out- identification and alphabet decipherment, requiring performs the orthographic system even though the only lists of words in both languages as input. orthographic system makes use of important addi- 320
  • 9. References A. Bouchard-Cˆ t´ , P. Liang, T.L. Griffiths, and D. Klein. oe 2007. A probabilistic approach to diachronic phonol- ogy. In Proc. of EMNLP. A. Bouchard-Cˆ t´ , T.L. Griffiths, and D. Klein. oe 2009. Improved reconstruction of protolanguage word forms. In Proc. of NAACL. A. Haghighi, P. Liang, T. Berg-Kirkpatrick, and D. Klein. 2008. Learning bilingual lexicons from monolingual corpora. Proceedings of ACL. D. Hall and D. Klein. 2010. Finding cognate groups using phylogenies. In Proc. of ACL. K. Knight and K. Yamada. 1999. A computational ap- proach to deciphering unknown scripts. In Proc. of ACL Workshop on Unsupervised Learning in Natural Language Processing. K. Knight, A. Nair, N. Rathod, and K. Yamada. 2006. Unsupervised analysis for decipherment problems. In Proc. of COLING/ACL. P. Koehn and K. Knight. 2002. Learning a translation lexicon from monolingual corpora. In Proc. of ACL workshop on Unsupervised lexical acquisition. P. Koehn. 2005. Europarl: A Parallel Corpus for Statis- tical Machine Translation. In Proc. of Machine Trans- lation Summit. G. Kondrak. 2001. Identifying Cognates by Phonetic and Semantic Similarity. In NAACL. H.W. Kuhn. 1955. The Hungarian method for the assign- ment problem. Naval research logistics quarterly. P. Liang, D. Klein, and M.I. Jordan. 2008. Agreement- based learning. Proc. of NIPS. S. Ravi and K. Knight. 2011. Bayesian inference for Zo- diac and other homophonic ciphers. In Proc. of ACL. B. Snyder, R. Barzilay, and K. Knight. 2010. A statisti- cal model for lost language decipherment. In Proc. of ACL. 321