Your SlideShare is downloading. ×
Simple effective decipherment via combinatorial optimization
Simple effective decipherment via combinatorial optimization
Simple effective decipherment via combinatorial optimization
Simple effective decipherment via combinatorial optimization
Simple effective decipherment via combinatorial optimization
Simple effective decipherment via combinatorial optimization
Simple effective decipherment via combinatorial optimization
Simple effective decipherment via combinatorial optimization
Simple effective decipherment via combinatorial optimization
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Simple effective decipherment via combinatorial optimization

100

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
100
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Simple Effective Decipherment via Combinatorial Optimization Taylor Berg-Kirkpatrick and Dan Klein Computer Science Division University of California at Berkeley {tberg, klein}@cs.berkeley.edu Abstract 2011), and bilingual lexicon induction (Koehn and Knight, 2002; Haghighi et al., 2008). We consider We present a simple objective function that a common element, which is a model wherein there when optimized yields accurate solutions to are character-level correspondences and word-level both decipherment and cognate pair identifica- correspondences, with the word matching parame- tion problems. The objective simultaneously terized by the character one. This approach sub- scores a matching between two alphabets and a matching between two lexicons, each in a sumes a range of past tasks, though of course past different language. We introduce a simple work has specialized in interesting ways. coordinate descent procedure that efficiently Past work has emphasized the modeling as- finds effective solutions to the resulting com- pect, where here we use a parametrically simplistic binatorial optimization problem. Our system model, but instead emphasize inference. requires only a list of words in both languages as input, yet it competes with and surpasses 2 Decipherment as Two-Level several state-of-the-art systems that are both substantially more complex and make use of Optimization more information. Our method represents two matchings, one at the al- phabet level and one at the lexicon level. A vector of1 Introduction variables x specifies a matching between alphabets. For each character i in the source alphabet and eachDecipherment induces a correspondence between character j in the target alphabet we define an indi-the words in an unknown language and the words cator variable xij that is on if and only if character iin a known language. We focus on the setting where is mapped to character j. Similarly, a vector y rep-a close correspondence between the alphabets of the resents a matching between lexicons. For word u intwo languages exists, but is unknown. Given only the source lexicon and word v in the target lexicon,two lists of words, the lexicons of both languages, the indicator variable yuv denotes that u maps to v.we attempt to induce the correspondence between Note that the matchings need not be one-to-one.alphabets and identify the cognates pairs present in We define an objective function on the matchingthe lexicons. The system we propose accomplishes variables as follows. Let E DIT D IST(u, v; x) denotethis by defining a simple combinatorial optimiza- the edit distance between source word u and targettion problem that is a function of both the alphabet word v given alphabet matching x. Let the lengthand cognate matchings, and then induces correspon- of word u be lu and the length of word w be lw .dences by optimizing the objective using a block co- This edit distance depends on x in the followingordinate descent procedure. way. Insertions and deletions always cost a constant There is a range of past work that has var- .1 Substitutions also cost unless the charactersiously investigated cognate detection (Kondrak, are matched in x, in which case the substitution is2001; Bouchard-Cˆ t´ et al., 2007; Bouchard-Cˆ t´ oe oe 1 In practice we set = lu +lv . lu + lv is the maximum 1et al., 2009; Hall and Klein, 2010), character-level number of edit operations between words u and v. This nor-decipherment (Knight and Yamada, 1999; Knight malization insures that edit distances are between 0 and 1 foret al., 2006; Snyder et al., 2010; Ravi and Knight, all pairs of words. 313 Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 313–321, Edinburgh, Scotland, UK, July 27–31, 2011. c 2011 Association for Computational Linguistics
  • 2. free. Now, the objective that we will minimize canbe stated simply: u v yuv · E DIT D IST(u, v; x), In order to get a better handle on the shape of thethe sum of the edit distances between the matched objective and to develop an efficient optimizationwords, where the edit distance function is parame- procedure we decompose each edit distance compu-terized by the alphabet matching. tation and re-formulate the optimization problem in Without restrictions on the matchings x and y Section 2.2.this objective can always be driven to zero by eithermapping all characters to all characters, or matchingnone of the words. It is thus necessary to restrict 2.1 Examplethe matchings in some way. Let I be the size of Figure 1 presents both an example matching prob-the source alphabet and J be the size of the target lem and a diagram of the variables and objective.alphabet. We allow the alphabet matching x tobe many-to-many but require that each character Here, the source lexicon consists of the Englishparticipate in no more than two mappings and that words (cat, bat, cart, rat, cab), andthe total number of mappings be max(I, J), a the source alphabet consists of the characters (a,constraint we refer to as restricted-many-to-many. b, c, r, t). The target alphabet is (0, 1,The requirements can be encoded with the followinglinear constraints on x: 2, 3). We have used digits as symbols in the target alphabet to make it clear that we treat the alphabets ∀i xij ≤ 2 as disjoint. We have no prior knowledge about any j correspondence between alphabets, or between lexi- ∀j xij ≤ 2 cons. i The target lexicon consists of the words (23, xij = max(I, J) 1233, 120, 323, 023). The bipartite graphs i j show a specific setting of the matching variables.The lexicon matching y is required to be τ -one-to- The bold edges correspond to the xij and yuv thatone. By this we mean that y is an at-most-one-to-one are one. The matchings shown achieve an edit dis-matching that covers proportion τ of the smaller of tance of zero between all matched word pairs ex-the two lexicons. Let U be the size of the source cept for the pair (cat, 23). The best edit align-lexicon and V be this size of the target lexicon.This requirement can be encoded with the following ment for this pair is also diagrammed. Here, ‘a’linear constraints: is aligned to ‘2’, ‘t’ is aligned to ‘3’, and ‘c’ is deleted and therefore aligned to the null position ‘#’. ∀u yuv ≤ 1 Only the initial deletion has a non-zero cost since v all other alignments correspond to substitutions be- ∀v yuv ≤ 1 tween characters that are matched in x. u yuv = τ min(U, V ) 2.2 Explicit Objective u v Computing E DIT D IST(u, v; x) requires running a Now we are ready to define the full optimization dynamic program because of the unknown editproblem. The first formulation is called the Implicit alignments; here we define those alignments z ex-Matching Objective since includes an implicit plicitly, which makes the E DIT D IST(u, v; x) easy tominimization over edit alignments inside the com- write explicitly at the cost of more variables. How-putation of E DIT D IST. ever, by writing the objective in an explicit form that refers to these edit variables, we are able to describe(1) Implicit Matching Objective: a efficient block coordinate descent procedure that can be used for optimization. min yuv · E DIT D IST(u, v; x) E DIT D IST(u, v; x) is computed by minimizing x,y u v over the set of monotonic alignments between the s.t. x is restricted-many-to-many characters of the source word u and the characters of the target word v. Let un be the character at the y is τ -one-to-one nth position of the source word u, and similarly for 314
  • 3. Alphabet Matching Matching Problem a 0 b 1 min yuv · EditDist(u, v; x) x,y u v c 2 s.t. x is restricted-many-to-many r 3 y is τ -one-to-one t xij Lexicon Matching Edit Distance cat 23 # # EditDist(u, v; x) = bat 1233 Substitution c 2 cart 120 minzuv · (1 − xun vm )zuv,nm a 3 s.t. n m rat 323 zuv is monotonic + zuv,n# + zuv,#m t cab 023 zuv,nm n m yuv Deletion InsertionFigure 1: An example problem displaying source and target lexicons and alphabets, along with specific matchings.The variables involved in the optimization problem are diagrammed. x are the alphabet matching indicator variables,y are the lexicon matching indicator variables, and z are the edit alignment indicator variables. The index u refers toa word in the source lexicon, v refers to word in the target lexicon, i refers to a character in the source alphabet, andj refers to a character in the target alphabet. n and m refer to positions in source and target words respectively. Thematching objective function is also shown.vm . Let zuv be the vector of alignment variablesfor the edit distance computation between sourceword u and target word v, where entry zuv,nm S UB(zuv , x) = (1 − xun vm )zuv,nmindicates whether the character at position n of n,msource word u is aligned to the character at positionm of target word v. Additionally, define variables D EL(zuv ) = zuv,n#zuv,n# and zuv,#m denoting null alignments, which nwill be used to keep track of insertions and deletions. I NS(zuv ) = zuv,#m m E DIT D IST(u, v; x) = Notice that the variable zuv,nm being turned on in- min · S UB(zuv , x) + D EL(zuv ) + I NS(zuv ) dicates the substitute operation, while a zuv,n# or zuv zuv,#m being turned on indicates an insert or delete s.t. zuv is monotonic operation. These variables are digrammed in Fig- ure 1. The requirement that zuv be a monotonic alignment can be expressed using linear constraints, but in our optimization procedure (described in Sec-We define S UB(zuv , x) to be the number of sub- tion 3) these constraints need not be explicitly rep-stitutions between characters that are not matched resented.in x, D EL(zuv ) to be the number of deletions, and Now we can substitute the explicit edit distanceI NS(zuv ) to be the number of insertions. equation into the implicit matching objective (1). 315
  • 4. Noticing that the mins and sums commute, we arrive Algorithm 1 Block Coordinate Descentat the explicit form of the matching optimization Randomly initialize alphabet matching x.problem. repeat for all u, v do(2) Explicit Matching Objective: (euv , zuv ) ← E DIT D IST(u, v; x) end formin yuv · · S UB(zuv , x) + D EL(zuv ) + I NS(zuv ) [Hungarian]x,y,z u,v y ← argminy τ -one-to-one u,v yuv euv s.t. x is restricted-many-to-many [Solve LP] y is τ -one-to-one x ← argmaxx restr.-many-to-many i,j xij cij ∀uv zuv is monotonic until convergenceThe implicit and explicit optimizations are the same,apart from the fact that the explicit optimization now Notice that y simply picks out which edit distanceexplicitly represents the edit alignment variables z. problems affect the objective. The zuv in each ofLet the explicit matching objective (2) be denoted these edit distance problems can be optimized in-as J(x, y, z). The relaxation of the explicit problem dependently. zuv that do not have yuv active havewith 0-1 constraints removed has integer solutions,2 no effect on the objective, and zuv with yuv activehowever the objective J(x, y, z) is non-convex. We can be optimized using the standard edit distance dy-thus turn to a block coordinate descent method in the namic program. Thus, in a first step we compute thenext section in order to find local optima. U · V edit distances euv and best monotonic align- ment variables zuv between all pairs of source and3 Optimization Method target words using U ·V calls to the standard edit dis- tance dynamic program. Altogether, this takes timeWe now state a block coordinate descent procedure O ( u lu ) · ( v lv ) .to find local optima of J(x, y, z) under the con- Now, in a second step we compute the leaststraints on x, y, and z. This procedure alternates weighted τ -one-to-one matching y under thebetween updating y and z to their exact joint optima weights euv . This can be accomplished in timewhen x is held fixed, and updating x to its exact op- O(max(U, V )3 ) using the Hungarian algorithmtimum when y and z are held fixed. (Kuhn, 1955). These two steps produce y and z that The psuedocode for the procedure is given in Al- exactly achieve the optimum value of J(x, y, z) forgorithm 1. Note that the function E DIT D IST returns the given value of x.both the min edit distance euv and the argmin editalignments zuv . Also note that cij is as defined in 3.2 Alphabet Matching UpdateSection 3.2. Let y and z, the lexicon matching variables and the3.1 Lexicon Matching Update edit alignments, be fixed. Now, we find the optimal alphabet matching variables x subject to the con-Let x, the alphabet matching variable, be fixed. We straint that x is restricted-many-to-many.consider the problem of optimizing J(x, y, z) over It makes sense that to optimize J(x, y, z) with re-the lexicon matching variable y and and the edit spect to x we should prioritize mappings xij thatalignments z under the constraint that y is τ -one- would mitigate the largest substitution costs in theto-one and each zuv is monotonic. active edit distance problems. Indeed, with a little 2 This can be shown by observing that optimizing x when y algebra it can be shown that solving a maximumand z are held fixed yields integer solutions (shown in Section weighted matching problem with weights cij that3.2), and similarly for the optimization of y and z when x isfixed (shown in Section 3.1). Thus, every local optimum with count potential substitution costs gives the correctrespect to these block coordinate updates has integer solutions. update for x. In particular, cij is the total cost ofThe global optimum must be one of these local optima. substitution edits in the active edit alignment prob- 316
  • 5. lems that would result if source character i were not 4.1 Phonetic Cognate Lexiconsmapped to target character j in the alphabet match- The first data set we evaluate on consists of 583ing x. This can be written as: triples of phonetic transcriptions of cognates in cij = · yuv · zuv,nm Spanish, Portuguese, and Italian. The data set was u,v n,m s.t. un =i,vm =j introduced by Bouchard-Cˆ t´ et al. (2007). For a oe given pair of languages the task is to determine the If x were constrained to be one-to-one, wecould again apply the Hungarian algorithm, this mapping between lexicons that correctly maps eachtime to find a maximum weighted matching under source word to its cognate in the target lexicon. Wethe weights cij . Since we have instead allowed refer to this task and data set as ROMANCE.restricted-many-to-many alphabet matchings we Hall and Klein (2010) presented a state-of-the-turn to linear programming for optimizing x. We art system for the task of cognate identification andcan state the update problem as the following linear evaluated on this data set. Their model explicitlyprogram (LP), which is guaranteed to have integersolutions: represents parameters for phonetic change between languages and their parents in a phylogenetic tree. They estimate parameters and infer the pairs of cog- min xij cij x nates present in all three languages jointly, while we ij consider each pair of languages in turn. s.t. ∀i xij ≤ 2, ∀j xij ≤ 2 Their model has similarities with our own in that j i it learns correspondences between the alphabets of xij = max(I, J) pairs of languages. However, their correspondences i j are probabilistic and implicit while ours are hard andIn experiments we used the GNU Linear Program- explicit. Their model also differs from our own inming Toolkit (GLPK) to solve the LP and update a key way. Notice that the phonetic alphabets forthe alphabet matching x. This update yields match- the three languages are actually the same. Sinceing variables x that achieve the optimum value of phonetic change occurs gradually across languagesJ(x, y, z) for fixed y and z. a helpful prior on the correspondence is to favor the identity. Their model makes use of such a prior.3.3 Random Restarts Our model, on the other hand, is unaware of anyIn practice we found that the block coordinate de- prior correspondence between alphabets and doesscent procedure can get stuck at poor local optima. not make use of this additional information aboutTo find better optima, we run the coordinate descent phonetic change.procedure multiple times, initialized each time with Hall and Klein (2010) also evaluate their modela random alphabet matching. We choose the local on lexicons that do not have a perfect cognate map-optimum with the best objective value across all ini- ping. This scenario, where not every word in onetializations. This approach yielded substantial im- language has a cognate in another, is more realistic.provements in achieved objective value. They produced a data set with this property by prun- ing words from the ROMANCE data set until only4 Experiments about 75% of the words in each source lexicon haveWe compare our system to three different state-of- cognates in each target lexicon. We refer to this taskthe-art systems on three different data sets. We set and data set as PARTIAL ROMANCE.up experiments that allow for as direct a comparison 4.2 Lexicons Extracted from Corporaas possible. In some cases it must be pointed outthat the past system’s goals are different from our Next, we evaluate our model on a noisier data set.own, and we will be comparing in a different way Here the lexicons in source and target languagesthan the respective work was intended. The three are extracted from corpora by taking the top 2,000systems make use of additional, or slightly different, words in each corpus. In particular, we used the En-sources of information. glish and Spanish sides of the Europarl parallel cor- 317
  • 6. pus (Koehn, 2005). To make this set up more real- between the two lexicons. For this data set, the lexi-istic (though fairly comparable), we insured that the cons are large enough that finding the exact solutioncorpora were non-parallel by using the first 50K sen- can be slow. Thus, in all experiments on this datatences on the English side and the second 50K sen- set, we instead use a greedy competitive linking al-tences on the Spanish side. To generate a gold cog- gorithm that runs in time O(U 2 V 2 log(U V )).nate matching we used the intersected HMM align- Again, for this dataset it is reasonable to expectment model of Liang et al. (2008) to align the full that many characters will map to themselves in theparallel corpus. From this alignment we extracted a best alphabet matching. The alphabets are not iden-translation lexicon by adding an entry for each word tical, but are far from disjoint. Neither our system,pair with the property that the English word was nor that of Haghighi et al. (2008) make use of thisaligned to the Spanish in over 10% of the alignments expectation. As far as both systems are concerned,involving the English word. To reduce this transla- the alphabets are disjoint.tion lexicon down to a cognate matching we wentthrough the translation lexicon by hand and removed 4.3 Deciphermentany pair of words that we judged to not be cognates. Finally, we evaluate our model on a data set whereThe resulting gold matching contains cognate map- a main goal is to decipher an unknown correspon-pings in the English lexicon for 1,026 of the words dence between alphabets. We attempt to learn ain the Spanish lexicon. This means that only about mapping from the alphabet of the ancient Semitic50% of the words in English lexicon have cognates language Ugaritic to the alphabet of Hebrew, andin the Spanish lexicon. We evaluate on this data set at the same time learn a matching between Hebrewby computing precision and recall for the number of words in a Hebrew lexicon and their cognates in aEnglish words that are mapped to a correct cognate. Ugaritic lexicon. This task is related to the task at-We refer to this task and data set as E UROPARL. tempted by Snyder et al. (2010). The data set con- On this data set, we compare against the state-of- sists of a Ugaritic lexicon of 2,214 words, each ofthe-art orthographic system presented in Haghighi which has a Hebrew cognate, the lexicon of theiret al. (2008). Haghighi et al. (2008) presents sev- 2,214 Hebrew cognates, and a gold cognate dictio-eral systems that are designed to extract transla- nary for evaluation. We refer to this task and data settion lexicons for non-parallel corpora by learning as U GARITIC.a correspondence between their monolingual lexi- The non-parameteric Bayesian system of Snydercons. Since our system specializes in matching cog- et al. (2010) assumes that the morphology of He-nates and does not take into account additional infor- brew is known, making use of an inventory of suf-mation from corpus statistics, we compare against fixes, prefixes, and stems derived from the wordsthe version of their system that only takes into ac- in the Hebrew bible. It attempts to learn a corre-count orthographic features and is thus is best suited spondence between the morphology of Ugaritic andfor cognate detection. Their system requires a small that of Hebrew while reconstructing cognates forseed of correct cognate pairs. From this seed the sys- Ugaritic words. This is a slightly different goal thantem learns a projection using canonical correlation that of our system, which learns a correspondenceanalysis (CCA) into a canonical feature space that between lexicons. Snyder et al. (2010) run theirallows feature vectors from source words and target system on a set 7,386 Ugaritic words, the same setwords to be compared. Once in this canonical space, that we extracted our 2,214 Ugaritic words with He-similarity metrics can be computed and words can be brew cognates from. We evaluate the accuracy of thematched using a bipartite matching algorithm. The lexicon matching produced by our system on theseprocess is iterative, adding cognate pairs to the seed 2,214 Ugaritic words, and so do they, measuring thelexicon gradually and each time re-computing a re- number of correctly reconstructed cognates.fined projection. Our system makes no use of a seed By restricting the source and target lexicons tolexicon whatsoever. sets of cognates we have made the task easier. This Both our system and the system of Haghighi et was necessary, however, because the Ugaritic andal. (2008) must solve bipartite matching problems Hebrew corpora used by Snyder et al. (2010) are not 318
  • 7. Model τ Accuracy Model τ Precision Recall F1 Hall and Klein (2010) – 90.3 Hall and Klein (2010) – 66.9 82.0 73.6 M ATCHER 1.0 90.1 M ATCHER 0.25 99.7 34.0 50.7 0.50 93.8 60.2 73.3 0.75 81.1 78.0 79.5Table 1: Results on ROMANCE data set. Our system islabeled M ATCHER. We compare against the phylogeneticcognate detection system of Hall and Klein (2010). We Table 2: Results on PARTIAL ROMANCE data set. Ourshow the pairwise cognate accuracy across all pairs of system is labeled M ATCHER. We compare against thelanguages from the following set: Spanish, Portuguese, phylogenetic cognate detection system of Hall and Kleinand Italian. (2010). We show the pairwise cognate precision, recall, and F1 across all pairs of languages from the following set: Spanish, Portuguese, and Italian. Note that approx-comparable: only a small proportion of the words imately 75% of the source words in each of the sourcein the Ugaritic lexicon have cognates in the lexicon lexicons have cognates in each of the target lexicons.composed of the most frequent Hebrew words. Here, the alphabets really are disjoint. The sym- Our average accuracy across all pairs of languagesbols in both languages look nothing alike. There is is 90.1%. The phylogenetic system of Hall andno obvious prior expectation about how the alpha- Klein (2010) achieves an average accuracy of 90.3%bets will be matched. We evaluate against a well- across all pairs of languages. Our system achievesestablished correspondence between the alphabets accuracy comparable to that of the phylogenetic sys-of Ugaritic and Hebrew. The Ugaritic alphabet con- tem, despite the fact that the phylogenetic system istains 30 characters, the Hebrew alphabet contains 22 substantially more complex and makes use of an in-characters, and the gold matching contains 33 en- formed prior on alphabet correspondences.tries. We evaluate the learned alphabet matching by The alphabet matching learned by our system iscounting the number of recovered entries from the interesting to analyze. For the pairing of Span-gold matching. ish and Portuguese it recovers phonetic correspon- Due to the size of the source and target lexicons, dences that are well known. Our system learns thewe again use the greedy competitive linking algo- correct cognate pairing of Spanish /bino/ to Por-rithm in place of the exact Hungarian algorithm in tuguese /vinu/. This pair exemplifies two com-experiments on this data set. mon phonetic correspondences for Spanish and Por- tuguese: the Spanish /o/ often transforms to a /u/ in5 Results Portuguese, and Spanish /b/ often transforms to /v/We present results on all four datasets ROMANCE, in Portuguese. Our system, which allows many-to-PARTIAL ROMANCE, E UROPARL, and U GARITIC. many alphabet correspondences, correctly identifiesOn the ROMANCE and PARTIAL ROMANCE data sets the mappings /o/ → /u/ and /b/ → /v/ as well as thewe compare against the numbers published by Hall identity mappings /o/ → /o/ and /b/ → /b/ which areand Klein (2010). We ran an implementation of also common.the orthographic system presented by Haghighi etal. (2008) on our E UROPARL data set. We com- 5.2 PARTIAL ROMANCEpare against the numbers published by Snyder et al. In Table 2 we present the results of running our sys-(2010) on the U GARITIC data set. We refer to our tem on the PARTIAL ROMANCE data set. In this datasystem as M ATCHER in result tables and discussion. set, only approximately 75% of the source words in each of the source lexicons have cognates in each of5.1 ROMANCE the target lexicons. The parameter τ trades off pre-The results of running our system, M ATCHER, on cision and recall. We show results for three differentthe ROMANCE data set are shown in Table 1. We settings of τ : 0.25, 0.5, and 0.75.recover 88.9% of the correct cognate mappings on Our system achieves an average precision acrossthe pair Spanish and Italian, 85.7% on Italian and language pairs of 99.7% at an average recall ofPortuguese, and 95.6% on Spanish and Portuguese. 34.0%. For the pairs Italian – Portuguese, and Span- 319
  • 8. Model Seed τ Precision Recall F1 Model τ Lexicon Acc. Alphabet Acc. Haghighi et al. (2008) 20 0.1 72.0 14.0 23.5 Snyder et al. (2010) – 60.4* 29/33* 20 0.25 63.6 31.0 41.7 M ATCHER 1.0 90.4 28/33 20 0.5 44.8 43.7 44.2 50 0.1 90.5 17.6 29.5 50 0.25 75.4 36.7 49.4 Table 4: Results on U GARITIC data set. Our system is la- 50 0.5 56.4 55.0 55.7 beled M ATCHER. We compare against the decipherment M ATCHER 0 0.1 93.5 18.2 30.5 system of Snyder et al. (2010). *Note that results for this 0 0.25 83.2 40.5 54.5 system are on a somewhat different task. In particular, the 0 0.5 56.5 55.1 55.8 M ATCHER system assumes the inventories of cognates in both Hebrew and Ugaritic are known, while the systemTable 3: Results on E UROPARL data set. Our system of Snyder et al. (2010) reconstructs cognates assumingis labeled M ATCHER. We compare against the bilingual only that the morphology of Hebrew is known, which is alexicon induction system of Haghighi et al. (2008). We harder task. We show cognate pair identification accuracyshow the cognate precision, recall, and F1 for the pair of and alphabet matching accuracy for Ugaritic and Hebrew.languages English and Spanish using lexicons extractedfrom corpora. Note that approximately 50% of the wordsin the English lexicon have cognates in the Spanish lexi-con. tional information: a seed matching of correct cog- nate pairs. The results show that as the size of this seed is decreased, the performance of the ortho-ish – Portuguese, our system achieves prefect preci- graphic system degrades.sion at recalls of 32.2% and 38.1% respectively. Thebest average F1 achieved by our system is 79.5%, 5.4 U GARITICwhich surpasses the average F1 of 73.6 achieved by In Table 4 we present results on the U GARITIC datathe phylogenetic system of Hall and Klein (2010). set. We evaluate both accuracy of the lexicon match- The phylogenetic system observes the phyloge- ing learned by our system, and the accuracy of thenetic tree of ancestry for the three languages and alphabet matching. Our system achieves a lexiconexplicitly models cognate evolution and survival in accuracy of 90.4% while correctly identifying 28 outa ‘survival’ tree. One might expect the phyloge- the 33 gold character mappings.netic system to achieve better results on this data set We also present the results for the deciphermentwhere part of the task is identifying which words do model of Snyder et al. (2010) in Table 4. Note thatnot have cognates. It is surprising that our model while the evaluation data sets for our two modelsdoes so well given its simplicity. are the same, the tasks are very different. In par-5.3 E UROPARL ticular, our system assumes the inventories of cog- nates in both Hebrew and Ugaritic are known, whileTable 3 presents results for our system on the E U - the system of Snyder et al. (2010) reconstructs cog-ROPARL data set across three different settings of τ : nates assuming only that the morphology of Hebrew0.1, 0.25, and 0.5. We compare against the ortho- is known, which is a harder task. Even so, the re-graphic system presented by Haghighi et al. (2008), sults show that our system is effective at decipher-across the same three settings of τ , and with two dif- ment when semantically similar lexicons are avail-ferent sizes of seed lexicon: 20 and 50. In this data able.set, only approximately 50% of the source wordshave cognates in the target lexicon. 6 Conclusion Our system achieves a precision of 93.5% at a re-call of 18.2%, and a best F1 of 55.0%. Using a seed We have presented a simple combinatorial modelmatching of 50 word pairs, the orthographic sys- that simultaneously incorporates both a matchingtem of Haghighi et al. (2008) achieves a best F1 of between alphabets and a matching between lexicons.55.7%. Using a seed matching of 20 word pairs, Our system is effective at both the tasks of cognateit achieves a best F1 of 44.2%. Our system out- identification and alphabet decipherment, requiringperforms the orthographic system even though the only lists of words in both languages as input.orthographic system makes use of important addi- 320
  • 9. ReferencesA. Bouchard-Cˆ t´ , P. Liang, T.L. Griffiths, and D. Klein. oe 2007. A probabilistic approach to diachronic phonol- ogy. In Proc. of EMNLP.A. Bouchard-Cˆ t´ , T.L. Griffiths, and D. Klein. oe 2009. Improved reconstruction of protolanguage word forms. In Proc. of NAACL.A. Haghighi, P. Liang, T. Berg-Kirkpatrick, and D. Klein. 2008. Learning bilingual lexicons from monolingual corpora. Proceedings of ACL.D. Hall and D. Klein. 2010. Finding cognate groups using phylogenies. In Proc. of ACL.K. Knight and K. Yamada. 1999. A computational ap- proach to deciphering unknown scripts. In Proc. of ACL Workshop on Unsupervised Learning in Natural Language Processing.K. Knight, A. Nair, N. Rathod, and K. Yamada. 2006. Unsupervised analysis for decipherment problems. In Proc. of COLING/ACL.P. Koehn and K. Knight. 2002. Learning a translation lexicon from monolingual corpora. In Proc. of ACL workshop on Unsupervised lexical acquisition.P. Koehn. 2005. Europarl: A Parallel Corpus for Statis- tical Machine Translation. In Proc. of Machine Trans- lation Summit.G. Kondrak. 2001. Identifying Cognates by Phonetic and Semantic Similarity. In NAACL.H.W. Kuhn. 1955. The Hungarian method for the assign- ment problem. Naval research logistics quarterly.P. Liang, D. Klein, and M.I. Jordan. 2008. Agreement- based learning. Proc. of NIPS.S. Ravi and K. Knight. 2011. Bayesian inference for Zo- diac and other homophonic ciphers. In Proc. of ACL.B. Snyder, R. Barzilay, and K. Knight. 2010. A statisti- cal model for lost language decipherment. In Proc. of ACL. 321

×