Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

On the Limitations of Unsupervised Bilingual Dictionary Induction

297 views

Published on

Talk given at ACL 2018 about our paper On the Limitations of Unsupervised Bilingual Dictionary Induction (http://aclweb.org/anthology/P18-1072).

Published in: Science
  • Login to see the comments

On the Limitations of Unsupervised Bilingual Dictionary Induction

  1. 1. On the Limitations of Unsupervised Bilingual Dictionary Induction Sebastian Ruder Ivan VulićAnders Søgaard
  2. 2. ‣ Recently: Unsupervised neural machine translation (Artetxe et al., ICLR 2018; Lample et al., ICLR 2018) 2 Background:
 Unsupervised MT ‣ Key component: Initialization via unsupervised cross-lingual alignment of word embedding spaces
  3. 3. ‣ Cross-lingual word embeddings enable cross-lingual transfer ‣ Most common approach: Project one word embedding space into another by learning a transformation matrix 
 between source embeddings and their translations (Mikolov et al., 2013) ‣ More recently: Use an adversarial setup to learn an unsupervised mapping ‣ Assumption: Word embedding spaces are approximately isomorphic, i.e. same number of vertices, connected the same way. 3 Background:
 Cross-lingual word embeddings n ∑ i=1 ∥Wxi − yi∥2 W xi yin
  4. 4. ‣ Nearest neighbour (NN) graphs of top 10 most frequent words in English and German are not isomorphic. ‣ NN graphs of top 10 most frequent English words and their translations into German 4 How similar are embeddings across languages? English German ‣ Not isomorphic
  5. 5. ‣ NN graphs of top 10 most frequent English nouns and their translations 5 How similar are embeddings across languages? English German ‣ Not isomorphic Word embeddings are not approximately isomorphic across languages.
  6. 6. ‣ Need a metric to measure how similar two NN graphs and of different languages are ‣ Propose eigenvector similarity ‣ : adjacency matrices of ‣ : degree matrices of ‣ : Laplacians of ‣ : eigenvalues (spectra) of 6 How do we quantify similarity? G1 G2 A1, A2 G1, G2 G1, G2D1, D2 L1 = D1 − A1, L2 = D2 − A2 G1, G2 λ1, λ2 L1, L2 Δ = k ∑ i=1 (λ1i − λ2i )2 k = min j { ∑ k i=1 λji ∑ n i=1 λji > 0.9}‣ Metric: where
  7. 7. 7 How do we quantify similarity? ‣ Quantifies how much two NN graphs are isospectral, i.e. they have the same spectrum (same sets of eigenvalues). ‣ Isomorphic isospectral, but isospectral isomorphic ‣ ‣ : are isospectral (very similar) ‣ : become less similar → ↛ Δ : G1, G2 → [0,∞) Δ = 0 G1, G2 Δ → ∞ G1, G2 Δ = k ∑ i=1 (λ1i − λ2i )2 k = min j { ∑ k i=1 λji ∑ n i=1 λji > 0.9}‣ Metric: where
  8. 8. ‣ Besides isomorphism, several other implicit assumptions ‣ May or may not scale to low-resource languages 8 Unsupervised cross-lingual learning assumptions Conneau et al. (2018) This work Languages Dependent-marking, fusional and isolating Agglutinative, many cases Corpora Comparable (Wikipedia) Different domains Algorithms/ hyperparameters Same Different
  9. 9. 1. Monolingual word embeddings:
 Learn monolingual vector spaces and . 2. Adversarial mapping:
 Learn a translation matrix . Train discriminator to discriminate samples from and . 9 Conneau et al. (2018) X Y W WX Y
  10. 10. 3. Refinement (Procrustes analysis):
 Build bilingual dictionary of frequent words using . Learn a new based on frequent word pairs. 4. Cross-domain similarity local scaling (CSLS):
 Use similarity measure that increases similarity of isolated word vectors, decreases similarity of vectors in dense areas. 10 Conneau et al. (2018) W W
  11. 11. ‣ Extract identically spelled words in both languages ‣ Use these as bilingual seed words ‣ Run refinement step of Conneau et al. (2018) 11 A simple weakly supervised method
  12. 12. 12 Experiments:
 Bilingual dictionary induction ‣ Given a list of source language words, find the closest target language word in the cross-lingual embedding space ‣ Compare against a gold standard dictionary ‣ Metric: Precision at 1 (P@1) ‣ Use fastText monolingual embeddings Conneau et al. (2018) This work Languages
 (English to) French, German, Chinese, Russian, Spanish Estonian (ET), Finnish (FI), Greek (EL), Hungarian (HU), Polish (PL), Turkish
  13. 13. 13 Impact of language similarity ‣ Unsupervised approaches are challenged by languages that are not isolating and not dependent marking ‣ Naive supervision leads to competitive performance on similar language pairs and better results for dissimilar pairs P@1 0 22.5 45 67.5 90 EN-ES EN-ET EN-FI EN-EL EN-HU EN-PL EN-TR ET-FI Unsupervised (Adversarial) Weakly supervised (Identical strings)
  14. 14. 14 Impact of language similarity ‣ Eigenvector similarity strongly correlates with BDI performance Eigenvectorsimilarity 0 2 4 6 8 BDI performance 0 22.5 45 67.5 90 (ρ ∼ 0.89)
  15. 15. 15 Impact of domain differences English-Spanish P@1 0 17.5 35 52.5 70 Domainsimilarity 0 0.2 0.4 0.6 0.8 EP-EP EP-Wiki EP-EMEA Wiki-EP Wiki-Wiki Wiki-EMEA EMEA-EP EMEA-Wiki EMEA-EMEA Domain similarity Unsupervised Weakly supervised ‣ Source and target embeddings induced on 3 corpora:
 EuroParl (EP), Wikipedia (Wiki), Medical (EMEA) ‣ Unsupervised approaches break down when domains are dissimilar
  16. 16. 16 Impact of domain differences English-Finnish P@1 0 7.5 15 22.5 30 Domainsimilarity 0 0.2 0.4 0.6 0.8 EP-EP EP-Wiki EP-EMEA Wiki-EP Wiki-Wiki Wiki-EMEA EMEA-EP EMEA-Wiki EMEA-EMEA Domain similarity Unsupervised Weakly supervised ‣ Domain differences may exacerbate difficulties of generalising across dissimilar languages
  17. 17. 17 Impact of domain differences English-Hungarian P@1 0 7.5 15 22.5 30 Domainsimilarity 0 0.2 0.4 0.6 0.8 EP-EP EP-Wiki EP-EMEA Wiki-EP Wiki-Wiki Wiki-EMEA EMEA-EP EMEA-Wiki EMEA-EMEA Domain similarity Unsupervised Weakly supervised ‣ Weak supervision helps to bridge domain differences, but performance still deteriorates
  18. 18. 18 Impact of hyper-parametersP@1 0 22.5 45 67.5 90 == win=10 ngrams=2-7 win=10, ngrams=2-7 English-Spanish (skipgram) English-Spanish (cbow) ‣ Settings: English with skipgram, win=2, ngrams=3-6 ‣ Vary hyper-parameters of Spanish embeddings ≠ ≠ ≠
  19. 19. 19 Impact of hyper-parametersP@1 0 22.5 45 67.5 90 == win=10 ngrams=2-7 win=10, ngrams=2-7 English-Spanish (skipgram) English-Spanish (cbow) ‣ Different algorithms introduce embedding spaces with wildly different structures. ≠ ≠ ≠
  20. 20. ‣ Worse performance overall, but better performance for dissimilar language pairs (Estonian, Finnish, Greek). ‣ Monolingual word embeddings may overfit to rare peculiarities of languages. 20 Impact of dimensionalityP@1 0 22.5 45 67.5 90 EN-ES EN-ET EN-FI EN-EL EN-HU EN-PL EN-TR 300-dimensional embeddings 40-dimensional embeddings
  21. 21. 21 Impact of evaluation procedure ‣ Part-of-speech:
 Performance on verbs is lowest across the board. ‣ Frequency:
 Sensitivity to frequency for Hungarian, but less so for Spanish. ‣ Homographs:
 Lower precision due to loan words/proper names. High precision for free with weak supervision.

  22. 22. ‣ Word embedding spaces are not approximately isomorphic across languages. ‣ We can use eigenvector similarity to characterise the relatedness of two monolingual vector spaces. ‣ Eigenvector similarity strongly correlates with unsupervised bilingual dictionary induction performance. ‣ Limitations of unsupervised bilingual dictionary induction: ‣ Morphologically rich languages. ‣ Corpora from different domains. ‣ Different word embedding algorithms. 22 Takeaways

×