Mikel Artetxe - 2017 - Learning bilingual word embeddings with (almost) no bilingual data

1. Learning bilingual word embeddings with (almost) no bilingual data Mikel Artetxe, Gorka Labaka, Eneko Agirre IXA NLP group – University of the Basque Country (UPV/EHU)

2. Who cares?

3. Who cares? word embeddings are useful!

7. Who cares? - inherently crosslingual tasks word embeddings are useful!

8. Who cares? - inherently crosslingual tasks - crosslingual transfer learning word embeddings are useful!

9. Who cares? - inherently crosslingual tasks - crosslingual transfer learning bilingual signal for training word embeddings are useful!

10. Who cares? - inherently crosslingual tasks - crosslingual transfer learning bilingual signal for training word embeddings are useful!

11. Who cares? - inherently crosslingual tasks - crosslingual transfer learning bilingual signal for training Previous work word embeddings are useful!

12. Who cares? - inherently crosslingual tasks - crosslingual transfer learning bilingual signal for training Previous work - parallel corpora word embeddings are useful!

13. Who cares? - inherently crosslingual tasks - crosslingual transfer learning bilingual signal for training Previous work - parallel corpora - comparable corpora word embeddings are useful!

14. Who cares? - inherently crosslingual tasks - crosslingual transfer learning bilingual signal for training Previous work - parallel corpora - comparable corpora - (big) dictionaries word embeddings are useful!

15. Who cares? - inherently crosslingual tasks - crosslingual transfer learning bilingual signal for training Previous work - parallel corpora - comparable corpora - (big) dictionaries word embeddings are useful!

16. Who cares? - inherently crosslingual tasks - crosslingual transfer learning bilingual signal for training This talkPrevious work - parallel corpora - comparable corpora - (big) dictionaries word embeddings are useful!

17. Who cares? - inherently crosslingual tasks - crosslingual transfer learning bilingual signal for training This talk - 25 word dictionary Previous work - parallel corpora - comparable corpora - (big) dictionaries word embeddings are useful!

18. Who cares? - inherently crosslingual tasks - crosslingual transfer learning bilingual signal for training This talk - 25 word dictionary - numerals (1, 2, 3…) Previous work - parallel corpora - comparable corpora - (big) dictionaries word embeddings are useful!

19. Bilingual embedding mappings

20. Bilingual embedding mappings 𝑋 𝑍 Basque English

21. Bilingual embedding mappings 𝑋 𝑍 Seed dictionary Basque English

22. Bilingual embedding mappings Txakur Sagar ⋮ Egutegi Dog Apple ⋮ Calendar 𝑋 𝑍 Seed dictionary Basque English

23. Bilingual embedding mappings Txakur Sagar ⋮ Egutegi Dog Apple ⋮ Calendar 𝑊 𝑋 𝑍 Seed dictionary Basque English

24. Bilingual embedding mappings Txakur Sagar ⋮ Egutegi Dog Apple ⋮ Calendar 𝑊 𝑋 𝑍 𝑋𝑊 Seed dictionary Basque English

25. Bilingual embedding mappings 𝑋1,∗ 𝑋2,∗ ⋮ 𝑋 𝑛,∗ 𝑊 ≈ 𝑍1,∗ 𝑍2,∗ ⋮ 𝑍 𝑛,∗ Txakur Sagar ⋮ Egutegi Dog Apple ⋮ Calendar 𝑊 𝑋 𝑍 𝑋𝑊 Seed dictionary Basque English

26. Bilingual embedding mappings 𝑋1,∗ 𝑋2,∗ ⋮ 𝑋 𝑛,∗ 𝑊 ≈ 𝑍1,∗ 𝑍2,∗ ⋮ 𝑍 𝑛,∗ Txakur Sagar ⋮ Egutegi Dog Apple ⋮ Calendar 𝑊 𝑋 𝑍 𝑋𝑊 Seed dictionary Basque English

27. Bilingual embedding mappings 𝑋1,∗ 𝑋2,∗ ⋮ 𝑋 𝑛,∗ 𝑊 ≈ 𝑍1,∗ 𝑍2,∗ ⋮ 𝑍 𝑛,∗ Txakur Sagar ⋮ Egutegi Dog Apple ⋮ Calendar 𝑊 𝑋 𝑍 𝑋𝑊 arg min 𝑊∈𝑂(𝑛) ෍ 𝑖 𝑋𝑖∗ 𝑊 − 𝑍𝑗∗ 2 Basque English

31. Bilingual embedding mappings

32. Bilingual embedding mappings Monolingual embeddings

33. Bilingual embedding mappings Monolingual embeddings Dictionary

34. Bilingual embedding mappings Monolingual embeddings Dictionary

35. Bilingual embedding mappings Mapping Monolingual embeddings Dictionary

36. Bilingual embedding mappings Mapping Monolingual embeddings Dictionary

37. Bilingual embedding mappings Mapping Dictionary Monolingual embeddings Dictionary

38. Bilingual embedding mappings Mapping Dictionary Monolingual embeddings Dictionary better!

39. Bilingual embedding mappings Mapping Dictionary Monolingual embeddings Dictionary better!

40. Bilingual embedding mappings Mapping Dictionary Monolingual embeddings Dictionary Mapping better!

41. Bilingual embedding mappings Mapping Dictionary Monolingual embeddings Dictionary Mapping better!

42. Bilingual embedding mappings Mapping Dictionary Monolingual embeddings Dictionary Mapping Dictionary better!

43. Bilingual embedding mappings Mapping Dictionary Monolingual embeddings Dictionary Mapping Dictionary better! even better!

44. Bilingual embedding mappings Mapping Dictionary Monolingual embeddings Dictionary Mapping Dictionary better! even better!

45. Bilingual embedding mappings Mapping Dictionary Monolingual embeddings Dictionary Mapping Mapping Dictionary better! even better!

46. Bilingual embedding mappings Mapping Dictionary Monolingual embeddings Dictionary Mapping Mapping Dictionary better! even better!

47. Bilingual embedding mappings Mapping Dictionary Monolingual embeddings Dictionary Mapping Mapping Dictionary Dictionary better! even better!

48. Bilingual embedding mappings Mapping Dictionary Monolingual embeddings Dictionary Mapping Mapping Dictionary Dictionary better! even better! even better!

49. Bilingual embedding mappings Mapping Dictionary Monolingual embeddings Dictionary

50. Bilingual embedding mappings Mapping Dictionary Monolingual embeddings Dictionary proposed self-learning method

51. Bilingual embedding mappings Mapping Dictionary Monolingual embeddings Dictionary proposed self-learning method formalization and implementation details in the paper based on the mapping method of Artetxe et al. (2016)

52. Bilingual embedding mappings Mapping Dictionary Monolingual embeddings Dictionary proposed self-learning method formalization and implementation details in the paper based on the mapping method of Artetxe et al. (2016) Too good to be true?

53. Bilingual embedding mappings Mapping Dictionary Monolingual embeddings Dictionary proposed self-learning method formalization and implementation details in the paper based on the mapping method of Artetxe et al. (2016) Too good to be true?

54. Experiments

55. Experiments • Dataset by Dinu et al. (2015)

56. Experiments • Dataset by Dinu et al. (2015) English-Italian

57. Experiments • Dataset by Dinu et al. (2015) extended to German and Finnish English-Italian

58. Experiments • Dataset by Dinu et al. (2015) extended to German and Finnish English-Italian English-German English-Finnish

59. Experiments • Dataset by Dinu et al. (2015) extended to German and Finnish ⇒ Monolingual embeddings (CBOW + negative sampling) English-Italian English-German English-Finnish

60. Experiments • Dataset by Dinu et al. (2015) extended to German and Finnish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs English-Italian English-German English-Finnish 5,000 5,000 5,000

61. Experiments • Dataset by Dinu et al. (2015) extended to German and Finnish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs English-Italian English-German English-Finnish 5,000 25 5,000 25 5,000 25

62. Experiments • Dataset by Dinu et al. (2015) extended to German and Finnish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs / numerals English-Italian English-German English-Finnish 5,000 25 num. 5,000 25 num. 5,000 25 num.

63. Experiments • Dataset by Dinu et al. (2015) extended to German and Finnish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs / numerals ⇒ Test dictionary: 1,500 word pairs English-Italian English-German English-Finnish 5,000 25 num. 5,000 25 num. 5,000 25 num.

64. Experiments • Dataset by Dinu et al. (2015) extended to German and Finnish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs / numerals ⇒ Test dictionary: 1,500 word pairs English-Italian English-German English-Finnish 5,000 25 num. 5,000 25 num. 5,000 25 num. word translation induction

65. Experiments • Dataset by Dinu et al. (2015) extended to German and Finnish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs / numerals ⇒ Test dictionary: 1,500 word pairs English-Italian English-German English-Finnish 5,000 25 num. 5,000 25 num. 5,000 25 num. Mikolov et al. (2013a) Xing et al. (2015) Zhang et al. (2016) Artetxe et al. (2016) word translation induction

66. Experiments • Dataset by Dinu et al. (2015) extended to German and Finnish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs / numerals ⇒ Test dictionary: 1,500 word pairs English-Italian English-German English-Finnish 5,000 25 num. 5,000 25 num. 5,000 25 num. Mikolov et al. (2013a) Xing et al. (2015) Zhang et al. (2016) Artetxe et al. (2016) Our method word translation induction

67. Experiments • Dataset by Dinu et al. (2015) extended to German and Finnish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs / numerals ⇒ Test dictionary: 1,500 word pairs English-Italian English-German English-Finnish 5,000 25 num. 5,000 25 num. 5,000 25 num. Mikolov et al. (2013a) 34.93% 0.00% 0.00% 35.00% 0.00% 0.07% 25.91% 0.00% 0.00% Xing et al. (2015) 36.87% 0.00% 0.13% 41.27% 0.07% 0.53% 28.23% 0.07% 0.56% Zhang et al. (2016) 36.73% 0.07% 0.27% 40.80% 0.13% 0.87% 28.16% 0.14% 0.42% Artetxe et al. (2016) 39.27% 0.07% 0.40% 41.87% 0.13% 0.73% 30.62% 0.21% 0.77% Our method 39.67% 37.27% 39.40% 40.87% 39.60% 40.27% 28.72% 28.16% 26.47% word translation induction

72. Experiments • Dataset by Dinu et al. (2015) extended to German and Finnish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs / numerals ⇒ Test dictionary: 1,500 word pairs English-Italian 5,000 25 num. Mikolov et al. (2013a) 34.93% 0.00% 0.00% Xing et al. (2015) 36.87% 0.00% 0.13% Zhang et al. (2016) 36.73% 0.07% 0.27% Artetxe et al. (2016) 39.27% 0.07% 0.40% Our method 39.67% 37.27% 39.40% word translation induction

73. Experiments • Dataset by Dinu et al. (2015) extended to German and Finnish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs / numerals

74. Experiments • Dataset by Dinu et al. (2015) extended to German and Finnish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs / numerals crosslingual word similarity

75. Experiments • Dataset by Dinu et al. (2015) extended to German and Finnish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs / numerals EN-IT EN-DE WS RG WS crosslingual word similarity

76. Experiments • Dataset by Dinu et al. (2015) extended to German and Finnish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs / numerals EN-IT EN-DE Bi. data WS RG WS crosslingual word similarity

77. Experiments • Dataset by Dinu et al. (2015) extended to German and Finnish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs / numerals EN-IT EN-DE Bi. data WS RG WS Luong et al. (2015) Europarl crosslingual word similarity

78. Experiments • Dataset by Dinu et al. (2015) extended to German and Finnish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs / numerals EN-IT EN-DE Bi. data WS RG WS Luong et al. (2015) Europarl Mikolov et al. (2013a) 5k dict Xing et al. (2015) 5k dict Zhang et al. (2016) 5k dict Artetxe et al. (2016) 5k dict crosslingual word similarity

79. Experiments • Dataset by Dinu et al. (2015) extended to German and Finnish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs / numerals EN-IT EN-DE Bi. data WS RG WS Luong et al. (2015) Europarl Mikolov et al. (2013a) 5k dict Xing et al. (2015) 5k dict Zhang et al. (2016) 5k dict Artetxe et al. (2016) 5k dict Our method 5k dict 25 dict num. crosslingual word similarity

80. Experiments • Dataset by Dinu et al. (2015) extended to German and Finnish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 word pairs / 25 word pairs / numerals EN-IT EN-DE Bi. data WS RG WS Luong et al. (2015) Europarl 33.1% 33.5% 35.6% Mikolov et al. (2013a) 5k dict 62.7% 64.3% 52.8% Xing et al. (2015) 5k dict 61.4% 70.0% 59.5% Zhang et al. (2016) 5k dict 61.6% 70.4% 59.6% Artetxe et al. (2016) 5k dict 61.7% 71.6% 59.7% Our method 5k dict 62.4% 74.2% 61.6% 25 dict 62.6% 74.9% 61.2% num. 62.8% 73.9% 60.4% crosslingual word similarity

83. Why does it work?

84. Why does it work? Mapping Dictionary Monolingual embeddings Dictionary

85. Why does it work? Mapping Dictionary Monolingual embeddings Dictionary small

86. Why does it work? Mapping Dictionary Monolingual embeddings Dictionary small large

87. Why does it work? Mapping Dictionary Monolingual embeddings Dictionary small no error large

88. Why does it work? Mapping Dictionary Monolingual embeddings Dictionary small no error large errors

89. Why does it work? Mapping Dictionary Monolingual embeddings Dictionary Mapping Dictionary small no error large errors

90. Why does it work? Mapping Dictionary Monolingual embeddings Dictionary Mapping Dictionary small no error large errors better?

91. Why does it work? Mapping Dictionary Monolingual embeddings Dictionary Mapping Dictionary small no error large errors better? worse?

92. Why does it work? Mapping Dictionary Monolingual embeddings Dictionary Mapping Mapping Dictionary Dictionary small no error large errors better? worse?

93. Why does it work? Mapping Dictionary Monolingual embeddings Dictionary Mapping Mapping Dictionary Dictionary small no error large errors better? worse? even better?

94. Why does it work? Mapping Dictionary Monolingual embeddings Dictionary Mapping Mapping Dictionary Dictionary small no error large errors better? worse? even better? even worse?

95. Why does it work? 𝑍𝑋𝑊

96. Why does it work? 𝑍𝑋𝑊 𝑊∗ = arg max 𝑊 ෍ 𝑖 max 𝑗 𝑋𝑖∗ 𝑊 ∙ 𝑍𝑗∗ s.t. 𝑊𝑊 𝑇 = 𝑊 𝑇 𝑊 = 𝐼Implicit objective:

97. Why does it work? 𝑍𝑋𝑊 𝑊∗ = arg max 𝑊 ෍ 𝑖 max 𝑗 𝑋𝑖∗ 𝑊 ∙ 𝑍𝑗∗ s.t. 𝑊𝑊 𝑇 = 𝑊 𝑇 𝑊 = 𝐼Implicit objective: Independent from seed dictionary!

115. Why does it work? 𝑍𝑋𝑊 𝑊∗ = arg max 𝑊 ෍ 𝑖 max 𝑗 𝑋𝑖∗ 𝑊 ∙ 𝑍𝑗∗ s.t. 𝑊𝑊 𝑇 = 𝑊 𝑇 𝑊 = 𝐼Implicit objective: Independent from seed dictionary!

116. Why does it work? 𝑍𝑋𝑊 𝑊∗ = arg max 𝑊 ෍ 𝑖 max 𝑗 𝑋𝑖∗ 𝑊 ∙ 𝑍𝑗∗ s.t. 𝑊𝑊 𝑇 = 𝑊 𝑇 𝑊 = 𝐼Implicit objective: Independent from seed dictionary! So why do we need a seed dictionary?

117. Why does it work? 𝑍𝑋𝑊 𝑊∗ = arg max 𝑊 ෍ 𝑖 max 𝑗 𝑋𝑖∗ 𝑊 ∙ 𝑍𝑗∗ s.t. 𝑊𝑊 𝑇 = 𝑊 𝑇 𝑊 = 𝐼Implicit objective: Independent from seed dictionary! So why do we need a seed dictionary? Avoid poor local optima!

119. Conclusions 𝑍𝑋𝑊

120. Conclusions 𝑍𝑋𝑊 • Simple self-learning method to train bilingual embedding mappings

121. Conclusions 𝑍𝑋𝑊 • Simple self-learning method to train bilingual embedding mappings • High quality results with almost no supervision (25 words, numerals)

122. Conclusions 𝑍𝑋𝑊 • Simple self-learning method to train bilingual embedding mappings • High quality results with almost no supervision (25 words, numerals) • Implicit optimization objective independent from seed dictionary

123. Conclusions 𝑍𝑋𝑊 • Simple self-learning method to train bilingual embedding mappings • High quality results with almost no supervision (25 words, numerals) • Implicit optimization objective independent from seed dictionary • Seed dictionary necessary to avoid poor local optima

124. Conclusions 𝑍𝑋𝑊 • Simple self-learning method to train bilingual embedding mappings • High quality results with almost no supervision (25 words, numerals) • Implicit optimization objective independent from seed dictionary • Seed dictionary necessary to avoid poor local optima • Future work: fully unsupervised training

125. One more thing… 𝑍𝑋𝑊>

126. One more thing… 𝑍𝑋𝑊> git clone https://github.com/artetxem/vecmap.git

127. One more thing… 𝑍𝑋𝑊> git clone https://github.com/artetxem/vecmap.git >

128. One more thing… 𝑍𝑋𝑊> git clone https://github.com/artetxem/vecmap.git > python3 vecmap/map_embeddings.py

129. One more thing… 𝑍𝑋𝑊> git clone https://github.com/artetxem/vecmap.git > python3 vecmap/map_embeddings.py --self_learning --numerals

130. One more thing… 𝑍𝑋𝑊> git clone https://github.com/artetxem/vecmap.git > python3 vecmap/map_embeddings.py --self_learning --numerals SRC_INPUT.EMB TRG_INPUT.EMB

131. One more thing… 𝑍𝑋𝑊> git clone https://github.com/artetxem/vecmap.git > python3 vecmap/map_embeddings.py --self_learning --numerals SRC_INPUT.EMB TRG_INPUT.EMB SRC_OUTPUT.EMB TRG_OUTPUT.EMB

132. One more thing… 𝑍𝑋𝑊> git clone https://github.com/artetxem/vecmap.git > python3 vecmap/map_embeddings.py --self_learning --numerals SRC_INPUT.EMB TRG_INPUT.EMB SRC_OUTPUT.EMB TRG_OUTPUT.EMB >

133. One more thing… 𝑍𝑋𝑊> git clone https://github.com/artetxem/vecmap.git > python3 vecmap/map_embeddings.py --self_learning --numerals SRC_INPUT.EMB TRG_INPUT.EMB SRC_OUTPUT.EMB TRG_OUTPUT.EMB > vecmap/reproduce_acl2017.sh

134. One more thing… 𝑍𝑋𝑊> git clone https://github.com/artetxem/vecmap.git > python3 vecmap/map_embeddings.py --self_learning --numerals SRC_INPUT.EMB TRG_INPUT.EMB SRC_OUTPUT.EMB TRG_OUTPUT.EMB > vecmap/reproduce_acl2017.sh -------------------------------------------------------------------------------- ENGLISH-ITALIAN -------------------------------------------------------------------------------- 5,000 WORD DICTIONARY - Mikolov et al. (2013a) | Translation: 34.93% MWS353: 62.66% - Xing et al. (2015) | Translation: 36.87% MWS353: 61.41% - Zhang et al. (2016) | Translation: 36.73% MWS353: 61.62% - Artetxe et al. (2016) | Translation: 39.27% MWS353: 61.74% - Proposed method | Translation: 39.67% MWS353: 62.35% 25 WORD DICTIONARY - Mikolov et al. (2013a) | Translation: 0.00% MWS353: -6.42% - Xing et al. (2015) | Translation: 0.00% MWS353: 19.49% - Zhang et al. (2016) | Translation: 0.07% MWS353: 15.52% - Artetxe et al. (2016) | Translation: 0.07% MWS353: 17.45% - Proposed method | Translation: 37.27% MWS353: 62.64% NUMERAL DICTIONARY - Mikolov et al. (2013a) | Translation: 0.00% MWS353: 28.75% - Xing et al. (2015) | Translation: 0.13% MWS353: 27.75% - Zhang et al. (2016) | Translation: 0.27% MWS353: 27.38% - Artetxe et al. (2016) | Translation: 0.40% MWS353: 24.85% - Proposed method | Translation: 39.40% MWS353: 62.82%

135. Thank you! 𝑍𝑋𝑊 https://github.com/artetxem/vecmap

Mikel Artetxe - 2017 - Learning bilingual word embeddings with (almost) no bilingual data

Recommended

Recommended

More Related Content

Similar to Mikel Artetxe - 2017 - Learning bilingual word embeddings with (almost) no bilingual data

Similar to Mikel Artetxe - 2017 - Learning bilingual word embeddings with (almost) no bilingual data (20)

More from Association for Computational Linguistics

More from Association for Computational Linguistics (20)

Recently uploaded

Recently uploaded (20)

Mikel Artetxe - 2017 - Learning bilingual word embeddings with (almost) no bilingual data