MultiSeg: Parallel Data and Subword Information for Learning Bilingual Embeddings in Low Resource Scenarios
1. MultiSeg: Parallel Data and Subword
Information for Learning Bilingual Embeddings
in Low Resource Scenarios
Efsun Sarioglu Kayi *, Vishal Anand *
Smaranda Muresan
2. Representing Subwords in Cross-lingual Space
● BiVec/BiSkip: Generate cross-lingual embeddings by training four word2vec
models at once that learn from inter/intra language context from sentence
aligned parallel corpora
● fastText: Train monolingual embeddings using subwords, i.e. 3-6 character
n-grams
● MultiSeg: Train BiSkip like model using various subword representations
○ MultiSegCN
: Character n-grams
○ Morphemes obtained by unsupervised morphological segmentation
■ MultiSegM
: Three segments: prefix + stem + suffix
■ MultiSegMall
: stem + afixes
○ MultiSegBPE
: Byte Pair Encoding (BPE)
○ MultiSegAll
: Char n-grams, morphological segments, BPE
11. Conclusion
● MultiSeg: Learning subwords during training of cross-lingual embeddings
● Evaluation
○ Syntax
■ Analogy reasoning results show that using subwords helps capture syntactic
characteristics
○ Semantics
■ Word similarity results and intrinsically, word translation scores demonstrate superior
performance over existing methods
○ Qualitatively
■ Better-quality cross-lingual embeddings particularly for morphological variants in both
languages