MultiSeg:
Parallel Data and Subword
Information for Learning Bilingual
Embeddings in Low Resource
Scenarios
In collaboration with Vishal Anand and Smaranda Muresan
Work done at Columbia University
1
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced
languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL). 2020
Representing Subwords in Cross-lingual
Space
● fastText: monolingual word embeddings that take into account subword
information-->words as bag of character n-grams
● Bilingual SkipGram (BiSkip)
○ Trains 4 SkipGram models jointly between two languages l1
and l2
based on
sentence alignments:
2
MultiSeg: Cross-lingual Embeddings Learned with
Subword Information
● Train BiSkip like model using various subword representations1
● MultiSegCN
: Character n-grams
● Morphemes obtained by unsupervised morphological segmentation
○ MultiSegM
: Three segments: prefix + stem + suffix
○ MultiSegMall
: stem + afixes
● MultiSegBPE
: Byte Pair Encoding (BPE)
● MultiSegAll
: Char n-grams, morphological segments, BPE
3
1: Code available at https://github.com/vishalanand/MultiSeg
Example Alignment for MorphAll
4
Alignment Algorithm
● Symmetric alignments from
parallel corpora are obtained
by fast_align tool
● Both word-level and
stem-level alignment are
considered and the best one
is chosen
5
MultiSeg
6
Evaluation
● Intrinsic
○ Word Translation Task a.k.a bilingual dictionary induction with Wiktionary dictionaries
● Extrinsic
○ Monolingual
■ Word Similarity Task: WordSim353, Stanford's Contextual Word Similarities (SCWS),
Rare Words (RW)
■ Analogy Reasoning Task: Semantic and Syntactic categories
○ Cross-lingual
■ Document Classification (CLDC): English-German only
● Qualitative
○ Word translation task
○ t-SNE visualization 7
Dataset for Low Resource Languages
● Three morphologically rich low resource languages:
○ Swahili (SW), Tagalog(TL), Somali (SO)
○ IARPA Machine Translation for English Retrieval of Information in Any Language
(MATERIAL) project’s parallel corpora
● German, a high resource morphologically rich language
○ EuroParl (1,908,920) subsampled to 100K to simulate low resource scenario
8
Word Translation Scores and Coverage
9
Qualitative Analysis
10
t-SNE Visualization of English-Swahili Vectors
● done − nagawa, did − ginawa, doing − ginagawa
11
t-SNE Visualization for English-Tagalog Vectors
12
t-SNE Visualization for English-Somali Vectors
●
●
● qaranimo is close to
togetherness while the same
(nationhood) is also shown in
a coarser fashion in 5c
13
German-English Monolingual and Cross-lingual
Results
14
Monolingual English Evaluation of Low Resource Languages
15
Swahili Analogy Reasoning Task:
Semantic & Syntactic Categories
16
Recap: MultiSeg
● MultiSeg: Learning subwords during training of cross-lingual embeddings
● Evaluation
○ Syntax
■ Analogy reasoning results show that using subwords helps capture syntactic
characteristics
○ Semantics
■ Word similarity results and intrinsically, word translation scores demonstrate superior
performance over existing methods
○ Qualitatively
■ Better-quality cross-lingual embeddings particularly for morphological variants in
both languages
17

MultiSeg

  • 1.
    MultiSeg: Parallel Data andSubword Information for Learning Bilingual Embeddings in Low Resource Scenarios In collaboration with Vishal Anand and Smaranda Muresan Work done at Columbia University 1 Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL). 2020
  • 2.
    Representing Subwords inCross-lingual Space ● fastText: monolingual word embeddings that take into account subword information-->words as bag of character n-grams ● Bilingual SkipGram (BiSkip) ○ Trains 4 SkipGram models jointly between two languages l1 and l2 based on sentence alignments: 2
  • 3.
    MultiSeg: Cross-lingual EmbeddingsLearned with Subword Information ● Train BiSkip like model using various subword representations1 ● MultiSegCN : Character n-grams ● Morphemes obtained by unsupervised morphological segmentation ○ MultiSegM : Three segments: prefix + stem + suffix ○ MultiSegMall : stem + afixes ● MultiSegBPE : Byte Pair Encoding (BPE) ● MultiSegAll : Char n-grams, morphological segments, BPE 3 1: Code available at https://github.com/vishalanand/MultiSeg
  • 4.
  • 5.
    Alignment Algorithm ● Symmetricalignments from parallel corpora are obtained by fast_align tool ● Both word-level and stem-level alignment are considered and the best one is chosen 5
  • 6.
  • 7.
    Evaluation ● Intrinsic ○ WordTranslation Task a.k.a bilingual dictionary induction with Wiktionary dictionaries ● Extrinsic ○ Monolingual ■ Word Similarity Task: WordSim353, Stanford's Contextual Word Similarities (SCWS), Rare Words (RW) ■ Analogy Reasoning Task: Semantic and Syntactic categories ○ Cross-lingual ■ Document Classification (CLDC): English-German only ● Qualitative ○ Word translation task ○ t-SNE visualization 7
  • 8.
    Dataset for LowResource Languages ● Three morphologically rich low resource languages: ○ Swahili (SW), Tagalog(TL), Somali (SO) ○ IARPA Machine Translation for English Retrieval of Information in Any Language (MATERIAL) project’s parallel corpora ● German, a high resource morphologically rich language ○ EuroParl (1,908,920) subsampled to 100K to simulate low resource scenario 8
  • 9.
  • 10.
  • 11.
    t-SNE Visualization ofEnglish-Swahili Vectors ● done − nagawa, did − ginawa, doing − ginagawa 11
  • 12.
    t-SNE Visualization forEnglish-Tagalog Vectors 12
  • 13.
    t-SNE Visualization forEnglish-Somali Vectors ● ● ● qaranimo is close to togetherness while the same (nationhood) is also shown in a coarser fashion in 5c 13
  • 14.
    German-English Monolingual andCross-lingual Results 14
  • 15.
    Monolingual English Evaluationof Low Resource Languages 15
  • 16.
    Swahili Analogy ReasoningTask: Semantic & Syntactic Categories 16
  • 17.
    Recap: MultiSeg ● MultiSeg:Learning subwords during training of cross-lingual embeddings ● Evaluation ○ Syntax ■ Analogy reasoning results show that using subwords helps capture syntactic characteristics ○ Semantics ■ Word similarity results and intrinsically, word translation scores demonstrate superior performance over existing methods ○ Qualitatively ■ Better-quality cross-lingual embeddings particularly for morphological variants in both languages 17