MultiSeg: Parallel Data and Subword Information for Learning Bilingual Embeddings in Low Resource Scenarios

•

0 likes•41 views

Vishal Anand

https://github.com/vishalanand/MultiSeg https://www.aclweb.org/anthology/2020.sltu-1.13.pdf

Science

MultiSeg: Parallel Data and Subword
Information for Learning Bilingual Embeddings
in Low Resource Scenarios
Efsun Sarioglu Kayi *, Vishal Anand *
Smaranda Muresan

Representing Subwords in Cross-lingual Space
● BiVec/BiSkip: Generate cross-lingual embeddings by training four word2vec
models at once that learn from inter/intra language context from sentence
aligned parallel corpora
● fastText: Train monolingual embeddings using subwords, i.e. 3-6 character
n-grams
● MultiSeg: Train BiSkip like model using various subword representations
○ MultiSegCN
: Character n-grams
○ Morphemes obtained by unsupervised morphological segmentation
■ MultiSegM
: Three segments: prefix + stem + suffix
■ MultiSegMall
: stem + afixes
○ MultiSegBPE
: Byte Pair Encoding (BPE)
○ MultiSegAll
: Char n-grams, morphological segments, BPE

Data Statistics
● SW/TL/SO: MATERIAL parallel corpora
● German: EuroParl (1,908,920) subsampled to 100K to simulate low resource
scenario

Evaluation
● Intrinsic
○ Word Translation Task a.k.a bilingual dictionary induction with Wiktionary dictionaries
● Extrinsic
○ Monolingual
■ Word Similarity
● WordSim353, Stanford's Contextual Word Similarities (SCWS), Rare Words (RW)
■ Analogy Reasoning
● Semantic and Syntactic categories
○ Cross-lingual
■ Document Classification (CLDC)
● English-German only
● Qualitative
○ Word translation task
○ t-SNE visualization

Conclusion
● MultiSeg: Learning subwords during training of cross-lingual embeddings
● Evaluation
○ Syntax
■ Analogy reasoning results show that using subwords helps capture syntactic
characteristics
○ Semantics
■ Word similarity results and intrinsically, word translation scores demonstrate superior
performance over existing methods
○ Qualitatively
■ Better-quality cross-lingual embeddings particularly for morphological variants in both
languages

Recently uploaded

Forest laws, Indian forest laws, why they are importantadityabhardwaj282

BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1

Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl

Manassas R - Parkside Middle School 🌎🏫qfactory1

Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |aasikanpl

Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar

Harmful and Useful Microorganisms Presentationtahreemzahra82

Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9

TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b

Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY

Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane

Engler and Prantl system of classification in plant taxonomyNistarini College, Purulia (W.B) India

Heredity: Inheritance and Variation of TraitsCharlene Llagas

Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013

Gas_Laws_powerpoint_notes.ppt for grade 10ROLANARIBATO3

Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal

TOTAL CHOLESTEROL (lipid profile test).pptxdharshini369nike

Volatile Oils Pharmacognosy And Phytochemistry -INandakishor Bhaurao Deshmukh

Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh

Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India

Recently uploaded (20)

Forest laws, Indian forest laws, why they are important

BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.

Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |

Manassas R - Parkside Middle School 🌎🏫

Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |

Analytical Profile of Coleus Forskohlii | Forskolin .pptx

Harmful and Useful Microorganisms Presentation

Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...

TOPIC 8 Temperature and Heat.pdf physics

Behavioral Disorder: Schizophrenia & it's Case Study.pdf

Microphone- characteristics,carbon microphone, dynamic microphone.pptx

Engler and Prantl system of classification in plant taxonomy

Heredity: Inheritance and Variation of Traits

Scheme-of-Work-Science-Stage-4 cambridge science.docx

Gas_Laws_powerpoint_notes.ppt for grade 10

Spermiogenesis or Spermateleosis or metamorphosis of spermatid

TOTAL CHOLESTEROL (lipid profile test).pptx

Volatile Oils Pharmacognosy And Phytochemistry -I

Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝

Bentham & Hooker's Classification. along with the merits and demerits of the ...

Featured

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools

12 Ways to Increase Your Influence at WorkGetSmarter

ChatGPT webinar slidesAlireza Esmikhani

More than Just Lines on a Map: Best Practices for U.S Bike RoutesProject for Public Spaces & National Center for Biking and Walking

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference

Barbie - Brand Strategy PresentationErica Santiago

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software

Introduction to C Programming LanguageSimplilearn

Featured (20)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

12 Ways to Increase Your Influence at Work

ChatGPT webinar slides

More than Just Lines on a Map: Best Practices for U.S Bike Routes

Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...

Barbie - Brand Strategy Presentation

Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well

Introduction to C Programming Language

MultiSeg: Parallel Data and Subword Information for Learning Bilingual Embeddings in Low Resource Scenarios

1. MultiSeg: Parallel Data and Subword Information for Learning Bilingual Embeddings in Low Resource Scenarios Efsun Sarioglu Kayi *, Vishal Anand * Smaranda Muresan

2. Representing Subwords in Cross-lingual Space ● BiVec/BiSkip: Generate cross-lingual embeddings by training four word2vec models at once that learn from inter/intra language context from sentence aligned parallel corpora ● fastText: Train monolingual embeddings using subwords, i.e. 3-6 character n-grams ● MultiSeg: Train BiSkip like model using various subword representations ○ MultiSegCN : Character n-grams ○ Morphemes obtained by unsupervised morphological segmentation ■ MultiSegM : Three segments: prefix + stem + suffix ■ MultiSegMall : stem + afixes ○ MultiSegBPE : Byte Pair Encoding (BPE) ○ MultiSegAll : Char n-grams, morphological segments, BPE

3. Training and Alignment Schema

4. Data Statistics ● SW/TL/SO: MATERIAL parallel corpora ● German: EuroParl (1,908,920) subsampled to 100K to simulate low resource scenario

5. Evaluation ● Intrinsic ○ Word Translation Task a.k.a bilingual dictionary induction with Wiktionary dictionaries ● Extrinsic ○ Monolingual ■ Word Similarity ● WordSim353, Stanford's Contextual Word Similarities (SCWS), Rare Words (RW) ■ Analogy Reasoning ● Semantic and Syntactic categories ○ Cross-lingual ■ Document Classification (CLDC) ● English-German only ● Qualitative ○ Word translation task ○ t-SNE visualization

6. Word Translation Task

7. Qualitative Analysis

8. Somali Visualization

9. Swahili Visualization

10. Tagalog Visualization

11. Conclusion ● MultiSeg: Learning subwords during training of cross-lingual embeddings ● Evaluation ○ Syntax ■ Analogy reasoning results show that using subwords helps capture syntactic characteristics ○ Semantics ■ Word similarity results and intrinsically, word translation scores demonstrate superior performance over existing methods ○ Qualitatively ■ Better-quality cross-lingual embeddings particularly for morphological variants in both languages

MultiSeg: Parallel Data and Subword Information for Learning Bilingual Embeddings in Low Resource Scenarios

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

MultiSeg: Parallel Data and Subword Information for Learning Bilingual Embeddings in Low Resource Scenarios