SlideShare a Scribd company logo
Enriching Word Vectors with
Subword Information
Piotr Bonjanowski, Edouard Grave, Armand Joulin,
Tomas Mikolov
Presented by Jaeyeun Yoon
On 2017-07-20
IDS Lab.
Word Embedding
 Word Embedding
• Word embedding is the collective name for a set of language modeling
and feature learning techniques where words or phrases from the
vocabulary are mapped to vectors of real numbers in a dimensional
space.
• Ex) Bag-of-words model, Latent semantic analysis(LSA), word2vec, GloVe
…
 Distributional Hypothesis
• “A word is characterized by the company it keeps”
[Firth, J.R. (1957)]
• Motivated by the idea that closely co-occuring words define the
semantics of the word
• Most of word embedding models are based on distributional hypothesis.
”The cat sat on the mat”
W("cat") = (0.2,−0.4,0.7,...)
W("mat") = (0.0,0.6,−0.1,...)
IDS Lab.
Characteristics of word embedding
• Once trained they can be used in various tasks
• Capture relative semantics of words
• Can be used to combat small dataset problems
IDS Lab.
Skip-Gram Model(general model)
• Mikolov, Tomas, et al. "Efficient estimation of
word representations in vector space." arXiv
preprint arXiv:1301.3781 (2013)
• Classifier for predicting words within
certain range of input word
• e.g.) … “It is time to finish”
• The objective of the skipgram model is
to maximize the following log-
likelihood:
𝒕=𝟏
𝑻
𝒄∈𝑪 𝒕
𝒍𝒐𝒈 𝒑 𝒘 𝒄 𝒘 𝒕)
𝑪𝒕 ∶ 𝑻𝒉𝒆 𝒔𝒆𝒕 𝒐𝒇 𝒊𝒏𝒅𝒊𝒄𝒆𝒔 𝒐𝒇 𝒘𝒐𝒓𝒅𝒔 𝒔𝒖𝒓𝒓𝒐𝒖𝒏𝒅𝒊𝒏𝒈 𝒘𝒐𝒓𝒅 𝒘 𝒕
𝒑 𝒘 𝒄 𝒘 𝒕) ∶ 𝑻𝒉𝒆 𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 𝒐𝒇 𝒐𝒃𝒔𝒆𝒓𝒗𝒊𝒏𝒈 𝒂 𝒄𝒐𝒏𝒕𝒆𝒙𝒕 𝒘𝒐𝒓𝒅 𝒘 𝒄 𝒈𝒊𝒗𝒆𝒏 𝒘 𝒕
IDS Lab.
Limitation of Skip-Gram
5
 It is difficult for good representations of rare words to
be learned with traditional word2vec.
 There could be words in the NLP task that were not
present in the word2vec training corpus
- This limitation is more pronounced in case of morphologically rich
languages
EX) In French or Spanish, most verbs have more than forty different
inflected froms. While the Finnish languages has fifteen cases for
nouns
-> It is possible to improve vector representations for
Morphologically rich languages by using character
level information.
IDS Lab.
Example
6
 German verb : ‘sein’ (English verb : ‘be’)
IDS Lab.
Limitation of Skip-Gram
7
 It is difficult for good representations of rare words to
be learned with traditional word2vec.
 There could be words in the NLP task that were not
present in the word2vec training corpus
- This limitation is more pronounced in case of morphologically rich
languages
EX) In French or Spanish, most verbs have more than forty different
inflected froms. While the Finnish languages has fifteen cases for
nouns
-> It is possible to improve vector representations for
Morphologically rich languages by using character
level information.
IDS Lab.
• The basic skip-gram model described above ignores the internal
structure of the word.
• However, character n-gram based model incorporates information
about the structure in terms of character n-gram embeddings.
• This paper suppose that each word 𝑤 is represented as a bag of
character n-gram.
EX) where / n = 3, it will be represented by the
character n-grams :
<wh / whe / her / ere / re>
And the special sequence
<where>
Character n-gram based model
8
word <her> is different from the
tri-gram her from the word where
where
IDS Lab.
• We represent a word by the sum of the vector representations of
its n-gram. We thus obtain the scoring function :
• We extract all the 𝑛 -grams. (3 ≤ 𝑛 ≤ 6)
Character n-gram based model (cont’d)
9
𝑠 𝑤, 𝑐 = g∈ς 𝑤
𝑧 𝑔
𝑇 𝑣𝑐
𝑤 ∶ 𝑔𝑖𝑣𝑒𝑛 𝑤𝑜𝑟𝑑
ς 𝑤 ∶ the set of n − grams appearing in word 𝑤
𝑧 𝑔 ∶ 𝑣𝑒𝑐𝑡𝑜𝑟 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛 𝑡𝑜 𝑒𝑎𝑐ℎ 𝑛 − 𝑔𝑟𝑎𝑚𝑠
𝑣𝑐 ∶ 𝑡ℎ𝑒 𝑤𝑜𝑟𝑑 𝑣𝑒𝑐𝑡𝑜𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑒𝑛𝑡𝑒𝑟 𝑤𝑜𝑟𝑑 𝐶
IDS Lab.
Computing word vector representation
10
IDS Lab.
Experiments Settings
 Target Languages
• German / English / French / Spanish /
Arabic / Romanian / Russian / Czech
 Kind of tasks
1. Human similarity judgement
2. Word analogy tasks
3. Comparison with morphological representations
4. Effect of the size of the training data
5. Effect of the size of n-grams
11
IDS Lab.
1. Human similarity judgement
 Correlation between human judgement and similarity scores on word
similarity datasets.
12
sg : Skip-Gram
cbow : continuous bag of words
sisg- : Subword Information Skip-Gram
(Treat unseen words as a null vector)
sisg : Subword Information Skip-Gram
(Treat unseen words by summing the n-gram vectors)
RW : Rare Words dataset
IDS Lab.
2. Word analogy tasks
 Accuracy of our model and baselines on word analogy tasks for Czech,
German, English and Italian
13
* It is observed that morphological
information significantly improves the
syntactic tasks; our approach outperforms
the baselines. In contrast, it does not help for
semantic questions, and even degrades the
performance for German and Italian.
IDS Lab.
3. Comparison with morphological representations
 Spearman’s rank correlation coefficient between human judgement and
model scores for different methods using morphology to learn word
representations.
14
IDS Lab.
4. Effect of the size of the training data
 Influence of size of the training data on performance
(Data : full Wikipedia dump / Task : 1. similarity task)
 Sisg model is more robust to the size of the training data.
 However, the performance of the baseline cbow model gets better as more and more data is
available. Sisg model, on the other hand, seems to quickly saturate and adding more data
does not always lead to improved result.
 It is observed the performance sisg with very small dataset better than the performance
15
IDS Lab.
5. Effect of the size of n-grams
 The choice of n boundary ​​is observed to be language and task
dependent.
 Results are always improved by taking 𝑛 ≥ 3 rather than 𝑛 ≥ 2, which
shows that character 2-games arenot informative for that task
16
Minimum value of n
Maximum value of n
IDS Lab.
Conclusion
 General Skip-Gram model has some limitations. (Ex: OOV)
 But, This can be overcome by using subword information
(character n-grams).
 This model is simple. Because of simplicity, this model trains fast
and does not require any preprocessing or supervision.
 It works better for certain languages. (Ex: German)
17

More Related Content

What's hot

Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 words
ananth
 
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language model
JiWenKim
 
NLP_KASHK:Context-Free Grammar for English
NLP_KASHK:Context-Free Grammar for EnglishNLP_KASHK:Context-Free Grammar for English
NLP_KASHK:Context-Free Grammar for English
Hemantha Kulathilake
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
OVHcloud
 
NLP_KASHK:Smoothing N-gram Models
NLP_KASHK:Smoothing N-gram ModelsNLP_KASHK:Smoothing N-gram Models
NLP_KASHK:Smoothing N-gram Models
Hemantha Kulathilake
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Yuriy Guts
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information Retrieval
Bhaskar Mitra
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Toine Bogers
 
Language Model.pptx
Language Model.pptxLanguage Model.pptx
Language Model.pptx
Firas Obeid
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)
Shuntaro Yada
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
Traian Rebedea
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
Primya Tamil
 
NLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit DistanceNLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit Distance
Hemantha Kulathilake
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
Hanwha System / ICT
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Databricks
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
Devashish Shanker
 
bag-of-words models
bag-of-words models bag-of-words models
bag-of-words models
Xiaotao Zou
 
The vector space model
The vector space modelThe vector space model
The vector space model
pkgosh
 
Conditional Random Fields - Vidya Venkiteswaran
Conditional Random Fields - Vidya VenkiteswaranConditional Random Fields - Vidya Venkiteswaran
Conditional Random Fields - Vidya Venkiteswaran
WithTheBest
 
Braden Hancock "Programmatically creating and managing training data with Sno...
Braden Hancock "Programmatically creating and managing training data with Sno...Braden Hancock "Programmatically creating and managing training data with Sno...
Braden Hancock "Programmatically creating and managing training data with Sno...
Fwdays
 

What's hot (20)

Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 words
 
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language model
 
NLP_KASHK:Context-Free Grammar for English
NLP_KASHK:Context-Free Grammar for EnglishNLP_KASHK:Context-Free Grammar for English
NLP_KASHK:Context-Free Grammar for English
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
 
NLP_KASHK:Smoothing N-gram Models
NLP_KASHK:Smoothing N-gram ModelsNLP_KASHK:Smoothing N-gram Models
NLP_KASHK:Smoothing N-gram Models
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information Retrieval
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Language Model.pptx
Language Model.pptxLanguage Model.pptx
Language Model.pptx
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
NLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit DistanceNLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit Distance
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
bag-of-words models
bag-of-words models bag-of-words models
bag-of-words models
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Conditional Random Fields - Vidya Venkiteswaran
Conditional Random Fields - Vidya VenkiteswaranConditional Random Fields - Vidya Venkiteswaran
Conditional Random Fields - Vidya Venkiteswaran
 
Braden Hancock "Programmatically creating and managing training data with Sno...
Braden Hancock "Programmatically creating and managing training data with Sno...Braden Hancock "Programmatically creating and managing training data with Sno...
Braden Hancock "Programmatically creating and managing training data with Sno...
 

Similar to Fasttext 20170720 yjy

Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
kevig
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
kevig
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
kevig
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
kevig
 
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...
csandit
 
SNLI_presentation_2
SNLI_presentation_2SNLI_presentation_2
SNLI_presentation_2
Viral Gupta
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
LaraOlmosCamarena
 
1066_multitask_prompted_training_en.pdf
1066_multitask_prompted_training_en.pdf1066_multitask_prompted_training_en.pdf
1066_multitask_prompted_training_en.pdf
ssusere320ca
 
New word analogy corpus
New word analogy corpusNew word analogy corpus
New word analogy corpus
Lukáš Svoboda
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword Information
Seonghyun Kim
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
Sumit Raj
 
Neural word embedding and language modelling
Neural word embedding and language modellingNeural word embedding and language modelling
Neural word embedding and language modelling
Riddhi Jain
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
Mustafa Jarrar
 
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
ijtsrd
 
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
ssuserc35c0e
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
cscpconf
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
Massimo Schenone
 
Segmenting dna sequence into words
Segmenting dna sequence into wordsSegmenting dna sequence into words
Segmenting dna sequence into words
Liang Wang
 
Word2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsWord2vec on the italian language: first experiments
Word2vec on the italian language: first experiments
Vincenzo Lomonaco
 
Cognitive plausibility in learning algorithms
Cognitive plausibility in learning algorithmsCognitive plausibility in learning algorithms
Cognitive plausibility in learning algorithms
André Karpištšenko
 

Similar to Fasttext 20170720 yjy (20)

Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
 
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali TextChunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...
 
SNLI_presentation_2
SNLI_presentation_2SNLI_presentation_2
SNLI_presentation_2
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
 
1066_multitask_prompted_training_en.pdf
1066_multitask_prompted_training_en.pdf1066_multitask_prompted_training_en.pdf
1066_multitask_prompted_training_en.pdf
 
New word analogy corpus
New word analogy corpusNew word analogy corpus
New word analogy corpus
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword Information
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
 
Neural word embedding and language modelling
Neural word embedding and language modellingNeural word embedding and language modelling
Neural word embedding and language modelling
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
 
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
Fasttext(Enriching Word Vectors with Subword Information) 논문 리뷰
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
Segmenting dna sequence into words
Segmenting dna sequence into wordsSegmenting dna sequence into words
Segmenting dna sequence into words
 
Word2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsWord2vec on the italian language: first experiments
Word2vec on the italian language: first experiments
 
Cognitive plausibility in learning algorithms
Cognitive plausibility in learning algorithmsCognitive plausibility in learning algorithms
Cognitive plausibility in learning algorithms
 

Recently uploaded

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 

Recently uploaded (20)

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 

Fasttext 20170720 yjy

  • 1. Enriching Word Vectors with Subword Information Piotr Bonjanowski, Edouard Grave, Armand Joulin, Tomas Mikolov Presented by Jaeyeun Yoon On 2017-07-20
  • 2. IDS Lab. Word Embedding  Word Embedding • Word embedding is the collective name for a set of language modeling and feature learning techniques where words or phrases from the vocabulary are mapped to vectors of real numbers in a dimensional space. • Ex) Bag-of-words model, Latent semantic analysis(LSA), word2vec, GloVe …  Distributional Hypothesis • “A word is characterized by the company it keeps” [Firth, J.R. (1957)] • Motivated by the idea that closely co-occuring words define the semantics of the word • Most of word embedding models are based on distributional hypothesis. ”The cat sat on the mat” W("cat") = (0.2,−0.4,0.7,...) W("mat") = (0.0,0.6,−0.1,...)
  • 3. IDS Lab. Characteristics of word embedding • Once trained they can be used in various tasks • Capture relative semantics of words • Can be used to combat small dataset problems
  • 4. IDS Lab. Skip-Gram Model(general model) • Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013) • Classifier for predicting words within certain range of input word • e.g.) … “It is time to finish” • The objective of the skipgram model is to maximize the following log- likelihood: 𝒕=𝟏 𝑻 𝒄∈𝑪 𝒕 𝒍𝒐𝒈 𝒑 𝒘 𝒄 𝒘 𝒕) 𝑪𝒕 ∶ 𝑻𝒉𝒆 𝒔𝒆𝒕 𝒐𝒇 𝒊𝒏𝒅𝒊𝒄𝒆𝒔 𝒐𝒇 𝒘𝒐𝒓𝒅𝒔 𝒔𝒖𝒓𝒓𝒐𝒖𝒏𝒅𝒊𝒏𝒈 𝒘𝒐𝒓𝒅 𝒘 𝒕 𝒑 𝒘 𝒄 𝒘 𝒕) ∶ 𝑻𝒉𝒆 𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 𝒐𝒇 𝒐𝒃𝒔𝒆𝒓𝒗𝒊𝒏𝒈 𝒂 𝒄𝒐𝒏𝒕𝒆𝒙𝒕 𝒘𝒐𝒓𝒅 𝒘 𝒄 𝒈𝒊𝒗𝒆𝒏 𝒘 𝒕
  • 5. IDS Lab. Limitation of Skip-Gram 5  It is difficult for good representations of rare words to be learned with traditional word2vec.  There could be words in the NLP task that were not present in the word2vec training corpus - This limitation is more pronounced in case of morphologically rich languages EX) In French or Spanish, most verbs have more than forty different inflected froms. While the Finnish languages has fifteen cases for nouns -> It is possible to improve vector representations for Morphologically rich languages by using character level information.
  • 6. IDS Lab. Example 6  German verb : ‘sein’ (English verb : ‘be’)
  • 7. IDS Lab. Limitation of Skip-Gram 7  It is difficult for good representations of rare words to be learned with traditional word2vec.  There could be words in the NLP task that were not present in the word2vec training corpus - This limitation is more pronounced in case of morphologically rich languages EX) In French or Spanish, most verbs have more than forty different inflected froms. While the Finnish languages has fifteen cases for nouns -> It is possible to improve vector representations for Morphologically rich languages by using character level information.
  • 8. IDS Lab. • The basic skip-gram model described above ignores the internal structure of the word. • However, character n-gram based model incorporates information about the structure in terms of character n-gram embeddings. • This paper suppose that each word 𝑤 is represented as a bag of character n-gram. EX) where / n = 3, it will be represented by the character n-grams : <wh / whe / her / ere / re> And the special sequence <where> Character n-gram based model 8 word <her> is different from the tri-gram her from the word where where
  • 9. IDS Lab. • We represent a word by the sum of the vector representations of its n-gram. We thus obtain the scoring function : • We extract all the 𝑛 -grams. (3 ≤ 𝑛 ≤ 6) Character n-gram based model (cont’d) 9 𝑠 𝑤, 𝑐 = g∈ς 𝑤 𝑧 𝑔 𝑇 𝑣𝑐 𝑤 ∶ 𝑔𝑖𝑣𝑒𝑛 𝑤𝑜𝑟𝑑 ς 𝑤 ∶ the set of n − grams appearing in word 𝑤 𝑧 𝑔 ∶ 𝑣𝑒𝑐𝑡𝑜𝑟 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛 𝑡𝑜 𝑒𝑎𝑐ℎ 𝑛 − 𝑔𝑟𝑎𝑚𝑠 𝑣𝑐 ∶ 𝑡ℎ𝑒 𝑤𝑜𝑟𝑑 𝑣𝑒𝑐𝑡𝑜𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑒𝑛𝑡𝑒𝑟 𝑤𝑜𝑟𝑑 𝐶
  • 10. IDS Lab. Computing word vector representation 10
  • 11. IDS Lab. Experiments Settings  Target Languages • German / English / French / Spanish / Arabic / Romanian / Russian / Czech  Kind of tasks 1. Human similarity judgement 2. Word analogy tasks 3. Comparison with morphological representations 4. Effect of the size of the training data 5. Effect of the size of n-grams 11
  • 12. IDS Lab. 1. Human similarity judgement  Correlation between human judgement and similarity scores on word similarity datasets. 12 sg : Skip-Gram cbow : continuous bag of words sisg- : Subword Information Skip-Gram (Treat unseen words as a null vector) sisg : Subword Information Skip-Gram (Treat unseen words by summing the n-gram vectors) RW : Rare Words dataset
  • 13. IDS Lab. 2. Word analogy tasks  Accuracy of our model and baselines on word analogy tasks for Czech, German, English and Italian 13 * It is observed that morphological information significantly improves the syntactic tasks; our approach outperforms the baselines. In contrast, it does not help for semantic questions, and even degrades the performance for German and Italian.
  • 14. IDS Lab. 3. Comparison with morphological representations  Spearman’s rank correlation coefficient between human judgement and model scores for different methods using morphology to learn word representations. 14
  • 15. IDS Lab. 4. Effect of the size of the training data  Influence of size of the training data on performance (Data : full Wikipedia dump / Task : 1. similarity task)  Sisg model is more robust to the size of the training data.  However, the performance of the baseline cbow model gets better as more and more data is available. Sisg model, on the other hand, seems to quickly saturate and adding more data does not always lead to improved result.  It is observed the performance sisg with very small dataset better than the performance 15
  • 16. IDS Lab. 5. Effect of the size of n-grams  The choice of n boundary ​​is observed to be language and task dependent.  Results are always improved by taking 𝑛 ≥ 3 rather than 𝑛 ≥ 2, which shows that character 2-games arenot informative for that task 16 Minimum value of n Maximum value of n
  • 17. IDS Lab. Conclusion  General Skip-Gram model has some limitations. (Ex: OOV)  But, This can be overcome by using subword information (character n-grams).  This model is simple. Because of simplicity, this model trains fast and does not require any preprocessing or supervision.  It works better for certain languages. (Ex: German) 17

Editor's Notes

  1. Introduction : Word Embedding 이라고 하는 것이 단어를 벡터로 만드는 것이다. 기존에 bow model, 고정된 길이의 dim Motivation : 여러 방법 중 주변 단어를 통해 정의하는 방법이 ex) word2vec , glove State of the art