SlideShare a Scribd company logo
1 of 35
Introduction to word
embeddings in Deep
Learning
Sharefah Al-Ghamdi Sharefah@ksu.edu.sa
LOGO Outlines
 Introduction.
 What are word embeddings?
 Why do we need word embeddings?
 Different types of word embeddings.
 Word embeddings Tools.
 Word2vec.
 Word embedding tutorial.
LOGO Introduction
Machine Translation
Search Engines
Spam Filtering
…
LOGO Introduction
We need a representation for words that capture their meanings, semantic relationships and
the different types of contexts they are used in. (It is Words embeddings)
LOGO What are Words Embeddings?
 A set of language modeling and feature learning
natural language processing (NLP) where words are
vectors of real numbers.
 Words that have the same meaning have a similar word
embedding representation.
LOGO What are Word Embeddings?
LOGO Why do we need words embeddings?
 The need for unsupervised learning .
 Solve various NLP applications not only one task.
 Many machine learning algorithms (including deep nets) require their input to be
their input to be vectors of continuous values; they just won’t work on strings of plain
on strings of plain text. (cat, cats, dog ..)
 Vector representation has two important and advantageous properties:
properties:
 Dimensionality Reduction.
 Contextual Similarity.
LOGO Example of Simple Method
 sentence=
” Word Embeddings are Word converted into numbers ”
 dictionary =
[‘Word’,’Embeddings’,’are’,’Converted’,’into’,’numbers’]
 The vector representation of “numbers” in a one-hot encoded vector
encoded vector according to the above dictionary is:
“numbers” [0,0,0,0,0,1]
LOGO Different types of Word Embeddings
1. Frequency based Embedding:
 Count Vector
 TF-IDF Vector
 Co-Occurrence Vector
2. Prediction based Embedding:
 CBOW (Continuous Bag of words)
 Skip – Gram model
LOGO Different types of Word Embeddings
1. Frequency based Embedding:
 Count Vector
 TF-IDF Vector
 Co-Occurrence Vector
2. Prediction based Embedding:
 CBOW (Continuous Bag of words)
 Skip – Gram model
LOGO Different types of Word Embeddings
1. Frequency based Embedding:
 Count Vector
 TF-IDF Vector
 Co-Occurrence Vector
2. Prediction based Embedding:
 CBOW (Continuous Bag of words)
 Skip – Gram model
• Takes into account the entire
corpus.
• Penalizing common words by
assigning them lower weights.
• TF: (Number of times term t
appears in a document)/(Number
of terms in the document)
• IDF: log(N/n), where, N is the
number of documents and n is
the number of documents a term
t has appeared in
LOGO Different types of Word Embeddings
1. Frequency based Embedding:
 Count Vector
 TF-IDF Vector
 Co-Occurrence Vector
2. Prediction based Embedding:
 CBOW (Continuous Bag of words)
 Skip – Gram model
• The Idea is that similar
words tend to occur together
and will have similar context.
• Co-occurrence is the
number of times the words
have appeared together in a
Context Window.
• Context window is
specified by a number and
the direction.
LOGO Co-Occurrence Vector Example
‫مزدحمة‬ ‫الرياض‬ ‫مدينة‬ ‫متطورة‬ ‫الرياض‬ ‫مدينة‬ ‫السعودية‬ ‫عاصمة‬ ‫الرياض‬ ‫مدينة‬
The 2 (around) context window for the word ‘‫’عاصمة‬
let us calculate a
co-occurrence matrix.
‫مدينة‬‫الرياض‬‫عاصمة‬‫السعودية‬‫متطورة‬‫مزدحمة‬
‫مدينة‬?
‫الرياض‬?
‫عاصمة‬?
‫السعودية‬?
‫متطورة‬?
‫مزدحمة‬?
LOGO Co-Occurrence Vector Example
‫مزدحمة‬ ‫الرياض‬ ‫مدينة‬ ‫متطورة‬ ‫الرياض‬ ‫مدينة‬ ‫السعودية‬ ‫عاصمة‬ ‫الرياض‬ ‫مدينة‬
‫الرياض‬ ‫مدينة‬
‫عاصمة‬ ‫مدينة‬
‫مزدحمة‬ ‫الرياض‬ ‫مدينة‬ ‫متطورة‬ ‫الرياض‬ ‫مدينة‬ ‫السعودية‬ ‫عاصمة‬ ‫الرياض‬ ‫مدينة‬
‫الرياض‬ ‫مدينة‬
‫متطورة‬ ‫مدينة‬
‫السعودية‬ ‫مدينة‬
‫عاصمة‬ ‫مدينة‬
‫مزدحمة‬ ‫الرياض‬ ‫مدينة‬ ‫متطورة‬ ‫الرياض‬ ‫مدينة‬ ‫السعودية‬ ‫عاصمة‬ ‫الرياض‬ ‫مدينة‬
‫الرياض‬ ‫مدينة‬
‫مزدحمة‬ ‫مدينة‬
‫متطورة‬ ‫مدينة‬
‫الرياض‬ ‫مدينة‬
The co-occurrence of two word ( ‫مدينة‬,‫الرياض‬ ) is 4.
LOGO Co-Occurrence Vector Example
The co-occurrence matrix
is not the word vector
representation that is
generally used. Instead, it
is decomposed using
techniques like PCA, SVD
etc. into factors and
combination of these
factors forms the word
vector representation.
‫مدينة‬‫الرياض‬‫عاصمة‬‫السعودية‬‫متطورة‬‫مزدحمة‬
‫مدينة‬0
‫الرياض‬4
‫عاصمة‬2
‫السعودية‬1
‫متطورة‬2
‫مزدحمة‬1
LOGO Different types of Word Embeddings
1. Frequency based Embedding:
 Count Vector
 TF-IDF Vector
 Co-Occurrence Vector
2. Prediction based Embedding:
 CBOW (Continuous Bag of words)
 Skip – Gram model
• Introduced in word2vec.
• Both are shallow neural networks
(NN) with three layers.
• Neural networks map word(s) to
the target variable which is also
a word(s).
• Both techniques learn weights
which act as word vector
representations.
LOGO Different types of Word Embeddings
1. Frequency based Embedding:
 Count Vector
 TF-IDF Vector
 Co-Occurrence Vector
2. Prediction based Embedding:
 CBOW (Continuous Bag of words)
 Skip – Gram model
• The CBOW NN predict the
probability of a word given a
context.
LOGO CBOW
LOGO CBOW
0 0 0 1 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
The word ‘Sample’
The word ‘Corpus’
N is the number of
neurons in the
hidden layer =
number of
dimensions we
choose to represent
our word
The weight between the hidden
layer and the output layer is
taken as the word vector
representation of the word.
LOGO CBOW
LOGO Different types of Word Embeddings
1. Frequency based Embedding:
 Count Vector
 TF-IDF Vector
 Co-Occurrence Vector
2. Prediction based Embedding:
 CBOW (Continuous Bag of words)
 Skip – Gram model
• The aim of skip-gram is to
predict the context given a
word.
LOGO Skip-Gram
Hey this is sample corpus using only one context word
In this example
C=2
LOGO CBOW and Skip-Gram
 In simpler words, CBOW tends to find the probability of a
word occurring in a neighbourhood (context). So
it generalises over all the different contexts in which a word
can be used.
 Whereas SkipGram tends to learn the different contexts
separately. So SkipGram needs enough data w.r.t. each
context. Hence SkipGram requires more data to train, also
SkipGram (given enough data) contains more knowledge
about the context.
LOGO Word embeddings Tools
The most famous algorithms used to build words embeddings:
Tool Technique/s Created by
Word2vec
• CBOW
• Skip-Gram
A team of researchers led by
Tomas Mikolov at Google.
(Mikolov et al 2013)
Glove • Co-occurrence.
It is developed as an open-
source project at Stanford
University. (Pennington et al
2014)
fastText
• BOW + Subword Information:
based on the skip-gram model, each word
is represented as a bag of character n-
grams
Facebook's AI Research
(FAIR) lab. (2018)
LOGO Compare GloVe and word2vec
LOGO Word embeddings Tools
• Gensim is a robust open-source vector space modeling and topic
modeling toolkit implemented in Python. Gensim includes
implementations of word2vec, document2vec algorithms etc.
• Eclipse Deeplearning4j is a deep learning programming library
written for Java. Deeplearning4j includes implementations of the
word2vec, doc2vec, and GloVe etc.
LOGO
Word2Vec Tutorial
LOGO Word2Vec Tutorial
 Gensim libarary provides easy way to use Word2Vec in Python.
 Classical Arabic Corpus written by Maha Alrabiah.
 You will apply the following:
 Train your own word2vec on an Arabic corpus.
 Save your model.
 Printing the words similarity scores.
 Getting word vector of a word.
 picking odd word out.
 Load a pre-trained model.
LOGO Word2Vec Tutorial
 Use the following python codes to do each step:
1. Import the needed models:
import gensim
from gensim.models import word2vec
import logging
2. Set up logging:
logging.basicConfig(format='%(asctime)s :
%(levelname)s : %(message)s',
level=logging.INFO)
LOGO Word2Vec Tutorial
3. load up your corpus
sentences = word2vec.Text8Corpus('…/your_data')
4. Train the model
model = word2vec.Word2Vec(sentences, size=300,
sg=0)
Size: Dimensionality of the feature vectors.
Sg: Defines the training algorithm. If 1, CBOW is used, otherwise, skip-gram
is employed.
Window: The maximum distance between the current and predicted word
within a sentence.
min_count: Ignores all words with total frequency lower than this.
…
LOGO Word2Vec Tutorial
 Use the following python codes to do each step:
5. Save the trained models:
model.save('…/workshop.model')
model.wv.save_word2vec_format('…/workshop.bin',
binary=True)
6. Printing the words similarity scores:
most_similar = model.wv.most_similar('‫)'السماء‬
for term, score in most_similar:
print(term, score)
model.wv.similarity('word1', 'word2')
LOGO Word2Vec Tutorial
 Use the following python codes to do each step:
7. Getting word vector of a word:
print(model[‘‫)]’محمد‬
8. picking odd word out:
print(model.doesnt_match(" ‫السبع‬ ‫السماوات‬
‫الجبال‬ ‫االرض‬‫مكه‬ ".split()))
9. Load a pre-trained model:
model =
gensim.models.KeyedVectors.load_word2vec_f
ormat(‘…/workshop.bin', binary=True)
LOGO References
 https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-
word2veec/
 https://rare-technologies.com/deep-learning-with-word2vec-and-gensim/
 http://ahogrammer.com/2017/01/20/the-list-of-pretrained-word-embeddings/
 https://machinelearningmastery.com/what-are-word-embeddings/
 https://flovv.github.io/figures/post29/embedding.png
 https://www.quora.com/What-is-word-embedding-in-deep-learning
LOGO References
 http://techblog.gumgum.com/articles/deep-learning-for-natural-language-
processing-part-1-word-embeddings
 https://www.inverse.com/article/31075-facebook-machine-learning-language-
fasttext
 https://en.wikipedia.org/wiki/FastText
 https://medium.com/deeper-learning/glossary-of-deep-learning-word-
embedding-f90c3cec34ca
 https://www.quora.com/Whats-the-best-word2vec-implementation-for-
generating-Word-Vectors-Word-Embeddings-of-a-2Gig-corpus-with-2-billion-
words
LOGO References
 https://radimrehurek.com/gensim/models/word2vec.html
 https://hackernoon.com/word2vec-part-1-fe2ec6514d70
 http://dsnotes.com/post/glove-enwiki/
 https://mahaalrabiah.wordpress.com/

More Related Content

What's hot

Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBNodeXperts
 
MongoDB presentation
MongoDB presentationMongoDB presentation
MongoDB presentationHyphen Call
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesankit_ppt
 
Natural Language Processing seminar review
Natural Language Processing seminar review Natural Language Processing seminar review
Natural Language Processing seminar review Jayneel Vora
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentationbhavesh_physics
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processingMinh Pham
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 

What's hot (20)

NLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit DistanceNLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit Distance
 
Bert
BertBert
Bert
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
BERT
BERTBERT
BERT
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Understanding GloVe
Understanding GloVeUnderstanding GloVe
Understanding GloVe
 
Apache spark
Apache sparkApache spark
Apache spark
 
MongoDB presentation
MongoDB presentationMongoDB presentation
MongoDB presentation
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
 
Natural Language Processing seminar review
Natural Language Processing seminar review Natural Language Processing seminar review
Natural Language Processing seminar review
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
 
NLP
NLPNLP
NLP
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 

Similar to ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop

Vectorization In NLP.pptx
Vectorization In NLP.pptxVectorization In NLP.pptx
Vectorization In NLP.pptxChode Amarnath
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfSease
 
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachText Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachAhmed Hani Ibrahim
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMassimo Schenone
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptxNameetDaga1
 
Word embeddings
Word embeddingsWord embeddings
Word embeddingsShruti kar
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlpLaraOlmosCamarena
 
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Chris Fregly
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016MLconf
 
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...QuantInsti
 
Modelling and Programming: Isn’t it all the same?
Modelling and Programming: Isn’t it all the same?Modelling and Programming: Isn’t it all the same?
Modelling and Programming: Isn’t it all the same?CHOOSE
 
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...
[Paper Reading]  Unsupervised Learning of Sentence Embeddings using Compositi...[Paper Reading]  Unsupervised Learning of Sentence Embeddings using Compositi...
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...Hiroki Shimanaka
 
Modern_2.pptx for java
Modern_2.pptx for java Modern_2.pptx for java
Modern_2.pptx for java MayaTofik
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Yuki Tomo
 
Deep network notes.pdf
Deep network notes.pdfDeep network notes.pdf
Deep network notes.pdfRamya Nellutla
 
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextGDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextrudolf eremyan
 
Deep learning Malaysia presentation 12/4/2017
Deep learning Malaysia presentation 12/4/2017Deep learning Malaysia presentation 12/4/2017
Deep learning Malaysia presentation 12/4/2017Brian Ho
 

Similar to ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop (20)

Vectorization In NLP.pptx
Vectorization In NLP.pptxVectorization In NLP.pptx
Vectorization In NLP.pptx
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachText Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
 
Chatbot_Presentation
Chatbot_PresentationChatbot_Presentation
Chatbot_Presentation
 
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
 
Bi-lingual Word Sense Induction
Bi-lingual Word Sense InductionBi-lingual Word Sense Induction
Bi-lingual Word Sense Induction
 
Modelling and Programming: Isn’t it all the same?
Modelling and Programming: Isn’t it all the same?Modelling and Programming: Isn’t it all the same?
Modelling and Programming: Isn’t it all the same?
 
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...
[Paper Reading]  Unsupervised Learning of Sentence Embeddings using Compositi...[Paper Reading]  Unsupervised Learning of Sentence Embeddings using Compositi...
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...
 
Modern_2.pptx for java
Modern_2.pptx for java Modern_2.pptx for java
Modern_2.pptx for java
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
 
Deep network notes.pdf
Deep network notes.pdfDeep network notes.pdf
Deep network notes.pdf
 
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextGDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
 
Deep learning Malaysia presentation 12/4/2017
Deep learning Malaysia presentation 12/4/2017Deep learning Malaysia presentation 12/4/2017
Deep learning Malaysia presentation 12/4/2017
 

More from iwan_rg

Automatic text simplification evaluation aspects
Automatic text simplification  evaluation aspectsAutomatic text simplification  evaluation aspects
Automatic text simplification evaluation aspectsiwan_rg
 
تلخيص كتاب مقدمة في معالجة اللغة العربية
تلخيص كتاب مقدمة في معالجة اللغة العربيةتلخيص كتاب مقدمة في معالجة اللغة العربية
تلخيص كتاب مقدمة في معالجة اللغة العربيةiwan_rg
 
Building theoretical models using structured equation modeling
Building theoretical models using structured equation modelingBuilding theoretical models using structured equation modeling
Building theoretical models using structured equation modelingiwan_rg
 
Introduction to Arabic natural language processing (Infographics)
Introduction to Arabic natural language processing (Infographics)Introduction to Arabic natural language processing (Infographics)
Introduction to Arabic natural language processing (Infographics)iwan_rg
 
Summary of Multilingual Natural Language Processing Applications: From Theory...
Summary of Multilingual Natural Language Processing Applications: From Theory...Summary of Multilingual Natural Language Processing Applications: From Theory...
Summary of Multilingual Natural Language Processing Applications: From Theory...iwan_rg
 
التقرير السنوي لمجموعة إيوان البحثية 1437هـ-1438هـ
التقرير السنوي لمجموعة إيوان البحثية 1437هـ-1438هـالتقرير السنوي لمجموعة إيوان البحثية 1437هـ-1438هـ
التقرير السنوي لمجموعة إيوان البحثية 1437هـ-1438هـiwan_rg
 
CHOOSING RESEARCH TOPICS AND WRITING RESEARCH PAPERS
CHOOSING RESEARCH TOPICS AND WRITING RESEARCH PAPERSCHOOSING RESEARCH TOPICS AND WRITING RESEARCH PAPERS
CHOOSING RESEARCH TOPICS AND WRITING RESEARCH PAPERSiwan_rg
 
التقرير السنوي لمجموعة إيوان البحثية 1436هـ-1437هـ
التقرير السنوي لمجموعة إيوان البحثية 1436هـ-1437هـالتقرير السنوي لمجموعة إيوان البحثية 1436هـ-1437هـ
التقرير السنوي لمجموعة إيوان البحثية 1436هـ-1437هـiwan_rg
 
مركز تميز الحوسبة العربية المتقدمة
مركز تميز  الحوسبة العربية المتقدمةمركز تميز  الحوسبة العربية المتقدمة
مركز تميز الحوسبة العربية المتقدمةiwan_rg
 
P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis
P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis
P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis iwan_rg
 
P03- MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization
P03- MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization P03- MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization
P03- MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization iwan_rg
 
P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation
P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation
P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation iwan_rg
 
P02- Towards a New Arabic Corpus of Dyslexic Texts
P02- Towards a New Arabic Corpus of Dyslexic TextsP02- Towards a New Arabic Corpus of Dyslexic Texts
P02- Towards a New Arabic Corpus of Dyslexic Textsiwan_rg
 
P01- Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects
P01- Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects P01- Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects
P01- Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects iwan_rg
 
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...iwan_rg
 
OSACT2 LREC 2016 workshop proceedings
OSACT2 LREC 2016 workshop proceedingsOSACT2 LREC 2016 workshop proceedings
OSACT2 LREC 2016 workshop proceedingsiwan_rg
 
محاضرة المدونات اللغوية وأدواتها
محاضرة المدونات اللغوية وأدواتهامحاضرة المدونات اللغوية وأدواتها
محاضرة المدونات اللغوية وأدواتهاiwan_rg
 
لغويات المدونة الحاسوبية
لغويات المدونة الحاسوبيةلغويات المدونة الحاسوبية
لغويات المدونة الحاسوبيةiwan_rg
 
iWAN Annual Report 1435/1436H
 iWAN Annual Report 1435/1436H iWAN Annual Report 1435/1436H
iWAN Annual Report 1435/1436Hiwan_rg
 
التقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ - 1435هـ
التقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ -  1435هـالتقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ -  1435هـ
التقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ - 1435هـiwan_rg
 

More from iwan_rg (20)

Automatic text simplification evaluation aspects
Automatic text simplification  evaluation aspectsAutomatic text simplification  evaluation aspects
Automatic text simplification evaluation aspects
 
تلخيص كتاب مقدمة في معالجة اللغة العربية
تلخيص كتاب مقدمة في معالجة اللغة العربيةتلخيص كتاب مقدمة في معالجة اللغة العربية
تلخيص كتاب مقدمة في معالجة اللغة العربية
 
Building theoretical models using structured equation modeling
Building theoretical models using structured equation modelingBuilding theoretical models using structured equation modeling
Building theoretical models using structured equation modeling
 
Introduction to Arabic natural language processing (Infographics)
Introduction to Arabic natural language processing (Infographics)Introduction to Arabic natural language processing (Infographics)
Introduction to Arabic natural language processing (Infographics)
 
Summary of Multilingual Natural Language Processing Applications: From Theory...
Summary of Multilingual Natural Language Processing Applications: From Theory...Summary of Multilingual Natural Language Processing Applications: From Theory...
Summary of Multilingual Natural Language Processing Applications: From Theory...
 
التقرير السنوي لمجموعة إيوان البحثية 1437هـ-1438هـ
التقرير السنوي لمجموعة إيوان البحثية 1437هـ-1438هـالتقرير السنوي لمجموعة إيوان البحثية 1437هـ-1438هـ
التقرير السنوي لمجموعة إيوان البحثية 1437هـ-1438هـ
 
CHOOSING RESEARCH TOPICS AND WRITING RESEARCH PAPERS
CHOOSING RESEARCH TOPICS AND WRITING RESEARCH PAPERSCHOOSING RESEARCH TOPICS AND WRITING RESEARCH PAPERS
CHOOSING RESEARCH TOPICS AND WRITING RESEARCH PAPERS
 
التقرير السنوي لمجموعة إيوان البحثية 1436هـ-1437هـ
التقرير السنوي لمجموعة إيوان البحثية 1436هـ-1437هـالتقرير السنوي لمجموعة إيوان البحثية 1436هـ-1437هـ
التقرير السنوي لمجموعة إيوان البحثية 1436هـ-1437هـ
 
مركز تميز الحوسبة العربية المتقدمة
مركز تميز  الحوسبة العربية المتقدمةمركز تميز  الحوسبة العربية المتقدمة
مركز تميز الحوسبة العربية المتقدمة
 
P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis
P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis
P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis
 
P03- MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization
P03- MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization P03- MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization
P03- MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization
 
P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation
P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation
P04- Toward an Arabic Punctuated Corpus: Annotation Guidelines and Evaluation
 
P02- Towards a New Arabic Corpus of Dyslexic Texts
P02- Towards a New Arabic Corpus of Dyslexic TextsP02- Towards a New Arabic Corpus of Dyslexic Texts
P02- Towards a New Arabic Corpus of Dyslexic Texts
 
P01- Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects
P01- Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects P01- Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects
P01- Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects
 
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
 
OSACT2 LREC 2016 workshop proceedings
OSACT2 LREC 2016 workshop proceedingsOSACT2 LREC 2016 workshop proceedings
OSACT2 LREC 2016 workshop proceedings
 
محاضرة المدونات اللغوية وأدواتها
محاضرة المدونات اللغوية وأدواتهامحاضرة المدونات اللغوية وأدواتها
محاضرة المدونات اللغوية وأدواتها
 
لغويات المدونة الحاسوبية
لغويات المدونة الحاسوبيةلغويات المدونة الحاسوبية
لغويات المدونة الحاسوبية
 
iWAN Annual Report 1435/1436H
 iWAN Annual Report 1435/1436H iWAN Annual Report 1435/1436H
iWAN Annual Report 1435/1436H
 
التقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ - 1435هـ
التقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ -  1435هـالتقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ -  1435هـ
التقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ - 1435هـ
 

Recently uploaded

Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 

Recently uploaded (20)

Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 

ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop

  • 1. Introduction to word embeddings in Deep Learning Sharefah Al-Ghamdi Sharefah@ksu.edu.sa
  • 2. LOGO Outlines  Introduction.  What are word embeddings?  Why do we need word embeddings?  Different types of word embeddings.  Word embeddings Tools.  Word2vec.  Word embedding tutorial.
  • 3. LOGO Introduction Machine Translation Search Engines Spam Filtering …
  • 4. LOGO Introduction We need a representation for words that capture their meanings, semantic relationships and the different types of contexts they are used in. (It is Words embeddings)
  • 5. LOGO What are Words Embeddings?  A set of language modeling and feature learning natural language processing (NLP) where words are vectors of real numbers.  Words that have the same meaning have a similar word embedding representation.
  • 6. LOGO What are Word Embeddings?
  • 7. LOGO Why do we need words embeddings?  The need for unsupervised learning .  Solve various NLP applications not only one task.  Many machine learning algorithms (including deep nets) require their input to be their input to be vectors of continuous values; they just won’t work on strings of plain on strings of plain text. (cat, cats, dog ..)  Vector representation has two important and advantageous properties: properties:  Dimensionality Reduction.  Contextual Similarity.
  • 8. LOGO Example of Simple Method  sentence= ” Word Embeddings are Word converted into numbers ”  dictionary = [‘Word’,’Embeddings’,’are’,’Converted’,’into’,’numbers’]  The vector representation of “numbers” in a one-hot encoded vector encoded vector according to the above dictionary is: “numbers” [0,0,0,0,0,1]
  • 9. LOGO Different types of Word Embeddings 1. Frequency based Embedding:  Count Vector  TF-IDF Vector  Co-Occurrence Vector 2. Prediction based Embedding:  CBOW (Continuous Bag of words)  Skip – Gram model
  • 10. LOGO Different types of Word Embeddings 1. Frequency based Embedding:  Count Vector  TF-IDF Vector  Co-Occurrence Vector 2. Prediction based Embedding:  CBOW (Continuous Bag of words)  Skip – Gram model
  • 11. LOGO Different types of Word Embeddings 1. Frequency based Embedding:  Count Vector  TF-IDF Vector  Co-Occurrence Vector 2. Prediction based Embedding:  CBOW (Continuous Bag of words)  Skip – Gram model • Takes into account the entire corpus. • Penalizing common words by assigning them lower weights. • TF: (Number of times term t appears in a document)/(Number of terms in the document) • IDF: log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in
  • 12. LOGO Different types of Word Embeddings 1. Frequency based Embedding:  Count Vector  TF-IDF Vector  Co-Occurrence Vector 2. Prediction based Embedding:  CBOW (Continuous Bag of words)  Skip – Gram model • The Idea is that similar words tend to occur together and will have similar context. • Co-occurrence is the number of times the words have appeared together in a Context Window. • Context window is specified by a number and the direction.
  • 13. LOGO Co-Occurrence Vector Example ‫مزدحمة‬ ‫الرياض‬ ‫مدينة‬ ‫متطورة‬ ‫الرياض‬ ‫مدينة‬ ‫السعودية‬ ‫عاصمة‬ ‫الرياض‬ ‫مدينة‬ The 2 (around) context window for the word ‘‫’عاصمة‬ let us calculate a co-occurrence matrix. ‫مدينة‬‫الرياض‬‫عاصمة‬‫السعودية‬‫متطورة‬‫مزدحمة‬ ‫مدينة‬? ‫الرياض‬? ‫عاصمة‬? ‫السعودية‬? ‫متطورة‬? ‫مزدحمة‬?
  • 14. LOGO Co-Occurrence Vector Example ‫مزدحمة‬ ‫الرياض‬ ‫مدينة‬ ‫متطورة‬ ‫الرياض‬ ‫مدينة‬ ‫السعودية‬ ‫عاصمة‬ ‫الرياض‬ ‫مدينة‬ ‫الرياض‬ ‫مدينة‬ ‫عاصمة‬ ‫مدينة‬ ‫مزدحمة‬ ‫الرياض‬ ‫مدينة‬ ‫متطورة‬ ‫الرياض‬ ‫مدينة‬ ‫السعودية‬ ‫عاصمة‬ ‫الرياض‬ ‫مدينة‬ ‫الرياض‬ ‫مدينة‬ ‫متطورة‬ ‫مدينة‬ ‫السعودية‬ ‫مدينة‬ ‫عاصمة‬ ‫مدينة‬ ‫مزدحمة‬ ‫الرياض‬ ‫مدينة‬ ‫متطورة‬ ‫الرياض‬ ‫مدينة‬ ‫السعودية‬ ‫عاصمة‬ ‫الرياض‬ ‫مدينة‬ ‫الرياض‬ ‫مدينة‬ ‫مزدحمة‬ ‫مدينة‬ ‫متطورة‬ ‫مدينة‬ ‫الرياض‬ ‫مدينة‬ The co-occurrence of two word ( ‫مدينة‬,‫الرياض‬ ) is 4.
  • 15. LOGO Co-Occurrence Vector Example The co-occurrence matrix is not the word vector representation that is generally used. Instead, it is decomposed using techniques like PCA, SVD etc. into factors and combination of these factors forms the word vector representation. ‫مدينة‬‫الرياض‬‫عاصمة‬‫السعودية‬‫متطورة‬‫مزدحمة‬ ‫مدينة‬0 ‫الرياض‬4 ‫عاصمة‬2 ‫السعودية‬1 ‫متطورة‬2 ‫مزدحمة‬1
  • 16. LOGO Different types of Word Embeddings 1. Frequency based Embedding:  Count Vector  TF-IDF Vector  Co-Occurrence Vector 2. Prediction based Embedding:  CBOW (Continuous Bag of words)  Skip – Gram model • Introduced in word2vec. • Both are shallow neural networks (NN) with three layers. • Neural networks map word(s) to the target variable which is also a word(s). • Both techniques learn weights which act as word vector representations.
  • 17. LOGO Different types of Word Embeddings 1. Frequency based Embedding:  Count Vector  TF-IDF Vector  Co-Occurrence Vector 2. Prediction based Embedding:  CBOW (Continuous Bag of words)  Skip – Gram model • The CBOW NN predict the probability of a word given a context.
  • 19. LOGO CBOW 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 The word ‘Sample’ The word ‘Corpus’ N is the number of neurons in the hidden layer = number of dimensions we choose to represent our word The weight between the hidden layer and the output layer is taken as the word vector representation of the word.
  • 21. LOGO Different types of Word Embeddings 1. Frequency based Embedding:  Count Vector  TF-IDF Vector  Co-Occurrence Vector 2. Prediction based Embedding:  CBOW (Continuous Bag of words)  Skip – Gram model • The aim of skip-gram is to predict the context given a word.
  • 22. LOGO Skip-Gram Hey this is sample corpus using only one context word In this example C=2
  • 23. LOGO CBOW and Skip-Gram  In simpler words, CBOW tends to find the probability of a word occurring in a neighbourhood (context). So it generalises over all the different contexts in which a word can be used.  Whereas SkipGram tends to learn the different contexts separately. So SkipGram needs enough data w.r.t. each context. Hence SkipGram requires more data to train, also SkipGram (given enough data) contains more knowledge about the context.
  • 24. LOGO Word embeddings Tools The most famous algorithms used to build words embeddings: Tool Technique/s Created by Word2vec • CBOW • Skip-Gram A team of researchers led by Tomas Mikolov at Google. (Mikolov et al 2013) Glove • Co-occurrence. It is developed as an open- source project at Stanford University. (Pennington et al 2014) fastText • BOW + Subword Information: based on the skip-gram model, each word is represented as a bag of character n- grams Facebook's AI Research (FAIR) lab. (2018)
  • 25. LOGO Compare GloVe and word2vec
  • 26. LOGO Word embeddings Tools • Gensim is a robust open-source vector space modeling and topic modeling toolkit implemented in Python. Gensim includes implementations of word2vec, document2vec algorithms etc. • Eclipse Deeplearning4j is a deep learning programming library written for Java. Deeplearning4j includes implementations of the word2vec, doc2vec, and GloVe etc.
  • 28. LOGO Word2Vec Tutorial  Gensim libarary provides easy way to use Word2Vec in Python.  Classical Arabic Corpus written by Maha Alrabiah.  You will apply the following:  Train your own word2vec on an Arabic corpus.  Save your model.  Printing the words similarity scores.  Getting word vector of a word.  picking odd word out.  Load a pre-trained model.
  • 29. LOGO Word2Vec Tutorial  Use the following python codes to do each step: 1. Import the needed models: import gensim from gensim.models import word2vec import logging 2. Set up logging: logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
  • 30. LOGO Word2Vec Tutorial 3. load up your corpus sentences = word2vec.Text8Corpus('…/your_data') 4. Train the model model = word2vec.Word2Vec(sentences, size=300, sg=0) Size: Dimensionality of the feature vectors. Sg: Defines the training algorithm. If 1, CBOW is used, otherwise, skip-gram is employed. Window: The maximum distance between the current and predicted word within a sentence. min_count: Ignores all words with total frequency lower than this. …
  • 31. LOGO Word2Vec Tutorial  Use the following python codes to do each step: 5. Save the trained models: model.save('…/workshop.model') model.wv.save_word2vec_format('…/workshop.bin', binary=True) 6. Printing the words similarity scores: most_similar = model.wv.most_similar('‫)'السماء‬ for term, score in most_similar: print(term, score) model.wv.similarity('word1', 'word2')
  • 32. LOGO Word2Vec Tutorial  Use the following python codes to do each step: 7. Getting word vector of a word: print(model[‘‫)]’محمد‬ 8. picking odd word out: print(model.doesnt_match(" ‫السبع‬ ‫السماوات‬ ‫الجبال‬ ‫االرض‬‫مكه‬ ".split())) 9. Load a pre-trained model: model = gensim.models.KeyedVectors.load_word2vec_f ormat(‘…/workshop.bin', binary=True)
  • 33. LOGO References  https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count- word2veec/  https://rare-technologies.com/deep-learning-with-word2vec-and-gensim/  http://ahogrammer.com/2017/01/20/the-list-of-pretrained-word-embeddings/  https://machinelearningmastery.com/what-are-word-embeddings/  https://flovv.github.io/figures/post29/embedding.png  https://www.quora.com/What-is-word-embedding-in-deep-learning
  • 34. LOGO References  http://techblog.gumgum.com/articles/deep-learning-for-natural-language- processing-part-1-word-embeddings  https://www.inverse.com/article/31075-facebook-machine-learning-language- fasttext  https://en.wikipedia.org/wiki/FastText  https://medium.com/deeper-learning/glossary-of-deep-learning-word- embedding-f90c3cec34ca  https://www.quora.com/Whats-the-best-word2vec-implementation-for- generating-Word-Vectors-Word-Embeddings-of-a-2Gig-corpus-with-2-billion- words
  • 35. LOGO References  https://radimrehurek.com/gensim/models/word2vec.html  https://hackernoon.com/word2vec-part-1-fe2ec6514d70  http://dsnotes.com/post/glove-enwiki/  https://mahaalrabiah.wordpress.com/

Editor's Notes

  1. we use natural language applications or benefit from them every day Natural Language Processing (NLP) helps machines “read” text by simulating the human ability to understand language. The challenge with machine translation technologies is not in translating words, but in understanding the meaning of sentences to provide a true translation.
  2. It is this approach to representing words and documents that may be considered one of the key breakthroughs of deep learning on challenging natural language processing problems.
  3. It is considered one of the key breakthroughs of deep learning on challenging natural language processing problems. such as part-of-speech tagging, information retrieval, question answering etc.  One of the benefits of using dense and low-dimensional vectors is computational: the majority of neural network toolkits do not play well with very high-dimensional, sparse vectors. … The main benefit of the dense representations is generalization power: if we believe some features may provide similar clues, it is worthwhile to provide a representation that is able to capture these similarities. Dimensionality Reduction — it is a more efficient representation Contextual Similarity — it is a more expressive representation Page 92, Neural Network Methods in Natural Language Processing, 2017.
  4. In the first type the statistics of word co-occurrences with its neighbor words computed, and then map these statistics down to a vector for each word. However, predictive models take 'raw' text as input and learn a word by predicting its surrounding context (the case of the skip-gram model) or predict a word given its surrounding context (the case of the Continuous Bag Of Words model (CBOW)) using gradient descent with randomly initialized vectors2.
  5. there may be quite a few variations: The way dictionary is prepared.: top 10,000 words The way count is taken for each word.:  frequency or the presence
  6. In this method computes the statistics of word co-occurrences with its neighbor words, and then map these statistics down to a vector for each word.
  7. The input layer and the target, both are one- hot encoded of size [1 X V]. 
  8. The input can be assumed as taking multi one-hot encoded vectors. The calculation of hidden activation changes. Instead of just copying the corresponding rows of the input-hidden weight matrix to the hidden layer, an average is taken over all the corresponding rows of the matrix. 
  9. It just flips CBOW’s architecture on its head.
  10. sum is taken over all the error vectors to obtain a final error vector.
  11. CBOW takes the average of the context of a word