Dorra elmekki nlp

•Download as PPTX, PDF•

1 like•75 views

Dipl. Ing. Dorra El Mekki

Engineering

A Document Descriptor using
Covariance of Word Vectors
Presented by: DORRA EL MEKKI
arwan Torki , A Document Descriptor using Covariance of Word Vectors ,56th Annual Meeting of the A
omputational Linguistics (Short Papers), pages 527–532 Melbourne, Australia, July 15 - 20, 2018.

1
2
3
4
Table of
Contents
Introduction
Conclusion
DoCoV Descriptor
Experimental Evaluation
2

State-of-the-art methods
4
Bag-of-Words(BOW)
The bag-of-words model is a simplifying representation used in natural language processing. In this model, a text is
represented as the bag of its words, disregarding grammar and even word order but keeping multiplicity.
Latent Semantic Indexing(LSI)
Latent semantic indexing, sometimes referred to as latent semantic analysis, is a mathematical method,
It finds the hidden relationships between words in order to improve information understanding
Deep learning methods
Introduction of neural language models using deep learning allowed to learn word vector representation

The added value
5
Interrelationship of
words in the text
Interrelationship
between the dimensions
of the word embedding
via the covariance
matrix elements

Define a document observation matrix
7
𝑛 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
𝑥𝑖 = [𝑥1 𝑥2 … 𝑥 𝑑 ] 𝑇
𝜖𝑅 𝑑
O=
𝑥11 ⋯ 𝑥1𝑑
⋮ ⋱ ⋮
𝑥 𝑛1 ⋯ 𝑥 𝑛𝑑

How to extract our DoCoV descriptor?
8
𝜎𝑥,𝑦 =
𝑖=1
𝑁
(𝑥𝑖− 𝑥)(𝑦𝑖− 𝑦)
𝑁
C=
𝜎 𝑋1
2
𝜎 𝑋1,𝑋2
𝜎 𝑋1,𝑋2
…
⋯
𝜎 𝑋1,𝑋 𝑑
𝜎 𝑋2,𝑋 𝑑
⋮ ⋱ ⋮
𝜎 𝑋1,𝑋 𝑑
𝜎 𝑋2,𝑋 𝑑
… 𝜎 𝑋 𝑑
2

The IMDB movie review dataset
10
25%
25%
50%
The IMDB movie review dataset
labelled training instances labelled test instances unlabelled training instances

Objectives of the experience
Objectif 1
The DoCoV descriptor can be used with different
alternatives for word representations
Objectif 2
Pre-trained models are giving the best results. This
alleviates the need of computing a problem specific
word embedding

01
02
03
Before Training
The Power of PowerPoint | thepopp.com 12
Step 1
Using the Training and unlabelled subsets of IMDB
dataset to obtain different embedding by setting
number of dimensions to 100, 200 and 300.
Step 2
Using pre-trained GloVe models trained on
wikipedia2014 and Gigaword5.
Step 3
Using pre-trained word2vec model trained
on Google news. We call it Gnews.

Observation 2 The best performing feature concatenation is
DoCoV+BOW. This ensures that the
concatenation in fact is benefiting from both
representations
Observation:
The Power of PowerPoint | thepopp.com 14
Observation 1 The DoCoV is consistently outperforming the Mean vector
for different dimensionality of the word embedding
Observation 3 In general the best results are achieved using
the available 300-dimensions Gnews word
embedding

16
Conclusion
Generic
which makes it useful for different supervised and
unsupervised tasks
Fixed-length
property
which makes it useful for
different learning algorithms
Better performance
against other state-of-the-art methods.
Minimal training
We do not require a encoder-decoder
model or a gradient descent iterations to
be computed.

Thank You for Your
Attention!Any Questions?

What's hot

RAJITHA_RESUMERajichowdary Dhulipalla

Bt0081, software engineeringsmumbahelp

IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...IRJET Journal

Bca winter 2013 2nd semsmumbahelp

InterpreterAhmad Al-bsheer

G6 m4-d-lesson 10-tmlabuski

Intrinsic and Extrinsic Evaluations of Word EmbeddingsJinho Choi

[Paper Reading] Supervised Learning of Universal Sentence Representations fro...Hiroki Shimanaka

IRJET - Storytelling App for Children with Hearing Impairment using Natur...IRJET Journal

T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...csandit

Powerpoint for e learning - some important templates - university of mosul- c...Dr. Oday QA

semantic text doc clusteringSouvik Roy

Neural Mask Generator : Learning to Generate Adaptive WordMaskings for Langu...MLAI2

What's hot (14)

RAJITHA_RESUME

Bt0081, software engineering

IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...

Bca winter 2013 2nd sem

Interpreter

G6 m4-d-lesson 10-t

Intrinsic and Extrinsic Evaluations of Word Embeddings

[Paper Reading] Supervised Learning of Universal Sentence Representations fro...

IRJET - Storytelling App for Children with Hearing Impairment using Natur...

T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...

Powerpoint for e learning - some important templates - university of mosul- c...

semantic text doc clustering

Neural Mask Generator : Learning to Generate Adaptive WordMaskings for Langu...

Similar to Dorra elmekki nlp

Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Yuki Tomo

IRJET- Short-Text Semantic Similarity using Glove Word EmbeddingIRJET Journal

A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig

Challenges in transfer learning in nlpLaraOlmosCamarena

GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextrudolf eremyan

[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...Hiroki Shimanaka

MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURESmlaij

Continuous bag of words cbow word2vec word embedding work .pdfdevangmittal4

ODSC East: Effective Transfer Learning for NLPindico data

An approach to word sense disambiguation combining modified lesk and bag of w...csandit

AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...cscpconf

Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachAhmed Hani Ibrahim

CMPE258 Short story.pptxChirudeepGorle

THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig

DataChat_FinalPaperUrjit Patel

228-SE3001_2Boshra Albayaty

Automatic Grading of Handwritten AnswersIRJET Journal

ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopiwan_rg

Dental TutorBot: Exploitation of Dental Textbooks for Automated LearningSergey Sosnovsky

Similar to Dorra elmekki nlp (20)

Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...

IRJET- Short-Text Semantic Similarity using Glove Word Embedding

A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...

Challenges in transfer learning in nlp

GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText

[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...

MULTILINGUAL SPEECH TO TEXT USING DEEP LEARNING BASED ON MFCC FEATURES

Continuous bag of words cbow word2vec word embedding work .pdf

ODSC East: Effective Transfer Learning for NLP

An approach to word sense disambiguation combining modified lesk and bag of w...

AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...

Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach

CMPE258 Short story.pptx

THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES

DataChat_FinalPaper

228-SE3001_2

Automatic Grading of Handwritten Answers

ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop

Dental TutorBot: Exploitation of Dental Textbooks for Automated Learning

Recently uploaded

DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEslot gacor bisa pakai pulsa

the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa

HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95

College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile

Architect Hassan Khalil Portfolio for 2024hassan khalil

Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat

Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665

Porous Ceramics seminar and technical writingrakeshbaidya232001

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile

GDSC ASEB Gen AI study jams presentationGDSCAESB

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

Roadmap to Membership of RICS - Pathways and RoutesM Maged Hegazy, LLM, MBA, CCP, P3O

Extrusion Processes and Their Limitations120cr0395

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxnull - The Open Security Community

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis

Recently uploaded (20)

DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE

the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx

HARMONY IN THE NATURE AND EXISTENCE - Unit-IV

College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik

Architect Hassan Khalil Portfolio for 2024

Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝

Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...

Call Girls Delhi {Jodhpur} 9711199012 high profile service

Porous Ceramics seminar and technical writing

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts

GDSC ASEB Gen AI study jams presentation

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

Roadmap to Membership of RICS - Pathways and Routes

Extrusion Processes and Their Limitations

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...

Dorra elmekki nlp

1. A Document Descriptor using Covariance of Word Vectors Presented by: DORRA EL MEKKI arwan Torki , A Document Descriptor using Covariance of Word Vectors ,56th Annual Meeting of the A omputational Linguistics (Short Papers), pages 527–532 Melbourne, Australia, July 15 - 20, 2018.

2. 1 2 3 4 Table of Contents Introduction Conclusion DoCoV Descriptor Experimental Evaluation 2

3. 1 INTRODUCTION

4. State-of-the-art methods 4 Bag-of-Words(BOW) The bag-of-words model is a simplifying representation used in natural language processing. In this model, a text is represented as the bag of its words, disregarding grammar and even word order but keeping multiplicity. Latent Semantic Indexing(LSI) Latent semantic indexing, sometimes referred to as latent semantic analysis, is a mathematical method, It finds the hidden relationships between words in order to improve information understanding Deep learning methods Introduction of neural language models using deep learning allowed to learn word vector representation

5. The added value 5 Interrelationship of words in the text Interrelationship between the dimensions of the word embedding via the covariance matrix elements

6. 2 DoCoV Descriptor

7. Define a document observation matrix 7 𝑛 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑥𝑖 = [𝑥1 𝑥2 … 𝑥 𝑑 ] 𝑇 𝜖𝑅 𝑑 O= 𝑥11 ⋯ 𝑥1𝑑 ⋮ ⋱ ⋮ 𝑥 𝑛1 ⋯ 𝑥 𝑛𝑑

8. How to extract our DoCoV descriptor? 8 𝜎𝑥,𝑦 = 𝑖=1 𝑁 (𝑥𝑖− 𝑥)(𝑦𝑖− 𝑦) 𝑁 C= 𝜎 𝑋1 2 𝜎 𝑋1,𝑋2 𝜎 𝑋1,𝑋2 … ⋯ 𝜎 𝑋1,𝑋 𝑑 𝜎 𝑋2,𝑋 𝑑 ⋮ ⋱ ⋮ 𝜎 𝑋1,𝑋 𝑑 𝜎 𝑋2,𝑋 𝑑 … 𝜎 𝑋 𝑑 2

9. 3 Experimental Evaluation

10. The IMDB movie review dataset 10 25% 25% 50% The IMDB movie review dataset labelled training instances labelled test instances unlabelled training instances

11. Objectives of the experience Objectif 1 The DoCoV descriptor can be used with different alternatives for word representations Objectif 2 Pre-trained models are giving the best results. This alleviates the need of computing a problem specific word embedding

12. 01 02 03 Before Training The Power of PowerPoint | thepopp.com 12 Step 1 Using the Training and unlabelled subsets of IMDB dataset to obtain different embedding by setting number of dimensions to 100, 200 and 300. Step 2 Using pre-trained GloVe models trained on wikipedia2014 and Gigaword5. Step 3 Using pre-trained word2vec model trained on Google news. We call it Gnews.

13.

14. Observation 2 The best performing feature concatenation is DoCoV+BOW. This ensures that the concatenation in fact is benefiting from both representations Observation: The Power of PowerPoint | thepopp.com 14 Observation 1 The DoCoV is consistently outperforming the Mean vector for different dimensionality of the word embedding Observation 3 In general the best results are achieved using the available 300-dimensions Gnews word embedding

15. 4 conclusion

16. 16 Conclusion Generic which makes it useful for different supervised and unsupervised tasks Fixed-length property which makes it useful for different learning algorithms Better performance against other state-of-the-art methods. Minimal training We do not require a encoder-decoder model or a gradient descent iterations to be computed.

17. Thank You for Your Attention!Any Questions?

Editor's Notes

in computing, a data descriptor is a structure containing information that describes data. In probability theory and statistics, covariance is a measure of the joint variability of two random variables. ... In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, (i.e., the variables tend to show opposite behavior), so the covariance is negative. Word vectors or we can say word embeddings : it simply converts words into vectors, For word embedding to be good we kind of require that the vectors carry some meaning So if you put in hamburger and cheeseburger into my model, I want those vectors to be close to each other cause they are very related words, We want also the diff between vectors to carry some meaning, For example Man-woman+queen = king I want a to get the vector of a word related to king In this paper , we gonna discuss how using covariance of word vectors may be useful, for that I m gonna pursue this plan
Retrieving documents that are similar to a query using vectors has a long history, so the added value of this paper is not about representing words as vectors but about using covariance of word vectors, To better understand the topic, let’s see some earlier methods modeled documents and queries using vector space, 1/The bag-of-words model is a simplifying representation used in natural language processing. In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. 2/Latent semantic indexing, sometimes referred to as latent semantic analysis, is a mathematical method, It finds the hidden (latent) relationships between words (semantics) in order to improve information understanding (indexing) 3/Introduction of neural language models using deep learning allowed to learn word vector representation (word embedding for simplicity)
DoCov obtains a fixed length representation of the paragraph which captures the interrelationship between the dimensions of the word embedding via the covariance matrix elements Instead studying the interrelationship of words in the text,
We present our DoCoV descriptor. First, we define a document observation matrix. Second, we show how to extract our DoCoV descriptor.
Given a d-dimensional word embedding model and an n-terms document. We can define a document observation matrix O ∈ R n×d . In the matrix O, a row represents a term in the document and columns represent the d-dimensional word embedding representation for that term. Assume that we have observed n terms of a d-dimensional random variable; we have a data matrix O(n × d) : The rows xi = x1 x2 · · · xd T ∈ Rd , denote the i-th observation of a d-dimensional random variable X ∈ Rd . The “sample mean vector” of the n observations ∈ Rd is given by the vector x¯ of the means x¯j of the d variables: x¯ = x¯1 x¯2 · · · x¯d T ∈ R d
Given an observation matrix O for a document, we compute the covariance matrix entries for every pair of dimensions (X, Y ). The matrix C ∈ R d×d is a symmetric matrix and is defined as
Now we can move to the experimental evaluation, In this part, We show an extensive comparative evaluation for unsupervised paragraph representation approaches.
We evaluate classification performance over the IMDB movie reviews dataset using error rate as the evaluation measure The dataset consists of 100K IMDB movie reviews and each review has several sentences. The 100K reviews are divided into three datasets: 25% labelled training instances, 25% labelled test instances and 50% unlabelled training instances. Each review has one label representing the sentiment of it: Positive or Negative. These labels are balanced in both the training and the test set.
The objective is to show that theDoCoV descriptor can be used with different alternatives for word representations. Also, the experiment shows that pre-trained models are giving the best results, namely the word2vec model built on Google news. This alleviates the need of computing a problem specific word embedding. In some cases there is no available data to construct the word embedding. To illustrate that we tried different alternatives for word representation
We used the Training and unlabelled subsets of IMDB dataset to obtain different embedding by setting number of dimensions to 100, 200 and 300. We used pre-trained GloVe models trained on wikipedia2014 and Gigaword5. We used pre-trained word2vec model trained on Google news. We call it Gnews. This model provides word vectors of 300 dimensions for each word.
Word error rate (WER) is a common metric of the performance of a speech recognition or machine translation system. WER=s+D+I/N where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C) Error-Rate performance when changing word vectors dimensionality. Table 1 shows the results when using DoCoV computed at different dimensions of word embedding in classification. The table also compares classification performance when using DoCoV to the performance when using the Mean of word embedding as a baseline. Also, we show the effect of fusing DoCoV with other feature sets. We mainly experiment with the following sets: DoCoV, Mean, and bag-of-words (BOW). We use the mean and DoCoV features
From the results we can observe the following We observe that the DoCoV is consistently outperforming the Mean vector for different dimensionality of the word embedding regardless of the embedding source. The best performing feature concatenation is DoCoV+BOW. This ensures that the concatenation in fact is benefiting from both representations. In general the best results are achieved using the available 300-dimensions Gnews word embedding. In the subsequent experiments we will use that embedding such that we do not need to build a different word embedding for every task on hand.
We presented a novel descriptor to represent text on any level such as sentences, paragraphs or documents. Our representation is generic which makes it useful for different supervised and unsupervised tasks. It has fixed-length property which makes it useful for different learning algorithms. Also, our descriptor requires minimal training. We do not require a encoder-decoder model or a gradient descent iterations to be computed. Empirically we showed the effectiveness of the descriptor in different tasks. We showed better performance against other state-of-the-art methods in both supervised and unsupervised settings.

Dorra elmekki nlp

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to Dorra elmekki nlp

Similar to Dorra elmekki nlp (20)

Recently uploaded

Recently uploaded (20)

Dorra elmekki nlp

Editor's Notes