Word vectorization(embedding) with nnlm

•Download as PPTX, PDF•

1 like•722 views

hyunsung lee

3월 22일 랩미팅 자료

Engineering

Word
Vectorization(Embedding)
with NNLM
Intern 이현성
SungKyunKwan University, Data mining Lab

Contents
• Brief intro to Keras
• Backgrounds : simple linear algebra
• Model description
• Discussion
• Go further

What is it?
• Deep learning library(wrapper) for Theano and Tensorflow.
• High-level Neural Network API

If you want to know deeper
• https://keras.io
• https://keras.io/getting-started/sequential-model-
guide/#examples

Then, for what happen to the one-hot
encoded vectors?

So we can use C’s row-vector for dense vector
representation for word.(embedding)
• How to implement it?
• Is it work well?

Dataset Description
• Corpus of Contemporary American English
http://corpus.byu.edu/coca/
• 1 million most frequent 5-gram in the total corpus
• No stemming or lemmatization done
• Approximately 25000 words

Example of Dataset
preprocessed by me
W0 W1 W2 W3 W4
Both men and women reported
i wanted something that was
the hospital when he was
to have a baby that
policies of the clinton
administrati
on

Model architecture
• Goal : Similar word have
similar vector representation.
• Input : N-gram word list
• Output : list of probability’s of
word t is word i

Model description
CW0
CW1
CW2
CW3
W0
W1
W2
W3
CW0
+
CW1
+
CW2
+
CW3
NNC를 곱해준다
Flatten
W4
W4_hat
이 둘의 차이(negative log likelihood, log
categorical cross-entropy)로 back-
propagation
Four vectors with
dimension 30
One vector with
dimension 120
One vector with
dimension 120
Relu Softmax

How loss is calculated
• V ={“미쿠가 오늘도 너무 귀여
워”}
• vector representation :

Model description
• # samples = 1 Million
• Minibatch with epoch 1000
• # Iteration 50

This Vector representation actually is
‘vector representation’?
• Similar vectors have similar meaning(in syntactic, semantic)?

Result.
• find similar vectors with Trained feature vector Ci
• KNN with Euclidean metric used
word 1st 2nd 3rd 4th 5th
Look Looks Looking Stared Peek glance
Run Ran Running term Pass Runs
Talk Talked Talking Story Bones Truth
Know Guess Thinking Knowing Knows sure
Boy Girl Woman Man Africa Doctor
Year Week Weeks Days Decade Month
Times Moment Day Nights Night Pause

results
• 잘 안 된 것들…?
word 1st 2nd 3rd 4th 5th
The Our United Your White Main
Japan Russia Slavery Terrorism Britain Sector
Indian Competitive Humanitarian Regulatory Canadian Investigative
New His Our Its My your
Your Our My His White Their
Gay Missile Reproductive Governmental Preventive Same-sex
A Presidential San Foreign The domestic

Discussion
• Good syntactic similarity for most words.
• Good semantic(meaning) similarity for nouns and verbs
• Bad semantic similarity for other words(adjectives, or etc…)
• I think this is mainly because I skipped
• lemmatization(erasing unimportant words such as ‘a’, ‘the’.......)
• stemming (hashing words like ‘did’, ‘do’ and ‘done’ into single ‘do’)

Go further(다음 발표 때 할 거)
• Use Skip-gram or CBOW
• toward better word to vector representation
• Better efficiency
• Larger corpus size
• Visualization for word models

What's hot

[AAAI 2019 tutorial] End-to-end goal-oriented question answering systemsQi He

Using Text Embeddings for Information RetrievalBhaskar Mitra

NLP Bootcamp 2018 : Representation Learning of text for NLPAnuj Gupta

Basics of Machine Learningbutest

Introduction to Machine LearningShahar Cohen

Presentation of Domain Specific Question Answering System Using N-gram Approach.Tasnim Ara Islam

An Overview of Naïve Bayes Classifier ananth

[系列活動] 資料探勘速遊 - Session4 case-studies台灣資料科學年會

Sentiment analysis using naive bayes classifier Dev Sahu

Machine learning (ML) and natural language processing (NLP)Nikola Milosevic

Anthiil Inside workshop on NLPSatyam Saxena

Machine Learning IntroductionPranav Prakash

NLP BootcampAnuj Gupta

Using and learning phrasesCassandra Jacobs

Deep Learning for Natural Language ProcessingJonathan Mugan

Topic extraction using machine learningSanjib Basak

Naive Bayes | StatisticsTransweb Global Inc

Making Machine Learning Work in Practice - StampedeCon 2014StampedeCon

On Semi-Supervised Learning and BeyondEunjeong (Lucy) Park

Machine learning introductionAnas Jamil

What's hot (20)

[AAAI 2019 tutorial] End-to-end goal-oriented question answering systems

Using Text Embeddings for Information Retrieval

NLP Bootcamp 2018 : Representation Learning of text for NLP

Basics of Machine Learning

Introduction to Machine Learning

Presentation of Domain Specific Question Answering System Using N-gram Approach.

An Overview of Naïve Bayes Classifier

[系列活動] 資料探勘速遊 - Session4 case-studies

Sentiment analysis using naive bayes classifier

Machine learning (ML) and natural language processing (NLP)

Anthiil Inside workshop on NLP

Machine Learning Introduction

NLP Bootcamp

Using and learning phrases

Deep Learning for Natural Language Processing

Topic extraction using machine learning

Naive Bayes | Statistics

Making Machine Learning Work in Practice - StampedeCon 2014

On Semi-Supervised Learning and Beyond

Machine learning introduction

Similar to Word vectorization(embedding) with nnlm

introtonlp-190218095523 (1).pdfAdityaMishra178868

Intro to nlpankit_ppt

chapter2 Know.representation.pptxwendifrawtadesse1

Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...IT Arena

A Panorama of Natural Language ProcessingTed Xiao

AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...Dr. Haxel Consult

Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik

Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Zachary S. Brown

Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...hajinouha0

Lexical Semantics, Semantic Similarity and Relevance for SEOKoray Tugberk GUBUR

lexical-semantics-221118101910-ccd46ac3.pdfGagu6

Interface for Finding Close Matches from Translation MemoryPriyatham Bollimpalli

NLP_guest_lecture.pdfSoha82

3. introduction to text miningLokesh Ramaswamy

Analyzing Arguments during a Debate using Natural Language Processing in PythonAbhinav Gupta

DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0Plain Concepts

Introduction to Text AnalysisLauren Klein

Pycon ke word vectorsOsebe Sammi

Intro to Vectorization Concepts - GaTech cse6242Josh Patterson

Similar to Word vectorization(embedding) with nnlm (20)

introtonlp-190218095523 (1).pdf

Intro to nlp

chapter2 Know.representation.pptx

Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...

A Panorama of Natural Language Processing

AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...

Engineering Intelligent NLP Applications Using Deep Learning – Part 1

Deep Learning and Modern Natural Language Processing (AnacondaCon2019)

Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...

Lexical Semantics, Semantic Similarity and Relevance for SEO

lexical-semantics-221118101910-ccd46ac3.pdf

Interface for Finding Close Matches from Translation Memory

NLP_guest_lecture.pdf

3. introduction to text mining

Analyzing Arguments during a Debate using Natural Language Processing in Python

DotNet 2019 | Pablo Doval - Recurrent Neural Networks with TF2.0

Introduction to Text Analysis

Pycon ke word vectors

Intro to Vectorization Concepts - GaTech cse6242

Recently uploaded

HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95

Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxnull - The Open Security Community

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis

Architect Hassan Khalil Portfolio for 2024hassan khalil

GDSC ASEB Gen AI study jams presentationGDSCAESB

Introduction to Multiple Access Protocol.pptxupamatechverse

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat

Internship report on mechanical engineeringmalavadedarshan25

IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95

247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1

9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Low Rate Call Girls In Saket, Delhi NCR

ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal

Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan

Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

Recently uploaded (20)

HARMONY IN THE NATURE AND EXISTENCE - Unit-IV

Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...

Architect Hassan Khalil Portfolio for 2024

GDSC ASEB Gen AI study jams presentation

Introduction to Multiple Access Protocol.pptx

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts

Internship report on mechanical engineering

IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...

247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt

9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf

ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...

Coefficient of Thermal Expansion and their Importance.pptx

Microscopic Analysis of Ceramic Materials.pptx

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...

Word vectorization(embedding) with nnlm

1. Word Vectorization(Embedding) with NNLM Intern 이현성 SungKyunKwan University, Data mining Lab

2. Contents • Brief intro to Keras • Backgrounds : simple linear algebra • Model description • Discussion • Go further

3. Brief intro to Keras

4. What is it? • Deep learning library(wrapper) for Theano and Tensorflow. • High-level Neural Network API

5. Example : Multilayer perceptron

6. Example : Convolutional neural network

7. If you want to know deeper • https://keras.io • https://keras.io/getting-started/sequential-model- guide/#examples

8. Backgrounds : simple linear algebra

9. Backgrounds

10. Backgrounds

11. Then, for what happen to the one-hot encoded vectors?

12.

13. So we can use C’s row-vector for dense vector representation for word.(embedding) • How to implement it? • Is it work well?

14. Model description

15. Dataset Description • Corpus of Contemporary American English http://corpus.byu.edu/coca/ • 1 million most frequent 5-gram in the total corpus • No stemming or lemmatization done • Approximately 25000 words

16. Example of Dataset preprocessed by me W0 W1 W2 W3 W4 Both men and women reported i wanted something that was the hospital when he was to have a baby that policies of the clinton administrati on

17. Model architecture • Goal : Similar word have similar vector representation. • Input : N-gram word list • Output : list of probability’s of word t is word i

18. Model description CW0 CW1 CW2 CW3 W0 W1 W2 W3 CW0 + CW1 + CW2 + CW3 NNC를 곱해준다 Flatten W4 W4_hat 이 둘의 차이(negative log likelihood, log categorical cross-entropy)로 back- propagation Four vectors with dimension 30 One vector with dimension 120 One vector with dimension 120 Relu Softmax

19. How loss is calculated • V ={“미쿠가 오늘도 너무 귀여 워”} • vector representation :

20. Model description • # samples = 1 Million • Minibatch with epoch 1000 • # Iteration 50

21. Implementation with keras

22. Implementation with keras

23. Implementation with Keras

24. Discussion

25. This Vector representation actually is ‘vector representation’? • Similar vectors have similar meaning(in syntactic, semantic)?

26. Result. • find similar vectors with Trained feature vector Ci • KNN with Euclidean metric used word 1st 2nd 3rd 4th 5th Look Looks Looking Stared Peek glance Run Ran Running term Pass Runs Talk Talked Talking Story Bones Truth Know Guess Thinking Knowing Knows sure Boy Girl Woman Man Africa Doctor Year Week Weeks Days Decade Month Times Moment Day Nights Night Pause

27. results • 잘 안 된 것들…? word 1st 2nd 3rd 4th 5th The Our United Your White Main Japan Russia Slavery Terrorism Britain Sector Indian Competitive Humanitarian Regulatory Canadian Investigative New His Our Its My your Your Our My His White Their Gay Missile Reproductive Governmental Preventive Same-sex A Presidential San Foreign The domestic

28. Discussion • Good syntactic similarity for most words. • Good semantic(meaning) similarity for nouns and verbs • Bad semantic similarity for other words(adjectives, or etc…) • I think this is mainly because I skipped • lemmatization(erasing unimportant words such as ‘a’, ‘the’.......) • stemming (hashing words like ‘did’, ‘do’ and ‘done’ into single ‘do’)

29. Go further(다음 발표 때 할 거) • Use Skip-gram or CBOW • toward better word to vector representation • Better efficiency • Larger corpus size • Visualization for word models

30. Use Skip-gram or CBOW

31. Proper visualization for word models

32. 실제로는 하나도 안 닮음…;;;

Word vectorization(embedding) with nnlm

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Word vectorization(embedding) with nnlm

Similar to Word vectorization(embedding) with nnlm (20)

More from hyunsung lee

More from hyunsung lee (9)

Recently uploaded

Recently uploaded (20)

Word vectorization(embedding) with nnlm