15. Dataset Description
• Corpus of Contemporary American English
http://corpus.byu.edu/coca/
• 1 million most frequent 5-gram in the total corpus
• No stemming or lemmatization done
• Approximately 25000 words
16. Example of Dataset
preprocessed by me
W0 W1 W2 W3 W4
Both men and women reported
i wanted something that was
the hospital when he was
to have a baby that
policies of the clinton
administrati
on
17. Model architecture
• Goal : Similar word have
similar vector representation.
• Input : N-gram word list
• Output : list of probability’s of
word t is word i
25. This Vector representation actually is
‘vector representation’?
• Similar vectors have similar meaning(in syntactic, semantic)?
26. Result.
• find similar vectors with Trained feature vector Ci
• KNN with Euclidean metric used
word 1st 2nd 3rd 4th 5th
Look Looks Looking Stared Peek glance
Run Ran Running term Pass Runs
Talk Talked Talking Story Bones Truth
Know Guess Thinking Knowing Knows sure
Boy Girl Woman Man Africa Doctor
Year Week Weeks Days Decade Month
Times Moment Day Nights Night Pause
27. results
• 잘 안 된 것들…?
word 1st 2nd 3rd 4th 5th
The Our United Your White Main
Japan Russia Slavery Terrorism Britain Sector
Indian Competitive Humanitarian Regulatory Canadian Investigative
New His Our Its My your
Your Our My His White Their
Gay Missile Reproductive Governmental Preventive Same-sex
A Presidential San Foreign The domestic
28. Discussion
• Good syntactic similarity for most words.
• Good semantic(meaning) similarity for nouns and verbs
• Bad semantic similarity for other words(adjectives, or etc…)
• I think this is mainly because I skipped
• lemmatization(erasing unimportant words such as ‘a’, ‘the’.......)
• stemming (hashing words like ‘did’, ‘do’ and ‘done’ into single ‘do’)
29. Go further(다음 발표 때 할 거)
• Use Skip-gram or CBOW
• toward better word to vector representation
• Better efficiency
• Larger corpus size
• Visualization for word models