SlideShare a Scribd company logo
TWO HOBBY PROJECTS WITH THE PACKAGE TEXT2VEC
https://www.linkedin.com/in/longhowlam
https://longhowlam.wordpress.com
@longhowlam
Longhow Lam -- Freelance Data Scientist
AGENDA
 TEXT2VEC INTRODUCTION
 JAAP.NL HOUSE PRICE PREDICTION
 SOAP ANALYTICS: THE BOLD
Just visit the website of this R package
www.text2vec.org
WWW.JAAP.NL HUIZEN ANALYTICS
https://github.com/longhowlam/jaap
https://www.linkedin.com/pulse/huis-te-koop-zet-beleggingsobject-je-huisomschrijving-longhow-lam/
HOUSE PRICES NOW CREATE MODEL ON HOUSE DESCRIPTIONS / TEXTS
PREDICT HOUSE PRICE WITH LASSO REGRESSION OR XGBOOST
TERM DOCUMENT MATRIX
Super sparse: 65.000 rows ~50.000 columns
house price kitchen big_garden garage ...(many more terms)... swimming_pool
house 1 235.000 1 0 1 ... 0
house 2 450.000 0 1 0 ... 0
house 3 376.000 1 0 0 ... 0
... ... ... ... ... ... ...
... ... ... ... ... ... ...
house 65.000 621.000 1 1 ... ... 1
Data.frame jaap with 65000 rows, column huisbeschrijvingen and column prijs
PREDICT HOUSE PRICE WITH LASSO REGRESSION OR XGBOOST
TERM DOCUMENT MATRIX
Too many columns for a normal linear regression, regularization is needed.
For example “lasso” regression
PREDICT HOUSE PRICE WITH LASSO REGRESSION OR XGBOOST
PREDICT HOUSE PRICE WITH LASSO REGRESSION OR XGBOOST
PREDICT HOUSE PRICE WITH LASSO REGRESSION OR XGBOOST
PREDICT HOUSE PRICE WITH LASSO REGRESSION OR XGBOOST
LASSO REGRESSION NEGATIVE AND POSITIVE COEFFICIENTS
R2
= 0.66
Intercept € 238.260
parkkosten € 39.644- familiehuis € 60.168
recreatiebungalow € 32.614- vrijstaande_villa € 48.180
bungalowpark € 31.801- belegging € 45.814
limburgse € 23.483- beleggingsobject € 42.543
2_kamer € 23.034- entree_vestibule € 41.674
plinten € 22.510- rijksmonument € 39.379
overdekt_zwembad € 21.971- recreatief € 39.142
2_kamerappartement € 20.625- verhuurd € 36.171
aannemer € 20.314- detaillering € 35.000
recreatiewoning € 19.748- visgraat € 33.589
proeven € 19.631- eigen_badkamer € 33.454
betaalbaar € 19.621- woningen_1 € 33.321
starterswoning € 19.502- toiletten € 32.836
volwassen € 19.476- rietgedekte € 32.096
kunststofkozijnen € 18.775- representatieve € 31.904
helder € 18.594- alarm € 31.841
verbeterd € 18.488- toplocatie € 31.821
eigen_gebruik € 18.430- gezinshuis € 31.297
ANALYTICS
SOAP
WORD EMBEDDINGS IN BOLD & BEAUTIFUL RECAPS
Term Document Matrix
Each document / recap is a vector of numbers
Word embedding
Each word is a vector of numbers
A word embedding has to be trained from a collection of documents / recaps
Amsterdam = (0.83, 0.89, 0.34, … , 0.63, 0.19)
Steffy = (0.33, 0.19, 0.79, … , 0.13, 0.01)
Germany = (0.72, 0.65, 0.43, … , 0.36, 0.57)
Laugh = (0.85, 0.77, 0.24, … , 0.88, 0.29)
…
…
https://github.com/longhowlam/TBATB
WORD EMBEDDINGS LINGUISTIC REGULARITIES
Closest words
Word relations
250 dimensional space
president
trump
car media
press
house
man
woman
king
queen
vector(“man") − vector(“woman")
is roughly
vector(“king”) − vector(“queen")
Trump speaks with the press
The president talks to the media
WORD EMBEDDINGS BOLD & BEAUTIFUL RECAPS
➢
4000 daily recaps of TBTB over the last 15 years
➢
We have around 10.000 unique words in these recaps
➢
I am generating word vectors of dimension 250
First a simple word cloud to get a
general idea of term importance
WORD EMBEDDINGS BOLD & BEAUTIFUL RECAPS
WORD EMBEDDINGS BOLD & BEAUTIFUL RECAPS
WORD EMBEDDINGS BOLD & BEAUTIFUL RECAPS
WORD EMBEDDINGS BOLD & BEAUTIFUL RECAPS
WORD EMBEDDINGS BOLD & BEAUTIFUL RECAPS
WORD EMBEDDINGS BOLD & BEAUTIFUL RECAPS
Stanford’s GloVe: Global Vectors for Word Representation
1 steffy steffy 1.00
2 steffy liam 0.82
3 steffy hope 0.79
4 steffy said 0.78
5 steffy wyatt 0.76
6 steffy bill 0.69
7 steffy asked 0.68
8 steffy quinn 0.67
9 steffy agreed 0.65
10 steffy rick 0.65
WORD EMBEDDINGS LINGUISTIC REGULARITIES
WORD EMBEDDINGS BOLD & BEAUTIFUL EXAMPLE
death furious lastly excused frustration onset
0.223 0.2006 0.1963 0.1958 0.1950 0.1937
Word vectors voor:
Steffy − Liam
WORD EMBEDDINGS BOLD & BEAUTIFUL EXAMPLE
liam katie wyatt steffy quinn said
0.5550 0.4845 0.4829 0.4645 0.4491 0.4201
Word vectors voor:
Bill − anger
Thanks for your attention. QUESTIONS?
https://www.linkedin.com/in/longhowlam
https://longhowlam.wordpress.com/
@longhowlam

More Related Content

More from Longhow Lam

Data science inspiratie_sessie
Data science inspiratie_sessieData science inspiratie_sessie
Data science inspiratie_sessie
Longhow Lam
 
Jaap Huisprijzen, GTST, The Bold, IKEA en Iens
Jaap Huisprijzen, GTST, The Bold, IKEA en IensJaap Huisprijzen, GTST, The Bold, IKEA en Iens
Jaap Huisprijzen, GTST, The Bold, IKEA en Iens
Longhow Lam
 
Dataiku meetup 12 july 2018 Amsterdam
Dataiku meetup 12 july 2018 AmsterdamDataiku meetup 12 july 2018 Amsterdam
Dataiku meetup 12 july 2018 Amsterdam
Longhow Lam
 
Data science in action
Data science in actionData science in action
Data science in action
Longhow Lam
 
MasterSearch_Meetup_AdvancedAnalytics
MasterSearch_Meetup_AdvancedAnalyticsMasterSearch_Meetup_AdvancedAnalytics
MasterSearch_Meetup_AdvancedAnalytics
Longhow Lam
 
Keras on tensorflow in R & Python
Keras on tensorflow in R & PythonKeras on tensorflow in R & Python
Keras on tensorflow in R & Python
Longhow Lam
 
Latent transwarp neural networks
Latent transwarp neural networksLatent transwarp neural networks
Latent transwarp neural networks
Longhow Lam
 
MathPaperPublished
MathPaperPublishedMathPaperPublished
MathPaperPublishedLonghow Lam
 
Heliview 29sep2015 slideshare
Heliview 29sep2015 slideshareHeliview 29sep2015 slideshare
Heliview 29sep2015 slideshare
Longhow Lam
 
Parameter estimation in a non stationary markov model
Parameter estimation in a non stationary markov modelParameter estimation in a non stationary markov model
Parameter estimation in a non stationary markov model
Longhow Lam
 
The analysis of doubly censored survival data
The analysis of doubly censored survival dataThe analysis of doubly censored survival data
The analysis of doubly censored survival data
Longhow Lam
 
Machine learning overview (with SAS software)
Machine learning overview (with SAS software)Machine learning overview (with SAS software)
Machine learning overview (with SAS software)
Longhow Lam
 

More from Longhow Lam (12)

Data science inspiratie_sessie
Data science inspiratie_sessieData science inspiratie_sessie
Data science inspiratie_sessie
 
Jaap Huisprijzen, GTST, The Bold, IKEA en Iens
Jaap Huisprijzen, GTST, The Bold, IKEA en IensJaap Huisprijzen, GTST, The Bold, IKEA en Iens
Jaap Huisprijzen, GTST, The Bold, IKEA en Iens
 
Dataiku meetup 12 july 2018 Amsterdam
Dataiku meetup 12 july 2018 AmsterdamDataiku meetup 12 july 2018 Amsterdam
Dataiku meetup 12 july 2018 Amsterdam
 
Data science in action
Data science in actionData science in action
Data science in action
 
MasterSearch_Meetup_AdvancedAnalytics
MasterSearch_Meetup_AdvancedAnalyticsMasterSearch_Meetup_AdvancedAnalytics
MasterSearch_Meetup_AdvancedAnalytics
 
Keras on tensorflow in R & Python
Keras on tensorflow in R & PythonKeras on tensorflow in R & Python
Keras on tensorflow in R & Python
 
Latent transwarp neural networks
Latent transwarp neural networksLatent transwarp neural networks
Latent transwarp neural networks
 
MathPaperPublished
MathPaperPublishedMathPaperPublished
MathPaperPublished
 
Heliview 29sep2015 slideshare
Heliview 29sep2015 slideshareHeliview 29sep2015 slideshare
Heliview 29sep2015 slideshare
 
Parameter estimation in a non stationary markov model
Parameter estimation in a non stationary markov modelParameter estimation in a non stationary markov model
Parameter estimation in a non stationary markov model
 
The analysis of doubly censored survival data
The analysis of doubly censored survival dataThe analysis of doubly censored survival data
The analysis of doubly censored survival data
 
Machine learning overview (with SAS software)
Machine learning overview (with SAS software)Machine learning overview (with SAS software)
Machine learning overview (with SAS software)
 

Recently uploaded

一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 

Recently uploaded (20)

一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 

text2vec SatRDay Amsterdam

  • 1. TWO HOBBY PROJECTS WITH THE PACKAGE TEXT2VEC https://www.linkedin.com/in/longhowlam https://longhowlam.wordpress.com @longhowlam Longhow Lam -- Freelance Data Scientist
  • 2. AGENDA  TEXT2VEC INTRODUCTION  JAAP.NL HOUSE PRICE PREDICTION  SOAP ANALYTICS: THE BOLD
  • 3. Just visit the website of this R package www.text2vec.org
  • 6. PREDICT HOUSE PRICE WITH LASSO REGRESSION OR XGBOOST TERM DOCUMENT MATRIX Super sparse: 65.000 rows ~50.000 columns house price kitchen big_garden garage ...(many more terms)... swimming_pool house 1 235.000 1 0 1 ... 0 house 2 450.000 0 1 0 ... 0 house 3 376.000 1 0 0 ... 0 ... ... ... ... ... ... ... ... ... ... ... ... ... ... house 65.000 621.000 1 1 ... ... 1 Data.frame jaap with 65000 rows, column huisbeschrijvingen and column prijs
  • 7. PREDICT HOUSE PRICE WITH LASSO REGRESSION OR XGBOOST TERM DOCUMENT MATRIX Too many columns for a normal linear regression, regularization is needed. For example “lasso” regression
  • 8. PREDICT HOUSE PRICE WITH LASSO REGRESSION OR XGBOOST
  • 9. PREDICT HOUSE PRICE WITH LASSO REGRESSION OR XGBOOST
  • 10. PREDICT HOUSE PRICE WITH LASSO REGRESSION OR XGBOOST
  • 11. PREDICT HOUSE PRICE WITH LASSO REGRESSION OR XGBOOST
  • 12. LASSO REGRESSION NEGATIVE AND POSITIVE COEFFICIENTS R2 = 0.66 Intercept € 238.260 parkkosten € 39.644- familiehuis € 60.168 recreatiebungalow € 32.614- vrijstaande_villa € 48.180 bungalowpark € 31.801- belegging € 45.814 limburgse € 23.483- beleggingsobject € 42.543 2_kamer € 23.034- entree_vestibule € 41.674 plinten € 22.510- rijksmonument € 39.379 overdekt_zwembad € 21.971- recreatief € 39.142 2_kamerappartement € 20.625- verhuurd € 36.171 aannemer € 20.314- detaillering € 35.000 recreatiewoning € 19.748- visgraat € 33.589 proeven € 19.631- eigen_badkamer € 33.454 betaalbaar € 19.621- woningen_1 € 33.321 starterswoning € 19.502- toiletten € 32.836 volwassen € 19.476- rietgedekte € 32.096 kunststofkozijnen € 18.775- representatieve € 31.904 helder € 18.594- alarm € 31.841 verbeterd € 18.488- toplocatie € 31.821 eigen_gebruik € 18.430- gezinshuis € 31.297
  • 14. WORD EMBEDDINGS IN BOLD & BEAUTIFUL RECAPS Term Document Matrix Each document / recap is a vector of numbers Word embedding Each word is a vector of numbers A word embedding has to be trained from a collection of documents / recaps Amsterdam = (0.83, 0.89, 0.34, … , 0.63, 0.19) Steffy = (0.33, 0.19, 0.79, … , 0.13, 0.01) Germany = (0.72, 0.65, 0.43, … , 0.36, 0.57) Laugh = (0.85, 0.77, 0.24, … , 0.88, 0.29) … … https://github.com/longhowlam/TBATB
  • 15. WORD EMBEDDINGS LINGUISTIC REGULARITIES Closest words Word relations 250 dimensional space president trump car media press house man woman king queen vector(“man") − vector(“woman") is roughly vector(“king”) − vector(“queen") Trump speaks with the press The president talks to the media
  • 16. WORD EMBEDDINGS BOLD & BEAUTIFUL RECAPS ➢ 4000 daily recaps of TBTB over the last 15 years ➢ We have around 10.000 unique words in these recaps ➢ I am generating word vectors of dimension 250 First a simple word cloud to get a general idea of term importance
  • 17. WORD EMBEDDINGS BOLD & BEAUTIFUL RECAPS
  • 18. WORD EMBEDDINGS BOLD & BEAUTIFUL RECAPS
  • 19. WORD EMBEDDINGS BOLD & BEAUTIFUL RECAPS
  • 20. WORD EMBEDDINGS BOLD & BEAUTIFUL RECAPS
  • 21. WORD EMBEDDINGS BOLD & BEAUTIFUL RECAPS
  • 22. WORD EMBEDDINGS BOLD & BEAUTIFUL RECAPS Stanford’s GloVe: Global Vectors for Word Representation
  • 23. 1 steffy steffy 1.00 2 steffy liam 0.82 3 steffy hope 0.79 4 steffy said 0.78 5 steffy wyatt 0.76 6 steffy bill 0.69 7 steffy asked 0.68 8 steffy quinn 0.67 9 steffy agreed 0.65 10 steffy rick 0.65 WORD EMBEDDINGS LINGUISTIC REGULARITIES
  • 24. WORD EMBEDDINGS BOLD & BEAUTIFUL EXAMPLE death furious lastly excused frustration onset 0.223 0.2006 0.1963 0.1958 0.1950 0.1937 Word vectors voor: Steffy − Liam
  • 25. WORD EMBEDDINGS BOLD & BEAUTIFUL EXAMPLE liam katie wyatt steffy quinn said 0.5550 0.4845 0.4829 0.4645 0.4491 0.4201 Word vectors voor: Bill − anger
  • 26. Thanks for your attention. QUESTIONS? https://www.linkedin.com/in/longhowlam https://longhowlam.wordpress.com/ @longhowlam