Title: NLP: From news recommendation to word prediction
Speaker: Mihalis Papakonstantinou (https://linkedin.com/in/mihalispapak/)
Date: Thursday, December 12, 2019
Event: https://meetup.com/Athens-Big-Data/events/266762191/
2. Who we are
Vaios Sintsirmas: Technical Manager @AthensVoice
vaiossynt@yahoo.gr
Mihalis Papakonstantinou: Data Engineer @Agroknow
mihalispapak@gmail.com
3. FOODAKAI - Data-powered Food Recalls
deliverynews.gr - A Text-data Project
AthensVoice - A Recommendation Engine
TABLE OF CONTENTS
NBG’s Word Embedding Challenge
01
02
03
04
8. AthensVoice
People are reading article pieces that are interesting to them.
So if we present more related articles to them, we can generate even more traffic.
And that, just by taking advantage of the traffic generated by the website alone!
9. What makes an article interesting? (part#1)
“It is interesting to me”
I usually read these articles.
These articles are from a specific
category.
Ok, we can present you with more from
the same category.
“It is interesting to people that
have the same interests as me”
I have a specific behaviour.
But, this matches to the behaviour of
others.
Ok, we’ll present you with articles that
they are reading!
10. What makes an article interesting? (part#2)
“The article categories I usually
read have a high correlation with
articles from another category”
I am usually reading these categories.
Oh wait, these categories have high
similarity with these! (based on user data)
We’ll present you with articles from these
categories!
“The articles I usually read are
heavily correlated with some
other articles”
I usually read these articles.
Oh wait, the ones you are reading are
related with these! (based on text
similarity)
We’ll present you with these, no problem!
11. What makes an article interesting? (part#3)
“Usually on this time of day,
people are reading articles
coming from these categories”
It is noon and everyone is getting hungry!
So everyone is reading about recipes &
restaurants.
And we know about it!
We’ll show you some of them!
“These are trending now!”
The social media API(s) are out there.
We can get frequently posted hashtags.
Can we get related articles? (based on
text similarity)
Great, we’ll present you with these!
22. We have to start
from somewhere!
Can we identify the product/ingredient involved?
What about the hazard?
Let’s employ some vocabulary specific text-mining!
24. Company
Identification
Now we need to identify the company involved in the recall.
Let’s employ some NER!
But how can we also get the relationship between companies (eg. one is a subsidiary of
another)?
This sounds like a graph problem!
26. Brand
Identification
Ok, so far so good, but what was the product-brand behind the recall?
And more importantly, how can we identify it?
There are open-datasets out there! Let’s take advantage of them!
28. Misc Info
Identification
If we also had the date, LOT number & brand size that would be great!
Ok, some data source specific parsers should be employed!
35. Enough with food recalls!
They are scary!
(and we are having pizza soon!)
Let’s switch to text aggregation!
36. deliverynews.gr
The Idea
And now time for a side-project
We know how to handle text data
Collect from a variety of sources
Employ ML/DL techniques to identify important terms
37. deliverynews.gr
The Idea
Can we put it into practise to collect news-articles announced (currently) within Greece?
And identify stuff?
And present them?
And all of this, fully automated, without manual-labor!
45. Article
Deduplication
Text similarity matrices are here!
But then again, some intuitive rules can be
applied!
(within period of time & talk about roughly
the same thing)
Cosine Similarity on Title: 82%
Cosine Similarity on Body: 75%
51. The More Data the
Better
The RecSys got better the more it was deployed
FOODAKAI’s internal workflows got better with more recalls
deliverynews.gr is getting bigger and better with more data sources
58. NBG Race
Challenge#1
Let’s create a word embedding
That can predict the next word
Coming from a specific pool of words
From a 10-word sentence
59. NBG Race
Challenge#1
Let’s create a word embedding
That can predict the next word
Coming from a specific pool of words
From a 10-word sentence
Given a 70K dataset of greek law texts
66. Initial
Preprocessing
We need to convert our text to a format models can
understand
Stemming/Lemmatization seems interesting!
(Greek) Stopword removal seems valid.
67. Initial
Preprocessing
We need to convert our text to a format models can
understand
Stemming/Lemmatization seems interesting!
(Greek) Stopword removal seems valid.
Let’s begin!
77. Preprocessing
Ok, we need to create a word embedding to cover specific words
And we need to predict the 11th word in a sentence
78. Preprocessing
Ok, we need to create a word embedding to cover specific words
And we need to predict the 11th word in a sentence
Let’s preprocess our dataset to create, filtering out words not in our target
92. Glove - Params
We need to set the following:
window,
learning_rate,
emb_dimesions,
epochs
93. Glove - Attempt#1
Let’s keep learning_rate, emb_dim, epochs fixed and play around with the
window parameter.
94. Glove - Attempt#1
Let’s keep learning_rate, emb_dim, epochs fixed and play around with the
window parameter.
We need to predict the 11th word, let’s go with window: 11