19th Athens Big Data Meetup - 2nd Talk - NLP: From news recommendation to word prediction

NLP: from news
recommendation to word
prediction

Who we are
Vaios Sintsirmas: Technical Manager @AthensVoice
vaiossynt@yahoo.gr
Mihalis Papakonstantinou: Data Engineer @Agroknow
mihalispapak@gmail.com

FOODAKAI - Data-powered Food Recalls
deliverynews.gr - A Text-data Project
AthensVoice - A Recommendation Engine
TABLE OF CONTENTS
NBG’s Word Embedding Challenge
01
02
03
04

AthensVoice
01
Building a news recommendation
engine

320.000users/day
510.000pages visited/day
15.000.000new data points/month

AthensVoice
That is just by the website running as-is!
What can we do to make it better?

AthensVoice
People are reading article pieces that are interesting to them.
So if we present more related articles to them, we can generate even more trafﬁc.
And that, just by taking advantage of the trafﬁc generated by the website alone!

What makes an article interesting? (part#1)
“It is interesting to me”
I usually read these articles.
These articles are from a speciﬁc
category.
Ok, we can present you with more from
the same category.
“It is interesting to people that
have the same interests as me”
I have a speciﬁc behaviour.
But, this matches to the behaviour of
others.
Ok, we’ll present you with articles that
they are reading!

“The article categories I usually
read have a high correlation with
articles from another category”
I am usually reading these categories.
Oh wait, these categories have high
similarity with these! (based on user data)
We’ll present you with articles from these
categories!
“The articles I usually read are
heavily correlated with some
other articles”
I usually read these articles.
Oh wait, the ones you are reading are
related with these! (based on text
similarity)
We’ll present you with these, no problem!

“Usually on this time of day,
people are reading articles
coming from these categories”
It is noon and everyone is getting hungry!
So everyone is reading about recipes &
restaurants.
And we know about it!
We’ll show you some of them!
“These are trending now!”
The social media API(s) are out there.
We can get frequently posted hashtags.
Can we get related articles? (based on
text similarity)
Great, we’ll present you with these!

In terms of numbers?
This is an event sponsored by NBG after
all!

742.622new page visits/month
~7.800€revenue/month
calculating the number of ads served/page and the avg cost of a banner

FOODAKAI
Data-powered food recalls
02

Food is recalled every day!
All around the world, and for various
reasons!
Can we collect, process, analyse and
present these food recalls?

550.000food recalls
1.500hazards/reasons for recalls
17.000ingredients/products involved
47official sources

Food recall
announced in
the US
Containing a wealth of
information!

We need
Hazard
Product
Brand
Company
Date

We have to start
from somewhere!
Can we identify the product/ingredient involved?
What about the hazard?
Let’s employ some vocabulary speciﬁc text-mining!

Company
Identification
Now we need to identify the company involved in the recall.
Let’s employ some NER!
But how can we also get the relationship between companies (eg. one is a subsidiary of
another)?
This sounds like a graph problem!

Brand
Identification
Ok, so far so good, but what was the product-brand behind the recall?
And more importantly, how can we identify it?
There are open-datasets out there! Let’s take advantage of them!

Misc Info
Identification
If we also had the date, LOT number & brand size that would be great!
Ok, some data source speciﬁc parsers should be employed!

Or?
Human-labor seems necessary!

deliverynews.gr
03
A Text-data Project

Enough with food recalls!
They are scary!
(and we are having pizza soon!)
Let’s switch to text aggregation!

deliverynews.gr
The Idea
And now time for a side-project
We know how to handle text data
Collect from a variety of sources
Employ ML/DL techniques to identify important terms

deliverynews.gr
The Idea
Can we put it into practise to collect news-articles announced (currently) within Greece?
And identify stuff?
And present them?
And all of this, fully automated, without manual-labor!

Step#2:
Important Terms
Identification
(Workflow)

Step#3:
(unfortunately)
Front-End Work

Articles
frequently talk
about the same
thing
Time for some deduplication!
Tricky, but can we do something about it?

Article
Deduplication
Text similarity matrices are here!
But then again, some intuitive rules can be
applied!
(within period of time & talk about roughly
the same thing)

Article
Deduplication
Text similarity matrices are here!
But then again, some intuitive rules can be
applied!
(within period of time & talk about roughly
the same thing)
Cosine Similarity on Title: 82%
Cosine Similarity on Body: 75%

What have we collected so
far?

10websites
4years back
500categories
~5.200.000articles

Can we give back to the
community?

Greek articles
pretrained vectors
Let’s try to create and publish pretrained word & phrases vectors on these greek articles!
Work in progress!

The More Data the
Better
The RecSys got better the more it was deployed
FOODAKAI’s internal workﬂows got better with more recalls
deliverynews.gr is getting bigger and better with more data sources

More Data
==
More Raw Data
Raw data is (almost) useless
We need to (pre)process it!

NBG’s Word Embedding Challenge
A text-handling challenge
04

NBG Race
Challenge#1
Let’s create a word embedding

NBG Race
Challenge#1
That can predict the next word

NBG Race
Challenge#1
Coming from a speciﬁc pool of words

NBG Race
Challenge#1
From a 10-word sentence

NBG Race
Challenge#1
From a 10-word sentence
Given a 70K dataset of greek law texts

What’s out there?
word2vec
fasttext
glove
...

Let’s crawl for
More
4 bash scripts + a Java project
~24 hours later
+1.6K new documents
~2% increase in dataset
Not much, but still!

Lesson#2
Preprocessing (part_1)

Initial
Preprocessing
We need to convert our text to a format models can
understand

Initial
Preprocessing
understand
Stemming/Lemmatization seems interesting!

Initial
Preprocessing
understand
(Greek) Stopword removal seems valid.

Initial
Preprocessing
understand
(Greek) Stopword removal seems valid.
Let’s begin!

Train#1
Time to train our model!

Train#1
Time to train our model!
Let’s start with word2vec

Lesson#2
Preprocessing (part_2)

Preprocessing
Ok, we need to create a word embedding to cover speciﬁc words

Preprocessing
And we need to predict the 11th word in a sentence

Preprocessing
And we need to predict the 11th word in a sentence
Let’s preprocess our dataset to create, ﬁltering out words not in our target

Let’s train our word2vec
(again)

Training
Ok, preprocessing seems valid

Training
But word2vec takes forever to train!

Training
But word2vec takes forever to train!
Let’s go with fasttext!

Training
Hm, fasttext performs poorly

Training
Let’s keep the conﬁguration the same and go with glove

WOW!
Let’s stick with glove!

Glove - Params
We need to set the following:
window,
learning_rate,
emb_dimesions,
epochs

Glove - Attempt#1
Let’s keep learning_rate, emb_dim, epochs ﬁxed and play around with the
window parameter.

Glove - Attempt#1
Let’s keep learning_rate, emb_dim, epochs ﬁxed and play around with the
window parameter.
We need to predict the 11th word, let’s go with window: 11

Glove - Attempt#2
What if we increase the window?
Let’s go with: 22

Glove - Attempt#3
Hm, lower window seems to perform better
Let’s try a window of 6

Glove - Attempt#4
Ok, even better!
Let’s lower the window now even more, let’s try window: 1

Enough with the window!
Let’s play around with
emb_dim

Glove - Attempt#5
Our initial tests were with emb_dim: 200
Let’s increase this, let’s go with 500

Glove - Attempt#6
Seems to do better!
What about 800?

Glove - Attempt#7
Hm.. the more we increase the value, the higher results we get!
Let’s go with 1500 and we’ll win!

Glove - Attempt#8
Ok, ok, let’s try 900 now!

Great!
Now time to experiment on
epochs!

Glove -
Attempt#{9, 10, 11}
Let’s go with:
200

Glove -
Attempt#{9, 10, 11}
Let’s go with:
200 250

Glove -
Attempt#{9, 10, 11}
Let’s go with:
200 250 300

98.17 points
Yes! NBG’s leaderboard agrees!

THANKS!
Keep in touch!
vaiossynt@yahoo.gr
mihalispapak@gmail.com

19th Athens Big Data Meetup - 2nd Talk - NLP: From news recommendation to word prediction

Recommended

Recommended

More Related Content

Similar to 19th Athens Big Data Meetup - 2nd Talk - NLP: From news recommendation to word prediction

Similar to 19th Athens Big Data Meetup - 2nd Talk - NLP: From news recommendation to word prediction (20)

More from Athens Big Data

More from Athens Big Data (20)

Recently uploaded

Recently uploaded (20)

19th Athens Big Data Meetup - 2nd Talk - NLP: From news recommendation to word prediction