No more bad news!

No more bad news!
News recommendation with ML and NLP.
Samia Khalid and Simon Lia-Jonassen
NTNU Cogito
March 7th, 2019

Contents 00 Introduction
01 Recommender architecture
02 Natural language processing
03 Recommendation model training
04 Demo and further work

Understand Content of news I read.
Learn my Interests over time.
Recommend news that interest me.
Introduction to News Recommendation

Implements three parts:
• Frontend and backend controllers.
• Feed provider and logging.
• NLP, ML and exploration workflow.
News recommender in a nutshell
https://github.com/s-j/goodnews

1. Text Processing
2. Clustering
3. Topic Extraction
Natural Language Processing and Exploration

1. Text Processing using spaCy
Leading open-source library for advanced NLP

a. Tokenization
b. Part Of Speech Tagging
c. Lemmatization
d. Stop words
1. Text Processing using spaCy

1. Recognizes a sentence and
assigns a syntactic structure to
it
• “Who is the AI research director?”
2. spaCy provides a built-in
visualizer
1. Text Processing using SpaCy
Dependency Parsing

1. Locate and classify named entities in text into pre-defined categories
2. Can help to answer questions like:
• “Which people, companies and products is the user interested in?”
Entity Recognition

1. Part-of-speech Tagging:
• assigns parts of speech to each token
such as noun, verb, adjective, etc.
2. spaCy uses a statistical model to
make a prediction of which tag or
label most likely applies in the
given context
Distribution of POS Tags

Word Probabilities: finding the most improbable words (noisy data)

Analyzing top unigrams in clicked articles vs all articles (considering only PROPN and NOUNS)

1. Word Vectors as input:
• 300 dimensional vectors to represent words in
numerical form
2. K-Means needs the number of cluster as
parameter:
• Try out different values until satisfied
• Can use silhouette score and distortion as metric
3. PCA for visualizing the results in 2-D
2. K-Means Clustering

2. K-Means Clustering
Note
Clusters for the full set of articles

1. LDA considers two things:
• Each document in a corpus is a weighted combination of several topics, e.g.,
doc1-> 0.1 finance + 0.2 science + 0.5 * technology,…
• Each topic has its collection of representative keywords, e.g.,
technology -> [‘computer’, ‘microsoft’, ‘google', ...]
3. Topic Modeling: LDA

2. The two probability distributions that the algorithm tries to approximate,
starting from a random initialization until convergence:
• For a given document, what is the distribution of topics that describe it?
• For a given topic, what is the distribution of its words or what is the importance (probability) of
each word in defining the topic nature?

Interactive Topics
Visualization with
pyLDAvis

1. Join requests and feedback logs.
• Alternative: use a third-party dataset.
2. Use #clicks > 0 as a positive label.
• Alternative 1: use #clicks / #views
• Alternative 2: use click order
• Alternative 3: get explicit feedback
Model training
Preprocessing

1. Use title and description to get:
• A bag of named entities such as person or org
(using spaCy).
• A bag of key terms from the semantic network
(using Textacy).
• A normalized sum over key term embedding
vectors found in GoogleNews word2vec dataset.
2. Hold out 20% of items for testing.
Model training
NLP features and Train/Test split

1. One-hot-encode entities to get a sparse vector.
2. Compensate popularity skew using
Inverse Document Frequency (IDF).
3. Train a classifier using Gradient Boosting
Decision Trees (GBDT).
Note that we have a small and very skewed,
noisy dataset so we are not expecting good
classification performance.
Model training
Pipeline based on entities

1. Hash-merge features into 100 buckets.
2. Train a GBDT classifier.
Model training
Pipeline based on semantic key terms

Just use logistic regression right away.
• This gives us a more relaxed prediction with a
much higher number of true positives but
also false negatives.
Model training
Pipeline based on embedding vectors

Model training
Stacking and beyond
It is possible to combine features and models...

1. Get NLP features for a ranking candidate.
• Equivalent to the preprocessing step in training.
2. Get "click probability" from the loaded pipeline
and use this value for ranking.
Model application
Using a trained model

• More data and NLP/ML advancements
• Personalized recommendation
• Incremental and online learning
• Social signals and behaviors
Further work

No more bad news!

More Related Content

What's hot

Similar to No more bad news!

More from Simon Lia-Jonassen

Recently uploaded

No more bad news!

Editor's Notes