SlideShare a Scribd company logo
No more bad news!
News recommendation with ML and NLP.
Samia Khalid and Simon Lia-Jonassen
NTNU Cogito
March 7th, 2019
Contents 00 Introduction
01 Recommender architecture
02 Natural language processing
03 Recommendation model training
04 Demo and further work
Understand Content of news I read.
Learn my Interests over time.
Recommend news that interest me.
Introduction to News Recommendation
Implements three parts:
• Frontend and backend controllers.
• Feed provider and logging.
• NLP, ML and exploration workflow.
News recommender in a nutshell
https://github.com/s-j/goodnews
Content and feedback signals
Natural
Language
Processing
Natural Language Processing
1. Text Processing
2. Clustering
3. Topic Extraction
Natural Language Processing and Exploration
1. Text Processing using spaCy
Leading open-source library for advanced NLP
a. Tokenization
b. Part Of Speech Tagging
c. Lemmatization
d. Stop words
1. Text Processing using spaCy
1. Recognizes a sentence and
assigns a syntactic structure to
it
• “Who is the AI research director?”
2. spaCy provides a built-in
visualizer
1. Text Processing using SpaCy
Dependency Parsing
1. Locate and classify named entities in text into pre-defined categories
2. Can help to answer questions like:
• “Which people, companies and products is the user interested in?”
1. Text Processing using SpaCy
Entity Recognition
1. Part-of-speech Tagging:
• assigns parts of speech to each token
such as noun, verb, adjective, etc.
2. spaCy uses a statistical model to
make a prediction of which tag or
label most likely applies in the
given context
1. Text Processing using SpaCy
Distribution of POS Tags
1. Text Processing using SpaCy
Word Probabilities: finding the most improbable words (noisy data)
1. Text Processing using SpaCy
Analyzing top unigrams in clicked articles vs all articles (considering only PROPN and NOUNS)
1. Word Vectors as input:
• 300 dimensional vectors to represent words in
numerical form
2. K-Means needs the number of cluster as
parameter:
• Try out different values until satisfied
• Can use silhouette score and distortion as metric
3. PCA for visualizing the results in 2-D
2. K-Means Clustering
2. K-Means Clustering
Note
Clusters for the full set of articles
1. LDA considers two things:
• Each document in a corpus is a weighted combination of several topics, e.g.,
doc1-> 0.1 finance + 0.2 science + 0.5 * technology,…
• Each topic has its collection of representative keywords, e.g.,
technology -> [‘computer’, ‘microsoft’, ‘google', ...]
3. Topic Modeling: LDA
2. The two probability distributions that the algorithm tries to approximate,
starting from a random initialization until convergence:
• For a given document, what is the distribution of topics that describe it?
• For a given topic, what is the distribution of its words or what is the importance (probability) of
each word in defining the topic nature?
3. Topic Modeling: LDA
3. Topic Modeling: LDA
Interactive Topics
Visualization with
pyLDAvis
Recommendation
Model
Training
1. Join requests and feedback logs.
• Alternative: use a third-party dataset.
2. Use #clicks > 0 as a positive label.
• Alternative 1: use #clicks / #views
• Alternative 2: use click order
• Alternative 3: get explicit feedback
Model training
Preprocessing
1. Use title and description to get:
• A bag of named entities such as person or org
(using spaCy).
• A bag of key terms from the semantic network
(using Textacy).
• A normalized sum over key term embedding
vectors found in GoogleNews word2vec dataset.
2. Hold out 20% of items for testing.
Model training
NLP features and Train/Test split
1. One-hot-encode entities to get a sparse vector.
2. Compensate popularity skew using
Inverse Document Frequency (IDF).
3. Train a classifier using Gradient Boosting
Decision Trees (GBDT).
Note that we have a small and very skewed,
noisy dataset so we are not expecting good
classification performance.
Model training
Pipeline based on entities
1. Hash-merge features into 100 buckets.
2. Train a GBDT classifier.
Model training
Pipeline based on semantic key terms
Just use logistic regression right away.
• This gives us a more relaxed prediction with a
much higher number of true positives but
also false negatives.
Model training
Pipeline based on embedding vectors
Model training
Stacking and beyond
It is possible to combine features and models...
1. Get NLP features for a ranking candidate.
• Equivalent to the preprocessing step in training.
2. Get "click probability" from the loaded pipeline
and use this value for ranking.
Model application
Using a trained model
Demo time!
• More data and NLP/ML advancements
• Personalized recommendation
• Incremental and online learning
• Social signals and behaviors
Further work
© Copyright Microsoft Corporation. All rights reserved.
Thank you!

More Related Content

What's hot

Document Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word RepresentationDocument Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word Representation
suthi
 
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET Journal
 
Text categorization
Text categorizationText categorization
Text categorization
Shubham Pahune
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
Dev Sahu
 
Ir 02
Ir   02Ir   02
Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jNatural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4j
William Lyon
 
IRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text Rank
IRJET Journal
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Bhaskar Mitra
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
aneeshabakharia
 
Dataworkz odsc london 2018
Dataworkz odsc london 2018Dataworkz odsc london 2018
Dataworkz odsc london 2018
Olaf de Leeuw
 
IRJET - Event Notifier on Scraped Mails using NLP
IRJET - Event Notifier on Scraped Mails using NLPIRJET - Event Notifier on Scraped Mails using NLP
IRJET - Event Notifier on Scraped Mails using NLP
IRJET Journal
 
Parsimonious topic models with salient word discovery
Parsimonious topic models with salient word discoveryParsimonious topic models with salient word discovery
Parsimonious topic models with salient word discovery
ieeepondy
 
Project Presentation
Project PresentationProject Presentation
Project Presentation
butest
 
Developing A Big Data Search Engine - Where we have gone. Where we are going:...
Developing A Big Data Search Engine - Where we have gone. Where we are going:...Developing A Big Data Search Engine - Where we have gone. Where we are going:...
Developing A Big Data Search Engine - Where we have gone. Where we are going:...
Lucidworks
 
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskDeep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Saurabh Saxena
 
Automated Software Requirements Labeling
Automated Software Requirements LabelingAutomated Software Requirements Labeling
Automated Software Requirements Labeling
Data Works MD
 
Building a Meta-search Engine
Building a Meta-search EngineBuilding a Meta-search Engine
Building a Meta-search Engine
Ayan Chandra
 
Sentence representations and question answering (YerevaNN)
Sentence representations and question answering (YerevaNN)Sentence representations and question answering (YerevaNN)
Sentence representations and question answering (YerevaNN)
YerevaNN research lab
 
Ir 03
Ir   03Ir   03
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using Hive
Qubole
 

What's hot (20)

Document Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word RepresentationDocument Classification Using KNN with Fuzzy Bags of Word Representation
Document Classification Using KNN with Fuzzy Bags of Word Representation
 
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
 
Text categorization
Text categorizationText categorization
Text categorization
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
Ir 02
Ir   02Ir   02
Ir 02
 
Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jNatural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4j
 
IRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text Rank
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
Dataworkz odsc london 2018
Dataworkz odsc london 2018Dataworkz odsc london 2018
Dataworkz odsc london 2018
 
IRJET - Event Notifier on Scraped Mails using NLP
IRJET - Event Notifier on Scraped Mails using NLPIRJET - Event Notifier on Scraped Mails using NLP
IRJET - Event Notifier on Scraped Mails using NLP
 
Parsimonious topic models with salient word discovery
Parsimonious topic models with salient word discoveryParsimonious topic models with salient word discovery
Parsimonious topic models with salient word discovery
 
Project Presentation
Project PresentationProject Presentation
Project Presentation
 
Developing A Big Data Search Engine - Where we have gone. Where we are going:...
Developing A Big Data Search Engine - Where we have gone. Where we are going:...Developing A Big Data Search Engine - Where we have gone. Where we are going:...
Developing A Big Data Search Engine - Where we have gone. Where we are going:...
 
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate HelpdeskDeep Learning Enabled Question Answering System to Automate Corporate Helpdesk
Deep Learning Enabled Question Answering System to Automate Corporate Helpdesk
 
Automated Software Requirements Labeling
Automated Software Requirements LabelingAutomated Software Requirements Labeling
Automated Software Requirements Labeling
 
Building a Meta-search Engine
Building a Meta-search EngineBuilding a Meta-search Engine
Building a Meta-search Engine
 
Sentence representations and question answering (YerevaNN)
Sentence representations and question answering (YerevaNN)Sentence representations and question answering (YerevaNN)
Sentence representations and question answering (YerevaNN)
 
Ir 03
Ir   03Ir   03
Ir 03
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using Hive
 

Similar to No more bad news!

Course-Plan-Object Oriented Concept (18CS45)1.pdf
Course-Plan-Object Oriented Concept (18CS45)1.pdfCourse-Plan-Object Oriented Concept (18CS45)1.pdf
Course-Plan-Object Oriented Concept (18CS45)1.pdf
abhijit.tec
 
CSCI6505 Project:Construct search engine using ML approach
CSCI6505 Project:Construct search engine using ML approachCSCI6505 Project:Construct search engine using ML approach
CSCI6505 Project:Construct search engine using ML approach
butest
 
Qda ces 2013 toronto workshop
Qda ces 2013 toronto workshopQda ces 2013 toronto workshop
Qda ces 2013 toronto workshop
CesToronto
 
Das patrac sandpythonwithpracticalcbse11
Das patrac sandpythonwithpracticalcbse11Das patrac sandpythonwithpracticalcbse11
Das patrac sandpythonwithpracticalcbse11
NumraHashmi
 
Recommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoRecommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and Dato
Ashok Venkatesan
 
An Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning ChallengeAn Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning Challenge
Traian Rebedea
 
LangChain + Docugami Webinar
LangChain + Docugami WebinarLangChain + Docugami Webinar
LangChain + Docugami Webinar
Taqi Jaffri
 
Notey's talk 20160923
Notey's talk 20160923Notey's talk 20160923
Notey's talk 20160923
Rosanna Man
 
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDBMongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
Lisa Roth, PMP
 
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDBMongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Simon Hughes
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Lucidworks
 
Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr
Umesh Prasad
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
GokulD
 
Recsys 2016
Recsys 2016Recsys 2016
Recsys 2016
Mindaugas Zickus
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
Angelo Salatino
 
MongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB
 
Topic-oriented writing at McAfee
Topic-oriented writing at McAfeeTopic-oriented writing at McAfee
Topic-oriented writing at McAfee
John Sarr
 
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...
AgileNetwork
 
Presentation_Doceng.pptx
Presentation_Doceng.pptxPresentation_Doceng.pptx
Presentation_Doceng.pptx
XINWEI50
 

Similar to No more bad news! (20)

Course-Plan-Object Oriented Concept (18CS45)1.pdf
Course-Plan-Object Oriented Concept (18CS45)1.pdfCourse-Plan-Object Oriented Concept (18CS45)1.pdf
Course-Plan-Object Oriented Concept (18CS45)1.pdf
 
CSCI6505 Project:Construct search engine using ML approach
CSCI6505 Project:Construct search engine using ML approachCSCI6505 Project:Construct search engine using ML approach
CSCI6505 Project:Construct search engine using ML approach
 
Qda ces 2013 toronto workshop
Qda ces 2013 toronto workshopQda ces 2013 toronto workshop
Qda ces 2013 toronto workshop
 
Das patrac sandpythonwithpracticalcbse11
Das patrac sandpythonwithpracticalcbse11Das patrac sandpythonwithpracticalcbse11
Das patrac sandpythonwithpracticalcbse11
 
Recommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoRecommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and Dato
 
An Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning ChallengeAn Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning Challenge
 
LangChain + Docugami Webinar
LangChain + Docugami WebinarLangChain + Docugami Webinar
LangChain + Docugami Webinar
 
Notey's talk 20160923
Notey's talk 20160923Notey's talk 20160923
Notey's talk 20160923
 
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDBMongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
 
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDBMongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
 
Recsys 2016
Recsys 2016Recsys 2016
Recsys 2016
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
 
MongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDB
 
Topic-oriented writing at McAfee
Topic-oriented writing at McAfeeTopic-oriented writing at McAfee
Topic-oriented writing at McAfee
 
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...
 
Presentation_Doceng.pptx
Presentation_Doceng.pptxPresentation_Doceng.pptx
Presentation_Doceng.pptx
 

More from Simon Lia-Jonassen

HyperLogLog and friends
HyperLogLog and friendsHyperLogLog and friends
HyperLogLog and friends
Simon Lia-Jonassen
 
Xgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - ExplainedXgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - Explained
Simon Lia-Jonassen
 
Chatbots are coming!
Chatbots are coming!Chatbots are coming!
Chatbots are coming!
Simon Lia-Jonassen
 
Large-Scale Real-Time Data Management for Engagement and Monetization
Large-Scale Real-Time Data Management for Engagement and MonetizationLarge-Scale Real-Time Data Management for Engagement and Monetization
Large-Scale Real-Time Data Management for Engagement and Monetization
Simon Lia-Jonassen
 
Efficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search EnginesEfficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search Engines
Simon Lia-Jonassen
 
Leveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at CxenseLeveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at Cxense
Simon Lia-Jonassen
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
Simon Lia-Jonassen
 
Efficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesEfficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search Engines
Simon Lia-Jonassen
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...
Simon Lia-Jonassen
 

More from Simon Lia-Jonassen (9)

HyperLogLog and friends
HyperLogLog and friendsHyperLogLog and friends
HyperLogLog and friends
 
Xgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - ExplainedXgboost: A Scalable Tree Boosting System - Explained
Xgboost: A Scalable Tree Boosting System - Explained
 
Chatbots are coming!
Chatbots are coming!Chatbots are coming!
Chatbots are coming!
 
Large-Scale Real-Time Data Management for Engagement and Monetization
Large-Scale Real-Time Data Management for Engagement and MonetizationLarge-Scale Real-Time Data Management for Engagement and Monetization
Large-Scale Real-Time Data Management for Engagement and Monetization
 
Efficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search EnginesEfficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search Engines
 
Leveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at CxenseLeveraging Big Data and Real-Time Analytics at Cxense
Leveraging Big Data and Real-Time Analytics at Cxense
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
 
Efficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search EnginesEfficient Query Processing in Distributed Search Engines
Efficient Query Processing in Distributed Search Engines
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...
 

Recently uploaded

5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 

Recently uploaded (20)

5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 

No more bad news!

  • 1. No more bad news! News recommendation with ML and NLP. Samia Khalid and Simon Lia-Jonassen NTNU Cogito March 7th, 2019
  • 2. Contents 00 Introduction 01 Recommender architecture 02 Natural language processing 03 Recommendation model training 04 Demo and further work
  • 3. Understand Content of news I read. Learn my Interests over time. Recommend news that interest me. Introduction to News Recommendation
  • 4. Implements three parts: • Frontend and backend controllers. • Feed provider and logging. • NLP, ML and exploration workflow. News recommender in a nutshell https://github.com/s-j/goodnews
  • 8. 1. Text Processing 2. Clustering 3. Topic Extraction Natural Language Processing and Exploration
  • 9. 1. Text Processing using spaCy Leading open-source library for advanced NLP
  • 10. a. Tokenization b. Part Of Speech Tagging c. Lemmatization d. Stop words 1. Text Processing using spaCy
  • 11. 1. Recognizes a sentence and assigns a syntactic structure to it • “Who is the AI research director?” 2. spaCy provides a built-in visualizer 1. Text Processing using SpaCy Dependency Parsing
  • 12. 1. Locate and classify named entities in text into pre-defined categories 2. Can help to answer questions like: • “Which people, companies and products is the user interested in?” 1. Text Processing using SpaCy Entity Recognition
  • 13. 1. Part-of-speech Tagging: • assigns parts of speech to each token such as noun, verb, adjective, etc. 2. spaCy uses a statistical model to make a prediction of which tag or label most likely applies in the given context 1. Text Processing using SpaCy Distribution of POS Tags
  • 14. 1. Text Processing using SpaCy Word Probabilities: finding the most improbable words (noisy data)
  • 15. 1. Text Processing using SpaCy Analyzing top unigrams in clicked articles vs all articles (considering only PROPN and NOUNS)
  • 16. 1. Word Vectors as input: • 300 dimensional vectors to represent words in numerical form 2. K-Means needs the number of cluster as parameter: • Try out different values until satisfied • Can use silhouette score and distortion as metric 3. PCA for visualizing the results in 2-D 2. K-Means Clustering
  • 17. 2. K-Means Clustering Note Clusters for the full set of articles
  • 18. 1. LDA considers two things: • Each document in a corpus is a weighted combination of several topics, e.g., doc1-> 0.1 finance + 0.2 science + 0.5 * technology,… • Each topic has its collection of representative keywords, e.g., technology -> [‘computer’, ‘microsoft’, ‘google', ...] 3. Topic Modeling: LDA
  • 19. 2. The two probability distributions that the algorithm tries to approximate, starting from a random initialization until convergence: • For a given document, what is the distribution of topics that describe it? • For a given topic, what is the distribution of its words or what is the importance (probability) of each word in defining the topic nature? 3. Topic Modeling: LDA
  • 20. 3. Topic Modeling: LDA Interactive Topics Visualization with pyLDAvis
  • 22. 1. Join requests and feedback logs. • Alternative: use a third-party dataset. 2. Use #clicks > 0 as a positive label. • Alternative 1: use #clicks / #views • Alternative 2: use click order • Alternative 3: get explicit feedback Model training Preprocessing
  • 23. 1. Use title and description to get: • A bag of named entities such as person or org (using spaCy). • A bag of key terms from the semantic network (using Textacy). • A normalized sum over key term embedding vectors found in GoogleNews word2vec dataset. 2. Hold out 20% of items for testing. Model training NLP features and Train/Test split
  • 24. 1. One-hot-encode entities to get a sparse vector. 2. Compensate popularity skew using Inverse Document Frequency (IDF). 3. Train a classifier using Gradient Boosting Decision Trees (GBDT). Note that we have a small and very skewed, noisy dataset so we are not expecting good classification performance. Model training Pipeline based on entities
  • 25. 1. Hash-merge features into 100 buckets. 2. Train a GBDT classifier. Model training Pipeline based on semantic key terms
  • 26. Just use logistic regression right away. • This gives us a more relaxed prediction with a much higher number of true positives but also false negatives. Model training Pipeline based on embedding vectors
  • 27. Model training Stacking and beyond It is possible to combine features and models...
  • 28. 1. Get NLP features for a ranking candidate. • Equivalent to the preprocessing step in training. 2. Get "click probability" from the loaded pipeline and use this value for ranking. Model application Using a trained model
  • 30. • More data and NLP/ML advancements • Personalized recommendation • Incremental and online learning • Social signals and behaviors Further work
  • 31. © Copyright Microsoft Corporation. All rights reserved. Thank you!

Editor's Notes

  1. 10 min
  2. Can skip this slide
  3. Analyzes the grammatical structure of a sentence, establishing relationships between "head" words and words which modify those heads. Dependency Parsers can read various forms of plain text input and can output various analysis formats, including part-of-speech tagged text, phrase structure trees, and a grammatical relations (typed dependency) format. Dependency Parsing can be used to solve various complex NLP problems like Named Entity Recognition, Relation Extraction, translation.
  4. Locates and and classifies named entities in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
  5. To discard noisy data
  6. Say something about «chars> outlier – shows we have data to clean
  7. To describe and summarize the documents in a corpus
  8. To describe and summarize the documents in a corpus
  9. 30 min
  10. 30 min
  11. 45 min
  12. 1 hour