SlideShare a Scribd company logo
1 of 22
Download to read offline
Presented at Dato Conf, SF
Personalization @ StumbleUpon
Recommending"
Semantic Nearest Neighbors"
using Storm and Dato
OVERVIEW	
  
StumbleUpon – Choose Topics, Discover Content
Bookmark, Organize and Share
Recommendations – Matching User With Content
1. Understand User
 2. Understand Content
 3. Recommend
 4. Get Feedback
TELEVISON
MUSIC
TRENDING
FRIENDS
LIKEMINDED
USERS
EXPERTS
ANIMALS
DOGS
PHOTOGRAPHY
MOVIES
ARTS
HUMOR
Architecture Overview
Ingestion Queue	
Discovery Queue	
Content
Analysis
MySQL	
Recommendation
Engine	
1. INGESTION
Cold Start
Model	
HBase	
 ES	
New
Content	
Event
Processors	
3. OFFLINE COMPUTATIONS
2. CHECK QUALITY
4. RECS
5. ONLINE COMPUTATION
Rec Models	
Rec Models	
Rec Models	
Event Queue
CONTEXTUAL	
  RECS	
  
•  Problem:
–  Recommend Items based on the topics discovered in the current
page a user is on
•  Strategies:
–  Find semantically similar items
–  Find items that dig further into a specific topic
–  Find items that dig further into a broader topic
–  Others… 

Problem
•  Very quick “Ingestion to Recommendation” turn around
time (x10 seconds)
–  Adopt stream processing with at-least-once processing guarantees
–  Build idempotent subsystems
–  Capitalize on non-linearity wherever possible
•  Low latency retrieval of recs (x10 ms)
–  Pre-compute recs
–  Retrieve recs in θ(1) time
•  Horizontally scalable design
–  Utilize distributed processing systems/data stores
Constraints/SLAs
•  (Offline) Utilize a high quality dataset to build a topic model
•  (Online) For each URL ingested,
–  Extract text features that summarize the documents
•  Use pre-built topic models for
–  Filtering noisy keywords
–  Finding general topics
–  Finding specific topics
–  Computing topic hashes
•  Compute similarity/relevance
•  Store for quick retrieval
Approach Overview
Feature Extraction
Wikipedia
Annotation2
Detect
Language 
Parse
Noun
Chunking1
Cleanup
Remove
Boilerplate
Coalesce Tags
1Manning, Christopher D., et al. “The Stanford CoreNLP natural language processing toolkit.” Proceedings of 52nd Annual Meeting of the ACL: System Demonstrations. 2014.
2Milne, David, and Ian H. Witten. "An open-source toolkit for mining Wikipedia." Artificial Intelligence 194 (2013): 222-239.
Compute
Tag Score
Topic Modeling
3Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022.
image courtesy: http://parkcu.com/blog/latent-dirichlet-allocation/
•  Similar to constrained clustering, LDA
can be run with topic associations4.
•  Perform hierarchical/agglomerative
clustering on SU’s taxonomy to obtain
K=75 clusters of topic sets.
•  Use the topic sets as possible labels
for the latent topic z
•  The words themselves are not learnt
for the specific topic they have been
mapped to.
LDA with Topic Associations
4Andrzejewski, David, and Xiaojin Zhu. “Latent Dirichlet Allocation with topic-in-set knowledge.” Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for
Natural Language Processing. Association for Computational Linguistics, 2009.
Example
Topic	
  Associa6ons	
  (Pre	
  LDA)	
   Top	
  Words	
  in	
  a	
  Topic	
  (Post	
  LDA)	
  
•  Choose relevant topics
–  Rank/Threshold by to get 
•  Filtering noisy tags
–  Rank/Threshold by 
•  Getting specific words
–  Rank/Threshold by 
•  Getting general words
–  Rank/Threshold by 
Using the Topic Models
Graphlab-Create I
Image Courtesy: https://dato.com/products/create/technology.html
•  Allows fast prototyping on a single machine
–  Python Interface to a C++ backend
–  Scalable Data Structures (Tabular and Graphs) made available
–  Out-of-core implementation of standard ML algorithms
–  Makes basic Data Engineering and Visualization tasks easy
•  Easy to deploy micro services (predictive services) around
models built using Graphlab create/pandas/scikit-learn.
–  REST-ful API hosted over a Tornado server
–  Distributed cache 
–  Amazon Cloudwatch for monitoring (for AWS deploys)
•  (Con) Debugging the service can be difficult
Graphlab-Create II
•  Distributed Realtime Computation System
–  Fault Tolerant, Scalable and Guaranteed Processing
–  Master --> Zookeeper --> Worker Nodes
•  Workers
–  Spout Stream sources
–  Bolt Computation units
•  Data Flow
–  Streams Unbounded sequence of Tuples
–  Topologies A network of spouts and bolts
Storm Basics5
5http://www.slideshare.net/ptgoetz/cassandra-and-storm-at-health-market-sceince
Architecture
URLs	
Webpage
Surveyor
Service
TMS*
Models
HTML	
  to	
  Text	
  
Text	
  to	
  (Tags	
  ,	
  Concepts)	
  
Merge	
  	
  
1.	
  Topic	
  Model	
  Query	
  
2.a.	
  Load	
  ES	
  
2.b.	
  Get	
  Similar	
  Items	
  
3.	
  Load	
  Similar	
  Items	
  for	
  
quick	
  lookup	
  
Build	
  Topic	
  Model	
  
Fetch	
  Page	
  HTML	
  
To	
  S3	
  
SIMILAR	
  ITEM	
  
TOPOLOGY	
  
KaXa	
  Broker	
  
*TMS	
  –	
  Topic	
  Model	
  Service	
  
Get	
  
Similar	
  
Items	
  
•  Number of Storm Workers: 3 
•  Number of ES Nodes: 3
•  Training:
–  Document Size: 2M
–  Vocabulary: 400K
–  Time: ~8s/iteration (16 cores)
•  Predictive service performance:
–  Peak requests handled: 200/min
–  Avg response time: 110 ms
•  URL Turn around time: 10s
•  Number of URLs ingested: 70/min
Some Numbers I
Some Numbers II
THANKS.	
  QUESTIONS?	
  

More Related Content

What's hot

Content Strategy for WordPress
Content Strategy for WordPressContent Strategy for WordPress
Content Strategy for WordPressStephanie Leary
 
COMM 1180 Level 2 - MET (Neigh) March 2013
COMM 1180 Level 2 - MET (Neigh) March 2013COMM 1180 Level 2 - MET (Neigh) March 2013
COMM 1180 Level 2 - MET (Neigh) March 2013Melanie Parlette-Stewart
 
Terminology Management in DITA
Terminology Management in DITATerminology Management in DITA
Terminology Management in DITABluestream
 
Fri schreiber key_knowledge engineering
Fri schreiber key_knowledge engineeringFri schreiber key_knowledge engineering
Fri schreiber key_knowledge engineeringeswcsummerschool
 
INTERNET INFORMATION SOURCES
INTERNET INFORMATION SOURCESINTERNET INFORMATION SOURCES
INTERNET INFORMATION SOURCESjeelani sofi
 
Introduction to Altmetrics for Medical and Special Librarians
Introduction to Altmetrics for Medical and Special LibrariansIntroduction to Altmetrics for Medical and Special Librarians
Introduction to Altmetrics for Medical and Special LibrariansLinda Galloway
 
COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013
COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013
COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013Melanie Parlette-Stewart
 
The 7 Habits of Highly Effective Research Communicators
The 7 Habits of Highly Effective Research CommunicatorsThe 7 Habits of Highly Effective Research Communicators
The 7 Habits of Highly Effective Research CommunicatorsAnup Kumar Das
 
Distributed machine learning examples
Distributed machine learning examplesDistributed machine learning examples
Distributed machine learning examplesStanley Wang
 
Writing Seminar Dowland
Writing Seminar DowlandWriting Seminar Dowland
Writing Seminar DowlandTraciwm
 
Cutting Edge Technology used in ePADD
Cutting Edge Technologyused in ePADDCutting Edge Technologyused in ePADD
Cutting Edge Technology used in ePADDpeterchanws
 
The Open Source Library: It's Free As in Puppy
The Open Source Library: It's Free As in PuppyThe Open Source Library: It's Free As in Puppy
The Open Source Library: It's Free As in PuppyTiffany Garrett
 

What's hot (20)

PDE3330 Sept 21
PDE3330 Sept 21 PDE3330 Sept 21
PDE3330 Sept 21
 
Content Strategy for WordPress
Content Strategy for WordPressContent Strategy for WordPress
Content Strategy for WordPress
 
Machine Learning & Apache Mahout
Machine Learning & Apache MahoutMachine Learning & Apache Mahout
Machine Learning & Apache Mahout
 
COMM 1180 Level 2 - MET (Neigh) March 2013
COMM 1180 Level 2 - MET (Neigh) March 2013COMM 1180 Level 2 - MET (Neigh) March 2013
COMM 1180 Level 2 - MET (Neigh) March 2013
 
Terminology Management in DITA
Terminology Management in DITATerminology Management in DITA
Terminology Management in DITA
 
Bis1100 Nov 2018
Bis1100 Nov 2018Bis1100 Nov 2018
Bis1100 Nov 2018
 
CCE2060 Oct 2017
CCE2060 Oct 2017CCE2060 Oct 2017
CCE2060 Oct 2017
 
Fri schreiber key_knowledge engineering
Fri schreiber key_knowledge engineeringFri schreiber key_knowledge engineering
Fri schreiber key_knowledge engineering
 
INTERNET INFORMATION SOURCES
INTERNET INFORMATION SOURCESINTERNET INFORMATION SOURCES
INTERNET INFORMATION SOURCES
 
Introduction to Altmetrics for Medical and Special Librarians
Introduction to Altmetrics for Medical and Special LibrariansIntroduction to Altmetrics for Medical and Special Librarians
Introduction to Altmetrics for Medical and Special Librarians
 
Toelf reading strategies
Toelf reading strategiesToelf reading strategies
Toelf reading strategies
 
COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013
COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013
COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013
 
The 7 Habits of Highly Effective Research Communicators
The 7 Habits of Highly Effective Research CommunicatorsThe 7 Habits of Highly Effective Research Communicators
The 7 Habits of Highly Effective Research Communicators
 
Social development library training
Social development library trainingSocial development library training
Social development library training
 
Databases mtcp2
Databases mtcp2Databases mtcp2
Databases mtcp2
 
Distributed machine learning examples
Distributed machine learning examplesDistributed machine learning examples
Distributed machine learning examples
 
Writing Seminar Dowland
Writing Seminar DowlandWriting Seminar Dowland
Writing Seminar Dowland
 
Cutting Edge Technology used in ePADD
Cutting Edge Technologyused in ePADDCutting Edge Technologyused in ePADD
Cutting Edge Technology used in ePADD
 
The Open Source Library: It's Free As in Puppy
The Open Source Library: It's Free As in PuppyThe Open Source Library: It's Free As in Puppy
The Open Source Library: It's Free As in Puppy
 
Hmnr101 guide
Hmnr101 guideHmnr101 guide
Hmnr101 guide
 

Similar to Recommending Semantic Nearest Neighbors Using Storm and Dato

The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
Duraspace Hot Topics Series 6: Metadata and Repository Services
Duraspace Hot Topics Series 6: Metadata and Repository ServicesDuraspace Hot Topics Series 6: Metadata and Repository Services
Duraspace Hot Topics Series 6: Metadata and Repository ServicesMatthew Critchlow
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic WebRoberto García
 
10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...
10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...
10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...DuraSpace
 
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyondHierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyondFrank Kelly
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 
Publishing and Serving Machine Learning Models with DLHub
Publishing and Serving Machine Learning Models with DLHubPublishing and Serving Machine Learning Models with DLHub
Publishing and Serving Machine Learning Models with DLHubGlobus
 
Drupal: Content Management and Community for your Library
Drupal: Content Management and Community for your LibraryDrupal: Content Management and Community for your Library
Drupal: Content Management and Community for your Libraryguest5e78e
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseJosh Patterson
 
Linked Open Data Visualization
Linked Open Data VisualizationLinked Open Data Visualization
Linked Open Data VisualizationLaura Po
 
HyperMembrane Structures for Open Source Cognitive Computing
HyperMembrane Structures for Open Source Cognitive ComputingHyperMembrane Structures for Open Source Cognitive Computing
HyperMembrane Structures for Open Source Cognitive ComputingJack Park
 
Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things PayamBarnaghi
 
A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebSimon Price
 
Open Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsOpen Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsAnubhav Jain
 

Similar to Recommending Semantic Nearest Neighbors Using Storm and Dato (20)

DatoConference2015
DatoConference2015DatoConference2015
DatoConference2015
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Duraspace Hot Topics Series 6: Metadata and Repository Services
Duraspace Hot Topics Series 6: Metadata and Repository ServicesDuraspace Hot Topics Series 6: Metadata and Repository Services
Duraspace Hot Topics Series 6: Metadata and Repository Services
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic Web
 
10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...
10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...
10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...
 
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyondHierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyond
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Publishing and Serving Machine Learning Models with DLHub
Publishing and Serving Machine Learning Models with DLHubPublishing and Serving Machine Learning Models with DLHub
Publishing and Serving Machine Learning Models with DLHub
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Elastic pivorak
Elastic pivorakElastic pivorak
Elastic pivorak
 
Drupal: Content Management and Community for your Library
Drupal: Content Management and Community for your LibraryDrupal: Content Management and Community for your Library
Drupal: Content Management and Community for your Library
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the Enterprise
 
Linked Open Data Visualization
Linked Open Data VisualizationLinked Open Data Visualization
Linked Open Data Visualization
 
HyperMembrane Structures for Open Source Cognitive Computing
HyperMembrane Structures for Open Source Cognitive ComputingHyperMembrane Structures for Open Source Cognitive Computing
HyperMembrane Structures for Open Source Cognitive Computing
 
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
 
Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things
 
From ontology to wiki
From ontology to wikiFrom ontology to wiki
From ontology to wiki
 
A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic Web
 
Semtech2006
Semtech2006Semtech2006
Semtech2006
 
Open Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsOpen Source Tools for Materials Informatics
Open Source Tools for Materials Informatics
 

Recently uploaded

An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 

Recently uploaded (20)

An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 

Recommending Semantic Nearest Neighbors Using Storm and Dato

  • 1. Presented at Dato Conf, SF Personalization @ StumbleUpon Recommending" Semantic Nearest Neighbors" using Storm and Dato
  • 3. StumbleUpon – Choose Topics, Discover Content
  • 5. Recommendations – Matching User With Content 1. Understand User 2. Understand Content 3. Recommend 4. Get Feedback TELEVISON MUSIC TRENDING FRIENDS LIKEMINDED USERS EXPERTS ANIMALS DOGS PHOTOGRAPHY MOVIES ARTS HUMOR
  • 6. Architecture Overview Ingestion Queue Discovery Queue Content Analysis MySQL Recommendation Engine 1. INGESTION Cold Start Model HBase ES New Content Event Processors 3. OFFLINE COMPUTATIONS 2. CHECK QUALITY 4. RECS 5. ONLINE COMPUTATION Rec Models Rec Models Rec Models Event Queue
  • 8. •  Problem: –  Recommend Items based on the topics discovered in the current page a user is on •  Strategies: –  Find semantically similar items –  Find items that dig further into a specific topic –  Find items that dig further into a broader topic –  Others… Problem
  • 9. •  Very quick “Ingestion to Recommendation” turn around time (x10 seconds) –  Adopt stream processing with at-least-once processing guarantees –  Build idempotent subsystems –  Capitalize on non-linearity wherever possible •  Low latency retrieval of recs (x10 ms) –  Pre-compute recs –  Retrieve recs in θ(1) time •  Horizontally scalable design –  Utilize distributed processing systems/data stores Constraints/SLAs
  • 10. •  (Offline) Utilize a high quality dataset to build a topic model •  (Online) For each URL ingested, –  Extract text features that summarize the documents •  Use pre-built topic models for –  Filtering noisy keywords –  Finding general topics –  Finding specific topics –  Computing topic hashes •  Compute similarity/relevance •  Store for quick retrieval Approach Overview
  • 11. Feature Extraction Wikipedia Annotation2 Detect Language Parse Noun Chunking1 Cleanup Remove Boilerplate Coalesce Tags 1Manning, Christopher D., et al. “The Stanford CoreNLP natural language processing toolkit.” Proceedings of 52nd Annual Meeting of the ACL: System Demonstrations. 2014. 2Milne, David, and Ian H. Witten. "An open-source toolkit for mining Wikipedia." Artificial Intelligence 194 (2013): 222-239. Compute Tag Score
  • 12. Topic Modeling 3Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022. image courtesy: http://parkcu.com/blog/latent-dirichlet-allocation/
  • 13. •  Similar to constrained clustering, LDA can be run with topic associations4. •  Perform hierarchical/agglomerative clustering on SU’s taxonomy to obtain K=75 clusters of topic sets. •  Use the topic sets as possible labels for the latent topic z •  The words themselves are not learnt for the specific topic they have been mapped to. LDA with Topic Associations 4Andrzejewski, David, and Xiaojin Zhu. “Latent Dirichlet Allocation with topic-in-set knowledge.” Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing. Association for Computational Linguistics, 2009.
  • 14. Example Topic  Associa6ons  (Pre  LDA)   Top  Words  in  a  Topic  (Post  LDA)  
  • 15. •  Choose relevant topics –  Rank/Threshold by to get •  Filtering noisy tags –  Rank/Threshold by •  Getting specific words –  Rank/Threshold by •  Getting general words –  Rank/Threshold by Using the Topic Models
  • 16. Graphlab-Create I Image Courtesy: https://dato.com/products/create/technology.html
  • 17. •  Allows fast prototyping on a single machine –  Python Interface to a C++ backend –  Scalable Data Structures (Tabular and Graphs) made available –  Out-of-core implementation of standard ML algorithms –  Makes basic Data Engineering and Visualization tasks easy •  Easy to deploy micro services (predictive services) around models built using Graphlab create/pandas/scikit-learn. –  REST-ful API hosted over a Tornado server –  Distributed cache –  Amazon Cloudwatch for monitoring (for AWS deploys) •  (Con) Debugging the service can be difficult Graphlab-Create II
  • 18. •  Distributed Realtime Computation System –  Fault Tolerant, Scalable and Guaranteed Processing –  Master --> Zookeeper --> Worker Nodes •  Workers –  Spout Stream sources –  Bolt Computation units •  Data Flow –  Streams Unbounded sequence of Tuples –  Topologies A network of spouts and bolts Storm Basics5 5http://www.slideshare.net/ptgoetz/cassandra-and-storm-at-health-market-sceince
  • 19. Architecture URLs Webpage Surveyor Service TMS* Models HTML  to  Text   Text  to  (Tags  ,  Concepts)   Merge     1.  Topic  Model  Query   2.a.  Load  ES   2.b.  Get  Similar  Items   3.  Load  Similar  Items  for   quick  lookup   Build  Topic  Model   Fetch  Page  HTML   To  S3   SIMILAR  ITEM   TOPOLOGY   KaXa  Broker   *TMS  –  Topic  Model  Service   Get   Similar   Items  
  • 20. •  Number of Storm Workers: 3 •  Number of ES Nodes: 3 •  Training: –  Document Size: 2M –  Vocabulary: 400K –  Time: ~8s/iteration (16 cores) •  Predictive service performance: –  Peak requests handled: 200/min –  Avg response time: 110 ms •  URL Turn around time: 10s •  Number of URLs ingested: 70/min Some Numbers I