SlideShare a Scribd company logo
1 of 22
Download to read offline
Presented at Dato Conf, SF
Personalization @ StumbleUpon
Recommending"
Semantic Nearest Neighbors"
using Storm and Dato
OVERVIEW	
  
StumbleUpon – Choose Topics, Discover Content
Bookmark, Organize and Share
Recommendations – Matching User With Content
1. Understand User
 2. Understand Content
 3. Recommend
 4. Get Feedback
TELEVISON
MUSIC
TRENDING
FRIENDS
LIKEMINDED
USERS
EXPERTS
ANIMALS
DOGS
PHOTOGRAPHY
MOVIES
ARTS
HUMOR
Architecture Overview
Ingestion Queue	
Discovery Queue	
Content
Analysis
MySQL	
Recommendation
Engine	
1. INGESTION
Cold Start
Model	
HBase	
 ES	
New
Content	
Event
Processors	
3. OFFLINE COMPUTATIONS
2. CHECK QUALITY
4. RECS
5. ONLINE COMPUTATION
Rec Models	
Rec Models	
Rec Models	
Event Queue
CONTEXTUAL	
  RECS	
  
•  Problem:
–  Recommend Items based on the topics discovered in the current
page a user is on
•  Strategies:
–  Find semantically similar items
–  Find items that dig further into a specific topic
–  Find items that dig further into a broader topic
–  Others… 

Problem
•  Very quick “Ingestion to Recommendation” turn around
time (x10 seconds)
–  Adopt stream processing with at-least-once processing guarantees
–  Build idempotent subsystems
–  Capitalize on non-linearity wherever possible
•  Low latency retrieval of recs (x10 ms)
–  Pre-compute recs
–  Retrieve recs in θ(1) time
•  Horizontally scalable design
–  Utilize distributed processing systems/data stores
Constraints/SLAs
•  (Offline) Utilize a high quality dataset to build a topic model
•  (Online) For each URL ingested,
–  Extract text features that summarize the documents
•  Use pre-built topic models for
–  Filtering noisy keywords
–  Finding general topics
–  Finding specific topics
–  Computing topic hashes
•  Compute similarity/relevance
•  Store for quick retrieval
Approach Overview
Feature Extraction
Wikipedia
Annotation2
Detect
Language 
Parse
Noun
Chunking1
Cleanup
Remove
Boilerplate
Coalesce Tags
1Manning, Christopher D., et al. “The Stanford CoreNLP natural language processing toolkit.” Proceedings of 52nd Annual Meeting of the ACL: System Demonstrations. 2014.
2Milne, David, and Ian H. Witten. "An open-source toolkit for mining Wikipedia." Artificial Intelligence 194 (2013): 222-239.
Compute
Tag Score
Topic Modeling
3Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022.
image courtesy: http://parkcu.com/blog/latent-dirichlet-allocation/
•  Similar to constrained clustering, LDA
can be run with topic associations4.
•  Perform hierarchical/agglomerative
clustering on SU’s taxonomy to obtain
K=75 clusters of topic sets.
•  Use the topic sets as possible labels
for the latent topic z
•  The words themselves are not learnt
for the specific topic they have been
mapped to.
LDA with Topic Associations
4Andrzejewski, David, and Xiaojin Zhu. “Latent Dirichlet Allocation with topic-in-set knowledge.” Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for
Natural Language Processing. Association for Computational Linguistics, 2009.
Example
Topic	
  Associa6ons	
  (Pre	
  LDA)	
   Top	
  Words	
  in	
  a	
  Topic	
  (Post	
  LDA)	
  
•  Choose relevant topics
–  Rank/Threshold by to get 
•  Filtering noisy tags
–  Rank/Threshold by 
•  Getting specific words
–  Rank/Threshold by 
•  Getting general words
–  Rank/Threshold by 
Using the Topic Models
Graphlab-Create I
Image Courtesy: https://dato.com/products/create/technology.html
•  Allows fast prototyping on a single machine
–  Python Interface to a C++ backend
–  Scalable Data Structures (Tabular and Graphs) made available
–  Out-of-core implementation of standard ML algorithms
–  Makes basic Data Engineering and Visualization tasks easy
•  Easy to deploy micro services (predictive services) around
models built using Graphlab create/pandas/scikit-learn.
–  REST-ful API hosted over a Tornado server
–  Distributed cache 
–  Amazon Cloudwatch for monitoring (for AWS deploys)
•  (Con) Debugging the service can be difficult
Graphlab-Create II
•  Distributed Realtime Computation System
–  Fault Tolerant, Scalable and Guaranteed Processing
–  Master --> Zookeeper --> Worker Nodes
•  Workers
–  Spout Stream sources
–  Bolt Computation units
•  Data Flow
–  Streams Unbounded sequence of Tuples
–  Topologies A network of spouts and bolts
Storm Basics5
5http://www.slideshare.net/ptgoetz/cassandra-and-storm-at-health-market-sceince
Architecture
URLs	
Webpage
Surveyor
Service
TMS*
Models
HTML	
  to	
  Text	
  
Text	
  to	
  (Tags	
  ,	
  Concepts)	
  
Merge	
  	
  
1.	
  Topic	
  Model	
  Query	
  
2.a.	
  Load	
  ES	
  
2.b.	
  Get	
  Similar	
  Items	
  
3.	
  Load	
  Similar	
  Items	
  for	
  
quick	
  lookup	
  
Build	
  Topic	
  Model	
  
Fetch	
  Page	
  HTML	
  
To	
  S3	
  
SIMILAR	
  ITEM	
  
TOPOLOGY	
  
KaXa	
  Broker	
  
*TMS	
  –	
  Topic	
  Model	
  Service	
  
Get	
  
Similar	
  
Items	
  
•  Number of Storm Workers: 3 
•  Number of ES Nodes: 3
•  Training:
–  Document Size: 2M
–  Vocabulary: 400K
–  Time: ~8s/iteration (16 cores)
•  Predictive service performance:
–  Peak requests handled: 200/min
–  Avg response time: 110 ms
•  URL Turn around time: 10s
•  Number of URLs ingested: 70/min
Some Numbers I
Some Numbers II
THANKS.	
  QUESTIONS?	
  

More Related Content

What's hot

COMM 1180 Level 2 - MET (Neigh) March 2013
COMM 1180 Level 2 - MET (Neigh) March 2013COMM 1180 Level 2 - MET (Neigh) March 2013
COMM 1180 Level 2 - MET (Neigh) March 2013
Melanie Parlette-Stewart
 
Terminology Management in DITA
Terminology Management in DITATerminology Management in DITA
Terminology Management in DITA
Bluestream
 
Fri schreiber key_knowledge engineering
Fri schreiber key_knowledge engineeringFri schreiber key_knowledge engineering
Fri schreiber key_knowledge engineering
eswcsummerschool
 
INTERNET INFORMATION SOURCES
INTERNET INFORMATION SOURCESINTERNET INFORMATION SOURCES
INTERNET INFORMATION SOURCES
jeelani sofi
 
COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013
COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013
COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013
Melanie Parlette-Stewart
 
The 7 Habits of Highly Effective Research Communicators
The 7 Habits of Highly Effective Research CommunicatorsThe 7 Habits of Highly Effective Research Communicators
The 7 Habits of Highly Effective Research Communicators
Anup Kumar Das
 
Writing Seminar Dowland
Writing Seminar DowlandWriting Seminar Dowland
Writing Seminar Dowland
Traciwm
 

What's hot (20)

PDE3330 Sept 21
PDE3330 Sept 21 PDE3330 Sept 21
PDE3330 Sept 21
 
Content Strategy for WordPress
Content Strategy for WordPressContent Strategy for WordPress
Content Strategy for WordPress
 
Machine Learning & Apache Mahout
Machine Learning & Apache MahoutMachine Learning & Apache Mahout
Machine Learning & Apache Mahout
 
COMM 1180 Level 2 - MET (Neigh) March 2013
COMM 1180 Level 2 - MET (Neigh) March 2013COMM 1180 Level 2 - MET (Neigh) March 2013
COMM 1180 Level 2 - MET (Neigh) March 2013
 
Terminology Management in DITA
Terminology Management in DITATerminology Management in DITA
Terminology Management in DITA
 
Bis1100 Nov 2018
Bis1100 Nov 2018Bis1100 Nov 2018
Bis1100 Nov 2018
 
CCE2060 Oct 2017
CCE2060 Oct 2017CCE2060 Oct 2017
CCE2060 Oct 2017
 
Fri schreiber key_knowledge engineering
Fri schreiber key_knowledge engineeringFri schreiber key_knowledge engineering
Fri schreiber key_knowledge engineering
 
INTERNET INFORMATION SOURCES
INTERNET INFORMATION SOURCESINTERNET INFORMATION SOURCES
INTERNET INFORMATION SOURCES
 
Introduction to Altmetrics for Medical and Special Librarians
Introduction to Altmetrics for Medical and Special LibrariansIntroduction to Altmetrics for Medical and Special Librarians
Introduction to Altmetrics for Medical and Special Librarians
 
Toelf reading strategies
Toelf reading strategiesToelf reading strategies
Toelf reading strategies
 
COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013
COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013
COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013
 
The 7 Habits of Highly Effective Research Communicators
The 7 Habits of Highly Effective Research CommunicatorsThe 7 Habits of Highly Effective Research Communicators
The 7 Habits of Highly Effective Research Communicators
 
Social development library training
Social development library trainingSocial development library training
Social development library training
 
Databases mtcp2
Databases mtcp2Databases mtcp2
Databases mtcp2
 
Distributed machine learning examples
Distributed machine learning examplesDistributed machine learning examples
Distributed machine learning examples
 
Writing Seminar Dowland
Writing Seminar DowlandWriting Seminar Dowland
Writing Seminar Dowland
 
Cutting Edge Technology used in ePADD
Cutting Edge Technologyused in ePADDCutting Edge Technologyused in ePADD
Cutting Edge Technology used in ePADD
 
The Open Source Library: It's Free As in Puppy
The Open Source Library: It's Free As in PuppyThe Open Source Library: It's Free As in Puppy
The Open Source Library: It's Free As in Puppy
 
Hmnr101 guide
Hmnr101 guideHmnr101 guide
Hmnr101 guide
 

Similar to Recommending Semantic Nearest Neighbors Using Storm and Dato

The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
Zubair Nabi
 

Similar to Recommending Semantic Nearest Neighbors Using Storm and Dato (20)

DatoConference2015
DatoConference2015DatoConference2015
DatoConference2015
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Duraspace Hot Topics Series 6: Metadata and Repository Services
Duraspace Hot Topics Series 6: Metadata and Repository ServicesDuraspace Hot Topics Series 6: Metadata and Repository Services
Duraspace Hot Topics Series 6: Metadata and Repository Services
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic Web
 
10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...
10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...
10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...
 
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyondHierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyond
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Publishing and Serving Machine Learning Models with DLHub
Publishing and Serving Machine Learning Models with DLHubPublishing and Serving Machine Learning Models with DLHub
Publishing and Serving Machine Learning Models with DLHub
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Elastic pivorak
Elastic pivorakElastic pivorak
Elastic pivorak
 
Drupal: Content Management and Community for your Library
Drupal: Content Management and Community for your LibraryDrupal: Content Management and Community for your Library
Drupal: Content Management and Community for your Library
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the Enterprise
 
Linked Open Data Visualization
Linked Open Data VisualizationLinked Open Data Visualization
Linked Open Data Visualization
 
HyperMembrane Structures for Open Source Cognitive Computing
HyperMembrane Structures for Open Source Cognitive ComputingHyperMembrane Structures for Open Source Cognitive Computing
HyperMembrane Structures for Open Source Cognitive Computing
 
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
Zarneger "Supporting AI: Best Practices for Content Delivery Platforms"
 
Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things
 
From ontology to wiki
From ontology to wikiFrom ontology to wiki
From ontology to wiki
 
A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic Web
 
Semtech2006
Semtech2006Semtech2006
Semtech2006
 
Open Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsOpen Source Tools for Materials Informatics
Open Source Tools for Materials Informatics
 

Recently uploaded

1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
HenryBriggs2
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 

Recommending Semantic Nearest Neighbors Using Storm and Dato

  • 1. Presented at Dato Conf, SF Personalization @ StumbleUpon Recommending" Semantic Nearest Neighbors" using Storm and Dato
  • 3. StumbleUpon – Choose Topics, Discover Content
  • 5. Recommendations – Matching User With Content 1. Understand User 2. Understand Content 3. Recommend 4. Get Feedback TELEVISON MUSIC TRENDING FRIENDS LIKEMINDED USERS EXPERTS ANIMALS DOGS PHOTOGRAPHY MOVIES ARTS HUMOR
  • 6. Architecture Overview Ingestion Queue Discovery Queue Content Analysis MySQL Recommendation Engine 1. INGESTION Cold Start Model HBase ES New Content Event Processors 3. OFFLINE COMPUTATIONS 2. CHECK QUALITY 4. RECS 5. ONLINE COMPUTATION Rec Models Rec Models Rec Models Event Queue
  • 8. •  Problem: –  Recommend Items based on the topics discovered in the current page a user is on •  Strategies: –  Find semantically similar items –  Find items that dig further into a specific topic –  Find items that dig further into a broader topic –  Others… Problem
  • 9. •  Very quick “Ingestion to Recommendation” turn around time (x10 seconds) –  Adopt stream processing with at-least-once processing guarantees –  Build idempotent subsystems –  Capitalize on non-linearity wherever possible •  Low latency retrieval of recs (x10 ms) –  Pre-compute recs –  Retrieve recs in θ(1) time •  Horizontally scalable design –  Utilize distributed processing systems/data stores Constraints/SLAs
  • 10. •  (Offline) Utilize a high quality dataset to build a topic model •  (Online) For each URL ingested, –  Extract text features that summarize the documents •  Use pre-built topic models for –  Filtering noisy keywords –  Finding general topics –  Finding specific topics –  Computing topic hashes •  Compute similarity/relevance •  Store for quick retrieval Approach Overview
  • 11. Feature Extraction Wikipedia Annotation2 Detect Language Parse Noun Chunking1 Cleanup Remove Boilerplate Coalesce Tags 1Manning, Christopher D., et al. “The Stanford CoreNLP natural language processing toolkit.” Proceedings of 52nd Annual Meeting of the ACL: System Demonstrations. 2014. 2Milne, David, and Ian H. Witten. "An open-source toolkit for mining Wikipedia." Artificial Intelligence 194 (2013): 222-239. Compute Tag Score
  • 12. Topic Modeling 3Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022. image courtesy: http://parkcu.com/blog/latent-dirichlet-allocation/
  • 13. •  Similar to constrained clustering, LDA can be run with topic associations4. •  Perform hierarchical/agglomerative clustering on SU’s taxonomy to obtain K=75 clusters of topic sets. •  Use the topic sets as possible labels for the latent topic z •  The words themselves are not learnt for the specific topic they have been mapped to. LDA with Topic Associations 4Andrzejewski, David, and Xiaojin Zhu. “Latent Dirichlet Allocation with topic-in-set knowledge.” Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing. Association for Computational Linguistics, 2009.
  • 14. Example Topic  Associa6ons  (Pre  LDA)   Top  Words  in  a  Topic  (Post  LDA)  
  • 15. •  Choose relevant topics –  Rank/Threshold by to get •  Filtering noisy tags –  Rank/Threshold by •  Getting specific words –  Rank/Threshold by •  Getting general words –  Rank/Threshold by Using the Topic Models
  • 16. Graphlab-Create I Image Courtesy: https://dato.com/products/create/technology.html
  • 17. •  Allows fast prototyping on a single machine –  Python Interface to a C++ backend –  Scalable Data Structures (Tabular and Graphs) made available –  Out-of-core implementation of standard ML algorithms –  Makes basic Data Engineering and Visualization tasks easy •  Easy to deploy micro services (predictive services) around models built using Graphlab create/pandas/scikit-learn. –  REST-ful API hosted over a Tornado server –  Distributed cache –  Amazon Cloudwatch for monitoring (for AWS deploys) •  (Con) Debugging the service can be difficult Graphlab-Create II
  • 18. •  Distributed Realtime Computation System –  Fault Tolerant, Scalable and Guaranteed Processing –  Master --> Zookeeper --> Worker Nodes •  Workers –  Spout Stream sources –  Bolt Computation units •  Data Flow –  Streams Unbounded sequence of Tuples –  Topologies A network of spouts and bolts Storm Basics5 5http://www.slideshare.net/ptgoetz/cassandra-and-storm-at-health-market-sceince
  • 19. Architecture URLs Webpage Surveyor Service TMS* Models HTML  to  Text   Text  to  (Tags  ,  Concepts)   Merge     1.  Topic  Model  Query   2.a.  Load  ES   2.b.  Get  Similar  Items   3.  Load  Similar  Items  for   quick  lookup   Build  Topic  Model   Fetch  Page  HTML   To  S3   SIMILAR  ITEM   TOPOLOGY   KaXa  Broker   *TMS  –  Topic  Model  Service   Get   Similar   Items  
  • 20. •  Number of Storm Workers: 3 •  Number of ES Nodes: 3 •  Training: –  Document Size: 2M –  Vocabulary: 400K –  Time: ~8s/iteration (16 cores) •  Predictive service performance: –  Peak requests handled: 200/min –  Avg response time: 110 ms •  URL Turn around time: 10s •  Number of URLs ingested: 70/min Some Numbers I