Recommending Semantic Nearest Neighbors Using Storm and Dato

Presented at Dato Conf, SF
Personalization @ StumbleUpon
Recommending"
Semantic Nearest Neighbors"
using Storm and Dato

StumbleUpon – Choose Topics, Discover Content

Recommendations – Matching User With Content
1. Understand User
2. Understand Content
3. Recommend
4. Get Feedback
TELEVISON
MUSIC
TRENDING
FRIENDS
LIKEMINDED
USERS
EXPERTS
ANIMALS
DOGS
PHOTOGRAPHY
MOVIES
ARTS
HUMOR

Architecture Overview
Ingestion Queue
Discovery Queue
Content
Analysis
MySQL
Recommendation
Engine
1. INGESTION
Cold Start
Model
HBase
ES
New
Content
Event
Processors
3. OFFLINE COMPUTATIONS
2. CHECK QUALITY
4. RECS
5. ONLINE COMPUTATION
Rec Models
Rec Models
Rec Models
Event Queue

•  Problem:
–  Recommend Items based on the topics discovered in the current
page a user is on
•  Strategies:
–  Find semantically similar items
–  Find items that dig further into a speciﬁc topic
–  Find items that dig further into a broader topic
–  Others…

Problem

•  Very quick “Ingestion to Recommendation” turn around
time (x10 seconds)
–  Adopt stream processing with at-least-once processing guarantees
–  Build idempotent subsystems
–  Capitalize on non-linearity wherever possible
•  Low latency retrieval of recs (x10 ms)
–  Pre-compute recs
–  Retrieve recs in θ(1) time
•  Horizontally scalable design
–  Utilize distributed processing systems/data stores
Constraints/SLAs

•  (Ofﬂine) Utilize a high quality dataset to build a topic model
•  (Online) For each URL ingested,
–  Extract text features that summarize the documents
•  Use pre-built topic models for
–  Filtering noisy keywords
–  Finding general topics
–  Finding speciﬁc topics
–  Computing topic hashes
•  Compute similarity/relevance
•  Store for quick retrieval
Approach Overview

Feature Extraction
Wikipedia
Annotation2
Detect
Language
Parse
Noun
Chunking1
Cleanup
Remove
Boilerplate
Coalesce Tags
1Manning, Christopher D., et al. “The Stanford CoreNLP natural language processing toolkit.” Proceedings of 52nd Annual Meeting of the ACL: System Demonstrations. 2014.
2Milne, David, and Ian H. Witten. "An open-source toolkit for mining Wikipedia." Artificial Intelligence 194 (2013): 222-239.
Compute
Tag Score

Topic Modeling
3Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022.
image courtesy: http://parkcu.com/blog/latent-dirichlet-allocation/

•  Similar to constrained clustering, LDA
can be run with topic associations4.
•  Perform hierarchical/agglomerative
clustering on SU’s taxonomy to obtain
K=75 clusters of topic sets.
•  Use the topic sets as possible labels
for the latent topic z
•  The words themselves are not learnt
for the speciﬁc topic they have been
mapped to.
LDA with Topic Associations
4Andrzejewski, David, and Xiaojin Zhu. “Latent Dirichlet Allocation with topic-in-set knowledge.” Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for
Natural Language Processing. Association for Computational Linguistics, 2009.

Example
Topic
Associa6ons
(Pre
LDA)
Top
Words
in
a
Topic
(Post
LDA)

•  Choose relevant topics
–  Rank/Threshold by to get
•  Filtering noisy tags
–  Rank/Threshold by
•  Getting speciﬁc words
•  Getting general words
Using the Topic Models

Graphlab-Create I
Image Courtesy: https://dato.com/products/create/technology.html

•  Allows fast prototyping on a single machine
–  Python Interface to a C++ backend
–  Scalable Data Structures (Tabular and Graphs) made available
–  Out-of-core implementation of standard ML algorithms
–  Makes basic Data Engineering and Visualization tasks easy
•  Easy to deploy micro services (predictive services) around
models built using Graphlab create/pandas/scikit-learn.
–  REST-ful API hosted over a Tornado server
–  Distributed cache
–  Amazon Cloudwatch for monitoring (for AWS deploys)
•  (Con) Debugging the service can be difﬁcult
Graphlab-Create II

•  Distributed Realtime Computation System
–  Fault Tolerant, Scalable and Guaranteed Processing
–  Master --> Zookeeper --> Worker Nodes
•  Workers
–  Spout Stream sources
–  Bolt Computation units
•  Data Flow
–  Streams Unbounded sequence of Tuples
–  Topologies A network of spouts and bolts
Storm Basics5
5http://www.slideshare.net/ptgoetz/cassandra-and-storm-at-health-market-sceince

Architecture
URLs
Webpage
Surveyor
Service
TMS*
Models
HTML
to
Text

Text
to
(Tags
,
Concepts)

Merge

1.
Topic
Model
Query

2.a.
Load
ES

2.b.
Get
Similar
Items

3.
Load
Similar
Items
for

quick
lookup

Build
Topic
Model

Fetch
Page
HTML

To
S3

SIMILAR
ITEM

TOPOLOGY

KaXa
Broker

*TMS
–
Topic
Model
Service

Get

Similar

Items

•  Number of Storm Workers: 3
•  Number of ES Nodes: 3
•  Training:
–  Document Size: 2M
–  Vocabulary: 400K
–  Time: ~8s/iteration (16 cores)
•  Predictive service performance:
–  Peak requests handled: 200/min
–  Avg response time: 110 ms
•  URL Turn around time: 10s
•  Number of URLs ingested: 70/min
Some Numbers I

Recommending Semantic Nearest Neighbors Using Storm and Dato

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Recommending Semantic Nearest Neighbors Using Storm and Dato

Similar to Recommending Semantic Nearest Neighbors Using Storm and Dato (20)

Recently uploaded

Recently uploaded (20)

Recommending Semantic Nearest Neighbors Using Storm and Dato