DataEngConf SF16 - Methods for Content Relevance at LinkedIn

©2015 LinkedIn Corporation. All Rights Reserved.
1
Methods for approaching content
relevance at LinkedIn
Ajit Singh
DataEngConf, April 7th 2016

22
Who am I ?
•  Builds models and infrastructure which
serves those models in production.
•  Background in machine learning
•  Ph.D. Machine Learning, CMU
•  Post-doc, University of Washington.

33
Who Do I Work With ?
Engineering team with expertise in
machine learning and systems.

44
Why attend this talk ?
Feed Slideshare
Rich media: slides and video,
with transcripts
Pulse
Content discovery.
Groups
Conversation Threads

55
Article Recommendation

6
Common Pattern for Recommendations
Keep it Simple
1.  Feature Extraction
–  What do I know about this content ?
2.  Indexing & Search
–  How do I store and retrieve documents and their features?
–  What candidates should I consider for this request ?
3.  Recommendation
–  What should I show you ?
6

77
Engineering Stacks
Offline (minutes – days)
•  Hadoop
•  Spark
•  “Big Data” tools
Nearline (seconds – hours)
•  Samza
•  Spark Streaming, Storm
Online (milliseconds)
•  Services
•  DB, Distributed Key-Value Stores

99
Kinds of Model Features
§  Member Features
–  What we know about the viewer or guest.
–  E.g., Industry, Skills, Languages Spoken
§  Document Features
–  What we know about a candidate item for recommendation.
–  E.g., For articles: vector-space, topics, social gestures.
§  Engagement
–  Aggregations of tracking data
–  E.g., Per-item click-through rates, Dwell time.
–  Think OLAP cubes.

1010
Document Features

1111
Document Features

1212
NLP
§  Featurizers which process text one document at a time.
–  Foundational:
§  Tokenization, Lemmatization, Stop-word removal => Vector-space models
§  Text near-deduplication (w-shingling, SpotSigs)
§  Language detection
–  Classifiers:
§  Explicitly categorize the document into one or more label sets.
§  Fast Prototyping
–  Build a library first
–  Deploy the library to Hadoop first: e.g., via Pig, Scalding.
–  Don’t build a near-real time system till you have validated features.

1313
Near-Real Time Feature Generation
§  Before starting, ask yourself – do I need near-real time features?

14
Offline
Nearline
Online
Article
DB
1
Article Stream
Article Stream
2
Language
Topics
Entity Extractors
Text Hashes
Language Features
Topic Features
Entity Features
Hash Features
Language Features
Hash Features
Topic Features
Entity Features
Search Index
3
6
5
Other Features
4
Ofﬂine
Producer

15
Pre-computed Recommendations
When you can stick to the offline world.
§  In many cases the recommendation problem is constrained: e.g.,
–  You know which users are likely to visit in a given time period.
–  Documents being considered will not change quickly.
§  Give me the top-k documents by highest normalized click-through rate.
–  Recommendations are pushed (e-mail, push notifications)
§  Just pre-compute recommendations for likely-to-visit users
–  Obvious parallelization via Hadoop.
–  Agile development.
–  Send recommendations to a distributed key-value store to serve.
§  Presupposes having good tracking & data pipelines (data lake).
15

1616
Data Science / Engineering Contracts
§  How do I get my data into the near real-time flow ?
§  How do I deploy a model for feature X ?
§  What if my model has a large number of parameters ?
–  Language models are notoriously huge.
–  Memory is often a tighter constraint than CPU.
§  What if I want to A/B test different versions of a feature ?
§  What happens if a feature source fails ?

1818
Search Indices
§  We use an in-house search
system called Galene.
§  Store features as searchable
facets within the index.

1919
Why ?
§  Flexible candidate selection,
–  Give me all the English-language documents ingested in the last four
days, which also mention the term “Lectra”.
–  Give me all promoted articles tagged with “Grace Hopper 2016”
§  Consolidate state management across search & recommendations.

2020
Search Verticals
Search Federator
Recommender
…
Pre-computed Recommendations
Articles Slides
Groups Courses
Search Verticals
Search
query
form
ulated
query
lookup
lookup

2121
Continuum between Search & Recommendations
Search Recommendation
Navigational Broad /
Exploratory
Guided &
Faceted Search
Empty
Search

2323
Homepage Module

Epsilon-Greedy Explore/Exploit
Target’s Red Ink Runs
out in Canada.
Why we need “Economy
Wide” Airline Seats.
How much work is too
much work?
We can learn from barn
raisers.
#1
#2
#3
#4
0.3241%
0.5923%
0.4864%
0.0231%
24

2525
Key Ideas
§  Online Algorithm
–  The model continuous updates with feedback from decisions.
–  Infrastructure components:
§  Near real-time counting (OLAP).
§  Efficient scoring of k-candidates per-request.
–  Allows for warm/cold-start models (c.f., Thompson sampling)
§  What algorithms do well vs. what humans do well.
–  Candidate selection by human editors.
–  Ranking via algorithms.

2626
Algorithm Aversion
“Although people may be willing to trust an
algorithm in the absence of experience with it,
seeing it perform—and almost inevitably err—
will cause them to abandon it in favor of a human
judge. This may occur even when people see the
algorithm outperform the human.”
Dietvorst et al.
J. Exp. Psych Res. 2014

2828
Engineering & Data Science
§  The core challenge is that data-driven products are complex
–  Tracking
–  Data Warehousing
–  Offline Infrastructure (Hadoop & Spark)
–  Modelers
–  Nearline & Online Infrastructure
§  Craftsmanship
–  Clear contracts are critical
§  Are you providing recommendations ?
§  Are you providing data sets ?
§  Are you providing modeling infrastructure to other teams ?

29
Align on Metrics
Know why something matters.
§  True North Metrics:
–  Track the health of a product.
–  Usually affected by many aspects of the product; not just relevance.
–  May not be measurable on short-time spans.
§  e.g., Revenue in a subscription funnel.
§  Signpost Metrics:
–  Leading indicators of health for a true north.
–  Measurable, often via an A/B test.
§  Relevance Metrics:
–  You have an optimization problem, this is what it optimizes.
–  Rare that you can directly optimize your true north metric.

DataEngConf SF16 - Methods for Content Relevance at LinkedIn

DataEngConf SF16 - Methods for Content Relevance at LinkedIn

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to DataEngConf SF16 - Methods for Content Relevance at LinkedIn

Similar to DataEngConf SF16 - Methods for Content Relevance at LinkedIn (20)

More from Hakka Labs

More from Hakka Labs (12)

Recently uploaded

Recently uploaded (20)

DataEngConf SF16 - Methods for Content Relevance at LinkedIn