How Lucene Powers the LinkedIn Segmentation and Targeting Platform

HOW LUCENE POWERS LINKEDIN SEGMENTATION & TARGETING PLATFORM
Hien Luu & Raj Rangaswamy

About Us

Hien
Luu

Rajasekaran

Rangaswamy

Agenda
• 
• 
• 
• 

Little bit about LinkedIn
Segmentation & Targeting Platform Overview
How Lucene powers Segmentation & Targeting Platform
Q&A

Our Mission
Connect the world’s professionals to make them
more productive and successful.

Our Vision
Create economic opportunity for every
professional in the world.

Members First!

The world’s largest professional network
Over 65% of members are now international

>30M

>90%

Fortune
100
Companies

use
LinkedIn
Talent
Soln
to
hire

>3M

Company
Pages

19

Languages

>5.7B

Professional
searches
in
2012

Other Company Facts
• 
• 

Headquartered
in
Mountain
View,
Calif.,
with
oﬃces
around
the
world!
LinkedIn
has
~4200
full-‐3me
employees
located
around
the
world

Segmenta3on
&
Targe3ng
PlaRorm
Overview

2. Attributes Added to Table

1. Create attributes
§ 
§ 
§ 
§ 
§ 

Name
Email
State
Occupation
Etc.

Name

Email

State

OccupaEon

John
Smith

jsmith@blah.com

California

Engineer

Jane
Smith

smithj@mail.com

Nevada

HR
Manager

Jane
Doe

jdoe@email.com

California

Engineer

3. Create Target Segment:
California, Engineer
Name

Email

State

OccupaEon

John
Smith

jsmith@blah.com

California

Engineer

Jane
Doe

jdoe@email.com

California

Engineer

4. Export List & Send Vendor

…


•  Business definition
–  Business would like to launch new campaigns often
–  Business would like to specify targeting criteria using
arbitrary set of attributes
–  Attributes need to be computed to fulfill the targeting
criteria
–  The attribute data resides on Hadoop or TD
–  Business is most comfortable with SQL-like language


A[ribute

Computa3on

Engine

A[ribute

Serving

Engine

Attribute consolidation

Self-service

A[ribute

Computa3on

Engine

Support
various
data
sources

Attribute availability


PB

Attribute computation
~238M

TB

TB
~440

Build segments

Self-service

A[ribute

Serving

Engine

A[ribute
predicate
expression

Build lists

count

$

1234

filter

Σ

sum

complex
expressions

Attribute Serving Engine
~238M

~440


Who are the job seekers?
Who are the LinkedIn Talent Solution prospects
in Europe?
Who are north American recruiters that
don’t work for a competitor?

How
Lucene
powers
Segmenta3on
&
Targe3ng
PlaRorm


•  Architecture
–  Indexer Architecture
–  Serving Architecture

• 
• 
• 
• 
• 

Load Balanced Model
Next Steps - Distributed Model
DocValues
Lessons Learnt
Why not use an existing solution?

Architecture

Attribute
Serving
Engine

Attribute
Computation
Engine

Data
Storage
Layer

Attribute
Indexing

Attribute
Creation
Engine

Attribute
Serving
Engine

Attribute
Materialization
Engine

Attribute
Metastore

Mapper

Architecture
mysql
attribute
store

K=>
AvroKey<GenericRecord>

V=>
AvroValue<NullWritable>

Attribute
Definitions

HDFS

shard 1

Avro data in
HDFS

Hadoop
Indexer MR
shard 2

Index Merger

shard n

Web Servers

Reducer

K=>
NullWritable

V=>
LuceneDocumentWrapper

LuceneOutputFormat

RecordWriter

LuceneDocumentWrapper

Document

Index

Architecture
JSON
Predicate

Expression

JSON
Lucene

Query
Parser

Inverted

Index

Inverted

Index

Segment
&

List

Inverted

Index


•  Architecture

•  Load Balanced Model
• 
• 
• 
• 

DocValues
Lessons Learnt

Serving – Load Balanced Model
HTTP Request

Load Balancer

Web Server 1

Shard 1

Web Server 2

Shard 2

Shared Drive

Web Server n

Shard n

Serving – Load Balanced Model

But
Wait…..

•  Is
load
balancing
alone
good
enough?

•  What
about
distribu3on
and
failover?

Next Steps – Distributed Model

•  A
generic
cluster
management
framework

•  Manage
par33oned
and
replicated
resources
in
distributed
systems

•  Built
on
top
of
Zookeeper
that
hides
the
complexity
of
ZK
primi3ves

•  Provides
distributed
features
such
as
leader
elec3on,
two-‐phase

commit
etc.
via
a
model
of
state
machine

hLp://helix.incubator.apache.org/


HTTP Request

Load Balancer

Scatter Gather

Web Server 1

Web Server 2

Web Server 3

Shard
1

active

Shard
2

active

Shard
3

active

Shard
2

standby

Shard
3

standby

Shard
1

standby


HTTP Request

Load Balancer

Scatter Gather

Web Server 1

Web Server 2

Web Server 3

Shard
1

active

Shard
2

active

Shard
3

failure

Shard
2

standby

Shard
3

active

Shard
1

failure

•  Architecture

• 
• 
• 
• 
• 

Load Balanced Model
DocValues
Lessons Learnt

DocValues – Use Case

•  Once segments are built, users want to forecast, see a
target revenue projection for the campaigns that they
want to run.
•  Campaigns can be run on various Revenue Models
•  This involves adding per member Propensity Scores and
Dollar Amounts

DocValues – Why not Stored Fields?

Why
not
use
Stored
Fields?

Document ID

•  Stored
ﬁelds
have
one
indirec3on
per

document
resul3ng
in
two
disk
seeks

.fdx

fetch filepointer to field data

.fdt

scan by id until field is found

per
document

•  Performance
cost
quickly
adds
up
when

fetching
millions
of
documents

DocValues – Why not Stored Fields?

•  Why not use Field Cache?
–  Is memory resident
–  Works fine when there is enough memory
–  But keeping millions of un-inverted values in memory is
impossible
–  Additional cost to parse values (from String and to String)

DocValues
•  Dense column based storage
–  (1 Value per Document and 1 Column per field and segment)
•  Accepts primitives
•  No conversion from/to String needed
•  Loads 80x-100x faster than building a FieldCache
•  All the work is done during Indexing
•  DocValue fields can be indexed and stored too

Lessons Learnt

Indexing
•  Reuse index writers, field and document instances
•  Create many partitions and merge them in a different
process
•  Rebuild (bootstrap) entire index if possible
•  Use partial updates with caution
•  Analyze the index

Lessons Learnt

Serving
•  Reuse a single instance of IndexSearcher
•  Limit usage of stored fields and term vectors
•  Plan for load balancing and failover
•  Cache term frequencies
•  Use different machines for serving and indexing

Why not use existing solutions?
•  Doesn’t
allow
dynamic
schema

•  Diﬃcult
to
bootstrap
indexes
built
in
Hadoop

•  Indexing
elevates
query
latency

• 
• 
• 
• 

Doesn’t
allow
dynamic
schema

Diﬃcult
to
bootstrap
indexes
built
in
Hadoop

Larger
memory
overhead

Compara3vely
slow

Ques3ons?

More
info:
data.linkedin.com

How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Recommended

Recommended

More Related Content

Similar to How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Similar to How Lucene Powers the LinkedIn Segmentation and Targeting Platform (20)

Recently uploaded

Recently uploaded (20)

How Lucene Powers the LinkedIn Segmentation and Targeting Platform