0
HOW LUCENE POWERS LINKEDIN SEGMENTATION & TARGETING PLATFORM
Hien Luu & Raj Rangaswamy
About Us

Hien	
  Luu	
  

Rajasekaran	
  
Rangaswamy	
  
Agenda
• 
• 
• 
• 

Little bit about LinkedIn
Segmentation & Targeting Platform Overview
How Lucene powers Segmentation & ...
Our Mission
Connect the world’s professionals to make them
more productive and successful.

Our Vision
Create economic opp...
The world’s largest professional network
Over 65% of members are now international

>30M	
  
>90%	
  
Fortune	
  100	
  Co...
Other Company Facts
• 
• 
	
  

Headquartered	
  in	
  Mountain	
  View,	
  Calif.,	
  with	
  offices	
  around	
  the	
  w...
Segmenta3on	
  &	
  Targe3ng	
  PlaRorm	
  Overview	
  
Segmentation & Targeting Platform Overview
Segmentation & Targeting Platform Overview
Segmentation & Targeting Platform Overview
2. Attributes Added to Table

1. Create attributes
§ 
§ 
§ 
§ 
§ 

Name
Em...
Segmentation & Targeting Platform Overview

•  Business definition
–  Business would like to launch new campaigns often
– ...
Segmentation & Targeting Platform Overview

A[ribute	
  
Computa3on	
  	
  
Engine	
  

A[ribute	
  	
  
Serving	
  	
  
E...
Segmentation & Targeting Platform Overview
Attribute consolidation

Self-service

A[ribute	
  
Computa3on	
  	
  
Engine	
...
Segmentation & Targeting Platform Overview

PB

Attribute computation
~238M

TB

TB
~440
Segmentation & Targeting Platform Overview
Build segments

Self-service

A[ribute	
  	
  
Serving	
  	
  
Engine	
  

A[ri...
Segmentation & Targeting Platform Overview
count

$	
  

1234

filter

	
  
Σ	
  

sum

complex
expressions

Attribute Ser...
Segmentation & Targeting Platform Overview

Who are the job seekers?
Who are the LinkedIn Talent Solution prospects
in Eur...
Segmentation & Targeting Platform Overview
How	
  Lucene	
  powers	
  Segmenta3on	
  &	
  Targe3ng	
  PlaRorm	
  
How Lucene powers Segmentation & Targeting Platform

•  Architecture
–  Indexer Architecture
–  Serving Architecture

• 
•...
Architecture

Attribute
Serving
Engine

Attribute
Computation
Engine

Data
Storage
Layer

Attribute
Indexing

Attribute
Cr...
Mapper	
  

Architecture
mysql
attribute
store

K=>	
  AvroKey<GenericRecord>	
  	
  
V=>	
  AvroValue<NullWritable>	
  

...
Architecture
JSON	
  Predicate	
  
Expression	
  

JSON	
  Lucene	
  	
  
Query	
  Parser	
  

Inverted	
  	
  
Index	
  
...
How Lucene powers Segmentation & Targeting Platform

•  Architecture
–  Indexer Architecture
–  Serving Architecture

•  L...
Serving – Load Balanced Model
HTTP Request

Load Balancer

Web Server 1

Shard 1

Web Server 2

Shard 2

Shared Drive

Web...
Serving – Load Balanced Model

But	
  Wait…..	
  
•  Is	
  load	
  balancing	
  alone	
  good	
  enough?	
  
•  What	
  ab...
How Lucene powers Segmentation & Targeting Platform

•  Architecture
–  Indexer Architecture
–  Serving Architecture

• 
•...
Next Steps – Distributed Model

•  A	
  generic	
  cluster	
  management	
  framework	
  
•  Manage	
  par33oned	
  and	
 ...
Next Steps – Distributed Model

HTTP Request

Load Balancer

Scatter Gather

Web Server 1

Web Server 2

Web Server 3

Sha...
Next Steps – Distributed Model

HTTP Request

Load Balancer

Scatter Gather

Web Server 1

Web Server 2

Web Server 3

Sha...
•  Architecture
–  Indexer Architecture
–  Serving Architecture

• 
• 
• 
• 
• 

Load Balanced Model
Next Steps - Distribu...
DocValues – Use Case

•  Once segments are built, users want to forecast, see a
target revenue projection for the campaign...
DocValues – Why not Stored Fields?

Why	
  not	
  use	
  Stored	
  Fields?	
  

Document ID

•  Stored	
  fields	
  have	
 ...
DocValues – Why not Stored Fields?

•  Why not use Field Cache?
–  Is memory resident
–  Works fine when there is enough m...
DocValues
•  Dense column based storage
–  (1 Value per Document and 1 Column per field and segment)
•  Accepts primitives...
•  Architecture
–  Indexer Architecture
–  Serving Architecture

• 
• 
• 
• 
• 

Load Balanced Model
Next Steps - Distribu...
Lessons Learnt

Indexing
•  Reuse index writers, field and document instances
•  Create many partitions and merge them in ...
Lessons Learnt

Serving
•  Reuse a single instance of IndexSearcher
•  Limit usage of stored fields and term vectors
•  Pl...
•  Architecture
–  Indexer Architecture
–  Serving Architecture

• 
• 
• 
• 
• 

Load Balanced Model
Next Steps - Distribu...
Why not use existing solutions?
•  Doesn’t	
  allow	
  dynamic	
  schema	
  
•  Difficult	
  to	
  bootstrap	
  indexes	
  b...
Ques3ons?	
  
	
  
More	
  info:	
  data.linkedin.com	
  
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
Upcoming SlideShare
Loading in...5
×

How Lucene Powers the LinkedIn Segmentation and Targeting Platform

604

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
604
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
16
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "How Lucene Powers the LinkedIn Segmentation and Targeting Platform"

  1. 1. HOW LUCENE POWERS LINKEDIN SEGMENTATION & TARGETING PLATFORM Hien Luu & Raj Rangaswamy
  2. 2. About Us Hien  Luu   Rajasekaran   Rangaswamy  
  3. 3. Agenda •  •  •  •  Little bit about LinkedIn Segmentation & Targeting Platform Overview How Lucene powers Segmentation & Targeting Platform Q&A
  4. 4. Our Mission Connect the world’s professionals to make them more productive and successful. Our Vision Create economic opportunity for every professional in the world. Members First!
  5. 5. The world’s largest professional network Over 65% of members are now international >30M   >90%   Fortune  100  Companies     use  LinkedIn  Talent  Soln  to  hire   >3M   Company  Pages     19     Languages   >5.7B   Professional  searches  in  2012  
  6. 6. Other Company Facts •  •    Headquartered  in  Mountain  View,  Calif.,  with  offices  around  the  world! LinkedIn  has  ~4200  full-­‐3me  employees  located  around  the  world  
  7. 7. Segmenta3on  &  Targe3ng  PlaRorm  Overview  
  8. 8. Segmentation & Targeting Platform Overview
  9. 9. Segmentation & Targeting Platform Overview
  10. 10. Segmentation & Targeting Platform Overview 2. Attributes Added to Table 1. Create attributes §  §  §  §  §  Name Email State Occupation Etc. Name   Email   State   OccupaEon   John  Smith   jsmith@blah.com   California   Engineer   Jane  Smith   smithj@mail.com   Nevada   HR  Manager   Jane  Doe   jdoe@email.com   California   Engineer   3. Create Target Segment: California, Engineer Name   Email   State   OccupaEon   John  Smith   jsmith@blah.com   California   Engineer   Jane  Doe   jdoe@email.com   California   Engineer   4. Export List & Send Vendor …  
  11. 11. Segmentation & Targeting Platform Overview •  Business definition –  Business would like to launch new campaigns often –  Business would like to specify targeting criteria using arbitrary set of attributes –  Attributes need to be computed to fulfill the targeting criteria –  The attribute data resides on Hadoop or TD –  Business is most comfortable with SQL-like language
  12. 12. Segmentation & Targeting Platform Overview A[ribute   Computa3on     Engine   A[ribute     Serving     Engine  
  13. 13. Segmentation & Targeting Platform Overview Attribute consolidation Self-service A[ribute   Computa3on     Engine   Support  various  data  sources   Attribute availability
  14. 14. Segmentation & Targeting Platform Overview PB Attribute computation ~238M TB TB ~440
  15. 15. Segmentation & Targeting Platform Overview Build segments Self-service A[ribute     Serving     Engine   A[ribute  predicate  expression   Build lists
  16. 16. Segmentation & Targeting Platform Overview count $   1234 filter   Σ   sum complex expressions Attribute Serving Engine ~238M ~440
  17. 17. Segmentation & Targeting Platform Overview Who are the job seekers? Who are the LinkedIn Talent Solution prospects in Europe? Who are north American recruiters that don’t work for a competitor?
  18. 18. Segmentation & Targeting Platform Overview
  19. 19. How  Lucene  powers  Segmenta3on  &  Targe3ng  PlaRorm  
  20. 20. How Lucene powers Segmentation & Targeting Platform •  Architecture –  Indexer Architecture –  Serving Architecture •  •  •  •  •  Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?
  21. 21. Architecture Attribute Serving Engine Attribute Computation Engine Data Storage Layer Attribute Indexing Attribute Creation Engine Attribute Serving Engine Attribute Materialization Engine Attribute Metastore
  22. 22. Mapper   Architecture mysql attribute store K=>  AvroKey<GenericRecord>     V=>  AvroValue<NullWritable>   Attribute Definitions HDFS shard 1 Avro data in HDFS Hadoop Indexer MR shard 2 Index Merger shard n Web Servers Reducer   K=>  NullWritable     V=>  LuceneDocumentWrapper   LuceneOutputFormat   RecordWriter        LuceneDocumentWrapper                                    Document                            Index  
  23. 23. Architecture JSON  Predicate   Expression   JSON  Lucene     Query  Parser   Inverted     Index   Inverted     Index   Segment  &   List   Inverted     Index  
  24. 24. How Lucene powers Segmentation & Targeting Platform •  Architecture –  Indexer Architecture –  Serving Architecture •  Load Balanced Model •  •  •  •  Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?
  25. 25. Serving – Load Balanced Model HTTP Request Load Balancer Web Server 1 Shard 1 Web Server 2 Shard 2 Shared Drive Web Server n Shard n
  26. 26. Serving – Load Balanced Model But  Wait…..   •  Is  load  balancing  alone  good  enough?   •  What  about  distribu3on  and  failover?  
  27. 27. How Lucene powers Segmentation & Targeting Platform •  Architecture –  Indexer Architecture –  Serving Architecture •  •  •  •  •  Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?
  28. 28. Next Steps – Distributed Model •  A  generic  cluster  management  framework   •  Manage  par33oned  and  replicated  resources  in  distributed  systems   •  Built  on  top  of  Zookeeper  that  hides  the  complexity  of  ZK  primi3ves   •  Provides  distributed  features  such  as  leader  elec3on,  two-­‐phase   commit  etc.  via  a  model  of  state  machine    hLp://helix.incubator.apache.org/  
  29. 29. Next Steps – Distributed Model HTTP Request Load Balancer Scatter Gather Web Server 1 Web Server 2 Web Server 3 Shard 1 active Shard 2 active Shard 3 active Shard 2 standby Shard 3 standby Shard 1 standby
  30. 30. Next Steps – Distributed Model HTTP Request Load Balancer Scatter Gather Web Server 1 Web Server 2 Web Server 3 Shard 1 active Shard 2 active Shard 3 failure Shard 2 standby Shard 3 active Shard 1 failure
  31. 31. •  Architecture –  Indexer Architecture –  Serving Architecture •  •  •  •  •  Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?
  32. 32. DocValues – Use Case •  Once segments are built, users want to forecast, see a target revenue projection for the campaigns that they want to run. •  Campaigns can be run on various Revenue Models •  This involves adding per member Propensity Scores and Dollar Amounts
  33. 33. DocValues – Why not Stored Fields? Why  not  use  Stored  Fields?   Document ID •  Stored  fields  have  one  indirec3on  per   document  resul3ng  in  two  disk  seeks   .fdx fetch filepointer to field data .fdt scan by id until field is found per  document   •  Performance  cost  quickly  adds  up  when   fetching  millions  of  documents  
  34. 34. DocValues – Why not Stored Fields? •  Why not use Field Cache? –  Is memory resident –  Works fine when there is enough memory –  But keeping millions of un-inverted values in memory is impossible –  Additional cost to parse values (from String and to String)
  35. 35. DocValues •  Dense column based storage –  (1 Value per Document and 1 Column per field and segment) •  Accepts primitives •  No conversion from/to String needed •  Loads 80x-100x faster than building a FieldCache •  All the work is done during Indexing •  DocValue fields can be indexed and stored too
  36. 36. •  Architecture –  Indexer Architecture –  Serving Architecture •  •  •  •  •  Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?
  37. 37. Lessons Learnt Indexing •  Reuse index writers, field and document instances •  Create many partitions and merge them in a different process •  Rebuild (bootstrap) entire index if possible •  Use partial updates with caution •  Analyze the index
  38. 38. Lessons Learnt Serving •  Reuse a single instance of IndexSearcher •  Limit usage of stored fields and term vectors •  Plan for load balancing and failover •  Cache term frequencies •  Use different machines for serving and indexing
  39. 39. •  Architecture –  Indexer Architecture –  Serving Architecture •  •  •  •  •  Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?
  40. 40. Why not use existing solutions? •  Doesn’t  allow  dynamic  schema   •  Difficult  to  bootstrap  indexes  built  in  Hadoop   •  Indexing  elevates  query  latency     •  •  •  •  Doesn’t  allow  dynamic  schema   Difficult  to  bootstrap  indexes  built  in  Hadoop   Larger  memory  overhead   Compara3vely  slow  
  41. 41. Ques3ons?     More  info:  data.linkedin.com  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×