How Lucene Powers LinkedIn
Segmentation & Targeting Platform
Lucene/SOLR Revolution EU, November 2013
Hien Luu, Raj Rangas...
About Us
*

Hien	
  Luu	
  

Rajasekaran	
  
Rangaswamy	
  
Agenda
§  Little bit about LinkedIn
§  Segmentation & Targeting Platform Overview
§  How Lucene powers Segmentation & T...
Our Mission
Connect the world’s professionals to make them
more productive and successful.

Our Vision
Create economic opp...
The world’s largest professional network
Over 65% of members are now international

	
  
>30M
	
  
>90%

Fortune	
  100	
 ...
Other Company Facts
•  Headquartered	
  in	
  Mountain	
  View,	
  Calif.,	
  with	
  offices	
  around	
  the	
  world!
•  ...
SegmentaKon	
  &	
  TargeKng	
  

©2013 LinkedIn Corporation. All Rights Reserved.
Segmentation & Targeting
Segmentation & Targeting

Bhaskar Ghosh

Attribute types
Segmentation & Targeting
1. Create attributes
§ 
§ 
§ 
§ 
§ 

Name
Email
State
Occupation
Etc.

2. Attributes Added t...
Segmentation & Targeting

§  Business definition
–  Business would like to launch new campaign
often
–  Business would li...
Segmentation & Targeting

Attribute
Computation
Engine

©2013 LinkedIn Corporation. All Rights Reserved.

Attribute
Servin...
Segmentation & Targeting
Attribute
consolidation

Self-service

Attribute
Computation
Engine

Support various
data sources...
Segmentation & Targeting
PB

Attribute computation
~238M
TB

TB

~440

©2013 LinkedIn Corporation. All Rights Reserved.
Segmentation & Targeting
Build
segments

Self-service

Attribute
Serving
Engine

Attribute predicate
expression
©2013 Link...
Segmentation & Targeting
count

filter
$

1234

complex
sum expressions

Σ

Serving Engine
~238M

~440
LinkedIn Member Att...
LinkedIn Segmentation & Targeting Platform
Who are the job seekers?

Who are the LinkedIn Talent Solution prospects
in Eur...
LinkedIn Segmentation & Targeting Platform

Complex tree-like attribute predicate expressions

©2013 LinkedIn Corporation....
Agenda
§  Architecture
–  Indexer Architecture
–  Serving Architecture

§  Load Balanced Model
§  Next Steps - Distribu...
Architecture

Attribute
Serving
Engine

Attribute
Computation
Engine

Data
Storage
Layer
©2013 LinkedIn Corporation. All R...
Indexer
Mapper
mysql
attribute
store

Avro data in
HDFS

Attribute
Definitions
HDFS

Hadoop
Indexer MR

shard 1

shard 2

...
Serving
JSON Predicate
Expression

JSON Lucene
Query Parser

Inverted
Index
©2013 LinkedIn Corporation. All Rights Reserve...
Agenda
§  Architecture
–  Indexer Architecture
–  Serving Architecture
§  Load Balanced Model
§  Next Steps - Distribut...
Serving – Load Balanced Model
HTTP Request

Load Balancer

Web Server 1

Shard 1

Web Server 2

Shard 2

Shared Drive
©201...
Serving – Load Balanced Model

But Wait…..
•  Is load balancing alone good enough?
•  What about distribution and failover...
Agenda
§  Architecture
–  Indexer Architecture
–  Serving Architecture
§  Load Balanced Model
§  Next Steps - Distribut...
Next Steps - Distributed Model

•  A generic cluster management framework
•  Used to manage partitioned and replicated res...
Next Steps - Distributed Model
HTTP Request

Load Balancer

Scatter Gather

Web Server 1

Web Server 2

Web Server 3

Shar...
Next Steps - Distributed Model
HTTP Request

Load Balancer

Scatter Gather

Web Server 1

Web Server 2

Web Server 3

Shar...
Agenda
§  Architecture
–  Indexer Architecture
–  Serving Architecture
§  Load Balanced Model
§  Next Steps - Distribut...
DocValues – Use Case
•  Once segments are built, users want to forecast, see a
target revenue projection for the campaigns...
DocValues – Why not Stored Fields?
Why not use Stored Fields?

Document ID

•  Stored fields have one indirection
per docu...
DocValues – Why not Field Cache?
Why not use Field Cache?
•  Is memory resident
•  Works fine when there is enough memory
...
DocValues
•  Dense column based storage (1 Value per Document and 1 Column
per field and segment)
•  Accepts primitives
• ...
Agenda
§  Architecture
–  Indexer Architecture
–  Serving Architecture
§  Load Balanced Model
§  Next Steps - Distribut...
Lessons Learnt
Indexing
•  Reuse index writers, field and document instances
•  Create many partitions and Merge them in a...
Agenda
§  Architecture
–  Indexer Architecture
–  Serving Architecture
§  Load Balanced Model
§  Next Steps - Distribut...
Why not use an existing solution?
•  Doesn’t allow dynamic schema
•  Difficult to bootstrap indexes built in
hadoop
•  Ind...
Questions?
More info: data.linkedin.com

©2013 LinkedIn Corporation. All Rights Reserved.
Upcoming SlideShare
Loading in...5
×

How Lucene Powers the LinkedIn Segmentation and Targeting Platform

1,202

Published on

Presented by Hien Luu, Technical Lead, LinkedIn
Rajasekaran Rangaswamy, LinkedIn

For internet companies, marketing campaigns play an important role in acquiring new customers, retaining and engaging existing customers, and promoting new products. The LinkedIn segmentation and targeting platform helps marketing teams to easily and quickly create member segments based on member attributes using nested predicate expressions ranging from simple to complex. Once segments are created, then those qualified members are targeted with marketing campaigns.

Lucene is a key piece of technology in this platform. This session will cover how we leverage Hadoop to efficiently build Lucene indexes for a large and growing member attribute data set of 225 million members, and how Lucene is used to create segments based on complex nested predicate expressions. This presentation will also share some of the lessons we learned and challenges we encountered from using Lucene to search over large data sets.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,202
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
41
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

How Lucene Powers the LinkedIn Segmentation and Targeting Platform

  1. 1. How Lucene Powers LinkedIn Segmentation & Targeting Platform Lucene/SOLR Revolution EU, November 2013 Hien Luu, Raj Rangaswamy ©2013 LinkedIn Corporation. All Rights Reserved.
  2. 2. About Us * Hien  Luu   Rajasekaran   Rangaswamy  
  3. 3. Agenda §  Little bit about LinkedIn §  Segmentation & Targeting Platform Overview §  How Lucene powers Segmentation & Targeting Platform §  Q&A ©2013 LinkedIn Corporation. All Rights Reserved.
  4. 4. Our Mission Connect the world’s professionals to make them more productive and successful. Our Vision Create economic opportunity for every professional in the world. Members First!
  5. 5. The world’s largest professional network Over 65% of members are now international   >30M   >90% Fortune  100  Companies     use  LinkedIn  Talent  Soln  to  hire   >3M   Company  Pages       19 Languages     >5.7B   Professional  searches  in  2012     ©2013 LinkedIn Corporation. All Rights Reserved.
  6. 6. Other Company Facts •  Headquartered  in  Mountain  View,  Calif.,  with  offices  around  the  world! •  LinkedIn  has  ~4200  full-­‐Kme  employees  located  around  the  world   *   Source : http://press.linkedin.com/about
  7. 7. SegmentaKon  &  TargeKng   ©2013 LinkedIn Corporation. All Rights Reserved.
  8. 8. Segmentation & Targeting
  9. 9. Segmentation & Targeting Bhaskar Ghosh Attribute types
  10. 10. Segmentation & Targeting 1. Create attributes §  §  §  §  §  Name Email State Occupation Etc. 2. Attributes Added to Table Name   Email   State   OccupaEon   John  Smith   jsmith@blah.com   California   Engineer   Jane  Smith   smithj@mail.com   Nevada   HR  Manager   Jane  Doe   jdoe@email.com   California   …   Engineer   3. Create Target Segment: California, Engineer Name   Email   State   OccupaEon   John  Smith   jsmith@blah.com   California   Engineer   Jane  Doe   jdoe@email.com   California   4. Export List & Send Vendor Engineer   LinkedIn Confidential ©2013 All Rights Reserved 10  
  11. 11. Segmentation & Targeting §  Business definition –  Business would like to launch new campaign often –  Business would like to specify targeting criteria using arbitrary set of attributes –  Attributes need to be computed to fulfill the targeting criteria –  The attribute data resides on Hadoop or TD –  Business is most comfortable with SQL-like language ©2013 LinkedIn Corporation. All Rights Reserved.
  12. 12. Segmentation & Targeting Attribute Computation Engine ©2013 LinkedIn Corporation. All Rights Reserved. Attribute Serving Engine
  13. 13. Segmentation & Targeting Attribute consolidation Self-service Attribute Computation Engine Support various data sources ©2013 LinkedIn Corporation. All Rights Reserved. Attribute availability
  14. 14. Segmentation & Targeting PB Attribute computation ~238M TB TB ~440 ©2013 LinkedIn Corporation. All Rights Reserved.
  15. 15. Segmentation & Targeting Build segments Self-service Attribute Serving Engine Attribute predicate expression ©2013 LinkedIn Corporation. All Rights Reserved. Build lists
  16. 16. Segmentation & Targeting count filter $ 1234 complex sum expressions Σ Serving Engine ~238M ~440 LinkedIn Member Attribute table ©2013 LinkedIn Corporation. All Rights Reserved.
  17. 17. LinkedIn Segmentation & Targeting Platform Who are the job seekers? Who are the LinkedIn Talent Solution prospects in Europe? Who are north American recruiters that don’t work for a competitor? ©2013 LinkedIn Corporation. All Rights Reserved.
  18. 18. LinkedIn Segmentation & Targeting Platform Complex tree-like attribute predicate expressions ©2013 LinkedIn Corporation. All Rights Reserved.
  19. 19. Agenda §  Architecture –  Indexer Architecture –  Serving Architecture §  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution? ©2013 LinkedIn Corporation. All Rights Reserved.
  20. 20. Architecture Attribute Serving Engine Attribute Computation Engine Data Storage Layer ©2013 LinkedIn Corporation. All Rights Reserved. Attribute Indexing Attribute Creation Engine Attribute Serving Engine Attribute Materialization Engine Attribute Metastore
  21. 21. Indexer Mapper mysql attribute store Avro data in HDFS Attribute Definitions HDFS Hadoop Indexer MR shard 1 shard 2 Index Merger shard n K=> AvroKey<GenericRecord> V=> AvroValue<NullWritable> Reducer K=> NullWritable V=> LuceneDocumentWrapper LuceneOutputFormat RecordWriter LuceneDocumentWrapper Document Web Servers Index ©2013 LinkedIn Corporation. All Rights Reserved.
  22. 22. Serving JSON Predicate Expression JSON Lucene Query Parser Inverted Index ©2013 LinkedIn Corporation. All Rights Reserved. Inverted Index Segment & List Inverted Index
  23. 23. Agenda §  Architecture –  Indexer Architecture –  Serving Architecture §  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution? ©2013 LinkedIn Corporation. All Rights Reserved.
  24. 24. Serving – Load Balanced Model HTTP Request Load Balancer Web Server 1 Shard 1 Web Server 2 Shard 2 Shared Drive ©2013 LinkedIn Corporation. All Rights Reserved. Web Server n Shard n
  25. 25. Serving – Load Balanced Model But Wait….. •  Is load balancing alone good enough? •  What about distribution and failover? ©2013 LinkedIn Corporation. All Rights Reserved.
  26. 26. Agenda §  Architecture –  Indexer Architecture –  Serving Architecture §  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution? ©2013 LinkedIn Corporation. All Rights Reserved.
  27. 27. Next Steps - Distributed Model •  A generic cluster management framework •  Used to manage partitioned and replicated resources in distributed systems •  Built on top of Zookeeper that hides the complexity of ZK primitives •  Provides distributed features such as leader election, twophase commit etc. via a model of state machine http://helix.incubator.apache.org/ ©2013 LinkedIn Corporation. All Rights Reserved.
  28. 28. Next Steps - Distributed Model HTTP Request Load Balancer Scatter Gather Web Server 1 Web Server 2 Web Server 3 Shard 1 active Shard 2 active Shard 3 active Shard 2 standby Shard 3 standby Shard 1 standby ©2013 LinkedIn Corporation. All Rights Reserved.
  29. 29. Next Steps - Distributed Model HTTP Request Load Balancer Scatter Gather Web Server 1 Web Server 2 Web Server 3 Shard 1 active Shard 2 active Shard 3 failure Shard 2 standby Shard 3 active Shard 1 failure ©2013 LinkedIn Corporation. All Rights Reserved.
  30. 30. Agenda §  Architecture –  Indexer Architecture –  Serving Architecture §  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution? ©2013 LinkedIn Corporation. All Rights Reserved.
  31. 31. DocValues – Use Case •  Once segments are built, users want to forecast, see a target revenue projection for the campaigns that they want to run. •  Campaigns can be run on various Revenue Models •  This involves adding per member Propensity Scores and Dollar Amounts ©2013 LinkedIn Corporation. All Rights Reserved.
  32. 32. DocValues – Why not Stored Fields? Why not use Stored Fields? Document ID •  Stored fields have one indirection per document resulting in two disk seeks per document .fdx fetch filepointer to field data .fdt scan by id until field is found •  Performance cost quickly adds up when fetching millions of documents ©2013 LinkedIn Corporation. All Rights Reserved.
  33. 33. DocValues – Why not Field Cache? Why not use Field Cache? •  Is memory resident •  Works fine when there is enough memory •  But keeping millions of un-inverted values in memory is impossible •  Additional cost to parse values (from String and to String) ©2013 LinkedIn Corporation. All Rights Reserved.
  34. 34. DocValues •  Dense column based storage (1 Value per Document and 1 Column per field and segment) •  Accepts primitives •  No conversion from/to String needed •  Loads 80x-100x faster than building a FieldCache •  All the work is done during Indexing •  DocValue fields can be indexed and stored too ©2013 LinkedIn Corporation. All Rights Reserved.
  35. 35. Agenda §  Architecture –  Indexer Architecture –  Serving Architecture §  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution? ©2013 LinkedIn Corporation. All Rights Reserved.
  36. 36. Lessons Learnt Indexing •  Reuse index writers, field and document instances •  Create many partitions and Merge them in a different process •  Rebuild (bootstrap) entire index if possible •  Use partial updates with caution •  Analyze the index Serving •  Reuse a single instance of IndexSearcher •  Limit usage of stored fields and term vectors •  Plan for load balancing and failover •  Cache term frequencies •  Use different machines for Serving and indexing ©2013 LinkedIn Corporation. All Rights Reserved.
  37. 37. Agenda §  Architecture –  Indexer Architecture –  Serving Architecture §  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution? ©2013 LinkedIn Corporation. All Rights Reserved.
  38. 38. Why not use an existing solution? •  Doesn’t allow dynamic schema •  Difficult to bootstrap indexes built in hadoop •  Indexing elevates query latency •  Doesn’t allow dynamic schema •  Difficult to bootstrap indexes built in hadoop •  Larger memory overhead •  Comparatively slow ©2013 LinkedIn Corporation. All Rights Reserved.
  39. 39. Questions? More info: data.linkedin.com ©2013 LinkedIn Corporation. All Rights Reserved.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×