• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
How Lucene Powers the LinkedIn Segmentation and Targeting Platform
 

How Lucene Powers the LinkedIn Segmentation and Targeting Platform

on

  • 505 views

 

Statistics

Views

Total Views
505
Views on SlideShare
485
Embed Views
20

Actions

Likes
1
Downloads
9
Comments
0

2 Embeds 20

http://www.linkedin.com 13
https://www.linkedin.com 7

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    How Lucene Powers the LinkedIn Segmentation and Targeting Platform How Lucene Powers the LinkedIn Segmentation and Targeting Platform Presentation Transcript

    • HOW LUCENE POWERS LINKEDIN SEGMENTATION & TARGETING PLATFORM Hien Luu & Raj Rangaswamy
    • About Us Hien  Luu   Rajasekaran   Rangaswamy  
    • Agenda •  •  •  •  Little bit about LinkedIn Segmentation & Targeting Platform Overview How Lucene powers Segmentation & Targeting Platform Q&A
    • Our Mission Connect the world’s professionals to make them more productive and successful. Our Vision Create economic opportunity for every professional in the world. Members First!
    • The world’s largest professional network Over 65% of members are now international >30M   >90%   Fortune  100  Companies     use  LinkedIn  Talent  Soln  to  hire   >3M   Company  Pages     19     Languages   >5.7B   Professional  searches  in  2012  
    • Other Company Facts •  •    Headquartered  in  Mountain  View,  Calif.,  with  offices  around  the  world! LinkedIn  has  ~4200  full-­‐3me  employees  located  around  the  world  
    • Segmenta3on  &  Targe3ng  PlaRorm  Overview  
    • Segmentation & Targeting Platform Overview
    • Segmentation & Targeting Platform Overview
    • Segmentation & Targeting Platform Overview 2. Attributes Added to Table 1. Create attributes §  §  §  §  §  Name Email State Occupation Etc. Name   Email   State   OccupaEon   John  Smith   jsmith@blah.com   California   Engineer   Jane  Smith   smithj@mail.com   Nevada   HR  Manager   Jane  Doe   jdoe@email.com   California   Engineer   3. Create Target Segment: California, Engineer Name   Email   State   OccupaEon   John  Smith   jsmith@blah.com   California   Engineer   Jane  Doe   jdoe@email.com   California   Engineer   4. Export List & Send Vendor …  
    • Segmentation & Targeting Platform Overview •  Business definition –  Business would like to launch new campaigns often –  Business would like to specify targeting criteria using arbitrary set of attributes –  Attributes need to be computed to fulfill the targeting criteria –  The attribute data resides on Hadoop or TD –  Business is most comfortable with SQL-like language
    • Segmentation & Targeting Platform Overview A[ribute   Computa3on     Engine   A[ribute     Serving     Engine  
    • Segmentation & Targeting Platform Overview Attribute consolidation Self-service A[ribute   Computa3on     Engine   Support  various  data  sources   Attribute availability
    • Segmentation & Targeting Platform Overview PB Attribute computation ~238M TB TB ~440
    • Segmentation & Targeting Platform Overview Build segments Self-service A[ribute     Serving     Engine   A[ribute  predicate  expression   Build lists
    • Segmentation & Targeting Platform Overview count $   1234 filter   Σ   sum complex expressions Attribute Serving Engine ~238M ~440
    • Segmentation & Targeting Platform Overview Who are the job seekers? Who are the LinkedIn Talent Solution prospects in Europe? Who are north American recruiters that don’t work for a competitor?
    • Segmentation & Targeting Platform Overview
    • How  Lucene  powers  Segmenta3on  &  Targe3ng  PlaRorm  
    • How Lucene powers Segmentation & Targeting Platform •  Architecture –  Indexer Architecture –  Serving Architecture •  •  •  •  •  Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?
    • Architecture Attribute Serving Engine Attribute Computation Engine Data Storage Layer Attribute Indexing Attribute Creation Engine Attribute Serving Engine Attribute Materialization Engine Attribute Metastore
    • Mapper   Architecture mysql attribute store K=>  AvroKey<GenericRecord>     V=>  AvroValue<NullWritable>   Attribute Definitions HDFS shard 1 Avro data in HDFS Hadoop Indexer MR shard 2 Index Merger shard n Web Servers Reducer   K=>  NullWritable     V=>  LuceneDocumentWrapper   LuceneOutputFormat   RecordWriter        LuceneDocumentWrapper                                    Document                            Index  
    • Architecture JSON  Predicate   Expression   JSON  Lucene     Query  Parser   Inverted     Index   Inverted     Index   Segment  &   List   Inverted     Index  
    • How Lucene powers Segmentation & Targeting Platform •  Architecture –  Indexer Architecture –  Serving Architecture •  Load Balanced Model •  •  •  •  Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?
    • Serving – Load Balanced Model HTTP Request Load Balancer Web Server 1 Shard 1 Web Server 2 Shard 2 Shared Drive Web Server n Shard n
    • Serving – Load Balanced Model But  Wait…..   •  Is  load  balancing  alone  good  enough?   •  What  about  distribu3on  and  failover?  
    • How Lucene powers Segmentation & Targeting Platform •  Architecture –  Indexer Architecture –  Serving Architecture •  •  •  •  •  Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?
    • Next Steps – Distributed Model •  A  generic  cluster  management  framework   •  Manage  par33oned  and  replicated  resources  in  distributed  systems   •  Built  on  top  of  Zookeeper  that  hides  the  complexity  of  ZK  primi3ves   •  Provides  distributed  features  such  as  leader  elec3on,  two-­‐phase   commit  etc.  via  a  model  of  state  machine    hLp://helix.incubator.apache.org/  
    • Next Steps – Distributed Model HTTP Request Load Balancer Scatter Gather Web Server 1 Web Server 2 Web Server 3 Shard 1 active Shard 2 active Shard 3 active Shard 2 standby Shard 3 standby Shard 1 standby
    • Next Steps – Distributed Model HTTP Request Load Balancer Scatter Gather Web Server 1 Web Server 2 Web Server 3 Shard 1 active Shard 2 active Shard 3 failure Shard 2 standby Shard 3 active Shard 1 failure
    • •  Architecture –  Indexer Architecture –  Serving Architecture •  •  •  •  •  Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?
    • DocValues – Use Case •  Once segments are built, users want to forecast, see a target revenue projection for the campaigns that they want to run. •  Campaigns can be run on various Revenue Models •  This involves adding per member Propensity Scores and Dollar Amounts
    • DocValues – Why not Stored Fields? Why  not  use  Stored  Fields?   Document ID •  Stored  fields  have  one  indirec3on  per   document  resul3ng  in  two  disk  seeks   .fdx fetch filepointer to field data .fdt scan by id until field is found per  document   •  Performance  cost  quickly  adds  up  when   fetching  millions  of  documents  
    • DocValues – Why not Stored Fields? •  Why not use Field Cache? –  Is memory resident –  Works fine when there is enough memory –  But keeping millions of un-inverted values in memory is impossible –  Additional cost to parse values (from String and to String)
    • DocValues •  Dense column based storage –  (1 Value per Document and 1 Column per field and segment) •  Accepts primitives •  No conversion from/to String needed •  Loads 80x-100x faster than building a FieldCache •  All the work is done during Indexing •  DocValue fields can be indexed and stored too
    • •  Architecture –  Indexer Architecture –  Serving Architecture •  •  •  •  •  Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?
    • Lessons Learnt Indexing •  Reuse index writers, field and document instances •  Create many partitions and merge them in a different process •  Rebuild (bootstrap) entire index if possible •  Use partial updates with caution •  Analyze the index
    • Lessons Learnt Serving •  Reuse a single instance of IndexSearcher •  Limit usage of stored fields and term vectors •  Plan for load balancing and failover •  Cache term frequencies •  Use different machines for serving and indexing
    • •  Architecture –  Indexer Architecture –  Serving Architecture •  •  •  •  •  Load Balanced Model Next Steps - Distributed Model DocValues Lessons Learnt Why not use an existing solution?
    • Why not use existing solutions? •  Doesn’t  allow  dynamic  schema   •  Difficult  to  bootstrap  indexes  built  in  Hadoop   •  Indexing  elevates  query  latency     •  •  •  •  Doesn’t  allow  dynamic  schema   Difficult  to  bootstrap  indexes  built  in  Hadoop   Larger  memory  overhead   Compara3vely  slow  
    • Ques3ons?     More  info:  data.linkedin.com