How Lucene Powers the LinkedIn Segmentation and Targeting Platform
Upcoming SlideShare
Loading in...5
×
 

How Lucene Powers the LinkedIn Segmentation and Targeting Platform

on

  • 1,050 views

Presented by Hien Luu, Technical Lead, LinkedIn ...

Presented by Hien Luu, Technical Lead, LinkedIn
Rajasekaran Rangaswamy, LinkedIn

For internet companies, marketing campaigns play an important role in acquiring new customers, retaining and engaging existing customers, and promoting new products. The LinkedIn segmentation and targeting platform helps marketing teams to easily and quickly create member segments based on member attributes using nested predicate expressions ranging from simple to complex. Once segments are created, then those qualified members are targeted with marketing campaigns.

Lucene is a key piece of technology in this platform. This session will cover how we leverage Hadoop to efficiently build Lucene indexes for a large and growing member attribute data set of 225 million members, and how Lucene is used to create segments based on complex nested predicate expressions. This presentation will also share some of the lessons we learned and challenges we encountered from using Lucene to search over large data sets.

Statistics

Views

Total Views
1,050
Views on SlideShare
817
Embed Views
233

Actions

Likes
1
Downloads
27
Comments
0

4 Embeds 233

http://www.lucenerevolution.org 226
http://lucenerevolution.org 5
http://www.lucenerevolution.com 1
http://news.google.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

How Lucene Powers the LinkedIn Segmentation and Targeting Platform How Lucene Powers the LinkedIn Segmentation and Targeting Platform Presentation Transcript

  • How Lucene Powers LinkedIn Segmentation & Targeting Platform Lucene/SOLR Revolution EU, November 2013 Hien Luu, Raj Rangaswamy ©2013 LinkedIn Corporation. All Rights Reserved.
  • About Us * Hien  Luu   Rajasekaran   Rangaswamy  
  • Agenda §  Little bit about LinkedIn §  Segmentation & Targeting Platform Overview §  How Lucene powers Segmentation & Targeting Platform §  Q&A ©2013 LinkedIn Corporation. All Rights Reserved. View slide
  • Our Mission Connect the world’s professionals to make them more productive and successful. Our Vision Create economic opportunity for every professional in the world. Members First! View slide
  • The world’s largest professional network Over 65% of members are now international   >30M   >90% Fortune  100  Companies     use  LinkedIn  Talent  Soln  to  hire   >3M   Company  Pages       19 Languages     >5.7B   Professional  searches  in  2012     ©2013 LinkedIn Corporation. All Rights Reserved.
  • Other Company Facts •  Headquartered  in  Mountain  View,  Calif.,  with  offices  around  the  world! •  LinkedIn  has  ~4200  full-­‐Kme  employees  located  around  the  world   *   Source : http://press.linkedin.com/about
  • SegmentaKon  &  TargeKng   ©2013 LinkedIn Corporation. All Rights Reserved.
  • Segmentation & Targeting
  • Segmentation & Targeting Bhaskar Ghosh Attribute types
  • Segmentation & Targeting 1. Create attributes §  §  §  §  §  Name Email State Occupation Etc. 2. Attributes Added to Table Name   Email   State   OccupaEon   John  Smith   jsmith@blah.com   California   Engineer   Jane  Smith   smithj@mail.com   Nevada   HR  Manager   Jane  Doe   jdoe@email.com   California   …   Engineer   3. Create Target Segment: California, Engineer Name   Email   State   OccupaEon   John  Smith   jsmith@blah.com   California   Engineer   Jane  Doe   jdoe@email.com   California   4. Export List & Send Vendor Engineer   LinkedIn Confidential ©2013 All Rights Reserved 10  
  • Segmentation & Targeting §  Business definition –  Business would like to launch new campaign often –  Business would like to specify targeting criteria using arbitrary set of attributes –  Attributes need to be computed to fulfill the targeting criteria –  The attribute data resides on Hadoop or TD –  Business is most comfortable with SQL-like language ©2013 LinkedIn Corporation. All Rights Reserved.
  • Segmentation & Targeting Attribute Computation Engine ©2013 LinkedIn Corporation. All Rights Reserved. Attribute Serving Engine
  • Segmentation & Targeting Attribute consolidation Self-service Attribute Computation Engine Support various data sources ©2013 LinkedIn Corporation. All Rights Reserved. Attribute availability
  • Segmentation & Targeting PB Attribute computation ~238M TB TB ~440 ©2013 LinkedIn Corporation. All Rights Reserved.
  • Segmentation & Targeting Build segments Self-service Attribute Serving Engine Attribute predicate expression ©2013 LinkedIn Corporation. All Rights Reserved. Build lists
  • Segmentation & Targeting count filter $ 1234 complex sum expressions Σ Serving Engine ~238M ~440 LinkedIn Member Attribute table ©2013 LinkedIn Corporation. All Rights Reserved.
  • LinkedIn Segmentation & Targeting Platform Who are the job seekers? Who are the LinkedIn Talent Solution prospects in Europe? Who are north American recruiters that don’t work for a competitor? ©2013 LinkedIn Corporation. All Rights Reserved.
  • LinkedIn Segmentation & Targeting Platform Complex tree-like attribute predicate expressions ©2013 LinkedIn Corporation. All Rights Reserved.
  • Agenda §  Architecture –  Indexer Architecture –  Serving Architecture §  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution? ©2013 LinkedIn Corporation. All Rights Reserved.
  • Architecture Attribute Serving Engine Attribute Computation Engine Data Storage Layer ©2013 LinkedIn Corporation. All Rights Reserved. Attribute Indexing Attribute Creation Engine Attribute Serving Engine Attribute Materialization Engine Attribute Metastore
  • Indexer Mapper mysql attribute store Avro data in HDFS Attribute Definitions HDFS Hadoop Indexer MR shard 1 shard 2 Index Merger shard n K=> AvroKey<GenericRecord> V=> AvroValue<NullWritable> Reducer K=> NullWritable V=> LuceneDocumentWrapper LuceneOutputFormat RecordWriter LuceneDocumentWrapper Document Web Servers Index ©2013 LinkedIn Corporation. All Rights Reserved.
  • Serving JSON Predicate Expression JSON Lucene Query Parser Inverted Index ©2013 LinkedIn Corporation. All Rights Reserved. Inverted Index Segment & List Inverted Index
  • Agenda §  Architecture –  Indexer Architecture –  Serving Architecture §  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution? ©2013 LinkedIn Corporation. All Rights Reserved.
  • Serving – Load Balanced Model HTTP Request Load Balancer Web Server 1 Shard 1 Web Server 2 Shard 2 Shared Drive ©2013 LinkedIn Corporation. All Rights Reserved. Web Server n Shard n
  • Serving – Load Balanced Model But Wait….. •  Is load balancing alone good enough? •  What about distribution and failover? ©2013 LinkedIn Corporation. All Rights Reserved.
  • Agenda §  Architecture –  Indexer Architecture –  Serving Architecture §  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution? ©2013 LinkedIn Corporation. All Rights Reserved.
  • Next Steps - Distributed Model •  A generic cluster management framework •  Used to manage partitioned and replicated resources in distributed systems •  Built on top of Zookeeper that hides the complexity of ZK primitives •  Provides distributed features such as leader election, twophase commit etc. via a model of state machine http://helix.incubator.apache.org/ ©2013 LinkedIn Corporation. All Rights Reserved.
  • Next Steps - Distributed Model HTTP Request Load Balancer Scatter Gather Web Server 1 Web Server 2 Web Server 3 Shard 1 active Shard 2 active Shard 3 active Shard 2 standby Shard 3 standby Shard 1 standby ©2013 LinkedIn Corporation. All Rights Reserved.
  • Next Steps - Distributed Model HTTP Request Load Balancer Scatter Gather Web Server 1 Web Server 2 Web Server 3 Shard 1 active Shard 2 active Shard 3 failure Shard 2 standby Shard 3 active Shard 1 failure ©2013 LinkedIn Corporation. All Rights Reserved.
  • Agenda §  Architecture –  Indexer Architecture –  Serving Architecture §  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution? ©2013 LinkedIn Corporation. All Rights Reserved.
  • DocValues – Use Case •  Once segments are built, users want to forecast, see a target revenue projection for the campaigns that they want to run. •  Campaigns can be run on various Revenue Models •  This involves adding per member Propensity Scores and Dollar Amounts ©2013 LinkedIn Corporation. All Rights Reserved.
  • DocValues – Why not Stored Fields? Why not use Stored Fields? Document ID •  Stored fields have one indirection per document resulting in two disk seeks per document .fdx fetch filepointer to field data .fdt scan by id until field is found •  Performance cost quickly adds up when fetching millions of documents ©2013 LinkedIn Corporation. All Rights Reserved.
  • DocValues – Why not Field Cache? Why not use Field Cache? •  Is memory resident •  Works fine when there is enough memory •  But keeping millions of un-inverted values in memory is impossible •  Additional cost to parse values (from String and to String) ©2013 LinkedIn Corporation. All Rights Reserved.
  • DocValues •  Dense column based storage (1 Value per Document and 1 Column per field and segment) •  Accepts primitives •  No conversion from/to String needed •  Loads 80x-100x faster than building a FieldCache •  All the work is done during Indexing •  DocValue fields can be indexed and stored too ©2013 LinkedIn Corporation. All Rights Reserved.
  • Agenda §  Architecture –  Indexer Architecture –  Serving Architecture §  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution? ©2013 LinkedIn Corporation. All Rights Reserved.
  • Lessons Learnt Indexing •  Reuse index writers, field and document instances •  Create many partitions and Merge them in a different process •  Rebuild (bootstrap) entire index if possible •  Use partial updates with caution •  Analyze the index Serving •  Reuse a single instance of IndexSearcher •  Limit usage of stored fields and term vectors •  Plan for load balancing and failover •  Cache term frequencies •  Use different machines for Serving and indexing ©2013 LinkedIn Corporation. All Rights Reserved.
  • Agenda §  Architecture –  Indexer Architecture –  Serving Architecture §  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution? ©2013 LinkedIn Corporation. All Rights Reserved.
  • Why not use an existing solution? •  Doesn’t allow dynamic schema •  Difficult to bootstrap indexes built in hadoop •  Indexing elevates query latency •  Doesn’t allow dynamic schema •  Difficult to bootstrap indexes built in hadoop •  Larger memory overhead •  Comparatively slow ©2013 LinkedIn Corporation. All Rights Reserved.
  • Questions? More info: data.linkedin.com ©2013 LinkedIn Corporation. All Rights Reserved.