We’re making great strides toward our mission:LinkedIn has over 225 million members, and we’re now adding more than two members per second. This is the fastest rate of absolute member growth in the company’s history. Sixty-four percent of LinkedIn members are currently located outside of the United States.LinkedIn counts executives from all 2012 Fortune 500 companies as members; its corporate talent solutions are used by 88 of the Fortune 100 companies.More than 2.9 million companies have LinkedIn Company Pages.LinkedIn members did over 5.7 billion professionally-oriented searches on the platform in 2012.[See http://press.linkedin.com/about for a complete list of LinkedIn facts and stats]
Email Campaign & Ad targetingAcquire new paid customersRetain and engage existing customersPromote new productsTraining and other important announcements* Talk about the speed of changing segmentation and targeting criteria
Professional identitySocial dataBehavioral
Given the business problem that Sid outlined, the solution we came up with has two partsThe first part is about compute attributes based on the attribute definitionThe second part is about serving the attribute values to define segments, effectively performing user segmentation
The attribute computation engine needs to support these 4 high level requirementsSelf-service meaning thatThere needs to be an easy way for someone on the business team to express the computational logic to compute a set of attributes for the needs of their marketing campaignsThis engine takes care of the complexity in executing the computational logic in terms of when, how and where to store the computation resultSupport various data sourcesData are in multiple places – TD and Hadoop. We need support thatFortunately SQL and HiveSQL are very similarAttribute consolidationOnce all the attributes are computed, they needed to be consolidated into a single dataset to make it easy everyone to consume and analyzeData availabilityRegister with Hive and copy the data onto TD system for business folks to consume
At the high level, the attribute computation engine needs to be able compute attributes that come from different data sets, and some of these data sets are huge.And this presents all kinds of interesting challenges, as you can imagineThe output of the computation engine is this big table – 225M roows, one for each member, ~240 columns, one for each attributesBehavioral Data Site Engagement,OL Transactions,Searches,Comments,Discussions….Social DataConnections,Follows,EndorsementsDemographic DataThis data comes from member profileLocation,Gender,Title,Function,Seniority,Education
Self-service way to manage attributesA web application where a member of marketing operations or business analyst team can use to express the computation logic in the form SQL select statement. And we call that attribute definition.The SQL statement is either a Teradata SQL statement or Hive QL statementThe web application validates the SQL statements to make sure they are valid and plus we need to extract the attribute name and their types, which will be useful for various purposeThe metadata about the attribute definitions and attributes are captured in a MySQL database. For HIVE QL queries - we support Hive hints as well general tuning parameters like split sizeOnce an attribute definition passes the validation step, it will go through an approval process, which is designed toMake sure there is no attribute duplicatesMake sure the query properly tunedOne of the benefits of this attribute portal is the centralization attribute definitions and make it easy to discovery attributes, the logic behind these attributes and data sourcewhen someone starts working on a marketing campaign, they first identify the targeting criteria based on the goals of the campaignfrom the set of targeting criteria, they identify what are the needed member attributes
Attribute computing workhorseThese executors are scheduled to run on a regular basisThey contact the attribute definition metadata repository to retrieve what attribute definitions to executeThey execute the query in parallel using APIsTD executorExecute using JDBC and store result in temporary tablesWe are using an in house library called LASSEN, which is an M/R library that leverages the power of MapReduce framework to quickly and efficiently download the data to HDFS. Hive executorProgrammatically execute these Hive queriesOne of the classes in Hive is not thread safe, therefore we can’t execute Hive QLs in parallel using multiple threads, so we use multiple Hive executors approach insteadPig executorExecute pig script filesHas the ability to rerun only the failed scriptsInteresting runtime detailsWe have all kinds of queries, simple one and complex ones. The complex ones that may take hours to complete. However we don’t want a query that takes 5 or 6 hours. That would delay the attribute computing phase for all the queries. Our system has a built in mechanism to kill a long running query that exceeds certain amount of timeWhat about failed queries – even though we validate them at the attribute def. submission time, some of them will fail at runtime due to various reason. Our system is built to be resilient against these failed queries. Only the attributes of the failed queries will not be available. Our system collects accounting information about each of the queries – so we know how many queries were successfully completed, how many failed and how long each takes.The output of each attribute definition is stored in a separated folder. So if we have 50 attribute definitions, the result of those queries are scattered across 50 places on Hadoop
Once the executors are completed executing and materializing the attributesThe job of the stitcher is to combine all these attributes together into a single data set, which I call LinkedIn big tableIt is an MapReduce job and it acts as a gateway to perform some validations like member id must not be less than 0 or certain values can’t be longer than certain lengthThe output of sticher is a single data set in Avro format that contains one record for every single LinkedIn memberThis output is also registered in Hive for data scientists to consumeTo make the linkedIn big table available for business analysts to generate more insights and further analysis, this same date set is copied onto TD via Data Loader componentThe processing executing these attribute definitions or select statements, stitching the attributes together into s single dataset and load the data onto TD takes about 5 to 6 hours.Not all attributes need to be refreshed daily, so we have a concept partial refresh and full refreshPartial refresh – only a subset of needed attribute definitions are executed and it takes much less time – 2-3 hours vs 5 to 6 hrs
Linkedin big table – 200GBThe LinkedIn big table is used for multiple purposesPropensity modelRanking model, where each member is assigned a certain score to indicate how likely a member belongs to certain class of member or likely to take an action.i.e job seeker, or how likely someone will upgrade to paid subscription.Business analysts and data scientistsFor their own analysis The most sought after dataA very rich data set that contains all kinds of interesting attributes about our members and it is all in a single place.Because of the heavy lifting has been done and data is available in a single placeOthers don’t to have hunt down what data sets
Self-service – web application for business analysts and marketing team to useSomeone who is not familiar with SQLUI that support drag and dropAttribute predicate expression is basically a boolean expression that is evaluated to true or false by comparing an attribute value to an expected valueFor example, whether the value of country attribute is United States or whether a member has more than 30 connectionsIn order to build segments – we need a way for expressing attribute predicates i.e. country in canada or in united statesSave this expression and evaluate it at a later pointBuilding segmentCombining various attribute predicates into a segmentBuild listsCombining segments together to target a certain set of member population for a marketing campagin
Based on the requirements I talked about in the previous slide, the serving engine needs to support the following features/operationsCount – how many members meet certain criteriaFilter members that meet certain criteriaSum – each member is assigned a life time value for a particular product, so we need the ability compute the total dollar amount of a segment based on how many members meet the defined criteriaComplex nested expression with support for conjunction (and) and disjunction (or)The core problem that the serving engine needs to solve is to support arbitrary predicate expression against any of the attributes and return the result in a reasonable amount of time. We basically think this is an information retrieval problem, so we leverage Lucene to help us with this problemTo support those arbitrary predicate expressions, we found Lucene to be pretty good at this kind of problem.
Map reduce applicationConsume data in Avro format and create Lucece indexesUsing custom writable to wrap a Lucene documentEach Lucence document contains all the 240+ attributes for each memberUse custom OutputFormat to build Lucene index segmentStore on local disk of reducer taskCopy onto HDFS at the end of the reduce taskLinkedIn big table – 200GBIndex – 175GB* # of map and reduce task
First one requires only one attributes – job seeker statusSecond requires two attributesTalent solution prospectsCountry where they work inFirst one would need 3 attributesWhether a member is a recruiterThe country that member works inWhether the company they work is considered a competitor of LinkedIn
JSON Predicate Expression – use JSON to define the format of the predicate expression. JSON is well suited for this purpose and it supports nested data structure, fairly flexible, easy to parseSupports different data typesFor each data types, certain operators are supported.An JSON predicate expression consists of an attribute name, data type, operator, and one or more valuesThe JSON predication expression is the contract between the browser and serverStoring the predicate expression in mysql and evaluate it at run time
Web applicationHas a UI for defining segments and listsSegment builderDrag arbitrary attributes and build predicate expressionsWith a click of a button, marketing team can get a sense of how many members meet the defined criteria define in the segmentThis will allow them a chance to change the criteria to increase the count for decrease the countSegments are meant as building blocks
Segments are building blocks and certain reusable Each marketing campaign is represented by a list, which is a collection of segments, each segment can be one of the two types.Inclusions – include members that meet the defined criteria of each of the selected segmentsNet count and raw countExclusions – exclude those members
One of things we are working on is to improve the turn around time for attributes – from the time an attribute is defined to the time it is available for building segments
* Give a shout out for engineering team that work on this platform
1. LinkedIn Segmentation & TargetingPlatform: A Big Data ApplicationHadoop Summit, June 2013Hien Luu, Sid Anand©2013 LinkedIn Corporation. All Rights Reserved.
2. About Us*Hien Luu Sid Anand
3. ©2013 LinkedIn Corporation. All Rights Reserved.Our missionConnect the world’s professionals to makethem more productive and successful
4. Over 200M members and counting2 4 8173255901452004 2005 2006 2007 2008 2009 2010 2011 2012LinkedIn Members (Millions)200+The world’s largest professional networkGrowing at more than 2 members/secSource :http://press.linkedin.com/about©2013 LinkedIn Corporation. All Rights Reserved.
5. *>88%Fortune 100 Companiesuse LinkedIn Talent Soln to hireCompany Pages>2.9MProfessional searches in 2012>5.7BLanguages19>30MFastest growing demographic:Students and NCGsThe world’s largest professional networkOver 64% of members are now internationalSource :http://press.linkedin.com/about©2013 LinkedIn Corporation. All Rights Reserved.
6. Other Company Facts*• Headquartered in Mountain View, Calif., with offices around the world!• As of June 1, 2013, LinkedIn has ~3,700 full-time employees located aroundthe worldSource :http://press.linkedin.com/about
7. Agenda Company Overview• Big Data @ LinkedIn• The Segmentation & Targeting Problem• Solution : LinkedIn Segmentation & Targeting Platform• Q & A
8. Big Data @ LinkedIn©2013 LinkedIn Corporation. All Rights Reserved.
9. LinkedIn : Big Data Story©2013 LinkedIn Corporation. All Rights Reserved.Our Big Data Story depends on Infrastructure!• On-line Data Infrastructure• Near-line Data Infrastructure• Offline Data InfrastructureOracle orEspressoUpdatesWebServingTeradataData StreamsNear-lineOn-line Off-line
10. Big Data Story : On-line Data©2013 LinkedIn Corporation. All Rights Reserved.On-line Data Infrastructure• Supports typical OLTP requirements• Highly concurrent R/W access• Transactional guarantees• Back-up & Recovery• Supports a central LinkedIn Data Principle!• “All data everywhere”• All OLTP databases need to provide atime-line consistent change stream• For this, we developed and open-sourced Databus!Oracle orEspressoUpdatesWebServingOn-line
11. Big Data Story : On-line DataOracle orEspresso Data Change EventsSearchIndexGraphIndexReadReplicasUpdatesStandardizationA user updates the company, title, & school on his profile. He also accepts aconnectionThe write is made to an Oracle or Espresso Master and DataBus replicates it:• the profile change is applied to the Standardization service E.g. the many forms of IBM were canonicalized for search-friendliness• …. and to the Search Index Recruiters can find you immediately by new keywords• the connection change is applied to the Graph Index service The user can now start receiving feed updates from his new connections
12. Big Data Story : On-line DataDatabus streams also update Hadoop!Oracle orEspressoSearchIndexGraphIndexReadReplicaUpdatesStandardizationData Change Events
13. Big Data Story : Near-line & Off-line Data©2013 LinkedIn Corporation. All Rights Reserved.2 Main Sources of Data @ LinkedIn• User-provided data• e.g. Member Profile data (e.g. employment, education history, endorsements)• Tracking data via web site instrumentation• e.g. pages viewed, email opened/sent, social gestures : posts/likes/sharesOracle orEspressoUpdatesDatabusWebServersTeradata
14. TheSegmentation & TargetingProblem©2013 LinkedIn Corporation. All Rights Reserved.
15. Segmentation & Targeting
16. Segmentation & Targeting Attribute typesBhaskar Ghosh
17. Segmentation & Targeting©2013 LinkedIn Corporation. All Rights Reserved.Step 1 : Take some information about usersMember ID Join Date Country Responded toPromotion X11 01/01/2013 FR F2 01/02/2013 BE F3 01/03/2013 FR F4 02/01/2013 FR TStep 2 : Provide some targeting criteria for a new promotionPick members where• Join Date between(01/01/2013", 01/31/2013") and• Country="FR" and• Responded to Promotion X1="F" Members 1 & 3Step 3 : Target them for a different email campaign (promotion_X2)
18. Segmentation & Targeting©2013 LinkedIn Corporation. All Rights Reserved.Step 1 : Take some information about usersMember ID Join Date Country Responded toPromotion X11 01/01/2013 FR F2 01/02/2013 BE F3 01/03/2013 FR F4 02/01/2013 FR TStep 2 : Provide some targeting criteria for a new promotionPick members where• Join Date between(01/01/2013", 01/31/2013") and• Country="FR" and• Responded to Promotion X1="F" Members 1 & 3Step 3 : Target them for a different email campaign (promotion_X2)AttributesSegmentDefinitionSegment
19. Segmentation & Targeting©2013 LinkedIn Corporation. All Rights Reserved.Problem Definition• The business wants to launch new campaigns often• The business wants to specify targeting criteria (segmentdefinitions) using an arbitrary set of attributes• The attributes often need to be computed to fulfill the targetingcriteria• This data resides on Hadoop or TD• The business is most comfortable with SQL-like languages
20. Segmentation & Targeting Solution©2013 LinkedIn Corporation. All Rights Reserved.
21. Segmentation & Targeting©2013 LinkedIn Corporation. All Rights Reserved.AttributeComputationEngineAttributeServingEngine
22. Segmentation & Targeting©2013 LinkedIn Corporation. All Rights Reserved.AttributeComputationEngineSelf-serviceSupport variousdata sourcesAttributeconsolidationAttributeavailability
23. Segmentation & Targeting©2013 LinkedIn Corporation. All Rights Reserved.Attribute computation~225MPBTBTB~240
24. LinkedIn Segmentation & Targeting Platform©2013 LinkedIn Corporation. All Rights Reserved.Attribute Portal Web ApplicationAttribute & DefinitionMetadata
25. LinkedIn Segmentation & Targeting Platform©2013 LinkedIn Corporation. All Rights Reserved.Attribute &DefinitionMetadataTD ExecutorHive ExecutorPig ExecutorRESTRESTREST
26. LinkedIn Segmentation & Targeting Platform©2013 LinkedIn Corporation. All Rights Reserved.M/RStitcher/path/dataset1/path/dataset2/path/dataset3/path/dataset4/path/lnkd_big_tableDataLoaderAttribute consolidation & availability
27. LinkedIn Segmentation & Targeting Platform©2013 LinkedIn Corporation. All Rights Reserved.LinkedIn big table, the most sought after dataSegmentationPropensityModelAd hoc analysisLinkedIn big table
28. Segmentation & Targeting©2013 LinkedIn Corporation. All Rights Reserved.AttributeServingEngineSelf-serviceAttribute predicateexpressionBuildsegmentsBuild lists
29. Segmentation & Targeting©2013 LinkedIn Corporation. All Rights Reserved.Serving Engine$count filter sumcomplexexpressionsΣ1234LinkedIn big table~225M~240
30. LinkedIn Segmentation & Targeting Platform©2013 LinkedIn Corporation. All Rights Reserved.InvertedIndexInvertedIndexInvertedIndexM/RIndexerLinkedIn big tableAttribute &DefinitionMetadata
31. LinkedIn Segmentation & Targeting Platform©2013 LinkedIn Corporation. All Rights Reserved.Who are north American recruiters thatdon’t work for a competitor?Who are the LinkedIn Talent Solution prospectsin Europe?Who are the job seekers?
32. LinkedIn Segmentation & Targeting Platform©2013 LinkedIn Corporation. All Rights Reserved.JSON PredicateExpressionJSON LuceneQuery ParserInvertedIndexInvertedIndexInvertedIndexSegment &List
33. LinkedIn Segmentation & Targeting Platform©2013 LinkedIn Corporation. All Rights Reserved.Complex tree-like attribute predicate expressions
34. LinkedIn Segmentation & Targeting Platform©2013 LinkedIn Corporation. All Rights Reserved.A marketing campaign is represented by a list
35. Conclusion©2013 LinkedIn Corporation. All Rights Reserved.Move at business speed and scale at LinkedIn scale Segmentation & Targeting Platform– Self-service– Multiple data sources & massive data volume– Support complex expression evaluation in seconds– Attribute availability at business speed
36. Engineering Team Jessica Ho Swetha Karthik Raj Rangaswamy Tony Tong Ajinkya Harkare Hien Luu Sid Anand©2013 LinkedIn Corporation. All Rights Reserved.
37. Questions?More info: data.linkedin.com©2013 LinkedIn Corporation. All Rights Reserved.