A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

1,573 views
1,408 views

Published on

This talk was given by Bhaskar Ghosh (Senior Director of Engineering, LinkedIn Data Infrastructure), at the Yale Oct 2012 Symposium on Big Data, in honor of Martin Schultz.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,573
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
55
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

  1. 1. A Small Overview of Big Data Products,Analytics and Infrastructure at LinkedinBhaskar Ghosh Big Data Science A Symposium in Honor of Martin SchultzSenior Director of Engineering Yale UniversityData Infrastructure 26 Oct 2012LinkedIn Confidential ©2013 All Rights Reserved
  2. 2. Outline 1. Martin and Me 2. Company and Mission 3. Products and Science 4. Data Infrastructure 5. P, S, DI: People You May Know 6. Linkedin + Yale 7. ConclusionLinkedIn Confidential ©2013 All Rights Reserved 2
  3. 3. Martin and Me Thank you Martin! Best mentor. Versatility, big-picture thinking and leadership. Yale CS Ph.D. 1995 (Parallel Algorithms) 12y @ Informix & Oracle building parallel database systems 4y @ Yahoo! building Ads systems & leading the Display Ads Exchange organization 2y+ @ LinkedIn building & leading the Data Infrastructure Engineering OrganizationLinkedIn Confidential ©2013 All Rights Reserved 3
  4. 4. The World’s Largest Professional NetworkConnecting Talent  Opportunity. At scale…175M+ 2 new 100M+ 2M+Members Worldwide Members Per Second Monthly Unique Visitors Company PagesLinkedIn Confidential ©2013 All Rights Reserved 4
  5. 5. ..and a bunch of Data-Driven Products Pandora Search for People Groups browse maps Events You May Be Interested InLinkedIn Confidential ©2013 All Rights Reserved 5
  6. 6. The LinkedIn Mission.Connect the world’s professionals to make themmore productive and successful
  7. 7. Linkedin Product Philosophy Goals  Provide a uniquely personalized experience to members (professionals)  Build an ecosystem to balance the interests of members and partners (companies) Approach  Launch Often and Early  Data-Driven Experiment and Test  Fail Fast  Prepare for Virality and ScaleLinkedIn Confidential ©2013 All Rights Reserved 7
  8. 8. Two Product Families For Members For Partners  People You May Know HireProfessionals  Who’s Viewed My Profile Companies  Jobs You May Be Interested In Market  News/Sharing  Today Sell  Search  Subscriptions Science and Analytics Data Infrastructure Profiles Actions Data Connections Content LinkedIn Confidential ©2013 All Rights Reserved 8
  9. 9. The Big-Data Feedback Loop Engagement ↑ Refinement ↑ Value ↑ Member Product Virality ↑ Insights ↑ Signals ↑ Data Science Scale ↑ Analytics ↑ InfrastructureLinkedIn Confidential ©2013 All Rights Reserved 9
  10. 10. Member-Facing Products: Diversity at Scale Product Family Products Science Data Infra 1. Profile and Connections Blending and ranking of 2. Activity Streams heterogeneous content Identity and 3. Messages (email) (e.g. Network Updates, Engagement Group Discussions, Job 4. Endorsements & Skills Postings) 1. People Search Search and 2. Group Search Analysis 3. Who Viewed My Profile 1. People You May Know 2. Jobs You May Be EntityRecommendations Interested In disambiguation and 3. Events You May Be matching Interested In 1. Subscription Packages Response Prediction Monetization 2. Sponsored Content Inventory ForecastingLinkedIn Confidential ©2013 All Rights Reserved 10
  11. 11. Recommendations…Are Effective .. And Drive> 50% of connections • Find data that is useful for Members • Guiding Principle • Provide Relevant Content • Establish Social Connections • In Appropriate Context > 50% of job applications > 50% of group joinsLinkedIn Confidential ©2013 All Rights Reserved 11
  12. 12. LinkedIn Recommendation EngineRecom- People Jobs GroupsmendationEntities … Ads Companies Searches be interested in Referral Center People Browse Similar Profiles Similar Groups Jobs You May Jobs Browse Browse Map TalentMatch Similar Jobs News Groups GYML Events Map Map … and moreProducts A/B APIRecom- Behavior Collaborativemendation Popularity User FeedbackTypes Analysis FilteringShared, (R-T) Feature Extraction, Entity (R-T) matching computationsDynamic, Resolution & EnrichmentUnified Offline data munging (hadoop)CoreService
  13. 13. Member-Facing Products: Diversity at Scale Product Family Products Science Data Infra 1. Profile and Connections Blending and ranking of • Scale 2. Activity Streams heterogeneous content Identity and • Full text and 3. Messages (email) (e.g. Network Updates, Engagement secondary ind Group Discussions, Job 4. Endorsements & Skills Postings) • Real-time • Faceted search 1. People Search • Near RT index Search and 2. Group Search freshness Analysis 3. Who Viewed My Profile • Drill-down exploration 1. People You May Know 2. Jobs You May Be Entity • Graph analysisRecommendations Interested In disambiguation and • Content serving 3. Events You May Be matching • Real-time tuning Interested In 1. Subscription Packages Monetization Response prediction 2. Sponsored ContentLinkedIn Confidential ©2013 All Rights Reserved 13
  14. 14. LinkedIn Data Infrastructure: Three-Phase Abstraction Near-Line Infra Application Offline Data Infra Users Online Data InfraInfrastructure Latency & Freshness Requirements Products • Member Profiles • Messages Online Activity that should be reflected immediately • Company Profiles • Endorsements • Connections • Skills • Activity Streams • Recommendations Near-Line Activity that should be reflected soon • Profile Standardization • Search • News • Messages • People You May Know • Recommendations Offline Activity that can be reflected later • Connection Strength • Next best idea… • NewsLinkedIn Confidential ©2013 All Rights Reserved 14
  15. 15. LinkedIn Data Infrastructure: Sample Stack Infra challenges in 3-phase Some off-the-shelf. ecosystem are diverse, Significant investment in complex and specific home-grown, deep and interesting platforms 15
  16. 16. LinkedIn Data Infrastructure: Data Stores Near-Line Infra Application Offline Data Infra Users Online Data Infra  ICDE 2012 (Data Infra Overview)  FAST 2012 (Voldemort for Serving) Systems Capabilities  Transactions  Rich structures (e.g. indexes)  Change capture capability Voldemort  Key value / document storageLinkedIn Confidential ©2013 All Rights Reserved 16
  17. 17. LinkedIn Data Infrastructure: Specialized Indexes Near-Line Infra Application Offline Data Infra Users Online Data Infra Systems Capabilities Zoie Bobo Sensei  Search platform GraphDB  Distributed graph engineLinkedIn Confidential ©2013 All Rights Reserved 17
  18. 18. LinkedIn Data Infrastructure: Pipelines Near-Line Infra Application Offline Data Infra Users Online Data Infra  ACM SOCC 2012: “Databus”  IEEE Data Eng. Bulletin 2012: “Kafka” Systems Capabilities  Messaging for site events, monitoring  High throughput  Change data capture stream  Reliable, consistent, low latency pipeLinkedIn Confidential ©2013 All Rights Reserved 18
  19. 19. LinkedIn Data Infrastructure: Off-line Analysis Near-Line Infra Application Offline Data Infra Users Online Data Infra Systems Capabilities  ML, Ranking, Relevance  Insights and Analytics  ETL, Metadata and Pipes  Business Source of TruthLinkedIn Confidential ©2013 All Rights Reserved 19
  20. 20. LinkedIn Data Infrastructure: Cluster Management Near-Line Infra Application Offline Data Infra Users Online Data Infra  ACM SOCC 2012: Untangling Cluster Management with Helix Systems Capabilities  Generic framework for building distributed systems  Cluster Management PrimitivesLinkedIn Confidential ©2013 All Rights Reserved 20
  21. 21. HELIX: Generalizing Cluster Management COUNT=2 t1≤ 5 STATE MACHINE S t1 t2 t3 t4 O M Helix COUNT=1 minimize(maxnj∈N S(nj) ) CONSTRAINTS OBJECTIVE minimize(maxnj∈N M(nj) )  Declare distributed system behavior via {S, C, O}  Enforce Partition constraints  Fault detection and tolerance (e.g. promote S to M)  Elasticity (e.g. Re-balance; Minimize migrations)  Used in Espresso, Search, DatabusLinkedIn Confidential ©2013 All Rights Reserved 21
  22. 22. LinkedIn Data Infrastructure: A few take-aways 1. Infrastructure decisions matter and are hard to transform in a hyper-growth environment. 2. Balance open-source products with home- grown platforms (**) 3. Operability, Capacity Planning and On-line Multi-tenancy are hard 4. Data Movement: Pipes and Feedback Loops are critical (**) 5. Data Model and Integration e2e are key (*) 6. Few vs Many: Balance over-specialized (agile) vs generic efforts (leverage-able) platforms (*) 7. Off-line Multi-Platform story is evolving.LinkedIn Confidential ©2013 All Rights Reserved 22
  23. 23. Science and Infrastructure: Giving Back Research Publications Open Source Projects  ACM SOCC 2012  Apache Helix new  ACM RecSys 2012  ParSeq new  SIGIR 2012  CIKM 2012  DataFu new  VLDB 2012  Apache Kafka  ICDE 2012  FAST 2012  Sensei  NetDB 2011  Azkaban  …  VoldemortLinkedIn Confidential ©2013 All Rights Reserved 23
  24. 24. A Recommendation Product: People You May Know (PYMK)LinkedIn Confidential ©2013 All Rights Reserved 24
  25. 25. Probability that you may know someone else? Alice ?? Bob Carol Known as “triangle closing”LinkedIn Confidential ©2013 All Rights Reserved 25
  26. 26. PYMK: Science, Members and Connections1) Feature selection is key The Feedback Loop  Common Connections Value ↑ Member Product  Geo Virality ↑ Insights ↑  Company Signals ↑  Age Data Science2) ML and data model • Traditional ML (e.g. matrix factorization) on O(n^2) of 175M tend to not scale easily3) Interplay: Data Model + ML + Parallel Computation model4) Adding edges: Why do it? • Creates positive-feedback social loops for members • More useful content and activity available to members • Denser graph improves signal strength in science-driven productsLinkedIn Confidential ©2013 All Rights Reserved 26
  27. 27. PYMK: Off-line Model Build Near-Line Infra Application Offline Data Infra Users Online Data Infra  Use generic off-line Infra (Hadoop and Pig) to build recommendations off-line.  Very complex workflow due to extraction and selection of large num of features. Built Azkaban for Hadoop.  Small Input and final look-up structure but large intermediate data (100’s of TB) due to MR. Problem (who you do not know) itself has an inherent blow-up.  Special optimizations (e.g. Bloom Join to remove connected)LinkedIn Confidential ©2013 All Rights Reserved 27
  28. 28. PYMK: Off-line to Near-Line Serving Near-Line Infra Application Offline Data Infra Users Online Data Infra  Build serving structure on Hadoop. Scan versus Index compactness tradeoff.  Voldemort: Partitioned k-v; Load-balancing; Pluggable storage layer; Failover.  Bulk load for efficiency. Fast Rollback for safety. Atomic swap.  Serving: Per-partition index in memory. PYMK blobs on disk.  Retrieval ~msec. Decoration in App FE is more expensive.LinkedIn Confidential ©2013 All Rights Reserved 28
  29. 29. PYMK: Science and Feedback Loop Near-Line Infra Application Offline Data Infra Users Online Data Infra  Response vs Latency: Fast refresh helps user experience. (e.g. showing connections of very recent connections). “Social” phenomenon.  Very agile feature: Lots of on-line A/B testing and tweaking of features  Huge Impact: > 50% of accepted invites are created by PYMKLinkedIn Confidential ©2013 All Rights Reserved 29
  30. 30. PYMK: Tying It All Together PYMK Application User Interactions Near-Line ServingNear-LineOffline P (B knows C) α large number of features Common Alice connections Organizational Overlap Offline Bob Carol Model Age Distance Dave Eve LinkedIn Confidential ©2013 All Rights Reserved 30
  31. 31. LinkedIn + Yale Students  What is my career path?  Where did my students go Students:  How can I prepare? after they left the  Transformation of  How do I get my first university? Careers internship and first job?  How is my school seeding Yale: the various industries with  Get a data-driven view the best talent?  Uncover opportunities  How does my school compare with other institutions Wins based on data and insightsLinkedIn Confidential ©2013 All Rights Reserved 31
  32. 32. Thank you colleagues for the beautiful slides! Amy Tang Anmol Bhasin Daniel Tunkelang David HenkeSr. Program Manager Sr. Engineering Manager Principal Data Scientist SVP Operations Kapil Surlaker Sam Shah Shirshanka Das Principal Engineer Principal Engineer Principal Engineer LinkedIn Confidential ©2013 All Rights Reserved 32
  33. 33. Summary1. E2E: The Big-Data feedback loop of social-network product design is cool2. Infrastructure 1. Data Infrastructure needs continuous innovation and iteration to keep pace for scale and cost. 2. Fast moving, Big, Clean Data + Agile Metadata = Goodness 3. Data-driven products need agile feedback infrastructure and measurement methodology.3. Methodology 1. Data-Driven experimentation enables insights and agile products 2. Recommendation-driven products have big impact. Read more @ data.linkedin.com LinkedIn Confidential ©2013 All Rights Reserved 33
  34. 34. Help us. Come Have Fun with Us! 1. Science and Data Mining: Recommendation and Optimization Problems 2. Next-generation ad-hoc and OLAP query processing on Hadoop 3. Graph Computations: Off-line mining and On-line integration loops 4. nRT Data Streams in Near-line infrastructure 5. And much more… Info: data.linkedin.comLinkedIn Confidential ©2013 All Rights Reserved 34
  35. 35. In Closing bghosh@linkedin.com Thank You!LinkedIn Confidential ©2013 All Rights Reserved 35
  36. 36. LinkedIn Confidential ©2013 All Rights Reserved 36

×