Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Path to 400M Members: LinkedIn’s Data Powered Journey

541 views

Published on

Path to 400M Members: LinkedIn’s Data Powered Journey

Published in: Technology
  • Be the first to comment

Path to 400M Members: LinkedIn’s Data Powered Journey

  1. 1. Xin Fu, Carl Steinbach Hadoop Summit Tokyo, October 26, 2016 Path to 400M* Members: LinkedIn’s Data Powered Journey * As of Q2 2016, LinkedIn had 450M members world wide
  2. 2. 2 2004 2011 2012 2009 2012 2015
  3. 3. 3 Real Time Visualization of New Sign-ups
  4. 4. What Does “Data-Driven” Mean at LinkedIn? 4
  5. 5. What Does “Data-Driven” Mean at LinkedIn? 5
  6. 6. Monitoring & Learning 6
  7. 7. What is This Phase Comprised of? 7 ● Dashboards ● Reports ● Trend explanation ○ Short term fluctuation: investigation ○ Long term trend: strategic analysis
  8. 8. Past Challenges 8 Reliability ● Easily broken without operational support, huge time spent in maintenance Diverse technology ● Self maintained pipelines ● Various UIs with different visualization capabilities ● Redundant computation
  9. 9. Standardized Reporting Tool 9 ● Reduces dependency on 3rd party BI tools ● Closer integration with LinkedIn’s ecosystem of experimentation and anomaly detection solutions
  10. 10. Towards Real Time Monitoring 10 Sign-up Country Platform Language Browser Signup Type OS
  11. 11. Experimentation & Analysis 11
  12. 12. What is This Phase Comprised of? 12 ● Experiment design ● Experiment analysis to inform ramp decisions ● Learning from multiple experiments to identify what works and what doesn’t work
  13. 13. Past Challenges 13 Experiment design ● Interaction between experiments Experiment analysis and ramp decision ● Manual analysis, extended time-to- decision ● Ramp decisions based on localized metrics ● Reruns needed sometimes due to undetected errors in setup Worst of all, some ramps happened without A/B testing ● e.g. infrastructural changes
  14. 14. Experimentation Platform @ LinkedIn 14 ● Company-wide platform for A/B testing, ramping, and advanced targeting needs ● Automated reporting and analysis capabilities
  15. 15. Tiering of Metrics 15 Metrics at different tier: ● Different review processes ● Different levels of visibility in dashboards and experiment scorecards ● Different computation priorities and SLAs in data pipelines ● Different life cycles
  16. 16. Backend Infrastructure for Tracking & Instrumentation 16
  17. 17. 17 InvitationClickEvent() Scale fact: ~1000 tracking event types, ~20TB per day, hundreds of metrics & data products Tracking Data Records User Activity
  18. 18. Tracking Data Lifecycle and Teams 18 Product teams: PMs, Developers, TestEng Infra teams: Hadoop, Kafka, DWH, ... Data teams: Analytics, Relevance Engineers,...
  19. 19. Example: How Do We Track a Profile View? 19 PageViewEvent Record 1: { "header" : { "memberId" : 12345, "time" : 1454745292951, "appName" : { "string" : "LinkedIn" "pageKey" : "profile_page" }, }, "trackingInfo" : { ["vieweeID" : "23456"], ... } } pageViews = LOAD ‘/data/tracking/PageViewEvent’; profileViews = FILTER pageViews by header.pageKey==‘profile_page’;
  20. 20. Example: How Do We Track a Profile View? 20 PageViewEvent Record 1: { "header" : { "memberId" : 12345, "time" : 1454745292951, "appName" : { "string" : "LinkedIn" "pageKey" : "new_profile_page" }, }, "trackingInfo" : { ["vieweeID" : "23456"], ... } } pageViews = LOAD ‘/data/tracking/PageViewEvent’; profileViews = FILTER pageViews by header.pageKey==‘profile_page’ or header.pageKey==‘new_profile_page’;
  21. 21. At Some Point It Becomes Unmaintainable ... 21
  22. 22. How Do We Handle Old and New? 22 Producers Consumers
  23. 23. DALI: A Data Access Layer for LinkedIn Abstract away underlying physical details to allow users to focus solely on the logical concerns Logical Tables + Views Logical FileSystem We had been working on something that could help...
  24. 24. 24 Data Catalog + Discovery (DALI) DaliFileSystem Client Data Source (HDFS) Data Sink (HDFS) Processing Engine (MapReduce, Spark, Presto) DALI Datasets (Tables + Views) Query Layers (Hive, Pig, Spark) View Defs + UDFs (Artifactory, Git) Dataflow APIs (MR, Spark, Scalding)DALI CLI DALI: Implementation Details in Context
  25. 25. Solving with DALI Views Producers Consumers
  26. 26. State of the World Today with Dali ~ 100 producer views ~ 200 consumer views ~ 80 unique tracking event data sources What’s next? ! Views on streaming data ! Selective materialization and caching ! Open source
  27. 27. At the Core of “Data-Driven” is .... 27
  28. 28. 28 Used to be Tug of War Between Speed and Quality
  29. 29. 29 Before We Learned that Technology Could Break the Dichotomy Between Speed and Quality
  30. 30. 30 Cultural Aspects: Partnership Data Scientists and Engineers
  31. 31. Interesting Challenges - Metric trade-off, e.g. between engagement vs. monetization - Real-time everything? - A/B test in a social network - Human judge for personalized search - Value of an action 31
  32. 32. It Took a Village 32 Thanks to all the Data Scientists, Engineers and Product partners at LinkedIn for being part of this great journey! https://engineering.linkedin.com/data

×