Building Data Driven Products at Linkedin

3,241 views

Published on

Talk slides from Big-Data-Cloud meetup on Building Data Driven Products at LinkedIn.

Published in: Technology, Business
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,241
On SlideShare
0
From Embeds
0
Number of Embeds
115
Actions
Shares
0
Downloads
34
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Building Data Driven Products at Linkedin

  1. 1. Building Data Productsusing Hadoop at Linkedin Mitul Tiwari Search, Network, and Analytics (SNA) LinkedIn 1 1
  2. 2. Who am I? 2 2
  3. 3. What do I mean by Data Products? 3 3
  4. 4. People You May Know 4 4
  5. 5. Profile Stats: WVMP 5 5
  6. 6. Viewers of this profile also ... 6 6
  7. 7. Skills 7 7
  8. 8. InMaps 8 8
  9. 9. Data Products: Key IdeasRecommendations People You May Know, Viewers of this profile ...Analytics and Insight Profile Stats: Who Viewed My Profile, SkillsVisualization InMaps 9 9
  10. 10. Data Products: Challenges LinkedIn: 2nd largest social network 120 million members on LinkedIn Billions of connections Billions of pageviews Terabytes of data to process 10 10
  11. 11. OutlineWhat do I mean by Data Products?Systems and Tools we useLet’s build “People You May Know”Managing workflowServing data in productionData QualityPerformance 11 11
  12. 12. Systems and ToolsKafka (LinkedIn)Hadoop (Apache)Azkaban (LinkedIn)Voldemort (LinkedIn) 12 12
  13. 13. Systems and ToolsKafka publish-subscribe messaging system transfer data from production to HDFSHadoopAzkabanVoldemort 13 13
  14. 14. Systems and ToolsKafkaHadoop Java MapReduce and Pig process dataAzkabanVoldemort 14 14
  15. 15. Systems and ToolsKafkaHadoopAzkaban Hadoop workflow management tool to manage hundreds of Hadoop jobsVoldemort 15 15
  16. 16. Systems and ToolsKafkaHadoopAzkabanVoldemort Key-value store store output of Hadoop jobs and serve in production 16 16
  17. 17. OutlineWhat do I mean by Data Products?Systems and Tools we useLet’s build “People You May Know”Managing workflowServing data in productionData QualityPerformance 17 17
  18. 18. People You May Know How do people Aliceknow each other? Bob Carol 18 18
  19. 19. People You May Know How do people Aliceknow each other? Bob Carol 19 19
  20. 20. People You May Know How do people Aliceknow each other? Bob Carol Triangle closing 20 20
  21. 21. People You May Know How do people Aliceknow each other? Bob Carol Triangle closingProb(Bob knows Carol) ~ the # of common connections 21 21
  22. 22. Triangle Closing in Pig-- connections in (source_id, dest_id) format in both directionsconnections = LOAD `connections` USING PigStorage();group_conn = GROUP connections BY source_id;pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2);common_conn = GROUP pairs BY (id1, id2);common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections;STORE common_conn INTO `common_conn` USING PigStorage(); 22 22
  23. 23. Pig OverviewLoad: load data, specify formatStore: store data, specify formatForeach, Generate: Projections, similar to selectGroup by: group by column(s)Join, Filter, Limit, Order, ...User Defined Functions (UDFs) 23 23
  24. 24. Triangle Closing in Pig-- connections in (source_id, dest_id) format in both directionsconnections = LOAD `connections` USING PigStorage();group_conn = GROUP connections BY source_id;pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2);common_conn = GROUP pairs BY (id1, id2);common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections;STORE common_conn INTO `common_conn` USING PigStorage(); 24 24
  25. 25. Triangle Closing in Pig-- connections in (source_id, dest_id) format in both directionsconnections = LOAD `connections` USING PigStorage();group_conn = GROUP connections BY source_id;pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2);common_conn = GROUP pairs BY (id1, id2);common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections;STORE common_conn INTO `common_conn` USING PigStorage(); 25 25
  26. 26. Triangle Closing in Pig-- connections in (source_id, dest_id) format in both directionsconnections = LOAD `connections` USING PigStorage();group_conn = GROUP connections BY source_id;pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2);common_conn = GROUP pairs BY (id1, id2);common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections;STORE common_conn INTO `common_conn` USING PigStorage(); 26 26
  27. 27. Triangle Closing in Pig-- connections in (source_id, dest_id) format in both directionsconnections = LOAD `connections` USING PigStorage();group_conn = GROUP connections BY source_id;pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2);common_conn = GROUP pairs BY (id1, id2);common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections;STORE common_conn INTO `common_conn` USING PigStorage(); 27 27
  28. 28. Triangle Closing in Pig-- connections in (source_id, dest_id) format in both directionsconnections = LOAD `connections` USING PigStorage();group_conn = GROUP connections BY source_id;pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2);common_conn = GROUP pairs BY (id1, id2);common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections;STORE common_conn INTO `common_conn` USING PigStorage(); 28 28
  29. 29. Triangle Closing Example Alice Bob Carol connections = LOAD `connections` USING1.(A,B),(B,A),(A,C),(C,A) PigStorage();2.(A,{B,C}),(B,{A}),(C,{A})3.(A,{B,C}),(A,{C,B})4.(B,C,1), (C,B,1) 29 29
  30. 30. Triangle Closing Example Alice Bob Carol1.(A,B),(B,A),(A,C),(C,A) group_conn = GROUP connections BY2.(A,{B,C}),(B,{A}),(C,{A}) source_id;3.(A,{B,C}),(A,{C,B})4.(B,C,1), (C,B,1) 30 30
  31. 31. Triangle Closing Example Alice Bob Carol1.(A,B),(B,A),(A,C),(C,A)2.(A,{B,C}),(B,{A}),(C,{A}) pairs = FOREACH group_conn GENERATE3.(A,{B,C}),(A,{C,B}) generatePair(connections.dest_id) as (id1, id2);4.(B,C,1), (C,B,1) 31 31
  32. 32. Triangle Closing Example Alice Bob Carol1.(A,B),(B,A),(A,C),(C,A)2.(A,{B,C}),(B,{A}),(C,{A}) common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn3.(A,{B,C}),(A,{C,B}) GENERATE flatten(group) as (source_id, dest_id),4.(B,C,1), (C,B,1) COUNT(pairs) as common_connections; 32 32
  33. 33. Our Workflow triangle-closing 33 33
  34. 34. Our Workflow triangle-closing top-n 34 34
  35. 35. Our Workflow triangle-closing top-n push-to-prod 35 35
  36. 36. OutlineWhat do I mean by Data Products?Systems and Tools we useLet’s build “People You May Know”Managing workflowServing data in productionData QualityPerformance 36 36
  37. 37. Our Workflow triangle-closing top-n push-to-prod 37 37
  38. 38. Our Workflow triangle-closing remove connections top-n push-to-prod 38 38
  39. 39. Our Workflow triangle-closing remove connections top-npush-to-qa push-to-prod 39 39
  40. 40. PYMK Workflow 40 40
  41. 41. Workflow RequirementsDependency managementRegular SchedulingMonitoringDiverse jobs: Java, Pig, ClojureConfiguration/ParametersResource control/lockingRestart/Stop/RetryVisualizationHistoryLogs 41 41
  42. 42. Workflow RequirementsDependency managementRegular SchedulingMonitoringDiverse jobs: Java, Pig, ClojureConfiguration/ParametersResource control/lockingRestart/Stop/RetryVisualizationHistory AzkabanLogs 42 42
  43. 43. Sample Azkaban Job Spectype=pigpig.script=top-n.pigdependencies=remove-connectionstop.n.size=100 43 43
  44. 44. Azkaban Workflow 44 44
  45. 45. Azkaban Workflow 45 45
  46. 46. Azkaban Workflow 46 46
  47. 47. Our Workflow triangle-closing remove connections top-n push-to-prod 47 47
  48. 48. Our Workflow triangle-closing remove connections top-n push-to-prod 48 48
  49. 49. OutlineWhat do I mean by Data Products?Systems and Tools we useLet’s build “People You May Know”Managing workflowServing data in productionData QualityPerformance 49 49
  50. 50. Production StorageRequirements Large amount of data/Scalable Quick lookup/low latency Versioning and Rollback Fault tolerance Offline index building 50 50
  51. 51. Voldemort StorageLarge amount of data/ScalableQuick lookup/low latencyVersioning and RollbackFault tolerance through replicationRead onlyOffline index building 51 51
  52. 52. Data Cycle 52 52
  53. 53. Voldemort RO Store 53 53
  54. 54. Our Workflow triangle-closing remove connections top-n push-to-prod 54 54
  55. 55. OutlineWhat do I mean by Data Products?Systems and Tools we useLet’s build “People You May Know”Managing workflowServing data in productionData QualityPerformance 55 55
  56. 56. Data QualityVerificationQA store with viewerExplainVersioning/RollbackUnit tests 56 56
  57. 57. OutlineWhat do I mean by Data Products?Systems and Tools we useLet’s build “People You May Know”Managing workflowServing data in productionData QualityPerformance 57 57
  58. 58. Performance 58 58
  59. 59. PerformanceSymmetry Bob knows Carol then Carol knows Bob 58 58
  60. 60. PerformanceSymmetry Bob knows Carol then Carol knows BobLimit Ignore members with > k connections 58 58
  61. 61. PerformanceSymmetry Bob knows Carol then Carol knows BobLimit Ignore members with > k connectionsSampling Sample k-connections 58 58
  62. 62. Things CoveredWhat do I mean by Data Products?Systems and Tools we useLet’s build “People You May Know”Managing workflowServing data in productionData QualityPerformance 59 59
  63. 63. SNA TeamThanks to SNA Team at LinkedInhttp://sna-projects.comWe are hiring! 60 60
  64. 64. Questions? 61 61

×