• Like

Building Data Products using Hadoop at Linkedin - Mitul Tiwari

  • 2,584 views
Uploaded on

Hadoop and other big data tools such as Voldemort, Azkaban, and Kafka, drive many data driven products at LinkedIn such as “People You MayKnow” and various recommendation products such as “Jobs You …

Hadoop and other big data tools such as Voldemort, Azkaban, and Kafka, drive many data driven products at LinkedIn such as “People You MayKnow” and various recommendation products such as “Jobs You May Be Interested In”. Each of these products can be viewed as a large scale social recommendation problems, which analyzes billions of possible options, and suggest appropriate recommendation.

Since these products analyzes billions of edges and terabytes of data daily, it can be built only using a large scale distributed compute infrastructure. Kafka publish-subscribe messaging system is used to get the data in Hadoop file system. Hadoop MapReduce is used as the basic building block to analyze billions of potential options, and predict recommendation. Over a hundred MapReduce tasks are combined together in a work-flow uising Azkaban, a Hadoop work-flow management tool. The output of Hadoop jobs is finally stored in Voldemort key-value store to serve the data at run-time for efficiency.

During this talk audience will get a basic understanding of link prediction problem behind “ People You May Know” feature, which is a large scale social recommendation problem. Overview of the solution of this problem using Hadoop MapReduce, Azkaban workflow management tool, and Voldemort key-value store will be presented. I will also describe how to efficiently compute the number of common connections (triangle closing) using Hadoop Mapreduce, which is one of the many signals in link prediction.

Overall, people interested in building interesting applications using Hadoop MapReduce will hugely benefit from this talk.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,584
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
109
Comments
0
Likes
10

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Hi, I am Mitul Tiwari. Today I am going to talk about building data driven products using Hadoop at LinkedIn.
  • I am part of Search, Network, Analytics team at LinkedIn, and I work on data driven products such as People You May Know.
  • let me illustrate through a few examples of data products at LinkedIn
  • LinkedIn is the second largest social network for professionals with more than 100 million members. PYMK is a large scale recommendation system that helps you connect with others. Basically, PYMK is a link prediction problem, where we analyze billions of edges to recommend possible connections to you. A big big-data problem!
  • Another example of a data product at LinkeIn is “Profile Stats” or “Who Viewed My Profile”. Profile Stats provides analytics about your profile on LinkedIn. It provides stats about who viewed your profile, what are the top search queries leading to your profile, the number of profile views per day/week, location of the visitor, etc. We have billions of pageviews per month, Profile Stats is another big data problem.
  • Another example of Data Product at LinkedIn is “Viewers of this profile also viewed these profiles”. A collaborative filtering way of suggesting profile.
  • topic pages for skills
  • Visualize your connections. Cluster your connections based on their connection density among them.
  • The key ideas behind these data products are Recommendations, Analytics, Insight, and Visualization.
  • Some challenges behind building data driven products at linkedin. A naive implementation of PYMK may result in generating 120mX120m pairs, which is 14400 trillion pairs. So you have to be smart about it. So which data product would you like me to build during this talk?
  • Here is a pig script to do triangle closing, that is, find the number of common connections between any pair of members.
  • So how many of you are familiar with Pig? Let me refresh some Pig constructs for those who are not very familiar with Pig.
  • First you load connections data that is in bidirectional pairs format. Representing each direction of an edge by a pair of member ids.
  • Then we group connections pairs by source_ids to aggregate all connections for each member.
  • From aggregated connections we generate pairs of members (id1, id2) which are friend-of-friend through a source_id
  • Now we group by (id1, id2) to aggregate all common connections, and count to find the number of common connections.
  • Finally, we store common connections data in HDFS.
  • Let me illustrate the triangle closing through our running example. First, we load each direction edge represented by a pair.
  • Then we group connections pairs by source_ids to aggregate all connections for each member.
  • From aggregated connections we generate pairs of members (id1, id2) which are friend-of-friend through a source_id
  • Finally we group by (id1, id2) to aggregate all common connections, and count to find the number of common connections.
  • After we are done with triangle closing we can list each members’ friends-of-friends ordered by the number of common connections. .
  • Since there might be too many people who are your friends-of-friends, you might want to select top n from that list. For example, there are more than hundred-fifty thousands people who are my friends-of-friends
  • Next you need to push this data out in production. So that’s a simple workflow.
  • I just described how you can build People You May Know by doing triangle-closing and finding out friends-of-friends, and the number of common connections between them. As you just saw there might be multiple jobs dependent on each other that you have to run in that order. So how do we manage our workflow?
  • We need to ensure that we are showing good quality data to our members. First, we verify that data transfer between HDFS and production system is done properly. Second, we push data to a QA store with a viewer to check any blatant mistakes. Third, we have can explain any PYMK recommendation, how and why that recommendation is appearing. Fourth, we have ways to rollback in case something goes wrong. And finally, we have unit tests in place to check things are processes as we desire
  • First, we can improve performance by 50% by utilizing symmetry in our triangle-closing. If Bob is a friend-of-friend of Carol then Carol is a friend-of-friend of Bob. Second, there are supernodes in our social graph. For examples, Barack Obama has more than 10000 connections on LinkedIn. If we generate f-o-f pairs from his connections involving Barack Obama, there will be more 100 million pairs of f-o-f. Third, we can sample certain number of connections to decrease the the number of pairs from f-of-f, and randomize so that we generate different pairs every day.
  • Oscon Data award for open source contributions.

Transcript

  • 1. Building Data Products using Hadoop at Linkedin
    • Mitul Tiwari
    • Search, Network, and Analytics (SNA)
    • LinkedIn
  • 2. Who am I?
  • 3.
    • What do I mean by Data Products?
  • 4. People You May Know
  • 5. Profile Stats: WVMP
  • 6. Viewers of this profile also ...
  • 7. Skills
  • 8. InMaps
  • 9. Data Products: Key Ideas
    • Recommendations
      • People You May Know, Viewers of this profile ...
    • Analytics and Insight
      • Profile Stats: Who Viewed My Profile, Skills
    • Visualization
      • InMaps
  • 10. Data Products: Challenges
    • LinkedIn: 2nd largest social network
    • 120 million members on LinkedIn
    • Billions of connections
    • Billions of pageviews
    • Terabytes of data to process
  • 11. Outline
    • What do I mean by Data Products?
    • Systems and Tools we use
    • Let’s build “People You May Know”
    • Managing workflow
    • Serving data in production
    • Data Quality
    • Performance
  • 12. Systems and Tools
    • Kafka (LinkedIn)
    • Hadoop (Apache)
    • Azkaban (LinkedIn)
    • Voldemort (LinkedIn)
  • 13. Systems and Tools
    • Kafka
      • publish-subscribe messaging system
      • transfer data from production to HDFS
    • Hadoop
    • Azkaban
    • Voldemort
  • 14. Systems and Tools
    • Kafka
    • Hadoop
      • Java MapReduce and Pig
      • process data
    • Azkaban
    • Voldemort
  • 15. Systems and Tools
    • Kafka
    • Hadoop
    • Azkaban
      • Hadoop workflow management tool
      • to manage hundreds of Hadoop jobs
    • Voldemort
  • 16. Systems and Tools
    • Kafka
    • Hadoop
    • Azkaban
    • Voldemort
      • Key-value store
      • store output of Hadoop jobs and serve in production
  • 17. Outline
    • What do I mean by Data Products?
    • Systems and Tools we use
    • Let’s build “People You May Know”
    • Managing workflow
    • Serving data in production
    • Data Quality
    • Performance
  • 18. People You May Know Alice Bob Carol How do people know each other?
  • 19. People You May Know Alice Bob Carol How do people know each other?
  • 20. People You May Know Alice Bob Carol Triangle closing How do people know each other?
  • 21. People You May Know Alice Bob Carol Triangle closing Prob(Bob knows Carol) ~ the # of common connections How do people know each other?
  • 22. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();
  • 23. Pig Overview
    • Load: load data, specify format
    • Store: store data, specify format
    • Foreach, Generate: Projections, similar to select
    • Group by: group by column(s)
    • Join, Filter, Limit, Order, ...
    • User Defined Functions (UDFs)
  • 24. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();
  • 25. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();
  • 26. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();
  • 27. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();
  • 28. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();
  • 29. Triangle Closing Example Alice Bob Carol
    • (A,B),(B,A),(A,C),(C,A)
    • (A,{B,C}),(B,{A}),(C,{A})
    • (A,{B,C}),(A,{C,B})
    • (B,C,1), (C,B,1)
    connections = LOAD `connections` USING PigStorage();
  • 30. Triangle Closing Example Alice Bob Carol
    • (A,B),(B,A),(A,C),(C,A)
    • (A,{B,C}),(B,{A}),(C,{A})
    • (A,{B,C}),(A,{C,B})
    • (B,C,1), (C,B,1)
    group_conn = GROUP connections BY source_id;
  • 31. Triangle Closing Example Alice Bob Carol
    • (A,B),(B,A),(A,C),(C,A)
    • (A,{B,C}),(B,{A}),(C,{A})
    • (A,{B,C}),(A,{C,B})
    • (B,C,1), (C,B,1)
    pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2);
  • 32. Triangle Closing Example Alice Bob Carol
    • (A,B),(B,A),(A,C),(C,A)
    • (A,{B,C}),(B,{A}),(C,{A})
    • (A,{B,C}),(A,{C,B})
    • (B,C,1), (C,B,1)
    common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections;
  • 33. Our Workflow triangle-closing
  • 34. Our Workflow triangle-closing top-n
  • 35. Our Workflow triangle-closing top-n push-to-prod
  • 36. Outline
    • What do I mean by Data Products?
    • Systems and Tools we use
    • Let’s build “People You May Know”
    • Managing workflow
    • Serving data in production
    • Data Quality
    • Performance
  • 37. Our Workflow triangle-closing top-n push-to-prod
  • 38. Our Workflow triangle-closing top-n push-to-prod remove connections
  • 39. Our Workflow triangle-closing top-n push-to-prod remove connections push-to-qa
  • 40. PYMK Workflow
  • 41. Workflow Requirements
    • Dependency management
    • Regular Scheduling
    • Monitoring
    • Diverse jobs: Java, Pig, Clojure
    • Configuration/Parameters
    • Resource control/locking
    • Restart/Stop/Retry
    • Visualization
    • History
    • Logs
  • 42. Workflow Requirements
    • Dependency management
    • Regular Scheduling
    • Monitoring
    • Diverse jobs: Java, Pig, Clojure
    • Configuration/Parameters
    • Resource control/locking
    • Restart/Stop/Retry
    • Visualization
    • History
    • Logs
    Azkaban
  • 43. Sample Azkaban Job Spec
    • type=pig
    • pig.script=top-n.pig
    • dependencies=remove-connections
    • top.n.size=100
  • 44. Azkaban Workflow
  • 45. Azkaban Workflow
  • 46. Azkaban Workflow
  • 47. Our Workflow triangle-closing top-n push-to-prod remove connections
  • 48. Our Workflow triangle-closing top-n push-to-prod remove connections
  • 49. Outline
    • What do I mean by Data Products?
    • Systems and Tools we use
    • Let’s build “People You May Know”
    • Managing workflow
    • Serving data in production
    • Data Quality
    • Performance
  • 50. Production Storage
    • Requirements
      • Large amount of data/Scalable
      • Quick lookup/low latency
      • Versioning and Rollback
      • Fault tolerance
      • Offline index building
  • 51. Voldemort Storage
      • Large amount of data/Scalable
      • Quick lookup/low latency
      • Versioning and Rollback
      • Fault tolerance through replication
      • Read only
      • Offline index building
  • 52. Data Cycle
  • 53. Voldemort RO Store
  • 54. Our Workflow triangle-closing top-n push-to-prod remove connections
  • 55. Outline
    • What do I mean by Data Products?
    • Systems and Tools we use
    • Let’s build “People You May Know”
    • Managing workflow
    • Serving data in production
    • Data Quality
    • Performance
  • 56. Data Quality
    • Verification
    • QA store with viewer
    • Explain
    • Versioning/Rollback
    • Unit tests
  • 57. Outline
    • What do I mean by Data Products?
    • Systems and Tools we use
    • Let’s build “People You May Know”
    • Managing workflow
    • Serving data in production
    • Data Quality
    • Performance
  • 58. Performance
    • Symmetry
      • Bob knows Carol then Carol knows Bob
    • Limit
      • Ignore members with > k connections
    • Sampling
      • Sample k-connections
  • 59. Things Covered
    • What do I mean by Data Products?
    • Systems and Tools we use
    • Let’s build “People You May Know”
    • Managing workflow
    • Serving data in production
    • Data Quality
    • Performance
  • 60. SNA Team
    • Thanks to SNA Team at LinkedIn
    • http://sna-projects.com
    • We are hiring!
  • 61. Questions?
  • 62. Outline
    • What do I mean by Data Products?
    • Systems and Tools we use
    • Let’s build “People You May Know”
    • Managing workflow
    • Serving data in production
    • Data Quality
    • Performance
  • 63. Outline
    • What do I mean by Data Products?
    • Systems and Tools we use
    • Let’s build “People You May Know”
    • Managing workflow
    • Serving data in production
    • Data Quality
    • Performance