Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How to win friends and influence people (with Hadoop)

2,422 views

Published on

Sam Shah and I gave this talk at Strata+Hadoop World 2012

Published in: Technology
  • Be the first to comment

How to win friends and influence people (with Hadoop)

  1. 1. How to Win Friends and Influence People (withHadoop)Strata Conference New YorkSam Shah and Joseph AdlerOctober 25 2012©2012 LinkedIn Corporation. All Rights Reserved.
  2. 2. Sam Shah Principal Engineer and Engineering Manager www.linkedin.com/in/shahsam Joseph Adler Senior Data Scientist www.linkedin.com/in/josephadler©2012 LinkedIn Corporation. All Rights Reserved.
  3. 3. LinkedIn is the leading professional network site 175M+ LinkedIn Members 640M+ Worldwide Professionals 3,300M+ Worldwide Workforce©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 3
  4. 4. Data rich 175+M Members 175M Member Profiles©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 4
  5. 5. LinkedIn 9.3B Page Views per Quarter 130M Unique Visitors©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 5
  6. 6. We have a lot of data.©2012 LinkedIn Corporation. All Rights Reserved.
  7. 7. We have a lot of data. We want to leverage this data to build products.©2012 LinkedIn Corporation. All Rights Reserved.
  8. 8. We have a lot of data. We want to leverage this data to build products.How do you make it easy to build products from data? ©2012 LinkedIn Corporation. All Rights Reserved.
  9. 9. Products we have built on Hadoop©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 9
  10. 10. Building products from dataExamples of products built with data Year in Review Email Network Updates Skills and Endorsements People You May Know and more…©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012
  11. 11. Year in Review One of the most successful email messages ever. 20% Response Rate 5 Clicks per responder STRATA NY 2012
  12. 12. Network updates©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 12
  13. 13. People you may know©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 13
  14. 14. Skills and Endorsements©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012
  15. 15. Building products from dataHadoop is awesome for building product with data Lots of cheap storage Vast computational resources Lots of tools for processing data, learning from data Shared infrastructure Shared support services Runs on commodity hardware (or AWS)©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012
  16. 16. LeverageThe marginal cost of building new products is low People You May Know (2 people) Skills and Endorsements (2 people) Year in Review (1 person, 1 month) Network Updates Stream (1 person, 3 months) Hadoop can empower small teams to build things©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012
  17. 17. LeverageThe marginal cost of building new products is low People You May Know (2 people) Skills and Endorsements (2 people) Year in Review (1 person, 1 month) Network Updates Stream (1 person, 3 months) Hadoop can empower small teams to build things©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012
  18. 18. Turning data into productsHow we build products©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 18
  19. 19. Year in Review  Steps to make the email – Collect job changers – Figure out who is connected to them – Rank job changes STRATA NY 2012
  20. 20. Example: Year in ReviewmemberPosition = LOAD $latest_positions USING BinaryJSON; connectionsWithChangeWithPic::source_id AS source_id,memberWithPositionsChangedLastYear = FOREACH ( connectionsWithChangeWithPic::member_id AS member_id, FILTER memberPosition BY ((start_date >= $start_date_low ) AND connectionsWithChangeWithPic::dest_first_name as first_name, (start_date <= $start_date_high)) connectionsWithChangeWithPic::dest_last_name as last_name,) GENERATE member_id, start_date, end_date; connectionsWithChangeWithPic::pic_id AS pic_id, memberinfowpics::first_name AS firstName,allConnections = LOAD $latest_bidirectional_connections USING BinaryJSON; memberinfowpics::last_name AS lastName, memberinfowpics::gmt_offset as gmt_offset,allConnectionsWithChange_nondistinct = FOREACH ( memberinfowpics::email_locale as email_locale, JOIN memberWithPositionsChangedLastYear BY member_id, memberinfowpics::email_address as email_address; allConnections BY dest) GENERATE allConnections::source AS source, allConnections::dest AS dest; resultGroup0 = GROUP withName BY (source_id, firstName, lastName, email_address, email_locale, gmt_offset);allConnectionsWithChange = DISTINCT allConnectionsWithChange_nondistinct; -- get the count of results per recipient resultGroupCount = FOREACH resultGroup0 GENERATE group ,memberinfowpics = LOAD $latest_memberinfowpics USING withName as toomany, COUNT_STAR(withName) as num_results; BinaryJSON;pictures = FOREACH ( FILTER memberinfowpics BY resultGroupPre = filter resultGroupCount by num_results > 2; ((cropped_picture_id is not null) AND resultGroup = FOREACH resultGroupPre { ( (member_picture_privacy == N) OR withName = LIMIT toomany 64; (member_picture_privacy == E)))) GENERATE member_id, cropped_picture_id, first_name as GENERATE group, withName, num_results; dest_first_name, last_name as dest_last_name; }resultPic = JOIN allConnectionsWithChange BY dest, pictures x_in_review_pre_out = FOREACH resultGroup GENERATE BY member_id; FLATTEN(group) as (source_id, firstName, lastName,connectionsWithChangeWithPic = FOREACH resultPic GENERATE email_address, email_locale, gmt_offset), allConnectionsWithChange::source AS source_id, withName.(member_id, pic_id, first_name, last_name) as allConnectionsWithChange::dest AS member_id, jobChanger, 2011 as changeYear:chararray, pictures::cropped_picture_id AS pic_id, num_results as num_results; pictures::dest_first_name AS dest_first_name, pictures::dest_last_name AS dest_last_name; x_in_review = FOREACH x_in_review_pre_out GENERATE source_id as recipientID, gmt_offset as gmtOffset,joinResult = JOIN connectionsWithChangeWithPic BY source_id, firstName as first_name, lastName as last_name, email_address, memberinfowpics BY member_id; email_locale, withName = FOREACH joinResult GENERATE TOTUPLE( changeYear, source_id,firstName, lastName, num_results,jobChanger) as body; rmf $xir; STORE x_in_review INTO $xir USING BinaryJSON(recipientID); STRATA NY 2012
  21. 21. Example: Year in Review{body={num_results=80, lastName=Adler, changeYear=2011, firstName=Joseph, jobChanger=[{last_name=OConnor, first_name=Br?on, member_id=12562482, pic_id=/p/3/000/086/1bd/10ee035.jpg}, {last_name=Sundaram, first_name=Vivek, member_id=6590171, pic_id=/p/3/000/0ae/354/36eb54c.jpg}, {last_name=Crane, first_name=Patrick, member_id=8628324, pic_id Each message requires a lot of=/p/1/000/09c/064/10191de.jpg}, {last_name=McLennan, first_name=Dan, member_id=10551114, pic_id=/p/2/000/09d/12f/147def1.jpg}, {last_name=Shaughnessy, first_name=Helen, member_id=2211035, pic_id=/p/3/000/06d/2ba/06a113c.jpg}, {last_name=Chen, first_name=Richard, member_id=12800647, pic_id=/p/2/000/007/1ad/0fb84f9.jpg}, {last_name=Barba, first_name=Troy, member_id=27577, pic_id=/p/2/000/0a2/3e9/3a83a33.jpg}, {last_name=Reed, first_name=Harper, member_id data:=1865420, pic_id=/p/1/000/001/17b/396a2c3.jpg}, {last_name=Goldstein, first_name=Peter, member_id=205610, pic_id=/p/2/000/01c/2e6/042999f.jpg}, {last_name=Koren, first_name=Yuval, member_id=2289577, pic_id=/p/1/000/02b/3d3/1fc3627.jpg}, {last_name=Kiang, first_name=Andy, member_id=8347, pic_id=/p/1/000/063/115/1256f61.jpg}, {last_name=Greenfield, first_name=Nick, member_id=82814545, pic_id=/p/1/000/068/39f/2080b8f.jpg}, {last_name=Murarka, first_name=Bubba, member_id=174233, pic_id=/p/3/000/011/2c8/33837b8.jpg}, {last_name=Kutter, first_name=Norbert, member_id=310933, pic_id=/p/3/000/005/0e2/02775a0.jpg}, {last_name=Ehrenberg, first_name=Roger, member_id=1662181, pic_id=/p/3/000/038/066/3572baf.jpg}, {last_name=Coderre, CISSP, first_name=Rob, member_id=68521, pic_id=/p/1/000/088/0d5/2438981.jpg}, {last_name=Stephens, first_name=Bradford, member_id=10900447, pic_id=/p/1/000/0ad/0dc/15f9df5.jpg}, {last_name=Shiau, first_name=Peter, member_id=300654, pic_id=/p/2/000/056/2a6/18938e3.jpg}, {last_name=Rajan, first_na – Header information (10 fields)me=Arvind, member_id=1260, pic_id=/p/3/000/019/3f7/1e6e0f2.jpg}, {last_name=Bellister, first_name=Jesse, member_id – 4 fields per person, 64 people=25234604, pic_id=/p/3/000/00a/17d/1e2136b.jpg}, {last_name=Mohan, first_name=Viraj, member_id=56817108, pic_id=/p/3/000/0cd/0a4/097527a.jpg}, {last_name=Ragade, first_name=Dhananjay, member_id=325284, pic_id=/p/3/000/000/035/0504fe7.jpg}, {last_name=Richards, first_name=Jeff, member_id=16762, pic_id=/p/2/000/039/14e/081d1c7.jpg}, {last_name=Wittenauer, first_name=Allen, member_id=3328775, pic_id=/p/3/000/08d/2a3/307b112.jpg}, {last_name=Porzak, first_ – That’s over 250 data fields forname=Jim, member_id=1708710, pic_id=/p/2/000/00d/109/0e4aa34.jpg}, {last_name=Ruma, first_name=Laurel, member_id=3429732, pic_id=/p/1/000/01e/277/2bb115b.jpg}, {last_name=Higgins, first_name=Josh, member_id=1458792, pic_id=/p/1/000/0c9/38b/1a24457.jpg}, {last_name=Benedict, first_name=Harvey, member_id=641340, pic_id=/p/3/000/0c6/1eb/2eb7119.jpg}, {last_name=Lazarus, first_name=Brett, member_id=49965786, pic_id=/p/2/000/03b/04e/318d080.jpg}, {last_name=Zhang, first_name=Simon, member_id=16323996, pic_id=/p/3/000/03f/0fe/35d4ded.jpg}, {last_name=Aspen, first_name=Matt, member_id=25240804, pic_id=/p/3/000/09b/371/22ec974.jpg}, {last_name=Herz, first_name=Erik, member_id=147604,pic_id=/p/3/000/086/014/0fab4d6.jpg}, {last_name=Sanders, first_name=Geoffrey, member_id=340570, pic_id=/p/1/000/0 the final messaged1/2d1/37a76e6.jpg}, {last_name=Wright, first_name=Caleb, member_id=12798700, pic_id=/p/2/000/08c/337/2cc951a.jpg}, {last_name=Parab, first_name=Guru, member_id=8915230, pic_id=/p/1/000/08a/257/051926a.jpg}, {last_name=Grossman,first_name=Nick, member_id=12159520, pic_id=/p/2/000/005/2f3/1955f31.jpg}, {last_name=Skomoroch, first_name=Peter,member_id=11642980, pic_id=/p/2/000/0b4/12d/31eadbe.jpg}, {last_name=Singh, first_name=Deepak, member_id=1246166,pic_id=/p/1/000/042/3f5/369f807.jpg}, {last_name=Noakes, first_name=Geoffrey, member_id=3518726, pic_id=/p/3/000/005/3d7/3f67632.jpg}, {last_name=Scudiere, first_name=Robert, member_id=3965286, pic_id=/p/2/000/090/210/009a099.jpg}, {last_name=Skyler, first_name=David, member_id=15377099, pic_id=/p/3/000/005/1bf/080b255.jpg}, {last_name=Shar How do we turn this raw data in toma, first_name=Manu, member_id=19295378, pic_id=/p/3/000/0d4/11e/2176c30.jpg}, {last_name=Huang, first_name=Erica,member_id=1808438, pic_id=/p/1/000/001/3a5/02ddd24.jpg}, {last_name=Ballotta, first_name=Pete, member_id=2011178,pic_id=/p/2/000/0b6/08f/3a92357.jpg}, {last_name=Kast, first_name=Anton, member_id=1092686, pic_id=/p/1/000/054/0e web content or email messages?2/1a8efb2.jpg}, {last_name=Redfern, first_name=Joff, member_id=2849241, pic_id=/p/3/000/03d/28d/19f5688.jpg}, {last_name=Smith, first_name=Aaron, member_id=83470876, pic_id=/p/2/000/08c/27c/3cfe37a.jpg}, {last_name=Yadav, first_name=Rishi, member_id=2097381, pic_id=/p/2/000/0c8/08d/3ab9006.jpg}, {last_name=Repass, first_name=Mike, member_id=8633208, pic_id=/p/2/000/071/195/0bfc573.jpg}, {last_name=Dalvi, first_name=Anand, member_id=8388, pic_id=/p/1/000/003/3cd/3127384.jpg}, {last_name=Croll, first_name=Alistair, member_id=511218, pic_id=/p/2/000/029/0e5/1ebc076.jpg}, {last_name=Tolman, first_name=Sarah, member_id=86040596, pic_id=/p/2/000/06f/1c9/1a7870e.jpg}, {last_name=Suvarna, first_name=Sandeep, member_id=10558779, pic_id=/p/1/000/05b/2c7/0ec214a.jpg}, {last_name=Elliott-McCrea, first_name=Kellan, member_id=163959, pic_id=/p/1/000/06b/2e8/2dbd3ae.jpg}, {last_name=Jatkar, first_name=Tarang, member_id=17763609, pic_id=/p/1/000/012/010/2e8ee7f.jpg}, {last_name=Brown, first_name=David, member_id=420737, pic_id=/p/3/000/002/140/0b2dbcc.jpg}, {last_name=Patel, first_name=Jay, member_id=1179857, pic_id=/p/2/000/07c/0b2/0365e91.jpg}, {last_name=Field, first_name=Dylan, member_id=13066037, pic_id=/p/2/000/0a5/3e2/1fb7f06.jpg},{last_name=Patel, first_name=Sumeet, member_id=23402387, pic_id=/p/2/000/0bf/3ca/2ca5f1f.jpg}, {last_name=Ting, first_name=Moses, member_id=15624915, pic_id=/p/2/000/0ac/117/29e329a.jpg}, {last_name=Hinnach, first_name=Yassine,member_id=1731285, pic_id=/p/3/000/000/035/330cce0.jpg}, {last_name=Das, first_name=Anshu, member_id=38878221, pic_id=/p/3/000/0b2/1ac/15902f4.jpg}, {last_name=Mendelson, first_name=Jordan, member_id=8598415, pic_id=/p/3/000/032/22a/1d2eaa6.jpg}, {last_name=Besbeas, first_name=Nick, member_id=12510505, pic_id=/p/3/000/093/167/34f5b6b.jpg}], source_id=256842}, first_name=Joseph, email_locale=en_US, last_name=Adler, gmtOffset=-8, recipientID=256842, email_address=jadler@linkedin.com} STRATA NY 2012
  22. 22. People you may know©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 22
  23. 23. People you may know Alice Bob Carol©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 23
  24. 24. People you may know Alice Bob Carol > 80% of connections from triangle closing©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 24
  25. 25. People you may know Age Organizational Overlap Distance Alice BobDave Carol Ranked Matches Eve User Interactions Results ©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 25
  26. 26. Skills and Endorsements©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012
  27. 27. Tagging Skills©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 27
  28. 28. ©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 28
  29. 29. Skills and Endorsements A combination of – Propensity to know member – Propensity for member to have skill©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012
  30. 30. ProductionalizationTake something that runs once… … and run it multiple times … and serve it reliably at scale … and iterate quickly©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 31
  31. 31. Data Lifecycle Moving around data is the key problem 1. Ingress Moving raw data from online systems to offline systems 2. Workflow management Managing offline processes 3. Egress Moving results from offline systems to online systems©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 32
  32. 32. Ingress Apache Kafka: Low latency publish/subscribe message bus – Common data format (Avro) – Changelog is the abstraction for integration – Schema evolution  Programmatic compatibility model  Explicit schema reviews  “O(1)” ETL K. Goodhope, J. Koshy, J. Kreps, N. Narkhede, R. Park, J. Rao, V.Y. Ye: Building LinkedIn’s Real-time Activity Data Pipeline. In IEEE Data Engineering Bulletin. Vol 35, No. 2, June 2012.©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 33
  33. 33. Workflows Job A Job B Job C©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 34
  34. 34. Workflows Job A Job B Job C Push to Production©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 35
  35. 35. Workflows Job A Job B Job X Job C Push to Production©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 36
  36. 36. Workflows Job A Job B Job X Job C Push to Production Push to QA©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 37
  37. 37. Real workflows are complicated©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 38
  38. 38. Workflow Management: Azkaban Dependency management Diverse job types (Pig, Hive, Java, . . . ) Scheduling Monitoring Configuration Retry/restart on failure Resource locking Log collection Historical information©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 39
  39. 39. Workflow Management: Azkaban©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 40
  40. 40. Workflow Management: Azkaban©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 41
  41. 41. Egress: Voldemort Distributed key/value store Easy to integrate into workflows – Off the shelf jobs to copy Voldemort Stores – One line command in Pig Cost of data load Data stored per node? Response time Fail-over How to transfer Versioning & rollback R. Sumbaly, J. Kreps, L. Gao, A. Feinberg, C. Soman, & S. Shah. Serving Large- Scale Batch Computed Data With Project Voldemort. In FAST 2012.©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 42
  42. 42. RecapWhy we use Hadoop Simple programmatic model Rich developer ecosystem – Languages: Pig, Hive, Crunch, Cascading, … – Libraries: Mahout, DataFu, ElephantBird, … Horizontal scalability, fault tolerance, multi-tenancy – Reliably process multiple TB of data Don’t need hardcore distributed systems engineers©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 43
  43. 43. RecapHow we use HadoopOpen source projects started at LinkedIn: Getting data in: Kafka Building and running job flows: Azkaban Getting data out: VoldemortThis empowers data scientists and engineers to focus on new productideas, not infrastructure©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 44
  44. 44. Learning Moredata.linkedin.com©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 45

×