How to win friends and influence people (with Hadoop)
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

How to win friends and influence people (with Hadoop)

on

  • 1,817 views

Sam Shah and I gave this talk at Strata+Hadoop World 2012

Sam Shah and I gave this talk at Strata+Hadoop World 2012

Statistics

Views

Total Views
1,817
Views on SlideShare
1,755
Embed Views
62

Actions

Likes
6
Downloads
35
Comments
1

2 Embeds 62

http://www.linkedin.com 42
https://www.linkedin.com 20

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • If you liked this, you'll love BillionairesBrain.com.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Today, Sam and I are going to talk about how we use Hadoop to build products with data.Sam and I are both engineers at LinkedIn. My title is trendier than Sam’s, but don’t hold that against me. Or him. We both know how to build products with data.Both of us have talked about a lot of the products in this presentation before, but we haven’t focused as much on infrastructure
  • We’d like to start by telling you a little bit about LinkedIn (and LinkedIn’s data).LinkedIn is the leading web site for professional networking. We currently have over 175 million members, but we’re still growing. That means that our data is growing too.
  • Each member has a profile. We know a lot about our members (start scrolling animation)…We know their current position, past positions, schools they attended, skills they have, skills that other people have endorsed them for, people and companies they follow, companies they work on.We think this data is very interesting. We can use this data to help members connect to each other, and make them more productive. That’s actually LinkedIn’s mission statement… I can’t believe I recited our mission statement in a public presentation. Anyway, let’s take a look at how we use this data.
  • When a user logs into LinkedIn, they see a page like this. Almost every part of our home page has been touched by data science.Home page is purely driven by data:News articlesNews streamPYMKDisplay adJYMBIIWVMPGYMLEtc…And by the way, we also learn what our members like and don’t like. Wehave over 130 million visitors to our site every quarter, and deliver over 9.3 billion web pages. (That’s even more data)
  • So, here’s the point of today’s talk.At LinkedIn, we have a lot of data.
  • We store our data in Hadoop, and we want to build product using that data on Hadoop.
  • So here’s the big challenge: how do we make it easy for our engineers, product managers, data scientists, analysts, web devs, reseptionists, whatever, build products from our data?That’s what we’re going to talk about today. We’ll tell you about some of the products that we’ve built from data, how we built these products, and why we built infastructure to support these products.
  • Let’s start by telling you a little more about some of the products that we have built with Hadoop, then we’ll tell you more about two of those products and the challenges that we faced productionalizing them.
  • More examples:Groups you might likeNetwork updates digest email“People who viewed this profile also viewed”Etc.
  • Let’s start with a project that I worked on at LinkedIn that I think illustrates the power of building products with data.Ask audience “who got this email?”We sent this to every LinkedIn member who had a lot of job changes in their network.[now read the numbers]Later in this presentation, we’ll tell you how we built this email from our data. I’ll even show you the code.
  • Hereis another example of a product that I’ve worked on. In the network stream on our home page, we’ve started sharing trends and patterns in data.We also tell you things that you might not know about your network. For example, it turns out that 21 of my former coworkers are now working at Google.
  • One of the most famous examples of data products at LinkedIn is People you may Know.PYMK was invented at LinkedIn. The idea of PYMK was to help you discover current coworkers, former coworkers, and friends on LinkedIn to help make your experience better. (This is not an actual screen shot from my account; I’m already connected to Sam.)We used Hadoop to build and scale PYMK. (We’ll also tell you more about how we built PYMK later in this presentation.)
  • Has anyone in the room seen a screen like this on LinkedIn?Has anyone endorsed someone else?Has anyone found it hard to stop endorsing people?We also used Hadoop to build our suggested endorsements.
  • We love using Hadoop for building data products.There are so many things that are great about Hadoop. (Our user quotas are in TB.)Hundreds of nodesGreat tools for working with data like Pig, and hive, and CrunchShared infrastucture. Hundreds of employees have accounts on Hadoop and run jobs (engineers, data scientists, product managers, even designers and finance people)
  • One of the greatest advantages of Hadoop is that it empowers small teams to build great things. Here are a few examplesMost of the items on this are big, important features: lots of page views, lots of new connections, lots of great content.The marginal cost of building more products is low
  • One of the greatest advantages of Hadoop is that it empowers small teams to build great things. Here are a few examplesMost of the items on this are big, important features: lots of page views, lots of new connections, lots of great content.The marginal cost of building more products is low
  • Let’s talk a little more about the year in review email. This is actually a pretty straightforward message in theory. Here’s how we do it. (Read slide)There isn’t any machine learning, or fancy algorithms. It’s just grouping and ranking.And in practice, it’s not that hard.
  • This is the code to compose this message. It’s About 60 lines of code, and most of that code involved renaming things.This is why we love Hadoop: we can do something simple without much code…Great! We’re done. We write this code and the message is done.
  • Well, not so fast… here’s the challenge. We know how to do the computation to make this message. But every message requires a lot of data: we potentially look at hundreds of MB of data before degnerating every message, and in the end the messages are up to 1MB in size.How do we get all the raw data that we need to make this message? How do we keep it up to date?How do we run this job frequently so the results stay current?How do we get these results out of Hadoop, turn them into email messages, and send them out?Let’s consider another problem.
  • One of the most famous examples of data products is People you may Know.PYMK was invented at LinkedIn. The idea of PYMK was to help you discover current coworkers, former coworkers, and friends on LinkedIn to help make your experience better. (This is not an actual screen shot from my account; I’m already connected to Sam.)We used Hadoop to build and scale PYMK.
  • - PYMK started simpler, grew more complicated- Complicated workflow, required tools and infrastructure to do this --> we needed it in place.
  • Throw over the wall from data science to productionizationNo one dedicated toproductionizationProvided “as a service” to do so
  • - Don’t want to beg for data- Others: Scribe, Flume
  • - Others: Oozie
  • - Others: Hbase, Cassandra, Kafka

How to win friends and influence people (with Hadoop) Presentation Transcript

  • 1. How to Win Friends and Influence People (withHadoop)Strata Conference New YorkSam Shah and Joseph AdlerOctober 25 2012©2012 LinkedIn Corporation. All Rights Reserved.
  • 2. Sam Shah Principal Engineer and Engineering Manager www.linkedin.com/in/shahsam Joseph Adler Senior Data Scientist www.linkedin.com/in/josephadler©2012 LinkedIn Corporation. All Rights Reserved.
  • 3. LinkedIn is the leading professional network site 175M+ LinkedIn Members 640M+ Worldwide Professionals 3,300M+ Worldwide Workforce©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 3
  • 4. Data rich 175+M Members 175M Member Profiles©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 4
  • 5. LinkedIn 9.3B Page Views per Quarter 130M Unique Visitors©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 5
  • 6. We have a lot of data.©2012 LinkedIn Corporation. All Rights Reserved.
  • 7. We have a lot of data. We want to leverage this data to build products.©2012 LinkedIn Corporation. All Rights Reserved.
  • 8. We have a lot of data. We want to leverage this data to build products.How do you make it easy to build products from data? ©2012 LinkedIn Corporation. All Rights Reserved.
  • 9. Products we have built on Hadoop©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 9
  • 10. Building products from dataExamples of products built with data Year in Review Email Network Updates Skills and Endorsements People You May Know and more…©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012
  • 11. Year in Review One of the most successful email messages ever. 20% Response Rate 5 Clicks per responder STRATA NY 2012
  • 12. Network updates©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 12
  • 13. People you may know©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 13
  • 14. Skills and Endorsements©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012
  • 15. Building products from dataHadoop is awesome for building product with data Lots of cheap storage Vast computational resources Lots of tools for processing data, learning from data Shared infrastructure Shared support services Runs on commodity hardware (or AWS)©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012
  • 16. LeverageThe marginal cost of building new products is low People You May Know (2 people) Skills and Endorsements (2 people) Year in Review (1 person, 1 month) Network Updates Stream (1 person, 3 months) Hadoop can empower small teams to build things©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012
  • 17. LeverageThe marginal cost of building new products is low People You May Know (2 people) Skills and Endorsements (2 people) Year in Review (1 person, 1 month) Network Updates Stream (1 person, 3 months) Hadoop can empower small teams to build things©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012
  • 18. Turning data into productsHow we build products©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 18
  • 19. Year in Review  Steps to make the email – Collect job changers – Figure out who is connected to them – Rank job changes STRATA NY 2012
  • 20. Example: Year in ReviewmemberPosition = LOAD $latest_positions USING BinaryJSON; connectionsWithChangeWithPic::source_id AS source_id,memberWithPositionsChangedLastYear = FOREACH ( connectionsWithChangeWithPic::member_id AS member_id, FILTER memberPosition BY ((start_date >= $start_date_low ) AND connectionsWithChangeWithPic::dest_first_name as first_name, (start_date <= $start_date_high)) connectionsWithChangeWithPic::dest_last_name as last_name,) GENERATE member_id, start_date, end_date; connectionsWithChangeWithPic::pic_id AS pic_id, memberinfowpics::first_name AS firstName,allConnections = LOAD $latest_bidirectional_connections USING BinaryJSON; memberinfowpics::last_name AS lastName, memberinfowpics::gmt_offset as gmt_offset,allConnectionsWithChange_nondistinct = FOREACH ( memberinfowpics::email_locale as email_locale, JOIN memberWithPositionsChangedLastYear BY member_id, memberinfowpics::email_address as email_address; allConnections BY dest) GENERATE allConnections::source AS source, allConnections::dest AS dest; resultGroup0 = GROUP withName BY (source_id, firstName, lastName, email_address, email_locale, gmt_offset);allConnectionsWithChange = DISTINCT allConnectionsWithChange_nondistinct; -- get the count of results per recipient resultGroupCount = FOREACH resultGroup0 GENERATE group ,memberinfowpics = LOAD $latest_memberinfowpics USING withName as toomany, COUNT_STAR(withName) as num_results; BinaryJSON;pictures = FOREACH ( FILTER memberinfowpics BY resultGroupPre = filter resultGroupCount by num_results > 2; ((cropped_picture_id is not null) AND resultGroup = FOREACH resultGroupPre { ( (member_picture_privacy == N) OR withName = LIMIT toomany 64; (member_picture_privacy == E)))) GENERATE member_id, cropped_picture_id, first_name as GENERATE group, withName, num_results; dest_first_name, last_name as dest_last_name; }resultPic = JOIN allConnectionsWithChange BY dest, pictures x_in_review_pre_out = FOREACH resultGroup GENERATE BY member_id; FLATTEN(group) as (source_id, firstName, lastName,connectionsWithChangeWithPic = FOREACH resultPic GENERATE email_address, email_locale, gmt_offset), allConnectionsWithChange::source AS source_id, withName.(member_id, pic_id, first_name, last_name) as allConnectionsWithChange::dest AS member_id, jobChanger, 2011 as changeYear:chararray, pictures::cropped_picture_id AS pic_id, num_results as num_results; pictures::dest_first_name AS dest_first_name, pictures::dest_last_name AS dest_last_name; x_in_review = FOREACH x_in_review_pre_out GENERATE source_id as recipientID, gmt_offset as gmtOffset,joinResult = JOIN connectionsWithChangeWithPic BY source_id, firstName as first_name, lastName as last_name, email_address, memberinfowpics BY member_id; email_locale, withName = FOREACH joinResult GENERATE TOTUPLE( changeYear, source_id,firstName, lastName, num_results,jobChanger) as body; rmf $xir; STORE x_in_review INTO $xir USING BinaryJSON(recipientID); STRATA NY 2012
  • 21. Example: Year in Review{body={num_results=80, lastName=Adler, changeYear=2011, firstName=Joseph, jobChanger=[{last_name=OConnor, first_name=Br?on, member_id=12562482, pic_id=/p/3/000/086/1bd/10ee035.jpg}, {last_name=Sundaram, first_name=Vivek, member_id=6590171, pic_id=/p/3/000/0ae/354/36eb54c.jpg}, {last_name=Crane, first_name=Patrick, member_id=8628324, pic_id Each message requires a lot of=/p/1/000/09c/064/10191de.jpg}, {last_name=McLennan, first_name=Dan, member_id=10551114, pic_id=/p/2/000/09d/12f/147def1.jpg}, {last_name=Shaughnessy, first_name=Helen, member_id=2211035, pic_id=/p/3/000/06d/2ba/06a113c.jpg}, {last_name=Chen, first_name=Richard, member_id=12800647, pic_id=/p/2/000/007/1ad/0fb84f9.jpg}, {last_name=Barba, first_name=Troy, member_id=27577, pic_id=/p/2/000/0a2/3e9/3a83a33.jpg}, {last_name=Reed, first_name=Harper, member_id data:=1865420, pic_id=/p/1/000/001/17b/396a2c3.jpg}, {last_name=Goldstein, first_name=Peter, member_id=205610, pic_id=/p/2/000/01c/2e6/042999f.jpg}, {last_name=Koren, first_name=Yuval, member_id=2289577, pic_id=/p/1/000/02b/3d3/1fc3627.jpg}, {last_name=Kiang, first_name=Andy, member_id=8347, pic_id=/p/1/000/063/115/1256f61.jpg}, {last_name=Greenfield, first_name=Nick, member_id=82814545, pic_id=/p/1/000/068/39f/2080b8f.jpg}, {last_name=Murarka, first_name=Bubba, member_id=174233, pic_id=/p/3/000/011/2c8/33837b8.jpg}, {last_name=Kutter, first_name=Norbert, member_id=310933, pic_id=/p/3/000/005/0e2/02775a0.jpg}, {last_name=Ehrenberg, first_name=Roger, member_id=1662181, pic_id=/p/3/000/038/066/3572baf.jpg}, {last_name=Coderre, CISSP, first_name=Rob, member_id=68521, pic_id=/p/1/000/088/0d5/2438981.jpg}, {last_name=Stephens, first_name=Bradford, member_id=10900447, pic_id=/p/1/000/0ad/0dc/15f9df5.jpg}, {last_name=Shiau, first_name=Peter, member_id=300654, pic_id=/p/2/000/056/2a6/18938e3.jpg}, {last_name=Rajan, first_na – Header information (10 fields)me=Arvind, member_id=1260, pic_id=/p/3/000/019/3f7/1e6e0f2.jpg}, {last_name=Bellister, first_name=Jesse, member_id – 4 fields per person, 64 people=25234604, pic_id=/p/3/000/00a/17d/1e2136b.jpg}, {last_name=Mohan, first_name=Viraj, member_id=56817108, pic_id=/p/3/000/0cd/0a4/097527a.jpg}, {last_name=Ragade, first_name=Dhananjay, member_id=325284, pic_id=/p/3/000/000/035/0504fe7.jpg}, {last_name=Richards, first_name=Jeff, member_id=16762, pic_id=/p/2/000/039/14e/081d1c7.jpg}, {last_name=Wittenauer, first_name=Allen, member_id=3328775, pic_id=/p/3/000/08d/2a3/307b112.jpg}, {last_name=Porzak, first_ – That’s over 250 data fields forname=Jim, member_id=1708710, pic_id=/p/2/000/00d/109/0e4aa34.jpg}, {last_name=Ruma, first_name=Laurel, member_id=3429732, pic_id=/p/1/000/01e/277/2bb115b.jpg}, {last_name=Higgins, first_name=Josh, member_id=1458792, pic_id=/p/1/000/0c9/38b/1a24457.jpg}, {last_name=Benedict, first_name=Harvey, member_id=641340, pic_id=/p/3/000/0c6/1eb/2eb7119.jpg}, {last_name=Lazarus, first_name=Brett, member_id=49965786, pic_id=/p/2/000/03b/04e/318d080.jpg}, {last_name=Zhang, first_name=Simon, member_id=16323996, pic_id=/p/3/000/03f/0fe/35d4ded.jpg}, {last_name=Aspen, first_name=Matt, member_id=25240804, pic_id=/p/3/000/09b/371/22ec974.jpg}, {last_name=Herz, first_name=Erik, member_id=147604,pic_id=/p/3/000/086/014/0fab4d6.jpg}, {last_name=Sanders, first_name=Geoffrey, member_id=340570, pic_id=/p/1/000/0 the final messaged1/2d1/37a76e6.jpg}, {last_name=Wright, first_name=Caleb, member_id=12798700, pic_id=/p/2/000/08c/337/2cc951a.jpg}, {last_name=Parab, first_name=Guru, member_id=8915230, pic_id=/p/1/000/08a/257/051926a.jpg}, {last_name=Grossman,first_name=Nick, member_id=12159520, pic_id=/p/2/000/005/2f3/1955f31.jpg}, {last_name=Skomoroch, first_name=Peter,member_id=11642980, pic_id=/p/2/000/0b4/12d/31eadbe.jpg}, {last_name=Singh, first_name=Deepak, member_id=1246166,pic_id=/p/1/000/042/3f5/369f807.jpg}, {last_name=Noakes, first_name=Geoffrey, member_id=3518726, pic_id=/p/3/000/005/3d7/3f67632.jpg}, {last_name=Scudiere, first_name=Robert, member_id=3965286, pic_id=/p/2/000/090/210/009a099.jpg}, {last_name=Skyler, first_name=David, member_id=15377099, pic_id=/p/3/000/005/1bf/080b255.jpg}, {last_name=Shar How do we turn this raw data in toma, first_name=Manu, member_id=19295378, pic_id=/p/3/000/0d4/11e/2176c30.jpg}, {last_name=Huang, first_name=Erica,member_id=1808438, pic_id=/p/1/000/001/3a5/02ddd24.jpg}, {last_name=Ballotta, first_name=Pete, member_id=2011178,pic_id=/p/2/000/0b6/08f/3a92357.jpg}, {last_name=Kast, first_name=Anton, member_id=1092686, pic_id=/p/1/000/054/0e web content or email messages?2/1a8efb2.jpg}, {last_name=Redfern, first_name=Joff, member_id=2849241, pic_id=/p/3/000/03d/28d/19f5688.jpg}, {last_name=Smith, first_name=Aaron, member_id=83470876, pic_id=/p/2/000/08c/27c/3cfe37a.jpg}, {last_name=Yadav, first_name=Rishi, member_id=2097381, pic_id=/p/2/000/0c8/08d/3ab9006.jpg}, {last_name=Repass, first_name=Mike, member_id=8633208, pic_id=/p/2/000/071/195/0bfc573.jpg}, {last_name=Dalvi, first_name=Anand, member_id=8388, pic_id=/p/1/000/003/3cd/3127384.jpg}, {last_name=Croll, first_name=Alistair, member_id=511218, pic_id=/p/2/000/029/0e5/1ebc076.jpg}, {last_name=Tolman, first_name=Sarah, member_id=86040596, pic_id=/p/2/000/06f/1c9/1a7870e.jpg}, {last_name=Suvarna, first_name=Sandeep, member_id=10558779, pic_id=/p/1/000/05b/2c7/0ec214a.jpg}, {last_name=Elliott-McCrea, first_name=Kellan, member_id=163959, pic_id=/p/1/000/06b/2e8/2dbd3ae.jpg}, {last_name=Jatkar, first_name=Tarang, member_id=17763609, pic_id=/p/1/000/012/010/2e8ee7f.jpg}, {last_name=Brown, first_name=David, member_id=420737, pic_id=/p/3/000/002/140/0b2dbcc.jpg}, {last_name=Patel, first_name=Jay, member_id=1179857, pic_id=/p/2/000/07c/0b2/0365e91.jpg}, {last_name=Field, first_name=Dylan, member_id=13066037, pic_id=/p/2/000/0a5/3e2/1fb7f06.jpg},{last_name=Patel, first_name=Sumeet, member_id=23402387, pic_id=/p/2/000/0bf/3ca/2ca5f1f.jpg}, {last_name=Ting, first_name=Moses, member_id=15624915, pic_id=/p/2/000/0ac/117/29e329a.jpg}, {last_name=Hinnach, first_name=Yassine,member_id=1731285, pic_id=/p/3/000/000/035/330cce0.jpg}, {last_name=Das, first_name=Anshu, member_id=38878221, pic_id=/p/3/000/0b2/1ac/15902f4.jpg}, {last_name=Mendelson, first_name=Jordan, member_id=8598415, pic_id=/p/3/000/032/22a/1d2eaa6.jpg}, {last_name=Besbeas, first_name=Nick, member_id=12510505, pic_id=/p/3/000/093/167/34f5b6b.jpg}], source_id=256842}, first_name=Joseph, email_locale=en_US, last_name=Adler, gmtOffset=-8, recipientID=256842, email_address=jadler@linkedin.com} STRATA NY 2012
  • 22. People you may know©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 22
  • 23. People you may know Alice Bob Carol©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 23
  • 24. People you may know Alice Bob Carol > 80% of connections from triangle closing©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 24
  • 25. People you may know Age Organizational Overlap Distance Alice BobDave Carol Ranked Matches Eve User Interactions Results ©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 25
  • 26. Skills and Endorsements©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012
  • 27. Tagging Skills©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 27
  • 28. ©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 28
  • 29. Skills and Endorsements A combination of – Propensity to know member – Propensity for member to have skill©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012
  • 30. ProductionalizationTake something that runs once… … and run it multiple times … and serve it reliably at scale … and iterate quickly©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 31
  • 31. Data Lifecycle Moving around data is the key problem 1. Ingress Moving raw data from online systems to offline systems 2. Workflow management Managing offline processes 3. Egress Moving results from offline systems to online systems©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 32
  • 32. Ingress Apache Kafka: Low latency publish/subscribe message bus – Common data format (Avro) – Changelog is the abstraction for integration – Schema evolution  Programmatic compatibility model  Explicit schema reviews  “O(1)” ETL K. Goodhope, J. Koshy, J. Kreps, N. Narkhede, R. Park, J. Rao, V.Y. Ye: Building LinkedIn’s Real-time Activity Data Pipeline. In IEEE Data Engineering Bulletin. Vol 35, No. 2, June 2012.©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 33
  • 33. Workflows Job A Job B Job C©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 34
  • 34. Workflows Job A Job B Job C Push to Production©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 35
  • 35. Workflows Job A Job B Job X Job C Push to Production©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 36
  • 36. Workflows Job A Job B Job X Job C Push to Production Push to QA©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 37
  • 37. Real workflows are complicated©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 38
  • 38. Workflow Management: Azkaban Dependency management Diverse job types (Pig, Hive, Java, . . . ) Scheduling Monitoring Configuration Retry/restart on failure Resource locking Log collection Historical information©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 39
  • 39. Workflow Management: Azkaban©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 40
  • 40. Workflow Management: Azkaban©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 41
  • 41. Egress: Voldemort Distributed key/value store Easy to integrate into workflows – Off the shelf jobs to copy Voldemort Stores – One line command in Pig Cost of data load Data stored per node? Response time Fail-over How to transfer Versioning & rollback R. Sumbaly, J. Kreps, L. Gao, A. Feinberg, C. Soman, & S. Shah. Serving Large- Scale Batch Computed Data With Project Voldemort. In FAST 2012.©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 42
  • 42. RecapWhy we use Hadoop Simple programmatic model Rich developer ecosystem – Languages: Pig, Hive, Crunch, Cascading, … – Libraries: Mahout, DataFu, ElephantBird, … Horizontal scalability, fault tolerance, multi-tenancy – Reliably process multiple TB of data Don’t need hardcore distributed systems engineers©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 43
  • 43. RecapHow we use HadoopOpen source projects started at LinkedIn: Getting data in: Kafka Building and running job flows: Azkaban Getting data out: VoldemortThis empowers data scientists and engineers to focus on new productideas, not infrastructure©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 44
  • 44. Learning Moredata.linkedin.com©2012 LinkedIn Corporation. All Rights Reserved. STRATA NY 2012 45