• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
How to win friends and influence people (with Hadoop)
 

How to win friends and influence people (with Hadoop)

on

  • 1,594 views

Sam Shah and I gave this talk at Strata+Hadoop World 2012

Sam Shah and I gave this talk at Strata+Hadoop World 2012

Statistics

Views

Total Views
1,594
Views on SlideShare
1,536
Embed Views
58

Actions

Likes
6
Downloads
31
Comments
0

2 Embeds 58

http://www.linkedin.com 42
https://www.linkedin.com 16

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Today, Sam and I are going to talk about how we use Hadoop to build products with data.Sam and I are both engineers at LinkedIn. My title is trendier than Sam’s, but don’t hold that against me. Or him. We both know how to build products with data.Both of us have talked about a lot of the products in this presentation before, but we haven’t focused as much on infrastructure
  • We’d like to start by telling you a little bit about LinkedIn (and LinkedIn’s data).LinkedIn is the leading web site for professional networking. We currently have over 175 million members, but we’re still growing. That means that our data is growing too.
  • Each member has a profile. We know a lot about our members (start scrolling animation)…We know their current position, past positions, schools they attended, skills they have, skills that other people have endorsed them for, people and companies they follow, companies they work on.We think this data is very interesting. We can use this data to help members connect to each other, and make them more productive. That’s actually LinkedIn’s mission statement… I can’t believe I recited our mission statement in a public presentation. Anyway, let’s take a look at how we use this data.
  • When a user logs into LinkedIn, they see a page like this. Almost every part of our home page has been touched by data science.Home page is purely driven by data:News articlesNews streamPYMKDisplay adJYMBIIWVMPGYMLEtc…And by the way, we also learn what our members like and don’t like. Wehave over 130 million visitors to our site every quarter, and deliver over 9.3 billion web pages. (That’s even more data)
  • So, here’s the point of today’s talk.At LinkedIn, we have a lot of data.
  • We store our data in Hadoop, and we want to build product using that data on Hadoop.
  • So here’s the big challenge: how do we make it easy for our engineers, product managers, data scientists, analysts, web devs, reseptionists, whatever, build products from our data?That’s what we’re going to talk about today. We’ll tell you about some of the products that we’ve built from data, how we built these products, and why we built infastructure to support these products.
  • Let’s start by telling you a little more about some of the products that we have built with Hadoop, then we’ll tell you more about two of those products and the challenges that we faced productionalizing them.
  • More examples:Groups you might likeNetwork updates digest email“People who viewed this profile also viewed”Etc.
  • Let’s start with a project that I worked on at LinkedIn that I think illustrates the power of building products with data.Ask audience “who got this email?”We sent this to every LinkedIn member who had a lot of job changes in their network.[now read the numbers]Later in this presentation, we’ll tell you how we built this email from our data. I’ll even show you the code.
  • Hereis another example of a product that I’ve worked on. In the network stream on our home page, we’ve started sharing trends and patterns in data.We also tell you things that you might not know about your network. For example, it turns out that 21 of my former coworkers are now working at Google.
  • One of the most famous examples of data products at LinkedIn is People you may Know.PYMK was invented at LinkedIn. The idea of PYMK was to help you discover current coworkers, former coworkers, and friends on LinkedIn to help make your experience better. (This is not an actual screen shot from my account; I’m already connected to Sam.)We used Hadoop to build and scale PYMK. (We’ll also tell you more about how we built PYMK later in this presentation.)
  • Has anyone in the room seen a screen like this on LinkedIn?Has anyone endorsed someone else?Has anyone found it hard to stop endorsing people?We also used Hadoop to build our suggested endorsements.
  • We love using Hadoop for building data products.There are so many things that are great about Hadoop. (Our user quotas are in TB.)Hundreds of nodesGreat tools for working with data like Pig, and hive, and CrunchShared infrastucture. Hundreds of employees have accounts on Hadoop and run jobs (engineers, data scientists, product managers, even designers and finance people)
  • One of the greatest advantages of Hadoop is that it empowers small teams to build great things. Here are a few examplesMost of the items on this are big, important features: lots of page views, lots of new connections, lots of great content.The marginal cost of building more products is low
  • One of the greatest advantages of Hadoop is that it empowers small teams to build great things. Here are a few examplesMost of the items on this are big, important features: lots of page views, lots of new connections, lots of great content.The marginal cost of building more products is low
  • Let’s talk a little more about the year in review email. This is actually a pretty straightforward message in theory. Here’s how we do it. (Read slide)There isn’t any machine learning, or fancy algorithms. It’s just grouping and ranking.And in practice, it’s not that hard.
  • This is the code to compose this message. It’s About 60 lines of code, and most of that code involved renaming things.This is why we love Hadoop: we can do something simple without much code…Great! We’re done. We write this code and the message is done.
  • Well, not so fast… here’s the challenge. We know how to do the computation to make this message. But every message requires a lot of data: we potentially look at hundreds of MB of data before degnerating every message, and in the end the messages are up to 1MB in size.How do we get all the raw data that we need to make this message? How do we keep it up to date?How do we run this job frequently so the results stay current?How do we get these results out of Hadoop, turn them into email messages, and send them out?Let’s consider another problem.
  • One of the most famous examples of data products is People you may Know.PYMK was invented at LinkedIn. The idea of PYMK was to help you discover current coworkers, former coworkers, and friends on LinkedIn to help make your experience better. (This is not an actual screen shot from my account; I’m already connected to Sam.)We used Hadoop to build and scale PYMK.
  • - PYMK started simpler, grew more complicated- Complicated workflow, required tools and infrastructure to do this --> we needed it in place.
  • Throw over the wall from data science to productionizationNo one dedicated toproductionizationProvided “as a service” to do so
  • - Don’t want to beg for data- Others: Scribe, Flume
  • - Others: Oozie
  • - Others: Hbase, Cassandra, Kafka