How to Win Friends and Influence People (with Hadoop)

How to Win Friends and Influence People (with
Hadoop)
Strata Conference New York
Sam Shah and Joseph Adler
October 25 2012
©2012 LinkedIn Corporation. All Rights Reserved.

Sam Shah
Principal Engineer and Engineering Manager
www.linkedin.com/in/shahsam
Joseph Adler
Senior Data Scientist
www.linkedin.com/in/josephadler

STRATA NY 2012
LinkedIn is the leading professional network site
Worldwide Workforce
3,300M+
Worldwide
Professionals
640M+
LinkedIn Members
175M+
©2012 LinkedIn Corporation. All Rights Reserved. 3

STRATA NY 2012
Data rich
175+MMembers 175M Member
Profiles

STRATA NY 2012
LinkedIn
9.3B Page Views
per Quarter 130M Unique Visitors
5

We have a lot of data.

We want to leverage this data to build products.

We want to leverage this data to build products.
How do you make it easy to build products from data?

STRATA NY 2012
Products we have built on Hadoop

STRATA NY 2012
Building products from data
Examples of products built with data
 Year in Review Email
 Network Updates
 Skills and Endorsements
 People You May Know
 and more…

STRATA NY 2012
Year in Review
One of the most
successful email
messages ever.
20% Response
Rate 5 Clicks per
responder

STRATA NY 2012
Network updates

STRATA NY 2012
People you may know

STRATA NY 2012
Skills and Endorsements

STRATA NY 2012
Building products from data
Hadoop is awesome for building product with data
 Lots of cheap storage
 Vast computational resources
 Lots of tools for processing data, learning from data
 Shared infrastructure
 Shared support services
 Runs on commodity hardware (or AWS)

STRATA NY 2012
Leverage
The marginal cost of building new products is low
 People You May Know (2 people)
 Skills and Endorsements (2 people)
 Year in Review (1 person, 1 month)
 Network Updates Stream (1 person, 3 months)
Hadoop can empower small teams to build things

STRATA NY 2012
How we build products
Turning data into products

STRATA NY 2012
Year in Review
 Steps to make the email
– Collect job changers
– Figure out who is connected
to them
– Rank job changes

STRATA NY 2012
Example: Year in Review
memberPosition = LOAD '$latest_positions' USING BinaryJSON;
memberWithPositionsChangedLastYear = FOREACH (
FILTER memberPosition BY ((start_date >= $start_date_low ) AND
(start_date <= $start_date_high))
) GENERATE member_id, start_date, end_date;
allConnections = LOAD '$latest_bidirectional_connections' USING BinaryJSON;
allConnectionsWithChange_nondistinct = FOREACH (
JOIN memberWithPositionsChangedLastYear BY member_id,
allConnections BY dest
) GENERATE allConnections::source AS source,
allConnections::dest AS dest;
allConnectionsWithChange = DISTINCT
allConnectionsWithChange_nondistinct;
memberinfowpics = LOAD '$latest_memberinfowpics' USING
BinaryJSON;
pictures = FOREACH ( FILTER memberinfowpics BY
((cropped_picture_id is not null) AND
( (member_picture_privacy == 'N') OR
(member_picture_privacy == 'E')))
) GENERATE member_id, cropped_picture_id, first_name as
dest_first_name, last_name as dest_last_name;
resultPic = JOIN allConnectionsWithChange BY dest, pictures
BY member_id;
connectionsWithChangeWithPic = FOREACH resultPic GENERATE
allConnectionsWithChange::source AS source_id,
allConnectionsWithChange::dest AS member_id,
pictures::cropped_picture_id AS pic_id,
pictures::dest_first_name AS dest_first_name,
pictures::dest_last_name AS dest_last_name;
joinResult = JOIN connectionsWithChangeWithPic BY source_id,
memberinfowpics BY member_id;
withName = FOREACH joinResult GENERATE
connectionsWithChangeWithPic::source_id AS source_id,
connectionsWithChangeWithPic::member_id AS member_id,
connectionsWithChangeWithPic::dest_first_name as first_name,
connectionsWithChangeWithPic::dest_last_name as last_name,
connectionsWithChangeWithPic::pic_id AS pic_id,
memberinfowpics::first_name AS firstName,
memberinfowpics::last_name AS lastName,
memberinfowpics::gmt_offset as gmt_offset,
memberinfowpics::email_locale as email_locale,
memberinfowpics::email_address as email_address;
resultGroup0 = GROUP withName BY (source_id, firstName,
lastName, email_address, email_locale, gmt_offset);
-- get the count of results per recipient
resultGroupCount = FOREACH resultGroup0 GENERATE group ,
withName as toomany, COUNT_STAR(withName) as num_results;
resultGroupPre = filter resultGroupCount by num_results > 2;
resultGroup = FOREACH resultGroupPre {
withName = LIMIT toomany 64;
GENERATE group, withName, num_results;
}
x_in_review_pre_out = FOREACH resultGroup GENERATE
FLATTEN(group) as (source_id, firstName, lastName,
email_address, email_locale, gmt_offset),
withName.(member_id, pic_id, first_name, last_name) as
jobChanger, '2011' as changeYear:chararray,
num_results as num_results;
x_in_review = FOREACH x_in_review_pre_out GENERATE
source_id as recipientID, gmt_offset as gmtOffset,
firstName as first_name, lastName as last_name, email_address,
email_locale,
TOTUPLE( changeYear, source_id,firstName, lastName,
num_results,jobChanger) as body;
rmf $xir;
STORE x_in_review INTO '$xir' USING BinaryJSON('recipientID');

STRATA NY 2012
Example: Year in Review
{body={num_results=80, lastName=Adler, changeYear=2011, firstName=Joseph, jobChanger=[{last_name=O'Connor, first_n
ame=Br?on, member_id=12562482, pic_id=/p/3/000/086/1bd/10ee035.jpg}, {last_name=Sundaram, first_name=Vivek, member
_id=6590171, pic_id=/p/3/000/0ae/354/36eb54c.jpg}, {last_name=Crane, first_name=Patrick, member_id=8628324, pic_id
=/p/1/000/09c/064/10191de.jpg}, {last_name=McLennan, first_name=Dan, member_id=10551114, pic_id=/p/2/000/09d/12f/1
47def1.jpg}, {last_name=Shaughnessy, first_name=Helen, member_id=2211035, pic_id=/p/3/000/06d/2ba/06a113c.jpg}, {l
ast_name=Chen, first_name=Richard, member_id=12800647, pic_id=/p/2/000/007/1ad/0fb84f9.jpg}, {last_name=Barba, fir
st_name=Troy, member_id=27577, pic_id=/p/2/000/0a2/3e9/3a83a33.jpg}, {last_name=Reed, first_name=Harper, member_id
=1865420, pic_id=/p/1/000/001/17b/396a2c3.jpg}, {last_name=Goldstein, first_name=Peter, member_id=205610, pic_id=/
p/2/000/01c/2e6/042999f.jpg}, {last_name=Koren, first_name=Yuval, member_id=2289577, pic_id=/p/1/000/02b/3d3/1fc36
27.jpg}, {last_name=Kiang, first_name=Andy, member_id=8347, pic_id=/p/1/000/063/115/1256f61.jpg}, {last_name=Green
field, first_name=Nick, member_id=82814545, pic_id=/p/1/000/068/39f/2080b8f.jpg}, {last_name=Murarka, first_name=B
ubba, member_id=174233, pic_id=/p/3/000/011/2c8/33837b8.jpg}, {last_name=Kutter, first_name=Norbert, member_id=310
933, pic_id=/p/3/000/005/0e2/02775a0.jpg}, {last_name=Ehrenberg, first_name=Roger, member_id=1662181, pic_id=/p/3/
000/038/066/3572baf.jpg}, {last_name=Coderre, CISSP, first_name=Rob, member_id=68521, pic_id=/p/1/000/088/0d5/2438
981.jpg}, {last_name=Stephens, first_name=Bradford, member_id=10900447, pic_id=/p/1/000/0ad/0dc/15f9df5.jpg}, {las
t_name=Shiau, first_name=Peter, member_id=300654, pic_id=/p/2/000/056/2a6/18938e3.jpg}, {last_name=Rajan, first_na
me=Arvind, member_id=1260, pic_id=/p/3/000/019/3f7/1e6e0f2.jpg}, {last_name=Bellister, first_name=Jesse, member_id
=25234604, pic_id=/p/3/000/00a/17d/1e2136b.jpg}, {last_name=Mohan, first_name=Viraj, member_id=56817108, pic_id=/p
/3/000/0cd/0a4/097527a.jpg}, {last_name=Ragade, first_name=Dhananjay, member_id=325284, pic_id=/p/3/000/000/035/05
04fe7.jpg}, {last_name=Richards, first_name=Jeff, member_id=16762, pic_id=/p/2/000/039/14e/081d1c7.jpg}, {last_nam
e=Wittenauer, first_name=Allen, member_id=3328775, pic_id=/p/3/000/08d/2a3/307b112.jpg}, {last_name=Porzak, first_
name=Jim, member_id=1708710, pic_id=/p/2/000/00d/109/0e4aa34.jpg}, {last_name=Ruma, first_name=Laurel, member_id=3
429732, pic_id=/p/1/000/01e/277/2bb115b.jpg}, {last_name=Higgins, first_name=Josh, member_id=1458792, pic_id=/p/1/
000/0c9/38b/1a24457.jpg}, {last_name=Benedict, first_name=Harvey, member_id=641340, pic_id=/p/3/000/0c6/1eb/2eb711
9.jpg}, {last_name=Lazarus, first_name=Brett, member_id=49965786, pic_id=/p/2/000/03b/04e/318d080.jpg}, {last_name
=Zhang, first_name=Simon, member_id=16323996, pic_id=/p/3/000/03f/0fe/35d4ded.jpg}, {last_name=Aspen, first_name=M
att, member_id=25240804, pic_id=/p/3/000/09b/371/22ec974.jpg}, {last_name=Herz, first_name=Erik, member_id=147604,
pic_id=/p/3/000/086/014/0fab4d6.jpg}, {last_name=Sanders, first_name=Geoffrey, member_id=340570, pic_id=/p/1/000/0
d1/2d1/37a76e6.jpg}, {last_name=Wright, first_name=Caleb, member_id=12798700, pic_id=/p/2/000/08c/337/2cc951a.jpg}
, {last_name=Parab, first_name=Guru, member_id=8915230, pic_id=/p/1/000/08a/257/051926a.jpg}, {last_name=Grossman,
first_name=Nick, member_id=12159520, pic_id=/p/2/000/005/2f3/1955f31.jpg}, {last_name=Skomoroch, first_name=Peter,
member_id=11642980, pic_id=/p/2/000/0b4/12d/31eadbe.jpg}, {last_name=Singh, first_name=Deepak, member_id=1246166,
pic_id=/p/1/000/042/3f5/369f807.jpg}, {last_name=Noakes, first_name=Geoffrey, member_id=3518726, pic_id=/p/3/000/0
05/3d7/3f67632.jpg}, {last_name=Scudiere, first_name=Robert, member_id=3965286, pic_id=/p/2/000/090/210/009a099.jp
g}, {last_name=Skyler, first_name=David, member_id=15377099, pic_id=/p/3/000/005/1bf/080b255.jpg}, {last_name=Shar
ma, first_name=Manu, member_id=19295378, pic_id=/p/3/000/0d4/11e/2176c30.jpg}, {last_name=Huang, first_name=Erica,
member_id=1808438, pic_id=/p/1/000/001/3a5/02ddd24.jpg}, {last_name=Ballotta, first_name=Pete, member_id=2011178,
pic_id=/p/2/000/0b6/08f/3a92357.jpg}, {last_name=Kast, first_name=Anton, member_id=1092686, pic_id=/p/1/000/054/0e
2/1a8efb2.jpg}, {last_name=Redfern, first_name=Joff, member_id=2849241, pic_id=/p/3/000/03d/28d/19f5688.jpg}, {las
t_name=Smith, first_name=Aaron, member_id=83470876, pic_id=/p/2/000/08c/27c/3cfe37a.jpg}, {last_name=Yadav, first_
name=Rishi, member_id=2097381, pic_id=/p/2/000/0c8/08d/3ab9006.jpg}, {last_name=Repass, first_name=Mike, member_id
=8633208, pic_id=/p/2/000/071/195/0bfc573.jpg}, {last_name=Dalvi, first_name=Anand, member_id=8388, pic_id=/p/1/00
0/003/3cd/3127384.jpg}, {last_name=Croll, first_name=Alistair, member_id=511218, pic_id=/p/2/000/029/0e5/1ebc076.j
pg}, {last_name=Tolman, first_name=Sarah, member_id=86040596, pic_id=/p/2/000/06f/1c9/1a7870e.jpg}, {last_name=Suv
arna, first_name=Sandeep, member_id=10558779, pic_id=/p/1/000/05b/2c7/0ec214a.jpg}, {last_name=Elliott-
McCrea, first_name=Kellan, member_id=163959, pic_id=/p/1/000/06b/2e8/2dbd3ae.jpg}, {last_name=Jatkar, first_name=T
arang, member_id=17763609, pic_id=/p/1/000/012/010/2e8ee7f.jpg}, {last_name=Brown, first_name=David, member_id=420
737, pic_id=/p/3/000/002/140/0b2dbcc.jpg}, {last_name=Patel, first_name=Jay, member_id=1179857, pic_id=/p/2/000/07
c/0b2/0365e91.jpg}, {last_name=Field, first_name=Dylan, member_id=13066037, pic_id=/p/2/000/0a5/3e2/1fb7f06.jpg},
{last_name=Patel, first_name=Sumeet, member_id=23402387, pic_id=/p/2/000/0bf/3ca/2ca5f1f.jpg}, {last_name=Ting, fi
rst_name=Moses, member_id=15624915, pic_id=/p/2/000/0ac/117/29e329a.jpg}, {last_name=Hinnach, first_name=Yassine,
member_id=1731285, pic_id=/p/3/000/000/035/330cce0.jpg}, {last_name=Das, first_name=Anshu, member_id=38878221, pic
_id=/p/3/000/0b2/1ac/15902f4.jpg}, {last_name=Mendelson, first_name=Jordan, member_id=8598415, pic_id=/p/3/000/032
/22a/1d2eaa6.jpg}, {last_name=Besbeas, first_name=Nick, member_id=12510505, pic_id=/p/3/000/093/167/34f5b6b.jpg}],
source_id=256842}, first_name=Joseph, email_locale=en_US, last_name=Adler, gmtOffset=-
8, recipientID=256842, email_address=jadler@linkedin.com}
Each message requires a lot of
data:
– Header information (10 fields)
– 4 fields per person, 64 people
– That’s over 250 data fields for
the final message
How do we turn this raw data in to
web content or email messages?

STRATA NY 2012
People you may know

STRATA NY 2012
Tagging Skills

STRATA NY 2012
Skills and Endorsements
A combination of
– Propensity to know member
– Propensity for member to have skill

STRATA NY 2012
Productionalization
Take something that runs once…
… and run it multiple times
… and serve it reliably at scale
… and iterate quickly

STRATA NY 2012
Data Lifecycle
 Moving around data is the key problem
1. Ingress
Moving raw data from online systems to offline systems
2. Workflow management
Managing offline processes
3. Egress
Moving results from offline systems to online systems

STRATA NY 2012
Ingress
 Apache Kafka: Low latency publish/subscribe message bus
– Common data format (Avro)
– Changelog is the abstraction for integration
– Schema evolution
 Programmatic compatibility model
 Explicit schema reviews
 “O(1)” ETL
 K. Goodhope, J. Koshy, J. Kreps, N. Narkhede, R. Park, J. Rao, V.Y. Ye: Building
LinkedIn’s Real-time Activity Data Pipeline. In IEEE Data Engineering Bulletin. Vol
35, No. 2, June 2012.

STRATA NY 2012
Workflows
Job A
Job B
Job C

STRATA NY 2012
Workflows
Job A
Job B
Job C
Push to Production

STRATA NY 2012
Workflows
Job A
Job B
Job C
Push to Production
Job X

STRATA NY 2012
Workflows
Job A
Job B
Job C
Push to Production
Job X
Push to QA

STRATA NY 2012
Real workflows are complicated

STRATA NY 2012
Workflow Management: Azkaban
 Dependency management
 Diverse job types (Pig, Hive, Java, . . . )
 Scheduling
 Monitoring
 Configuration
 Retry/restart on failure
 Resource locking
 Log collection
 Historical information

STRATA NY 2012

STRATA NY 2012
Egress: Voldemort
 Distributed key/value store
 Easy to integrate into workflows
– Off the shelf jobs to copy Voldemort Stores
– One line command in Pig
 Cost of data load
 Data stored per node? Response time
 Fail-over
 How to transfer
 Versioning & rollback
 R. Sumbaly, J. Kreps, L. Gao, A. Feinberg, C. Soman, & S. Shah. Serving Large-
Scale Batch Computed Data With Project Voldemort. In FAST 2012.

STRATA NY 2012
Recap
Why we use Hadoop
 Simple programmatic model
 Rich developer ecosystem
– Languages: Pig, Hive, Crunch, Cascading, …
– Libraries: Mahout, DataFu, ElephantBird, …
 Horizontal scalability, fault tolerance, multi-tenancy
– Reliably process multiple TB of data
 Don’t need hardcore distributed systems engineers

STRATA NY 2012
Recap
How we use Hadoop
Open source projects started at LinkedIn:
 Getting data in: Kafka
 Building and running job flows: Azkaban
 Getting data out: Voldemort
This empowers data scientists and engineers to focus on new product
ideas, not infrastructure

STRATA NY 2012
data.linkedin.com
Learning More

How to Win Friends and Influence People (with Hadoop)

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (16)

Similar to How to Win Friends and Influence People (with Hadoop)

Similar to How to Win Friends and Influence People (with Hadoop) (20)

Recently uploaded

Recently uploaded (20)

How to Win Friends and Influence People (with Hadoop)