SlideShare a Scribd company logo
1 of 41
How to Win Friends and Influence People (with
Hadoop)
Strata Conference New York
Sam Shah and Joseph Adler
October 25 2012
©2012 LinkedIn Corporation. All Rights Reserved.
Sam Shah
Principal Engineer and Engineering Manager
www.linkedin.com/in/shahsam
©2012 LinkedIn Corporation. All Rights Reserved.
Joseph Adler
Senior Data Scientist
www.linkedin.com/in/josephadler
STRATA NY 2012
LinkedIn is the leading professional network site
Worldwide Workforce
3,300M+
Worldwide
Professionals
640M+
LinkedIn Members
175M+
©2012 LinkedIn Corporation. All Rights Reserved. 3
STRATA NY 2012
Data rich
©2012 LinkedIn Corporation. All Rights Reserved. 4
175+MMembers 175M Member
Profiles
STRATA NY 2012
LinkedIn
©2012 LinkedIn Corporation. All Rights Reserved.
9.3B Page Views
per Quarter 130M Unique Visitors
5
We have a lot of data.
©2012 LinkedIn Corporation. All Rights Reserved.
We have a lot of data.
We want to leverage this data to build products.
©2012 LinkedIn Corporation. All Rights Reserved.
We have a lot of data.
We want to leverage this data to build products.
How do you make it easy to build products from data?
©2012 LinkedIn Corporation. All Rights Reserved.
STRATA NY 2012
Products we have built on Hadoop
©2012 LinkedIn Corporation. All Rights Reserved. 9
STRATA NY 2012
Building products from data
Examples of products built with data
 Year in Review Email
 Network Updates
 Skills and Endorsements
 People You May Know
 and more…
©2012 LinkedIn Corporation. All Rights Reserved.
STRATA NY 2012
Year in Review
One of the most
successful email
messages ever.
20% Response
Rate 5 Clicks per
responder
STRATA NY 2012
Network updates
©2012 LinkedIn Corporation. All Rights Reserved. 12
STRATA NY 2012
People you may know
©2012 LinkedIn Corporation. All Rights Reserved. 13
STRATA NY 2012
Skills and Endorsements
©2012 LinkedIn Corporation. All Rights Reserved.
STRATA NY 2012
Building products from data
Hadoop is awesome for building product with data
 Lots of cheap storage
 Vast computational resources
 Lots of tools for processing data, learning from data
 Shared infrastructure
 Shared support services
 Runs on commodity hardware (or AWS)
©2012 LinkedIn Corporation. All Rights Reserved.
STRATA NY 2012
Leverage
The marginal cost of building new products is low
 People You May Know (2 people)
 Skills and Endorsements (2 people)
 Year in Review (1 person, 1 month)
 Network Updates Stream (1 person, 3 months)
Hadoop can empower small teams to build things
©2012 LinkedIn Corporation. All Rights Reserved.
STRATA NY 2012
Leverage
The marginal cost of building new products is low
 People You May Know (2 people)
 Skills and Endorsements (2 people)
 Year in Review (1 person, 1 month)
 Network Updates Stream (1 person, 3 months)
Hadoop can empower small teams to build things
©2012 LinkedIn Corporation. All Rights Reserved.
STRATA NY 2012
How we build products
Turning data into products
©2012 LinkedIn Corporation. All Rights Reserved. 18
STRATA NY 2012
Year in Review
 Steps to make the email
– Collect job changers
– Figure out who is connected
to them
– Rank job changes
STRATA NY 2012
Example: Year in Review
memberPosition = LOAD '$latest_positions' USING BinaryJSON;
memberWithPositionsChangedLastYear = FOREACH (
FILTER memberPosition BY ((start_date >= $start_date_low ) AND
(start_date <= $start_date_high))
) GENERATE member_id, start_date, end_date;
allConnections = LOAD '$latest_bidirectional_connections' USING BinaryJSON;
allConnectionsWithChange_nondistinct = FOREACH (
JOIN memberWithPositionsChangedLastYear BY member_id,
allConnections BY dest
) GENERATE allConnections::source AS source,
allConnections::dest AS dest;
allConnectionsWithChange = DISTINCT
allConnectionsWithChange_nondistinct;
memberinfowpics = LOAD '$latest_memberinfowpics' USING
BinaryJSON;
pictures = FOREACH ( FILTER memberinfowpics BY
((cropped_picture_id is not null) AND
( (member_picture_privacy == 'N') OR
(member_picture_privacy == 'E')))
) GENERATE member_id, cropped_picture_id, first_name as
dest_first_name, last_name as dest_last_name;
resultPic = JOIN allConnectionsWithChange BY dest, pictures
BY member_id;
connectionsWithChangeWithPic = FOREACH resultPic GENERATE
allConnectionsWithChange::source AS source_id,
allConnectionsWithChange::dest AS member_id,
pictures::cropped_picture_id AS pic_id,
pictures::dest_first_name AS dest_first_name,
pictures::dest_last_name AS dest_last_name;
joinResult = JOIN connectionsWithChangeWithPic BY source_id,
memberinfowpics BY member_id;
withName = FOREACH joinResult GENERATE
connectionsWithChangeWithPic::source_id AS source_id,
connectionsWithChangeWithPic::member_id AS member_id,
connectionsWithChangeWithPic::dest_first_name as first_name,
connectionsWithChangeWithPic::dest_last_name as last_name,
connectionsWithChangeWithPic::pic_id AS pic_id,
memberinfowpics::first_name AS firstName,
memberinfowpics::last_name AS lastName,
memberinfowpics::gmt_offset as gmt_offset,
memberinfowpics::email_locale as email_locale,
memberinfowpics::email_address as email_address;
resultGroup0 = GROUP withName BY (source_id, firstName,
lastName, email_address, email_locale, gmt_offset);
-- get the count of results per recipient
resultGroupCount = FOREACH resultGroup0 GENERATE group ,
withName as toomany, COUNT_STAR(withName) as num_results;
resultGroupPre = filter resultGroupCount by num_results > 2;
resultGroup = FOREACH resultGroupPre {
withName = LIMIT toomany 64;
GENERATE group, withName, num_results;
}
x_in_review_pre_out = FOREACH resultGroup GENERATE
FLATTEN(group) as (source_id, firstName, lastName,
email_address, email_locale, gmt_offset),
withName.(member_id, pic_id, first_name, last_name) as
jobChanger, '2011' as changeYear:chararray,
num_results as num_results;
x_in_review = FOREACH x_in_review_pre_out GENERATE
source_id as recipientID, gmt_offset as gmtOffset,
firstName as first_name, lastName as last_name, email_address,
email_locale,
TOTUPLE( changeYear, source_id,firstName, lastName,
num_results,jobChanger) as body;
rmf $xir;
STORE x_in_review INTO '$xir' USING BinaryJSON('recipientID');
STRATA NY 2012
Example: Year in Review
{body={num_results=80, lastName=Adler, changeYear=2011, firstName=Joseph, jobChanger=[{last_name=O'Connor, first_n
ame=Br?on, member_id=12562482, pic_id=/p/3/000/086/1bd/10ee035.jpg}, {last_name=Sundaram, first_name=Vivek, member
_id=6590171, pic_id=/p/3/000/0ae/354/36eb54c.jpg}, {last_name=Crane, first_name=Patrick, member_id=8628324, pic_id
=/p/1/000/09c/064/10191de.jpg}, {last_name=McLennan, first_name=Dan, member_id=10551114, pic_id=/p/2/000/09d/12f/1
47def1.jpg}, {last_name=Shaughnessy, first_name=Helen, member_id=2211035, pic_id=/p/3/000/06d/2ba/06a113c.jpg}, {l
ast_name=Chen, first_name=Richard, member_id=12800647, pic_id=/p/2/000/007/1ad/0fb84f9.jpg}, {last_name=Barba, fir
st_name=Troy, member_id=27577, pic_id=/p/2/000/0a2/3e9/3a83a33.jpg}, {last_name=Reed, first_name=Harper, member_id
=1865420, pic_id=/p/1/000/001/17b/396a2c3.jpg}, {last_name=Goldstein, first_name=Peter, member_id=205610, pic_id=/
p/2/000/01c/2e6/042999f.jpg}, {last_name=Koren, first_name=Yuval, member_id=2289577, pic_id=/p/1/000/02b/3d3/1fc36
27.jpg}, {last_name=Kiang, first_name=Andy, member_id=8347, pic_id=/p/1/000/063/115/1256f61.jpg}, {last_name=Green
field, first_name=Nick, member_id=82814545, pic_id=/p/1/000/068/39f/2080b8f.jpg}, {last_name=Murarka, first_name=B
ubba, member_id=174233, pic_id=/p/3/000/011/2c8/33837b8.jpg}, {last_name=Kutter, first_name=Norbert, member_id=310
933, pic_id=/p/3/000/005/0e2/02775a0.jpg}, {last_name=Ehrenberg, first_name=Roger, member_id=1662181, pic_id=/p/3/
000/038/066/3572baf.jpg}, {last_name=Coderre, CISSP, first_name=Rob, member_id=68521, pic_id=/p/1/000/088/0d5/2438
981.jpg}, {last_name=Stephens, first_name=Bradford, member_id=10900447, pic_id=/p/1/000/0ad/0dc/15f9df5.jpg}, {las
t_name=Shiau, first_name=Peter, member_id=300654, pic_id=/p/2/000/056/2a6/18938e3.jpg}, {last_name=Rajan, first_na
me=Arvind, member_id=1260, pic_id=/p/3/000/019/3f7/1e6e0f2.jpg}, {last_name=Bellister, first_name=Jesse, member_id
=25234604, pic_id=/p/3/000/00a/17d/1e2136b.jpg}, {last_name=Mohan, first_name=Viraj, member_id=56817108, pic_id=/p
/3/000/0cd/0a4/097527a.jpg}, {last_name=Ragade, first_name=Dhananjay, member_id=325284, pic_id=/p/3/000/000/035/05
04fe7.jpg}, {last_name=Richards, first_name=Jeff, member_id=16762, pic_id=/p/2/000/039/14e/081d1c7.jpg}, {last_nam
e=Wittenauer, first_name=Allen, member_id=3328775, pic_id=/p/3/000/08d/2a3/307b112.jpg}, {last_name=Porzak, first_
name=Jim, member_id=1708710, pic_id=/p/2/000/00d/109/0e4aa34.jpg}, {last_name=Ruma, first_name=Laurel, member_id=3
429732, pic_id=/p/1/000/01e/277/2bb115b.jpg}, {last_name=Higgins, first_name=Josh, member_id=1458792, pic_id=/p/1/
000/0c9/38b/1a24457.jpg}, {last_name=Benedict, first_name=Harvey, member_id=641340, pic_id=/p/3/000/0c6/1eb/2eb711
9.jpg}, {last_name=Lazarus, first_name=Brett, member_id=49965786, pic_id=/p/2/000/03b/04e/318d080.jpg}, {last_name
=Zhang, first_name=Simon, member_id=16323996, pic_id=/p/3/000/03f/0fe/35d4ded.jpg}, {last_name=Aspen, first_name=M
att, member_id=25240804, pic_id=/p/3/000/09b/371/22ec974.jpg}, {last_name=Herz, first_name=Erik, member_id=147604,
pic_id=/p/3/000/086/014/0fab4d6.jpg}, {last_name=Sanders, first_name=Geoffrey, member_id=340570, pic_id=/p/1/000/0
d1/2d1/37a76e6.jpg}, {last_name=Wright, first_name=Caleb, member_id=12798700, pic_id=/p/2/000/08c/337/2cc951a.jpg}
, {last_name=Parab, first_name=Guru, member_id=8915230, pic_id=/p/1/000/08a/257/051926a.jpg}, {last_name=Grossman,
first_name=Nick, member_id=12159520, pic_id=/p/2/000/005/2f3/1955f31.jpg}, {last_name=Skomoroch, first_name=Peter,
member_id=11642980, pic_id=/p/2/000/0b4/12d/31eadbe.jpg}, {last_name=Singh, first_name=Deepak, member_id=1246166,
pic_id=/p/1/000/042/3f5/369f807.jpg}, {last_name=Noakes, first_name=Geoffrey, member_id=3518726, pic_id=/p/3/000/0
05/3d7/3f67632.jpg}, {last_name=Scudiere, first_name=Robert, member_id=3965286, pic_id=/p/2/000/090/210/009a099.jp
g}, {last_name=Skyler, first_name=David, member_id=15377099, pic_id=/p/3/000/005/1bf/080b255.jpg}, {last_name=Shar
ma, first_name=Manu, member_id=19295378, pic_id=/p/3/000/0d4/11e/2176c30.jpg}, {last_name=Huang, first_name=Erica,
member_id=1808438, pic_id=/p/1/000/001/3a5/02ddd24.jpg}, {last_name=Ballotta, first_name=Pete, member_id=2011178,
pic_id=/p/2/000/0b6/08f/3a92357.jpg}, {last_name=Kast, first_name=Anton, member_id=1092686, pic_id=/p/1/000/054/0e
2/1a8efb2.jpg}, {last_name=Redfern, first_name=Joff, member_id=2849241, pic_id=/p/3/000/03d/28d/19f5688.jpg}, {las
t_name=Smith, first_name=Aaron, member_id=83470876, pic_id=/p/2/000/08c/27c/3cfe37a.jpg}, {last_name=Yadav, first_
name=Rishi, member_id=2097381, pic_id=/p/2/000/0c8/08d/3ab9006.jpg}, {last_name=Repass, first_name=Mike, member_id
=8633208, pic_id=/p/2/000/071/195/0bfc573.jpg}, {last_name=Dalvi, first_name=Anand, member_id=8388, pic_id=/p/1/00
0/003/3cd/3127384.jpg}, {last_name=Croll, first_name=Alistair, member_id=511218, pic_id=/p/2/000/029/0e5/1ebc076.j
pg}, {last_name=Tolman, first_name=Sarah, member_id=86040596, pic_id=/p/2/000/06f/1c9/1a7870e.jpg}, {last_name=Suv
arna, first_name=Sandeep, member_id=10558779, pic_id=/p/1/000/05b/2c7/0ec214a.jpg}, {last_name=Elliott-
McCrea, first_name=Kellan, member_id=163959, pic_id=/p/1/000/06b/2e8/2dbd3ae.jpg}, {last_name=Jatkar, first_name=T
arang, member_id=17763609, pic_id=/p/1/000/012/010/2e8ee7f.jpg}, {last_name=Brown, first_name=David, member_id=420
737, pic_id=/p/3/000/002/140/0b2dbcc.jpg}, {last_name=Patel, first_name=Jay, member_id=1179857, pic_id=/p/2/000/07
c/0b2/0365e91.jpg}, {last_name=Field, first_name=Dylan, member_id=13066037, pic_id=/p/2/000/0a5/3e2/1fb7f06.jpg},
{last_name=Patel, first_name=Sumeet, member_id=23402387, pic_id=/p/2/000/0bf/3ca/2ca5f1f.jpg}, {last_name=Ting, fi
rst_name=Moses, member_id=15624915, pic_id=/p/2/000/0ac/117/29e329a.jpg}, {last_name=Hinnach, first_name=Yassine,
member_id=1731285, pic_id=/p/3/000/000/035/330cce0.jpg}, {last_name=Das, first_name=Anshu, member_id=38878221, pic
_id=/p/3/000/0b2/1ac/15902f4.jpg}, {last_name=Mendelson, first_name=Jordan, member_id=8598415, pic_id=/p/3/000/032
/22a/1d2eaa6.jpg}, {last_name=Besbeas, first_name=Nick, member_id=12510505, pic_id=/p/3/000/093/167/34f5b6b.jpg}],
source_id=256842}, first_name=Joseph, email_locale=en_US, last_name=Adler, gmtOffset=-
8, recipientID=256842, email_address=jadler@linkedin.com}
Each message requires a lot of
data:
– Header information (10 fields)
– 4 fields per person, 64 people
– That’s over 250 data fields for
the final message
How do we turn this raw data in to
web content or email messages?
STRATA NY 2012
People you may know
©2012 LinkedIn Corporation. All Rights Reserved. 22
STRATA NY 2012
Skills and Endorsements
©2012 LinkedIn Corporation. All Rights Reserved.
STRATA NY 2012
Tagging Skills
©2012 LinkedIn Corporation. All Rights Reserved. 24
STRATA NY 2012©2012 LinkedIn Corporation. All Rights Reserved. 25
STRATA NY 2012
Skills and Endorsements
©2012 LinkedIn Corporation. All Rights Reserved.
A combination of
– Propensity to know member
– Propensity for member to have skill
STRATA NY 2012
Productionalization
Take something that runs once…
… and run it multiple times
… and serve it reliably at scale
… and iterate quickly
©2012 LinkedIn Corporation. All Rights Reserved. 28
STRATA NY 2012
Data Lifecycle
 Moving around data is the key problem
1. Ingress
Moving raw data from online systems to offline systems
2. Workflow management
Managing offline processes
3. Egress
Moving results from offline systems to online systems
©2012 LinkedIn Corporation. All Rights Reserved. 29
STRATA NY 2012
Ingress
 Apache Kafka: Low latency publish/subscribe message bus
– Common data format (Avro)
– Changelog is the abstraction for integration
– Schema evolution
 Programmatic compatibility model
 Explicit schema reviews
 “O(1)” ETL
 K. Goodhope, J. Koshy, J. Kreps, N. Narkhede, R. Park, J. Rao, V.Y. Ye: Building
LinkedIn’s Real-time Activity Data Pipeline. In IEEE Data Engineering Bulletin. Vol
35, No. 2, June 2012.
©2012 LinkedIn Corporation. All Rights Reserved. 30
STRATA NY 2012
Workflows
©2012 LinkedIn Corporation. All Rights Reserved. 31
Job A
Job B
Job C
STRATA NY 2012
Workflows
©2012 LinkedIn Corporation. All Rights Reserved. 32
Job A
Job B
Job C
Push to Production
STRATA NY 2012
Workflows
©2012 LinkedIn Corporation. All Rights Reserved. 33
Job A
Job B
Job C
Push to Production
Job X
STRATA NY 2012
Workflows
©2012 LinkedIn Corporation. All Rights Reserved. 34
Job A
Job B
Job C
Push to Production
Job X
Push to QA
STRATA NY 2012
Real workflows are complicated
©2012 LinkedIn Corporation. All Rights Reserved. 35
STRATA NY 2012
Workflow Management: Azkaban
©2012 LinkedIn Corporation. All Rights Reserved. 36
 Dependency management
 Diverse job types (Pig, Hive, Java, . . . )
 Scheduling
 Monitoring
 Configuration
 Retry/restart on failure
 Resource locking
 Log collection
 Historical information
STRATA NY 2012
Workflow Management: Azkaban
©2012 LinkedIn Corporation. All Rights Reserved. 37
STRATA NY 2012
Workflow Management: Azkaban
©2012 LinkedIn Corporation. All Rights Reserved. 38
STRATA NY 2012
Egress: Voldemort
 Distributed key/value store
 Easy to integrate into workflows
– Off the shelf jobs to copy Voldemort Stores
– One line command in Pig
 Cost of data load
 Data stored per node? Response time
 Fail-over
 How to transfer
 Versioning & rollback
 R. Sumbaly, J. Kreps, L. Gao, A. Feinberg, C. Soman, & S. Shah. Serving Large-
Scale Batch Computed Data With Project Voldemort. In FAST 2012.
©2012 LinkedIn Corporation. All Rights Reserved. 39
STRATA NY 2012
Recap
Why we use Hadoop
 Simple programmatic model
 Rich developer ecosystem
– Languages: Pig, Hive, Crunch, Cascading, …
– Libraries: Mahout, DataFu, ElephantBird, …
 Horizontal scalability, fault tolerance, multi-tenancy
– Reliably process multiple TB of data
 Don’t need hardcore distributed systems engineers
©2012 LinkedIn Corporation. All Rights Reserved. 40
STRATA NY 2012
Recap
How we use Hadoop
Open source projects started at LinkedIn:
 Getting data in: Kafka
 Building and running job flows: Azkaban
 Getting data out: Voldemort
This empowers data scientists and engineers to focus on new product
ideas, not infrastructure
©2012 LinkedIn Corporation. All Rights Reserved. 41
STRATA NY 2012
data.linkedin.com
Learning More
©2012 LinkedIn Corporation. All Rights Reserved. 42

More Related Content

Viewers also liked

Pholia magra auxilia natural
Pholia magra auxilia naturalPholia magra auxilia natural
Pholia magra auxilia naturalFolia
 
Lexis Diligence overview
Lexis Diligence overview Lexis Diligence overview
Lexis Diligence overview LexisNexis
 
Status of United States Equal Employment Legislation │ State Net®
Status of United States Equal Employment Legislation │ State Net® Status of United States Equal Employment Legislation │ State Net®
Status of United States Equal Employment Legislation │ State Net® LexisNexis
 
Gaming & Internet Gambling Regulations per Jurisdiction
Gaming & Internet Gambling Regulations per Jurisdiction   Gaming & Internet Gambling Regulations per Jurisdiction
Gaming & Internet Gambling Regulations per Jurisdiction LexisNexis
 
Two Roads Diverged
Two Roads DivergedTwo Roads Diverged
Two Roads DivergedRanjeet Tate
 
SAP Webinar – Monetizing M2M
SAP Webinar – Monetizing M2MSAP Webinar – Monetizing M2M
SAP Webinar – Monetizing M2MComputaris
 
Led zasloni
Led zasloniLed zasloni
Led zasloniopenerp1
 
Z uśmiechem 2 -Matematyka1
Z uśmiechem 2 -Matematyka1Z uśmiechem 2 -Matematyka1
Z uśmiechem 2 -Matematyka1fundacjawartozyc
 
2012 conference presentation digital presence
2012 conference presentation digital presence2012 conference presentation digital presence
2012 conference presentation digital presenceSchoolTechPolicies.com
 
Zauroczeni pięknem 3 -Paradise
Zauroczeni pięknem 3 -ParadiseZauroczeni pięknem 3 -Paradise
Zauroczeni pięknem 3 -Paradisefundacjawartozyc
 
Partnership Agreements for the Agriculture Community
Partnership Agreements for the Agriculture CommunityPartnership Agreements for the Agriculture Community
Partnership Agreements for the Agriculture CommunityCari Rincker
 
Ticaret dersi-fiyati
Ticaret dersi-fiyatiTicaret dersi-fiyati
Ticaret dersi-fiyatizeynep_zyn84
 

Viewers also liked (16)

Pholia magra auxilia natural
Pholia magra auxilia naturalPholia magra auxilia natural
Pholia magra auxilia natural
 
Lexis Diligence overview
Lexis Diligence overview Lexis Diligence overview
Lexis Diligence overview
 
Status of United States Equal Employment Legislation │ State Net®
Status of United States Equal Employment Legislation │ State Net® Status of United States Equal Employment Legislation │ State Net®
Status of United States Equal Employment Legislation │ State Net®
 
Gaming & Internet Gambling Regulations per Jurisdiction
Gaming & Internet Gambling Regulations per Jurisdiction   Gaming & Internet Gambling Regulations per Jurisdiction
Gaming & Internet Gambling Regulations per Jurisdiction
 
2012 smarter habits-smarter clicking
2012 smarter habits-smarter clicking2012 smarter habits-smarter clicking
2012 smarter habits-smarter clicking
 
Two Roads Diverged
Two Roads DivergedTwo Roads Diverged
Two Roads Diverged
 
SAP Webinar – Monetizing M2M
SAP Webinar – Monetizing M2MSAP Webinar – Monetizing M2M
SAP Webinar – Monetizing M2M
 
Led zasloni
Led zasloniLed zasloni
Led zasloni
 
Z uśmiechem 2 -Matematyka1
Z uśmiechem 2 -Matematyka1Z uśmiechem 2 -Matematyka1
Z uśmiechem 2 -Matematyka1
 
Stress e Management
Stress e ManagementStress e Management
Stress e Management
 
Tecniche di miglioramento
Tecniche di miglioramentoTecniche di miglioramento
Tecniche di miglioramento
 
2012 conference presentation digital presence
2012 conference presentation digital presence2012 conference presentation digital presence
2012 conference presentation digital presence
 
Zauroczeni pięknem 3 -Paradise
Zauroczeni pięknem 3 -ParadiseZauroczeni pięknem 3 -Paradise
Zauroczeni pięknem 3 -Paradise
 
Negoziazione
NegoziazioneNegoziazione
Negoziazione
 
Partnership Agreements for the Agriculture Community
Partnership Agreements for the Agriculture CommunityPartnership Agreements for the Agriculture Community
Partnership Agreements for the Agriculture Community
 
Ticaret dersi-fiyati
Ticaret dersi-fiyatiTicaret dersi-fiyati
Ticaret dersi-fiyati
 

Similar to How to Win Friends and Influence People (with Hadoop)

How to win friends and influence people (with Hadoop)
How to win friends and influence people (with Hadoop)How to win friends and influence people (with Hadoop)
How to win friends and influence people (with Hadoop)Joseph Adler
 
Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...
Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...
Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...Eric D. Boyd
 
Miniproject on Employee Management using Perl/Database.
Miniproject on Employee Management using Perl/Database.Miniproject on Employee Management using Perl/Database.
Miniproject on Employee Management using Perl/Database.Sanchit Raut
 
The "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInThe "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInSam Shah
 
AngularJS: What's the Big Deal?
AngularJS: What's the Big Deal?AngularJS: What's the Big Deal?
AngularJS: What's the Big Deal?Jim Duffy
 
Contacto server API in PHP
Contacto server API in PHPContacto server API in PHP
Contacto server API in PHPHem Shrestha
 
Findability Day 2014 Neo4j how graph data boost your insights
Findability Day 2014 Neo4j how graph data boost your insightsFindability Day 2014 Neo4j how graph data boost your insights
Findability Day 2014 Neo4j how graph data boost your insightsFindwise
 
When Relational Isn't Enough: Neo4j at Squidoo
When Relational Isn't Enough: Neo4j at SquidooWhen Relational Isn't Enough: Neo4j at Squidoo
When Relational Isn't Enough: Neo4j at SquidooGil Hildebrand
 
Building Applications with DynamoDB
Building Applications with DynamoDBBuilding Applications with DynamoDB
Building Applications with DynamoDBAmazon Web Services
 
Creating Operational Redundancy for Effective Web Data Mining
Creating Operational Redundancy for Effective Web Data MiningCreating Operational Redundancy for Effective Web Data Mining
Creating Operational Redundancy for Effective Web Data MiningJonathan LeBlanc
 
Cena-DTA PHP Conference 2011 Slides
Cena-DTA PHP Conference 2011 SlidesCena-DTA PHP Conference 2011 Slides
Cena-DTA PHP Conference 2011 SlidesAsao Kamei
 
Polymer 3.0 by Michele Gallotti
Polymer 3.0 by Michele GallottiPolymer 3.0 by Michele Gallotti
Polymer 3.0 by Michele GallottiThinkOpen
 
Freeing Yourself from an RDBMS Architecture
Freeing Yourself from an RDBMS ArchitectureFreeing Yourself from an RDBMS Architecture
Freeing Yourself from an RDBMS ArchitectureDavid Hoerster
 
RESTFUL SERVICES MADE EASY: THE EVE REST API FRAMEWORK - Nicola Iarocci - Co...
RESTFUL SERVICES MADE EASY: THE EVE REST API FRAMEWORK -  Nicola Iarocci - Co...RESTFUL SERVICES MADE EASY: THE EVE REST API FRAMEWORK -  Nicola Iarocci - Co...
RESTFUL SERVICES MADE EASY: THE EVE REST API FRAMEWORK - Nicola Iarocci - Co...Codemotion
 
GraphConnect Europe 2016 - Opening Keynote, Emil Eifrem
GraphConnect Europe 2016 - Opening Keynote, Emil EifremGraphConnect Europe 2016 - Opening Keynote, Emil Eifrem
GraphConnect Europe 2016 - Opening Keynote, Emil EifremNeo4j
 
Building LinkedIn's Learning Platform with MongoDB
Building LinkedIn's Learning Platform with MongoDBBuilding LinkedIn's Learning Platform with MongoDB
Building LinkedIn's Learning Platform with MongoDBMongoDB
 
What's new in Vaadin 8, how do you upgrade, and what's next?
What's new in Vaadin 8, how do you upgrade, and what's next?What's new in Vaadin 8, how do you upgrade, and what's next?
What's new in Vaadin 8, how do you upgrade, and what's next?Marcus Hellberg
 

Similar to How to Win Friends and Influence People (with Hadoop) (20)

How to win friends and influence people (with Hadoop)
How to win friends and influence people (with Hadoop)How to win friends and influence people (with Hadoop)
How to win friends and influence people (with Hadoop)
 
Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...
Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...
Consuming Data From Many Platforms: The Benefits of OData - St. Louis Day of ...
 
Miniproject on Employee Management using Perl/Database.
Miniproject on Employee Management using Perl/Database.Miniproject on Employee Management using Perl/Database.
Miniproject on Employee Management using Perl/Database.
 
The "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInThe "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedIn
 
AngularJS: What's the Big Deal?
AngularJS: What's the Big Deal?AngularJS: What's the Big Deal?
AngularJS: What's the Big Deal?
 
Contacto server API in PHP
Contacto server API in PHPContacto server API in PHP
Contacto server API in PHP
 
My Portfolio
My PortfolioMy Portfolio
My Portfolio
 
My Portfolio
My PortfolioMy Portfolio
My Portfolio
 
Sql Server 2000
Sql Server 2000Sql Server 2000
Sql Server 2000
 
Findability Day 2014 Neo4j how graph data boost your insights
Findability Day 2014 Neo4j how graph data boost your insightsFindability Day 2014 Neo4j how graph data boost your insights
Findability Day 2014 Neo4j how graph data boost your insights
 
When Relational Isn't Enough: Neo4j at Squidoo
When Relational Isn't Enough: Neo4j at SquidooWhen Relational Isn't Enough: Neo4j at Squidoo
When Relational Isn't Enough: Neo4j at Squidoo
 
Building Applications with DynamoDB
Building Applications with DynamoDBBuilding Applications with DynamoDB
Building Applications with DynamoDB
 
Creating Operational Redundancy for Effective Web Data Mining
Creating Operational Redundancy for Effective Web Data MiningCreating Operational Redundancy for Effective Web Data Mining
Creating Operational Redundancy for Effective Web Data Mining
 
Cena-DTA PHP Conference 2011 Slides
Cena-DTA PHP Conference 2011 SlidesCena-DTA PHP Conference 2011 Slides
Cena-DTA PHP Conference 2011 Slides
 
Polymer 3.0 by Michele Gallotti
Polymer 3.0 by Michele GallottiPolymer 3.0 by Michele Gallotti
Polymer 3.0 by Michele Gallotti
 
Freeing Yourself from an RDBMS Architecture
Freeing Yourself from an RDBMS ArchitectureFreeing Yourself from an RDBMS Architecture
Freeing Yourself from an RDBMS Architecture
 
RESTFUL SERVICES MADE EASY: THE EVE REST API FRAMEWORK - Nicola Iarocci - Co...
RESTFUL SERVICES MADE EASY: THE EVE REST API FRAMEWORK -  Nicola Iarocci - Co...RESTFUL SERVICES MADE EASY: THE EVE REST API FRAMEWORK -  Nicola Iarocci - Co...
RESTFUL SERVICES MADE EASY: THE EVE REST API FRAMEWORK - Nicola Iarocci - Co...
 
GraphConnect Europe 2016 - Opening Keynote, Emil Eifrem
GraphConnect Europe 2016 - Opening Keynote, Emil EifremGraphConnect Europe 2016 - Opening Keynote, Emil Eifrem
GraphConnect Europe 2016 - Opening Keynote, Emil Eifrem
 
Building LinkedIn's Learning Platform with MongoDB
Building LinkedIn's Learning Platform with MongoDBBuilding LinkedIn's Learning Platform with MongoDB
Building LinkedIn's Learning Platform with MongoDB
 
What's new in Vaadin 8, how do you upgrade, and what's next?
What's new in Vaadin 8, how do you upgrade, and what's next?What's new in Vaadin 8, how do you upgrade, and what's next?
What's new in Vaadin 8, how do you upgrade, and what's next?
 

Recently uploaded

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

How to Win Friends and Influence People (with Hadoop)

  • 1. How to Win Friends and Influence People (with Hadoop) Strata Conference New York Sam Shah and Joseph Adler October 25 2012 ©2012 LinkedIn Corporation. All Rights Reserved.
  • 2. Sam Shah Principal Engineer and Engineering Manager www.linkedin.com/in/shahsam ©2012 LinkedIn Corporation. All Rights Reserved. Joseph Adler Senior Data Scientist www.linkedin.com/in/josephadler
  • 3. STRATA NY 2012 LinkedIn is the leading professional network site Worldwide Workforce 3,300M+ Worldwide Professionals 640M+ LinkedIn Members 175M+ ©2012 LinkedIn Corporation. All Rights Reserved. 3
  • 4. STRATA NY 2012 Data rich ©2012 LinkedIn Corporation. All Rights Reserved. 4 175+MMembers 175M Member Profiles
  • 5. STRATA NY 2012 LinkedIn ©2012 LinkedIn Corporation. All Rights Reserved. 9.3B Page Views per Quarter 130M Unique Visitors 5
  • 6. We have a lot of data. ©2012 LinkedIn Corporation. All Rights Reserved.
  • 7. We have a lot of data. We want to leverage this data to build products. ©2012 LinkedIn Corporation. All Rights Reserved.
  • 8. We have a lot of data. We want to leverage this data to build products. How do you make it easy to build products from data? ©2012 LinkedIn Corporation. All Rights Reserved.
  • 9. STRATA NY 2012 Products we have built on Hadoop ©2012 LinkedIn Corporation. All Rights Reserved. 9
  • 10. STRATA NY 2012 Building products from data Examples of products built with data  Year in Review Email  Network Updates  Skills and Endorsements  People You May Know  and more… ©2012 LinkedIn Corporation. All Rights Reserved.
  • 11. STRATA NY 2012 Year in Review One of the most successful email messages ever. 20% Response Rate 5 Clicks per responder
  • 12. STRATA NY 2012 Network updates ©2012 LinkedIn Corporation. All Rights Reserved. 12
  • 13. STRATA NY 2012 People you may know ©2012 LinkedIn Corporation. All Rights Reserved. 13
  • 14. STRATA NY 2012 Skills and Endorsements ©2012 LinkedIn Corporation. All Rights Reserved.
  • 15. STRATA NY 2012 Building products from data Hadoop is awesome for building product with data  Lots of cheap storage  Vast computational resources  Lots of tools for processing data, learning from data  Shared infrastructure  Shared support services  Runs on commodity hardware (or AWS) ©2012 LinkedIn Corporation. All Rights Reserved.
  • 16. STRATA NY 2012 Leverage The marginal cost of building new products is low  People You May Know (2 people)  Skills and Endorsements (2 people)  Year in Review (1 person, 1 month)  Network Updates Stream (1 person, 3 months) Hadoop can empower small teams to build things ©2012 LinkedIn Corporation. All Rights Reserved.
  • 17. STRATA NY 2012 Leverage The marginal cost of building new products is low  People You May Know (2 people)  Skills and Endorsements (2 people)  Year in Review (1 person, 1 month)  Network Updates Stream (1 person, 3 months) Hadoop can empower small teams to build things ©2012 LinkedIn Corporation. All Rights Reserved.
  • 18. STRATA NY 2012 How we build products Turning data into products ©2012 LinkedIn Corporation. All Rights Reserved. 18
  • 19. STRATA NY 2012 Year in Review  Steps to make the email – Collect job changers – Figure out who is connected to them – Rank job changes
  • 20. STRATA NY 2012 Example: Year in Review memberPosition = LOAD '$latest_positions' USING BinaryJSON; memberWithPositionsChangedLastYear = FOREACH ( FILTER memberPosition BY ((start_date >= $start_date_low ) AND (start_date <= $start_date_high)) ) GENERATE member_id, start_date, end_date; allConnections = LOAD '$latest_bidirectional_connections' USING BinaryJSON; allConnectionsWithChange_nondistinct = FOREACH ( JOIN memberWithPositionsChangedLastYear BY member_id, allConnections BY dest ) GENERATE allConnections::source AS source, allConnections::dest AS dest; allConnectionsWithChange = DISTINCT allConnectionsWithChange_nondistinct; memberinfowpics = LOAD '$latest_memberinfowpics' USING BinaryJSON; pictures = FOREACH ( FILTER memberinfowpics BY ((cropped_picture_id is not null) AND ( (member_picture_privacy == 'N') OR (member_picture_privacy == 'E'))) ) GENERATE member_id, cropped_picture_id, first_name as dest_first_name, last_name as dest_last_name; resultPic = JOIN allConnectionsWithChange BY dest, pictures BY member_id; connectionsWithChangeWithPic = FOREACH resultPic GENERATE allConnectionsWithChange::source AS source_id, allConnectionsWithChange::dest AS member_id, pictures::cropped_picture_id AS pic_id, pictures::dest_first_name AS dest_first_name, pictures::dest_last_name AS dest_last_name; joinResult = JOIN connectionsWithChangeWithPic BY source_id, memberinfowpics BY member_id; withName = FOREACH joinResult GENERATE connectionsWithChangeWithPic::source_id AS source_id, connectionsWithChangeWithPic::member_id AS member_id, connectionsWithChangeWithPic::dest_first_name as first_name, connectionsWithChangeWithPic::dest_last_name as last_name, connectionsWithChangeWithPic::pic_id AS pic_id, memberinfowpics::first_name AS firstName, memberinfowpics::last_name AS lastName, memberinfowpics::gmt_offset as gmt_offset, memberinfowpics::email_locale as email_locale, memberinfowpics::email_address as email_address; resultGroup0 = GROUP withName BY (source_id, firstName, lastName, email_address, email_locale, gmt_offset); -- get the count of results per recipient resultGroupCount = FOREACH resultGroup0 GENERATE group , withName as toomany, COUNT_STAR(withName) as num_results; resultGroupPre = filter resultGroupCount by num_results > 2; resultGroup = FOREACH resultGroupPre { withName = LIMIT toomany 64; GENERATE group, withName, num_results; } x_in_review_pre_out = FOREACH resultGroup GENERATE FLATTEN(group) as (source_id, firstName, lastName, email_address, email_locale, gmt_offset), withName.(member_id, pic_id, first_name, last_name) as jobChanger, '2011' as changeYear:chararray, num_results as num_results; x_in_review = FOREACH x_in_review_pre_out GENERATE source_id as recipientID, gmt_offset as gmtOffset, firstName as first_name, lastName as last_name, email_address, email_locale, TOTUPLE( changeYear, source_id,firstName, lastName, num_results,jobChanger) as body; rmf $xir; STORE x_in_review INTO '$xir' USING BinaryJSON('recipientID');
  • 21. STRATA NY 2012 Example: Year in Review {body={num_results=80, lastName=Adler, changeYear=2011, firstName=Joseph, jobChanger=[{last_name=O'Connor, first_n ame=Br?on, member_id=12562482, pic_id=/p/3/000/086/1bd/10ee035.jpg}, {last_name=Sundaram, first_name=Vivek, member _id=6590171, pic_id=/p/3/000/0ae/354/36eb54c.jpg}, {last_name=Crane, first_name=Patrick, member_id=8628324, pic_id =/p/1/000/09c/064/10191de.jpg}, {last_name=McLennan, first_name=Dan, member_id=10551114, pic_id=/p/2/000/09d/12f/1 47def1.jpg}, {last_name=Shaughnessy, first_name=Helen, member_id=2211035, pic_id=/p/3/000/06d/2ba/06a113c.jpg}, {l ast_name=Chen, first_name=Richard, member_id=12800647, pic_id=/p/2/000/007/1ad/0fb84f9.jpg}, {last_name=Barba, fir st_name=Troy, member_id=27577, pic_id=/p/2/000/0a2/3e9/3a83a33.jpg}, {last_name=Reed, first_name=Harper, member_id =1865420, pic_id=/p/1/000/001/17b/396a2c3.jpg}, {last_name=Goldstein, first_name=Peter, member_id=205610, pic_id=/ p/2/000/01c/2e6/042999f.jpg}, {last_name=Koren, first_name=Yuval, member_id=2289577, pic_id=/p/1/000/02b/3d3/1fc36 27.jpg}, {last_name=Kiang, first_name=Andy, member_id=8347, pic_id=/p/1/000/063/115/1256f61.jpg}, {last_name=Green field, first_name=Nick, member_id=82814545, pic_id=/p/1/000/068/39f/2080b8f.jpg}, {last_name=Murarka, first_name=B ubba, member_id=174233, pic_id=/p/3/000/011/2c8/33837b8.jpg}, {last_name=Kutter, first_name=Norbert, member_id=310 933, pic_id=/p/3/000/005/0e2/02775a0.jpg}, {last_name=Ehrenberg, first_name=Roger, member_id=1662181, pic_id=/p/3/ 000/038/066/3572baf.jpg}, {last_name=Coderre, CISSP, first_name=Rob, member_id=68521, pic_id=/p/1/000/088/0d5/2438 981.jpg}, {last_name=Stephens, first_name=Bradford, member_id=10900447, pic_id=/p/1/000/0ad/0dc/15f9df5.jpg}, {las t_name=Shiau, first_name=Peter, member_id=300654, pic_id=/p/2/000/056/2a6/18938e3.jpg}, {last_name=Rajan, first_na me=Arvind, member_id=1260, pic_id=/p/3/000/019/3f7/1e6e0f2.jpg}, {last_name=Bellister, first_name=Jesse, member_id =25234604, pic_id=/p/3/000/00a/17d/1e2136b.jpg}, {last_name=Mohan, first_name=Viraj, member_id=56817108, pic_id=/p /3/000/0cd/0a4/097527a.jpg}, {last_name=Ragade, first_name=Dhananjay, member_id=325284, pic_id=/p/3/000/000/035/05 04fe7.jpg}, {last_name=Richards, first_name=Jeff, member_id=16762, pic_id=/p/2/000/039/14e/081d1c7.jpg}, {last_nam e=Wittenauer, first_name=Allen, member_id=3328775, pic_id=/p/3/000/08d/2a3/307b112.jpg}, {last_name=Porzak, first_ name=Jim, member_id=1708710, pic_id=/p/2/000/00d/109/0e4aa34.jpg}, {last_name=Ruma, first_name=Laurel, member_id=3 429732, pic_id=/p/1/000/01e/277/2bb115b.jpg}, {last_name=Higgins, first_name=Josh, member_id=1458792, pic_id=/p/1/ 000/0c9/38b/1a24457.jpg}, {last_name=Benedict, first_name=Harvey, member_id=641340, pic_id=/p/3/000/0c6/1eb/2eb711 9.jpg}, {last_name=Lazarus, first_name=Brett, member_id=49965786, pic_id=/p/2/000/03b/04e/318d080.jpg}, {last_name =Zhang, first_name=Simon, member_id=16323996, pic_id=/p/3/000/03f/0fe/35d4ded.jpg}, {last_name=Aspen, first_name=M att, member_id=25240804, pic_id=/p/3/000/09b/371/22ec974.jpg}, {last_name=Herz, first_name=Erik, member_id=147604, pic_id=/p/3/000/086/014/0fab4d6.jpg}, {last_name=Sanders, first_name=Geoffrey, member_id=340570, pic_id=/p/1/000/0 d1/2d1/37a76e6.jpg}, {last_name=Wright, first_name=Caleb, member_id=12798700, pic_id=/p/2/000/08c/337/2cc951a.jpg} , {last_name=Parab, first_name=Guru, member_id=8915230, pic_id=/p/1/000/08a/257/051926a.jpg}, {last_name=Grossman, first_name=Nick, member_id=12159520, pic_id=/p/2/000/005/2f3/1955f31.jpg}, {last_name=Skomoroch, first_name=Peter, member_id=11642980, pic_id=/p/2/000/0b4/12d/31eadbe.jpg}, {last_name=Singh, first_name=Deepak, member_id=1246166, pic_id=/p/1/000/042/3f5/369f807.jpg}, {last_name=Noakes, first_name=Geoffrey, member_id=3518726, pic_id=/p/3/000/0 05/3d7/3f67632.jpg}, {last_name=Scudiere, first_name=Robert, member_id=3965286, pic_id=/p/2/000/090/210/009a099.jp g}, {last_name=Skyler, first_name=David, member_id=15377099, pic_id=/p/3/000/005/1bf/080b255.jpg}, {last_name=Shar ma, first_name=Manu, member_id=19295378, pic_id=/p/3/000/0d4/11e/2176c30.jpg}, {last_name=Huang, first_name=Erica, member_id=1808438, pic_id=/p/1/000/001/3a5/02ddd24.jpg}, {last_name=Ballotta, first_name=Pete, member_id=2011178, pic_id=/p/2/000/0b6/08f/3a92357.jpg}, {last_name=Kast, first_name=Anton, member_id=1092686, pic_id=/p/1/000/054/0e 2/1a8efb2.jpg}, {last_name=Redfern, first_name=Joff, member_id=2849241, pic_id=/p/3/000/03d/28d/19f5688.jpg}, {las t_name=Smith, first_name=Aaron, member_id=83470876, pic_id=/p/2/000/08c/27c/3cfe37a.jpg}, {last_name=Yadav, first_ name=Rishi, member_id=2097381, pic_id=/p/2/000/0c8/08d/3ab9006.jpg}, {last_name=Repass, first_name=Mike, member_id =8633208, pic_id=/p/2/000/071/195/0bfc573.jpg}, {last_name=Dalvi, first_name=Anand, member_id=8388, pic_id=/p/1/00 0/003/3cd/3127384.jpg}, {last_name=Croll, first_name=Alistair, member_id=511218, pic_id=/p/2/000/029/0e5/1ebc076.j pg}, {last_name=Tolman, first_name=Sarah, member_id=86040596, pic_id=/p/2/000/06f/1c9/1a7870e.jpg}, {last_name=Suv arna, first_name=Sandeep, member_id=10558779, pic_id=/p/1/000/05b/2c7/0ec214a.jpg}, {last_name=Elliott- McCrea, first_name=Kellan, member_id=163959, pic_id=/p/1/000/06b/2e8/2dbd3ae.jpg}, {last_name=Jatkar, first_name=T arang, member_id=17763609, pic_id=/p/1/000/012/010/2e8ee7f.jpg}, {last_name=Brown, first_name=David, member_id=420 737, pic_id=/p/3/000/002/140/0b2dbcc.jpg}, {last_name=Patel, first_name=Jay, member_id=1179857, pic_id=/p/2/000/07 c/0b2/0365e91.jpg}, {last_name=Field, first_name=Dylan, member_id=13066037, pic_id=/p/2/000/0a5/3e2/1fb7f06.jpg}, {last_name=Patel, first_name=Sumeet, member_id=23402387, pic_id=/p/2/000/0bf/3ca/2ca5f1f.jpg}, {last_name=Ting, fi rst_name=Moses, member_id=15624915, pic_id=/p/2/000/0ac/117/29e329a.jpg}, {last_name=Hinnach, first_name=Yassine, member_id=1731285, pic_id=/p/3/000/000/035/330cce0.jpg}, {last_name=Das, first_name=Anshu, member_id=38878221, pic _id=/p/3/000/0b2/1ac/15902f4.jpg}, {last_name=Mendelson, first_name=Jordan, member_id=8598415, pic_id=/p/3/000/032 /22a/1d2eaa6.jpg}, {last_name=Besbeas, first_name=Nick, member_id=12510505, pic_id=/p/3/000/093/167/34f5b6b.jpg}], source_id=256842}, first_name=Joseph, email_locale=en_US, last_name=Adler, gmtOffset=- 8, recipientID=256842, email_address=jadler@linkedin.com} Each message requires a lot of data: – Header information (10 fields) – 4 fields per person, 64 people – That’s over 250 data fields for the final message How do we turn this raw data in to web content or email messages?
  • 22. STRATA NY 2012 People you may know ©2012 LinkedIn Corporation. All Rights Reserved. 22
  • 23. STRATA NY 2012 Skills and Endorsements ©2012 LinkedIn Corporation. All Rights Reserved.
  • 24. STRATA NY 2012 Tagging Skills ©2012 LinkedIn Corporation. All Rights Reserved. 24
  • 25. STRATA NY 2012©2012 LinkedIn Corporation. All Rights Reserved. 25
  • 26. STRATA NY 2012 Skills and Endorsements ©2012 LinkedIn Corporation. All Rights Reserved. A combination of – Propensity to know member – Propensity for member to have skill
  • 27. STRATA NY 2012 Productionalization Take something that runs once… … and run it multiple times … and serve it reliably at scale … and iterate quickly ©2012 LinkedIn Corporation. All Rights Reserved. 28
  • 28. STRATA NY 2012 Data Lifecycle  Moving around data is the key problem 1. Ingress Moving raw data from online systems to offline systems 2. Workflow management Managing offline processes 3. Egress Moving results from offline systems to online systems ©2012 LinkedIn Corporation. All Rights Reserved. 29
  • 29. STRATA NY 2012 Ingress  Apache Kafka: Low latency publish/subscribe message bus – Common data format (Avro) – Changelog is the abstraction for integration – Schema evolution  Programmatic compatibility model  Explicit schema reviews  “O(1)” ETL  K. Goodhope, J. Koshy, J. Kreps, N. Narkhede, R. Park, J. Rao, V.Y. Ye: Building LinkedIn’s Real-time Activity Data Pipeline. In IEEE Data Engineering Bulletin. Vol 35, No. 2, June 2012. ©2012 LinkedIn Corporation. All Rights Reserved. 30
  • 30. STRATA NY 2012 Workflows ©2012 LinkedIn Corporation. All Rights Reserved. 31 Job A Job B Job C
  • 31. STRATA NY 2012 Workflows ©2012 LinkedIn Corporation. All Rights Reserved. 32 Job A Job B Job C Push to Production
  • 32. STRATA NY 2012 Workflows ©2012 LinkedIn Corporation. All Rights Reserved. 33 Job A Job B Job C Push to Production Job X
  • 33. STRATA NY 2012 Workflows ©2012 LinkedIn Corporation. All Rights Reserved. 34 Job A Job B Job C Push to Production Job X Push to QA
  • 34. STRATA NY 2012 Real workflows are complicated ©2012 LinkedIn Corporation. All Rights Reserved. 35
  • 35. STRATA NY 2012 Workflow Management: Azkaban ©2012 LinkedIn Corporation. All Rights Reserved. 36  Dependency management  Diverse job types (Pig, Hive, Java, . . . )  Scheduling  Monitoring  Configuration  Retry/restart on failure  Resource locking  Log collection  Historical information
  • 36. STRATA NY 2012 Workflow Management: Azkaban ©2012 LinkedIn Corporation. All Rights Reserved. 37
  • 37. STRATA NY 2012 Workflow Management: Azkaban ©2012 LinkedIn Corporation. All Rights Reserved. 38
  • 38. STRATA NY 2012 Egress: Voldemort  Distributed key/value store  Easy to integrate into workflows – Off the shelf jobs to copy Voldemort Stores – One line command in Pig  Cost of data load  Data stored per node? Response time  Fail-over  How to transfer  Versioning & rollback  R. Sumbaly, J. Kreps, L. Gao, A. Feinberg, C. Soman, & S. Shah. Serving Large- Scale Batch Computed Data With Project Voldemort. In FAST 2012. ©2012 LinkedIn Corporation. All Rights Reserved. 39
  • 39. STRATA NY 2012 Recap Why we use Hadoop  Simple programmatic model  Rich developer ecosystem – Languages: Pig, Hive, Crunch, Cascading, … – Libraries: Mahout, DataFu, ElephantBird, …  Horizontal scalability, fault tolerance, multi-tenancy – Reliably process multiple TB of data  Don’t need hardcore distributed systems engineers ©2012 LinkedIn Corporation. All Rights Reserved. 40
  • 40. STRATA NY 2012 Recap How we use Hadoop Open source projects started at LinkedIn:  Getting data in: Kafka  Building and running job flows: Azkaban  Getting data out: Voldemort This empowers data scientists and engineers to focus on new product ideas, not infrastructure ©2012 LinkedIn Corporation. All Rights Reserved. 41
  • 41. STRATA NY 2012 data.linkedin.com Learning More ©2012 LinkedIn Corporation. All Rights Reserved. 42