A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

A Small Overview of Big Data Products,
Analytics and Infrastructure at Linkedin

Bhaskar Ghosh Big Data Science
A Symposium in Honor of Martin Schultz
Senior Director of Engineering
Yale University
Data Infrastructure 26 Oct 2012

LinkedIn Confidential ©2013 All Rights Reserved

Outline

1. Martin and Me
2. Company and Mission
3. Products and Science
4. Data Infrastructure
5. P, S, DI: People You May Know
6. Linkedin + Yale
7. Conclusion

LinkedIn Confidential ©2013 All Rights Reserved 2

Martin and Me

Thank you Martin! Best mentor.
Versatility, big-picture thinking and leadership.
Yale CS Ph.D. 1995 (Parallel Algorithms)

12y @ Informix & Oracle building parallel
database systems

4y @ Yahoo! building Ads systems & leading
the Display Ads Exchange organization

2y+ @ LinkedIn building & leading the
Data Infrastructure Engineering Organization


The World’s Largest Professional Network
Connecting Talent  Opportunity. At scale…

175M+ 2 new 100M+ 2M+
Members Worldwide Members Per Second Monthly Unique Visitors Company Pages


..and a bunch of Data-Driven Products

Pandora Search for People

Groups browse maps
Events You
May Be
Interested In


The LinkedIn Mission.
Connect the world’s professionals to make them
more productive and successful

Linkedin Product Philosophy

Goals
 Provide a uniquely personalized experience to
members (professionals)
 Build an ecosystem to balance the interests of
members and partners (companies)

Approach
 Launch Often and Early
 Data-Driven Experiment and Test
 Fail Fast
 Prepare for Virality and Scale


Two Product Families

For Members For Partners
 People You May Know
Hire
Professionals
 Who’s Viewed My Profile Companies
 Jobs You May Be
Interested In Market
 News/Sharing
 Today Sell
 Search
 Subscriptions

Science and Analytics

Data Infrastructure
Profiles Actions
Data
Connections Content


The Big-Data Feedback Loop

Engagement ↑ Refinement ↑
Value ↑
Member Product

Virality ↑ Insights ↑

Signals ↑
Data Science

Scale ↑ Analytics ↑
Infrastructure


Member-Facing Products: Diversity at Scale
Product Family Products Science Data Infra

1. Profile and Connections Blending and ranking of
2. Activity Streams heterogeneous content
Identity and
3. Messages (email) (e.g. Network Updates,
Engagement Group Discussions, Job
4. Endorsements & Skills Postings)

1. People Search
Search and
2. Group Search
Analysis
3. Who Viewed My Profile

1. People You May Know
2. Jobs You May Be Entity
Recommendations Interested In disambiguation and
3. Events You May Be matching
Interested In

1. Subscription Packages Response Prediction
Monetization
2. Sponsored Content Inventory Forecasting


Recommendations…Are Effective .. And Drive

> 50% of connections • Find data that is useful for Members
• Guiding Principle
• Provide Relevant Content
• Establish Social Connections
• In Appropriate Context

> 50% of job applications > 50% of group joins


LinkedIn Recommendation Engine

Recom- People Jobs Groups
mendation
Entities … Ads
Companies
Searches

be interested in
Referral Center
People Browse

Similar Profiles

Similar Groups
Jobs You May

Jobs Browse

Browse Map
TalentMatch

Similar Jobs
News

Groups
GYML
Events
Map

Map
… and more
Products

A/B
API
Recom-
Behavior Collaborative
mendation Popularity User Feedback
Types
Analysis Filtering

Shared, (R-T) Feature Extraction, Entity (R-T) matching computations
Dynamic, Resolution & Enrichment
Unified Offline data munging (hadoop)
Core
Service

Member-Facing Products: Diversity at Scale
Product Family Products Science Data Infra

1. Profile and Connections Blending and ranking of • Scale
2. Activity Streams heterogeneous content
Identity and • Full text and
3. Messages (email) (e.g. Network Updates,
Engagement secondary ind
Group Discussions, Job
4. Endorsements & Skills Postings) • Real-time

• Faceted search
1. People Search • Near RT index
Search and
2. Group Search freshness
Analysis
3. Who Viewed My Profile • Drill-down
exploration

1. People You May Know
2. Jobs You May Be Entity • Graph analysis
Recommendations Interested In disambiguation and • Content serving
3. Events You May Be matching • Real-time tuning
Interested In

1. Subscription Packages
Monetization Response prediction
2. Sponsored Content


LinkedIn Data Infrastructure: Three-Phase Abstraction

Near-Line
Infra

Application Offline
Data Infra

Users Online Data
Infra

Infrastructure Latency & Freshness Requirements Products
• Member Profiles • Messages
Online Activity that should be reflected immediately • Company Profiles • Endorsements
• Connections • Skills
• Activity Streams • Recommendations
Near-Line Activity that should be reflected soon • Profile Standardization • Search
• News • Messages
• People You May Know • Recommendations
Offline Activity that can be reflected later • Connection Strength • Next best idea…
• News

LinkedIn Data Infrastructure: Sample Stack

Infra challenges in 3-phase Some off-the-shelf.
ecosystem are diverse, Significant investment in
complex and specific home-grown, deep and
interesting platforms

15

LinkedIn Data Infrastructure: Data Stores

Near-Line
Infra

Application Offline
Data Infra

Users Online Data
Infra

 ICDE 2012 (Data Infra Overview)  FAST 2012 (Voldemort for Serving)

Systems Capabilities

 Transactions
 Rich structures (e.g. indexes)
 Change capture capability
Voldemort  Key value / document storage


LinkedIn Data Infrastructure: Specialized Indexes

Near-Line
Infra

Application Offline
Data Infra

Users Online Data
Infra


Zoie Bobo Sensei  Search platform

GraphDB  Distributed graph engine


LinkedIn Data Infrastructure: Pipelines

Near-Line
Infra

Application Offline
Data Infra

Users Online Data
Infra

 ACM SOCC 2012: “Databus”  IEEE Data Eng. Bulletin 2012: “Kafka”

 Messaging for site events, monitoring
 High throughput

 Change data capture stream
 Reliable, consistent, low latency pipe

LinkedIn Data Infrastructure: Off-line Analysis

Near-Line
Infra

Application Offline
Data Infra

Users Online Data
Infra


 ML, Ranking, Relevance
 Insights and Analytics
 ETL, Metadata and Pipes
 Business Source of Truth

LinkedIn Data Infrastructure: Cluster Management

Near-Line
Infra

Application Offline
Data Infra

Users Online Data
Infra

 ACM SOCC 2012: Untangling Cluster Management with Helix


 Generic framework for building
distributed systems
 Cluster Management Primitives


HELIX: Generalizing Cluster Management

COUNT=2
t1≤ 5 STATE MACHINE
S
t1 t2

t3 t4
O M Helix
COUNT=1

minimize(maxnj∈N S(nj) )
CONSTRAINTS OBJECTIVE
minimize(maxnj∈N M(nj) )

 Declare distributed system behavior via {S, C, O}
 Enforce Partition constraints
 Fault detection and tolerance (e.g. promote S to M)
 Elasticity (e.g. Re-balance; Minimize migrations)
 Used in Espresso, Search, Databus


LinkedIn Data Infrastructure: A few take-aways

1. Infrastructure decisions matter and are hard to
transform in a hyper-growth environment.
2. Balance open-source products with home-
grown platforms (**)
3. Operability, Capacity Planning and On-line
Multi-tenancy are hard
4. Data Movement: Pipes and Feedback Loops
are critical (**)
5. Data Model and Integration e2e are key (*)
6. Few vs Many: Balance over-specialized (agile)
vs generic efforts (leverage-able) platforms (*)
7. Off-line Multi-Platform story is evolving.


Science and Infrastructure: Giving Back

Research Publications Open Source Projects
 ACM SOCC 2012  Apache Helix new
 ACM RecSys 2012
 ParSeq new
 SIGIR 2012
 CIKM 2012  DataFu new
 VLDB 2012  Apache Kafka
 ICDE 2012
 FAST 2012  Sensei
 NetDB 2011  Azkaban
 …
 Voldemort


A Recommendation Product:

People You May Know (PYMK)


Probability that you may know someone else?

Alice

??

Bob Carol

Known as “triangle closing”


PYMK: Science, Members and Connections
1) Feature selection is key The Feedback Loop
 Common Connections Value ↑
Member Product
 Geo
Virality ↑ Insights ↑
 Company Signals ↑
 Age Data Science

2) ML and data model
• Traditional ML (e.g. matrix factorization) on O(n^2) of 175M
tend to not scale easily
3) Interplay: Data Model + ML + Parallel Computation model
4) Adding edges: Why do it?
• Creates positive-feedback social loops for members
• More useful content and activity available to members
• Denser graph improves signal strength in science-driven
products


PYMK: Off-line Model Build

Near-Line
Infra

Application Offline
Data Infra

Users Online Data
Infra

 Use generic off-line Infra (Hadoop and Pig) to build recommendations off-line.
 Very complex workflow due to extraction and selection of large num of features.
Built Azkaban for Hadoop.
 Small Input and final look-up structure but large intermediate data (100’s of TB)
due to MR. Problem (who you do not know) itself has an inherent blow-up.
 Special optimizations (e.g. Bloom Join to remove connected)


PYMK: Off-line to Near-Line Serving

Near-Line
Infra

Application Offline
Data Infra

Users Online Data
Infra

 Build serving structure on Hadoop. Scan versus Index compactness tradeoff.
 Voldemort: Partitioned k-v; Load-balancing; Pluggable storage layer; Failover.
 Bulk load for efficiency. Fast Rollback for safety. Atomic swap.
 Serving: Per-partition index in memory. PYMK blobs on disk.
 Retrieval ~msec. Decoration in App FE is more expensive.


PYMK: Science and Feedback Loop

Near-Line
Infra

Application Offline
Data Infra

Users Online Data
Infra

 Response vs Latency: Fast refresh helps user experience. (e.g. showing
connections of very recent connections). “Social” phenomenon.
 Very agile feature: Lots of on-line A/B testing and tweaking of features
 Huge Impact: > 50% of accepted invites are created by PYMK


PYMK: Tying It All Together
PYMK
Application
User Interactions

Near-Line
Serving

Near-Line

Offline
P (B knows C) α large number of features

Common
Alice connections

Organizational
Overlap
Offline
Bob Carol
Model
Age

Distance

Dave Eve

LinkedIn + Yale

Students

 What is my career path?  Where did my students go Students:
 How can I prepare? after they left the  Transformation of
 How do I get my first university? Careers
internship and first job?  How is my school seeding Yale:
the various industries with
 Get a data-driven view
the best talent?
 Uncover opportunities
 How does my school
compare with other
institutions

Wins based on data and insights


Thank you colleagues for the beautiful slides!

Amy Tang Anmol Bhasin Daniel Tunkelang David Henke
Sr. Program Manager Sr. Engineering Manager Principal Data Scientist SVP Operations

Kapil Surlaker Sam Shah Shirshanka Das
Principal Engineer Principal Engineer Principal Engineer


Summary

1. E2E: The Big-Data feedback loop of social-network product design is cool
2. Infrastructure
1. Data Infrastructure needs continuous innovation and iteration to keep
pace for scale and cost.
2. Fast moving, Big, Clean Data + Agile Metadata = Goodness
3. Data-driven products need agile feedback infrastructure and
measurement methodology.
3. Methodology
1. Data-Driven experimentation enables insights and agile products
2. Recommendation-driven products have big impact.

Read more @ data.linkedin.com


Help us. Come Have Fun with Us!

1. Science and Data Mining: Recommendation and Optimization Problems
2. Next-generation ad-hoc and OLAP query processing on Hadoop
3. Graph Computations: Off-line mining and On-line integration loops
4. nRT Data Streams in Near-line infrastructure
5. And much more…

Info: data.linkedin.com


In Closing

bghosh@linkedin.com

Thank You!

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

Similar to A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn (20)

More from Amy W. Tang

More from Amy W. Tang (12)

Recently uploaded

Recently uploaded (20)

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn