Big Data EcoSystem @ LinkedIn
October 20, 2012
LinkedIn Confidential ©2013 All Rights Reserved
LinkedIn Confidential ©2013 All Rights Reserved
Sunil Shirguppi
Head of Data Services- International
LinkedIn Corporation
http://www.linkedin.com/in/sunilshirguppi
Outline
LinkedIn Overview
Data Science
Big Data Eco-System
Learnings
LinkedIn Confidential ©2013 All Rights Reserved 3
Our Mission
Connect the world’s professionals
to make them more productive and successful
LinkedIn Confidential ©2013 All Rights Reserved 4
We are the professional profile of record
Googled yourself lately?
Don’t feel bad, we all do it.
Executives from all
Companies are
LinkedIn members
The LinkedIn Opportunity
LinkedIn Confidential ©2013 All Rights Reserved 7
Fundamentally transforming the way the world worksFundamentally transforming the way the world works
Connect talent with opportunity at massive scale
+
The World’s Largest Professional Network
LinkedIn Confidential ©2013 All Rights Reserved 8
*as of Nov 4, 2011
**as of June 30, 2011
2
4
8
17
32
55
90
2004 2005 2006 2007 2008 2009 2010
LinkedIn Members (Millions)
175M+*
82%
Fortune 100 Companies
use LinkedIn to hire
Company Pages
>2M
**
New Members joining
~2/sec
Professional
searches in 2011
~4.2B
Multiple revenue channels
 Premium Subscriptions
 Self Serve Ads
 Hiring Solutions
 Marketing Solutions
Let’s talk Data…
Business is recognizing the importance of analytics
Data Scientist = Curiosity + Intuition + Data
gathering + Standardization + Statistics + Modeling
+ Visualization + Communication
What makes a Data Scientist?
Big Data at LinkedIn
LinkedIn Confidential ©2013 All Rights Reserved 13
* Chart from Philip Russom- Research Director: TDWI
What do we do with Data?
 Data Standardization
 Build innovative data products to help professionals
 Draw insights
 Drive the business
Before we can do that...
There are a few challenges that we have to overcome
• Scale
• Standardization
• Infrastructure
Few Data-Driven Products
LinkedIn Confidential ©2013 All Rights Reserved 15
Pandora Search for People
Events You
May Be
Interested In
Groups browse maps
How do we do it?
LinkedIn Sample Data Stack
Crowdsourcing
Big Data at LinkedIn
LinkedIn Confidential ©2013 All Rights Reserved 19
Users
Online Data
Store
Near-Line
Data Store
Application Offline Data
Store
Web
Logs
High-level data environment
Challenges so complex which
off-the-shelf or a few
technologies can’t address
Built our own combination of
toolsets/ technologies to
meet specific requirements
LinkedIn Data Stack – Online
LinkedIn Confidential ©2013 All Rights Reserved 20
Users
Online Data
Store
Near-Line
Data Store
Application Offline Data
Store
Web
Logs
Systems Capabilities
• Rich structures (e.g. indexes)
• Change capture capability
LinkedIn Data Stack – Nearline
LinkedIn Confidential ©2013 All Rights Reserved 21
Users
Online Data
Store
Near-Line
Data Store
Application Offline Data
Store
Web
Logs
Systems Capabilities
• Key value accessVoldemort
• Search platform
• Distributed Graph engine
Zoie Bobo Sensei
D-Graph
LinkedIn Data Stack – Pipeline
LinkedIn Confidential ©2013 All Rights Reserved 22
Users
Online Data
Store
Near-Line
Data Store
Application Offline Data
Store
Web
Logs
Systems Capabilities
• Messaging for site events,
monitoring
• Change data capture streams
LinkedIn Data Stack – Offline
LinkedIn Confidential ©2013 All Rights Reserved 23
Users
Online Data
Store
Near-Line
Data Store
Application Offline Data
Store
Web
Logs
Systems Capabilities
• Machine learning, ranking,
relevance
• Warehouse and analytics
LinkedIn with Hadoop, Aster, and Teradata
LinkedIn Confidential ©2013 All Rights Reserved 24
Integrated Data
Warehouse
• Exec Dashboards
• Adhoc/OLAP
• Complex SQL
• SQL
Data transformation
& batch processing
• Image processing
• Search indexes
• Graph (PYMK)
• MapReduce
Analytic Platform for data
discovery
• nPath Pattern/Path
• Clickstream analysis
• A/B site testing
• Data Sciences discovery
• SQL-MapReduce
Aster/Teradata
Bi-Directional Connector
Aster/Teradata
Hadoop Connectors
Batch data transformations for
engineering groups using HDFS +
MapReduce
Batch data transformations for
engineering groups using HDFS +
MapReduce
Interactive MapReduce
analytics for the enterprise using
MapReduce Analytics &
SQL-MapReduce
Interactive MapReduce
analytics for the enterprise using
MapReduce Analytics &
SQL-MapReduce
Integration with structured data,
operational intelligence, scalable
distribution of analytics
Integration with structured data,
operational intelligence, scalable
distribution of analytics
It’s a global economy
Country connectedness on LinkedIn
Data deep dives
Job migration after financial collapse
How Often do people change jobs?
Visualization is important
If your name is Chip, you are likely in sales!
31
Industry Growth
Buzzwords
What next?
• Self service analytics
• Metadata framework
• Integrate reporting solutions
• Go Mobile!
• Scalability and Data Quality
Challenges
• Data volumes and availability
– Billion+ rows every day
– Users in Global locations need data
• Multiple platforms
– Agile development
– Data Integration
 Data Quality
– User input data
– Data standardization
Key Learnings
 Self Service
– Making data accessible to key stakeholders in a timely
manner creates tremendous value.
– Viz is more important than we think
• Measuring your future investments
– Performance is not the only measure
– Company fundamentals matter
• As an Data team, be in control of your destiny
– Identify what to measure and lead by metrics
– Become the Think-tank
Web 3.0 – It’s all about data!!
LinkedIn Confidential ©2013 All Rights Reserved 36
ULTIMATELY…
It is all about the people!
LinkedIn Confidential ©2013 All Rights Reserved 39
Thank You!

Big Data Ecosystem @ LinkedIn

  • 1.
    Big Data EcoSystem@ LinkedIn October 20, 2012 LinkedIn Confidential ©2013 All Rights Reserved
  • 2.
    LinkedIn Confidential ©2013All Rights Reserved Sunil Shirguppi Head of Data Services- International LinkedIn Corporation http://www.linkedin.com/in/sunilshirguppi
  • 3.
    Outline LinkedIn Overview Data Science BigData Eco-System Learnings LinkedIn Confidential ©2013 All Rights Reserved 3
  • 4.
    Our Mission Connect theworld’s professionals to make them more productive and successful LinkedIn Confidential ©2013 All Rights Reserved 4
  • 5.
    We are theprofessional profile of record Googled yourself lately? Don’t feel bad, we all do it.
  • 6.
    Executives from all Companiesare LinkedIn members
  • 7.
    The LinkedIn Opportunity LinkedInConfidential ©2013 All Rights Reserved 7 Fundamentally transforming the way the world worksFundamentally transforming the way the world works Connect talent with opportunity at massive scale +
  • 8.
    The World’s LargestProfessional Network LinkedIn Confidential ©2013 All Rights Reserved 8 *as of Nov 4, 2011 **as of June 30, 2011 2 4 8 17 32 55 90 2004 2005 2006 2007 2008 2009 2010 LinkedIn Members (Millions) 175M+* 82% Fortune 100 Companies use LinkedIn to hire Company Pages >2M ** New Members joining ~2/sec Professional searches in 2011 ~4.2B
  • 9.
    Multiple revenue channels Premium Subscriptions  Self Serve Ads  Hiring Solutions  Marketing Solutions
  • 10.
  • 11.
    Business is recognizingthe importance of analytics
  • 12.
    Data Scientist =Curiosity + Intuition + Data gathering + Standardization + Statistics + Modeling + Visualization + Communication What makes a Data Scientist?
  • 13.
    Big Data atLinkedIn LinkedIn Confidential ©2013 All Rights Reserved 13 * Chart from Philip Russom- Research Director: TDWI
  • 14.
    What do wedo with Data?  Data Standardization  Build innovative data products to help professionals  Draw insights  Drive the business Before we can do that... There are a few challenges that we have to overcome • Scale • Standardization • Infrastructure
  • 15.
    Few Data-Driven Products LinkedInConfidential ©2013 All Rights Reserved 15 Pandora Search for People Events You May Be Interested In Groups browse maps
  • 16.
    How do wedo it?
  • 18.
    LinkedIn Sample DataStack Crowdsourcing
  • 19.
    Big Data atLinkedIn LinkedIn Confidential ©2013 All Rights Reserved 19 Users Online Data Store Near-Line Data Store Application Offline Data Store Web Logs High-level data environment Challenges so complex which off-the-shelf or a few technologies can’t address Built our own combination of toolsets/ technologies to meet specific requirements
  • 20.
    LinkedIn Data Stack– Online LinkedIn Confidential ©2013 All Rights Reserved 20 Users Online Data Store Near-Line Data Store Application Offline Data Store Web Logs Systems Capabilities • Rich structures (e.g. indexes) • Change capture capability
  • 21.
    LinkedIn Data Stack– Nearline LinkedIn Confidential ©2013 All Rights Reserved 21 Users Online Data Store Near-Line Data Store Application Offline Data Store Web Logs Systems Capabilities • Key value accessVoldemort • Search platform • Distributed Graph engine Zoie Bobo Sensei D-Graph
  • 22.
    LinkedIn Data Stack– Pipeline LinkedIn Confidential ©2013 All Rights Reserved 22 Users Online Data Store Near-Line Data Store Application Offline Data Store Web Logs Systems Capabilities • Messaging for site events, monitoring • Change data capture streams
  • 23.
    LinkedIn Data Stack– Offline LinkedIn Confidential ©2013 All Rights Reserved 23 Users Online Data Store Near-Line Data Store Application Offline Data Store Web Logs Systems Capabilities • Machine learning, ranking, relevance • Warehouse and analytics
  • 24.
    LinkedIn with Hadoop,Aster, and Teradata LinkedIn Confidential ©2013 All Rights Reserved 24 Integrated Data Warehouse • Exec Dashboards • Adhoc/OLAP • Complex SQL • SQL Data transformation & batch processing • Image processing • Search indexes • Graph (PYMK) • MapReduce Analytic Platform for data discovery • nPath Pattern/Path • Clickstream analysis • A/B site testing • Data Sciences discovery • SQL-MapReduce Aster/Teradata Bi-Directional Connector Aster/Teradata Hadoop Connectors Batch data transformations for engineering groups using HDFS + MapReduce Batch data transformations for engineering groups using HDFS + MapReduce Interactive MapReduce analytics for the enterprise using MapReduce Analytics & SQL-MapReduce Interactive MapReduce analytics for the enterprise using MapReduce Analytics & SQL-MapReduce Integration with structured data, operational intelligence, scalable distribution of analytics Integration with structured data, operational intelligence, scalable distribution of analytics
  • 26.
    It’s a globaleconomy Country connectedness on LinkedIn
  • 27.
    Data deep dives Jobmigration after financial collapse
  • 28.
    How Often dopeople change jobs?
  • 29.
  • 30.
    If your nameis Chip, you are likely in sales!
  • 31.
  • 32.
  • 33.
    What next? • Selfservice analytics • Metadata framework • Integrate reporting solutions • Go Mobile! • Scalability and Data Quality
  • 34.
    Challenges • Data volumesand availability – Billion+ rows every day – Users in Global locations need data • Multiple platforms – Agile development – Data Integration  Data Quality – User input data – Data standardization
  • 35.
    Key Learnings  SelfService – Making data accessible to key stakeholders in a timely manner creates tremendous value. – Viz is more important than we think • Measuring your future investments – Performance is not the only measure – Company fundamentals matter • As an Data team, be in control of your destiny – Identify what to measure and lead by metrics – Become the Think-tank
  • 36.
    Web 3.0 –It’s all about data!! LinkedIn Confidential ©2013 All Rights Reserved 36
  • 37.
  • 38.
    It is allabout the people!
  • 39.
    LinkedIn Confidential ©2013All Rights Reserved 39 Thank You!