SlideShare a Scribd company logo
Reading the Tea
       Leaves:              Alexis
  Big Data at LinkedIn



Alexis Baird
Product Manager
LinkedIn
     Recruiting Solutions            1
What is LinkedIn?

§  LinkedIn’s mission: “Connect the world’s professionals to
    make them more productive and successful”
§  The site officially launched on May 5, 2003
§  Now has >187 million members worldwide
§  LinkedIn has >3,000 employees in offices all around the
    world
§  Headquartered in Mountain View, CA
§  Three different lines of revenue:
   –  Subscriptions
   –  Talent Solutions
   –  Marketing Solutions


                                                                2
Who am I?




            3
The Age of Big Data




                      4
Big Data at LinkedIn

§  187+ million members from >200 countries
§  Each month, 52 million members come to the site
    generating ~2 billion page views:
  –  Performing searches
  –  Connecting with other members
  –  Editing their profile
  –  Sharing, commenting on, or liking news articles
  –  Participating in group discussions
  –  And much more…



                                                       5
Big Data Challenges

§  Storage and processing constraints




§  Noisy signal
   –  Variation
   –  People are not always rational or consistent




                                                     6
Data Messiness
§  Job titles:                 §  Companies:
    §  “programmer”,               §  “Microsoft”
    §  “software developer”        §  “MSFT”
    §  “engineer”                  §  “Bing”
    §  “coding ninja”              §  “Microsoft/Bing”
§  Schools:                        §  “Microsoft-Mountain View”
    §  “Connecticut College”
    §  “Conn College”
    §  “Conn”
    §  “CC”
    §  “Conn College (NOT
        Uconn)”
                                                                7
Data Standardization
§  Take an input (usually a user-entered string) and turn it
    into a meaningful abstract id


      “Microsoft”

      “MSFT”                        Company_id = 1035
                                    (“Microsoft Corporation”)
      “Bing”

      “Microsoft/Bing”

      “Microsoft-Mountain View


                                                                8
Why is this important?




                         9
Search




         10
Structured data > Unstructured data




                       P(“linkedin” = company_id 1337) = .87
                       P(“ceo” = title_id 238) = .92




                                                           11
Recommendations




                  12
Recommendation products at LinkedIn
                             Similar Profiles




                                  Connections




           Network updates
                                    Events You May
                                    Be Interested In




                                 News




                                                       13
LinkedIn’s recommender ecosystem
Recommendations drive:
> 50% of connections
            > 50% of job applications
                         > 50% of group joins




                                            14
Jobs You Might Be Interested In




                                  15
How LinkedIn matches people to jobs
              Job                                             Corpus Stats
                                           Matching   Transition probabilities
                                                      Connectivity
                                   Binary             yrs of experience to reach title
title         industry       …
                                     Exact matches:   education needed for this title
geo           description                             …
company       functional area        geo, industry,
                                     …

          User Base                Soft                              Similarity
                                                        (candidate expertise, job description)
                                     transition
           Filtered                                                    0.56
                                     probabilities,
                                                                     Similarity
          Candidate                  similarity,       (candidate specialties, job description)
                                     …                                  0.2
                                                               Transition probability
                                   Text                   (candidate industry, job industry)
General       Current Position                                         0.43
expertise     title
specialties   summary                                               Title Similarity

education     tenure length                                             0.8
headline      industry
                                                              Similarity (headline, title)
geo           functional area
experience    …                                                         0.7
                                                                          .
                      derive
                               d
                                                                          .
                                                                          .
                                                                                             16
Data Standardization: Occupations

§  How do we know a “senior software developer” and a
    “software developer” are the same occupation?




                                                         17
Data Standardization: Occupations

§  How do we know a “senior software developer” and a
    “software developer” are the same occupation?
   –  Strip a special set of words known to indicate seniority




                                                                 18
Data Standardization: Occupations

§  How do we know a “senior software developer” and a
    “software developer” are the same occupation?
   –  Strip a special set of words known to indicate seniority
§  How do we know a “software developer” and a “software
    engineer” are the same occupation?




                                                                 19
Data Standardization: Occupations

§  How do we know a “senior software developer” and a
    “software developer” are the same occupation?
   –  Strip a special set of words known to indicate seniority
§  How do we know a “software developer” and a “software
    engineer” are the same occupation?
   –  Term similarity




                                                                 20
Data Standardization: Occupations

§  How do we know a “senior software developer” and a
    “software developer” are the same occupation?
   –  Strip a special set of words known to indicate seniority
§  How do we know a “software developer” and a “software
    engineer” are the same occupation?
   –  Term similarity
§  How do we know a “programmer” and a “software
    developer” are the same occupation but a “programmer”
    and a “program director” are not?




                                                                 21
Data Standardization: Occupations

§  How do we know a “senior software developer” and a
    “software developer” are the same occupation?
   –  Strip a special set of words known to indicate seniority
§  How do we know a “software developer” and a “software
    engineer” are the same occupation?
   –  Term similarity
§  How do we know a “programmer” and a “software
    developer” are the same occupation but a “programmer”
    and a “program director” are not?
   –  Need something more complicated




                                                                 22
Data standardization: Occupations

1.  Rule-based string clean up:
   –  ~2 million different titles => 24,000 different “cleaned” titles
   –  Eg. “Sr software dev” => “senior software developer”
2.  Create “virtual profiles” for each title using various
    extracted and normalized profile features (i.e. skills,
    degree, field of study, summary, job description, honors,
    etc.)
3.  Cluster similar titles
4.  Get rid of uninformative titles spread across too many
    different topics
5.  Apply hand QA to tune the clusters/name the clusters


                                                                         23
Lessons learned

§  Know your machine learning!
§  Know your success metric!
§  Need to allow for ambiguity within a given title
       §  “Head of production”
       §  DDS
§  Some titles are not standardizable:




                                                       25
Take aways

§  The more information you give, the better your
    standardization will be
§  Why do you want LI to do a good job standardizing the
    data on your profile?
   –  Better recommendations:
       §    News
       §    Jobs
       §    Groups
       §    Connections
       §    Etc.
   –  Recruiters can find you more easily
   –  Potential connections can find you



                                                            26
Thank You!
                                     175M+           2/sec
                                     62% non U.S.


                                                    25th
                               90          We’re    Most visit website worldwide
                                                    (Comscore 6-12)



                          55
                                          Hiring!   >2M
                                                    Company pages



                                                    85%
                    32

               17
           8
 2    4                                             Fortune 500 Companies use
                                                    LinkedIn to hire
2004 2005 2006 2007 2008 2009 2010 2011
          LinkedIn Members (Millions)



          Learn more at http://data.linkedin.com/
                                                                                   27

More Related Content

Viewers also liked

The Data Cleansing Process - A Roadmap to Material Master Data Quality
The Data Cleansing Process - A Roadmap to Material Master Data QualityThe Data Cleansing Process - A Roadmap to Material Master Data Quality
The Data Cleansing Process - A Roadmap to Material Master Data Quality
I.M.A. Ltd.
 
Why Data Quality is Key To Solvency II
Why Data Quality is Key To Solvency IIWhy Data Quality is Key To Solvency II
Why Data Quality is Key To Solvency IIcolinrickard
 
Creating A Solvency II Data Governance Framework
Creating A Solvency II Data Governance FrameworkCreating A Solvency II Data Governance Framework
Creating A Solvency II Data Governance Frameworkcolinrickard
 
Get it Clean and Keep it Clean
Get it Clean and Keep it CleanGet it Clean and Keep it Clean
Get it Clean and Keep it Clean
DQ Global
 
Data Cleanup Presentation - RecordLion
Data Cleanup Presentation - RecordLionData Cleanup Presentation - RecordLion
Data Cleanup Presentation - RecordLion
Andrew Borgschulte
 
Data Quality - The Cleansing Process
Data Quality - The Cleansing ProcessData Quality - The Cleansing Process
Data Quality - The Cleansing Process
InfoCheckPoint
 
Presentation on Data Cleansing
Presentation on Data CleansingPresentation on Data Cleansing
Presentation on Data Cleansing
ng8
 
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data CleaningBrief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
Jennifer Morrow
 
WLIA - 2015 Fall Regional, Oshkosh WI
WLIA - 2015 Fall Regional, Oshkosh WIWLIA - 2015 Fall Regional, Oshkosh WI
WLIA - 2015 Fall Regional, Oshkosh WI
Wisconsin State Cartographer's Office
 
Data Cleaning Process
Data Cleaning ProcessData Cleaning Process
Data Cleaning Process
InfoCheckPoint
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
Amir Masoud Sefidian
 
Best practice strategies to clean up and maintain your database with Hether G...
Best practice strategies to clean up and maintain your database with Hether G...Best practice strategies to clean up and maintain your database with Hether G...
Best practice strategies to clean up and maintain your database with Hether G...
Blackbaud Pacific
 

Viewers also liked (14)

The Data Cleansing Process - A Roadmap to Material Master Data Quality
The Data Cleansing Process - A Roadmap to Material Master Data QualityThe Data Cleansing Process - A Roadmap to Material Master Data Quality
The Data Cleansing Process - A Roadmap to Material Master Data Quality
 
Why Data Quality is Key To Solvency II
Why Data Quality is Key To Solvency IIWhy Data Quality is Key To Solvency II
Why Data Quality is Key To Solvency II
 
Creating A Solvency II Data Governance Framework
Creating A Solvency II Data Governance FrameworkCreating A Solvency II Data Governance Framework
Creating A Solvency II Data Governance Framework
 
Get it Clean and Keep it Clean
Get it Clean and Keep it CleanGet it Clean and Keep it Clean
Get it Clean and Keep it Clean
 
Data Cleanup Presentation - RecordLion
Data Cleanup Presentation - RecordLionData Cleanup Presentation - RecordLion
Data Cleanup Presentation - RecordLion
 
Data Quality - The Cleansing Process
Data Quality - The Cleansing ProcessData Quality - The Cleansing Process
Data Quality - The Cleansing Process
 
Presentation on Data Cleansing
Presentation on Data CleansingPresentation on Data Cleansing
Presentation on Data Cleansing
 
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data CleaningBrief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
 
WLIA - 2015 Fall Regional, Oshkosh WI
WLIA - 2015 Fall Regional, Oshkosh WIWLIA - 2015 Fall Regional, Oshkosh WI
WLIA - 2015 Fall Regional, Oshkosh WI
 
Data Cleaning Process
Data Cleaning ProcessData Cleaning Process
Data Cleaning Process
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Data cleansing
Data cleansingData cleansing
Data cleansing
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
Best practice strategies to clean up and maintain your database with Hether G...
Best practice strategies to clean up and maintain your database with Hether G...Best practice strategies to clean up and maintain your database with Hether G...
Best practice strategies to clean up and maintain your database with Hether G...
 

Similar to Big Data and Data Standardization at LinkedIn

Connecting Talent to Opportunity.. at scale @ LinkedIn
Connecting Talent to Opportunity.. at scale @ LinkedInConnecting Talent to Opportunity.. at scale @ LinkedIn
Connecting Talent to Opportunity.. at scale @ LinkedIn
Anmol Bhasin
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The People
Daniel Tunkelang
 
Scale, Structure, and Semantics
Scale, Structure, and SemanticsScale, Structure, and Semantics
Scale, Structure, and Semantics
Daniel Tunkelang
 
Content, Connections, and Context
Content, Connections, and ContextContent, Connections, and Context
Content, Connections, and Context
Daniel Tunkelang
 
Key Lessons Learned Building Recommender Systems for Large-Scale Social Netw...
 Key Lessons Learned Building Recommender Systems for Large-Scale Social Netw... Key Lessons Learned Building Recommender Systems for Large-Scale Social Netw...
Key Lessons Learned Building Recommender Systems for Large-Scale Social Netw...
Christian Posse
 
Keynote Peter Skomoroch - skills, reputation, and search
Keynote   Peter Skomoroch - skills, reputation, and searchKeynote   Peter Skomoroch - skills, reputation, and search
Keynote Peter Skomoroch - skills, reputation, and searchlucenerevolution
 
KEYNOTE: Skills, Reputation and Search
KEYNOTE: Skills, Reputation and SearchKEYNOTE: Skills, Reputation and Search
KEYNOTE: Skills, Reputation and Search
lucenerevolution
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
Trey Grainger
 
Machine Learning for Recommender Systems in the Job Market
Machine Learning for Recommender Systems in the Job MarketMachine Learning for Recommender Systems in the Job Market
Machine Learning for Recommender Systems in the Job Market
Fabian Abel
 
Skills, Reputation, and Search
Skills, Reputation, and SearchSkills, Reputation, and Search
Skills, Reputation, and Search
Peter Skomoroch
 
Strata 2013 - LinkedIn Endorsements: Reputation, Virality, and Social Tagging
Strata 2013 - LinkedIn Endorsements: Reputation, Virality, and Social TaggingStrata 2013 - LinkedIn Endorsements: Reputation, Virality, and Social Tagging
Strata 2013 - LinkedIn Endorsements: Reputation, Virality, and Social Tagging
Sam Shah
 
LinkedIn Endorsements: Reputation, Virality, and Social Tagging
LinkedIn Endorsements: Reputation, Virality, and Social TaggingLinkedIn Endorsements: Reputation, Virality, and Social Tagging
LinkedIn Endorsements: Reputation, Virality, and Social Tagging
Peter Skomoroch
 
LinkedIn Skills: RecSys Conference 2014
LinkedIn Skills: RecSys Conference 2014LinkedIn Skills: RecSys Conference 2014
LinkedIn Skills: RecSys Conference 2014
Mathieu Bastian
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Trey Grainger
 
Next generation linked in talent search
Next generation linked in talent searchNext generation linked in talent search
Next generation linked in talent search
Ryan Wu
 
Wanda Fiat (Ars Analitica) – NLP Beyond Chatbots. Quick Solutions to Hard Pro...
Wanda Fiat (Ars Analitica) – NLP Beyond Chatbots. Quick Solutions to Hard Pro...Wanda Fiat (Ars Analitica) – NLP Beyond Chatbots. Quick Solutions to Hard Pro...
Wanda Fiat (Ars Analitica) – NLP Beyond Chatbots. Quick Solutions to Hard Pro...
Codiax
 
Personas: A Sure Cure for the Ailing Market Requirements Document
Personas: A Sure Cure for the Ailing Market Requirements DocumentPersonas: A Sure Cure for the Ailing Market Requirements Document
Personas: A Sure Cure for the Ailing Market Requirements Document
SVPMA
 
All the cool kids....
All the cool kids....All the cool kids....
All the cool kids....
Matthias Noback
 
Tagging That Works - O'Reilly Web 2.0 Expo
Tagging That Works - O'Reilly Web 2.0 ExpoTagging That Works - O'Reilly Web 2.0 Expo
Tagging That Works - O'Reilly Web 2.0 Expo
Thomas Vander Wal
 
Knowledge Graphs, Ontologies, and AI Applications
Knowledge Graphs, Ontologies, and AI ApplicationsKnowledge Graphs, Ontologies, and AI Applications
Knowledge Graphs, Ontologies, and AI Applications
Earley Information Science
 

Similar to Big Data and Data Standardization at LinkedIn (20)

Connecting Talent to Opportunity.. at scale @ LinkedIn
Connecting Talent to Opportunity.. at scale @ LinkedInConnecting Talent to Opportunity.. at scale @ LinkedIn
Connecting Talent to Opportunity.. at scale @ LinkedIn
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The People
 
Scale, Structure, and Semantics
Scale, Structure, and SemanticsScale, Structure, and Semantics
Scale, Structure, and Semantics
 
Content, Connections, and Context
Content, Connections, and ContextContent, Connections, and Context
Content, Connections, and Context
 
Key Lessons Learned Building Recommender Systems for Large-Scale Social Netw...
 Key Lessons Learned Building Recommender Systems for Large-Scale Social Netw... Key Lessons Learned Building Recommender Systems for Large-Scale Social Netw...
Key Lessons Learned Building Recommender Systems for Large-Scale Social Netw...
 
Keynote Peter Skomoroch - skills, reputation, and search
Keynote   Peter Skomoroch - skills, reputation, and searchKeynote   Peter Skomoroch - skills, reputation, and search
Keynote Peter Skomoroch - skills, reputation, and search
 
KEYNOTE: Skills, Reputation and Search
KEYNOTE: Skills, Reputation and SearchKEYNOTE: Skills, Reputation and Search
KEYNOTE: Skills, Reputation and Search
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
Machine Learning for Recommender Systems in the Job Market
Machine Learning for Recommender Systems in the Job MarketMachine Learning for Recommender Systems in the Job Market
Machine Learning for Recommender Systems in the Job Market
 
Skills, Reputation, and Search
Skills, Reputation, and SearchSkills, Reputation, and Search
Skills, Reputation, and Search
 
Strata 2013 - LinkedIn Endorsements: Reputation, Virality, and Social Tagging
Strata 2013 - LinkedIn Endorsements: Reputation, Virality, and Social TaggingStrata 2013 - LinkedIn Endorsements: Reputation, Virality, and Social Tagging
Strata 2013 - LinkedIn Endorsements: Reputation, Virality, and Social Tagging
 
LinkedIn Endorsements: Reputation, Virality, and Social Tagging
LinkedIn Endorsements: Reputation, Virality, and Social TaggingLinkedIn Endorsements: Reputation, Virality, and Social Tagging
LinkedIn Endorsements: Reputation, Virality, and Social Tagging
 
LinkedIn Skills: RecSys Conference 2014
LinkedIn Skills: RecSys Conference 2014LinkedIn Skills: RecSys Conference 2014
LinkedIn Skills: RecSys Conference 2014
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
 
Next generation linked in talent search
Next generation linked in talent searchNext generation linked in talent search
Next generation linked in talent search
 
Wanda Fiat (Ars Analitica) – NLP Beyond Chatbots. Quick Solutions to Hard Pro...
Wanda Fiat (Ars Analitica) – NLP Beyond Chatbots. Quick Solutions to Hard Pro...Wanda Fiat (Ars Analitica) – NLP Beyond Chatbots. Quick Solutions to Hard Pro...
Wanda Fiat (Ars Analitica) – NLP Beyond Chatbots. Quick Solutions to Hard Pro...
 
Personas: A Sure Cure for the Ailing Market Requirements Document
Personas: A Sure Cure for the Ailing Market Requirements DocumentPersonas: A Sure Cure for the Ailing Market Requirements Document
Personas: A Sure Cure for the Ailing Market Requirements Document
 
All the cool kids....
All the cool kids....All the cool kids....
All the cool kids....
 
Tagging That Works - O'Reilly Web 2.0 Expo
Tagging That Works - O'Reilly Web 2.0 ExpoTagging That Works - O'Reilly Web 2.0 Expo
Tagging That Works - O'Reilly Web 2.0 Expo
 
Knowledge Graphs, Ontologies, and AI Applications
Knowledge Graphs, Ontologies, and AI ApplicationsKnowledge Graphs, Ontologies, and AI Applications
Knowledge Graphs, Ontologies, and AI Applications
 

Recently uploaded

Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 

Recently uploaded (20)

Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 

Big Data and Data Standardization at LinkedIn

  • 1. Reading the Tea Leaves: Alexis Big Data at LinkedIn Alexis Baird Product Manager LinkedIn Recruiting Solutions 1
  • 2. What is LinkedIn? §  LinkedIn’s mission: “Connect the world’s professionals to make them more productive and successful” §  The site officially launched on May 5, 2003 §  Now has >187 million members worldwide §  LinkedIn has >3,000 employees in offices all around the world §  Headquartered in Mountain View, CA §  Three different lines of revenue: –  Subscriptions –  Talent Solutions –  Marketing Solutions 2
  • 4. The Age of Big Data 4
  • 5. Big Data at LinkedIn §  187+ million members from >200 countries §  Each month, 52 million members come to the site generating ~2 billion page views: –  Performing searches –  Connecting with other members –  Editing their profile –  Sharing, commenting on, or liking news articles –  Participating in group discussions –  And much more… 5
  • 6. Big Data Challenges §  Storage and processing constraints §  Noisy signal –  Variation –  People are not always rational or consistent 6
  • 7. Data Messiness §  Job titles: §  Companies: §  “programmer”, §  “Microsoft” §  “software developer” §  “MSFT” §  “engineer” §  “Bing” §  “coding ninja” §  “Microsoft/Bing” §  Schools: §  “Microsoft-Mountain View” §  “Connecticut College” §  “Conn College” §  “Conn” §  “CC” §  “Conn College (NOT Uconn)” 7
  • 8. Data Standardization §  Take an input (usually a user-entered string) and turn it into a meaningful abstract id “Microsoft” “MSFT” Company_id = 1035 (“Microsoft Corporation”) “Bing” “Microsoft/Bing” “Microsoft-Mountain View 8
  • 9. Why is this important? 9
  • 10. Search 10
  • 11. Structured data > Unstructured data P(“linkedin” = company_id 1337) = .87 P(“ceo” = title_id 238) = .92 11
  • 13. Recommendation products at LinkedIn Similar Profiles Connections Network updates Events You May Be Interested In News 13
  • 14. LinkedIn’s recommender ecosystem Recommendations drive: > 50% of connections > 50% of job applications > 50% of group joins 14
  • 15. Jobs You Might Be Interested In 15
  • 16. How LinkedIn matches people to jobs Job Corpus Stats Matching Transition probabilities Connectivity Binary yrs of experience to reach title title industry … Exact matches: education needed for this title geo description … company functional area geo, industry, … User Base Soft Similarity (candidate expertise, job description) transition Filtered 0.56 probabilities, Similarity Candidate similarity, (candidate specialties, job description) … 0.2 Transition probability Text (candidate industry, job industry) General Current Position 0.43 expertise title specialties summary Title Similarity education tenure length 0.8 headline industry Similarity (headline, title) geo functional area experience … 0.7 . derive d . . 16
  • 17. Data Standardization: Occupations §  How do we know a “senior software developer” and a “software developer” are the same occupation? 17
  • 18. Data Standardization: Occupations §  How do we know a “senior software developer” and a “software developer” are the same occupation? –  Strip a special set of words known to indicate seniority 18
  • 19. Data Standardization: Occupations §  How do we know a “senior software developer” and a “software developer” are the same occupation? –  Strip a special set of words known to indicate seniority §  How do we know a “software developer” and a “software engineer” are the same occupation? 19
  • 20. Data Standardization: Occupations §  How do we know a “senior software developer” and a “software developer” are the same occupation? –  Strip a special set of words known to indicate seniority §  How do we know a “software developer” and a “software engineer” are the same occupation? –  Term similarity 20
  • 21. Data Standardization: Occupations §  How do we know a “senior software developer” and a “software developer” are the same occupation? –  Strip a special set of words known to indicate seniority §  How do we know a “software developer” and a “software engineer” are the same occupation? –  Term similarity §  How do we know a “programmer” and a “software developer” are the same occupation but a “programmer” and a “program director” are not? 21
  • 22. Data Standardization: Occupations §  How do we know a “senior software developer” and a “software developer” are the same occupation? –  Strip a special set of words known to indicate seniority §  How do we know a “software developer” and a “software engineer” are the same occupation? –  Term similarity §  How do we know a “programmer” and a “software developer” are the same occupation but a “programmer” and a “program director” are not? –  Need something more complicated 22
  • 23. Data standardization: Occupations 1.  Rule-based string clean up: –  ~2 million different titles => 24,000 different “cleaned” titles –  Eg. “Sr software dev” => “senior software developer” 2.  Create “virtual profiles” for each title using various extracted and normalized profile features (i.e. skills, degree, field of study, summary, job description, honors, etc.) 3.  Cluster similar titles 4.  Get rid of uninformative titles spread across too many different topics 5.  Apply hand QA to tune the clusters/name the clusters 23
  • 24.
  • 25. Lessons learned §  Know your machine learning! §  Know your success metric! §  Need to allow for ambiguity within a given title §  “Head of production” §  DDS §  Some titles are not standardizable: 25
  • 26. Take aways §  The more information you give, the better your standardization will be §  Why do you want LI to do a good job standardizing the data on your profile? –  Better recommendations: §  News §  Jobs §  Groups §  Connections §  Etc. –  Recruiters can find you more easily –  Potential connections can find you 26
  • 27. Thank You! 175M+ 2/sec 62% non U.S. 25th 90 We’re Most visit website worldwide (Comscore 6-12) 55 Hiring! >2M Company pages 85% 32 17 8 2 4 Fortune 500 Companies use LinkedIn to hire 2004 2005 2006 2007 2008 2009 2010 2011 LinkedIn Members (Millions) Learn more at http://data.linkedin.com/ 27