Data Science for Higher Ed
Gloria Lau
Manager, Data Science @ LinkedIn
LinkedIn data.
For students*.

*prospective students, current students and recent graduates
WHY?
We have career outcome data to
derive better insights about higher
education
Common questions from user studies
Prospective students:
I want to be a pediatrician. Where should I go to school?
I don’t know what I want but I am an A student. So?
Current students:
Show me the internship / job opportunities.
Should I double / change major?
Recent graduates:
Show me the job opportunities.
Should I consider further education?
The Answer for the type A’s
Show me the career outcome data per school / field of study / degree
The Answer for the exploratory kind
Show me the career outcome data in a form that allows for
serendipitous discoveries
 build me some data products to help me draw insights
from aggregate data
 build me some data products that are delightful
OK! Let’s start building some data
products for students!
type A’s and non type A’s, we have answers for you
Invest in Plumbing
Before your faucets
Data Science for Higher Ed
A case study
From plumbing to fixture.
From standardization to delightful data products.
Standardization
•

Standardization is about understanding our data, and
building the foundational layer that maps <school_name> to
<school_id> so that we can build data products on top
•

Entity resolution

•

Recognizable entities

•

Typeahead
Entity Resolution

•

User types in University of California, Berkeley  easy

•

User types in UCB  hard / ambiguous
Entity Resolution

•

Name feature: fuzzy match, edit distance, prefix match, etc

•

Profile feature: email, groups, etc

•

Network feature: connections, invitations, etc
Recognizable entities
•

User types in University of California, Berkeley  easy

•

User types in UCB  hard / ambiguous / alias not
understood

•

User types in 東京大学  harder / canonical name not
understood
Recognizable entities

•

You don’t know what you don’t know
•

Your standardization is only as good as your recognized
dataset

•

LinkedIn data is very global
Recognizable entities

•

IPEDS for US school data

•

Crowdsourcing for non-US school + government data
•

•

internal and external with schema spec’ed out

Alias – bootstrap from member data
Typeahead
•

Plug the hole from the front(-end) as soon as you can

•

Invest in a good typeahead early on so that you don’t even
need to standardize
•

Helps standardization rate tremendously

•

Make sure you have aliases and localized strings in your
typeahead
Plumbing? checked
Onto building delightful* data products

*The level of delightfulness is directly correlated to
how good your standardization layer is.
Similar Schools
Serendipitous discoveries. Sideways browse.
Based on career outcome data + some more.
Similar Schools
Similar schools
•

Aggregate profile per school based on alumni data
•

Industry, job title, job function, company, skills, etc

•

Feature engineering and balancing

•

Dot-product of 2 aggregate profiles = school similarity
Similar schools – issues

•

Observation #1: similarity identified between tiny
specialized schools and big research institutions

•

Observation #2: similarity identified between non-US
specialized schools and big US research institutions
What’s wrong?
Degree bucketization
Similar schools - issues
•

Observation: no data

•

New community colleges and non-US
schools have very sparse data

•

Solution: attribute-based similarity
•

From IPEDS and crowdsourced data

Kyoritsu Women's University
Notable Alumni
Aspirations. Connecting the dots.
Notable Alumni
•

Who’s notable?
•

Wikipedia match
•
•

•

School standardization
Name mapping

Success stories
Who’s notable – Wikipedia stories

…
Wikipedia stories
•

Lightweight school standardization
•

•

Name mapping
•

•

✓ Name feature ✕ profile feature ✕ network feature

Even when you are notable, your name isn’t unique

Crowdsourcing for evaluation
•

Profile from LinkedIn vs profile from Wikipedia
Crowdsourcing for evaluation
Are we done? Do we have notable
alumni for all schools?
Similar issue like similar schools – data sparseness
Who’s notable - Success stories
•

Many schools don’t have notable alumni section in Wikipedia

•

Success stories based on LinkedIn data
•

Features of success
•
•

•

CXO’s at Fortune companies
Generalizes to high seniority at top companies

But what does it mean to be
•
•

Senior

•
•

A top company

An alum

They all depend on…
Standardization
•

Degree standardization - alumni

•

Company standardization
•

•

IBM vs international brotherhood of magicians

Title & seniority standardization
•

founder of the gloria lau franchise vs founder of LinkedIn

•

VP in financial sector vs VP in software engineering industry
Evaluation – I know it when I see it
INSIGHTS:
unique & standardized data to describe schools.
similar schools.
notable alumni.

to drive STUDENT DECISIONS

Qcon SF 2013

  • 1.
    Data Science forHigher Ed Gloria Lau Manager, Data Science @ LinkedIn
  • 3.
    LinkedIn data. For students*. *prospectivestudents, current students and recent graduates
  • 4.
    WHY? We have careeroutcome data to derive better insights about higher education
  • 5.
    Common questions fromuser studies Prospective students: I want to be a pediatrician. Where should I go to school? I don’t know what I want but I am an A student. So? Current students: Show me the internship / job opportunities. Should I double / change major? Recent graduates: Show me the job opportunities. Should I consider further education?
  • 6.
    The Answer forthe type A’s Show me the career outcome data per school / field of study / degree
  • 7.
    The Answer forthe exploratory kind Show me the career outcome data in a form that allows for serendipitous discoveries  build me some data products to help me draw insights from aggregate data  build me some data products that are delightful
  • 8.
    OK! Let’s startbuilding some data products for students! type A’s and non type A’s, we have answers for you
  • 10.
  • 11.
  • 12.
    Data Science forHigher Ed A case study From plumbing to fixture. From standardization to delightful data products.
  • 13.
    Standardization • Standardization is aboutunderstanding our data, and building the foundational layer that maps <school_name> to <school_id> so that we can build data products on top • Entity resolution • Recognizable entities • Typeahead
  • 14.
    Entity Resolution • User typesin University of California, Berkeley  easy • User types in UCB  hard / ambiguous
  • 15.
    Entity Resolution • Name feature:fuzzy match, edit distance, prefix match, etc • Profile feature: email, groups, etc • Network feature: connections, invitations, etc
  • 16.
    Recognizable entities • User typesin University of California, Berkeley  easy • User types in UCB  hard / ambiguous / alias not understood • User types in 東京大学  harder / canonical name not understood
  • 17.
    Recognizable entities • You don’tknow what you don’t know • Your standardization is only as good as your recognized dataset • LinkedIn data is very global
  • 18.
    Recognizable entities • IPEDS forUS school data • Crowdsourcing for non-US school + government data • • internal and external with schema spec’ed out Alias – bootstrap from member data
  • 19.
    Typeahead • Plug the holefrom the front(-end) as soon as you can • Invest in a good typeahead early on so that you don’t even need to standardize • Helps standardization rate tremendously • Make sure you have aliases and localized strings in your typeahead
  • 20.
    Plumbing? checked Onto buildingdelightful* data products *The level of delightfulness is directly correlated to how good your standardization layer is.
  • 21.
    Similar Schools Serendipitous discoveries.Sideways browse. Based on career outcome data + some more.
  • 22.
  • 23.
    Similar schools • Aggregate profileper school based on alumni data • Industry, job title, job function, company, skills, etc • Feature engineering and balancing • Dot-product of 2 aggregate profiles = school similarity
  • 24.
    Similar schools –issues • Observation #1: similarity identified between tiny specialized schools and big research institutions • Observation #2: similarity identified between non-US specialized schools and big US research institutions
  • 25.
  • 26.
    Similar schools -issues • Observation: no data • New community colleges and non-US schools have very sparse data • Solution: attribute-based similarity • From IPEDS and crowdsourced data Kyoritsu Women's University
  • 27.
  • 29.
    Notable Alumni • Who’s notable? • Wikipediamatch • • • School standardization Name mapping Success stories
  • 30.
    Who’s notable –Wikipedia stories …
  • 31.
    Wikipedia stories • Lightweight schoolstandardization • • Name mapping • • ✓ Name feature ✕ profile feature ✕ network feature Even when you are notable, your name isn’t unique Crowdsourcing for evaluation • Profile from LinkedIn vs profile from Wikipedia
  • 32.
  • 33.
    Are we done?Do we have notable alumni for all schools? Similar issue like similar schools – data sparseness
  • 34.
    Who’s notable -Success stories • Many schools don’t have notable alumni section in Wikipedia • Success stories based on LinkedIn data • Features of success • • • CXO’s at Fortune companies Generalizes to high seniority at top companies But what does it mean to be • • Senior • • A top company An alum They all depend on…
  • 35.
    Standardization • Degree standardization -alumni • Company standardization • • IBM vs international brotherhood of magicians Title & seniority standardization • founder of the gloria lau franchise vs founder of LinkedIn • VP in financial sector vs VP in software engineering industry
  • 36.
    Evaluation – Iknow it when I see it
  • 37.
    INSIGHTS: unique & standardizeddata to describe schools. similar schools. notable alumni. to drive STUDENT DECISIONS

Editor's Notes

  • #5 Students and recent graduates are the fastest growing segment at Linkedin
  • #8 Princeton’s data
  • #18 Invest in getting good dataset globally Members grow into new markets
  • #19 Go after government datasets if you can
  • #20 Be smart – building out your vocabulary could have unexpected effects on your typeahead performance
  • #33 Former NFL players turn realtors
  • #36 Yoga class vs MBA at stanford – degree standardization