• Save
Data Science for Hire Ed
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Data Science for Hire Ed



Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1dGIrxX. ...

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1dGIrxX.

Gloria Lau describes some of the products built for the higher education sector, the data standardization process, determining school similarity and identifying notable alumni. Filmed at qconsf.com.

Gloria Lau leads the core data products team at Linkedin. Her team focuses on understanding and engaging members to construct the best professional identity on the web, including education and occupation, and builds interesting data products on top of said data. Previously, she was a research scientist at FindLaw, a Thomson Reuters business. She has a MS and PhD from Stanford, and BS from UCLA.



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Data Science for Hire Ed Presentation Transcript

  • 1. Data Science for Higher Ed Gloria Lau Manager, Data Science @ LinkedIn
  • 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /data-analysis-hiring InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  • 3. Presented at QCon San Francisco www.qconsf.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. LinkedIn data. For students*. *prospective students, current students and recent graduates
  • 5. WHY? We have career outcome data to derive better insights about higher education
  • 6. Common questions from user studies Prospective students: I want to be a pediatrician. Where should I go to school? I don’t know what I want but I am an A student. So? Current students: Show me the internship / job opportunities. Should I double / change major? Recent graduates: Show me the job opportunities. Should I consider further education?
  • 7. The Answer for the type A’s Show me the career outcome data per school / field of study / degree
  • 8. The Answer for the exploratory kind Show me the career outcome data in a form that allows for serendipitous discoveries  build me some data products to help me draw insights from aggregate data  build me some data products that are delightful
  • 9. OK! Let’s start building some data products for students! type A’s and non type A’s, we have answers for you
  • 10. Invest in Plumbing
  • 11. Before your faucets
  • 12. Data Science for Higher Ed A case study From plumbing to fixture. From standardization to delightful data products.
  • 13. Standardization • Standardization is about understanding our data, and building the foundational layer that maps <school_name> to <school_id> so that we can build data products on top • Entity resolution • Recognizable entities • Typeahead
  • 14. Entity Resolution • User types in University of California, Berkeley  easy • User types in UCB  hard / ambiguous
  • 15. Entity Resolution • Name feature: fuzzy match, edit distance, prefix match, etc • Profile feature: email, groups, etc • Network feature: connections, invitations, etc
  • 16. Recognizable entities • User types in University of California, Berkeley  easy • User types in UCB  hard / ambiguous / alias not understood • User types in 東京大学  harder / canonical name not understood
  • 17. Recognizable entities • You don’t know what you don’t know • Your standardization is only as good as your recognized dataset • LinkedIn data is very global
  • 18. Recognizable entities • IPEDS for US school data • Crowdsourcing for non-US school + government data • • internal and external with schema spec’ed out Alias – bootstrap from member data
  • 19. Typeahead • Plug the hole from the front(-end) as soon as you can • Invest in a good typeahead early on so that you don’t even need to standardize • Helps standardization rate tremendously • Make sure you have aliases and localized strings in your typeahead
  • 20. Plumbing? checked Onto building delightful* data products *The level of delightfulness is directly correlated to how good your standardization layer is.
  • 21. Similar Schools Serendipitous discoveries. Sideways browse. Based on career outcome data + some more.
  • 22. Similar Schools
  • 23. Similar schools • Aggregate profile per school based on alumni data • Industry, job title, job function, company, skills, etc • Feature engineering and balancing • Dot-product of 2 aggregate profiles = school similarity
  • 24. Similar schools – issues • Observation #1: similarity identified between tiny specialized schools and big research institutions • Observation #2: similarity identified between non-US specialized schools and big US research institutions
  • 25. What’s wrong? Degree bucketization
  • 26. Similar schools - issues Kyoritsu Women's University • Observation: no data • New community colleges and non-US schools have very sparse data • Solution: attribute-based similarity • From IPEDS and crowdsourced data
  • 27. Notable Alumni Aspirations. Connecting the dots.
  • 28. Notable Alumni • Who’s notable? • Wikipedia match • • • School standardization Name mapping Success stories
  • 29. Who’s notable – Wikipedia stories …
  • 30. Wikipedia stories • Lightweight school standardization • • network feature Name mapping • • ✓ Name feature ✕ profile feature ✕ Even when you are notable, your name isn’t unique Crowdsourcing for evaluation • Profile from LinkedIn vs profile from Wikipedia
  • 31. Crowdsourcing for evaluation
  • 32. Are we done? Do we have notable alumni for all schools? Similar issue like similar schools – data sparseness
  • 33. Who’s notable - Success stories • Many schools don’t have notable alumni section in Wikipedia • Success stories based on LinkedIn data • Features of success • • • CXO’s at Fortune companies Generalizes to high seniority at top companies But what does it mean to be • • Senior • • A top company An alum They all depend on…
  • 34. Standardization • Degree standardization - alumni • Company standardization • • IBM vs international brotherhood of magicians Title & seniority standardization • founder of the gloria lau franchise vs founder of LinkedIn • VP in financial sector vs VP in software engineering industry
  • 35. Evaluation – I know it when I see it
  • 36. INSIGHTS: unique & standardized data to describe schools. similar schools. notable alumni. to drive STUDENT DECISIONS
  • 37. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/dataanalysis-hiring