• Save
Lies, damned lies and the data scientist  2011 strata summit
Upcoming SlideShare
Loading in...5
×
 

Lies, damned lies and the data scientist 2011 strata summit

on

  • 2,804 views

When it comes to big data insights, how do you know you’re asking the right questions? Hiring data scientists is a good start – we’re seeing their growth both on LinkedIn and at LinkedIn. But ...

When it comes to big data insights, how do you know you’re asking the right questions? Hiring data scientists is a good start – we’re seeing their growth both on LinkedIn and at LinkedIn. But even data scientists are not immune from the myriad of hidden pitfalls that keep your key insights out of sight.

Drawing from a deceptively simple exercise that I’ve used to haze dozens of data scientists on their first day, I will discuss the good, the bad and the ugly lessons we’ve learned about asking the right questions, denominators and being a data skeptic.

Statistics

Views

Total Views
2,804
Views on SlideShare
2,673
Embed Views
131

Actions

Likes
5
Downloads
0
Comments
1

9 Embeds 131

http://www.linkedin.com 42
http://lanyrd.com 32
http://www.scoop.it 31
http://twitter.com 10
https://www.linkedin.com 7
http://us-w1.rockmelt.com 4
https://twitter.com 3
http://tweetedtimes.com 1
http://ams.activemailservice.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Lies, Damned Lies and the Data Scientist By: MonicaRogati – data scientist at LinkedIn.Data lies – but it lies because we let it. So let’s not let it. Let’s ask the right questions.
  • I’m going to talk about how to ask the right question by showing you a a deceptively simple exercise that LinkedIn data scientists go through. The question is, what are the hottest industries this year, according to the LinkedIn data? There’s one small detail I’m not specifying – what’s the definition of hot. That definition plays a major part in asking the right questions.
  • SO let’s take a look at the data. On LinkedIn, we have over 120M people, their industry, and the year they joined.
  • … so the first attempt at defining “hot” might be – let’s look at the YOY growth of an industry & look at the top 3. That idea is not so hot – at best, it’s only an indicatorof LinkedIn’s penetration in an industry; at worst, it’s actually a contrarian indicator because it shows people who might want to transition OUT of that industry
  • … so the first attempt at defining “hot” might be – let’s look at the YOY growth of an industry & look at the top 3. That idea is not so hot – at best, it’s only an indicatorof LinkedIn’s penetration in an industry; at worst, it’s actually a contrarian indicator because it shows people who might want to transition OUT of that industry
  • The next piece of data we can look at is the individual positions people list on their profiles – they have a start date and an industry, so you can see what industry people are flowing into in a given year. Much better.
  • You run the numbers… Wait a second!! Is consulting really the hottest industry? Hmm.. I think the data is trying to lie to us. We need to take into account churn & promotions – and we do that by looking at the NET inflow of people into an industry: people coming IN minus people coming out.
  • There, that should be much better. Next external factor that might come into play is seasonality. If we’re doing this analysis in the summer, it looks like there a lot fewer teachers and accountants, and a lot more summer interns compared to last year! So ideally, we want to compare the same time period to take out seasonal effects
  • OK … done, let’s take another look. Are the Mining and metals & Dairy industries really the hottest industries this year? Or are they just very small industries on LinkedIn, and it’s much easier to grow off of a small base? You can get around this by making separate categories for industries of different size, ignoring industries below a certain size, or somehow account for that effect.
  • Now, we’re done: got seasonality, thresholding, net inflow – this has to be the right question. Well, almost. We assumed the data is clean. And it’s not.
  • For example , there are a lot of fake accounts that we’ve immediately closed, but they’re still there in the database. If you don’t check for that flag, you have this army of darthvaders boosting up the defense and space industry.
  • Including the tail of a distribution might not make sense – do we want people who have 200 positions listed on their profile? They might throw off your data.
  • We need to put the data under a microscope and understand what each flag, category and date means.OK, now we’ve accounted for external factors, took out the noise, are we ready to see some industry growth charts?!
  • Hm, ok, we plot the YOY growth and we get something that looks like this : a spaghetti chart that mostly shows industries moving in unison – an effect of the broader economic conditions (see that dip in 2001 and 2009). If we want to actually focus on differences between industries instead of what they have in common, we need to scale or normalize those numbers – for example, by dividing the net # of people coming into an industry by the TOTAL number of people who started jobs that year. This also has the nice property that it accounts for website growth.
  • OK, this MUST be it, right? The data stopped lying and we can actually see some real trends. Wild swings around 2000 for Internet and telecommunications, and there’s definitely something going on w/ real estate there. It still looks like spaghetti, it’s hard to understand and explain, and it’s not exactly telling a story. To tell the story, we need to make some hard decisions and pick only a couple of those lines, clean things up, and let that story shine.
  • Nice! I’ve picked 3 industries – when the line is above zero, that industry is growing; below 0, it’s shrinking. So the Internet is taking off in 94, booming in 99, then there’s a huge dip in 2001. Real Estate is growing steadily, it’s picking up in 2002, and it’s sinking in 2008 – and so are financial services. This is all coming from aggregating data on people’s public LinkedIn profiles! This is the kind of story that gets people excited about the insights in the LinkedIn data – but it wouldn’t have been possible, if we didn’t ask the right questions.
  • So let’s have some fun with the method I’ve just describe – let’s take a look at the growth of analytics and data science jobs over the past few years. Whoa! That rapid growth in the past 3 years is even more impressive when we realize that this is all properly normalized, not just the count of people with those titles on LinkedIn
  • So next time you look at your data, don’t let it lie to you – account for external factors, take out the noise, and ask the right questions.

Lies, damned lies and the data scientist  2011 strata summit Lies, damned lies and the data scientist 2011 strata summit Presentation Transcript

  • @mrogati
  • hottest industries
    The Mission:
  • +
    date
    joined
    LinkedIn
    The data
  • hottest industries
    Hotness (X) =
    Year-over-year growth of
    people in industry X
    on LinkedIn
    The Question
  • hottest industries
    Hotness (X) =
    Year-over-year growth of
    people in industry X
    on LinkedIn
    The Question
  • The data
  • hottest industries
    Hotness (X) =
    Year-over-year growth of
    people job starters
    in industry X
    on LinkedIn
    The Question
  • Externa-lies
  • Externa-lies
  • Externa-lies
  • hottest industries
    Hotness (X) =
    part year-over-part year growth of
    net job starters
    in a big enough industry X
    on LinkedIn
    The Question
  • Dirty data, dirty lies
  • # profiles
    # jobs on LinkedIn profile *
    Dirty data, dirty lies
    * hypothetical data
  • Check
    flags,
    categories,
    dates,

    Dirty data, dirty lies
  • Norma-lies
  • Hotness (X) =
    part year-over-part year growth of
    normalizednet job starters,
    minus noise,
    in a big enough industry X
    on LinkedIn
    hottest industries
    The Question
  • Norma-lies
  • Internet
    Real Estate
    Financial Services
    Truth by omission
  • … and the data scientist
  • @mrogati