More than Just Lines on a Map: Best Practices for U.S Bike Routes
Editor's Notes
Feeding the Vizard: Finding stories in the dataMonica Rogati. Data Scientist at LinkedIn
There are plenty of stories in the LinkedIn data – because there are 120 millionprofiles – with career histories - that in aggregate, can tell you a lot about the world & labor market evolution.
What were the fastest growing titles in a given year? We can see the tech boom in 1999 and the bubble bursting as people go back to grad school. Today, we see the rise in social media & of course, data scientists
..and, finally, we can see fads in job titles goingfrom “gurus” to “ninjas” to “rock stars”. So HOW do we find stories like these in the data?
You have to ask the right questions. I’m going to talk about how to ask the right question by showing you a a deceptively simple exercise that LinkedIn data scientists go through. The question is, what are the hottest industries this year, according to the LinkedIn data? There’s one small detail I’m not specifying – what’s the definition of hot. That definition plays a major part in asking the right questions.
SO let’s take a look at the data. On LinkedIn, we have over 120M people, their industry, and the year they joined.
… so the first attempt at defining “hot” might be – let’s look at the YOY growth of an industry & look at the top 3. That idea is not so hot – at best, it’s only an indicatorof LinkedIn’s penetration in an industry; at worst, it’s actually a contrarian indicator because it shows people who might want to transition OUT of that industry
The next piece of data we can look at is the individual positions people list on their profiles – they have a start date and an industry, so you can see what industry people are flowing into in a given year. Much better.
You run the numbers… Wait a second!! Is consulting really the hottest industry? Hmm.. I think the data is trying to lie to us. We need to take into account churn & promotions – and we do that by looking at the NET inflow of people into an industry: people coming IN minus people coming out.
There, that should be much better. Next external factor that might come into play is seasonality. If we’re doing this analysis in the summer, it looks like there a lot fewer teachers and accountants, and a lot more summer interns compared to last year! So ideally, we want to compare the same time period to take out seasonal effects
OK … done, let’s take another look. Are the Mining and metals & Dairy industries really the hottest industries this year? Or are they just very small industries on LinkedIn, and it’s much easier to grow off of a small base? You can get around this by making separate categories for industries of different size, ignoring industries below a certain size, or somehow account for that effect.
For example , there are a lot of fake accounts that we’ve immediately closed, but they’re still there in the database. If you don’t check for that flag, you have this army of darthvaders boosting up the defense and space industry.
Hm, ok, we plot the YOY growth and we get something that looks like this : a spaghetti chart that mostly shows industries moving in unison – an effect of the broader economic conditions (see that dip in 2001 and 2009). If we want to actually focus on differences between industries instead of what they have in common, we need to scale or normalize those numbers – for example, by dividing the net # of people coming into an industry by the TOTAL number of people who started jobs that year. This also has the nice property that it accounts for website growth.
OK, this MUST be it, right? The data stopped lying and we can actually see some real trends. Wild swings around 2000 for Internet and telecommunications, and there’s definitely something going on w/ real estate there. It still looks like spaghetti, it’s hard to understand and explain, and it’s not exactly telling a story. To tell the story, we need to make some hard decisions and pick only a couple of those lines, clean things up, and let that story shine.
Nice! I’ve picked 3 industries – when the line is above zero, that industry is growing; below 0, it’s shrinking. So the Internet is taking off in 94, booming in 99, then there’s a huge dip in 2001. Real Estate is growing steadily, it’s picking up in 2002, and it’s sinking in 2008 – and so are financial services. This is all coming from aggregating data on people’s public LinkedIn profiles! This is the kind of story that gets people excited about the insights in the LinkedIn data – but it wouldn’t have been possible, if we didn’t ask the right questions.
We have to aggregate across interesting dimensions, normalize the raw data, and dig. Let’s see a few examples of what this means.
Looking at job promotions by month and country reveals interesting patterns.
Normalizing raw data is essential -- there is always a denominator. Look for what is over-indexed, not just popular. To see the trend HERE, we use %ages.
Finally, ask why . For the promotions data, we took a look at when people were born. And here is our explanation for the trend: Millenials don’t care what month it is, they want their promotions.
Sometimes you have to rephrase the question you’re being asked. This infographic is 100% correct – but Atlanta and Chicago don’t strike me as wedding destinations. When people ask for most popular , give them “over-represented” – normalize (in this case, normalize by city size and total incoming flights).
Here are two memorable images to explain – and help you remember- the concept of over-represented and under-represented.Babies are UNDER-represented in prison: fewer of them there vs. the overall population.What are a few animals over-represented on geeky t-shirts? Wolves, elefants and pigs.
Let’s take a look at what majors are over-represented among entrepreneurs, according to the LinkedIn data.
What are other over-represented schools among entrepreneurs, other than the business schools (in blue)? There are a few technical schools (in orange), and a few general schools (in green) , some of which skew technical.
Which companies are over-represented in founders’ histories?
Let’s take another example. What first names are over-represented among CEOs? You might look at the list and say – well, those are just popular baby names 50 years ago. But that’s not the whole story – you have to dig.
Look at the length of the names – now that’s an interesting story! There’s Chip, Todd and Trey - the quintessential sales guys. CEOs are more diverse – but they still want to be your friend -- so they use nicknames.
The story itself needs to be interesting – to get people talking about it; It has to be accessible, so it’s easy to understand by a lot of people; and it has to be relatable – Howard Stern covered it because Howard was one of the names.
So obviously there are a lot of interesting stories in the data. We all love data & want more of it – I mean, that’s IS my license plate. But raw data isn’t enough – you have to DIG to find the real story.
Feeding the Vizard: Finding stories in the dataMonica Rogati. Data Scientist at LinkedIn