SlideShare a Scribd company logo
1 of 22
Jobs & Skills
Team Grant
MMA 865
What if you wanted to…
Plan your career
 How your key skills are trending
Develop labour policy
 Skill deficits by region or by industry
Train job-ready graduates
 Add skills to programs and syllabi
(Syllabi is an awesome word)
2014 Aug 16 Team Grant for Queen's School of Business 2
?
?
?
2014 Aug 16 Team Grant for Queen's School of Business 3
Source
Extract
Store
Distill
Analyze
Answer Questions
from linkedin import linkedin
import json
authentication =
linkedin.LinkedInDeveloperAuthentication(...)
application =
linkedin.LinkedInApplication(authentication)
client = pymongo.MongoClient()
db = client.jobengine
max_id =
db.posting.find({'source':'linkedin'}).sort('raw_data.id
',-1).limit(1)[0]['raw_data']['id']
while True:
list_of_jobs = application.search_job(selectors=
[{'jobs': ['id', 'posting-date‘,...]}],
params={'count': 100, 'sort':'DD',...})
for job in reversed(list_of_jobs):
if job['id'] <= max_id:
continue
max_id=job['id']
location=job['locationDescription']
raw_date=job['postingDate']
posteddate=time.strftime("%d/%m/%Y",...))
skills=job['skillsAndExperience']
db.posting.insert({"posted_date": posteddate,
"skills": skills, "city": location,
"source":'linkedin', "raw_data": job})
time.sleep(300)
from careerbuilder import CareerBuilder
import json
import pymongo
cb = CareerBuilder(DEV_KEY)
search = cb.job_search(HostSite='CA', PostedWithin='1')
list_of_jobs=search['ResponseJobSearch']['Results']['Job
SearchResult']
client = pymongo.MongoClient()
db = client.jobengine
for job in list_of_jobs:
location=job['Location']
posteddate=time.strftime("%m/%d/%Y",time.
strptime(job[‘PostedDate’], "%m/%d/%Y"))
skills=job['Skills']['Skill']
db.posting.insert({"posted_date":
posteddate, "skills": skills, "city": location,
"source": 'careerbuilder', "raw_data": job})
Linked in
2014 Aug 16 Team Grant for Queen's School of Business 4
to career builder, indeed
Source Extract Store Distill Analyze
from indeed import IndeedClient
import json
import pymongo
import time
client = IndeedClient(‘123456')
params = {
'l' : "Anywhere",
'co' : "ca",
'userip' : "1.2.3.4",
'useragent' : "Mozilla/5.0 (Macintosh; Intel Mac OS
X 10_8_2)"
}
search_response = client.search(**params)
list_of_jobs = search_response['results']
client = pymongo.MongoClient()
db = client.jobengine
for job in list_of_jobs:
location=job['city']
posteddate=time.strftime("%d/%m/%Y",time.
strptime(job[‘date’], "%a, %d %b %Y %H:%M:%S GMT"))
db.posting.insert({"posted_date":
posteddate, "skills": "", "city": location, "source":
'indeed', "raw_data": job})
Results from Canada
 60k results per week
 300 MB per week
 3+ data structures
2014 Aug 16 Team Grant for Queen's School of Business 5
"formattedRelativeTime": "5 days ago",
"city": "Lillooet",
"date": "Thu, 24 Jul 2014 20:21:52
GMT",
"formattedLocationFull": "Lillooet,
BC",
"url":
"http://ca.indeed.com/viewjob?jk=7779e5fbf4
d0613f&qd=cvKKr6L_4R6jh64NGGBfipMcUh0i4g5C-
X18qE0gAzC3Ws-
qTrT0d3CswmkqzrsGxdgmiLA9Fpf3adh66N9NEAN9-
HvuJGR2pUApIXI2XAs&indpubnum=12434332109849
25&atk=18u2anmkg0mqi68p",
"jobtitle": "Executive Assistant",
"company": "Xaxli'p",
"onmousedown": "indeed_clk(this,
'834');",
"snippet": "The Executive Assistant is
responsible for providing administrative
and secretarial services and support to the
Chief and Council and the Band
Administrator... ",
"source": "WorkBC",
"state": "BC",
"sponsored": false,
"country": "CA",
"formattedLocation": "Lillooet, BC",
"jobkey": "7779e5fbf4d0613f",
"expired": false,
"indeedApply": false
Source Extract Store Distill Analyze
Sample result
2014 Aug 16 Team Grant for Queen's School of Business 6
Source
API
Import IO
MongoDB
Source Extract Store Distill Analyze
Python
Hadoop
SAS
Unstructured
Structured
Storage & structure
“Postings” collection
 Store documents from different sources,
with different structures
Wrapper structure allows uniform retrieval
 Posted date
 Skills
 Source
 Raw data
 Location
2014 Aug 16 Team Grant for Queen's School of Business 7
Source Extract Store Distill Analyze
Challenge & Solution
Identifying new information
Differing data formats
Duplicates between sources
Differing skill set data structures
2014 Aug 16 Team Grant for Queen's School of Business 8
Source Extract Store Distill Analyze
2014 Aug 16 Team Grant for Queen's School of Business 9
import json
import pymongo
client = pymongo.MongoClient()
db = client.jobengine
# Query to get only the skills and posted_date fields
postings=db.posting.find({},{"posted_date":1,
"skills":1, "_id":0});
# To iterate over each posting
for posting in postings:
#Continue processing only if the skills field is not
empty
if posting['skills'] != "":
skills=posting['skills']
#If the skills fields is a list, it will iterate
over each element and print the date and the skill,
#Otherwise it will just print the date and the
content of the skills field
if isinstance(skills,list):
for skill in skills:
print "%s,%s" %
(posting['posted_date'],skill.replace(',','').lower())
else:
print "%s,%s" %
(posting['posted_date'],skills.replace(',','').lower())
from mrjob.job import MRJob
class skillsCount(MRJob):
def mapper(self, _, value):
date, skill = value.split(",")
yield skill, 1
def reducer(self, key, values):
yield sum(values), key
if __name__ == '__main__':
skillsCount.run()
…
4 "html"
4 "system integration"
5 "software development"
6 "database"
7 "bookkeeping"
8 "audit"
<date>
<skill>
sort-n
Example: identify in-demand skills
getPostedDateSkill.py getSkillsCount.py
Source Extract Store Distill Analyze
Trends
2014 Aug 16 Team Grant for Queen's School of Business 10
 Run MR algorithms to return skill mention frequencies
by date
 Leverage analytics to understand trends, identify
seasonality and predict growth / decline
Package to help employers
find untapped labour sources
and governments target
immigration policies
Source Extract Store Distill Analyze
Banks: “communication”
2014 Aug 16 Team Grant for Queen's School of Business 11
0
10
20
30
40
50
60
70
Jun-01 Jul-01 Aug-01
Actual
Forecast
Source Extract Store Distill Analyze
Banks: “SAS”
2014 Aug 16 Team Grant for Queen's School of Business 12
0
1
2
3
4
5
6
7
8
9
10
Jun-01 Jul-01 Aug-01
Actual
Forecast
Source Extract Store Distill Analyze
Clustering
2014 Aug 16 Team Grant for Queen's School of Business 13
 Run algorithms to return complementary clusters of
skills
 Analyze for frequency of association to understand
relative importance and trends over time
Package to help job seekers
learn “next” skills and post-
secondary institutions adapt
programs and course syllabi
(Used twice in a single presentation!)
Source Extract Store Distill Analyze
2014 Aug 16 Team Grant for Queen's School of Business 14
Big data…
Big questions?
Syllabi (third time’s the charm)
Appendix 1: LinkedIn API
2014 Aug 16 Team Grant for Queen's School of Business 15
from linkedin import linkedin
import json
CONSUMER_KEY='7559rpvtim1fcq'
CONSUMER_SECRET='8mpfyOlPLggQjuvp'
USER_TOKEN='570511eb-3f62-4423-b365-40d78d96a31a'
USER_SECRET='a2795c55-3094-498f-8234-a56a2fc304f0'
RETURN_URL='http://127.0.0.1'
authentication = linkedin.LinkedInDeveloperAuthentication(CONSUMER_KEY, CONSUMER_SECRET,
USER_TOKEN, USER_SECRET,
RETURN_URL, linkedin.PERMISSIONS.enums.values())
application = linkedin.LinkedInApplication(authentication)
profile = application.get_profile(selectors=['id', 'first-name', 'last-name', 'skills'])
print json.dumps(profile, indent=3)
print "*" * 120
jobs = application.search_job(selectors=[{'jobs': ['id', 'customer-job-code', 'posting-date']}],
params={'title': 'python', 'count': 2})
print json.dumps(jobs, indent=3)
Appendix 2: CareerBuilder API
2014 Aug 16 Team Grant for Queen's School of Business 16
from careerbuilder import CareerBuilder
import json
import pymongo
cb = CareerBuilder(DEV_KEY)
search = cb.job_search(HostSite='CA', PostedWithin='1')
list_of_jobs=search['ResponseJobSearch']['Results']['JobSearchResult']
client = pymongo.MongoClient()
db = client.jobengine
for job in list_of_jobs:
location=job['Location']
posteddate=time.strftime("%m/%d/%Y",time.strptime(job[‘PostedDate’],
"%m/%d/%Y"))
skills=job['Skills']['Skill']
db.posting.insert({"posted_date": posteddate, "skills": skills, "city":
location, "source": 'careerbuilder', "raw_data": job})
Appendix 3: CareerBuilder Result
2014 Aug 16 Team Grant for Queen's School of Business 17
"Company": "Robert Half Technology",
"CompanyDID": "c8432266b3wfjhdhwpx",
"CompanyDetailsURL": "http://www.careerbuilder.ca/jobs/company-name/c8432266b3wfjhdhwpx/robert-
half-technology/?sc_cmp1=13_JobRes_ComDet",
"DID": "J3G6PM69F3QVJ2MY15G",
"OnetCode": "15-1099.04",
"ONetFriendlyTitle": "Web Developers",
"DescriptionTeaser": "Ref ID: 05090-9688475 Classification: Programmer/Analyst Compensation: DOE
Our client is currently looking for candidate with strong understanding of...",
"Distance": null,
"EmploymentType": "Full-Time Employee",
"EducationRequired": "Not Specified",
"ExperienceRequired": "Not Specified",
"JobDetailsURL":
"http://api.careerbuilder.com/v1/joblink?TrackingID=UNTRKD&HostSite=CA&DID=J3G6PM69F3QVJ2MY15G",
"JobServiceURL":
"https://api.careerbuilder.com/v1/job?DID=J3G6PM69F3QVJ2MY15G&HostSite=CA&DeveloperKey=WDHT5Y26MLSB
GLS2HC7G",
"Location": "Toronto-M5J 2T3",
"LocationLatitude": "43.6432",
"LocationLongitude": "-79.3806",
"PostedDate": "7/29/2014",
"PostedTime": "7/29/2014 8:16:48 PM",
"Pay": "N/A",
…
Appendix 4: Indeed API
2014 Aug 16 Team Grant for Queen's School of Business 18
from indeed import IndeedClient
import json
import pymongo
import time
client = IndeedClient(‘123456')
params = {
'l' : "Anywhere",
'co' : "ca",
'userip' : "1.2.3.4",
'useragent' : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2)"
}
search_response = client.search(**params)
list_of_jobs = search_response['results']
client = pymongo.MongoClient()
db = client.jobengine
for job in list_of_jobs:
location=job['city']
posteddate=time.strftime("%d/%m/%Y",time.strptime(job[‘date’], "%a, %d %b %Y %H:%M:%S GMT"))
db.posting.insert({"posted_date": posteddate, "skills": "", "city": location, "source": 'indeed',
"raw_data": job})
Appendix 5: Indeed Result
2014 Aug 16 Team Grant for Queen's School of Business 19
"formattedRelativeTime": "5 days ago",
"city": "Lillooet",
"date": "Thu, 24 Jul 2014 20:21:52 GMT",
"formattedLocationFull": "Lillooet, BC",
"url":
"http://ca.indeed.com/viewjob?jk=7779e5fbf4d0613f&qd=cvKKr6L_4R6jh64NGGBfipMcUh0i4g5C-
X18qE0gAzC3Ws-qTrT0d3CswmkqzrsGxdgmiLA9Fpf3adh66N9NEAN9-
HvuJGR2pUApIXI2XAs&indpubnum=1243433210984925&atk=18u2anmkg0mqi68p",
"jobtitle": "Executive Assistant",
"company": "Xaxli'p",
"onmousedown": "indeed_clk(this, '834');",
"snippet": "The Executive Assistant is responsible for providing administrative and
secretarial services and support to the Chief and Council and the Band Administrator...
",
"source": "WorkBC",
"state": "BC",
"sponsored": false,
"country": "CA",
"formattedLocation": "Lillooet, BC",
"jobkey": "7779e5fbf4d0613f",
"expired": false,
"indeedApply": false
Appendix 6: getPostedDateSkill
2014 Aug 16 Team Grant for Queen's School of Business 20
import json
import pymongo
client = pymongo.MongoClient()
db = client.jobengine
# Query to get only the skills and posted_date fields
postings=db.posting.find({},{"posted_date":1, "skills":1, "_id":0});
# To iterate over each posting
for posting in postings:
#Continue processing only if the skills field is not empty
if posting['skills'] != "":
skills=posting['skills']
#If the skills fields is a list, it will iterate over each element and print the date
and the skill,
#Otherwise it will just print the date and the content of the skills field
if isinstance(skills,list):
for skill in skills:
print "%s,%s" % (posting['posted_date'],skill.replace(',','').lower())
else:
print "%s,%s" % (posting['posted_date'],skills.replace(',','').lower())
Appendix 7: getSkillsCount
2014 Aug 16 Team Grant for Queen's School of Business 21
from mrjob.job import MRJob
class skillsCount(MRJob):
def mapper(self, _, value):
date, skill = value.split(",")
yield skill, 1
def reducer(self, key, values):
yield sum(values), key
if __name__ == '__main__':
skillsCount.run()
Attributions
Text for Big Data graphic:
http://www.bigdata-startups.com/job-
descriptions/
Big Data graphic: http://www.wordle.net/
2014 Aug 16 Team Grant for Queen's School of Business 22

More Related Content

Similar to Final Presentation V1.8

Creating Professional Applications with the LinkedIn API
Creating Professional Applications with the LinkedIn APICreating Professional Applications with the LinkedIn API
Creating Professional Applications with the LinkedIn APIKirsten Hunter
 
Google
GoogleGoogle
Googlesoon
 
實戰Facebook Marketing API
實戰Facebook Marketing API實戰Facebook Marketing API
實戰Facebook Marketing APIYu LI
 
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHPPHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHPiMasters
 
0 to 60 with AWS AppSync: Rapid Development Techniques for Mobile APIs (MOB32...
0 to 60 with AWS AppSync: Rapid Development Techniques for Mobile APIs (MOB32...0 to 60 with AWS AppSync: Rapid Development Techniques for Mobile APIs (MOB32...
0 to 60 with AWS AppSync: Rapid Development Techniques for Mobile APIs (MOB32...Amazon Web Services
 
Building LinkedIn's Learning Platform with MongoDB
Building LinkedIn's Learning Platform with MongoDBBuilding LinkedIn's Learning Platform with MongoDB
Building LinkedIn's Learning Platform with MongoDBMongoDB
 
PHPUnit Episode iv.iii: Return of the tests
PHPUnit Episode iv.iii: Return of the testsPHPUnit Episode iv.iii: Return of the tests
PHPUnit Episode iv.iii: Return of the testsMichelangelo van Dam
 
How to Leverage APIs for SEO #TTTLive2019
How to Leverage APIs for SEO #TTTLive2019How to Leverage APIs for SEO #TTTLive2019
How to Leverage APIs for SEO #TTTLive2019Paul Shapiro
 
Working with templates in Microsoft 365 aMS Berlin 2022
Working with templates in Microsoft 365 aMS Berlin 2022Working with templates in Microsoft 365 aMS Berlin 2022
Working with templates in Microsoft 365 aMS Berlin 2022Chirag Patel
 
Enabling Machine Learning with Apache Flink - Sherin Thomas, Lyft
Enabling Machine Learning with Apache Flink - Sherin Thomas, LyftEnabling Machine Learning with Apache Flink - Sherin Thomas, Lyft
Enabling Machine Learning with Apache Flink - Sherin Thomas, LyftFlink Forward
 
"Using ElasticSearch to Scale Near Real-Time Search" by John Billings (Presen...
"Using ElasticSearch to Scale Near Real-Time Search" by John Billings (Presen..."Using ElasticSearch to Scale Near Real-Time Search" by John Billings (Presen...
"Using ElasticSearch to Scale Near Real-Time Search" by John Billings (Presen...Yelp Engineering
 
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가JaeCheolKim10
 
GraphQL - when REST API is not enough - lessons learned
GraphQL - when REST API is not enough - lessons learnedGraphQL - when REST API is not enough - lessons learned
GraphQL - when REST API is not enough - lessons learnedMarcinStachniuk
 
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...Lucidworks
 
Building Awesome API with Spring
Building Awesome API with SpringBuilding Awesome API with Spring
Building Awesome API with SpringVladimir Tsukur
 
GraphQL - when REST API is to less - lessons learned
GraphQL - when REST API is to less - lessons learnedGraphQL - when REST API is to less - lessons learned
GraphQL - when REST API is to less - lessons learnedMarcinStachniuk
 
Learning to rank search results
Learning to rank search resultsLearning to rank search results
Learning to rank search resultsJettro Coenradie
 
vertopal.com_DataEncodingForDataClustering-5 (1).pdf
vertopal.com_DataEncodingForDataClustering-5 (1).pdfvertopal.com_DataEncodingForDataClustering-5 (1).pdf
vertopal.com_DataEncodingForDataClustering-5 (1).pdfzraibianour
 

Similar to Final Presentation V1.8 (20)

Creating Professional Applications with the LinkedIn API
Creating Professional Applications with the LinkedIn APICreating Professional Applications with the LinkedIn API
Creating Professional Applications with the LinkedIn API
 
Google
GoogleGoogle
Google
 
實戰Facebook Marketing API
實戰Facebook Marketing API實戰Facebook Marketing API
實戰Facebook Marketing API
 
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHPPHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
 
0 to 60 with AWS AppSync: Rapid Development Techniques for Mobile APIs (MOB32...
0 to 60 with AWS AppSync: Rapid Development Techniques for Mobile APIs (MOB32...0 to 60 with AWS AppSync: Rapid Development Techniques for Mobile APIs (MOB32...
0 to 60 with AWS AppSync: Rapid Development Techniques for Mobile APIs (MOB32...
 
Building LinkedIn's Learning Platform with MongoDB
Building LinkedIn's Learning Platform with MongoDBBuilding LinkedIn's Learning Platform with MongoDB
Building LinkedIn's Learning Platform with MongoDB
 
PHPUnit Episode iv.iii: Return of the tests
PHPUnit Episode iv.iii: Return of the testsPHPUnit Episode iv.iii: Return of the tests
PHPUnit Episode iv.iii: Return of the tests
 
How to Leverage APIs for SEO #TTTLive2019
How to Leverage APIs for SEO #TTTLive2019How to Leverage APIs for SEO #TTTLive2019
How to Leverage APIs for SEO #TTTLive2019
 
Working with templates in Microsoft 365 aMS Berlin 2022
Working with templates in Microsoft 365 aMS Berlin 2022Working with templates in Microsoft 365 aMS Berlin 2022
Working with templates in Microsoft 365 aMS Berlin 2022
 
JyotilResumeCtrl
JyotilResumeCtrlJyotilResumeCtrl
JyotilResumeCtrl
 
Ams adapters
Ams adaptersAms adapters
Ams adapters
 
Enabling Machine Learning with Apache Flink - Sherin Thomas, Lyft
Enabling Machine Learning with Apache Flink - Sherin Thomas, LyftEnabling Machine Learning with Apache Flink - Sherin Thomas, Lyft
Enabling Machine Learning with Apache Flink - Sherin Thomas, Lyft
 
"Using ElasticSearch to Scale Near Real-Time Search" by John Billings (Presen...
"Using ElasticSearch to Scale Near Real-Time Search" by John Billings (Presen..."Using ElasticSearch to Scale Near Real-Time Search" by John Billings (Presen...
"Using ElasticSearch to Scale Near Real-Time Search" by John Billings (Presen...
 
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
 
GraphQL - when REST API is not enough - lessons learned
GraphQL - when REST API is not enough - lessons learnedGraphQL - when REST API is not enough - lessons learned
GraphQL - when REST API is not enough - lessons learned
 
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
 
Building Awesome API with Spring
Building Awesome API with SpringBuilding Awesome API with Spring
Building Awesome API with Spring
 
GraphQL - when REST API is to less - lessons learned
GraphQL - when REST API is to less - lessons learnedGraphQL - when REST API is to less - lessons learned
GraphQL - when REST API is to less - lessons learned
 
Learning to rank search results
Learning to rank search resultsLearning to rank search results
Learning to rank search results
 
vertopal.com_DataEncodingForDataClustering-5 (1).pdf
vertopal.com_DataEncodingForDataClustering-5 (1).pdfvertopal.com_DataEncodingForDataClustering-5 (1).pdf
vertopal.com_DataEncodingForDataClustering-5 (1).pdf
 

Final Presentation V1.8

  • 1. Jobs & Skills Team Grant MMA 865
  • 2. What if you wanted to… Plan your career  How your key skills are trending Develop labour policy  Skill deficits by region or by industry Train job-ready graduates  Add skills to programs and syllabi (Syllabi is an awesome word) 2014 Aug 16 Team Grant for Queen's School of Business 2 ? ? ?
  • 3. 2014 Aug 16 Team Grant for Queen's School of Business 3 Source Extract Store Distill Analyze Answer Questions
  • 4. from linkedin import linkedin import json authentication = linkedin.LinkedInDeveloperAuthentication(...) application = linkedin.LinkedInApplication(authentication) client = pymongo.MongoClient() db = client.jobengine max_id = db.posting.find({'source':'linkedin'}).sort('raw_data.id ',-1).limit(1)[0]['raw_data']['id'] while True: list_of_jobs = application.search_job(selectors= [{'jobs': ['id', 'posting-date‘,...]}], params={'count': 100, 'sort':'DD',...}) for job in reversed(list_of_jobs): if job['id'] <= max_id: continue max_id=job['id'] location=job['locationDescription'] raw_date=job['postingDate'] posteddate=time.strftime("%d/%m/%Y",...)) skills=job['skillsAndExperience'] db.posting.insert({"posted_date": posteddate, "skills": skills, "city": location, "source":'linkedin', "raw_data": job}) time.sleep(300) from careerbuilder import CareerBuilder import json import pymongo cb = CareerBuilder(DEV_KEY) search = cb.job_search(HostSite='CA', PostedWithin='1') list_of_jobs=search['ResponseJobSearch']['Results']['Job SearchResult'] client = pymongo.MongoClient() db = client.jobengine for job in list_of_jobs: location=job['Location'] posteddate=time.strftime("%m/%d/%Y",time. strptime(job[‘PostedDate’], "%m/%d/%Y")) skills=job['Skills']['Skill'] db.posting.insert({"posted_date": posteddate, "skills": skills, "city": location, "source": 'careerbuilder', "raw_data": job}) Linked in 2014 Aug 16 Team Grant for Queen's School of Business 4 to career builder, indeed Source Extract Store Distill Analyze from indeed import IndeedClient import json import pymongo import time client = IndeedClient(‘123456') params = { 'l' : "Anywhere", 'co' : "ca", 'userip' : "1.2.3.4", 'useragent' : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2)" } search_response = client.search(**params) list_of_jobs = search_response['results'] client = pymongo.MongoClient() db = client.jobengine for job in list_of_jobs: location=job['city'] posteddate=time.strftime("%d/%m/%Y",time. strptime(job[‘date’], "%a, %d %b %Y %H:%M:%S GMT")) db.posting.insert({"posted_date": posteddate, "skills": "", "city": location, "source": 'indeed', "raw_data": job})
  • 5. Results from Canada  60k results per week  300 MB per week  3+ data structures 2014 Aug 16 Team Grant for Queen's School of Business 5 "formattedRelativeTime": "5 days ago", "city": "Lillooet", "date": "Thu, 24 Jul 2014 20:21:52 GMT", "formattedLocationFull": "Lillooet, BC", "url": "http://ca.indeed.com/viewjob?jk=7779e5fbf4 d0613f&qd=cvKKr6L_4R6jh64NGGBfipMcUh0i4g5C- X18qE0gAzC3Ws- qTrT0d3CswmkqzrsGxdgmiLA9Fpf3adh66N9NEAN9- HvuJGR2pUApIXI2XAs&indpubnum=12434332109849 25&atk=18u2anmkg0mqi68p", "jobtitle": "Executive Assistant", "company": "Xaxli'p", "onmousedown": "indeed_clk(this, '834');", "snippet": "The Executive Assistant is responsible for providing administrative and secretarial services and support to the Chief and Council and the Band Administrator... ", "source": "WorkBC", "state": "BC", "sponsored": false, "country": "CA", "formattedLocation": "Lillooet, BC", "jobkey": "7779e5fbf4d0613f", "expired": false, "indeedApply": false Source Extract Store Distill Analyze Sample result
  • 6. 2014 Aug 16 Team Grant for Queen's School of Business 6 Source API Import IO MongoDB Source Extract Store Distill Analyze Python Hadoop SAS Unstructured Structured
  • 7. Storage & structure “Postings” collection  Store documents from different sources, with different structures Wrapper structure allows uniform retrieval  Posted date  Skills  Source  Raw data  Location 2014 Aug 16 Team Grant for Queen's School of Business 7 Source Extract Store Distill Analyze
  • 8. Challenge & Solution Identifying new information Differing data formats Duplicates between sources Differing skill set data structures 2014 Aug 16 Team Grant for Queen's School of Business 8 Source Extract Store Distill Analyze
  • 9. 2014 Aug 16 Team Grant for Queen's School of Business 9 import json import pymongo client = pymongo.MongoClient() db = client.jobengine # Query to get only the skills and posted_date fields postings=db.posting.find({},{"posted_date":1, "skills":1, "_id":0}); # To iterate over each posting for posting in postings: #Continue processing only if the skills field is not empty if posting['skills'] != "": skills=posting['skills'] #If the skills fields is a list, it will iterate over each element and print the date and the skill, #Otherwise it will just print the date and the content of the skills field if isinstance(skills,list): for skill in skills: print "%s,%s" % (posting['posted_date'],skill.replace(',','').lower()) else: print "%s,%s" % (posting['posted_date'],skills.replace(',','').lower()) from mrjob.job import MRJob class skillsCount(MRJob): def mapper(self, _, value): date, skill = value.split(",") yield skill, 1 def reducer(self, key, values): yield sum(values), key if __name__ == '__main__': skillsCount.run() … 4 "html" 4 "system integration" 5 "software development" 6 "database" 7 "bookkeeping" 8 "audit" <date> <skill> sort-n Example: identify in-demand skills getPostedDateSkill.py getSkillsCount.py Source Extract Store Distill Analyze
  • 10. Trends 2014 Aug 16 Team Grant for Queen's School of Business 10  Run MR algorithms to return skill mention frequencies by date  Leverage analytics to understand trends, identify seasonality and predict growth / decline Package to help employers find untapped labour sources and governments target immigration policies Source Extract Store Distill Analyze
  • 11. Banks: “communication” 2014 Aug 16 Team Grant for Queen's School of Business 11 0 10 20 30 40 50 60 70 Jun-01 Jul-01 Aug-01 Actual Forecast Source Extract Store Distill Analyze
  • 12. Banks: “SAS” 2014 Aug 16 Team Grant for Queen's School of Business 12 0 1 2 3 4 5 6 7 8 9 10 Jun-01 Jul-01 Aug-01 Actual Forecast Source Extract Store Distill Analyze
  • 13. Clustering 2014 Aug 16 Team Grant for Queen's School of Business 13  Run algorithms to return complementary clusters of skills  Analyze for frequency of association to understand relative importance and trends over time Package to help job seekers learn “next” skills and post- secondary institutions adapt programs and course syllabi (Used twice in a single presentation!) Source Extract Store Distill Analyze
  • 14. 2014 Aug 16 Team Grant for Queen's School of Business 14 Big data… Big questions? Syllabi (third time’s the charm)
  • 15. Appendix 1: LinkedIn API 2014 Aug 16 Team Grant for Queen's School of Business 15 from linkedin import linkedin import json CONSUMER_KEY='7559rpvtim1fcq' CONSUMER_SECRET='8mpfyOlPLggQjuvp' USER_TOKEN='570511eb-3f62-4423-b365-40d78d96a31a' USER_SECRET='a2795c55-3094-498f-8234-a56a2fc304f0' RETURN_URL='http://127.0.0.1' authentication = linkedin.LinkedInDeveloperAuthentication(CONSUMER_KEY, CONSUMER_SECRET, USER_TOKEN, USER_SECRET, RETURN_URL, linkedin.PERMISSIONS.enums.values()) application = linkedin.LinkedInApplication(authentication) profile = application.get_profile(selectors=['id', 'first-name', 'last-name', 'skills']) print json.dumps(profile, indent=3) print "*" * 120 jobs = application.search_job(selectors=[{'jobs': ['id', 'customer-job-code', 'posting-date']}], params={'title': 'python', 'count': 2}) print json.dumps(jobs, indent=3)
  • 16. Appendix 2: CareerBuilder API 2014 Aug 16 Team Grant for Queen's School of Business 16 from careerbuilder import CareerBuilder import json import pymongo cb = CareerBuilder(DEV_KEY) search = cb.job_search(HostSite='CA', PostedWithin='1') list_of_jobs=search['ResponseJobSearch']['Results']['JobSearchResult'] client = pymongo.MongoClient() db = client.jobengine for job in list_of_jobs: location=job['Location'] posteddate=time.strftime("%m/%d/%Y",time.strptime(job[‘PostedDate’], "%m/%d/%Y")) skills=job['Skills']['Skill'] db.posting.insert({"posted_date": posteddate, "skills": skills, "city": location, "source": 'careerbuilder', "raw_data": job})
  • 17. Appendix 3: CareerBuilder Result 2014 Aug 16 Team Grant for Queen's School of Business 17 "Company": "Robert Half Technology", "CompanyDID": "c8432266b3wfjhdhwpx", "CompanyDetailsURL": "http://www.careerbuilder.ca/jobs/company-name/c8432266b3wfjhdhwpx/robert- half-technology/?sc_cmp1=13_JobRes_ComDet", "DID": "J3G6PM69F3QVJ2MY15G", "OnetCode": "15-1099.04", "ONetFriendlyTitle": "Web Developers", "DescriptionTeaser": "Ref ID: 05090-9688475 Classification: Programmer/Analyst Compensation: DOE Our client is currently looking for candidate with strong understanding of...", "Distance": null, "EmploymentType": "Full-Time Employee", "EducationRequired": "Not Specified", "ExperienceRequired": "Not Specified", "JobDetailsURL": "http://api.careerbuilder.com/v1/joblink?TrackingID=UNTRKD&HostSite=CA&DID=J3G6PM69F3QVJ2MY15G", "JobServiceURL": "https://api.careerbuilder.com/v1/job?DID=J3G6PM69F3QVJ2MY15G&HostSite=CA&DeveloperKey=WDHT5Y26MLSB GLS2HC7G", "Location": "Toronto-M5J 2T3", "LocationLatitude": "43.6432", "LocationLongitude": "-79.3806", "PostedDate": "7/29/2014", "PostedTime": "7/29/2014 8:16:48 PM", "Pay": "N/A", …
  • 18. Appendix 4: Indeed API 2014 Aug 16 Team Grant for Queen's School of Business 18 from indeed import IndeedClient import json import pymongo import time client = IndeedClient(‘123456') params = { 'l' : "Anywhere", 'co' : "ca", 'userip' : "1.2.3.4", 'useragent' : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2)" } search_response = client.search(**params) list_of_jobs = search_response['results'] client = pymongo.MongoClient() db = client.jobengine for job in list_of_jobs: location=job['city'] posteddate=time.strftime("%d/%m/%Y",time.strptime(job[‘date’], "%a, %d %b %Y %H:%M:%S GMT")) db.posting.insert({"posted_date": posteddate, "skills": "", "city": location, "source": 'indeed', "raw_data": job})
  • 19. Appendix 5: Indeed Result 2014 Aug 16 Team Grant for Queen's School of Business 19 "formattedRelativeTime": "5 days ago", "city": "Lillooet", "date": "Thu, 24 Jul 2014 20:21:52 GMT", "formattedLocationFull": "Lillooet, BC", "url": "http://ca.indeed.com/viewjob?jk=7779e5fbf4d0613f&qd=cvKKr6L_4R6jh64NGGBfipMcUh0i4g5C- X18qE0gAzC3Ws-qTrT0d3CswmkqzrsGxdgmiLA9Fpf3adh66N9NEAN9- HvuJGR2pUApIXI2XAs&indpubnum=1243433210984925&atk=18u2anmkg0mqi68p", "jobtitle": "Executive Assistant", "company": "Xaxli'p", "onmousedown": "indeed_clk(this, '834');", "snippet": "The Executive Assistant is responsible for providing administrative and secretarial services and support to the Chief and Council and the Band Administrator... ", "source": "WorkBC", "state": "BC", "sponsored": false, "country": "CA", "formattedLocation": "Lillooet, BC", "jobkey": "7779e5fbf4d0613f", "expired": false, "indeedApply": false
  • 20. Appendix 6: getPostedDateSkill 2014 Aug 16 Team Grant for Queen's School of Business 20 import json import pymongo client = pymongo.MongoClient() db = client.jobengine # Query to get only the skills and posted_date fields postings=db.posting.find({},{"posted_date":1, "skills":1, "_id":0}); # To iterate over each posting for posting in postings: #Continue processing only if the skills field is not empty if posting['skills'] != "": skills=posting['skills'] #If the skills fields is a list, it will iterate over each element and print the date and the skill, #Otherwise it will just print the date and the content of the skills field if isinstance(skills,list): for skill in skills: print "%s,%s" % (posting['posted_date'],skill.replace(',','').lower()) else: print "%s,%s" % (posting['posted_date'],skills.replace(',','').lower())
  • 21. Appendix 7: getSkillsCount 2014 Aug 16 Team Grant for Queen's School of Business 21 from mrjob.job import MRJob class skillsCount(MRJob): def mapper(self, _, value): date, skill = value.split(",") yield skill, 1 def reducer(self, key, values): yield sum(values), key if __name__ == '__main__': skillsCount.run()
  • 22. Attributions Text for Big Data graphic: http://www.bigdata-startups.com/job- descriptions/ Big Data graphic: http://www.wordle.net/ 2014 Aug 16 Team Grant for Queen's School of Business 22