2. What if you wanted to…
Plan your career
How your key skills are trending
Develop labour policy
Skill deficits by region or by industry
Train job-ready graduates
Add skills to programs and syllabi
(Syllabi is an awesome word)
2014 Aug 16 Team Grant for Queen's School of Business 2
?
?
?
3. 2014 Aug 16 Team Grant for Queen's School of Business 3
Source
Extract
Store
Distill
Analyze
Answer Questions
4. from linkedin import linkedin
import json
authentication =
linkedin.LinkedInDeveloperAuthentication(...)
application =
linkedin.LinkedInApplication(authentication)
client = pymongo.MongoClient()
db = client.jobengine
max_id =
db.posting.find({'source':'linkedin'}).sort('raw_data.id
',-1).limit(1)[0]['raw_data']['id']
while True:
list_of_jobs = application.search_job(selectors=
[{'jobs': ['id', 'posting-date‘,...]}],
params={'count': 100, 'sort':'DD',...})
for job in reversed(list_of_jobs):
if job['id'] <= max_id:
continue
max_id=job['id']
location=job['locationDescription']
raw_date=job['postingDate']
posteddate=time.strftime("%d/%m/%Y",...))
skills=job['skillsAndExperience']
db.posting.insert({"posted_date": posteddate,
"skills": skills, "city": location,
"source":'linkedin', "raw_data": job})
time.sleep(300)
from careerbuilder import CareerBuilder
import json
import pymongo
cb = CareerBuilder(DEV_KEY)
search = cb.job_search(HostSite='CA', PostedWithin='1')
list_of_jobs=search['ResponseJobSearch']['Results']['Job
SearchResult']
client = pymongo.MongoClient()
db = client.jobengine
for job in list_of_jobs:
location=job['Location']
posteddate=time.strftime("%m/%d/%Y",time.
strptime(job[‘PostedDate’], "%m/%d/%Y"))
skills=job['Skills']['Skill']
db.posting.insert({"posted_date":
posteddate, "skills": skills, "city": location,
"source": 'careerbuilder', "raw_data": job})
Linked in
2014 Aug 16 Team Grant for Queen's School of Business 4
to career builder, indeed
Source Extract Store Distill Analyze
from indeed import IndeedClient
import json
import pymongo
import time
client = IndeedClient(‘123456')
params = {
'l' : "Anywhere",
'co' : "ca",
'userip' : "1.2.3.4",
'useragent' : "Mozilla/5.0 (Macintosh; Intel Mac OS
X 10_8_2)"
}
search_response = client.search(**params)
list_of_jobs = search_response['results']
client = pymongo.MongoClient()
db = client.jobengine
for job in list_of_jobs:
location=job['city']
posteddate=time.strftime("%d/%m/%Y",time.
strptime(job[‘date’], "%a, %d %b %Y %H:%M:%S GMT"))
db.posting.insert({"posted_date":
posteddate, "skills": "", "city": location, "source":
'indeed', "raw_data": job})
5. Results from Canada
60k results per week
300 MB per week
3+ data structures
2014 Aug 16 Team Grant for Queen's School of Business 5
"formattedRelativeTime": "5 days ago",
"city": "Lillooet",
"date": "Thu, 24 Jul 2014 20:21:52
GMT",
"formattedLocationFull": "Lillooet,
BC",
"url":
"http://ca.indeed.com/viewjob?jk=7779e5fbf4
d0613f&qd=cvKKr6L_4R6jh64NGGBfipMcUh0i4g5C-
X18qE0gAzC3Ws-
qTrT0d3CswmkqzrsGxdgmiLA9Fpf3adh66N9NEAN9-
HvuJGR2pUApIXI2XAs&indpubnum=12434332109849
25&atk=18u2anmkg0mqi68p",
"jobtitle": "Executive Assistant",
"company": "Xaxli'p",
"onmousedown": "indeed_clk(this,
'834');",
"snippet": "The Executive Assistant is
responsible for providing administrative
and secretarial services and support to the
Chief and Council and the Band
Administrator... ",
"source": "WorkBC",
"state": "BC",
"sponsored": false,
"country": "CA",
"formattedLocation": "Lillooet, BC",
"jobkey": "7779e5fbf4d0613f",
"expired": false,
"indeedApply": false
Source Extract Store Distill Analyze
Sample result
6. 2014 Aug 16 Team Grant for Queen's School of Business 6
Source
API
Import IO
MongoDB
Source Extract Store Distill Analyze
Python
Hadoop
SAS
Unstructured
Structured
7. Storage & structure
“Postings” collection
Store documents from different sources,
with different structures
Wrapper structure allows uniform retrieval
Posted date
Skills
Source
Raw data
Location
2014 Aug 16 Team Grant for Queen's School of Business 7
Source Extract Store Distill Analyze
8. Challenge & Solution
Identifying new information
Differing data formats
Duplicates between sources
Differing skill set data structures
2014 Aug 16 Team Grant for Queen's School of Business 8
Source Extract Store Distill Analyze
9. 2014 Aug 16 Team Grant for Queen's School of Business 9
import json
import pymongo
client = pymongo.MongoClient()
db = client.jobengine
# Query to get only the skills and posted_date fields
postings=db.posting.find({},{"posted_date":1,
"skills":1, "_id":0});
# To iterate over each posting
for posting in postings:
#Continue processing only if the skills field is not
empty
if posting['skills'] != "":
skills=posting['skills']
#If the skills fields is a list, it will iterate
over each element and print the date and the skill,
#Otherwise it will just print the date and the
content of the skills field
if isinstance(skills,list):
for skill in skills:
print "%s,%s" %
(posting['posted_date'],skill.replace(',','').lower())
else:
print "%s,%s" %
(posting['posted_date'],skills.replace(',','').lower())
from mrjob.job import MRJob
class skillsCount(MRJob):
def mapper(self, _, value):
date, skill = value.split(",")
yield skill, 1
def reducer(self, key, values):
yield sum(values), key
if __name__ == '__main__':
skillsCount.run()
…
4 "html"
4 "system integration"
5 "software development"
6 "database"
7 "bookkeeping"
8 "audit"
<date>
<skill>
sort-n
Example: identify in-demand skills
getPostedDateSkill.py getSkillsCount.py
Source Extract Store Distill Analyze
10. Trends
2014 Aug 16 Team Grant for Queen's School of Business 10
Run MR algorithms to return skill mention frequencies
by date
Leverage analytics to understand trends, identify
seasonality and predict growth / decline
Package to help employers
find untapped labour sources
and governments target
immigration policies
Source Extract Store Distill Analyze
11. Banks: “communication”
2014 Aug 16 Team Grant for Queen's School of Business 11
0
10
20
30
40
50
60
70
Jun-01 Jul-01 Aug-01
Actual
Forecast
Source Extract Store Distill Analyze
12. Banks: “SAS”
2014 Aug 16 Team Grant for Queen's School of Business 12
0
1
2
3
4
5
6
7
8
9
10
Jun-01 Jul-01 Aug-01
Actual
Forecast
Source Extract Store Distill Analyze
13. Clustering
2014 Aug 16 Team Grant for Queen's School of Business 13
Run algorithms to return complementary clusters of
skills
Analyze for frequency of association to understand
relative importance and trends over time
Package to help job seekers
learn “next” skills and post-
secondary institutions adapt
programs and course syllabi
(Used twice in a single presentation!)
Source Extract Store Distill Analyze
14. 2014 Aug 16 Team Grant for Queen's School of Business 14
Big data…
Big questions?
Syllabi (third time’s the charm)
15. Appendix 1: LinkedIn API
2014 Aug 16 Team Grant for Queen's School of Business 15
from linkedin import linkedin
import json
CONSUMER_KEY='7559rpvtim1fcq'
CONSUMER_SECRET='8mpfyOlPLggQjuvp'
USER_TOKEN='570511eb-3f62-4423-b365-40d78d96a31a'
USER_SECRET='a2795c55-3094-498f-8234-a56a2fc304f0'
RETURN_URL='http://127.0.0.1'
authentication = linkedin.LinkedInDeveloperAuthentication(CONSUMER_KEY, CONSUMER_SECRET,
USER_TOKEN, USER_SECRET,
RETURN_URL, linkedin.PERMISSIONS.enums.values())
application = linkedin.LinkedInApplication(authentication)
profile = application.get_profile(selectors=['id', 'first-name', 'last-name', 'skills'])
print json.dumps(profile, indent=3)
print "*" * 120
jobs = application.search_job(selectors=[{'jobs': ['id', 'customer-job-code', 'posting-date']}],
params={'title': 'python', 'count': 2})
print json.dumps(jobs, indent=3)
16. Appendix 2: CareerBuilder API
2014 Aug 16 Team Grant for Queen's School of Business 16
from careerbuilder import CareerBuilder
import json
import pymongo
cb = CareerBuilder(DEV_KEY)
search = cb.job_search(HostSite='CA', PostedWithin='1')
list_of_jobs=search['ResponseJobSearch']['Results']['JobSearchResult']
client = pymongo.MongoClient()
db = client.jobengine
for job in list_of_jobs:
location=job['Location']
posteddate=time.strftime("%m/%d/%Y",time.strptime(job[‘PostedDate’],
"%m/%d/%Y"))
skills=job['Skills']['Skill']
db.posting.insert({"posted_date": posteddate, "skills": skills, "city":
location, "source": 'careerbuilder', "raw_data": job})
17. Appendix 3: CareerBuilder Result
2014 Aug 16 Team Grant for Queen's School of Business 17
"Company": "Robert Half Technology",
"CompanyDID": "c8432266b3wfjhdhwpx",
"CompanyDetailsURL": "http://www.careerbuilder.ca/jobs/company-name/c8432266b3wfjhdhwpx/robert-
half-technology/?sc_cmp1=13_JobRes_ComDet",
"DID": "J3G6PM69F3QVJ2MY15G",
"OnetCode": "15-1099.04",
"ONetFriendlyTitle": "Web Developers",
"DescriptionTeaser": "Ref ID: 05090-9688475 Classification: Programmer/Analyst Compensation: DOE
Our client is currently looking for candidate with strong understanding of...",
"Distance": null,
"EmploymentType": "Full-Time Employee",
"EducationRequired": "Not Specified",
"ExperienceRequired": "Not Specified",
"JobDetailsURL":
"http://api.careerbuilder.com/v1/joblink?TrackingID=UNTRKD&HostSite=CA&DID=J3G6PM69F3QVJ2MY15G",
"JobServiceURL":
"https://api.careerbuilder.com/v1/job?DID=J3G6PM69F3QVJ2MY15G&HostSite=CA&DeveloperKey=WDHT5Y26MLSB
GLS2HC7G",
"Location": "Toronto-M5J 2T3",
"LocationLatitude": "43.6432",
"LocationLongitude": "-79.3806",
"PostedDate": "7/29/2014",
"PostedTime": "7/29/2014 8:16:48 PM",
"Pay": "N/A",
…
18. Appendix 4: Indeed API
2014 Aug 16 Team Grant for Queen's School of Business 18
from indeed import IndeedClient
import json
import pymongo
import time
client = IndeedClient(‘123456')
params = {
'l' : "Anywhere",
'co' : "ca",
'userip' : "1.2.3.4",
'useragent' : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2)"
}
search_response = client.search(**params)
list_of_jobs = search_response['results']
client = pymongo.MongoClient()
db = client.jobengine
for job in list_of_jobs:
location=job['city']
posteddate=time.strftime("%d/%m/%Y",time.strptime(job[‘date’], "%a, %d %b %Y %H:%M:%S GMT"))
db.posting.insert({"posted_date": posteddate, "skills": "", "city": location, "source": 'indeed',
"raw_data": job})
19. Appendix 5: Indeed Result
2014 Aug 16 Team Grant for Queen's School of Business 19
"formattedRelativeTime": "5 days ago",
"city": "Lillooet",
"date": "Thu, 24 Jul 2014 20:21:52 GMT",
"formattedLocationFull": "Lillooet, BC",
"url":
"http://ca.indeed.com/viewjob?jk=7779e5fbf4d0613f&qd=cvKKr6L_4R6jh64NGGBfipMcUh0i4g5C-
X18qE0gAzC3Ws-qTrT0d3CswmkqzrsGxdgmiLA9Fpf3adh66N9NEAN9-
HvuJGR2pUApIXI2XAs&indpubnum=1243433210984925&atk=18u2anmkg0mqi68p",
"jobtitle": "Executive Assistant",
"company": "Xaxli'p",
"onmousedown": "indeed_clk(this, '834');",
"snippet": "The Executive Assistant is responsible for providing administrative and
secretarial services and support to the Chief and Council and the Band Administrator...
",
"source": "WorkBC",
"state": "BC",
"sponsored": false,
"country": "CA",
"formattedLocation": "Lillooet, BC",
"jobkey": "7779e5fbf4d0613f",
"expired": false,
"indeedApply": false
20. Appendix 6: getPostedDateSkill
2014 Aug 16 Team Grant for Queen's School of Business 20
import json
import pymongo
client = pymongo.MongoClient()
db = client.jobengine
# Query to get only the skills and posted_date fields
postings=db.posting.find({},{"posted_date":1, "skills":1, "_id":0});
# To iterate over each posting
for posting in postings:
#Continue processing only if the skills field is not empty
if posting['skills'] != "":
skills=posting['skills']
#If the skills fields is a list, it will iterate over each element and print the date
and the skill,
#Otherwise it will just print the date and the content of the skills field
if isinstance(skills,list):
for skill in skills:
print "%s,%s" % (posting['posted_date'],skill.replace(',','').lower())
else:
print "%s,%s" % (posting['posted_date'],skills.replace(',','').lower())
21. Appendix 7: getSkillsCount
2014 Aug 16 Team Grant for Queen's School of Business 21
from mrjob.job import MRJob
class skillsCount(MRJob):
def mapper(self, _, value):
date, skill = value.split(",")
yield skill, 1
def reducer(self, key, values):
yield sum(values), key
if __name__ == '__main__':
skillsCount.run()
22. Attributions
Text for Big Data graphic:
http://www.bigdata-startups.com/job-
descriptions/
Big Data graphic: http://www.wordle.net/
2014 Aug 16 Team Grant for Queen's School of Business 22