Final Presentation V1.8

Jobs & Skills
Team Grant
MMA 865

What if you wanted to…
Plan your career
 How your key skills are trending
Develop labour policy
 Skill deficits by region or by industry
Train job-ready graduates
 Add skills to programs and syllabi
(Syllabi is an awesome word)
2014 Aug 16 Team Grant for Queen's School of Business 2
?
?
?

Source
Extract
Store
Distill
Analyze
Answer Questions

from linkedin import linkedin
import json
authentication =
linkedin.LinkedInDeveloperAuthentication(...)
application =
linkedin.LinkedInApplication(authentication)
client = pymongo.MongoClient()
db = client.jobengine
max_id =
db.posting.find({'source':'linkedin'}).sort('raw_data.id
',-1).limit(1)[0]['raw_data']['id']
while True:
list_of_jobs = application.search_job(selectors=
[{'jobs': ['id', 'posting-date‘,...]}],
params={'count': 100, 'sort':'DD',...})
for job in reversed(list_of_jobs):
if job['id'] <= max_id:
continue
max_id=job['id']
location=job['locationDescription']
raw_date=job['postingDate']
posteddate=time.strftime("%d/%m/%Y",...))
skills=job['skillsAndExperience']
db.posting.insert({"posted_date": posteddate,
"skills": skills, "city": location,
"source":'linkedin', "raw_data": job})
time.sleep(300)
from careerbuilder import CareerBuilder
import json
import pymongo
cb = CareerBuilder(DEV_KEY)
search = cb.job_search(HostSite='CA', PostedWithin='1')
list_of_jobs=search['ResponseJobSearch']['Results']['Job
SearchResult']
for job in list_of_jobs:
location=job['Location']
posteddate=time.strftime("%m/%d/%Y",time.
strptime(job[‘PostedDate’], "%m/%d/%Y"))
skills=job['Skills']['Skill']
db.posting.insert({"posted_date":
posteddate, "skills": skills, "city": location,
"source": 'careerbuilder', "raw_data": job})
Linked in
to career builder, indeed
Source Extract Store Distill Analyze
from indeed import IndeedClient
import json
import pymongo
import time
client = IndeedClient(‘123456')
params = {
'l' : "Anywhere",
'co' : "ca",
'userip' : "1.2.3.4",
'useragent' : "Mozilla/5.0 (Macintosh; Intel Mac OS
X 10_8_2)"
}
search_response = client.search(**params)
list_of_jobs = search_response['results']
location=job['city']
posteddate=time.strftime("%d/%m/%Y",time.
strptime(job[‘date’], "%a, %d %b %Y %H:%M:%S GMT"))
db.posting.insert({"posted_date":
posteddate, "skills": "", "city": location, "source":
'indeed', "raw_data": job})

Results from Canada
 60k results per week
 300 MB per week
 3+ data structures
"formattedRelativeTime": "5 days ago",
"city": "Lillooet",
"date": "Thu, 24 Jul 2014 20:21:52
GMT",
"formattedLocationFull": "Lillooet,
BC",
"url":
"http://ca.indeed.com/viewjob?jk=7779e5fbf4
d0613f&qd=cvKKr6L_4R6jh64NGGBfipMcUh0i4g5C-
X18qE0gAzC3Ws-
qTrT0d3CswmkqzrsGxdgmiLA9Fpf3adh66N9NEAN9-
HvuJGR2pUApIXI2XAs&indpubnum=12434332109849
25&atk=18u2anmkg0mqi68p",
"jobtitle": "Executive Assistant",
"company": "Xaxli'p",
"onmousedown": "indeed_clk(this,
'834');",
"snippet": "The Executive Assistant is
responsible for providing administrative
and secretarial services and support to the
Chief and Council and the Band
Administrator... ",
"source": "WorkBC",
"state": "BC",
"sponsored": false,
"country": "CA",
"formattedLocation": "Lillooet, BC",
"jobkey": "7779e5fbf4d0613f",
"expired": false,
"indeedApply": false
Sample result

Source
API
Import IO
MongoDB
Python
Hadoop
SAS
Unstructured
Structured

Storage & structure
“Postings” collection
 Store documents from different sources,
with different structures
Wrapper structure allows uniform retrieval
 Posted date
 Skills
 Source
 Raw data
 Location

Challenge & Solution
Identifying new information
Differing data formats
Duplicates between sources
Differing skill set data structures

import json
import pymongo
# Query to get only the skills and posted_date fields
postings=db.posting.find({},{"posted_date":1,
"skills":1, "_id":0});
# To iterate over each posting
for posting in postings:
#Continue processing only if the skills field is not
empty
if posting['skills'] != "":
skills=posting['skills']
#If the skills fields is a list, it will iterate
over each element and print the date and the skill,
#Otherwise it will just print the date and the
content of the skills field
if isinstance(skills,list):
for skill in skills:
print "%s,%s" %
(posting['posted_date'],skill.replace(',','').lower())
else:
print "%s,%s" %
(posting['posted_date'],skills.replace(',','').lower())
from mrjob.job import MRJob
class skillsCount(MRJob):
def mapper(self, _, value):
date, skill = value.split(",")
yield skill, 1
def reducer(self, key, values):
yield sum(values), key
if __name__ == '__main__':
skillsCount.run()
…
4 "html"
4 "system integration"
5 "software development"
6 "database"
7 "bookkeeping"
8 "audit"
<date>
<skill>
sort-n
Example: identify in-demand skills
getPostedDateSkill.py getSkillsCount.py

Trends
 Run MR algorithms to return skill mention frequencies
by date
 Leverage analytics to understand trends, identify
seasonality and predict growth / decline
Package to help employers
find untapped labour sources
and governments target
immigration policies

Banks: “communication”
0
10
20
30
40
50
60
70
Jun-01 Jul-01 Aug-01
Actual
Forecast

Banks: “SAS”
0
1
2
3
4
5
6
7
8
9
10
Jun-01 Jul-01 Aug-01
Actual
Forecast

Clustering
 Run algorithms to return complementary clusters of
skills
 Analyze for frequency of association to understand
relative importance and trends over time
Package to help job seekers
learn “next” skills and post-
secondary institutions adapt
programs and course syllabi
(Used twice in a single presentation!)

Big data…
Big questions?
Syllabi (third time’s the charm)

Appendix 1: LinkedIn API
from linkedin import linkedin
import json
CONSUMER_KEY='7559rpvtim1fcq'
CONSUMER_SECRET='8mpfyOlPLggQjuvp'
USER_TOKEN='570511eb-3f62-4423-b365-40d78d96a31a'
USER_SECRET='a2795c55-3094-498f-8234-a56a2fc304f0'
RETURN_URL='http://127.0.0.1'
authentication = linkedin.LinkedInDeveloperAuthentication(CONSUMER_KEY, CONSUMER_SECRET,
USER_TOKEN, USER_SECRET,
RETURN_URL, linkedin.PERMISSIONS.enums.values())
application = linkedin.LinkedInApplication(authentication)
profile = application.get_profile(selectors=['id', 'first-name', 'last-name', 'skills'])
print json.dumps(profile, indent=3)
print "*" * 120
jobs = application.search_job(selectors=[{'jobs': ['id', 'customer-job-code', 'posting-date']}],
params={'title': 'python', 'count': 2})
print json.dumps(jobs, indent=3)

Appendix 2: CareerBuilder API
from careerbuilder import CareerBuilder
import json
import pymongo
cb = CareerBuilder(DEV_KEY)
search = cb.job_search(HostSite='CA', PostedWithin='1')
list_of_jobs=search['ResponseJobSearch']['Results']['JobSearchResult']
location=job['Location']
posteddate=time.strftime("%m/%d/%Y",time.strptime(job[‘PostedDate’],
"%m/%d/%Y"))
skills=job['Skills']['Skill']
db.posting.insert({"posted_date": posteddate, "skills": skills, "city":
location, "source": 'careerbuilder', "raw_data": job})

Appendix 3: CareerBuilder Result
"Company": "Robert Half Technology",
"CompanyDID": "c8432266b3wfjhdhwpx",
"CompanyDetailsURL": "http://www.careerbuilder.ca/jobs/company-name/c8432266b3wfjhdhwpx/robert-
half-technology/?sc_cmp1=13_JobRes_ComDet",
"DID": "J3G6PM69F3QVJ2MY15G",
"OnetCode": "15-1099.04",
"ONetFriendlyTitle": "Web Developers",
"DescriptionTeaser": "Ref ID: 05090-9688475 Classification: Programmer/Analyst Compensation: DOE
Our client is currently looking for candidate with strong understanding of...",
"Distance": null,
"EmploymentType": "Full-Time Employee",
"EducationRequired": "Not Specified",
"ExperienceRequired": "Not Specified",
"JobDetailsURL":
"http://api.careerbuilder.com/v1/joblink?TrackingID=UNTRKD&HostSite=CA&DID=J3G6PM69F3QVJ2MY15G",
"JobServiceURL":
"https://api.careerbuilder.com/v1/job?DID=J3G6PM69F3QVJ2MY15G&HostSite=CA&DeveloperKey=WDHT5Y26MLSB
GLS2HC7G",
"Location": "Toronto-M5J 2T3",
"LocationLatitude": "43.6432",
"LocationLongitude": "-79.3806",
"PostedDate": "7/29/2014",
"PostedTime": "7/29/2014 8:16:48 PM",
"Pay": "N/A",
…

Appendix 4: Indeed API
from indeed import IndeedClient
import json
import pymongo
import time
client = IndeedClient(‘123456')
params = {
'l' : "Anywhere",
'co' : "ca",
'userip' : "1.2.3.4",
'useragent' : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2)"
}
search_response = client.search(**params)
list_of_jobs = search_response['results']
location=job['city']
posteddate=time.strftime("%d/%m/%Y",time.strptime(job[‘date’], "%a, %d %b %Y %H:%M:%S GMT"))
db.posting.insert({"posted_date": posteddate, "skills": "", "city": location, "source": 'indeed',
"raw_data": job})

Appendix 5: Indeed Result
"formattedRelativeTime": "5 days ago",
"city": "Lillooet",
"date": "Thu, 24 Jul 2014 20:21:52 GMT",
"formattedLocationFull": "Lillooet, BC",
"url":
"http://ca.indeed.com/viewjob?jk=7779e5fbf4d0613f&qd=cvKKr6L_4R6jh64NGGBfipMcUh0i4g5C-
X18qE0gAzC3Ws-qTrT0d3CswmkqzrsGxdgmiLA9Fpf3adh66N9NEAN9-
HvuJGR2pUApIXI2XAs&indpubnum=1243433210984925&atk=18u2anmkg0mqi68p",
"jobtitle": "Executive Assistant",
"company": "Xaxli'p",
"onmousedown": "indeed_clk(this, '834');",
"snippet": "The Executive Assistant is responsible for providing administrative and
secretarial services and support to the Chief and Council and the Band Administrator...
",
"source": "WorkBC",
"state": "BC",
"sponsored": false,
"country": "CA",
"formattedLocation": "Lillooet, BC",
"jobkey": "7779e5fbf4d0613f",
"expired": false,
"indeedApply": false

Appendix 6: getPostedDateSkill
import json
import pymongo
# Query to get only the skills and posted_date fields
postings=db.posting.find({},{"posted_date":1, "skills":1, "_id":0});
# To iterate over each posting
for posting in postings:
#Continue processing only if the skills field is not empty
if posting['skills'] != "":
skills=posting['skills']
#If the skills fields is a list, it will iterate over each element and print the date
and the skill,
#Otherwise it will just print the date and the content of the skills field
if isinstance(skills,list):
for skill in skills:
print "%s,%s" % (posting['posted_date'],skill.replace(',','').lower())
else:
print "%s,%s" % (posting['posted_date'],skills.replace(',','').lower())

Appendix 7: getSkillsCount
from mrjob.job import MRJob
class skillsCount(MRJob):
def mapper(self, _, value):
date, skill = value.split(",")
yield skill, 1
def reducer(self, key, values):
yield sum(values), key
if __name__ == '__main__':
skillsCount.run()

Attributions
Text for Big Data graphic:
http://www.bigdata-startups.com/job-
descriptions/
Big Data graphic: http://www.wordle.net/

Final Presentation V1.8

Recommended

Recommended

More Related Content

Similar to Final Presentation V1.8

Similar to Final Presentation V1.8 (20)

Final Presentation V1.8