Selenium for Jobseekers
How Indeed uses Selenium to submit job applications
Seshu Madhav Chaturvedula
Software Engineer
Indeed Inc.
Selenium ~= Test Automation
How Indeed used Selenium to help people get Jobs
more specifically, on Smart Phones
Smart Phones
Evolution becomes revolution
There is an app
for everything
Jun July Aug Sept Oct Nov Dec Jan Feb Mar Apr May
0
100
20
40
60
80
Desktop Traffic Mobile Traffic
Searching for jobs, on Mobile
Percent of total q1 vs time, 2015-2016
Applying for jobs, on Mobile
51%
Of all applications submitted
on Indeed are mobile
200K
Mobile applications
completed each day
4X
Increase of mobile Indeed
Applies in the last year
Employers who accept mobile applications receive
twice as many quality applicants
How people apply for jobs ?
Employer website
How people want to apply for jobs ?
Employer website Employer website
Typical problems with Employer careers portals
slownon-mobile friendly
non-mobile friendly
non-mobile friendly
slow
16.06s
Indeed’s idea - MoBolt
Employer website Employer website Mobile friendly website
Mobolt
Mobile friendly website
Employer website
Mobolt
Mobile friendly website
API
Employer website
Mobolt
Mobile friendly website
API Scraping
Employer website
Mobolt
Mobile friendly website
Scraping
Employer website
Capabilities of Employer websites to emulate
Jobs
1
Authentication
2
Applications
3
Capabilities of Employer websites to emulate
Job Description List of Pages that in that
job application flow
List of Questions
asked in each page
Jobs
1
Capabilities of Employer websites to emulate
Don’t want to add
another authentication
Privacy concerns (PII)
Authentication
2
Capabilities of Employer websites to emulate
Capture all info employer
Website expects
Privacy concerns (PII) Provide a fallback
mechanism if we fail
Applications
3
Mobolt
Mobile friendly website
Scraping
Employer website
Web App (jquery-
mobile)
Employer website
Web App
Employer website
Jobs
Web AppJobs Service
Employer website
Jobs
Web App
Selenium Code
Jobs Service
Employer website
Jobs
Job (in Jobs DB)
[
{
“job”: “Installation Technician”,
“job_id”: “http://foo.com/jobs/1234”,
“page”: “Personal Information”,
“label”: “First Name”,
“widget_type”: “text_box”,
“locator_type”: “xpath”,
“locator”: “./first-name”
},
...
]
Web application
Now understands how to represent this
question on UI.
Web App
Selenium Code
Jobs Service
Employer website
Jobs
Auth. Service
Web App
Selenium CodeSelenium Code
Jobs Service
Employer website
Jobs
Auth. Service
Web App
Selenium CodeSelenium Code
Jobs Service
Employer website
ApplicationsJobs
Auth. Service
Application (in applications DB)
[
{
“job”: “Installation Technician”,
“job_id”: “http://foo.com/jobs/1234”,
“page”: “Personal Information”,
“label”: “First Name”,
“widget_type”: “text_box”,
“locator_type”: “xpath”,
“locator”: “./first-name”,
“answer”: “Mitchell Johnson”
},
...
]
Web application
appends answer given by Job seeker
Web App
Selenium CodeSelenium Code
Jobs Service Apply Service
Employer website
ApplicationsJobs
Auth. Service
Web App
Selenium CodeSelenium Code
Jobs Service Apply Service
Employer website
Selenium Code
ApplicationsJobs
Auth. Service
How Apply Service replays the application ?
[
{
“job”: “Installation Technician”,
“job_id”: “http://foo.com/jobs/1234”,
“page”: “Personal Information”,
“label”: “First Name”,
“widget_type”: “text_box”,
“locator_type”: “xpath”,
“locator”: “./first-name”,
“answer”: “Mitchell Johnson”
},
...
]
Uses Selenium to load Job URL
Locates WebElements (locator_type & locator)
Types the answer
If we do all of what we talked so far perfectly,
(at least near perfectly!)
Non-Moboltized Moboltized
Non-Moboltized Moboltized
How do we extend this for many employer sites?
Web App
Selenium CodeSelenium Code
Jobs Service Apply Service
Employer website
ApplicationsJobs
Auth. Service
Selenium Code
Web App
Selenium Code
Jobs Service Apply ServiceApplicationsJobs
Auth. Service
Selenium CodeS S S S S S
Employer Websites
Web AppJobs Service Apply ServiceApplicationsJobs
Selenium Code
C C C C C C
backend framework
Auth. Service
Selenium Code
Employer Websites
Web AppJobs Service Apply ServiceApplicationsJobs
Selenium Code
C C C C C C
backend framework
Auth. Service
C C C C C C
backend framework
Employer Websites
Web AppJobs Service Apply ServiceApplicationsJobs
Auth. Service
C C C C C C
backend framework
C C C C C C
backend framework
C C C C C C
backend framework
Employer Websites
Web AppJobs Service Apply ServiceApplicationsJobs
C C C C C C
backend framework
C C C C C C
backend framework
Auth. Service
C C C C C C
backend framework
Employer Websites
Auth. ServiceThis is how we extended Mobolt for many employer sites.
Some Challenges
Stay in sync with Employer Website
Captcha
1
2
Dependent Questions & Knockout Questions3
Testing our Implementations4
Current Work
Real-time applications
Reduce implementation time
3
1
Reducing drop-off rate, Perform better
than Desktop
2
Sasi Kumar Baratam
Software Engineer
Mobile friendly website
Jobs Service Apply ServiceWeb App
Browser Service
Mobolt
Http Browser Service
Returning Sessions
1
2
Purge Expired Sessions3
No Single Point of Failure4
Requirements
State and Health APIs
Logging and real-time Monitoring
5
6
Resilient to Browser Memory Leaks7
Horizontally Scalable8
Requirements
Native Selenium Grid
Hub
Native Selenium Grid
Node Node Node
Hub
Native Selenium Grid
Node Node Node
HubClient Code
Native Selenium Grid - Limitations
1
Hub is the single point of failure. If it fails, we lose access to all nodes
2
Difficult to add logging and monitoring
3 Can’t restart node gracefully
Native Selenium Grid - Limitations
1
Hub is the single point of failure. If it fails, we lose access to all nodes
2
Difficult to add logging and monitoring
3 Can’t restart node gracefully
Native Selenium Grid - Limitations
1
Hub is the single point of failure. If it fails, we lose access to all nodes
2
Difficult to add logging and monitoring
3 Can’t restart node gracefully
Mobolt Selenium Grid
Client Code Grid
Servers
AWS ELB
Mobolt Selenium Grid
Client Code
M
Grid
Servers
Grid Node 1
Maintenance scripts
Selenium Server
Grid Node n
Maintenance scripts
Selenium Server
AWS ELB
Mobolt Selenium Grid
Client Code
M
Grid
SchedulerGrid
Servers
Grid Node 1
Maintenance scripts
Selenium Server
Grid Node n
Maintenance scripts
Selenium Server
AWS ELB
Grid
Servers
Web Server bundled with JDK
Grid
Servers
Web Server bundled with JDK
State, Node, Session, Health Http
endpoints
Grid
Servers
Web Server bundled with JDK
State, Node, Session, Health Http
endpoints
Maintains inventory of Nodes and
Sessions in Mongodb
Selenium Server creates browser
based on Grid Server request
Grid Node
Maintenance scripts
Selenium server
Selenium Server creates browser
based on Grid Server request
Set of shell scripts which monitor
node health
Grid Node
Maintenance scripts
Selenium server
Selenium Server creates browser
based on Grid Server request
Set of shell scripts which monitor
node health
Restart node if available memory
or available sessions is too low
Grid Node
Maintenance scripts
Selenium server
Selenium Server creates browser
based on Grid Server request
Set of shell scripts which monitor
node health
Restart node if available memory
or available sessions is too low
Kill unresponsive browsers
Grid Node
Maintenance scripts
Selenium server
Grid
Scheduler
HeartBeat - Pings nodes to check
if they are running or not
HeartBeat - Pings nodes to check
if they are running or not
Purging Expired Browsers
Grid
Scheduler
Machine Configuration (AWS)
5 Grid Servers
(2vCPU/3.75G memory,
ec2-classic)
25 Grid Nodes
(2vCPU/15.75G memory, ubuntu,
xvfb-enabled)
Comparative Analysis
Architecture
HeartBeat
Returning Sessions
Purge Expired Sessions
State, Health APIs
No Single point of failure
Logging and Monitoring
Resilient to memory leaks
Graceful Node restart
Horizontally Scalable
Native Selenium Grid
Tree of Grid Nodes
Mobolt Selenium Grid
Bipartite graph
Challenges
Rogue Sessions
Sometimes browsers don’t
respond as expected
Gateway timeouts
If session creation takes more than
1 minute, then we respond with
timeout error.
Challenges
Rogue Sessions
Sometimes browsers don’t
respond as expected
Gateway timeouts
If session creation takes more than
1 minute, then we respond with
timeout error.
Grid Traffic Statistics
~ 625
Max req per min
~ 89K
Max req per day
~ 2.0M
Total req in last
30 days
~ 8.7M
Total req in last
1 year
Future work
Future Work
Browser Pool
Maintain a pool of browsers and
allocate browsers from pool
Autoscale
Autoscale grid nodes
and servers
Future Work
Browser Pool
Maintain a pool of browsers and
allocate browsers from pool
Autoscale
Autoscale grid nodes
and servers
Engineering blog & talks - indeed.tech
Open Source - opensource.indeedeng.io
Careers - indeed.jobs
Twitter - @IndeedEng
Learn More
Thank you!
seshu@indeed.com
sasibaratam@indeed.com
FAQs
How is Captcha Handled?
Employer Page
Sign Up
Submit
Mobolt Page
Username
Password
Employer Page
Captcha Image
Sign Up
Submit
Selenium Code
Mobolt Page
Username
Password
Other questions
Employer Page
Captcha Image
Sign Up
Submit
Selenium Code
Mobolt Page
Sign Up
Username
Password
Submit
Username
Password
Other questions
Employer Page
Captcha Image
Sign Up
Submit
Selenium Code
Mobolt Page
Sign Up
Username
Password
blah
blah
V4XBG
Submit
Username
Password
Other questions
Employer Page
Captcha Image
Sign Up
Submit
Selenium Code
Selenium Code
Mobolt Page
Sign Up
Username
Password
blah
blah
V4XBG
Submit
Username
Password
Challenges in scraping model?
Stay in sync with
Employer website
Jobs get added and deleted
Page flow for a job application
CSS Selectors / Xpaths change
New questions get added OR
questions get updated
Job Seekers use a
password with us
Then change password
on the ATS
Why not curl based crawling?
Dynamic things happen
on employer websites
Dependent questions
Job dependent flows
It is difficult and in some cases
impossible to scrape using curl.
We want to write content
to the web pages
Then change password
on the ATS

Selenium for Jobseekers

Editor's Notes

  • #5 Generally Selenium is used more in the context of Test Automation
  • #6 In this talk, We are going to see ….
  • #8 Monica will design
  • #9 At Indeed we are hiring all the time and are growing quickly. Over the past two years, we’ve gone from adding a few new developers per month to having up to 50 start on a single day. This rapid growth has added even more pressure onto the setup process.
  • #10 At Indeed we also gathered some stats about number of people applying for jobs on Mobile.
  • #45 When you do a similar thing many number of times, there is an opportunity to abstract out the core repetitive part. We called that backend-framework.
  • #50 Build in as much automatic recovery as possible.
  • #54 by ensuring that our job data is complete and current
  • #62 Test
  • #67 com.sun.net.httpserver.HttpServer
  • #77 Move this to after slide 60
  • #84 If that sounds interesting to you, you should follow our engineering blog, github, and social media to learn more.
  • #86 If that sounds interesting to you, you should follow our engineering blog, github, and social media to learn more.