About me
● Currently working @Amazon
Disclaimer: I am not representing Amazon, not talking about anything
to do with my current, previous or future experience within Amazon.
Development and testing professionally since 2009
IBM, Microsoft, Dell, Netease…
● Over 20 presentations worldwide
● Author of the book “How to Test a Time Machine”
● Contact:
https://thetestlynx.wordpress.com
@thetestlynx in twitter
Noemi Ferrera
Agenda
● What’s a crawler?
● Why and when do we need a crawler?
● Types of crawlers
● Components of a crawler
○ View/node
○ Arcs/links
○ Visited storage
○ Heat map
● Example
● What can go wrong
● What you need to succeed
All you need to know about crawlers…
What’s a crawler?
A crawler is an automatic system that iterates throughout the parts of an
application, with the objective of finding issues or explore it.
..Can be a web application, but also other types of applications.
Definition
Why and when …
● Discovery testing
● Finding particular common issues (ex. 404)
● Quick coverage
● Generally runs in production - or pre-prod (late)
…do we need a crawler?
Types of Crawlers
● UI vs API
● View First vs Arc First
● Exhaustive vs Shortcutted
● Random vs Smart
Types of crawlers
UI VS API
UI API
Uses UI to navigate through the application Uses API to navigate through the application
Closer to user’s behaviour Faster to run
Checks elements, not only links Focus mostly on links and API points
Types of crawlers
View first VS Arc first
View First Arc First
Focuses first on the view, then navigates Focuses first on the navigation, then check
the view
Better when the application has many checks but
does not have too much navigation
Better when views have few things to check
but long list of navigation points
Types of crawlers
Exhaustive vs Shortcutted
Exhaustive Shortcutted
Aims to visit the entire application Stops after a number of visits
Better for smaller applications or have a lot of time
to cover it all
Better if the application is too big, and not
enough time to cover it all
Might make too many calls or take too long finding
issues
Might end up after visiting the important parts
of the application
Types of crawlers
Random vs Smart
Random Smart
Could be partially random Uses some logic to give priority to parts of
the application
Likely needs to be shortcutted Might end up after visiting the important parts
of the application
Components of a crawler
● How to tell when you are in a different one?
○ Website: URL
○ External links - Avoid navigation
○ Games or harder apps
View/Node
Components of a crawler
● How to navigate?
○ Clicks
○ API calls
○ Swiping and other actions
○ VR apps - other interactions
● Clickable objects?
○ Websites - href
○ All dom objects
■ Containers?
○ Moving/Changing objects
Arcs/links
Components of a crawler
● How to navigate?
○ Clicks
○ API calls
○ Swiping and other actions
○ VR apps - other interactions
● Clickable objects?
○ Websites - href
○ All dom objects
■ Containers?
○ Moving/Changing objects
Arcs/links
Components of a crawler
● How to navigate?
○ Clicks
○ API calls
○ Swiping and other actions
○ VR apps - other interactions
● Clickable objects?
○ Websites - href
○ All dom objects
■ Containers?
○ Moving/Changing objects
Arcs/links
Components of a crawler
● How to navigate?
○ Clicks
○ API calls
○ Swiping and other actions
○ VR apps - other interactions
● Clickable objects?
○ Websites - href
○ All dom objects
■ Containers?
○ Moving/Changing objects
○ Dynanism
○ Hidden elements?
Arcs/links
Components of a crawler
● By usage
● By issues found
● By novelty
● Others
Heat map
Crawling with Selenium
Class WebCrawlerSelenium:
def __init__(self):
driver = webdriver.Chrome(...)
self.top_level = 10
url = https://www.selenium.dev
driver.get(url)
view = view_class.ViewClass(url)
self.explore(view, [], 0 driver)
driver.close()
Example
Start the crawler
Explore the first view
Crawling with Selenium
def explore(self, view, visited, current_level, driver):
current_level = current_level + 1
if current_level >= self.top_level:
sys.exit(“Max visit reached”)
visit.append(view)
check_status(node.url)
If view.count == -1:
view.count = 0
get_all_href(view) # adds view.count
while view.count > 0:
get_next_view(view)
Example cont…
Explore each level
Initialize the linked views
Crawling with Selenium
def check_status(self, node):
status_code =
requests.get(url).status_code
if status_code < 200 or status_code >=
400:
sys.exit(“Error on url” + url)
Example cont 2 …
Check status with API
Crawling with Selenium
def get_all_href(self, view):
for a_tag in driver.find_element(By.TAG_NAME, ‘a’):
view.count = view.count + 1
href = a_tag.get_attribute(‘href’)
view.actions[href] =
a_tag.get_dom_attribute(‘href’)
Example cont 3 …
Get all references for the node
Finding by a tag
Add to actions
Crawling with Selenium
def get_next_view(self, view, visited):
sub_url = view.actions.last()
count = len(view.actions)
while sub_url in visited and count > 0:
sub_url = view.actions[count]
count = count - 1
if count == 0:
return
subview = view_class.ViewClass(sub_url)
self.try_click(sub_url, driver) # ui navigation, API -
requests.get
Example cont 4 …
Initialize the view
Get all the urls
Click next action
Explore next
Crawling with Selenium
def try_click(self, href, driver):
xpath= ('//a[@href="'+href+'"]')
try:
element =
driver.find_element(By.XPATH, xpath)
element.click()
except Exception:
print(“Could not find the xpath”)
Example cont 5 …
Tries to click the
element
We could add here
other actions
What could go wrong
● How to identify views? (Already covered)
○ External links?
○ Keep track of visited
○ Top level
● How to identify navigation points/arcs? (already covered)
○ Partial vs full hrefs
What could go wrong
● How to identify views?
○ External links?
○ Keep track of visited
○ Top level
● How to identify navigation points/arcs?
○ Partial vs full hrefs
● Forms, ex. login
What could go wrong
● How to identify views? (Already covered)
○ External links?
○ Keep track of visited
○ Top level
● How to identify navigation points/arcs? (already covered)
○ Partial vs full hrefs
● Forms, ex. login
● Pop-ups
What could go wrong
● How to identify views? (Already covered)
○ External links?
○ Keep track of visited
○ Top level
● How to identify navigation points/arcs? (already covered)
○ Partial vs full hrefs
● Forms, ex. login
● Pop-ups
● Cookies
● Dynamic objects
● Stale links
What you need to succeed
● Know: graph, trees, types of traversals
○ Tracking visited nodes
● App knowledge
○ Experience or tool (head map generator…)
● What type of issues are you looking for?
○ API? UI? When do they happen?
● Make sure you cannot cover these with other testing!!!
Summary
● What’s a crawler?
● Why and when do we need a crawler?
○ Discovery
○ Common issues
○ Quick coverage
● Types of crawlers
● Components of a crawler
● Example
● What can go wrong
● What you need to succeed
Summary
● What’s a crawler?
● Why and when do we need a crawler?
● Types of crawlers
○ UI/API/MIXED
○ VIEW FIRST / DEPTH FIRST
○ EXHAUSTIVE / SHORTCUTTED
○ RANDOM / SMART
● Components of a crawler
● Example
● What can go wrong
● What you need to succeed
Summary
● What’s a crawler?
● Why and when do we need a crawler?
● Types of crawlers
● Components of a crawler
○ View/node
○ Arcs/links
○ Visited storage
○ Heat map
● Example
● What can go wrong
● What you need to succeed
Summary
● What’s a crawler?
● Why and when do we need a crawler?
● Types of crawlers
● Components of a crawler
● Example
● What can go wrong
● What you need to succeed