3. What is WebDriver?
From W3C Recommendation:
WebDriver is a remote control interface that enables introspection and control of user agents.
Provided is a set of interfaces to discover and manipulate DOM elements in web documents and to control the
behavior of a user agent. It is primarily intended to allow web authors to write tests that automate a user agent
from a separate controlling process, but may also be used in such a way as to allow in-browser scripts to control a
— possibly separate — browser.
https://www.w3.org/TR/webdriver1/
4. Can it be used for scraping?
WebDriver can:
● Open a browser window
● Open a URL
● Identify the page status (loads, timeout, redirect)
● Identify and interact with elements (click links, extract data, tell if an element it present or not)
So … sounds good?
5. Pros and Cons
WebDriver actually loads the entire page in a browser - it is slow but can handle backend-dependent sites
better than wget
It offers more interaction with page elements but this means you need to know the target site well
It is a “what you see is what you get” solution, which is good and bad at the same time
6. Target is iframe (the defaced
page)
But metadata is outside
iframe
Need screenshot to scan
captures more quickly
Captchas, js, popups
URL schema for MD:
http://zone-h.com/mirror/id/
1002
Page mirror:
http://zonehmirrors.org/defa
ced/alldas/2000/10/12/www
.jpbdbkl.gov.my/
7. Workflow
● Open Browser (profile with Noscript enabled)
● Go to https://www.zone-h.org/mirror/id/{Number}
● If page loads: capture metadata, load iframe
○ If iframe loads: capture iframe, take full-page screenshot
● If multiple pages do not load:
○ Assume you are in a bad batch and increment first by 1-9, then by 10-100
● Continue on until banned for the day (~800 requests) or target reached
9. Application
Go where no wget has gone before -> scrape interactive, high-value sites
Very customizable scraping solutions
More target interaction possible -> queries, searches etc.