Scraping recalcitrant web sites with Python & Selenium

Uploaded on


More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Scraping recalcitrant web sites with Python & Selenium Roger BarnesSyPy July 2012
  • 2. Some sites suck
  • 3. Some sites suck - "for your own good"For security reasons, each button isan image, dynamically generated bya hash wrapped in a mess ofjavascript, randomly placed
  • 4. ...but they work in a web browser! Lets use the web browser to scrape them
  • 5. Enter Selenium Selenium automates browsers Thats it
  • 6. Selenium can...● navigate (windows, frames, links)● find elements and parse attributes● interact and trigger events (click, type, ...)● capture screenshots● run javascript● let the browser take care of the hard stuff (cookies, javascript, sessions, profiles, DOM)Comes with various components and bindings ... including python
  • 7. General RecipeIngredients:● firefox (or chrome)● firebug (or chrome dev tools)● Selenium IDE ○ record a session, write less code● python and its batteries● python-selenium● xvfb and pyvirtualdisplay (optional)● other libraries to taste ○ eg image manipulation, database access, DOM parsing, OCR
  • 8. General RecipeMethod:● Install requirements (apt-get, pip etc) ○ sudo apt-get install xvfb firefox ○ pip install selenium pyvirtualdisplay● Start up Firefox and Selenium IDE● Record a "test" run through site ○ Add in some assertions along the way● Export test as Python script● Hack from there ○ Loops ○ Image/data extraction ○ Wrangling data into a database
  • 9. Example from Selenium IDEclass Ingdirect2(unittest.TestCase): def setUp(self): self.driver = webdriver.Firefox() self.driver.implicitly_wait( 30) self.base_url = "" self.verificationErrors = [] def test_ingdirect2(self): driver = self.driver But what about driver.get( self.base_url + "/client/index.aspx") that dang driver.switch_to_frame( body) # Had to add this keypad? ... driver.find_element_by_id( "txtCIF").clear() driver.find_element_by_id( "txtCIF").send_keys( "12345678") driver.find_element_by_id( "objKeypad_B1").click() driver.find_element_by_id( "objKeypad_B2").click() driver.find_element_by_id( "objKeypad_B3").click() driver.find_element_by_id( "objKeypad_B4").click() driver.find_element_by_id( "btnLogin").click() self.assertTrue( self.is_element_present(By.ID, "ctl2_lblBalance"))
  • 10. PIL saves the day# Get screenshot for extraction of button imagesscreenshot = driver.get_screenshot_as_base64()im = = driver.find_element_by_xpath( //*[@id="objKeypad_divShowAll"]/table)all_buttons = table.find_elements_by_tag_name( "input")# Determine md5sum of each button by cropping based on element positionsfor button in all_buttons: button_image = im.crop(getcropbox(button)) hexid = hashlib.md5(button_image.tostring()).hexdigest() button_mapping[hexid] = button.get_attribute( "id")# Now we know which button is which ( based on previous lookup), enter the PINfor char in driver.find_element_by_id(button_mapping[hex_mapping[char]]).click()driver.find_element_by_id( "btnLogin").click()# Were in!!!11one
  • 11. But why do all this?Its my data! ... and Ill graph if i want to * Actual results may vary. Graph indicates open inodes, not high-roller gambling problem
  • 12. Thats all folksSlides●●●●●