Scraping recalcitrant web sites with Python & Selenium
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Scraping recalcitrant web sites with Python & Selenium

  • 8,902 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
8,902
On Slideshare
8,887
From Embeds
15
Number of Embeds
4

Actions

Shares
Downloads
29
Comments
0
Likes
3

Embeds 15

http://eventifier.co 8
http://www.linkedin.com 4
http://eventifier.com 2
https://twitter.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Scraping recalcitrant web sites with Python & Selenium Roger BarnesSyPy July 2012
  • 2. Some sites suck
  • 3. Some sites suck - "for your own good"For security reasons, each button isan image, dynamically generated bya hash wrapped in a mess ofjavascript, randomly placed
  • 4. ...but they work in a web browser! Lets use the web browser to scrape them
  • 5. Enter Selenium Selenium automates browsers Thats it
  • 6. Selenium can...● navigate (windows, frames, links)● find elements and parse attributes● interact and trigger events (click, type, ...)● capture screenshots● run javascript● let the browser take care of the hard stuff (cookies, javascript, sessions, profiles, DOM)Comes with various components and bindings ... including python
  • 7. General RecipeIngredients:● firefox (or chrome)● firebug (or chrome dev tools)● Selenium IDE ○ record a session, write less code● python and its batteries● python-selenium● xvfb and pyvirtualdisplay (optional)● other libraries to taste ○ eg image manipulation, database access, DOM parsing, OCR
  • 8. General RecipeMethod:● Install requirements (apt-get, pip etc) ○ sudo apt-get install xvfb firefox ○ pip install selenium pyvirtualdisplay● Start up Firefox and Selenium IDE● Record a "test" run through site ○ Add in some assertions along the way● Export test as Python script● Hack from there ○ Loops ○ Image/data extraction ○ Wrangling data into a database
  • 9. Example from Selenium IDEclass Ingdirect2(unittest.TestCase): def setUp(self): self.driver = webdriver.Firefox() self.driver.implicitly_wait( 30) self.base_url = "https://www.ingdirect.com.au" self.verificationErrors = [] def test_ingdirect2(self): driver = self.driver But what about driver.get( self.base_url + "/client/index.aspx") that dang driver.switch_to_frame( body) # Had to add this keypad? ... driver.find_element_by_id( "txtCIF").clear() driver.find_element_by_id( "txtCIF").send_keys( "12345678") driver.find_element_by_id( "objKeypad_B1").click() driver.find_element_by_id( "objKeypad_B2").click() driver.find_element_by_id( "objKeypad_B3").click() driver.find_element_by_id( "objKeypad_B4").click() driver.find_element_by_id( "btnLogin").click() self.assertTrue( self.is_element_present(By.ID, "ctl2_lblBalance"))
  • 10. PIL saves the day# Get screenshot for extraction of button imagesscreenshot = driver.get_screenshot_as_base64()im = Image.open(StringIO.StringIO(base64.decodestring(screenshot)))table = driver.find_element_by_xpath( //*[@id="objKeypad_divShowAll"]/table)all_buttons = table.find_elements_by_tag_name( "input")# Determine md5sum of each button by cropping based on element positionsfor button in all_buttons: button_image = im.crop(getcropbox(button)) hexid = hashlib.md5(button_image.tostring()).hexdigest() button_mapping[hexid] = button.get_attribute( "id")# Now we know which button is which ( based on previous lookup), enter the PINfor char in self.pin: driver.find_element_by_id(button_mapping[hex_mapping[char]]).click()driver.find_element_by_id( "btnLogin").click()# Were in!!!11one
  • 11. But why do all this?Its my data! ... and Ill graph if i want to * Actual results may vary. Graph indicates open inodes, not high-roller gambling problem
  • 12. Thats all folksSlides● http://bit.ly/scrapiumCode● https://gist.github.com/3015852Me● https://twitter.com/mindsocket● https://github.com/mindsocket● roger@mindsocket.com.au