Scraping recalcitrant web sites              with Python & Selenium                     Roger BarnesSyPy July 2012
Some sites suck
Some sites suck - "for your own good"For security reasons, each button isan image, dynamically generated bya hash wrapped ...
...but they work in a web browser!  Lets use the web browser to scrape them
Enter Selenium      Selenium automates browsers                 Thats it
Selenium can...●   navigate (windows, frames, links)●   find elements and parse attributes●   interact and trigger events ...
General RecipeIngredients:● firefox (or chrome)● firebug (or chrome dev tools)● Selenium IDE    ○ record a session, write ...
General RecipeMethod:● Install requirements (apt-get, pip etc)   ○ sudo apt-get install xvfb firefox   ○ pip install selen...
Example from Selenium IDEclass Ingdirect2(unittest.TestCase):    def setUp(self):        self.driver = webdriver.Firefox()...
PIL saves the day# Get screenshot for extraction of button imagesscreenshot = driver.get_screenshot_as_base64()im = Image....
But why do all this?Its my data!                                  ... and Ill graph if i want to       * Actual results ma...
Thats all folksSlides● http://bit.ly/scrapiumCode● https://gist.github.com/3015852Me● https://twitter.com/mindsocket● http...
Scraping recalcitrant web sites with Python & Selenium
Upcoming SlideShare
Loading in...5
×

Scraping recalcitrant web sites with Python & Selenium

8,982

Published on

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
8,982
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
41
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Scraping recalcitrant web sites with Python & Selenium

  1. 1. Scraping recalcitrant web sites with Python & Selenium Roger BarnesSyPy July 2012
  2. 2. Some sites suck
  3. 3. Some sites suck - "for your own good"For security reasons, each button isan image, dynamically generated bya hash wrapped in a mess ofjavascript, randomly placed
  4. 4. ...but they work in a web browser! Lets use the web browser to scrape them
  5. 5. Enter Selenium Selenium automates browsers Thats it
  6. 6. Selenium can...● navigate (windows, frames, links)● find elements and parse attributes● interact and trigger events (click, type, ...)● capture screenshots● run javascript● let the browser take care of the hard stuff (cookies, javascript, sessions, profiles, DOM)Comes with various components and bindings ... including python
  7. 7. General RecipeIngredients:● firefox (or chrome)● firebug (or chrome dev tools)● Selenium IDE ○ record a session, write less code● python and its batteries● python-selenium● xvfb and pyvirtualdisplay (optional)● other libraries to taste ○ eg image manipulation, database access, DOM parsing, OCR
  8. 8. General RecipeMethod:● Install requirements (apt-get, pip etc) ○ sudo apt-get install xvfb firefox ○ pip install selenium pyvirtualdisplay● Start up Firefox and Selenium IDE● Record a "test" run through site ○ Add in some assertions along the way● Export test as Python script● Hack from there ○ Loops ○ Image/data extraction ○ Wrangling data into a database
  9. 9. Example from Selenium IDEclass Ingdirect2(unittest.TestCase): def setUp(self): self.driver = webdriver.Firefox() self.driver.implicitly_wait( 30) self.base_url = "https://www.ingdirect.com.au" self.verificationErrors = [] def test_ingdirect2(self): driver = self.driver But what about driver.get( self.base_url + "/client/index.aspx") that dang driver.switch_to_frame( body) # Had to add this keypad? ... driver.find_element_by_id( "txtCIF").clear() driver.find_element_by_id( "txtCIF").send_keys( "12345678") driver.find_element_by_id( "objKeypad_B1").click() driver.find_element_by_id( "objKeypad_B2").click() driver.find_element_by_id( "objKeypad_B3").click() driver.find_element_by_id( "objKeypad_B4").click() driver.find_element_by_id( "btnLogin").click() self.assertTrue( self.is_element_present(By.ID, "ctl2_lblBalance"))
  10. 10. PIL saves the day# Get screenshot for extraction of button imagesscreenshot = driver.get_screenshot_as_base64()im = Image.open(StringIO.StringIO(base64.decodestring(screenshot)))table = driver.find_element_by_xpath( //*[@id="objKeypad_divShowAll"]/table)all_buttons = table.find_elements_by_tag_name( "input")# Determine md5sum of each button by cropping based on element positionsfor button in all_buttons: button_image = im.crop(getcropbox(button)) hexid = hashlib.md5(button_image.tostring()).hexdigest() button_mapping[hexid] = button.get_attribute( "id")# Now we know which button is which ( based on previous lookup), enter the PINfor char in self.pin: driver.find_element_by_id(button_mapping[hex_mapping[char]]).click()driver.find_element_by_id( "btnLogin").click()# Were in!!!11one
  11. 11. But why do all this?Its my data! ... and Ill graph if i want to * Actual results may vary. Graph indicates open inodes, not high-roller gambling problem
  12. 12. Thats all folksSlides● http://bit.ly/scrapiumCode● https://gist.github.com/3015852Me● https://twitter.com/mindsocket● https://github.com/mindsocket● roger@mindsocket.com.au
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×