Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Web Scraping


Published on

NICAR 2010 Web Scraping panel

Published in: Technology, Design
  • Be the first to comment

Web Scraping

  1. 1. NICAR 2010 Web Scraping Basics James Wilkerson, The Des Moines Register Jacob Fenton, The (Allentown) Morning Call Intro – James W. Basic tools – James W. Firefox extensions: DownloadThemAll Outwit Hub Yahoo Pipes Openkapow Perl tools – Jacob Python tools – James W.
  2. 2. Something to stare at
  3. 3. Pre-built scraping stuff Firefox extensions: DownloadThemAll Outwit Hub Yahoo! Pipes Openkapow
  4. 4. Python with BeautifulSoup - Easily pull in and pull apart html. - Search for page elements cleanly and easily. - Python is better than perl. Great tutorial by Ben Welsh (palewire) at LA Times:
  5. 5. BeautifulSoup example #Bring in the modules necessary to grab & process pages from mechanize import Browser from BeautifulSoup import BeautifulSoup #Use mechanize to grab the page. mech = Browser() url = "" page1 = html1 =
  6. 6. #Carve up the html soup1 = BeautifulSoup(html1) #Send page to function that will extract data from appropriate table extract(soup1, 2007)
  7. 7. #Function to extract table info def extract(soup, year): table = soup.find("table", border=1) for row in table.findAll('tr')[1:]: col = row.findAll('td') rank = col[0].string artist = col[1].string album = col[2].string cover_link = col[3].img['src'] record = (str(year), rank, artist, album, cover_link) return record
  8. 8. #Follow the link to 2006 data and process that page page2 = mech.follow_link(text_regex="Next") html2 = soup2 = BeautifulSoup(html2) extract(soup2, 2006)
  9. 9. RESULTS: 2007|10|LCD Soundsystem|Sound of Silver| 2007|9|Ulrich Schnauss|Goodbye| 2007|8|The Clientele|God Save The Clientele| 2007|7|The Modernist|Collectors Series Pt. 1: Popular Songs| 2007|6|Bebel Gilberto|Momento| 2007|5|Various Artists|Jay Deelicious: 1995-1998| 2007|4|Lindstrom and Prins Thomas|BBC Essential Mix| 2007|3|Go Home Productions|This Was Pop| 2007|2|Apparat|Walls| 2007|1|Caribou|Andorra| 2006|10|Lily Allen|Alright, Still| 2006|9|Nouvelle Vague|Nouvelle Vague| 2006|8|Bookashade|Movements| 2006|7|Charlotte Gainsbourg|5:55| 2006|6|The Drive-By Truckers|The Blessing and the Curse| 2006|5|Basement Jaxx|Crazy Itch Radio| 2006|4|Love is All|Nine Times The Same Song| 2006|3|Ewan Pearson|Sci.Fi.Hi.Fi_01| 2006|2|Neko Case|Fox Confessor Brings The Flood| 2006|1|Ellen Allien & Apparat|Orchestra of Bubbles|