• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Web Scraping
 

Web Scraping

on

  • 2,638 views

NICAR 2010 Web Scraping panel

NICAR 2010 Web Scraping panel

Statistics

Views

Total Views
2,638
Views on SlideShare
2,635
Embed Views
3

Actions

Likes
2
Downloads
0
Comments
0

1 Embed 3

http://www.slideshare.net 3

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Web Scraping Web Scraping Presentation Transcript

    • NICAR 2010 Web Scraping Basics James Wilkerson, The Des Moines Register Jacob Fenton, The (Allentown) Morning Call Intro – James W. Basic tools – James W. Firefox extensions: DownloadThemAll Outwit Hub Yahoo Pipes Openkapow Perl tools – Jacob Python tools – James W.
    • Something to stare at
    • Pre-built scraping stuff Firefox extensions: DownloadThemAll http://www.downloadthemall.net Outwit Hub http://www.outwit.com Yahoo! Pipes http://pipes.yahoo.com Openkapow http://www.openkapow.com/
    • Python with BeautifulSoup - Easily pull in and pull apart html. - Search for page elements cleanly and easily. - Python is better than perl. Great tutorial by Ben Welsh (palewire) at LA Times: http://www.palewire.com
    • BeautifulSoup example #Bring in the modules necessary to grab & process pages from mechanize import Browser from BeautifulSoup import BeautifulSoup #Use mechanize to grab the page. mech = Browser() url = "http://www.palewire.com/scrape/albums/2007.html" page1 = mech.open(url) html1 = page1.read()
    • #Carve up the html soup1 = BeautifulSoup(html1) #Send page to function that will extract data from appropriate table extract(soup1, 2007)
    • #Function to extract table info def extract(soup, year): table = soup.find("table", border=1) for row in table.findAll('tr')[1:]: col = row.findAll('td') rank = col[0].string artist = col[1].string album = col[2].string cover_link = col[3].img['src'] record = (str(year), rank, artist, album, cover_link) return record
    • #Follow the link to 2006 data and process that page page2 = mech.follow_link(text_regex="Next") html2 = page2.read() soup2 = BeautifulSoup(html2) extract(soup2, 2006)
    • RESULTS: 2007|10|LCD Soundsystem|Sound of Silver|http://www.palewire.com/scrape/albums/covers/sound%20of%20silver.jpg 2007|9|Ulrich Schnauss|Goodbye|http://www.palewire.com/scrape/albums/covers/goodbye.jpg 2007|8|The Clientele|God Save The Clientele|http://www.palewire.com/scrape/albums/covers/god%20save%20the%20clientele.jpg 2007|7|The Modernist|Collectors Series Pt. 1: Popular Songs|http://www.palewire.com/scrape/albums/covers/collectors%20series.jpg 2007|6|Bebel Gilberto|Momento|http://www.palewire.com/scrape/albums/covers/memento.jpg 2007|5|Various Artists|Jay Deelicious: 1995-1998|http://www.palewire.com/scrape/albums/covers/jaydeelicious.jpg 2007|4|Lindstrom and Prins Thomas|BBC Essential Mix|http://www.palewire.com/scrape/albums/covers/lindstrom%20prins%20thomas.jpg 2007|3|Go Home Productions|This Was Pop|http://www.palewire.com/scrape/albums/covers/this%20was%20pop.jpg 2007|2|Apparat|Walls|http://www.palewire.com/scrape/albums/covers/walls.jpg 2007|1|Caribou|Andorra|http://www.palewire.com/scrape/albums/covers/andorra.jpg 2006|10|Lily Allen|Alright, Still|http://www.palewire.com/scrape/albums/covers/alright%20still.jpg 2006|9|Nouvelle Vague|Nouvelle Vague|http://www.palewire.com/scrape/albums/covers/nouvelle%20vague.jpg 2006|8|Bookashade|Movements|http://www.palewire.com/scrape/albums/covers/movements.jpg 2006|7|Charlotte Gainsbourg|5:55|http://www.palewire.com/scrape/albums/covers/555.jpg 2006|6|The Drive-By Truckers|The Blessing and the Curse|http://www.palewire.com/scrape/albums/covers/blessing%20and%20curse.jpg 2006|5|Basement Jaxx|Crazy Itch Radio|http://www.palewire.com/scrape/albums/covers/crazy%20itch%20radio.jpg 2006|4|Love is All|Nine Times The Same Song|http://www.palewire.com/scrape/albums/covers/nine%20times.jpg 2006|3|Ewan Pearson|Sci.Fi.Hi.Fi_01|http://www.palewire.com/scrape/albums/covers/sci%20fi%20hi%20fi.jpg 2006|2|Neko Case|Fox Confessor Brings The Flood|http://www.palewire.com/scrape/albums/covers/fox%20confessor.jpg 2006|1|Ellen Allien & Apparat|Orchestra of Bubbles|http://www.palewire.com/scrape/albums/covers/orchestra%20of%20bubbles.jpg
    •