• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
useR! 2012 Talk
 

useR! 2012 Talk

on

  • 2,149 views

My lightening talk from useR! 2012 conference on scraping some baseball data.

My lightening talk from useR! 2012 conference on scraping some baseball data.

Statistics

Views

Total Views
2,149
Views on SlideShare
2,149
Embed Views
0

Actions

Likes
0
Downloads
29
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    useR! 2012 Talk useR! 2012 Talk Presentation Transcript

    • Using R for Scraping Data Ryan Elmore National Renewable Energy Lab rtelmore@gmail.com Twitter: rtelmore June 13, 2012 useR! 2012
    • A Baseball ChallengeQuestion: Has the minimum number of pitchesper (full) inning (6 pitches) has ever beenattained?Answer: I don’t know; scrape the boxscores atbaseball-reference.com.http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
    • A Baseball ChallengeQuestion: Has the minimum number of pitchesper (full) inning (6 pitches) has ever beenattained?Answer: I don’t know; scrape the boxscores atbaseball-reference.com.http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
    • The BoxscoreThis column seems useful!
    • The BoxscoreThis column seems useful!
    • Dissecting the URLhttp://www.baseball-reference.com/boxes/COL/COL201104010.shtml Just step through all of the teams: YearMonthDay Game ID COL, BOS, etc.
    • How Do We Proceed?The most systematic way that I could findwas to break it down like this:• 30 Teams• 2005 - 2010• Everyday from Apr 1 through Oct 31• This is a little more than 78K URLs!• My program took about 3 hrs 25 min.
    • How Do We Proceed?The most systematic way that I could findwas to break it down like this:• 30 Teams• 2005 - 2010• Everyday from Apr 1 through Oct 31• This is a little more than 78K URLs!• My program took about 3 hrs 25 min.
    • R Codefor (team in teams){ for (year in years){ out.string <- paste(Sys.time(), "--", team, year, sep = " ") print(out.string) for (month in months){ for (day in days){ for (i in 0:1){ full.url <- paste(paste(base.url, team, date.url, sep="/"), i, ".shtml", sep="") table.stats <- readHTMLTable(full.url) ## Process the list of data.frames returned by ## the call to readHTMLTable } } } }}
    • R Codefor (team in teams){ for (year in years){ out.string <- paste(Sys.time(), "--", team, year, sep = " ") print(out.string) for (month in months){ for (day in days){ for (i in 0:1){ full.url <- paste(paste(base.url, team, date.url, sep="/"), i, ".shtml", sep="") table.stats <- readHTMLTable(full.url) ## Process the list of data.frames returned by ## the call to readHTMLTable } } } }}
    • Tools• base: paste, strsplit, unlist, lapply• XML: readHTMLTable, htmlTreeParse, getNodeSet, xmlValue, xmlSApply• httr, stringr, and other Hadley things• useful, but not necessary: regex, xpath, XML, etc.
    • Tools• base: paste, strsplit, unlist, lapply• XML: readHTMLTable, htmlTreeParse, getNodeSet, xmlValue, xmlSApply• httr, stringr, and other Hadley things• useful, but not necessary: regex, xpath, XML, etc.
    • Conclusions/Discussion• There is a lot of data available on the web!• You can access this data from a browser; however, you can access A LOT more data if you let your computer do the work.• R and its libraries provide a great platform for scraping data and data mining.• Download data and see where you go.
    • Conclusions/Discussion• There is a lot of data available on the web!• You can access this data from a browser; however, you can access A LOT more data if you let your computer do the work.• R and its libraries provide a great platform for scraping data and data mining.• Download data and see where you go.
    • Was That Minimum Attained?• NO! Unless there is an error in my code.• Did we learn something? Of course.• The skills are transferrable to other websites with data.