useR! 2012 Talk

  • 1,922 views
Uploaded on

My lightening talk from useR! 2012 conference on scraping some baseball data.

My lightening talk from useR! 2012 conference on scraping some baseball data.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,922
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
32
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Using R for Scraping Data Ryan Elmore National Renewable Energy Lab rtelmore@gmail.com Twitter: rtelmore June 13, 2012 useR! 2012
  • 2. A Baseball ChallengeQuestion: Has the minimum number of pitchesper (full) inning (6 pitches) has ever beenattained?Answer: I don’t know; scrape the boxscores atbaseball-reference.com.http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
  • 3. A Baseball ChallengeQuestion: Has the minimum number of pitchesper (full) inning (6 pitches) has ever beenattained?Answer: I don’t know; scrape the boxscores atbaseball-reference.com.http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
  • 4. The BoxscoreThis column seems useful!
  • 5. The BoxscoreThis column seems useful!
  • 6. Dissecting the URLhttp://www.baseball-reference.com/boxes/COL/COL201104010.shtml Just step through all of the teams: YearMonthDay Game ID COL, BOS, etc.
  • 7. How Do We Proceed?The most systematic way that I could findwas to break it down like this:• 30 Teams• 2005 - 2010• Everyday from Apr 1 through Oct 31• This is a little more than 78K URLs!• My program took about 3 hrs 25 min.
  • 8. How Do We Proceed?The most systematic way that I could findwas to break it down like this:• 30 Teams• 2005 - 2010• Everyday from Apr 1 through Oct 31• This is a little more than 78K URLs!• My program took about 3 hrs 25 min.
  • 9. R Codefor (team in teams){ for (year in years){ out.string <- paste(Sys.time(), "--", team, year, sep = " ") print(out.string) for (month in months){ for (day in days){ for (i in 0:1){ full.url <- paste(paste(base.url, team, date.url, sep="/"), i, ".shtml", sep="") table.stats <- readHTMLTable(full.url) ## Process the list of data.frames returned by ## the call to readHTMLTable } } } }}
  • 10. R Codefor (team in teams){ for (year in years){ out.string <- paste(Sys.time(), "--", team, year, sep = " ") print(out.string) for (month in months){ for (day in days){ for (i in 0:1){ full.url <- paste(paste(base.url, team, date.url, sep="/"), i, ".shtml", sep="") table.stats <- readHTMLTable(full.url) ## Process the list of data.frames returned by ## the call to readHTMLTable } } } }}
  • 11. Tools• base: paste, strsplit, unlist, lapply• XML: readHTMLTable, htmlTreeParse, getNodeSet, xmlValue, xmlSApply• httr, stringr, and other Hadley things• useful, but not necessary: regex, xpath, XML, etc.
  • 12. Tools• base: paste, strsplit, unlist, lapply• XML: readHTMLTable, htmlTreeParse, getNodeSet, xmlValue, xmlSApply• httr, stringr, and other Hadley things• useful, but not necessary: regex, xpath, XML, etc.
  • 13. Conclusions/Discussion• There is a lot of data available on the web!• You can access this data from a browser; however, you can access A LOT more data if you let your computer do the work.• R and its libraries provide a great platform for scraping data and data mining.• Download data and see where you go.
  • 14. Conclusions/Discussion• There is a lot of data available on the web!• You can access this data from a browser; however, you can access A LOT more data if you let your computer do the work.• R and its libraries provide a great platform for scraping data and data mining.• Download data and see where you go.
  • 15. Was That Minimum Attained?• NO! Unless there is an error in my code.• Did we learn something? Of course.• The skills are transferrable to other websites with data.