Using R for Scraping Data
             Ryan Elmore
    National Renewable Energy Lab

        rtelmore@gmail.com
          Twitter: rtelmore

            June 13, 2012
              useR! 2012
A Baseball Challenge

Question: Has the minimum number of pitches
per (full) inning (6 pitches) has ever been
attained?
Answer: I don’t know; scrape the boxscores at
baseball-reference.com.


http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
A Baseball Challenge

Question: Has the minimum number of pitches
per (full) inning (6 pitches) has ever been
attained?
Answer: I don’t know; scrape the boxscores at
baseball-reference.com.


http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
The Boxscore




This column seems useful!
The Boxscore




This column seems useful!
Dissecting the URL
http://www.baseball-reference.com/boxes/COL/COL201104010.shtml




   Just step through
    all of the teams:     YearMonthDay         Game ID
    COL, BOS, etc.
How Do We Proceed?
The most systematic way that I could find
was to break it down like this:
• 30 Teams
• 2005 - 2010
• Everyday from Apr 1 through Oct 31
• This is a little more than 78K URLs!
• My program took about 3 hrs 25 min.
How Do We Proceed?
The most systematic way that I could find
was to break it down like this:
• 30 Teams
• 2005 - 2010
• Everyday from Apr 1 through Oct 31
• This is a little more than 78K URLs!
• My program took about 3 hrs 25 min.
R Code
for (team in teams){
  for (year in years){
    out.string <- paste(Sys.time(), "--", team, year, sep = " ")
    print(out.string)
    for (month in months){
      for (day in days){
        for (i in 0:1){
          full.url <- paste(paste(base.url, team, date.url,
             sep="/"), i, ".shtml", sep="")
          table.stats <- readHTMLTable(full.url)
          ## Process the list of data.frames returned by
          ## the call to readHTMLTable
        }
      }
    }
  }
}
R Code
for (team in teams){
  for (year in years){
    out.string <- paste(Sys.time(), "--", team, year, sep = " ")
    print(out.string)
    for (month in months){
      for (day in days){
        for (i in 0:1){
          full.url <- paste(paste(base.url, team, date.url,
             sep="/"), i, ".shtml", sep="")
          table.stats <- readHTMLTable(full.url)
          ## Process the list of data.frames returned by
          ## the call to readHTMLTable
        }
      }
    }
  }
}
Tools

•   base: paste, strsplit, unlist, lapply
•   XML: readHTMLTable, htmlTreeParse,
    getNodeSet, xmlValue, xmlSApply
•   httr, stringr, and other Hadley things
•   useful, but not necessary: regex, xpath,
    XML, etc.
Tools

•   base: paste, strsplit, unlist, lapply
•   XML: readHTMLTable, htmlTreeParse,
    getNodeSet, xmlValue, xmlSApply
•   httr, stringr, and other Hadley things
•   useful, but not necessary: regex, xpath,
    XML, etc.
Conclusions/Discussion

• There is a lot of data available on the web!
• You can access this data from a browser;
  however, you can access A LOT more data
  if you let your computer do the work.
• R and its libraries provide a great platform
  for scraping data and data mining.
• Download data and see where you go.
Conclusions/Discussion

• There is a lot of data available on the web!
• You can access this data from a browser;
  however, you can access A LOT more data
  if you let your computer do the work.
• R and its libraries provide a great platform
  for scraping data and data mining.
• Download data and see where you go.
Was That Minimum Attained?

• NO! Unless there is an error in my code.
• Did we learn something? Of course.
• The skills are transferrable to other
  websites with data.

useR! 2012 Talk

  • 1.
    Using R forScraping Data Ryan Elmore National Renewable Energy Lab rtelmore@gmail.com Twitter: rtelmore June 13, 2012 useR! 2012
  • 2.
    A Baseball Challenge Question:Has the minimum number of pitches per (full) inning (6 pitches) has ever been attained? Answer: I don’t know; scrape the boxscores at baseball-reference.com. http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
  • 3.
    A Baseball Challenge Question:Has the minimum number of pitches per (full) inning (6 pitches) has ever been attained? Answer: I don’t know; scrape the boxscores at baseball-reference.com. http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
  • 4.
  • 5.
  • 6.
    Dissecting the URL http://www.baseball-reference.com/boxes/COL/COL201104010.shtml Just step through all of the teams: YearMonthDay Game ID COL, BOS, etc.
  • 7.
    How Do WeProceed? The most systematic way that I could find was to break it down like this: • 30 Teams • 2005 - 2010 • Everyday from Apr 1 through Oct 31 • This is a little more than 78K URLs! • My program took about 3 hrs 25 min.
  • 8.
    How Do WeProceed? The most systematic way that I could find was to break it down like this: • 30 Teams • 2005 - 2010 • Everyday from Apr 1 through Oct 31 • This is a little more than 78K URLs! • My program took about 3 hrs 25 min.
  • 9.
    R Code for (teamin teams){ for (year in years){ out.string <- paste(Sys.time(), "--", team, year, sep = " ") print(out.string) for (month in months){ for (day in days){ for (i in 0:1){ full.url <- paste(paste(base.url, team, date.url, sep="/"), i, ".shtml", sep="") table.stats <- readHTMLTable(full.url) ## Process the list of data.frames returned by ## the call to readHTMLTable } } } } }
  • 10.
    R Code for (teamin teams){ for (year in years){ out.string <- paste(Sys.time(), "--", team, year, sep = " ") print(out.string) for (month in months){ for (day in days){ for (i in 0:1){ full.url <- paste(paste(base.url, team, date.url, sep="/"), i, ".shtml", sep="") table.stats <- readHTMLTable(full.url) ## Process the list of data.frames returned by ## the call to readHTMLTable } } } } }
  • 11.
    Tools • base: paste, strsplit, unlist, lapply • XML: readHTMLTable, htmlTreeParse, getNodeSet, xmlValue, xmlSApply • httr, stringr, and other Hadley things • useful, but not necessary: regex, xpath, XML, etc.
  • 12.
    Tools • base: paste, strsplit, unlist, lapply • XML: readHTMLTable, htmlTreeParse, getNodeSet, xmlValue, xmlSApply • httr, stringr, and other Hadley things • useful, but not necessary: regex, xpath, XML, etc.
  • 13.
    Conclusions/Discussion • There isa lot of data available on the web! • You can access this data from a browser; however, you can access A LOT more data if you let your computer do the work. • R and its libraries provide a great platform for scraping data and data mining. • Download data and see where you go.
  • 14.
    Conclusions/Discussion • There isa lot of data available on the web! • You can access this data from a browser; however, you can access A LOT more data if you let your computer do the work. • R and its libraries provide a great platform for scraping data and data mining. • Download data and see where you go.
  • 15.
    Was That MinimumAttained? • NO! Unless there is an error in my code. • Did we learn something? Of course. • The skills are transferrable to other websites with data.