Published in: Technology
Web data from R

  1. 1. Web data acquisition with R Scott Chamberlain October 28, 2011
  2. 2. Why would you even need to do this? Why not just get data through a browser?
  3. 3. Some use cases• Reason 1: It just takes too dam* long to manually search/get data on a web interface• Reason 2: Workflow integration• Reason 3: Your work is reproducible and transparent if done from R instead of clicking buttons on the web
  4. 4. A few general methods of getting web data through R
  5. 5. • Read file – ideal if available• HTML• XML• JSON• APIs that serve up XML/JSON
  6. 6. Practice…read.csv (or xls, txt, etc.)Get URL for file…see screenshoturl <- “”mycsv <- read.csv(url)mycsv
  7. 7. ‘Scraping’ web data• Why? When there is no API – Can either scrape XML or HTML or JSON – XML and JSON are easier formats to deal with from R
  8. 8. Scraping E.g. 1: XML
  9. 9. Scraping E.g. 1: XMLThe summary XML page behind the rendered page…
  10. 10. Scraping E.g. 1: XMLWe can process the XML ourselves using a bunch of lines of code…
  11. 11. Scraping E.g. 1: XML…OR just use a package someone already created - rfishbase And you get this nice plot
  12. 12. Practice…XML and JSON formats data from the USA National Phenology Networkinstall.packages(c(“RCurl”,”XML”,”RJSONIO”)) # if not installed alreadyrequire(RCurl); require(XML); require(RJSONIO)XML Formatxmlurl <- getObservationsForSpeciesIndividualAtLocation.xml? year=2009&station_ids[0]=4881&station_ids[1]=4882&species_id=3xmlout <- getURLContent(xmlurl, curl = getCurlHandle())xmlTreeParse(xmlout)[[1]][[1]]JSON Formatjsonurl <- getObservationsForSpeciesIndividualAtLocation.json? year=2009&station_ids[0]=4881&station_ids[1]=4882&species_id=3jsonout <- getURLContent(jsonurl, curl = getCurlHandle())fromJSON(jsonout)
  13. 13. Scraping E.g. 2: HTML All this code can produce something like…
  14. 14. Scraping E.g. 2: HTML …this
  15. 15. Practice…scraping HTMLinstall.packages(c("XML","RCurl")) # if not already installedrequire(XML); require(RCurl)# Lets look at the raw html firstrawhtml <- getURLContent( Scrape data from the websiterawPMI <- readHTMLTable( <- data.frame(rawPMI[[1]])names(PMI)[1] <- Year
  16. 16. APIs (application programmatic interface)• Many data sources have API’s – largely for talking to other web interfaces – we can use their API from R• Consists of a set of methods to search, retrieve, or submit data to, a data source/repository• One can write R code to interface with an API – Keep in mind some API’s require authentication keys
  17. 17. API Documentation• API docs for the Integrated Taxonomic Information Service (ITIS):
  18. 18. Example: Simple call to API
  19. 19. rOpenSci suite of R packages• There are many packages on CRAN for specific data sources on the web – search on CRAN to find these• rOpenSci is developing a lot of packages for as many open source data sources as possible – Please use and give feedback…
  20. 20. Data Literature/metadata , code at GitHub
  21. 21. Three examples of packages that interact with an API
  22. 22. API E.g. 1: Search literature: rplosYou can do this using this tutorial:
  23. 23. API E.g. 2: Get taxonomic information for your study species: taxize A tutorial:
  24. 24. API E.g. 3: Get some data: dryad A tutorial:
  25. 25. Calling external programs from R
  26. 26. Why even think about doing this?• Again, workflow integration• It’s just easier to call X program from R if you have are going to run many analyses with said program
  27. 27. Eg. 1: Phylometa…using the files in the dropboxAlso, get Phylometa here:• On a Mac: doesn’t work on mac because it’s .exe – But system() often can work to run external programs• On Windows: system(paste("new_phyloMeta_1.2b.exe" Aerts2006JEcol_tree.txt Aerts2006JEcol_data.txt), intern=T) NOTE: intern = T, returns the output to the R console Should give you something like this 
  28. 28. Resources• rOpenSci (development of R packages for all open source data and literature)• CRAN packages (search for a data source)• Tutorials/websites: – and-rcurl• Non-R based, but cool: