Web data from R

Rice University
Oct. 28, 2011

More Related Content


Web data from R

  1. Web data acquisition with R Scott Chamberlain October 28, 2011
  2. Why would you even need to do this? Why not just get data through a browser?
  3. Some use cases • Reason 1: It just takes too dam* long to manually search/get data on a web interface • Reason 2: Workflow integration • Reason 3: Your work is reproducible and transparent if done from R instead of clicking buttons on the web
  4. A few general methods of getting web data through R
  5. Read file – ideal if available • HTML • XML • JSON • APIs that serve up XML/JSON
  6. Practice…read.csv (or xls, txt, etc.) Get URL for file…see screenshot url <- “” mycsv <- read.csv(url) mycsv
  7. ‘Scraping’ web data • Why? When there is no API – Can either scrape XML or HTML or JSON – XML and JSON are easier formats to deal with from R
  8. Scraping E.g. 1: XML
  9. Scraping E.g. 1: XML The summary XML page behind the rendered page…
  10. Scraping E.g. 1: XML We can process the XML ourselves using a bunch of lines of code…
  11. Scraping E.g. 1: XML …OR just use a package someone already created - rfishbase And you get this nice plot
  12. Practice…XML and JSON formats data from the USA National Phenology Network install.packages(c(“RCurl”,”XML”,”RJSONIO”)) # if not installed already require(RCurl); require(XML); require(RJSONIO) XML Format xmlurl <- ' getObservationsForSpeciesIndividualAtLocation.xml? year=2009&station_ids[0]=4881&station_ids[1]=4882&species_id=3' xmlout <- getURLContent(xmlurl, curl = getCurlHandle()) xmlTreeParse(xmlout)[[1]][[1]] JSON Format jsonurl <- ' getObservationsForSpeciesIndividualAtLocation.json? year=2009&station_ids[0]=4881&station_ids[1]=4882&species_id=3' jsonout <- getURLContent(jsonurl, curl = getCurlHandle()) fromJSON(jsonout)
  13. Scraping E.g. 2: HTML All this code can produce something like…
  14. Scraping E.g. 2: HTML …this
  15. Practice…scraping HTML install.packages(c("XML","RCurl")) # if not already installed require(XML); require(RCurl) # Lets look at the raw html first rawhtml <- getURLContent('') rawhtml # Scrape data from the website rawPMI <- readHTMLTable('') rawPMI PMI <- data.frame(rawPMI[[1]]) names(PMI)[1] <- 'Year'
  16. APIs (application programmatic interface) • Many data sources have API’s – largely for talking to other web interfaces – we can use their API from R • Consists of a set of methods to search, retrieve, or submit data to, a data source/repository • One can write R code to interface with an API – Keep in mind some API’s require authentication keys
  17. API Documentation • API docs for the Integrated Taxonomic Information Service (ITIS):
  18. Example: Simple call to API
  19. rOpenSci suite of R packages • There are many packages on CRAN for specific data sources on the web – search on CRAN to find these • rOpenSci is developing a lot of packages for as many open source data sources as possible – Please use and give feedback…
  20. Data Literature/metadata , code at GitHub
  21. Three examples of packages that interact with an API
  22. API E.g. 1: Search literature: rplos You can do this using this tutorial:
  23. API E.g. 2: Get taxonomic information for your study species: taxize A tutorial:
  24. API E.g. 3: Get some data: dryad A tutorial:
  25. Calling external programs from R
  26. Why even think about doing this? • Again, workflow integration • It’s just easier to call X program from R if you have are going to run many analyses with said program
  27. Eg. 1: Phylometa …using the files in the dropbox Also, get Phylometa here: • On a Mac: doesn’t work on mac because it’s .exe – But system() often can work to run external programs • On Windows: system(paste('"new_phyloMeta_1.2b.exe" Aerts2006JEcol_tree.txt Aerts2006JEcol_data.txt'), intern=T) NOTE: intern = T, returns the output to the R console Should give you something like this 
  28. Resources • rOpenSci (development of R packages for all open source data and literature) • CRAN packages (search for a data source) • Tutorials/websites: – and-rcurl • Non-R based, but cool: