Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Web data from R

20,836 views

Published on

Published in: Technology
  • Be the first to comment

Web data from R

  1. 1. Web data acquisition with R Scott Chamberlain October 28, 2011
  2. 2. Why would you even need to do this? Why not just get data through a browser?
  3. 3. Some use cases• Reason 1: It just takes too dam* long to manually search/get data on a web interface• Reason 2: Workflow integration• Reason 3: Your work is reproducible and transparent if done from R instead of clicking buttons on the web
  4. 4. A few general methods of getting web data through R
  5. 5. • Read file – ideal if available• HTML• XML• JSON• APIs that serve up XML/JSON
  6. 6. Practice…read.csv (or xls, txt, etc.)Get URL for file…see screenshoturl <- “http://datadryad.org/bitstream/handle/10255/dryad.8614/ScavengingFoodWebs_2009REV.csv?sequence=1”mycsv <- read.csv(url)mycsv
  7. 7. ‘Scraping’ web data• Why? When there is no API – Can either scrape XML or HTML or JSON – XML and JSON are easier formats to deal with from R
  8. 8. Scraping E.g. 1: XMLhttp://www.fishbase.org/summary/speciessummary.php?id=2
  9. 9. Scraping E.g. 1: XMLThe summary XML page behind the rendered page…
  10. 10. Scraping E.g. 1: XMLWe can process the XML ourselves using a bunch of lines of code…
  11. 11. Scraping E.g. 1: XML…OR just use a package someone already created - rfishbase And you get this nice plot
  12. 12. Practice…XML and JSON formats data from the USA National Phenology Networkinstall.packages(c(“RCurl”,”XML”,”RJSONIO”)) # if not installed alreadyrequire(RCurl); require(XML); require(RJSONIO)XML Formatxmlurl <- http://www-dev.usanpn.org/npn_portal/observations/ getObservationsForSpeciesIndividualAtLocation.xml? year=2009&station_ids[0]=4881&station_ids[1]=4882&species_id=3xmlout <- getURLContent(xmlurl, curl = getCurlHandle())xmlTreeParse(xmlout)[[1]][[1]]JSON Formatjsonurl <- http://www-dev.usanpn.org/npn_portal/observations/ getObservationsForSpeciesIndividualAtLocation.json? year=2009&station_ids[0]=4881&station_ids[1]=4882&species_id=3jsonout <- getURLContent(jsonurl, curl = getCurlHandle())fromJSON(jsonout)
  13. 13. Scraping E.g. 2: HTML All this code can produce something like…
  14. 14. Scraping E.g. 2: HTML …this
  15. 15. Practice…scraping HTMLinstall.packages(c("XML","RCurl")) # if not already installedrequire(XML); require(RCurl)# Lets look at the raw html firstrawhtml <- getURLContent(http://www.ism.ws/ISMReport/content.cfm?ItemNumber=10752)rawhtml# Scrape data from the websiterawPMI <- readHTMLTable(http://www.ism.ws/ISMReport/content.cfm?ItemNumber=10752)rawPMIPMI <- data.frame(rawPMI[[1]])names(PMI)[1] <- Year
  16. 16. APIs (application programmatic interface)• Many data sources have API’s – largely for talking to other web interfaces – we can use their API from R• Consists of a set of methods to search, retrieve, or submit data to, a data source/repository• One can write R code to interface with an API – Keep in mind some API’s require authentication keys
  17. 17. API Documentation• API docs for the Integrated Taxonomic Information Service (ITIS):http://www.itis.gov/ws_description.html http://www.itis.gov/ITISWebService/services/ITISService/searchByScientificName?srchKey=Tardigrada
  18. 18. Example: Simple call to API
  19. 19. rOpenSci suite of R packages• There are many packages on CRAN for specific data sources on the web – search on CRAN to find these• rOpenSci is developing a lot of packages for as many open source data sources as possible – Please use and give feedback…
  20. 20. Data Literature/metadata http://ropensci.org/ , code at GitHub
  21. 21. Three examples of packages that interact with an API
  22. 22. API E.g. 1: Search literature: rplosYou can do this using this tutorial: http://ropensci.org/tutorials/rplos-tutorial/
  23. 23. API E.g. 2: Get taxonomic information for your study species: taxize A tutorial: http://ropensci.org/tutorials/r-taxize-tutorial/
  24. 24. API E.g. 3: Get some data: dryad A tutorial: http://ropensci.org/tutorials/dryad-tutorial/
  25. 25. Calling external programs from R
  26. 26. Why even think about doing this?• Again, workflow integration• It’s just easier to call X program from R if you have are going to run many analyses with said program
  27. 27. Eg. 1: Phylometa…using the files in the dropboxAlso, get Phylometa here:http://lajeunesse.myweb.usf.edu/publications.html• On a Mac: doesn’t work on mac because it’s .exe – But system() often can work to run external programs• On Windows: system(paste("new_phyloMeta_1.2b.exe" Aerts2006JEcol_tree.txt Aerts2006JEcol_data.txt), intern=T) NOTE: intern = T, returns the output to the R console Should give you something like this 
  28. 28. Resources• rOpenSci (development of R packages for all open source data and literature)• CRAN packages (search for a data source)• Tutorials/websites: – http://www.programmingr.com/content/webscraping-using-readlines- and-rcurl• Non-R based, but cool: http://ecologicaldata.org/

×