Web data from R

  • 11,491 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
11,491
On Slideshare
0
From Embeds
0
Number of Embeds
27

Actions

Shares
Downloads
197
Comments
0
Likes
5

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Web data acquisition with R Scott Chamberlain October 28, 2011
  • 2. Why would you even need to do this? Why not just get data through a browser?
  • 3. Some use cases• Reason 1: It just takes too dam* long to manually search/get data on a web interface• Reason 2: Workflow integration• Reason 3: Your work is reproducible and transparent if done from R instead of clicking buttons on the web
  • 4. A few general methods of getting web data through R
  • 5. • Read file – ideal if available• HTML• XML• JSON• APIs that serve up XML/JSON
  • 6. Practice…read.csv (or xls, txt, etc.)Get URL for file…see screenshoturl <- “http://datadryad.org/bitstream/handle/10255/dryad.8614/ScavengingFoodWebs_2009REV.csv?sequence=1”mycsv <- read.csv(url)mycsv
  • 7. ‘Scraping’ web data• Why? When there is no API – Can either scrape XML or HTML or JSON – XML and JSON are easier formats to deal with from R
  • 8. Scraping E.g. 1: XMLhttp://www.fishbase.org/summary/speciessummary.php?id=2
  • 9. Scraping E.g. 1: XMLThe summary XML page behind the rendered page…
  • 10. Scraping E.g. 1: XMLWe can process the XML ourselves using a bunch of lines of code…
  • 11. Scraping E.g. 1: XML…OR just use a package someone already created - rfishbase And you get this nice plot
  • 12. Practice…XML and JSON formats data from the USA National Phenology Networkinstall.packages(c(“RCurl”,”XML”,”RJSONIO”)) # if not installed alreadyrequire(RCurl); require(XML); require(RJSONIO)XML Formatxmlurl <- http://www-dev.usanpn.org/npn_portal/observations/ getObservationsForSpeciesIndividualAtLocation.xml? year=2009&station_ids[0]=4881&station_ids[1]=4882&species_id=3xmlout <- getURLContent(xmlurl, curl = getCurlHandle())xmlTreeParse(xmlout)[[1]][[1]]JSON Formatjsonurl <- http://www-dev.usanpn.org/npn_portal/observations/ getObservationsForSpeciesIndividualAtLocation.json? year=2009&station_ids[0]=4881&station_ids[1]=4882&species_id=3jsonout <- getURLContent(jsonurl, curl = getCurlHandle())fromJSON(jsonout)
  • 13. Scraping E.g. 2: HTML All this code can produce something like…
  • 14. Scraping E.g. 2: HTML …this
  • 15. Practice…scraping HTMLinstall.packages(c("XML","RCurl")) # if not already installedrequire(XML); require(RCurl)# Lets look at the raw html firstrawhtml <- getURLContent(http://www.ism.ws/ISMReport/content.cfm?ItemNumber=10752)rawhtml# Scrape data from the websiterawPMI <- readHTMLTable(http://www.ism.ws/ISMReport/content.cfm?ItemNumber=10752)rawPMIPMI <- data.frame(rawPMI[[1]])names(PMI)[1] <- Year
  • 16. APIs (application programmatic interface)• Many data sources have API’s – largely for talking to other web interfaces – we can use their API from R• Consists of a set of methods to search, retrieve, or submit data to, a data source/repository• One can write R code to interface with an API – Keep in mind some API’s require authentication keys
  • 17. API Documentation• API docs for the Integrated Taxonomic Information Service (ITIS):http://www.itis.gov/ws_description.html http://www.itis.gov/ITISWebService/services/ITISService/searchByScientificName?srchKey=Tardigrada
  • 18. Example: Simple call to API
  • 19. rOpenSci suite of R packages• There are many packages on CRAN for specific data sources on the web – search on CRAN to find these• rOpenSci is developing a lot of packages for as many open source data sources as possible – Please use and give feedback…
  • 20. Data Literature/metadata http://ropensci.org/ , code at GitHub
  • 21. Three examples of packages that interact with an API
  • 22. API E.g. 1: Search literature: rplosYou can do this using this tutorial: http://ropensci.org/tutorials/rplos-tutorial/
  • 23. API E.g. 2: Get taxonomic information for your study species: taxize A tutorial: http://ropensci.org/tutorials/r-taxize-tutorial/
  • 24. API E.g. 3: Get some data: dryad A tutorial: http://ropensci.org/tutorials/dryad-tutorial/
  • 25. Calling external programs from R
  • 26. Why even think about doing this?• Again, workflow integration• It’s just easier to call X program from R if you have are going to run many analyses with said program
  • 27. Eg. 1: Phylometa…using the files in the dropboxAlso, get Phylometa here:http://lajeunesse.myweb.usf.edu/publications.html• On a Mac: doesn’t work on mac because it’s .exe – But system() often can work to run external programs• On Windows: system(paste("new_phyloMeta_1.2b.exe" Aerts2006JEcol_tree.txt Aerts2006JEcol_data.txt), intern=T) NOTE: intern = T, returns the output to the R console Should give you something like this 
  • 28. Resources• rOpenSci (development of R packages for all open source data and literature)• CRAN packages (search for a data source)• Tutorials/websites: – http://www.programmingr.com/content/webscraping-using-readlines- and-rcurl• Non-R based, but cool: http://ecologicaldata.org/