• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Web data from R
 

Web data from R

on

  • 9,956 views

 

Statistics

Views

Total Views
9,956
Views on SlideShare
8,527
Embed Views
1,429

Actions

Likes
5
Downloads
176
Comments
0

36 Embeds 1,429

http://www.r-bloggers.com 692
http://r-ecology.blogspot.com 484
http://schamberlain.github.com 77
http://r-ecology.blogspot.fr 20
http://feeds.feedburner.com 13
http://schamberlain.github.io 11
http://r-ecology.blogspot.ca 10
http://recology.info 10
http://r-ecology.blogspot.in 9
http://r-ecology.blogspot.kr 8
http://r-ecology.blogspot.co.nz 8
http://r-ecology.blogspot.it 7
http://r-ecology.blogspot.com.es 7
http://r-ecology.blogspot.de 7
http://r-ecology.blogspot.tw 6
http://r-ecology.blogspot.co.uk 6
http://www.scoop.it 5
http://r-ecology.blogspot.be 5
http://r-ecology.blogspot.ch 4
http://r-ecology.blogspot.com.ar 4
http://r-ecology.blogspot.co.at 4
http://r-ecology.blogspot.se 4
http://www.hanrss.com 4
http://r-ecology.blogspot.com.au 3
http://r-ecology.blogspot.mx 3
http://r-ecology.blogspot.ru 3
http://r-ecology.blogspot.no 2
http://recology77.rssing.com 2
http://r-ecology.blogspot.gr 2
http://r-ecology.blogspot.hk 2
http://r-ecology.blogspot.com.br 2
http://r-ecology.blogspot.fi 1
http://www.newsblur.com 1
http://r-ecology.blogspot.nl 1
http://localhost 1
http://r-ecology.blogspot.jp 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

Web data from R Web data from R Presentation Transcript

  • Web data acquisition with R Scott Chamberlain October 28, 2011
  • Why would you even need to do this? Why not just get data through a browser?
  • Some use cases• Reason 1: It just takes too dam* long to manually search/get data on a web interface• Reason 2: Workflow integration• Reason 3: Your work is reproducible and transparent if done from R instead of clicking buttons on the web
  • A few general methods of getting web data through R
  • • Read file – ideal if available• HTML• XML• JSON• APIs that serve up XML/JSON
  • Practice…read.csv (or xls, txt, etc.)Get URL for file…see screenshoturl <- “http://datadryad.org/bitstream/handle/10255/dryad.8614/ScavengingFoodWebs_2009REV.csv?sequence=1”mycsv <- read.csv(url)mycsv
  • ‘Scraping’ web data• Why? When there is no API – Can either scrape XML or HTML or JSON – XML and JSON are easier formats to deal with from R
  • Scraping E.g. 1: XMLhttp://www.fishbase.org/summary/speciessummary.php?id=2
  • Scraping E.g. 1: XMLThe summary XML page behind the rendered page…
  • Scraping E.g. 1: XMLWe can process the XML ourselves using a bunch of lines of code…
  • Scraping E.g. 1: XML…OR just use a package someone already created - rfishbase And you get this nice plot
  • Practice…XML and JSON formats data from the USA National Phenology Networkinstall.packages(c(“RCurl”,”XML”,”RJSONIO”)) # if not installed alreadyrequire(RCurl); require(XML); require(RJSONIO)XML Formatxmlurl <- http://www-dev.usanpn.org/npn_portal/observations/ getObservationsForSpeciesIndividualAtLocation.xml? year=2009&station_ids[0]=4881&station_ids[1]=4882&species_id=3xmlout <- getURLContent(xmlurl, curl = getCurlHandle())xmlTreeParse(xmlout)[[1]][[1]]JSON Formatjsonurl <- http://www-dev.usanpn.org/npn_portal/observations/ getObservationsForSpeciesIndividualAtLocation.json? year=2009&station_ids[0]=4881&station_ids[1]=4882&species_id=3jsonout <- getURLContent(jsonurl, curl = getCurlHandle())fromJSON(jsonout)
  • Scraping E.g. 2: HTML All this code can produce something like…
  • Scraping E.g. 2: HTML …this
  • Practice…scraping HTMLinstall.packages(c("XML","RCurl")) # if not already installedrequire(XML); require(RCurl)# Lets look at the raw html firstrawhtml <- getURLContent(http://www.ism.ws/ISMReport/content.cfm?ItemNumber=10752)rawhtml# Scrape data from the websiterawPMI <- readHTMLTable(http://www.ism.ws/ISMReport/content.cfm?ItemNumber=10752)rawPMIPMI <- data.frame(rawPMI[[1]])names(PMI)[1] <- Year
  • APIs (application programmatic interface)• Many data sources have API’s – largely for talking to other web interfaces – we can use their API from R• Consists of a set of methods to search, retrieve, or submit data to, a data source/repository• One can write R code to interface with an API – Keep in mind some API’s require authentication keys
  • API Documentation• API docs for the Integrated Taxonomic Information Service (ITIS):http://www.itis.gov/ws_description.html http://www.itis.gov/ITISWebService/services/ITISService/searchByScientificName?srchKey=Tardigrada
  • Example: Simple call to API
  • rOpenSci suite of R packages• There are many packages on CRAN for specific data sources on the web – search on CRAN to find these• rOpenSci is developing a lot of packages for as many open source data sources as possible – Please use and give feedback…
  • Data Literature/metadata http://ropensci.org/ , code at GitHub
  • Three examples of packages that interact with an API
  • API E.g. 1: Search literature: rplosYou can do this using this tutorial: http://ropensci.org/tutorials/rplos-tutorial/
  • API E.g. 2: Get taxonomic information for your study species: taxize A tutorial: http://ropensci.org/tutorials/r-taxize-tutorial/
  • API E.g. 3: Get some data: dryad A tutorial: http://ropensci.org/tutorials/dryad-tutorial/
  • Calling external programs from R
  • Why even think about doing this?• Again, workflow integration• It’s just easier to call X program from R if you have are going to run many analyses with said program
  • Eg. 1: Phylometa…using the files in the dropboxAlso, get Phylometa here:http://lajeunesse.myweb.usf.edu/publications.html• On a Mac: doesn’t work on mac because it’s .exe – But system() often can work to run external programs• On Windows: system(paste("new_phyloMeta_1.2b.exe" Aerts2006JEcol_tree.txt Aerts2006JEcol_data.txt), intern=T) NOTE: intern = T, returns the output to the R console Should give you something like this 
  • Resources• rOpenSci (development of R packages for all open source data and literature)• CRAN packages (search for a data source)• Tutorials/websites: – http://www.programmingr.com/content/webscraping-using-readlines- and-rcurl• Non-R based, but cool: http://ecologicaldata.org/