R, HTTP, and APIs, with a preview of TopicWatchr

504 views

Published on

Strong, Homer. "R, HTTP, and APIs, with a preview of TopicWatchr." Portland R User Group, 15 November 2011.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

R, HTTP, and APIs, with a preview of TopicWatchr

  1. 1. Application Programming InterfacesWhy?I want my code to have access to your code or data... from a differentcomputer! we might be using different operating systems! different programming languages! have different compression capabilites! security! etc.At least you dont have to install tons of code or download all of the data.
  2. 2. The Internet Suggests a SolutionHyperText Transfer Protocol: HTTP Since the WWW has caught on, HTTP has become a dominant protocol. Pretty much all computers support some kind of HTTP client Browsers are just fancy HTTP clients R can be a client too!Duncan Temple Langs RCurl package offers R access to libcurl, a popular HTTP library.
  3. 3. But what data will we transfer?HTTP gives us a nearly universal way to pass data between machines, now we have to decide what formatmessages ought to have. Lets choose something lightweight and human readable (so no XML :p) but it should be something easily serializable, and should have some structure JSON is the popular choice
  4. 4. JSONJSON looks like this: 1 { 2 "hello" : "world", 3 "universe" : 42, 4 "pizza" : nil, 5 "cookies" : ["chocolate", "molasses", "oatmeal"], 6 "eggs" : { 7 "over" : "easy" 8 } 9 }JSON has types, can be nested, and has analogies (e.g. dicts or hashes or maps) in most major programminglanguages.smells like a list in RThe JSONIO , also by Duncan Temple Lang, takes R lists to and from their JSON representations.
  5. 5. Numerous ExamplesComputational geocoding, Google, et al. face-recognition, face.com prediction, GoogleData Federal Register Bloomberg"Data APIs/feeds available as packages in R"asked on stats.exchange.com a couple of months ago. The list of packages included:quantmod , tseries , flmport , WSI , RGoogleTrends , RGoogleDocs , twitteR , Zillow , RNYTimes ,UScensus2000 , infochimps , rdatamarket , factualR , RDSTK , RBloomberg , LIM , RTAQ , IBrokers ,rnpn , RClimate
  6. 6. API example: TopicWatchTopicWatch is a platform for text analytics and visualization currently developing 3 interfaces to the API: iPad app web app R packageWe collect streaming data from a variety of sources including Twitter, RSS feeds, government publications,and others.
  7. 7. API OutlineThe API is still under development, and is unstable. Were always adding new features and polishing old ones.Just a few concrete capabilites that are already running: time series of n-gram frequencies & counts aggregated at several resolutions n-grams ranked by frequency also aggregated a several resolution can be filtered by sub grams raw documents that contain a gram topics that contain a gram time series counts of documents that contain co-occurring n-grams ranking grams by usage change between any two times
  8. 8. TopicWatchrThe R package is thin wrapper for the HTTP API. It (unsurprisingly) worksby sending a request to a URL parsing JSON results re-arranging lists into data framesBut it has some nice functionality to make working with the API a bitsmoother: parses timestamps in data paginates large requests automatically handles authentication
  9. 9. Example 1: Presidential CandidatesCode to get data:1 library(TopicWatchr)2 set_credentials("PRUG", "12345")34 candidates <- c("Herman Cain", "Mitt Romney", "Rick Perry",5 "Newt Gingrich", "Ron Paul", "Michelle Bachmann",6 "Jon Huntsman", "Rick Santorum")78 twitter_counts <- wordCounts("twitter_sample", candidates)9 rss_counts <- wordCounts("rss-majorpapers", candidates)The wordCounts function constructs the proper API call, makes the call, and arranges the results into a dataframe. Each data frame looks like this:data.frame: 5 obs. of 9 variables:$ times : POSIXct, format: "2011-11-15 08:00:00" "2011-11-15 08:30:00" ...$ Herman Cain : num 0 0.00148 0 0.00326 0.00274$ Mitt Romney : num 0 0.00148 0 0.00326 0.00548$ Rick Perry : num 0 0.00148 0 0 0$ Newt Gingrich : num 0 0.00148 0 0.00326 0$ Ron Paul : num 0 0 0 0 0$ Michelle Bachmann: num 0 0 0 0 0$ Jon Huntsman : num 0 0 0 0 0$ Rick Santorum : num 0 0.00148 0 0 0Then we combine data frames and polish with ggplot2 ...
  10. 10. Final Result
  11. 11. Example 2: Likely Phrase Generator 1 lastGram <- function(g){ 2 strsplit(g, " ")[[1]][[2]] 3 } 4 5 vc <- topGrams("twitter_sample", 6 filter=first, limit=1, 7 m=1, n=2, prefix=TRUE, 8 resolution="daily")$gram 910 phrase <- c()1112 for (i in 1:i){13 vc <- lastGram(vc)14 phrase <- c(phrase, vc)15 vc <- topGrams(twsrc, filter=vc, limit=1, m=1, n=2,16 prefix=TRUE, dev_server=TRUE,17 resolution="daily")$gram18 }
  12. 12. `Likely phrases from earlier today:Twitter: "im going back :) lt3 please follow back :) lt3 please"Technology RSS feeds: "user interface displays users click scheme federal trade commission ftc antitrustcomplaint outside occupy wall street"same source, seeded with the word "statistics": "statistics showing highlights google apps like behavioraladvertising refers obliquely suggested session sounded viable business edition"Politics RSS feeds: "washington university battleground poll numbers superfan badge request may becomepresident obama administration asked whether congress approval"Major papers RSS feeds: "percent stake throughout california chapter 11 years ago effectively sealed georgew street movement prefers birds early"Federal Register: "revision incorporates provisions related investigative actions could result based upon freshprunes grown ornamentals ca fip"
  13. 13. Feeling Adventurous?Were looking for beta testers for the R package! In Shackletons words, what to expect:...BITTER COLD, LONG MONTHS OF COMPLETE DARKNESS, CONSTANT DANGER, SAFE RETURN DOUBTFUL...But it can still be fun! You can talk with me about it, or get in touch later athomer@luckysort.com
  14. 14. Thats all!Thanks for listening. Questions?

×