January 2013 Portland User Group MeetUp PresentationR & Text Analytics                Daniel Fennelly                Portl...
Following are some notes on the usage of the R package TopicWatchr.TopicWatchr is designed to neatly access the [Luckysort...
> library(TopicWatchr)    Loading required package: RJSONIO    Loading required package: RCurl    Loading required package...
Package Summary1. Formulate and send API requests according to task2. Receive and parse JSON response3. Page through multi...
The BasicsThe data we work with at LuckySort and which well be talking about herehave a few specific qualities:1. Text Sou...
The BasicsText Sources•   Hourly: Twitter Data, StockTwits, Consumer Facebook statuses, Wordpress    posts and comments......
The BasicsText SourcesLets get some more specific metadata.  > twitter.info <- getSourceInfo("twitter_sample")  > names(tw...
The BasicsTerms in TimeOur most basic analysis is that of the term occurring within a streamingdocument source. How are <t...
The BasicsTerms in TimeLets plot our data and see what it looks like! The function `plotSignal` justwraps some handy ggplo...
The BasicsTerms in TimeOf course ones choice of resolution is going to change the look of the data. Atthe daily resolution...
The BasicsTerms in Time![tech words daily resolution](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/...
The BasicsTerms in Time![tech words hourly resolution](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images...
The BasicsTerm Co-occurrencesMoving beyond simple word counts, were often interested in the subset of atext source mention...
The BasicsTerm Co-occurrences> aapl.sentiment <- aggregateCooccurrences("stock_twits", "$aapl", c("~bullish","~bearish"), ...
The BasicsTerm Co-occurrences![aapl sentiment](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/aapl_se...
Prototyping Event AnalysisHow do we identify transient spikes corresponding to real world events?Suppose we want to use on...
Prototyping Event Analysis  > ev <- read.events("data/events.csv")  > fed.freq <- get.signal("fed", "2012-09-13 09:30:00",...
Prototyping Event Analysis      ![fed signal](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/fed_sign...
Prototyping Event Analysis> compute.thresholds <- function(x,window=96,t.func=compute.threshold){    L <- length(x) - wind...
Prototyping Event Analysis      ![fed signal](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/fed_sign...
Source Statistics > twitter.docs <- document.submatrix("twitter_sample", end=Sys.time(), hours=8, to.df=FALSE) > length(tw...
Source Statistics[Zipfs Law](http://en.wikipedia.org/wiki/Zipfs_law) is a classic finding in thefield of lexical analysis....
Feeling Adventurous?Last time at LuckySort HQ: Were looking for beta testers for the R package! In Shackletons words, what...
Upcoming SlideShare
Loading in …5
×

"R & Text Analytics" (15 January 2013)

1,433 views

Published on

Fennelly, David. "R & Text Analytics (PPT)." Portland R User Group, 15 January 2013.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,433
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
34
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

"R & Text Analytics" (15 January 2013)

  1. 1. January 2013 Portland User Group MeetUp PresentationR & Text Analytics Daniel Fennelly Portland R User Group Portland, Oregon 15 January 2013
  2. 2. Following are some notes on the usage of the R package TopicWatchr.TopicWatchr is designed to neatly access the [Luckysort API](http://luckysort.com/products/api/docs/intro)TopicWatchr was authored by Homer Strong and is currently maintained and updated by Daniel Fennelly.
  3. 3. > library(TopicWatchr) Loading required package: RJSONIO Loading required package: RCurl Loading required package: bitops Welcome to TopicWatchr! Remember to check for updates regularly. Found TopicWatch account file in ~/.tw Welcome daniel@luckysort.comCredentials can be stored in `~/.tw` daniel@luckysort.com hunter2Or you can authenticate in the interactive shell... > clearCredentials() > setCredentials() Enter username: daniel@luckysort.com Enter password: >Note: Be careful about the password prompt in ESS. It seems ESS hides thepassword in the minibuffer before displaying it in the *R* buffer.
  4. 4. Package Summary1. Formulate and send API requests according to task2. Receive and parse JSON response3. Page through multiple requests, offer quick visualization tools, other utilitiesOther end-user tools to access this data include the[TopicWatch](https://studio.luckysort.com/) web interface and the tw.pypython client.
  5. 5. The BasicsThe data we work with at LuckySort and which well be talking about herehave a few specific qualities:1. Text Sources2. Terms3. Time
  6. 6. The BasicsText Sources• Hourly: Twitter Data, StockTwits, Consumer Facebook statuses, Wordpress posts and comments...• Daily: RSS news sources, Amazon.com product reviews, Benzinga News Updates• your data? (talk with us!)Lets fetch our personal list of our sources.> my.sources <- getSources()> head(my.sources) name id 1 Wordpress Intense Debate comments wp_en_comments-id 2 StockTwits stock_twits 3 Benzinga News Updates benzinga_news_updates_1 4 AngelList angelco 5 Amazon.com Shoes best sellers reviews amzn-bestsellers-shoes 6 Amazon.com Home & Kitchen best sellers reviews amzn-bestsellers-home> dim(my.sources)[1] 35 2
  7. 7. The BasicsText SourcesLets get some more specific metadata. > twitter.info <- getSourceInfo("twitter_sample") > names(twitter.info) [1] "metrics" "resolutions" "users" [4] "name" "finest_resolution" "owner" [7] "aggregate_type" "type" "id" > twitter.info$finest_resolution [1] 3600 > twitter.info$metrics [1] "documentcounts"Sources have specific resolutions available to them, given in seconds. Thefinest resolution for Twitter is one hour. The metrics are almost always goingto just be "documentcounts", although were working on making availablenumeric sources like stock market or exchange rate information.
  8. 8. The BasicsTerms in TimeOur most basic analysis is that of the term occurring within a streamingdocument source. How are <term> occurrences in <document source>changing over time from <start> to <finish>. > end <- Sys.time() > start <- ISOdate(2012, 12, 01, tz="PST") > start; end [1] "2012-12-01 12:00:00 PST" [1] "2013-01-14 23:10:19 PST" > terms <- c("obama", "mayan", "newtown", "iphone") > resolution <- 3600 * 24 > recent.news <- metric.counts(terms, src="twitter_sample", start=start, end=end, resolution=resolution, freq=T, debug=T) get: https://api.luckysort.com/v1/sources/twitter_sample/metrics/documentcounts? start=2012-12-01T12:00:00Z&end=2013-01-15T07:10:19Z&grams=obama,mayan,newtown, iphone&limit=300&resolution=86400&offset=0&freq=TRUE
  9. 9. The BasicsTerms in TimeLets plot our data and see what it looks like! The function `plotSignal` justwraps some handy ggplot2 code. For anything sophisticated youll probablywant to tailor your plotting to your own needs. > png("news.png", width=1280, height=720) > plotSignal(recent.news) > dev.off() ![news_plot](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/news.png)
  10. 10. The BasicsTerms in TimeOf course ones choice of resolution is going to change the look of the data. Atthe daily resolution theres no way to disambiguate between sustained dailyusage of a term or rapid usage within a short time span. Take a look at theseplots of the same terms over the same time span collected at hourly and dailyresolution.
  11. 11. The BasicsTerms in Time![tech words daily resolution](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/techWordsDaily.png)
  12. 12. The BasicsTerms in Time![tech words hourly resolution](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/techWordsHourly.png)
  13. 13. The BasicsTerm Co-occurrencesMoving beyond simple word counts, were often interested in the subset of atext source mentioning a specific term. We also might want to compact theoccurrence of several related terms into a single signal. This is where *filters*and *indices* come in handy! An index like `~bullish` is just a weighted sumof terms. For example, the terms `buy`, `upgrade`, `longterm` and `added`are all contained within the `~bullish` index. Weve created several publicindices like these which we feel are useful in certain applications like stockmarket or consumer sentiment analysis. (Of course users can also create theirown indices too.)Lets look at the behavior of the `~bullish` and `~bearish` indices onStockTwits, a twitter-like community around the stock market. We filter ondocuments containing Apples ticker symbol "$aapl" so that the only signalswere looking at are in some way related to Apple.
  14. 14. The BasicsTerm Co-occurrences> aapl.sentiment <- aggregateCooccurrences("stock_twits", "$aapl", c("~bullish","~bearish"), start=start, end=end, debug=TRUE, resolution=86400)> head(aapl.sentiment) times ~bearish ~bullish1 2012-12-01 16:00:00 0.1398305 0.13135592 2012-12-02 16:00:00 0.1944719 0.19249753 2012-12-03 16:00:00 0.2195296 0.20741274 2012-12-04 16:00:00 0.2502294 0.19455495 2012-12-05 16:00:00 0.1986820 0.18056016 2012-12-06 16:00:00 0.2187758 0.1786600> plotSignal(aapl.sentiment)
  15. 15. The BasicsTerm Co-occurrences![aapl sentiment](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/aapl_sentiment.png) ![aapl sentiment](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/aapl_sentiment.png)
  16. 16. Prototyping Event AnalysisHow do we identify transient spikes corresponding to real world events?Suppose we want to use only these document count time-series and that wehave a sliding history window. We might start with example data of eventsand try the performance of a couple different algorithms. source.id,datetime,gram,event,n twitter_sample,2012-09-12 05:45:00 -0700,apple,true,2 twitter_sample,2012-08-24 15:45:00 -0700,patent,true,1 twitter_sample,2012-10-29 08:00:00 -0700,#sandy,true,3 stock_twits,2012-10-02 08:15:00 -0700,$CMG,true,1 stock_twits,2012-09-13 09:30:00 -0700,fed,true,2 stock_twits,2012-04-11 07:00:00 -0700,lawsuit,true,1 ...Lets look more specifically at the case of the term "fed" on Stock Twits. Fromhere on were going to be looking at some code I used to prototype the alertsfeature on TopicWatch. This prototyping code is not part of TopicWatchr, but isan example application of the package.
  17. 17. Prototyping Event Analysis > ev <- read.events("data/events.csv") > fed.freq <- get.signal("fed", "2012-09-13 09:30:00", freq=T) > head(fed.freq) times vals 1 2012-09-01 11:00:00 0.0000000000 2 2012-09-01 12:00:00 0.0000000000 3 2012-09-01 13:00:00 0.0009699321 4 2012-09-01 14:00:00 0.0000000000 5 2012-09-01 15:00:00 0.0000000000 6 2012-09-01 16:00:00 0.0000000000
  18. 18. Prototyping Event Analysis ![fed signal](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/fed_signal.png)
  19. 19. Prototyping Event Analysis> compute.thresholds <- function(x,window=96,t.func=compute.threshold){ L <- length(x) - window t <- rep(NA,length(x)) for(i in 1:L){ t[window+i] <- t.func(x[(i-1):(i+window-1)]) } t}> z.function <- function(theta=2.5){function(x){mean(x) + theta*sd(x)}}> max.function <- function(theta=1.0){function(x){max(x) * theta}}> cv.function <- function(theta=1.0){function(x){mean(x) + sd(x) * (theta + (sd(x) /mean(x)))}}> fed.freq$z <- compute.thresholds(fed.freq$vals, t.func=z.function())> fed.freq$max <- compute.thresholds(fed.freq$vals, t.func=max.function())> fed.freq$cv <- compute.thresholds(fed.freq$vals, t.func=cv.function())> long.fed <- convert.to.long(fed.freq, "times")> ggplot(long.fed) + geom_line(aes(x=times, y=value, col=variable)) +scale_color_manual(values=c("black", "red", "blue", "green"))
  20. 20. Prototyping Event Analysis ![fed signal](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/fed_signal.png)
  21. 21. Source Statistics > twitter.docs <- document.submatrix("twitter_sample", end=Sys.time(), hours=8, to.df=FALSE) > length(twitter.docs) [1] 225 > twitter.docs[[1]] best reality best reality show reality love 1 1 1 1 flava best flava of love show 1 1 1 1 reality show 1 > twitter.docterm <- submatrix.to.dataframe(twitter.docs, max.n=1) > dim(twitter.docterm) [1] 225 1280 > term.sums <- colSums(twitter.docterm) > mean(term.sums) mean(term.sums) [1] 1.283594 > max(term.sums) [1] 14Now we have some information about our sampling of twitter documents. We have 225documents, with 1280 unique terms. Right now the above function is simply grabbing 25twitter documents per hour over the past 8 hours.
  22. 22. Source Statistics[Zipfs Law](http://en.wikipedia.org/wiki/Zipfs_law) is a classic finding in thefield of lexical analysis. > term.sums <- sort(term.sums, decreasing=TRUE) > qplot(x=log(1:length(term.sums)), y=log(term.sums)) ![twitter zipf](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/twitter_zipf.png)
  23. 23. Feeling Adventurous?Last time at LuckySort HQ: Were looking for beta testers for the R package! In Shackletons words, what to expect: **...BITTER COLD, LONG MONTHS OF COMPLETE DARKNESS, CONSTANT DANGER, SAFE RETURN DOUBTFUL...**This time around were in a slightly more stable place. Theres more data, more options,and more opportunities to maybe discover some cool stuff! (Expect some darkness,minimal danger, and a shrinking population of software bugs.)prug-topicwatchrSee also these notes used at the Portland R Users Group meeting on 15 January 2013on GitHub. They cover basic usage of the TopicWatchr package to pull time series textdata from the LuckySort API, with some examples of prototyping event detectionheuristics with R. ( https://github.com/danielfennelly/prug-topicwatchr )Talk with me about it, or get in touch later at daniel@luckysort.com

×