SlideShare a Scribd company logo
1 of 23
Download to read offline
January 2013 Portland User Group MeetUp Presentation




R & Text Analytics

                Daniel Fennelly




                Portland R User Group
                   Portland, Oregon
                   15 January 2013
Following are some notes on the usage of the R package TopicWatchr.

TopicWatchr is designed to neatly access the [Luckysort API](http://luckysort.com/products/api/docs/intro)

TopicWatchr was authored by Homer Strong and is currently maintained and updated by Daniel Fennelly.
> library(TopicWatchr)
    Loading required package: RJSONIO
    Loading required package: RCurl
    Loading required package: bitops
    Welcome to TopicWatchr!
    Remember to check for updates regularly.
    Found TopicWatch account file in ~/.tw
    Welcome daniel@luckysort.com


Credentials can be stored in `~/.tw`

   daniel@luckysort.com hunter2


Or you can authenticate in the interactive shell...

   > clearCredentials()
   > setCredentials()
   Enter username: daniel@luckysort.com
   Enter password:
   >


Note: Be careful about the password prompt in ESS. It seems ESS hides the
password in the minibuffer before displaying it in the *R* buffer.
Package Summary

1. Formulate and send API requests according to task
2. Receive and parse JSON response
3. Page through multiple requests, offer quick visualization tools, other utilities

Other end-user tools to access this data include the
[TopicWatch](https://studio.luckysort.com/) web interface and the tw.py
python client.
The Basics

The data we work with at LuckySort and which we'll be talking about here
have a few specific qualities:

1. Text Sources
2. Terms
3. Time
The Basics
Text Sources

•   Hourly: Twitter Data, StockTwits, Consumer Facebook statuses, Wordpress
    posts and comments...
•   Daily: RSS news sources, Amazon.com product reviews, Benzinga News
    Updates
•   your data? (talk with us!)

Let's fetch our personal list of our sources.


> my.sources <- getSources()
> head(my.sources)
                                        name                                 id
 1         Wordpress Intense Debate comments                  wp_en_comments-id
 2                                StockTwits                        stock_twits
 3                     Benzinga News Updates            benzinga_news_updates_1
 4                                 AngelList                            angelco
 5     Amazon.com Shoes best sellers reviews             amzn-bestsellers-shoes
 6     Amazon.com Home & Kitchen best sellers reviews     amzn-bestsellers-home
> dim(my.sources)
[1] 35 2
The Basics
Text Sources

Let's get some more specific metadata.


  > twitter.info <- getSourceInfo("twitter_sample")
  > names(twitter.info)
  [1] "metrics"           "resolutions"       "users"
  [4] "name"              "finest_resolution" "owner"
  [7] "aggregate_type"    "type"              "id"
  > twitter.info$finest_resolution
  [1] 3600
  > twitter.info$metrics
  [1] "documentcounts"




Sources have specific resolutions available to them, given in seconds. The
finest resolution for Twitter is one hour. The metrics are almost always going
to just be "documentcounts", although we're working on making available
numeric sources like stock market or exchange rate information.
The Basics
Terms in Time

Our most basic analysis is that of the term occurring within a streaming
document source. How are <term> occurrences in <document source>
changing over time from <start> to <finish>.


  > end <- Sys.time()
  > start <- ISOdate(2012, 12, 01, tz="PST")
  > start; end
  [1] "2012-12-01 12:00:00 PST"
  [1] "2013-01-14 23:10:19 PST"
  > terms <- c("obama", "mayan", "newtown", "iphone")
  > resolution <- 3600 * 24
  > recent.news <- metric.counts(terms, src="twitter_sample", start=start, end=end,
  resolution=resolution, freq=T, debug=T)
  get: https://api.luckysort.com/v1/sources/twitter_sample/metrics/documentcounts?
  start=2012-12-01T12:00:00Z&end=2013-01-15T07:10:19Z&grams=obama,mayan,newtown,
  iphone&limit=300&resolution=86400&offset=0&freq=TRUE
The Basics
Terms in Time

Let's plot our data and see what it looks like! The function `plotSignal` just
wraps some handy ggplot2 code. For anything sophisticated you'll probably
want to tailor your plotting to your own needs.

   > png("news.png", width=1280, height=720)
   > plotSignal(recent.news)
   > dev.off()




      ![news_plot](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/news.png)
The Basics
Terms in Time




Of course one's choice of resolution is going to change the look of the data. At
the daily resolution there's no way to disambiguate between sustained daily
usage of a term or rapid usage within a short time span. Take a look at these
plots of the same terms over the same time span collected at hourly and daily
resolution.
The Basics
Terms in Time




![tech words daily resolution](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/techWordsDaily.png)
The Basics
Terms in Time




![tech words hourly resolution](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/techWordsHourly.png)
The Basics
Term Co-occurrences


Moving beyond simple word counts, we're often interested in the subset of a
text source mentioning a specific term. We also might want to compact the
occurrence of several related terms into a single signal. This is where *filters*
and *indices* come in handy! An index like `~bullish` is just a weighted sum
of terms. For example, the terms `buy`, `upgrade`, `longterm` and `added`
are all contained within the `~bullish` index. We've created several public
indices like these which we feel are useful in certain applications like stock
market or consumer sentiment analysis. (Of course users can also create their
own indices too.)

Let's look at the behavior of the `~bullish` and `~bearish` indices on
StockTwits, a twitter-like community around the stock market. We filter on
documents containing Apple's ticker symbol "$aapl" so that the only signals
we're looking at are in some way related to Apple.
The Basics
Term Co-occurrences




> aapl.sentiment <- aggregateCooccurrences("stock_twits", "$aapl", c("~bullish",
"~bearish"), start=start, end=end, debug=TRUE, resolution=86400)
> head(aapl.sentiment)
                times ~bearish ~bullish
1 2012-12-01 16:00:00 0.1398305 0.1313559
2 2012-12-02 16:00:00 0.1944719 0.1924975
3 2012-12-03 16:00:00 0.2195296 0.2074127
4 2012-12-04 16:00:00 0.2502294 0.1945549
5 2012-12-05 16:00:00 0.1986820 0.1805601
6 2012-12-06 16:00:00 0.2187758 0.1786600
> plotSignal(aapl.sentiment)
The Basics
Term Co-occurrences




![aapl sentiment](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/aapl_sentiment.png)




      ![aapl sentiment](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/aapl_sentiment.png)
Prototyping Event Analysis

How do we identify transient spikes corresponding to real world events?
Suppose we want to use only these document count time-series and that we
have a sliding history window. We might start with example data of events
and try the performance of a couple different algorithms.


  source.id,datetime,gram,event,n
  twitter_sample,2012-09-12 05:45:00 -0700,apple,true,2
  twitter_sample,2012-08-24 15:45:00 -0700,patent,true,1
  twitter_sample,2012-10-29 08:00:00 -0700,#sandy,true,3
  stock_twits,2012-10-02 08:15:00 -0700,$CMG,true,1
  stock_twits,2012-09-13 09:30:00 -0700,fed,true,2
  stock_twits,2012-04-11 07:00:00 -0700,lawsuit,true,1
  ...



Let's look more specifically at the case of the term "fed" on Stock Twits. From
here on we're going to be looking at some code I used to prototype the alerts
feature on TopicWatch. This prototyping code is not part of TopicWatchr, but is
an example application of the package.
Prototyping Event Analysis

  > ev <- read.events("data/events.csv")
  > fed.freq <- get.signal("fed", "2012-09-13 09:30:00", freq=T)
  > head(fed.freq)
                   times        vals
  1 2012-09-01 11:00:00 0.0000000000
  2 2012-09-01 12:00:00 0.0000000000
  3 2012-09-01 13:00:00 0.0009699321
  4 2012-09-01 14:00:00 0.0000000000
  5 2012-09-01 15:00:00 0.0000000000
  6 2012-09-01 16:00:00 0.0000000000
Prototyping Event Analysis




      ![fed signal](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/fed_signal.png)
Prototyping Event Analysis

> compute.thresholds <- function(x,window=96,t.func=compute.threshold){
    L <- length(x) - window
    t <- rep(NA,length(x))
    for(i in 1:L){
    t[window+i] <- t.func(x[(i-1):(i+window-1)])
    }
    t
}
> z.function <- function(theta=2.5){function(x){mean(x) + theta*sd(x)}}
> max.function <- function(theta=1.0){function(x){max(x) * theta}}
> cv.function <- function(theta=1.0){function(x){mean(x) + sd(x) * (theta + (sd(x) /
mean(x)))}}
> fed.freq$z <- compute.thresholds(fed.freq$vals, t.func=z.function())
> fed.freq$max <- compute.thresholds(fed.freq$vals, t.func=max.function())
> fed.freq$cv <- compute.thresholds(fed.freq$vals, t.func=cv.function())

> long.fed <- convert.to.long(fed.freq, "times")
> ggplot(long.fed) + geom_line(aes(x=times, y=value, col=variable)) +
scale_color_manual(values=c("black", "red", "blue", "green"))
Prototyping Event Analysis




      ![fed signal](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/fed_signal.png)
Source Statistics

 > twitter.docs <- document.submatrix("twitter_sample", end=Sys.time(), hours=8,
 to.df=FALSE)
 > length(twitter.docs)
 [1] 225
 > twitter.docs[[1]]
   best reality best reality show           reality              love
          1                 1                1                  1
          flava               best   flava of love                show
     1                  1               1                   1
   reality show
            1
 > twitter.docterm <- submatrix.to.dataframe(twitter.docs, max.n=1)
 > dim(twitter.docterm)
 [1] 225 1280
 > term.sums <- colSums(twitter.docterm)
 > mean(term.sums)
 mean(term.sums)
 [1] 1.283594
 > max(term.sums)
 [1] 14

Now we have some information about our sampling of twitter documents. We have 225
documents, with 1280 unique terms. Right now the above function is simply grabbing 25
twitter documents per hour over the past 8 hours.
Source Statistics

[Zipf's Law](http://en.wikipedia.org/wiki/Zipf's_law) is a classic finding in the
field of lexical analysis.


   > term.sums <- sort(term.sums, decreasing=TRUE)
   > qplot(x=log(1:length(term.sums)), y=log(term.sums))




     ![twitter zipf](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/twitter_zipf.png)
Feeling Adventurous?

Last time at LuckySort HQ:


 We're looking for beta testers for the R package! In Shackleton's words, what to expect:

 **...BITTER COLD, LONG MONTHS OF COMPLETE DARKNESS, CONSTANT DANGER, SAFE RETURN DOUBTFUL...**




This time around we're in a slightly more stable place. There's more data, more options,
and more opportunities to maybe discover some cool stuff! (Expect some darkness,
minimal danger, and a shrinking population of software bugs.)


prug-topicwatchr

See also these notes used at the Portland R User's Group meeting on 15 January 2013
on GitHub. They cover basic usage of the TopicWatchr package to pull time series text
data from the LuckySort API, with some examples of prototyping event detection
heuristics with R. ( https://github.com/danielfennelly/prug-topicwatchr )


Talk with me about it, or get in touch later at daniel@luckysort.com

More Related Content

What's hot

Mood analyzer-ng poland
Mood analyzer-ng polandMood analyzer-ng poland
Mood analyzer-ng polandSherry List
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightMatthew Russell
 
Exploring Word2Vec in Scala
Exploring Word2Vec in ScalaExploring Word2Vec in Scala
Exploring Word2Vec in ScalaGary Sieling
 
Scalabay - API Design Antipatterns
Scalabay - API Design AntipatternsScalabay - API Design Antipatterns
Scalabay - API Design AntipatternsManish Pandit
 
Silicon Valley 2014 - API Antipatterns
Silicon Valley 2014 - API AntipatternsSilicon Valley 2014 - API Antipatterns
Silicon Valley 2014 - API AntipatternsManish Pandit
 
Connecting your Python App to OpenERP through OOOP
Connecting your Python App to OpenERP through OOOPConnecting your Python App to OpenERP through OOOP
Connecting your Python App to OpenERP through OOOPraimonesteve
 

What's hot (6)

Mood analyzer-ng poland
Mood analyzer-ng polandMood analyzer-ng poland
Mood analyzer-ng poland
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and Insight
 
Exploring Word2Vec in Scala
Exploring Word2Vec in ScalaExploring Word2Vec in Scala
Exploring Word2Vec in Scala
 
Scalabay - API Design Antipatterns
Scalabay - API Design AntipatternsScalabay - API Design Antipatterns
Scalabay - API Design Antipatterns
 
Silicon Valley 2014 - API Antipatterns
Silicon Valley 2014 - API AntipatternsSilicon Valley 2014 - API Antipatterns
Silicon Valley 2014 - API Antipatterns
 
Connecting your Python App to OpenERP through OOOP
Connecting your Python App to OpenERP through OOOPConnecting your Python App to OpenERP through OOOP
Connecting your Python App to OpenERP through OOOP
 

Similar to "R & Text Analytics" (15 January 2013)

IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisIBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisTorsten Steinbach
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insightDigital Reasoning
 
Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)Paulo Gandra de Sousa
 
Letting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search ComponentLetting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search ComponentJay Luker
 
Microsoft azure data fundamentals (dp 900) practice tests 2022
Microsoft azure data fundamentals (dp 900) practice tests 2022Microsoft azure data fundamentals (dp 900) practice tests 2022
Microsoft azure data fundamentals (dp 900) practice tests 2022SkillCertProExams
 
Text Mining of Twitter in Data Mining
Text Mining of Twitter in Data MiningText Mining of Twitter in Data Mining
Text Mining of Twitter in Data MiningMeghaj Mallick
 
Polymer 3.0 by Michele Gallotti
Polymer 3.0 by Michele GallottiPolymer 3.0 by Michele Gallotti
Polymer 3.0 by Michele GallottiThinkOpen
 
Hw09 Analytics And Reporting
Hw09   Analytics And ReportingHw09   Analytics And Reporting
Hw09 Analytics And ReportingCloudera, Inc.
 
Intro To Spring Python
Intro To Spring PythonIntro To Spring Python
Intro To Spring Pythongturnquist
 
Training thethings.iO
Training thethings.iOTraining thethings.iO
Training thethings.iOMarc Pous
 
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, EverAltitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, EverFastly
 
Ben ford intro
Ben ford introBen ford intro
Ben ford introPuppet
 
Telemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordTelemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordPuppet
 
Rapid prototyping search applications with solr
Rapid prototyping search applications with solrRapid prototyping search applications with solr
Rapid prototyping search applications with solrLucidworks (Archived)
 
As You Seek – How Search Enables Big Data Analytics
As You Seek – How Search Enables Big Data AnalyticsAs You Seek – How Search Enables Big Data Analytics
As You Seek – How Search Enables Big Data AnalyticsInside Analysis
 
Agile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBAgile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBRussell Jurney
 
Getting Started with Real-Time Analytics
Getting Started with Real-Time AnalyticsGetting Started with Real-Time Analytics
Getting Started with Real-Time AnalyticsAmazon Web Services
 

Similar to "R & Text Analytics" (15 January 2013) (20)

IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisIBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insight
 
PoEAA by Example
PoEAA by ExamplePoEAA by Example
PoEAA by Example
 
Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)
 
Letting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search ComponentLetting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search Component
 
Microsoft azure data fundamentals (dp 900) practice tests 2022
Microsoft azure data fundamentals (dp 900) practice tests 2022Microsoft azure data fundamentals (dp 900) practice tests 2022
Microsoft azure data fundamentals (dp 900) practice tests 2022
 
Text Mining of Twitter in Data Mining
Text Mining of Twitter in Data MiningText Mining of Twitter in Data Mining
Text Mining of Twitter in Data Mining
 
Polymer 3.0 by Michele Gallotti
Polymer 3.0 by Michele GallottiPolymer 3.0 by Michele Gallotti
Polymer 3.0 by Michele Gallotti
 
Hw09 Analytics And Reporting
Hw09   Analytics And ReportingHw09   Analytics And Reporting
Hw09 Analytics And Reporting
 
Intro To Spring Python
Intro To Spring PythonIntro To Spring Python
Intro To Spring Python
 
Training thethings.iO
Training thethings.iOTraining thethings.iO
Training thethings.iO
 
How we build Vox
How we build VoxHow we build Vox
How we build Vox
 
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, EverAltitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
 
Ben ford intro
Ben ford introBen ford intro
Ben ford intro
 
Telemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordTelemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben Ford
 
Building Streaming Applications with Streaming SQL
Building Streaming Applications with Streaming SQLBuilding Streaming Applications with Streaming SQL
Building Streaming Applications with Streaming SQL
 
Rapid prototyping search applications with solr
Rapid prototyping search applications with solrRapid prototyping search applications with solr
Rapid prototyping search applications with solr
 
As You Seek – How Search Enables Big Data Analytics
As You Seek – How Search Enables Big Data AnalyticsAs You Seek – How Search Enables Big Data Analytics
As You Seek – How Search Enables Big Data Analytics
 
Agile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBAgile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDB
 
Getting Started with Real-Time Analytics
Getting Started with Real-Time AnalyticsGetting Started with Real-Time Analytics
Getting Started with Real-Time Analytics
 

Recently uploaded

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

"R & Text Analytics" (15 January 2013)

  • 1. January 2013 Portland User Group MeetUp Presentation R & Text Analytics Daniel Fennelly Portland R User Group Portland, Oregon 15 January 2013
  • 2. Following are some notes on the usage of the R package TopicWatchr. TopicWatchr is designed to neatly access the [Luckysort API](http://luckysort.com/products/api/docs/intro) TopicWatchr was authored by Homer Strong and is currently maintained and updated by Daniel Fennelly.
  • 3. > library(TopicWatchr) Loading required package: RJSONIO Loading required package: RCurl Loading required package: bitops Welcome to TopicWatchr! Remember to check for updates regularly. Found TopicWatch account file in ~/.tw Welcome daniel@luckysort.com Credentials can be stored in `~/.tw` daniel@luckysort.com hunter2 Or you can authenticate in the interactive shell... > clearCredentials() > setCredentials() Enter username: daniel@luckysort.com Enter password: > Note: Be careful about the password prompt in ESS. It seems ESS hides the password in the minibuffer before displaying it in the *R* buffer.
  • 4. Package Summary 1. Formulate and send API requests according to task 2. Receive and parse JSON response 3. Page through multiple requests, offer quick visualization tools, other utilities Other end-user tools to access this data include the [TopicWatch](https://studio.luckysort.com/) web interface and the tw.py python client.
  • 5. The Basics The data we work with at LuckySort and which we'll be talking about here have a few specific qualities: 1. Text Sources 2. Terms 3. Time
  • 6. The Basics Text Sources • Hourly: Twitter Data, StockTwits, Consumer Facebook statuses, Wordpress posts and comments... • Daily: RSS news sources, Amazon.com product reviews, Benzinga News Updates • your data? (talk with us!) Let's fetch our personal list of our sources. > my.sources <- getSources() > head(my.sources) name id 1 Wordpress Intense Debate comments wp_en_comments-id 2 StockTwits stock_twits 3 Benzinga News Updates benzinga_news_updates_1 4 AngelList angelco 5 Amazon.com Shoes best sellers reviews amzn-bestsellers-shoes 6 Amazon.com Home & Kitchen best sellers reviews amzn-bestsellers-home > dim(my.sources) [1] 35 2
  • 7. The Basics Text Sources Let's get some more specific metadata. > twitter.info <- getSourceInfo("twitter_sample") > names(twitter.info) [1] "metrics" "resolutions" "users" [4] "name" "finest_resolution" "owner" [7] "aggregate_type" "type" "id" > twitter.info$finest_resolution [1] 3600 > twitter.info$metrics [1] "documentcounts" Sources have specific resolutions available to them, given in seconds. The finest resolution for Twitter is one hour. The metrics are almost always going to just be "documentcounts", although we're working on making available numeric sources like stock market or exchange rate information.
  • 8. The Basics Terms in Time Our most basic analysis is that of the term occurring within a streaming document source. How are <term> occurrences in <document source> changing over time from <start> to <finish>. > end <- Sys.time() > start <- ISOdate(2012, 12, 01, tz="PST") > start; end [1] "2012-12-01 12:00:00 PST" [1] "2013-01-14 23:10:19 PST" > terms <- c("obama", "mayan", "newtown", "iphone") > resolution <- 3600 * 24 > recent.news <- metric.counts(terms, src="twitter_sample", start=start, end=end, resolution=resolution, freq=T, debug=T) get: https://api.luckysort.com/v1/sources/twitter_sample/metrics/documentcounts? start=2012-12-01T12:00:00Z&end=2013-01-15T07:10:19Z&grams=obama,mayan,newtown, iphone&limit=300&resolution=86400&offset=0&freq=TRUE
  • 9. The Basics Terms in Time Let's plot our data and see what it looks like! The function `plotSignal` just wraps some handy ggplot2 code. For anything sophisticated you'll probably want to tailor your plotting to your own needs. > png("news.png", width=1280, height=720) > plotSignal(recent.news) > dev.off() ![news_plot](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/news.png)
  • 10. The Basics Terms in Time Of course one's choice of resolution is going to change the look of the data. At the daily resolution there's no way to disambiguate between sustained daily usage of a term or rapid usage within a short time span. Take a look at these plots of the same terms over the same time span collected at hourly and daily resolution.
  • 11. The Basics Terms in Time ![tech words daily resolution](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/techWordsDaily.png)
  • 12. The Basics Terms in Time ![tech words hourly resolution](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/techWordsHourly.png)
  • 13. The Basics Term Co-occurrences Moving beyond simple word counts, we're often interested in the subset of a text source mentioning a specific term. We also might want to compact the occurrence of several related terms into a single signal. This is where *filters* and *indices* come in handy! An index like `~bullish` is just a weighted sum of terms. For example, the terms `buy`, `upgrade`, `longterm` and `added` are all contained within the `~bullish` index. We've created several public indices like these which we feel are useful in certain applications like stock market or consumer sentiment analysis. (Of course users can also create their own indices too.) Let's look at the behavior of the `~bullish` and `~bearish` indices on StockTwits, a twitter-like community around the stock market. We filter on documents containing Apple's ticker symbol "$aapl" so that the only signals we're looking at are in some way related to Apple.
  • 14. The Basics Term Co-occurrences > aapl.sentiment <- aggregateCooccurrences("stock_twits", "$aapl", c("~bullish", "~bearish"), start=start, end=end, debug=TRUE, resolution=86400) > head(aapl.sentiment) times ~bearish ~bullish 1 2012-12-01 16:00:00 0.1398305 0.1313559 2 2012-12-02 16:00:00 0.1944719 0.1924975 3 2012-12-03 16:00:00 0.2195296 0.2074127 4 2012-12-04 16:00:00 0.2502294 0.1945549 5 2012-12-05 16:00:00 0.1986820 0.1805601 6 2012-12-06 16:00:00 0.2187758 0.1786600 > plotSignal(aapl.sentiment)
  • 15. The Basics Term Co-occurrences ![aapl sentiment](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/aapl_sentiment.png) ![aapl sentiment](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/aapl_sentiment.png)
  • 16. Prototyping Event Analysis How do we identify transient spikes corresponding to real world events? Suppose we want to use only these document count time-series and that we have a sliding history window. We might start with example data of events and try the performance of a couple different algorithms. source.id,datetime,gram,event,n twitter_sample,2012-09-12 05:45:00 -0700,apple,true,2 twitter_sample,2012-08-24 15:45:00 -0700,patent,true,1 twitter_sample,2012-10-29 08:00:00 -0700,#sandy,true,3 stock_twits,2012-10-02 08:15:00 -0700,$CMG,true,1 stock_twits,2012-09-13 09:30:00 -0700,fed,true,2 stock_twits,2012-04-11 07:00:00 -0700,lawsuit,true,1 ... Let's look more specifically at the case of the term "fed" on Stock Twits. From here on we're going to be looking at some code I used to prototype the alerts feature on TopicWatch. This prototyping code is not part of TopicWatchr, but is an example application of the package.
  • 17. Prototyping Event Analysis > ev <- read.events("data/events.csv") > fed.freq <- get.signal("fed", "2012-09-13 09:30:00", freq=T) > head(fed.freq) times vals 1 2012-09-01 11:00:00 0.0000000000 2 2012-09-01 12:00:00 0.0000000000 3 2012-09-01 13:00:00 0.0009699321 4 2012-09-01 14:00:00 0.0000000000 5 2012-09-01 15:00:00 0.0000000000 6 2012-09-01 16:00:00 0.0000000000
  • 18. Prototyping Event Analysis ![fed signal](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/fed_signal.png)
  • 19. Prototyping Event Analysis > compute.thresholds <- function(x,window=96,t.func=compute.threshold){ L <- length(x) - window t <- rep(NA,length(x)) for(i in 1:L){ t[window+i] <- t.func(x[(i-1):(i+window-1)]) } t } > z.function <- function(theta=2.5){function(x){mean(x) + theta*sd(x)}} > max.function <- function(theta=1.0){function(x){max(x) * theta}} > cv.function <- function(theta=1.0){function(x){mean(x) + sd(x) * (theta + (sd(x) / mean(x)))}} > fed.freq$z <- compute.thresholds(fed.freq$vals, t.func=z.function()) > fed.freq$max <- compute.thresholds(fed.freq$vals, t.func=max.function()) > fed.freq$cv <- compute.thresholds(fed.freq$vals, t.func=cv.function()) > long.fed <- convert.to.long(fed.freq, "times") > ggplot(long.fed) + geom_line(aes(x=times, y=value, col=variable)) + scale_color_manual(values=c("black", "red", "blue", "green"))
  • 20. Prototyping Event Analysis ![fed signal](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/fed_signal.png)
  • 21. Source Statistics > twitter.docs <- document.submatrix("twitter_sample", end=Sys.time(), hours=8, to.df=FALSE) > length(twitter.docs) [1] 225 > twitter.docs[[1]] best reality best reality show reality love 1 1 1 1 flava best flava of love show 1 1 1 1 reality show 1 > twitter.docterm <- submatrix.to.dataframe(twitter.docs, max.n=1) > dim(twitter.docterm) [1] 225 1280 > term.sums <- colSums(twitter.docterm) > mean(term.sums) mean(term.sums) [1] 1.283594 > max(term.sums) [1] 14 Now we have some information about our sampling of twitter documents. We have 225 documents, with 1280 unique terms. Right now the above function is simply grabbing 25 twitter documents per hour over the past 8 hours.
  • 22. Source Statistics [Zipf's Law](http://en.wikipedia.org/wiki/Zipf's_law) is a classic finding in the field of lexical analysis. > term.sums <- sort(term.sums, decreasing=TRUE) > qplot(x=log(1:length(term.sums)), y=log(term.sums)) ![twitter zipf](http://github.com/danielfennelly/prug-topicwatchr/raw/master/images/twitter_zipf.png)
  • 23. Feeling Adventurous? Last time at LuckySort HQ: We're looking for beta testers for the R package! In Shackleton's words, what to expect: **...BITTER COLD, LONG MONTHS OF COMPLETE DARKNESS, CONSTANT DANGER, SAFE RETURN DOUBTFUL...** This time around we're in a slightly more stable place. There's more data, more options, and more opportunities to maybe discover some cool stuff! (Expect some darkness, minimal danger, and a shrinking population of software bugs.) prug-topicwatchr See also these notes used at the Portland R User's Group meeting on 15 January 2013 on GitHub. They cover basic usage of the TopicWatchr package to pull time series text data from the LuckySort API, with some examples of prototyping event detection heuristics with R. ( https://github.com/danielfennelly/prug-topicwatchr ) Talk with me about it, or get in touch later at daniel@luckysort.com