Text analytics in Python and R with examples from Tobacco Control

Text Analytics with
and
(w/ examples from Tobacco Control)
@BenHealey

The Process
Look intenselyFrequencies
Classification
Bright Idea Gather Clean Standardise
De-dup and select

http://scrapy.org
Spiders  Items  Pipelines
- readLines, XML / Rcurl / scrapeR packages
- tm package (factiva plugin), twitteR
- Beautiful Soup
- Pandas (eg, financial data)
http://blog.siliconstraits.vn/building-web-crawler-scrapy/

• Translating text to consistent form
– Scrapy returns unicode strings
– Māori  Maori
• SWAPSET =
[[ u"Ā", "A"], [ u"ā", "a"], [ u"ä", "a"]]
• translation_table =
dict([(ord(k), unicode(v)) for k, v in settings.SWAPSET])
• cleaned_content =
html_content.translate(translation_table)
– Or…
• test=u’Māori’ (you already have unicode)
• Unidecode(test) (returns ‘Maori’)

• Dealing with non-Unicode
– http://nedbatchelder.com/text/unipain.html
– Some scraped html will be in latin1 (mismatch UTF8)
– Have your datastore default to UTF-8
– Learn to love whack-a-mole
• Dealing with too many spaces:
– newstring = ' '.join(mystring.split())
– Or… use re
• Don’t forget the metadata!
– Define a common data structure early if you have
multiple sources

Text Standardisation
• Stopwords
– "a, about, above, across, ... yourself, yourselves, you've, z”
• Stemmers
– "some sample stemmed words"  "some sampl stem word“
• Tokenisers (eg, for bigrams)
– BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
– tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
– ‘and said’, ‘and security’
Natural Language Toolkittm package

libs = c("RODBC", "RWeka“, "Snowball","wordcloud", "tm" ,"topicmodels")
…
cleanCorpus = function(corpus) {
corpus.tmp = tm_map(corpus, tolower) # ??? Not sure.
corpus.tmp = tm_map(corpus.tmp, removePunctuation)
corpus.tmp = tm_map(corpus.tmp, removeWords, stopwords("english"))
corpus.tmp = tm_map(corpus.tmp, stripWhitespace)
return(corpus.tmp)
}
posts.corpus = cleanCorpus(posts.corpus)
posts.corpus_stemmed = tm_map(posts.corpus, stemDocument)

• Using dictionaries for stem completion
politi.tdm <- TermDocumentMatrix(politi.corpus)
politi.tdm = removeSparseTerms(politi.tdm, 0.99)
politi.tdm = as.matrix(politi.tdm)
# get word counts in decreasing order, put these into a plain text doc.
word_freqs = sort(rowSums(politi.tdm), decreasing=TRUE)
length(word_freqs)
smalldict = PlainTextDocument(names(word_freqs))
politi.corpus_final = tm_map(politi.corpus_stemmed,
stemCompletion, dictionary=smalldict, type="first")

Deduplication
• Python sets
– shingles1 = set(get_shingles(record1['standardised_content']))
• Shingling and Jaccard similarity
– (a,rose,is,a,rose,is,a,rose)
– {(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose)}
• {(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is)}
–
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf  a free text
http://www.cs.utah.edu/~jeffp/teaching/cs5955/L4-Jaccard+Shingle.pdf

Frequency Analysis
• Document-Term Matrix
– politi.dtm <- DocumentTermMatrix(politi.corpus_stemmed,
control = list(wordLengths=c(4,Inf)))
• Frequent and co-occurring terms
– findFreqTerms(politi.dtm, 5000)
[1] "2011" "also" "announc" "area" "around"
[6] "auckland" "better" "bill" "build" "busi"
– findAssocs(politi.dtm, "smoke", 0.5)
smoke tobacco quit smokefre smoker 2025 cigarett
1.00 0.74 0.68 0.62 0.62 0.58 0.57

Top 100 terms: Tariana Turia
Note: Documents from Aug 2011 – July 2012 Wordcloud package

Top 100 terms: Tony Ryall
Note: Documents from Aug 2011 – July 2012

• Exploration and feature extraction
– Metadata gathered at time of collection (eg, Scrapy)
– RODBC or MySQLdb with plain ol’ SQL
– Native or package functions for length of strings, sna, etc.
• Unsupervised
– nltk.cluster
– tm, topicmodels, as.matrix(dtm)  kmeans, etc.
• Supervised
– First hurdle: Training set 
– nltk.classify
– tm, e1071, others…
Classification

2 posts or fewer more than 750 posts
846 1,157 23 45,499
41.0% 1.3% 1.1% 50.1%

Cohort: New users (posters) in Q1 2012

• LDA (topicmodels)
– New users
– Highly active users
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
good smoke just smoke feel
day time day quit day
thank week get can dont
well patch realli one like
will start think will still
quit good day like feel
smoke one well day thing
can take great your just
will stay done now get
luck strong awesom get time

• LDA (topicmodels)
– Highly active users (HAU)
– HAU1 (F, 38, PI)
– HAU2 (F, 33, NZE)
– HAU3 (M, 48, NZE)
quit good day like feel
smoke one well day thing
can take great your just
will stay done now get
luck strong awesom get time
18% 14% 40% 8% 20%
31% 21% 27% 6% 16%
16% 9% 21% 49% 5%

Recap
• Your text will probably be messy
– Python, R-based tools reduce the pain
• Simple analyses can generate useful insight
• Combine with data of other types for context
– source, quantities, dates, network position, history
• May surface useful features for classification
Slides, Code: message2ben@gmail.com

Text analytics in Python and R with examples from Tobacco Control

More Related Content

What's hot

Viewers also liked

Similar to Text analytics in Python and R with examples from Tobacco Control

Recently uploaded

Text analytics in Python and R with examples from Tobacco Control

Editor's Notes