Text analytics in Python and R with examples from Tobacco Control

13,175 views

Published on

Ben has been doing data sciencey work since 1999 for organisations in the banking, retailing, health and education industries. He is currently on contracts with Pharmac and Aspire2025 (a Tobacco Control research collaboration) where, happily, he gets to use his data-wrangling powers for good.

This presentation focuses on analysing text, with Tobacco Control as the context. Examples include monitoring mentions of NZ's smokefree goal by politicians and examining media uptake of BATNZ's Agree/Disagree PR campaign. It covers common obstacles during data extraction, cleaning and analysis, along with the key Python and R packages you can use to help clear them.

Published in: Technology, Health & Medicine
0 Comments
20 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
13,175
On SlideShare
0
From Embeds
0
Number of Embeds
40
Actions
Shares
0
Downloads
396
Comments
0
Likes
20
Embeds 0
No embeds

No notes for slide
  • Gather stage.
  • Gather stage.
  • Clean stage
  • Clean stage
  • Clean stage
  • Standardise stage
  • Standardise stage
  • Standardise stage0.99 is generous. Lower would remove more terms.A term-document matrix where those terms from x are removed which have at least asparse percentage of empty (i.e., terms occurring 0 times in a document) elements. I.e., the resulting matrix contains only terms with a sparse factor of less than sparse.TermDocumentMatrix (terms along side (rows), docs along top (columns))
  • Dedup and select stage
  • Analysis stage
  • Analysis stage
  • Analysis stage
  • Analysis stage
  • Analysis stage
  • Analysis stage
  • Analysis stageDragonfly talk by Marcus Frean on LatentDirichletAllocation
  • Analysis stage (exploratory)
  • Analysis stage (Exploratory)
  • Analysis stage (Unsupervised classification)
  • Analysis stage (Unsupervised classification)
  • Text analytics in Python and R with examples from Tobacco Control

    1. 1. Text Analytics with and (w/ examples from Tobacco Control) @BenHealey
    2. 2. The Process Look intenselyFrequencies Classification Bright Idea Gather Clean Standardise De-dup and select
    3. 3. http://scrapy.org Spiders  Items  Pipelines - readLines, XML / Rcurl / scrapeR packages - tm package (factiva plugin), twitteR - Beautiful Soup - Pandas (eg, financial data) http://blog.siliconstraits.vn/building-web-crawler-scrapy/
    4. 4. • Translating text to consistent form – Scrapy returns unicode strings – Māori  Maori • SWAPSET = [[ u"Ā", "A"], [ u"ā", "a"], [ u"ä", "a"]] • translation_table = dict([(ord(k), unicode(v)) for k, v in settings.SWAPSET]) • cleaned_content = html_content.translate(translation_table) – Or… • test=u’Māori’ (you already have unicode) • Unidecode(test) (returns ‘Maori’)
    5. 5. • Dealing with non-Unicode – http://nedbatchelder.com/text/unipain.html – Some scraped html will be in latin1 (mismatch UTF8) – Have your datastore default to UTF-8 – Learn to love whack-a-mole • Dealing with too many spaces: – newstring = ' '.join(mystring.split()) – Or… use re • Don’t forget the metadata! – Define a common data structure early if you have multiple sources
    6. 6. Text Standardisation • Stopwords – "a, about, above, across, ... yourself, yourselves, you've, z” • Stemmers – "some sample stemmed words"  "some sampl stem word“ • Tokenisers (eg, for bigrams) – BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) – tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer)) – ‘and said’, ‘and security’ Natural Language Toolkittm package
    7. 7. Text Standardisation libs = c("RODBC", "RWeka“, "Snowball","wordcloud", "tm" ,"topicmodels") … cleanCorpus = function(corpus) { corpus.tmp = tm_map(corpus, tolower) # ??? Not sure. corpus.tmp = tm_map(corpus.tmp, removePunctuation) corpus.tmp = tm_map(corpus.tmp, removeWords, stopwords("english")) corpus.tmp = tm_map(corpus.tmp, stripWhitespace) return(corpus.tmp) } posts.corpus = cleanCorpus(posts.corpus) posts.corpus_stemmed = tm_map(posts.corpus, stemDocument)
    8. 8. Text Standardisation • Using dictionaries for stem completion politi.tdm <- TermDocumentMatrix(politi.corpus) politi.tdm = removeSparseTerms(politi.tdm, 0.99) politi.tdm = as.matrix(politi.tdm) # get word counts in decreasing order, put these into a plain text doc. word_freqs = sort(rowSums(politi.tdm), decreasing=TRUE) length(word_freqs) smalldict = PlainTextDocument(names(word_freqs)) politi.corpus_final = tm_map(politi.corpus_stemmed, stemCompletion, dictionary=smalldict, type="first")
    9. 9. Deduplication • Python sets – shingles1 = set(get_shingles(record1['standardised_content'])) • Shingling and Jaccard similarity – (a,rose,is,a,rose,is,a,rose) – {(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose)} • {(a,rose,is,a), (rose,is,a,rose), (is,a,rose,is)} – http://infolab.stanford.edu/~ullman/mmds/ch3.pdf  a free text http://www.cs.utah.edu/~jeffp/teaching/cs5955/L4-Jaccard+Shingle.pdf
    10. 10. Frequency Analysis • Document-Term Matrix – politi.dtm <- DocumentTermMatrix(politi.corpus_stemmed, control = list(wordLengths=c(4,Inf))) • Frequent and co-occurring terms – findFreqTerms(politi.dtm, 5000) [1] "2011" "also" "announc" "area" "around" [6] "auckland" "better" "bill" "build" "busi" – findAssocs(politi.dtm, "smoke", 0.5) smoke tobacco quit smokefre smoker 2025 cigarett 1.00 0.74 0.68 0.62 0.62 0.58 0.57
    11. 11. Mentions of the 2025 goal
    12. 12. Mentions of the 2025 goal
    13. 13. Top 100 terms: Tariana Turia Note: Documents from Aug 2011 – July 2012 Wordcloud package
    14. 14. Top 100 terms: Tony Ryall Note: Documents from Aug 2011 – July 2012
    15. 15. • Exploration and feature extraction – Metadata gathered at time of collection (eg, Scrapy) – RODBC or MySQLdb with plain ol’ SQL – Native or package functions for length of strings, sna, etc. • Unsupervised – nltk.cluster – tm, topicmodels, as.matrix(dtm)  kmeans, etc. • Supervised – First hurdle: Training set  – nltk.classify – tm, e1071, others… Classification
    16. 16. 2 posts or fewer more than 750 posts 846 1,157 23 45,499 41.0% 1.3% 1.1% 50.1%
    17. 17. Cohort: New users (posters) in Q1 2012
    18. 18. • LDA (topicmodels) – New users – Highly active users Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 good smoke just smoke feel day time day quit day thank week get can dont well patch realli one like will start think will still Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 quit good day like feel smoke one well day thing can take great your just will stay done now get luck strong awesom get time
    19. 19. • LDA (topicmodels) – Highly active users (HAU) – HAU1 (F, 38, PI) – HAU2 (F, 33, NZE) – HAU3 (M, 48, NZE) Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 quit good day like feel smoke one well day thing can take great your just will stay done now get luck strong awesom get time 18% 14% 40% 8% 20% 31% 21% 27% 6% 16% 16% 9% 21% 49% 5%
    20. 20. Recap • Your text will probably be messy – Python, R-based tools reduce the pain • Simple analyses can generate useful insight • Combine with data of other types for context – source, quantities, dates, network position, history • May surface useful features for classification Slides, Code: message2ben@gmail.com

    ×