Wed 1430 kartik_subramanian_color

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
447
On Slideshare
447
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
4
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Blog Sentiment Analysis using NOSQL techniques S. Kartik, Ph.D EMC Distinguished Engineer Global Field CTO, Data Computing Division, EMC© Copyright 2011 EMC Corporation. All rights reserved. 1
  • 2. NOSQL is Not Only SQL!© Copyright 2011 EMC Corporation. All rights reserved. 2
  • 3. What is Sentiment Analysis?• Sentiment analysis or opinion mining refers to the application of natural language processing, computational linguistics, and text analytics to identify and extract subjective information in source materials. • Source: Wikipedia• For bulk text data such as blogs, Sentiment Analysis requires a combination of natural language processing and statistical techniques to quantify the results – Hence NOSQL techniques are used for this sort of analysis© Copyright 2011 EMC Corporation. All rights reserved. 3
  • 4. What are our customers saying about us? • Discern trends and categories in on-line conversations? – Search for relevant blogs – ‘Fingerprinting’ based on word frequencies – Similarity Measure – Identify ‘clusters’ of documents© Copyright 2011 EMC Corporation. All rights reserved. 4
  • 5. Natural Language Processing • Non trivial! • Natural Language is hard to interpret – Ontologies are very specific to industries and technologies (eg. medical vs telco,…) – Abbreviations and modern geek-speak is hard to interpret (LOL, RTFM, ROTFL, BSOD,….) • Natural language is inherently ambiguous – “The thieves stole the paintings. They were then sold.” – “The thieves stole the paintings. They were then caught” • Strongly recommend excellent book on NLTK – Natural Language Processing with Python – O’Reilly© Copyright 2011 EMC Corporation. All rights reserved. 5
  • 6. Tools and Techniques• Greenplum Map/Reduce for blog text processing in parallel• Python Natural Language Tool Kit (nltk) to parse blogs• Histograms for word frequencies using Greenplum SQL• Construct metrics describing document similarity using SQL• Use statistical analysis with clustering techniques to group blogs with similar word usage© Copyright 2011 EMC Corporation. All rights reserved. 6
  • 7. What are our customers saying about us? Method • Construct document histograms • Transform histograms into document “fingerprints” • Use clustering techniques to discover similar documents.© Copyright 2011 EMC Corporation. All rights reserved. 7
  • 8. What are our customers saying about us? Constructing document histograms • Parsing & extract html files • Using natural language processing for tokenization and stemming • Cleansing inconsistencies • Transforming unstructured data into structured data© Copyright 2011 EMC Corporation. All rights reserved. 8
  • 9. What are our customers saying about us? “Fingerprinting” - Term frequency of words within a document vs. frequency that those words occur in all documents - Term frequency-inverse document frequency (tf-idf weight) - Easily calculated based on formulas over the document histograms. - The result is a vector in n-space.© Copyright 2011 EMC Corporation. All rights reserved. 9
  • 10. What are our customers saying about us? k-means clustering: - Iterative algorithm for finding items that are similar within an n- dimensional space - Two fundamental steps: 1. Measuring distance to a centroid (the center of a cluster) 2. Moving the center of a cluster to minimize the sum distance to all the members of the cluster.© Copyright 2011 EMC Corporation. All rights reserved. 10
  • 11. What are our customers saying about us?© Copyright 2011 EMC Corporation. All rights reserved. 11
  • 12. What are our customers saying about us?© Copyright 2011 EMC Corporation. All rights reserved. 12
  • 13. What are our customers saying about us?© Copyright 2011 EMC Corporation. All rights reserved. 13
  • 14. What are our customers saying about us? • innovation • leader • design • bug • installation • download • speed • graphics • improvement© Copyright 2011 EMC Corporation. All rights reserved. 14
  • 15. Accessing the data • find blogsplog/model -exec echo "$PWD/{}" ; > filelist.txt • Build the directory list into a set of files that we will access: -INPUT: NAME: filelist FILE: - maple:/Users/demo/blogsplog/filelist.txt COLUMNS: - path text • For each record in the list "open()" the file and read it in its entirety -MAP: NAME: read_data PARAMETERS: [path text] RETURNS: [id int, path text, body text] LANGUAGE: python FUNCTION: | (_, fname) = path.rsplit(/, 1) (id, _) = fname.split(.) body = f.open(path).read()… id | path | body ------+---------------------------------------+------------------------------------ 2482 | /Users/demo/blogsplog/model/2482.html | <!DOCTYPE html PUBLIC "... 1 | /Users/demo/blogsplog/model/1.html | <!DOCTYPE html PUBLIC "... 10 | /Users/demo/blogsplog/model/1000.html | <!DOCTYPE html PUBLIC "... 2484 | /Users/demo/blogsplog/model/2484.html | <!DOCTYPE html PUBLIC "... ...© Copyright 2011 EMC Corporation. All rights reserved. 15
  • 16. How do we deal with HTML? • First, strip out HTML tags in the blog – Use standard NLTK HTML Parsers with some method overrides • Tokenizing is breaking up the text into distinct pieces (usually words) • Stemming gets rid of ending variation – Stemming, stemmed, stems  stem • Stop Words are removed (and, but, if,…) • Get rid of non-Unicode characters • Lastly, discard very small words (<4 characters)© Copyright 2011 EMC Corporation. All rights reserved. 16
  • 17. Parse the documents into word lists• Convert HTML documents into parsed, tokenized, stemmed, term lists with stop-word removal• Use the HTMLParser library to parse the html documents and extract titles and body contents: if parser not in SD: from HTMLParser import HTMLparser ... class MyHTMLParser(HTMLParser): def __init(self): HTMLParser.__init__(self) ... def handle_data(self, data): data = data.strip() if self.inhead: if self.tag == title: self.title = data if self.inbody: ... parser = SD[parser] parser.reset() ... © Copyright 2011 EMC Corporation. All rights reserved. 17
  • 18. Parse the documents into word lists• Use nltk to tokenize, stem, and remove common terms: if parser not in SD: from nltk import WordTokenizer, PorterStemmer, corpus ... class MyHTMLParser(HTMLParser): def __init(self): ... self.tokenizer = WordTokenizer() self.stemmer = PorterStemmer() self.stopwords = dict(map(lambda x: (x, True), corpus.stopwords.words())) ... def handle_data(self, data): ... if self.inbody: tokens = self.tokenizer.tokenize(data) stems = map(self.stemmer.stem, tokens) for x in stems: if len(x) < 4: continue x = x.lower() if x in self.stopwords: continue self.doc.append(x) ... parser = SD[parser] parser.reset() ...© Copyright 2011 EMC Corporation. All rights reserved. 18
  • 19. Parse the documents into word lists• Use nltk to tokenize, stem, and remove common terms: if parser not in SD: from nltk import WordTokenizer, PorterStemmer, corpus ... class MyHTMLParser(HTMLParser): def __init(self): ... self.tokenizer = WordTokenizer()shell$ gpmapreduce -f blog-terms.ymlself.stemmer = PorterStemmer()mapreduce_75643_run_1 self.stopwords = dict(map(lambda x: (x, True), corpus.stopwords.words()))DONE def handle_data(self, data): ... if self.inbody:sql# SELECT id, title, doc FROM blog_terms LIMIT 5; tokens = self.tokenizer.tokenize(data) stems = map(self.stemmer.stem, tokens) id | title | doc for x in stems:------+------------------+----------------------------------------------------------------- if len(x) < 4: continue 2482 | noodlepie | {noodlepi,from,gutter,grub,gourmet,tabl,noodlepi,blog,scoff,... x = x.lower() 1 | Bhootakannadi | {bhootakannadi,2005,unifi,feed,gener,comment,final,integr,... if x in self.stopwords: continue 10 | Tea Set | {novelti,dish,goldilock,bear,bowl,lide,contain,august,... self.doc.append(x)... ... parser = SD[parser] parser.reset() ...© Copyright 2011 EMC Corporation. All rights reserved. 19
  • 20. Create histograms of word frequenciesExtract a term-dictionary of terms that show up in at least ten blogs sql# SELECT term, sum(c) AS freq, count(*) AS num_blogs FROM ( SELECT id, term, count(*) AS c FROM ( SELECT id, unnest(doc) AS term FROM blog_terms ) term_unnest GROUP BY id, term ) doc_terms WHERE term IS NOT NULL GROUP BY term HAVING count(*) > 10; term | freq | num_blogs ----------+------+----------- sturdi | 19 | 13 canon | 97 | 40 group | 48 | 17 skin | 510 | 152 linger | 19 | 17 blunt | 20 | 17© Copyright 2011 EMC Corporation. All rights reserved. 20
  • 21. Create histograms of word frequenciesUse the term frequencies to construct the term dictionary… sql# SELECT array(SELECT term FROM blog_term_freq) dictionary; dictionary --------------------------------------------------------------------- {sturdi,canon,group,skin,linger,blunt,detect,giver,annoy,telephon,...…then use the term dictionary to construct feature vectors for every document, mapping document terms to the features in the dictionary: sql# SELECT id, gp_extract_feature_histogram(dictionary, doc) FROM blog_terms, blog_features; id | term_count -----+---------------------------------------------------------------- 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,2,0,4,0,1,0,1,0,1,...} 1 | {41,1,34,1,22,1,125,1,387,...}:{0,9,0,1,0,1,0,1,0,3,0,2,...} 10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2,0,6,0,12,0,3,0,1,0,1,...} ...© Copyright 2011 EMC Corporation. All rights reserved. 21
  • 22. Create histograms of word frequenciesFormat of a sparse vector id | term_count -----+---------------------------------------------------------------------- ... 10 | {3,1,40,...}:{0,2,0,...} ... Dense representation of the vector id | term_count -----+---------------------------------------------------------------------- ... 10 | {0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...} ... dictionary ---------------------------------------------------------------------------- {sturdi,canon,group,skin,linger,blunt,detect,giver,...} Representing the document {skin, skin, ...}© Copyright 2011 EMC Corporation. All rights reserved. 22
  • 23. Transform the blog terms into statistically useful measuresUse the feature vectors to construct tf-idf (term frequency inverse document frequency) vectors:These are a measure of the importance of terms. sql# SELECT id, (term_count*logidf) tfxidf FROM blog_histogram, ( SELECT log(count(*)/count_vec(term_count)) logidf FROM blog_histogram ) blog_logidf; id | tfxidf -----+------------------------------------------------------------------- 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.34311110...} 1 | {41,1,34,1,22,1,125,1,387,...}:{0,0.771999985977529,0,1.999427...} 10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2.95439664949608,0,3.2006935...} ...© Copyright 2011 EMC Corporation. All rights reserved. 23
  • 24. Create document clusters around iteratively defined centroidsNow that we have TFxIDFs we have something that is a statistically significant metric, which enables all sorts of real analytics.The current example is k-means clustering which requires two operations.First, we compute a distance metric between the documents and a random selection of centroids, for instance cosine similarity:sql# SELECT id, tfxidf, cid, ACOS((tfxidf %*% centroid) / (svec_l2norm(tfxidf) * svec_l2norm(centroid)) ) AS distance FROM blog_tfxidf, blog_centroids; id | tfxidf | cid | distance -----+-------------------------------------------------------------------+-----+------------ 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 1 | 1.53672977 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 2 | 1.55720354 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 3 | 1.55040145© Copyright 2011 EMC Corporation. All rights reserved. 24
  • 25. Create document clusters around iteratively defined centroidsNext, use an averaging metric to re-center the mean of a cluster:sql# SELECT cid, sum(tfxidf)/count(*) AS centroid FROM ( SELECT id, tfxidf, cid, row_number() OVER (PARTITION BY id ORDER BY distance, cid) rank FROM blog_distance ) blog_rank WHERE rank = 1 GROUP BY cid; cid | centroid-----+------------------------------------------------------------------------ 3 | {1,1,1,1,1,1,1,1,1,...}:{0.157556041103536,0.0635233900749665,0.050...} 2 | {1,1,1,1,1,1,3,1,1,...}:{0.0671131209568817,0.332220028552986,0,0.0...} 1 | {1,1,1,1,1,1,1,1,1,...}:{0.103874521481016,0.158213686890834,0.0540...}Repeat the previous two operations until the centroids converge, and you have k-means clustering.© Copyright 2011 EMC Corporation. All rights reserved. 25
  • 26. Summary• Accessing the data (MapReduce) id | path | body ------+---------------------------------------+------------------------------------ 2482 | /Users/demo/blogsplog/model/2482.html | <!DOCTYPE html PUBLIC ”...• Parse the documents into word lists (MapReduce) id | title | doc ------+------------------+----------------------------------------------------------------- 2482 | noodlepie | {noodlepi,from,gutter,grub,gourmet,tabl,noodlepi,blog,scoff,...• Create histograms of word frequencies (SQL) id | term_count -----+---------------------------------------------------------------- 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,2,0,4,0,1,0,1,0,1,...}• Transform the blog terms into statistically useful measures (SQL) id | tfxidf -----+------------------------------------------------------------------- 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.34311110...}• Create document clusters around iteratively defined centroids (MADlib or SQL window functions) id | tfxidf | cid | distance -----+-------------------------------------------------------------------+-----+------------ 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 1 | 1.53672977© Copyright 2011 EMC Corporation. All rights reserved. 26
  • 27. Conclusions • Simple exercise on Natural Language Text Processing • Technique uses a combination of non-SQL processing (Map/Reduce) and SQL Processing (Greenplum) • Takes qualitative, subjective text data and converts the problem into a form amenable to statistical analysis using k-means clustering • Rich possibilities for business impact using these techniques emerging across industries.© Copyright 2011 EMC Corporation. All rights reserved. 27