SlideShare a Scribd company logo
Blog Sentiment Analysis using
                                   NOSQL techniques

                        S. Kartik, Ph.D
                        EMC Distinguished Engineer
                        Global Field CTO,
                        Data Computing Division, EMC


© Copyright 2011 EMC Corporation. All rights reserved.       1
NOSQL is
                  Not Only SQL!

© Copyright 2011 EMC Corporation. All rights reserved.   2
What is Sentiment Analysis?
• Sentiment analysis or opinion mining refers to the
  application of natural language
  processing, computational linguistics, and text
  analytics to identify and extract subjective information
  in source materials.
                                                         • Source: Wikipedia

• For bulk text data such as blogs, Sentiment Analysis
  requires a combination of natural language processing
  and statistical techniques to quantify the results
          – Hence NOSQL techniques are used for this sort of analysis



© Copyright 2011 EMC Corporation. All rights reserved.                         3
What are our customers saying about us?
                                                         • Discern trends and categories
                                                           in on-line conversations?
                                                           – Search for relevant blogs
                                                           – ‘Fingerprinting’ based on word
                                                             frequencies
                                                           – Similarity Measure
                                                           – Identify ‘clusters’ of documents




© Copyright 2011 EMC Corporation. All rights reserved.                                          4
Natural Language Processing
    • Non trivial!
    • Natural Language is hard to interpret
              – Ontologies are very specific to industries and
                technologies (eg. medical vs telco,…)
              – Abbreviations and modern geek-speak is hard to
                interpret (LOL, RTFM, ROTFL, BSOD,….)
    • Natural language is inherently ambiguous
              – “The thieves stole the paintings. They were then sold.”
              – “The thieves stole the paintings. They were then caught”
    • Strongly recommend excellent book on NLTK
              – Natural Language Processing with Python – O’Reilly


© Copyright 2011 EMC Corporation. All rights reserved.                     5
Tools and Techniques
• Greenplum Map/Reduce for blog text processing in
  parallel
• Python Natural Language Tool Kit (nltk) to parse blogs
• Histograms for word frequencies using Greenplum SQL
• Construct metrics describing document similarity
  using SQL
• Use statistical analysis with clustering techniques to
  group blogs with similar word usage




© Copyright 2011 EMC Corporation. All rights reserved.     6
What are our customers saying about us?

                                                                         Method


                                                         • Construct document histograms


                                                         • Transform histograms into document
                                                           “fingerprints”


                                                         • Use clustering techniques to discover
                                                           similar documents.




© Copyright 2011 EMC Corporation. All rights reserved.                                             7
What are our customers saying about us?
                                                          Constructing document histograms


                                                         • Parsing & extract html files
                                                         • Using natural language processing
                                                           for tokenization and stemming
                                                         • Cleansing inconsistencies
                                                         • Transforming unstructured data into
                                                           structured data



© Copyright 2011 EMC Corporation. All rights reserved.                                           8
What are our customers saying about us?
                                                         “Fingerprinting”
                                                         - Term frequency of words within a
                                                           document vs. frequency that those
                                                           words occur in all documents
                                                         - Term frequency-inverse document
                                                           frequency (tf-idf weight)
                                                         - Easily calculated based on formulas
                                                           over the document histograms.
                                                         - The result is a vector in n-space.



© Copyright 2011 EMC Corporation. All rights reserved.                                           9
What are our customers saying about us?
                                                         k-means clustering:
                                                         - Iterative algorithm for finding items
                                                           that are similar within an n-
                                                           dimensional space


                                                         - Two fundamental steps:
                                                           1. Measuring distance to a centroid (the
                                                              center of a cluster)
                                                           2. Moving the center of a cluster to
                                                              minimize the sum distance to all the
                                                              members of the cluster.


© Copyright 2011 EMC Corporation. All rights reserved.                                                10
What are our customers saying about us?




© Copyright 2011 EMC Corporation. All rights reserved.   11
What are our customers saying about us?




© Copyright 2011 EMC Corporation. All rights reserved.   12
What are our customers saying about us?




© Copyright 2011 EMC Corporation. All rights reserved.   13
What are our customers saying about us?

           • innovation
           • leader
           • design



                                                         • bug
                                                         • installation
                                                         • download

           • speed
           • graphics
           • improvement


© Copyright 2011 EMC Corporation. All rights reserved.                    14
Accessing the data
     • find blogsplog/model -exec echo "$PWD/{}" ; > filelist.txt
     • Build the directory list into a set of files that we will access:
                 -INPUT:
                    NAME: filelist
                    FILE:
                      - maple:/Users/demo/blogsplog/filelist.txt
                    COLUMNS:
                      - path text


     • For each record in the list "open()" the file and read it in its entirety
                 -MAP:
                   NAME:      read_data
                   PARAMETERS: [path text]
                   RETURNS: [id int, path text, body text]
                   LANGUAGE: python
                   FUNCTION: |
                     (_, fname) = path.rsplit('/', 1)
                     (id, _) = fname.split('.')
                     body     = f.open(path).read()…

       id |                path                 |             body
      ------+---------------------------------------+------------------------------------
       2482 | /Users/demo/blogsplog/model/2482.html | <!DOCTYPE html PUBLIC "...
          1 | /Users/demo/blogsplog/model/1.html | <!DOCTYPE html PUBLIC "...
         10 | /Users/demo/blogsplog/model/1000.html | <!DOCTYPE html PUBLIC "...
       2484 | /Users/demo/blogsplog/model/2484.html | <!DOCTYPE html PUBLIC "...
      ...



© Copyright 2011 EMC Corporation. All rights reserved.                                      15
How do we deal with HTML?
    • First, strip out HTML tags in the blog
              – Use standard NLTK HTML Parsers with some method
                overrides
    • Tokenizing is breaking up the text into distinct
      pieces (usually words)
    • Stemming gets rid of ending variation
              – Stemming, stemmed, stems  stem
    • Stop Words are removed (and, but, if,…)
    • Get rid of non-Unicode characters
    • Lastly, discard very small words (<4 characters)

© Copyright 2011 EMC Corporation. All rights reserved.            16
Parse the documents into word lists
• Convert HTML documents into parsed, tokenized, stemmed, term lists with stop-word
  removal
• Use the HTMLParser library to parse the html documents and extract titles and body
  contents:
                                              if 'parser' not in SD:
                                                      from HTMLParser import HTMLparser
                                                      ...
                                                      class MyHTMLParser(HTMLParser):
                                                        def __init(self):
                                                          HTMLParser.__init__(self)
                                                          ...
                                                        def handle_data(self, data):
                                                          data = data.strip()
                                                          if self.inhead:
                                                            if self.tag == 'title':
                                                               self.title = data
                                                            if self.inbody:
                                                      ...
                                              parser = SD['parser']
                                              parser.reset()
                                              ...




 © Copyright 2011 EMC Corporation. All rights reserved.                                   17
Parse the documents into word lists
• Use nltk to tokenize, stem, and remove common terms:

                                             if 'parser' not in SD:
                                                    from nltk import WordTokenizer, PorterStemmer, corpus
                                                    ...
                                                    class MyHTMLParser(HTMLParser):
                                                      def __init(self):
                                                        ...
                                                        self.tokenizer = WordTokenizer()
                                                        self.stemmer = PorterStemmer()
                                                        self.stopwords = dict(map(lambda x: (x, True),
                                                                         corpus.stopwords.words()))
                                                     ...
                                                     def handle_data(self, data):
                                                        ...
                                                        if self.inbody:
                                                           tokens = self.tokenizer.tokenize(data)
                                                           stems = map(self.stemmer.stem, tokens)
                                                           for x in stems:
                                                            if len(x) < 4: continue
                                                            x = x.lower()
                                                            if x in self.stopwords: continue
                                                            self.doc.append(x)
                                                     ...
                                             parser = SD['parser']
                                             parser.reset()
                                             ...




© Copyright 2011 EMC Corporation. All rights reserved.                                                      18
Parse the documents into word lists
• Use nltk to tokenize, stem, and remove common terms:

                                             if 'parser' not in SD:
                                              from nltk import WordTokenizer, PorterStemmer, corpus
                                              ...
                                              class MyHTMLParser(HTMLParser):
                                                def __init(self):
                                              ...
                                                  self.tokenizer = WordTokenizer()
shell$ gpmapreduce -f blog-terms.ymlself.stemmer = PorterStemmer()
mapreduce_75643_run_1                             self.stopwords = dict(map(lambda x: (x, True), corpus.stopwords.words()))
DONE                                           def handle_data(self, data):
                                                  ...
                                               if self.inbody:
sql# SELECT id, title, doc FROM blog_terms LIMIT 5;
                                                  tokens = self.tokenizer.tokenize(data)
                                                  stems = map(self.stemmer.stem, tokens)
 id |      title       |                                  doc
                                                  for x in stems:
------+------------------+-----------------------------------------------------------------
                                                    if len(x) < 4: continue
 2482 | noodlepie             | {noodlepi,from,gutter,grub,gourmet,tabl,noodlepi,blog,scoff,...
                                                    x = x.lower()
    1 | Bhootakannadi | {bhootakannadi,2005,unifi,feed,gener,comment,final,integr,...
                                                    if x in self.stopwords: continue
   10 | Tea Set           | {novelti,dish,goldilock,bear,bowl,lide,contain,august,...
                                                    self.doc.append(x)
...                                            ...
                                     parser = SD['parser']
                                     parser.reset()
                                     ...




© Copyright 2011 EMC Corporation. All rights reserved.                                                                        19
Create histograms of word frequencies
Extract a term-dictionary of terms that show up in at least ten blogs

                   sql# SELECT term, sum(c) AS freq, count(*) AS num_blogs
                      FROM (
                        SELECT id, term, count(*) AS c
                        FROM (
                          SELECT id, unnest(doc) AS term
                          FROM blog_terms
                        ) term_unnest
                        GROUP BY id, term
                      ) doc_terms
                      WHERE term IS NOT NULL
                      GROUP BY term
                      HAVING count(*) > 10;

                     term | freq | num_blogs
                   ----------+------+-----------
                    sturdi | 19 |           13
                    canon | 97 |              40
                    group | 48 |             17
                    skin | 510 |            152
                    linger | 19 |           17
                    blunt | 20 |            17




© Copyright 2011 EMC Corporation. All rights reserved.                       20
Create histograms of word frequencies
Use the term frequencies to construct the term dictionary…

       sql#      SELECT array(SELECT term FROM blog_term_freq) dictionary;

                                dictionary
       ---------------------------------------------------------------------
       {sturdi,canon,group,skin,linger,blunt,detect,giver,annoy,telephon,...




…then use the term dictionary to construct feature vectors for every document, mapping document
  terms to the features in the dictionary:
       sql#     SELECT id, gp_extract_feature_histogram(dictionary, doc)
               FROM blog_terms, blog_features;

        id |                     term_count
       -----+----------------------------------------------------------------
       2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,2,0,4,0,1,0,1,0,1,...}
          1 | {41,1,34,1,22,1,125,1,387,...}:{0,9,0,1,0,1,0,1,0,3,0,2,...}
         10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2,0,6,0,12,0,3,0,1,0,1,...}
       ...




© Copyright 2011 EMC Corporation. All rights reserved.                                            21
Create histograms of word frequencies
Format of a sparse vector
     id |                    term_count
    -----+----------------------------------------------------------------------
    ...
      10 | {3,1,40,...}:{0,2,0,...}
    ...




   Dense representation of the vector
     id |                     term_count
    -----+----------------------------------------------------------------------
    ...
      10 | {0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...}
    ...
                            dictionary
    ----------------------------------------------------------------------------
          {sturdi,canon,group,skin,linger,blunt,detect,giver,...}




      Representing the document
    {skin, skin, ...}




© Copyright 2011 EMC Corporation. All rights reserved.                             22
Transform the blog terms into statistically
    useful measures
Use the feature vectors to construct tf-idf (term frequency inverse document frequency)
  vectors:
These are a measure of the importance of terms.

       sql# SELECT id, (term_count*logidf) tfxidf
           FROM blog_histogram, (
             SELECT log(count(*)/count_vec(term_count)) logidf
             FROM blog_histogram
           ) blog_logidf;

        id |                       tfxidf
       -----+-------------------------------------------------------------------
       2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.34311110...}
          1 | {41,1,34,1,22,1,125,1,387,...}:{0,0.771999985977529,0,1.999427...}
         10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2.95439664949608,0,3.2006935...}
       ...




© Copyright 2011 EMC Corporation. All rights reserved.                                    23
Create document clusters around iteratively
    defined centroids
Now that we have TFxIDFs we have something that is a statistically significant metric,
  which enables all sorts of real analytics.
The current example is k-means clustering which requires two operations.


First, we compute a distance metric between the documents and a random selection
   of centroids, for instance cosine similarity:

sql# SELECT id, tfxidf, cid,
     ACOS((tfxidf %*% centroid) /
       (svec_l2norm(tfxidf) * svec_l2norm(centroid))
     ) AS distance
    FROM blog_tfxidf, blog_centroids;

  id |                       tfxidf                              | cid | distance
 -----+-------------------------------------------------------------------+-----+------------
 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 1 | 1.53672977
 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 2 | 1.55720354
 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 3 | 1.55040145




© Copyright 2011 EMC Corporation. All rights reserved.                                          24
Create document clusters around iteratively
    defined centroids
Next, use an averaging metric to re-center the mean of a cluster:


sql# SELECT cid, sum(tfxidf)/count(*) AS centroid
    FROM (
      SELECT id, tfxidf, cid,
       row_number() OVER (PARTITION BY id ORDER BY distance, cid) rank
      FROM blog_distance
    ) blog_rank
    WHERE rank = 1
    GROUP BY cid;

 cid |                              centroid
-----+------------------------------------------------------------------------
   3 | {1,1,1,1,1,1,1,1,1,...}:{0.157556041103536,0.0635233900749665,0.050...}
   2 | {1,1,1,1,1,1,3,1,1,...}:{0.0671131209568817,0.332220028552986,0,0.0...}
   1 | {1,1,1,1,1,1,1,1,1,...}:{0.103874521481016,0.158213686890834,0.0540...}



Repeat the previous two operations until the centroids converge, and you have k-means clustering.




© Copyright 2011 EMC Corporation. All rights reserved.                                              25
Summary

• Accessing the data (MapReduce)
                                                                    id |               path                 |                        body
                                                                   ------+---------------------------------------+------------------------------------
                                                                    2482 | /Users/demo/blogsplog/model/2482.html | <!DOCTYPE html PUBLIC ”...

• Parse the documents into word lists (MapReduce)
                                                          id |      title      |                                 doc
                                                         ------+------------------+-----------------------------------------------------------------
                                                          2482 | noodlepie            | {noodlepi,from,gutter,grub,gourmet,tabl,noodlepi,blog,scoff,...

• Create histograms of word frequencies (SQL)
                                                                                       id |                    term_count
                                                                                      -----+----------------------------------------------------------------
                                                                                      2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,2,0,4,0,1,0,1,0,1,...}

• Transform the blog terms into statistically useful measures (SQL)
                                                                                  id |                       tfxidf
                                                                                 -----+-------------------------------------------------------------------
                                                                                 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.34311110...}

• Create document clusters around iteratively defined centroids (MADlib or SQL window functions)
                                                    id |                       tfxidf                             | cid | distance
                                                   -----+-------------------------------------------------------------------+-----+------------
                                                   2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 1 | 1.53672977




© Copyright 2011 EMC Corporation. All rights reserved.                                                                                                         26
Conclusions
    • Simple exercise on Natural Language Text
      Processing
    • Technique uses a combination of non-SQL
      processing (Map/Reduce) and SQL Processing
      (Greenplum)
    • Takes qualitative, subjective text data and converts
      the problem into a form amenable to statistical
      analysis using k-means clustering
    • Rich possibilities for business impact using these
      techniques emerging across industries.


© Copyright 2011 EMC Corporation. All rights reserved.       27

More Related Content

Viewers also liked

Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
Daniel Marcous
 
RWDG Webinar: Achieving Data Quality Through Data Governance
RWDG Webinar: Achieving Data Quality Through Data GovernanceRWDG Webinar: Achieving Data Quality Through Data Governance
RWDG Webinar: Achieving Data Quality Through Data Governance
DATAVERSITY
 
CDO Slides: Real World Data Strategy Success Stories
CDO Slides: Real World Data Strategy Success StoriesCDO Slides: Real World Data Strategy Success Stories
CDO Slides: Real World Data Strategy Success Stories
DATAVERSITY
 
Webinar: Initiating a Customer MDM/Data Governance Program
Webinar: Initiating a Customer MDM/Data Governance ProgramWebinar: Initiating a Customer MDM/Data Governance Program
Webinar: Initiating a Customer MDM/Data Governance Program
DATAVERSITY
 
LDM Webinar: Data Modeling & Metadata Management
LDM Webinar: Data Modeling & Metadata ManagementLDM Webinar: Data Modeling & Metadata Management
LDM Webinar: Data Modeling & Metadata Management
DATAVERSITY
 
Smart Data Slides: Emerging Hardware Choices for Modern AI Data Management
Smart Data Slides: Emerging Hardware Choices for Modern AI Data ManagementSmart Data Slides: Emerging Hardware Choices for Modern AI Data Management
Smart Data Slides: Emerging Hardware Choices for Modern AI Data Management
DATAVERSITY
 
Data-Ed Slides: Best Practices in Data Stewardship (Technical)
Data-Ed Slides: Best Practices in Data Stewardship (Technical)Data-Ed Slides: Best Practices in Data Stewardship (Technical)
Data-Ed Slides: Best Practices in Data Stewardship (Technical)
DATAVERSITY
 
Slides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
Slides: NoSQL Data Modeling Using JSON Documents – A Practical ApproachSlides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
Slides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
DATAVERSITY
 
RWDG Slides: Corporate Data Governance - The CDO is the Data Governance Chief
RWDG Slides: Corporate Data Governance - The CDO is the Data Governance ChiefRWDG Slides: Corporate Data Governance - The CDO is the Data Governance Chief
RWDG Slides: Corporate Data Governance - The CDO is the Data Governance Chief
DATAVERSITY
 

Viewers also liked (9)

Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
RWDG Webinar: Achieving Data Quality Through Data Governance
RWDG Webinar: Achieving Data Quality Through Data GovernanceRWDG Webinar: Achieving Data Quality Through Data Governance
RWDG Webinar: Achieving Data Quality Through Data Governance
 
CDO Slides: Real World Data Strategy Success Stories
CDO Slides: Real World Data Strategy Success StoriesCDO Slides: Real World Data Strategy Success Stories
CDO Slides: Real World Data Strategy Success Stories
 
Webinar: Initiating a Customer MDM/Data Governance Program
Webinar: Initiating a Customer MDM/Data Governance ProgramWebinar: Initiating a Customer MDM/Data Governance Program
Webinar: Initiating a Customer MDM/Data Governance Program
 
LDM Webinar: Data Modeling & Metadata Management
LDM Webinar: Data Modeling & Metadata ManagementLDM Webinar: Data Modeling & Metadata Management
LDM Webinar: Data Modeling & Metadata Management
 
Smart Data Slides: Emerging Hardware Choices for Modern AI Data Management
Smart Data Slides: Emerging Hardware Choices for Modern AI Data ManagementSmart Data Slides: Emerging Hardware Choices for Modern AI Data Management
Smart Data Slides: Emerging Hardware Choices for Modern AI Data Management
 
Data-Ed Slides: Best Practices in Data Stewardship (Technical)
Data-Ed Slides: Best Practices in Data Stewardship (Technical)Data-Ed Slides: Best Practices in Data Stewardship (Technical)
Data-Ed Slides: Best Practices in Data Stewardship (Technical)
 
Slides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
Slides: NoSQL Data Modeling Using JSON Documents – A Practical ApproachSlides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
Slides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
 
RWDG Slides: Corporate Data Governance - The CDO is the Data Governance Chief
RWDG Slides: Corporate Data Governance - The CDO is the Data Governance ChiefRWDG Slides: Corporate Data Governance - The CDO is the Data Governance Chief
RWDG Slides: Corporate Data Governance - The CDO is the Data Governance Chief
 

Similar to Wed 1430 kartik_subramanian_color

Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
Sumit Raj
 
Empower your Enterprise with language intelligence_Francisco Webber
Empower your Enterprise with language intelligence_Francisco Webber Empower your Enterprise with language intelligence_Francisco Webber
Empower your Enterprise with language intelligence_Francisco Webber
Dataconomy Media
 
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found..."Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
Dataconomy Media
 
NLP, Expert system and pattern recognition
NLP, Expert system and pattern recognitionNLP, Expert system and pattern recognition
NLP, Expert system and pattern recognition
Mohammad Ilyas Malik
 
Data Segmenting in Anzo
Data Segmenting in AnzoData Segmenting in Anzo
Data Segmenting in Anzo
LeeFeigenbaum
 
Semantic Web and Machine Learning Tutorial
Semantic Web and Machine Learning TutorialSemantic Web and Machine Learning Tutorial
Semantic Web and Machine Learning Tutorialbutest
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
Dr. Haxel Consult
 
Terminology in openEHR
Terminology in openEHRTerminology in openEHR
Terminology in openEHR
Pablo Pazos
 
Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?
Fabien Coppens
 
RESTful SOA and the Spring Framework (EMCWorld 2011)
RESTful SOA and the Spring Framework (EMCWorld 2011)RESTful SOA and the Spring Framework (EMCWorld 2011)
RESTful SOA and the Spring Framework (EMCWorld 2011)
EMC
 
On the Use of an Internal DSL for Enriching EMF Models
On the Use of an Internal DSL for Enriching EMF ModelsOn the Use of an Internal DSL for Enriching EMF Models
On the Use of an Internal DSL for Enriching EMF ModelsFilip Krikava
 
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
devismileyrockz
 
pre-defence.pptx
pre-defence.pptxpre-defence.pptx
pre-defence.pptx
khanz8
 
Ecm mythbusters the_real_story_behind_vendor_marketing
Ecm mythbusters the_real_story_behind_vendor_marketingEcm mythbusters the_real_story_behind_vendor_marketing
Ecm mythbusters the_real_story_behind_vendor_marketingQuestexConf
 
Engineering Interoperable and Reliable Systems
Engineering Interoperable and Reliable SystemsEngineering Interoperable and Reliable Systems
Engineering Interoperable and Reliable Systems
Rick Warren
 
Natural Language Processing Advancements By Deep Learning - A Survey
Natural Language Processing Advancements By Deep Learning - A SurveyNatural Language Processing Advancements By Deep Learning - A Survey
Natural Language Processing Advancements By Deep Learning - A Survey
AkshayaNagarajan10
 
EclipseConEurope2012 SOA - Models As Operational Documentation
EclipseConEurope2012 SOA - Models As Operational DocumentationEclipseConEurope2012 SOA - Models As Operational Documentation
EclipseConEurope2012 SOA - Models As Operational Documentation
Marc Dutoo
 
Ai Brain Docs Solution Oct 2012
Ai Brain Docs Solution Oct 2012Ai Brain Docs Solution Oct 2012
Ai Brain Docs Solution Oct 2012tom_marsh
 
Towards a Marketplace of Open Source Software Data
Towards a Marketplace of Open Source Software DataTowards a Marketplace of Open Source Software Data
Towards a Marketplace of Open Source Software Data
Fernando Silva Parreiras
 

Similar to Wed 1430 kartik_subramanian_color (20)

Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
 
Empower your Enterprise with language intelligence_Francisco Webber
Empower your Enterprise with language intelligence_Francisco Webber Empower your Enterprise with language intelligence_Francisco Webber
Empower your Enterprise with language intelligence_Francisco Webber
 
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found..."Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
 
NLP, Expert system and pattern recognition
NLP, Expert system and pattern recognitionNLP, Expert system and pattern recognition
NLP, Expert system and pattern recognition
 
eZ Product Vision Keynote
eZ Product Vision KeynoteeZ Product Vision Keynote
eZ Product Vision Keynote
 
Data Segmenting in Anzo
Data Segmenting in AnzoData Segmenting in Anzo
Data Segmenting in Anzo
 
Semantic Web and Machine Learning Tutorial
Semantic Web and Machine Learning TutorialSemantic Web and Machine Learning Tutorial
Semantic Web and Machine Learning Tutorial
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
 
Terminology in openEHR
Terminology in openEHRTerminology in openEHR
Terminology in openEHR
 
Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?Complex Event Processing: What?, Why?, How?
Complex Event Processing: What?, Why?, How?
 
RESTful SOA and the Spring Framework (EMCWorld 2011)
RESTful SOA and the Spring Framework (EMCWorld 2011)RESTful SOA and the Spring Framework (EMCWorld 2011)
RESTful SOA and the Spring Framework (EMCWorld 2011)
 
On the Use of an Internal DSL for Enriching EMF Models
On the Use of an Internal DSL for Enriching EMF ModelsOn the Use of an Internal DSL for Enriching EMF Models
On the Use of an Internal DSL for Enriching EMF Models
 
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
AUTOENCODER AND ITS TYPES , HOW ITS USED, APPLICATIONS , ADVANTAGES AND DISAD...
 
pre-defence.pptx
pre-defence.pptxpre-defence.pptx
pre-defence.pptx
 
Ecm mythbusters the_real_story_behind_vendor_marketing
Ecm mythbusters the_real_story_behind_vendor_marketingEcm mythbusters the_real_story_behind_vendor_marketing
Ecm mythbusters the_real_story_behind_vendor_marketing
 
Engineering Interoperable and Reliable Systems
Engineering Interoperable and Reliable SystemsEngineering Interoperable and Reliable Systems
Engineering Interoperable and Reliable Systems
 
Natural Language Processing Advancements By Deep Learning - A Survey
Natural Language Processing Advancements By Deep Learning - A SurveyNatural Language Processing Advancements By Deep Learning - A Survey
Natural Language Processing Advancements By Deep Learning - A Survey
 
EclipseConEurope2012 SOA - Models As Operational Documentation
EclipseConEurope2012 SOA - Models As Operational DocumentationEclipseConEurope2012 SOA - Models As Operational Documentation
EclipseConEurope2012 SOA - Models As Operational Documentation
 
Ai Brain Docs Solution Oct 2012
Ai Brain Docs Solution Oct 2012Ai Brain Docs Solution Oct 2012
Ai Brain Docs Solution Oct 2012
 
Towards a Marketplace of Open Source Software Data
Towards a Marketplace of Open Source Software DataTowards a Marketplace of Open Source Software Data
Towards a Marketplace of Open Source Software Data
 

More from DATAVERSITY

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
DATAVERSITY
 
Data at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceData at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and Governance
DATAVERSITY
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data Literacy
DATAVERSITY
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
DATAVERSITY
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for You
DATAVERSITY
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
DATAVERSITY
 
Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?
DATAVERSITY
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling Fundamentals
DATAVERSITY
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic Project
DATAVERSITY
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
DATAVERSITY
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?
DATAVERSITY
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
DATAVERSITY
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?
DATAVERSITY
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and Forwards
DATAVERSITY
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement Today
DATAVERSITY
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics
DATAVERSITY
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
DATAVERSITY
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?
DATAVERSITY
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
DATAVERSITY
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive Advantage
DATAVERSITY
 

More from DATAVERSITY (20)

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
 
Data at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceData at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and Governance
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data Literacy
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for You
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
 
Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling Fundamentals
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic Project
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and Forwards
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement Today
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive Advantage
 

Recently uploaded

Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 

Recently uploaded (20)

Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 

Wed 1430 kartik_subramanian_color

  • 1. Blog Sentiment Analysis using NOSQL techniques S. Kartik, Ph.D EMC Distinguished Engineer Global Field CTO, Data Computing Division, EMC © Copyright 2011 EMC Corporation. All rights reserved. 1
  • 2. NOSQL is Not Only SQL! © Copyright 2011 EMC Corporation. All rights reserved. 2
  • 3. What is Sentiment Analysis? • Sentiment analysis or opinion mining refers to the application of natural language processing, computational linguistics, and text analytics to identify and extract subjective information in source materials. • Source: Wikipedia • For bulk text data such as blogs, Sentiment Analysis requires a combination of natural language processing and statistical techniques to quantify the results – Hence NOSQL techniques are used for this sort of analysis © Copyright 2011 EMC Corporation. All rights reserved. 3
  • 4. What are our customers saying about us? • Discern trends and categories in on-line conversations? – Search for relevant blogs – ‘Fingerprinting’ based on word frequencies – Similarity Measure – Identify ‘clusters’ of documents © Copyright 2011 EMC Corporation. All rights reserved. 4
  • 5. Natural Language Processing • Non trivial! • Natural Language is hard to interpret – Ontologies are very specific to industries and technologies (eg. medical vs telco,…) – Abbreviations and modern geek-speak is hard to interpret (LOL, RTFM, ROTFL, BSOD,….) • Natural language is inherently ambiguous – “The thieves stole the paintings. They were then sold.” – “The thieves stole the paintings. They were then caught” • Strongly recommend excellent book on NLTK – Natural Language Processing with Python – O’Reilly © Copyright 2011 EMC Corporation. All rights reserved. 5
  • 6. Tools and Techniques • Greenplum Map/Reduce for blog text processing in parallel • Python Natural Language Tool Kit (nltk) to parse blogs • Histograms for word frequencies using Greenplum SQL • Construct metrics describing document similarity using SQL • Use statistical analysis with clustering techniques to group blogs with similar word usage © Copyright 2011 EMC Corporation. All rights reserved. 6
  • 7. What are our customers saying about us? Method • Construct document histograms • Transform histograms into document “fingerprints” • Use clustering techniques to discover similar documents. © Copyright 2011 EMC Corporation. All rights reserved. 7
  • 8. What are our customers saying about us? Constructing document histograms • Parsing & extract html files • Using natural language processing for tokenization and stemming • Cleansing inconsistencies • Transforming unstructured data into structured data © Copyright 2011 EMC Corporation. All rights reserved. 8
  • 9. What are our customers saying about us? “Fingerprinting” - Term frequency of words within a document vs. frequency that those words occur in all documents - Term frequency-inverse document frequency (tf-idf weight) - Easily calculated based on formulas over the document histograms. - The result is a vector in n-space. © Copyright 2011 EMC Corporation. All rights reserved. 9
  • 10. What are our customers saying about us? k-means clustering: - Iterative algorithm for finding items that are similar within an n- dimensional space - Two fundamental steps: 1. Measuring distance to a centroid (the center of a cluster) 2. Moving the center of a cluster to minimize the sum distance to all the members of the cluster. © Copyright 2011 EMC Corporation. All rights reserved. 10
  • 11. What are our customers saying about us? © Copyright 2011 EMC Corporation. All rights reserved. 11
  • 12. What are our customers saying about us? © Copyright 2011 EMC Corporation. All rights reserved. 12
  • 13. What are our customers saying about us? © Copyright 2011 EMC Corporation. All rights reserved. 13
  • 14. What are our customers saying about us? • innovation • leader • design • bug • installation • download • speed • graphics • improvement © Copyright 2011 EMC Corporation. All rights reserved. 14
  • 15. Accessing the data • find blogsplog/model -exec echo "$PWD/{}" ; > filelist.txt • Build the directory list into a set of files that we will access: -INPUT: NAME: filelist FILE: - maple:/Users/demo/blogsplog/filelist.txt COLUMNS: - path text • For each record in the list "open()" the file and read it in its entirety -MAP: NAME: read_data PARAMETERS: [path text] RETURNS: [id int, path text, body text] LANGUAGE: python FUNCTION: | (_, fname) = path.rsplit('/', 1) (id, _) = fname.split('.') body = f.open(path).read()… id | path | body ------+---------------------------------------+------------------------------------ 2482 | /Users/demo/blogsplog/model/2482.html | <!DOCTYPE html PUBLIC "... 1 | /Users/demo/blogsplog/model/1.html | <!DOCTYPE html PUBLIC "... 10 | /Users/demo/blogsplog/model/1000.html | <!DOCTYPE html PUBLIC "... 2484 | /Users/demo/blogsplog/model/2484.html | <!DOCTYPE html PUBLIC "... ... © Copyright 2011 EMC Corporation. All rights reserved. 15
  • 16. How do we deal with HTML? • First, strip out HTML tags in the blog – Use standard NLTK HTML Parsers with some method overrides • Tokenizing is breaking up the text into distinct pieces (usually words) • Stemming gets rid of ending variation – Stemming, stemmed, stems  stem • Stop Words are removed (and, but, if,…) • Get rid of non-Unicode characters • Lastly, discard very small words (<4 characters) © Copyright 2011 EMC Corporation. All rights reserved. 16
  • 17. Parse the documents into word lists • Convert HTML documents into parsed, tokenized, stemmed, term lists with stop-word removal • Use the HTMLParser library to parse the html documents and extract titles and body contents: if 'parser' not in SD: from HTMLParser import HTMLparser ... class MyHTMLParser(HTMLParser): def __init(self): HTMLParser.__init__(self) ... def handle_data(self, data): data = data.strip() if self.inhead: if self.tag == 'title': self.title = data if self.inbody: ... parser = SD['parser'] parser.reset() ... © Copyright 2011 EMC Corporation. All rights reserved. 17
  • 18. Parse the documents into word lists • Use nltk to tokenize, stem, and remove common terms: if 'parser' not in SD: from nltk import WordTokenizer, PorterStemmer, corpus ... class MyHTMLParser(HTMLParser): def __init(self): ... self.tokenizer = WordTokenizer() self.stemmer = PorterStemmer() self.stopwords = dict(map(lambda x: (x, True), corpus.stopwords.words())) ... def handle_data(self, data): ... if self.inbody: tokens = self.tokenizer.tokenize(data) stems = map(self.stemmer.stem, tokens) for x in stems: if len(x) < 4: continue x = x.lower() if x in self.stopwords: continue self.doc.append(x) ... parser = SD['parser'] parser.reset() ... © Copyright 2011 EMC Corporation. All rights reserved. 18
  • 19. Parse the documents into word lists • Use nltk to tokenize, stem, and remove common terms: if 'parser' not in SD: from nltk import WordTokenizer, PorterStemmer, corpus ... class MyHTMLParser(HTMLParser): def __init(self): ... self.tokenizer = WordTokenizer() shell$ gpmapreduce -f blog-terms.ymlself.stemmer = PorterStemmer() mapreduce_75643_run_1 self.stopwords = dict(map(lambda x: (x, True), corpus.stopwords.words())) DONE def handle_data(self, data): ... if self.inbody: sql# SELECT id, title, doc FROM blog_terms LIMIT 5; tokens = self.tokenizer.tokenize(data) stems = map(self.stemmer.stem, tokens) id | title | doc for x in stems: ------+------------------+----------------------------------------------------------------- if len(x) < 4: continue 2482 | noodlepie | {noodlepi,from,gutter,grub,gourmet,tabl,noodlepi,blog,scoff,... x = x.lower() 1 | Bhootakannadi | {bhootakannadi,2005,unifi,feed,gener,comment,final,integr,... if x in self.stopwords: continue 10 | Tea Set | {novelti,dish,goldilock,bear,bowl,lide,contain,august,... self.doc.append(x) ... ... parser = SD['parser'] parser.reset() ... © Copyright 2011 EMC Corporation. All rights reserved. 19
  • 20. Create histograms of word frequencies Extract a term-dictionary of terms that show up in at least ten blogs sql# SELECT term, sum(c) AS freq, count(*) AS num_blogs FROM ( SELECT id, term, count(*) AS c FROM ( SELECT id, unnest(doc) AS term FROM blog_terms ) term_unnest GROUP BY id, term ) doc_terms WHERE term IS NOT NULL GROUP BY term HAVING count(*) > 10; term | freq | num_blogs ----------+------+----------- sturdi | 19 | 13 canon | 97 | 40 group | 48 | 17 skin | 510 | 152 linger | 19 | 17 blunt | 20 | 17 © Copyright 2011 EMC Corporation. All rights reserved. 20
  • 21. Create histograms of word frequencies Use the term frequencies to construct the term dictionary… sql# SELECT array(SELECT term FROM blog_term_freq) dictionary; dictionary --------------------------------------------------------------------- {sturdi,canon,group,skin,linger,blunt,detect,giver,annoy,telephon,... …then use the term dictionary to construct feature vectors for every document, mapping document terms to the features in the dictionary: sql# SELECT id, gp_extract_feature_histogram(dictionary, doc) FROM blog_terms, blog_features; id | term_count -----+---------------------------------------------------------------- 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,2,0,4,0,1,0,1,0,1,...} 1 | {41,1,34,1,22,1,125,1,387,...}:{0,9,0,1,0,1,0,1,0,3,0,2,...} 10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2,0,6,0,12,0,3,0,1,0,1,...} ... © Copyright 2011 EMC Corporation. All rights reserved. 21
  • 22. Create histograms of word frequencies Format of a sparse vector id | term_count -----+---------------------------------------------------------------------- ... 10 | {3,1,40,...}:{0,2,0,...} ... Dense representation of the vector id | term_count -----+---------------------------------------------------------------------- ... 10 | {0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...} ... dictionary ---------------------------------------------------------------------------- {sturdi,canon,group,skin,linger,blunt,detect,giver,...} Representing the document {skin, skin, ...} © Copyright 2011 EMC Corporation. All rights reserved. 22
  • 23. Transform the blog terms into statistically useful measures Use the feature vectors to construct tf-idf (term frequency inverse document frequency) vectors: These are a measure of the importance of terms. sql# SELECT id, (term_count*logidf) tfxidf FROM blog_histogram, ( SELECT log(count(*)/count_vec(term_count)) logidf FROM blog_histogram ) blog_logidf; id | tfxidf -----+------------------------------------------------------------------- 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.34311110...} 1 | {41,1,34,1,22,1,125,1,387,...}:{0,0.771999985977529,0,1.999427...} 10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2.95439664949608,0,3.2006935...} ... © Copyright 2011 EMC Corporation. All rights reserved. 23
  • 24. Create document clusters around iteratively defined centroids Now that we have TFxIDFs we have something that is a statistically significant metric, which enables all sorts of real analytics. The current example is k-means clustering which requires two operations. First, we compute a distance metric between the documents and a random selection of centroids, for instance cosine similarity: sql# SELECT id, tfxidf, cid, ACOS((tfxidf %*% centroid) / (svec_l2norm(tfxidf) * svec_l2norm(centroid)) ) AS distance FROM blog_tfxidf, blog_centroids; id | tfxidf | cid | distance -----+-------------------------------------------------------------------+-----+------------ 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 1 | 1.53672977 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 2 | 1.55720354 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 3 | 1.55040145 © Copyright 2011 EMC Corporation. All rights reserved. 24
  • 25. Create document clusters around iteratively defined centroids Next, use an averaging metric to re-center the mean of a cluster: sql# SELECT cid, sum(tfxidf)/count(*) AS centroid FROM ( SELECT id, tfxidf, cid, row_number() OVER (PARTITION BY id ORDER BY distance, cid) rank FROM blog_distance ) blog_rank WHERE rank = 1 GROUP BY cid; cid | centroid -----+------------------------------------------------------------------------ 3 | {1,1,1,1,1,1,1,1,1,...}:{0.157556041103536,0.0635233900749665,0.050...} 2 | {1,1,1,1,1,1,3,1,1,...}:{0.0671131209568817,0.332220028552986,0,0.0...} 1 | {1,1,1,1,1,1,1,1,1,...}:{0.103874521481016,0.158213686890834,0.0540...} Repeat the previous two operations until the centroids converge, and you have k-means clustering. © Copyright 2011 EMC Corporation. All rights reserved. 25
  • 26. Summary • Accessing the data (MapReduce) id | path | body ------+---------------------------------------+------------------------------------ 2482 | /Users/demo/blogsplog/model/2482.html | <!DOCTYPE html PUBLIC ”... • Parse the documents into word lists (MapReduce) id | title | doc ------+------------------+----------------------------------------------------------------- 2482 | noodlepie | {noodlepi,from,gutter,grub,gourmet,tabl,noodlepi,blog,scoff,... • Create histograms of word frequencies (SQL) id | term_count -----+---------------------------------------------------------------- 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,2,0,4,0,1,0,1,0,1,...} • Transform the blog terms into statistically useful measures (SQL) id | tfxidf -----+------------------------------------------------------------------- 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.34311110...} • Create document clusters around iteratively defined centroids (MADlib or SQL window functions) id | tfxidf | cid | distance -----+-------------------------------------------------------------------+-----+------------ 2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 1 | 1.53672977 © Copyright 2011 EMC Corporation. All rights reserved. 26
  • 27. Conclusions • Simple exercise on Natural Language Text Processing • Technique uses a combination of non-SQL processing (Map/Reduce) and SQL Processing (Greenplum) • Takes qualitative, subjective text data and converts the problem into a form amenable to statistical analysis using k-means clustering • Rich possibilities for business impact using these techniques emerging across industries. © Copyright 2011 EMC Corporation. All rights reserved. 27