Sentiment Mining Engine - Architecture

  • 2,675 views
Uploaded on

Detailed design of a Sentiment Mining Framework to analyze unstructured text like user reviews, call center transcripts, twitter posts to gather insights into customer moods and identify product …

Detailed design of a Sentiment Mining Framework to analyze unstructured text like user reviews, call center transcripts, twitter posts to gather insights into customer moods and identify product features which are key influencers

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,675
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
128
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 1 Architectural Overview of a Sentiment Mining and Feature Extraction Engine Satyajit Gupte gupte.satyajit@gmail.com 1.0 Introduction: This document describes the system design of a complete framework to analyze unstructured text. It outlines how Sentiment Polarity Classification is performed on the unstructured text and key Features are extracted. The unstructured text could be call center notes/transcripts, user written product reviews, twitter posts, blog posts etc. The Sentiment Mining and Feature Extraction Engine could be potentially used on such unstructured text to gather insights into customer mood and identify product attributes/features which are key influencers. 2.0 Design Goals: 1. Classify unstructured text as having either positive, negative or neutral sentiment. 2. Extract important features and calculate Feature wise sentiment. 3. Achieve correct sentiment classification accuracy of over 80%. 4. Require minimal human intervention once the Classification Engine is trained. 5. Should be independent of the underlying domain of the unstructured text. 6. Should be Robust enough to handle changing user vocabulary. 3.0 System Overview: The high level overview of the entire system is presented below with an example. Keyword Matrix obtained after Training Sentiment Sentiment: Negative Polarity Score: 0.1 Classifier Unstructured Text Feature Extractor Feature Scorecard"Dear sir, Last 3 days I am veryfrustrated with Airtel customer Feature Sentimentcare agent .They activated a Customer care negativevulgar caller tune on my sim and Caller tune negativetake my balance. Also thecoverage is very poor .I request Coverage pooryou to take necessary action to Automaticallyresolve my issue by the end of thisweek” Learned Feature Dictionary
  • 2. 2 A user written complaint from an Airtel forum is the unstructured text used for illustrative purposes. The various modules and the algorithms used are further detailed in the following sections. 4.0 Sentiment Polarity Classification: To Label Unstructured text into either positive, negative or neutral sentiment, we use a Text Classification approach. The underlying sentiment present in the unstructured text is modeled into three topic classes (positive, negative and neutral) and a training corpus is generated manually. Significant discriminating keywords to differentiate between the three classes are selected using term frequency-inverse document frequency (tf-idf) weights. These keywords along with their calculated probability of occurrence are then used in a customized Bayesain Classifier to label text. The entire task of sentiment polarity classification can be broken into two modules: Training and Scoring. 4.1 Training: The input to the Training module is a manually labeled set of documents known as the training corpus and the output is a keyword matrix. The training corpus is randomly selected from the sample distribution. Tf-idf weights are used to select discriminating keywords and the probability of occurrence of each keyword is calculated. Unigram (single word) and Bigram (two-word) combinations are extracted. The process of training must be repeated from time to time to enable the Sentiment Classification Engine to learn changes in vocabulary. An overview of the Training module is presented below. Extract keywordsPositive with probability. Keywordsentiment corpus matrix tf-idf selection filterNegativesentiment corpus Sample keyword matrix keyword belief Good service 0.98 Excellent 0.95 fairly ok 0.65 Disappointing 0.2 Horrible 0.1 Wrong deduction 0.01
  • 3. 3 4.2 Sentiment Scoring: The input to the scoring module is unclassified text and the output is the associated sentiment label and score. The keyword matrix learned from the Training phase is used in the scoring engine. The features present in the unclassified text are combined in a customized Bayesian classifier to perform the sentiment labeling. Customized Sentiment Unclassified Bayesian Text label with Text Classifier score Keyword matrix 5.0 Feature Extraction: The input to the Feature Extraction module is unstructured text and the output is a set of features/attributes present in the text along-with the associated sentiment. Initially a feature dictionary is generated automatically from the corpus. It is observed that in English, the product features are usually Nouns or Noun phrases. A Part of Speech (POS) Tagger is used to label the words in the text. For this purpose we make use of an open source POS tagger. To build the feature dictionary we use term- frequency weights to select the most relevant features. The sentence fragment containing the feature is then analyzed for Sentiment. An overview of the module is presented below. Split Text Part of speech Into Tagger to extractUnstructured sentence featuresText fragments Feature Sentiment Polarity Classifier Feature extracted with positive/negative classification
  • 4. 4An illustrative example is provided below.Unstructured Text:The signal strength is good but the internet speed is slow.Sentence fragments with POS tags:The signal strength is good <prp>The ,<n> signal, <n>strength, <v> is, <a>goodinternet speed is slow <n>internet, <v>speed, <v>is, <a>slowPOS tag keys:<n> : noun, <a>: adjective, <v>: verbEach fragment is then analyzed for sentiment polarity to give the following featurescorecard.Feature SentimentSignal strength positiveinternet negative6.0 Integrating the Sentiment Mining and Feature Extraction Engine into anexisting framework:The Sentiment Mining and Feature Extraction Engine will be implemented in thePython programming language. The Sentiment Polarity Classification module will becoded from scratch and customized to meet the design goals. The Feature ExtractionModule makes use of an open source POS tagger.The Engine will be run as a web service on a server and expose an interface to the user.The user makes a function call to a running instance of the Sentiment Mining Enginewith the text to be analyzed as the parameter. The response of the Sentiment Engineserver can be encoded in either JSON or XML.It is the responsibility of the user to parse the XML/JSON and perform further reportingactions. request = analyze(text) Sentiment Mining User Engine JSON/XML encoded response containing sentiment polarity, score and feature scorecard