Building a Sentiment Analytics Solution powered by Machine Learning- Webinar QA


Published on

Building a Sentiment Analytics solution powered by Machine Learning- Impetus Webinar- Few Q&A

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Building a Sentiment Analytics Solution powered by Machine Learning- Webinar QA

  1. 1. Webinar: Building a Sentiment Analytics Solution Powered by Machine LearningMay 11, 2012Question and Answer SessionQ: The idea that human sentiments can be classified by smile, neutral or frownmight be far too simple to provide actionable intelligence. Please expand onwhy you have chosen to go in this direction.Answer: We chose n-gram as it helps in generating a pattern, which we can labelas positive and negative or neutral. The same pattern generation with a uni-grammodel would eventually learn what these emoticons and smileys means, as andwhen we train it. However, we are not only banking upon the smileys as there aremany tweets and FB status updates where people dont use smileys to expresstheir sentiments.Q: What is the max N in N-gram here?Answer: We usually choose to keep it around 1-3, depending on the use case. Thesmaller the “n”, the more exhaustive the knowledge bank would be and wouldalso ensure enhanced performance, just like the way we achieved it in oursolution.Q: What are the industries that youve seen your solution to be most useful for?Is it strictly on retail product basis?Answer: Our solution can work for the diverse industries and verticals present onthe social portals.It is also useful for organizations that use solutions like Support tickets, wherequantifying the quality or drawing actionable insights by reading text is a tedioustask.Impetus Proprietary Page 1
  2. 2. Q: How does sentiment analysis work with Hadoop and machine learning toolslike Mahout?Answer : We use a combination of HBase & Hadoop along with in-memoryarchitecture to leverage the huge unstructured data and provide near real- timeinsights. While developing our solution, we balanced incrementing counters inreal time with Map Reduce jobs over the same data-set to ensure data accuracy.Using PHPThrift helped us in integrating our solution with Hadoop & Mahout.The n-gram technique used by Mahout is different to the one we use, andtherefore, the solution must be optimized in terms of knowledge bank creationand the performance, depending upon the use-case.Q: How do you work with established Ontologies across multiple domains? Doyou offer BI integration to compare and contrast the accuracy of the entityidentification, which merges semantic analytics, NLP and ontological. Is N-Gramextensible enough to handle the Ontology question?Answer : N-Gram can be extended to handle Ontology question, only if we classifythe subject areas in Ontology with a primary knowledge bank. The algorithm issmart and intuitive to learn based on the initial training data-set and also canclassify and fetch answers based on the initial classification. If there exists aproper use-case we can extend the solution to the Ontology domain.Q: Please let us know what is the accuracy you achieved using this model?Answer : We achieved about 99% accuracy with 1% False Positives for anintensely trained classifier for one of our case studies and for another, the sameknowledge bank gave us 96% accuracy. It all depends on the training data-set andthe way we train the classifier.Impetus Proprietary Page 2
  3. 3. Q: Is there any minimum percentage of positive and negative attitudes?Answer : We use a technique of benchmarking neutral, say 40-50% of positive isneutral. Any intensity below that will be negative, and above it will be positive.You can customize it as per the industry and the required accuracy for your usecase.Q: Which approach is better - Unigram or Bi-gram?Answer : Though we have used Unigram in our solution, we believe that both theapproaches are equally suitable. However, when the source is a micro-bloggingwebsite such as Twitter where we need to draw sentiments from just 140characters, each pattern generated with unigram can contribute towards higheraccuracy vis-à-vis the patterns generated with bi-gram.Q: Sentiment is the main factor here but what happens if the same word meansdifferent in different context?Answer : Yes! It is indeed all about sentiments, and as humans we decide whichword would mean what as per the related context.And therefore, we need to train the Classifier on the same word in differentcontext and make it intelligent enough to predict as per the use case.Q: Would your solution take into account the context of the sentence, say aword as simple as not rubbish to a more complex context. Wouldnt a semanticanalysis be required?Answer : Yes! The pattern prediction model decides the probability and thecombination of a pattern to be negative, positive or neutral. This makes it prettyaccurate to predict the right sentiment in terms of complex contexts. And if anyinaccuracy exists, the manual False Positive dashboard enables to train theclassifier to get the calculations corrected and get the right context.Impetus Proprietary Page 3
  4. 4. Q: How many tweets per second per core?Answer : This is pretty case specific, though we were able to manage more than1000 tweets per second. Actually, it entirely depends on the keywords that youare searching and the solution you are leveraging in the backend.Q: Big data analytics usually use distributed processing which operates in batchmode such as Hadoop. What kind of a tech stack do you use to model andprocess on real-time?Answer : A combination of HBase & Hadoop along with In-Memory architecturecan be used to leverage the huge unstructured data and provide near real- timeinsights. While developing our solution, we balanced incrementing counters inreal time with Map Reduce jobs over the same data-set to ensure data accuracy.The Tweet logs are ingested into HBase at a growing rate and we report metricssuch as Tweet throughput, unique user growth over time, and return Tweet useractivity in near real-time.Our solution enables customers to handpick statistics on demand to gain marketinsights and react quickly to trends. This is all possible by processing the HBasedata on HDFS by Hadoop, in order to convert it from batch to near real-time.This approach brings it 80% closer to near real-time analytics, and the remaining20% would take some extra effort where you can introduce In-Memory solutionslike MemBase, GigaSpaces or Memcached. In fact, in our solution, we usedMemcached.Q: How many n-grams need to be stored for an actual production deployment?Answer : You really dont need to store millions of n-gram! If your knowledgebank is smart enough to make the classifier intuitive and intelligent, that shouldbe enough.Impetus Proprietary Page 4
  5. 5. Q. Can your solution be used to classify the sentiments at the individualcustomer level, especially in financial industry?Answer : Yes, indeed! If you are pretty sure about the lingo used in the industryand its relevant context, you can train the classifier on it and it will give youinformation on individual customer level.Q: Do you create the training data set manually to start with? Wouldnt thisdata change based on the verticals such as technology v/s commerce v/s healthcare. I don’t suppose you would use a common classifier for all verticals.Answer: Yes! We do create the training data set manually and yes, this alsochanges based on the verticals.We can use a common classifier for all verticals but we will have to use newtraining data set for each of the industry to get higher accuracy for the industrylingo. Ex: Stock market industry would have a different lingo vis-à-vis technologybusiness.Q: Why you opted to use Bayes instead of SVM?Answer : Support Vector Machines(SVM) approach itself gives you many optionslike Epsilon-SVM, Nu-SVM, LS-SVM, Bayesian SVM, Bayesian LS-SVM.For building an approach that predicts sentiments from Social Media channels, wefound Bayesian Classifier in combination with N-Gram Classifier to give betterresults along with high performance. However, SVM can be considered as analternative to Bayes depending upon what you are trying to classify and how youwould use the classified information. Write to us at for more information.Impetus Proprietary Page 5