Webinar: Building a Sentiment Analytics Solution Powered by Machine Learning
May 11, 2012
Question and Answer Session

Q: The idea that human sentiments can be classified by smile, neutral or frown
might be far too simple to provide actionable intelligence. Please expand on
why you have chosen to go in this direction.

Answer: We chose n-gram as it helps in generating a pattern, which we can label
as positive and negative or neutral. The same pattern generation with a uni-gram
model would eventually learn what these emoticons and smileys means, as and
when we train it. However, we are not only banking upon the smileys as there are
many tweets and FB status updates where people don't use smileys to express
their sentiments.

Q: What is the max N in N-gram here?

Answer: We usually choose to keep it around 1-3, depending on the use case. The
smaller the “n”, the more exhaustive the knowledge bank would be and would
also ensure enhanced performance, just like the way we achieved it in our
solution.

Q: What are the industries that you've seen your solution to be most useful for?
Is it strictly on retail product basis?

Answer: Our solution can work for the diverse industries and verticals present on
the social portals.

It is also useful for organizations that use solutions like Support tickets, where
quantifying the quality or drawing actionable insights by reading text is a tedious
task.




Impetus Proprietary                                                            Page 1
Q: How does sentiment analysis work with Hadoop and machine learning tools
like Mahout?

Answer : We use a combination of HBase & Hadoop along with in-memory
architecture to leverage the huge unstructured data and provide near real- time
insights. While developing our solution, we balanced incrementing counters in
real time with Map Reduce jobs over the same data-set to ensure data accuracy.
Using PHPThrift helped us in integrating our solution with Hadoop & Mahout.

The n-gram technique used by Mahout is different to the one we use, and
therefore, the solution must be optimized in terms of knowledge bank creation
and the performance, depending upon the use-case.

Q: How do you work with established Ontologies across multiple domains? Do
you offer BI integration to compare and contrast the accuracy of the entity
identification, which merges semantic analytics, NLP and ontological. Is N-Gram
extensible enough to handle the Ontology question?

Answer : N-Gram can be extended to handle Ontology question, only if we classify
the subject areas in Ontology with a primary knowledge bank. The algorithm is
smart and intuitive to learn based on the initial training data-set and also can
classify and fetch answers based on the initial classification. If there exists a
proper use-case we can extend the solution to the Ontology domain.

Q: Please let us know what is the accuracy you achieved using this model?

Answer : We achieved about 99% accuracy with 1% False Positives for an
intensely trained classifier for one of our case studies and for another, the same
knowledge bank gave us 96% accuracy. It all depends on the training data-set and
the way we train the classifier.




Impetus Proprietary                                                          Page 2
Q: Is there any minimum percentage of positive and negative attitudes?

Answer : We use a technique of benchmarking neutral, say 40-50% of positive is
neutral. Any intensity below that will be negative, and above it will be positive.
You can customize it as per the industry and the required accuracy for your use
case.

Q: Which approach is better - Unigram or Bi-gram?

Answer : Though we have used Unigram in our solution, we believe that both the
approaches are equally suitable. However, when the source is a micro-blogging
website such as Twitter where we need to draw sentiments from just 140
characters, each pattern generated with unigram can contribute towards higher
accuracy vis-à-vis the patterns generated with bi-gram.

Q: Sentiment is the main factor here but what happens if the same word means
different in different context?

Answer : Yes! It is indeed all about sentiments, and as humans we decide which
word would mean what as per the related context.

And therefore, we need to train the Classifier on the same word in different
context and make it intelligent enough to predict as per the use case.

Q: Would your solution take into account the context of the sentence, say a
word as simple as 'not rubbish' to a more complex context. Wouldn't a semantic
analysis be required?

Answer : Yes! The pattern prediction model decides the probability and the
combination of a pattern to be negative, positive or neutral. This makes it pretty
accurate to predict the right sentiment in terms of complex contexts. And if any
inaccuracy exists, the manual False Positive dashboard enables to train the
classifier to get the calculations corrected and get the right context.




Impetus Proprietary                                                            Page 3
Q: How many tweets per second per core?

Answer : This is pretty case specific, though we were able to manage more than
1000 tweets per second. Actually, it entirely depends on the keywords that you
are searching and the solution you are leveraging in the backend.

Q: Big data analytics usually use distributed processing which operates in batch
mode such as Hadoop. What kind of a tech stack do you use to model and
process on real-time?

Answer : A combination of HBase & Hadoop along with In-Memory architecture
can be used to leverage the huge unstructured data and provide near real- time
insights. While developing our solution, we balanced incrementing counters in
real time with Map Reduce jobs over the same data-set to ensure data accuracy.

The Tweet logs are ingested into HBase at a growing rate and we report metrics
such as Tweet throughput, unique user growth over time, and return Tweet user
activity in near real-time.

Our solution enables customers to handpick statistics on demand to gain market
insights and react quickly to trends. This is all possible by processing the HBase
data on HDFS by Hadoop, in order to convert it from batch to near real-time.

This approach brings it 80% closer to near real-time analytics, and the remaining
20% would take some extra effort where you can introduce In-Memory solutions
like MemBase, GigaSpaces or Memcached. In fact, in our solution, we used
Memcached.

Q: How many n-grams need to be stored for an actual production deployment?

Answer : You really don't need to store millions of n-gram! If your knowledge
bank is smart enough to make the classifier intuitive and intelligent, that should
be enough.




Impetus Proprietary                                                            Page 4
Q. Can your solution be used to classify the sentiments at the individual
customer level, especially in financial industry?

Answer : Yes, indeed! If you are pretty sure about the lingo used in the industry
and its relevant context, you can train the classifier on it and it will give you
information on individual customer level.

Q: Do you create the training data set manually to start with? Wouldn't this
data change based on the verticals such as technology v/s commerce v/s health
care. I don’t suppose you would use a common classifier for all verticals.

Answer: Yes! We do create the training data set manually and yes, this also
changes based on the verticals.

We can use a common classifier for all verticals but we will have to use new
training data set for each of the industry to get higher accuracy for the industry
lingo. Ex: Stock market industry would have a different lingo vis-à-vis technology
business.

Q: Why you opted to use Bayes instead of SVM?

Answer : Support Vector Machines(SVM) approach itself gives you many options
like Epsilon-SVM, Nu-SVM, LS-SVM, Bayesian SVM, Bayesian LS-SVM.

For building an approach that predicts sentiments from Social Media channels, we
found Bayesian Classifier in combination with N-Gram Classifier to give better
results along with high performance. However, SVM can be considered as an
alternative to Bayes depending upon what you are trying to classify and how you
would use the classified information.


             Write to us at inquiry@impetus.com for more information.




Impetus Proprietary                                                           Page 5

Building a Sentiment Analytics Solution powered by Machine Learning- Webinar QA

  • 1.
    Webinar: Building aSentiment Analytics Solution Powered by Machine Learning May 11, 2012 Question and Answer Session Q: The idea that human sentiments can be classified by smile, neutral or frown might be far too simple to provide actionable intelligence. Please expand on why you have chosen to go in this direction. Answer: We chose n-gram as it helps in generating a pattern, which we can label as positive and negative or neutral. The same pattern generation with a uni-gram model would eventually learn what these emoticons and smileys means, as and when we train it. However, we are not only banking upon the smileys as there are many tweets and FB status updates where people don't use smileys to express their sentiments. Q: What is the max N in N-gram here? Answer: We usually choose to keep it around 1-3, depending on the use case. The smaller the “n”, the more exhaustive the knowledge bank would be and would also ensure enhanced performance, just like the way we achieved it in our solution. Q: What are the industries that you've seen your solution to be most useful for? Is it strictly on retail product basis? Answer: Our solution can work for the diverse industries and verticals present on the social portals. It is also useful for organizations that use solutions like Support tickets, where quantifying the quality or drawing actionable insights by reading text is a tedious task. Impetus Proprietary Page 1
  • 2.
    Q: How doessentiment analysis work with Hadoop and machine learning tools like Mahout? Answer : We use a combination of HBase & Hadoop along with in-memory architecture to leverage the huge unstructured data and provide near real- time insights. While developing our solution, we balanced incrementing counters in real time with Map Reduce jobs over the same data-set to ensure data accuracy. Using PHPThrift helped us in integrating our solution with Hadoop & Mahout. The n-gram technique used by Mahout is different to the one we use, and therefore, the solution must be optimized in terms of knowledge bank creation and the performance, depending upon the use-case. Q: How do you work with established Ontologies across multiple domains? Do you offer BI integration to compare and contrast the accuracy of the entity identification, which merges semantic analytics, NLP and ontological. Is N-Gram extensible enough to handle the Ontology question? Answer : N-Gram can be extended to handle Ontology question, only if we classify the subject areas in Ontology with a primary knowledge bank. The algorithm is smart and intuitive to learn based on the initial training data-set and also can classify and fetch answers based on the initial classification. If there exists a proper use-case we can extend the solution to the Ontology domain. Q: Please let us know what is the accuracy you achieved using this model? Answer : We achieved about 99% accuracy with 1% False Positives for an intensely trained classifier for one of our case studies and for another, the same knowledge bank gave us 96% accuracy. It all depends on the training data-set and the way we train the classifier. Impetus Proprietary Page 2
  • 3.
    Q: Is thereany minimum percentage of positive and negative attitudes? Answer : We use a technique of benchmarking neutral, say 40-50% of positive is neutral. Any intensity below that will be negative, and above it will be positive. You can customize it as per the industry and the required accuracy for your use case. Q: Which approach is better - Unigram or Bi-gram? Answer : Though we have used Unigram in our solution, we believe that both the approaches are equally suitable. However, when the source is a micro-blogging website such as Twitter where we need to draw sentiments from just 140 characters, each pattern generated with unigram can contribute towards higher accuracy vis-à-vis the patterns generated with bi-gram. Q: Sentiment is the main factor here but what happens if the same word means different in different context? Answer : Yes! It is indeed all about sentiments, and as humans we decide which word would mean what as per the related context. And therefore, we need to train the Classifier on the same word in different context and make it intelligent enough to predict as per the use case. Q: Would your solution take into account the context of the sentence, say a word as simple as 'not rubbish' to a more complex context. Wouldn't a semantic analysis be required? Answer : Yes! The pattern prediction model decides the probability and the combination of a pattern to be negative, positive or neutral. This makes it pretty accurate to predict the right sentiment in terms of complex contexts. And if any inaccuracy exists, the manual False Positive dashboard enables to train the classifier to get the calculations corrected and get the right context. Impetus Proprietary Page 3
  • 4.
    Q: How manytweets per second per core? Answer : This is pretty case specific, though we were able to manage more than 1000 tweets per second. Actually, it entirely depends on the keywords that you are searching and the solution you are leveraging in the backend. Q: Big data analytics usually use distributed processing which operates in batch mode such as Hadoop. What kind of a tech stack do you use to model and process on real-time? Answer : A combination of HBase & Hadoop along with In-Memory architecture can be used to leverage the huge unstructured data and provide near real- time insights. While developing our solution, we balanced incrementing counters in real time with Map Reduce jobs over the same data-set to ensure data accuracy. The Tweet logs are ingested into HBase at a growing rate and we report metrics such as Tweet throughput, unique user growth over time, and return Tweet user activity in near real-time. Our solution enables customers to handpick statistics on demand to gain market insights and react quickly to trends. This is all possible by processing the HBase data on HDFS by Hadoop, in order to convert it from batch to near real-time. This approach brings it 80% closer to near real-time analytics, and the remaining 20% would take some extra effort where you can introduce In-Memory solutions like MemBase, GigaSpaces or Memcached. In fact, in our solution, we used Memcached. Q: How many n-grams need to be stored for an actual production deployment? Answer : You really don't need to store millions of n-gram! If your knowledge bank is smart enough to make the classifier intuitive and intelligent, that should be enough. Impetus Proprietary Page 4
  • 5.
    Q. Can yoursolution be used to classify the sentiments at the individual customer level, especially in financial industry? Answer : Yes, indeed! If you are pretty sure about the lingo used in the industry and its relevant context, you can train the classifier on it and it will give you information on individual customer level. Q: Do you create the training data set manually to start with? Wouldn't this data change based on the verticals such as technology v/s commerce v/s health care. I don’t suppose you would use a common classifier for all verticals. Answer: Yes! We do create the training data set manually and yes, this also changes based on the verticals. We can use a common classifier for all verticals but we will have to use new training data set for each of the industry to get higher accuracy for the industry lingo. Ex: Stock market industry would have a different lingo vis-à-vis technology business. Q: Why you opted to use Bayes instead of SVM? Answer : Support Vector Machines(SVM) approach itself gives you many options like Epsilon-SVM, Nu-SVM, LS-SVM, Bayesian SVM, Bayesian LS-SVM. For building an approach that predicts sentiments from Social Media channels, we found Bayesian Classifier in combination with N-Gram Classifier to give better results along with high performance. However, SVM can be considered as an alternative to Bayes depending upon what you are trying to classify and how you would use the classified information. Write to us at inquiry@impetus.com for more information. Impetus Proprietary Page 5