This document discusses building a sentiment analysis solution powered by machine learning. It begins with an introduction to sentiment analysis and outlines the existing landscape of solutions. It then discusses challenges like accuracy and isolating content types. The document proposes that machine learning can help address these challenges by analyzing sentiment versus subjectivity, polarity reactions, and sentiment intensity. It describes how to build such a solution using machine learning, including creating a knowledge base and leveraging machine learning algorithms. Finally, it outlines Impetus Technologies' sentiment analysis solution and the benefits it provides.
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Building a Sentiment Analytics Solution Powered by Machine Learning- Impetus White Paper
1. Building a Sentiment Analytics
Solution Powered by
Machine Learning
Abstract
This white paper talks about the need for building a Sentiment
Analytics solution based on Machine Learning. It focuses on why
Sentiment Analysis is vital in today’s world, the existing solutions
landscape and why Machine learning is recommended to build
such a solution and gather better business insights.
n this white paper you will also learn about how to build a
Machine Learning solution, and the benefits you can accrue from
it.
Impetus Technologies, Inc.
www.impetus.com
W H I T E P A P E R
2. Building a Sentiment Analytics solution powered by Machine Learning
2
Table of Contents
Introduction..............................................................................................3
The four stages of Sentiment Analytics ....................................................3
Existing landscape of Sentiment Analytics solutions................................4
Challenges facing Sentiment Analytics .....................................................5
Demystifying accuracy.................................................................5
Isolating content types ................................................................5
Sentiment override......................................................................5
Machine Learning to the rescue ...............................................................5
Sentiment versus subjectivity......................................................6
Analyzing polarity reactions ........................................................6
Understanding sentiment intensity.............................................7
Building Sentiment Analytics Solution powered by Machine Learning...7
How Machine Learning works .....................................................7
Building a Sentiment Analytics tool.............................................8
Creating a knowledge base........................................................10
Leveraging Machine Learning for analyzing sentiments ...........10
The Impetus solution.................................................................12
The benefits offered by the Impetus solution...........................12
Conclusion...............................................................................................13
3. Building a Sentiment Analytics solution powered by Machine Learning
3
Introduction
Sentiment Analytics or opinion mining refers to a broad area of Natural
Language Processing (NLP), computational linguistics and text mining. Generally
speaking, Sentiment Analytics determines the attitude of a speaker or a writer
with respect to a particular subject. The attitude may be based on the person’s
judgment, evaluation or the affective state—that is to say, the emotional state
of the author while writing. It can also be the intended emotional
communication—that is, the emotional impact that the author would want to
have on the reader.
Sentiment Analytics can therefore be described as a discipline that helps
organizations measure, evaluate and explain the performance of their social
media initiatives, based on the opinions or sentiments people express on social
portals, such as Twitter, Facebook, among others.
The Four Stages of Sentiment Analytics
1. The first and foremost event that acts like a trigger is the launch of a
campaign on various communication channels.
2. Once this happens, in the second stage, how the target audience reacts
online. Its reactions provide a collection of opinions and sentiments about
the campaign and the product.
3. The next step is analyzing the sentiments, where the findings are presented
to the top management, strategists, and other stakeholders, such as the
sales and marketing team. This visual and interactive information facilitates
better insights.
A Sentiment Analytics solution is used here for gauging the performance of
a campaign, product, or a brand and to check how well received it is in the
marketplace.
4. Based on these insights and analysis, comes the innovation stage, where
businesses respond to what their target audience is saying about their
offerings, and accordingly, manage their reputation.
It helps them identify how and what their existing customers are talking
about and appropriately design a Customer Relationship Management
system. It also enables them to track changes in perception over time.
In order to quantify the reactions and derive actionable insights, Sentiment
Analytics is the key.
4. Building a Sentiment Analytics solution powered by Machine Learning
4
Our experience has been that the cost of building reliable storage using
commodity hardware may be around USD 1 per gigabyte. This is only for storage
and does not include the cost associated with managing, monitoring, and
hosting the Big Data.
Existing Landscape of Sentiment Analytics
Solutions
Sentiment Analytics is conventionally handled by a couple of popular
techniques which are Natural Language Processing and Artificial Intelligence.
Natural Language Processing
In the first option, the algorithm is trained to understand the natural language
and draw inferences from it. This option is usually not very successful as it is
deterred by factors such as internationalization as well as the language used
for tweets/FB status updates, which are a lot different from the natural
language.
Artificial intelligence
The second option uses information by NLP and mathematical calculations to
determine negative, positive, or neutral sentiments. The Machine Learning
solution is a part of Artificial Intelligence.
At Impetus, we believe that
the biggest advantage of
commodity hardware is that
you can build it yourself, and
that there are many avenues
open for innovation.
Commodity hardware is
readily available and people
have easy access to it.
Therefore, while using
commodity hardware, you
have the option of
customizing and optimizing it,
over and beyond the existing
offering.
5. Building a Sentiment Analytics solution powered by Machine Learning
5
Sentiment Analytics opens up a host of new opportunities and perspectives.
However, before leveraging them, organizations need to overcome certain
challenges.
Let us now check out the pluses and minuses of using Open Source and cloud
computing. Obviously, using free and Open Source software to store, manage,
and analyze Big Data is a good idea. We all know that Hadoop can be leveraged
to tackle large volumes of data, while saving significantly on costs.
Challenges Facing Sentiment Analytics
Demystifying accuracy
The first challenge is the inability of machines to gauge and measure sentiments
accurately. Writing an efficient algorithm can give a lower False Positive rate.
This implies that a tweet, which was supposed to be classified as positive will be
classified as positive, and not as neutral or negative.
In a Machine Learning-based system, the accuracy of a system is dependent on
the False Positive rate, which should ideally be lower than 4-5 percent,
depending on the training data set and the information that is given to the
algorithm to learn.
Isolating content types
The next problem area is inconsistency across social networks and the neutral
nature of social media. Take the instance of tweets. Tweets are usually 60
percent neutral—most do not have opinions, or any explicit sentiments.
It is also difficult to determine the Target Overlook Actual Verbatim, which is
caused by people re-tweeting or updating the same status again and again,
thereby, diluting the sentiment’s overall strength.
Sentiment override
Another problem is the inability to allow clients to leverage contextual
knowledge in order to apply the accurate sentiment scores. Most of the current
systems for example, cannot detect sarcasm.
Machine Learning to the Rescue
The challenges discussed can be addressed effectively by Machine Learning. It is
a fact that computers need intelligence to cater to the needs of people.
6. Building a Sentiment Analytics solution powered by Machine Learning
6
Learning and knowledge are central to intelligence and Machine Learning takes
care of this requirement.
Machine Learning is a structure capable of acquiring and integrating knowledge
automatically. The ability of machines to learn from experience, training,
analytical observation, and other means creates an intelligent system that can
continuously self-improve and thereby, exhibit efficiency and effectiveness.
Having discussed the challenges associated with analyzing sentiments and
established Machine Learning as the favored approach, it is also important to
discuss some of the critical questions that an ideal Sentiment Analytics tool
needs to address.
They deal with subjectivity and sentiment, help analyze polarity of reactions and
make it possible to gauge sentiment intensity.
Sentiment versus subjectivity
As mentioned earlier, tweets can be classified as positive, negative, and neutral.
The question is how does a machine know about subjectivity and sentiment?
Subjectivity is the linguistic expression of opinions, sentiments, emotions,
evaluations, beliefs, and speculations.
The basic components of Sentiment Analytics involve a sentiment holder, who is
the person or organization that holds a specific opinion on a particular focus
area; the object, on which a sentiment is expressed and lastly, the sentiment,
which is a view, attitude, or appraisal on an object from a sentiment holder.
A statistical and probabilistic approach can be used to define what is subjective
and how to use it in a Sentiment Analytics.
Analyzing polarity reactions
The second big question is how does a analyze polarity of reactions in terms of
negative and positive?
Machine Learning plays a big role in deciding the polarity. Whether a sentence is
positive, negative, or neutral is judged by an algorithm, which refers to its initial
knowledge bank to decide the polarity of a new sentence or a word.
A knowledge bank is the database of some pre-trained words with information
on the patterns of the words and sentences defined as negative, neutral, and
positive.
7. Building a Sentiment Analytics solution powered by Machine Learning
7
Understanding sentiment intensity
The third critical question relates to how a machine knows about sentiment
intensity. Whether a word or a sentence is strongly positive or strongly negative
is decided by a couple of metrics.
A technique of benchmarking neutral is used where say 40-50 percent of
positive is neutral, any intensity below that is negative, and above it is positive.
Also, by referring to the knowledge bank, which is being continuously trained by
a Machine Learning algorithm, the prediction becomes more accurate about the
intensity. The occurrence of a word or a sequence of word in a particular
polarity decides the intensity of the overall sentiment.
The accuracy of the system depends on the range and the complexity of the
language. Therefore, the wider the net is thrown, and the more difficult the
language gets, the less accurate the system is likely to be.
Furthermore, it is easier to classify sentiments on the basis of positive and
negative, vis-à-vis the scenario where they are being further distributed on the
basis of exactness—excellent, incredible, good, and so on. Enhanced granularity
requires enhanced accuracy, and this in turn demands a deeper understanding
of the human language.
Building Sentiment Analytics Solution
Powered by Machine Learning
How Machine Learning works
Having talked about how Machine Learning can help address the key problem
areas, it is now important to focus on how the system actually works.
8. Building a Sentiment Analytics solution powered by Machine Learning
8
To understand this at a high level, one can pool in the input text in the form of
twitter tweets using Search APIs. Alternatively, it is possible to use Facebook
graph search results or any other REST API or XML output.
It is additionally feasible to use a Machine Learning algorithm based on Bayes
Probability as well as text classification and pattern generation that uses the n-
gram technique to process this information.
The processed information generates results based on the initial knowledge
bank, which is a trained set of n-gram pattern texts, classified as positive,
negative, and neutral. The new information is processed and labeled as
positive, negative, or neutral.
To ensure that there are no False Positives, there is a manual dashboard to
review and correct the polarity. Once a False Positive is found and corrected,
the Machine Learning classifier re-runs and learns from the new training data
set and corrects its calculations for intensity and polarity prediction.
Building a Sentiment Analytics tool
Building a Sentiment Analytics tool, which leverages Machine Learning
facilitates low time complexities, without sacrificing performance. The first step
in building this tool is to come up with simple and effective methods.
9. Building a Sentiment Analytics solution powered by Machine Learning
9
You can employ the n-gram approach for text classification, which is used
frequently to model a phenomenon in natural languages. It is possible to
develop two simple variations of this approach, which yield high performance
ratios for filtering sentiments.
The second step is exploiting human behavior to know its perception of
sentiments. Whenever a new text is received, the system will read the initial
parts of the text and then decide whether the incoming text is heading towards
negative or positive.
In case the sentiment is negative, it is actually not required to read the entire
text to conclude the sentiment as negative, as just a quick glance can suffice.
This human behavior is simulated by means of heuristics, which is referred to as
the first ‘n-words’ heuristics. Based on this, considering the first ‘n words’ of an
incoming text and discarding the rest can yield the correct class.
The process of tokenizing text into a word is different for every language and
there are numerous forms and tenses used in any language. For example—
walk, walking, walked, caminar, caminó, and caminando.
All these words represent the same meaning but since they have other written
forms, they differ in terms of polar intensity. It is possible to address this issue
You can employ the n-gram
approach for text classification,
which is used frequently to model
a phenomenon in natural
languages. It is possible to
develop two simple variations of
this approach, which yield high
performance ratios for filtering
sentiments.
10. Building a Sentiment Analytics solution powered by Machine Learning
10
by implementing an n-gram generated pattern which is a sub-sequence of ‘n’
items from a given sequence, instead of words. N-gram primarily considers the
sequence of alphabets or characters that make a word, rather than their
language. Therefore, it is likely to succeed where the NLP-based solution fails.
Creating a knowledge base
A knowledge base includes a database table with the information of n-gram
patterns, identified as positive, negative, or neutral and generated from a
training dataset.
The n-gram model can be used to store more contexts with a well-understood
space–time tradeoff, enabling small experiments to scale up very efficiently.
Therefore, the training data in a big data stack is feedback-loop enabled, which
means that the new classified n-gram patterns are labeled as positive, negative,
or neutral. This data starts influencing the polarity intensity prediction model for
the existing pattern, making it a more accurate knowledge bank.
For instance, a pattern generated for the word ‘rubbish’ will be treated as
negative and the algorithm will identify words like rubbished and rubbishing as
negative itself.
Leveraging Machine Learning for analyzing sentiments
A tweet or a Facebook status text is first processed by an n-gram filter for
pattern generation. If the pattern already exists, the pattern’s polarity intensity
is increased or decreased based on the text classification. Else, a new pattern is
generated and is labeled by its polarity, based on the existing training data.
The two core advantages of
the n-gram model are its
relative simplicity and the
ability to scale up, by simply
increasing the ‘n.’
11. Building a Sentiment Analytics solution powered by Machine Learning
11
This pattern is processed through a Bayesian filter that is based on the principle
that most events are dependent and the probability of an event occurring in the
future can be inferred from the previous occurrences of that event.
A probability value is then assigned to each word or a token. Now this value is
based on some calculations that take into account how often that word occurs
in one category or another. The most common application of the filter is to
identify words that appear in the negative sentiment category, versus the
positive sentiment category.
This solution breaks down the content to improve the filter by supplementing it
not only with a database of words to categorize, but also sets of n-gram derived
from the text. The algorithm additionally helps with the extraction and offers a
few more layers of depth for Bayesian filtering.
Now based on the complete combination of pattern as well as the n-gram text
and Bayes filter classification, the tweet is labeled as positive or negative. As the
information is again treated as a new training data set, it is re-used to make the
knowledge bank more intelligent by the algorithm.
12. Building a Sentiment Analytics solution powered by Machine Learning
12
If the feedback loop of the training data-sets is cut and the knowledge bank kept
intact for a specific use case, it is possible to run it through n-gram filter and
Bayes classifier concurrently to get the results.
Also, in the complete process of new pattern generation, if there is a False
Positive, this can be manually corrected in the FP-Dashboard.
The Impetus solution
Impetus has successfully adopted this approach and built a Sentiment Analytics
solution which is powered by Machine Learning and leverages Big Data
technologies.
Our solution intuitively retrieves the input text for analysis, using artificial
intuition, which is a sophisticated function of an artifice that can interpret data
in-depth and locate hidden factors.
The solution is smart enough to change its source of information, juggling
between Twitter, Facebook, an XML file, Text file, etc., and filtering out the
noise from the content which holds the sentiment.
It also offers an option to use the custom REST APIs for cross-functional teams
to build on top of the solution. It is an intuitive solution, capable of processing
near real-time data using the Big Data stack. The solution works on LAMP,
HBASE, Hadoop, and PHP Thrift and has been tested for different scenarios.
Our primary purpose of using Big Data is for analytics, and we perform
concurrent processing to enable fast results with higher accuracy. Here, the
latency is more important than batching.
Therefore, we recommend a combination of HBase and Hadoop along with the
In-Memory architecture, to leverage the huge unstructured data and provide
near real-time insights. While developing this solution, we balanced
incrementing counters in real time with Map Reduce jobs over the same data-
set to ensure data accuracy.
The benefits offered by the Impetus solution
Here are some of the main advantages that our solution offers vis-à-vis its
contemporaries.
Impetus solution enables
customers to handpick
statistics on demand to gain
market insights and react
quickly to trends. This is all
possible by processing the
HBase data on HDFS by
Hadoop, to convert it from
batch to near real-time.
This approach brings it 80
percent closer to near real-
time analytics. The
remaining 20 percent will
take extra effort in the form
of In-Memory solutions like
MemBase, GigaSpaces, or
Memcached.
13. Building a Sentiment Analytics solution powered by Machine Learning
13
Apart from a higher degree of accuracy, our solution also helps identify
Influencers.
Say, after a campaign or a product launch, an enterprise wants to know
who is talking about its offerings, where and how. If the sentiment is
largely negative, the enterprise needs to neutralize the mentions that
may hurt its brand the most.
With our solution, the company can drill down into negative mentions,
identify the content coming from the most influential people in the
industry, understand how far each tweet traveled, and how many
people were impacted by this content.
Including Influencer Analytics alongside sentiment measurement is
becoming a standard of the social media monitoring industry. It also
enables reputation management, thereby taking the influencer concept
a step further.
As far as sentiment algorithms are concerned, part of a successful
prioritization process is to identify the intensity of each mention. “I
really hate product X and will never buy it” is quite different from
“Product X is running a little slow today.”
This ability to cross-reference intensity, influence trajectory, velocity and
sentiment of each social media mention drives us towards a reliable priority
system.
Conclusion
In conclusion it can be said that a Sentiment Analytics Solution is used to gauge
the performance of a campaign, product, or a brand and to check how well it
has been received in the marketplace. Sentiment Analytics is conventionally
handled by couple of popular techniques including NLP and Artificial
Intelligence.
The neutral nature of social media mentions, Target Overlook Actual Verbatim
and Sentiment Override challenges can be handled by building a Machine
Learning model based on n-gram and the Bayes filter classification,
A Machine Learning system has a certain level of knowledge and is associated
with a corresponding knowledge-management organization which enables it to
interpret, analyze, and test the knowledge acquired.
Apart from a higher degree of accuracy, it also helps identify influencers.