NLP Ecosystem

A for Analytics
CONTINUING EDUCATION PROGRAMME
DMS IIT-DELHI
3/24/2013-6/23/2013
Harshad B. Madhamshettiwar
Paper submitted in the partial fulfillment of the
requirements for the Certificate of
Business Analytics and Optimization

1
Harshad Madhamshettiwar
A for Analytics
Objective:
This paper is aimed at explaining why text analytics is important from business point of view for any kind
of business and how sentiment analysis is used to make the better decisions and science behind it.
Background:
Time has changed now; people have become talkative and active in sense of sharing opinions, unlike past
(era before World Wide Web) when an individual’s opinions were shared only to family and friends; The
Web has dramatically changed the way that people express their views and opinions which can influence
decision of thousands or millions of people directly or indirectly.
And that when gut based decisions aren’t worthy for business its now time for the makeover; time
for data driven decision making and setting of fact based goals, using statistical science and various
analytical tools.
One of the analytics method which help make better decisions is text analytics. Now a day’s business
profits and different strategies are dependent on the customer feedback and demand. Companies are more
focusing on getting documented feedback from various sources like: surveys, social networking sites, and
blogs etc. etc.. This leads to generation of huge amount of text data and also answers to questioners,
“undoubtedly the most unstructured or semi-structured data. “
After churning and cleaning the DATA and it is converted into source of critical information
having large number of opportunities hidden in it in form of the behavior and sentiments of customers i.e.
what customer thinks and feels about your service and products and judges how accountable and reliable
is the company, how he/she is promoting you, whether with good or bad reviews.
Context:
In many cases, opinions are hidden in long forum posts and blogs.
It is difficult for a human reader to find relevant sources, extract related sentences with opinions, read
them, summarize them, and organize them into usable forms. Thus, automated opinion discovery and
summarization systems are needed. Sentiment analysis, also known as opinion mining, grows out of this
need. It is a challenging natural language processing or text mining problem. [1]
Context of the content of the paper revolves around why the sentiment analysis is being used vastly by the
companies and how it works.
Literature Review:
Text analytics reveals insights from electronic text materials, associates them so they go to the right
person and place, and provides intelligence to know what you need to do next – whether it is answering
complex search-and-retrieval questions, presenting relevant content to internal or external Web users, or
predicting which phrase will best affect sentiments.
Sentiment analysis automatically locates and extracts sentiment from online materials, such as social
networking sites, comments and blogs on the Internet, as well as internal electronic documents.
Text analytics brings together multiple approaches: [1]
• Text mining involves techniques from several areas, including the fields of computational linguistics
and information retrieval, to structure text into a numeric representation for use in traditional data mining
and predictive analysis.

2
• Natural language processing – a discipline from the field of artificial intelligence – combines computer
science and linguistics to identify meaningful concepts, attributes and opinions in the spoken or written
word.
 Best of both worlds(Hybrid Approach)
Data mining approach:
A data mining approach to sentiment analysis translates an unstructured text problem to one that makes
predictions on structured, quantitative data. The approach borrows several techniques from computational
linguistics and information retrieval communities to represent the text numerically, and then applies
traditional data mining techniques to this numeric representation. In the end, a target variable is identified
and a pattern is discovered from the training data for predicting sentiment polarity. This pattern can then
be used to predict new observations.
The first step in creating the numeric representation is to convert the entire training collection into a
document-by-term frequency matrix. Each document is parsed into individual terms, or term/part-of-
speech pairs. Then the set of all terms becomes the variables on the data set so that documents are now
represented as vectors of length equal to the number of distinct terms in the collection. These vectors are
very sparse, containing mostly zeroes – because any one document contains a very small percentage of
the terms in the collection. Once the documents are represented as vectors, the frequencies in each cell
can be weighted with a function that takes into account the distribution of the term across the collection
and relative to the levels of the target variable.
After these document vectors are formed, a dimension reduction technique – such as the singular value
decomposition (see Taming Text with the SVD, Albright, 2004) – is typically used to represent each
document in a reduced-dimensional space of maybe 50 to 100 variables, where each variable is a linear
combination of the weighted terms that originally represented each document.
Finally, these reduced-dimensional vectors, together with the sentiment variable, can be supplied to a
predictive model. The model will attempt to learn from the training data by utilizing patterns in the
reduced-dimensional vector. This predictive model will then create a function that will predict the
sentiment for any document.
Benefits of the data mining approach
The data mining approach is appealing because it is based on learning patterns that are useful for making
automated, efficient predictions. The algorithms are capable of discovering unimagined and complicated
patterns that would be beyond what a human could anticipate. Frequently, a data mining approach can
beat a rule-based approach in topic classification. Of course, this is dependent on having enough training
data to build the model.
Drawback of the data mining approach
The vector-based representation of a document, which is required for data mining techniques, does not
maintain information that is potentially important to sentiment classification. For example, the vector
representation does not capture when terms are close to one another in the document, if one term precedes
another or any other contextual cues. The order of terms in a phrase can significantly affect meaning.
Consider the phrases:
“… night for a great movie”
and
“… great night for a movie”
These two phrases convey two different meanings; yet in a vector representation, the phrases have an
identical representation.
In addition, most predictive models provide little feedback to the user as to precisely why a particular
document was classified as having positive or negative polarity. So when you attempt to understand what
positive things people said in a particular document, you frequently have to read the entire document to
discover the answer.

3
As a final drawback, forming the training and validation is an essential component of learning a
predictive model, but it can be very time-consuming and challenging. A rating needs to be provided for
every document, and if there are attributes of documents that you wish to use to measure sentiment, you
will need to provide a rating for each of these as well. Another complication is that two different
reviewers frequently assign two different sentiment ratings to the same document. This can introduce
unexpected errors in building and measuring the performance of your model.
Natural language processing approach:
Natural language processing (NLP) is a field of artificial intelligence that deals with automatically
extracting meaning from natural language text. As discussed in the introduction of this paper, it’s very
challenging to get machines to understand text at the same levels as humans. Doing this with the specific
goal of extracting sentiment is even more challenging.
Natural language processing (NLP) combines computer science and linguistics to identify meaningful
concepts and attributes in the spoken or written word. In the context of text analytics, this analysis most
often applies to electronic documents.
The rule-based NLP methods use certain entities and syntactic patterns in the text to understand its
meaning.
Figure 1 below shows steps involved in sentiment analysis by NLP is carried out. [3][5]
Figure1: Sentiment analysis by NLP approach.
Benefits of the NLP approach
The major advantage of rule-based methods is the amount of control they give rule developers over how
the analysis will be performed. Developers can use their knowledge of the domain and the language
within it to develop rules that have high precision.
Text analytics
Defining problems of
sentiment analysis
Sentiment and subjectivity
classification
Document-Level Sentiment
Classification
Sentence-Level Subjectivity
and Sentiment Classification
Opinion Lexicon Generation
Feature-based sentiment
analysis
Feature Extraction
Opinion Orientation
Identification
Opinion search and retrieval
Opinion spam and utility of
opinions
Opinion Spam
Utility of Reviews
Sentiment analysis of
comparative sentences
Problem Definition
Identification of Comparative
Sentences
Extraction of Objects and
Object Features in
Comparative Sentences
Identification of Preferred
Objects in Comparative
Sentences
Sentiment analysis (NLP)

4
Unlike statistical analysis, the results of rule-based analysis are easily interpretable. This is very important
for real-life applications where the analysts need to know exactly why a document or an attribute within a
document was tagged as positive or negative. In other words, analysts need to know exactly what
sentences, keywords or context within the document triggered the positive or negative sentiment.
Figure 2 shows an example of this. [6]
Phrases are marked in original text based on their sentiment score as: Negative, Neutral, Positive.
The document sentiment is: +0.202
Summary
A beginner in analytics is like a child learning Alphabets for first time; it seems to be very complex in first go but then practice makes
man perfect%u2026.slowly... For us; analytics is same, its just waiting for us to learn more and keep learning and then it will become
a part of us%u2026slowly child will become an expert...
Entities
No entities could be found.
Themes
Evidence Sentiment
learning alphabets 4 +0.20
u2026slowly child 4 +0.20
beginning child 4 +0.20
Topics
Score
Education 0.72
Figure 2: Example showing different entities that were used for rule-based analysis.
Rule-based methods are completely unsupervised; that is, they do not require any training data. This is a
big advantage in real-life applications where training data is scarce. The non-availability of training data
is more pronounced when it comes to granular sentiment analysis (sentiment derived at the objects and
attributes level).
Another advantage of rule-based methods is their ability to refine the rules over time based on the
feedback from analysts or subject-matter experts. The more time the rule developer spends on refining the

5
rules, the better the results. Language evolves over time and people start using newer terms to express
their sentiments. This is especially true for social media, where the language used changes all the time. In
such cases, rule-based methods give you the flexibility needed to adjust your models accordingly.
Drawback of the NLP approach
The disadvantage of rule-based methods is that they require a lot of human involvement in developing the
rules. These methods completely rely on the domain knowledge of rule developers. It might take a few
weeks to come up with a strong rule-based model for a new domain. However, once you have a strong
rule-based model for a domain, you can reuse that model with some minor modifications for different
applications within the domain.
The importance of validation data is often underestimated while developing these models. The rules being
written must be generic enough so that they are capable of handling all possible cases. Inexperienced rule
developers tend to over-fit their rules to the sample data they are working with. Such rules might not work
well when tested on different data sets. So, rule developers must make sure they validate the rules on
different data sets before considering a model ready to deploy.
Discussion:
We now know that how sentiment analytics works effectively throughout wide range of industries.
Text analytics can be approached from two different directions,
• Discovery-driven. When you don’t know where to start, a discovery-driven approach helps identify key
patterns and attributes in the unstructured data at hand. This exploration reveals new insights, which are
then used to define the structure, such as the categories and concepts you will use.
• Domain-driven. If there is already an understanding of the data or some domain knowledge regarding
which terms and phrases are meaningful, you can start with this knowledge and find where it exists in the
materials.
Both approaches are valid, and more importantly, they complement each other. “Discovery of concepts
can be used to define a structure or taxonomy for the data. On the other hand, content that doesn’t fit into
a predefined structure can be further explored using discovery to find previously unknown information.”
Organizations in a variety of industries – from the public and private sector, from manufacturing to
finance to health care – are using these approaches in inventive ways.
Figure 3: Industries adopting text and sentiment analytics [2]
All these industries are using sentiment analytics because the reviews have economic impact.
Economic impact of Reviews [4]
As mentioned, many readers of online reviews say that these reviews significantly influence their
purchasing decisions. However, while these readers may have believed that they were “significantly
Text and
Sentiment
analysis
Governm
ent and
Research
Health
and Life
Sciences
Finance
Media
and
Publishin
g
Film
Entertain
ment
Industry
E-
Business

6
influenced”, perception and reality can differ. A key reason to understand the real economic impact of
reviews is that the results of such an analysis have important implications for how much effort companies
might or should want to expend on online reputation monitoring and management.
Given the rise of online commerce, it is not surprising that a body of work centered within the economics
and marketing literature studies the question of whether the polarity (often referred to as “valence”)
and/or volume of reviews available online have a measurable, significant influence on actual consumer
purchasing.
One way to acquire a good reputation is, of course, by receiving many positive reviews of oneself as a
merchant; another is for the products one offers to receive many positive reviews. For the purposes of our
discussion, we regard experiments wherein the buying is hypothetical as being out of scope; instead, we
focus on economic analyses of the behavior of people engaged in real shopping and spending real money.
The general form that most studies take is to use some form of hedonic regression to analyze the value
and the significance of different item features to some function, such as a measure of utility to the
customer, using previously recorded data. Specific economic functions that have been examined include
revenue (box-office take, sales rank on Amazon, etc.), revenue growth, stock trading volume, and
measures that auction-sites like eBay make available, such as bid price or probability of a bid or sale
being made.
It is important to note that some conclusions drawn from one domain often do not carry over to another;
for instance, reviews seem to be influential for big-ticket items but less so for cheaper items. But there are
also conflicting findings within the same domain. Moreover, different subsegments of the consumer
population may react differently: for example, people who are more highly motivated to purchase may
take ratings more seriously. Additionally, in some studies, positive ratings have an effect but negative
ones don’t, and in other studies the opposite effect is seen; the timing of such feedback and various
characteristics of the merchant or of the feedback itself (e.g., volume) may also be a factor.
Nonetheless, to gloss over many details for the sake of brevity: if one allows any effect — including
correlation even if said correlation is shown to be not predictive — that passes a statistical significance
test at the .05 level to be classed as “significant”, then many studies find that review polarity has a
significant economic effect.
Conclusion:
Independently, both the domain knowledge and the data mining approaches to sentiment analysis have
their strengths and weaknesses; but hopefully you will not be forced to choose between using one or the
other for your analysis. In this paper, we have shown that the two approaches complement one another.
So, while the NLP approach leverages the rule builder’s domain knowledge, text mining can also be used
by that person to improve, clarify or correct how that knowledge relates to the particular collection being
analyzed.
References:

7
[1] White Paper- Combining Knowledge and Data Mining to Understand Sentiment – A Practical
Assessment of Approaches (www.sas.com/offices)
[2] Text Analytics 101: Improve Decision-Making by Incorporating Unstructured Data – Words and
Images – into Analytic Processes
Insights from a webinar in the SAS Applying Business Analytics Series Originally broadcast in April
2010
[3] Sentiment Analysis and Subjectivity
Bing Liu
Department of Computer Science
University of Illinois at Chicago
[4] Opinion mining and sentiment analysis
Bo Pang1
and Lillian Lee2
1 Yahoo! Research, 701 First Ave. Sunnyvale, CA 94089, U.S.A., bopang@yahoo-inc.com
2 Computer Science Department, Cornell University, Ithaca, NY 14853, U.S.A., llee@cs.cornell.edu
[5] How sentiment analysis works in machines (an introduction)
www.slideshare.net
[6] Web Demo Lexalytics.htm

NLP Ecosystem

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to NLP Ecosystem

Similar to NLP Ecosystem (20)

Recently uploaded

Recently uploaded (20)

NLP Ecosystem