SlideShare a Scribd company logo
1 of 86
MODULE 4 : Text Analytics
CSC601.4 Analyze Text data and gain insights.
CONTENTS
● Text Mining
○ History of text mining
○ Roots of text mining
○ Overview of seven practices of text analytic
○ Application and use cases for Text mining:
■ Extracting meaning from unstructured text
■ Summarizing Text.
● Text Analysis
○ Text Analysis Steps
○ A Text Analysis Example
○ Collecting Raw Text
○ Representing Text
○ Term Frequency—Inverse Document Frequency (TFIDF)
○ Categorizing Documents by Topics
○ Determining Sentiments
○ Gaining Insights
Text Mining
● Text mining is the process of evaluating large amount of textual data to
produce meaningful information, and to convert the unstructured text data
into structured text data for further analysis and visualization.
● Text mining helps to identify unnoticed facts, relationships and assertions
of textual big data.
● The process of text mining includes two basic python libraries: textblob and
wordcloud.
Text Data
● Before doing the text mining, we need to understand the text data like
determining the number of words in the document.
● We need to first load data from different sources including text files(.txt),
pdfs (.pdf), csv files(.csv) etc.
Example Data Sources and Formats for Text Analysis
Text Pre-Processing
● Text Pre-Processing is an important phase before applying any
algorithms on text data.
● Data cleaning implies cleaning of noise such as: punctuation, spaces
etc.
● The objective of text mining is to clean the data for creating independent
terms from the data file for further analysis.
● After the textual data has been loaded in environment, it needs to be
cleaned by adopting different measures like transforming the text to
lowercase; removing specific characters like removing URLs , non-
english words, punctuations, whitespace etc.
Shallow Parsing
● Tokenization is the process of breaking down a text paragraph into smaller chunks
such as words or sentence.
● Token is a single entity that is building block for sentence or paragraph.
● Sentence tokenizer breaks text paragraph into sentences while word tokenizer breaks
text paragraph into words.
● The process of classifying words into their parts of speech and labelling them
accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts
of speech are also known as word classes or lexical categories.
● The collection of tags used for a particular task is known as a tagset.
● The emphasis in this section is on exploiting tags, and tagging text automatically.
● A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches
a part of speech tag to each word.
Stop words
● Text may contain stop words such as is, am, are, this, a, an,
the, etc.
● These stop words are considered as noise in the text and
hence should be removed.
● Before doing analysis of text data, we should filter out the list
of tokens from these stop words.
Stemming and Lemmatizing
● Stemming and Lemmatization considers another type of noise in the text, which
reduces derivationally related forms of a word to common root word.
● Stemming is the process of gathering words of similar origin into one word.
Stemming helps us to increase accuracy in our mined text by removing suffixes
and reducing words to their basic forms. For example, words like detection,
detected, detecting are reduced to a common word "detect".
● Lemmatization is usually more sophisticated than stemming and also reduces
words to their base word. But lemmar, unlike stemmer, works on an individual
word with knowledge of the context. Example, word "better" has "good" as its
lemma, but this is not included by stemming because it requires a dictionary look-
up.
Stemming and Lemmatizing
● Stemming and Lemmatization considers another type of noise in the text, which
reduces derivationally related forms of a word to common root word.
● Stemming is the process of gathering words of similar origin into one word.
Stemming helps us to increase accuracy in our mined text by removing suffixes
and reducing words to their basic forms. For example, words like detection,
detected, detecting are reduced to a common word "detect".
● Lemmatization is usually more sophisticated than stemming and also reduces
words to their base word. But lemmar, unlike stemmer, works on an individual
word with knowledge of the context. Example, word "better" has "good" as its
lemma, but this is not included by stemming because it requires a dictionary look-
up.
Word Cloud
● For creating a visual impact, a word cloud is created from different words.
● The Word cloud is created from wordcloud library. In the word cloud, the size of
the words is dependent on their frequencies.
Sentiment Analysis
● Sentiment Analysis is also popularly known as opinion analysis or opinion mining.
The key idea is to use techniques from text analytics, NLP, machine learning and
linguistics to extract important information or data points from unstructured text.
● Sentiment analysis is a branch of machine learning that deals with interaction
between computers and humans using the natural language. Sentiment analysis
provides a way to understand the attitudes and opinions expressed in texts.
● Sentiment polarity is typically a numeric score which is assigned to both the
positive and negative aspects of a text document based on subjective parameters
like specific words and phrases expressing feelings and emotion. Neutral sentiment
typically has 0 polarity since it does not express any specific sentiment, positive
sentiment will have polarity > 0 and negative < 0.
Applications of Natural Language Processing
● With the advent of new technologies, there has been a massive growth in the
availability of text data. Thus, there are different applications of natural
language processing which may contribute to an organization's success in a
dominant manner. Example: Understanding customer behavior through twitter
data, developing recommendation systems, cluster analysis of the customer
data on the basis of reviews etc. This section focus on different applications of
natural language processing
Analyzing Twitter Data
Twitter is social networking site where people communicate in short messages
called tweets. Tweeting basically means posting short messages to people who
follows you on twitter, with an intention that the messages might be helpful for
taking a decision.
Document Similarity
Document similarity is a powerful technique used to recommend products/services,
videos, movies etc. The different examples of document similarity include ecommerce
websites recommending products on its website, Amazon Prime and Netflix
recommending moviesshows, YouTube recommending videos etc. Recommendation
for a product/service can be done according to pre-defined criterion like no. of buyers,
budget, rating, popularity, manufacturer, description etc.
Cluster Analysis
Cluster analysis can be done on text data after the feature extraction is done on
the data using vectorizer. This section performs cluster analysis on the above
data and forms clusters of different movies together on the basis of their
information stored in tfidf matrix while performing feature extraction.
Text Analysis Steps
A text analysis problem usually consists of three important steps: parsing,
search and retrieval, and text mining.
A text analysis problem may also consist of other subtasks such as discourse
and segmentation
Parsing is the process that takes unstructured text and imposes a structure for further analysis. The
unstructured text could be a plain text file, a weblog, an Extensible Markup Language (XML) file, a HyperText
Markup Language (HTML) file, or a Word document. Parsing deconstructs the provided text and renders
it in a more structured way for the subsequent steps.
Search and retrieval is the identification of the documents in a corpus that contain search items such
as specific words, phrases, topics, or entities like people or organizations. These search items are generally
called key terms. Search and retrieval originated from the field of library science and is now used exten-
sively by web search engines.
Text mining uses the terms and indexes produced by the prior two steps to
discover meaningful insights pertaining to domains or problems of interest.
Part-of-Speech (POS) Tagging, Lemmatization, and
Stemming
The goal of POS tagging is to build a model whose input is a sentence, such
as:
he saw a fox
and whose output is a tag sequence. Each tag marks the POS for the
corresponding word, such as:
PRP VBD DT NN
according to the Penn Treebank POS tags . Therefore, the four words are
mapped to pronoun (personal), verb (past tense). determiner, and noun
(singular), respectively.
Both lemmatization and stemming are techniques to reduce the number of dimensions and reduce
inflections or variant forms to the base form to more accurately measure the number of times each
word appears. With the use of a given dictionary, lemmatization finds the correct dictionary base form
of a word.
For example, given the sentence:
obesity causes many problems
the output of lemmatization would be:
obesity cause many problem
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
Stemming
Different from lemmatization, stemming does not need a dictionary, and it usually
refers to a crude process of stripping affixes based on a set of heuristics with the
hope of correctly achieving the goal to reduce inflections or variant forms.
After the process, words are stripped to become stems. A stem is not necessarily
an actual word defined in the natural language, but it is sufficient to differentiate
itself from the stems of other words. A well-known rule-based stemming algorithm is
Porter's stemming algorithm. It defines a set of production rules to iteratively
transform words into their stems. For the sentence shown previously:
obesity causes many problems
the output of Porter's stemming algorithm is:
obes caus mani problem
http://www.infogistics.com/posdemo.htm
https://www.link.cs.cmu.edu/cgi-bin/link/construct-page-4.cgi#submit
https://textanalysisonline.com/nltk-pos-tagging
import nltk
a = "Sample Text"
words = nltk.tokenize.word_tokenize(a)
fd = nltk.FreqDist(words)
fd.plot()
Explanation of code:
1. Import nltk module.
2. Write the text whose word distribution you need to find.
3. Tokenize each word in the text which is served as input to FreqDist module of the nltk.
4. Apply each word to nlk.FreqDist in the form of a list
5. Plot the words in the graph using plot()
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
sentence="Hello, You have to build a very good site and I love visiting your site."
words = word_tokenize(sentence)
ps = PorterStemmer()
for w in words:
rootWord=ps.stem(w)
print(rootWord)
hello
,
you
have
build
a
veri
good
site
and
I
love
visit
your
site
● Package PorterStemer is imported from module stem
● Packages for tokenization of sentence as well as words are imported
● A sentence is written which is to be tokenized in the next step.
● Word tokenization stemming lemmatization is implemented in this step.
● An object for PorterStemmer is created here.
● Loop is run and stemming of each word is done using the object created in the code line 5
http://text-processing.com/demo/stem/
Text Analysis Example
Consider the fictitious company ACME, maker of two products: bPhone and bEbook. ACME is in
strong competition with other companies that manufacture and sell similar products. To succeed,
ACME needs to produce excellent phones and eBook readers and increase sales. One of the ways
the company does this is to monitor what is being said about ACME products in social media. In other
words, what is the buzz on its products? ACME wants to search all that is said about ACME products
in social media sites, such as Twitter and Facebook, and popular review sites, such as Amazon and
ConsumerReports. It wants to answer questions such as these.
• Are people mentioning its products?
• What is being said? Are the products seen as good or bad? If people think an ACME product is bad,
why?
For example, are they complaining about the battery life of the bPhone, or the response time
in their bEbook?
They want to monitor the social media buzz using a simple process based on the three steps
Text Analysis Process
1. Collect raw text - This corresponds to Phase 1 and Phase 2 of the Data
Analytic Lifecycle.
2. Represent text - Convert each review into a suitable document representation
with proper indices, and build a corpus based on these indexed reviews. This
step corresponds to Phases 2 and 3 of the Data Analytic Lifecycle.
3. Compute the usefulness of each word in the reviews using methods such as
TFIDF .This and the following two steps correspond to Phases 3 through 5 of
the Data Analytic Lifecycle.
4. Categorize documents by topics. This can be achieved through topic models
(such as latent Dirichlet allocation).
5. Determine sentiments of the reviews. Identify whether the reviews are positive or negative.
Many product review sites provide ratings of a product with each review. If such information is
not available, techniques like sentiment analysis can be used on the textual data to infer the
underlying sentiments. People can express many emotions. To keep the process simple,
ACME considers sentiments as positive, neutral, or negative.
6. Review the results and gain greater insights - This step corresponds to Phase 5 and 6 of the
Data Analytic Lifecycle. Marketing gathers the results from the previous steps. Find out
what exactly makes people love or hate a product. Use one or more visualization techniques
to report the findings. Test the soundness of the conclusions and operationalize the findings if
applicable.
Collecting Raw Text
The Data Science team starts by actively monitoring various websites for user-generated
contents. The user-generated contents being collected could be related articles from news portals and
blogs, comments on ACME's products from online shops or reviews sites, or social media posts that contain
keywords b Phone or bEbook. Regardless of where the data comes from, it's likely that the team would
deal with semi-structured data such as HTML web pages, Really Simple Syndication (RSS) feeds, XML, or
JavaScript Object Notation (JSON) files. Enough structure needs to be imposed to find the part of the raw
text that the team really cares about. In the brand management example, ACME is interested in what the
reviews say about bPhon e or bEb ook and when the reviews are posted. Therefore, the team will actively
collect such information.
Many websites and services offer public APIs for third-party developers to
access their data.
For example, the Twitter API allows developers to choose from the Streaming
API or the REST API to retrieve public Twitter posts that contain the keywords
bPhone or bEbook.
Developers can also read tweets in real time from a specific user or tweets
posted near a specific venue. The fetched tweets are in the JSON format.
Many news portals and blogs provide data feeds that are in an open standard
format, such as RSS or XML.
Representing Text
ln this data representation step, raw text is first transformed with text normalization techniques
such as tokenization and case folding.
Then it is represented in a more structured way for analysis.
Tokenization is the task of separating (also called tokenizing) words from the body of text.
Raw text is converted into collections of tokens after the tokenization, where each token is
generally a word.
A common approach is tokenizing on spaces.
state-of-art
Representing Text
Another text normalization technique is called case folding, which reduces all letters to
lowercase (or the opposite if applicable).
One needs to be cautious applying case folding to tasks such as information extraction,
sentiment analysis, and machine translation.
If implemented incorrectly, case folding may reduce or change the meaning of the text and
create additional noise.
For example, when General Motors becomes general and motors, the downstream analysis
may very likely consider them as separated words rather than the name of a company.
When the abbreviation of the World Health Organization WHO or the rock band The Who
become who, they may both be interpreted as the pronoun who.
Representing Text
If case folding must be present, one way to reduce such problems is to create a
lookup table of words not to be case folded.
The team can come up with some heuristics or rules-based strategies for the
case folding.
For example, the program can be taught to ignore words that have uppercase
in the middle of a sentence.
Representing Text
After normalizing the text by tokenization and case folding, it needs to be
represented in a more structured way.
A simple yet widely used approach to represent text is called bag-of-words.
Given a document, bag-of-words represents the document as a set of terms,
ignoring information such as order, context, inferences, and discourse.
Each word is considered a term or token (which is often the smallest unit for the
analysis).
In many cases, bag-of-words additionally assumes every term in the document is
independent.
Representing Text
The document then becomes a vector with one dimension for every distinct
term in the space, and the terms are unordered.
The permutation 0* of a document D contains the same words exactly the
same number of times but in a different order.
Therefore, using the bag-of-words representation, document D and its
permutation D* would share the same representation.
Representing Text
Bag-of-words takes quite a na..-ve approach, as order plays an important role in the semantics of text.
With bag-of-words, many texts with different meanings are combined into one form.
For example, the texts
"a dog bites a man"
and
"a man bites a dog"
have very different meanings, but they would share the same representation with bag-of-words.
Representing Text
Using single words as identifiers with the bag-of-words representation, the term
frequency (TF) of each word can be calculated.
Term frequency represents the weight of each term in a document, and it is
proportional to the number of occurrences of the term in that document.
Representing Text
Besides extracting the terms, their morphological features may need to be
included.
The morphological features specify additional information about the terms,
which may include root words, affixes, part-of-speech tags, named entities, or
intonation (variations of spoken pitch).
The features from this step contribute to the downstream analysis in
classification or sentiment analysis.
Representing Text
The set of features that need to be extracted and stored highly depends on the
specific task to be performed.
lf the task is to label and distinguish the part of speech, for example, the features
will include all the words in the text and their corresponding part-of-speech tags.
If the task is to annotate the named entities like names and organizations, the
features highlight such information appearing in the text.
Constructing the features is no trivial task; quite often this is done entirely manual
ly, and sometimes it requires domain expertise.
Representing Text
Sometimes creating features is a text analysis task all to itself.
One such example is topic modeling.
Topic modeling provides a way to quickly analyze large volumes of raw text and identify the
latent topics.
Topic modeling may not require the documents to be labeled or annotated.
It can discover topics directly from an analysis of the raw text. A topic consists of a cluster of
words that frequently occur together and that share the same theme.
Probabilistic topic modeling,is a suite of algorithms that aim to parse large archives of
documents and discover and annotate the topics.
Representing Text
It is important not only to create a representation of a document but also to
create a representation of a corpus.
A corpus is a collection of documents.
A corpus could be so large that it includes all the documents in one or more
languages, or it could be smaller or limited to a specific domain, such as
technology, medicine, or law.
For a web search engine, the entire World Wide Web is the relevant corpus.
Most corpora are much smaller. The Brown Corpus
Representing Text
Many corpora focus on specific domains.
For example, the BioCreative corpora are from biology, the Switchboard corpus contains
telephone conversations, and the European Parliament Proceedings Parallel Corpus was
extracted from the proceedings of the European Parliament in 21 European languages.
Most corpora come with metadata, such as the size of the corpus and the domains from which
the text is extracted.
Some corpora (such as the Brown Corpus) include the information content of every word
appearing in the text.
Representing Text
Information content (IC) is a metric to denote the importance of a term in a corpus.
The conventional way of measuring the IC of a term is to combine the knowledge of its
hierarchical
structure from an ontology with statistics on its actual usage in text derived from a corpus.
Terms with higher IC values are considered more important than terms with lower IC values.
For example, the word necklace generally has a higher IC value than the word jewelry in an
English corpus because jewelry is more general and is likely to appear more often than
necklace.
IC can help measure the semantic similarity of terms , such measures do not require an
annotated corpus, and they generally achieve strong correlations with human judgment.
Term Frequency-Inverse Document Frequency (TFIDF)
TFIDF, a measure widely used in information retrieval and text analysis.
Instead of using a traditional corpus as a knowledge base
TFIDF directly works on top of the fetched documents and treats these
documents as the "corpus."
TFIDF is robust and efficient on dynamic content, because document changes
require only the update of frequency counts.
Term Frequency-Inverse Document Frequency (TFIDF)
Term Frequency-Inverse Document Frequency (TFIDF)
To understand how the term frequency is computed, consider a bag-of-words
vector space of 10 words:
i, love, acme, my, bebook, bphone, fantastic, slow, terrible, and terrific .
Term Frequency-Inverse Document Frequency (TFIDF)
the logarithm can be applied to word frequencies whose distribution also
contains a long tail, as shown in Equation
Term Frequency-Inverse Document Frequency (TFIDF)
Because longer documents contain more terms, they tend to have higher term
frequency values.
They also tend to contain more distinct terms.
These factors can conspire to raise the term frequency values of longer
documents and lead to undesirable bias favoring longer documents.
To address this problem, the term frequency can be normalized. For example,
the term frequency of term t in document d can be normalized based on the
number of terms in d as shown in Equation
Term Frequency-Inverse Document Frequency (TFIDF)
A term frequency vector can become very high dimensional because the bag-
of-words vector space can grow substantially to include all the words in
English.
The high dimensionality makes it difficult to store and parse the text and
contribute to performance issues related to text analysis.
Term Frequency-Inverse Document Frequency (TFIDF)
For the purpose of reducing dimensionality, not all the words from a given language
need to be included in the term frequency vector. In English, for example, it is
common to remove words such as
the, a, of, and, to, and other articles that are not likely to contribute to semantic
understanding.
These common words are called stop words.
Lists of stop words are available in various languages for automating the
identification of stop words. Among them is the Snowball's stop words list that
contains stop words
in more than ten languages.
Term Frequency-Inverse Document Frequency
(TFIDF)
Another simple yet effective way to reduce dimensionality is to store a term and
its frequency only if the term appears at least once in a document.
Any term not existing in the term frequency vector by default will have a
frequency of 0.
Therefore, the previous term frequency vector would be simplified to what is
shown in Table
Term Frequency-Inverse Document Frequency (TFIDF)
Some NLP techniques such as lemmatization and stemming can also reduce high dimensionality.
Lemmatization and stemming are two different techniques that combine various forms of a word.
With these techniques, words such as play, plays, played, and playing can be mapped to the same term.
It has been shown that the term frequency is based on the raw count of a term occurring in a stand-
alone document.
Term frequency by itself suffers a critical problem: It regards that stand-alone document
as the entire world.
The importance of a term is solely based on its presence in this particular document.
Term Frequency-Inverse Document Frequency (TFIDF)
Stop words such as the, and, and a could be inappropriately considered the
most important because they have the highest frequencies in every document.
For example, the top three most frequent words in Shakespeare's Hamlet are
all stop words {t he, and, and of,
Besides stop words, words that are more general in meaning tend to appear
more often, thus having higher term frequencies.
In an article about consumer telecommunications, the word phone would be
likely to receive a high term frequency.
Term Frequency-Inverse Document Frequency (TFIDF)
As a result, the important keywords such as b Phone and bEbook and their
related words could appear to be less important.
Consider a search engine that responds to a search query and fetches relevant
Documents.
Using term frequency alone, the search engine would not properly assess how
relevant each document is in relation to the search query.
Term Frequency-Inverse Document Frequency (TFIDF)
A quick fix for the problem is to introduce an additional variable that has a
broader view of the world considering the importance of a term not only in a
single document but in a collection of documents, or in a corpus.
The additional variable should reduce the effect of the term frequency as the
term appears in more documents. That is the intention of the inverted
document frequency (IDF).
Term Frequency-Inverse Document Frequency (TFIDF)
The IDF inversely corresponds to the document frequency {DF}, which is
defined to be the number of documents in the corpus that contain a term.
Let a corpus D contain N documents. The document frequency of a term t in
corpus
D = {d1,d2 , •• • d11 } is defined as shown in Equation
Term Frequency-Inverse Document Frequency (TFIDF)
The Inverse document frequency of a term t is obtained by dividing N by the
document frequency of the term and then taking the logarithm of that quotient,
as shown in Equation
If the term is not in the corpus, it leads to a division-by-zero. A quick fix is to
add 1 to the denominator, as demonstrated in Equation
Categorizing Documents by Topics
A topic consists of a cluster of words that frequently occur together and share the same theme.
The topics of a document are not as straightforward as they might initially appear. Consider these two
reviews:
1. The bPhoneSx has coverage everywhere. It's much less flaky than my old bPhone4G.
2 . While I love ACME's bPhone series, I've been quite disappointed by the bEbook.
The text is illegible, and it makes even my old NBook look blazingly fast.
Is the first review about bPhone5x or bPhone4G? Is the second review about bPhone, bEbook, or
NBook?
For machines, these questions can be difficult to answer.
Categorizing Documents by Topics
If a review is talking about bPhoneSx, the term bPhoneSx and related terms
(such as phone and ACME) are likely to appear frequently.
A document typically consists of multiple themes running through the text in
different proportions-
for example, 30% on a topic related to phones, 15% on a topic related to
appearance, 10% on a topic related to shipping, 5% on a topic related to
service, and so on.
Categorizing Documents by Topics
Document grouping can be achieved with clustering methods such as k-means
clustering or classification methods such as support vector machines . k-
nearest neighbors or Naive Bayes .
However, a more feasible and prevalent approach is to use topic modeling.
Topic modeling provides tools to automatically organize, search, understand,
and summarize from vast amounts of information.
Categorizing Documents by Topics
Topic models are statistical models that examine words from a set of
documents, determine the themes over the text, and discover how the themes
are associated or change over time.
The process of topic modeling can be simplified to the following.
1. Uncover the hidden topical patterns within a corpus.
2. Annotate documents according to these topics.
3. Use annotations to organize, search, and summarize texts.
Categorizing Documents by Topics
A topic is formally defined as a distribution over a fixed vocabulary of words.
Different topics would have different distributions over the same vocabulary.
A topic can be viewed as a cluster of words with related meanings, and each word
has a corresponding weight inside this topic.
Note that a word from the vocabulary can reside in multiple topics with different
weights.
Topic models do not necessarily require prior knowledge of the texts.
The topics can emerge solely based on analyzing the text.
The simplest topic model is latent Dirichlet allocation (LDA) a generative
probabilistic model of a corpus proposed by David M. Blei and two other
researchers.
In generative probabilistic modeling, data is treated as the result of a generative
process that includes hidden variables.
LDA assumes that there is a fixed vocabulary of words, and the number of the
latent topics is predefined and remains constant.
LDA assumes that each latent topic follows a Dirichlet distribution over the
vocabulary, and each document is represented as a random mixture of latent
topics.
Figure illustrates the intuitions behind LDA.
The left side of the figure shows four topics built from a corpus, where each topic contains a list
of the most important words from the vocabulary.
The four example topics are related to problem, policy, neural, and report.
For each document, a distribution over the topics is chosen, as shown in the histogram on the
right.
Next, a topic assignment is picked for each word in the document, and the word from the
corresponding topic (colored discs) is chosen.
In reality, only the documents (as shown in the middle of the figure) are available. The goal of
LDA is to infer the underlying topics, topic proportions, and topic assignments for every
document.
Topic models can be used in document modeling, document classification, and
collaborative filtering
Topic models not only can be applied to textual data, they can also help
annotate images.
Just as a document can be considered a collection of topics, images can be
considered a collection of image features.
Determining Sentiments
Sentiment analysis refers to a group of tasks that use statistics and natural
language processing to mine opinions to identify and extract subjective information
from texts.
Early work on sentiment analysis focused on detecting the polarity of product
reviews from Epinions and movie reviews from the Internet Movie Database (IMDb)
at the document level.
Later work handles sentiment analysis at the sentence level . More recently, the
focus has shifted to phrase-level and short-text forms in response to the popularity
of micro-blogging services such as Twitter
Determining Sentiments
One can manually construct lists of words with positive sentiments (such as
brilliant, awesome, and spectacular) and negative sentiments (such as awful,
stupid, and hideous).
Related work has pointed out that such an approach can be expected to achieve
accuracy around 60% , and it is likely to be outperformed by examination of
corpus statistics.
Determining Sentiments
Classification methods such as naive Bayes, maximum entropy (MaxEnt), and
support vector machines (SVM) are often used to extract corpus statistics for
sentiment analysis.
Related research has found out that these classifiers can score around 80%
accuracy on sentiment analysis over unstructured data.
One or more of such classifiers can be applied to unstructured data, such
as movie reviews or even tweets.
Determining Sentiments
The movie review corpus by Pang et al. includes 2,000 movie reviews collected from an IMDb
archive of the rec.arts.movies.reviews newsgroup.
These movie reviews have been manually tagged into 1,000 positive reviews and 1,000
negative reviews.
Depending on the classifier, the data may need to be split into training and testing sets.
A rule of the thumb for splitting data is to produce a training set much bigger
than the testing set.
For example, an 80/20 split would produce 80% of the data as the training set and 20% as the
testing set.
Determining Sentiments
One or more classifiers are trained over the training set to learn the
characteristics or patterns residing in the data.
The sentiment tags in the testing data are hidden away from the classifiers.
After the training, classifiers are tested over the testing set to infer the
sentiment tags.
Finally, the result is compared against the original sentiment tags to evaluate
the overall performance of the classifier.
A confusion matrix is a specific table layout that allows visualization of the
performance of a model over the testing set.
Precision and recall are two measures commonly used to evaluate tasks
related to text analysis.
Definitions of precision and recall are given in Equations
Precision is defined as the percentage of documents in the results that are
relevant. If by entering keyword bPhone, the search engine returns 100
documents, and 70 of them are relevant, the precision of the search engine
result is 0.7%.
Recall is the percentage of returned documents among all relevant documents
in the corpus. If by entering keyword bPhone, the search engine returns 100
documents, only 70 of which are relevant while failing to return 10 additional,
relevant documents, the recall is 70/ (70+ 10) = 0.875.
Precision and recall are important concepts, whether the task is about
information retrieval of a search engine or text analysis over a finite corpus.
A good classifier ideally should achieve both precision and recall close to 1.0
Gaining Insights
Corresponding to the data collection phase, the Data Science team has used
bPhone as the keyword to collect more than 300 reviews from a popular
technical review website.
The 300 reviews are visualized as a word cloud after removing stop words. A
word cloud (or tag cloud) is a visual representation of textual data.
Tags are generally single words, and the importance of each word is shown
with font size or color.
IMPORTANT QUESTIONS
1. What are the main challenges of text analysis?
2. What is a corpus?
3. What are common words (such as a, and, of) called?
4. Why can't we use TF alone to measure the usefulness of the words?
5. What is a caveat of IDF? How does TFIDF address the problem?
6. Name three benefits of using the TFIDF.
7. What methods can be used for sentiment analysis?
8. What is the definition of topic in topic models?
9. Explain the trade-offs for precision and recall.

More Related Content

What's hot

Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalA. LE
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkFaisal Siddiqi
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycleManoj Mishra
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Simplilearn
 
Architecture of data mining system
Architecture of data mining systemArchitecture of data mining system
Architecture of data mining systemramya marichamy
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsGabriel Moreira
 
Recommending for the World
Recommending for the WorldRecommending for the World
Recommending for the WorldYves Raimond
 
genetic algorithm based music recommender system
genetic algorithm based music recommender systemgenetic algorithm based music recommender system
genetic algorithm based music recommender systemneha pevekar
 
Time series forecasting with machine learning
Time series forecasting with machine learningTime series forecasting with machine learning
Time series forecasting with machine learningDr Wei Liu
 
Practical sentiment analysis
Practical sentiment analysisPractical sentiment analysis
Practical sentiment analysisDiana Maynard
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalNik Spirin
 
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Edureka!
 
The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature EngineeringAlice Zheng
 

What's hot (20)

Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information Retrieval
 
TEXT SUMMARIZATION
TEXT SUMMARIZATIONTEXT SUMMARIZATION
TEXT SUMMARIZATION
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Machine learning & Time Series Analysis
Machine learning & Time Series AnalysisMachine learning & Time Series Analysis
Machine learning & Time Series Analysis
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
Data Visualization With R
Data Visualization With RData Visualization With R
Data Visualization With R
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycle
 
Text mining
Text miningText mining
Text mining
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
 
Data Management in R
Data Management in RData Management in R
Data Management in R
 
Architecture of data mining system
Architecture of data mining systemArchitecture of data mining system
Architecture of data mining system
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender Systems
 
Recommending for the World
Recommending for the WorldRecommending for the World
Recommending for the World
 
genetic algorithm based music recommender system
genetic algorithm based music recommender systemgenetic algorithm based music recommender system
genetic algorithm based music recommender system
 
Time series forecasting with machine learning
Time series forecasting with machine learningTime series forecasting with machine learning
Time series forecasting with machine learning
 
Practical sentiment analysis
Practical sentiment analysisPractical sentiment analysis
Practical sentiment analysis
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
 
RDF and OWL
RDF and OWLRDF and OWL
RDF and OWL
 
The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature Engineering
 

Similar to MODULE 4-Text Analytics.pptx

NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxrohithprabhas1
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
What is Text Analysis?
What is Text Analysis?What is Text Analysis?
What is Text Analysis?Ducat India
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibEl Habib NFAOUI
 
Full text search
Full text searchFull text search
Full text searchdeleteman
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位eLearning Consortium 電子學習聯盟
 
Natural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxNatural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxAlyaaMachi
 
NLP Deep Learning with Tensorflow
NLP Deep Learning with TensorflowNLP Deep Learning with Tensorflow
NLP Deep Learning with Tensorflowseungwoo kim
 
Natural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxNatural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxSHIBDASDUTTA
 
Big data
Big dataBig data
Big dataIshucs
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)Kuppusamy P
 
Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining Rupak Roy
 
APPROACH FOR THICKENING SENTENCE SCORE FOR AUTOMATIC TEXT SUMMARIZATION
APPROACH FOR THICKENING SENTENCE SCORE FOR AUTOMATIC TEXT SUMMARIZATIONAPPROACH FOR THICKENING SENTENCE SCORE FOR AUTOMATIC TEXT SUMMARIZATION
APPROACH FOR THICKENING SENTENCE SCORE FOR AUTOMATIC TEXT SUMMARIZATIONIJDKP
 

Similar to MODULE 4-Text Analytics.pptx (20)

NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptxNLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
NLP WITH NAÏVE BAYES CLASSIFIER (1).pptx
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
NLP PPT.pptx
NLP PPT.pptxNLP PPT.pptx
NLP PPT.pptx
 
NLP todo
NLP todoNLP todo
NLP todo
 
What is Text Analysis?
What is Text Analysis?What is Text Analysis?
What is Text Analysis?
 
Top 10 Must-Know NLP Techniques for Data Scientists
Top 10 Must-Know NLP Techniques for Data ScientistsTop 10 Must-Know NLP Techniques for Data Scientists
Top 10 Must-Know NLP Techniques for Data Scientists
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
 
LLM.pdf
LLM.pdfLLM.pdf
LLM.pdf
 
Full text search
Full text searchFull text search
Full text search
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
 
NLP.pptx
NLP.pptxNLP.pptx
NLP.pptx
 
Natural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxNatural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptx
 
NLP Deep Learning with Tensorflow
NLP Deep Learning with TensorflowNLP Deep Learning with Tensorflow
NLP Deep Learning with Tensorflow
 
Natural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxNatural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptx
 
Text analytics
Text analyticsText analytics
Text analytics
 
Big data
Big dataBig data
Big data
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining
 
APPROACH FOR THICKENING SENTENCE SCORE FOR AUTOMATIC TEXT SUMMARIZATION
APPROACH FOR THICKENING SENTENCE SCORE FOR AUTOMATIC TEXT SUMMARIZATIONAPPROACH FOR THICKENING SENTENCE SCORE FOR AUTOMATIC TEXT SUMMARIZATION
APPROACH FOR THICKENING SENTENCE SCORE FOR AUTOMATIC TEXT SUMMARIZATION
 

More from nikshaikh786

Module 2_ Divide and Conquer Approach.pptx
Module 2_ Divide and Conquer Approach.pptxModule 2_ Divide and Conquer Approach.pptx
Module 2_ Divide and Conquer Approach.pptxnikshaikh786
 
Module 1_ Introduction.pptx
Module 1_ Introduction.pptxModule 1_ Introduction.pptx
Module 1_ Introduction.pptxnikshaikh786
 
Module 1_ Introduction to Mobile Computing.pptx
Module 1_  Introduction to Mobile Computing.pptxModule 1_  Introduction to Mobile Computing.pptx
Module 1_ Introduction to Mobile Computing.pptxnikshaikh786
 
Module 2_ GSM Mobile services.pptx
Module 2_  GSM Mobile services.pptxModule 2_  GSM Mobile services.pptx
Module 2_ GSM Mobile services.pptxnikshaikh786
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxnikshaikh786
 
MODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptxMODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptxnikshaikh786
 
Module 3_ Classification.pptx
Module 3_ Classification.pptxModule 3_ Classification.pptx
Module 3_ Classification.pptxnikshaikh786
 
Module 2_ Introduction to Data Mining, Data Exploration and Data Pre-processi...
Module 2_ Introduction to Data Mining, Data Exploration and Data Pre-processi...Module 2_ Introduction to Data Mining, Data Exploration and Data Pre-processi...
Module 2_ Introduction to Data Mining, Data Exploration and Data Pre-processi...nikshaikh786
 
Module 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxModule 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxnikshaikh786
 
Module 2_ Cyber offenses & Cybercrime.pptx
Module 2_ Cyber offenses & Cybercrime.pptxModule 2_ Cyber offenses & Cybercrime.pptx
Module 2_ Cyber offenses & Cybercrime.pptxnikshaikh786
 
Module 1- Introduction to Cybercrime.pptx
Module 1- Introduction to Cybercrime.pptxModule 1- Introduction to Cybercrime.pptx
Module 1- Introduction to Cybercrime.pptxnikshaikh786
 
MODULE 5- EDA.pptx
MODULE 5- EDA.pptxMODULE 5- EDA.pptx
MODULE 5- EDA.pptxnikshaikh786
 
Module 3 - Time Series.pptx
Module 3 - Time Series.pptxModule 3 - Time Series.pptx
Module 3 - Time Series.pptxnikshaikh786
 
Module 2_ Regression Models..pptx
Module 2_ Regression Models..pptxModule 2_ Regression Models..pptx
Module 2_ Regression Models..pptxnikshaikh786
 
MODULE 1_Introduction to Data analytics and life cycle..pptx
MODULE 1_Introduction to Data analytics and life cycle..pptxMODULE 1_Introduction to Data analytics and life cycle..pptx
MODULE 1_Introduction to Data analytics and life cycle..pptxnikshaikh786
 
MAD&PWA VIVA QUESTIONS.pdf
MAD&PWA VIVA QUESTIONS.pdfMAD&PWA VIVA QUESTIONS.pdf
MAD&PWA VIVA QUESTIONS.pdfnikshaikh786
 
VIVA QUESTIONS FOR DEVOPS.pdf
VIVA QUESTIONS FOR DEVOPS.pdfVIVA QUESTIONS FOR DEVOPS.pdf
VIVA QUESTIONS FOR DEVOPS.pdfnikshaikh786
 

More from nikshaikh786 (20)

Module 2_ Divide and Conquer Approach.pptx
Module 2_ Divide and Conquer Approach.pptxModule 2_ Divide and Conquer Approach.pptx
Module 2_ Divide and Conquer Approach.pptx
 
Module 1_ Introduction.pptx
Module 1_ Introduction.pptxModule 1_ Introduction.pptx
Module 1_ Introduction.pptx
 
Module 1_ Introduction to Mobile Computing.pptx
Module 1_  Introduction to Mobile Computing.pptxModule 1_  Introduction to Mobile Computing.pptx
Module 1_ Introduction to Mobile Computing.pptx
 
Module 2_ GSM Mobile services.pptx
Module 2_  GSM Mobile services.pptxModule 2_  GSM Mobile services.pptx
Module 2_ GSM Mobile services.pptx
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
 
MODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptxMODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptx
 
DWM-MODULE 6.pdf
DWM-MODULE 6.pdfDWM-MODULE 6.pdf
DWM-MODULE 6.pdf
 
TCS MODULE 6.pdf
TCS MODULE 6.pdfTCS MODULE 6.pdf
TCS MODULE 6.pdf
 
Module 3_ Classification.pptx
Module 3_ Classification.pptxModule 3_ Classification.pptx
Module 3_ Classification.pptx
 
Module 2_ Introduction to Data Mining, Data Exploration and Data Pre-processi...
Module 2_ Introduction to Data Mining, Data Exploration and Data Pre-processi...Module 2_ Introduction to Data Mining, Data Exploration and Data Pre-processi...
Module 2_ Introduction to Data Mining, Data Exploration and Data Pre-processi...
 
Module 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptxModule 1_Data Warehousing Fundamentals.pptx
Module 1_Data Warehousing Fundamentals.pptx
 
Module 2_ Cyber offenses & Cybercrime.pptx
Module 2_ Cyber offenses & Cybercrime.pptxModule 2_ Cyber offenses & Cybercrime.pptx
Module 2_ Cyber offenses & Cybercrime.pptx
 
Module 1- Introduction to Cybercrime.pptx
Module 1- Introduction to Cybercrime.pptxModule 1- Introduction to Cybercrime.pptx
Module 1- Introduction to Cybercrime.pptx
 
MODULE 5- EDA.pptx
MODULE 5- EDA.pptxMODULE 5- EDA.pptx
MODULE 5- EDA.pptx
 
Module 3 - Time Series.pptx
Module 3 - Time Series.pptxModule 3 - Time Series.pptx
Module 3 - Time Series.pptx
 
Module 2_ Regression Models..pptx
Module 2_ Regression Models..pptxModule 2_ Regression Models..pptx
Module 2_ Regression Models..pptx
 
MODULE 1_Introduction to Data analytics and life cycle..pptx
MODULE 1_Introduction to Data analytics and life cycle..pptxMODULE 1_Introduction to Data analytics and life cycle..pptx
MODULE 1_Introduction to Data analytics and life cycle..pptx
 
IOE MODULE 6.pptx
IOE MODULE 6.pptxIOE MODULE 6.pptx
IOE MODULE 6.pptx
 
MAD&PWA VIVA QUESTIONS.pdf
MAD&PWA VIVA QUESTIONS.pdfMAD&PWA VIVA QUESTIONS.pdf
MAD&PWA VIVA QUESTIONS.pdf
 
VIVA QUESTIONS FOR DEVOPS.pdf
VIVA QUESTIONS FOR DEVOPS.pdfVIVA QUESTIONS FOR DEVOPS.pdf
VIVA QUESTIONS FOR DEVOPS.pdf
 

Recently uploaded

Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 

Recently uploaded (20)

Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 

MODULE 4-Text Analytics.pptx

  • 1. MODULE 4 : Text Analytics CSC601.4 Analyze Text data and gain insights.
  • 2. CONTENTS ● Text Mining ○ History of text mining ○ Roots of text mining ○ Overview of seven practices of text analytic ○ Application and use cases for Text mining: ■ Extracting meaning from unstructured text ■ Summarizing Text. ● Text Analysis ○ Text Analysis Steps ○ A Text Analysis Example ○ Collecting Raw Text ○ Representing Text ○ Term Frequency—Inverse Document Frequency (TFIDF) ○ Categorizing Documents by Topics ○ Determining Sentiments ○ Gaining Insights
  • 3. Text Mining ● Text mining is the process of evaluating large amount of textual data to produce meaningful information, and to convert the unstructured text data into structured text data for further analysis and visualization. ● Text mining helps to identify unnoticed facts, relationships and assertions of textual big data. ● The process of text mining includes two basic python libraries: textblob and wordcloud.
  • 4. Text Data ● Before doing the text mining, we need to understand the text data like determining the number of words in the document. ● We need to first load data from different sources including text files(.txt), pdfs (.pdf), csv files(.csv) etc.
  • 5. Example Data Sources and Formats for Text Analysis
  • 6. Text Pre-Processing ● Text Pre-Processing is an important phase before applying any algorithms on text data. ● Data cleaning implies cleaning of noise such as: punctuation, spaces etc. ● The objective of text mining is to clean the data for creating independent terms from the data file for further analysis. ● After the textual data has been loaded in environment, it needs to be cleaned by adopting different measures like transforming the text to lowercase; removing specific characters like removing URLs , non- english words, punctuations, whitespace etc.
  • 7. Shallow Parsing ● Tokenization is the process of breaking down a text paragraph into smaller chunks such as words or sentence. ● Token is a single entity that is building block for sentence or paragraph. ● Sentence tokenizer breaks text paragraph into sentences while word tokenizer breaks text paragraph into words. ● The process of classifying words into their parts of speech and labelling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. ● The collection of tags used for a particular task is known as a tagset. ● The emphasis in this section is on exploiting tags, and tagging text automatically. ● A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word.
  • 8. Stop words ● Text may contain stop words such as is, am, are, this, a, an, the, etc. ● These stop words are considered as noise in the text and hence should be removed. ● Before doing analysis of text data, we should filter out the list of tokens from these stop words.
  • 9. Stemming and Lemmatizing ● Stemming and Lemmatization considers another type of noise in the text, which reduces derivationally related forms of a word to common root word. ● Stemming is the process of gathering words of similar origin into one word. Stemming helps us to increase accuracy in our mined text by removing suffixes and reducing words to their basic forms. For example, words like detection, detected, detecting are reduced to a common word "detect". ● Lemmatization is usually more sophisticated than stemming and also reduces words to their base word. But lemmar, unlike stemmer, works on an individual word with knowledge of the context. Example, word "better" has "good" as its lemma, but this is not included by stemming because it requires a dictionary look- up.
  • 10. Stemming and Lemmatizing ● Stemming and Lemmatization considers another type of noise in the text, which reduces derivationally related forms of a word to common root word. ● Stemming is the process of gathering words of similar origin into one word. Stemming helps us to increase accuracy in our mined text by removing suffixes and reducing words to their basic forms. For example, words like detection, detected, detecting are reduced to a common word "detect". ● Lemmatization is usually more sophisticated than stemming and also reduces words to their base word. But lemmar, unlike stemmer, works on an individual word with knowledge of the context. Example, word "better" has "good" as its lemma, but this is not included by stemming because it requires a dictionary look- up.
  • 11. Word Cloud ● For creating a visual impact, a word cloud is created from different words. ● The Word cloud is created from wordcloud library. In the word cloud, the size of the words is dependent on their frequencies.
  • 12. Sentiment Analysis ● Sentiment Analysis is also popularly known as opinion analysis or opinion mining. The key idea is to use techniques from text analytics, NLP, machine learning and linguistics to extract important information or data points from unstructured text. ● Sentiment analysis is a branch of machine learning that deals with interaction between computers and humans using the natural language. Sentiment analysis provides a way to understand the attitudes and opinions expressed in texts. ● Sentiment polarity is typically a numeric score which is assigned to both the positive and negative aspects of a text document based on subjective parameters like specific words and phrases expressing feelings and emotion. Neutral sentiment typically has 0 polarity since it does not express any specific sentiment, positive sentiment will have polarity > 0 and negative < 0.
  • 13. Applications of Natural Language Processing ● With the advent of new technologies, there has been a massive growth in the availability of text data. Thus, there are different applications of natural language processing which may contribute to an organization's success in a dominant manner. Example: Understanding customer behavior through twitter data, developing recommendation systems, cluster analysis of the customer data on the basis of reviews etc. This section focus on different applications of natural language processing
  • 14. Analyzing Twitter Data Twitter is social networking site where people communicate in short messages called tweets. Tweeting basically means posting short messages to people who follows you on twitter, with an intention that the messages might be helpful for taking a decision.
  • 15. Document Similarity Document similarity is a powerful technique used to recommend products/services, videos, movies etc. The different examples of document similarity include ecommerce websites recommending products on its website, Amazon Prime and Netflix recommending moviesshows, YouTube recommending videos etc. Recommendation for a product/service can be done according to pre-defined criterion like no. of buyers, budget, rating, popularity, manufacturer, description etc.
  • 16. Cluster Analysis Cluster analysis can be done on text data after the feature extraction is done on the data using vectorizer. This section performs cluster analysis on the above data and forms clusters of different movies together on the basis of their information stored in tfidf matrix while performing feature extraction.
  • 17. Text Analysis Steps A text analysis problem usually consists of three important steps: parsing, search and retrieval, and text mining. A text analysis problem may also consist of other subtasks such as discourse and segmentation
  • 18. Parsing is the process that takes unstructured text and imposes a structure for further analysis. The unstructured text could be a plain text file, a weblog, an Extensible Markup Language (XML) file, a HyperText Markup Language (HTML) file, or a Word document. Parsing deconstructs the provided text and renders it in a more structured way for the subsequent steps. Search and retrieval is the identification of the documents in a corpus that contain search items such as specific words, phrases, topics, or entities like people or organizations. These search items are generally called key terms. Search and retrieval originated from the field of library science and is now used exten- sively by web search engines.
  • 19. Text mining uses the terms and indexes produced by the prior two steps to discover meaningful insights pertaining to domains or problems of interest.
  • 20. Part-of-Speech (POS) Tagging, Lemmatization, and Stemming The goal of POS tagging is to build a model whose input is a sentence, such as: he saw a fox and whose output is a tag sequence. Each tag marks the POS for the corresponding word, such as: PRP VBD DT NN according to the Penn Treebank POS tags . Therefore, the four words are mapped to pronoun (personal), verb (past tense). determiner, and noun (singular), respectively.
  • 21. Both lemmatization and stemming are techniques to reduce the number of dimensions and reduce inflections or variant forms to the base form to more accurately measure the number of times each word appears. With the use of a given dictionary, lemmatization finds the correct dictionary base form of a word. For example, given the sentence: obesity causes many problems the output of lemmatization would be: obesity cause many problem https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
  • 22. Stemming Different from lemmatization, stemming does not need a dictionary, and it usually refers to a crude process of stripping affixes based on a set of heuristics with the hope of correctly achieving the goal to reduce inflections or variant forms. After the process, words are stripped to become stems. A stem is not necessarily an actual word defined in the natural language, but it is sufficient to differentiate itself from the stems of other words. A well-known rule-based stemming algorithm is Porter's stemming algorithm. It defines a set of production rules to iteratively transform words into their stems. For the sentence shown previously: obesity causes many problems the output of Porter's stemming algorithm is: obes caus mani problem
  • 23.
  • 24.
  • 27. import nltk a = "Sample Text" words = nltk.tokenize.word_tokenize(a) fd = nltk.FreqDist(words) fd.plot()
  • 28. Explanation of code: 1. Import nltk module. 2. Write the text whose word distribution you need to find. 3. Tokenize each word in the text which is served as input to FreqDist module of the nltk. 4. Apply each word to nlk.FreqDist in the form of a list 5. Plot the words in the graph using plot()
  • 29. from nltk.stem import PorterStemmer from nltk.tokenize import sent_tokenize, word_tokenize sentence="Hello, You have to build a very good site and I love visiting your site." words = word_tokenize(sentence) ps = PorterStemmer() for w in words: rootWord=ps.stem(w) print(rootWord)
  • 31. ● Package PorterStemer is imported from module stem ● Packages for tokenization of sentence as well as words are imported ● A sentence is written which is to be tokenized in the next step. ● Word tokenization stemming lemmatization is implemented in this step. ● An object for PorterStemmer is created here. ● Loop is run and stemming of each word is done using the object created in the code line 5 http://text-processing.com/demo/stem/
  • 32. Text Analysis Example Consider the fictitious company ACME, maker of two products: bPhone and bEbook. ACME is in strong competition with other companies that manufacture and sell similar products. To succeed, ACME needs to produce excellent phones and eBook readers and increase sales. One of the ways the company does this is to monitor what is being said about ACME products in social media. In other words, what is the buzz on its products? ACME wants to search all that is said about ACME products in social media sites, such as Twitter and Facebook, and popular review sites, such as Amazon and ConsumerReports. It wants to answer questions such as these. • Are people mentioning its products? • What is being said? Are the products seen as good or bad? If people think an ACME product is bad, why? For example, are they complaining about the battery life of the bPhone, or the response time in their bEbook? They want to monitor the social media buzz using a simple process based on the three steps
  • 34. 1. Collect raw text - This corresponds to Phase 1 and Phase 2 of the Data Analytic Lifecycle. 2. Represent text - Convert each review into a suitable document representation with proper indices, and build a corpus based on these indexed reviews. This step corresponds to Phases 2 and 3 of the Data Analytic Lifecycle. 3. Compute the usefulness of each word in the reviews using methods such as TFIDF .This and the following two steps correspond to Phases 3 through 5 of the Data Analytic Lifecycle. 4. Categorize documents by topics. This can be achieved through topic models (such as latent Dirichlet allocation).
  • 35. 5. Determine sentiments of the reviews. Identify whether the reviews are positive or negative. Many product review sites provide ratings of a product with each review. If such information is not available, techniques like sentiment analysis can be used on the textual data to infer the underlying sentiments. People can express many emotions. To keep the process simple, ACME considers sentiments as positive, neutral, or negative. 6. Review the results and gain greater insights - This step corresponds to Phase 5 and 6 of the Data Analytic Lifecycle. Marketing gathers the results from the previous steps. Find out what exactly makes people love or hate a product. Use one or more visualization techniques to report the findings. Test the soundness of the conclusions and operationalize the findings if applicable.
  • 36. Collecting Raw Text The Data Science team starts by actively monitoring various websites for user-generated contents. The user-generated contents being collected could be related articles from news portals and blogs, comments on ACME's products from online shops or reviews sites, or social media posts that contain keywords b Phone or bEbook. Regardless of where the data comes from, it's likely that the team would deal with semi-structured data such as HTML web pages, Really Simple Syndication (RSS) feeds, XML, or JavaScript Object Notation (JSON) files. Enough structure needs to be imposed to find the part of the raw text that the team really cares about. In the brand management example, ACME is interested in what the reviews say about bPhon e or bEb ook and when the reviews are posted. Therefore, the team will actively collect such information.
  • 37. Many websites and services offer public APIs for third-party developers to access their data. For example, the Twitter API allows developers to choose from the Streaming API or the REST API to retrieve public Twitter posts that contain the keywords bPhone or bEbook. Developers can also read tweets in real time from a specific user or tweets posted near a specific venue. The fetched tweets are in the JSON format.
  • 38. Many news portals and blogs provide data feeds that are in an open standard format, such as RSS or XML.
  • 39. Representing Text ln this data representation step, raw text is first transformed with text normalization techniques such as tokenization and case folding. Then it is represented in a more structured way for analysis. Tokenization is the task of separating (also called tokenizing) words from the body of text. Raw text is converted into collections of tokens after the tokenization, where each token is generally a word. A common approach is tokenizing on spaces. state-of-art
  • 40. Representing Text Another text normalization technique is called case folding, which reduces all letters to lowercase (or the opposite if applicable). One needs to be cautious applying case folding to tasks such as information extraction, sentiment analysis, and machine translation. If implemented incorrectly, case folding may reduce or change the meaning of the text and create additional noise. For example, when General Motors becomes general and motors, the downstream analysis may very likely consider them as separated words rather than the name of a company. When the abbreviation of the World Health Organization WHO or the rock band The Who become who, they may both be interpreted as the pronoun who.
  • 41. Representing Text If case folding must be present, one way to reduce such problems is to create a lookup table of words not to be case folded. The team can come up with some heuristics or rules-based strategies for the case folding. For example, the program can be taught to ignore words that have uppercase in the middle of a sentence.
  • 42. Representing Text After normalizing the text by tokenization and case folding, it needs to be represented in a more structured way. A simple yet widely used approach to represent text is called bag-of-words. Given a document, bag-of-words represents the document as a set of terms, ignoring information such as order, context, inferences, and discourse. Each word is considered a term or token (which is often the smallest unit for the analysis). In many cases, bag-of-words additionally assumes every term in the document is independent.
  • 43. Representing Text The document then becomes a vector with one dimension for every distinct term in the space, and the terms are unordered. The permutation 0* of a document D contains the same words exactly the same number of times but in a different order. Therefore, using the bag-of-words representation, document D and its permutation D* would share the same representation.
  • 44. Representing Text Bag-of-words takes quite a na..-ve approach, as order plays an important role in the semantics of text. With bag-of-words, many texts with different meanings are combined into one form. For example, the texts "a dog bites a man" and "a man bites a dog" have very different meanings, but they would share the same representation with bag-of-words.
  • 45. Representing Text Using single words as identifiers with the bag-of-words representation, the term frequency (TF) of each word can be calculated. Term frequency represents the weight of each term in a document, and it is proportional to the number of occurrences of the term in that document.
  • 46. Representing Text Besides extracting the terms, their morphological features may need to be included. The morphological features specify additional information about the terms, which may include root words, affixes, part-of-speech tags, named entities, or intonation (variations of spoken pitch). The features from this step contribute to the downstream analysis in classification or sentiment analysis.
  • 47. Representing Text The set of features that need to be extracted and stored highly depends on the specific task to be performed. lf the task is to label and distinguish the part of speech, for example, the features will include all the words in the text and their corresponding part-of-speech tags. If the task is to annotate the named entities like names and organizations, the features highlight such information appearing in the text. Constructing the features is no trivial task; quite often this is done entirely manual ly, and sometimes it requires domain expertise.
  • 48. Representing Text Sometimes creating features is a text analysis task all to itself. One such example is topic modeling. Topic modeling provides a way to quickly analyze large volumes of raw text and identify the latent topics. Topic modeling may not require the documents to be labeled or annotated. It can discover topics directly from an analysis of the raw text. A topic consists of a cluster of words that frequently occur together and that share the same theme. Probabilistic topic modeling,is a suite of algorithms that aim to parse large archives of documents and discover and annotate the topics.
  • 49. Representing Text It is important not only to create a representation of a document but also to create a representation of a corpus. A corpus is a collection of documents. A corpus could be so large that it includes all the documents in one or more languages, or it could be smaller or limited to a specific domain, such as technology, medicine, or law. For a web search engine, the entire World Wide Web is the relevant corpus. Most corpora are much smaller. The Brown Corpus
  • 50. Representing Text Many corpora focus on specific domains. For example, the BioCreative corpora are from biology, the Switchboard corpus contains telephone conversations, and the European Parliament Proceedings Parallel Corpus was extracted from the proceedings of the European Parliament in 21 European languages. Most corpora come with metadata, such as the size of the corpus and the domains from which the text is extracted. Some corpora (such as the Brown Corpus) include the information content of every word appearing in the text.
  • 51. Representing Text Information content (IC) is a metric to denote the importance of a term in a corpus. The conventional way of measuring the IC of a term is to combine the knowledge of its hierarchical structure from an ontology with statistics on its actual usage in text derived from a corpus. Terms with higher IC values are considered more important than terms with lower IC values. For example, the word necklace generally has a higher IC value than the word jewelry in an English corpus because jewelry is more general and is likely to appear more often than necklace. IC can help measure the semantic similarity of terms , such measures do not require an annotated corpus, and they generally achieve strong correlations with human judgment.
  • 52. Term Frequency-Inverse Document Frequency (TFIDF) TFIDF, a measure widely used in information retrieval and text analysis. Instead of using a traditional corpus as a knowledge base TFIDF directly works on top of the fetched documents and treats these documents as the "corpus." TFIDF is robust and efficient on dynamic content, because document changes require only the update of frequency counts.
  • 53. Term Frequency-Inverse Document Frequency (TFIDF)
  • 54. Term Frequency-Inverse Document Frequency (TFIDF) To understand how the term frequency is computed, consider a bag-of-words vector space of 10 words: i, love, acme, my, bebook, bphone, fantastic, slow, terrible, and terrific .
  • 55. Term Frequency-Inverse Document Frequency (TFIDF) the logarithm can be applied to word frequencies whose distribution also contains a long tail, as shown in Equation
  • 56. Term Frequency-Inverse Document Frequency (TFIDF) Because longer documents contain more terms, they tend to have higher term frequency values. They also tend to contain more distinct terms. These factors can conspire to raise the term frequency values of longer documents and lead to undesirable bias favoring longer documents. To address this problem, the term frequency can be normalized. For example, the term frequency of term t in document d can be normalized based on the number of terms in d as shown in Equation
  • 57. Term Frequency-Inverse Document Frequency (TFIDF) A term frequency vector can become very high dimensional because the bag- of-words vector space can grow substantially to include all the words in English. The high dimensionality makes it difficult to store and parse the text and contribute to performance issues related to text analysis.
  • 58. Term Frequency-Inverse Document Frequency (TFIDF) For the purpose of reducing dimensionality, not all the words from a given language need to be included in the term frequency vector. In English, for example, it is common to remove words such as the, a, of, and, to, and other articles that are not likely to contribute to semantic understanding. These common words are called stop words. Lists of stop words are available in various languages for automating the identification of stop words. Among them is the Snowball's stop words list that contains stop words in more than ten languages.
  • 59. Term Frequency-Inverse Document Frequency (TFIDF) Another simple yet effective way to reduce dimensionality is to store a term and its frequency only if the term appears at least once in a document. Any term not existing in the term frequency vector by default will have a frequency of 0. Therefore, the previous term frequency vector would be simplified to what is shown in Table
  • 60. Term Frequency-Inverse Document Frequency (TFIDF) Some NLP techniques such as lemmatization and stemming can also reduce high dimensionality. Lemmatization and stemming are two different techniques that combine various forms of a word. With these techniques, words such as play, plays, played, and playing can be mapped to the same term. It has been shown that the term frequency is based on the raw count of a term occurring in a stand- alone document. Term frequency by itself suffers a critical problem: It regards that stand-alone document as the entire world. The importance of a term is solely based on its presence in this particular document.
  • 61. Term Frequency-Inverse Document Frequency (TFIDF) Stop words such as the, and, and a could be inappropriately considered the most important because they have the highest frequencies in every document. For example, the top three most frequent words in Shakespeare's Hamlet are all stop words {t he, and, and of, Besides stop words, words that are more general in meaning tend to appear more often, thus having higher term frequencies. In an article about consumer telecommunications, the word phone would be likely to receive a high term frequency.
  • 62. Term Frequency-Inverse Document Frequency (TFIDF) As a result, the important keywords such as b Phone and bEbook and their related words could appear to be less important. Consider a search engine that responds to a search query and fetches relevant Documents. Using term frequency alone, the search engine would not properly assess how relevant each document is in relation to the search query.
  • 63. Term Frequency-Inverse Document Frequency (TFIDF) A quick fix for the problem is to introduce an additional variable that has a broader view of the world considering the importance of a term not only in a single document but in a collection of documents, or in a corpus. The additional variable should reduce the effect of the term frequency as the term appears in more documents. That is the intention of the inverted document frequency (IDF).
  • 64. Term Frequency-Inverse Document Frequency (TFIDF) The IDF inversely corresponds to the document frequency {DF}, which is defined to be the number of documents in the corpus that contain a term. Let a corpus D contain N documents. The document frequency of a term t in corpus D = {d1,d2 , •• • d11 } is defined as shown in Equation
  • 65. Term Frequency-Inverse Document Frequency (TFIDF) The Inverse document frequency of a term t is obtained by dividing N by the document frequency of the term and then taking the logarithm of that quotient, as shown in Equation
  • 66. If the term is not in the corpus, it leads to a division-by-zero. A quick fix is to add 1 to the denominator, as demonstrated in Equation
  • 67. Categorizing Documents by Topics A topic consists of a cluster of words that frequently occur together and share the same theme. The topics of a document are not as straightforward as they might initially appear. Consider these two reviews: 1. The bPhoneSx has coverage everywhere. It's much less flaky than my old bPhone4G. 2 . While I love ACME's bPhone series, I've been quite disappointed by the bEbook. The text is illegible, and it makes even my old NBook look blazingly fast. Is the first review about bPhone5x or bPhone4G? Is the second review about bPhone, bEbook, or NBook? For machines, these questions can be difficult to answer.
  • 68. Categorizing Documents by Topics If a review is talking about bPhoneSx, the term bPhoneSx and related terms (such as phone and ACME) are likely to appear frequently. A document typically consists of multiple themes running through the text in different proportions- for example, 30% on a topic related to phones, 15% on a topic related to appearance, 10% on a topic related to shipping, 5% on a topic related to service, and so on.
  • 69. Categorizing Documents by Topics Document grouping can be achieved with clustering methods such as k-means clustering or classification methods such as support vector machines . k- nearest neighbors or Naive Bayes . However, a more feasible and prevalent approach is to use topic modeling. Topic modeling provides tools to automatically organize, search, understand, and summarize from vast amounts of information.
  • 70. Categorizing Documents by Topics Topic models are statistical models that examine words from a set of documents, determine the themes over the text, and discover how the themes are associated or change over time. The process of topic modeling can be simplified to the following. 1. Uncover the hidden topical patterns within a corpus. 2. Annotate documents according to these topics. 3. Use annotations to organize, search, and summarize texts.
  • 71. Categorizing Documents by Topics A topic is formally defined as a distribution over a fixed vocabulary of words. Different topics would have different distributions over the same vocabulary. A topic can be viewed as a cluster of words with related meanings, and each word has a corresponding weight inside this topic. Note that a word from the vocabulary can reside in multiple topics with different weights. Topic models do not necessarily require prior knowledge of the texts. The topics can emerge solely based on analyzing the text.
  • 72. The simplest topic model is latent Dirichlet allocation (LDA) a generative probabilistic model of a corpus proposed by David M. Blei and two other researchers. In generative probabilistic modeling, data is treated as the result of a generative process that includes hidden variables. LDA assumes that there is a fixed vocabulary of words, and the number of the latent topics is predefined and remains constant. LDA assumes that each latent topic follows a Dirichlet distribution over the vocabulary, and each document is represented as a random mixture of latent topics.
  • 73. Figure illustrates the intuitions behind LDA.
  • 74. The left side of the figure shows four topics built from a corpus, where each topic contains a list of the most important words from the vocabulary. The four example topics are related to problem, policy, neural, and report. For each document, a distribution over the topics is chosen, as shown in the histogram on the right. Next, a topic assignment is picked for each word in the document, and the word from the corresponding topic (colored discs) is chosen. In reality, only the documents (as shown in the middle of the figure) are available. The goal of LDA is to infer the underlying topics, topic proportions, and topic assignments for every document.
  • 75. Topic models can be used in document modeling, document classification, and collaborative filtering Topic models not only can be applied to textual data, they can also help annotate images. Just as a document can be considered a collection of topics, images can be considered a collection of image features.
  • 76. Determining Sentiments Sentiment analysis refers to a group of tasks that use statistics and natural language processing to mine opinions to identify and extract subjective information from texts. Early work on sentiment analysis focused on detecting the polarity of product reviews from Epinions and movie reviews from the Internet Movie Database (IMDb) at the document level. Later work handles sentiment analysis at the sentence level . More recently, the focus has shifted to phrase-level and short-text forms in response to the popularity of micro-blogging services such as Twitter
  • 77. Determining Sentiments One can manually construct lists of words with positive sentiments (such as brilliant, awesome, and spectacular) and negative sentiments (such as awful, stupid, and hideous). Related work has pointed out that such an approach can be expected to achieve accuracy around 60% , and it is likely to be outperformed by examination of corpus statistics.
  • 78. Determining Sentiments Classification methods such as naive Bayes, maximum entropy (MaxEnt), and support vector machines (SVM) are often used to extract corpus statistics for sentiment analysis. Related research has found out that these classifiers can score around 80% accuracy on sentiment analysis over unstructured data. One or more of such classifiers can be applied to unstructured data, such as movie reviews or even tweets.
  • 79. Determining Sentiments The movie review corpus by Pang et al. includes 2,000 movie reviews collected from an IMDb archive of the rec.arts.movies.reviews newsgroup. These movie reviews have been manually tagged into 1,000 positive reviews and 1,000 negative reviews. Depending on the classifier, the data may need to be split into training and testing sets. A rule of the thumb for splitting data is to produce a training set much bigger than the testing set. For example, an 80/20 split would produce 80% of the data as the training set and 20% as the testing set.
  • 80. Determining Sentiments One or more classifiers are trained over the training set to learn the characteristics or patterns residing in the data. The sentiment tags in the testing data are hidden away from the classifiers. After the training, classifiers are tested over the testing set to infer the sentiment tags. Finally, the result is compared against the original sentiment tags to evaluate the overall performance of the classifier.
  • 81. A confusion matrix is a specific table layout that allows visualization of the performance of a model over the testing set.
  • 82. Precision and recall are two measures commonly used to evaluate tasks related to text analysis. Definitions of precision and recall are given in Equations
  • 83. Precision is defined as the percentage of documents in the results that are relevant. If by entering keyword bPhone, the search engine returns 100 documents, and 70 of them are relevant, the precision of the search engine result is 0.7%. Recall is the percentage of returned documents among all relevant documents in the corpus. If by entering keyword bPhone, the search engine returns 100 documents, only 70 of which are relevant while failing to return 10 additional, relevant documents, the recall is 70/ (70+ 10) = 0.875.
  • 84. Precision and recall are important concepts, whether the task is about information retrieval of a search engine or text analysis over a finite corpus. A good classifier ideally should achieve both precision and recall close to 1.0
  • 85. Gaining Insights Corresponding to the data collection phase, the Data Science team has used bPhone as the keyword to collect more than 300 reviews from a popular technical review website. The 300 reviews are visualized as a word cloud after removing stop words. A word cloud (or tag cloud) is a visual representation of textual data. Tags are generally single words, and the importance of each word is shown with font size or color.
  • 86. IMPORTANT QUESTIONS 1. What are the main challenges of text analysis? 2. What is a corpus? 3. What are common words (such as a, and, of) called? 4. Why can't we use TF alone to measure the usefulness of the words? 5. What is a caveat of IDF? How does TFIDF address the problem? 6. Name three benefits of using the TFIDF. 7. What methods can be used for sentiment analysis? 8. What is the definition of topic in topic models? 9. Explain the trade-offs for precision and recall.