This document provides an overview of text analysis and mining. It discusses key concepts like text pre-processing, representation, shallow parsing, stop words, stemming and lemmatization. Specific techniques covered include tokenization, part-of-speech tagging, Porter stemming algorithm. Applications mentioned are sentiment analysis, document similarity, cluster analysis. The document also provides a multi-step example of text analysis involving collecting raw text, representing text, computing TF-IDF, categorizing documents by topics, determining sentiments and gaining insights.
Time series forecasting with machine learningDr Wei Liu
An introduction of developing and application time series forecast models with both traditional time series methods and machine learning techniques. Case study for a challenging very short-term electrical price forecasting project was presented.
Time series forecasting with machine learningDr Wei Liu
An introduction of developing and application time series forecast models with both traditional time series methods and machine learning techniques. Case study for a challenging very short-term electrical price forecasting project was presented.
Data Science Interview Questions | Data Science Interview Questions And Answe...Simplilearn
This video on Data science interview questions will take you through some of the most popular questions that you face in your Data science interviews. It’s simply impossible to ignore the importance of data and our capacity to analyze, consolidate, and contextualize it. Data scientists are relied upon to fill this need, but there is a serious dearth of qualified candidates worldwide. If you’re moving down the path to be a data scientist, you need to be prepared to impress prospective employers with your knowledge. In addition to explaining why data science is so important, you’ll need to show that you're technically proficient with Big Data concepts, frameworks, and applications. So, here we discuss the list of most popular questions you can expect in an interview and how to frame your answers.
Why learn Data Science?
Data Scientists are being deployed in all kinds of industries, creating a huge demand for skilled professionals. The data scientist is the pinnacle rank in an analytics organization. Glassdoor has ranked data scientist first in the 25 Best Jobs for 2016, and good data scientists are scarce and in great demand. As a data, you will be required to understand the business problem, design the analysis, collect and format the required data, apply algorithms or techniques using the correct tools, and finally make recommendations backed by data.
You can gain in-depth knowledge of Data Science by taking our Data Science with python certification training course. With Simplilearn’s Data Science certification training course, you will prepare for a career as a Data Scientist as you master all the concepts and techniques. Those who complete the course will be able to:
1. Gain an in-depth understanding of data science processes, data wrangling, data exploration, data visualization, hypothesis building, and testing. You will also learn the basics of statistics.
Install the required Python environment and other auxiliary tools and libraries
2. Understand the essential concepts of Python programming such as data types, tuples, lists, dicts, basic operators and functions
3. Perform high-level mathematical computing using the NumPy package and its large library of mathematical functions
Perform scientific and technical computing using the SciPy package and its sub-packages such as Integrate, Optimize, Statistics, IO and Weave
4. Perform data analysis and manipulation using data structures and tools provided in the Pandas package
5. Gain expertise in machine learning using the Scikit-Learn package
Learn more at www.simplilearn.com/big-data-and-analytics/python-for-data-science-training
High level introduction to text mining analytics, which covers the building blocks or most commonly used techniques of text mining along with useful additional references/links where required for background/literature and R codes to get you started.
Attention Mechanism in Language Understanding and its ApplicationsArtifacia
This is the presentation from our AI Meet March 2017 on Attention Mechanism in Language Understanding and its Applications.
You can join Artifacia AI Meet Bangalore Group: https://www.meetup.com/Artifacia-AI-Meet/
A college level presentation covering the following topics:-
Introduction
Text mining Comparison with other mining
Text Mining Process
How Algorithm is derived for Text Mining
Text Analysis For Google Sheet
Conclusion
K-Folds cross-validation is one method that attempts to maximize the use of the available data for training and then testing a model. It is particularly useful for assessing model performance, as it provides a range of accuracy scores across (somewhat) different data sets.
INTRODUCTION TO Natural language processingsocarem879
Natural language processing (NLP) is a machine learning technology that gives computers the ability to
interpret, manipulate, and comprehend human language.
•Ex: Amazon’s Alexa and Apple’s Siri utilize NLP to listen to user queries and find answers
• We have large volumes of voice and text data from various communication channels like emails, text
messages, social media newsfeeds, video, audio, and more.
• They use NLP software to automatically process this data, analyze the intent or sentiment in the
message, and respond in real time to human communication
• When text mining and machine learning are combined, automated text analysis becomes possible
PREPROCESSING STEPS IN NLP
• Data preprocessing involves preparing and cleaning text data so that machines can analyze it. This
can be done in following:
• Tokenization. It substitutes sensitive information with nonsensitive information, or a token.
Tokenization is often used in payment transactions to protect credit card data.
• Stop word removal. Common words are removed from the text, so unique words that offer the most
information about the text remain.
• Lemmatization and stemming. Lemmatization groups together different inflected versions of the
same word. For example, the word "walking" would be reduced to its root form, or stem, "walk" to
process.
• Part-of-speech tagging. Words are tagged based on which part of speech they correspond to -- such
as nouns, verbs or adjectives
Data Science Interview Questions | Data Science Interview Questions And Answe...Simplilearn
This video on Data science interview questions will take you through some of the most popular questions that you face in your Data science interviews. It’s simply impossible to ignore the importance of data and our capacity to analyze, consolidate, and contextualize it. Data scientists are relied upon to fill this need, but there is a serious dearth of qualified candidates worldwide. If you’re moving down the path to be a data scientist, you need to be prepared to impress prospective employers with your knowledge. In addition to explaining why data science is so important, you’ll need to show that you're technically proficient with Big Data concepts, frameworks, and applications. So, here we discuss the list of most popular questions you can expect in an interview and how to frame your answers.
Why learn Data Science?
Data Scientists are being deployed in all kinds of industries, creating a huge demand for skilled professionals. The data scientist is the pinnacle rank in an analytics organization. Glassdoor has ranked data scientist first in the 25 Best Jobs for 2016, and good data scientists are scarce and in great demand. As a data, you will be required to understand the business problem, design the analysis, collect and format the required data, apply algorithms or techniques using the correct tools, and finally make recommendations backed by data.
You can gain in-depth knowledge of Data Science by taking our Data Science with python certification training course. With Simplilearn’s Data Science certification training course, you will prepare for a career as a Data Scientist as you master all the concepts and techniques. Those who complete the course will be able to:
1. Gain an in-depth understanding of data science processes, data wrangling, data exploration, data visualization, hypothesis building, and testing. You will also learn the basics of statistics.
Install the required Python environment and other auxiliary tools and libraries
2. Understand the essential concepts of Python programming such as data types, tuples, lists, dicts, basic operators and functions
3. Perform high-level mathematical computing using the NumPy package and its large library of mathematical functions
Perform scientific and technical computing using the SciPy package and its sub-packages such as Integrate, Optimize, Statistics, IO and Weave
4. Perform data analysis and manipulation using data structures and tools provided in the Pandas package
5. Gain expertise in machine learning using the Scikit-Learn package
Learn more at www.simplilearn.com/big-data-and-analytics/python-for-data-science-training
High level introduction to text mining analytics, which covers the building blocks or most commonly used techniques of text mining along with useful additional references/links where required for background/literature and R codes to get you started.
Attention Mechanism in Language Understanding and its ApplicationsArtifacia
This is the presentation from our AI Meet March 2017 on Attention Mechanism in Language Understanding and its Applications.
You can join Artifacia AI Meet Bangalore Group: https://www.meetup.com/Artifacia-AI-Meet/
A college level presentation covering the following topics:-
Introduction
Text mining Comparison with other mining
Text Mining Process
How Algorithm is derived for Text Mining
Text Analysis For Google Sheet
Conclusion
K-Folds cross-validation is one method that attempts to maximize the use of the available data for training and then testing a model. It is particularly useful for assessing model performance, as it provides a range of accuracy scores across (somewhat) different data sets.
INTRODUCTION TO Natural language processingsocarem879
Natural language processing (NLP) is a machine learning technology that gives computers the ability to
interpret, manipulate, and comprehend human language.
•Ex: Amazon’s Alexa and Apple’s Siri utilize NLP to listen to user queries and find answers
• We have large volumes of voice and text data from various communication channels like emails, text
messages, social media newsfeeds, video, audio, and more.
• They use NLP software to automatically process this data, analyze the intent or sentiment in the
message, and respond in real time to human communication
• When text mining and machine learning are combined, automated text analysis becomes possible
PREPROCESSING STEPS IN NLP
• Data preprocessing involves preparing and cleaning text data so that machines can analyze it. This
can be done in following:
• Tokenization. It substitutes sensitive information with nonsensitive information, or a token.
Tokenization is often used in payment transactions to protect credit card data.
• Stop word removal. Common words are removed from the text, so unique words that offer the most
information about the text remain.
• Lemmatization and stemming. Lemmatization groups together different inflected versions of the
same word. For example, the word "walking" would be reduced to its root form, or stem, "walk" to
process.
• Part-of-speech tagging. Words are tagged based on which part of speech they correspond to -- such
as nouns, verbs or adjectives
Text analysis is known as text analytics. It refers to the representation, processing, and modeling of textual data to derive beneficial insights. An important element of text analysis is text mining, the process of finding relationships and interesting patterns in large text collections.
Detailed documented with the definition of text mining along with challenges, implementing modeling techniques, word cloud and much more.
Thanks, for your time, if you enjoyed this short video there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
APPROACH FOR THICKENING SENTENCE SCORE FOR AUTOMATIC TEXT SUMMARIZATIONIJDKP
In our study we will use approach that combine Natural language processing NLP with Term occurrences to improve the quality of important sentences selection by thickening sentence score along with reducing the number of long sentences that would be included in the final summarization. There are sixteen known methods for automatic text summarization. In our paper we utilized Term frequency approach and built an algorithm to re filter sentences score.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Automobile Management System Project Report.pdfKamal Acharya
The proposed project is developed to manage the automobile in the automobile dealer company. The main module in this project is login, automobile management, customer management, sales, complaints and reports. The first module is the login. The automobile showroom owner should login to the project for usage. The username and password are verified and if it is correct, next form opens. If the username and password are not correct, it shows the error message.
When a customer search for a automobile, if the automobile is available, they will be taken to a page that shows the details of the automobile including automobile name, automobile ID, quantity, price etc. “Automobile Management System” is useful for maintaining automobiles, customers effectively and hence helps for establishing good relation between customer and automobile organization. It contains various customized modules for effectively maintaining automobiles and stock information accurately and safely.
When the automobile is sold to the customer, stock will be reduced automatically. When a new purchase is made, stock will be increased automatically. While selecting automobiles for sale, the proposed software will automatically check for total number of available stock of that particular item, if the total stock of that particular item is less than 5, software will notify the user to purchase the particular item.
Also when the user tries to sale items which are not in stock, the system will prompt the user that the stock is not enough. Customers of this system can search for a automobile; can purchase a automobile easily by selecting fast. On the other hand the stock of automobiles can be maintained perfectly by the automobile shop manager overcoming the drawbacks of existing system.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Event Management System Vb Net Project Report.pdfKamal Acharya
In present era, the scopes of information technology growing with a very fast .We do not see any are untouched from this industry. The scope of information technology has become wider includes: Business and industry. Household Business, Communication, Education, Entertainment, Science, Medicine, Engineering, Distance Learning, Weather Forecasting. Carrier Searching and so on.
My project named “Event Management System” is software that store and maintained all events coordinated in college. It also helpful to print related reports. My project will help to record the events coordinated by faculties with their Name, Event subject, date & details in an efficient & effective ways.
In my system we have to make a system by which a user can record all events coordinated by a particular faculty. In our proposed system some more featured are added which differs it from the existing system such as security.
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfKamal Acharya
The College Bus Management system is completely developed by Visual Basic .NET Version. The application is connect with most secured database language MS SQL Server. The application is develop by using best combination of front-end and back-end languages. The application is totally design like flat user interface. This flat user interface is more attractive user interface in 2017. The application is gives more important to the system functionality. The application is to manage the student’s details, driver’s details, bus details, bus route details, bus fees details and more. The application has only one unit for admin. The admin can manage the entire application. The admin can login into the application by using username and password of the admin. The application is develop for big and small colleges. It is more user friendly for non-computer person. Even they can easily learn how to manage the application within hours. The application is more secure by the admin. The system will give an effective output for the VB.Net and SQL Server given as input to the system. The compiled java program given as input to the system, after scanning the program will generate different reports. The application generates the report for users. The admin can view and download the report of the data. The application deliver the excel format reports. Because, excel formatted reports is very easy to understand the income and expense of the college bus. This application is mainly develop for windows operating system users. In 2017, 73% of people enterprises are using windows operating system. So the application will easily install for all the windows operating system users. The application-developed size is very low. The application consumes very low space in disk. Therefore, the user can allocate very minimum local disk space for this application.
1. MODULE 4 : Text Analytics
CSC601.4 Analyze Text data and gain insights.
2. CONTENTS
● Text Mining
○ History of text mining
○ Roots of text mining
○ Overview of seven practices of text analytic
○ Application and use cases for Text mining:
■ Extracting meaning from unstructured text
■ Summarizing Text.
● Text Analysis
○ Text Analysis Steps
○ A Text Analysis Example
○ Collecting Raw Text
○ Representing Text
○ Term Frequency—Inverse Document Frequency (TFIDF)
○ Categorizing Documents by Topics
○ Determining Sentiments
○ Gaining Insights
3. Text Mining
● Text mining is the process of evaluating large amount of textual data to
produce meaningful information, and to convert the unstructured text data
into structured text data for further analysis and visualization.
● Text mining helps to identify unnoticed facts, relationships and assertions
of textual big data.
● The process of text mining includes two basic python libraries: textblob and
wordcloud.
4. Text Data
● Before doing the text mining, we need to understand the text data like
determining the number of words in the document.
● We need to first load data from different sources including text files(.txt),
pdfs (.pdf), csv files(.csv) etc.
6. Text Pre-Processing
● Text Pre-Processing is an important phase before applying any
algorithms on text data.
● Data cleaning implies cleaning of noise such as: punctuation, spaces
etc.
● The objective of text mining is to clean the data for creating independent
terms from the data file for further analysis.
● After the textual data has been loaded in environment, it needs to be
cleaned by adopting different measures like transforming the text to
lowercase; removing specific characters like removing URLs , non-
english words, punctuations, whitespace etc.
7. Shallow Parsing
● Tokenization is the process of breaking down a text paragraph into smaller chunks
such as words or sentence.
● Token is a single entity that is building block for sentence or paragraph.
● Sentence tokenizer breaks text paragraph into sentences while word tokenizer breaks
text paragraph into words.
● The process of classifying words into their parts of speech and labelling them
accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts
of speech are also known as word classes or lexical categories.
● The collection of tags used for a particular task is known as a tagset.
● The emphasis in this section is on exploiting tags, and tagging text automatically.
● A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches
a part of speech tag to each word.
8. Stop words
● Text may contain stop words such as is, am, are, this, a, an,
the, etc.
● These stop words are considered as noise in the text and
hence should be removed.
● Before doing analysis of text data, we should filter out the list
of tokens from these stop words.
9. Stemming and Lemmatizing
● Stemming and Lemmatization considers another type of noise in the text, which
reduces derivationally related forms of a word to common root word.
● Stemming is the process of gathering words of similar origin into one word.
Stemming helps us to increase accuracy in our mined text by removing suffixes
and reducing words to their basic forms. For example, words like detection,
detected, detecting are reduced to a common word "detect".
● Lemmatization is usually more sophisticated than stemming and also reduces
words to their base word. But lemmar, unlike stemmer, works on an individual
word with knowledge of the context. Example, word "better" has "good" as its
lemma, but this is not included by stemming because it requires a dictionary look-
up.
10. Stemming and Lemmatizing
● Stemming and Lemmatization considers another type of noise in the text, which
reduces derivationally related forms of a word to common root word.
● Stemming is the process of gathering words of similar origin into one word.
Stemming helps us to increase accuracy in our mined text by removing suffixes
and reducing words to their basic forms. For example, words like detection,
detected, detecting are reduced to a common word "detect".
● Lemmatization is usually more sophisticated than stemming and also reduces
words to their base word. But lemmar, unlike stemmer, works on an individual
word with knowledge of the context. Example, word "better" has "good" as its
lemma, but this is not included by stemming because it requires a dictionary look-
up.
11. Word Cloud
● For creating a visual impact, a word cloud is created from different words.
● The Word cloud is created from wordcloud library. In the word cloud, the size of
the words is dependent on their frequencies.
12. Sentiment Analysis
● Sentiment Analysis is also popularly known as opinion analysis or opinion mining.
The key idea is to use techniques from text analytics, NLP, machine learning and
linguistics to extract important information or data points from unstructured text.
● Sentiment analysis is a branch of machine learning that deals with interaction
between computers and humans using the natural language. Sentiment analysis
provides a way to understand the attitudes and opinions expressed in texts.
● Sentiment polarity is typically a numeric score which is assigned to both the
positive and negative aspects of a text document based on subjective parameters
like specific words and phrases expressing feelings and emotion. Neutral sentiment
typically has 0 polarity since it does not express any specific sentiment, positive
sentiment will have polarity > 0 and negative < 0.
13. Applications of Natural Language Processing
● With the advent of new technologies, there has been a massive growth in the
availability of text data. Thus, there are different applications of natural
language processing which may contribute to an organization's success in a
dominant manner. Example: Understanding customer behavior through twitter
data, developing recommendation systems, cluster analysis of the customer
data on the basis of reviews etc. This section focus on different applications of
natural language processing
14. Analyzing Twitter Data
Twitter is social networking site where people communicate in short messages
called tweets. Tweeting basically means posting short messages to people who
follows you on twitter, with an intention that the messages might be helpful for
taking a decision.
15. Document Similarity
Document similarity is a powerful technique used to recommend products/services,
videos, movies etc. The different examples of document similarity include ecommerce
websites recommending products on its website, Amazon Prime and Netflix
recommending moviesshows, YouTube recommending videos etc. Recommendation
for a product/service can be done according to pre-defined criterion like no. of buyers,
budget, rating, popularity, manufacturer, description etc.
16. Cluster Analysis
Cluster analysis can be done on text data after the feature extraction is done on
the data using vectorizer. This section performs cluster analysis on the above
data and forms clusters of different movies together on the basis of their
information stored in tfidf matrix while performing feature extraction.
17. Text Analysis Steps
A text analysis problem usually consists of three important steps: parsing,
search and retrieval, and text mining.
A text analysis problem may also consist of other subtasks such as discourse
and segmentation
18. Parsing is the process that takes unstructured text and imposes a structure for further analysis. The
unstructured text could be a plain text file, a weblog, an Extensible Markup Language (XML) file, a HyperText
Markup Language (HTML) file, or a Word document. Parsing deconstructs the provided text and renders
it in a more structured way for the subsequent steps.
Search and retrieval is the identification of the documents in a corpus that contain search items such
as specific words, phrases, topics, or entities like people or organizations. These search items are generally
called key terms. Search and retrieval originated from the field of library science and is now used exten-
sively by web search engines.
19. Text mining uses the terms and indexes produced by the prior two steps to
discover meaningful insights pertaining to domains or problems of interest.
20. Part-of-Speech (POS) Tagging, Lemmatization, and
Stemming
The goal of POS tagging is to build a model whose input is a sentence, such
as:
he saw a fox
and whose output is a tag sequence. Each tag marks the POS for the
corresponding word, such as:
PRP VBD DT NN
according to the Penn Treebank POS tags . Therefore, the four words are
mapped to pronoun (personal), verb (past tense). determiner, and noun
(singular), respectively.
21. Both lemmatization and stemming are techniques to reduce the number of dimensions and reduce
inflections or variant forms to the base form to more accurately measure the number of times each
word appears. With the use of a given dictionary, lemmatization finds the correct dictionary base form
of a word.
For example, given the sentence:
obesity causes many problems
the output of lemmatization would be:
obesity cause many problem
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
22. Stemming
Different from lemmatization, stemming does not need a dictionary, and it usually
refers to a crude process of stripping affixes based on a set of heuristics with the
hope of correctly achieving the goal to reduce inflections or variant forms.
After the process, words are stripped to become stems. A stem is not necessarily
an actual word defined in the natural language, but it is sufficient to differentiate
itself from the stems of other words. A well-known rule-based stemming algorithm is
Porter's stemming algorithm. It defines a set of production rules to iteratively
transform words into their stems. For the sentence shown previously:
obesity causes many problems
the output of Porter's stemming algorithm is:
obes caus mani problem
27. import nltk
a = "Sample Text"
words = nltk.tokenize.word_tokenize(a)
fd = nltk.FreqDist(words)
fd.plot()
28. Explanation of code:
1. Import nltk module.
2. Write the text whose word distribution you need to find.
3. Tokenize each word in the text which is served as input to FreqDist module of the nltk.
4. Apply each word to nlk.FreqDist in the form of a list
5. Plot the words in the graph using plot()
29. from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
sentence="Hello, You have to build a very good site and I love visiting your site."
words = word_tokenize(sentence)
ps = PorterStemmer()
for w in words:
rootWord=ps.stem(w)
print(rootWord)
31. ● Package PorterStemer is imported from module stem
● Packages for tokenization of sentence as well as words are imported
● A sentence is written which is to be tokenized in the next step.
● Word tokenization stemming lemmatization is implemented in this step.
● An object for PorterStemmer is created here.
● Loop is run and stemming of each word is done using the object created in the code line 5
http://text-processing.com/demo/stem/
32. Text Analysis Example
Consider the fictitious company ACME, maker of two products: bPhone and bEbook. ACME is in
strong competition with other companies that manufacture and sell similar products. To succeed,
ACME needs to produce excellent phones and eBook readers and increase sales. One of the ways
the company does this is to monitor what is being said about ACME products in social media. In other
words, what is the buzz on its products? ACME wants to search all that is said about ACME products
in social media sites, such as Twitter and Facebook, and popular review sites, such as Amazon and
ConsumerReports. It wants to answer questions such as these.
• Are people mentioning its products?
• What is being said? Are the products seen as good or bad? If people think an ACME product is bad,
why?
For example, are they complaining about the battery life of the bPhone, or the response time
in their bEbook?
They want to monitor the social media buzz using a simple process based on the three steps
34. 1. Collect raw text - This corresponds to Phase 1 and Phase 2 of the Data
Analytic Lifecycle.
2. Represent text - Convert each review into a suitable document representation
with proper indices, and build a corpus based on these indexed reviews. This
step corresponds to Phases 2 and 3 of the Data Analytic Lifecycle.
3. Compute the usefulness of each word in the reviews using methods such as
TFIDF .This and the following two steps correspond to Phases 3 through 5 of
the Data Analytic Lifecycle.
4. Categorize documents by topics. This can be achieved through topic models
(such as latent Dirichlet allocation).
35. 5. Determine sentiments of the reviews. Identify whether the reviews are positive or negative.
Many product review sites provide ratings of a product with each review. If such information is
not available, techniques like sentiment analysis can be used on the textual data to infer the
underlying sentiments. People can express many emotions. To keep the process simple,
ACME considers sentiments as positive, neutral, or negative.
6. Review the results and gain greater insights - This step corresponds to Phase 5 and 6 of the
Data Analytic Lifecycle. Marketing gathers the results from the previous steps. Find out
what exactly makes people love or hate a product. Use one or more visualization techniques
to report the findings. Test the soundness of the conclusions and operationalize the findings if
applicable.
36. Collecting Raw Text
The Data Science team starts by actively monitoring various websites for user-generated
contents. The user-generated contents being collected could be related articles from news portals and
blogs, comments on ACME's products from online shops or reviews sites, or social media posts that contain
keywords b Phone or bEbook. Regardless of where the data comes from, it's likely that the team would
deal with semi-structured data such as HTML web pages, Really Simple Syndication (RSS) feeds, XML, or
JavaScript Object Notation (JSON) files. Enough structure needs to be imposed to find the part of the raw
text that the team really cares about. In the brand management example, ACME is interested in what the
reviews say about bPhon e or bEb ook and when the reviews are posted. Therefore, the team will actively
collect such information.
37. Many websites and services offer public APIs for third-party developers to
access their data.
For example, the Twitter API allows developers to choose from the Streaming
API or the REST API to retrieve public Twitter posts that contain the keywords
bPhone or bEbook.
Developers can also read tweets in real time from a specific user or tweets
posted near a specific venue. The fetched tweets are in the JSON format.
38. Many news portals and blogs provide data feeds that are in an open standard
format, such as RSS or XML.
39. Representing Text
ln this data representation step, raw text is first transformed with text normalization techniques
such as tokenization and case folding.
Then it is represented in a more structured way for analysis.
Tokenization is the task of separating (also called tokenizing) words from the body of text.
Raw text is converted into collections of tokens after the tokenization, where each token is
generally a word.
A common approach is tokenizing on spaces.
state-of-art
40. Representing Text
Another text normalization technique is called case folding, which reduces all letters to
lowercase (or the opposite if applicable).
One needs to be cautious applying case folding to tasks such as information extraction,
sentiment analysis, and machine translation.
If implemented incorrectly, case folding may reduce or change the meaning of the text and
create additional noise.
For example, when General Motors becomes general and motors, the downstream analysis
may very likely consider them as separated words rather than the name of a company.
When the abbreviation of the World Health Organization WHO or the rock band The Who
become who, they may both be interpreted as the pronoun who.
41. Representing Text
If case folding must be present, one way to reduce such problems is to create a
lookup table of words not to be case folded.
The team can come up with some heuristics or rules-based strategies for the
case folding.
For example, the program can be taught to ignore words that have uppercase
in the middle of a sentence.
42. Representing Text
After normalizing the text by tokenization and case folding, it needs to be
represented in a more structured way.
A simple yet widely used approach to represent text is called bag-of-words.
Given a document, bag-of-words represents the document as a set of terms,
ignoring information such as order, context, inferences, and discourse.
Each word is considered a term or token (which is often the smallest unit for the
analysis).
In many cases, bag-of-words additionally assumes every term in the document is
independent.
43. Representing Text
The document then becomes a vector with one dimension for every distinct
term in the space, and the terms are unordered.
The permutation 0* of a document D contains the same words exactly the
same number of times but in a different order.
Therefore, using the bag-of-words representation, document D and its
permutation D* would share the same representation.
44. Representing Text
Bag-of-words takes quite a na..-ve approach, as order plays an important role in the semantics of text.
With bag-of-words, many texts with different meanings are combined into one form.
For example, the texts
"a dog bites a man"
and
"a man bites a dog"
have very different meanings, but they would share the same representation with bag-of-words.
45. Representing Text
Using single words as identifiers with the bag-of-words representation, the term
frequency (TF) of each word can be calculated.
Term frequency represents the weight of each term in a document, and it is
proportional to the number of occurrences of the term in that document.
46. Representing Text
Besides extracting the terms, their morphological features may need to be
included.
The morphological features specify additional information about the terms,
which may include root words, affixes, part-of-speech tags, named entities, or
intonation (variations of spoken pitch).
The features from this step contribute to the downstream analysis in
classification or sentiment analysis.
47. Representing Text
The set of features that need to be extracted and stored highly depends on the
specific task to be performed.
lf the task is to label and distinguish the part of speech, for example, the features
will include all the words in the text and their corresponding part-of-speech tags.
If the task is to annotate the named entities like names and organizations, the
features highlight such information appearing in the text.
Constructing the features is no trivial task; quite often this is done entirely manual
ly, and sometimes it requires domain expertise.
48. Representing Text
Sometimes creating features is a text analysis task all to itself.
One such example is topic modeling.
Topic modeling provides a way to quickly analyze large volumes of raw text and identify the
latent topics.
Topic modeling may not require the documents to be labeled or annotated.
It can discover topics directly from an analysis of the raw text. A topic consists of a cluster of
words that frequently occur together and that share the same theme.
Probabilistic topic modeling,is a suite of algorithms that aim to parse large archives of
documents and discover and annotate the topics.
49. Representing Text
It is important not only to create a representation of a document but also to
create a representation of a corpus.
A corpus is a collection of documents.
A corpus could be so large that it includes all the documents in one or more
languages, or it could be smaller or limited to a specific domain, such as
technology, medicine, or law.
For a web search engine, the entire World Wide Web is the relevant corpus.
Most corpora are much smaller. The Brown Corpus
50. Representing Text
Many corpora focus on specific domains.
For example, the BioCreative corpora are from biology, the Switchboard corpus contains
telephone conversations, and the European Parliament Proceedings Parallel Corpus was
extracted from the proceedings of the European Parliament in 21 European languages.
Most corpora come with metadata, such as the size of the corpus and the domains from which
the text is extracted.
Some corpora (such as the Brown Corpus) include the information content of every word
appearing in the text.
51. Representing Text
Information content (IC) is a metric to denote the importance of a term in a corpus.
The conventional way of measuring the IC of a term is to combine the knowledge of its
hierarchical
structure from an ontology with statistics on its actual usage in text derived from a corpus.
Terms with higher IC values are considered more important than terms with lower IC values.
For example, the word necklace generally has a higher IC value than the word jewelry in an
English corpus because jewelry is more general and is likely to appear more often than
necklace.
IC can help measure the semantic similarity of terms , such measures do not require an
annotated corpus, and they generally achieve strong correlations with human judgment.
52. Term Frequency-Inverse Document Frequency (TFIDF)
TFIDF, a measure widely used in information retrieval and text analysis.
Instead of using a traditional corpus as a knowledge base
TFIDF directly works on top of the fetched documents and treats these
documents as the "corpus."
TFIDF is robust and efficient on dynamic content, because document changes
require only the update of frequency counts.
54. Term Frequency-Inverse Document Frequency (TFIDF)
To understand how the term frequency is computed, consider a bag-of-words
vector space of 10 words:
i, love, acme, my, bebook, bphone, fantastic, slow, terrible, and terrific .
55. Term Frequency-Inverse Document Frequency (TFIDF)
the logarithm can be applied to word frequencies whose distribution also
contains a long tail, as shown in Equation
56. Term Frequency-Inverse Document Frequency (TFIDF)
Because longer documents contain more terms, they tend to have higher term
frequency values.
They also tend to contain more distinct terms.
These factors can conspire to raise the term frequency values of longer
documents and lead to undesirable bias favoring longer documents.
To address this problem, the term frequency can be normalized. For example,
the term frequency of term t in document d can be normalized based on the
number of terms in d as shown in Equation
57. Term Frequency-Inverse Document Frequency (TFIDF)
A term frequency vector can become very high dimensional because the bag-
of-words vector space can grow substantially to include all the words in
English.
The high dimensionality makes it difficult to store and parse the text and
contribute to performance issues related to text analysis.
58. Term Frequency-Inverse Document Frequency (TFIDF)
For the purpose of reducing dimensionality, not all the words from a given language
need to be included in the term frequency vector. In English, for example, it is
common to remove words such as
the, a, of, and, to, and other articles that are not likely to contribute to semantic
understanding.
These common words are called stop words.
Lists of stop words are available in various languages for automating the
identification of stop words. Among them is the Snowball's stop words list that
contains stop words
in more than ten languages.
59. Term Frequency-Inverse Document Frequency
(TFIDF)
Another simple yet effective way to reduce dimensionality is to store a term and
its frequency only if the term appears at least once in a document.
Any term not existing in the term frequency vector by default will have a
frequency of 0.
Therefore, the previous term frequency vector would be simplified to what is
shown in Table
60. Term Frequency-Inverse Document Frequency (TFIDF)
Some NLP techniques such as lemmatization and stemming can also reduce high dimensionality.
Lemmatization and stemming are two different techniques that combine various forms of a word.
With these techniques, words such as play, plays, played, and playing can be mapped to the same term.
It has been shown that the term frequency is based on the raw count of a term occurring in a stand-
alone document.
Term frequency by itself suffers a critical problem: It regards that stand-alone document
as the entire world.
The importance of a term is solely based on its presence in this particular document.
61. Term Frequency-Inverse Document Frequency (TFIDF)
Stop words such as the, and, and a could be inappropriately considered the
most important because they have the highest frequencies in every document.
For example, the top three most frequent words in Shakespeare's Hamlet are
all stop words {t he, and, and of,
Besides stop words, words that are more general in meaning tend to appear
more often, thus having higher term frequencies.
In an article about consumer telecommunications, the word phone would be
likely to receive a high term frequency.
62. Term Frequency-Inverse Document Frequency (TFIDF)
As a result, the important keywords such as b Phone and bEbook and their
related words could appear to be less important.
Consider a search engine that responds to a search query and fetches relevant
Documents.
Using term frequency alone, the search engine would not properly assess how
relevant each document is in relation to the search query.
63. Term Frequency-Inverse Document Frequency (TFIDF)
A quick fix for the problem is to introduce an additional variable that has a
broader view of the world considering the importance of a term not only in a
single document but in a collection of documents, or in a corpus.
The additional variable should reduce the effect of the term frequency as the
term appears in more documents. That is the intention of the inverted
document frequency (IDF).
64. Term Frequency-Inverse Document Frequency (TFIDF)
The IDF inversely corresponds to the document frequency {DF}, which is
defined to be the number of documents in the corpus that contain a term.
Let a corpus D contain N documents. The document frequency of a term t in
corpus
D = {d1,d2 , •• • d11 } is defined as shown in Equation
65. Term Frequency-Inverse Document Frequency (TFIDF)
The Inverse document frequency of a term t is obtained by dividing N by the
document frequency of the term and then taking the logarithm of that quotient,
as shown in Equation
66. If the term is not in the corpus, it leads to a division-by-zero. A quick fix is to
add 1 to the denominator, as demonstrated in Equation
67. Categorizing Documents by Topics
A topic consists of a cluster of words that frequently occur together and share the same theme.
The topics of a document are not as straightforward as they might initially appear. Consider these two
reviews:
1. The bPhoneSx has coverage everywhere. It's much less flaky than my old bPhone4G.
2 . While I love ACME's bPhone series, I've been quite disappointed by the bEbook.
The text is illegible, and it makes even my old NBook look blazingly fast.
Is the first review about bPhone5x or bPhone4G? Is the second review about bPhone, bEbook, or
NBook?
For machines, these questions can be difficult to answer.
68. Categorizing Documents by Topics
If a review is talking about bPhoneSx, the term bPhoneSx and related terms
(such as phone and ACME) are likely to appear frequently.
A document typically consists of multiple themes running through the text in
different proportions-
for example, 30% on a topic related to phones, 15% on a topic related to
appearance, 10% on a topic related to shipping, 5% on a topic related to
service, and so on.
69. Categorizing Documents by Topics
Document grouping can be achieved with clustering methods such as k-means
clustering or classification methods such as support vector machines . k-
nearest neighbors or Naive Bayes .
However, a more feasible and prevalent approach is to use topic modeling.
Topic modeling provides tools to automatically organize, search, understand,
and summarize from vast amounts of information.
70. Categorizing Documents by Topics
Topic models are statistical models that examine words from a set of
documents, determine the themes over the text, and discover how the themes
are associated or change over time.
The process of topic modeling can be simplified to the following.
1. Uncover the hidden topical patterns within a corpus.
2. Annotate documents according to these topics.
3. Use annotations to organize, search, and summarize texts.
71. Categorizing Documents by Topics
A topic is formally defined as a distribution over a fixed vocabulary of words.
Different topics would have different distributions over the same vocabulary.
A topic can be viewed as a cluster of words with related meanings, and each word
has a corresponding weight inside this topic.
Note that a word from the vocabulary can reside in multiple topics with different
weights.
Topic models do not necessarily require prior knowledge of the texts.
The topics can emerge solely based on analyzing the text.
72. The simplest topic model is latent Dirichlet allocation (LDA) a generative
probabilistic model of a corpus proposed by David M. Blei and two other
researchers.
In generative probabilistic modeling, data is treated as the result of a generative
process that includes hidden variables.
LDA assumes that there is a fixed vocabulary of words, and the number of the
latent topics is predefined and remains constant.
LDA assumes that each latent topic follows a Dirichlet distribution over the
vocabulary, and each document is represented as a random mixture of latent
topics.
74. The left side of the figure shows four topics built from a corpus, where each topic contains a list
of the most important words from the vocabulary.
The four example topics are related to problem, policy, neural, and report.
For each document, a distribution over the topics is chosen, as shown in the histogram on the
right.
Next, a topic assignment is picked for each word in the document, and the word from the
corresponding topic (colored discs) is chosen.
In reality, only the documents (as shown in the middle of the figure) are available. The goal of
LDA is to infer the underlying topics, topic proportions, and topic assignments for every
document.
75. Topic models can be used in document modeling, document classification, and
collaborative filtering
Topic models not only can be applied to textual data, they can also help
annotate images.
Just as a document can be considered a collection of topics, images can be
considered a collection of image features.
76. Determining Sentiments
Sentiment analysis refers to a group of tasks that use statistics and natural
language processing to mine opinions to identify and extract subjective information
from texts.
Early work on sentiment analysis focused on detecting the polarity of product
reviews from Epinions and movie reviews from the Internet Movie Database (IMDb)
at the document level.
Later work handles sentiment analysis at the sentence level . More recently, the
focus has shifted to phrase-level and short-text forms in response to the popularity
of micro-blogging services such as Twitter
77. Determining Sentiments
One can manually construct lists of words with positive sentiments (such as
brilliant, awesome, and spectacular) and negative sentiments (such as awful,
stupid, and hideous).
Related work has pointed out that such an approach can be expected to achieve
accuracy around 60% , and it is likely to be outperformed by examination of
corpus statistics.
78. Determining Sentiments
Classification methods such as naive Bayes, maximum entropy (MaxEnt), and
support vector machines (SVM) are often used to extract corpus statistics for
sentiment analysis.
Related research has found out that these classifiers can score around 80%
accuracy on sentiment analysis over unstructured data.
One or more of such classifiers can be applied to unstructured data, such
as movie reviews or even tweets.
79. Determining Sentiments
The movie review corpus by Pang et al. includes 2,000 movie reviews collected from an IMDb
archive of the rec.arts.movies.reviews newsgroup.
These movie reviews have been manually tagged into 1,000 positive reviews and 1,000
negative reviews.
Depending on the classifier, the data may need to be split into training and testing sets.
A rule of the thumb for splitting data is to produce a training set much bigger
than the testing set.
For example, an 80/20 split would produce 80% of the data as the training set and 20% as the
testing set.
80. Determining Sentiments
One or more classifiers are trained over the training set to learn the
characteristics or patterns residing in the data.
The sentiment tags in the testing data are hidden away from the classifiers.
After the training, classifiers are tested over the testing set to infer the
sentiment tags.
Finally, the result is compared against the original sentiment tags to evaluate
the overall performance of the classifier.
81. A confusion matrix is a specific table layout that allows visualization of the
performance of a model over the testing set.
82. Precision and recall are two measures commonly used to evaluate tasks
related to text analysis.
Definitions of precision and recall are given in Equations
83. Precision is defined as the percentage of documents in the results that are
relevant. If by entering keyword bPhone, the search engine returns 100
documents, and 70 of them are relevant, the precision of the search engine
result is 0.7%.
Recall is the percentage of returned documents among all relevant documents
in the corpus. If by entering keyword bPhone, the search engine returns 100
documents, only 70 of which are relevant while failing to return 10 additional,
relevant documents, the recall is 70/ (70+ 10) = 0.875.
84. Precision and recall are important concepts, whether the task is about
information retrieval of a search engine or text analysis over a finite corpus.
A good classifier ideally should achieve both precision and recall close to 1.0
85. Gaining Insights
Corresponding to the data collection phase, the Data Science team has used
bPhone as the keyword to collect more than 300 reviews from a popular
technical review website.
The 300 reviews are visualized as a word cloud after removing stop words. A
word cloud (or tag cloud) is a visual representation of textual data.
Tags are generally single words, and the importance of each word is shown
with font size or color.
86. IMPORTANT QUESTIONS
1. What are the main challenges of text analysis?
2. What is a corpus?
3. What are common words (such as a, and, of) called?
4. Why can't we use TF alone to measure the usefulness of the words?
5. What is a caveat of IDF? How does TFIDF address the problem?
6. Name three benefits of using the TFIDF.
7. What methods can be used for sentiment analysis?
8. What is the definition of topic in topic models?
9. Explain the trade-offs for precision and recall.