This document discusses using R for Twitter data analytics. It outlines the basics of Twitter data analytics using R, including collecting real-time Twitter data, text mining techniques for Twitter data, and sentiment analysis. Some key steps involved are exploring the Twitter corpus, preprocessing the text by removing stopwords and stemming words, creating a document-term matrix, and calculating TF-IDF weights. Cosine similarity is used to measure similarity between text documents. The goal is to extract useful patterns and insights from large amounts of Twitter data in real-time.
GATE: a text analysis tool for social mediaDiana Maynard
Short tutorial about how and why to use GATE for text analysis of social media, given at the Big Social Data workshop at Reading University in April 2015.
In this talk we outline some of the key challenges in text analytics, describe some of Endeca's current research work in this area, examine the current state of the text analytics market and explore some of the prospects for the future.
GATE: a text analysis tool for social mediaDiana Maynard
Short tutorial about how and why to use GATE for text analysis of social media, given at the Big Social Data workshop at Reading University in April 2015.
In this talk we outline some of the key challenges in text analytics, describe some of Endeca's current research work in this area, examine the current state of the text analytics market and explore some of the prospects for the future.
Best Practices for Large Scale Text Mining ProcessingOntotext
Q&A:
NOW facilitates semantic search by having annotations attached to search strings. How compolex does that get, e.g. with wildcards between annotated strings?
NOW’s searchbox is quite basic at the moment, but still supports a few scenarios.
1. Pure concept/faceted search - search for all documents containing a concept or where a set of concepts are co-occurring. Ranking is based on frequence of occurrence.
2. Concept/faceted + Full Text search - search for both concepts and particular textual term of phrase.
3. Full text search
With search, pretty much anything can be done to customise it. For the NOW showcase we’ve kept it fairly simple, as usually every client has a slightly different case and wants to tune search in a slightly different direction.
The search in NOW is faceted which means that you search with concepts (facets) and you retrieve all documents which contain mentions of the searched concept. If you search by more than one facet the engine retrieves documents which contain mentions of both concepts but there is no restriction that they occur next to each other.
Is the tagging service expandable (say with custom ontologies)? also is it a something you offer as a service? it is unclear to me from the website.
The TAG service is used for demonstration purposes only. The models behind it are trained for annotating news articles. The pipeline is customizable for every concrete scenario, different domains and entities of interest. You can access several of our pipelines as a service through the S4 platform or you can have them hosted as an on premise solution. In some cases our clients want domain adaptation or improvements in particular area, or to tag with their internal dataset - in this case we offer again an on premise deployment and also a managed service hosted on our hardware.
Hdoes your system accomodate cluster analysis using unsupervised keyword/phrase annotation for knowledge discovery?
As much as the patterns of user behaviour are also considered knowledge discovery we employ these for suggesting related reads. Apart from these we have experience tailoring custom clustering pipelines which also rely on features like keyword and named entities.
For topic extraction how many topics can we extract? from twitter corpus wgat csn we infer?
For topic extraction we have determined that we obtain best results when suggesting 3 categories. These are taken from IPTC but only the uppermost levels which are less than 20.
The twitter corpus example is from a project Ontotext participates in called Pheme. The goal of the project is to detect rumours and to check their veracity, thus help journalists in their hunt for attractive news.
Do you provide Processing Resources and JAPE rules for GATE framework and that can be used with GATE embedded?
We are contributing to the GATE framework and everything which has been wrapped up as PRs has been included the corresponding GATE distributions.
Researchers have long known that the words of a text have always contained more information than on the surface. As such, texts have been studied for subtexts and other latent or hidden information. One approach has involved the machine-enabled analysis of human sentiment, usually mapped out on a positive-negative polarity. NVivo 11 Plus (a qualitative research tool released in late 2015) enables the automated sentiment analysis of texts (coded research, formal articles, text corpora, Tweetstream datasets, Facebook wall posts, websites, and other sources) based on four categories: very positive, moderately positive, moderately negative, and very negative. The tool feature compares the target text set against a sentiment dictionary and enables coding at different units of analysis: sentence, paragraph, or cell. Further, the sentiment capability extracts the coded text into respective text sets which may be further analyzed using text frequency counts, text searches, automated theme and sub-theme extractions (topic modeling), and data visualizations.
Sentiment analysis over Twitter offers organisations and individuals a fast and effective way to monitor the publics' feelings towards them and their competitors. To assess the performance of sentiment analysis methods over Twitter a small set of evaluation datasets have been released in the last few years. In this paper we present an overview of eight publicly available and manually annotated evaluation datasets for Twitter sentiment analysis. Based on this review, we show that a common limitation of most of these datasets, when assessing sentiment analysis at target (entity) level, is the lack of distinctive sentiment annotations among the tweets and the entities contained in them. For example, the tweet ``I love iPhone, but I hate iPad'' can be annotated with a mixed sentiment label, but the entity iPhone within this tweet should be annotated with a positive sentiment label. Aiming to overcome this limitation, and to complement current evaluation datasets, we present STS-Gold, a new evaluation dataset where tweets and targets (entities) are annotated individually and therefore may present different sentiment labels. This paper also provides a comparative study of the various datasets along several dimensions including: total number of tweets, vocabulary size and sparsity. We also investigate the pair-wise correlation among these dimensions as well as their correlations to the sentiment classification performance on different datasets.
Sentiment mining- The Design and Implementation of an Internet PublicOpinion...Prateek Singh
Sentiment mining paper presentation, database mining and business intelligence.
The Design and Implementation of an Internet PublicOpinion Monitoring and Analysing System
Introduction to Library Research Skills
How do I effectively and efficiently do research and navigate the college's online library?
This workshop will introduce you to the principles of academic research and show you how to best use the ESC Library resources to find sources and cite them in your academic papers.
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python.
Social media & sentiment analysis splunk conf2012Michael Wilde
This presentation was delivered at Splunk's User Conference (conf2012). It covers info about social media data, how to index / use it with Splunk and a lot of content around Sentiment Analysis.
Are you interested in learning about text analysis but have little to no experience with programming languages or writing code? These two short courses will introduce you to multiple text analysis methods. We will examine real-world examples and engage in hands-on activities that don’t require running any code. These short courses are ideal for students and researchers in non-technical fields, faculty who would like to incorporate text analysis in their curriculum, or as a precursor to programming with text analysis tools.
A Gentle Introduction to Text Analysis I will cover both qualitative and quantitative text analysis methods, bag-of-words techniques and classification.
Best Practices for Large Scale Text Mining ProcessingOntotext
Q&A:
NOW facilitates semantic search by having annotations attached to search strings. How compolex does that get, e.g. with wildcards between annotated strings?
NOW’s searchbox is quite basic at the moment, but still supports a few scenarios.
1. Pure concept/faceted search - search for all documents containing a concept or where a set of concepts are co-occurring. Ranking is based on frequence of occurrence.
2. Concept/faceted + Full Text search - search for both concepts and particular textual term of phrase.
3. Full text search
With search, pretty much anything can be done to customise it. For the NOW showcase we’ve kept it fairly simple, as usually every client has a slightly different case and wants to tune search in a slightly different direction.
The search in NOW is faceted which means that you search with concepts (facets) and you retrieve all documents which contain mentions of the searched concept. If you search by more than one facet the engine retrieves documents which contain mentions of both concepts but there is no restriction that they occur next to each other.
Is the tagging service expandable (say with custom ontologies)? also is it a something you offer as a service? it is unclear to me from the website.
The TAG service is used for demonstration purposes only. The models behind it are trained for annotating news articles. The pipeline is customizable for every concrete scenario, different domains and entities of interest. You can access several of our pipelines as a service through the S4 platform or you can have them hosted as an on premise solution. In some cases our clients want domain adaptation or improvements in particular area, or to tag with their internal dataset - in this case we offer again an on premise deployment and also a managed service hosted on our hardware.
Hdoes your system accomodate cluster analysis using unsupervised keyword/phrase annotation for knowledge discovery?
As much as the patterns of user behaviour are also considered knowledge discovery we employ these for suggesting related reads. Apart from these we have experience tailoring custom clustering pipelines which also rely on features like keyword and named entities.
For topic extraction how many topics can we extract? from twitter corpus wgat csn we infer?
For topic extraction we have determined that we obtain best results when suggesting 3 categories. These are taken from IPTC but only the uppermost levels which are less than 20.
The twitter corpus example is from a project Ontotext participates in called Pheme. The goal of the project is to detect rumours and to check their veracity, thus help journalists in their hunt for attractive news.
Do you provide Processing Resources and JAPE rules for GATE framework and that can be used with GATE embedded?
We are contributing to the GATE framework and everything which has been wrapped up as PRs has been included the corresponding GATE distributions.
Researchers have long known that the words of a text have always contained more information than on the surface. As such, texts have been studied for subtexts and other latent or hidden information. One approach has involved the machine-enabled analysis of human sentiment, usually mapped out on a positive-negative polarity. NVivo 11 Plus (a qualitative research tool released in late 2015) enables the automated sentiment analysis of texts (coded research, formal articles, text corpora, Tweetstream datasets, Facebook wall posts, websites, and other sources) based on four categories: very positive, moderately positive, moderately negative, and very negative. The tool feature compares the target text set against a sentiment dictionary and enables coding at different units of analysis: sentence, paragraph, or cell. Further, the sentiment capability extracts the coded text into respective text sets which may be further analyzed using text frequency counts, text searches, automated theme and sub-theme extractions (topic modeling), and data visualizations.
Sentiment analysis over Twitter offers organisations and individuals a fast and effective way to monitor the publics' feelings towards them and their competitors. To assess the performance of sentiment analysis methods over Twitter a small set of evaluation datasets have been released in the last few years. In this paper we present an overview of eight publicly available and manually annotated evaluation datasets for Twitter sentiment analysis. Based on this review, we show that a common limitation of most of these datasets, when assessing sentiment analysis at target (entity) level, is the lack of distinctive sentiment annotations among the tweets and the entities contained in them. For example, the tweet ``I love iPhone, but I hate iPad'' can be annotated with a mixed sentiment label, but the entity iPhone within this tweet should be annotated with a positive sentiment label. Aiming to overcome this limitation, and to complement current evaluation datasets, we present STS-Gold, a new evaluation dataset where tweets and targets (entities) are annotated individually and therefore may present different sentiment labels. This paper also provides a comparative study of the various datasets along several dimensions including: total number of tweets, vocabulary size and sparsity. We also investigate the pair-wise correlation among these dimensions as well as their correlations to the sentiment classification performance on different datasets.
Sentiment mining- The Design and Implementation of an Internet PublicOpinion...Prateek Singh
Sentiment mining paper presentation, database mining and business intelligence.
The Design and Implementation of an Internet PublicOpinion Monitoring and Analysing System
Introduction to Library Research Skills
How do I effectively and efficiently do research and navigate the college's online library?
This workshop will introduce you to the principles of academic research and show you how to best use the ESC Library resources to find sources and cite them in your academic papers.
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python.
Social media & sentiment analysis splunk conf2012Michael Wilde
This presentation was delivered at Splunk's User Conference (conf2012). It covers info about social media data, how to index / use it with Splunk and a lot of content around Sentiment Analysis.
Are you interested in learning about text analysis but have little to no experience with programming languages or writing code? These two short courses will introduce you to multiple text analysis methods. We will examine real-world examples and engage in hands-on activities that don’t require running any code. These short courses are ideal for students and researchers in non-technical fields, faculty who would like to incorporate text analysis in their curriculum, or as a precursor to programming with text analysis tools.
A Gentle Introduction to Text Analysis I will cover both qualitative and quantitative text analysis methods, bag-of-words techniques and classification.
Sentiment analysis involves the process of automatically detecting the polarity of a text and extracting the author's reviews on the subject, and finally, classifying the text. In many research approaches, the textual data classification is done using deep learning models. This is due to the ability of deep learning models to classify a text with a high accuracy and the ability to model the sequence of textual data with word dependencies throughout the sentence. One of these deep learning models is RNN (Recurrent Neural Network). In order to use these models, the textual data and words must be converted into numerical vectors, for which various algorithms and methods have been proposed [10]. Today's pretrained word embedding libraries such as FastText have a high accuracy and quality in vector representations for words. Accordingly, in most current systems and research approaches, these libraries are used to convert the textual data to numerical vectors
Twitter analysis - Data as factor for designing the right communication star...Pere Claver Llimona
Short presentation of Twitter analysis interactive scorecard. An on-line application that downloads and analyses tweets related to an Organisation to monitor Twitter activity, in order to help defining a suitable Twitter communication strategy (working demo).
When to use the different text analytics tools - Meaning CloudMeaningCloud
Classification, topic extraction, clustering... When to use the different Text Analytics tools?
How to leverage Text Analytics technology for your business
MeaningCloud webinar, February 8th, 2017
More information and recording of the webinar https://www.meaningcloud.com/blog/recorded-webinar-use-different-text-analytics-tools
www.meaningcloud.com
Slides from Enterprise Search & Analytics Meetup @ Cisco Systems - http://www.meetup.com/Enterprise-Search-and-Analytics-Meetup/events/220742081/
Relevancy and Search Quality Analysis - By Mark David and Avi Rappoport
The Manifold Path to Search Quality
To achieve accurate search results, we must come to an understanding of the three pillars involved.
1. Understand your data
2. Understand your customers’ intent
3. Understand your search engine
The first path passes through Data Analysis and Text Processing.
The second passes through Query Processing, Log Analysis, and Result Presentation.
Everything learned from those explorations feeds into the final path of Relevancy Ranking.
Search quality is focused on end users finding what they want -- technical relevance is sometimes irrelevant! Working with the short head (very frequent queries) has the most return on investment for improving the search experience, tuning the results, for example, to emphasize recent documents or de-emphasize archive documents, near-duplicate detection, exposing diverse results in ambiguous situations, using synonyms, and guiding search via best bets and auto-suggest. Long-tail analysis can reveal user intent by detecting patterns, discovering related terms, and identifying the most fruitful results by aggregated behavior. all this feeds back into the regression testing, which provides reliable metrics to evaluate the changes.
By merging these insights, you can improve the quality of the search overall, in a scalable and maintainable fashion.
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...ssuser7dcef0
Power plants release a large amount of water vapor into the
atmosphere through the stack. The flue gas can be a potential
source for obtaining much needed cooling water for a power
plant. If a power plant could recover and reuse a portion of this
moisture, it could reduce its total cooling water intake
requirement. One of the most practical way to recover water
from flue gas is to use a condensing heat exchanger. The power
plant could also recover latent heat due to condensation as well
as sensible heat due to lowering the flue gas exit temperature.
Additionally, harmful acids released from the stack can be
reduced in a condensing heat exchanger by acid condensation. reduced in a condensing heat exchanger by acid condensation.
Condensation of vapors in flue gas is a complicated
phenomenon since heat and mass transfer of water vapor and
various acids simultaneously occur in the presence of noncondensable
gases such as nitrogen and oxygen. Design of a
condenser depends on the knowledge and understanding of the
heat and mass transfer processes. A computer program for
numerical simulations of water (H2O) and sulfuric acid (H2SO4)
condensation in a flue gas condensing heat exchanger was
developed using MATLAB. Governing equations based on
mass and energy balances for the system were derived to
predict variables such as flue gas exit temperature, cooling
water outlet temperature, mole fraction and condensation rates
of water and sulfuric acid vapors. The equations were solved
using an iterative solution technique with calculations of heat
and mass transfer coefficients and physical properties.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
Online aptitude test management system project report.pdfKamal Acharya
The purpose of on-line aptitude test system is to take online test in an efficient manner and no time wasting for checking the paper. The main objective of on-line aptitude test system is to efficiently evaluate the candidate thoroughly through a fully automated system that not only saves lot of time but also gives fast results. For students they give papers according to their convenience and time and there is no need of using extra thing like paper, pen etc. This can be used in educational institutions as well as in corporate world. Can be used anywhere any time as it is a web based application (user Location doesn’t matter). No restriction that examiner has to be present when the candidate takes the test.
Every time when lecturers/professors need to conduct examinations they have to sit down think about the questions and then create a whole new set of questions for each and every exam. In some cases the professor may want to give an open book online exam that is the student can take the exam any time anywhere, but the student might have to answer the questions in a limited time period. The professor may want to change the sequence of questions for every student. The problem that a student has is whenever a date for the exam is declared the student has to take it and there is no way he can take it at some other time. This project will create an interface for the examiner to create and store questions in a repository. It will also create an interface for the student to take examinations at his convenience and the questions and/or exams may be timed. Thereby creating an application which can be used by examiners and examinee’s simultaneously.
Examination System is very useful for Teachers/Professors. As in the teaching profession, you are responsible for writing question papers. In the conventional method, you write the question paper on paper, keep question papers separate from answers and all this information you have to keep in a locker to avoid unauthorized access. Using the Examination System you can create a question paper and everything will be written to a single exam file in encrypted format. You can set the General and Administrator password to avoid unauthorized access to your question paper. Every time you start the examination, the program shuffles all the questions and selects them randomly from the database, which reduces the chances of memorizing the questions.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Water billing management system project report.pdfKamal Acharya
Our project entitled “Water Billing Management System” aims is to generate Water bill with all the charges and penalty. Manual system that is employed is extremely laborious and quite inadequate. It only makes the process more difficult and hard.
The aim of our project is to develop a system that is meant to partially computerize the work performed in the Water Board like generating monthly Water bill, record of consuming unit of water, store record of the customer and previous unpaid record.
We used HTML/PHP as front end and MYSQL as back end for developing our project. HTML is primarily a visual design environment. We can create a android application by designing the form and that make up the user interface. Adding android application code to the form and the objects such as buttons and text boxes on them and adding any required support code in additional modular.
MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software. It is a stable ,reliable and the powerful solution with the advanced features and advantages which are as follows: Data Security.MySQL is free open source database that facilitates the effective management of the databases by connecting them to the software.
2. Outline
• Big data and Social Media data
• Social media analytics
• Why Social media Analytics
• Real time twitter data analytics
• Text Mining for Tweeter Data
• Basics of Twitter data analytics using R
• Summary
3. Introduction
• The enormous data produced online,
increasing by seconds to minute
• Humanly difficult to keep up with the rate of
data generation on Twitter and Facebook.
• Device advance analytical model by combining
and implementing Machine learning, data
mining and NLP algorithm to make cognizant
decisions for time sensitive events.
Big Data
Analytics
Volume
• Petabyte
• Exabyte
•Faster processing
Velocity
• Batch
• Near real time
• Real time
•Improve
performance
Variety
• Structured
• Semi structured
• Unstructured
•Increase accuracy
Positive
Neutral :-/
Negative
Analytics
Social media Sentiment
4. Social media New way to predict future by understanding present and take cognizant decision
SOCIAL MEDIA REVOLUTION
5.
6.
7. Social Media Analytics
Social media has given new way of
communication technology for people to
share their opinion, interest, sentiment to the
world.
Huge amount of unstructured data is
generated over social media like Facebook,
twitter, LinkedIn
Social Media Analytics deals with
development and evaluation of tools and
frameworks to collect, monitor, analyze,
summarize, and visualize social media data
Extracts useful patterns and information
8. Why Social Media Analytics?
• Social media – An integral part of daily routine, changing the way of communication
across the globe
• Opinion of the mass is important – Political Party; Government Policies; Movies;
Products and Services; Individual(s) ; Organizations
• Trending topics can reveal people’s intentions and their interests and importantly
current happenings
9. Applications of Social Media Analytics
Retail companies - To harness their brand awareness, service improvement,
advertising/marketing strategies, identifying influencers
Finance: to determine market sentiment, news data for trading
Government and public officials
• Monitoring public perception on political candidates, election campaigns and
announcements
• Prediction at national level of happiness, unemployment etc.
• Social media job loss index: econprediction.eecs.umich.edu
• An article on real world applications
• Sudden change in behavior
10. Real time analytics
• The pulse of society can be found in social-media in real-time.
• Analyzing social-media content in real-time helps social scientists to
predict future and take quick relevant action in time !!!
What’s trending right now may not be popular
about an hour ago or hour before on social
media
11. Why Twitter?
• Twitter is a social microblog platform (Short Text Messages of 140 characters)
• 500 million tweets are generated everyday (http://www.internetlivestats.com/twitter-statistics/)
• Users often discuss current affairs and share personal views on various subjects
• views and sentiments on any subject from new products launched, to favorite movies , music to
political decisions.
• Twitter audience varies from common man to celebrities
• The tweets are also public and hence accessible to researchers unlike most social
network sites.
• Tweets are reliably time stamped so that they can be analyzed from a temporal
perspective.
12. Facts• The male vs. female ratio of social media users is as follows:
• Facebook – 60% female/40% male;
• Twitter – 60% female/40% male;
• Pinterest – 79% female/21% male;
• Google Plus – 29% female/71% male;
• LinkedIn – 55% female/45% male.
• YouTube has over 1 billion unique visitors per month
• 91% of mobile Internet access is for social activities with 73% of smartphone owners accessing social networks
through apps at least once per day.
• There are 684,478 pieces of content shared on Facebook; 3,600 new photos on Instagram; 2,083 check-ins on
Foursquare during every minute of every day.
• LinkedIn has over 3 million company pages
• According to this study, mothers with children under the age of 5 are the most active on social media.
13. Challenges
• Tweets are highly unstructured and also non-grammatical
• Out of Vocabulary Words
• Lexical Variation
• Extensive usage of acronyms like asap, lol, afaik
14. Text Analysis
• Text analysis : extract or classify information from text, like tweets, emails, chats,
documents, etc.
• Some popular examples are:
• Spam filtering: One of the most known and used text classification applications
(assign a category to a text). Spam filters learn to classify an email or message as
spam depending on the content and the subject.
• Sentiment Analysis: another application is text classification where an algorithm
must learn to classify an opinion as positive, neutral or negative depending on
the mood expressed by the writer.
• Information Extraction: From a text, learn to extract a particular piece of
information or data, for example, extracting addresses, entities, keywords, etc
15. Why is Sentiment Analysis Important?• 93% of marketers are using social media. However, only 9% of marketing companies have full-time bloggers
• Around 46% of web users will look towards social media when making a purchase.
• Government or Political party may want to know whether people support their program or not.
• Before investing into a company, one can leverage the sentiment of the people for the company to find out
where it stands.
• A company might want find out the reviews of its products like Amazon
• Economics: Predicting financial market. Used by corporates to monitor stock markets.
• Election :
1. Analyzing election related chatter
2. Find Party / Person wise sentiment
3. Find what people likes dislikes about Party/Person
4. Find major reasons behind success or failure
5. Find major trends in election
6. Analysing impact of non political movements which links to politics (Anna and Ramdev like movements)
17. Data Processing steps• Explore Corpus – Understand the types of variables, their functions, permissible values, and so on.
• Some formats including html and xml contain tags and other data structures that provide more metadata.
• Convert text to lowercase – This is to avoid distinguish between words simply on case.
• Remove Number(if required) – Numbers may or may not be relevant to our analyses.
• Remove English stop words – Stop words are common words found in a language.
• Words like for, of, are etc are common stop words.
• Remove Own stop words(if required) – Along with English stop words, we could instead or in addition remove our
own stop words.
• Strip white space – Eliminate extra white spaces.
• Stemming – Transforms to root word.
• Stemming uses an algorithm that removes common word endings for English words, such as “es”, “ed” and “’s”.
• For example i.e., 1) “computer” & “computers” become “comput”
• Lemmatisation – transform to dictionary base form i.e., “produce” & “produced” become “produce”
• Sparse terms – We are often not interested in infrequent terms in our documents. Such “sparse” terms should be
removed from the document term matrix.
18. Document Term Matrix
• Document term matrix – A document term matrix is a matrix with documents as the rows and terms as the
columns and a count of the frequency of words as the cells of the matrix.
• Calculate Term Weight – TF-IDF
• How frequently term appears?
Term Frequency: TF(t) = (Number of times term t appears in a document) / (Total number of terms
in the document)
• How important a term is?
DF: Document Frequency = d (number of documents containing a given term) / D (the size of the collection
of documents)
• To normalize : Essentially we are compressing the scale of values so that very large or very small quantities are
smoothly compared
• IDF: Inverse Document Frequency
IDF(t) = log(Total number of documents / Number of documents with term t in it)
Example:
Consider a document containing 100 words wherein the word CAR appears 3 times
TF(CAR) = 3 / 100 = 0.03
Now, assume we have 10 million documents and the word CAR appears in one thousand of these
IDF(CAR) = log(10,000,000 / 1,000) = 4
TF-IDF weight is product of these quantities: 0.03 * 4 = 0.12
19. Similarity Distance Measure (Cosine)
• Why Cosine?
• General observation is that the Cosine similarity works better than the
Euclidean for text data.
20. Calculate Cosine similarity
• Example:
• Text 1: statistics skills and programming skills are equally important for analytics
• Text 2: statistics skills and domain knowledge are important for analytics
• Text 3 : I like reading books and travelling
• Document Term Matrix for the above 3 text would be:
• The three vectors are:
• T1 = (1,2,1,1,0,1,1,1,1,1,0,0,0,0,0,0)
• T2 = (1,1,1,0,1,1,0,1,1,1,1,0,0,0,0,0)
• T3 = (0,0,1,0,0,0,0,0,0,0,0,1,1,1,1,1)
• Degree of Similarity (T1 & T2) = (T1 %*% T2) / (sqrt(sum(T1^2)) * sqrt(sum(T2^2))) = 77%
• Degree of Similarity (T1 & T3) = (T1 %*% T3) / (sqrt(sum(T1^2)) * sqrt(sum(T3^2))) = 12%