Textmining

Submitted by,
Gokul K
LE48MCA15
No:28
FISAT

 Defining Text Mining
 Structured vs. Unstructured Data
 Why Text Mining
 Some Text Mining Ambiguities
 Text Mining Practice Areas
 Pre-processing Techniques
 Challenges in Text Mining
 Conclusion

• The use of computational methods and techniques to
extract high quality information from text
• The discovery by computer of new, previously unknown
information, by automatically extracting information from a
usually large amount of different unstructured textual
resources

 We have a collection of documents (mainly text or
html-based)
 We have a set of users
 A user wants to retrieve the documents related to
a given concept
 He consequently submits a query expressed
through words or terms
 An information retrieval system returns the
documents most related to this concept

 Unstructured text is present in various forms, and
in huge and ever increasing quantities:
1. books
2. financial and other business reports
3. various kinds of business and
administrative documents
4. news articles
 It is estimated that ~80% of all the available data are
unstructured data

 TM research and practice are focused on the
development, continual improvement and
application of such methods
 To enable effective and efficient use of such huge
quantities of textual content, we need
computational methods for
1. automated extraction of information from
unstructured text
2. analysis and summarization of extracted
information

 Language is ambiguous
 Context is needed to clarify
 The same words can have different meaning
 Bear (verb) – to support or carry
 Bear (noun) – a large animal
 Different words can mean the same (synonyms)
 Language is subtle(difficult to analyse
 Concept / word extraction usually results in huge number of
dimensions
 Thousands of new fields
 Each field typically has low information content (sparse)
 Misspellings, abbreviations, spelling variants
 Renders search engines, SQL queries.. ineffective.

 Homonomy: same word, different meaning
Mary walked along the bank of the river
HarborBank is the richest bank in the citys
 Synonymy: Synonyms, different words, similar or
same meaning, can substitute one word for other
without changing meaning.
Miss Nelson became a kind of big sister to Benjamin
Miss Nelson became a kind of large sister to Benjamin.

 Polysemy: same word or form, but different,
albeit related meaning
The bank raised its interest rates yesterday
The store is next to the newly constructed bank
The bank appeared first in Italy I the Renaissance
 Hyponymy: Concept hierarchy or subclass
Animal (noun) – cat, dog
Injury – broken leg, intusion

 Search and Information Retrieval – storage and
retrieval of text documents, including search
engines and keyword search
 Document Clustering – Grouping and categorizing
terms, snippets, paragraphs or documents using
clustering methods
 Document Classification – grouping and
categorizing snippets, paragraphs or document
using data mining classification methods, based on
methods trained on labelled examples
 Web Mining – Data and Text mining on the
internet with specific focus on scaled and
interconnectedness of the web

 Information Extraction – Identification and
extraction of relevant facts and relationships from
unstructured text
 Natural Language Processing – Low level language
processing and understanding of tasks (eg. Tagging
part of speech)
 Concept extraction – Grouping of words and
phrases into semantically similar groups

 Document – a sequence of words and punctuation,
following the grammatical rules of the language.
 Term – usually a word, but can be a word-pair or
phrase
 Corpus – a collection of documents
 Lexicon – set of all unique words in corpus

 Text Normalization
 Parts of Speech Tagging
 Removal of stop words
 Stop words – common words that don’t add
meaningful content to the document
 Stemming
 Removing suffices and prefixes leaving the root or stem of
the word.
 Tokenization

 Case
 Make all lower case (if you don’t care about proper
nouns, titles, etc)
 Clean up transcription and typing errors
 do n’t, movei
 Correct misspelled words
 Phonetically
 Use fuzzy matching algorithms such as Soundex,
Metaphone or string edit distance
 Dictionaries
 Use POS and context to make good guess

 POS tagging is a process of assigning a POS or
lexical class marker to each word in a sentence
(and all sentences in a corpus).
 Input: the lead paint is unsafe
 Output: the/Det lead/N paint/N is/V
unsafe/Adj

 Tokenization is the process of breaking a stream
of text up into words, phrases, symbols, or other
meaningful elements called tokens.
 Converts streams of characters into words
 Tokens or words are separated by whitespace,
punctuation marks or line breaks.

 Normalizes / unifies variations of the same data
 ‘walking’, ‘walks’, ‘walked’, ‘walked’  walk
 Inflectional stemming
 Remove plurals
 Normalize verb tenses
 Remove other affixes
 Stemming to root
 Reduce word to most basic element
 More aggressive than inflectional
 ‘ ‘Apply’, ‘applications’, ‘reapplied’  apply

 The uppermost problem in text mining is the ambiguity
of the language i.e. the capability of being understood in
two or more possible sense. Because one word or phrase
may have multiple meanings those can lead to ambiguity
problem.
 In fields like Bioinformatics there are multiple names
for a single gene or protein that may also lead to
ambiguity problem.

  One more problem with test mining is when we
use the social media data i.e. status updates,
tweets, comments, reviews etc. most people use
slang words like- “btw” for by the way, “ppl” for
people etc. these words do not exist in the
dictionary that’s why they affects the mining
results.
 Another problem with text mining is cleaning the
data, if we extract online texts then we also get the
reference addresses of the images linked with the
text and those references are hard to remove.

Text analysis presently is really a fascinating technique
to determine the useful results from the textual data. By
using text mining techniques we can easily extract public
reviews, can classify the text into predefined classes, can
conclude the documents and also can make group or
cluster of multiple documents.

 https://en.wikipedia.org/wiki/Text_mining
 http://searchbusinessanalytics.techtarget.com/defi
nition/text-mining
 https://www.ijircce.com/upload/2016/april/40_Tex
t.pdf

Textmining

More Related Content

What's hot

Viewers also liked

Similar to Textmining

Recently uploaded

Textmining