Submitted by,
Gokul K
LE48MCA15
No:28
FISAT
 Defining Text Mining
 Structured vs. Unstructured Data
 Why Text Mining
 Some Text Mining Ambiguities
 Text Mining Practice Areas
 Pre-processing Techniques
 Challenges in Text Mining
 Conclusion
• The use of computational methods and techniques to
extract high quality information from text
• The discovery by computer of new, previously unknown
information, by automatically extracting information from a
usually large amount of different unstructured textual
resources
 We have a collection of documents (mainly text or
html-based)
 We have a set of users
 A user wants to retrieve the documents related to
a given concept
 He consequently submits a query expressed
through words or terms
 An information retrieval system returns the
documents most related to this concept
 Unstructured text is present in various forms, and
in huge and ever increasing quantities:
1. books
2. financial and other business reports
3. various kinds of business and
administrative documents
4. news articles
 It is estimated that ~80% of all the available data are
unstructured data
 TM research and practice are focused on the
development, continual improvement and
application of such methods
 To enable effective and efficient use of such huge
quantities of textual content, we need
computational methods for
1. automated extraction of information from
unstructured text
2. analysis and summarization of extracted
information
 Language is ambiguous
 Context is needed to clarify
 The same words can have different meaning
 Bear (verb) – to support or carry
 Bear (noun) – a large animal
 Different words can mean the same (synonyms)
 Language is subtle(difficult to analyse
 Concept / word extraction usually results in huge number of
dimensions
 Thousands of new fields
 Each field typically has low information content (sparse)
 Misspellings, abbreviations, spelling variants
 Renders search engines, SQL queries.. ineffective.
 Homonomy: same word, different meaning
Mary walked along the bank of the river
HarborBank is the richest bank in the citys
 Synonymy: Synonyms, different words, similar or
same meaning, can substitute one word for other
without changing meaning.
Miss Nelson became a kind of big sister to Benjamin
Miss Nelson became a kind of large sister to Benjamin.
 Polysemy: same word or form, but different,
albeit related meaning
The bank raised its interest rates yesterday
The store is next to the newly constructed bank
The bank appeared first in Italy I the Renaissance
 Hyponymy: Concept hierarchy or subclass
Animal (noun) – cat, dog
Injury – broken leg, intusion
 Search and Information Retrieval – storage and
retrieval of text documents, including search
engines and keyword search
 Document Clustering – Grouping and categorizing
terms, snippets, paragraphs or documents using
clustering methods
 Document Classification – grouping and
categorizing snippets, paragraphs or document
using data mining classification methods, based on
methods trained on labelled examples
 Web Mining – Data and Text mining on the
internet with specific focus on scaled and
interconnectedness of the web
 Information Extraction – Identification and
extraction of relevant facts and relationships from
unstructured text
 Natural Language Processing – Low level language
processing and understanding of tasks (eg. Tagging
part of speech)
 Concept extraction – Grouping of words and
phrases into semantically similar groups
 Document – a sequence of words and punctuation,
following the grammatical rules of the language.
 Term – usually a word, but can be a word-pair or
phrase
 Corpus – a collection of documents
 Lexicon – set of all unique words in corpus
 Text Normalization
 Parts of Speech Tagging
 Removal of stop words
 Stop words – common words that don’t add
meaningful content to the document
 Stemming
 Removing suffices and prefixes leaving the root or stem of
the word.
 Tokenization
 Case
 Make all lower case (if you don’t care about proper
nouns, titles, etc)
 Clean up transcription and typing errors
 do n’t, movei
 Correct misspelled words
 Phonetically
 Use fuzzy matching algorithms such as Soundex,
Metaphone or string edit distance
 Dictionaries
 Use POS and context to make good guess
 POS tagging is a process of assigning a POS or
lexical class marker to each word in a sentence
(and all sentences in a corpus).
 Input: the lead paint is unsafe
 Output: the/Det lead/N paint/N is/V
unsafe/Adj
 Tokenization is the process of breaking a stream
of text up into words, phrases, symbols, or other
meaningful elements called tokens.
 Converts streams of characters into words
 Tokens or words are separated by whitespace,
punctuation marks or line breaks.
 Normalizes / unifies variations of the same data
 ‘walking’, ‘walks’, ‘walked’, ‘walked’  walk
 Inflectional stemming
 Remove plurals
 Normalize verb tenses
 Remove other affixes
 Stemming to root
 Reduce word to most basic element
 More aggressive than inflectional
 ‘ ‘Apply’, ‘applications’, ‘reapplied’  apply
 The uppermost problem in text mining is the ambiguity
of the language i.e. the capability of being understood in
two or more possible sense. Because one word or phrase
may have multiple meanings those can lead to ambiguity
problem.
 In fields like Bioinformatics there are multiple names
for a single gene or protein that may also lead to
ambiguity problem.
  One more problem with test mining is when we
use the social media data i.e. status updates,
tweets, comments, reviews etc. most people use
slang words like- “btw” for by the way, “ppl” for
people etc. these words do not exist in the
dictionary that’s why they affects the mining
results.
 Another problem with text mining is cleaning the
data, if we extract online texts then we also get the
reference addresses of the images linked with the
text and those references are hard to remove.
Text analysis presently is really a fascinating technique
to determine the useful results from the textual data. By
using text mining techniques we can easily extract public
reviews, can classify the text into predefined classes, can
conclude the documents and also can make group or
cluster of multiple documents.
 https://en.wikipedia.org/wiki/Text_mining
 http://searchbusinessanalytics.techtarget.com/defi
nition/text-mining
 https://www.ijircce.com/upload/2016/april/40_Tex
t.pdf
Textmining

Textmining

  • 1.
  • 2.
     Defining TextMining  Structured vs. Unstructured Data  Why Text Mining  Some Text Mining Ambiguities  Text Mining Practice Areas  Pre-processing Techniques  Challenges in Text Mining  Conclusion
  • 3.
    • The useof computational methods and techniques to extract high quality information from text • The discovery by computer of new, previously unknown information, by automatically extracting information from a usually large amount of different unstructured textual resources
  • 4.
     We havea collection of documents (mainly text or html-based)  We have a set of users  A user wants to retrieve the documents related to a given concept  He consequently submits a query expressed through words or terms  An information retrieval system returns the documents most related to this concept
  • 6.
     Unstructured textis present in various forms, and in huge and ever increasing quantities: 1. books 2. financial and other business reports 3. various kinds of business and administrative documents 4. news articles  It is estimated that ~80% of all the available data are unstructured data
  • 7.
     TM researchand practice are focused on the development, continual improvement and application of such methods  To enable effective and efficient use of such huge quantities of textual content, we need computational methods for 1. automated extraction of information from unstructured text 2. analysis and summarization of extracted information
  • 8.
     Language isambiguous  Context is needed to clarify  The same words can have different meaning  Bear (verb) – to support or carry  Bear (noun) – a large animal  Different words can mean the same (synonyms)  Language is subtle(difficult to analyse  Concept / word extraction usually results in huge number of dimensions  Thousands of new fields  Each field typically has low information content (sparse)  Misspellings, abbreviations, spelling variants  Renders search engines, SQL queries.. ineffective.
  • 9.
     Homonomy: sameword, different meaning Mary walked along the bank of the river HarborBank is the richest bank in the citys  Synonymy: Synonyms, different words, similar or same meaning, can substitute one word for other without changing meaning. Miss Nelson became a kind of big sister to Benjamin Miss Nelson became a kind of large sister to Benjamin.
  • 10.
     Polysemy: sameword or form, but different, albeit related meaning The bank raised its interest rates yesterday The store is next to the newly constructed bank The bank appeared first in Italy I the Renaissance  Hyponymy: Concept hierarchy or subclass Animal (noun) – cat, dog Injury – broken leg, intusion
  • 11.
     Search andInformation Retrieval – storage and retrieval of text documents, including search engines and keyword search  Document Clustering – Grouping and categorizing terms, snippets, paragraphs or documents using clustering methods  Document Classification – grouping and categorizing snippets, paragraphs or document using data mining classification methods, based on methods trained on labelled examples  Web Mining – Data and Text mining on the internet with specific focus on scaled and interconnectedness of the web
  • 12.
     Information Extraction– Identification and extraction of relevant facts and relationships from unstructured text  Natural Language Processing – Low level language processing and understanding of tasks (eg. Tagging part of speech)  Concept extraction – Grouping of words and phrases into semantically similar groups
  • 13.
     Document –a sequence of words and punctuation, following the grammatical rules of the language.  Term – usually a word, but can be a word-pair or phrase  Corpus – a collection of documents  Lexicon – set of all unique words in corpus
  • 14.
     Text Normalization Parts of Speech Tagging  Removal of stop words  Stop words – common words that don’t add meaningful content to the document  Stemming  Removing suffices and prefixes leaving the root or stem of the word.  Tokenization
  • 16.
     Case  Makeall lower case (if you don’t care about proper nouns, titles, etc)  Clean up transcription and typing errors  do n’t, movei  Correct misspelled words  Phonetically  Use fuzzy matching algorithms such as Soundex, Metaphone or string edit distance  Dictionaries  Use POS and context to make good guess
  • 17.
     POS taggingis a process of assigning a POS or lexical class marker to each word in a sentence (and all sentences in a corpus).  Input: the lead paint is unsafe  Output: the/Det lead/N paint/N is/V unsafe/Adj
  • 18.
     Tokenization isthe process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens.  Converts streams of characters into words  Tokens or words are separated by whitespace, punctuation marks or line breaks.
  • 19.
     Normalizes /unifies variations of the same data  ‘walking’, ‘walks’, ‘walked’, ‘walked’  walk  Inflectional stemming  Remove plurals  Normalize verb tenses  Remove other affixes  Stemming to root  Reduce word to most basic element  More aggressive than inflectional  ‘ ‘Apply’, ‘applications’, ‘reapplied’  apply
  • 20.
     The uppermostproblem in text mining is the ambiguity of the language i.e. the capability of being understood in two or more possible sense. Because one word or phrase may have multiple meanings those can lead to ambiguity problem.  In fields like Bioinformatics there are multiple names for a single gene or protein that may also lead to ambiguity problem.
  • 21.
      Onemore problem with test mining is when we use the social media data i.e. status updates, tweets, comments, reviews etc. most people use slang words like- “btw” for by the way, “ppl” for people etc. these words do not exist in the dictionary that’s why they affects the mining results.  Another problem with text mining is cleaning the data, if we extract online texts then we also get the reference addresses of the images linked with the text and those references are hard to remove.
  • 22.
    Text analysis presentlyis really a fascinating technique to determine the useful results from the textual data. By using text mining techniques we can easily extract public reviews, can classify the text into predefined classes, can conclude the documents and also can make group or cluster of multiple documents.
  • 23.