Text Mining
Maurice Masih
13030141093
04/02/15 1
Topic of Discussion
โ€ข Introduction
โ€ข Text mining Comparison with other mining
โ€ข Text Mining Process
โ€ข How Algorithm is derived for Text Mining
โ€ข Text Analysis For Google Sheet
โ€ข Conclusion
04/02/15 2
Introduction
โ€ข It is the process of deriving high-quality information
โ€“ Non trivial information
โ€“ Unstructured text.
โ€ข It is also called as text data mining or text analytics.
Need
Bio Tech Industry
80% of biological knowledge is only in research
paper(unstructured).
If a scientist manually read 50 research paper/week and only 10%
of data are useful then he/she manages only 5 research paper/week
04/02/15 3
Text mining Comparison withโ€ฆ
Text Mining
Information
Retrieval
Web Mining
Data Mining
Statistics
Computer
Linguistics &
natural
language
processing04/02/15 4
Text Mining Process
Text
transformation
Text
Preprocessing
Text
Attribute
Selection
Data Mining/
Patter Discovery
Interpretation/
Evaluationโ€ขDocument
Clustering
โ€ขText
Characteristics
โ€ขText Cleanup
โ€ขTokenization
โ€ขText representation
โ€ขFeature Selection
โ€ขReduce Dimensionality
โ€ขRemove irrelevant
attributes
โ€ขStructured
database
โ€ขApplication
dependent
โ€ขClassic data mining
technique
Terminate or
iterate
04/02/15 5
1.Text
Document clustering
๏ƒผ Large volume of textual data.
๏ƒผ No clear picture what document suit the application.
๏ƒผ Common technique is K mean clustering.
Text Characteristics
๏ƒผ Dependency
๏ƒผ Ambiguity
๏ƒผ Noisy Data
๏ƒผ Unstructured data
04/02/15 6
2.Text Preprocessing
Text Cleanup
๏ƒผ Remove ads from page
๏ƒผ Convert from binary format
๏ƒผ Normalize text
๏ƒผ Deal with tables, figures and formulas
Tokenization
๏ƒผ Splitting up a string of characters into a set of tokens.
๏ƒผ Need to deal with issues like, Apostrophes, hyphens.
๏ƒผ Need to deal with tenses, part of speech, etc.
04/02/15 7
3.Text transformation
Text Representation
๏ƒผ Text document is represented by the words (features) it contains
and their occurrences.
Bag of Words
04/02/15 8
3.Text transformation contd..
04/02/15 9
4.Attribute Selection
Reduction of dimensionality
๏ƒผ Learners have difficulty addressing tasks with high dimensionality.
๏ƒผ Scarcity of resources and feasibility issues also call for a further
cutback of attributes.
Irrelevant features
๏ƒผ Not all features help!
e.g., the existence of a noun in a news article is unlikely to help
classify it as โ€œpoliticsโ€ or โ€œsportโ€.
04/02/15 10
5.Data Mining/ Pattern Discovery
๏ƒผ Text mining process merges with the traditional Data Mining process.
๏ƒผ Classic Data Mining techniques are used on the structured database
that resulted from the previous stages.
6.Interpretation & Evaluation
What to do next?
๏ƒผ Terminate
๏ƒผ Iterate
04/02/15 11
How Algorithm is derived for Text
Mining
04/02/15 12
Text Analysis For Google Sheet
โ€ขPerform Sentiment Analysis
โ€ขExtract mention of entities and
concepts.
โ€ขSummarize long chunks of text
โ€ขDetect the language of a
document
โ€ขFind the best hashtags .
โ€ขExtract the full text of an article,
as well as its author
name, embedded media, etc.
04/02/15 13
Conclusion
Text mining generally consists of the analysis of (multiple) text
documents by extracting key phrases, concepts, matches etc. and
the preparation of the text processed in that manner for further
analyses with numeric data mining techniques.
04/02/15 14
References
โ€ข http://www.r-bloggers.com/text-mining-in-r-automatic-categorization-
of-wikipedia-articles/
โ€ข http://www.kdd.org/sites/default/files/issues/7-1-2005-06/9-
Popowich.pdf
โ€ข www.Slideshare.net
04/02/15 15
04/02/15 16

Tesxt mining

  • 1.
  • 2.
    Topic of Discussion โ€ขIntroduction โ€ข Text mining Comparison with other mining โ€ข Text Mining Process โ€ข How Algorithm is derived for Text Mining โ€ข Text Analysis For Google Sheet โ€ข Conclusion 04/02/15 2
  • 3.
    Introduction โ€ข It isthe process of deriving high-quality information โ€“ Non trivial information โ€“ Unstructured text. โ€ข It is also called as text data mining or text analytics. Need Bio Tech Industry 80% of biological knowledge is only in research paper(unstructured). If a scientist manually read 50 research paper/week and only 10% of data are useful then he/she manages only 5 research paper/week 04/02/15 3
  • 4.
    Text mining Comparisonwithโ€ฆ Text Mining Information Retrieval Web Mining Data Mining Statistics Computer Linguistics & natural language processing04/02/15 4
  • 5.
    Text Mining Process Text transformation Text Preprocessing Text Attribute Selection DataMining/ Patter Discovery Interpretation/ Evaluationโ€ขDocument Clustering โ€ขText Characteristics โ€ขText Cleanup โ€ขTokenization โ€ขText representation โ€ขFeature Selection โ€ขReduce Dimensionality โ€ขRemove irrelevant attributes โ€ขStructured database โ€ขApplication dependent โ€ขClassic data mining technique Terminate or iterate 04/02/15 5
  • 6.
    1.Text Document clustering ๏ƒผ Largevolume of textual data. ๏ƒผ No clear picture what document suit the application. ๏ƒผ Common technique is K mean clustering. Text Characteristics ๏ƒผ Dependency ๏ƒผ Ambiguity ๏ƒผ Noisy Data ๏ƒผ Unstructured data 04/02/15 6
  • 7.
    2.Text Preprocessing Text Cleanup ๏ƒผRemove ads from page ๏ƒผ Convert from binary format ๏ƒผ Normalize text ๏ƒผ Deal with tables, figures and formulas Tokenization ๏ƒผ Splitting up a string of characters into a set of tokens. ๏ƒผ Need to deal with issues like, Apostrophes, hyphens. ๏ƒผ Need to deal with tenses, part of speech, etc. 04/02/15 7
  • 8.
    3.Text transformation Text Representation ๏ƒผText document is represented by the words (features) it contains and their occurrences. Bag of Words 04/02/15 8
  • 9.
  • 10.
    4.Attribute Selection Reduction ofdimensionality ๏ƒผ Learners have difficulty addressing tasks with high dimensionality. ๏ƒผ Scarcity of resources and feasibility issues also call for a further cutback of attributes. Irrelevant features ๏ƒผ Not all features help! e.g., the existence of a noun in a news article is unlikely to help classify it as โ€œpoliticsโ€ or โ€œsportโ€. 04/02/15 10
  • 11.
    5.Data Mining/ PatternDiscovery ๏ƒผ Text mining process merges with the traditional Data Mining process. ๏ƒผ Classic Data Mining techniques are used on the structured database that resulted from the previous stages. 6.Interpretation & Evaluation What to do next? ๏ƒผ Terminate ๏ƒผ Iterate 04/02/15 11
  • 12.
    How Algorithm isderived for Text Mining 04/02/15 12
  • 13.
    Text Analysis ForGoogle Sheet โ€ขPerform Sentiment Analysis โ€ขExtract mention of entities and concepts. โ€ขSummarize long chunks of text โ€ขDetect the language of a document โ€ขFind the best hashtags . โ€ขExtract the full text of an article, as well as its author name, embedded media, etc. 04/02/15 13
  • 14.
    Conclusion Text mining generallyconsists of the analysis of (multiple) text documents by extracting key phrases, concepts, matches etc. and the preparation of the text processed in that manner for further analyses with numeric data mining techniques. 04/02/15 14
  • 15.
  • 16.