Textmining Introduction


Published on

Introduction to Text Mining

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Textmining Introduction

  1. 1. Text mining: Introduction and data preparation
  2. 2. Overview of Text mining <ul><li>What is Text Mining? </li></ul><ul><li>Text Mining, &quot;also known as intelligent text analysis, text data mining or knowledge-discovery in text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text.&quot; </li></ul>
  3. 3. Need for Text mining: <ul><li>We can better understand the need for Text mining using a practical example. </li></ul><ul><li>Ex: The Bio Tech Industry. </li></ul><ul><li>-80% of biological knowledge is only in research paper (unstructured data). </li></ul><ul><li>- If a scientist manually read 50 research paper/week and only 10% of those data are useful then he/she manages only 5 research paper/week. </li></ul>
  4. 4. Need for Text mining <ul><li>But online databases like Medline adds more than 10,000 abstracts per month using Text mining </li></ul><ul><li>  </li></ul><ul><li>Thus the performance of gathering relevant data is increased </li></ul><ul><li>dramatically when we use text mining .It shows the need for Text </li></ul><ul><li>mining. </li></ul>
  5. 5. Challenges in Text Mining <ul><li>Information is in unstructured textual form </li></ul><ul><li>Large textual data base </li></ul><ul><li>almost all publications are also in electronic form </li></ul><ul><li>Very high number of possible “dimensions” (but sparse): </li></ul><ul><li>all possible word and phrase types in the language!! </li></ul><ul><li>Complex and subtle relationships between concepts in text </li></ul>
  6. 6. Challenges in Text Mining <ul><li>“ AOL merges with Time-Warner” “Time-Warner is bought by AOL” </li></ul><ul><li>Word ambiguity and context sensitivity </li></ul><ul><li>automobile = car = vehicle = Toyota </li></ul><ul><li>Apple (the company) or apple (the fruit) </li></ul><ul><li>Noisy data </li></ul><ul><li>Example: Spelling mistakes </li></ul>
  7. 7. Text Mining Process <ul><li>Text preprocessing </li></ul><ul><li>Syntactic/Semantic </li></ul><ul><li>text analysis </li></ul><ul><li>Features Generation </li></ul><ul><li>Bag of words </li></ul><ul><li>Features Selection </li></ul><ul><li>Simple counting </li></ul><ul><li>Statistics </li></ul>
  8. 8. Text Mining Process <ul><li>Text/Data Mining </li></ul><ul><li>Classification </li></ul><ul><li>Clustering </li></ul><ul><li>Associations </li></ul><ul><li>Analyzing results </li></ul>
  9. 9. Applications <ul><li>The potential applications are countless. </li></ul><ul><ul><li>Customer profile analysis </li></ul></ul><ul><ul><li>Trend analysis </li></ul></ul><ul><ul><li>Information filtering and routing </li></ul></ul><ul><ul><li>Event tracks </li></ul></ul><ul><ul><li>news stories classification </li></ul></ul><ul><ul><li>Web search etc etc. </li></ul></ul>
  10. 10. Tokenization <ul><li>Convert a sentence into a sequence of tokens i.e words. </li></ul><ul><li>Why do we tokenize? </li></ul><ul><li>Because we do not want to treat a sentence as a sequence of characters </li></ul><ul><li>Tokenizing general English sentences is relatively straightforward. </li></ul><ul><li>Use spaces as the boundaries </li></ul><ul><li>Use some heuristics to handle exceptions </li></ul>
  11. 11. Tokenisation issues <ul><li>separate possessive endings or abbreviated forms from preceding words: </li></ul><ul><ul><li>Mary’s  Mary ‘s Mary’s  Mary is Mary’s  Mary has </li></ul></ul><ul><li>separate punctuation marks and quotes from words : </li></ul><ul><ul><li>Mary.  Mary . </li></ul></ul><ul><ul><li>“ new”  “ new “ </li></ul></ul>
  12. 12. Dictionary creation <ul><li>Dictionary is used to locate occurrence of a particular term in the documents. </li></ul><ul><li>It will reduce the retrivel time of an algorithm. </li></ul><ul><li>They are stored as linked list </li></ul>
  13. 13. Example <ul><li>Brutus −-> 1 2 4 11 31 45 173 174 </li></ul><ul><li>Caesar −-> 1 2 4 5 6 16 57 132 . . . </li></ul><ul><li>Calpurnia −-> 2 31 54 101 </li></ul><ul><li>In the above example the occurence of the terns brutus caesar and calpurnia in the documents are given. </li></ul>
  14. 14. Feature generation and selection <ul><li>Importance of feature selection </li></ul><ul><li>Machine Learning </li></ul><ul><li>It improve the efficiency in many machine learning. </li></ul><ul><li>Over fitting problem </li></ul><ul><li>Over fitting is the problem of training the machine so much that when the actual data is place it behave well to an extent and start to fail. </li></ul><ul><li>Improve Efficiency of training </li></ul>
  15. 15. Feature selection methods for classification <ul><li>Filter Method </li></ul><ul><li>pre-process computation of score for each feature and then select </li></ul><ul><li>feature according to the score </li></ul><ul><li>Wrapper Method </li></ul><ul><li>The wrapper utilize learning as a black box to score subset </li></ul><ul><li>features </li></ul><ul><li>Embedded Method </li></ul><ul><li>Feature selection is perform within the process of training the </li></ul><ul><li>algorithm </li></ul>
  16. 16. Parsing tasks <ul><li>Separate words from spaces and punctuation </li></ul><ul><li>Clean up </li></ul><ul><li>Remove redundant words </li></ul><ul><li>Remove words with no content </li></ul><ul><li>Cleaned up list of Words referred to </li></ul><ul><li>as tokens </li></ul>
  17. 17. Simple Algorithm for parsing <ul><li># Initialize, description-the entire text </li></ul><ul><li>charcount<-nchar(Description) </li></ul><ul><li># number of records of text </li></ul><ul><li>Line count<-length(Description) </li></ul><ul><li>Num<-Line count*6 </li></ul><ul><li># Array to hold location of spaces </li></ul><ul><li>Position<-rep(0,Num) </li></ul><ul><li>dim(Position)<-c(Linecount,6) </li></ul>
  18. 18. Simple Algorithm for parsing <ul><li># Array for Terms </li></ul><ul><li>Terms<-rep(“”,Num) </li></ul><ul><li>dim(Terms)<-c(Linecount,6) </li></ul><ul><li>wordcount<-rep(0,Linecount) </li></ul>
  19. 19. Search for Spaces <ul><li>for (i in 1:Linecount) </li></ul><ul><li>{ </li></ul><ul><li>n<-charcount[i] </li></ul><ul><li>k<-1 </li></ul><ul><li>for (j in 1:n) </li></ul><ul><li>{ </li></ul><ul><li>Char<-substring(Description[i],j,j) </li></ul><ul><li>if (is.all.white(Char)) {Position[i,k]<-j; k<-k+1} </li></ul><ul><li>wordcount[i]<-k }} </li></ul>
  20. 20. Get Words <ul><li># parse out terms </li></ul><ul><li>for (i in 1:Linecount) </li></ul><ul><li>{ </li></ul><ul><li># first word </li></ul><ul><li>if (Position[i,1]==0) Terms[i,1]<-Description[i] else if (Position[i,1]>0) </li></ul><ul><li>Terms[i,1]<-substring(Description[i],1,Position[i,j]-1) </li></ul>
  21. 21. Get Words <ul><li>for (j in 1:wordcount) </li></ul><ul><li>{ </li></ul><ul><li>if (Position[i,j]>0) </li></ul><ul><li>{ </li></ul><ul><li>Terms[i,j]<-substring(Description[i],Position[i,j-1]+1,Position[i,j]-1) </li></ul><ul><li>} </li></ul><ul><li>} </li></ul><ul><li>} </li></ul>
  22. 22. Conclusion <ul><li>In this presentation </li></ul><ul><li>Overview of text mining </li></ul><ul><li>Tokenization </li></ul><ul><li>Dictionary creation </li></ul><ul><li>Feature selection </li></ul><ul><li>Parsing are studied in detail. </li></ul>
  23. 23. Visit more self help tutorials <ul><li>Pick a tutorial of your choice and browse through it at your own pace. </li></ul><ul><li>The tutorials section is free, self-guiding and will not involve any additional support. </li></ul><ul><li>Visit us at www.dataminingtools.net </li></ul>