Text mining: Introduction and data preparation
Overview of Text mining What is Text Mining? Text Mining, "also known as intelligent text analysis, text data mining or knowledge-discovery in text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text."
Need for Text mining: We can better understand the need for Text mining using a practical example. Ex: The Bio Tech Industry. -80% of  biological knowledge is only in research paper (unstructured data). - If  a scientist  manually  read 50 research paper/week and only 10% of those data are useful   then   he/she manages only 5 research paper/week.
Need for Text mining But online databases like Medline adds more than 10,000 abstracts per month using  Text mining   Thus the performance of gathering relevant data is increased dramatically when we use text mining .It shows the need for Text mining.
Challenges in Text Mining Information is in unstructured textual form Large textual data base almost all publications are also in electronic form Very high number of possible “dimensions” (but sparse): all possible word and phrase types in the language!! Complex and subtle relationships between concepts in text
Challenges in Text Mining “ AOL merges with Time-Warner” “Time-Warner is bought by AOL” Word ambiguity and context sensitivity automobile = car = vehicle = Toyota Apple (the company) or apple (the fruit) Noisy data Example: Spelling mistakes
Text Mining Process Text preprocessing Syntactic/Semantic text analysis Features Generation Bag of words Features Selection Simple counting Statistics
Text Mining Process Text/Data Mining Classification Clustering Associations Analyzing results
Applications The potential applications are countless. Customer profile analysis Trend analysis Information filtering and routing Event tracks news stories classification Web search etc etc.
Tokenization Convert a sentence into a sequence of  tokens i.e  words. Why do we tokenize? Because we do not want to treat a sentence as a sequence of  characters Tokenizing general English sentences is relatively straightforward. Use spaces as the boundaries Use some heuristics to handle exceptions
Tokenisation issues  separate possessive endings or abbreviated forms from preceding words:  Mary’s    Mary ‘s Mary’s    Mary is Mary’s    Mary has separate punctuation marks and quotes from words  : Mary.    Mary  . “ new”    “  new  “
  Dictionary creation Dictionary is used to locate  occurrence of a particular term in the documents. It will reduce the retrivel time of an algorithm. They are stored as linked list
Example Brutus −-> 1 2 4 11 31 45 173 174 Caesar −-> 1 2 4 5 6 16 57 132 . . . Calpurnia −-> 2 31 54 101 In the above example the occurence of the terns brutus caesar and calpurnia in the documents are given.
Feature generation and selection Importance of feature selection Machine Learning It improve the efficiency in many machine learning. Over fitting problem Over fitting is the problem of training the machine so much that when the actual data is place it behave well to an extent and start to fail. Improve Efficiency of training
Feature selection methods for classification Filter Method pre-process computation of score for each feature and then select feature according to the score Wrapper Method The wrapper utilize learning as a black box to score subset features Embedded Method Feature selection is perform within the process of training the algorithm
Parsing tasks Separate words from spaces and punctuation Clean up Remove redundant words Remove words with no content Cleaned up list of Words referred to as tokens
Simple Algorithm for parsing # Initialize, description-the entire text charcount<-nchar(Description) # number of records of text Line count<-length(Description) Num<-Line count*6 # Array to hold location of spaces Position<-rep(0,Num) dim(Position)<-c(Linecount,6)
Simple Algorithm for parsing # Array for Terms Terms<-rep(“”,Num) dim(Terms)<-c(Linecount,6) wordcount<-rep(0,Linecount)
Search for Spaces for (i in 1:Linecount) { n<-charcount[i] k<-1 for (j in 1:n) { Char<-substring(Description[i],j,j) if (is.all.white(Char)) {Position[i,k]<-j; k<-k+1} wordcount[i]<-k }}
Get Words # parse out terms for (i in 1:Linecount) { # first word if (Position[i,1]==0) Terms[i,1]<-Description[i] else if (Position[i,1]>0) Terms[i,1]<-substring(Description[i],1,Position[i,j]-1)
Get Words for (j in 1:wordcount) { if (Position[i,j]>0) { Terms[i,j]<-substring(Description[i],Position[i,j-1]+1,Position[i,j]-1) } } }
Conclusion In this presentation Overview of text mining Tokenization Dictionary creation Feature selection Parsing are studied in detail.
Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net

Textmining Introduction

  • 1.
    Text mining: Introductionand data preparation
  • 2.
    Overview of Textmining What is Text Mining? Text Mining, &quot;also known as intelligent text analysis, text data mining or knowledge-discovery in text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text.&quot;
  • 3.
    Need for Textmining: We can better understand the need for Text mining using a practical example. Ex: The Bio Tech Industry. -80% of biological knowledge is only in research paper (unstructured data). - If a scientist manually read 50 research paper/week and only 10% of those data are useful then he/she manages only 5 research paper/week.
  • 4.
    Need for Textmining But online databases like Medline adds more than 10,000 abstracts per month using Text mining   Thus the performance of gathering relevant data is increased dramatically when we use text mining .It shows the need for Text mining.
  • 5.
    Challenges in TextMining Information is in unstructured textual form Large textual data base almost all publications are also in electronic form Very high number of possible “dimensions” (but sparse): all possible word and phrase types in the language!! Complex and subtle relationships between concepts in text
  • 6.
    Challenges in TextMining “ AOL merges with Time-Warner” “Time-Warner is bought by AOL” Word ambiguity and context sensitivity automobile = car = vehicle = Toyota Apple (the company) or apple (the fruit) Noisy data Example: Spelling mistakes
  • 7.
    Text Mining ProcessText preprocessing Syntactic/Semantic text analysis Features Generation Bag of words Features Selection Simple counting Statistics
  • 8.
    Text Mining ProcessText/Data Mining Classification Clustering Associations Analyzing results
  • 9.
    Applications The potentialapplications are countless. Customer profile analysis Trend analysis Information filtering and routing Event tracks news stories classification Web search etc etc.
  • 10.
    Tokenization Convert asentence into a sequence of tokens i.e words. Why do we tokenize? Because we do not want to treat a sentence as a sequence of characters Tokenizing general English sentences is relatively straightforward. Use spaces as the boundaries Use some heuristics to handle exceptions
  • 11.
    Tokenisation issues separate possessive endings or abbreviated forms from preceding words: Mary’s  Mary ‘s Mary’s  Mary is Mary’s  Mary has separate punctuation marks and quotes from words : Mary.  Mary . “ new”  “ new “
  • 12.
    Dictionarycreation Dictionary is used to locate occurrence of a particular term in the documents. It will reduce the retrivel time of an algorithm. They are stored as linked list
  • 13.
    Example Brutus −->1 2 4 11 31 45 173 174 Caesar −-> 1 2 4 5 6 16 57 132 . . . Calpurnia −-> 2 31 54 101 In the above example the occurence of the terns brutus caesar and calpurnia in the documents are given.
  • 14.
    Feature generation andselection Importance of feature selection Machine Learning It improve the efficiency in many machine learning. Over fitting problem Over fitting is the problem of training the machine so much that when the actual data is place it behave well to an extent and start to fail. Improve Efficiency of training
  • 15.
    Feature selection methodsfor classification Filter Method pre-process computation of score for each feature and then select feature according to the score Wrapper Method The wrapper utilize learning as a black box to score subset features Embedded Method Feature selection is perform within the process of training the algorithm
  • 16.
    Parsing tasks Separatewords from spaces and punctuation Clean up Remove redundant words Remove words with no content Cleaned up list of Words referred to as tokens
  • 17.
    Simple Algorithm forparsing # Initialize, description-the entire text charcount<-nchar(Description) # number of records of text Line count<-length(Description) Num<-Line count*6 # Array to hold location of spaces Position<-rep(0,Num) dim(Position)<-c(Linecount,6)
  • 18.
    Simple Algorithm forparsing # Array for Terms Terms<-rep(“”,Num) dim(Terms)<-c(Linecount,6) wordcount<-rep(0,Linecount)
  • 19.
    Search for Spacesfor (i in 1:Linecount) { n<-charcount[i] k<-1 for (j in 1:n) { Char<-substring(Description[i],j,j) if (is.all.white(Char)) {Position[i,k]<-j; k<-k+1} wordcount[i]<-k }}
  • 20.
    Get Words #parse out terms for (i in 1:Linecount) { # first word if (Position[i,1]==0) Terms[i,1]<-Description[i] else if (Position[i,1]>0) Terms[i,1]<-substring(Description[i],1,Position[i,j]-1)
  • 21.
    Get Words for(j in 1:wordcount) { if (Position[i,j]>0) { Terms[i,j]<-substring(Description[i],Position[i,j-1]+1,Position[i,j]-1) } } }
  • 22.
    Conclusion In thispresentation Overview of text mining Tokenization Dictionary creation Feature selection Parsing are studied in detail.
  • 23.
    Visit more selfhelp tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net