Textmining Introduction

  • 549 views
Uploaded on

Introduction to Text Mining

Introduction to Text Mining

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
549
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Text mining: Introduction and data preparation
  • 2. Overview of Text mining
    • What is Text Mining?
    • Text Mining, "also known as intelligent text analysis, text data mining or knowledge-discovery in text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text."
  • 3. Need for Text mining:
    • We can better understand the need for Text mining using a practical example.
    • Ex: The Bio Tech Industry.
    • -80% of biological knowledge is only in research paper (unstructured data).
    • - If a scientist manually read 50 research paper/week and only 10% of those data are useful then he/she manages only 5 research paper/week.
  • 4. Need for Text mining
    • But online databases like Medline adds more than 10,000 abstracts per month using Text mining
    •  
    • Thus the performance of gathering relevant data is increased
    • dramatically when we use text mining .It shows the need for Text
    • mining.
  • 5. Challenges in Text Mining
    • Information is in unstructured textual form
    • Large textual data base
    • almost all publications are also in electronic form
    • Very high number of possible “dimensions” (but sparse):
    • all possible word and phrase types in the language!!
    • Complex and subtle relationships between concepts in text
  • 6. Challenges in Text Mining
    • “ AOL merges with Time-Warner” “Time-Warner is bought by AOL”
    • Word ambiguity and context sensitivity
    • automobile = car = vehicle = Toyota
    • Apple (the company) or apple (the fruit)
    • Noisy data
    • Example: Spelling mistakes
  • 7. Text Mining Process
    • Text preprocessing
    • Syntactic/Semantic
    • text analysis
    • Features Generation
    • Bag of words
    • Features Selection
    • Simple counting
    • Statistics
  • 8. Text Mining Process
    • Text/Data Mining
    • Classification
    • Clustering
    • Associations
    • Analyzing results
  • 9. Applications
    • The potential applications are countless.
      • Customer profile analysis
      • Trend analysis
      • Information filtering and routing
      • Event tracks
      • news stories classification
      • Web search etc etc.
  • 10. Tokenization
    • Convert a sentence into a sequence of tokens i.e words.
    • Why do we tokenize?
    • Because we do not want to treat a sentence as a sequence of characters
    • Tokenizing general English sentences is relatively straightforward.
    • Use spaces as the boundaries
    • Use some heuristics to handle exceptions
  • 11. Tokenisation issues
    • separate possessive endings or abbreviated forms from preceding words:
      • Mary’s  Mary ‘s Mary’s  Mary is Mary’s  Mary has
    • separate punctuation marks and quotes from words :
      • Mary.  Mary .
      • “ new”  “ new “
  • 12. Dictionary creation
    • Dictionary is used to locate occurrence of a particular term in the documents.
    • It will reduce the retrivel time of an algorithm.
    • They are stored as linked list
  • 13. Example
    • Brutus −-> 1 2 4 11 31 45 173 174
    • Caesar −-> 1 2 4 5 6 16 57 132 . . .
    • Calpurnia −-> 2 31 54 101
    • In the above example the occurence of the terns brutus caesar and calpurnia in the documents are given.
  • 14. Feature generation and selection
    • Importance of feature selection
    • Machine Learning
    • It improve the efficiency in many machine learning.
    • Over fitting problem
    • Over fitting is the problem of training the machine so much that when the actual data is place it behave well to an extent and start to fail.
    • Improve Efficiency of training
  • 15. Feature selection methods for classification
    • Filter Method
    • pre-process computation of score for each feature and then select
    • feature according to the score
    • Wrapper Method
    • The wrapper utilize learning as a black box to score subset
    • features
    • Embedded Method
    • Feature selection is perform within the process of training the
    • algorithm
  • 16. Parsing tasks
    • Separate words from spaces and punctuation
    • Clean up
    • Remove redundant words
    • Remove words with no content
    • Cleaned up list of Words referred to
    • as tokens
  • 17. Simple Algorithm for parsing
    • # Initialize, description-the entire text
    • charcount<-nchar(Description)
    • # number of records of text
    • Line count<-length(Description)
    • Num<-Line count*6
    • # Array to hold location of spaces
    • Position<-rep(0,Num)
    • dim(Position)<-c(Linecount,6)
  • 18. Simple Algorithm for parsing
    • # Array for Terms
    • Terms<-rep(“”,Num)
    • dim(Terms)<-c(Linecount,6)
    • wordcount<-rep(0,Linecount)
  • 19. Search for Spaces
    • for (i in 1:Linecount)
    • {
    • n<-charcount[i]
    • k<-1
    • for (j in 1:n)
    • {
    • Char<-substring(Description[i],j,j)
    • if (is.all.white(Char)) {Position[i,k]<-j; k<-k+1}
    • wordcount[i]<-k }}
  • 20. Get Words
    • # parse out terms
    • for (i in 1:Linecount)
    • {
    • # first word
    • if (Position[i,1]==0) Terms[i,1]<-Description[i] else if (Position[i,1]>0)
    • Terms[i,1]<-substring(Description[i],1,Position[i,j]-1)
  • 21. Get Words
    • for (j in 1:wordcount)
    • {
    • if (Position[i,j]>0)
    • {
    • Terms[i,j]<-substring(Description[i],Position[i,j-1]+1,Position[i,j]-1)
    • }
    • }
    • }
  • 22. Conclusion
    • In this presentation
    • Overview of text mining
    • Tokenization
    • Dictionary creation
    • Feature selection
    • Parsing are studied in detail.
  • 23. Visit more self help tutorials
    • Pick a tutorial of your choice and browse through it at your own pace.
    • The tutorials section is free, self-guiding and will not involve any additional support.
    • Visit us at www.dataminingtools.net