Text Mining Patrick Cash Outline

2,159 views
2,064 views

Published on

0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,159
On SlideShare
0
From Embeds
0
Number of Embeds
39
Actions
Shares
0
Downloads
161
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Text Mining Patrick Cash Outline

  1. 1. Text Mining Patrick Cash
  2. 2. Outline <ul><li>Introduction </li></ul><ul><li>Data Mining </li></ul><ul><li>Text Mining </li></ul><ul><ul><li>Text Mining Process </li></ul></ul><ul><li>Text Mining Applications </li></ul><ul><li>Challenges in Text Mining </li></ul><ul><li>Conclusion </li></ul>
  3. 3. Introduction <ul><li>Why Text Mining? </li></ul><ul><ul><li>Massive amount of new information being created </li></ul></ul><ul><ul><ul><li>World’s data doubles every 18 months (Jacques Vallee Ph.D) </li></ul></ul></ul><ul><ul><li>80-90% of all data is held in various unstructured formats </li></ul></ul><ul><ul><li>Useful information can be derived from this unstructured data </li></ul></ul>
  4. 4. Introduction <ul><li>Intelligence in text mining is based on NLP techniques </li></ul><ul><li>Can be used as a preprocessing technique to harvest data and get an initial understanding of the patterns that exist in the data </li></ul><ul><li>Often seen as a special case of data mining but there is an important difference </li></ul>
  5. 5. Data Mining <ul><li>Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information (or patterns) from data </li></ul><ul><li>Data Mining: a misnomer? </li></ul><ul><ul><li>Knowledge discovery, knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. </li></ul></ul>
  6. 6. Data Mining <ul><li>Descriptive: understanding underlying processes or behavior </li></ul><ul><ul><li>Patterns and trends </li></ul></ul><ul><ul><li>Clustering </li></ul></ul><ul><li>Predictive: predict an unseen or unmeasured value </li></ul><ul><ul><li>Future projections and missing values </li></ul></ul><ul><ul><li>Classification </li></ul></ul>
  7. 7. SEMMA <ul><li>Search </li></ul><ul><ul><li>Input data source, data sampling, partitioning </li></ul></ul><ul><li>Explore </li></ul><ul><ul><li>Patterns, trends, outliers, visualization </li></ul></ul><ul><li>Modify </li></ul><ul><ul><li>Clustering, feature reduction </li></ul></ul><ul><li>Model </li></ul><ul><ul><li>Regression, tree, network </li></ul></ul><ul><li>Assess </li></ul><ul><ul><li>Report, pass to next step in analysis </li></ul></ul>
  8. 8. Search vs. Discover Data Mining Text Mining Data Retrieval Information Retrieval Search (goal-oriented) Discover (opportunistic) Structured Data Unstructured Data (Text)
  9. 9. Text Mining <ul><li>Many different by similar definitions </li></ul><ul><li>Text Mining = Statistical NLP + Data Mining </li></ul><ul><li>Text Mining is a process that employs </li></ul><ul><ul><li>Statistical NLP: a set of algorithms for converting unstructured text into structured data objects </li></ul></ul><ul><ul><li>Data Mining: the quantitative methods that analyze these data objects to discover knowledge </li></ul></ul>
  10. 10. Text Mining <ul><li>Descriptive </li></ul><ul><ul><li>Pattern and trend analysis </li></ul></ul><ul><ul><li>Knowledge base creation </li></ul></ul><ul><ul><li>Summarization </li></ul></ul><ul><ul><li>Visualization </li></ul></ul><ul><li>Predictive </li></ul><ul><ul><li>Classification </li></ul></ul><ul><ul><li>Question answering </li></ul></ul><ul><ul><li>Pattern and trend forecasting </li></ul></ul>
  11. 11. Text Mining Techniques <ul><li>Information Retrieval </li></ul><ul><ul><li>Indexing and retrieval of textual documents </li></ul></ul><ul><li>Information Extraction </li></ul><ul><ul><li>Extraction of partial knowledge in the text </li></ul></ul><ul><li>Web Mining </li></ul><ul><ul><li>Indexing and retrieval of textual documents and extraction of partial knowledge using the web (ontology building) </li></ul></ul><ul><li>Clustering </li></ul><ul><ul><li>Generating collections of similar text documents </li></ul></ul>
  12. 12. Text Mining Process
  13. 13. Text Mining Process <ul><li>Text Preprocessing </li></ul><ul><ul><li>Syntactic/Semantic text analysis </li></ul></ul><ul><li>Features Generation </li></ul><ul><ul><li>Bag of words </li></ul></ul><ul><li>Features Selection </li></ul><ul><ul><li>Simple counting </li></ul></ul><ul><ul><li>Statistics </li></ul></ul><ul><li>Text/Data Mining </li></ul><ul><ul><li>Classification (Supervised) / Clustering (Unsupervised) </li></ul></ul><ul><li>Analyzing results </li></ul>
  14. 14. Text Mining Process <ul><li>Text preprocessing </li></ul><ul><ul><li>Part Of Speech (POS) tagging </li></ul></ul><ul><ul><ul><li>Find the corresponding POS for each word. </li></ul></ul></ul><ul><ul><li>Word sense disambiguation </li></ul></ul><ul><ul><ul><li>Context based or proximity based </li></ul></ul></ul><ul><ul><li>Parsing </li></ul></ul><ul><ul><ul><li>Generates a parse tree (graph) for each sentence </li></ul></ul></ul><ul><ul><ul><li>Each sentence is a stand alone graph </li></ul></ul></ul>
  15. 15. Text Mining Process <ul><li>Feature Generation </li></ul><ul><ul><li>Text document is represented by the words it contains (and their occurrences) </li></ul></ul><ul><ul><ul><li>Order of words is not that important for certain applications (Bag of words) </li></ul></ul></ul><ul><ul><li>Stemming: identifies a word by its root </li></ul></ul><ul><ul><ul><li>Reduce dimensionality </li></ul></ul></ul><ul><ul><li>Stop words: The common words unlikely to help text mining </li></ul></ul>
  16. 16. Text Mining Process <ul><li>Feature Selection </li></ul><ul><ul><li>Reduce dimensionality </li></ul></ul><ul><ul><ul><li>Learners have difficulty addressing tasks with high dimensionality </li></ul></ul></ul><ul><ul><ul><li>Only interested in the information relevant to what is being analyzed </li></ul></ul></ul><ul><ul><li>Irrelevant features </li></ul></ul><ul><ul><ul><li>Not all features help </li></ul></ul></ul>
  17. 17. Text Mining Process <ul><li>Text Mining: Classification definition </li></ul><ul><ul><li>Given: a collection of labeled records (training set) </li></ul></ul><ul><ul><ul><li>Each record contains a set of features (attributes), and the true class (label) </li></ul></ul></ul><ul><ul><li>Find: a model for the class as a function of the values of the features </li></ul></ul><ul><ul><li>Goal: previously unseen records should be assigned a class as accurately as possible </li></ul></ul>
  18. 18. Text Mining Process <ul><li>Text Mining: Clustering definition </li></ul><ul><ul><li>Given: a set of documents and a similarity measure among documents </li></ul></ul><ul><ul><li>Find: clusters such that: </li></ul></ul><ul><ul><ul><li>Documents in one cluster are more similar to one another </li></ul></ul></ul><ul><ul><ul><li>Documents in separate clusters are less similar to one another </li></ul></ul></ul><ul><ul><li>Goal: </li></ul></ul><ul><ul><ul><li>Finding a correct set of documents clusters </li></ul></ul></ul>
  19. 19. Text Mining Process <ul><li>Supervised learning (classification) </li></ul><ul><ul><li>The training data is labeled indicating the class </li></ul></ul><ul><ul><li>New data is classified based on the training set </li></ul></ul><ul><ul><li>Correct classification: The known label of test sample is identical with the class result from the classification model </li></ul></ul><ul><li>Unsupervised learning (clustering) </li></ul><ul><ul><li>The class labels of training data are unknown </li></ul></ul><ul><ul><li>Establish the existence of classes or clusters in the data </li></ul></ul><ul><ul><li>Good clustering method: high intra-cluster similarity </li></ul></ul>
  20. 20. Text Mining Process <ul><li>Analyzing the results </li></ul><ul><ul><li>Are the results satisfactory? </li></ul></ul><ul><ul><li>Does more mining need to be done? </li></ul></ul><ul><ul><li>Does a different technique need to be used? </li></ul></ul><ul><ul><li>Does another iteration of one or more steps in the process need to be done? </li></ul></ul>
  21. 21. Text Mining Applications <ul><li>Bioinformatics </li></ul><ul><ul><li>Genomics research (DNA sequencing) </li></ul></ul><ul><li>Medical </li></ul><ul><ul><li>Mining medical records to improve care </li></ul></ul><ul><li>Business intelligence </li></ul><ul><ul><li>Risk analysis </li></ul></ul><ul><li>Research </li></ul><ul><ul><li>Analyzing research publications </li></ul></ul><ul><li>Basically anywhere there is large amount of unstructured text data </li></ul>
  22. 22. Text Mining Application <ul><li>Classification (Categorization) </li></ul><ul><ul><li>Spam detection, Document organization </li></ul></ul><ul><li>Clustering </li></ul><ul><ul><li>Trend analysis, Topic identification </li></ul></ul><ul><li>Web Mining </li></ul><ul><ul><li>Trend analysis, Opinion mining, Ontology creation </li></ul></ul><ul><li>Classical NLP </li></ul><ul><ul><li>Text summarization, Question answering, Information extraction </li></ul></ul>
  23. 23. Text Mining Application <ul><li>Smaller scale applications </li></ul><ul><ul><li>Relationship Analysis </li></ul></ul><ul><ul><ul><li>If A is related to B, and B is related to C, there is potentially a relationship between A and C. </li></ul></ul></ul><ul><ul><li>Trend analysis </li></ul></ul><ul><ul><ul><li>Occurrences of A peak in October. </li></ul></ul></ul><ul><ul><li>Mixed applications </li></ul></ul><ul><ul><ul><li>Co-occurrence of A together with B peak in November. (Shopping Cart Analysis) </li></ul></ul></ul>
  24. 24. Challenges in Text Mining <ul><li>Remember </li></ul><ul><ul><li>Text Mining = Statistical NLP + Data Mining </li></ul></ul><ul><li>Text mining suffers from the same challenges as Statistical NLP and Data Mining </li></ul><ul><li>Add in the additional difficulties associated with the data not being structured </li></ul>
  25. 25. Challenges in Text Mining <ul><li>Statistical NLP </li></ul><ul><ul><li>Ambiguity </li></ul></ul><ul><ul><li>Context </li></ul></ul><ul><ul><li>Tokenization Sentence Detection </li></ul></ul><ul><ul><li>Stemming </li></ul></ul><ul><ul><li>POS Tagging </li></ul></ul><ul><ul><li>Coreference Resolution </li></ul></ul>
  26. 26. Challenges in Text Mining <ul><li>Data Mining </li></ul><ul><ul><li>Data preprocessing </li></ul></ul><ul><ul><ul><li>Ability to process the data </li></ul></ul></ul><ul><ul><ul><li>Massive amounts of data </li></ul></ul></ul><ul><ul><ul><li>Determining and extracting information of interest </li></ul></ul></ul><ul><ul><li>Availability of NLP tools to work with data mining </li></ul></ul><ul><ul><li>Discovery process </li></ul></ul><ul><ul><ul><li>No training data available </li></ul></ul></ul>
  27. 27. Conclusion <ul><li>Text Mining = Statistical NLP + Data Mining </li></ul><ul><ul><li>Culmination of all the NLP techniques covered in this course </li></ul></ul><ul><li>Growing research area that will be important as information growth (and need to extract knowledge from that information) increases </li></ul>
  28. 28. References <ul><li>Even-Zohar, Y. Introduction to Text Mining. Supercomputing, 2002. http://alg.ncsa.uiuc.edu/do/documents/presentations </li></ul><ul><li>Treloar, N AvaQuest Inc. www.knowledgetechnologies.net/proceedings/presentations/treloar/nathantreloar.ppt </li></ul><ul><li>Witte, R. Faculty of Informatics Institute for Program Structures and Data Organization (IPD) http://www.edbt2006.de/edbt-share/IntroductionToTextMining.pdf </li></ul>
  29. 29. <ul><li>Questions ? </li></ul>

×