Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Text Mining


Published on

  • Be the first to comment

Text Mining

  1. 1. Text & Web Mining
  2. 2. Structured Data <ul><li>So far we have focused on mining from structured data: </li></ul>Attribute  Value Attribute  Value Attribute  Value  Attribute  Value Outlook  Sunny Temperature  Hot Windy  Yes Humidity  High Play  Yes Most data mining involves such data
  3. 3. Complex Data Types <ul><li>Increased importance of complex data: </li></ul><ul><ul><li>Spatial data : includes geographic data and medical & satellite images </li></ul></ul><ul><ul><li>Multimedia data : images, audio, & video </li></ul></ul><ul><ul><li>Time-series data : for example banking data and stock exchange data </li></ul></ul><ul><ul><li>Text data : word descriptions for objects </li></ul></ul><ul><ul><li>World-Wide-Web : highly unstructured text and multimedia data </li></ul></ul>Focus
  4. 4. Text Databases <ul><li>Many text databases exist in practice </li></ul><ul><ul><li>News articles </li></ul></ul><ul><ul><li>Research papers </li></ul></ul><ul><ul><li>Books </li></ul></ul><ul><ul><li>Digital libraries </li></ul></ul><ul><ul><li>E-mail messages </li></ul></ul><ul><ul><li>Web pages </li></ul></ul><ul><li>Growing rapidly in size and importance </li></ul>
  5. 5. Semi-Structured Data <ul><li>Text databases are often semi-structured </li></ul><ul><li>Example: </li></ul><ul><ul><li>Title </li></ul></ul><ul><ul><li>Author </li></ul></ul><ul><ul><li>Publication_Date </li></ul></ul><ul><ul><li>Length </li></ul></ul><ul><ul><li>Category </li></ul></ul><ul><ul><li>Abstract </li></ul></ul><ul><ul><li>Content </li></ul></ul>Structured attribute/value pairs Unstructured
  6. 6. Handling Text Data <ul><li>Modeling semi-structured data </li></ul><ul><li>Information Retrieval (IR) from unstructured documents </li></ul><ul><li>Text mining </li></ul><ul><ul><li>Compare documents </li></ul></ul><ul><ul><li>Rank importance & relevance </li></ul></ul><ul><ul><li>Find patterns or trends across documents </li></ul></ul>
  7. 7. Information Retrieval <ul><li>IR locates relevant documents </li></ul><ul><ul><li>Key words </li></ul></ul><ul><ul><li>Similar documents </li></ul></ul><ul><li>IR Systems </li></ul><ul><ul><li>On-line library catalogs </li></ul></ul><ul><ul><li>On-line document management systems </li></ul></ul>
  8. 8. Performance Measure <ul><li>Two basic measures </li></ul>All documents Retrieved documents Relevant documents Relevant & retrieved
  9. 9. Retrieval Methods <ul><li>Keyword-based IR </li></ul><ul><ul><li>E.g., “data and mining” </li></ul></ul><ul><ul><li>Synonymy problem : a document may talk about “knowledge discovery” instead </li></ul></ul><ul><ul><li>Polysemy problem : mining can mean different things </li></ul></ul><ul><li>Similarity-based IR </li></ul><ul><ul><li>Set of common keywords </li></ul></ul><ul><ul><li>Return the degree of relevance </li></ul></ul><ul><ul><li>Problem: what is the similarity of “data mining” and “data analysis” </li></ul></ul>
  10. 10. Modeling a Document <ul><li>Set of n documents and m terms </li></ul><ul><li>Each document is a vector v in R m </li></ul><ul><ul><li>The j -th coordinate of v measures the association of the j -th term </li></ul></ul><ul><ul><li>Here r is the number of occurrences of the j -th term and R is the number of occurrences of any term. </li></ul></ul>
  11. 11. Frequency Matrix
  12. 12. Similarity Measures <ul><li>Cosine measure </li></ul>Dot product Norm of the vectors
  13. 13. Example <ul><li>Google search for “association mining” </li></ul><ul><li>Two of the documents retrieved: </li></ul><ul><ul><li>Idaho Mining Association: mining in Idaho (doc 1) </li></ul></ul><ul><ul><li>Scalable Algorithms for Association mining (doc 2) </li></ul></ul><ul><li>Using only the two terms </li></ul>
  14. 14. New Model <ul><li>Add the term “data” to the document model </li></ul>
  15. 15. Frequency Matrix Will quickly become large Singular value decomposition can be used to reduce it
  16. 16. Association Analysis <ul><li>Collect set of keywords frequently used together and find association among them </li></ul><ul><li>Apply any association rule algorithm to a database in the format </li></ul><ul><li>{document_id, a_set_of_keywords} </li></ul>
  17. 17. Document Classification <ul><li>Need already classified documents as training set </li></ul><ul><li>Induce a classification model </li></ul><ul><li>Any difference from before? </li></ul>A set of keywords associated with a document has no fixed set of attributes or dimensions
  18. 18. Association-Based Classification <ul><li>Classify documents based on associated, frequently occurring text patterns </li></ul><ul><ul><li>Extract keywords and terms with IR and simple association analysis </li></ul></ul><ul><ul><li>Create a concept hierarchy of terms </li></ul></ul><ul><ul><li>Classify training documents into class hierarchies </li></ul></ul><ul><ul><li>Use association mining to discover associated terms to distinguish one class from another </li></ul></ul>
  19. 19. Remember Generalized Association Rules Clothes Outerwear Shirts Jackets Ski Pants Footwear Shoes Hiking Boots Taxonomy: Generalized association rule X  Y where no item in Y is an ancestor of an item in X Ancestor of shoes and hiking boots
  20. 20. Classifiers <ul><li>Let X be a set of terms </li></ul><ul><li>Let Anc ( X ) be those terms and their ancestor terms </li></ul><ul><li>Consider a rule X   C and document d </li></ul><ul><li>If X  Anc ( d ) then X   C covers d </li></ul><ul><li>A rule that covers d may be used to classify d (but only one can be used) </li></ul>
  21. 21. Procedure <ul><li>Step 1: Generate all generalized association rules , where X is a set of terms and C is a class, that satisfy minimum support. </li></ul><ul><li>Step 2: Rank the rules according to some rule ranking criterion </li></ul><ul><li>Step 3: Select rules from the list </li></ul>
  22. 22. Web Mining <ul><li>The World Wide Web may have more opportunities for data mining than any other area </li></ul><ul><li>However, there are serious challenges: </li></ul><ul><ul><li>It is too huge </li></ul></ul><ul><ul><li>Complexity of Web pages is greater than any traditional text document collection </li></ul></ul><ul><ul><li>It is highly dynamic </li></ul></ul><ul><ul><li>It has a broad diversity of users </li></ul></ul><ul><ul><li>Only a tiny portion of the information is truly useful </li></ul></ul>
  23. 23. Search Engines  Web Mining <ul><li>Current technology: search engines </li></ul><ul><ul><li>Keyword-based indices </li></ul></ul><ul><ul><li>Too many relevant pages </li></ul></ul><ul><ul><li>Synonymy and polysemy problems </li></ul></ul><ul><li>More challenging: web mining </li></ul><ul><ul><li>Web content mining </li></ul></ul><ul><ul><li>Web structure mining </li></ul></ul><ul><ul><li>Web usage mining </li></ul></ul>
  24. 24. Web Content Mining
  25. 25. Example: Classification of Web Documents <ul><li>Assign a class to each document based on predefined topic categories </li></ul><ul><li>E.g., use Yahoo!’s taxonomy and associated documents for training </li></ul><ul><li>Keyword-based document classification </li></ul><ul><li>Keyword-based association analysis </li></ul>
  26. 26. Web Structure Mining
  27. 27. Authoritative Web Pages <ul><li>High quality relevant Web pages are termed authoritative </li></ul><ul><li>Explore linkages (hyperlinks) </li></ul><ul><ul><li>Linking a Web page can be considered an endorsement of that page </li></ul></ul><ul><ul><li>Those pages that are linked frequently are considered authoritative </li></ul></ul><ul><ul><li>(This has its roots back to IR methods based on journal citations) </li></ul></ul>
  28. 28. Structure via Hubs <ul><li>A hub is a set of Web pages containing collections of links to authorities </li></ul><ul><li>There is a wide variety of hubs: </li></ul><ul><ul><li>Simple list of recommended links on a person’s home page </li></ul></ul><ul><ul><li>Professional resource lists on commercial sites </li></ul></ul>
  29. 29. HITS <ul><li>Hyperlink-Induced Topic Search (HITS) </li></ul><ul><ul><li>Form a root set of pages using the query terms in an index-based search (200 pages) </li></ul></ul><ul><ul><li>Expand into a base set by including all pages the root set links to (1000-5000 pages) </li></ul></ul><ul><ul><li>Go into an iterative process to determine hubs and authorities </li></ul></ul>
  30. 30. Calculating Weights <ul><li>Authority weight </li></ul><ul><li>Hub weight </li></ul>Page p is pointed to by page q
  31. 31. Adjacency Matrix <ul><li>Lets number the pages {1,2,…, n } </li></ul><ul><li>The adjacency matrix is defined by </li></ul><ul><li>By writing the authority and hub weights as vectors we have </li></ul>
  32. 32. Recursive Calculations <ul><li>We now have </li></ul><ul><li>By linear algebra theory this converges to the principle eigenvectors of the the two matrices </li></ul>
  33. 33. Output <ul><li>The HITS algorithm finally outputs </li></ul><ul><ul><li>Short list of pages with high hub weights </li></ul></ul><ul><ul><li>Short list of pages with high authority weights </li></ul></ul><ul><li>Have not accounted for context </li></ul>
  34. 34. Applications <ul><li>The Clever Project at IBM’s Almaden Labs </li></ul><ul><ul><li>Developed the HITS algorithm </li></ul></ul><ul><li>Google </li></ul><ul><ul><li>Developed at Stanford </li></ul></ul><ul><ul><li>Uses algorithms similar to HITS (PageRank) </li></ul></ul><ul><ul><li>On-line version </li></ul></ul>
  35. 35. Web Usage Mining
  36. 36. Complex Data Types Summary <ul><li>Emerging areas of mining complex data types: </li></ul><ul><ul><li>Text mining can be done quite effectively, especially if the documents are semi-structured </li></ul></ul><ul><ul><li>Web mining is more difficult due to lack of such structure </li></ul></ul><ul><ul><ul><li>Data includes text documents, hypertext documents, link structure, and logs </li></ul></ul></ul><ul><ul><ul><li>Need to rely on unsupervised learning, sometimes followed up with supervised learning such as classification </li></ul></ul></ul>