Text Mining


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Text Mining

  1. 1. Text & Web Mining
  2. 2. Structured Data <ul><li>So far we have focused on mining from structured data: </li></ul>Attribute  Value Attribute  Value Attribute  Value  Attribute  Value Outlook  Sunny Temperature  Hot Windy  Yes Humidity  High Play  Yes Most data mining involves such data
  3. 3. Complex Data Types <ul><li>Increased importance of complex data: </li></ul><ul><ul><li>Spatial data : includes geographic data and medical & satellite images </li></ul></ul><ul><ul><li>Multimedia data : images, audio, & video </li></ul></ul><ul><ul><li>Time-series data : for example banking data and stock exchange data </li></ul></ul><ul><ul><li>Text data : word descriptions for objects </li></ul></ul><ul><ul><li>World-Wide-Web : highly unstructured text and multimedia data </li></ul></ul>Focus
  4. 4. Text Databases <ul><li>Many text databases exist in practice </li></ul><ul><ul><li>News articles </li></ul></ul><ul><ul><li>Research papers </li></ul></ul><ul><ul><li>Books </li></ul></ul><ul><ul><li>Digital libraries </li></ul></ul><ul><ul><li>E-mail messages </li></ul></ul><ul><ul><li>Web pages </li></ul></ul><ul><li>Growing rapidly in size and importance </li></ul>
  5. 5. Semi-Structured Data <ul><li>Text databases are often semi-structured </li></ul><ul><li>Example: </li></ul><ul><ul><li>Title </li></ul></ul><ul><ul><li>Author </li></ul></ul><ul><ul><li>Publication_Date </li></ul></ul><ul><ul><li>Length </li></ul></ul><ul><ul><li>Category </li></ul></ul><ul><ul><li>Abstract </li></ul></ul><ul><ul><li>Content </li></ul></ul>Structured attribute/value pairs Unstructured
  6. 6. Handling Text Data <ul><li>Modeling semi-structured data </li></ul><ul><li>Information Retrieval (IR) from unstructured documents </li></ul><ul><li>Text mining </li></ul><ul><ul><li>Compare documents </li></ul></ul><ul><ul><li>Rank importance & relevance </li></ul></ul><ul><ul><li>Find patterns or trends across documents </li></ul></ul>
  7. 7. Information Retrieval <ul><li>IR locates relevant documents </li></ul><ul><ul><li>Key words </li></ul></ul><ul><ul><li>Similar documents </li></ul></ul><ul><li>IR Systems </li></ul><ul><ul><li>On-line library catalogs </li></ul></ul><ul><ul><li>On-line document management systems </li></ul></ul>
  8. 8. Performance Measure <ul><li>Two basic measures </li></ul>All documents Retrieved documents Relevant documents Relevant & retrieved
  9. 9. Retrieval Methods <ul><li>Keyword-based IR </li></ul><ul><ul><li>E.g., “data and mining” </li></ul></ul><ul><ul><li>Synonymy problem : a document may talk about “knowledge discovery” instead </li></ul></ul><ul><ul><li>Polysemy problem : mining can mean different things </li></ul></ul><ul><li>Similarity-based IR </li></ul><ul><ul><li>Set of common keywords </li></ul></ul><ul><ul><li>Return the degree of relevance </li></ul></ul><ul><ul><li>Problem: what is the similarity of “data mining” and “data analysis” </li></ul></ul>
  10. 10. Modeling a Document <ul><li>Set of n documents and m terms </li></ul><ul><li>Each document is a vector v in R m </li></ul><ul><ul><li>The j -th coordinate of v measures the association of the j -th term </li></ul></ul><ul><ul><li>Here r is the number of occurrences of the j -th term and R is the number of occurrences of any term. </li></ul></ul>
  11. 11. Frequency Matrix
  12. 12. Similarity Measures <ul><li>Cosine measure </li></ul>Dot product Norm of the vectors
  13. 13. Example <ul><li>Google search for “association mining” </li></ul><ul><li>Two of the documents retrieved: </li></ul><ul><ul><li>Idaho Mining Association: mining in Idaho (doc 1) </li></ul></ul><ul><ul><li>Scalable Algorithms for Association mining (doc 2) </li></ul></ul><ul><li>Using only the two terms </li></ul>
  14. 14. New Model <ul><li>Add the term “data” to the document model </li></ul>
  15. 15. Frequency Matrix Will quickly become large Singular value decomposition can be used to reduce it
  16. 16. Association Analysis <ul><li>Collect set of keywords frequently used together and find association among them </li></ul><ul><li>Apply any association rule algorithm to a database in the format </li></ul><ul><li>{document_id, a_set_of_keywords} </li></ul>
  17. 17. Document Classification <ul><li>Need already classified documents as training set </li></ul><ul><li>Induce a classification model </li></ul><ul><li>Any difference from before? </li></ul>A set of keywords associated with a document has no fixed set of attributes or dimensions
  18. 18. Association-Based Classification <ul><li>Classify documents based on associated, frequently occurring text patterns </li></ul><ul><ul><li>Extract keywords and terms with IR and simple association analysis </li></ul></ul><ul><ul><li>Create a concept hierarchy of terms </li></ul></ul><ul><ul><li>Classify training documents into class hierarchies </li></ul></ul><ul><ul><li>Use association mining to discover associated terms to distinguish one class from another </li></ul></ul>
  19. 19. Remember Generalized Association Rules Clothes Outerwear Shirts Jackets Ski Pants Footwear Shoes Hiking Boots Taxonomy: Generalized association rule X  Y where no item in Y is an ancestor of an item in X Ancestor of shoes and hiking boots
  20. 20. Classifiers <ul><li>Let X be a set of terms </li></ul><ul><li>Let Anc ( X ) be those terms and their ancestor terms </li></ul><ul><li>Consider a rule X   C and document d </li></ul><ul><li>If X  Anc ( d ) then X   C covers d </li></ul><ul><li>A rule that covers d may be used to classify d (but only one can be used) </li></ul>
  21. 21. Procedure <ul><li>Step 1: Generate all generalized association rules , where X is a set of terms and C is a class, that satisfy minimum support. </li></ul><ul><li>Step 2: Rank the rules according to some rule ranking criterion </li></ul><ul><li>Step 3: Select rules from the list </li></ul>
  22. 22. Web Mining <ul><li>The World Wide Web may have more opportunities for data mining than any other area </li></ul><ul><li>However, there are serious challenges: </li></ul><ul><ul><li>It is too huge </li></ul></ul><ul><ul><li>Complexity of Web pages is greater than any traditional text document collection </li></ul></ul><ul><ul><li>It is highly dynamic </li></ul></ul><ul><ul><li>It has a broad diversity of users </li></ul></ul><ul><ul><li>Only a tiny portion of the information is truly useful </li></ul></ul>
  23. 23. Search Engines  Web Mining <ul><li>Current technology: search engines </li></ul><ul><ul><li>Keyword-based indices </li></ul></ul><ul><ul><li>Too many relevant pages </li></ul></ul><ul><ul><li>Synonymy and polysemy problems </li></ul></ul><ul><li>More challenging: web mining </li></ul><ul><ul><li>Web content mining </li></ul></ul><ul><ul><li>Web structure mining </li></ul></ul><ul><ul><li>Web usage mining </li></ul></ul>
  24. 24. Web Content Mining
  25. 25. Example: Classification of Web Documents <ul><li>Assign a class to each document based on predefined topic categories </li></ul><ul><li>E.g., use Yahoo!’s taxonomy and associated documents for training </li></ul><ul><li>Keyword-based document classification </li></ul><ul><li>Keyword-based association analysis </li></ul>
  26. 26. Web Structure Mining
  27. 27. Authoritative Web Pages <ul><li>High quality relevant Web pages are termed authoritative </li></ul><ul><li>Explore linkages (hyperlinks) </li></ul><ul><ul><li>Linking a Web page can be considered an endorsement of that page </li></ul></ul><ul><ul><li>Those pages that are linked frequently are considered authoritative </li></ul></ul><ul><ul><li>(This has its roots back to IR methods based on journal citations) </li></ul></ul>
  28. 28. Structure via Hubs <ul><li>A hub is a set of Web pages containing collections of links to authorities </li></ul><ul><li>There is a wide variety of hubs: </li></ul><ul><ul><li>Simple list of recommended links on a person’s home page </li></ul></ul><ul><ul><li>Professional resource lists on commercial sites </li></ul></ul>
  29. 29. HITS <ul><li>Hyperlink-Induced Topic Search (HITS) </li></ul><ul><ul><li>Form a root set of pages using the query terms in an index-based search (200 pages) </li></ul></ul><ul><ul><li>Expand into a base set by including all pages the root set links to (1000-5000 pages) </li></ul></ul><ul><ul><li>Go into an iterative process to determine hubs and authorities </li></ul></ul>
  30. 30. Calculating Weights <ul><li>Authority weight </li></ul><ul><li>Hub weight </li></ul>Page p is pointed to by page q
  31. 31. Adjacency Matrix <ul><li>Lets number the pages {1,2,…, n } </li></ul><ul><li>The adjacency matrix is defined by </li></ul><ul><li>By writing the authority and hub weights as vectors we have </li></ul>
  32. 32. Recursive Calculations <ul><li>We now have </li></ul><ul><li>By linear algebra theory this converges to the principle eigenvectors of the the two matrices </li></ul>
  33. 33. Output <ul><li>The HITS algorithm finally outputs </li></ul><ul><ul><li>Short list of pages with high hub weights </li></ul></ul><ul><ul><li>Short list of pages with high authority weights </li></ul></ul><ul><li>Have not accounted for context </li></ul>
  34. 34. Applications <ul><li>The Clever Project at IBM’s Almaden Labs </li></ul><ul><ul><li>Developed the HITS algorithm </li></ul></ul><ul><li>Google </li></ul><ul><ul><li>Developed at Stanford </li></ul></ul><ul><ul><li>Uses algorithms similar to HITS (PageRank) </li></ul></ul><ul><ul><li>On-line version </li></ul></ul>
  35. 35. Web Usage Mining
  36. 36. Complex Data Types Summary <ul><li>Emerging areas of mining complex data types: </li></ul><ul><ul><li>Text mining can be done quite effectively, especially if the documents are semi-structured </li></ul></ul><ul><ul><li>Web mining is more difficult due to lack of such structure </li></ul></ul><ul><ul><ul><li>Data includes text documents, hypertext documents, link structure, and logs </li></ul></ul></ul><ul><ul><ul><li>Need to rely on unsupervised learning, sometimes followed up with supervised learning such as classification </li></ul></ul></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.