Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Data Mining with Unstructured Data A Study And Implementation of Industry Product(s) Samrat Sen
  2. 2. Goals <ul><li>Issues in Text Mining with Unstructured Data </li></ul><ul><li>Analysis of Data Mining products </li></ul><ul><li>Study of a Real Life Classification Problem </li></ul><ul><li>Strategy for solving the problem </li></ul>
  3. 3. Issues in Text Mining <ul><li>Different from KDD and DM techniques in structured Databases </li></ul><ul><li>Problems: </li></ul><ul><li> 1. Concerned with predefined fields </li></ul><ul><li>2. Based on learning from attribute- value </li></ul><ul><li> database </li></ul><ul><li>e.g </li></ul><ul><li>P.T.O </li></ul>
  4. 4. <ul><li>If Married(Person, Spouse) and Income(Person) >= 25,000 </li></ul><ul><li>Then Potential-Customer(Spouse) </li></ul><ul><li>If Married(Person, Spouse) and Potential-Customer(Person) </li></ul><ul><li>Then Potential-Customer(Spouse) </li></ul>Issues in Text Mining Person Age Sex Income Customer Ann S 32 F 10,000 yes Jane G 53 F 20,000 no Sri S 35 M 65,000 yes Egor 25 M 10,000 yes Husband Wife Egor Ann S Sri H Jane Potential Customer Table Married to Table Induced Rules
  5. 5. <ul><li>Algorithm techniques like </li></ul><ul><li>Association Extraction from Indexed data, </li></ul><ul><li>Prototypical Document Extraction from full Text </li></ul><ul><li>Industry standard data mining tools cannot be used directly </li></ul><ul><li>e.g a usual process has to have the Text Transformer, Text Analyzer, Summary generator </li></ul>Issues in Text Mining
  6. 6. <ul><li>The input and output interfaces, the file formats </li></ul><ul><li>may cost in time and money. </li></ul><ul><li>Exhaustive domains have to be set up for </li></ul><ul><li>classification. </li></ul><ul><li>Cost and Benefits have to be weighed before </li></ul><ul><li>model selection. </li></ul><ul><li>1. Gain from positive prediction </li></ul><ul><li>2. Loss from an incorrect positive prediction (false positive) </li></ul><ul><li>3. Benefit from a correct negative prediction </li></ul><ul><li>4. Cost of incorrect negative prediction (false negative) </li></ul><ul><li>5. Cost of project time (a better product/algorithm may come up) </li></ul>Issues in Text Mining
  7. 7. Data Mining Products/Tools <ul><li>DARWIN – from Oracle </li></ul><ul><li>Intelligent Data Miner – from IBM </li></ul><ul><li>Intermedia Text with Oracle Database with context query feature </li></ul><ul><li>(theme based document retrieval) </li></ul> ip /analyze/warehouse/ datamining / http://www-4. ibm .com/software/data/ iminer / FOR MORE INFO...
  8. 8. <ul><li>New Specification being proposed by SUN for a Data Mining API * </li></ul><ul><li>SQLServer 2000 – Data mining and English query writing features </li></ul><ul><li>Verity Knowledge Organizer </li></ul>Data Mining Products/Tools FOR MORE INFO.. . * html#3 Additional Text Mining sites: 1. 2. 3.
  9. 9. DARWIN <ul><li>Functions </li></ul><ul><li>Prediction (from known values) </li></ul><ul><li>Classification (into categories) </li></ul><ul><li>Forecasting (future predictions) </li></ul><ul><li>Approach </li></ul><ul><li>Plan </li></ul><ul><li>Prepare Dataset </li></ul><ul><li>Build and Use models </li></ul>
  10. 10. DARWIN <ul><li>The problem is defined in terms of data fields and data records </li></ul><ul><li>The fields are classified as follows: </li></ul><ul><li>- Categorical and Ordered Fields </li></ul><ul><li>- Predictive Fields </li></ul><ul><li>- Target Fields </li></ul><ul><li>DARWIN dataset file has to be created containing all the records in the problem domain (using a descriptor file) </li></ul>
  11. 11. DARWIN - Models <ul><li>Tree model – Based on classification and regression tree algorithm </li></ul><ul><li>Net model – A feed forward multilayer neural network </li></ul><ul><li>Match Model – Memory based reasoning model, using a K-nearest neighbor algorithm </li></ul>
  12. 12. DARWIN – Tree Model Create Tree Training Data Test/Evaluate Tree (Information on error rates of pruned sub-trees) Predict with Tree (using the selected sub-tree) Analyze Results I/P Prediction Dataset Merged I/P & O/P prediction dataset
  13. 13. DARWIN – Net Model Create Net Training Dataset Train Net (Information on error rates of pruned sub-trees) Prediction Dataset Analyze Results I/P Prediction Dataset Merged I/P & O/P prediction dataset Neural Network Model Trained Neural Network
  14. 14. DARWIN – Match Model Create Match Model Training Data Optimize match weights Predict with Match Analyze Results I/P Prediction Dataset Merged I/P & O/P prediction dataset
  15. 15. Evaluate Evaluates the performance of a given model on a given dataset, when working on known data for test or evaluation purposes. Summarize Data Provides a statistical summary of the values taken by a data in the specified fields of a dataset Frequency Count Provides information on the frequency with which particular data values appear in a dataset DARWIN – Analyzing
  16. 16. Performance Matrix Can be used to compare simple fields or simple functions of fields Sensitivity Provides a model showing the relative importance of attributes used in building a model DARWIN – Analyzing
  17. 17. DARWIN – Code Generation <ul><li>Darwin can generate C, C++, Java code for a </li></ul><ul><li>Tree or Net model so that a prediction function </li></ul><ul><li>can be called from an application Program </li></ul><ul><li>Java code can also be generated to embed a </li></ul><ul><li>model in a Web Applet </li></ul>FOR MORE INFO...
  18. 18. DARWIN <ul><li>For more info </li></ul><ul><li> </li></ul><ul><li>1. Oracle Data Mining Data sheet </li></ul><ul><li>2. Oracle Data Mining Solutions </li></ul><ul><li> </li></ul><ul><li> </li></ul><ul><li>1. Managing Unstructured Data with Oracle8 </li></ul><ul><li> </li></ul><ul><li>1. Product manuals </li></ul>
  19. 19. DARWIN
  20. 20. <ul><li>Ranking technique called theme proving is used </li></ul><ul><li>Documents grouped into categories and subcategories </li></ul><ul><li>Integrated with the Oracle – 8 database. </li></ul><ul><li>Absolutely no training or tuning required </li></ul>Oracle – Intermedia Text
  21. 21. Oracle – Intermedia Text <ul><li>Lexical Knowledge Base </li></ul><ul><li>- 200,000 concepts from very broad domains </li></ul><ul><li>- 2000 major categories </li></ul><ul><li>- Concepts mapped into one or more words/phrases in </li></ul><ul><li>canonical form </li></ul><ul><li>- Each of these have alternate inflectional </li></ul><ul><li>variations,acronyms, synonyms stored </li></ul><ul><li>- Total vocabulary of 450,000 terms </li></ul><ul><li>- Each entry has other parameters like parts of speech </li></ul>
  22. 22. Oracle – Intermedia Text <ul><li>Theme Extraction </li></ul><ul><li>- Themes are assigned initial ranks based on </li></ul><ul><li>structure of the document and the frequency of the theme. </li></ul><ul><li>- All the ancestor themes also included in the result </li></ul><ul><li>- Theme proving done before final ranking </li></ul><ul><li>Queries </li></ul><ul><li>Direct match, phrase search (‘contains’), case-sensitive query, misspellings and fuzzy match, inflections (‘about’), compound queries, Boolean operators, Natural language query </li></ul>
  23. 23. Oracle – Intermedia Text <ul><li>Oracle at Trec 8 </li></ul><ul><li>( Eighth text retrieval conference- htm ) </li></ul><ul><li>Recall at 1000 71.57% (3384/4728) </li></ul><ul><li>Average Precision 41.30% </li></ul><ul><li>Initial precision (at 92.79% </li></ul><ul><li>recall 0.0) </li></ul><ul><li>Final precision (at 07.91% </li></ul><ul><li>recall 1.0) </li></ul>
  24. 24. Intermedia Text-Model
  25. 25. Interface Options
  26. 26. Language Selection <ul><li>Java for robot </li></ul><ul><li>PL/SQL for data retrieval </li></ul>
  27. 27. Code Execution
  28. 28. Overview of the System Customer Browser Web Server Oracle 8i Intermedia Text Server process Tag stripper Listening at port 80 JDBC Client Browser
  29. 29. Intermedia Text <ul><li>Steps for Building an application </li></ul><ul><li>Load the documents </li></ul><ul><li>Index the document </li></ul><ul><li>Issue Queries </li></ul><ul><li>Present the documents that satisfy the query </li></ul>
  30. 30. Loading Methods <ul><li>Loading Methods </li></ul><ul><ul><li>Insert Statements </li></ul></ul><ul><ul><li>SQL Loader </li></ul></ul><ul><ul><li>Ctxsrv – This is a server daemon process which builds </li></ul></ul><ul><ul><li>the index at regular intervals </li></ul></ul><ul><ul><li>Ctxload Utility Used for </li></ul></ul><ul><li>Thesaurus Import/Export </li></ul><ul><li>Text Loading </li></ul><ul><li>Document Updating/Exporting </li></ul>
  31. 31. Create and Populate a Simple Table <ul><li>CREATE TABLE quick ( </li></ul><ul><li>quick_id NUMBER CONSTRAINT quick_pk PRIMARY KEY, </li></ul><ul><li>text VARCHAR2(80) ); </li></ul><ul><li>INSERT INTO quick </li></ul><ul><li>VALUES ( 1, 'The cat sat on the mat' ); </li></ul><ul><li>INSERT INTO quick </li></ul><ul><li>VALUES ( 2, 'The fox jumped over the dog' ); INSERT INTO quick </li></ul><ul><li>VALUES ( 3, 'The dog barked like a dog' ); COMMIT; </li></ul>
  32. 32. Run a Text Query <ul><li>SELECT text FROM quick </li></ul><ul><li>WHERE CONTAINS ( text, </li></ul><ul><li>'sat on the mat' ) > 0; DRG-10599: column is not indexed </li></ul><ul><li>You must have a Text index on a column before you can do a “contains” query on it </li></ul>
  33. 33. Create the Text Index CREATE INDEX quick_text on quick ( text ) INDEXTYPE IS CTXSYS.CONTEXT; <ul><li>CTXSYS is the system user for inter Media Text </li></ul><ul><li>The INDEXTYPE keyword is a feature of the Extensible Indexing Framework </li></ul>
  34. 34. Run a Text Query <ul><li>SELECT text FROM quick </li></ul><ul><li>WHERE CONTAINS ( text, </li></ul><ul><li>'sat on the mat' ) > 0; TEXT </li></ul><ul><li>----------------------- </li></ul><ul><li>The cat sat on the mat </li></ul><ul><li>You should regard the CONTAINS function as boolean in meaning </li></ul><ul><li>It is implemented as a number since SQL does not have a boolean datatype </li></ul><ul><li>The only sensible way to use it is with >0 </li></ul>
  35. 35. Run a Text Query <ul><li>SELECT SCORE(42) s, text FROM quick </li></ul><ul><li>WHERE CONTAINS ( text, 'dog', 42 ) </li></ul><ul><li>>= 0 /* just for teaching purposes! */ ORDER BY s; </li></ul><ul><li>S TEXT </li></ul><ul><li>-- --------------------------- </li></ul><ul><li>7 The dog barked like a dog </li></ul><ul><li>4 The fox jumped over the dog </li></ul><ul><li>The better is the match, the higher is the score </li></ul><ul><li>The value can be used in ORDER BY but has no absolute significance </li></ul><ul><li>The score is zero when the query is not matched </li></ul>
  36. 36. Intermedia Text - Indexing Pipeline Datastore Filter Sectioner Database Engine Lexer Plain text Column data Doc Data Filtered Doc text Tokens Index Data Section Offsets <ul><li>First step is creating an index </li></ul><ul><li>Datastore </li></ul><ul><li>Reads the data out of the table (for URL datastore performs a ‘GET ‘) </li></ul>
  37. 37. Intermedia Text - Indexing Pipeline <ul><li>Filter : The data is transformed to some text type, this is needed as some of formats may be binary as when storing doc, pdf, HTML types </li></ul><ul><li>Sectioner: Converts to plain text, removes tags and invisible info. </li></ul><ul><li>Lexer: Splits the text into discrete tokens. </li></ul><ul><li>Engine: Takes the tokens from lexer , the offsets from sectioner and a list of stoplist words to build an index. </li></ul>
  38. 38. Intermedia Text - Indexing Pipeline <ul><li>Example of index creation </li></ul><ul><li>Statements </li></ul><ul><li>Insert into docs values(1,’first document’); </li></ul><ul><li>Insert into docs values(2,’second document’); </li></ul><ul><li>Produces an index </li></ul><ul><li>DOCUMENT  doc 1 position 2, doc 2 position 2 </li></ul><ul><li>FIRST  doc 1 position 1 </li></ul><ul><li>SECOND  doc 2 position 1 </li></ul>
  39. 39. Testing procedure <ul><li>Document set from newsgroups </li></ul><ul><li>122 documents from a text mining site </li></ul><ul><li>Loaded using insert statements </li></ul><ul><li>File datastore used </li></ul><ul><li>Documents(HTML) from browsing </li></ul><ul><li>20 documents </li></ul><ul><li>Loaded from server process </li></ul><ul><li>URL datastore used </li></ul>
  40. 40. Newsgroup Results <ul><li>1.      Religion ,Atheism – 15 </li></ul><ul><li>on bible, islam, religious beliefs </li></ul><ul><li>2.      Comp-os-ms-windows-misc - 17 </li></ul><ul><li>about operating sys, protocols, installation </li></ul><ul><li>3. – 27 </li></ul><ul><li>on hardware and software for computer graphics </li></ul><ul><li>4.      Ice Hockey - 18 </li></ul><ul><li>5.      Computer hardware – 12 </li></ul><ul><li>on installation of different peripheral devices </li></ul><ul><li>6.      Mideast.politics - 14 </li></ul><ul><li>on political development in mideast </li></ul><ul><li>7. - 19 </li></ul><ul><li>on various space programs, devices,theories </li></ul>
  41. 41. Newsgroup Results Group Retrieved Wrong Not Retrieved Recall Precision Science and technology 120 16 1 99% 78% Computer Hardware Industry 12 0 5 71% 100% Government 103 26 8 90% 74%
  42. 42. Newsgroup Results politics 17 3 0 100% 82% Military 5 1 0 80% 80% Social Environment 48 2 14 77% 96% Religion 22 3 2 90% 86% Islam 4 0 0 100% 100% Leisure recreati-on 22 4 5 78% 82%
  43. 43. Newsgroup Results Recall = # of correct positive predictions ---------------------------------- # of positive examples Precision = # of correct positive predictions --------------------------------- # of positive predictions Sports 21 1 0 90% 90% Hockey 18 0 0 100% 100%
  44. 44. Query <ul><li>AND & </li></ul><ul><li>OR | </li></ul><ul><li>EQUIV = </li></ul><ul><li>MINUS - </li></ul><ul><li>NOT ~ </li></ul><ul><li>ACCUM , </li></ul>Syntax: Binary Operators cat & dog cat | dog cat = dog cat - dog cat ~ dog cat , dog
  45. 45. Semantics: Binary Operators <ul><li>The semantics of all the binary operators is defined in terms of SCORE </li></ul><ul><li>However, the score for even the simplest query expression - a single word - is calculated by a subtle rule </li></ul><ul><ul><li>the score is higher for a document where the query word occurs more frequently than for one where it occurs less frequently </li></ul></ul><ul><ul><li>but when “word1” occurs N times in document D, its score is lower than when “word2” occurs N times in document D if “word1” occurs more often in the whole document set than “word2” </li></ul></ul>
  46. 46. The Salton Algorithm <ul><li>inter Media Text uses an algorithm which is similar to the Salton Algorithm - widely used in Text Retrieval products </li></ul><ul><li>The score for a word is proportional to... f ( 1+log ( N/n) ) ...where </li></ul><ul><ul><li>f is the frequency of the search term in the document </li></ul></ul><ul><ul><li>N is the total number documents </li></ul></ul><ul><ul><li>and n is the number of documents which contain the search term </li></ul></ul><ul><li>The score is converted into an integer in the range 0 - 100. </li></ul>
  47. 47. The Salton Algorithm <ul><li>Inverse frequency scoring assumes that frequently occurring terms in a document </li></ul><ul><li>set are noise terms, and so these terms are scored lower. For a document to score </li></ul><ul><li>high, the query term must occur frequently in the document but infrequently in the </li></ul><ul><li>document set as a whole. </li></ul>Assumption
  48. 48. The Salton Algorithm <ul><li>This table assumes that only one document in the set contains the query term. </li></ul><ul><li># of Documents in Document Set Occurrences of Term in Document Needed to Score 100 1 34 </li></ul><ul><li>5 20 </li></ul><ul><li>10 17 </li></ul><ul><li>50 13 </li></ul><ul><li>100 12 </li></ul><ul><li>500 10 </li></ul><ul><li>1,000 9 </li></ul><ul><li>10,000 7 </li></ul><ul><li>100,000 5 </li></ul><ul><li>1,000,000 4 </li></ul>
  49. 49. Summary of operators <ul><li>Binary operators… </li></ul>& | = - ~ , <ul><li>Built-in expansion... </li></ul>? $ ! <ul><li>Thesaurus... </li></ul>BT, BTG, BTP, BTI, NT, NTG, NTP, NTI, PT, RT, SYN, TR, TRSYN, TT
  50. 50. Summary of operators <ul><li>Stored query expression... </li></ul>SQE <ul><li>Grouping and escaping... </li></ul>() {} <ul><li>Special... </li></ul>NEAR WITHIN ABOUT
  51. 51. Application Details- Customer profile Analyzer The http server For (User web Page caching) Is started Oracle web Server also started
  52. 52. Log In Screen- Customer & User Log in Screen Used both By the customer And the users The oracle web- Server takes care Of the secure Connections, while For the http server, The user id is Common for the session -no user can invoke a Document from server Without user id.
  53. 53. Customer Interface – Http Server The user Uses the Interface Provided By the custom http server
  54. 54. Main User Screen User can Choose the Type of data To be analyzed. Two types of data exist- 1. Newsgroups 2. User Browsed URL’s
  55. 55. Selection of Category and options User chooses Category and Other options Like- Generating theme Generating gist Generating- marked-up text Date range
  56. 56. Results Page – Gist Generation Can use this Page for drilling Down to the Actual document Which opens up in The browser (generated By the filter option) Can generate theme And gist from this Screen.
  57. 57. Search Screen Search screen, Has advance options Like fuzzy search, About search etc. A chain of expressions Can be used along With conjunctions (like ‘ not’,’or’,’and’ etc) for Joining the statements
  58. 58. Conclusion <ul><li>New estimation methods trying to find more meaning from text. </li></ul><ul><li>Industry has great text mining products and is constantly improving technology. </li></ul><ul><li>Unstructured Data Mining – a long way to go. </li></ul>