Catégorisation automatisée de contenus documentaires : la ...

357 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
357
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Catégorisation automatisée de contenus documentaires : la ...

  1. 1. GammaWare Technology June 2002 Yiftach Ravid, VP R&D GammaSite Inc. yiftach@GammaSite.com 1
  2. 2. Overview - The challenge - Taxonomies - Classification - Categorization - Focused Crawler - Q&A 2
  3. 3. The challenge: Generate Structured Taxonomies of text repositories Internal DB Information Word Application Web Forms XML Services Catalogues Mail Domino 3  Generate a structured taxonomy of huge text repositories
  4. 4. Taxonomy 4
  5. 5. What is a Taxonomy  Taxonomy  Taxis = arrangement or division  Nomos = law  The science of classification according to a pre- determined system  Best-known use of taxonomy is in Biology  taxonomies of animals and plants 5
  6. 6. Web Taxonomy  Best-known use of taxonomies:  Web portals or Directories  Internet sites classified into hierarchical topics General: • Yahoo! http://www.yahoo.com/ • Open Directory http://www.dmoz.org/ • LookSmart http://www.looksmart.com/r?country=uk  Topical: • Business.Com http://www.business.com/ • HealthWeb http://www.healthweb.org/ • Education Planet http://www.educationplanet.com/ 6
  7. 7. Taxonomy - Sample 7
  8. 8. Taxonomy vs. Thesaurus Criteria Taxonomy Thesaurus Focus Documents and their organization Terms used in the organization Usage Classification of documents Indexing documents  Classified into categories/terms  Terms are attached to documents Retrieval Mainly browsing Keyword queries Size Restricted to the necessary terms sizes is very large (Terms may be added freely) 8
  9. 9. Classification 9
  10. 10. What is a Classifier Concept (Topic, Subject):  An abstract or generic idea generalized from particular instances [Merriam Webster] Classifier:  A function on a concept (category) and on an object (document)  Returns a number between 0 and 1 called confidence rate  Confidence rate: measuring the confidence that the object (document) belongs (should be classified) to the concept (category) 10
  11. 11. Methods for Automatic Classification  Rule based  Pre-defined set of rules  Advantage • incorporating prior knowledge  Disadvantages: • extreme reliance on man-made rules • costly in terms of man-hours  Linguistics  Use of morphology, syntax and semantics  Not Multi lingual, demands many training examples  Machine Learning 11
  12. 12. What is Machine Learning Machine Learning is the study of computer algorithms that automatically improve performance through “experience” 12
  13. 13. Sample for Machine Learning DOGS CATS 13
  14. 14. Discriminating Features Q1: Who is this person? Q2: What are the most discriminating features? 14
  15. 15. Discriminating Features Answer:  Lips  Eyes 15
  16. 16. Discriminating Features The “Margaret Thatcher effect” 16
  17. 17. Supervised Inductive Learning  A process where:  A learning algorithm is provided with a set of labeled instances, positive and negative examples (a training set)  Using the training set the leaning algorithm generates a classifier  The quality of the classifier is measured via its ability to perform well on novel instances (a test set) 17
  18. 18. Supervised Inductive Learning Example Training Test errors correct 18
  19. 19. Evaluating a Classifier Category Classifier 19
  20. 20. Recall and Precision Use a confusion matrix to count True Label Yes No Total Good 70 50 120 Classified Bad 30 150 180 Total 100 200 300 Precision (P) = GY / (GY + GN) = 70 / (70+50) = 0.58 Recall (R) = GY / (GY + BY) = 70 / (70+30) = 0.70 Accuracy (A) = (GY+NN)/(GY+GN+BY+BN) = 220 / 300 = 0.73 F-measure (F) = 2/(1/P + 1/R) = 2*GY/(GY+GN+GY+BY) = 2*70/(100+120) = 0.63 20
  21. 21. Supervised Statistical Machine Learning  A Supervised Inductive Learning method that is based on statistics obtained from the training set  Benefits  Generality and flexibility • Successfully applied across a broad spectrum of problems  Multi lingual  Low labor costs 21
  22. 22. How to Classify documents  Pre defined fields ( Structured data )  Author  Title  Date  Content ( Unstructured data )  From title, main text, emphasized text  All words  All 2 words, All 3 words, etc.  Phrases, Synonyms, etc. 22
  23. 23. Getting Started 23
  24. 24. GammaWare Work Flow Requirements Ready Design the Improve Classifiers Taxonomy Seeding Catalogue Process Documents Train Check Seed Classifiers 24
  25. 25. Requirements  Initial parameters and decisions:  Level of percolation - affects: • Recall • Precision  Multi label • Maximum number of categories into which a document can be classified  Types of training documents • Full text, Keywords • Different types per category  List of Stop Words • Common words in the used language and also in topic 25
  26. 26. Taxonomy  A Taxonomy is constructed according to:  UserBusiness needs • who will be using the taxonomy  Data • content of documents for classification  Good taxonomy:  requires critical attention to both the definition and application of categories and their labels  simple and intuitive  How: Using the Expert Tool 26
  27. 27. Seeding process  Seeding process: each category within the taxonomy needs to be given a few examples of relevant documents of the same type that the user seeks to catalog  An average of 3-6 relevant documents per category  Seeds can either be “positive seeds” or “negative seeds” for each category  For better results - training documents should be in a similar structure as the documents for classification  How: Using the Expert Tool 27
  28. 28. Check Seed  Check seed: Classify the seeds into the taxonomy  Output: An HTML page (browsed by the Expert tool)  For each category shows the cataloging results for all the relevant seeds.  Why: Help in locating seeding problems:  Seeds that are multi labeled  Problems in taxonomy structure  How: Using the GammaWare Manager 28
  29. 29. Train Classifiers  Train: Train classifiers for all categories  Output: A classifier file (gcl extension) for each category  Why: The classifiers are used for categorization.  How: Using the GammaWare Manager 29
  30. 30. Classify Documents  Categorization: Catalogue documents into a Taxonomy  Output: A table in a database  Why: This is why we are here.  How: Using the GammaWare Manager 30
  31. 31. Improve Classifiers  Methods to improve classification results using the Expert Tool.  Re-design the taxonomy  Seed problems • More examples • Add new seeds • drag and drop documents from classification view • Negative “seeds”  Modify Categorization and Train parameters 31
  32. 32. Categorization 32
  33. 33. Hierarchical Categorization  Goal: Classify a document into the appropriate sub-topic(s) in the taxonomy  Difficulties:  Many sub-topics  A document may fall into several sub- topics  Classifiers are not perfect  Must control “Recall” and “Precision” according to the client’s needs 33
  34. 34. Hierarchical Categorization  Divide and Conquer solution:  Solve the problem Level by Level  At each level decompose the problem into several, smaller sized classification sub- problems  Note: ignoring interactions between sub- problems can yield poor results 34  Patent Pending on Categorization
  35. 35. Focused Crawler 35
  36. 36. Topic Specific Crawling  Retrieve all documents that are relevant to a specific topic of interest  Hyper-linked networks (Intranet, Internet)  Two options: • Crawl the network. Then apply classification schemes to filter relevant documents. • Using classification schemes crawl the network while teaching the crawler to imitate (intelligent) human surfing strategies 36
  37. 37. Simple Crawling  The Network is huge  Storage  Network Starting  Time Document  Good for general-purpose search engines  Crawling: The process of retrieving documents from the net 37
  38. 38. Focused Crawling via Link Classifiers  Analyze the context of the link Herbal tea specialist Link Classifier Retrieve the URL My brother new Link Classifier Link is irrelevant born child 38  Link classifier: Decision according to the context of the link
  39. 39. Focused Crawler – The Learning Process Retrieve the content of the Herbal tea link specialist Link Classifier Send acknowledgment to the “link classifier” - Crawler Learning Process Classifier 39  Crawler Classifier: Checks if the document is good for Crawling
  40. 40. GammaWare API 40
  41. 41. Architecture - Basic Proxy Client GammaWare CORBA GammaWare Proxy CORBA API GammaWare Software GW File System Customer Client ODBC Relational Web Database File Relational Database Outlook Notes File Document System Domino System Management 41
  42. 42. Multiple Servers GammaWare Proxy GammaWare Proxy GammaWare Database Server 4 GammaWare Database Server 3 GammaWare Server 2 GammaWare Server Client 42  Scalability and Availability
  43. 43. Q&A 43

×