GammaWare Technology June 2002 Yiftach Ravid, VP R&D GammaSite Inc. [email_address]
Overview - The challenge - Taxonomies - Classification - Focused Crawler - Q&A - Categorization
The challenge: Generate Structured Taxonomies of text repositories <ul><li>Generate a structured taxonomy of huge text rep...
Taxonomy
What is a Taxonomy <ul><li>Taxonomy  </li></ul><ul><ul><li>Taxis = arrangement or division </li></ul></ul><ul><ul><li>Nomo...
Web Taxonomy <ul><li>Best-known use of taxonomies: </li></ul><ul><ul><li>Web portals or Directories </li></ul></ul><ul><ul...
Taxonomy - Sample
Taxonomy vs.  Thesaurus   Restricted to the necessary terms Mainly  browsing <ul><li>Classification  of documents </li></u...
Classification
What is a Classifier <ul><li>Concept (Topic, Subject): </li></ul><ul><li>An abstract or generic idea generalized from part...
Methods for Automatic Classification <ul><li>Rule based </li></ul><ul><ul><li>Pre-defined set of rules  </li></ul></ul><ul...
What is Machine Learning <ul><li>Machine Learning is the study of computer algorithms that automatically improve performan...
Sample for Machine Learning DOGS CATS
Discriminating Features Q1 : Who is this person? Q2 : What are the most discriminating features?
Discriminating Features <ul><li>Answer : </li></ul><ul><ul><li>Lips </li></ul></ul><ul><ul><li>Eyes </li></ul></ul>
Discriminating Features The “Margaret Thatcher effect”
Supervised Inductive Learning <ul><li>A process where: </li></ul><ul><li>A learning algorithm is provided with a set of la...
Supervised Inductive Learning Example Training Test errors correct
Evaluating a  Classifier Category Classifier
Recall and Precision Precision  (P)   =  GY / (GY + GN) = 70 / (70+50) = 0.58 F-measure  (F)   =  2/(1/P + 1/R) = 2*GY/(GY...
Supervised Statistical Machine Learning <ul><li>A Supervised Inductive Learning  method  that is based on statistics obtai...
How to Classify documents <ul><li>Pre defined fields ( Structured data ) </li></ul><ul><ul><li>Author  </li></ul></ul><ul>...
Getting Started
GammaWare Work Flow Check Seed Improve Classifiers Requirements Design the Taxonomy Seeding Process Train Classifiers Cata...
Requirements <ul><li>Initial parameters and decisions: </li></ul><ul><ul><li>Level of percolation - affects: </li></ul></u...
Taxonomy <ul><li>A  Taxonomy is constructed according to : </li></ul><ul><ul><li>UserBusiness needs </li></ul></ul><ul><ul...
Seeding process <ul><li>Seeding process : each category within the taxonomy needs to be given a few examples of relevant d...
Check Seed <ul><li>Check seed : Classify the seeds into the taxonomy </li></ul><ul><li>Output : An HTML page (browsed by t...
Train Classifiers <ul><li>Train : Train classifiers for all categories </li></ul><ul><li>Output : A classifier file (gcl e...
Classify Documents <ul><li>Categorization : Catalogue documents into a Taxonomy </li></ul><ul><li>Output : A table in a da...
Improve Classifiers <ul><li>Methods to improve classification results using the Expert Tool. </li></ul><ul><ul><li>Re-desi...
Categorization
Hierarchical Categorization <ul><li>Goal:  Classify a document into the appropriate sub-topic(s) in the taxonomy </li></ul...
Hierarchical Categorization <ul><li>Divide and Conquer  solution:  </li></ul><ul><ul><li>Solve the problem  Level by Level...
Focused Crawler
Topic Specific Crawling <ul><ul><li>Hyper-linked networks (Intranet, Internet) </li></ul></ul><ul><ul><li>Two options: </l...
Simple Crawling <ul><li>Crawling: The process of retrieving documents from the net </li></ul><ul><ul><li>The Network is hu...
Focused Crawling via Link Classifiers My brother new  born child Herbal tea  specialist Link is irrelevant <ul><li>Link cl...
Focused Crawler – The Learning Process <ul><li>Crawler Classifier: Checks if the document is good for Crawling </li></ul>H...
GammaWare API
Architecture - Basic Relational Database Customer Client GammaWare API CORBA GammaWare Proxy File System Relational Databa...
Multiple Servers <ul><li>Scalability and Availability </li></ul>GammaWare Server 4 GammaWare Server 2 GammaWare Server 3 D...
Q & A
Upcoming SlideShare
Loading in …5
×

Catégorisation automatisée de contenus documentaires : la ...

433 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
433
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • If this presentation is independent
  • Catégorisation automatisée de contenus documentaires : la ...

    1. 1. GammaWare Technology June 2002 Yiftach Ravid, VP R&D GammaSite Inc. [email_address]
    2. 2. Overview - The challenge - Taxonomies - Classification - Focused Crawler - Q&A - Categorization
    3. 3. The challenge: Generate Structured Taxonomies of text repositories <ul><li>Generate a structured taxonomy of huge text repositories </li></ul>XML Word Domino Web Catalogues Forms Mail Structured Data Unstructured Data Internal DB Business, Relevant Content Information Application Services
    4. 4. Taxonomy
    5. 5. What is a Taxonomy <ul><li>Taxonomy </li></ul><ul><ul><li>Taxis = arrangement or division </li></ul></ul><ul><ul><li>Nomos = law </li></ul></ul><ul><li>The science of classification according to a pre-determined system </li></ul><ul><li>Best-known use of taxonomy is in Biology </li></ul><ul><ul><li>taxonomies of animals and plants </li></ul></ul>
    6. 6. Web Taxonomy <ul><li>Best-known use of taxonomies: </li></ul><ul><ul><li>Web portals or Directories </li></ul></ul><ul><ul><li>Internet sites classified into hierarchical topics </li></ul></ul><ul><ul><li>General: </li></ul></ul><ul><ul><ul><li>Yahoo! http://www.yahoo.com/ </li></ul></ul></ul><ul><ul><ul><li>Open Directory http://www. dmoz .org/ </li></ul></ul></ul><ul><ul><ul><li>LookSmart http://www. looksmart .com/r?country= uk </li></ul></ul></ul><ul><ul><li>Topical: </li></ul></ul><ul><ul><ul><li>Business.Com http://www.business.com/ </li></ul></ul></ul><ul><ul><ul><li>HealthWeb http://www. healthweb .org/ </li></ul></ul></ul><ul><ul><ul><li>Education Planet http://www. educationplanet .com/ </li></ul></ul></ul>
    7. 7. Taxonomy - Sample
    8. 8. Taxonomy vs. Thesaurus Restricted to the necessary terms Mainly browsing <ul><li>Classification of documents </li></ul><ul><li>Classified into categories/terms </li></ul>Documents and their organization Taxonomy Size Retrieval Usage Focus Criteria sizes is very large ( Terms may be added freely) Keyword queries <ul><li>Indexing documents </li></ul><ul><li>Terms are attached to documents </li></ul>Terms used in the organization Thesaurus
    9. 9. Classification
    10. 10. What is a Classifier <ul><li>Concept (Topic, Subject): </li></ul><ul><li>An abstract or generic idea generalized from particular instances [Merriam Webster] </li></ul><ul><li>Classifier: </li></ul><ul><li>A function on a concept (category) and on an object (document) </li></ul><ul><li>Returns a number between 0 and 1 called confidence rate </li></ul><ul><li>Confidence rate: measuring the confidence that the object (document) belongs (should be classified) to the concept (category) </li></ul>
    11. 11. Methods for Automatic Classification <ul><li>Rule based </li></ul><ul><ul><li>Pre-defined set of rules </li></ul></ul><ul><ul><li>Advantage </li></ul></ul><ul><ul><ul><li>incorporating prior knowledge </li></ul></ul></ul><ul><ul><li>Disadvantages: </li></ul></ul><ul><ul><ul><li>extreme reliance on man-made rules </li></ul></ul></ul><ul><ul><ul><li>costly in terms of man-hours </li></ul></ul></ul><ul><li>Linguistics </li></ul><ul><ul><li>Use of morphology, syntax and semantics </li></ul></ul><ul><ul><li>Not Multi lingual, demands many training examples </li></ul></ul><ul><li>Machine Learning </li></ul>
    12. 12. What is Machine Learning <ul><li>Machine Learning is the study of computer algorithms that automatically improve performance through “experience” </li></ul>
    13. 13. Sample for Machine Learning DOGS CATS
    14. 14. Discriminating Features Q1 : Who is this person? Q2 : What are the most discriminating features?
    15. 15. Discriminating Features <ul><li>Answer : </li></ul><ul><ul><li>Lips </li></ul></ul><ul><ul><li>Eyes </li></ul></ul>
    16. 16. Discriminating Features The “Margaret Thatcher effect”
    17. 17. Supervised Inductive Learning <ul><li>A process where: </li></ul><ul><li>A learning algorithm is provided with a set of labeled instances, positive and negative examples ( a training set ) </li></ul><ul><li>Using the training set the leaning algorithm generates a classifier </li></ul><ul><li>The quality of the classifier is measured via its ability to perform well on novel instances ( a test set ) </li></ul>
    18. 18. Supervised Inductive Learning Example Training Test errors correct
    19. 19. Evaluating a Classifier Category Classifier
    20. 20. Recall and Precision Precision (P) = GY / (GY + GN) = 70 / (70+50) = 0.58 F-measure (F) = 2/(1/P + 1/R) = 2*GY/(GY+GN+GY+BY) = 2*70/(100+120) = 0.63 Recall (R) = GY / (GY + BY) = 70 / (70+30) = 0.70 Accuracy (A) = (GY+NN)/(GY+GN+BY+BN) = 220 / 300 = 0.73 Use a confusion matrix to count 300 180 120 Total Classified Bad Good True Label 200 100 150 30 50 70 Total No Yes
    21. 21. Supervised Statistical Machine Learning <ul><li>A Supervised Inductive Learning method that is based on statistics obtained from the training set </li></ul><ul><li>Benefits </li></ul><ul><ul><li>Generality and flexibility </li></ul></ul><ul><ul><ul><li>Successfully applied across a broad spectrum of problems </li></ul></ul></ul><ul><ul><li>Multi lingual </li></ul></ul><ul><ul><li>Low labor costs </li></ul></ul>
    22. 22. How to Classify documents <ul><li>Pre defined fields ( Structured data ) </li></ul><ul><ul><li>Author </li></ul></ul><ul><ul><li>Title </li></ul></ul><ul><ul><li>Date </li></ul></ul><ul><li>Content ( Unstructured data ) </li></ul><ul><ul><li>From title, main text, emphasized text </li></ul></ul><ul><ul><li>All words </li></ul></ul><ul><ul><li>All 2 words, All 3 words, etc. </li></ul></ul><ul><ul><li>Phrases, Synonyms, etc. </li></ul></ul>
    23. 23. Getting Started
    24. 24. GammaWare Work Flow Check Seed Improve Classifiers Requirements Design the Taxonomy Seeding Process Train Classifiers Catalogue Documents Ready
    25. 25. Requirements <ul><li>Initial parameters and decisions: </li></ul><ul><ul><li>Level of percolation - affects: </li></ul></ul><ul><ul><ul><li>Recall </li></ul></ul></ul><ul><ul><ul><li>Precision </li></ul></ul></ul><ul><ul><li>Multi label </li></ul></ul><ul><ul><ul><li>Maximum number of categories into which a document can be classified </li></ul></ul></ul><ul><ul><li>Types of training documents </li></ul></ul><ul><ul><ul><li>Full text, Keywords </li></ul></ul></ul><ul><ul><ul><li>Different types per category </li></ul></ul></ul><ul><ul><li>List of Stop Words </li></ul></ul><ul><ul><ul><li>Common words in the used language and also in topic </li></ul></ul></ul>
    26. 26. Taxonomy <ul><li>A Taxonomy is constructed according to : </li></ul><ul><ul><li>UserBusiness needs </li></ul></ul><ul><ul><ul><li>who will be using the taxonomy </li></ul></ul></ul><ul><ul><li>Data </li></ul></ul><ul><ul><ul><li>content of documents for classification </li></ul></ul></ul><ul><li>Good taxonomy: </li></ul><ul><ul><li>requires critical attention to both the definition and application of categories and their labels </li></ul></ul><ul><ul><li>simple and intuitive </li></ul></ul><ul><li>How : Using the Expert Tool </li></ul>
    27. 27. Seeding process <ul><li>Seeding process : each category within the taxonomy needs to be given a few examples of relevant documents of the same type that the user seeks to catalog </li></ul><ul><ul><li>An average of 3-6 relevant documents per category </li></ul></ul><ul><ul><li>Seeds can either be “positive seeds” or “negative seeds” for each category </li></ul></ul><ul><li>For better results - training documents should be in a similar structure as the documents for classification </li></ul><ul><li>How : Using the Expert Tool </li></ul>
    28. 28. Check Seed <ul><li>Check seed : Classify the seeds into the taxonomy </li></ul><ul><li>Output : An HTML page (browsed by the Expert tool) </li></ul><ul><ul><li>For each category shows the cataloging results for all the relevant seeds. </li></ul></ul><ul><li>Why : Help in locating seeding problems: </li></ul><ul><ul><li>Seeds that are multi labeled </li></ul></ul><ul><ul><li>Problems in taxonomy structure </li></ul></ul><ul><li>How : Using the GammaWare Manager </li></ul>
    29. 29. Train Classifiers <ul><li>Train : Train classifiers for all categories </li></ul><ul><li>Output : A classifier file (gcl extension) for each category </li></ul><ul><li>Why : The classifiers are used for categorization . </li></ul><ul><li>How : Using the GammaWare Manager </li></ul>
    30. 30. Classify Documents <ul><li>Categorization : Catalogue documents into a Taxonomy </li></ul><ul><li>Output : A table in a database </li></ul><ul><li>Why : This is why we are here . </li></ul><ul><li>How : Using the GammaWare Manager </li></ul>
    31. 31. Improve Classifiers <ul><li>Methods to improve classification results using the Expert Tool. </li></ul><ul><ul><li>Re-design the taxonomy </li></ul></ul><ul><ul><li>Seed problems </li></ul></ul><ul><ul><ul><li>More examples </li></ul></ul></ul><ul><ul><ul><li>Add new seeds </li></ul></ul></ul><ul><ul><ul><ul><li>drag and drop documents from classification view </li></ul></ul></ul></ul><ul><ul><ul><li>Negative “seeds” </li></ul></ul></ul><ul><ul><li>Modify Categorization and Train parameters </li></ul></ul>
    32. 32. Categorization
    33. 33. Hierarchical Categorization <ul><li>Goal: Classify a document into the appropriate sub-topic(s) in the taxonomy </li></ul><ul><li>Difficulties: </li></ul><ul><ul><li>Many sub-topics </li></ul></ul><ul><ul><li>A document may fall into several sub-topics </li></ul></ul><ul><ul><li>Classifiers are not perfect </li></ul></ul><ul><ul><li>Must control “Recall” and “Precision” according to the client’s needs </li></ul></ul>
    34. 34. Hierarchical Categorization <ul><li>Divide and Conquer solution: </li></ul><ul><ul><li>Solve the problem Level by Level </li></ul></ul><ul><ul><li>At each level decompose the problem into several, smaller sized classification sub-problems </li></ul></ul><ul><ul><li>Note: ignoring interactions between sub-problems can yield poor results </li></ul></ul><ul><li>Patent Pending on Categorization </li></ul>
    35. 35. Focused Crawler
    36. 36. Topic Specific Crawling <ul><ul><li>Hyper-linked networks (Intranet, Internet) </li></ul></ul><ul><ul><li>Two options: </li></ul></ul><ul><ul><ul><li>Crawl the network. Then apply classification schemes to filter relevant documents. </li></ul></ul></ul><ul><ul><ul><li>Using classification schemes crawl the network while teaching the crawler to imitate (intelligent) human surfing strategies </li></ul></ul></ul><ul><li>Retrieve all documents that are relevant to a specific topic of interest </li></ul>
    37. 37. Simple Crawling <ul><li>Crawling: The process of retrieving documents from the net </li></ul><ul><ul><li>The Network is huge </li></ul></ul><ul><ul><ul><li>Storage </li></ul></ul></ul><ul><ul><li>Network </li></ul></ul><ul><ul><ul><li>Time </li></ul></ul></ul><ul><ul><li>Good for general-purpose search engines </li></ul></ul>Starting Document
    38. 38. Focused Crawling via Link Classifiers My brother new born child Herbal tea specialist Link is irrelevant <ul><li>Link classifier: Decision according to the context of the link </li></ul><ul><li>Analyze the context of the link </li></ul>Link Classifier Link Classifier Retrieve the URL
    39. 39. Focused Crawler – The Learning Process <ul><li>Crawler Classifier: Checks if the document is good for Crawling </li></ul>Herbal tea specialist Link Classifier Retrieve the content of the link Send acknowledgment to the “link classifier” - Learning Process Crawler Classifier
    40. 40. GammaWare API
    41. 41. Architecture - Basic Relational Database Customer Client GammaWare API CORBA GammaWare Proxy File System Relational Database GammaWare Software Proxy Client ODBC CORBA GW File System Document Management Web File System Notes Domino Outlook
    42. 42. Multiple Servers <ul><li>Scalability and Availability </li></ul>GammaWare Server 4 GammaWare Server 2 GammaWare Server 3 Database GammaWare Proxy GammaWare Server GammaWare Proxy Client Database
    43. 43. Q & A

    ×