This document summarizes a webinar presentation about using taxonomies to improve search. The presentation covers how search works, measuring search accuracy, theoretical foundations of search, and how taxonomies can enhance search. It emphasizes doing taxonomy work after analyzing data to support how the data will be accessed and presented. The goal is to leverage semantic relationships and metadata to connect users to relevant information.
Streamlining Python Development: A Guide to a Modern Project Setup
Taxonomies in Search: Leveraging Content Semantically
1. Taxonomies in SearchAn SLA Webinar Aug 10, 1:00pm-2:00pm EST Marjorie Hlava, President mhlava@accessinn.com Access Innovations, Inc. www.accessinn.com Leveraging your content semantically
2. Agenda How search works Measuring accuracy in search Precision Recall Relevance Search theoretical basis Bayes, Boole and the rest of the guys The taxonomy effect
3. How does search work? Many parts Search software – of course Computer network Parsing of text Well formed or structured text CLEAN DATA Computer software – network Computer hardware Telecommunications connection Training sets for statistical systems
4. Technical parts of search Search technology Ranking algorithms Query language Federators Cache Inverted index Other enhancements Presentation Layer
5. My Main Frustration Select hardware Select software Design system Try to load the data Add the taxonomy That’s BACKWARDS
6. Data First! What are you building the system for? Assess the data Do the design Decide what else needs to be added Taxonomy terms Other controls Find a system that will work with your data
7. Access Innovations – Complex FarmWith Perfect Search Query Federators Query Servers Search Harmony Presentation Layer Deploy Hub Index Builders Cleanup, etc. Repository XIS (cache) Cache Builders Source Data
8. CUSTOM CONNECTOR EMAIL CONNECTOR DATABASE CONNECTOR FILE TRAVERSER WEB CRAWLER MANAGEMENT API QUERY API CONTENT API Data Harmony Governance API SEARCH SERVER FILTERSERVER FAST Search example Core Architectural Components Administrator’s Dashboard Web Content Vertical Applications Pipeline Query Pipeline Files, Documents QUERY PROCESSOR Portals Index DB Databases DOCUMENT PROCESSOR Results Custom Front-Ends Alerts Email, Groupware Search harmony Mobile Devices Custom Applications Content Push MAIstro Agent DB
10. Relevance How well a set of returned documents answers the information need “Accuracy” Related to objective of search Different user communities Information resources Tension of user needs and context available A confidence “guessimate” 10
11. The formulas Recall = Number of relevant items retrieved Number of relevant items in the collection Precision = Number of relevant items retrieved Number of items retrieved Relevance = Germane (Precision) Pertinent (Recall)
12. Measuring Relevance Concepts Context Age of documents Completeness (recall) Quality Statistically determined ? Nope, it is subjective Someone has to determine the rightness of the item A confidence factor = canard!
13. Kinds of search Bayesian – FAST Lucene Autonomy / Verity Boolean Dialog Endeca Perfect Search Ranking algorithms Google 13
14. Search Theoretical BasisThose Famous Guys Boole Bayes Bayesian Techniques Turney Turney algorithm Enriched structured data Marco Dorigo Ant Colony This is only a sample of a large body of research
15. George Boole and Boolean algebra George Boole Mathematician 1815-1864 Boolean algebra An algebraic system of logic AND, OR, NOT, ANDNOT, Dialog, BRS, Stairs 15
16. Boolean representation Venn diagram showing the intersection of sets A AND B (in violet), The union of sets A OR B (all the colored regions), And set A XOR B (all the colored regions except the violet). The "universe" is represented by the rectangular frame. 16
17. Bayes and Bayes’ Theorem Thomas Bayes Mathematician 1702 - 1761 Bayesian theorem Uses probability inductively Established a mathematical basis for probability inference WHAT? A means of calculating, from the number of times an event has not occurred, the probability that it will occur in future trials 17
18. Bayesian methods - Cautions A user might wish to change the distribution of probabilities. A user will make a novel request for information in a previously unanticipated way. The computational difficulty of exploring a previously unknown network. The quality and extent of the prior beliefs used in Bayesian inference processing.
19. Bayesian cautions (cont.) A Bayesian network is only as useful as the prior knowledge is reliable. An optimistic or pessimistic expectation of the quality of these prior beliefs will distort the entire network and invalidate the results. Must ensure the selection of the statistical distribution induced in modeling the data. Must have the proper distribution model to describe the data. That is you have to constantly train and retrain the data
20. Peter Turney and the Turney Algorithm Peter D. Turney, Canada, present Learning algorithms for keyphraseextraction Tree Induction Algorithm Lexical Semantics GenEx – with human input 80% acceptable Extraction vs. generation and sentiment of words (hits(word AND "excellent") hits (poor))log2 ---------------------------------------- (hits(word AND "poor") hits (excellent))
21. Marco Dorigo and Ant Colony Optimization Marco Dorigo Research director for the Belgian Fonds de la RechercheScientifique Research director of the IRIDIA lab at the UniversitéLibre de Bruxelles Ant Colony Optimization metaheuristicfor combinatorial optimization problems Swarm intelligence Value importance vs. heuristic importance Useful in search prediction 21
22. Natural Language Processing Syntactic Semantic Morphological Phraseological Lemmatization (stemming) Statistical Grammatical Common Sense
23. Basic areas of Automatic Language Processing (ALP) Auto Translation Auto Indexing Auto Abstracting Artificial Intelligence Searching Spell Checking Semantic Web Natural Language Processes (NLP) Computational Linguistics
25. Inverted Files and Boolean are basic to all search Searchable Index Inverted File Index Taxonomy Thesaurus Hierarchical Display
26.
27.
28. Complex Inverted File Index Example 1 key - L2, P2, H of - Stop outline - L1, P1, T presentation - L1, P3, T terminology - L2, P3, H thesaurus - (1) - L3, P1, H (2) - L7, P1, SH (3) - L8, P1, SH tools - (1) - L3, P2, H (2) - L8, P2, SH when - L9, P3, H why - L9, P1, H & - Stop 1 - Stop 2 - Stop 3 - Stop 4 - Stop construction - L7, P2, SH costs - L6, P1, H define - L2, P1, H features - L4, P1, SH functions - L5, P1, SH
29. Word and Term Parsing Stemming -ing, -ed, -es, -’s, -s’, etc. Depluralization Truncation Left and right Wild cards Organi*ation Variant Spellings Centre, center Hyphens
30. The taxonomy effect Where do the terms go? How are they used in search What other ways can I use the taxonomy in search?
31. Site search Search of 53 crawled sites including journals, books, web site, conference sites, etc. Navigation Bookstore search Search database for Journals and pubs For search all publications
32. Navigate the full taxonomy “tree” BROWSE Auto-completion using the taxonomy Guide the user Taxonomy Driven Search Presentation
41. Where does the subject metadata go? Apply to content itself Use meta name field in HTML header Connect search to the keywords in the SQL or other database tables
46. Integrate taxonomy to enhance findability Browsable categories of a directory Browsable faceted navigation Smart search for term equivalents Taxonomy terms (original or modified) as labels Navigation aids incorporate taxonomy terms and relationships
47. More Taxonomy Enrichment Spelling alternatives and correction Related concepts Statistical information about the metadata Navigation or drill downs Search refinement Recursive sets Concept linking Dictionary lookup (in taxonomy glossary)
49. Raw Full text data feeds Data Base Plus Search Workflow XIS Creation SQL for ecommerce Printed source materials Add metadata Data Crawls on 53+ sources XIS repository Taxonomy terms Load to Perfect Search MAI Concept Extractor Taxonomy Thesaurus Master MAI Rule Base Search Harmony Display Search Save data to search and repositories at the same time
50. Raw Full text data feeds Data Base Plus Search Workflow XIS Creation SQL for ecommerce Printed source materials XIS repository Data Crawls on data sources Add metadata Load to Search MAI Concept Extractor MAI Rule Base Search Harmony Display Search Taxonomy Thesaurus Master Source data Taxonomy terms Search data Clean and enhance data
51. Client Data Full Text HTML, PDF, Data Feeds, etc. Taxonomy In Sharepoint Automatic Summarization Search Presentation:90% accuracy Browse by Subject Auto-completion Broader Terms Narrower Terms Related Terms Machine Aided Indexer (M.A.I.™) Repository Search Software Inline Tagging Client taxonomy Client Taxonomy Metadata and Entity Extractor Thesaurus Master
52. What we covered How search works Measuring accuracy in search Search theoretical basis Bayes, Boole and the rest of the guys The taxonomy effect
53. Do the data FIRST What do you have? What does it need? How would you LIKE to access it? Look at the data BEFORE you create the specifications DTD built without data is not going to work Then choose the system that will support your data
54. Next Month Same time, same station Solving the Challenge of Connecting People and Author NetworksJay Ven Eman, Ph.D.September 14As online digital publishing continues to grow, taxonomies can be increasingly useful in connecting people with author networks through directory creation with author disambiguation and subject metadata tagging to increase the usefulness of information for researchers and community-building.