Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Taxonomies in Search


Published on

Presented by Marjorie Hlava, president of Access Innovations, Inc. on August 10, 2011. Part two of the Special Libraries Association's Leveraging Your Taxonomy series.

Published in: Technology, Education
  • Be the first to comment

Taxonomies in Search

  1. 1. Taxonomies in SearchAn SLA Webinar<br />Aug 10, 1:00pm-2:00pm EST<br />Marjorie Hlava, President<br /><br />Access Innovations, Inc. <br /><br />Leveraging your content semantically<br />
  2. 2. Agenda<br />How search works<br />Measuring accuracy in search<br />Precision<br />Recall<br />Relevance<br />Search theoretical basis<br />Bayes, Boole and the rest of the guys<br />The taxonomy effect<br />
  3. 3. How does search work?<br />Many parts<br />Search software – of course<br />Computer network<br />Parsing of text<br />Well formed or structured text<br />CLEAN DATA<br />Computer software – network<br />Computer hardware<br />Telecommunications connection<br />Training sets for statistical systems<br />
  4. 4. Technical parts of search<br />Search technology<br />Ranking algorithms<br />Query language<br />Federators<br />Cache<br />Inverted index<br />Other enhancements<br />Presentation Layer<br />
  5. 5. My Main Frustration<br />Select hardware<br />Select software<br />Design system<br />Try to load the data<br />Add the taxonomy<br />That’s BACKWARDS<br />
  6. 6. Data First!<br />What are you building the system for?<br />Assess the data<br />Do the design<br />Decide what else needs to be added<br />Taxonomy terms<br />Other controls<br />Find a system that will work with your data<br />
  7. 7. Access Innovations – Complex FarmWith Perfect Search<br />Query<br />Federators<br />Query Servers<br />Search Harmony Presentation <br />Layer<br />Deploy<br />Hub<br />Index <br />Builders<br />Cleanup, etc.<br />Repository XIS (cache)<br />Cache <br />Builders<br />Source<br />Data<br />
  8. 8. CUSTOM<br />CONNECTOR<br />EMAIL<br />CONNECTOR<br />DATABASE<br />CONNECTOR<br />FILE<br />TRAVERSER<br />WEB<br />CRAWLER<br />MANAGEMENT API<br />QUERY API<br />CONTENT API<br />Data Harmony Governance API<br />SEARCH<br />SERVER<br />FILTERSERVER<br />FAST Search example<br />Core Architectural Components<br />Administrator’s<br />Dashboard<br />Web<br />Content<br />Vertical<br />Applications<br />Pipeline<br />Query<br />Pipeline<br />Files,<br />Documents<br />QUERY<br />PROCESSOR<br />Portals<br />Index DB<br />Databases<br />DOCUMENT<br />PROCESSOR<br />Results<br />Custom<br />Front-Ends<br />Alerts<br />Email, <br />Groupware<br />Search harmony<br />Mobile<br />Devices<br />Custom<br />Applications<br />Content<br />Push<br />MAIstro<br />Agent DB<br />
  9. 9. Measuring accuracy in search<br />Relevance<br />Recall<br />Precision<br />Accuracy – Hits, miss, noise<br />Ranking<br />Linguistics<br />Query Processing<br />Results Processing<br />Display<br />Search refinement<br />Usability<br />Business Rules<br />9<br />
  10. 10. Relevance<br />How well a set of returned documents answers the information need<br />“Accuracy”<br />Related to objective of search<br />Different user communities<br />Information resources<br />Tension of user needs and context available<br />A confidence “guessimate”<br />10<br />
  11. 11. The formulas<br />Recall = Number of relevant items retrieved<br /> Number of relevant items in the collection<br />Precision = Number of relevant items retrieved<br /> Number of items retrieved<br />Relevance = Germane (Precision)<br /> Pertinent (Recall)<br />
  12. 12. Measuring Relevance<br />Concepts <br />Context<br />Age of documents <br />Completeness (recall) <br />Quality<br />Statistically determined ?<br />Nope, it is subjective <br />Someone has to determine the rightness of the item<br />A confidence factor = canard!<br />
  13. 13. Kinds of search<br />Bayesian – <br />FAST<br />Lucene<br />Autonomy / Verity<br />Boolean<br />Dialog<br />Endeca<br />Perfect Search<br />Ranking algorithms<br />Google<br />13<br />
  14. 14. Search Theoretical BasisThose Famous Guys<br />Boole<br />Bayes<br />Bayesian Techniques<br />Turney<br />Turney algorithm<br />Enriched structured data<br />Marco Dorigo<br />Ant Colony<br />This is only a sample <br />of a large body of research<br />
  15. 15. George Boole and Boolean algebra<br />George Boole<br />Mathematician<br />1815-1864<br />Boolean algebra<br />An algebraic system of logic <br />AND, OR, NOT, ANDNOT, <br />Dialog, BRS, Stairs<br />15<br />
  16. 16. Boolean representation<br />Venn diagram showing the intersection of sets A AND B (in violet), <br />The union of sets A OR B (all the colored regions), <br />And set A XOR B (all the colored regions except the violet). <br />The "universe" is represented by the rectangular frame.<br />16<br />
  17. 17. Bayes and Bayes’ Theorem<br />Thomas Bayes<br />Mathematician<br />1702 - 1761<br />Bayesian theorem <br />Uses probability inductively <br />Established a mathematical basis for probability inference <br />WHAT?<br />A means of calculating, <br />from the number of times an event has not occurred, <br />the probability that it will occur in future trials<br />17<br />
  18. 18. Bayesian methods - Cautions<br />A user might wish to change the distribution of probabilities. <br />A user will make a novel request for information in a previously unanticipated way.<br />The computational difficulty of exploring a previously unknown network. <br />The quality and extent of the prior beliefs used in Bayesian inference processing. <br />
  19. 19. Bayesian cautions (cont.)<br />A Bayesian network is only as useful as the prior knowledge is reliable. <br />An optimistic or pessimistic expectation of the quality of these prior beliefs will distort the entire network and invalidate the results. <br />Must ensure the selection of the statistical distribution induced in modeling the data. <br />Must have the proper distribution model to describe the data.<br />That is you have to constantly train and retrain the data<br />
  20. 20. Peter Turney and the Turney Algorithm<br />Peter D. Turney, Canada, present<br />Learning algorithms for keyphraseextraction<br />Tree Induction Algorithm<br />Lexical Semantics<br />GenEx – with human input<br />80% acceptable<br />Extraction vs. generation and sentiment of words<br />         (hits(word AND "excellent") hits (poor))log2 ----------------------------------------         (hits(word AND "poor") hits (excellent)) <br />
  21. 21. Marco Dorigo and Ant Colony Optimization<br />Marco Dorigo<br />Research director for the Belgian Fonds de la RechercheScientifique<br />Research director of the IRIDIA lab at the UniversitéLibre de Bruxelles<br />Ant Colony Optimization <br />metaheuristicfor combinatorial optimization problems<br />Swarm intelligence<br />Value importance vs. heuristic importance<br />Useful in search prediction<br />21<br />
  22. 22. Natural Language Processing<br />Syntactic<br />Semantic<br />Morphological<br />Phraseological<br />Lemmatization (stemming)<br />Statistical<br />Grammatical<br />Common Sense<br />
  23. 23. Basic areas of Automatic Language Processing (ALP)<br />Auto Translation<br />Auto Indexing<br />Auto Abstracting<br />Artificial Intelligence<br />Searching<br />Spell Checking<br />Semantic Web<br />Natural Language Processes (NLP)<br />Computational Linguistics<br />
  24. 24. Statistical Search <br />Cluster analysis<br />Neural networks<br />Co-occurrence<br />Bayesian inference<br />Latent Semantic <br />Etc.<br />24<br />
  25. 25. Inverted Files and Boolean <br />are basic to all search <br />Searchable Index<br />Inverted<br />File<br />Index<br />Taxonomy<br />Thesaurus<br />Hierarchical Display<br />
  26. 26. Sample Slide for Inverted File Index Demonstration<br />Outline of Presentation<br /><ul><li>Define key terminology
  27. 27. Thesaurus tools </li></ul>Features<br />Functions<br /><ul><li>Costs </li></ul>Thesaurus construction<br />Thesaurus tools<br /><ul><li>Why & when?</li></li></ul><li>Simple Inverted File Index<br />key <br />of<br />outline<br />presentation<br />terminology<br />thesaurus<br />tools<br />when<br />why<br />&<br />1<br />2<br />3<br />4<br />construction<br />costs<br />define<br />features<br />functions<br />
  28. 28. Complex Inverted File Index<br />Example 1<br />key - L2, P2, H<br />of - Stop<br />outline - L1, P1, T<br />presentation - L1, P3, T<br />terminology - L2, P3, H<br />thesaurus - (1) - L3, P1, H<br /> (2) - L7, P1, SH<br /> (3) - L8, P1, SH<br />tools - (1) - L3, P2, H<br /> (2) - L8, P2, SH<br />when - L9, P3, H<br />why - L9, P1, H<br />& - Stop<br />1 - Stop<br />2 - Stop<br />3 - Stop<br />4 - Stop<br />construction - L7, P2, SH <br />costs - L6, P1, H<br />define - L2, P1, H<br />features - L4, P1, SH<br />functions - L5, P1, SH<br />
  29. 29. Word and Term Parsing<br />Stemming<br />-ing, -ed, -es, -’s, -s’, etc. <br />Depluralization<br />Truncation<br />Left and right<br />Wild cards<br />Organi*ation<br />Variant Spellings<br />Centre, center<br />Hyphens <br />
  30. 30. The taxonomy effect<br />Where do the terms go?<br />How are they used in search<br />What other ways can I use the taxonomy in search?<br />
  31. 31. Site search<br />Search of 53 crawled sites including journals, books, web site, conference sites, etc.<br />Navigation <br />Bookstore search <br />Search database for Journals and pubs<br />For search all publications<br />
  32. 32. Navigate the full taxonomy “tree”<br />BROWSE<br />Auto-completion using the taxonomy<br />Guide the user<br />Taxonomy Driven Search Presentation<br />
  33. 33. A quick look behind the scenes<br />Database<br />Management<br />System<br /><ul><li>Search thesaurus
  34. 34. Validate term entry
  35. 35. Block invalid terms
  36. 36. Record candidates
  37. 37. Establish rules for </li></ul> term use<br /><ul><li>Suggest indexing </li></ul> terms<br />Thesaurus<br />tool<br />Indexing<br />tool<br /><ul><li>Validate terms
  38. 38. Add terms and rules
  39. 39. Change terms and rules
  40. 40. Delete terms and rules</li></li></ul><li>Thesaurus<br />Term Record<br />view<br />Taxonomy<br />view<br />
  41. 41. Where does the subject metadata go?<br />Apply to content itself<br />Use meta name field in HTML header<br />Connect search to the keywords in the SQL or other database tables<br />
  42. 42. HTML Header<br />
  43. 43. RDBMS Connection<br />Taxonomy term table<br />
  44. 44. Suggested taxonomy descriptors<br />
  45. 45.
  46. 46. Integrate taxonomy to enhance findability<br />Browsable categories of a directory<br />Browsable faceted navigation<br />Smart search for term equivalents<br />Taxonomy terms (original or modified) as labels<br />Navigation aids incorporate taxonomy terms and relationships<br />
  47. 47. More Taxonomy Enrichment<br />Spelling alternatives and correction<br />Related concepts<br />Statistical information about the metadata<br />Navigation or drill downs<br />Search refinement<br />Recursive sets<br />Concept linking<br />Dictionary lookup (in taxonomy glossary)<br />
  48. 48. Brand is repeated in several spots and tied to search as well<br />
  49. 49. Raw Full text data feeds <br />Data Base Plus Search Workflow <br />XIS Creation<br />SQL for ecommerce<br />Printed source materials<br />Add metadata<br />Data Crawls on 53+ sources<br />XIS repository <br />Taxonomy terms <br />Load to<br />Perfect Search<br />MAI Concept Extractor<br />Taxonomy Thesaurus Master<br />MAI Rule Base<br />Search Harmony Display Search <br />Save data to search and repositories at the same time<br />
  50. 50. Raw Full text data feeds <br />Data Base Plus Search Workflow <br />XIS Creation<br />SQL for ecommerce<br />Printed source materials<br />XIS repository <br />Data Crawls on data sources<br />Add metadata<br />Load to<br />Search<br />MAI Concept Extractor<br />MAI Rule Base<br />Search Harmony Display Search <br />Taxonomy Thesaurus Master<br />Source data<br />Taxonomy terms <br />Search data<br />Clean and enhance data<br />
  51. 51. Client Data<br />Full Text<br />HTML, PDF,<br />Data Feeds, etc.<br />Taxonomy In Sharepoint<br />Automatic Summarization<br />Search<br />Presentation:90% accuracy<br />Browse by Subject<br />Auto-completion<br />Broader Terms<br />Narrower Terms<br />Related Terms<br />Machine Aided Indexer (M.A.I.™)<br />Repository<br />Search<br />Software<br />Inline Tagging<br />Client taxonomy<br />Client Taxonomy<br />Metadata and Entity Extractor<br />Thesaurus Master<br />
  52. 52. What we covered <br />How search works<br />Measuring accuracy in search<br />Search theoretical basis<br />Bayes, Boole and the rest of the guys<br />The taxonomy effect<br />
  53. 53. Do the data FIRST<br />What do you have?<br />What does it need?<br />How would you LIKE to access it?<br />Look at the data BEFORE you create the specifications<br />DTD built without data is not going to work<br />Then choose the system that will support your data<br />
  54. 54. Next Month<br />Same time, same station<br />Solving the Challenge of Connecting People and Author NetworksJay Ven Eman, Ph.D.September 14As online digital publishing continues to grow, taxonomies can be increasingly useful in connecting people with author networks through directory creation with author disambiguation and subject metadata tagging to increase the usefulness of information for researchers and community-building.<br />
  55. 55. About Access Innovations<br />49<br />Access innovations are experts in content creation, enrichment and conversion services. We provide services to semantically enrich and tag and raw text into highly structured data. We deliver clean ,well formed, metadata enriched ,data so our clients can reuse repurpose, store, and find their knowledge assets. We go beyond the standards to build taxonomies and other data control structures as a solid foundation for data. <br />Our services and software allow organizations to use and present their information to both internal and external constituents by leveraging search, presentation, e-commerce . We change search to found!<br />Quick Facts<br /><ul><li>Founded in 1978
  56. 56. Headquartered in Albuquerque
  57. 57. Privately held
  58. 58. Delivered more than 2000 engagements </li></li></ul><li>Thank you for your attention!<br />Slides will be available on SLA Taxonomy Division and Access Web sites tomorrow<br />Taxonomies in Search:<br />Marjorie M. K. Hlava<br />Access Innovations / Data Harmony<br /><br />+505.998.0800<br />