Taxonomies in SearchAn SLA WebinarAug 10, 1:00pm-2:00pm ESTMarjorie Hlava, Presidentmhlava@accessinn.comAccess Innovations, Inc. www.accessinn.comLeveraging your content semantically
AgendaHow search worksMeasuring accuracy in searchPrecisionRecallRelevanceSearch theoretical basisBayes, Boole and the rest of the guysThe taxonomy effect
How does search work?Many partsSearch software – of courseComputer networkParsing of textWell formed or structured textCLEAN DATAComputer software – networkComputer hardwareTelecommunications connectionTraining sets for statistical systems
Technical parts of searchSearch technologyRanking algorithmsQuery languageFederatorsCacheInverted indexOther enhancementsPresentation Layer
My Main FrustrationSelect hardwareSelect softwareDesign systemTry to load the dataAdd the taxonomyThat’s BACKWARDS
Data First!What are you building the system for?Assess the dataDo the designDecide what else needs to be addedTaxonomy termsOther controlsFind a system that will work with your data
Access Innovations – Complex FarmWith Perfect SearchQueryFederatorsQuery ServersSearch Harmony Presentation LayerDeployHubIndex BuildersCleanup, etc.Repository XIS (cache)Cache BuildersSourceData
CUSTOMCONNECTOREMAILCONNECTORDATABASECONNECTORFILETRAVERSERWEBCRAWLERMANAGEMENT APIQUERY  APICONTENT APIData Harmony Governance APISEARCHSERVERFILTERSERVERFAST Search exampleCore Architectural ComponentsAdministrator’sDashboardWebContentVerticalApplicationsPipelineQueryPipelineFiles,DocumentsQUERYPROCESSORPortalsIndex DBDatabasesDOCUMENTPROCESSORResultsCustomFront-EndsAlertsEmail, GroupwareSearch harmonyMobileDevicesCustomApplicationsContentPushMAIstroAgent DB
Measuring accuracy in searchRelevanceRecallPrecisionAccuracy – Hits, miss, noiseRankingLinguisticsQuery ProcessingResults ProcessingDisplaySearch refinementUsabilityBusiness Rules9
RelevanceHow well a set of returned documents answers the information need“Accuracy”Related to objective of searchDifferent user communitiesInformation resourcesTension of user needs and context availableA confidence “guessimate”10
The formulasRecall = Number of relevant items retrieved        Number of relevant items in the collectionPrecision = Number of relevant items retrieved           Number of items retrievedRelevance = Germane (Precision)                     Pertinent (Recall)
Measuring RelevanceConcepts ContextAge of documents Completeness (recall) QualityStatistically determined ?Nope, it is subjective Someone has to determine the rightness of the itemA confidence factor = canard!
Kinds of searchBayesian – FASTLuceneAutonomy / VerityBooleanDialogEndecaPerfect SearchRanking algorithmsGoogle13
Search Theoretical BasisThose Famous GuysBooleBayesBayesian TechniquesTurneyTurney algorithmEnriched structured dataMarco DorigoAnt ColonyThis is only a sample  of a large body of research
George Boole and Boolean algebraGeorge BooleMathematician1815-1864Boolean algebraAn algebraic system of logic AND, OR, NOT, ANDNOT, Dialog, BRS, Stairs15
Boolean representationVenn diagram showing the intersection of sets A AND B (in violet), The union of sets A OR B (all the colored regions), And set A XOR B (all the colored regions except the violet). The "universe" is represented by the rectangular frame.16
Bayes and Bayes’ TheoremThomas BayesMathematician1702 - 1761Bayesian theorem Uses probability inductively Established a mathematical basis for probability inference WHAT?A means of calculating, from the number of times an event has not occurred, the probability that it will occur in future trials17
Bayesian methods - CautionsA user might wish to change the distribution of probabilities. A user will make a novel request for information in a previously unanticipated way.The computational difficulty of exploring a previously unknown network. The quality and extent of the prior beliefs used in Bayesian inference processing.
Bayesian cautions (cont.)A Bayesian network is only as useful as the prior knowledge is reliable. An optimistic or pessimistic expectation of the quality of these prior beliefs will distort the entire network and invalidate the results. Must ensure the selection of the statistical distribution induced in modeling the data. Must have the proper distribution model to describe the data.That is you have to constantly train and retrain the data
Peter Turney and the Turney AlgorithmPeter D. Turney, Canada, presentLearning algorithms for keyphraseextractionTree Induction AlgorithmLexical SemanticsGenEx – with human input80% acceptableExtraction vs. generation and sentiment of words         (hits(word AND "excellent") hits (poor))log2 ----------------------------------------         (hits(word AND "poor") hits (excellent))
Marco Dorigo and Ant Colony OptimizationMarco DorigoResearch director for the Belgian Fonds de la RechercheScientifiqueResearch director of the IRIDIA lab at the UniversitéLibre de BruxellesAnt Colony Optimization metaheuristicfor combinatorial optimization problemsSwarm intelligenceValue importance vs. heuristic importanceUseful in search prediction21
Natural Language ProcessingSyntacticSemanticMorphologicalPhraseologicalLemmatization (stemming)StatisticalGrammaticalCommon Sense
Basic areas of Automatic Language Processing (ALP)Auto TranslationAuto IndexingAuto AbstractingArtificial IntelligenceSearchingSpell CheckingSemantic WebNatural Language Processes (NLP)Computational Linguistics
Statistical Search Cluster analysisNeural networksCo-occurrenceBayesian inferenceLatent Semantic Etc.24
Inverted Files and Boolean are basic to all search Searchable IndexInvertedFileIndexTaxonomyThesaurusHierarchical Display
Sample Slide for Inverted File Index DemonstrationOutline of PresentationDefine key terminology
Thesaurus tools FeaturesFunctionsCosts Thesaurus constructionThesaurus toolsWhy & when?Simple Inverted File Indexkey ofoutlinepresentationterminologythesaurustoolswhenwhy&1234constructioncostsdefinefeaturesfunctions
Complex Inverted File IndexExample 1key - L2, P2, Hof - Stopoutline - L1, P1, Tpresentation - L1, P3, Tterminology - L2, P3, Hthesaurus - (1) - L3, P1, H    (2) - L7, P1, SH    (3) - L8, P1, SHtools - (1) - L3, P2, H     (2) - L8, P2, SHwhen - L9, P3, Hwhy - L9, P1, H& - Stop1 - Stop2 - Stop3 - Stop4 - Stopconstruction - L7, P2, SH costs - L6, P1, Hdefine - L2, P1, Hfeatures - L4, P1, SHfunctions - L5, P1, SH
Word and Term ParsingStemming-ing, -ed, -es, -’s, -s’, etc. DepluralizationTruncationLeft and rightWild cardsOrgani*ationVariant SpellingsCentre, centerHyphens
The taxonomy effectWhere do the terms go?How are they used in searchWhat other ways can I use the taxonomy in search?
Site searchSearch of 53 crawled sites including journals, books, web site,  conference sites, etc.Navigation Bookstore search Search database for Journals and pubsFor search all publications
Navigate the full taxonomy “tree”BROWSEAuto-completion using the taxonomyGuide the userTaxonomy Driven Search Presentation
A quick look behind the scenesDatabaseManagementSystemSearch thesaurus
Validate term entry
Block invalid terms
Record candidates
Establish rules for 	term useSuggest indexing 	termsThesaurustoolIndexingtoolValidate terms
Add terms and rules
Change terms and rules
Delete terms and rulesThesaurusTerm RecordviewTaxonomyview
Where does the subject metadata go?Apply to content itselfUse meta name field in HTML headerConnect search to the keywords in the SQL or other database tables
HTML Header
RDBMS ConnectionTaxonomy term table
Suggested taxonomy descriptors
Integrate taxonomy to enhance findabilityBrowsable categories of a directoryBrowsable faceted navigationSmart search for term equivalentsTaxonomy terms (original or modified) as labelsNavigation aids incorporate taxonomy terms and relationships
More Taxonomy EnrichmentSpelling alternatives and correctionRelated conceptsStatistical information about the metadataNavigation or drill downsSearch refinementRecursive setsConcept linkingDictionary lookup (in taxonomy glossary)
Brand is repeated in several spots and tied to search as well
Raw Full text data feeds Data Base Plus Search Workflow XIS CreationSQL for ecommercePrinted source  materialsAdd metadataData Crawls on 53+ sourcesXIS repository Taxonomy terms Load toPerfect SearchMAI Concept ExtractorTaxonomy Thesaurus MasterMAI Rule BaseSearch Harmony Display  Search  Save data to search and repositories at the same time
Raw Full text data feeds Data Base Plus Search Workflow XIS CreationSQL for ecommercePrinted source  materialsXIS repository Data Crawls on data sourcesAdd metadataLoad toSearchMAI Concept ExtractorMAI Rule BaseSearch Harmony Display  Search  Taxonomy Thesaurus MasterSource dataTaxonomy terms Search dataClean and enhance data

Taxonomies in Search

  • 1.
    Taxonomies in SearchAnSLA WebinarAug 10, 1:00pm-2:00pm ESTMarjorie Hlava, Presidentmhlava@accessinn.comAccess Innovations, Inc. www.accessinn.comLeveraging your content semantically
  • 2.
    AgendaHow search worksMeasuringaccuracy in searchPrecisionRecallRelevanceSearch theoretical basisBayes, Boole and the rest of the guysThe taxonomy effect
  • 3.
    How does searchwork?Many partsSearch software – of courseComputer networkParsing of textWell formed or structured textCLEAN DATAComputer software – networkComputer hardwareTelecommunications connectionTraining sets for statistical systems
  • 4.
    Technical parts ofsearchSearch technologyRanking algorithmsQuery languageFederatorsCacheInverted indexOther enhancementsPresentation Layer
  • 5.
    My Main FrustrationSelecthardwareSelect softwareDesign systemTry to load the dataAdd the taxonomyThat’s BACKWARDS
  • 6.
    Data First!What areyou building the system for?Assess the dataDo the designDecide what else needs to be addedTaxonomy termsOther controlsFind a system that will work with your data
  • 7.
    Access Innovations –Complex FarmWith Perfect SearchQueryFederatorsQuery ServersSearch Harmony Presentation LayerDeployHubIndex BuildersCleanup, etc.Repository XIS (cache)Cache BuildersSourceData
  • 8.
    CUSTOMCONNECTOREMAILCONNECTORDATABASECONNECTORFILETRAVERSERWEBCRAWLERMANAGEMENT APIQUERY APICONTENT APIData Harmony Governance APISEARCHSERVERFILTERSERVERFAST Search exampleCore Architectural ComponentsAdministrator’sDashboardWebContentVerticalApplicationsPipelineQueryPipelineFiles,DocumentsQUERYPROCESSORPortalsIndex DBDatabasesDOCUMENTPROCESSORResultsCustomFront-EndsAlertsEmail, GroupwareSearch harmonyMobileDevicesCustomApplicationsContentPushMAIstroAgent DB
  • 9.
    Measuring accuracy insearchRelevanceRecallPrecisionAccuracy – Hits, miss, noiseRankingLinguisticsQuery ProcessingResults ProcessingDisplaySearch refinementUsabilityBusiness Rules9
  • 10.
    RelevanceHow well aset of returned documents answers the information need“Accuracy”Related to objective of searchDifferent user communitiesInformation resourcesTension of user needs and context availableA confidence “guessimate”10
  • 11.
    The formulasRecall =Number of relevant items retrieved Number of relevant items in the collectionPrecision = Number of relevant items retrieved Number of items retrievedRelevance = Germane (Precision) Pertinent (Recall)
  • 12.
    Measuring RelevanceConcepts ContextAgeof documents Completeness (recall) QualityStatistically determined ?Nope, it is subjective Someone has to determine the rightness of the itemA confidence factor = canard!
  • 13.
    Kinds of searchBayesian– FASTLuceneAutonomy / VerityBooleanDialogEndecaPerfect SearchRanking algorithmsGoogle13
  • 14.
    Search Theoretical BasisThoseFamous GuysBooleBayesBayesian TechniquesTurneyTurney algorithmEnriched structured dataMarco DorigoAnt ColonyThis is only a sample of a large body of research
  • 15.
    George Boole andBoolean algebraGeorge BooleMathematician1815-1864Boolean algebraAn algebraic system of logic AND, OR, NOT, ANDNOT, Dialog, BRS, Stairs15
  • 16.
    Boolean representationVenn diagramshowing the intersection of sets A AND B (in violet), The union of sets A OR B (all the colored regions), And set A XOR B (all the colored regions except the violet). The "universe" is represented by the rectangular frame.16
  • 17.
    Bayes and Bayes’TheoremThomas BayesMathematician1702 - 1761Bayesian theorem Uses probability inductively Established a mathematical basis for probability inference WHAT?A means of calculating, from the number of times an event has not occurred, the probability that it will occur in future trials17
  • 18.
    Bayesian methods -CautionsA user might wish to change the distribution of probabilities. A user will make a novel request for information in a previously unanticipated way.The computational difficulty of exploring a previously unknown network. The quality and extent of the prior beliefs used in Bayesian inference processing.
  • 19.
    Bayesian cautions (cont.)ABayesian network is only as useful as the prior knowledge is reliable. An optimistic or pessimistic expectation of the quality of these prior beliefs will distort the entire network and invalidate the results. Must ensure the selection of the statistical distribution induced in modeling the data. Must have the proper distribution model to describe the data.That is you have to constantly train and retrain the data
  • 20.
    Peter Turney andthe Turney AlgorithmPeter D. Turney, Canada, presentLearning algorithms for keyphraseextractionTree Induction AlgorithmLexical SemanticsGenEx – with human input80% acceptableExtraction vs. generation and sentiment of words         (hits(word AND "excellent") hits (poor))log2 ----------------------------------------         (hits(word AND "poor") hits (excellent))
  • 21.
    Marco Dorigo andAnt Colony OptimizationMarco DorigoResearch director for the Belgian Fonds de la RechercheScientifiqueResearch director of the IRIDIA lab at the UniversitéLibre de BruxellesAnt Colony Optimization metaheuristicfor combinatorial optimization problemsSwarm intelligenceValue importance vs. heuristic importanceUseful in search prediction21
  • 22.
  • 23.
    Basic areas ofAutomatic Language Processing (ALP)Auto TranslationAuto IndexingAuto AbstractingArtificial IntelligenceSearchingSpell CheckingSemantic WebNatural Language Processes (NLP)Computational Linguistics
  • 24.
    Statistical Search ClusteranalysisNeural networksCo-occurrenceBayesian inferenceLatent Semantic Etc.24
  • 25.
    Inverted Files andBoolean are basic to all search Searchable IndexInvertedFileIndexTaxonomyThesaurusHierarchical Display
  • 26.
    Sample Slide forInverted File Index DemonstrationOutline of PresentationDefine key terminology
  • 27.
    Thesaurus tools FeaturesFunctionsCostsThesaurus constructionThesaurus toolsWhy & when?Simple Inverted File Indexkey ofoutlinepresentationterminologythesaurustoolswhenwhy&1234constructioncostsdefinefeaturesfunctions
  • 28.
    Complex Inverted FileIndexExample 1key - L2, P2, Hof - Stopoutline - L1, P1, Tpresentation - L1, P3, Tterminology - L2, P3, Hthesaurus - (1) - L3, P1, H (2) - L7, P1, SH (3) - L8, P1, SHtools - (1) - L3, P2, H (2) - L8, P2, SHwhen - L9, P3, Hwhy - L9, P1, H& - Stop1 - Stop2 - Stop3 - Stop4 - Stopconstruction - L7, P2, SH costs - L6, P1, Hdefine - L2, P1, Hfeatures - L4, P1, SHfunctions - L5, P1, SH
  • 29.
    Word and TermParsingStemming-ing, -ed, -es, -’s, -s’, etc. DepluralizationTruncationLeft and rightWild cardsOrgani*ationVariant SpellingsCentre, centerHyphens
  • 30.
    The taxonomy effectWheredo the terms go?How are they used in searchWhat other ways can I use the taxonomy in search?
  • 31.
    Site searchSearch of53 crawled sites including journals, books, web site, conference sites, etc.Navigation Bookstore search Search database for Journals and pubsFor search all publications
  • 32.
    Navigate the fulltaxonomy “tree”BROWSEAuto-completion using the taxonomyGuide the userTaxonomy Driven Search Presentation
  • 33.
    A quick lookbehind the scenesDatabaseManagementSystemSearch thesaurus
  • 34.
  • 35.
  • 36.
  • 37.
    Establish rules for term useSuggest indexing termsThesaurustoolIndexingtoolValidate terms
  • 38.
  • 39.
  • 40.
    Delete terms andrulesThesaurusTerm RecordviewTaxonomyview
  • 41.
    Where does thesubject metadata go?Apply to content itselfUse meta name field in HTML headerConnect search to the keywords in the SQL or other database tables
  • 42.
  • 43.
  • 44.
  • 46.
    Integrate taxonomy toenhance findabilityBrowsable categories of a directoryBrowsable faceted navigationSmart search for term equivalentsTaxonomy terms (original or modified) as labelsNavigation aids incorporate taxonomy terms and relationships
  • 47.
    More Taxonomy EnrichmentSpellingalternatives and correctionRelated conceptsStatistical information about the metadataNavigation or drill downsSearch refinementRecursive setsConcept linkingDictionary lookup (in taxonomy glossary)
  • 48.
    Brand is repeatedin several spots and tied to search as well
  • 49.
    Raw Full textdata feeds Data Base Plus Search Workflow XIS CreationSQL for ecommercePrinted source materialsAdd metadataData Crawls on 53+ sourcesXIS repository Taxonomy terms Load toPerfect SearchMAI Concept ExtractorTaxonomy Thesaurus MasterMAI Rule BaseSearch Harmony Display Search Save data to search and repositories at the same time
  • 50.
    Raw Full textdata feeds Data Base Plus Search Workflow XIS CreationSQL for ecommercePrinted source materialsXIS repository Data Crawls on data sourcesAdd metadataLoad toSearchMAI Concept ExtractorMAI Rule BaseSearch Harmony Display Search Taxonomy Thesaurus MasterSource dataTaxonomy terms Search dataClean and enhance data