Concept Based Search SySearch vs. Traditional Search Simplified Version for External Yu Wang, Engineering Manager
Agenda How do we search differently? SySearch vs. Traditional Search Search Technology Explained IR & its Applications IR Model, Key Metrics and Core Technologies Relevance Ranking Relevance Ranking Example SySearch Functional Overview
SySearch vs. Traditional Search Find keywords of the mail (tree) branches (power) lines Tracks … Combine Keywords with Boolean operators branches AND lines AND tracks Assumptions: User need can be expressed in a few words No ambiguity wording in query Vocabulary is precise in target documents  Free-text search based on concepts Concepts isolation from mail Damage being done by tree branches Tangling of overhead power lines Falling trees and tree branches Obstruction or damage to tracks Group concepts into free text Query by Example (Even Simpler) Find similar using original document Enhancement: Query Expansion Variation of vocabulary in concept A user receives an e-mail message that says: Following the incident close to Watford railway station in July, we need to assess the damage being done by tree branches tangling in overhead power lines or falling onto the tracks. The user then wants to locate documents matching the e-mail message.
Agenda How do we search differently? SySearch vs. Traditional Search Search Technology Explained IR & its Applications IR Model, Key Metrics and Core Technologies Relevance Ranking Relevance Ranking Example SySearch Functional Overview
Information Retrieval Information retrieval (IR) is finding material (usually documents) of an  unstructured  nature (usually text) that satisfies an information need from within large collections (usually stored on computers). web search enterprise, institutional, and domain-specific search personal information retrieval
IR Model, Key Metrics and Core Technologies User Query Query Rep. {Term} Document Rep. {Index, Term} Original Documents Relevance Ranking Precision: What fraction of the returned results are  relevant  to the information  need? Recall: What fraction of the  relevant  documents in the collection were returned  by the system?
Relevance Ranking: Traditional Keyword and Boolean Search Boolean Model Based on term matching {1,0}, High Precision Simple to understand but hard to use Hard to support long query  Hard to support natural language query Can’t rank relevance effectively due to {1,0} nature Possible Enhancements Page Ranking  (Great success by Google but limited to web) Term Weighting
Relevance Ranking: SySearch Concept Based Search Bayesian Probability Probability Ranking Principle, BIM Relevance Feedback, Term Weighting Support natural language query inherently Query by Example Find Similar (Hyperlinking) Query Expansion Text Classification Category
Example: Relevance Ranking (1) A user wants to find an article which introduces  both  apple and orange. He inputs query string {Apple, Orange} and the system found 2 match documents. Both documents contain both query terms, while in D1 Apple appears 3 times Orange and in D2 Apple appears equally to Orange in occurrence.  Is D1 or D2 more relevant to user need? Query Terms  {Apple, Orange} Apple  Orange Relevance Ranking Search Server Document Server Apple… Orange… Apple… Apple… Apple… Orange… Apple Orange… D1 D2 D1 Apple: Orange=3:1 D2 Apple: Orange=1:1 D1>D2? D2>D1?
Example: Relevance Ranking (2) Solution: Variable D: Relevancy of Document {D1, D2} Evidence Q: Term  Apple(1) and Orange(1) found in Document P(Q|D1)=2*(3/4)*(1/4)=3/8 P(Q|D2)=2*(1/2)*(1/2)=1/2 P(D1)=P(D2)=1/2 P(D1|Q)/P(D2|Q)= P(Q|D1)P(D1)/P(Q|D2)P(D2) P(D1|Q)/P(D2|Q)=0.75 Results: D2 should be returned to user before D1 as it appears to be talking about the both instead of focusing on one of them Conclusion: This approach tries to “understand” queries and documents by using statistic method to establish “concepts” Bayesian Probability (Simplified) Solution: Q = Apple ^ Orange D1 = Apple ^ Orange ^ Apple ^ Apple … D2 = Apple ^ Orange ^ Apple ^ Orange … D1->(contains) Q => Relevance=1 D2->(contains) Q => Relevance=1 Results: D1 and D2 has the same relevance score, user needs to check which is more relevant by himself Conclusion: This approach returns exact matches but failed to rank their relevance effectively Boolean Model
Agenda How do we search differently? SySearch vs. Traditional Search Search Technology Explained IR & its Applications IR Model, Key Metrics and Core Technologies Relevance Ranking Relevance Ranking Example SySearch Functional Overview
SySearch Functional Architecture SySearch Web GUI/API System Administration Security/CSI Event Framework Log Management Category Document Group Meta-Data Scheduler Query  Bayesian Engine Augmentation Expansion FS DB EM Document Import Document Filtering Content Add-on Meta-data Parser Text Processing Term Lexicon Index MPF Index Meta-data Index Term Index FS MPF Index Meta-data Index Term Index DB
SySearch Functionalities Indexing Document Import Manager File System, Database, Web (Spider) Common Import File Format – XOI  Document Filtering Content add-ons (Stellent) Meta-Data Parser Text Processing and Term Lexicon Index Manager Position Indexes
SySearch Functionalities Search/Query Document, Paragraph Meta-data Boolean (), Proximity (Near, Phase) Natural Language Search Lexicon (Tokenize, Stem, Stop word) default to English Relevance Ranking Bayesian Engine Custom Term Weighting Query Expansion, Query by Example (Find Similar), Categorization Filtering Category, Meta-data, Document Group Query Enhancement (Synonym & Acronym)
Distributed SySearch Deployment
Q&A

Concept Based Search

  • 1.
    Concept Based SearchSySearch vs. Traditional Search Simplified Version for External Yu Wang, Engineering Manager
  • 2.
    Agenda How dowe search differently? SySearch vs. Traditional Search Search Technology Explained IR & its Applications IR Model, Key Metrics and Core Technologies Relevance Ranking Relevance Ranking Example SySearch Functional Overview
  • 3.
    SySearch vs. TraditionalSearch Find keywords of the mail (tree) branches (power) lines Tracks … Combine Keywords with Boolean operators branches AND lines AND tracks Assumptions: User need can be expressed in a few words No ambiguity wording in query Vocabulary is precise in target documents Free-text search based on concepts Concepts isolation from mail Damage being done by tree branches Tangling of overhead power lines Falling trees and tree branches Obstruction or damage to tracks Group concepts into free text Query by Example (Even Simpler) Find similar using original document Enhancement: Query Expansion Variation of vocabulary in concept A user receives an e-mail message that says: Following the incident close to Watford railway station in July, we need to assess the damage being done by tree branches tangling in overhead power lines or falling onto the tracks. The user then wants to locate documents matching the e-mail message.
  • 4.
    Agenda How dowe search differently? SySearch vs. Traditional Search Search Technology Explained IR & its Applications IR Model, Key Metrics and Core Technologies Relevance Ranking Relevance Ranking Example SySearch Functional Overview
  • 5.
    Information Retrieval Informationretrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). web search enterprise, institutional, and domain-specific search personal information retrieval
  • 6.
    IR Model, KeyMetrics and Core Technologies User Query Query Rep. {Term} Document Rep. {Index, Term} Original Documents Relevance Ranking Precision: What fraction of the returned results are relevant to the information need? Recall: What fraction of the relevant documents in the collection were returned by the system?
  • 7.
    Relevance Ranking: TraditionalKeyword and Boolean Search Boolean Model Based on term matching {1,0}, High Precision Simple to understand but hard to use Hard to support long query Hard to support natural language query Can’t rank relevance effectively due to {1,0} nature Possible Enhancements Page Ranking (Great success by Google but limited to web) Term Weighting
  • 8.
    Relevance Ranking: SySearchConcept Based Search Bayesian Probability Probability Ranking Principle, BIM Relevance Feedback, Term Weighting Support natural language query inherently Query by Example Find Similar (Hyperlinking) Query Expansion Text Classification Category
  • 9.
    Example: Relevance Ranking(1) A user wants to find an article which introduces both apple and orange. He inputs query string {Apple, Orange} and the system found 2 match documents. Both documents contain both query terms, while in D1 Apple appears 3 times Orange and in D2 Apple appears equally to Orange in occurrence. Is D1 or D2 more relevant to user need? Query Terms {Apple, Orange} Apple Orange Relevance Ranking Search Server Document Server Apple… Orange… Apple… Apple… Apple… Orange… Apple Orange… D1 D2 D1 Apple: Orange=3:1 D2 Apple: Orange=1:1 D1>D2? D2>D1?
  • 10.
    Example: Relevance Ranking(2) Solution: Variable D: Relevancy of Document {D1, D2} Evidence Q: Term Apple(1) and Orange(1) found in Document P(Q|D1)=2*(3/4)*(1/4)=3/8 P(Q|D2)=2*(1/2)*(1/2)=1/2 P(D1)=P(D2)=1/2 P(D1|Q)/P(D2|Q)= P(Q|D1)P(D1)/P(Q|D2)P(D2) P(D1|Q)/P(D2|Q)=0.75 Results: D2 should be returned to user before D1 as it appears to be talking about the both instead of focusing on one of them Conclusion: This approach tries to “understand” queries and documents by using statistic method to establish “concepts” Bayesian Probability (Simplified) Solution: Q = Apple ^ Orange D1 = Apple ^ Orange ^ Apple ^ Apple … D2 = Apple ^ Orange ^ Apple ^ Orange … D1->(contains) Q => Relevance=1 D2->(contains) Q => Relevance=1 Results: D1 and D2 has the same relevance score, user needs to check which is more relevant by himself Conclusion: This approach returns exact matches but failed to rank their relevance effectively Boolean Model
  • 11.
    Agenda How dowe search differently? SySearch vs. Traditional Search Search Technology Explained IR & its Applications IR Model, Key Metrics and Core Technologies Relevance Ranking Relevance Ranking Example SySearch Functional Overview
  • 12.
    SySearch Functional ArchitectureSySearch Web GUI/API System Administration Security/CSI Event Framework Log Management Category Document Group Meta-Data Scheduler Query Bayesian Engine Augmentation Expansion FS DB EM Document Import Document Filtering Content Add-on Meta-data Parser Text Processing Term Lexicon Index MPF Index Meta-data Index Term Index FS MPF Index Meta-data Index Term Index DB
  • 13.
    SySearch Functionalities IndexingDocument Import Manager File System, Database, Web (Spider) Common Import File Format – XOI Document Filtering Content add-ons (Stellent) Meta-Data Parser Text Processing and Term Lexicon Index Manager Position Indexes
  • 14.
    SySearch Functionalities Search/QueryDocument, Paragraph Meta-data Boolean (), Proximity (Near, Phase) Natural Language Search Lexicon (Tokenize, Stem, Stop word) default to English Relevance Ranking Bayesian Engine Custom Term Weighting Query Expansion, Query by Example (Find Similar), Categorization Filtering Category, Meta-data, Document Group Query Enhancement (Synonym & Acronym)
  • 15.
  • 16.