Concept Based Search SySearch vs. Traditional Search Simplified Version for External Yu Wang, Engineering Manager
Agenda <ul><li>How do we search differently? </li></ul><ul><ul><li>SySearch vs. Traditional Search </li></ul></ul><ul><li>...
SySearch vs. Traditional Search <ul><li>Find keywords of the mail </li></ul><ul><ul><li>(tree) branches </li></ul></ul><ul...
Agenda <ul><li>How do we search differently? </li></ul><ul><ul><li>SySearch vs. Traditional Search </li></ul></ul><ul><li>...
Information Retrieval <ul><li>Information retrieval (IR) is finding material (usually documents) of an  unstructured  natu...
IR Model, Key Metrics and Core Technologies User Query Query Rep. {Term} Document Rep. {Index, Term} Original Documents Re...
Relevance Ranking: Traditional <ul><li>Keyword and Boolean Search </li></ul><ul><ul><li>Boolean Model </li></ul></ul><ul><...
Relevance Ranking: SySearch <ul><li>Concept Based Search </li></ul><ul><ul><li>Bayesian Probability </li></ul></ul><ul><ul...
Example: Relevance Ranking (1) A user wants to find an article which introduces  both  apple and orange. He inputs query s...
Example: Relevance Ranking (2) <ul><li>Solution: </li></ul><ul><li>Variable D: Relevancy of Document {D1, D2} </li></ul><u...
Agenda <ul><li>How do we search differently? </li></ul><ul><ul><li>SySearch vs. Traditional Search </li></ul></ul><ul><li>...
SySearch Functional Architecture SySearch Web GUI/API System Administration Security/CSI Event Framework Log Management Ca...
SySearch Functionalities <ul><li>Indexing </li></ul><ul><ul><li>Document Import Manager </li></ul></ul><ul><ul><ul><li>Fil...
SySearch Functionalities <ul><li>Search/Query </li></ul><ul><ul><li>Document, Paragraph </li></ul></ul><ul><ul><li>Meta-da...
Distributed SySearch Deployment
Q&A
Upcoming SlideShare
Loading in …5
×

Concept Based Search

3,792 views

Published on

An introduction to Concept based Search Technology

Published in: Business, Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,792
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Concept Based Search

    1. 1. Concept Based Search SySearch vs. Traditional Search Simplified Version for External Yu Wang, Engineering Manager
    2. 2. Agenda <ul><li>How do we search differently? </li></ul><ul><ul><li>SySearch vs. Traditional Search </li></ul></ul><ul><li>Search Technology Explained </li></ul><ul><ul><li>IR & its Applications </li></ul></ul><ul><ul><li>IR Model, Key Metrics and Core Technologies </li></ul></ul><ul><ul><li>Relevance Ranking </li></ul></ul><ul><ul><li>Relevance Ranking Example </li></ul></ul><ul><li>SySearch Functional Overview </li></ul>
    3. 3. SySearch vs. Traditional Search <ul><li>Find keywords of the mail </li></ul><ul><ul><li>(tree) branches </li></ul></ul><ul><ul><li>(power) lines </li></ul></ul><ul><ul><li>Tracks … </li></ul></ul><ul><li>Combine Keywords with Boolean operators </li></ul><ul><ul><li>branches AND lines AND tracks </li></ul></ul><ul><li>Assumptions: </li></ul><ul><ul><li>User need can be expressed in a few words </li></ul></ul><ul><ul><li>No ambiguity wording in query </li></ul></ul><ul><ul><li>Vocabulary is precise in target documents </li></ul></ul><ul><li>Free-text search based on concepts </li></ul><ul><ul><li>Concepts isolation from mail </li></ul></ul><ul><ul><ul><li>Damage being done by tree branches </li></ul></ul></ul><ul><ul><ul><li>Tangling of overhead power lines </li></ul></ul></ul><ul><ul><ul><li>Falling trees and tree branches </li></ul></ul></ul><ul><ul><ul><li>Obstruction or damage to tracks </li></ul></ul></ul><ul><ul><li>Group concepts into free text </li></ul></ul><ul><li>Query by Example (Even Simpler) </li></ul><ul><ul><li>Find similar using original document </li></ul></ul><ul><li>Enhancement: Query Expansion </li></ul><ul><ul><li>Variation of vocabulary in concept </li></ul></ul>A user receives an e-mail message that says: Following the incident close to Watford railway station in July, we need to assess the damage being done by tree branches tangling in overhead power lines or falling onto the tracks. The user then wants to locate documents matching the e-mail message.
    4. 4. Agenda <ul><li>How do we search differently? </li></ul><ul><ul><li>SySearch vs. Traditional Search </li></ul></ul><ul><li>Search Technology Explained </li></ul><ul><ul><li>IR & its Applications </li></ul></ul><ul><ul><li>IR Model, Key Metrics and Core Technologies </li></ul></ul><ul><ul><li>Relevance Ranking </li></ul></ul><ul><ul><li>Relevance Ranking Example </li></ul></ul><ul><li>SySearch Functional Overview </li></ul>
    5. 5. Information Retrieval <ul><li>Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). </li></ul><ul><li>web search </li></ul><ul><li>enterprise, institutional, and domain-specific search </li></ul><ul><li>personal information retrieval </li></ul>
    6. 6. IR Model, Key Metrics and Core Technologies User Query Query Rep. {Term} Document Rep. {Index, Term} Original Documents Relevance Ranking <ul><li>Precision: What fraction of the returned results are relevant to the information need? </li></ul><ul><li>Recall: What fraction of the relevant documents in the collection were returned by the system? </li></ul>
    7. 7. Relevance Ranking: Traditional <ul><li>Keyword and Boolean Search </li></ul><ul><ul><li>Boolean Model </li></ul></ul><ul><ul><ul><li>Based on term matching {1,0}, High Precision </li></ul></ul></ul><ul><ul><ul><li>Simple to understand but hard to use </li></ul></ul></ul><ul><ul><ul><ul><li>Hard to support long query </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Hard to support natural language query </li></ul></ul></ul></ul><ul><ul><ul><li>Can’t rank relevance effectively due to {1,0} nature </li></ul></ul></ul><ul><ul><li>Possible Enhancements </li></ul></ul><ul><ul><ul><li>Page Ranking (Great success by Google but limited to web) </li></ul></ul></ul><ul><ul><ul><li>Term Weighting </li></ul></ul></ul>
    8. 8. Relevance Ranking: SySearch <ul><li>Concept Based Search </li></ul><ul><ul><li>Bayesian Probability </li></ul></ul><ul><ul><ul><li>Probability Ranking Principle, BIM </li></ul></ul></ul><ul><ul><ul><li>Relevance Feedback, Term Weighting </li></ul></ul></ul><ul><ul><ul><li>Support natural language query inherently </li></ul></ul></ul><ul><ul><li>Query by Example </li></ul></ul><ul><ul><ul><li>Find Similar (Hyperlinking) </li></ul></ul></ul><ul><ul><li>Query Expansion </li></ul></ul><ul><ul><li>Text Classification </li></ul></ul><ul><ul><ul><li>Category </li></ul></ul></ul>
    9. 9. Example: Relevance Ranking (1) A user wants to find an article which introduces both apple and orange. He inputs query string {Apple, Orange} and the system found 2 match documents. Both documents contain both query terms, while in D1 Apple appears 3 times Orange and in D2 Apple appears equally to Orange in occurrence. Is D1 or D2 more relevant to user need? Query Terms {Apple, Orange} Apple Orange Relevance Ranking Search Server Document Server Apple… Orange… Apple… Apple… Apple… Orange… Apple Orange… D1 D2 D1 Apple: Orange=3:1 D2 Apple: Orange=1:1 D1>D2? D2>D1?
    10. 10. Example: Relevance Ranking (2) <ul><li>Solution: </li></ul><ul><li>Variable D: Relevancy of Document {D1, D2} </li></ul><ul><li>Evidence Q: Term Apple(1) and Orange(1) found in Document </li></ul><ul><li>P(Q|D1)=2*(3/4)*(1/4)=3/8 </li></ul><ul><li>P(Q|D2)=2*(1/2)*(1/2)=1/2 </li></ul><ul><li>P(D1)=P(D2)=1/2 </li></ul><ul><li>P(D1|Q)/P(D2|Q)= P(Q|D1)P(D1)/P(Q|D2)P(D2) </li></ul><ul><li>P(D1|Q)/P(D2|Q)=0.75 </li></ul><ul><li>Results: </li></ul><ul><li>D2 should be returned to user before D1 as it appears to be talking about the both instead of focusing on one of them </li></ul><ul><li>Conclusion: </li></ul><ul><li>This approach tries to “understand” queries and documents by using statistic method to establish “concepts” </li></ul>Bayesian Probability (Simplified) <ul><li>Solution: </li></ul><ul><li>Q = Apple ^ Orange </li></ul><ul><li>D1 = Apple ^ Orange ^ Apple ^ Apple … </li></ul><ul><li>D2 = Apple ^ Orange ^ Apple ^ Orange … </li></ul><ul><li>D1->(contains) Q => Relevance=1 </li></ul><ul><li>D2->(contains) Q => Relevance=1 </li></ul><ul><li>Results: </li></ul><ul><li>D1 and D2 has the same relevance score, user needs to check which is more relevant by himself </li></ul><ul><li>Conclusion: </li></ul><ul><li>This approach returns exact matches but failed to rank their relevance effectively </li></ul>Boolean Model
    11. 11. Agenda <ul><li>How do we search differently? </li></ul><ul><ul><li>SySearch vs. Traditional Search </li></ul></ul><ul><li>Search Technology Explained </li></ul><ul><ul><li>IR & its Applications </li></ul></ul><ul><ul><li>IR Model, Key Metrics and Core Technologies </li></ul></ul><ul><ul><li>Relevance Ranking </li></ul></ul><ul><ul><li>Relevance Ranking Example </li></ul></ul><ul><li>SySearch Functional Overview </li></ul>
    12. 12. SySearch Functional Architecture SySearch Web GUI/API System Administration Security/CSI Event Framework Log Management Category Document Group Meta-Data Scheduler Query Bayesian Engine Augmentation Expansion FS DB EM Document Import Document Filtering Content Add-on Meta-data Parser Text Processing Term Lexicon Index MPF Index Meta-data Index Term Index FS MPF Index Meta-data Index Term Index DB
    13. 13. SySearch Functionalities <ul><li>Indexing </li></ul><ul><ul><li>Document Import Manager </li></ul></ul><ul><ul><ul><li>File System, Database, Web (Spider) </li></ul></ul></ul><ul><ul><ul><li>Common Import File Format – XOI </li></ul></ul></ul><ul><ul><li>Document Filtering </li></ul></ul><ul><ul><ul><li>Content add-ons (Stellent) </li></ul></ul></ul><ul><ul><ul><li>Meta-Data Parser </li></ul></ul></ul><ul><ul><li>Text Processing and Term Lexicon </li></ul></ul><ul><ul><li>Index Manager </li></ul></ul><ul><ul><ul><li>Position Indexes </li></ul></ul></ul>
    14. 14. SySearch Functionalities <ul><li>Search/Query </li></ul><ul><ul><li>Document, Paragraph </li></ul></ul><ul><ul><li>Meta-data </li></ul></ul><ul><ul><li>Boolean (), Proximity (Near, Phase) </li></ul></ul><ul><ul><li>Natural Language Search </li></ul></ul><ul><ul><ul><li>Lexicon (Tokenize, Stem, Stop word) default to English </li></ul></ul></ul><ul><ul><li>Relevance Ranking </li></ul></ul><ul><ul><ul><li>Bayesian Engine </li></ul></ul></ul><ul><ul><ul><li>Custom Term Weighting </li></ul></ul></ul><ul><ul><ul><li>Query Expansion, Query by Example (Find Similar), Categorization </li></ul></ul></ul><ul><ul><li>Filtering </li></ul></ul><ul><ul><ul><li>Category, Meta-data, Document Group </li></ul></ul></ul><ul><ul><li>Query Enhancement (Synonym & Acronym) </li></ul></ul>
    15. 15. Distributed SySearch Deployment
    16. 16. Q&A

    ×