University of Waterloo
         MultiText for Genomics


 Task-Specific Query
Expansion for Genomics
 (MultiText Experimen...
The MultiText Project

•What    is MultiText?
 •   A collection of IR tools developed at U of Waterloo.


What is MultiTex...
MultiText for Genomics
  •What   is MultiText for Genomics?


           Query Formulation
                (Okapi)

      ...
Query Formulation (Okapi)
•Two          interesting facts:                               Query Formulation
 •   Gene name ...
Okapi Search Term Sets
•Generate            multiple search term sets:
 •   Okapi 1 (higher precision, lower recall)
     ...
Results of Okapi Experiments
                        Mean Average Precision (MAP)

                                       ...
MultiText for Genomics
  •Next:   Query Tiering


            Query Formulation
                 (Okapi)

                ...
Query Tiering (metadata)
•Use   metadata tags in data:
   (“<TagName>”..“</TagName>”) > “search_terms”


•Order   them by ...
The Query Tiers
•6   Query Tiers:                                                 Query Tiering
                          ...
The Query Tiers
•6   Query Tiers (continued):
 •   Tier 4:
     •   Boolean expression in the “title” metadata field.
    ...
Using the Query Tiers

•Can       retrieve documents using:
 •   All Tiers (AT)
     •   The tiers are executed in order.
...
Using the Query Tiers
                   Query Formulation
                        (Okapi)

                              ...
MultiText for Genomics
  •Next:   Feedback


           Query Formulation
                (Okapi)

                       ...
Feedback (Query expansion)
•Learn            “most relevant” chemical:                                  Feedback
 •   Usin...
Complete MTG System - Runs
        Query Formulation
             (Okapi)

                                               ...
Complete MTG System -
                 Results
                              Mean Average Precision (MAP)

Training       ...
Conclusions
•MultiTextsupports a variety of standard and
 non-standard techniques:
 •   Okapi BM25 implementation
 •   Que...
Upcoming SlideShare
Loading in …5
×

Task-Specific Query Expansion for Genomics (MultiText Experiments for TREC 2003)

822 views
746 views

Published on

A presentation given at TREC (Text REtrieval Conference) 2003, based on the paper "Task-Specific Query Expansion (MultiText Experiments for TREC 2003)" by myself, Charles Clarke, Gordon Cormack, Thomas Lynam, and Egidio Terra.

The research presented in this talk formed the basis of my Master's (MMath) thesis in computer science.

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
822
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Task-Specific Query Expansion for Genomics (MultiText Experiments for TREC 2003)

  1. 1. University of Waterloo MultiText for Genomics Task-Specific Query Expansion for Genomics (MultiText Experiments for TREC 2003) David L. Yeung University of Waterloo, Waterloo, Ontario, Canada Nov. 20, 2003 1/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  2. 2. The MultiText Project •What is MultiText? • A collection of IR tools developed at U of Waterloo. What is MultiText for Genomics? • Based on MultiText. • No external databases or domain-specific knowledge. • A combination of techniques... 2/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  3. 3. MultiText for Genomics •What is MultiText for Genomics? Query Formulation (Okapi) Feedback Topic Documents (Query expansion) Query Tiering (metadata) 3/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  4. 4. Query Formulation (Okapi) •Two interesting facts: Query Formulation • Gene name type didn't matter (Okapi) • Spacing and punctuation affected performance •Example (training topic 5): • glycine receptor, alpha 1 • Glycine-receptor, alpha1 • Alpha 1 Glycine Receptor • glycine receptors... alpha receptor... alpha 1 • And so on... 4/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  5. 5. Okapi Search Term Sets •Generate multiple search term sets: • Okapi 1 (higher precision, lower recall) • Treat gene names as phrases, except for punctuation. • “glycine_receptor_alpha_1” • Okapi 2 • Heuristics for guessing role of punctuation; also guess plurals. • Okapi 3 (lower precision, higher recall) • All pairs of tokens from gene names (bigrams). • “glycine[_]receptor”, “receptor[_]alpha”, “alpha[_]1”, etc. • Okapi Fusion • Take the product of the 3 scores. 5/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  6. 6. Results of Okapi Experiments Mean Average Precision (MAP) Okapi 1 Training Okapi 2 Okapi 3 Test Okapi Fusion 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 • Two interesting points: • The trend in MAP is reversed between the training and test data. • Recall (from most to least): Okapi Fusion/Okapi 3, Okapi 2, Okapi 1. 6/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  7. 7. MultiText for Genomics •Next: Query Tiering Query Formulation (Okapi) Feedback Topic Documents (Query expansion) Query Tiering (metadata) 7/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  8. 8. Query Tiering (metadata) •Use metadata tags in data: (“<TagName>”..“</TagName>”) > “search_terms” •Order them by correlation to relevance: chemical list (RN) Relevance title (TI) abstract (AB) MeSH headings (MH) PubMed ID (PMID)... 8/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  9. 9. The Query Tiers •6 Query Tiers: Query Tiering (metadata) • Tier 1: • Almost exact match in the “chemical list” metadata field. • “glycine receptor, alpha 1” → “glycine receptor alpha1” • Tier 2: • As above, but allow for additional terms. • “RAC1” → “rac1 GTP-Binding Protein” • Tier 3: • Gene name is weakened until a match is made. • “estrogen receptor 1” → “Receptors, Estrogen” 9/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  10. 10. The Query Tiers •6 Query Tiers (continued): • Tier 4: • Boolean expression in the “title” metadata field. • “tyrosyl-tRNA synthetase” → “tyrosyl”^“trna”^“synthetase” • Tier 5: • Boolean expression in the “chemical list” metadata field. • Tier 6: • Boolean expression in the “abstract” metadata field. 10/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  11. 11. Using the Query Tiers •Can retrieve documents using: • All Tiers (AT) • The tiers are executed in order. • Best Tier (BT) • Once a tier has retrieved non-zero documents, ignore the rest. ... then fuse with results of Okapi experiment. 11/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  12. 12. Using the Query Tiers Query Formulation (Okapi) Feedback Topic Documents (Query expansion) Query Tiering (metadata) •Fusing with Okapi: • Rank Fusion (-R) • Document's score based on weighted sum of (reverse) rank. 12/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  13. 13. MultiText for Genomics •Next: Feedback Query Formulation (Okapi) Feedback Topic Documents (Query expansion) Query Tiering (metadata) 13/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  14. 14. Feedback (Query expansion) •Learn “most relevant” chemical: Feedback • Using pseudo-relevance feedback (Query expansion) • Only if document not matched in Tier 1 • Assign score to chemicals using Tf-Idf scoring scheme α •Example (training topic 27):   N  w i = R i ×  log    f    • cholinergic receptor, muscarinic 3   i  • Receptors, Muscarinic (29880.980020675546) • Muscarinic Antagonists (20430.84754342255) • muscarinic receptor M2 (13976.522895229124) • muscarinic receptor M3 (11159.997636110056) • Carbachol (11101.760218985524) • ... etc. 14/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  15. 15. Complete MTG System - Runs Query Formulation (Okapi) Feedback Topic Documents (Query expansion) Query Tiering (metadata) •Complete runs: Okapi Fusion, ATR, BTR, ATRF, BTRF • Fusion with Okapi: Rank Fusion (-R) • Query Tiering: All Tiers (AT), Best Tier (BT) • Feedback: (-F) 15/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  16. 16. Complete MTG System - Results Mean Average Precision (MAP) Training Okapi Fusion ATR Test BTR ATRF* 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 BTRF* •Complete runs: Okapi Fusion, ATR, BTR, ATRF*, BTRF* • Fusion with Okapi: Rank Fusion (-R) • Query Tiering: All Tiers (AT), Best Tier (BT) • Feedback: (-F) • * denotes an official submission 16/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
  17. 17. Conclusions •MultiTextsupports a variety of standard and non-standard techniques: • Okapi BM25 implementation • Query Tiering and Fusion • Pseudo-relevance Feedback •Possible to improve performance in genomics domain even without domain-specific knowledge: • Characteristics of corpus (SSR, metadata) • Merging results of multiple independent methods •For more information, please see our paper! 17/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project

×