• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Task-Specific Query Expansion for Genomics (MultiText Experiments for TREC 2003)
 

Task-Specific Query Expansion for Genomics (MultiText Experiments for TREC 2003)

on

  • 980 views

A presentation given at TREC (Text REtrieval Conference) 2003, based on the paper "Task-Specific Query Expansion (MultiText Experiments for TREC 2003)" by myself, Charles Clarke, Gordon Cormack, ...

A presentation given at TREC (Text REtrieval Conference) 2003, based on the paper "Task-Specific Query Expansion (MultiText Experiments for TREC 2003)" by myself, Charles Clarke, Gordon Cormack, Thomas Lynam, and Egidio Terra.

The research presented in this talk formed the basis of my Master's (MMath) thesis in computer science.

Statistics

Views

Total Views
980
Views on SlideShare
977
Embed Views
3

Actions

Likes
1
Downloads
0
Comments
0

1 Embed 3

http://www.linkedin.com 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Task-Specific Query Expansion for Genomics (MultiText Experiments for TREC 2003) Task-Specific Query Expansion for Genomics (MultiText Experiments for TREC 2003) Presentation Transcript

    • University of Waterloo MultiText for Genomics Task-Specific Query Expansion for Genomics (MultiText Experiments for TREC 2003) David L. Yeung University of Waterloo, Waterloo, Ontario, Canada Nov. 20, 2003 1/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
    • The MultiText Project •What is MultiText? • A collection of IR tools developed at U of Waterloo. What is MultiText for Genomics? • Based on MultiText. • No external databases or domain-specific knowledge. • A combination of techniques... 2/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
    • MultiText for Genomics •What is MultiText for Genomics? Query Formulation (Okapi) Feedback Topic Documents (Query expansion) Query Tiering (metadata) 3/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
    • Query Formulation (Okapi) •Two interesting facts: Query Formulation • Gene name type didn't matter (Okapi) • Spacing and punctuation affected performance •Example (training topic 5): • glycine receptor, alpha 1 • Glycine-receptor, alpha1 • Alpha 1 Glycine Receptor • glycine receptors... alpha receptor... alpha 1 • And so on... 4/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
    • Okapi Search Term Sets •Generate multiple search term sets: • Okapi 1 (higher precision, lower recall) • Treat gene names as phrases, except for punctuation. • “glycine_receptor_alpha_1” • Okapi 2 • Heuristics for guessing role of punctuation; also guess plurals. • Okapi 3 (lower precision, higher recall) • All pairs of tokens from gene names (bigrams). • “glycine[_]receptor”, “receptor[_]alpha”, “alpha[_]1”, etc. • Okapi Fusion • Take the product of the 3 scores. 5/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
    • Results of Okapi Experiments Mean Average Precision (MAP) Okapi 1 Training Okapi 2 Okapi 3 Test Okapi Fusion 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 • Two interesting points: • The trend in MAP is reversed between the training and test data. • Recall (from most to least): Okapi Fusion/Okapi 3, Okapi 2, Okapi 1. 6/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
    • MultiText for Genomics •Next: Query Tiering Query Formulation (Okapi) Feedback Topic Documents (Query expansion) Query Tiering (metadata) 7/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
    • Query Tiering (metadata) •Use metadata tags in data: (“<TagName>”..“</TagName>”) > “search_terms” •Order them by correlation to relevance: chemical list (RN) Relevance title (TI) abstract (AB) MeSH headings (MH) PubMed ID (PMID)... 8/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
    • The Query Tiers •6 Query Tiers: Query Tiering (metadata) • Tier 1: • Almost exact match in the “chemical list” metadata field. • “glycine receptor, alpha 1” → “glycine receptor alpha1” • Tier 2: • As above, but allow for additional terms. • “RAC1” → “rac1 GTP-Binding Protein” • Tier 3: • Gene name is weakened until a match is made. • “estrogen receptor 1” → “Receptors, Estrogen” 9/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
    • The Query Tiers •6 Query Tiers (continued): • Tier 4: • Boolean expression in the “title” metadata field. • “tyrosyl-tRNA synthetase” → “tyrosyl”^“trna”^“synthetase” • Tier 5: • Boolean expression in the “chemical list” metadata field. • Tier 6: • Boolean expression in the “abstract” metadata field. 10/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
    • Using the Query Tiers •Can retrieve documents using: • All Tiers (AT) • The tiers are executed in order. • Best Tier (BT) • Once a tier has retrieved non-zero documents, ignore the rest. ... then fuse with results of Okapi experiment. 11/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
    • Using the Query Tiers Query Formulation (Okapi) Feedback Topic Documents (Query expansion) Query Tiering (metadata) •Fusing with Okapi: • Rank Fusion (-R) • Document's score based on weighted sum of (reverse) rank. 12/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
    • MultiText for Genomics •Next: Feedback Query Formulation (Okapi) Feedback Topic Documents (Query expansion) Query Tiering (metadata) 13/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
    • Feedback (Query expansion) •Learn “most relevant” chemical: Feedback • Using pseudo-relevance feedback (Query expansion) • Only if document not matched in Tier 1 • Assign score to chemicals using Tf-Idf scoring scheme α •Example (training topic 27):   N  w i = R i ×  log    f    • cholinergic receptor, muscarinic 3   i  • Receptors, Muscarinic (29880.980020675546) • Muscarinic Antagonists (20430.84754342255) • muscarinic receptor M2 (13976.522895229124) • muscarinic receptor M3 (11159.997636110056) • Carbachol (11101.760218985524) • ... etc. 14/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
    • Complete MTG System - Runs Query Formulation (Okapi) Feedback Topic Documents (Query expansion) Query Tiering (metadata) •Complete runs: Okapi Fusion, ATR, BTR, ATRF, BTRF • Fusion with Okapi: Rank Fusion (-R) • Query Tiering: All Tiers (AT), Best Tier (BT) • Feedback: (-F) 15/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
    • Complete MTG System - Results Mean Average Precision (MAP) Training Okapi Fusion ATR Test BTR ATRF* 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 BTRF* •Complete runs: Okapi Fusion, ATR, BTR, ATRF*, BTRF* • Fusion with Okapi: Rank Fusion (-R) • Query Tiering: All Tiers (AT), Best Tier (BT) • Feedback: (-F) • * denotes an official submission 16/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project
    • Conclusions •MultiTextsupports a variety of standard and non-standard techniques: • Okapi BM25 implementation • Query Tiering and Fusion • Pseudo-relevance Feedback •Possible to improve performance in genomics domain even without domain-specific knowledge: • Characteristics of corpus (SSR, metadata) • Merging results of multiple independent methods •For more information, please see our paper! 17/17 TREC 2003 Genomics Track: University of Waterloo MultiText Project