Cec2010 araujo pereziglesias


Published on

Presentation in the CEC 2010 conference of the paper "Training a Classifier for the selection of Good Query Expansion Terms with a Genetic Algorithm"

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Cec2010 araujo pereziglesias

  1. 1. <ul><li>Training a Classifier for Good Query Expansion Terms with a Genetic Algorithm </li></ul><ul>Lourdes Araujo UNED Spain Joaquín Pérez-Iglesias UNED Spain </ul>
  2. 2. <ul>The problem </ul><ul><li>Terms in user queries differ of document terms referring the same concept
  3. 3. User queries are usually too short and ambiguous
  4. 4. Several studies have concluded that the average number of terms is between 2 and 3 . </li></ul>
  5. 5. <ul>The problem </ul><ul><li>Query example: </li></ul><ul>modify query technique </ul>
  6. 6. <ul>The problem </ul><ul><li>User information need: </li></ul><ul>learning about modifying the search queries to improve the results of his searches </ul>
  7. 7. <ul>The problem </ul>
  8. 8. <ul>The problem </ul><ul>Too general terms: </ul><ul><li>Amount of documents retrieved extremely large
  9. 9. Inappropriate order </li></ul>
  10. 10. <ul>The problem </ul><ul>Adding other related terms can improve the results: modification, method, transformation, “query expansion”, … New query: modify query technique “query expansion” </ul>
  11. 11. <ul>The problem </ul>
  12. 12. <ul>The problem </ul><ul><li>Results improve </li></ul><ul><li>The first documents retrieved are relevant </li></ul><ul><li>The goal: to find expansion terms automatically , but... </li></ul>
  13. 13. <ul>The problem </ul><ul>Beware! </ul><ul><li>Adding too much terms prevent to get any answer
  14. 14. Performance drops for some queries
  15. 15. Expansion terms must be carefully chosen </li></ul>
  16. 16. <ul>Approaches to query expansion </ul><ul><li>Relevance Feedback ( user supervision ): </li><ul><li>The user selects from the documents retrieved with the original query the relevant ones
  17. 17. The system extracts expansion terms from the set of relevant documents. </li></ul></ul>
  18. 18. <ul>Approaches to query expansion </ul><ul><li>Pseudorelevance Feedback ( without user supervision ): </li><ul><li>The system selects the top k most relevant documents
  19. 19. The system extracts expansion terms from the top k documents. </li></ul></ul>
  20. 20. <ul>The expansion Process </ul><ul><li>Retrieve top k relevant documents for original query
  21. 21. Extract terminology from top k
  22. 22. Select the expansion terms
  23. 23. Reweight expansion terms
  24. 24. Expanded query: original + expension terms </li></ul>
  25. 25. <ul>Terminology extraction: Kullback-Leibler divergence </ul><ul><li>Extracts terms appearing frequently in the top k documents and with low frequency in the whole collection. </li></ul><ul>P F :Probability of term t in the top k docs. </ul>P C :Probability of term t in the whole colection.
  26. 26. <ul>Selecting Expansion Terms </ul><ul><li>We can use a GA to selected the combination of terms that optimize the quality of the expanded query
  27. 27. We have explored several fitness funtions to approximate a measure of the query quality </li><ul><li>Cosine between query and docs </li></ul></ul>
  28. 28. <ul>Selecting Expansion Terms </ul><ul><li>We have not found a good approximation to the measure of the query quality!
  29. 29. Measure of the query quality: Average Precision
  30. 30. Requires the user relevance judgements
  31. 31. Only known for some document collections
  32. 32. A different approach! </li></ul>
  33. 33. <ul>Selecting Expansion Terms </ul><ul><li>Design a GA with a perfect fitness function:
  34. 34. Average Precision
  35. 35. Train a term classifier with the combination of terms selected by the GA
  36. 36. The classifier can select expansion terms from docs without the user relevance judgements </li></ul>
  37. 37. <ul>System Scheme: Training </ul>Query <ul><li>Search </li></ul><ul><li>Classifier
  38. 38. training </li></ul><ul>GA </ul><ul>Term <li>extraction </li></ul><ul><li>Candidate terms </li></ul><ul><li>Top k retrieved
  39. 39. docs. </li></ul><ul><li>Classification
  40. 40. features </li></ul><ul><li>Quality terms </li></ul><ul>User <li>relevance
  41. 41. judgements </li></ul>
  42. 42. <ul>System Scheme </ul>New query Expanded query <ul><li>Search </li></ul><ul><li>Reweight </li></ul><ul><li>FINAL
  43. 43. CLASSIFIER </li></ul><ul>Term <li>extraction </li></ul><ul><li>Candidate terms </li></ul><ul>Expansion <li>terms </li></ul><ul><li>Top k retrieved
  44. 44. docs. </li></ul><ul><li>Classification
  45. 45. features </li></ul>
  46. 46. <ul>Genetic Algorithm </ul><ul><li>Individuals: binary strings
  47. 47. Each position a candidate query term
  48. 48. 1 the term is in the query
  49. 49. Initial population: randomly generated
  50. 50. Initial query terms are always included </li></ul>
  51. 51. <ul>Genetic Algorithm </ul><ul><li>Selection mechanism: roulette wheel
  52. 52. One-point crossover
  53. 53. Mutation: one bit flips
  54. 54. Elitism </li></ul>
  55. 55. <ul>Fitness function </ul><ul><li>Precision(d) for a given document d is defined as the fraction of relevant documents within the set of documents retrieved with a higher rank than document d (including d).
  56. 56. Average Precision for a set of relevant documents Rel = [ d 1, ... , d n ] is the mean precision of all these documents: </li></ul>
  57. 57. <ul>The term classifier </ul><ul><li>WEKA software for the classifier.
  58. 58. Features for term classication: </li><ul><li>Probability of the candidate term t (within the top feedback docs and the collection) </li></ul></ul><ul><ul><li>Coocurrence of t and one or more query terms (within the top feedback docs and the collection) and at difference distances
  59. 59. Inverse document frequency of t </li></ul></ul>
  60. 60. <ul>Evaluation collection </ul><ul><li>Test set: TREC Disk4 and 5
  61. 61. 150 first available queries
  62. 62. User relevance judgements are provided </li></ul>
  63. 63. <ul>Classifier results </ul>Class TP Rate FP Rate Precision Recall F-measure Good 0.416 0.193 0.687 0.416 0.518 Bad 0.807 0.584 0.576 0.807 0.673 61% term correctly classified TP: True positive FP: False positive
  64. 64. <ul>Improvement of relevance </ul>MAP Δ MAP BM25 0.2130 GA-Oracle 0.4518 (+112.1%) KLD 0.2352 (+10.4%) Classifier 0.2534 (+18.9%)
  65. 65. <ul>Conclusions </ul><ul><li>GA using a perfect fitness function indicates a large room for improvement
  66. 66. GA shown more than a half of the candidate terms for expansion has no effect or worsen the query results
  67. 67. The classifier approach has achieved more than 18% of improvement.. </li></ul>
  68. 68. <ul>Future works </ul><ul><li>Extend the set of features of the classifier: </li></ul><ul>Statistics of combination of terms beyong coocurrence (distance, etc.) </ul><ul><li>Optimize the classifier training (WEKA parameters) </li></ul>
  69. 69. <ul>Thank you ! </ul><ul>Any question? </ul>