Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Efficient Diversity-Aware Search

401 views

Published on

Yet another class present.

Published in: Technology, Career
  • Be the first to comment

  • Be the first to like this

Efficient Diversity-Aware Search

  1. 1. Efficient Diversity-Aware Search Dacong (Tony) Yan May 4, 2011
  2. 2. Background & Motivation What is search? 1. A user U initiates a query Q 2. A list of documents D sorted by relevance R w.r.t Q are returnedCSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20
  3. 3. Background & Motivation What is search? 1. A user U initiates a query Q 2. A list of documents D sorted by relevance R w.r.t Q are returned User Satisfaction sat(U, Q) It’s all about relevance between D and Q! User U has its own perspective on relevance RU 1 Roughly speaking, sat(U, Q) ∝ diff (RU ,R)CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20
  4. 4. Background & Motivation What is search? 1. A user U initiates a query Q 2. A list of documents D sorted by relevance R w.r.t Q are returned User Satisfaction sat(U, Q) It’s all about relevance between D and Q! User U has its own perspective on relevance RU 1 Roughly speaking, sat(U, Q) ∝ diff (RU ,R) Problem: RU is difficult to capture, and usually ignored!CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20
  5. 5. Background & Motivation What is search? 1. A user U initiates a query Q 2. A list of documents D sorted by relevance R w.r.t Q are returned User Satisfaction sat(U, Q) It’s all about relevance between D and Q! User U has its own perspective on relevance RU 1 Roughly speaking, sat(U, Q) ∝ diff (RU ,R) Problem: RU is difficult to capture, and usually ignored! Symptoms of ignoring RU Redundant documents included in the result set Most relevant documents in terms of RU excluded from the result setCSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20
  6. 6. Background & Motivation What is search? 1. A user U initiates a query Q 2. A list of documents D sorted by relevance R w.r.t Q are returned User Satisfaction sat(U, Q) It’s all about relevance between D and Q! User U has its own perspective on relevance RU 1 Roughly speaking, sat(U, Q) ∝ diff (RU ,R) Problem: RU is difficult to capture, and usually ignored! Symptoms of ignoring RU Redundant documents included in the result set Most relevant documents in terms of RU excluded from the result set Solution: diversity-aware search!CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 2/20
  7. 7. Agenda Background & Motivation Diversity-Aware Search DivGen Approach Evaluation ConclusionCSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 3/20
  8. 8. Diversity-Aware Search Intuitively, relevance + dissimilarityCSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 4/20
  9. 9. Diversity-Aware Search Intuitively, relevance + dissimilarity Formally, a content-based diversification perspective: Data Model User Behavior Model Answer QualityCSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 4/20
  10. 10. Data Model Vector Space Model: documents as weighted sets of features Each document d is represented as a vector d = (d 1 , d 2 , ...), denoting feature i has weight d i ≥ 0 in document dCSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 5/20
  11. 11. Data Model Vector Space Model: documents as weighted sets of features Each document d is represented as a vector d = (d 1 , d 2 , ...), denoting feature i has weight d i ≥ 0 in document d Examples textual documents: features can be keywords weighted in a tf.idf manner graph “documents”: features can be paths in the corpus graph in recsys scenario: features can be the set of users who recommend a documentCSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 5/20
  12. 12. User Behavior Model Assumption: the user examines the results in their order of presentation.CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20
  13. 13. User Behavior Model Assumption: the user examines the results in their order of presentation. Usefulness of a document d: the probability that d is useful Relevance: the probability that d is relevant Novelty: the probability that d’s content is not redundantCSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20
  14. 14. User Behavior Model Assumption: the user examines the results in their order of presentation. Usefulness of a document d: the probability that d is useful Relevance: the probability that d is relevant Novelty: the probability that d’s content is not redundant Consider a document d preceded by d1 , d2 , ..., dm w.r.t a query q, its usefulness is defined below: use(d|{d1 , ..., dm }, q) = rel(d|q) · (1 − red(d|{d1 , ...dm }, q))CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20
  15. 15. User Behavior Model Assumption: the user examines the results in their order of presentation. Usefulness of a document d: the probability that d is useful Relevance: the probability that d is relevant Novelty: the probability that d’s content is not redundant Consider a document d preceded by d1 , d2 , ..., dm w.r.t a query q, its usefulness is defined below: use(d|{d1 , ..., dm }, q) = rel(d|q) · (1 − red(d|{d1 , ...dm }, q)) ⇓ m use(d|{d1 , ..., dm }, q) = sim(d, q) · i=1 (1 − red(d|di , q))CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20
  16. 16. User Behavior Model Assumption: the user examines the results in their order of presentation. Usefulness of a document d: the probability that d is useful Relevance: the probability that d is relevant Novelty: the probability that d’s content is not redundant Consider a document d preceded by d1 , d2 , ..., dm w.r.t a query q, its usefulness is defined below: use(d|{d1 , ..., dm }, q) = rel(d|q) · (1 − red(d|{d1 , ...dm }, q)) ⇓ m use(d|{d1 , ..., dm }, q) = sim(d, q) · i=1 (1 − red(d|di , q)) red(d|di , q) can be decomposed further: sim(d, di ): the probability that the content of d is similar to, or contained in, that of di ; fq : the estimated probability that, given a query q, a document with similar content to, or content contained in, a document previously emitted, is redundant.CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20
  17. 17. User Behavior Model Assumption: the user examines the results in their order of presentation. Usefulness of a document d: the probability that d is useful Relevance: the probability that d is relevant Novelty: the probability that d’s content is not redundant Consider a document d preceded by d1 , d2 , ..., dm w.r.t a query q, its usefulness is defined below: use(d|{d1 , ..., dm }, q) = rel(d|q) · (1 − red(d|{d1 , ...dm }, q)) ⇓ m use(d|{d1 , ..., dm }, q) = sim(d, q) · i=1 (1 − red(d|di , q)) red(d|di , q) can be decomposed further: sim(d, di ): the probability that the content of d is similar to, or contained in, that of di ; fq : the estimated probability that, given a query q, a document with similar content to, or content contained in, a document previously emitted, is redundant. red(d|di , q) = sim(d, di ) · fqCSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 6/20
  18. 18. User Behavior Model (Cont.) Focus Parameter fq fq is the main tunable parameter in red(d|di , q) = sim(d, di ) · fq It is defined on a per-query basis, and denotes the amount of desired diversification Smaller fq favors relevance over diversity Larger fq favors diversity over relevanceCSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 7/20
  19. 19. User Behavior Model (Cont.) Focus Parameter fq fq is the main tunable parameter in red(d|di , q) = sim(d, di ) · fq It is defined on a per-query basis, and denotes the amount of desired diversification Smaller fq favors relevance over diversity Larger fq favors diversity over relevance Probabilistic interpretation: “how likely is a relevant document to be useful to the user, given that they have already examined a document with similar content? ”CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 7/20
  20. 20. Answer Quality Quantification propertiesCSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 8/20
  21. 21. Answer Quality Quantification properties Tractable instantiationCSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 8/20
  22. 22. Answer Quality Quantification properties Tractable instantiation An optimal answer for strict order dominance semantics can be found by greedily identifying the best result at position 1, 2, ..., kCSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 8/20
  23. 23. The DivGen Approach
  24. 24. A First Stab to DAS Steps: 1. Compute the relevance of each document to the query; 2. Identify the highest score document d, and update the usefulness of all other documents, based on their similarity to d; 3. Repeat the procedure k times.CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 10/20
  25. 25. A First Stab to DAS Steps: 1. Compute the relevance of each document to the query; 2. Identify the highest score document d, and update the usefulness of all other documents, based on their similarity to d; 3. Repeat the procedure k times. Problems: It requires access to the entire corpus. It is too inefficient even for a moderately large set of documents.CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 10/20
  26. 26. A Threshold Algorithm for DAS Generate-Filter Idea: Incrementally compute documents in descending order of relevance; Maintain upper and lower bounds on the relevance of every encountered document; Rerank the documents with diversity taken into account.CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 11/20
  27. 27. A Threshold Algorithm for DAS Generate-Filter Idea: Incrementally compute documents in descending order of relevance; Maintain upper and lower bounds on the relevance of every encountered document; Rerank the documents with diversity taken into account. Data Access Primitives Sequential Access (SA): retrieve the id of the document with the next highest weight for a specified feature i Random Access (RA): retrieve the exact weight of feature i in document dCSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 11/20
  28. 28. A Threshold Algorithm for DAS Generate-Filter Idea: Incrementally compute documents in descending order of relevance; Maintain upper and lower bounds on the relevance of every encountered document; Rerank the documents with diversity taken into account. Data Access Primitives Sequential Access (SA): retrieve the id of the document with the next highest weight for a specified feature i Random Access (RA): retrieve the exact weight of feature i in document d Drawbacks Fully compute the relevance, and retrieve the entire content; Wasted I/O efforts, and a lot of this I/O is not sequential in nature; Hardly any early pruning is possible.CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 11/20
  29. 29. A Threshold Algorithm for DAS Generate-Filter Idea: Incrementally compute documents in descending order of relevance; Maintain upper and lower bounds on the relevance of every encountered document; Rerank the documents with diversity taken into account. Data Access Primitives Sequential Access (SA): retrieve the id of the document with the next highest weight for a specified feature i Random Access (RA): retrieve the exact weight of feature i in document d Drawbacks Fully compute the relevance, and retrieve the entire content; Wasted I/O efforts, and a lot of this I/O is not sequential in nature; Hardly any early pruning is possible. DivGen: making Generate aware of diversity!CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 11/20
  30. 30. The DivGen Algorithm Idea: maintain a set of candidate documents with bounds on usefulnessCSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 12/20
  31. 31. The DivGen Algorithm Idea: maintain a set of candidate documents with bounds on usefulness Novel Data Access Primitives Bound Access (BA): retrieve the features with the highest weight in d, as well as an upper bound w on the weight of any other features of d Batch Sequential Access (BSA): retrieve the documents with the highest weight of non-query feature i, as well as an upper bound w on the weight of i in any other document Document Random Access (DocRA): retrieve all the features with nonzero weight in d, along with their exact weightsCSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 12/20
  32. 32. The DivGen Algorithm Idea: maintain a set of candidate documents with bounds on usefulness Novel Data Access Primitives Bound Access (BA): retrieve the features with the highest weight in d, as well as an upper bound w on the weight of any other features of d Batch Sequential Access (BSA): retrieve the documents with the highest weight of non-query feature i, as well as an upper bound w on the weight of i in any other document Document Random Access (DocRA): retrieve all the features with nonzero weight in d, along with their exact weights Advantages of BA, BSA, DocRA Existing index techniques can be easily leveraged to enable these primitives. These primitives can enable a set of early prunings to make the algorithm more efficient.CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 12/20
  33. 33. Algorithm Pseudo-codeCSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 13/20
  34. 34. Revisit Data Access PrimitivesCSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 14/20
  35. 35. An Execution ExampleCSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 15/20
  36. 36. An Execution ExampleCSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 15/20
  37. 37. Evaluation Experimental Setup Java 6, Oracle BerkeleyDB Java Edition v3.3.74 Ubuntu Linux 8.04, Intel Core2 X6800 2.93GHz CPU, 1GB Memory ext3fs filesystem with a page size of 4KBCSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 16/20
  38. 38. Evaluation Experimental Setup Java 6, Oracle BerkeleyDB Java Edition v3.3.74 Ubuntu Linux 8.04, Intel Core2 X6800 2.93GHz CPU, 1GB Memory ext3fs filesystem with a page size of 4KB Datasets Real data: taken from Grapevine, a tool for distilling knowledge from social media Synthetic data: Zipfian distribution across documents, and normal distribution in each document.CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 16/20
  39. 39. Evaluation Experimental Setup Java 6, Oracle BerkeleyDB Java Edition v3.3.74 Ubuntu Linux 8.04, Intel Core2 X6800 2.93GHz CPU, 1GB Memory ext3fs filesystem with a page size of 4KB Datasets Real data: taken from Grapevine, a tool for distilling knowledge from social media Synthetic data: Zipfian distribution across documents, and normal distribution in each document. How to synthesize?CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 16/20
  40. 40. Evaluation (Cont. I)CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 17/20
  41. 41. Evaluation (Cont. II)CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 18/20
  42. 42. Conclusion This paper formally studied the diversity-aware search problem; proposed a set of novel data access primitives to efficiently solve DAS; performed experimental studies demonstrating the usefulness of DivGen.CSE 788, Dacong (Tony) Yan Efficient Diversity-Aware Search 19/20
  43. 43. Thank you!

×