Исмаил Сенгор Алтинговде «Проблемы эффективности поисковых систем»


Published on

Научно-технический семинар спикеров RuSSIR 2012 Чирага Шаха и Исмаила Сенгор Алтинговде в московском офисе Яндекса, 3 августа 2012.

Исмаил Сенгор Алтинговде, ведущий научный сотрудник в Исследовательском центре L3S в Ганновере, Германия.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Исмаил Сенгор Алтинговде «Проблемы эффективности поисковых систем»

  1. 1. Efficiency Issues for Web Search Engines:How to Make a Search Engine Return Results in a Hundred Milliseconds or Less? Ismail Sengor Altingovde L3S Research Center
  2. 2. Research interests• Efficiency and scalability issues for Web-IR• Social Web: Sentiment analysis, Social ranking• Domain-specific search engines & focused crawling• Web information extraction• XML querying & searching• Recommendation systems• OCR & IR• Web databases
  3. 3. Research interests: Today• Efficiency and scalability issues for WebIR – Caching • Cost-aware result caching techniques [TWEB 2011, IPM2012] • Alternative result cache organizations [ECIR 2011] • Cache freshness [SIGIR 2011, ECIR 2012] • Regionalization & caching [SPIRE 2012] – Static Index Pruning • Query views in pruning algorithms [TOIS 2012] • Correctness guarantees for pruning [in progress]
  4. 4. Search: what really happens is... Indexer-1 Inverted Index Broker Query 3 1 2 4 8 Query Result 2 Indexer-2 User 3 5 4 7 5 7 Doc server-1 Doc server-2 … Indexer-N … … 6 6Documents
  5. 5. Data items related to search• Data items – Posting lists (fetched) – Intersections of posting lists (computed) – Query results as doc-ids (computed) – Documents (fetched) – Query results as pages (computed)! all ‘ em e ach C
  6. 6. Search: where caches come in? Indexer-1 Inverted Index Broker Query 3 1 2 4 8 List Query Result Cache 2 User 5 Result 4 Indexer-2 57 Cache 7 Doc server-1 Doc server-2 … Document … Cache 6 6Documents
  7. 7. Caching for Web search• Cache content is according to – Frequency – Recency• Cache types – Static (for longer term access patterns) – Dynamic (for shorter term access patterns) – Hybrid
  8. 8. How about costs?• Miss-costs are not “uniform”! • Costs are inter-related!• Both the caching strategies and the evaluation should consider costs!
  9. 9. Search: caches and costs Indexer-1 Inverted Index Broker Clist Query 3 1 2 4 Crank 8 Query Result 2 Indexer-2 User Result 4 Clist 5 Cache 7 5 7 Crank Doc server-1 Doc server-2 … Indexer-n … C snip Csnip … 6 Cdoc Cdoc 6Documents Cost(q) = Clist + Crank + Cdoc + Csnip
  10. 10. Our contribution• Cost-aware caching for Web search – Single-level (result) caches – Multi-level caches• Costs are computed on-the-fly or simulated• Gain of caching an item – Time cost to produce or fetch (Citem) – Storage space (Sitem)
  11. 11. Motivating scenario: Single-level• Result cache R – Capacity(R) = 1 page• Result pages for queries A and B – Freq(A) = 10, Freq(B) = 20 – Cache result “B” for higher hit rate• What if: – Cost(A) = 100 ms, Cost(B) = 10 ms? – Cache “A” for higher processing efficiency Take costs into account while caching (and evaluating)!
  12. 12. Motivating scenario: Multi-level• Assume – Freq(A) = 10, Freq(B) = 20, – Cost(A) = 100ms, Cost(B) = 10 ms – an additonal list cache L – all terms in query A is cached in L – new Cost(A) = 10 ms (dropped from 100ms!) – now it is better to cache result B Take cost interdependencies into account in multi-level caches.
  13. 13. Cost update dependencies
  14. 14. Cost-aware result caching (RC)• Key idea: Embed query processing cost into the result caching strategies• Static caching – Knapsack problem • Query results have values and sizes • Greedy solution: order in Value/Size – MostFreq Strategy (baseline) • Value(q) = Freq(q) • Unit space per result
  15. 15. Cost-aware static RCStatic cost-aware caching strategies:• FreqThenCost – Sort first by Freq(q) and then Cost(q)• StabilityThenCost – Stabilityof the frequency in succeeding time intervals• Freq&Cost – Value(q) = Cost(q) x Freq(q)K , K>1 – Why Freq(q)K ? • Queries with very low frequencies may disappear in the future
  16. 16. Cost-aware dynamic RC• Baselines: LRU, LFU• LCU: Least costly cached item is evicted• LFCU_K: Least Frequently and Costly Used – Cost(q) x Freq(q)K , K>1• GDS: Greedy Dual Size [Cao&Irani, 1997] – H = Cost(q) + L, L is age and set to the H value of evicted item• GDSF_K: Greedy Dual Size Frequency – Cost(q) x Freq(q)K + L [Arlitt et al., 2000]
  17. 17. Performance: Static RC Gains up to 3%!
  18. 18. Performance: Dynamic RC Gains up to 6%!
  19. 19. Today’s talk• Cost-aware caching strategies• Cache invalidation
  20. 20. Motivation• Higher cache capacity improves hit rate• But results become stale [Cambazoglu et al. WWW2010]
  21. 21. Solutions from the literature• Decoupled: Time-to-live (TTL) – refresh stale results when backend is idle [Cambazoglu et al., WWW’10] q1 R1 TTL(q1) q2 R2 TTL(q2) qi … qk Rk TTL(qk) Result Cache
  22. 22. Solutions from the literature• Coupled: Cache invalidation policy (CIP) [Blanco et al., SIGIR’10] – Incremental index update – Content changes sent to CIP module to invalidate queries (offline) CIP module all queries in cache(s) all changes in the backend index
  23. 23. Our contribution• Devise a new invalidation mechanism – better than TTL and close to CIP in detecting stale results – better than CIP and close to TTL in efficiency and practicality
  24. 24. Timestamp-based Invalidation• The value of the TS on an item shows the last time the item was updated• TIF has two components: – Offline (indexing time) : Decide on term and document timestamps – Online (query time): Decide on the staleness of the query result
  25. 25. TIF Architecture qi SEARCH Document timestamps NODE TS(d1) TS(d2) … TS(dD)q1 R1 TS(q1) Invalidation 0/1 logic document TSq2 R2 TS(q2) TS(t1) t1 updates qi, Ri, TS(qi) TS(t2) t2 … miss/stale Doc. … … … index parserqk Rk TS(qk) results TS(tT) tT updatesResult cache term TS updates documents assigned to the node
  26. 26. TS Update Policies: Documents• For a newly added document d – TS(d) = now()• For a deleted document d – TS(d) = infinite• For an updated document d – if diff(dnew, dold) > L TS(d) = now() – diff(di, dj): |length(di) – length(dj)|
  27. 27. TS Update Policies: Terms• Frequency based update t TS(t) = T0, PLLTS= 5 t Number of added postings > F x PLLTS TS(t) = now() PLLTS= 6
  28. 28. TS Update Policies: Terms• Score based update t p1 p2 p3 p4 p5 sort w.r.t. scoring function p4 p3 p2 p5 p1 TS(t) = T0, STS = Score(p3) t p1 p2 p3 p4 p5 p6 Score of added posting > STS TS(t) = now() STS = re-sort & compute
  29. 29. Result Invalidation Policy• A search node decides a result stale if: – C1: ∃d ϵ R, s.t. TS(d) > TS(q) (d is deleted or revised after the generation of query result) or, – C2: ∀t ϵ q, s.t. TS(t) > TS(q) (all query terms appeared in new documents after the generation of query result)• Also apply TTL to avoid stale accumulation
  30. 30. Simulation setup• Data: English wikipedia dump – snapshot at Jan 1, 2006 ≈ 1 million pages – All add/deletes/updates for following 30 days• Queries: 10,000 from AOL log
  31. 31. Simulation setup• Evaluation metrics [Blanco 2010] – The query result is updated if two top-10 lists are not exactly the same Redundant query executions False Positive Ratio = Number of unique queries Stale results returned Stale Traffic Ratio = Number of query occurrences
  32. 32. Performance: all queriesFrequency-based term TS update Score-based term TS update
  33. 33. Discussion TIF CIPData Send <q, R, TS(q)> to Send all <q, R> to CIPtransfer the search nodes Send all docs to CIPInvalidation Traverse the query indexoperations Compare TS values for every document
  34. 34. Conclusion & Future work• Data on the Web is growing continuosly – Search efficiency is crucial! – We present strategies for improving the performance of Web search engines • Cost aware strategies improve efficiency • Practical invalidation methods with good accuracy• Upcoming work on efficiency: – New strategies for cache freshness & index pruning!
  35. 35. Thank you!Questions???
  36. 36. References: Our work• [SIGIR 2011] Sadiye Alici, Ismail Sengör Altingövde, Rifat Ozcan, Berkant Barla Cambazoglu, Özgür Ulusoy: Timestamp-based result cache invalidation for web search engines. SIGIR 2011: 973-982• [ECIR 2012] Sadiye Alici, Ismail Sengör Altingövde, Rifat Ozcan, Berkant Barla Cambazoglu, Özgür Ulusoy: Adaptive Time-to-Live Strategies for Query Result Caching in Web Search Engines. ECIR 2012: 401-412• [TOIS 2102] Ismail Sengör Altingövde, Rifat Ozcan, Özgür Ulusoy: Static index pruning in web search engines: Combining term and document popularities with query views. ACM Trans. Inf. Syst. 30(1): 2 (2012)• [TWEB 2012] Rifat Ozcan, Ismail Sengör Altingövde, Özgür Ulusoy: Cost-Aware Strategies for Query Result Caching in Web Search Engines. TWEB 5(2): 9 (2011)• [SPIRE 2012] B. Barla Cambazoglu , Ismail Sengör Altingövde: Impact of Regionalization on Performance of Web Search Engine Result Caches. (to appear)• [IPM 2012] Ozcan, I. S. Altingovde, B. B. Cambazoglu, F. P. Junqueira, Ö. Ulusoy: Five-level Static Cache Architecture for Web Search Engines, IPM, to appear.• [ECIR 2011] Ismail Sengör Altingövde, Rifat Ozcan, Berkant Barla Cambazoglu, Özgür Ulusoy: Second Chance: A Hybrid Approach for Dynamic Result Caching in Search Engines. ECIR 2011: 510-516arch engine caching. WWW 2010: 181-190
  37. 37. Other References• [Blanco et al., SIGIR 2010] Roi Blanco, Edward Bortnikov, Flavio Junqueira, Ronny Lempel, Luca Telloli, Hugo Zaragoza: Caching search engine results over incremental indices. SIGIR 2010: 82-89• [Cambazoglu et al., WWW 2010] Berkant Barla Cambazoglu, Flavio Paiva Junqueira, Vassilis Plachouras, Scott A. Banachowski, Baoqiu Cui, Swee Lim, Bill Bridge: A refreshing perspective of search engine caching. WWW 2010: 181-190