How to build the next 1000 search engines?!

772 views

Published on

Opening keynote at the ECIR 2012 Industry Day: How to build the next 1000 search engines?! Includes preview of new minimal indexing approach.

Published in: Technology, Design
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
772
On SlideShare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Does “Entity-based ranking” make sense?
  • NOTE: MATERIALIZED VIEWs, where supported (not in MonetDB), can be used instead of TABLEs when stored relations (index) are expected to get updates.
  • How to build the next 1000 search engines?!

    1. 1. How to build the next 1000 search engines?! Arjen P. de Vries arjen@acm.org Centrum Wiskunde & Informatica Delft University of Technology Spinque B.V.
    2. 2. Search is everywhere
    3. 3. Search is everywhere Yet it only works well on the web…
    4. 4. Complications Heterogeneous data sources  WWW, wikipedia, news, e- mail, patents, twitter, personal information, … Varying result types  “Documents”, tweets, courses, people, expert s, gene expressions, temperatures, … Multiple dimensions of relevance  Topicality, recency, reading level, …
    5. 5. Complications Many search tasks require a mix within these dimensions:  News and patents  Companies and their CEOs  Recent and on topic Many search tasks also require a mix across these dimensions:  Patents assigned to our top 3 competitors in market segments mentioned in the recent press releases issued by our top 10 clients
    6. 6.  System‟s internal information representation  Linguistic annotations  Named entities, sentiment, dependencies, …  Knowledge resources  Wikipedia, Freebase, IDC9, IPTC, …  Links to related documents  Citations, urls Anchors that describe the URI  Anchor text Queries that lead to clicks on the URI  Session, user, dwell-time, … Tweets that mention the URI  Time, location, user, … Other social media that describe the URI  User, rating  Tag, organisation of `folksonomy‟ + UNCERTAINTY ALL OVER!
    7. 7. What goes in the black box? Document Collection: Anchors Entity types Sentiment Tweets BM25 Cited documents BM25F … LM RM Ranked VSM list DFR of answers QIR?User Learning to rank? Context ECIR / CIKM / SIGIR / ICTIR / WSDM papers!
    8. 8. Rarely & scarcely addressed… Student: How do I build it? Professor: Who will build it for me? Last session of the conference…
    9. 9. Search System
    10. 10. Parameterised Search System Cornacchia, De Vries, ECIR 2007 A Parametrised Search System
    11. 11. Parameterised Search System Cannot we ‘remove’ this IR engineer (or scientist!) from the loop, like DBMS software removes the data engineer from the loop? Cornacchia, De Vries, ECIR 2007 A Parametrised Search SystemAnd three (four?) children, a startup and 5 years later, a PhD defense!
    12. 12. Search by Strategy Visually construct search strategies by connecting building blocks
    13. 13. Search by Strategy Visually construct search strategies by connecting building blocks Each block describes either data or actions upon that data  Connection points (“pins”) are typed: doc / sec / term / ne (named entity) / tuple  Actions are expressed as scripts (later more)
    14. 14. Strategy Builder
    15. 15. From Patent to Inventor
    16. 16. Reports Visits
    17. 17. Generate Search Engine!Or, really, generate a REST API from the strategy specification!
    18. 18. Demo(Showed demo of children‟s search engine)
    19. 19. How Strategies Help Strategies improve communication between search intermediary and user  Encapsulate domain expert knowledge  Abstract representation of search expert knowledge  Analyze information seeking process at any stage Strategies facilitate knowledge management  Store / share / publish / refine Strategies mix exact (DB) and ranked (IR) searches  Avoid the need for “human (probabilistic) joins”
    20. 20. Search Intermediaries Travel agency Task complexity Real estate agents Recruiters Librarians Archivists Digital forensics detectives Patent information specialists
    21. 21. Exploratory Search Search & (Faceted) Browsing  Help discover schema, ontology, etc.  Help discover the relevant sources  Within-collection (by year/location, by type, …)  Across multiple collections (by source)
    22. 22. Probabilistic faceted browsing Traditional (boolean filters) Probabilistic Price Price • 100K - 200K • 100K - 200K • 200K - 300K • 200K - 300K • 300K - 400K • 300K - 400K Rooms Rooms • 3 • 3 • 4 • 4 • 5 • 5 Size Size • 100 - 150 m2 • 100 - 150 m2 • 150 - 200 m2 • 150 - 200 m2 • 200 - 250 m2 • 200 - 250 m2• Good when user knows exactly • Good for exploratory search which filters to apply • Will see perfect-match results• Will see perfect-match results• Won’t see “interesting” results • Will also see “interesting” results
    23. 23. Dynamic facets Pre-indexed Dynamic Price Price • 100K - 200K • 100K - 200K • 200K - 300K • 200K - 300K • 300K - 400K • 300K - 400K Rooms Rooms • 3 • 3 • 4 • 4 • 5 • 5 Size Size • 100 - 150 m2 • 100 - 150 m2 • 150 - 200 m2 • 150 - 200 m2 • 200 - 250 m2 • 200 - 250 m2• Pre-defined ad-hoc indices • Facets decided from result set intersected with result set • Challenge: dynamically adapt granularity• Challenge: many indices to maintain • Different price ranges for villa/garage! • Challenge: heavy concurrent queries to DB
    24. 24. Demo(Showed Spinque‟s Real-estate search demo)
    25. 25. Limitations Search & Browse Faceted exploration does not include joins  Cannot construct new data sources from existing ones!  Only the pre-defined paths through the information space can actually be traversed
    26. 26. Who needs a Join? You!!! … whenever „relevance cues‟ are typed:  People (e.g., inventors)  Companies (e.g., assignees)  Categories (e.g., IPTC)  Time (e.g., expiry date)  Location (e.g., country) … or whenever multiple sources are to be combined  E.g., patents & news, patents & Wikipedia, …
    27. 27. Patents on X by Y(y) by Y(y)
    28. 28. Interactive Information Access Feedback:  Interaction improves information representation Faceted Browsing:  Interaction can let user take over where machine would fail Search by Strategy:  Interaction can let user take over where system designer would fail
    29. 29. Conclusion “No idealized one-shot search engine” Empower the user!
    30. 30. Under the Hood
    31. 31. From Strategies to DB Queries in1 in2 in3 Strategy • Data flow BB1(in1,in2,in3, u1,u2) out in1 BB2(in1) Spinque: strategy out CREATE VIEW a AS SELECT .. • Query: strategy made operational CREATE VIEW b AS SELECT .. CREATE VIEW c AS Spinque: PRA SELECT ..  Database Spinque: RDBMS (MonetDB) Relational DB
    32. 32. Probabilistic Relational Algebra Strategy x = Project DISTINCT • PRA: probabilistic [$1,$3](y); relational algebra (Fuhr and Roelleke, TOIS 2001) CREATE VIEW x AS SELECT a1, a3, • SQL 1-prod(1-prob) AS prob FROM y explicit probabilities GROUP BY a1, a3; Relational DB
    33. 33. What‟s in the DB? Text-based ranking T D f  term-doc-freq relations (inverted file) t0 d3 3  One per language, stemming, section t0 d5 10  Domain-independent, click and index t1 d2 4 Entity ranking subj pred/attr obj/value p  Probabilistic triples Arjen speaks_to you 0.95  Domain-aware you follow Arjen 0.5 speech minutes 45 0.8  Needs supervised indexing Content-based (MM) retrieval Img_id f1 … fN …  Feature vectors, click and index 0 0.12 0.84 1 0.54 … 0.31 2 0.23 … 0.1
    34. 34. VIEWS and TABLES User Stored relation parameter CREATE VIEW TABLE a AS SELECT … FROM term-doc … ; CREATE VIEW b AS SELECT … FROM a WHERE a.x = u1 ; CREATE VIEW TABLE c AS SELECT … FROM a WHERE a.x = 42 ; CREATE VIEW d AS SELECT … FROM b … ; No user parameter Pre-computable BB content: sequence of VIEW definitions relation A VIEW is pre-computable when  All the relations addressed are pre-computable / stored  No dependency on user parameters Pre-computable VIEWs can become TABLEs (or MATERIALIZED VIEWs)  Query-independent computations are performed only once, then read from TABLEs at each query  Recognition of these patterns is fully automatic  Extends MonetDB‟s per-session caching to across-sessions caching
    35. 35. What Next?
    36. 36. Current Situation index ; Schema definition repeat { specify ; retrieve Search & explore } until 
    37. 37. Traditional Indexing Preprocessing determines to large extend how search request form will be processed  Especially regarding tokenization, stemming, etc. Fast and scalable, but inflexible  E.g., entity search hard-coded on top of engine, advertisements matched on different data, etc.
    38. 38. Search by Strategy Flexible: generate arbitrary engine on the fly Not as fast as highly optimized and very well engineered inverted file based systems
    39. 39. Desirable Situation repeat { index ; Mixed Initiative specify ; Schema definition Search & explore retrieve } until 
    40. 40. Non-Indexed Search Grep  Very flexible  Use it all the time on my mh mail folders when gmail fails me!  Not scalable, little or no structure
    41. 41. Minimal Indexing How to reduce pre-processing necessary to create a search engine over a new collection?  Can we do without a keyword index?  Can we avoid hardwired decisions for tokenization, language detection, stemming, …
    42. 42. Suffix Array Pros:  provides many core search functions: term statistics, keyword search, phrase search.  no upfront tokenization needed (access at character level)  no upfront language detection needed Cons:  difficult to build for large corpora  expensive w.r.t. disk space
    43. 43. Demo(Showed patent search demo)
    44. 44. “Real Code”
    45. 45. Patents on X by Y(y) by Y(y)
    46. 46. PRAs__STRATEGY___filter_DOC_with_NE_nes =Project [$2,$3]( Join [$1 = $2]( s__STRATEGY___clef_ip_patents_DATA_result, Project [$1,$3]( Select [$2 = "ipcr-classification"]( s__STRATEGY___clef_ip_patents_DATA_ne_doc ) ) ));
    47. 47. CREATE TABLE s__STRATEGY___filter_DOC_with_NE_nes AS SELECT tmp_1814091754.a2 AS a1, tmp_1814091754.a3 AS a2, tmp_1814091754.prob AS prob FROM ( SELECT s__STRATEGY___clef_ip_patents_DATA_result.a1 AS a1, tmp__1652836708.a1 AS a2, tmp__1652836708.a2 AS a3, s__STRATEGY___clef_ip_patents_DATA_result.prob * tmp__1652836708.prob AS prob FROM s__STRATEGY___clef_ip_patents_DATA_result, ( SELECT tmp_1444787941.a1 AS a1, tmp_1444787941.a3 AS a2, tmp_1444787941.prob AS prob FROM ( SELECT s__STRATEGY___clef_ip_patents_DATA_ne_doc.a1 AS a1, s__STRATEGY___clef_ip_patents_DATA_ne_doc.a2 AS a2, s__STRATEGY___clef_ip_patents_DATA_ne_doc.a3 AS a3, s__STRATEGY___clef_ip_patents_DATA_ne_doc.prob AS prob FROM s__STRATEGY___clef_ip_patents_DATA_ne_doc WHERE s__STRATEGY___clef_ip_patents_DATA_ne_doc.a2 =‘ipcr-classification’ ) AS tmp_1444787941 ) AS tmp__1652836708 WHERE s__STRATEGY___clef_ip_patents_DATA_result.a1 = tmp__1652836708.a2 ) AS tmp_1814091754 ORDER BY a1 WITH DATA;
    48. 48. info@spinque.com www.spinque.comfacebook.com/spinque

    ×