• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
How to build the next 1000 search engines?!

How to build the next 1000 search engines?!



Opening keynote at the ECIR 2012 Industry Day: How to build the next 1000 search engines?! Includes preview of new minimal indexing approach.

Opening keynote at the ECIR 2012 Industry Day: How to build the next 1000 search engines?! Includes preview of new minimal indexing approach.



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Does “Entity-based ranking” make sense?
  • NOTE: MATERIALIZED VIEWs, where supported (not in MonetDB), can be used instead of TABLEs when stored relations (index) are expected to get updates.

How to build the next 1000 search engines?! How to build the next 1000 search engines?! Presentation Transcript

  • How to build the next 1000 search engines?! Arjen P. de Vries arjen@acm.org Centrum Wiskunde & Informatica Delft University of Technology Spinque B.V.
  • Search is everywhere
  • Search is everywhere Yet it only works well on the web…
  • Complications Heterogeneous data sources  WWW, wikipedia, news, e- mail, patents, twitter, personal information, … Varying result types  “Documents”, tweets, courses, people, expert s, gene expressions, temperatures, … Multiple dimensions of relevance  Topicality, recency, reading level, …
  • Complications Many search tasks require a mix within these dimensions:  News and patents  Companies and their CEOs  Recent and on topic Many search tasks also require a mix across these dimensions:  Patents assigned to our top 3 competitors in market segments mentioned in the recent press releases issued by our top 10 clients
  •  System‟s internal information representation  Linguistic annotations  Named entities, sentiment, dependencies, …  Knowledge resources  Wikipedia, Freebase, IDC9, IPTC, …  Links to related documents  Citations, urls Anchors that describe the URI  Anchor text Queries that lead to clicks on the URI  Session, user, dwell-time, … Tweets that mention the URI  Time, location, user, … Other social media that describe the URI  User, rating  Tag, organisation of `folksonomy‟ + UNCERTAINTY ALL OVER!
  • What goes in the black box? Document Collection: Anchors Entity types Sentiment Tweets BM25 Cited documents BM25F … LM RM Ranked VSM list DFR of answers QIR?User Learning to rank? Context ECIR / CIKM / SIGIR / ICTIR / WSDM papers!
  • Rarely & scarcely addressed… Student: How do I build it? Professor: Who will build it for me? Last session of the conference…
  • Search System
  • Parameterised Search System Cornacchia, De Vries, ECIR 2007 A Parametrised Search System
  • Parameterised Search System Cannot we ‘remove’ this IR engineer (or scientist!) from the loop, like DBMS software removes the data engineer from the loop? Cornacchia, De Vries, ECIR 2007 A Parametrised Search SystemAnd three (four?) children, a startup and 5 years later, a PhD defense!
  • Search by Strategy Visually construct search strategies by connecting building blocks
  • Search by Strategy Visually construct search strategies by connecting building blocks Each block describes either data or actions upon that data  Connection points (“pins”) are typed: doc / sec / term / ne (named entity) / tuple  Actions are expressed as scripts (later more)
  • Strategy Builder
  • From Patent to Inventor
  • Reports Visits
  • Generate Search Engine!Or, really, generate a REST API from the strategy specification!
  • Demo(Showed demo of children‟s search engine)
  • How Strategies Help Strategies improve communication between search intermediary and user  Encapsulate domain expert knowledge  Abstract representation of search expert knowledge  Analyze information seeking process at any stage Strategies facilitate knowledge management  Store / share / publish / refine Strategies mix exact (DB) and ranked (IR) searches  Avoid the need for “human (probabilistic) joins”
  • Search Intermediaries Travel agency Task complexity Real estate agents Recruiters Librarians Archivists Digital forensics detectives Patent information specialists
  • Exploratory Search Search & (Faceted) Browsing  Help discover schema, ontology, etc.  Help discover the relevant sources  Within-collection (by year/location, by type, …)  Across multiple collections (by source)
  • Probabilistic faceted browsing Traditional (boolean filters) Probabilistic Price Price • 100K - 200K • 100K - 200K • 200K - 300K • 200K - 300K • 300K - 400K • 300K - 400K Rooms Rooms • 3 • 3 • 4 • 4 • 5 • 5 Size Size • 100 - 150 m2 • 100 - 150 m2 • 150 - 200 m2 • 150 - 200 m2 • 200 - 250 m2 • 200 - 250 m2• Good when user knows exactly • Good for exploratory search which filters to apply • Will see perfect-match results• Will see perfect-match results• Won’t see “interesting” results • Will also see “interesting” results
  • Dynamic facets Pre-indexed Dynamic Price Price • 100K - 200K • 100K - 200K • 200K - 300K • 200K - 300K • 300K - 400K • 300K - 400K Rooms Rooms • 3 • 3 • 4 • 4 • 5 • 5 Size Size • 100 - 150 m2 • 100 - 150 m2 • 150 - 200 m2 • 150 - 200 m2 • 200 - 250 m2 • 200 - 250 m2• Pre-defined ad-hoc indices • Facets decided from result set intersected with result set • Challenge: dynamically adapt granularity• Challenge: many indices to maintain • Different price ranges for villa/garage! • Challenge: heavy concurrent queries to DB
  • Demo(Showed Spinque‟s Real-estate search demo)
  • Limitations Search & Browse Faceted exploration does not include joins  Cannot construct new data sources from existing ones!  Only the pre-defined paths through the information space can actually be traversed
  • Who needs a Join? You!!! … whenever „relevance cues‟ are typed:  People (e.g., inventors)  Companies (e.g., assignees)  Categories (e.g., IPTC)  Time (e.g., expiry date)  Location (e.g., country) … or whenever multiple sources are to be combined  E.g., patents & news, patents & Wikipedia, …
  • Patents on X by Y(y) by Y(y)
  • Interactive Information Access Feedback:  Interaction improves information representation Faceted Browsing:  Interaction can let user take over where machine would fail Search by Strategy:  Interaction can let user take over where system designer would fail
  • Conclusion “No idealized one-shot search engine” Empower the user!
  • Under the Hood
  • From Strategies to DB Queries in1 in2 in3 Strategy • Data flow BB1(in1,in2,in3, u1,u2) out in1 BB2(in1) Spinque: strategy out CREATE VIEW a AS SELECT .. • Query: strategy made operational CREATE VIEW b AS SELECT .. CREATE VIEW c AS Spinque: PRA SELECT ..  Database Spinque: RDBMS (MonetDB) Relational DB
  • Probabilistic Relational Algebra Strategy x = Project DISTINCT • PRA: probabilistic [$1,$3](y); relational algebra (Fuhr and Roelleke, TOIS 2001) CREATE VIEW x AS SELECT a1, a3, • SQL 1-prod(1-prob) AS prob FROM y explicit probabilities GROUP BY a1, a3; Relational DB
  • What‟s in the DB? Text-based ranking T D f  term-doc-freq relations (inverted file) t0 d3 3  One per language, stemming, section t0 d5 10  Domain-independent, click and index t1 d2 4 Entity ranking subj pred/attr obj/value p  Probabilistic triples Arjen speaks_to you 0.95  Domain-aware you follow Arjen 0.5 speech minutes 45 0.8  Needs supervised indexing Content-based (MM) retrieval Img_id f1 … fN …  Feature vectors, click and index 0 0.12 0.84 1 0.54 … 0.31 2 0.23 … 0.1
  • VIEWS and TABLES User Stored relation parameter CREATE VIEW TABLE a AS SELECT … FROM term-doc … ; CREATE VIEW b AS SELECT … FROM a WHERE a.x = u1 ; CREATE VIEW TABLE c AS SELECT … FROM a WHERE a.x = 42 ; CREATE VIEW d AS SELECT … FROM b … ; No user parameter Pre-computable BB content: sequence of VIEW definitions relation A VIEW is pre-computable when  All the relations addressed are pre-computable / stored  No dependency on user parameters Pre-computable VIEWs can become TABLEs (or MATERIALIZED VIEWs)  Query-independent computations are performed only once, then read from TABLEs at each query  Recognition of these patterns is fully automatic  Extends MonetDB‟s per-session caching to across-sessions caching
  • What Next?
  • Current Situation index ; Schema definition repeat { specify ; retrieve Search & explore } until 
  • Traditional Indexing Preprocessing determines to large extend how search request form will be processed  Especially regarding tokenization, stemming, etc. Fast and scalable, but inflexible  E.g., entity search hard-coded on top of engine, advertisements matched on different data, etc.
  • Search by Strategy Flexible: generate arbitrary engine on the fly Not as fast as highly optimized and very well engineered inverted file based systems
  • Desirable Situation repeat { index ; Mixed Initiative specify ; Schema definition Search & explore retrieve } until 
  • Non-Indexed Search Grep  Very flexible  Use it all the time on my mh mail folders when gmail fails me!  Not scalable, little or no structure
  • Minimal Indexing How to reduce pre-processing necessary to create a search engine over a new collection?  Can we do without a keyword index?  Can we avoid hardwired decisions for tokenization, language detection, stemming, …
  • Suffix Array Pros:  provides many core search functions: term statistics, keyword search, phrase search.  no upfront tokenization needed (access at character level)  no upfront language detection needed Cons:  difficult to build for large corpora  expensive w.r.t. disk space
  • Demo(Showed patent search demo)
  • “Real Code”
  • Patents on X by Y(y) by Y(y)
  • PRAs__STRATEGY___filter_DOC_with_NE_nes =Project [$2,$3]( Join [$1 = $2]( s__STRATEGY___clef_ip_patents_DATA_result, Project [$1,$3]( Select [$2 = "ipcr-classification"]( s__STRATEGY___clef_ip_patents_DATA_ne_doc ) ) ));
  • CREATE TABLE s__STRATEGY___filter_DOC_with_NE_nes AS SELECT tmp_1814091754.a2 AS a1, tmp_1814091754.a3 AS a2, tmp_1814091754.prob AS prob FROM ( SELECT s__STRATEGY___clef_ip_patents_DATA_result.a1 AS a1, tmp__1652836708.a1 AS a2, tmp__1652836708.a2 AS a3, s__STRATEGY___clef_ip_patents_DATA_result.prob * tmp__1652836708.prob AS prob FROM s__STRATEGY___clef_ip_patents_DATA_result, ( SELECT tmp_1444787941.a1 AS a1, tmp_1444787941.a3 AS a2, tmp_1444787941.prob AS prob FROM ( SELECT s__STRATEGY___clef_ip_patents_DATA_ne_doc.a1 AS a1, s__STRATEGY___clef_ip_patents_DATA_ne_doc.a2 AS a2, s__STRATEGY___clef_ip_patents_DATA_ne_doc.a3 AS a3, s__STRATEGY___clef_ip_patents_DATA_ne_doc.prob AS prob FROM s__STRATEGY___clef_ip_patents_DATA_ne_doc WHERE s__STRATEGY___clef_ip_patents_DATA_ne_doc.a2 =‘ipcr-classification’ ) AS tmp_1444787941 ) AS tmp__1652836708 WHERE s__STRATEGY___clef_ip_patents_DATA_result.a1 = tmp__1652836708.a2 ) AS tmp_1814091754 ORDER BY a1 WITH DATA;
  • info@spinque.com www.spinque.comfacebook.com/spinque