Your SlideShare is downloading. ×
0
What to do when one size does not fit all?! Arjen P. de Vries [email_address] Centrum Wiskunde & Informatica Delft Univers...
Core Questions <ul><li>How to represent information? </li></ul><ul><ul><li>The information need and search requests </li><...
Complications <ul><li>Heterogeneous data sources </li></ul><ul><ul><li>WWW, wikipedia, news, e-mail, patents, twitter, per...
Complications <ul><li>Many search tasks require a mix within these dimensions: </li></ul><ul><ul><li>News and patents </li...
Complications <ul><li>System’s internal information representation </li></ul><ul><ul><li>Linguistic annotations </li></ul>...
Complications <ul><li>Anchors that describe the URI </li></ul><ul><ul><li>Anchor text </li></ul></ul><ul><li>Queries that ...
Tweets about blip.tv <ul><li>E.g.:  http://blip.tv/file/2168377 </li></ul><ul><ul><li>Amazing </li></ul></ul><ul><ul><li>W...
Even More Complications <ul><li>Uncertainty in matching process </li></ul><ul><ul><li>Vocabulary mismatch </li></ul></ul><...
The one size fits all  &quot;semantically enhanced retrieval model“? BM25 BM25F LM RM VSM DFR QIR? Learning to rank? Docum...
http://www.hellokids.com/c_19938/coloring-page/holiday-coloring-pages/easter-coloring-pages/jesus-coloring-pages/the-holy-...
Parameterised Search System Cannot we ‘remove’ this IR engineer from the loop, like DBMS software removes the data enginee...
Search by Strategy <ul><li>Visually construct  search strategies  by connecting building blocks </li></ul>
 
 
Generate Search Engine!
Search by Strategy <ul><li>Visually construct search strategies by connecting building blocks </li></ul><ul><li>Each block...
Strategy Builder
From Patent to Inventor
Reports Visits
 
BBs and typed pins <ul><li>N input pins, 1 output pin </li></ul><ul><ul><li>Pins represent data / result sets </li></ul></...
From Strategies to DB Queries <ul><li>Database Spinque:  RDBMS  (MonetDB) </li></ul><ul><li>Data flow Spinque:  strategy <...
Probabilistic Relational Algebra Strategy Relational DB <ul><li>SQL explicit probabilities </li></ul><ul><ul><li>CREATE VI...
SpinQL, the sneak preview <ul><li>PRA still too low-level; who writes algebraic plans?! </li></ul><ul><li>SpinQL: “See obj...
What’s in the DB? <ul><li>Text-based ranking </li></ul><ul><ul><li>term-doc-freq relations (inverted file) </li></ul></ul>...
VIEWS and TABLES <ul><li>BB content:  sequence of VIEW definitions </li></ul><ul><li>A VIEW is pre-computable when </li></...
Exploratory Search <ul><li>Search & (Faceted) Browsing </li></ul><ul><ul><li>Help discover schema, ontology, etc. </li></u...
Probabilistic faceted browsing <ul><li>Traditional (boolean filters) </li></ul><ul><li>Probabilistic </li></ul><ul><ul><li...
Dynamic facets <ul><li>Pre-indexed </li></ul><ul><li>Dynamic </li></ul><ul><ul><li>100K - 200K </li></ul></ul><ul><ul><li>...
Probabilistic facets and strategies (current) <ul><ul><li>100K - 200K </li></ul></ul><ul><ul><li>200K - 300K </li></ul></u...
Probabilistic facets and strategies (better) <ul><ul><li>100K - 200K </li></ul></ul><ul><ul><li>200K - 300K </li></ul></ul...
Mixing probabilistic data streams <ul><li>N inputs streams  S 1 ,…,S N ,  1 output </li></ul><ul><li>All streams:  (id, p)...
Mixing probabilistic data streams in RA <ul><li>Sum( p , GroupBy( id , (Union( α 1 S 1 ,…, α n S N ))) </li></ul><ul><ul><...
Mixing probabilistic data streams in RA <ul><li>Project( α 1 p 1 +…+ α N p N , OuterJoin( id 1 =… id n , ( S 1 ,…,S N ))) ...
Limitations Search & Browse <ul><li>Faceted exploration does not include joins </li></ul><ul><ul><li>Cannot construct new ...
Who needs a Join? <ul><li>You!!! … whenever ‘relevance cues’ are typed: </li></ul><ul><ul><li>People (e.g., inventors) </l...
Patents on X by Y(y) by Y(y)
1. Which universities/colleges hold patents? 2. Who are the inventors named in those patents? 3. Which inventors are activ...
How Strategies Help <ul><li>Strategies improve communication between search intermediary and user </li></ul><ul><ul><li>En...
 
Conclusion <ul><li>“ No idealized one-shot search engine” </li></ul><ul><li>Empower the user! </li></ul>
Search Intermediaries <ul><li>Travel agency  </li></ul><ul><li>Real estate agents </li></ul><ul><li>Recruiters </li></ul><...
 
Research Opportunities <ul><li>Assist the user make the best out of their increased level of control </li></ul><ul><ul><li...
Current Situation <ul><li>index ; </li></ul><ul><li>repeat { </li></ul><ul><li>specify ; </li></ul><ul><li>retrieve </li><...
Desirable Situation <ul><li>repeat { </li></ul><ul><li>index ; </li></ul><ul><li>specify ; </li></ul><ul><li>retrieve </li...
Interactive Information Access <ul><li>Feedback: </li></ul><ul><ul><li>Interaction improves information representation </l...
Upcoming SlideShare
Loading in...5
×

What to do when one size does not fit all?!

586

Published on

Keynote talk about "Search by Strategy" at the ESAIR 2011 workshop, held at CIKM 2011.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
586
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Viewing a BB as a function can be used later to sketch SpinQL.
  • Does “Entity-based ranking” make sense?
  • NOTE: MATERIALIZED VIEWs, where supported (not in MonetDB), can be used instead of TABLEs when stored relations (index) are expected to get updates.
  • This is how it should be done. How it is done at the moment: always append (like filters). Up/down means: upvote/downvote the selected bucket
  • This is how it should be done. How it is done at the moment: always append (like filters). Up/down means: upvote/downvote the selected bucket
  • Transcript of "What to do when one size does not fit all?!"

    1. 1. What to do when one size does not fit all?! Arjen P. de Vries [email_address] Centrum Wiskunde & Informatica Delft University of Technology Spinque B.V.
    2. 2. Core Questions <ul><li>How to represent information? </li></ul><ul><ul><li>The information need and search requests </li></ul></ul><ul><ul><li>The objects to be shown in response to an information request </li></ul></ul><ul><li>How to match information representations </li></ul><ul><ul><li>(Deductive) data retrieval, (inductive) information retrieval, or a mix?! </li></ul></ul>
    3. 3. Complications <ul><li>Heterogeneous data sources </li></ul><ul><ul><li>WWW, wikipedia, news, e-mail, patents, twitter, personal information, … </li></ul></ul><ul><li>Varying result types </li></ul><ul><ul><li>“ Documents”, tweets, courses, people, experts, gene expressions, temperatures, … </li></ul></ul><ul><li>Multiple dimensions of relevance </li></ul><ul><ul><li>Topicality, recency, reading level, … </li></ul></ul>
    4. 4. Complications <ul><li>Many search tasks require a mix within these dimensions: </li></ul><ul><ul><li>News and patents </li></ul></ul><ul><ul><li>Companies and their CEOs </li></ul></ul><ul><ul><li>Recent and on topic </li></ul></ul><ul><li>Many search tasks also require a mix across these dimensions: </li></ul><ul><ul><li>Patents assigned to our top 3 competitors in market segments mentioned in the recent press releases issued by our top 10 clients </li></ul></ul>
    5. 5. Complications <ul><li>System’s internal information representation </li></ul><ul><ul><li>Linguistic annotations </li></ul></ul><ul><ul><ul><li>Named entities, sentiment, dependencies, … </li></ul></ul></ul><ul><ul><li>Knowledge resources </li></ul></ul><ul><ul><ul><li>Wikipedia, Freebase, IDC9, IPTC, … </li></ul></ul></ul><ul><ul><li>Links to related documents </li></ul></ul><ul><ul><ul><li>Citations, urls </li></ul></ul></ul>
    6. 6. Complications <ul><li>Anchors that describe the URI </li></ul><ul><ul><li>Anchor text </li></ul></ul><ul><li>Queries that lead to clicks on the URI </li></ul><ul><ul><li>Session, user, dwell-time, … </li></ul></ul><ul><li>Tweets that mention the URI </li></ul><ul><ul><li>Time, location, user, … </li></ul></ul><ul><li>Other social media that describe the URI </li></ul><ul><ul><li>User, rating </li></ul></ul><ul><ul><li>Tag, organisation of `folksonomy’ </li></ul></ul>
    7. 7. Tweets about blip.tv <ul><li>E.g.: http://blip.tv/file/2168377 </li></ul><ul><ul><li>Amazing </li></ul></ul><ul><ul><li>Watching “World’s most realistic 3D city models?” </li></ul></ul><ul><ul><li>Google Earth/Maps killer </li></ul></ul><ul><ul><li>Ludvig Emgard shows how maps/satellite pics on web is done (learn Google and MS!) </li></ul></ul><ul><ul><ul><ul><ul><li>and ~120 more Tweets </li></ul></ul></ul></ul></ul>
    8. 8. Even More Complications <ul><li>Uncertainty in matching process </li></ul><ul><ul><li>Vocabulary mismatch </li></ul></ul><ul><ul><li>Incomplete relevance information </li></ul></ul><ul><li>Imperfect and noisy representations of both documents and information need </li></ul><ul><ul><li>OCR, multimedia analysis, NE taggers, HTML table extraction, … </li></ul></ul>
    9. 9. The one size fits all &quot;semantically enhanced retrieval model“? BM25 BM25F LM RM VSM DFR QIR? Learning to rank? Document Collection: Anchors Entity types Sentiment Tweets Cited documents … Context User Ran ked list of answers
    10. 10. http://www.hellokids.com/c_19938/coloring-page/holiday-coloring-pages/easter-coloring-pages/jesus-coloring-pages/the-holy-grail-coloring-page
    11. 11. Parameterised Search System Cannot we ‘remove’ this IR engineer from the loop, like DBMS software removes the data engineer from the loop? Cornacchia, De Vries, ECIR 2007 A Parametrised Search System
    12. 12. Search by Strategy <ul><li>Visually construct search strategies by connecting building blocks </li></ul>
    13. 15. Generate Search Engine!
    14. 16. Search by Strategy <ul><li>Visually construct search strategies by connecting building blocks </li></ul><ul><li>Each block describes either data or actions upon that data </li></ul>
    15. 17. Strategy Builder
    16. 18. From Patent to Inventor
    17. 19. Reports Visits
    18. 21. BBs and typed pins <ul><li>N input pins, 1 output pin </li></ul><ul><ul><li>Pins represent data / result sets </li></ul></ul><ul><li>M user-parameters (u) </li></ul><ul><ul><li>instantiated at query-time </li></ul></ul><ul><li>A BB can be viewed as a function </li></ul><ul><ul><li>out = BB(in 1 ,..,in N , u 1 ,..,u M ) </li></ul></ul><ul><li>Pins are typed </li></ul><ul><ul><li>doc / sec / term / ne (named entity) / tuple </li></ul></ul><ul><ul><li>only pins of the same type can be connected </li></ul></ul><ul><li>No assumption on the underlying data store / data API </li></ul><ul><li>The content of the BB respects the type contract. </li></ul>BB 1 (in 1 ,in 2 ,in 3 , u 1 ,u 2 ) in 1 in 2 in 3 out BB 2 (in 1 ) in 1 out
    19. 22. From Strategies to DB Queries <ul><li>Database Spinque: RDBMS (MonetDB) </li></ul><ul><li>Data flow Spinque: strategy </li></ul><ul><li>Query: strategy made operational Spinque: PRA </li></ul>CREATE VIEW a AS SELECT .. CREATE VIEW b AS SELECT .. CREATE VIEW c AS SELECT .. Strategy Relational DB BB 1 (in 1 ,in 2 ,in 3 , u 1 ,u 2 ) in 1 in 2 in 3 out BB 2 (in 1 ) in 1 out
    20. 23. Probabilistic Relational Algebra Strategy Relational DB <ul><li>SQL explicit probabilities </li></ul><ul><ul><li>CREATE VIEW x AS </li></ul></ul><ul><ul><li>SELECT a1, a3, </li></ul></ul><ul><ul><li>1-prod(1-prob) AS prob </li></ul></ul><ul><ul><li>FROM y </li></ul></ul><ul><ul><li>GROUP BY a1, a3; </li></ul></ul><ul><li>PRA: probabilistic relational algebra (Fuhr and Roelleke, TOIS 2001) </li></ul><ul><ul><li>x = Project DISTINCT </li></ul></ul><ul><ul><li>[$1,$3](y); </li></ul></ul>
    21. 24. SpinQL, the sneak preview <ul><li>PRA still too low-level; who writes algebraic plans?! </li></ul><ul><li>SpinQL: “See objects, generate SQL under the hood” </li></ul><ul><ul><li>Understands Spinque data types (doc, sec, term, named-entity, tuple) </li></ul></ul><ul><ul><li>Allows to build levels of abstractions, to describe: </li></ul></ul><ul><ul><li>access to probabilistic relations </li></ul></ul><ul><ul><li>domain-unaware typed data streams (e.g. person.name() ) </li></ul></ul><ul><ul><li>domain aware data streams (e.g. person.inventor_of() ) </li></ul></ul><ul><ul><li>building blocks (building blocks are functions) </li></ul></ul><ul><ul><li>strategies (if building blocks are functions, strategies are as well) </li></ul></ul><ul><li>SpinQL not contained in a strategy, SpinQL is a strategy specification </li></ul><ul><ul><li>SpinQL describes all, the editor shows desired granularity / expertise level </li></ul></ul><ul><ul><li>E.g. show patent-strategy, zoom in on “inventors”, then “persons”, “ne”, raw data access </li></ul></ul>
    22. 25. What’s in the DB? <ul><li>Text-based ranking </li></ul><ul><ul><li>term-doc-freq relations (inverted file) </li></ul></ul><ul><ul><ul><li>One per language, stemming, section </li></ul></ul></ul><ul><ul><li>Domain-independent, click and index </li></ul></ul><ul><li>Entity ranking </li></ul><ul><ul><li>Probabilistic triples </li></ul></ul><ul><ul><li>Domain-aware </li></ul></ul><ul><ul><ul><li>Needs supervised indexing </li></ul></ul></ul><ul><li>Content-based (MM) retrieval </li></ul><ul><ul><li>Feature vectors, click and index </li></ul></ul>T D f t 0 d 3 3 t 0 d 5 10 t 1 d 2 4 subj pred/attr obj/value p Arjen speaks_to you 0.95 you follow Arjen 0.5 speech minutes 45 0.8 Img_id f 1 … f N 0 0.12 … 0.84 1 0.54 … 0.31 2 0.23 … 0.1
    23. 26. VIEWS and TABLES <ul><li>BB content: sequence of VIEW definitions </li></ul><ul><li>A VIEW is pre-computable when </li></ul><ul><ul><li>All the relations addressed are pre-computable / stored </li></ul></ul><ul><ul><li>No dependency on user parameters </li></ul></ul><ul><li>Pre-computable VIEWs can become TABLEs (or MATERIALIZED VIEWs) </li></ul><ul><ul><li>Query-independent computations are performed only once , then read from TABLEs at each query </li></ul></ul><ul><ul><li>Recognition of these patterns is fully automatic </li></ul></ul><ul><ul><li>Extends MonetDB’s per-session caching to across-sessions caching </li></ul></ul>CREATE VIEW a AS SELECT … FROM term-doc … ; CREATE VIEW b AS SELECT … FROM a WHERE a.x = u 1 ; CREATE VIEW c AS SELECT … FROM a WHERE a.x = 42 ; CREATE VIEW d AS SELECT … FROM b … ; CREATE TABLE a AS SELECT … FROM term-doc … ; CREATE VIEW b AS SELECT … FROM a WHERE a.x = u 1 ; CREATE TABLE c AS SELECT … FROM a WHERE a.x = 42 ; CREATE VIEW d AS SELECT … FROM b … ; User parameter Stored relation No user parameter Pre-computable relation
    24. 27. Exploratory Search <ul><li>Search & (Faceted) Browsing </li></ul><ul><ul><li>Help discover schema, ontology, etc. </li></ul></ul><ul><ul><li>Help discover the relevant sources </li></ul></ul><ul><ul><ul><li>Within-collection (by year/location, by type, …) </li></ul></ul></ul><ul><ul><ul><li>Across multiple collections (by source) </li></ul></ul></ul>
    25. 28. Probabilistic faceted browsing <ul><li>Traditional (boolean filters) </li></ul><ul><li>Probabilistic </li></ul><ul><ul><li>100K - 200K </li></ul></ul><ul><ul><li>200K - 300K </li></ul></ul><ul><ul><li>300K - 400K </li></ul></ul>Price <ul><ul><li>3 </li></ul></ul><ul><ul><li>4 </li></ul></ul><ul><ul><li>5 </li></ul></ul>Rooms <ul><ul><li>100 - 150 m 2 </li></ul></ul><ul><ul><li>150 - 200 m 2 </li></ul></ul><ul><ul><li>200 - 250 m 2 </li></ul></ul>Size <ul><ul><li>100K - 200K </li></ul></ul><ul><ul><li>200K - 300K </li></ul></ul><ul><ul><li>300K - 400K </li></ul></ul>Price <ul><ul><li>3 </li></ul></ul><ul><ul><li>4 </li></ul></ul><ul><ul><li>5 </li></ul></ul>Rooms <ul><ul><li>100 - 150 m 2 </li></ul></ul><ul><ul><li>150 - 200 m 2 </li></ul></ul><ul><ul><li>200 - 250 m 2 </li></ul></ul>Size <ul><li>Good when user knows exactly which filters to apply </li></ul><ul><li>Will see perfect-match results </li></ul><ul><li>Won’t see “interesting” results </li></ul><ul><li>Good for exploratory search </li></ul><ul><li>Will see perfect-match results </li></ul><ul><li>Will also see “interesting” results </li></ul>
    26. 29. Dynamic facets <ul><li>Pre-indexed </li></ul><ul><li>Dynamic </li></ul><ul><ul><li>100K - 200K </li></ul></ul><ul><ul><li>200K - 300K </li></ul></ul><ul><ul><li>300K - 400K </li></ul></ul>Price <ul><ul><li>3 </li></ul></ul><ul><ul><li>4 </li></ul></ul><ul><ul><li>5 </li></ul></ul>Rooms <ul><ul><li>100 - 150 m 2 </li></ul></ul><ul><ul><li>150 - 200 m 2 </li></ul></ul><ul><ul><li>200 - 250 m 2 </li></ul></ul>Size <ul><ul><li>100K - 200K </li></ul></ul><ul><ul><li>200K - 300K </li></ul></ul><ul><ul><li>300K - 400K </li></ul></ul>Price <ul><ul><li>3 </li></ul></ul><ul><ul><li>4 </li></ul></ul><ul><ul><li>5 </li></ul></ul>Rooms <ul><ul><li>100 - 150 m 2 </li></ul></ul><ul><ul><li>150 - 200 m 2 </li></ul></ul><ul><ul><li>200 - 250 m 2 </li></ul></ul>Size <ul><li>Pre-defined ad-hoc indices intersected with result set </li></ul><ul><li>Challenge: many indices to maintain </li></ul><ul><li>Facets decided from result set </li></ul><ul><li>Challenge: dynamically adapt granularity </li></ul><ul><ul><li>Different price ranges for villa/garage! </li></ul></ul><ul><li>Challenge: heavy concurrent queries to DB </li></ul>
    27. 30. Probabilistic facets and strategies (current) <ul><ul><li>100K - 200K </li></ul></ul><ul><ul><li>200K - 300K </li></ul></ul><ul><ul><li>300K - 400K </li></ul></ul>Price <ul><ul><li>3 </li></ul></ul><ul><ul><li>4 </li></ul></ul><ul><ul><li>5 </li></ul></ul>Rooms <ul><ul><li>100 - 150 m 2 </li></ul></ul><ul><ul><li>150 - 200 m 2 </li></ul></ul><ul><ul><li>200 - 250 m 2 </li></ul></ul>Size <ul><li>Filter on Size (in/out) </li></ul><ul><ul><li>no ranking! </li></ul></ul><ul><li>Re-rank on </li></ul><ul><ul><li>Rooms (up/down) </li></ul></ul><ul><ul><li>Price (up/down) </li></ul></ul><ul><ul><li>no filter! </li></ul></ul><ul><li>BAD: Order of Rooms/Price matters! </li></ul><ul><li>BAD: Ranking function of Rooms/Price internally smoothed with previous ranking (done for efficiency) </li></ul><ul><li>BAD: possible to weight each facet, but not consistently with others </li></ul>Original strategy <ul><ul><li>100 - 150 m 2 </li></ul></ul><ul><ul><li>150 - 200 m 2 </li></ul></ul><ul><ul><li>200 - 250 m 2 </li></ul></ul>Size <ul><ul><li>100K - 200K </li></ul></ul><ul><ul><li>200K - 300K </li></ul></ul><ul><ul><li>300K - 400K </li></ul></ul>Price <ul><ul><li>3 </li></ul></ul><ul><ul><li>4 </li></ul></ul><ul><ul><li>5 </li></ul></ul>Rooms
    28. 31. Probabilistic facets and strategies (better) <ul><ul><li>100K - 200K </li></ul></ul><ul><ul><li>200K - 300K </li></ul></ul><ul><ul><li>300K - 400K </li></ul></ul>Price <ul><ul><li>3 </li></ul></ul><ul><ul><li>4 </li></ul></ul><ul><ul><li>5 </li></ul></ul>Rooms <ul><ul><li>100 - 150 m 2 </li></ul></ul><ul><ul><li>150 - 200 m 2 </li></ul></ul><ul><ul><li>200 - 250 m 2 </li></ul></ul>Size <ul><li>Filter on Size (in/out) </li></ul><ul><ul><li>no ranking! </li></ul></ul><ul><li>Re-rank on Rooms (up/down) </li></ul><ul><ul><li>no filter! </li></ul></ul><ul><li>Re-rank on Price (up/down) </li></ul><ul><ul><li>no filter! </li></ul></ul><ul><li>Mix 3 different rankings: </li></ul><ul><ul><li>Rooms </li></ul></ul><ul><ul><li>original (always present) </li></ul></ul><ul><ul><li>Price </li></ul></ul><ul><li>Change coefficients to explore </li></ul><ul><li>Challenge: use algebraic SpinQL representation for rewritings </li></ul><ul><ul><li>e.g. push filters up </li></ul></ul>Original strategy 20% 50% 30% Mix <ul><ul><li>100 - 150 m 2 </li></ul></ul><ul><ul><li>150 - 200 m 2 </li></ul></ul><ul><ul><li>200 - 250 m 2 </li></ul></ul>Size <ul><ul><li>100K - 200K </li></ul></ul><ul><ul><li>200K - 300K </li></ul></ul><ul><ul><li>300K - 400K </li></ul></ul>Price <ul><ul><li>3 </li></ul></ul><ul><ul><li>4 </li></ul></ul><ul><ul><li>5 </li></ul></ul>Rooms
    29. 32. Mixing probabilistic data streams <ul><li>N inputs streams S 1 ,…,S N , 1 output </li></ul><ul><li>All streams: (id, p) Same id type (e.g. docs) </li></ul><ul><li>Linear combination: </li></ul><ul><li>p 1 …p N must be comparable; on same scale </li></ul><ul><ul><li>Take care in scripting the blocks </li></ul></ul><ul><li>Expensive operation in relational algebra </li></ul>20% 50% 30% Mix
    30. 33. Mixing probabilistic data streams in RA <ul><li>Sum( p , GroupBy( id , (Union( α 1 S 1 ,…, α n S N ))) </li></ul><ul><ul><ul><li>GroupBy optim. for few large groups – opposite here </li></ul></ul></ul><ul><ul><ul><li>Example shows 3 ids, 2 streams: 3 groups (could be millions), each group large max 2 (usually < 5). </li></ul></ul></ul>α 1 S 1 α 2 S 2 S 1 S 2 id p id 0 0.2*0.1 = 0.02 id 1 0.2*0.7 = 0.14 id 2 0.2*0.9 = 0.18 id 0 0.8*0.2 = 0.16 id 2 0.8*1.0 = 0.8 id p id 0 0.02 + 0.16 = 0.18 id 1 0.14 id 2 0.18 + 0.8 = 0.26 20% 80% Mix id p id 0 0.1 id 1 0.7 id 2 0.9 id p id 0 0.2 id 2 1.0 Inputs Union( α 1 S 1 ,…, α n S N ))) Sum( p , GroupBy( id))
    31. 34. Mixing probabilistic data streams in RA <ul><li>Project( α 1 p 1 +…+ α N p N , OuterJoin( id 1 =… id n , ( S 1 ,…,S N ))) </li></ul><ul><ul><ul><li>Explicit summation of few values more efficient than aggregation </li></ul></ul></ul><ul><ul><ul><li>Example omits handling NULLs from OuterJoin – not free </li></ul></ul></ul><ul><ul><ul><li>Super-fast if streams ordered on id – impossible with Union/Group </li></ul></ul></ul>S 1 S 2 20% 80% Mix id p id 0 0.1 id 1 0.7 id 2 0.9 id p id 0 0.2*0.1 + 0.8*0.2 = 0.18 id 1 0.2*0.7 = 0.14 id 2 0.2*0.9 + 0.8*1.0 = 0.26 id p id 0 0.2 id 2 1.0 id p 1 p 2 id 0 0.1 0.2 id 1 0.7 id 2 0.9 1.0 Inputs OuterJoin( id 1 =… id n , ( S 1 ,…,S N )) Project( α 1 p 1 +…+ α N p N )
    32. 35. Limitations Search & Browse <ul><li>Faceted exploration does not include joins </li></ul><ul><ul><li>Cannot construct new data sources from existing ones! </li></ul></ul><ul><ul><li>Only the pre-defined paths through the information space can actually be traversed </li></ul></ul>
    33. 36. Who needs a Join? <ul><li>You!!! … whenever ‘relevance cues’ are typed: </li></ul><ul><ul><li>People (e.g., inventors) </li></ul></ul><ul><ul><li>Companies (e.g., assignees) </li></ul></ul><ul><ul><li>Categories (e.g., IPTC) </li></ul></ul><ul><ul><li>Time (e.g., expiry date) </li></ul></ul><ul><ul><li>Location (e.g., country) </li></ul></ul><ul><li>… or whenever multiple sources are to be combined </li></ul><ul><ul><li>E.g., patents & news, patents & Wikipedia, … </li></ul></ul>
    34. 37. Patents on X by Y(y) by Y(y)
    35. 38. 1. Which universities/colleges hold patents? 2. Who are the inventors named in those patents? 3. Which inventors are active in the area of our company? Real-life patent search example: Which researchers associated to universities and colleges should our Human Resources manager know to hire the right people on time?
    36. 39. How Strategies Help <ul><li>Strategies improve communication between search intermediary and user </li></ul><ul><ul><li>Encapsulate domain expert knowledge </li></ul></ul><ul><ul><li>Abstract representation of search expert knowledge </li></ul></ul><ul><ul><li>Analyze information seeking process at any stage </li></ul></ul><ul><li>Strategies facilitate knowledge management </li></ul><ul><ul><li>Store / share / publish / refine </li></ul></ul><ul><li>Strategies mix exact (DB) and ranked (IR) searches </li></ul><ul><ul><li>Avoid the need for “human (probabilistic) joins” </li></ul></ul>
    37. 41. Conclusion <ul><li>“ No idealized one-shot search engine” </li></ul><ul><li>Empower the user! </li></ul>
    38. 42. Search Intermediaries <ul><li>Travel agency </li></ul><ul><li>Real estate agents </li></ul><ul><li>Recruiters </li></ul><ul><li>Librarians </li></ul><ul><li>Archivists </li></ul><ul><li>Digital forensics detectives </li></ul><ul><li>Patent information specialists </li></ul>Task complexity
    39. 44. Research Opportunities <ul><li>Assist the user make the best out of their increased level of control </li></ul><ul><ul><li>Integrate usage data from live system to help improve or adapt strategies </li></ul></ul><ul><li>Handle “even larger” scale data </li></ul><ul><ul><li>Patent demo fine on ~17GB semi-structured data (i.e., Fairview Research’s Green Energy collection), without specific optimizations, even with fairly large strategies </li></ul></ul><ul><li>Formalism </li></ul><ul><ul><li>Score normalization </li></ul></ul><ul><li>Close the loop! </li></ul>
    40. 45. Current Situation <ul><li>index ; </li></ul><ul><li>repeat { </li></ul><ul><li>specify ; </li></ul><ul><li>retrieve </li></ul><ul><li>} until  </li></ul>Search & explore Schema definition
    41. 46. Desirable Situation <ul><li>repeat { </li></ul><ul><li>index ; </li></ul><ul><li>specify ; </li></ul><ul><li>retrieve </li></ul><ul><li>} until  </li></ul>Mixed Initiative Schema definition Search & explore
    42. 47. Interactive Information Access <ul><li>Feedback: </li></ul><ul><ul><li>Interaction improves information representation </li></ul></ul><ul><li>Faceted Browsing: </li></ul><ul><ul><li>Interaction can let user take over where machine would fail </li></ul></ul><ul><li>Search by Strategy: </li></ul><ul><ul><li>Interaction can let user take over where system designer would fail </li></ul></ul>
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×