Search	  as	  Communica/on:	  Lessons	  from	  a	  Personal	  Journey	  Daniel	  Tunkelang	  Head	  of	  Query	  Understan...
These	  are	  great	  textbooks	  on	  informa/on	  retrieval.	  
Unfortunately,	  I	  never	  read	  them	  in	  school.	  But	  I	  did	  study	  graphs	  and	  stuff.	  	  
I	  found	  myself	  developing	  a	  search	  engine.	  
And	  the	  next	  thing	  I	  knew,	  I	  was	  a	  search	  guy.	  
So	  what	  did	  I	  learn	  along	  the	  way?	  
Search	  isnt	  a	  ranking	  problem.	  Its	  a	  communica/on	  problem.	  
Outline	  1.	  Lessons	  from	  Library	  Science	  	  2.	  Adventures	  with	  InformaAon	  ExtracAon	  	  3.	  A	  Momen...
1.	  Lessons	  from	  Library	  Science	  
InformaAon	  need	   query	   select	  from	  results	  rank	  using	  IR	  model	  USER:	  SYSTEM:	  M-­‐idf	   PageRank	...
Old	  school	  search:	  ask	  a	  librarian.	  
Search	  lives	  in	  an	  informa/on-­‐seeking	  context.	  	  [Pirolli	  and	  Card,	  2005]	  
vs.	  Recognize	  ambiguity	  and	  ask	  for	  clarifica/on.	  
Clarify,	  then	  refine.	  Computers	   Books	  
Faceted	  search.	  It’s	  not	  just	  for	  e-­‐commerce.	  
Give	  users	  transparency,	  guidance,	  and	  control.	  
Take-­‐away	  for	  search	  engine	  developers:	  	  	  Act	  like	  a	  librarian.	  Communicate	  with	  your	  user.	  
2.	  Adventures	  with	  Informa/on	  Extrac/on	  
String	  matching	  is	  great	  but	  has	  limits.	  
20	  20for i in [1..n]!s ← w1 w2 … wi!if Pc(s) > 0!a ← new Segment()!a.segs ← {s}!a.prob ← Pc(s)!B[i] ← {a}!for j in [1..i...
Named	  en/ty	  recogni/on	  is	  free,	  as	  in	  free	  beer.	  
Problem:	  they	  process	  each	  document	  separately.	  EnAty	  DetecAon	  System	  Why	  not	  take	  advantage	  of	...
Give	  your	  documents	  the	  right	  to	  vote!	  Use	  a	  high-­‐recall	  method	  to	  collect	  candidates.	  •  e....
Looking	  for	  topics?	  Use	  idf,	  and	  its	  cousin	  ridf.	  Inverse	  document	  frequency	  (idf)	  •  Too	  low?...
Terminology	  extrac/on?	  Try	  data	  recycling.	  
Obtain	  en//es	  by	  any	  means	  necessary.	  
Take-­‐away	  for	  search	  engine	  developers:	  	  	  En/ty	  detec/on	  is	  crucial.	  And	  it	  isn’t	  that	  har...
3.	  A	  Moment	  of	  Clarity	  
informaAon	  Need	   query	   select	  from	  results	  rank	  using	  IR	  model	  USER:	  SYSTEM:	  M-­‐idf	   PageRank	...
What	  does	  this	  process	  look	  like	  to	  the	  system?	  vs.	  
And	  here’s	  what	  it	  looks	  like	  to	  the	  user.	  GOOD	   NOT	  SO	  GOOD	  But	  can	  the	  system	  tell	  t...
User	  experience	  should	  reflect	  system	  confidence.	  vs.	  
h^p://searchengineland.com/ge`ng-­‐organized-­‐paid-­‐search-­‐user-­‐intent-­‐the-­‐search-­‐funnel-­‐116312	  Derived	  ...
34	  34for i in [1..n]!s ← w1 w2 … wi!if Pc(s) > 0!a ← new Segment()!a.segs ← {s}!a.prob ← Pc(s)!B[i] ← {a}!for j in [1..i...
We	  can	  learn	  from	  analyzing	  user	  behavior.	  
And	  we	  can	  look	  at	  our	  relevance	  scores.	  Naviga/onal	   Exploratory	  
Claudia	  Hauff,	  Query	  Difficulty	  for	  Digital	  Libraries	  [2009]	  There	  are	  many	  pre-­‐	  and	  post-­‐retri...
Take-­‐away	  for	  search	  engine	  developers:	  	  	  Queries	  vary	  in	  difficulty.	  Recognize	  and	  adapt.	  
Review	  1.  Lessons	  from	  Library	  Science	  •  Act	  like	  a	  librarian.	  Communicate	  with	  users.	  	  2.	  A...
Conclusion:	  Read	  the	  textbooks.	  But	  treat	  search	  as	  a	  communica/on	  problem.	  
WE’RE	  HIRING!	  hbp://data.linkedin.com/search	  	  	  Contact	  me:	  dtunkelang@linkedin.com	  hbp://linkedin.com/in/d...
Upcoming SlideShare
Loading in …5
×

Search as Communication: Lessons from a Personal Journey

11,382 views

Published on

Search as Communication: Lessons from a Personal Journey
by Daniel Tunkelang (Head of Query Understanding, LinkedIn)

Presented at Etsy's Code as Craft Series on May 21, 2013

When I tell people I spent a decade studying computer science at MIT and CMU, most assume that I focused my studies in information retrieval — after all, I’ve spent most of my professional life working on search.

But that’s not how it happened. I learned about information extraction as a summer intern at IBM Research, where I worked on visual query reformulation. I learned how search engines work by building one at Endeca. It was only after I’d hacked my way through the problem for a few years that I started to catch up on the rich scholarly literature of the past few decades.

As a result, I developed a point of view about search without the benefit of academic conventional wisdom. Specifically, I came to see search not so much as a ranking problem as a communication problem.

In this talk, I’ll explain my communication-centric view of search, offering examples, general techniques, and open problems.

--

Daniel Tunkelang is Head of Query Understanding at LinkedIn. Educated at MIT and CMU, he has his career working on big data, addressing key challenges in search, data mining, user interfaces, and network analysis. He co-founded enterprise search and business intelligence pioneer Endeca, where he spent a decade as its Chief Scientist. In 2011, Endeca was acquired by Oracle for over $1B. Previous to LinkedIn, he led a team at Google working on local search quality. Daniel has authored fifteen patents, written a textbook on faceted search, and created the annual symposium on human-computer interaction and information retrieval.

Published in: Technology, Design
1 Comment
18 Likes
Statistics
Notes
  • Hai Daniel: I too studied what library science offered to information retrieval. HyperPlex incorporates the classification Master Table Of Contents as an option for exploring and navigating to what is needed. Later I learnt that MTOC is Ontology (a set of ontologies). We use it in HyperPlex. Please take look and see if you can use it.

    http://www.slideshare.net/putchavn/hyper-plex-high-precision-queryresponse-knowledge-repository-pdf

    Hope you do not mind too many comments in a short span.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
11,382
On SlideShare
0
From Embeds
0
Number of Embeds
1,512
Actions
Shares
0
Downloads
63
Comments
1
Likes
18
Embeds 0
No embeds

No notes for slide

Search as Communication: Lessons from a Personal Journey

  1. Search  as  Communica/on:  Lessons  from  a  Personal  Journey  Daniel  Tunkelang  Head  of  Query  Understanding,  LinkedIn  
  2. These  are  great  textbooks  on  informa/on  retrieval.  
  3. Unfortunately,  I  never  read  them  in  school.  But  I  did  study  graphs  and  stuff.    
  4. I  found  myself  developing  a  search  engine.  
  5. And  the  next  thing  I  knew,  I  was  a  search  guy.  
  6. So  what  did  I  learn  along  the  way?  
  7. Search  isnt  a  ranking  problem.  Its  a  communica/on  problem.  
  8. Outline  1.  Lessons  from  Library  Science    2.  Adventures  with  InformaAon  ExtracAon    3.  A  Moment  of  Clarity  
  9. 1.  Lessons  from  Library  Science  
  10. InformaAon  need   query   select  from  results  rank  using  IR  model  USER:  SYSTEM:  M-­‐idf   PageRank  A  birds-­‐eye  view  of  how  search  engines  work.  
  11. Old  school  search:  ask  a  librarian.  
  12. Search  lives  in  an  informa/on-­‐seeking  context.    [Pirolli  and  Card,  2005]  
  13. vs.  Recognize  ambiguity  and  ask  for  clarifica/on.  
  14. Clarify,  then  refine.  Computers   Books  
  15. Faceted  search.  It’s  not  just  for  e-­‐commerce.  
  16. Give  users  transparency,  guidance,  and  control.  
  17. Take-­‐away  for  search  engine  developers:      Act  like  a  librarian.  Communicate  with  your  user.  
  18. 2.  Adventures  with  Informa/on  Extrac/on  
  19. String  matching  is  great  but  has  limits.  
  20. 20  20for i in [1..n]!s ← w1 w2 … wi!if Pc(s) > 0!a ← new Segment()!a.segs ← {s}!a.prob ← Pc(s)!B[i] ← {a}!for j in [1..i-1]!for b in B[j]!s ← wj wj+1 … wi!if Pc(s) > 0!a ← new Segment()!a.segs ← b.segs U {s}!a.prob ← b.prob * Pc(s)!B[i] ← B[i] U {a}!sort B[i] by prob!truncate B[i] to size k!People  search  for  en//es.  Recognize  them!  
  21. Named  en/ty  recogni/on  is  free,  as  in  free  beer.  
  22. Problem:  they  process  each  document  separately.  EnAty  DetecAon  System  Why  not  take  advantage  of  corpus  features?      
  23. Give  your  documents  the  right  to  vote!  Use  a  high-­‐recall  method  to  collect  candidates.  •  e.g.,  all  Atle-­‐case  spans  of  words  other  than  single  word  beginning  a  sentence.    Process  each  document  separately.  •  Each  candidate  is  assigned  an  enAty  type,  or  no  type  at  all.    If  a  candidate  is  mostly  assigned  a  single  enAty  type,  extrapolate  to  all  its  occurrences.  
  24. Looking  for  topics?  Use  idf,  and  its  cousin  ridf.  Inverse  document  frequency  (idf)  •  Too  low?  Probably  a  stop  word.  •  Too  high?  Could  be  noise.    Residual  inverse  document  frequency  (ridf)  •  Predict  idf  using  Poisson  model.  •  Difference  between  idf  and  predicted  idf.    “a  good  keyword  is  far  from  Poisson”            [Church  and  Gale,  1995]  
  25. Terminology  extrac/on?  Try  data  recycling.  
  26. Obtain  en//es  by  any  means  necessary.  
  27. Take-­‐away  for  search  engine  developers:      En/ty  detec/on  is  crucial.  And  it  isn’t  that  hard.  
  28. 3.  A  Moment  of  Clarity  
  29. informaAon  Need   query   select  from  results  rank  using  IR  model  USER:  SYSTEM:  M-­‐idf   PageRank  Let’s  go  back  to  our  pigeons  for  a  moment.    
  30. What  does  this  process  look  like  to  the  system?  vs.  
  31. And  here’s  what  it  looks  like  to  the  user.  GOOD   NOT  SO  GOOD  But  can  the  system  tell  the  difference?  
  32. User  experience  should  reflect  system  confidence.  vs.  
  33. h^p://searchengineland.com/ge`ng-­‐organized-­‐paid-­‐search-­‐user-­‐intent-­‐the-­‐search-­‐funnel-­‐116312  Derived  from  [Jansen  et  al,  2007].  Searches  reflect  a  variety  of  informa/on  needs.  
  34. 34  34for i in [1..n]!s ← w1 w2 … wi!if Pc(s) > 0!a ← new Segment()!a.segs ← {s}!a.prob ← Pc(s)!B[i] ← {a}!for j in [1..i-1]!for b in B[j]!s ← wj wj+1 … wi!if Pc(s) > 0!a ← new Segment()!a.segs ← b.segs U {s}!a.prob ← b.prob * Pc(s)!B[i] ← B[i] U {a}!sort B[i] by prob!truncate B[i] to size k!We  can  segment  informa/on  need  from  the  query.  
  35. We  can  learn  from  analyzing  user  behavior.  
  36. And  we  can  look  at  our  relevance  scores.  Naviga/onal   Exploratory  
  37. Claudia  Hauff,  Query  Difficulty  for  Digital  Libraries  [2009]  There  are  many  pre-­‐  and  post-­‐retrieval  signals.  
  38. Take-­‐away  for  search  engine  developers:      Queries  vary  in  difficulty.  Recognize  and  adapt.  
  39. Review  1.  Lessons  from  Library  Science  •  Act  like  a  librarian.  Communicate  with  users.    2.  Adventures  with  InformaAon  ExtracAon  •  EnAty  detecAon  is  crucial.  And  isn’t  that  hard.    3.  A  Moment  of  Clarity  •  Queries  vary  in  difficulty.  Recognize  and  adapt.  
  40. Conclusion:  Read  the  textbooks.  But  treat  search  as  a  communica/on  problem.  
  41. WE’RE  HIRING!  hbp://data.linkedin.com/search      Contact  me:  dtunkelang@linkedin.com  hbp://linkedin.com/in/dtunkelang  

×