Peter	
  Brantley 	
       	
     	
  	
  
Internet	
  Archive 	
     	
     	
  	
  
The	
  Presidio     	
     	
     	
...
Essential	
  premise	
  :	
  

combining	
  web	
  search	
  
with	
  book	
  search	
  is	
  an	
  

engineering	
  chall...
I.	
  	
  Presenting	
  combined	
  search	
  
 For	
  several	
  years,	
  I	
  served	
  the	
  University	
  of	
  
    California	
  as	
  the	
  Director	
  of	
  T...
We	
  held	
  various	
  conversations	
  over	
  time	
  
with	
  Google	
  engineers	
  in	
  similar	
  spaces	
  ...	
...
 In	
  academic	
  info	
  markets,	
  “metasearch”	
  –	
  
    distributed	
  queries	
  with	
  central	
  resolution,	...
 “Google	
  is	
  undertaking	
  the	
  most	
  radical	
  change	
  to	
  its	
  search	
  
    results	
  ever,	
  intro...
Simple	
  search	
  box	
  ...	
  but	
  

 User	
  search	
  intentionality	
  	
  
for	
  books	
  vs.	
  web	
  can	
  ...
Google	
  Scholar	
  is	
  vertical	
  search	
  engine.	
  

Explicit	
  opt-­‐in	
  discovery	
  service	
  for	
  STM	
...
 From	
  2007	
  –	
  early	
  2009,	
  I	
  was	
  the	
  Director	
  
    of	
  the	
  Digital	
  Library	
  Federation....
Some	
  comparisons	
  
between	
  web	
  pages	
  
    and	
  books.	
  
 web:	
   	
  	
  
	
   	
      	
  short	
  doc	
  (web	
  page)	
  length	
  	
  

	
  books:	
  	
  	
  
	
   	
       ...
 web:	
  
	
   	
      	
  high	
  data	
  density	
  (per	
  doc	
  size)	
  	
  

	
  books: 	
  	
  
	
   	
    	
  hig...
 web:     	
  	
  
	
   	
      	
  trillions	
  of	
  unique	
  web	
  pages	
  

	
  books: 	
  	
  
	
   	
    	
  (low...
 web:	
  
	
   	
        	
  many	
  complex	
  media	
  types	
  

	
  books:	
  
	
   	
        	
  text	
  and	
  image...
 web:    	
  	
  
	
   	
     	
  dynamic	
  over	
  time	
  
	
   	
     	
  (avg.	
  TTL	
  of	
  web	
  pages	
  is	
  ...
 web:	
  
	
   	
        	
  single	
  instances	
  (web	
  pages)	
  

	
  books:	
  
	
   	
        	
  duplicate	
  ins...
 web:	
  
	
   	
        	
  hyperlinked	
  in/out	
  
	
   	
        	
  (useful	
  in	
  relevance)	
  

	
  books: 	
  ...
 web:	
  
	
   	
        	
  designed	
  component	
  structure	
  
	
   	
        	
  {page	
  hierarchy	
  >	
  web	
  s...
Bibliographic	
  data	
  cf.	
  full	
  text	
  (book)	
  data:	
  

         The	
  Melvyl	
  Recommender	
  Project	
  
...
Project	
  Lead	
  
     Peter	
  Brantley,	
  Director	
  of	
  Technology	
  

Implementation	
  Team	
  
       Kirk	...
Often	
  many	
  different	
  editions	
  of	
  popular	
  books.	
  
Can	
  easily	
  artificially	
  boost	
  search	
  (n...
 In	
  CDL	
  tests,	
  for	
  texts	
  vs.	
  bib	
  records:	
  

	
  Search	
  scoring	
  for	
  full	
  text	
  docume...
 Easy	
  for	
  a	
  single	
  work	
  to	
  overwhelm	
  web	
  
      pages	
  in	
  relevance	
  for	
  a	
  well-­‐fitt...
Books	
  are	
  long	
  strings	
  of	
  many	
  words,	
  
split	
  into	
  n_sized	
  chunks	
  for	
  parsing.	
  

  	...
{Search	
  Term}	
  and	
  {Document}	
  weights	
  

1.    How	
  often	
  is	
  a	
  search	
  term	
  found	
  within	
...
 Which	
  is	
  better?	
  

1.    	
   Adequate	
  matches	
  over	
  many	
  fields,	
  	
  
2.     	
   Better	
  matche...
1.    Books	
  are	
  sooo	
  much	
  longer	
  than	
  web	
  pages.	
  
2.    Books	
  produce	
  1000’s	
  more	
  chun...
II.	
  What	
  you	
  get	
  from	
  books	
  
 The	
  dialectic	
  between	
  books	
  and	
  
      web	
  provides	
  benefits	
  from	
  their	
  
      integration	
...
All	
  search	
  is	
  made	
  smarter	
  by	
  analysis.	
  

1.    structure	
  
2.    contextualization	
  
3.    relat...
Because	
  of	
  digitization,	
  
books	
  have	
  complications	
  cf.	
  	
  
web	
  pages;	
  a	
  result	
  of	
  OCR...
Common	
  OCR	
  traps:	
  

    	
   embedded	
  languages	
  
    	
   Latin	
  or	
  archaic	
  spelling	
  	
  
   ...
    ricain	
           ricanant	
  
    ricaine	
          ricanante	
  
    ricaines	
         ricane	
  
    rica...
More	
  words	
  from	
  more	
  books,	
  	
  
more	
  spelling	
  mistakes.	
  

  	
   	
  This	
  is	
  a	
  good	
  t...
 “Our	
  understanding	
  of	
  language	
  is,	
  in	
  large	
  
    part,	
  built	
  inductively	
  from	
  statistica...
 “Before	
  the	
  1930’s,	
  and	
  even	
  40’s	
  or	
  50’s	
  in	
  some	
  
    parts,	
  	
  at	
  harvest	
  time,...
 Statistical	
  analysis	
  of	
  which	
  terms	
  tend	
  to	
  
    appear	
  in	
  the	
  vicinity	
  of	
  which	
  o...
 Analysis	
  via	
  co-­‐occurrence	
  enables	
  one	
  to	
  
    construct	
  a	
  better	
  general	
  search	
  engin...
 LSA	
  is	
  an	
  CS	
  term	
  referring	
  to	
  a	
  technique	
  in	
  
    “natural	
  language	
  processing	
  .....
 (LSI	
  =	
  LSA	
  in	
  context	
  of	
  info	
  retrieval	
  (IR).)	
  

	
  “Clustering	
  is	
  a	
  way	
  to	
  gr...
 “Because	
  it	
  uses	
  a	
  strictly	
  mathematical	
  
    approach,	
  LSI	
  is	
  inherently	
  independent	
  of...
 “[Q]ueries	
  can	
  be	
  made	
  in	
  one	
  language,	
  such	
  
    as	
  English,	
  and	
  conceptually	
  simila...
 “LSI	
  automatically	
  adapts	
  to	
  new	
  and	
  changing	
  
    terminology,	
  and	
  it	
  has	
  been	
  shown...
 The	
  More	
  Data,	
  The	
  Better	
  ...	
  	
  

	
  The	
  More	
  Books,	
  The	
  Better	
  Web	
  Search.	
  
Contact	
  information:	
  

peter	
  brantley 	
    	
       	
  internet	
  archive	
  
@naypinya	
  (twitter)	
  	
    ...
Upcoming SlideShare
Loading in …5
×

Books and Webs: Pulling the Down Rows

1,196 views

Published on

Elementary explanation of the difficulties of combining indexes for web pages and books, and means by which book index data can optimize general web searches at scale.

Published in: Technology, News & Politics
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,196
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
11
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Books and Webs: Pulling the Down Rows

  1. 1. Peter  Brantley         Internet  Archive         The  Presidio      11.09  
  2. 2. Essential  premise  :   combining  web  search   with  book  search  is  an   engineering  challenge  
  3. 3. I.    Presenting  combined  search  
  4. 4.  For  several  years,  I  served  the  University  of   California  as  the  Director  of  Technology  for   the  California  Digital  Library.    (the  digital  library  group  for  the  UC  system)  
  5. 5. We  held  various  conversations  over  time   with  Google  engineers  in  similar  spaces  ...   grappling  with  the  indexing,  search,  and   user  interface  issues  with  combined  but     disparate  content  pools  (books,  journals,   web,  image,  video).       (an  important  issue  for  digital  libraries)  
  6. 6.  In  academic  info  markets,  “metasearch”  –   distributed  queries  with  central  resolution,   contested  for  primacy  with  search  over   aggregated  content.        To  an  extent,  only  LANL  and  commercial   search  pursued  aggregation  at  scale.    Aggregation  wins.      
  7. 7.  “Google  is  undertaking  the  most  radical  change  to  its  search   results  ever,  introducing  a  "Universal  Search"  system  that  will   blend  listings  from  its  news,  video,  images,  local  and  book   search  engines  among  those  it  gathers  from  crawling  web   pages.”    “With  Universal  Search,  Google  will  hit  a  range  of  its  vertical   search  engines,  then  decide  if  the  relevancy  of  a  result  from   book  search  is  higher  than  a  match  from  web  page  search.”    Danny  Sullivan,  “Google  2.0”,  May  16  2007,    Search  Engine  Land  
  8. 8. Simple  search  box  ...  but   User  search  intentionality     for  books  vs.  web  can  differ   “mark  twain  hawai’i”  
  9. 9. Google  Scholar  is  vertical  search  engine.   Explicit  opt-­‐in  discovery  service  for  STM   journal  content,  utilized  in  HE  academia.    Many  concerns  with  combining  the  Scholar   product  with  Big  Daddy.    User  search  goals   differ;  content  distinct;  different  indexing.      
  10. 10.  From  2007  –  early  2009,  I  was  the  Director   of  the  Digital  Library  Federation.      I  made  a   request  of  Google  to  update  members  on   GBS  status  at  DLF’s  Fall  Forum,  Nov.  2008.    They  issued  an  explicit  request  for  HE  CS/ EE  attention  to  the  problem  of  integrating   book  and  web  search.    Paraphrasing:  “Not   a  well  solved  problem”.    
  11. 11. Some  comparisons   between  web  pages   and  books.  
  12. 12.  web:            short  doc  (web  page)  length      books:            long  doc  (book)  length  
  13. 13.  web:        high  data  density  (per  doc  size)      books:          highly  variant  data  density        (e.g.  fiction  vs.  non-­‐fiction)  
  14. 14.  web:          trillions  of  unique  web  pages    books:          (low)  millions  of  unique  books    
  15. 15.  web:        many  complex  media  types    books:        text  and  image  media  
  16. 16.  web:          dynamic  over  time        (avg.  TTL  of  web  pages  is  short)    books:          static  over  time        (print  books  permanently  fixed)  
  17. 17.  web:        single  instances  (web  pages)    books:        duplicate  instances  (copies),        similar  instances  (editions),        in  multiple  languages  
  18. 18.  web:        hyperlinked  in/out        (useful  in  relevance)    books:          normally  quiescent          (sometimes  citations)  
  19. 19.  web:        designed  component  structure        {page  hierarchy  >  web  site}    books:          artificial  component  structure          {page  images  >  book}  
  20. 20. Bibliographic  data  cf.  full  text  (book)  data:   The  Melvyl  Recommender  Project   Full  Text  Extension   (Supplementary  Report)   California  Digital  Library   October  2006   Funded  by  the  Andrew  W.  Mellon  Foundation  
  21. 21. Project  Lead     Peter  Brantley,  Director  of  Technology   Implementation  Team     Kirk  Hastings,  Text  Systems  Designer     Martin  Haye,  Programmer  (Contractor)     Steve  Toub,  Web  Design  Manager     Colleen  Whitney,  Programmer  and  Coordinator   Assessment  Team     Jane  Lee,  Assessment  Analyst     Felicia  Poe,  Assessment  Coordinator     Lisa  Schiff,  Digital  Ingest  Programmer  
  22. 22. Often  many  different  editions  of  popular  books.   Can  easily  artificially  boost  search  (n_copies).   e.g.  “Moby  Dick”  published  100s  of  times      (and  in  many  languages)   Depending  on  publication  date:      either  public  domain  (dep.  on  country)    or  in-­‐copyright  (out-­‐of-­‐print  or  in-­‐print)  
  23. 23.  In  CDL  tests,  for  texts  vs.  bib  records:    Search  scoring  for  full  text  documents   was  typically  10  -­‐  100  times  larger  than   for  metadata-­‐only  records.      (Probably  approximate  magnitude        cf.  to  representative  web  pages).  
  24. 24.  Easy  for  a  single  work  to  overwhelm  web   pages  in  relevance  for  a  well-­‐fitting  query.        E.g.  “English  working  class  labor  industrial”     The  making  of  the  English  working  class.     Author:  E  P  Thompson       Publisher:  New  York,  Pantheon  Books       [1964,  ©1963]  
  25. 25. Books  are  long  strings  of  many  words,   split  into  n_sized  chunks  for  parsing.    Term  indexing  based  on  overlapping   and  variant  length  “word  vectors”        “battle”    “of”    “britain”        “battle  of”    “britain”      “battle”    “of  britain”        “battle  of  britain”  
  26. 26. {Search  Term}  and  {Document}  weights   1.  How  often  is  a  search  term  found  within   a  given  sized  chunk  of  text?   2.  How  many  chunks  of  text  is  the  term   found  within?   3.  How  many  chunks  of  text  does  the   document  contain?  
  27. 27.  Which  is  better?   1.    Adequate  matches  over  many  fields,     2.    Better  matches  in  fewer  fields.      Metrics  vary  between  books  and  web.    One  learns  from  one’s  mistakes.      More  books,  more  mistakes.    
  28. 28. 1.  Books  are  sooo  much  longer  than  web  pages.   2.  Books  produce  1000’s  more  chunks  than  web.   3.  Term  weighting  is  very  complex  for  long  docs.   4.  Indexes  must  be  integrated  for  web  and  books.   5.  But  source  term  indexes  are  biased  differently.  
  29. 29. II.  What  you  get  from  books  
  30. 30.  The  dialectic  between  books  and   web  provides  benefits  from  their   integration  (no  matter  the  pain).   Books  enrich  general  web  search,   not  just  via  the  data  within  books,     but  also  by  books-­‐as-­‐data.  
  31. 31. All  search  is  made  smarter  by  analysis.   1.  structure   2.  contextualization   3.  relatedness   4.  normalization   5.  association  
  32. 32. Because  of  digitization,   books  have  complications  cf.     web  pages;  a  result  of  OCR.   1.  Language  detection   2.  Determining  which  words  get  indexed   (–  stop  words  like  “of”  “a”  “the”  etc.)   3.  OCR  mistakes  hamper  word  recognition  
  33. 33. Common  OCR  traps:       embedded  languages       Latin  or  archaic  spelling         complex  scripts  (e.g.  captions)       hyphenated  words    
  34. 34.   ricain     ricanant     ricaine     ricanante     ricaines     ricane     ricana     ricamente     ricanai     ricanement     ricains     ricanements     rical     rican     rically     ricanes     ricals     ricans  
  35. 35. More  words  from  more  books,     more  spelling  mistakes.      This  is  a  good  thing!    Leads  to  improved  spelling  correction    (in  multiple  languages)  and      more  sensitive  translation.    
  36. 36.  “Our  understanding  of  language  is,  in  large   part,  built  inductively  from  statistical  analysis   of  large  samples  of  language  as  used  ‘in  the   wild,’  and  the  larger  the  sample,  the  better   our  understanding.”              -­‐  Hank  Bromley,  IA  
  37. 37.  “Before  the  1930’s,  and  even  40’s  or  50’s  in  some   parts,    at  harvest  time,  a  horse  or  mule  drawn   wagon  would  go  through  the  field,  straddling  two   rows  of  corn.    Adults  working  on  each  side  of  the   wagon  would  pull  the  corn  from  the  standing  corn   stalks  and  toss  it  into  the  wagon.    The  unfortunate   younger  ones  would  have  to  pull  corn  from  the   down  rows  –  stoop  labor  in  its  worst  form.”                        -­‐  JDB  
  38. 38.  Statistical  analysis  of  which  terms  tend  to   appear  in  the  vicinity  of  which  others),  useful   not  only  for  context-­‐sensitive  OCR,  but  more   significantly,  for  building  semantic  maps  and   other  kinds  of  knowledge  representation.      “dead  as  a  door  nail”  –  the  term  “door  nail”        is  not  commonly  found  elsewhere.  
  39. 39.  Analysis  via  co-­‐occurrence  enables  one  to   construct  a  better  general  search  engine  by   enhancing  the  ability  to  distinguish  among   multiple  meanings  of  a  given  word  based   on  the  context  in  which  the  word  occurs.  
  40. 40.  LSA  is  an  CS  term  referring  to  a  technique  in   “natural  language  processing  ...  of  analyzing   relationships  between  a  set  of  documents   and  the  terms  they  contain  by  producing  a   set  of  concepts  related  to  the  documents   and  terms.”                -­‐  Wikipedia.org  
  41. 41.  (LSI  =  LSA  in  context  of  info  retrieval  (IR).)    “Clustering  is  a  way  to  group  documents   based  on  their  conceptual  similarity  to  each   other  ...  .    This  is  very  useful  when  dealing   with  an  unknown  collection  of  unstructured   text.”  
  42. 42.  “Because  it  uses  a  strictly  mathematical   approach,  LSI  is  inherently  independent  of   language.    This  enables  LSI  to  elicit  the   semantic  content  of  information  written  in   any  language  without  requiring  the  use  of   auxiliary  structures,  such  as  dictionaries  and   thesauri.”  
  43. 43.  “[Q]ueries  can  be  made  in  one  language,  such   as  English,  and  conceptually  similar  results   will  be  returned  even  if  they  are  composed  of   an  entirely  different  language  or  of  multiple   languages.”  
  44. 44.  “LSI  automatically  adapts  to  new  and  changing   terminology,  and  it  has  been  shown  to  be  very   tolerant  of  noise  (i.e.,  misspelled  words,  typo-­‐ graphical  errors,  unreadable  characters,  etc.).        “This  is  especially  important  for  applications   using  text  derived  from  Optical  Character   Recognition  (OCR)    ...”                -­‐  Wikipedia.org  
  45. 45.  The  More  Data,  The  Better  ...      The  More  Books,  The  Better  Web  Search.  
  46. 46. Contact  information:   peter  brantley      internet  archive   @naypinya  (twitter)      peter  @  archive.org  

×