• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Books and Webs: Pulling the Down Rows
 

Books and Webs: Pulling the Down Rows

on

  • 1,620 views

Elementary explanation of the difficulties of combining indexes for web pages and books, and means by which book index data can optimize general web searches at scale.

Elementary explanation of the difficulties of combining indexes for web pages and books, and means by which book index data can optimize general web searches at scale.

Statistics

Views

Total Views
1,620
Views on SlideShare
1,608
Embed Views
12

Actions

Likes
3
Downloads
9
Comments
0

2 Embeds 12

http://www.techgig.com 7
http://www.slideshare.net 5

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Books and Webs: Pulling the Down Rows Books and Webs: Pulling the Down Rows Presentation Transcript

    • Peter  Brantley         Internet  Archive         The  Presidio      11.09  
    • Essential  premise  :   combining  web  search   with  book  search  is  an   engineering  challenge  
    • I.    Presenting  combined  search  
    •  For  several  years,  I  served  the  University  of   California  as  the  Director  of  Technology  for   the  California  Digital  Library.    (the  digital  library  group  for  the  UC  system)  
    • We  held  various  conversations  over  time   with  Google  engineers  in  similar  spaces  ...   grappling  with  the  indexing,  search,  and   user  interface  issues  with  combined  but     disparate  content  pools  (books,  journals,   web,  image,  video).       (an  important  issue  for  digital  libraries)  
    •  In  academic  info  markets,  “metasearch”  –   distributed  queries  with  central  resolution,   contested  for  primacy  with  search  over   aggregated  content.        To  an  extent,  only  LANL  and  commercial   search  pursued  aggregation  at  scale.    Aggregation  wins.      
    •  “Google  is  undertaking  the  most  radical  change  to  its  search   results  ever,  introducing  a  "Universal  Search"  system  that  will   blend  listings  from  its  news,  video,  images,  local  and  book   search  engines  among  those  it  gathers  from  crawling  web   pages.”    “With  Universal  Search,  Google  will  hit  a  range  of  its  vertical   search  engines,  then  decide  if  the  relevancy  of  a  result  from   book  search  is  higher  than  a  match  from  web  page  search.”    Danny  Sullivan,  “Google  2.0”,  May  16  2007,    Search  Engine  Land  
    • Simple  search  box  ...  but   User  search  intentionality     for  books  vs.  web  can  differ   “mark  twain  hawai’i”  
    • Google  Scholar  is  vertical  search  engine.   Explicit  opt-­‐in  discovery  service  for  STM   journal  content,  utilized  in  HE  academia.    Many  concerns  with  combining  the  Scholar   product  with  Big  Daddy.    User  search  goals   differ;  content  distinct;  different  indexing.      
    •  From  2007  –  early  2009,  I  was  the  Director   of  the  Digital  Library  Federation.      I  made  a   request  of  Google  to  update  members  on   GBS  status  at  DLF’s  Fall  Forum,  Nov.  2008.    They  issued  an  explicit  request  for  HE  CS/ EE  attention  to  the  problem  of  integrating   book  and  web  search.    Paraphrasing:  “Not   a  well  solved  problem”.    
    • Some  comparisons   between  web  pages   and  books.  
    •  web:            short  doc  (web  page)  length      books:            long  doc  (book)  length  
    •  web:        high  data  density  (per  doc  size)      books:          highly  variant  data  density        (e.g.  fiction  vs.  non-­‐fiction)  
    •  web:          trillions  of  unique  web  pages    books:          (low)  millions  of  unique  books    
    •  web:        many  complex  media  types    books:        text  and  image  media  
    •  web:          dynamic  over  time        (avg.  TTL  of  web  pages  is  short)    books:          static  over  time        (print  books  permanently  fixed)  
    •  web:        single  instances  (web  pages)    books:        duplicate  instances  (copies),        similar  instances  (editions),        in  multiple  languages  
    •  web:        hyperlinked  in/out        (useful  in  relevance)    books:          normally  quiescent          (sometimes  citations)  
    •  web:        designed  component  structure        {page  hierarchy  >  web  site}    books:          artificial  component  structure          {page  images  >  book}  
    • Bibliographic  data  cf.  full  text  (book)  data:   The  Melvyl  Recommender  Project   Full  Text  Extension   (Supplementary  Report)   California  Digital  Library   October  2006   Funded  by  the  Andrew  W.  Mellon  Foundation  
    • Project  Lead     Peter  Brantley,  Director  of  Technology   Implementation  Team     Kirk  Hastings,  Text  Systems  Designer     Martin  Haye,  Programmer  (Contractor)     Steve  Toub,  Web  Design  Manager     Colleen  Whitney,  Programmer  and  Coordinator   Assessment  Team     Jane  Lee,  Assessment  Analyst     Felicia  Poe,  Assessment  Coordinator     Lisa  Schiff,  Digital  Ingest  Programmer  
    • Often  many  different  editions  of  popular  books.   Can  easily  artificially  boost  search  (n_copies).   e.g.  “Moby  Dick”  published  100s  of  times      (and  in  many  languages)   Depending  on  publication  date:      either  public  domain  (dep.  on  country)    or  in-­‐copyright  (out-­‐of-­‐print  or  in-­‐print)  
    •  In  CDL  tests,  for  texts  vs.  bib  records:    Search  scoring  for  full  text  documents   was  typically  10  -­‐  100  times  larger  than   for  metadata-­‐only  records.      (Probably  approximate  magnitude        cf.  to  representative  web  pages).  
    •  Easy  for  a  single  work  to  overwhelm  web   pages  in  relevance  for  a  well-­‐fitting  query.        E.g.  “English  working  class  labor  industrial”     The  making  of  the  English  working  class.     Author:  E  P  Thompson       Publisher:  New  York,  Pantheon  Books       [1964,  ©1963]  
    • Books  are  long  strings  of  many  words,   split  into  n_sized  chunks  for  parsing.    Term  indexing  based  on  overlapping   and  variant  length  “word  vectors”        “battle”    “of”    “britain”        “battle  of”    “britain”      “battle”    “of  britain”        “battle  of  britain”  
    • {Search  Term}  and  {Document}  weights   1.  How  often  is  a  search  term  found  within   a  given  sized  chunk  of  text?   2.  How  many  chunks  of  text  is  the  term   found  within?   3.  How  many  chunks  of  text  does  the   document  contain?  
    •  Which  is  better?   1.    Adequate  matches  over  many  fields,     2.    Better  matches  in  fewer  fields.      Metrics  vary  between  books  and  web.    One  learns  from  one’s  mistakes.      More  books,  more  mistakes.    
    • 1.  Books  are  sooo  much  longer  than  web  pages.   2.  Books  produce  1000’s  more  chunks  than  web.   3.  Term  weighting  is  very  complex  for  long  docs.   4.  Indexes  must  be  integrated  for  web  and  books.   5.  But  source  term  indexes  are  biased  differently.  
    • II.  What  you  get  from  books  
    •  The  dialectic  between  books  and   web  provides  benefits  from  their   integration  (no  matter  the  pain).   Books  enrich  general  web  search,   not  just  via  the  data  within  books,     but  also  by  books-­‐as-­‐data.  
    • All  search  is  made  smarter  by  analysis.   1.  structure   2.  contextualization   3.  relatedness   4.  normalization   5.  association  
    • Because  of  digitization,   books  have  complications  cf.     web  pages;  a  result  of  OCR.   1.  Language  detection   2.  Determining  which  words  get  indexed   (–  stop  words  like  “of”  “a”  “the”  etc.)   3.  OCR  mistakes  hamper  word  recognition  
    • Common  OCR  traps:       embedded  languages       Latin  or  archaic  spelling         complex  scripts  (e.g.  captions)       hyphenated  words    
    •   ricain     ricanant     ricaine     ricanante     ricaines     ricane     ricana     ricamente     ricanai     ricanement     ricains     ricanements     rical     rican     rically     ricanes     ricals     ricans  
    • More  words  from  more  books,     more  spelling  mistakes.      This  is  a  good  thing!    Leads  to  improved  spelling  correction    (in  multiple  languages)  and      more  sensitive  translation.    
    •  “Our  understanding  of  language  is,  in  large   part,  built  inductively  from  statistical  analysis   of  large  samples  of  language  as  used  ‘in  the   wild,’  and  the  larger  the  sample,  the  better   our  understanding.”              -­‐  Hank  Bromley,  IA  
    •  “Before  the  1930’s,  and  even  40’s  or  50’s  in  some   parts,    at  harvest  time,  a  horse  or  mule  drawn   wagon  would  go  through  the  field,  straddling  two   rows  of  corn.    Adults  working  on  each  side  of  the   wagon  would  pull  the  corn  from  the  standing  corn   stalks  and  toss  it  into  the  wagon.    The  unfortunate   younger  ones  would  have  to  pull  corn  from  the   down  rows  –  stoop  labor  in  its  worst  form.”                        -­‐  JDB  
    •  Statistical  analysis  of  which  terms  tend  to   appear  in  the  vicinity  of  which  others),  useful   not  only  for  context-­‐sensitive  OCR,  but  more   significantly,  for  building  semantic  maps  and   other  kinds  of  knowledge  representation.      “dead  as  a  door  nail”  –  the  term  “door  nail”        is  not  commonly  found  elsewhere.  
    •  Analysis  via  co-­‐occurrence  enables  one  to   construct  a  better  general  search  engine  by   enhancing  the  ability  to  distinguish  among   multiple  meanings  of  a  given  word  based   on  the  context  in  which  the  word  occurs.  
    •  LSA  is  an  CS  term  referring  to  a  technique  in   “natural  language  processing  ...  of  analyzing   relationships  between  a  set  of  documents   and  the  terms  they  contain  by  producing  a   set  of  concepts  related  to  the  documents   and  terms.”                -­‐  Wikipedia.org  
    •  (LSI  =  LSA  in  context  of  info  retrieval  (IR).)    “Clustering  is  a  way  to  group  documents   based  on  their  conceptual  similarity  to  each   other  ...  .    This  is  very  useful  when  dealing   with  an  unknown  collection  of  unstructured   text.”  
    •  “Because  it  uses  a  strictly  mathematical   approach,  LSI  is  inherently  independent  of   language.    This  enables  LSI  to  elicit  the   semantic  content  of  information  written  in   any  language  without  requiring  the  use  of   auxiliary  structures,  such  as  dictionaries  and   thesauri.”  
    •  “[Q]ueries  can  be  made  in  one  language,  such   as  English,  and  conceptually  similar  results   will  be  returned  even  if  they  are  composed  of   an  entirely  different  language  or  of  multiple   languages.”  
    •  “LSI  automatically  adapts  to  new  and  changing   terminology,  and  it  has  been  shown  to  be  very   tolerant  of  noise  (i.e.,  misspelled  words,  typo-­‐ graphical  errors,  unreadable  characters,  etc.).        “This  is  especially  important  for  applications   using  text  derived  from  Optical  Character   Recognition  (OCR)    ...”                -­‐  Wikipedia.org  
    •  The  More  Data,  The  Better  ...      The  More  Books,  The  Better  Web  Search.  
    • Contact  information:   peter  brantley      internet  archive   @naypinya  (twitter)      peter  @  archive.org