Your SlideShare is downloading. ×
Synaptica Proquest Talk Taxonomy Boot Camp 2009
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Synaptica Proquest Talk Taxonomy Boot Camp 2009

625
views

Published on

Presentation given by Dave Clarke, CEO, Synaptica, LLC and Paula McCoy, ProQuest, on Machine vs. Human Indexing at Taxonomy Boot Camp in San Jose, 2009.

Presentation given by Dave Clarke, CEO, Synaptica, LLC and Paula McCoy, ProQuest, on Machine vs. Human Indexing at Taxonomy Boot Camp in San Jose, 2009.

Published in: Technology, Business

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
625
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Taxonomies:  Tools  or  People?   by   Dave  Clarke  &  Paula  McCoy   When  would  one  favor  human  indexing  over  machine  indexing?  An   example  of  the  human  indexing  effort  is  presented  along  with  tools   that  can  help  with  the  process.  An  example  of  autocategorizaAon  is   illustrated  with  a  discussion  of  the  reciprocal  flow  of  informaAon   between  the  taxonomy  management  tool  and  the  autocategorizaAon   tool.  Speakers  then  discuss  how  structured  vocabularies  help  refine   categorizers  and  how  feedback  from  the  categorizer  tool  to  the   human  editorial  team  contributes  to  the  conAnual  improvement  of   the  vocabularies.   TBC;  Taxonomies:  Tools  or  People?   Copyright  ©  SynapAca,  LLC,  2009   11/25/09   By  Dave  Clarke  &  Paula  McCoy   www.synapAcasoIware.com   Slide  1  
  • 2. HUMAN  VS.  MACHINE   &   THE  HUMAN  OPTION   Dave  Clarke   CEO   SynapAca,  LLC   dave.clarke@synapAcasoIware.com   TBC;  Taxonomies:  Tools  or  People?   Copyright  ©  SynapAca,  LLC,  2009   11/25/09   By  Dave  Clarke  &  Paula  McCoy   www.synapAcasoIware.com   Slide  2  
  • 3. Humans  will  invent  almost  anything  to  save  Ime   TBC;  Taxonomies:  Tools  or  People?   Copyright  ©  SynapAca,  LLC,  2009   11/25/09   By  Dave  Clarke  &  Paula  McCoy   www.synapAcasoIware.com   Slide  3  
  • 4. Human  or  machine  indexing  –  depends  on  the  data  and  the  user   very  high   volume   very  quick   turnaround   noisy  or  incomplete   results  tolerable   non-­‐textual,  e.g.   images,  sounds   subtle  &  abstract   concepts   mission-­‐criIcal   precision  &  recall   highly   structured   homogeneous   topics   TBC;  Taxonomies:  Tools  or  People?   Copyright  ©  SynapAca,  LLC,  2009   11/25/09   By  Dave  Clarke  &  Paula  McCoy   www.synapAcasoIware.com       Slide  4  
  • 5. Human  indexing  –  the  process   TBC;  Taxonomies:  Tools  or  People?   Copyright  ©  SynapAca,  LLC,  2009   11/25/09   By  Dave  Clarke  &  Paula  McCoy   www.synapAcasoIware.com   Slide  5  
  • 6. Human  indexing  –  a  wish  list     of  Ime-­‐saving  tools   TBC;  Taxonomies:  Tools  or  People?   Copyright  ©  SynapAca,  LLC,  2009   11/25/09   By  Dave  Clarke  &  Paula  McCoy   www.synapAcasoIware.com   Slide  6  
  • 7. Human  indexing  –  a  wish  list     of  Ime-­‐saving  tools   TBC;  Taxonomies:  Tools  or  People?   Copyright  ©  SynapAca,  LLC,  2009   11/25/09   By  Dave  Clarke  &  Paula  McCoy   www.synapAcasoIware.com   Slide  7  
  • 8. Human  indexing  –  SynapIca’s  “IMS”  Toolbox   TBC;  Taxonomies:  Tools  or  People?   Copyright  ©  SynapAca,  LLC,  2009   11/25/09   By  Dave  Clarke  &  Paula  McCoy   www.synapAcasoIware.com   Slide  8  
  • 9. Human  indexing  –  IMS  Workflow  Detail   TBC;  Taxonomies:  Tools  or  People?   Copyright  ©  SynapAca,  LLC,  2009   11/25/09   By  Dave  Clarke  &  Paula  McCoy   www.synapAcasoIware.com   Slide  9  
  • 10. Human  indexing  –  profile  set  up  screen  shot   TBC;  Taxonomies:  Tools  or  People?   Copyright  ©  SynapAca,  LLC,  2009   11/25/09   By  Dave  Clarke  &  Paula  McCoy   www.synapAcasoIware.com   Slide  10  
  • 11. Human  indexing  –  examples   1.  A national library could use IMS to human index digital images and multimedia assets against a set of authority files. 2.  A professional services corporation could use IMS to human index mission-critical legal documents against a taxonomy of compliance terminology. 3.  A multinational electronics company could use IMS to human index product data according to product lines and families, hardware assets and other product based keyword groups. TBC;  Taxonomies:  Tools  or  People?   Copyright  ©  SynapAca,  LLC,  2009   11/25/09   By  Dave  Clarke  &  Paula  McCoy   www.synapAcasoIware.com   Slide  11  
  • 12. Human  indexing  –  conclusions   1.  Like  everything  else  in  life,  if  we  can  possibly  pass  the  task  on   to  machines,  we’d  like  to   2.  There  are  some  situaAons  where  machines  are  the  only   soluAon  and  there  are  others  where  human  indexing  is   required  (non-­‐machine-­‐readable  data  sets,  subtle/abstract   concepts,  mission-­‐criAcal  precision-­‐recall  requirements,  etc.)   3.  If  human  indexing  is  required  there  are  tools  that  can  help   speed  up  the  process  and  help  adain  indexing  consistency   4.  The  SynapAca  “wish  list”  represents  those  Ame-­‐saving  tools   requested  by  our  user  base  over  the  past  ten  years     TBC;  Taxonomies:  Tools  or  People?   Copyright  ©  SynapAca,  LLC,  2009   11/25/09   By  Dave  Clarke  &  Paula  McCoy   www.synapAcasoIware.com   Slide  12  
  • 13. AUTOCATEGORIZATION   A  CASE  STUDY  USING  SYNAPTICA   Paula  McCoy   Manager,  Taxonomy  Development   ProQuest   paula.mccoy@proquest.com   TBC;  Taxonomies:  Tools  or  People?   Copyright  ©  Proquest,  Inc.,  2009   11/25/09   By  Dave  Clarke  &  Paula  McCoy   www.proquest.com   Slide  13  
  • 14. • InformaAon  aggregator  &  database  producer,  with  content  ranging  from   newspapers  to  academic/scholarly  publicaAons,  in  topics  spanning  business   and  management,  STM  (scienAfic,  technical,  medical),  humaniAes,  social   science,  general  reference   • Abstracts/indexes  more  than  6,000  periodicals  and  newspapers   • Daily  ingest  of  more  than  60,000  new  newspaper  and  newswire  arAcles   • Customer  base:  Public  and  academic  libraries   • End  users:  Academic  and  student  researchers   TBC;  Taxonomies:  Tools  or  People?   Copyright  ©  Proquest,  Inc.,  2009   11/25/09   By  Dave  Clarke  &  Paula  McCoy   www.proquest.com   Slide  14  
  • 15. ProQuest  Search  Interface   The  Mandate:   To  promote  discovery  of  all  content  relevant  to  the  user’s  search  query   The  SoluAon:     Index  and  abstract  as  much  content  as  possible  in  order  to  maximize  the   number  of  “entry  points”  to  an  arAcle.   –  Indexing  provided  for  different  parts  of  an  arAcle:   •  SUBJECTS   •  COMPANIES   •  PEOPLE   •  LOCATIONS   –  Abstracts  provided  for  all  arAcles  of  minimum  length   TBC;  Taxonomies:  Tools  or  People?   Copyright  ©  Proquest,  Inc.,  2009   11/25/09   By  Dave  Clarke  &  Paula  McCoy   www.proquest.com   Slide  15  
  • 16. ProQuest  Search  Interface   A  Growing  Challenge:   How  to  A&I  hundred  of  thousands  of  new  arAcles  every   day?   The  Only  Answer:   AutocategorizaAon,  or  auto-­‐indexing:    Machine-­‐based  applica/on  of  index  terms  to  a   document  or  other  object   TBC;  Taxonomies:  Tools  or  People?   Copyright  ©  Proquest,  Inc.,  2009   11/25/09   By  Dave  Clarke  &  Paula  McCoy   www.proquest.com   Slide  16  
  • 17. The  AutocategorizaAon  SoluAon   Basic  Tenets  of  AutocategorizaAon:   1.  Must  have  a  controlled  vocabulary  in  place   2.  Must  have  other  controlled  lists  if  you  want  to  index  companies,  people,   locaAons,  etc.   3.  Must  have  a  way  to  manage  your  vocabularies   4.  Must  have  a  way  to  manage  the  results  of  the  autocat—no  automated   indexing  method  is  perfect     Autocat  success  rests  upon  the  existence  of  a  strong  controlled  vocabulary   with  a  history  of  usage  from  which  the  automaAon  soIware  can  learn.   TBC;  Taxonomies:  Tools  or  People?   Copyright  ©  Proquest,  Inc.,  2009   11/25/09   By  Dave  Clarke  &  Paula  McCoy   www.proquest.com   Slide  17  
  • 18. The  ProQuest  Approach   1)  Implement  SynapAca  thesaurus  management  soluAon  to   manage  11,300+-­‐term  subject  thesaurus  and  authority  files   for  companies,  people,  and  locaAons     2)  Purchase  Nstein  Technologies’  Text  Mining  Engine  soluAon  to   automate  abstracAng  and  indexing  of  subject  and  other   terms   3)  Train  the  TME  to  understand  the  usage  of  ProQuest   thesaurus  terms  (3-­‐month  collaboraAve  process)   4)  Implement  Nstein’s  Knowledge  Base  Manager  (TME   Manager)  to  manage  subject  terms  rules  base          SynapIca                        Taxonomy  Manager                        Nstein   TBC;  Taxonomies:  Tools  or  People?   Copyright  ©  Proquest,  Inc.,  2009   11/25/09   By  Dave  Clarke  &  Paula  McCoy   www.proquest.com   Slide  18  
  • 19. Thesaurus  and  Autocat  Management   SynapAca  Thesaurus  Management:   •  New  terms  added,  hierarchies  revised,  Scope  Notes  added/revised     •  Use  For  (non-­‐preferred)  terms  added  frequently  to  reflect  variant  usages  in  the   indexed  literature  and  provide  addiAonal  cross-­‐references     Nstein  Autocat  Management:   •  Nstein  TME  Manager  tool  used  to  manage  indexing  rules  base  for  all  thesaurus   terms   •  Autocat  rules  supplement  and  complement  the  underlying  concept  training     •  Autocat  rules  can  be  added,  deleted,  revised     •  Autocat  rules  enable  autocat  indexing  to  keep  up  with  changes  in  term  usages   so  that  new  variants  can  be  added  and  rules  created  based  on  current  topics  in   the  literature  or  in  the  news   TBC;  Taxonomies:  Tools  or  People?   Copyright  ©  Proquest,  Inc.,  2009   11/25/09   By  Dave  Clarke  &  Paula  McCoy   www.proquest.com   Slide  19  
  • 20. SynapAca-­‐TME  InteracAon   Thesaurus  management  informs  2  levels  of  indexing:  manual  and   automated     The  thesaurus  as  represented  in  SynapAca  must  display  all  cross-­‐ references  (mainly  Use  refs)  required  by  manual  indexers     The  thesaurus  as  represented  in  Nstein  must  contain  rules  reflecAng  those   Use  references     Term  updates  made  in  SynapAca  are  duplicated  in  Nstein  via  indexing  rules     Use  references  in  SynapAca  point  human  indexers  to  the  right  term     Use  references  in  Nstein  rules  base  point  the  automated  indexer  to  the   right  term   TBC;  Taxonomies:  Tools  or  People?   Copyright  ©  Proquest,  Inc.,  2009   11/25/09   By  Dave  Clarke  &  Paula  McCoy   www.proquest.com   Slide  20  
  • 21. SynapAca  &  Autocat:  Benefits   •  A  semanAc-­‐based  autocat  soluAon  indexes  as  well  as  it’s  been  trained     that  training  is  most  successful  if  based  on  years  of  manual  indexing  using   a  controlled  subject  vocabulary    combined  with  a  rules  base,  autocat   can  produce  intelligent  and  informed  indexing   •  Reviewing  the  results  of  good  autocat  leads  to  comparison  with  ongoing   manual  indexing    quesAons  about  term  usages  rise  to  the  surface     human  indexing  can  improve  by  becoming  more  flexible  and  adaptable  to   changes  in  terminology    revised  term  usages  are  reflected  in  SynapAca   •  Human  indexers  raise  issues  of  new  term  variants  and  need  for  new  terms     SynapAca  is  updated    the  rules  base  is  updated  to  allow  autocat  to   capture  terms  beder   TBC;  Taxonomies:  Tools  or  People?   Copyright  ©  Proquest,  Inc.,  2009   11/25/09   By  Dave  Clarke  &  Paula  McCoy   www.proquest.com   Slide  21  
  • 22. Benefits  for  SynapAca  Thesaurus  Control     •  Day-­‐to-­‐day  review  of  automated  indexing  highlights  correct  and  incorrect   term  usages,  leading  to  greater  discipline  in  SynapAca  thesaurus   management  to  ensure  human  indexers  remain  aware  of  terms  and  their   proper  usage.   •  The  need  for  precision  in  subject  terms  means  terms  must  be  exact  and   descripAve—automated  indexing  will  not  work  with  vague,  ambiguous   terms  or  one-­‐word  terms  with  mulAple  meanings,  like  “Apologies,”   “Affect,”  “ArAculaAon.”  The  result  is  a  more  robust  and  controlled  subject   vocabulary.   •  Automated  indexing  will  use  terms  in  the  thesaurus  that  human  indexers   may  have  forgoden  about—leading  again  to  revised  hierarchies  in   SynapAca,  new  Scope  Notes,  and  instant  feedback  to  indexers.   TBC;  Taxonomies:  Tools  or  People?   Copyright  ©  Proquest,  Inc.,  2009   11/25/09   By  Dave  Clarke  &  Paula  McCoy   www.proquest.com   Slide  22  
  • 23. dave.clarke@synapAcasoIware.com     paula.mccoy@proquest.com     TBC;  Taxonomies:  Tools  or  People?   Copyright  ©  SynapAca,  LLC,  2009   11/25/09   By  Dave  Clarke  &  Paula  McCoy   www.synapAcasoIware.com   Slide  23