Taxonomies: Tools or People? TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Synaptica, LLC, 20...
<ul><li>Dave Clarke </li></ul><ul><li>CEO </li></ul><ul><li>Synaptica, LLC </li></ul><ul><li>[email_address] </li></ul>TBC...
Humans will invent almost anything to save time TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright ©...
Human or machine indexing – depends on the data and the user subtle & abstract concepts non-textual, e.g. images, sounds h...
Human indexing – the process TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Synaptica, LLC, 20...
Human indexing – a wish list  of time-saving tools <ul><ul><li>Minimize switching between screens - integrate vocabulary s...
Human indexing – a wish list  of time-saving tools <ul><ul><li>Self-correcting substitution of variants with  their prefer...
Human indexing – Synaptica’s “IMS” Toolbox TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Syna...
Human indexing – IMS Workflow Detail TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Synaptica,...
Human indexing – profile set up screen shot TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Syn...
Human indexing – examples TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Synaptica, LLC, 2009 ...
Human indexing – conclusions TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Synaptica, LLC, 20...
<ul><li>Paula McCoy </li></ul><ul><li>Manager, Taxonomy Development </li></ul><ul><li>ProQuest </li></ul><ul><li>[email_ad...
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Proquest, Inc., 2009 www.proquest.com 12/09/09 ...
<ul><li>The Mandate: </li></ul><ul><li>To promote discovery of all content relevant to the user’s search query </li></ul><...
<ul><li>A Growing Challenge: </li></ul><ul><li>How to A&I hundred of thousands of new articles every day? </li></ul><ul><l...
The Autocategorization Solution <ul><li>Basic Tenets of Autocategorization: </li></ul><ul><li>Must have a controlled vocab...
<ul><li>Implement Synaptica thesaurus management solution to manage 11,300+-term subject thesaurus and authority files for...
Thesaurus and Autocat Management <ul><li>Synaptica Thesaurus Management: </li></ul><ul><li>New terms added, hierarchies re...
Synaptica-TME Interaction <ul><li>Thesaurus management informs 2 levels of indexing: manual and automated </li></ul><ul><l...
Synaptica & Autocat: Benefits <ul><li>A semantic-based autocat solution indexes as well as it’s been trained    that trai...
Benefits for Synaptica Thesaurus Control  <ul><li>Day-to-day review of automated indexing highlights correct and incorrect...
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Synaptica, LLC, 2009 www.synapticasoftware.com ...
Upcoming SlideShare
Loading in...5
×

Synaptica Proquest Talk Taxonomy Boot Camp 2009

1,512

Published on

Power Point presentation given by Dave Clarke, CEO, Synaptica, LLC and Paula McCoy of Proquest at Taxonomy Boot Camp 2009 in San Jose, California.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,512
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
53
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • People will go to extraordinary lengths to invent tools to save time. All of us would like to be in a place where machines can take over information indexing for us. But when is this possible, and when should it be avoided.
  • People will go to extraordinary lengths to invent tools to save time. All of us would like to be in a place where machines can take over information indexing for us. But when is this possible, and when should it be avoided.
  • People will go to extraordinary lengths to invent tools to save time. All of us would like to be in a place where machines can take over information indexing for us. But when is this possible, and when should it be avoided.
  • This chart attempts to examine the question of when is it appropriate to let machines or people perform indexing. It is not an exact science, so individual circumstances require an evaluation of all these factors, plus business factors such as ease of access to human indexers and to IT resources and funds. Broadly speaking, however, certain factors steer one in the direction of certain solutions. Factors that lean toward machine indexing: If the size of a data set is so large that it would be impossible to process it by humans then machine indexing may be the only solution, regardless of any qualitative factors. If the data set is fast moving and access to it is time-sensitive, then machine indexing can also be the preferred solution. Although small sets of fast moving data may be processed by humans. If the users are generalists or in pursuit of information for recreational purposes then they are likely to be more tolerant of noisy or incomplete results. Factors that lean toward human indexing: If the data is not at all machine-readable then human indexing may be the only solution. For example, photographs and video without any metadata or embedded speech may require human review. If the data contains subtle or abstract concepts then these may elude even the most finely tuned machines. For example, the ideas behind in the To be or not to be soliloquy in Hamlet are too subtle to be identified from textual analysis alone. If the users are experts for whom data is a mission-critical resource then they may require exceedingly high precision and recall which would demand either human indexing or an extremely high degree of human training and QC of the machine process. Factors that benefit either indexing method: If data is well structured within identifiable fields or metadata attributes then this structure provides context that will greatly assist machine indexing, but also help with human indexing. If data is on a homogeneous topic, such as a database of articles all about nuclear physics, will be easier to index than a database covering all disciplines and topics.
  • The human indexing process essentially involves three simple steps: Review the content one article / record at a time Search the controlled vocabularies to find the terms that best describe the content Either tag the content directly by adding the terms as metadata values within the CMS, or assign the indexing terms to the content item by using a separate index table / interface
  • Most of our user-base create their taxonomies in Synaptica and then integrate them with third-party automatic indexing tools. Others have determined that they need to perform human indexing and over the years they have developed a wish list of time-saving tools. (see bullets for wish list)
  • Most of our user-base create their taxonomies in Synaptica and then integrate them with third-party automatic indexing tools. Others have determined that they need to perform human indexing and over the years they have developed a wish list of time-saving tools. (see bullets for wish list)
  • Ten years ago the Synaptica software team productized this wish list and bundled all these features into a Synaptica package called IMS. IMS – the Indexing Management System – acts as an integration toolset between the taxonomy management system and content management system. It provides ready-made GUI screens, and also a suite of web services components that allow indexing functionality to be custom crafted into the CMS screenflow.
  • This slide illustrates the workflow for IMS as a component that sits between a CMS system and a taxonomy management system to assist the human indexing process.
  • This screen shot illustrates how indexing profiles can be created to streamline the indexing operation for particular sets of content. Many parameters can be configured such as user-access permissions, term expansion, access to particular vocabularies and facets, even the selection of individual sub-branches within a hierarchy.
  • We are actively working with a number of clients who are performing human indexing for selected data sets. Following are three “hypothetical” but realistic examples.
  • Conclusions: (see bullets for conclusions)
  • Questions?
  • Synaptica Proquest Talk Taxonomy Boot Camp 2009

    1. 1. Taxonomies: Tools or People? TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Synaptica, LLC, 2009 www.synapticasoftware.com 12/09/09 Slide When would one favor human indexing over machine indexing? An example of the human indexing effort is presented along with tools that can help with the process. An example of autocategorization is illustrated with a discussion of the reciprocal flow of information between the taxonomy management tool and the autocategorization tool. Speakers then discuss how structured vocabularies help refine categorizers and how feedback from the categorizer tool to the human editorial team contributes to the continual improvement of the vocabularies. by Dave Clarke & Paula McCoy
    2. 2. <ul><li>Dave Clarke </li></ul><ul><li>CEO </li></ul><ul><li>Synaptica, LLC </li></ul><ul><li>[email_address] </li></ul>TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Synaptica, LLC, 2009 www.synapticasoftware.com 12/09/09 Slide HUMAN VS. MACHINE & THE HUMAN OPTION
    3. 3. Humans will invent almost anything to save time TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Synaptica, LLC, 2009 www.synapticasoftware.com 12/09/09 Slide
    4. 4. Human or machine indexing – depends on the data and the user subtle & abstract concepts non-textual, e.g. images, sounds highly structured very high volume homogeneous topics mission-critical precision & recall noisy or incomplete results tolerable very quick turnaround TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Synaptica, LLC, 2009 www.synapticasoftware.com 12/09/09 Slide Humans Machines Size Time-sensitivity Generalist users Machine-readability Conceptual-abstraction Expert users Data-structure Homogeneity
    5. 5. Human indexing – the process TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Synaptica, LLC, 2009 www.synapticasoftware.com 12/09/09 Slide Data Set 1. Review the content 2. Consult the vocabularies 3. Either tag the content item or build an index table Controlled Vocabularies Index Table
    6. 6. Human indexing – a wish list of time-saving tools <ul><ul><li>Minimize switching between screens - integrate vocabulary search / browse with content interface </li></ul></ul><ul><ul><li>Filter specific metadata elements to restrict lookup to relevant vocabularies or subsets of vocabularies </li></ul></ul><ul><ul><li>Search-as-you-type access to controlled vocabularies </li></ul></ul><ul><ul><li>Tree-browse as an alternative to search </li></ul></ul><ul><ul><li>Redirect queries at any time by exploring semantic relationships </li></ul></ul><ul><ul><li>Inline definitional and indexer notes </li></ul></ul>TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Synaptica, LLC, 2009 www.synapticasoftware.com 12/09/09 Slide
    7. 7. Human indexing – a wish list of time-saving tools <ul><ul><li>Self-correcting substitution of variants with their preferred terms </li></ul></ul><ul><ul><li>Optional pre-population of possible target terms based on text matches </li></ul></ul><ul><ul><li>In-line submission of candidate terms where no appropriate term identified </li></ul></ul><ul><ul><li>Optional automatic expansion of tag-set to include variants, parents, children, associations, language equivalents and crosswalk schema equivalents </li></ul></ul><ul><ul><li>Profile templates to save user- and content-based indexing preferences </li></ul></ul>TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Synaptica, LLC, 2009 www.synapticasoftware.com 12/09/09 Slide
    8. 8. Human indexing – Synaptica’s “IMS” Toolbox TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Synaptica, LLC, 2009 www.synapticasoftware.com 12/09/09 Slide <ul><ul><li>Minimize switching between screens - integrate vocabulary search / browse with content interface </li></ul></ul><ul><ul><li>Filter specific metadata elements to restrict lookup to relevant vocabularies </li></ul></ul><ul><ul><li>Search-as-you-type access to controlled vocabularies </li></ul></ul><ul><ul><li>Tree-browse and drop-down pick-list alternatives to search </li></ul></ul><ul><ul><li>Redirect queries at any time by exploring semantic relationships </li></ul></ul><ul><ul><li>Inline definitional and indexer notes </li></ul></ul><ul><ul><li>Self-correcting substitution of variants with their preferred terms </li></ul></ul><ul><ul><li>Optional pre-population of possible target terms based on text matches </li></ul></ul><ul><ul><li>In-line submission of candidate terms where no appropriate term identified </li></ul></ul><ul><ul><li>Optional automatic expansion of tag-set to include variants, parents, children, associations, language equivalents and crosswalk schema equivalents </li></ul></ul><ul><ul><li>Profile templates to save user- and content-based indexing preferences </li></ul></ul>
    9. 9. Human indexing – IMS Workflow Detail TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Synaptica, LLC, 2009 www.synapticasoftware.com 12/09/09 Slide
    10. 10. Human indexing – profile set up screen shot TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Synaptica, LLC, 2009 www.synapticasoftware.com 12/09/09 Slide
    11. 11. Human indexing – examples TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Synaptica, LLC, 2009 www.synapticasoftware.com 12/09/09 Slide <ul><li>A national library could use IMS to human index digital images and multimedia assets against a set of authority files. </li></ul><ul><li>A professional services corporation could use IMS to human index mission-critical legal documents against a taxonomy of compliance terminology. </li></ul><ul><li>A multinational electronics company could use IMS to human index product data according to product lines and families, hardware assets and other product based keyword groups. </li></ul>
    12. 12. Human indexing – conclusions TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Synaptica, LLC, 2009 www.synapticasoftware.com 12/09/09 Slide <ul><li>Like everything else in life, if we can possibly pass the task on to machines, we’d like to </li></ul><ul><li>There are some situations where machines are the only solution and there are others where human indexing is required (non-machine-readable data sets, subtle/abstract concepts, mission-critical precision-recall requirements, etc.) </li></ul><ul><li>If human indexing is required there are tools that can help speed up the process and help attain indexing consistency </li></ul><ul><li>The Synaptica “wish list” represents those time-saving tools requested by our user base over the past ten years </li></ul>
    13. 13. <ul><li>Paula McCoy </li></ul><ul><li>Manager, Taxonomy Development </li></ul><ul><li>ProQuest </li></ul><ul><li>[email_address] </li></ul>TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Proquest, Inc., 2009 www.proquest.com 12/09/09 Slide AUTOCATEGORIZATION A CASE STUDY USING SYNAPTICA
    14. 14. TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Proquest, Inc., 2009 www.proquest.com 12/09/09 Slide <ul><li>Information aggregator & database producer, with content ranging from newspapers to academic/scholarly publications, in topics spanning business and management, STM (scientific, technical, medical), humanities, social science, general reference </li></ul><ul><li>Abstracts/indexes more than 6,000 periodicals and newspapers </li></ul><ul><li>Daily ingest of more than 60,000 new newspaper and newswire articles </li></ul><ul><li>Customer base: Public and academic libraries </li></ul><ul><li>End users: Academic and student researchers </li></ul>
    15. 15. <ul><li>The Mandate: </li></ul><ul><li>To promote discovery of all content relevant to the user’s search query </li></ul><ul><li>The Solution: </li></ul><ul><li>Index and abstract as much content as possible in order to maximize the number of “entry points” to an article. </li></ul><ul><ul><li>Indexing provided for different parts of an article: </li></ul></ul><ul><ul><ul><li>SUBJECTS </li></ul></ul></ul><ul><ul><ul><li>COMPANIES </li></ul></ul></ul><ul><ul><ul><li>PEOPLE </li></ul></ul></ul><ul><ul><ul><li>LOCATIONS </li></ul></ul></ul><ul><ul><li>Abstracts provided for all articles of minimum length </li></ul></ul>TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Proquest, Inc., 2009 www.proquest.com 12/09/09 Slide ProQuest Search Interface
    16. 16. <ul><li>A Growing Challenge: </li></ul><ul><li>How to A&I hundred of thousands of new articles every day? </li></ul><ul><li>The Only Answer: </li></ul><ul><li>Autocategorization, or auto-indexing: </li></ul><ul><li>Machine-based application of index terms to a document or other object </li></ul>TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Proquest, Inc., 2009 www.proquest.com 12/09/09 Slide ProQuest Search Interface
    17. 17. The Autocategorization Solution <ul><li>Basic Tenets of Autocategorization: </li></ul><ul><li>Must have a controlled vocabulary in place </li></ul><ul><li>Must have other controlled lists if you want to index companies, people, locations, etc. </li></ul><ul><li>Must have a way to manage your vocabularies </li></ul><ul><li>Must have a way to manage the results of the autocat—no automated indexing method is perfect </li></ul><ul><li>Autocat success rests upon the existence of a strong controlled vocabulary with a history of usage from which the automation software can learn. </li></ul>TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Proquest, Inc., 2009 www.proquest.com 12/09/09 Slide
    18. 18. <ul><li>Implement Synaptica thesaurus management solution to manage 11,300+-term subject thesaurus and authority files for companies, people, and locations </li></ul><ul><li>Purchase Nstein Technologies’ T ext M ining E ngine solution to automate abstracting and indexing of subject and other terms </li></ul><ul><li>Train the TME to understand the usage of ProQuest thesaurus terms (3-month collaborative process) </li></ul><ul><li>Implement Nstein’s Knowledge Base Manager (TME Manager) to manage subject terms rules base </li></ul><ul><li>Synaptica Taxonomy Manager Nstein </li></ul>TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Proquest, Inc., 2009 www.proquest.com 12/09/09 Slide The ProQuest Approach
    19. 19. Thesaurus and Autocat Management <ul><li>Synaptica Thesaurus Management: </li></ul><ul><li>New terms added, hierarchies revised, Scope Notes added/revised </li></ul><ul><li>Use For (non-preferred) terms added frequently to reflect variant usages in the indexed literature and provide additional cross-references </li></ul><ul><li>Nstein Autocat Management: </li></ul><ul><li>Nstein TME Manager tool used to manage indexing rules base for all thesaurus terms </li></ul><ul><li>Autocat rules supplement and complement the underlying concept training </li></ul><ul><li>Autocat rules can be added, deleted, revised </li></ul><ul><li>Autocat rules enable autocat indexing to keep up with changes in term usages so that new variants can be added and rules created based on current topics in the literature or in the news </li></ul>TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Proquest, Inc., 2009 www.proquest.com 12/09/09 Slide
    20. 20. Synaptica-TME Interaction <ul><li>Thesaurus management informs 2 levels of indexing: manual and automated </li></ul><ul><li>The thesaurus as represented in Synaptica must display all cross-references (mainly Use refs) required by manual indexers </li></ul><ul><li>The thesaurus as represented in Nstein must contain rules reflecting those Use references </li></ul><ul><li>Term updates made in Synaptica are duplicated in Nstein via indexing rules </li></ul><ul><li>Use references in Synaptica point human indexers to the right term </li></ul><ul><li>Use references in Nstein rules base point the automated indexer to the right term </li></ul>TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Proquest, Inc., 2009 www.proquest.com 12/09/09 Slide
    21. 21. Synaptica & Autocat: Benefits <ul><li>A semantic-based autocat solution indexes as well as it’s been trained  that training is most successful if based on years of manual indexing using a controlled subject vocabulary  combined with a rules base, autocat can produce intelligent and informed indexing </li></ul><ul><li>Reviewing the results of good autocat leads to comparison with ongoing manual indexing  questions about term usages rise to the surface  human indexing can improve by becoming more flexible and adaptable to changes in terminology  revised term usages are reflected in Synaptica </li></ul><ul><li>Human indexers raise issues of new term variants and need for new terms  Synaptica is updated  the rules base is updated to allow autocat to capture terms better </li></ul>TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Proquest, Inc., 2009 www.proquest.com 12/09/09 Slide
    22. 22. Benefits for Synaptica Thesaurus Control <ul><li>Day-to-day review of automated indexing highlights correct and incorrect term usages, leading to greater discipline in Synaptica thesaurus management to ensure human indexers remain aware of terms and their proper usage. </li></ul><ul><li>The need for precision in subject terms means terms must be exact and descriptive—automated indexing will not work with vague, ambiguous terms or one-word terms with multiple meanings, like “Apologies,” “Affect,” “Articulation.” The result is a more robust and controlled subject vocabulary. </li></ul><ul><li>Automated indexing will use terms in the thesaurus that human indexers may have forgotten about—leading again to revised hierarchies in Synaptica, new Scope Notes, and instant feedback to indexers. </li></ul>TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Proquest, Inc., 2009 www.proquest.com 12/09/09 Slide
    23. 23. TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Synaptica, LLC, 2009 www.synapticasoftware.com 12/09/09 Slide [email_address] [email_address] Questions?
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×