Taxonomy Assessments -
                                 Part Two
                                 February 9, 2012




                                  Access Innovations, Inc.
             Leveraging Your Content Semantically
                                             Jay Ven Eman, Ph.D., CEO
                                                  j_ven_eman@accessinn.com
                                                      www.accessinn.com
                                                     www.dataharmony.com
                                                        +1.505.998.0800
                                                       Albuquerque, NM




© 2012. Access Innovations, Inc. All rights reserved.
Indexing
     Subject term assignment
     Permanent meta-data to indexed object
     Used for retrieval and evaluation
     Processes
      •     Manual
            •     Publisher
            •     3rd party aggregators
            •     Authors
      •     Automated methods


    © 2011. Access Innovations, Inc. All rights reserved.
Integration / workflow
                                                                      API’s, Client/Server,
              Author Submission                                     Web Services, HTTP-TCP/IP
                   System


Books
                                                                           Content
                                                                       Repository “A”
                                                                       Or Intermediate
Conference                                                               Processes
Proceedings



                                                                                  Content
  ETC.
                                                                                 Repository
                                                                                  “B”, etc.
                                   Thesaurus
                                                           M.A.I.
                                    Master


 Web                                       Data Harmony
 Sites                                     MAIstro Server


                                   Classification System

   © 2011. Access Innovations, Inc. All rights reserved.
Select the document collection
                                                                 CMS



                               Please select the database and the the document directory to load




 © 2011. Access Innovations, Inc. All rights reserved.
CMS




© 2011. Access Innovations, Inc. All rights reserved.
Sample unstructured document




 © 2011. Access Innovations, Inc. All rights reserved.
Run the documents through a metadata extraction
process to create well-formed, rich XML




                                                       • Automatic (per doc template)
                                                       • E.g. Dublin Core Metadata
                                                       • Bibliographic citation




    © 2011. Access Innovations, Inc. All rights reserved.
Automatically add the taxonomy
terms




                                                    Entity extraction: People,
                                                      Places, Things
                                                    Conceptual indexing: using the
                                                      taxonomy




 © 2011. Access Innovations, Inc. All rights reserved.
Classification Process or Assigned Indexing
                                                         <Anchor><Date>09-14-11</Date>
09-14-11
                                                         <TI>“Solving the Challenge”</TI>
“Solving the Challenge”
                                                         <BLH>By</BLH>
By Jay Ven Eman
                                                         <Author>
                                                         <AU_FN>Jay</AU_FN>
The process of indexing
                                                         <AU_MI></AU_MI>
a content object begins
                                                         <AU_LN>Ven Eman</AU_LN>
with…
                                                         </Author>
                                                         <Body>The process of indexing a content
                                                         object begins with…</Body>

                                                         <Subject>Indexing</Subject>
                                                         <Subject>Thesauri</Subject>
                                                         <Subject>Standards</Subject>
                                                         <Subject>Classification</Subject>
   Unstructured
                                                         </Anchor>

                                                                                             Structured


     Thesaurus
                               M.A.I.
      Master
                                                                       Content
              Data Harmony                                             Repository
              MAIstro Server                                           e.g. Database
       Classification System
     © 2011. Access Innovations, Inc. All rights reserved.
Indexing
     Indexing measures
      •     Indexing experts
      •     Subject matter experts (SME)
      •     Hits, misses, & noise
      •     85% hits
     In conjunction with taxonomy measures
      •     Over & under used terms
      •     Over & under indexed content



    © 2011. Access Innovations, Inc. All rights reserved.
Indexing & Search Metrics
     Hit, Miss, Noise
     Subjective
      •     Relevance
      •     Aboutness
     Statistical
      •     Precision
      •     Recall
      •     Level of effort



    © 2011. Access Innovations, Inc. All rights reserved.
Hit, Miss, Noise
     Hit – exactly what a human indexer would use
     Miss – human indexer would use, but system
      did not assign
     Noise – system assigned, but human did not
      •     Relevant noise – could have been assigned
      •     Irrelevant noise – just plain wrong




    © 2011. Access Innovations, Inc. All rights reserved.
Subjective
     Relevance
      •     Reflects how akin it is to the users request
     “Aboutness”
      •     Reflects the topical match between the document
            content and the term
      •     How well the topic describes what the document is
            about
     Varies with level of conceptual terms vs. factual
      terms in the thesaurus




    © 2011. Access Innovations, Inc. All rights reserved.
Indexing
     All content types & sources
      •     Inventory control
      •     Everything in, everything out
     Document types
      •     Articles
      •     Proceedings
      •     Corporate




    © 2011. Access Innovations, Inc. All rights reserved.
Link to Community Resources
(Source: Helen Atkins, AACR)
                                                CME
                                                               Upcoming
                   Other                     Activity on
                                                               Conference
                  Journal                     Topic A
                                                               on Topic A
                 Articles on
                  Topic A
                                                                        Job Posting
                                                  Journal                for Expert
                                                 Article on              on Topic A
                                                  Topic A

                Grant Available                               Podcast Interview
               for Researchers                                 with Researcher
                 Working on                                   Working on Topic A
                    Topic A               Author Networks
                                          Social Networking
                                          SME – Topic A

    © 2011. Access Innovations, Inc. All rights reserved.
Indexing with Data Harmony® M.A.I.™
     Rule base development
      •     80/20 rule
      •     Indexing objectives
     GUI
     Time-to-market
      •     Level of effort to build
      •     Level of effort to maintain
      •     Less than all other alternatives when
            indexing for high precision & recall


    © 2011. Access Innovations, Inc. All rights reserved.
Updating Rule Base
     Automatic for matching rules when using
      Data Harmony MAIstro™
     80/20 rule
     Re-index when 5% to 10% changes to
      taxonomy – arbitrary ranges:
      •     Monthly with small databases – 5k to 20k
      •     Quarterly with medium – 20k to 1 million
      •     Annual with large – greater than 1 million
     Depends on search software, too

    © 2011. Access Innovations, Inc. All rights reserved.
NAMES




© 2012. Access Innovations, Inc. All rights reserved.
What’s in a name?
     Juliet:
"What's in a name? That which
      we call a rose
     By any other name would smell as
      sweet."
     Romeo and Juliet (II, ii, 1-2)




    © 2011. Access Innovations, Inc. All rights reserved.
© 2012. Access Innovations, Inc. All rights reserved.
Magnitude of the Problem:
Facebook - 700 Million Users Projected for 2011(Open-First)




         700 Million Names

        How will your boss, peers,
        anyone ever find you?


    © 2012. Access Innovations, Inc. All rights reserved.
What’s in a name?
     My name         Jay Ven Eman
                      Ven Eman, Jay
      <First_Name>Jay</First_Name>
      <Last_Name>Ven Eman</Last_Name>
     Name variants  Aliases
      Jay Von Eman    William Henry McCarty
      Jay Van Eman    Henry Antrim
      Jay van Eman    William H. Bonney
      Jay ven Eman    Billy the Kid
      Jay Veneman  National & Cultural
      Jay Venema      Conventions
    © 2011. Access Innovations, Inc. All rights reserved.
Names
     Computationally & editorially intense
     Author submissions
     Membership records & the like
     Industry initiatives – ORCID, VIVO
     Subject term disambiguation
     Inventory control basics apply here, too
     Difficulty level is high
     Constance maintenance needed


    © 2011. Access Innovations, Inc. All rights reserved.
Taxonomy Assessments -
                                 Part Two
                                 February 9, 2012


                                 Thank you! Questions?
                                  Access Innovations, Inc.
             Leveraging Your Content Semantically
                                             Jay Ven Eman, Ph.D., CEO
                                                  j_ven_eman@accessinn.com
                                                      www.accessinn.com
                                                     www.dataharmony.com
                                                        +1.505.998.0800
                                                       Albuquerque, NM




© 2012. Access Innovations, Inc. All rights reserved.

Taxonomy Assessments - Part Two

  • 1.
    Taxonomy Assessments - Part Two February 9, 2012 Access Innovations, Inc. Leveraging Your Content Semantically Jay Ven Eman, Ph.D., CEO j_ven_eman@accessinn.com www.accessinn.com www.dataharmony.com +1.505.998.0800 Albuquerque, NM © 2012. Access Innovations, Inc. All rights reserved.
  • 2.
    Indexing  Subject term assignment  Permanent meta-data to indexed object  Used for retrieval and evaluation  Processes • Manual • Publisher • 3rd party aggregators • Authors • Automated methods © 2011. Access Innovations, Inc. All rights reserved.
  • 3.
    Integration / workflow API’s, Client/Server, Author Submission Web Services, HTTP-TCP/IP System Books Content Repository “A” Or Intermediate Conference Processes Proceedings Content ETC. Repository “B”, etc. Thesaurus M.A.I. Master Web Data Harmony Sites MAIstro Server Classification System © 2011. Access Innovations, Inc. All rights reserved.
  • 4.
    Select the documentcollection CMS Please select the database and the the document directory to load © 2011. Access Innovations, Inc. All rights reserved.
  • 5.
    CMS © 2011. AccessInnovations, Inc. All rights reserved.
  • 6.
    Sample unstructured document © 2011. Access Innovations, Inc. All rights reserved.
  • 7.
    Run the documentsthrough a metadata extraction process to create well-formed, rich XML • Automatic (per doc template) • E.g. Dublin Core Metadata • Bibliographic citation © 2011. Access Innovations, Inc. All rights reserved.
  • 8.
    Automatically add thetaxonomy terms Entity extraction: People, Places, Things Conceptual indexing: using the taxonomy © 2011. Access Innovations, Inc. All rights reserved.
  • 9.
    Classification Process orAssigned Indexing <Anchor><Date>09-14-11</Date> 09-14-11 <TI>“Solving the Challenge”</TI> “Solving the Challenge” <BLH>By</BLH> By Jay Ven Eman <Author> <AU_FN>Jay</AU_FN> The process of indexing <AU_MI></AU_MI> a content object begins <AU_LN>Ven Eman</AU_LN> with… </Author> <Body>The process of indexing a content object begins with…</Body> <Subject>Indexing</Subject> <Subject>Thesauri</Subject> <Subject>Standards</Subject> <Subject>Classification</Subject> Unstructured </Anchor> Structured Thesaurus M.A.I. Master Content Data Harmony Repository MAIstro Server e.g. Database Classification System © 2011. Access Innovations, Inc. All rights reserved.
  • 10.
    Indexing  Indexing measures • Indexing experts • Subject matter experts (SME) • Hits, misses, & noise • 85% hits  In conjunction with taxonomy measures • Over & under used terms • Over & under indexed content © 2011. Access Innovations, Inc. All rights reserved.
  • 11.
    Indexing & SearchMetrics  Hit, Miss, Noise  Subjective • Relevance • Aboutness  Statistical • Precision • Recall • Level of effort © 2011. Access Innovations, Inc. All rights reserved.
  • 12.
    Hit, Miss, Noise  Hit – exactly what a human indexer would use  Miss – human indexer would use, but system did not assign  Noise – system assigned, but human did not • Relevant noise – could have been assigned • Irrelevant noise – just plain wrong © 2011. Access Innovations, Inc. All rights reserved.
  • 13.
    Subjective  Relevance • Reflects how akin it is to the users request  “Aboutness” • Reflects the topical match between the document content and the term • How well the topic describes what the document is about  Varies with level of conceptual terms vs. factual terms in the thesaurus © 2011. Access Innovations, Inc. All rights reserved.
  • 14.
    Indexing  All content types & sources • Inventory control • Everything in, everything out  Document types • Articles • Proceedings • Corporate © 2011. Access Innovations, Inc. All rights reserved.
  • 15.
    Link to CommunityResources (Source: Helen Atkins, AACR) CME Upcoming Other Activity on Conference Journal Topic A on Topic A Articles on Topic A Job Posting Journal for Expert Article on on Topic A Topic A Grant Available Podcast Interview for Researchers with Researcher Working on Working on Topic A Topic A Author Networks Social Networking SME – Topic A © 2011. Access Innovations, Inc. All rights reserved.
  • 16.
    Indexing with DataHarmony® M.A.I.™  Rule base development • 80/20 rule • Indexing objectives  GUI  Time-to-market • Level of effort to build • Level of effort to maintain • Less than all other alternatives when indexing for high precision & recall © 2011. Access Innovations, Inc. All rights reserved.
  • 17.
    Updating Rule Base  Automatic for matching rules when using Data Harmony MAIstro™  80/20 rule  Re-index when 5% to 10% changes to taxonomy – arbitrary ranges: • Monthly with small databases – 5k to 20k • Quarterly with medium – 20k to 1 million • Annual with large – greater than 1 million  Depends on search software, too © 2011. Access Innovations, Inc. All rights reserved.
  • 18.
    NAMES © 2012. AccessInnovations, Inc. All rights reserved.
  • 19.
    What’s in aname?  Juliet:
"What's in a name? That which we call a rose  By any other name would smell as sweet."  Romeo and Juliet (II, ii, 1-2) © 2011. Access Innovations, Inc. All rights reserved.
  • 20.
    © 2012. AccessInnovations, Inc. All rights reserved.
  • 21.
    Magnitude of theProblem: Facebook - 700 Million Users Projected for 2011(Open-First) 700 Million Names How will your boss, peers, anyone ever find you? © 2012. Access Innovations, Inc. All rights reserved.
  • 22.
    What’s in aname?  My name Jay Ven Eman Ven Eman, Jay <First_Name>Jay</First_Name> <Last_Name>Ven Eman</Last_Name>  Name variants  Aliases Jay Von Eman William Henry McCarty Jay Van Eman Henry Antrim Jay van Eman William H. Bonney Jay ven Eman Billy the Kid Jay Veneman  National & Cultural Jay Venema Conventions © 2011. Access Innovations, Inc. All rights reserved.
  • 23.
    Names  Computationally & editorially intense  Author submissions  Membership records & the like  Industry initiatives – ORCID, VIVO  Subject term disambiguation  Inventory control basics apply here, too  Difficulty level is high  Constance maintenance needed © 2011. Access Innovations, Inc. All rights reserved.
  • 24.
    Taxonomy Assessments - Part Two February 9, 2012 Thank you! Questions? Access Innovations, Inc. Leveraging Your Content Semantically Jay Ven Eman, Ph.D., CEO j_ven_eman@accessinn.com www.accessinn.com www.dataharmony.com +1.505.998.0800 Albuquerque, NM © 2012. Access Innovations, Inc. All rights reserved.

Editor's Notes

  • #7 PDF
  • #10 Post processing“Labels” content itemBut also classifies author
  • #16 Thanks to Helen Atkins of AACR for this illustration.The real power of this is that the links can all go in all directions, so we take advantage of having the user’s attention regardless of how they step into our “web”Continuing Medical Education (CME)
  • #21 Johnny Carson