How to Apply Your Taxonomy to Your Content Automatically

1,593 views

Published on

Presentation by Marjorie M.K. Hlava at the 2013 SLA Annual conference in San Diego, CA, Monday, June 10, 2013.

Published in: Health & Medicine, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,593
On SlideShare
0
From Embeds
0
Number of Embeds
39
Actions
Shares
0
Downloads
21
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • NICEM, the National Information Center for Educational Media, is a database of over 640,000 audio and visual items in all subject areas that apply to learning, from preschool through professional. Access Innovations applies subject terms from the NICEM taxonomy to each bibliographic record. In addition to that backend data work, we also created a search and presentation layer for the website, which we call Search Harmony. Here are some of the user-friendly features that are included in Search Harmony: [Click] Users can navigate the site by browsing the full taxonomy, and see the number of records tagged with each subject [CLICK] Auto-completion of search terms, which is a common feature of many search engines, but in this case the user is assisted in formulating a search by seeing a pick list of terms from the taxonomy, including all synonyms – even if they appear in the middle of a phrase. [CLICK] The user is also guided to expand the topic with broader terms or related terms from the taxonomy, or narrow the search to find more precise information. [CLICK] The resulting site has been recognized as an indispensable resource for educators.
  • Thanks to Helen Atkins of AACR for this illustration. The real power of this is that the links can all go in all directions, so we take advantage of having the user’s attention regardless of how they step into our “web”
  • How to Apply Your Taxonomy to Your Content Automatically

    1. 1. How to apply yourHow to apply your taxonomy to yourtaxonomy to your contentcontent AUTOMATICALLYAUTOMATICALLY Marjorie M.K. Hlava President Access Innovations / Data Harmony mhlava@accessinn.com www.dataharmony.com www.accessinn.com 505-998-0800
    2. 2. Three things to cover today 1. Operationalizing Auto Indexing (MAI) 2. Where is MAI using a taxonomy used now 3. How does the MAI work?
    3. 3. Operationalizing the taxonomy • Expected organizational change • How is it put into place - Getting it applied Getting people to understand • Getting the organization to embrace the taxonomy • Two areas to consider – Technical aspects of implementation – Cultural Aspects of changes use patterns
    4. 4. Expected organizational change • Find and discover ability improved! • Automatic implementation • Tagging everything • Document management plan • Make it part of the infrastructure • Thinking in outlines • Everyone uses the same outline - taxonomy
    5. 5. Getting the organization to embrace the taxonomy  • Use the MAI as an aid –Indexing = applying the taxonomy terms –Automatic or background indexing –Index on Save or submit –Automatically index the old documents whenever they are seen or requested again • Make it easy for the users – auto in the background
    6. 6. Assumptions • Taxonomy has been created • Tested on the documents, pages etc. • Agreed to by the major stakeholders • Is ready to apply to the records – There is a field to put this subject metadata in – The use is established – • Where the terms will be used • How the terms will be used • Maintenance is established – Search log feeds – Feedback from users – Wired into new product development needs
    7. 7. Where is it put into place? • Tag / index content – Metadata records – Full text inline • Web navigation – Browse – navigation tree – NavBar – Search • Product catalogs • Customer Service – find the answers quickly
    8. 8. Inline Tagging Show the exact point where the concept is mentioned Mouse-over to view the term record Statistical summary, showing the number of times each term is mentioned in the article Show the exact point where the concept is mentioned Mouse-over to view the term record Statistical summary, showing the number of times each term is mentioned in the article Copyright © 2013 Access Innovations, Inc.
    9. 9. Taxonomy DrivenTaxonomy Driven Search PresentationSearch Presentation Copyright © 2013 Access Innovations, Inc.
    10. 10. HTML Headers META NAME KEYWORD Use the taxonomy here
    11. 11. Taxonomies in new ways • Mashups • Visualization of your content • Trend analysis • Content segmentation / repackaging • Author / people profiles • Place and people disambiguation • Data cross walks • In Sharepoint
    12. 12. Authors at a placeMASHUP locations to a GPS grid of an area Two data points GPS Coordinates Taxonomy description of the Copyright © 2013 Access Innovations, Inc.
    13. 13. 15 Visualization Strategies Matrix Visualization Software Copyright © 2013 Access Innovations, Inc.
    14. 14. All Data Up-posted To The Top Level Copyright © 2013 Access Innovations, Inc.
    15. 15. Copyright © 2013 Access Innovations, Inc.
    16. 16. Pattern Analysis Domain Associations Copyright © 2013 Access Innovations, Inc.
    17. 17. More Like This - Recommender Cancer Epidemiology Biomarkers & Prevention Vol. 12, 161-164, February 2003 © 2003 American Association for Cancer Research Short Communications Alcohol, Folate, Methionine, and Risk of Incident Breast Cancer in the American Cancer Society Cancer Prevention Study II Nutrition Cohort Heather Spencer Feigelson1 , Carolyn R. Jonas, Andreas S. Robertson, Marjorie L. McCullough, Michael J. Thun and Eugenia E. Calle Department of Epidemiology and Surveillance Research, American Cancer Society, National Home Office, Atlanta, Georgia 30329-4251 Recent studies suggest that the increased risk of breast cancer associated with alcohol consumption may be reduced by adequate folate intake. We examined this question among 66,561 postmenopausal women in the American Cancer Society Cancer Prevention Study II Nutrition Cohort. Related Press Releases •How What and How Much We Eat (And Drink) Affects Our Risk •Novel COX-2 Combination Treatment May Reduce Colon Cance Combination Regimen of COX-2 Inhibitor and Fish Oil Causes C •COX-2 Levels Are Elevated in Smokers Related AACR Workshops and Conferences •Frontiers in Cancer Prevention Research •Continuing Medical Education (CME) •Molecular Targets and Cancer Therapeutics Related Meeting Abstracts •Association between dietary folate intake, alcohol intake, and m •Folate, folate cofactor, and alcohol intakes and risk for colorect •Dietary folate intake and risk of prostate cancer in a large prosp Related Working Groups •Finance •Charter •Molecular Epidemiology Related Education Book Content Oral Contraceptives, Postmenopausal Hormones, and Physical Activity and Cancer Hormonal Interventions: From Adjuvant Therapy to Breast Cancer Prevention Related Awards •AACR-GlaxoSmithKline Clinical Cancer Research Scholar Awards •ACS Award •Weinstein Distinguished Lecture Webcasts Related Webcasts Think Tank Report Related Think Tank Report Content Copyright © 2013 Access Innovations, Inc.
    18. 18. Link to Society Resources Journal Article on Topic A Other Journal Articles on Topic A Upcoming Conference on Topic A Podcast Interview with Researcher Working on Topic A Grant Available for Researchers Working on Topic A CME Activity on Topic A Job Posting for Expert on Topic A Copyright © 2013 Access Innovations, Inc.
    19. 19. Thesaurus MasterThesaurus Master Machine Aided Indexer (M.A.I.™) Reposito ry Term Store Reposito ry Term Store Search Presentation: 90% accuracy Browse by Subject Auto-completion Broader Terms Narrower Terms Related Terms Search Presentation: 90% accuracy Browse by Subject Auto-completion Broader Terms Narrower Terms Related Terms Inline Tagging Metadata and Entity Extractor Automatic Summarization Search Software Search Software Client taxonomy Client taxonomy Taxonomy In Sharepoint Copyright © 2013 Access Innovations, Inc.
    20. 20. Editorial Workflow Integration Author Submission Module The author fills in the data to the document template, attaching images and graphs as necessary. An API calls Data Harmony and generates a list of indexing terms based on the content. Copyright © 2013 Access Innovations, Inc.
    21. 21. A Quick Look Behind the Scenes Database Management System Thesaurus tool Indexing Tool MAI •Validate terms •Add terms and rules •Change terms and rules •Delete terms and rules •Search thesaurus •Validate term entry •Block invalid terms •Record candidates •Establish rules for term use •Auto Suggest indexing terms Copyright © 2013 Access Innovations, Inc.
    22. 22. Database Management System Add Metadata using MAI Search Inverted File Aadvark Alligator Apple Advantage …. Zebra Record locator Accessinn.com/12345/demofile/recid15 Database records Each with many elements Portal Searching
    23. 23. Data Feed Portal page Thesaurus / taxonomy system add Metadata Data Base Management System Automatic Metadata Suggestions Search
    24. 24. Data Base Management System Add Metadata using MAI Search Use taxonomy to search
    25. 25. • Consistency – Machine Aided Indexing consistently suggests the same term under the same conditions. – Initial tests on units from client proved to be 20% more consistent than manual indexing. • Deeper Indexing • Machine Aided Indexing may suggest more terms per unit than manual indexers working under a per unit quota. • Faster Processing • Human verification of suggested indexing terms can be seven (7) times as fast as manual indexing. • Fully automatic indexing can run 8 million articles in 2 hours in the Amazon Cloud Service • Full text indexing article(30 page articles) at less than a minute per article07/25/13 Copyright (c) 1998 Access Innovations, Inc. Advantages of MAI
    26. 26. 07/25/13 Copyright (c) 1998 Access Innovations, Inc. Use of Thesaurus • Faster Rule Building • Determines Depth of Indexing • Increases Measurable Consistency • Provides Base for Statistical Computations
    27. 27. 07/25/13 Copyright (c) 1998 Access Innovations, Inc. MAI System Overview -- System Components 1. Thesaurus / Taxonomy 2. MAI Program a. Rule Builder b. MAI Engine (matches text to terms) c. Statistics Computation 3. Knowledge Base / Rule base 4. Editorial Evaluation
    28. 28. How does auto indexing work? • The Modules • The syntax • The process • How do you know if it’s good? Statistics / evaluation • Rules of thumb
    29. 29. 07/25/13 Copyright (c) 1998 Access Innovations, Inc. Knowledge Base Creative Loop
    30. 30. 07/25/13 Copyright (c) 1998 Access Innovations, Inc. Edit • Interactive text editor for building rules • Provides insertion, deletion, copying of characters or marked blocks Reports • Creates Knowledge Base listings • Can output all rules, or selected rules MAI - Rule Builder
    31. 31. 07/25/13 Copyright (c) 1998 Access Innovations, Inc. MAI – Concept Extractor Input • Reads client’s electronic file format • Creates internal format for scanning • Modified project by project Interpreter • Scans reformatted input, selects phrases • Looks up text phrases in Knowledge Base • Interprets rule(s) associated with matched phrase Output • Produces electronic file of scanned units with suggested index terms for production of print file • Optionally produces data file for statistics program
    32. 32. 07/25/13 Copyright (c) 1998 Access Innovations, Inc. MAI - Statistics Computation How do you know if indexing is accurate? • Accepts optional data file created by MAI engine while processing units with existing index terms • Computes MAI ‘Hit’, ‘Miss’, ‘Noise’ figures Reports • Produces listing of terms with unit numbers categorized as ‘Hit’, ‘Miss’, or ‘Noise’
    33. 33. Rules Type • Simple Condition (80%) • Identity • synonym • Complex Rules: • Compound Condition • Multiple Condition Explanation • If (condition is true) use term ‘A’ • Use term ‘B’ where ‘A’ (matched text) is synonymous with, but not identical to, ‘B’ • If (condition 1 is true and/or condition 2 is true…and/or condition n is true) use term ‘A’ • If (condition 1 is true…) use term ‘A’ else if (condition 2 is true…) use term ‘B’ else if (condition n is true…) use term ‘C’ 07/25/13 Copyright (c) 1998 Access Innovations, Inc. Rule Types
    34. 34. • Condition Type • Proximity • Location • Format • Truncation • Explanation • Helps to determine context of a phrase by its relative location to another indicative phrase. • Determines where in the unit a phrase occurs • Checks the matched text for certain conditions such as capitalization, format (italic codes) or affixes (prefix or suffix to root word). • Left or right word stem isolation07/25/13 Copyright (c) 1998 Access Innovations, Inc. More options
    35. 35. • NEAR “string” – If the matched text is within 3 words before or after another “string” in the same sentence. • WITH “string” – If the matched text is in the same sentence as another “string” • MENTIONS “string” – If the matched text is in the same abstract as another “string” • AROUND “string” – If the matched text is in the same text as another “string” 07/25/13 Proximity Conditions
    36. 36. • in title – If the matched text is in “field” • in abstract – If the matched text is in “field” • begin sentence – If the matched text is located at the beginning of the sentence • end sentence – If the matched text is located at the end of the sentence 07/25/13 Copyright (c) 1998 Access Innovations, Inc. Location Conditions
    37. 37. • all caps – If the matched text is all caps • initial caps – If the matched text begins with a capital letter • match “string 1”, – If the matched text begins with, ends “string 2” with, or is included within specific character strings(s), such as “<i>”, “</i>” or quotation sets 07/25/13 Format Conditions
    38. 38. • Affix “string1”, – • If the matched text is a root word in“string2” the Knowledge Base which is prefixed (string1) or suffixed (string2) by specific character string(s), such as “pseudo” or “ization”. • Either string1 or string2 may be null(“”). • Additionally, it will allow wild card characters such as “*”. 07/25/13 Format Conditions (
    39. 39. 07/25/13 Simple Rules: Identity text: artifact rule: use artifact text: accelerator mass spectrometry rule: use accelerator mass spectrometry text: National Socialism rule: use National Socialism use fascism
    40. 40. 07/25/13 Simple Rules: Synonym text: testing rule: use analysis text: the history of rule: use history text: micrographs rule: use scanning electron microscopy text: casoni rule: use adobe
    41. 41. 07/25/13 Complex Rules: Simple Condition Automatically created text: cave rule: if (near “Sandia”) use Sandia cave site text: packing rule: if (near “shipping”) use packing and shipping text: chin rule: if (all caps)use Canadian Heritage Information Network
    42. 42. 07/25/13 Complex Rules: Compound Condition text: congress rule: if (initial caps and not begin sentence) and (with “u.s.” or with “united states”)) use U.S. Congress text: dye rule: if (in title and (near “natural” or with “vegetable”)) use natural dyeing text: x-ray rule: if (near “energy dispersive” or mentions “(xrd)” or (near “fluorescence” and with “spectroscopy”)) use energy-dispersive x-ray fluorescence spectroscopy
    43. 43. 07/25/13 Copyright (c) 1998 Access Innovations, Inc. Complex MAI Rule Rule: survey 12804: IF (near "geological" OR with "geophysics" OR with "oil recovery") USE geophysical prospecting ENDIF IF (mentions "interstellar" OR mentions "star" OR mentions "stars" OR mentions "Universe" OR mentions "stellar" OR mentions "galaxies" OR mentions "protostars" OR mentions "protostar" OR mentions "galactic" OR mentions "galaxy" OR mentions "radio source" OR mentions "HII regions" OR mentions "nebula" OR mentions "nova" OR mentions "supernova" OR mentions "nebulae" OR mentions "novae") USE astronomy and astrophysics ENDIF IF (mentions "discusses" OR mentions "presents" OR mentions "the author" OR mentions "this article" OR mentions "this book" OR mentions "this paper" OR mentions "the article" OR mentions "the book" OR mentions "the paper") USE reviews ENDIF
    44. 44. • HIT – When the MAI engine generates an indexing term which is identical to an index term which would have been assigned by a human indexer • MISS – When the MAI engine fails to generate an indexing term which would have been assigned by a human indexer. • NOISE – When the MAI engine generates an indexing term which is incorrect, out of context, or illogical. 07/25/13 MAI Statistics Terms
    45. 45. • CONSISTENCY – When topical or descriptive indexing terms are assigned in the same way from article to article over time. • DEPTH – A qualitative measure of the level and range of detail of the indexing (ie, from broader terms to narrower terms of a single concept). • BREADTH – A qualitative measure of the number or range of concepts indexed in an article. 07/25/13 Copyright (c) 1998 Access Innovations, Inc. MAI Statistics Terms
    46. 46. 07/25/13 Copyright (c) 1998 Access Innovations, Inc.
    47. 47. • Aim for 85% of higher accuracy – Hits of misses and noise • Maintain monthly – Thesaurus / taxonomy updates – From search logs – From new sources covered – From new initiatives and term usage • Maintain MAI with term changes and additions – 80% of rules are automatic match rules – Should create 6 – 10 rules per hour • Do regular random sampling to improve the indexing quality 1% • Reindex when taxonomy changes reach 5% of the term base • Hire people good at logic and word games 07/25/13 Copyright (c) 1998 Access Innovations, Inc. Rules of thumb
    48. 48. Cost items • Software – Two main types – • Bayesian / statistical / co-occurrence • Rules based • Educating the system – Training the system based on collection of records where the terms are used correctly – 20% of terms may need disambiguation • Implementing the system – Setting the vectors (programmers) – Applying the rules (editors / taxonomists) • Ongoing maintenance – Resetting the vectors (programmers) – Updating the terms and rules (editors / taxonomists)
    49. 49. • A. MAI System Components • Thesaurus • Knowledge Base • MAI Program • Editorial Evaluation • B. Use of Thesaurus • Word Standardization • Faster Rule Building • Increased Measurable Consistency 07/25/13 Copyright (c) 1998 Access Innovations, Inc. MAI Summary
    50. 50. • C. Knowledge Base Components • Simple Rules • Complex Rules • D. MAI Program Modules • Rule Builder Module • MAI Engine Module • Statistics Computation • E. Advantages of Rules based MAI • Consistency • Deeper Indexing • Faster Thru Put 07/25/13 Copyright (c) 1998 Access Innovations, Inc. MAI Summary
    51. 51. Taxonomy Is core to the system
    52. 52. Thank youThank you Marjorie M.K. Hlava President Access Innovations / Data Harmony mhlava@accessinn.com www.dataharmony.com www.accessinn.com 505-998-0800

    ×