Machine Aided Indexer


Published on

Detailed information on the operation of the Data Harmony Machine Aided Indexer module from Access Innovation’s, Inc. Presented by Alice Redmond-Neal and Jack Bruce at the 2012 Data Harmony User Group meeting on February 7, 2012 at the Access Innovations, Inc. offices.

Published in: Business, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Machine Aided Indexer

  1. 1. Machine Aided Indexer™
  2. 2. Machine Aided IndexerTM is available as astand-alone version or as part of MAIstro™(integrated with Thesaurus MasterTM).M.A.I.TM creates a simple rulebase from yourthesaurus terms to use for categorizingdocuments.You can fine-tune the rulebase to reflecteditorial knowledge and judgment, specifyingwhen thesaurus terms should be used. Your result: Precision Indexing
  3. 3. M.A.I. under the hood Concept Extractor™ Compares text to Knowledge Base rules to present suggested index terms Statistics Collector™ Gathers and stores the index experience of the system, sorting into Hits / Misses / Noise Prioritizes terms needing rule finetuning to improve indexing accuracy Rule Builder™ Human editor creates, edits, and reviews rules for indexing terms
  4. 4. IN Knowledge MAI BaseText Concept Rule Builder Extractor Editor manages Knowledge Base List of suggested terms from Statistics controlled Collector vocabulary improves the Knowledge Base OUT Human review Database results in: Indexed Hits—selected terms set of Misses—added terms documents Noise—rejected terms
  5. 5. Objective in indexing:apply indexing terms with... Accuracy Speed Depth -- specificity Breadth -- exhaustivity ConsistencyObjective in M.A.I. rulebuilding:make rules reflect human thinking foroptimal categorization
  6. 6. How? Formulate standard rules  for interpreting text  for applying thesaurus terms as subject metadata to index/categorize documents 2/14/2012
  7. 7. Why use rules for indexing?Rules provides consistent direction forinterpreting text and applying indexing terms.Accurate indexing results in preciseinformation retrieval.
  8. 8. M.A.I.’s starter rulebase M.A.I. automatically generates rules Starter rules match exactly to words in text  Identity rules for thesaurus terms  Synonym rules for established NonPreferred terms Success out of the box depends on  Taxonomy term expression of concepts  Writer’s creative expression of concepts
  9. 9. Fine-tuned by editors, rules enable context clues to pinpoint word meaning ―reading between the lines‖ natural language processing greater accuracy over simple rule indexing Use M.A.I.’s Rule Builder Module to fine-tune rules for applying terms.
  10. 10. Indexing and rule-building – two processes Indexing:  Read and interpret document text  Decide on indexing term Rulebuilding:  Identify prompt word(s)  What brought the indexing term to mind?  This text to match in the document is the starting point for rule-building. 2/14/2012
  11. 11. Indexer reads the document text―Indian leaders are asking the government…‖ 11
  12. 12. Indexer considers indexing terms  ―government‖ State government? Federal government? City government?  ―Indians‖ in India? Indigenous people? Native Americans? 2/14/2012
  13. 13. Indexer selects indexing terms―Indian leaders are asking the government to prevent a repeat of the 1990 census undercount that missed nearly 3000 Indians in New Mexico.‖ 13
  14. 14. M.A.I. term suggestions Government New MexicoUse your knowledge to select best terms – from M.A.l. suggested terms from thesaurusDecide on indexing terms and apply them to document.
  15. 15. Indexing done,rule-building beginsThe rule-building editor’s question: What words in the text prompted selection of those terms?This word (or words) is the starting point forbuilding a rule with M.A.I. – the ―gatekeeper.‖
  16. 16. Choose the MAI Rule Builder tabA rule hastwo parts: Viewing options:--Text to Match font--rule body style size
  17. 17. M.A.I. rule starts withText to Match The prompt word (or word part or phrase) in the document -- whatever made the indexer think of a specific indexing term -- becomes the Text to Match of a rule.
  18. 18. Importance of Text to Match TTM opens the door to the rulebase  Without a word or phrase to match, the knowledge of the rulebase is unavailable. M.A.I. system programmatically creates a starter rulebase  Identity rules – exact match to thesaurus term  Synonym rules – exact match to NonPref termStarting point for a rulebase – Ready for finetuning
  19. 19. M.A.I. out of the box  Estimate 60% accuracy  Success depends on:  Style of thesaurus terms  Writing style of documents  Addition of synonyms
  20. 20. If only…Document authors wroteusing the language of thesaurus terms,then the starter rulebase would be sufficient…but...
  21. 21. Editors make M.A.I. rules smarter 1. Modify the Text to Match 2. Modify the rule body
  22. 22. 1. Modify the Text to Match Words with the same root crystal ~ crystallize ~ crystalline ~ crystallization ~ crystal-forming Text to match: crystal* Words in inverted sequence Power, Solar = Solar power Text to match: solar Phrases with same meaning, different syntax Pollution control = Control of pollution Text to match: pollution
  23. 23. 2. Modify the rule body Starter rules (identity and synonym) specify term to be used – no ifs, ands or buts You can  establish conditions or limits on the suggestion of the indexing term(s)  direct M.A.I. to ignore a word or phrase in text (NULL rule)
  24. 24. Two basic types of rules 1. Simple rules (starter rules) no conditions to limit the use of the indexing term 2. Condition rules where rules get interesting!
  25. 25. Simple rules – how they work The prompt word in the text suggests the same indexing term every time that word occurs No IFs qualify the use of the indexing term Text to Match in the document  USE Indexing term
  26. 26. 3 Types of simple rules 1. Identity rules 2. Synonym rules 3. NULL rules
  27. 27. Simple rules – identity ruleText to Matchis identical tothesaurus termin therule body --No conditions
  28. 28. Simple rules – identityText to match: irrigation USE IrrigationText to match: Lake Michigan USE Lake MichiganText to match: marriage and divorce records USE Marriage and divorce records
  29. 29. Identity rules are created programmatically
  30. 30. Simple rules – synonym rule Show term equivalents (Use/Used for)Text to match: jobless USE UnemploymentText to match: fish farm USE AquacultureText to match: Y2K USE Y2K issueText to match: parish USE CountyText to match: e-business USE Ecommerce
  31. 31. Simple rules – synonym rule Simplify morphological, punctuation, spelling, and sequencing variationsText to match: worker’s compensation workman’s compensation workmen’s compensation work* comp* USE Worker’s compensationText to match: e-commerce USE Ecommerce
  32. 32. A synonym rule for the Text to Match ―jobless‖suggests … USE Unemployment When M.A.I. is integrated with Thesaurus Master, synonym rules for Non Preferred terms are generated programmatically.
  33. 33. Simple rules – synonym rule Separate out compound termsText to match: fishing USE Fishing and huntingText to match: hunting USE Fishing and huntingText to match: adoption USE Adoption and foster careText to match: divorce USE Marriage and divorce records TIP: Trim TTM down to one core element
  34. 34. Simple rules – NULLIgnore a thesaurus word that occurs • as part of an irrelevant phrase ―physician’s orders‖ • as part of an idiom ―in light of…‖ ―a bird in hand‖ ―looking back...‖ Text to match: in light of Rule: NULL
  35. 35. NULL rule – Do not index with the thesaurus term ―Light‖ in this instance.
  36. 36. Two basic types of rules 1. Simple rules (starter rules) no conditions limit the use of the indexing term 2. Condition rules where rules get interesting!
  37. 37. Dealing with ambiguity
  38. 38. Jay Leno’s headlines Police Begin Campaign to Run Down Jaywalkers Local High School Dropouts Cut in Half Red Tape Holds Up New Bridges Include Your Children When Baking Cookies Kids Make Nutritious Snacks Iraqi Head Loses Arm
  39. 39. How would you disambiguate…• bush – What other words and/or conditions should lead to using the term  Shrubs – OR  U.S. presidents balloon  Aerostatic aviation – OR  Party supplies will(s)  Jurisprudence, Last will and testament, Living wills  (auxiliary verb)
  40. 40. Example: routing vehicles (direction) work (workflow) people, data, stuff (distribute, disperse) the other team (overwhelming defeat) wood (using power tool)
  41. 41. Example: Technology – Need conditions?  Top term  Narrow terms Engineering Information technology Medical technology Technology transfer Radio frequency identification technology  Scope note The practical use of scientific knowledge in industry and everyday life; the scientific method and material used to achieve a commercial or industrial objective  Related terms Technology assessment Technology researchSet conditions on using term Technology? ―new fangled technology‖ ―cooking technology‖ ―report from the Massachusetts Institute of Technology‖
  42. 42. When the prompt wordis ambiguous Could prompt word be interpreted differently?  Indian leaders are asking the government…  balloon  bush  bridge  adoption Under what conditions would another interpretation be correct?
  43. 43. Thinking conditionally –let the IFs begin... Convergent thinking What other words in text would confirm your interpretation of the text-to-match meaning and your proposed indexing term? Divergent thinking What words in text would contradict your interpretation?
  44. 44. Condition rules – IF rules For ambiguous word meanings, editor can set IF conditions that must be met for rules to suggest an indexing term. Can incorporate conditions from Scope Notes Editor can set one or more conditions, joined with Boolean operators AND, OR, and NOT.
  45. 45. Example: Sniffer BT Malicious code SN A program that intercepts routed data and examines each packet in search of specified information, such as passwords transmitted in clear text. M.A.I. rule TTM: sniffer USE Sniffer “Customs used a sniffer dog to identify the contraband …”
  46. 46. In a botany taxonomy, ―bushes‖ is a NonPref Termthat prompts the preferred term ―Shrubs‖ --even if the text is about (former) President Bush.When a simple rule won’t do, set conditions in therule to increase precision Hits and decrease Noise.
  47. 47. Simplify the TTM – then add conditions in the rule body
  48. 48. 4 types of conditions1. Proximity of rule’s TTM to quoted word from document text (4 levels of proximity)2. Capitalization of TTM3. Exact MATCH of TTM to word in text4. TTM begins or ends a sentence Mix and match conditions with Boolean operators: AND, OR, NOT
  49. 49. Condition rules – Proximity Text to match: safetyIF (NEAR “security”) WITHIN 3 WORDS USE Crime preventionENDIFIF (WITH “community”) WITHIN SENTENCE USE Public safetyENDIFIF (AROUND “product”) WITHIN 50 WORDS USE Product safetyENDIFIF (MENTIONS “food”) WITHIN 250 WORDS USE Food handling and safetyENDIF
  50. 50. Condition rules – Proximity Text to match: bear IF (NEAR “Chicago” OR WITH “football”) USE Chicago Bears ENDIF IF (NEAR “market” OR AROUND “stock”) USE Stock market ENDIF IF (MENTIONS “forest” OR MENTIONS “woods”) USE Wild animals ENDIF
  51. 51. Example: DocumentationText to match: documentationUSE Documentation Identity rule created problems Add conditions for greater precision: IF (AROUND "software" OR WITH "application" OR AROUND "hardware" OR WITH "instruction“) USE Documentation ENDIF
  52. 52. Condition rules – Negation Text to match: wages IF (NOT WITH “war”) USE Wages and salaries ENDIF• Text to match: web IF (NOT WITH “spin*”) USE Internet ENDIF (“spider” no longer differentiates internet from arachnids)
  53. 53. Condition rules – Case Text to match: aids IF (ALL CAPS) USE AIDS and HIV ENDIF Text to match: masters IF (INITIAL CAPS AND MENTIONS “poet*”) USE Edgar Lee Masters ENDIF
  54. 54. Condition rules – Match Text to match: employ* IF (MATCH “employment”) USE Employment ENDIF IF (MATCH “employee” AND (WITH “municipal” OR WITH “city” OR WITH “town”)) USE Municipal employees ENDIF
  55. 55. Condition rules – Sentence position  IF (BEGIN SENTENCE)  IF (END SENTENCE)
  56. 56. Conditions in rules help increase precision Hits decrease Noisefor more precise information retrieval. Conditions depend on human logic.
  57. 57. M.A.I. can save illogical statements  bad results.M.A.I. can not save a rule with incorrect syntax.Rule Check and Save check the syntax of a rule.Error warning – explains syntax problems – shows line location Closing parenthesis missing
  58. 58. Mind your IFs and ( )s – come in 2s IF starts the system thinking about a condition; ENDIF completes the thought. Every IF condition goes in ( )s. Every ( must close with ) -- multiple ( )s are OK. Every IF condition must close with an ENDIF. Every ― must close with ‖. Function words must be spelled correctly.
  59. 59. Kicking rules up a notch Rules can express  Multiple concepts  Alternative concepts  Contingent concepts
  60. 60. Condition rules – IF-IFText to match: housingIF (AROUND “afford*”) USE Affordable housing IS DIFFERENT FROMENDIFIF (AROUND “public”) Text to match : housing USE Public housing IF (AROUND “afford*”)ENDIF USE Affordable housing IF (AROUND “public”)Independent conditions USE Public housing ENDIF ENDIF Contingent conditions
  61. 61. Condition rules – IF-IFText to Match: agricultur* Text to Match: agricultur*IF (WITH “products”) IF (WITH “products”) USE Agricultural products USE Agricultural products IF (WITH “programs”) ENDIF USE Agricultural programs IF (WITH “programs”) ENDIF USE Agricultural programsENDIF ENDIFAgricultural programs is available ONLY IF BOTH terms may be used—Agricultural products they are independent condition is met.
  62. 62. Condition rules – IF-IFText to Match: agricultur*IF (WITH “products”) USE Agricultural products IF (WITH “programs”) USE Agricultural programs ENDIFENDIFIndentation emphasizes contingent condition
  63. 63. Condition rules – IF-ELSE 1 IF - ELSE provides further options in rules, a default if the first condition is not met. It may be used without condition Text to match: technology IF (AROUND “transfer*”) USE Technology transfer ELSE USE Technology ENDIF
  64. 64. Condition rules – IF-ELSE 2 Text to match: norwegian IF (AROUND “language” OR WITH “speak*”) USE Norwegian language ELSE USE Norway ENDIF
  65. 65. Condition rules – IF-ELSE IF  IF - ELSE IF or add extra conditions Text to match: norwegian IF (MENTIONS “language”) USE Norwegian language ELSE IF (MENTIONS “country”) USE Norway ENDIF ENDIF
  66. 66. You can...  Truncate a single word with * e.g. agri*  Use * as a wild card between words, e.g. drinking * driving  Truncate in the text to match and/or in the rule body
  67. 67. And you can... Include multiple conditions in a rule, starting from a single text-to- match tax*Text to match:IF (WITH “business”) USE Business taxesIF (WITH “income”) USE Income taxesIF (WITH “sales”) USE Sales taxesIF (AROUND “forms”) USE Tax formsIF (AROUND “law*” OR AROUND “legis*” OR AROUND “legal”) USE Tax laws
  68. 68. And you can... Use multiple Boolean operators in rules Embed clauses within clauses using Boolean operators Text to match: activit* IF (WITH “extracurricular” OR (WITH “school” AND (WITH “after” OR WITH “before” OR WITH “outside”))) USE Extracurricular activities ENDIF Watch the ( )s!
  69. 69. M.A.I. in action(105 ILCS 45/1-20) Sec. 1-20. Enrollment. If the parents or guardians of a homelesschild or youth choose to enroll the child in a school other than theschool of origin, that school immediately shall enroll the homelesschild or youth even if the child or youth is unable to produce recordsnormally required for enrollment, such as previous academic records,medical records, proof of residency, or other documentation. Nothing inthis subsection shall prohibit school districts from requiring parentsor guardians of a homeless child to submit an address or such othercontact information as the district may require from parents orguardians of nonhomeless children. It shall be the duty of theenrolling school to immediately contact the school last attended by thechild or youth to obtain relevant academic and other records. If thechild or youth must obtain immunizations, it shall be the duty of theenrolling school to promptly refer the child or youth for thoseimmunizations.(Source: P.A. 88-634, eff. 1-1-95; 88-686, eff. 1-24-95.)
  70. 70. Original identity rule for “Children and youth” Modify rule for “Children and youth” to Text to Match: child*
  71. 71. Reading M.A.I. results Indexing terms | Document words match TTM Children and youth | (15) child*(9) youth (6) Schools | (7) school*(7) Homeless people | (3) homeless*(3) Immunizations | (2) immuniz*(2)
  72. 72. M.A.I. Statistics let you track performanceas you fine-tune the Knowledge Base. M.A.I.’s Statistics Collector gathers and stores indexing experience. Statistics compare editor’s indexing results to M.A.I.’s suggestions  Hits, Misses, Noise Statistics prioritizes the terms for which rules need fine-tuning.
  73. 73. M.A.I. statistics Hits System suggests indexing terms that are chosen by the editor--good! Misses System misses terms editor uses Noise  System suggests terms not used by editorMisses and Noise … need more rule-building
  74. 74. Open Misses to reveal thesaurus terms usedby an editor but not suggested by M.A.I. Buddhism was used by editors for indexing 3 records, but was not suggested by M.A.I.
  75. 75. Open the key beside the term to see the listof records where the term was used... The file name, record number and editor’s name are stored with each record.
  76. 76. Click to highlight any record line on the left.The full record appears on the right, withM.A.I.’s Suggested Terms and the editor’sUsed Terms.
  77. 77. In this record, M.A.I. interprets ―devotion‖ andsuggests the indexing term ―Prayer‖ -- Hit.The editor used ―Buddhism‖ though M.A.I.did not suggest the term -- Miss.M.A.I. suggested ―Libraries‖ and ―Religions‖though the terms were not used -- Noise.For this record, M.A.I. scored• 3 Hits -- Prayer, Sri Lanka, Religious beliefs• 1 Miss -- Buddhism• 2 Noise -- Religions, Libraries
  78. 78. The word ―Buddhism‖ does not appearin the record, although ―Buddha‖ does.The editor’s use of the thesaurus termBuddhism to index the record isappropriate.M.A.I.’s Knowledge Base can be fine-tunedto reflect human knowledge andinterpretation of the text.
  79. 79. Search the Knowledge Base for rules for Buddha. (Truncate buddh* to widen the search.)Click Search,results appear
  80. 80. Rules exist for ―Buddhism‖ and ―Buddhist‖ but not for ―Buddha,‖ which is in the text. You can easily create a new rule … Text to Match: Buddha IF (MENTIONS “religion”) USE Buddhism ENDIFIf ―buddha‖ and ―religion‖ are both in the text,M.A.I. suggests the indexing term Buddhism.
  81. 81. Enter a rule for Text to Match: buddha ... Better yet: combine all 3 rules by using Text to Match: buddh*
  82. 82. Click Save, OK to verify, and then Retry...
  83. 83. The new rule Text to Match: buddha promptsBuddhism in Suggested Terms for indexing.
  84. 84. At any time, you can: modify a rule check the rule for syntax save the rule see the rule’s history add an editorial note find a word clear the screen delete the rule
  85. 85. Each rule in the Knowledge Base that theeditor fine-tunes increases M.A.I.’s• ability to recognize synonyms,• find connections between non-contiguous words• interpret idioms,• make sense of allusions,• ―read between the lines‖ Over time, statistics for Hits increase, while Misses and Noise decrease.
  86. 86. M.A.I.’s Statistics Report summarizes Hit/Miss/Noise figures over time
  87. 87. When to make rules Before processing documents  Proactive rule building provides head start  Increases hits from the start After processing documents  Statistics report lets indexer see what rules need fine-tuning to improve Hits, avoid Misses, and decrease Noise, based on comparison of M.A.I. suggestions with editor’s indexingRule-building is an on-going process  Frequency diminishes, results improve
  88. 88. Custom configure M.A.I. How many term suggestions? Limit use of a term to n documents? How much text to scan? Treat singular the same as plural? Ignore stopwords? Quote marks? Plural=Singular? Most specific term only? Suggest Candidates?
  89. 89. M.A.I. measurably improves indexing results: • Consistency same term suggested under same text conditions • Indexing coverage terms reflect full range of indexable concepts in data • Indexing depth terms reflect the granularity and precision of deeper levels of thesaurus • Faster throughput nearly 7 times faster indexing
  90. 90. M.A.I. mines the full depth of yourthesaurus, suggesting the most specificand appropriate indexing term.M.A.I. can also filter indexing terms,displaying more general Broad Terms,while retaining the more precise indexingterms stored with the document.
  91. 91. Pairing Machine Aided Indexer with Thesaurus Master as MAIstro provides • simple thesaurus construction and maintenance • faster indexing • deeper indexing • greater concept coverage • more consistent indexing Efficiency and Economy in document storage and retrieval