Beyond Boolean - Enterprise Search Technologies


Published on

Presentation I gave at a local Boston conference in 1994 about Enterprise Search. some predictions panned out; some did not (at least not yet).

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Beyond Boolean - Enterprise Search Technologies

  1. 1. Surviving the Information Glut Presentation by Bob Boeri Factory Mutual Engineering & Research October 7, 1994 bbtitle 9/23/94
  2. 2. Roots of the Problem <Storage: increasing <Access: faster <Document complexity: more <Information quantity: increasing exponentially "'A cat may look at a king,' said Alice. 'I've read that in some book, but I don't remember where.' ". Alice in Wonderland bb1 9/13/94
  3. 3. Document Complexity <word processor types <fonts <rich layouts <tables <graphics/ photos <video, sound, hypertext <SGML <... what isn't a document? "and what is the use of a book," thought Alice, "without pictures or conversations?" Alice in Wonderland bb2 9/13/94
  4. 4. How to Find What You Are Looking For <how large is your collection of documents? <how complex are they? <how complex are the searches? <who will search? •individuals working by themselves •members of a corporate organization bb3 9/17/94
  5. 5. Searching a Very Small Collection <a few dozen documents <simple structure (e.g., memo or e-mail) <written consistently (e.g., you, one author) Find that note about an inexpensive, simple word processor that never needs upgrading and will let you add simple graphics to your writing. Runs under MS-Windows. bb4 9/17/94
  6. 6. Trivial Search Techniques <browse through each one <use a word processor "list files" <simple search system, simple boolean search bb5 9/17/94
  7. 7. First Search Barrier <somewhere between a few dozen documents and several hundred <can't remember exactly the words to search, begin searching synonyms or using wild cards. She went on, rather surprised at not being able to think of the word. 'I mean to get under the --under the - under THIS, you know!' putting her hand on the trunk of a tree. --Alice in Wonder land bb6 9/14/94
  8. 8. First-Level Search Techniques <Range searches: "word processor" <sentence> "inexpensive" <only 2 hits (probably missed something). < forgot to ask about graphics support. bb7 9/15/94
  9. 9. First-Level Search Techniques <Wild cards: word process* <sentence> basic <99 hits; unusable < Maybe asking about graphics support too will reduce number of hits bb8 9/15/94
  10. 10. Searching Gets Complex <(basic <sentence> word process*) <paragraph> (support* <sentence> graphic*) < complex expression < system searches a long time < finds nothing useful. bb9 9/16/94
  11. 11. Sample Hits from 1st-level Complex Search <"Although Visual Basic contains a rudimentary word processor... graphic support is really limited to OLE and DDE." <"Basic word processing skills can sometimes be transferred to..... programs which allow you to create graphic effects. bb10 9/17/94
  12. 12. Need to Break the 1st-Level Search Barrier: <reduce hits to most relevant <get hits when simpler searches fail <additional techniques beyond Boolean < new ways to divide and conquor < richer, easier search aids < richer reporting of results bb11 9/19/94
  13. 13. Combine Structured and Full Text Queries <Apply search to portion of library ("form queries") <Requires knowledge of the library <Requires "catalog card" for each document (e.g., date, subject) <Smart system might construct catalog card •Requires highly regular documents •Risk of catalog errors bb12 9/19/94
  14. 14. Combine Structured and Full Text Queries <Could design as a form for users to fill out <Example: DATE: after 1/1/94 (inexpensive <sentence> "word processor" <sentence> "windows") bb13 9/19/94
  15. 15. Relevancy Ranking <Puts most likely hits at the top of the list <Requires understanding of what's most important •# of hits/document •weighting certain hits (e.g., exact matches) more than others •weighting other criteria (such as date or other structured fields) •let users say what's most important to them bb14 9/19/94
  16. 16. Thesauruses <General <Specific •medical •legal •scientific <user-modifiable "I don't know the meaning of half those long words, and what's more, I don't believe you do either!" -- Alice in Wonderland bb15 9/20/94
  17. 17. Linguistic Helps <Automatic search for parts of speech •"sprinkle" also searches for "sprinkled," "sprinkling," etc. <Fuzzy search •"sprinkle" also searches for "sparkle" •helps overcome some OCR errors. •user-specifiable (how many letters to make "fuzzy") •gets words you would have missed •gets words that make no sense at all. <Natural Language Queries: ("Find me cheap reliable easy Windows word processors") "Language is worth a thousand pounds a word." bb16 9/20/94 -- Through the Looking Glass
  18. 18. Complex and Modular Queries <Create, debug, save queries <Use queries as models for new queries <If modular ("Lego•s") •assemble large search queries by plugging together smaller ones. •fine tune searches (adjusting rankings of search criteria). •build libraries of modular searches bb19 9/22/94
  19. 19. Fuzzy Searches <use neural network technology <like sophisticated wildcard searches <help overcome OCR errors <find good matches and irrelevant ones <can distort relevancy rankings by hit count bb20 9/22/94
  20. 20. SGML Usage <"Zone" searches •Confine searches to paragraph headings, chapter titles, etc. <Use SGML DTDs directly: •Full, Arbitrary (all DTDs) A exploits full capabilities of your tag set A performance and/or size penalties •Specific DTDs only A "Any color Ford you want as long as it's black." A May be tuned for better use bb17 9/20/94
  21. 21. SGML Usage <Filter (convert) SGML tags to application specific codes. •Not authentic SGML use •May be better performance than authentic SGML <Best when documents are themselves highly structured. <One-way (from SGML to proprietary); loses important SGML benefit. <Few vendors support SGML well <Those who do may skimp on other search facilities. bb18 9/21/94
  22. 22. Interest Profiling <Profile determined by any number of means <"I like these documents. Find me more like this." •simple •unexpected results •electronic highlighter improves search <The more search tools the better.
  23. 23. Information Agents <passive •computed once, updated periodically •use when you choose (whenever new CD-Rom title appears) <active •information gobots •always on the lookout for anything relevant •inform you with results or email notification •on-line or jukeboxes Looking in classifieds for a low-mileage Saab, prefer beige or red, one-owner, automatic, 1993 or newer, less than $10,000. Looking in PC literature for Windows word processor , easy to use, never needs upgrades, can handle graphics, bug-free, uses 1MB disk, less than $29.95.
  24. 24. Collateral Issues: Authoring and Using <Authoring •Populating the system •Subject areas and forms •Document size •Legacy Documents bb24 9/23/94
  25. 25. Populating the system <Security: everyone have identical access? <Easy way to get documents into system? <Form per document for form queries? •date, subject area, sub-type)? •subject area (e.g., word processors)? •sub-types within areas (e.g., character-based, GUI) <Easy way to retract documents? Re-file documents? "See also" subject areas? <QA of forms and documents •Form field info correct? •Complex document objects (e.g.,tables). bb25 9/24/94
  26. 26. Document Size <Whole documents or chunks? <What's appropriate to users? •Effort to build collection •Precision of hits •Size of hit list •What's natural and expected "What size do you want to be," the catepillar asked. Oh, I'm not so particular as to size, Alice hastily replied. "Only one doesn't like changing so often, you know." -- Alice in Wonderland bb26 9/24/94
  27. 27. Legacy Documents <Paper •size, number, quality •OCR •Ability to attach page images •At least name file for faxing <Electronic "These words were followed by a very long silence, broken only by an occasional •document type exclamation of 'Hjckrrh!" from the Gryphon." •quality of author practices -- Alice in Wonderland •fonts. . . . . . •command launch when possible •what about form queries/document? bb27 9/24/94
  28. 28. Collateral Issues: Using <Pie fonts <Non-English characters <Equations <Font fidelity, size on-screen •letter "o" and zero •letters one "1", el "l", and capital i "I". "The White Queen whispered, 'I can read words of one letter!... However, don't be discouraged, You'll come to it in time.'" -- Through the Looking Glass bb28 9/25/94
  29. 29. Collateral Issues: Using <Navigation within documents <Viewers <Launching when Viewers Inadequate <CD-Rom Performance <Exporting information for reuse. <Printing "... the books are something like our books, only the words go the wrong way." -- Through the Looking Glass bb29 9/25/94
  30. 30. Collateral Issues: Using <Interactive searches <Batch searches ("go do this later and tell me what you found") <Autonomous information agents •Continuous monitoring •Urgent, routine notification •Empower agents to "Ring a bell" ; "Push a button" •Active documents: "Go find me more like yourself" bb30 9/26/94
  31. 31. Adobe Acrobat version 2.0 <Powerful searching <CD-Rom performance <Font problem disappears <SGML promised bb31 9/26/94
  32. 32. And What of Our Original Search... Perfect Word Processor, Saab for a Song Alice laughed. `There's not use trying,' she said: `one CAN'T believe impossible things.' `I daresay you haven't had much practice,' said the Queen. . -- Through the Looking Glass Even the best searching system can't find what isn't there. But the best ones will keep on trying. bb28 9/25/94