Taxonomies in Search

  • 3,474 views
Uploaded on

Presented by Marjorie Hlava, president of Access Innovations, Inc. on August 10, 2011. Part two of the Special Libraries Association's Leveraging Your Taxonomy series.

Presented by Marjorie Hlava, president of Access Innovations, Inc. on August 10, 2011. Part two of the Special Libraries Association's Leveraging Your Taxonomy series.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,474
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
49
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Taxonomies in SearchAn SLA Webinar
    Aug 10, 1:00pm-2:00pm EST
    Marjorie Hlava, President
    mhlava@accessinn.com
    Access Innovations, Inc.
    www.accessinn.com
    Leveraging your content semantically
  • 2. Agenda
    How search works
    Measuring accuracy in search
    Precision
    Recall
    Relevance
    Search theoretical basis
    Bayes, Boole and the rest of the guys
    The taxonomy effect
  • 3. How does search work?
    Many parts
    Search software – of course
    Computer network
    Parsing of text
    Well formed or structured text
    CLEAN DATA
    Computer software – network
    Computer hardware
    Telecommunications connection
    Training sets for statistical systems
  • 4. Technical parts of search
    Search technology
    Ranking algorithms
    Query language
    Federators
    Cache
    Inverted index
    Other enhancements
    Presentation Layer
  • 5. My Main Frustration
    Select hardware
    Select software
    Design system
    Try to load the data
    Add the taxonomy
    That’s BACKWARDS
  • 6. Data First!
    What are you building the system for?
    Assess the data
    Do the design
    Decide what else needs to be added
    Taxonomy terms
    Other controls
    Find a system that will work with your data
  • 7. Access Innovations – Complex FarmWith Perfect Search
    Query
    Federators
    Query Servers
    Search Harmony Presentation
    Layer
    Deploy
    Hub
    Index
    Builders
    Cleanup, etc.
    Repository XIS (cache)
    Cache
    Builders
    Source
    Data
  • 8. CUSTOM
    CONNECTOR
    EMAIL
    CONNECTOR
    DATABASE
    CONNECTOR
    FILE
    TRAVERSER
    WEB
    CRAWLER
    MANAGEMENT API
    QUERY API
    CONTENT API
    Data Harmony Governance API
    SEARCH
    SERVER
    FILTERSERVER
    FAST Search example
    Core Architectural Components
    Administrator’s
    Dashboard
    Web
    Content
    Vertical
    Applications
    Pipeline
    Query
    Pipeline
    Files,
    Documents
    QUERY
    PROCESSOR
    Portals
    Index DB
    Databases
    DOCUMENT
    PROCESSOR
    Results
    Custom
    Front-Ends
    Alerts
    Email,
    Groupware
    Search harmony
    Mobile
    Devices
    Custom
    Applications
    Content
    Push
    MAIstro
    Agent DB
  • 9. Measuring accuracy in search
    Relevance
    Recall
    Precision
    Accuracy – Hits, miss, noise
    Ranking
    Linguistics
    Query Processing
    Results Processing
    Display
    Search refinement
    Usability
    Business Rules
    9
  • 10. Relevance
    How well a set of returned documents answers the information need
    “Accuracy”
    Related to objective of search
    Different user communities
    Information resources
    Tension of user needs and context available
    A confidence “guessimate”
    10
  • 11. The formulas
    Recall = Number of relevant items retrieved
    Number of relevant items in the collection
    Precision = Number of relevant items retrieved
    Number of items retrieved
    Relevance = Germane (Precision)
    Pertinent (Recall)
  • 12. Measuring Relevance
    Concepts
    Context
    Age of documents
    Completeness (recall)
    Quality
    Statistically determined ?
    Nope, it is subjective
    Someone has to determine the rightness of the item
    A confidence factor = canard!
  • 13. Kinds of search
    Bayesian –
    FAST
    Lucene
    Autonomy / Verity
    Boolean
    Dialog
    Endeca
    Perfect Search
    Ranking algorithms
    Google
    13
  • 14. Search Theoretical BasisThose Famous Guys
    Boole
    Bayes
    Bayesian Techniques
    Turney
    Turney algorithm
    Enriched structured data
    Marco Dorigo
    Ant Colony
    This is only a sample
    of a large body of research
  • 15. George Boole and Boolean algebra
    George Boole
    Mathematician
    1815-1864
    Boolean algebra
    An algebraic system of logic
    AND, OR, NOT, ANDNOT,
    Dialog, BRS, Stairs
    15
  • 16. Boolean representation
    Venn diagram showing the intersection of sets A AND B (in violet),
    The union of sets A OR B (all the colored regions),
    And set A XOR B (all the colored regions except the violet).
    The "universe" is represented by the rectangular frame.
    16
  • 17. Bayes and Bayes’ Theorem
    Thomas Bayes
    Mathematician
    1702 - 1761
    Bayesian theorem
    Uses probability inductively
    Established a mathematical basis for probability inference
    WHAT?
    A means of calculating,
    from the number of times an event has not occurred,
    the probability that it will occur in future trials
    17
  • 18. Bayesian methods - Cautions
    A user might wish to change the distribution of probabilities.
    A user will make a novel request for information in a previously unanticipated way.
    The computational difficulty of exploring a previously unknown network.
    The quality and extent of the prior beliefs used in Bayesian inference processing.
  • 19. Bayesian cautions (cont.)
    A Bayesian network is only as useful as the prior knowledge is reliable.
    An optimistic or pessimistic expectation of the quality of these prior beliefs will distort the entire network and invalidate the results.
    Must ensure the selection of the statistical distribution induced in modeling the data.
    Must have the proper distribution model to describe the data.
    That is you have to constantly train and retrain the data
  • 20. Peter Turney and the Turney Algorithm
    Peter D. Turney, Canada, present
    Learning algorithms for keyphraseextraction
    Tree Induction Algorithm
    Lexical Semantics
    GenEx – with human input
    80% acceptable
    Extraction vs. generation and sentiment of words
             (hits(word AND "excellent") hits (poor))log2 ----------------------------------------         (hits(word AND "poor") hits (excellent))
  • 21. Marco Dorigo and Ant Colony Optimization
    Marco Dorigo
    Research director for the Belgian Fonds de la RechercheScientifique
    Research director of the IRIDIA lab at the UniversitéLibre de Bruxelles
    Ant Colony Optimization
    metaheuristicfor combinatorial optimization problems
    Swarm intelligence
    Value importance vs. heuristic importance
    Useful in search prediction
    21
  • 22. Natural Language Processing
    Syntactic
    Semantic
    Morphological
    Phraseological
    Lemmatization (stemming)
    Statistical
    Grammatical
    Common Sense
  • 23. Basic areas of Automatic Language Processing (ALP)
    Auto Translation
    Auto Indexing
    Auto Abstracting
    Artificial Intelligence
    Searching
    Spell Checking
    Semantic Web
    Natural Language Processes (NLP)
    Computational Linguistics
  • 24. Statistical Search
    Cluster analysis
    Neural networks
    Co-occurrence
    Bayesian inference
    Latent Semantic
    Etc.
    24
  • 25. Inverted Files and Boolean
    are basic to all search
    Searchable Index
    Inverted
    File
    Index
    Taxonomy
    Thesaurus
    Hierarchical Display
  • 26. Sample Slide for Inverted File Index Demonstration
    Outline of Presentation
    • Define key terminology
    • 27. Thesaurus tools
    Features
    Functions
    • Costs
    Thesaurus construction
    Thesaurus tools
    • Why & when?
  • Simple Inverted File Index
    key
    of
    outline
    presentation
    terminology
    thesaurus
    tools
    when
    why
    &
    1
    2
    3
    4
    construction
    costs
    define
    features
    functions
  • 28. Complex Inverted File Index
    Example 1
    key - L2, P2, H
    of - Stop
    outline - L1, P1, T
    presentation - L1, P3, T
    terminology - L2, P3, H
    thesaurus - (1) - L3, P1, H
    (2) - L7, P1, SH
    (3) - L8, P1, SH
    tools - (1) - L3, P2, H
    (2) - L8, P2, SH
    when - L9, P3, H
    why - L9, P1, H
    & - Stop
    1 - Stop
    2 - Stop
    3 - Stop
    4 - Stop
    construction - L7, P2, SH
    costs - L6, P1, H
    define - L2, P1, H
    features - L4, P1, SH
    functions - L5, P1, SH
  • 29. Word and Term Parsing
    Stemming
    -ing, -ed, -es, -’s, -s’, etc.
    Depluralization
    Truncation
    Left and right
    Wild cards
    Organi*ation
    Variant Spellings
    Centre, center
    Hyphens
  • 30. The taxonomy effect
    Where do the terms go?
    How are they used in search
    What other ways can I use the taxonomy in search?
  • 31. Site search
    Search of 53 crawled sites including journals, books, web site, conference sites, etc.
    Navigation
    Bookstore search
    Search database for Journals and pubs
    For search all publications
  • 32. Navigate the full taxonomy “tree”
    BROWSE
    Auto-completion using the taxonomy
    Guide the user
    Taxonomy Driven Search Presentation
  • 33. A quick look behind the scenes
    Database
    Management
    System
    • Search thesaurus
    • 34. Validate term entry
    • 35. Block invalid terms
    • 36. Record candidates
    • 37. Establish rules for
    term use
    • Suggest indexing
    terms
    Thesaurus
    tool
    Indexing
    tool
    • Validate terms
    • 38. Add terms and rules
    • 39. Change terms and rules
    • 40. Delete terms and rules
  • Thesaurus
    Term Record
    view
    Taxonomy
    view
  • 41. Where does the subject metadata go?
    Apply to content itself
    Use meta name field in HTML header
    Connect search to the keywords in the SQL or other database tables
  • 42. HTML Header
  • 43. RDBMS Connection
    Taxonomy term table
  • 44. Suggested taxonomy descriptors
  • 45.
  • 46. Integrate taxonomy to enhance findability
    Browsable categories of a directory
    Browsable faceted navigation
    Smart search for term equivalents
    Taxonomy terms (original or modified) as labels
    Navigation aids incorporate taxonomy terms and relationships
  • 47. More Taxonomy Enrichment
    Spelling alternatives and correction
    Related concepts
    Statistical information about the metadata
    Navigation or drill downs
    Search refinement
    Recursive sets
    Concept linking
    Dictionary lookup (in taxonomy glossary)
  • 48. Brand is repeated in several spots and tied to search as well
  • 49. Raw Full text data feeds
    Data Base Plus Search Workflow
    XIS Creation
    SQL for ecommerce
    Printed source materials
    Add metadata
    Data Crawls on 53+ sources
    XIS repository
    Taxonomy terms
    Load to
    Perfect Search
    MAI Concept Extractor
    Taxonomy Thesaurus Master
    MAI Rule Base
    Search Harmony Display Search
    Save data to search and repositories at the same time
  • 50. Raw Full text data feeds
    Data Base Plus Search Workflow
    XIS Creation
    SQL for ecommerce
    Printed source materials
    XIS repository
    Data Crawls on data sources
    Add metadata
    Load to
    Search
    MAI Concept Extractor
    MAI Rule Base
    Search Harmony Display Search
    Taxonomy Thesaurus Master
    Source data
    Taxonomy terms
    Search data
    Clean and enhance data
  • 51. Client Data
    Full Text
    HTML, PDF,
    Data Feeds, etc.
    Taxonomy In Sharepoint
    Automatic Summarization
    Search
    Presentation:90% accuracy
    Browse by Subject
    Auto-completion
    Broader Terms
    Narrower Terms
    Related Terms
    Machine Aided Indexer (M.A.I.™)
    Repository
    Search
    Software
    Inline Tagging
    Client taxonomy
    Client Taxonomy
    Metadata and Entity Extractor
    Thesaurus Master
  • 52. What we covered
    How search works
    Measuring accuracy in search
    Search theoretical basis
    Bayes, Boole and the rest of the guys
    The taxonomy effect
  • 53. Do the data FIRST
    What do you have?
    What does it need?
    How would you LIKE to access it?
    Look at the data BEFORE you create the specifications
    DTD built without data is not going to work
    Then choose the system that will support your data
  • 54. Next Month
    Same time, same station
    Solving the Challenge of Connecting People and Author NetworksJay Ven Eman, Ph.D.September 14As online digital publishing continues to grow, taxonomies can be increasingly useful in connecting people with author networks through directory creation with author disambiguation and subject metadata tagging to increase the usefulness of information for researchers and community-building.
  • 55. About Access Innovations
    49
    Access innovations are experts in content creation, enrichment and conversion services. We provide services to semantically enrich and tag and raw text into highly structured data. We deliver clean ,well formed, metadata enriched ,data so our clients can reuse repurpose, store, and find their knowledge assets. We go beyond the standards to build taxonomies and other data control structures as a solid foundation for data.
    Our services and software allow organizations to use and present their information to both internal and external constituents by leveraging search, presentation, e-commerce . We change search to found!
    Quick Facts
    • Founded in 1978
    • 56. Headquartered in Albuquerque
    • 57. Privately held
    • 58. Delivered more than 2000 engagements
  • Thank you for your attention!
    Slides will be available on SLA Taxonomy Division and Access Web sites tomorrow
    Taxonomies in Search: http://www.accessinn.com/library/presentations/sla-taxonomies-in-search-aug10-2011.pptx
    Marjorie M. K. Hlava
    Access Innovations / Data Harmony
    mhlava@accessinn.com
    +505.998.0800