Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Enterprise Architecture Topics


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Enterprise Architecture Topics

  1. 1. Ontologies, Taxonomies and Search Denise Bedford Special Libraries Association Baltimore, Maryland June, 2006
  2. 2. Presentation Overview <ul><li>I will talk today from the ‘end game’ perspective of search – placing taxonomies and ontologies in that context </li></ul><ul><ul><li>Search – User Needs </li></ul></ul><ul><ul><li>Search Functional Architecture </li></ul></ul><ul><ul><li>Metadata & Categorization </li></ul></ul><ul><ul><li>Semantic Technologies </li></ul></ul>
  3. 3. Search – User Needs
  4. 4. Basic Assumptions About Search Systems <ul><li>Search systems are not WYSIWYG – what you see is not what you get </li></ul><ul><li>Search systems are more like ice-bergs - most of the search system is below the surface – you can’t see it, you don’t know how it has been configured or what components it has </li></ul><ul><li>Helpful to have a framework against which you can judge the suitability of any search tool to your environment </li></ul><ul><li>Assess the search environment before you decide what kind of a search system you need </li></ul>
  5. 5. The Environment <ul><li>Like other libraries we acquire, create, license, access information in 30 subject domains – ranging from Law & Justice, to Transport, to Environment, to Agriculture, to Education, Health & Nutrition, etc. </li></ul><ul><li>We have 500 different kinds of content/document types </li></ul><ul><li>We have a set of business processes which represent the way we do our work </li></ul><ul><li>We have six working languages – but actually working in many more </li></ul><ul><li>We have a rich history – content dating back to the 1940s </li></ul><ul><li>Our priorities change over time </li></ul><ul><li>Everyone in the Bank is a recognized international expert </li></ul>
  6. 6. Underlying Search Architecture Silos of Information = Multiple search engines and user interfaces
  7. 7. Search Environment <ul><li>It is very challenging to design a search system which meets the needs of this complex environment </li></ul><ul><li>Internal Experts do ‘known item’ searching and look for other experts to talk to </li></ul><ul><li>General public looks for information ‘about’ something </li></ul><ul><li>Most people don’t know all the dimensions of the Bank </li></ul><ul><li>People need to search in other than English and find content written in other languages </li></ul><ul><li>Our new Knowledge and Learning initiative is moving us towards a W3 working environment - What information I need when I need it, where I need it </li></ul><ul><li>We need a search architecture that supports the user’s needs </li></ul>
  8. 8. Information Architecture: Moving Towards Contextualization Anonymous Access (Context Free) Group Access (Context Sensitive – Group) Individualized Access (Context Sensitive – Individual) Customized Access to Information Content – Information Assets Structured Less structured Intranet External Web Operations Portal myPortal* myPortal* Client Connections People Basic Information “ Customer” Information 3W Information Model What I Want Where I Want When I Want
  9. 9. Search Use Case Analysis Why Search Wasn’t Working
  10. 10. Search Has Always Been Problemmatic <ul><li>Despite any reports you may read in the general literature, buying a fast crawler will not solve your search problems </li></ul><ul><li>Implementing a fast crawler simply surfaces information management and data quality challenges directly to the users </li></ul><ul><li>The crawler approach provides high hit rates, and very low relevancy rates </li></ul><ul><li>It was a challenge to understand the search problem from the system-side </li></ul><ul><li>Try to understand the search problem from the user’s side </li></ul>
  11. 11. Consultant is looking for the CAS for the Congo. Steps 1 - 2 DEC Consultant F Intranet Search ImageBank Search Steps 4-6 F
  12. 12. Adviser wants to find social indicators, relevant quotes and Bank’s position on racism, as well as learn what the Bank has done in this area in order to prepare a 30 min. speech for a WB Managing Director for conference. Steps 1-4 HD Adviser F WB Intranet Search Google Internet Steps 7-12 PS External NGO Colleague Bank Colleagues LAC Group Steps 15-16 S Steps 5-6 S PS
  13. 13. What We Learned <ul><ul><li>The absolute success rate per search task was low at 43.18% </li></ul></ul><ul><ul><li>The Search experience generally consists of multiple steps and multiple searches within and across sources </li></ul></ul><ul><ul><li>The source with the fewest number of steps was a Colleague or Personal Contact. The source with the greatest number of steps was External Web Search Browse </li></ul></ul><ul><ul><li>Each source has its own behavior, business rules, functional architectures – users need to learn each system </li></ul></ul><ul><ul><li>The Intranet search was selected as a logical place to look for information in over 65% of the use cases. However, the success rate for the Intranet Google search was only 35% </li></ul></ul>
  14. 14. What We Learned <ul><li>You need to understand search from the searcher’s point of view </li></ul><ul><li>Until you can see it this way, you are simply guessing at what problems need to be solved and how to solve them </li></ul><ul><li>This is a challenge for an organization whose users are the public, but it is possible to design an architecture that suits the general public and the internal staff, and the important stakeholders </li></ul><ul><li>The critical success factor is to design an open architecture for agility and flexibility </li></ul>
  15. 15. Enterprise Search Evolving Search Architecture
  16. 16. Search Engine Parts Contextualization, Personalization, Recommender, Content Syndication, Q&A Systems, Intelligent Search What the user sees Search System Inputs Metadata Repository for Enterprise Search Index Architecture & Construction Vocabulary Support (thesaurus) Classification Schemes (taxonomy) Query Transformation & Matching Boolean matching Exact phrase matching Fuzzy matching Term matching + synonym expansion Term matching + cross language expansion Term weighted matching Query term root matching & dictionary expansion Wild card matching Term proximity matching Neural network expansion Genetic algorithm expansion Search Outputs Query Manipulation Simple search Fielded Search Query Processing Algorithms Search term assistance (thesaurus, dictionaries) Search language selection Sources selection Display Results Integrated search results Search results sorting Search results contextualization Search results relevancy ranking The basic search programs Foundation or Baseline of The Search System <ul><li>Query Processing </li></ul><ul><li>Transforming the query to add tags, </li></ul><ul><li>Synonyms, etc. </li></ul>
  17. 17. Enterprise Search Proposed Solution <ul><li>What Enterprise Search is and what it is not </li></ul><ul><li>Distinguish between the web services platform for accessing Enterprise Search and the enterprise’s full store of content (its not only electronic, and its not all in html format!!) </li></ul><ul><li>What kind of an architecture do you need to support enterprise search? </li></ul><ul><li>What kind of tools do you need to support enterprise search? Its much more than just a web crawler…. </li></ul><ul><li>What do you need to support true concept level cross-language searching? </li></ul><ul><li>How does Enterprise Search respect security classifications with/without restricting knowledge of what exists? </li></ul>
  18. 19. Taxonomies and Ontologies in Search
  19. 20. Using Taxonomies in Search <ul><li>Controlled vocabularies – guiding users through the search process (flat taxonomies) </li></ul><ul><li>Classification schemes – browsing, navigation, syndication, contextualization (hierarchical taxonomies) </li></ul><ul><li>Metadata – supporting fielded search (faceted taxonomies) </li></ul><ul><li>Thesauri – supporting knowledge discovery, preventing Zero results, supporting cross-language searching (network taxonomies) </li></ul><ul><li>Synonym Rings – improving relevancy, reducing information scattering, managing recall (ring taxonomies) </li></ul>
  20. 21. Flat taxonomy Hierarchical taxonomy Ring taxonomy Ring taxonomy Fielded Search = Faceted Taxonomy Vision of An Enterprise Advanced Search
  21. 22. Ring Taxonomy Network Taxonomy Metadata
  22. 23. Data Driving the Enterprise Architecture 30 Second Description of Taxonomies and Ontologies
  23. 24. Taxonomies & Ontologies <ul><li>I just mentioned several kinds of taxonomies that search </li></ul><ul><ul><li>Faceted taxonomies (metadata schemes) </li></ul></ul><ul><ul><li>Flat taxonomies (controlled vocabularies, picklists) </li></ul></ul><ul><ul><li>Hierarchical taxonomies (classification schemes) </li></ul></ul><ul><ul><li>Network taxonomies (thesauri, semantic networks) </li></ul></ul><ul><ul><li>Ring taxonomies – to handle synonym rings and synonym searhing </li></ul></ul><ul><li>Let’s walk through these kinds of taxonomies </li></ul><ul><li>Then let’s see how they all make up an ontology </li></ul>
  24. 25. Facet Taxonomies Faceted taxonomy represented as a star data structure. Each node in the start structure is liked to the center focus. Any node can be linked to other nodes in other stars. Appears simple, but becomes complex quickly.
  25. 26. Core Metadata Strategy <ul><li>What is a core metadata strategy? </li></ul><ul><li>What’s the process you use to discover your organization’s core metadata strategy? </li></ul><ul><li>Libraries have many core metadata standards – COSATI, Dublin Core, MARC, MODS, COSATI, AIIM TR48 </li></ul><ul><li>The important question for metadata in search is how are you using your metadata to support search? </li></ul><ul><li>Too often we put a ‘dumb’ search engine on top of ‘smart metadata’ and do nothing more with it than ‘publish’ it </li></ul><ul><li>It’s time to think smarter about how we use our metadata </li></ul>
  26. 27. Purpose of Core Metadata Identification/ Distinction Search & Browse Use Management Compliant Document Management
  27. 28. Capturing Core Metadata Identification/ Distinction Use Management Compliant Document Management Human Creation Programmatic Capture Inherit from System Context Extrapolate from Business Rules Search & Browse
  28. 29. Flat Taxonomy Structure Energy Environment Education Economics Transport Trade Labor Agriculture
  29. 30. Hierarchical Taxonomy A hierarchical taxonomy is represented as a tree data structure in a database application. The tree data structure consists of nodes and links. In an RDBMS environment, the relationships become associations. In a hierarchical taxonomy, a node can have only one parent.
  30. 31. Hierarchical Taxonomies – Classification Schemes <ul><li>Ranganathan is the supreme authority on how to create a well-design classification scheme </li></ul><ul><li>Most of what we teach in graduate school, though, is how to use a classification scheme not how to create one </li></ul><ul><li>Classification schemes are the controlling reference source for metadata attributes </li></ul><ul><li>For example, the Enterprise Topic Classification Scheme is the reference source for the attribute Topic </li></ul><ul><li>We use tools to help us discover what the scheme should be and also to help us classify content </li></ul>
  31. 32. Network Taxonomies A network taxonomy is a plex data structure. Each node can have more than one parent. Any item in a plex structure can be linked to any other item. In plex structures, links can be meaningful & different.
  32. 33. Poverty mitigation Poverty alleviation Poverty elimination Poverty reduction Poverty eradication Poverty abatement Poverty prevention Poverty reducation Ring Taxonomy Rings can include all kinds of synonyms - true, misspellings, predecessors, abbreviations
  33. 34. Content Entity1 Content Elements Content Metadata Topic Class Scheme Business Process Scheme Thesaurus Country Names Region Names Skill Sets/ Competencies Standard Statistical Variables Has values uses Has Contains User Has relationship to Has Meaning in Context Contextual Matrix & Sensiing Understood uses Not in progress In progress In Use Profile Has Business Rule Has Ontology Design from 50,000 Feet Has values Content Parts Has Metadata Has
  34. 35. Building & Maintaining Taxonomies Moving Towards Automted Metadata Capture
  35. 36. Relationships across data classes 3 Oracle Data classes Topic data class Subtopic Data Class
  36. 37. Building and Maintaining Taxonomies <ul><li>Moving towards automated metadata generation means that catalogers shift their effort to reviewing the metadata generated and to maintaining LCSH and LC classification as part of a suite of categorization tools </li></ul><ul><li>Level of effort shifts to training and developing the tools and away from original cataloging and metadata capture </li></ul><ul><li>Continue to work closely with subject experts to define the controlled vocabularies and classification schemes </li></ul>
  37. 38. The problem with metadata <ul><li>Metadata…. </li></ul><ul><ul><li>Is expensive and time consuming to create </li></ul></ul><ul><ul><li>Is sometimes subjective and not granular enough </li></ul></ul><ul><ul><li>Doesn’t always address the ways that users and systems think about the information it describes </li></ul></ul><ul><ul><li>May not tell us enough about the information to trust it </li></ul></ul><ul><ul><li>May address only one context – the context for which it is created </li></ul></ul><ul><ul><li>May live in the source application where it was created </li></ul></ul><ul><ul><li>May not be as accessible as the information object </li></ul></ul><ul><li>The future depends on metadata so we have to resolve these problems </li></ul><ul><li>We are resolving the problem using automated metadata creation and extraction technologies </li></ul><ul><li>Metadata extraction uses rules, classifiers and grammar engines </li></ul><ul><li>Metadata creation uses categorization and rules engines </li></ul>
  38. 39. Smart Use of Technologies <ul><li>Sample structure – Bank Topics Classification Scheme (hierarchical taxonomy) </li></ul><ul><ul><li>Oracle data classes used to represent Topic Classification scheme </li></ul></ul><ul><ul><ul><li>hierarchical taxonomy as reference source for the attribute – Topic </li></ul></ul></ul><ul><ul><ul><li>used for Browse, Search, Content Syndication, Personalization </li></ul></ul></ul><ul><ul><li>1 st challenge is to architect the hierarchy correctly </li></ul></ul><ul><ul><ul><li>3 distinct data classes, not a tree structure with inheritance </li></ul></ul></ul><ul><ul><ul><li>Allows you to use the three data classes for distinct functions across systems but still enforce relationships across the classes </li></ul></ul></ul>
  39. 40. What is Teragram? <ul><li>Semantic analysis tools which support concept extraction, categorization, summarization and pattern matching rules engines </li></ul><ul><li>Teragram works in 23 languages </li></ul><ul><li>Use categorization to capture Topics, Business Activities, Regions, Sectors, Themes, etc. </li></ul><ul><li>Use Concept Extraction to capture keywords </li></ul><ul><li>Use Rules Engine to capture Loan #, Credit #, Project ID, Trust Fund #, etc. </li></ul><ul><li>Use Summarization to generate a ‘gist’ of the content </li></ul>
  40. 41. How does semantic analysis work?
  41. 42. Automatically Generated Metadata Metadata is generated using a categorization profile – reflects the way people organize.
  42. 43. Automatically Extracted Metadata <?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?> <Proper_Noun_Concept> <Source><Source_Type>file</Source_Type> <Source_Name>W:/Concept Extraction/Media Monitoring Negative Training Set/ 001B950F2EE8D0B4452570B4003FF816.txt</Source_Name> </Source><Profile_Name>PEOPLE_ORG</Profile_Name> < keywords> Abdul Salam Syed, Aruna Roy, Arundhati Roy, Arvind Kesarival, Bharat Dogra, Kwazulu Natal, Madhu Bhaduri , </keywords><keyword_count>7</keyword_count> </Proper_Noun_Concept> Proper Noun Profile for People Names uses grammars to find and extract the names of people referenced in the document.
  43. 44. Semantic Analysis Basics <ul><li>Once you have made some sense of the sentence, reconstruct entities for information extraction (compose) </li></ul><ul><ul><li>Identify names and other fixed form expressions – people, organizations, conferences </li></ul></ul><ul><ul><li>Identify basic noun groups, verb groups, presentations, other grammatical elements </li></ul></ul><ul><ul><li>Construct complex noun groups and verb groups </li></ul></ul><ul><ul><li>Identify event structures </li></ul></ul><ul><ul><li>Identify common elements and associate </li></ul></ul>
  44. 45. Categorizing Content <ul><li>Let’s look how we’re categorizing our content to this structure automatically </li></ul><ul><li>Topic classification, geographical region assignment, keywording examples </li></ul><ul><li>Can apply this approach to any kind of content </li></ul><ul><li>Enables us to build a robust metadata repository model, with strong metadata quality, to move towards SI at the functional level </li></ul><ul><li>Also note that we can do this across many languages </li></ul>
  45. 46. Leveraging the Structure <ul><li>Each subtopic is a knowledge domain (hierarchical taxonomy) and e ach subtopic has an extensive concept level definition (1,000 – 5,000+ concepts) </li></ul><ul><li>Concepts are controlled vocabularies in their raw form (flat taxonomy) </li></ul><ul><li>Concepts with relationships (extensive per new Z39.19 standard) comprise semantic network (network taxonomy) </li></ul><ul><li>Categorization tools work with topic structure & concept definitions to categorize and index content </li></ul><ul><li>The following screen illustrates how that same structure is embedded into Teragram profile to support categorization </li></ul>
  46. 47. Subtopics Domain concepts or controlled vocabulary
  47. 48. Extensive operators allow us to write grammatical rules to manage typical semantic problems
  48. 49. Concept based rules engine allows us to define patterns to capture other kinds of data
  49. 50. Example of use of Authority Control to capture country names but extract ‘authorized’ version of country name Example of use of a gazetteer + concept extraction + rules engine to support semantic interoperability
  50. 51. Use of concept extraction + rules engine to capture Loan #, Credit #, Project ID#
  51. 52. Enterprise Profile Development & Maintenance <ul><li>Enterprise Metadata Profile </li></ul><ul><li>Concept Extraction Technology </li></ul><ul><li>Country </li></ul><ul><li>Organization Name </li></ul><ul><li>People Name </li></ul><ul><li>Series Name/Collection Title </li></ul><ul><li>Author/Creator </li></ul><ul><li>Title </li></ul><ul><li>Publisher </li></ul><ul><li>Standard Statistical Variable </li></ul><ul><li>Version/Edition </li></ul><ul><li>Categorization Technology </li></ul><ul><li>Topic Categorization </li></ul><ul><li>Business Function Categorization </li></ul><ul><li>Region Categorization </li></ul><ul><li>Sector Categorization </li></ul><ul><li>Theme Categorization </li></ul><ul><li>Rule-Based Capture </li></ul><ul><li>Project ID </li></ul><ul><li>Trust Fund # </li></ul><ul><li>Loan # </li></ul><ul><li>Credit # </li></ul><ul><li>Series # </li></ul><ul><li>Publication Date </li></ul><ul><li>Language </li></ul><ul><li>Summarization </li></ul>e-CDS Reference Sources for Country, Region, Topics Business Function, Keywords, Project ID, People, Organization Data Governance Process for Topics, Business Function, Country, Region, Keywords, People, Organizations, Project ID Teragram Team TK240 Client System 1 System 2 System 3 System 4 System 5 Enterprise Profile Creation and Maintenance UCM Service Requests Update & Change Requests
  52. 53. ImageBank Integration Content Capture ISP Integration Enterprise Profile Development & Maintenance XML Wrapped Metadata Dedicated Server – Teragram Semantic Engine – Concept Extraction, Categorization, Clustering, Rule Based Engine, Language Detection APIs & Integration APIs & Integration Content Capture XML Wrapped Metadata Factiva Metadata Database IRIS Integration APIs & Integration Enterprise Metadata Capture Strategy TK240 Client XML Output Reference Sources APIs & Technical Integration Content Owners Content Owners Business Analyst Indexers Librarians Functional Team Enterprise Metadata Capture – Functional Reference Model
  53. 54. Caution Regarding Tools <ul><li>Not all tools will do what we describing here </li></ul><ul><li>You need to have an underlying semantic engine which can perform semantic analysis </li></ul><ul><li>You need to have a semantic engine in multiple languages – semantics vary by language </li></ul><ul><li>You need to have access to the programs through a user-friendly interface so you can adapt them to your environment without having to have programming knowledge </li></ul><ul><li>You need to have several different kinds of technologies to do what I’m describing here </li></ul><ul><li>Not all the tools on the market today support this work </li></ul>
  54. 55. Impacts & Outcomes <ul><li>Information Access impacts </li></ul><ul><ul><li>Increased precision of search </li></ul></ul><ul><ul><li>Better control over recall </li></ul></ul><ul><ul><li>Searching like we talk </li></ul></ul><ul><ul><li>Exact match searching – known item searching will work better </li></ul></ul><ul><ul><li>Metadata based searching now begins to resemble full-text searching but with all the advantages of structure & context, and a significant reduction in the amount of noise </li></ul></ul><ul><li>Productivity Improvements </li></ul><ul><ul><li>Can now assign deep metadata to all kinds of content </li></ul></ul><ul><ul><li>Remove the human review aspect from the metadata capture </li></ul></ul><ul><ul><li>Reduce unit times where human review is still used </li></ul></ul><ul><li>Information Quality impacts </li></ul><ul><ul><li>Apply quality metrics at the metadata level to eliminate need to build ‘fuzzy search architectures’ – these rarely scale or improve in performance </li></ul></ul><ul><ul><li>Use the technologies to identify and fix problems with data </li></ul></ul>
  55. 56. Thank You. Questions & Discussions Contact Information: [email_address] [email_address]