Implementing Semantic Search  in the Enterprise Paul Wlodarczyk Director of Consulting Services Earley & Associates Amber Swope
Questions we will answer today What is Semantic Search? How is Enterprise Search different from Internet Search? Why Semantic Enterprise Search? How do you implement enterprise semantic search?  Examine people, process, technology, and content. How do I prepare my content to enable semantic search? What technologies are there and how do they differ? What can I search?
What is Semantic Search? semantic   adj. Of or relating to meaning in language or communications. Semantic  search uses language processing to assess the “meaning” of content (documents or web pages) and the “meaning” of search queries to return more relevant results (better matches in meaning) Key concepts:  Taxonomy, Named Entity, Ontology, Tag
Key concept: Taxonomy taxonomy  n.  A categorization scheme for content, often hierarchical.  Example: the animal kingdom Most often, taxonomies show “is a” relationships Example: A mammal is a vertebrate A rodent is a mammal A rabbit is a rodent
Key concept: Named Entity named entity  n. A person, organization, place, thing, or event identified in a body of text  Entities  are distinct from  terms  in that they are unambiguous. e.g. “Washington” is a term that is ambiguous to an entity (the first President, the city, the state, the US Government, the monument).  A  tagged  named entity is unambiguous
Example: Named Entities
Key concept: Ontology ontology  n. A set of relationships between entities.  Often these are in subject-predicate-object [triple] format.  Often ontologies relate entities that exist in multiple taxonomies .  Example :  A  food chain  is a set of relationships (predator/prey) between entities (animals, plants) that exist in different taxonomies (kingdoms).  The relationships are triples:  Rodents eat seeds of grasses.  Fox eats rodents.  Kangaroo rat is a rodent.  Rye is a grass.  Etc.
How does semantic search work? Assess meaning of documents  Identify named entities and relationships (triples) OR Categorize documents to taxonomies OR Score each document with a “signature” or “graph” “ Tag” documents for meaning  (categories, entities, triples, semantic signatures, graphs, etc.) Index the documents Assess meaning of search terms Match documents to search terms via  common meaning Meaning [search term] Meaning Meaning
Enterprise Search vs. Web Search Web Search Enterprise Search Search corpus Every public webpage – the whole internet Public documents in the enterprise, departmental docs, plus local docs (My Documents) Context Generic : Shopping or seeking news and information Company-specific: Executing a role in a business process Taxonomies / categories Generic – Open Directory Project, Wikipedia, News, etc.  Domain Specific  (customers, organization, products, technologies, processes) Info Security Information is public Information is secure with role-based access controls  Search algorithms Keyword and Link-based Links = relevancy Popularity = relevancy Professionally tagged Keyword & tag-based No links!  No traffic!  Inconsistent metadata tags! Perfect result Most popular content Highest quality content
Why  Semantic  Enterprise Search? Semantic analysis can provide the  context, relevancy, and consistency  that is lacking in enterprise content creation and search  Enterprise content lacks the  connectedness  that internet search exploits “ Traffic”   is not a clue to relevancy in enterprise search Enterprise users  do not consistently tag content with metadata
Another key difference in Enterprise Search:  Social Context In “enterprise search” is that we know a lot more about “who” is searching and “who” has authored “what” We understand the  community  a lot better in the enterprise
Roadmap for implementing semantic search Implement Enterprise Content Management Implement Enterprise Search Layer-in semantic analysis to improve search relevancy Semantic search isn’t a replacement to ECM and enterprise search.  It’s a “sweetener.” Implement ECM Implement Enterprise Search Exploit Semantic Search
ECM and Enterprise Search Roll-out  Strategy & Plan Implement Deploy Maintain People Use cases and User Experience  Job Redesign, Communities Training  Incentives for participation Process Content  Lifecycle  Analysis Workflow, bus. rules, process redesign Governance Evergreen process for maintaining IA Technology Business  & system req’ts,  technical architecture ECM and Search Implementation Desktop integration (classification, search) Social tech (ratings, tags, bookmarks) Content Content Analysis, Information Architecture, Taxonomy dev’t Content migration Content classification tools, search tools Taxonomy maintenance, folksonomy Strategy & Plan Implement Deploy Maintain
Layer-in Semantic Enterprise Search Semantic technologies play a role in content classification – from defining taxonomies and ontologies, to tagging documents, to improving search terms and hits – as well as in search and discovery Strategy & Plan Implement Deploy Maintain People Use cases and User Experience  Job Redesign Training  Incentives for participation Process Content  Lifecycle  Analysis Workflow, bus. rules, process redesign Governance Evergreen process for maintaining IA Technology Business  & system req’ts,  technical architecture ECM and Search Implementation , Semantic search implementation Desktop integration (classification, search) Social tech (ratings, tags, bookmarks) , machine learning Content Content Analysis,  Information Architecture,  Taxonomy dev’t Content migration , build triple stores, semantic training sets Content classification tools, search tools Taxonomy maintenance,  folksonomy Strategy & Plan Implement Deploy Maintain
Classify, Navigate, Search, Retrieve Content  within the Enterprise Content  Author Check-in & Classify Document or Content Object Retrieve Document or Content Object Retrieve Unformatted Content End User Retrieve Formatted Content Retrieve Document End User End User
Strategy and Plan: Key Activities Business Objectives : Understand the key business problems that must be solved People : Understand actors, roles, and use cases (who creates, who files, who searches, etc.) Process : Understand content lifecycle: how you create, maintain, reuse, and publish content Technology : Understand existing technology and new requirements for all use cases Content : Understand existing content, classification, policies, reuse, multichannel, etc.
Strategy and Plan: Deliverables Business Objectives : Define the  ROI  in terms of the  key metrics  and how they will trend People : Actors, roles, and  Use Cases  elaborated into  System And Business Requirements Process : Desired state  Content Lifecycle  defined  Technology :  Systems Architecture  completed and new technology modules defined, integration points with existing technology defined  Content :  Information Architecture : How content will be structured, classified, managed, reused, and searched
Strategy and Plan:  Semantic Search Considerations: Technology Semantic technologies need to be considered and evaluated as part of the technical architecture, including: Categorizers (for auto-tagging, clustering) Entity extraction Triple stores and inference engine Tag servers Desktop integration (expose UX into authoring and search tools)
Strategy and Plan:  Semantic Search Considerations: Content Semantic tools can aid content analysis activities including taxonomy, ontology, and name directory development Knowing which semantic approaches will be used for navigation, search, and retrieval (taxonomy, named entity, ontology) will inform the information architecture analysis and content classification
Preparing Content for Semantic Search Strategy & Plan Implement Deploy Maintain
Analyze existing content Know what you have Number of retrievable units? Size of each retrievable unit? Current retrieval method? Understand its use Who retrieves it? When they need it? How they find it? How often need it? Determine the relationships between retrievable units
Key Considerations Search Objectives  Who is searching for what?  How do they search?  How do they expect to see results?  How do they rank quality and relevance? Content Where is it?  Federation?  What types of documents? Security issues?  Is XML or other special content types involved?  Component documents or content reuse? User Experience (UX) What is a balance between user expectations and an effective UI design?  Are you involving users in the design?  How can you embed the UX into daily tools (mail, desktop, browser, CMS)?
Define content structure Define authoring units Size? File format? Define storage units Size? Relationships between units? Define retrieval units Documents Components Topics/chunks
Classify content Define terms and thesauri Develop taxonomies How many? Relationship between them? Where/how stored? Apply taxonomy values to content When are values applied? Who is responsible for applying/reviewing? What can be automated?  Develop ontologies (if using triples)
Define metadata Identify what data is needed  Define the values How used? Where/how stored? Apply metadata values to content When are values applied? Who is responsible for applying/reviewing? What can be automated?
Control content Identify relationship between storage, retrieval and display mechanisms Same? Different? Relationship between them? Define storage strategy Where is content stored? Where is metadata stored? Where are deliverables stored (if generated)? How many repositories? Who needs access to each one?
Information Architecture for Semantic Search Information Architecture Structure content for retrieval Apply retrieval support at appropriate level
What technology does semantic search  implementation require? Semantic Tagging Technology “ Train” a system to auto-categorize documents; taxonomy server Named entity extraction; directory server Analyze against “triples”; triple stores plus inference engines Augment automatic tags with user tags and refinements Semantic Search Technology Disambiguate search terms to their meaning Map “meaning” of search term to “meaning” of document Refine “meaning” of search terms (clustering / similarity: “more like this”) Integration Technology User experience for check-in, classification and NS&R Desktop integration with browsers, email, and authoring tools Integration frameworks to tie semantic services with existing enterprise search and content management
What can I search? Content in ECM By using semantic tags in ECM metadata Content on your desktop By semantically tagging and indexing Content on the web By searching semantic metadata (e.g. RDF, linked data URIs) Databases By using XML Data Stores to make relational data available as a “document” that can be tagged
Standards Resource Description Framework (RDF) Make statements about resources in triples format W3C Semantic Web Standards (“linked data”) Use URIs to point to data in the web Turn web pages into databases
Recap Semantic search improves search relevance by matching meaning of search terms to meaning of documents Semantic technologies include categorizers, entity extractors, and linguistic analysis of relationships between entities (triplets) Semantic technologies are available as plug-ins to enterprise systems, or “baked in” to enterprise systems Semantic search requires extra steps along the way in implementing ECM and enterprise search

Implementing Semantic Search

  • 1.
    Implementing Semantic Search in the Enterprise Paul Wlodarczyk Director of Consulting Services Earley & Associates Amber Swope
  • 2.
    Questions we willanswer today What is Semantic Search? How is Enterprise Search different from Internet Search? Why Semantic Enterprise Search? How do you implement enterprise semantic search? Examine people, process, technology, and content. How do I prepare my content to enable semantic search? What technologies are there and how do they differ? What can I search?
  • 3.
    What is SemanticSearch? semantic adj. Of or relating to meaning in language or communications. Semantic search uses language processing to assess the “meaning” of content (documents or web pages) and the “meaning” of search queries to return more relevant results (better matches in meaning) Key concepts: Taxonomy, Named Entity, Ontology, Tag
  • 4.
    Key concept: Taxonomytaxonomy n. A categorization scheme for content, often hierarchical. Example: the animal kingdom Most often, taxonomies show “is a” relationships Example: A mammal is a vertebrate A rodent is a mammal A rabbit is a rodent
  • 5.
    Key concept: NamedEntity named entity n. A person, organization, place, thing, or event identified in a body of text Entities are distinct from terms in that they are unambiguous. e.g. “Washington” is a term that is ambiguous to an entity (the first President, the city, the state, the US Government, the monument). A tagged named entity is unambiguous
  • 6.
  • 7.
    Key concept: Ontologyontology n. A set of relationships between entities. Often these are in subject-predicate-object [triple] format. Often ontologies relate entities that exist in multiple taxonomies . Example : A food chain is a set of relationships (predator/prey) between entities (animals, plants) that exist in different taxonomies (kingdoms). The relationships are triples: Rodents eat seeds of grasses. Fox eats rodents. Kangaroo rat is a rodent. Rye is a grass. Etc.
  • 8.
    How does semanticsearch work? Assess meaning of documents Identify named entities and relationships (triples) OR Categorize documents to taxonomies OR Score each document with a “signature” or “graph” “ Tag” documents for meaning (categories, entities, triples, semantic signatures, graphs, etc.) Index the documents Assess meaning of search terms Match documents to search terms via common meaning Meaning [search term] Meaning Meaning
  • 9.
    Enterprise Search vs.Web Search Web Search Enterprise Search Search corpus Every public webpage – the whole internet Public documents in the enterprise, departmental docs, plus local docs (My Documents) Context Generic : Shopping or seeking news and information Company-specific: Executing a role in a business process Taxonomies / categories Generic – Open Directory Project, Wikipedia, News, etc. Domain Specific (customers, organization, products, technologies, processes) Info Security Information is public Information is secure with role-based access controls Search algorithms Keyword and Link-based Links = relevancy Popularity = relevancy Professionally tagged Keyword & tag-based No links! No traffic! Inconsistent metadata tags! Perfect result Most popular content Highest quality content
  • 10.
    Why Semantic Enterprise Search? Semantic analysis can provide the context, relevancy, and consistency that is lacking in enterprise content creation and search Enterprise content lacks the connectedness that internet search exploits “ Traffic” is not a clue to relevancy in enterprise search Enterprise users do not consistently tag content with metadata
  • 11.
    Another key differencein Enterprise Search: Social Context In “enterprise search” is that we know a lot more about “who” is searching and “who” has authored “what” We understand the community a lot better in the enterprise
  • 12.
    Roadmap for implementingsemantic search Implement Enterprise Content Management Implement Enterprise Search Layer-in semantic analysis to improve search relevancy Semantic search isn’t a replacement to ECM and enterprise search. It’s a “sweetener.” Implement ECM Implement Enterprise Search Exploit Semantic Search
  • 13.
    ECM and EnterpriseSearch Roll-out Strategy & Plan Implement Deploy Maintain People Use cases and User Experience Job Redesign, Communities Training Incentives for participation Process Content Lifecycle Analysis Workflow, bus. rules, process redesign Governance Evergreen process for maintaining IA Technology Business & system req’ts, technical architecture ECM and Search Implementation Desktop integration (classification, search) Social tech (ratings, tags, bookmarks) Content Content Analysis, Information Architecture, Taxonomy dev’t Content migration Content classification tools, search tools Taxonomy maintenance, folksonomy Strategy & Plan Implement Deploy Maintain
  • 14.
    Layer-in Semantic EnterpriseSearch Semantic technologies play a role in content classification – from defining taxonomies and ontologies, to tagging documents, to improving search terms and hits – as well as in search and discovery Strategy & Plan Implement Deploy Maintain People Use cases and User Experience Job Redesign Training Incentives for participation Process Content Lifecycle Analysis Workflow, bus. rules, process redesign Governance Evergreen process for maintaining IA Technology Business & system req’ts, technical architecture ECM and Search Implementation , Semantic search implementation Desktop integration (classification, search) Social tech (ratings, tags, bookmarks) , machine learning Content Content Analysis, Information Architecture, Taxonomy dev’t Content migration , build triple stores, semantic training sets Content classification tools, search tools Taxonomy maintenance, folksonomy Strategy & Plan Implement Deploy Maintain
  • 15.
    Classify, Navigate, Search,Retrieve Content within the Enterprise Content Author Check-in & Classify Document or Content Object Retrieve Document or Content Object Retrieve Unformatted Content End User Retrieve Formatted Content Retrieve Document End User End User
  • 16.
    Strategy and Plan:Key Activities Business Objectives : Understand the key business problems that must be solved People : Understand actors, roles, and use cases (who creates, who files, who searches, etc.) Process : Understand content lifecycle: how you create, maintain, reuse, and publish content Technology : Understand existing technology and new requirements for all use cases Content : Understand existing content, classification, policies, reuse, multichannel, etc.
  • 17.
    Strategy and Plan:Deliverables Business Objectives : Define the ROI in terms of the key metrics and how they will trend People : Actors, roles, and Use Cases elaborated into System And Business Requirements Process : Desired state Content Lifecycle defined Technology : Systems Architecture completed and new technology modules defined, integration points with existing technology defined Content : Information Architecture : How content will be structured, classified, managed, reused, and searched
  • 18.
    Strategy and Plan: Semantic Search Considerations: Technology Semantic technologies need to be considered and evaluated as part of the technical architecture, including: Categorizers (for auto-tagging, clustering) Entity extraction Triple stores and inference engine Tag servers Desktop integration (expose UX into authoring and search tools)
  • 19.
    Strategy and Plan: Semantic Search Considerations: Content Semantic tools can aid content analysis activities including taxonomy, ontology, and name directory development Knowing which semantic approaches will be used for navigation, search, and retrieval (taxonomy, named entity, ontology) will inform the information architecture analysis and content classification
  • 20.
    Preparing Content forSemantic Search Strategy & Plan Implement Deploy Maintain
  • 21.
    Analyze existing contentKnow what you have Number of retrievable units? Size of each retrievable unit? Current retrieval method? Understand its use Who retrieves it? When they need it? How they find it? How often need it? Determine the relationships between retrievable units
  • 22.
    Key Considerations SearchObjectives Who is searching for what? How do they search? How do they expect to see results? How do they rank quality and relevance? Content Where is it? Federation? What types of documents? Security issues? Is XML or other special content types involved? Component documents or content reuse? User Experience (UX) What is a balance between user expectations and an effective UI design? Are you involving users in the design? How can you embed the UX into daily tools (mail, desktop, browser, CMS)?
  • 23.
    Define content structureDefine authoring units Size? File format? Define storage units Size? Relationships between units? Define retrieval units Documents Components Topics/chunks
  • 24.
    Classify content Defineterms and thesauri Develop taxonomies How many? Relationship between them? Where/how stored? Apply taxonomy values to content When are values applied? Who is responsible for applying/reviewing? What can be automated? Develop ontologies (if using triples)
  • 25.
    Define metadata Identifywhat data is needed Define the values How used? Where/how stored? Apply metadata values to content When are values applied? Who is responsible for applying/reviewing? What can be automated?
  • 26.
    Control content Identifyrelationship between storage, retrieval and display mechanisms Same? Different? Relationship between them? Define storage strategy Where is content stored? Where is metadata stored? Where are deliverables stored (if generated)? How many repositories? Who needs access to each one?
  • 27.
    Information Architecture forSemantic Search Information Architecture Structure content for retrieval Apply retrieval support at appropriate level
  • 28.
    What technology doessemantic search implementation require? Semantic Tagging Technology “ Train” a system to auto-categorize documents; taxonomy server Named entity extraction; directory server Analyze against “triples”; triple stores plus inference engines Augment automatic tags with user tags and refinements Semantic Search Technology Disambiguate search terms to their meaning Map “meaning” of search term to “meaning” of document Refine “meaning” of search terms (clustering / similarity: “more like this”) Integration Technology User experience for check-in, classification and NS&R Desktop integration with browsers, email, and authoring tools Integration frameworks to tie semantic services with existing enterprise search and content management
  • 29.
    What can Isearch? Content in ECM By using semantic tags in ECM metadata Content on your desktop By semantically tagging and indexing Content on the web By searching semantic metadata (e.g. RDF, linked data URIs) Databases By using XML Data Stores to make relational data available as a “document” that can be tagged
  • 30.
    Standards Resource DescriptionFramework (RDF) Make statements about resources in triples format W3C Semantic Web Standards (“linked data”) Use URIs to point to data in the web Turn web pages into databases
  • 31.
    Recap Semantic searchimproves search relevance by matching meaning of search terms to meaning of documents Semantic technologies include categorizers, entity extractors, and linguistic analysis of relationships between entities (triplets) Semantic technologies are available as plug-ins to enterprise systems, or “baked in” to enterprise systems Semantic search requires extra steps along the way in implementing ECM and enterprise search