Leveraging enterprise information is no easy task, especially when unstructured information represents more than 80% of enterprise content. Meaningfully structuring content is critical for companies, Natural Language Processing and Semantic Enrichment is becoming increasingly important to improve the quality of tasks related to information retrieval.
With the Semantic Web moving towards full realisation thanks to the Linked Data initiative and with the interest of major search engines in structured data, the enterprise search world is finding it more attractive to make its information machine readable and exploit that information to improve search over its content.
In this scenario, three trends are transforming the face of search:
Entity-oriented search. Searching not by keyword, but by entities that represent specific concepts in a certain domain.
Knowledge graphs. Leveraging relationships amongst entities: Linked Data datasets (Freebase, DbPedia….) or custom companies’ knowledge bases.
Search assistance. Autocomplete and spellchecking are now common features, but making use of semantic data makes it possible to offer smarter features, guiding the users to what they want, in a natural way.
Sometimes, the proper resources for building such features are not easy to obtain. In order to generate these, our approach includes a number of unstructured data processing mechanisms the goal of which is to automatically extract semantic information:
Extract content from heterogeneous data sources
Extract domain information and enrich the content through different NLP processes like Named Entity Recognition, Coreference Resolution, Entity Linking and Disambiguation, and Topic Annotation
Create specialised indexes to store the semantic information extracted
Currently there are a number of well developed uses of semantic extracted information such as faceting and concept indexing, however further methods of exploiting semantic extracted information are presenting themselves in the industry:
Smart Autocomplete
The target of this feature is to automatically complete users’ phrase with entity names and properties, helping them to find the desired documents through exploration of the domain Knowledge Graph. As the user keys in the phrase, the system will propose a set of named entities and/or a set of entity types. As the user accepts a suggestion, the system will dynamically adapt following suggestions to the chosen context.
The accuracy delivered by entity driven search brings increased satisfaction among users. They will see documents that are about a specific semantic concept, with concrete properties, and not about a keyword that can be ambiguously interpreted.
Semantic More Like This
A feature to find documents similar to one that is input, based on the underlying knowledge in the documents, instead of words.
Implementing a semantic distance function, we can provide a grade of similarity between documents based
ECIR-2014: Multilanguage Content Discovery Through Entity Driven Search
1. ECIR 2014 Industry Day
Content Discovery Through Entity Driven Search
Alessandro Benedetti
http://uk.linkedin.com/in/alexbenedetti
Antonio David Perez Morales
http://es.linkedin.com/in/adperezmorales
16th
April 2014
2. • Experienced at building and delivering a wide range of enterprise
solutions across the whole information life cycle
• Alfresco & Ephesoft certified Platinum Partner
• Red Hat Enterprise Linux Ready Partner
• Crafter & Varnish Gold Partners
• Search Solutions Consultant
Alfresco Partner of the Year 2012 and
2013
3. Working effectively together
Who We Are
3
Antonio David Pérez Morales
- R&D Senior Engineer
- Master in Engineering and Technology
Software
- Digital Identity and Security expert
- Enterprise Search Background
- Semantic, NLP, ML Technologies and
Information Retrieval lover
- Apache Stanbol Committer
- Apache contributor
@adperezmorales
http://es.linkedin.com/in/adperezmorales/
Alessandro Benedetti
- R&D Senior Engineer
- Master in Computer Science
- Information Retrieval background
-- Enterprise Search specialist
- Semantic, NLP, ML Technologies
and Information Retrieval lover
@AlexBenedetti
http://uk.linkedin.com/in/alexbenedetti
6. Working effectively together
Zaizi R&D Department
6
•Giving sense to the content
• Enriching it semantically
•Adding value to ECM/CMS
• More structured content, easy to manage, link and search,
•Improving search
• Across different domains, data sources, User Experience
• Machine Learning applied research
• Content Organization – Recommendation Systems
8. Working effectively together
Enterprise Search Problems
8
Challenge :
Search within Big and Heterogeneus Repositories
• Heterogeneus Data Sources
• Filesystem, DB, ECM/CMS, Email, …
• Unstructured Content
• PDFs, text plain, Word, …
• Documents not linked between each other
• Federated Search needed
• Search across data sources
• Different permissions
• Centralized endpoint
9. Working effectively together
Current Enterprise Search Weaknesses
9
• Keyword based
• Low precision
• Ambiguous terms not in context
• Not accurate weighting when keywords are combined
in a query
11. Working effectively together
Entity Driven Search
11
• Moves from keywords to Entities
•More understandable to a Human
• Process the unstructured text
• Enrich it
• Build specific indexes
• Use entities and concepts in searches
12. Working effectively together
Sensefy
12
• Semantic Enterprise Search Engine
• Federated Search
• Evolved User Experience
• Based on cutting-edge Open Source Frameworks
14. Working effectively together
RedLink
14
• Semantic Cloud platform
• Providing Software as a Service
• Manage unstructured data
• Extract knowledge and intelligence
• Make sense of information
• Feed into business processes
• Open-Source based components
• Entity Linking using Knowledge Bases
15. Working effectively together
NLP & Semantic Enrichment
15
• From unstructured to structured
• NLP Analysis. POS Tagging
• Named Entities Recognition
• Linked Data
• Entity Linking using Knowledge Bases
• Disambiguation
• Indexing in Solr
16. Working effectively together
Smart Autocomplete
16
• Multi Phase suggestions
• Closer to natural language query formulation
• Named Entities infix
• Entity types infix
• Multi Language entity type support
• Properties driven query approach
17. Working effectively together
Smart Autocomplete
Configuration
17
• Entity type properties
•Interesting to our use case and scenario
• Properties inheritance through type hierarchy
• Enhance type information from external resource
•Freebase, DbPedia , Custom Data Set
18. Working effectively together
Semantic Search
18
• Search by Named Entity
• Search by Entity Type
• Search by Entity Type properties
• Grouping Results by Sense
• Contextualize Results Using Semantic Information
19. Working effectively together
Semantic More Like This
19
• Search for Similar Documents based on Entities and Entities’
categories
• Similarity Function based on Documents’ Sense
• Not based on text tokens
• Entity Frequency /
Inverted Document Frequency
• Entity Type Frequency /
Inverted Document Frequency
22. Working effectively together
Future Work
22
• Semantic More Like This new approach (Graph
relations)
• Machine Learning components: Classification, Topic
annotation, Clustering
• Semantic facets
• Secured Entity Search
• Image and Media searches