Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Multi-language
Content Discovery
Through Entity Driven
Search
Alessandro Benedetti
Search Consultant and
R&D Software Engi...
Who I am
Alessandro Benedetti

Apache ManifoldCF committer

Search Consultant

R&D Software Engineer

Master in Comput...
ZAIZI
ZAIZI

Experienced at building and delivering a wide range of enterprise solutions across the
whole information life cycl...
Agenda

Context

Problem

Solution

Demo

What's upcoming
Zaizi R&D Department

Giving sense to the content

Enriching it semantically

Adding value to ECM/CMS

More structured...
Enterprise Search Problems
Challenge :
Search within Big and Heterogeneus Repositories

Heterogeneus data sources

Files...
Sensefy

Semantic Enterprise Search Engine

Federated Search

Evolved User Experience

Based on cutting-edge Open Sour...
Architecture
Entity Driven Search

Moving from keywords to Entities

More understandable to Humans

Process the unstructured text at...
What is an Entity in our domain ?

Real world concepts

Linked Data resources

Rdf(xml) structured data
• Unique identi...
Redlink

Semantic Cloud platform

Providing Software as a Service

Text analysis and Entity Linking using Knowledge Bas...
Indexing - NLP & Semantic Enrichment

Apache ManifoldCF custom processors/output connectors

From unstructured to struct...
Search - Smart Autocomplete

Multi Phase suggestions

Closer to natural language query formulation

Named Entities

En...
Smart Autocomplete – Named Entities

Infix Suggestion ( ron → Cristiano Ronaldo)

Fuzzy suggestion ( cristinao → Cristia...
Smart Autocomplete – Entity Types

Infix Suggestion ( play → Football Player)

Fuzzy suggestion ( foobtall → Football Te...
Smart Autocomplete – configuration

Knowledge base for entity linking and dereference

DbPedia, Freebase, Custom Dataset...
Semantic Search

Search by Named Entity

Ex. Give me all the documents related to
Christian Bale

Search by Entity Type...
Semantic Facets

Dynamic calculated semantic facets based on
types and entities from documents

Improve the navigation o...
Semantic More Like This

Search for similar documents based on Entities
and Entity Types

Similarity function based on d...
Live Demo

Context

Problem

Solution

Demo

What's upcoming
What's upcoming

Machine Learning components:
– Classification
– Topic annotation
– Clustering

Secured Entity Search

...
Any Questions?
Alessandro Benedetti
Search Consultant and
R&D Software Engineer
Zaizi
Email: abenedetti@zaizi.com
Twitter:...
Multi-language Content Discovery Through Entity Driven Search
Upcoming SlideShare
Loading in …5
×

Multi-language Content Discovery Through Entity Driven Search

733 views

Published on

This talk is about the description of the implementation of a Semantic Search
Engine based on Solr.
Meaningfully structuring content is critical, Natural Language Processing and
Semantic Enrichment is becoming increasingly important to improve the quality
of Solr search results .
Our solution is based on three advanced features :
Entity-oriented search - Searching not by keyword, but by entities (concepts
in a certain domain).
Knowledge graphs - Leveraging relationships amongst entities: Linked Data
datasets (Freebase, DbPedia, Custom ...)
Search assistance - Autocomplete and Spellchecking are now common features,
but using semantic data makes it possible to offer smarter features, driving
the users to build queries in a natural way.
The approach includes unstructured data processing mechanisms integrated with
Solr to automatically index semantic and multi-language information.
Smart Autocomplete will complete users' query with entity names and
properties from the domain knowledge graph. As the user types, the system
will propose a set of named entities and/or a set of entity types across
different languages. As the user accepts a suggestion, the system will
dynamically adapt following suggestions and return relevant documents.
Semantic More Like This will find similar documents to a seed one, based on
the underlying knowledge in the documents, instead of tokens.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Multi-language Content Discovery Through Entity Driven Search

  1. 1. Multi-language Content Discovery Through Entity Driven Search Alessandro Benedetti Search Consultant and R&D Software Engineer Zaizi http://uk.linkedin.com/in/alexbenedetti
  2. 2. Who I am Alessandro Benedetti  Apache ManifoldCF committer  Search Consultant  R&D Software Engineer  Master in Computer Science  Information Retrieval Background  Semantic, NLP, Machine Learning Technologies Enthusiast  Beach Volleyball Player & Snowboarder
  3. 3. ZAIZI
  4. 4. ZAIZI  Experienced at building and delivering a wide range of enterprise solutions across the whole information life cycle  Alfresco & Ephesoft certified Platinum Partner  Red Hat Enterprise Linux Ready Partner  R&D department specialising in Open Source Search Solutions Alfresco Partner of the Year 2012 and 2013
  5. 5. Agenda  Context  Problem  Solution  Demo  What's upcoming
  6. 6. Zaizi R&D Department  Giving sense to the content  Enriching it semantically  Adding value to ECM/CMS  More structured content, easy to manage, link and search  Improving search  Across different domains, data sources, User Experience  Machine Learning applied research  Content Organization – Recommendation Systems
  7. 7. Enterprise Search Problems Challenge : Search within Big and Heterogeneus Repositories  Heterogeneus data sources  Filesystems, DB, ECM/CMS, Email, …  Unstructured content in different formats  PDF, text plain, Word …  Documents not linked between each other  Federated Search  across data sources  preserving permissions  centralized endpoint
  8. 8. Sensefy  Semantic Enterprise Search Engine  Federated Search  Evolved User Experience  Based on cutting-edge Open Source Frameworks
  9. 9. Architecture
  10. 10. Entity Driven Search  Moving from keywords to Entities  More understandable to Humans  Process the unstructured text at indexing time  Enrich it  Build specific indexes  Use entities and concepts in searches • Trying to foresee the concepts the user wants to express
  11. 11. What is an Entity in our domain ?  Real world concepts  Linked Data resources  Rdf(xml) structured data • Unique identifier + properties  Stored in a Knowledge Base ( Freebase, DbPedia, Custom Dataset)
  12. 12. Redlink  Semantic Cloud platform  Providing Software as a Service  Text analysis and Entity Linking using Knowledge Bases  Linked Data Publishing  Enterprise Data Linking  Open-Source based components
  13. 13. Indexing - NLP & Semantic Enrichment  Apache ManifoldCF custom processors/output connectors  From unstructured to structured  NLP Analysis. POS Tagging  Named Entities Recognition  Entity Linking using Knowledge Bases  Disambiguation  Indexing in specific Solr Collections • Primary Index (documents) • Entity Index • Entity Types
  14. 14. Search - Smart Autocomplete  Multi Phase suggestions  Closer to natural language query formulation  Named Entities  Entity Types  Document Titles
  15. 15. Smart Autocomplete – Named Entities  Infix Suggestion ( ron → Cristiano Ronaldo)  Fuzzy suggestion ( cristinao → Cristiano Ronaldo)  Brief description of the suggested entity  Specific Solr index for the entities • Schema ( label, notable_type, occurrences...) • Edge-Ngram token filtered label field • Fuzzy queries with variable distance / classic queries to the label suggestion field
  16. 16. Smart Autocomplete – Entity Types  Infix Suggestion ( play → Football Player)  Fuzzy suggestion ( foobtall → Football Team)  Multi Language ( calcia → Calciatore[it]( Football Player)[en] )  Multi phase suggestion through properties ( ital → football player nationality italian)  Specific Solr collection for the entity types • SolrDocument is an entity type ( type,occurrences,attributes,type hierarchy...) • EdgeNgram token filtered type • Multi-language suggestion highlight
  17. 17. Smart Autocomplete – configuration  Knowledge base for entity linking and dereference  DbPedia, Freebase, Custom Dataset  Properties  For each entity type of interest  Ldpath will be used to identify the property in the graph  Hierarchy  All the sub-instances of a type will automatically inherit their parent properties to ease the configuration
  18. 18. Semantic Search  Search by Named Entity  Ex. Give me all the documents related to Christian Bale  Search by Entity Type  Ex. Give me all the documents about football players  Search by Entity Type + properties  Ex. Give me all the documents about football players whose nationality is British  Query time Join : Entity-Entity Type collection → primary Index
  19. 19. Semantic Facets  Dynamic calculated semantic facets based on types and entities from documents  Improve the navigation of results  Allow refined search through semantic information  Configurable custom layer on top of Solr faceting component
  20. 20. Semantic More Like This  Search for similar documents based on Entities and Entity Types  Similarity function based on document meaning  Multi Language / Not based on text tokens but concepts  Solr More Like This on custom fields  Entity Frequency / Inverted Document Frequency  Entity Type Frequency / Inverted Document Frequency
  21. 21. Live Demo  Context  Problem  Solution  Demo  What's upcoming
  22. 22. What's upcoming  Machine Learning components: – Classification – Topic annotation – Clustering  Secured Entity Search  Image and Media searches  Advanced Geo-search  Personalized/collaborative search  Recommendations  Q&A  Advanced configurable Admin Dashboard
  23. 23. Any Questions? Alessandro Benedetti Search Consultant and R&D Software Engineer Zaizi Email: abenedetti@zaizi.com Twitter: @Zaizi

×