Multi-language Content Discovery Through Entity Driven Search

Multi-language
Content Discovery
Through Entity Driven
Search
Alessandro Benedetti
Search Consultant and
R&D Software Engineer
Zaizi
http://uk.linkedin.com/in/alexbenedetti

Who I am

Apache ManifoldCF committer

Search Consultant


Master in Computer Science

Information Retrieval Background

Semantic, NLP, Machine Learning Technologies Enthusiast

Beach Volleyball Player & Snowboarder

ZAIZI

Experienced at building and delivering a wide range of enterprise solutions across the
whole information life cycle

Alfresco & Ephesoft certified Platinum Partner

Red Hat Enterprise Linux Ready Partner

R&D department specialising in Open Source
Search Solutions
Alfresco Partner of the Year 2012 and
2013

Agenda

Context

Problem

Solution

Demo

What's upcoming

Zaizi R&D Department

Giving sense to the content

Enriching it semantically

Adding value to ECM/CMS

More structured content, easy to manage,
link and search

Improving search

Across different domains, data sources, User
Experience

Machine Learning applied research

Content Organization – Recommendation Systems

Enterprise Search Problems
Challenge :
Search within Big and Heterogeneus Repositories

Heterogeneus data sources

Filesystems, DB, ECM/CMS, Email, …

Unstructured content in different formats

PDF, text plain, Word …

Documents not linked between each other

Federated Search

across data sources

preserving permissions

centralized endpoint

Sensefy

Semantic Enterprise Search Engine

Federated Search

Evolved User Experience

Based on cutting-edge Open Source Frameworks

Entity Driven Search

Moving from keywords to Entities

More understandable to Humans

Process the unstructured text at indexing time

Enrich it

Build specific indexes

Use entities and concepts in searches
• Trying to foresee the concepts the user wants to express

What is an Entity in our domain ?

Real world concepts

Linked Data resources

Rdf(xml) structured data
• Unique identifier + properties

Stored in a Knowledge Base ( Freebase, DbPedia, Custom Dataset)

Redlink

Semantic Cloud platform

Providing Software as a Service

Text analysis and Entity Linking using Knowledge Bases

Linked Data Publishing

Enterprise Data Linking

Open-Source based components

Indexing - NLP & Semantic Enrichment

Apache ManifoldCF custom processors/output connectors

From unstructured to structured

NLP Analysis. POS Tagging

Named Entities Recognition

Entity Linking using Knowledge Bases

Disambiguation

Indexing in specific Solr Collections
• Primary Index (documents)
• Entity Index
• Entity Types

Search - Smart Autocomplete

Multi Phase suggestions

Closer to natural language query formulation

Named Entities

Entity Types

Document Titles

Smart Autocomplete – Named Entities

Infix Suggestion ( ron → Cristiano Ronaldo)

Fuzzy suggestion ( cristinao → Cristiano Ronaldo)

Brief description of the suggested entity

Specific Solr index for the entities
• Schema ( label, notable_type, occurrences...)
• Edge-Ngram token filtered label field
• Fuzzy queries with variable distance / classic queries to the label suggestion
field

Smart Autocomplete – Entity Types

Infix Suggestion ( play → Football Player)

Fuzzy suggestion ( foobtall → Football Team)

Multi Language ( calcia → Calciatore[it]( Football Player)[en] )

Multi phase suggestion through properties ( ital →
football player nationality italian)

Specific Solr collection for the entity types
• SolrDocument is an entity type ( type,occurrences,attributes,type hierarchy...)
• EdgeNgram token filtered type
• Multi-language suggestion highlight

Smart Autocomplete – configuration

Knowledge base for entity linking and dereference

DbPedia, Freebase, Custom Dataset

Properties

For each entity type of interest

Ldpath will be used to identify the property
in the graph

Hierarchy

All the sub-instances of a type
will automatically inherit their parent properties
to ease the configuration

Semantic Search

Search by Named Entity

Ex. Give me all the documents related to
Christian Bale

Search by Entity Type

Ex. Give me all the documents about football players

Search by Entity Type + properties

Ex. Give me all the documents about football players whose nationality is British

Query time Join :
Entity-Entity Type collection → primary Index

Semantic Facets

Dynamic calculated semantic facets based on
types and entities from documents

Improve the navigation of results

Allow refined search through semantic information

Configurable custom layer on top of Solr faceting component

Semantic More Like This

Search for similar documents based on Entities
and Entity Types

Similarity function based on document meaning

Multi Language / Not based on text tokens but concepts

Solr More Like This on custom fields

Entity Frequency /
Inverted Document Frequency

Entity Type Frequency /
Inverted Document Frequency

Live Demo

Context

Problem

Solution

Demo

What's upcoming

What's upcoming

Machine Learning components:
– Classification
– Topic annotation
– Clustering

Secured Entity Search

Image and Media searches

Advanced Geo-search

Personalized/collaborative search

Recommendations

Q&A

Advanced configurable Admin Dashboard

Any Questions?
Search Consultant and
Zaizi
Email: abenedetti@zaizi.com
Twitter: @Zaizi

Multi-language Content Discovery Through Entity Driven Search

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Multi-language Content Discovery Through Entity Driven Search

Similar to Multi-language Content Discovery Through Entity Driven Search (20)

More from Alessandro Benedetti

More from Alessandro Benedetti (9)

Recently uploaded

Recently uploaded (20)

Multi-language Content Discovery Through Entity Driven Search