SlideShare a Scribd company logo
Multi-language
Content Discovery
Through Entity Driven
Search
Alessandro Benedetti
Search Consultant and
R&D Software Engineer
Zaizi
http://uk.linkedin.com/in/alexbenedetti
Who I am
Alessandro Benedetti

Apache ManifoldCF committer

Search Consultant

R&D Software Engineer

Master in Computer Science

Information Retrieval Background

Semantic, NLP, Machine Learning Technologies Enthusiast

Beach Volleyball Player & Snowboarder
ZAIZI
ZAIZI

Experienced at building and delivering a wide range of enterprise solutions across the
whole information life cycle

Alfresco & Ephesoft certified Platinum Partner

Red Hat Enterprise Linux Ready Partner

R&D department specialising in Open Source
Search Solutions
Alfresco Partner of the Year 2012 and
2013
Agenda

Context

Problem

Solution

Demo

What's upcoming
Zaizi R&D Department

Giving sense to the content

Enriching it semantically

Adding value to ECM/CMS

More structured content, easy to manage,
link and search

Improving search

Across different domains, data sources, User
Experience

Machine Learning applied research

Content Organization – Recommendation Systems
Enterprise Search Problems
Challenge :
Search within Big and Heterogeneus Repositories

Heterogeneus data sources

Filesystems, DB, ECM/CMS, Email, …

Unstructured content in different formats

PDF, text plain, Word …

Documents not linked between each other

Federated Search

across data sources

preserving permissions

centralized endpoint
Sensefy

Semantic Enterprise Search Engine

Federated Search

Evolved User Experience

Based on cutting-edge Open Source Frameworks
Architecture
Entity Driven Search

Moving from keywords to Entities

More understandable to Humans

Process the unstructured text at indexing time

Enrich it

Build specific indexes

Use entities and concepts in searches
• Trying to foresee the concepts the user wants to express
What is an Entity in our domain ?

Real world concepts

Linked Data resources

Rdf(xml) structured data
• Unique identifier + properties

Stored in a Knowledge Base ( Freebase, DbPedia, Custom Dataset)
Redlink

Semantic Cloud platform

Providing Software as a Service

Text analysis and Entity Linking using Knowledge Bases

Linked Data Publishing

Enterprise Data Linking

Open-Source based components
Indexing - NLP & Semantic Enrichment

Apache ManifoldCF custom processors/output connectors

From unstructured to structured

NLP Analysis. POS Tagging

Named Entities Recognition

Entity Linking using Knowledge Bases

Disambiguation

Indexing in specific Solr Collections
• Primary Index (documents)
• Entity Index
• Entity Types
Search - Smart Autocomplete

Multi Phase suggestions

Closer to natural language query formulation

Named Entities

Entity Types

Document Titles
Smart Autocomplete – Named Entities

Infix Suggestion ( ron → Cristiano Ronaldo)

Fuzzy suggestion ( cristinao → Cristiano Ronaldo)

Brief description of the suggested entity

Specific Solr index for the entities
• Schema ( label, notable_type, occurrences...)
• Edge-Ngram token filtered label field
• Fuzzy queries with variable distance / classic queries to the label suggestion
field
Smart Autocomplete – Entity Types

Infix Suggestion ( play → Football Player)

Fuzzy suggestion ( foobtall → Football Team)

Multi Language ( calcia → Calciatore[it]( Football Player)[en] )

Multi phase suggestion through properties ( ital →
football player nationality italian)

Specific Solr collection for the entity types
• SolrDocument is an entity type ( type,occurrences,attributes,type hierarchy...)
• EdgeNgram token filtered type
• Multi-language suggestion highlight
Smart Autocomplete – configuration

Knowledge base for entity linking and dereference

DbPedia, Freebase, Custom Dataset

Properties

For each entity type of interest

Ldpath will be used to identify the property
in the graph

Hierarchy

All the sub-instances of a type
will automatically inherit their parent properties
to ease the configuration
Semantic Search

Search by Named Entity

Ex. Give me all the documents related to
Christian Bale

Search by Entity Type

Ex. Give me all the documents about football players

Search by Entity Type + properties

Ex. Give me all the documents about football players whose nationality is British

Query time Join :
Entity-Entity Type collection → primary Index
Semantic Facets

Dynamic calculated semantic facets based on
types and entities from documents

Improve the navigation of results

Allow refined search through semantic information

Configurable custom layer on top of Solr faceting component
Semantic More Like This

Search for similar documents based on Entities
and Entity Types

Similarity function based on document meaning

Multi Language / Not based on text tokens but concepts

Solr More Like This on custom fields

Entity Frequency /
Inverted Document Frequency

Entity Type Frequency /
Inverted Document Frequency
Live Demo

Context

Problem

Solution

Demo

What's upcoming
What's upcoming

Machine Learning components:
– Classification
– Topic annotation
– Clustering

Secured Entity Search

Image and Media searches

Advanced Geo-search

Personalized/collaborative search

Recommendations

Q&A

Advanced configurable Admin Dashboard
Any Questions?
Alessandro Benedetti
Search Consultant and
R&D Software Engineer
Zaizi
Email: abenedetti@zaizi.com
Twitter: @Zaizi

More Related Content

What's hot

Internet searchingnewver
Internet searchingnewverInternet searchingnewver
Internet searchingnewver
Jayatunga Amaraweera
 
Live Blog Analysis
Live Blog AnalysisLive Blog Analysis
Live Blog Analysis
Prithvi Kamath
 
Highly Relevant Search Result Ranking for Law Enforcement
Highly Relevant Search Result Ranking for Law EnforcementHighly Relevant Search Result Ranking for Law Enforcement
Highly Relevant Search Result Ranking for Law Enforcement
Lucidworks (Archived)
 
Managing Annotations (OR2016)
Managing Annotations (OR2016)Managing Annotations (OR2016)
Managing Annotations (OR2016)
Robert Sanderson
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
enterprisesearchmeetup
 
A Semantic Data Model for Web Applications
A Semantic Data Model for Web ApplicationsA Semantic Data Model for Web Applications
A Semantic Data Model for Web Applications
Armin Haller
 
Custom Metadata Types
Custom Metadata TypesCustom Metadata Types
Custom Metadata Types
CEPTES Software Inc
 
Searching for The Matrix in haystack (with Elasticsearch)
Searching for The Matrix in haystack  (with Elasticsearch)Searching for The Matrix in haystack  (with Elasticsearch)
Searching for The Matrix in haystack (with Elasticsearch)
Tomas Sirny
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Lucidworks
 
Linked Data for Czech Legislation
Linked Data for Czech LegislationLinked Data for Czech Legislation
Linked Data for Czech Legislation
Martin Necasky
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
skillupevent
 
The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise
Ontotext
 
5. Your Crossref questions answered
5. Your Crossref questions answered5. Your Crossref questions answered
5. Your Crossref questions answered
Crossref
 
Data exchange over internet (XML vs JSON)
Data exchange over internet (XML vs JSON)Data exchange over internet (XML vs JSON)
Data exchange over internet (XML vs JSON)
Wajahat Shahid
 
XML and Databases
XML and DatabasesXML and Databases
XML and Databases
Cittrex
 
IIIF Foundational Specifications
IIIF Foundational SpecificationsIIIF Foundational Specifications
IIIF Foundational Specifications
Robert Sanderson
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
Amine Ferchichi
 
JSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge GraphsJSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge Graphs
Franz Inc. - AllegroGraph
 
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
Stefan Adam
 

What's hot (19)

Internet searchingnewver
Internet searchingnewverInternet searchingnewver
Internet searchingnewver
 
Live Blog Analysis
Live Blog AnalysisLive Blog Analysis
Live Blog Analysis
 
Highly Relevant Search Result Ranking for Law Enforcement
Highly Relevant Search Result Ranking for Law EnforcementHighly Relevant Search Result Ranking for Law Enforcement
Highly Relevant Search Result Ranking for Law Enforcement
 
Managing Annotations (OR2016)
Managing Annotations (OR2016)Managing Annotations (OR2016)
Managing Annotations (OR2016)
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
 
A Semantic Data Model for Web Applications
A Semantic Data Model for Web ApplicationsA Semantic Data Model for Web Applications
A Semantic Data Model for Web Applications
 
Custom Metadata Types
Custom Metadata TypesCustom Metadata Types
Custom Metadata Types
 
Searching for The Matrix in haystack (with Elasticsearch)
Searching for The Matrix in haystack  (with Elasticsearch)Searching for The Matrix in haystack  (with Elasticsearch)
Searching for The Matrix in haystack (with Elasticsearch)
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
Linked Data for Czech Legislation
Linked Data for Czech LegislationLinked Data for Czech Legislation
Linked Data for Czech Legislation
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise
 
5. Your Crossref questions answered
5. Your Crossref questions answered5. Your Crossref questions answered
5. Your Crossref questions answered
 
Data exchange over internet (XML vs JSON)
Data exchange over internet (XML vs JSON)Data exchange over internet (XML vs JSON)
Data exchange over internet (XML vs JSON)
 
XML and Databases
XML and DatabasesXML and Databases
XML and Databases
 
IIIF Foundational Specifications
IIIF Foundational SpecificationsIIIF Foundational Specifications
IIIF Foundational Specifications
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
JSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge GraphsJSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge Graphs
 
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
 

Similar to Multi-language Content Discovery Through Entity Driven Search

Multi-language Content Discovery Through Entity Driven Search: Presented by A...
Multi-language Content Discovery Through Entity Driven Search: Presented by A...Multi-language Content Discovery Through Entity Driven Search: Presented by A...
Multi-language Content Discovery Through Entity Driven Search: Presented by A...
Lucidworks
 
Using metadata repositories with search
Using metadata repositories with searchUsing metadata repositories with search
Using metadata repositories with search
Jean Graef
 
Introduction to enterprise search
Introduction to enterprise searchIntroduction to enterprise search
Introduction to enterprise search
Usama Nada
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Robert Calcavecchia
 
A fresh new look into Information Gathering - OWASP Spain
A fresh new look into Information Gathering - OWASP SpainA fresh new look into Information Gathering - OWASP Spain
A fresh new look into Information Gathering - OWASP Spain
Christian Martorella
 
Callcenter HPE IDOL overview
Callcenter HPE IDOL overviewCallcenter HPE IDOL overview
Callcenter HPE IDOL overview
Tania Akinina
 
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
EPC Group
 
Use O365 and Azure Cognitive Services for intelligent search
Use O365 and Azure Cognitive Services for intelligent searchUse O365 and Azure Cognitive Services for intelligent search
Use O365 and Azure Cognitive Services for intelligent search
Jeff Fried
 
Enterprise Search in SharePoint 2010
Enterprise Search in SharePoint 2010Enterprise Search in SharePoint 2010
Enterprise Search in SharePoint 2010
bgerman
 
Document repositories-and-metadata
Document repositories-and-metadataDocument repositories-and-metadata
Document repositories-and-metadata
Earley Information Science
 
Recent Trends in Semantic Search Technologies
Recent Trends in Semantic Search TechnologiesRecent Trends in Semantic Search Technologies
Recent Trends in Semantic Search Technologies
Thanh Tran
 
Sem tech2013 tutorial
Sem tech2013 tutorialSem tech2013 tutorial
Sem tech2013 tutorial
Thengo Kim
 
Getting the most ouf of SharePoint Search - Tulsa SharePoint Interest Group
Getting the most ouf of SharePoint Search - Tulsa SharePoint Interest GroupGetting the most ouf of SharePoint Search - Tulsa SharePoint Interest Group
Getting the most ouf of SharePoint Search - Tulsa SharePoint Interest Group
Corey Roth
 
Which SharePoint Search is Right for You?
Which SharePoint Search is Right for You?Which SharePoint Search is Right for You?
Which SharePoint Search is Right for You?
charelenetorres
 
In search of: A meetup about Liferay and Search 2016-04-20
In search of: A meetup about Liferay and Search   2016-04-20In search of: A meetup about Liferay and Search   2016-04-20
In search of: A meetup about Liferay and Search 2016-04-20
Tibor Lipusz
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and Retrieval
Optum
 
TechDays11 Geneva - Going Further with SharePoint 2010 Search
TechDays11 Geneva - Going Further with SharePoint 2010 SearchTechDays11 Geneva - Going Further with SharePoint 2010 Search
TechDays11 Geneva - Going Further with SharePoint 2010 Search
Marius Constantinescu [MVP]
 
SharePoint Server 2007 Overview - TechMentor 2007 with Joel Oleson
SharePoint Server 2007 Overview - TechMentor 2007 with Joel OlesonSharePoint Server 2007 Overview - TechMentor 2007 with Joel Oleson
SharePoint Server 2007 Overview - TechMentor 2007 with Joel Oleson
Joel Oleson
 
SharePoint 2010 Findability
SharePoint 2010 FindabilitySharePoint 2010 Findability
SharePoint 2010 Findability
Dave Maskell
 
SA2: Text Mining from User Generated Content
SA2: Text Mining from User Generated ContentSA2: Text Mining from User Generated Content
SA2: Text Mining from User Generated Content
John Breslin
 

Similar to Multi-language Content Discovery Through Entity Driven Search (20)

Multi-language Content Discovery Through Entity Driven Search: Presented by A...
Multi-language Content Discovery Through Entity Driven Search: Presented by A...Multi-language Content Discovery Through Entity Driven Search: Presented by A...
Multi-language Content Discovery Through Entity Driven Search: Presented by A...
 
Using metadata repositories with search
Using metadata repositories with searchUsing metadata repositories with search
Using metadata repositories with search
 
Introduction to enterprise search
Introduction to enterprise searchIntroduction to enterprise search
Introduction to enterprise search
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
 
A fresh new look into Information Gathering - OWASP Spain
A fresh new look into Information Gathering - OWASP SpainA fresh new look into Information Gathering - OWASP Spain
A fresh new look into Information Gathering - OWASP Spain
 
Callcenter HPE IDOL overview
Callcenter HPE IDOL overviewCallcenter HPE IDOL overview
Callcenter HPE IDOL overview
 
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
 
Use O365 and Azure Cognitive Services for intelligent search
Use O365 and Azure Cognitive Services for intelligent searchUse O365 and Azure Cognitive Services for intelligent search
Use O365 and Azure Cognitive Services for intelligent search
 
Enterprise Search in SharePoint 2010
Enterprise Search in SharePoint 2010Enterprise Search in SharePoint 2010
Enterprise Search in SharePoint 2010
 
Document repositories-and-metadata
Document repositories-and-metadataDocument repositories-and-metadata
Document repositories-and-metadata
 
Recent Trends in Semantic Search Technologies
Recent Trends in Semantic Search TechnologiesRecent Trends in Semantic Search Technologies
Recent Trends in Semantic Search Technologies
 
Sem tech2013 tutorial
Sem tech2013 tutorialSem tech2013 tutorial
Sem tech2013 tutorial
 
Getting the most ouf of SharePoint Search - Tulsa SharePoint Interest Group
Getting the most ouf of SharePoint Search - Tulsa SharePoint Interest GroupGetting the most ouf of SharePoint Search - Tulsa SharePoint Interest Group
Getting the most ouf of SharePoint Search - Tulsa SharePoint Interest Group
 
Which SharePoint Search is Right for You?
Which SharePoint Search is Right for You?Which SharePoint Search is Right for You?
Which SharePoint Search is Right for You?
 
In search of: A meetup about Liferay and Search 2016-04-20
In search of: A meetup about Liferay and Search   2016-04-20In search of: A meetup about Liferay and Search   2016-04-20
In search of: A meetup about Liferay and Search 2016-04-20
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and Retrieval
 
TechDays11 Geneva - Going Further with SharePoint 2010 Search
TechDays11 Geneva - Going Further with SharePoint 2010 SearchTechDays11 Geneva - Going Further with SharePoint 2010 Search
TechDays11 Geneva - Going Further with SharePoint 2010 Search
 
SharePoint Server 2007 Overview - TechMentor 2007 with Joel Oleson
SharePoint Server 2007 Overview - TechMentor 2007 with Joel OlesonSharePoint Server 2007 Overview - TechMentor 2007 with Joel Oleson
SharePoint Server 2007 Overview - TechMentor 2007 with Joel Oleson
 
SharePoint 2010 Findability
SharePoint 2010 FindabilitySharePoint 2010 Findability
SharePoint 2010 Findability
 
SA2: Text Mining from User Generated Content
SA2: Text Mining from User Generated ContentSA2: Text Mining from User Generated Content
SA2: Text Mining from User Generated Content
 

More from Alessandro Benedetti

Search Quality Evaluation to Help Reproducibility : an Open Source Approach
Search Quality Evaluation to Help Reproducibility : an Open Source ApproachSearch Quality Evaluation to Help Reproducibility : an Open Source Approach
Search Quality Evaluation to Help Reproducibility : an Open Source Approach
Alessandro Benedetti
 
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Alessandro Benedetti
 
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachSearch Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
Alessandro Benedetti
 
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Alessandro Benedetti
 
Search Quality Evaluation: Tools and Techniques
Search Quality Evaluation: Tools and TechniquesSearch Quality Evaluation: Tools and Techniques
Search Quality Evaluation: Tools and Techniques
Alessandro Benedetti
 
From Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank StoryFrom Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank Story
Alessandro Benedetti
 
Advanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache LuceneAdvanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache Lucene
Alessandro Benedetti
 
Lucene And Solr Document Classification
Lucene And Solr Document ClassificationLucene And Solr Document Classification
Lucene And Solr Document Classification
Alessandro Benedetti
 
Content Discovery Through Entity Driven Search
Content Discovery Through Entity Driven SearchContent Discovery Through Entity Driven Search
Content Discovery Through Entity Driven Search
Alessandro Benedetti
 

More from Alessandro Benedetti (9)

Search Quality Evaluation to Help Reproducibility : an Open Source Approach
Search Quality Evaluation to Help Reproducibility : an Open Source ApproachSearch Quality Evaluation to Help Reproducibility : an Open Source Approach
Search Quality Evaluation to Help Reproducibility : an Open Source Approach
 
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
 
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachSearch Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
 
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
 
Search Quality Evaluation: Tools and Techniques
Search Quality Evaluation: Tools and TechniquesSearch Quality Evaluation: Tools and Techniques
Search Quality Evaluation: Tools and Techniques
 
From Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank StoryFrom Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank Story
 
Advanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache LuceneAdvanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache Lucene
 
Lucene And Solr Document Classification
Lucene And Solr Document ClassificationLucene And Solr Document Classification
Lucene And Solr Document Classification
 
Content Discovery Through Entity Driven Search
Content Discovery Through Entity Driven SearchContent Discovery Through Entity Driven Search
Content Discovery Through Entity Driven Search
 

Recently uploaded

Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 

Recently uploaded (20)

Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 

Multi-language Content Discovery Through Entity Driven Search

  • 1.
  • 2. Multi-language Content Discovery Through Entity Driven Search Alessandro Benedetti Search Consultant and R&D Software Engineer Zaizi http://uk.linkedin.com/in/alexbenedetti
  • 3. Who I am Alessandro Benedetti  Apache ManifoldCF committer  Search Consultant  R&D Software Engineer  Master in Computer Science  Information Retrieval Background  Semantic, NLP, Machine Learning Technologies Enthusiast  Beach Volleyball Player & Snowboarder
  • 5. ZAIZI  Experienced at building and delivering a wide range of enterprise solutions across the whole information life cycle  Alfresco & Ephesoft certified Platinum Partner  Red Hat Enterprise Linux Ready Partner  R&D department specialising in Open Source Search Solutions Alfresco Partner of the Year 2012 and 2013
  • 7. Zaizi R&D Department  Giving sense to the content  Enriching it semantically  Adding value to ECM/CMS  More structured content, easy to manage, link and search  Improving search  Across different domains, data sources, User Experience  Machine Learning applied research  Content Organization – Recommendation Systems
  • 8. Enterprise Search Problems Challenge : Search within Big and Heterogeneus Repositories  Heterogeneus data sources  Filesystems, DB, ECM/CMS, Email, …  Unstructured content in different formats  PDF, text plain, Word …  Documents not linked between each other  Federated Search  across data sources  preserving permissions  centralized endpoint
  • 9. Sensefy  Semantic Enterprise Search Engine  Federated Search  Evolved User Experience  Based on cutting-edge Open Source Frameworks
  • 11. Entity Driven Search  Moving from keywords to Entities  More understandable to Humans  Process the unstructured text at indexing time  Enrich it  Build specific indexes  Use entities and concepts in searches • Trying to foresee the concepts the user wants to express
  • 12. What is an Entity in our domain ?  Real world concepts  Linked Data resources  Rdf(xml) structured data • Unique identifier + properties  Stored in a Knowledge Base ( Freebase, DbPedia, Custom Dataset)
  • 13. Redlink  Semantic Cloud platform  Providing Software as a Service  Text analysis and Entity Linking using Knowledge Bases  Linked Data Publishing  Enterprise Data Linking  Open-Source based components
  • 14. Indexing - NLP & Semantic Enrichment  Apache ManifoldCF custom processors/output connectors  From unstructured to structured  NLP Analysis. POS Tagging  Named Entities Recognition  Entity Linking using Knowledge Bases  Disambiguation  Indexing in specific Solr Collections • Primary Index (documents) • Entity Index • Entity Types
  • 15. Search - Smart Autocomplete  Multi Phase suggestions  Closer to natural language query formulation  Named Entities  Entity Types  Document Titles
  • 16. Smart Autocomplete – Named Entities  Infix Suggestion ( ron → Cristiano Ronaldo)  Fuzzy suggestion ( cristinao → Cristiano Ronaldo)  Brief description of the suggested entity  Specific Solr index for the entities • Schema ( label, notable_type, occurrences...) • Edge-Ngram token filtered label field • Fuzzy queries with variable distance / classic queries to the label suggestion field
  • 17. Smart Autocomplete – Entity Types  Infix Suggestion ( play → Football Player)  Fuzzy suggestion ( foobtall → Football Team)  Multi Language ( calcia → Calciatore[it]( Football Player)[en] )  Multi phase suggestion through properties ( ital → football player nationality italian)  Specific Solr collection for the entity types • SolrDocument is an entity type ( type,occurrences,attributes,type hierarchy...) • EdgeNgram token filtered type • Multi-language suggestion highlight
  • 18. Smart Autocomplete – configuration  Knowledge base for entity linking and dereference  DbPedia, Freebase, Custom Dataset  Properties  For each entity type of interest  Ldpath will be used to identify the property in the graph  Hierarchy  All the sub-instances of a type will automatically inherit their parent properties to ease the configuration
  • 19. Semantic Search  Search by Named Entity  Ex. Give me all the documents related to Christian Bale  Search by Entity Type  Ex. Give me all the documents about football players  Search by Entity Type + properties  Ex. Give me all the documents about football players whose nationality is British  Query time Join : Entity-Entity Type collection → primary Index
  • 20. Semantic Facets  Dynamic calculated semantic facets based on types and entities from documents  Improve the navigation of results  Allow refined search through semantic information  Configurable custom layer on top of Solr faceting component
  • 21. Semantic More Like This  Search for similar documents based on Entities and Entity Types  Similarity function based on document meaning  Multi Language / Not based on text tokens but concepts  Solr More Like This on custom fields  Entity Frequency / Inverted Document Frequency  Entity Type Frequency / Inverted Document Frequency
  • 23. What's upcoming  Machine Learning components: – Classification – Topic annotation – Clustering  Secured Entity Search  Image and Media searches  Advanced Geo-search  Personalized/collaborative search  Recommendations  Q&A  Advanced configurable Admin Dashboard
  • 24. Any Questions? Alessandro Benedetti Search Consultant and R&D Software Engineer Zaizi Email: abenedetti@zaizi.com Twitter: @Zaizi