SlideShare a Scribd company logo
Annotated text
Searching for “things not strings”
aka: Entity oriented search
https://www.springer.com/la/book/9783319939339
Text strings = unstructured data
.. he visited JFK later that day
Entities = structured data
https://en.wikipedia.org/wiki/John_F._Kennedy
https://en.wikipedia.org/wiki/John_F._Kennedy_International_Airport
Strings vs things
35th U.S. President
John Fitzgerald Kennedy
JFK
• Synonymy: thing has multiple string forms (bad for recall)
• Polysemy: - string has multiple meanings (bad for precision)
Strings vs things (it gets worse….)
John
JFK
• Query and document text strings are tokenized
(matching all Johns is bad for precision)
U.S
Fitzgerald
President
Kennedy
35th
Precision and Recall
Relevant docs
All docs
Search results
John F. Kennedy
High recall Low precision
Precision and Recall
Relevant docs
All docs
Search
results
“35th US President”
Low recall High precision
Precision and Recall
Relevant docs
All docs
Search results
JFK
High recall Low precision
Precision and Recall
Relevant docs
All docs
Search results
https://en.wikipedia.org/wiki/John_F._Kennedy
High recall High precision
Why use Entity IDs?
https://en.wikipedia.org/wiki/John_F._Kennedy
• One canonical ID
• One meaning
The ideal match:
Documents index entity IDs
Searches use entity IDs
== ==
https://en.wikipedia.org/wiki/John_F._Kennedy_International_Airport
How to rewrite free-text searches?
Documents use structured info
Searches use structured criteria
== ==
https://en.wikipedia.org/wiki/John_F._Kennedy_International_Airport
“JFK”
How to structure free text in docs?
Documents use structured info
Searches use structured criteria
== ==
https://en.wikipedia.org/wiki/John_F._Kennedy_International_Airport
Annotated text = text + entity IDs
.. he visited JFK later that day
https://en.wikipedia.org/wiki/John_F._Kennedy_International_Airport
How do I enable it?
• New field type in elasticsearch 6.5
• Enabled by installing a plugin
sudo bin/elasticsearch-plugin install mapper-annotated-text
New field type:
"mappings": {
"_doc": {
"properties": {
"my_rich_text_field": {
"type": "annotated_text",
"analyzer": "my_analyzer"
}
}
}
}
Markdown-like syntax:
{
“my_rich_text_field" : "Today [elastic](Elastic+Inc.) announced”
}
What gets indexed?
POST wikipediaannotated/_analyze
{
"field": "text",
"text": "Today [elastic](Elastic+Inc.&Company) announced"
}
today elastic
Elastic inc.
Company
announced
Example use: show entities in context
Example use: show entities in context
{
"person": [“Lee Harvey Oswald", "John F. Kennedy"],
"text": "...the [35th President of the United States](John%20F.%20Kennedy) ..."
}
{
"query": {
"term": {"person": "John F. Kennedy"}
},
"highlight": {
"fragment_size": 200,
"require_field_match": false,
"type": "annotated",
"fields": { "text": {} }
}
}
Query
Document
Detour: DBPedia ontology
• Things
• Living things
• People
• Office Holder
• John F. Kennedy
• Dwight D. Eisenhower
• DBPedia ontology provides classes for Wikipedia entries
• Can be used to enrich entity IDs with type
See http://mappings.dbpedia.org/server/ontology/classes/
{
"text": “.. General [Dwight D. Eisenhower](Dwight%20D.%20Eisenhower&OfficeHolder)…”
}
Example use: find things near strings
{
"text": “.. photographs of [Abraham Lincoln](Abraham%20Lincoln&OfficeHolder)..."
}
Document
{
"span_near": {
"slop": 1,
"in_order": true,
"clauses": [
{"span_term": {"text": {"value": "photographs"}}},
{"span_term": {"text": {"value": "of"}},
{"span_term": {"text": {"value": "OfficeHolder"}}}
]
}
}
Query
Further reading
https://www.elastic.co/guide/en/elasticsearch/plugins/6.6/mapper-annotated-text-usage.html
https://www.elastic.co/blog/search-for-things-not-strings-with-the-annotated-text-plugin

More Related Content

What's hot

Semantic Web introduction
Semantic Web introductionSemantic Web introduction
Semantic Web introduction
Graphity
 
RDA: thinking globally, acting globally
RDA: thinking globally, acting globallyRDA: thinking globally, acting globally
RDA: thinking globally, acting globally
Gordon Dunsire
 
Google Advanced Search Features by Sunil Verma
Google Advanced Search Features by Sunil VermaGoogle Advanced Search Features by Sunil Verma
Google Advanced Search Features by Sunil Verma
Sunil Verma
 
Temple University Digital Scholarship Center: Model of the Month Club: Septem...
Temple University Digital Scholarship Center: Model of the Month Club: Septem...Temple University Digital Scholarship Center: Model of the Month Club: Septem...
Temple University Digital Scholarship Center: Model of the Month Club: Septem...
Liz Rodrigues
 
Content analysis 2.0
Content analysis 2.0Content analysis 2.0
Content analysis 2.0
Murray Dick
 
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Denis Shestakov
 

What's hot (6)

Semantic Web introduction
Semantic Web introductionSemantic Web introduction
Semantic Web introduction
 
RDA: thinking globally, acting globally
RDA: thinking globally, acting globallyRDA: thinking globally, acting globally
RDA: thinking globally, acting globally
 
Google Advanced Search Features by Sunil Verma
Google Advanced Search Features by Sunil VermaGoogle Advanced Search Features by Sunil Verma
Google Advanced Search Features by Sunil Verma
 
Temple University Digital Scholarship Center: Model of the Month Club: Septem...
Temple University Digital Scholarship Center: Model of the Month Club: Septem...Temple University Digital Scholarship Center: Model of the Month Club: Septem...
Temple University Digital Scholarship Center: Model of the Month Club: Septem...
 
Content analysis 2.0
Content analysis 2.0Content analysis 2.0
Content analysis 2.0
 
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizin...
 

More from Yann Cluchey

Implementing Keyword Sort with Elasticsearch
Implementing Keyword Sort with ElasticsearchImplementing Keyword Sort with Elasticsearch
Implementing Keyword Sort with Elasticsearch
Yann Cluchey
 
Machine Learning and the Elastic Stack
Machine Learning and the Elastic StackMachine Learning and the Elastic Stack
Machine Learning and the Elastic Stack
Yann Cluchey
 
Elasticsearch at AffiliateWindow
Elasticsearch at AffiliateWindowElasticsearch at AffiliateWindow
Elasticsearch at AffiliateWindow
Yann Cluchey
 
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
Yann Cluchey
 
Lightning talk: elasticsearch at Cogenta
Lightning talk: elasticsearch at CogentaLightning talk: elasticsearch at Cogenta
Lightning talk: elasticsearch at Cogenta
Yann Cluchey
 
Concurrency Patterns with MongoDB
Concurrency Patterns with MongoDBConcurrency Patterns with MongoDB
Concurrency Patterns with MongoDB
Yann Cluchey
 

More from Yann Cluchey (6)

Implementing Keyword Sort with Elasticsearch
Implementing Keyword Sort with ElasticsearchImplementing Keyword Sort with Elasticsearch
Implementing Keyword Sort with Elasticsearch
 
Machine Learning and the Elastic Stack
Machine Learning and the Elastic StackMachine Learning and the Elastic Stack
Machine Learning and the Elastic Stack
 
Elasticsearch at AffiliateWindow
Elasticsearch at AffiliateWindowElasticsearch at AffiliateWindow
Elasticsearch at AffiliateWindow
 
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
 
Lightning talk: elasticsearch at Cogenta
Lightning talk: elasticsearch at CogentaLightning talk: elasticsearch at Cogenta
Lightning talk: elasticsearch at Cogenta
 
Concurrency Patterns with MongoDB
Concurrency Patterns with MongoDBConcurrency Patterns with MongoDB
Concurrency Patterns with MongoDB
 

Recently uploaded

Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
pavan998932
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
Philip Schwarz
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
Hornet Dynamics
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
lorraineandreiamcidl
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
Hironori Washizaki
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 

Recently uploaded (20)

Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
What is Augmented Reality Image Tracking
What is Augmented Reality Image TrackingWhat is Augmented Reality Image Tracking
What is Augmented Reality Image Tracking
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptxLORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
LORRAINE ANDREI_LEQUIGAN_HOW TO USE WHATSAPP.pptx
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 

Annotated Text feature in Elasticsearch

  • 1. Annotated text Searching for “things not strings”
  • 2. aka: Entity oriented search https://www.springer.com/la/book/9783319939339
  • 3. Text strings = unstructured data .. he visited JFK later that day
  • 4. Entities = structured data https://en.wikipedia.org/wiki/John_F._Kennedy https://en.wikipedia.org/wiki/John_F._Kennedy_International_Airport
  • 5. Strings vs things 35th U.S. President John Fitzgerald Kennedy JFK • Synonymy: thing has multiple string forms (bad for recall) • Polysemy: - string has multiple meanings (bad for precision)
  • 6. Strings vs things (it gets worse….) John JFK • Query and document text strings are tokenized (matching all Johns is bad for precision) U.S Fitzgerald President Kennedy 35th
  • 7. Precision and Recall Relevant docs All docs Search results John F. Kennedy High recall Low precision
  • 8. Precision and Recall Relevant docs All docs Search results “35th US President” Low recall High precision
  • 9. Precision and Recall Relevant docs All docs Search results JFK High recall Low precision
  • 10. Precision and Recall Relevant docs All docs Search results https://en.wikipedia.org/wiki/John_F._Kennedy High recall High precision
  • 11. Why use Entity IDs? https://en.wikipedia.org/wiki/John_F._Kennedy • One canonical ID • One meaning
  • 12. The ideal match: Documents index entity IDs Searches use entity IDs == == https://en.wikipedia.org/wiki/John_F._Kennedy_International_Airport
  • 13. How to rewrite free-text searches? Documents use structured info Searches use structured criteria == == https://en.wikipedia.org/wiki/John_F._Kennedy_International_Airport “JFK”
  • 14. How to structure free text in docs? Documents use structured info Searches use structured criteria == == https://en.wikipedia.org/wiki/John_F._Kennedy_International_Airport
  • 15. Annotated text = text + entity IDs .. he visited JFK later that day https://en.wikipedia.org/wiki/John_F._Kennedy_International_Airport
  • 16. How do I enable it? • New field type in elasticsearch 6.5 • Enabled by installing a plugin sudo bin/elasticsearch-plugin install mapper-annotated-text
  • 17. New field type: "mappings": { "_doc": { "properties": { "my_rich_text_field": { "type": "annotated_text", "analyzer": "my_analyzer" } } } }
  • 18. Markdown-like syntax: { “my_rich_text_field" : "Today [elastic](Elastic+Inc.) announced” }
  • 19. What gets indexed? POST wikipediaannotated/_analyze { "field": "text", "text": "Today [elastic](Elastic+Inc.&Company) announced" } today elastic Elastic inc. Company announced
  • 20. Example use: show entities in context
  • 21. Example use: show entities in context { "person": [“Lee Harvey Oswald", "John F. Kennedy"], "text": "...the [35th President of the United States](John%20F.%20Kennedy) ..." } { "query": { "term": {"person": "John F. Kennedy"} }, "highlight": { "fragment_size": 200, "require_field_match": false, "type": "annotated", "fields": { "text": {} } } } Query Document
  • 22. Detour: DBPedia ontology • Things • Living things • People • Office Holder • John F. Kennedy • Dwight D. Eisenhower • DBPedia ontology provides classes for Wikipedia entries • Can be used to enrich entity IDs with type See http://mappings.dbpedia.org/server/ontology/classes/ { "text": “.. General [Dwight D. Eisenhower](Dwight%20D.%20Eisenhower&OfficeHolder)…” }
  • 23. Example use: find things near strings { "text": “.. photographs of [Abraham Lincoln](Abraham%20Lincoln&OfficeHolder)..." } Document { "span_near": { "slop": 1, "in_order": true, "clauses": [ {"span_term": {"text": {"value": "photographs"}}}, {"span_term": {"text": {"value": "of"}}, {"span_term": {"text": {"value": "OfficeHolder"}}} ] } } Query