SlideShare a Scribd company logo
1 of 31
Download to read offline
CONTENT INTELLIGENCE
Statistical Entity Linking
Laurie Lugrin, R&D NLP engineer
CONTENT INTELLIGENCE
idio, content marketing
for marketing & sales teams
• content insight
• user interest profile
• recommendation demo
content model
user model
topic performance chart
CONTENT INTELLIGENCE
idio, content marketing
• focus on interests, not socio-demographics / firmographics
• automatic text analysis, model building and recommendation
CONTENT INTELLIGENCE
CONTENT INTELLIGENCE
Content analysis at idio
• Python
• build automation (luigi)
• web services
• scala
• entity linking, i.e. finding topics in texts
CONTENT INTELLIGENCE
How most discussions start at conferences
• me: “I work on Natural Language Processing.”
• other: “So you’re in the field of deep learning?”
• our topic extractor is based on statistical analysis
however
CONTENT INTELLIGENCE
Outline
• entity linking task
• method (inspired by DBpedia-Spotlight)
• data pre-processing
• adaptation
CONTENT INTELLIGENCE
Entity Linking task
• find topics in a text written in a natural language
• For us: 1 topic = 1 uri to a wikipedia article.
• Sometimes called “wikification”.
demo
spotlight demo
CONTENT INTELLIGENCE
Entity Linking jargon
Surface Form: word (or phrase) that refers to a topic
“hoverboard” and “hover board” are surface
forms for the topic “Hoverboard”, the fictional
levitating board used for personal transportation.
Context: words surrounding a surface form.
CONTENT INTELLIGENCE
I'm on cloud 9 whenever I write Python code.
CONTENT INTELLIGENCE
Challenges
• ambiguous words
• multi-word expressions
• different possible splits
I'm on cloud 9 whenever I write Python code.
CONTENT INTELLIGENCE
Method
• Build an annotation model
• statistics about words and topics
• Apply it on the given input text
CONTENT INTELLIGENCE
Build an annotation model
Data: wikipedia
• 1 link = 1 annotation
demo
edit mode of a wikipedia page
CONTENT INTELLIGENCE
Build an annotation model
Data: wikipedia
• 1 link = 1 annotation
Algorithm overview:
• find all potential Surface Forms
• decide to annotate or not
• decide which topic
So, what statistics do we need?
demo
spotlight demo: candidates
CONTENT INTELLIGENCE
Extract stats
Identify all known SFs
Decide to annotate or not
Decide which topic
‣ Set of SFs we’ve seen annotated
‣ P ( annotation | SF )
‣ number of annotations for each SF
‣ P ( topic | SF )
‣ P ( topic | annotation )
‣ P ( topic | context )
CONTENT INTELLIGENCE
Model
surface forms
• annotated count, total count
topics
• number of annotations
• context, i.e. surrounding words, with number of occurrences
surface form <-> topic associations
• number of annotations
CONTENT INTELLIGENCE
Model: Example
SciPy is an [[open source]]
[[Python (programming language)|Python]] library
used by scientists, analysts, and engineers doing
[[scientific computing]] and technical computing.
CONTENT INTELLIGENCE
Model: Example
SciPy is an [[open source]]
[[Python (programming language)|Python]] library
used by scientists, analysts, and engineers doing
[[scientific computing]] and technical computing.
CONTENT INTELLIGENCE
Model: Example
SciPy is an [[open source]]
[[Python (programming language)|Python]] library
used by scientists, analysts, and engineers doing
[[scientific computing]] and technical computing.
CONTENT INTELLIGENCE
Model
surface forms: annotated count, total count
• “is” / “an” / “and” are skipped because too common (not informative)
SciPy is an [[open source]]
[[Python (programming language)|Python]] library
used by scientists, analysts, and engineers doing
[[scientific computing]] and technical computing.
SF anno total
ScyPy 0 1
Python 1 1
open 0 1
open source 1 1
engineering doing 0 1
CONTENT INTELLIGENCE
Model
topics: number of annotations, context
SciPy is an [[open source]]
[[Python (programming language)|Python]] library
used by scientists, analysts, and engineers doing
[[scientific computing]] and technical computing.
topic
num.
annotations
context
Open_source 1
SciPy, Python, library, engineer,
technical, computing, use, 

science x2, analyst
Python_(programming
_language)
1 (same)
Scientific_computing 1 (same)
CONTENT INTELLIGENCE
Model
surface form <-> topic associations: number of annotations
SciPy is an [[open source]]
[[Python (programming language)|Python]] library
used by scientists, analysts, and engineers doing
[[scientific computing]] and technical computing.
SF topic
num.
annotations
open source Open_Source 1
Python Python_(programming_language) 1
scientific
computing
Scientific_computing 1
CONTENT INTELLIGENCE
Model
surface forms
• annotated count, total count
topics
• number of annotations
• context, i.e. surrounding words, with number of occurrences
surface form <-> topic associations
• number of annotations
CONTENT INTELLIGENCE
Text annotation
model
SF, topic and association statistics
input text
I'm on cloud 9 whenever I write Python code.
Known SFs: “cloud”, “9”, “cloud 9”, “write”, “Python”, “code”
Annotate?
• discard “9” and “write” because low anno probability
• “cloud” vs “cloud 9” overlap: keep higher anno probability
• keep “Python” and “code”
Which topic?
• SF “Python” is ambiguous: animal or programming language?
• The context supports the programming language.
CONTENT INTELLIGENCE
Data preparation
extract stats: “It’s just counting.”
CONTENT INTELLIGENCE
Data preparation
• stemming: reduce words to their word stem
• “laptops” -> “laptops”, “worked” -> “work”
• skip words
• wikipedia dump is a single XML file of ~50G.

(See our blog post on “idio’s Wikipedia toolkits”)
extract stats: “It’s just counting.”
CONTENT INTELLIGENCE
Challenges
Wikipedia is not representative of our client’s articles.
• Errors in wikipedia, but it’s not the worst
• Unwanted SF-topic associations
• Virtually all surface forms are ambiguous.
• Bad priors: “yesterday” -> Beatles song (even in lower-case)
• annotation of a sub-part of a SF
CONTENT INTELLIGENCE
How to fix?
• Whenever possible, find a pattern, understand the underlying issue
and make algorithm changes: give a boost to some SFs, some
topics, or some SF-topic associations, e.g. capitalised phrases.
• tweak the model/probabilities for isolated issues.
• We made a model editor for spotlight. Check our github.
CONTENT INTELLIGENCE
Takeaways
• Wikipedia is an awesome resource for NLP
• Statistics can solve some NLP tasks
• Adapt the formula with rules and boosts to make up for the
differences between the learning data set and the output we want
CONTENT INTELLIGENCE
Links
Resources
DBpedia
idio
• http://dl.acm.org/citation.cfm?id=2002592
• http://dbpedia-spotlight.github.io/demo/
• https://github.com/dbpedia-spotlight/dbpedia-spotlight
• idioplatform.com
• http://engineering.idioplatform.com/
• github.com/idio/
CONTENT INTELLIGENCE
Thank you

More Related Content

What's hot

Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in SolrTommaso Teofili
 
Computer Science Library Training
Computer Science Library TrainingComputer Science Library Training
Computer Science Library Trainingpvhead123
 
Computer Science Masters Library Training - June 2017
Computer Science Masters Library Training - June 2017Computer Science Masters Library Training - June 2017
Computer Science Masters Library Training - June 2017pvhead123
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingSimon Hughes
 
Building a Real-time Solr-powered Recommendation Engine
Building a Real-time Solr-powered Recommendation EngineBuilding a Real-time Solr-powered Recommendation Engine
Building a Real-time Solr-powered Recommendation Enginelucenerevolution
 
Recommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoRecommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoAshok Venkatesan
 
Near Real-time Web-Page Recs Using Content Features
Near Real-time Web-Page Recs Using Content FeaturesNear Real-time Web-Page Recs Using Content Features
Near Real-time Web-Page Recs Using Content FeaturesAshok Venkatesan
 
Shrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPShrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPlucenerevolution
 

What's hot (9)

Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Computer Science Library Training
Computer Science Library TrainingComputer Science Library Training
Computer Science Library Training
 
Intro_to_ML
Intro_to_MLIntro_to_ML
Intro_to_ML
 
Computer Science Masters Library Training - June 2017
Computer Science Masters Library Training - June 2017Computer Science Masters Library Training - June 2017
Computer Science Masters Library Training - June 2017
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic Matching
 
Building a Real-time Solr-powered Recommendation Engine
Building a Real-time Solr-powered Recommendation EngineBuilding a Real-time Solr-powered Recommendation Engine
Building a Real-time Solr-powered Recommendation Engine
 
Recommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoRecommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and Dato
 
Near Real-time Web-Page Recs Using Content Features
Near Real-time Web-Page Recs Using Content FeaturesNear Real-time Web-Page Recs Using Content Features
Near Real-time Web-Page Recs Using Content Features
 
Shrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPShrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLP
 

Similar to Statistical Entity Linking

Software Programming with Python II.pptx
Software Programming with Python II.pptxSoftware Programming with Python II.pptx
Software Programming with Python II.pptxGevitaChinnaiah
 
Zemanta Tech Talk at Audible
Zemanta Tech Talk at AudibleZemanta Tech Talk at Audible
Zemanta Tech Talk at AudibleAndraz Tori
 
Python A Comprehensive Guide for Beginners.pdf
Python A Comprehensive Guide for Beginners.pdfPython A Comprehensive Guide for Beginners.pdf
Python A Comprehensive Guide for Beginners.pdfKajal Digital
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudDatabricks
 
Designing and Implementing Search Solutions
Designing and Implementing Search SolutionsDesigning and Implementing Search Solutions
Designing and Implementing Search SolutionsFindwise
 
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...Jennifer D'Souza
 
Knowledge_Based_Systems_Siemens
Knowledge_Based_Systems_SiemensKnowledge_Based_Systems_Siemens
Knowledge_Based_Systems_SiemensVinay Bhat
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch BasicsShifa Khan
 
Introduction to Python – Learn Python Programming.pptx
Introduction to Python – Learn Python Programming.pptxIntroduction to Python – Learn Python Programming.pptx
Introduction to Python – Learn Python Programming.pptxHassanShah396906
 
PRESENTATION ON PYTHON.pptx
PRESENTATION ON PYTHON.pptxPRESENTATION ON PYTHON.pptx
PRESENTATION ON PYTHON.pptxabhishek364864
 
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...Markus Harrer
 
How community software supports language documentation and data analysis
How community software supports language documentation and data analysisHow community software supports language documentation and data analysis
How community software supports language documentation and data analysisPeter Bouda
 
A Whirlwind Tour Of Python
A Whirlwind Tour Of PythonA Whirlwind Tour Of Python
A Whirlwind Tour Of PythonAsia Smith
 

Similar to Statistical Entity Linking (20)

Software Programming with Python II.pptx
Software Programming with Python II.pptxSoftware Programming with Python II.pptx
Software Programming with Python II.pptx
 
On the Usage of Pythonic Idioms
On the Usage of Pythonic IdiomsOn the Usage of Pythonic Idioms
On the Usage of Pythonic Idioms
 
Zemanta Tech Talk at Audible
Zemanta Tech Talk at AudibleZemanta Tech Talk at Audible
Zemanta Tech Talk at Audible
 
Python A Comprehensive Guide for Beginners.pdf
Python A Comprehensive Guide for Beginners.pdfPython A Comprehensive Guide for Beginners.pdf
Python A Comprehensive Guide for Beginners.pdf
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
 
Designing and Implementing Search Solutions
Designing and Implementing Search SolutionsDesigning and Implementing Search Solutions
Designing and Implementing Search Solutions
 
python ppt.pptx
python ppt.pptxpython ppt.pptx
python ppt.pptx
 
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
 
Final presentation
Final presentationFinal presentation
Final presentation
 
Knowledge_Based_Systems_Siemens
Knowledge_Based_Systems_SiemensKnowledge_Based_Systems_Siemens
Knowledge_Based_Systems_Siemens
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
 
Introduction to Python – Learn Python Programming.pptx
Introduction to Python – Learn Python Programming.pptxIntroduction to Python – Learn Python Programming.pptx
Introduction to Python – Learn Python Programming.pptx
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptx
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptx
 
PRESENTATION ON PYTHON.pptx
PRESENTATION ON PYTHON.pptxPRESENTATION ON PYTHON.pptx
PRESENTATION ON PYTHON.pptx
 
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...
Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online...
 
How community software supports language documentation and data analysis
How community software supports language documentation and data analysisHow community software supports language documentation and data analysis
How community software supports language documentation and data analysis
 
A Whirlwind Tour Of Python
A Whirlwind Tour Of PythonA Whirlwind Tour Of Python
A Whirlwind Tour Of Python
 
bhaskars.pptx
bhaskars.pptxbhaskars.pptx
bhaskars.pptx
 
Python Training
Python TrainingPython Training
Python Training
 

Recently uploaded

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Statistical Entity Linking

  • 1. CONTENT INTELLIGENCE Statistical Entity Linking Laurie Lugrin, R&D NLP engineer
  • 2. CONTENT INTELLIGENCE idio, content marketing for marketing & sales teams • content insight • user interest profile • recommendation demo content model user model topic performance chart
  • 3. CONTENT INTELLIGENCE idio, content marketing • focus on interests, not socio-demographics / firmographics • automatic text analysis, model building and recommendation
  • 5. CONTENT INTELLIGENCE Content analysis at idio • Python • build automation (luigi) • web services • scala • entity linking, i.e. finding topics in texts
  • 6. CONTENT INTELLIGENCE How most discussions start at conferences • me: “I work on Natural Language Processing.” • other: “So you’re in the field of deep learning?” • our topic extractor is based on statistical analysis however
  • 7. CONTENT INTELLIGENCE Outline • entity linking task • method (inspired by DBpedia-Spotlight) • data pre-processing • adaptation
  • 8. CONTENT INTELLIGENCE Entity Linking task • find topics in a text written in a natural language • For us: 1 topic = 1 uri to a wikipedia article. • Sometimes called “wikification”. demo spotlight demo
  • 9. CONTENT INTELLIGENCE Entity Linking jargon Surface Form: word (or phrase) that refers to a topic “hoverboard” and “hover board” are surface forms for the topic “Hoverboard”, the fictional levitating board used for personal transportation. Context: words surrounding a surface form.
  • 10. CONTENT INTELLIGENCE I'm on cloud 9 whenever I write Python code.
  • 11. CONTENT INTELLIGENCE Challenges • ambiguous words • multi-word expressions • different possible splits I'm on cloud 9 whenever I write Python code.
  • 12. CONTENT INTELLIGENCE Method • Build an annotation model • statistics about words and topics • Apply it on the given input text
  • 13. CONTENT INTELLIGENCE Build an annotation model Data: wikipedia • 1 link = 1 annotation demo edit mode of a wikipedia page
  • 14. CONTENT INTELLIGENCE Build an annotation model Data: wikipedia • 1 link = 1 annotation Algorithm overview: • find all potential Surface Forms • decide to annotate or not • decide which topic So, what statistics do we need? demo spotlight demo: candidates
  • 15. CONTENT INTELLIGENCE Extract stats Identify all known SFs Decide to annotate or not Decide which topic ‣ Set of SFs we’ve seen annotated ‣ P ( annotation | SF ) ‣ number of annotations for each SF ‣ P ( topic | SF ) ‣ P ( topic | annotation ) ‣ P ( topic | context )
  • 16. CONTENT INTELLIGENCE Model surface forms • annotated count, total count topics • number of annotations • context, i.e. surrounding words, with number of occurrences surface form <-> topic associations • number of annotations
  • 17. CONTENT INTELLIGENCE Model: Example SciPy is an [[open source]] [[Python (programming language)|Python]] library used by scientists, analysts, and engineers doing [[scientific computing]] and technical computing.
  • 18. CONTENT INTELLIGENCE Model: Example SciPy is an [[open source]] [[Python (programming language)|Python]] library used by scientists, analysts, and engineers doing [[scientific computing]] and technical computing.
  • 19. CONTENT INTELLIGENCE Model: Example SciPy is an [[open source]] [[Python (programming language)|Python]] library used by scientists, analysts, and engineers doing [[scientific computing]] and technical computing.
  • 20. CONTENT INTELLIGENCE Model surface forms: annotated count, total count • “is” / “an” / “and” are skipped because too common (not informative) SciPy is an [[open source]] [[Python (programming language)|Python]] library used by scientists, analysts, and engineers doing [[scientific computing]] and technical computing. SF anno total ScyPy 0 1 Python 1 1 open 0 1 open source 1 1 engineering doing 0 1
  • 21. CONTENT INTELLIGENCE Model topics: number of annotations, context SciPy is an [[open source]] [[Python (programming language)|Python]] library used by scientists, analysts, and engineers doing [[scientific computing]] and technical computing. topic num. annotations context Open_source 1 SciPy, Python, library, engineer, technical, computing, use, 
 science x2, analyst Python_(programming _language) 1 (same) Scientific_computing 1 (same)
  • 22. CONTENT INTELLIGENCE Model surface form <-> topic associations: number of annotations SciPy is an [[open source]] [[Python (programming language)|Python]] library used by scientists, analysts, and engineers doing [[scientific computing]] and technical computing. SF topic num. annotations open source Open_Source 1 Python Python_(programming_language) 1 scientific computing Scientific_computing 1
  • 23. CONTENT INTELLIGENCE Model surface forms • annotated count, total count topics • number of annotations • context, i.e. surrounding words, with number of occurrences surface form <-> topic associations • number of annotations
  • 24. CONTENT INTELLIGENCE Text annotation model SF, topic and association statistics input text I'm on cloud 9 whenever I write Python code. Known SFs: “cloud”, “9”, “cloud 9”, “write”, “Python”, “code” Annotate? • discard “9” and “write” because low anno probability • “cloud” vs “cloud 9” overlap: keep higher anno probability • keep “Python” and “code” Which topic? • SF “Python” is ambiguous: animal or programming language? • The context supports the programming language.
  • 25. CONTENT INTELLIGENCE Data preparation extract stats: “It’s just counting.”
  • 26. CONTENT INTELLIGENCE Data preparation • stemming: reduce words to their word stem • “laptops” -> “laptops”, “worked” -> “work” • skip words • wikipedia dump is a single XML file of ~50G.
 (See our blog post on “idio’s Wikipedia toolkits”) extract stats: “It’s just counting.”
  • 27. CONTENT INTELLIGENCE Challenges Wikipedia is not representative of our client’s articles. • Errors in wikipedia, but it’s not the worst • Unwanted SF-topic associations • Virtually all surface forms are ambiguous. • Bad priors: “yesterday” -> Beatles song (even in lower-case) • annotation of a sub-part of a SF
  • 28. CONTENT INTELLIGENCE How to fix? • Whenever possible, find a pattern, understand the underlying issue and make algorithm changes: give a boost to some SFs, some topics, or some SF-topic associations, e.g. capitalised phrases. • tweak the model/probabilities for isolated issues. • We made a model editor for spotlight. Check our github.
  • 29. CONTENT INTELLIGENCE Takeaways • Wikipedia is an awesome resource for NLP • Statistics can solve some NLP tasks • Adapt the formula with rules and boosts to make up for the differences between the learning data set and the output we want
  • 30. CONTENT INTELLIGENCE Links Resources DBpedia idio • http://dl.acm.org/citation.cfm?id=2002592 • http://dbpedia-spotlight.github.io/demo/ • https://github.com/dbpedia-spotlight/dbpedia-spotlight • idioplatform.com • http://engineering.idioplatform.com/ • github.com/idio/