SlideShare a Scribd company logo
1 of 28
Download to read offline
EDRAK:
Entity-centric Data Resource
for Arabic Knowledge
Mohamed H. Gad-Elrab Mohamed Amir Yosef Gerhard Weikum
Max-Planck-Institut für Informatik
Saarbrücken, Germany
30th July 2015
Comprehensive Arabic Data Resource!
2
If only we have a
Outline
• Resources Use-cases
• Related Work
• EDRAK Resource
• EDRAK Creation
• EDRAK in Numbers
• Evaluation
3
Resources Use-cases
• Entity Linking / Named Entity Disambiguation
4
Angela_Merkel
‫ميركل‬‫تطالب‬‫البرلمان‬‫األلماني‬‫بدعم‬‫اليونان‬ ‫إنقاذ‬ ‫خطة‬
‫ألمانيا‬ - ‫السياسية‬ - ‫انتخابات‬ -
‫الحزب‬‫الديمقراطي‬‫االجتماعي‬ –
Germany – Politics –
Elections .. etc
Context
‫دوروتيا‬ ‫أنجيال‬‫ميركل‬ - ‫أنغيال‬
- ‫أنجيال‬ ‫ميركل‬ ‫األلمانية‬ ‫المستشارة‬ –
Angela Merkel – Merkel … etc
Names
Person, German_Politician, ..etc
Types
Merkel calls the German parliament to support the Greece bailout plan
Resources Use-cases
• Entity Linking / Named Entity Disambiguation
• Dictionary-based NER
• Entity Summarization
• Question Answering
• Fine-grained Semantic Type Classifier
• ….
5
Existing Resources
Resource Name Entity-
Aware?
Building source Arabic
Names
Size
Contex
t info?
JRC-Names
(Steinberger et al. 2011)
No Wikipedia + News 17K No
Arabic Lexical NEs
(Attia et al. 2010)
No Wikipedia + WordNet 45K should
CMUQ-Arabic-NET
(Azab et. al. 2013)
No Wikipedia + News 60K No
Google-Word-To-Concept
(Spitkovsky and Chang, 2012)
Yes Wikipedia + Web 800K No
BabelNet
(Navigli and Ponzetto, 2012)
Yes Wikipedia + Concepts
Translation
NA Yes
AIDArabic
(Yosef et. Al 2014)
Yes Wikipedia (Eng & Ar) 495K Yes
6
EDRAK
7
Entity Catalog
(2.4M Entities)
Names Dictionary
Keyphrases
Dictionary
Weights
Semantic Types
Entity-Entity
Similarity
EDRAK Creation
Yago3
(English & Arabic)
Culture Specific
Prominent Entities
Entity Catalog
8
EDRAK Creation
• Manually from the Arabic Wikipedia
• Page Titles
• Anchor Text
• Redirects
• Disambiguation Pages
Names Dictionary
9
EDRAK Creation Names Dictionary
10
• Limitation
Missing Arabic Names
Have Arabic Names
EDRAK Creation Names Dictionary
11
Populating Arabic names for Entities that exist only
in the English Wikipedia, and compile more names
for entities in the Arabic Wikipedia
External Resources
Named Entity Translation
Transliteration
En. Entity
Names
Generated
Ar. Names
Names
Dictionary
EDRAK Creation
• Approach 1: External Resources
• Entity-aware: Google Word to Concept (GW2C)
• Web Hypertext Anchors to Wikipedia
• Name-Dictionaries:
• JRC-Names
• CMUQ-Arabic-NET (Azab et al. 2013)
Names Dictionary
------ ------ ---- -----
---- --- --- ----- -----
- ----- ----- -------- -
-------- ------ ----- --
-- ----- ---- --- -- --
---- -------------- ---
- --- ------- --------
-- ------ -------------
- ---- --- ---- --- ----
---- --- -- --- ---- ---
- ---- ----- ---- ------
------ ---- --------- --
- --- ----- ------- ----
-----
-- ------ ---- --- ----
--- ---------- ---- ----
----- --- --- ----- ----
-- ----- ----- --------
---- -------- ---------
-- ------ -------------
- ---- --- ---- --- ----
---- --- -- --- ---- ---
- ---- ----- ---- ------
-------- ----- ---- ---
-- ---- --- -- ------ --
----
--------- ---- --- ----
--- ----- ------ -----
---- --- -- ------ -----
--------- ---- --- ----
--- ---- ----
------ ------ ---- -----
---- --- --- ----- -----
- ----- ----- ----------
- ---- --- -- ------ ---
----------- ---- --- --
----- -
Web pages
12
EDRAK Creation
• Approach 2: Entity Names Translation
• Statistical Machine Translation (SMT)
• Services target full text
• Name Entities are mistranslated
• E.g. “Nolan North is an American actor”
• E.g. “Robert Green”
• SMT Systems do not consider types
Names Dictionary
13
EDRAK Creation
• Approach 2: Entity Names Translation
• Entity-Names SMT
Names Dictionary
14
Christian Schmidt ?
Christian Dior ‫كريستيان‬‫ديور‬
Eric Schmidt ‫إريك‬‫إشميت‬
EDRAK Creation
• Approach 2: Entity Names Translation
• Type-Aware Entity Names SMT
• Wikipedia Cross-Languages links + QCMU-Arabic-NETs
• Persons, Non-persons and Full back
Names Dictionary
Arabic Entity NameArabic Entity NameArabic Entity Name
Non-Persons
Translation
Model
English Entity
is
Person?
Parallel
PERSON
names
Persons
Translation
Model
yesNo
Pick top-k
Arabic Entity Name
15
Parallel
NON-
PERSON
names
EDRAK Creation
• Approach 3: Persons Names Transliteration
• Persons Names
• Unseen Names
• Capturing several Arabic possibilities
• Ex: Tony ( ‫/طوني‬ ‫)توني‬
• Transliteration as Character-Level SMT
• Training: En-AR Persons Names.
Names Dictionary
16
A n g e l a SPACE M e r k e l ‫أ‬‫ن‬‫ج‬‫ي‬‫ل‬‫ا‬ SPACE ‫م‬‫ي‬‫ر‬‫ك‬‫ل‬
A l b e r t SPACE E i n s t e i n ‫أ‬‫ل‬‫ب‬‫ر‬‫ت‬ SPACE ‫أ‬‫ي‬‫ن‬‫ش‬‫ت‬‫ا‬‫ي‬‫ن‬
EDRAK Creation
• Manually from the Arabic Wikipedia
• In-link Pages Titles
• Anchor Texts
• Categories
• Citations
17
Keyphrases
Dictionary
EDRAK Creation
• Arabic Keyphrases Generation
18
Keyphrases
Dictionary
Named Entity Translation
Transliteration
En. In-link
Titles
Keyphrases
Dictionary
Named Entity Translation
En.
Categories
EDRAK in Numbers
• AIDArabic Resource (Yosef et al. 2014) vs EDRAK
1
10
100
1000
10000
100000
1000000
10000000
100000000
1E+09
Entities Count Unique Names Entity-Name
Entries
Unique
Keyphrases
Entity-keyph.
Entries
AIDArabiic EDRAK
19
EDRAK in Numbers
• Entities per Semantic Type
1,220,032
52%
360,108
16%
359,071
15%
199,846
9%
196,305
8%
PERSON LOCATION ARTIFACT EVENT ORGANIZATION
20
EDRAK in Numbers
• Example
21
EDRAK in Numbers
• Example
22
Evaluation
• Manual Assessment
• 55 Native Arabic Speakers
• Distributed over many areas
• 150 Names for annonator
• Fairly distributed sample
23
Evaluation
• Manual Assessment
• Precision @1
24
0
10
20
30
40
50
60
70
80
90
100
First/Last
Names
labels PERSON labels NON-
PERSON
Redirects
PERSON
Redirects NON-
PERSON
Categories
Type-Aware Combined Transliterated Categories Translation
Evaluation
• Manual Assessment
• Results precision changes according to the source.
• Highest: First/Last Names
• Lowest: Redirects (NON-PERSON)
• No real difference between Type-Aware and Combined
SMT.
• Transliterated names confusion
• Ex. Johannes, Friedrich
25
Conclusion
• EDRAK offers 2.4M Entities with Potential names,
Contextual keyphrases and Semantic Types.
• EDRAK is not limited to the Arabic Wikipedia
• External Resources
• Type-Aware Entity Names Translation
• Person Names Transliteration
26
Download EDRAK
http://www.mpi-inf.mpg.de/yago-naga/aida/
27
Thank you!
28

More Related Content

What's hot

Event-based archival descriptions
Event-based archival descriptionsEvent-based archival descriptions
Event-based archival descriptionsAthanasios Velios
 
Managing RDF data with graph databases
Managing RDF data with graph databasesManaging RDF data with graph databases
Managing RDF data with graph databasesGraph-TA
 
Scalable Text Mining
Scalable Text MiningScalable Text Mining
Scalable Text MiningJee-Hyub Kim
 
Deriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataDeriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataGraph-TA
 
Introduction of semantic technology for SAS programmers
Introduction of semantic technology for SAS programmersIntroduction of semantic technology for SAS programmers
Introduction of semantic technology for SAS programmersKevin Lee
 
Semantic Web And Coldfusion
Semantic Web And ColdfusionSemantic Web And Coldfusion
Semantic Web And Coldfusionwilliam_greenly
 
Federated data stores using semantic web technology
Federated data stores using semantic web technologyFederated data stores using semantic web technology
Federated data stores using semantic web technologySteve Ray
 
Theory behind Image Compression and Semantic Search
Theory behind Image Compression and Semantic SearchTheory behind Image Compression and Semantic Search
Theory behind Image Compression and Semantic SearchSanti Adavani
 
Week5
Week5Week5
Week5Aethe
 
NCBO SPARQL Endpoint
NCBO SPARQL EndpointNCBO SPARQL Endpoint
NCBO SPARQL EndpointTrish Whetzel
 
Tips menggunakakan Google
Tips menggunakakan GoogleTips menggunakakan Google
Tips menggunakakan GoogleAkhmad Nasir
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic SearchRoi Blanco
 
IETF105 Regext RDAP JCard Profile
IETF105 Regext RDAP JCard ProfileIETF105 Regext RDAP JCard Profile
IETF105 Regext RDAP JCard ProfileAPNIC
 
Publishing and Using Linked Open Data - Day 5
Publishing and Using Linked Open Data - Day 5Publishing and Using Linked Open Data - Day 5
Publishing and Using Linked Open Data - Day 5Richard Urban
 
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...Matt Stubbs
 
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...Dimitris Kontokostas
 

What's hot (20)

Event-based archival descriptions
Event-based archival descriptionsEvent-based archival descriptions
Event-based archival descriptions
 
Managing RDF data with graph databases
Managing RDF data with graph databasesManaging RDF data with graph databases
Managing RDF data with graph databases
 
Scalable Text Mining
Scalable Text MiningScalable Text Mining
Scalable Text Mining
 
Digital Twin: jSON-LD, RDF
Digital Twin: jSON-LD, RDFDigital Twin: jSON-LD, RDF
Digital Twin: jSON-LD, RDF
 
Deriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataDeriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF Data
 
Introduction of semantic technology for SAS programmers
Introduction of semantic technology for SAS programmersIntroduction of semantic technology for SAS programmers
Introduction of semantic technology for SAS programmers
 
Semantic Web - Linked Data - RDF
Semantic Web - Linked Data - RDFSemantic Web - Linked Data - RDF
Semantic Web - Linked Data - RDF
 
Semantic Web And Coldfusion
Semantic Web And ColdfusionSemantic Web And Coldfusion
Semantic Web And Coldfusion
 
Federated data stores using semantic web technology
Federated data stores using semantic web technologyFederated data stores using semantic web technology
Federated data stores using semantic web technology
 
Theory behind Image Compression and Semantic Search
Theory behind Image Compression and Semantic SearchTheory behind Image Compression and Semantic Search
Theory behind Image Compression and Semantic Search
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Week5
Week5Week5
Week5
 
NCBO SPARQL Endpoint
NCBO SPARQL EndpointNCBO SPARQL Endpoint
NCBO SPARQL Endpoint
 
469 talk
469 talk469 talk
469 talk
 
Tips menggunakakan Google
Tips menggunakakan GoogleTips menggunakakan Google
Tips menggunakakan Google
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic Search
 
IETF105 Regext RDAP JCard Profile
IETF105 Regext RDAP JCard ProfileIETF105 Regext RDAP JCard Profile
IETF105 Regext RDAP JCard Profile
 
Publishing and Using Linked Open Data - Day 5
Publishing and Using Linked Open Data - Day 5Publishing and Using Linked Open Data - Day 5
Publishing and Using Linked Open Data - Day 5
 
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
 
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
 

Similar to EDRAK: Entity-centric Data Resource for Arabic Knowledge

10 Sourcing Tips with Ryan Gillis - SourceCon DC Webinar 8-29-19
10 Sourcing Tips with Ryan Gillis - SourceCon DC Webinar 8-29-1910 Sourcing Tips with Ryan Gillis - SourceCon DC Webinar 8-29-19
10 Sourcing Tips with Ryan Gillis - SourceCon DC Webinar 8-29-19rgillis
 
Knowledge Technologies: Opportunities and Challenges
Knowledge Technologies: Opportunities and ChallengesKnowledge Technologies: Opportunities and Challenges
Knowledge Technologies: Opportunities and ChallengesFariz Darari
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Identifying The Benefit of Linked Data
Identifying The Benefit of Linked DataIdentifying The Benefit of Linked Data
Identifying The Benefit of Linked DataRichard Wallis
 
Temporal Entity Random Indexing
Temporal Entity Random IndexingTemporal Entity Random Indexing
Temporal Entity Random IndexingAnnalina Caputo
 
Evolution of Search
Evolution of SearchEvolution of Search
Evolution of SearchBill Slawski
 
Schema.org - An Extending Influence
Schema.org - An Extending InfluenceSchema.org - An Extending Influence
Schema.org - An Extending InfluenceRichard Wallis
 
WTF is Semantic Web?
WTF is Semantic Web?WTF is Semantic Web?
WTF is Semantic Web?milesw
 
05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptxGambari Amosa Isiaka
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalCarsten Eickhoff
 
Finding information on the Web - methodology
Finding information on the Web - methodologyFinding information on the Web - methodology
Finding information on the Web - methodologyPhilippe Scheimann
 
Search Analytics for Content Strategists
Search Analytics for Content StrategistsSearch Analytics for Content Strategists
Search Analytics for Content StrategistsLouis Rosenfeld
 
Semantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and StanbolSemantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and StanbolAll Things Open
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Roi Blanco
 
Query Design for Digital Methods by Richard Rogers
Query Design for Digital Methods by Richard RogersQuery Design for Digital Methods by Richard Rogers
Query Design for Digital Methods by Richard RogersDigital Methods Initiative
 
Harnessing diversity in crowds and machines for better ner performance
Harnessing diversity in crowds and machines for better ner performanceHarnessing diversity in crowds and machines for better ner performance
Harnessing diversity in crowds and machines for better ner performanceoanainel
 
Internet Research Presentation
Internet Research PresentationInternet Research Presentation
Internet Research Presentationadeason
 
Sourcing Basics & Cool Tools
Sourcing Basics & Cool ToolsSourcing Basics & Cool Tools
Sourcing Basics & Cool ToolsMarianthe Verver
 

Similar to EDRAK: Entity-centric Data Resource for Arabic Knowledge (20)

K Search
K SearchK Search
K Search
 
10 Sourcing Tips with Ryan Gillis - SourceCon DC Webinar 8-29-19
10 Sourcing Tips with Ryan Gillis - SourceCon DC Webinar 8-29-1910 Sourcing Tips with Ryan Gillis - SourceCon DC Webinar 8-29-19
10 Sourcing Tips with Ryan Gillis - SourceCon DC Webinar 8-29-19
 
Knowledge Technologies: Opportunities and Challenges
Knowledge Technologies: Opportunities and ChallengesKnowledge Technologies: Opportunities and Challenges
Knowledge Technologies: Opportunities and Challenges
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Identifying The Benefit of Linked Data
Identifying The Benefit of Linked DataIdentifying The Benefit of Linked Data
Identifying The Benefit of Linked Data
 
Temporal Entity Random Indexing
Temporal Entity Random IndexingTemporal Entity Random Indexing
Temporal Entity Random Indexing
 
Evolution of Search
Evolution of SearchEvolution of Search
Evolution of Search
 
Schema.org - An Extending Influence
Schema.org - An Extending InfluenceSchema.org - An Extending Influence
Schema.org - An Extending Influence
 
WTF is Semantic Web?
WTF is Semantic Web?WTF is Semantic Web?
WTF is Semantic Web?
 
05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Finding information on the Web - methodology
Finding information on the Web - methodologyFinding information on the Web - methodology
Finding information on the Web - methodology
 
Search Analytics for Content Strategists
Search Analytics for Content StrategistsSearch Analytics for Content Strategists
Search Analytics for Content Strategists
 
Search engine ppt
Search engine pptSearch engine ppt
Search engine ppt
 
Semantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and StanbolSemantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and Stanbol
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search
 
Query Design for Digital Methods by Richard Rogers
Query Design for Digital Methods by Richard RogersQuery Design for Digital Methods by Richard Rogers
Query Design for Digital Methods by Richard Rogers
 
Harnessing diversity in crowds and machines for better ner performance
Harnessing diversity in crowds and machines for better ner performanceHarnessing diversity in crowds and machines for better ner performance
Harnessing diversity in crowds and machines for better ner performance
 
Internet Research Presentation
Internet Research PresentationInternet Research Presentation
Internet Research Presentation
 
Sourcing Basics & Cool Tools
Sourcing Basics & Cool ToolsSourcing Basics & Cool Tools
Sourcing Basics & Cool Tools
 

Recently uploaded

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 

Recently uploaded (20)

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

EDRAK: Entity-centric Data Resource for Arabic Knowledge

  • 1. EDRAK: Entity-centric Data Resource for Arabic Knowledge Mohamed H. Gad-Elrab Mohamed Amir Yosef Gerhard Weikum Max-Planck-Institut für Informatik Saarbrücken, Germany 30th July 2015
  • 2. Comprehensive Arabic Data Resource! 2 If only we have a
  • 3. Outline • Resources Use-cases • Related Work • EDRAK Resource • EDRAK Creation • EDRAK in Numbers • Evaluation 3
  • 4. Resources Use-cases • Entity Linking / Named Entity Disambiguation 4 Angela_Merkel ‫ميركل‬‫تطالب‬‫البرلمان‬‫األلماني‬‫بدعم‬‫اليونان‬ ‫إنقاذ‬ ‫خطة‬ ‫ألمانيا‬ - ‫السياسية‬ - ‫انتخابات‬ - ‫الحزب‬‫الديمقراطي‬‫االجتماعي‬ – Germany – Politics – Elections .. etc Context ‫دوروتيا‬ ‫أنجيال‬‫ميركل‬ - ‫أنغيال‬ - ‫أنجيال‬ ‫ميركل‬ ‫األلمانية‬ ‫المستشارة‬ – Angela Merkel – Merkel … etc Names Person, German_Politician, ..etc Types Merkel calls the German parliament to support the Greece bailout plan
  • 5. Resources Use-cases • Entity Linking / Named Entity Disambiguation • Dictionary-based NER • Entity Summarization • Question Answering • Fine-grained Semantic Type Classifier • …. 5
  • 6. Existing Resources Resource Name Entity- Aware? Building source Arabic Names Size Contex t info? JRC-Names (Steinberger et al. 2011) No Wikipedia + News 17K No Arabic Lexical NEs (Attia et al. 2010) No Wikipedia + WordNet 45K should CMUQ-Arabic-NET (Azab et. al. 2013) No Wikipedia + News 60K No Google-Word-To-Concept (Spitkovsky and Chang, 2012) Yes Wikipedia + Web 800K No BabelNet (Navigli and Ponzetto, 2012) Yes Wikipedia + Concepts Translation NA Yes AIDArabic (Yosef et. Al 2014) Yes Wikipedia (Eng & Ar) 495K Yes 6
  • 7. EDRAK 7 Entity Catalog (2.4M Entities) Names Dictionary Keyphrases Dictionary Weights Semantic Types Entity-Entity Similarity
  • 8. EDRAK Creation Yago3 (English & Arabic) Culture Specific Prominent Entities Entity Catalog 8
  • 9. EDRAK Creation • Manually from the Arabic Wikipedia • Page Titles • Anchor Text • Redirects • Disambiguation Pages Names Dictionary 9
  • 10. EDRAK Creation Names Dictionary 10 • Limitation Missing Arabic Names Have Arabic Names
  • 11. EDRAK Creation Names Dictionary 11 Populating Arabic names for Entities that exist only in the English Wikipedia, and compile more names for entities in the Arabic Wikipedia External Resources Named Entity Translation Transliteration En. Entity Names Generated Ar. Names Names Dictionary
  • 12. EDRAK Creation • Approach 1: External Resources • Entity-aware: Google Word to Concept (GW2C) • Web Hypertext Anchors to Wikipedia • Name-Dictionaries: • JRC-Names • CMUQ-Arabic-NET (Azab et al. 2013) Names Dictionary ------ ------ ---- ----- ---- --- --- ----- ----- - ----- ----- -------- - -------- ------ ----- -- -- ----- ---- --- -- -- ---- -------------- --- - --- ------- -------- -- ------ ------------- - ---- --- ---- --- ---- ---- --- -- --- ---- --- - ---- ----- ---- ------ ------ ---- --------- -- - --- ----- ------- ---- ----- -- ------ ---- --- ---- --- ---------- ---- ---- ----- --- --- ----- ---- -- ----- ----- -------- ---- -------- --------- -- ------ ------------- - ---- --- ---- --- ---- ---- --- -- --- ---- --- - ---- ----- ---- ------ -------- ----- ---- --- -- ---- --- -- ------ -- ---- --------- ---- --- ---- --- ----- ------ ----- ---- --- -- ------ ----- --------- ---- --- ---- --- ---- ---- ------ ------ ---- ----- ---- --- --- ----- ----- - ----- ----- ---------- - ---- --- -- ------ --- ----------- ---- --- -- ----- - Web pages 12
  • 13. EDRAK Creation • Approach 2: Entity Names Translation • Statistical Machine Translation (SMT) • Services target full text • Name Entities are mistranslated • E.g. “Nolan North is an American actor” • E.g. “Robert Green” • SMT Systems do not consider types Names Dictionary 13
  • 14. EDRAK Creation • Approach 2: Entity Names Translation • Entity-Names SMT Names Dictionary 14 Christian Schmidt ? Christian Dior ‫كريستيان‬‫ديور‬ Eric Schmidt ‫إريك‬‫إشميت‬
  • 15. EDRAK Creation • Approach 2: Entity Names Translation • Type-Aware Entity Names SMT • Wikipedia Cross-Languages links + QCMU-Arabic-NETs • Persons, Non-persons and Full back Names Dictionary Arabic Entity NameArabic Entity NameArabic Entity Name Non-Persons Translation Model English Entity is Person? Parallel PERSON names Persons Translation Model yesNo Pick top-k Arabic Entity Name 15 Parallel NON- PERSON names
  • 16. EDRAK Creation • Approach 3: Persons Names Transliteration • Persons Names • Unseen Names • Capturing several Arabic possibilities • Ex: Tony ( ‫/طوني‬ ‫)توني‬ • Transliteration as Character-Level SMT • Training: En-AR Persons Names. Names Dictionary 16 A n g e l a SPACE M e r k e l ‫أ‬‫ن‬‫ج‬‫ي‬‫ل‬‫ا‬ SPACE ‫م‬‫ي‬‫ر‬‫ك‬‫ل‬ A l b e r t SPACE E i n s t e i n ‫أ‬‫ل‬‫ب‬‫ر‬‫ت‬ SPACE ‫أ‬‫ي‬‫ن‬‫ش‬‫ت‬‫ا‬‫ي‬‫ن‬
  • 17. EDRAK Creation • Manually from the Arabic Wikipedia • In-link Pages Titles • Anchor Texts • Categories • Citations 17 Keyphrases Dictionary
  • 18. EDRAK Creation • Arabic Keyphrases Generation 18 Keyphrases Dictionary Named Entity Translation Transliteration En. In-link Titles Keyphrases Dictionary Named Entity Translation En. Categories
  • 19. EDRAK in Numbers • AIDArabic Resource (Yosef et al. 2014) vs EDRAK 1 10 100 1000 10000 100000 1000000 10000000 100000000 1E+09 Entities Count Unique Names Entity-Name Entries Unique Keyphrases Entity-keyph. Entries AIDArabiic EDRAK 19
  • 20. EDRAK in Numbers • Entities per Semantic Type 1,220,032 52% 360,108 16% 359,071 15% 199,846 9% 196,305 8% PERSON LOCATION ARTIFACT EVENT ORGANIZATION 20
  • 21. EDRAK in Numbers • Example 21
  • 22. EDRAK in Numbers • Example 22
  • 23. Evaluation • Manual Assessment • 55 Native Arabic Speakers • Distributed over many areas • 150 Names for annonator • Fairly distributed sample 23
  • 24. Evaluation • Manual Assessment • Precision @1 24 0 10 20 30 40 50 60 70 80 90 100 First/Last Names labels PERSON labels NON- PERSON Redirects PERSON Redirects NON- PERSON Categories Type-Aware Combined Transliterated Categories Translation
  • 25. Evaluation • Manual Assessment • Results precision changes according to the source. • Highest: First/Last Names • Lowest: Redirects (NON-PERSON) • No real difference between Type-Aware and Combined SMT. • Transliterated names confusion • Ex. Johannes, Friedrich 25
  • 26. Conclusion • EDRAK offers 2.4M Entities with Potential names, Contextual keyphrases and Semantic Types. • EDRAK is not limited to the Arabic Wikipedia • External Resources • Type-Aware Entity Names Translation • Person Names Transliteration 26
  • 28. 28