EDRAK is an entity-centric data resource for Arabic knowledge created by researchers at the Max Planck Institute for Informatics. It contains over 2.4 million entities with Arabic names, keyphrases and semantic types. The researchers created EDRAK by extracting data from various sources, including the Arabic and English Wikipedias. They also generated additional Arabic names using techniques like entity name translation and transliteration of person names. An evaluation found the highest precision for names directly from Wikipedias, and lower precision for some generated data like redirected names. EDRAK is intended to support tasks like entity linking and question answering in Arabic.
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
EDRAK: Entity-centric Data Resource for Arabic Knowledge
1. EDRAK:
Entity-centric Data Resource
for Arabic Knowledge
Mohamed H. Gad-Elrab Mohamed Amir Yosef Gerhard Weikum
Max-Planck-Institut für Informatik
Saarbrücken, Germany
30th July 2015
4. Resources Use-cases
• Entity Linking / Named Entity Disambiguation
4
Angela_Merkel
ميركلتطالبالبرلماناأللمانيبدعماليونان إنقاذ خطة
ألمانيا - السياسية - انتخابات -
الحزبالديمقراطياالجتماعي –
Germany – Politics –
Elections .. etc
Context
دوروتيا أنجيالميركل - أنغيال
- أنجيال ميركل األلمانية المستشارة –
Angela Merkel – Merkel … etc
Names
Person, German_Politician, ..etc
Types
Merkel calls the German parliament to support the Greece bailout plan
5. Resources Use-cases
• Entity Linking / Named Entity Disambiguation
• Dictionary-based NER
• Entity Summarization
• Question Answering
• Fine-grained Semantic Type Classifier
• ….
5
6. Existing Resources
Resource Name Entity-
Aware?
Building source Arabic
Names
Size
Contex
t info?
JRC-Names
(Steinberger et al. 2011)
No Wikipedia + News 17K No
Arabic Lexical NEs
(Attia et al. 2010)
No Wikipedia + WordNet 45K should
CMUQ-Arabic-NET
(Azab et. al. 2013)
No Wikipedia + News 60K No
Google-Word-To-Concept
(Spitkovsky and Chang, 2012)
Yes Wikipedia + Web 800K No
BabelNet
(Navigli and Ponzetto, 2012)
Yes Wikipedia + Concepts
Translation
NA Yes
AIDArabic
(Yosef et. Al 2014)
Yes Wikipedia (Eng & Ar) 495K Yes
6
11. EDRAK Creation Names Dictionary
11
Populating Arabic names for Entities that exist only
in the English Wikipedia, and compile more names
for entities in the Arabic Wikipedia
External Resources
Named Entity Translation
Transliteration
En. Entity
Names
Generated
Ar. Names
Names
Dictionary
13. EDRAK Creation
• Approach 2: Entity Names Translation
• Statistical Machine Translation (SMT)
• Services target full text
• Name Entities are mistranslated
• E.g. “Nolan North is an American actor”
• E.g. “Robert Green”
• SMT Systems do not consider types
Names Dictionary
13
14. EDRAK Creation
• Approach 2: Entity Names Translation
• Entity-Names SMT
Names Dictionary
14
Christian Schmidt ?
Christian Dior كريستيانديور
Eric Schmidt إريكإشميت
15. EDRAK Creation
• Approach 2: Entity Names Translation
• Type-Aware Entity Names SMT
• Wikipedia Cross-Languages links + QCMU-Arabic-NETs
• Persons, Non-persons and Full back
Names Dictionary
Arabic Entity NameArabic Entity NameArabic Entity Name
Non-Persons
Translation
Model
English Entity
is
Person?
Parallel
PERSON
names
Persons
Translation
Model
yesNo
Pick top-k
Arabic Entity Name
15
Parallel
NON-
PERSON
names
16. EDRAK Creation
• Approach 3: Persons Names Transliteration
• Persons Names
• Unseen Names
• Capturing several Arabic possibilities
• Ex: Tony ( /طوني )توني
• Transliteration as Character-Level SMT
• Training: En-AR Persons Names.
Names Dictionary
16
A n g e l a SPACE M e r k e l أنجيلا SPACE ميركل
A l b e r t SPACE E i n s t e i n ألبرت SPACE أينشتاين
17. EDRAK Creation
• Manually from the Arabic Wikipedia
• In-link Pages Titles
• Anchor Texts
• Categories
• Citations
17
Keyphrases
Dictionary
23. Evaluation
• Manual Assessment
• 55 Native Arabic Speakers
• Distributed over many areas
• 150 Names for annonator
• Fairly distributed sample
23
24. Evaluation
• Manual Assessment
• Precision @1
24
0
10
20
30
40
50
60
70
80
90
100
First/Last
Names
labels PERSON labels NON-
PERSON
Redirects
PERSON
Redirects NON-
PERSON
Categories
Type-Aware Combined Transliterated Categories Translation
25. Evaluation
• Manual Assessment
• Results precision changes according to the source.
• Highest: First/Last Names
• Lowest: Redirects (NON-PERSON)
• No real difference between Type-Aware and Combined
SMT.
• Transliterated names confusion
• Ex. Johannes, Friedrich
25
26. Conclusion
• EDRAK offers 2.4M Entities with Potential names,
Contextual keyphrases and Semantic Types.
• EDRAK is not limited to the Arabic Wikipedia
• External Resources
• Type-Aware Entity Names Translation
• Person Names Transliteration
26