Arabic syntactic parsing

Arabic Syntactic parsing
By Amena Helmy

POSTagging
‫كبيرا‬
‫صفة‬
‫كتابا‬
‫اسم‬
‫محمد‬
‫علم‬ ‫اسم‬
‫قرأ‬
‫فعل‬
3
POS tagging • Give information about the
individual words

Syntactic parsing
‫كبيرا‬
JJ
‫كتابا‬
NN
‫محمد‬
NNP
‫قرأ‬
VBD
4
Syntactic parsing
• Whole sentence
• The overall structure of each sentence
(Tree)
• The way the words relate to each other
SBJ
OBJ
amod

Constituency Parsing
Words are organized into constituents
Constituents are groups of words that can act
as single units
‫كبيرا‬
JJ
‫كتابا‬
NN
‫محمد‬
NNP
‫قرأ‬
VBD
NP NP
S
VP

ConstituencyTest
• Substitution
•
‫عن‬ ‫شيئا‬ ‫الأعلم‬
‫السوداء‬ ‫النظارة‬ ‫ذو‬ ‫األطوار‬ ‫غريب‬ ‫الرجل‬
•
‫عن‬ ‫أعلم‬ ‫ال‬
‫ه‬
‫شيئا‬
• Conjunction
•
•
‫وصديقه‬

Head of constituents (phrases)
‫هاما‬ ‫أمرا‬ ‫يكون‬ ‫قد‬
VP
‫يكون‬
‫الذي‬ ‫الرجل‬
‫اشترى‬
‫البيت‬
NP
‫الرجل‬
‫النهر‬ ‫ضفتي‬ ‫على‬
PP
‫على‬
For constituents, we usually name them as phrases based on the
word that heads the constituent (The most important word)

Dependency structure
All nodes are words.
Relations between words are
shown through directed arc that
goes from the head to dependent.
The type of relation is manifested
through arc labels
SBJ
OBJ
amod
‫كبيرا‬
JJ
‫كتابا‬
NN
‫محمد‬
NNP
‫قرأ‬
VBD

Parsing importance
Parsing is the task of uncovering the
syntactic structure of language and is often
viewed as an important prerequisite for
building systems capable of understanding
language
Syntactic structure is necessary as a first
step towards some NLP tasks
English Chinese

1. Real sentences are long
•
‫و‬
‫خلص‬
‫الى‬
‫ان‬
‫جدول‬
‫اعمال‬
‫اللقاء‬
‫الحالي‬
‫يتضمن‬
:
‫العمل‬
‫على‬
‫تنظيم‬
‫الري‬
‫في‬
‫شكل‬
‫مدروس‬
‫يهدف‬
‫الي‬
‫االفادة‬
‫الى‬
‫اقصى‬
‫حد‬
‫من‬
‫الثروة‬
‫المائية‬
‫التي‬
‫يوفر‬
‫ها‬
‫نهر‬
‫الدامور‬
‫لري‬
‫كل‬
‫االراضي‬
‫الزراعية‬
‫التي‬
‫تنتفع‬
‫بالري‬
‫من‬
‫مي‬
‫اهه‬
,
‫علما‬
‫ان‬
‫الطول‬
‫االجمالي‬
‫للمجاري‬
‫الدائمة‬
‫ل‬
‫ل‬
‫نهر‬
‫هو‬
‫حوالى‬
76
‫كيلومترا‬
‫و‬
‫متوسط‬
‫صبيبها‬
‫نحو‬
4.71
‫امتار‬
‫مكعبة‬
‫بالثانية‬
‫و‬
‫أن‬
‫عدد‬
‫ايام‬
‫الهطول‬
‫التقريبي‬
‫هو‬
70
‫يوما‬
‫في‬
‫السنة‬
‫و‬
‫بالتالي‬
‫فإن‬
‫عملية‬
‫حسابية‬
‫بسيطة‬
‫تظهر‬
‫ان‬
‫متوسط‬
‫كمية‬
‫المي‬
‫اه‬
‫التي‬
‫يوفرها‬
‫النهر‬
‫هي‬
28
‫مليون‬
‫متر‬
‫مكعب‬
‫من‬
‫المياه‬
‫سنويا‬
,
‫االمر‬
‫الذي‬
‫يظهر‬
‫اهمية‬
‫هذا‬
‫النهر‬
‫و‬
‫مقدار‬
‫الطاقة‬
‫المائية‬
‫التي‬
‫يمكن‬
‫االفادة‬
‫منها‬
,
‫اضافة‬
‫الى‬
‫حل‬
‫مشكلة‬
‫المؤسسات‬
‫السياحية‬
‫التي‬
‫انشئت‬
‫من‬
‫دون‬
‫رخص‬
‫قانونية‬
‫و‬
‫إعداد‬
‫لوائح‬
‫بالتعديات‬
‫و‬
‫العمل‬
‫على‬
‫إزالتها‬
‫و‬
‫منع‬
‫استخدام‬
‫الكهرباء‬
‫بهدف‬
‫الصيد‬
‫في‬
‫مياه‬
‫النهر‬
‫محافظة‬
‫على‬
‫االسماك‬
‫ال‬
‫تي‬
‫تتوالد‬
‫و‬
‫تتكاثر‬
‫فيه‬
‫و‬
‫منع‬
‫اقامة‬
‫شبكات‬
‫صرف‬
‫صحي‬
‫تلوث‬
‫مجرى‬
‫النهر‬
,
‫العمل‬
‫على‬
‫معالجة‬
‫مياه‬
‫المصانع‬
‫على‬
‫اختالفها‬
‫قبل‬
‫ان‬
‫تصب‬
‫في‬
‫النهر‬
،
‫و‬
‫توعية‬
‫المزارعين‬
‫على‬
‫موضوع‬
‫المبيدات‬
‫الزراعية‬
‫و‬
‫معالجة‬
‫مشكلة‬
‫البناء‬
‫العشوائي‬
‫ل‬
‫ا‬
‫لحد‬
‫منه‬
‫على‬
‫ضفتي‬
‫النهر‬
‫و‬
‫وضع‬
‫تخطيط‬
‫عمراني‬
‫جديد‬
‫للحوض‬
‫ب‬
‫ا‬
‫لتنسيق‬
‫مع‬
‫مديرية‬
‫التنظيم‬
‫المدني‬
‫ووزارة‬
‫البيئة‬
,
‫وأخيرا‬
‫اقامة‬
‫اتفاق‬
‫حسن‬
‫جوار‬
‫بين‬
‫كل‬
‫البلديات‬
‫التي‬
‫يشملها‬
‫حوض‬
‫نهر‬
‫الدامور‬
‫تكون‬
‫بمثابة‬
‫لجنة‬
‫متابعة‬
‫دائمة‬
‫بإشر‬
‫اف‬
‫سعادة‬
‫القائم‬
‫قام‬
‫بغية‬
‫العمل‬
‫معا‬
‫على‬
‫توفير‬
‫الحماية‬
‫الالزمة‬
‫لهذا‬
‫الشريان‬
‫المائي‬
‫المهم‬
‫و‬
‫المحافظة‬
‫على‬
‫حقوق‬
‫المنتفعين‬
‫بمياهه‬
" .'

2. Ambiguous sentences
I saw the man with
binoculars

Arabic Complex Morphology
Analyzing s such input
morphologically is not an
easy task to do, but it has
to be done correctly to
pursue to next step which
is processing the input text
syntactically.
and you will watch it
‫وستشاهدونها‬

Arabic is a Pro-drop language
Arabic is a pro-drop
language, where the subject
of a verb may be implicitly
encoded in the verb
morphology.

Free word order
English SV O The boy ate the
food.
Arabic SV O ‫الطعام‬ ‫الولد‬ ‫أكل‬
V S O ‫الطعام‬ ‫أكل‬ ‫الولد‬

The parsing process
P
A
R
S
E
R
Grammar
sentences
Constituency Dependency

Constituency grammar
Phrase structure grammar
Context free grammar (CFG)

Context Free Grammar
•A Context-free grammar consists of a set of rules or
productions, each expressing the ways the symbols of the
language can be grouped together, and a lexicon of words

1. Set of (Words) Σ
‫يذهب‬
‫صباحا‬
‫إلى‬
‫المدرسة‬
‫الطالب‬

2. N a set of non-terminal symbols.

4. R a set of rules or productions, each of the
form (A → β)
Contextual rules
S  VP
VP  V NP PP NP
NP  DN
NP  N
PP  P NP

4. R a set of rules or productions, each of the
form (A → β)
Contextual rules
S  VP
VP  V NP PP NP
NP  DN
NP  N
PP  P NP
Lexical rules
N ‫صباحا‬
DN ‫الطالب‬
DN ‫المدرسة‬
V ‫يذهب‬
PP  ‫إلى‬

Classical NLP Parsing (Pre 1990 era)
Parsers have poor coverage
Even quite simple sentences
had many possible analyses
Build CFG grammar
for languages

Solutions
High coverage for rules
We need mechanisms that allow us
to find the most likely parse(s)

Use annotated data
Treebanks appearance
Seems a lot slower and
less useful than building a
grammar

Statistical parsing
• Contextual rules
S  VP
S  VP NP
VP  V NP PP NP
VP  V NP
NP  DN
NP  N
PP  P NP
• Lexical rules
N ‫صباحا‬
DN ‫الطالب‬
DN ‫المدرسة‬
V ‫يذهب‬
PP  ‫إلى‬
[0.4]
[0.6]
[0.3]
[0.7]
[0.5]
[0.5]
[1.0]
[0.4]
[0.3]
[0.3]
[1.0]
[1.0]
P is the set of probabilities associated to rules P (A → β)
We need mechanisms that allow us to find the
most likely parse(s)

How to calculate probability of rules
• P(X |Y ) =
𝐶𝑜𝑢𝑛𝑡 𝑋 𝑌)
𝐶𝑜𝑢𝑛𝑡 (𝑌)
P(VP V) =
𝐶𝑜𝑢𝑛𝑡 𝑉 𝑉𝑃)
𝐶𝑜𝑢𝑛𝑡 (𝑉𝑃)
P(VP  V NP) =
𝐶𝑜𝑢𝑛𝑡 𝑉 𝑁𝑃 𝑉𝑃)
𝐶𝑜𝑢𝑛𝑡 (𝑉𝑃)
113/183
70/183
VP  V 113
VP  V NP 70
[0.6]
[0.4]

Probabilistic CFG
P is the set of probabilities associated to rules P (A → β)
CFG PCFG

Ambiguous sentences
‫جارهم‬ ‫وابن‬ ‫عمر‬ ‫لعب‬
‫المؤدب‬

How to calculate the probability of a tree
• The probability of an entire tree is the product of probabilities for these individual choices.
1.0
1.0
1.0
0.6
0.4
1.0
1.0
1.0
1.0
0.3
0.1
0.6
0.1
0.1
P(T1) = P(S → VP NP) * P(VP → V) * P(V →
‫)لعب‬ *
P(NP → NP CONJ NP) * P(NP → DET + NN) *
P(DET + NN → ‫)الطفل‬ P(CONJ → ‫و‬ )
P(NP → NN ADJP) P(NN → ‫)ابن‬
P(ADJP → NP DET + ADJ) P(NP→NN PRON)
P(NN→ ‫)جار‬ P(PRON→ ‫)هم‬
P(DET + ADJ→ ‫)المؤدب‬
P(T1) = 1.0* 1.0* 1.0*
0.1* 0.3 *1.0* 1.0* 0.1* 0.6 *0.6 *0.1*0.4*1.0*1.0 =
0000432

Cont.
0.1
0.1
0.1
0.1
0.6
0.4
0.1
0.1
0.1
0.3
0.3
0.1
0.1
0.6
P(T2) = P(S → VP NP) * P(VP → V) * P(V →
‫)لعب‬ *
P(NP → NP CONJ NP) * P(NP → DET + NN)
*
P(DET + NN → ‫)الطفل‬ P(CONJ → ‫و‬ )
P(NP → NN ADJP) P(NN → ‫)ابن‬
P(ADJP → NP DET + ADJ) P(NP→NN
PRON)
P(NN→ ‫)جار‬ P(PRON→ ‫)هم‬
P(DET + ADJ→ ‫)المؤدب‬
P(T1) = 1.0* 1.0* 1.0*
0.3* 0.3 *1.0* 1.0* 0.6 *0.1 *0.6
*0.1*0.4*1.0*1.0 =
0000432 0.0001296

The Probabilistic Context PCFG Free
Grammar (PCFG), or the Stochastic Context-
Free Grammar SCFG
• Like a context-free grammar G is defined by four parameters (N, Σ, R, S); a
probabilistic context-free grammar is also defined by four parameters, with a
slight augmentation to each of the rules in R:
• N a set of non-terminal symbols.
• Σ a set of terminal symbols.
• R a set of rules or productions, each of the form (A → β)
• S a designated start symbol.
• P is the set of probabilities associated to rules P (A → β),

Arabic syntactic parsing

More Related Content

Similar to Arabic syntactic parsing

Recently uploaded

Arabic syntactic parsing