The Glossarium Graeco-Arabicum
Linguistic Research and Database Design in
Polyalphabetic Environments
Torsten Roeder (BBAW...
Ms. Paris BnF 5847, f. 5:
Muslim scholars in discussion.
Arabic
translation of
Dioscurides’
Materia medica
(Ibn alalal-adwiya
wa-l-aghdhiya,
1–4.
1291 H.)
Filecards for
the Greek and
Arabic Lexicon
(GALex)
GALex
The Database Glossarium Græco-Arabicum
The Glossarium Graeco-Arabicum
makes available information in the following
fields of research:

• the vocabulary and synt...
The Glossarium Graeco-Arabicum online:
November 2013

Telota
Glossarium Graeco-Arabicum

BERLIN-BRANDENBURGISCHE
AKADEMIE DER WISSENSCHAFTEN
November 2013

Telota
Glossarium Graeco-Arabicum

I

Technical Challenges
→ polyalphabetic environment

II

Scholarly Requ...
November 2013

Telota
Glossarium Graeco-Arabicum

1

Languages Used in the GlossGA Interface

2

Unicode Character Corpus
...
November 2013

Telota
Glossarium Graeco-Arabicum

Languages used within the project:
Ancient Greek

Medieval Arabic

Moder...
November 2013

Telota
Glossarium Graeco-Arabicum

Unicode Chart

Range

Description

C0 Controls and Basic Latin
Latin Ext...
November 2013

Telota
Glossarium Graeco-Arabicum

Requirements:
1. Data input in all three alphabets with all vowels and d...
November 2013

Telota
Glossarium Graeco-Arabicum

a
b

Writing Directions

c

Search

d

I.4. EXAMPLES

Data Input

Search...
November 2013

Telota
Glossarium Graeco-Arabicum

ʾ˒ʿ˓
I.4.a. DATA INPUT
November 2013

Telota
Glossarium Graeco-Arabicum

[ʾ]

U+02BE

MODIFIER LETTER RIGHT HALF RING
transliteration of Arabic h...
November 2013

Telota
Glossarium Graeco-Arabicum

Problem: Appearance vs. Encoding
Users will normally choose charaters …
...
November 2013

Telota
Glossarium Graeco-Arabicum

Solutions:
–

restrict the characters accepted by the database
→ safe, b...
November 2013

Telota
Glossarium Graeco-Arabicum

Phenomenon:

Home (THEN) Arabic Glossary (THEN) ‫( ص‬THEN) ‫صحة‬
becomes...
November 2013

Telota
Glossarium Graeco-Arabicum

Problem: Strong vs. Weak Characters
In Unicode, alphabetic characters ar...
November 2013

Telota
Glossarium Graeco-Arabicum

Solutions:
–

insert a ”strong whitespace”:
Unicodes U+200E (left to rig...
November 2013

Telota
Glossarium Graeco-Arabicum

GREEK

ARABIC

ENGLISH

diacritics
not distinct

vowel signs
not distinc...
November 2013

Telota
Glossarium Graeco-Arabicum

Solution:
Greek
Greek collation

Arabic
Arabic collation

English
Latin ...
November 2013

Telota
Glossarium Graeco-Arabicum

Phenomenon:
–
–

user searches for Arabic words starting with ‫مل‬
trunc...
November 2013

Telota
Glossarium Graeco-Arabicum

Solution:
Unicode Arabic Asterisk (U+066D), right-to-left

‫مل٭‬
I.4.d. ...
November 2013

Telota
Glossarium Graeco-Arabicum

Challenges for the Developer:
–

Unicode does not provide general trunca...
November 2013

Telota
Glossarium Graeco-Arabicum

Technical Recommendations for Polyalphabetic Environments
–

use softwar...
November 2013

Telota
Glossarium Graeco-Arabicum

1

Corpus
→ How to deal with a database of 70,000+ words?

2

Translatio...
November 2013

Telota
Glossarium Graeco-Arabicum

How to deal with a database of 70,000+ words?
–

search form
→ user need...
November 2013

Telota
Glossarium Graeco-Arabicum

Distributon of sources
in the GlossGA corpus
Area size corresponds
to nu...
November 2013

Telota
Glossarium Graeco-Arabicum

Distribution of words
in one source
Area size corresponds
to number of w...
November 2013

Telota
Glossarium Graeco-Arabicum

How to visualize transformation of language structures?
→ compare parts ...
November 2013

Telota
Glossarium Graeco-Arabicum

Compared Parts of Speech
Blue:
Greek Parts of Speech
Red:
Arabic Parts o...
November 2013

Telota
Glossarium Graeco-Arabicum

Compared Parts of Speech
X-Axis:
Greek Parts of Speech
Y-Axis:
Arabic Pa...
November 2013

Telota
Glossarium Graeco-Arabicum

How to transform the database into a dictionary?
Experimental preview:
→...
November 2013

Telota
Glossarium Graeco-Arabicum

Export function via email:

II.3.b. SINGLE LEXEMES
November 2013

Telota
Glossarium Graeco-Arabicum

Recommendations
1

provide multiple access methods
→ support various use...
November 2013

Telota
Glossarium Graeco-Arabicum

Situation: Technical vs. Scholarly Requirements
–

which one goes first?...
November 2013

Telota
Glossarium Graeco-Arabicum

Thanks

to you for your attention!

Project Website

http://telota.bbaw....
Upcoming SlideShare
Loading in …5
×

[DCSB] Torsten Roeder (BBAW), Yury Arzhanov (Ruhr­Universität Bochum) "The Glossarium Graeco­Arabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

869 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
869
On SlideShare
0
From Embeds
0
Number of Embeds
410
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

[DCSB] Torsten Roeder (BBAW), Yury Arzhanov (Ruhr­Universität Bochum) "The Glossarium Graeco­Arabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

  1. 1. The Glossarium Graeco-Arabicum Linguistic Research and Database Design in Polyalphabetic Environments Torsten Roeder (BBAW), Yury Arzhanov (Ruhr Universität Bochum)
  2. 2. Ms. Paris BnF 5847, f. 5: Muslim scholars in discussion.
  3. 3. Arabic translation of Dioscurides’ Materia medica (Ibn alalal-adwiya wa-l-aghdhiya, 1–4. 1291 H.)
  4. 4. Filecards for the Greek and Arabic Lexicon (GALex)
  5. 5. GALex
  6. 6. The Database Glossarium Græco-Arabicum
  7. 7. The Glossarium Graeco-Arabicum makes available information in the following fields of research: • the vocabulary and syntax of Classical and Middle Arabic; • the development of a scientific and technical vocabulary in Arabic; • the vocabulary of Classical and Middle Greek; • the chronology and nature of the translation movement into Arabic; • the establishment of the texts of Greek works and their Arabic translations.
  8. 8. The Glossarium Graeco-Arabicum online:
  9. 9. November 2013 Telota Glossarium Graeco-Arabicum BERLIN-BRANDENBURGISCHE AKADEMIE DER WISSENSCHAFTEN
  10. 10. November 2013 Telota Glossarium Graeco-Arabicum I Technical Challenges → polyalphabetic environment II Scholarly Requirements → linguistic database III Technical vs. Scholarly → concluding discussion OUTLINE
  11. 11. November 2013 Telota Glossarium Graeco-Arabicum 1 Languages Used in the GlossGA Interface 2 Unicode Character Corpus 3 Areas of Technical Challenges 4 Examples I. TECHNICAL CHALLENGES
  12. 12. November 2013 Telota Glossarium Graeco-Arabicum Languages used within the project: Ancient Greek Medieval Arabic Modern English Greek alphabet Arabic alphabet Latin alphabet 3 layers of diacritics optional vowel signs 1 layer of diacritics LTR (left to right) RTL LTR I.1. LANGUAGES
  13. 13. November 2013 Telota Glossarium Graeco-Arabicum Unicode Chart Range Description C0 Controls and Basic Latin Latin Extended-A Latin Extended-Additional 0000-007F 0100-017F 1E00-1EFF Latin Alphabet transliteration symbols transliteration symbols Greek and Coptic Greek Extended 0370-03FF 1F00-1FFF Greek Alphabet Greek Diacritics Arabic Arabic Supplement Spacing Modifier Letters 0600-06FF 0750-077F 02B0-02FF Arabic Alphabet Arabic Alphabet special Arabic characters → in total: about 450 different characters from eight different charts I.2. UNICODE
  14. 14. November 2013 Telota Glossarium Graeco-Arabicum Requirements: 1. Data input in all three alphabets with all vowels and diacritics → How to implement a comfortable interface? 2. Simultaneous display of texts in three alphabets and two directions → How to implement concurrent writing directions? 3. Search for terms, insensitive for diacritics or vowels → How to implement queries with different collation sets? I.3. REQUIREMENTS
  15. 15. November 2013 Telota Glossarium Graeco-Arabicum a b Writing Directions c Search d I.4. EXAMPLES Data Input Search Terms
  16. 16. November 2013 Telota Glossarium Graeco-Arabicum ʾ˒ʿ˓ I.4.a. DATA INPUT
  17. 17. November 2013 Telota Glossarium Graeco-Arabicum [ʾ] U+02BE MODIFIER LETTER RIGHT HALF RING transliteration of Arabic hamza [˒] U+02D2 MODIFIER LETTER CENTRED RIGHT HALF RING more rounded articulation [ʿ] U+02BF MODIFIER LETTER LEFT HALF RING transliteration of Arabic ain [˓] U+02D3 MODIFIER LETTER CENTRED LEFT HALF RING less rounded articulation I.4.a. DATA INPUT
  18. 18. November 2013 Telota Glossarium Graeco-Arabicum Problem: Appearance vs. Encoding Users will normally choose charaters … → not because of their unicode description → but because of their appearance How to bring Unicode to the user? I.4.a. DATA INPUT
  19. 19. November 2013 Telota Glossarium Graeco-Arabicum Solutions: – restrict the characters accepted by the database → safe, but required validation methods – provide a virtual keyboard (onscreen) → user-friendly Alternative methods: – beta code → less recommendable from unicode point of view → but widely used I.4.a. DATA INPUT
  20. 20. November 2013 Telota Glossarium Graeco-Arabicum Phenomenon: Home (THEN) Arabic Glossary (THEN) ‫( ص‬THEN) ‫صحة‬ becomes Home > Arabic Glossary > ‫ص< صحة‬ I.4.b. WRITING DIRECTIONS
  21. 21. November 2013 Telota Glossarium Graeco-Arabicum Problem: Strong vs. Weak Characters In Unicode, alphabetic characters are usually STRONG CHARACTERS which determine the writing direction, while punctuation characters are usually WEAK CHARACTERS which do not change the writing direction. → relevant in: comma separated lists, bibliographic references, breadcrumb lines, table alignments … I.4.b. WRITING DIRECTIONS
  22. 22. November 2013 Telota Glossarium Graeco-Arabicum Solutions: – insert a ”strong whitespace”: Unicodes U+200E (left to right) or U+200F (right to left) – or, if in HTML, set the writing direction directly: <span dir="ltr">…</span> I.4.b. WRITING DIRECTIONS
  23. 23. November 2013 Telota Glossarium Graeco-Arabicum GREEK ARABIC ENGLISH diacritics not distinct vowel signs not distinct diacritics distinct requirement: η finds also ἠ ἦ ἥ requirement: ‫ سبب‬finds also ‫سبب‬ 7 88 requirement: d does not find ḏ Problem: Distinction vs. Collation I.4.c. SEARCH
  24. 24. November 2013 Telota Glossarium Graeco-Arabicum Solution: Greek Greek collation Arabic Arabic collation English Latin collation Collation Charts: <http://unicode.org/charts/uca/> Restrictions: – – does not work for mixed texts → data needs to be separated some environments do not support Arabic vowel collation → e.g. MySQL <6.0 I.4.c. SEARCH
  25. 25. November 2013 Telota Glossarium Graeco-Arabicum Phenomenon: – – user searches for Arabic words starting with ‫مل‬ truncation sysmbol (asterisk) appears at the wrong side ‫*مل‬ Problem: Neutral Writing Direction – – the standard asterisk is a NEUTRAL CHARACTER it adapts the main writing direction I.4.d. SEARCH TERMS
  26. 26. November 2013 Telota Glossarium Graeco-Arabicum Solution: Unicode Arabic Asterisk (U+066D), right-to-left ‫مل٭‬ I.4.d. SEARCH TERMS
  27. 27. November 2013 Telota Glossarium Graeco-Arabicum Challenges for the Developer: – Unicode does not provide general truncation or joker symbols – different asterisk and joker signs must be processed – no standard solution available I.4.d. SEARCH TERMS
  28. 28. November 2013 Telota Glossarium Graeco-Arabicum Technical Recommendations for Polyalphabetic Environments – use software components that supports unicode thoughout – compose a project corpus of unicode characters – provide input methods to make the characters easily available – consider unicode writing directions and collations – make sure that all characters do not only appear correctly, but that they are also encoded correctly SUMMARY OF I.
  29. 29. November 2013 Telota Glossarium Graeco-Arabicum 1 Corpus → How to deal with a database of 70,000+ words? 2 Translation movements → How to visualize transformations of language structures? 3 Single Lexemes → How to transform the database into a dictionary? II. SCHOLARLY REQUIREMENTS
  30. 30. November 2013 Telota Glossarium Graeco-Arabicum How to deal with a database of 70,000+ words? – search form → user needs to know exactly what he/she is looking for – browsing (e.g. by sources and words in alphabetical order) → user needs to know roughly what he/she is looking for – visualization → statistical and/or graphical approach → user can explore the corpus II.1. CORPUS
  31. 31. November 2013 Telota Glossarium Graeco-Arabicum Distributon of sources in the GlossGA corpus Area size corresponds to number of words → Which sources constitute the major/minor parts of the corpus? II.1.a. CORPUS TREEMAP
  32. 32. November 2013 Telota Glossarium Graeco-Arabicum Distribution of words in one source Area size corresponds to number of words → What kind of vocabulary does constitute the source? II.1.b. SOURCE TREEMAP
  33. 33. November 2013 Telota Glossarium Graeco-Arabicum How to visualize transformation of language structures? → compare parts of speech in diagrams (experimental) II.2. TRANSLATION MOVEMENTS
  34. 34. November 2013 Telota Glossarium Graeco-Arabicum Compared Parts of Speech Blue: Greek Parts of Speech Red: Arabic Parts of Speech Bar Length: number of words of respective part of speech II.2.a. TRANSLATION MOVEMENTS
  35. 35. November 2013 Telota Glossarium Graeco-Arabicum Compared Parts of Speech X-Axis: Greek Parts of Speech Y-Axis: Arabic Parts of Speech Intersections: Dot size represents number of words transferred from Greek PoS into Arabic PoS II.2.b. TRANSLATION MOVEMENTS
  36. 36. November 2013 Telota Glossarium Graeco-Arabicum How to transform the database into a dictionary? Experimental preview: → collation of all entries of a Greek lexeme → ordered by Arabic lexeme → output with source and context II.3.a. SINGLE LEXEMES
  37. 37. November 2013 Telota Glossarium Graeco-Arabicum Export function via email: II.3.b. SINGLE LEXEMES
  38. 38. November 2013 Telota Glossarium Graeco-Arabicum Recommendations 1 provide multiple access methods → support various user scenarios 2 invent statistical and visual evaluation methods → profit from electronic data processing 3 provide conventional scholarly formats → correspond to the community’s needs SUMMARY OF II.
  39. 39. November 2013 Telota Glossarium Graeco-Arabicum Situation: Technical vs. Scholarly Requirements – which one goes first? → technical requirements as necessary basis → scholarly requirements as superior objective – – both need attention from scholars both need attention from techies → vice versa understanding → team competence LAST BUT ONE SLIDE
  40. 40. November 2013 Telota Glossarium Graeco-Arabicum Thanks to you for your attention! Project Website http://telota.bbaw.de/glossga Contact Yury Arzhanov | yury.arzhanov@rub.de Torsten Roeder | roeder@bbaw.de

×