TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
[DCSB] Amiz Zeldes (HU, Berlin) "Towards Digital Coptic: Searching and Visualizing Coptic Manuscript Data"
1. Towards Digital Coptic
Searching and Visualizing
Coptic Manuscript Data
Caroline T. Schroeder,
University of the Pacific
cschroeder@pacific.edu
Amir Zeldes,
Humboldt-Universität zu Berlin
amir.zeldes@rz.hu-berlin.de
Berlin Digital Classicist Seminar, 14.1.2014
2. Plan
Introduction
Coptic data
Annotations so far: normalizing, tokenizing and tagging
Search architecture
Searching through multiple segmentations: ANNIS
Dealing with corpus formats: TEI, SaltNPepper
Visualization
Dedicated visualizations
A reusable generic approach
Conclusion and outlook
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
1/37
3. Who are these people?
Prof. Caroline T. Schroeder –
Religious and Classical Studies /
Humanities Center Director
University of the Pacific
Dr. Amir Zeldes –
Korpuslinguistik /
SFB 632 Information Structure
(from March: eHumanities group KOMeT)
Humboldt-Universität zu Berlin
Cooperation Coptic SCRIPTORIUM established at 2012
NEH summer institute on "Text in a Digital Age" (Tufts):
http://coptic.pacific.edu/
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
2/37
4. Why Coptic?
Last stage of Ancient Egyptian Language (starting 2nd Century)
Mediterranean in 1st millenium
Hellenistic period
Unique language
Longest continuous documentation
Contact language (with Greek)
Religious significance
Early Christianity
Rise of monasticism
Gnosticism
...
Schroeder & Zeldes / Towards Digital Coptic
Coptische Dialects
14.1.2014
BMBF eHumanties - KOMeT / Zeldes
Berlin,
3/37
5. The data
Lots of material (thanks to the Egyptian desert )
Relatively little online, nothing like Greek and Latin
(Perseus)
Lots of things you may want are not available:
New Testament (online, not normalized/lemmatized/annotated)
Old Testament
The Rule of St. Pachomius
Works of Shenoute of Atripe
Apophthegmata patrum
...
But some have been digitized at some point!
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
4/37
6. A word about the texts in this talk
So far we've concentrated on Shenoute's sermon Abraham our
Father
"As for us, brethren, let us live by the truth so that we are upstanding in
all our works, and so that the prophets, apostles and all the saints might
dwell among us, ..."
Apophthegmata Patrum (sayings of the desert fathers)
"They said about the blessed Sarah the virgin that she spent sixty years
living at the top of the river and she never set foot outside to see the
river."
New Testament, esp. Gospel of Mark
see http://coptic.pacific.edu/ for corpora and tools
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
5/37
7. Getting from raw text to annotated corpora
Making the data searchable starts
with:
Encoding manuscripts (Epidoc TEI)
Segmentation of "word forms"
Normalization
Segmentation of morphemes
Part-of-speech tagging
More annotations...
Brief recap: Detailed talk in Leipzig
last month (slides on my page)
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
6/37
8. Normalization
Automatic normalization, manual correction
handling of known diacritics, abbreviations
closed, growing list of known variants
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
7/37
9. Tokenization
Identifying morphemes non-trivial (agglutinative language,
different conventions; we follow Layton 2004)
ϫⲓⲛⲧⲁⲓⲣ̅ⲙⲟⲛⲁⲭⲟⲥ
'Since I became a monk'
since-that-PAST-1sg-do-monk
ⲉⲛⲧⲁϥⲧⲣⲉⲛⲣⲡϣⲁ
'he who made us keep the ceremony'
REL-PAST-3sgM-CAUS-1pl-do-the-observance
Word level segmentation: manual (no scriptio continua)
Morph segmentation: automatic (accuracy: 84% - 94%)
ⲛ̄ⲟⲩϣⲏⲣⲉ` ⲛ̄ⲁⲃⲣⲁϩⲁⲙ`
of-a-son of-Abraham
Schroeder & Zeldes / Towards Digital Coptic
ⲛ ⲟⲩ ϣⲏⲣⲉ ⲛ ⲁⲃⲣⲁϩⲁⲙ
of a son
of Abraham
Berlin, 14.1.2014
8/37
10. Part-of-speech tagging
POS tagging using TreeTagger (Schmid 1994) and a lexicon from the
CMCL project (courtesy of Prof. Tito Orlandi)
Two tag sets:
fine grained (45 tags) and coarse (22 tags)
(see http://coptic.pacific.edu/ for documentation)
Interannotator agreement: 94.19% agreement, kappa = 93.67
(considers chance agreement, cf. Artstein & Poesio 2008)
Accuracy:
In domain, 10-fold cross-validation: 94.04% (fine)
Out of domain (test with papyri.info): 79.6% (fine) / 87.7% (coarse)
Main difficulties: open classes (N/V),
disambiguating homonyms (ⲉ can have 6 different tags!)
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
9/37
11. Further annotations
Many other layers are done manually:
Translation
Language of origin
Coreference
Entity tagging (people, places...)
Parallel alignment (with Greek)
Syntax trees (very preliminary tests)
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
10/37
12. Representing data – how to look at all this stuff?
We now have a lot of data to represent:
Diplomatic transcriptions (including character rendering!)
Normalization
Segmentation into words, morphemes, sometimes letters
Annotations
How do we encode this data for search and
visualization?
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
11/37
13. The first challenge: minimal units
Minimal units, or tokens, are critical for searching:
Find all words preceding the word "God"
Give me any mentions of Saint Paphnutius, ±10 words
Search for the glosses father and son within 20 words
Two problems:
The concept of words is complex in Coptic
Annotations overlap parts of words:
individual letters, line breaks...
tokens are smaller than words!
Schroeder & Zeldes / Towards Digital Coptic
ⲡⲉϪⲁϥ ϫⲉ ⲉⲓ̇ⲥ ϣ
ⲙⲟⲩⲛ ⲛ̇ⲣⲟⲙⲡⲉ ⲻ
Ⲡⲉϫⲉ ⲡ̇ϩⲗ̇ⲗⲟ ⲛⲁϥ
he sAid "it's been e
ight years" –
The old man told him
Berlin, 14.1.2014
12/37
14. Solution: segmentation layers in ANNIS
We use the open source ANNIS platform as a search
interface (Zeldes et al. 2009)
Any annotation layer can be defined as a segmentation
defining alternative views on:
Adjacency
(in words, morphemes, etc.)
Proximity
(in words, morphemes, etc.)
Context size
(in words, morphemes, etc.)
But which segmentation layer do you want to see?
Remember, diplomatic and normalized layers don't match
Any segmentation layer is usable as "base text"
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
13/37
17. Searching with AQL
(see http://www.sfb632.uni-potsdam.de/annis/ )
Basic principle of ANNIS Query Language (AQL):
search for some annotations (#1, #2, #3...)
stipulate relationships between them (operators)
Example: verbs of Greek origin
pos="V" &
source_lang="Greek" &
#1 _=_ #2
The head bandit repented
identical coverage operator
I have faith in God
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
16/37
18. Referencing segmentations
There are many operators
. (adjacent), _i_ (inclusion), _o_ (overlap), _l_ (left aligned)...
> (dominance), -> (pointing relation), >@l (left child)...
...
Possible to use segmentations in queries:
#1 . #2
- one followed by two
#1 .word #2
- two is the next word after one
#1 .norm,1,10 #2
- within 1 to 10 norm units
...
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
17/37
19. Adding metadata
Metadata is like any other constraint, with meta::
prefix
Can use regular expressions and negation
pos!="V" & source_lang="Greek" &
#1 _=_ #2 & meta::msName=/MONB.*/
For metadata names and values we use TEI/EpiDoc as
a guideline
More information on AQL:
http://www.sfb632.uni-potsdam.de/annis/
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
18/37
20. Architecture and formats
Different formats are suitable for different parts of the
data
TEI ideal for manuscript structure, metadata
Linguistic formats for computational corpus linguistics:
tagging, parsing, coreference
Convert and merge data using SaltNPepper
(Zipser & Romary 2010)
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
19/37
21. SaltNPepper (Zipser & Romary 2010)
Metamodel Salt for
multiformat conversion
Work on extending
TEI support: 2014-15
Salt as internal representation
in ANNIS
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
20/37
22. How can we view the data?
Even if we can query everything at once:
people who are indirect objects of the verb "show" aligned
with Greek neuters...
Can we also look at everything at once?
Excerpt from a Salt graph view of two words:
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
21/37
23. Breaking it down
Different annotations require different visualizations
Two conflicting requirements:
Ideal representation for each layer (syntax -> trees)
Stay generic and minimize amount of visualizations
How can we avoid programming new visualizations
with each new annotation layer?
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
22/37
24. Generic versus dedicated
For some purposes, dedicated visualizations cannot be
avoided
Special interactive functionality
Special layouting algorithms
For other purposes, we can reuse visualizations by
making flexible and configurable
Need to take segmentations into account
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
23/37
25. Some dedicated examples
Syntax trees
Coreference view (interactive)
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
24/37
26. Taking segmentations into account
Visualizations must be configurable to be aware of different
base texts
Syntax tree is based on normalized "word"-internal morphs
Sometimes one syntactic unit has multiple tokens
band
of ban dits
came upon a band
Schroeder & Zeldes / Towards Digital Coptic
of bandits
band ofban
15 dits and foundthem
drinking . [...]
Berlin, 14.1.2014
25/37
27. Reusing dedicated visualizers?
In some cases, some creative uses can be found for
existing visualizations
Using the coreference visualizer for parallel alignment:
apophthegmata patrum
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
26/37
28. Generic visualizations
Two main generic visualizers:
Annotation grid:
just mark borders of annotations
good for flat information
HTML visualizer:
generates HTML elements based
on annotations
defined using two simple stylesheets
can look like (almost) anything
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
27/37
29. Multiple grids
All annotations in one grid can lead to visual overload
Often better to separate groups of annotations:
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
28/37
30. The HTML visualizer
Any specific visualization is configured by two style sheets:
a config file and a CSS file
norm.config
p
norm.css
p
div.htmlvis {
word
span; style="word"
norm
span; style="norm"
font-family: Antinoou, sans-serif;
width: 500px;
white-space: normal !important;
value
trans t:title; style="trans" value
}
.trans:hover{color: red}
.word:after{content: " ";}
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
29/37
31. Result
Abraham our Father
<p>
<t class="translation"
title="Abraham our father wished to
have children with Sarah.">
<span class="word">
<span class="norm">
ⲁⲃⲣⲁϩⲁⲙ
</span>
</span>
<span class="word">
<span class="norm">
ⲡⲉⲛ
</span>
<span class="norm">
ⲉⲓⲱⲧ
</span>
</span>
</t>
...
</p>
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
30/37
32. Reusing the HTML visualizer
dipl.config
tok
span
lb
div; style="line"
pb
table:title; style="pb"
pb
tr
cb
td; style="cb"
hi_rend
hi_rend:rend
Schroeder & Zeldes / Towards Digital Coptic
value
value
value
Berlin, 14.1.2014
31/37
34. Aggregate visualizations
Latest version of ANNIS offers basic frequency analysis
Open question: How much more should we build?
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
33/37
35. Aggregate visualizations
Other visualizations are currently done e.g. in R:
ϫⲟⲟ
ⲉⲓⲣⲉ
ϣⲁ ⲟⲩⲛ
ϩⲟⲟⲩI/me
ⲡⲉϫⲉ
you.SG.M Egyptian vocabulary
ⲓⲏⲥⲟⲩⲥ
ϯⲥⲃⲱ
ⲕⲁ
ⲛⲉⲩ
ⲉⲓ
ⲅⲁⲗⲓⲗⲁⲓⲁ
ⲛⲥⲱ
Gospel
ⲕⲏⲣⲩⲥⲥⲉ
ⲉⲩⲁⲅⲅⲉⲗⲓⲟⲛ
said
Schroeder & Zeldes / Towards Digital Coptic
ⲛⲙⲙⲁ
ⲧⲃⲃⲟ
Jesus
ⲉⲣⲏⲙⲟⲥ
ⲁⲡⲁ
ⲕⲱ
ⲫⲟⲣⲉⲓ ⲣⲓ
ⲕ
. ⲥⲱ
ϣⲧⲏⲛ
ⲣⲁⲧ
ⲙⲉⲉⲩⲉ
ⲗⲁⲁⲩ
ⲙⲟⲛⲁⲭⲟⲥ
ⲡⲉϫⲁ
ⲣⲟⲙⲡⲉ
ϫⲉⲓ
ⲧⲁ
ⲁϣ
ⲓⲱϩⲁⲛⲛⲏⲥbaptism
ⲃⲁⲡⲧⲓⲥⲙⲁ
ⲁⲕⲁⲑⲁⲣⲧⲟⲛ
impure
John
ⲥⲓⲙⲱⲛ
old man
ⲧⲉⲧⲛ
ⲥⲩⲛⲁⲅⲱⲅⲏ
ⲛⲙ
ⲛⲧⲉⲣⲉ
ϣⲟⲙⲛⲧ
ⲏⲣⲡ
ⲉⲓⲃⲉ
Abba
ⲟⲩⲱⲙ
ⲡⲉⲓ ϩⲗⲗⲟ
ⲙⲟⲟⲩ ϭⲱⲗⲡ
wine
synagogue
ⲇⲁⲓⲙⲱⲛⲓⲟⲛ
ⲥⲟⲩⲧⲛ
eat
Gospel of Mark 1
ⲩⲛⲟⲩ
11 apophthegmata patrum
ⲡⲛⲉⲩⲙⲁ Holy
Ghost
Greek vocabulary
Berlin, 14.1.2014
34/37
36. Conclusion
Annotation projects should not be limited by corpus
architectures:
annotate whatever you want, however often you want
link anything to anything
Why annotate all of these things in the corpus?
(and not just in a separate spreadsheet)
Plots of just the verbs? Proper names? POS tagging
Highlight, search and link place-names? Entity tagging
Collapse inflected variants? Lemmatization
Collapse prominent referents? Coreference annotation
Dispersion of any of the above, alignment ... and much more
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
35/37
37. Conclusion
Anything can be made queryable with more layers:
typical constructions and objects of verbs?
Greek vs. native verbs -> add language of origin layer
Translation behavior -> add alignment layer
...
Fitting visualization facilities
should be easy to re-use
optimized to the task, display relevant portions of information
for many purposes, they must be sensitive to segmentations
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
36/37
38. Outlook
This March: BMBF funded young researcher group on
eHumanities at HU Berlin
KOMeT:
KOrpuslinguistische Methoden für ePhilologie mit TEI
Focus on marrying TEI resources with computational linguistics methods
and formats
Developing NLP tools, search and visualization for ancient world textual
resources
Pilot phase (2014, approved): Coptic
Main phase (2015-2019, pending): Other languages as well
Currently looking for a student assistant (60h/month)
Stay tuned for more!
Schroeder & Zeldes / Towards Digital Coptic
Berlin, 14.1.2014
37/37
40. References
Artstein, Ron & Massimo Poesio (2008), Inter-Coder Agreement for
Computational Linguistics. Computational Linguistics 34(4), 556–596.
Layton, Bentley (2004), A Coptic Grammar. Second Edition, Revised and
Expanded. (Porta linguarum orientalium 20.) Wiesbaden: Harrassowitz.
Schmid, Helmut (1994), Probabilistic Part-of-Speech Tagging Using Decision
Trees. In: Proceedings of the Conference on New Methods in Language
Processing. Manchester, UK, 44–49. Available at: http://www.ims.unistuttgart.de/ftp/pub/corpora/tree-tagger1.pdf.
Zeldes, Amir, Julia Ritz, Anke Lüdeling & Christian Chiarcos (2009), ANNIS: A
Search Tool for Multi-Layer Annotated Corpora. In: Proceedings of Corpus
Linguistics 2009. Liverpool, UK.
Zipser, Florian & Laurent Romary (2010), A Model Oriented Approach to the
Mapping of Annotation Formats using Standards. In: Proceedings of the
Workshop on Language Resource and Language Technology Standards,
LREC-2010. Valletta, Malta, 7–18.