[DCSB] Amiz Zeldes (HU, Berlin) "Towards Digital Coptic: Searching and Visualizing Coptic Manuscript Data"

Towards Digital Coptic
Searching and Visualizing
Coptic Manuscript Data
Caroline T. Schroeder,
University of the Pacific

cschroeder@pacific.edu

Amir Zeldes,
Humboldt-Universität zu Berlin

amir.zeldes@rz.hu-berlin.de

Berlin Digital Classicist Seminar, 14.1.2014

Plan
 Introduction
 Coptic data
 Annotations so far: normalizing, tokenizing and tagging

 Search architecture
 Searching through multiple segmentations: ANNIS
 Dealing with corpus formats: TEI, SaltNPepper

 Visualization
 Dedicated visualizations
 A reusable generic approach

 Conclusion and outlook

Schroeder & Zeldes / Towards Digital Coptic

Berlin, 14.1.2014

1/37

Who are these people?
 Prof. Caroline T. Schroeder –
Religious and Classical Studies /
Humanities Center Director
University of the Pacific
 Dr. Amir Zeldes –
Korpuslinguistik /
SFB 632 Information Structure
(from March: eHumanities group KOMeT)
Humboldt-Universität zu Berlin
 Cooperation Coptic SCRIPTORIUM established at 2012
NEH summer institute on "Text in a Digital Age" (Tufts):
http://coptic.pacific.edu/


Berlin, 14.1.2014

2/37

Why Coptic?
 Last stage of Ancient Egyptian Language (starting 2nd Century)
 Mediterranean in 1st millenium
 Hellenistic period

 Unique language
 Longest continuous documentation
 Contact language (with Greek)

 Religious significance
 Early Christianity

 Rise of monasticism
 Gnosticism
 ...

Coptische Dialects
14.1.2014

BMBF eHumanties - KOMeT / Zeldes
Berlin,

3/37

The data
 Lots of material (thanks to the Egyptian desert )
 Relatively little online, nothing like Greek and Latin
(Perseus)
 Lots of things you may want are not available:







New Testament (online, not normalized/lemmatized/annotated)
Old Testament
The Rule of St. Pachomius
Works of Shenoute of Atripe
Apophthegmata patrum
...

 But some have been digitized at some point!

Berlin, 14.1.2014

4/37

A word about the texts in this talk
 So far we've concentrated on Shenoute's sermon Abraham our
Father
 "As for us, brethren, let us live by the truth so that we are upstanding in
all our works, and so that the prophets, apostles and all the saints might
dwell among us, ..."

 Apophthegmata Patrum (sayings of the desert fathers)
 "They said about the blessed Sarah the virgin that she spent sixty years
living at the top of the river and she never set foot outside to see the
river."

 New Testament, esp. Gospel of Mark
see http://coptic.pacific.edu/ for corpora and tools

Berlin, 14.1.2014

5/37

Getting from raw text to annotated corpora
 Making the data searchable starts
with:
 Encoding manuscripts (Epidoc TEI)
 Segmentation of "word forms"

 Normalization
 Segmentation of morphemes
 Part-of-speech tagging

 More annotations...

 Brief recap: Detailed talk in Leipzig
last month (slides on my page)

Berlin, 14.1.2014

6/37

Normalization
 Automatic normalization, manual correction
 handling of known diacritics, abbreviations

 closed, growing list of known variants


Berlin, 14.1.2014

7/37

Tokenization
 Identifying morphemes non-trivial (agglutinative language,
different conventions; we follow Layton 2004)
 ϫⲓⲛⲧⲁⲓⲣ̅ⲙⲟⲛⲁⲭⲟⲥ
'Since I became a monk'
since-that-PAST-1sg-do-monk
 ⲉⲛⲧⲁϥⲧⲣⲉⲛⲣⲡϣⲁ
'he who made us keep the ceremony'
REL-PAST-3sgM-CAUS-1pl-do-the-observance

 Word level segmentation: manual (no scriptio continua)
 Morph segmentation: automatic (accuracy: 84% - 94%)
ⲛ̄ⲟⲩϣⲏⲣⲉ` ⲛ̄ⲁⲃⲣⲁϩⲁⲙ` 
of-a-son of-Abraham


ⲛ ⲟⲩ ϣⲏⲣⲉ ⲛ ⲁⲃⲣⲁϩⲁⲙ
of a son
of Abraham

Berlin, 14.1.2014

8/37

Part-of-speech tagging
 POS tagging using TreeTagger (Schmid 1994) and a lexicon from the
CMCL project (courtesy of Prof. Tito Orlandi)
 Two tag sets:
 fine grained (45 tags) and coarse (22 tags)
(see http://coptic.pacific.edu/ for documentation)
 Interannotator agreement: 94.19% agreement, kappa = 93.67
(considers chance agreement, cf. Artstein & Poesio 2008)

 Accuracy:
 In domain, 10-fold cross-validation: 94.04% (fine)
 Out of domain (test with papyri.info): 79.6% (fine) / 87.7% (coarse)

 Main difficulties: open classes (N/V),
disambiguating homonyms (ⲉ can have 6 different tags!)


Berlin, 14.1.2014

9/37

Further annotations
 Many other layers are done manually:
 Translation
 Language of origin
 Coreference

 Entity tagging (people, places...)
 Parallel alignment (with Greek)
 Syntax trees (very preliminary tests)


Berlin, 14.1.2014

10/37

Representing data – how to look at all this stuff?
 We now have a lot of data to represent:
 Diplomatic transcriptions (including character rendering!)
 Normalization
 Segmentation into words, morphemes, sometimes letters

 Annotations

 How do we encode this data for search and
visualization?


Berlin, 14.1.2014

11/37

The first challenge: minimal units
 Minimal units, or tokens, are critical for searching:
 Find all words preceding the word "God"
 Give me any mentions of Saint Paphnutius, ±10 words
 Search for the glosses father and son within 20 words

 Two problems:
 The concept of words is complex in Coptic
 Annotations overlap parts of words:
individual letters, line breaks...
 tokens are smaller than words!

ⲡⲉϪⲁϥ ϫⲉ ⲉⲓ̇ⲥ ϣ
ⲙⲟⲩⲛ ⲛ̇ⲣⲟⲙⲡⲉ ⲻ
Ⲡⲉϫⲉ ⲡ̇ϩⲗ̇ⲗⲟ ⲛⲁϥ
he sAid "it's been e
ight years" –
The old man told him
Berlin, 14.1.2014

12/37

Solution: segmentation layers in ANNIS
 We use the open source ANNIS platform as a search
interface (Zeldes et al. 2009)
 Any annotation layer can be defined as a segmentation
defining alternative views on:
 Adjacency

(in words, morphemes, etc.)

 Proximity


 Context size


 But which segmentation layer do you want to see?
 Remember, diplomatic and normalized layers don't match
 Any segmentation layer is usable as "base text"

Berlin, 14.1.2014

13/37

Switching segmentations in ANNIS


Berlin, 14.1.2014

14/37

Different contexts
 Example search: entity="person"

 Hit: Abba Antonius
 Some options:

Ⲁⲩϭⲱⲗⲡ̇
5 ⲉ̇ⲃⲟⲗ ⲛⲁⲡⲁ ⲁ̇ⲛ
ⲧⲱⲛⲓ̇ⲟⲥ ϩⲓ̇ ⲡ̇ϫⲁⲓ̇ⲉ̇ ·
ϫⲉ ⲟⲩⲛ ⲟⲩⲁ̇ ⲉ̇ϥⲉⲓⲛⲉ̇

 ±5 words, diplomatic: (less than -5 found, since start of text)
Ⲁⲩϭⲱⲗⲡ̇ ⲉ̇ⲃⲟⲗ ⲛⲁⲡⲁ ⲁ̇ⲛⲧⲱⲛⲓ̇ⲟⲥ ϩⲓ̇ⲡ̇ϫⲁⲓ̇ⲉ̇ · ϫⲉⲟⲩⲛⲟⲩⲁ̇ ⲉ̇ϥⲉⲓⲛⲉ̇ ⲙ̇ⲙⲟⲕ

 ±10 morphs, normalized:
ⲁ ⲩ ϭⲱⲗⲡ ⲉⲃⲟⲗ ⲛ ⲁⲡⲁ ⲁⲛⲧⲱⲛⲓⲟⲥ ϩⲓ ⲡ ϫⲁⲓⲉ · ϫⲉ ⲟⲩⲛ ⲟⲩⲁ ⲉ ϥ ⲉⲓⲛⲉ ⲙⲙⲟ ⲕ

 ±5 tokens:
Ⲁ ⲩ ϭⲱⲗⲡ̇ ⲉ̇ⲃⲟⲗ ⲛ ⲁⲡⲁ ⲁ̇ⲛ ⲧⲱⲛⲓ̇ⲟⲥ ϩⲓ̇ ⲡ̇ ϫⲁⲓ̇ⲉ̇ · ϫⲉ


Berlin, 14.1.2014

15/37

Searching with AQL
(see http://www.sfb632.uni-potsdam.de/annis/ )

 Basic principle of ANNIS Query Language (AQL):
 search for some annotations (#1, #2, #3...)
 stipulate relationships between them (operators)

 Example: verbs of Greek origin
pos="V" &
source_lang="Greek" &
#1 _=_ #2

The head bandit repented

identical coverage operator
I have faith in God

Berlin, 14.1.2014

16/37

Referencing segmentations
 There are many operators
 . (adjacent), _i_ (inclusion), _o_ (overlap), _l_ (left aligned)...
 > (dominance), -> (pointing relation), >@l (left child)...
 ...

 Possible to use segmentations in queries:
 #1 . #2

- one followed by two

 #1 .word #2

- two is the next word after one

 #1 .norm,1,10 #2

- within 1 to 10 norm units

 ...

Berlin, 14.1.2014

17/37

Adding metadata
 Metadata is like any other constraint, with meta::
prefix
 Can use regular expressions and negation
pos!="V" & source_lang="Greek" &
#1 _=_ #2 & meta::msName=/MONB.*/

 For metadata names and values we use TEI/EpiDoc as
a guideline

 More information on AQL:
http://www.sfb632.uni-potsdam.de/annis/


Berlin, 14.1.2014

18/37

Architecture and formats
 Different formats are suitable for different parts of the
data
 TEI ideal for manuscript structure, metadata
 Linguistic formats for computational corpus linguistics:
tagging, parsing, coreference
 Convert and merge data using SaltNPepper
(Zipser & Romary 2010)


Berlin, 14.1.2014

19/37

SaltNPepper (Zipser & Romary 2010)
 Metamodel Salt for
multiformat conversion
 Work on extending
TEI support: 2014-15

 Salt as internal representation
in ANNIS

Berlin, 14.1.2014

20/37

How can we view the data?
 Even if we can query everything at once:
 people who are indirect objects of the verb "show" aligned
with Greek neuters...

 Can we also look at everything at once?

 Excerpt from a Salt graph view of two words:


Berlin, 14.1.2014

21/37

Breaking it down
 Different annotations require different visualizations

 Two conflicting requirements:
 Ideal representation for each layer (syntax -> trees)
 Stay generic and minimize amount of visualizations

 How can we avoid programming new visualizations
with each new annotation layer?


Berlin, 14.1.2014

22/37

Generic versus dedicated
 For some purposes, dedicated visualizations cannot be
avoided
 Special interactive functionality
 Special layouting algorithms

 For other purposes, we can reuse visualizations by
making flexible and configurable
 Need to take segmentations into account


Berlin, 14.1.2014

23/37

Some dedicated examples
 Syntax trees

 Coreference view (interactive)


Berlin, 14.1.2014

24/37

Taking segmentations into account
 Visualizations must be configurable to be aware of different
base texts
 Syntax tree is based on normalized "word"-internal morphs
 Sometimes one syntactic unit has multiple tokens

band

of ban dits

came upon a band


of bandits

band ofban
15 dits and foundthem
drinking . [...]
Berlin, 14.1.2014

25/37

Reusing dedicated visualizers?
 In some cases, some creative uses can be found for
existing visualizations
 Using the coreference visualizer for parallel alignment:

apophthegmata patrum


Berlin, 14.1.2014

26/37

Generic visualizations
Two main generic visualizers:

 Annotation grid:
 just mark borders of annotations
 good for flat information

 HTML visualizer:
 generates HTML elements based
on annotations

 defined using two simple stylesheets
 can look like (almost) anything


Berlin, 14.1.2014

27/37

Multiple grids
 All annotations in one grid can lead to visual overload

 Often better to separate groups of annotations:


Berlin, 14.1.2014

28/37

The HTML visualizer
 Any specific visualization is configured by two style sheets:
a config file and a CSS file
norm.config
p

norm.css

p

div.htmlvis {

word

span; style="word"

norm

span; style="norm"

font-family: Antinoou, sans-serif;
width: 500px;
white-space: normal !important;

value

trans t:title; style="trans" value

}
.trans:hover{color: red}
.word:after{content: " ";}


Berlin, 14.1.2014

29/37

Result

Abraham our Father


<t class="translation"
title="Abraham our father wished to
have children with Sarah.">


ⲁⲃⲣⲁϩⲁⲙ




ⲡⲉⲛ


ⲉⲓⲱⲧ


</t>
...



Berlin, 14.1.2014

30/37

Reusing the HTML visualizer
dipl.config

tok

span

lb

div; style="line"

pb

table:title; style="pb"

pb

tr

cb

td; style="cb"

hi_rend

hi_rend:rend


value

value

value

Berlin, 14.1.2014

31/37

Visualizing TEI @rend attributes
dipl.css
div.line{display: block;
height: 22px
counter-increment: linecount;}
div.line:nth-of-type(5n):before{
content: counter(linecount)" "}
...

.pb{border-style:solid;}
.cb{counter-reset: linecount 0;
width: 160px;
min-width: 160px}

...
hi_rend[rend*=superscript]
{vertical-align: super; font-size: 80%}
hi_rend[rend*=red] {color: red}
hi_rend[rend*=tall] {font-size: 120%}

hi_rend[rend*=extralarge] {font-size: 160%}


Berlin, 14.1.2014

32/37

Aggregate visualizations
 Latest version of ANNIS offers basic frequency analysis

 Open question: How much more should we build?

Berlin, 14.1.2014

33/37

Aggregate visualizations
 Other visualizations are currently done e.g. in R:

ϫⲟⲟ

ⲉⲓⲣⲉ

ϣⲁ ⲟⲩⲛ
ϩⲟⲟⲩI/me

ⲡⲉϫⲉ

you.SG.M Egyptian vocabulary

ⲓⲏⲥⲟⲩⲥ

ϯⲥⲃⲱ
ⲕⲁ
ⲛⲉⲩ

ⲉⲓ

ⲅⲁⲗⲓⲗⲁⲓⲁ
ⲛⲥⲱ
Gospel
ⲕⲏⲣⲩⲥⲥⲉ

ⲉⲩⲁⲅⲅⲉⲗⲓⲟⲛ

said


ⲛⲙⲙⲁ

ⲧⲃⲃⲟ

Jesus

ⲉⲣⲏⲙⲟⲥ

ⲁⲡⲁ

ⲕⲱ

ⲫⲟⲣⲉⲓ ⲣⲓ

ⲕ

. ⲥⲱ

ϣⲧⲏⲛ

ⲣⲁⲧ
ⲙⲉⲉⲩⲉ

ⲗⲁⲁⲩ

ⲙⲟⲛⲁⲭⲟⲥ

ⲡⲉϫⲁ

ⲣⲟⲙⲡⲉ

ϫⲉⲓ
ⲧⲁ

ⲁϣ

ⲓⲱϩⲁⲛⲛⲏⲥbaptism
ⲃⲁⲡⲧⲓⲥⲙⲁ

ⲁⲕⲁⲑⲁⲣⲧⲟⲛ
impure

John

ⲥⲓⲙⲱⲛ

old man

ⲧⲉⲧⲛ

ⲥⲩⲛⲁⲅⲱⲅⲏ

ⲛⲙ
ⲛⲧⲉⲣⲉ

ϣⲟⲙⲛⲧ
ⲏⲣⲡ

ⲉⲓⲃⲉ

Abba

ⲟⲩⲱⲙ

ⲡⲉⲓ ϩⲗⲗⲟ

ⲙⲟⲟⲩ ϭⲱⲗⲡ

wine

synagogue

ⲇⲁⲓⲙⲱⲛⲓⲟⲛ
ⲥⲟⲩⲧⲛ

eat

Gospel of Mark 1

ⲩⲛⲟⲩ

11 apophthegmata patrum

ⲡⲛⲉⲩⲙⲁ Holy
Ghost

Greek vocabulary
Berlin, 14.1.2014

34/37

Conclusion
 Annotation projects should not be limited by corpus
architectures:
 annotate whatever you want, however often you want
 link anything to anything

 Why annotate all of these things in the corpus?
(and not just in a separate spreadsheet)






Plots of just the verbs? Proper names?  POS tagging
Highlight, search and link place-names?  Entity tagging
Collapse inflected variants?  Lemmatization
Collapse prominent referents?  Coreference annotation
Dispersion of any of the above, alignment ... and much more


Berlin, 14.1.2014

35/37

Conclusion
 Anything can be made queryable with more layers:
 typical constructions and objects of verbs?
 Greek vs. native verbs -> add language of origin layer
 Translation behavior -> add alignment layer

 ...

 Fitting visualization facilities
 should be easy to re-use

 optimized to the task, display relevant portions of information
 for many purposes, they must be sensitive to segmentations

Berlin, 14.1.2014

36/37

Outlook
 This March: BMBF funded young researcher group on
eHumanities at HU Berlin
 KOMeT:
KOrpuslinguistische Methoden für ePhilologie mit TEI
 Focus on marrying TEI resources with computational linguistics methods
and formats
 Developing NLP tools, search and visualization for ancient world textual
resources
 Pilot phase (2014, approved): Coptic
 Main phase (2015-2019, pending): Other languages as well
 Currently looking for a student assistant (60h/month)

 Stay tuned for more!


Berlin, 14.1.2014

37/37

Ⲙⲓⲱⲧⲛ ⲧⲱⲛⲟⲩ!
well-being+your.PL greatly
=>
Thanks!

References
 Artstein, Ron & Massimo Poesio (2008), Inter-Coder Agreement for
Computational Linguistics. Computational Linguistics 34(4), 556–596.
 Layton, Bentley (2004), A Coptic Grammar. Second Edition, Revised and
Expanded. (Porta linguarum orientalium 20.) Wiesbaden: Harrassowitz.
 Schmid, Helmut (1994), Probabilistic Part-of-Speech Tagging Using Decision
Trees. In: Proceedings of the Conference on New Methods in Language
Processing. Manchester, UK, 44–49. Available at: http://www.ims.unistuttgart.de/ftp/pub/corpora/tree-tagger1.pdf.
 Zeldes, Amir, Julia Ritz, Anke Lüdeling & Christian Chiarcos (2009), ANNIS: A
Search Tool for Multi-Layer Annotated Corpora. In: Proceedings of Corpus
Linguistics 2009. Liverpool, UK.
 Zipser, Florian & Laurent Romary (2010), A Model Oriented Approach to the
Mapping of Annotation Formats using Standards. In: Proceedings of the
Workshop on Language Resource and Language Technology Standards,
LREC-2010. Valletta, Malta, 7–18.

Links
 Coptic SCRIPTORIUM:

 ANNIS:

http://coptic.pacific.edu/

http://www.sfb632.uni-potsdam.de/annis/

 Search engine for our corpora:
https://korpling.german.hu-berlin.de/annis3/scriptorium

 Papyri.info: http://papyri.info/
 CMCL: http://cmcl.let.uniroma1.it/

[DCSB] Amiz Zeldes (HU, Berlin) "Towards Digital Coptic: Searching and Visualizing Coptic Manuscript Data"

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (19)

Similar to [DCSB] Amiz Zeldes (HU, Berlin) "Towards Digital Coptic: Searching and Visualizing Coptic Manuscript Data"

Similar to [DCSB] Amiz Zeldes (HU, Berlin) "Towards Digital Coptic: Searching and Visualizing Coptic Manuscript Data" (20)

More from Digital Classicist Seminar Berlin

More from Digital Classicist Seminar Berlin (20)

Recently uploaded

Recently uploaded (20)

[DCSB] Amiz Zeldes (HU, Berlin) "Towards Digital Coptic: Searching and Visualizing Coptic Manuscript Data"