Annotating the Hebrew Bible

Annotating the Hebrew
Bible
High Precision Philology in a Digital Space
Leipzig 2016-02-15/16
Dirk Roorda
‫֥ר‬ ַ‫ב‬ ְ‫דּ‬

Text. What is it?
bᵊrēšˈîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˈayim wᵊʔˌēṯ hāʔˈāreṣ .
Genesis 1:1
In the beginning God created the heavens and the earth.

A string of words ...

... separated by spaces?

A string of letters ...
... in which alefbet?
bᵊrēšˈîṯ bārˈā ʔᵉlōhˈîm
ʔˌēṯ haššāmˈayim wᵊʔ
ˌēṯ hāʔˈāreṣ .
phonetic

hebrew with vowels and accents
‫֣א‬ ָ‫ר‬ ָ‫בּ‬ ‫ית‬ ֖ ִ‫אשׁ‬ ֵ‫ר‬ ְ‫בּ‬
‫ם‬ִ‫֖י‬ ַ‫מ‬ָּ‫שׁ‬ ַ‫ה‬ ‫֥ת‬ ֵ‫א‬ ‫֑ים‬ ִ‫ֱֹלה‬‫א‬
‫ץ׃‬ ֶ‫ר‬ ָֽ‫א‬ ָ‫ה‬ ‫֥ת‬ ֵ‫א‬ ְ‫ו‬

hebrew consonantal
‫ברא‬ ‫בראשית‬
‫השמים‬ ‫את‬ ‫אלהים‬
‫הארץ׃‬ ‫ואת‬

etcbc transcription (full)
B.:- R;>CI73JT
B.@R@74>
>:ELOHI92JM >;71T
HA- C.@MA73JIM W:-
>;71T H@- >@75REY00

etcbc transcription (consonantal)
B R>CJT BR> >LHJM
>T H CMJM W >T H
>RY

1. The Text itself (representations)
2. Linguistics (feature structures)
3. "Manual" (really manual or software-generated)
4. Queries (exegetical search)
layers of annotation

A text is ... in abstracto:
a sequence of objects with a notion of embedding
1994 Crist-Jan Doedens.
Text Databases.
One Database Model and Several
Retrieval Languages.
Ph.D. thesis. In Language and
Computers, Amsterdam.
See Google Books.

... in concreto:
all objects are sets of monads (the smallest elements)
all objects participate in spatial relationships
sequence - embedding - overlap - gap
all objects can carry unlimited features
a representation of a word is just a feature

... in practice:
this model has been implemented in a mature system
2002-2014 Ulrik Petersen.
Emdros.
Text database engine for storage and
retrieval of analyzed or annotated text.
Open Source Software.
See COLING paper 2004.

... for convenience:
an ISO standard captures a lot of this model
2012 Nancy Ide and Laurent Romary.
Linguistic Annotation Framework
(LAF).
ISO Standard 24612.

<node xml:id="n_88917">
<link targets="r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11"/>
</node>
<edge xml:id="e1" from="n88917" to="n84383"/>
<a xml:id="ae1" label="parents" ref="e1" as="link"/>
<region xml:id="r_2" anchors="6 23"/>
<node xml:id="n_3"><link targets="r_2"/></node>
<a xml:id="a_3" label="word" ref="n_3" as="monads"/>labeled
edges
nodes
annotations
(features)
annotations
(empty)
primary data
regions
lexeme_utf8= ‫ר‬‫א‬‫שׁ‬‫י‬‫ת‬
surface_consonants_utf8= ‫ר‬‫א‬‫שׁ‬‫י‬‫ת‬
‫בּ‬ְ‫ר‬ֵ‫א‬‫שׁ‬ִ֖‫י‬‫ת‬‫בּ‬ָ‫ר‬ָ֣‫א‬‫א‬ֱ.‫ה‬ִ֑‫י‬‫ם‬‫א‬ֵ֥‫ת‬‫ה‬ַ‫שּׁ‬ָ‫מ‬ַ֖‫י‬ִ‫ם‬‫ְו‬‫א‬ֵ֥‫ת‬‫ה‬ָ‫א‬ָֽ‫ר‬ֶ‫ץ‬‫׃‬
0-56-2392 72-91r9r10r11
n2n3
word
sentence
phrase
determination=determined
phrase_function=Objc
phrase_type=PP
parents
mother
subphrase
clause
r11 r10 r9
clause_atom_number=1
clause_atom_relation=0
clause_atom_type=xQtl
indentation=0
<a xml:id="af22" label="ft" ref="n3" as="utf8"><fs>
<f name="lexeme_utf8" value=" ‫ר‬‫א‬‫ׁש‬‫י‬‫ת‬ "/>
<f name="surface_consonants_utf8" value=" ‫ר‬‫א‬‫ׁש‬‫י‬‫ת‬ "/>
</fs></a>
link to
regions
Linguistic Annotation FrameworkLet's go LAF

LAF from the outside
dirk:~/SURFdrive/laf-fabric-data/etcbc4b/laf > ls -lh
total 3195648
-rw-r--r-- 1 dirk staff 14K May 4 15:20 etcbc4b.hdr
-rw-r--r-- 1 dirk staff 12M May 4 15:08 etcbc4b.lst
-rw-r--r-- 1 dirk staff 5.1M May 4 15:08 etcbc4b.txt
-rw-r--r-- 1 dirk staff 1.6K May 4 15:20 etcbc4b.txt.hdr
-rw-r--r-- 1 dirk staff 106M May 4 15:09 etcbc4b_lingo.c.xml
-rw-r--r-- 1 dirk staff 107M May 4 15:09 etcbc4b_lingo.p.xml
-rw-r--r-- 1 dirk staff 148M May 4 15:09 etcbc4b_lingo.pa.xml
-rw-r--r-- 1 dirk staff 21M May 4 15:09 etcbc4b_lingo.s.xml
-rw-r--r-- 1 dirk staff 23M May 4 15:09 etcbc4b_lingo.sp.xml
-rw-r--r-- 1 dirk staff 298M May 4 15:09 etcbc4b_lingo.xml
-rw-r--r-- 1 dirk staff 642M May 4 15:08 etcbc4b_monads.lex.xml
-rw-r--r-- 1 dirk staff 125M May 4 15:08 etcbc4b_monads.xml
-rw-r--r-- 1 dirk staff 37M May 4 15:08 etcbc4b_regions.xml
-rw-r--r-- 1 dirk staff 36M May 4 15:08 etcbc4b_sections.xml
dirk:~/SURFdrive/laf-fabric-data/etcbc4b/laf > du -d1 -h
1.5G .
dirk:~/SURFdrive/laf-fabric-data/etcbc4b/laf >

LAF statistics
OPENED AT:2015-06-29T05-20-29
0.00s PARSING ANNOTATION FILES
8m 24s INFO: END PARSING
800,607 regions
1,437,355 nodes
2,223,873 edges
5,029,354 annots
30,757,007 features
9,491,189 distinct xml identifiers
8m 24s MODELING RESULT FILES
9m 36s WRITING RESULT FILES for m
CLOSED AT:2015-06-29T05-30-49

MetadataTwo headers:
• for the LAF resource
• for the text data
useful
but
utterly
boring
...
for now

Annotation metadata
just
book
keeping

Feature structures
TEI
ISOcat: metadata registry
- the ideal of machine interoperability
- added bureaucracy
- local copy needed anyway

Digging into LAF data
700,000 regions

LAF data: words
400,000 words with identiﬁcation information

LAF data: linguistic props
1,400,000 nodes
2,300,000 edges

The ETCBC database
Eep Talstra 197?-2015
Wido van Peursen 2013-
Constantijn Sikkel
Janet Dyk
Reinoud Oosting
Oliver Glanz
...
2012 Eep Talstra, Constantijn Sikkel, Oliver Glanz,
Reinoud Oosting, and Janet Dyk: Text database of the
Hebrew Bible. DOI 10.17026/dans-x8h-y2bv. Restricted
Access.
2014 Wido van Peursen, Eep Talstra, Constantijn
Sikkel, Janet Dyk, Oliver Glanz, Reinoud Oosting, Gino
Kalkman and Dirk Roorda: Hebrew Text Database
ETCBC4. DOI 10.17026/dans-2z3-arxf. Open Acces
(CC-BY NC)
2015 Wido van Peursen, Constantijn Sikkel and Dirk
Roorda: Hebrew Text Database ETCBC4b. DOI:
10.17026/dans-z6y-skyh Open Acces (CC-BY NC)
archived!
4
3
4b
4s to come

Parallel Passages
Work with Martijn Naaijer (ETCBC)
on historical linguistic variation.
See parallel on shebanq.

Verbal Valency
Work with Janet Dyk (ETCBC)
on verbal semantics.
See valence on shebanq.
clause
has
complement
has direct
object‫נתן‬
has
another direct
object
clause
has
complement
complement
is indirect
object
complement
is locative
determine
primary object
and secondary
objects (pdo,
sdos)
(act of) producing; yielding;
giving (in itself)
produce; yield; give
produce, yield, give
give
give ?or? place
place
make {pdo} (to be (as)/to
become/to do) {sdos}
00
0c
10
1i
1c
1l
2

SHEBANQ
System for HEBrew text:
ANnotations for Queries and Markup
Giving it to the users:
readers
language scholars
historical linguists
computational linguists
exegetes-hermeneutes
bible translators
publishers
computer scientists

LAF-Fabric
API for LAF-processing
explore - analyze - visualize
readthedocs
github.com/ETCBC/laf-fabric

Agility with formats
• Emdros: MQL format (text-database)
• LAF: XML format (graph)
• LAF-Fabric: binary Python datastructures
(dictionaries and arrays)
• R: binary rds format (data frame)
• Pandas: binary table format (data frame)

Before I go ...
we really try to communicate research activity
when it happens
feeds !

dirk.roorda@dans.knaw.nl
https://shebanq.ancient-data.org
thank you

Annotating the Hebrew Bible

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (13)

Similar to Annotating the Hebrew Bible

Similar to Annotating the Hebrew Bible (20)

More from Dirk Roorda

More from Dirk Roorda (15)

Recently uploaded

Recently uploaded (20)

Annotating the Hebrew Bible