SlideShare a Scribd company logo
Annotating the Hebrew
Bible
High Precision Philology in a Digital Space
Leipzig 2016-02-15/16
Dirk Roorda
‫֥ר‬ ַ‫ב‬ ְ‫דּ‬
Text. What is it?
bᵊrēšˈîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˈayim wᵊʔˌēṯ hāʔˈāreṣ .
Genesis 1:1
In the beginning God created the heavens and the earth.
A string of words ...
bᵊrēšˈîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˈayim wᵊʔˌēṯ hāʔˈāreṣ .
... separated by spaces?
bᵊrēšˈîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˈayim wᵊʔˌēṯ hāʔˈāreṣ .
bᵊrēšˈîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˈayim wᵊʔˌēṯ hāʔˈāreṣ .
A string of letters ...
... in which alefbet?
bᵊrēšˈîṯ bārˈā ʔᵉlōhˈîm
ʔˌēṯ haššāmˈayim wᵊʔ
ˌēṯ hāʔˈāreṣ .
phonetic
hebrew with vowels and accents
‫֣א‬ ָ‫ר‬ ָ‫בּ‬ ‫ית‬ ֖ ִ‫אשׁ‬ ֵ‫ר‬ ְ‫בּ‬
‫ם‬ִ‫֖י‬ ַ‫מ‬ָּ‫שׁ‬ ַ‫ה‬ ‫֥ת‬ ֵ‫א‬ ‫֑ים‬ ִ‫ֱֹלה‬‫א‬
‫ץ׃‬ ֶ‫ר‬ ָֽ‫א‬ ָ‫ה‬ ‫֥ת‬ ֵ‫א‬ ְ‫ו‬
hebrew consonantal
‫ברא‬ ‫בראשית‬
‫השמים‬ ‫את‬ ‫אלהים‬
‫הארץ׃‬ ‫ואת‬
etcbc transcription (full)
B.:- R;>CI73JT
B.@R@74>
>:ELOHI92JM >;71T
HA- C.@MA73JIM W:-
>;71T H@- >@75REY00
etcbc transcription (consonantal)
B R>CJT BR> >LHJM
>T H CMJM W >T H
>RY
1. The Text itself (representations)
2. Linguistics (feature structures)
3. "Manual" (really manual or software-generated)
4. Queries (exegetical search)
layers of annotation
A text is ... in abstracto:
a sequence of objects with a notion of embedding
1994 Crist-Jan Doedens.
Text Databases.
One Database Model and Several
Retrieval Languages.
Ph.D. thesis. In Language and
Computers, Amsterdam.
See Google Books.
... in concreto:
all objects are sets of monads (the smallest elements)
all objects participate in spatial relationships
sequence - embedding - overlap - gap
all objects can carry unlimited features
a representation of a word is just a feature
... in practice:
this model has been implemented in a mature system
2002-2014 Ulrik Petersen.
Emdros.
Text database engine for storage and
retrieval of analyzed or annotated text.
Open Source Software.
See COLING paper 2004.
... for convenience:
an ISO standard captures a lot of this model
2012 Nancy Ide and Laurent Romary.
Linguistic Annotation Framework
(LAF).
ISO Standard 24612.
1. The Text itself (representations)
2. Linguistics (feature structures)
3. "Manual" (really manual or software-generated)
4. Queries (exegetical search)
layers of annotation
<node xml:id="n_88917">
<link targets="r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11"/>
</node>
<edge xml:id="e1" from="n88917" to="n84383"/>
<a xml:id="ae1" label="parents" ref="e1" as="link"/>
<region xml:id="r_2" anchors="6 23"/>
<node xml:id="n_3"><link targets="r_2"/></node>
<a xml:id="a_3" label="word" ref="n_3" as="monads"/>labeled
edges
nodes
annotations
(features)
annotations
(empty)
primary data
regions
lexeme_utf8= ‫ר‬‫א‬‫שׁ‬‫י‬‫ת‬
surface_consonants_utf8= ‫ר‬‫א‬‫שׁ‬‫י‬‫ת‬
‫בּ‬ְ‫ר‬ֵ‫א‬‫שׁ‬ִ֖‫י‬‫ת‬‫בּ‬ָ‫ר‬ָ֣‫א‬‫א‬ֱ.‫ה‬ִ֑‫י‬‫ם‬‫א‬ֵ֥‫ת‬‫ה‬ַ‫שּׁ‬ָ‫מ‬ַ֖‫י‬ִ‫ם‬‫ְו‬‫א‬ֵ֥‫ת‬‫ה‬ָ‫א‬ָֽ‫ר‬ֶ‫ץ‬‫׃‬
0-56-2392 72-91r9r10r11
n2n3
word
sentence
phrase
determination=determined
phrase_function=Objc
phrase_type=PP
parents
mother
subphrase
clause
r11 r10 r9
clause_atom_number=1
clause_atom_relation=0
clause_atom_type=xQtl
indentation=0
<a xml:id="af22" label="ft" ref="n3" as="utf8"><fs>
<f name="lexeme_utf8" value=" ‫ר‬‫א‬‫ׁש‬‫י‬‫ת‬ "/>
<f name="surface_consonants_utf8" value=" ‫ר‬‫א‬‫ׁש‬‫י‬‫ת‬ "/>
</fs></a>
link to
regions
Linguistic Annotation FrameworkLet's go LAF
LAF from the outside
dirk:~/SURFdrive/laf-fabric-data/etcbc4b/laf > ls -lh
total 3195648
-rw-r--r-- 1 dirk staff 14K May 4 15:20 etcbc4b.hdr
-rw-r--r-- 1 dirk staff 12M May 4 15:08 etcbc4b.lst
-rw-r--r-- 1 dirk staff 5.1M May 4 15:08 etcbc4b.txt
-rw-r--r-- 1 dirk staff 1.6K May 4 15:20 etcbc4b.txt.hdr
-rw-r--r-- 1 dirk staff 106M May 4 15:09 etcbc4b_lingo.c.xml
-rw-r--r-- 1 dirk staff 107M May 4 15:09 etcbc4b_lingo.p.xml
-rw-r--r-- 1 dirk staff 148M May 4 15:09 etcbc4b_lingo.pa.xml
-rw-r--r-- 1 dirk staff 21M May 4 15:09 etcbc4b_lingo.s.xml
-rw-r--r-- 1 dirk staff 23M May 4 15:09 etcbc4b_lingo.sp.xml
-rw-r--r-- 1 dirk staff 298M May 4 15:09 etcbc4b_lingo.xml
-rw-r--r-- 1 dirk staff 642M May 4 15:08 etcbc4b_monads.lex.xml
-rw-r--r-- 1 dirk staff 125M May 4 15:08 etcbc4b_monads.xml
-rw-r--r-- 1 dirk staff 37M May 4 15:08 etcbc4b_regions.xml
-rw-r--r-- 1 dirk staff 36M May 4 15:08 etcbc4b_sections.xml
dirk:~/SURFdrive/laf-fabric-data/etcbc4b/laf > du -d1 -h
1.5G .
dirk:~/SURFdrive/laf-fabric-data/etcbc4b/laf >
LAF statistics
OPENED AT:2015-06-29T05-20-29
0.00s PARSING ANNOTATION FILES
8m 24s INFO: END PARSING
800,607 regions
1,437,355 nodes
2,223,873 edges
5,029,354 annots
30,757,007 features
9,491,189 distinct xml identifiers
8m 24s MODELING RESULT FILES
9m 36s WRITING RESULT FILES for m
CLOSED AT:2015-06-29T05-30-49
MetadataTwo headers:
• for the LAF resource
• for the text data
useful
but
utterly
boring
...
for now
Annotation metadata
just
book
keeping
Feature structures
TEI
ISOcat: metadata registry
- the ideal of machine interoperability
- added bureaucracy
- local copy needed anyway
docs for
real people
Digging into LAF data
700,000 regions
LAF data: words
400,000 words with identification information
LAF data:
word features
LAF data: hierarchies
LAF data: linguistic props
1,400,000 nodes
2,300,000 edges
The ETCBC database
Eep Talstra 197?-2015
Wido van Peursen 2013-
Constantijn Sikkel
Janet Dyk
Reinoud Oosting
Oliver Glanz
...
2012 Eep Talstra, Constantijn Sikkel, Oliver Glanz,
Reinoud Oosting, and Janet Dyk: Text database of the
Hebrew Bible. DOI 10.17026/dans-x8h-y2bv. Restricted
Access.
2014 Wido van Peursen, Eep Talstra, Constantijn
Sikkel, Janet Dyk, Oliver Glanz, Reinoud Oosting, Gino
Kalkman and Dirk Roorda: Hebrew Text Database
ETCBC4. DOI 10.17026/dans-2z3-arxf. Open Acces
(CC-BY NC)
2015 Wido van Peursen, Constantijn Sikkel and Dirk
Roorda: Hebrew Text Database ETCBC4b. DOI:
10.17026/dans-z6y-skyh Open Acces (CC-BY NC)
archived!
4
3
4b
4s to come
1. The Text itself (representations)
2. Linguistics (feature structures)
3. "Manual" (really manual or software-generated)
4. Queries (exegetical search)
layers of annotation
Parallel Passages
Work with Martijn Naaijer (ETCBC)
on historical linguistic variation.
See parallel on shebanq.
Verbal Valency
Work with Janet Dyk (ETCBC)
on verbal semantics.
See valence on shebanq.
clause
has
complement
has direct
object‫נתן‬
has
another direct
object
clause
has
complement
complement
is indirect
object
complement
is locative
determine
primary object
and secondary
objects (pdo,
sdos)
(act of) producing; yielding;
giving (in itself)
produce; yield; give
produce, yield, give
give
give ?or? place
place
make {pdo} (to be (as)/to
become/to do) {sdos}
00
0c
10
1i
1c
1l
2
1. The Text itself (representations)
2. Linguistics (feature structures)
3. "Manual" (really manual or software-generated)
4. Queries (exegetical search)
layers of annotation
Queries
SHEBANQ
System for HEBrew text:
ANnotations for Queries and Markup
Giving it to the users:
readers
language scholars
historical linguists
computational linguists
exegetes-hermeneutes
bible translators
publishers
computer scientists
LAF-Fabric
API for LAF-processing
explore - analyze - visualize
readthedocs
github.com/ETCBC/laf-fabric
Agility with formats
• Emdros: MQL format (text-database)
• LAF: XML format (graph)
• LAF-Fabric: binary Python datastructures
(dictionaries and arrays)
• R: binary rds format (data frame)
• Pandas: binary table format (data frame)
The power of R
R: Getting the text back
R: Bigrams
Before I go ...
we really try to communicate research activity
when it happens
feeds !
visuals
dirk.roorda@dans.knaw.nl
https://shebanq.ancient-data.org
thank you

More Related Content

Viewers also liked

Research demonstrators
Research demonstratorsResearch demonstrators
Research demonstrators
Dirk Roorda
 
Shebanq roma-2013-10-01
Shebanq roma-2013-10-01Shebanq roma-2013-10-01
Shebanq roma-2013-10-01
Dirk Roorda
 
Auto ingest demo-werklunch 2013-11-05
Auto ingest demo-werklunch 2013-11-05Auto ingest demo-werklunch 2013-11-05
Auto ingest demo-werklunch 2013-11-05
Dirk Roorda
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew Bible
Dirk Roorda
 
Laf fabric-dh benelux2014
Laf fabric-dh benelux2014Laf fabric-dh benelux2014
Laf fabric-dh benelux2014
Dirk Roorda
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case Study
Dirk Roorda
 
Data Analysis in the Hebrew Bible
Data Analysis in the Hebrew BibleData Analysis in the Hebrew Bible
Data Analysis in the Hebrew Bible
Dirk Roorda
 
20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen
Dirk Roorda
 
Award
AwardAward
Hebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, LessonsHebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, Lessons
Dirk Roorda
 
2009 PLANETS Vienna - MIXED migration to XML
2009 PLANETS Vienna - MIXED migration to XML2009 PLANETS Vienna - MIXED migration to XML
2009 PLANETS Vienna - MIXED migration to XML
Dirk Roorda
 
Text fabric
Text fabricText fabric
Text fabric
Dirk Roorda
 
Verbal Valency in Hebrew Verbs
Verbal Valency in Hebrew VerbsVerbal Valency in Hebrew Verbs
Verbal Valency in Hebrew Verbs
Dirk Roorda
 

Viewers also liked (13)

Research demonstrators
Research demonstratorsResearch demonstrators
Research demonstrators
 
Shebanq roma-2013-10-01
Shebanq roma-2013-10-01Shebanq roma-2013-10-01
Shebanq roma-2013-10-01
 
Auto ingest demo-werklunch 2013-11-05
Auto ingest demo-werklunch 2013-11-05Auto ingest demo-werklunch 2013-11-05
Auto ingest demo-werklunch 2013-11-05
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew Bible
 
Laf fabric-dh benelux2014
Laf fabric-dh benelux2014Laf fabric-dh benelux2014
Laf fabric-dh benelux2014
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case Study
 
Data Analysis in the Hebrew Bible
Data Analysis in the Hebrew BibleData Analysis in the Hebrew Bible
Data Analysis in the Hebrew Bible
 
20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen
 
Award
AwardAward
Award
 
Hebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, LessonsHebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, Lessons
 
2009 PLANETS Vienna - MIXED migration to XML
2009 PLANETS Vienna - MIXED migration to XML2009 PLANETS Vienna - MIXED migration to XML
2009 PLANETS Vienna - MIXED migration to XML
 
Text fabric
Text fabricText fabric
Text fabric
 
Verbal Valency in Hebrew Verbs
Verbal Valency in Hebrew VerbsVerbal Valency in Hebrew Verbs
Verbal Valency in Hebrew Verbs
 

Similar to Annotating the Hebrew Bible

IMPACT Final Conference - Katrien Depuydt
IMPACT Final Conference - Katrien DepuydtIMPACT Final Conference - Katrien Depuydt
IMPACT Final Conference - Katrien Depuydt
IMPACT Centre of Competence
 
Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1
Tobias Wunner
 
Sarah Rees Jones (York) and Helen Petrie: 'Chartex overview and next steps'
Sarah Rees Jones (York) and Helen Petrie: 'Chartex overview and next steps' Sarah Rees Jones (York) and Helen Petrie: 'Chartex overview and next steps'
Sarah Rees Jones (York) and Helen Petrie: 'Chartex overview and next steps'
Digital History
 
Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrieval
captainmactavish1996
 
LSDI.pptx
LSDI.pptxLSDI.pptx
LSDI.pptx
HisokaFreecs
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
Benjamin Bengfort
 
Practical hebrew search
Practical hebrew searchPractical hebrew search
Practical hebrew search
Itamar
 
Web and text
Web and textWeb and text
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
Houw Liong The
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)Uma Se
 
Building Blocks for the Future: Making Controlled Vocabularies Available for ...
Building Blocks for the Future: Making Controlled Vocabularies Available for ...Building Blocks for the Future: Making Controlled Vocabularies Available for ...
Building Blocks for the Future: Making Controlled Vocabularies Available for ...
Národní technická knihovna (NTK)
 
Embedding NomLex-BR nominalizations into OpenWordnet-PT
Embedding NomLex-BR nominalizations into OpenWordnet-PTEmbedding NomLex-BR nominalizations into OpenWordnet-PT
Embedding NomLex-BR nominalizations into OpenWordnet-PT
Alexandre Rademaker
 
Professor John Coleman, Phonetics Department, Oxford University, talk "Voices...
Professor John Coleman, Phonetics Department, Oxford University, talk "Voices...Professor John Coleman, Phonetics Department, Oxford University, talk "Voices...
Professor John Coleman, Phonetics Department, Oxford University, talk "Voices...
onthewight
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
Gabriel Hamilton
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
Sean Golliher
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdf
JemalNesre1
 
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingTheodore J. LaGrow
 
Subword tokenizers
Subword tokenizersSubword tokenizers
Subword tokenizers
Ha Loc Do
 

Similar to Annotating the Hebrew Bible (20)

IMPACT Final Conference - Katrien Depuydt
IMPACT Final Conference - Katrien DepuydtIMPACT Final Conference - Katrien Depuydt
IMPACT Final Conference - Katrien Depuydt
 
Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1
 
Sarah Rees Jones (York) and Helen Petrie: 'Chartex overview and next steps'
Sarah Rees Jones (York) and Helen Petrie: 'Chartex overview and next steps' Sarah Rees Jones (York) and Helen Petrie: 'Chartex overview and next steps'
Sarah Rees Jones (York) and Helen Petrie: 'Chartex overview and next steps'
 
Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrieval
 
LSDI.pptx
LSDI.pptxLSDI.pptx
LSDI.pptx
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Practical hebrew search
Practical hebrew searchPractical hebrew search
Practical hebrew search
 
Web and text
Web and textWeb and text
Web and text
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
 
Building Blocks for the Future: Making Controlled Vocabularies Available for ...
Building Blocks for the Future: Making Controlled Vocabularies Available for ...Building Blocks for the Future: Making Controlled Vocabularies Available for ...
Building Blocks for the Future: Making Controlled Vocabularies Available for ...
 
NLP
NLPNLP
NLP
 
NLP
NLPNLP
NLP
 
Embedding NomLex-BR nominalizations into OpenWordnet-PT
Embedding NomLex-BR nominalizations into OpenWordnet-PTEmbedding NomLex-BR nominalizations into OpenWordnet-PT
Embedding NomLex-BR nominalizations into OpenWordnet-PT
 
Professor John Coleman, Phonetics Department, Oxford University, talk "Voices...
Professor John Coleman, Phonetics Department, Oxford University, talk "Voices...Professor John Coleman, Phonetics Department, Oxford University, talk "Voices...
Professor John Coleman, Phonetics Department, Oxford University, talk "Voices...
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
 
Chapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdfChapter 2 Text Operation and Term Weighting.pdf
Chapter 2 Text Operation and Term Weighting.pdf
 
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
 
Subword tokenizers
Subword tokenizersSubword tokenizers
Subword tokenizers
 

More from Dirk Roorda

TF-FAIR.pdf
TF-FAIR.pdfTF-FAIR.pdf
TF-FAIR.pdf
Dirk Roorda
 
Textpy
TextpyTextpy
Textpy
Dirk Roorda
 
General Missives
General MissivesGeneral Missives
General Missives
Dirk Roorda
 
Text Display (when it gets tricky)
Text Display (when it gets tricky)Text Display (when it gets tricky)
Text Display (when it gets tricky)
Dirk Roorda
 
Tf in-context
Tf in-contextTf in-context
Tf in-context
Dirk Roorda
 
Quran and Text-Fabric
Quran and Text-FabricQuran and Text-Fabric
Quran and Text-Fabric
Dirk Roorda
 
Ancient corpora analysis
Ancient corpora analysisAncient corpora analysis
Ancient corpora analysis
Dirk Roorda
 
Qdf2tf
Qdf2tfQdf2tf
Qdf2tf
Dirk Roorda
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchers
Dirk Roorda
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case Study
Dirk Roorda
 
Shebanq gniezno
Shebanq gnieznoShebanq gniezno
Shebanq gniezno
Dirk Roorda
 
2007 PresDB Edinburgh - MIXED migration to XML
2007 PresDB Edinburgh - MIXED migration to XML2007 PresDB Edinburgh - MIXED migration to XML
2007 PresDB Edinburgh - MIXED migration to XML
Dirk Roorda
 
2010 CLARA Nijmegen - Data Seal of Approval tutorial
2010 CLARA Nijmegen - Data Seal of Approval tutorial2010 CLARA Nijmegen - Data Seal of Approval tutorial
2010 CLARA Nijmegen - Data Seal of Approval tutorial
Dirk Roorda
 
2010 DANS - Infrastructure
2010 DANS - Infrastructure2010 DANS - Infrastructure
2010 DANS - Infrastructure
Dirk Roorda
 
2007 iPres Beijing - MIXED: Preservation by migration to XML
2007 iPres Beijing - MIXED: Preservation by migration to XML2007 iPres Beijing - MIXED: Preservation by migration to XML
2007 iPres Beijing - MIXED: Preservation by migration to XML
Dirk Roorda
 

More from Dirk Roorda (15)

TF-FAIR.pdf
TF-FAIR.pdfTF-FAIR.pdf
TF-FAIR.pdf
 
Textpy
TextpyTextpy
Textpy
 
General Missives
General MissivesGeneral Missives
General Missives
 
Text Display (when it gets tricky)
Text Display (when it gets tricky)Text Display (when it gets tricky)
Text Display (when it gets tricky)
 
Tf in-context
Tf in-contextTf in-context
Tf in-context
 
Quran and Text-Fabric
Quran and Text-FabricQuran and Text-Fabric
Quran and Text-Fabric
 
Ancient corpora analysis
Ancient corpora analysisAncient corpora analysis
Ancient corpora analysis
 
Qdf2tf
Qdf2tfQdf2tf
Qdf2tf
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchers
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case Study
 
Shebanq gniezno
Shebanq gnieznoShebanq gniezno
Shebanq gniezno
 
2007 PresDB Edinburgh - MIXED migration to XML
2007 PresDB Edinburgh - MIXED migration to XML2007 PresDB Edinburgh - MIXED migration to XML
2007 PresDB Edinburgh - MIXED migration to XML
 
2010 CLARA Nijmegen - Data Seal of Approval tutorial
2010 CLARA Nijmegen - Data Seal of Approval tutorial2010 CLARA Nijmegen - Data Seal of Approval tutorial
2010 CLARA Nijmegen - Data Seal of Approval tutorial
 
2010 DANS - Infrastructure
2010 DANS - Infrastructure2010 DANS - Infrastructure
2010 DANS - Infrastructure
 
2007 iPres Beijing - MIXED: Preservation by migration to XML
2007 iPres Beijing - MIXED: Preservation by migration to XML2007 iPres Beijing - MIXED: Preservation by migration to XML
2007 iPres Beijing - MIXED: Preservation by migration to XML
 

Recently uploaded

FAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable PredictionsFAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable Predictions
Michel Dumontier
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
subedisuryaofficial
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
Health Advances
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
IvanMallco1
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SELF-EXPLANATORY
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
IqrimaNabilatulhusni
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
anitaento25
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
aishnasrivastava
 
plant biotechnology Lecture note ppt.pptx
plant biotechnology Lecture note ppt.pptxplant biotechnology Lecture note ppt.pptx
plant biotechnology Lecture note ppt.pptx
yusufzako14
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
Richard Gill
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
Sérgio Sacani
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
insect morphology and physiology of insect
insect morphology and physiology of insectinsect morphology and physiology of insect
insect morphology and physiology of insect
anitaento25
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
AADYARAJPANDEY1
 

Recently uploaded (20)

FAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable PredictionsFAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable Predictions
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
 
plant biotechnology Lecture note ppt.pptx
plant biotechnology Lecture note ppt.pptxplant biotechnology Lecture note ppt.pptx
plant biotechnology Lecture note ppt.pptx
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
insect morphology and physiology of insect
insect morphology and physiology of insectinsect morphology and physiology of insect
insect morphology and physiology of insect
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
 

Annotating the Hebrew Bible

  • 1. Annotating the Hebrew Bible High Precision Philology in a Digital Space Leipzig 2016-02-15/16 Dirk Roorda ‫֥ר‬ ַ‫ב‬ ְ‫דּ‬
  • 2. Text. What is it? bᵊrēšˈîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˈayim wᵊʔˌēṯ hāʔˈāreṣ . Genesis 1:1 In the beginning God created the heavens and the earth.
  • 3. A string of words ... bᵊrēšˈîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˈayim wᵊʔˌēṯ hāʔˈāreṣ .
  • 4. ... separated by spaces? bᵊrēšˈîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˈayim wᵊʔˌēṯ hāʔˈāreṣ . bᵊrēšˈîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˈayim wᵊʔˌēṯ hāʔˈāreṣ .
  • 5. A string of letters ... ... in which alefbet? bᵊrēšˈîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˈayim wᵊʔ ˌēṯ hāʔˈāreṣ . phonetic
  • 6. hebrew with vowels and accents ‫֣א‬ ָ‫ר‬ ָ‫בּ‬ ‫ית‬ ֖ ִ‫אשׁ‬ ֵ‫ר‬ ְ‫בּ‬ ‫ם‬ִ‫֖י‬ ַ‫מ‬ָּ‫שׁ‬ ַ‫ה‬ ‫֥ת‬ ֵ‫א‬ ‫֑ים‬ ִ‫ֱֹלה‬‫א‬ ‫ץ׃‬ ֶ‫ר‬ ָֽ‫א‬ ָ‫ה‬ ‫֥ת‬ ֵ‫א‬ ְ‫ו‬
  • 7. hebrew consonantal ‫ברא‬ ‫בראשית‬ ‫השמים‬ ‫את‬ ‫אלהים‬ ‫הארץ׃‬ ‫ואת‬
  • 8. etcbc transcription (full) B.:- R;>CI73JT B.@R@74> >:ELOHI92JM >;71T HA- C.@MA73JIM W:- >;71T H@- >@75REY00
  • 9. etcbc transcription (consonantal) B R>CJT BR> >LHJM >T H CMJM W >T H >RY
  • 10. 1. The Text itself (representations) 2. Linguistics (feature structures) 3. "Manual" (really manual or software-generated) 4. Queries (exegetical search) layers of annotation
  • 11. A text is ... in abstracto: a sequence of objects with a notion of embedding 1994 Crist-Jan Doedens. Text Databases. One Database Model and Several Retrieval Languages. Ph.D. thesis. In Language and Computers, Amsterdam. See Google Books.
  • 12. ... in concreto: all objects are sets of monads (the smallest elements) all objects participate in spatial relationships sequence - embedding - overlap - gap all objects can carry unlimited features a representation of a word is just a feature
  • 13. ... in practice: this model has been implemented in a mature system 2002-2014 Ulrik Petersen. Emdros. Text database engine for storage and retrieval of analyzed or annotated text. Open Source Software. See COLING paper 2004.
  • 14. ... for convenience: an ISO standard captures a lot of this model 2012 Nancy Ide and Laurent Romary. Linguistic Annotation Framework (LAF). ISO Standard 24612.
  • 15. 1. The Text itself (representations) 2. Linguistics (feature structures) 3. "Manual" (really manual or software-generated) 4. Queries (exegetical search) layers of annotation
  • 16. <node xml:id="n_88917"> <link targets="r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11"/> </node> <edge xml:id="e1" from="n88917" to="n84383"/> <a xml:id="ae1" label="parents" ref="e1" as="link"/> <region xml:id="r_2" anchors="6 23"/> <node xml:id="n_3"><link targets="r_2"/></node> <a xml:id="a_3" label="word" ref="n_3" as="monads"/>labeled edges nodes annotations (features) annotations (empty) primary data regions lexeme_utf8= ‫ר‬‫א‬‫שׁ‬‫י‬‫ת‬ surface_consonants_utf8= ‫ר‬‫א‬‫שׁ‬‫י‬‫ת‬ ‫בּ‬ְ‫ר‬ֵ‫א‬‫שׁ‬ִ֖‫י‬‫ת‬‫בּ‬ָ‫ר‬ָ֣‫א‬‫א‬ֱ.‫ה‬ִ֑‫י‬‫ם‬‫א‬ֵ֥‫ת‬‫ה‬ַ‫שּׁ‬ָ‫מ‬ַ֖‫י‬ִ‫ם‬‫ְו‬‫א‬ֵ֥‫ת‬‫ה‬ָ‫א‬ָֽ‫ר‬ֶ‫ץ‬‫׃‬ 0-56-2392 72-91r9r10r11 n2n3 word sentence phrase determination=determined phrase_function=Objc phrase_type=PP parents mother subphrase clause r11 r10 r9 clause_atom_number=1 clause_atom_relation=0 clause_atom_type=xQtl indentation=0 <a xml:id="af22" label="ft" ref="n3" as="utf8"><fs> <f name="lexeme_utf8" value=" ‫ר‬‫א‬‫ׁש‬‫י‬‫ת‬ "/> <f name="surface_consonants_utf8" value=" ‫ר‬‫א‬‫ׁש‬‫י‬‫ת‬ "/> </fs></a> link to regions Linguistic Annotation FrameworkLet's go LAF
  • 17. LAF from the outside dirk:~/SURFdrive/laf-fabric-data/etcbc4b/laf > ls -lh total 3195648 -rw-r--r-- 1 dirk staff 14K May 4 15:20 etcbc4b.hdr -rw-r--r-- 1 dirk staff 12M May 4 15:08 etcbc4b.lst -rw-r--r-- 1 dirk staff 5.1M May 4 15:08 etcbc4b.txt -rw-r--r-- 1 dirk staff 1.6K May 4 15:20 etcbc4b.txt.hdr -rw-r--r-- 1 dirk staff 106M May 4 15:09 etcbc4b_lingo.c.xml -rw-r--r-- 1 dirk staff 107M May 4 15:09 etcbc4b_lingo.p.xml -rw-r--r-- 1 dirk staff 148M May 4 15:09 etcbc4b_lingo.pa.xml -rw-r--r-- 1 dirk staff 21M May 4 15:09 etcbc4b_lingo.s.xml -rw-r--r-- 1 dirk staff 23M May 4 15:09 etcbc4b_lingo.sp.xml -rw-r--r-- 1 dirk staff 298M May 4 15:09 etcbc4b_lingo.xml -rw-r--r-- 1 dirk staff 642M May 4 15:08 etcbc4b_monads.lex.xml -rw-r--r-- 1 dirk staff 125M May 4 15:08 etcbc4b_monads.xml -rw-r--r-- 1 dirk staff 37M May 4 15:08 etcbc4b_regions.xml -rw-r--r-- 1 dirk staff 36M May 4 15:08 etcbc4b_sections.xml dirk:~/SURFdrive/laf-fabric-data/etcbc4b/laf > du -d1 -h 1.5G . dirk:~/SURFdrive/laf-fabric-data/etcbc4b/laf >
  • 18. LAF statistics OPENED AT:2015-06-29T05-20-29 0.00s PARSING ANNOTATION FILES 8m 24s INFO: END PARSING 800,607 regions 1,437,355 nodes 2,223,873 edges 5,029,354 annots 30,757,007 features 9,491,189 distinct xml identifiers 8m 24s MODELING RESULT FILES 9m 36s WRITING RESULT FILES for m CLOSED AT:2015-06-29T05-30-49
  • 19. MetadataTwo headers: • for the LAF resource • for the text data useful but utterly boring ... for now
  • 21. Feature structures TEI ISOcat: metadata registry - the ideal of machine interoperability - added bureaucracy - local copy needed anyway
  • 23. Digging into LAF data 700,000 regions
  • 24. LAF data: words 400,000 words with identification information
  • 27. LAF data: linguistic props 1,400,000 nodes 2,300,000 edges
  • 28. The ETCBC database Eep Talstra 197?-2015 Wido van Peursen 2013- Constantijn Sikkel Janet Dyk Reinoud Oosting Oliver Glanz ... 2012 Eep Talstra, Constantijn Sikkel, Oliver Glanz, Reinoud Oosting, and Janet Dyk: Text database of the Hebrew Bible. DOI 10.17026/dans-x8h-y2bv. Restricted Access. 2014 Wido van Peursen, Eep Talstra, Constantijn Sikkel, Janet Dyk, Oliver Glanz, Reinoud Oosting, Gino Kalkman and Dirk Roorda: Hebrew Text Database ETCBC4. DOI 10.17026/dans-2z3-arxf. Open Acces (CC-BY NC) 2015 Wido van Peursen, Constantijn Sikkel and Dirk Roorda: Hebrew Text Database ETCBC4b. DOI: 10.17026/dans-z6y-skyh Open Acces (CC-BY NC) archived! 4 3 4b 4s to come
  • 29. 1. The Text itself (representations) 2. Linguistics (feature structures) 3. "Manual" (really manual or software-generated) 4. Queries (exegetical search) layers of annotation
  • 30. Parallel Passages Work with Martijn Naaijer (ETCBC) on historical linguistic variation. See parallel on shebanq.
  • 31.
  • 32. Verbal Valency Work with Janet Dyk (ETCBC) on verbal semantics. See valence on shebanq. clause has complement has direct object‫נתן‬ has another direct object clause has complement complement is indirect object complement is locative determine primary object and secondary objects (pdo, sdos) (act of) producing; yielding; giving (in itself) produce; yield; give produce, yield, give give give ?or? place place make {pdo} (to be (as)/to become/to do) {sdos} 00 0c 10 1i 1c 1l 2
  • 33.
  • 34. 1. The Text itself (representations) 2. Linguistics (feature structures) 3. "Manual" (really manual or software-generated) 4. Queries (exegetical search) layers of annotation
  • 36. SHEBANQ System for HEBrew text: ANnotations for Queries and Markup Giving it to the users: readers language scholars historical linguists computational linguists exegetes-hermeneutes bible translators publishers computer scientists
  • 37. LAF-Fabric API for LAF-processing explore - analyze - visualize readthedocs github.com/ETCBC/laf-fabric
  • 38. Agility with formats • Emdros: MQL format (text-database) • LAF: XML format (graph) • LAF-Fabric: binary Python datastructures (dictionaries and arrays) • R: binary rds format (data frame) • Pandas: binary table format (data frame)
  • 40. R: Getting the text back
  • 42. Before I go ... we really try to communicate research activity when it happens feeds !