When Printed Hypertexts Go Digital:
                                                       Information Extraction from the Parsing of Indices
                        M. Romanello, M. Berti, A. Babeu, G. Crane - The Perseus Project, Tufts University, Medford MA (USA)


Modern critical editions of ancient Classical works (i.e. Greek and Latin) gen-                                { System . out . p r i n t l n ( ” ∗∗∗ NEW ENTRY BEGINS ∗∗∗ ” ) ; }
                                                                                                               −> ˆ (ENTRY ˆ (NAME author name e p i c l e s i s ∗) work entry+)
                                                                                                               ;
erally include manually created indices of other sources quoted in the text. Our
                                                                                                    author name
main assumption is that such indices can be considered as a form of domain spe-                              :
                                                                                                             // ( a=CAPITAL LATIN WORD { System . o u t . p r i n t l n ( $a . t e x t + ”{” + $a . l i n e + ” ,” + $a . pos + ”}”);})+
cific language. We present a parsing-based approach to the problem of extracting                             CAPITAL LATIN WORD+ { System . out . p r i n t l n ( ” Author name : ” + $author name . t e x t ) ; }
                                                                                                             |WRONG CAPITAL LATIN WORD+ { System . out . p r i n t l n ( ” Author name : ” + $author name . t e x t ) ; }
information from them to support the creation of digital collections of those ancient                        ;

texts. In particular we show preliminary results of applying a fuzzy parser to the                  work entry
                                                                                                            :
OCR transcription of an index of quotations to extract information from potentially                         w o r k t i t l e w o r k o c c u r r e n c e+ −> ˆ (WORK w o r k t i t l e w o r k o c c u r r e n c e +)
                                                                                                            | w o r k o c c u r r e n c e+ −> ˆ (WORK w o r k o c c u r r e n c e +)
noisy input.                                                                                        [...]
                                                                                                            ;



                                                                                                    This grammar expressed in the Extended Backus-Naur Form (EBNF) was
                         Introduction: Indices of quotations                                        then transformed into a Java parser by using the ANTLR parser generator
                                                                                                    [2].

In recent years, mass digitization initiatives have made accessible the page                        The developed framework consists of 1) an OCR transcription of the index
images of an increasing number of modern editions. Now we can access not                            to be parsed with an accuracy of 96.73% which was produced from a PDF
only the text but also the paratextual apparatus of each digital edition,                           by using an Google’s OCRopus installation specifically trained on Ancient
namely prefaces, notes, critical apparatuses and indices.                                           Greek; 2) the Index Parser consisting of both a lexer and a set of grammar
The topic of converting printed scholarly materials to dig-                                         rules; 3) a set of external knowledge sources (such as name lists) that can
ital hypertexts is not new , having a long research history [3]. This                               be used for the error correction.
paper proposes the automatic parsing of manually created indices scripto-
rum (i.e. indices of quotations) as an approach to reuse the efforts made
over decades by scholars on individuating and indexing citations inside
texts in order to create new digital tools.




                                                                                                         Figure 2: Overall framework


    Figure 1: Excerpt of the index of quotations drawn from the                                                                                  Results and conclusions
    Kaibel’s edition of Athenaeus’ Deipnosophistae (1887-1890), avail-
    able on Google Books.
                                                                                                    The result of parsing the index of quotations is a machine actionable tree
                                                                                                    (see List. 2) representing its hierarchical structure, where each author and
The indices of quotations found in many modern critical editions of classi-
                                                                                                    work quoted in the text is associated to a reference indicating its position
cal authors can be thought of as the hypertext through which an editor
                                                                                                    in the text. Starting from this resulting tree it is possible to automatically
creates internal links to those passages in their edited work that con-
                                                                                                    tag the quotations in the text and thus to reconstruct the hypertextual
tain quotations from other ancient sources. These indices also provide
                                                                                                    links between the index and the text.
outward links to the entire body of classical literature by listing quota-
tions of other surviving works.                                                                               Listing 2: Example of XML output of the index parsing process
                                                                                                    <?xml version=” 1 . 0 ” e n c o d i n g=”UTF−8” ?>
                                                                                                    <index>
                                                                                                       <entry i d=”N65541”>
                                                                                                           <name>
                     INDEX PARSING AND INFORMATION                                                             <t e x t i d=”N65546” l e n g t h=”8” l i n e=”1” s t a r t c h a r=”0” type=” c a p i t a l l a t i n w o r d ”>AETHLIVS</ t e x t>
                                                                                                               <s>AETHLIUS</ s>
                              EXTRACTION                                                                   </name>
                                                                                                           <work i d=”N65563”>
                                                                                                               <t i t l e i d=”N65566”>
                                                                                                                     <t r a s n c r i p t>[ some g r e e k t e x t ]</ t r a s n c r i p t>
                                                                                                             </ t i t l e>
                                                                                                                     <o c c u r r e n c e i d=”N65594”>
An index of quotations is mainly structured as entries where each entry                                                   <e d s t a t m i d=”N65596”>
                                                                                                                               <t e x t i d=”N65599” l e n g t h=”7” l i n e=”2” s t a r t c h a r=” 24 ” type=” a n n o t a t i o n ”>( f r . 2 )</ t e x t>
                                                                                                                          </ e d s t a t m>
corresponds to an author and where the name of the author constitutes                                                     < i n t r e f i d=”N65605”>
                                                                                                                               <t a r g e t type=” s i n g l e v a l u e ” v a l u e=” 653 f ”>653 f</ t a r g e t>
the so-called lemma, namely the headword. For each author the editor                                                      </ i n t r e f>
                                                                                                                     </ o c c u r r e n c e>
lists every work of that same author that is cited in the text.                                                </work>
                                                                                                       </ entry>
                                                                                                       [...]
                                                                                                    </index>
The main assumption for building a parser of printed indices is that an in-
                                                                                                    Indices of quotations are finally worth parsing since by using information
dex constitutes a domain-specific language and that the syntactic                                    from them we can 1) reconstruct internal and external links be-
disposition of its lexical components is subject to a grammar of                                    tween different texts and 2) extract semantic information such as
rules that can be preliminarily defined (cfr. [1]).                                                  lists of names, epithets of authors, titles of works, canonical citations used
After the lexical elements of the index were identified, a grammar of syn-                           by scholars.
tactic rules was specified, defining for each of them the corresponding
sequence of tokens to be matched (see List. 1).                                                     References
                                      Listing 1: Example of parser code                             [1] F. Boschetti. Methods to extend greek and latin corpora with variants and conjectures: Mapping critical
grammar G;
                                                                                                        apparatuses onto reference text. In Proceedings of the Corpus Linguistics Conference (CL2007),
options {                                                                                               2007.
          tokenVocab=L ; l a n g u a g e=Java ; b a c k t r a c k=true ; output=AST ;
}                                                                                                   [2] T. J. Parr and R. W. Quong. ANTLR: a predicated-LL(k) parser generator. Software Practice and
 [...]
index     :                                                                                             Experience, 25:789—810, 1995.
          entry+
          ;                                                                                         [3] D. R. Raymond and F. W. Tompa. Hypertext and the new oxford english dictionary. In Proceedings of
entry     :
                                                                                                        the ACM conference on Hypertext, pages 143–153, Chapel Hill, North Carolina, United States, 1987.
          n o i s e ? author name e p i c l e s i s ∗ work entry+ c r o s s r e f t o e n t r y ∗       ACM.

Ht159 Poster

  • 1.
    When Printed HypertextsGo Digital: Information Extraction from the Parsing of Indices M. Romanello, M. Berti, A. Babeu, G. Crane - The Perseus Project, Tufts University, Medford MA (USA) Modern critical editions of ancient Classical works (i.e. Greek and Latin) gen- { System . out . p r i n t l n ( ” ∗∗∗ NEW ENTRY BEGINS ∗∗∗ ” ) ; } −> ˆ (ENTRY ˆ (NAME author name e p i c l e s i s ∗) work entry+) ; erally include manually created indices of other sources quoted in the text. Our author name main assumption is that such indices can be considered as a form of domain spe- : // ( a=CAPITAL LATIN WORD { System . o u t . p r i n t l n ( $a . t e x t + ”{” + $a . l i n e + ” ,” + $a . pos + ”}”);})+ cific language. We present a parsing-based approach to the problem of extracting CAPITAL LATIN WORD+ { System . out . p r i n t l n ( ” Author name : ” + $author name . t e x t ) ; } |WRONG CAPITAL LATIN WORD+ { System . out . p r i n t l n ( ” Author name : ” + $author name . t e x t ) ; } information from them to support the creation of digital collections of those ancient ; texts. In particular we show preliminary results of applying a fuzzy parser to the work entry : OCR transcription of an index of quotations to extract information from potentially w o r k t i t l e w o r k o c c u r r e n c e+ −> ˆ (WORK w o r k t i t l e w o r k o c c u r r e n c e +) | w o r k o c c u r r e n c e+ −> ˆ (WORK w o r k o c c u r r e n c e +) noisy input. [...] ; This grammar expressed in the Extended Backus-Naur Form (EBNF) was Introduction: Indices of quotations then transformed into a Java parser by using the ANTLR parser generator [2]. In recent years, mass digitization initiatives have made accessible the page The developed framework consists of 1) an OCR transcription of the index images of an increasing number of modern editions. Now we can access not to be parsed with an accuracy of 96.73% which was produced from a PDF only the text but also the paratextual apparatus of each digital edition, by using an Google’s OCRopus installation specifically trained on Ancient namely prefaces, notes, critical apparatuses and indices. Greek; 2) the Index Parser consisting of both a lexer and a set of grammar The topic of converting printed scholarly materials to dig- rules; 3) a set of external knowledge sources (such as name lists) that can ital hypertexts is not new , having a long research history [3]. This be used for the error correction. paper proposes the automatic parsing of manually created indices scripto- rum (i.e. indices of quotations) as an approach to reuse the efforts made over decades by scholars on individuating and indexing citations inside texts in order to create new digital tools. Figure 2: Overall framework Figure 1: Excerpt of the index of quotations drawn from the Results and conclusions Kaibel’s edition of Athenaeus’ Deipnosophistae (1887-1890), avail- able on Google Books. The result of parsing the index of quotations is a machine actionable tree (see List. 2) representing its hierarchical structure, where each author and The indices of quotations found in many modern critical editions of classi- work quoted in the text is associated to a reference indicating its position cal authors can be thought of as the hypertext through which an editor in the text. Starting from this resulting tree it is possible to automatically creates internal links to those passages in their edited work that con- tag the quotations in the text and thus to reconstruct the hypertextual tain quotations from other ancient sources. These indices also provide links between the index and the text. outward links to the entire body of classical literature by listing quota- tions of other surviving works. Listing 2: Example of XML output of the index parsing process <?xml version=” 1 . 0 ” e n c o d i n g=”UTF−8” ?> <index> <entry i d=”N65541”> <name> INDEX PARSING AND INFORMATION <t e x t i d=”N65546” l e n g t h=”8” l i n e=”1” s t a r t c h a r=”0” type=” c a p i t a l l a t i n w o r d ”>AETHLIVS</ t e x t> <s>AETHLIUS</ s> EXTRACTION </name> <work i d=”N65563”> <t i t l e i d=”N65566”> <t r a s n c r i p t>[ some g r e e k t e x t ]</ t r a s n c r i p t> </ t i t l e> <o c c u r r e n c e i d=”N65594”> An index of quotations is mainly structured as entries where each entry <e d s t a t m i d=”N65596”> <t e x t i d=”N65599” l e n g t h=”7” l i n e=”2” s t a r t c h a r=” 24 ” type=” a n n o t a t i o n ”>( f r . 2 )</ t e x t> </ e d s t a t m> corresponds to an author and where the name of the author constitutes < i n t r e f i d=”N65605”> <t a r g e t type=” s i n g l e v a l u e ” v a l u e=” 653 f ”>653 f</ t a r g e t> the so-called lemma, namely the headword. For each author the editor </ i n t r e f> </ o c c u r r e n c e> lists every work of that same author that is cited in the text. </work> </ entry> [...] </index> The main assumption for building a parser of printed indices is that an in- Indices of quotations are finally worth parsing since by using information dex constitutes a domain-specific language and that the syntactic from them we can 1) reconstruct internal and external links be- disposition of its lexical components is subject to a grammar of tween different texts and 2) extract semantic information such as rules that can be preliminarily defined (cfr. [1]). lists of names, epithets of authors, titles of works, canonical citations used After the lexical elements of the index were identified, a grammar of syn- by scholars. tactic rules was specified, defining for each of them the corresponding sequence of tokens to be matched (see List. 1). References Listing 1: Example of parser code [1] F. Boschetti. Methods to extend greek and latin corpora with variants and conjectures: Mapping critical grammar G; apparatuses onto reference text. In Proceedings of the Corpus Linguistics Conference (CL2007), options { 2007. tokenVocab=L ; l a n g u a g e=Java ; b a c k t r a c k=true ; output=AST ; } [2] T. J. Parr and R. W. Quong. ANTLR: a predicated-LL(k) parser generator. Software Practice and [...] index : Experience, 25:789—810, 1995. entry+ ; [3] D. R. Raymond and F. W. Tompa. Hypertext and the new oxford english dictionary. In Proceedings of entry : the ACM conference on Hypertext, pages 143–153, Chapel Hill, North Carolina, United States, 1987. n o i s e ? author name e p i c l e s i s ∗ work entry+ c r o s s r e f t o e n t r y ∗ ACM.