Text::Perfide::BookCleaner,                                           a Perl module to clean and normalize plain text books...
Upcoming SlideShare
Loading in …5
×

Poster - Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text books

584 views

Published on

Poster for Text::Perfide::Bookcleaner presented at SEPLN2011 - http://goo.gl/iMNML

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
584
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Poster - Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text books

  1. 1. Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text books Andr´ Santos, Jos´ Jo˜o Almeida e e a Table 2: Example excerpt before and after processed with sections Abstract 1 PRIMEIRA PARTE 1 _sec+N:part=1_ PRIMEIRA PARTE Several tasks which include text documents processing, such as text alignment, format 2 2 3 FANTINE conversions or information retrieval are heavily affected by the state of the original docu- 4 3 FANTINE 4 ments. 5 5 _sec+N:book=1_ LIVRO PRIMEIRO 6 ^L LIVRO PRIMEIRO Text::Perfide::BookCleaner is a Perl module to pre-process plain text books and 6 7 ⇒7 UM JUSTO clean them, getting them prepared for further arbitrary use. 8 UM JUSTO 8 Cleaning tasks include removing page breaks, page numbers, headers and footers; find- 9 9 O abade Myriel 10 O abade Myriel ing section titles and boundaries; removing footnotes and normalizing paragraph notation 11 10 11 Em 1815, era bispo de Digne, o and Unicode characters. The process is guided with the help of declarative objects such as 12 Em 1815, era bispo de Digne, o 12 reverendo Carlos Francisco Bemvindo 13 reverendo Carlos Francisco Bemvindo ontologies and configuration files. 2.3 Paragraphs1 Motivation Several possible notations for paragraphs: Text processing (and particularly book alignment) are tasks whose perfor- new-line separated, blank-line separated, Algorithmmance is often negatively affected by: indented or not. Direct speech can also be Function paragraphs(text){ Calculate text’s word and delimited with quotation marks or preceded line metrics • Page structure • Paragraphs with a dash. Find paragraph notation • Section titles and boundaries • Character encoding etc. Some metrics based on the number of Normalize paragraph notation • Footnotes words, white spaces, empty and non-empty } lines can be used to infer the type of no- tation used, which can then be normalized.2 Book cleaner 2.4 Footnotes Footnotes are formed by a call mark in the middle of the text, and the matching foot- Algorithm note expansion, which generally appears at Function footnotes(text) lines = lines from text the end of the same page. fnpattern = <<int>>, [int], ^int, ... Footnote expansions can often be found for ( x ∈ dom(lines) at the bottom of the page or, sometimes, at : x begins with fnpattern the end of the document. ∧ x is immediately before or after page break ){ Footnote call marks appear in the middle Replace x in text with custom mark of the text, usually consisting of some kind Add x to standoff file of numeration highlighted by some special } symbol. for ( y ∈ text : y ∈ fnpattern ){ Replace y in text with custom mark Add y to standoff file } } Table 3: Example excerpt before and after processed with footnotes 1 aupr`s de la rue Saint-Antoine, e2.1 Pages 2 a la porte des Tournelles[1]. ` 1 aupr`s de la rue Saint-Antoine, a e ` Structural elements left from a document’s previous page division: page breaks, page num- 3 [1] La Bastille, qui fut prise par le 2 la porte des Tournelles_fnr29_.bers, headers and footers. 4 peuple de Paris, le 14 juillet 1789, ⇒3 _fne8_ Algorithm 5 puis d´molie. B. e 4 ^L Quel etait en chemin l’´tonnement ´ e Function pages(text){ 5 de l’Ing´nu! je vous le laisse a e ` page breaks = { page break characters ∪ page numbers } 6 ^L Quel etait en chemin l’´tonnement ´ e for ( x ∈ { page breaks }){ 7 de l’Ing´nu! je vous le laisse a e ` prec ← lines preceding x ; footers = footers ∪ { prec } foll ← lines following x ; headers = headers ∪ { foll } } Ignore headers/footers with number of occurrences < threshold 2.5 Words and characters for ( y ∈ { headers ∪ footers }){ Sanitizing elements such as: Replace headers/footers with custom mark Add headers/footers to standoff file • Unicode characters • Translineation } } • Text encoding • Transpagination • Glyphs Table 1: Example excerpt before and after processed with pages 1 le nabab, et assez audacieux 2.6 Commit 2 pour s’emparer de sa personne. Final step, which irreversibly removes all marks left behind by the previous steps and 3 1 le nabab, et assez audacieux outputs the clean version of the text. 4 Page 3 2 pour s’emparer de sa personne. _pb2_ 5 ^L La maison a vapeur Jules Verne ` ⇒3 Le faquir, - evidemment le seul ´ 6 4 entre tous que ne surexcitˆt pas a 3 Sections Ontology 7 Le faquir, - evidemment le seul ´ An ontology is used to store information relevant to the sectioning process: types of sec- 8 entre tous que ne surexcitˆt pas a tions and their relations, common section names, and ordinal and cardinal numbers. Each element in the ontology is translated to several different languages.2.2 Sections Table 4: Excerpts from the sections ontology. Delimiting sections allows to: cap cena end 3 Algorithm PT cap´tulo, cap, ı PT cena PT fim PT terceiro, • automatically generate tables of con- capitulo FR sc`ne e FR fin terceira, Function sections(text,ontology){ tents; lines = lines from text FR chapitre, chap EN scene EN the_end trˆs, tres e • identify sections from one version of a for ( x ∈ lines: x ∈ ontology ){ EN chapter, chap BT act BT _alone EN third, three book that are missing on another ver- Add mark to text (before x) RU глава FR troisi`me, e sion; } NT sec troisieme • matching and comparing the sections } BT _numeral of two books before aligning them • ontology is being used to directly create the Perl code containing the complex data struc- (alignability); tures which are used in the process of sectioning; • perform additional processing on specific • the code is more readable and easier to maintain; sections. • the ontology can be used to discuss details with people with no programming experience. http://per-fide.di.uminho.pt/

×