Text::Perfide::BookCleaner, a Perl module to clean and                       normalize plain text booksText::Perfide::BookCl...
that may negatively affect book alignment               guide the alignment process by pairingare:                         ...
ing the user to understand the problems de-     tected in the book, and what was performed     in order to solve them.    ...
mark. If not, it tries to find occurrences of             formally represent the hierarchy of divisions    page numbers. Th...
the same paragraph are separated only by the              2.2.4 Footnotespunctuation mark and a space.                    ...
After removing the footnote, the example          by a dash, a newline and more word charac-    given above would look lik...
The recognition of these elements is            the ontology will allow to establish hierar-    language-dependent. As suc...
Translation units                                 demonstrated an increase in the number ofThe alignment of texts resulted...
Evert, S. 2001. The CQP query language  tutorial. IMS Stuttgart, 13.Gupta, R. and S. Ahmed. 2007. Project Pro-  posal Apac...
Upcoming SlideShare
Loading in...5

Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text books


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text books

  1. 1. Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text booksText::Perfide::BookCleaner, un m´dulo Perl para limpieza de libros em texto o plano Andr´ Santos e Jos´ Jo˜o Almeida e a Departamento de Inform´tica a Departamento de Inform´tica a Universidade do Minho Universidade do Minho Braga, Portugal Braga, Portugal pg15973@alunos.uminho.pt jj@di.uminho.pt Resumen: En este trabajo se presenta Text::Perfide::BookCleaner, un m´dulo Perl o para pre-procesamiento de libros en formato de texto plano para posterior alineamiento u otros usos. Tareas de limpieza incluyen: eliminaci´n de saltos de p´gina, n´meros de o a u p´gina, encabezados y pies de p´gina; encontrar t´ a a ıtulos y fronteras de las secciones; elimi- naci´n de las notas de pie de p´gina y normalizaci´n de la notaci´n de p´rrafo y caracteres o a o o a Unicode. El proceso es guiado con la ayuda de objetos declarativos como ontolog´ y ıas ficheros de configuraci´n. Se realiz´ una evaluaci´n comparativa de las alineaciones con y o o o sin Text::Perfide::BookCleaner sobre la cual se presentan los resultados y conclusiones. Palabras clave: corpora paralelos, alineamiento de documentos, herramientas de PLN Abstract: This paper presents Text::Perfide::BookCleaner, an application to pre- process plain text books and clean them for any further arbitrary use, e.g., text alignment, format conversion or information retrieval. Cleaning tasks include removing page breaks, page numbers, headers and footers; finding section titles and boundaries; removing footnotes and normalizing paragraph notation and Unicode characters. The process is guided with the help of declarative objects such as ontologies and configuration files. A comparative evaluation of alignments with and without Text::Perfide::BookCleaner was performed, and the results and conclusions are presented. Keywords: parallel corpora, document alignment, NLP tools1 Introduction 1.1 ContextIn many tasks related to natural language The work presented in this paper has beenprocessing, text is the starting point, and the developed within Project Per-Fide, a projectquality of results that can be obtained de- which aims to build a large parallel cor-pends strongly on the text quality itself. pora (Ara´jo et al., 2010). This process u Many kinds of noise may be present in involves gathering, preparing, aligning andtexts, resulting in a wrong interpretation of: making available for query thousands of doc- uments in several languages. • individual characters, The documents gathered come from a wide range of sources, and are obtained in several • words, phrases, sentences, different formats and conditions. Some of these documents present specific issues which • paragraph frontiers, prevent them to be successfully aligned. • pages, headers, footers and footnotes, 1.2 Motivation • titles and section frontiers. The alignment of text documents is a pro- cess which depends heavily on the format The variety of existing texts is huge, and and conditions of the documents to bemany of the problems are related to specific aligned (V´ronis, 2000). When processing etypes of text. documents of literary type (i.e. books), there In this document, the central concern is are specific issues which give origin to addi-the task of cleaning text books, with a special tional difficulties.focus on making them ready for alignment. The main characteristics of the documents
  2. 2. that may negatively affect book alignment guide the alignment process by pairingare: the sections first (structural alignment).File format: Documents available in PDF These problems are often enough to derail or Microsoft Word formats (or, gen- the alignment process, producing unaccept- erally, any structured or unstructured able results, and are found frequently enough format other than plain text) need to justify the development of a tool to pre- to be converted to plain text be- process the books, cleaning and preparing fore being processed, using tools such them for further processing. as pdftotext (Noonburg, 2001) or There are other tools available which Apache Tika (Gupta and Ahmed, 2007; also address the issue of preparing corpora Mattmann and Zitting, 2011). This con- for alignment, such as the one described version often leads to loss of informa- by (Okita, 2009); or text cleaning, such as tion (text structure, text formatting, im- TextSoap, a program for cleaning text in sev- ages, etc) and introduction of noise (bad eral formats (UnmarkedSoftware, 2011). conversion of mathematical expressions, diacritics, tables and other) (Robinson, 2 Book Cleaner 2001); In order to solve the problems described in the previous section, the first approach wasText encoding format: There are many to implement a Perl script which, with the different available text encoding for- help of UNIX utilities like grep and some reg- mats. For example, Western-European ular expressions, attempted to find the un- languages can be encoded as ISO-8859- desired elements and normalize, replace or 1 (also known as Latin1), Windows CP delete them. As we started to have a bet- 1252, UTF-8 or others. Discovering the ter grasp on the task, it became clear that encoding format used in a given docu- such a naive approach was not enough. ment is frequently not a straightforward It was then decided to develop a task, and Dealing with a text while as- Perl module, Text::Perfide::BookCleaner suming the wrong encoding may have a (T::P::BC), divided in separate functions, negative impact on the results; each one dealing with a specific problem.Structural residues: Some texts, despite being in plain text format, still contain 2.1 Design Goals structural elements from their previous T::P::BC was developed with several design format (for example, it is common to find goals in mind. Each component of the module page numbers, headers and footers in the meets the following three requirements: middle of the text of documents which Optional: depending on the conditions of have been converted from PDF to plain the books to be cleaned, some steps of text); this process may not be desired or neces-Unpaired sections: Sometimes one of the sary. Thus, each step may be performed documents to be aligned contains one independently from the others, or not be or more sections which are not present performed at all; in the other document. It is a quite Chainable: the functions can be used in se- common occurrence with forewords to a quence, with the output of one function given edition, preambles, etc. passing as input to the next, with no in-Sectioning notation: There are endless termediary steps required; ways to represent and numerate the Reversible: as much as possible, no infor- several divisions and subdivisions of mation is lost (everything removed from a document (parts, chapters, sections, the original document is kept in separate etc). Several of these notations are files). This means that, at any time, it is language-dependent (e.g. Part One or possible to revert the book to a previous Premi`re Part). As such, aligning sec- e (or even the original) state. tion titles is not a trivial task. Identi- fying these marks can not only help to Additionally, after the cleaning process, a prevent this, but it can even be used to diagnostic report is generated, aimed at help-
  3. 3. ing the user to understand the problems de- tected in the book, and what was performed in order to solve them. Below is presented an example of the di- agnostic produced after cleaning a version of La maison ` vapeur from Jules Verne: a1 footers = [’( Page _NUM_ ) = 240’];2 headers=$VAR1 =3 ["(La maison x{e0} vapeur Jules Verne) = 241"];4 ctrL=1;5 pagnum_ctrL=241;6 sectionsO=2;7 sectionsN=30;8 word_per_emptyline=3107.86842105263;9 emptylines=37;10 word_per_line=11.5658603466849;11 lines=10211; Figure 1: Pipeline of T::P::BC.12 To_be_indented=1;13 words=118099;14 word_per_indent=52.2793271359008; 2.2.1 Pages15 lines_w_pont=3263; Page breaks, headers and footers are struc- tural elements of pages that must be dealt In this diagnostic, we can see, among other with care when automatically processing things, that this document had headers com- texts. In the particular case of bitext align- posed by the book title and author name, ment, the presence of these elements is often footers contained the word Page followed by enough to confuse the aligner and decrease its a number (presumably the page number) and performance. The headers and footers usu- that the 241 pages were separated with ˆL . ally contain information about the book au- thor, book name and section title. Despite The process is also configurable with the the fact that the information provided by this help of an ontology, and the use of configura- elements might be relevant for some tasks, tion files written in domain-specific languages they should be identified, marked and possi- is also planned. Both are described in sec- bly removed before further processing. tion 2.3. Below is presented an extract of a plain text version of La Maison ` Vapeur, from a 2.2 Pipeline Jules Verne, containing, from line 3 to 6, a We decided to tackle the problem in five dif- page break (a footer, a page break character ferent steps: pages, sections, paragraphs, foot- and a header, surrounded by blank lines). notes, and words and characters. Each of 1 rencontrer le nabab, et assez audacieux pour these stages deals with a specific type of prob- 2 s’emparer de sa personne. lems commonly found in plain text books. A 3 commit option is also available, a final step 4 Page 3 which irreversibly removes all the markup 5 ^L La maison ` vapeur a Jules Verne 6 added along the cleaning process. The dif- 7 Le faquir, - ´videmment le seul entre tous e ferent steps of this process are implemented 8 que ne surexcitˆt pas l’espoir de gagner la a as functions in T::P::BC. Figure 1 presents the pipeline of the The 0xC character (also known as T::P::BC module. Control+L, ^L or f), indicates, in plain text, T::P::BC accepts as input documents in a page break. When present, these provide plain text format, with UTF-8 text encoding a reliable way to delimit pages. Page num- (other encoding formats are also supported). bers, headers and footers should also be found The final output is a summary report, and near these breaks. There are books, however, the cleaned version of the original document, which do not have their pages separated with which may then be used in several processes, this character. such as alignment (Varga et al., 2005; Ev- Our algorithm starts by looking for page ert, 2001), ebook generation (Lee, Gutten- break characters. If it finds them, it immedi- berg, and McCrary, 2002), or others. ately replaces them with a custom page break
  4. 4. mark. If not, it tries to find occurrences of formally represent the hierarchy of divisions page numbers. These are typically lines with of a book (chapters, sections, etc). As such, only three or less digits (four digits would oc- the notation used to write the titles of the casionally get year references confused with several divisions is dictated by the personal page numbers), and a preceding and a fol- choices of whoever transcribed the book to lowing blank lines. electronic format, or by their previous format After finding the page breaks, the algo- and the tool used to perform the conversion rithm attempts to identify headers and foot- to plain text. Additionally, the nomenclature ers: it considers the lines which are near each used is often language-dependent (e.g. Part page break as candidates to headers or footers One and Premi`re Part or Chapter II and e (depending on whether they are after or be- Cap´ıtulo 2 ). fore the break, respectively). Then the num- Below is presented the beginning of the ber of occurrences of each candidate (normal- first section of a Portuguese version of Les ized so that changes in numbers are ignored) Miserables, from Vitor Hugo: is calculated, and those which occur more 1 PRIMEIRA PARTE than a given threshold are considered headers 2 or footers. 3 FANTINE 4 At the end, the headers and footers are 5 moved to a standoff file, and only the page 6 ^L LIVRO PRIMEIRO break mark is left in the original text. 7 A cleaned up version of the previous ex- 8 UM JUSTO ample is presented below: 9 10 O abade Myriel 111 est vrai qu’il fallait ˆtre assez chanceux pour e 12 Em 1815, era bispo de Digne, o reverendo Carlos2 rencontrer le nabab, et assez audacieux pour 13 Francisco Bemvindo Myriel, o qual contava setenta3 s’emparer de sa personne. _pb2_4 Le faquir, - ´videmment le seul entre tous e5 que ne surexcitˆt pas l’espoir de gagner la a In order to help in the sectioning pro-6 prime, - filait au milieu des groupes, s’arrˆtant e cess, an ontology was built (see section 2.3.1), which includes several section types and their 2.2.2 Sections relation, names of common sections and ordi- There are several reason that justify the im- nal and cardinal numbers. portance of delimiting sections in a book: The algorithm tries to find lines contain- ing section names, pages or lines containing • automatically generate tables of con- only numbers, or lines with Roman numer- tents; als. Then a custom section mark is added • identify sections from one version of a containing a normalized version of the origi- book that are missing on another ver- nal title (e.g. Roman numerals are converted sion (or example, it is common for some to Arabic numbers). editions of a book to have an exclusive The processed version of the example preamble or afterword which cannot be shown above follows: found in the other editions); 1 _sec+N:part=1_ PRIMEIRA PARTE • matching and comparing the sections of 2 two books before aligning them. This 3 FANTINE 4 allows to assess the books alignability 5 _sec+N:book=1_ LIVRO PRIMEIRO (compatibility for alignment), predict 6 the results of the alignment and even to 7 UM JUSTO 8 simply discard the worst sections; 9 O abade Myriel 10 • in areas like biomedical text mining, be- 11 Em 1815, era bispo de Digne, o reverendo Carlos ing able to detect the different sections of 12 Francisco Bemvindo Myriel, o qual contava setenta a document allows to search for specific information in specific sections (Agarwal 2.2.3 Paragraphs and Yu, 2009); When reading a book, we often assume that Being an unstructured format, plain text a new line, specially if indented, represents by itself does not have the necessary means to a new paragraph, and that sentences within
  5. 5. the same paragraph are separated only by the 2.2.4 Footnotespunctuation mark and a space. Footnotes are formed by a call mark inserted However, when dealing with plain text in the middle of the text (often appended tobooks, several other representations are pos- the end of a word or phrase), and the footnotesible: some books break sentences with a expansion (i.e. the text of the footnote). Thenew line and paragraphs with a blank line placement of the footnote expansion shifts(two consecutive new line characters); others from book to book, but generally appears ei-use indentation to represent paragraphs; and ther at the end of the same page where its callthere are even books where new line charac- mark can be found, or at a dedicated sectionters are used to wrap text. There are also at the end of the document.books where direct speech (from a charac- Depending on the notation used to repre-ter of the story) is represented with a trailing sent footnote call marks, these can introducedash; others use quotation marks to the same noise in the alignment (for example, if theend. aligner confuses them with sentence breaks). The detection of paragraph and sentence At best, the alignment might work well, butboundaries is of the utmost importance in the the resulting corpora will be polluted with thealignment process, and a wrong delimitation call marks, interfering with its further use.is often responsible for a bad alignment. Footnote expansions are even more likely to In order to be able to detect how para- disturb the process, as it is very unlikely thatgraphs and sentences are represented, our al- the matching footnotes are found at the samegorithm starts by measuring several parame- place in both books (the page divisions wouldters, such as the number of words, the num- have to be exactly the same).ber of lines, the number of empty lines and Below is presented an example of a pagethe number of indented lines. containing footnote call marks and, at the Then several metrics are calculated based end, footnote expansions, followed by the be-on these parameters, which are used to under- ginning of another page:stand how paragraphs are represented, andthe text is fixed accordingly. 1 roi Charles V, fils de Jean II, aupr`s de la rue e 2 Saint-Antoine, ` la porte des Tournelles[1]. a 3 4 [1] La Bastille, qui fut prise par le peuple de Algorithm 1: Paragraph 5 Paris, le 14 juillet 1789, puis d´molie. B. eInput: txt:text of the book 6Output: txt:text with improved paragraph 7 ^L Quel ´tait en chemin l’´tonnement de l’Ing´nu! e e e 8 je vous le laisse ` penser. Il crut d’abord que aelines ← ...calculate empty lineslines ← ...calculate number of lineswords ← ...calculate number of words The removal of footnotes is performed inplines ← ...number of lines with punctuation two steps: first the expansions are removed,indenti ← ...calculate indentation distrib then the call marks. The devised algorithmplr ← plines // punctuated lines ratio lines searches for lines starting with a plausibleforall the i ∈ dom(indent) do pattern (e.g. <<1>>, [2] or ^3), and followed if i > 10 ∨ indenti < 12 then by blank lines. These are considered footnote remove(indenti ) // remove false indent expansions, are consequently replaced by a custom footnote mark and moved to a stand-wpl ← words lines // word per line off file. wordswpel ← 1+elines // word per empty line Once the footnote expansions have been wordswpi ← indent // word per indent removed, the remaining marks (following theif wpel > 150 then same patterns) are likely to be call marks, and if wpi ∈ [10..100] then as such, removed and replaced by a custom Separate parag. by indentation normalized mark1 if wpl ∈ [10..100] ∧ plr > 0.6 then Separate parag. by new lines 1 The ideal would be to be able to establish the correspondence between the footnote call marks andelse their respective expansion. However, that feature is Separate parag. by empty lines not specially relevant to the focus of this work, as the effort it would require is not compensated by its practical results.
  6. 6. After removing the footnote, the example by a dash, a newline and more word charac- given above would look like the following: ters). When this pattern is found, the new- line is removed and appended to the end of1 roi Charles V, fils de Jean II, aupr`s de la rue e the word instead.2 Saint-Antoine, ` la porte des Tournelles_fnr29_. a3 _fne8_ Dealing with Unicode-characters is per-4 ^L Quel ´tait en chemin l’´tonnement de l’Ing´nu! e e e formed with a simple search and replace.5 je vous le laisse ` penser. Il crut d’abord que a Characters which have a corresponding char- acter in ASCII are directly replaced, while 2.2.5 Words and characters stranger characters are replaced with a nor- malized custom mark. At word and character level, there are sev- eral problems that can affect the alignment 2.2.6 Commit of two books: Unicode characters, translin- This last step removes the custom marks in- eations and transpaginations: troduced by the previous functions and out- puts a cleaned document ready to be pro- Unicode characters: often, Unicode-only cessed. The only marks left are section versions of previously existing ASCII marks, which can be helpful to guide the characters are used. For example, Uni- alignment. code has several types of dashes (e.g. ‘-’, There are situations where elements such ‘–’ and ‘ ’), where ASCII only has one as page numbers are important (for example, (i.e. ‘-’). While these do not represent having page numbers in a book is convenient the exact same character, they are close when making citations). However, removing enough, and their normalization allows these marks makes the previous steps irre- to produce more similar results. versible. As such, this step is optional, and it Glyphs: Some documents use glyphs (Uni- is meant to be used only when a clean docu- code characters) to represent combina- ment is required for further processing. tions of some specific characters, like fi or ff. 2.3 Declarative objects The first attempt at writing a program to Translineation: translineation happens clean books was a self-contained Perl script – when a given word is split across two all the patterns and logical steps were hard- lines in order to keep the line length coded into the program, with only a few constant. If the two parts of translin- command-line flags allowing to fine tune the eated words are not rejoined before process. When this solution was discarded aligning, the aligner may see them as for not being powerful enough to allow more two separate words. complex processing, several higher level con- Transpagination: transpagination happens figuration objects were created: an ontol- when a translineation occurs at the last ogy to describe section types and names, and line of a page, resulting in a word split domain-specific languages to describe config- across pages. However, the concept of uration files. page is removed by the first function, These elements open the possibility to dis- pages, which removes headers and foot- cuss the subject with people who have no pro- ers and concatenates the pages. This gramming experience, allowing them to di- means that transpagination occurrences rectly contribute with their knowledge to the get reduced to translineations. development of the software. The algorithm for dealing with translin- 2.3.1 Sections Ontology eations and transpaginations is available as The information about sections relevant to an optional feature. This is mainly be- the sectioning process is being stored in an cause different languages have different stan- ontology file (Uschold and Gruninger, 1996). dard ways of dealing with translineations, Currently it contains several section types and many documents use incorrect forms of and their relation (e.g. a Section is part translineation. As such, the user will have the of a Chapter, an Act contains one or more option to turn this feature on or off. Scenes), names of common sections (e.g. In- The algorithm works by looking for words troduction, Index, Foreword ) and ordinal and split across lines (word characters, followed cardinal numbers.
  7. 7. The recognition of these elements is the ontology will allow to establish hierar- language-dependent. As such, mechanisms to chies, and, for example, search for Scenes af- deal with several different languages had to be ter finding an Act, or even to classify a text implemented. Some of these languages, even as play in the presence of either. use a different alphabet (e.g. Russian), which Taking advantage of Perl being a dynamic lead to interesting challenges. programming language (i.e. is able to load Several extracts of the sections ontology new code at runtime), the ontology is being are presented below. The first example shows used to directly create the Perl code contain- several translations and ways of writing chap- ter, and NT sec indicates section as a nar- ing the complex data structures which are rower term of chapter : used in the process of sectioning. The on- tology is in the ISO Thesaurus format, and1 cap the Perl module Biblio::Thesaurus (Sim˜es o2 PT cap´tulo, cap, capitulo ı and Almeida, 2002) is being used to manipu-3 FR chapitre, chap4 EN chapter, chap late the ontology.5 RU глава6 NT sec 2.3.2 Domain Specific Languages The use of domain-specific languages (DSLs) In the next example, chapter is indicated will allow to have configuration files which as a broader term of sections: control the flow of the process. Despite1 cena not yet implemented, it is planned to have2 PT cena T::P::BC reading a file describing external fil-3 FR sc`ne e ters to be used at specific stages of the process4 EN scene5 BT act (before or after any step). This will allow to have calls to third-party BT _alone allows to indicate that this tools, such as PDF to text converters or aux- term must appear alone in a line: iliary cleaning scripts specific for a given type of files.1 end2 PT fim3 FR fin Both the sections ontology and DSLs help to4 EN the_end achieve a better organization, allowing to ab-5 BT _alone stract the domain layer from the the code and implementation details. The number 3 and several variations. BT _numeral identifies this term as a number, and consequently might be used to numerate 3 Evaluation some of the other terms: The evaluation of a tool such as T::P::BC1 3 might be performed by comparing the results2 PT terceiro, terceira, trˆs, tres e3 EN third, three of alignments of texts before and after being4 FR troisi`me, troisieme e cleaned with T::P::BC.5 BT _numeral In order to test T::P::BC, a set of 20 pairs of books from Shakespeare, both in Por- The typification of sections allows to de- tuguese and English, was selected. From the fine different rules and patterns to recognize 40 books, half (the Portuguese ones) con- each type. For example, a line beginning tained both headers and footers. with Chapter One may be almost undoubt- The books were aligned using cwb-align, edly considered a section title, even if more an aligner tool bundled with IMS Open Cor- text follows in the same line (which in many pus Workbench (IMS Corpus Workbench, cases is the title of the chapter). On the other 1994-2002). hand, a line beginning with a Roman numeral may also represent a chapter beginning, but Pages, headers and footers only if nothing follows in the same line (or From a total of 1093 footers present in the else century references, for example, would be documents, 1077 were found and removed frequently mistaken for chapter divisions). (98.5%), and 1183 headers were detected and Despite not yet being fully implemented, removed in a similar amount of total occur- the relations between sections described in rences.
  8. 8. Translation units demonstrated an increase in the number ofThe alignment of texts resulted in the 1:1 translation units.creation of translation memories, files in The use of ontologies for describing andthe TMX format (Translation Memory eX- storing the notation of text units ( chapter,change (Oscar, )). Another way to evaluate section, etc,) in a declarative human-readableT::P::BC is by comparing the number of each notation was very important for maintenancetype of translation units in each file – either and for asking collaboration from translator1:1, 1:0 or 0:1, or 2:2 (the numbers represent experts.how many segments were included in each While processing the books from Shake-variant of the translation unit). Table 1 sum- speare, we felt the need to have a theater-marizes the results obtained in the alignment specific filter which would normalize, by ex-of the 20 pairs of Shakespeare books. ample, the stage directions and the notation used in the characters name before each line. Alignm. Type Original Cleaned ∆% This is a practical example of the need to 0:1 or 1:0 8333 6570 -21.2 allow external filters to be applied at specific 1:1 18802 23433 +24.6 2:1 or 1:2 15673 11365 -27.5 stages of the cleaning process. This feature 2:2 5413 4297 -20.6 will be implemented soon in T::P::BC. Total Seg. PT 54864 53204 -3.0 A special effort will be devoted to the Total Seg. EN 59744 51515 -13.8 recognition of tables of contents, which is still one of the main T::P::BC error causes.Table 1: Number of translation units ob- Additionally, it is under consideration thetained for each type of alignment, with and possibility of using stand off annotation inwithout using T::P::BC2 . the intermediary steps. That would mean to index the document contents before pro- The total number of 1:1 alignments has cessing, and have all the information aboutincreased; in fact, in the original docu- the changes to be made to the document inments, 39.0% of the translation units were of a separate file. This would allow to havethe 1:1 type. On the other hand, the trans- the original file untouched during the wholelation memory created after cleaning the files process, and still be able to commit thecontained 51.3% of 1:1 translation units. changes and produce a cleaned version. The total number of segments decreased inthe cleaned version; some of the books used Text::Perfide::BookCleaner will be madea specific theatrical notation which artificially available at CPAN3 in a very near future.increased the number of segments. This nota-tion was normalized during the cleaning pro- Acknowledgmentscess with theatre_norm, a small Perl script Andr´ Santos has a scholarship from ecreated to normalize these undesired specific Funda¸˜o para a Computa¸˜o Cient´ ca ca ıfica Na-features, which lead to the smaller amount of cional and the work reported here hassegments. been partially funded by Funda¸˜o para a ca The number of alignments considered bad Ciˆncia e Tecnologia through project Per- eby the aligner also decreased, from a total of Fide PTDC/CLE-LLI/108948/2008.10 “bad alignment” cases in the original doc-uments to 4 in the cleaned ones. References Agarwal, S. and H. Yu. 2009. Automatically classifying sentences in full-text biomed- ical articles into Introduction, Methods,4 Conclusions and Future Work Results and Discussion. Bioinformatics,Working with real situations, 25(23):3174.Text::Perfide::BookCleaner proved Ara´jo, S., J.J. Almeida, I. Dias, and uto be a useful tool to reduce text noise. A. Sim˜es. 2010. Apresenta¸˜o do pro- o ca The possibility of using a set of configu- jecto Per-Fide: Paralelizando o Portuguˆs eration options helped to obtain the necessary com seis outras l´ınguas. Linguam´tica, aadaptations to different uses. page 71. Using T::P::BC, the translation memo- 3ries produced in the alignment process have http://www.cpan.org
  9. 9. Evert, S. 2001. The CQP query language tutorial. IMS Stuttgart, 13.Gupta, R. and S. Ahmed. 2007. Project Pro- posal Apache Tika.IMS Corpus Workbench. 1994- 2002. http://www.ims.uni- stuttgart.de/projekte/CorpusWorkbench/.Lee, K.H., N. Guttenberg, and V. McCrary. 2002. Standardization aspects of eBook content formats. Computer Standards & Interfaces, 24(3):227–239.Mattmann, Chris A. and Jukka L. Zitting. 2011. Tika in Action. Manning Publica- tions Co., 1st edition.Noonburg, D. 2001. xpdf: A C++ library for accessing PDF.Okita, T. 2009. Data cleaning for word align- ment. In Proceedings of the ACL-IJCNLP 2009 Student Research Workshop, pages 72–80. Association for Computational Lin- guistics.Oscar, TMX. Lisa (2000) translation memory exchange.Robinson, N. 2001. A Comparison of Utilities for converting from Postscript or Portable Document Format to Text. Geneva: CERN, 31.Sim˜es, A. and J.J. Almeida. 2002. Li- o brary::*: a toolkit for digital libraries.UnmarkedSoftware. 2011. TextSoap: For people working with other people’s text.Uschold, M. and M. Gruninger. 1996. On- tologies: Principles, methods and appli- cations. The Knowledge Engineering Re- view, 11(02):93–136.Varga, D., P. Hal´csy, A. Kornai, V. Nagy, a L. N´meth, and V. Tr´n. 2005. Paral- e o lel corpora for medium density languages. Recent Advances in Natural Language Pro- cessing IV: Selected Papers from RANLP 2005.V´ronis, J. 2000. From the Rosetta stone to e the information society. pages 1–24.