Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

  • Be the first to comment


  1. 1. Corpus Review: CORDE Miren Estibaliz Vivanco 3rd Course English Philology University of Deusto
  2. 2. What is a text corpus? <ul><li>A corpus ( corpora , in plural) or text corpus is a large and structured set of texts . </li></ul><ul><li>They are used to do statistical analysis and hypothesis testing , checking occurrences or validating linguistic rules on a specific universe. </li></ul><ul><li>Nowadays, they are usually electronically structured and processed. </li></ul><ul><li>A corpus may contain texts in a single language ( monolingual corpus ) or text data in multiple languages ( multilingual corpus ). </li></ul>
  3. 3. What is a text corpus? <ul><li>The corpora are often subjcted to a process known as annotation . </li></ul><ul><li>This makes the corpus more useful for linguistic research . </li></ul><ul><li>An example of annotating a corpus is  part-of-speech tagging , or  POS-tagging , in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of  tags. </li></ul>
  4. 4. Information about CORDE <ul><li>The  Corpus diacrónico del español   (CORDE) is a textual corpus of all the times and places where the Spanish language has been spoken, since 1975 . </li></ul><ul><li>Aim : CORDE is designed to extract information to study words and their meanings , as well as the grammar and its use over time. </li></ul><ul><li>It started to be used in 1994 , when RAE brought up the possibility of applying the new technologies of information. </li></ul><ul><li>Target : to create a data bank which improved the quality of their working materials and made data access easier . </li></ul><ul><li>There’s another corpus named CREA ( Corpus de referencia del español actual). </li></ul>
  5. 5. Corpus Diacrónico del Español
  6. 6. What does the CORDE consist of? <ul><li>The corpus collects written texts of very different kinds : narrative, dramatic, lyrical, scientifical, technical… </li></ul><ul><li>The aim is to collect all geographical, historical and generical so that the whole is representative enough. </li></ul><ul><li>CORDE is a necessary tool for any diachronical study that is related to the Spanish language. </li></ul><ul><li>One of the most important objectives of the diachronic corpus is to serve as a basic material for the production of the  Nuevo diccionario histórico. </li></ul>
  7. 7. Text Acquisition of the CORDE <ul><li>The origin or source of the texts which arrive to CORDE is diverse: </li></ul><ul><ul><li>Books which are scanned through a program of optical character recognition. </li></ul></ul><ul><ul><li>Other books obtained in electronical format . </li></ul></ul><ul><ul><li>Some are typed in digital format , because there was no modern edition of some pieces which have been decided to be included for the peculiarity of their language. </li></ul></ul>
  8. 8. Let’s search the term “nación” in CORDE
  9. 9. Statistics <ul><li>Absolute percentages of the obtained cases, classified according to subject , chronological or geographical criteria: </li></ul><ul><ul><li>The term “nación” appears most in documents of “historical prose” . </li></ul></ul><ul><ul><li>Most documents containing the word “nación” are from the year 1820 (9502 cases). </li></ul></ul><ul><ul><li>Most of the texts are from Spain . </li></ul></ul>
  10. 10. Looking for examples in the CORDE
  11. 11. Results of examples containing “nación”
  12. 12. Placing the term in a historical context <ul><li>The author of the book from which most examples come from is “Sátiras y panfletos del Trienio Constitucional (1820-1823)” by Sebastián de Miñano . </li></ul><ul><li>Most of the text containing the term “nación” belonging to the year 1820 makes sense for various reasons: </li></ul><ul><ul><li>The “Trienio Liberal” or “Trieno Constitucional” took place the three years between 1820 and 1823 . </li></ul></ul><ul><ul><li>It was the kingdom of Fernando VII “El Deseado”. </li></ul></ul><ul><ul><li>The 1st of January 1820 the “pronunciamiento” of colonel Rafael de Diego took place in the sevillian locality of Las Cabezas de San Juan. </li></ul></ul>Rafael de Diego
  13. 13. Placing the term in a historical context <ul><li>Although little success at the beginning, Riego immediately proclaimed the restoration of the Cadiz Constitution (1812, La Pepa ) and the re-establishment of constitutional authorities . </li></ul><ul><li>The support to the little militar coup grew stronger and made the uprising until March 10 . </li></ul><ul><li>That date, a manifest was published by Fernando VII respecting the Cadiz Constitution, which established a parliamentary monarchy . </li></ul>Cadiz Constitution
  14. 14. Text sample
  15. 15. Let’s search the term “nación” in BNC
  16. 16. What is BNC? <ul><li>The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century , both spoken and written. </li></ul><ul><li>The  written part  of the BNC ( 90% ) includes, for example, extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text. </li></ul><ul><li>The   spoken part  ( 10% ) consists of orthographic transcriptions of unscripted informal conversations (recorded by volunteers selected from different age, region and social classes in a demographically balanced way) and spoken language collected in different contexts, ranging from formal business or government meetings to radio shows and phone-ins . </li></ul>
  17. 17. Purposes of BNC corpus <ul><li>The purpose of a language corpus is to provide language workers with evidence of how language is really used, evidence that can then be used to inform and substantiate individual theories about what words might or should mean. </li></ul><ul><li>Traditional grammars and dictionaries tell us what a word  ought to mean , but only experience can tell us what a word  is used to mean . </li></ul><ul><li>This is why dictionary publishers, grammar writers, language teachers, and developers of natural language processing software alike have been turning to corpus evidence as a means of extending and organizing that experience. </li></ul>
  18. 18. Selection Criteria for the BNC <ul><li>Domain </li></ul><ul><ul><li>The  domain  of a text indicates the kind of writing it contains. </li></ul></ul><ul><ul><ul><li>75% of the written texts were to be chosen from informative writings : of which roughly equal quantities should be chosen from the fields of applied sciences, arts, belief & thought, commerce & finance, leisure, natural & pure science, social science, world affairs. </li></ul></ul></ul><ul><ul><ul><li>25% of the written texts were to be imaginative , that is, literary and creative works. </li></ul></ul></ul>
  19. 19. Selection Criteria for the BNC <ul><li>Medium </li></ul><ul><ul><li>The  medium  of a text indicates the kind of publication in which it occurs. The classification used is quite broad. </li></ul></ul><ul><ul><ul><li>60% of written texts were to be books </li></ul></ul></ul><ul><ul><ul><li>25% were to be periodicals (newspapers etc.) </li></ul></ul></ul><ul><ul><ul><li>5 and 10% should come from other kinds of miscellaneous published material (brochures, advertising leaflets, etc) </li></ul></ul></ul><ul><ul><ul><li>5 and 10% should come from unpublished written material such as personal letters and diaries, essays and memoranda, etc </li></ul></ul></ul><ul><ul><ul><li>Small amount (less than 5%) should come from material written to be spoken (for example, political speeches, play texts, broadcast scripts, etc.) </li></ul></ul></ul>
  20. 20. Looking for examples in BNC <ul><li>The corpus gives a random selection of 50 solutions among all the results of “nation”. </li></ul><ul><li>Unlike the CORDE, it does not show any statistic charts and it does not give the option to specify authors or dates. You just enter a text or phrase. </li></ul>
  21. 21. Results for the term “nation”
  22. 22. Conclusion <ul><li>I did not find any relevant information about the term “nation” in the BNC corpus, because the results are shown at random and are not organized in a chronological way. </li></ul><ul><li>Therefore, the first result was from the book “The Tragedy of Belief”, by John Fulton, about whom I did not find any relevant information , apart from the fact that it is a text about Irish politics from the year 1991. </li></ul><ul><li>Instead, the CORDE allowed me to do a quite complete research about the term “nación” and it let me know the reason why the results of the term were abundant in the year 1820. </li></ul>
  23. 23. Bibliography <ul><li>http :// / rae /gestores/gespub000019. nsf / voTodosporId /B4E26FC2520104D8C125716400455C06? OpenDocument&i=1   </li></ul><ul><li>REAL ACADEMIA ESPAÑOLA: Banco de datos (CORDE) [en línea].  Corpus diacrónico del español.  <> </li></ul><ul><li>http :// / ayuda_c.htm </li></ul><ul><li>http :// / wiki / Corpus_linguistics </li></ul><ul><li>http :// / wiki / La_Pepa </li></ul><ul><li>http :// / </li></ul><ul><li>All information retrieved 21:06, May 4, 2010 </li></ul>