Corpus Review: CORDE Miren Estibaliz Vivanco 3rd Course  English Philology University of Deusto
What is a text corpus? A  corpus  ( corpora , in plural) or  text corpus  is a large and structured  set of texts . They are used to do  statistical analysis  and  hypothesis testing , checking occurrences or validating linguistic rules on a specific universe.  Nowadays, they are usually  electronically   structured  and processed. A corpus may contain texts in a single language ( monolingual  corpus ) or text data in multiple languages ( multilingual  corpus ).
What is a text corpus? The corpora are often subjcted to a process known as  annotation . This makes the corpus  more useful  for  linguistic research . An example of annotating a corpus is  part-of-speech tagging , or  POS-tagging , in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of  tags.
Information about CORDE  The  Corpus diacrónico del español   (CORDE) is a  textual corpus  of all the times and places where the Spanish language has been spoken, since  1975 . Aim : CORDE is designed to extract information to study  words  and their  meanings , as well as the  grammar  and its use over time.  It started to be used in  1994 , when RAE brought up the possibility of applying the new technologies of information. Target : to create a  data bank  which improved the quality of their working materials and made data  access easier .  There’s another corpus named  CREA  ( Corpus de referencia del español actual).
Corpus Diacrónico del Español
What does the CORDE consist of? The corpus collects written  texts of very different kinds : narrative, dramatic, lyrical, scientifical, technical… The aim is to  collect  all geographical, historical and generical so that the whole is representative enough.  CORDE is a necessary tool for any  diachronical study  that is related to the Spanish language. One of the most important objectives of the diachronic corpus is to serve as a  basic material for  the production of the  Nuevo diccionario histórico.
Text Acquisition of the CORDE The origin or source of the texts which arrive to CORDE is diverse: Books  which are  scanned  through a program of optical character recognition. Other  books  obtained  in electronical format . Some are  typed in digital format , because there was no modern edition of some pieces which have been decided to be included for the peculiarity of their language.
Let’s search the term “nación” in CORDE
Statistics Absolute  percentages  of the obtained cases, classified according to  subject ,  chronological  or  geographical  criteria: The term “nación” appears most in documents of  “historical prose” . Most documents containing the word “nación” are from the  year 1820  (9502 cases). Most of the texts are from  Spain .
Looking for examples in the CORDE
Results of examples containing “nación”
Placing the term in a historical context The author of the book from which most examples come from is  “Sátiras y panfletos del Trienio Constitucional (1820-1823)”  by  Sebastián de Miñano . Most of the text containing the term “nación” belonging to the year  1820  makes sense for various reasons: The  “Trienio Liberal”  or “Trieno Constitucional” took place the three years  between 1820 and 1823 . It was the kingdom of  Fernando VII  “El Deseado”. The 1st of January 1820 the  “pronunciamiento”  of colonel  Rafael de Diego  took place in the sevillian locality of Las Cabezas de San Juan. Rafael de Diego
Placing the term in a historical context Although little success at the beginning, Riego immediately proclaimed the  restoration of the Cadiz Constitution  (1812,  La Pepa ) and the re-establishment of  constitutional authorities . The support to the little militar coup grew stronger and made the uprising until  March 10 . That date, a  manifest  was published by Fernando VII respecting the Cadiz Constitution, which established a  parliamentary monarchy . Cadiz Constitution
Text sample
Let’s search the term “nación” in BNC
What is BNC? The British National Corpus  (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English  from the later part of the 20th century , both spoken and written.  The  written part  of the BNC ( 90% ) includes, for example, extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text.  The   spoken part  ( 10% ) consists of orthographic transcriptions of unscripted informal conversations (recorded by volunteers selected from different age, region and social classes in a demographically balanced way) and spoken language collected in different contexts, ranging from formal business or government meetings to radio shows and phone-ins .
Purposes of BNC corpus The purpose of a language corpus is to  provide  language workers with  evidence  of how language is really used, evidence that can then be used to inform and substantiate individual theories about what words might or should mean.  Traditional grammars and dictionaries tell us what a word  ought to mean , but only experience can tell us what a word  is used to mean .  This is why dictionary publishers, grammar writers, language teachers, and developers of natural language processing software alike have been turning to corpus evidence as a means of extending and organizing that experience.
Selection Criteria for the BNC Domain The  domain  of a text indicates the kind of writing it contains. 75%  of the written texts were to be chosen from  informative writings : of which roughly equal quantities should be chosen from the fields of applied sciences, arts, belief & thought, commerce & finance, leisure, natural & pure science, social science, world affairs. 25%  of the written texts were to be  imaginative , that is, literary and creative works.
Selection Criteria for the BNC Medium The  medium  of a text indicates the kind of publication in which it occurs. The classification used is quite broad. 60%  of written texts were to be  books 25%  were to be  periodicals  (newspapers etc.) 5  and  10%  should come from other kinds of  miscellaneous published material  (brochures, advertising leaflets, etc) 5  and  10%  should come from  unpublished  written material such as personal letters and diaries, essays and memoranda, etc Small amount  (less than 5%) should come from  material written to be spoken  (for example, political speeches, play texts, broadcast scripts, etc.)
Looking for examples in BNC The corpus gives a  random selection of 50 solutions  among all the results of “nation”. Unlike the CORDE, it  does not show any statistic  charts and it  does not give the option to specify  authors or dates. You just enter a text or phrase.
Results for the term “nation”
Conclusion I did not find any relevant information about the term “nation” in the BNC corpus, because the results are shown  at random  and are not organized in a chronological way. Therefore, the first result was from the book “The Tragedy of Belief”, by John Fulton, about whom  I did not find any relevant information , apart from the fact that it is a text about Irish politics from the year 1991. Instead, the CORDE allowed me to do a quite  complete research  about the term “nación” and it let me know the reason why the results of the term were abundant in the year 1820.
Bibliography http :// www.rae.es / rae /gestores/gespub000019. nsf / voTodosporId /B4E26FC2520104D8C125716400455C06? OpenDocument&i=1     REAL ACADEMIA ESPAÑOLA: Banco de datos (CORDE) [en línea].  Corpus diacrónico del español.  <http://www.rae.es>  http :// corpus.rae.es / ayuda_c.htm   http :// en.wikipedia.org / wiki / Corpus_linguistics   http :// en.wikipedia.org / wiki / La_Pepa   http :// www.natcorp.ox.ac.uk /   All information retrieved 21:06, May 4, 2010

Corpus

  • 1.
    Corpus Review: CORDEMiren Estibaliz Vivanco 3rd Course English Philology University of Deusto
  • 2.
    What is atext corpus? A corpus ( corpora , in plural) or text corpus is a large and structured set of texts . They are used to do statistical analysis and hypothesis testing , checking occurrences or validating linguistic rules on a specific universe. Nowadays, they are usually electronically structured and processed. A corpus may contain texts in a single language ( monolingual corpus ) or text data in multiple languages ( multilingual corpus ).
  • 3.
    What is atext corpus? The corpora are often subjcted to a process known as annotation . This makes the corpus more useful for linguistic research . An example of annotating a corpus is  part-of-speech tagging , or  POS-tagging , in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of  tags.
  • 4.
    Information about CORDE The  Corpus diacrónico del español   (CORDE) is a textual corpus of all the times and places where the Spanish language has been spoken, since 1975 . Aim : CORDE is designed to extract information to study words and their meanings , as well as the grammar and its use over time. It started to be used in 1994 , when RAE brought up the possibility of applying the new technologies of information. Target : to create a data bank which improved the quality of their working materials and made data access easier . There’s another corpus named CREA ( Corpus de referencia del español actual).
  • 5.
  • 6.
    What does theCORDE consist of? The corpus collects written texts of very different kinds : narrative, dramatic, lyrical, scientifical, technical… The aim is to collect all geographical, historical and generical so that the whole is representative enough. CORDE is a necessary tool for any diachronical study that is related to the Spanish language. One of the most important objectives of the diachronic corpus is to serve as a basic material for the production of the  Nuevo diccionario histórico.
  • 7.
    Text Acquisition ofthe CORDE The origin or source of the texts which arrive to CORDE is diverse: Books which are scanned through a program of optical character recognition. Other books obtained in electronical format . Some are typed in digital format , because there was no modern edition of some pieces which have been decided to be included for the peculiarity of their language.
  • 8.
    Let’s search theterm “nación” in CORDE
  • 9.
    Statistics Absolute percentages of the obtained cases, classified according to subject , chronological or geographical criteria: The term “nación” appears most in documents of “historical prose” . Most documents containing the word “nación” are from the year 1820 (9502 cases). Most of the texts are from Spain .
  • 10.
  • 11.
    Results of examplescontaining “nación”
  • 12.
    Placing the termin a historical context The author of the book from which most examples come from is “Sátiras y panfletos del Trienio Constitucional (1820-1823)” by Sebastián de Miñano . Most of the text containing the term “nación” belonging to the year 1820 makes sense for various reasons: The “Trienio Liberal” or “Trieno Constitucional” took place the three years between 1820 and 1823 . It was the kingdom of Fernando VII “El Deseado”. The 1st of January 1820 the “pronunciamiento” of colonel Rafael de Diego took place in the sevillian locality of Las Cabezas de San Juan. Rafael de Diego
  • 13.
    Placing the termin a historical context Although little success at the beginning, Riego immediately proclaimed the restoration of the Cadiz Constitution (1812, La Pepa ) and the re-establishment of constitutional authorities . The support to the little militar coup grew stronger and made the uprising until March 10 . That date, a manifest was published by Fernando VII respecting the Cadiz Constitution, which established a parliamentary monarchy . Cadiz Constitution
  • 14.
  • 15.
    Let’s search theterm “nación” in BNC
  • 16.
    What is BNC?The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century , both spoken and written. The  written part  of the BNC ( 90% ) includes, for example, extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text. The   spoken part  ( 10% ) consists of orthographic transcriptions of unscripted informal conversations (recorded by volunteers selected from different age, region and social classes in a demographically balanced way) and spoken language collected in different contexts, ranging from formal business or government meetings to radio shows and phone-ins .
  • 17.
    Purposes of BNCcorpus The purpose of a language corpus is to provide language workers with evidence of how language is really used, evidence that can then be used to inform and substantiate individual theories about what words might or should mean. Traditional grammars and dictionaries tell us what a word  ought to mean , but only experience can tell us what a word  is used to mean . This is why dictionary publishers, grammar writers, language teachers, and developers of natural language processing software alike have been turning to corpus evidence as a means of extending and organizing that experience.
  • 18.
    Selection Criteria forthe BNC Domain The  domain  of a text indicates the kind of writing it contains. 75% of the written texts were to be chosen from informative writings : of which roughly equal quantities should be chosen from the fields of applied sciences, arts, belief & thought, commerce & finance, leisure, natural & pure science, social science, world affairs. 25% of the written texts were to be imaginative , that is, literary and creative works.
  • 19.
    Selection Criteria forthe BNC Medium The  medium  of a text indicates the kind of publication in which it occurs. The classification used is quite broad. 60% of written texts were to be books 25% were to be periodicals (newspapers etc.) 5 and 10% should come from other kinds of miscellaneous published material (brochures, advertising leaflets, etc) 5 and 10% should come from unpublished written material such as personal letters and diaries, essays and memoranda, etc Small amount (less than 5%) should come from material written to be spoken (for example, political speeches, play texts, broadcast scripts, etc.)
  • 20.
    Looking for examplesin BNC The corpus gives a random selection of 50 solutions among all the results of “nation”. Unlike the CORDE, it does not show any statistic charts and it does not give the option to specify authors or dates. You just enter a text or phrase.
  • 21.
    Results for theterm “nation”
  • 22.
    Conclusion I didnot find any relevant information about the term “nation” in the BNC corpus, because the results are shown at random and are not organized in a chronological way. Therefore, the first result was from the book “The Tragedy of Belief”, by John Fulton, about whom I did not find any relevant information , apart from the fact that it is a text about Irish politics from the year 1991. Instead, the CORDE allowed me to do a quite complete research about the term “nación” and it let me know the reason why the results of the term were abundant in the year 1820.
  • 23.
    Bibliography http ://www.rae.es / rae /gestores/gespub000019. nsf / voTodosporId /B4E26FC2520104D8C125716400455C06? OpenDocument&i=1   REAL ACADEMIA ESPAÑOLA: Banco de datos (CORDE) [en línea].  Corpus diacrónico del español.  <http://www.rae.es> http :// corpus.rae.es / ayuda_c.htm http :// en.wikipedia.org / wiki / Corpus_linguistics http :// en.wikipedia.org / wiki / La_Pepa http :// www.natcorp.ox.ac.uk / All information retrieved 21:06, May 4, 2010