Towards The Manding Corpus: Texts Selection Principles and Metatext Markup


Published on

© Artem Davydov

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Towards The Manding Corpus: Texts Selection Principles and Metatext Markup

  1. 1. Towards The Manding Corpus: Texts Selection Principles and Metatext Markup Artem Davydov St. Petersburg State University, Faculty of Oriental and African Studies Paper presented on the AfLaT workshop Malta, May 18, 2010
  2. 3. Text selection principles <ul><li>Two criteria proposed by Sinclair (2005): </li></ul><ul><li>internal criterion ( takes into consideration the communicative function of a text ) </li></ul><ul><li>external criterion ( reflects details of the language) </li></ul><ul><li>Sinclair’s conclusion: the contents of a corpus should be selected disregarding the language they contain </li></ul>
  3. 4. Oral texts <ul><li>Multilingualism: </li></ul><ul><li>Bamana-French code-switching </li></ul><ul><li>unadapted loanwords </li></ul><ul><li>Two groups of oral texts to be distinguished </li></ul><ul><li>spontaneously generated texts; </li></ul><ul><li>texts which had been first written and then pronounced. </li></ul>
  4. 5. Written texts <ul><li>published folklore texts </li></ul><ul><li>fiction books </li></ul><ul><li>the press </li></ul><ul><li>educational and religious literature </li></ul><ul><li>Nko publications </li></ul>
  5. 6. Fiction books <ul><li>the language they contain is often highly “unnatural” </li></ul><ul><li>working method typical for Malian writers: first writing in French, then translating to Bamana </li></ul>
  6. 7. Nko publications <ul><li>are mostly in Maninka, but Bamana texts are also available </li></ul><ul><li>tones are already indicated </li></ul><ul><li>can be automatically converted to Roman script </li></ul><ul><li>the written language has significantly evolved from the spoken language </li></ul>
  7. 8. The Web <ul><li>a predominantly oral language </li></ul><ul><li>only few resources are available online </li></ul><ul><li>the quality of online resources is low </li></ul>
  8. 9. Some preliminary conclusions <ul><li>Certain types of texts cause doubt, but under existing conditions it would be to wasteful to reject any text data. </li></ul><ul><li>Every text must be provided with detailed metadata in order to allow the user to single out the sub-corpora. </li></ul>
  9. 10. Metatext Markup <ul><li>Groups of parameters: </li></ul><ul><li>text metadata </li></ul><ul><li>data about the author </li></ul><ul><li>writing system </li></ul><ul><li>genre </li></ul><ul><li>technical metadata </li></ul>
  10. 11. Text metadata <ul><li>title of the text </li></ul><ul><li>date of creation </li></ul><ul><li>original text or translation </li></ul><ul><li>for translations: the language of original </li></ul><ul><li>channel: written, oral, Internet </li></ul>
  11. 12. Data about the author <ul><li>name </li></ul><ul><li>gender </li></ul><ul><li>age </li></ul><ul><li>native language(s) </li></ul><ul><li>places of predominant linguistic socialization </li></ul>
  12. 13. Writing system <ul><li>Old Malian Bamana orthography </li></ul><ul><li>New Malian Bamana orthography </li></ul><ul><li>French-based spontaneous orthography (“colonial”): ou = /u/, gn =/ɲ/, acute accents on the final vowels, etc. </li></ul><ul><li>The Adjami writing system based on the Arabic alphabet. </li></ul><ul><li>The N’ko writing system </li></ul><ul><li>Other Roman-based orthographies. </li></ul>
  13. 14. Genres/Subject matters <ul><li>Folklore : Popular tales, Anecdotes, Epics, Proverbs, Traditional Songs, Koteba (Traditional Theatre), Riddles. </li></ul><ul><li>Fiction : Prose, Movie Scripts, Theatre, Modern Poetry, Popular Songs Lyrics. </li></ul><ul><li>Religious : Christian, Islamic, Other . </li></ul><ul><li>Educational : Formal Education, Vulgarization, Popular Science. </li></ul><ul><li>Academic Writings : History, Lingustics. </li></ul><ul><li>Personal communication : Personal Records, Correspondence , Dialogs. </li></ul><ul><li>Information : Advertising, News, Column, Narrative, Interview, Public Speech. </li></ul>
  14. 15. Technical metadata <ul><li>Finally, to trace the corpus updating each text is to be provided with the following information: </li></ul><ul><li>the name of the project member who added the text to the corpus; </li></ul><ul><li>the date of the adding the text to the corpus. </li></ul>
  15. 16. Conclusion <ul><li>The suggested metatext markup system will provide a user with the ability to create sub-corpora with the specified parameters. It will also help to control the process of filling the corpus with new text data and to estimate the balance of the corpus. </li></ul>
  16. 17. Thank you!