Towards The Manding Corpus: Texts Selection Principles and Metatext Markup Artem Davydov St. Petersburg State University, Faculty of Oriental and African Studies Paper presented on the AfLaT workshop  Malta, May 18, 2010
 
Text selection principles Two criteria proposed by Sinclair (2005): internal criterion ( takes into consideration the communicative function of a text   ) external criterion ( reflects details of the language) Sinclair’s conclusion:  the contents of a corpus should be selected disregarding the language they contain
Oral texts Multilingualism: Bamana-French code-switching unadapted loanwords Two groups of oral texts to be distinguished spontaneously generated texts; texts which had been first written and then pronounced.
Written texts published folklore texts fiction books the press educational and religious literature Nko publications
Fiction books the language they contain is often highly “unnatural” working method typical for Malian writers: first writing in French, then translating to Bamana
Nko publications are mostly in Maninka, but Bamana texts are also available tones are already indicated can be automatically converted to Roman script the written language has significantly evolved from the spoken language
The Web a predominantly oral language only few resources are available online the quality of online resources is low
Some preliminary conclusions Certain types of texts cause doubt, but under existing conditions it would be to wasteful to reject any text data. Every text must be provided with detailed metadata in order to allow the user to single out the sub-corpora.
Metatext Markup Groups of parameters: text metadata data about the author writing system genre technical metadata
Text metadata title of the text date of creation original text or translation for translations: the language of original channel: written, oral, Internet
Data about the author name gender age native language(s) places of predominant linguistic socialization
Writing system Old Malian Bamana orthography New Malian Bamana orthography French-based spontaneous orthography  (“colonial”):  ou = /u/, gn =/ɲ/, acute accents on the final vowels, etc. The  Adjami  writing system based on the Arabic alphabet.  The  N’ko  writing system Other Roman-based orthographies.
Genres/Subject matters Folklore : Popular tales, Anecdotes, Epics, Proverbs, Traditional Songs, Koteba (Traditional Theatre), Riddles. Fiction : Prose, Movie Scripts, Theatre, Modern Poetry, Popular Songs Lyrics. Religious : Christian, Islamic, Other . Educational : Formal Education, Vulgarization, Popular Science. Academic Writings : History, Lingustics. Personal  communication : Personal Records, Correspondence , Dialogs. Information : Advertising, News, Column, Narrative, Interview, Public Speech.
Technical metadata Finally, to trace the corpus updating each text is to be provided with the following information: the name of the project member who added the text to the corpus; the date of the adding the text to the corpus.
Conclusion The suggested metatext markup system will provide a user with the ability to create sub-corpora with the specified parameters. It will also help to control the process of filling the corpus with new text data and to estimate the balance of the corpus.
Thank you!

Towards The Manding Corpus: Texts Selection Principles and Metatext Markup

  • 1.
    Towards The MandingCorpus: Texts Selection Principles and Metatext Markup Artem Davydov St. Petersburg State University, Faculty of Oriental and African Studies Paper presented on the AfLaT workshop Malta, May 18, 2010
  • 2.
  • 3.
    Text selection principlesTwo criteria proposed by Sinclair (2005): internal criterion ( takes into consideration the communicative function of a text ) external criterion ( reflects details of the language) Sinclair’s conclusion: the contents of a corpus should be selected disregarding the language they contain
  • 4.
    Oral texts Multilingualism:Bamana-French code-switching unadapted loanwords Two groups of oral texts to be distinguished spontaneously generated texts; texts which had been first written and then pronounced.
  • 5.
    Written texts publishedfolklore texts fiction books the press educational and religious literature Nko publications
  • 6.
    Fiction books thelanguage they contain is often highly “unnatural” working method typical for Malian writers: first writing in French, then translating to Bamana
  • 7.
    Nko publications aremostly in Maninka, but Bamana texts are also available tones are already indicated can be automatically converted to Roman script the written language has significantly evolved from the spoken language
  • 8.
    The Web apredominantly oral language only few resources are available online the quality of online resources is low
  • 9.
    Some preliminary conclusionsCertain types of texts cause doubt, but under existing conditions it would be to wasteful to reject any text data. Every text must be provided with detailed metadata in order to allow the user to single out the sub-corpora.
  • 10.
    Metatext Markup Groupsof parameters: text metadata data about the author writing system genre technical metadata
  • 11.
    Text metadata titleof the text date of creation original text or translation for translations: the language of original channel: written, oral, Internet
  • 12.
    Data about theauthor name gender age native language(s) places of predominant linguistic socialization
  • 13.
    Writing system OldMalian Bamana orthography New Malian Bamana orthography French-based spontaneous orthography (“colonial”): ou = /u/, gn =/ɲ/, acute accents on the final vowels, etc. The Adjami writing system based on the Arabic alphabet. The N’ko writing system Other Roman-based orthographies.
  • 14.
    Genres/Subject matters Folklore: Popular tales, Anecdotes, Epics, Proverbs, Traditional Songs, Koteba (Traditional Theatre), Riddles. Fiction : Prose, Movie Scripts, Theatre, Modern Poetry, Popular Songs Lyrics. Religious : Christian, Islamic, Other . Educational : Formal Education, Vulgarization, Popular Science. Academic Writings : History, Lingustics. Personal communication : Personal Records, Correspondence , Dialogs. Information : Advertising, News, Column, Narrative, Interview, Public Speech.
  • 15.
    Technical metadata Finally,to trace the corpus updating each text is to be provided with the following information: the name of the project member who added the text to the corpus; the date of the adding the text to the corpus.
  • 16.
    Conclusion The suggestedmetatext markup system will provide a user with the ability to create sub-corpora with the specified parameters. It will also help to control the process of filling the corpus with new text data and to estimate the balance of the corpus.
  • 17.