Spanish in the US:Developing an open linguistic corpus     Barbara E. Bullock & Almeida Jacqueline Toribio            24th...
Spanish in Texas Corpus Project• Purpose: to make publically available  authentic data about variation in Spanish as  spok...
Motivation• Document continuity and variation• Understand variation in its local context• Overcome the challenges of study...
Inspiration• Garland Bills, Vivian Cook, Lourdes Ortega,  Ricardo Otheguy, Bonnie Urciuoli, Guadalupe  Valdés, Walt Wolfra...
Impetus• What is needed  – large, representative samples of oral Spanish in    the U.S.  – metadata about the speakers  – ...
Why open?• To facilitate access  – attract as many eyes as possible to the same data  – accelerate the production of findi...
But…• Large corpora are of limited utility to  untrained end users  – Teachers need short videos that are appropriate    f...
Two-pronged approach• Spanish in Texas Corpus Project  – Video interviews that provides rich content• SpinTX: Corpus-to-Cl...
Goals of this talk• Document our efforts to develop an open corpus of  U.S. Spanish, using open-source tools• Define ‘open...
Origins of the project• Language Resource Center  [LRC], 2010-2013• Center for Open Educational Resources for  Language Le...
Open Educational Resources [OER] Educational material offered freely for anyone to use, typically involving some permissio...
Spanish in Texas Media License  • Attribution Required  • Non-Commercial  • Share-Alike                                 12
Spanish in Texas Corpus Project                                  13
Spanish in Texas Corpus Project• Spanish in Texas is our first collection of video  interviews  – provides content for Spi...
Spanish in Texas Corpus Project• Ideally serve as a reference corpus for oral  Spanish in Texas  – large (1 million + word...
From Corpus to Classroom                           16
SpinTX: Corpus-to-Classroom• Aims  – develop a pedagogically friendly interface for    using the corpus  – involve teacher...
Funding• Department of Education, Title VI• College of Liberal Arts• Longhorn Innovation Fund for Technology  [LIFT]      ...
Our team• Directors: Barbara E. Bullock & Almeida J. Toribio• Project Manager and Web Architect: Rachael Gilg• Consultants...
Our team• Collaborators   – University of Texas Pan American       •   José Esteban Hernández       •   Stephanie Brock   ...
“La gente puede ser pobre, pero     compra su coca-cola”                                  21
From Community to Corpus
Recruit ‘locally’• Recruit and train interns  – Internal Review Board training  – Video shooting and audio recording  – Pr...
Video Production Protocol• HD video cameras• Professional quality condenser microphones  – interviewer and interviewee are...
Interview protocol• Sampling of a large set of questions (~75)    – from NPR Storycorps (Historias)    – biographical info...
Interview Metadata                     26
Processing the Videos• Intake interview materials  – create unique ID for video and forms  – archive raw video and remove ...
From Transcript to Corpus
Original Transcript (from Automatic Sync)                                        29
Review Transcript in Google Docs                                   30
Prepare Transcript for TreeTagger                                    31
Run Transcript through TreeTaggerSpanish          English                                    32
Upload Video and Transcript to YouTube                                         33
Download SRT File                    34
Combine Data from SRT File andTreeTagger File, and add additional Tags                                           35
Divide CSV Files and Videos into Clips and     adjust Timings and Numberings                                         36
From Corpus to Clips Archive
Selecting Clips                  38
Topic List (Manual)•   Abuelos             •   Herencia familiar•   Amigos              •   Identidad•   Amor y relaciones...
Automatic Pedagogical Annotations                                    40
Manual Coding for Complex Cases• Annotation of ‘lo’ as an article that allows for the  elision of nouns as in “lo bueno de...
Automatic annotation levels for clips• Grammatical  – aggregated from textbooks• Functional  – greetings, ask for help, ex...
Keyword Search                 43
Filtering by Tags and Metadata                                 44
Video Page             45
Annotation Highlighter                         46
Cloze Test Generator                       47
Word cloud visualization tool                                48
OER Materials• Spanish in Texas searchable clip corpus  available this spring  – approximately 500 clips and growing• All ...
OER Materials• In spirit of OER, please share-alike• Add to repository any pedagogical materials  you or your students mig...
Classroom and Community• We are designing the corpus and tools with  the end-users  – using locally-relevant language samp...
SPinTX: Corpus-to-Scholarship         (the future)                                52
SPinTX: Corpus-to-Scholarship• Full interviews, video-taped, captioned, POS  tagged will be made available• Syntactically-...
Corpus-to-Scholarship: Share-alike• When you use the corpus, share-alike  – crowd-sourcing approach to additional annotati...
Archive• Spanish in Texas Corpus to be archived at the  Nettie Lee Benson Latin American Collection,  University of Texas ...
WebsitesProject website:http://sites.la.utexas.edu/spanishtx/Corpus-to-Classroom Blog:http://sites.la.utexas.edu/corpus-to...
Thanks• To all of our collaborators• Especially to our students and their friends,  neighbors, and families who shared the...
Upcoming SlideShare
Loading in …5
×

Spanish in the U.S.: Developing an open linguistic corpus

487 views
381 views

Published on

Presentation by project directors Barbara E. Bullock and Almeida Jacqueline Toribio at the 24th Conference on Spanish in the United States, March 2013 in McAllen, Texas.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
487
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Spanish in the U.S.: Developing an open linguistic corpus

  1. 1. Spanish in the US:Developing an open linguistic corpus Barbara E. Bullock & Almeida Jacqueline Toribio 24th Conference on Spanish in the United States 9th Conference on Spanish in Contact with Other Languages March 6-9, 2013, McAllen, Texas
  2. 2. Spanish in Texas Corpus Project• Purpose: to make publically available authentic data about variation in Spanish as spoken in Texas – for education – for research 2
  3. 3. Motivation• Document continuity and variation• Understand variation in its local context• Overcome the challenges of studying naturalistic data – Cost: gathering, transcribing and coding data – Accountability: corpora upon which studies are based are rarely made available to the public• Encourage teachers/students/public to view local varieties as a resource 3
  4. 4. Inspiration• Garland Bills, Vivian Cook, Lourdes Ortega, Ricardo Otheguy, Bonnie Urciuoli, Guadalupe Valdés, Walt Wolfram, Ana Celia Zentella, … • language variation ‘in the public interest’ • an empirical turn in thinking about contact varieties • Ornstein-Galicia’s (1981) call: investigate Spanish varieties in your own backyard, share resources, create concordances of usage 4
  5. 5. Impetus• What is needed – large, representative samples of oral Spanish in the U.S. – metadata about the speakers – a context and protocols for sharing architecture, scripts, analytical techniques, and data … as well as findings 5
  6. 6. Why open?• To facilitate access – attract as many eyes as possible to the same data – accelerate the production of findings, which is particularly important for the study of U.S. Spanish• To reduce costs in terms of time and money, especially for those who can least afford it 6
  7. 7. But…• Large corpora are of limited utility to untrained end users – Teachers need short videos that are appropriate for classroom use – And teachers need tools • to easily search videos, • to author materials, • to curate their own collections 7
  8. 8. Two-pronged approach• Spanish in Texas Corpus Project – Video interviews that provides rich content• SpinTX: Corpus-to-Classroom – Collection of pre-selected, corrected, annotated clips from the larger corpus – Open-source, pedagogically-friendly search and authoring tools 8
  9. 9. Goals of this talk• Document our efforts to develop an open corpus of U.S. Spanish, using open-source tools• Define ‘open’• Describe the protocols that we are using for to convert Spanish in TX interviews to pedagogically useful corpus• Showcase materials and tools that we have for use• Share our work with others who may be interested in developing open Spanish in X corpora• Forecast to an open sociolinguistic/computational research corpus of the full interviews of Spanish in TX 9
  10. 10. Origins of the project• Language Resource Center [LRC], 2010-2013• Center for Open Educational Resources for Language Learning [COERLL] 10
  11. 11. Open Educational Resources [OER] Educational material offered freely for anyone to use, typically involving some permission to remix, improve, and redistribute creativecommons.org 11
  12. 12. Spanish in Texas Media License • Attribution Required • Non-Commercial • Share-Alike 12
  13. 13. Spanish in Texas Corpus Project 13
  14. 14. Spanish in Texas Corpus Project• Spanish in Texas is our first collection of video interviews – provides content for SpinTX Corpus-to-Classroom• Additional collections – Spanish in Texas CS collection – Hindi-English CS collection 14
  15. 15. Spanish in Texas Corpus Project• Ideally serve as a reference corpus for oral Spanish in Texas – large (1 million + words), representative of variation, fully open – currently 134 interviews; approx. 600,000 words• This will help establish a better “baseline” for heritage language research and teaching than the traditionally assumed monolingual one 15
  16. 16. From Corpus to Classroom 16
  17. 17. SpinTX: Corpus-to-Classroom• Aims – develop a pedagogically friendly interface for using the corpus – involve teachers and learners, via crowd-sourcing, social networking, and workshops, in the development of open educational resources – create a model for using open source tools and a pedagogical interface that can be adapted for any language corpus collection 17
  18. 18. Funding• Department of Education, Title VI• College of Liberal Arts• Longhorn Innovation Fund for Technology [LIFT] 18
  19. 19. Our team• Directors: Barbara E. Bullock & Almeida J. Toribio• Project Manager and Web Architect: Rachael Gilg• Consultants – Graphic Designer: Nathalie Steinfeld Childre – Computational Linguist: Martí Quixal – Digital Media Producer: Scott Zúñiga – Educational Technologist: Arthur Wendorf – Outreach Coordinator: Jeffrey Michno – Content Manager, Intern Coordinator: Jacqueline Larsen Serigos – Materials Development: Jesse Abing, Joshua Frank• Undergraduate Interns – 2011-present: 12 19
  20. 20. Our team• Collaborators – University of Texas Pan American • José Esteban Hernández • Stephanie Brock • José Flores • Viridiana Gallegos • Rossy Limas • Michelle Madrid – Texas A&M International University • Patricia González • Conchita Hickey • Lisa Flores – Others • Daniel Villa, New Mexico State University • María Irene Moyna, Texas A&M University • MaryEllen García, University of Texas, San Antonio • Jens Clegg, Indiana University-Purdue University • Abby Dings, Southwestern University 20
  21. 21. “La gente puede ser pobre, pero compra su coca-cola” 21
  22. 22. From Community to Corpus
  23. 23. Recruit ‘locally’• Recruit and train interns – Internal Review Board training – Video shooting and audio recording – Practice interviews on site• Recruit family, friends, acquaintances – Any Spanish-speaking resident of TX• Conduct interviews in their home communities 23
  24. 24. Video Production Protocol• HD video cameras• Professional quality condenser microphones – interviewer and interviewee are each recorded into a separate channel• Interviewer wears headphones to monitor audio 24
  25. 25. Interview protocol• Sampling of a large set of questions (~75) – from NPR Storycorps (Historias) – biographical information• Average Length: 30-45 min.• Language: Spanish and mixed• Consent form and talent release• Metadata on speaker and interviewer – Google docs 25
  26. 26. Interview Metadata 26
  27. 27. Processing the Videos• Intake interview materials – create unique ID for video and forms – archive raw video and remove from camera• Video and transcript preparation – Final Cut Pro – Upload to Automatic Sync (3-5 day turnaround) – convert transcript to UTF-8, upload to Google Drive collection – upload to Youtube to create synced caption file (SRT) 27
  28. 28. From Transcript to Corpus
  29. 29. Original Transcript (from Automatic Sync) 29
  30. 30. Review Transcript in Google Docs 30
  31. 31. Prepare Transcript for TreeTagger 31
  32. 32. Run Transcript through TreeTaggerSpanish English 32
  33. 33. Upload Video and Transcript to YouTube 33
  34. 34. Download SRT File 34
  35. 35. Combine Data from SRT File andTreeTagger File, and add additional Tags 35
  36. 36. Divide CSV Files and Videos into Clips and adjust Timings and Numberings 36
  37. 37. From Corpus to Clips Archive
  38. 38. Selecting Clips 38
  39. 39. Topic List (Manual)• Abuelos • Herencia familiar• Amigos • Identidad• Amor y relaciones • Idioma• Comida • La infancia• Criando Hijos • Matrimonio• Cultura • Padres• Escuela • Religión• Familia • Texas• Futuro • Trabajo 39
  40. 40. Automatic Pedagogical Annotations 40
  41. 41. Manual Coding for Complex Cases• Annotation of ‘lo’ as an article that allows for the elision of nouns as in “lo bueno de esta clase es…” – The rule requires a sequence of two words: “lo” followed by an adjective with some words in between (in fact only adjective modifiers, as adverbs, since the BARRIER operator is telling the scanning process to stop if a typical NP boundary is crossed. 41
  42. 42. Automatic annotation levels for clips• Grammatical – aggregated from textbooks• Functional – greetings, ask for help, express opinions• Pragmatics – discourse markers, place holders (“este”), attenuators• Bilingual forms – CS, loans, loan translations 42
  43. 43. Keyword Search 43
  44. 44. Filtering by Tags and Metadata 44
  45. 45. Video Page 45
  46. 46. Annotation Highlighter 46
  47. 47. Cloze Test Generator 47
  48. 48. Word cloud visualization tool 48
  49. 49. OER Materials• Spanish in Texas searchable clip corpus available this spring – approximately 500 clips and growing• All specially created code scripts are available now through GitHub• IRB, talent release, google metadata survey template, etc. available 49
  50. 50. OER Materials• In spirit of OER, please share-alike• Add to repository any pedagogical materials you or your students might develop from Spanish in Texas clip corpus 50
  51. 51. Classroom and Community• We are designing the corpus and tools with the end-users – using locally-relevant language samples to illustrate every aspect of Spanish• Users model their own language for pedagogical purposes• The corpus is the textbook 51
  52. 52. SPinTX: Corpus-to-Scholarship (the future) 52
  53. 53. SPinTX: Corpus-to-Scholarship• Full interviews, video-taped, captioned, POS tagged will be made available• Syntactically-parsed corpora• Additional public protocols, open-source search tools 53
  54. 54. Corpus-to-Scholarship: Share-alike• When you use the corpus, share-alike – crowd-sourcing approach to additional annotation levels (e.g., PRAAT text grids) • we’ll use stand-off annotation – sociolinguists would ideally share data coding – corpus linguists would ideally share scripts• Any users could contribute their collections: video, transcript, and metadata – we’ll run it through SpinTX processing 54
  55. 55. Archive• Spanish in Texas Corpus to be archived at the Nettie Lee Benson Latin American Collection, University of Texas Libraries 55
  56. 56. WebsitesProject website:http://sites.la.utexas.edu/spanishtx/Corpus-to-Classroom Blog:http://sites.la.utexas.edu/corpus-to-classroom/Facebook page:https://www.facebook.com/spanish.in.texas 56
  57. 57. Thanks• To all of our collaborators• Especially to our students and their friends, neighbors, and families who shared their time and their language with us 57

×