Your SlideShare is downloading. ×

Spanish in the U.S.: Developing an open linguistic corpus


Published on

Presentation by project directors Barbara E. Bullock and Almeida Jacqueline Toribio at the 24th Conference on Spanish in the United States, March 2013 in McAllen, Texas.

Presentation by project directors Barbara E. Bullock and Almeida Jacqueline Toribio at the 24th Conference on Spanish in the United States, March 2013 in McAllen, Texas.

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Spanish in the US:Developing an open linguistic corpus Barbara E. Bullock & Almeida Jacqueline Toribio 24th Conference on Spanish in the United States 9th Conference on Spanish in Contact with Other Languages March 6-9, 2013, McAllen, Texas
  • 2. Spanish in Texas Corpus Project• Purpose: to make publically available authentic data about variation in Spanish as spoken in Texas – for education – for research 2
  • 3. Motivation• Document continuity and variation• Understand variation in its local context• Overcome the challenges of studying naturalistic data – Cost: gathering, transcribing and coding data – Accountability: corpora upon which studies are based are rarely made available to the public• Encourage teachers/students/public to view local varieties as a resource 3
  • 4. Inspiration• Garland Bills, Vivian Cook, Lourdes Ortega, Ricardo Otheguy, Bonnie Urciuoli, Guadalupe Valdés, Walt Wolfram, Ana Celia Zentella, … • language variation ‘in the public interest’ • an empirical turn in thinking about contact varieties • Ornstein-Galicia’s (1981) call: investigate Spanish varieties in your own backyard, share resources, create concordances of usage 4
  • 5. Impetus• What is needed – large, representative samples of oral Spanish in the U.S. – metadata about the speakers – a context and protocols for sharing architecture, scripts, analytical techniques, and data … as well as findings 5
  • 6. Why open?• To facilitate access – attract as many eyes as possible to the same data – accelerate the production of findings, which is particularly important for the study of U.S. Spanish• To reduce costs in terms of time and money, especially for those who can least afford it 6
  • 7. But…• Large corpora are of limited utility to untrained end users – Teachers need short videos that are appropriate for classroom use – And teachers need tools • to easily search videos, • to author materials, • to curate their own collections 7
  • 8. Two-pronged approach• Spanish in Texas Corpus Project – Video interviews that provides rich content• SpinTX: Corpus-to-Classroom – Collection of pre-selected, corrected, annotated clips from the larger corpus – Open-source, pedagogically-friendly search and authoring tools 8
  • 9. Goals of this talk• Document our efforts to develop an open corpus of U.S. Spanish, using open-source tools• Define ‘open’• Describe the protocols that we are using for to convert Spanish in TX interviews to pedagogically useful corpus• Showcase materials and tools that we have for use• Share our work with others who may be interested in developing open Spanish in X corpora• Forecast to an open sociolinguistic/computational research corpus of the full interviews of Spanish in TX 9
  • 10. Origins of the project• Language Resource Center [LRC], 2010-2013• Center for Open Educational Resources for Language Learning [COERLL] 10
  • 11. Open Educational Resources [OER] Educational material offered freely for anyone to use, typically involving some permission to remix, improve, and redistribute 11
  • 12. Spanish in Texas Media License • Attribution Required • Non-Commercial • Share-Alike 12
  • 13. Spanish in Texas Corpus Project 13
  • 14. Spanish in Texas Corpus Project• Spanish in Texas is our first collection of video interviews – provides content for SpinTX Corpus-to-Classroom• Additional collections – Spanish in Texas CS collection – Hindi-English CS collection 14
  • 15. Spanish in Texas Corpus Project• Ideally serve as a reference corpus for oral Spanish in Texas – large (1 million + words), representative of variation, fully open – currently 134 interviews; approx. 600,000 words• This will help establish a better “baseline” for heritage language research and teaching than the traditionally assumed monolingual one 15
  • 16. From Corpus to Classroom 16
  • 17. SpinTX: Corpus-to-Classroom• Aims – develop a pedagogically friendly interface for using the corpus – involve teachers and learners, via crowd-sourcing, social networking, and workshops, in the development of open educational resources – create a model for using open source tools and a pedagogical interface that can be adapted for any language corpus collection 17
  • 18. Funding• Department of Education, Title VI• College of Liberal Arts• Longhorn Innovation Fund for Technology [LIFT] 18
  • 19. Our team• Directors: Barbara E. Bullock & Almeida J. Toribio• Project Manager and Web Architect: Rachael Gilg• Consultants – Graphic Designer: Nathalie Steinfeld Childre – Computational Linguist: Martí Quixal – Digital Media Producer: Scott Zúñiga – Educational Technologist: Arthur Wendorf – Outreach Coordinator: Jeffrey Michno – Content Manager, Intern Coordinator: Jacqueline Larsen Serigos – Materials Development: Jesse Abing, Joshua Frank• Undergraduate Interns – 2011-present: 12 19
  • 20. Our team• Collaborators – University of Texas Pan American • José Esteban Hernández • Stephanie Brock • José Flores • Viridiana Gallegos • Rossy Limas • Michelle Madrid – Texas A&M International University • Patricia González • Conchita Hickey • Lisa Flores – Others • Daniel Villa, New Mexico State University • María Irene Moyna, Texas A&M University • MaryEllen García, University of Texas, San Antonio • Jens Clegg, Indiana University-Purdue University • Abby Dings, Southwestern University 20
  • 21. “La gente puede ser pobre, pero compra su coca-cola” 21
  • 22. From Community to Corpus
  • 23. Recruit ‘locally’• Recruit and train interns – Internal Review Board training – Video shooting and audio recording – Practice interviews on site• Recruit family, friends, acquaintances – Any Spanish-speaking resident of TX• Conduct interviews in their home communities 23
  • 24. Video Production Protocol• HD video cameras• Professional quality condenser microphones – interviewer and interviewee are each recorded into a separate channel• Interviewer wears headphones to monitor audio 24
  • 25. Interview protocol• Sampling of a large set of questions (~75) – from NPR Storycorps (Historias) – biographical information• Average Length: 30-45 min.• Language: Spanish and mixed• Consent form and talent release• Metadata on speaker and interviewer – Google docs 25
  • 26. Interview Metadata 26
  • 27. Processing the Videos• Intake interview materials – create unique ID for video and forms – archive raw video and remove from camera• Video and transcript preparation – Final Cut Pro – Upload to Automatic Sync (3-5 day turnaround) – convert transcript to UTF-8, upload to Google Drive collection – upload to Youtube to create synced caption file (SRT) 27
  • 28. From Transcript to Corpus
  • 29. Original Transcript (from Automatic Sync) 29
  • 30. Review Transcript in Google Docs 30
  • 31. Prepare Transcript for TreeTagger 31
  • 32. Run Transcript through TreeTaggerSpanish English 32
  • 33. Upload Video and Transcript to YouTube 33
  • 34. Download SRT File 34
  • 35. Combine Data from SRT File andTreeTagger File, and add additional Tags 35
  • 36. Divide CSV Files and Videos into Clips and adjust Timings and Numberings 36
  • 37. From Corpus to Clips Archive
  • 38. Selecting Clips 38
  • 39. Topic List (Manual)• Abuelos • Herencia familiar• Amigos • Identidad• Amor y relaciones • Idioma• Comida • La infancia• Criando Hijos • Matrimonio• Cultura • Padres• Escuela • Religión• Familia • Texas• Futuro • Trabajo 39
  • 40. Automatic Pedagogical Annotations 40
  • 41. Manual Coding for Complex Cases• Annotation of ‘lo’ as an article that allows for the elision of nouns as in “lo bueno de esta clase es…” – The rule requires a sequence of two words: “lo” followed by an adjective with some words in between (in fact only adjective modifiers, as adverbs, since the BARRIER operator is telling the scanning process to stop if a typical NP boundary is crossed. 41
  • 42. Automatic annotation levels for clips• Grammatical – aggregated from textbooks• Functional – greetings, ask for help, express opinions• Pragmatics – discourse markers, place holders (“este”), attenuators• Bilingual forms – CS, loans, loan translations 42
  • 43. Keyword Search 43
  • 44. Filtering by Tags and Metadata 44
  • 45. Video Page 45
  • 46. Annotation Highlighter 46
  • 47. Cloze Test Generator 47
  • 48. Word cloud visualization tool 48
  • 49. OER Materials• Spanish in Texas searchable clip corpus available this spring – approximately 500 clips and growing• All specially created code scripts are available now through GitHub• IRB, talent release, google metadata survey template, etc. available 49
  • 50. OER Materials• In spirit of OER, please share-alike• Add to repository any pedagogical materials you or your students might develop from Spanish in Texas clip corpus 50
  • 51. Classroom and Community• We are designing the corpus and tools with the end-users – using locally-relevant language samples to illustrate every aspect of Spanish• Users model their own language for pedagogical purposes• The corpus is the textbook 51
  • 52. SPinTX: Corpus-to-Scholarship (the future) 52
  • 53. SPinTX: Corpus-to-Scholarship• Full interviews, video-taped, captioned, POS tagged will be made available• Syntactically-parsed corpora• Additional public protocols, open-source search tools 53
  • 54. Corpus-to-Scholarship: Share-alike• When you use the corpus, share-alike – crowd-sourcing approach to additional annotation levels (e.g., PRAAT text grids) • we’ll use stand-off annotation – sociolinguists would ideally share data coding – corpus linguists would ideally share scripts• Any users could contribute their collections: video, transcript, and metadata – we’ll run it through SpinTX processing 54
  • 55. Archive• Spanish in Texas Corpus to be archived at the Nettie Lee Benson Latin American Collection, University of Texas Libraries 55
  • 56. WebsitesProject website: Blog: page: 56
  • 57. Thanks• To all of our collaborators• Especially to our students and their friends, neighbors, and families who shared their time and their language with us 57