SlideShare a Scribd company logo
Spanish in the US:
Developing an open linguistic corpus

     Barbara E. Bullock & Almeida Jacqueline Toribio


            24th Conference on Spanish in the United States
      9th Conference on Spanish in Contact with Other Languages
                   March 6-9, 2013, McAllen, Texas
Spanish in Texas Corpus Project
• Purpose: to make publically available
  authentic data about variation in Spanish as
  spoken in Texas
  – for education
  – for research




                                                 2
Motivation
• Document continuity and variation
• Understand variation in its local context
• Overcome the challenges of studying
  naturalistic data
  – Cost: gathering, transcribing and coding data
  – Accountability: corpora upon which studies are
    based are rarely made available to the public
• Encourage teachers/students/public to view
  local varieties as a resource
                                                     3
Inspiration
• Garland Bills, Vivian Cook, Lourdes Ortega,
  Ricardo Otheguy, Bonnie Urciuoli, Guadalupe
  Valdés, Walt Wolfram, Ana Celia Zentella, …
     • language variation ‘in the public interest’
     • an empirical turn in thinking about contact varieties
     • Ornstein-Galicia’s (1981) call: investigate Spanish
       varieties in your own backyard, share resources, create
       concordances of usage



                                                                 4
Impetus
• What is needed
  – large, representative samples of oral Spanish in
    the U.S.
  – metadata about the speakers
  – a context and protocols for sharing architecture,
    scripts, analytical techniques, and data
    … as well as findings



                                                        5
Why open?
• To facilitate access
  – attract as many eyes as possible to the same data
  – accelerate the production of findings, which is
    particularly important for the study of U.S.
    Spanish
• To reduce costs in terms of time and money,
  especially for those who can least afford it


                                                        6
But…
• Large corpora are of limited utility to
  untrained end users
  – Teachers need short videos that are appropriate
    for classroom use
  – And teachers need tools
     • to easily search videos,
     • to author materials,
     • to curate their own collections



                                                      7
Two-pronged approach
• Spanish in Texas Corpus Project
  – Video interviews that provides rich content


• SpinTX: Corpus-to-Classroom
  – Collection of pre-selected, corrected, annotated
    clips from the larger corpus
  – Open-source, pedagogically-friendly search and
    authoring tools

                                                       8
Goals of this talk
• Document our efforts to develop an open corpus of
  U.S. Spanish, using open-source tools
• Define ‘open’
• Describe the protocols that we are using for to convert
  Spanish in TX interviews to pedagogically useful corpus
• Showcase materials and tools that we have for use
• Share our work with others who may be interested in
  developing open Spanish in X corpora
• Forecast to an open sociolinguistic/computational
  research corpus of the full interviews of Spanish in TX

                                                            9
Origins of the project
• Language Resource Center
  [LRC], 2010-2013
• Center for Open Educational Resources for
  Language Learning [COERLL]




                                              10
Open Educational Resources [OER]
 Educational material offered freely for anyone to
 use, typically involving some permission to remix,
 improve, and redistribute

 creativecommons.org




                                                      11
Spanish in Texas Media License




  • Attribution Required
  • Non-Commercial
  • Share-Alike
                                 12
Spanish in Texas Corpus Project




                                  13
Spanish in Texas Corpus Project
• Spanish in Texas is our first collection of video
  interviews
  – provides content for SpinTX Corpus-to-Classroom

• Additional collections
  – Spanish in Texas CS collection
  – Hindi-English CS collection



                                                      14
Spanish in Texas Corpus Project
• Ideally serve as a reference corpus for oral
  Spanish in Texas
  – large (1 million + words), representative of
    variation, fully open
  – currently 134 interviews; approx. 600,000 words
• This will help establish a better “baseline” for
  heritage language research and teaching than
  the traditionally assumed monolingual one

                                                      15
From Corpus to Classroom




                           16
SpinTX: Corpus-to-Classroom
• Aims
  – develop a pedagogically friendly interface for
    using the corpus
  – involve teachers and learners, via crowd-sourcing,
    social networking, and workshops, in the
    development of open educational resources
  – create a model for using open source tools and a
    pedagogical interface that can be adapted for any
    language corpus collection

                                                         17
Funding
• Department of Education, Title VI
• College of Liberal Arts
• Longhorn Innovation Fund for Technology
  [LIFT]




                                            18
Our team
• Directors: Barbara E. Bullock & Almeida J. Toribio
• Project Manager and Web Architect: Rachael Gilg
• Consultants
   –   Graphic Designer: Nathalie Steinfeld Childre
   –   Computational Linguist: Martí Quixal
   –   Digital Media Producer: Scott Zúñiga
   –   Educational Technologist: Arthur Wendorf
   –   Outreach Coordinator: Jeffrey Michno
   –   Content Manager, Intern Coordinator: Jacqueline Larsen Serigos
   –   Materials Development: Jesse Abing, Joshua Frank
• Undergraduate Interns
   – 2011-present: 12


                                                                        19
Our team
• Collaborators
   – University of Texas Pan American
       •   José Esteban Hernández
       •   Stephanie Brock
       •   José Flores
       •   Viridiana Gallegos
       •   Rossy Limas
       •   Michelle Madrid
   – Texas A&M International University
       • Patricia González
       • Conchita Hickey
       • Lisa Flores
   – Others
       •   Daniel Villa, New Mexico State University
       •   María Irene Moyna, Texas A&M University
       •   MaryEllen García, University of Texas, San Antonio
       •   Jens Clegg, Indiana University-Purdue University
       •   Abby Dings, Southwestern University


                                                                20
“La gente puede ser pobre, pero
     compra su coca-cola”




                                  21
From Community to Corpus
Recruit ‘locally’
• Recruit and train interns
  – Internal Review Board training
  – Video shooting and audio recording
  – Practice interviews on site
• Recruit family, friends, acquaintances
  – Any Spanish-speaking resident of TX
• Conduct interviews in their home
  communities

                                           23
Video Production Protocol
• HD video cameras
• Professional quality condenser microphones
  – interviewer and interviewee are each recorded
    into a separate channel
• Interviewer wears headphones to monitor
  audio



                                                    24
Interview protocol
• Sampling of a large set of questions (~75)
    – from NPR Storycorps (Historias)
    – biographical information
•   Average Length: 30-45 min.
•   Language: Spanish and mixed
•   Consent form and talent release
•   Metadata on speaker and interviewer
    – Google docs
                                               25
Interview Metadata




                     26
Processing the Videos
• Intake interview materials
  – create unique ID for video and forms
  – archive raw video and remove from camera
• Video and transcript preparation
  – Final Cut Pro
  – Upload to Automatic Sync (3-5 day turnaround)
  – convert transcript to UTF-8, upload to Google Drive
    collection
  – upload to Youtube to create synced caption file (SRT)

                                                            27
From Transcript to Corpus
Original Transcript (from Automatic Sync)




                                        29
Review Transcript in Google Docs




                                   30
Prepare Transcript for TreeTagger




                                    31
Run Transcript through TreeTagger
Spanish          English




                                    32
Upload Video and Transcript to YouTube




                                         33
Download SRT File




                    34
Combine Data from SRT File and
TreeTagger File, and add additional Tags




                                           35
Divide CSV Files and Videos into Clips and
     adjust Timings and Numberings




                                         36
From Corpus to Clips Archive
Selecting Clips




                  38
Topic List (Manual)
•   Abuelos             •   Herencia familiar
•   Amigos              •   Identidad
•   Amor y relaciones   •   Idioma
•   Comida              •   La infancia
•   Criando Hijos       •   Matrimonio
•   Cultura             •   Padres
•   Escuela             •   Religión
•   Familia             •   Texas
•   Futuro              •   Trabajo

                                                39
Automatic Pedagogical Annotations




                                    40
Manual Coding for Complex Cases
• Annotation of ‘lo’ as an article that allows for the
  elision of nouns as in “lo bueno de esta clase es…”



   – The rule requires a sequence of two words: “lo”
     followed by an adjective with some words in between
     (in fact only adjective modifiers, as adverbs, since the
     BARRIER operator is telling the scanning process to
     stop if a typical NP boundary is crossed.

                                                                41
Automatic annotation levels for clips
• Grammatical
  – aggregated from textbooks
• Functional
  – greetings, ask for help, express opinions
• Pragmatics
  – discourse markers, place holders (“este”),
    attenuators
• Bilingual forms
  – CS, loans, loan translations

                                                 42
Keyword Search




                 43
Filtering by Tags and Metadata




                                 44
Video Page




             45
Annotation Highlighter




                         46
Cloze Test Generator




                       47
Word cloud visualization tool




                                48
OER Materials
• Spanish in Texas searchable clip corpus
  available this spring
  – approximately 500 clips and growing
• All specially created code scripts are available
  now through GitHub
• IRB, talent release, google metadata survey
  template, etc. available


                                                     49
OER Materials
• In spirit of OER, please share-alike
• Add to repository any pedagogical materials
  you or your students might develop from
  Spanish in Texas clip corpus




                                                50
Classroom and Community
• We are designing the corpus and tools with
  the end-users
  – using locally-relevant language samples to
    illustrate every aspect of Spanish
• Users model their own language for
  pedagogical purposes
• The corpus is the textbook


                                                 51
SPinTX: Corpus-to-Scholarship
         (the future)




                                52
SPinTX: Corpus-to-Scholarship
• Full interviews, video-taped, captioned, POS
  tagged will be made available
• Syntactically-parsed corpora
• Additional public protocols, open-source
  search tools




                                                 53
Corpus-to-Scholarship: Share-alike
• When you use the corpus, share-alike
  – crowd-sourcing approach to additional annotation
    levels (e.g., PRAAT text grids)
     • we’ll use stand-off annotation
  – sociolinguists would ideally share data coding
  – corpus linguists would ideally share scripts
• Any users could contribute their collections:
  video, transcript, and metadata
  – we’ll run it through SpinTX processing

                                                       54
Archive
• Spanish in Texas Corpus to be archived at the
  Nettie Lee Benson Latin American Collection,
  University of Texas Libraries




                                                  55
Websites
Project website:
http://sites.la.utexas.edu/spanishtx/

Corpus-to-Classroom Blog:
http://sites.la.utexas.edu/corpus-to-classroom/

Facebook page:
https://www.facebook.com/spanish.in.texas

                                                  56
Thanks
• To all of our collaborators
• Especially to our students and their friends,
  neighbors, and families who shared their time
  and their language with us




                                                  57

More Related Content

Similar to Spanish in the U.S.: Developing an open linguistic corpus

A Video Corpus for Language Learning: Open Source Tools & Materials from the ...
A Video Corpus for Language Learning: Open Source Tools & Materials from the ...A Video Corpus for Language Learning: Open Source Tools & Materials from the ...
A Video Corpus for Language Learning: Open Source Tools & Materials from the ...
Spanish in Texas Project
 
Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...
Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...
Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...
WGBH Media Library and Archives
 
Designing for Diversity: Creating Learning Experiences that Travel the Globe
Designing for Diversity: Creating Learning Experiences that Travel the GlobeDesigning for Diversity: Creating Learning Experiences that Travel the Globe
Designing for Diversity: Creating Learning Experiences that Travel the Globe
Una Daly
 
A la recherche
 A la recherche A la recherche
A la recherche
hhs
 
Webinar: Getting Started with Digitization An Introduction for Libraries-2016...
Webinar: Getting Started with Digitization An Introduction for Libraries-2016...Webinar: Getting Started with Digitization An Introduction for Libraries-2016...
Webinar: Getting Started with Digitization An Introduction for Libraries-2016...
TechSoup
 
Using pedagogic corpora in ELT
Using pedagogic corpora in ELTUsing pedagogic corpora in ELT
Using pedagogic corpora in ELT
Pascual Pérez-Paredes
 
How Open Education Practices Support Student Centered Design & Accessibility
How Open Education Practices Support Student Centered Design & AccessibilityHow Open Education Practices Support Student Centered Design & Accessibility
How Open Education Practices Support Student Centered Design & Accessibility
Una Daly
 
Project-Based Learning for World Languages
Project-Based Learning for World LanguagesProject-Based Learning for World Languages
Project-Based Learning for World Languages
Vintage High School
 
Blended Learning-Best Practices
Blended Learning-Best PracticesBlended Learning-Best Practices
Blended Learning-Best Practices
Saint Michael's College
 
Authentic materials
Authentic materialsAuthentic materials
Authentic materialshhs
 
Sharing the Power and Glory: PACSCL's Success with Survey and Processing Pro...
Sharing the Power and Glory:  PACSCL's Success with Survey and Processing Pro...Sharing the Power and Glory:  PACSCL's Success with Survey and Processing Pro...
Sharing the Power and Glory: PACSCL's Success with Survey and Processing Pro...Holly Mengel
 
OER exploration with the adult literacy program, August 2016
OER exploration with the adult literacy program, August 2016OER exploration with the adult literacy program, August 2016
OER exploration with the adult literacy program, August 2016
Manisha Khetarpal
 
Sharing an Open Methodology for Building Domain-specific Corpora for EAP
Sharing an Open Methodology for Building Domain-specific Corpora for EAP Sharing an Open Methodology for Building Domain-specific Corpora for EAP
Sharing an Open Methodology for Building Domain-specific Corpora for EAP
Alannah Fitzgerald
 
Language teachers april 2012
Language teachers april 2012Language teachers april 2012
Language teachers april 2012
Pam Furney
 
NCSS 2013 Differentiated Instruction: A Gateway to Success with the Common Core
NCSS 2013 Differentiated Instruction:  A Gateway to Success with the Common CoreNCSS 2013 Differentiated Instruction:  A Gateway to Success with the Common Core
NCSS 2013 Differentiated Instruction: A Gateway to Success with the Common Core
Susan Santoli
 
OER: insights into a multilingual landscape - EUROCALL 2014 conference
OER: insights into a multilingual landscape - EUROCALL 2014 conference  OER: insights into a multilingual landscape - EUROCALL 2014 conference
OER: insights into a multilingual landscape - EUROCALL 2014 conference
LangOER
 
Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...
LangOER
 
Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...
Web2Learn
 
Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...
LangOER
 
Digital Literacy and the Role of the Language Teacher Cyprus May2021
Digital Literacy and the Role of the Language Teacher Cyprus May2021Digital Literacy and the Role of the Language Teacher Cyprus May2021
Digital Literacy and the Role of the Language Teacher Cyprus May2021
Jeroen Clemens
 

Similar to Spanish in the U.S.: Developing an open linguistic corpus (20)

A Video Corpus for Language Learning: Open Source Tools & Materials from the ...
A Video Corpus for Language Learning: Open Source Tools & Materials from the ...A Video Corpus for Language Learning: Open Source Tools & Materials from the ...
A Video Corpus for Language Learning: Open Source Tools & Materials from the ...
 
Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...
Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...
Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...
 
Designing for Diversity: Creating Learning Experiences that Travel the Globe
Designing for Diversity: Creating Learning Experiences that Travel the GlobeDesigning for Diversity: Creating Learning Experiences that Travel the Globe
Designing for Diversity: Creating Learning Experiences that Travel the Globe
 
A la recherche
 A la recherche A la recherche
A la recherche
 
Webinar: Getting Started with Digitization An Introduction for Libraries-2016...
Webinar: Getting Started with Digitization An Introduction for Libraries-2016...Webinar: Getting Started with Digitization An Introduction for Libraries-2016...
Webinar: Getting Started with Digitization An Introduction for Libraries-2016...
 
Using pedagogic corpora in ELT
Using pedagogic corpora in ELTUsing pedagogic corpora in ELT
Using pedagogic corpora in ELT
 
How Open Education Practices Support Student Centered Design & Accessibility
How Open Education Practices Support Student Centered Design & AccessibilityHow Open Education Practices Support Student Centered Design & Accessibility
How Open Education Practices Support Student Centered Design & Accessibility
 
Project-Based Learning for World Languages
Project-Based Learning for World LanguagesProject-Based Learning for World Languages
Project-Based Learning for World Languages
 
Blended Learning-Best Practices
Blended Learning-Best PracticesBlended Learning-Best Practices
Blended Learning-Best Practices
 
Authentic materials
Authentic materialsAuthentic materials
Authentic materials
 
Sharing the Power and Glory: PACSCL's Success with Survey and Processing Pro...
Sharing the Power and Glory:  PACSCL's Success with Survey and Processing Pro...Sharing the Power and Glory:  PACSCL's Success with Survey and Processing Pro...
Sharing the Power and Glory: PACSCL's Success with Survey and Processing Pro...
 
OER exploration with the adult literacy program, August 2016
OER exploration with the adult literacy program, August 2016OER exploration with the adult literacy program, August 2016
OER exploration with the adult literacy program, August 2016
 
Sharing an Open Methodology for Building Domain-specific Corpora for EAP
Sharing an Open Methodology for Building Domain-specific Corpora for EAP Sharing an Open Methodology for Building Domain-specific Corpora for EAP
Sharing an Open Methodology for Building Domain-specific Corpora for EAP
 
Language teachers april 2012
Language teachers april 2012Language teachers april 2012
Language teachers april 2012
 
NCSS 2013 Differentiated Instruction: A Gateway to Success with the Common Core
NCSS 2013 Differentiated Instruction:  A Gateway to Success with the Common CoreNCSS 2013 Differentiated Instruction:  A Gateway to Success with the Common Core
NCSS 2013 Differentiated Instruction: A Gateway to Success with the Common Core
 
OER: insights into a multilingual landscape - EUROCALL 2014 conference
OER: insights into a multilingual landscape - EUROCALL 2014 conference  OER: insights into a multilingual landscape - EUROCALL 2014 conference
OER: insights into a multilingual landscape - EUROCALL 2014 conference
 
Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...
 
Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...
 
Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...Framing quality indicators for multilingual repositories of Open Educational ...
Framing quality indicators for multilingual repositories of Open Educational ...
 
Digital Literacy and the Role of the Language Teacher Cyprus May2021
Digital Literacy and the Role of the Language Teacher Cyprus May2021Digital Literacy and the Role of the Language Teacher Cyprus May2021
Digital Literacy and the Role of the Language Teacher Cyprus May2021
 

Spanish in the U.S.: Developing an open linguistic corpus

  • 1. Spanish in the US: Developing an open linguistic corpus Barbara E. Bullock & Almeida Jacqueline Toribio 24th Conference on Spanish in the United States 9th Conference on Spanish in Contact with Other Languages March 6-9, 2013, McAllen, Texas
  • 2. Spanish in Texas Corpus Project • Purpose: to make publically available authentic data about variation in Spanish as spoken in Texas – for education – for research 2
  • 3. Motivation • Document continuity and variation • Understand variation in its local context • Overcome the challenges of studying naturalistic data – Cost: gathering, transcribing and coding data – Accountability: corpora upon which studies are based are rarely made available to the public • Encourage teachers/students/public to view local varieties as a resource 3
  • 4. Inspiration • Garland Bills, Vivian Cook, Lourdes Ortega, Ricardo Otheguy, Bonnie Urciuoli, Guadalupe Valdés, Walt Wolfram, Ana Celia Zentella, … • language variation ‘in the public interest’ • an empirical turn in thinking about contact varieties • Ornstein-Galicia’s (1981) call: investigate Spanish varieties in your own backyard, share resources, create concordances of usage 4
  • 5. Impetus • What is needed – large, representative samples of oral Spanish in the U.S. – metadata about the speakers – a context and protocols for sharing architecture, scripts, analytical techniques, and data … as well as findings 5
  • 6. Why open? • To facilitate access – attract as many eyes as possible to the same data – accelerate the production of findings, which is particularly important for the study of U.S. Spanish • To reduce costs in terms of time and money, especially for those who can least afford it 6
  • 7. But… • Large corpora are of limited utility to untrained end users – Teachers need short videos that are appropriate for classroom use – And teachers need tools • to easily search videos, • to author materials, • to curate their own collections 7
  • 8. Two-pronged approach • Spanish in Texas Corpus Project – Video interviews that provides rich content • SpinTX: Corpus-to-Classroom – Collection of pre-selected, corrected, annotated clips from the larger corpus – Open-source, pedagogically-friendly search and authoring tools 8
  • 9. Goals of this talk • Document our efforts to develop an open corpus of U.S. Spanish, using open-source tools • Define ‘open’ • Describe the protocols that we are using for to convert Spanish in TX interviews to pedagogically useful corpus • Showcase materials and tools that we have for use • Share our work with others who may be interested in developing open Spanish in X corpora • Forecast to an open sociolinguistic/computational research corpus of the full interviews of Spanish in TX 9
  • 10. Origins of the project • Language Resource Center [LRC], 2010-2013 • Center for Open Educational Resources for Language Learning [COERLL] 10
  • 11. Open Educational Resources [OER] Educational material offered freely for anyone to use, typically involving some permission to remix, improve, and redistribute creativecommons.org 11
  • 12. Spanish in Texas Media License • Attribution Required • Non-Commercial • Share-Alike 12
  • 13. Spanish in Texas Corpus Project 13
  • 14. Spanish in Texas Corpus Project • Spanish in Texas is our first collection of video interviews – provides content for SpinTX Corpus-to-Classroom • Additional collections – Spanish in Texas CS collection – Hindi-English CS collection 14
  • 15. Spanish in Texas Corpus Project • Ideally serve as a reference corpus for oral Spanish in Texas – large (1 million + words), representative of variation, fully open – currently 134 interviews; approx. 600,000 words • This will help establish a better “baseline” for heritage language research and teaching than the traditionally assumed monolingual one 15
  • 16. From Corpus to Classroom 16
  • 17. SpinTX: Corpus-to-Classroom • Aims – develop a pedagogically friendly interface for using the corpus – involve teachers and learners, via crowd-sourcing, social networking, and workshops, in the development of open educational resources – create a model for using open source tools and a pedagogical interface that can be adapted for any language corpus collection 17
  • 18. Funding • Department of Education, Title VI • College of Liberal Arts • Longhorn Innovation Fund for Technology [LIFT] 18
  • 19. Our team • Directors: Barbara E. Bullock & Almeida J. Toribio • Project Manager and Web Architect: Rachael Gilg • Consultants – Graphic Designer: Nathalie Steinfeld Childre – Computational Linguist: Martí Quixal – Digital Media Producer: Scott Zúñiga – Educational Technologist: Arthur Wendorf – Outreach Coordinator: Jeffrey Michno – Content Manager, Intern Coordinator: Jacqueline Larsen Serigos – Materials Development: Jesse Abing, Joshua Frank • Undergraduate Interns – 2011-present: 12 19
  • 20. Our team • Collaborators – University of Texas Pan American • José Esteban Hernández • Stephanie Brock • José Flores • Viridiana Gallegos • Rossy Limas • Michelle Madrid – Texas A&M International University • Patricia González • Conchita Hickey • Lisa Flores – Others • Daniel Villa, New Mexico State University • María Irene Moyna, Texas A&M University • MaryEllen García, University of Texas, San Antonio • Jens Clegg, Indiana University-Purdue University • Abby Dings, Southwestern University 20
  • 21. “La gente puede ser pobre, pero compra su coca-cola” 21
  • 23. Recruit ‘locally’ • Recruit and train interns – Internal Review Board training – Video shooting and audio recording – Practice interviews on site • Recruit family, friends, acquaintances – Any Spanish-speaking resident of TX • Conduct interviews in their home communities 23
  • 24. Video Production Protocol • HD video cameras • Professional quality condenser microphones – interviewer and interviewee are each recorded into a separate channel • Interviewer wears headphones to monitor audio 24
  • 25. Interview protocol • Sampling of a large set of questions (~75) – from NPR Storycorps (Historias) – biographical information • Average Length: 30-45 min. • Language: Spanish and mixed • Consent form and talent release • Metadata on speaker and interviewer – Google docs 25
  • 27. Processing the Videos • Intake interview materials – create unique ID for video and forms – archive raw video and remove from camera • Video and transcript preparation – Final Cut Pro – Upload to Automatic Sync (3-5 day turnaround) – convert transcript to UTF-8, upload to Google Drive collection – upload to Youtube to create synced caption file (SRT) 27
  • 29. Original Transcript (from Automatic Sync) 29
  • 30. Review Transcript in Google Docs 30
  • 31. Prepare Transcript for TreeTagger 31
  • 32. Run Transcript through TreeTagger Spanish English 32
  • 33. Upload Video and Transcript to YouTube 33
  • 35. Combine Data from SRT File and TreeTagger File, and add additional Tags 35
  • 36. Divide CSV Files and Videos into Clips and adjust Timings and Numberings 36
  • 37. From Corpus to Clips Archive
  • 39. Topic List (Manual) • Abuelos • Herencia familiar • Amigos • Identidad • Amor y relaciones • Idioma • Comida • La infancia • Criando Hijos • Matrimonio • Cultura • Padres • Escuela • Religión • Familia • Texas • Futuro • Trabajo 39
  • 41. Manual Coding for Complex Cases • Annotation of ‘lo’ as an article that allows for the elision of nouns as in “lo bueno de esta clase es…” – The rule requires a sequence of two words: “lo” followed by an adjective with some words in between (in fact only adjective modifiers, as adverbs, since the BARRIER operator is telling the scanning process to stop if a typical NP boundary is crossed. 41
  • 42. Automatic annotation levels for clips • Grammatical – aggregated from textbooks • Functional – greetings, ask for help, express opinions • Pragmatics – discourse markers, place holders (“este”), attenuators • Bilingual forms – CS, loans, loan translations 42
  • 44. Filtering by Tags and Metadata 44
  • 49. OER Materials • Spanish in Texas searchable clip corpus available this spring – approximately 500 clips and growing • All specially created code scripts are available now through GitHub • IRB, talent release, google metadata survey template, etc. available 49
  • 50. OER Materials • In spirit of OER, please share-alike • Add to repository any pedagogical materials you or your students might develop from Spanish in Texas clip corpus 50
  • 51. Classroom and Community • We are designing the corpus and tools with the end-users – using locally-relevant language samples to illustrate every aspect of Spanish • Users model their own language for pedagogical purposes • The corpus is the textbook 51
  • 52. SPinTX: Corpus-to-Scholarship (the future) 52
  • 53. SPinTX: Corpus-to-Scholarship • Full interviews, video-taped, captioned, POS tagged will be made available • Syntactically-parsed corpora • Additional public protocols, open-source search tools 53
  • 54. Corpus-to-Scholarship: Share-alike • When you use the corpus, share-alike – crowd-sourcing approach to additional annotation levels (e.g., PRAAT text grids) • we’ll use stand-off annotation – sociolinguists would ideally share data coding – corpus linguists would ideally share scripts • Any users could contribute their collections: video, transcript, and metadata – we’ll run it through SpinTX processing 54
  • 55. Archive • Spanish in Texas Corpus to be archived at the Nettie Lee Benson Latin American Collection, University of Texas Libraries 55
  • 57. Thanks • To all of our collaborators • Especially to our students and their friends, neighbors, and families who shared their time and their language with us 57