Spanish in the U.S.: Developing an open linguistic corpus

Spanish in the US:
Developing an open linguistic corpus

Barbara E. Bullock & Almeida Jacqueline Toribio

24th Conference on Spanish in the United States
9th Conference on Spanish in Contact with Other Languages
March 6-9, 2013, McAllen, Texas

Spanish in Texas Corpus Project
• Purpose: to make publically available
authentic data about variation in Spanish as
spoken in Texas
– for education
– for research

2

Motivation
• Document continuity and variation
• Understand variation in its local context
• Overcome the challenges of studying
naturalistic data
– Cost: gathering, transcribing and coding data
– Accountability: corpora upon which studies are
based are rarely made available to the public
• Encourage teachers/students/public to view
local varieties as a resource
3

Inspiration
• Garland Bills, Vivian Cook, Lourdes Ortega,
Ricardo Otheguy, Bonnie Urciuoli, Guadalupe
Valdés, Walt Wolfram, Ana Celia Zentella, …
• language variation ‘in the public interest’
• an empirical turn in thinking about contact varieties
• Ornstein-Galicia’s (1981) call: investigate Spanish
varieties in your own backyard, share resources, create
concordances of usage

4

Impetus
• What is needed
– large, representative samples of oral Spanish in
the U.S.
– metadata about the speakers
– a context and protocols for sharing architecture,
scripts, analytical techniques, and data
… as well as findings

5

Why open?
• To facilitate access
– attract as many eyes as possible to the same data
– accelerate the production of findings, which is
particularly important for the study of U.S.
Spanish
• To reduce costs in terms of time and money,
especially for those who can least afford it

6

But…
• Large corpora are of limited utility to
untrained end users
– Teachers need short videos that are appropriate
for classroom use
– And teachers need tools
• to easily search videos,
• to author materials,
• to curate their own collections

7

Two-pronged approach
• Spanish in Texas Corpus Project
– Video interviews that provides rich content

• SpinTX: Corpus-to-Classroom
– Collection of pre-selected, corrected, annotated
clips from the larger corpus
– Open-source, pedagogically-friendly search and
authoring tools

8

Goals of this talk
• Document our efforts to develop an open corpus of
U.S. Spanish, using open-source tools
• Define ‘open’
• Describe the protocols that we are using for to convert
Spanish in TX interviews to pedagogically useful corpus
• Showcase materials and tools that we have for use
• Share our work with others who may be interested in
developing open Spanish in X corpora
• Forecast to an open sociolinguistic/computational
research corpus of the full interviews of Spanish in TX

9

Origins of the project
• Language Resource Center
[LRC], 2010-2013
• Center for Open Educational Resources for
Language Learning [COERLL]

10

Open Educational Resources [OER]
Educational material offered freely for anyone to
use, typically involving some permission to remix,
improve, and redistribute

creativecommons.org

11

Spanish in Texas Media License

• Attribution Required
• Non-Commercial
• Share-Alike
12


13

• Spanish in Texas is our first collection of video
interviews
– provides content for SpinTX Corpus-to-Classroom

• Additional collections
– Spanish in Texas CS collection
– Hindi-English CS collection

14

• Ideally serve as a reference corpus for oral
Spanish in Texas
– large (1 million + words), representative of
variation, fully open
– currently 134 interviews; approx. 600,000 words
• This will help establish a better “baseline” for
heritage language research and teaching than
the traditionally assumed monolingual one

15

From Corpus to Classroom

16

SpinTX: Corpus-to-Classroom
• Aims
– develop a pedagogically friendly interface for
using the corpus
– involve teachers and learners, via crowd-sourcing,
social networking, and workshops, in the
development of open educational resources
– create a model for using open source tools and a
pedagogical interface that can be adapted for any
language corpus collection

17

Funding
• Department of Education, Title VI
• College of Liberal Arts
• Longhorn Innovation Fund for Technology
[LIFT]

18

Our team
• Directors: Barbara E. Bullock & Almeida J. Toribio
• Project Manager and Web Architect: Rachael Gilg
• Consultants
– Graphic Designer: Nathalie Steinfeld Childre
– Computational Linguist: Martí Quixal
– Digital Media Producer: Scott Zúñiga
– Educational Technologist: Arthur Wendorf
– Outreach Coordinator: Jeffrey Michno
– Content Manager, Intern Coordinator: Jacqueline Larsen Serigos
– Materials Development: Jesse Abing, Joshua Frank
• Undergraduate Interns
– 2011-present: 12

19

Our team
• Collaborators
– University of Texas Pan American
• José Esteban Hernández
• Stephanie Brock
• José Flores
• Viridiana Gallegos
• Rossy Limas
• Michelle Madrid
– Texas A&M International University
• Patricia González
• Conchita Hickey
• Lisa Flores
– Others
• Daniel Villa, New Mexico State University
• María Irene Moyna, Texas A&M University
• MaryEllen García, University of Texas, San Antonio
• Jens Clegg, Indiana University-Purdue University
• Abby Dings, Southwestern University

20

“La gente puede ser pobre, pero
compra su coca-cola”

21

Recruit ‘locally’
• Recruit and train interns
– Internal Review Board training
– Video shooting and audio recording
– Practice interviews on site
• Recruit family, friends, acquaintances
– Any Spanish-speaking resident of TX
• Conduct interviews in their home
communities

23

Video Production Protocol
• HD video cameras
• Professional quality condenser microphones
– interviewer and interviewee are each recorded
into a separate channel
• Interviewer wears headphones to monitor
audio

24

Interview protocol
• Sampling of a large set of questions (~75)
– from NPR Storycorps (Historias)
– biographical information
• Average Length: 30-45 min.
• Language: Spanish and mixed
• Consent form and talent release
• Metadata on speaker and interviewer
– Google docs
25

Interview Metadata

26

Processing the Videos
• Intake interview materials
– create unique ID for video and forms
– archive raw video and remove from camera
• Video and transcript preparation
– Final Cut Pro
– Upload to Automatic Sync (3-5 day turnaround)
– convert transcript to UTF-8, upload to Google Drive
collection
– upload to Youtube to create synced caption file (SRT)

27

Original Transcript (from Automatic Sync)

29

Review Transcript in Google Docs

30

Prepare Transcript for TreeTagger

31

Run Transcript through TreeTagger
Spanish English

32

Upload Video and Transcript to YouTube

33

Download SRT File

34

Combine Data from SRT File and
TreeTagger File, and add additional Tags

35

Divide CSV Files and Videos into Clips and
adjust Timings and Numberings

36

Topic List (Manual)
• Abuelos • Herencia familiar
• Amigos • Identidad
• Amor y relaciones • Idioma
• Comida • La infancia
• Criando Hijos • Matrimonio
• Cultura • Padres
• Escuela • Religión
• Familia • Texas
• Futuro • Trabajo

39

Automatic Pedagogical Annotations

40

Manual Coding for Complex Cases
• Annotation of ‘lo’ as an article that allows for the
elision of nouns as in “lo bueno de esta clase es…”

– The rule requires a sequence of two words: “lo”
followed by an adjective with some words in between
(in fact only adjective modifiers, as adverbs, since the
BARRIER operator is telling the scanning process to
stop if a typical NP boundary is crossed.

41

Automatic annotation levels for clips
• Grammatical
– aggregated from textbooks
• Functional
– greetings, ask for help, express opinions
• Pragmatics
– discourse markers, place holders (“este”),
attenuators
• Bilingual forms
– CS, loans, loan translations

42

Filtering by Tags and Metadata

44

Annotation Highlighter

46

Cloze Test Generator

47

Word cloud visualization tool

48

OER Materials
• Spanish in Texas searchable clip corpus
available this spring
– approximately 500 clips and growing
• All specially created code scripts are available
now through GitHub
• IRB, talent release, google metadata survey
template, etc. available

49

OER Materials
• In spirit of OER, please share-alike
• Add to repository any pedagogical materials
you or your students might develop from
Spanish in Texas clip corpus

50

Classroom and Community
• We are designing the corpus and tools with
the end-users
– using locally-relevant language samples to
illustrate every aspect of Spanish
• Users model their own language for
pedagogical purposes
• The corpus is the textbook

51

SPinTX: Corpus-to-Scholarship
(the future)

52

SPinTX: Corpus-to-Scholarship
• Full interviews, video-taped, captioned, POS
tagged will be made available
• Syntactically-parsed corpora
• Additional public protocols, open-source
search tools

53

Corpus-to-Scholarship: Share-alike
• When you use the corpus, share-alike
– crowd-sourcing approach to additional annotation
levels (e.g., PRAAT text grids)
• we’ll use stand-off annotation
– sociolinguists would ideally share data coding
– corpus linguists would ideally share scripts
• Any users could contribute their collections:
video, transcript, and metadata
– we’ll run it through SpinTX processing

54

Archive
• Spanish in Texas Corpus to be archived at the
Nettie Lee Benson Latin American Collection,
University of Texas Libraries

55

Websites
Project website:
http://sites.la.utexas.edu/spanishtx/

Corpus-to-Classroom Blog:
http://sites.la.utexas.edu/corpus-to-classroom/

Facebook page:
https://www.facebook.com/spanish.in.texas

56

Thanks
• To all of our collaborators
• Especially to our students and their friends,
neighbors, and families who shared their time
and their language with us

57

Spanish in the U.S.: Developing an open linguistic corpus

Recommended

Recommended

More Related Content

Similar to Spanish in the U.S.: Developing an open linguistic corpus

Similar to Spanish in the U.S.: Developing an open linguistic corpus (20)

Spanish in the U.S.: Developing an open linguistic corpus