Quantitative Research Design
Corpus Study
Bikash Chandra Taly
PhD in English Language
bikashchndrataly@gmail.com
There are Three questions you may answer while you are listening this
presentation.
1. Why do we need corpus Study in Language Study?
2.What are key features of Corpus Linguistics design?
3. Why do we need corpus study in quantitative research?
Corpus Study
 A collection of texts assumed to be representative of a given language put together so that it can be
used for linguistic analysis.
 To verify a hypothesis about language - for example, to determine how the usage of a particular
sound, word, or syntactic construction varies.
 “The analysis of naturally occurring language on the basis of computerized corpora and the analysis
is performed with the help of the computer” Nesselhauf, 2005.
 “A systematic collection of naturally occurring texts (of both written and spoken language)”
(Nesselhauf, 2000).
 The analysis is performed with specialized software, and takes into account the frequency of the
phenomena investigated.
Collocation Analysis
Why Corpus Linguistics
 More reliable than intuition.
 Language patterns are easily identified.
 Deconstruct texts to discover patterns.
 Test hypothesis on specific language features empirically.
 Draw conclusions on large amounts of linguistics data.
 Frequency rather than the possibility.
 To gain insights into changes related to language development, both in first and
second language situations.
 How language is used and how it varies in different situations.
 Insight into underlying discourse
Corpus linguistics: Why in Language study
 Insights into the internal workings of real language -Knowledge in turn also used in
other fields of enquiry -Planning, syllabus designing, compiling and tagging.
 It proposes that reliable language is more feasible with corpora collected in the field in its
natural context and with experimental interference.
 It allows us to see how language is used in different contexts and enable us to teach language
more effectively according to learners specific purposes.
 Corpus-based dictionaries and grammars -how lexis and grammar are “really "used -
COLLINS COBUILD LEARNER'S DICTIONARY -THE LONGMAN GRAMMAR OF
SPOKEN AND WRITTEN ENGLISH.
Branches of Corpus Linguistics
 Phonetics
 Morphology
 Syntax
 Semantics
 Pragmatics
 Lexicography
 Dialects
 Minority languages
 Synchronic and diachronic variation Syntax
Types of Corpus in Linguistics
 General corpora
 A general corpus: written texts, spoken texts, or both, and very often it represents a
national, regional or sub variety of a language.
 Generalized corpora are often very large, more than 10 million words, and contain a
variety of language so that findings from it may be somewhat generalized.
 Approximately a million words, such as the Lancaster-Oslo-Bergen (LOB) written
corpus, and others of a much bigger size that include both written and spoken texts,
such as the over 450 million-word Contemporary Corpus of American English
(COCA).
 A specialized corpus: Targets one text type (or genre), say, political
speeches, newspaper editorials, master’s theses, or business letters.
Because of its narrowed text focus, a specialized corpus is usually smaller
in size compared to a general one.
 Specialized corpora can be large or small and are often created to answer very specific
questions. Michigan Corpus of Academic Spoken English (MICASE), which contains
only spoken language from a university setting; the CHILDES Corpus (MacWhinney,
1992), which contains language used by children; the MICUSP, Michigan Corpus of
Upper- level Student Papers, a collection of papers from a range of university
disciplines; and a medical corpus containing language used by nurses and hospital
staff. Specialized corpora are often used in ESP settings.
Learner’s Corpora
 Aim at representing the language as produced by the learners of a language, and they include
spoken or written language samples produced by non-native speakers.
 They are used to identify differences among learners’ frequency of words and types of mistakes.
 It is one kind of specialized corpus that contains written texts and/or spoken transcripts of
language used by students who are currently acquiring the language.
 A well-known learner corpus is the International Corpus of Learner English (ICLE)
(Granger, 2003), which contains essays written by English language learners with 14
different native languages.
Pedagogic Corpora
Contains language used in classroom settings.
Include academic textbooks, transcripts of classroom interactions, or any other written text or
spoken transcript that learners encounter in an educational setting.
 Pedagogic corpora can be used to ensure students are learning useful language, to examine
teacher-student dynamics, or as a self-reflective tool for teacher development.
Corpus Analysis for Quantitative
Research
 Quantitative data shows what occurs frequently and what occurs rarely in the language
 if you wanted to compare the language use of patterns for the words big and large, you
would need to know how many times each word occurs in the corpus, how many different
words co-occur with each of these adjectives (the collocations), and how common each of
those collocations is. These are all quantitative measurements. . . .
 "A crucial part of the corpus-based approach is going beyond the quantitative patterns to
propose functional interpretations explaining why the patterns exist. As a result, a large
amount of effort in corpus-based studies is devoted to explaining and exemplifying
quantitative patterns." (Douglas Biber, Susan Conrad, and Randi Reppen, Corpus
Linguistics: Investigating Language Structure and Use, Cambridge University Press, 2004)
 Determine to what extent a feature is used, or how common one feature is relative to
another; such tasks require quantitative evidence.
Written Corpora
 Obtaining/creating, Storing, Organizing
 Discursive and typically at least several pages long
 Integral
 Conscious product of a unified authorial effort
 Stylistically homogeneous
 Materials Required:
 -scanner, OCR software Process: -paper document into electronic text file Types: -
newspapers, Academic articles, published correspondence, periodicals -small specialized
corpora -informal writings (travel diaries, e-mail, discussion, blogs, news groups)
Spoken Corpora
 Speech corpora: -sound recordings (started in previous corpora) -SPOKEN ENGLISH
CORPUS (e.g., MICASE) -detailed description of spoken phenomena: phonology, prosody
(stress, tone units…), etc
 Deciding on a transcription system (MS Office package is used for transcriptions and
partial data analysis (MS Excel). Wordsmith Tool is common for data analysis.
 Prosodic/non prosodic
 Representing interactional characteristics of speech (over lapping speech, non-verbal
contextual events, an informal face-to-face conversation, a telephone conversation, a
lecture a meeting an interview, a debate, interactional corpora (classroom interactions,
Academic Presentation, authentic interactions).
 Permission to use data
 Ensuring anonymity
 Avoiding impracticality of data
Programs for Corpus
 corpus linguistics involves a variety of independent (analysis) methods, including frequency
lists, keyword lists, concordance analysis, cluster/n-gram analysis, and list of collocates and
collocational analysis.
 Butler (1998, pp. 217-220) mentioned some main programs which are often used:
 Word frequency: occur most frequently in the text(s). Ascending order of frequency, or
alphabetically.
 Concordances: shows what sorts of words tend to occur in the immediate environment of a
given word.
 Distribution: sets of words through the various parts of the text (s).
 Collocations: shows which particular words or sets of words enter into.
 Keywords: A comparison with another body of text taken as anorm
Programs for Corpus
Corpus Design
 Corpus linguistics is a methodology to obtain and analyze the language data either
quantitatively or qualitatively
 The sizes of the text samples to be included, the range of language varieties (synchronic)
and the time period (diachronic) to be sampled, whether to include writing and speech
and the approximate level of encoding detail to be recorded in electronic form.
 The primary stages are:
 Specifications and design
 Selection of sources
 Obtaining copyright permissions
 Data capture and encoding/markup
 Corpus processing
Program used for corpus
 Concordance Lines:
 Are a useful tool for investigating corpora, but their use is limited by the ability of the
human observer to process information.
 Frequency and Key-word Lists
 Comparing the frequency lists for two corpora can give interesting information about
the differences between the two texts. e.g.) Kennedy (1998)
 Keywords
 They can be lexical items which reflect the topic of a particular text but also
grammatical words which convey more subtle information.
 Collocation
 Statistical measurements of collocation are more reliable, and for this reason a corpus is
essential.
 Annotation
 The annotation of a spoken corpus for prosodic features. The annotation of a corpus of
learner English for types of error.
Text Analysis Tools
 Wordsmith Tools ( Scott,2013): Most widely used corpus software. Any text can be
subjected to the same process of analysis that official corpora undergo: concordance
lines, word lists,
 Micro Concord: Suitable not only for language researchers but also for teachers for
‘data-driven learning’in the language classroom.
 Oxford Concordance Program (OCP): very flexible but rather slow.
 TACT Web: Operates in two stages which is the production of database from a given
text and subsequent use of the database for particular analyses.
 Word Cruncher: Consists of programs for indexing texts, and one for generating
concordances.
Academic Word List (Coxhead, 2000)
 Research purpose:
 To develop and evaluate a new academic word list
 Factors considered in building the Academic Corpus
 Procedures:
 Representation
 Organization
 Size
 Word selection
Academic Word List (Coxhead, 2000
 Representation
 Not only textbooks, but also a range of academic texts
 158 journal articles (print)
 51 edited journal articles (online)
 43 complete university textbooks or course books
 42 texts from the Learned and Scientific section of the Wellington
Corpus of Written English (Bauer, 1993)
Academic Word List (Coxhead, 2000
 Organization
 4 disciplines
 arts, commerce, law, science
 28 subject areas
Academic Word List (Coxhead, 2000
 Size
 3.5 million running words
 so as to identify 100 occurrences of a word family
 Coxhead referred to the data from Brown Corpus (Francis &
Kucera, 1982
Academic Word List (Coxhead, 2000
Cohead, 2000, p. 220
Academic Word List (Coxhead, 2000)
 Word selection
 What a word is
 Morphologically different words (e.g. –s and – ed)
 word types
 word families
 “[a] word family was defined as a stem plus all closely related
affixed forms…” (Coxhead, 2000, p. 218)
Academic Word List (Coxhead, 2000
 Methods:
 Range (Heatley & Nation, 1996)
 Criteria for a member of a word family
 Specialized Occurrence
 excluding 2000 most frequent words
 Range
 occurs at least 10 times in each discipline
 occurs in 15 or more subject areas (out of 28)
 Frequency
 occurs at least 100 times in the Academic Corpus
Taking “analyze” as an example
 regular inflections
analysed, analysing, analyses
 derivations
analyser, analysers, analysis, analyst, analysts, analytic, analytical, analytically
 American spelling
analyze, analyzed, analyzes, analyzing
Academic Word List (Coxhead, 2000)
 Results
 570 word families
 12% word coverage for commerce
 9.3% word coverage for arts
 9.4% word coverage for law
 9.1% word coverage for science
 Average 10% word coverage for academic texts
 TextAnalyticalTools
Feature of Program
Feature of Program
References

Corpus study design

  • 1.
    Quantitative Research Design CorpusStudy Bikash Chandra Taly PhD in English Language bikashchndrataly@gmail.com
  • 2.
    There are Threequestions you may answer while you are listening this presentation. 1. Why do we need corpus Study in Language Study? 2.What are key features of Corpus Linguistics design? 3. Why do we need corpus study in quantitative research?
  • 3.
    Corpus Study  Acollection of texts assumed to be representative of a given language put together so that it can be used for linguistic analysis.  To verify a hypothesis about language - for example, to determine how the usage of a particular sound, word, or syntactic construction varies.  “The analysis of naturally occurring language on the basis of computerized corpora and the analysis is performed with the help of the computer” Nesselhauf, 2005.  “A systematic collection of naturally occurring texts (of both written and spoken language)” (Nesselhauf, 2000).  The analysis is performed with specialized software, and takes into account the frequency of the phenomena investigated.
  • 5.
  • 6.
    Why Corpus Linguistics More reliable than intuition.  Language patterns are easily identified.  Deconstruct texts to discover patterns.  Test hypothesis on specific language features empirically.  Draw conclusions on large amounts of linguistics data.  Frequency rather than the possibility.  To gain insights into changes related to language development, both in first and second language situations.  How language is used and how it varies in different situations.  Insight into underlying discourse
  • 7.
    Corpus linguistics: Whyin Language study  Insights into the internal workings of real language -Knowledge in turn also used in other fields of enquiry -Planning, syllabus designing, compiling and tagging.  It proposes that reliable language is more feasible with corpora collected in the field in its natural context and with experimental interference.  It allows us to see how language is used in different contexts and enable us to teach language more effectively according to learners specific purposes.  Corpus-based dictionaries and grammars -how lexis and grammar are “really "used - COLLINS COBUILD LEARNER'S DICTIONARY -THE LONGMAN GRAMMAR OF SPOKEN AND WRITTEN ENGLISH.
  • 8.
    Branches of CorpusLinguistics  Phonetics  Morphology  Syntax  Semantics  Pragmatics  Lexicography  Dialects  Minority languages  Synchronic and diachronic variation Syntax
  • 9.
    Types of Corpusin Linguistics  General corpora  A general corpus: written texts, spoken texts, or both, and very often it represents a national, regional or sub variety of a language.  Generalized corpora are often very large, more than 10 million words, and contain a variety of language so that findings from it may be somewhat generalized.  Approximately a million words, such as the Lancaster-Oslo-Bergen (LOB) written corpus, and others of a much bigger size that include both written and spoken texts, such as the over 450 million-word Contemporary Corpus of American English (COCA).
  • 10.
     A specializedcorpus: Targets one text type (or genre), say, political speeches, newspaper editorials, master’s theses, or business letters. Because of its narrowed text focus, a specialized corpus is usually smaller in size compared to a general one.  Specialized corpora can be large or small and are often created to answer very specific questions. Michigan Corpus of Academic Spoken English (MICASE), which contains only spoken language from a university setting; the CHILDES Corpus (MacWhinney, 1992), which contains language used by children; the MICUSP, Michigan Corpus of Upper- level Student Papers, a collection of papers from a range of university disciplines; and a medical corpus containing language used by nurses and hospital staff. Specialized corpora are often used in ESP settings.
  • 11.
    Learner’s Corpora  Aimat representing the language as produced by the learners of a language, and they include spoken or written language samples produced by non-native speakers.  They are used to identify differences among learners’ frequency of words and types of mistakes.  It is one kind of specialized corpus that contains written texts and/or spoken transcripts of language used by students who are currently acquiring the language.  A well-known learner corpus is the International Corpus of Learner English (ICLE) (Granger, 2003), which contains essays written by English language learners with 14 different native languages.
  • 12.
    Pedagogic Corpora Contains languageused in classroom settings. Include academic textbooks, transcripts of classroom interactions, or any other written text or spoken transcript that learners encounter in an educational setting.  Pedagogic corpora can be used to ensure students are learning useful language, to examine teacher-student dynamics, or as a self-reflective tool for teacher development.
  • 13.
    Corpus Analysis forQuantitative Research  Quantitative data shows what occurs frequently and what occurs rarely in the language  if you wanted to compare the language use of patterns for the words big and large, you would need to know how many times each word occurs in the corpus, how many different words co-occur with each of these adjectives (the collocations), and how common each of those collocations is. These are all quantitative measurements. . . .  "A crucial part of the corpus-based approach is going beyond the quantitative patterns to propose functional interpretations explaining why the patterns exist. As a result, a large amount of effort in corpus-based studies is devoted to explaining and exemplifying quantitative patterns." (Douglas Biber, Susan Conrad, and Randi Reppen, Corpus Linguistics: Investigating Language Structure and Use, Cambridge University Press, 2004)  Determine to what extent a feature is used, or how common one feature is relative to another; such tasks require quantitative evidence.
  • 14.
    Written Corpora  Obtaining/creating,Storing, Organizing  Discursive and typically at least several pages long  Integral  Conscious product of a unified authorial effort  Stylistically homogeneous  Materials Required:  -scanner, OCR software Process: -paper document into electronic text file Types: - newspapers, Academic articles, published correspondence, periodicals -small specialized corpora -informal writings (travel diaries, e-mail, discussion, blogs, news groups)
  • 15.
    Spoken Corpora  Speechcorpora: -sound recordings (started in previous corpora) -SPOKEN ENGLISH CORPUS (e.g., MICASE) -detailed description of spoken phenomena: phonology, prosody (stress, tone units…), etc  Deciding on a transcription system (MS Office package is used for transcriptions and partial data analysis (MS Excel). Wordsmith Tool is common for data analysis.  Prosodic/non prosodic  Representing interactional characteristics of speech (over lapping speech, non-verbal contextual events, an informal face-to-face conversation, a telephone conversation, a lecture a meeting an interview, a debate, interactional corpora (classroom interactions, Academic Presentation, authentic interactions).  Permission to use data  Ensuring anonymity  Avoiding impracticality of data
  • 16.
    Programs for Corpus corpus linguistics involves a variety of independent (analysis) methods, including frequency lists, keyword lists, concordance analysis, cluster/n-gram analysis, and list of collocates and collocational analysis.  Butler (1998, pp. 217-220) mentioned some main programs which are often used:  Word frequency: occur most frequently in the text(s). Ascending order of frequency, or alphabetically.  Concordances: shows what sorts of words tend to occur in the immediate environment of a given word.  Distribution: sets of words through the various parts of the text (s).  Collocations: shows which particular words or sets of words enter into.  Keywords: A comparison with another body of text taken as anorm
  • 17.
  • 18.
    Corpus Design  Corpuslinguistics is a methodology to obtain and analyze the language data either quantitatively or qualitatively  The sizes of the text samples to be included, the range of language varieties (synchronic) and the time period (diachronic) to be sampled, whether to include writing and speech and the approximate level of encoding detail to be recorded in electronic form.  The primary stages are:  Specifications and design  Selection of sources  Obtaining copyright permissions  Data capture and encoding/markup  Corpus processing
  • 19.
    Program used forcorpus  Concordance Lines:  Are a useful tool for investigating corpora, but their use is limited by the ability of the human observer to process information.  Frequency and Key-word Lists  Comparing the frequency lists for two corpora can give interesting information about the differences between the two texts. e.g.) Kennedy (1998)  Keywords  They can be lexical items which reflect the topic of a particular text but also grammatical words which convey more subtle information.  Collocation  Statistical measurements of collocation are more reliable, and for this reason a corpus is essential.  Annotation  The annotation of a spoken corpus for prosodic features. The annotation of a corpus of learner English for types of error.
  • 20.
    Text Analysis Tools Wordsmith Tools ( Scott,2013): Most widely used corpus software. Any text can be subjected to the same process of analysis that official corpora undergo: concordance lines, word lists,  Micro Concord: Suitable not only for language researchers but also for teachers for ‘data-driven learning’in the language classroom.  Oxford Concordance Program (OCP): very flexible but rather slow.  TACT Web: Operates in two stages which is the production of database from a given text and subsequent use of the database for particular analyses.  Word Cruncher: Consists of programs for indexing texts, and one for generating concordances.
  • 21.
    Academic Word List(Coxhead, 2000)  Research purpose:  To develop and evaluate a new academic word list  Factors considered in building the Academic Corpus  Procedures:  Representation  Organization  Size  Word selection
  • 22.
    Academic Word List(Coxhead, 2000  Representation  Not only textbooks, but also a range of academic texts  158 journal articles (print)  51 edited journal articles (online)  43 complete university textbooks or course books  42 texts from the Learned and Scientific section of the Wellington Corpus of Written English (Bauer, 1993)
  • 23.
    Academic Word List(Coxhead, 2000  Organization  4 disciplines  arts, commerce, law, science  28 subject areas
  • 24.
    Academic Word List(Coxhead, 2000  Size  3.5 million running words  so as to identify 100 occurrences of a word family  Coxhead referred to the data from Brown Corpus (Francis & Kucera, 1982
  • 25.
    Academic Word List(Coxhead, 2000 Cohead, 2000, p. 220
  • 26.
    Academic Word List(Coxhead, 2000)  Word selection  What a word is  Morphologically different words (e.g. –s and – ed)  word types  word families  “[a] word family was defined as a stem plus all closely related affixed forms…” (Coxhead, 2000, p. 218)
  • 27.
    Academic Word List(Coxhead, 2000  Methods:  Range (Heatley & Nation, 1996)  Criteria for a member of a word family  Specialized Occurrence  excluding 2000 most frequent words  Range  occurs at least 10 times in each discipline  occurs in 15 or more subject areas (out of 28)  Frequency  occurs at least 100 times in the Academic Corpus
  • 28.
    Taking “analyze” asan example  regular inflections analysed, analysing, analyses  derivations analyser, analysers, analysis, analyst, analysts, analytic, analytical, analytically  American spelling analyze, analyzed, analyzes, analyzing
  • 29.
    Academic Word List(Coxhead, 2000)  Results  570 word families  12% word coverage for commerce  9.3% word coverage for arts  9.4% word coverage for law  9.1% word coverage for science  Average 10% word coverage for academic texts
  • 30.
  • 31.
  • 32.
  • 42.