Corpus linguistics, ch6

Corpus linguistics
(Schmitt,2020)
01

TABLE OF
CONTENTS
03
What is corpus
linguistics?
01
02
04
05
Corpus design and
compilation
What can a corpus
tell us?
Overview of different
types of corpus
studies
How can corpora
inform language
teaching?
(Schmitt,2020)
02

What is corpus
linguistics?
01
(Schmitt,2020)
03

 ‘Corpus linguistics’ has enjoyed much greater popularity, both as a means to explore actual
patterns of language use and as a tool for developing materials for classroom language instruction.
 Corpus linguistics uses large collections of both spoken and written natural texts that are stored on
computers.
 One of the major contributions of corpus linguistics is in the area of exploring patterns of language
use.
 Corpus linguistics and the term ‘corpus’ in its present-day are synonymous with computerized
corpora and methods, but they were not before.
What is corpus linguistics?
(Schmitt,2020)
04

 An empirical approach to linguistic analysis is based on naturally occurring spoken or written data.
 Advances in technology have led to a number of advantages for corpus linguists, including the
collection of larger language samples and the ability for faster and more efficient text processing .
 Characteristic of corpus-based analyses of language:
o It is empirical, analysing the actual patterns of use in natural texts.
o It utilizes a large and principled collection of natural texts.
o It makes extensive use of computers for analysis, using both automatic and interactive techniques.
o It depends on both quantitative and qualitative analytical techniques.
(Schmitt,2020)
05

 A corpus refers to a large principled collection of natural texts.
 The use of natural texts means that language has been collected from naturally occurring sources.
 Examples of well-known corpora:
o The British National Corpus (BNC)
o The Corpus of Contemporary American English (COCA)
o The Brown Corpus
 The text collection process for building a corpus needs to be principled to ensure
representativeness and balance.
(Schmitt,2020)
06

 The linguistic features or research questions being investigated will shape the collection of texts
used in creating the corpus.
 Although computers make possible a wide range of sophisticated statistical techniques, human
analysts are still needed to decide what information is worth searching for, to extract that
information from the corpus and to interpret the findings.
 Corpus linguistics bring together aspects of quantitative and qualitative technique.
 The quantitative analyses provide an accurate view of more macro-level characteristics, whereas
the qualitative analyses provide the complementary micro-level perspective.
(Schmitt,2020)
07

Corpus design and
compilation
02
(Schmitt,2020)
08

 Although there is no minimum size for a text collection to be considered a corpus, an early
standard size set by the creators of the Brown Corpus was one million words.
 A number of well-known specialized corpora are much smaller than that, but there is a
general assumption that for most tasks within corpus linguistics, larger corpora are more
valuable.
 Modern corpora are available to other researchers and free of charge.
 They enable researchers all over the world to access the same sets of data, which
encourages a higher degree of accountability in data analysis and permits collaborative
studies by different researchers.
Corpus design and compilation
(Schmitt,2020)
09

.....
Types of corpora
(Schmitt,2020)
10

A. General corpora
o BNC contains 100 million words and the COCA had 560 million words.
o Brown and LOB, at a mere one million words.
 It designed to be balanced and include language samples from a wide range of registers or
genres.
 Most of the early general corpora were limited to written language, because written texts are
vastly easier and cheaper to compile than transcripts of speech.
 A few corpora dedicated to spoken discourse.
o The Cambridge and Nottingham Corpus of Discourse in English (CANCODE).
Types of corpora
(Schmitt,2020)
11

B. Specialized corpora
 They designed with more specific research goals in mind and they considered the most crucial
‘growth area’ for corpus linguistics.
 Specialized corpora may include both spoken and written components.
o International Corpus of English (ICE)
o The TOEFL-2000 Spoken and Written Academic Language Corpus
 A specialized corpus focuses on a particular spoken or written variety of language.
• Historical corpora such as the Archer Corpus (two million words of British and American English dating from
1650 to 1990).
• ‘Learner’s corpus’ (spoken or written language samples produced by non-native speakers).
Types of corpora
(Schmitt,2020)
12

.....
Issues in corpus design
(Schmitt,2020)
13

 One of the most important factors in corpus linguistics is the design of the corpus.
 This design of the corpus impacts all of the analysis and results.
 The composition of the corpus should reflect the anticipated research goals.
 For example:
o Comparing patterns of language found in spoken and written discourse.
• The corpus has to include a range of possible spoken and written texts.
• The information derived from the corpus accurately reflects the variation possible in the
patterns being compared across the two registers.
(Schmitt,2020)
14

 A well-designed corpus should aim to be representative of the types of language included in it.
 There are many different ways to conceive of and justify representativeness:
a. A representative of different registers (fiction, casual conversation) and topics. (national vs local news).
b. A representativeness involves the demographics of the speakers or writers (nationality, gender,
education level).
c. A representative based on production or reception (e-mail messages, newspapers).
 All these issues must be weighed when deciding how much of each category to include.
 In thinking about the research goals of a corpus, compilers must bear in mind the intended
distribution of the corpus.
(Schmitt,2020)
15

.....
Corpus compilation
(Schmitt,2020)
16

 When creating a corpus, data collection involves obtaining or creating electronic versions of the
target texts, and storing and organizing them.
 Data collection for a written corpus means using a scanner and optical character recognition
(OCR) software to scan paper documents into electronic text files.
 OCR is not error-free and manual proofreading and error-correction is necessary.
 The data collection of spoken corpus is long and expensive.
• A transcription system (an orthographic transcription system)
• The representative of interactional characteristics of the speech in the transcripts.
 An important issue for both spoken and written corpus during data collection is obtaining
permission to use the data for the corpus.
Corpus compilation
(Schmitt,2020)
17

.....
Markup and annotation
(Schmitt,2020)
18

 A simple corpus could consist of raw text, with no additional information provided about the
origins, authors, speakers, structure or contents of the texts themselves.
 Encoding some of the information in the form of markup makes the corpus more useful.
 Structural markup refers to the use of codes in the texts to identify structural features of the
text.
o A written corpus (titles, authors, chapters)
o A spoken corpus (speakers, paralinguistic features)
 Many corpora provide information about the contents and creation of each text in what is called
a header.
 Headers include classifications of the text into categories, such as register, genre, topic domain.
(Schmitt,2020)
19

 Some corpora are also encoded with certain types of linguistic annotation.
 There are different kinds of linguistic processing or annotation:
A. Part-of-speech tagging which involves assigning a grammatical category tag to each word in the
corpus.
o ‘A goat can eat shoes’ A (indefinite article) goat (noun, singular) can (modal) eat (main verb) shoes (noun,
plural).
B. prosodic and phonetic annotation, which are not uncommon.
C. Syntactic parsing, which is much less common.
 A tagged corpus allows researchers to answer different types of questions, explore the
frequency of lexical items, grammatical structures, and addresses the problem of words that
have multiple meanings or functions.
(Schmitt,2020)
20

What can a corpus
tell us?
03
(Schmitt,2020)
21

.....
Word counts and basic corpus
tools
(Schmitt,2020)
22

 There are many levels of information that can be gathered from a corpus.
 These levels range from simple word lists to complex grammatical structures and interactive
analyses.
 Analyses can explore individual lexical or linguistic features or identify clusters.
 The tools that are used for these analyses range from basic to complex computer programs.
 The most basic information that we can get from a corpus, is frequency of occurrence
information.
o MonoConc, WordSmith Tools, and Antconc
 A word list is a list of all the words that occur in the corpus that arranged in alphabetic or
frequency order.
Word counts and basic corpus tools
(Schmitt,2020)
23

 Concordancing packages can provide additional information about lexical co-occurrence
patterns.
 Once the search word is selected, the program can search the texts in the corpus and provide a
list of each occurrence of the target word in context this is called ‘key word in context’ (KWIC).
 A concordance program can also provide information about words that tend to occur together in
the corpus in what is called ‘collocates’, and the resulting sets of words are called
‘collocations’.
 An analysis of collocations provides important information about grammatical and semantic
patterns of use for individual lexical items.
 The corpus analysis can discover patterns of use that were unnoticed before.
 For example, synonymous verbs begin and start have the same grammatical potential.
Word counts and basic corpus tools
(Schmitt,2020)
26

.....
Working with tagged texts
(Schmitt,2020)
28

 In order to carry out more sophisticated types of corpus analyses, it is often necessary to have
a tagged corpus.
 when a corpus is tagged, each word in the corpus is given a grammatical label.
 The process of assigning grammatical labels to words is complex.
 For example:‘ can’ falls into two grammatical categories.
o It can be a modal ‘I can reach the book’.
o It can be used as a noun ‘Put the paper in the can’.
 Computers can accurately identify the grammatical labels for many words.
 There are certain features that remain elusive, and here the program will bring the problematic
to the screen for the user to select the correct classification.
 Once texts have been tagged it provide a fuller picture of the texts in a register.
Working with tagged texts
(Schmitt,2020)
29

Overview of different
types of corpus
studies
04
(Schmitt,2020)
30

 Over the years, corpora have been used to address a number of interesting issues such as the
question of language change.
 The area of historical linguistics which look in how language has changed over the centuries.
 Scholars have also look into to language development, in first and second language situations.
 Corpora have also been used to explore similarities or differences across different national or
regional varieties of English (Australian English, American English, Indian English).
 There also studies explore the differences between spoken and written language.
 Before corpus linguistics it was difficult to note patterns of use, since observing and tracking
use patterns was a huge task.
Overview of different types of corpus
studies
(Schmitt,2020)
31

How can corpora
inform language
teaching?
05
(Schmitt,2020)
32

 There is impact of corpus linguistic studies on classroom language teaching practices.
 Corpus-based studies of particular language features such as The Longman Grammar of Spoken and
Written English will serve language teachers by providing a basis for deciding which language
features and structures are important.
 Teachers and materials writers can have a basis for selecting the material that is being
presented.
 Rather than basing pedagogical decisions on intuitions, these decisions can now be grounded
on actual patterns of language use in various situations.
How can corpora inform language teaching?
(Schmitt,2020)
33

.....
Bringing corpora into the
language classroom
(Schmitt,2020)
34

 Corpus-based information can be brought to bear on language teaching in two ways:
1. Teachers can shape instruction based on corpus-based information.
• They can consult corpus studies to gain information about the features that they are teaching.
• For example:
o ‘Conversational English’ teachers could read corpus investigations on spoken language
to determine which features and grammatical structures are characteristic of conversational
English.
Bringing corpora into the language classroom
(Schmitt,2020)
35

2. Learners interact with corpora.
 This can take place in one of two ways:
A. If computer facilities are adequate learners can be actively involved in exploring corpora.
B. If adequate facilities do not exist teachers can bring the results from corpus searches for
use in the classroom.
 The use of concordancing tasks in the classroom is a matter of some controversy.
• It strongly advocated by those who favour an inductive or data-driven approach to learning.
• It criticized by others who argue that it is difficult to guide students appropriately in the
analysis of vast numbers of linguistic examples.
Bringing corpora into the language classroom
(Schmitt,2020)
36

.....
Examples of corpus-based
classroom activities
(Schmitt,2020)
37

 The creation of appropriate, corpus-based teaching materials takes time, careful planning and
access to a few basic tools and resources.
 The activities will require access to a computer, texts and to a concordancing package.
 Several vocabulary activities can be generated through simple frequency lists and concordance
output.
 The vocabulary frequency list can be used to identify vocabulary words that need to be taught.
 Frequency lists can also be a starting point for students to group words by grammatical
category (verb, nouns, etc.) or semantic categories.
Examples of corpus-based classroom activities
(Schmitt,2020)
38

 Concordances of target words can be used to better understand those words’ meanings and
usage.
 The use of a word and its patterning characteristics also contribute to its meaning senses.
 For example, words often are seen as synonymous when actually, their use is not synonymous.
 Dictionaries often list the ‘resulting copulas’ become, turn, go and come as synonyms, with
meanings like ‘to become’, ‘to get to be’, ‘to result’, ‘to turn out’.
 Most dictionaries provide no clues to how these four words might differ in meaning.
 Corpus research shows that these words differ dramatically in their typical contexts of use.
(Schmitt,2020)
39

o ‘turn’ change of colour or physical appearance.
(The water turned grey)
o ‘go’ describes a change to a negative state.
(go crazy, go bad, go wrong)
o ‘come’ describe a change to a more active state.
(come awake, come alive)
 If corpus activities coupled with dictionary activities, they can provide a much richer language-
learning environment for student.
 The patterns of language use that can be discovered through corpus linguistics will continue to
reshape the way we think of language.
(Schmitt,2020)
40

Schmitt, N. (2020). An introduction to applied linguistics. Routledge.
RESOURCES
41

Do You Have Any Question?
THANKS
42

Corpus linguistics, ch6

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Corpus linguistics, ch6

Similar to Corpus linguistics, ch6 (20)

More from VivaAs

More from VivaAs (20)

Recently uploaded

Recently uploaded (20)

Corpus linguistics, ch6