SlideShare a Scribd company logo
CORPUS LINGUISTICS
Corpus linguistics as Applied Linguistics
& Teaching / Language corpora
(based on P. Rayson’s & D. Archer’s course for H & S linguistics
Lancaster University & U. of Central Lanscashire)
A corpus can be defined as a collection of texts
assumed to be representative of a given language
put together so that it can be used for linguistic
analysis. Usually the assumption is that the
language stored in a corpus is naturally-occurring,
that is gathered according to explicit design
criteria, with a specific purpose in mind, and with
a claim to represent natural chunks of language
selected according to specific typology
Tognini-Bonelli (2001:2)
“nowadays the term 'corpus' nearly
always implies the additional
feature of 'machine-readable'”.
McEnery & Wilson, Corpus Linguistics. Online manual.
Types of (electronic)
corpora
Based on: http://www.uow.edu.au/~dlee/CBLLinks.htm
ENGLISH CORPORA: GENERAL LANGUAGE
CORPORA
The 1980s:
-Bank of English
-monitor corpus
-both spoken and written text
-different regional varieties of English
-British National Corpus (BNC)
-90 million written words
-10 million spoken words
-freely accessible (search interface and KWIC)
The 1990s and 2000s
-speech corpora:
-sound recordings (started in previous corpora)
-SPOKEN ENGLISH CORPUS (e.g., MICASE)
-detailed description of spoken phenomena: phonology,
prosody (stress, tone units…), etc
-multimedia corpora:
-transcripts synchronised audio/video recordings
-TALKBANK Website: SANTA BARBARACORPUS
OF SPOKEN AMERICAN ENGLISH (SBCSAE)
The 1990s and 2000s
-parsed corpora:
-syntactically analysed
-SURFACE AND UNDERLYING STRUCTURAL
ANALYSES AND NATURALISTIC ENGLISH CORPUS
(SUSANNE)
-historical / diachronic corpora:
-English of different periods (e.g., Helsinki)
-may cover specific historical periods or genres (e.g.,
COCA)
-track and describe how language has evolved
-A REPRESENTATIVE CORPUS OF HISTORICAL
ENGLISH REGISTERS (ARCHER)
The 1990s and 2000s
-specialised corpora:
-focus on concrete genres/domains
-BUSINESS LETTERS CORPUS (BLC) (e.g., Someya)
-lingua franca corpora:
-ENGLISH AS A LINGUA FRANCA IN ACADEMIC
SETTINGS (ELFA) CORPUS
-intercultural exchanges among speakers who use
English as a lingua franca (also, e.g., ICLE)
The 1990s and 2000s
-developmental language corpora:
-non-adult English native speakers' output (e.g.,
CHILDES)
-not as proficient as native-speaker corpora
-POLYTECHNIC OF WALES (POW) CORPUS
-ESL/EFL learner corpora:
-learners of English's output
-one and the same L1 background or different mother
tongues
-JAPANESE EFL LEARNER CORPUS (JEFLL)
The 2010s
On-line corpora + management tools on-
line (e.g., Sketch Engine)
Corpora builders on-line
N-gram and conc-gram builders
Literature and corpora
Other uses: Socio-cultural studies,
historical, Translation, Law, (?)
WORDSMITH: how to manage corpora
-Computer program which permits users to compile their
own corpus
-Texts must be in .txt (doc, htm) format
-Any text can be subjected to the same process of
analysis that official corpora undergo: concordance
lines, word lists, etc
-No need to pre-process such texts in advance (e.g., XML
coding or exhaustive tagging)
Corpus linguistics: Why applied linguistics?
-Insights into the internal workings of real language
-Knowledge in turn also used in other fields of enquiry
-Planning, designing, compiling and tagging
-Frequency lists and concordance lines (+further analysis)
-Sinclair’s (2003) “degeneralisation”:
-sceptical about 'received' descriptions
-patterns found in the data: more precise or alternative descriptions
-Corpus-based dictionaries and grammars
-how lexis and grammar are “really” used
-COLLINS COBUILD LEARNER'S DICTIONARY
-THE LONGMAN GRAMMAR OF SPOKEN AND WRITTEN
ENGLISH
Underlying assumption
Intuition is not enough to study language
…
Reaction to Noam Chomsky’s focus
on introspection in 1950s/60s
Empirical observation of naturally
occurring data versus theory of how
human language processing is actually undertaken
REPRESENTATIVENESS is Key:
i.e. a balanced sample of a language or a
particular variety of language --- c.f.
national corpora (British, American,
Czech, Polish …)
Reasoning?
Helps to remove intuitive bias
Helps us to find common/ rare phenomena
AND SIZE IS ALSO IMPORTANT
Brown/LOB
1960s
1 million
BNC
1990s
100 million
Web
Present day
? billion
Birmingham
corpus
1980
10 million
Collins Bank of English
Cambridge International Corpus
Oxford English Corpus
2006
600 million – 1 billion
Web
Future
? billions
So what is corpus linguistics?
= the “study of language using corpora”
= empirical methodology
= a useful means of exploring:
Synchronic and diachronic variation
Syntax, semantics, pragmatics
Lexicography
Dialects, minority languages
Not just English
Corpus techniques we utilise
Retrieval
Frequency profiling
Concordancing
Collocations
Key words
Key domains
Annotation
POS tagging
Semantic tagging
Annotation
Part of speech
tagging
Semantic field
tagging
Retrieval
Frequency lists
Concordances
Key words
Text
Keywords
Text or
reference
corpus
What are “key words”?
And why are they so useful?
Key words
Word Clouds
If we compare
text A
… with text B … we can discover the most
significant items within text A
… and not only
the frequent items
Collocations
Collocation = a relationship between words that tend to
occur together in texts
Words that tend to occur near word X are the collocates
of word X (consider “fish and XXXXX”)
Based on frequency (how frequent separate vs. how
frequent together = T-SCORE & mutual information)
The company a word keeps: implicit associations or
assumptions (also “semantic prosody”, cf. Sinclair, 2002)
e.g., Bachelor: eligible, flat, life, days
Spinster: elderly, widows, sisters, parish
Corpus software
Our scenario
SEM TAGGER
POS TAGGER
Book Search
Other texts – not compiled for
corpus linguistics
Linguistic analysis
Natural language
processing &
Computational linguistics
Corpus Empirical evidence
to inform linguistStatistical and rule-
based language
models
Corpus Linguistics
Theories on
language /
communication
Theories on
language /
communication
Corpus linguistics for the analysis of real linguistic /
paralinguistic / extra-linguistic information
feedback
space for
our own
annotation
some mark-
up for context
audiovisual
element
Task for you:
What corpus steps / resources would you have to
go through in order to…?
**Find the most frequent word in spoken English?
++Identify different collocates with “news” in
written English
>>Determine the most frequent verb tense used in
Business letters
// Determine the key errors that your students make
by comparison with other students’ errors
CORPORA and LANGUAGE TEACHING /
LEARNING
-Mixture between instructional and naturalistic LL
-Fulfilment of both the input and output hypotheses
-”Scaffolding” (though loosely speaking)
-insights concerning English culture(s)
-Student-centred and related to constructivism:
mastering corpora = learning autonomy
CORPUS-BASED ESL/EFL ACTIVITIES
-Focus on lexis, grammar and register
-introductory notions concerning collocation,
colligation, and formal vs. informal
-For already motivated students: corpora on-line
Activity one:
contractions, formal or
informal?
spoken or written?
1* ?’?? 2
34
Quotation marks!
Activity two:
Corpora as a source of
knowledge concerning
collocation and colligation
[v*] mistakes
powerful, not strong!!![aj*]
Activity three:
meaning via collocations and
co-text
Activity four: Lexical descriptions
**Contact with the English language:
input (at least lexis-wise)
author
corpus
reference
corpus
Select the text
you want a list
of
Save both lists to
compare them
with Keyword
author
corpus list
reference
corpus list
Some concluding remarks = Task for you:
See what kinds of corpus things you would have
to do to:
>> Demonstrate that the use of some verbs is
incorrect
>> Investigate real lexical levels needed for
different learning stages (e.g., beginner versus
intermediate)
** Determine the main grammatical structures in
conversation that learners may need
Adolphs, Svenja (2006). Introducing electronic text analysis: a practical
guide for language and literary studies. London: Routledge.
Biber, Douglas, Conrad, Susan, & Reppen, Randi. (1998). Corpus
Linguistics: Investigating language structure and use. Cambridge: CUP.
Hockey, Susan. (2000). Electronic texts in the humanities: Principles and
practice. Oxford: Oxford University Press.
Hunston, Susan. (2002). Corpora in applied linguistics. Cambridge:
Cambridge University Press.
Kennedy, Graeme D. (1998). An introduction to corpus linguistics.
London: Longman.
McEnery, Tony, & Wilson, Andrew. (2001). Corpus linguistics (2nd
Ed.). Edinburgh : Edinburgh University Press.
Meyer, Charles. (2002). English corpus linguistics: An introduction.
Cambridge: Cambridge University Press.
Ooi, Vincent B.Y. (1998). Computer Corpus Lexicography. Cambridge
Sampson, Geoffrey & McCarthy, Diana (Eds.). (2004). Corpus
linguistics. London: Continuum.
Scott, Mike, & Tribble, Chris. (2006). Textual patterns: Keyword and
corpus analysis in language education, Amsterdam: Benjamins.
Sinclair, John (1991) Corpus, Concordance, Collocation. Oxford
UCREL website (many others … 1991 – 1999 / 2000 – 2005 / (…) ?

More Related Content

What's hot

British national corpus
British national corpusBritish national corpus
British national corpus
Laura P
 
what is stylistics and its levels 1.Phonological level 2.Graphological leve...
what is stylistics and its levels 1.Phonological level   2.Graphological leve...what is stylistics and its levels 1.Phonological level   2.Graphological leve...
what is stylistics and its levels 1.Phonological level 2.Graphological leve...
RajpootBhatti5
 
Corpus linguistics the basics
Corpus linguistics the basicsCorpus linguistics the basics
Corpus linguistics the basics
Jorge Baptista
 
Corpora in language teaching
Corpora in language teachingCorpora in language teaching
Corpora in language teaching
Jonathan Smart
 
Corpus Linguistics for Language Teaching and Learning
Corpus Linguistics for Language Teaching and LearningCorpus Linguistics for Language Teaching and Learning
Corpus Linguistics for Language Teaching and Learning
Martin Wynne
 
Language planning
Language planningLanguage planning
Language planning
Erhan Bektaş
 
Corpus and bnc
Corpus and bncCorpus and bnc
Corpus and bnc
moona butt
 
Sociolinguistics
SociolinguisticsSociolinguistics
Sociolinguistics
Ronnier Barrientos
 
Corpus Linguistics: An Introduction
Corpus Linguistics: An IntroductionCorpus Linguistics: An Introduction
Corpus Linguistics: An Introduction
Nanang Zubaidi
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
Irum Malik
 
Lexicography
LexicographyLexicography
Lexicography
Sadia Irshad
 
Functional Linguistics
Functional LinguisticsFunctional Linguistics
Functional Linguistics
Kamakshi Rajagopal
 
Introduction to sociolinguistics
Introduction to sociolinguisticsIntroduction to sociolinguistics
Introduction to sociolinguistics
Lusya Liann
 
Pidgin and creole
Pidgin and creole Pidgin and creole
Pidgin and creole
Ali Jathmi
 
Dialect and accent (idiolect)
Dialect and accent (idiolect)Dialect and accent (idiolect)
Dialect and accent (idiolect)
Muslimah Alg
 
Chapter 18 language and regional variation
Chapter 18   language and regional variationChapter 18   language and regional variation
Chapter 18 language and regional variation
-
 
Beyond the sentence
Beyond the sentenceBeyond the sentence
Beyond the sentence
Ramiro Rodríguez
 
Corpus Linguistics
Corpus LinguisticsCorpus Linguistics
Corpus Linguistics
Prof.Ravindra Borse
 
Language planning
Language planningLanguage planning
Language planning
Ayesha Mir
 
Discourse analysis
Discourse analysisDiscourse analysis
Discourse analysis
Vale Caicedo
 

What's hot (20)

British national corpus
British national corpusBritish national corpus
British national corpus
 
what is stylistics and its levels 1.Phonological level 2.Graphological leve...
what is stylistics and its levels 1.Phonological level   2.Graphological leve...what is stylistics and its levels 1.Phonological level   2.Graphological leve...
what is stylistics and its levels 1.Phonological level 2.Graphological leve...
 
Corpus linguistics the basics
Corpus linguistics the basicsCorpus linguistics the basics
Corpus linguistics the basics
 
Corpora in language teaching
Corpora in language teachingCorpora in language teaching
Corpora in language teaching
 
Corpus Linguistics for Language Teaching and Learning
Corpus Linguistics for Language Teaching and LearningCorpus Linguistics for Language Teaching and Learning
Corpus Linguistics for Language Teaching and Learning
 
Language planning
Language planningLanguage planning
Language planning
 
Corpus and bnc
Corpus and bncCorpus and bnc
Corpus and bnc
 
Sociolinguistics
SociolinguisticsSociolinguistics
Sociolinguistics
 
Corpus Linguistics: An Introduction
Corpus Linguistics: An IntroductionCorpus Linguistics: An Introduction
Corpus Linguistics: An Introduction
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
Lexicography
LexicographyLexicography
Lexicography
 
Functional Linguistics
Functional LinguisticsFunctional Linguistics
Functional Linguistics
 
Introduction to sociolinguistics
Introduction to sociolinguisticsIntroduction to sociolinguistics
Introduction to sociolinguistics
 
Pidgin and creole
Pidgin and creole Pidgin and creole
Pidgin and creole
 
Dialect and accent (idiolect)
Dialect and accent (idiolect)Dialect and accent (idiolect)
Dialect and accent (idiolect)
 
Chapter 18 language and regional variation
Chapter 18   language and regional variationChapter 18   language and regional variation
Chapter 18 language and regional variation
 
Beyond the sentence
Beyond the sentenceBeyond the sentence
Beyond the sentence
 
Corpus Linguistics
Corpus LinguisticsCorpus Linguistics
Corpus Linguistics
 
Language planning
Language planningLanguage planning
Language planning
 
Discourse analysis
Discourse analysisDiscourse analysis
Discourse analysis
 

Similar to Corpus linguistics intro

Corpus study design
Corpus study designCorpus study design
Corpus study design
bikashtaly
 
What corpora are available? by David Y. W.D
What corpora are available? by David Y. W.DWhat corpora are available? by David Y. W.D
What corpora are available? by David Y. W.D
RajpootBhatti5
 
Corpus Linguistics
Corpus LinguisticsCorpus Linguistics
Corpus Linguistics
Fatima Batool
 
corpus linguistics.pptx
corpus linguistics.pptxcorpus linguistics.pptx
corpus linguistics.pptx
Subramanian Mani
 
Computer assisted text and corpus analysis
Computer assisted text and corpus analysisComputer assisted text and corpus analysis
Computer assisted text and corpus analysis
RubyaShaheen
 
Specialist genres
Specialist genresSpecialist genres
Specialist genres
Pascual Pérez-Paredes
 
lexicography
lexicographylexicography
lexicography
ayfa
 
Corpus approaches to discourse analysis
Corpus approaches to discourse analysisCorpus approaches to discourse analysis
Corpus approaches to discourse analysis
Aseel K. Mahmood
 
corpus linguistics and lexicography
corpus linguistics and lexicographycorpus linguistics and lexicography
corpus linguistics and lexicography
ayfa
 
11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)
ThennarasuSakkan
 
The Corpus In The Classroom
The Corpus In The ClassroomThe Corpus In The Classroom
The Corpus In The Classroom
Colin Graham
 
Corpora analysis bruno natalia sarah
Corpora analysis   bruno natalia sarahCorpora analysis   bruno natalia sarah
Corpora analysis bruno natalia sarah
Bruno Bruno Bruno Bruno
 
Graded assignment #3
Graded assignment #3Graded assignment #3
Graded assignment #3
Muhammad Amzar
 
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
ijnlc
 
Analysing communication
Analysing communicationAnalysing communication
Analysing communication
Pei Zhao
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY
mimisy
 
Comp app lexicography
Comp app lexicographyComp app lexicography
Comp app lexicography
syila239
 
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
kevig
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
Raul Vargas
 
Language
LanguageLanguage
Language
Guido Wachsmuth
 

Similar to Corpus linguistics intro (20)

Corpus study design
Corpus study designCorpus study design
Corpus study design
 
What corpora are available? by David Y. W.D
What corpora are available? by David Y. W.DWhat corpora are available? by David Y. W.D
What corpora are available? by David Y. W.D
 
Corpus Linguistics
Corpus LinguisticsCorpus Linguistics
Corpus Linguistics
 
corpus linguistics.pptx
corpus linguistics.pptxcorpus linguistics.pptx
corpus linguistics.pptx
 
Computer assisted text and corpus analysis
Computer assisted text and corpus analysisComputer assisted text and corpus analysis
Computer assisted text and corpus analysis
 
Specialist genres
Specialist genresSpecialist genres
Specialist genres
 
lexicography
lexicographylexicography
lexicography
 
Corpus approaches to discourse analysis
Corpus approaches to discourse analysisCorpus approaches to discourse analysis
Corpus approaches to discourse analysis
 
corpus linguistics and lexicography
corpus linguistics and lexicographycorpus linguistics and lexicography
corpus linguistics and lexicography
 
11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)11 terms in Corpus Linguistics1 (2)
11 terms in Corpus Linguistics1 (2)
 
The Corpus In The Classroom
The Corpus In The ClassroomThe Corpus In The Classroom
The Corpus In The Classroom
 
Corpora analysis bruno natalia sarah
Corpora analysis   bruno natalia sarahCorpora analysis   bruno natalia sarah
Corpora analysis bruno natalia sarah
 
Graded assignment #3
Graded assignment #3Graded assignment #3
Graded assignment #3
 
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
 
Analysing communication
Analysing communicationAnalysing communication
Analysing communication
 
LEXICOGRAPHY
LEXICOGRAPHY LEXICOGRAPHY
LEXICOGRAPHY
 
Comp app lexicography
Comp app lexicographyComp app lexicography
Comp app lexicography
 
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
Language
LanguageLanguage
Language
 

Recently uploaded

Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
National Information Standards Organization (NISO)
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
AyyanKhan40
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
TechSoup
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
taiba qazi
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
Jean Carlos Nunes Paixão
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
Colégio Santa Teresinha
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Dr. Vinod Kumar Kanvaria
 
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Ashish Kohli
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
WaniBasim
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
Celine George
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
IreneSebastianRueco1
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
eBook.com.bd (প্রয়োজনীয় বাংলা বই)
 
What is the purpose of studying mathematics.pptx
What is the purpose of studying mathematics.pptxWhat is the purpose of studying mathematics.pptx
What is the purpose of studying mathematics.pptx
christianmathematics
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
ak6969907
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
David Douglas School District
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
Israel Genealogy Research Association
 

Recently uploaded (20)

Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
 
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
 
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
 
What is the purpose of studying mathematics.pptx
What is the purpose of studying mathematics.pptxWhat is the purpose of studying mathematics.pptx
What is the purpose of studying mathematics.pptx
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
 

Corpus linguistics intro

  • 1. CORPUS LINGUISTICS Corpus linguistics as Applied Linguistics & Teaching / Language corpora (based on P. Rayson’s & D. Archer’s course for H & S linguistics Lancaster University & U. of Central Lanscashire)
  • 2. A corpus can be defined as a collection of texts assumed to be representative of a given language put together so that it can be used for linguistic analysis. Usually the assumption is that the language stored in a corpus is naturally-occurring, that is gathered according to explicit design criteria, with a specific purpose in mind, and with a claim to represent natural chunks of language selected according to specific typology Tognini-Bonelli (2001:2)
  • 3. “nowadays the term 'corpus' nearly always implies the additional feature of 'machine-readable'”. McEnery & Wilson, Corpus Linguistics. Online manual.
  • 4. Types of (electronic) corpora Based on: http://www.uow.edu.au/~dlee/CBLLinks.htm
  • 5. ENGLISH CORPORA: GENERAL LANGUAGE CORPORA The 1980s: -Bank of English -monitor corpus -both spoken and written text -different regional varieties of English -British National Corpus (BNC) -90 million written words -10 million spoken words -freely accessible (search interface and KWIC)
  • 6. The 1990s and 2000s -speech corpora: -sound recordings (started in previous corpora) -SPOKEN ENGLISH CORPUS (e.g., MICASE) -detailed description of spoken phenomena: phonology, prosody (stress, tone units…), etc -multimedia corpora: -transcripts synchronised audio/video recordings -TALKBANK Website: SANTA BARBARACORPUS OF SPOKEN AMERICAN ENGLISH (SBCSAE)
  • 7. The 1990s and 2000s -parsed corpora: -syntactically analysed -SURFACE AND UNDERLYING STRUCTURAL ANALYSES AND NATURALISTIC ENGLISH CORPUS (SUSANNE) -historical / diachronic corpora: -English of different periods (e.g., Helsinki) -may cover specific historical periods or genres (e.g., COCA) -track and describe how language has evolved -A REPRESENTATIVE CORPUS OF HISTORICAL ENGLISH REGISTERS (ARCHER)
  • 8. The 1990s and 2000s -specialised corpora: -focus on concrete genres/domains -BUSINESS LETTERS CORPUS (BLC) (e.g., Someya) -lingua franca corpora: -ENGLISH AS A LINGUA FRANCA IN ACADEMIC SETTINGS (ELFA) CORPUS -intercultural exchanges among speakers who use English as a lingua franca (also, e.g., ICLE)
  • 9. The 1990s and 2000s -developmental language corpora: -non-adult English native speakers' output (e.g., CHILDES) -not as proficient as native-speaker corpora -POLYTECHNIC OF WALES (POW) CORPUS -ESL/EFL learner corpora: -learners of English's output -one and the same L1 background or different mother tongues -JAPANESE EFL LEARNER CORPUS (JEFLL)
  • 10. The 2010s On-line corpora + management tools on- line (e.g., Sketch Engine) Corpora builders on-line N-gram and conc-gram builders Literature and corpora Other uses: Socio-cultural studies, historical, Translation, Law, (?)
  • 11. WORDSMITH: how to manage corpora -Computer program which permits users to compile their own corpus -Texts must be in .txt (doc, htm) format -Any text can be subjected to the same process of analysis that official corpora undergo: concordance lines, word lists, etc -No need to pre-process such texts in advance (e.g., XML coding or exhaustive tagging)
  • 12. Corpus linguistics: Why applied linguistics? -Insights into the internal workings of real language -Knowledge in turn also used in other fields of enquiry -Planning, designing, compiling and tagging -Frequency lists and concordance lines (+further analysis) -Sinclair’s (2003) “degeneralisation”: -sceptical about 'received' descriptions -patterns found in the data: more precise or alternative descriptions -Corpus-based dictionaries and grammars -how lexis and grammar are “really” used -COLLINS COBUILD LEARNER'S DICTIONARY -THE LONGMAN GRAMMAR OF SPOKEN AND WRITTEN ENGLISH
  • 13. Underlying assumption Intuition is not enough to study language … Reaction to Noam Chomsky’s focus on introspection in 1950s/60s Empirical observation of naturally occurring data versus theory of how human language processing is actually undertaken
  • 14. REPRESENTATIVENESS is Key: i.e. a balanced sample of a language or a particular variety of language --- c.f. national corpora (British, American, Czech, Polish …) Reasoning? Helps to remove intuitive bias Helps us to find common/ rare phenomena
  • 15. AND SIZE IS ALSO IMPORTANT Brown/LOB 1960s 1 million BNC 1990s 100 million Web Present day ? billion
  • 16. Birmingham corpus 1980 10 million Collins Bank of English Cambridge International Corpus Oxford English Corpus 2006 600 million – 1 billion Web Future ? billions
  • 17. So what is corpus linguistics? = the “study of language using corpora” = empirical methodology = a useful means of exploring: Synchronic and diachronic variation Syntax, semantics, pragmatics Lexicography Dialects, minority languages Not just English
  • 18. Corpus techniques we utilise Retrieval Frequency profiling Concordancing Collocations Key words Key domains Annotation POS tagging Semantic tagging
  • 19. Annotation Part of speech tagging Semantic field tagging Retrieval Frequency lists Concordances
  • 20. Key words Text Keywords Text or reference corpus What are “key words”? And why are they so useful?
  • 21. Key words Word Clouds If we compare text A … with text B … we can discover the most significant items within text A … and not only the frequent items
  • 22. Collocations Collocation = a relationship between words that tend to occur together in texts Words that tend to occur near word X are the collocates of word X (consider “fish and XXXXX”) Based on frequency (how frequent separate vs. how frequent together = T-SCORE & mutual information) The company a word keeps: implicit associations or assumptions (also “semantic prosody”, cf. Sinclair, 2002) e.g., Bachelor: eligible, flat, life, days Spinster: elderly, widows, sisters, parish
  • 25. Book Search Other texts – not compiled for corpus linguistics
  • 26. Linguistic analysis Natural language processing & Computational linguistics Corpus Empirical evidence to inform linguistStatistical and rule- based language models Corpus Linguistics Theories on language / communication Theories on language / communication Corpus linguistics for the analysis of real linguistic / paralinguistic / extra-linguistic information feedback
  • 27. space for our own annotation some mark- up for context audiovisual element
  • 28. Task for you: What corpus steps / resources would you have to go through in order to…? **Find the most frequent word in spoken English? ++Identify different collocates with “news” in written English >>Determine the most frequent verb tense used in Business letters // Determine the key errors that your students make by comparison with other students’ errors
  • 29. CORPORA and LANGUAGE TEACHING / LEARNING -Mixture between instructional and naturalistic LL -Fulfilment of both the input and output hypotheses -”Scaffolding” (though loosely speaking) -insights concerning English culture(s) -Student-centred and related to constructivism: mastering corpora = learning autonomy
  • 30. CORPUS-BASED ESL/EFL ACTIVITIES -Focus on lexis, grammar and register -introductory notions concerning collocation, colligation, and formal vs. informal -For already motivated students: corpora on-line
  • 31. Activity one: contractions, formal or informal? spoken or written?
  • 34. Activity two: Corpora as a source of knowledge concerning collocation and colligation
  • 36.
  • 38. Activity three: meaning via collocations and co-text
  • 39.
  • 40. Activity four: Lexical descriptions **Contact with the English language: input (at least lexis-wise)
  • 42. Select the text you want a list of
  • 43. Save both lists to compare them with Keyword
  • 45.
  • 46.
  • 47. Some concluding remarks = Task for you: See what kinds of corpus things you would have to do to: >> Demonstrate that the use of some verbs is incorrect >> Investigate real lexical levels needed for different learning stages (e.g., beginner versus intermediate) ** Determine the main grammatical structures in conversation that learners may need
  • 48. Adolphs, Svenja (2006). Introducing electronic text analysis: a practical guide for language and literary studies. London: Routledge. Biber, Douglas, Conrad, Susan, & Reppen, Randi. (1998). Corpus Linguistics: Investigating language structure and use. Cambridge: CUP. Hockey, Susan. (2000). Electronic texts in the humanities: Principles and practice. Oxford: Oxford University Press. Hunston, Susan. (2002). Corpora in applied linguistics. Cambridge: Cambridge University Press. Kennedy, Graeme D. (1998). An introduction to corpus linguistics. London: Longman. McEnery, Tony, & Wilson, Andrew. (2001). Corpus linguistics (2nd Ed.). Edinburgh : Edinburgh University Press. Meyer, Charles. (2002). English corpus linguistics: An introduction. Cambridge: Cambridge University Press. Ooi, Vincent B.Y. (1998). Computer Corpus Lexicography. Cambridge Sampson, Geoffrey & McCarthy, Diana (Eds.). (2004). Corpus linguistics. London: Continuum. Scott, Mike, & Tribble, Chris. (2006). Textual patterns: Keyword and corpus analysis in language education, Amsterdam: Benjamins. Sinclair, John (1991) Corpus, Concordance, Collocation. Oxford UCREL website (many others … 1991 – 1999 / 2000 – 2005 / (…) ?