2. What is Corpus?
• Definition
• Why are they used?
• What are they considered to be?
– Method vs Theory
• Types of corpora?
– Monolingual Vs. Multilingual
– Parallel Vs. Translated
3. Corpus Linguistics
• LC history
– 1960 1st generation e.g. Brown
– 1975 2nd generation e.g. Cobuild
– 1990 3rd generation e.g. BOE
• Roots
– CL and Linguistics
• Comparative linguistics
• Syntactics and semantics
– Chomskyan revolution
• Technology and the progress of CL
• Benefits of CL
• Problems of CL
4. Building the Corpora
• General corpora
– E.g. BNC, The Brown Corpus
• Specialized corpora
– How corpora is used (Written – Spoken)
– Materials for creating the corpora (newspapers – books –
documents etc.)
– General (Social – science – art ..etc)
• Multilingual corpora – Parallel corpora
• Learners corpora (International Corpus of Learner
English)
• Monitor Corpus (The Bank of English)
• Historical Corpus
5. Advantages and Disadvantages
• More reliable than intuition
• Language patterns are easily identified
• Deconstruct texts to discover patterns
• Track the development of specific features in the
history of English
• Test hypothesis on specific language features
empirically
• Follow language acquisition properly
• Draw conclusions on large amount of linguistic data
• Not always a complete picture
• Frequency rather than the possibility
6. CL terminology
• Concordance
– Where and in what context?
– Frequency
• Annotation
– Mark-up
• Tagging
– POS tagging
– Syntactic Treebank
– Semantic tagging
• Coding
• Metadata
8. Corpora and Translation
• Corpus translation studies (CTS)
• Descriptive translation
• Equivalence
• Corpus-based translation
• The process Vs the product
• The third code
• Simplification Vs normalization
9. Methods of Research in CL
• Quantitative
• Qualitative
– Context
• Quantitative and Qualitative
10. Corpus Software
• AntConc:
• MICASE: Michigan Corpus of Academic
Spoken English
• TACT: Text Analysis Computing Tools
• TACTWeb: a concordance program based on
TACT but for the Web
• SARA: the concordance program which is
specifically written for the British National
Corpus
11. Corpus Software Continued
• BNCweb
• BNCweb is a web-based client program for searching and retrieving lexical, grammatical and textual data from the
British National Corpus (BNC). It relies on the Corpus Query Processor (CQP) of the IMS Open Corpus Workbench
to provide a convenient interface between the user and the rich variety of annotated text in the 100-million word
BNC in its most recent incarnation, the XML-version.BNC Web Index
• This is the web front end to David Lee's BNC Index spreadsheet. For an introduction to BNC Index, please
see David's web site.CLAWS
• Part of speech tagging software for English.Clustertool
• Clustertool allows you to perform Hierarchical Agglomerative Cluster Analysis on your own data.CQPweb
• An extension of BNCweb but designed for use with any corpus.LL Calculator
• This calculates Log-Likelihood values from a 2x2 contingency table. LL is a more reliable alternative to the
standard Pearson's chi-squared test, see Dunning (1993).LWAC
• LWAC is a tool for constructing corpora from web data.Sentrick
• Stream-oriented Java library and a set of command line tools for high quality sentence boundary detection.
(Sentence segmentation / splitting / disambiguation). Currently has one model for German (trained on general text
and Wikipedia lynx dumps).SigTest
• Flexible Significance Test System: Chi-squared test, log-likelihood test and Fisher exact test for any kind of
contingency table, using RUSAS
• Semantic tagger developed for English and extended to Finnish and Russian.VARD
• Variant Detector software that facilitates the pre-processing of corpora for normalisation of spelling variation (e.g.
Early Modern English)Wmatrix
• A corpus comparison and annotation tool incorporating CLAWS and USAS in a web front end.
12. Additional Resources
• University of Lancaster Centre for Computer Corpus Research on Language (Summer
School) http://ucrel.lancs.ac.uk/
• McEnery, Tony, and Wilson, Andrew. Corpus Linguistics, 2nd ed. Edinburgh University Press,
2001.
• ESRC Centre for Corpus Approaches to Social Science (CASS) University of Lancaster
• Aston, Guy and Burnard, Lou. The BNC handbook: exploring the British National Corpus
with SARA. Edinburgh University Press, 1998.
• McEnery, Tony, and Wilson, Andrew. Corpus Linguistics, 2nd ed. Edinburgh University Press,
2001.
• Biber, Douglas, Conrad, Susan, and Reppen, Randi. Corpus Linguistics: Investigating
Language Structure and Use.CUP, 1998.
•