A corpus is basically a body of knowledge or information of some kind. In linguistics, it usually means a collection of texts which are taken to represent some aspect of language - for example, fictional writing, radio broadcasts, editorials, etc. By carrying out research on the corpus (or corpora) the researcher hopes to make generalizations about an aspect of language as a whole.
Many of the ideas we have about our own language, which are based on linguistic intuitions, are not correct.
* “Language users cannot accurately report language usage, even their own” [Sinclair, J. (1987) Introduction, in the Collins Cobuild English Language Dictionary , London: Collins]
* “There are many facts about language that cannot be discovered by just thinking about it, or even reading and listening very intently” [Sinclair, J. (1995) Introduction, in the Collins Cobuild English Dictionary , London: HarperCollins]
* “Using a language is a skill that most people are not conscious of; they cannot examine it in detail, but simply use it to communicate” [Sinclair, J. (1995) Introduction, in the Collins Cobuild English Dictionary , London: HarperCollins]
People used to think the Earth was flat and that it was the centre of the Solar System. Galileo’s discovery of the moons around Jupiter, by using better technology (a telescope), forced astronomers and other scientists to think again about their theories and assumptions. We can think of a corpus as being like a telescope which provides a more clearly focused view of the language we are investigating.
Corpora may reduce the lack of exposure to sufficiently varied examples by provided a variety of examples in a concentrated form. They often offer more motivating, interesting or exciting approaches to teaching and learning foreign languages.
Corpus design is an art in itself. However, you can build useful corpora in the classroom. Basically you need a collection of writings or transcriptions as simple text files and some concordancing software as a minimum tool for analysis.
However, you need to consider copyright and the type of language investigation you want to carry out.
[Wichmann, Fligelstone, McEnery and Knowles. Teaching and Language Corpora. Longman 1997.] is a good starting point for most of the further questions you may have.
HarperCollins: http://www.collins.co.uk/Corpus/CorpusSearch.aspx [Collins - English only; search written, spoken, British, American separately; KWIC format; max 40 examples per search; can specify wordclass, and a few other features; collocation lists also available; 56 Million Words]
Oxford-BNC (British National Corpus): http://sara.natcorp.ox.ac.uk/lookup.html [British English only; from c 1994; sentence-length examples only, not KWIC format; max 50 examples per search; 100 Million Words]
Brigham Young University-BNC: http://corpus.byu.edu/bnc/x.asp [a better place to access BNC; KWIC format concordances, etc]
Corpus of American English: http://www.americancorpus.org/
David Lee’s Corpora Bookmarks: http://devoted.to/corpora
Tim Johns’ website has many exercises and useful links, including his “CONTEXTS” program: http://www.eisu2.bham.ac.uk/johnstf/index.html