Online Corpus
Literacy Teachers’ Best Friend
Dominik Lukeš
dlukes@dyslexiaaction.org.uk @techczech
Outline
http://www.flickr.com/photos/adactio/3563832656
What is a corpus
Answering questions
with a corpus
The language of corpus
searches
The corpus and the
classroom
Practice
Corpus / Corpora
????
of about
language
knowledge
http://www.flickr.com/photos/missturner/3029700617/
Prescriptivism
… how language should be used
Descriptivism
… how language is used
v
“Most of the prescriptive
rules of the language
mavens make no sense on
any level. They are bits of
folklore that originated for
screwball reasons several
hundred years ago… For as
long as they have existed,
speakers have flouted
them…”
“intellectual abdication”
“should be ashamed”
“current around 1900”
“a perversion of
grammatical education”
“blind to textual evidence
even when he himself
exhibits it”
“dishonest and stupid”
“vile little compendium
of tripe about style”
Grammarian
Geoffrey K Pullum on …
“More passives in
Orwell's pompous essay
with the warning about
how you mustn't use
them than in any
periodical you can lay
your hands on! “
This usage stuff is not straightforward and
easy. If ever someone tells you that the rules
of English grammar are simple and logical
and you should just learn them and obey
them, walk away, because you're getting
advice from a fool.
http://languagelog.ldc.upenn.edu/nll/?p=2790
Corpus
Key modern tool for finding out
about how language works…
Corpus
… is a large database of
representative language
samples …
Corpus
… 100s of millions of words
from (mostly) written language
in different genres in small
samples (~2000 words) …
Corpus
… used for linguistic research,
making dictionaries, writing
grammars, …
Corpora available for teachers
http://corpus.byu.edu
BYU corpora available
COCA (contemporary Am English)
COHA (historical Am English)
GloWbE (global web English)
Wikipedia
Google Books (BrEng/AmEng)
BNC (British National Corpus)
Hansard (British parliamentary speeches)
Spanish/Portugese
Access to COCA and related BYU
corpora is free…
…but free registration
required for more than
~10 queries a day
Other resources derived
from BYU corpora
WordFrequency.info
WordAndPhrase.info
AcademicWords.info
Collocates.info
http://www.webcorp.org.uk
http://corpus.leeds.ac.uk
http://www.flickr.com/photos/atoach/3900591006/
Searching a corpus early on in the
process of making a generalization
can save you a lot of unpleasant
surprises later.
How do we use the word
dyslexia?
We speak more often of dyslexic
children than adults.
We speak more often of dyslexia than
any other dys- word.
Concordance
BNC:
dyslexic [n*]
COCA:
dyslexic [n*]
http://www.americancorpus.org/
http://corpus.byu.edu/bnc
COCA:
dys*
Suffixing
rules
*yed
*ied
Suffixing
rules
*yed
*ied
played
stayed
portrayed
enjoyed
unemployed
surveyed
died
tried
married
worried
identified
applied
The Corpus Magic
*
[ ]
?
Different corpora use slightly
different codes. Read the
manual.
[n* ]
The Corpus Magic
*
[ ]
?
Any one character
Any number of
characters (incl 0)
Lemma
(all inflectional
forms of a word)
Different corpora use slightly
different codes. Read the
manual.
[n* ]
Part of speech tags
(e.g. nouns)
*each each, reach, beach, teach,
outreach, …, impeach, …
teach* teachers, teaching, …,
teachable, teacher-librarians, …
t*ch touch, teach, tech, torch,
trench, twitch, …, three-inch, …
teach * teach the, teach us, teach
students, …
?each reach, beach, teach, peach,
leach, keach, …
each? each- (1), each# (1) [ie nothing]
?each? peachy, bleachy, teacha, reachs
(2) [ie spelling error], …
t?ch tech, tach, toch, tuch, tsch, tich
t??ch touch, teach, torch, tisch, …
[Lemma]
Part of speech tags
[run].[n*]
[run] [n*]
Common tags
[n*] noun
[NN2] plural nouns
[v*] verb
[VVD] verb past tense
[aj*] (BNC) / [j*](COCA) adjective
[av*] (BNC) / [r*](COCA) adverb
Help
You can also
cats and dogs search for idioms
?each*s combine wildcards
[=pretty] search for synonyms
car|bike|horse search for alternatives
used -car exclude searches
For more details see:
Concordance + KWIC
*ies.[N*]
KWIC – Key-Word In Context
*ies.[N*]
Limit searches by genre
Other questions corpus can answer
Are there more nouns or verbs ending in -ies?
*ies.[V*] vs. *ies.[N*]
Are there four-letter verbs ending in -ed in the present
tense? ??ed.[VVB]
What are the most common adjectives describing students
vs. pupils. [j*] [student] vs. [j*] [pupil]
What do we say teachers do most often?
[teacher] [vvb]
Corpus, rules, and regularity
http://www.flickr.com/photos/51505078@N00/352492687
pre*
*ed
*ies.[V*]
Collocations
Limits on variability
See also Kennedy, p. 80-23
Collocations
Limits on variability
See also Kennedy, p. 80-23
Collocations (cont)
[teacher] must [v*]
Idioms and set phrases
275 results
359 results
Google as a Corpus
"put the search text in quotes"
use * for the search item
training.dyslexiaaction.org.u
k
Google as a Corpus
PRO: rare, low frequency usage,
up-to-date usage
CON: no sampling, no frequency
sort, no genre limit, no part
of speech tags
Google results counts are only
rough estimates…
http://searchengineland.com/why-google-cant-count-results-properly-53559
Different people searching in different geographic
locations can get different numbers
Sometimes searching for A gives fewer results
than searching for A without B
…but Google fights can be fun
WebCorp is makes Google search
results linguist-friendly
Avoid Common
Corpus Errors
http://www.flickr.com/photos/andreassolberg/433734311
Be aware of limitations:
sampling, coverage, size,
presence of typos and errors,
bad part of speech tagging
Beware of low frequency
results
Beware of homographs
Check results come from
multiple sources
Check KWIC to confirm
relevance
Limit search by genre
Check examples and sources
training.dyslexiaaction.org.u
k
Always check low frequency results
must [v*] [n*]
…sometimes they come from the
same source
False roots
http://etymonline.com
corner, silly, preface,
cockroach, protest, stable …
Make your own corpus with
TextSTAT
http://neon.niederlandistik.fu-berlin.de/en/textstat
Make your own corpus with
AntConc
http://www.antlab.sci.waseda.ac.jp/software.html
Corpus in the
classroom
teacher preparation
student discovery
Teacher preparation
find relevant, common examples
prepare worksheets
check for exceptions
find out answers to student
questions about rules and usage
Student discovery
show search results to students to
work out rules or word meanings
teach students how to search for
questions
ask students to give each other
puzzles for searching
For heavy classroom use…
register for
group access
to prevent
spam lock out
Corpus v dictionary
Non-classroom corpus
use
supplement dictionary
cross-word puzzles
check typical usage
when writing
Where to go next?
http://www.corpora4learning.net
Thank you
Contact dlukes@dyslexiaaction.org.uk

Using online corpus for literacy teachers

Editor's Notes

  • #5 An important dichotomy (one of many) in the study of language
  • #24 Confirms hypothesis that children more than adults and boys more than girls; how about the dyslexia v. dystopia