Quantitative Individuated Corpus Linguistics - Presentation Transcript
Quantitative Individuated Corpus Linguistics: A Speaker-Centric Approach to Variation Cornelius Puschmann Universität Osnabrück 5 Juni 2007
Preliminaries
A totalizing view of language competence performance production investigation but... whose competence? whose performance? social function learning cognitive basis cultural transmission
Variation or overlap? competence performance Speaker A competence performance Speaker B competence performance Speaker C Observations: 1. A contrastive comparison of performance should give us some insight into shared competence. 2. Speaker-level granularity is preferable to higher levels of segmentation (by gender, social class etc). 3. Instead of generalizing from the outset, we can reach general conclusions after observing the degree of variation or overlap in language production. So how do we do this?
How corpora treat language data
any sentence is as good as any other sentence (the data is flat)
a corpus should be a well-balanced mix of different genres, modes and sources (representativeness)
textual and compositional coherence cannot be taken into account
contextual information (who said what, when, where, why, how and to whom) is largely unavailable
Corpora and traditions of text production
copora largely consist of “well-established genres”
the material they contain is produced by “language professionals” (journalists, writers, politicians)
texts are long and stylistically distant from everyday communication in their level of formality, complexity and elaborateness
compositional integrity (text structure) is very important but largely ignored
the text (=collection of words) takes precedent over the speaker
A different view of language data
language data and sources of
variation...
... vs. speakers and their natural attributes
Blogs as data sources
A new kind of resource
estimated 100 million active bloggers in 2007
split evenly among genders
all age groups are represented
many bloggers provide personal information (age, gender, location)
use web feeds (Atom and RSS formats) to syndicate blog entries in XML (ideal for building modern corpora)
“clean” data with minimal interference
Blogs as corpus data: Pros
very large bodies of data can be automatically assembled
data is naturally segmented by
speaker (+gender, +age, +location, ...)
length and time of writing
often include additional meta-data
produced by a large and growing variety of individuals using it for a wide spectrum of purposes
Blogs as corpus data: Cons
only one genre (?)
CMC as a singular mode (?)
sampling of speakers not representative (?)
Granularity and natural segmentation of data in a blog-based corpus Modes of investigation: 1. Degree of internal variation among all posts by the same blogger 2. Variation between bloggers 3. Variation between groups (gender, age etc) What I had for breakfast this morning xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx posted 01/01/2007 by Jane Smith post 1 post 2 post 3 post 4 ...
An example for a blog-based corpus
self-built corpus for my research project on corporate blogging
web feeds (RSS and Atom protocols) used to retrieve, store and analyze language data
implemented TreeTagger for automated part-of-speech tagging
156 sources
25,769 posts
6.6 million words
Application
Individual variation: word class distribution
Heather Hamilton (Microsoft)
Individual variation: word class distribution
Irving Wladawsky-Berger (IBM)
Individual variation: pronoun use
Heather Hamilton (Microsoft)
1 the DT 2787
2 I PP 2723
3 to TO 2088
4 a DT 1440
5 of IN 1324
6 and CC 1254
7 It PP 1097
8 you PP 854
9 in IN 818
10 that IN 776
11 my PP$ 757
12 is VBZ 739
13 For IN 580
14 n't RB 540
15 's VBZ 530
16 on IN 498
17 are VBP 475
18 me PP 450
19 with IN 431
20 this DT 424
Irving Wladawsky-Berger (IBM)
1 the DT 2788
2 and CC 1931
3 of IN 1571
4 to TO 1562
5 in IN 1291
6 a DT 1047
7 is VBZ 695
8 I PP 560
9 that IN 439
10 For IN 434
11 It PP 417
12 with IN 401
13 as IN 390
14 are VBP 380
15 we PP 359
16 on IN 331
17 our PP$ 259
18 have VHP 253
19 that WDT 248
Individual variation: collocates preceding instances of believe
Gender, age and variation: Schler et al
article Effects of Age and Gender on Blogging (AAAI 2006 )
all blogs accessible from blogger.com one day in August 2004
downloaded each blog that included author-provided indication of gender and at least 200 appearances of common English words
the full corpus thus obtained included over 71,000 blogs and over 300 million tokens
used to predict age and gender of bloggers
Gender, age and variation: common words males
token male female
linux 0.53±0.04 0.03±0.01
microsoft 0.63±0.05 0.08±0.01
gaming 0.25±0.02 0.04±0.00
server 0.76±0.05 0.13±0.01
software 0.99±0.05 0.17±0.02
gb 0.27±0.02 0.05±0.01
programming 0.36±0.02 0.08±0.01
google 0.90±0.04 0.19±0.02
data 0.62±0.03 0.14±0.01
graphics 0.27±0.02 0.06±0.01
india 0.62±0.04 0.15±0.01
nations 0.25±0.01 0.06±0.01
democracy 0.23±0.01 0.06±0.01
users 0.45±0.02 0.11±0.01
economic 0.26±0.01 0.07±0.01
Gender, age and variation: common words females
token male female
shopping 0.66±0.02 1.48±0.03
mom 2.07±0.05 4.69±0.08
cried 0.31±0.01 0.72±0.02
freaked 0.08±0.01 0.21±0.01
pink 0.33±0.02 0.85±0.03
cute 0.83±0.03 2.32±0.04
gosh 0.17±0.01 0.47±0.02
kisses 0.08±0.01 0.28±0.01
yummy 0.10±0.01 0.36±0.01
mommy 0.08±0.01 0.31±0.02
boyfriend 0.41±0.02 1.73±0.04
skirt 0.06±0.01 0.26±0.01
adorable 0.05±0.00 0.23±0.01
husband 0.28±0.01 1.38±0.04
hubby 0.01±0.00 0.30±0.02
Gender, age and variation: common words by age
token teens twens thirties
maths 1.05±0.06 0.03±0.00 0.02±0.01
homework 1.37±0.06 0.18±0.01 0.15±0.02
bored 3.84±0.27 1.11±0.14 0.47±0.04
sis 0.74±0.04 0.26±0.03 0.10±0.02
boring 3.69±0.10 1.02±0.04 0.63±0.05
awesome 2.92±0.08 1.28±0.04 0.57±0.04
mum 1.25±0.06 0.41±0.04 0.23±0.04
mad 2.16±0.07 0.80±0.03 0.53±0.04
dumb 0.89±0.04 0.45±0.03 0.22±0.03
semester 0.22±0.02 0.44±0.03 0.18±0.04
apartment 0.18±0.02 1.23±0.05 0.55±0.05
drunk 0.77±0.04 0.88±0.03 0.41±0.05
beer 0.32±0.02 1.15±0.05 0.70±0.05
student 0.65±0.04 0.98±0.05 0.61±0.06
album 0.64±0.05 0.84±0.06 0.56±0.08
college 1.51±0.07 1.92±0.07 1.31±0.09
someday 0.35±0.02 0.40±0.02 0.28±0.03
dating 0.31±0.02 0.52±0.03 0.37±0.04
Gender, age and variation: common words by age (ii)
token teens twens thirties
marriage 0.27±0.03 0.83±0.05 1.41±0.13
development 0.16±0.02 0.50±0.03 0.82±0.10
campaign 0.14±0.02 0.38±0.03 0.70±0.07
tax 0.14±0.02 0.38±0.03 0.72±0.11
local 0.38±0.02 1.18±0.04 1.85±0.10
democratic 0.13±0.02 0.29±0.02 0.59±0.05
son 0.51±0.03 0.92±0.05 2.37±0.16
systems 0.12±0.01 0.36±0.03 0.55±0.06
provide 0.15±0.01 0.54±0.03 0.69±0.05
workers 0.10±0.01 0.35±0.02 0.46±0.04
Observations
How an individuated approach to corpus linguistics can benefit the field
allow us to take into account individual stylistic preference as a source of variation when making generalizations (syntax, semantics, ...)
allow us to observe specificities of individual production before making blanket label statements about groups (based on gender, social standing etc)
inverts the idea of system and variation (“how much overlap is there in language use?” vs. “how much variation can our theories account for?”)
Research possibilities?
“personal grammar?”, “personal semantics?”
Construction Grammar (to what degree are constructions individual?)
variation over the lifetime
weighing genre, mode and individual variation
practical applications for forensic linguistics / language profiling
Thank you for listening!
Quantitative Individuated Corpus Linguistics: A Speaker-Centric Approach to Variation Cornelius Puschmann Universität Osnabrück 5 Juni 2007
0 comments
Post a comment