• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Quantitative Individuated Corpus Linguistics
 

Quantitative Individuated Corpus Linguistics

on

  • 2,518 views

Held at the Linguistic Colloquim, University of Osnabrueck, June 5, 2007.

Held at the Linguistic Colloquim, University of Osnabrueck, June 5, 2007.

Statistics

Views

Total Views
2,518
Views on SlideShare
2,506
Embed Views
12

Actions

Likes
0
Downloads
89
Comments
0

2 Embeds 12

http://corpblawg.ynada.com 7
http://www.slideshare.net 5

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Quantitative Individuated Corpus Linguistics Quantitative Individuated Corpus Linguistics Presentation Transcript

    • Quantitative Individuated Corpus Linguistics: A Speaker-Centric Approach to Variation Cornelius Puschmann Universität Osnabrück 5 Juni 2007
    • Preliminaries
    • A totalizing view of language competence performance production investigation but... whose competence? whose performance? social function learning cognitive basis cultural transmission
    • Variation or overlap? competence performance Speaker A competence performance Speaker B competence performance Speaker C Observations: 1. A contrastive comparison of performance should give us some insight into shared competence. 2. Speaker-level granularity is preferable to higher levels of segmentation (by gender, social class etc). 3. Instead of generalizing from the outset, we can reach general conclusions after observing the degree of variation or overlap in language production. So how do we do this?
    • How corpora treat language data
      • any sentence is as good as any other sentence (the data is flat)
      • a corpus should be a well-balanced mix of different genres, modes and sources (representativeness)
      • textual and compositional coherence cannot be taken into account
      • contextual information (who said what, when, where, why, how and to whom) is largely unavailable
    • Corpora and traditions of text production
      • copora largely consist of “well-established genres”
      • the material they contain is produced by “language professionals” (journalists, writers, politicians)
      • texts are long and stylistically distant from everyday communication in their level of formality, complexity and elaborateness
      • compositional integrity (text structure) is very important but largely ignored
      • the text (=collection of words) takes precedent over the speaker
    • A different view of language data
      • language data and sources of
      • variation...
      • ... vs. speakers and their natural attributes
    • Blogs as data sources
    • A new kind of resource
      • estimated 100 million active bloggers in 2007
      • split evenly among genders
      • all age groups are represented
      • many bloggers provide personal information (age, gender, location)
      • use web feeds (Atom and RSS formats) to syndicate blog entries in XML (ideal for building modern corpora)
      • “clean” data with minimal interference
    • Blogs as corpus data: Pros
      • very large bodies of data can be automatically assembled
      • data is naturally segmented by
        • speaker (+gender, +age, +location, ...)
        • length and time of writing
      • often include additional meta-data
      • produced by a large and growing variety of individuals using it for a wide spectrum of purposes
    • Blogs as corpus data: Cons
      • only one genre (?)
      • CMC as a singular mode (?)
      • sampling of speakers not representative (?)
    • Granularity and natural segmentation of data in a blog-based corpus Modes of investigation: 1. Degree of internal variation among all posts by the same blogger 2. Variation between bloggers 3. Variation between groups (gender, age etc) What I had for breakfast this morning xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx posted 01/01/2007 by Jane Smith post 1 post 2 post 3 post 4 ...
    • An example for a blog-based corpus
      • self-built corpus for my research project on corporate blogging
      • web feeds (RSS and Atom protocols) used to retrieve, store and analyze language data
      • implemented TreeTagger for automated part-of-speech tagging
      • 156 sources
      • 25,769 posts
      • 6.6 million words
    • Application
    • Individual variation: word class distribution
        • Heather Hamilton (Microsoft)
    • Individual variation: word class distribution
        • Irving Wladawsky-Berger (IBM)
    • Individual variation: pronoun use
      • Heather Hamilton (Microsoft)
      • 1 the DT 2787
      • 2 I PP 2723
      • 3 to TO 2088
      • 4 a DT 1440
      • 5 of IN 1324
      • 6 and CC 1254
      • 7 It PP 1097
      • 8 you PP 854
      • 9 in IN 818
      • 10 that IN 776
      • 11 my PP$ 757
      • 12 is VBZ 739
      • 13 For IN 580
      • 14 n't RB 540
      • 15 's VBZ 530
      • 16 on IN 498
      • 17 are VBP 475
      • 18 me PP 450
      • 19 with IN 431
      • 20 this DT 424
      • Irving Wladawsky-Berger (IBM)
      • 1 the DT 2788
      • 2 and CC 1931
      • 3 of IN 1571
      • 4 to TO 1562
      • 5 in IN 1291
      • 6 a DT 1047
      • 7 is VBZ 695
      • 8 I PP 560
      • 9 that IN 439
      • 10 For IN 434
      • 11 It PP 417
      • 12 with IN 401
      • 13 as IN 390
      • 14 are VBP 380
      • 15 we PP 359
      • 16 on IN 331
      • 17 our PP$ 259
      • 18 have VHP 253
      • 19 that WDT 248
    • Individual variation: collocates preceding instances of believe
    • Gender, age and variation: Schler et al
      • article Effects of Age and Gender on Blogging (AAAI 2006 )
      • all blogs accessible from blogger.com one day in August 2004
      • downloaded each blog that included author-provided indication of gender and at least 200 appearances of common English words
      • the full corpus thus obtained included over 71,000 blogs and over 300 million tokens
      • used to predict age and gender of bloggers
    • Gender, age and variation: common words males
      • token male female
      • linux 0.53±0.04 0.03±0.01
      • microsoft 0.63±0.05 0.08±0.01
      • gaming 0.25±0.02 0.04±0.00
      • server 0.76±0.05 0.13±0.01
      • software 0.99±0.05 0.17±0.02
      • gb 0.27±0.02 0.05±0.01
      • programming 0.36±0.02 0.08±0.01
      • google 0.90±0.04 0.19±0.02
      • data 0.62±0.03 0.14±0.01
      • graphics 0.27±0.02 0.06±0.01
      • india 0.62±0.04 0.15±0.01
      • nations 0.25±0.01 0.06±0.01
      • democracy 0.23±0.01 0.06±0.01
      • users 0.45±0.02 0.11±0.01
      • economic 0.26±0.01 0.07±0.01
    • Gender, age and variation: common words females
      • token male female
      • shopping 0.66±0.02 1.48±0.03
      • mom 2.07±0.05 4.69±0.08
      • cried 0.31±0.01 0.72±0.02
      • freaked 0.08±0.01 0.21±0.01
      • pink 0.33±0.02 0.85±0.03
      • cute 0.83±0.03 2.32±0.04
      • gosh 0.17±0.01 0.47±0.02
      • kisses 0.08±0.01 0.28±0.01
      • yummy 0.10±0.01 0.36±0.01
      • mommy 0.08±0.01 0.31±0.02
      • boyfriend 0.41±0.02 1.73±0.04
      • skirt 0.06±0.01 0.26±0.01
      • adorable 0.05±0.00 0.23±0.01
      • husband 0.28±0.01 1.38±0.04
      • hubby 0.01±0.00 0.30±0.02
    • Gender, age and variation: common words by age
      • token teens twens thirties
      • maths 1.05±0.06 0.03±0.00 0.02±0.01
      • homework 1.37±0.06 0.18±0.01 0.15±0.02
      • bored 3.84±0.27 1.11±0.14 0.47±0.04
      • sis 0.74±0.04 0.26±0.03 0.10±0.02
      • boring 3.69±0.10 1.02±0.04 0.63±0.05
      • awesome 2.92±0.08 1.28±0.04 0.57±0.04
      • mum 1.25±0.06 0.41±0.04 0.23±0.04
      • mad 2.16±0.07 0.80±0.03 0.53±0.04
      • dumb 0.89±0.04 0.45±0.03 0.22±0.03
      • semester 0.22±0.02 0.44±0.03 0.18±0.04
      • apartment 0.18±0.02 1.23±0.05 0.55±0.05
      • drunk 0.77±0.04 0.88±0.03 0.41±0.05
      • beer 0.32±0.02 1.15±0.05 0.70±0.05
      • student 0.65±0.04 0.98±0.05 0.61±0.06
      • album 0.64±0.05 0.84±0.06 0.56±0.08
      • college 1.51±0.07 1.92±0.07 1.31±0.09
      • someday 0.35±0.02 0.40±0.02 0.28±0.03
      • dating 0.31±0.02 0.52±0.03 0.37±0.04
    • Gender, age and variation: common words by age (ii)
      • token teens twens thirties
      • marriage 0.27±0.03 0.83±0.05 1.41±0.13
      • development 0.16±0.02 0.50±0.03 0.82±0.10
      • campaign 0.14±0.02 0.38±0.03 0.70±0.07
      • tax 0.14±0.02 0.38±0.03 0.72±0.11
      • local 0.38±0.02 1.18±0.04 1.85±0.10
      • democratic 0.13±0.02 0.29±0.02 0.59±0.05
      • son 0.51±0.03 0.92±0.05 2.37±0.16
      • systems 0.12±0.01 0.36±0.03 0.55±0.06
      • provide 0.15±0.01 0.54±0.03 0.69±0.05
      • workers 0.10±0.01 0.35±0.02 0.46±0.04
    • Observations
    • How an individuated approach to corpus linguistics can benefit the field
      • allow us to take into account individual stylistic preference as a source of variation when making generalizations (syntax, semantics, ...)
      • allow us to observe specificities of individual production before making blanket label statements about groups (based on gender, social standing etc)
      • inverts the idea of system and variation (“how much overlap is there in language use?” vs. “how much variation can our theories account for?”)
    • Research possibilities?
      • “personal grammar?”, “personal semantics?”
      • Construction Grammar (to what degree are constructions individual?)
      • variation over the lifetime
      • weighing genre, mode and individual variation
      • practical applications for forensic linguistics / language profiling
    • Thank you for listening!
    • Quantitative Individuated Corpus Linguistics: A Speaker-Centric Approach to Variation Cornelius Puschmann Universität Osnabrück 5 Juni 2007