Slideshare.net (beta)

 
Post to TwitterPost to Twitter
Post: 
Myspace Hi5 Friendster Xanga LiveJournal Facebook Blogger Tagged Typepad Freewebs BlackPlanet gigya icons

All comments

Add a comment on Slide 1

If you have a SlideShare account, login to comment; else you can comment as a guest


Showing 1-50 of 0 (more)

Quantitative Individuated Corpus Linguistics

From coffee001, 2 years ago

Held at the Linguistic Colloquim, University of Osnabrueck, June 5 more

591 views  |  0 comments  |  0 favorites  |  54 downloads  |  1 embed (Stats)
 

Categories

Add Category
 
 

Groups / Events

 

 
Embed
options

More Info

CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License
This slideshow is Public
Total Views: 591
on Slideshare: 586
from embeds: 5

Slideshow transcript

Slide 1: Quantitative Individuated Corpus Linguistics: A Speaker-Centric Approach to Variation Cornelius Puschmann Universität Osnabrück 5 Juni 2007

Slide 2: Preliminaries

Slide 3: A totalizing view of language cognitive basis learning production performance competence investigation but... social function cultural transmission whose competence? whose performance?

Slide 4: Variation or overlap? Speaker A Observations: performance competence 1. A contrastive comparison of performance should give us some insight into shared competence. Speaker B 2. Speaker-level granularity is preferable to higher levels of segmentation (by gender, social performance competence class etc). 3. Instead of generalizing from the outset, we can reach general conclusions after observing the Speaker C degree of variation or overlap in language production. performance competence So how do we do this?

Slide 5: How corpora treat language data any sentence is as good as any other sentence ● (the data is flat) a corpus should be a well-balanced mix of different genres, ● modes and sources (representativeness) textual and compositional coherence cannot be taken into ● account contextual information (who said what, when, where, why, how ● and to whom) is largely unavailable

Slide 6: Corpora and traditions of text production copora largely consist of “well-established genres” ● the material they contain is produced by “language ● professionals” (journalists, writers, politicians) texts are long and stylistically distant from everyday ● communication in their level of formality, complexity and elaborateness compositional integrity (text structure) is very important but ● largely ignored the text (=collection of words) takes precedent over the speaker

Slide 7: A different view of language data language data and sources of ... vs. speakers and their variation... natural attributes language data speaker mode register dialect age gender situation*

Slide 8: Blogs as data sources

Slide 9: A new kind of resource estimated 100 million active bloggers in 2007 ● split evenly among genders ● all age groups are represented ● many bloggers provide personal information (age, gender, ● location) use web feeds (Atom and RSS formats) to syndicate blog ● entries in XML (ideal for building modern corpora) “clean” data with minimal interference ●

Slide 10: Blogs as corpus data: Pros very large bodies of data can be automatically assembled ● data is naturally segmented by ● speaker (+gender, +age, +location, ...) – length and time of writing – often include additional meta-data ● produced by a large and growing variety of individuals using it ● for a wide spectrum of purposes

Slide 11: Blogs as corpus data: Cons only one genre (?) ● CMC as a singular mode (?) ● sampling of speakers not representative (?) ●

Slide 12: Granularity and natural segmentation of data in a blog-based corpus What I had for breakfast this morning post 1 post 2 post 3 xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx xxxx post 4 ... xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx xxxx xxx xxx xx xxxx, xxx xxx xx xxxx Modes of investigation: xxx xxx xx xxxx, xxx xxx xx 1. Degree of internal variation among posted 01/01/2007 by Jane Smith all posts by the same blogger 2. Variation between bloggers 3. Variation between groups (gender, age etc)

Slide 13: An example for a blog-based corpus self-built corpus for my research project on corporate blogging ● web feeds (RSS and Atom protocols) used to retrieve, store ● and analyze language data implemented TreeTagger for automated part-of-speech ● tagging 156 sources ● 25,769 posts ● 6.6 million words ●

Slide 14: Application

Slide 15: Individual variation: word class distribution Heather Hamilton (Microsoft)

Slide 16: Individual variation: word class distribution Irving Wladawsky-Berger (IBM)

Slide 17: Individual variation: pronoun use Heather Hamilton (Microsoft) Irving Wladawsky-Berger (IBM) 1 the DT 2787 1 the DT 2788 2 I PP 2723 2 and CC 1931 3 to TO 2088 3 of IN 1571 4 a DT 1440 4 to TO 1562 5 of IN 1324 5 in IN 1291 6 and CC 1254 6 a DT 1047 7 It PP 1097 7 is VBZ 695 8 you PP 854 8 I PP 560 9 in IN 818 9 that IN 439 10 that IN 776 10 For IN 434 11 my PP$ 757 11 It PP 417 12 is VBZ 739 12 with IN 401 13 For IN 580 13 as IN 390 14 n't RB 540 14 are VBP 380 15 's VBZ 530 15 we PP 359 16 on IN 498 16 on IN 331 17 are VBP 475 17 our PP$ 259 18 me PP 450 18 have VHP 253 19 with IN 431 19 that WDT 248 20 this DT 424

Slide 18: Individual variation: collocates preceding instances of believe 20 18 16 14 I 12 (other) 10 8 6 4 2 0 Heather Irving Jonathan BNC

Slide 19: Gender, age and variation: Schler et al article Effects of Age and Gender on Blogging (AAAI 2006 ) ● all blogs accessible from blogger.com one day in August 2004 ● downloaded each blog that included author-provided indication ● of gender and at least 200 appearances of common English words the full corpus thus obtained included over 71,000 blogs and ● over 300 million tokens used to predict age and gender of bloggers ●

Slide 20: Gender, age and variation: common words males token male female linux 0.53±0.04 0.03±0.01 microsoft 0.63±0.05 0.08±0.01 gaming 0.25±0.02 0.04±0.00 server 0.76±0.05 0.13±0.01 software 0.99±0.05 0.17±0.02 gb 0.27±0.02 0.05±0.01 programming 0.36±0.02 0.08±0.01 google 0.90±0.04 0.19±0.02 data 0.62±0.03 0.14±0.01 graphics 0.27±0.02 0.06±0.01 india 0.62±0.04 0.15±0.01 nations 0.25±0.01 0.06±0.01 democracy 0.23±0.01 0.06±0.01 users 0.45±0.02 0.11±0.01 economic 0.26±0.01 0.07±0.01

Slide 21: Gender, age and variation: common words females token male female shopping 0.66±0.02 1.48±0.03 mom 2.07±0.05 4.69±0.08 cried 0.31±0.01 0.72±0.02 freaked 0.08±0.01 0.21±0.01 pink 0.33±0.02 0.85±0.03 cute 0.83±0.03 2.32±0.04 gosh 0.17±0.01 0.47±0.02 kisses 0.08±0.01 0.28±0.01 yummy 0.10±0.01 0.36±0.01 mommy 0.08±0.01 0.31±0.02 boyfriend 0.41±0.02 1.73±0.04 skirt 0.06±0.01 0.26±0.01 adorable 0.05±0.00 0.23±0.01 husband 0.28±0.01 1.38±0.04 hubby 0.01±0.00 0.30±0.02

Slide 22: Gender, age and variation: common words by age token teens twens thirties maths 1.05±0.06 0.03±0.00 0.02±0.01 homework 1.37±0.06 0.18±0.01 0.15±0.02 bored 3.84±0.27 1.11±0.14 0.47±0.04 sis 0.74±0.04 0.26±0.03 0.10±0.02 boring 3.69±0.10 1.02±0.04 0.63±0.05 awesome 2.92±0.08 1.28±0.04 0.57±0.04 mum 1.25±0.06 0.41±0.04 0.23±0.04 mad 2.16±0.07 0.80±0.03 0.53±0.04 dumb 0.89±0.04 0.45±0.03 0.22±0.03 semester 0.22±0.02 0.44±0.03 0.18±0.04 apartment 0.18±0.02 1.23±0.05 0.55±0.05 drunk 0.77±0.04 0.88±0.03 0.41±0.05 beer 0.32±0.02 1.15±0.05 0.70±0.05 student 0.65±0.04 0.98±0.05 0.61±0.06 album 0.64±0.05 0.84±0.06 0.56±0.08 college 1.51±0.07 1.92±0.07 1.31±0.09 someday 0.35±0.02 0.40±0.02 0.28±0.03 dating 0.31±0.02 0.52±0.03 0.37±0.04

Slide 23: Gender, age and variation: common words by age (ii) token teens twens thirties marriage 0.27±0.03 0.83±0.05 1.41±0.13 development 0.16±0.02 0.50±0.03 0.82±0.10 campaign 0.14±0.02 0.38±0.03 0.70±0.07 tax 0.14±0.02 0.38±0.03 0.72±0.11 local 0.38±0.02 1.18±0.04 1.85±0.10 democratic 0.13±0.02 0.29±0.02 0.59±0.05 son 0.51±0.03 0.92±0.05 2.37±0.16 systems 0.12±0.01 0.36±0.03 0.55±0.06 provide 0.15±0.01 0.54±0.03 0.69±0.05 workers 0.10±0.01 0.35±0.02 0.46±0.04

Slide 24: Observations

Slide 25: How an individuated approach to corpus linguistics can benefit the field allow us to take into account individual stylistic preference as a ● source of variation when making generalizations (syntax, semantics, ...) allow us to observe specificities of individual production before ● making blanket label statements about groups (based on gender, social standing etc) inverts the idea of system and variation (“how much overlap is ● there in language use?” vs. “how much variation can our theories account for?”)

Slide 26: Research possibilities? “personal grammar?”, “personal semantics?” ● Construction Grammar (to what degree are constructions ● individual?) variation over the lifetime ● weighing genre, mode and individual variation ● practical applications for forensic linguistics / language profiling ●

Slide 27: Thank you for listening!

Slide 28: Quantitative Individuated Corpus Linguistics: A Speaker-Centric Approach to Variation Cornelius Puschmann Universität Osnabrück 5 Juni 2007