Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Corpora, Blogs and Linguistic Variation (Paderborn)


Published on

This presentation was held as a guest lecture on corpus linguistics at the University of Paderborn, Germany, on 8 November 2007. I'd like to thank my colleague Anette Rosenbach for inviting me as part of her "Web as Corpus" seminar.

Published in: Economy & Finance, Technology

Corpora, Blogs and Linguistic Variation (Paderborn)

  1. 1. Corpora, Blogs and Linguistic Variation: Arguments for Using Structured Web Data in Corpus Development Cornelius Puschmann University of Düsseldorf [email_address] University of Paderborn 8 November 2007
  2. 2. Contents of this presentation <ul><li>What counts as evidence in linguistics? </li></ul><ul><li>System, use and the individual </li></ul><ul><li>Using the Web for corpus investigations </li></ul><ul><li>Structured vs. unstructured language data </li></ul><ul><li>A research example </li></ul>
  3. 3. What counts as evidence in linguistics?
  4. 4. Four central questions a researcher must answer <ul><li>What is my question? </li></ul><ul><li>What data can I use? </li></ul><ul><li>What methods are at my disposal? </li></ul><ul><li>What are my findings? </li></ul><ul><li>*0) What are my preliminary assumptions? </li></ul>
  5. 5. Different schools of thought ...all have different questions and assumptions! CogSci SocioLing Functionalism
  6. 6. What role does corpus data play? <ul><li>If my aim is figure out what we know when we know language, then corpus data is just one type of evidence among many. </li></ul><ul><li>If my aim is to describe the social function of language in a speech community, then I'll be interested in some (specific) natural language data. </li></ul><ul><li>If my aim is to systematically describe, classify and manipulate natural language data for practical reasons, it is likely to be the only thing to qualify as data to me. </li></ul><ul><li>The relevance of corpus data can range from “somewhat interesting” to “what else is there?”, depending on my perspective. </li></ul>
  7. 7. System, use and the individual
  8. 8. A totalizing view of language production investigation but... whose system? whose use? system social function cognitive mechanism genetic disposition use cultural transmission
  9. 9. Margin vs. center: what is shared vs. what varies If we're interested in variation, corpus analysis is the way to go shared, recurring & patterned language individual & varying language
  10. 10. A (slightly) different way of looking at language <ul><li>While variation in language use is recognized by linguists, system is generally believed to be an abstract and essentially universal category. </li></ul><ul><li>Alternately, it is possible to regard system as the sum of all recurring and patterned instances of language use by a single speaker, some of which are shared with other speakers, while others are not. </li></ul><ul><li>How could similarities and differences between speakers be accurately captured and measured? </li></ul><ul><li>They could be captured with corpora that allow us to systematically compare different speakers. </li></ul>
  11. 11. Using the Web for corpus investigations
  12. 12. The ultimate corpus <ul><li>the indexable Web has more than 11.5 billion pages (2005 study) </li></ul><ul><li>virtually all (written) languages are represented </li></ul><ul><li>established forms of writing (fiction, official documents, personal communication) and </li></ul><ul><li>new genres (blogs, message boards) can be found on the Web </li></ul>
  13. 13. Web as Corpus (WaC) <ul><li>WaC treats the vast amount of language data on the WWW as a corpus </li></ul><ul><li>search engines are used to query this corpus </li></ul><ul><li>they can be commercial (e.g. Google), or specialized tools for linguistic research (e.g. WebCorp, LSE) </li></ul><ul><li>but: specialized linguistic search engines limited to post-processing (?) </li></ul><ul><li>pros: large volume of data, no data acquisition necessary, easy to use </li></ul><ul><li>cons: no knowledge of the makeup of the dataset, no control over the dataset or the query engine, no tagging, parsing or other specialized linguistic tools (with commercial engines) </li></ul>
  14. 14. Broader issues with WaC <ul><li>When we're looking for qualitative data, WaC is relatively unproblematic, but when we're comparing frequencies (i.e. taking a quantitative approach) it has serious issues. </li></ul><ul><li>The fact that Google controls the data means that both the </li></ul><ul><li>data and </li></ul><ul><li>how is processed </li></ul><ul><li>are </li></ul><ul><li>unknown and </li></ul><ul><li>may change at any time </li></ul><ul><li>...and most importantly: Google doesn't care! </li></ul>
  15. 15. Web for Corpus <ul><li>Web for Corpus (WfC) extracts data from the Web and stores it locally </li></ul><ul><li>it is closer to traditional corpus development in the sense that data is consciously added by the researcher following certain criteria (randomness, balance etc) </li></ul><ul><li>pros: control over makeup of the dataset, precise knowledge of its size, ability to annotate, reuse, share and publish the dataset, ability to parse, tag ... </li></ul><ul><li>cons: corpus construction is technically more challenging, corpus is smaller than the Web in its entirety (though still larger than traditional corpora) </li></ul>
  16. 16. Constructing a corpus using web data <ul><li>Pick a data source (Wikipedia, Project Gutenberg,, the entire WWW as indexed by Google, ...) </li></ul><ul><li>Retrieve the raw HTML data ( spidering or crawling ) </li></ul><ul><li>Process the HTML data, i.e. separate natural language from formatting instructions. For example <h1> Dear diary </h1> I am <b> really </b> bored &amp; tired right now... <br/> becomes Dear diary I am really bored & tired right now </li></ul><ul><li>Tag, parse and store </li></ul>
  17. 17. Things to consider ...not a whole lot of language data!
  18. 18. Things to consider ... better, but we need to take register into account
  19. 19. <ul><ul><li>Structured vs. unstructured language data </li></ul></ul>
  20. 20. A blog is... <ul><li>Blog (n) : a website where entries are written in chronological order and displayed in reverse chronological order (“web log”). </li></ul><ul><li>blog (v) : to maintain or add content to a blog. </li></ul><ul><li>Blogs are written by a variety of people (“bloggers”) with a variety of purposes. Every text in a blog can be directly linked to its author and usually has other meta-information (date, keywords etc). </li></ul>
  21. 21. A few facts <ul><li>Technorati (blog search engine) tracks about 70 million blogs worldwide </li></ul><ul><li>120.000 new blogs are created each day – that's 1.4 every second </li></ul><ul><li>1.5 million entries (posts) are published every day </li></ul><ul><li>virtually every (written) language on the planet is represented in blogs </li></ul><ul><li>blog data is well-structured in the sense that it doesn't contain visual markup </li></ul><ul><li>The markup of a blog is semantic, meaning it contains meta-information about the content that a machine can understand. </li></ul>
  22. 22. An example
  23. 23. A research example
  24. 24. Expression of futurity in English: will vs. be going to <ul><li>will origin as a transitive lexical verb (OE willan) with a meaning similar to German wollen; has been grammaticalized to express futurity </li></ul><ul><li>be going to </li></ul><ul><li>can be combined with +NP (movement) or +Inf.V (future); movement “away from the speaker, towards a goal” (Perez) </li></ul><ul><li>“ The notion of satisfying a condition forms one of the major distinctions between the two future expressions will and be going to . A sentence with will relies on a condition evident in the context that will enable the proposition to take place.” (Perez) </li></ul>
  25. 25. Distribution of will vs. be going to in three blogs light blue = will dark blue = going to
  26. 26. Personal pronoun frequency in the same blogs <ul><li>THOMPSON </li></ul><ul><li>01 the DT 2638 </li></ul><ul><li>02 and CC 1604 </li></ul><ul><li>03 to TO 1272 </li></ul><ul><li>04 of IN 1152 </li></ul><ul><li>05 a DT 1034 </li></ul><ul><li>06 in IN 821 </li></ul><ul><li>07 For IN 680 </li></ul><ul><li>08 on IN 509 </li></ul><ul><li>09 is VBZ 465 </li></ul><ul><li>10 with IN 375 </li></ul><ul><li>J.SCHWARTZ </li></ul><ul><li>01 the DT 3077 </li></ul><ul><li>02 and CC 2120 </li></ul><ul><li>03 to TO 2072 </li></ul><ul><li>04 a DT 1862 </li></ul><ul><li>05 of IN 1486 </li></ul><ul><li>06 in IN 1008 </li></ul><ul><li>07 we PP 988 </li></ul><ul><li>08 I PP 818 </li></ul><ul><li>09 our PP$ 723 </li></ul><ul><li>10 It PP 642 </li></ul><ul><li>H.HAMILTON </li></ul><ul><li>01 the DT 3802 </li></ul><ul><li>02 I PP 3613 </li></ul><ul><li>03 to TO 2789 </li></ul><ul><li>04 a DT 2045 </li></ul><ul><li>05 of IN 1795 </li></ul><ul><li>06 and CC 1788 </li></ul><ul><li>07 It PP 1519 </li></ul><ul><li>08 you PP 1205 </li></ul><ul><li>09 in IN 1096 </li></ul><ul><li>10 that IN 1077 </li></ul>
  27. 27. Inanimate subjects with be going to <ul><li>(1) my car is going to get pretty crusty </li></ul><ul><li>(2) the exhibit is going to be in Seattle </li></ul><ul><li>(3) the things that are left to do in this house are going to cost some money </li></ul><ul><li>(4) the work is still going to be there the next day </li></ul><ul><li>(5) my first post is going to be about troublesome managers </li></ul>
  28. 28. Observations <ul><li>The assumption that be going to is more frequent with certain subjects is confirmed by the data (1 st pers. -> 2 nd pers. -> human -> animate -> inanimate). </li></ul><ul><li>This appears to be the strongest conditioning factor – degree of certainty seems less significant. </li></ul><ul><li>However, where subjects are inanimate, the course of action can always be described as certain. </li></ul><ul><li>These observations could be tested against a large number of sources (blogs), factoring in individual variation in addition to other factors. </li></ul>
  29. 29. Thanks for listening!
  30. 30. Corpora, Blogs and Linguistic Variation: Arguments for Using Structured Web Data in Corpus Development Cornelius Puschmann University of Düsseldorf [email_address] University of Paderborn 8 November 2007