Corpora, Blogs and Linguistic Variation:  Arguments for Using Structured Web Data in Corpus Development Cornelius Puschman...
Contents of this presentation <ul><li>What counts as evidence in linguistics? </li></ul><ul><li>System, use and the indivi...
What counts as evidence in linguistics?
Four central questions a researcher must answer <ul><li>What is my question? </li></ul><ul><li>What data can I use? </li><...
Different schools of thought ...all have different questions and assumptions! CogSci SocioLing Functionalism
What role does corpus data play? <ul><li>If my aim is figure out  what we know when we know language,  then  corpus data i...
System, use and the individual
A totalizing view of language production investigation but... whose  system? whose  use? system social function cognitive ...
Margin vs. center: what is shared vs. what varies If  we're interested in variation, corpus analysis is the way to go shar...
A (slightly) different way of looking at language <ul><li>While variation in language  use  is recognized by linguists,  s...
Using the Web for corpus investigations
The ultimate corpus <ul><li>the indexable Web has more than 11.5 billion pages (2005 study) </li></ul><ul><li>virtually al...
Web as Corpus (WaC) <ul><li>WaC treats the vast amount of language data on the WWW as a corpus </li></ul><ul><li>search en...
Broader issues with WaC <ul><li>When we're looking for  qualitative  data, WaC is relatively unproblematic, but when we're...
Web for Corpus <ul><li>Web for Corpus (WfC) extracts data  from  the Web and stores it locally </li></ul><ul><li>it is clo...
Constructing a corpus using web data <ul><li>Pick a data source (Wikipedia, Project Gutenberg, blogger.com, the entire WWW...
Things to consider ...not a whole lot of language data!
Things to consider ... better, but we need to take register into account
<ul><ul><li>Structured vs. unstructured language data </li></ul></ul>
A blog is... <ul><li>Blog (n) :  a website where entries are written in chronological order and displayed in reverse chron...
A few facts <ul><li>Technorati (blog search engine) tracks about 70 million blogs worldwide </li></ul><ul><li>120.000 new ...
An example
A research example
Expression of futurity in English:  will  vs.  be going to <ul><li>will origin as a transitive lexical verb (OE  willan)  ...
Distribution of  will  vs.  be going to  in three blogs light blue =  will dark blue =  going to
Personal pronoun frequency in the same blogs <ul><li>THOMPSON </li></ul><ul><li>01 the DT 2638 </li></ul><ul><li>02 and CC...
Inanimate subjects with  be going to <ul><li>(1) my car is  going to get  pretty crusty </li></ul><ul><li>(2) the exhibit ...
Observations <ul><li>The assumption that  be going to  is more frequent with certain subjects is confirmed by the data (1 ...
Thanks for listening!
Corpora, Blogs and Linguistic Variation:  Arguments for Using Structured Web Data in Corpus Development Cornelius Puschman...
Upcoming SlideShare
Loading in …5
×

Corpora, Blogs and Linguistic Variation (Paderborn)

2,742 views
2,641 views

Published on

This presentation was held as a guest lecture on corpus linguistics at the University of Paderborn, Germany, on 8 November 2007. I'd like to thank my colleague Anette Rosenbach for inviting me as part of her "Web as Corpus" seminar.

Published in: Economy & Finance, Technology
8 Comments
1 Like
Statistics
Notes
No Downloads
Views
Total views
2,742
On SlideShare
0
From Embeds
0
Number of Embeds
28
Actions
Shares
0
Downloads
63
Comments
8
Likes
1
Embeds 0
No embeds

No notes for slide

Corpora, Blogs and Linguistic Variation (Paderborn)

  1. 1. Corpora, Blogs and Linguistic Variation: Arguments for Using Structured Web Data in Corpus Development Cornelius Puschmann University of Düsseldorf [email_address] University of Paderborn 8 November 2007
  2. 2. Contents of this presentation <ul><li>What counts as evidence in linguistics? </li></ul><ul><li>System, use and the individual </li></ul><ul><li>Using the Web for corpus investigations </li></ul><ul><li>Structured vs. unstructured language data </li></ul><ul><li>A research example </li></ul>
  3. 3. What counts as evidence in linguistics?
  4. 4. Four central questions a researcher must answer <ul><li>What is my question? </li></ul><ul><li>What data can I use? </li></ul><ul><li>What methods are at my disposal? </li></ul><ul><li>What are my findings? </li></ul><ul><li>*0) What are my preliminary assumptions? </li></ul>
  5. 5. Different schools of thought ...all have different questions and assumptions! CogSci SocioLing Functionalism
  6. 6. What role does corpus data play? <ul><li>If my aim is figure out what we know when we know language, then corpus data is just one type of evidence among many. </li></ul><ul><li>If my aim is to describe the social function of language in a speech community, then I'll be interested in some (specific) natural language data. </li></ul><ul><li>If my aim is to systematically describe, classify and manipulate natural language data for practical reasons, it is likely to be the only thing to qualify as data to me. </li></ul><ul><li>The relevance of corpus data can range from “somewhat interesting” to “what else is there?”, depending on my perspective. </li></ul>
  7. 7. System, use and the individual
  8. 8. A totalizing view of language production investigation but... whose system? whose use? system social function cognitive mechanism genetic disposition use cultural transmission
  9. 9. Margin vs. center: what is shared vs. what varies If we're interested in variation, corpus analysis is the way to go shared, recurring & patterned language individual & varying language
  10. 10. A (slightly) different way of looking at language <ul><li>While variation in language use is recognized by linguists, system is generally believed to be an abstract and essentially universal category. </li></ul><ul><li>Alternately, it is possible to regard system as the sum of all recurring and patterned instances of language use by a single speaker, some of which are shared with other speakers, while others are not. </li></ul><ul><li>How could similarities and differences between speakers be accurately captured and measured? </li></ul><ul><li>They could be captured with corpora that allow us to systematically compare different speakers. </li></ul>
  11. 11. Using the Web for corpus investigations
  12. 12. The ultimate corpus <ul><li>the indexable Web has more than 11.5 billion pages (2005 study) </li></ul><ul><li>virtually all (written) languages are represented </li></ul><ul><li>established forms of writing (fiction, official documents, personal communication) and </li></ul><ul><li>new genres (blogs, message boards) can be found on the Web </li></ul>
  13. 13. Web as Corpus (WaC) <ul><li>WaC treats the vast amount of language data on the WWW as a corpus </li></ul><ul><li>search engines are used to query this corpus </li></ul><ul><li>they can be commercial (e.g. Google), or specialized tools for linguistic research (e.g. WebCorp, LSE) </li></ul><ul><li>but: specialized linguistic search engines limited to post-processing (?) </li></ul><ul><li>pros: large volume of data, no data acquisition necessary, easy to use </li></ul><ul><li>cons: no knowledge of the makeup of the dataset, no control over the dataset or the query engine, no tagging, parsing or other specialized linguistic tools (with commercial engines) </li></ul>
  14. 14. Broader issues with WaC <ul><li>When we're looking for qualitative data, WaC is relatively unproblematic, but when we're comparing frequencies (i.e. taking a quantitative approach) it has serious issues. </li></ul><ul><li>The fact that Google controls the data means that both the </li></ul><ul><li>data and </li></ul><ul><li>how is processed </li></ul><ul><li>are </li></ul><ul><li>unknown and </li></ul><ul><li>may change at any time </li></ul><ul><li>...and most importantly: Google doesn't care! </li></ul>
  15. 15. Web for Corpus <ul><li>Web for Corpus (WfC) extracts data from the Web and stores it locally </li></ul><ul><li>it is closer to traditional corpus development in the sense that data is consciously added by the researcher following certain criteria (randomness, balance etc) </li></ul><ul><li>pros: control over makeup of the dataset, precise knowledge of its size, ability to annotate, reuse, share and publish the dataset, ability to parse, tag ... </li></ul><ul><li>cons: corpus construction is technically more challenging, corpus is smaller than the Web in its entirety (though still larger than traditional corpora) </li></ul>
  16. 16. Constructing a corpus using web data <ul><li>Pick a data source (Wikipedia, Project Gutenberg, blogger.com, the entire WWW as indexed by Google, ...) </li></ul><ul><li>Retrieve the raw HTML data ( spidering or crawling ) </li></ul><ul><li>Process the HTML data, i.e. separate natural language from formatting instructions. For example <h1> Dear diary </h1> I am <b> really </b> bored &amp; tired right now... <br/> becomes Dear diary I am really bored & tired right now </li></ul><ul><li>Tag, parse and store </li></ul>
  17. 17. Things to consider ...not a whole lot of language data!
  18. 18. Things to consider ... better, but we need to take register into account
  19. 19. <ul><ul><li>Structured vs. unstructured language data </li></ul></ul>
  20. 20. A blog is... <ul><li>Blog (n) : a website where entries are written in chronological order and displayed in reverse chronological order (“web log”). </li></ul><ul><li>blog (v) : to maintain or add content to a blog. </li></ul><ul><li>Blogs are written by a variety of people (“bloggers”) with a variety of purposes. Every text in a blog can be directly linked to its author and usually has other meta-information (date, keywords etc). </li></ul>
  21. 21. A few facts <ul><li>Technorati (blog search engine) tracks about 70 million blogs worldwide </li></ul><ul><li>120.000 new blogs are created each day – that's 1.4 every second </li></ul><ul><li>1.5 million entries (posts) are published every day </li></ul><ul><li>virtually every (written) language on the planet is represented in blogs </li></ul><ul><li>blog data is well-structured in the sense that it doesn't contain visual markup </li></ul><ul><li>The markup of a blog is semantic, meaning it contains meta-information about the content that a machine can understand. </li></ul>
  22. 22. An example
  23. 23. A research example
  24. 24. Expression of futurity in English: will vs. be going to <ul><li>will origin as a transitive lexical verb (OE willan) with a meaning similar to German wollen; has been grammaticalized to express futurity </li></ul><ul><li>be going to </li></ul><ul><li>can be combined with +NP (movement) or +Inf.V (future); movement “away from the speaker, towards a goal” (Perez) </li></ul><ul><li>“ The notion of satisfying a condition forms one of the major distinctions between the two future expressions will and be going to . A sentence with will relies on a condition evident in the context that will enable the proposition to take place.” (Perez) </li></ul>
  25. 25. Distribution of will vs. be going to in three blogs light blue = will dark blue = going to
  26. 26. Personal pronoun frequency in the same blogs <ul><li>THOMPSON </li></ul><ul><li>01 the DT 2638 </li></ul><ul><li>02 and CC 1604 </li></ul><ul><li>03 to TO 1272 </li></ul><ul><li>04 of IN 1152 </li></ul><ul><li>05 a DT 1034 </li></ul><ul><li>06 in IN 821 </li></ul><ul><li>07 For IN 680 </li></ul><ul><li>08 on IN 509 </li></ul><ul><li>09 is VBZ 465 </li></ul><ul><li>10 with IN 375 </li></ul><ul><li>J.SCHWARTZ </li></ul><ul><li>01 the DT 3077 </li></ul><ul><li>02 and CC 2120 </li></ul><ul><li>03 to TO 2072 </li></ul><ul><li>04 a DT 1862 </li></ul><ul><li>05 of IN 1486 </li></ul><ul><li>06 in IN 1008 </li></ul><ul><li>07 we PP 988 </li></ul><ul><li>08 I PP 818 </li></ul><ul><li>09 our PP$ 723 </li></ul><ul><li>10 It PP 642 </li></ul><ul><li>H.HAMILTON </li></ul><ul><li>01 the DT 3802 </li></ul><ul><li>02 I PP 3613 </li></ul><ul><li>03 to TO 2789 </li></ul><ul><li>04 a DT 2045 </li></ul><ul><li>05 of IN 1795 </li></ul><ul><li>06 and CC 1788 </li></ul><ul><li>07 It PP 1519 </li></ul><ul><li>08 you PP 1205 </li></ul><ul><li>09 in IN 1096 </li></ul><ul><li>10 that IN 1077 </li></ul>
  27. 27. Inanimate subjects with be going to <ul><li>(1) my car is going to get pretty crusty </li></ul><ul><li>(2) the exhibit is going to be in Seattle </li></ul><ul><li>(3) the things that are left to do in this house are going to cost some money </li></ul><ul><li>(4) the work is still going to be there the next day </li></ul><ul><li>(5) my first post is going to be about troublesome managers </li></ul>
  28. 28. Observations <ul><li>The assumption that be going to is more frequent with certain subjects is confirmed by the data (1 st pers. -> 2 nd pers. -> human -> animate -> inanimate). </li></ul><ul><li>This appears to be the strongest conditioning factor – degree of certainty seems less significant. </li></ul><ul><li>However, where subjects are inanimate, the course of action can always be described as certain. </li></ul><ul><li>These observations could be tested against a large number of sources (blogs), factoring in individual variation in addition to other factors. </li></ul>
  29. 29. Thanks for listening!
  30. 30. Corpora, Blogs and Linguistic Variation: Arguments for Using Structured Web Data in Corpus Development Cornelius Puschmann University of Düsseldorf [email_address] University of Paderborn 8 November 2007

×