Uppsala uni 4march2011


Published on

1 Comment
1 Like
  • wt iz thiz?????
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Kris I is pdf7-webgenre collection
  • Uppsala uni 4march2011

    1. 1. Marina Santini Artificial Solutions, KYH Agile Web Development Stockholm Uppsala University Department of Linguistics and Philology, Seminar Series Fri 4 March 2011
    2. 2. Genres on the Web GoWeb
    3. 3. Outline  What is genre? What is web genre?  What is the difference betw genre and web genre?  Why is (web) genre important?  Automatic web genre identification  The very beginning: Biber and Karlgren&Cutting  Sharoff  Kim & Ross  Santini  Stein et al.  Web genre identification by Humans  Karlgren  Rosso & Haas  Crowston et al.  Future directions
    4. 4. What is genre? The beginning…  Aristotle (4th cent. b.C.): drama, lyrics, epics  Drama: tragedy, comedy, satyr  Literary theory and literary genres  Library classification  Library classification used also in online bookshops (e.g Amazon)  Music genres (jazz, rock, etc.), film genres (thriller, drama, western etc.)
    5. 5. More recently…  Genre in academic contexts, in workplace and professional contexts, public contexts, in pedagogy (teaching writing), etc (resarch articles, essays, emails, memos, etc.)
    6. 6. Recent Genre Definitions: 2008-2010
    7. 7. Genre & Corpus Linguistics  Surprisingly, no explicit definition of what genre is…  Brown corpus (1961): 15 genres  Sockholm-Umeå Corpus (SUC) (1990s)  British National Corpus (1990s)  etc.
    8. 8. David Lee and the BNC Jungle
    9. 9. Why is genre important?  It is a context carrier: being based on recurrent conventions and predictable expectations, genre provides the communicative context and the communicative purpose for which a text has been produced. Think of what happens in your mind when you come across a specific genre. Eg, FAQs, reviews, interviews, academic papers, reportages…
    10. 10. Benefits (I) Being a context carrier…  Complexity reduction: a text receives identity throught belonging to a certain genre;  Predictivity: genre reduces information overload.  Findability: genre helps find web documents ”relevant” to our information needs;
    11. 11. Benefits (II)  Genre competence increases information understanding:  genre competence increases self protection against digital crimes (fishing, hoaxes, cyberbullying) because it can help us spot genre anomalies and consequently malicious intentions;  Genre competence helps implement democracy:  some educational programs (e.g. in Australia) focus on teaching genre since the primary school because those who do not have genre competence because they drop off school after the primary school become socially disadvantaged in the structure of power.
    12. 12. What is webgenre ?  All types of genres that are on the web…  Paper genres that have been uploaded in any format + genres that do not have any countepart in the paper world:  ex: home page, About Us, FAQs, webzine, personal blog, corporate weblogs …
    13. 13. How is webgenre different from paper genre?  On the web, there are new communicative settings, and new communicative contexts, so new genres are spawned  On the web, the new communicative settings have been spurred by a proliferation of new technologies that ease, foster and model our communication: ex: chats, blogs, social networks, like Facebook, Twitter, LinkedIn…
    14. 14. Then, a written text is not only topic…  There are many dimensions of variation: domain, topic, register, sentiment, level of complexity or difficulty or specialisation, trustworthiness and credibility, etc.  … genre is a dimension of variation. Genre gives us a topic packaged in a certain way. From the package, we are able to identify the communicative purpose of the text and the commiunicative context that has spawn such a text.
    15. 15. A step back…  Biber (1988)  Genre  Text types  66 linguistically-motivated features  Multi-Dimensional Analysis  Ad-hoc corpus  Karlgren & Cutting (1994)  Genre  20 shallow features  Brown Corpus
    16. 16. Biberian Text Types Biber (1988) Biber (1989) Biber (1993) Biber (1995) Biber (2004a) Biber (2004b) Biber et al. (2005) etc. Genres/Registers vs. Text Types External Features vs. Internal Features “I have used the term ‘genre’ (or ‘register’) for text varieties that are readily recognized and ‘named’ within a culture (e.g. letters, press editorials, sermon, conversation), while I have used the term ‘text type’ for varieties that are defined linguistically (rather than perceptually)” (Biber, 1993).
    17. 17. Multi-Dimensional Analysis Factor Analysis, Factors Scores (Biber, 1988) Cluster Analysis (Biber, 1989) Additional Statistical Tests (Biber, 2004a; 2004b, etc.) 1. intimate interpersonal interaction 2. informational interaction 3. scientific exposition 4. learned exposition 5. imaginative narrative 6. general narrative exposition 7. situated reportage 8. involved persuasion Cluster Analysis - Biber (1989)Factor 2 - Biber (1988) Criticism: Lee (1999)
    18. 18. From Biber’s text types to genres of electronic corpora: Karlgren and Cutting (1994)
    19. 19. Karlgren and Cutting (1994): Recognizing Text Genres with Simple Metrics Using Discriminant Analysis  20 features  Discriminant analysis  Brown corpus
    20. 20. POSs & SUC
    21. 21. More than 15 years later…  Grieve, Biber et al. ” We define a genre in a very similar manner to how we define register – i.e. as a variety of language defined by the external situation in which it is produced. However, while a register is characterized by pervasive linguistic features, a genre is characterized by conventionalized linguistic features”  Karlgren: ”Genre is a vague but well-established notion, and genres are explicitly identified and discussed by language users even while they may be difficult to encode and put into practical use” GoWeb
    22. 22. The concept of genre is beneficial… but difficult to pin down and to agree upon GoWeb In the book, we do not propose a single and unified definition of genre. Authors give their different views on genre.
    23. 23. Do we really need a definition?  After all….  … once we are convinced that genre is useful, we could just say that: genre is a classificatory principle based on a number of attributes.  The web is immense, we cannot think of classifying web documents by genre manually, can we? Let’s just focus on AUTOMATIC web GENRE CLASSIFCATION!
    24. 24. What do we need for Automatic webGenre Identification (AGI)?  We need:  a genre taxonomy (palette) and a corpus  measurable attributes (features) that can be extracted automatically  an automatic classifier, i.e. a computational model that does the classification for us
    25. 25. Vector representation & supervised machine learning algorithms (esp. SVM)
    26. 26. Models for AGI: Scenarios  Serge Sharoff  Kim & Ross  Santini  Stein et al.  Others… GoWeb
    27. 27. Morphology & the Linguist  Aim: Find a genre palette allowing comparison among corpora (Web As Corpus initiative ) and across languages  A functional genre palette inspired by J. Sinclair  Many corpora: English and Russian  Classifier: SVM  Features: POS trigrams (577 for Russian; 593 for English) Ex of POS trigrams: ADV ADJ NOUN Sharoff  GoWeb
    28. 28. The expert (the linguist) decides:
    29. 29. Results
    30. 30. KRYS I and Harmonic Descriptor Representation (HDR)  Information studies , Digital Libraries: semantic concept  Features: HDR = FP, LP or AP (betw 1 and T/ (N x MP))  Number of features: 7431  Classifier: SVM  KRYS I + 7 webgenre collection (total: 24 + 7 genre classes , 3452 documents) Kim & Ross  GoWeb 2477 words
    31. 31. KRYS I & 7-webgenre collection
    32. 32. Accuracies
    33. 33. What about morphology & syntax? What about noise?  Collection: 7-webgenre collection + others  Features: 100 facets  Genre palette: 7 webgenres + other  Classifier: inferential model subjective Bayesian method Santini  GoWeb
    34. 34. 7-webgenre collection  Balanced (200 web pages per genre class)  Genre palette  Not annotated manually  Built following 2 principles:  Objective sources  Consistent genre granularity
    35. 35. 100 Facets
    36. 36. Inferential model  It is a simple probabilistic model based on rules.  It allows some ”reasonging” through the use of weights (closer to artificial intelligence than machine learning)
    37. 37. Comparisons (I)
    38. 38. Different types of noise!
    39. 39. Results
    40. 40. Three experimental settings, three different genre needs…. 1. Genre comparison across corpora 2. Digital libraries, where documents can be more easily monitored 3. The wild web, where everything is uncertain and noisy WEGA prototype: a retrieval model for genre-enabled web search
    41. 41. Genre retrieval model  Genre collection and palette: KI-04 corpus: 8 webgenres  Firefox add-on  Model: ”lightweight GenreRich model” (linear discriminant analysis)  Features: HTML, link features, character features, vocabulary concentration features (< 100 features) Stein, Meyer zu Eissen, Lipka GoWeb
    42. 42. WEGA (WEb Genre Analysis)
    43. 43. KI-04 genre collection: 8 webgenres
    44. 44. Genre Classes & Human Recognition  How can we decide on the most representative genre classes? Let’s ask users… yes indeed, but how?  1) questionnaires (Karlgren)  2) card sorting (Rosso & Haas)  3) task-oriented studies (Crowston et al.)  4) others…
    45. 45. Questionnaires: ”what genres are available on the internet?”
    46. 46. User Warrant  Collecting genre terminology in the users’ own words (3 participants)  Make the users classify web pages and create piles (rationale?)  Users choose the best of the collected genre terminology (102 participants)  User validation of the genre palette (257 participants)  Genres’ usefulness of web search (32 participants) GoWeb: Rosso & Haas
    47. 47. Final Genre palette: 18 genres
    48. 48. Genres & Tasks  3 groups of respondents : teachers, journalists, engineers,  Respondents were asked to carry out a web search for a real task of their own choice  What is your search goal?  What type of web page would you call this?  What is it about the page that makes you call that?  Was this page useful to you? GoWeb: Crowston et al.
    49. 49. What type of web page would you call this?  522 unique terms  about 300
    50. 50. Syracuse corpus & AGI ACL 2010 (Uppsala): FINE-GRAINED GENRE CLASSIFICATION USING STRUCTURAL LEARNING ALGORITHMS Zhili Wu, Katja Markert and Serge Sharoff  The whole corpus: 3027 annotated webpages divided into 292 genres.  Focussing on genres containing 15 or more examples, the corpus is of about 2293 examples and 52 genres.
    51. 51. Conclusions (I) : Do we really need a definition of genre? 1. Take a number of web pages belonging to different web genres (e.g. blogs, home pages, news stories, FAQs, etc.) 2. Identify and extract genre-revealing features 3. Feed an automatic classifier Where is problem?
    52. 52. Conclusions (II)  The problem with this approach is that without a theoretical definition and characterization of the concept of genre, it is not clear:  how to create a genre taxonomy that both humans and automatic classifiers can easily discriminate against  how to select representative corpus for the genre classes in the taxonomy, since there is a lot of variation in users’ assessment …  how to identifiy the optimal genre–revealing features
    53. 53. Future Work Genre is a high-level concept: we NEED a theoretical definition of genre for computational and empirical purposes. Without a theoretical definition:  genres become lifeless texts, merely characterized by formal attributes and the communicative context , i.e. the thing that make genre important, is completely stripped out  Although in some restricted experimental settings, this ”formalistic” approach is quite rewarding (more than 95% success rate), we can hardly generalize on it.
    54. 54. Future directions: AGI is a fertile land for research and development… Now that basic explorations have been carried out, we should concentrate more on the correlation and interrelation of the following variables:  Human agreement  Representation of genre classes  Number of genre classes  Nature of genre classes  Size of the whole corpus  Sturctured and unstructered noise  Genre-revealing features that account for the context that genres carry with them  New computational models and algorithms…
    55. 55. Certainties….  Genre is a useful concept in many disciplines  Automatic genre classification is feasible, and there is ample space for improvement  I am interested in your views on (web) genre:  send me your impressions, ideas, gut feelings and your genre classes:  Facebook page: www.facebook.com/genresontheweb  Genre blog: www.forum.santini.se  Webrider’s Short proposal to EU: www.webrider.se
    56. 56. Thank you for your attention!
    57. 57. References (I)  Bateman, John (2008) Multimodality and Genre, Palgrave Macmillan  Bawarshi, Anis S. and Reiff, Mary Jo (eds) (2010) Genre: An Introduction to History, Theory, Research, and Pedagogy (free book); http://wac.colostate.edu/books/bawarshi_reiff/genre.pdf  Bruce, Ian (2008) Academic Writing and Genre, Continuum  Dorgeloh, Heidrun and Wanner, Anja (2010) Syntactic Variation and Genre, De Gruyter Mouton
    58. 58. References (II)  Giltrow,Janet and Stein, Dieter (eds) (2009) Genres in the Internet, John Benjamins Publishing Company  Heyd, Theresa (2008) Email Hoaxes: Form, function, genre ecology, John Benjamins Publishing Company  Lee, David (2001), Genres, Registers, Text Types, Domains, And Styles: Clarifying The Concepts And Navigating A Path Through The Bnc Jungle, Language Learning & Technology September 2001, Vol. 5, Num. 3. pp. 37-72, http://llt.msu.edu/vol5num3/pdf/lee.pdf
    59. 59. References (III)  Luzón, María José, Ruiz-Madrid, María Noelia and Villanueva, María Luisa (eds) (2010) Digital Genres, New Literacies and Autonomy in Language Learning, Cambridge Scholars Publishing  Martin, James and Rose, David (2008) Genre Relations: Mapping Culture, Equinox  Puschmann, Cornelius (2010) The corporate blog as an emerging genre of computer-mediated communication: features, constraints, discourse situation, Universitätsverlag Göttingen  WEGA prototype download, documentation and references: http://www.uni- weimar.de/cms/medien/webis/research/projects/wega .html