Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Uppsala uni 4march2011






Total Views
Views on SlideShare
Embed Views



1 Embed 3

http://www.linkedin.com 3



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.


11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • wt iz thiz?????
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Kris I is pdf7-webgenre collection

Uppsala uni 4march2011 Uppsala uni 4march2011 Presentation Transcript

  • Marina Santini Artificial Solutions, KYH Agile Web Development Stockholm Uppsala UniversityDepartment of Linguistics and Philology, Seminar Series Fri 4 March 2011
  • Genres on the Web GoWeb
  • Outline  What is genre? What is web genre?  What is the difference betw genre and web genre?  Why is (web) genre important?  Automatic web genre identification  The very beginning: Biber and Karlgren&Cutting  Sharoff  Kim & Ross  Santini  Stein et al.  Web genre identification by Humans  Karlgren  Rosso & Haas  Crowston et al.  Future directions
  • What is genre? The beginning… Aristotle (4th cent. b.C.): drama, lyrics, epics  Drama: tragedy, comedy, satyr Literary theory and literary genres Library classification Library classification used also in online bookshops (e.g Amazon) Music genres (jazz, rock, etc.), film genres (thriller, drama, western etc.)
  • More recently… Genre in academic contexts, in workplace and professional contexts, public contexts, in pedagogy (teaching writing), etc(resarch articles, essays, emails, memos, etc.)
  • Recent Genre Definitions: 2008-2010
  • Genre & Corpus Linguistics  Surprisingly, no explicit definition of what genre is…  Brown corpus (1961): 15 genres  Sockholm-Umeå Corpus (SUC) (1990s)  British National Corpus (1990s)  etc.
  • David Lee and the BNC Jungle
  • Why is genre important? It is a context carrier: being based on recurrent conventions and predictable expectations, genre provides the communicative context and the communicative purpose for which a text has been produced. Think of what happens in your mind when you come across a specific genre. Eg, FAQs, reviews, interviews, academic papers, reportages…
  • Benefits (I)Being a context carrier… Complexity reduction: a text receives identity throught belonging to a certain genre; Predictivity: genre reduces information overload. Findability: genre helps find web documents ”relevant” to our information needs;
  • Benefits (II) Genre competence increases information understanding:  genre competence increases self protection against digital crimes (fishing, hoaxes, cyberbullying) because it can help us spot genre anomalies and consequently malicious intentions; Genre competence helps implement democracy:  some educational programs (e.g. in Australia) focus on teaching genre since the primary school because those who do not have genre competence because they drop off school after the primary school become socially disadvantaged in the structure of power.
  • What is webgenre ? All types of genres that are on the web… Paper genres that have been uploaded in any format + genres that do not have any countepart in the paper world:  ex: home page, About Us, FAQs, webzine, personal blog, corporate weblogs …
  • How is webgenre different from papergenre?  On the web, there are new communicative settings, and new communicative contexts, so new genres are spawned  On the web, the new communicative settings have been spurred by a proliferation of new technologies that ease, foster and model our communication: ex: chats, blogs, social networks, like Facebook, Twitter, LinkedIn…
  • Then, a written text is not onlytopic… There are many dimensions of variation: domain, topic, register, sentiment, level of complexity or difficulty or specialisation, trustworthiness and credibility, etc. … genre is a dimension of variation. Genre gives us a topic packaged in a certain way. From the package, we are able to identify the communicative purpose of the text and the commiunicative context that has spawn such a text.
  • A step back… Biber (1988)  Genre  Text types  66 linguistically-motivated features  Multi-Dimensional Analysis  Ad-hoc corpus Karlgren & Cutting (1994)  Genre  20 shallow features  Brown Corpus
  • Biberian Genres/Registers vs.Text Types Text Types Biber (1988) Biber (1989) Biber (1993) Biber (1995) Biber (2004a) External Features Biber (2004b) vs. Biber et al. (2005) Internal Features etc. “I have used the term ‘genre’ (or ‘register’) for text varieties that are readily recognized and ‘named’ within a culture (e.g. letters, press editorials, sermon, conversation), while I have used the term ‘text type’ for varieties that are defined linguistically (rather than perceptually)” (Biber, 1993).
  • Multi-Dimensional Analysis Factor Analysis, Factors Scores (Biber, 1988) Cluster Analysis (Biber, 1989) Additional Statistical Tests (Biber, 2004a; 2004b, etc.) Factor 2 - Biber (1988) Cluster Analysis - Biber (1989) 1. intimate interpersonal interaction 2. informational interaction 3. scientific exposition 4. learned exposition 5. imaginative narrative 6. general narrative exposition 7. situated reportage 8. involved persuasion Criticism: Lee (1999)
  • From Biber’s text types to genres of electroniccorpora: Karlgren and Cutting (1994)
  • Karlgren and Cutting (1994):Recognizing Text Genres with Simple MetricsUsing Discriminant Analysis 20 features Discriminant analysis Brown corpus
  • POSs & SUC
  • GoWebMore than 15 years later… Grieve, Biber et al. ” We define a genre in a very similar manner to how we define register – i.e. as a variety of language defined by the external situation in which it is produced. However, while a register is characterized by pervasive linguistic features, a genre is characterized by conventionalized linguistic features” Karlgren: ”Genre is a vague but well-established notion, and genres are explicitly identified and discussed by language users even while they may be difficult to encode and put into practical use”
  • The concept of genre is beneficial…but difficult to pin down and toagree upon In the book, we do not propose a single and unified definition of genre. Authors give their different views on genre. GoWeb
  • Do we really need a definition?  After all….  … once we are convinced that genre is useful, we could just say that: genre is a classificatory principle based on a number of attributes.  The web is immense, we cannot think of classifying web documents by genre manually, can we? Let’s just focus on AUTOMATIC web GENRE CLASSIFCATION!
  • What do we need for AutomaticwebGenre Identification (AGI)? We need:  a genre taxonomy (palette) and a corpus  measurable attributes (features) that can be extracted automatically  an automatic classifier, i.e. a computational model that does the classification for us
  • Vector representation & supervisedmachine learning algorithms (esp.SVM)
  • Models for AGI: Scenarios Serge Sharoff Kim & Ross Santini Stein et al. Others… GoWeb
  • Morphology & the Linguist Sharoff  GoWeb  Aim: Find a genre palette allowing comparison among corpora (Web As Corpus initiative ) and across languages  A functional genre palette inspired by J. Sinclair  Many corpora: English and Russian  Classifier: SVM  Features: POS trigrams (577 for Russian; 593 for English) Ex of POS trigrams: ADV ADJ NOUN
  • The expert (the linguist) decides:
  • Results
  • KRYS I and Harmonic DescriptorRepresentation (HDR) 2477 words Kim & Ross  GoWeb  Information studies , Digital Libraries: semantic concept  Features: HDR = FP, LP or AP (betw 1 and T/ (N x MP))  Number of features: 7431  Classifier: SVM  KRYS I + 7 webgenre collection (total: 24 + 7 genre classes , 3452 documents)
  • KRYS I &7-webgenrecollection
  • Accuracies
  • What about morphology & syntax?What about noise? Santini  GoWeb Collection: 7-webgenre collection + others Features: 100 facets Genre palette: 7 webgenres + other Classifier: inferential model subjective Bayesian method
  • 7-webgenre collection Balanced (200 web pages per genre class) Genre palette Not annotated manually Built following 2 principles:  Objective sources  Consistent genre granularity
  • 100 Facets
  • Inferential model It is a simple probabilistic model based on rules. It allows some ”reasonging” through the use of weights (closer to artificial intelligence than machine learning)
  • Comparisons (I)
  • Different types of noise!
  • Results
  • Three experimental settings, threedifferent genre needs….1. Genre comparison across corpora2. Digital libraries, where documents can be more easily monitored3. The wild web, where everything is uncertain and noisy WEGA prototype: a retrieval model for genre-enabled web search
  • Genre retrieval modelStein, Meyer zu Eissen, Lipka GoWeb Genre collection and palette: KI-04 corpus: 8 webgenres Firefox add-on Model: ”lightweight GenreRich model” (linear discriminant analysis) Features: HTML, link features, character features, vocabulary concentration features (< 100 features)
  • WEGA (WEb Genre Analysis)
  • KI-04 genre collection: 8 webgenres
  • Genre Classes & HumanRecognition How can we decide on the most representative genre classes? Let’s ask users… yes indeed, but how? 1) questionnaires (Karlgren) 2) card sorting (Rosso & Haas) 3) task-oriented studies (Crowston et al.) 4) others…
  • Questionnaires: ”what genres are available on the internet?”
  • User Warrant GoWeb: Rosso & Haas Collecting genre terminology in the users’ own words (3 participants)  Make the users classify web pages and create piles (rationale?) Users choose the best of the collected genre terminology (102 participants) User validation of the genre palette (257 participants) Genres’ usefulness of web search (32 participants)
  • FinalGenrepalette:18genres
  • Genres & Tasks GoWeb: Crowston et al. 3 groups of respondents : teachers, journalists, engineers, Respondents were asked to carry out a web search for a real task of their own choice  What is your search goal?  What type of web page would you call this?  What is it about the page that makes you call that?  Was this page useful to you?
  • What type of web page would you call this? 522 unique terms  about 300
  • Syracuse corpus & AGIACL 2010 (Uppsala): FINE-GRAINED GENRE CLASSIFICATION USING STRUCTURAL LEARNING ALGORITHMS Zhili Wu, Katja Markert and Serge Sharoff The whole corpus: 3027 annotated webpages divided into 292 genres. Focussing on genres containing 15 or more examples, the corpus is of about 2293 examples and 52 genres.
  • Conclusions (I) : Do we really needa definition of genre?1. Take a number of web pages belonging to different web genres (e.g. blogs, home pages, news stories, FAQs, etc.)2. Identify and extract genre-revealing features3. Feed an automatic classifier Where is problem?
  • Conclusions (II) The problem with this approach is that without a theoretical definition and characterization of the concept of genre, it is not clear:  how to create a genre taxonomy that both humans and automatic classifiers can easily discriminate against  how to select representative corpus for the genre classes in the taxonomy, since there is a lot of variation in users’ assessment …  how to identifiy the optimal genre–revealing features
  • Future Work Genre is a high-level concept: we NEED a theoretical definition of genre for computational and empirical purposes.Without a theoretical definition: genres become lifeless texts, merely characterized by formal attributes and the communicative context , i.e. the thing that make genre important, is completely stripped out Although in some restricted experimental settings, this ”formalistic” approach is quite rewarding (more than 95% success rate), we can hardly generalize on it.
  • Future directions: AGI is a fertile landfor research and development… Now that basic explorations have been carried out, we should concentrate more on the correlation and interrelation of the following variables:  Human agreement  Representation of genre classes  Number of genre classes  Nature of genre classes  Size of the whole corpus  Sturctured and unstructered noise  Genre-revealing features that account for the context that genres carry with them  New computational models and algorithms…
  • Certainties…. Genre is a useful concept in many disciplines Automatic genre classification is feasible, and there is ample space for improvement I am interested in your views on (web) genre:  send me your impressions, ideas, gut feelings and your genre classes:  Facebook page: www.facebook.com/genresontheweb  Genre blog: www.forum.santini.se  Webrider’s Short proposal to EU: www.webrider.se
  • Thank you for your attention!
  • References (I) Bateman, John (2008) Multimodality and Genre, Palgrave Macmillan Bawarshi, Anis S. and Reiff, Mary Jo (eds) (2010) Genre: An Introduction to History, Theory, Research, and Pedagogy (free book); http://wac.colostate.edu/books/bawarshi_reiff/genre.pdf Bruce, Ian (2008) Academic Writing and Genre, Continuum Dorgeloh, Heidrun and Wanner, Anja (2010) Syntactic Variation and Genre, De Gruyter Mouton
  • References (II) Giltrow,Janet and Stein, Dieter (eds) (2009) Genres in the Internet, John Benjamins Publishing Company Heyd, Theresa (2008) Email Hoaxes: Form, function, genre ecology, John Benjamins Publishing Company Lee, David (2001), Genres, Registers, Text Types, Domains, And Styles: Clarifying The Concepts And Navigating A Path Through The Bnc Jungle, Language Learning & Technology September 2001, Vol. 5, Num. 3. pp. 37- 72, http://llt.msu.edu/vol5num3/pdf/lee.pdf
  • References (III) Luzón, María José, Ruiz-Madrid, María Noelia and Villanueva, María Luisa (eds) (2010) Digital Genres, New Literacies and Autonomy in Language Learning, Cambridge Scholars Publishing Martin, James and Rose, David (2008) Genre Relations: Mapping Culture, Equinox Puschmann, Cornelius (2010) The corporate blog as an emerging genre of computer-mediated communication: features, constraints, discourse situation, Universitätsverlag Göttingen WEGA prototype download, documentation and references: http://www.uni- weimar.de/cms/medien/webis/research/projects/wega .html