What is genre? What is web genre?
What is the difference betw genre and web genre?
Why is (web) genre important?
Automatic web genre identification
The very beginning: Biber and Karlgren&Cutting
Kim & Ross
Stein et al.
Web genre identification by Humans
Rosso & Haas
Crowston et al.
What is genre? The beginning…
Aristotle (4th cent. b.C.): drama, lyrics, epics
Drama: tragedy, comedy, satyr
Literary theory and literary genres
Library classification used also in online bookshops (e.g
Music genres (jazz, rock, etc.), film genres (thriller,
drama, western etc.)
Genre in academic contexts, in
workplace and professional
contexts, public contexts, in
pedagogy (teaching writing), etc
(resarch articles, essays, emails,
Why is genre important?
It is a context carrier: being based on recurrent
conventions and predictable expectations, genre
provides the communicative context and the
communicative purpose for which a text has been
Think of what happens in your mind when you come
across a specific genre. Eg, FAQs, reviews,
interviews, academic papers, reportages…
Being a context carrier…
Complexity reduction: a text receives identity
throught belonging to a certain genre;
Predictivity: genre reduces information overload.
Findability: genre helps find web documents
”relevant” to our information needs;
Genre competence increases information
genre competence increases self protection against
digital crimes (fishing, hoaxes, cyberbullying) because it
can help us spot genre anomalies and consequently
Genre competence helps implement democracy:
some educational programs (e.g. in Australia) focus on
teaching genre since the primary school because those
who do not have genre competence because they drop
off school after the primary school become socially
disadvantaged in the structure of power.
What is webgenre ?
All types of genres that are on the web…
Paper genres that have been uploaded in any format
+ genres that do not have any countepart in the
ex: home page, About Us, FAQs, webzine,
personal blog, corporate weblogs …
How is webgenre different from paper
On the web, there are new communicative settings,
and new communicative contexts, so new genres are
On the web, the new communicative settings have
been spurred by a proliferation of new technologies
that ease, foster and model our communication: ex:
chats, blogs, social networks, like Facebook, Twitter,
Then, a written text is not only
There are many dimensions of variation: domain,
topic, register, sentiment, level of complexity or
difficulty or specialisation, trustworthiness and
… genre is a dimension of variation. Genre gives us
a topic packaged in a certain way. From the package,
we are able to identify the communicative purpose of
the text and the commiunicative context that has
spawn such a text.
A step back…
66 linguistically-motivated features
Karlgren & Cutting (1994)
20 shallow features
Biber et al. (2005)
“I have used the term ‘genre’ (or ‘register’) for text varieties that are readily
recognized and ‘named’ within a culture (e.g. letters, press editorials, sermon,
conversation), while I have used the term ‘text type’ for varieties that are defined
linguistically (rather than perceptually)” (Biber, 1993).
More than 15 years later…
Grieve, Biber et al. ” We define a genre in a very similar
manner to how we define register – i.e. as a variety of
language defined by the external situation in which it is
produced. However, while a register is characterized by
pervasive linguistic features, a genre is characterized by
conventionalized linguistic features”
Karlgren: ”Genre is a vague but well-established
notion, and genres are explicitly identified and
discussed by language users even while they may be
difficult to encode and put into practical use”
The concept of genre is beneficial…
but difficult to pin down and to
In the book, we do not
propose a single and
unified definition of
genre. Authors give
their different views on
Do we really need a definition?
… once we are convinced that genre is useful, we could just
say that: genre is a classificatory principle based on a
number of attributes.
The web is immense, we cannot think of classifying web
documents by genre manually, can we? Let’s just focus on
AUTOMATIC web GENRE CLASSIFCATION!
What do we need for Automatic
webGenre Identification (AGI)?
a genre taxonomy (palette) and a corpus
measurable attributes (features) that can be extracted
an automatic classifier, i.e. a computational model that
does the classification for us
Models for AGI: Scenarios
Kim & Ross
Stein et al.
Morphology & the Linguist
Aim: Find a genre palette allowing comparison among
corpora (Web As Corpus initiative ) and across
A functional genre palette inspired by J. Sinclair
Many corpora: English and Russian
Features: POS trigrams (577 for Russian; 593 for
Ex of POS trigrams: ADV ADJ NOUN
KRYS I and Harmonic Descriptor
Information studies , Digital Libraries:
Features: HDR = FP, LP or AP (betw 1 and
T/ (N x MP))
Number of features: 7431
KRYS I + 7 webgenre collection (total: 24 +
7 genre classes , 3452 documents)
Kim & Ross GoWeb
Three experimental settings, three
different genre needs….
1. Genre comparison across corpora
2. Digital libraries, where documents can be more easily
3. The wild web, where everything is uncertain and
a retrieval model for genre-enabled web search
Genre retrieval model
Genre collection and palette: KI-04 corpus: 8 webgenres
Model: ”lightweight GenreRich model” (linear discriminant
Features: HTML, link features, character features,
vocabulary concentration features (< 100 features)
Stein, Meyer zu Eissen, Lipka GoWeb
Genre Classes & Human
How can we decide on the most representative genre
classes? Let’s ask users… yes indeed, but how?
1) questionnaires (Karlgren)
2) card sorting (Rosso & Haas)
3) task-oriented studies (Crowston et al.)
Questionnaires: ”what genres are
available on the internet?”
Collecting genre terminology in the users’ own words
Make the users classify web pages and create piles
Users choose the best of the collected genre
terminology (102 participants)
User validation of the genre palette (257 participants)
Genres’ usefulness of web search (32 participants)
GoWeb: Rosso & Haas
Genres & Tasks
3 groups of respondents : teachers, journalists, engineers,
Respondents were asked to carry out a web search for a
real task of their own choice
What is your search goal?
What type of web page would you call this?
What is it about the page that makes you call that?
Was this page useful to you?
GoWeb: Crowston et al.
What type of web page would you call this?
522 unique terms about 300
Syracuse corpus & AGI
ACL 2010 (Uppsala):
FINE-GRAINED GENRE CLASSIFICATION USING
STRUCTURAL LEARNING ALGORITHMS
Zhili Wu, Katja Markert and Serge Sharoff
The whole corpus: 3027 annotated webpages divided
into 292 genres.
Focussing on genres containing 15 or more examples,
the corpus is of about 2293 examples and 52 genres.
Conclusions (I) : Do we really need
a definition of genre?
1. Take a number of web pages belonging to different
web genres (e.g. blogs, home pages, news stories,
2. Identify and extract genre-revealing features
3. Feed an automatic classifier
Where is problem?
The problem with this approach is that without a
theoretical definition and characterization of the
concept of genre, it is not clear:
how to create a genre taxonomy that both humans and
automatic classifiers can easily discriminate against
how to select representative corpus for the genre classes
in the taxonomy, since there is a lot of variation in users’
how to identifiy the optimal genre–revealing features
Genre is a high-level concept: we NEED a theoretical
definition of genre for computational and empirical
Without a theoretical definition:
genres become lifeless texts, merely characterized by
formal attributes and the communicative context , i.e.
the thing that make genre important, is completely
Although in some restricted experimental settings,
this ”formalistic” approach is quite rewarding (more
than 95% success rate), we can hardly generalize on it.
Future directions: AGI is a fertile land
for research and development…
Now that basic explorations have been carried out, we
should concentrate more on the correlation and
interrelation of the following variables:
Representation of genre classes
Number of genre classes
Nature of genre classes
Size of the whole corpus
Sturctured and unstructered noise
Genre-revealing features that account for the context that
genres carry with them
New computational models and algorithms…
Genre is a useful concept in many disciplines
Automatic genre classification is feasible, and there is ample
space for improvement
I am interested in your views on (web) genre:
send me your impressions, ideas, gut feelings and your genre
Facebook page: www.facebook.com/genresontheweb
Genre blog: www.forum.santini.se
Webrider’s Short proposal to EU: www.webrider.se
Bateman, John (2008) Multimodality and Genre,
Bawarshi, Anis S. and Reiff, Mary Jo (eds) (2010) Genre:
An Introduction to History, Theory, Research, and
Pedagogy (free book);
Bruce, Ian (2008) Academic Writing and Genre,
Dorgeloh, Heidrun and Wanner, Anja (2010) Syntactic
Variation and Genre, De Gruyter Mouton
Giltrow,Janet and Stein, Dieter (eds) (2009) Genres in
the Internet, John Benjamins Publishing Company
Heyd, Theresa (2008) Email Hoaxes: Form, function,
genre ecology, John Benjamins Publishing Company
Lee, David (2001), Genres, Registers, Text Types,
Domains, And Styles: Clarifying The Concepts And
Navigating A Path Through The Bnc Jungle, Language
Learning & Technology September 2001, Vol. 5, Num. 3.
pp. 37-72, http://llt.msu.edu/vol5num3/pdf/lee.pdf
Luzón, María José, Ruiz-Madrid, María Noelia and
Villanueva, María Luisa (eds) (2010) Digital Genres,
New Literacies and Autonomy in Language
Learning, Cambridge Scholars Publishing
Martin, James and Rose, David (2008) Genre
Relations: Mapping Culture, Equinox
Puschmann, Cornelius (2010) The corporate blog as
an emerging genre of computer-mediated
communication: features, constraints, discourse
situation, Universitätsverlag Göttingen
WEGA prototype download, documentation and