We present a preliminary study that explores whether text features used for readability assessment are reliable genre-revealing features. We empirically explore the difference between genre and domain. We carry out two sets of experiments with both supervised and unsupervised methods. Findings on the Swedish national corpus (the SUC) show that readability cues are good indicators of genre variation.
Exploring the Future Potential of AI-Enabled Smartphone Processors
An Exploratory Study on Genre Classification using Readability Features
1. An
Exploratory
Study
on
Genre
Classifica7on
using
Readability
Features
Johan
Falkenjack,
Marina
San2ni,
Arne
Jönsson
SICS
East
Swedish
ICT
SUC’s
Text
Category
Genre/
Domain
A
Press,
Reportage
Genre
B
Press,
Editorials
Genre
C
Press,
Reviews
Genre
E
Skills,
Trades,
Hobbies
Domain
F
Popular
lore
Domain
G
Biographies,
essays
Genre
H
Miscellaneous
Mixed
J
Learned
and
scien2fic
wri2ng
Genre
K
Imagina2ve
prose
Genre
SLTC
2016,
UMEÅ,
SWEDEN
Confusion
Matrix:
clusters
evaluated
against
6
SUC
genres
(Exp4)
Research
ques7ons:
1. Are
there
any
empirical
differences
between
the
no2ons
of
genre
and
domain?
2. Are
readability
assessment
features
reliable
genre-‐
revealing
features?
Theore7cal
dis7nc7on:
Domain
=
subject
field
Genre=
conven2onalized
textual
pa]ern
118
Readability
assessment
features:
lexical,
morphological,
syntac2c
features
(e.g.
average
sentence
length,
frequent
lemmas,
and
average
dependency
distance)
and
13
combined
readability
measures
(e.g.
LIX
and
OVIX).
Conclusion
Findings
on
the
SUC
show
that
readability
cues
are
good
indicators
of
genre
varia2on
(H1),
but
work
less
efficiently
on
domain
dis2nc2ons.
Arguably,
these
results
confirm
H2
and
show
empirically
the
existence
of
a
theore2cal
divide
between
genres
and
domains.
Future
work
includes
explora2ons
of
genre
and
domains
in
the
Brown
corpus
and
other
text
collec2ons.
H1:
Agglomera7ve
Hierarchical
Clustering
with
Ward’s
Linkage
(AHCW)
Readability
assessment
features
show
some
degree
of
robustness
in
the
iden2fica2onof
SUC
genres
even
when
used
with
an
unsupervised
method
such
as
AHCW.
H2:
Naive
Bayes
&
Support
Vector
Machines
Domain
and
genre
are
two
different
no2ons
that
are
NOT
represented
by
the
same
type
of
features.
Supervised
classifica0on
(NB
and
SVM)
shows
that
readability
assessment
features
work
be]er
on
genres
and
less
efficiently
on
domains.
Overall
Results:
F-‐scores
Accuracy
(Supervised)