SlideShare a Scribd company logo
1 of 1
Download to read offline
An	
  Exploratory	
  Study	
  on	
  Genre	
  Classifica7on	
  using	
  Readability	
  Features	
  
Johan	
  Falkenjack,	
  Marina	
  San2ni,	
  Arne	
  Jönsson	
  
SICS	
  East	
  Swedish	
  ICT	
  
SUC’s	
  	
  
Text	
  Category	
  
Genre/	
  
Domain	
  
A	
   Press,	
  
Reportage	
  
Genre	
  
B
	
  	
  
Press,	
  
Editorials	
  
Genre	
  
C	
   Press,	
  Reviews	
   Genre	
  
E	
   Skills,	
  Trades,	
  
Hobbies	
  
Domain	
  
F	
   Popular	
  lore	
   Domain	
  
G	
   Biographies,	
  
essays	
  
Genre	
  
H	
   Miscellaneous	
   Mixed	
  
J	
   Learned	
  and	
  
scien2fic	
  
wri2ng	
  
Genre	
  
K	
   Imagina2ve	
  
prose	
  
Genre	
  
SLTC	
  
2016,	
  
UMEÅ,	
  
SWEDEN	
  
	
  
Confusion	
  Matrix:	
  	
  
clusters	
  evaluated	
  
against	
  6	
  SUC	
  
genres	
  (Exp4)	
  
Research	
  ques7ons:	
  
1.  Are	
  there	
  any	
  empirical	
  
differences	
  between	
  the	
  
no2ons	
  of	
  genre	
  and	
  
domain?	
  	
  
2.  Are	
  readability	
  assessment	
  
features	
  reliable	
  genre-­‐
revealing	
  features?	
  	
  
Theore7cal	
  dis7nc7on:	
  
Domain	
  =	
  subject	
  field	
  
Genre=	
  conven2onalized	
  textual	
  pa]ern	
  
118	
  Readability	
  
assessment	
  features:	
  
lexical,	
  
morphological,	
  
syntac2c	
  features	
  
(e.g.	
  average	
  
sentence	
  length,	
  
frequent	
  lemmas,	
  and	
  
average	
  dependency	
  
distance)	
  and	
  13	
  
combined	
  readability	
  
measures	
  (e.g.	
  LIX	
  
and	
  OVIX).	
  
Conclusion	
  
Findings	
  on	
  the	
  SUC	
  show	
  that	
  readability	
  cues	
  
are	
  good	
  indicators	
  of	
  genre	
  varia2on	
  (H1),	
  but	
  
work	
  less	
  efficiently	
  on	
  domain	
  dis2nc2ons.	
  
Arguably,	
  these	
  results	
  confirm	
  H2	
  and	
  show	
  
empirically	
  the	
  existence	
  of	
  a	
  theore2cal	
  divide	
  
between	
  genres	
  and	
  domains.	
  
Future	
  work	
  includes	
  explora2ons	
  of	
  genre	
  and	
  
domains	
  in	
  the	
  Brown	
  corpus	
  and	
  other	
  text	
  
collec2ons.	
  	
  
H1:	
  Agglomera7ve	
  Hierarchical	
  Clustering	
  
with	
  Ward’s	
  Linkage	
  (AHCW)	
  
Readability	
  assessment	
  features	
  show	
  some	
  
degree	
  of	
  robustness	
  in	
  the	
  iden2fica2onof	
  
SUC	
  genres	
  even	
  when	
  used	
  with	
  an	
  
unsupervised	
  method	
  such	
  as	
  AHCW.	
  
H2:	
  Naive	
  Bayes	
  &	
  Support	
  
Vector	
  Machines	
  
Domain	
  and	
  genre	
  are	
  two	
  
different	
  no2ons	
  that	
  are	
  
NOT	
  	
  represented	
  by	
  the	
  
same	
  type	
  of	
  features.	
  
Supervised	
  
classifica0on	
  (NB	
  and	
  
SVM)	
  shows	
  that	
  
readability	
  assessment	
  
features	
  work	
  be]er	
  on	
  
genres	
  and	
  less	
  efficiently	
  
on	
  domains.	
  	
  
Overall	
  Results:	
  F-­‐scores	
  
Accuracy	
  (Supervised)	
  

More Related Content

More from Marina Santini

More from Marina Santini (20)

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

An Exploratory Study on Genre Classification using Readability Features

  • 1. An  Exploratory  Study  on  Genre  Classifica7on  using  Readability  Features   Johan  Falkenjack,  Marina  San2ni,  Arne  Jönsson   SICS  East  Swedish  ICT   SUC’s     Text  Category   Genre/   Domain   A   Press,   Reportage   Genre   B     Press,   Editorials   Genre   C   Press,  Reviews   Genre   E   Skills,  Trades,   Hobbies   Domain   F   Popular  lore   Domain   G   Biographies,   essays   Genre   H   Miscellaneous   Mixed   J   Learned  and   scien2fic   wri2ng   Genre   K   Imagina2ve   prose   Genre   SLTC   2016,   UMEÅ,   SWEDEN     Confusion  Matrix:     clusters  evaluated   against  6  SUC   genres  (Exp4)   Research  ques7ons:   1.  Are  there  any  empirical   differences  between  the   no2ons  of  genre  and   domain?     2.  Are  readability  assessment   features  reliable  genre-­‐ revealing  features?     Theore7cal  dis7nc7on:   Domain  =  subject  field   Genre=  conven2onalized  textual  pa]ern   118  Readability   assessment  features:   lexical,   morphological,   syntac2c  features   (e.g.  average   sentence  length,   frequent  lemmas,  and   average  dependency   distance)  and  13   combined  readability   measures  (e.g.  LIX   and  OVIX).   Conclusion   Findings  on  the  SUC  show  that  readability  cues   are  good  indicators  of  genre  varia2on  (H1),  but   work  less  efficiently  on  domain  dis2nc2ons.   Arguably,  these  results  confirm  H2  and  show   empirically  the  existence  of  a  theore2cal  divide   between  genres  and  domains.   Future  work  includes  explora2ons  of  genre  and   domains  in  the  Brown  corpus  and  other  text   collec2ons.     H1:  Agglomera7ve  Hierarchical  Clustering   with  Ward’s  Linkage  (AHCW)   Readability  assessment  features  show  some   degree  of  robustness  in  the  iden2fica2onof   SUC  genres  even  when  used  with  an   unsupervised  method  such  as  AHCW.   H2:  Naive  Bayes  &  Support   Vector  Machines   Domain  and  genre  are  two   different  no2ons  that  are   NOT    represented  by  the   same  type  of  features.   Supervised   classifica0on  (NB  and   SVM)  shows  that   readability  assessment   features  work  be]er  on   genres  and  less  efficiently   on  domains.     Overall  Results:  F-­‐scores   Accuracy  (Supervised)