An Exploratory Study on Genre Classification using Readability Features

•

0 likes•370 views

We present a preliminary study that explores whether text features used for readability assessment are reliable genre-revealing features. We empirically explore the difference between genre and domain. We carry out two sets of experiments with both supervised and unsupervised methods. Findings on the Swedish national corpus (the SUC) show that readability cues are good indicators of genre variation.

Technology

An
Exploratory
Study
on
Genre
Classiﬁca7on
using
Readability
Features

Johan
Falkenjack,
Marina
San2ni,
Arne
Jönsson

SICS
East
Swedish
ICT

SUC’s

Text
Category

Genre/

Domain

A
Press,

Reportage

Genre

B

Press,

Editorials

Genre

C
Press,
Reviews
Genre

E
Skills,
Trades,

Hobbies

Domain

F
Popular
lore
Domain

G
Biographies,

essays

Genre

H
Miscellaneous
Mixed

J
Learned
and

scien2ﬁc

wri2ng

Genre

K
Imagina2ve

prose

Genre

SLTC

2016,

UMEÅ,

SWEDEN

Confusion
Matrix:

clusters
evaluated

against
6
SUC

genres
(Exp4)

Research
ques7ons:

1.  Are
there
any
empirical

diﬀerences
between
the

no2ons
of
genre
and

domain?

2.  Are
readability
assessment

features
reliable
genre-‐
revealing
features?

Theore7cal
dis7nc7on:

Domain
=
subject
ﬁeld

Genre=
conven2onalized
textual
pa]ern

118
Readability

assessment
features:

lexical,

morphological,

syntac2c
features

(e.g.
average

sentence
length,

frequent
lemmas,
and

average
dependency

distance)
and
13

combined
readability

measures
(e.g.
LIX

and
OVIX).

Conclusion

Findings
on
the
SUC
show
that
readability
cues

are
good
indicators
of
genre
varia2on
(H1),
but

work
less
eﬃciently
on
domain
dis2nc2ons.

Arguably,
these
results
conﬁrm
H2
and
show

empirically
the
existence
of
a
theore2cal
divide

between
genres
and
domains.

Future
work
includes
explora2ons
of
genre
and

domains
in
the
Brown
corpus
and
other
text

collec2ons.

H1:
Agglomera7ve
Hierarchical
Clustering

with
Ward’s
Linkage
(AHCW)

Readability
assessment
features
show
some

degree
of
robustness
in
the
iden2ﬁca2onof

SUC
genres
even
when
used
with
an

unsupervised
method
such
as
AHCW.

H2:
Naive
Bayes
&
Support

Vector
Machines

Domain
and
genre
are
two

diﬀerent
no2ons
that
are

NOT

represented
by
the

same
type
of
features.

Supervised

classiﬁca0on
(NB
and

SVM)
shows
that

readability
assessment

features
work
be]er
on

genres
and
less
eﬃciently

on
domains.

Overall
Results:
F-‐scores

Accuracy
(Supervised)

More from Marina Santini

Web corpora are a cornerstone of modern Language Technology. Corpora built from the web are convenient because their creation is fast and inexpensive. Several studies have been carried out to assess the representativeness of general-purpose web corpora by comparing them to traditional corpora. Less attention has been paid to assess the representativeness of specialized or domain-specific web corpora. In this paper, we focus on the assessment of domain representativeness of web corpora and we claim that it is possible to assess the degree of domainspecificity, or domainhood, of web corpora. We present a case study where we explore the effectiveness of different measures - namely the Mann-Withney-Wilcoxon Test, Kendall correlation coefficient, Kullback– Leibler divergence, log-likelihood and burstiness - to gauge domainhood. Our findings indicate that burstiness is the most suitable measure to single out domain-specific words from a specialized corpus and to allow for the quantification of domainhood.

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...

Marina Santini

In this study, we focus on the creation and evaluation of domain-specific web corpora. To this purpose, we propose a two-step approach, namely the (1) the automatic extraction and evaluation of term seeds from personas and use cases/scenarios; (2) the creation and evaluation of domain-specific web corpora bootstrapped with term seeds automatically extracted in step 1. Results are encouraging and show that: (1) it is possible to create a fairly accurate term extractor for relatively short narratives; (2) it is straightforward to evaluate a quality such as domain-specificity of web corpora using well-established metrics.

Towards a Quality Assessment of Web Corpora for Language Technology Applications

Marina Santini

In this study, we put forward two claims: 1) it is possible to design a dynamic and extensible corpus without running the risk of getting into scalability problems; 2) it is possible to devise noise-resistant Language Technology applications without affecting performance. To support our claims, we describe the design, construction and limitations of a very specialized medical web corpus, called eCare_Sv_01, and we present two experiments on lay-specialized text classification. eCare_Sv_01 is a small corpus of web documents written in Swedish. The corpus contains documents about chronic diseases. The sublanguage used in each document has been labelled as "lay" or "specialized" by a lay annotator. The corpus is designed as a flexible text resource, where additional medical documents will be appended over time. Experiments show that the layspecialized labels assigned by the lay annotator are reliably learned by standard classifiers. More specifically, Experiment 1 shows that scalability is not an issue when increasing the size of the datasets to be learned from 156 up to 801 documents. Experiment 2 shows that lay-specialized labels can be learned regardless of the large amount of disturbing factors, such as machine translated documents or low-quality texts, which are numerous in the corpus.

A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-

Marina Santini

Lecture: Semantic Word Clouds

Marina Santini

Lecture: Ontologies and the Semantic Web

Marina Santini

Lecture: Summarization

Marina Santini

Relation Extraction

Marina Santini

Lecture: Question Answering

Marina Santini

IE: Named Entity Recognition (NER)

Marina Santini

Lecture: Vector Semantics (aka Distributional Semantics)

Marina Santini

word sense disambiguation, wsd, thesaurus-based methods, dictionary-based methods, supervised methods, lesk algorithm, michael lesk, simplified lesk, corpus lesk, graph-based methods, word similarity, word relatedness, path-based similarity, information content, surprisal, resnik method, lin method, elesk, extended lesk, semcor, collocational features, bag-of-words features, the window, lexical semantics, computational semantics, semantic analysis in language technology.

Lecture: Word Sense Disambiguation

Marina Santini

Lecture: Word Senses

Marina Santini

sentiment analysis, affetctive meaning, connotational aspects, sentiment lexica, sentiment lexicons, naive bayes baseline algorithm, mutual information, pointwise mutual information, computational semantics, likelihood, scherer typology, emotion classification, opinion mining, sentiment mining, subjectivity analysis, manually-built sentiment lexicons, semi-supervised methods, sentiwordnet, general inquirer, learning sentiment lexicons, turney algorithm,

Sentiment Analysis

Marina Santini

Semantic Role Labeling

Marina Santini

Semantics and Computational Semantics

Marina Santini

Lecture 9: Machine Learning in Practice (2)

Marina Santini

Lecture 8: Machine Learning in Practice (1)

Marina Santini

Lecture 5: Interval Estimation

Marina Santini

Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio

Marina Santini

Lecture 3b: Decision Trees (1 part)

Marina Santini

More from Marina Santini (20)

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...

Towards a Quality Assessment of Web Corpora for Language Technology Applications

A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-

Lecture: Semantic Word Clouds

Lecture: Ontologies and the Semantic Web

Lecture: Summarization

Relation Extraction

Lecture: Question Answering

IE: Named Entity Recognition (NER)

Lecture: Vector Semantics (aka Distributional Semantics)

Lecture: Word Sense Disambiguation

Lecture: Word Senses

Sentiment Analysis

Semantic Role Labeling

Semantics and Computational Semantics

Lecture 9: Machine Learning in Practice (2)

Lecture 8: Machine Learning in Practice (1)

Lecture 5: Interval Estimation

Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio

Lecture 3b: Decision Trees (1 part)

Recently uploaded

The value of a flexible API Management solution for Open Banking Steve Melan, Manager for IT Innovation and Architecture - State's and Saving's Bank of Luxembourg Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The value of a flexible API Management solution for O...

apidays

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

💉💊+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI}}+971581248768 +971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Cl

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Manulife - Insurer Transformation Award 2024

The Digital Insurer

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

MS Copilot expands with MS Graph connectors

Nanddeep Nachan

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Zilliz

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Juan lago vázquez

Building Digital Trust in a Digital Economy Veronica Tan, Director - Cyber Security Agency of Singapore Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

apidays

Following the popularity of "Cloud Revolution: Exploring the New Wave of Serverless Spatial Data," we're thrilled to announce this much-anticipated encore webinar. In this sequel, we'll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR. Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios. Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects. Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you're building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Safe Software

Real Time Object Detection Using Open CV

Khem

Corporate and higher education. Two industries that, in the past, have had a clear divide with very little crossover. The difference in goals, learning styles and objectives paved the way for differing learning technologies platforms to evolve. Now, those stark lines are blurring as both sides are discovering they have content that’s relevant to the other. Join Tammy Rutherford as she walks through the pros and cons of corporate and higher ed collaborating. And the challenges of these different technology platforms working together for a brighter future.

Corporate and higher education May webinar.pptx

Rustici Software

AXA XL - Insurer Innovation Award Americas 2024

The Digital Insurer

Created by Mozilla Research in 2012 and now part of Linux Foundation Europe, the Servo project is an experimental rendering engine written in Rust. It combines memory safety and concurrency to create an independent, modular, and embeddable rendering engine that adheres to web standards. Stewardship of Servo moved from Mozilla Research to the Linux Foundation in 2020, where its mission remains unchanged. After some slow years, in 2023 there has been renewed activity on the project, with a roadmap now focused on improving the engine’s CSS 2 conformance, exploring Android support, and making Servo a practical embeddable rendering engine. In this presentation, Rakhi Sharma reviews the status of the project, our recent developments in 2023, our collaboration with Tauri to make Servo an easy-to-use embeddable rendering engine, and our plans for the future to make Servo an alternative web rendering engine for the embedded devices industry. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://ossna2024.sched.com/event/1aBNF/a-year-of-servo-reboot-where-are-we-now-rakhi-sharma-igalia

A Year of the Servo Reboot: Where Are We Now?

Igalia

ICT role in 21st century education and its challenges

rafiqahmad00786416

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

FWD Group - Insurer Innovation Award 2024

The Digital Insurer

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

Whatsapp Number Escorts Call girls 8617370543 Available 24x7 Navi Mumbai Call Girls Service Offer Genuine VIP Model Escorts Call Girls in Your Budget. Navi Mumbai Call Girls Service Provide Real Call Girls Number. Make Your Sexual Pleasure Memorable with Our Navi Mumbai Call Girls at Affordable Price. Top VIP Escorts Call Girls, High Profile Independent Escorts Call Girls, Housewife Women Escorts Call Girl, College Girls Escorts Call Girls, Russian Escorts Call girls Service in Your Budget.

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

Deepika Singh

Abhishek Deb(1), Mr Abdul Kalam(2) M. Des (UX) , School of Design, DIT University , Dehradun. This paper explores the future potential of AI-enabled smartphone processors, aiming to investigate the advancements, capabilities, and implications of integrating artificial intelligence (AI) into smartphone technology. The research study goals consist of evaluating the development of AI in mobile phone processors, analyzing the existing state as well as abilities of AI-enabled cpus determining future patterns as well as chances together with reviewing obstacles as well as factors to consider for more growth.

Exploring the Future Potential of AI-Enabled Smartphone Processors

debabhi2

Recently uploaded (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...

Strategies for Landing an Oracle DBA Job as a Fresher

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Manulife - Insurer Transformation Award 2024

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

MS Copilot expands with MS Graph connectors

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Real Time Object Detection Using Open CV

Corporate and higher education May webinar.pptx

AXA XL - Insurer Innovation Award Americas 2024

A Year of the Servo Reboot: Where Are We Now?

ICT role in 21st century education and its challenges

Axa Assurance Maroc - Insurer Innovation Award 2024

FWD Group - Insurer Innovation Award 2024

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

Exploring the Future Potential of AI-Enabled Smartphone Processors

An Exploratory Study on Genre Classification using Readability Features

1. An Exploratory Study on Genre Classifica7on using Readability Features Johan Falkenjack, Marina San2ni, Arne Jönsson SICS East Swedish ICT SUC’s Text Category Genre/ Domain A Press, Reportage Genre B Press, Editorials Genre C Press, Reviews Genre E Skills, Trades, Hobbies Domain F Popular lore Domain G Biographies, essays Genre H Miscellaneous Mixed J Learned and scien2fic wri2ng Genre K Imagina2ve prose Genre SLTC 2016, UMEÅ, SWEDEN Confusion Matrix: clusters evaluated against 6 SUC genres (Exp4) Research ques7ons: 1.  Are there any empirical differences between the no2ons of genre and domain? 2.  Are readability assessment features reliable genre-‐ revealing features? Theore7cal dis7nc7on: Domain = subject field Genre= conven2onalized textual pa]ern 118 Readability assessment features: lexical, morphological, syntac2c features (e.g. average sentence length, frequent lemmas, and average dependency distance) and 13 combined readability measures (e.g. LIX and OVIX). Conclusion Findings on the SUC show that readability cues are good indicators of genre varia2on (H1), but work less efficiently on domain dis2nc2ons. Arguably, these results confirm H2 and show empirically the existence of a theore2cal divide between genres and domains. Future work includes explora2ons of genre and domains in the Brown corpus and other text collec2ons. H1: Agglomera7ve Hierarchical Clustering with Ward’s Linkage (AHCW) Readability assessment features show some degree of robustness in the iden2fica2onof SUC genres even when used with an unsupervised method such as AHCW. H2: Naive Bayes & Support Vector Machines Domain and genre are two different no2ons that are NOT represented by the same type of features. Supervised classifica0on (NB and SVM) shows that readability assessment features work be]er on genres and less efficiently on domains. Overall Results: F-‐scores Accuracy (Supervised)

An Exploratory Study on Genre Classification using Readability Features

Recommended

Recommended

More Related Content

More from Marina Santini

More from Marina Santini (20)

Recently uploaded

Recently uploaded (20)

An Exploratory Study on Genre Classification using Readability Features