SlideShare a Scribd company logo
1 of 38
Download to read offline
The CW Corpus
A new resource for evaluating the
identification of complex words
Matthew Shardlow
The University of Manchester

http://lexicalsimplification.blogspot.co.uk

1
Lexical Simplification
Complex Word
Identification

http://lexicalsimplification.blogspot.co.uk

He profoundly changed.

2
Lexical Simplification
Complex Word
Identification

Substitution
Generation

http://lexicalsimplification.blogspot.co.uk

He profoundly changed.
Profoundly: extremely, very,
deeply, acutely

2
Lexical Simplification
Complex Word
Identification

He profoundly changed.
Profoundly: extremely, very,
deeply, acutely

Word Sense
Disambiguation

Profoundly: extremely, very,
deeply, acutely

`

Substitution
Generation

http://lexicalsimplification.blogspot.co.uk

2
Lexical Simplification
Complex Word
Identification

He profoundly changed.

Substitution
Generation

Profoundly: extremely, very,
deeply, acutely

Word Sense
Disambiguation

Profoundly: extremely, very,
deeply, acutely

Synonym
Ranking
http://lexicalsimplification.blogspot.co.uk

#1) deeply
#2) extremely
#3) acutely
2
Complex Words
●

How do we define a Complex Word?

http://lexicalsimplification.blogspot.co.uk

3
Complex Words
●

How do we define a Complex Word?

●

Manual Definition
–

Any word which impedes a reader's comprehension
of a text.

http://lexicalsimplification.blogspot.co.uk

3
Complex Words
●

How do we define a Complex Word?

●

Manual Definition
–

●

Any word which impedes a reader's comprehension
of a text.

Heuristic Features
–

Frequency

–

Familiarity

–

Length

–

Context

http://lexicalsimplification.blogspot.co.uk

3
Complex Word
Identification
●

Important to get it right: Propagation errors
Correct:
He profoundly changed

He deeply changed

Incorrect:
He profoundly changed

He profoundly turned

http://lexicalsimplification.blogspot.co.uk

4
Complex Word
Identification
●

Important to get it right: Propagation errors
Correct:
He profoundly changed
Incorrect:
He profoundly changed

●

He deeply changed

He profoundly turned

No evaluation data.

http://lexicalsimplification.blogspot.co.uk

4
Complex Word
Identification
●

Important to get it right: Propagation errors
Correct:
He profoundly changed

He deeply changed

Incorrect:
He profoundly changed

He profoundly turned

●

No evaluation data.

●

Gold standard data required.

http://lexicalsimplification.blogspot.co.uk

4
Gold Standard Data
●

Criteria for corpus entries:
–

Annotated Sentences.

–

Coherent English.

–

One complex word per sentence.

http://lexicalsimplification.blogspot.co.uk

5
Gold Standard Data
●

Criteria for corpus entries:
–

Annotated Sentences.

–

Coherent English.

–

One complex word per sentence.

●

Difficult to generate automatically.

●

Expensive to generate manually.

http://lexicalsimplification.blogspot.co.uk

5
Gold Standard Data
●

Criteria for corpus entries:
–

Annotated Sentences.

–

Coherent English.

–

One complex word per sentence.

●

Difficult to generate automatically.

●

Expensive to generate manually.

●

So, we mine Simple Wikipedia Edit Histories.

http://lexicalsimplification.blogspot.co.uk

5
Simple Wikipedia
Edit Histories
●

Simple Wikipedia is:
–

An online encyclopedia.

–

Written in simplified English.

–

Collaboratively edited.

–

Available to download in XML format.

http://lexicalsimplification.blogspot.co.uk

6
Simple Wikipedia
Edit Histories
●

Simple Wikipedia is:
–

An online encyclopedia.

–

Written in simplified English.

–

Collaboratively edited.

–

Available to download in XML format.

●

Changes to articles recorded in edit histories.

●

Some changes are simplifications.

http://lexicalsimplification.blogspot.co.uk

6
Simple Wikipedia
Edit Histories
●

Advantages:
–

Fully automated

–

High throughput

–

Cost-effective

http://lexicalsimplification.blogspot.co.uk

7
Simple Wikipedia
Edit Histories
●

Advantages:

●

Disadvantages:

–

Fully automated

–

Content quality

–

High throughput

–

–

Cost-effective

Sparsity of
simplifications

–

Data exhaustion

http://lexicalsimplification.blogspot.co.uk

7
Mining – Extract Likely
Candidates
●

There are 2 stages to the mining process.

●

Stage 1:
–

2 adjacent revisions are selected.

http://lexicalsimplification.blogspot.co.uk

8
Mining – Extract Likely
Candidates
●

There are 2 stages to the mining process.

●

Stage 1:
–

2 adjacent revisions are selected.

–

A similarity score (TF-IDF) is calculated at sentence
level.

http://lexicalsimplification.blogspot.co.uk

8
Mining – Extract Likely
Candidates
●

There are 2 stages to the mining process.

●

Stage 1:
–

2 adjacent revisions are selected.

–

A similarity score (TF-IDF) is calculated at sentence
level.

–

High scoring pairs passed on.

–

All other pairs discarded.

http://lexicalsimplification.blogspot.co.uk

8
Mining – Validate
Candidates
●

There are 2 stages to the mining process.

●

Stage 2: A series of checks

http://lexicalsimplification.blogspot.co.uk

9
Mining – Validate
Candidates
●

There are 2 stages to the mining process.

●

Stage 2: A series of checks
–

One word difference.

http://lexicalsimplification.blogspot.co.uk

9
Mining – Validate
Candidates
●

There are 2 stages to the mining process.

●

Stage 2: A series of checks
–

One word difference.

–

Real words. (not: spam / vandalism / nonsense)

http://lexicalsimplification.blogspot.co.uk

9
Mining – Validate
Candidates
●

There are 2 stages to the mining process.

●

Stage 2: A series of checks
–

One word difference.

–

Real words. (not: spam / vandalism / nonsense)

–

Different stems.

http://lexicalsimplification.blogspot.co.uk

9
Mining – Validate
Candidates
●

There are 2 stages to the mining process.

●

Stage 2: A series of checks
–

One word difference.

–

Real words. (not: spam / vandalism / nonsense)

–

Different stems.

–

Synonyms.

http://lexicalsimplification.blogspot.co.uk

9
Mining – Validate
Candidates
●

There are 2 stages to the mining process.

●

Stage 2: A series of checks
–

One word difference.

–

Real words. (not: spam / vandalism / nonsense)

–

Different stems.

–

Synonyms.

–

Simplifying.

http://lexicalsimplification.blogspot.co.uk

9
Analysis
●

Six Annotators

●

Each given a 70 instance sample.

http://lexicalsimplification.blogspot.co.uk

10
Analysis
●

Six Annotators

●

Each given a 70 instance sample.
–

50 examples from the corpus (different for each).

–

20 common examples as a validation set.

http://lexicalsimplification.blogspot.co.uk

10
Analysis
●

Six Annotators

●

Each given a 70 instance sample.
–

50 examples from the corpus (different for each).

–

20 common examples as a validation set.

●

2 annotators ruled out by validation set.

●

Final corpus accuracy of: 97.5%.

http://lexicalsimplification.blogspot.co.uk

10
Experiments
●

Several experiments performed so far.

●

Presented at ACL Student Research Workshop.

http://lexicalsimplification.blogspot.co.uk

11
Experiments
●

Several experiments performed so far.

●

Presented at ACL Student Research Workshop.

●

3 techniques for identification were compared.

http://lexicalsimplification.blogspot.co.uk

11
Experiments
●

Several experiments performed so far.

●

Presented at ACL Student Research Workshop.

●

3 techniques for identification were compared.

●

Sophisticated strategies gave little or no
improvement over a baseline.

http://lexicalsimplification.blogspot.co.uk

11
Summary
●

Identifying Complex Words is important.

●

The CW Corpus lets us evaluate methods.

●

Preliminary results give little improvement.

http://lexicalsimplification.blogspot.co.uk
References
●

Corpus: http://tinyurl.com/cwcorpus

S. Devlin and J. Tait. The use of a psycholinguistic
database in the simplif cation of text for aphasic readers.
i
Linguistic Databases, p 161–173, 1998.
M. Yatskar, B. Pang, C. Danescu-Niculescu-Mizil, and L. Lee.
For the sake of simplicity: unsupervised extraction of
lexical simplif cations from Wikipedia. In HLT ’10 NAACL,
i
p 365–368, Stroudsburg, PA, USA, 2010.
http://lexicalsimplification.blogspot.co.uk

12
Any Questions
●

Corpus: http://tinyurl.com/cwcorpus

http://lexicalsimplification.blogspot.co.uk

13
Annotator Agreement
Annotator
Index
1

Kappa
1

Sample
Accuracy
98%

2

1

96%

3

0.4

70%

4

1

100%

5

0.6

84%

6

1

96%

http://lexicalsimplification.blogspot.co.uk
Example Discarded
Pairs
●

It was a _____ evening.

●

Nonsense Words (spelling correction)
–

●

Different Stems (sense correction)
–

●

Cooler → Cool

Synonymy (meaning change)
–

●

Cuol → Cool

Long → Cool

Simplifying

– Calm → Cool
http://lexicalsimplification.blogspot.co.uk

More Related Content

Similar to The CW Corpus PITR2013

Sattose 2020 presentation
Sattose 2020 presentationSattose 2020 presentation
Sattose 2020 presentationCéline Deknop
 
A Tale of Two Data Centers: Kafka Streams Resiliency (Anna McDonald, Confluen...
A Tale of Two Data Centers: Kafka Streams Resiliency (Anna McDonald, Confluen...A Tale of Two Data Centers: Kafka Streams Resiliency (Anna McDonald, Confluen...
A Tale of Two Data Centers: Kafka Streams Resiliency (Anna McDonald, Confluen...confluent
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsAnubhav Jain
 
Better Ruby Through Design Principles
Better Ruby Through Design PrinciplesBetter Ruby Through Design Principles
Better Ruby Through Design PrinciplesMike Gehard
 
Hooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQLHooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQLSamuel Lampa
 
Food Chains and Food Webs
Food Chains and Food WebsFood Chains and Food Webs
Food Chains and Food Webssth215
 
Apache Kafka® Delivers a Single Source of Truth for The New York Times
Apache Kafka® Delivers a Single Source of Truth for The New York TimesApache Kafka® Delivers a Single Source of Truth for The New York Times
Apache Kafka® Delivers a Single Source of Truth for The New York Timesconfluent
 
Technical writing
Technical writingTechnical writing
Technical writingpusthmus
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitinbloomreacheng
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Kris Jack
 

Similar to The CW Corpus PITR2013 (13)

Sattose 2020 presentation
Sattose 2020 presentationSattose 2020 presentation
Sattose 2020 presentation
 
A Tale of Two Data Centers: Kafka Streams Resiliency (Anna McDonald, Confluen...
A Tale of Two Data Centers: Kafka Streams Resiliency (Anna McDonald, Confluen...A Tale of Two Data Centers: Kafka Streams Resiliency (Anna McDonald, Confluen...
A Tale of Two Data Centers: Kafka Streams Resiliency (Anna McDonald, Confluen...
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
 
Better Ruby Through Design Principles
Better Ruby Through Design PrinciplesBetter Ruby Through Design Principles
Better Ruby Through Design Principles
 
Hooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQLHooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQL
 
SEppt
SEpptSEppt
SEppt
 
Food Chains and Food Webs
Food Chains and Food WebsFood Chains and Food Webs
Food Chains and Food Webs
 
111.docx
111.docx111.docx
111.docx
 
Apache Kafka® Delivers a Single Source of Truth for The New York Times
Apache Kafka® Delivers a Single Source of Truth for The New York TimesApache Kafka® Delivers a Single Source of Truth for The New York Times
Apache Kafka® Delivers a Single Source of Truth for The New York Times
 
Technical writing
Technical writingTechnical writing
Technical writing
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
 
Well test analysis
Well test analysisWell test analysis
Well test analysis
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
 

Recently uploaded

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 

Recently uploaded (20)

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 

The CW Corpus PITR2013