SlideShare a Scribd company logo
1 of 11
English Spellchecker
Project
Lviv Data Science Summer School
2016
Input data
Datasets for training: 800 000 paragraphs (2 million words)
Datasets for testing:
from CONLL corpus
We want to achieve - Part 1
1. Spellchecker focused on defining and correcting Mec errors
Target error
Mec error:
● description: spelling, punctuation, capitalization
● examples:
○ This knowledge maybe relevant to them.
○ To tell his or her ralatives…
○ ...their altitudes will be easily changed.
We want to achieve - Part 2
● Implement several language models and choose the best one for Mec
mistakes
● Practice Hidden Markov Models and Python
● Have fun in the teamwork :)
Methods & Solution used
● Uni-gram
○ better understanding of the errors and the sensitivity of each parameters
○ previous & posterior context of a word is important
○ Part-of-speech could help
● Custom N-grams Model
○ N-grams models, based on previous context and posteriori context
Tested Methods
● Custom N-grams Model with Part-of-Speech tagging
○ No improvement of the results with part-of-speech tagging
● Conditionnal Random Fields
○ too slow to train
● Unigram model
○ perform well and are really simple
Results
● Custom N-gram Model
○ Precision : 30%
○ Recall : 41 %
TYPE AMU CAMB CUUI IITB IPN NARA NTHU PKU POST RAC SJTU UFC UMC
Vt 11.61 20.00 5.79 1.90 0.98 16.18 12.90 14.16 3.31 29.17 4.59 0.00 17.60
ArtOrDet 18.75 54.74 67.38 1.81 0.36 54.42 37.96 9.65 59.41 0.66 14.63 0.00 33.42
Mec 31.56 30.67 17.47 1.13 4.79 37.28 7.17 31.69 37.88 45.82 1.10 0.00 22.31
Recall (in %) for each error type with alternative answers, indicating how well each team performs against a particular error type
Results
2nd position in the CONLL benchmark in terms of recall, with an average
precision!
OUR TEAM
Jordi Carrera Ventura
Charlotte Rudnik
Kateryna Aloshkina
Igor Kraynikov
¡Gracias Jordi!
Дякуємо УКУ!

More Related Content

Similar to Spell Checker

Semantic Model Differencing Utilizing Behavioral Semantics Specifications (Ta...
Semantic Model Differencing Utilizing Behavioral Semantics Specifications (Ta...Semantic Model Differencing Utilizing Behavioral Semantics Specifications (Ta...
Semantic Model Differencing Utilizing Behavioral Semantics Specifications (Ta...
Tanja Mayerhofer
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition ranking
eXascale Infolab
 
A hierarchical neural autoencoder for paragraphs and documents
A hierarchical neural autoencoder for paragraphs and documentsA hierarchical neural autoencoder for paragraphs and documents
A hierarchical neural autoencoder for paragraphs and documents
Hayahide Yamagishi
 
Programming beyond cs
Programming beyond csProgramming beyond cs
Programming beyond cs
uditproject
 

Similar to Spell Checker (20)

Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
 
Bert
BertBert
Bert
 
Study_of_Sequence_labeling_Systems
Study_of_Sequence_labeling_SystemsStudy_of_Sequence_labeling_Systems
Study_of_Sequence_labeling_Systems
 
Yves Peirsman - Deep Learning for NLP
Yves Peirsman - Deep Learning for NLPYves Peirsman - Deep Learning for NLP
Yves Peirsman - Deep Learning for NLP
 
[EMNLP2017読み会] Efficient Attention using a Fixed-Size Memory Representation
[EMNLP2017読み会] Efficient Attention using a Fixed-Size Memory Representation[EMNLP2017読み会] Efficient Attention using a Fixed-Size Memory Representation
[EMNLP2017読み会] Efficient Attention using a Fixed-Size Memory Representation
 
Applying Transformation Characteristics to Solve the Multi Objective Linear F...
Applying Transformation Characteristics to Solve the Multi Objective Linear F...Applying Transformation Characteristics to Solve the Multi Objective Linear F...
Applying Transformation Characteristics to Solve the Multi Objective Linear F...
 
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
 
Semantic Model Differencing Utilizing Behavioral Semantics Specifications (Ta...
Semantic Model Differencing Utilizing Behavioral Semantics Specifications (Ta...Semantic Model Differencing Utilizing Behavioral Semantics Specifications (Ta...
Semantic Model Differencing Utilizing Behavioral Semantics Specifications (Ta...
 
Duplicate_Quora_Question_Detection
Duplicate_Quora_Question_DetectionDuplicate_Quora_Question_Detection
Duplicate_Quora_Question_Detection
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition ranking
 
A hierarchical neural autoencoder for paragraphs and documents
A hierarchical neural autoencoder for paragraphs and documentsA hierarchical neural autoencoder for paragraphs and documents
A hierarchical neural autoencoder for paragraphs and documents
 
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
 
ML in Android
ML in AndroidML in Android
ML in Android
 
CSSC ML Workshop
CSSC ML WorkshopCSSC ML Workshop
CSSC ML Workshop
 
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
 
Important Concepts for Machine Learning
Important Concepts for Machine LearningImportant Concepts for Machine Learning
Important Concepts for Machine Learning
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Programming beyond cs
Programming beyond csProgramming beyond cs
Programming beyond cs
 
Emnl preading2016
Emnl preading2016Emnl preading2016
Emnl preading2016
 
Unit 1.pptx
Unit 1.pptxUnit 1.pptx
Unit 1.pptx
 

More from Lviv Data Science Summer School

Master defence 2020 - Nazariy Perepichka - Parameterizing of Human Speech Gen...
Master defence 2020 - Nazariy Perepichka - Parameterizing of Human Speech Gen...Master defence 2020 - Nazariy Perepichka - Parameterizing of Human Speech Gen...
Master defence 2020 - Nazariy Perepichka - Parameterizing of Human Speech Gen...
Lviv Data Science Summer School
 
Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
 Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
Lviv Data Science Summer School
 
Master defence 2020 - Oleh Lukianykhin - Reinforcement Learning for Voltage C...
Master defence 2020 - Oleh Lukianykhin - Reinforcement Learning for Voltage C...Master defence 2020 - Oleh Lukianykhin - Reinforcement Learning for Voltage C...
Master defence 2020 - Oleh Lukianykhin - Reinforcement Learning for Voltage C...
Lviv Data Science Summer School
 
Master defence 2020 - Philipp Kofman - Efficient Generation of Complex Data D...
Master defence 2020 - Philipp Kofman - Efficient Generation of Complex Data D...Master defence 2020 - Philipp Kofman - Efficient Generation of Complex Data D...
Master defence 2020 - Philipp Kofman - Efficient Generation of Complex Data D...
Lviv Data Science Summer School
 
Master defence 2020 - Dmitri Glusco - Replica Exchange For Multiple-Environme...
Master defence 2020 - Dmitri Glusco - Replica Exchange For Multiple-Environme...Master defence 2020 - Dmitri Glusco - Replica Exchange For Multiple-Environme...
Master defence 2020 - Dmitri Glusco - Replica Exchange For Multiple-Environme...
Lviv Data Science Summer School
 
Master defence 2020 -Roman Moiseiev - Stock Market Prediction Utilizing Centr...
Master defence 2020 -Roman Moiseiev - Stock Market Prediction Utilizing Centr...Master defence 2020 -Roman Moiseiev - Stock Market Prediction Utilizing Centr...
Master defence 2020 -Roman Moiseiev - Stock Market Prediction Utilizing Centr...
Lviv Data Science Summer School
 

More from Lviv Data Science Summer School (20)

Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...
Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...
Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...
 
Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...
Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...
Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...
 
Master defence 2020 - Nazariy Perepichka - Parameterizing of Human Speech Gen...
Master defence 2020 - Nazariy Perepichka - Parameterizing of Human Speech Gen...Master defence 2020 - Nazariy Perepichka - Parameterizing of Human Speech Gen...
Master defence 2020 - Nazariy Perepichka - Parameterizing of Human Speech Gen...
 
Master defence 2020 - Serhii Tiutiunnyk - Context-based Question-answering Sy...
Master defence 2020 - Serhii Tiutiunnyk - Context-based Question-answering Sy...Master defence 2020 - Serhii Tiutiunnyk - Context-based Question-answering Sy...
Master defence 2020 - Serhii Tiutiunnyk - Context-based Question-answering Sy...
 
Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
 Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
 
Master defence 2020 - Dmytro Babenko - Determining Sentiment and Important Pr...
Master defence 2020 - Dmytro Babenko - Determining Sentiment and Important Pr...Master defence 2020 - Dmytro Babenko - Determining Sentiment and Important Pr...
Master defence 2020 - Dmytro Babenko - Determining Sentiment and Important Pr...
 
Master defence 2020 - Oleh Lukianykhin - Reinforcement Learning for Voltage C...
Master defence 2020 - Oleh Lukianykhin - Reinforcement Learning for Voltage C...Master defence 2020 - Oleh Lukianykhin - Reinforcement Learning for Voltage C...
Master defence 2020 - Oleh Lukianykhin - Reinforcement Learning for Voltage C...
 
Master defence 2020 - Borys Olshanetskyi -Context Independent Speaker Classif...
Master defence 2020 - Borys Olshanetskyi -Context Independent Speaker Classif...Master defence 2020 - Borys Olshanetskyi -Context Independent Speaker Classif...
Master defence 2020 - Borys Olshanetskyi -Context Independent Speaker Classif...
 
Master defence 2020 - Philipp Kofman - Efficient Generation of Complex Data D...
Master defence 2020 - Philipp Kofman - Efficient Generation of Complex Data D...Master defence 2020 - Philipp Kofman - Efficient Generation of Complex Data D...
Master defence 2020 - Philipp Kofman - Efficient Generation of Complex Data D...
 
Master defence 2020 - Anastasiia Kasprova - Customer Lifetime Value for Retai...
Master defence 2020 - Anastasiia Kasprova - Customer Lifetime Value for Retai...Master defence 2020 - Anastasiia Kasprova - Customer Lifetime Value for Retai...
Master defence 2020 - Anastasiia Kasprova - Customer Lifetime Value for Retai...
 
Master defence 2020 - Dmitri Glusco - Replica Exchange For Multiple-Environme...
Master defence 2020 - Dmitri Glusco - Replica Exchange For Multiple-Environme...Master defence 2020 - Dmitri Glusco - Replica Exchange For Multiple-Environme...
Master defence 2020 - Dmitri Glusco - Replica Exchange For Multiple-Environme...
 
Master defence 2020 - Ivan Prodaiko - Person Re-identification in a Top-view ...
Master defence 2020 - Ivan Prodaiko - Person Re-identification in a Top-view ...Master defence 2020 - Ivan Prodaiko - Person Re-identification in a Top-view ...
Master defence 2020 - Ivan Prodaiko - Person Re-identification in a Top-view ...
 
Master defence 2020 - Yevhen Pozdniakov - Changing Clothing on People Images...
Master defence 2020 - Yevhen Pozdniakov -  Changing Clothing on People Images...Master defence 2020 - Yevhen Pozdniakov -  Changing Clothing on People Images...
Master defence 2020 - Yevhen Pozdniakov - Changing Clothing on People Images...
 
Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...
 Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar... Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...
Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...
 
Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...
Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...
Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...
 
Master defence 2020 - Roman Riazantsev - 3D Reconstruction of Video Sign Lan...
Master defence 2020 -  Roman Riazantsev - 3D Reconstruction of Video Sign Lan...Master defence 2020 -  Roman Riazantsev - 3D Reconstruction of Video Sign Lan...
Master defence 2020 - Roman Riazantsev - 3D Reconstruction of Video Sign Lan...
 
Master defence 2020 - Vadym Korshunov - Region-Selected Image Generation with...
Master defence 2020 - Vadym Korshunov - Region-Selected Image Generation with...Master defence 2020 - Vadym Korshunov - Region-Selected Image Generation with...
Master defence 2020 - Vadym Korshunov - Region-Selected Image Generation with...
 
Master defence 2020 -Roman Moiseiev - Stock Market Prediction Utilizing Centr...
Master defence 2020 -Roman Moiseiev - Stock Market Prediction Utilizing Centr...Master defence 2020 -Roman Moiseiev - Stock Market Prediction Utilizing Centr...
Master defence 2020 -Roman Moiseiev - Stock Market Prediction Utilizing Centr...
 
Master defence 2020 - Maksym Opirskyi -Topological Approach to Wikipedia Arti...
Master defence 2020 - Maksym Opirskyi -Topological Approach to Wikipedia Arti...Master defence 2020 - Maksym Opirskyi -Topological Approach to Wikipedia Arti...
Master defence 2020 - Maksym Opirskyi -Topological Approach to Wikipedia Arti...
 
Master defence 2020 - Oleksandr Smyrnov - A Multifactorial Optimization of Pe...
Master defence 2020 - Oleksandr Smyrnov - A Multifactorial Optimization of Pe...Master defence 2020 - Oleksandr Smyrnov - A Multifactorial Optimization of Pe...
Master defence 2020 - Oleksandr Smyrnov - A Multifactorial Optimization of Pe...
 

Recently uploaded

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
MateoGardella
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 

Recently uploaded (20)

This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 

Spell Checker

  • 1. English Spellchecker Project Lviv Data Science Summer School 2016
  • 2. Input data Datasets for training: 800 000 paragraphs (2 million words) Datasets for testing: from CONLL corpus
  • 3. We want to achieve - Part 1 1. Spellchecker focused on defining and correcting Mec errors
  • 4. Target error Mec error: ● description: spelling, punctuation, capitalization ● examples: ○ This knowledge maybe relevant to them. ○ To tell his or her ralatives… ○ ...their altitudes will be easily changed.
  • 5. We want to achieve - Part 2 ● Implement several language models and choose the best one for Mec mistakes ● Practice Hidden Markov Models and Python ● Have fun in the teamwork :)
  • 6. Methods & Solution used ● Uni-gram ○ better understanding of the errors and the sensitivity of each parameters ○ previous & posterior context of a word is important ○ Part-of-speech could help ● Custom N-grams Model ○ N-grams models, based on previous context and posteriori context
  • 7. Tested Methods ● Custom N-grams Model with Part-of-Speech tagging ○ No improvement of the results with part-of-speech tagging ● Conditionnal Random Fields ○ too slow to train ● Unigram model ○ perform well and are really simple
  • 8. Results ● Custom N-gram Model ○ Precision : 30% ○ Recall : 41 % TYPE AMU CAMB CUUI IITB IPN NARA NTHU PKU POST RAC SJTU UFC UMC Vt 11.61 20.00 5.79 1.90 0.98 16.18 12.90 14.16 3.31 29.17 4.59 0.00 17.60 ArtOrDet 18.75 54.74 67.38 1.81 0.36 54.42 37.96 9.65 59.41 0.66 14.63 0.00 33.42 Mec 31.56 30.67 17.47 1.13 4.79 37.28 7.17 31.69 37.88 45.82 1.10 0.00 22.31 Recall (in %) for each error type with alternative answers, indicating how well each team performs against a particular error type
  • 9. Results 2nd position in the CONLL benchmark in terms of recall, with an average precision!
  • 10. OUR TEAM Jordi Carrera Ventura Charlotte Rudnik Kateryna Aloshkina Igor Kraynikov