SlideShare a Scribd company logo
1 of 20
Autor Conducător științific
Universitatea
Politehnica
București
Facultatea de
Automatică și
Calculatoare
Catedra de
Calculatoare
FILLING THE GAPS USING
GOOGLE 5-GRAMS CORPUS
Costin-Gabriel Chiru - costin.chiru@cs.pub.ro
Andrei Hanganu
Traian Rebedea
Stefan Trausan-Matu
Contents
• Problem presentation
• Our solution
• Assumptions
• Methodology
• Candidates filtering heuristics
• Experiments and results
• Conclusions
23.07.2010 ICSOFT 2010 1
The Problem
• Lots of projects attempting to digitize the content of some publications:
– Gutenberg Project (http://www.gutenberg.org/wiki/Main_Page);
– The Million Book Project (http://www.rr.cs.cmu.edu/mbdl.htm);
– The Runeberg Project (http://runeberg.org/);
– Google Book Search (http://books.google.com/);
– Many others.
• Problems:
– Very old documents;
– Partially damaged paper;
– Cheap (poor quality) paper.
• Results: the OCR-s are unable to
fully recognize the content of some
documents!
23.07.2010 ICSOFT 2010 2
Our Solution
• A probabilistic method for text recovery that
tries to identify which are the missing words
from the digital form of the document.
• Based on:
– “Web 1T 5-gram Version 1” corpus – n-gram
corpus provided by Google (used to generate
candidates)
23.07.2010 ICSOFT 2010 3
Gaps
• We are focusing on the reconstruction of
damaged documents based on the prediction
of the most plausible word sets for filling the
missing areas resulted after conversion to
digital form – we call them gaps.
• Gaps – very important property: its dimension
– number of characters or words that can be
place inside the gap.
23.07.2010 ICSOFT 2010 4
Assumptions
• Our method is based on two assumptions:
– Intra-document similarity. The document model
has 2 components:
• The style model – the structure of the text;
• The language model – the vocabulary used by the
author (n-grams and their frequencies).
– The Google corpus dimension is large enough to
subsume most of the language models of the
documents posted on the Internet:
• Any word that does not appear in this corpus, should
not be considered as a possible candidate to fill in the
gaps.
23.07.2010 ICSOFT 2010 5
Methodology (1)
• The style model of the document 
dimension of the gap.
• 2 heuristics:
• Estimated character count ([min_chars, max_chars]) –
from the document format: margins and indentation;
• Estimated word count ([min_words, max_words]) –
uses previous heuristic and the distribution of words
length (in terms of characters) and of number of words
per phrase.
23.07.2010 ICSOFT 2010 6
Methodology (2)
• The language model of the document 
detect the missing words.
1. Start from the partial words at the beginning or
at the ending of the gaps.
2. Use both the n-gram corpus and the words that
have been correctly identified before and after
the gap in order to identify first and last word
from the gap:
 Use the last 4 words before the gap and the first 4 after
it in order to detect the most probable first and last
word from the gap using the n-grams from the corpus
(since the max order of n-grams in the corpus is 5);
23.07.2010 ICSOFT 2010 7
Methodology (3)
 If there is not such a 5-gram, than the order of n-gram
is decreased repeatedly until bigrams where we
consider only the first word before the gap and last
after it;
 The same thing happens when the gap is near the start
of the end of a phrase.
3. The possible candidates are stored and the
process is restarted for each of this candidates in
order to find the rest of the words from the gap.
23.07.2010 ICSOFT 2010 8
Methodology (4)
4. The process ends when one of the following
situations is reached:
 The number of words or characters exceed the estimated
word or character count  the branches are too long to be
valid  can be discarded;
 A left-side branch matches at some point a right-side
branch, identifying a valid candidate for the missing words.
 The left-side branch has reached an end sentence mark-up
(</S>) AND the right-side one has reached a beginning of
sentence mark-up (<S>). At this point a “partial match” has
been obtained, which contains a possible unrecoverable gap
inside it.
– If the added size of the branches fits in the estimated character
and word count  a valid candidate.
23.07.2010 ICSOFT 2010 9
Encountered Problems
• No continuation possibility for a branch:
– Decrease the order of n-grams;
– If already at bigrams order, the branch is
discarded.
• Very large number of candidates are
generated for each possible word (nomin
candidates are generated)
– The candidates have to be filtered out!
23.07.2010 ICSOFT 2010 10
Candidates Filtering Heuristics (1)
• POS-based: heuristic that predict the POS of the words
and discard the words that do not have the predicted
POS (TreeTagger).
• Semantics-based: discard the branches that do not
contain words related to the rest of the document
(based on lexical chains build using WordNet).
• Frequency-based: prefer the branches with higher
scores for the n-grams in the corpus.
• Considering these heuristics, some scores are
computed for every word added to a branch.
23.07.2010 ICSOFT 2010 11
Candidates Filtering Heuristics (2)
• These values are then combined in order to provide a
general score of the branch.
• A heuristic is used: distance to the nearest end of the
gap –to detect the importance of the scores of each
word (the error is propagated from ends to the middle
of the gap).
• Finally, the obtained scores are normalize with respect
to the number of words from the branch and the
results are ordered according to this final score.
• The branch with the highest score is used.
23.07.2010 ICSOFT 2010 12
Experiments (1)
• Starting from full documents and remove some
parts in order to simulate the gaps
(http://en.wikipedia.org/wiki/Literature).
• ”An even more narrow interpretation is that
(<gap>) text have a physical form, ...”
• TreeTagger (word, POS, lemma):
– “An DT an even RB even more RBR more narrow JJ
narrow <gap> NN <unknown> text NN text have VBP
have a DT a physical JJ physical form NN form , , ,”
23.07.2010 ICSOFT 2010 13
Experiments (2)
• The estimated word count was established to be 3.
• The 5-grams starting with “an even more narrow” are investigated
– none found. 4-grams and then trigrams are investigated
considering “more narrow” – 168 hits are found.
• The results containing symbols, punctuation marks or words with
less than 256 appearances in the corpus have been filtered out – 22
results. Top 6 are:
– [3] and [4816] [ CC : 0.527744] [-1]
– [3] approach [399] [ NN : 0.885605] [5]
– [3] as [372] [ IN : 0.829617] [-1]
– [3] definition [1934] [ NN : 1.221063] [1]
– [3] focus [2276] [ NN : 1.057171] [11]
– [3] interpretation [583] [ NN : 1.221063] [4]
Semantic relevance, -1 = not
filtered out by this criterion
Probability of n-gram of POS
N-gram frequency
Number of remaining
words
23.07.2010 ICSOFT 2010 14
Experiments (3)
• Thresholds: frequency: 308, POS score: 0.883849 and semantic
relevance: 4.
• Remaining candidates: “approach”, “focus”, “interpretation”,
“range”, “sense”, and “view”.
• The process continues with each of them until either no n-grams
are found to continue, the maximum depth has been reached or
we encountered a possible solution.
23.07.2010 ICSOFT 2010 15
• For the presented gap (interpretation is that), the results were:
• An even more narrow <gap> is that text have a physical form
– Missing word(s): interpretation.
– Results: approach [399][NN], view [754][NN], focus [2276][NN],
interpretation [583][NN] and sense [1346][NN].
Results (1)
23.07.2010 ICSOFT 2010 16
Results (2)
• “for scientific instruction, yet <gap> remain too technical to sit well
in most programmes”
– Missing word(s): they.
– Results: still [210782][RB] and they [418129][PP].
• “and often have a primarily utilitarian purpose: <gap> data or
convey immediate information.”
– Missing word(s): to record.
– Results: over 50 results, the closest results being: to [62786][TO] -
present [6934][JJ], to [62786][TO] - share [5828][NN], to [62786][TO] -
gain [7704][NN], to [62786][TO] - study [5423][NN], to [62786][TO] -
test [3854][NN], to [62786][TO] - order [4641][NN], to [62786][TO] -
move [8527][NN], to [62786][TO] - process [3899][NN], to [62786]
[TO] - control [4081][NN] and to [62786][TO] - access [3631][NN].
23.07.2010 ICSOFT 2010 17
Conclusions
• The application didn’t achieve the expected results.
• N-grams: not very helpful – coverage rates: 5-grams:
15%, 4-grams: 30%, trigrams: 60%, bigrams: 90% (our
assumption regarding the corpus was wrong).
• Small variations of the thresholds of each of the
considered heuristics can lead to massive filtering.
• Best heuristics seems to be the one based on POS.
23.07.2010 ICSOFT 2010 18
Q&A
Thank you for your time!
23.07.2010 ICSOFT 2010

More Related Content

What's hot

Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...Quinsulon Israel
 
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-ServiceMarius Corici
 
Detecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaDetecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaTraian Rebedea
 
Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Foldersfeiwin
 
Weakly Supervised Machine Reading
Weakly Supervised Machine ReadingWeakly Supervised Machine Reading
Weakly Supervised Machine ReadingIsabelle Augenstein
 
Quick Tour of Text Mining
Quick Tour of Text MiningQuick Tour of Text Mining
Quick Tour of Text MiningYi-Shin Chen
 
A general method applicable to the search for anglicisms in russian social ne...
A general method applicable to the search for anglicisms in russian social ne...A general method applicable to the search for anglicisms in russian social ne...
A general method applicable to the search for anglicisms in russian social ne...Ilia Karpov
 
Query Translation for Data Sources with Heterogeneous Content Semantics
Query Translation for Data Sources with Heterogeneous Content Semantics Query Translation for Data Sources with Heterogeneous Content Semantics
Query Translation for Data Sources with Heterogeneous Content Semantics Jie Bao
 
14. Michael Oakes (UoW) Natural Language Processing for Translation
14. Michael Oakes (UoW) Natural Language Processing for Translation14. Michael Oakes (UoW) Natural Language Processing for Translation
14. Michael Oakes (UoW) Natural Language Processing for TranslationRIILP
 
Lecture 2: Computational Semantics
Lecture 2: Computational SemanticsLecture 2: Computational Semantics
Lecture 2: Computational SemanticsMarina Santini
 
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Polytechnic University of Bari
 

What's hot (20)

Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
 
AINL 2016: Kravchenko
AINL 2016: KravchenkoAINL 2016: Kravchenko
AINL 2016: Kravchenko
 
Ontology learning
Ontology learningOntology learning
Ontology learning
 
Using lexical chains for text summarization
Using lexical chains for text summarizationUsing lexical chains for text summarization
Using lexical chains for text summarization
 
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
 
Detecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaDetecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large Corpora
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Folders
 
Weakly Supervised Machine Reading
Weakly Supervised Machine ReadingWeakly Supervised Machine Reading
Weakly Supervised Machine Reading
 
Aq35241246
Aq35241246Aq35241246
Aq35241246
 
Quick Tour of Text Mining
Quick Tour of Text MiningQuick Tour of Text Mining
Quick Tour of Text Mining
 
AINL 2016: Malykh
AINL 2016: MalykhAINL 2016: Malykh
AINL 2016: Malykh
 
A general method applicable to the search for anglicisms in russian social ne...
A general method applicable to the search for anglicisms in russian social ne...A general method applicable to the search for anglicisms in russian social ne...
A general method applicable to the search for anglicisms in russian social ne...
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 
Query Translation for Data Sources with Heterogeneous Content Semantics
Query Translation for Data Sources with Heterogeneous Content Semantics Query Translation for Data Sources with Heterogeneous Content Semantics
Query Translation for Data Sources with Heterogeneous Content Semantics
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 
14. Michael Oakes (UoW) Natural Language Processing for Translation
14. Michael Oakes (UoW) Natural Language Processing for Translation14. Michael Oakes (UoW) Natural Language Processing for Translation
14. Michael Oakes (UoW) Natural Language Processing for Translation
 
Ir 03
Ir   03Ir   03
Ir 03
 
Lecture 2: Computational Semantics
Lecture 2: Computational SemanticsLecture 2: Computational Semantics
Lecture 2: Computational Semantics
 
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
 

Similar to Filling the gaps

2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH WarNik Chow
 
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)Nicolas Van Labeke
 
6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptx6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptxShowravDuttaAnkur
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Kai Li
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
 
Query Understanding
Query UnderstandingQuery Understanding
Query UnderstandingMatt Corkum
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationSeonghyun Kim
 
Introduction to Qualitative Research - Syllabus Spring 2014
Introduction to Qualitative Research - Syllabus Spring 2014Introduction to Qualitative Research - Syllabus Spring 2014
Introduction to Qualitative Research - Syllabus Spring 2014Joan E. Hughes, Ph.D.
 
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...Lviv Data Science Summer School
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPAnuj Gupta
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science researchAnubhav Jain
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needsIvan Berlocher
 

Similar to Filling the gaps (20)

How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
How to supervise a thesis in NLP in the ChatGPT era? By Laure SoulierHow to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
 
Chat adapted pos tagger for romanian language
Chat adapted pos tagger for romanian languageChat adapted pos tagger for romanian language
Chat adapted pos tagger for romanian language
 
Examining reading
Examining readingExamining reading
Examining reading
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
 
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)
OpenEssayist: Extractive Summarisation and Formative Assessment (DCLA13)
 
6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptx6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptx
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
Query Understanding
Query UnderstandingQuery Understanding
Query Understanding
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword Information
 
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
 
Introduction to Qualitative Research - Syllabus Spring 2014
Introduction to Qualitative Research - Syllabus Spring 2014Introduction to Qualitative Research - Syllabus Spring 2014
Introduction to Qualitative Research - Syllabus Spring 2014
 
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science research
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needs
 

More from University Politehnica Bucharest

PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic AnalysisPhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic AnalysisUniversity Politehnica Bucharest
 
Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...University Politehnica Bucharest
 
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...University Politehnica Bucharest
 
Determine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysisDetermine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysisUniversity Politehnica Bucharest
 
Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...University Politehnica Bucharest
 
Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...University Politehnica Bucharest
 
Movie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profileMovie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profileUniversity Politehnica Bucharest
 
Tracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corporaTracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corporaUniversity Politehnica Bucharest
 
The collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case studyThe collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case studyUniversity Politehnica Bucharest
 
Unsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesisUnsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesisUniversity Politehnica Bucharest
 
Tweets topic modelling across different countries prezentarea
Tweets topic modelling across different countries   prezentareaTweets topic modelling across different countries   prezentarea
Tweets topic modelling across different countries prezentareaUniversity Politehnica Bucharest
 
Nlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chatsNlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chatsUniversity Politehnica Bucharest
 

More from University Politehnica Bucharest (20)

PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic AnalysisPhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
PhD Thesis - Influence of Repetitions on Discourse and Semantic Analysis
 
Time series analysis for sales prediction
Time series analysis for sales predictionTime series analysis for sales prediction
Time series analysis for sales prediction
 
Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...Identification and Classification of the Most Important Moments in Students’ ...
Identification and Classification of the Most Important Moments in Students’ ...
 
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
Digital Services Development Using Statistics Tools to Emphasize Pollution Ph...
 
Identifying cyclic words with the help of google
Identifying cyclic words with the help of googleIdentifying cyclic words with the help of google
Identifying cyclic words with the help of google
 
Expression of Political Opinions in Press
Expression of Political Opinions in PressExpression of Political Opinions in Press
Expression of Political Opinions in Press
 
Determine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysisDetermine the time period when a text was written using time series analysis
Determine the time period when a text was written using time series analysis
 
Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...Using machine learning to generate predictions based on the information extra...
Using machine learning to generate predictions based on the information extra...
 
Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...Hearthstone helper using optical character recognition techniques for cards d...
Hearthstone helper using optical character recognition techniques for cards d...
 
Movie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profileMovie recommender system using the user's psychological profile
Movie recommender system using the user's psychological profile
 
Tracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corporaTracing the paths between concepts in large bio medical corpora
Tracing the paths between concepts in large bio medical corpora
 
The collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case studyThe collection and analysis of public data - Bucharest case study
The collection and analysis of public data - Bucharest case study
 
Archaisms and neologisms identification in texts
Archaisms and neologisms identification in textsArchaisms and neologisms identification in texts
Archaisms and neologisms identification in texts
 
Unsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesisUnsupervised system for automatic grading of bachelor and master thesis
Unsupervised system for automatic grading of bachelor and master thesis
 
Tweets topic modelling across different countries prezentarea
Tweets topic modelling across different countries   prezentareaTweets topic modelling across different countries   prezentarea
Tweets topic modelling across different countries prezentarea
 
Sentiment based text segmentation
Sentiment based text segmentationSentiment based text segmentation
Sentiment based text segmentation
 
Creativity detection in texts
Creativity detection in textsCreativity detection in texts
Creativity detection in texts
 
Nlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chatsNlp based heuristics for assessing participants in cscl chats
Nlp based heuristics for assessing participants in cscl chats
 
Detecting discourse creativity in chat conversations
Detecting discourse creativity in chat conversationsDetecting discourse creativity in chat conversations
Detecting discourse creativity in chat conversations
 
Metaphor detection
Metaphor detectionMetaphor detection
Metaphor detection
 

Recently uploaded

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 

Recently uploaded (20)

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 

Filling the gaps

  • 1. Autor Conducător științific Universitatea Politehnica București Facultatea de Automatică și Calculatoare Catedra de Calculatoare FILLING THE GAPS USING GOOGLE 5-GRAMS CORPUS Costin-Gabriel Chiru - costin.chiru@cs.pub.ro Andrei Hanganu Traian Rebedea Stefan Trausan-Matu
  • 2. Contents • Problem presentation • Our solution • Assumptions • Methodology • Candidates filtering heuristics • Experiments and results • Conclusions 23.07.2010 ICSOFT 2010 1
  • 3. The Problem • Lots of projects attempting to digitize the content of some publications: – Gutenberg Project (http://www.gutenberg.org/wiki/Main_Page); – The Million Book Project (http://www.rr.cs.cmu.edu/mbdl.htm); – The Runeberg Project (http://runeberg.org/); – Google Book Search (http://books.google.com/); – Many others. • Problems: – Very old documents; – Partially damaged paper; – Cheap (poor quality) paper. • Results: the OCR-s are unable to fully recognize the content of some documents! 23.07.2010 ICSOFT 2010 2
  • 4. Our Solution • A probabilistic method for text recovery that tries to identify which are the missing words from the digital form of the document. • Based on: – “Web 1T 5-gram Version 1” corpus – n-gram corpus provided by Google (used to generate candidates) 23.07.2010 ICSOFT 2010 3
  • 5. Gaps • We are focusing on the reconstruction of damaged documents based on the prediction of the most plausible word sets for filling the missing areas resulted after conversion to digital form – we call them gaps. • Gaps – very important property: its dimension – number of characters or words that can be place inside the gap. 23.07.2010 ICSOFT 2010 4
  • 6. Assumptions • Our method is based on two assumptions: – Intra-document similarity. The document model has 2 components: • The style model – the structure of the text; • The language model – the vocabulary used by the author (n-grams and their frequencies). – The Google corpus dimension is large enough to subsume most of the language models of the documents posted on the Internet: • Any word that does not appear in this corpus, should not be considered as a possible candidate to fill in the gaps. 23.07.2010 ICSOFT 2010 5
  • 7. Methodology (1) • The style model of the document  dimension of the gap. • 2 heuristics: • Estimated character count ([min_chars, max_chars]) – from the document format: margins and indentation; • Estimated word count ([min_words, max_words]) – uses previous heuristic and the distribution of words length (in terms of characters) and of number of words per phrase. 23.07.2010 ICSOFT 2010 6
  • 8. Methodology (2) • The language model of the document  detect the missing words. 1. Start from the partial words at the beginning or at the ending of the gaps. 2. Use both the n-gram corpus and the words that have been correctly identified before and after the gap in order to identify first and last word from the gap:  Use the last 4 words before the gap and the first 4 after it in order to detect the most probable first and last word from the gap using the n-grams from the corpus (since the max order of n-grams in the corpus is 5); 23.07.2010 ICSOFT 2010 7
  • 9. Methodology (3)  If there is not such a 5-gram, than the order of n-gram is decreased repeatedly until bigrams where we consider only the first word before the gap and last after it;  The same thing happens when the gap is near the start of the end of a phrase. 3. The possible candidates are stored and the process is restarted for each of this candidates in order to find the rest of the words from the gap. 23.07.2010 ICSOFT 2010 8
  • 10. Methodology (4) 4. The process ends when one of the following situations is reached:  The number of words or characters exceed the estimated word or character count  the branches are too long to be valid  can be discarded;  A left-side branch matches at some point a right-side branch, identifying a valid candidate for the missing words.  The left-side branch has reached an end sentence mark-up (</S>) AND the right-side one has reached a beginning of sentence mark-up (<S>). At this point a “partial match” has been obtained, which contains a possible unrecoverable gap inside it. – If the added size of the branches fits in the estimated character and word count  a valid candidate. 23.07.2010 ICSOFT 2010 9
  • 11. Encountered Problems • No continuation possibility for a branch: – Decrease the order of n-grams; – If already at bigrams order, the branch is discarded. • Very large number of candidates are generated for each possible word (nomin candidates are generated) – The candidates have to be filtered out! 23.07.2010 ICSOFT 2010 10
  • 12. Candidates Filtering Heuristics (1) • POS-based: heuristic that predict the POS of the words and discard the words that do not have the predicted POS (TreeTagger). • Semantics-based: discard the branches that do not contain words related to the rest of the document (based on lexical chains build using WordNet). • Frequency-based: prefer the branches with higher scores for the n-grams in the corpus. • Considering these heuristics, some scores are computed for every word added to a branch. 23.07.2010 ICSOFT 2010 11
  • 13. Candidates Filtering Heuristics (2) • These values are then combined in order to provide a general score of the branch. • A heuristic is used: distance to the nearest end of the gap –to detect the importance of the scores of each word (the error is propagated from ends to the middle of the gap). • Finally, the obtained scores are normalize with respect to the number of words from the branch and the results are ordered according to this final score. • The branch with the highest score is used. 23.07.2010 ICSOFT 2010 12
  • 14. Experiments (1) • Starting from full documents and remove some parts in order to simulate the gaps (http://en.wikipedia.org/wiki/Literature). • ”An even more narrow interpretation is that (<gap>) text have a physical form, ...” • TreeTagger (word, POS, lemma): – “An DT an even RB even more RBR more narrow JJ narrow <gap> NN <unknown> text NN text have VBP have a DT a physical JJ physical form NN form , , ,” 23.07.2010 ICSOFT 2010 13
  • 15. Experiments (2) • The estimated word count was established to be 3. • The 5-grams starting with “an even more narrow” are investigated – none found. 4-grams and then trigrams are investigated considering “more narrow” – 168 hits are found. • The results containing symbols, punctuation marks or words with less than 256 appearances in the corpus have been filtered out – 22 results. Top 6 are: – [3] and [4816] [ CC : 0.527744] [-1] – [3] approach [399] [ NN : 0.885605] [5] – [3] as [372] [ IN : 0.829617] [-1] – [3] definition [1934] [ NN : 1.221063] [1] – [3] focus [2276] [ NN : 1.057171] [11] – [3] interpretation [583] [ NN : 1.221063] [4] Semantic relevance, -1 = not filtered out by this criterion Probability of n-gram of POS N-gram frequency Number of remaining words 23.07.2010 ICSOFT 2010 14
  • 16. Experiments (3) • Thresholds: frequency: 308, POS score: 0.883849 and semantic relevance: 4. • Remaining candidates: “approach”, “focus”, “interpretation”, “range”, “sense”, and “view”. • The process continues with each of them until either no n-grams are found to continue, the maximum depth has been reached or we encountered a possible solution. 23.07.2010 ICSOFT 2010 15
  • 17. • For the presented gap (interpretation is that), the results were: • An even more narrow <gap> is that text have a physical form – Missing word(s): interpretation. – Results: approach [399][NN], view [754][NN], focus [2276][NN], interpretation [583][NN] and sense [1346][NN]. Results (1) 23.07.2010 ICSOFT 2010 16
  • 18. Results (2) • “for scientific instruction, yet <gap> remain too technical to sit well in most programmes” – Missing word(s): they. – Results: still [210782][RB] and they [418129][PP]. • “and often have a primarily utilitarian purpose: <gap> data or convey immediate information.” – Missing word(s): to record. – Results: over 50 results, the closest results being: to [62786][TO] - present [6934][JJ], to [62786][TO] - share [5828][NN], to [62786][TO] - gain [7704][NN], to [62786][TO] - study [5423][NN], to [62786][TO] - test [3854][NN], to [62786][TO] - order [4641][NN], to [62786][TO] - move [8527][NN], to [62786][TO] - process [3899][NN], to [62786] [TO] - control [4081][NN] and to [62786][TO] - access [3631][NN]. 23.07.2010 ICSOFT 2010 17
  • 19. Conclusions • The application didn’t achieve the expected results. • N-grams: not very helpful – coverage rates: 5-grams: 15%, 4-grams: 30%, trigrams: 60%, bigrams: 90% (our assumption regarding the corpus was wrong). • Small variations of the thresholds of each of the considered heuristics can lead to massive filtering. • Best heuristics seems to be the one based on POS. 23.07.2010 ICSOFT 2010 18
  • 20. Q&A Thank you for your time! 23.07.2010 ICSOFT 2010

Editor's Notes

  1. No of words, frequency in corpus, probability of n-gram of POS and