SlideShare a Scribd company logo
1 of 21
Minimum Edit Distance
K.A.S.H. Kulathilake
B.Sc.(Sp.Hons.)IT, MCS, Mphil, SEDA(UK)
Introduction
• Much of natural language processing is concerned with
measuring how similar two strings are.
• For example in spelling correction, the user typed
some erroneous string—let’s say graffe–and we want
to know what the user meant.
• The user probably intended a word that is similar to
graffe.
• Among candidate similar words, the word giraffe,
which differs by only one letter from graffe, seems
intuitively to be more similar than, say grail or graf,
which differ in more letters.
Introduction (Cont…)
• Edit distance gives us a way to quantify both
of these intuitions about string similarity.
• More formally, the minimum edit distance
between two strings is defined as the
minimum number of editing operations
(operations like insertion, deletion,
substitution) needed to transform one string
into another.
Introduction (Cont…)
• The gap between intention and execution, for
example, is 5 (delete an i, substitute e for n,
substitute x for t, insert c, substitute u for n).
• It’s much easier to see this by looking at the most
important visualization for string distances, an
alignment between the two strings.
Levenshtein Distance
• We can also assign a particular cost or weight to each of these
operations.
• The Levenshtein distance between two sequences is the simplest
weighting factor in which each of the three operations has a cost of
1 (Levenshtein, 1966)—we assume that the substitution of a letter
for itself, for example, t for t, has zero cost.
• The Levenshtein distance between intention and execution is 5.
• Levenshtein also proposed an alternative version of his metric in
which each insertion or deletion has a cost of 1 and substitutions
are not allowed. (This is equivalent to allowing substitution, but
giving each substitution a cost of 2 since any substitution can be
represented by one insertion and one deletion).
• Using this version, the Levenshtein distance between intention and
execution is 8.
Levenshtein distance - example
• distance(“William Cohen”, “Willliam Cohon”)
W I L L I A M _ C O H E N
W I L L L I A M _ C O H O N
C C C C I C C C C C C C S C
0 0 0 0 1 1 1 1 1 1 1 1 2 2
s
t
op
cost
Levenshtein distance - example
• distance(“William Cohen”, “Willliam Cohon”)
W I L L I A M _ C O H E N
W I L L L I A M _ C O H O N
C C C C I C C C C C C C S C
0 0 0 0 1 1 1 1 1 1 1 1 2 2
s
t
op
cost
gap
Dynamic Programming
• In order to find the edit distance we can use
dynamic programming.
• Dynamic programming is the name for a class
of algorithms, first introduced by Bellman
(1957), that apply a table-driven method to
solve problems by combining solutions to sub-
problems.
Edit Distance of Two Strings
• Let’s first define the minimum edit distance
between two strings.
• Given two strings, the source string X of length
n, and target string Y of length m, we’ll define
D(i, j) as the edit distance between X[1::i] and
Y[1:: j], i.e., the first i characters of X and the
first j characters of Y.
• The edit distance between X and Y is thus
D(n,m).
Edit Distance of Two Strings (Cont…)
• We’ll use dynamic programming to compute D(n,m)
bottom up, combining solutions to sub-problems.
• In the base case, with a source substring of length i but
an empty target string, going from i characters to 0
requires i deletes.
• With a target substring of length j but an empty source
going from 0 characters to j characters requires j
inserts.
• Having computed D(i, j) for small i, j we then compute
larger D(i, j) based on previously computed smaller
values.
Edit Distance of Two Strings (Cont…)
• The value of D(i, j) is computed by taking the
minimum of the three possible paths through
the matrix which arrive there:
Edit Distance of Two Strings (Cont…)
• If we assume the version of Levenshtein
distance in which the insertions and deletions
each have a cost of 1 (ins-cost() = del-cost() =
1), and substitutions have a cost of 2 (except
substitution of identical letters have zero
cost), the computation for D(i, j) becomes:
Edit Distance of Two Strings (Cont…)
Edit Distance of Two Strings (Cont…)
• Knowing the minimum edit distance is useful for algorithms
like finding potential spelling error corrections.
• But the edit distance algorithm is important in another
way; with a small change, it can also provide the minimum
cost alignment between two strings.
• Aligning two strings is useful throughout speech and
language processing.
• In speech recognition, minimum edit distance alignment is
used to compute the word error rate.
• Alignment plays a role in machine translation, in which
sentences in a parallel corpus (a corpus with a text in two
languages) need to be matched to each other.
Edit Distance of Two Strings (Cont…)
• To extend the edit distance algorithm to produce an
alignment, we can start by visualizing an alignment as a
path through the edit distance matrix.
• Following matrix shows this path with the boldfaced
cell.
• Each boldfaced cell represents an alignment of a pair of
letters in the two strings.
• If two boldfaced cells occur in the same row, there will
be an insertion in going from the source to the target;
two boldfaced cells in the same column indicate a
deletion.
Edit Distance of Two Strings (Cont…)
Edit Distance of Two Strings (Cont…)
• How to compute alignment path?
– The computation proceeds in two steps.
– In the first step, we augment the minimum edit
distance algorithm to store back pointers in each cell.
– The back pointer from a cell points to the previous cell
(or cells) that we came from in entering the current
cell.
– Some cells have multiple back pointers because the
minimum extension could have come from multiple
previous cells.
Edit Distance of Two Strings (Cont…)
– In the second step, we perform a back trace.
– In a back trace, we start from the last cell (at the
final row and column), and follow the pointers
back through the dynamic programming matrix.
– Each complete path between the final cell and the
initial cell is a minimum distance alignment.
Edit Distance of Two Strings (Cont…)
Later What We Do?
• Instead of computing the “minimum edit
distance” between two strings, Viterbi
computes the “maximum probability
alignment” of one string with another.
Question?
• How to handle transposition?
• E.g the < -- > hte

More Related Content

What's hot

2_1 Edit Distance.pptx
2_1 Edit Distance.pptx2_1 Edit Distance.pptx
2_1 Edit Distance.pptx
tanishamahajan11
 
Machine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative ModelsMachine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative Models
butest
 
Text clustering
Text clusteringText clustering
Text clustering
KU Leuven
 

What's hot (20)

Principles of programming languages. Detail notes
Principles of programming languages. Detail notesPrinciples of programming languages. Detail notes
Principles of programming languages. Detail notes
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
 
2_1 Edit Distance.pptx
2_1 Edit Distance.pptx2_1 Edit Distance.pptx
2_1 Edit Distance.pptx
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
Directed Acyclic Graph Representation of basic blocks
Directed Acyclic Graph Representation of basic blocksDirected Acyclic Graph Representation of basic blocks
Directed Acyclic Graph Representation of basic blocks
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Machine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative ModelsMachine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative Models
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
Input-Buffering
Input-BufferingInput-Buffering
Input-Buffering
 
Chapter 5 Syntax Directed Translation
Chapter 5   Syntax Directed TranslationChapter 5   Syntax Directed Translation
Chapter 5 Syntax Directed Translation
 
Introduction to prolog
Introduction to prologIntroduction to prolog
Introduction to prolog
 
NLP_KASHK:Smoothing N-gram Models
NLP_KASHK:Smoothing N-gram ModelsNLP_KASHK:Smoothing N-gram Models
NLP_KASHK:Smoothing N-gram Models
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
Dag representation of basic blocks
Dag representation of basic blocksDag representation of basic blocks
Dag representation of basic blocks
 
Chapter 6 intermediate code generation
Chapter 6   intermediate code generationChapter 6   intermediate code generation
Chapter 6 intermediate code generation
 
Text clustering
Text clusteringText clustering
Text clustering
 

Similar to NLP_KASHK:Minimum Edit Distance

Design and analysis of Algorithms - Lecture 08 (1).ppt
Design and analysis of Algorithms - Lecture 08 (1).pptDesign and analysis of Algorithms - Lecture 08 (1).ppt
Design and analysis of Algorithms - Lecture 08 (1).ppt
ZeenaJaba
 
Dd 160506122947-160630175555-160701121726
Dd 160506122947-160630175555-160701121726Dd 160506122947-160630175555-160701121726
Dd 160506122947-160630175555-160701121726
marangburu42
 

Similar to NLP_KASHK:Minimum Edit Distance (20)

Design and analysis of Algorithms - Lecture 08 (1).ppt
Design and analysis of Algorithms - Lecture 08 (1).pptDesign and analysis of Algorithms - Lecture 08 (1).ppt
Design and analysis of Algorithms - Lecture 08 (1).ppt
 
Quantum Noise and Error Correction
Quantum Noise and Error CorrectionQuantum Noise and Error Correction
Quantum Noise and Error Correction
 
Dynamic time wrapping (dtw), vector quantization(vq), linear predictive codin...
Dynamic time wrapping (dtw), vector quantization(vq), linear predictive codin...Dynamic time wrapping (dtw), vector quantization(vq), linear predictive codin...
Dynamic time wrapping (dtw), vector quantization(vq), linear predictive codin...
 
Floyd aaaaaa
Floyd aaaaaaFloyd aaaaaa
Floyd aaaaaa
 
Dd 160506122947-160630175555-160701121726
Dd 160506122947-160630175555-160701121726Dd 160506122947-160630175555-160701121726
Dd 160506122947-160630175555-160701121726
 
Multimedia lossy compression algorithms
Multimedia lossy compression algorithmsMultimedia lossy compression algorithms
Multimedia lossy compression algorithms
 
2_EditDistance_Jan_08_2020.pptx
2_EditDistance_Jan_08_2020.pptx2_EditDistance_Jan_08_2020.pptx
2_EditDistance_Jan_08_2020.pptx
 
7 convolutional codes
7 convolutional codes7 convolutional codes
7 convolutional codes
 
M017327378
M017327378M017327378
M017327378
 
Scheduling Using Multi Objective Genetic Algorithm
Scheduling Using Multi Objective Genetic AlgorithmScheduling Using Multi Objective Genetic Algorithm
Scheduling Using Multi Objective Genetic Algorithm
 
Lossy
LossyLossy
Lossy
 
Mathematics Research Paper - Mathematics of Computer Networking - Final Draft
Mathematics Research Paper - Mathematics of Computer Networking - Final DraftMathematics Research Paper - Mathematics of Computer Networking - Final Draft
Mathematics Research Paper - Mathematics of Computer Networking - Final Draft
 
L06 stemmer and edit distance
L06 stemmer and edit distanceL06 stemmer and edit distance
L06 stemmer and edit distance
 
MS Project
MS ProjectMS Project
MS Project
 
poster
posterposter
poster
 
Non standard size image compression with reversible embedded wavelets
Non standard size image compression with reversible embedded waveletsNon standard size image compression with reversible embedded wavelets
Non standard size image compression with reversible embedded wavelets
 
Non standard size image compression with reversible embedded wavelets
Non standard size image compression with reversible embedded waveletsNon standard size image compression with reversible embedded wavelets
Non standard size image compression with reversible embedded wavelets
 
Wavelet Based Image Compression Using FPGA
Wavelet Based Image Compression Using FPGAWavelet Based Image Compression Using FPGA
Wavelet Based Image Compression Using FPGA
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 

More from Hemantha Kulathilake

More from Hemantha Kulathilake (20)

NLP_KASHK:Parsing with Context-Free Grammar
NLP_KASHK:Parsing with Context-Free Grammar NLP_KASHK:Parsing with Context-Free Grammar
NLP_KASHK:Parsing with Context-Free Grammar
 
NLP_KASHK:Context-Free Grammar for English
NLP_KASHK:Context-Free Grammar for EnglishNLP_KASHK:Context-Free Grammar for English
NLP_KASHK:Context-Free Grammar for English
 
NLP_KASHK:POS Tagging
NLP_KASHK:POS TaggingNLP_KASHK:POS Tagging
NLP_KASHK:POS Tagging
 
NLP_KASHK:Markov Models
NLP_KASHK:Markov ModelsNLP_KASHK:Markov Models
NLP_KASHK:Markov Models
 
NLP_KASHK:N-Grams
NLP_KASHK:N-GramsNLP_KASHK:N-Grams
NLP_KASHK:N-Grams
 
NLP_KASHK:Morphology
NLP_KASHK:MorphologyNLP_KASHK:Morphology
NLP_KASHK:Morphology
 
NLP_KASHK:Text Normalization
NLP_KASHK:Text NormalizationNLP_KASHK:Text Normalization
NLP_KASHK:Text Normalization
 
NLP_KASHK:Finite-State Automata
NLP_KASHK:Finite-State AutomataNLP_KASHK:Finite-State Automata
NLP_KASHK:Finite-State Automata
 
NLP_KASHK:Regular Expressions
NLP_KASHK:Regular Expressions NLP_KASHK:Regular Expressions
NLP_KASHK:Regular Expressions
 
NLP_KASHK: Introduction
NLP_KASHK: Introduction NLP_KASHK: Introduction
NLP_KASHK: Introduction
 
COM1407: File Processing
COM1407: File Processing COM1407: File Processing
COM1407: File Processing
 
COm1407: Character & Strings
COm1407: Character & StringsCOm1407: Character & Strings
COm1407: Character & Strings
 
COM1407: Structures, Unions & Dynamic Memory Allocation
COM1407: Structures, Unions & Dynamic Memory Allocation COM1407: Structures, Unions & Dynamic Memory Allocation
COM1407: Structures, Unions & Dynamic Memory Allocation
 
COM1407: Input/ Output Functions
COM1407: Input/ Output FunctionsCOM1407: Input/ Output Functions
COM1407: Input/ Output Functions
 
COM1407: Working with Pointers
COM1407: Working with PointersCOM1407: Working with Pointers
COM1407: Working with Pointers
 
COM1407: Arrays
COM1407: ArraysCOM1407: Arrays
COM1407: Arrays
 
COM1407: Program Control Structures – Repetition and Loops
COM1407: Program Control Structures – Repetition and Loops COM1407: Program Control Structures – Repetition and Loops
COM1407: Program Control Structures – Repetition and Loops
 
COM1407: Program Control Structures – Decision Making & Branching
COM1407: Program Control Structures – Decision Making & BranchingCOM1407: Program Control Structures – Decision Making & Branching
COM1407: Program Control Structures – Decision Making & Branching
 
COM1407: C Operators
COM1407: C OperatorsCOM1407: C Operators
COM1407: C Operators
 
COM1407: Type Casting, Command Line Arguments and Defining Constants
COM1407: Type Casting, Command Line Arguments and Defining Constants COM1407: Type Casting, Command Line Arguments and Defining Constants
COM1407: Type Casting, Command Line Arguments and Defining Constants
 

Recently uploaded

1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptx
pritamlangde
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 

Recently uploaded (20)

1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Signal Processing and Linear System Analysis
Signal Processing and Linear System AnalysisSignal Processing and Linear System Analysis
Signal Processing and Linear System Analysis
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptx
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptx
 
Ground Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth ReinforcementGround Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth Reinforcement
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Ghuma $ Russian Call Girls Ahmedabad ₹7.5k Pick Up & Drop With Cash Payment 8...
Ghuma $ Russian Call Girls Ahmedabad ₹7.5k Pick Up & Drop With Cash Payment 8...Ghuma $ Russian Call Girls Ahmedabad ₹7.5k Pick Up & Drop With Cash Payment 8...
Ghuma $ Russian Call Girls Ahmedabad ₹7.5k Pick Up & Drop With Cash Payment 8...
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...
 

NLP_KASHK:Minimum Edit Distance

  • 1. Minimum Edit Distance K.A.S.H. Kulathilake B.Sc.(Sp.Hons.)IT, MCS, Mphil, SEDA(UK)
  • 2. Introduction • Much of natural language processing is concerned with measuring how similar two strings are. • For example in spelling correction, the user typed some erroneous string—let’s say graffe–and we want to know what the user meant. • The user probably intended a word that is similar to graffe. • Among candidate similar words, the word giraffe, which differs by only one letter from graffe, seems intuitively to be more similar than, say grail or graf, which differ in more letters.
  • 3. Introduction (Cont…) • Edit distance gives us a way to quantify both of these intuitions about string similarity. • More formally, the minimum edit distance between two strings is defined as the minimum number of editing operations (operations like insertion, deletion, substitution) needed to transform one string into another.
  • 4. Introduction (Cont…) • The gap between intention and execution, for example, is 5 (delete an i, substitute e for n, substitute x for t, insert c, substitute u for n). • It’s much easier to see this by looking at the most important visualization for string distances, an alignment between the two strings.
  • 5. Levenshtein Distance • We can also assign a particular cost or weight to each of these operations. • The Levenshtein distance between two sequences is the simplest weighting factor in which each of the three operations has a cost of 1 (Levenshtein, 1966)—we assume that the substitution of a letter for itself, for example, t for t, has zero cost. • The Levenshtein distance between intention and execution is 5. • Levenshtein also proposed an alternative version of his metric in which each insertion or deletion has a cost of 1 and substitutions are not allowed. (This is equivalent to allowing substitution, but giving each substitution a cost of 2 since any substitution can be represented by one insertion and one deletion). • Using this version, the Levenshtein distance between intention and execution is 8.
  • 6. Levenshtein distance - example • distance(“William Cohen”, “Willliam Cohon”) W I L L I A M _ C O H E N W I L L L I A M _ C O H O N C C C C I C C C C C C C S C 0 0 0 0 1 1 1 1 1 1 1 1 2 2 s t op cost
  • 7. Levenshtein distance - example • distance(“William Cohen”, “Willliam Cohon”) W I L L I A M _ C O H E N W I L L L I A M _ C O H O N C C C C I C C C C C C C S C 0 0 0 0 1 1 1 1 1 1 1 1 2 2 s t op cost gap
  • 8. Dynamic Programming • In order to find the edit distance we can use dynamic programming. • Dynamic programming is the name for a class of algorithms, first introduced by Bellman (1957), that apply a table-driven method to solve problems by combining solutions to sub- problems.
  • 9. Edit Distance of Two Strings • Let’s first define the minimum edit distance between two strings. • Given two strings, the source string X of length n, and target string Y of length m, we’ll define D(i, j) as the edit distance between X[1::i] and Y[1:: j], i.e., the first i characters of X and the first j characters of Y. • The edit distance between X and Y is thus D(n,m).
  • 10. Edit Distance of Two Strings (Cont…) • We’ll use dynamic programming to compute D(n,m) bottom up, combining solutions to sub-problems. • In the base case, with a source substring of length i but an empty target string, going from i characters to 0 requires i deletes. • With a target substring of length j but an empty source going from 0 characters to j characters requires j inserts. • Having computed D(i, j) for small i, j we then compute larger D(i, j) based on previously computed smaller values.
  • 11. Edit Distance of Two Strings (Cont…) • The value of D(i, j) is computed by taking the minimum of the three possible paths through the matrix which arrive there:
  • 12. Edit Distance of Two Strings (Cont…) • If we assume the version of Levenshtein distance in which the insertions and deletions each have a cost of 1 (ins-cost() = del-cost() = 1), and substitutions have a cost of 2 (except substitution of identical letters have zero cost), the computation for D(i, j) becomes:
  • 13. Edit Distance of Two Strings (Cont…)
  • 14. Edit Distance of Two Strings (Cont…) • Knowing the minimum edit distance is useful for algorithms like finding potential spelling error corrections. • But the edit distance algorithm is important in another way; with a small change, it can also provide the minimum cost alignment between two strings. • Aligning two strings is useful throughout speech and language processing. • In speech recognition, minimum edit distance alignment is used to compute the word error rate. • Alignment plays a role in machine translation, in which sentences in a parallel corpus (a corpus with a text in two languages) need to be matched to each other.
  • 15. Edit Distance of Two Strings (Cont…) • To extend the edit distance algorithm to produce an alignment, we can start by visualizing an alignment as a path through the edit distance matrix. • Following matrix shows this path with the boldfaced cell. • Each boldfaced cell represents an alignment of a pair of letters in the two strings. • If two boldfaced cells occur in the same row, there will be an insertion in going from the source to the target; two boldfaced cells in the same column indicate a deletion.
  • 16. Edit Distance of Two Strings (Cont…)
  • 17. Edit Distance of Two Strings (Cont…) • How to compute alignment path? – The computation proceeds in two steps. – In the first step, we augment the minimum edit distance algorithm to store back pointers in each cell. – The back pointer from a cell points to the previous cell (or cells) that we came from in entering the current cell. – Some cells have multiple back pointers because the minimum extension could have come from multiple previous cells.
  • 18. Edit Distance of Two Strings (Cont…) – In the second step, we perform a back trace. – In a back trace, we start from the last cell (at the final row and column), and follow the pointers back through the dynamic programming matrix. – Each complete path between the final cell and the initial cell is a minimum distance alignment.
  • 19. Edit Distance of Two Strings (Cont…)
  • 20. Later What We Do? • Instead of computing the “minimum edit distance” between two strings, Viterbi computes the “maximum probability alignment” of one string with another.
  • 21. Question? • How to handle transposition? • E.g the < -- > hte