SlideShare a Scribd company logo
1 of 38
Download to read offline
King’s College London, University of London
MSc in Advanced Software Engineering
Approximate Indexing: Gapped
Suffix Array
KyungHoon Park
King’s College London, University of London
Agenda
 Research Objective
 Gapped suffix array
 Application
 Going beyond gSA
 Q&A
King’s College London, University of London
Research Objective
King’s College London, University of London
Main questions
1. Using the developed suffix array, can gapped
suffix array be developed in O(n) time?
2. What are the limitations of gapped suffix array?
How can these can be overcome?
King’s College London, University of London
Research aims
1. To fully understand and implement suffix array
and LCP.
2. Implement a gapped suffix array from the suffix
array in O(n) time.
3. To study and implement the paper gapped suffix
array.
4. If there are possibilities to develop to multiple
gapped suffix array, to research other limitations.
King’s College London, University of London
Gapped Suffix Array
King’s College London, University of London
Main questions
1. Using the developed suffix array, can
gapped suffix array be developed in O(n)
time?
2. 2. What are the limitations of gapped suffix array?
How can these can be overcome?
King’s College London, University of London
Definitions
T = t1t2 … tn, P = p1 p2 … pn , strings of symbols in
finite alphabet
m = length of search string
n = length of text
k = k-mistake (Hamming distance)
King’s College London, University of London
Suffix Array
i T[i] SA T[SA[i]] LCP
0 mississippi 10 i 0
1 ississippi 7 ippi 1
2 ssissippi 4 issippi 1
3 sissippi 1 ississippi 4
4 issippi 0 mississippi 0
5 ssippi 9 pi 0
6 sippi 8 ppi 1
7 ippi 6 sippi 0
8 ppi 3 sissippi 2
9 pi 5 ssippi 1
T = mississippi
King’s College London, University of London
Gapped Suffix Array
1. First introduced by Crochemore and Tischler
(2010)
2. Constructed after SA
3. SA that has a Gap within a specific range to
provide approximate index.
4. The range of gap defined before constructing
the gapped suffix array.
King’s College London, University of London
Gapped Suffix Array
T = mississippi, (1, 2)-gSA (3,1)
i T[i] SA gSA (1, 2)- gSA(3,1)
1 mississippi 10 10 i#
2 ississippi 7 7 i#pi
3 ssissippi 4 4 i#sippi
4 sissippi 1 1 i#sissippi
5 issippi 0 0 m#ssissippi
6 Ssippi 9 9 p#
7 Sippi 8 8 p#i
8 Ippi 6 5 s#ppi
9 ppi 3 2 s#ssippi
10 pi 5 6 s#ippi
11 i 2 3 s#issippi
Definition
(g0, g1)-gSA (m, k)
gSA = Gapped suffix array
g0 = start cursor of the gap
g1 = end cursor of the gap
m = length of search string
k = Hamming distance
King’s College London, University of London
Flow of constructing the gSA
• Skew
Algorithm
1. Constructing
the SA
• Figure of the
k-mistake
• Range of gap
2. Defining the
limitations
• Sorting based on
GRANK &
HRANK
3. Constructing
the gSA
King’s College London, University of London
Limitations of gSA
1. Hamming distance, length of pattern and gap
range should define prior to constructing.
2. gSA cannot cover all of approximate string
matching based on defined k-mistake.
ex) k = 2, gap=(1,3)
coat -> c##t, ##at, co## (support)
#o#t, c#a# (cannot support)
3. gSA cannot support multiple gaps
EX) coach -> c#a#h
King’s College London, University of London
Constructing gSA - #1. GRANK
i 0 1 2 3 4 5 6 7 8 9 10
T[i] m i s s i s s i p p i
GRANK 5 1 8 8 1 8 8 1 6 6 1
GRANK contains the ranks of factors of y with
length up to g0. That is, rank created by cutting
the characters before the beginning of the gap at
position g0
For Example, m = 3, gap range = (1,2)
King’s College London, University of London
Constructing gSA - #2. HRANK
HRANK contains the RANKs of the suffixes that are
at the end of the gap.
As we have now already created the suffix array
before constructing the gapped suffix, it is possible
to easily bring the suffix of where the gap ends.
HRANK[r] = ISA[SA[r]+g1]
King’s College London, University of London
GRANK & HRANK
For example, the structure of the GRANK and
HRANK of the fourth suffix sissippi is constructed as
below.
s i s s i p p i
GRANK Gap HRANK
If we perform the radix sort by combining both
GRANK and HRANK created in this way, it is
possible to create gSA in linear time.
King’s College London, University of London
Example of (1,2)-gSA(3,1)
i T[i] SA gSA (1, 2)- gSA GRANK HRANK
1 mississippi 10 10 i# 5 0
2 ississippi 7 7 i#pi 1 6
3 ssissippi 4 4 i#sippi 8 8
4 sissippi 1 1 i#sissippi 8 9
5 issippi 0 0 m#ssissippi 1 11
6 Ssippi 9 9 p# 8 0
7 Sippi 8 8 p#i 8 1
8 Ippi 6 5 s#ppi 1 7
9 ppi 3 2 s#ssippi 6 10
10 pi 5 6 s#ippi 6 2
11 i 2 3 s#issippi 1 3
King’s College London, University of London
Search in (1,2)-gSA(3,1)
For example, if m = mis (m0, m1, m2), it needs to
search three times:
- search mi (m0, m1) in the SA
- search is (m1, m2) in the SA
- search ms (m0, m2) in the gSA
P = cot
(1,2)-gSA(3,1) c#t #ot co#
Searching array in the (1,2)-gSA(3,1) in the SA in the SA
King’s College London, University of London
Application
King’s College London, University of London
Platform and Language
1. Language: C#
2. Platform: Microsoft .NET
(.Net Framework v4.0)
King’s College London, University of London
Algorithms
1. Construction of suffix array with LCP
- Radix sort
- Skew algorithm
2. Construction of gapped suffix array with gLCP
- Radix sort
3. Approximate string search
- pattern analysis
- binary search with LCP
King’s College London, University of London
Gapped Suffix Array
King’s College London, University of London
Going beyond gSA
King’s College London, University of London
Main questions
1. Using the developed suffix array, can gapped
suffix array be developed in O(n) time?
2. What are the limitations of gapped
suffix array? How can these can be
overcome?
King’s College London, University of London
Limitation of gSA
P = coat
(2,3)-gSA(4,1) #oat c#at co#t coa#
Searching array SA Cannot
support
gSA(4,1) SA
P = coast
(3,4)-gSA(5,1) #oast c#oast co#st coa#t coas#
Searching array SA Cannot
support
Cannot
support
gSA(5,1) SA
If we suppose k is 1 and gap is ended at m-1
King’s College London, University of London
Countermeasure
P = coat
(2,3)-gSA(4,1) #oat c#at co#t coa#
Searching array SA gSA(3,1) gSA(4,1) SA
P = coast
(3,4)-gSA(5,1) #oast c#oast co#st coa#t coas#
Searching array SA gSA(3,1) gSA(4,1) gSA(5,1) SA
King’s College London, University of London
Countermeasure
P = cot c#t, #ot, co#
gSA(3, 1)  SA, gSA(3, 1)
P = coat #oat, c#at, co#t, coa#
gSA(4, 1)  SA, gSA(3, 1), gSA(4, 1)
P = coast #oast, c#oast, co#st, coa#t, coas#
gSA(5, 1)  SA, gSA(3, 1), gSA(4, 1), gSA(5, 1)
P = coasts #oasts, c#oasts, co#sts, coa#ts, coas#s, coast#
gSA(6, 1)  SA, gSA(3, 1), gSA(4, 1), gSA(5, 1), gSA(6, 1)
gSA(m, 1) SA, gSA(3, 1) … gSA(m-2, 1), gSA(m-1, 1), gSA(m, 1)
King’s College London, University of London
Theorem If the length of the Gap is 1, the required
count of gSA is | m - 2 |, and it is possible for both
construction and search time to be performed in linear
time.
King’s College London, University of London
Total count of required gSAs
gSA(m, p) Required gapped suffix arrays
gSA(3,1)  SA, gSA(3,1)
gSA(4,1)  SA, gSA(3,1), gSA(4,1)
gSA(4,2)  SA, gSA(3,1), gSA(4,2)
gSA(5,1)  SA, gSA(3,1), gSA(4,1), gSA(5,1)
gSA(5,2)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,2)
gSA(5,3)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,3)
gSA(6,1)  SA, gSA(3,1), gSA(4,1), gSA(5,1), gSA(6,1)
gSA(6,2)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(6,2),
gSA(6,3)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
A(6,3)
gSA(6,4)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
A(6,4)
gSA(7,1)  SA, gSA(3,1), gSA(4,1), gSA(5,1), gSA(6,1), gSA(7,1)
gSA(7,2)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2) , gSA(6,1), gS
A(6,2), gSA(7,2)
gSA(7,3)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
A(6,1) , gSA(6,2), gSA(6,3), gSA(7,3)
gSA(7,4)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
A(6,1) , gSA(6,2) , gSA(6,3), gSA(6,4), gSA(7,4)
gSA(7,5)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS
gC =Total count of required
gSAs
𝒈𝑪 =
𝒊=𝟏
𝒑−𝟏
𝒌 − 𝒊 𝒊𝒇 𝒌 − 𝒊 > 𝟎
𝟏 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
King’s College London, University of London
Multiple gaps, m is various
P = coat ##at, #o#t, #oa#, c##t, c#a#, co##
gSA(4,2)  SA, gSA(3,1), gSA(4,2)
P = coast ##ast, #o#st, #oa#t, #oas#, c##st, c#a#t, c#as#, co##t, co#s#,coa##
gSA(5,2)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,2), (1,2)(3,4)-gSA(5,2)
P = coasts ##asts, #o#sts, #oa#ts, #oas#s, #oast#, c##sts, c#a#ts, c#as#s, c#ast#, co#
#ts, co#s#s, co#st#, coa##s, coa#t#, coas##
gSA(6,2)  SA, gSA(3,1) , gSA(4,1),gSA(4,2), gSA(5,1), gSA(5,2), (1,2)(3,4)-gSA(5,2), gS
A(6,2), (1,2)(4,5)-gSA(6,2), (2,3)(4,5)-gSA(6,2)
P = coasts ###sts, ##a#ts, ##as#s, ##ast#, #o##ts, #o#s#s, #o#st#, #oa##s, #oa#t#, #
oas##, c###ts, c##s#s, c##st#, c#a##s, c#a#t#, c#as##, co###s, co##t#, co
#s##, coa###
gSA(6,3)  SA, gSA(3,1) , gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), (1,2)(3,4)-gSA(5,2), gS
A(5,3)gSA(6,3), (1,3)(4,5)-gSA(6,3), (1,2)(3,5)-gSA(6,3)
King’s College London, University of London
Two approaches to support the
multiple gaps
Second is to continuously additionally create multiple gapped
suffix array as per above method.
Perform a search where the search is carried out until the first gap
of the search pattern, and after that every individual character is
compared.
King’s College London, University of London
First approach
c # a # t
r = gSA[i](3,1),T[r]
T[ r+2 ]T[ r+3 ]T[ r+4 ]
c # a s # s
r = gSA[i](3,1),T[r]
T[r+3]T[r+4]T[r+5]
King’s College London, University of London
Worst case for searching with it
First fragment’s length is defined fm
Binary search the first fragment with gLCP = O(logn + fm)
Search rest of fragment = O((m - fm)n)
So O((m - fm)n + log n + fm)
King’s College London, University of London
Summary
King’s College London, University of London
Further work
Gapped suffix array only supports searching of specific
patterns.
For it to support approximate indexing in all situations,
will require more research and development into
multiple gapped suffix arrays.
Future task is to study multiple gapped suffix array and
its efficiency
King’s College London, University of London
Conclusion
The theory of Maxime that gSA can be created in linear
time has been put into practice and confirmed to be
true
Additionally to this research, further potentials of
multiple gSAs were looked at and were able to
conclude that it’s an area requiring more research
King’s College London, University of London
King’s College London, University of London
Q&A

More Related Content

Similar to Approximate Indexing: Gapped Suffix Array

A taxonomy of suffix array construction algorithms
A taxonomy of suffix array construction algorithmsA taxonomy of suffix array construction algorithms
A taxonomy of suffix array construction algorithmsunyil96
 
SPACE-EFFICIENT K-MER ALGORITHM FOR GENERALISED SUFFIX TREE
SPACE-EFFICIENT K-MER ALGORITHM FOR GENERALISED SUFFIX TREESPACE-EFFICIENT K-MER ALGORITHM FOR GENERALISED SUFFIX TREE
SPACE-EFFICIENT K-MER ALGORITHM FOR GENERALISED SUFFIX TREEijitcs
 
String kmp
String kmpString kmp
String kmpthinkphp
 
Parallel random projection using R high performance computing for planted mot...
Parallel random projection using R high performance computing for planted mot...Parallel random projection using R high performance computing for planted mot...
Parallel random projection using R high performance computing for planted mot...TELKOMNIKA JOURNAL
 
Combining text and pattern preprocessing in an adaptive dna pattern matcher
Combining text and pattern preprocessing in an adaptive dna pattern matcherCombining text and pattern preprocessing in an adaptive dna pattern matcher
Combining text and pattern preprocessing in an adaptive dna pattern matcherIAEME Publication
 
A Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitmentA Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitmentKemal Can Kara
 
prolog-coolPrograms-flora.ppt
prolog-coolPrograms-flora.pptprolog-coolPrograms-flora.ppt
prolog-coolPrograms-flora.pptdatapro2
 
Deconstructing Columnar Transposition Ciphers
Deconstructing Columnar Transposition CiphersDeconstructing Columnar Transposition Ciphers
Deconstructing Columnar Transposition CiphersRobert Talbert
 
Graph Summarization with Quality Guarantees
Graph Summarization with Quality GuaranteesGraph Summarization with Quality Guarantees
Graph Summarization with Quality GuaranteesTwo Sigma
 
32 -longest-common-prefix
32 -longest-common-prefix32 -longest-common-prefix
32 -longest-common-prefixSanjeev Gupta
 
Point Placement Algorithms: An Experimental Study
Point Placement Algorithms: An Experimental StudyPoint Placement Algorithms: An Experimental Study
Point Placement Algorithms: An Experimental StudyCSCJournals
 
Langford sequences through a product of labeled digraphs
Langford sequences through a product of labeled digraphsLangford sequences through a product of labeled digraphs
Langford sequences through a product of labeled digraphsGraph-TA
 
Msa & rooted/unrooted tree
Msa & rooted/unrooted treeMsa & rooted/unrooted tree
Msa & rooted/unrooted treeSamiul Ehsan
 
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...cseiitgn
 

Similar to Approximate Indexing: Gapped Suffix Array (18)

A taxonomy of suffix array construction algorithms
A taxonomy of suffix array construction algorithmsA taxonomy of suffix array construction algorithms
A taxonomy of suffix array construction algorithms
 
SPACE-EFFICIENT K-MER ALGORITHM FOR GENERALISED SUFFIX TREE
SPACE-EFFICIENT K-MER ALGORITHM FOR GENERALISED SUFFIX TREESPACE-EFFICIENT K-MER ALGORITHM FOR GENERALISED SUFFIX TREE
SPACE-EFFICIENT K-MER ALGORITHM FOR GENERALISED SUFFIX TREE
 
String kmp
String kmpString kmp
String kmp
 
Parallel random projection using R high performance computing for planted mot...
Parallel random projection using R high performance computing for planted mot...Parallel random projection using R high performance computing for planted mot...
Parallel random projection using R high performance computing for planted mot...
 
poster
posterposter
poster
 
Combining text and pattern preprocessing in an adaptive dna pattern matcher
Combining text and pattern preprocessing in an adaptive dna pattern matcherCombining text and pattern preprocessing in an adaptive dna pattern matcher
Combining text and pattern preprocessing in an adaptive dna pattern matcher
 
A Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitmentA Matching Approach Based on Term Clusters for eRecruitment
A Matching Approach Based on Term Clusters for eRecruitment
 
prolog-coolPrograms-flora.ppt
prolog-coolPrograms-flora.pptprolog-coolPrograms-flora.ppt
prolog-coolPrograms-flora.ppt
 
Deconstructing Columnar Transposition Ciphers
Deconstructing Columnar Transposition CiphersDeconstructing Columnar Transposition Ciphers
Deconstructing Columnar Transposition Ciphers
 
Presentation 2
Presentation 2Presentation 2
Presentation 2
 
Graph Summarization with Quality Guarantees
Graph Summarization with Quality GuaranteesGraph Summarization with Quality Guarantees
Graph Summarization with Quality Guarantees
 
32 -longest-common-prefix
32 -longest-common-prefix32 -longest-common-prefix
32 -longest-common-prefix
 
2nd Semester M Tech: Structural Engineering (June-2015) Question Papers
2nd  Semester M Tech: Structural Engineering  (June-2015) Question Papers2nd  Semester M Tech: Structural Engineering  (June-2015) Question Papers
2nd Semester M Tech: Structural Engineering (June-2015) Question Papers
 
Point Placement Algorithms: An Experimental Study
Point Placement Algorithms: An Experimental StudyPoint Placement Algorithms: An Experimental Study
Point Placement Algorithms: An Experimental Study
 
Ch06 multalign
Ch06 multalignCh06 multalign
Ch06 multalign
 
Langford sequences through a product of labeled digraphs
Langford sequences through a product of labeled digraphsLangford sequences through a product of labeled digraphs
Langford sequences through a product of labeled digraphs
 
Msa & rooted/unrooted tree
Msa & rooted/unrooted treeMsa & rooted/unrooted tree
Msa & rooted/unrooted tree
 
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
A Quest for Subexponential Time Parameterized Algorithms for Planar-k-Path: F...
 

Recently uploaded

QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 

Recently uploaded (20)

QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 

Approximate Indexing: Gapped Suffix Array

  • 1. King’s College London, University of London MSc in Advanced Software Engineering Approximate Indexing: Gapped Suffix Array KyungHoon Park
  • 2. King’s College London, University of London Agenda  Research Objective  Gapped suffix array  Application  Going beyond gSA  Q&A
  • 3. King’s College London, University of London Research Objective
  • 4. King’s College London, University of London Main questions 1. Using the developed suffix array, can gapped suffix array be developed in O(n) time? 2. What are the limitations of gapped suffix array? How can these can be overcome?
  • 5. King’s College London, University of London Research aims 1. To fully understand and implement suffix array and LCP. 2. Implement a gapped suffix array from the suffix array in O(n) time. 3. To study and implement the paper gapped suffix array. 4. If there are possibilities to develop to multiple gapped suffix array, to research other limitations.
  • 6. King’s College London, University of London Gapped Suffix Array
  • 7. King’s College London, University of London Main questions 1. Using the developed suffix array, can gapped suffix array be developed in O(n) time? 2. 2. What are the limitations of gapped suffix array? How can these can be overcome?
  • 8. King’s College London, University of London Definitions T = t1t2 … tn, P = p1 p2 … pn , strings of symbols in finite alphabet m = length of search string n = length of text k = k-mistake (Hamming distance)
  • 9. King’s College London, University of London Suffix Array i T[i] SA T[SA[i]] LCP 0 mississippi 10 i 0 1 ississippi 7 ippi 1 2 ssissippi 4 issippi 1 3 sissippi 1 ississippi 4 4 issippi 0 mississippi 0 5 ssippi 9 pi 0 6 sippi 8 ppi 1 7 ippi 6 sippi 0 8 ppi 3 sissippi 2 9 pi 5 ssippi 1 T = mississippi
  • 10. King’s College London, University of London Gapped Suffix Array 1. First introduced by Crochemore and Tischler (2010) 2. Constructed after SA 3. SA that has a Gap within a specific range to provide approximate index. 4. The range of gap defined before constructing the gapped suffix array.
  • 11. King’s College London, University of London Gapped Suffix Array T = mississippi, (1, 2)-gSA (3,1) i T[i] SA gSA (1, 2)- gSA(3,1) 1 mississippi 10 10 i# 2 ississippi 7 7 i#pi 3 ssissippi 4 4 i#sippi 4 sissippi 1 1 i#sissippi 5 issippi 0 0 m#ssissippi 6 Ssippi 9 9 p# 7 Sippi 8 8 p#i 8 Ippi 6 5 s#ppi 9 ppi 3 2 s#ssippi 10 pi 5 6 s#ippi 11 i 2 3 s#issippi Definition (g0, g1)-gSA (m, k) gSA = Gapped suffix array g0 = start cursor of the gap g1 = end cursor of the gap m = length of search string k = Hamming distance
  • 12. King’s College London, University of London Flow of constructing the gSA • Skew Algorithm 1. Constructing the SA • Figure of the k-mistake • Range of gap 2. Defining the limitations • Sorting based on GRANK & HRANK 3. Constructing the gSA
  • 13. King’s College London, University of London Limitations of gSA 1. Hamming distance, length of pattern and gap range should define prior to constructing. 2. gSA cannot cover all of approximate string matching based on defined k-mistake. ex) k = 2, gap=(1,3) coat -> c##t, ##at, co## (support) #o#t, c#a# (cannot support) 3. gSA cannot support multiple gaps EX) coach -> c#a#h
  • 14. King’s College London, University of London Constructing gSA - #1. GRANK i 0 1 2 3 4 5 6 7 8 9 10 T[i] m i s s i s s i p p i GRANK 5 1 8 8 1 8 8 1 6 6 1 GRANK contains the ranks of factors of y with length up to g0. That is, rank created by cutting the characters before the beginning of the gap at position g0 For Example, m = 3, gap range = (1,2)
  • 15. King’s College London, University of London Constructing gSA - #2. HRANK HRANK contains the RANKs of the suffixes that are at the end of the gap. As we have now already created the suffix array before constructing the gapped suffix, it is possible to easily bring the suffix of where the gap ends. HRANK[r] = ISA[SA[r]+g1]
  • 16. King’s College London, University of London GRANK & HRANK For example, the structure of the GRANK and HRANK of the fourth suffix sissippi is constructed as below. s i s s i p p i GRANK Gap HRANK If we perform the radix sort by combining both GRANK and HRANK created in this way, it is possible to create gSA in linear time.
  • 17. King’s College London, University of London Example of (1,2)-gSA(3,1) i T[i] SA gSA (1, 2)- gSA GRANK HRANK 1 mississippi 10 10 i# 5 0 2 ississippi 7 7 i#pi 1 6 3 ssissippi 4 4 i#sippi 8 8 4 sissippi 1 1 i#sissippi 8 9 5 issippi 0 0 m#ssissippi 1 11 6 Ssippi 9 9 p# 8 0 7 Sippi 8 8 p#i 8 1 8 Ippi 6 5 s#ppi 1 7 9 ppi 3 2 s#ssippi 6 10 10 pi 5 6 s#ippi 6 2 11 i 2 3 s#issippi 1 3
  • 18. King’s College London, University of London Search in (1,2)-gSA(3,1) For example, if m = mis (m0, m1, m2), it needs to search three times: - search mi (m0, m1) in the SA - search is (m1, m2) in the SA - search ms (m0, m2) in the gSA P = cot (1,2)-gSA(3,1) c#t #ot co# Searching array in the (1,2)-gSA(3,1) in the SA in the SA
  • 19. King’s College London, University of London Application
  • 20. King’s College London, University of London Platform and Language 1. Language: C# 2. Platform: Microsoft .NET (.Net Framework v4.0)
  • 21. King’s College London, University of London Algorithms 1. Construction of suffix array with LCP - Radix sort - Skew algorithm 2. Construction of gapped suffix array with gLCP - Radix sort 3. Approximate string search - pattern analysis - binary search with LCP
  • 22. King’s College London, University of London Gapped Suffix Array
  • 23. King’s College London, University of London Going beyond gSA
  • 24. King’s College London, University of London Main questions 1. Using the developed suffix array, can gapped suffix array be developed in O(n) time? 2. What are the limitations of gapped suffix array? How can these can be overcome?
  • 25. King’s College London, University of London Limitation of gSA P = coat (2,3)-gSA(4,1) #oat c#at co#t coa# Searching array SA Cannot support gSA(4,1) SA P = coast (3,4)-gSA(5,1) #oast c#oast co#st coa#t coas# Searching array SA Cannot support Cannot support gSA(5,1) SA If we suppose k is 1 and gap is ended at m-1
  • 26. King’s College London, University of London Countermeasure P = coat (2,3)-gSA(4,1) #oat c#at co#t coa# Searching array SA gSA(3,1) gSA(4,1) SA P = coast (3,4)-gSA(5,1) #oast c#oast co#st coa#t coas# Searching array SA gSA(3,1) gSA(4,1) gSA(5,1) SA
  • 27. King’s College London, University of London Countermeasure P = cot c#t, #ot, co# gSA(3, 1)  SA, gSA(3, 1) P = coat #oat, c#at, co#t, coa# gSA(4, 1)  SA, gSA(3, 1), gSA(4, 1) P = coast #oast, c#oast, co#st, coa#t, coas# gSA(5, 1)  SA, gSA(3, 1), gSA(4, 1), gSA(5, 1) P = coasts #oasts, c#oasts, co#sts, coa#ts, coas#s, coast# gSA(6, 1)  SA, gSA(3, 1), gSA(4, 1), gSA(5, 1), gSA(6, 1) gSA(m, 1) SA, gSA(3, 1) … gSA(m-2, 1), gSA(m-1, 1), gSA(m, 1)
  • 28. King’s College London, University of London Theorem If the length of the Gap is 1, the required count of gSA is | m - 2 |, and it is possible for both construction and search time to be performed in linear time.
  • 29. King’s College London, University of London Total count of required gSAs gSA(m, p) Required gapped suffix arrays gSA(3,1)  SA, gSA(3,1) gSA(4,1)  SA, gSA(3,1), gSA(4,1) gSA(4,2)  SA, gSA(3,1), gSA(4,2) gSA(5,1)  SA, gSA(3,1), gSA(4,1), gSA(5,1) gSA(5,2)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,2) gSA(5,3)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,3) gSA(6,1)  SA, gSA(3,1), gSA(4,1), gSA(5,1), gSA(6,1) gSA(6,2)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(6,2), gSA(6,3)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS A(6,3) gSA(6,4)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS A(6,4) gSA(7,1)  SA, gSA(3,1), gSA(4,1), gSA(5,1), gSA(6,1), gSA(7,1) gSA(7,2)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2) , gSA(6,1), gS A(6,2), gSA(7,2) gSA(7,3)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS A(6,1) , gSA(6,2), gSA(6,3), gSA(7,3) gSA(7,4)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS A(6,1) , gSA(6,2) , gSA(6,3), gSA(6,4), gSA(7,4) gSA(7,5)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS gC =Total count of required gSAs 𝒈𝑪 = 𝒊=𝟏 𝒑−𝟏 𝒌 − 𝒊 𝒊𝒇 𝒌 − 𝒊 > 𝟎 𝟏 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
  • 30. King’s College London, University of London Multiple gaps, m is various P = coat ##at, #o#t, #oa#, c##t, c#a#, co## gSA(4,2)  SA, gSA(3,1), gSA(4,2) P = coast ##ast, #o#st, #oa#t, #oas#, c##st, c#a#t, c#as#, co##t, co#s#,coa## gSA(5,2)  SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,2), (1,2)(3,4)-gSA(5,2) P = coasts ##asts, #o#sts, #oa#ts, #oas#s, #oast#, c##sts, c#a#ts, c#as#s, c#ast#, co# #ts, co#s#s, co#st#, coa##s, coa#t#, coas## gSA(6,2)  SA, gSA(3,1) , gSA(4,1),gSA(4,2), gSA(5,1), gSA(5,2), (1,2)(3,4)-gSA(5,2), gS A(6,2), (1,2)(4,5)-gSA(6,2), (2,3)(4,5)-gSA(6,2) P = coasts ###sts, ##a#ts, ##as#s, ##ast#, #o##ts, #o#s#s, #o#st#, #oa##s, #oa#t#, # oas##, c###ts, c##s#s, c##st#, c#a##s, c#a#t#, c#as##, co###s, co##t#, co #s##, coa### gSA(6,3)  SA, gSA(3,1) , gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), (1,2)(3,4)-gSA(5,2), gS A(5,3)gSA(6,3), (1,3)(4,5)-gSA(6,3), (1,2)(3,5)-gSA(6,3)
  • 31. King’s College London, University of London Two approaches to support the multiple gaps Second is to continuously additionally create multiple gapped suffix array as per above method. Perform a search where the search is carried out until the first gap of the search pattern, and after that every individual character is compared.
  • 32. King’s College London, University of London First approach c # a # t r = gSA[i](3,1),T[r] T[ r+2 ]T[ r+3 ]T[ r+4 ] c # a s # s r = gSA[i](3,1),T[r] T[r+3]T[r+4]T[r+5]
  • 33. King’s College London, University of London Worst case for searching with it First fragment’s length is defined fm Binary search the first fragment with gLCP = O(logn + fm) Search rest of fragment = O((m - fm)n) So O((m - fm)n + log n + fm)
  • 34. King’s College London, University of London Summary
  • 35. King’s College London, University of London Further work Gapped suffix array only supports searching of specific patterns. For it to support approximate indexing in all situations, will require more research and development into multiple gapped suffix arrays. Future task is to study multiple gapped suffix array and its efficiency
  • 36. King’s College London, University of London Conclusion The theory of Maxime that gSA can be created in linear time has been put into practice and confirmed to be true Additionally to this research, further potentials of multiple gSAs were looked at and were able to conclude that it’s an area requiring more research
  • 37. King’s College London, University of London
  • 38. King’s College London, University of London Q&A