SlideShare a Scribd company logo
1 of 29
Download to read offline
Certified Bit-Coded Regular Expression Parsing
Rodrigo Ribeiro1 Andr´e Rauber Du Bois2
1Departament of Computer Science
Universidade Federal de Ouro Preto
2Departament of Computer Science
Universidade Federal de Pelotas
September 21, 2017
Introduction
Parsing is pervasive in computing
String search tools, lexical analysers...
Binary data files like images, videos ...
Our focus: Regular Languages (RLs)
Languages denoted by Regular Expressions (REs) and
equivalent formalisms
Introduction
Is RE parsing a yes / no question?.
No! Better to produce evidence: parse trees.
Why use bit-codes for parse trees?
Memory compact representation of parsing evidence.
Easy serialization of parsing results.
Contributions
We provide fully certified proofs of a derivative based
algorithm that produces a bit representation of a parse tree.
We mechanize results about the relation between RE parse
trees and bit-codes.
We provide sound and complete decision procedures for prefix
and substring matching of RE.
Coded included in a RE search tool developed by us —
verigrep.
All results formalized in Agda version 2.5.2.
Regular Expression Syntax
Definition of RE over a finite alphabet Σ.
e ::= ∅ | | a | e e | e + e | e
Agda code
data Regex : Set where
∅ : Regex
: Regex
$ : Char → Regex
• : Regex → Regex → Regex
+ : Regex → Regex → Regex
: Regex → Regex
Regular Expression Semantics
∈
a ∈ Σ
a ∈ a
s ∈ e s ∈ e
ss ∈ ee
s ∈ e
s ∈ e + e
s ∈ e
s ∈ e + e
s ∈ + ee
s ∈ e
Regular Expression Semantics — Agda code
data ∈ : List Char → Regex → Set where
: [ ] ∈
$ : (c : Char) → $ c ∈ $ c
• : s ∈ l →
s ∈ r →
(s ++ s ) ∈ l • r
+ L : s ∈ l → s ∈ l + r
-- some code omitted...
Parse trees for REs
We interpret RE as types and parse tree as terms.
Informally:
leafs: empty string and character.
concatenation: pair of parse trees.
choice: just the branch of chosen RE.
Kleene star: list of parse trees.
Parse trees for RE — Example
star− ::
inr
•
$ a $ b
star− ::
inr
•
$ a $ b
star[]
Figure: Parse tree for RE: (c + ab) and the string w = abab.
Parse trees for REs
data Tree : Regex → Set where
: Tree
$ : (c : Char) → Tree ($ c)
inl : Tree l → Tree (l + r)
inr : Tree r → Tree (l + r)
• : Tree l → Tree r → Tree (l • r)
star[] : Tree (l )
star− :: : Tree l → Tree (l ) → Tree (l )
Relating parse trees and RE semantics
Using function flat.
Property: Let t be a parse tree for a RE e and a string s.
Then, flat(t) = s and s ∈ e .
flat( ) = []
flat($ a) = [a]
flat(inl t) = flat(t)
flat(inr t) = flat(t)
flat(t • t ) = flat(t)++flat(t )
flat(star[]) = []
flat(star− :: t ts) = flat(t)++flat(ts)
Relating parse trees and RE semantics
flat type ensure its correctness property!
flat : Tree e → ∃ (λ xs → xs ∈ e )
flat = [ ] ,
flat ($ c) = [ c ] , ($ c)
flat (inl r t) with flat t
...| xs , prf = , r +L prf
flat (inr l t) with flat t
...| xs , prf = , l +R prf
flat (t • t ) with flat t | flat t
...| xs , prf | ys , prf = , (prf • prf )
-- some code omitted
Bit-codes for parse trees
Bit-codes mark...
which branch of choice was chosen during parsing.
matchings done by the Kleene star operator.
Predicate relating bit-codes to its RE.
data IsCode : List Bit → Regex → Set where
: [ ] IsCode
$ : (c : Char) → [ ] IsCode ($ c)
inl : xs IsCode l → (0b :: xs) IsCode (l + r)
inr : xs IsCode r → (1b :: xs) IsCode (l + r)
• : xs IsCode l → ys IsCode r → (xs ++ ys) IsCode (l • r)
star[] : [ 1b ] IsCode (l )
star− :: : xs IsCode l → xss IsCode (l ) →
(0b :: xs ++ xss) IsCode (l )
How to relate bit-codes and parse trees?
Function code builds bit-codes for parse trees.
code : Tree e → ∃ (λ bs → bs IsCode e)
code ($ c) = [ ] , ($ c)
code (inl r t) with code t
...| ys , pr = 0b :: ys , inl r pr
code (inr l t) with code t
...| ys , pr = 1b :: ys , inr l pr
code star[] = 1b :: [ ] , star[]
code (star− :: t ts) with code t | code ts
...| xs , pr | xss , prs = (0b :: xs ++ xss) , star− :: pr prs
-- some code omitted
How to relate bit-codes and parse-trees?
Function decode parses a bit string for w.r.t. a RE.
Property: forall t, decode (code t) ≡ t
decode : ∃ (λ bs → bs IsCode e) → Tree e
decode ( , ($ c)) = $ c
decode ( , (inl r pr)) = inl r (decode ( , pr))
decode ( , (inr l pr)) = inr l (decode ( , pr))
decode star[] = ( +L )
decode (star− :: pr pr ) with decode ( , pr) | decode ( , pr )
...| pr1 | pr2 = ( +R (pr1 • pr2))
-- some code omitted
Bit-codes for RE parse trees
Building a parse tree for compute bit-codes is expansive.
Better idea: build bit-codes during parsing, instead of
computing parse trees.
How? Just attach bit-codes to RE.
data BitRegex : Set where
empty : BitRegex
eps : List Bit → BitRegex
char : List Bit → Char → BitRegex
choice : List Bit → BitRegex → BitRegex → BitRegex
cat : List Bit → BitRegex → BitRegex → BitRegex
star : List Bit → BitRegex → BitRegex
Relating REs and BREs
Function internalize converts a RE into a BRE.
internalize : Regex → BitRegex
internalize ∅ = empty
internalize = eps [ ]
internalize ($ x) = char [ ] x
internalize (e • e ) = cat [ ] (internalize e) (internalize e )
internalize (e + e )
= choice [ ] (fuse [ 0b ] (internalize e))
(fuse [ 1b ] (internalize e ))
internalize (e ) = star [ ] (internalize e)
Relating REs and BREs
Function erase converts a BRE into a RE.
Property: for all e, erase (internalize e) ≡ e.
erase : BitRegex → Regex
erase empty = ∅
erase (eps x) =
erase (char x c) = $ c
erase (choice x e e ) = erase e + (erase e )
erase (cat x e e ) = erase e • (erase e )
erase (star x e) = (erase e)
Semantics of BREs
Same as RE semantics, but includes the bit-codes.
Property: s ∈ e iff s ∈ internalize e .
data ∈ : List Char → BitRegex → Set where
eps : [ ] ∈ eps bs
char : (c : Char) → [ c ] ∈ char bs c
inl : s ∈ l → s ∈ choice bs l r
inr : s ∈ r → s ∈ choice bs l r
cat : s ∈ l → s ∈ r → (s ++ s ) ∈ cat bs l r
-- some code omitted
Derivatives in a nutshell
What is the derivative of an (B)RE?
Derivatives are defined w.r.t. an alphabet symbol.
The derivative of a (B)RE e w.r.t. a, ∂(e, a), is another
(B)RE that denotes all strings in e language with the leading
a removed.
∂a(e) = {s | as ∈ e }
Property: s ∈ ∂[ e , x ] holds iff (x :: s) ∈ e holds.
Derivatives for BREs
Follows the definition of Brzozowski’s derivatives.
∂[ , ] : BitRegex → Char → BitRegex
∂[ eps bs , c ] = eps bs
∂[ cat bs e e , c ] with ν[ e ]
∂[ cat bs e e , c ] | yes pr
= choice bs (cat bs ∂[ e , c ] e ) (fuse (mkEps pr) ∂[ e , c ])
∂[ cat bs e e , c ] | no pr = cat bs ∂[ e , c ] e
∂[ star bs e , c ]
= cat bs (fuse [ 0b ] ∂[ e , c ]) (star [ ] e)
-- some code omitted
Nullability test for BREs
Checks if ∈ e , for some (B)RE e.
Agda code: decision procedure for [ ] ∈ e .
ν(∅) = ∅
ν( ) =
ν(a) = ∅
ν(e e ) =
if ν(e) = ν(e ) =
∅ otherwise
ν(e + e ) =
if ν(e) = or ν(e ) =
∅ otherwise
ν(e ) =
Parsing with derivatives
RE-based text search tools parse prefixes and substrings.
Types IsPrefix and IsSubstr are proofs that a string is a prefix
/ substring of an input BRE.
Parsing algorithms defined as proofs of decidability of IsPrefix
and IsSubstr. Proof by induction on the input string, using
properties of derivative operation.
data IsPrefix (a : List Char) (e : BitRegex) : Set where
Prefix : a ≡ b ++ c → b ∈ e → IsPrefix a e
data IsSubstr (a : List Char) (e : BitRegex) : Set where
Substr : a ≡ b ++ c ++ d → c ∈ e → IsSubstr a e
Experimental Results
Figure: Results of experiment 1.
Future Work
How to improve efficiency?
How intrinsic verification affects generated code efficiency?
Currently porting code to use extrinsic verification (proofs
separated from program code).
Experiment with alternative formalization: BRE semantics
defined by erase: s ∈ e = s ∈ erase e .
How to measure memory consumption, without compiler
support?
No profiling support in Agda compiler.
Agda compiles to Haskell, but there’s no direct correspondence
between Agda source code and Haskell generated code.
Conclusion
We build a certified algorithm for BRE parsing in Agda.
We certify several previous results about bit-coded parse trees
and their relationship with RE semantics.
Algorithm included in verigrep tool for RE text search.
Relating REs and BREs
Function fuse attach a bit-string into a BRE.
fuse : List Bit → BitRegex → BitRegex
fuse bs empty = empty
fuse bs (eps x) = eps (bs ++ x)
fuse bs (char x c) = char (bs ++ x) c
fuse bs (choice x e e ) = choice (bs ++ x) e e
fuse bs (cat x e e ) = cat (bs ++ x) e e
fuse bs (star x e) = star (bs ++ x) e
Building bit-codes for
mkEps : [ ] ∈ t → List Bit
mkEps (eps bs) = bs
mkEps (inl br bs pr) = bs ++ mkEps pr
mkEps (inr bl bs pr) = bs ++ mkEps pr
mkEps (cat bs pr pr eq) = bs ++ mkEps pr ++ mkEps pr
mkEps (star[] bs) = bs ++ [ 1b ]
mkEps (star− :: bs pr pr x) = bs ++ [ 1b ]
Relating parse trees and RE semantics
unflat builds parse trees out of RE semantics evidence.
Functions flat and unflat are inverses.
unflat : xs ∈ e → Tree e
unflat =
unflat ($ c) = $ c
unflat (prf • prf ) = unflat prf • unflat prf
unflat (r +L prf ) = inl r (unflat prf )
unflat (l +R prf ) = inr l (unflat prf )
-- some code omitted

More Related Content

What's hot

Regular expressions
Regular expressionsRegular expressions
Regular expressionsBrij Kishore
 
Why functional programming and category theory strongly matters
Why functional programming and category theory strongly mattersWhy functional programming and category theory strongly matters
Why functional programming and category theory strongly mattersPiotr Paradziński
 
Pure and Declarative Syntax Definition: Paradise Lost and Regained
Pure and Declarative Syntax Definition: Paradise Lost and RegainedPure and Declarative Syntax Definition: Paradise Lost and Regained
Pure and Declarative Syntax Definition: Paradise Lost and RegainedGuido Wachsmuth
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationPierre de Lacaze
 
Declarative Syntax Definition - Grammars and Trees
Declarative Syntax Definition - Grammars and TreesDeclarative Syntax Definition - Grammars and Trees
Declarative Syntax Definition - Grammars and TreesGuido Wachsmuth
 
Advance LISP (Artificial Intelligence)
Advance LISP (Artificial Intelligence) Advance LISP (Artificial Intelligence)
Advance LISP (Artificial Intelligence) wahab khan
 
Python 培训讲义
Python 培训讲义Python 培训讲义
Python 培训讲义leejd
 
Beyond tf idf why, what & how
Beyond tf idf why, what & howBeyond tf idf why, what & how
Beyond tf idf why, what & howlucenerevolution
 
Finaal application on regular expression
Finaal application on regular expressionFinaal application on regular expression
Finaal application on regular expressionGagan019
 
High-Performance Haskell
High-Performance HaskellHigh-Performance Haskell
High-Performance HaskellJohan Tibell
 
Real World Haskell: Lecture 7
Real World Haskell: Lecture 7Real World Haskell: Lecture 7
Real World Haskell: Lecture 7Bryan O'Sullivan
 
Declarative Semantics Definition - Static Analysis and Error Checking
Declarative Semantics Definition - Static Analysis and Error CheckingDeclarative Semantics Definition - Static Analysis and Error Checking
Declarative Semantics Definition - Static Analysis and Error CheckingGuido Wachsmuth
 
Learning notes of r for python programmer (Temp1)
Learning notes of r for python programmer (Temp1)Learning notes of r for python programmer (Temp1)
Learning notes of r for python programmer (Temp1)Chia-Chi Chang
 
Compiler Components and their Generators - Traditional Parsing Algorithms
Compiler Components and their Generators - Traditional Parsing AlgorithmsCompiler Components and their Generators - Traditional Parsing Algorithms
Compiler Components and their Generators - Traditional Parsing AlgorithmsGuido Wachsmuth
 
Static name resolution
Static name resolutionStatic name resolution
Static name resolutionEelco Visser
 
Real World Haskell: Lecture 5
Real World Haskell: Lecture 5Real World Haskell: Lecture 5
Real World Haskell: Lecture 5Bryan O'Sullivan
 

What's hot (20)

Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Why functional programming and category theory strongly matters
Why functional programming and category theory strongly mattersWhy functional programming and category theory strongly matters
Why functional programming and category theory strongly matters
 
Term Rewriting
Term RewritingTerm Rewriting
Term Rewriting
 
Pure and Declarative Syntax Definition: Paradise Lost and Regained
Pure and Declarative Syntax Definition: Paradise Lost and RegainedPure and Declarative Syntax Definition: Paradise Lost and Regained
Pure and Declarative Syntax Definition: Paradise Lost and Regained
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
 
Declarative Syntax Definition - Grammars and Trees
Declarative Syntax Definition - Grammars and TreesDeclarative Syntax Definition - Grammars and Trees
Declarative Syntax Definition - Grammars and Trees
 
Syntax Definition
Syntax DefinitionSyntax Definition
Syntax Definition
 
Advance LISP (Artificial Intelligence)
Advance LISP (Artificial Intelligence) Advance LISP (Artificial Intelligence)
Advance LISP (Artificial Intelligence)
 
Python 培训讲义
Python 培训讲义Python 培训讲义
Python 培训讲义
 
Beyond tf idf why, what & how
Beyond tf idf why, what & howBeyond tf idf why, what & how
Beyond tf idf why, what & how
 
Finaal application on regular expression
Finaal application on regular expressionFinaal application on regular expression
Finaal application on regular expression
 
High-Performance Haskell
High-Performance HaskellHigh-Performance Haskell
High-Performance Haskell
 
Real World Haskell: Lecture 7
Real World Haskell: Lecture 7Real World Haskell: Lecture 7
Real World Haskell: Lecture 7
 
Declarative Semantics Definition - Static Analysis and Error Checking
Declarative Semantics Definition - Static Analysis and Error CheckingDeclarative Semantics Definition - Static Analysis and Error Checking
Declarative Semantics Definition - Static Analysis and Error Checking
 
Learning notes of r for python programmer (Temp1)
Learning notes of r for python programmer (Temp1)Learning notes of r for python programmer (Temp1)
Learning notes of r for python programmer (Temp1)
 
Compiler Components and their Generators - Traditional Parsing Algorithms
Compiler Components and their Generators - Traditional Parsing AlgorithmsCompiler Components and their Generators - Traditional Parsing Algorithms
Compiler Components and their Generators - Traditional Parsing Algorithms
 
Static name resolution
Static name resolutionStatic name resolution
Static name resolution
 
Syntax Definition
Syntax DefinitionSyntax Definition
Syntax Definition
 
Real World Haskell: Lecture 5
Real World Haskell: Lecture 5Real World Haskell: Lecture 5
Real World Haskell: Lecture 5
 
Formal Grammars
Formal GrammarsFormal Grammars
Formal Grammars
 

Similar to Certified bit coded regular expression parsing

Computer Software: Compiler Construction Lecture 05.ppt
Computer Software: Compiler Construction Lecture 05.pptComputer Software: Compiler Construction Lecture 05.ppt
Computer Software: Compiler Construction Lecture 05.pptishaquet82
 
An overview of Python 2.7
An overview of Python 2.7An overview of Python 2.7
An overview of Python 2.7decoupled
 
Power of functions in a typed world
Power of functions in a typed worldPower of functions in a typed world
Power of functions in a typed worldDebasish Ghosh
 
ANSI C REFERENCE CARD
ANSI C REFERENCE CARDANSI C REFERENCE CARD
ANSI C REFERENCE CARDTia Ricci
 
Intermediate code generation in Compiler Design
Intermediate code generation in Compiler DesignIntermediate code generation in Compiler Design
Intermediate code generation in Compiler DesignKuppusamy P
 
C# Cheat Sheet
C# Cheat SheetC# Cheat Sheet
C# Cheat SheetGlowTouch
 
Chapter Two(1)
Chapter Two(1)Chapter Two(1)
Chapter Two(1)bolovv
 
Query Optimization - Brandon Latronica
Query Optimization - Brandon LatronicaQuery Optimization - Brandon Latronica
Query Optimization - Brandon Latronica"FENG "GEORGE"" YU
 
Hierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLPHierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLPRupak Roy
 
Intermediate code generation1
Intermediate code generation1Intermediate code generation1
Intermediate code generation1Shashwat Shriparv
 

Similar to Certified bit coded regular expression parsing (20)

Ch04
Ch04Ch04
Ch04
 
Computer Software: Compiler Construction Lecture 05.ppt
Computer Software: Compiler Construction Lecture 05.pptComputer Software: Compiler Construction Lecture 05.ppt
Computer Software: Compiler Construction Lecture 05.ppt
 
An overview of Python 2.7
An overview of Python 2.7An overview of Python 2.7
An overview of Python 2.7
 
A tour of Python
A tour of PythonA tour of Python
A tour of Python
 
Power of functions in a typed world
Power of functions in a typed worldPower of functions in a typed world
Power of functions in a typed world
 
ANSI C REFERENCE CARD
ANSI C REFERENCE CARDANSI C REFERENCE CARD
ANSI C REFERENCE CARD
 
Ch3
Ch3Ch3
Ch3
 
Intermediate code generation in Compiler Design
Intermediate code generation in Compiler DesignIntermediate code generation in Compiler Design
Intermediate code generation in Compiler Design
 
Python lecture 07
Python lecture 07Python lecture 07
Python lecture 07
 
C# Cheat Sheet
C# Cheat SheetC# Cheat Sheet
C# Cheat Sheet
 
Chapter Two(1)
Chapter Two(1)Chapter Two(1)
Chapter Two(1)
 
Query Optimization - Brandon Latronica
Query Optimization - Brandon LatronicaQuery Optimization - Brandon Latronica
Query Optimization - Brandon Latronica
 
Ch8a
Ch8aCh8a
Ch8a
 
scripting in Python
scripting in Pythonscripting in Python
scripting in Python
 
7645347.ppt
7645347.ppt7645347.ppt
7645347.ppt
 
Hierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLPHierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLP
 
Intermediate code generation1
Intermediate code generation1Intermediate code generation1
Intermediate code generation1
 
Arrays
ArraysArrays
Arrays
 
Sql server lab_2
Sql server lab_2Sql server lab_2
Sql server lab_2
 
Antlr V3
Antlr V3Antlr V3
Antlr V3
 

Recently uploaded

Terpineol and it's characterization pptx
Terpineol and it's characterization pptxTerpineol and it's characterization pptx
Terpineol and it's characterization pptxMuhammadRazzaq31
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceAlex Henderson
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
Plasmid: types, structure and functions.
Plasmid: types, structure and functions.Plasmid: types, structure and functions.
Plasmid: types, structure and functions.Cherry
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptxArvind Kumar
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...Scintica Instrumentation
 
COMPOSTING : types of compost, merits and demerits
COMPOSTING : types of compost, merits and demeritsCOMPOSTING : types of compost, merits and demerits
COMPOSTING : types of compost, merits and demeritsCherry
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
ONLINE VOTING SYSTEM SE Project for vote
ONLINE VOTING SYSTEM SE Project for voteONLINE VOTING SYSTEM SE Project for vote
ONLINE VOTING SYSTEM SE Project for voteRaunakRastogi4
 
GBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of AsepsisGBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of AsepsisAreesha Ahmad
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body Areesha Ahmad
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIADr. TATHAGAT KHOBRAGADE
 
Site specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdfSite specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdfCherry
 
Concept of gene and Complementation test.pdf
Concept of gene and Complementation test.pdfConcept of gene and Complementation test.pdf
Concept of gene and Complementation test.pdfCherry
 
FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.takadzanijustinmaime
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry Areesha Ahmad
 
GBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) MetabolismGBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) MetabolismAreesha Ahmad
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxDiariAli
 

Recently uploaded (20)

Terpineol and it's characterization pptx
Terpineol and it's characterization pptxTerpineol and it's characterization pptx
Terpineol and it's characterization pptx
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Plasmid: types, structure and functions.
Plasmid: types, structure and functions.Plasmid: types, structure and functions.
Plasmid: types, structure and functions.
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
COMPOSTING : types of compost, merits and demerits
COMPOSTING : types of compost, merits and demeritsCOMPOSTING : types of compost, merits and demerits
COMPOSTING : types of compost, merits and demerits
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
ONLINE VOTING SYSTEM SE Project for vote
ONLINE VOTING SYSTEM SE Project for voteONLINE VOTING SYSTEM SE Project for vote
ONLINE VOTING SYSTEM SE Project for vote
 
GBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of AsepsisGBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of Asepsis
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Site specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdfSite specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdf
 
Concept of gene and Complementation test.pdf
Concept of gene and Complementation test.pdfConcept of gene and Complementation test.pdf
Concept of gene and Complementation test.pdf
 
FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.FS P2 COMBO MSTA LAST PUSH past exam papers.
FS P2 COMBO MSTA LAST PUSH past exam papers.
 
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY  // USES OF ANTIOBIOTICS TYPES OF ANTIB...
ABHISHEK ANTIBIOTICS PPT MICROBIOLOGY // USES OF ANTIOBIOTICS TYPES OF ANTIB...
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
 
GBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) MetabolismGBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) Metabolism
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 

Certified bit coded regular expression parsing

  • 1. Certified Bit-Coded Regular Expression Parsing Rodrigo Ribeiro1 Andr´e Rauber Du Bois2 1Departament of Computer Science Universidade Federal de Ouro Preto 2Departament of Computer Science Universidade Federal de Pelotas September 21, 2017
  • 2. Introduction Parsing is pervasive in computing String search tools, lexical analysers... Binary data files like images, videos ... Our focus: Regular Languages (RLs) Languages denoted by Regular Expressions (REs) and equivalent formalisms
  • 3. Introduction Is RE parsing a yes / no question?. No! Better to produce evidence: parse trees. Why use bit-codes for parse trees? Memory compact representation of parsing evidence. Easy serialization of parsing results.
  • 4. Contributions We provide fully certified proofs of a derivative based algorithm that produces a bit representation of a parse tree. We mechanize results about the relation between RE parse trees and bit-codes. We provide sound and complete decision procedures for prefix and substring matching of RE. Coded included in a RE search tool developed by us — verigrep. All results formalized in Agda version 2.5.2.
  • 5. Regular Expression Syntax Definition of RE over a finite alphabet Σ. e ::= ∅ | | a | e e | e + e | e Agda code data Regex : Set where ∅ : Regex : Regex $ : Char → Regex • : Regex → Regex → Regex + : Regex → Regex → Regex : Regex → Regex
  • 6. Regular Expression Semantics ∈ a ∈ Σ a ∈ a s ∈ e s ∈ e ss ∈ ee s ∈ e s ∈ e + e s ∈ e s ∈ e + e s ∈ + ee s ∈ e
  • 7. Regular Expression Semantics — Agda code data ∈ : List Char → Regex → Set where : [ ] ∈ $ : (c : Char) → $ c ∈ $ c • : s ∈ l → s ∈ r → (s ++ s ) ∈ l • r + L : s ∈ l → s ∈ l + r -- some code omitted...
  • 8. Parse trees for REs We interpret RE as types and parse tree as terms. Informally: leafs: empty string and character. concatenation: pair of parse trees. choice: just the branch of chosen RE. Kleene star: list of parse trees.
  • 9. Parse trees for RE — Example star− :: inr • $ a $ b star− :: inr • $ a $ b star[] Figure: Parse tree for RE: (c + ab) and the string w = abab.
  • 10. Parse trees for REs data Tree : Regex → Set where : Tree $ : (c : Char) → Tree ($ c) inl : Tree l → Tree (l + r) inr : Tree r → Tree (l + r) • : Tree l → Tree r → Tree (l • r) star[] : Tree (l ) star− :: : Tree l → Tree (l ) → Tree (l )
  • 11. Relating parse trees and RE semantics Using function flat. Property: Let t be a parse tree for a RE e and a string s. Then, flat(t) = s and s ∈ e . flat( ) = [] flat($ a) = [a] flat(inl t) = flat(t) flat(inr t) = flat(t) flat(t • t ) = flat(t)++flat(t ) flat(star[]) = [] flat(star− :: t ts) = flat(t)++flat(ts)
  • 12. Relating parse trees and RE semantics flat type ensure its correctness property! flat : Tree e → ∃ (λ xs → xs ∈ e ) flat = [ ] , flat ($ c) = [ c ] , ($ c) flat (inl r t) with flat t ...| xs , prf = , r +L prf flat (inr l t) with flat t ...| xs , prf = , l +R prf flat (t • t ) with flat t | flat t ...| xs , prf | ys , prf = , (prf • prf ) -- some code omitted
  • 13. Bit-codes for parse trees Bit-codes mark... which branch of choice was chosen during parsing. matchings done by the Kleene star operator. Predicate relating bit-codes to its RE. data IsCode : List Bit → Regex → Set where : [ ] IsCode $ : (c : Char) → [ ] IsCode ($ c) inl : xs IsCode l → (0b :: xs) IsCode (l + r) inr : xs IsCode r → (1b :: xs) IsCode (l + r) • : xs IsCode l → ys IsCode r → (xs ++ ys) IsCode (l • r) star[] : [ 1b ] IsCode (l ) star− :: : xs IsCode l → xss IsCode (l ) → (0b :: xs ++ xss) IsCode (l )
  • 14. How to relate bit-codes and parse trees? Function code builds bit-codes for parse trees. code : Tree e → ∃ (λ bs → bs IsCode e) code ($ c) = [ ] , ($ c) code (inl r t) with code t ...| ys , pr = 0b :: ys , inl r pr code (inr l t) with code t ...| ys , pr = 1b :: ys , inr l pr code star[] = 1b :: [ ] , star[] code (star− :: t ts) with code t | code ts ...| xs , pr | xss , prs = (0b :: xs ++ xss) , star− :: pr prs -- some code omitted
  • 15. How to relate bit-codes and parse-trees? Function decode parses a bit string for w.r.t. a RE. Property: forall t, decode (code t) ≡ t decode : ∃ (λ bs → bs IsCode e) → Tree e decode ( , ($ c)) = $ c decode ( , (inl r pr)) = inl r (decode ( , pr)) decode ( , (inr l pr)) = inr l (decode ( , pr)) decode star[] = ( +L ) decode (star− :: pr pr ) with decode ( , pr) | decode ( , pr ) ...| pr1 | pr2 = ( +R (pr1 • pr2)) -- some code omitted
  • 16. Bit-codes for RE parse trees Building a parse tree for compute bit-codes is expansive. Better idea: build bit-codes during parsing, instead of computing parse trees. How? Just attach bit-codes to RE. data BitRegex : Set where empty : BitRegex eps : List Bit → BitRegex char : List Bit → Char → BitRegex choice : List Bit → BitRegex → BitRegex → BitRegex cat : List Bit → BitRegex → BitRegex → BitRegex star : List Bit → BitRegex → BitRegex
  • 17. Relating REs and BREs Function internalize converts a RE into a BRE. internalize : Regex → BitRegex internalize ∅ = empty internalize = eps [ ] internalize ($ x) = char [ ] x internalize (e • e ) = cat [ ] (internalize e) (internalize e ) internalize (e + e ) = choice [ ] (fuse [ 0b ] (internalize e)) (fuse [ 1b ] (internalize e )) internalize (e ) = star [ ] (internalize e)
  • 18. Relating REs and BREs Function erase converts a BRE into a RE. Property: for all e, erase (internalize e) ≡ e. erase : BitRegex → Regex erase empty = ∅ erase (eps x) = erase (char x c) = $ c erase (choice x e e ) = erase e + (erase e ) erase (cat x e e ) = erase e • (erase e ) erase (star x e) = (erase e)
  • 19. Semantics of BREs Same as RE semantics, but includes the bit-codes. Property: s ∈ e iff s ∈ internalize e . data ∈ : List Char → BitRegex → Set where eps : [ ] ∈ eps bs char : (c : Char) → [ c ] ∈ char bs c inl : s ∈ l → s ∈ choice bs l r inr : s ∈ r → s ∈ choice bs l r cat : s ∈ l → s ∈ r → (s ++ s ) ∈ cat bs l r -- some code omitted
  • 20. Derivatives in a nutshell What is the derivative of an (B)RE? Derivatives are defined w.r.t. an alphabet symbol. The derivative of a (B)RE e w.r.t. a, ∂(e, a), is another (B)RE that denotes all strings in e language with the leading a removed. ∂a(e) = {s | as ∈ e } Property: s ∈ ∂[ e , x ] holds iff (x :: s) ∈ e holds.
  • 21. Derivatives for BREs Follows the definition of Brzozowski’s derivatives. ∂[ , ] : BitRegex → Char → BitRegex ∂[ eps bs , c ] = eps bs ∂[ cat bs e e , c ] with ν[ e ] ∂[ cat bs e e , c ] | yes pr = choice bs (cat bs ∂[ e , c ] e ) (fuse (mkEps pr) ∂[ e , c ]) ∂[ cat bs e e , c ] | no pr = cat bs ∂[ e , c ] e ∂[ star bs e , c ] = cat bs (fuse [ 0b ] ∂[ e , c ]) (star [ ] e) -- some code omitted
  • 22. Nullability test for BREs Checks if ∈ e , for some (B)RE e. Agda code: decision procedure for [ ] ∈ e . ν(∅) = ∅ ν( ) = ν(a) = ∅ ν(e e ) = if ν(e) = ν(e ) = ∅ otherwise ν(e + e ) = if ν(e) = or ν(e ) = ∅ otherwise ν(e ) =
  • 23. Parsing with derivatives RE-based text search tools parse prefixes and substrings. Types IsPrefix and IsSubstr are proofs that a string is a prefix / substring of an input BRE. Parsing algorithms defined as proofs of decidability of IsPrefix and IsSubstr. Proof by induction on the input string, using properties of derivative operation. data IsPrefix (a : List Char) (e : BitRegex) : Set where Prefix : a ≡ b ++ c → b ∈ e → IsPrefix a e data IsSubstr (a : List Char) (e : BitRegex) : Set where Substr : a ≡ b ++ c ++ d → c ∈ e → IsSubstr a e
  • 25. Future Work How to improve efficiency? How intrinsic verification affects generated code efficiency? Currently porting code to use extrinsic verification (proofs separated from program code). Experiment with alternative formalization: BRE semantics defined by erase: s ∈ e = s ∈ erase e . How to measure memory consumption, without compiler support? No profiling support in Agda compiler. Agda compiles to Haskell, but there’s no direct correspondence between Agda source code and Haskell generated code.
  • 26. Conclusion We build a certified algorithm for BRE parsing in Agda. We certify several previous results about bit-coded parse trees and their relationship with RE semantics. Algorithm included in verigrep tool for RE text search.
  • 27. Relating REs and BREs Function fuse attach a bit-string into a BRE. fuse : List Bit → BitRegex → BitRegex fuse bs empty = empty fuse bs (eps x) = eps (bs ++ x) fuse bs (char x c) = char (bs ++ x) c fuse bs (choice x e e ) = choice (bs ++ x) e e fuse bs (cat x e e ) = cat (bs ++ x) e e fuse bs (star x e) = star (bs ++ x) e
  • 28. Building bit-codes for mkEps : [ ] ∈ t → List Bit mkEps (eps bs) = bs mkEps (inl br bs pr) = bs ++ mkEps pr mkEps (inr bl bs pr) = bs ++ mkEps pr mkEps (cat bs pr pr eq) = bs ++ mkEps pr ++ mkEps pr mkEps (star[] bs) = bs ++ [ 1b ] mkEps (star− :: bs pr pr x) = bs ++ [ 1b ]
  • 29. Relating parse trees and RE semantics unflat builds parse trees out of RE semantics evidence. Functions flat and unflat are inverses. unflat : xs ∈ e → Tree e unflat = unflat ($ c) = $ c unflat (prf • prf ) = unflat prf • unflat prf unflat (r +L prf ) = inl r (unflat prf ) unflat (l +R prf ) = inr l (unflat prf ) -- some code omitted