SlideShare a Scribd company logo
Certified Bit-Coded Regular Expression Parsing
Rodrigo Ribeiro1 Andr´e Rauber Du Bois2
1Departament of Computer Science
Universidade Federal de Ouro Preto
2Departament of Computer Science
Universidade Federal de Pelotas
September 21, 2017
Introduction
Parsing is pervasive in computing
String search tools, lexical analysers...
Binary data files like images, videos ...
Our focus: Regular Languages (RLs)
Languages denoted by Regular Expressions (REs) and
equivalent formalisms
Introduction
Is RE parsing a yes / no question?.
No! Better to produce evidence: parse trees.
Why use bit-codes for parse trees?
Memory compact representation of parsing evidence.
Easy serialization of parsing results.
Contributions
We provide fully certified proofs of a derivative based
algorithm that produces a bit representation of a parse tree.
We mechanize results about the relation between RE parse
trees and bit-codes.
We provide sound and complete decision procedures for prefix
and substring matching of RE.
Coded included in a RE search tool developed by us —
verigrep.
All results formalized in Agda version 2.5.2.
Regular Expression Syntax
Definition of RE over a finite alphabet Σ.
e ::= ∅ | | a | e e | e + e | e
Agda code
data Regex : Set where
∅ : Regex
: Regex
$ : Char → Regex
• : Regex → Regex → Regex
+ : Regex → Regex → Regex
: Regex → Regex
Regular Expression Semantics
∈
a ∈ Σ
a ∈ a
s ∈ e s ∈ e
ss ∈ ee
s ∈ e
s ∈ e + e
s ∈ e
s ∈ e + e
s ∈ + ee
s ∈ e
Regular Expression Semantics — Agda code
data ∈ : List Char → Regex → Set where
: [ ] ∈
$ : (c : Char) → $ c ∈ $ c
• : s ∈ l →
s ∈ r →
(s ++ s ) ∈ l • r
+ L : s ∈ l → s ∈ l + r
-- some code omitted...
Parse trees for REs
We interpret RE as types and parse tree as terms.
Informally:
leafs: empty string and character.
concatenation: pair of parse trees.
choice: just the branch of chosen RE.
Kleene star: list of parse trees.
Parse trees for RE — Example
star− ::
inr
•
$ a $ b
star− ::
inr
•
$ a $ b
star[]
Figure: Parse tree for RE: (c + ab) and the string w = abab.
Parse trees for REs
data Tree : Regex → Set where
: Tree
$ : (c : Char) → Tree ($ c)
inl : Tree l → Tree (l + r)
inr : Tree r → Tree (l + r)
• : Tree l → Tree r → Tree (l • r)
star[] : Tree (l )
star− :: : Tree l → Tree (l ) → Tree (l )
Relating parse trees and RE semantics
Using function flat.
Property: Let t be a parse tree for a RE e and a string s.
Then, flat(t) = s and s ∈ e .
flat( ) = []
flat($ a) = [a]
flat(inl t) = flat(t)
flat(inr t) = flat(t)
flat(t • t ) = flat(t)++flat(t )
flat(star[]) = []
flat(star− :: t ts) = flat(t)++flat(ts)
Relating parse trees and RE semantics
flat type ensure its correctness property!
flat : Tree e → ∃ (λ xs → xs ∈ e )
flat = [ ] ,
flat ($ c) = [ c ] , ($ c)
flat (inl r t) with flat t
...| xs , prf = , r +L prf
flat (inr l t) with flat t
...| xs , prf = , l +R prf
flat (t • t ) with flat t | flat t
...| xs , prf | ys , prf = , (prf • prf )
-- some code omitted
Bit-codes for parse trees
Bit-codes mark...
which branch of choice was chosen during parsing.
matchings done by the Kleene star operator.
Predicate relating bit-codes to its RE.
data IsCode : List Bit → Regex → Set where
: [ ] IsCode
$ : (c : Char) → [ ] IsCode ($ c)
inl : xs IsCode l → (0b :: xs) IsCode (l + r)
inr : xs IsCode r → (1b :: xs) IsCode (l + r)
• : xs IsCode l → ys IsCode r → (xs ++ ys) IsCode (l • r)
star[] : [ 1b ] IsCode (l )
star− :: : xs IsCode l → xss IsCode (l ) →
(0b :: xs ++ xss) IsCode (l )
How to relate bit-codes and parse trees?
Function code builds bit-codes for parse trees.
code : Tree e → ∃ (λ bs → bs IsCode e)
code ($ c) = [ ] , ($ c)
code (inl r t) with code t
...| ys , pr = 0b :: ys , inl r pr
code (inr l t) with code t
...| ys , pr = 1b :: ys , inr l pr
code star[] = 1b :: [ ] , star[]
code (star− :: t ts) with code t | code ts
...| xs , pr | xss , prs = (0b :: xs ++ xss) , star− :: pr prs
-- some code omitted
How to relate bit-codes and parse-trees?
Function decode parses a bit string for w.r.t. a RE.
Property: forall t, decode (code t) ≡ t
decode : ∃ (λ bs → bs IsCode e) → Tree e
decode ( , ($ c)) = $ c
decode ( , (inl r pr)) = inl r (decode ( , pr))
decode ( , (inr l pr)) = inr l (decode ( , pr))
decode star[] = ( +L )
decode (star− :: pr pr ) with decode ( , pr) | decode ( , pr )
...| pr1 | pr2 = ( +R (pr1 • pr2))
-- some code omitted
Bit-codes for RE parse trees
Building a parse tree for compute bit-codes is expansive.
Better idea: build bit-codes during parsing, instead of
computing parse trees.
How? Just attach bit-codes to RE.
data BitRegex : Set where
empty : BitRegex
eps : List Bit → BitRegex
char : List Bit → Char → BitRegex
choice : List Bit → BitRegex → BitRegex → BitRegex
cat : List Bit → BitRegex → BitRegex → BitRegex
star : List Bit → BitRegex → BitRegex
Relating REs and BREs
Function internalize converts a RE into a BRE.
internalize : Regex → BitRegex
internalize ∅ = empty
internalize = eps [ ]
internalize ($ x) = char [ ] x
internalize (e • e ) = cat [ ] (internalize e) (internalize e )
internalize (e + e )
= choice [ ] (fuse [ 0b ] (internalize e))
(fuse [ 1b ] (internalize e ))
internalize (e ) = star [ ] (internalize e)
Relating REs and BREs
Function erase converts a BRE into a RE.
Property: for all e, erase (internalize e) ≡ e.
erase : BitRegex → Regex
erase empty = ∅
erase (eps x) =
erase (char x c) = $ c
erase (choice x e e ) = erase e + (erase e )
erase (cat x e e ) = erase e • (erase e )
erase (star x e) = (erase e)
Semantics of BREs
Same as RE semantics, but includes the bit-codes.
Property: s ∈ e iff s ∈ internalize e .
data ∈ : List Char → BitRegex → Set where
eps : [ ] ∈ eps bs
char : (c : Char) → [ c ] ∈ char bs c
inl : s ∈ l → s ∈ choice bs l r
inr : s ∈ r → s ∈ choice bs l r
cat : s ∈ l → s ∈ r → (s ++ s ) ∈ cat bs l r
-- some code omitted
Derivatives in a nutshell
What is the derivative of an (B)RE?
Derivatives are defined w.r.t. an alphabet symbol.
The derivative of a (B)RE e w.r.t. a, ∂(e, a), is another
(B)RE that denotes all strings in e language with the leading
a removed.
∂a(e) = {s | as ∈ e }
Property: s ∈ ∂[ e , x ] holds iff (x :: s) ∈ e holds.
Derivatives for BREs
Follows the definition of Brzozowski’s derivatives.
∂[ , ] : BitRegex → Char → BitRegex
∂[ eps bs , c ] = eps bs
∂[ cat bs e e , c ] with ν[ e ]
∂[ cat bs e e , c ] | yes pr
= choice bs (cat bs ∂[ e , c ] e ) (fuse (mkEps pr) ∂[ e , c ])
∂[ cat bs e e , c ] | no pr = cat bs ∂[ e , c ] e
∂[ star bs e , c ]
= cat bs (fuse [ 0b ] ∂[ e , c ]) (star [ ] e)
-- some code omitted
Nullability test for BREs
Checks if ∈ e , for some (B)RE e.
Agda code: decision procedure for [ ] ∈ e .
ν(∅) = ∅
ν( ) =
ν(a) = ∅
ν(e e ) =
if ν(e) = ν(e ) =
∅ otherwise
ν(e + e ) =
if ν(e) = or ν(e ) =
∅ otherwise
ν(e ) =
Parsing with derivatives
RE-based text search tools parse prefixes and substrings.
Types IsPrefix and IsSubstr are proofs that a string is a prefix
/ substring of an input BRE.
Parsing algorithms defined as proofs of decidability of IsPrefix
and IsSubstr. Proof by induction on the input string, using
properties of derivative operation.
data IsPrefix (a : List Char) (e : BitRegex) : Set where
Prefix : a ≡ b ++ c → b ∈ e → IsPrefix a e
data IsSubstr (a : List Char) (e : BitRegex) : Set where
Substr : a ≡ b ++ c ++ d → c ∈ e → IsSubstr a e
Experimental Results
Figure: Results of experiment 1.
Future Work
How to improve efficiency?
How intrinsic verification affects generated code efficiency?
Currently porting code to use extrinsic verification (proofs
separated from program code).
Experiment with alternative formalization: BRE semantics
defined by erase: s ∈ e = s ∈ erase e .
How to measure memory consumption, without compiler
support?
No profiling support in Agda compiler.
Agda compiles to Haskell, but there’s no direct correspondence
between Agda source code and Haskell generated code.
Conclusion
We build a certified algorithm for BRE parsing in Agda.
We certify several previous results about bit-coded parse trees
and their relationship with RE semantics.
Algorithm included in verigrep tool for RE text search.
Relating REs and BREs
Function fuse attach a bit-string into a BRE.
fuse : List Bit → BitRegex → BitRegex
fuse bs empty = empty
fuse bs (eps x) = eps (bs ++ x)
fuse bs (char x c) = char (bs ++ x) c
fuse bs (choice x e e ) = choice (bs ++ x) e e
fuse bs (cat x e e ) = cat (bs ++ x) e e
fuse bs (star x e) = star (bs ++ x) e
Building bit-codes for
mkEps : [ ] ∈ t → List Bit
mkEps (eps bs) = bs
mkEps (inl br bs pr) = bs ++ mkEps pr
mkEps (inr bl bs pr) = bs ++ mkEps pr
mkEps (cat bs pr pr eq) = bs ++ mkEps pr ++ mkEps pr
mkEps (star[] bs) = bs ++ [ 1b ]
mkEps (star− :: bs pr pr x) = bs ++ [ 1b ]
Relating parse trees and RE semantics
unflat builds parse trees out of RE semantics evidence.
Functions flat and unflat are inverses.
unflat : xs ∈ e → Tree e
unflat =
unflat ($ c) = $ c
unflat (prf • prf ) = unflat prf • unflat prf
unflat (r +L prf ) = inl r (unflat prf )
unflat (l +R prf ) = inr l (unflat prf )
-- some code omitted

More Related Content

What's hot

Regular expressions
Regular expressionsRegular expressions
Regular expressions
Brij Kishore
 
Why functional programming and category theory strongly matters
Why functional programming and category theory strongly mattersWhy functional programming and category theory strongly matters
Why functional programming and category theory strongly matters
Piotr Paradziński
 
Term Rewriting
Term RewritingTerm Rewriting
Term Rewriting
Eelco Visser
 
Pure and Declarative Syntax Definition: Paradise Lost and Regained
Pure and Declarative Syntax Definition: Paradise Lost and RegainedPure and Declarative Syntax Definition: Paradise Lost and Regained
Pure and Declarative Syntax Definition: Paradise Lost and Regained
Guido Wachsmuth
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
Pierre de Lacaze
 
Declarative Syntax Definition - Grammars and Trees
Declarative Syntax Definition - Grammars and TreesDeclarative Syntax Definition - Grammars and Trees
Declarative Syntax Definition - Grammars and Trees
Guido Wachsmuth
 
Syntax Definition
Syntax DefinitionSyntax Definition
Syntax Definition
Guido Wachsmuth
 
Advance LISP (Artificial Intelligence)
Advance LISP (Artificial Intelligence) Advance LISP (Artificial Intelligence)
Advance LISP (Artificial Intelligence)
wahab khan
 
Python 培训讲义
Python 培训讲义Python 培训讲义
Python 培训讲义
leejd
 
Beyond tf idf why, what & how
Beyond tf idf why, what & howBeyond tf idf why, what & how
Beyond tf idf why, what & how
lucenerevolution
 
Finaal application on regular expression
Finaal application on regular expressionFinaal application on regular expression
Finaal application on regular expression
Gagan019
 
High-Performance Haskell
High-Performance HaskellHigh-Performance Haskell
High-Performance Haskell
Johan Tibell
 
Real World Haskell: Lecture 7
Real World Haskell: Lecture 7Real World Haskell: Lecture 7
Real World Haskell: Lecture 7
Bryan O'Sullivan
 
Declarative Semantics Definition - Static Analysis and Error Checking
Declarative Semantics Definition - Static Analysis and Error CheckingDeclarative Semantics Definition - Static Analysis and Error Checking
Declarative Semantics Definition - Static Analysis and Error Checking
Guido Wachsmuth
 
Learning notes of r for python programmer (Temp1)
Learning notes of r for python programmer (Temp1)Learning notes of r for python programmer (Temp1)
Learning notes of r for python programmer (Temp1)
Chia-Chi Chang
 
Compiler Components and their Generators - Traditional Parsing Algorithms
Compiler Components and their Generators - Traditional Parsing AlgorithmsCompiler Components and their Generators - Traditional Parsing Algorithms
Compiler Components and their Generators - Traditional Parsing Algorithms
Guido Wachsmuth
 
Static name resolution
Static name resolutionStatic name resolution
Static name resolution
Eelco Visser
 
Syntax Definition
Syntax DefinitionSyntax Definition
Syntax Definition
Eelco Visser
 
Real World Haskell: Lecture 5
Real World Haskell: Lecture 5Real World Haskell: Lecture 5
Real World Haskell: Lecture 5
Bryan O'Sullivan
 
Formal Grammars
Formal GrammarsFormal Grammars
Formal Grammars
Eelco Visser
 

What's hot (20)

Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Why functional programming and category theory strongly matters
Why functional programming and category theory strongly mattersWhy functional programming and category theory strongly matters
Why functional programming and category theory strongly matters
 
Term Rewriting
Term RewritingTerm Rewriting
Term Rewriting
 
Pure and Declarative Syntax Definition: Paradise Lost and Regained
Pure and Declarative Syntax Definition: Paradise Lost and RegainedPure and Declarative Syntax Definition: Paradise Lost and Regained
Pure and Declarative Syntax Definition: Paradise Lost and Regained
 
Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
 
Declarative Syntax Definition - Grammars and Trees
Declarative Syntax Definition - Grammars and TreesDeclarative Syntax Definition - Grammars and Trees
Declarative Syntax Definition - Grammars and Trees
 
Syntax Definition
Syntax DefinitionSyntax Definition
Syntax Definition
 
Advance LISP (Artificial Intelligence)
Advance LISP (Artificial Intelligence) Advance LISP (Artificial Intelligence)
Advance LISP (Artificial Intelligence)
 
Python 培训讲义
Python 培训讲义Python 培训讲义
Python 培训讲义
 
Beyond tf idf why, what & how
Beyond tf idf why, what & howBeyond tf idf why, what & how
Beyond tf idf why, what & how
 
Finaal application on regular expression
Finaal application on regular expressionFinaal application on regular expression
Finaal application on regular expression
 
High-Performance Haskell
High-Performance HaskellHigh-Performance Haskell
High-Performance Haskell
 
Real World Haskell: Lecture 7
Real World Haskell: Lecture 7Real World Haskell: Lecture 7
Real World Haskell: Lecture 7
 
Declarative Semantics Definition - Static Analysis and Error Checking
Declarative Semantics Definition - Static Analysis and Error CheckingDeclarative Semantics Definition - Static Analysis and Error Checking
Declarative Semantics Definition - Static Analysis and Error Checking
 
Learning notes of r for python programmer (Temp1)
Learning notes of r for python programmer (Temp1)Learning notes of r for python programmer (Temp1)
Learning notes of r for python programmer (Temp1)
 
Compiler Components and their Generators - Traditional Parsing Algorithms
Compiler Components and their Generators - Traditional Parsing AlgorithmsCompiler Components and their Generators - Traditional Parsing Algorithms
Compiler Components and their Generators - Traditional Parsing Algorithms
 
Static name resolution
Static name resolutionStatic name resolution
Static name resolution
 
Syntax Definition
Syntax DefinitionSyntax Definition
Syntax Definition
 
Real World Haskell: Lecture 5
Real World Haskell: Lecture 5Real World Haskell: Lecture 5
Real World Haskell: Lecture 5
 
Formal Grammars
Formal GrammarsFormal Grammars
Formal Grammars
 

Similar to Certified bit coded regular expression parsing

Ch04
Ch04Ch04
Ch04
Hankyo
 
Computer Software: Compiler Construction Lecture 05.ppt
Computer Software: Compiler Construction Lecture 05.pptComputer Software: Compiler Construction Lecture 05.ppt
Computer Software: Compiler Construction Lecture 05.ppt
ishaquet82
 
An overview of Python 2.7
An overview of Python 2.7An overview of Python 2.7
An overview of Python 2.7
decoupled
 
A tour of Python
A tour of PythonA tour of Python
A tour of Python
Aleksandar Veselinovic
 
Power of functions in a typed world
Power of functions in a typed worldPower of functions in a typed world
Power of functions in a typed world
Debasish Ghosh
 
ANSI C REFERENCE CARD
ANSI C REFERENCE CARDANSI C REFERENCE CARD
ANSI C REFERENCE CARD
Tia Ricci
 
Ch3
Ch3Ch3
Intermediate code generation in Compiler Design
Intermediate code generation in Compiler DesignIntermediate code generation in Compiler Design
Intermediate code generation in Compiler Design
Kuppusamy P
 
Python lecture 07
Python lecture 07Python lecture 07
Python lecture 07
Tanwir Zaman
 
C# Cheat Sheet
C# Cheat SheetC# Cheat Sheet
C# Cheat Sheet
GlowTouch
 
Chapter Two(1)
Chapter Two(1)Chapter Two(1)
Chapter Two(1)
bolovv
 
Query Optimization - Brandon Latronica
Query Optimization - Brandon LatronicaQuery Optimization - Brandon Latronica
Query Optimization - Brandon Latronica
"FENG "GEORGE"" YU
 
Ch8a
Ch8aCh8a
scripting in Python
scripting in Pythonscripting in Python
scripting in Python
Team-VLSI-ITMU
 
7645347.ppt
7645347.ppt7645347.ppt
7645347.ppt
jeronimored
 
Hierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLPHierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLP
Rupak Roy
 
Intermediate code generation1
Intermediate code generation1Intermediate code generation1
Intermediate code generation1
Shashwat Shriparv
 
Arrays
ArraysArrays
Sql server lab_2
Sql server lab_2Sql server lab_2
Sql server lab_2
vijay venkatash
 
Antlr V3
Antlr V3Antlr V3
Antlr V3
guest5024494
 

Similar to Certified bit coded regular expression parsing (20)

Ch04
Ch04Ch04
Ch04
 
Computer Software: Compiler Construction Lecture 05.ppt
Computer Software: Compiler Construction Lecture 05.pptComputer Software: Compiler Construction Lecture 05.ppt
Computer Software: Compiler Construction Lecture 05.ppt
 
An overview of Python 2.7
An overview of Python 2.7An overview of Python 2.7
An overview of Python 2.7
 
A tour of Python
A tour of PythonA tour of Python
A tour of Python
 
Power of functions in a typed world
Power of functions in a typed worldPower of functions in a typed world
Power of functions in a typed world
 
ANSI C REFERENCE CARD
ANSI C REFERENCE CARDANSI C REFERENCE CARD
ANSI C REFERENCE CARD
 
Ch3
Ch3Ch3
Ch3
 
Intermediate code generation in Compiler Design
Intermediate code generation in Compiler DesignIntermediate code generation in Compiler Design
Intermediate code generation in Compiler Design
 
Python lecture 07
Python lecture 07Python lecture 07
Python lecture 07
 
C# Cheat Sheet
C# Cheat SheetC# Cheat Sheet
C# Cheat Sheet
 
Chapter Two(1)
Chapter Two(1)Chapter Two(1)
Chapter Two(1)
 
Query Optimization - Brandon Latronica
Query Optimization - Brandon LatronicaQuery Optimization - Brandon Latronica
Query Optimization - Brandon Latronica
 
Ch8a
Ch8aCh8a
Ch8a
 
scripting in Python
scripting in Pythonscripting in Python
scripting in Python
 
7645347.ppt
7645347.ppt7645347.ppt
7645347.ppt
 
Hierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLPHierarchical Clustering - Text Mining/NLP
Hierarchical Clustering - Text Mining/NLP
 
Intermediate code generation1
Intermediate code generation1Intermediate code generation1
Intermediate code generation1
 
Arrays
ArraysArrays
Arrays
 
Sql server lab_2
Sql server lab_2Sql server lab_2
Sql server lab_2
 
Antlr V3
Antlr V3Antlr V3
Antlr V3
 

Recently uploaded

Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
MaheshaNanjegowda
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
muralinath2
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
terusbelajar5
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
İsa Badur
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
LengamoLAppostilic
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
hozt8xgk
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
Advanced-Concepts-Team
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
AbdullaAlAsif1
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
Sérgio Sacani
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Texas Alliance of Groundwater Districts
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
University of Maribor
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
Aditi Bajpai
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 

Recently uploaded (20)

Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 

Certified bit coded regular expression parsing

  • 1. Certified Bit-Coded Regular Expression Parsing Rodrigo Ribeiro1 Andr´e Rauber Du Bois2 1Departament of Computer Science Universidade Federal de Ouro Preto 2Departament of Computer Science Universidade Federal de Pelotas September 21, 2017
  • 2. Introduction Parsing is pervasive in computing String search tools, lexical analysers... Binary data files like images, videos ... Our focus: Regular Languages (RLs) Languages denoted by Regular Expressions (REs) and equivalent formalisms
  • 3. Introduction Is RE parsing a yes / no question?. No! Better to produce evidence: parse trees. Why use bit-codes for parse trees? Memory compact representation of parsing evidence. Easy serialization of parsing results.
  • 4. Contributions We provide fully certified proofs of a derivative based algorithm that produces a bit representation of a parse tree. We mechanize results about the relation between RE parse trees and bit-codes. We provide sound and complete decision procedures for prefix and substring matching of RE. Coded included in a RE search tool developed by us — verigrep. All results formalized in Agda version 2.5.2.
  • 5. Regular Expression Syntax Definition of RE over a finite alphabet Σ. e ::= ∅ | | a | e e | e + e | e Agda code data Regex : Set where ∅ : Regex : Regex $ : Char → Regex • : Regex → Regex → Regex + : Regex → Regex → Regex : Regex → Regex
  • 6. Regular Expression Semantics ∈ a ∈ Σ a ∈ a s ∈ e s ∈ e ss ∈ ee s ∈ e s ∈ e + e s ∈ e s ∈ e + e s ∈ + ee s ∈ e
  • 7. Regular Expression Semantics — Agda code data ∈ : List Char → Regex → Set where : [ ] ∈ $ : (c : Char) → $ c ∈ $ c • : s ∈ l → s ∈ r → (s ++ s ) ∈ l • r + L : s ∈ l → s ∈ l + r -- some code omitted...
  • 8. Parse trees for REs We interpret RE as types and parse tree as terms. Informally: leafs: empty string and character. concatenation: pair of parse trees. choice: just the branch of chosen RE. Kleene star: list of parse trees.
  • 9. Parse trees for RE — Example star− :: inr • $ a $ b star− :: inr • $ a $ b star[] Figure: Parse tree for RE: (c + ab) and the string w = abab.
  • 10. Parse trees for REs data Tree : Regex → Set where : Tree $ : (c : Char) → Tree ($ c) inl : Tree l → Tree (l + r) inr : Tree r → Tree (l + r) • : Tree l → Tree r → Tree (l • r) star[] : Tree (l ) star− :: : Tree l → Tree (l ) → Tree (l )
  • 11. Relating parse trees and RE semantics Using function flat. Property: Let t be a parse tree for a RE e and a string s. Then, flat(t) = s and s ∈ e . flat( ) = [] flat($ a) = [a] flat(inl t) = flat(t) flat(inr t) = flat(t) flat(t • t ) = flat(t)++flat(t ) flat(star[]) = [] flat(star− :: t ts) = flat(t)++flat(ts)
  • 12. Relating parse trees and RE semantics flat type ensure its correctness property! flat : Tree e → ∃ (λ xs → xs ∈ e ) flat = [ ] , flat ($ c) = [ c ] , ($ c) flat (inl r t) with flat t ...| xs , prf = , r +L prf flat (inr l t) with flat t ...| xs , prf = , l +R prf flat (t • t ) with flat t | flat t ...| xs , prf | ys , prf = , (prf • prf ) -- some code omitted
  • 13. Bit-codes for parse trees Bit-codes mark... which branch of choice was chosen during parsing. matchings done by the Kleene star operator. Predicate relating bit-codes to its RE. data IsCode : List Bit → Regex → Set where : [ ] IsCode $ : (c : Char) → [ ] IsCode ($ c) inl : xs IsCode l → (0b :: xs) IsCode (l + r) inr : xs IsCode r → (1b :: xs) IsCode (l + r) • : xs IsCode l → ys IsCode r → (xs ++ ys) IsCode (l • r) star[] : [ 1b ] IsCode (l ) star− :: : xs IsCode l → xss IsCode (l ) → (0b :: xs ++ xss) IsCode (l )
  • 14. How to relate bit-codes and parse trees? Function code builds bit-codes for parse trees. code : Tree e → ∃ (λ bs → bs IsCode e) code ($ c) = [ ] , ($ c) code (inl r t) with code t ...| ys , pr = 0b :: ys , inl r pr code (inr l t) with code t ...| ys , pr = 1b :: ys , inr l pr code star[] = 1b :: [ ] , star[] code (star− :: t ts) with code t | code ts ...| xs , pr | xss , prs = (0b :: xs ++ xss) , star− :: pr prs -- some code omitted
  • 15. How to relate bit-codes and parse-trees? Function decode parses a bit string for w.r.t. a RE. Property: forall t, decode (code t) ≡ t decode : ∃ (λ bs → bs IsCode e) → Tree e decode ( , ($ c)) = $ c decode ( , (inl r pr)) = inl r (decode ( , pr)) decode ( , (inr l pr)) = inr l (decode ( , pr)) decode star[] = ( +L ) decode (star− :: pr pr ) with decode ( , pr) | decode ( , pr ) ...| pr1 | pr2 = ( +R (pr1 • pr2)) -- some code omitted
  • 16. Bit-codes for RE parse trees Building a parse tree for compute bit-codes is expansive. Better idea: build bit-codes during parsing, instead of computing parse trees. How? Just attach bit-codes to RE. data BitRegex : Set where empty : BitRegex eps : List Bit → BitRegex char : List Bit → Char → BitRegex choice : List Bit → BitRegex → BitRegex → BitRegex cat : List Bit → BitRegex → BitRegex → BitRegex star : List Bit → BitRegex → BitRegex
  • 17. Relating REs and BREs Function internalize converts a RE into a BRE. internalize : Regex → BitRegex internalize ∅ = empty internalize = eps [ ] internalize ($ x) = char [ ] x internalize (e • e ) = cat [ ] (internalize e) (internalize e ) internalize (e + e ) = choice [ ] (fuse [ 0b ] (internalize e)) (fuse [ 1b ] (internalize e )) internalize (e ) = star [ ] (internalize e)
  • 18. Relating REs and BREs Function erase converts a BRE into a RE. Property: for all e, erase (internalize e) ≡ e. erase : BitRegex → Regex erase empty = ∅ erase (eps x) = erase (char x c) = $ c erase (choice x e e ) = erase e + (erase e ) erase (cat x e e ) = erase e • (erase e ) erase (star x e) = (erase e)
  • 19. Semantics of BREs Same as RE semantics, but includes the bit-codes. Property: s ∈ e iff s ∈ internalize e . data ∈ : List Char → BitRegex → Set where eps : [ ] ∈ eps bs char : (c : Char) → [ c ] ∈ char bs c inl : s ∈ l → s ∈ choice bs l r inr : s ∈ r → s ∈ choice bs l r cat : s ∈ l → s ∈ r → (s ++ s ) ∈ cat bs l r -- some code omitted
  • 20. Derivatives in a nutshell What is the derivative of an (B)RE? Derivatives are defined w.r.t. an alphabet symbol. The derivative of a (B)RE e w.r.t. a, ∂(e, a), is another (B)RE that denotes all strings in e language with the leading a removed. ∂a(e) = {s | as ∈ e } Property: s ∈ ∂[ e , x ] holds iff (x :: s) ∈ e holds.
  • 21. Derivatives for BREs Follows the definition of Brzozowski’s derivatives. ∂[ , ] : BitRegex → Char → BitRegex ∂[ eps bs , c ] = eps bs ∂[ cat bs e e , c ] with ν[ e ] ∂[ cat bs e e , c ] | yes pr = choice bs (cat bs ∂[ e , c ] e ) (fuse (mkEps pr) ∂[ e , c ]) ∂[ cat bs e e , c ] | no pr = cat bs ∂[ e , c ] e ∂[ star bs e , c ] = cat bs (fuse [ 0b ] ∂[ e , c ]) (star [ ] e) -- some code omitted
  • 22. Nullability test for BREs Checks if ∈ e , for some (B)RE e. Agda code: decision procedure for [ ] ∈ e . ν(∅) = ∅ ν( ) = ν(a) = ∅ ν(e e ) = if ν(e) = ν(e ) = ∅ otherwise ν(e + e ) = if ν(e) = or ν(e ) = ∅ otherwise ν(e ) =
  • 23. Parsing with derivatives RE-based text search tools parse prefixes and substrings. Types IsPrefix and IsSubstr are proofs that a string is a prefix / substring of an input BRE. Parsing algorithms defined as proofs of decidability of IsPrefix and IsSubstr. Proof by induction on the input string, using properties of derivative operation. data IsPrefix (a : List Char) (e : BitRegex) : Set where Prefix : a ≡ b ++ c → b ∈ e → IsPrefix a e data IsSubstr (a : List Char) (e : BitRegex) : Set where Substr : a ≡ b ++ c ++ d → c ∈ e → IsSubstr a e
  • 25. Future Work How to improve efficiency? How intrinsic verification affects generated code efficiency? Currently porting code to use extrinsic verification (proofs separated from program code). Experiment with alternative formalization: BRE semantics defined by erase: s ∈ e = s ∈ erase e . How to measure memory consumption, without compiler support? No profiling support in Agda compiler. Agda compiles to Haskell, but there’s no direct correspondence between Agda source code and Haskell generated code.
  • 26. Conclusion We build a certified algorithm for BRE parsing in Agda. We certify several previous results about bit-coded parse trees and their relationship with RE semantics. Algorithm included in verigrep tool for RE text search.
  • 27. Relating REs and BREs Function fuse attach a bit-string into a BRE. fuse : List Bit → BitRegex → BitRegex fuse bs empty = empty fuse bs (eps x) = eps (bs ++ x) fuse bs (char x c) = char (bs ++ x) c fuse bs (choice x e e ) = choice (bs ++ x) e e fuse bs (cat x e e ) = cat (bs ++ x) e e fuse bs (star x e) = star (bs ++ x) e
  • 28. Building bit-codes for mkEps : [ ] ∈ t → List Bit mkEps (eps bs) = bs mkEps (inl br bs pr) = bs ++ mkEps pr mkEps (inr bl bs pr) = bs ++ mkEps pr mkEps (cat bs pr pr eq) = bs ++ mkEps pr ++ mkEps pr mkEps (star[] bs) = bs ++ [ 1b ] mkEps (star− :: bs pr pr x) = bs ++ [ 1b ]
  • 29. Relating parse trees and RE semantics unflat builds parse trees out of RE semantics evidence. Functions flat and unflat are inverses. unflat : xs ∈ e → Tree e unflat = unflat ($ c) = $ c unflat (prf • prf ) = unflat prf • unflat prf unflat (r +L prf ) = inl r (unflat prf ) unflat (l +R prf ) = inr l (unflat prf ) -- some code omitted