SlideShare a Scribd company logo
1 of 38
Download to read offline
/regexp?/
The mechanics of Expression Processing
with some PCRE2 referral
Meetup PUG
25 Luglio 2017
#AperiTech
I AM DAVIDE DELL’ERBA
Research & Development @
INDEX
◇ Regex engine types
◇ Two all-encompassing rules
◇ NFA vs DFA
◇ Backtracking
◇ PCRE2
REGEX ENGINE TYPES
◇ DFA
◇ Traditional NFA
◇ POSIX NFA
◇ Hybrid NFA/DFA
REGEX ENGINE TYPES
Engine type Programs
DFA
awk (most versions), egrep (most versions), flex, lex, MySQL,
Procmail
Tradition NFA
GNU Emacs, Java, grep (most versions), less, more, .NET
languages, PCRE library, Perl, PHP, Python, Ruby, sed (most
versions), vi
Posix NFA
mawk, Mortice Kern Systems’ utilities, GNU Emacs (when
requested)
Hybrid
NFA/DFA
GNU awk, GNU grep / egrep, Tcl
REGEX ENGINE TYPES IN PHP
Text
processing
Programs
PCRE Regular Expressions (Perl-Compatible)
POSIX Regex
Regular Expression (POSIX Extended)
Deprecated from PHP 5.3; Removed from PHP 7.0
TESTING THE ENGINE TYPES
Traditional NFA or not?
「nfa|nfa not」 “nfa not”
“nfa not” “nfa not”
Traditional NFA DFA, NFA POSIX
TESTING THE ENGINE TYPES
DFA or POSIX NFA?
「X(.+)+X」 “=XX=====================”
POSIX NFA DFA
No match!
TWO ALL-ENCOMPASSING RULES
1. The match that begins earliest
(leftmost) wins
2. The standard quantifiers are
greedy (「*」,「+」,「?」,「{m,n}」)
THE MATCH THAT BEGINS EARLIEST WINS
◇ “a match” instead of “the match”
◇ attempt to match the beginning
of the string
◇ if all permutation are exhausted
without match, retry from next
character
THE MATCH THAT BEGINS EARLIEST WINS
「cat」 “The dragging belly indicates
your cat is too fat”
「fat|cat|belly|your」 “The dragging belly indicates
your cat is too fat”
THE STANDARD QUANTIFIERS ARE GREEDY
◇ minimum number of matches
that are required before it can
be considered successful
◇ maximum number that it will
ever attempt to match
THE STANDARD QUANTIFIERS ARE GREEDY
「.*(d+)」 “Copyright - 05 March 2016”
「.*(d*)」 “Copyright - 05 March 2016”
「d+(?!.*d)」 “Copyright - 05 March 2016”
REGEX-DIRECTION VS TEXT-DIRECTION
◇ NFA engine is Regex-Directed
◇ DFA engine is Text-Directed
NFA ENGINE: REGEX-DIRECTED
「to(nite|knight|night)」͎ “tonight”͎
「to(nite|knight|night)」͎ “tonight”͎
「to(nite|knight|night)」͎ “tonight”͎
「to(nite|knight|night)」͎ “tonight”͎
「to(nite|knight|night)」͎ “tonight”͎
「to(nite|knight|night)」͎ “tonight”͎
DFA ENGINE: TEXT-DIRECTED
「to(nite|knight|night)」 “tonight”͎ ͎
「to(nite|knight|night)」 “tonight”͎ ͎
「to(nite|knight|night)」 “tonight”͎ ͎
「to(nite|knight|night)」 “tonight”͎
「to(nite|knight|night)」 “tonight”͎ ͎
͎ ͎
͎
͎ ͎
NFA VS DFA
DFA NFA
Time Fast Slow
Space Less More
Type Deterministic Non Deterministic
Result Consistent Unpredictable
Backtracking ✗ ✓
Construction DFA ⊂ NFA NFA ⊃ DFA
Pre-compile Slower and more memory Faster and less memory
Then? Is boring Is funny
BACKTRACKING
◇ Consider each subexpression or
component in turn
◇ If it decides between two (or more) equally
viable options:
○ selects one
○ remember the others one
◇ If it’s successful (and the rest of the regex it
is also successful)
○ the match is finished
◇ Otherwise it backtracks to where it chose
the first option
TWO IMPORTANT POINTS ON BACKTRACKING
◇ When faced with multiple choices, which
should be tried first?
The engine always looks for greedy
quantifiers and skips lazy ones.
◇ When forced to backtrack, which saved
choice should the engine use?
The most recently saved option is the one
used (LIFO: Last In First Out)
SAVED STATES
◇ A match without backtracking
「ab?c」 “abc”͎ ͎saved states
「ab?c」 “abc”͎ ͎
「ab?c」 “abc”͎ ͎
「ab?c」 “abc”͎ ͎
SAVED STATES
◇ A match with backtracking
「ab?c」 “ac”͎ ͎saved states
「ab?c」 “ac”͎ ͎
「ab?c」 “ac”͎ ͎
「ab?c」 “ac”͎ ͎
「ab?c」 “ac”͎ ͎
✗
SAVED STATES
◇ A lazy match with backtracking
「ab??c」 “abc”͎ ͎saved states
「ab??c」 “abc”͎ ͎
「ab??c」 “abc”͎ ͎
「ab??c」 “abc”͎ ͎
「ab??c」 “abc”͎ ͎
「ab??c」 “abc”͎ ͎
✗
POSIX NFA
◇ A POSIX NFA does not stop with the
first match it finds, but continues to
try options states that might remain
◇ Each time it reached the end of the
regex, it would have another plausible
match
◇ Eventually, all options are exhausted
PCRE2: PERL COMPATIBLE REGULAR EXPRESSION
The PCRE library is a set of function that
implement regular expression pattern
matching using the same syntax and
semantics as Perl 5.
PCRE2: PERL COMPATIBLE REGULAR EXPRESSION
The PCRE library is a set of function that
implements regular expression pattern
matching using the same syntax and
semantics as Perl 5
Why a Perl regex clone?
Perl Regex is a standard de facto for the
web age.
PCRE2 vs PCRE
◇ This new API does not have any
user-visible C structure
◇ Function calls are used as the means
as interacting with the library
◇ JIT compilation has been moved into a
separate function
◇ It contains no static or global variables
◇ The idea of context in which PCRE
functions are called
BASE PROCESS IS EASY
pcre2_match()pcre2_compile() results...
PCRE2_COMPILE()
re = pcre2_compile(
pattern, /* the pattern */
PCRE2_ZERO_TERMINATED, /* indicates pattern is zero-terminated */
0, /* default options */
&errornumber, /* for error number */
&erroroffset, /* for error offset */
NULL); /* use default compile context */
match_data = pcre2_match_data_create_from_pattern(re, NULL);
rc = pcre2_match(
re, /* the compiled pattern */
subject, /* the subject string */
subject_length, /* the length of the subject */
0, /* start at offset 0 in the subject */
0, /* default options */
match_data, /* block for storing the result */
NULL); /* use default match context */
ovector = pcre2_get_ovector_pointer(match_data);
PCRE2_COMPILE STRUCTURE
typedef struct pcre2_real_code {
pcre2_memctl memctl; /* Memory control fields */
const uint8_t *tables; /* The character tables */
void *executable_jit; /* Pointer to JIT code */
uint8_t start_bitmap[32]; /* Bitmap for starting code unit < 256 */
CODE_BLOCKSIZE_TYPE blocksize; /* Total (bytes) that was malloc-ed */
uint32_t magic_number; /* Paranoid and endianness check */
uint32_t compile_options; /* Options passed to pcre2_compile() */
uint32_t overall_options; /* Options after processing the pattern */
uint32_t flags; /* Various state flags */
uint32_t limit_heap; /* Limit set in the pattern */
uint32_t limit_match; /* Limit set in the pattern */
uint32_t limit_depth; /* Limit set in the pattern */
uint32_t first_codeunit; /* Starting code unit */
uint32_t last_codeunit; /* This codeunit must be seen */
uint16_t bsr_convention; /* What R matches */
uint16_t newline_convention; /* What is a newline? */
uint16_t max_lookbehind; /* Longest lookbehind (characters) */
uint16_t minlength; /* Minimum length of match */
uint16_t top_bracket; /* Highest numbered group */
uint16_t top_backref; /* Highest numbered back reference */
uint16_t name_entry_size; /* Size (code units) of table entries */
uint16_t name_count; /* Number of name entries in the table */
} pcre2_real_code;
PCRE2_MATCH()
re = pcre2_compile(
pattern, /* the pattern */
PCRE2_ZERO_TERMINATED, /* indicates pattern is zero-terminated */
0, /* default options */
&errornumber, /* for error number */
&erroroffset, /* for error offset */
NULL); /* use default compile context */
match_data = pcre2_match_data_create_from_pattern(re, NULL);
rc = pcre2_match(
re, /* the compiled pattern */
subject, /* the subject string */
subject_length, /* the length of the subject */
0, /* start at offset 0 in the subject */
0, /* default options */
match_data, /* block for storing the result */
NULL); /* use default match context */
ovector = pcre2_get_ovector_pointer(match_data);
MATCH RESULT STRUCTURE
typedef struct pcre2_real_match_data {
pcre2_memctl memctl;
const pcre2_real_code *code; /* The pattern used for the match */
PCRE2_SPTR subject; /* The subject that was matched */
PCRE2_SPTR mark; /* Pointer to last mark */
PCRE2_SIZE leftchar; /* Offset to leftmost code unit */
PCRE2_SIZE rightchar; /* Offset to rightmost code unit */
PCRE2_SIZE startchar; /* Offset to starting code unit */
uint16_t matchedby; /* Type of match (normal, JIT, DFA) */
uint16_t oveccount; /* Number of pairs */
int rc; /* The return code from the match */
PCRE2_SIZE ovector[10000];/* The first field */
} pcre2_real_match_data;
OVECTOR
re = pcre2_compile(
pattern, /* the pattern */
PCRE2_ZERO_TERMINATED, /* indicates pattern is zero-terminated */
0, /* default options */
&errornumber, /* for error number */
&erroroffset, /* for error offset */
NULL); /* use default compile context */
match_data = pcre2_match_data_create_from_pattern(re, NULL);
rc = pcre2_match(
re, /* the compiled pattern */
subject, /* the subject string */
subject_length, /* the length of the subject */
0, /* start at offset 0 in the subject */
0, /* default options */
match_data, /* block for storing the result */
NULL); /* use default match context */
ovector = pcre2_get_ovector_pointer(match_data);
TRY IT YOURSELF!
docker pull delda/pcre2
docker run -it delda/pcre2 bash
delda/pcre2 is a docker image based on Debian Jessy
with a checkout of PCRE2 source code. The library is
installed in UTF8 with debug and JIT options active.
NFA EXAMPLE
root@1cf6d9ffdc9b:/src/pcre2# ./pcre2test
PCRE2 version 10.30-DEV 2017-03-05
re> /nfa|nfa not/
data> nfa not
0: nfa
data>
NFA EXAMPLE
root@1cf6d9ffdc9b:/src/pcre2# ./pcre2test
PCRE2 version 10.30-DEV 2017-03-05
re> /nfa|nfa not/auto_callout
data> nfa not
--->nfa not
+0 ^ n
+1 ^^ f
+2 ^ ^ a
+3 ^ ^ |
0: nfa
data>
NFA EXAMPLE
root@1cf6d9ffdc9b:/src/pcre2# ./pcre2test -d
PCRE2 version 10.30-DEV 2017-03-05
re> /nfa|nfa not/
-----------------------------------------------
0 9 Bra
3 nfa
9 17 Alt
12 not nfa
26 26 Ket
29 End
-----------------------------------------------
Capturing subpattern count = 0
First code unit = 'n'
Last code unit = 'a'
Subject length lower bound = 3
data>
DFA EXAMPLE
root@1cf6d9ffdc9b:/src/pcre2# ./pcre2test -dfa
PCRE2 version 10.30-DEV 2017-03-05
re> /nfa|nfa not/auto_callout
data>
DFA EXAMPLE
data> nfa not
--->nfa not
+0 ^ n
+4 ^ n
+1 ^^ f
+5 ^^ f
+2 ^ ^ a
+6 ^ ^ a
+3 ^ ^ |
+7 ^ ^
+8 ^ ^ n
+9 ^ ^ o
+10 ^ ^ t
+11 ^ ^
0: nfa not
1: nfa
data> ^C

More Related Content

What's hot

Tutorial4 Threads
Tutorial4  ThreadsTutorial4  Threads
Tutorial4 Threads
tech2click
 
Advanced cfg bypass on adobe flash player 18 defcon russia 23
Advanced cfg bypass on adobe flash player 18 defcon russia 23Advanced cfg bypass on adobe flash player 18 defcon russia 23
Advanced cfg bypass on adobe flash player 18 defcon russia 23
DefconRussia
 
Georgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityGeorgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software security
DefconRussia
 
Presentation buffer overflow attacks and theircountermeasures
Presentation buffer overflow attacks and theircountermeasuresPresentation buffer overflow attacks and theircountermeasures
Presentation buffer overflow attacks and theircountermeasures
tharindunew
 

What's hot (20)

Tutorial4 Threads
Tutorial4  ThreadsTutorial4  Threads
Tutorial4 Threads
 
Quick tour of PHP from inside
Quick tour of PHP from insideQuick tour of PHP from inside
Quick tour of PHP from inside
 
First session quiz
First session quizFirst session quiz
First session quiz
 
Specializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network StackSpecializing the Data Path - Hooking into the Linux Network Stack
Specializing the Data Path - Hooking into the Linux Network Stack
 
Mona cheatsheet
Mona cheatsheetMona cheatsheet
Mona cheatsheet
 
Buffer OverFlow
Buffer OverFlowBuffer OverFlow
Buffer OverFlow
 
C tutorial
C tutorialC tutorial
C tutorial
 
C
CC
C
 
Advanced cfg bypass on adobe flash player 18 defcon russia 23
Advanced cfg bypass on adobe flash player 18 defcon russia 23Advanced cfg bypass on adobe flash player 18 defcon russia 23
Advanced cfg bypass on adobe flash player 18 defcon russia 23
 
Common mistakes in C programming
Common mistakes in C programmingCommon mistakes in C programming
Common mistakes in C programming
 
Georgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityGeorgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software security
 
Code gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introductionCode gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introduction
 
Vm ware fuzzing - defcon russia 20
Vm ware fuzzing  - defcon russia 20Vm ware fuzzing  - defcon russia 20
Vm ware fuzzing - defcon russia 20
 
PHP7 is coming
PHP7 is comingPHP7 is coming
PHP7 is coming
 
2018 cosup-delete unused python code safely - english
2018 cosup-delete unused python code safely - english2018 cosup-delete unused python code safely - english
2018 cosup-delete unused python code safely - english
 
Basic c++ 11/14 for python programmers
Basic c++ 11/14 for python programmersBasic c++ 11/14 for python programmers
Basic c++ 11/14 for python programmers
 
Php extensions
Php extensionsPhp extensions
Php extensions
 
Create your own PHP extension, step by step - phpDay 2012 Verona
Create your own PHP extension, step by step - phpDay 2012 VeronaCreate your own PHP extension, step by step - phpDay 2012 Verona
Create your own PHP extension, step by step - phpDay 2012 Verona
 
Presentation buffer overflow attacks and theircountermeasures
Presentation buffer overflow attacks and theircountermeasuresPresentation buffer overflow attacks and theircountermeasures
Presentation buffer overflow attacks and theircountermeasures
 
Interpreter, Compiler, JIT from scratch
Interpreter, Compiler, JIT from scratchInterpreter, Compiler, JIT from scratch
Interpreter, Compiler, JIT from scratch
 

Similar to Regular Expression (RegExp)

Php opcodes sep2008
Php opcodes sep2008Php opcodes sep2008
Php opcodes sep2008
bengiuliano
 
please help me with this and explain in details also in the first qu.pdf
please help me with this and explain in details also in the first qu.pdfplease help me with this and explain in details also in the first qu.pdf
please help me with this and explain in details also in the first qu.pdf
newfaransportsfitnes
 
Unit 4
Unit 4Unit 4
Unit 4
siddr
 
Assignment 13assg-13.cppAssignment 13assg-13.cpp   @auth.docx
Assignment 13assg-13.cppAssignment 13assg-13.cpp   @auth.docxAssignment 13assg-13.cppAssignment 13assg-13.cpp   @auth.docx
Assignment 13assg-13.cppAssignment 13assg-13.cpp   @auth.docx
braycarissa250
 
Q1 Consider the below omp_trap1.c implantation, modify the code so t.pdf
Q1 Consider the below omp_trap1.c implantation, modify the code so t.pdfQ1 Consider the below omp_trap1.c implantation, modify the code so t.pdf
Q1 Consider the below omp_trap1.c implantation, modify the code so t.pdf
abdulrahamanbags
 
Directive-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingDirective-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous Computing
Ruymán Reyes
 

Similar to Regular Expression (RegExp) (20)

Php opcodes sep2008
Php opcodes sep2008Php opcodes sep2008
Php opcodes sep2008
 
Bare metal performance in Elixir
Bare metal performance in ElixirBare metal performance in Elixir
Bare metal performance in Elixir
 
C programming language tutorial
C programming language tutorial C programming language tutorial
C programming language tutorial
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
OpenMP
OpenMPOpenMP
OpenMP
 
Msfpayload/Msfencoder cheatsheet
Msfpayload/Msfencoder cheatsheetMsfpayload/Msfencoder cheatsheet
Msfpayload/Msfencoder cheatsheet
 
OpenHPI - Parallel Programming Concepts - Week 3
OpenHPI - Parallel Programming Concepts - Week 3OpenHPI - Parallel Programming Concepts - Week 3
OpenHPI - Parallel Programming Concepts - Week 3
 
Exploit techniques - a quick review
Exploit techniques - a quick reviewExploit techniques - a quick review
Exploit techniques - a quick review
 
please help me with this and explain in details also in the first qu.pdf
please help me with this and explain in details also in the first qu.pdfplease help me with this and explain in details also in the first qu.pdf
please help me with this and explain in details also in the first qu.pdf
 
Unit 4
Unit 4Unit 4
Unit 4
 
Assignment 13assg-13.cppAssignment 13assg-13.cpp   @auth.docx
Assignment 13assg-13.cppAssignment 13assg-13.cpp   @auth.docxAssignment 13assg-13.cppAssignment 13assg-13.cpp   @auth.docx
Assignment 13assg-13.cppAssignment 13assg-13.cpp   @auth.docx
 
Quiz 9
Quiz 9Quiz 9
Quiz 9
 
C programming session10
C programming  session10C programming  session10
C programming session10
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic Analysis
 
Embedded C programming session10
Embedded C programming  session10Embedded C programming  session10
Embedded C programming session10
 
Q1 Consider the below omp_trap1.c implantation, modify the code so t.pdf
Q1 Consider the below omp_trap1.c implantation, modify the code so t.pdfQ1 Consider the below omp_trap1.c implantation, modify the code so t.pdf
Q1 Consider the below omp_trap1.c implantation, modify the code so t.pdf
 
Directive-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous ComputingDirective-based approach to Heterogeneous Computing
Directive-based approach to Heterogeneous Computing
 
Crash course in verilog
Crash course in verilogCrash course in verilog
Crash course in verilog
 
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
Apidays Paris 2023 - Forget TypeScript, Choose Rust to build Robust, Fast and...
 
Explain that explain
Explain that explainExplain that explain
Explain that explain
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Regular Expression (RegExp)

  • 1. /regexp?/ The mechanics of Expression Processing with some PCRE2 referral Meetup PUG 25 Luglio 2017 #AperiTech
  • 2. I AM DAVIDE DELL’ERBA Research & Development @
  • 3. INDEX ◇ Regex engine types ◇ Two all-encompassing rules ◇ NFA vs DFA ◇ Backtracking ◇ PCRE2
  • 4. REGEX ENGINE TYPES ◇ DFA ◇ Traditional NFA ◇ POSIX NFA ◇ Hybrid NFA/DFA
  • 5. REGEX ENGINE TYPES Engine type Programs DFA awk (most versions), egrep (most versions), flex, lex, MySQL, Procmail Tradition NFA GNU Emacs, Java, grep (most versions), less, more, .NET languages, PCRE library, Perl, PHP, Python, Ruby, sed (most versions), vi Posix NFA mawk, Mortice Kern Systems’ utilities, GNU Emacs (when requested) Hybrid NFA/DFA GNU awk, GNU grep / egrep, Tcl
  • 6. REGEX ENGINE TYPES IN PHP Text processing Programs PCRE Regular Expressions (Perl-Compatible) POSIX Regex Regular Expression (POSIX Extended) Deprecated from PHP 5.3; Removed from PHP 7.0
  • 7. TESTING THE ENGINE TYPES Traditional NFA or not? 「nfa|nfa not」 “nfa not” “nfa not” “nfa not” Traditional NFA DFA, NFA POSIX
  • 8. TESTING THE ENGINE TYPES DFA or POSIX NFA? 「X(.+)+X」 “=XX=====================” POSIX NFA DFA No match!
  • 9. TWO ALL-ENCOMPASSING RULES 1. The match that begins earliest (leftmost) wins 2. The standard quantifiers are greedy (「*」,「+」,「?」,「{m,n}」)
  • 10. THE MATCH THAT BEGINS EARLIEST WINS ◇ “a match” instead of “the match” ◇ attempt to match the beginning of the string ◇ if all permutation are exhausted without match, retry from next character
  • 11. THE MATCH THAT BEGINS EARLIEST WINS 「cat」 “The dragging belly indicates your cat is too fat” 「fat|cat|belly|your」 “The dragging belly indicates your cat is too fat”
  • 12. THE STANDARD QUANTIFIERS ARE GREEDY ◇ minimum number of matches that are required before it can be considered successful ◇ maximum number that it will ever attempt to match
  • 13. THE STANDARD QUANTIFIERS ARE GREEDY 「.*(d+)」 “Copyright - 05 March 2016” 「.*(d*)」 “Copyright - 05 March 2016” 「d+(?!.*d)」 “Copyright - 05 March 2016”
  • 14. REGEX-DIRECTION VS TEXT-DIRECTION ◇ NFA engine is Regex-Directed ◇ DFA engine is Text-Directed
  • 15. NFA ENGINE: REGEX-DIRECTED 「to(nite|knight|night)」͎ “tonight”͎ 「to(nite|knight|night)」͎ “tonight”͎ 「to(nite|knight|night)」͎ “tonight”͎ 「to(nite|knight|night)」͎ “tonight”͎ 「to(nite|knight|night)」͎ “tonight”͎ 「to(nite|knight|night)」͎ “tonight”͎
  • 16. DFA ENGINE: TEXT-DIRECTED 「to(nite|knight|night)」 “tonight”͎ ͎ 「to(nite|knight|night)」 “tonight”͎ ͎ 「to(nite|knight|night)」 “tonight”͎ ͎ 「to(nite|knight|night)」 “tonight”͎ 「to(nite|knight|night)」 “tonight”͎ ͎ ͎ ͎ ͎ ͎ ͎
  • 17. NFA VS DFA DFA NFA Time Fast Slow Space Less More Type Deterministic Non Deterministic Result Consistent Unpredictable Backtracking ✗ ✓ Construction DFA ⊂ NFA NFA ⊃ DFA Pre-compile Slower and more memory Faster and less memory Then? Is boring Is funny
  • 18. BACKTRACKING ◇ Consider each subexpression or component in turn ◇ If it decides between two (or more) equally viable options: ○ selects one ○ remember the others one ◇ If it’s successful (and the rest of the regex it is also successful) ○ the match is finished ◇ Otherwise it backtracks to where it chose the first option
  • 19. TWO IMPORTANT POINTS ON BACKTRACKING ◇ When faced with multiple choices, which should be tried first? The engine always looks for greedy quantifiers and skips lazy ones. ◇ When forced to backtrack, which saved choice should the engine use? The most recently saved option is the one used (LIFO: Last In First Out)
  • 20. SAVED STATES ◇ A match without backtracking 「ab?c」 “abc”͎ ͎saved states 「ab?c」 “abc”͎ ͎ 「ab?c」 “abc”͎ ͎ 「ab?c」 “abc”͎ ͎
  • 21. SAVED STATES ◇ A match with backtracking 「ab?c」 “ac”͎ ͎saved states 「ab?c」 “ac”͎ ͎ 「ab?c」 “ac”͎ ͎ 「ab?c」 “ac”͎ ͎ 「ab?c」 “ac”͎ ͎ ✗
  • 22. SAVED STATES ◇ A lazy match with backtracking 「ab??c」 “abc”͎ ͎saved states 「ab??c」 “abc”͎ ͎ 「ab??c」 “abc”͎ ͎ 「ab??c」 “abc”͎ ͎ 「ab??c」 “abc”͎ ͎ 「ab??c」 “abc”͎ ͎ ✗
  • 23. POSIX NFA ◇ A POSIX NFA does not stop with the first match it finds, but continues to try options states that might remain ◇ Each time it reached the end of the regex, it would have another plausible match ◇ Eventually, all options are exhausted
  • 24. PCRE2: PERL COMPATIBLE REGULAR EXPRESSION The PCRE library is a set of function that implement regular expression pattern matching using the same syntax and semantics as Perl 5.
  • 25. PCRE2: PERL COMPATIBLE REGULAR EXPRESSION The PCRE library is a set of function that implements regular expression pattern matching using the same syntax and semantics as Perl 5 Why a Perl regex clone? Perl Regex is a standard de facto for the web age.
  • 26. PCRE2 vs PCRE ◇ This new API does not have any user-visible C structure ◇ Function calls are used as the means as interacting with the library ◇ JIT compilation has been moved into a separate function ◇ It contains no static or global variables ◇ The idea of context in which PCRE functions are called
  • 27. BASE PROCESS IS EASY pcre2_match()pcre2_compile() results...
  • 28. PCRE2_COMPILE() re = pcre2_compile( pattern, /* the pattern */ PCRE2_ZERO_TERMINATED, /* indicates pattern is zero-terminated */ 0, /* default options */ &errornumber, /* for error number */ &erroroffset, /* for error offset */ NULL); /* use default compile context */ match_data = pcre2_match_data_create_from_pattern(re, NULL); rc = pcre2_match( re, /* the compiled pattern */ subject, /* the subject string */ subject_length, /* the length of the subject */ 0, /* start at offset 0 in the subject */ 0, /* default options */ match_data, /* block for storing the result */ NULL); /* use default match context */ ovector = pcre2_get_ovector_pointer(match_data);
  • 29. PCRE2_COMPILE STRUCTURE typedef struct pcre2_real_code { pcre2_memctl memctl; /* Memory control fields */ const uint8_t *tables; /* The character tables */ void *executable_jit; /* Pointer to JIT code */ uint8_t start_bitmap[32]; /* Bitmap for starting code unit < 256 */ CODE_BLOCKSIZE_TYPE blocksize; /* Total (bytes) that was malloc-ed */ uint32_t magic_number; /* Paranoid and endianness check */ uint32_t compile_options; /* Options passed to pcre2_compile() */ uint32_t overall_options; /* Options after processing the pattern */ uint32_t flags; /* Various state flags */ uint32_t limit_heap; /* Limit set in the pattern */ uint32_t limit_match; /* Limit set in the pattern */ uint32_t limit_depth; /* Limit set in the pattern */ uint32_t first_codeunit; /* Starting code unit */ uint32_t last_codeunit; /* This codeunit must be seen */ uint16_t bsr_convention; /* What R matches */ uint16_t newline_convention; /* What is a newline? */ uint16_t max_lookbehind; /* Longest lookbehind (characters) */ uint16_t minlength; /* Minimum length of match */ uint16_t top_bracket; /* Highest numbered group */ uint16_t top_backref; /* Highest numbered back reference */ uint16_t name_entry_size; /* Size (code units) of table entries */ uint16_t name_count; /* Number of name entries in the table */ } pcre2_real_code;
  • 30. PCRE2_MATCH() re = pcre2_compile( pattern, /* the pattern */ PCRE2_ZERO_TERMINATED, /* indicates pattern is zero-terminated */ 0, /* default options */ &errornumber, /* for error number */ &erroroffset, /* for error offset */ NULL); /* use default compile context */ match_data = pcre2_match_data_create_from_pattern(re, NULL); rc = pcre2_match( re, /* the compiled pattern */ subject, /* the subject string */ subject_length, /* the length of the subject */ 0, /* start at offset 0 in the subject */ 0, /* default options */ match_data, /* block for storing the result */ NULL); /* use default match context */ ovector = pcre2_get_ovector_pointer(match_data);
  • 31. MATCH RESULT STRUCTURE typedef struct pcre2_real_match_data { pcre2_memctl memctl; const pcre2_real_code *code; /* The pattern used for the match */ PCRE2_SPTR subject; /* The subject that was matched */ PCRE2_SPTR mark; /* Pointer to last mark */ PCRE2_SIZE leftchar; /* Offset to leftmost code unit */ PCRE2_SIZE rightchar; /* Offset to rightmost code unit */ PCRE2_SIZE startchar; /* Offset to starting code unit */ uint16_t matchedby; /* Type of match (normal, JIT, DFA) */ uint16_t oveccount; /* Number of pairs */ int rc; /* The return code from the match */ PCRE2_SIZE ovector[10000];/* The first field */ } pcre2_real_match_data;
  • 32. OVECTOR re = pcre2_compile( pattern, /* the pattern */ PCRE2_ZERO_TERMINATED, /* indicates pattern is zero-terminated */ 0, /* default options */ &errornumber, /* for error number */ &erroroffset, /* for error offset */ NULL); /* use default compile context */ match_data = pcre2_match_data_create_from_pattern(re, NULL); rc = pcre2_match( re, /* the compiled pattern */ subject, /* the subject string */ subject_length, /* the length of the subject */ 0, /* start at offset 0 in the subject */ 0, /* default options */ match_data, /* block for storing the result */ NULL); /* use default match context */ ovector = pcre2_get_ovector_pointer(match_data);
  • 33. TRY IT YOURSELF! docker pull delda/pcre2 docker run -it delda/pcre2 bash delda/pcre2 is a docker image based on Debian Jessy with a checkout of PCRE2 source code. The library is installed in UTF8 with debug and JIT options active.
  • 34. NFA EXAMPLE root@1cf6d9ffdc9b:/src/pcre2# ./pcre2test PCRE2 version 10.30-DEV 2017-03-05 re> /nfa|nfa not/ data> nfa not 0: nfa data>
  • 35. NFA EXAMPLE root@1cf6d9ffdc9b:/src/pcre2# ./pcre2test PCRE2 version 10.30-DEV 2017-03-05 re> /nfa|nfa not/auto_callout data> nfa not --->nfa not +0 ^ n +1 ^^ f +2 ^ ^ a +3 ^ ^ | 0: nfa data>
  • 36. NFA EXAMPLE root@1cf6d9ffdc9b:/src/pcre2# ./pcre2test -d PCRE2 version 10.30-DEV 2017-03-05 re> /nfa|nfa not/ ----------------------------------------------- 0 9 Bra 3 nfa 9 17 Alt 12 not nfa 26 26 Ket 29 End ----------------------------------------------- Capturing subpattern count = 0 First code unit = 'n' Last code unit = 'a' Subject length lower bound = 3 data>
  • 37. DFA EXAMPLE root@1cf6d9ffdc9b:/src/pcre2# ./pcre2test -dfa PCRE2 version 10.30-DEV 2017-03-05 re> /nfa|nfa not/auto_callout data>
  • 38. DFA EXAMPLE data> nfa not --->nfa not +0 ^ n +4 ^ n +1 ^^ f +5 ^^ f +2 ^ ^ a +6 ^ ^ a +3 ^ ^ | +7 ^ ^ +8 ^ ^ n +9 ^ ^ o +10 ^ ^ t +11 ^ ^ 0: nfa not 1: nfa data> ^C
  • 39. ANY QUESTIONS? You can find me at @delda80 github.com/delda info@davidedellerba.it