Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

RegExp

619 views

Published on

The mechanics of Expression Processing with some PCRE2 referral

Published in: Technology
  • Be the first to comment

RegExp

  1. 1. /regexp?/ The mechanics of Expression Processing with some PCRE2 referral Meetup PUG 25 Luglio 2017 #AperiTech
  2. 2. I AM DAVIDE DELL’ERBA Research & Development @
  3. 3. INDEX ◇ Regex engine types ◇ Two all-encompassing rules ◇ NFA vs DFA ◇ Backtracking ◇ PCRE2
  4. 4. REGEX ENGINE TYPES ◇ DFA ◇ Traditional NFA ◇ POSIX NFA ◇ Hybrid NFA/DFA
  5. 5. REGEX ENGINE TYPES Engine type Programs DFA awk (most versions), egrep (most versions), flex, lex, MySQL, Procmail Tradition NFA GNU Emacs, Java, grep (most versions), less, more, .NET languages, PCRE library, Perl, PHP, Python, Ruby, sed (most versions), vi Posix NFA mawk, Mortice Kern Systems’ utilities, GNU Emacs (when requested) Hybrid NFA/DFA GNU awk, GNU grep / egrep, Tcl
  6. 6. REGEX ENGINE TYPES IN PHP Text processing Programs PCRE Regular Expressions (Perl-Compatible) POSIX Regex Regular Expression (POSIX Extended) Deprecated from PHP 5.3; Removed from PHP 7.0
  7. 7. TESTING THE ENGINE TYPES Traditional NFA or not? 「nfa|nfa not」 “nfa not” “nfa not” “nfa not” Traditional NFA DFA, NFA POSIX
  8. 8. TESTING THE ENGINE TYPES DFA or POSIX NFA? 「X(.+)+X」 “=XX=====================” POSIX NFA DFA No match!
  9. 9. TWO ALL-ENCOMPASSING RULES 1. The match that begins earliest (leftmost) wins 2. The standard quantifiers are greedy (「*」,「+」,「?」,「{m,n}」)
  10. 10. THE MATCH THAT BEGINS EARLIEST WINS ◇ “a match” instead of “the match” ◇ attempt to match the beginning of the string ◇ if all permutation are exhausted without match, retry from next character
  11. 11. THE MATCH THAT BEGINS EARLIEST WINS 「cat」 “The dragging belly indicates your cat is too fat” 「fat|cat|belly|your」 “The dragging belly indicates your cat is too fat”
  12. 12. THE STANDARD QUANTIFIERS ARE GREEDY ◇ minimum number of matches that are required before it can be considered successful ◇ maximum number that it will ever attempt to match
  13. 13. THE STANDARD QUANTIFIERS ARE GREEDY 「.*(d+)」 “Copyright - 05 March 2016” 「.*(d*)」 “Copyright - 05 March 2016” 「d+(?!.*d)」 “Copyright - 05 March 2016”
  14. 14. REGEX-DIRECTION VS TEXT-DIRECTION ◇ NFA engine is Regex-Directed ◇ DFA engine is Text-Directed
  15. 15. NFA ENGINE: REGEX-DIRECTED 「to(nite|knight|night)」͎ “tonight”͎ 「to(nite|knight|night)」͎ “tonight”͎ 「to(nite|knight|night)」͎ “tonight”͎ 「to(nite|knight|night)」͎ “tonight”͎ 「to(nite|knight|night)」͎ “tonight”͎ 「to(nite|knight|night)」͎ “tonight”͎
  16. 16. DFA ENGINE: TEXT-DIRECTED 「to(nite|knight|night)」 “tonight”͎ ͎ 「to(nite|knight|night)」 “tonight”͎ ͎ 「to(nite|knight|night)」 “tonight”͎ ͎ 「to(nite|knight|night)」 “tonight”͎ 「to(nite|knight|night)」 “tonight”͎ ͎ ͎ ͎ ͎ ͎ ͎
  17. 17. NFA VS DFA DFA NFA Time Fast Slow Space Less More Type Deterministic Non Deterministic Result Consistent Unpredictable Backtracking ✗ ✓ Construction DFA ⊂ NFA NFA ⊃ DFA Pre-compile Slower and more memory Faster and less memory Then? Is boring Is funny
  18. 18. BACKTRACKING ◇ Consider each subexpression or component in turn ◇ If it decides between two (or more) equally viable options: ○ selects one ○ remember the others one ◇ If it’s successful (and the rest of the regex it is also successful) ○ the match is finished ◇ Otherwise it backtracks to where it chose the first option
  19. 19. TWO IMPORTANT POINTS ON BACKTRACKING ◇ When faced with multiple choices, which should be tried first? The engine always looks for greedy quantifiers and skips lazy ones. ◇ When forced to backtrack, which saved choice should the engine use? The most recently saved option is the one used (LIFO: Last In First Out)
  20. 20. SAVED STATES ◇ A match without backtracking 「ab?c」 “abc”͎ ͎saved states 「ab?c」 “abc”͎ ͎ 「ab?c」 “abc”͎ ͎ 「ab?c」 “abc”͎ ͎
  21. 21. SAVED STATES ◇ A match with backtracking 「ab?c」 “ac”͎ ͎saved states 「ab?c」 “ac”͎ ͎ 「ab?c」 “ac”͎ ͎ 「ab?c」 “ac”͎ ͎ 「ab?c」 “ac”͎ ͎ ✗
  22. 22. SAVED STATES ◇ A lazy match with backtracking 「ab??c」 “abc”͎ ͎saved states 「ab??c」 “abc”͎ ͎ 「ab??c」 “abc”͎ ͎ 「ab??c」 “abc”͎ ͎ 「ab??c」 “abc”͎ ͎ 「ab??c」 “abc”͎ ͎ ✗
  23. 23. POSIX NFA ◇ A POSIX NFA does not stop with the first match it finds, but continues to try options states that might remain ◇ Each time it reached the end of the regex, it would have another plausible match ◇ Eventually, all options are exhausted
  24. 24. PCRE2: PERL COMPATIBLE REGULAR EXPRESSION The PCRE library is a set of function that implements regular expression pattern matching using the same syntax and semantics as Perl 5 Why a Perl regex clone? Perl Regex is a standard de facto for the web age.
  25. 25. PCRE2 vs PCRE ◇ This new API does not have any user-visible C structure ◇ Function calls are used as the means as interacting with the library ◇ JIT compilation has been moved into a separate function ◇ It contains no static or global variables ◇ The idea of context in which PCRE functions are called
  26. 26. BASE PROCESS IS EASY pcre2_match()pcre2_compile() results...
  27. 27. PCRE2_COMPILE() re = pcre2_compile( pattern, /* the pattern */ PCRE2_ZERO_TERMINATED, /* indicates pattern is zero-terminated */ 0, /* default options */ &errornumber, /* for error number */ &erroroffset, /* for error offset */ NULL); /* use default compile context */ match_data = pcre2_match_data_create_from_pattern(re, NULL); rc = pcre2_match( re, /* the compiled pattern */ subject, /* the subject string */ subject_length, /* the length of the subject */ 0, /* start at offset 0 in the subject */ 0, /* default options */ match_data, /* block for storing the result */ NULL); /* use default match context */ ovector = pcre2_get_ovector_pointer(match_data);
  28. 28. PCRE2_COMPILE STRUCTURE typedef struct pcre2_real_code { pcre2_memctl memctl; /* Memory control fields */ const uint8_t *tables; /* The character tables */ void *executable_jit; /* Pointer to JIT code */ uint8_t start_bitmap[32]; /* Bitmap for starting code unit < 256 */ CODE_BLOCKSIZE_TYPE blocksize; /* Total (bytes) that was malloc-ed */ uint32_t magic_number; /* Paranoid and endianness check */ uint32_t compile_options; /* Options passed to pcre2_compile() */ uint32_t overall_options; /* Options after processing the pattern */ uint32_t flags; /* Various state flags */ uint32_t limit_heap; /* Limit set in the pattern */ uint32_t limit_match; /* Limit set in the pattern */ uint32_t limit_depth; /* Limit set in the pattern */ uint32_t first_codeunit; /* Starting code unit */ uint32_t last_codeunit; /* This codeunit must be seen */ uint16_t bsr_convention; /* What R matches */ uint16_t newline_convention; /* What is a newline? */ uint16_t max_lookbehind; /* Longest lookbehind (characters) */ uint16_t minlength; /* Minimum length of match */ uint16_t top_bracket; /* Highest numbered group */ uint16_t top_backref; /* Highest numbered back reference */ uint16_t name_entry_size; /* Size (code units) of table entries */ uint16_t name_count; /* Number of name entries in the table */ } pcre2_real_code;
  29. 29. PCRE2_MATCH() re = pcre2_compile( pattern, /* the pattern */ PCRE2_ZERO_TERMINATED, /* indicates pattern is zero-terminated */ 0, /* default options */ &errornumber, /* for error number */ &erroroffset, /* for error offset */ NULL); /* use default compile context */ match_data = pcre2_match_data_create_from_pattern(re, NULL); rc = pcre2_match( re, /* the compiled pattern */ subject, /* the subject string */ subject_length, /* the length of the subject */ 0, /* start at offset 0 in the subject */ 0, /* default options */ match_data, /* block for storing the result */ NULL); /* use default match context */ ovector = pcre2_get_ovector_pointer(match_data);
  30. 30. MATCH RESULT STRUCTURE typedef struct pcre2_real_match_data { pcre2_memctl memctl; const pcre2_real_code *code; /* The pattern used for the match */ PCRE2_SPTR subject; /* The subject that was matched */ PCRE2_SPTR mark; /* Pointer to last mark */ PCRE2_SIZE leftchar; /* Offset to leftmost code unit */ PCRE2_SIZE rightchar; /* Offset to rightmost code unit */ PCRE2_SIZE startchar; /* Offset to starting code unit */ uint16_t matchedby; /* Type of match (normal, JIT, DFA) */ uint16_t oveccount; /* Number of pairs */ int rc; /* The return code from the match */ PCRE2_SIZE ovector[10000];/* The first field */ } pcre2_real_match_data;
  31. 31. OVECTOR re = pcre2_compile( pattern, /* the pattern */ PCRE2_ZERO_TERMINATED, /* indicates pattern is zero-terminated */ 0, /* default options */ &errornumber, /* for error number */ &erroroffset, /* for error offset */ NULL); /* use default compile context */ match_data = pcre2_match_data_create_from_pattern(re, NULL); rc = pcre2_match( re, /* the compiled pattern */ subject, /* the subject string */ subject_length, /* the length of the subject */ 0, /* start at offset 0 in the subject */ 0, /* default options */ match_data, /* block for storing the result */ NULL); /* use default match context */ ovector = pcre2_get_ovector_pointer(match_data);
  32. 32. TRY IT YOURSELF! docker pull delda/pcre2 docker run -it delda/pcre2 bash delda/pcre2 is a docker image based on Debian Jessy with a checkout of PCRE2 source code. The library is installed in UTF8 with debug and JIT options active.
  33. 33. NFA EXAMPLE root@1cf6d9ffdc9b:/src/pcre2# ./pcre2test PCRE2 version 10.30-DEV 2017-03-05 re> /nfa|nfa not/ data> nfa not 0: nfa data>
  34. 34. NFA EXAMPLE root@1cf6d9ffdc9b:/src/pcre2# ./pcre2test PCRE2 version 10.30-DEV 2017-03-05 re> /nfa|nfa not/auto_callout data> nfa not --->nfa not +0 ^ n +1 ^^ f +2 ^ ^ a +3 ^ ^ | 0: nfa data>
  35. 35. NFA EXAMPLE root@1cf6d9ffdc9b:/src/pcre2# ./pcre2test -d PCRE2 version 10.30-DEV 2017-03-05 re> /nfa|nfa not/ ----------------------------------------------- 0 9 Bra 3 nfa 9 17 Alt 12 not nfa 26 26 Ket 29 End ----------------------------------------------- Capturing subpattern count = 0 First code unit = 'n' Last code unit = 'a' Subject length lower bound = 3 data>
  36. 36. DFA EXAMPLE root@1cf6d9ffdc9b:/src/pcre2# ./pcre2test -dfa PCRE2 version 10.30-DEV 2017-03-05 re> /nfa|nfa not/auto_callout data>
  37. 37. DFA EXAMPLE data> nfa not --->nfa not +0 ^ n +4 ^ n +1 ^^ f +5 ^^ f +2 ^ ^ a +6 ^ ^ a +3 ^ ^ | +7 ^ ^ +8 ^ ^ n +9 ^ ^ o +10 ^ ^ t +11 ^ ^ 0: nfa not 1: nfa data> ^C
  38. 38. ANY QUESTIONS? You can find me at @delda80 github.com/delda info@davidedellerba.it

×