The document discusses regular expressions (regex or regexp). It begins with definitions of regex from a mathematical and programming perspective. It then covers topics like Unicode support, nested parentheses, modifiers, and libraries like Regexp::Common that provide common regex patterns.
2. Table of Contents
• regexp? what is it?
• $supported_by ~~ @most_major_languages;
• but how (much)??
• Unicode support?
• assertions?
• modifiers?
• Irregular expressions
• qr{([A-Za-z_]w*s*((((?:(?>[^()]+)|(?2))*))))}
• use CPAN;
• Regexp::Assemble;
• Regexp::Common;
• (ir)?regular questions (?:from|by) the audience
3. regexp? what is it?
Mathematically speaking[*]
• The empty language Ø is a regular language.
• For each a ∈ Σ (a belongs to Σ), the singleton language {a} is a regular
language.
• If A is a regular language, A* (Kleene star) is a regular language. Due to this,
the empty string language {ε} is also regular.
• If A and B are regular languages, then A ∪ B (union) and A • B (concatenation)
are regular languages.
• No other languages over Σ are regular.
4. regexp? what is it?
In our language
• 0 or more of… (quantifier)
• '' # empty string
• 'string' # any string
• '(?:string|文字列)' # any alteration of strings
• That's it!
• ? # {0,}
• + # {1,}
• [0-9] # (?:0|1|2|3|4|5|6|7|8|9)
5. regexp? what is it?
((?:ir)?reg(?:ular )?exp(?:ressions?)?)
Visualized by: regexper.com
6. regexp? what is it?
(?:[x00-x7F]|[xC2-xDF][x80-xBF]|xE0[xA0-xBF][x80-xBF]|[xE1-xECxEExEF][x80-xBF]{2}|xED[x80-x9F][x80-xBF]|
xF0[x90-xBF][x80-xBF]{2}|[xF1-xF3][x80-xBF]{3}|xF4[x80-x8F][x80-xBF]{2})
Exerpt from: https://www.w3.org/International/questions/qa-forms-utf-8
Visualized by: regexper.com
7. regexp? what is it?
(?:[+-]?)(?:0x[0-9a-fA-F]+(?:.[0-9a-fA-F]+)?(?:[pP][+-]?[0-9]+)|(?:[1-9][0-9]*)(?:.[0-9]+)?(?:[eE][+-]?[0-9]+)?|0(?:.0+|(?:.0+)?(?:[eE]
[+-]?[0-9]+))|(?:[Nn]a[Nn]|[Ii]nf(?:inity)?))
Exerpt from: https://github.com/dankogai/js-sion/blob/main/sion.ts
Visualized by: regexper.com
9. Irregular expressions
/^(11+?)1+$/ # is this a regular expression?
$ seq 2 100 | perl -nlE 'say $_ if (1x$_) !~ /^(11+?)1+$/'
2
3
5
7
…
79
83
89
97
10. Irregular expressions
/^(11+?)1+$/ # is NOT EXACTLY a regular expression!
• The problem is 1
• It is the result of the preceding capture
• In other words, this expression is self-modifying.
• So it is not mathematically a regular expression
• Regexp ≠ Regular Expression
• Regexp ⊆ Regular Expression
14. Unicode Support
What is a character?
• String is /.*/ but . =
• [x00-xff] # legacy world of bytes
• [u0000-uFFFF] # prematurely modern
• [u{0000}-u{10FFFF}] # correctly modern
15. Unicode Support
What is a character?
• String is /.*/ but . =
• [x00-xff] # Perl < 5.7
• [u0000-uFFFF] # Java(Script)?, Python2, …
• [u{0000}-u{10FFFF}] # Perl, Ruby, Python3, …
16. Unicode Support?
What will the following say?
$ perl -Mutf8 -MData::Dumper -E
'my@m=("🦏🐪🐘🐍💎⚙" =~ /(.)/g); say Dumper([@m])'
17. Unicode Support?
What will the following say?
$ perl -Mutf8 -MData::Dumper -E
'my@m=("🦏🐪🐘🐍💎⚙" =~ /(.)/g); say Dumper([@m])'
$VAR1 = [
"x{1f98f}",
"x{1f42a}",
"x{1f418}",
"x{1f40d}",
"x{1f48e}",
"x{2699}"
];
19. Unicode Support?
What will the following say?
$ node -e
'console.log("🦏🐪🐘🐍💎⚙".match(/(.)/g))'
[
'', '', '', '',
'', '', '', '',
'', '', '⚙'
]
20. Unicode Support?
What will the following say?
$ node -e
'console.log("🦏🐪🐘🐍💎⚙".match(/(.)/ug))'
[ '🦏', '🐪', '🐘', '🐍', '💎', '⚙' ]
21. Unicode Support?
What will the following say?
$ perl -Mutf8 -MData::Dumper -E
'my@m=("🇯🇵🇺🇦" =~ /(.)/g); say Dumper([@m])'
22. Unicode Support?
What will the following say?
$ perl -Mutf8 -MData::Dumper -E
'my@m=("🇯🇵🇺🇦" =~ /(.)/g); say Dumper([@m])'
$VAR1 = [
"x{1f1ef}",
"x{1f1f5}",
"x{1f1fa}",
"x{1f1e6}"
];
23. Unicode Support?
What will the following say?
$ perl -Mutf8 -MData::Dumper -E
'my@m=("🇯🇵🇺🇦" =~ /(.)/g); say Dumper([@m])'
$VAR1 = [
"x{1f1ef}", # REGIONAL INDICATOR SYMBOL LETTER J
"x{1f1f5}", # REGIONAL INDICATOR SYMBOL LETTER P
"x{1f1fa}", # REGIONAL INDICATOR SYMBOL LETTER U
"x{1f1e6}" # REGIONAL INDICATOR SYMBOL LETTER A
];
24. Unicode Support?
What will the following say?
$ perl -Mutf8 -MData::Dumper -E
'my@m=("🇯🇵🇺🇦" =~ /(X)/g); say Dumper([@m])'
$VAR1 = [
"x{1f1ef}x{1f1f5}",
"x{1f1fa}x{1f1e6}"
];
26. Unicode Support?
What will the following say?
$ node -e
'console.log("🇯🇵🇺🇦".match(/(.)/ug))'
[ '🇯', '🇵', '🇺', '🇦' ]
27. Unicode Support?
What will the following say?
$ node -e
'console.log("🇯🇵🇺🇦".match(/(X)/ug))'
🙅 [ '🇯🇵','🇺🇦' ]
🙆 SyntaxError: Invalid regular expression: /(X)/: Invalid escape
at [eval]:1:24
at Script.runInThisContext (node:vm:129:12)
at Object.runInThisContext (node:vm:305:38)
at node:internal/process/execution:75:19
at [eval]-wrapper:6:22
at evalScript (node:internal/process/execution:74:60)
at node:internal/main/eval_string:27:3