Regex & Automata
Decoded
By Dildar Jakhro
Foundations
100 Regex Patterns
Descriptive Forms
Automata View
Practical Usage
Limits & Beyond
Foundations
01
Foundations: Alphabets, Strings & Languages
The building blocks of formal language theory.
Alphabet (Σ)
A finite, non-empty set of symbols. Ex: Σ
= {a, b}, Σ = {0, 1}.
String (w)
A finite sequence of symbols from an
alphabet. Ex: "abb", "101".
Language (L)
A set of strings. Can be finite or infinite.
Ex: L = {w | w ends with 'b'}.
Key Concepts & Notation
Empty String (ε): The unique string of length zero.
Length (|w|): The number of symbols in a string.
Concatenation: Joining two strings end-to-end.
Kleene Star (Σ*): The set of all possible strings over Σ, including ε.
The Power of Regular Languages
A language is regular if a regex can describe it. They form the foundation of text processing.
Chomsky Hierarchy
Type 0: Recursively Enumerable
Type 1: Context-Sensitive
Type 2: Context-Free
Type 3: Regular Languages
Closure Properties
Union (L₁ L₂)
∪ Concatenation (L₁L₂)
Kleene Star (L*) Complement (Lᶜ)
Intersection (L₁ L₂)
∩
100 Regex Patterns
02
Regex Patterns: Digits, Letters & Identifiers
Digits & Letters
[0-9] any digit
[a-zA-Z] any letter
d digit
w word char
s whitespace
Negations
D non-digit
W non-word
S non-space
[^aeiou] non-vowel
Quantifiers
x? optional
x* zero or more
x+ one or more
x{n,m} n to m
Anchors & Boundaries
^ start
$ end
b word boundary
B non-boundary
Cases & Concatenation
[A-Z] uppercase
[a-z] lowercase
. any char
xy concatenation
x|y union
Identifiers
[a-zA-Z0-9_]
Matches any character valid in an identifier.
Regex Patterns: Numbers, Ranges & Scientific Notation
Integers
[1-9][0-9]* // Positive
-?[0-9]+ // Signed
0[xX][0-9a-fA-F]+ // Hex
0[0-7]* // Octal
0b[01]+ // Binary
Floating-Point
[0-9]+.[0-9]+ // Fixed
(+|-)?([0-9]+.?[0-9]*|.[0-9]+) // Decimal
[0-9]+(e|E)(+|-)?[0-9]+ // Scientific
Formatted Numbers
[0-9]{1,3}(,[0-9]{3})* // Thousands
Date & Time
(0?[1-9]|[12][0-9]|3[01]) // Day
(0?[1-9]|1[0-2]) // Month
(19|20)[0-9]{2} // Year
[0-2][0-9]:[0-5][0-9] // HH:MM
Regex Patterns: Emails, URLs & Network Addresses
Email & URLs
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}
Email
https?://[^/s]+
URL
(ftp|http|https)://[^/s]+
Protocol URL
www.[^/s]+
Web Address
IP Addresses
bd{1,3}.d{1,3}.d{1,3}.d{1,3}b
IPv4
([0-9A-Fa-f]{1,4}:){7}[0-9A-Fa-f]{1,4}
IPv6
b[A-Z]{2}b
Country Code
bd{4}b
Port
Query & Path
[?&][a-zA-Z_][0-9a-zA-Z_]*=
Query Key
/[^/s]+
Path
File Types
.(jpg|jpeg|png|gif)
Image
.(pdf|doc|docx)
Document
Regex Patterns: Text, Markup & Code
Markdown & HTML
**[^*]+**
Bold (**text**)
`[^`]+`
Inline Code
^#+s+
Heading
Missing superscript or subscript argumentMissing superscript or
subscript argument]+)]([)]+)
Link
<[^>]+>
HTML Tag
Programming
#[A-Za-z_]w*
Python Comment
//.*$
C++ Comment
[a-z]+(_[a-z]+)*
snake_case
b[A-Z][a-z]+[A-Z][a-z]*
camelCase
Dates & Days
b(?:Mon|Tue|Wed|Thu|Fri|Sat|Sun)b
Weekday
b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)b
Month
d{4}-d{2}-d{2}
Logs & CS Terms
ERROR|WARN|INFO
Log Level
b(stack|heap|queue)b
CS Terms
Descriptive Forms
03
Descriptive Expressions (1/5)
Translating English into formal patterns.
1 All strings over {a,b} that end with the symbol 'b'. 2 Strings that contain an even number of 'a's.
3 Strings that do not contain the substring "ab". 4 Strings where every 'b' is immediately followed by an
'a'.
5 Strings that contain at least three 'a's. 6 Strings whose length is a multiple of 3.
7 Strings that contain the substring "aaa". 8 Strings with an equal number of 'a's and 'b's.
9 Strings that start and end with the same symbol. 10 Strings with no consecutive identical letters.
Descriptive Expressions (3/5)
Exploring patterns across different alphabets and structures.
21 All decimal strings without leading zeros. 22 Strings over {a,b,c} where all 'a's appear before all
'b's.
23 Strings that contain at least one double letter (e.g.,
"aa").
24 Strings where no letter is repeated.
25 Strings that form a palindrome (read the same
forwards and backwards).
26 Strings composed of vowels only.
27 Strings composed of consonants only. 28 Strings where the length equals the number of
vowels.
29 Strings that contain exactly two words (separated by
spaces).
30 Strings of balanced parentheses over { (, ) }.
Descriptive Expressions (2/5)
Focusing on binary patterns and numerical properties.
11 Binary strings representing numbers divisible by 3. 12 Binary strings that do not contain "00" as a substring.
13 Strings with an odd number of 0s and an even
number of 1s.
14 Strings whose third symbol from the start is a '1'.
15 Strings that have a '0' in every odd position. 16 Strings that end with "11".
17 Strings that contain "010" as a substring. 18 Strings where 0s and 1s alternate (e.g., 0101...).
19 Strings with more 0s than 1s. 20 Strings where every 0 is immediately followed by a 1.
Descriptive Expressions (4/5)
Defining patterns over a custom {x, y} alphabet.
31 Strings over {x,y} where every 'y' occurs in pairs. 32 Strings that contain the substring "xyx".
33 Strings where the number of 'x's is divisible by 4. 34 Strings that end with "xy".
35 Strings that do not contain "yy" as a substring. 36 Strings where "xy" appears exactly twice.
37 Strings with alternating 'x' and 'y' (e.g., xyxy...). 38 Strings where every 'x' occurs before any 'y'.
39 Strings with an equal number of runs of 'x's and 'y's. 40 Strings with no 'x' in an even position.
Descriptive Expressions (5/5)
Specifying real-world data formats in plain English.
41 All strings that look like an email (local@domain). 42 URLs that start with an optional "https://".
43 Floating-point numbers with an optional sign. 44 Dates in the format "yyyy-mm-dd".
45 Time strings in the format "hh:mm:ss". 46 MAC addresses with colon-separated hex pairs.
47 Credit-card numbers in four groups of four digits. 48 Hashtags starting with '#' followed by alphanumeric
chars.
49 File paths that end with a ".txt" extension. 50 Strings with exactly one space between words.
Automata View
04
The Trinity: Regex, NFA & DFA
Kleene's Theorem states that Regular Expressions, NFAs, and DFAs have the same
expressive power. They all describe the same class of languages: the Regular Languages.
Regex
≡
NFA
≡
DFA
This equivalence allows us to convert freely between these representations for design, verification, and optimization.
From English Description to DFA
A systematic workflow for translating informal specs into formal automata.
1. Parse Spec
Identify atomic conditions
and memory needs.
2. Design States
Assign states to track
progress and memory.
3. Add Transitions
Define transitions for each
symbol in each state.
4. Minimize
Use table-filling to create
the optimal DFA.
Example: "Even number of a's" results in a 2-state DFA (tracking even/odd count).
Practical Usage
05
Practical Use: Tokenizing with Regex
How lexical analyzers (lexers) use regex to break source code into meaningful tokens.
Source Code
if (x > 10)
Regex Patterns
if|while|for
[a-zA-Z_][a-zA-Z0-9_]*
[0-9]+
Tokens
(KEYWORD, "if")
(IDENTIFIER, "x")
(NUMBER, "10")
NFAs/DFAs for each pattern run in parallel via Thompson's construction, ensuring an O(n) scan time.
Practical Use: Validation & Extraction
Core techniques for applying regex in real-world scenarios.
Validation
Ensure input conforms to an expected pattern to reject
malformed data.
if/^d{5}$/.test(zipCode) { /* ... */ }
Extraction
Use capturing groups to isolate and retrieve specific parts of
a string.
const match = text.match(/(d{3})-(d{4})/);
Remember to balance greediness with laziness (e.g., `.*` vs `.*?`) to avoid catastrophic backtracking.
Limits & Beyond
06
The Limits: Non-Regular Languages
Not all languages are regular. The Pumping Lemma is a tool to prove this.
The Pumping Lemma
For any regular language L, there exists a "pumping length"
p such that any string w in L with |w| p can be divided
≥
into three parts, w = xyz, satisfying:
|xy| p
≤
|y| 1
≥
xyⁱz L for all i 0
∈ ≥
Example: L = {aⁿbⁿ | n 0}
≥
Choose w = aᵖbᵖ. The lemma fails, so L is not regular.
Example: Palindromes
Also non-regular, requiring a more powerful model (Context-
Free Grammar).
Performance Traps & How to Avoid Them
Regex can be powerful, but poorly written patterns can lead to catastrophic backtracking.
The Trap: Exponential Backtracking
Nested quantifiers like `(a+)+b` on a long string of 'a's can
cause the engine to explore an exponential number of
paths.
/(a+)+b/.test('aaaaaaaaaaaaa') // Can freeze!
The Solution: Linear Time
Use possessive quantifiers (e.g., `a++`).
Use atomic groups (e.g., `(?>...)`).
Prefer lazy quantifiers (e.g., `.*?`).
Use a DFA-based regex engine when possible.
Key Takeaways & Roadmap
100 Patterns
A library of common regex for
digits, text, networks, and code.
50 Descriptions
Practice translating English specs
into formal patterns.
Automata Theory
The foundation for verification,
optimization, and understanding
limits.
Disciplined Usage
Balance power with performance
to avoid common traps.
From here, explore Context-Free Languages and modern parsing algorithms!
By Dildar Jakhro
2025/10/07

Master 100 Regex Patterns & Automata Theory in One Compact Slide Deck

  • 1.
  • 2.
    Foundations 100 Regex Patterns DescriptiveForms Automata View Practical Usage Limits & Beyond
  • 3.
  • 4.
    Foundations: Alphabets, Strings& Languages The building blocks of formal language theory. Alphabet (Σ) A finite, non-empty set of symbols. Ex: Σ = {a, b}, Σ = {0, 1}. String (w) A finite sequence of symbols from an alphabet. Ex: "abb", "101". Language (L) A set of strings. Can be finite or infinite. Ex: L = {w | w ends with 'b'}. Key Concepts & Notation Empty String (ε): The unique string of length zero. Length (|w|): The number of symbols in a string. Concatenation: Joining two strings end-to-end. Kleene Star (Σ*): The set of all possible strings over Σ, including ε.
  • 5.
    The Power ofRegular Languages A language is regular if a regex can describe it. They form the foundation of text processing. Chomsky Hierarchy Type 0: Recursively Enumerable Type 1: Context-Sensitive Type 2: Context-Free Type 3: Regular Languages Closure Properties Union (L₁ L₂) ∪ Concatenation (L₁L₂) Kleene Star (L*) Complement (Lᶜ) Intersection (L₁ L₂) ∩
  • 6.
  • 7.
    Regex Patterns: Digits,Letters & Identifiers Digits & Letters [0-9] any digit [a-zA-Z] any letter d digit w word char s whitespace Negations D non-digit W non-word S non-space [^aeiou] non-vowel Quantifiers x? optional x* zero or more x+ one or more x{n,m} n to m Anchors & Boundaries ^ start $ end b word boundary B non-boundary Cases & Concatenation [A-Z] uppercase [a-z] lowercase . any char xy concatenation x|y union Identifiers [a-zA-Z0-9_] Matches any character valid in an identifier.
  • 8.
    Regex Patterns: Numbers,Ranges & Scientific Notation Integers [1-9][0-9]* // Positive -?[0-9]+ // Signed 0[xX][0-9a-fA-F]+ // Hex 0[0-7]* // Octal 0b[01]+ // Binary Floating-Point [0-9]+.[0-9]+ // Fixed (+|-)?([0-9]+.?[0-9]*|.[0-9]+) // Decimal [0-9]+(e|E)(+|-)?[0-9]+ // Scientific Formatted Numbers [0-9]{1,3}(,[0-9]{3})* // Thousands Date & Time (0?[1-9]|[12][0-9]|3[01]) // Day (0?[1-9]|1[0-2]) // Month (19|20)[0-9]{2} // Year [0-2][0-9]:[0-5][0-9] // HH:MM
  • 9.
    Regex Patterns: Emails,URLs & Network Addresses Email & URLs [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,} Email https?://[^/s]+ URL (ftp|http|https)://[^/s]+ Protocol URL www.[^/s]+ Web Address IP Addresses bd{1,3}.d{1,3}.d{1,3}.d{1,3}b IPv4 ([0-9A-Fa-f]{1,4}:){7}[0-9A-Fa-f]{1,4} IPv6 b[A-Z]{2}b Country Code bd{4}b Port Query & Path [?&][a-zA-Z_][0-9a-zA-Z_]*= Query Key /[^/s]+ Path File Types .(jpg|jpeg|png|gif) Image .(pdf|doc|docx) Document
  • 10.
    Regex Patterns: Text,Markup & Code Markdown & HTML **[^*]+** Bold (**text**) `[^`]+` Inline Code ^#+s+ Heading Missing superscript or subscript argumentMissing superscript or subscript argument]+)]([)]+) Link <[^>]+> HTML Tag Programming #[A-Za-z_]w* Python Comment //.*$ C++ Comment [a-z]+(_[a-z]+)* snake_case b[A-Z][a-z]+[A-Z][a-z]* camelCase Dates & Days b(?:Mon|Tue|Wed|Thu|Fri|Sat|Sun)b Weekday b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)b Month d{4}-d{2}-d{2} Logs & CS Terms ERROR|WARN|INFO Log Level b(stack|heap|queue)b CS Terms
  • 11.
  • 12.
    Descriptive Expressions (1/5) TranslatingEnglish into formal patterns. 1 All strings over {a,b} that end with the symbol 'b'. 2 Strings that contain an even number of 'a's. 3 Strings that do not contain the substring "ab". 4 Strings where every 'b' is immediately followed by an 'a'. 5 Strings that contain at least three 'a's. 6 Strings whose length is a multiple of 3. 7 Strings that contain the substring "aaa". 8 Strings with an equal number of 'a's and 'b's. 9 Strings that start and end with the same symbol. 10 Strings with no consecutive identical letters.
  • 13.
    Descriptive Expressions (3/5) Exploringpatterns across different alphabets and structures. 21 All decimal strings without leading zeros. 22 Strings over {a,b,c} where all 'a's appear before all 'b's. 23 Strings that contain at least one double letter (e.g., "aa"). 24 Strings where no letter is repeated. 25 Strings that form a palindrome (read the same forwards and backwards). 26 Strings composed of vowels only. 27 Strings composed of consonants only. 28 Strings where the length equals the number of vowels. 29 Strings that contain exactly two words (separated by spaces). 30 Strings of balanced parentheses over { (, ) }.
  • 14.
    Descriptive Expressions (2/5) Focusingon binary patterns and numerical properties. 11 Binary strings representing numbers divisible by 3. 12 Binary strings that do not contain "00" as a substring. 13 Strings with an odd number of 0s and an even number of 1s. 14 Strings whose third symbol from the start is a '1'. 15 Strings that have a '0' in every odd position. 16 Strings that end with "11". 17 Strings that contain "010" as a substring. 18 Strings where 0s and 1s alternate (e.g., 0101...). 19 Strings with more 0s than 1s. 20 Strings where every 0 is immediately followed by a 1.
  • 15.
    Descriptive Expressions (4/5) Definingpatterns over a custom {x, y} alphabet. 31 Strings over {x,y} where every 'y' occurs in pairs. 32 Strings that contain the substring "xyx". 33 Strings where the number of 'x's is divisible by 4. 34 Strings that end with "xy". 35 Strings that do not contain "yy" as a substring. 36 Strings where "xy" appears exactly twice. 37 Strings with alternating 'x' and 'y' (e.g., xyxy...). 38 Strings where every 'x' occurs before any 'y'. 39 Strings with an equal number of runs of 'x's and 'y's. 40 Strings with no 'x' in an even position.
  • 16.
    Descriptive Expressions (5/5) Specifyingreal-world data formats in plain English. 41 All strings that look like an email (local@domain). 42 URLs that start with an optional "https://". 43 Floating-point numbers with an optional sign. 44 Dates in the format "yyyy-mm-dd". 45 Time strings in the format "hh:mm:ss". 46 MAC addresses with colon-separated hex pairs. 47 Credit-card numbers in four groups of four digits. 48 Hashtags starting with '#' followed by alphanumeric chars. 49 File paths that end with a ".txt" extension. 50 Strings with exactly one space between words.
  • 17.
  • 18.
    The Trinity: Regex,NFA & DFA Kleene's Theorem states that Regular Expressions, NFAs, and DFAs have the same expressive power. They all describe the same class of languages: the Regular Languages. Regex ≡ NFA ≡ DFA This equivalence allows us to convert freely between these representations for design, verification, and optimization.
  • 19.
    From English Descriptionto DFA A systematic workflow for translating informal specs into formal automata. 1. Parse Spec Identify atomic conditions and memory needs. 2. Design States Assign states to track progress and memory. 3. Add Transitions Define transitions for each symbol in each state. 4. Minimize Use table-filling to create the optimal DFA. Example: "Even number of a's" results in a 2-state DFA (tracking even/odd count).
  • 20.
  • 21.
    Practical Use: Tokenizingwith Regex How lexical analyzers (lexers) use regex to break source code into meaningful tokens. Source Code if (x > 10) Regex Patterns if|while|for [a-zA-Z_][a-zA-Z0-9_]* [0-9]+ Tokens (KEYWORD, "if") (IDENTIFIER, "x") (NUMBER, "10") NFAs/DFAs for each pattern run in parallel via Thompson's construction, ensuring an O(n) scan time.
  • 22.
    Practical Use: Validation& Extraction Core techniques for applying regex in real-world scenarios. Validation Ensure input conforms to an expected pattern to reject malformed data. if/^d{5}$/.test(zipCode) { /* ... */ } Extraction Use capturing groups to isolate and retrieve specific parts of a string. const match = text.match(/(d{3})-(d{4})/); Remember to balance greediness with laziness (e.g., `.*` vs `.*?`) to avoid catastrophic backtracking.
  • 23.
  • 24.
    The Limits: Non-RegularLanguages Not all languages are regular. The Pumping Lemma is a tool to prove this. The Pumping Lemma For any regular language L, there exists a "pumping length" p such that any string w in L with |w| p can be divided ≥ into three parts, w = xyz, satisfying: |xy| p ≤ |y| 1 ≥ xyⁱz L for all i 0 ∈ ≥ Example: L = {aⁿbⁿ | n 0} ≥ Choose w = aᵖbᵖ. The lemma fails, so L is not regular. Example: Palindromes Also non-regular, requiring a more powerful model (Context- Free Grammar).
  • 25.
    Performance Traps &How to Avoid Them Regex can be powerful, but poorly written patterns can lead to catastrophic backtracking. The Trap: Exponential Backtracking Nested quantifiers like `(a+)+b` on a long string of 'a's can cause the engine to explore an exponential number of paths. /(a+)+b/.test('aaaaaaaaaaaaa') // Can freeze! The Solution: Linear Time Use possessive quantifiers (e.g., `a++`). Use atomic groups (e.g., `(?>...)`). Prefer lazy quantifiers (e.g., `.*?`). Use a DFA-based regex engine when possible.
  • 26.
    Key Takeaways &Roadmap 100 Patterns A library of common regex for digits, text, networks, and code. 50 Descriptions Practice translating English specs into formal patterns. Automata Theory The foundation for verification, optimization, and understanding limits. Disciplined Usage Balance power with performance to avoid common traps. From here, explore Context-Free Languages and modern parsing algorithms!
  • 27.