2. What are we talking about?
The Regular expression Denial of Service (ReDoS) is a Denial of Service attack, that
exploits the fact that most Regular Expression implementations may reach extreme
situations that cause them to work very slowly (exponentially related to input size). An
attacker can then cause a program using a Regular Expression to enter these extreme
situations and then hang for a very long time.
- OWASP Definition
4. Regex Interpreter in a Nutshell
Regular expression engines are implemented as finite state machines (FSM).
The pattern you supply (called Regular Expression) is compiled into a data structure that
represents this state machine.
When you match a string against this pattern, the regex engine takes each character and
decides the state transition within the FSM. If there are no valid state transitions for an
input character the match fails.
One of the states in the FSM is a terminating/end state. If the regex engine gets there it
reports success.
Regex engines can arrange the state machines in two different ways: DFA and NFA.
5. Regex Engines - DFA vs NFA
Deterministic Finite Automa
Non-Deterministic Finite Automa
Called deterministic because it can always choose a next
state for a given input character; If it cannot go to a valid
state it means that the input string does not match the regex
pattern.
Called non-deterministic because there are cases where the
regex engine has to guess which state to go to next. If it
guesses wrong it has to go back to a previous state and try
a different transition.
This process is called backtracking.
The NFA will have to try all possible routes through the state
machine until it finds the terminating state, the possible
routes are exhausted, or there are no more input characters.
7. Which one do we choose? and Why NFA?
DFA regex Engines don’t need to backtrack and they are faster
NFA regex Engines need to backtrack and it is possible to structure your pattern in such a
way that the backtracking will cause nearly infinite loops on certain input sequences, but
this feature allow for capture groups.
For this reason, what you see in practice is that most modern languages implement their
regex engines as NFAs
On the other side
8. Did you notice the Security Problem?
if you didn’t, please read again slide 5,7 with a focus on red enlightened parts
In some cases a not so well structured regular Expression can degenerate in a sort of
infinite loop NFA because of its naïve nature:
“The algorithm tries one by one all the possible paths (if needed) until a match is found (or
all the paths are tried and fail). ”
The Backtracking Problem
The attacker might use the above knowledge to look for applications that use Regular
Expressions, containing an Evil Regex, and send a well-crafted input, that will hang
the system.
10. Quantifiers
To specify the number of times a token should be matched by the regex engine, you can
choose one of the following quantifiers:
? — match the token zero times (not at all) or exactly once
* — match the token zero or more times
+ — match the token one or more times
{m,n} — match the token between m and n (both including) times, where mand n are
natural numbers and n ≥ m.
As you can see by Default they are greedy because they tell the engine to match as many
instances of its quantified token or subpattern as possible.
11. Why are Greedy Quantifiers so Heavy?
a greedy quantifier will try to match as much as it possibly can. Every time the engine
greedily consumes one more input tokens, it has to remember that it made that choice. It will
therefore persist its current state and store it so it can come back to it later in the
backtracking process. When the regular expression engine backtracks, it performs another
match attempt at a different position in the pattern.
Storing this backtracking position doesn’t come for free, and neither does the actual
backtracking process.
12. Why are Nested Greedy Quantifiers so Evil?
- Part 1
Examples of Evil Patterns:
● (a+)+
● ([a-zA-Z]+)*
● (a|aa)+
● (a|a?)+
● (.*a){x} | for x > 10
There is an exponential complexity of O(2^n).
This occurs because each quantifier adds a layer of alternative steps to the paths that the NFA has to try before it
can certainly tell that there is no match (fail situation).
Whenever you see that a quantifier applies to a token that is already quantified, there is potential for the number of
steps to explode.
13. Why are Nested Greedy Quantifiers so Evil ? -
- Part 2
Example of Evil Regex: /(x+x+)+y/
The above regex turns ugly when the “y” is missing from the subject string.
At 21*x long input string the debugger bows out at 2.8 million steps, diagnosing a bad case of
catastrophic backtracking, just to find out that there’s no “y”.
15. A simple WAF Rule deployment -
On July 2, Cloudflare deployed a new rule in the
WAF that caused CPUs to become exhausted.
The update contained a regular expression that
backtracked enormously and exhausted every
CPU core that handles HTTP/HTTPS traffic on
the Cloudflare network worldwide; this brought
down Cloudflare’s core proxying, CDN and WAF
functionality.
graph showing CPUs dedicated to serving HTTP/HTTPS traffic spiking to
nearly 100% usage across the servers in Cloudflare’s network.
16. The regular expression
that was at the heart of the outage
/(?:(?:"|'|]|}||d|(?:nan|infinity|true|false|null|undefined|symbol
|math)|`|-|+)+[)]*;?((?:s|-|~|!|{}||||+)*.*(?:.*=.*)))/
This rule causing the outage was targeting Cross-site
scripting (XSS) attacks.
The critical part is the red enlightened .*(?:.*=.*). The (?: and matching ) are a non-capturing group
(the parser uses it to match the text, but ignores it later in the final result).
For the purposes of the discussion of why this pattern causes CPU exhaustion we can safely
ignore it and treat the pattern as .*.*=.*.
17. “Any real-world expression that ask the engine to ‘match anything followed by anything’ can
lead to catastrophic backtracking. “
Needed Steps:
The indicted regex takes 23 steps to match the string
“x=x”.
With 20 x’s after the = the engine takes 555
steps to match. That’s not linear.
Worst Case :
If the x= was missing, so the string was just
20 x’s, the engine would take 4,067 steps to
find the pattern doesn’t match. That
because of the naive nature of NFA regex
engines (it has to try all the possible paths,
exploded by the greedy quantifiers).
19. Lazy Quantifiers (quantifier?)
In contrast to the standard greedy quantifier, which eats up as many instances of the
quantified token as possible, a lazy quantifier tells the engine to match as few of the quantified
tokens as needed.
A lazy quantifier gives you the shortest match.
If the quantified token has matched so few characters that the rest of the pattern can not
match, the engine backtracks to the quantified token and makes it expand its match—one step
at a time. After matching each new character or subexpression, the engine tries once again to
match the rest of the pattern (this behavior is called “helpful”).
they are expensive too
Using lazy rather than greedy matches helps control the amount of
backtracking that occurs in some cases like in the Cloudflare regex :
/.*?.*?=.*?/ matches “x=x” in 11 steps instead of 23 and so does matching
“x=xxxxxxxxxxxxxxxxxx”
From a computing standpoint, this process of matching one item, advancing,
failing, backtracking and expanding is expensive in the other direction in some
situations.
In conclusion lazy quantifiers doesn’t fix every backtrackig problem, but they
could be a possible way to reduce it.
20. Possessive Quantifiers (quantifier+)
In contrast to the standard docile quantifier, which gives up characters if needed in order to allow the rest of
the pattern to match, a possessive quantifier tells the engine that even if what follows in the pattern fails to
match, it will hang on to its characters.
Possessive quantifiers match fragments of string as solid blocks that cannot be backtracked into: it's all or
nothing. This behavior is particularly useful when you know there is no valid reason why the engine should
ever backtrack into a section of matched text, as you can save the engine a lot of needless work.
Useful, but you need to be sure that no backtracking is needed in your search pattern.
21. fully re-writing the pattern to be more
specific and move away from a regular
expression engine with that backtracks
when a partially successful search path
fails.
This means having a DFA in which the
algorithm executes in time linear in the
size of the string being matched
against.
The only real solution :
“NFA version of the indicted Cloudflare regex”
“DFA version of the indicted Cloudflare regex converted with
https://cyberzhg.github.io/toolbox/nfa2dfa”
22. End
Sitography:
https://www.rexegg.com/regex-explosive-quantifiers.html - The Explosive Quantifier Trap
http://wstoop.co.za/wregex.php - How Regular Expression Engines works
https://mariusschulz.com/blog/why-using-the-greedy-in-regular-expressions-is-almost-never-what-you-actually-
want#bad-performance-and-incorrect-matches - Why Using the .* Greedy Quantifier is Almost never What you actually want
https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS - OWASP ReDoS
https://www.regular-expressions.info/catastrophic.html - Catastrophic Backtracking
https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/ - Details of the 2 July Outage by Cloudflare