Avoid Stack Overflows By Writing Regexes Properly

Written by sonarsource | Published 2021/02/23
Tech Story Tags: java | regex | regular-expressions | stackoverflow | static-analysis | programming | coding | good-comapny

TLDR Regular expressions are a powerful mechanism for matching patterns in strings. But matching a pattern may - depending on the pattern - require stack space proportional to the length of the input. This means large inputs could cause the program to crash with a stack overflowException. To prevent this, we have a rule that warns you about issues like this: Regular expressions should not overflow the stack. The rule tries to estimate how much stack space a regular expression will consume relative to the input size. It only raises issues on regular expressions whose stack consumption exceeds a configurable threshold.via the TL;DR App

We've been working recently on adding rules to help write better regular expressions in Java.
I've talked already about rules to find errors in character classes and with the use of boundary markers and overly complex regular expressions. Those rules help you make sure your regular expressions are accurate and maintainable. Today I will show you how to make sure that the regular expression won’t crash your program. 
As a brief reminder: regular expressions are a terse and powerful mechanism for matching patterns in strings. Part of the power of regular expressions is that a pattern can concisely describe a string that might be much larger than the original pattern. Sometimes much, much larger. Sometimes enough to overflow the stack and crash your application.
Due to the way regular expression matching is implemented in Java (and many other languages/libraries), matching a pattern may - depending on the regex - require stack space proportional to the length of the input. This means large inputs could cause the program to crash with a 
StackOverflowException
 when you try to use the regex.
We ran into this problem in our own code when analysis tried  to find comments containing at least one word with more than a given number of characters. For this we used a regex similar to
 #(.|\n)*\w{3,}
, which would cause the analysis to crash on source files containing sufficiently long comments. To prevent problems like this for our future selves as well as for you, dear users, we implemented a rule which can detect problems like this:
In theory any regex is susceptible to stack overflow if it contains a repetition that contains some sort of branching, such as another repetition or an alternative. However, for some regular expressions the length of input required to make them crash is much larger than for others.
To account for this, our rule tries to estimate how much stack space a regular expression will consume relative to the input size and only raises issues on regular expressions whose stack consumption exceeds a configurable threshold.
One way to avoid stack consumption in regular expressions is to use possessive quantifiers. Possessive quantifiers are created by adding a 
+
 to a quantifier (e.g.
 x*+
 instead of 
x*
). Doing so disables backtracking. In addition to avoiding issues with catastrophic backtracking, making a quantifier possessive allows a pattern to be matched without consuming stack space. The problem is that sometimes backtracking is necessary and using possessive quantifiers to disable it may leave you with a regex that can never match any input.
Consider for example the regex I cited above: 
#(.|\n)*\w{3,}
: If we try to fix the stack overflow by making the quantifier possessive, we end up with 
#(.|\n)*+\w{3,}
. This regex can never match anything because any input that could be matched by 
\w
 will already have been matched by 
(.|\n)*+
 and, being possessive, it won’t give it back. Luckily we have a rule that warns you about issues like this:
So how would one properly fix this issue? In our case, we decided to get rid of the regex altogether since there wasn’t really a good reason to use a regex here. But the regex could also be salvaged easily enough by getting rid of the alternation. The intent of 
.|\n
 was to match any character, including line breaks, because . does not match line breaks by default. The proper way to match any character would be to enable the 
DOTALL
 flag, which can also be enabled from within the regex using 
(?s)
. So either 
Pattern.compile(“#.*\\w{3,}”, Pattern.DOTALL)
 or 
”(?s)#.*\\{3,}”
 would work (the latter of which also works when using methods that don’t involve 
Pattern.compile
).
Another regex susceptible to stack overflows would be something as simple as 
(a|b)*
. Here the alternation can be removed by using a character class instead, which doesn’t require stack space to match: 
[ab]*
. Luckily we have a rule that finds alternations that can be replaced by character classes and suggests just that replacement like in the following Spring Boot code:
Regular expressions are a truly powerful feature. Because great power comes with great responsibility (and great opportunities for screwing things up, or confusing your colleagues and future-you), we now offer a total of 24 rules targeting regular expressions.
They are all available today in SonarCloud. These three rules around avoiding stack overflows are available starting from SonarQube 8.7, but the rest are already released. You can see them all at rules.sonarsource.com.
Finally, note that some of the rules perform in-depth automata-based analyses on your regular expressions to identify issues as accurately as possible. This is an extremely promising approach and we will probably discuss this feature in a future blog post because we truly believe it’s amazing. This will for sure bring A LOT to regex code quality in general, so stay tuned!

Written by sonarsource | We build world-class Code Quality & Security tools: SonarQube, SonarLint and SonarCloud
Published by HackerNoon on 2021/02/23