3 Kinds of Parentheses: Are you a RegEx Master?

Written by manishv | Published 2018/05/03
Tech Story Tags: software-development | coding | programming | javascript | regex-master

TLDRvia the TL;DR App

If you’ve spent any time writing code you’ve no doubt abused regular expressions until they were an inscrutable character jumble that could give a real parser a run for its money. Even so, I was still surprised when I learned that there are 3 different kinds of parentheses in regular expressions, not just 2.

And no, the 2 aren’t left and right, wise guy.

The 3 types of parentheses are Literal, Capturing, and Non-Capturing. You probably know about capturing parentheses. You’ll recognize literal parentheses too. It’s the non-capturing parentheses that’ll throw most folks, along with the semantics around multiple and nested capturing parentheses. (True RegEx masters, please hold the, “But wait, there’s more!” for the conclusion).

Literal Parentheses

Literal Parentheses are just that, literal text that you want to match. Suppose you want to match U.S. phone numbers of the form (xxx)yyy-zzzz. You could write the regular expression as /\(\d{3})\d{3}-\d{4}/. Notice that we had to type \( instead of just a naked (. That’s because a raw parenthesis starts a capturing or non-capturing group. If we want to match a literal parenthesis in the text, we have to escape it with \.

Capturing Parentheses

You’ve probably written some capturing parentheses too, whether you meant to capture or not. These parentheses aren’t used to match literal () in the text, but instead they are used to group characters together in a regular expression so that we can apply other operators like +, *?, or {n}.

For example, if we want to match just the strings can or can’t we can write /can('t)?/. We need the parentheses here because /can't?/ would match only the strings can’, and can’t, not quite what we had in mind.

Photo by David Clode on Unsplash

However, there’s something else going on here. These are called capturing parentheses for a reason — namely they capture anything that matches the expression they contain for later use by your program. Continuing the can/can’t example, in JavaScript we get:

const match = /can('t)?/.exec("We can't do it!");console.log(match[0]); // prints the match "can't"console.log(match[1]); // prints captured "'t"

Here, match[1] contains the item captured by the parentheses. Now this is somewhat uninteresting because we really don’t care about the ‘t separately from the word can’t.

The phone number example gets more interesting. In JavaScript, we can extract the area code of a U.S. style phone number as follows:

const match = /\((\d{3})\)\d{3}-\d{4}/.exec("(303)555-1212");console.log(match[0]); // (303)555-1212console.log(match[1]); // 303

Let’s take a closer look at what is going on in that regular expression, /\((\d{3})\)\d{3}-\d{4}/. It is almost identical to the expression we used in the literal parentheses example, but this time I added a set of capturing parentheses inside the pair of literal parentheses. This tells the regular expression engine to remember the part of the match that is inside the capturing parentheses. This captured match is what we find in match[1]. Notice that the entire phone number match is in match[0]. This little example shows the power of capturing parentheses. Above, we used it to extract an area code from a phone number. We can use it to extract all kinds of text — a poor man’s parser.

“Exploits of a Mom” from XKCD

As another quick example, we can use capturing parentheses to extract first name and last name via /(\D+) (\D+)/. match[1] will have the first name and match[2] will have the last name, assuming you’re not matching Bobby Tables’ given name (see comic), or have extra spaces to deal with.

Non-capturing Parentheses

Now, we get to the third kind of parenthesis — non-capturing parentheses. There are times when you need to group things together in a regular expression, but you don’t want to capture the match, like in the can/can’t example above. To avoid capturing the ‘t, we write /can(?:'t)?/. Here, all we get is the full match, with no sub-matches.

The (?: is a special sequence that starts a parenthesized group, just like (, but the regular expression engine is told, don’t bother to capture the match in the group, just use it for operator precedence. Let’s look at a more complex example where ignoring a parenthesized group is useful.

Let’s extend that phone number regular expression to allow a prefix of mobile or office. With only capturing parentheses, this looks like match = /((mobile|office) )?\((\d{3})\)\d{3}-\d{4}/.exec(...). (Is this inscrutable yet?). The problem is that the area code we want to extract is in match[3]. (I’ll leave it as an exercise to the reader as to why.) This is confusing and unnecessary since we don’t care about the annotation or anything other than the area code in this example. To capture only the area code, we can do:

const re = /(?:(?:mobile|office) )?\((\d{3})\)\d{3}-\d{4}/;const match = re.exec('mobile (303)555-1212');console.log(match[0]); //mobile (303)555-1212console.log(match[1]); //303

Notice the two sets of non-capturing parentheses (?: around the annotation, but the use of regular capturing parentheses around the area code.

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

-Jamie Zawinksi

But Wait, There’s More!

And there you have it, 3 kinds of parentheses, literal, capturing, and non-capturing — \(, (, (?:. We should probably use (?: more than we do, but I find it hard to read, so as long as ( doesn’t cause any performance issues or semantic changes to an existing regular expression (by changing the index needed to find relevant group matches), I’ll skip the extra ?:. I’m not sure if this is the best practice, but let’s face it, regular expressions are hard enough to read as it is.

True RegEx masters know that there are other types of parentheses that use the (? syntax as well. Alas, I’m not actually a RegEx master so I’ll leave you to searching for other sources to learn about those, as they aren’t supported in many native regular expression libraries, JavaScript being one of them. Named regular expression groups are among the most useful of these.


Published by HackerNoon on 2018/05/03