A Programmer's Guide to Regex or Regular Expressions

Everybody talks about regular expression, but everyone hates regular expression yet ends up using regular expression!

So what is regular expression? umm, we need to go deeper? So yeah, Let’s dive into building blocks of regex with a short intro..

regular expression:

Regular expression or rational expression itself is an object and describes a pattern of characters. It allows us to search for specific patterns of text it also help match, locate, and manage text. Though they look pretty complicated yet they are very powerful, you can absolutely create a regex for almost any pattern of text you think.

Building block of regular expression:

Metacharacters are the building blocks of regular expressions. Characters in regex are understood to be either a metacharacter with a special meaning or a regular character with a literal meaning.

Reserved meta-characters:

Meta characters that are reserved and need to be escaped:

.[{()\^$|?*+

we gonna see example of escaping later.

Other common meta characters are:

Caret (^):

(^)

Matches the start of the string, and in multiline mode also matches immediately after each newline.

example:

^\d{3} will match with patterns like "456" in "456-112-112".

Dollar ($):

($)

Matches the end of the string or just before the newline at the end of the string, and in multiline mode also matches before a newline.

example:

\d{3}$  will match with patterns like "112" in "456-112-112".

\d:

\d

matches whole number or digit

(0–9)

. Here number of

\d

determines the umber of digits our regex will match for. i,e:

\d

means single digit

\d\d = double digits

and so on.

example:

\d\d\d

will match

327 , 123, 787 but not 1223

as there are 4 digits in “1223” and our regex is a match for 3 digits.

\d =1

\d\d = 12

\d\d\d ≠ 473847

as it returns 3 digits but

contains 6 digits.

\d\d\d ≠ cat

because it will match only digits but cat contains letter.

\D:

Reverse of

\d

. Matches anything except digits.

example:

\D\D = AB

\D\D = xy

\D\D ≠ 12

as it won’t match numeric character.

\w:

Matches any alpha-numeric(word) characters.

example:

\w\w\w = 467

\w\w\w\w = Crow

\w\w\w ≠ python

\w\w\w

doesn’t return python because python contains 6 characters.

\W:

Similar to \D \W is reverse of \w i,e: Matches anything but alpha-numeric characters

example:

\W\W = ,, or !! or @#

\W\W\W = !@#

\W\W\W\W != Titanic2

as every character is alpha-numeric.

/s :

Matches any white-space characters such as space and tab.

For example from upper example_text the regex

\s

will match only the space between two words and ignore everything else.

/S:

Matches any non-whitespace characters unlike

\s

Repeaters (

*, + and { }

*, + and { }

are called repeaters as they denote that the preceding character is to be used for more than one time.

Asterisk symbol ( * ):

Asterisk matches when the character preceding

matches 0 or more times. i.e: It tells the computer to match the preceding character (or set of characters) for 0 or more times (upto infinite).

example:

Gre*n = Green(e is found 2 times), Grn(e is found 0 time), Greeeeen (e is found 5 times)

and so on ..

tre* != trees

as there is “s” followes by “ee”.

Plus symbol ( + ):

(+)

sign matches when the character preceding

‘+’

matches atleast one or more times (upto infinite).

example:

Gre+n = Green, Greeeen, Gren

and so on..

Gre+n != Grn

as “e” is absent here.

Dot(.):

The period matches any alphanumeric character or symbol. Interestingly it can take place of any other symbol and for that reason it is being called Wildcard.

example:

Gre. = Gree, Gren, Gre1

and so on

Gre. != Green

as . by itself will only match for a single character, here, in the 4th position of the term. n is the 5th character and is not accounted for in the RegEx.

but

Gre.*

will match Green as it tells to match any character
used any number of times.

Alternation (|):

Allows for alternate matches. | works like the Boolean OR.

example:

A|B

creates a regular expression that will match either

A or B

H(i!|ey!)

will match either

Hi! or Hey!

M(s|r|rs)\.?\s[A-Z]\w+

will match any name started with

Ms, Mr or Mrs

Question mark (?):

Matches when the character preceding ? occurs 0 or 1 time only, making the character match optional.

example:

Favou?rite = Favourite

(u is found 1 time)

Favou?rite = Favorite

(u is found 0 time)

Character set ([]):

```
[]
```
is used to indicate a set of characters. In a set:
Characters can be listed individually, e.g.
```
 [cat]
```
will match
```
'c', 'a', or 't'
```
. Ranges of characters can be indicated by giving two characters and separating them by a
```
'-'
```
,

example:

```
[A-Z]
```
will match any uppercase ASCII letter,
```
[0–9]
```
will match any digit from
```
0 to 9
```
.
```
[0-3][0-3]
```
will match all the two-digits numbers from
```
00 to 33
```
```
[0-9A-Fa-f]
```
will match any hexadecimal digit.
If - is escaped (e.g.
```
 [A\-Z])
```
or if it’s placed as the first or last character (e.g.
```
[A-])
```
, it will match a literal '-'.

The order of the characters does not matter.
Special characters lose their special meaning inside sets.For example,
```
[(+*)]
```
will match any of the literal characters
```
'(', '+', '*', or ')'
```
To match a literal '{' inside a set, precede it with a backslash, or place it at the beginning of the set. For example, both
```
[()[\]{}]
```
and
```
[]()[{}]
```
will both match a parenthesis.

Character group ():

A character group is indicated by () matches the characters in exact order.

example:

(abc) = abc

not

acb

(123) = 123

not

https?://(www\.)?(\w+)(\.\w+)

will match any url. There are 3 groups here.

1st group: the optional

www.

2nd group: the domain name

google, facebook

etc

3rd group: top level domain

.com, .net, .org

There is another implicit group group 0 group 0 is everything that we captured in our case the entire

url

Quantifiers:

regex use quantifiers to indicate the scope of a search string. We can use multiple quantifiers in our search string. quantifiers are:

{n}:

Matches when the preceding character, or character group, occurs n times exactly.

example:

\d{3}=123

pand[ora]{2} = pandar, pandoo

pand[ora]{2} ≠ pandora

as the quantifier

{2}

only allows for 2 letters from the character set

 [ora]

{n,m}:

Matches when the preceding character, or character group, occurs at least n times, and at most m times.

example:

\d{2,6} = 430973, 4303, 38238

\d{2, 6} ≠ 3

3 does not match because it is 1 digit, so outside of the character range.

Escaping Metacharacters:

To search for a character that is a reserved metacharacter (any of .[{()\^$|?*+), we can use the backslash \ to escape the character so it can be recognized.

Example:

Below regex will match any valid mail id. Here we’ve used \ to escape reserved character.

^([a-zA-Z0–9_\-\.]+)@([a-zA-Z0–9_\-\.]+)\.([a-zA-Z]{2,5})$

Congratulations! Now you know the very basic of regex and it’s already too much for a day!

In my upcoming article we will practice regex with python.