Learning RegEx Basics in Ruby

Find patterns in strings is a common problem for developers. One of the main scenarios where you could need this is on form validations, has the email the right structure? has the user first name any invalid character?. That's where regular expressions appear.

Let's Begin

In Ruby, regular expressions are defined between two slashes. Let me show you a simple example of regex in ruby:

def find_match(text)
  return true if text=~/hello/

  false
end

puts find_match("hello world") #true
puts find_match("nice to see you") #false

find_match

function will only check if the text contains hello

Match from at least one of the options

If you want to check if a text contains a character from a particular group of characters, you can use square brackets to accomplish this.

def match_range(text)
  return true if text=~/[0123456789]/

  false
end 

puts match_range("number 1") #true
puts match_range("number one") #false

While

/abc/

indicates "a and b and c",

/[abc]/

indicates "a or b or c". Instead of

/[0123456789]/

you can use

/[0-9]/

which generates the same result. Another common range is

/[a-z]/

to identify all the characters from the english alphabet.

Shorthand for ranges

Ruby provides a shorthand syntax for some of the ranges we frequently need. Let's see the most common:

\d for [0-9]
\s for tabs, white spaces, or newlines
\w for [0-9], [a-z] or [A-Z]
\D everything except [0-9]
\S everything except tabs, white spaces or newlines
\W for everything except characters from [0-9], [a-z] or [A-Z]

def word_match(text)
  return true if text=~/\w/

  false
end 

puts word_match("!&$%") #false
puts word_match("a!#$%") #true

Checking the first and last character of the string

For now, we have examined for any match in the string, but if we want to check just the beginning, the end, or the whole string we need to add something else. Let's see an example:

def begin_end_match(text)
  return true if text=~/^\w/ || text=~/\w$/

  false
end 

puts begin_end_match("acd#$%") #true
puts begin_end_match("#$%acd") #true
puts begin_end_match("#$%acd#$%") #false

Use ^ at the beginning of the regex to indicate that you want to match at the first character of the string. Use $ at the final of the regex to check the match with the last character of the string.

Using + for 1 or more matches

What if we want to find a pattern with the whole text? isn't enough to use ^ and $ since it will only work if your text has just one character. If you want to check for one or more characters then you need to use +.

def whole_match(text)
  return true if text=~/^\d+$/

  false
end 

puts whole_match("1") #true
puts whole_match("123") #true
puts whole_match("1s3") #false

In the example we can see the matching works as long as all the characters are digits, it doesn't matter how many digits.

Using ? for 0 or 1 match

By using "?" we can specify that a pattern must appear at most once. Let's see an example that returns true if the given text has at most one digit at the begin and then it's must be followed by no digits (at least one).

def zero_or_one_match(text)
  return true if text=~/^\d?\D+$/

  false
end 

puts zero_or_one_match("1qwe") #true
puts zero_or_one_match("asd") #true
puts zero_or_one_match("12swdwe") #false

You can notice that in order to use different patterns you only need to concatenate them in the same expression. So first we have "^" to define a pattern at the start of the string, then "\d?" to ask for one digit or no digit, this is followed by "\D+" which look for at least one no digit character. Finally, the pattern finishes with "$" which indicates that this string must be finished with the previous pattern described.

Using {}

A frequent situation when you need regex is to validate a phone number and you want to check if phone number length is between 7 and 10 characters. To solve this scenario you an use the next match.

def phone_match(phone)
  return true if phone=~/^\d{7,10}$/

  false
end 

puts phone_match("1234567") #true
puts phone_match("123456") #false
puts phone_match("3117654321") #true

You can also use the curly brackets to ask for a specific length. In the previous example, if you use {7} instead of {7,10} then only the first test will return true.

Using () to capture groups

Now, let's think in a compound pattern that uses different rules, and you need to accept that pattern many times. An example of this can be found in the pattern of an IP address.

def check_if_ip(ip_number)
  return true if ip_number=~/^(\d{1,3}[.]){3}\d{1,3}$/

  false
end 

puts check_if_ip("255.255.255.0") #true
puts check_if_ip("128.11.1.0") #true
puts check_if_ip("191.0.0") #false
puts check_if_ip("191.0.0.0.") #false

Here the pattern consists in the concatenation of one to three digits followed by a dot. This pattern is repeated three times, and then another pattern of one to three digits with no dot.

**Using * for zero or more**

Let's suppose you want to allow a specific pattern in some part of the string, then you could use "*" to define that the pattern is allowed, so it can appear once or many times or no appear at all. For example, for an integer number when you want to use a dot as delimiter each three decimal units, you want to allow a specific pattern of a dot and three digits to be allowed at the end of the string.

def check_if_integer(phone)
  return true if phone=~/^\d{1,3}(.\d{3})*$/

  false
end 

puts check_if_integer("23") #true
puts check_if_integer("1.234") #true
puts check_if_integer("5.734.123") #true
puts check_if_integer("24.18") #false

Accepting any character

Now let's suppose you want to accept any character in some part of your string. For example, you want to make a very simple validation of an URL, then you validate that it always begins with "www." and finishes with ".com", you can do it this way:

def url_regex(phone)
  return true if phone=~/^[www.][\d\D]+[.com]$/

  false
end 

puts url_regex("google.com") #false
puts url_regex("www.google") #false
puts url_regex("www.google.com") #true

With "[\d\D]" (which means either digit or not digit ) you can accept any character, you can also get the same result with "[\w\W]".

Escaping special characters in Ruby Regex

What if I want to use a special character in my regex?. For example, I want to validate that the string begins with a slash and also contains the character "[" in some part of the text.

def espacing_regex(phone)
  return true if phone =~ %r{^/[\w\W]*\[[\w\W]*$}

  false
end 

puts espacing_regex('/[') #true
puts espacing_regex("qwer") #false
puts espacing_regex('/exit[') #true
puts espacing_regex('exit/[') #false

To escape a special character in a regex you could use backslash "\". So for instance, you can use "\[" to verify if the "[" character is in your string. Besides, In the example, I replace the initial and final slashes with %r{}. Both options "\\" and "%r{}" are equivalent but the second one allows me to use slashes in the regex with no need for escaping.

In conclusion

Regex can be confusing at first. It is not exactly one of the first 2 or 3 topics you learn in the "introduction to programming course" besides there is a lot of concepts to learn. But with the basic notions explained here, you can make basics validations (and even some complex). As a recommendation to practice more regex try to create a chatbot, you will need to use a lot of all these concepts and many others.