Regular Expression

Regular Expression Techniques: Simplified Examples

Programming Tips-And-Tricks

Regular expressions provide a short and flexible means for identifying particular text; characters, words, or patterns of characters. You can extract emails, proxies, IP addresses, phone numbers, home addresses, HTML tags, URLs, links, dates, and what’s not?

Regular Expression is a language of its own. Whatever programming languages you have already learnt, will help very little en route your learning of Regular Expression. Learning regular expression is both simple and tough. It is pretty simple if your use case is simple and it is difficult to compose if your requirement has complex scenarios.

There’s a wealth of information available in the form of eBooks, articles, and websites which explains regular expressions in a myriad of respects. So instead of writing just another little primer I’d prefer to go straight to more practical examples. It is presumed that you already know the basics of Regular Expression syntax and are familiar with a handful of short examples at least.

So let’s plunge into some of its techniques by employing them in pretty useful use cases.


Look for Numbers With or Without Decimal Values

Here’s the regular expression.

ˆ[-+]?[0-9]+(.[0-9]+)?$

This expression looks for

  • symbol +,- or nothing (because of ? quantifier) then
  • one or more digits (because of + quantifier) then
  • one or more decimal values (because of + quantifier) or no decimal value at all (because of ?)

Hence it will match numbers like: 32, -3.324, and +98.6 etc.


Get an Array of Words Separated Through, First Letter as Capital, of Each Word

We can use the following regex to split the string AppleOrangeBananaStrawberryPeach into a list or array of fruits.

(?<=[a-z])(?=[A-Z])

How does it work? The lookbehind (?<=[a-z]) asserts that the letter immediately preceding the current position is a lowercase letter. And the lookahead ?=[A-Z] asserts that the letter immediately following the current position is an uppercase letter. But remember that lookahead and lookbehind never move their positions forward or backward, they just look around (ahead or behind) to match the letters with the pattern.

Let’s look at how is it achieved through a little code in JavaScript.

var regex = /(?<=[a-z])(?=[A-Z])/gm;
var str = "AppleOrangeBananaStrawberryPeach";
var m;
var startIndex=0;
var lastIndex=0;

while ((m = regex.exec(str)) !== null) {
    /* This is required to avoid infinite loops with zero-width matches */
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }

    lastIndex=m.index
    console.log(str.substring(startIndex, m.index));
    startIndex = m.index    
}

console.log(str.substring(lastIndex));	//at this point, we are still left with the last fruit, that is Peach.

Here the result will be seen in the console window as below:

Apple
Orange
Banana
Strawberry
Peach

If you don’t have access to any JavaScript editor, visit this link to write and test your javascript code: https://js.do


IP Address Locator

Creating a regular expression to Look for IP address looks quite simple in the beginning but its not that easy. There are a few variations out there on the internet but the following is pretty clean and straight forward version.

Here’s the expression.

^((25[0-5]|2[0-4][0-9]|[1]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[1]?[0-9][0-9]?)$

Let’s break it down as it sounds a little complex in the beginning:

“^” indicates start of string so that we are sure the IP address is not part of something else.
Regular expression does not provide anything to compare a number with another, so we are confined to whatever facility it provides. Let’s see how.

The expression

(25[0-5]|2[0-4][0-9]|[1]?[0-9][0-9]?)\.)

allows for the number in the range 250-255 or 200-249 or 0 to 199. That is, in essence, it allows the number to be between 0 to 255 and that’s what we want. After the valid number we need a “.” which is provided by escaping it (.). We can use this same expression first three times so the expression will become:

(25[0-5]|2[0-4][0-9]|[1]?[0-9][0-9]?)\.){3}

So far we are able to validate numbers having 3 octets each followed by a “.” like: 123.0.12.

Remember any octet can not exceed the number 255 or else the expression will (and should) result in false. Now we need the fourth octet but without trailing “.”. So our final expression will be:

^((25[0-5]|2[0-4][0-9]|[1]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[1]?[0-9][0-9]?)$

The $ indicates the end of string.


Negative Lookahead

With a Negative lookahead we want to look ahead in our string and see if it matches the given pattern, but then disregard it and move on.
The syntax for a negative lookahead is as follows:

(?!x)	where x is your pattern you want to match but return false so that the expression adjacent to it doesn't match.

Let’s say we wish to identify numbers greater than 3000 but less than 4000.
We can set up our regular expression as follows:

\b3(?!000)\d\d\d\b

Here we start from the digit 3 then take a peek at the next three digits (lookahead) to figure out if they are 000 (making it 3000 which is not required). If yes then ignore the next adjacent expression. Otherwise go ahead and take the next 3 digits. They will be certainly greater than 3000 and less than 4000.

Positive Lookahead

A positive lookahead works in the same way as negative lookahead, but the characters inside the lookahead have to match rather than not match. The syntax for a positive lookahead is as follows:

(?=x)	where x is the pattern you want to match and return true so that the expression adjacent to it gets evaluated.

Simple Password Validation using Lookahead

Let’s dig deeper into the more interesting usage with an expression that validates a password.

Our password must meet these five conditions before it gets validated:

  1. The password must be between eight to fifteen characters long (.{8,15})
  2. It must include at least one lowercase character ([a-z])
  3. It must include at least three uppercase characters ([A-Z]{3})
  4. It must include at least one digit (\d)
  5. It must include at least one non-alphanumeric character (e.g, !@#$ [^a-zA-Z0-9])

We’ll use the Lookahead technique to validate conditions 2 to 5. As we have mentioned above the lookahead looks ahead and sees if the given text matches the pattern. Note that it does not move the pointer ahead.

Let’s start with condition 1. We don’t use lookahead for this but we will exactly match that whatever user has entered is limited to 8-15 characters. Pretty simple!

The expression .{8,15} fulfills this condition. Remember a “.” means anything except a newline character and this is what we want.

So we’ll start our expression with the first condition in place as follows:

^.{8,15}$

Remember ^ indicates start of string and $ indicates end of string.

Let’s move on to condition 2. As you know from here onward we intend to use positive lookahead (?=) for every condition, so that we only validate certain condition but keep the pointer at its current point. The condition 2 dictates: It must include at least one lowercase character, and we are going to use [a-z] to validate it. We use Contrast which is quite handy in this scenario. Contrast means first check for the 0 or more characters which are other than what we want and then immediately look for the character(s) which we are interested in. That is, mutually exclusive conditions. So the expression

(?=[^a-z]*[a-z])

will look ahead for 0 or more characters other than lowercase and then one lowercase character. Excellent, Isn’t it?

So our expression will become:

^(?=[^a-z]*[a-z]).{8,15}$

If you’re able to grasp this, the rest of the expression is just a replication of the similar conditions.

Ok now let’s move on to condition 3. The condition dictates: It must include at least three uppercase characters, and we are going to use [A-Z] to validate it. Here again we’ll use contrast and the expression will be like:

(?=(?:[^A-Z]*[A-Z]){3})

Since we want to have at least 3 uppercase characters and they may or may not be adjacent, the expression will be somewhat different than the above. Note that ?: is logically same as ?= lookahead.

So our expression will further grow to:

^(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3}).{8,15}$

So far our regular expression covers condition 1, 2 and 3, where it uses positive lookahead for the conditions 2 and 3.

Now let’s move on to condition 4. This condition dictates: It must include at least one digit, and we are going to use pretty simple \d character to validate it. Here again we’ll use contrast and the lookahead expression will be like:

(?=\D*\d)

So our expression will become:

^(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})(?=\D*\d).{8,15}$

and finally we are left with condition 5 which dictates: It must include at least one non-alphanumeric character (e.g, !@#$), and here’s what we have to use for this purpose:

(?=[a-zA-Z0-9]*[^a-zA-Z0-9])

Again this expression is self-explanatory, first look for zero or more alpha-numeric characters and then one non-alphanumeric one.

So our final expression will be like:

^(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})(?=\D*\d)(?=[a-zA-Z0-9]*[^a-zA-Z0-9]).{8,15}$

At a first glance it looks like an ugly sequence of weird characters, but pretty intuitive (i bet) if you have carefully gone through from the beginning.


Matching a URL the Perfect Way

Here’s the expression:

(http|https|ftp):[\/]{2}([a-zA-Z0-9\-]+\.)+([a-zA-Z]{2,6})(:[0-9]+)?\/?([a-zA-Z0-9\-\._\?\,\'\/\\\+&%\$#\=~]*)

I can guess that you may have a panic stricken face once again by looking at the long stream of weird characters. But at the same time I’m sure about your belief that it won’t be much difficult to comprehend as well.

So here’s the explanation on how it is constructed.

  • (http|https|ftp):[\/]{2} Should start with http, https or ftp followed by ://
  • ([a-zA-Z0-9-]+.) Should match a valid domain/subdomain name without TLD, like www.yahoo., google., mail.yahoo.
  • ([a-zA-Z]{2,6}) The part after the dot of the domain, that is Top Level Domain (TLD), e.g, com, net, org, co etc.
  • (:[0-9]+)? May contain a port specification like http://www.mywebsite.com:3082. Note that ? quantifier dictates that the port may or may not exist.
  • \/?([a-zA-Z0-9-._\?\,\’\/\+&%\$#=~]*) One forward slash “/” and then it may contain several characters containing a combination of digits, letters, dots, hyphens, forward slashes – of course with no limit (though there’s still a limit of a url length).

Here’s the JavaScript flavor of this regular expression. Give it a whirl with several variations to match and see how does it validate.

<script>
var regex = /(http|https|ftp):[\/]{2}([a-zA-Z0-9\-]+\.)+([a-zA-Z]{2,6})(:[0-9]+)?\/?([a-zA-Z0-9\-\._\?\,\'\/\\\+&%\$#\=~]*)/gmi;
var str = "http://www.google.com";
//var str = "https://www.google.com";
//var str = "https://google.com/helloworld";
//var str = "ftp://google.com";
//var str = "smtp://www.myweb.com/thisstringshouldnotmatch"

var m;
m = regex.exec(str);
console.log(m);

if(m!==null){
	alert("matched");
}else{
	alert("not matched");
}	

</script>

Validating Email Format

^([a-z0-9_\.]+)(?<!\.)@([\da-z\-]+\.)+([a-z]{2,6})$

Here’s a little elaboration about how it is constructed.

  • [a-z0-9_.]+) user id which can contain letters, digits, underscore “_” and dot “.”. The dot may be required to allow user names like abc.xyz. Please note that some other characters may also be allowed. Check the internet to look for them, and you may easily include them here.
  • (?<!.) This is positive lookbehind assertion. In the expression above the character “@” is placed right after this. So this asserts that right before @ character there should not be any dot. May be you don’t want this check and allow a dot at the end.
  • @ The well-familier @ sign which is a mandatory part of an email address.
  • ([\da-z-]+.)+ A domain or subdomain name, e.g, “yahoo.”, “abc.pqr.xyz.”.
  • ([a-z]{2,6}) The part after the dot of the domain name (TLD), e.g, com, net, org etc. Some new TLDs can have upto 6 characters after dot.

This expression validates maximum email addresses you can imagine. However, as mentioned above, it may not have covered all allowed characters – peculiarly those which are rarely used in an email address. You may add them if required.

 

Some Postal Codes Validations

Let’s do something relatively simple now. Let’s validate some prominent postal codes’ validation by using Regular Expressions.

US Postal Code:

We need to validate US Postal code, allowing both the five-digit and nine-digit (ZIP+4) formats. The regex should match 12345 and 12345-6784, but not 1234, 123456, 123456789, or 1234-56789.

Here’s the regular expression. There’s nothing much to explain for this particular case.

\b[0-9]{5}(-[0-9]{4})?\b

Canadian Postal Codes:

We’re searching for all Canadian postal codes in a column:

'\b[ABCEGHJKLMNPRSTVXY][0-9][A-Z] [0-9][A-Z][0-9]\b'

Let’s try to comprehend it. The first letter should not be one of D, F, I, O, Q, U, W and Z (so we exclude those letters), then a single digit then a single letter with no restriction, then a space, then a single digit and then again any letter from A to Z and at last a single digit. Quit simple to understand and build the expression.

UK Postal Codes:

Building a complex but short regular expression for UK Postal codes validation is nearly impossible because UK Postal codes assume one of the following 6 different formats.

FormatCoverageExample
AA9A 9AAWC, EC1-EC4, NW1W, SE1P, SW1EC1A 1BB
A9A 9AAE1W, N1C, N1PW1A 0AX
A9 9AAB, E, G, L, M, N, S, WM1 1AE
A99 9AAB, E, G, L, M, N, S, WB33 8TH
AA9 9AARest of the Postal CodesCR2 6XH
AA99 9AARest of the Postal CodesDN55 1PT

So we have to build separate chunk of expression for each format and “or” them all by using “|” operator. Hence the following fairly large but pretty simple to comprehend regular expression gets built.

(?:[A-Za-z]\d ?\d[A-Za-z]{2})|(?:[A-Za-z][A-Za-z\d]\d ?\d[A-Za-z]{2})|(?:[A-Za-z]{2}\d{2} ?\d[A-Za-z]{2})|(?:[A-Za-z]\d[A-Za-z] ?\d[A-Za-z]{2})|(?:[A-Za-z]{2}\d[A-Za-z] ?\d[A-Za-z]{2})

1 thought on “Regular Expression Techniques: Simplified Examples

  1. Great stuff! For Validating Email Format, I used following code within quotes in VBA but didn’t succeed:
    ^([a-z0-9_\.]+)(?<!\.)@([\da-z\-]+\.)+([a-z]{2,6})$
    could you help me to convert above code into VBA?

Leave a Reply

Your email address will not be published. Required fields are marked *