Regular Expressions

A regular expression is a pattern that is matched against a subject string, from left to right. Some common uses of regular expressions (but not exhaustive) are:

  • Replacing text within a string

  • Capturing groups of information from a string

  • Validating data, like a user name with multiple constraints

  • much more…

A Trivial Example

As we previously just mentioned, a common use case for regex is validating user input against an assortment of different constraints. To take the user name validation example, let’s look at how we might validate the following constraints on user input:

  • Must begin with a capital letter

  • Must be at least 10 characters in length

  • Must end with a number

  • Can only be alphanumeric

This is a simple example, and the point is just to serve as an introduction to regular expressions. Let’s have a look at how we would implement such a scenario in python and explain step by step what each part is doing. Don’t worry too much about understanding to following as in this article we will be breaking down the core fundamentals of regular expressions into digestable chunks. The aim by the end of it, is that you should be able to piece together complex expressions for matching an assortment of scenarios:

import re  # Import pythons regular expressions module

# For demonstration purposes; we will build the string over multiple steps
# for ease of understanding
pattern = r"" # starting point; empty raw string
pattern += "[A-Z]{1}"  # First character MUST be an uppercased A -> Z character
pattern += "[a-zA-Z0-9]{8,}  # Must then contain AT LEAST 8 additional characters /1[8+]1/
# We have implicitly guaranteed so far that we have an uppercase char[0] and 8 alpha numeric chars ending in a digit.
pattern += "[0-9]{1}"  # Must end with a number

# putting it altogether then, pattern is:
pattern = r'[A-Z]{1}[a-zA-Z0-9]{8,}[0-9]{1}'
re.match(pattern, "ValidPassword2")  # <reMatch object; spam=(0, 14), match='ValidPassword2'>
re.match(pattern, "invalidPassword5") # None (no match due to missing initial capital)
re.match(pattern, "InvalidPassword")  # None (no match due to missing ending digit
re.match(pattern, "Invalid5")  # None, too short!

If you are experienced in regular expressions; you may be screaming that there are other or better ways to do exactly this; often with regular expressions there are many ways to skin a cat, but for simplicity and to serve as an introduction, this is a decent enough example. Again, if this is all new to you, focus on trying to understand it but not remember it, we will be going in-depth shortly.

Note: Here we are reusing the pattern string, it is advisable when reusing a pattern to compile it into a re.Pattern object using re.compile(pattern).

Regular Expr: Simple Matchers

In it’s simplest form, a regular expression is just a bunch of characters that we use to perform a search in a string, for each snippet in this article we will be sharing an example of the syntax in action as well as an interactive link to dabble and view it yourself.

Simple Matcher

Pattern

Subject String

Expected Match

example

This is a trivial example

This is a trivial example

bar

Foo bar

Foo bar

Try Simple Matcher: https://regex101.com/r/tTZsZN/1

Typically regular expressions are case insensitive, (outside of using the i flag - more on that towards the end of the article under the flags section).

import re
re.match("foo", "Foo will not match")

Regular Expr: Meta Characters

Meta characters are the bread and butter of regular expressions, and understanding them can make staring at a daunting regular expression become somewhat demystified. Here is a brief summary of the core meta characters:

Regex Meta Characters

Meta Characters

Description

.

Period matches any single character, except a line break character e.g n

[]

Character classes. Match any character contained within the brackets.

[^]

Negated Character classes. Match any character NOT contained within the brackets.

?

Makes the preceding symbol optional.

+

Matches one or more of the preceding symbol.

*

Matches zero or more of the preceding symbol.

{i, j}

Braces. Matches at least i but no more than j repetitions of the preceding symbol.

(foo)

Character group. Matches the characters foo in exactly that order.

|

Alternation. Matches characters either before or after the symbol.

\

Escapes the next character, This allows using meta characters (and others) in their literal sense.

^

Carat. Matches the beginning of the input (also has use in negative character classes).

$

Dollar sign. Matches the end of the input. ^foo$.

Regular Expr: Meta -> .

The meta character . is used to indicate any single character. This has some exclusions for things like line breaks and it is also worth noting that certain language re implementations can permit flags which also allow this character to match even line breaks as well, we will discuss that here using pythons DOTALL flag.

Meta Full Stop

Pattern

Subject String

Expected Match

.at

I put a hat on my cat

I put a hat on my cat

foo.

foo1 with foo2

foo1 with foo2

Try Full Stop: https://regex101.com/r/Ii7Bj9/1

import re
pattern = r"foo."
re.findall(pattern, "foo1 with foo2")
# ["foo1", "foo2"]

Line breaks and pythons DOTALL flag example:

import re
foo = "foo\n"
re.match("foo.", foo)
#  No Match as `.` does not match on the new line
re.match("foo.", foo, flags=re.DOTALL)  # Capture line breaks too!
# < re.Match object; span=(0,4), match='foo\n'>

Regular Expr: Character Classes -> […]

Character classes in regex are used to denote literal values, so using meta characters inside them do not need escaped. Hyphens can be used inside character classes to signify a range, just like we used in the initial example (username validation). Character classes are denoted by the [ <–> ] square brackets. Order inside character classes does not matter:

Meta Character Classes

Pattern

Subject String

Expected Match

[Tt]he .at

The cat

The cat

[sMc]at

The cat, sat on the Mat

The Foobar, was foobar

Try Character Classes: https://regex101.com/r/8iSKB8/1

import re
pattern = re.compile(r"[sMc]at")
re.findall(pattern, "The cat sat on the Mat")
# ['cat', 'sat', 'Mat']

Regular Expr: Negated Character Classes -> [^…]

Similar to the Character Classes outlined previously, the negated character class matches anything except what is defined inside the square brackets. We mentioned previously how the carat ^ symbol can denote the start of the string, however it’s additional use case is here (as well as in lookarounds more on that one later..). Here we will find any words that do NOT start with a letter:

Meta Negated Character Classes

Pattern

Subject String

Expected Match

[^a-zA-Z]*

NoMatch

<no match>

[^a-zA-Z]*

5Matched

5Matched

Try Negated Character Classes: https://regex101.com/r/meqZgw/1

import re

pattern = re.compile(r"[^a-zA-Z].*")
re.match(pattern, "failed")
re.match(pattern, "5Passed")

Note: There are some short hand tricks with regex, which we will discuss later, things like d and w but for simplicity, bear with me for now. You will also notice various methods of the python re module here, the difference between re.search, re.match and re.findall will be outlined later on as well.

Regular Expr: Question Mark -> ?

The meta character ? indicates an optional preceding character (or group). This matches zero or more of the preceding character.

Meta Optional Repetition (?)

Pattern

Subject String

Expected Match

[T|t]?he

he

he

[T|t]?he

The

The

Try Optional Repetition (?): https://regex101.com/r/KQSs7f/1

import re
pattern = re.compile(r"[T|S]?he")
re.match(pattern, "The")  # <re.Match object; span=(0, 3), match='The'>
re.match(pattern, "She")  # <re.Match object; span=(0, 3), match='She'>
re.match(pattern, "he")  # <re.Match object; span=(0, 2), match='he'>

Regular Expr: Plus -> +

The meta character + indicates one or more repetitions of the preceding character. Unlike the * there should be at least one character. If used after a character class or capture group it finds the repetitions of the character set also. So for example:

Meta Optional Repetition (+)

Pattern

Subject String

Expected Match

a+bc

aaaaaaaaaaaaaaaaaaaaaaaaaabc

aaaaaaaaaaaaaaaaaaaaaaaaaabc

a+bc

bc

<No Match>

Try Required Repetition (+): https://regex101.com/r/sH0Bmf/1

import re

pattern = re.compile(r"a+bc.*")
re.match(pattern, "abcdef")  # <re.Match object; span=(0,6), match='abcdef'>
re.match(pattern, "abc")  # <re.Match object; span=(0,3), match='abc'>
re.match(pattern, "bc")  # None

Regular Expr: Plus -> *

In a similar sense to the + repetition meta character, * indicates that the preceding character can be either optional or infinite amount of the previous character. If used after a character class or capture group it finds the repetitions of the character set also.

Meta Optional Repetition (+)

Pattern

Subject String

Expected Match

a*bc

aaaaaaaaaaaaaaaaaaaaaaaaaabc

aaaaaaaaaaaaaaaaaaaaaaaaaabc

a*bc

bc

bc

As you an see above, the core difference from + and * here is that the pattern a*bc will match if a exists or not, a simple demonstration of that is outlined below:

Try Optional Repetition (*): https://regex101.com/r/sH0Bmf/1

import re

star = r"a*bc"
plus = r"a+bc"
text = "bc"
re.search(plus, text)  # NoneType (no match!)
re.search(star, text) # "bc" <re.Match object; span=(0, 2), match='bc'>

Regular Expr: Braces -> { }

Braces (also known as quantifiers) are used to apply constraints to the number of repetitions of the previous character or group of characters, Let’s say we wanted to write some