Table of Contents

Regular Expressions

Regular Expressions are search patterns for “Regular Text”. They are used by many different tools and languages, including the Linux grep command, the Windows findstr command, less, vi/vim, sed, awk, perl, python, and many others.

Video

Why Use Regular Expressions?

Regular Expressions can be a little daunting to learn: they often look like someone was just bashing their head against the keyboard (or, like a cat was lying on the keyboard). But they are very powerful - a well-written regular expression can replace many pages of code in a programming language such as C or C++ - and so it is worth investing some time to understand them.

The Seven Basic Elements of Regular Expressions

Characters

In a regular expression (regexp), any character that doesn't otherwise have a special meaning matches that character. So the digit "5", for example, matches the digit "5"; similarly "cat" matches the letters "c", "a", and "t" in sequence.

A backslash can be used to remove any special meaning which a character has. The period character "." is a type of wildcard (see below), so to search for a literal period, we place a backslash in front of it: "\."

Wildcards

A period "." will match any single character. Similarly, three periods "..." will match any three characters.

Bracket Expressions / Character Classes

Bracket Expressions or Character Classes are contained in square brackets "[[|]]"

Repetition

Alternation

Grouping

Anchors

Examples

DescriptionRegexp (GNU Extended Grep dialect - “grep -E”)MatchesDoes not matchComments
A specific wordHelloHello
Hello there!
Hello, World!
He said, “Hello James”, in a very threatening tone
Hi there
Hell of a Day
h el lo
A specific word with nothing else on the line^Hello$HelloHello there!
Hello, World!
He said, “Hello James”, in a very threatening tone
Hi there
Hell of a Day
h el lo
This will match “Hello” anywhere on the line, but not permit any variations, such as spaces in the word or UPPER-/lower-case changes.
5-character line^.....$rouge
green
Ho-ho
Yellow
long line
tiny
12-45-78
The anchor characters prevent extra characters from existing between the five characters and the start and end of the line.
Lines that start with a vowel^[AEIOUYaeiouy]Allo
Everything
Energy
Under
Yellow
everything
Hello
White
4164915050
Grinch
The character class includes both UPPERCASE and lowercase letters. You could instead use the option (specific to the tool you're using) to ignore case; for example, -i for grep or /I for findstr.
Lines that end in a punctuation mark[[:punct:]]$Hello there!
Thanks.
What do you think?
Hello there
416-491-5050
New Year greetings
An integer^[-+]?[[:digit:]]+$+15
-2
720
1440
1280
1920
000
012
+ 4
3.14
0x47
$1.13
$4
123,456
This looks for lines that start with a + or - (optional), then contain digits.
A decimal number^[-+]?[[:digit:]]+\.?[[:digit:]]*$+3.14
42
-1000.0
+212
+36.7
42.00
3.333333333
0.976
.976
+-200
1.1.1.1
13.4.7
This will match lines that start with + or - (optional), then contain digits, then optionally contain a decimal point followed by zero or more additional digits.
A Canadian Postal Code^[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ] ?[0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]$H0H 0H0
M3C 1L2
K1A 0A2
T2G 0P3
V8W 9W2
R3B 0N2
M2J2X5
M5S 2C6
POB 1L0
90210
MN4 2R6
A Canadian postal code alternates between letters and digits: A9A 9A9. The first letter must be one of ABCEGHJKLMNPRSTVXY and the remaining letters must be one of ABCEGHJKLMNPRSTVXY.
Phone Numbers (Canada/US)^[^+[:digit:]]*(\+?1)?[^+[:digit:]]*[[2-9]]([^+[:digit:]]*[[0-9]]){9}[^+[:digit:]]*$(416) 967-1111
+1 416-736-3636
416-439-0000
+65 6896 2391
555-1212
A Canadian/US phone number consists of a 3-digit Area Code (which may not start with 0 or 1) and a 10-digit local number consisting of an exchange (3 digits) and a line (4 digits). The country code for Canada and the US is 1, so the number may be preceeded by +1 or 1. Area codes are sometimes contained in parenthesis, and dashes or spaces are sometimes used as separators.
IP Address (IPv4 dotted quad)^((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])$1.1.1.1
4.4.8.8
8.8.8.8
7.12.9.43
10.106.32.109
172.16.97.1
192.168.0.1
IP=67.69.105.143
1.10.100.1000
255.255.255.0
IP=100.150.200.250
103.271.92.16
1O.10.10.10
An IPv4 address in “dotted quad” notation consists of four numbers in the range 0-255 separated by periods. The numbers are called “octets” (which means a collection of eight bits, an alternate way of saying “byte”).
Private IP Address^(10\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])|192\.168|172\.(1[6-9]|2[0-9]|3[0-1]))\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])$10.4.72.13
172.16.97.1
192.168.0.1
IP=192.168.113.42
1.1.1.1
4.4.8.8
192.169.12.6
192.168.400.37
Address is 1 . 2 . 3 . 4
Private IP addresses are defined as: valid IPv4 dotted quad addresses with a first octet of 10; or first two octets of 192.168; or first octet of 172 followed by a second octet in the range 16-31.

Regular Expression Dialects

Regular expressions have evolved over the years, and the various tools that handle regular expressions have different capabilities and slightly different syntax.

In particular, the original Unix search tool grep came in three varieties:

The GNU project originally shipped all three commands, but fgrep and egrep were never fully standardized, so they were removed from the Posix standard in 2001. They were recently also removed from the GNU project.

Unlike the original Unix grep, the GNU grep can handle the full extended regular expression syntax, in either of two ways:

Other tools, such as sed, similarly require backslashes in front of some of the extended regexp meta-characters (or, if you're using a GNU version of sed, you can use the -E option to enable extended regular expressions, just like GNU grep).

The Perl language introduced one of the most powerful and consistent versions of the regular expression language. There has been increasing consensus around “Perl-Compatible Regular Extensions” (aka PCRE) and that dialect is available in many tools (including GNU grep via the -P option, as well as the PCRE/PCRE2 library for C and C++ programs, which is used in many software packages including Safari and Apache httpd).

Using Regular Expressions

Regular expressions can be used in many places:

Windows findstr and Regular Expressions

The Windows findstr command accepts regular expressions or literal expressions. It will guess what you're using, and may guess incorrectly, so it's best to use the /R and /L optons to directly specify if your search pattern is a regexp or literal.

Findstr permits multiple search patterns in a quoted string, separated by a space; this acts like a type of alternation. However, this makes it impossible to use a literal space in a search pattern. If you wish to include a space in your search pattern, prepend /C: to your search string. You can use multiple /C: search strings.

For example, FINDSTR /R /C:“red” /C:“blue” INPUTFILE is roughly equivalent to grep -E “red|blue” INPUTFILE

Findstr is also limited to (approximately) 127 characters in the regular expression.

For information on findstr's regular expression dialect, see help findstr. In particular, the findstr command does not support alternation with the | symbol, repetition other than with the * symbol, named character classes [[:name:]], or grouping ( ).