Both sides previous revisionPrevious revisionNext revision | Previous revision |
ops102:regular_expressions [2024/04/01 09:28] – [Windows findstr and Regular Expressions] chris | ops102:regular_expressions [2025/03/14 02:05] (current) – [Examples] chris |
---|
| |
**Regular Expressions** are search patterns for "Regular Text". They are used by many different tools and languages, including the Linux grep command, the Windows findstr command, less, vi/vim, sed, awk, perl, python, and many others. | **Regular Expressions** are search patterns for "Regular Text". They are used by many different tools and languages, including the Linux grep command, the Windows findstr command, less, vi/vim, sed, awk, perl, python, and many others. |
| |
===== Video ===== | |
| |
* [[https://seneca-my.sharepoint.com/:v:/g/personal/chris_tyler_senecapolytechnic_ca/EUGN0BHIlzlCmrjXwZgYdSQBoJvWjX9wwfDZKFKS9sGXVg|Video Lecture on Regular Expressions]] (This is an extended version of the lecture given in class on March 27/28 made available for review). | |
| |
===== Why Use Regular Expressions? ===== | ===== Why Use Regular Expressions? ===== |
* A repeat count can be placed in curly brackets. It applies to the previous element: ''<nowiki>"x{3}"</nowiki>'' matches ''<nowiki>"xxx"</nowiki>'' | * A repeat count can be placed in curly brackets. It applies to the previous element: ''<nowiki>"x{3}"</nowiki>'' matches ''<nowiki>"xxx"</nowiki>'' |
* A repeat can be a range, written as min,max in curly brackets: ''<nowiki>"x{2,5}"</nowiki>'' will match ''<nowiki>"xx"</nowiki>'', ''<nowiki>"xxx"</nowiki>'', ''<nowiki>"xxxx"</nowiki>'', or ''<nowiki>"xxxxx"</nowiki>'' | * A repeat can be a range, written as min,max in curly brackets: ''<nowiki>"x{2,5}"</nowiki>'' will match ''<nowiki>"xx"</nowiki>'', ''<nowiki>"xxx"</nowiki>'', ''<nowiki>"xxxx"</nowiki>'', or ''<nowiki>"xxxxx"</nowiki>'' |
* The maximum value in a range can be omitted: ''<nowiki>"x{2,}"</nowiki>'' will two or more ''<nowiki>"x"</nowiki>'' characters in a row | * The maximum value in a range can be omitted: ''<nowiki>"x{2,}"</nowiki>'' will match two or more ''<nowiki>"x"</nowiki>'' characters in a row |
* There are short forms for some commonly-used ranges: | * There are short forms for some commonly-used ranges: |
* ''<nowiki>"*"</nowiki>'' is the same as ''<nowiki>"{0,}"</nowiki>'' (zero or more) | * ''<nowiki>"*"</nowiki>'' is the same as ''<nowiki>"{0,}"</nowiki>'' (zero or more) |
* A carat symbol will match the start of a line: ''<nowiki>"^[[:upper:]]"</nowiki>'' wil match lines that start with an uppercase letter. | * A carat symbol will match the start of a line: ''<nowiki>"^[[:upper:]]"</nowiki>'' wil match lines that start with an uppercase letter. |
* A dollar sign will match the end of a line: ''<nowiki>"[[:punct:]]$"</nowiki>'' will match lines that end with a punctuation mark. | * A dollar sign will match the end of a line: ''<nowiki>"[[:punct:]]$"</nowiki>'' will match lines that end with a punctuation mark. |
* The two characters may be used together: ''<nowiki>"cat"</nowiki>'' will match the word ''<nowiki>"cat"</nowiki>'' anywhere on a line, but ''<nowiki>"^cat$"</nowiki>'' will only match lines that contain //only// the word ''<nowiki>"cat"</nowiki>''. Likewise, ''<nowiki>"^[[0-9.]]$"</nowiki>'' will match lines that are made up of only digits and dot characters. | * The two anchors may be used together: ''<nowiki>"cat"</nowiki>'' will match the word ''<nowiki>"cat"</nowiki>'' anywhere on a line, but ''<nowiki>"^cat$"</nowiki>'' will only match lines that contain //just// the word ''<nowiki>"cat"</nowiki>'' and nothing else. Similarly, ''<nowiki>"^[[0-9.]]$"</nowiki>'' will match lines that are made up of only digits and dot characters, and ''<nowiki>"^...$"</nowiki>'' or ''<nowiki>"^.{3}$"</nowiki>'' will only match lines that contain exactly three characters. |
| |
===== Examples ===== | ===== Examples ===== |
| |
| |
^Description^Regexp (GNU Extended Grep dialect - "grep -E")^Matches^Does not match^Comments^ | ^Description^Regexp (GNU Extended Grep dialect - "grep -E")^Matches these lines^Does not match these lines^Comments^ |
|A specific word|''<nowiki>Hello</nowiki>''|Hello\\ Hello there!\\ Hello, World!\\ He said, "Hello James", in a very threatening tone|Hi there\\ Hell of a Day\\ h el lo| | |A specific word|''<nowiki>Hello</nowiki>''|Hello\\ Hello there!\\ Hello, World!\\ He said, "Hello James", in a very threatening tone|Hi there\\ Hell of a Day\\ h el lo|This will match "Hello" anywhere on the line, but not permit any variations, such as spaces in the word or UPPER-/lower-case changes.| |
|A specific word with nothing else on the line|''<nowiki>^Hello$</nowiki>''|Hello|Hello there!\\ Hello, World!\\ He said, "Hello James", in a very threatening tone\\ Hi there\\ Hell of a Day\\ h el lo|This will match "Hello" anywhere on the line, but not permit any variations, such as spaces in the word or UPPER-/lower-case changes.| | |A specific word with nothing else on the line|''<nowiki>^Hello$</nowiki>''|Hello|Hello there!\\ Hello, World!\\ He said, "Hello James", in a very threatening tone\\ Hi there\\ Hell of a Day\\ h el lo| | |
|5-character line|''<nowiki>^.....$</nowiki>''|rouge\\ green\\ Ho-ho\\ |Yellow\\ long line\\ tiny\\ 12-45-78|The anchor characters prevent extra characters from existing between the five characters and the start and end of the line.| | |5-character line|''<nowiki>^.....$</nowiki>''|rouge\\ green\\ Ho-ho\\ |Yellow\\ long line\\ tiny\\ 12-45-78|The anchor characters prevent extra characters from existing between the five characters and the start and end of the line.| |
|Lines that start with a vowel|''<nowiki>^[AEIOUYaeiouy]</nowiki>''|Allo\\ Everything\\ Energy\\ Under\\ Yellow\\ everything|Hello\\ White\\ 4164915050\\ Grinch|The character class includes both UPPERCASE and lowercase letters. You could instead use the option (specific to the tool you're using) to ignore case; for example, ''-i'' for grep or ''/I'' for findstr.| | |Lines that start with a vowel|''<nowiki>^[AEIOUYaeiouy]</nowiki>''|Allo\\ Everything\\ Energy\\ Under\\ Yellow\\ everything|Hello\\ White\\ 4164915050\\ Grinch|The character class includes both UPPERCASE and lowercase letters. You could instead use the option (specific to the tool you're using) to ignore case; for example, ''-i'' for grep or ''/I'' for findstr.| |
|An integer|''<nowiki>^[-+]?[[:digit:]]+$</nowiki>''|+15\\ -2\\ 720\\ 1440\\ 1280\\ 1920\\ 000\\ 012|+ 4\\ 3.14\\ 0x47\\ $1.13\\ $4\\ 123,456|This looks for lines that start with a + or - (optional), then contain digits.| | |An integer|''<nowiki>^[-+]?[[:digit:]]+$</nowiki>''|+15\\ -2\\ 720\\ 1440\\ 1280\\ 1920\\ 000\\ 012|+ 4\\ 3.14\\ 0x47\\ $1.13\\ $4\\ 123,456|This looks for lines that start with a + or - (optional), then contain digits.| |
|A decimal number|''<nowiki>^[-+]?[[:digit:]]+\.?[[:digit:]]*$</nowiki>''|+3.14\\ 42\\ -1000.0\\ +212\\ +36.7\\ 42.00\\ 3.333333333\\ 0.976|.976\\ +-200\\ 1.1.1.1\\ 13.4.7|This will match lines that start with + or - (optional), then contain digits, then optionally contain a decimal point followed by zero or more additional digits.| | |A decimal number|''<nowiki>^[-+]?[[:digit:]]+\.?[[:digit:]]*$</nowiki>''|+3.14\\ 42\\ -1000.0\\ +212\\ +36.7\\ 42.00\\ 3.333333333\\ 0.976|.976\\ +-200\\ 1.1.1.1\\ 13.4.7|This will match lines that start with + or - (optional), then contain digits, then optionally contain a decimal point followed by zero or more additional digits.| |
|A Canadian Postal Code|''<nowiki>^[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ] ?[0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]$</nowiki>''|H0H 0H0\\ M3C 1L2\\ K1A 0A2\\ T2G 0P3\\ V8W 9W2\\ R3B 0N2\\ M2J2X5\\ M5S 2C6|POB 1L0\\ 90210\\ MN4 2R6|A Canadian postal code alternates between letters and digits: A9A 9A9. The first letter must be of of ABCEGHJKLMNPRSTVXY and the remaining letters must be one of ABCEGHJKLMNPRSTVXY.| | |A Canadian Postal Code|''<nowiki>^[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ] ?[0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]$</nowiki>''|H0H 0H0\\ M3C 1L2\\ K1A 0A2\\ T2G 0P3\\ V8W 9W2\\ R3B 0N2\\ M2J2X5\\ M5S 2C6|POB 1L0\\ 90210\\ MN4 2R6|A Canadian postal code alternates between letters and digits: A9A 9A9. The first letter must be one of ABCEGHJKLMNPRSTVXY and the remaining letters must be one of ABCEGHJKLMNPRSTVXY.| |
|Phone Numbers (Canada/US)|''<nowiki>^[^+[:digit:]]*(\+?1)?[^+[:digit:]]*[[2-9]]([^+[:digit:]]*[[0-9]]){9}[^+[:digit:]]*$</nowiki>''|(416) 967-1111\\ +1 416-736-3636\\ 416-439-0000|+65 6896 2391\\ 555-1212|A Canadian/US phone number consists of a 3-digit Area Code (which may not start with 0 or 1) and a 10-digit local number consisting of an exchange (3 digits) and a line (4 digits). The country code for Canada and the US is 1, so the number may be preceeded by +1 or 1. Area codes are sometimes contained in parenthesis, and dashes or spaces are sometimes used as separators.| | |Phone Numbers (Canada/US)|''<nowiki>^[^+[:digit:]]*(\+?1)?[^+[:digit:]]*[[2-9]]([^+[:digit:]]*[[0-9]]){9}[^+[:digit:]]*$</nowiki>''|(416) 967-1111\\ +1 416-736-3636\\ 416-439-0000|+65 6896 2391\\ 555-1212|A Canadian/US phone number consists of a 3-digit Area Code (which may not start with 0 or 1) and a 10-digit local number consisting of an exchange (3 digits) and a line (4 digits). The country code for Canada and the US is 1, so the number may be preceeded by +1 or 1. Area codes are sometimes contained in parenthesis, and dashes or spaces are sometimes used as separators.| |
|IP Address (IPv4 dotted quad)|''<nowiki>^((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])$</nowiki>''|1.1.1.1\\ 4.4.8.8\\ 8.8.8.8\\ 7.12.9.43\\ 10.106.32.109\\ 172.16.97.1\\ 192.168.0.1\\ |IP=67.69.105.143\\ 1.10.100.1000\\ 255.255.255.0\\ IP=100.150.200.250\\ 103.271.92.16\\ 1O.10.10.10|An IPv4 address in "dotted quad" notation consists of four numbers in the range 0-255 separated by periods. The numbers are called "octets" (which means a collection of eight bits, an alternate way of saying "byte").| | |IP Address (IPv4 dotted quad)|''<nowiki>^((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])$</nowiki>''|1.1.1.1\\ 4.4.8.8\\ 8.8.8.8\\ 7.12.9.43\\ 10.106.32.109\\ 172.16.97.1\\ 192.168.0.1\\ |IP=67.69.105.143\\ 1.10.100.1000\\ IP=100.150.200.250\\ 103.271.92.16\\ 1O.10.10.10|An IPv4 address in "dotted quad" notation consists of four numbers in the range 0-255 separated by periods. The numbers are called "octets" (which means a collection of eight bits, an alternate way of saying "byte").| |
|Private IP Address|''<nowiki>^(10\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])|192\.168|172\.(1[6-9]|2[0-9]|3[0-1]))\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])$</nowiki>''|10.4.72.13\\ 172.16.97.1\\ 192.168.0.1|IP=192.168.113.42\\ 1.1.1.1\\ 4.4.8.8\\ 192.169.12.6\\ 192.168.400.37\\ Address is 1 . 2 . 3 . 4|Private IP addresses are defined as: valid IPv4 dotted quad addresses with a first octet of 10; or first two octets of 192.168; or first octet of 172 followed by a second octet in the range 16-31.| | |Private IP Address|''<nowiki>^(10\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])|192\.168|172\.(1[6-9]|2[0-9]|3[0-1]))\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])$</nowiki>''|10.4.72.13\\ 172.16.97.1\\ 192.168.0.1|IP=192.168.113.42\\ 1.1.1.1\\ 4.4.8.8\\ 192.169.12.6\\ 192.168.400.37\\ Address is 1 . 2 . 3 . 4|Private IP addresses are defined as: valid IPv4 dotted quad addresses with a first octet of 10; or first two octets of 192.168; or first octet of 172 followed by a second octet in the range 16-31.| |
| |
| |
* Windows | * Windows |
* findstr /R | * findstr /R (see notes below) |
| |
* Languages | * Programming Languages (Cross-Platform) |
* Powershell | * Cross-platform Shells (Powershell, zsh, bash) |
* Python | * Python |
* JavaScript | * JavaScript |
* Perl | * Perl |
* C / C++ via [[https://www.pcre.org/|PCRE/PCRE2 library]] | * C / C++ via [[https://www.pcre.org/|PCRE/PCRE2 library]] |
* ...and many others! | * ...and many others! |
Findstr is also limited to (approximately) 127 characters in the regular expression. | Findstr is also limited to (approximately) 127 characters in the regular expression. |
| |
For information on findstr's regular expression dialect, see ''help findstr''. In particular, the findstr command does not support alternation with the ''|'' symbol, repetition other than with the ''*'' symbol, or grouping using ''( )''. | For information on findstr's regular expression dialect, see ''help findstr''. In particular, the findstr command does not support alternation with the ''|'' symbol, repetition other than with the ''*'' symbol, named character classes ''<nowiki>[[:</nowiki>//name//<nowiki>:]]</nowiki>'', or grouping ''( )''. |
| |
| |