Both sides previous revisionPrevious revisionNext revision | Previous revision |
ops102:regular_expressions [2024/03/27 16:22] – chris | ops102:regular_expressions [2025/03/14 02:05] (current) – [Examples] chris |
---|
==== Wildcards ==== | ==== Wildcards ==== |
| |
A period ''<nowiki>"."</nowiki>'' will match **any** single character. Similarly, three periods ''<nowiki>"..."</nowiki>'' will match any three characters. | A period ''<nowiki>"."</nowiki>'' will match **any** single character. Similarly, three periods ''<nowiki>"..."</nowiki>'' will match **any three** characters. |
| |
==== Bracket Expressions / Character Classes ==== | ==== Bracket Expressions / Character Classes ==== |
| |
Bracket Expressions or Character Classes are contained in square brackets ''<nowiki>"[[|]]"</nowiki>'' | Bracket Expressions or Character Classes are contained in square brackets ''<nowiki>"[[|]]"</nowiki>'' |
* A list of characters in square brackets will match any //one// character from the list of characters: ''<nowiki>"[[abc]]"</nowiki>'' will match ''<nowiki>"a"</nowiki>'', ''<nowiki>"b"</nowiki>'', or ''<nowiki>"c"</nowiki>'' | * A list of characters in square brackets will match any //one// character from the list of characters: ''<nowiki>"[abc]"</nowiki>'' will match ''<nowiki>"a"</nowiki>'', ''<nowiki>"b"</nowiki>'', or ''<nowiki>"c"</nowiki>'' |
* A range of characters in square brackets, written as a starting character, a dash, and an ending character, will match any character in that range: ''<nowiki>"[[0-9]]"</nowiki>'' will match any one digit. | * A range of characters in square brackets, written as a starting character, a dash, and an ending character, will match any character in that range: ''<nowiki>"[0-9]"</nowiki>'' will match any one digit. |
* There are some pre-defined named character classes. These are selected by specifying the name of the character class surrounded by colons and square brackets, placed within outer square brackets, like ''<nowiki>"[[:digits:]]"</nowiki>''. The available names are: | * There are some pre-defined named character classes. These are selected by specifying the name of the character class surrounded by colons and square brackets, placed within outer square brackets, like ''<nowiki>"[[:digits:]]"</nowiki>''. The available names are: |
* alnum - alphanumeric | * alnum - alphanumeric |
* lower - lowercase letters | * lower - lowercase letters |
* xdigit - hexidecimal digits (digits plus a-f and A-F) | * xdigit - hexidecimal digits (digits plus a-f and A-F) |
* Ranges, lists, and named character classes may be combined - e.g., ''<nowiki>"[[[:digit:]]+-.,]"</nowiki>'' ''<nowiki>"[[[:digit:]][:punct:]]"</nowiki>'' ''<nowiki>"[[0-9_*]]"</nowiki>'' | * Ranges, lists, and named character classes may be combined - e.g., ''<nowiki>"[[[:digit:]]+-.,]"</nowiki>'' ''<nowiki>"[[:digit:]][:punct:]]"</nowiki>'' ''<nowiki>"[0-9_*]"</nowiki>'' |
* To invert a character class, add a carat ^ character as the first character after the opening square bracket: ''<nowiki>"[^[:digit:]]"</nowiki>'' matches any non-digit character, and ''<nowiki>"[[^:]]"</nowiki>'' matches any character that is not a colon. | * To invert a character class, add a carat ^ character as the first character after the opening square bracket: ''<nowiki>"[^[:digit:]]"</nowiki>'' matches any non-digit character, and ''<nowiki>"[^:]"</nowiki>'' matches any character that is not a colon. |
* To include a literal carat, place it at the end of the character class. To include a literal dash or closing square bracket, place it at the start of the character class. | * To include a literal carat, place it at the end of the character class. To include a literal dash or closing square bracket, place it at the start of the character class. |
| |
* A repeat count can be placed in curly brackets. It applies to the previous element: ''<nowiki>"x{3}"</nowiki>'' matches ''<nowiki>"xxx"</nowiki>'' | * A repeat count can be placed in curly brackets. It applies to the previous element: ''<nowiki>"x{3}"</nowiki>'' matches ''<nowiki>"xxx"</nowiki>'' |
* A repeat can be a range, written as min,max in curly brackets: ''<nowiki>"x{2,5}"</nowiki>'' will match ''<nowiki>"xx"</nowiki>'', ''<nowiki>"xxx"</nowiki>'', ''<nowiki>"xxxx"</nowiki>'', or ''<nowiki>"xxxxx"</nowiki>'' | * A repeat can be a range, written as min,max in curly brackets: ''<nowiki>"x{2,5}"</nowiki>'' will match ''<nowiki>"xx"</nowiki>'', ''<nowiki>"xxx"</nowiki>'', ''<nowiki>"xxxx"</nowiki>'', or ''<nowiki>"xxxxx"</nowiki>'' |
* The maximum value in a range can be omitted: ''<nowiki>"x{2,}"</nowiki>'' will two or more ''<nowiki>"x"</nowiki>'' characters in a row | * The maximum value in a range can be omitted: ''<nowiki>"x{2,}"</nowiki>'' will match two or more ''<nowiki>"x"</nowiki>'' characters in a row |
* There are short forms for some commonly-used ranges: | * There are short forms for some commonly-used ranges: |
* ''<nowiki>"*"</nowiki>'' is the same as ''<nowiki>"{0,}"</nowiki>'' (zero or more) | * ''<nowiki>"*"</nowiki>'' is the same as ''<nowiki>"{0,}"</nowiki>'' (zero or more) |
* A carat symbol will match the start of a line: ''<nowiki>"^[[:upper:]]"</nowiki>'' wil match lines that start with an uppercase letter. | * A carat symbol will match the start of a line: ''<nowiki>"^[[:upper:]]"</nowiki>'' wil match lines that start with an uppercase letter. |
* A dollar sign will match the end of a line: ''<nowiki>"[[:punct:]]$"</nowiki>'' will match lines that end with a punctuation mark. | * A dollar sign will match the end of a line: ''<nowiki>"[[:punct:]]$"</nowiki>'' will match lines that end with a punctuation mark. |
* The two characters may be used together: ''<nowiki>"cat"</nowiki>'' will match the word ''<nowiki>"cat"</nowiki>'' anywhere on a line, but ''<nowiki>"^cat$"</nowiki>'' will only match lines that contain //only// the word ''<nowiki>"cat"</nowiki>''. Likewise, ''<nowiki>"^[[0-9.]]$"</nowiki>'' will match lines that are made up of only digits and dot characters. | * The two anchors may be used together: ''<nowiki>"cat"</nowiki>'' will match the word ''<nowiki>"cat"</nowiki>'' anywhere on a line, but ''<nowiki>"^cat$"</nowiki>'' will only match lines that contain //just// the word ''<nowiki>"cat"</nowiki>'' and nothing else. Similarly, ''<nowiki>"^[[0-9.]]$"</nowiki>'' will match lines that are made up of only digits and dot characters, and ''<nowiki>"^...$"</nowiki>'' or ''<nowiki>"^.{3}$"</nowiki>'' will only match lines that contain exactly three characters. |
| |
===== Examples ===== | ===== Examples ===== |
| |
| |
^Description^Regexp (GNU Extended Grep dialect - "grep -E")^Matches^Does not match^Comments^ | ^Description^Regexp (GNU Extended Grep dialect - "grep -E")^Matches these lines^Does not match these lines^Comments^ |
|A specific word|''<nowiki>Hello</nowiki>''|Hello\\ Hello there!\\ Hello, World!\\ He said, "Hello James", in a very threatening tone|Hi there\\ Hell of a Day\\ h el lo| | |A specific word|''<nowiki>Hello</nowiki>''|Hello\\ Hello there!\\ Hello, World!\\ He said, "Hello James", in a very threatening tone|Hi there\\ Hell of a Day\\ h el lo|This will match "Hello" anywhere on the line, but not permit any variations, such as spaces in the word or UPPER-/lower-case changes.| |
|A specific word with nothing else on the line|''<nowiki>^Hello$</nowiki>''|Hello|Hello there!\\ Hello, World!\\ He said, "Hello James", in a very threatening tone\\ Hi there\\ Hell of a Day\\ h el lo|This will match "Hello" anywhere on the line, but not permit any variations, such as spaces in the word or UPPER-/lower-case changes.| | |A specific word with nothing else on the line|''<nowiki>^Hello$</nowiki>''|Hello|Hello there!\\ Hello, World!\\ He said, "Hello James", in a very threatening tone\\ Hi there\\ Hell of a Day\\ h el lo| | |
|5-character line|''<nowiki>^.....$</nowiki>''|rouge\\ green\\ Ho-ho\\ |Yellow\\ long line\\ tiny\\ 12-45-78|The anchor characters prevent extra characters from existing between the five characters and the start and end of the line.| | |5-character line|''<nowiki>^.....$</nowiki>''|rouge\\ green\\ Ho-ho\\ |Yellow\\ long line\\ tiny\\ 12-45-78|The anchor characters prevent extra characters from existing between the five characters and the start and end of the line.| |
|Lines that start with a vowel|''<nowiki>^[[AEIOUYaeiouy]]</nowiki>''|Allo\\ Everyhing\\ Energy\\ Under\\ Yellow|Hello\\ White\\ 4164915050\\ Grinch|The character class includes both UPPERCASE and lowercase letters. You could instead use the option (specific to the tool you're using) to ignore case; for example, ''-i'' for grep or ''/I'' for findstr.| | |Lines that start with a vowel|''<nowiki>^[AEIOUYaeiouy]</nowiki>''|Allo\\ Everything\\ Energy\\ Under\\ Yellow\\ everything|Hello\\ White\\ 4164915050\\ Grinch|The character class includes both UPPERCASE and lowercase letters. You could instead use the option (specific to the tool you're using) to ignore case; for example, ''-i'' for grep or ''/I'' for findstr.| |
|Lines that end in a punctuation mark|''<nowiki>[[:punct:]]$</nowiki>''|Hello there!\\ Thanks.\\ What do you think?|Hello there\\ 416-491-5050\\ New Year greetings| | | |Lines that end in a punctuation mark|''<nowiki>[[:punct:]]$</nowiki>''|Hello there!\\ Thanks.\\ What do you think?|Hello there\\ 416-491-5050\\ New Year greetings| | |
|An integer|''<nowiki>^[-+]?[[:digit:]]+$</nowiki>''|+15\\ -2\\ 720\\ 1440\\ 1280\\ 1920\\ 000\\ 012|+ 4\\ 3.14\\ 0x47\\ $1.13\\ $4\\ 123,456|This looks for lines that start with a + or - (optional), then contain digits.| | |An integer|''<nowiki>^[-+]?[[:digit:]]+$</nowiki>''|+15\\ -2\\ 720\\ 1440\\ 1280\\ 1920\\ 000\\ 012|+ 4\\ 3.14\\ 0x47\\ $1.13\\ $4\\ 123,456|This looks for lines that start with a + or - (optional), then contain digits.| |
|A decimal number|''<nowiki>^[-+]?[[:digit:]]+\.[[:digit:]]*$</nowiki>''|+3.14\\ 42\\ -1000.0\\ +212\\ +36.7\\ 42.00\\ 3.333333333\\ 0.976|.976\\ +-200\\ 1.1.1.1\\ 13.4.7|This will match lines that start with + or - (optional), then contain digits, then optionally contain a decimal point followed by zero or more additional digits.| | |A decimal number|''<nowiki>^[-+]?[[:digit:]]+\.?[[:digit:]]*$</nowiki>''|+3.14\\ 42\\ -1000.0\\ +212\\ +36.7\\ 42.00\\ 3.333333333\\ 0.976|.976\\ +-200\\ 1.1.1.1\\ 13.4.7|This will match lines that start with + or - (optional), then contain digits, then optionally contain a decimal point followed by zero or more additional digits.| |
|A Canadian Postal Code|''<nowiki>^[[ABCEGHJKLMNPRSTVXY]][0-9][[ABCEGHJKLMNPRSTVWXYZ]] ?[[0-9]][ABCEGHJKLMNPRSTVWXYZ][[0-9]]$</nowiki>''|H0H 0H0\\ M3C 1L2\\ K1A 0A2\\ T2G 0P3\\ V8W 9W2\\ R3B 0N2\\ M2J2X5\\ M5S 2C6|POB 1L0\\ 90210\\ MN4 2R6|A Canadian postal code alternates between letters and digits: A9A 9A9. The first letter must be of of ABCEGHJKLMNPRSTVXY and the remaining letters must be one of ABCEGHJKLMNPRSTVXY.| | |A Canadian Postal Code|''<nowiki>^[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ] ?[0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]$</nowiki>''|H0H 0H0\\ M3C 1L2\\ K1A 0A2\\ T2G 0P3\\ V8W 9W2\\ R3B 0N2\\ M2J2X5\\ M5S 2C6|POB 1L0\\ 90210\\ MN4 2R6|A Canadian postal code alternates between letters and digits: A9A 9A9. The first letter must be one of ABCEGHJKLMNPRSTVXY and the remaining letters must be one of ABCEGHJKLMNPRSTVXY.| |
|Phone Numbers (Canada/US)|''<nowiki>^[^+[:digit:]]*(\+?1)?[^+[:digit:]]*[[2-9]]([^+[:digit:]]*[[0-9]]){9}[^+[:digit:]]*$</nowiki>''|(416) 967-1111\\ +1 416-736-3636\\ 416-439-0000|+65 6896 2391\\ 555-1212|A Canadian/US phone number consists of a 3-digit Area Code (which may not start with 0 or 1) and a 10-digit local number consisting of an exchange (3 digits) and a line (4 digits). The country code for Canada and the US is 1, so the number may be preceeded by +1 or 1. Area codes are sometimes contained in parenthesis, and dashes or spaces are sometimes used as separators.| | |Phone Numbers (Canada/US)|''<nowiki>^[^+[:digit:]]*(\+?1)?[^+[:digit:]]*[[2-9]]([^+[:digit:]]*[[0-9]]){9}[^+[:digit:]]*$</nowiki>''|(416) 967-1111\\ +1 416-736-3636\\ 416-439-0000|+65 6896 2391\\ 555-1212|A Canadian/US phone number consists of a 3-digit Area Code (which may not start with 0 or 1) and a 10-digit local number consisting of an exchange (3 digits) and a line (4 digits). The country code for Canada and the US is 1, so the number may be preceeded by +1 or 1. Area codes are sometimes contained in parenthesis, and dashes or spaces are sometimes used as separators.| |
|IP Address (IPv4 dotted quad)|''<nowiki>^(((25[[0-5]]|2[[0-4]][0-9]|1[[0-9]][0-9]|[[1-9]][0-9]|[[0-9]]))\.){3}(25[[0-5]]|2[[0-4]][0-9]|1[[0-9]][0-9]|[[1-9]][0-9]|[[0-9]])$</nowiki>''|1.1.1.1\\ 4.4.8.8\\ 8.8.8.8\\ 7.12.9.43\\ 10.106.32.109\\ 172.16.97.1\\ 192.168.0.1\\ |IP=67.69.105.143\\ 1.10.100.1000\\ 255.255.255.0\\ IP=100.150.200.250\\ 103.271.92.16\\ 1O.10.10.10|An IPv4 address in "dotted quad" notation consists of four numbers in the range 0-255 separated by periods. The numbers are called "octets" (which means a collection of eight bits, an alternate way of saying "byte").| | |IP Address (IPv4 dotted quad)|''<nowiki>^((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])$</nowiki>''|1.1.1.1\\ 4.4.8.8\\ 8.8.8.8\\ 7.12.9.43\\ 10.106.32.109\\ 172.16.97.1\\ 192.168.0.1\\ |IP=67.69.105.143\\ 1.10.100.1000\\ IP=100.150.200.250\\ 103.271.92.16\\ 1O.10.10.10|An IPv4 address in "dotted quad" notation consists of four numbers in the range 0-255 separated by periods. The numbers are called "octets" (which means a collection of eight bits, an alternate way of saying "byte").| |
|Private IP Address|''<nowiki>^(10\.((25[[0-5]]|2[[0-4]][0-9]|1[[0-9]][0-9]|[[1-9]][0-9]|[[0-9]]))|192\.168|172\.(1[[6-9]]|2[[0-9]]|3[[0-1]]))\.((25[[0-5]]|2[[0-4]][0-9]|1[[0-9]][0-9]|[[1-9]][0-9]|[[0-9]]))\.((25[[0-5]]|2[[0-4]][0-9]|1[[0-9]][0-9]|[[1-9]][0-9]|[[0-9]]))</nowiki>''|10.4.72.13\\ 172.16.97.1\\ 192.168.0.1|IP=192.168.113.42\\ 1.1.1.1\\ 4.4.8.8\\ 192.169.12.6\\ 192.168.400.37\\ Address is 1 . 2 . 3 . 4|Private IP addresses are defined as: valid IPv4 dotted quad addresses with a first octet of 10; or first two octets of 192.168; or first octet of 172 followed by a second octet in the range 16-31.| | |Private IP Address|''<nowiki>^(10\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])|192\.168|172\.(1[6-9]|2[0-9]|3[0-1]))\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])$</nowiki>''|10.4.72.13\\ 172.16.97.1\\ 192.168.0.1|IP=192.168.113.42\\ 1.1.1.1\\ 4.4.8.8\\ 192.169.12.6\\ 192.168.400.37\\ Address is 1 . 2 . 3 . 4|Private IP addresses are defined as: valid IPv4 dotted quad addresses with a first octet of 10; or first two octets of 192.168; or first octet of 172 followed by a second octet in the range 16-31.| |
| |
===== Regular Expression Dialects ===== | ===== Regular Expression Dialects ===== |
Unlike the original Unix grep, the GNU grep can handle the full extended regular expression syntax, in either of two ways: | Unlike the original Unix grep, the GNU grep can handle the full extended regular expression syntax, in either of two ways: |
* To use the special characters (called "meta-characters") ?, +, {, |, (, and ) preceed them with a backslash. In other words, while a backslash makes special characters like . or * //ordinary//, it also makes //ordinary// characters ? + { | } into special characters. | * To use the special characters (called "meta-characters") ?, +, {, |, (, and ) preceed them with a backslash. In other words, while a backslash makes special characters like . or * //ordinary//, it also makes //ordinary// characters ? + { | } into special characters. |
* Alternately, use the ''-E'' option to make grep understand extended regular expressions, which causes ? + { ( | ) to become special characters. | * Alternately, use the ''-E'' option to make grep understand extended regular expressions, which causes ? + { ( | ) } to become special characters. |
| |
Other tools, such as sed, similarly require backslashes in front of some of the extended regexp meta-characters (or, if you're using a GNU version of sed, you can use the -E option to enable extended regular expressions, just like GNU grep). | Other tools, such as sed, similarly require backslashes in front of some of the extended regexp meta-characters (or, if you're using a GNU version of sed, you can use the -E option to enable extended regular expressions, just like GNU grep). |
* GNU grep | * GNU grep |
* The bash test command ''<nowiki>[[ "string" =~ regexp ]]</nowiki>'' | * The bash test command ''<nowiki>[[ "string" =~ regexp ]]</nowiki>'' |
| * Note that the regular expression is __not__ quoted |
| * Example: ''<nowiki>X="ABC"; if [[ "$X" =~ ^[[:upper:]]{3}$ ]]; then echo "MATCH" ; else echo "NO MATCH" ; fi</nowiki>'' |
* The less command, using the / and ? keystrokes for searching forward and backward | * The less command, using the / and ? keystrokes for searching forward and backward |
* The vi/vim editor, also using the / and ? keystrokes for searching forward and backward | * The vi/vim editor, also using the / and ? keystrokes for searching forward and backward |
| |
* Windows | * Windows |
* findstr /R | * findstr /R (see notes below) |
| |
* Languages | * Programming Languages (Cross-Platform) |
* Powershell | * Cross-platform Shells (Powershell, zsh, bash) |
* Python | * Python |
* JavaScript | * JavaScript |
* Perl | * Perl |
| * C / C++ via [[https://www.pcre.org/|PCRE/PCRE2 library]] |
* ...and many others! | * ...and many others! |
| |
Findstr is also limited to (approximately) 127 characters in the regular expression. | Findstr is also limited to (approximately) 127 characters in the regular expression. |
| |
For information on findstr's regular expression dialect, see ''help findstr'' | For information on findstr's regular expression dialect, see ''help findstr''. In particular, the findstr command does not support alternation with the ''|'' symbol, repetition other than with the ''*'' symbol, named character classes ''<nowiki>[[:</nowiki>//name//<nowiki>:]]</nowiki>'', or grouping ''( )''. |
| |
| |