Differences

This shows you the differences between two versions of the page.

--- ops102:regular_expressions [2024/03/27 16:22] – chris
+++ ops102:regular_expressions [2025/03/14 02:05] (current) – [Examples] chris
@@ Line 17: / Line 17: @@
 ====  Wildcards  ====
-A period ''<nowiki>"."</nowiki>'' will match **any** single character. Similarly, three periods ''<nowiki>"..."</nowiki>'' will match any three characters.
+A period ''<nowiki>"."</nowiki>'' will match **any** single character. Similarly, three periods ''<nowiki>"..."</nowiki>'' will match **any three** characters.
 ====  Bracket Expressions / Character Classes  ====
 Bracket Expressions or Character Classes are contained in square brackets ''<nowiki>"[[|]]"</nowiki>''
-  *  A list of characters in square brackets will match any //one// character from the list of characters: ''<nowiki>"[[abc]]"</nowiki>'' will match ''<nowiki>"a"</nowiki>'', ''<nowiki>"b"</nowiki>'', or ''<nowiki>"c"</nowiki>''
+  *  A list of characters in square brackets will match any //one// character from the list of characters: ''<nowiki>"[abc]"</nowiki>'' will match ''<nowiki>"a"</nowiki>'', ''<nowiki>"b"</nowiki>'', or ''<nowiki>"c"</nowiki>''
-  *  A range of characters in square brackets, written as a starting character, a dash, and an ending character, will match any character in that range: ''<nowiki>"[[0-9]]"</nowiki>'' will match any one digit.
+  *  A range of characters in square brackets, written as a starting character, a dash, and an ending character, will match any character in that range: ''<nowiki>"[0-9]"</nowiki>'' will match any one digit.
   *  There are some pre-defined named character classes. These are selected by specifying the name of the character class surrounded by colons and square brackets, placed within outer square brackets, like ''<nowiki>"[[:digits:]]"</nowiki>''. The available names are:
     *  alnum - alphanumeric
@@ Line 37: / Line 37: @@
     *  lower - lowercase letters
     *  xdigit - hexidecimal digits (digits plus a-f and A-F)
-  *  Ranges, lists, and named character classes may be combined - e.g., ''<nowiki>"[[[:digit:]]+-.,]"</nowiki>'' ''<nowiki>"[[[:digit:]][:punct:]]"</nowiki>'' ''<nowiki>"[[0-9_*]]"</nowiki>''
+  *  Ranges, lists, and named character classes may be combined - e.g., ''<nowiki>"[[[:digit:]]+-.,]"</nowiki>'' ''<nowiki>"[[:digit:]][:punct:]]"</nowiki>'' ''<nowiki>"[0-9_*]"</nowiki>''
-  *  To invert a character class, add a carat ^ character as the first character after the opening square bracket: ''<nowiki>"[^[:digit:]]"</nowiki>'' matches any non-digit character, and ''<nowiki>"[[^:]]"</nowiki>'' matches any character that is not a colon.
+  *  To invert a character class, add a carat ^ character as the first character after the opening square bracket: ''<nowiki>"[^[:digit:]]"</nowiki>'' matches any non-digit character, and ''<nowiki>"[^:]"</nowiki>'' matches any character that is not a colon.
   *  To include a literal carat, place it at the end of the character class. To include a literal dash or closing square bracket, place it at the start of the character class.
@@ Line 45: / Line 45: @@
   *  A repeat count can be placed in curly brackets. It applies to the previous element:  ''<nowiki>"x{3}"</nowiki>'' matches ''<nowiki>"xxx"</nowiki>''
   *  A repeat can be a range, written as min,max in curly brackets: ''<nowiki>"x{2,5}"</nowiki>'' will match ''<nowiki>"xx"</nowiki>'', ''<nowiki>"xxx"</nowiki>'', ''<nowiki>"xxxx"</nowiki>'', or ''<nowiki>"xxxxx"</nowiki>''
-  *  The maximum value in a range can be omitted: ''<nowiki>"x{2,}"</nowiki>'' will two or more ''<nowiki>"x"</nowiki>'' characters in a row
+  *  The maximum value in a range can be omitted: ''<nowiki>"x{2,}"</nowiki>'' will match two or more ''<nowiki>"x"</nowiki>'' characters in a row
   *  There are short forms for some commonly-used ranges:
     *  ''<nowiki>"*"</nowiki>'' is the same as ''<nowiki>"{0,}"</nowiki>'' (zero or more)
@@ Line 65: / Line 65: @@
   *  A carat symbol will match the start of a line: ''<nowiki>"^[[:upper:]]"</nowiki>'' wil match lines that start with an uppercase letter.
   *  A dollar sign will match the end of a line: ''<nowiki>"[[:punct:]]$"</nowiki>'' will match lines that end with a punctuation mark.
-  *  The two characters may be used together: ''<nowiki>"cat"</nowiki>'' will match the word ''<nowiki>"cat"</nowiki>'' anywhere on a line, but ''<nowiki>"^cat$"</nowiki>'' will only match lines that contain //only// the word ''<nowiki>"cat"</nowiki>''. Likewise, ''<nowiki>"^[[0-9.]]$"</nowiki>'' will match lines that are made up of only digits and dot characters.
+  *  The two anchors may be used together: ''<nowiki>"cat"</nowiki>'' will match the word ''<nowiki>"cat"</nowiki>'' anywhere on a line, but ''<nowiki>"^cat$"</nowiki>'' will only match lines that contain //just// the word ''<nowiki>"cat"</nowiki>'' and nothing else. Similarly, ''<nowiki>"^[[0-9.]]$"</nowiki>'' will match lines that are made up of only digits and dot characters, and ''<nowiki>"^...$"</nowiki>'' or ''<nowiki>"^.{3}$"</nowiki>'' will only match lines that contain exactly three characters.
 =====  Examples  =====
@@ Line 71: / Line 71: @@
-^Description^Regexp (GNU Extended Grep dialect - "grep -E")^Matches^Does not match^Comments^
+^Description^Regexp (GNU Extended Grep dialect - "grep -E")^Matches these lines^Does not match these lines^Comments^
-|A specific word|''<nowiki>Hello</nowiki>''|Hello\\ Hello there!\\ Hello, World!\\ He said, "Hello James", in a very threatening tone|Hi there\\ Hell of a Day\\ h el lo|
+|A specific word|''<nowiki>Hello</nowiki>''|Hello\\ Hello there!\\ Hello, World!\\ He said, "Hello James", in a very threatening tone|Hi there\\ Hell of a Day\\ h el lo|This will match "Hello" anywhere on the line, but not permit any variations, such as spaces in the word or UPPER-/lower-case changes.|
-|A specific word with nothing else on the line|''<nowiki>^Hello$</nowiki>''|Hello|Hello there!\\ Hello, World!\\ He said, "Hello James", in a very threatening tone\\ Hi there\\ Hell of a Day\\ h el lo|This will match "Hello" anywhere on the line, but not permit any variations, such as spaces in the word or UPPER-/lower-case changes.|
+|A specific word with nothing else on the line|''<nowiki>^Hello$</nowiki>''|Hello|Hello there!\\ Hello, World!\\ He said, "Hello James", in a very threatening tone\\ Hi there\\ Hell of a Day\\ h el lo| |
 |5-character line|''<nowiki>^.....$</nowiki>''|rouge\\ green\\ Ho-ho\\ |Yellow\\ long line\\ tiny\\ 12-45-78|The anchor characters prevent extra characters from existing between the five characters and the start and end of the line.|
-|Lines that start with a vowel|''<nowiki>^[[AEIOUYaeiouy]]</nowiki>''|Allo\\ Everyhing\\ Energy\\ Under\\ Yellow|Hello\\ White\\ 4164915050\\ Grinch|The character class includes both UPPERCASE and lowercase letters. You could instead use the option (specific to the tool you're using) to ignore case; for example, ''-i'' for grep or ''/I'' for findstr.|
+|Lines that start with a vowel|''<nowiki>^[AEIOUYaeiouy]</nowiki>''|Allo\\ Everything\\ Energy\\ Under\\ Yellow\\ everything|Hello\\ White\\ 4164915050\\ Grinch|The character class includes both UPPERCASE and lowercase letters. You could instead use the option (specific to the tool you're using) to ignore case; for example, ''-i'' for grep or ''/I'' for findstr.|
 |Lines that end in a punctuation mark|''<nowiki>[[:punct:]]$</nowiki>''|Hello there!\\ Thanks.\\ What do you think?|Hello there\\ 416-491-5050\\ New Year greetings| |
 |An integer|''<nowiki>^[-+]?[[:digit:]]+$</nowiki>''|+15\\ -2\\ 720\\ 1440\\ 1280\\ 1920\\ 000\\ 012|+ 4\\ 3.14\\ 0x47\\ $1.13\\ $4\\ 123,456|This looks for lines that start with a + or - (optional), then contain digits.|
-|A decimal number|''<nowiki>^[-+]?[[:digit:]]+\.[[:digit:]]*$</nowiki>''|+3.14\\ 42\\ -1000.0\\ +212\\ +36.7\\ 42.00\\ 3.333333333\\ 0.976|.976\\ +-200\\ 1.1.1.1\\ 13.4.7|This will match lines that start with + or - (optional), then contain digits, then optionally contain a decimal point followed by zero or more additional digits.|
+|A decimal number|''<nowiki>^[-+]?[[:digit:]]+\.?[[:digit:]]*$</nowiki>''|+3.14\\ 42\\ -1000.0\\ +212\\ +36.7\\ 42.00\\ 3.333333333\\ 0.976|.976\\ +-200\\ 1.1.1.1\\ 13.4.7|This will match lines that start with + or - (optional), then contain digits, then optionally contain a decimal point followed by zero or more additional digits.|
-|A Canadian Postal Code|''<nowiki>^[[ABCEGHJKLMNPRSTVXY]][0-9][[ABCEGHJKLMNPRSTVWXYZ]] ?[[0-9]][ABCEGHJKLMNPRSTVWXYZ][[0-9]]$</nowiki>''|H0H 0H0\\ M3C 1L2\\ K1A 0A2\\ T2G 0P3\\ V8W 9W2\\ R3B 0N2\\ M2J2X5\\ M5S 2C6|POB 1L0\\ 90210\\ MN4 2R6|A Canadian postal code alternates between letters and digits: A9A 9A9. The first letter must be of of ABCEGHJKLMNPRSTVXY and the remaining letters must be one of ABCEGHJKLMNPRSTVXY.|
+|A Canadian Postal Code|''<nowiki>^[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ] ?[0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]$</nowiki>''|H0H 0H0\\ M3C 1L2\\ K1A 0A2\\ T2G 0P3\\ V8W 9W2\\ R3B 0N2\\ M2J2X5\\ M5S 2C6|POB 1L0\\ 90210\\ MN4 2R6|A Canadian postal code alternates between letters and digits: A9A 9A9. The first letter must be one of ABCEGHJKLMNPRSTVXY and the remaining letters must be one of ABCEGHJKLMNPRSTVXY.|
 |Phone Numbers (Canada/US)|''<nowiki>^[^+[:digit:]]*(\+?1)?[^+[:digit:]]*[[2-9]]([^+[:digit:]]*[[0-9]]){9}[^+[:digit:]]*$</nowiki>''|(416) 967-1111\\ +1 416-736-3636\\ 416-439-0000|+65 6896 2391\\ 555-1212|A Canadian/US phone number consists of a 3-digit Area Code (which may not start with 0 or 1) and a 10-digit local number consisting of an exchange (3 digits) and a line (4 digits). The country code for Canada and the US is 1, so the number may be preceeded by +1 or 1. Area codes are sometimes contained in parenthesis, and dashes or spaces are sometimes used as separators.|
-|IP Address (IPv4 dotted quad)|''<nowiki>^(((25[[0-5]]|2[[0-4]][0-9]|1[[0-9]][0-9]|[[1-9]][0-9]|[[0-9]]))\.){3}(25[[0-5]]|2[[0-4]][0-9]|1[[0-9]][0-9]|[[1-9]][0-9]|[[0-9]])$</nowiki>''|1.1.1.1\\ 4.4.8.8\\ 8.8.8.8\\ 7.12.9.43\\ 10.106.32.109\\ 172.16.97.1\\ 192.168.0.1\\ |IP=67.69.105.143\\ 1.10.100.1000\\ 255.255.255.0\\ IP=100.150.200.250\\ 103.271.92.16\\ 1O.10.10.10|An IPv4 address in "dotted quad" notation consists of four numbers in the range 0-255 separated by periods. The numbers are called "octets" (which means a collection of eight bits, an alternate way of saying "byte").|
+|IP Address (IPv4 dotted quad)|''<nowiki>^((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])$</nowiki>''|1.1.1.1\\ 4.4.8.8\\ 8.8.8.8\\ 7.12.9.43\\ 10.106.32.109\\ 172.16.97.1\\ 192.168.0.1\\ |IP=67.69.105.143\\ 1.10.100.1000\\ IP=100.150.200.250\\ 103.271.92.16\\ 1O.10.10.10|An IPv4 address in "dotted quad" notation consists of four numbers in the range 0-255 separated by periods. The numbers are called "octets" (which means a collection of eight bits, an alternate way of saying "byte").|
-|Private IP Address|''<nowiki>^(10\.((25[[0-5]]|2[[0-4]][0-9]|1[[0-9]][0-9]|[[1-9]][0-9]|[[0-9]]))|192\.168|172\.(1[[6-9]]|2[[0-9]]|3[[0-1]]))\.((25[[0-5]]|2[[0-4]][0-9]|1[[0-9]][0-9]|[[1-9]][0-9]|[[0-9]]))\.((25[[0-5]]|2[[0-4]][0-9]|1[[0-9]][0-9]|[[1-9]][0-9]|[[0-9]]))</nowiki>''|10.4.72.13\\ 172.16.97.1\\ 192.168.0.1|IP=192.168.113.42\\ 1.1.1.1\\ 4.4.8.8\\ 192.169.12.6\\ 192.168.400.37\\ Address is 1 . 2 . 3 . 4|Private IP addresses are defined as: valid IPv4 dotted quad addresses with a first octet of 10; or first two octets of 192.168; or first octet of 172 followed by a second octet in the range 16-31.|
+|Private IP Address|''<nowiki>^(10\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])|192\.168|172\.(1[6-9]|2[0-9]|3[0-1]))\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])$</nowiki>''|10.4.72.13\\ 172.16.97.1\\ 192.168.0.1|IP=192.168.113.42\\ 1.1.1.1\\ 4.4.8.8\\ 192.169.12.6\\ 192.168.400.37\\ Address is 1 . 2 . 3 . 4|Private IP addresses are defined as: valid IPv4 dotted quad addresses with a first octet of 10; or first two octets of 192.168; or first octet of 172 followed by a second octet in the range 16-31.|
 =====  Regular Expression Dialects  =====
@@ Line 97: / Line 97: @@
 Unlike the original Unix grep, the GNU grep can handle the full extended regular expression syntax, in either of two ways:
   *  To use the special characters (called "meta-characters") ?, +, {, |, (, and ) preceed them with a backslash. In other words, while a backslash makes special characters like . or * //ordinary//, it also makes //ordinary// characters ? + { | } into special characters.
-  *  Alternately, use the ''-E'' option to make grep understand extended regular expressions, which causes ? + { ( | ) to become special characters.
+  *  Alternately, use the ''-E'' option to make grep understand extended regular expressions, which causes ? + { ( | ) } to become special characters.
 Other tools, such as sed, similarly require backslashes in front of some of the extended regexp meta-characters (or, if you're using a GNU version of sed, you can use the -E option to enable extended regular expressions, just like GNU grep).
@@ Line 110: / Line 110: @@
     *  GNU grep
     *  The bash test command ''<nowiki>[[ "string" =~ regexp ]]</nowiki>''
+      * Note that the regular expression is __not__ quoted
+      * Example: ''<nowiki>X="ABC"; if [[ "$X" =~ ^[[:upper:]]{3}$ ]]; then echo "MATCH" ; else echo "NO MATCH" ; fi</nowiki>''
     *  The less command, using the / and ? keystrokes for searching forward and backward
     *  The vi/vim editor, also using the / and ? keystrokes for searching forward and backward
@@ Line 115: / Line 117: @@
   *  Windows
-    *  findstr /R
+    *  findstr /R (see notes below)
-  *  Languages
+  *  Programming Languages (Cross-Platform)
-    *  Powershell
+    * Cross-platform Shells (Powershell, zsh, bash)
-    *  Python
+    * Python
-    *  JavaScript
+    * JavaScript
-    *  Perl
+    * Perl
+    * C / C++ via [[https://www.pcre.org/|PCRE/PCRE2 library]]
     *  ...and many others!
@@ Line 134: / Line 137: @@
 Findstr is also limited to (approximately) 127 characters in the regular expression.
-For information on findstr's regular expression dialect, see ''help findstr''
+For information on findstr's regular expression dialect, see ''help findstr''. In particular, the findstr command does not support alternation with the ''|'' symbol, repetition other than with the ''*'' symbol, named character classes ''<nowiki>[[:</nowiki>//name//<nowiki>:]]</nowiki>'', or grouping ''( )''.