User Tools

Site Tools


ops102:regular_expressions

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
ops102:regular_expressions [2024/04/01 09:28] – [Windows findstr and Regular Expressions] chrisops102:regular_expressions [2025/03/14 02:05] (current) – [Examples] chris
Line 2: Line 2:
  
 **Regular Expressions** are search patterns for "Regular Text". They are used by many different tools and languages, including the Linux grep command, the Windows findstr command, less, vi/vim, sed, awk, perl, python, and many others. **Regular Expressions** are search patterns for "Regular Text". They are used by many different tools and languages, including the Linux grep command, the Windows findstr command, less, vi/vim, sed, awk, perl, python, and many others.
- 
-===== Video ===== 
- 
-  * [[https://seneca-my.sharepoint.com/:v:/g/personal/chris_tyler_senecapolytechnic_ca/EUGN0BHIlzlCmrjXwZgYdSQBoJvWjX9wwfDZKFKS9sGXVg|Video Lecture on Regular Expressions]] (This is an extended version of the lecture given in class on March 27/28 made available for review). 
  
 =====  Why Use Regular Expressions?  ===== =====  Why Use Regular Expressions?  =====
Line 49: Line 45:
   *  A repeat count can be placed in curly brackets. It applies to the previous element:  ''<nowiki>"x{3}"</nowiki>'' matches ''<nowiki>"xxx"</nowiki>''   *  A repeat count can be placed in curly brackets. It applies to the previous element:  ''<nowiki>"x{3}"</nowiki>'' matches ''<nowiki>"xxx"</nowiki>''
   *  A repeat can be a range, written as min,max in curly brackets: ''<nowiki>"x{2,5}"</nowiki>'' will match ''<nowiki>"xx"</nowiki>'', ''<nowiki>"xxx"</nowiki>'', ''<nowiki>"xxxx"</nowiki>'', or ''<nowiki>"xxxxx"</nowiki>''   *  A repeat can be a range, written as min,max in curly brackets: ''<nowiki>"x{2,5}"</nowiki>'' will match ''<nowiki>"xx"</nowiki>'', ''<nowiki>"xxx"</nowiki>'', ''<nowiki>"xxxx"</nowiki>'', or ''<nowiki>"xxxxx"</nowiki>''
-  *  The maximum value in a range can be omitted: ''<nowiki>"x{2,}"</nowiki>'' will two or more ''<nowiki>"x"</nowiki>'' characters in a row+  *  The maximum value in a range can be omitted: ''<nowiki>"x{2,}"</nowiki>'' will match two or more ''<nowiki>"x"</nowiki>'' characters in a row
   *  There are short forms for some commonly-used ranges:   *  There are short forms for some commonly-used ranges:
     *  ''<nowiki>"*"</nowiki>'' is the same as ''<nowiki>"{0,}"</nowiki>'' (zero or more)     *  ''<nowiki>"*"</nowiki>'' is the same as ''<nowiki>"{0,}"</nowiki>'' (zero or more)
Line 69: Line 65:
   *  A carat symbol will match the start of a line: ''<nowiki>"^[[:upper:]]"</nowiki>'' wil match lines that start with an uppercase letter.   *  A carat symbol will match the start of a line: ''<nowiki>"^[[:upper:]]"</nowiki>'' wil match lines that start with an uppercase letter.
   *  A dollar sign will match the end of a line: ''<nowiki>"[[:punct:]]$"</nowiki>'' will match lines that end with a punctuation mark.   *  A dollar sign will match the end of a line: ''<nowiki>"[[:punct:]]$"</nowiki>'' will match lines that end with a punctuation mark.
-  *  The two characters may be used together: ''<nowiki>"cat"</nowiki>'' will match the word ''<nowiki>"cat"</nowiki>'' anywhere on a line, but ''<nowiki>"^cat$"</nowiki>'' will only match lines that contain //only// the word ''<nowiki>"cat"</nowiki>''Likewise, ''<nowiki>"^[[0-9.]]$"</nowiki>'' will match lines that are made up of only digits and dot characters.+  *  The two anchors may be used together: ''<nowiki>"cat"</nowiki>'' will match the word ''<nowiki>"cat"</nowiki>'' anywhere on a line, but ''<nowiki>"^cat$"</nowiki>'' will only match lines that contain //just// the word ''<nowiki>"cat"</nowiki>'' and nothing elseSimilarly, ''<nowiki>"^[[0-9.]]$"</nowiki>'' will match lines that are made up of only digits and dot characters, and ''<nowiki>"^...$"</nowiki>'' or ''<nowiki>"^.{3}$"</nowiki>'' will only match lines that contain exactly three characters.
  
 =====  Examples  ===== =====  Examples  =====
Line 75: Line 71:
  
  
-^Description^Regexp (GNU Extended Grep dialect - "grep -E")^Matches^Does not match^Comments^ +^Description^Regexp (GNU Extended Grep dialect - "grep -E")^Matches these lines^Does not match these lines^Comments^ 
-|A specific word|''<nowiki>Hello</nowiki>''|Hello\\ Hello there!\\ Hello, World!\\ He said, "Hello James", in a very threatening tone|Hi there\\ Hell of a Day\\ h el lo| +|A specific word|''<nowiki>Hello</nowiki>''|Hello\\ Hello there!\\ Hello, World!\\ He said, "Hello James", in a very threatening tone|Hi there\\ Hell of a Day\\ h el lo|This will match "Hello" anywhere on the line, but not permit any variations, such as spaces in the word or UPPER-/lower-case changes.
-|A specific word with nothing else on the line|''<nowiki>^Hello$</nowiki>''|Hello|Hello there!\\ Hello, World!\\ He said, "Hello James", in a very threatening tone\\ Hi there\\ Hell of a Day\\ h el lo|This will match "Hello" anywhere on the line, but not permit any variations, such as spaces in the word or UPPER-/lower-case changes.|+|A specific word with nothing else on the line|''<nowiki>^Hello$</nowiki>''|Hello|Hello there!\\ Hello, World!\\ He said, "Hello James", in a very threatening tone\\ Hi there\\ Hell of a Day\\ h el lo| |
 |5-character line|''<nowiki>^.....$</nowiki>''|rouge\\ green\\ Ho-ho\\ |Yellow\\ long line\\ tiny\\ 12-45-78|The anchor characters prevent extra characters from existing between the five characters and the start and end of the line.| |5-character line|''<nowiki>^.....$</nowiki>''|rouge\\ green\\ Ho-ho\\ |Yellow\\ long line\\ tiny\\ 12-45-78|The anchor characters prevent extra characters from existing between the five characters and the start and end of the line.|
 |Lines that start with a vowel|''<nowiki>^[AEIOUYaeiouy]</nowiki>''|Allo\\ Everything\\ Energy\\ Under\\ Yellow\\ everything|Hello\\ White\\ 4164915050\\ Grinch|The character class includes both UPPERCASE and lowercase letters. You could instead use the option (specific to the tool you're using) to ignore case; for example, ''-i'' for grep or ''/I'' for findstr.| |Lines that start with a vowel|''<nowiki>^[AEIOUYaeiouy]</nowiki>''|Allo\\ Everything\\ Energy\\ Under\\ Yellow\\ everything|Hello\\ White\\ 4164915050\\ Grinch|The character class includes both UPPERCASE and lowercase letters. You could instead use the option (specific to the tool you're using) to ignore case; for example, ''-i'' for grep or ''/I'' for findstr.|
Line 83: Line 79:
 |An integer|''<nowiki>^[-+]?[[:digit:]]+$</nowiki>''|+15\\ -2\\ 720\\ 1440\\ 1280\\ 1920\\ 000\\ 012|+ 4\\ 3.14\\ 0x47\\ $1.13\\ $4\\ 123,456|This looks for lines that start with a + or - (optional), then contain digits.| |An integer|''<nowiki>^[-+]?[[:digit:]]+$</nowiki>''|+15\\ -2\\ 720\\ 1440\\ 1280\\ 1920\\ 000\\ 012|+ 4\\ 3.14\\ 0x47\\ $1.13\\ $4\\ 123,456|This looks for lines that start with a + or - (optional), then contain digits.|
 |A decimal number|''<nowiki>^[-+]?[[:digit:]]+\.?[[:digit:]]*$</nowiki>''|+3.14\\ 42\\ -1000.0\\ +212\\ +36.7\\ 42.00\\ 3.333333333\\ 0.976|.976\\ +-200\\ 1.1.1.1\\ 13.4.7|This will match lines that start with + or - (optional), then contain digits, then optionally contain a decimal point followed by zero or more additional digits.| |A decimal number|''<nowiki>^[-+]?[[:digit:]]+\.?[[:digit:]]*$</nowiki>''|+3.14\\ 42\\ -1000.0\\ +212\\ +36.7\\ 42.00\\ 3.333333333\\ 0.976|.976\\ +-200\\ 1.1.1.1\\ 13.4.7|This will match lines that start with + or - (optional), then contain digits, then optionally contain a decimal point followed by zero or more additional digits.|
-|A Canadian Postal Code|''<nowiki>^[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ] ?[0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]$</nowiki>''|H0H 0H0\\ M3C 1L2\\ K1A 0A2\\ T2G 0P3\\ V8W 9W2\\ R3B 0N2\\ M2J2X5\\ M5S 2C6|POB 1L0\\ 90210\\ MN4 2R6|A Canadian postal code alternates between letters and digits: A9A 9A9. The first letter must be of of ABCEGHJKLMNPRSTVXY and the remaining letters must be one of ABCEGHJKLMNPRSTVXY.|+|A Canadian Postal Code|''<nowiki>^[ABCEGHJKLMNPRSTVXY][0-9][ABCEGHJKLMNPRSTVWXYZ] ?[0-9][ABCEGHJKLMNPRSTVWXYZ][0-9]$</nowiki>''|H0H 0H0\\ M3C 1L2\\ K1A 0A2\\ T2G 0P3\\ V8W 9W2\\ R3B 0N2\\ M2J2X5\\ M5S 2C6|POB 1L0\\ 90210\\ MN4 2R6|A Canadian postal code alternates between letters and digits: A9A 9A9. The first letter must be one of ABCEGHJKLMNPRSTVXY and the remaining letters must be one of ABCEGHJKLMNPRSTVXY.|
 |Phone Numbers (Canada/US)|''<nowiki>^[^+[:digit:]]*(\+?1)?[^+[:digit:]]*[[2-9]]([^+[:digit:]]*[[0-9]]){9}[^+[:digit:]]*$</nowiki>''|(416) 967-1111\\ +1 416-736-3636\\ 416-439-0000|+65 6896 2391\\ 555-1212|A Canadian/US phone number consists of a 3-digit Area Code (which may not start with 0 or 1) and a 10-digit local number consisting of an exchange (3 digits) and a line (4 digits). The country code for Canada and the US is 1, so the number may be preceeded by +1 or 1. Area codes are sometimes contained in parenthesis, and dashes or spaces are sometimes used as separators.| |Phone Numbers (Canada/US)|''<nowiki>^[^+[:digit:]]*(\+?1)?[^+[:digit:]]*[[2-9]]([^+[:digit:]]*[[0-9]]){9}[^+[:digit:]]*$</nowiki>''|(416) 967-1111\\ +1 416-736-3636\\ 416-439-0000|+65 6896 2391\\ 555-1212|A Canadian/US phone number consists of a 3-digit Area Code (which may not start with 0 or 1) and a 10-digit local number consisting of an exchange (3 digits) and a line (4 digits). The country code for Canada and the US is 1, so the number may be preceeded by +1 or 1. Area codes are sometimes contained in parenthesis, and dashes or spaces are sometimes used as separators.|
-|IP Address (IPv4 dotted quad)|''<nowiki>^((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])$</nowiki>''|1.1.1.1\\ 4.4.8.8\\ 8.8.8.8\\ 7.12.9.43\\ 10.106.32.109\\ 172.16.97.1\\ 192.168.0.1\\ |IP=67.69.105.143\\ 1.10.100.1000\\ 255.255.255.0\\ IP=100.150.200.250\\ 103.271.92.16\\ 1O.10.10.10|An IPv4 address in "dotted quad" notation consists of four numbers in the range 0-255 separated by periods. The numbers are called "octets" (which means a collection of eight bits, an alternate way of saying "byte").|+|IP Address (IPv4 dotted quad)|''<nowiki>^((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])$</nowiki>''|1.1.1.1\\ 4.4.8.8\\ 8.8.8.8\\ 7.12.9.43\\ 10.106.32.109\\ 172.16.97.1\\ 192.168.0.1\\ |IP=67.69.105.143\\ 1.10.100.1000\\ IP=100.150.200.250\\ 103.271.92.16\\ 1O.10.10.10|An IPv4 address in "dotted quad" notation consists of four numbers in the range 0-255 separated by periods. The numbers are called "octets" (which means a collection of eight bits, an alternate way of saying "byte").|
 |Private IP Address|''<nowiki>^(10\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])|192\.168|172\.(1[6-9]|2[0-9]|3[0-1]))\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])$</nowiki>''|10.4.72.13\\ 172.16.97.1\\ 192.168.0.1|IP=192.168.113.42\\ 1.1.1.1\\ 4.4.8.8\\ 192.169.12.6\\ 192.168.400.37\\ Address is 1 . 2 . 3 . 4|Private IP addresses are defined as: valid IPv4 dotted quad addresses with a first octet of 10; or first two octets of 192.168; or first octet of 172 followed by a second octet in the range 16-31.| |Private IP Address|''<nowiki>^(10\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])|192\.168|172\.(1[6-9]|2[0-9]|3[0-1]))\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])$</nowiki>''|10.4.72.13\\ 172.16.97.1\\ 192.168.0.1|IP=192.168.113.42\\ 1.1.1.1\\ 4.4.8.8\\ 192.169.12.6\\ 192.168.400.37\\ Address is 1 . 2 . 3 . 4|Private IP addresses are defined as: valid IPv4 dotted quad addresses with a first octet of 10; or first two octets of 192.168; or first octet of 172 followed by a second octet in the range 16-31.|
  
Line 121: Line 117:
  
   *  Windows   *  Windows
-    *  findstr /R+    *  findstr /R (see notes below)
  
-  *  Languages +  *  Programming Languages (Cross-Platform) 
-    *  Powershell +    * Cross-platform Shells (Powershell, zsh, bash) 
-    *  Python +    * Python 
-    *  JavaScript +    * JavaScript 
-    *  Perl+    * Perl
     * C / C++ via [[https://www.pcre.org/|PCRE/PCRE2 library]]     * C / C++ via [[https://www.pcre.org/|PCRE/PCRE2 library]]
     *  ...and many others!     *  ...and many others!
Line 141: Line 137:
 Findstr is also limited to (approximately) 127 characters in the regular expression. Findstr is also limited to (approximately) 127 characters in the regular expression.
  
-For information on findstr's regular expression dialect, see ''help findstr''. In particular, the findstr command does not support alternation with the ''|'' symbol, repetition other than with the ''*'' symbol, or grouping using ''( )''.+For information on findstr's regular expression dialect, see ''help findstr''. In particular, the findstr command does not support alternation with the ''|'' symbol, repetition other than with the ''*'' symbol, named character classes ''<nowiki>[[:</nowiki>//name//<nowiki>:]]</nowiki>'', or grouping ''( )''.
  
  
ops102/regular_expressions.1711963722.txt.gz · Last modified: 2024/04/16 18:10 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki