CMPS 215 - Lab 7

Lab 7 - Basic Regular Expressions

Material Covered

man page on grep
Appendix A in the book

Part I: Basic Regular Expressions in grep

grep is a tool that searches for text which matches a pattern. For today's lab, we will be using basic regular expressions (BREs) to express the pattern. Next week, we will expand this with extended regular expressions (EREs).

Regular expressions are used by many Unix/Linux tools (and several programming languages) to represent a pattern that we wish to find in a string. The grep utility lets us search for matching strings in either a file or standard in. Regular expressions are powerful tools that let us compactly indicate the text for which we are searching. The basic idea behind regular expressions is to define a sequence of special symbols and printable characters than will match the desired text. The simplest regular expression is just a word or portion of a word. For example, we have previously used grep as follows:

ps aux | grep mysqld

This would parse each line of the ps aux output looking for the word mysqld. Only the lines which contain that word would be printed to the screen.

The real power in regular expressions comes from using metacharacters to represent wildcards in the text. For example, the metacharacter ^ meant to match starting at the begining of the line. There are two basic types of metacharacters: those which represent one character and those which modify the matching behavior.

The period symbol (.) matches any character, be it a space, letter, number or so on. For example, the regular expression ".at" would match "cat", "bat", "rat", "1at" and so on.

There is another pair of metacharacters which will match any of the characters given: the square brackets [ ]. The BRE square bracket metacharacters are similar to the filename square bracket metacharacters. You can specify specific printable characters or a range of printable characters. For example, the BRE "[ab12]" will match "a", "b", "1" or "2". The BRE "[1-3]" will match "1", "2" or "3". You can also combine multiple ranges or mix ranges and literal characters. The BRE "[a-z0-9]" will match any lowercase letter or digit. The BRE "[0-9abc]" will match any digit or the letters "a", "b" or "c".

There is also a variation on the square brackets which EXCLUDES the given characters from the match: [^ ]. For example, the BRE "[^0-9]" would match any character that was NOT a digit. This is not to be confused with the begining of line metacharacter ^. If the ^ comes as the first character within the square brackets, this is interpretted as "exclude". If it comes outside of square brackets, it is interpretted as "match at start of line". Thus the BRE "[^a-z]" would exclude lowercase letters from matching while the BRE "^[a-z]" would match a lowercase letter that came at the start of the line.

Along with ^ for matching at the start of the line, there is another metacharacter to match at the end of the line: $. For example, the BRE "^cat$" would only match lines that contain JUST the word cat since both the start of line and end of line metacharacters are used.

The final metacharacter used in most BRE systems is *, which modifies the preceeding regular expression. This metacharacter means "match the preceeding regular expression 0 or more times". For example, the BRE ".*" means match 0 or more characters. That is because * (match 0 or more) modifies . (match any). The matching routine will find the longest possible match. For example, if you give the BRE "[0-9]*" it will look for 0 or more digits and match the longest string of digits it finds. The BRE "ab*" would match "a", "ab", "abbbbb" and so on. Note that with BREs, if you want to look for 1 or more matches, you have to repeat the modified regular expression twice. For example, the BRE "[0-9][0-9]*" would match one or more digits.

Metacharacters and literal characters can be combined to form a more complex regular expression. For example, to look for the word cat followed later in the line by the word dog, with any characters in between them, you would use the BRE "cat.*dog". When forming a regular expression, first think of a phrase in English that describes the pattern, for example "A line starting with 1 or more digits". Then start thinking of the metacharacters you will need for that phrase. In this case, we'll need ^ to match the start of line, the [] to match digits and * to repeat the matching. So we'd end up with the BRE "^[0-9][0-9]*".

Part II: grep Syntax

The basic grep syntax, as seen from "man grep" is:

grep [OPTIONS] <PATTERN> [FILE_LIST]

the PATTERN is a regular expression which is typically contained inside double quotes as all the examples above were. The FILE_LIST is a space seperated list of 0 or more files that grep needs to search. If there are no files in FILE_LIST, grep will search standard in. There are many OPTIONS to choose from. The following are some common options. Look at the man page for a complete list:

-f <file>     Read the regular expression from the specified file
-i            Ignore the case of any letters when matching, so "b" would match "b" or "B"
-v            Invert the match. Only print lines which DON'T match PATTERN
-c            Print a count of the lines that matched, not the lines themselves
-o            Print only the matching portion of the line, not the whole line
-n            Preceed each line in the output with its line number in the original file
-R            FILE_LIST is a directory, perform match recursively against all files and subdirectories.

Some examples of running grep:

ps -ef | grep -v "^root"             Print all processes not owned by root
grep "^[^ ]" lab1.txt                Print all lines that do NOT start with a space
grep -i "cat" lab1.txt               Find all lines with cat regardless of case

Lab Writeup

Answer the following questions by logging in to Moodle and clicking "Edit my submission" for Lab 7. Copy and paste the questions into the space provided, then answer the questions. You may save and re-edit the assignment on Moodle up until the due date.

What are regular expressions used for?
Assume you have a file that contains these lines:
```
a3b
adcbcbc
12c5a36
a5
3c
```
1. In the above file which line(s) will match the BRE "[a3][5c]" ?
2. In the above file which line(s) will the BRE "^.3.*b" match?
You have the BRE "wx*yz". Which of the following will match: wyyz wxxz wyz?
Write a BRE that will match any of these strings: 1a2, 1b4, 1c3, 1a3.
Write a BRE that will match empty lines (e.g. lines with nothing on them).
Give the grep command to search a file called 'stuff.dat' for lines that start with the word "zip" (regardless of case, so also match "ZIP" or "Zip") and is followed by 1 or more digits.
Give the grep command to remove all lines which start with the characters // and all blank lines from a file called 'code.cpp'.