grep
is a tool that searches for text which matches a pattern.
For today's lab, we will be using basic regular expressions (BREs) to express
the pattern. Next week, we will expand this with extended regular expressions
(EREs).
Regular expressions are used by many Unix/Linux tools (and several programming languages) to represent a pattern that we wish to find in a string. The grep utility lets us search for matching strings in either a file or standard in. Regular expressions are powerful tools that let us compactly indicate the text for which we are searching. The basic idea behind regular expressions is to define a sequence of special symbols and printable characters than will match the desired text. The simplest regular expression is just a word or portion of a word. For example, we have previously used grep as follows:
ps aux | grep mysqldThis would parse each line of the
ps aux
output looking for the
word mysqld
. Only the lines which contain that word would be
printed to the screen.
The real power in regular expressions comes from using metacharacters to
represent wildcards in the text. For example, the metacharacter ^
meant to match starting at the begining of the line. There are two basic types
of metacharacters: those which represent one character and those which modify
the matching behavior.
The period symbol (.) matches any character, be it a space, letter, number or so on. For example, the regular expression ".at" would match "cat", "bat", "rat", "1at" and so on.
There is another pair of metacharacters which will match any of the characters given: the square brackets [ ]. The BRE square bracket metacharacters are similar to the filename square bracket metacharacters. You can specify specific printable characters or a range of printable characters. For example, the BRE "[ab12]" will match "a", "b", "1" or "2". The BRE "[1-3]" will match "1", "2" or "3". You can also combine multiple ranges or mix ranges and literal characters. The BRE "[a-z0-9]" will match any lowercase letter or digit. The BRE "[0-9abc]" will match any digit or the letters "a", "b" or "c".
There is also a variation on the square brackets which EXCLUDES the given characters from the match: [^ ]. For example, the BRE "[^0-9]" would match any character that was NOT a digit. This is not to be confused with the begining of line metacharacter ^. If the ^ comes as the first character within the square brackets, this is interpretted as "exclude". If it comes outside of square brackets, it is interpretted as "match at start of line". Thus the BRE "[^a-z]" would exclude lowercase letters from matching while the BRE "^[a-z]" would match a lowercase letter that came at the start of the line.
Along with ^ for matching at the start of the line, there is another metacharacter to match at the end of the line: $. For example, the BRE "^cat$" would only match lines that contain JUST the word cat since both the start of line and end of line metacharacters are used.
The final metacharacter used in most BRE systems is *, which modifies the preceeding regular expression. This metacharacter means "match the preceeding regular expression 0 or more times". For example, the BRE ".*" means match 0 or more characters. That is because * (match 0 or more) modifies . (match any). The matching routine will find the longest possible match. For example, if you give the BRE "[0-9]*" it will look for 0 or more digits and match the longest string of digits it finds. The BRE "ab*" would match "a", "ab", "abbbbb" and so on. Note that with BREs, if you want to look for 1 or more matches, you have to repeat the modified regular expression twice. For example, the BRE "[0-9][0-9]*" would match one or more digits.
Metacharacters and literal characters can be combined to form a more complex regular expression. For example, to look for the word cat followed later in the line by the word dog, with any characters in between them, you would use the BRE "cat.*dog". When forming a regular expression, first think of a phrase in English that describes the pattern, for example "A line starting with 1 or more digits". Then start thinking of the metacharacters you will need for that phrase. In this case, we'll need ^ to match the start of line, the [] to match digits and * to repeat the matching. So we'd end up with the BRE "^[0-9][0-9]*".
grep [OPTIONS] <PATTERN> [FILE_LIST]the PATTERN is a regular expression which is typically contained inside double quotes as all the examples above were. The FILE_LIST is a space seperated list of 0 or more files that grep needs to search. If there are no files in FILE_LIST, grep will search standard in. There are many OPTIONS to choose from. The following are some common options. Look at the man page for a complete list:
-f <file> Read the regular expression from the specified file -i Ignore the case of any letters when matching, so "b" would match "b" or "B" -v Invert the match. Only print lines which DON'T match PATTERN -c Print a count of the lines that matched, not the lines themselves -o Print only the matching portion of the line, not the whole line -n Preceed each line in the output with its line number in the original file -R FILE_LIST is a directory, perform match recursively against all files and subdirectories.Some examples of running grep:
ps -ef | grep -v "^root" Print all processes not owned by root grep "^[^ ]" lab1.txt Print all lines that do NOT start with a space grep -i "cat" lab1.txt Find all lines with cat regardless of case
a3b adcbcbc 12c5a36 a5 3c