
Filters
Filters
and the Unix philosophy
grep
Regular
expressions
egrep
regular expressions
Fun?
with regular expressions
Other
filters
sed
Advanced grep options and patterns
Reading:
The Unix Programming Environment, Chapter 4
- Unix has a philosophy of using
small programs that have a specific purpose
- These programs are then
combined to produce the result you want
- By giving you a set of
"building blocks," Unix lets you handle just about any situation
- Many of these "building
blocks" are "filters"
- They take some input,
do something to it, and produce some output
- We'll cover a few of these in
this section
- Generally speaking, grep searches for patterns in files
- Or in stdin, if no
files are given
- The patterns are a class of
patterns called regular expressions
- grep stands for “get
regular expression and print”
- Variants of grep, called egrep and fgrep, are also usually available as grep -E and grep -F
- egrep extends the regular expression
syntax
- fgrep does a "fast" search
using fixed strings
- Some of the most useful
options:
- grep –v prints lines that
do not match the pattern
- grep –i is
case-insensitive
- grep –n prints out the
line number before the line (and file if more than one file searched)
- grep –f filename reads the
patterns from a file (maybe only for fgrep and egrep on some systems)
- grep –l only prints out
the filenames that have something that matches (very useful on command
lines: sort `grep –l …` | …
- Regular expressions are basically
mini-algorithms that specify how to match text
- Regular expressions
look similar to shell patterns, but are quite a bit different
- The simplest regular
expresson is a single letter, which matches that letter
- a matches a, abcde,
or supercalifragilisticexpialidocious
- A sequence of letters matches
that sequence
- cat matches cat, caterpillar, or scatalogical
- The character . (a dot) matches any character
- The character * indicates zero or more occurrences
of the preceeding character
- car* matches cat, carry, or carolina
- ar*a matches sarah, saab, or marrrrrrrrrrrrra,
but not marrrrrrrrrtha
- ^ matches the beginning of a line
- $ matches the end of a line
- So ^$ matches a blank line
- [....] matches any of the characters given, and ranges
can be specified
- [0-9] matches any digit
- [0-9]* matches zero or more digits
- [^....] matches any character other than those listed,
and ranges can be specified
- [^0-9] matches any non-digit
- Note that * doesn't match anything itself. It
just modifies the meaning of the previous character
- egrep (or grep -E) adds a few
more
- The character + matches one or more of the
previous character
- car+ matches car, carr, or carrrrr,
but not ca
- The character ? matches zero or one of the
previous character
- car?pet matches capet and carpet, but not ca or carrpet
- (expression1|expression2) matches either expression1
or expression2
- Note that ?, and + don't match anything themselves. They just modify
the meaning of the previous character
- The book offers a couple of
interesting regular expressions. If you understand them, you could be
considered to have a good understanding of regular expressions.
- ^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$
- ^a?b?c?d?e?f?g?h?i?j?k?l?m?n?o?p?q?r?s?t?u?v?w?x?y?z?$
- The book offers a
"thought exercise" in Exercise 4-2 on p. 105:
- How would things be
different if grep could match newlines?
- (Perl makes this
possible.)
- Case sensitive
- Case insensitive
- Can sort numerically
- Can sort ascending or
descending
- Can sort based on part
of the line
- Note the spelling!
- Discards duplicate
lines
- Can include a count of
the number of times each line appears
- Can print only the duplicated
lines, or only the unique lines
- I've actually never
used this one
- diff and cmp are more
commonly used, and more useful, I think
- Translates one set of
characters into another
- Can use ranges, just
like character classes in regular expressions
- Examples
- Make something 31337
("eleet")
- Copies bits from one
place to another
- Can do various
transformations on the data (ASCII ß
à
EBCDIC)
- cat $* |
tr -sc A-Za-z '\012' |
sort |
uniq -c |
sort -n |
more
- sed is a version of ed that's
designed to be used as a filter
- While ed is no longer useful,
sed is still quite useful
- sed does not alter any
named files; the modified version is printed on stdout
- So, how do you edit a
file with sed?
- Usually with
something like:
- sed [commands] filename >filename.new
mv filename filename.old
mv filename.new filename
- By far, the most
common usage of sed is to replace one thing with another
- sed 's/foo/bar/g' replaces all occurrences
off "foo" with "bar"
- "Foo" is a
regular expression
- You can delete
regular expressions by putting a null string for the replacement
- See the text for other
examples and note that grep turns out to be a special case of sed
- The book makes a
"newer" command with sed, which is of interest for how they do
the quoting, but the find command does a much easier version of
"newer" (thought question...)
Advanced grep options and patterns
Here are
some more options for the grep family that you will be responsible for:
-C NUM, --context=NUM
Print NUM lines of
output context. Places a line containing --- between contiguous
groups of matches.
-R, -r, --recursive
Read all files under each directory, recursively; this is
equivalent to the -d recurse option.
And here
are some more grep pattern
"primitives" you are responsible for learning (we have covered the first
few already):
A regular expression may be
followed by one of several repetition operators:
? The preceding
item is optional and matched at most once.
* The preceding
item will be matched zero or more times.
+ The preceding
item will be matched one or more times.
{n} The preceding item is
matched exactly n times.
{n,} The preceding item is matched n or
more times.
{n,m} The preceding item is matched at least n
times, but not more than m times.
This is
getting VERY close to the following Holy Grail I wished for earlier:
ssh-server% egrep (pattern)x-y