
Filters
Filters
and the Unix philosophy
grep
Regular
expressions
egrep
regular expressions
Fun?
with regular expressions
Other
filters
sed
Advanced grep options and patterns
Reading:
The Unix Programming Environment, Chapter 4
- Unix has a philosophy of
using small programs that have a specific purpose
- These programs are then
combined to produce the result you want
- By giving you a set of
"building blocks," Unix lets you handle just about any situation
- Many of these "building
blocks" are "filters"
- They take some input,
do something to it, and produce some output
- We'll cover a few of these in
this section
- Generally speaking, grep searches for patterns in files
- Or in stdin, if no
files are given
- The patterns are a class of
patterns called regular expressions
- grep stands for “get
regular expression and print”
- Variants of grep, called egrep and fgrep, are also usually available as grep -E and grep -F
- egrep extends the regular expression
syntax
- fgrep does a "fast" search
using fixed strings
- Some of the most useful
options:
- grep –v prints lines that
do not match the pattern
- grep –i is
case-insensitive
- grep –n prints out the
line number before the line (and file if more than one file searched)
- grep –f filename reads the
patterns from a file (maybe only for fgrep and egrep on some systems)
- grep –l only prints out
the filenames that have something that matches (very useful on command
lines: sort `grep –l …` | …
- Regular expressions are
basically mini-algorithms that specify how to match text
- Regular expressions
look similar to shell patterns, but are quite a bit different
- The simplest regular
expresson is a single letter, which matches that letter
- a matches a, abcde,
or supercalifragilisticexpialidocious
- A sequence of letters matches
that sequence
- cat matches cat, caterpillar, or scatalogical
- The character . (a dot) matches any character
- The character * indicates zero or more occurrences
of the preceeding character
- car* matches cat, carry, or carolina
- ar*a matches sarah, saab, or marrrrrrrrrrrrra,
but not marrrrrrrrrtha
- ^ matches the beginning of a line
- $ matches the end of a line
- So ^$ matches a blank line
- [....] matches any of the characters given, and ranges
can be specified
- [0-9] matches any digit
- [0-9]* matches zero or more digits
- [^....] matches any character other than those listed,
and ranges can be specified
- [^0-9] matches any non-digit
- Note that * doesn't match anything itself. It
just modifies the meaning of the previous character
- egrep (or grep -E) adds a few
more
- The character + matches one or more of the
previous character
- car+ matches car, carr, or carrrrr,
but not ca
- The character ? matches zero or one of the
previous character
- car?pet matches capet and carpet, but not ca or carrpet
- (expression1|expression2) matches either expression1
or expression2
- Note that ?, and + don't match anything themselves. They just modify
the meaning of the previous character
- The book offers a couple of
interesting regular expressions. If you understand them, you could be
considered to have a good understanding of regular expressions.
- ^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$
- ^a?b?c?d?e?f?g?h?i?j?k?l?m?n?o?p?q?r?s?t?u?v?w?x?y?z?$
- The book offers a
"thought exercise" in Exercise 4-2 on p. 105:
- How would things be
different if grep could match newlines?
- (Perl makes this
possible.)
- Case sensitive
- Case insensitive
- Can sort numerically
- Can sort ascending or
descending
- Can sort based on part
of the line
- Note the spelling!
- Discards duplicate
lines
- Can include a count of
the number of times each line appears
- Can print only the
duplicated lines, or only the unique lines
- I've actually never
used this one
- diff and cmp are more
commonly used, and more useful, I think
- Translates one set of
characters into another
- Can use ranges, just
like character classes in regular expressions
- Examples
- Make something 31337
("eleet")
- Copies bits from one
place to another
- Can do various
transformations on the data (ASCII ß
à
EBCDIC)
- cat $* |
tr -sc A-Za-z '\012' |
sort |
uniq -c |
sort -n |
more
- sed is a version of ed that's
designed to be used as a filter
- While ed is no longer useful,
sed is still quite useful
- sed does not alter any
named files; the modified version is printed on stdout
- So, how do you edit a
file with sed?
- Usually with something
like:
- sed [commands] filename >filename.new
mv filename filename.old
mv filename.new filename
- By far, the most
common usage of sed is to replace one thing with another
- sed 's/foo/bar/g' replaces all
occurrences off "foo" with "bar"
- "Foo" is a
regular expression
- You can delete
regular expressions by putting a null string for the replacement
- See the text for other
examples and note that grep turns out to be a special case of sed
- The book makes a
"newer" command with sed, which is of interest for how they do
the quoting, but the find command does a much easier version of
"newer" (thought question...)
Q: With what we now know about sed, is it possible to do something like this:
“Replace all
occurrences of ‘P’ followed by any capital letter with ‘M’
followed by that same capital letter”?
Why or why not can we do this? If not, what kind of
primitive/capability? are we needing?
Advanced grep options and patterns
Here are
some more options for the grep family that you will be responsible for:
-C NUM, --context=NUM
Print NUM lines of
output context. Places a line containing --- between contiguous
groups of matches.
-R, -r, --recursive
Read all files under each directory, recursively; this is
equivalent to the -d recurse option.
And here
are some more grep pattern
"primitives" you are responsible for learning (we have covered the
first few already):
A regular expression may be
followed by one of several repetition operators:
? The preceding item
is optional and matched at most once.
* The preceding
item will be matched zero or more times.
+ The preceding
item will be matched one or more times.
{n} The preceding item is
matched exactly n times.
{n,} The preceding item is matched n or
more times.
{n,m} The preceding item is matched at least n
times, but not more than m times.
This is
getting VERY close to the following Holy Grail I wished for earlier:
ssh-server% egrep (pattern)x-y