Filters

Filters
        Filters and the Unix philosophy
        grep
        Regular expressions
        egrep regular expressions
        Fun? with regular expressions
        Other filters
        sed

Reading: The Unix Programming Environment, Chapter 4

Filters and the Unix philosophy

Unix has a philosophy of using small programs that have a specific purpose
These programs are then combined to produce the result you want
By giving you a set of "building blocks," Unix lets you handle just about any situation
Many of these "building blocks" are "filters"

They take some input, do something to it, and produce some output

We'll cover a few of these in this section

grep

Generally speaking, grep searches for patterns in files

Or in stdin, if no files are given

The patterns are a class of patterns called regular expressions

grep stands for “get regular expression and print”

Variants of grep, called egrep and fgrep, are also usually available as grep -E and grep -F

egrep extends the regular expression syntax
fgrep does a "fast" search using fixed strings

Some of the most useful options:

grep –v prints lines that do not match the pattern
grep –i is case-insensitive
grep –n prints out the line number before the line (and file if more than one file searched)
grep –f filename reads the patterns from a file (maybe only for fgrep and egrep on some systems)
grep –l only prints out the filenames that have something that matches (very useful on command lines: sort `grep –l …` | …

Regular expressions

Regular expressions are basically mini-algorithms that specify how to match text

Regular expressions look similar to shell patterns, but are quite a bit different

The simplest regular expresson is a single letter, which matches that letter

a matches a, abcde, or supercalifragilisticexpialidocious

A sequence of letters matches that sequence

cat matches cat, caterpillar, or scatalogical

The character . (a dot) matches any character
The character * indicates zero or more occurrences of the preceeding character

car* matches cat, carry, or carolina
ar*a matches sarah, saab, or marrrrrrrrrrrrra, but not marrrrrrrrrtha

^ matches the beginning of a line
$ matches the end of a line

So ^$ matches a blank line

[....] matches any of the characters given, and ranges can be specified

[0-9] matches any digit
[0-9]* matches zero or more digits

[^....] matches any character other than those listed, and ranges can be specified

[^0-9] matches any non-digit

Note that * doesn't match anything itself. It just modifies the meaning of the previous character

egrep regular expressions

egrep (or grep -E) adds a few more

The character + matches one or more of the previous character

car+ matches car, carr, or carrrrr, but not ca

The character ? matches zero or one of the previous character

car?pet matches capet and carpet, but not ca or carrpet

(expression1|expression2) matches either expression1 or expression2
Note that ?, and + don't match anything themselves. They just modify the meaning of the previous character

Fun? with regular expressions

The book offers a couple of interesting regular expressions. If you understand them, you could be considered to have a good understanding of regular expressions.
^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$
^a?b?c?d?e?f?g?h?i?j?k?l?m?n?o?p?q?r?s?t?u?v?w?x?y?z?$
The book offers a "thought exercise" in Exercise 4-2 on p. 105:

How would things be different if grep could match newlines?
(Perl makes this possible.)

GNU Grep 3.0 Goodies

GNU Grep 3.0 addes character classes and other goodies that can save you time if you do a lot of grepping. Link. (For now, focus on the baseline grep stuff above: I will announce what this is testable later.)

Other filters

sort

Can sort alphabetically

Case sensitive
Case insensitive

Can sort numerically
Can sort ascending or descending
Can sort based on part of the line

uniq

Note the spelling!
Discards duplicate lines
Can include a count of the number of times each line appears
Can print only the duplicated lines, or only the unique lines

comm

I've actually never used this one
diff and cmp are more commonly used, and more useful, I think

tr

Translates one set of characters into another
Can use ranges, just like character classes in regular expressions
Examples

tr a-z A-Z

Capitalizes everything

tr aeiot 43107

Make something 31337 ("eleet")

dd

Copies bits from one place to another
Can do various transformations on the data (ASCII ß à EBCDIC)

Combining things

cat $* |
tr -sc A-Za-z '\012' |
sort |
uniq -c |
sort -n |
more

sed

sed is a version of ed that's designed to be used as a filter
While ed is no longer useful, sed is still quite useful

sed does not alter any named files; the modified version is printed on stdout

So, how do you edit a file with sed?
Usually with something like:

sed[commands] filename >filename.new
mv filename filename.old
mv filename.new filename

Common usage

By far, the most common usage of sed is to replace one thing with another

sed 's/foo/bar/g' replaces all occurrences off "foo" with "bar"
"Foo" is a regular expression
You can delete regular expressions by putting a null string for the replacement

See the text for other examples and note that grep turns out to be a special case of sed

The book makes a "newer" command with sed, which is of interest for how they do the quoting, but the find command does a much easier version of "newer"