image001

Home

Syllabus

Notes

Homework

Grades


Filters


Filters
        Filters and the Unix philosophy
        grep
        Regular expressions
        egrep regular expressions
        Fun? with regular expressions
        Other filters
        sed

        Advanced grep options and patterns


Reading: The Unix Programming Environment, Chapter 4

Filters and the Unix philosophy

  • Unix has a philosophy of using small programs that have a specific purpose
  • These programs are then combined to produce the result you want
  • By giving you a set of "building blocks," Unix lets you handle just about any situation
  • Many of these "building blocks" are "filters"
    • They take some input, do something to it, and produce some output
  • We'll cover a few of these in this section

grep

  • Generally speaking, grep searches for patterns in files
    • Or in stdin, if no files are given
  • The patterns are a class of patterns called regular expressions
    • grep stands for “get regular expression and print”
  • Variants of grep, called egrep and fgrep, are also usually available as grep -E and grep -F
    • egrep extends the regular expression syntax
    • fgrep does a "fast" search using fixed strings
  • Some of the most useful options:
    • grep –v prints lines that do not match the pattern
    • grep –i is case-insensitive
    • grep –n prints out the line number before the line (and file if more than one file searched)
    • grep –f filename reads the patterns from a file (maybe only for fgrep and egrep on some systems)
    • grep –l only prints out the filenames that have something that matches (very useful on command lines: sort `grep –l …` | …

Regular expressions

  • Regular expressions are basically mini-algorithms that specify how to match text
    • Regular expressions look similar to shell patterns, but are quite a bit different
  • The simplest regular expresson is a single letter, which matches that letter
    • a matches a, abcde, or supercalifragilisticexpialidocious
  • A sequence of letters matches that sequence
    • cat matches cat, caterpillar, or scatalogical
  • The character . (a dot) matches any character
  • The character * indicates zero or more occurrences of the preceeding character
    • car* matches cat, carry, or carolina
    • ar*a matches sarah, saab, or marrrrrrrrrrrrra, but not marrrrrrrrrtha
  • ^ matches the beginning of a line
  • $ matches the end of a line
    • So ^$ matches a blank line
  • [....] matches any of the characters given, and ranges can be specified
    • [0-9] matches any digit
    • [0-9]* matches zero or more digits
  • [^....] matches any character other than those listed, and ranges can be specified
    • [^0-9] matches any non-digit
  • Note that * doesn't match anything itself. It just modifies the meaning of the previous character

egrep regular expressions

  • egrep (or grep -E) adds a few more
    • The character + matches one or more of the previous character
      • car+ matches car, carr, or carrrrr, but not ca
    • The character ? matches zero or one of the previous character
      • car?pet  matches capet and carpet, but not ca or carrpet 
  • (expression1|expression2) matches either expression1 or expression2
  • Note that ?, and + don't match anything themselves. They just modify the meaning of the previous character

Fun? with regular expressions

  • The book offers a couple of interesting regular expressions. If you understand them, you could be considered to have a good understanding of regular expressions.
  • ^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$
  • ^a?b?c?d?e?f?g?h?i?j?k?l?m?n?o?p?q?r?s?t?u?v?w?x?y?z?$
  • The book offers a "thought exercise" in Exercise 4-2 on p. 105:
    • How would things be different if grep could match newlines?
    • (Perl makes this possible.)

Other filters

  • sort
    • Can sort alphabetically
      • Case sensitive
      • Case insensitive
    • Can sort numerically
    • Can sort ascending or descending
    • Can sort based on part of the line
  • uniq
    • Note the spelling!
    • Discards duplicate lines
    • Can include a count of the number of times each line appears
    • Can print only the duplicated lines, or only the unique lines
  • comm
    • I've actually never used this one
    • diff and cmp are more commonly used, and more useful, I think
  • tr
    • Translates one set of characters into another
    • Can use ranges, just like character classes in regular expressions
    • Examples
      • tr a-z A-Z
        • Capitalizes everything
      • tr aeiot 43107
        • Make something 31337 ("eleet")
  • dd
    • Copies bits from one place to another
    • Can do various transformations on the data (ASCII ß à EBCDIC)
  • Combining things
    • cat $* |
      tr -sc A-Za-z '\012' |
      sort |
      uniq -c |
      sort -n |
      more

sed

  • sed is a version of ed that's designed to be used as a filter
  • While ed is no longer useful, sed is still quite useful
    • sed does not alter any named files; the modified version is printed on stdout
      • So, how do you edit a file with sed?
      • Usually with something like:
        • sed [commands] filename >filename.new
          mv filename filename.old
          mv filename.new filename
  • Common usage
    • By far, the most common usage of sed is to replace one thing with another
      • sed 's/foo/bar/g' replaces all occurrences off "foo" with "bar"
      • "Foo" is a regular expression
      • You can delete regular expressions by putting a null string for the replacement
    • See the text for other examples and note that grep turns out to be a special case of sed
  • The book makes a "newer" command with sed, which is of interest for how they do the quoting, but the find command does a much easier version of "newer" (thought question...)

Advanced grep options and patterns

Here are some more options for the grep family that you will be responsible for:

-C NUM, --context=NUM

              Print  NUM  lines  of  output  context.   Places  a line containing --- between contiguous groups of matches.

-R, -r, --recursive

              Read all files under each directory, recursively; this  is  equivalent to the -d recurse option.

And here are some more grep pattern "primitives" you are responsible for learning (we have covered the first few already):

A regular expression may be followed by one of several repetition operators:

?      The preceding item is optional and matched at most once.

*      The preceding item will be matched zero or more times.

+      The preceding item will be matched one or more times.

{n}    The preceding item is matched exactly n times.

{n,}   The preceding item is matched n or more times.

{n,m}  The preceding item is matched at least n times, but not  more  than  m times.

This is getting VERY close to the following Holy Grail I wished for earlier:

ssh-server% egrep (pattern)x-y