image001

Home

Syllabus

Notes

Homework

Grades


Filters


Filters
        Filters and the Unix philosophy
        grep
        Regular expressions
        egrep regular expressions
        Fun? with regular expressions
        Other filters
        sed


Reading: The Unix Programming Environment, Chapter 4

Filters and the Unix philosophy

  • Unix has a philosophy of using small programs that have a specific purpose
  • These programs are then combined to produce the result you want
  • By giving you a set of "building blocks," Unix lets you handle just about any situation
  • Many of these "building blocks" are "filters"
    • They take some input, do something to it, and produce some output
  • We'll cover a few of these in this section

grep

  • Generally speaking, grep searches for patterns in files
    • Or in stdin, if no files are given
  • The patterns are a class of patterns called regular expressions
    • grep stands for “get regular expression and print”
  • Variants of grep, called egrep and fgrep, are also usually available as grep -E and grep -F
    • egrep extends the regular expression syntax
    • fgrep does a "fast" search using fixed strings
  • Some of the most useful options:
    • grep –v prints lines that do not match the pattern
    • grepi is case-insensitive
    • grep –n prints out the line number before the line (and file if more than one file searched)
    • grep –f filename reads the patterns from a file (maybe only for fgrep and egrep on some systems)
    • grep –l only prints out the filenames that have something that matches (very useful on command lines: sort `grep –l …` | …

Regular expressions

  • Regular expressions are basically mini-algorithms that specify how to match text
    • Regular expressions look similar to shell patterns, but are quite a bit different
  • The simplest regular expresson is a single letter, which matches that letter
    • a matches a, abcde, or supercalifragilisticexpialidocious
  • A sequence of letters matches that sequence
    • cat matches cat, caterpillar, or scatalogical
  • The character . (a dot) matches any character
  • The character * indicates zero or more occurrences of the preceeding character
    • car* matches cat, carry, or carolina
    • ar*a matches sarah, saab, or marrrrrrrrrrrrra, but not marrrrrrrrrtha
  • ^ matches the beginning of a line
  • $ matches the end of a line
    • So ^$ matches a blank line
  • [....] matches any of the characters given, and ranges can be specified
    • [0-9] matches any digit
    • [0-9]* matches zero or more digits
  • [^....] matches any character other than those listed, and ranges can be specified
    • [^0-9] matches any non-digit
  • Note that * doesn't match anything itself. It just modifies the meaning of the previous character

egrep regular expressions

  • egrep (or grep -E) adds a few more
    • The character + matches one or more of the previous character
      • car+ matches car, carr, or carrrrr, but not ca
    • The character ? matches zero or one of the previous character
      • car?pet  matches capet and carpet, but not ca or carrpet 
  • (expression1|expression2) matches either expression1 or expression2
  • Note that ?, and + don't match anything themselves. They just modify the meaning of the previous character

Fun? with regular expressions

  • The book offers a couple of interesting regular expressions. If you understand them, you could be considered to have a good understanding of regular expressions.
  • ^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$
  • ^a?b?c?d?e?f?g?h?i?j?k?l?m?n?o?p?q?r?s?t?u?v?w?x?y?z?$
  • The book offers a "thought exercise" in Exercise 4-2 on p. 105:
    • How would things be different if grep could match newlines?
    • (Perl makes this possible.)

GNU Grep 3.0 Goodies

GNU Grep 3.0 addes character classes and other goodies that can save you time if you do a lot of grepping. Link.  (For now, focus on the baseline grep stuff above: I will announce what this is testable later.)

Other filters

  • sort
    • Can sort alphabetically
      • Case sensitive
      • Case insensitive
    • Can sort numerically
    • Can sort ascending or descending
    • Can sort based on part of the line
  • uniq
    • Note the spelling!
    • Discards duplicate lines
    • Can include a count of the number of times each line appears
    • Can print only the duplicated lines, or only the unique lines
  • comm
    • I've actually never used this one
    • diff and cmp are more commonly used, and more useful, I think
  • tr
    • Translates one set of characters into another
    • Can use ranges, just like character classes in regular expressions
    • Examples
      • tr a-z A-Z
        • Capitalizes everything
      • tr aeiot 43107
        • Make something 31337 ("eleet")
  • dd
    • Copies bits from one place to another
    • Can do various transformations on the data (ASCII ß à EBCDIC)
  • Combining things
    • cat $* |
      tr -sc A-Za-z '\012' |
      sort |
      uniq -c |
      sort -n |
      more

sed

  • sed is a version of ed that's designed to be used as a filter
  • While ed is no longer useful, sed is still quite useful
    • sed does not alter any named files; the modified version is printed on stdout
      • So, how do you edit a file with sed?
      • Usually with something like:
        • sed [commands] filename >filename.new
          mv filename filename.old
          mv filename.new filename
  • Common usage
    • By far, the most common usage of sed is to replace one thing with another
      • sed 's/foo/bar/g' replaces all occurrences off "foo" with "bar"
      • "Foo" is a regular expression
      • You can delete regular expressions by putting a null string for the replacement
    • See the text for other examples and note that grep turns out to be a special case of sed
  • The book makes a "newer" command with sed, which is of interest for how they do the quoting, but the find command does a much easier version of "newer"