September 10

* Regular Expressions

  Recall definitions of language, string
  Recall operations on strings


   - A palindrome is a string that reads the same forward and backward.
     A formal definition follows.
	1. e is a palindrome
	2. If a is any symbol, then the string a is a palindrome
	3. If a is any symbol and x is a palindrome, then axa is a palindrome

     Prove by induction that the two definitions are equivalent
	(x = x^R iff x is a palindrome).

     1.  x is palindrome => fits rules
     Clearly every string satisfying the second definition reads the same
     forward and backward (thus if follows rules then x = x^R).  Suppose
     x reads the same forward and backward.
     
     2.  fits rules => x is palindrome
     We prove by induction on the length of x that x's being a palindrome
     follows from rules (1) through (3).
     
     Base Case:  If |x| <= 1, then x is either e or a single symbol a
        and rule (1) or (2) applies.

     Inductive Hypothesis:  Assume true if |x|<=n
	
     Inductive Step:  Prove true for |x|=n+1
     If |x| > 1, then x begins and ends with some symbol a.
     Thus x = awa, where w reads the same forward and backward and is shorter
     than x (|w| = n-1).  By the inductive hypothesis, rules (1) through (3)
     imply that w is a palindrome.  Thus by rule (3), w = awa is a palindrome.


  Recall definition of concatenation
     xy = string x followed by string y (can apply to characters and languages)
     w^i = string w concatenated with itself i times
     a* = character a concatenated with itself 0 or more times
	  (any number of times)
     w* = string w concatenated with itself 0 or more times
	  (any number of times)
     L* = language L concategnated with itself 0 or more times
	  (any number of times
     a+, w+ = character (string) concatenated with itself 1 or more times

  A regular expression is a pattern that describes a set of strings
     (a language).
  Regular expressions over alphabet Sigma are strings of the alphabet
     Sigma v {(, ), phi, v, *} s.t.

     1.  phi is a regular expression denoting language {}
     2.  e is a regular expression denoting language {e}
     3.  For all a in Sigma, a is a regular expression denoting {a}
     4.  If r and s are regular expressions denoting languages R and S
	 respectively, then (r v s) - also written as (r + s),
	 (rs), (r*) are regular expressions denoting languages RvS, RS, and
	 R*, respectively.


   In writing regular expressions, we can omit parentheses by assuming
the precedence (*, ., v).

   L(r) = language denoted by r.

   Examples

   - (0 v 1)* denotes Sigma* = {e, 0, 1, 00, 01, 10, 11, 000, 001, ...}
   - 0* v (0*10*10*10*)* is #1s divisible by 3 = {e, 0, 111, 0101010, ...}
   - integer = (+ v - v e)[(1-9)(0-9)*] v 0
   - (0 + e) (1 + 10)* is no two consecutive 0s
   - (1 + 01)*(0 + e) is no two consecutive 0s

   - A re for set of strings over {a,b} that contain exactly two "b"s
     This must explicitly ensure the presence of two "b"s.
     Any number of "a"s may occur before, between, and after the "b"s.

     a*ba*ba*

   - A re for set of strings over {a,b} containing two or more "b"s
      a*ba*b(a v b)*
      (a v b)*ba*ba*
      (a v b)*b(a v b)*b(a v b)*

   Two expressions that represent the same set are "equivalent" expressions.
   For example, (0 v 1 v e)* is equivalent to (0 v 1)*

   - A re for set of strings over {a,b} that do not contain the substring aa.
     A string in this set may contain a prefix of any number of "b"s.
     All "a"s must be followed by at least one "b" or terminate the string.
     The re b*(ab+) v b*(ab+)*a generates the desired set.
	= b*(ab+)*(e v a)
	= b*(abb*)*(e v a)
	= (b v ab)*(e v a)



* CS applications
  - grep, egrep, text searches
    Look at grep and egrep man pages

   Example:  search for strings in following text:

  SOFTSHARE Order Number: 970403-0008PROCURE
  A -- NEXT GENERATION UNMANNED VEHICLES SOL DAAE07-97-R-X064 DUE 060397 
  POC Shannon Tighe (810) 574-7230 The U.S. Army Tank-automotive and 
  Armaments Command is seeking proposals from sources interested in the 
* Next Generation Unmanned Vehicles (NGUV) Robotics Program. The NGUV, 
* under the direction of the Robotics Office at TARDEC in Warren, MI, and 
  in cooperation with both the Army Research Labs (ARL) and the Missile 
  Command (MICOM), is an effort to select, develop and integrate 
  autonomous mobility and navigation technologies onto a vehicle platform. 
  This four year effort will be known as DEMO III. DEMO III is the initial 
  and primary thrust of the UGVTEE Program (FY97-01), an initiative to 
  develop and demonstrate a new, smaller unmanned ground vehicle platform. 
  DEMO III grows directly out of the DEMO II Program. It concentrates on 
  exploiting the technology achievements of the DEMO II Program, which are 
  intended to serve as evolutionary steps to support a wide range of 
  future applications. The fundamental objective of DEMO III, for the 
  technology perspective, is to develop and demonstrate a range of 
  critical technologies, which, when combined, provides a significant, 
  step-function improvement in current UGV capabilities. The technology 
  focus will be on mobility, intelligent system architectures, human 
  interface and planning, reconnaissance, surveillance and Target 
  Acquisition (RSTA), and simulation, emulation, and testing. The main 
  mission of DEMO III is to NGUV Platform integration and demonstration of 
* four robotics vehicles for use in the Army Battle Lab Experiments (FY98) 
  and extended user appraisal technology Demonstration (FY00). Options may 
  be included for additional contract support during user appraisal. 
  Solicitation issuance is estimated for 21 April 97 with proposals due 
  approximately 45 days later on 3 June 1997. Full and open competition 
  will be utilized for this procurement. See Note: 26. All responsible 
  sources may submit an offer which will be considered. The offer due date 
  is on or about 060397 (See solicitation for actual offer due date). 

  >egrep 'DEMO I+|UGV' text

  This four year effort will be known as DEMO III. DEMO III is the initial 
  and primary thrust of the UGVTEE Program (FY97-01), an initiative to 
  DEMO III grows directly out of the DEMO II Program. It concentrates on 
  exploiting the technology achievements of the DEMO II Program, which are 
  future applications. The fundamental objective of DEMO III, for the 
  step-function improvement in current UGV capabilities. The technology 
  mission of DEMO III is to NGUV Platform integration and demonstration of 

  - net searches?

    Use Boolean expressions, subset of regular expressions

    WebCrawler (AND, OR, NOT)

    Yahoo (AND, OR, * for wildcard matches,
       use a plus sign in front of word to require word, - for not)

    Infoseek (use a plus sign in front of a word that must appear)


  - LEX uses regular expressions, YACC uses context-free grammars

     lex generates programs to be used in simple lexical analysis
     of text.  Each filename (the standard input by default) con-
     tains regular expressions to search for, and actions written
     in C to be executed when expressions are found.

     A C source program, lex.yy.c is generated, to be compiled as
     follows:

          cc lex.yy.c -ll

     This program, when run, copies unrecognized portions of  the
     input  to  the  output, and executes the associated C action
     for each regular expression that is recognized.  The  actual
     string  matched  is  left  in  yytext, an external character
     array.


     The following:

          %% [A-Z]     putchar (yytext[0]+'a'-'A'); [ ]+$     ; [
          ]+ putchar(' ');

     is an example of a lex program.  It converts upper  case  to
     lower, removes blanks at the end of lines, and replaces mul-
     tiple blanks by single blanks.

          D       [0-9]
          %%
          if      printf("IF statement\n");
          [a-z]+  printf("tag, value %s\n",yytext);
          0{D}+   printf("octal number %s\n",yytext);
          {D}+    printf("decimal number %s\n",yytext);
          "++"    printf("unary op\n");
          "+"     printf("binary op\n");
          "/*"    {       loop:
                          while (input() != '*');
                          switch (input())
                                  {
                                  case '/': break;
                                  case '*': unput('*');
                                  default: go to loop;
                                  }
                          }

   The class of regular languages is fairly constraining.  For example,
   we will show later that {0^n 1^n: n>=1} is not regular and cannot be
   described using a regular language.

   Sometimes we need a way of determining whether a given string w has the
   property P.  An "algorithm" can do this - an algorithm is a finite sequence
   of instructions that will always terminate with the correct answer to a
   corresponding question.

   Q:  What is m * n?
   There exists an algorithm that will terminate in finite time with answer.

   An algorithm that is designed to answer questions of the form
   "Is string w a member of language L?" is a "language recognition device".
   Such an algorithm will say yes or no to a string being a member of
      L = {w in {a,b}*:  w does not have the substring bbb}.

   Regular expressions are "language generators", they provide a mechanism to
   list elements of a language.  We will next consider a mechanism that is a
   "language recognition device" for regular languages.