```September 10

* Regular Expressions

Recall definitions of language, string
Recall operations on strings

- A palindrome is a string that reads the same forward and backward.
A formal definition follows.
1. e is a palindrome
2. If a is any symbol, then the string a is a palindrome
3. If a is any symbol and x is a palindrome, then axa is a palindrome

Prove by induction that the two definitions are equivalent
(x = x^R iff x is a palindrome).

1.  x is palindrome => fits rules
Clearly every string satisfying the second definition reads the same
forward and backward (thus if follows rules then x = x^R).  Suppose
x reads the same forward and backward.

2.  fits rules => x is palindrome
We prove by induction on the length of x that x's being a palindrome
follows from rules (1) through (3).

Base Case:  If |x| <= 1, then x is either e or a single symbol a
and rule (1) or (2) applies.

Inductive Hypothesis:  Assume true if |x|<=n

Inductive Step:  Prove true for |x|=n+1
If |x| > 1, then x begins and ends with some symbol a.
Thus x = awa, where w reads the same forward and backward and is shorter
than x (|w| = n-1).  By the inductive hypothesis, rules (1) through (3)
imply that w is a palindrome.  Thus by rule (3), w = awa is a palindrome.

Recall definition of concatenation
xy = string x followed by string y (can apply to characters and languages)
w^i = string w concatenated with itself i times
a* = character a concatenated with itself 0 or more times
(any number of times)
w* = string w concatenated with itself 0 or more times
(any number of times)
L* = language L concategnated with itself 0 or more times
(any number of times
a+, w+ = character (string) concatenated with itself 1 or more times

A regular expression is a pattern that describes a set of strings
(a language).
Regular expressions over alphabet Sigma are strings of the alphabet
Sigma v {(, ), phi, v, *} s.t.

1.  phi is a regular expression denoting language {}
2.  e is a regular expression denoting language {e}
3.  For all a in Sigma, a is a regular expression denoting {a}
4.  If r and s are regular expressions denoting languages R and S
respectively, then (r v s) - also written as (r + s),
(rs), (r*) are regular expressions denoting languages RvS, RS, and
R*, respectively.

In writing regular expressions, we can omit parentheses by assuming
the precedence (*, ., v).

L(r) = language denoted by r.

Examples

- (0 v 1)* denotes Sigma* = {e, 0, 1, 00, 01, 10, 11, 000, 001, ...}
- 0* v (0*10*10*10*)* is #1s divisible by 3 = {e, 0, 111, 0101010, ...}
- integer = (+ v - v e)[(1-9)(0-9)*] v 0
- (0 + e) (1 + 10)* is no two consecutive 0s
- (1 + 01)*(0 + e) is no two consecutive 0s

- A re for set of strings over {a,b} that contain exactly two "b"s
This must explicitly ensure the presence of two "b"s.
Any number of "a"s may occur before, between, and after the "b"s.

a*ba*ba*

- A re for set of strings over {a,b} containing two or more "b"s
a*ba*b(a v b)*
(a v b)*ba*ba*
(a v b)*b(a v b)*b(a v b)*

Two expressions that represent the same set are "equivalent" expressions.
For example, (0 v 1 v e)* is equivalent to (0 v 1)*

- A re for set of strings over {a,b} that do not contain the substring aa.
A string in this set may contain a prefix of any number of "b"s.
All "a"s must be followed by at least one "b" or terminate the string.
The re b*(ab+) v b*(ab+)*a generates the desired set.
= b*(ab+)*(e v a)
= b*(abb*)*(e v a)
= (b v ab)*(e v a)

* CS applications
- grep, egrep, text searches
Look at grep and egrep man pages

Example:  search for strings in following text:

SOFTSHARE Order Number: 970403-0008PROCURE
A -- NEXT GENERATION UNMANNED VEHICLES SOL DAAE07-97-R-X064 DUE 060397
POC Shannon Tighe (810) 574-7230 The U.S. Army Tank-automotive and
Armaments Command is seeking proposals from sources interested in the
* Next Generation Unmanned Vehicles (NGUV) Robotics Program. The NGUV,
* under the direction of the Robotics Office at TARDEC in Warren, MI, and
in cooperation with both the Army Research Labs (ARL) and the Missile
Command (MICOM), is an effort to select, develop and integrate
autonomous mobility and navigation technologies onto a vehicle platform.
This four year effort will be known as DEMO III. DEMO III is the initial
and primary thrust of the UGVTEE Program (FY97-01), an initiative to
develop and demonstrate a new, smaller unmanned ground vehicle platform.
DEMO III grows directly out of the DEMO II Program. It concentrates on
exploiting the technology achievements of the DEMO II Program, which are
intended to serve as evolutionary steps to support a wide range of
future applications. The fundamental objective of DEMO III, for the
technology perspective, is to develop and demonstrate a range of
critical technologies, which, when combined, provides a significant,
step-function improvement in current UGV capabilities. The technology
focus will be on mobility, intelligent system architectures, human
interface and planning, reconnaissance, surveillance and Target
Acquisition (RSTA), and simulation, emulation, and testing. The main
mission of DEMO III is to NGUV Platform integration and demonstration of
* four robotics vehicles for use in the Army Battle Lab Experiments (FY98)
and extended user appraisal technology Demonstration (FY00). Options may
be included for additional contract support during user appraisal.
Solicitation issuance is estimated for 21 April 97 with proposals due
approximately 45 days later on 3 June 1997. Full and open competition
will be utilized for this procurement. See Note: 26. All responsible
sources may submit an offer which will be considered. The offer due date
is on or about 060397 (See solicitation for actual offer due date).

>egrep 'DEMO I+|UGV' text

This four year effort will be known as DEMO III. DEMO III is the initial
and primary thrust of the UGVTEE Program (FY97-01), an initiative to
DEMO III grows directly out of the DEMO II Program. It concentrates on
exploiting the technology achievements of the DEMO II Program, which are
future applications. The fundamental objective of DEMO III, for the
step-function improvement in current UGV capabilities. The technology
mission of DEMO III is to NGUV Platform integration and demonstration of

- net searches?

Use Boolean expressions, subset of regular expressions

WebCrawler (AND, OR, NOT)

Yahoo (AND, OR, * for wildcard matches,
use a plus sign in front of word to require word, - for not)

Infoseek (use a plus sign in front of a word that must appear)

- LEX uses regular expressions, YACC uses context-free grammars

lex generates programs to be used in simple lexical analysis
of text.  Each filename (the standard input by default) con-
tains regular expressions to search for, and actions written
in C to be executed when expressions are found.

A C source program, lex.yy.c is generated, to be compiled as
follows:

cc lex.yy.c -ll

This program, when run, copies unrecognized portions of  the
input  to  the  output, and executes the associated C action
for each regular expression that is recognized.  The  actual
string  matched  is  left  in  yytext, an external character
array.

The following:

%% [A-Z]     putchar (yytext[0]+'a'-'A'); [ ]+\$     ; [
]+ putchar(' ');

is an example of a lex program.  It converts upper  case  to
lower, removes blanks at the end of lines, and replaces mul-
tiple blanks by single blanks.

D       [0-9]
%%
if      printf("IF statement\n");
[a-z]+  printf("tag, value %s\n",yytext);
0{D}+   printf("octal number %s\n",yytext);
{D}+    printf("decimal number %s\n",yytext);
"++"    printf("unary op\n");
"+"     printf("binary op\n");
"/*"    {       loop:
while (input() != '*');
switch (input())
{
case '/': break;
case '*': unput('*');
default: go to loop;
}
}

The class of regular languages is fairly constraining.  For example,
we will show later that {0^n 1^n: n>=1} is not regular and cannot be
described using a regular language.

Sometimes we need a way of determining whether a given string w has the
property P.  An "algorithm" can do this - an algorithm is a finite sequence
of instructions that will always terminate with the correct answer to a
corresponding question.

Q:  What is m * n?
There exists an algorithm that will terminate in finite time with answer.

An algorithm that is designed to answer questions of the form
"Is string w a member of language L?" is a "language recognition device".
Such an algorithm will say yes or no to a string being a member of
L = {w in {a,b}*:  w does not have the substring bbb}.

Regular expressions are "language generators", they provide a mechanism to
list elements of a language.  We will next consider a mechanism that is a
"language recognition device" for regular languages.
```