```October 15

* Properties of Context-Free Languages

We want to start proving whether a given language is a CFL.
We will start with the pumping lemma for CFL.  In order to prove
the pumping lemma, we need to be familiar with converting grammars to
"Chomsky Normal Form".

* Chomsky Normal Form

Any CFL (with e) has a grammar with all productions of the forms

A -> BC (two nonterminals)
A -> a  (one terminal)

If CFL contains e, can make it "almost" CNF by making a new start state S'
and add S' -> S | e

1.  Get rid of e-productions, unit productions, and useless symbols
2.  Convert right sides of all existing rules to use just nonterminals
3.  Make right sides with nonterminals have length 2

1.  1) Eliminate e-productions

First assume e not in L(G) (can add S -> e later if e in L(G)).

A symbol A is "nullable" if A =>* e
(if A->e or if A->alpha and all symbols of alpha are nullable).

To identify nullable symbols, build from groun up.

Nullable = {A: A->e is a production}

Repeat until nullable does not change:

If A is not nullable and A->alpha s.t. all symbols of alpha
are nullable,
Then nullable = nullable union {A}

Now, to eliminate e-productions once we know nullables, consider
each production

A -> x1 x2 ... xn (xi in V)

Create productions {A -> alpha1 alpha2 ... alphan}

where * alphai = e or xi       if xi nullable
* alphai = xi            if xi not nullable
* not all alphai are e

Example

A -> BCdE with C, E nullable

Eliminate A -> BCdE and replace with productions of the form

+- -+   +- -+
| e |   | e |
A -> B | C | d | E |
+- -+   +- -+

Now A -> Bd | BCd | BdE | BCdE.

Note that if A -> BC with both B,C nullable, we would add
A -> B | C | BC in place of A -> BC but would not add A -> e
because e is not in the language).

There exists a proof that the language does not change in Hopcroft
and Ullman's book.

2) Eliminate unit productions

Assume L(G) is not empty and G as no e-productions.

Construct a new grammar G'.
G' contains all non-unit productions from G.

Forall A, B

If A =>* B then add A -> B1 | B2 | B3 | ... | Bn
where B -> B1 | B2 | ... | B2 are the non-unit productions of B.

Now eliminate all unit productions.

Example

S -> xAy
A -> B | xC
B -> C
C -> D | z
D -> zz

Keep all non-unit productions and make first level replacements.

S -> xAy
A -> C | xC
C -> zz | z

Make second level replacements.

S -> xAy
A -> zz | z | xC
C -> zz | z

3) Eliminate useless symbols

Note that we want to apply these three steps in this order so that
we do not reintroduce e-productions or unit productions.

A symbol is useless if it cannot be reached or if it never generates
a string.

A symbol A is useful iff
(1) exists alpha, beta   S =>* alpha A beta
(2) exists x in language and A =>* x

If either (1) or (2) is not true, then A is useless.

We must check condition 2 first and then condition 1.
Why?  We may accept A because
S -> AB
A -> a
However, because B is useless A should be useless too.
Checking condition 2 first and then condition 1 will catch this case.

(2)  Given G = (V, T, P, S), L(G) not empty

Construct G' s.t. L(G) = L(G') and G' contains no symbols which
do not pass step (2).

1.  Forall productions A -> x, x in T*, put A into V' because
A can generate a string of terminals.

2.  Iterate until no change in V':

Forall productions A -> beta where beta in T union V',
put A in V'.

S -> xSy | vA
A -> xB | xDE | xC | x
B -> B | BB
C -> DC
E -> xy | F
F -> xyz

V' = A, E, F
V' = S, A, E, F

New productions are
S -> xSy | vA
A -> x
E -> xy | F
F -> xyz

(1)  Given G = (V, T, P, S)
Construct G' s.t. L(G) = L(G') and G' contains no symbols which
do not pass step (1).

1.  Put S in V'.

2.  Iterate until no change in V':

If A in V' and A -> alpha1 | alpha2 | ... | alphan
variables in alpha1, alpha2, ..., alphan to V'
terminals in alpha1, alpha2, ..., alphan to V' and T'.

Let P' = productions using only V' symbols.

Simple induciton shows correctness.

S -> xSy | vA
A -> x
E -> xy | F
F -> xyz

V' = S
V' = S, A

New productions are
S -> xSy | vA
A -> x

2.  Convert right sides of all existing rules to use just nonterminals

Create new variables X_a, X_b, ..., for all terminals a, b in T
and add productions X_a -> a, X_b -> b, etc.

Then replace all terminals on right side of existing rules with
corresponding new variable.

A -> BbCdeF becomes

A -> B X_b C X_d X_e F

X_b -> b
X_d -> d
X_e -> e

Now all productions are of the form A -> a or A -> B1 B2 ... Bn, Bi in NT

3.  Make right sides with nonterminals have length 2

For A -> B1 B2 ... Bn create new variables D2, D3, ... D{n-1}
and replace the production with

A      -> B1 D2
D2     -> B2 D3
D3     -> B3 D4
...
D{n-2} -> B{n-2} Dn
D{n-1} -> B{n-1} Bn

Example

S -> aA | bB | b
A -> Baa | ba
B -> bAAb | ab

Step 1.  No e-productions, unit productions, or useless symbols.

Step 2.

S   -> X_a A | X_b B | b
A   -> B X_a X_a | X_b X_a
B   -> X_b A A X_b | X_a X_b
X_a -> a
X_b -> b

Step 3.

S   -> X_a A | X_b B | b
A   -> B D1 | X_b X_a
D1  -> X_a X_a
B   -> X_b D2
D2  -> A D3
D3  -> A X_b
X_a -> a
X_b -> b
```