October 15

* Properties of Context-Free Languages

  We want to start proving whether a given language is a CFL.
  We will start with the pumping lemma for CFL.  In order to prove
  the pumping lemma, we need to be familiar with converting grammars to
  "Chomsky Normal Form".

* Chomsky Normal Form

   Any CFL (with e) has a grammar with all productions of the forms



      A -> BC (two nonterminals)
      A -> a  (one terminal)

   If CFL contains e, can make it "almost" CNF by making a new start state S'
and add S' -> S | e

   1.  Get rid of e-productions, unit productions, and useless symbols
   2.  Convert right sides of all existing rules to use just nonterminals
   3.  Make right sides with nonterminals have length 2


   1.  1) Eliminate e-productions

          First assume e not in L(G) (can add S -> e later if e in L(G)).

          A symbol A is "nullable" if A =>* e
	    (if A->e or if A->alpha and all symbols of alpha are nullable).

	  To identify nullable symbols, build from groun up.

	  Nullable = {A: A->e is a production}


	  Repeat until nullable does not change:

	     If A is not nullable and A->alpha s.t. all symbols of alpha
		are nullable,
	     Then nullable = nullable union {A}

      
	  Now, to eliminate e-productions once we know nullables, consider
	  each production


	  A -> x1 x2 ... xn (xi in V)

	  Create productions {A -> alpha1 alpha2 ... alphan}

	     where * alphai = e or xi       if xi nullable
		   * alphai = xi            if xi not nullable
		   * not all alphai are e


	  Example

	     A -> BCdE with C, E nullable

	     Eliminate A -> BCdE and replace with productions of the form

                    +- -+   +- -+
                    | e |   | e |
	     A -> B | C | d | E |
		    +- -+   +- -+

	     Now A -> Bd | BCd | BdE | BCdE.

	     Note that if A -> BC with both B,C nullable, we would add
	     A -> B | C | BC in place of A -> BC but would not add A -> e
	     because e is not in the language).


	     There exists a proof that the language does not change in Hopcroft
	     and Ullman's book.


       2) Eliminate unit productions

	  Assume L(G) is not empty and G as no e-productions.

	  Construct a new grammar G'.
	  G' contains all non-unit productions from G.

	  Forall A, B

	  If A =>* B then add A -> B1 | B2 | B3 | ... | Bn
	     where B -> B1 | B2 | ... | B2 are the non-unit productions of B.

	  Now eliminate all unit productions.


	  Example

	     S -> xAy
	     A -> B | xC
	     B -> C
	     C -> D | z
	     D -> zz

	     Keep all non-unit productions and make first level replacements.


	     S -> xAy
	     A -> C | xC
	     C -> zz | z

	     Make second level replacements.

	     S -> xAy
	     A -> zz | z | xC
	     C -> zz | z


       3) Eliminate useless symbols

	  Note that we want to apply these three steps in this order so that
	  we do not reintroduce e-productions or unit productions.


	  A symbol is useless if it cannot be reached or if it never generates
	  a string.

	  A symbol A is useful iff
	     (1) exists alpha, beta   S =>* alpha A beta
	     (2) exists x in language and A =>* x

	  If either (1) or (2) is not true, then A is useless.

	  We must check condition 2 first and then condition 1.
	  Why?  We may accept A because
	     S -> AB
	     A -> a
	  However, because B is useless A should be useless too.
	  Checking condition 2 first and then condition 1 will catch this case.


	  (2)  Given G = (V, T, P, S), L(G) not empty

	       Construct G' s.t. L(G) = L(G') and G' contains no symbols which
	       do not pass step (2).

	       1.  Forall productions A -> x, x in T*, put A into V' because
		   A can generate a string of terminals.

	       2.  Iterate until no change in V':

		   Forall productions A -> beta where beta in T union V',
		      put A in V'.


	       S -> xSy | vA
	       A -> xB | xDE | xC | x
	       B -> B | BB
	       C -> DC
	       E -> xy | F
	       F -> xyz


	       V' = A, E, F
	       V' = S, A, E, F

	       New productions are
		  S -> xSy | vA
		  A -> x
		  E -> xy | F
		  F -> xyz

	  (1)  Given G = (V, T, P, S)
	       Construct G' s.t. L(G) = L(G') and G' contains no symbols which
	       do not pass step (1).

	       1.  Put S in V'.

	       2.  Iterate until no change in V':

		   If A in V' and A -> alpha1 | alpha2 | ... | alphan
		   Then add all
		      variables in alpha1, alpha2, ..., alphan to V'
		      terminals in alpha1, alpha2, ..., alphan to V' and T'.

		   Let P' = productions using only V' symbols.

		   Simple induciton shows correctness.


		  S -> xSy | vA
		  A -> x
		  E -> xy | F
		  F -> xyz

		  V' = S
		  V' = S, A

		  New productions are
		  S -> xSy | vA
		  A -> x


   2.  Convert right sides of all existing rules to use just nonterminals

       Create new variables X_a, X_b, ..., for all terminals a, b in T
       and add productions X_a -> a, X_b -> b, etc.

       Then replace all terminals on right side of existing rules with
	  corresponding new variable.

       A -> BbCdeF becomes

       A -> B X_b C X_d X_e F

       Add productions

       X_b -> b
       X_d -> d
       X_e -> e

       Now all productions are of the form A -> a or A -> B1 B2 ... Bn, Bi in NT


   3.  Make right sides with nonterminals have length 2

       For A -> B1 B2 ... Bn create new variables D2, D3, ... D{n-1}
       and replace the production with


       A      -> B1 D2
       D2     -> B2 D3
       D3     -> B3 D4
       ...
       D{n-2} -> B{n-2} Dn
       D{n-1} -> B{n-1} Bn


   Example

      S -> aA | bB | b
      A -> Baa | ba
      B -> bAAb | ab

      Step 1.  No e-productions, unit productions, or useless symbols.

      Step 2.

	 S   -> X_a A | X_b B | b
	 A   -> B X_a X_a | X_b X_a
	 B   -> X_b A A X_b | X_a X_b
	 X_a -> a
	 X_b -> b

      Step 3.

	 S   -> X_a A | X_b B | b
	 A   -> B D1 | X_b X_a
	 D1  -> X_a X_a
	 B   -> X_b D2
	 D2  -> A D3
	 D3  -> A X_b
	 X_a -> a
	 X_b -> b