October 15 * Properties of Context-Free Languages We want to start proving whether a given language is a CFL. We will start with the pumping lemma for CFL. In order to prove the pumping lemma, we need to be familiar with converting grammars to "Chomsky Normal Form". * Chomsky Normal Form Any CFL (with e) has a grammar with all productions of the forms A -> BC (two nonterminals) A -> a (one terminal) If CFL contains e, can make it "almost" CNF by making a new start state S' and add S' -> S | e 1. Get rid of e-productions, unit productions, and useless symbols 2. Convert right sides of all existing rules to use just nonterminals 3. Make right sides with nonterminals have length 2 1. 1) Eliminate e-productions First assume e not in L(G) (can add S -> e later if e in L(G)). A symbol A is "nullable" if A =>* e (if A->e or if A->alpha and all symbols of alpha are nullable). To identify nullable symbols, build from groun up. Nullable = {A: A->e is a production} Repeat until nullable does not change: If A is not nullable and A->alpha s.t. all symbols of alpha are nullable, Then nullable = nullable union {A} Now, to eliminate e-productions once we know nullables, consider each production A -> x1 x2 ... xn (xi in V) Create productions {A -> alpha1 alpha2 ... alphan} where * alphai = e or xi if xi nullable * alphai = xi if xi not nullable * not all alphai are e Example A -> BCdE with C, E nullable Eliminate A -> BCdE and replace with productions of the form +- -+ +- -+ | e | | e | A -> B | C | d | E | +- -+ +- -+ Now A -> Bd | BCd | BdE | BCdE. Note that if A -> BC with both B,C nullable, we would add A -> B | C | BC in place of A -> BC but would not add A -> e because e is not in the language). There exists a proof that the language does not change in Hopcroft and Ullman's book. 2) Eliminate unit productions Assume L(G) is not empty and G as no e-productions. Construct a new grammar G'. G' contains all non-unit productions from G. Forall A, B If A =>* B then add A -> B1 | B2 | B3 | ... | Bn where B -> B1 | B2 | ... | B2 are the non-unit productions of B. Now eliminate all unit productions. Example S -> xAy A -> B | xC B -> C C -> D | z D -> zz Keep all non-unit productions and make first level replacements. S -> xAy A -> C | xC C -> zz | z Make second level replacements. S -> xAy A -> zz | z | xC C -> zz | z 3) Eliminate useless symbols Note that we want to apply these three steps in this order so that we do not reintroduce e-productions or unit productions. A symbol is useless if it cannot be reached or if it never generates a string. A symbol A is useful iff (1) exists alpha, beta S =>* alpha A beta (2) exists x in language and A =>* x If either (1) or (2) is not true, then A is useless. We must check condition 2 first and then condition 1. Why? We may accept A because S -> AB A -> a However, because B is useless A should be useless too. Checking condition 2 first and then condition 1 will catch this case. (2) Given G = (V, T, P, S), L(G) not empty Construct G' s.t. L(G) = L(G') and G' contains no symbols which do not pass step (2). 1. Forall productions A -> x, x in T*, put A into V' because A can generate a string of terminals. 2. Iterate until no change in V': Forall productions A -> beta where beta in T union V', put A in V'. S -> xSy | vA A -> xB | xDE | xC | x B -> B | BB C -> DC E -> xy | F F -> xyz V' = A, E, F V' = S, A, E, F New productions are S -> xSy | vA A -> x E -> xy | F F -> xyz (1) Given G = (V, T, P, S) Construct G' s.t. L(G) = L(G') and G' contains no symbols which do not pass step (1). 1. Put S in V'. 2. Iterate until no change in V': If A in V' and A -> alpha1 | alpha2 | ... | alphan Then add all variables in alpha1, alpha2, ..., alphan to V' terminals in alpha1, alpha2, ..., alphan to V' and T'. Let P' = productions using only V' symbols. Simple induciton shows correctness. S -> xSy | vA A -> x E -> xy | F F -> xyz V' = S V' = S, A New productions are S -> xSy | vA A -> x 2. Convert right sides of all existing rules to use just nonterminals Create new variables X_a, X_b, ..., for all terminals a, b in T and add productions X_a -> a, X_b -> b, etc. Then replace all terminals on right side of existing rules with corresponding new variable. A -> BbCdeF becomes A -> B X_b C X_d X_e F Add productions X_b -> b X_d -> d X_e -> e Now all productions are of the form A -> a or A -> B1 B2 ... Bn, Bi in NT 3. Make right sides with nonterminals have length 2 For A -> B1 B2 ... Bn create new variables D2, D3, ... D{n-1} and replace the production with A -> B1 D2 D2 -> B2 D3 D3 -> B3 D4 ... D{n-2} -> B{n-2} Dn D{n-1} -> B{n-1} Bn Example S -> aA | bB | b A -> Baa | ba B -> bAAb | ab Step 1. No e-productions, unit productions, or useless symbols. Step 2. S -> X_a A | X_b B | b A -> B X_a X_a | X_b X_a B -> X_b A A X_b | X_a X_b X_a -> a X_b -> b Step 3. S -> X_a A | X_b B | b A -> B D1 | X_b X_a D1 -> X_a X_a B -> X_b D2 D2 -> A D3 D3 -> A X_b X_a -> a X_b -> b