logo
 
Syntax
CptS 355 - Programming Language Design
Washington State University
Home
Calendar
Syllabus
Resources
People
Project turn-in

Thought question

  • How do you know what are well-formed programs in a language?

Introduction

Languages are described by a
  • syntax - "form", a specification of the well-formed constructs in a language
  • semantics - "function", a specification of the meaning of construct in a language
Describing syntax is easier than semantics.

Formal Descriptions of Syntax

Formally, a language, represented as L, is a set of strings, usually called sentences, from some alphabet, represented as S. A grammar, G is a set of rules that describe (legal sentences) a language. From a grammar we can automatically construct the following machines:
  • a language recognizer - A recognizer is a machine that takes a string and determines if it belongs to the language described by the grammar.
  • a language generator - A generator is a machine that produces only legal sentences in the language. It produces random sentences.

To simplify the specification of a grammar, usually the syntax of a programming language is specified with respect to an alphabet of token classes. (Contrast with CptS 317 where alphabets are typically collections of single characters). A token is a sequence of characters that are treated as a unit in subsequent processing. Basing the language description on tokens rather than individual characters simplifies both the description of a language and its implementation. Example tokens (literally, what appears in the program) are:

  • begin : class "begin keyword"
  • } : class "right brace"
  • "asdfas" : class "string literal"
  • 7 : class "integer literal"
  • 3.54 : class "floating point literal"
  • foo i j : class "identifier"
  • + - : class "addition operator"
  • * / % : class "multiplication operator"
So let's look at a sentence in a programming language
  sum = x + y;
The tokens are (from left to right): sum, =, x, +, y, ; and the corresponding classes are: identifier, assignment op, identifier, plus op, identifier, semicolon.

BNF

Work by Turing and Chomsky in the 1940-50s identified four categories of languages of increasing power and complexity: regular, context-free, context-sensitive, and recursively enumerable. Usually, programming languages are context-free. The first programming language to have a formally specified grammar was ALGOL 60. The formal description was in a "metalanguage" called Backus Naur Form. A metalanguage is a language used to describe other languages. The components of BNF include the following.
  • Terminals - These are the token classes. Terminals will be represented as the name of the token class, e.g., begin means the token corresponding to the reserved word begin, identifier means any identifier, etc.
  • Nonterminals - These are represented in angle brackets, e.g., <stmt> means the nonterminal statement.
  • Rules or productions of the form
      nonterminal -> body
    The body of the rule consists of a list of
    • terminals,
    • nonterminals,
    • | (meaning "or"),
    BNF is usually extended by a few convenience notations, which I will use in these notes but are not used in the book:
    • a pair of brackets, [], enclosing an optional clause,
    • a pair of braces, {}, followed by a * (meaning zero or more), or + (meaning one or more), enclosing a repeating clause
    • e or E, representing the empty string.
    The extensions do not increase the expressive power of BNF but make for shorter grammars.

    The interpretation of a rule is that the syntax of the nonterminal, sometimes called the head or left-hand side (LHS) is described by the body, sometimes called the right-hand side (RHS). For example, the following rule describes the syntax of an if statement.
    <if_stmt> -> if <predicate> then <stmt>
    | if <predicate> then <stmt> else <stmt>

    Note that a non-terminal may be the LHS of several rules. The rule given above is the same as the pair of rules given below.
    <if_stmt> -> if <predicate> then <stmt>
    <if_stmt> -> if <predicate> then <stmt> else <stmt>

    Another equivalent formulation is as an optional clause.
    <if_stmt> -> if <predicate> then <stmt> [ else <stmt> ]

  • A start symbol - By default the start symbol is the nonterminal on the LHS of the first rule.

Derivations

Let's look at how a grammar can be used to generate a sentence in the language. The process is called derivation. The idea is that if we can somehow derive the sentence from the start symbol, then the sentence is part of the language described by the grammar. Derivation proceeds by replacing a nonterminal with its body. Consider the following simple grammar.
<stmt_list> -> <stmt>
| <stmt> ; <stmt_list>
<stmt> -> <var> = <expr>
<expr> -> <expr> - <expr>
| <expr> * <expr>
| <var>
<var> -> X
| Y
and consider the sentence X = Y - Y * X. Let's try to derive it.
<stmt_list>
=> <stmt>
=> <var> = <expr>
=> X = <expr>
=> X = <expr> * <expr>
=> X = <expr> * <var>
=> X = <expr> * X
=> X = <expr> - <expr> * X
=> X = <var> - <expr> * X
=> X = Y - <expr> * X
=> X = Y - <var> * X
=> X = Y - Y * X

Derivation Tree

Each derivation creates a derivation tree.
  • The root of the tree is the start symbol.
  • Every interior node is a nonterminal.
  • Every leaf is a terminal.
  • There is one child for each nonterminal or terminal in the body of a production that is used in the derivation.
Here is the derivation tree for the above derivation
                      stmt_list
                          |
                        stmt
                       / | \
                    var  =  expr
                     |      / | \
                     X   expr *  expr 
                        /  | \      |
                    expr   -  expr  X
                     |         | 
                    var       var
                     |         |      
                     Y         Y      
A grammar is ambiguous is if there are two or more derivation trees for some sentence. This grammar is ambiguous since there is more than one possible derivation tree for the sentence above. Here is a second derivation and its corresponding tree.
<stmt_list>
=> <stmt>
=> <var> = <expr>
=> X = <expr>
=> X = <expr> - <expr>
=> X = <var> - <expr>
=> X = Y - <expr>
=> X = Y - <expr> * <expr>
=> X = Y - <expr> * <var>
=> X = Y - <expr> * X
=> X = Y - <var> * X
=> X = Y - Y * X
                      stmt_list
                          |
                        stmt
                       / | \
                    var  =  expr
                     |      / | \
                     X   expr -  expr 
                           |     / | \
                           Y  expr *  expr
                               |       | 
                              var     var
                               |       | 
                               Y       X 

Parsing

Derivation starts with the start symbol and proceeds by replacing nonterminals. Parsing is the inverse process: starting with a string purportedly in the language it attempts to find a derivation tree which is now called a parse tree. For our purposes, informal approaches to parsing will be sufficient. Parsing is examined more rigorously in the Compilers course, CptS 452.

Relationship between Grammar, Associativity and Precedence

Specifying the right grammar for a language can help to control associativity and precedence.

Associativity refers to a "direction" in which (binary) operators associate. In mathematical notation, subtraction is left-associative meaning that

  7 - 3 - 4 
is interpreted to mean
  (7 - 3) - 4 
rather than
  7 - (3 - 4)
which has a very different meaning!

Precedence refers to which operations are executed prior to others. Multiplication typically has higher precedence than subtraction meaning it should be done first so

  7 + 3 * 4
evaluates to 19 and not to 40. Most (but not all) programming languages respect these mathematical conventions.

A parse tree implicitly says which operations' results are input to other operations. For example, in the parse tree given above, if we assume X = 3 and Y = 4, then the result is 4 - (4 * 3) or -8 because the result of the multiplication is the second operand of the substraction.

Question: what is the result if we use the first parse tree instead?

WARNING: the following material is not in the book but you are responsible for it nevertheless!

Associativity and precedence can be specified in a grammar by altering whether recursion is done on the right or left sides of rules, and by altering the derivation order of the grammar rules.

To specify precedence, the trick is to split the production where the precedence is ambiguous into two (or more) productions. Notice that this moves multiplication down the parse tree, so that a multiplication can never be the ancestor of a subtraction.
<stmt_list> -> <stmt>
| <stmt> ; <stmt_list>
<stmt> -> <var> = <expr>
<expr> -> <expr> - <expr>
| <term>
<term> -> <term> * <term>
| <var>
<var> -> X
| Y

Question: what might we do to the grammar so that the result of a subtraction might be an operand of a multiplication? How would you do it in mathematical notation?

We can specify associativity in the grammar by giving a direction to the parse, that is, by recursing on only the left side (or right side, but not both) of an operation. Let's make subtraction left associative and make multiplication right-associative (just as an illustration).
<expr> -> <expr> - <term>
| <term>
<term> -> <var> * <term>
| <var>
<var> -> X
| Y

Exercise: convince yourself that the above grammar gives multiplication precedence over subtraction, that subtraction associates to the left, and that multiplication associates to the right, by creating parse trees for several expressions.

End of warning

So now for a nasty little secret about programming languages and context-free grammars. The grammar for a PL typically does *not* specify the acceptable programs of the language. Consider

   int *c;
   c = 17;

Some aspects of the language, such as type checking, are difficult (in the sense that they make the grammar blow up in size) or impossible to express using CFGs. These aspects usually go by the name of static semantics. We will take up the issue of types and type checking later in the semester.
                                                                                                                                                                                                                                                                                                                                             

  (c) 2003 Curtis Dyreson, (c) 2004 Carl H. Hauser           E-mail questions or comments to Prof. Carl Hauser