Chapter 3: Bisonc++ concepts

This chapter introduces many of the basic concepts without which the details of bisonc++ do not make sense. If you do not already know how to use bisonc++, bison++ or bison, it is advised to start by reading this chapter carefully.

3.1: Languages and Context-Free Grammars

In order for bisonc++ to parse a language, it must be described by a context-free grammar. This means that you specify one or more syntactic groupings and give rules for constructing them from their parts. For example, in the C language, one kind of grouping is called an `expression'. One rule for making an expression might be, "An expression can be made of a minus sign and another expression". Another would be, "An expression can be an integer". As you can see, rules are often recursive, but there must be at least one rule which leads out of the recursion.

The most common formal system for presenting such rules for humans to read is Backus-Naur Form or `BNF', which was developed in order to specify the language Algol 60. Any grammar expressed in BNF is a context-free grammar. The input to bisonc++ is essentially machine-readable BNF.

Not all context-free languages can be handled by bisonc++, only those that are LALR(1). In brief, this means that it must be possible to tell how to parse any portion of an input string with just a single token of look-ahead. Strictly speaking, that is a description of an LR(1) grammar, and LALR(1) involves additional restrictions that are hard to explain simply; but it is rare in actual practice to find an LR(1) grammar that fails to be LALR(1). See section 7.5 for more information on this.

In the formal grammatical rules for a language, each kind of syntactic unit or grouping is named by a symbol. Those which are built by grouping smaller constructs according to grammatical rules are called nonterminal symbols; those which can't be subdivided are called terminal symbols or token types. We call a piece of input corresponding to a single terminal symbol a token, and a piece corresponding to a single nonterminal symbol a grouping.

We can use the C++ language as an example of what symbols, terminal and nonterminal, mean. The tokens of C++ are identifiers, constants (numeric and string), and the various keywords, arithmetic operators and punctuation marks. So the terminal symbols of a grammar for C++ include `identifier', `number', `string', plus one symbol for each keyword, operator or punctuation mark: `if', `return', `const', `static', `int', `char', `plus-sign', `open-brace', `close-brace', `comma' and many more. (These tokens can be subdivided into characters, but that is a matter of lexicography, not grammar.)

Here is a simple C++ function subdivided into tokens:

 
    int square(int x)   // keyword `int', identifier, open-paren, 
                        // keyword `int', identifier, close-paren
    {                   // open-brace 
        return x * x;   // keyword `return', identifier,
                        // asterisk, identifier, semicolon
    }                   // close-brace
        

The syntactic groupings of C++ include the expression, the statement, the declaration, and the function definition. These are represented in the grammar of C++ by nonterminal symbols `expression', `statement', `declaration' and `function definition'. The full grammar uses dozens of additional language constructs, each with its own nonterminal symbol, in order to express the meanings of these four. The example above is a function definition; it contains one declaration, and one statement. In the statement, each `x' is an expression and so is `x * x'.

Each nonterminal symbol must have grammatical rules showing how it is made out of simpler constructs. For example, one kind of C++ statement is the return statement; this would be described with a grammar rule which reads informally as follows:

A `statement' can be made of a `return' keyword, an `expression' and a `semicolon'.

There would be many other rules for `statement', one for each kind of statement in C++.

One nonterminal symbol must be distinguished as the special one which defines a complete utterance in the language. It is called the start symbol. In a compiler, this means a complete input program. In the C++ language, the nonterminal symbol `sequence of definitions and declarations' plays this role.

For example, `1 + 2' is a valid C++ expression--a valid part of a C++ program--but it is not valid as an entire C++ program. In the context-free grammar of C++, this follows from the fact that `expression' is not the start symbol.

The bisonc++ parser reads a sequence of tokens as its input, and groups the tokens using the grammar rules. If the input is valid, the end result is that the entire token sequence reduces to a single grouping whose symbol is the grammar's start symbol. If we use a grammar for C++, the entire input must be a `sequence of definitions and declarations'. If not, the parser reports a syntax error.

3.2: From Formal Rules to Bisonc++ Input

A formal grammar is a mathematical construct. To define the language for Bisonc++, you must write a file expressing the grammar in bisonc++ syntax: a Bisonc++ grammar file. See chapter 5.

A nonterminal symbol in the formal grammar is represented in bisonc++ input as an identifier, like an identifier in C++. By convention, it should be in lower case, such as expr, stmt or declaration.

The bisonc++ representation for a terminal symbol is also called a token type. Token types as well can be represented as C++-like identifiers. By convention, these identifiers should be upper case to distinguish them from nonterminals: for example, INTEGER, IDENTIFIER, IF or RETURN. A terminal symbol that stands for a particular keyword in the language should be named after that keyword converted to upper case. The terminal symbol error is reserved for error recovery. See section 5.2, which also describes the current restrictions on the names of terminal symbols.

A terminal symbol can also be represented as a character literal, just like a C++ character constant. You should do this whenever a token is just a single character (parenthesis, plus-sign, etc.): use that same character in a literal as the terminal symbol for that token.

The grammar rules also have an expression in bisonc++ syntax. For example, here is the bisonc++ rule for a C++ return statement. The semicolon in quotes is a literal character token, representing part of the C++ syntax for the statement; the naked semicolon, and the colon, are bisonc++ punctuation used in every rule.

    stmt:   
        RETURN expr ';'
    ;
        
See section 5.3.

3.3: Semantic Values

A formal grammar selects tokens only by their classifications: for example, if a rule mentions the terminal symbol `integer constant', it means that any integer constant is grammatically valid in that position. The precise value of the constant is irrelevant to how to parse the input: if `x + 4' is grammatical then `x + 1' or `x + 3989' is equally grammatical.

But the precise value is very important for what the input means once it is parsed. A compiler is useless if it fails to distinguish between 4, 1 and 3989 as constants in the program! Therefore, each token in a bisonc++ grammar has both a token type and a semantic value. See section 5.6 for details.

The token type is a terminal symbol defined in the grammar, such as INTEGER, IDENTIFIER or ','. It tells everything you need to know to decide where the token may validly appear and how to group it with other tokens. The grammar rules know nothing about tokens except their types.

The semantic value has all the rest of the information about the meaning of the token, such as the value of an integer, or the name of an identifier. (A token such as ',' which is just punctuation doesn't need to have any semantic value.)

For example, an input token might be classified as token type INTEGER and have the semantic value 4. Another input token might have the same token type INTEGER but value 3989. When a grammar rule says that INTEGER is allowed, either of these tokens is acceptable because each is an INTEGER. When the parser accepts the token, it keeps track of the token's semantic value.

Each grouping can also have a semantic value as well as its nonterminal symbol. For example, in a calculator, an expression typically has a semantic value that is a number. In a compiler for a programming language, an expression typically has a semantic value that is a tree structure describing the meaning of the expression.

3.4: Semantic Actions

In order to be useful, a program must do more than parse input; it must also produce some output based on the input. In a bisonc++ grammar, a grammar rule can have an action made up of C++ statements. Each time the parser recognizes a match for that rule, the action is executed. See section 5.6.4.

Most of the time, the purpose of an action is to compute the semantic value of the whole construct from the semantic values of its parts. For example, suppose we have a rule which says an expression can be the sum of two expressions. When the parser recognizes such a sum, each of the subexpressions has a semantic value which describes how it was built up. The action for this rule should create a similar sort of value for the newly recognized larger expression.

For example, here is a rule that says an expression can be the sum of two subexpressions:

    expr: expr '+' expr   
        { 
            $$ = $1 + $3; 
        }
    ;
        
The action says how to produce the semantic value of the sum expression from the values of the two subexpressions.

3.5: Bisonc++ output: the Parser class

When you run bisonc++, you give it a bisonc++ grammar file as input. The output, however, defines a C++ class, in which several members have already been defined. Therefore, the output of bisonc++ consists of header files and a C++ source file, defining a member (parse()) that parses the language described by the grammar. The class and its implementation is called a bisonc++ parser class. Keep in mind that the Bisonc++ utility and the bisonc++ parser class are two distinct pieces of software: the bisonc++ utility is a program whose output is the bisonc++ parser class that becomes part of your program.

More specifically, bisonc++ generates the following files from a bisonc++ grammar file:

The job of the bisonc++ parsing member is to group tokens into groupings according to the grammar rules--for example, to build identifiers and operators into expressions. As it does this, it runs the actions for the grammar rules it uses.

In C++ the tokens should be produced by an object called the lexical analyzer or lexical scanner that you must supply in some fashion (such as by writing it in C++). The bisonc++ parsing member requests the next token from the lexical analyzer each time it wants a new token. The parser itself doesn't know what is "inside" the tokens (though their semantic values may reflect this). Typically the lexical analyzer makes the tokens by parsing characters of text, but bisonc++ does not depend on this. See section 6.3.1.

The bisonc++ parsing function is C++ code defining a member function named parse() which implements that grammar. This parsing function nor the parser object for which it is called does make a complete C++ program: you must supply some additional details. One `detail' to be supplied is is the lexical analyzer. The parser class itself declares several more members which must be defined when used. One of these additional members is an error-reporting function which the parser calls to report an error. Simple default, yet sensible, implementations for these additional members may be generated by bisonc++. Having constructed a parser class and a lexical scanner class, objects of these classes must be defined in a complete C++ program. Usually such objects are defined in a function called main(); you have to provide this, and arrange for it to call the parser's parse() function, or the parser will never run. See chapter 6.

Note that, different from conventions used by Bison and Bison++, there is no special name convention requirement anymore imposed by bisonc++. In particular, there is no need to begin all variable and function names used in the bisonc++ parser with `yy' or `YY' anymore. However, some name restrictions on symbolic tokens exist. See section 5.5.24.1 for details.

3.5.1: Bisonc++: an optionally reentrant Parser

A computer program or routine is described as reentrant if it can be safely called recursively and concurrently from multiple processes. To be reentrant, a function must hold no static data, must not return a pointer to static data, must work only on the data provided to it by the caller, and must not call non-reentrant functions (Source: http://en.wikipedia.org/wiki/Reentrant).

Currently, bisonc++ generates a parsing member which may or may not be reentrant, depending on whether or not the option --thread-safe is specified.

The source file generated by bisonc++ containing the parsing member function not only contains this function, but also various tables (e.g., state transition tables) defined in the anonymous name space. When the option --thread-safe is provided, these tables are const tables: their elements are not changed by the parsing function and so the parsing function, as it only manipulates its own local data, becomes reentrant.

3.6: Stages in Using Bisonc++

The actual language-design process using bisonc++, from grammar specification to a working compiler or interpreter, has these parts: Once the software has been implemented, the following steps are required to create the final program:

3.7: The Overall Layout of a Bisonc++ Grammar File

The input file for the bisonc++ utility is a bisonc++ grammar file. Different from Bison++ and Bison grammar files, bisonc++ grammar file consist of only two sections. The general form of a bisonc++ grammar file is as follows:
    Bisonc++ directives
    %%
    Grammar rules
        
Readers familiar with Bison may note that there is no C declaration section and no section to define Additional C code. With bisonc++ these sections are superfluous since, due to the fact that a bisonc++ parser is a class, all additional code required for the parser's implementation can be incorporated into the parser class itself. Also, C++ classes normally only require declarations that can be defined in the classes' header files, so also the `additional C code' section could be omitted from the bisonc++ grammar file.

The `%%' is a punctuation that appears in every bisonc++ grammar file to separate the two sections.

The bisonc++ directives section is used to declare the names of the terminal and nonterminal symbols, and may also describe operator precedence and the data types of semantic values of various symbols. Furthermore, this section is also used to specify bisonc++ directives. These bisonc++ directives are used to define, e.g., the name of the generated parser class and a namespace in which the parser class will be defined.

The grammar rules define how to construct each nonterminal symbol from its parts.

One special directive is availble that may be used in the directives section and in the grammar rules section. This directive is %include, allowing you to split long grammar specification files into smaller, more comprehensible and accessible chunks. The %include directive is discussed in more detail in section 5.5.7.