Chapter 5: Bisonc++ grammar files

Bisonc++ takes as input a context-free grammar specification and produces a C++ class offering various predefined members, among which the member parse(), that recognizes correct instances of the grammar.

In this chapter the organization and specification of such a grammar file is discussed in detail.

Having read this chapter you should be able to define a grammar for which Bisonc++ can generate a class, containing a member that will recognize correctly formulated (in terms of your grammar) input, using all the features and facilities offered by bisonc++ to specify a grammar. In principle this grammar will be in the class of LALR(1) grammars (see, e.g., Aho, Sethi & Ullman, 2003 (Addison-Wesley)).

5.1: Outline of a Bisonc++ Grammar File

The input file for the bisonc++ utility is a bisonc++ grammar file. Different from Bison++ and Bison grammar files, bisonc++ grammar file consist of only two sections. The general form of a bisonc++ grammar file is as follows:


    Bisonc++ directives
    %%
    Grammar rules

Readers familiar with Bison may note that there is no C declaration section and no section to define Additional C code. With bisonc++ these sections are superfluous since, due to the fact that a bisonc++ parser is a class, all additional code required for the parser's implementation can be incorporated into the parser class itself. Also, C++ classes normally only require declarations that can be defined in the classes' header files, so also the `additional C code' section could be omitted from the Bisonc++ grammar file.

The `%%' is a punctuation that appears in every bisonc++ grammar file to separate the two sections.

The bisonc++ directives section is used to declare the names of the terminal and nonterminal symbols, and may also describe operator precedence and the data types of semantic values of various symbols. Furthermore, this section is also used to specify bisonc++ directives. These bisonc++ directives are used to define, e.g., the name of the generated parser class and a namespace in which the parser class is defined. All bisonc++ directives are covered in section 5.5.

The grammar rules define how to construct nonterminal symbols from their parts. The grammar rules section contains one or more bisonc++ grammar rules, and nothing else. See section 5.3, covering the syntax of grammar rules.

There must always be at least one grammar rule, and the first `%%' (which precedes the grammar rules) may never be omitted even if it is the first thing in the file.

Bisonc++'s grammar file may be split into several files. Each file may be given a suggestive name. This allows quick identification of where a particular section or rule is found, and improves readability of the designed grammar. The %include-directive (see section 5.5.7) can be used to include a partial grammar specification file into another specification file.

5.2: Symbols, Terminal and Nonterminal Symbols

Symbols in bisonc++ grammars represent the grammatical classifications of the language.

A terminal symbol (also known as a token type) represents a class of syntacticly equivalent tokens. You use the symbol in grammar rules to mean that a token in that class is allowed. The symbol is represented in the Bisonc++ parser by a numeric code, and the parser's lex() member function returns a token type code to indicate what kind of token has been read. You don't need to know what the code value is; you can use the symbol to stand for it.

A nonterminal symbol stands for a class of syntactically equivalent groupings. The symbol name is used in writing grammar rules. By convention, it should be all lower case.

Symbol names can contain letters, digits (not at the beginning), and underscores. Bisonc++ currently does not support periods in symbol names (Users familiar with Bison may observe that Bison does support periods in symbol names, but the Bison user guide remarks that `Periods make sense only in nonterminals'. Even so, it appears that periods in symbols are hardly ever used).

There are two ways to write terminal symbols in the grammar:

A named token type is written with an identifier, like an identifier in C++. By convention, it should be all upper case. Each such name must be defined with a bisonc++ directive such as %token. See section 5.5.24.
A character token type (or literal character token) is written in the grammar using the same syntax used in C++ for character constants; for example, '+' is a character token type. A character token type doesn't need to be declared unless you need to specify its semantic value data type (see section 5.6.1), associativity, or precedence (see section 5.5.8).

By convention, a character token type is used only to represent a token that consists of that particular character. Thus, the token type '+' is used to represent the character `+' as a token. Nothing enforces this convention, but if you depart from it, your program will likely confuse other readers.

All the usual escape sequences used in character literals in C++ can be used in bisonc++ as well, but you must not use the 0 character as a character literal because its ASCII code, zero, is the code lex() must return for end-of-input (see section 6.3.1). If your program must be able to return 0-byte characters, define a special token (e.g., ZERO_BYTE) and return that token instead.

Note that literal string tokens, formally supported in Bison, is not supported by bisonc++. Again, such tokens are hardly ever encountered, and the dominant lexical scanner generators (like flex(1)) do not support them. Common practice is to define a symbolic name for a literal string token. So, a token like EQ may be defined in the grammar file, with the lexical scanner returning EQ when it matches ==.

How you choose to write a terminal symbol has no effect on its grammatical meaning. That depends only on where it appears in rules and on when the parser function returns that symbol.

The value returned by the lex() member is always one of the terminal symbols (or 0 for end-of-input). Whichever way you write the token type in the grammar rules, you write it the same way in the definition of yylex. The numeric code for a character token type is simply the ASCII code for the character, so lex() can use the identical character constant to generate the requisite code. Each named token type becomes a C++ enumeration value in the parser base-class header file, so lex() can use the corresponding enumeration identifiers. When using an externally (to the parser) defined lexical scanner, the lexical scanner should include the parser's base class header file, returning the required enumeration identifiers as defined in the parser class. So, if (%token NUM) is defined in the parser class Parser, then the externally defined lexical scanner may return Parser::NUM.

The symbol `error' is a terminal symbol reserved for error recovery (see chapter 8). The error symbol should not be used for any other purpose. In particular, the parser's member function lex() should never return this value. Several other identifiers should not be used as terminal symbols. See section 5.5.24.1 for a description.

5.3: Syntax of Grammar Rules

A bisonc++ grammar rule has the following general form:

 
    result: 
        components
        ...
    ;

where result is the nonterminal symbol that this rule describes and components are various terminal and nonterminal symbols that are put together by this rule (see section 5.2). With respect to the way rules are defined, note the following:

The construction:
```
    exp:    
        exp '+' exp
    ;
        
```
means that two groupings of type exp, with a `+' token in between, can be combined into a larger grouping of type exp.
Whitespace in rules is significant only to separate symbols. You can add extra whitespace as you wish.
Scattered among the components can be actions that determine the semantics of the rule. An action looks like this:
```
    {
        C++ statements
    }
        
```
Usually there is only one action and it follows the components. See section 5.6.4.
Multiple rules for the same result can be written separately or can be joined with the vertical-bar character `|' as follows:
```
    result: 
        rule1-components
        ...
    | 
        rule2-components...
        ...
    ;
        
```
They are still considered distinct rules even when joined in this way.
Alternatively, multiple rules of the same nonterminal can be defined. E.g., the previous definition of result: could also have been defined as:
```
    result: 
        rule1-components
        ...
    ;

    result:
        rule2-components...
        ...
    ;
        
```
However, this is a potentially dangerous practice, since one of the two result rules could also have used a misspelled rule-name (e.g., the second result) should have been results. Therefore, bisonc++ generates a warning if the same nonterminal is used repeatedly when defining production rules.
If components in a rule is empty, it means that result can match the empty string. Such a alternative is called an empty production rule. For example, here is how to define a comma-separated sequence of zero or more exp groupings:
```
    expseq:   
        expseq1
    | 
        // empty 
    ;

    expseq1:  
        expseq1 ',' exp
    | 
        exp
    ;
        
```
Convention calls for a comment `// empty' in each empty production rule.

5.4: Writing recursive rules

A rule is called recursive when its result nonterminal appears also on its right hand side. Nearly all bisonc++ grammars need to use recursion, because that is the only way to define a sequence of any number of somethings. Consider this recursive definition of a comma-separated sequence of one or more expressions:


    expseq1:  
            expseq1 ',' exp
    | 
            exp
    ;

Since the recursive use of expseq1 is the leftmost symbol in the right hand side, we call this left recursion. By contrast, here the same construct is defined using right recursion:


    expseq1:  
            exp ',' expseq1
    | 
            exp
    ;

Any kind of sequence can be defined using either left recursion or right recursion, but you should always use left recursion, because it can parse a sequence of any number of elements with bounded stack space. Right recursion uses up space on the bisonc++ stack in proportion to the number of elements in the sequence, because all the elements must be shifted onto the stack before the rule can be applied even once. See chapter 7 for further explanation of this.

Indirect or mutual recursion occurs when the result of the rule does not appear directly on its right hand side, but does appear in rules for other nonterminals which do appear on its right hand side. For example:


    expr:     
        primary '+' primary
    |
        primary
        ;

    primary:  
        constant
    | 
        '(' expr ')'
    ;

defines two mutually-recursive nonterminals, since each refers to the other.

5.5: Bisonc++ Directives

The bisonc++ declarations section of a bisonc++ grammar defines the symbols used in formulating the grammar and the data types of semantic values. See section 5.2.

All token type names (but not single-character literal tokens such as '+' and '*') must be declared. If you need to specify which data type to use for the semantic value (see section 5.6.2) of nonterminal symbols, these symbols must be declared as well.

The first rule in the file by default specifies the start symbol. If you want some other symbol to be the start symbol, you must use an explicit %start directive (see section 3.1).

In this section all of bisonc++'s declarations are discussed. Some of the declarations have already been mentioned, but several more are available. Some declarations define how the grammar parses its input (like %left, %right); other declarations are available, defining, e.g., the name of the parsing function (by default parse()), or the name(s) of the files generated by bisonc++.

In particular readers familiar with Bison (or Bison++) should read this section thoroughly, since bisonc++'s directives are more extensive and different from the `declarations' offered by Bison, and the macros offered by Bison++.

Several directives expect file- or path-name arguments. File- or path-names must be specified on the same line as the directive itself, and they start at the first non-blank character following the directive. File- or path-names may contain escape sequences (e.g., if you must: use `\ ' to include a blank in a filename) and continue until the first blank character thereafter. Alternatively, file- or path-names may be surrounded by double quotes ("...") or pointed brackets (<...>). Pointed brackets surrounding file- or path-names merely function to delimit filenames. They do not refer to, e.g., C++'s include path. No escape sequences are required for blanks within delimited file- or path-names.

Directives accepting a `filename' do not accept path names, i.e., they cannot contain directory separators (/); options accepting a 'pathname' may contain directory separators.

Sometimes directives have analogous command-line options. In those cases command-line options take priority over directives.

Some directives may generate errors. This happens when an directive conflicts with the contents of a file which bisonc++ cannot modify (e.g., a parser class header file exists, but doesn't define a name space, but a %namespace directive was provided).

To solve such errore the offending directive could be omitted, the existing file could be removed, or the existing file could be hand-edited according to the directive's specification.

5.5.1: %baseclass-preinclude: specifying a header included by the baseclass

Syntax: %baseclass-preinclude pathname
Pathname defines the path to the file preincluded in the parser's base-class header. See the description of the --baseclass-preinclude option for details about this directive. By default `filename' is surrounded by double quotes; it's OK, however, to provide them yourself. When the argument is surrounded by pointed brackets

#include
            <header>

is used.

5.5.2: %class-name: defining the name of the parser class

Syntax: %class-name parser-class-name
By default, bisonc++ generates a parser-class by the name of Parser. The default can be changed using this directive which defines the name of the C++ class that will be generated. It may be defined only once and parser-class-name must be a C++ identifier.

If you're familiar with the Bison++ program, please note:

This directive replaces the %name directive previously used by Bison++.
Contrary to Bison++'s %name directive, %class-name may appear anywhere in the directive section of the grammar specification file.

It is an error if this directive is used and an already existing parser-class header file does not define class `className' and/or if an already existing implementation header file does not define members of the class `className'.

5.5.3: %debug: adding debugging code to the `parse()' member

Syntax: %debug
Provide parse() and its support functions with debugging code, showing the actual parsing process on the standard output stream. When included, the debugging output is active by default, but its activity may be controlled using the setDebug(bool on-off) member. Note that no #ifdef DEBUG macros are used anymore. Rerun bisonc++ without the --debug option to generate an equivalent parser not containing the debugging code.

5.5.4: %error-verbose: dumping the parser's state stack

Syntax: %error-verbose
The parser's state stack is dumped to the standard error stream when an error is detected by the parse() member function. Following a call of the error() function, the stack is dumped from the top of the stack (highest offset) down to its bottom (offset 0). Each stack element is prefixed by the stack element's index.

5.5.5: %expect: suppressing conflict warnings

Syntax: %expect number
Bisonc++ normally warns if there are any conflicts in the grammar (see section 7.1), but many real grammars have harmless shift/reduce conflicts which are resolved in a predictable way and would be difficult to eliminate. It is desirable to suppress the warning about these conflicts unless the number of conflicts changes. You can do this with the %expect declaration.

The argument number is a decimal integer. The declaration says there should be no warning if there are number shift/reduce conflicts and no reduce/reduce conflicts. The usual warning is given if there are either more or fewer conflicts, or if there are any reduce/reduce conflicts.

In general, using %expect involves these steps:

Compile your grammar without %expect. Use the `-V' option to get a verbose list of where the conflicts occur. Bisonc++ will also print the number of conflicts.
Check each of the conflicts to make sure that bisonc++'s default resolution is what you really want. If not, rewrite the grammar and go back to the beginning.
Add an %expect declaration, copying the number of (shift-reduce) conflict printed by bisonc++.

Now bisonc++ will stop annoying you about the conflicts you have checked, but it will warn you again if changes in the grammar result in another number or type of conflicts.

5.5.6: %flex: using the traditional `flex++' interface

Syntax: %flex
When provided, the scanner matched text function is called as d_scanner.YYText(), and the scanner token function is called as d_scanner.yylex(). This directive is only interpreted if the %scanner directive is also provided.

5.5.7: %include: splitting the input file

Syntax: %include pathname
This directive is used to switch to pathname while processing a grammar specification. Unless pathname defines an absolute file-path, pathname is searched relative to the location of bisonc++'s main grammar specification file (i.e., the grammar file that was specified as bisonc++'s command-line option). This directive can be used to split long grammar specification files in shorter, meaningful units. After processing pathname processing continues beyond the %include pathname directive.

Bisonc++'s main grammar specification file could be:


    %include spec/declarations.gr
    %%
    %include spec/rules.gr

where spec/declarations.gr contains declarations and spec/rules.gr contains the rules. Each of the files included using %include may itself use %include directives (which are then processed relative to their locations). The default nesting limit for %include directives is 10, but the option --max-inclusion-depth can be used to change this default.

%include directives should be specified on a line of their own.

5.5.8: %left, %right, %nonassoc: defining operator precedence

Syntax:

%left [ <type> ] terminal(s)
%nonassoc [ <type> ] terminal(s)
%right [ <type> ] terminal(s)

These directives are called precedence directives (see also section 5.5.8 for general information on operator precedence).

The %left, %right or %nonassoc directive are used to declare tokens and to specify their precedence and associativity, all at once.

The associativity of an operator op determines how repeated uses of the operator nest: whether `x op y op z' is parsed by grouping x with y first or by grouping y with z first. %left specifies left-associativity (grouping x with y first) and %right specifies right-associativity (grouping y with z first). %nonassoc specifies no associativity, which means that `x op y op z' is not a defined operation, and could be considered an error.
The precedence of an operator determines how it nests with other operators. All the tokens declared in a single precedence directive have equal precedence and nest together according to their associativity. When two tokens declared in different precedence directives associate, the one declared later has the higher precedence and is grouped first.

The <type> specification is optional, and specifies the type of the semantic value when a token specified to the right of a <type> specification is received. The pointed arrows are part of the type specification; the type itself must be a field of a %union specification (see section 5.5.26).

When multiple tokens are listed they must be separated by whitespace or by commas. Note that the precedence directives also serve to define token names: symbolic tokens mentioned with these directives should not be defined using %token directives.

5.5.9: %locationstruct: specifying a dedicated location struct

Syntax: %locationstruct struct-definition
Defines the organization of the location-struct data type LTYPE__. This struct should be specified analogously to the way the parser's stacktype is defined using %union (see below). The location struct type is named LTYPE__. If neither locationstruct nor LTYPE__ is specified, the default LTYPE__ struct is used.

5.5.10: %lsp-needed: using the default location type

Syntax: %lsp-needed
Defining this causes bisonc++ to include code into the generated parser using the standard location stack. The token-location type defaults to the following struct, defined in the parser's base class when this directive is specified:


    struct LTYPE__
    {
        int timestamp;
        int first_line;
        int first_column;
        int last_line;
        int last_column;
        char *text;
    };

Note that defining this struct type does not imply that its field are also assigned. Some form of communication with the lexical scanner is probably required to initialize the fields of this struct properly.

5.5.11: %ltype: using an existing location type

%ltype typename
Specifies a user-defined token location type. If %ltype is used, typename should be the name of an alternate (predefined) type (e.g., size_t). It should not be used together with a %locationstruct specification. From within the parser class, this type may be used as LTYPE__.

Any text following %ltype up to the end of the line, up to the first of a series of trailing blanks or tabs or up to a comment-token (// or /*) becomes part of the type definition. Be sure not to end a %ltype definition in a semicolon.

5.5.12: %namespace: using a namespace

Syntax: %namespace namespace
Defines all of the code generated by bisonc++ in the namespace namespace. By default no namespace is defined.

If this options is used the implementation header will contain a commented out using namespace directive for the requested namespace.

In addition, the parser and parser base class header files also use the specified namespace to define their include guard directives.

It is an error if this directive is used and an already existing parser-class header file and/or implementation header file does not define namespace identifier.

5.5.13: %negative-dollar-indices: using constructions like $-1

Syntax: %negative-dollar-indices
Accept (do not generate warnings) zero- or negative dollar-indices in the grammar's action blocks. Zero or negative dollar-indices are commonly used to implement inherited attributes and should normally be avoided. When used they can be specified like $-1, or like $<type>-1, where type is empty; an STYPE__ tag; or a %union field-name. See also the sections 5.6.4 and 6.6.

In combination with the %polymorphic directive (see below) only the $-i format can be used (see also section 5.6.3).

5.5.14: %no-lines: suppressing `#line' directives

Syntax: %no-lines
Do not put #line preprocessor directives in the file containing the parser's parse() function. By default #line preprocessor directives are inserted just before action blocks in the generated parse.cc file.

The #line directives allow compilers and debuggers to associate errors with lines in your grammar specification file, rather than with the source file containing the parse function itself.

5.5.15: %prec: overruling default precedences

Syntax: %prec token
The construction %prec token may be used in production rules to overrule the actual precendence of an operator in a particular production rule. Well known is the construction


    expression:
        '-' expression %prec UMINUS
        {
            ...
        }

Here, the default priority and precedence of the `-' token as the subtraction operator is overruled by the precedence and priority of the UMINUS token, which is frequently defined as:


    %right UMINUS

E.g., a list of arithmetic operators could consists of:


    %left '+' '-'
    %left '*' '/' '%'
    %right UMINUS

giving '*' '/' and '%' a higher priority than '+' and '-', ensuring at the same time that UMINUS is given both the highest priority and a right-associativity.

In the above production rule the operator order would cause the construction


    '-' expression

to be evaluated from right to left, having a higher precedence than either the multiplication or the addition operators.

5.5.16: %polymorphic: using polymorphism to define multiple semantic values

Syntax: %polymorphic polymorphic-specification(s)

The %polymorphic directive results in defining and using a polymorphic semantic value class, which can be used as a (preferred) alternative to the (traditional) a union specification.

Refer to section 5.6.3 for a detailed description of the specification, characteristics, and use of polymorphic semantic values as defined by bisonc++.

As a quick reference: to define multiple semantic values using a polymorphic semantic value class offering either an int, a std::string or a std::vector<double> specify:


    %polymorphic INT: int; STRING: std::string; 
                 VECT: std::vector<double>

and use %type specifications (cf. section 5.5.25) to associate (non-)terminals with specific semantic values.

5.5.17: %print-tokens: displaying tokens and matched text

Syntax: %print-tokens

The %print-tokens directive provides an implementation of the Parser class's print__ function displaying the current token value and the text matched by the lexical scanner as received by the generated parse function.

The print__ function is also implemented if the --print command-line option is provided.

5.5.18: %required-tokens: defining the minimum number of tokens between error reports

Syntax: %required-tokens ntokens
Whenever a syntactic error is detected during the parsing process the next few tokens that are received by the parsing function may easily cause yet another (spurious) syntactic error. In this situation error recovery in fact produces an avalanche of additional errors. If this happens the recovery process may benefit from a slight modification. Rather than reporting every syntactic error encountered by the parsing function, the parsing function may wait for a series of successfully processed tokens before reporting the next error.

The directive %required-tokens can be used to specify this number. E.g., the specification %required-tokens 10 requires the parsing function to process successfully a series of 10 tokens before another syntactic error is reported (and counted). If a syntactic error is encountered before processing 10 tokens then the counter counting the number of successfully processed tokens is reset to zero, no error is reported, but the error recoery procedure continues as usual. The number of required tokens can also be set using the option --required-tokens. By default the number of required tokens is initialized to 0.

5.5.19: %scanner: using a standard scanner interface

Syntax: %scanner header
Use header as the pathname of a file to include in the parser's class header. See the description of the --scanner option for details about this option. This directive also implies the automatic definition of a composed Scanner d_scanner data member into the generated parser, as well as a predefined int lex() member, returning d_scanner.lex().

By specifying the %flex directive the function d_scanner.YYText() is called.

The specfied header file will be surrounded by double quotes if no delimiters were provided. If pointed brackets (<...>) are used, they are kept.

It is an error if this directive is used and an already existing parser-class header file does not include `header'.

5.5.20: %scanner-matched-text-function: define the name of the scanner's member returning the matched texttoken

Syntax: %scanner-matched-text-function function-call

The %scanner-matched-text-function directive defines the scanner function returning the text matching the previously returned token. By default this is d_scanner.matched().

A complete function call expression should be provided (including a scanner object, if used). This option overrules the d_scanner.matched() call used by default when the %scanner directive is specified. Example:


    %scanner-matched-text-function myScanner.matchedText()

If the function call expression contains white space then the function-call specification should be surrounded by double quotes ("). This directive is overruled by the --scanner-matched-text-function command-line option.

5.5.21: %scanner-token-function: define the name of the scanner's token function

Syntax: %scanner-token-function function-call

The scanner function returning the next token, called from the generated parser's lex function. A complete function call expression should be provided (including a scanner object, if used). Example:


    %scanner-token-function d_scanner.lex()

If the function call contains white space scanner-token-function should be surrounded by double quotes.

It is an error if this directive is used and the specified scanner token function is not called from the code in an already existing implementation header.

5.5.22: %start: defining the start rule

Syntax: %start nonterminal symbol

By default bisonc++ uses the the LHS of the first rule in a grammar specification file as the start symbol. I.e., the parser tries to recognize that nonterminal when parsing input.

This default behavior may be overriden using the %start directive. The nonterminal symbol specifies a LHS that may be defined anywhere in the rules section of the grammar specification file. This LHS becomes the grammar's start symbol.

5.5.23: %stype: specifying the semantic stack type

Syntax: %stype typename
The type of the semantic value of tokens. The specification typename should be the name of an unstructured type (e.g., size_t). By default it is int. See YYSTYPE in bison. It should not be used if a %union specification is used. Within the parser class, this type may be used as STYPE__.

Any text following %stype up to the end of the line, up to the first of a series of trailing blanks or tabs or up to a comment-token (// or /*) becomes part of the type definition. Be sure not to end a %stype definition in a semicolon.

5.5.24: %token: defining token names

Syntax:

%token terminal token(s)
%token [ <type> ] terminal token(s)

The %token directive is used to define one or more symbolic terminal tokens. When multiple tokens are listed they must be separated by whitespace or by commas.

bisonc++ converts symbolic tokens (including those defined by the precedence directives (cf. section 5.5.8)) into Parser::Tokens enumeration values (where `Parser' is the name of the generated parser class, see section 5.5.2). This allows the lexical scanner member function lex() to return these token values by name directly, and it allows externally defined lexical scanners (called by lex()) to return token values as Parser::name.

When an externally defined lexical scanner is used, it should include Parserbase.h, the parser's base class header file, in order to be able to use the Parser::Tokens enumeration type.

Although it is possible to specify explicitly the numeric code for a token type by appending an integer value in the field immediately following the token name (as in %token NUM 300) this practice is deprecated. It is generally best to let bisonc++ choose the numeric values for all token types. Bisonc++ automatically selects values that don't conflict with each other or with ASCII character values.

5.5.24.1: Improper token names

Several identifiers cannot be used as token names as their use would collide with identifiers that are defined in the parser's base class.

In particular,

no token should end in two underscores (__).
some identifiers are reserved and cannot be used as tokens. They are:
```
    ABORT, ACCEPT, ERROR, clearin, debug, error, setDebug
        
```
Except for error, which is a predefined terminal token, these identifiers are the names of functions traditionally defined by bisonc++. The restriction on the above identifers could be lifted, but then the resulting generated parser would no longer be backward compatible with versions before Bisonc++ 2.0.0. It appears that imposing minimal restrictions on the names of tokens is a small penalty to pay for keeping backward compatibility.

5.5.25: %type: associating semantic values to (non)terminals

Syntax: %type <type> symbol(s)
When %polymorphic is used to specify multiple semantic value types, (non-)terminals can be associated with one of the semantic value types specified with the %polymorphic directive.

When %union is used to specify multiple semantic value types, (non-)terminals can be associated with one of the union fields specified with the %union directive.

To associate (non-)terminals with specific semantic value types the %type directive is used.

With this directive, symbol(s) represents of one or more blank or comma delimited grammatical symbols (i.e., terminal and/or nonterminal symbols); type is either a polymorphic type-identifier or a field name defined in the %union specification. The specified nonterminal(s) are automatically associated with the indicate semantic type. The pointed arrows are part of the type specification.

When the semantic value type of a terminal symbol is defined the lexical scanner rather than the parser's actions must assign the appropriate semantic value to d_val__ just prior to returning the token. To associate terminal symbols with semantic values, terminal symbols can also be specified in a %type directive.

5.5.26: %union: using a 'union' to define multiple semantic values

Syntax: %union union-definition body
The %union directive specifies the entire collection of possible data types for semantic values. The keyword %union is followed by a pair of braces containing the same thing that goes inside a union in C++. For example:


    %union {
      double u_val;
      symrec *u_tptr;
    };

This says that the two alternative types are double and

symrec
*

. They are given names u_val and u_tptr; these names are used in the %token and %type directives to pick one of the types for a terminal or nonterminal symbol (see section 5.5.25).

Notes:

The semicolon following the closing brace is optional.
C++-11 does allow class types to be used in union definitions, so they can also be used in %union directives. When a class type variant is required, all required constructors, the destructor and other members (like overloaded assignment operators) must be able to handle the actual class type data fields properly. A discussion of how to use unrestricted unions is beyon this manual's scope, but can be found, e.g., in the C++ Annotations. See also section 5.6.2.

5.5.27: %weak-tags: %polymorphic declaring 'enum Tag__'

By default, the %polymorphic directive declares a strongly typed enum: enum class Tag__, and code generated by bisonc++ always uses the Tag__ scope when referring to tag identifiers. It is often possible (by pre-associating tokens with tags, using %type directives) to avoid having to use tags in user-code.

If tags are explicitly used, then they must be prefixed with the Tag__ scope. Before the arrival of the C++-11 standard strongly typed enumerations didn't exist, and explicit enum-type scope prefixes were usually omitted. This works fine, as long as there are no name-conflicts: parser-tokens, other enumerations or variables could use identifiers also used by enum Tag__. This results in compilation errors that can simply be prevented using strongly typed enumerations.

The %weak-tags directive can be specified when the Tag__ enum should not be declared as a strongly typed enum. When in doubt, don't use this directive and stick to using the strongly typed Tag__ enum. When using %weak-tags be prepared for compilation errors caused by name collisions.

5.5.28: Directives controlling the names of generated files

Unless otherwise specified, bisonc++ uses the name of the parser-class to derive the names of most of the files it may generate. Below <CLASS> should be interpreted as the name of the parser's class, Parser by default, but configurable using %class-name (see section 5.5.2).

The parser's base class header: <Class>base.h, configurable using %baseclass-header (see section 5.5.28.1) or %filenames (see section 5.5.28.3);
The parser's class header: <Class>.h, configurable using %class-header (see section 5.5.28.2) or %filenames (see section 5.5.28.3);
The parser's implementation header file: <Class>.ih, configurable using %implementation-header (see section 5.5.28.4) or %filenames (see section 5.5.28.3);

5.5.28.1: %baseclass-header: defining the parser's base class header

Syntax: %baseclass-header filename
Filename defines the name of the file to contain the parser's base class. This class defines, e.g., the parser's symbolic tokens. Defaults to the name of the parser class plus the suffix base.h. It is always generated, unless (re)writing is suppressed by the --no-baseclass-header and --dont-rewrite-baseclass-header options. This directive is overruled by the --baseclass-header (-b) command-line option.

It is an error if this directive is used and an already existing parser class header file does not contain #include "filename".

5.5.28.2: %class-header: defining the parser's class header

Syntax: %class-header filename
Filename defines the name of the file to contain the parser class. Defaults to the name of the parser class plus the suffix .h This directive is overruled by the --class-header (-c) command-line option.

5.5.28.3: %filenames: specifying a generic filename

Syntax: %filenames filename
Filename is a generic filename that is used for all header files generated by bisonc++. Options defining specific filenames are also available (which then, in turn, overrule the name specified by this option). This directive is overruled by the --filenames (-f) command-line option.

5.5.28.4: %implementation-header: defining the implementation header

Syntax: %implementation-header filename
Filename defines the name of the file to contain the implementation header. It defaults to the name of the generated parser class plus the suffix .ih.
The implementation header should contain all directives and declarations only used by the implementations of the parser's member functions. It is the only header file that is included by the source file containing parse's implementation. User defined implementation of other class members may use the same convention, thus concentrating all directives and declarations that are required for the compilation of other source files belonging to the parser class in one header file.
This directive is overruled by the --implementation-header (-i) command-line option.

5.5.28.5: %parsefun-source: defining the parse() function's sourcefile

Syntax: %parsefun-source filename
Filename defines the name of the source file to contain the parser member function parse. Defaults to parse.cc. This directive is overruled by the --parse-source (-p) command-line option.

5.5.28.6: %target-directory: defining the directory where files must be written

Syntax: %target-directory pathname
Pathname defines the directory where generated files should be written. By default this is the directory where bisonc++ is called. This directive is overruled by the --target-directory command-line option.

5.6: Defining Language Semantics

The grammar rules for a language determine only the syntax. The semantics are determined by the semantic values associated with various tokens and groupings, and by the actions taken when various groupings are recognized.

For example, the calculator calculates properly because the value associated with each expression is the proper number; it adds properly because the action for the grouping `x + y' is to add the numbers associated with x and y.

In this section defining the semantics of a language is addressed, covering the following topics:

Specifying one data type for all semantic values;
Specifying several alternative data types;
Using Polymorphism to specify several data types;
Specifying Actions (an action is the semantic definition of a grammar rule);
Specifying data types for actions to operate on;
Specifying when and how to put actions in the middle of a rule (most actions go at the end of a rule. In some situations it may be desirable to put an action in in the middle of a rule).

5.6.1: Data Types of Semantic Values

In a simple program it may be sufficient to use the same data type for the semantic values of all language constructs. This was true in the rpn and infix calculator examples (see, e.g., sections 4.1 and 4.2).

Bisonc++'s default is to use type int for all semantic values. To specify some other type, the directive %stype must be used, like this:


    %stype double

This directive must go in the directives section of the grammar file (see section 5.1). As a result of this, the parser class defines a private type STYPE__ as double: Semantic values of language constructs always have the type STYPE__, and (assuming the parser class is named Parser) an internally used data member d_val that could be used by the lexical scanner to associate a semantic value with a returned token is defined as:


    Parser::STYPE__ d_val;

5.6.2: More Than One Value Type

In many programs, different kinds of data types are used in combination with different kinds of terminal and non-terminal tokens. For example, a numeric constant may need type int or double, while a string needs type std::string, and an identifier might need a pointer to an entry in a symbol table.

To use more than one data type for semantic values in one parser, bisonc++ offers the following feature:

Define polymorphic semantic values, associating (non)terminals with their proper semantic types (cf section 5.6.3), and associate (non-)terminal tokens with their appropriate semantic values;
Specify the entire collection of possible data types, using a %union directive (see section 5.5.26), and associate (non-)terminal tokens with their appropriate semantic values;
Define your own class handling the various semantic values, and associate that class with the parser's semantic value type using the %stype directive. The association of (non-)terminal tokens and specific value types is handled by your own class.

The first approach (and to a lesser extent, the second approach) has the advantage that bisonc++ is able to enforce the correct association between semantic types and rules and/or tokens, and that bisonc++ is able to check the type-correctness of assignments to rule results.

5.6.3: Polymorphism and multiple semantic values: `%polymorphic'

Bisonc++ may generate code using polymorphic semantic values. The approach discussed here is a direct result of a suggestion originally made by Dallas A. Clement in September 2007. All sources of the example discussed in this section can be retrieved from the poly directory.

One may wonder why a union is still used by bisonc++ as C++ offers inherently superior approaches to combine multiple types in one type. The C++ way to do so is by defining a polymorphic base class and a series of derived classes implementing the various exclusive data types. The union approach is still supported by bisonc++ since it is supported by bison(1) and bison++; dropping the union would needlessly impede backward compatibility.

The (preferred) alternative to using a union, however, is using a polymorphic base class. Although it is possible to define your own polymorphic semantic value classes, bisonc++ makes life easy by offering the %polymorphic directive.

The example program (cf. poly) implements a polymorphic base class, and derived classes containing either an int or a std::string semantic value. These types are asociated with tags (resp. INT and TEXT using the %polymorphic directive, which is discussed next.

5.6.3.1: The %polymorphic directive

The %polymorphic directive results in bisonc++ generating a parser using polymorphic semantic values. Each semantic value specification consists of a tag, which is a C++ identifier, and a C++ type definition.

Tags and type definitions are separated from each other by a colon, and multiple semantic values specifications are separated by semicolons. The semicolon trailing the final semantic value specification is optional.

A grammar specification file may contain only one %polymorphic directive, and the %polymorphic, %stype and %union directives are mutually exclusive.

Here is an example, defining three semantic values types: an int, a std::string and a std::vector<double>:


    %polymorphic INT: int; STRING: std::string; 
                 VECT: std::vector<double>

The identifier to the left of the colon is called the type-identifier, and the type definition to the right of the colon is called the type-definition. Types specified at the %polymorphic type-definitions must be built-in types or class-type declarations. Class types mentioned at the %polymorphic directive must offer default constructors.

If type declarations refer to types declared in header files that were not already included by the parser's base class header, then these header s must be inserted using the %baseclass-preinclude directive as these types are referred to in the generated parserbase.h header file.

5.6.3.2: The %polymorphic and %type: associating semantic values with (non-)terminals

The %type directive is used to associate (non-)terminals with semantic value types.

Semantic values may also be associated with terminal tokens. In that case it is the lexical scanner's responsibility to assign a properly typed value to the parser's STYPE__ d_val__ data member.

Non-terminals may automatically be associated with polymorphic semantic values using %type directives. E.g., after:


    %polymorphic INT: int; TEXT: std::string
    %type <INT> expr

the expr non-terminal returns int semantic values. In this case, a rule like:


    expr:
        expr '+' expr
        {
            $$ = $1 + $3;
        }

automatically associates $$, $1 and $3 with int values. $$ is an lvalue (representing the semantic value associated with the expr: rule), while $1 and $3 represent the int semantic value associated with the expr non-terminal in the production rule '-' expr (rvalues).

When negative dollar indices (like $-1) are used, pre-defined associations between non-terminals and semantic types are ignored. With positive indices or in combination with the production rule's return value $$, however, semantic value types can explicitly be specified using the common `$<type>$' or `$<type>1' syntax. (In this and following examples index number 1 represents any valid positive index; -1 represents any valid negative index).

The type-overruling syntax does not allow blanks to be used (so $<INT>$ is OK, $< INT >$ isn't).

Various combinations of type-associations and type specifications may be encountered:

$-1: %type associations are ignored, and the semantic value type STYPE__ is used instead. A warning is issued unless the %negative-dollar-indices directive was specified.
$<tag>-1: error: <tag> specifications are not allowed for negative dollar indices.

$$ or $1 specifications

%type<TAG>	$<tag>	action:

absent	no <tag>	STYPE__ is used

	$<id>	tag-override

	$<>	STYPE__ is used

	$<STYPE__>	STYPE__ is used

STYPE__	no <tag>	STYPE__ is used

	$<id>	tag-override

	$<>	STYPE__ is used

	$<STYPE__>	STYPE__ is used

(existing) tag	no <tag>	auto-tag

	$<id>	tag-override

	$<>	STYPE__ is used

	$<STYPE__>	STYPE__ is used

(undefined) tag	no <tag>	tag-error

	$<id>	tag-override

	$<>	STYPE__ is used

	$<STYPE__>	STYPE__ is used

auto-tag: $$ and $1 represent, respectively, $$.get<tag>() and $1.get<tag>();

tag-error: error: tag undefined;

tag-override: if id is a defined tag, then $<tag>$ and $<tag>1 represent the tag's type. Otherwise: error (using undefined tag id).

When using `$$.' or `$1.' default tags are ignored. A warning is issued that the default tag is ignored. This syntax allows members of the semantic value type (STYPE__) to be called explicitly. The default tag is only ignored if there are no additional characters (e.g., blanks, closing parentheses) between the dollar-expressions and the member selector operator (e.g., no tags are used with $1.member(), but tags are used with ($1).member()). The opposite, overriding default tag associations, is accomplished using constructions like $<STYPE__>$ and $<STYPE__>1.

When negative dollar indices are used, the appropriate tag must explicitly be specified. The next example shows how this is realized in the grammar specification file itself:


    %polymorphic INT: int
    %type <INT> ident
    %%
    
    type:
        ident arg
    ;
    
    arg:
        {
            call($-1->get<Tag__::INT>());
        }
    ;

In this example call may define an int or int & parameter.

It is also possible to delegate specification of the semantic value to the function call itself, as shown next:


    %polymorphic INT: int
    %type <INT> ident
    %%
    
    type:
        ident arg
    ;
    
    arg:
        {
            call($-1);
        }
    ;

Here, the function call could be implemented like this:


    void call(STYPE__ &st)
    {
        st->get<Tag__::INT>() = 5;
    }

5.6.3.3: Code generated by %polymorphic

The parser using polymorphic semantic values adds several classes to the generated files. The majority of these are class templates, defined in parserbase.h. In addition, some of the additionally implemented code is added to the parse.cc source file.

To minimize namespace pollution most of the additional code is contained in a namespace of its own: Meta__. If the %namespace directive was used then Meta__ is nested under the namespace declared by that directive. The name Meta__ hints at the fact that much of the code implementing polymorphic semantic values uses template meta programming.

The enumeration 'enum class Tag__'

One notable exception is the enumeration Tag__. It is declared outside of Meta__ (but inside the %namespace namespace, if provided) to simplify its use. Its identifiers are the tags declared by the %polymorphic directive. This is a strongly typed enumeration. The %weak-tags directive can be used to declare a pre C++-11 standard enum Tag__.

The namespace Meta__

Below, DataType refers to the semantic value's data type that is associated with a Tag__ identifier; ReturnType equals DataType if DataType is a built-in type (like int, double, etc.), while it is equal to DataType const & otherwise (for, e.g., class-type data types).

The important elements of the namespace Meta__ are:

The polymorphic semantic value's base class Base.
Its public interface offers the following members:
- Tag__ tag() const: returns the semantic value type's tag.
- ReturnType get<Tag__>() const: accesses the (non-modifiable) data element of the type matching the tag. the data element of the type matching the tag (also see below at the description of the class SType).
- DataType &get<Tag__>() const: accesses the (modifiable) data element of the type matching the tag.
The semantic value classes Semantic<Tag__::ID>: public Base.
Semantic<Tag__::ID> classes are derived for each of the tag identifiers ID declared at the %polymorphic directive. Semantic<Tag__::ID> classes contain a mutable DataType data member. Their public interfaces offer the following members:
- A default constructor;
- A Semantic(DataType const &) constructor;
- A Semantic(DataType &&) constructor;
- An operator ReturnType() const conversion operator;
- An operator DataType &()t conversion operator.
Semantic<Tag__::ID> objects are usually not explicitly used. Rather, their use is implied by the actual semantic value class SType and by several support functions (see below).
De semantic value class SType: public std::shared_ptr<Base> provides access to the various semantic value types. The SType__ class becomes the parser's STYPE__ type, and explicitly accessing Semantic<Tag__::ID> should never be necessary.
SType's public interface offers the following members:
- Constructors: default, copy and move constructors.
  Since the parser's semantic value and semantic value stack is completely controlled by the parser, and since the actual semantic data values are unknown at construction time of the semantic value (d_val__) and of the semantic value stack, no constructors expecting DataType values are provided.
- Assignment operators.
  The standard overloaded assignment operator (copy and move variants) as well as copy and move assignment operators for the types declared at the %polymorphic directive are provided. Assigning a value using operator= allocates a Semantic<Tag__::tag> object for the tag matching the right-hand side's data type, and resets the SType's shared_pointer to this new Semantic<Tag__::tag> object. Be aware that this may break the default association of the semantic value as declared by a %type directive. When breaking the default association make sure that explicit tags are used (as in $<Tag__::tag>1), overriding the default association with the currently active association. In most cases, however, the assignment is not used to break the default association but simply to assign a value to $$. By default the SType's shared pointer is zero, and the assignment initializes the semantic value to a value of the proper type. Assuming a lexical scanner may return a NR token, offering an int number() const accessor, then part of expr rule could be:
```
        expr:
            NR
            {
                $$ = d_scanner.number();
            }
        ...
            
```
  after which expr's semantic value has been initialized to a Semantic<Tag__::INT>.
- returns the semantic value stored inside Semantic<Tag__>. If the type-name is a built-in type a copy of the value is returned, otherwise a reference to a constant object is returned;
  This member checks for 0-pointers and for Tag__ mismatches between the requested and actual Tag__, throwing a std::logic_error if found.
- returns a reference to the (modifiable) semantic value stored inside Semantic<Tag__>.
  This member checks for 0-pointers and for Tag__ mismatches between the requested and actual Tag__, in that case replacing the current Semantic object pointed to by a new Semantic<Tag__> object of the requested Tag__.
- refers to the semantic value stored inside Semantic<Tag__>. If the type-name is a built-in type a copy of the value is returned, otherwise a reference to a constant object is returned;
  This is a (partially) unchecking variant of the corresponing get member, resulting in a Segfault if used when the shared_ptr holds a 0-pointer, and throwing a std::bad_cast in case of a mismatch between the requested and actual Tag__.
- returns a reference to the (modifiable) semantic value stored inside Semantic<Tag__>.
  This is a (partially) unchecking variant of the corresponing get member, resulting in a Segfault if used when the shared_ptr holds a 0-pointer, and throwing a std::bad_cast in case of a mismatch between the requested and actual Tag__.

When an incorrect tag is specified (e.g., with

get<Tag__::tag>(),
$<Tag__::tag>$

, or $<Tag__::tag>1), the generated code correctly copiles, but the program will throw a std::bad_cast exception once the offending code is executed.

Additional Headers

When using %polymorphic three additional header files are included by parserbase.h:

memory, required for std::shared_ptr;
stdexcept, required for std::logic_error;
type_traits, required for the implementation of one of SType's overloaded assignment operators.

5.6.3.4: A parser using a polymorphic semantic value type

In this section a parser is developed using polymorphic semantic values. Its %polymorphic directive looks like this:


    %polymorphic INT: int; TEXT: std::string;

Furthermore, the grammar declares tokens INT and IDENTIFIER, and pre-associates the TEXT tag with the identifier non-terminal, associates the INT tag with the int non-terminal, and associates STYPE__, the generic polymorphic value with the non=terminal combi:


    %type <TEXT>    identifier
    %type <INT>     int
    %type <STYPE__> combi

For this example a simple grammar was developed, expecting an optional number of input lines, formatted according to the following rule production rules:


    rule:
        identifier '(' identifier ')' '\n'
    |
        identifier '=' int '\n'
    |
        combi '\n'
    ;

The rules for identifier and int assign, respectively, text and an int value to the parser's semantic value stack:


    identifier:
        IDENTIFIER
        {
            $$ = d_scanner.matched();
        }
    ;
    int:
        INT
        {
            $$ = d_scanner.intValue();
        }
    ;

These simple assignments can be used as int is pre-associated with the INT tag and identifier is asociated with the TEXT tag.

As the combi rule is not associated with a specific semantic value, its semantic value could be either INT or TEXT. Irrespective of what is actually returned by combi, its semantic value can be passed on to a function (process(STYPE__ const &), responsible for the semantic value's further processing. Here are the definition of the combi non-terminal and action blocks for the rule non-terminal:


    combi:
        int
    |
        identifier;
    ;
        
    rule:
        identifier '(' identifier ')' '\n'
        {
            cout << $1 << " " << $3 << '\n';
        }
    |
        identifier '=' int '\n'
        {
            cout << $1 << " " << $3 << '\n';
        }
    |
        combi '\n'
        {
            process($1);
        }
    ;

Since identifier has been associated with TEXT and int with INT, the $-references to these elements in the production rules already return, respectively, a std::string const & and an int.

For combi the situation is slightly more complex, as combi could either return an int (via its int production rule) or a std::string const & (via its identifier production rule).

Fortunately, process can find out by inspecting the semantic value's Tag__:


    void process(SSTYPE__ const &semVal)
    {
        if (semVal.tag() == Tag__::INT)
            cout << "Saw an int-value: " << semVal.get<Tag__::INT>() << '\n';
        else
            cout << "Saw text: " << semVal.get<Tag__::TEXT>() << '\n';
    }

5.6.3.5: A scanner using a polymorphic semantic value type

The scanner recognizes input patterns, and returns Parser tokens (e.g., Parser::INT) matching the recognized input.

It is easily created by flexc++(1) processing the following simple specification file.

%interactive
%filenames scanner

%%

[ \t]+                  // skip white space

[0-9]+                  return Parser::NR;

[a-zA-Z_][a-zA-Z0-9_]*  return Parser::IDENTIFIER;

.|\n                    return matched()[0];

The reader may refer to flexc++(1) documentation for details about flexc++(1) specification files.

5.6.4: Actions

An action accompanies a syntactic rule and contains C++ code to be executed each time an instance of that rule is recognized. The task of most actions is to compute a semantic value for the grouping built by the rule from the semantic values associated with tokens or smaller groupings.

An action consists of C++ statements surrounded by braces, much like a compound statement in C++. It can be placed at any position in the rule; it is executed at that position. Most rules have just one action at the end of the rule, following all the components. Actions in the middle of a rule are tricky and should be used only for special purposes (see section 5.6.6).

The C++ code in an action can refer to the semantic values of the components matched by the rule with the construct $n, which stands for the value of the nth component. The semantic value for the grouping being constructed is $$. (Bisonc++ translates both of these constructs into array element references when it copies the actions into the parser file.)

Here is a typical example:


    exp:
        ...
    |
        exp '+' exp
        { 
            $$ = $1 + $3; 
        }
    |
        ...

This rule constructs an exp from two smaller exp groupings connected by a plus-sign token. In the action, $1 and $3 refer to the semantic values of the two component exp groupings, which are the first and third symbols on the right hand side of the rule. The sum is stored into $$ so that it becomes the semantic value of the addition-expression just recognized by the rule. If there were a useful semantic value associated with the `+' token, it could be referred to as $2.

If you don't specify an action for a rule, bisonc++ supplies a default: $$ = $1. Thus, the value of the first symbol in the rule becomes the value of the whole rule. Of course, the default rule is valid only if the two data types match. There is no meaningful default action for an empty rule; every empty rule must have an explicit action unless the rule's value does not matter. Note that the default $$ value is assigned at the beginning of an action block. Any changes to $1 are therefore not automatically propagated to $$. E.g., assuming that $1 == 3 at the beginning of the following action block, then $$ will still be equal to 3 after executing the statement in the action block:


    {               // assume: $1 == 3
        $1 += 12;   // $1 now is 15, $$ remains 3
    }

If $$ should receive the value of the modified $1, then $$ must explicitly be assigned to $$. E.g.,


    {               // assume: $1 == 3
        $1 += 12;   // $1 now is 15, $$ remains 3
        $$ = $1;    // now $$ == 15 as well.
    }

Using $n with n equal to zero or a negative value is allowed for reference to tokens and groupings on the stack before those that match the current rule. This is a very risky practice, and to use it reliably you must be certain of the context in which the rule is applied. Here is a case in which you can use this reliably:


    foo:      
        expr bar '+' expr  
        { ... }
    | 
        expr bar '-' expr  
        { ... }
    ;

    bar:
        // empty
    |
        { 
            previous_expr = $0; 
        }
    ;

As long as bar is used only in the fashion shown here, $0 always refers to the expr which precedes bar in the definition of foo. But as mentioned: it's a risky practice, which should be avoided if at all possible. See also section 6.6.

All $-type variables used in action blocks can be modified. All numbered $-variables are deleted when a production rule has been recognized. Unless an action explicitly assigns a value to $$, the (possibly modified) $1 value is assigned to $$ when a production rule has been recognized.

5.6.5: Data Types of Values in Actions

If you have chosen a single data type for semantic values, the $$ and $n constructs always have that data type.

If you have used a %union directive to specify a variety of data types, then you must declare a choice among these types for each terminal or nonterminal symbol that can have a semantic value. Then each time you use $$ or $n, its data type is determined by which symbol it refers to in the rule. In this example,


    exp:    
        ...
    | 
        exp '+' exp
        { 
            $$ = $1 + $3; 
        }

$1 and $3 refer to instances of exp, so they all have the data type declared for the nonterminal symbol exp. If $2 were used, it would have the data type declared for the terminal symbol '+', whatever that might be.

Alternatively, you can specify the data type when you refer to the value, by inserting `<type>' after the `$' at the beginning of the reference. For example, if you have defined types as shown here:


    %union 
    {
        int u_int;
        double u_double;
    };

then you can write $<u_int>1 to refer to the first subunit of the rule as an integer, or $<u_double>1 to refer to it as a double.

5.6.6: Actions in Mid-Rule

Occasionally it is useful to put an action in the middle of a rule. These actions are written just like usual end-of-rule actions, but they are executed before the parser recognizes the components that follow them.

A mid-rule action may refer to the components preceding it using $n, but it may not (cannot) refer to subsequent components because it is executed before they are parsed.

The mid-rule action itself counts as one of the components of the rule. This makes a difference when there is another action later in the same rule (and usually there is another at the end): you have to count the actions along with the symbols when working out which number n to use in $n.

The mid-rule action can also have a semantic value. The action can set its value with an assignment to $$, and actions later in the rule can refer to the value using $n. Since there is no symbol to name the action, there is no way to declare a data type for the value in advance, so you must use the `$<...>' construct to specify a data type each time you refer to this value.

There is no way to set the value of the entire rule with a mid-rule action, because assignments to $$ do not have that effect. The only way to set the value for the entire rule is with an ordinary action at the end of the rule.

Here is an example from a hypothetical compiler, handling a let statement that looks like ``let (variable) statement' and serves to create a variable named variable temporarily for the duration of statement. To parse this construct, we must put variable into the symbol table while statement is parsed, then remove it afterward. Here is how it is done:


    stmt:   
        LET '(' var ')'
        {
            $<u_context>$ = pushSymtab();
            temporaryVariable($3); 
        }
        stmt    
        { 
            $$ = $6;
            popSymtab($<u_context>5); 
        }

As soon as `let (variable)' has been recognized, the first action is executed. It saves a copy of the current symbol table as its semantic value, using alternative u_context in the data-type union. Then it uses temporaryVariable() to add the new variable (using, e.g., a name that cannot normally be used in the parsed language) to the current symbol table. Once the first action is finished, the embedded statement (stmt) can be parsed. Note that the mid-rule action is component number 5, so `stmt' is component number 6.

Once the embedded statement is parsed, its semantic value becomes the value of the entire let-statement. Then the semantic value from the earlier action is used to restore the former symbol table. This removes the temporary let-variable from the list so that it won't appear to exist while the rest of the program is parsed.

Taking action before a rule is completely recognized often leads to conflicts since the parser must commit to a parse in order to execute the action. For example, the following two rules, without mid-rule actions, can coexist in a working parser because the parser can shift the open-brace token and look at what follows before deciding whether there is a declaration or not:


    compound: 
        '{' declarations statements '}'
    | 
        '{' statements '}'
    ;

But when we add a mid-rule action as follows, the rules become nonfunctional:


    compound: 
        { 
            prepareForLocalVariables(); 
        }
        '{' declarations statements '}'
    | 
        '{' statements '}'
    ;

Now the parser is forced to decide whether to execute the mid-rule action when it has read no farther than the open-brace. In other words, it must commit to using one rule or the other, without sufficient information to do it correctly. (The open-brace token is what is called the look-ahead token at this time, since the parser is still deciding what to do about it. See section 7.0.3.

You might think that the problem can be solved by putting identical actions into the two rules, like this:


        { 
            prepareForLocalVariables(); 
        }
        '{' declarations statements '}'
    | 
        { 
            prepareForLocalVariables(); 
        }
        '{' statements '}'
    ;

But this does not help, because bisonc++ never parses the contents of actions, and so it does not realize that the two actions are identical.

If the grammar is such that a declaration can be distinguished from a statement by the first token (which is true in C, but not in C++, which allows statements and declarations to be mixed)), then one solution is to put the action after the open-brace, like this:


    compound: 
        '{'
        { 
            prepareForLocalVariables(); 
        }
        declarations statements '}'
    | 
        '{' statements '}'
    ;

Now the next token following a recognized '{' token would be either the first declarations token or the first statements token, which would in any case tell bisonc++ which rule to use, thus solving the problem.

Another (much used) solution is to bury the action inside a support non-terminal symbol which recognizes the first block-open brace and performs the required preparations:


    openblock:
        '{'
        { 
            prepareForLocalVariables(); 
        }
    ;

    compound: 
            openblock declarations statements '}'
    | 
            openblock statements '}'
    ;

Now bisonc++ can execute the action in the rule for subroutine without deciding which rule for compound it eventually uses. Note that the action is now at the end of its rule. Any mid-rule action can be converted to an end-of-rule action in this way, and this is what bisonc++ actually does to implement mid-rule actions.

By the way, note that in a language like C++ the above construction is obsolete anyway, since C++ allows mid-block variable- and object declarations. In C++ a compound statement could be defined, e.g., as follows:


    stmnt_or_decl:
        declarations
    |
        pure_stmnt      // among which: compound_stmnt
    ;

    statements:
        // empty
    |
        statements stmnt_or_decl
    ;

    compound_stmnt:                 
        open_block statements '}'
    ;

Here, the compound_stmnt would begin with the necessary preparations for local declarations, which would then have been completed by the time they would really be needed by declarations.

5.7: Basic Grammatical Constructions

In the following sections several basic grammatical constructions are presented in their prototypical and generic forms. When these basic constructions are used to construct a grammar, the resulting grammar is usually accepted by bisonc++. Moreover, these basic constructions are frequently encountered in programming languages. When designing your own grammar, try to stick as closely as possible to the following basic grammatical constructions.

5.7.1: Plain Alternatives

Simple alternatives can be specified using the vertical bar (|). Each alternative should begin with a unique identifying terminal token. The terminal token may actually be hidden in a non-terminal rule, in which case that nonterminal can be used as an alias for the non-terminal. In fact identical terminal tokens may be used if at some point the terminal tokens differ over different alternatives. Here are some examples:


    // Example 1:  plain terminal distinguishing tokens
    expr:
        ID        
    |
        NUMBER
    ;

    // Example 2:  nested terminal distinguishing tokens
    expr:
        id
    |
        number
    ;

    id:
        ID
    ;

    number:
        NUMBER
    ;

    // Example 3:  eventually diverting routes
    expr:
        ID
        id
    |
        ID
        number
    ;

    id:
        ID
    ;

    number:
        NUMBER
    ;

5.7.2: One Or More Alternatives, No Separators

A series of elements normally use left-recursion. For example, C++ supports string concatenation: series of double quote delimited ASCII characters define a string, and multiple white-space delimited strings are handled as one single string:


    "hello"         // multiple ws-delimited strings
    " " 
    "world"

    "hello world"   // same thing

Usually a parser is responsible for concatenating the individual string-parts, receiving one or more STRING tokens from the lexical scanner. A string rule handles one or more incoming STRING tokens:


    string:
        STRING
    |
        string STRING

The above rule can be used as a prototype for recognizing a series of elements. The token STRING may of course be embedded in another rule. The generic form of this rule could be formulated as follows:


    series:
        unit
    |
        series unit

Note that the single element is first recognized, whereafter the left-recursive alternative may be recognized repeatedly.

5.7.3: Zero Or More Alternatives, No Separators

An optional series of elements also use left-recursion, but the single element alternative remains empty. For example, in C++ a compound statement may contain statements or declarations, but it may also be empty:


    opt_statements:
        // empty
    |
        opt_statements statements

The above rule can be used as a prototype for recognizing a series of elements: the generic form of this rule could be formulated as follows:


    opt_series:
        // empty
    |
        opt_series unit

Note that the empty element is recognized first, even though it is empty, whereafter the left-recursive alternative may be recognized repeatedly. In practice this means that an action block may be defined at the empty alternative, which is then executed prior to the left-recursive alternative. Such an action block could be used to perform initializations necessary for the proper handling of the left-recursive alternative.

5.7.4: One Or More Alternatives, Using Separators

A series of elements which are separated from each other using some delimiter again normally uses left-recursion. For example, a C++ variable definition list consists of one or more identifiers, separated by comma's. If there is only one identifier no comma is used. Here is the rule defining a list using separators:


    variables:
        IDENTIFIER
    |
        variables ',' IDENTIFIER

The above rule can be used as a prototype for recognizing a series of delimited elements. The generic form of this rule could be formulated as follows:


    series:
        unit
    |
        series delimiter unit

Note that the single element is first recognized, whereafter the left-recursive alternative may be recognized repeatedly. In fact, this rule is not really different from the standard rule for a series, which does not hold true for the rule to recognize an optional series of delimited elements, covered in the next section.

5.7.5: Zero Or More Alternatives, Using Separators

An optional series of elements, separated from each other using delimiters occurs frequently in programming languages. For example, C++ functions have parameter lists which may or may not require arguments. Since a parameter list may be defined empty, an empty alternative is required. However, a simple generalization of the optional series construction (section 5.7.3) won't work, since that would imply that the first argument is preceded by a separator, which is clearly not the intention. So, the following construction is wrong:


    opt_parlist:
        // empty
    |
        opt_parlist ',' parameter

To define an optional series of delimited elements two rules are required: one rule handling the optional part, the other the delimited series of elements. So, the correct definition is as follows:


    opt_parlist:
        // empty
    |
        parlist
    ;

    parlist:
        parameter
    |
        parlist ',' parameter
    ;

Again, the above rule pair can be used as a prototype for recognizing an optional series of delimited elements. The generic form of these rules could be formulated as follows:


    opt_series:
        // empty
    |
        series
    ;

    series:
        element
    |
        series delimiter element

Note that the opt_series rules neatly distinguishes the no-element case from the case were elements are present. Usually these two cases need to be handled quite differently, and the opt_series rules empty alternative easily allows us to recognize the no-elements case.

5.7.6: Nested Blocks

Finally, we add the nested rule to our bag of rule-tricks. Again, nested rules appear frequently: parenthesized expressions and compound statements are two very well known examples. These kind of rules are characterized by the fact that the nested variant is itself an example of the element appearing in the nested variant. The definition of a statement is actually a bit more complex than the definition of an expression, since the statement appearing in the compound statement is in fact an optional series of elements. Let's first have a look at the nested expression rule. Here it is, in a basic form:


    expr:
        NUMBER
    |
        ID
    |
        expr '+' expr
    |
        ...
    |
        '(' expr ')'
    ;

This definition is simply characterized that the non-terminal expr appears within a set of parentheses, which is not too complex.

The definition of opt_statements, however, is a bit more complex. But acknowledging the fact that a statement contains among other elements a compound statement, and that a compound statement, in turn, contains opt_statements an opt_statements construction can be formulated accordingly:


    opt_statements:         // define an optional series
        // empty
    |
        opt_statements statement
    ;
        
    statement:              // define alternatives for `statement'
        expr_statement
    |
        if_statement
    |
        ...
    |
        compound_statement
    ;

    compound_statement:     // define the compound statement itself
        '{' opt_statements '}'
    ;

5.8: Multiple Parsers in the Same Program

Most programs that use bisonc++ parse only one language and therefore contain only one bisonc++ parser. But what if you want to parse more than one language with the same program? Since bisonc++ constructs class rather than a parsing function, this problem can easily be solved: simply define your second (third, fourth, ...) parser class, each having its own unique class-name, using the %class-name directive, and construct parser objects of each of the defined classes.