Chapter 4: Examples

Now we show and explain three sample programs written using bisonc++: a reverse polish notation calculator, an algebraic (infix) notation calculator, and a multi-function calculator. All three have been tested under Linux (kernel 2.4.24 and above); each produces a usable, though limited, interactive desk-top calculator.

These examples are simple, but bisonc++ grammars for real programming languages are written the same way. You can copy these examples from this dument into source files to try them yourself. Also, the bisonc++ package contains the various source files ready for use.

4.1: rpn: a Reverse Polish Notation Calculator

The first example is that of a simple double-precision reverse polish notation calculator (a calculator using postfix operators). This example provides a good starting point, since operator precedence is not an issue. The second example illustrates how operator precedence is handled.

All sources for this calculator are found in the examples/rpn/ directory.

4.1.1: Declarations for the `rpn' calculator

Here are the C++ and bisonc++ directives for the reverse polish notation calculator. As in C++, end-of-line comments may be used.
%baseclass-preinclude cmath

%token NUM
%stype double

The directive section provides information to bisonc++ about the token types (see section The bisonc++ Declarations Section). Each terminal symbol that is not a single-character literal must be declared here (Single-character literals normally don't need to be declared). In this example, all the arithmetic operators are designated by single-character literals, so the only terminal symbol that needs to be declared is NUM, the token type for numeric constants. Since bisonc++ uses the type int as the default semantic value type, one additional directive is required to inform bisonc++ about the fact that we are using double values. The %stype (semantic value type) directive is used to do so.

4.1.2: Grammar rules for the `rpn' calculator

Here are the grammar rules for the reverse polish notation calculator.
input:    
        // empty 
| 
        input line
;

line:   
        '\n'
| 
        exp '\n'  
        { 
            std::cout << "\t" << $1 << std::endl;
        }
;

exp:      
        NUM             
| 
        exp exp '+'     
        { 
            $$ = $1 + $2;    
        }
| 
        exp exp '-'     
        { 
            $$ = $1 - $2;    
        }
| 
        exp exp '*'     
        { 
            $$ = $1 * $2;    
        }
| 
        exp exp '/'     
        { 
            $$ = $1 / $2;    
        }
|
        // Exponentiation:
        exp exp '^'     
        { 
            $$ = pow($1, $2); 
        }
|
        // Unary minus:
        exp 'n'
        { 
            $$ = -$1;        
        }
;

The groupings of the rpn `language' defined here are the expression (given the name exp), the line of input (line), and the complete input transcript (input). Each of these nonterminal symbols has several alternate rules, joined by the `|' punctuator which is read as the logical or. The following sections explain what these rules mean.

The semantics of the language is determined by the actions taken when a grouping is recognized. The actions are the C++ code that appears inside braces. See section 5.6.4.

You must specify these actions in C++, but bisonc++ provides the means for passing semantic values between the rules. In each action, the pseudo-variable $$ represents the semantic value for the grouping that the rule is going to construct. Assigning a value to $$ is the main job of most actions. The semantic values of the components of the rule are referred to as $1 (the first component of a rule), $2 (the second component), and so on.

4.1.2.1: Explanation of `input'

Consider the definition of input:
    input:    
            // empty 
    | 
            input line
    ;
        
This definition reads as follows: A complete input is either an empty string, or a complete input followed by an input line. Notice that `complete input' is defined in terms of itself. This definition is said to be left recursive since input appears always as the leftmost symbol in the sequence. See section 5.4.

The first alternative is empty because there are no symbols between the colon and the first `|'; this means that input can match an empty string of input (no tokens). We write the rules this way because it is legitimate to type Ctrl-d right after you start the calculator. It's conventional to put an empty alternative first and write the comment `// empty' in it.

The second alternate rule (input line) handles all nontrivial input. It means After reading any number of lines, read one more line if possible. The left recursion makes this rule into a loop. Since the first alternative matches empty input, the loop can be executed zero or more times.

The parser's parsing function (parse()) continues to process input until a grammatical error is seen or the lexical analyzer says there are no more input tokens, which occurs at end of file.

4.1.2.2: Explanation of `line'

Now consider the definition of line:
    line:   
            '\n'
    | 
            exp '\n'  
            { 
                cout << "\t" << $1 << endl;
            }
    ;
        
The first alternative is a token which is a newline character; this means that rpn accepts a blank line (and ignores it, since there is no action). The second alternative is an expression followed by a newline. This is the alternative that makes rpn useful. The semantic value of the exp grouping is the value of $1 because the exp in question is the first symbol in the rule's alternative. The action prints this value, which is the result of the computation the user asked for.

This action is unusual because it does not assign a value to $$. As a consequence, the semantic value associated with the line is not initialized (so its value will be unpredictable). This would be a bug if that value were ever used, but we don't use it: once rpn has printed the value of the user's input line, that value is no longer needed.

4.1.2.3: Explanation of `expr'

The exp grouping has several rules, one for each kind of expression. The first rule handles the simplest expressions: those that are just numbers. The second handles an addition-expression, which looks like two expressions followed by a plus-sign. The third handles subtraction, and so on.
    exp:      
            NUM             
    | 
            exp exp '+'     
            { 
                $$ = $1 + $2;    
            }
    | 
            exp exp '-'     
            { 
                $$ = $1 - $2;    
            }
    ...
        
It is customary to use `|' to join all the rules for exp, but the rules could equally well have been written separately:
    exp:      
            NUM             
    ;

    exp:
            exp exp '+'     
            { 
                $$ = $1 + $2;    
            }
    ;

    exp:
            exp exp '-'     
            { 
                $$ = $1 - $2;    
            }
    ;

    ...
        
Most of the rules have actions that compute the value of the expression in terms of the values of its parts. For example, in the rule for addition, $1 refers to the first component exp and $2 refers to the second one. The third component, '+', has no meaningful associated semantic value, but if it had one you could refer to it as $3. When the parser's parsing function parse() recognizes a sum expression using this rule, the sum of the two subexpressions' values is produced as the value of the entire expression. See section 5.6.4.

You don't have to give an action for every rule. When a rule has no action, Bison by default copies the value of $1 into $$. This is what happens in the first rule (the one that uses NUM).

The formatting shown here is the recommended convention, but Bison does not require it. You can add or change whitespace as much as you wish.

4.1.3: The Lexical Scanner used by `rpn'

The lexical analyzer's job is low-level parsing: converting characters or sequences of characters into tokens. The bisonc++ parser gets its tokens by calling the lexical analyzer lex(), which is a predeclared member of the parser class. See section 6.3.1.

Only a simple lexical analyzer is needed for rpn. This lexical analyzer skips blanks and tabs, then reads in numbers as double and returns them as NUM tokens. Any other character that isn't part of a number is a separate token. Note that the token-code for such a single-character token is the character itself.

The return value of the lexical analyzer function is a numeric code which represents a token type. The same text used in bisonc++ rules to stand for this token type is also a C++ expression for the numeric code for the type. This works in two ways. If the token type is a character literal, then its numeric code is the ASCII code for that character; you can use the same character literal in the lexical analyzer to express the number. If the token type is an identifier, that identifier is defined by bisonc++ as a C++ enumeration value. In this example, therefore, NUM becomes an enumeration value for lex() to return.

The semantic value of the token (if it has one) is stored into the parser's data member d_val (comparable to the variable yylval used by, e.g., Bison). This data member has int as its default type, but by specifying %stype in the directive section this default type can be modified (to, e.g., double).

A token value of zero is returned once end-of-file is encountered. (Bisonc++ recognizes any nonpositive value as indicating the end of the input).

Here is the lexical scanner's implementation:

#include "Parser.ih"

/*
    Lexical scanner returns a double floating point 
    number on the stack and the token NUM, or the ASCII
    character read if not a number.  Skips all blanks
    and tabs, returns 0 for EOF.
*/

int Parser::lex()
{
    char c;
                                    // get the next non-ws character
    while (std::cin.get(c) && c == ' ' || c == '\t')            
        ;

    if (!std::cin)                   // no characters were obtained
        return 0;                   // indicate End Of Input  
    
    if (c == '.' || isdigit (c))    // if a digit char was found
    {
        std::cin.putback(c);        // return the character
        std::cin >> d_val;          // extract a number
        return NUM;                 // return the NUM token
    }

    return c;                       // otherwise return the extracted char.
}

4.1.4: The Controlling Function `main()'

In keeping with the spirit of this example, the controlling function main() is kept to the bare minimum. The only requirement is that it constructs a parser object and then calls its parsing function parse() to start the process of parsing:
/*
                              rpn.cc
*/

#include "rpn.h"

int main()
{
    Parser parser;

    parser.parse();
    return 0;
}

4.1.5: The error reporting member `error()'

When parse() detects a syntax error, it calls the error reporting member function error() to print an error message (usually but not always parse error). It is up to the programmer to supply an implementation, but a very bland and simple in-line implementation is provided by bisonc++ in the class header file (see chapter 6). This default implementation is acceptable for rpn.

Once error() returns, the bisonc++ parser may recover from the error and continue parsing if the grammar contains a suitable error rule (see chapter 8). Otherwise, the parsing function parse() returns nonzero. Not any error rules were included in this example, so any invalid input causes the calculator program to exit. This is not clean behavior for a real calculator, but it is adequate for this first example.

4.1.6: Running Bisonc++ to generate the Parser

Before running bisonc++ to produce a parser class, we need to decide how to arrange all the source code in one or more source files. Even though the example is fairly simple, all user-defined functions should be defined in source files of their own. For rpn this means that a source file rpn.cc is constructed holding main(), and a file parser/lex.cc holding the lexical scanner's implementation. Note that I've put all the parser's files in a separate directory as well (also see section 3.7).

In rpn's parser directory the file grammar holds the grammar specification. Bisonc++ constructs a parser class and a parsing member function from this file after issuing the command:

    b() grammar
        
From this, bisonc++ produced the following files: By default, Parserbase.h and parse.cc will be re-created each time bisonc++ is re-run. Parser.h and Parser.ih may safely be modified by the programmer, e.g., to add new members to to the parser class. These two files will not be overwritten by bisonc++, unless explicitly instructed to do so.

4.1.7: Constructing and running `rpn'

Here is how to compile and run the parser file:
    # List files (recursively) in the (current) examples/rpn  directory.
    % ls -R 
    .:
    build  parser  rpn.cc  rpn.h
    
    ./parser:
    grammar  lex.cc
    
    # Create `rpn' using the `build' script:
    % ./build rpn
    
    # List files again, ./rpn is the constructed program
    % ls -R
    .:
    build  parser  rpn  rpn.cc  rpn.h
    
    ./parser:
    Parser.h  Parser.ih  Parserbase.h  grammar  lex.cc  parse.cc  parse.output
        
Here is an example session using rpn:
    % rpn
    4 9 +
            13
    3 7 + 3 4 5 *+-
            -13
    3 7 + 3 4 5 * + - n              Note the unary minus, `n'
            13
    5 6 / 4 n +
            -3.16667
    3 4 ^                            Exponentiation
            81
    ^D                               End-of-file indicator
    %
        

4.2: `calc': an Infix Notation Calculator

We now modify rpn to handle infix operators instead of postfix. Infix notation involves the concept of operator precedence and the need for parentheses nested to arbitrary depth. Here is the bisonc++ grammar specification for calc, an infix desk-top calculator:
%baseclass-preinclude   cmath
%stype double

%token NUM
%left '-' '+'
%left '*' '/'
%left NEG     // negation--unary minus 
%right '^'    // exponentiation        

%%

input:    
        // empty 
| 
        input line
;

line:   
        '\n'
| 
        exp '\n'  
        { 
            std::cout << "\t" << $1 << '\n';
        }
;

exp:      
        NUM             
| 
        exp '+' exp 
        { 
            $$ = $1 + $3;
        }
| 
        exp '-' exp 
        { 
            $$ = $1 - $3;
        }
| 
        exp '*' exp 
        { 
            $$ = $1 * $3;
        }
| 
        exp '/' exp 
        { 
            $$ = $1 / $3;
        }
| 
        '-' exp %prec NEG
        { 
            $$ = -$2;
        }
|
        // Exponentiation:
        exp '^' exp 
        { 
            $$ = pow($1, $3);
        }
|
        '(' exp ')'
        { 
            $$ = $2;
        }
;

The functions lex(), error() and main() can be the same as used with rpn.

There are two important new features shown in this code.

In the second section (Bisonc++ directives), %left declares token types and says they are left-associative operators. The directives %left and %right (right associativity) take the place of %token which is used to declare a token type name without associativity. (These tokens are single-character literals, which ordinarily don't need to be declared. We declare them here to specify the associativity.)

Operator precedence is determined by the line ordering of the directives; the higher the line number of the directive (lower on the page or screen), the higher the precedence. Hence, exponentiation has the highest precedence, unary minus (NEG) is next, followed by `*' and `/', and so on. See section 5.5.8.

The other important new feature is the %prec in the grammar section for the unary minus operator. The %prec simply instructs bisonc++ that the rule `| '-' exp' has the same precedence as NEG (in this case the next-to-highest). See section 7.3.

Here is a sample run of calc:

    % calc
    4 + 4.5 - (34/(8*3+-3))
            6.88095
    -56 + 2
            -54
    3 ^ 2
            9
        

4.3: Basic Error Recovery

Up to this point, this manual has not addressed the issue of error recovery, i.e., how to continue parsing after the parser detects a syntax error. All that's been handled so far is error reporting using the error() member function with yyerror. Recall that by default parse() returns after calling error(). This means that an erroneous input line causes the calculator program to exit. Now we show how to rectify this deficiency.

The bisonc++ language itself includes the reserved word error, which may be included in the grammar rules. In the example below it has been added as one more alternatives for line:

line:   
        '\n'
| 
        exp '\n'  
        { 
            std::cout << "\t" << $1 << '\n';
        }
| 
        error '\n'
;

This addition to the grammar allows for simple error recovery in the event of a parse error. If an expression that cannot be evaluated is read, the error is recognized by the third rule for line, and parsing continues (The error() member function is still called upon to print its message as well). Different from the implementation of error in Bison, bisonc++ proceeds on the assumption that whenever error is used in a rule it is the grammar constructor's intention to have the parser continue parsing. Therefore, a statement like `yyerrok;' seen in Bison grammars is superfluous in bisonc++ grammars. The reserved keyword error itself causes the parsing function to skip all subsequent input until a possible token following error is seen. In the above implementation that token would be the newline character `\n' (see chapter 8).

This form of error recovery deals with syntax errors. There are other kinds of errors; for example, divisions by zero, which raises an exception signal that is normally fatal. A real calculator program must handle this signal and use whatever it takes to discard the rest of the current line of input and resume parsing thereafter. This extensive error handling is not discussed here, as it is not specific to bisonc++ programs.

4.4: `mfcalc': a Multi-Function Calculator

Now that the basics of bisonc++ have been discussed, it is time to move on to a more advanced problem. The above calculators provided only five functions, `+', `-', `*', `/' and `^'. It would be nice to have a calculator that provides other mathematical functions such as sin, cos, etc..

It is easy to add new operators to the infix calculator as long as they are only single-character literals. The parser's member lex() passes back all non-number characters as tokens, so new grammar rules suffice for adding a new operator. But we want something more flexible: built-in functions whose syntaxis is as follows:

    function_name (argument)
        
At the same time, we add memory to the calculator, thus allowing you to create named variables, store values in them, and use them later. Here is a sample session with the multi-function calculator:
	pi = 3.141592653589
	        3.14159
	sin(pi)
	        7.93266e-13
	alpha = beta1 = 2.3
	        2.3
	alpha
	        2.3
	ln(alpha)
	        0.832909
	exp(ln(beta1))
	        2.3
        
Note that multiple assignment and nested function calls are permitted.

4.4.1: The Declaration Section for `mfcalc'

The grammar specification file for the mfcalc calculator allows us to introduce several new features. Here is the bisonc++ directive section for the mfcalc multi-function calculator (line numbers were added for referential purposes, they are not part of the declaraction section as used in the actual grammar file):
    1   %union
    2   {
    3       double u_val;
    4       double *u_symbol;
    5       double (*u_fun)(double);
    6   }
    7
    8   %token <u_val>  NUM         // Simple double precision number
    9   %token <u_symbol> VAR       // Variable
   10   %token <u_fun>  FNCT        // Function
   11   %type  <u_val>  exp
   12
   13   %right '='
   14   %left '-' '+'
   15   %left '*' '/'
   16   %left NEG                   // negation--unary minus 
   17   %right '^'                  // exponentiation        
        
The above grammar introduces only two new features of the Bison language. These features allow semantic values to have various data types

The %union directive given in lines 1 until 6 allow semantic values to have various data types (see section 5.6.2). The %union directive is used instead of %stype, and defines the type Parser::STYPE as the indicated union: all semantic values now have this Parser::STYPE type. As defined here the allowable types are now

Since values can now have various types, it is necessary to associate a type with each grammar symbol whose semantic value is used. These symbols are NUM, VAR, FNCT, and exp. Their declarations are augmented with information about their data type (placed between angle brackets).

The Bison construct %type (line 12) is used for declaring nonterminal symbols, just as %token is used for declaring token types. We have not used %type before because nonterminal symbols are normally declared implicitly by the rules that define them. But exp must be declared explicitly so we can specify its value type. See also section 5.5.25.

Finally note the right associative operator `=', defined in line 13: by making the assignment operator right-associative we can allow sequential assignments of the form a = b = c = expression.

4.4.2: Grammar Rules for `mfcalc'

Here are the grammar rules for the multi-function calculator. Most of them are copied directly from calc. Three rules, those which mention VAR or FNCT, are new:
input:    
        // empty 
| 
        input line
;

line:   
        '\n'
| 
        exp '\n'  
        { 
            cout << "\t" << $1 << endl;
        }
| 
        error '\n'
;

exp:      
        NUM             
| 
        VAR                
        { 
            $$ = *$1;
        }
| 
        VAR '=' exp        
        { 
            $$ = *$1 = $3;
        }
| 
        FNCT '(' exp ')'   
        { 
            $$ = (*$1)($3); 
        }
| 
        exp '+' exp 
        { 
            $$ = $1 + $3;
        }
| 
        exp '-' exp 
        { 
            $$ = $1 - $3;
        }
| 
        exp '*' exp 
        { 
            $$ = $1 * $3;
        }
| 
        exp '/' exp 
        { 
            $$ = $1 / $3;
        }
| 
        '-' exp %prec NEG
        { 
            $$ = -$2;
        }
|
        // Exponentiation:
        exp '^' exp 
        { 
            $$ = pow($1, $3);
        }
|
        '(' exp ')'
        { 
            $$ = $2;
        }
;

4.4.3: The `mfcalc' Symbol- and Function Tables

The multi-function calculator requires a symbol table to keep track of the names and meanings of variables and functions. This doesn't affect the grammar rules (except for the actions) or the bisonc++ directives, but it requires some additional C++ types for support as well as several additional data members for the parser class.

The symbol table itself varies in size and contents once mfcalc is used, and if a program uses multiple parser objects (well...) each parser will require its own symbol table. Therefore it is defined as a data member d_symbols in the Parser's header file. In contrast, the function table has a fixed size and contents. Because of this, multiple parser objects (if defined) could share the same function table, and so the function table is defined as a static data member. Both tables profitably use the std::map container data type that is available in C++: their keys are std::string objects, their values, respecively, doubles and double (*)(double)s. Here is the declaration of d_symbols and s_functions as used in mfcalc's parser:

    std::map<std::string, double> d_symbols;

    static std::map<std::string, double (*)(double)> s_functions;
        
As s_functions is a static member, it can be initialized compile time from an array of pairs. To ease the definition of such an array a private typedef
    typedef std::pair<char const *, double (*)(double)> FunctionPair;
        
is added to the parser class, as well as a private array
    static FunctionPair s_funTab[];
        
These definitions allow us to initialize s_functions in a separate source file (data.cc):
#include "Parser.ih"

Parser::FunctionPair Parser::s_funTab[] =
{
    FunctionPair("sin",  sin),
    FunctionPair("cos",  cos),
    FunctionPair("atan", atan),
    FunctionPair("ln",   log),
    FunctionPair("exp",  exp),
    FunctionPair("sqrt", sqrt),
};

map<string, double (*)(double)> Parser::s_functions
(
    Parser::s_funTab, 
    Parser::s_funTab + sizeof(Parser::s_funTab) / sizeof(Parser::FunctionPair)
);



By simply editing the definition of s_funTab, additional functions can be added to the calculator.

4.4.4: The revised `lex()' member

In mfcalc, the parser's member function lex() must now recognize variables, function names, numeric values, and the single-character arithmetic operators. Strings of alphanumeric characters with a leading nondigit are recognized as either variables or functions depending on the table in which they are found. By arranging lex()'s logic such that the function table is searched first it is simple to ensure that no variable can ever have the name of a predefined function. The currently implemented approach, in which two different tables are used for the arithmetic functions and the variable symbols is appealing because it's simple to implement. However, it also has the drawback of being difficult to scale to more generic calculators, using, e.g., different data types and different types of functions. In such situations a single symbol table is more preferable, where the keys are the identifiers (variables, function names, predefined constants, etc.) while the values are objects describing their characteristics. A re-implementation of mfcalc using an integrated symbol table is suggested in one of the exercises of the upcoming section 4.5.

The parser's lex() member uses the following approach:

Here is mfcalc's parser's lex() member function:
#include "Parser.ih"

/*
    Lexical scanner returns a double floating point 
    number on the stack and the token NUM, or the ASCII
    character read if not a number.  Skips all blanks
    and tabs, returns 0 for EOF.
*/

int Parser::lex()
{
    char c;
                                    // get the next non-ws character
    while (cin.get(c) && c == ' ' || c == '\t')            
        ;

    if (!cin)                   // no characters were obtained
        return 0;                   // indicate End Of Input  
    
    if (c == '.' || isdigit (c))    // if a digit char was found
    {
        cin.putback(c);        // return the character
        cin >> d_val.u_val;    // extract a number
        return NUM;                 // return the NUM token
    }
    
    if (!isalpha(c))                // c doesn't start an identifier: 
        return c;                   // return a single character token.

    // in all other cases, an ident is entered. Recognize a var or function

    string word;           // store the name in a string object

    while (true)                // process all alphanumerics:
    {
        word += c;              // add 'm to `word'
        if (!cin.get(c))   // no more chars? then done here
            break;

        if (!isalnum(c))        // not an alphanumeric: put it back and done.
        {
            cin.putback(c);
            break;
        }
    }
                                // Now lookup the name as a function's name
    map<string, double (*)(double)>::iterator function = 
                                                s_functions.find(word);

                                // Got it, so return FPTR 
    if (function != s_functions.end())  
    {
        d_val.u_fun = function->second;
        return FNCT;
    }
                                // no function, so return a VAR. Set
                                // u_symbol to the symbol's address in the
                                // d_symbol map. The map will add the
                                // symbol if not yet there.
    d_val.u_symbol = &d_symbols[word];
    return VAR;            
}

4.4.5: Constructing `mfcalc'

In order to construct mfcalc, the following steps are suggested:

4.5: Exercises

Here are some suggestions for you to consider to improve mfcalc's implementation and operating mode: