These examples are simple, but bisonc++ grammars for real programming languages are written the same way. You can copy these examples from this dument into source files to try them yourself. Also, the bisonc++ package contains the various source files ready for use.
All sources for this calculator are found in the examples/rpn/ directory.
%baseclass-preinclude cmath %token NUM %stype double
The directive section provides information to bisonc++ about the token
types (see section The bisonc++ Declarations Section). Each terminal symbol
that is not a single-character literal must be declared here (Single-character
literals normally don't need to be declared). In this example, all the
arithmetic operators are designated by single-character literals, so the only
terminal symbol that needs to be declared is NUM, the token type for
numeric constants. Since bisonc++ uses the type int as the default
semantic value type, one additional directive is required to inform bisonc++
about the fact that we are using double values. The %stype
(semantic value type) directive is used to do so.
input:
// empty
|
input line
;
line:
'\n'
|
exp '\n'
{
std::cout << "\t" << $1 << std::endl;
}
;
exp:
NUM
|
exp exp '+'
{
$$ = $1 + $2;
}
|
exp exp '-'
{
$$ = $1 - $2;
}
|
exp exp '*'
{
$$ = $1 * $2;
}
|
exp exp '/'
{
$$ = $1 / $2;
}
|
// Exponentiation:
exp exp '^'
{
$$ = pow($1, $2);
}
|
// Unary minus:
exp 'n'
{
$$ = -$1;
}
;
The groupings of the rpn `language' defined here are the expression
(given the name exp), the line of input (line), and the complete input
transcript (input). Each of these nonterminal symbols has several
alternate rules, joined by the `|' punctuator which is read as the logical
or. The following sections explain what these rules mean.
The semantics of the language is determined by the actions taken when a grouping is recognized. The actions are the C++ code that appears inside braces. See section 5.6.4.
You must specify these actions in C++, but bisonc++ provides the means for
passing semantic values between the rules. In each action, the pseudo-variable
$$ represents the semantic value for the grouping that the rule is going
to construct. Assigning a value to $$ is the main job of most actions. The
semantic values of the components of the rule are referred to as $1 (the
first component of a rule), $2 (the second component), and so on.
input:
input:
// empty
|
input line
;
This definition reads as follows: A complete input is either an empty
string, or a complete input followed by an input line. Notice that `complete
input' is defined in terms of itself. This definition is said to be left
recursive since input appears always as the leftmost symbol in the
sequence. See section 5.4.
The first alternative is empty because there are no symbols between the colon
and the first `|'; this means that input can match an empty string of
input (no tokens). We write the rules this way because it is legitimate to
type Ctrl-d right after you start the calculator. It's conventional to put
an empty alternative first and write the comment `// empty' in it.
The second alternate rule (input line) handles all nontrivial input. It
means After reading any number of lines, read one more line if
possible. The left recursion makes this rule into a loop. Since the first
alternative matches empty input, the loop can be executed zero or more times.
The parser's parsing function (parse()) continues to process input until a
grammatical error is seen or the lexical analyzer says there are no more input
tokens, which occurs at end of file.
line:
'\n'
|
exp '\n'
{
cout << "\t" << $1 << endl;
}
;
The first alternative is a token which is a newline character; this means
that rpn accepts a blank line (and ignores it, since there is no
action). The second alternative is an expression followed by a newline. This
is the alternative that makes rpn useful. The semantic value of the
exp grouping is the value of $1 because the exp in question is the
first symbol in the rule's alternative. The action prints this value, which is
the result of the computation the user asked for.
This action is unusual because it does not assign a value to $$. As a
consequence, the semantic value associated with the line is not initialized
(so its value will be unpredictable). This would be a bug if that value were
ever used, but we don't use it: once rpn has printed the value of the
user's input line, that value is no longer needed.
exp grouping has several rules, one for each kind of expression. The
first rule handles the simplest expressions: those that are just numbers. The
second handles an addition-expression, which looks like two expressions
followed by a plus-sign. The third handles subtraction, and so on.
exp:
NUM
|
exp exp '+'
{
$$ = $1 + $2;
}
|
exp exp '-'
{
$$ = $1 - $2;
}
...
It is customary to use `|' to join all the rules for exp, but the
rules could equally well have been written separately:
exp:
NUM
;
exp:
exp exp '+'
{
$$ = $1 + $2;
}
;
exp:
exp exp '-'
{
$$ = $1 - $2;
}
;
...
Most of the rules have actions that compute the value of the expression in
terms of the values of its parts. For example, in the rule for addition,
$1 refers to the first component exp and $2 refers to the second
one. The third component, '+', has no meaningful associated semantic
value, but if it had one you could refer to it as $3. When the parser's
parsing function parse() recognizes a sum expression using this rule, the
sum of the two subexpressions' values is produced as the value of the entire
expression. See section 5.6.4.
You don't have to give an action for every rule. When a rule has no action,
Bison by default copies the value of $1 into $$. This is what happens
in the first rule (the one that uses NUM).
The formatting shown here is the recommended convention, but Bison does not require it. You can add or change whitespace as much as you wish.
lex(), which is a predeclared member of the
parser class. See section 6.3.1.
Only a simple lexical analyzer is needed for rpn. This lexical analyzer
skips blanks and tabs, then reads in numbers as double and returns them as
NUM tokens. Any other character that isn't part of a number is a separate
token. Note that the token-code for such a single-character token is the
character itself.
The return value of the lexical analyzer function is a numeric code which
represents a token type. The same text used in bisonc++ rules to stand for this
token type is also a C++ expression for the numeric code for the type. This
works in two ways. If the token type is a character literal, then its numeric
code is the ASCII code for that character; you can use the same character
literal in the lexical analyzer to express the number. If the token type is an
identifier, that identifier is defined by bisonc++ as a C++ enumeration
value. In this example, therefore, NUM becomes an enumeration value
for lex() to return.
The semantic value of the token (if it has one) is stored into the parser's
data member d_val (comparable to the variable yylval used by, e.g.,
Bison). This data member has int as its default type, but by specifying
%stype in the directive section this default type can be modified (to,
e.g., double).
A token value of zero is returned once end-of-file is encountered. (Bisonc++ recognizes any nonpositive value as indicating the end of the input).
Here is the lexical scanner's implementation:
#include "Parser.ih"
/*
Lexical scanner returns a double floating point
number on the stack and the token NUM, or the ASCII
character read if not a number. Skips all blanks
and tabs, returns 0 for EOF.
*/
int Parser::lex()
{
char c;
// get the next non-ws character
while (std::cin.get(c) && c == ' ' || c == '\t')
;
if (!std::cin) // no characters were obtained
return 0; // indicate End Of Input
if (c == '.' || isdigit (c)) // if a digit char was found
{
std::cin.putback(c); // return the character
std::cin >> d_val; // extract a number
return NUM; // return the NUM token
}
return c; // otherwise return the extracted char.
}
main() is kept to the bare minimum. The only requirement is that it
constructs a parser object and then calls its parsing function parse() to
start the process of parsing:
/*
rpn.cc
*/
#include "rpn.h"
int main()
{
Parser parser;
parser.parse();
return 0;
}
parse() detects a syntax error, it calls the error reporting
member function error() to print an error message (usually but not always
parse error). It is up to the programmer to supply an implementation, but
a very bland and simple in-line implementation is provided by bisonc++ in the class
header file (see chapter 6). This default implementation is
acceptable for rpn.
Once error() returns, the bisonc++ parser may recover from the error and
continue parsing if the grammar contains a suitable error rule (see chapter
8). Otherwise, the parsing function parse() returns
nonzero. Not any error rules were included in this example, so any invalid
input causes the calculator program to exit. This is not clean behavior
for a real calculator, but it is adequate for this first example.
rpn this means that a source file
rpn.cc is constructed holding main(), and a file parser/lex.cc
holding the lexical scanner's implementation. Note that I've put all the
parser's files in a separate directory as well (also see section 3.7).
In rpn's parser directory the file grammar
holds the grammar specification. Bisonc++ constructs a parser class and a
parsing member function from this file after issuing the command:
b() grammar
From this, bisonc++ produced the following files:
Parser.h, the parser class definition;
Parserbase.h, the parser's base class definition, defining, among
other, the grammatical tokens to be used by externally defined lexical
scanners;
Parser.ih, the internal header file, to be included by all
implementations of the parser class' members;
parse.cc, the parsing member function.
Parserbase.h and parse.cc will be re-created each
time bisonc++ is re-run. Parser.h and Parser.ih may safely be modified
by the programmer, e.g., to add new members to to the parser class. These two
files will not be overwritten by bisonc++, unless explicitly instructed to do
so.
# List files (recursively) in the (current) examples/rpn directory.
% ls -R
.:
build parser rpn.cc rpn.h
./parser:
grammar lex.cc
# Create `rpn' using the `build' script:
% ./build rpn
# List files again, ./rpn is the constructed program
% ls -R
.:
build parser rpn rpn.cc rpn.h
./parser:
Parser.h Parser.ih Parserbase.h grammar lex.cc parse.cc parse.output
Here is an example session using rpn:
% rpn
4 9 +
13
3 7 + 3 4 5 *+-
-13
3 7 + 3 4 5 * + - n Note the unary minus, `n'
13
5 6 / 4 n +
-3.16667
3 4 ^ Exponentiation
81
^D End-of-file indicator
%
rpn to handle infix operators instead of postfix. Infix
notation involves the concept of operator precedence and the need for
parentheses nested to arbitrary depth. Here is the bisonc++ grammar
specification for calc, an infix desk-top calculator:
%baseclass-preinclude cmath
%stype double
%token NUM
%left '-' '+'
%left '*' '/'
%left NEG // negation--unary minus
%right '^' // exponentiation
%%
input:
// empty
|
input line
;
line:
'\n'
|
exp '\n'
{
std::cout << "\t" << $1 << '\n';
}
;
exp:
NUM
|
exp '+' exp
{
$$ = $1 + $3;
}
|
exp '-' exp
{
$$ = $1 - $3;
}
|
exp '*' exp
{
$$ = $1 * $3;
}
|
exp '/' exp
{
$$ = $1 / $3;
}
|
'-' exp %prec NEG
{
$$ = -$2;
}
|
// Exponentiation:
exp '^' exp
{
$$ = pow($1, $3);
}
|
'(' exp ')'
{
$$ = $2;
}
;
The functions lex(), error() and main() can be the same as
used with rpn.
There are two important new features shown in this code.
In the second section (Bisonc++ directives), %left declares token types
and says they are left-associative operators. The directives %left and
%right (right associativity) take the place of %token which is used to
declare a token type name without associativity. (These tokens are
single-character literals, which ordinarily don't need to be declared. We
declare them here to specify the associativity.)
Operator precedence is determined by the line ordering of the directives;
the higher the line number of the directive (lower on the page or screen),
the higher the precedence. Hence, exponentiation has the highest precedence,
unary minus (NEG) is next, followed by `*' and `/', and so on. See
section 5.5.8.
The other important new feature is the %prec in the grammar section for
the unary minus operator. The %prec simply instructs bisonc++ that the
rule `| '-' exp' has the same precedence as NEG (in this case the
next-to-highest). See section 7.3.
Here is a sample run of calc:
% calc
4 + 4.5 - (34/(8*3+-3))
6.88095
-56 + 2
-54
3 ^ 2
9
error() member
function with yyerror. Recall that by default parse() returns after
calling error(). This means that an erroneous input line causes the
calculator program to exit. Now we show how to rectify this deficiency.
The bisonc++ language itself includes the reserved word error, which may
be included in the grammar rules. In the example below it has been added as
one more alternatives for line:
line:
'\n'
|
exp '\n'
{
std::cout << "\t" << $1 << '\n';
}
|
error '\n'
;
This addition to the grammar allows for simple error recovery in the event
of a parse error. If an expression that cannot be evaluated is read, the error
is recognized by the third rule for line, and parsing continues (The
error() member function is still called upon to print its message as
well). Different from the implementation of error in Bison, bisonc++ proceeds
on the assumption that whenever error is used in a rule it is the grammar
constructor's intention to have the parser continue parsing. Therefore, a
statement like `yyerrok;' seen in Bison grammars is superfluous in bisonc++
grammars. The reserved keyword error itself causes the parsing function to
skip all subsequent input until a possible token following error is
seen. In the above implementation that token would be the newline character
`\n' (see chapter 8).
This form of error recovery deals with syntax errors. There are other kinds of errors; for example, divisions by zero, which raises an exception signal that is normally fatal. A real calculator program must handle this signal and use whatever it takes to discard the rest of the current line of input and resume parsing thereafter. This extensive error handling is not discussed here, as it is not specific to bisonc++ programs.
+', `-', `*', `/' and `^'. It would be nice to have a
calculator that provides other mathematical functions such as sin,
cos, etc..
It is easy to add new operators to the infix calculator as long as they are
only single-character literals. The parser's member lex() passes back all
non-number characters as tokens, so new grammar rules suffice for adding a new
operator. But we want something more flexible: built-in functions whose
syntaxis is as follows:
function_name (argument)
At the same time, we add memory to the calculator, thus allowing you to
create named variables, store values in them, and use them later. Here is a
sample session with the multi-function calculator:
pi = 3.141592653589
3.14159
sin(pi)
7.93266e-13
alpha = beta1 = 2.3
2.3
alpha
2.3
ln(alpha)
0.832909
exp(ln(beta1))
2.3
Note that multiple assignment and nested function calls are permitted.
mfcalc calculator allows us to
introduce several new features. Here is the bisonc++ directive section for
the mfcalc multi-function calculator (line numbers were added for
referential purposes, they are not part of the declaraction section as used in
the actual grammar file):
1 %union
2 {
3 double u_val;
4 double *u_symbol;
5 double (*u_fun)(double);
6 }
7
8 %token <u_val> NUM // Simple double precision number
9 %token <u_symbol> VAR // Variable
10 %token <u_fun> FNCT // Function
11 %type <u_val> exp
12
13 %right '='
14 %left '-' '+'
15 %left '*' '/'
16 %left NEG // negation--unary minus
17 %right '^' // exponentiation
The above grammar introduces only two new features of the Bison
language. These features allow semantic values to have various data types
The %union directive given in lines 1 until 6 allow semantic values
to have various data types (see section 5.6.2).
The %union directive is used instead of %stype, and defines the
type Parser::STYPE as the indicated union: all semantic values now
have this Parser::STYPE type. As defined here the allowable types are now
double (for exp and NUM);
double, being a pointer to entries in
mfcalc's symbol table, used with VAR tokens (see section
5.5.26).
double argument and
returning a double value, used with FNCT tokens.
NUM, VAR, FNCT, and exp. Their declarations are augmented with
information about their data type (placed between angle brackets).
The Bison construct %type (line 12) is used for declaring nonterminal
symbols, just as %token is used for declaring token types. We have not
used %type before because nonterminal symbols are normally declared
implicitly by the rules that define them. But exp must be declared
explicitly so we can specify its value type. See also section
5.5.25.
Finally note the right associative operator `=', defined in line 13:
by making the assignment operator right-associative we can allow sequential
assignments of the form a = b = c = expression.
calc. Three rules, those which mention VAR or
FNCT, are new:
input:
// empty
|
input line
;
line:
'\n'
|
exp '\n'
{
cout << "\t" << $1 << endl;
}
|
error '\n'
;
exp:
NUM
|
VAR
{
$$ = *$1;
}
|
VAR '=' exp
{
$$ = *$1 = $3;
}
|
FNCT '(' exp ')'
{
$$ = (*$1)($3);
}
|
exp '+' exp
{
$$ = $1 + $3;
}
|
exp '-' exp
{
$$ = $1 - $3;
}
|
exp '*' exp
{
$$ = $1 * $3;
}
|
exp '/' exp
{
$$ = $1 / $3;
}
|
'-' exp %prec NEG
{
$$ = -$2;
}
|
// Exponentiation:
exp '^' exp
{
$$ = pow($1, $3);
}
|
'(' exp ')'
{
$$ = $2;
}
;
The symbol table itself varies in size and contents once mfcalc is
used, and if a program uses multiple parser objects (well...) each parser will
require its own symbol table. Therefore it is defined as a data member
d_symbols in the Parser's header file. In contrast, the function table
has a fixed size and contents. Because of this, multiple parser objects
(if defined) could share the same function table, and so the function table is
defined as a static data member. Both tables profitably use the
std::map container data type that is available in C++: their keys are
std::string objects, their values, respecively, doubles and double
(*)(double)s. Here is the declaration of d_symbols and s_functions as
used in mfcalc's parser:
std::map<std::string, double> d_symbols;
static std::map<std::string, double (*)(double)> s_functions;
As s_functions is a static member, it can be initialized compile
time from an array of pairs. To ease the definition of such an array a
private typedef
typedef std::pair<char const *, double (*)(double)> FunctionPair;
is added to the parser class, as well as a private array
static FunctionPair s_funTab[];
These definitions allow us to initialize s_functions in a separate
source file (data.cc):
#include "Parser.ih"
Parser::FunctionPair Parser::s_funTab[] =
{
FunctionPair("sin", sin),
FunctionPair("cos", cos),
FunctionPair("atan", atan),
FunctionPair("ln", log),
FunctionPair("exp", exp),
FunctionPair("sqrt", sqrt),
};
map<string, double (*)(double)> Parser::s_functions
(
Parser::s_funTab,
Parser::s_funTab + sizeof(Parser::s_funTab) / sizeof(Parser::FunctionPair)
);
By simply editing the definition of s_funTab, additional
functions can be added to the calculator.
mfcalc, the parser's member function lex() must now recognize
variables, function names, numeric values, and the single-character arithmetic
operators. Strings of alphanumeric characters with a leading nondigit are
recognized as either variables or functions depending on the table in which
they are found. By arranging lex()'s logic such that the function table is
searched first it is simple to ensure that no variable can ever have the name
of a predefined function. The currently implemented approach, in which two
different tables are used for the arithmetic functions and the variable
symbols is appealing because it's simple to implement. However, it also has
the drawback of being difficult to scale to more generic calculators, using,
e.g., different data types and different types of functions. In such
situations a single symbol table is more preferable, where the keys are the
identifiers (variables, function names, predefined constants, etc.) while the
values are objects describing their characteristics. A re-implementation of
mfcalc using an integrated symbol table is suggested in one of the
exercises of the upcoming section 4.5.
The parser's lex() member uses the following approach:
mfcalc's parser (d_val) is itself also a union,
the numerical value can be extracted into d_val.u_val, and a
NUM token can be returned.
s_functions map. If found, d_val.u_fun is given the function's
address, found as the value of the s_functions map element
corresponding to the read identifier, and token FNCT is returned.
If the symbol is not found in s_functions the address of the
value ofn d_symbols associated with the received identifier is
assigned to d_val.u_symbol and token VAR is returned. Note
that this automatically defines newly used variables, since
d_symbols[name] automatically inserts a new element in a map if
d_symbol[name] wasn't already there.
mfcalc's parser's lex() member function:
#include "Parser.ih"
/*
Lexical scanner returns a double floating point
number on the stack and the token NUM, or the ASCII
character read if not a number. Skips all blanks
and tabs, returns 0 for EOF.
*/
int Parser::lex()
{
char c;
// get the next non-ws character
while (cin.get(c) && c == ' ' || c == '\t')
;
if (!cin) // no characters were obtained
return 0; // indicate End Of Input
if (c == '.' || isdigit (c)) // if a digit char was found
{
cin.putback(c); // return the character
cin >> d_val.u_val; // extract a number
return NUM; // return the NUM token
}
if (!isalpha(c)) // c doesn't start an identifier:
return c; // return a single character token.
// in all other cases, an ident is entered. Recognize a var or function
string word; // store the name in a string object
while (true) // process all alphanumerics:
{
word += c; // add 'm to `word'
if (!cin.get(c)) // no more chars? then done here
break;
if (!isalnum(c)) // not an alphanumeric: put it back and done.
{
cin.putback(c);
break;
}
}
// Now lookup the name as a function's name
map<string, double (*)(double)>::iterator function =
s_functions.find(word);
// Got it, so return FPTR
if (function != s_functions.end())
{
d_val.u_fun = function->second;
return FNCT;
}
// no function, so return a VAR. Set
// u_symbol to the symbol's address in the
// d_symbol map. The map will add the
// symbol if not yet there.
d_val.u_symbol = &d_symbols[word];
return VAR;
}
mfcalc, the following steps are suggested:
mfcalc.cc. Actually, it is already available,
since all implementations of main() used so far are identical to
each other.
parser:
grammar;
bisonc++ grammar to produce the files Parser.h,
Parserbase.h, Parser.ih and parse.cc;
Parser.h so as to include FunctionPair,
s_functions, s_funTab and d_symbols;
Parser.ih so as to include cmath and optionally
`using namespace std', which is commented out by default;
data.cc and lex.cc to initialize the static
data and to contain the lexical scanner, respectively.
mfcalc in mfcalc.cc's directory using the
following command:
g++ -o mfcalc *.cc parser/*.cc
mfcalc's
implementation and operating mode:
cmath' to the
Parser::s_functions;
Symbol in which the symbol type, and an
appropriate value for the symbol is stored. Define only one map d_symbols
in the Parser, and provide the Symbol class with means to obtain the
appropriate values for the various token types.
%union directive, and change it into %stype
Symbol. Hint: use the %preinclude-header directive to make
Symbol known to the parser's base class.
CONST for numerical constants (like PI, (E)),
and pre-define some numerical constants;
get() and set() member pair in Symbol, and use the appropriate
member in the appropriate expr rule; use ERROR() to initiate error
recovery.