These examples are simple, but bisonc++ grammars for real programming languages are written the same way. You can copy these examples from this dument into source files to try them yourself. Also, the bisonc++ package contains the various source files ready for use.
All sources for this calculator are found in the examples/rpn/ directory.
%baseclass-preinclude cmath %token NUM %stype double
The directive section provides information to bisonc++ about the token
types (see section The bisonc++ Declarations Section). Each terminal symbol
that is not a single-character literal must be declared here (Single-character
literals normally don't need to be declared). In this example, all the
arithmetic operators are designated by single-character literals, so the only
terminal symbol that needs to be declared is NUM
, the token type for
numeric constants. Since bisonc++ uses the type int
as the default
semantic value type, one additional directive is required to inform bisonc++
about the fact that we are using double
values. The %stype
(semantic value type) directive is used to do so.
input: // empty | input line ; line: '\n' | exp '\n' { std::cout << "\t" << $1 << std::endl; } ; exp: NUM | exp exp '+' { $$ = $1 + $2; } | exp exp '-' { $$ = $1 - $2; } | exp exp '*' { $$ = $1 * $2; } | exp exp '/' { $$ = $1 / $2; } | // Exponentiation: exp exp '^' { $$ = pow($1, $2); } | // Unary minus: exp 'n' { $$ = -$1; } ;
The groupings of the rpn
`language' defined here are the expression
(given the name exp
), the line of input (line
), and the complete input
transcript (input
). Each of these nonterminal symbols has several
alternate rules, joined by the `|
' punctuator which is read as the logical
or. The following sections explain what these rules mean.
The semantics of the language is determined by the actions taken when a grouping is recognized. The actions are the C++ code that appears inside braces. See section 5.6.4.
You must specify these actions in C++, but bisonc++ provides the means for
passing semantic values between the rules. In each action, the pseudo-variable
$$
represents the semantic value for the grouping that the rule is going
to construct. Assigning a value to $$
is the main job of most actions. The
semantic values of the components of the rule are referred to as $1
(the
first component of a rule), $2
(the second component), and so on.
input
:
input: // empty | input line ;This definition reads as follows: A complete input is either an empty string, or a complete input followed by an input line. Notice that `complete input' is defined in terms of itself. This definition is said to be left recursive since input appears always as the leftmost symbol in the sequence. See section 5.4.
The first alternative is empty because there are no symbols between the colon
and the first `|
'; this means that input can match an empty string of
input (no tokens). We write the rules this way because it is legitimate to
type Ctrl-d
right after you start the calculator. It's conventional to put
an empty alternative first and write the comment `// empty
' in it.
The second alternate rule (input line
) handles all nontrivial input. It
means After reading any number of lines, read one more line if
possible. The left recursion makes this rule into a loop. Since the first
alternative matches empty input, the loop can be executed zero or more times.
The parser's parsing function (parse()
) continues to process input until a
grammatical error is seen or the lexical analyzer says there are no more input
tokens, which occurs at end of file.
line: '\n' | exp '\n' { cout << "\t" << $1 << endl; } ;The first alternative is a token which is a newline character; this means that
rpn
accepts a blank line (and ignores it, since there is no
action). The second alternative is an expression followed by a newline. This
is the alternative that makes rpn
useful. The semantic value of the
exp
grouping is the value of $1
because the exp
in question is the
first symbol in the rule's alternative. The action prints this value, which is
the result of the computation the user asked for.
This action is unusual because it does not assign a value to $$
. As a
consequence, the semantic value associated with the line is not initialized
(so its value will be unpredictable). This would be a bug if that value were
ever used, but we don't use it: once rpn
has printed the value of the
user's input line, that value is no longer needed.
exp
grouping has several rules, one for each kind of expression. The
first rule handles the simplest expressions: those that are just numbers. The
second handles an addition-expression, which looks like two expressions
followed by a plus-sign. The third handles subtraction, and so on.
exp: NUM | exp exp '+' { $$ = $1 + $2; } | exp exp '-' { $$ = $1 - $2; } ...It is customary to use `
|
' to join all the rules for exp, but the
rules could equally well have been written separately:
exp: NUM ; exp: exp exp '+' { $$ = $1 + $2; } ; exp: exp exp '-' { $$ = $1 - $2; } ; ...Most of the rules have actions that compute the value of the expression in terms of the values of its parts. For example, in the rule for addition,
$1
refers to the first component exp
and $2
refers to the second
one. The third component, '+
', has no meaningful associated semantic
value, but if it had one you could refer to it as $3
. When the parser's
parsing function parse()
recognizes a sum expression using this rule, the
sum of the two subexpressions' values is produced as the value of the entire
expression. See section 5.6.4.
You don't have to give an action for every rule. When a rule has no action,
Bison by default copies the value of $1
into $$
. This is what happens
in the first rule (the one that uses NUM
).
The formatting shown here is the recommended convention, but Bison does not require it. You can add or change whitespace as much as you wish.
lex()
, which is a predeclared member of the
parser class. See section 6.3.1.
Only a simple lexical analyzer is needed for rpn
. This lexical analyzer
skips blanks and tabs, then reads in numbers as double and returns them as
NUM
tokens. Any other character that isn't part of a number is a separate
token. Note that the token-code for such a single-character token is the
character itself.
The return value of the lexical analyzer function is a numeric code which
represents a token type. The same text used in bisonc++ rules to stand for this
token type is also a C++ expression for the numeric code for the type. This
works in two ways. If the token type is a character literal, then its numeric
code is the ASCII
code for that character; you can use the same character
literal in the lexical analyzer to express the number. If the token type is an
identifier, that identifier is defined by bisonc++ as a C++ enumeration
value. In this example, therefore, NUM
becomes an enumeration value
for lex()
to return.
The semantic value of the token (if it has one) is stored into the parser's
data member d_val
(comparable to the variable yylval
used by, e.g.,
Bison). This data member has int
as its default type, but by specifying
%stype
in the directive section this default type can be modified (to,
e.g., double
).
A token value of zero is returned once end-of-file is encountered. (Bisonc++ recognizes any nonpositive value as indicating the end of the input).
Here is the lexical scanner's implementation:
#include "Parser.ih" /* Lexical scanner returns a double floating point number on the stack and the token NUM, or the ASCII character read if not a number. Skips all blanks and tabs, returns 0 for EOF. */ int Parser::lex() { char c; // get the next non-ws character while (std::cin.get(c) && c == ' ' || c == '\t') ; if (!std::cin) // no characters were obtained return 0; // indicate End Of Input if (c == '.' || isdigit (c)) // if a digit char was found { std::cin.putback(c); // return the character std::cin >> d_val; // extract a number return NUM; // return the NUM token } return c; // otherwise return the extracted char. }
main()
is kept to the bare minimum. The only requirement is that it
constructs a parser object and then calls its parsing function parse()
to
start the process of parsing:
/* rpn.cc */ #include "rpn.h" int main() { Parser parser; parser.parse(); return 0; }
parse()
detects a syntax error, it calls the error reporting
member function error()
to print an error message (usually but not always
parse error). It is up to the programmer to supply an implementation, but
a very bland and simple in-line implementation is provided by bisonc++ in the class
header file (see chapter 6). This default implementation is
acceptable for rpn
.
Once error()
returns, the bisonc++ parser may recover from the error and
continue parsing if the grammar contains a suitable error rule (see chapter
8). Otherwise, the parsing function parse()
returns
nonzero. Not any error rules were included in this example, so any invalid
input causes the calculator program to exit. This is not clean behavior
for a real calculator, but it is adequate for this first example.
rpn
this means that a source file
rpn.cc
is constructed holding main()
, and a file parser/lex.cc
holding the lexical scanner's implementation. Note that I've put all the
parser's files in a separate directory as well (also see section 3.7).
In rpn's parser directory the file grammar
holds the grammar specification. Bisonc++ constructs a parser class and a
parsing member function from this file after issuing the command:
b() grammarFrom this, bisonc++ produced the following files:
Parser.h
, the parser class definition;
Parserbase.h
, the parser's base class definition, defining, among
other, the grammatical tokens to be used by externally defined lexical
scanners;
Parser.ih
, the internal header file, to be included by all
implementations of the parser class' members;
parse.cc
, the parsing member function.
Parserbase.h
and parse.cc
will be re-created each
time bisonc++ is re-run. Parser.h
and Parser.ih
may safely be modified
by the programmer, e.g., to add new members to to the parser class. These two
files will not be overwritten by bisonc++, unless explicitly instructed to do
so.
# List files (recursively) in the (current) examples/rpn directory. % ls -R .: build parser rpn.cc rpn.h ./parser: grammar lex.cc # Create `rpn' using the `build' script: % ./build rpn # List files again, ./rpn is the constructed program % ls -R .: build parser rpn rpn.cc rpn.h ./parser: Parser.h Parser.ih Parserbase.h grammar lex.cc parse.cc parse.outputHere is an example session using
rpn
:
% rpn 4 9 + 13 3 7 + 3 4 5 *+- -13 3 7 + 3 4 5 * + - n Note the unary minus, `n' 13 5 6 / 4 n + -3.16667 3 4 ^ Exponentiation 81 ^D End-of-file indicator %
rpn
to handle infix operators instead of postfix. Infix
notation involves the concept of operator precedence and the need for
parentheses nested to arbitrary depth. Here is the bisonc++ grammar
specification for calc
, an infix desk-top calculator:
%baseclass-preinclude cmath %stype double %token NUM %left '-' '+' %left '*' '/' %left NEG // negation--unary minus %right '^' // exponentiation %% input: // empty | input line ; line: '\n' | exp '\n' { std::cout << "\t" << $1 << '\n'; } ; exp: NUM | exp '+' exp { $$ = $1 + $3; } | exp '-' exp { $$ = $1 - $3; } | exp '*' exp { $$ = $1 * $3; } | exp '/' exp { $$ = $1 / $3; } | '-' exp %prec NEG { $$ = -$2; } | // Exponentiation: exp '^' exp { $$ = pow($1, $3); } | '(' exp ')' { $$ = $2; } ;The functions
lex()
, error()
and main()
can be the same as
used with rpn
.
There are two important new features shown in this code.
In the second section (Bisonc++ directives), %left
declares token types
and says they are left-associative operators. The directives %left
and
%right
(right associativity) take the place of %token
which is used to
declare a token type name without associativity. (These tokens are
single-character literals, which ordinarily don't need to be declared. We
declare them here to specify the associativity.)
Operator precedence is determined by the line ordering of the directives;
the higher the line number of the directive (lower on the page or screen),
the higher the precedence. Hence, exponentiation has the highest precedence,
unary minus (NEG
) is next, followed by `*
' and `/
', and so on. See
section 5.5.8.
The other important new feature is the %prec
in the grammar section for
the unary minus operator. The %prec
simply instructs bisonc++ that the
rule `| '-' exp
' has the same precedence as NEG
(in this case the
next-to-highest). See section 7.3.
Here is a sample run of calc
:
% calc 4 + 4.5 - (34/(8*3+-3)) 6.88095 -56 + 2 -54 3 ^ 2 9
error()
member
function with yyerror. Recall that by default parse()
returns after
calling error()
. This means that an erroneous input line causes the
calculator program to exit. Now we show how to rectify this deficiency.
The bisonc++ language itself includes the reserved word error
, which may
be included in the grammar rules. In the example below it has been added as
one more alternatives for line:
line: '\n' | exp '\n' { std::cout << "\t" << $1 << '\n'; } | error '\n' ;
This addition to the grammar allows for simple error recovery in the event
of a parse error. If an expression that cannot be evaluated is read, the error
is recognized by the third rule for line, and parsing continues (The
error()
member function is still called upon to print its message as
well). Different from the implementation of error
in Bison, bisonc++ proceeds
on the assumption that whenever error
is used in a rule it is the grammar
constructor's intention to have the parser continue parsing. Therefore, a
statement like `yyerrok;
' seen in Bison grammars is superfluous in bisonc++
grammars. The reserved keyword error
itself causes the parsing function to
skip all subsequent input until a possible token following error
is
seen. In the above implementation that token would be the newline character
`\n
' (see chapter 8).
This form of error recovery deals with syntax errors. There are other kinds of errors; for example, divisions by zero, which raises an exception signal that is normally fatal. A real calculator program must handle this signal and use whatever it takes to discard the rest of the current line of input and resume parsing thereafter. This extensive error handling is not discussed here, as it is not specific to bisonc++ programs.
+
', `-
', `*
', `/
' and `^
'. It would be nice to have a
calculator that provides other mathematical functions such as sin
,
cos
, etc..
It is easy to add new operators to the infix calculator as long as they are
only single-character literals. The parser's member lex()
passes back all
non-number characters as tokens, so new grammar rules suffice for adding a new
operator. But we want something more flexible: built-in functions whose
syntaxis is as follows:
function_name (argument)At the same time, we add memory to the calculator, thus allowing you to create named variables, store values in them, and use them later. Here is a sample session with the multi-function calculator:
pi = 3.141592653589 3.14159 sin(pi) 7.93266e-13 alpha = beta1 = 2.3 2.3 alpha 2.3 ln(alpha) 0.832909 exp(ln(beta1)) 2.3Note that multiple assignment and nested function calls are permitted.
mfcalc
calculator allows us to
introduce several new features. Here is the bisonc++ directive section for
the mfcalc
multi-function calculator (line numbers were added for
referential purposes, they are not part of the declaraction section as used in
the actual grammar file):
1 %union 2 { 3 double u_val; 4 double *u_symbol; 5 double (*u_fun)(double); 6 } 7 8 %token <u_val> NUM // Simple double precision number 9 %token <u_symbol> VAR // Variable 10 %token <u_fun> FNCT // Function 11 %type <u_val> exp 12 13 %right '=' 14 %left '-' '+' 15 %left '*' '/' 16 %left NEG // negation--unary minus 17 %right '^' // exponentiationThe above grammar introduces only two new features of the Bison language. These features allow semantic values to have various data types
The %union
directive given in lines 1 until 6 allow semantic values
to have various data types (see section 5.6.2).
The %union
directive is used instead of %stype
, and defines the
type Parser::STYPE
as the indicated union: all semantic values now
have this Parser::STYPE
type. As defined here the allowable types are now
double
(for exp
and NUM
);
double
, being a pointer to entries in
mfcalc
's symbol table, used with VAR
tokens (see section
5.5.26).
double
argument and
returning a double
value, used with FNCT
tokens.
NUM
, VAR
, FNCT
, and exp
. Their declarations are augmented with
information about their data type (placed between angle brackets).
The Bison construct %type
(line 12) is used for declaring nonterminal
symbols, just as %token
is used for declaring token types. We have not
used %type
before because nonterminal symbols are normally declared
implicitly by the rules that define them. But exp
must be declared
explicitly so we can specify its value type. See also section
5.5.25.
Finally note the right associative operator `=
', defined in line 13:
by making the assignment operator right-associative we can allow sequential
assignments of the form a = b = c = expression
.
calc
. Three rules, those which mention VAR
or
FNCT
, are new:
input: // empty | input line ; line: '\n' | exp '\n' { cout << "\t" << $1 << endl; } | error '\n' ; exp: NUM | VAR { $$ = *$1; } | VAR '=' exp { $$ = *$1 = $3; } | FNCT '(' exp ')' { $$ = (*$1)($3); } | exp '+' exp { $$ = $1 + $3; } | exp '-' exp { $$ = $1 - $3; } | exp '*' exp { $$ = $1 * $3; } | exp '/' exp { $$ = $1 / $3; } | '-' exp %prec NEG { $$ = -$2; } | // Exponentiation: exp '^' exp { $$ = pow($1, $3); } | '(' exp ')' { $$ = $2; } ;
The symbol table itself varies in size and contents once mfcalc
is
used, and if a program uses multiple parser objects (well...) each parser will
require its own symbol table. Therefore it is defined as a data member
d_symbols
in the Parser's header file. In contrast, the function table
has a fixed size and contents. Because of this, multiple parser objects
(if defined) could share the same function table, and so the function table is
defined as a static data member. Both tables profitably use the
std::map
container data type that is available in C++: their keys are
std::string
objects, their values, respecively, double
s and double
(*)(double)
s. Here is the declaration of d_symbols
and s_functions
as
used in mfcalc
's parser:
std::map<std::string, double> d_symbols; static std::map<std::string, double (*)(double)> s_functions;As
s_functions
is a static member, it can be initialized compile
time from an array of pairs. To ease the definition of such an array a
private typedef
typedef std::pair<char const *, double (*)(double)> FunctionPair;is added to the parser class, as well as a private array
static FunctionPair s_funTab[];These definitions allow us to initialize
s_functions
in a separate
source file (data.cc
):
#include "Parser.ih" Parser::FunctionPair Parser::s_funTab[] = { FunctionPair("sin", sin), FunctionPair("cos", cos), FunctionPair("atan", atan), FunctionPair("ln", log), FunctionPair("exp", exp), FunctionPair("sqrt", sqrt), }; map<string, double (*)(double)> Parser::s_functions ( Parser::s_funTab, Parser::s_funTab + sizeof(Parser::s_funTab) / sizeof(Parser::FunctionPair) );By simply editing the definition of
s_funTab
, additional
functions can be added to the calculator.
mfcalc
, the parser's member function lex()
must now recognize
variables, function names, numeric values, and the single-character arithmetic
operators. Strings of alphanumeric characters with a leading nondigit are
recognized as either variables or functions depending on the table in which
they are found. By arranging lex()
's logic such that the function table is
searched first it is simple to ensure that no variable can ever have the name
of a predefined function. The currently implemented approach, in which two
different tables are used for the arithmetic functions and the variable
symbols is appealing because it's simple to implement. However, it also has
the drawback of being difficult to scale to more generic calculators, using,
e.g., different data types and different types of functions. In such
situations a single symbol table is more preferable, where the keys are the
identifiers (variables, function names, predefined constants, etc.) while the
values are objects describing their characteristics. A re-implementation of
mfcalc
using an integrated symbol table is suggested in one of the
exercises of the upcoming section 4.5.
The parser's lex()
member uses the following approach:
mfcalc
's parser (d_val
) is itself also a union
,
the numerical value can be extracted into d_val.u_val
, and a
NUM
token can be returned.
s_functions
map. If found, d_val.u_fun
is given the function's
address, found as the value of the s_functions
map element
corresponding to the read identifier, and token FNCT
is returned.
If the symbol is not found in s_functions
the address of the
value ofn d_symbols
associated with the received identifier is
assigned to d_val.u_symbol
and token VAR
is returned. Note
that this automatically defines newly used variables, since
d_symbols[name]
automatically inserts a new element in a map if
d_symbol[name]
wasn't already there.
mfcalc
's parser's lex()
member function:
#include "Parser.ih" /* Lexical scanner returns a double floating point number on the stack and the token NUM, or the ASCII character read if not a number. Skips all blanks and tabs, returns 0 for EOF. */ int Parser::lex() { char c; // get the next non-ws character while (cin.get(c) && c == ' ' || c == '\t') ; if (!cin) // no characters were obtained return 0; // indicate End Of Input if (c == '.' || isdigit (c)) // if a digit char was found { cin.putback(c); // return the character cin >> d_val.u_val; // extract a number return NUM; // return the NUM token } if (!isalpha(c)) // c doesn't start an identifier: return c; // return a single character token. // in all other cases, an ident is entered. Recognize a var or function string word; // store the name in a string object while (true) // process all alphanumerics: { word += c; // add 'm to `word' if (!cin.get(c)) // no more chars? then done here break; if (!isalnum(c)) // not an alphanumeric: put it back and done. { cin.putback(c); break; } } // Now lookup the name as a function's name map<string, double (*)(double)>::iterator function = s_functions.find(word); // Got it, so return FPTR if (function != s_functions.end()) { d_val.u_fun = function->second; return FNCT; } // no function, so return a VAR. Set // u_symbol to the symbol's address in the // d_symbol map. The map will add the // symbol if not yet there. d_val.u_symbol = &d_symbols[word]; return VAR; }
mfcalc
, the following steps are suggested:
mfcalc.cc
. Actually, it is already available,
since all implementations of main()
used so far are identical to
each other.
parser
:
grammar
;
bisonc++ grammar
to produce the files Parser.h
,
Parserbase.h
, Parser.ih
and parse.cc
;
Parser.h
so as to include FunctionPair,
s_functions, s_funTab
and d_symbols
;
Parser.ih
so as to include cmath
and optionally
`using namespace std
', which is commented out by default;
data.cc
and lex.cc
to initialize the static
data and to contain the lexical scanner, respectively.
mfcalc
in mfcalc.cc
's directory using the
following command:
g++ -o mfcalc *.cc parser/*.cc
mfcalc
's
implementation and operating mode:
cmath
' to the
Parser::s_functions
;
Symbol
in which the symbol type, and an
appropriate value for the symbol is stored. Define only one map d_symbols
in the Parser, and provide the Symbol
class with means to obtain the
appropriate values for the various token types.
%union
directive, and change it into %stype
Symbol
. Hint: use the %preinclude-header
directive to make
Symbol
known to the parser's base class.
CONST
for numerical constants (like PI
, (E)),
and pre-define some numerical constants;
get()
and set()
member pair in Symbol
, and use the appropriate
member in the appropriate expr
rule; use ERROR()
to initiate error
recovery.