The most common formal system for presenting such rules for humans to read is Backus-Naur Form or `BNF', which was developed in order to specify the language Algol 60. Any grammar expressed in BNF is a context-free grammar. The input to bisonc++ is essentially machine-readable BNF.
Not all context-free languages can be handled by bisonc++, only those that are LALR(1). In brief, this means that it must be possible to tell how to parse any portion of an input string with just a single token of look-ahead. Strictly speaking, that is a description of an LR(1) grammar, and LALR(1) involves additional restrictions that are hard to explain simply; but it is rare in actual practice to find an LR(1) grammar that fails to be LALR(1). See section 7.5 for more information on this.
In the formal grammatical rules for a language, each kind of syntactic unit or grouping is named by a symbol. Those which are built by grouping smaller constructs according to grammatical rules are called nonterminal symbols; those which can't be subdivided are called terminal symbols or token types. We call a piece of input corresponding to a single terminal symbol a token, and a piece corresponding to a single nonterminal symbol a grouping.
We can use the C++ language as an example of what symbols, terminal and nonterminal, mean. The tokens of C++ are identifiers, constants (numeric and string), and the various keywords, arithmetic operators and punctuation marks. So the terminal symbols of a grammar for C++ include `identifier', `number', `string', plus one symbol for each keyword, operator or punctuation mark: `if', `return', `const', `static', `int', `char', `plus-sign', `open-brace', `close-brace', `comma' and many more. (These tokens can be subdivided into characters, but that is a matter of lexicography, not grammar.)
Here is a simple C++ function subdivided into tokens:
int square(int x) // keyword `int', identifier, open-paren, // keyword `int', identifier, close-paren { // open-brace return x * x; // keyword `return', identifier, // asterisk, identifier, semicolon } // close-brace
The syntactic groupings of C++ include the expression, the statement, the declaration, and the function definition. These are represented in the grammar of C++ by nonterminal symbols `expression', `statement', `declaration' and `function definition'. The full grammar uses dozens of additional language constructs, each with its own nonterminal symbol, in order to express the meanings of these four. The example above is a function definition; it contains one declaration, and one statement. In the statement, each `x' is an expression and so is `x * x'.
Each nonterminal symbol must have grammatical rules showing how it is made out of simpler constructs. For example, one kind of C++ statement is the return statement; this would be described with a grammar rule which reads informally as follows:
A `statement' can be made of a `return' keyword, an `expression' and a `semicolon'.
There would be many other rules for `statement', one for each kind of statement in C++.
One nonterminal symbol must be distinguished as the special one which defines a complete utterance in the language. It is called the start symbol. In a compiler, this means a complete input program. In the C++ language, the nonterminal symbol `sequence of definitions and declarations' plays this role.
For example, `1 + 2' is a valid C++ expression--a valid part of a C++ program--but it is not valid as an entire C++ program. In the context-free grammar of C++, this follows from the fact that `expression' is not the start symbol.
The bisonc++ parser reads a sequence of tokens as its input, and groups the tokens using the grammar rules. If the input is valid, the end result is that the entire token sequence reduces to a single grouping whose symbol is the grammar's start symbol. If we use a grammar for C++, the entire input must be a `sequence of definitions and declarations'. If not, the parser reports a syntax error.
A nonterminal symbol in the formal grammar is represented in bisonc++ input as
an identifier, like an identifier in C++. By convention, it should be in
lower case, such as expr
, stmt
or declaration
.
The bisonc++ representation for a terminal symbol is also called a token
type. Token types as well can be represented as C++-like identifiers. By
convention, these identifiers should be upper case to distinguish them from
nonterminals: for example, INTEGER
, IDENTIFIER
, IF
or
RETURN
. A terminal symbol that stands for a particular keyword in the
language should be named after that keyword converted to upper case. The
terminal symbol error
is reserved for error recovery. See section
5.2, which also describes the current restrictions on the names of
terminal symbols.
A terminal symbol can also be represented as a character literal, just like a C++ character constant. You should do this whenever a token is just a single character (parenthesis, plus-sign, etc.): use that same character in a literal as the terminal symbol for that token.
The grammar rules also have an expression in bisonc++ syntax. For example, here is the bisonc++ rule for a C++ return statement. The semicolon in quotes is a literal character token, representing part of the C++ syntax for the statement; the naked semicolon, and the colon, are bisonc++ punctuation used in every rule.
stmt: RETURN expr ';' ;See section 5.3.
x + 4
' is
grammatical then `x + 1
' or `x + 3989
' is equally grammatical.
But the precise value is very important for what the input means once it is parsed. A compiler is useless if it fails to distinguish between 4, 1 and 3989 as constants in the program! Therefore, each token in a bisonc++ grammar has both a token type and a semantic value. See section 5.6 for details.
The token type is a terminal symbol defined in the grammar, such as
INTEGER
, IDENTIFIER
or ',
'. It tells everything you need to know
to decide where the token may validly appear and how to group it with other
tokens. The grammar rules know nothing about tokens except their types.
The semantic value has all the rest of the information about the meaning of
the token, such as the value of an integer, or the name of an identifier. (A
token such as ',
' which is just punctuation doesn't need to have any
semantic value.)
For example, an input token might be classified as token type INTEGER
and
have the semantic value 4. Another input token might have the same token type
INTEGER
but value 3989. When a grammar rule says that INTEGER
is
allowed, either of these tokens is acceptable because each is an
INTEGER
. When the parser accepts the token, it keeps track of the token's
semantic value.
Each grouping can also have a semantic value as well as its nonterminal symbol. For example, in a calculator, an expression typically has a semantic value that is a number. In a compiler for a programming language, an expression typically has a semantic value that is a tree structure describing the meaning of the expression.
Most of the time, the purpose of an action is to compute the semantic value of the whole construct from the semantic values of its parts. For example, suppose we have a rule which says an expression can be the sum of two expressions. When the parser recognizes such a sum, each of the subexpressions has a semantic value which describes how it was built up. The action for this rule should create a similar sort of value for the newly recognized larger expression.
For example, here is a rule that says an expression can be the sum of two subexpressions:
expr: expr '+' expr { $$ = $1 + $3; } ;The action says how to produce the semantic value of the sum expression from the values of the two subexpressions.
parse()
)
that parses the language described by the grammar. The class and its
implementation is called a bisonc++ parser class. Keep in mind that the
Bisonc++ utility and the bisonc++ parser class are two distinct pieces of
software: the bisonc++ utility is a program whose output is the bisonc++
parser class that becomes part of your program.
More specifically, bisonc++ generates the following files from a bisonc++ grammar file:
The job of the bisonc++ parsing member is to group tokens into groupings according to the grammar rules--for example, to build identifiers and operators into expressions. As it does this, it runs the actions for the grammar rules it uses.
In C++ the tokens should be produced by an object called the lexical analyzer or lexical scanner that you must supply in some fashion (such as by writing it in C++). The bisonc++ parsing member requests the next token from the lexical analyzer each time it wants a new token. The parser itself doesn't know what is "inside" the tokens (though their semantic values may reflect this). Typically the lexical analyzer makes the tokens by parsing characters of text, but bisonc++ does not depend on this. See section 6.3.1.
The bisonc++ parsing function is C++ code defining a member function named
parse()
which implements that grammar. This parsing function nor the
parser object for which it is called does make a complete C++ program: you
must supply some additional details. One `detail' to be supplied is is the
lexical analyzer. The parser class itself declares several more members which
must be defined when used. One of these additional members is an
error-reporting function which the parser calls to report an error. Simple
default, yet sensible, implementations for these additional members may be
generated by bisonc++. Having constructed a parser class and a lexical scanner
class, objects of these classes must be defined in a complete C++
program. Usually such objects are defined in a function called main()
; you
have to provide this, and arrange for it to call the parser's parse()
function, or the parser will never run. See chapter 6.
Note that, different from conventions used by Bison and Bison++, there is no special name convention requirement anymore imposed by bisonc++. In particular, there is no need to begin all variable and function names used in the bisonc++ parser with `yy' or `YY' anymore. However, some name restrictions on symbolic tokens exist. See section 5.5.24.1 for details.
Currently, bisonc++ generates a parsing member which may or may not be reentrant, depending on whether or not the option --thread-safe is specified.
The source file generated by bisonc++ containing the parsing member function not
only contains this function, but also various tables (e.g., state transition
tables) defined in the anonymous name space. When the option --thread-safe
is provided, these tables are const
tables: their elements are not changed
by the parsing function and so the parsing function, as it only manipulates
its own local data, becomes reentrant.
parse()
) and its
support functions must be implemented by the programmer. Of course, additional
member functions should also be declared in the parser class' header. At the
very least the member int lex()
calling the lexecal scanner to produce the
next available token must be implemented (although a standardized
implementation can also be generated by bisonc++). The member lex()
is
called by parse()
(support functions) to obtain the next available
token. The member function void error(char const *msg)
may also be
re-implemented by the programmer, but a basic in-line implementation is
provided by default. The member function error()
is called when
parse()
detects (syntactic) errors.
int main() { Parser parser; return parser.parse(); }
Bisonc++ directives %% Grammar rulesReaders familiar with Bison may note that there is no C declaration section and no section to define Additional C code. With bisonc++ these sections are superfluous since, due to the fact that a bisonc++ parser is a class, all additional code required for the parser's implementation can be incorporated into the parser class itself. Also, C++ classes normally only require declarations that can be defined in the classes' header files, so also the `additional C code' section could be omitted from the bisonc++ grammar file.
The `%%' is a punctuation that appears in every bisonc++ grammar file to separate the two sections.
The bisonc++ directives section is used to declare the names of the terminal and nonterminal symbols, and may also describe operator precedence and the data types of semantic values of various symbols. Furthermore, this section is also used to specify bisonc++ directives. These bisonc++ directives are used to define, e.g., the name of the generated parser class and a namespace in which the parser class will be defined.
The grammar rules define how to construct each nonterminal symbol from its parts.
One special directive is availble that may be used in the directives
section and in the grammar rules section. This directive is %include
,
allowing you to split long grammar specification files into smaller, more
comprehensible and accessible chunks. The %include
directive is discussed
in more detail in section 5.5.7.