summaryrefslogtreecommitdiff
path: root/doc/ply.md
diff options
context:
space:
mode:
Diffstat (limited to 'doc/ply.md')
-rw-r--r--doc/ply.md2587
1 files changed, 2587 insertions, 0 deletions
diff --git a/doc/ply.md b/doc/ply.md
new file mode 100644
index 0000000..29a16de
--- /dev/null
+++ b/doc/ply.md
@@ -0,0 +1,2587 @@
+# PLY (Python Lex-Yacc)
+
+This document provides an overview of lexing and parsing with PLY. Given
+the intrinsic complexity of parsing, I strongly advise that you read (or
+at least skim) this entire document before jumping into a big
+development project with PLY.
+
+The current version requires Python 3.6 or newer. If you\'re using an
+older version of Python, use one of the historical releases.
+
+## Introduction
+
+PLY is a pure-Python implementation of the compiler construction tools
+lex and yacc. The main goal of PLY is to stay fairly faithful to the way
+in which traditional lex/yacc tools work. This includes supporting
+LALR(1) parsing as well as providing extensive input validation, error
+reporting, and diagnostics. Thus, if you\'ve used yacc in another
+programming language, it should be relatively straightforward to use
+PLY.
+
+Early versions of PLY were developed to support an Introduction to
+Compilers Course I taught in 2001 at the University of Chicago. Since
+PLY was primarily developed as an instructional tool, you will find it
+to be fairly picky about token and grammar rule specification. In part,
+this added formality is meant to catch common programming mistakes made
+by novice users. However, advanced users will also find such features to
+be useful when building complicated grammars for real programming
+languages. It should also be noted that PLY does not provide much in the
+way of bells and whistles (e.g., automatic construction of abstract
+syntax trees, tree traversal, etc.). Nor would I consider it to be a
+parsing framework. Instead, you will find a bare-bones, yet fully
+capable lex/yacc implementation written entirely in Python.
+
+The rest of this document assumes that you are somewhat familiar with
+parsing theory, syntax directed translation, and the use of compiler
+construction tools such as lex and yacc in other programming languages.
+If you are unfamiliar with these topics, you will probably want to
+consult an introductory text such as \"Compilers: Principles,
+Techniques, and Tools\", by Aho, Sethi, and Ullman. O\'Reilly\'s \"Lex
+and Yacc\" by John Levine may also be handy. In fact, the O\'Reilly book
+can be used as a reference for PLY as the concepts are virtually
+identical.
+
+## PLY Overview
+
+PLY consists of two separate modules; `lex.py` and `yacc.py`, both of
+which are found in a Python package called `ply`. The `lex.py` module is
+used to break input text into a collection of tokens specified by a
+collection of regular expression rules. `yacc.py` is used to recognize
+language syntax that has been specified in the form of a context free
+grammar.
+
+The two tools are meant to work together. Specifically, `lex.py`
+provides an interface to produce tokens. `yacc.py` uses this retrieve
+tokens and invoke grammar rules. The output of `yacc.py` is often an
+Abstract Syntax Tree (AST). However, this is entirely up to the user. If
+desired, `yacc.py` can also be used to implement simple one-pass
+compilers.
+
+Like its Unix counterpart, `yacc.py` provides most of the features you
+expect including extensive error checking, grammar validation, support
+for empty productions, error tokens, and ambiguity resolution via
+precedence rules. In fact, almost everything that is possible in
+traditional yacc should be supported in PLY.
+
+The primary difference between `yacc.py` and Unix `yacc` is that
+`yacc.py` doesn\'t involve a separate code-generation process. Instead,
+PLY relies on reflection (introspection) to build its lexers and
+parsers. Unlike traditional lex/yacc which require a special input file
+that is converted into a separate source file, the specifications given
+to PLY *are* valid Python programs. This means that there are no extra
+source files nor is there a special compiler construction step (e.g.,
+running yacc to generate Python code for the compiler).
+
+## Lex
+
+`lex.py` is used to tokenize an input string. For example, suppose
+you\'re writing a programming language and a user supplied the following
+input string:
+
+ x = 3 + 42 * (s - t)
+
+A tokenizer splits the string into individual tokens:
+
+ 'x','=', '3', '+', '42', '*', '(', 's', '-', 't', ')'
+
+Tokens are usually given names to indicate what they are. For example:
+
+ 'ID','EQUALS','NUMBER','PLUS','NUMBER','TIMES',
+ 'LPAREN','ID','MINUS','ID','RPAREN'
+
+More specifically, the input is broken into pairs of token types and
+values. For example:
+
+ ('ID','x'), ('EQUALS','='), ('NUMBER','3'),
+ ('PLUS','+'), ('NUMBER','42), ('TIMES','*'),
+ ('LPAREN','('), ('ID','s'), ('MINUS','-'),
+ ('ID','t'), ('RPAREN',')'
+
+The specification of tokens is done by writing a series of regular
+expression rules. The next section shows how this is done using
+`lex.py`.
+
+### Lex Example
+
+The following example shows how `lex.py` is used to write a simple
+tokenizer:
+
+ # ------------------------------------------------------------
+ # calclex.py
+ #
+ # tokenizer for a simple expression evaluator for
+ # numbers and +,-,*,/
+ # ------------------------------------------------------------
+ import ply.lex as lex
+
+ # List of token names. This is always required
+ tokens = (
+ 'NUMBER',
+ 'PLUS',
+ 'MINUS',
+ 'TIMES',
+ 'DIVIDE',
+ 'LPAREN',
+ 'RPAREN',
+ )
+
+ # Regular expression rules for simple tokens
+ t_PLUS = r'\+'
+ t_MINUS = r'-'
+ t_TIMES = r'\*'
+ t_DIVIDE = r'/'
+ t_LPAREN = r'\('
+ t_RPAREN = r'\)'
+
+ # A regular expression rule with some action code
+ def t_NUMBER(t):
+ r'\d+'
+ t.value = int(t.value)
+ return t
+
+ # Define a rule so we can track line numbers
+ def t_newline(t):
+ r'\n+'
+ t.lexer.lineno += len(t.value)
+
+ # A string containing ignored characters (spaces and tabs)
+ t_ignore = ' \t'
+
+ # Error handling rule
+ def t_error(t):
+ print("Illegal character '%s'" % t.value[0])
+ t.lexer.skip(1)
+
+ # Build the lexer
+ lexer = lex.lex()
+
+To use the lexer, you first need to feed it some input text using its
+`input()` method. After that, repeated calls to `token()` produce
+tokens. The following code shows how this works:
+
+ # Test it out
+ data = '''
+ 3 + 4 * 10
+ + -20 *2
+ '''
+
+ # Give the lexer some input
+ lexer.input(data)
+
+ # Tokenize
+ while True:
+ tok = lexer.token()
+ if not tok:
+ break # No more input
+ print(tok)
+
+When executed, the example will produce the following output:
+
+ $ python example.py
+ LexToken(NUMBER,3,2,1)
+ LexToken(PLUS,'+',2,3)
+ LexToken(NUMBER,4,2,5)
+ LexToken(TIMES,'*',2,7)
+ LexToken(NUMBER,10,2,10)
+ LexToken(PLUS,'+',3,14)
+ LexToken(MINUS,'-',3,16)
+ LexToken(NUMBER,20,3,18)
+ LexToken(TIMES,'*',3,20)
+ LexToken(NUMBER,2,3,21)
+
+Lexers also support the iteration protocol. So, you can write the above
+loop as follows:
+
+ for tok in lexer:
+ print(tok)
+
+The tokens returned by `lexer.token()` are instances of `LexToken`. This
+object has attributes `type`, `value`, `lineno`, and `lexpos`. The
+following code shows an example of accessing these attributes:
+
+ # Tokenize
+ while True:
+ tok = lexer.token()
+ if not tok:
+ break # No more input
+ print(tok.type, tok.value, tok.lineno, tok.lexpos)
+
+The `type` and `value` attributes contain the type and value of the
+token itself. `lineno` and `lexpos` contain information about the
+location of the token. `lexpos` is the index of the token relative to
+the start of the input text.
+
+### The tokens list
+
+All lexers must provide a list `tokens` that defines all of the possible
+token names that can be produced by the lexer. This list is always
+required and is used to perform a variety of validation checks. The
+tokens list is also used by the `yacc.py` module to identify terminals.
+
+In the example, the following code specified the token names:
+
+ tokens = (
+ 'NUMBER',
+ 'PLUS',
+ 'MINUS',
+ 'TIMES',
+ 'DIVIDE',
+ 'LPAREN',
+ 'RPAREN',
+ )
+
+### Specification of tokens
+
+Each token is specified by writing a regular expression rule compatible
+with Python\'s `re` module. Each of these rules are defined by making
+declarations with a special prefix `t_` to indicate that it defines a
+token. For simple tokens, the regular expression can be specified as
+strings such as this (note: Python raw strings are used since they are
+the most convenient way to write regular expression strings):
+
+ t_PLUS = r'\+'
+
+In this case, the name following the `t_` must exactly match one of the
+names supplied in `tokens`. If some kind of action needs to be
+performed, a token rule can be specified as a function. For example,
+this rule matches numbers and converts the string into a Python integer:
+
+ def t_NUMBER(t):
+ r'\d+'
+ t.value = int(t.value)
+ return t
+
+When a function is used, the regular expression rule is specified in the
+function documentation string. The function always takes a single
+argument which is an instance of `LexToken`. This object has attributes
+of `type` which is the token type (as a string), `value` which is the
+lexeme (the actual text matched), `lineno` which is the current line
+number, and `lexpos` which is the position of the token relative to the
+beginning of the input text. By default, `type` is set to the name
+following the `t_` prefix. The action function can modify the contents
+of the `LexToken` object as appropriate. However, when it is done, the
+resulting token should be returned. If no value is returned by the
+action function, the token is discarded and the next token read.
+
+Internally, `lex.py` uses the `re` module to do its pattern matching.
+Patterns are compiled using the `re.VERBOSE` flag which can be used to
+help readability. However, be aware that unescaped whitespace is ignored
+and comments are allowed in this mode. If your pattern involves
+whitespace, make sure you use `\s`. If you need to match the `#`
+character, use `[#]`.
+
+When building the master regular expression, rules are added in the
+following order:
+
+1. All tokens defined by functions are added in the same order as they
+ appear in the lexer file.
+2. Tokens defined by strings are added next by sorting them in order of
+ decreasing regular expression length (longer expressions are added
+ first).
+
+Without this ordering, it can be difficult to correctly match certain
+types of tokens. For example, if you wanted to have separate tokens for
+\"=\" and \"==\", you need to make sure that \"==\" is checked first. By
+sorting regular expressions in order of decreasing length, this problem
+is solved for rules defined as strings. For functions, the order can be
+explicitly controlled since rules appearing first are checked first.
+
+To handle reserved words, you should write a single rule to match an
+identifier and do a special name lookup in a function like this:
+
+ reserved = {
+ 'if' : 'IF',
+ 'then' : 'THEN',
+ 'else' : 'ELSE',
+ 'while' : 'WHILE',
+ ...
+ }
+
+ tokens = ['LPAREN','RPAREN',...,'ID'] + list(reserved.values())
+
+ def t_ID(t):
+ r'[a-zA-Z_][a-zA-Z_0-9]*'
+ t.type = reserved.get(t.value,'ID') # Check for reserved words
+ return t
+
+This approach greatly reduces the number of regular expression rules and
+is likely to make things a little faster.
+
+Note: You should avoid writing individual rules for reserved words. For
+example, if you write rules like this:
+
+ t_FOR = r'for'
+ t_PRINT = r'print'
+
+those rules will be triggered for identifiers that include those words
+as a prefix such as \"forget\" or \"printed\". This is probably not what
+you want.
+
+### Token values
+
+When tokens are returned by lex, they have a value that is stored in the
+`value` attribute. Normally, the value is the text that was matched.
+However, the value can be assigned to any Python object. For instance,
+when lexing identifiers, you may want to return both the identifier name
+and information from some sort of symbol table. To do this, you might
+write a rule like this:
+
+ def t_ID(t):
+ ...
+ # Look up symbol table information and return a tuple
+ t.value = (t.value, symbol_lookup(t.value))
+ ...
+ return t
+
+It is important to note that storing data in other attribute names is
+*not* recommended. The `yacc.py` module only exposes the contents of the
+`value` attribute. Thus, accessing other attributes may be unnecessarily
+awkward. If you need to store multiple values on a token, assign a
+tuple, dictionary, or instance to `value`.
+
+### Discarded tokens
+
+To discard a token, such as a comment, define a token rule that returns
+no value. For example:
+
+ def t_COMMENT(t):
+ r'\#.*'
+ pass
+ # No return value. Token discarded
+
+Alternatively, you can include the prefix `ignore_` in the token
+declaration to force a token to be ignored. For example:
+
+ t_ignore_COMMENT = r'\#.*'
+
+Be advised that if you are ignoring many different kinds of text, you
+may still want to use functions since these provide more precise control
+over the order in which regular expressions are matched (i.e., functions
+are matched in order of specification whereas strings are sorted by
+regular expression length).
+
+### Line numbers and positional information
+
+By default, `lex.py` knows nothing about line numbers. This is because
+`lex.py` doesn\'t know anything about what constitutes a \"line\" of
+input (e.g., the newline character or even if the input is textual
+data). To update this information, you need to write a special rule. In
+the example, the `t_newline()` rule shows how to do this:
+
+ # Define a rule so we can track line numbers
+ def t_newline(t):
+ r'\n+'
+ t.lexer.lineno += len(t.value)
+
+Within the rule, the `lineno` attribute of the underlying lexer
+`t.lexer` is updated. After the line number is updated, the token is
+discarded since nothing is returned.
+
+`lex.py` does not perform any kind of automatic column tracking.
+However, it does record positional information related to each token in
+the `lexpos` attribute. Using this, it is usually possible to compute
+column information as a separate step. For instance, just count
+backwards until you reach a newline:
+
+ # Compute column.
+ # input is the input text string
+ # token is a token instance
+ def find_column(input, token):
+ line_start = input.rfind('\n', 0, token.lexpos) + 1
+ return (token.lexpos - line_start) + 1
+
+Since column information is often only useful in the context of error
+handling, calculating the column position can be performed when needed
+as opposed to doing it for each token. Note: If you\'re parsing a
+language where whitespace matters (i.e., Python), it\'s probably better
+match whitespace as a token instead of ignoring it.
+
+### Ignored characters
+
+The special `t_ignore` rule is reserved by `lex.py` for characters that
+should be completely ignored in the input stream. Usually this is used
+to skip over whitespace and other non-essential characters. Although it
+is possible to define a regular expression rule for whitespace in a
+manner similar to `t_newline()`, the use of `t_ignore` provides
+substantially better lexing performance because it is handled as a
+special case and is checked in a much more efficient manner than the
+normal regular expression rules.
+
+The characters given in `t_ignore` are not ignored when such characters
+are part of other regular expression patterns. For example, if you had a
+rule to capture quoted text, that pattern can include the ignored
+characters (which will be captured in the normal way). The main purpose
+of `t_ignore` is to ignore whitespace and other padding between the
+tokens that you actually want to parse.
+
+### Literal characters
+
+Literal characters can be specified by defining a variable `literals` in
+your lexing module. For example:
+
+ literals = [ '+','-','*','/' ]
+
+or alternatively:
+
+ literals = "+-*/"
+
+A literal character is a single character that is returned \"as is\"
+when encountered by the lexer. Literals are checked after all of the
+defined regular expression rules. Thus, if a rule starts with one of the
+literal characters, it will always take precedence.
+
+When a literal token is returned, both its `type` and `value` attributes
+are set to the character itself. For example, `'+'`.
+
+It\'s possible to write token functions that perform additional actions
+when literals are matched. However, you\'ll need to set the token type
+appropriately. For example:
+
+ literals = [ '{', '}' ]
+
+ def t_lbrace(t):
+ r'\{'
+ t.type = '{' # Set token type to the expected literal
+ return t
+
+ def t_rbrace(t):
+ r'\}'
+ t.type = '}' # Set token type to the expected literal
+ return t
+
+### Error handling
+
+The `t_error()` function is used to handle lexing errors that occur when
+illegal characters are detected. In this case, the `t.value` attribute
+contains the rest of the input string that has not been tokenized. In
+the example, the error function was defined as follows:
+
+ # Error handling rule
+ def t_error(t):
+ print("Illegal character '%s'" % t.value[0])
+ t.lexer.skip(1)
+
+In this case, we print the offending character and skip ahead one
+character by calling `t.lexer.skip(1)`.
+
+### EOF Handling
+
+The `t_eof()` function is used to handle an end-of-file (EOF) condition
+in the input. As input, it receives a token type `'eof'` with the
+`lineno` and `lexpos` attributes set appropriately. The main use of this
+function is provide more input to the lexer so that it can continue to
+parse. Here is an example of how this works:
+
+ # EOF handling rule
+ def t_eof(t):
+ # Get more input (Example)
+ more = input('... ')
+ if more:
+ t.lexer.input(more)
+ return t.lexer.token()
+ return None
+
+The EOF function should return the next available token (by calling
+`t.lexer.token())` or `None` to indicate no more data. Be aware that
+setting more input with the `t.lexer.input()` method does NOT reset
+the lexer state or the `lineno` attribute used for position tracking.
+The `lexpos` attribute is reset so be aware of that if you\'re using it
+in error reporting.
+
+### Building and using the lexer
+
+To build the lexer, the function `lex.lex()` is used. For example:
+
+ lexer = lex.lex()
+
+This function uses Python reflection (or introspection) to read the
+regular expression rules out of the calling context and build the lexer.
+Once the lexer has been built, two methods can be used to control the
+lexer.
+
+`lexer.input(data)`. Reset the lexer and store a new input string.
+
+`lexer.token()`. Return the next token. Returns a special `LexToken`
+instance on success or None if the end of the input text has been
+reached.
+
+### The \@TOKEN decorator
+
+In some applications, you may want to define tokens as a series of more
+complex regular expression rules. For example:
+
+ digit = r'([0-9])'
+ nondigit = r'([_A-Za-z])'
+ identifier = r'(' + nondigit + r'(' + digit + r'|' + nondigit + r')*)'
+
+ def t_ID(t):
+ # want docstring to be identifier above. ?????
+ ...
+
+In this case, we want the regular expression rule for `ID` to be one of
+the variables above. However, there is no way to directly specify this
+using a normal documentation string. To solve this problem, you can use
+the `@TOKEN` decorator. For example:
+
+ from ply.lex import TOKEN
+
+ @TOKEN(identifier)
+ def t_ID(t):
+ ...
+
+This will attach `identifier` to the docstring for `t_ID()` allowing
+`lex.py` to work normally. Naturally, you could use `@TOKEN` on all
+functions as an alternative to using docstrings.
+
+### Debugging
+
+For the purpose of debugging, you can run `lex()` in a debugging mode as
+follows:
+
+ lexer = lex.lex(debug=True)
+
+This will produce various sorts of debugging information including all
+of the added rules, the master regular expressions used by the lexer,
+and tokens generating during lexing.
+
+In addition, `lex.py` comes with a simple main function which will
+either tokenize input read from standard input or from a file specified
+on the command line. To use it, put this in your lexer:
+
+ if __name__ == '__main__':
+ lex.runmain()
+
+Please refer to the \"Debugging\" section near the end for some more
+advanced details of debugging.
+
+### Alternative specification of lexers
+
+As shown in the example, lexers are specified all within one Python
+module. If you want to put token rules in a different module from the
+one in which you invoke `lex()`, use the `module` keyword argument.
+
+For example, you might have a dedicated module that just contains the
+token rules:
+
+ # module: tokrules.py
+ # This module just contains the lexing rules
+
+ # List of token names. This is always required
+ tokens = (
+ 'NUMBER',
+ 'PLUS',
+ 'MINUS',
+ 'TIMES',
+ 'DIVIDE',
+ 'LPAREN',
+ 'RPAREN',
+ )
+
+ # Regular expression rules for simple tokens
+ t_PLUS = r'\+'
+ t_MINUS = r'-'
+ t_TIMES = r'\*'
+ t_DIVIDE = r'/'
+ t_LPAREN = r'\('
+ t_RPAREN = r'\)'
+
+ # A regular expression rule with some action code
+ def t_NUMBER(t):
+ r'\d+'
+ t.value = int(t.value)
+ return t
+
+ # Define a rule so we can track line numbers
+ def t_newline(t):
+ r'\n+'
+ t.lexer.lineno += len(t.value)
+
+ # A string containing ignored characters (spaces and tabs)
+ t_ignore = ' \t'
+
+ # Error handling rule
+ def t_error(t):
+ print("Illegal character '%s'" % t.value[0])
+ t.lexer.skip(1)
+
+Now, if you wanted to build a tokenizer from these rules from within a
+different module, you would do the following (shown for Python
+interactive mode):
+
+ >>> import tokrules
+ >>> lexer = lex.lex(module=tokrules)
+ >>> lexer.input("3 + 4")
+ >>> lexer.token()
+ LexToken(NUMBER,3,1,1,0)
+ >>> lexer.token()
+ LexToken(PLUS,'+',1,2)
+ >>> lexer.token()
+ LexToken(NUMBER,4,1,4)
+ >>> lexer.token()
+ None
+ >>>
+
+The `module` option can also be used to define lexers from instances of
+a class. For example:
+
+ import ply.lex as lex
+
+ class MyLexer(object):
+ # List of token names. This is always required
+ tokens = (
+ 'NUMBER',
+ 'PLUS',
+ 'MINUS',
+ 'TIMES',
+ 'DIVIDE',
+ 'LPAREN',
+ 'RPAREN',
+ )
+
+ # Regular expression rules for simple tokens
+ t_PLUS = r'\+'
+ t_MINUS = r'-'
+ t_TIMES = r'\*'
+ t_DIVIDE = r'/'
+ t_LPAREN = r'\('
+ t_RPAREN = r'\)'
+
+ # A regular expression rule with some action code
+ # Note addition of self parameter since we're in a class
+ def t_NUMBER(self,t):
+ r'\d+'
+ t.value = int(t.value)
+ return t
+
+ # Define a rule so we can track line numbers
+ def t_newline(self,t):
+ r'\n+'
+ t.lexer.lineno += len(t.value)
+
+ # A string containing ignored characters (spaces and tabs)
+ t_ignore = ' \t'
+
+ # Error handling rule
+ def t_error(self,t):
+ print("Illegal character '%s'" % t.value[0])
+ t.lexer.skip(1)
+
+ # Build the lexer
+ def build(self,**kwargs):
+ self.lexer = lex.lex(module=self, **kwargs)
+
+ # Test it output
+ def test(self,data):
+ self.lexer.input(data)
+ while True:
+ tok = self.lexer.token()
+ if not tok:
+ break
+ print(tok)
+
+ # Build the lexer and try it out
+ m = MyLexer()
+ m.build() # Build the lexer
+ m.test("3 + 4") # Test it
+
+When building a lexer from class, *you should construct the lexer from
+an instance of the class*, not the class object itself. This is because
+PLY only works properly if the lexer actions are defined by
+bound-methods.
+
+When using the `module` option to `lex()`, PLY collects symbols from the
+underlying object using the `dir()` function. There is no direct access
+to the `__dict__` attribute of the object supplied as a module value.
+
+Finally, if you want to keep things nicely encapsulated, but don\'t want
+to use a full-fledged class definition, lexers can be defined using
+closures. For example:
+
+ import ply.lex as lex
+
+ # List of token names. This is always required
+ tokens = (
+ 'NUMBER',
+ 'PLUS',
+ 'MINUS',
+ 'TIMES',
+ 'DIVIDE',
+ 'LPAREN',
+ 'RPAREN',
+ )
+
+ def MyLexer():
+ # Regular expression rules for simple tokens
+ t_PLUS = r'\+'
+ t_MINUS = r'-'
+ t_TIMES = r'\*'
+ t_DIVIDE = r'/'
+ t_LPAREN = r'\('
+ t_RPAREN = r'\)'
+
+ # A regular expression rule with some action code
+ def t_NUMBER(t):
+ r'\d+'
+ t.value = int(t.value)
+ return t
+
+ # Define a rule so we can track line numbers
+ def t_newline(t):
+ r'\n+'
+ t.lexer.lineno += len(t.value)
+
+ # A string containing ignored characters (spaces and tabs)
+ t_ignore = ' \t'
+
+ # Error handling rule
+ def t_error(t):
+ print("Illegal character '%s'" % t.value[0])
+ t.lexer.skip(1)
+
+ # Build the lexer from my environment and return it
+ return lex.lex()
+
+Important note: If you are defining a lexer using a class or closure, be
+aware that PLY still requires you to only define a single lexer per
+module (source file). There are extensive validation/error checking
+parts of the PLY that may falsely report error messages if you don\'t
+follow this rule.
+
+### Maintaining state
+
+In your lexer, you may want to maintain a variety of state information.
+This might include mode settings, symbol tables, and other details. As
+an example, suppose that you wanted to keep track of how many NUMBER
+tokens had been encountered.
+
+One way to do this is to keep a set of global variables in the module
+where you created the lexer. For example:
+
+ num_count = 0
+ def t_NUMBER(t):
+ r'\d+'
+ global num_count
+ num_count += 1
+ t.value = int(t.value)
+ return t
+
+If you don\'t like the use of a global variable, another place to store
+information is inside the Lexer object created by `lex()`. To this, you
+can use the `lexer` attribute of tokens passed to the various rules. For
+example:
+
+ def t_NUMBER(t):
+ r'\d+'
+ t.lexer.num_count += 1 # Note use of lexer attribute
+ t.value = int(t.value)
+ return t
+
+ lexer = lex.lex()
+ lexer.num_count = 0 # Set the initial count
+
+This latter approach has the advantage of being simple and working
+correctly in applications where multiple instantiations of a given lexer
+exist in the same application. However, this might also feel like a
+gross violation of encapsulation to OO purists. Just to put your mind at
+some ease, all internal attributes of the lexer (with the exception of
+`lineno`) have names that are prefixed by `lex` (e.g.,
+`lexdata`,`lexpos`, etc.). Thus, it is perfectly safe to store
+attributes in the lexer that don\'t have names starting with that prefix
+or a name that conflicts with one of the predefined methods (e.g.,
+`input()`, `token()`, etc.).
+
+If you don\'t like assigning values on the lexer object, you can define
+your lexer as a class as shown in the previous section:
+
+ class MyLexer:
+ ...
+ def t_NUMBER(self,t):
+ r'\d+'
+ self.num_count += 1
+ t.value = int(t.value)
+ return t
+
+ def build(self, **kwargs):
+ self.lexer = lex.lex(object=self,**kwargs)
+
+ def __init__(self):
+ self.num_count = 0
+
+The class approach may be the easiest to manage if your application is
+going to be creating multiple instances of the same lexer and you need
+to manage a lot of state.
+
+State can also be managed through closures. For example:
+
+ def MyLexer():
+ num_count = 0
+ ...
+ def t_NUMBER(t):
+ r'\d+'
+ nonlocal num_count
+ num_count += 1
+ t.value = int(t.value)
+ return t
+ ...
+
+### Lexer cloning
+
+If necessary, a lexer object can be duplicated by invoking its `clone()`
+method. For example:
+
+ lexer = lex.lex()
+ ...
+ newlexer = lexer.clone()
+
+When a lexer is cloned, the copy is exactly identical to the original
+lexer including any input text and internal state. However, the clone
+allows a different set of input text to be supplied which may be
+processed separately. This may be useful in situations when you are
+writing a parser/compiler that involves recursive or reentrant
+processing. For instance, if you needed to scan ahead in the input for
+some reason, you could create a clone and use it to look ahead. Or, if
+you were implementing some kind of preprocessor, cloned lexers could be
+used to handle different input files.
+
+Creating a clone is different than calling `lex.lex()` in that PLY
+doesn\'t regenerate any of the internal tables or regular expressions.
+
+Special considerations need to be made when cloning lexers that also
+maintain their own internal state using classes or closures. Namely, you
+need to be aware that the newly created lexers will share all of this
+state with the original lexer. For example, if you defined a lexer as a
+class and did this:
+
+ m = MyLexer()
+ a = lex.lex(object=m) # Create a lexer
+
+ b = a.clone() # Clone the lexer
+
+Then both `a` and `b` are going to be bound to the same object `m` and
+any changes to `m` will be reflected in both lexers. It\'s important to
+emphasize that `clone()` is only meant to create a new lexer that reuses
+the regular expressions and environment of another lexer. If you need to
+make a totally new copy of a lexer, then call `lex()` again.
+
+### Internal lexer state
+
+A Lexer object `lexer` has a number of internal attributes that may be
+useful in certain situations.
+
+`lexer.lexpos`
+
+: This attribute is an integer that contains the current position
+ within the input text. If you modify the value, it will change the
+ result of the next call to `token()`. Within token rule functions,
+ this points to the first character *after* the matched text. If the
+ value is modified within a rule, the next returned token will be
+ matched at the new position.
+
+`lexer.lineno`
+
+: The current value of the line number attribute stored in the lexer.
+ PLY only specifies that the attribute exists\-\--it never sets,
+ updates, or performs any processing with it. If you want to track
+ line numbers, you will need to add code yourself (see the section on
+ line numbers and positional information).
+
+`lexer.lexdata`
+
+: The current input text stored in the lexer. This is the string
+ passed with the `input()` method. It would probably be a bad idea to
+ modify this unless you really know what you\'re doing.
+
+`lexer.lexmatch`
+
+: This is the raw `Match` object returned by the Python `re.match()`
+ function (used internally by PLY) for the current token. If you have
+ written a regular expression that contains named groups, you can use
+ this to retrieve those values. Note: This attribute is only updated
+ when tokens are defined and processed by functions.
+
+### Conditional lexing and start conditions
+
+In advanced parsing applications, it may be useful to have different
+lexing states. For instance, you may want the occurrence of a certain
+token or syntactic construct to trigger a different kind of lexing. PLY
+supports a feature that allows the underlying lexer to be put into a
+series of different states. Each state can have its own tokens, lexing
+rules, and so forth. The implementation is based largely on the \"start
+condition\" feature of GNU flex. Details of this can be found at
+<http://flex.sourceforge.net/manual/Start-Conditions.html>
+
+To define a new lexing state, it must first be declared. This is done by
+including a \"states\" declaration in your lex file. For example:
+
+ states = (
+ ('foo','exclusive'),
+ ('bar','inclusive'),
+ )
+
+This declaration declares two states, `'foo'` and `'bar'`. States may be
+of two types; `'exclusive'` and `'inclusive'`. An exclusive state
+completely overrides the default behavior of the lexer. That is, lex
+will only return tokens and apply rules defined specifically for that
+state. An inclusive state adds additional tokens and rules to the
+default set of rules. Thus, lex will return both the tokens defined by
+default in addition to those defined for the inclusive state.
+
+Once a state has been declared, tokens and rules are declared by
+including the state name in token/rule declaration. For example:
+
+ t_foo_NUMBER = r'\d+' # Token 'NUMBER' in state 'foo'
+ t_bar_ID = r'[a-zA-Z_][a-zA-Z0-9_]*' # Token 'ID' in state 'bar'
+
+ def t_foo_newline(t):
+ r'\n'
+ t.lexer.lineno += 1
+
+A token can be declared in multiple states by including multiple state
+names in the declaration. For example:
+
+ t_foo_bar_NUMBER = r'\d+' # Defines token 'NUMBER' in both state 'foo' and 'bar'
+
+Alternative, a token can be declared in all states using the \'ANY\' in
+the name:
+
+ t_ANY_NUMBER = r'\d+' # Defines a token 'NUMBER' in all states
+
+If no state name is supplied, as is normally the case, the token is
+associated with a special state `'INITIAL'`. For example, these two
+declarations are identical:
+
+ t_NUMBER = r'\d+'
+ t_INITIAL_NUMBER = r'\d+'
+
+States are also associated with the special `t_ignore`, `t_error()`, and
+`t_eof()` declarations. For example, if a state treats these
+differently, you can declare:
+
+ t_foo_ignore = " \t\n" # Ignored characters for state 'foo'
+
+ def t_bar_error(t): # Special error handler for state 'bar'
+ pass
+
+By default, lexing operates in the `'INITIAL'` state. This state
+includes all of the normally defined tokens. For users who aren\'t using
+different states, this fact is completely transparent. If, during lexing
+or parsing, you want to change the lexing state, use the `begin()`
+method. For example:
+
+ def t_begin_foo(t):
+ r'start_foo'
+ t.lexer.begin('foo') # Starts 'foo' state
+
+To get out of a state, you use `begin()` to switch back to the initial
+state. For example:
+
+ def t_foo_end(t):
+ r'end_foo'
+ t.lexer.begin('INITIAL') # Back to the initial state
+
+The management of states can also be done with a stack. For example:
+
+ def t_begin_foo(t):
+ r'start_foo'
+ t.lexer.push_state('foo') # Starts 'foo' state
+
+ def t_foo_end(t):
+ r'end_foo'
+ t.lexer.pop_state() # Back to the previous state
+
+The use of a stack would be useful in situations where there are many
+ways of entering a new lexing state and you merely want to go back to
+the previous state afterwards.
+
+An example might help clarify. Suppose you were writing a parser and you
+wanted to grab sections of arbitrary C code enclosed by curly braces.
+That is, whenever you encounter a starting brace `'{'`, you want to read
+all of the enclosed code up to the ending brace `'}'` and return it as a
+string. Doing this with a normal regular expression rule is nearly (if
+not actually) impossible. This is because braces can be nested and can
+be included in comments and strings. Thus, matching up to the first
+matching `'}'` character isn\'t good enough. Here is how you might use
+lexer states to do this:
+
+ # Declare the state
+ states = (
+ ('ccode','exclusive'),
+ )
+
+ # Match the first {. Enter ccode state.
+ def t_ccode(t):
+ r'\{'
+ t.lexer.code_start = t.lexer.lexpos # Record the starting position
+ t.lexer.level = 1 # Initial brace level
+ t.lexer.begin('ccode') # Enter 'ccode' state
+
+ # Rules for the ccode state
+ def t_ccode_lbrace(t):
+ r'\{'
+ t.lexer.level +=1
+
+ def t_ccode_rbrace(t):
+ r'\}'
+ t.lexer.level -=1
+
+ # If closing brace, return the code fragment
+ if t.lexer.level == 0:
+ t.value = t.lexer.lexdata[t.lexer.code_start:t.lexer.lexpos+1]
+ t.type = "CCODE"
+ t.lexer.lineno += t.value.count('\n')
+ t.lexer.begin('INITIAL')
+ return t
+
+ # C or C++ comment (ignore)
+ def t_ccode_comment(t):
+ r'(/\*(.|\n)*?\*/)|(//.*)'
+ pass
+
+ # C string
+ def t_ccode_string(t):
+ r'\"([^\\\n]|(\\.))*?\"'
+
+ # C character literal
+ def t_ccode_char(t):
+ r'\'([^\\\n]|(\\.))*?\''
+
+ # Any sequence of non-whitespace characters (not braces, strings)
+ def t_ccode_nonspace(t):
+ r'[^\s\{\}\'\"]+'
+
+ # Ignored characters (whitespace)
+ t_ccode_ignore = " \t\n"
+
+ # For bad characters, we just skip over it
+ def t_ccode_error(t):
+ t.lexer.skip(1)
+
+In this example, the occurrence of the first \'{\' causes the lexer to
+record the starting position and enter a new state `'ccode'`. A
+collection of rules then match various parts of the input that follow
+(comments, strings, etc.). All of these rules merely discard the token
+(by not returning a value). However, if the closing right brace is
+encountered, the rule `t_ccode_rbrace` collects all of the code (using
+the earlier recorded starting position), stores it, and returns a token
+\'CCODE\' containing all of that text. When returning the token, the
+lexing state is restored back to its initial state.
+
+### Miscellaneous Issues
+
+- The lexer requires input to be supplied as a single input string.
+ Since most machines have more than enough memory, this rarely
+ presents a performance concern. However, it means that the lexer
+ currently can\'t be used with streaming data such as open files or
+ sockets. This limitation is primarily a side-effect of using the
+ `re` module. You might be able to work around this by implementing
+ an appropriate `def t_eof()` end-of-file handling rule. The main
+ complication here is that you\'ll probably need to ensure that data
+ is fed to the lexer in a way so that it doesn\'t split in in the
+ middle of a token.
+
+- If you need to supply optional flags to the re.compile() function,
+ use the reflags option to lex. For example:
+
+ lex.lex(reflags=re.UNICODE | re.VERBOSE)
+
+ Note: by default, `reflags` is set to `re.VERBOSE`. If you provide
+ your own flags, you may need to include this for PLY to preserve its
+ normal behavior.
+
+- If you are going to create a hand-written lexer and you plan to use
+ it with `yacc.py`, it only needs to conform to the following
+ requirements:
+
+ 1. It must provide a `token()` method that returns the next token
+ or `None` if no more tokens are available.
+ 2. The `token()` method must return an object `tok` that has `type`
+ and `value` attributes. If line number tracking is being used,
+ then the token should also define a `lineno` attribute.
+
+## Parsing basics
+
+`yacc.py` is used to parse language syntax. Before showing an example,
+there are a few important bits of background that must be mentioned.
+First, *syntax* is usually specified in terms of a BNF grammar. For
+example, if you wanted to parse simple arithmetic expressions, you might
+first write an unambiguous grammar specification like this:
+
+ expression : expression + term
+ | expression - term
+ | term
+
+ term : term * factor
+ | term / factor
+ | factor
+
+ factor : NUMBER
+ | ( expression )
+
+In the grammar, symbols such as `NUMBER`, `+`, `-`, `*`, and `/` are
+known as *terminals* and correspond to input tokens. Identifiers such as
+`term` and `factor` refer to grammar rules comprised of a collection of
+terminals and other rules. These identifiers are known as
+*non-terminals*.
+
+The semantic behavior of a language is often specified using a technique
+known as syntax directed translation. In syntax directed translation,
+attributes are attached to each symbol in a given grammar rule along
+with an action. Whenever a particular grammar rule is recognized, the
+action describes what to do. For example, given the expression grammar
+above, you might write the specification for a simple calculator like
+this:
+
+ Grammar Action
+ -------------------------------- --------------------------------------------
+ expression0 : expression1 + term expression0.val = expression1.val + term.val
+ | expression1 - term expression0.val = expression1.val - term.val
+ | term expression0.val = term.val
+
+ term0 : term1 * factor term0.val = term1.val * factor.val
+ | term1 / factor term0.val = term1.val / factor.val
+ | factor term0.val = factor.val
+
+ factor : NUMBER factor.val = int(NUMBER.lexval)
+ | ( expression ) factor.val = expression.val
+
+A good way to think about syntax directed translation is to view each
+symbol in the grammar as a kind of object. Associated with each symbol
+is a value representing its \"state\" (for example, the `val` attribute
+above). Semantic actions are then expressed as a collection of functions
+or methods that operate on the symbols and associated values.
+
+Yacc uses a parsing technique known as LR-parsing or shift-reduce
+parsing. LR parsing is a bottom up technique that tries to recognize the
+right-hand-side of various grammar rules. Whenever a valid
+right-hand-side is found in the input, the appropriate action code is
+triggered and the grammar symbols are replaced by the grammar symbol on
+the left-hand-side.
+
+LR parsing is commonly implemented by shifting grammar symbols onto a
+stack and looking at the stack and the next input token for patterns
+that match one of the grammar rules. The details of the algorithm can be
+found in a compiler textbook, but the following example illustrates the
+steps that are performed if you wanted to parse the expression
+`3 + 5 * (10 - 20)` using the grammar defined above. In the example, the
+special symbol `$` represents the end of input:
+
+ Step Symbol Stack Input Tokens Action
+ ---- --------------------- --------------------- -------------------------------
+ 1 3 + 5 * ( 10 - 20 )$ Shift 3
+ 2 3 + 5 * ( 10 - 20 )$ Reduce factor : NUMBER
+ 3 factor + 5 * ( 10 - 20 )$ Reduce term : factor
+ 4 term + 5 * ( 10 - 20 )$ Reduce expr : term
+ 5 expr + 5 * ( 10 - 20 )$ Shift +
+ 6 expr + 5 * ( 10 - 20 )$ Shift 5
+ 7 expr + 5 * ( 10 - 20 )$ Reduce factor : NUMBER
+ 8 expr + factor * ( 10 - 20 )$ Reduce term : factor
+ 9 expr + term * ( 10 - 20 )$ Shift *
+ 10 expr + term * ( 10 - 20 )$ Shift (
+ 11 expr + term * ( 10 - 20 )$ Shift 10
+ 12 expr + term * ( 10 - 20 )$ Reduce factor : NUMBER
+ 13 expr + term * ( factor - 20 )$ Reduce term : factor
+ 14 expr + term * ( term - 20 )$ Reduce expr : term
+ 15 expr + term * ( expr - 20 )$ Shift -
+ 16 expr + term * ( expr - 20 )$ Shift 20
+ 17 expr + term * ( expr - 20 )$ Reduce factor : NUMBER
+ 18 expr + term * ( expr - factor )$ Reduce term : factor
+ 19 expr + term * ( expr - term )$ Reduce expr : expr - term
+ 20 expr + term * ( expr )$ Shift )
+ 21 expr + term * ( expr ) $ Reduce factor : (expr)
+ 22 expr + term * factor $ Reduce term : term * factor
+ 23 expr + term $ Reduce expr : expr + term
+ 24 expr $ Reduce expr
+ 25 $ Success!
+
+When parsing the expression, an underlying state machine and the current
+input token determine what happens next. If the next token looks like
+part of a valid grammar rule (based on other items on the stack), it is
+generally shifted onto the stack. If the top of the stack contains a
+valid right-hand-side of a grammar rule, it is usually \"reduced\" and
+the symbols replaced with the symbol on the left-hand-side. When this
+reduction occurs, the appropriate action is triggered (if defined). If
+the input token can\'t be shifted and the top of stack doesn\'t match
+any grammar rules, a syntax error has occurred and the parser must take
+some kind of recovery step (or bail out). A parse is only successful if
+the parser reaches a state where the symbol stack is empty and there are
+no more input tokens.
+
+It is important to note that the underlying implementation is built
+around a large finite-state machine that is encoded in a collection of
+tables. The construction of these tables is non-trivial and beyond the
+scope of this discussion. However, subtle details of this process
+explain why, in the example above, the parser chooses to shift a token
+onto the stack in step 9 rather than reducing the rule
+`expr : expr + term`.
+
+## Yacc
+
+The `ply.yacc` module implements the parsing component of PLY. The name
+\"yacc\" stands for \"Yet Another Compiler Compiler\" and is borrowed
+from the Unix tool of the same name.
+
+### An example
+
+Suppose you wanted to make a grammar for simple arithmetic expressions
+as previously described. Here is how you would do it with `yacc.py`:
+
+ # Yacc example
+
+ import ply.yacc as yacc
+
+ # Get the token map from the lexer. This is required.
+ from calclex import tokens
+
+ def p_expression_plus(p):
+ 'expression : expression PLUS term'
+ p[0] = p[1] + p[3]
+
+ def p_expression_minus(p):
+ 'expression : expression MINUS term'
+ p[0] = p[1] - p[3]
+
+ def p_expression_term(p):
+ 'expression : term'
+ p[0] = p[1]
+
+ def p_term_times(p):
+ 'term : term TIMES factor'
+ p[0] = p[1] * p[3]
+
+ def p_term_div(p):
+ 'term : term DIVIDE factor'
+ p[0] = p[1] / p[3]
+
+ def p_term_factor(p):
+ 'term : factor'
+ p[0] = p[1]
+
+ def p_factor_num(p):
+ 'factor : NUMBER'
+ p[0] = p[1]
+
+ def p_factor_expr(p):
+ 'factor : LPAREN expression RPAREN'
+ p[0] = p[2]
+
+ # Error rule for syntax errors
+ def p_error(p):
+ print("Syntax error in input!")
+
+ # Build the parser
+ parser = yacc.yacc()
+
+ while True:
+ try:
+ s = input('calc > ')
+ except EOFError:
+ break
+ if not s: continue
+ result = parser.parse(s)
+ print(result)
+
+In this example, each grammar rule is defined by a Python function where
+the docstring to that function contains the appropriate context-free
+grammar specification. The statements that make up the function body
+implement the semantic actions of the rule. Each function accepts a
+single argument `p` that is a sequence containing the values of each
+grammar symbol in the corresponding rule. The values of `p[i]` are
+mapped to grammar symbols as shown here:
+
+ def p_expression_plus(p):
+ 'expression : expression PLUS term'
+ # ^ ^ ^ ^
+ # p[0] p[1] p[2] p[3]
+
+ p[0] = p[1] + p[3]
+
+For tokens, the \"value\" of the corresponding `p[i]` is the *same* as
+the `p.value` attribute assigned in the lexer module. For non-terminals,
+the value is determined by whatever is placed in `p[0]` when rules are
+reduced. This value can be anything at all. However, it probably most
+common for the value to be a simple Python type, a tuple, or an
+instance. In this example, we are relying on the fact that the `NUMBER`
+token stores an integer value in its value field. All of the other rules
+perform various types of integer operations and propagate the result.
+
+Note: The use of negative indices have a special meaning in
+yacc\-\--specially `p[-1]` does not have the same value as `p[3]` in
+this example. Please see the section on \"Embedded Actions\" for further
+details.
+
+The first rule defined in the yacc specification determines the starting
+grammar symbol (in this case, a rule for `expression` appears first).
+Whenever the starting rule is reduced by the parser and no more input is
+available, parsing stops and the final value is returned (this value
+will be whatever the top-most rule placed in `p[0]`). Note: an
+alternative starting symbol can be specified using the `start` keyword
+argument to `yacc()`.
+
+The `p_error(p)` rule is defined to catch syntax errors. See the error
+handling section below for more detail.
+
+To build the parser, call the `yacc.yacc()` function. This function
+looks at the module and attempts to construct all of the LR parsing
+tables for the grammar you have specified.
+
+If any errors are detected in your grammar specification, `yacc.py` will
+produce diagnostic messages and possibly raise an exception. Some of the
+errors that can be detected include:
+
+- Duplicated function names (if more than one rule function have the
+ same name in the grammar file).
+- Shift/reduce and reduce/reduce conflicts generated by ambiguous
+ grammars.
+- Badly specified grammar rules.
+- Infinite recursion (rules that can never terminate).
+- Unused rules and tokens
+- Undefined rules and tokens
+
+The next few sections discuss grammar specification in more detail.
+
+The final part of the example shows how to actually run the parser
+created by `yacc()`. To run the parser, you have to call the `parse()`
+with a string of input text. This will run all of the grammar rules and
+return the result of the entire parse. This result return is the value
+assigned to `p[0]` in the starting grammar rule.
+
+### Combining Grammar Rule Functions
+
+When grammar rules are similar, they can be combined into a single
+function. For example, consider the two rules in our earlier example:
+
+ def p_expression_plus(p):
+ 'expression : expression PLUS term'
+ p[0] = p[1] + p[3]
+
+ def p_expression_minus(t):
+ 'expression : expression MINUS term'
+ p[0] = p[1] - p[3]
+
+Instead of writing two functions, you might write a single function like
+this:
+
+ def p_expression(p):
+ '''expression : expression PLUS term
+ | expression MINUS term'''
+ if p[2] == '+':
+ p[0] = p[1] + p[3]
+ elif p[2] == '-':
+ p[0] = p[1] - p[3]
+
+In general, the docstring for any given function can contain multiple
+grammar rules. So, it would have also been legal (although possibly
+confusing) to write this:
+
+ def p_binary_operators(p):
+ '''expression : expression PLUS term
+ | expression MINUS term
+ term : term TIMES factor
+ | term DIVIDE factor'''
+ if p[2] == '+':
+ p[0] = p[1] + p[3]
+ elif p[2] == '-':
+ p[0] = p[1] - p[3]
+ elif p[2] == '*':
+ p[0] = p[1] * p[3]
+ elif p[2] == '/':
+ p[0] = p[1] / p[3]
+
+When combining grammar rules into a single function, it is usually a
+good idea for all of the rules to have a similar structure (e.g., the
+same number of terms). Otherwise, the corresponding action code may be
+more complicated than necessary. However, it is possible to handle
+simple cases using len(). For example:
+
+ def p_expressions(p):
+ '''expression : expression MINUS expression
+ | MINUS expression'''
+ if (len(p) == 4):
+ p[0] = p[1] - p[3]
+ elif (len(p) == 3):
+ p[0] = -p[2]
+
+If parsing performance is a concern, you should resist the urge to put
+too much conditional processing into a single grammar rule as shown in
+these examples. When you add checks to see which grammar rule is being
+handled, you are actually duplicating the work that the parser has
+already performed (i.e., the parser already knows exactly what rule it
+matched). You can eliminate this overhead by using a separate `p_rule()`
+function for each grammar rule.
+
+### Character Literals
+
+If desired, a grammar may contain tokens defined as single character
+literals. For example:
+
+ def p_binary_operators(p):
+ '''expression : expression '+' term
+ | expression '-' term
+ term : term '*' factor
+ | term '/' factor'''
+ if p[2] == '+':
+ p[0] = p[1] + p[3]
+ elif p[2] == '-':
+ p[0] = p[1] - p[3]
+ elif p[2] == '*':
+ p[0] = p[1] * p[3]
+ elif p[2] == '/':
+ p[0] = p[1] / p[3]
+
+A character literal must be enclosed in quotes such as `'+'`. In
+addition, if literals are used, they must be declared in the
+corresponding `lex` file through the use of a special `literals`
+declaration:
+
+ # Literals. Should be placed in module given to lex()
+ literals = ['+','-','*','/' ]
+
+Character literals are limited to a single character. Thus, it is not
+legal to specify literals such as `'<='` or `'=='`. For this, use the
+normal lexing rules (e.g., define a rule such as `t_EQ = r'=='`).
+
+### Empty Productions
+
+`yacc.py` can handle empty productions by defining a rule like this:
+
+ def p_empty(p):
+ 'empty :'
+ pass
+
+Now to use the empty production, use \'empty\' as a symbol. For example:
+
+ def p_optitem(p):
+ 'optitem : item'
+ ' | empty'
+ ...
+
+Note: You can write empty rules anywhere by specifying an empty right
+hand side. However, I personally find that writing an \"empty\" rule and
+using \"empty\" to denote an empty production is easier to read and more
+clearly states your intentions.
+
+### Changing the starting symbol
+
+Normally, the first rule found in a yacc specification defines the
+starting grammar rule (top level rule). To change this, supply a `start`
+specifier in your file. For example:
+
+ start = 'foo'
+
+ def p_bar(p):
+ 'bar : A B'
+
+ # This is the starting rule due to the start specifier above
+ def p_foo(p):
+ 'foo : bar X'
+ ...
+
+The use of a `start` specifier may be useful during debugging since you
+can use it to have yacc build a subset of a larger grammar. For this
+purpose, it is also possible to specify a starting symbol as an argument
+to `yacc()`. For example:
+
+ parser = yacc.yacc(start='foo')
+
+### Dealing With Ambiguous Grammars
+
+The expression grammar given in the earlier example has been written in
+a special format to eliminate ambiguity. However, in many situations, it
+is extremely difficult or awkward to write grammars in this format. A
+much more natural way to express the grammar is in a more compact form
+like this:
+
+ expression : expression PLUS expression
+ | expression MINUS expression
+ | expression TIMES expression
+ | expression DIVIDE expression
+ | LPAREN expression RPAREN
+ | NUMBER
+
+Unfortunately, this grammar specification is ambiguous. For example, if
+you are parsing the string \"3 \* 4 + 5\", there is no way to tell how
+the operators are supposed to be grouped. For example, does the
+expression mean \"(3 \* 4) + 5\" or is it \"3 \* (4+5)\"?
+
+When an ambiguous grammar is given to `yacc.py` it will print messages
+about \"shift/reduce conflicts\" or \"reduce/reduce conflicts\". A
+shift/reduce conflict is caused when the parser generator can\'t decide
+whether or not to reduce a rule or shift a symbol on the parsing stack.
+For example, consider the string \"3 \* 4 + 5\" and the internal parsing
+stack:
+
+ Step Symbol Stack Input Tokens Action
+ ---- --------------------- --------------------- -------------------------------
+ 1 $ 3 * 4 + 5$ Shift 3
+ 2 $ 3 * 4 + 5$ Reduce : expression : NUMBER
+ 3 $ expr * 4 + 5$ Shift *
+ 4 $ expr * 4 + 5$ Shift 4
+ 5 $ expr * 4 + 5$ Reduce: expression : NUMBER
+ 6 $ expr * expr + 5$ SHIFT/REDUCE CONFLICT ????
+
+In this case, when the parser reaches step 6, it has two options. One is
+to reduce the rule `expr : expr * expr` on the stack. The other option
+is to shift the token `+` on the stack. Both options are perfectly legal
+from the rules of the context-free-grammar.
+
+By default, all shift/reduce conflicts are resolved in favor of
+shifting. Therefore, in the above example, the parser will always shift
+the `+` instead of reducing. Although this strategy works in many cases
+(for example, the case of \"if-then\" versus \"if-then-else\"), it is
+not enough for arithmetic expressions. In fact, in the above example,
+the decision to shift `+` is completely wrong\-\--we should have reduced
+`expr * expr` since multiplication has higher mathematical precedence
+than addition.
+
+To resolve ambiguity, especially in expression grammars, `yacc.py`
+allows individual tokens to be assigned a precedence level and
+associativity. This is done by adding a variable `precedence` to the
+grammar file like this:
+
+ precedence = (
+ ('left', 'PLUS', 'MINUS'),
+ ('left', 'TIMES', 'DIVIDE'),
+ )
+
+This declaration specifies that `PLUS`/`MINUS` have the same precedence
+level and are left-associative and that `TIMES`/`DIVIDE` have the same
+precedence and are left-associative. Within the `precedence`
+declaration, tokens are ordered from lowest to highest precedence. Thus,
+this declaration specifies that `TIMES`/`DIVIDE` have higher precedence
+than `PLUS`/`MINUS` (since they appear later in the precedence
+specification).
+
+The precedence specification works by associating a numerical precedence
+level value and associativity direction to the listed tokens. For
+example, in the above example you get:
+
+ PLUS : level = 1, assoc = 'left'
+ MINUS : level = 1, assoc = 'left'
+ TIMES : level = 2, assoc = 'left'
+ DIVIDE : level = 2, assoc = 'left'
+
+These values are then used to attach a numerical precedence value and
+associativity direction to each grammar rule. *This is always determined
+by looking at the precedence of the right-most terminal symbol.* For
+example:
+
+ expression : expression PLUS expression # level = 1, left
+ | expression MINUS expression # level = 1, left
+ | expression TIMES expression # level = 2, left
+ | expression DIVIDE expression # level = 2, left
+ | LPAREN expression RPAREN # level = None (not specified)
+ | NUMBER # level = None (not specified)
+
+When shift/reduce conflicts are encountered, the parser generator
+resolves the conflict by looking at the precedence rules and
+associativity specifiers.
+
+1. If the current token has higher precedence than the rule on the
+ stack, it is shifted.
+2. If the grammar rule on the stack has higher precedence, the rule is
+ reduced.
+3. If the current token and the grammar rule have the same precedence,
+ the rule is reduced for left associativity, whereas the token is
+ shifted for right associativity.
+4. If nothing is known about the precedence, shift/reduce conflicts are
+ resolved in favor of shifting (the default).
+
+For example, if \"expression PLUS expression\" has been parsed and the
+next token is \"TIMES\", the action is going to be a shift because
+\"TIMES\" has a higher precedence level than \"PLUS\". On the other
+hand, if \"expression TIMES expression\" has been parsed and the next
+token is \"PLUS\", the action is going to be reduce because \"PLUS\" has
+a lower precedence than \"TIMES.\"
+
+When shift/reduce conflicts are resolved using the first three
+techniques (with the help of precedence rules), `yacc.py` will report no
+errors or conflicts in the grammar (although it will print some
+information in the `parser.out` debugging file).
+
+One problem with the precedence specifier technique is that it is
+sometimes necessary to change the precedence of an operator in certain
+contexts. For example, consider a unary-minus operator in \"3 + 4 \*
+-5\". Mathematically, the unary minus is normally given a very high
+precedence\--being evaluated before the multiply. However, in our
+precedence specifier, MINUS has a lower precedence than TIMES. To deal
+with this, precedence rules can be given for so-called \"fictitious
+tokens\" like this:
+
+ precedence = (
+ ('left', 'PLUS', 'MINUS'),
+ ('left', 'TIMES', 'DIVIDE'),
+ ('right', 'UMINUS'), # Unary minus operator
+ )
+
+Now, in the grammar file, we can write our unary minus rule like this:
+
+ def p_expr_uminus(p):
+ 'expression : MINUS expression %prec UMINUS'
+ p[0] = -p[2]
+
+In this case, `%prec UMINUS` overrides the default rule
+precedence\--setting it to that of UMINUS in the precedence specifier.
+
+At first, the use of UMINUS in this example may appear very confusing.
+UMINUS is not an input token or a grammar rule. Instead, you should
+think of it as the name of a special marker in the precedence table.
+When you use the `%prec` qualifier, you\'re telling yacc that you want
+the precedence of the expression to be the same as for this special
+marker instead of the usual precedence.
+
+It is also possible to specify non-associativity in the `precedence`
+table. This would be used when you *don\'t* want operations to chain
+together. For example, suppose you wanted to support comparison
+operators like `<` and `>` but you didn\'t want to allow combinations
+like `a < b < c`. To do this, specify a rule like this:
+
+ precedence = (
+ ('nonassoc', 'LESSTHAN', 'GREATERTHAN'), # Nonassociative operators
+ ('left', 'PLUS', 'MINUS'),
+ ('left', 'TIMES', 'DIVIDE'),
+ ('right', 'UMINUS'), # Unary minus operator
+ )
+
+If you do this, the occurrence of input text such as `a < b < c` will
+result in a syntax error. However, simple expressions such as `a < b`
+will still be fine.
+
+Reduce/reduce conflicts are caused when there are multiple grammar rules
+that can be applied to a given set of symbols. This kind of conflict is
+almost always bad and is always resolved by picking the rule that
+appears first in the grammar file. Reduce/reduce conflicts are almost
+always caused when different sets of grammar rules somehow generate the
+same set of symbols. For example:
+
+ assignment : ID EQUALS NUMBER
+ | ID EQUALS expression
+
+ expression : expression PLUS expression
+ | expression MINUS expression
+ | expression TIMES expression
+ | expression DIVIDE expression
+ | LPAREN expression RPAREN
+ | NUMBER
+
+In this case, a reduce/reduce conflict exists between these two rules:
+
+ assignment : ID EQUALS NUMBER
+ expression : NUMBER
+
+For example, if you wrote \"a = 5\", the parser can\'t figure out if
+this is supposed to be reduced as `assignment : ID EQUALS NUMBER` or
+whether it\'s supposed to reduce the 5 as an expression and then reduce
+the rule `assignment : ID EQUALS expression`.
+
+It should be noted that reduce/reduce conflicts are notoriously
+difficult to spot looking at the input grammar. When a reduce/reduce
+conflict occurs, `yacc()` will try to help by printing a warning message
+such as this:
+
+ WARNING: 1 reduce/reduce conflict
+ WARNING: reduce/reduce conflict in state 15 resolved using rule (assignment -> ID EQUALS NUMBER)
+ WARNING: rejected rule (expression -> NUMBER)
+
+This message identifies the two rules that are in conflict. However, it
+may not tell you how the parser arrived at such a state. To try and
+figure it out, you\'ll probably have to look at your grammar and the
+contents of the `parser.out` debugging file with an appropriately high
+level of caffeination.
+
+### The parser.out file
+
+Tracking down shift/reduce and reduce/reduce conflicts is one of the
+finer pleasures of using an LR parsing algorithm. To assist in
+debugging, `yacc.py` can create a debugging file called \'parser.out\'.
+To create this file, use `yacc.yacc(debug=True)`. The contents of this
+file look like the following:
+
+ Unused terminals:
+
+
+ Grammar
+
+ Rule 1 expression -> expression PLUS expression
+ Rule 2 expression -> expression MINUS expression
+ Rule 3 expression -> expression TIMES expression
+ Rule 4 expression -> expression DIVIDE expression
+ Rule 5 expression -> NUMBER
+ Rule 6 expression -> LPAREN expression RPAREN
+
+ Terminals, with rules where they appear
+
+ TIMES : 3
+ error :
+ MINUS : 2
+ RPAREN : 6
+ LPAREN : 6
+ DIVIDE : 4
+ PLUS : 1
+ NUMBER : 5
+
+ Nonterminals, with rules where they appear
+
+ expression : 1 1 2 2 3 3 4 4 6 0
+
+
+ Parsing method: LALR
+
+
+ state 0
+
+ S' -> . expression
+ expression -> . expression PLUS expression
+ expression -> . expression MINUS expression
+ expression -> . expression TIMES expression
+ expression -> . expression DIVIDE expression
+ expression -> . NUMBER
+ expression -> . LPAREN expression RPAREN
+
+ NUMBER shift and go to state 3
+ LPAREN shift and go to state 2
+
+
+ state 1
+
+ S' -> expression .
+ expression -> expression . PLUS expression
+ expression -> expression . MINUS expression
+ expression -> expression . TIMES expression
+ expression -> expression . DIVIDE expression
+
+ PLUS shift and go to state 6
+ MINUS shift and go to state 5
+ TIMES shift and go to state 4
+ DIVIDE shift and go to state 7
+
+
+ state 2
+
+ expression -> LPAREN . expression RPAREN
+ expression -> . expression PLUS expression
+ expression -> . expression MINUS expression
+ expression -> . expression TIMES expression
+ expression -> . expression DIVIDE expression
+ expression -> . NUMBER
+ expression -> . LPAREN expression RPAREN
+
+ NUMBER shift and go to state 3
+ LPAREN shift and go to state 2
+
+
+ state 3
+
+ expression -> NUMBER .
+
+ $ reduce using rule 5
+ PLUS reduce using rule 5
+ MINUS reduce using rule 5
+ TIMES reduce using rule 5
+ DIVIDE reduce using rule 5
+ RPAREN reduce using rule 5
+
+
+ state 4
+
+ expression -> expression TIMES . expression
+ expression -> . expression PLUS expression
+ expression -> . expression MINUS expression
+ expression -> . expression TIMES expression
+ expression -> . expression DIVIDE expression
+ expression -> . NUMBER
+ expression -> . LPAREN expression RPAREN
+
+ NUMBER shift and go to state 3
+ LPAREN shift and go to state 2
+
+
+ state 5
+
+ expression -> expression MINUS . expression
+ expression -> . expression PLUS expression
+ expression -> . expression MINUS expression
+ expression -> . expression TIMES expression
+ expression -> . expression DIVIDE expression
+ expression -> . NUMBER
+ expression -> . LPAREN expression RPAREN
+
+ NUMBER shift and go to state 3
+ LPAREN shift and go to state 2
+
+
+ state 6
+
+ expression -> expression PLUS . expression
+ expression -> . expression PLUS expression
+ expression -> . expression MINUS expression
+ expression -> . expression TIMES expression
+ expression -> . expression DIVIDE expression
+ expression -> . NUMBER
+ expression -> . LPAREN expression RPAREN
+
+ NUMBER shift and go to state 3
+ LPAREN shift and go to state 2
+
+
+ state 7
+
+ expression -> expression DIVIDE . expression
+ expression -> . expression PLUS expression
+ expression -> . expression MINUS expression
+ expression -> . expression TIMES expression
+ expression -> . expression DIVIDE expression
+ expression -> . NUMBER
+ expression -> . LPAREN expression RPAREN
+
+ NUMBER shift and go to state 3
+ LPAREN shift and go to state 2
+
+
+ state 8
+
+ expression -> LPAREN expression . RPAREN
+ expression -> expression . PLUS expression
+ expression -> expression . MINUS expression
+ expression -> expression . TIMES expression
+ expression -> expression . DIVIDE expression
+
+ RPAREN shift and go to state 13
+ PLUS shift and go to state 6
+ MINUS shift and go to state 5
+ TIMES shift and go to state 4
+ DIVIDE shift and go to state 7
+
+
+ state 9
+
+ expression -> expression TIMES expression .
+ expression -> expression . PLUS expression
+ expression -> expression . MINUS expression
+ expression -> expression . TIMES expression
+ expression -> expression . DIVIDE expression
+
+ $ reduce using rule 3
+ PLUS reduce using rule 3
+ MINUS reduce using rule 3
+ TIMES reduce using rule 3
+ DIVIDE reduce using rule 3
+ RPAREN reduce using rule 3
+
+ ! PLUS [ shift and go to state 6 ]
+ ! MINUS [ shift and go to state 5 ]
+ ! TIMES [ shift and go to state 4 ]
+ ! DIVIDE [ shift and go to state 7 ]
+
+ state 10
+
+ expression -> expression MINUS expression .
+ expression -> expression . PLUS expression
+ expression -> expression . MINUS expression
+ expression -> expression . TIMES expression
+ expression -> expression . DIVIDE expression
+
+ $ reduce using rule 2
+ PLUS reduce using rule 2
+ MINUS reduce using rule 2
+ RPAREN reduce using rule 2
+ TIMES shift and go to state 4
+ DIVIDE shift and go to state 7
+
+ ! TIMES [ reduce using rule 2 ]
+ ! DIVIDE [ reduce using rule 2 ]
+ ! PLUS [ shift and go to state 6 ]
+ ! MINUS [ shift and go to state 5 ]
+
+ state 11
+
+ expression -> expression PLUS expression .
+ expression -> expression . PLUS expression
+ expression -> expression . MINUS expression
+ expression -> expression . TIMES expression
+ expression -> expression . DIVIDE expression
+
+ $ reduce using rule 1
+ PLUS reduce using rule 1
+ MINUS reduce using rule 1
+ RPAREN reduce using rule 1
+ TIMES shift and go to state 4
+ DIVIDE shift and go to state 7
+
+ ! TIMES [ reduce using rule 1 ]
+ ! DIVIDE [ reduce using rule 1 ]
+ ! PLUS [ shift and go to state 6 ]
+ ! MINUS [ shift and go to state 5 ]
+
+ state 12
+
+ expression -> expression DIVIDE expression .
+ expression -> expression . PLUS expression
+ expression -> expression . MINUS expression
+ expression -> expression . TIMES expression
+ expression -> expression . DIVIDE expression
+
+ $ reduce using rule 4
+ PLUS reduce using rule 4
+ MINUS reduce using rule 4
+ TIMES reduce using rule 4
+ DIVIDE reduce using rule 4
+ RPAREN reduce using rule 4
+
+ ! PLUS [ shift and go to state 6 ]
+ ! MINUS [ shift and go to state 5 ]
+ ! TIMES [ shift and go to state 4 ]
+ ! DIVIDE [ shift and go to state 7 ]
+
+ state 13
+
+ expression -> LPAREN expression RPAREN .
+
+ $ reduce using rule 6
+ PLUS reduce using rule 6
+ MINUS reduce using rule 6
+ TIMES reduce using rule 6
+ DIVIDE reduce using rule 6
+ RPAREN reduce using rule 6
+
+The different states that appear in this file are a representation of
+every possible sequence of valid input tokens allowed by the grammar.
+When receiving input tokens, the parser is building up a stack and
+looking for matching rules. Each state keeps track of the grammar rules
+that might be in the process of being matched at that point. Within each
+rule, the \".\" character indicates the current location of the parse
+within that rule. In addition, the actions for each valid input token
+are listed. When a shift/reduce or reduce/reduce conflict arises, rules
+*not* selected are prefixed with an !. For example:
+
+ ! TIMES [ reduce using rule 2 ]
+ ! DIVIDE [ reduce using rule 2 ]
+ ! PLUS [ shift and go to state 6 ]
+ ! MINUS [ shift and go to state 5 ]
+
+By looking at these rules (and with a little practice), you can usually
+track down the source of most parsing conflicts. It should also be
+stressed that not all shift-reduce conflicts are bad. However, the only
+way to be sure that they are resolved correctly is to look at
+`parser.out`.
+
+### Syntax Error Handling
+
+If you are creating a parser for production use, the handling of syntax
+errors is important. As a general rule, you don\'t want a parser to
+throw up its hands and stop at the first sign of trouble. Instead, you
+want it to report the error, recover if possible, and continue parsing
+so that all of the errors in the input get reported to the user at once.
+This is the standard behavior found in compilers for languages such as
+C, C++, and Java.
+
+In PLY, when a syntax error occurs during parsing, the error is
+immediately detected (i.e., the parser does not read any more tokens
+beyond the source of the error). However, at this point, the parser
+enters a recovery mode that can be used to try and continue further
+parsing. As a general rule, error recovery in LR parsers is a delicate
+topic that involves ancient rituals and black-magic. The recovery
+mechanism provided by `yacc.py` is comparable to Unix yacc so you may
+want consult a book like O\'Reilly\'s \"Lex and Yacc\" for some of the
+finer details.
+
+When a syntax error occurs, `yacc.py` performs the following steps:
+
+1. On the first occurrence of an error, the user-defined `p_error()`
+ function is called with the offending token as an argument. However,
+ if the syntax error is due to reaching the end-of-file, `p_error()`
+ is called with an argument of `None`. Afterwards, the parser enters
+ an \"error-recovery\" mode in which it will not make future calls to
+ `p_error()` until it has successfully shifted at least 3 tokens onto
+ the parsing stack.
+2. If no recovery action is taken in `p_error()`, the offending
+ lookahead token is replaced with a special `error` token.
+3. If the offending lookahead token is already set to `error`, the top
+ item of the parsing stack is deleted.
+4. If the entire parsing stack is unwound, the parser enters a restart
+ state and attempts to start parsing from its initial state.
+5. If a grammar rule accepts `error` as a token, it will be shifted
+ onto the parsing stack.
+6. If the top item of the parsing stack is `error`, lookahead tokens
+ will be discarded until the parser can successfully shift a new
+ symbol or reduce a rule involving `error`.
+
+#### Recovery and resynchronization with error rules
+
+The most well-behaved approach for handling syntax errors is to write
+grammar rules that include the `error` token. For example, suppose your
+language had a grammar rule for a print statement like this:
+
+ def p_statement_print(p):
+ 'statement : PRINT expr SEMI'
+ ...
+
+To account for the possibility of a bad expression, you might write an
+additional grammar rule like this:
+
+ def p_statement_print_error(p):
+ 'statement : PRINT error SEMI'
+ print("Syntax error in print statement. Bad expression")
+
+In this case, the `error` token will match any sequence of tokens that
+might appear up to the first semicolon that is encountered. Once the
+semicolon is reached, the rule will be invoked and the `error` token
+will go away.
+
+This type of recovery is sometimes known as parser resynchronization.
+The `error` token acts as a wildcard for any bad input text and the
+token immediately following `error` acts as a synchronization token.
+
+It is important to note that the `error` token usually does not appear
+as the last token on the right in an error rule. For example:
+
+ def p_statement_print_error(p):
+ 'statement : PRINT error'
+ print("Syntax error in print statement. Bad expression")
+
+This is because the first bad token encountered will cause the rule to
+be reduced\--which may make it difficult to recover if more bad tokens
+immediately follow.
+
+#### Panic mode recovery
+
+An alternative error recovery scheme is to enter a panic mode recovery
+in which tokens are discarded to a point where the parser might be able
+to recover in some sensible manner.
+
+Panic mode recovery is implemented entirely in the `p_error()` function.
+For example, this function starts discarding tokens until it reaches a
+closing \'}\'. Then, it restarts the parser in its initial state:
+
+ def p_error(p):
+ print("Whoa. You are seriously hosed.")
+ if not p:
+ print("End of File!")
+ return
+
+ # Read ahead looking for a closing '}'
+ while True:
+ tok = parser.token() # Get the next token
+ if not tok or tok.type == 'RBRACE':
+ break
+ parser.restart()
+
+This function discards the bad token and tells the parser that the error
+was ok:
+
+ def p_error(p):
+ if p:
+ print("Syntax error at token", p.type)
+ # Just discard the token and tell the parser it's okay.
+ parser.errok()
+ else:
+ print("Syntax error at EOF")
+
+More information on these methods is as follows:
+
+`parser.errok()`
+
+: This resets the parser state so it doesn\'t think it\'s in
+ error-recovery mode. This will prevent an `error` token from being
+ generated and will reset the internal error counters so that the
+ next syntax error will call `p_error()` again.
+
+`parser.token()`
+
+: This returns the next token on the input stream.
+
+`parser.restart()`.
+
+: This discards the entire parsing stack and resets the parser to its
+ initial state.
+
+To supply the next lookahead token to the parser, `p_error()` can return
+a token. This might be useful if trying to synchronize on special
+characters. For example:
+
+ def p_error(p):
+ # Read ahead looking for a terminating ";"
+ while True:
+ tok = parser.token() # Get the next token
+ if not tok or tok.type == 'SEMI': break
+ parser.errok()
+
+ # Return SEMI to the parser as the next lookahead token
+ return tok
+
+Keep in mind in that the above error handling functions, `parser` is an
+instance of the parser created by `yacc()`. You\'ll need to save this
+instance someplace in your code so that you can refer to it during error
+handling.
+
+#### Signalling an error from a production
+
+If necessary, a production rule can manually force the parser to enter
+error recovery. This is done by raising the `SyntaxError` exception like
+this:
+
+ def p_production(p):
+ 'production : some production ...'
+ raise SyntaxError
+
+The effect of raising `SyntaxError` is the same as if the last symbol
+shifted onto the parsing stack was actually a syntax error. Thus, when
+you do this, the last symbol shifted is popped off of the parsing stack
+and the current lookahead token is set to an `error` token. The parser
+then enters error-recovery mode where it tries to reduce rules that can
+accept `error` tokens. The steps that follow from this point are exactly
+the same as if a syntax error were detected and `p_error()` were called.
+
+One important aspect of manually setting an error is that the
+`p_error()` function will NOT be called in this case. If you need to
+issue an error message, make sure you do it in the production that
+raises `SyntaxError`.
+
+Note: This feature of PLY is meant to mimic the behavior of the YYERROR
+macro in yacc.
+
+#### When Do Syntax Errors Get Reported?
+
+In most cases, yacc will handle errors as soon as a bad input token is
+detected on the input. However, be aware that yacc may choose to delay
+error handling until after it has reduced one or more grammar rules
+first. This behavior might be unexpected, but it\'s related to special
+states in the underlying parsing table known as \"defaulted states.\" A
+defaulted state is parsing condition where the same grammar rule will be
+reduced regardless of what *valid* token comes next on the input. For
+such states, yacc chooses to go ahead and reduce the grammar rule
+*without reading the next input token*. If the next token is bad, yacc
+will eventually get around to reading it and report a syntax error.
+It\'s just a little unusual in that you might see some of your grammar
+rules firing immediately prior to the syntax error.
+
+Usually, the delayed error reporting with defaulted states is harmless
+(and there are other reasons for wanting PLY to behave in this way).
+However, if you need to turn this behavior off for some reason. You can
+clear the defaulted states table like this:
+
+ parser = yacc.yacc()
+ parser.defaulted_states = {}
+
+Disabling defaulted states is not recommended if your grammar makes use
+of embedded actions as described in Section 6.11.
+
+#### General comments on error handling
+
+For normal types of languages, error recovery with error rules and
+resynchronization characters is probably the most reliable technique.
+This is because you can instrument the grammar to catch errors at
+selected places where it is relatively easy to recover and continue
+parsing. Panic mode recovery is really only useful in certain
+specialized applications where you might want to discard huge portions
+of the input text to find a valid restart point.
+
+### Line Number and Position Tracking
+
+Position tracking is often a tricky problem when writing compilers. By
+default, PLY tracks the line number and position of all tokens. This
+information is available using the following functions:
+
+`p.lineno(num)`. Return the line number for symbol *num*
+
+`p.lexpos(num)`. Return the lexing position for symbol *num*
+
+For example:
+
+ def p_expression(p):
+ 'expression : expression PLUS expression'
+ line = p.lineno(2) # line number of the PLUS token
+ index = p.lexpos(2) # Position of the PLUS token
+
+As an optional feature, `yacc.py` can automatically track line numbers
+and positions for all of the grammar symbols as well. However, this
+extra tracking requires extra processing and can significantly slow down
+parsing. Therefore, it must be enabled by passing the `tracking=True`
+option to `yacc.parse()`. For example:
+
+ yacc.parse(data,tracking=True)
+
+Once enabled, the `lineno()` and `lexpos()` methods work for all grammar
+symbols. In addition, two additional methods can be used:
+
+`p.linespan(num)`. Return a tuple (startline,endline) with the starting
+and ending line number for symbol *num*.
+
+`p.lexspan(num)`. Return a tuple (start,end) with the starting and
+ending positions for symbol *num*.
+
+For example:
+
+ def p_expression(p):
+ 'expression : expression PLUS expression'
+ p.lineno(1) # Line number of the left expression
+ p.lineno(2) # line number of the PLUS operator
+ p.lineno(3) # line number of the right expression
+ ...
+ start,end = p.linespan(3) # Start,end lines of the right expression
+ starti,endi = p.lexspan(3) # Start,end positions of right expression
+
+Note: The `lexspan()` function only returns the range of values up to
+the start of the last grammar symbol.
+
+Although it may be convenient for PLY to track position information on
+all grammar symbols, this is often unnecessary. For example, if you are
+merely using line number information in an error message, you can often
+just key off of a specific token in the grammar rule. For example:
+
+ def p_bad_func(p):
+ 'funccall : fname LPAREN error RPAREN'
+ # Line number reported from LPAREN token
+ print("Bad function call at line", p.lineno(2))
+
+Similarly, you may get better parsing performance if you only
+selectively propagate line number information where it\'s needed using
+the `p.set_lineno()` method. For example:
+
+ def p_fname(p):
+ 'fname : ID'
+ p[0] = p[1]
+ p.set_lineno(0,p.lineno(1))
+
+PLY doesn\'t retain line number information from rules that have already
+been parsed. If you are building an abstract syntax tree and need to
+have line numbers, you should make sure that the line numbers appear in
+the tree itself.
+
+### AST Construction
+
+`yacc.py` provides no special functions for constructing an abstract
+syntax tree. However, such construction is easy enough to do on your
+own.
+
+A minimal way to construct a tree is to create and propagate a tuple or
+list in each grammar rule function. There are many possible ways to do
+this, but one example would be something like this:
+
+ def p_expression_binop(p):
+ '''expression : expression PLUS expression
+ | expression MINUS expression
+ | expression TIMES expression
+ | expression DIVIDE expression'''
+
+ p[0] = ('binary-expression',p[2],p[1],p[3])
+
+ def p_expression_group(p):
+ 'expression : LPAREN expression RPAREN'
+ p[0] = ('group-expression',p[2])
+
+ def p_expression_number(p):
+ 'expression : NUMBER'
+ p[0] = ('number-expression',p[1])
+
+Another approach is to create a set of data structure for different
+kinds of abstract syntax tree nodes and assign nodes to `p[0]` in each
+rule. For example:
+
+ class Expr: pass
+
+ class BinOp(Expr):
+ def __init__(self,left,op,right):
+ self.left = left
+ self.right = right
+ self.op = op
+
+ class Number(Expr):
+ def __init__(self,value):
+ self.value = value
+
+ def p_expression_binop(p):
+ '''expression : expression PLUS expression
+ | expression MINUS expression
+ | expression TIMES expression
+ | expression DIVIDE expression'''
+
+ p[0] = BinOp(p[1],p[2],p[3])
+
+ def p_expression_group(p):
+ 'expression : LPAREN expression RPAREN'
+ p[0] = p[2]
+
+ def p_expression_number(p):
+ 'expression : NUMBER'
+ p[0] = Number(p[1])
+
+The advantage to this approach is that it may make it easier to attach
+more complicated semantics, type checking, code generation, and other
+features to the node classes.
+
+To simplify tree traversal, it may make sense to pick a very generic
+tree structure for your parse tree nodes. For example:
+
+ class Node:
+ def __init__(self,type,children=None,leaf=None):
+ self.type = type
+ if children:
+ self.children = children
+ else:
+ self.children = [ ]
+ self.leaf = leaf
+
+ def p_expression_binop(p):
+ '''expression : expression PLUS expression
+ | expression MINUS expression
+ | expression TIMES expression
+ | expression DIVIDE expression'''
+
+ p[0] = Node("binop", [p[1],p[3]], p[2])
+
+### Embedded Actions
+
+The parsing technique used by yacc only allows actions to be executed at
+the end of a rule. For example, suppose you have a rule like this:
+
+ def p_foo(p):
+ "foo : A B C D"
+ print("Parsed a foo", p[1],p[2],p[3],p[4])
+
+In this case, the supplied action code only executes after all of the
+symbols `A`, `B`, `C`, and `D` have been parsed. Sometimes, however, it
+is useful to execute small code fragments during intermediate stages of
+parsing. For example, suppose you wanted to perform some action
+immediately after `A` has been parsed. To do this, write an empty rule
+like this:
+
+ def p_foo(p):
+ "foo : A seen_A B C D"
+ print("Parsed a foo", p[1],p[3],p[4],p[5])
+ print("seen_A returned", p[2])
+
+ def p_seen_A(p):
+ "seen_A :"
+ print("Saw an A = ", p[-1]) # Access grammar symbol to left
+ p[0] = some_value # Assign value to seen_A
+
+In this example, the empty `seen_A` rule executes immediately after `A`
+is shifted onto the parsing stack. Within this rule, `p[-1]` refers to
+the symbol on the stack that appears immediately to the left of the
+`seen_A` symbol. In this case, it would be the value of `A` in the `foo`
+rule immediately above. Like other rules, a value can be returned from
+an embedded action by assigning it to `p[0]`
+
+The use of embedded actions can sometimes introduce extra shift/reduce
+conflicts. For example, this grammar has no conflicts:
+
+ def p_foo(p):
+ """foo : abcd
+ | abcx"""
+
+ def p_abcd(p):
+ "abcd : A B C D"
+
+ def p_abcx(p):
+ "abcx : A B C X"
+
+However, if you insert an embedded action into one of the rules like
+this:
+
+ def p_foo(p):
+ """foo : abcd
+ | abcx"""
+
+ def p_abcd(p):
+ "abcd : A B C D"
+
+ def p_abcx(p):
+ "abcx : A B seen_AB C X"
+
+ def p_seen_AB(p):
+ "seen_AB :"
+
+an extra shift-reduce conflict will be introduced. This conflict is
+caused by the fact that the same symbol `C` appears next in both the
+`abcd` and `abcx` rules. The parser can either shift the symbol (`abcd`
+rule) or reduce the empty rule `seen_AB` (`abcx` rule).
+
+A common use of embedded rules is to control other aspects of parsing
+such as scoping of local variables. For example, if you were parsing C
+code, you might write code like this:
+
+ def p_statements_block(p):
+ "statements: LBRACE new_scope statements RBRACE"""
+ # Action code
+ ...
+ pop_scope() # Return to previous scope
+
+ def p_new_scope(p):
+ "new_scope :"
+ # Create a new scope for local variables
+ s = new_scope()
+ push_scope(s)
+ ...
+
+In this case, the embedded action `new_scope` executes immediately after
+a `LBRACE` (`{`) symbol is parsed. This might adjust internal symbol
+tables and other aspects of the parser. Upon completion of the rule
+`statements_block`, code might undo the operations performed in the
+embedded action (e.g., `pop_scope()`).
+
+### Miscellaneous Yacc Notes
+
+1. By default, `yacc.py` relies on `lex.py` for tokenizing. However, an
+ alternative tokenizer can be supplied as follows:
+
+ parser = yacc.parse(lexer=x)
+
+ in this case, `x` must be a Lexer object that minimally has a
+ `x.token()` method for retrieving the next token. If an input string
+ is given to `yacc.parse()`, the lexer must also have an `x.input()`
+ method.
+
+2. To print copious amounts of debugging during parsing, use:
+
+ parser.parse(input_text, debug=True)
+
+3. Since LR parsing is driven by tables, the performance of the parser
+ is largely independent of the size of the grammar. The biggest
+ bottlenecks will be the lexer and the complexity of the code in your
+ grammar rules.
+
+4. `yacc()` also allows parsers to be defined as classes and as
+ closures (see the section on alternative specification of lexers).
+ However, be aware that only one parser may be defined in a single
+ module (source file). There are various error checks and validation
+ steps that may issue confusing error messages if you try to define
+ multiple parsers in the same source file.
+
+## Multiple Parsers and Lexers
+
+In advanced parsing applications, you may want to have multiple parsers
+and lexers.
+
+As a general rules this isn\'t a problem. However, to make it work, you
+need to carefully make sure everything gets hooked up correctly. First,
+make sure you save the objects returned by `lex()` and `yacc()`. For
+example:
+
+ lexer = lex.lex() # Return lexer object
+ parser = yacc.yacc() # Return parser object
+
+Next, when parsing, make sure you give the `parse()` function a
+reference to the lexer it should be using. For example:
+
+ parser.parse(text,lexer=lexer)
+
+If you forget to do this, the parser will use the last lexer
+created\--which is not always what you want.
+
+Within lexer and parser rule functions, these objects are also
+available. In the lexer, the \"lexer\" attribute of a token refers to
+the lexer object that triggered the rule. For example:
+
+ def t_NUMBER(t):
+ r'\d+'
+ ...
+ print(t.lexer) ## Show lexer object
+
+In the parser, the \"lexer\" and \"parser\" attributes refer to the
+lexer and parser objects respectively:
+
+ def p_expr_plus(p):
+ 'expr : expr PLUS expr'
+ ...
+ print(p.parser) # Show parser object
+ print(p.lexer) # Show lexer object
+
+If necessary, arbitrary attributes can be attached to the lexer or
+parser object. For example, if you wanted to have different parsing
+modes, you could attach a mode attribute to the parser object and look
+at it later.
+
+## Advanced Debugging
+
+Debugging a compiler is typically not an easy task. PLY provides some
+diagostic capabilities through the use of Python\'s `logging` module.
+The next two sections describe this:
+
+### Debugging the lex() and yacc() commands
+
+Both the `lex()` and `yacc()` commands have a debugging mode that can be
+enabled using the `debug` flag. For example:
+
+ lex.lex(debug=True)
+ yacc.yacc(debug=True)
+
+Normally, the output produced by debugging is routed to either standard
+error or, in the case of `yacc()`, to a file `parser.out`. This output
+can be more carefully controlled by supplying a logging object. Here is
+an example that adds information about where different debugging
+messages are coming from:
+
+ # Set up a logging object
+ import logging
+ logging.basicConfig(
+ level = logging.DEBUG,
+ filename = "parselog.txt",
+ filemode = "w",
+ format = "%(filename)10s:%(lineno)4d:%(message)s"
+ )
+ log = logging.getLogger()
+
+ lex.lex(debug=True,debuglog=log)
+ yacc.yacc(debug=True,debuglog=log)
+
+If you supply a custom logger, the amount of debugging information
+produced can be controlled by setting the logging level. Typically,
+debugging messages are either issued at the `DEBUG`, `INFO`, or
+`WARNING` levels.
+
+PLY\'s error messages and warnings are also produced using the logging
+interface. This can be controlled by passing a logging object using the
+`errorlog` parameter:
+
+ lex.lex(errorlog=log)
+ yacc.yacc(errorlog=log)
+
+If you want to completely silence warnings, you can either pass in a
+logging object with an appropriate filter level or use the `NullLogger`
+object defined in either `lex` or `yacc`. For example:
+
+ yacc.yacc(errorlog=yacc.NullLogger())
+
+### Run-time Debugging
+
+To enable run-time debugging of a parser, use the `debug` option to
+parse. This option can either be an integer (which turns debugging on or
+off) or an instance of a logger object. For example:
+
+ log = logging.getLogger()
+ parser.parse(input,debug=log)
+
+If a logging object is passed, you can use its filtering level to
+control how much output gets generated. The `INFO` level is used to
+produce information about rule reductions. The `DEBUG` level will show
+information about the parsing stack, token shifts, and other details.
+The `ERROR` level shows information related to parsing errors.
+
+For very complicated problems, you should pass in a logging object that
+redirects to a file where you can more easily inspect the output after
+execution.
+
+## Using Python -OO Mode
+
+Because of PLY\'s reliance on docstrings, it is not compatible with
+[-OO]{.title-ref} mode of the interpreter (which strips docstrings). If
+you want to support this, you\'ll need to write a decorator or some
+other tool to attach docstrings to functions. For example:
+
+ def _(doc):
+ def decorate(func):
+ func.__doc__ = doc
+ return func
+ return decorate
+
+ @_("assignment : expr PLUS expr")
+ def p_assignment(p):
+ ...
+
+PLY does not provide such a decorator by default.
+
+## Where to go from here?
+
+The `examples` directory of the PLY distribution contains several simple
+examples. Please consult a compilers textbook for the theory and
+underlying implementation details or LR parsing.