summaryrefslogtreecommitdiff
path: root/doc/ragel/ragel-guide.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/ragel/ragel-guide.txt')
-rw-r--r--doc/ragel/ragel-guide.txt3603
1 files changed, 0 insertions, 3603 deletions
diff --git a/doc/ragel/ragel-guide.txt b/doc/ragel/ragel-guide.txt
deleted file mode 100644
index 592c3f2c..00000000
--- a/doc/ragel/ragel-guide.txt
+++ /dev/null
@@ -1,3603 +0,0 @@
-= Ragel State Machine Compiler User Guide
-Adrian Thurston <thurston@colm.net>
-Ragel Version 7.0
-:toc:
-:toclevels: 3
-:numbered:
-
-.License
-
- Permission is hereby granted, free of charge, to any person obtaining a copy
- of this software and associated documentation files (the "Software"), to
- deal in the Software without restriction, including without limitation the
- rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
- sell copies of the Software, and to permit persons to whom the Software is
- furnished to do so, subject to the following conditions:
-
- The above copyright notice and this permission notice shall be included in all
- copies or substantial portions of the Software.
-
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
- IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
- FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
- AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
- LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
- OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
- SOFTWARE.
-
-== Introduction
-
-=== Abstract
-
-In today's computing landscape, regular expressions are used heavily for the
-purpose of specifying parsers. They are normally treated as black boxes, linked
-together with program logic. User code is executed in between invocations of
-the regular expression engine. To add code before a pattern terminates, the
-programmer is required to break patterns and paste them back together with
-program logic. The more inline code needed, the less the advantages of regular
-expressions are seen.
-
-Ragel is a software development tool that allows user code to be embedded into
-the transitions of a regular expression's corresponding state machine,
-eliminating the need to switch from the regular expression engine to the user
-code execution environment, and then back again. As a result, expressions can
-be maximally continuous. One is free to specify an entire parser using a single
-regular expression. The single-expression model affords concise and elegant
-descriptions of languages and the generation of very simple, fast and robust
-code. Ragel compiles executable finite state machines from a high level regular
-language notation. Ragel targets C, C++, Objective-C, D, Go, GNU ASM x86-64,
-Java, Ruby, C#, OCaml, Crack, Rust, Julia and JavaScript
-
-In addition to building state machines from regular expressions, Ragel allows
-the programmer to directly specify state machines with state charts. These two
-notations may be freely combined. There are also facilities for controlling
-nondeterminism in the resulting machines and building scanners using patterns
-that themselves have embedded actions. Ragel can produce code that is small and
-runs very fast. Ragel can handle integer-sized alphabets and can compile very
-large state machines.
-
-=== Motivation
-
-When a programmer is faced with the task of producing a parser for a
-context-free language, there are many tools to choose from. It is quite common
-to generate useful and efficient parsers for programming languages from a
-formal grammar. It is also quite common for programmers to avoid such tools
-when making parsers for simple computer languages, such as file formats and
-communication protocols. Such languages are often regular, and tools for
-processing the context-free languages are viewed as too heavyweight for the
-purpose of parsing regular languages. The extra run-time effort required for
-supporting the recursive nature of context-free languages is wasted.
-
-When we turn to the regular expression-based parsing tools, such as Lex, Re2C,
-and scripting languages such as Sed, Awk and Perl we find that they are split
-into two levels: a regular expression matching engine and some kind of program
-logic for linking patterns together. For example, a Lex program is composed of
-sets of regular expressions. The implied program logic repeatedly attempts to
-match a pattern in the current set. When a match is found, the associated user
-code executed. It requires the user to consider a language as a sequence of
-independent tokens. Scripting languages and regular expression libraries allow
-one to link patterns together using arbitrary program code. This is very
-flexible and powerful; however, we can be more concise and clear if we avoid
-gluing together regular expressions with if statements and while loops.
-
-This model of execution, where the runtime alternates between regular
-expression matching and user code execution places restrictions on when
-action code may be executed. Since action code can only be associated with
-complete patterns, any action code that must be executed before an entire
-pattern is matched requires that the pattern be broken into smaller units.
-Instead of being forced to disrupt the regular expression syntax and write
-smaller expressions, it is desirable to retain a single expression and embed
-code for performing actions directly into the transitions that move over the
-characters. After all, capable programmers are astutely aware of the machinery
-underlying their programs, so why not provide them with access to that
-machinery? To achieve this, we require an action execution model for associating
-code with the sub-expressions of a regular expression in a way that does not
-disrupt its syntax.
-
-The primary goal of Ragel is to provide developers with an ability to embed
-actions into the transitions and states of a regular expression's state machine
-in support of the definition of entire parsers or large sections of parsers
-using a single regular expression. From the regular expression we gain a clear
-and concise statement of our language. From the state machine we obtain a very
-fast and robust executable that lends itself to many kinds of analysis and
-visualization.
-
-=== Overview
-
-Ragel is a language for specifying state machines. The Ragel program is a
-compiler that assembles a state machine definition to executable code. Ragel
-is based on the principle that any regular language can be converted to a
-deterministic finite state automaton. Since every regular language has a state
-machine representation and vice versa, the terms regular language and state
-machine (or just machine) will be used interchangeably in this document.
-
-Ragel outputs machines to C, C++, Objective-C, D, Go, GNU ASM x86-64, Java,
-Ruby, C#, OCaml, Crack, Rust, Julia and JavaScript code. The output is
-designed to be generic and is not bound to any particular input or processing
-method. A Ragel machine expects to have data passed to it in buffer blocks.
-When there is no more input, the machine can be queried for acceptance. In
-this way, a Ragel machine can be used to simply recognize a regular language
-like a regular expression library. By embedding code into the regular language,
-a Ragel machine can also be used to parse input.
-
-The Ragel language has many operators for constructing and manipulating
-machines. Machines are built up from smaller machines, to bigger ones, to the
-final machine representing the language that needs to be recognized or parsed.
-
-The core state machine construction operators are those found in most theory
-of computation textbooks. They date back to the 1950s and are widely studied.
-They are based on set operations and permit one to think of languages as a set
-of strings. They are Union, Intersection, Difference, Concatenation and Kleene
-Star. Put together, these operators make up what most people know as regular
-expressions. Ragel also provides a scanner construction operator
-and provides operators for explicitly constructing machines
-using a state chart method. In the state chart method, one joins machines
-together without any implied transitions and then explicitly specifies where
-epsilon transitions should be drawn.
-
-The state machine manipulation operators are specific to Ragel. They allow the
-programmer to access the states and transitions of regular language's
-corresponding machine. There are two uses of the manipulation operators. The
-first and primary use is to embed code into transitions and states, allowing
-the programmer to specify the actions of the state machine.
-
-Ragel attempts to make the action embedding facility as intuitive as possible.
-To do so, a number of issues need to be addressed. For example, when making a
-nondeterministic specification into a DFA using machines that have embedded
-actions, new transitions are often made that have the combined actions of
-several source transitions. Ragel ensures that multiple actions associated with
-a single transition are ordered consistently with respect to the order of
-reference and the natural ordering implied by the construction operators.
-
-The second use of the manipulation operators is to assign priorities to
-transitions. Priorities provide a convenient way of controlling any
-nondeterminism introduced by the construction operators. Suppose two
-transitions leave from the same state and go to distinct target states on the
-same character. If these transitions are assigned conflicting priorities, then
-during the determinization process the transition with the higher priority will
-take precedence over the transition with the lower priority. The lower priority
-transition gets abandoned. The transitions would otherwise be combined into a new
-transition that goes to a new state that is a combination of the original
-target states. Priorities are often required for segmenting machines. The most
-common uses of priorities have been encoded into a set of simple operators
-that should be used instead of priority embeddings whenever possible.
-
-For the purposes of embedding, Ragel divides transitions and states into
-different classes. There are four operators for embedding actions and
-priorities into the transitions of a state machine. It is possible to embed
-into entering transitions, finishing transitions, all transitions and leaving
-transitions. The embedding into leaving transitions is a special case.
-These transition embeddings get stored in the final states of a machine. They
-are transferred to any transitions that are made going out of the machine by
-future concatenation or kleene star operations.
-
-There are several more operators for embedding actions into states. Like the
-transition embeddings, there are various different classes of states that the
-embedding operators access. For example, one can access start states, final
-states or all states, among others. Unlike the transition embeddings, there are
-several different types of state action embeddings. These are executed at
-various different times during the processing of input. It is possible to embed
-actions that are executed on transitions into a state, on transitions out of a
-state, on transitions taken on the error event, or on transitions taken on the
-EOF event.
-
-Within actions, it is possible to influence the behaviour of the state machine.
-The user can write action code that jumps or calls to another portion of the
-machine, changes the current character being processed, or breaks out of the
-processing loop. With the state machine calling feature Ragel can be used to
-parse languages that are not regular. For example, one can parse balanced
-parentheses by calling into a parser when an open parenthesis character is seen
-and returning to the state on the top of the stack when the corresponding
-closing parenthesis character is seen. More complicated context-free languages
-such as expressions in C are out of the scope of Ragel.
-
-Ragel also provides a scanner construction operator that can be used to build
-scanners much the same way that Lex is used. The Ragel generated code, which
-relies on user-defined variables for backtracking, repeatedly tries to match
-patterns to the input, favouring longer patterns over shorter ones and patterns
-that appear ahead of others when the lengths of the possible matches are
-identical. When a pattern is matched the associated action is executed.
-
-The key distinguishing feature between scanners in Ragel and scanners in Lex is
-that Ragel patterns may be arbitrary Ragel expressions and can therefore
-contain embedded code. With a Ragel-based scanner the user need not wait until
-the end of a pattern before user code can be executed.
-
-Scanners do take Ragel out of the domain of pure state machines and require the
-user to maintain the backtracking related variables. However, scanners
-integrate well with regular state machine instantiations. They can be called to
-or jumped to only when needed, or they can be called out of or jumped out of
-when a simpler, pure state machine model is appropriate.
-
-Two types of output code style are available. Ragel can produce a table-driven
-machine or a directly executable machine. The directly executable machine is
-much faster than the table-driven. On the other hand, the table-driven machine
-is more compact and less demanding on the host language compiler. It is better
-suited to compiling large state machines.
-
-=== Related Work
-
-==== Lex
-
-Lex is perhaps the best-known tool for constructing parsers from regular
-expressions. In the Lex processing model, generated code attempts to match one
-of the user's regular expression patterns, favouring longer matches over
-shorter ones. Once a match is made it then executes the code associated with
-the pattern and consumes the matching string. This process is repeated until
-the input is fully consumed.
-
-Through the use of start conditions, related sets of patterns may be defined.
-The active set may be changed at any time. This allows the user to define
-different lexical regions. It also allows the user to link patterns together by
-requiring that some patterns come before others. This is quite like a
-concatenation operation. However, use of Lex for languages that require a
-considerable amount of pattern concatenation is inappropriate. In such cases a
-Lex program deteriorates into a manually specified state machine, where start
-conditions define the states and pattern actions define the transitions. Lex
-is therefore best suited to parsing tasks where the language to be parsed can
-be described in terms of regions of tokens.
-
-Lex is useful in many scenarios and has undoubtedly stood the test of time.
-There are, however, several drawbacks to using Lex. Lex can impose too much
-overhead for parsing applications where buffering is not required because all
-the characters are available in a single string. In these cases there is
-structure to the language to be parsed and a parser specification tool can
-help, but employing a heavyweight processing loop that imposes a stream
-pull model and dynamic input buffer allocation is inappropriate. An
-example of this kind of scenario is the conversion of floating point numbers
-contained in a string to their corresponding numerical values.
-
-Another drawback is the very issue that Ragel attempts to solve.
-It is not possible to execute a user action while
-matching a character contained inside a pattern. For example, if scanning a
-programming language and string literals can contain newlines which must be
-counted, a Lex user must break up a string literal pattern so as to associate
-an action with newlines. This forces the definition of a new start condition.
-Alternatively the user can reprocess the text of the matched string literal to
-count newlines.
-
-/////////////////////////////////
-How ragel is different from Lex.
-
-Like Re2c, Ragel provides a simple execution model that does not make any
-assumptions as to how the input is collected. Also, Ragel does not do any
-buffering in the generated code. Consequently there are no dependencies on
-external functions such as `malloc`.
-
-If buffering is required it can be manually implemented by embedding actions
-that copy the current character to a buffer, or data can be passed to the
-parser using known block boundaries. If the longest-match operator is used,
-Ragel requires the user to ensure that the ending portion of the input buffer
-is preserved when the buffer is exhausted before a token is fully matched. The
-user should move the token prefix to a new memory location, such as back to the
-beginning of the input buffer, then place the subsequently read input
-immediately after the prefix.
-
-These properties of Ragel make it more work to write a program that requires
-the longest-match operator or buffering of input, however they make Ragel a
-more flexible tool that can produce very simple and fast-running programs under
-a variety of input acquisition arrangements.
-
-In Ragel, it is not necessary
-to introduce start conditions to concatenate tokens and retain action
-execution. Ragel allows one to structure a parser as a series of tokens, but
-does not require it.
-
-Like Lex and Re2C, Ragel is able to process input using a longest-match
-execution model, however the core of the Ragel language specifies parsers at a
-much lower level. This core is built around a pure state machine model. When
-building basic machines there is no implied algorithm for processing input
-other than to move from state to state on the transitions of the machine. This
-core of pure state machine operations makes Ragel well suited to handling
-parsing problems not based on token scanning. Should one need to use a
-longest-match model, the functionality is available and the lower level state
-machine construction facilities can be used to specify the patterns of a
-longest-match machine.
-
-This is not possible in Ragel. One can only program
-a longest-match instantiation with a fixed set of rules. One can jump to
-another longest-match machine that employs the same machine definitions in the
-construction of its rules, however no states will be shared.
-
-In Ragel, input may be re-parsed using a
-different machine, but since the action to be executed is associated with
-transitions of the compiled state machine, the longest-match construction does
-not permit a single rule to be excluded from the active set. It cannot be done
-ahead of time nor in the excluded rule's action.
-/////////////////////////////////
-
-==== Re2C
-
-The Re2C program defines an input processing model similar to that of Lex.
-Re2C focuses on making generated state machines run very fast and
-integrate easily into any program, free of dependencies. Re2C generates
-directly executable code and is able to claim that generated parsers run nearly
-as fast as their hand-coded equivalents. This is very important for user
-adoption, as programmers are reluctant to use a tool when a faster alternative
-exists. A consideration to ease of use is also important because developers
-need the freedom to integrate the generated code as they see fit.
-
-==== Regular Expression Libraries
-
-Many scripting languages provide ways of composing parsers by linking regular
-expressions using program logic. For example, Sed and Awk are two established
-Unix scripting tools that allow the programmer to exploit regular expressions
-for the purpose of locating and extracting text of interest. High-level
-programming languages such as Perl, Python, PHP and Ruby all provide regular
-expression libraries that allow the user to combine regular expressions with
-arbitrary code.
-
-In addition to supporting the linking of regular expressions with arbitrary
-program logic, the Perl programming language permits the embedding of code into
-regular expressions. Perl embeddings do not translate into the embedding of
-code into deterministic state machines. Perl regular expressions are in fact
-not fully compiled to deterministic machines when embedded code is involved.
-They are instead interpreted and involve backtracking. This is shown by the
-following Perl program. When it is fed the input +abcd+ the interpreter
-attempts to match the first alternative, printing +a1 b1+. When this
-possibility fails it backtracks and tries the second possibility, printing
-+a2 b2+, at which point it succeeds.
-
------------------
-print "YES\n" if ( <STDIN> =~
- /( a (?{ print "a1 "; }) b (?{ print "b1 "; }) cX ) |
- ( a (?{ print "a2 "; }) b (?{ print "b2 "; }) cd )/x )
------------------
-
-In Ragel there is no regular expression interpreter. Aside from the scanner
-operator, all Ragel expressions are made into deterministic machines and the
-run time simply moves from state to state as it consumes input. An equivalent
-parser expressed in Ragel would attempt both of the alternatives concurrently,
-printing +a1 a2 b1 b2+.
-
-//////////////////////
-=== Development Status
-
-Ragel is a relatively new tool and is under continuous development. As a rough
-release guide, minor revision number changes are for implementation
-improvements and feature additions. Major revision number changes are for
-implementation and language changes that do not preserve backwards
-compatibility. Though in the past this has not always held true: changes that
-break code have crept into minor version number changes. Typically, the
-documentation lags behind the development in the interest of documenting only
-the lasting features. The latest changes are always documented in the ChangeLog
-file.
-//////////////////////
-
-== Constructing State Machines
-
-=== Ragel State Machine Specifications
-
-A Ragel input file consists of a program in the host language that contains
-embedded machine specifications. Ragel normally passes input straight to
-output. When it sees a machine specification it stops to read the Ragel
-statements and possibly generate code in place of the specification. Afterwards
-it continues to pass input through. There can be any number of FSM
-specifications in an input file. A multi-line FSM spec starts with +%%{+ and
-ends with +}%%+. A single-line FSM spec starts with +%%+ and ends at the
-first newline.
-
-While Ragel is looking for FSM specifications it does basic lexical analysis on
-the surrounding input. It interprets literal strings and comments so a
-+%%+ sequence in either of those will not trigger the parsing of an FSM
-specification. Ragel does not pass the input through any preprocessor nor does it
-interpret preprocessor directives itself so includes, defines and ifdef logic
-cannot be used to alter the parse of a Ragel input file. It is therefore not
-possible to use an +#if 0+ directive to comment out a machine as is
-commonly done in C code. As an alternative, a machine can be prevented from
-causing any generated output by commenting out write statements.
-
-In the example below, a multi-line specification is used to define the machine
-and single line specifications are used to trigger the writing of the machine
-data and execution code. This example shows parsing of a command line argument.
-
-.Parsing Command Line Args
--------------------------
-#include <string.h>
-#include <stdio.h>
-
-%%{
- machine foo;
- main :=
- ( 'foo' | 'bar' )
- 0 @{ res = 1; };
-}%%
-
-%% write data;
-
-int main( int argc, char **argv )
-{
- int cs, res = 0;
- if ( argc > 1 ) {
- char *p = argv[1];
- char *pe = p + strlen(p) + 1;
- %% write init;
- %% write exec;
- }
- printf("result = %i\n", res );
- return 0;
-}
--------------------------
-
-==== Naming Ragel Blocks
-
-------------
-machine fsm_name;
-------------
-
-The +machine+ statement gives the name of the FSM. If present in a
-specification, this statement must appear first. If a machine specification
-does not have a name then Ragel uses the previous specification name. If no
-previous specification name exists then this is an error. Because FSM
-specifications persist in memory, a machine's statements can be spread across
-multiple machine specifications. This allows one to break up a machine across
-several files or draw in statements that are common to multiple machines using
-the +include+ statement.
-
-[[definition]]
-==== Machine Definition
-
---------------
-<name> = <expression>;
---------------
-
-The machine definition statement associates an FSM expression with a name. Machine
-expressions assigned to names can later be referenced in other expressions. A
-definition statement on its own does not cause any states to be generated. It is simply a
-description of a machine to be used later. States are generated only when a definition is
-instantiated, which happens when a definition is referenced in an instantiated
-expression.
-
-[[instantiation]]
-==== Machine Instantiation
-
---------------
-<name> := <expression>;
---------------
-
-The machine instantiation statement generates a set of states representing an
-expression. Each instantiation generates a distinct set of states. The
-starting state of the instantiation is written in the data section of the
-generated code using the instantiation name. If a machine named +main+ is
-instantiated, its start state is used as the specification's start state and is
-assigned to the `cs` variable by the +write init+ command. If no +main+
-machine is given, the start state of the last machine instantiation to appear
-is used as the specification's start state.
-
-From outside the execution loop, control may be passed to any machine by
-assigning the entry point to the `cs` variable. From inside the execution
-loop, control may be passed to any machine instantiation using `fcall`,
-+fgoto+ or +fnext+ statements.
-
-==== Including Ragel Code
-
---------------
-include FsmName "inputfile.rl";
---------------
-
-The +include+ statement can be used to draw in the statements of another FSM
-specification. Both the name and input file are optional, however at least one
-must be given. Without an FSM name, the given input file is searched for an FSM
-of the same name as the current specification. Without an input file, the
-current file is searched for a machine of the given name. If both are present,
-the given input file is searched for a machine of the given name.
-
-Ragel searches for included files from the location of the current file.
-Additional directories can be added to the search path using the +-I+ option.
-
-[[import]]
-==== Importing Definitions
-
---------------
-import "inputfile.h";
---------------
-
-The +import+ statement scrapes a file for sequences of tokens that match the
-following forms. Ragel treats these forms as state machine definitions.
-
-* name '=' number
-* name '=' lit_string
-* 'define' name number
-* 'define' name lit_string
-
-If the input file is a Ragel program then tokens inside any Ragel
-specifications are ignored. See the section on <<export, Write Exports>> for a
-description of exporting machine definitions.
-
-Ragel searches for imported files from the location of the current file.
-Additional directories can be added to the search path using the +-I+ option.
-
-[[lexing]]
-=== Lexical Analysis of a Ragel Block
-
-Within a machine specification the following lexical rules apply to the input.
-
-* The +#+ symbol begins a comment that terminates at the next newline.
-* The symbols +""+, +''+, +//+, and +[]+ behave as the
-delimiters of literal strings. Within them, the following escape sequences
-are interpreted:
-+
- \0 \a \b \t \n \v \f \r
-+
-A backslash at the end of a line joins the following line onto the current. A
-backslash preceding any other character removes special meaning. This applies
-to terminating characters and to special characters in regular expression
-literals. As an exception, regular expression literals do not support escape
-sequences as the operands of a range within a list. See the bullet on regular
-expressions in <<basic, Basic Machines>>.
-
-* The symbols +{}+ delimit a block of host language code that will be
-embedded into the machine as an action. Within the block of host language
-code, basic lexical analysis of comments and strings is done in order to
-correctly find the closing brace of the block. With the exception of FSM
-commands embedded in code blocks, the entire block is preserved as is for
-identical reproduction in the output code.
-
-* The pattern `[+-]?[0-9]+` denotes an integer in decimal format.
-Integers used for specifying machines may be negative only if the alphabet type
-is signed. Integers used for specifying priorities may be positive or negative.
-
-* The pattern `0x[0-9A-Fa-f]+` denotes an integer in hexadecimal
-format.
-
-* The keywords are
-+access+, +action+, +alphtype+, +eof+, +err+, +export+, +from+, +getkey+,
-+inwhen+, +lerr+, +machine+, +nfapostpop+, +nfaprepush+, +outwhen+, +postpop+,
-+prepush+, +to+, +variable+, +when+ and +write+.
-
-* The pattern +[a-zA-Z_][a-zA-Z_0-9]*+ denotes an identifier.
-
-///////////
-.The allowable symbols are:
-------------
-( ) ! ^ * ? + : -> - | & . , := = ; > @ $ %
->/ $/ %/ </ @/ <>/ >! $! %! <! @! <>!
->^ $^ %^ <^ @^ <>^ >~ $~ %~ <~ @~ <>~
->* $* %* <* @* <>*
-------------
-///////////
-
-* Any amount of whitespace may separate tokens.
-
-////////////
-=== Parse of an FSM Specification
-
-The following statements are possible within an FSM specification. The
-requirements for trailing semicolons loosely follow that of C. A block
-specifying code does not require a trailing semicolon. An expression statement
-does require a trailing semicolon.
-////////////
-
-[[basic]]
-=== Basic Machines
-
-The basic machines are the base operands of regular language expressions. They
-are the smallest unit to which machine construction and manipulation operators
-can be applied.
-
-* `'hello'` -- Concatenation Literal. Produces a machine that matches
-the sequence of characters in the quoted string. If there are 5 characters
-there will be 6 states chained together with the characters in the string. See
-the section <<lexing,Lexical Analysis>> for information on valid escape
-sequences.
-+
-image::bmconcat.png[align="left"]
-+
-It is possible to make a concatenation literal case-insensitive by appending an
-+i+ to the string, for example `'cmd'i`.
-
-///////////////
-% GENERATE: bmconcat
-% OPT: -p
-% %%{
-% machine bmconcat;
-.verbatim
-main := 'hello';
-.end verbatim
-% }%%
-% END GENERATE
-///////////////
-
-* `"hello"` -- Identical to the single quoted version.
-
-* `[hello]` -- Or Expression. Produces a union of characters. There
-will be two states with a transition for each unique character between the two
-states. The `[]` delimiters behave like the quotes of a literal string. For
-example, `[ \t]` means tab or space. The `or` expression supports
-character ranges with the `-` symbol as a separator. The meaning of the union
-can be negated using an initial +^+ character as in standard regular
-expressions. See <<lexing,Lexical Analysis>> for information on valid escape
-sequences in `or` expressions.
-+
-image::bmor.png[align="left"]
-
-/////////
-% GENERATE: bmor
-% OPT: -p
-% %%{
-% machine bmor;
-.verbatim
-main := [hello];
-.end verbatim
-% }%%
-% END GENERATE
-/////////
-
-* `''`, `""`, and `[]` -- Zero Length Machine. Produces a machine
-that matches the zero length string. Zero length machines have one state that
-is both a start state and a final state.
-+
-image::bmnull.png[align="left"]
-
-///////
-% GENERATE: bmnull
-% OPT: -p
-% %%{
-% machine bmnull;
-.verbatim
-main := '';
-.end verbatim
-% }%%
-% END GENERATE
-///////
-
-//% FIXME: More on the range of values here.
-* `42` -- Numerical Literal. Produces a two state machine with one
-transition on the given number. The number may be in decimal or hexadecimal
-format and should be in the range allowed by the alphabet type. The minimum and
-maximum values permitted are defined by the host machine that Ragel is compiled
-on. For example, numbers in a `short` alphabet on an i386 machine should be in
-the range `-32768` to `32767`.
-+
-image::bmnum.png[align="left"]
-
-///////////
-% GENERATE: bmnum
-% %%{
-% machine bmnum;
-.verbatim
-main := 42;
-.end verbatim
-% }%%
-% END GENERATE
-///////////
-
-* `/simple_regex/` -- Regular Expression. Regular expressions are
-parsed as a series of expressions that are concatenated together. Each
-concatenated expression may be a literal character, the `any` character
-specified by the `.` symbol, or a union of characters specified by the `[]`
-delimiters. If the first character of a union is `^` then it matches any
-character not in the list. Within a union, a range of characters can be given
-by separating the first and last characters of the range with the `-` symbol.
-Each concatenated machine may have repetition specified by following it with
-the `*` symbol. The standard escape sequences described in <<lexing, Lexical
-Analysis>> are supported everywhere in regular expressions except as the
-operands of a range within in a list. This notation also supports the `i`
-trailing option. Use it to produce case-insensitive machines, as in `/GET/i`.
-+
-Ragel does not support very complex regular expressions because the desired
-results can always be achieved using the more general machine construction
-operators listed in <<machconst, Regular Language Operators>>. The
-following diagram shows the result of compiling `/ab*[c-z].*[123]/`. Note that
-in the diagram, `DEF` represents the default transition, which is taken if no
-other transition can be taken.
-+
-image::bmregex.png[align="left"]
-
-//////////////////
-% GENERATE: bmregex
-% OPT: -p
-% %%{
-% machine bmregex;
-.verbatim
-main := /ab*[c-z].*[123]/;
-.end verbatim
-% }%%
-% END GENERATE
-//////////////////
-
-* `'a' .. 'z'` -- Range. Produces a machine that matches any
-characters in the specified range. Allowable upper and lower bounds of the
-range are concatenation literals of length one and numerical literals. For
-example, `0x10..0x20`, `0..63`, and `'a'..'z'` are valid ranges.
-The bounds should be in the range allowed by the alphabet type.
-+
-image::bmrange.png[align="left"]
-
-///////////////
-% GENERATE: bmrange
-% OPT: -p
-% %%{
-% machine bmrange;
-.verbatim
-main := 'a' .. 'z';
-.end verbatim
-% }%%
-% END GENERATE
-///////////////
-
-* `variable_name` -- Lookup the machine definition assigned to the
-variable name given and use an instance of it. See <<definition, Machine
-Definition>> for an important note on what it means to reference a variable
-name.
-
-* `builtin_machine` -- There are several built-in machines available
-for use. They are all two-state machines for the purpose of matching common
-classes of characters. They are:
-
-** `any` -- Any character in the alphabet.
-
-** `ascii` -- Ascii characters. `0..127`
-
-** `extend` -- Ascii extended characters. This is the range
-`-128..127` for signed alphabets and the range `0..255` for unsigned
-alphabets.
-
-** `alpha` -- Alphabetic characters. `[A-Za-z]`
-
-** `digit` -- Digits. `[0-9]`
-
-** `alnum` -- Alpha numerics. `[0-9A-Za-z]`
-
-** `lower` -- Lowercase characters. `[a-z]`
-
-** `upper` -- Uppercase characters. `[A-Z]`
-
-** `xdigit` -- Hexadecimal digits. `[0-9A-Fa-f]`
-
-** `cntrl` -- Control characters. `0..31`, `127`
-
-** `graph` -- Graphical characters. `[!-~]`
-
-** `print` -- Printable characters. `[ -~]`
-
-** `punct` -- Punctuation. Graphical characters that are not alphanumerics.
-+
- [!-/:-@\[-`{-~]
-
-** `space` -- Whitespace. `[\t\v\f\n\r ]`
-
-** `zlen` -- Zero length string. `""`
-
-** `empty` -- Empty set. Matches nothing. `^any`
-
-=== Operator Precedence
-
-The following table shows operator precedence from lowest to highest. Operators
-in the same precedence group are evaluated from left to right.
-
-|================================================================================
- ^|1 |`,` | Join
- ^|2 |`\| & - --` | Union, Intersection and Subtraction
- ^|3 |`. <: :> :>>` | Concatenation
- ^|4 |`:` | Label
- ^|5 |`->` | Epsilon Transition
-1.6+^.^|6 |`> @ $ %` | Transitions Actions and Priorities
- |`>/ $/ %/ </ @/ <>/` | EOF Actions
- |`>! $! %! <! @! <>!` | Global Error Actions
- |`>^ $^ %^ <^ @^ <>^` | Local Error Actions
- |`>~ $~ %~ <~ @~ <>~` | To-State Actions
- |`>* $* %* <* @* <>*` | From-State Action
- ^|7 |`* ** ? + {n} {,n} {n,} {n,m}` | Repetition
- ^|8 |`! ^` | Negation and Character-Level Negation
- ^|9 |`( <expr> )` | Grouping
-|================================================================================
-
-[[machconst]]
-=== Regular Language Operators
-
-When using Ragel it is helpful to have a sense of how it constructs machines.
-The determinization process can produce results that seem unusual to someone
-not familiar with the NFA to DFA conversion algorithm. In this section we
-describe Ragel's state machine operators. Though the operators are defined
-using epsilon transitions, it should be noted that this is for discussion only.
-The epsilon transitions described in this section do not persist, but are
-immediately removed by the determinization process which is executed at every
-operation. Ragel does not make use of any nondeterministic intermediate state
-machines.
-
-To create an epsilon transition between two states `x` and `y` is to
-copy all of the properties of `y` into `x`. This involves drawing in
-all of `y`'s to-state actions, EOF actions, etc., in addition to its
-transitions. If `x` and `y` both have a transition out on the same
-character, then the transitions must be combined. During transition
-combination a new transition is made that goes to a new state that is the
-combination of both target states. The new combination state is created using
-the same epsilon transition method. The new state has an epsilon transition
-drawn to all the states that compose it. Since the creation of new epsilon
-transitions may be triggered every time an epsilon transition is drawn, the
-process of drawing epsilon transitions is repeated until there are no more
-epsilon transitions to be made.
-
-A very common error that is made when using Ragel is to make machines that do
-too much. That is, to create machines that have unintentional
-nondeterministic properties. This usually results from being unaware of the common strings
-between machines that are combined together using the regular language
-operators. This can involve never leaving a machine, causing its actions to be
-propagated through all the following states. Or it can involve an alternation
-where both branches are unintentionally taken simultaneously.
-
-This problem forces one to think hard about the language that needs to be
-matched. To guard against this kind of problem one must ensure that the machine
-specification is divided up using boundaries that do not allow ambiguities from
-one portion of the machine to the next. See the chapter on
-<<controlling_nondeterminism, Controlling Nondeterminism>> for more on this
-problem and how to solve it.
-
-The Graphviz tool is an immense help when debugging improperly compiled
-machines or otherwise learning how to use Ragel. Graphviz Dot files can be
-generated from Ragel programs using the `-V` option. See the section on
-<<visualization, Visualization>> for more information.
-
-==== Union
-
---------------
-expr | expr
---------------
-
-The union operation produces a machine that matches any string in machine one
-or machine two. The operation first creates a new start state. Epsilon
-transitions are drawn from the new start state to the start states of both
-input machines. The resulting machine has a final state set equivalent to the
-union of the final state sets of both input machines. In this operation, there
-is the opportunity for nondeterminism among both branches. If there are
-strings, or prefixes of strings that are matched by both machines then the new
-machine will follow both parts of the alternation at once. The union operation is
-shown below.
-
-image::opor.png[align="left"]
-
-The following example demonstrates the union of three machines representing
-common tokens.
-
-///////////////////
-% GENERATE: exor
-% OPT: -p
-% %%{
-% machine exor;
-///////////////////
----------------
-# Hex digits, decimal digits, or identifiers
-main := '0x' xdigit+ | digit+ | alpha alnum*;
----------------
-///////////////////
-% }%%
-% END GENERATE
-///////////////////
-
-image::exor.png[align="left"]
-
-==== Intersection
-
---------------
-expr & expr
---------------
-
-Intersection produces a machine that matches any string that is in both machine
-one and machine two. To achieve intersection, a union is performed on the two
-machines. After the result has been made deterministic, any final state that is
-not a combination of final states from both machines has its final state status
-revoked. To complete the operation, paths that do not lead to a final state are
-pruned from the machine. Therefore, if there are any such paths in either of
-the expressions they will be removed by the intersection operator.
-Intersection can be used to require that two independent patterns be
-simultaneously satisfied as in the following example.
-
-////////////////////
-% GENERATE: exinter
-% OPT: -p
-% %%{
-% machine exinter;
-////////////////////
-----------------
-# Match lines four characters wide that contain
-# words separated by whitespace.
-main :=
- /[^\n][^\n][^\n][^\n]\n/* &
- (/[a-z][a-z]*/ | [ \n])**;
-----------------
-////////////////////
-% }%%
-% END GENERATE
-////////////////////
-
-image::exinter.png[align="left"]
-
-==== Difference
-
-----------------
-expr - expr
-----------------
-
-The difference operation produces a machine that matches
-strings that are in machine one but are not in machine two. To achieve subtraction,
-a union is performed on the two machines. After the result has been made
-deterministic, any final state that came from machine two or is a combination
-of states involving a final state from machine two has its final state status
-revoked. As with intersection, the operation is completed by pruning any path
-that does not lead to a final state. The following example demonstrates the
-use of subtraction to exclude specific cases from a set.
-
-//////////////////////
-% GENERATE: exsubtr
-% OPT: -p
-% %%{
-% machine exsubtr;
-//////////////////////
----------------------------
-# Subtract keywords from identifiers.
-main := /[a-z][a-z]*/ - ( 'for' | 'int' );
----------------------------
-/////////////////
-% }%%
-% END GENERATE
-/////////////////
-
-image::exsubtr.png[align="left"]
-
-[[strong_difference]]
-==== Strong Difference
-
----------------
-expr -- expr
----------------
-
-Strong difference produces a machine that matches any string of the first
-machine that does not have any string of the second machine as a substring. In
-the following example, strong subtraction is used to excluded `CRLF` from
-a sequence. In the corresponding visualization, the label `DEF` is short
-for default. The default transition is taken if no other transition can be
-taken.
-
-/////////////////
-% GENERATE: exstrongsubtr
-% OPT: -p
-% %%{
-% machine exstrongsubtr;
-/////////////////
------------------
-crlf = '\r\n';
-main := [a-z]+ ':' ( any* -- crlf ) crlf;
------------------
-/////////////////
-% }%%
-% END GENERATE
-/////////////////
-
-image::exstrongsubtr.png[align="left"]
-
-This operator is equivalent to the following.
-
----------------
-expr - ( any* expr any* )
----------------
-
-==== Concatenation
-
---------------
-expr . expr
---------------
-
-Concatenation produces a machine that matches all the strings in machine one
-followed by all the strings in machine two. Concatenation draws epsilon
-transitions from the final states of the first machine to the start state of
-the second machine. The final states of the first machine lose their final
-state status, unless the start state of the second machine is final as well.
-Concatenation is the default operator. Two machines next to each other with no
-operator between them results in concatenation.
-
-image::opconcat.png[align="left"]
-
-The opportunity for nondeterministic behaviour results from the possibility of
-the final states of the first machine accepting a string that is also accepted
-by the start state of the second machine.
-The most common scenario in which this happens is the
-concatenation of a machine that repeats some pattern with a machine that gives
-a terminating string, but the repetition machine does not exclude the
-terminating string. The example in <<strong_difference, Strong Difference>>
-guards against this. Another example is the expression `("'" any* "'")`.
-When executed the thread of control will
-never leave the `any*` machine. This is a problem especially if actions
-are embedded to process the characters of the `any*` component.
-
-In the following example, the first machine is always active due to the
-nondeterministic nature of concatenation. This particular nondeterminism is intended,
-however, because we wish to permit EOF strings before the end of the input.
-
-////////////////////////
-% GENERATE: exconcat
-% OPT: -p
-% %%{
-% machine exconcat;
-////////////////////////
-----------------------
-# Require an eof marker on the last line.
-main := /[^\n]*\n/* . 'EOF\n';
-----------------------
-////////////////////
-% }%%
-% END GENERATE
-////////////////////
-
-image::exconcat.png[align="left"]
-
-There is a language ambiguity involving concatenation and subtraction. Because
-concatenation is the default operator for two adjacent machines there is an
-ambiguity between subtraction of a positive numerical literal and concatenation
-of a negative numerical literal. For example, `(x-7)` could be interpreted as
-`(x . -7)` or `(x - 7)`. In the Ragel language, the subtraction operator always
-takes precedence over concatenation of a negative literal. We adhere to the
-rule that the default concatenation operator takes effect only when there are
-no other operators between two machines. Beware of writing machines such as
-`(any -1)` when what is desired is a concatenation of `any` and `-1`. Instead
-write `(any . -1)` or `(any (-1))`. If in doubt of the meaning of your program
-do not rely on the default concatenation operator; always use the `.` symbol.
-
-==== Kleene Star
-
----------------
-expr*
----------------
-
-The machine resulting from the Kleene Star operator will match zero or more
-repetitions of the machine it is applied to.
-It creates a new start state and an additional final
-state. Epsilon transitions are drawn between the new start state and the old start
-state, between the new start state and the new final state, and
-between the final states of the machine and the new start state. After the
-machine is made deterministic, the final states get all the
-transitions of the start state.
-
-image::opstar.png[align="left"]
-
-The possibility for nondeterministic behaviour arises if the final states have
-transitions on any of the same characters as the start state. This is common
-when applying kleene star to an alternation of tokens. Like the other problems
-arising from nondeterministic behavior, this is discussed in more detail in the chapter on
-<<controlling_nondeterminism, Controlling Nondeterminism>>. This particular
-problem can also be solved using the longest-match construction discussed in the section
-on <<generating_scanners, Scanners>>.
-
-In this example, there is no nondeterminism introduced by the exterior kleene
-star due to the newline at the end of the regular expression. Without the
-newline the exterior kleene star would be redundant and there would be
-ambiguity between repeating the inner range of the regular expression and the
-entire regular expression. Though it would not cause a problem in this case,
-unnecessary nondeterminism in the kleene star operator often causes undesired
-results for new Ragel users and must be guarded against.
-
-///////////////
-% GENERATE: exstar
-% OPT: -p
-% %%{
-% machine exstar;
-///////////////
-------------------
-# Match any number of lines with only lowercase letters.
-main := /[a-z]*\n/*;
-------------------
-///////////////
-% }%%
-% END GENERATE
-///////////////
-
-image::exstar.png[align="left"]
-
-==== One Or More Repetition
-
------------
-expr+
------------
-
-This operator produces the concatenation of the machine with the kleene star of
-itself. The result will match one or more repetitions of the machine. The plus
-operator is equivalent to `(expr . expr*)`.
-
-/////////////////////
-% GENERATE: explus
-% OPT: -p
-% %%{
-% machine explus;
-/////////////////////
-----------------------
-# Match alpha-numeric words.
-main := alnum+;
-----------------------
-////////////////
-% }%%
-% END GENERATE
-////////////////
-
-image::explus.png[align="left"]
-
-==== Optional
-
----------------
-expr?
----------------
-
-The _optional_ operator produces a machine that accepts the machine
-given or the zero length string. The optional operator is equivalent to
-`(expr | '' )`. In the following example the optional operator is used to
-possibly extend a token.
-
-///////////////
-% GENERATE: exoption
-% OPT: -p
-% %%{
-% machine exoption;
-///////////////
----------------
-# Match integers or floats.
-main := digit+ ('.' digit+)?;
----------------
-/////////////////
-% }%%
-% END GENERATE
-/////////////////
-
-image::exoption.png[align="left"]
-
-==== Repetition
-
-* `expr {n}` -- Exactly N copies of expr.
-* `expr {,n}` -- Zero to N copies of expr.
-* `expr {n,}` -- N or more copies of expr.
-* `expr {n,m}` -- N to M copies of expr.
-
-==== Negation
-
---------------
-!expr
---------------
-
-Negation produces a machine that matches any string not matched by the given
-machine. Negation is equivalent to `(any* - expr)`.
-
-//////////////////////
-% GENERATE: exnegate
-% OPT: -p
-% %%{
-% machine exnegate;
-//////////////////////
---------------------
-# Accept anything but a string beginning with a digit.
-main := ! ( digit any* );
---------------------
-//////////////////////
-% }%%
-% END GENERATE
-//////////////////////
-
-image::exnegate.png[align="left"]
-
-==== Character-Level Negation
-
---------------
-^expr
---------------
-
-Character-level negation produces a machine that matches any single character
-not matched by the given machine. Character-Level Negation is equivalent to
-`(any - expr)`. It must be applied only to machines that match strings of
-length one.
-
-=== State Machine Minimization
-
-State machine minimization is the process of finding the minimal equivalent FSM accepting
-the language. Minimization reduces the number of states in machines
-by merging equivalent states. It does not change the behaviour of the machine
-in any way. It will cause some states to be merged into one because they are
-functionally equivalent. State minimization is on by default. It can be turned
-off with the `-n` option.
-
-The algorithm implemented is similar to Hopcroft's state minimization
-algorithm. Hopcroft's algorithm assumes a finite alphabet that can be listed in
-memory, whereas Ragel supports arbitrary integer alphabets that cannot be
-listed in memory. Though exact analysis is very difficult, Ragel minimization
-runs close to O(n * log(n)) and requires O(n) temporary storage where
-$n$ is the number of states.
-
-[[visualization]]
-=== Visualization
-
-/////////
-%In many cases, practical
-%parsing programs will be too large to completely visualize with Graphviz. The
-%proper approach is to reduce the language to the smallest subset possible that
-%still exhibits the characteristics that one wishes to learn about or to fix.
-%This can be done without modifying the source code using the `-M` and
-%`-S` options. If a machine cannot be easily reduced,
-%embeddings of unique actions can be very useful for tracing a
-%particular component of a larger machine specification, since action names are
-%written out on transition labels.
-/////////
-
-Ragel is able to emit compiled state machines in Graphviz's Dot file format.
-This is done using the `-V` option.
-Graphviz support allows users to perform
-incremental visualization of their parsers. User actions are displayed on
-transition labels of the graph.
-
-If the final graph is too large to be
-meaningful, or even drawn, the user is able to inspect portions of the parser
-by naming particular regular expression definitions with the `-S` and
-`-M` options to the `ragel` program. Use of Graphviz greatly
-improves the Ragel programming experience. It allows users to learn Ragel by
-experimentation and also to track down bugs caused by unintended
-nondeterminism.
-
-Ragel has another option to help debugging. The `-x` option causes Ragel
-to emit the compiled machine in an XML format.
-
-== User Actions
-
-Ragel permits the user to embed actions into the transitions of a regular
-expression's corresponding state machine. These actions are executed when the
-generated code moves over a transition. Like the regular expression operators,
-the action embedding operators are fully compositional. They take a state
-machine and an action as input, embed the action and yield a new state machine
-that can be used in the construction of other machines. Due to the
-compositional nature of embeddings, the user has complete freedom in the
-placement of actions.
-
-A machine's transitions are categorized into four classes. The action embedding
-operators access the transitions defined by these classes. The
-_entering transition_ operator `>` isolates the start state, then embeds an action
-into all transitions leaving it. The _finishing transition_ operator
-`@` embeds an action into all transitions going into a final state. The
-_all transition_ operator `$` embeds an action into all transitions of
-an expression. The _leaving transition_ operator `%` provides access
-to the yet-unmade transitions moving out of the machine via the final states.
-
-=== Embedding Actions
-
----------------
-action ActionName {
- /* Code an action here. */
- count += 1;
-}
----------------
-
-The action statement defines a block of code that can be embedded into an FSM.
-Action names can be referenced by the action embedding operators in
-expressions. Though actions need not be named in this way (literal blocks
-of code can be embedded directly when building machines), defining reusable
-blocks of code whenever possible is good practice because it potentially increases the
-degree to which the machine can be minimized.
-
-Within an action some Ragel expressions and statements are parsed and
-translated. These allow the user to interact with the machine from action code.
-See <<vals, Values and Statements>> for a complete list of values and
-statements available in code blocks.
-
-==== Entering Action
-
-----------------------
-expr > action
-----------------------
-
-The entering action operator embeds an action into all transitions
-that enter into the machine from the start state. If the start state is final,
-then the action is also embedded into the start state as a leaving action. This
-means that if a machine accepts the zero-length string and control passes
-through the start state then the entering action is executed. Note
-that this can happen on both a following character and on the EOF event.
-
-In some machines, the start state has transtions coming in from within the
-machine. In these cases the start state is first isolated from the rest of the
-machine ensuring that the entering actions are executed once only.
-
-////////////////
-% GENERATE: exstact
-% OPT: -p
-% %%{
-% machine exstact;
-////////////////
---------------------
-# Execute A at the beginning of a string of alpha.
-action A {}
-main := ( lower* >A ) . ' ';
---------------------
-//////////////////
-% }%%
-% END GENERATE
-//////////////////
-
-image::exstact.png[align="left"]
-
-==== Finishing Action
-
-------------
-expr @ action
-------------
-
-The finishing action operator embeds an action into any transitions that move
-the machine into a final state. Further input may move the machine out of the
-final state, but keep it in the machine. Therefore, finishing actions may be
-executed more than once if a machine has any internal transitions out of a
-final state. In the following example, the final state has no transitions out
-and the finishing action is executed only once.
-
-////////////////////////
-% GENERATE: exdoneact
-% OPT: -p
-% %%{
-% machine exdoneact;
-% action A {}
-////////////////////////
---------------------
-# Execute A when the trailing space is seen.
-main := ( lower* ' ' ) @A;
---------------------
-////////////////////////
-% }%%
-% END GENERATE
-////////////////////////
-
-image::exdoneact.png[align="left"]
-
-==== All Transition Action
-
-------------
-expr $ action
-------------
-
-The all transition operator embeds an action into all transitions of a machine.
-The action is executed whenever a transition of the machine is taken. In the
-following example, A is executed on every character matched.
-
-///////////////////
-% GENERATE: exallact
-% OPT: -p
-% %%{
-% machine exallact;
-% action A {}
-///////////////////
----------------------
-# Execute A on any characters of the machine.
-main := ( 'm1' | 'm2' ) $A;
----------------------
-/////////////////////
-% }%%
-% END GENERATE
-/////////////////////
-
-image::exallact.png[align="left"]
-
-[[out_actions]]
-==== Leaving Actions
-
----------------
-expr % action
----------------
-
-The leaving action operator queues an action for embedding into the transitions
-that go out of a machine via a final state. The action is first stored in
-the machine's final states and is later transferred to any transitions that are
-made going out of the machine by a kleene star or concatenation operation.
-
-If a final state of the machine is still final when compilation is complete
-then the leaving action is also embedded as an EOF action. Therefore, leaving
-the machine is defined as either leaving on a character or as state machine
-acceptance.
-
-This operator allows one to associate an action with the termination of a
-sequence, without being concerned about what particular character terminates
-the sequence. In the following example, A is executed when leaving the alpha
-machine on the newline character.
-
-//////////////
-% GENERATE: exoutact1
-% OPT: -p
-% %%{
-% machine exoutact1;
-% action A {}
-//////////////
-----------------
-# Match a word followed by a newline. Execute A when
-# finishing the word.
-main := ( lower+ %A ) . '\n';
-----------------
-//////////////
-% }%%
-% END GENERATE
-//////////////
-
-image::exoutact1.png[align="left"]
-
-In the following example, the `term_word` action could be used to register
-the appearance of a word and to clear the buffer that the `lower` action used
-to store the text of it.
-
-////////////////////
-% GENERATE: exoutact2
-% OPT: -p
-% %%{
-% machine exoutact2;
-% action lower {}
-% action space {}
-% action term_word {}
-% action newline {}
-////////////////////
---------------------
-word = ( [a-z] @lower )+ %term_word;
-main := word ( ' ' @space word )* '\n' @newline;
---------------------
-////////////////
-% }%%
-% END GENERATE
-////////////////
-
-image::exoutact2.png[align="left"]
-
-In this final example of the action embedding operators, A is executed upon entering
-the alpha machine, B is executed on all transitions of the
-alpha machine, C is executed when the alpha machine is exited by moving into the
-newline machine and N is executed when the newline machine moves into a final
-state.
-
-////////////////////
-% GENERATE: exaction
-% OPT: -p
-% %%{
-% machine exaction;
-% action A {}
-% action B {}
-% action C {}
-% action N {}
-////////////////////
-----------------------
-# Execute A on starting the alpha machine, B on every transition
-# moving through it and C upon finishing. Execute N on the newline.
-main := ( lower* >A $B %C ) . '\n' @N;
-----------------------
-////////////////////
-% }%%
-% END GENERATE
-////////////////////
-
-image::exaction.png[align="left"]
-
-=== State Action Embedding Operators
-
-The state embedding operators allow one to embed actions into states. Like the
-transition embedding operators, there are several different classes of states
-that the operators access. The meanings of the symbols are similar to the
-meanings of the symbols used for the transition embedding operators. The design
-of the state selections was driven by a need to cover the states of an
-expression with exactly one error action.
-
-Unlike the transition embedding operators, the state embedding operators are
-also distinguished by the different kinds of events that embedded actions can
-be associated with. Therefore the state embedding operators have two
-components. The first, which is the first one or two characters, specifies the
-class of states that the action will be embedded into. The second component
-specifies the type of event the action will be executed on. The symbols of the
-second component also have equivalent keywords.
-
-The different classes of states are:
-
-* `>` -- the start state
-* `<` -- any state except the start state
-* `$` -- all states
-* `%` -- final states
-* `@` -- any state except final states
-* `<>` -- any except start and final (middle)
-
-The different kinds of embeddings are:
-
-* `~` -- to-state actions (`to`)
-* `*` -- from-state actions (`from`)
-* `/` -- EOF actions (`eof`)
-* `!` -- error actions (`err`)
-* `^` -- local error actions (`lerr`)
-
-==== To-State and From-State Actions
-
-===== To-State Actions
-
-* `>~action >to(name) >to{...}` -- the start state
-* `<~action <to(name) <to{...}` -- any state except the start state
-* `$~action $to(name) $to{...}` -- all states
-* `%~action %to(name) %to{...}` -- final states
-* `@~action @to(name) @to{...}` -- any state except final states
-* `<>~action <>to(name) <>to{...}` -- any except start and final (middle)
-
-
-To-state actions are executed whenever the state machine moves into the
-specified state, either by a natural movement over a transition or by an
-action-based transfer of control such as `fgoto`. They are executed after the
-in-transition's actions but before the current character is advanced and
-tested against the end of the input block. To-state embeddings stay with the
-state. They are irrespective of the state's current set of transitions and any
-future transitions that may be added in or out of the state.
-
-Note that the setting of the current state variable `cs` outside of the
-execute code is not considered by Ragel as moving into a state and consequently
-the to-state actions of the new current state are not executed. This includes
-the initialization of the current state when the machine begins. This is
-because the entry point into the machine execution code is after the execution
-of to-state actions.
-
-===== From-State Actions
-
-* `>*action >from(name) >from{...}` -- the start state
-* `<*action <from(name) <from{...}` -- any state except the start state
-* `$*action $from(name) $from{...}` -- all states
-* `%*action %from(name) %from{...}` -- final states
-* `@*action @from(name) @from{...}` -- any state except final states
-* `<>*action <>from(name) <>from{...}` -- any except start and final (middle)
-
-From-state actions are executed whenever the state machine takes a transition from a
-state, either to itself or to some other state. These actions are executed
-immediately after the current character is tested against the input block end
-marker and before the transition to take is sought based on the current
-character. From-state actions are therefore executed even if a transition
-cannot be found and the machine moves into the error state. Like to-state
-embeddings, from-state embeddings stay with the state.
-
-==== EOF Actions
-
-* `>/action >eof(name) >eof{...}` -- the start state
-* `</action <eof(name) <eof{...}` -- any state except the start state
-* `$/action $eof(name) $eof{...}` -- all states
-* `%/action %eof(name) %eof{...}` -- final states
-* `@/action @eof(name) @eof{...}` -- any state except final states
-* `<>/action <>eof(name) <>eof{...}` -- any except start and final (middle)
-
-The EOF action embedding operators enable the user to embed actions that are
-executed at the end of the input stream. EOF actions are stored in states and
-generated in the `write exec` block. They are run when `p == pe == eof`
-as the execute block is finishing. EOF actions are free to adjust `p`
-and jump to another part of the machine to restart execution.
-
-==== Handling Errors
-
-In many applications it is useful to be able to react to parsing errors. The
-user may wish to print an error message that depends on the context. It may
-also be desirable to consume input in an attempt to return the input stream to
-some known state and resume parsing. To support error handling and recovery,
-Ragel provides error action embedding operators. There are two kinds of error
-actions: global error actions and local error actions. Error actions can be
-used to simply report errors, or by jumping to a machine instantiation that
-consumes input, can attempt to recover from errors.
-
-===== Global Error Actions
-
-* `>!action >err(name) >err{...}` -- the start state
-* `<!action <err(name) <err{...}` -- any state except the start state
-* `$!action $err(name) $err{...}` -- all states
-* `%!action %err(name) %err{...}` -- final states
-* `@!action @err(name) @err{...}` -- any state except final states
-* `<>!action <>err(name) <>err{...}` -- any except start and final (middle)
-
-Global error actions are stored in the states they are embedded into until
-compilation is complete. They are then transferred to the transitions that move
-into the error state. These transitions are taken on all input characters that
-are not already covered by the state's transitions. If a state with an error
-action is not final when compilation is complete, then the action is also
-embedded as an EOF action.
-
-Error actions can be used to recover from errors by jumping back into the
-machine with `fgoto` and optionally altering `p`.
-
-===== Local Error Actions
-
-* `>^action >lerr(name) >lerr{...}` -- the start state
-* `<^action <lerr(name) <lerr{...}` -- any state except the start state
-* `$^action $lerr(name) $lerr{...}` -- all states
-* `%^action %lerr(name) %lerr{...}` -- final states
-* `@^action @lerr(name) @lerr{...}` -- any state except final states
-* `<>^action <>lerr(name) <>lerr{...}` -- any except start and final (middle)
-
-Like global error actions, local error actions are also stored in the states
-they are embedded into until a transfer point. The transfer point is different
-however. Each local error action embedding is associated with a name. When a
-machine definition has been fully constructed, all local error action
-embeddings associated with the same name as the machine definition are
-transferred to the error transitions. At this time they are also embedded as
-EOF actions in the case of non-final states.
-
-Local error actions can be used to specify an action to take when a particular
-section of a larger state machine fails to match. A particular machine
-definition's ``thread'' may die and the local error actions executed, however
-the machine as a whole may continue to match input.
-
-There are two forms of local error action embeddings. In the first form the
-name defaults to the current machine. In the second form the machine name can
-be specified. This is useful when it is more convenient to specify the local
-error action in a sub-definition that is used to construct the machine
-definition that the local error action is associated with. To embed local error
-actions and explicitly state the machine definition on which the transfer is to
-happen use `(name, action)` as the action.
-
-===== Example
-
-The following example uses error actions to report an error and jump to a
-machine that consumes the remainder of the line when parsing fails. After
-consuming the line, the error recovery machine returns to the main loop.
-
-//////////////////////////
-% GENERATE: erract
-% %%{
-% machine erract;
-% ws = ' ';
-% address = 'foo AT bar..com';
-% date = 'Monday May 12';
-//////////////////////////
-----------------------------
-action cmd_err {
- printf( "command error\n" );
- fhold; fgoto line;
-}
-action from_err {
- printf( "from error\n" );
- fhold; fgoto line;
-}
-action to_err {
- printf( "to error\n" );
- fhold; fgoto line;
-}
-
-line := [^\n]* '\n' @{ fgoto main; };
-
-main := (
- (
- 'from' @err(cmd_err)
- ( ws+ address ws+ date '\n' ) $err(from_err) |
- 'to' @err(cmd_err)
- ( ws+ address '\n' ) $err(to_err)
- )
-)*;
-----------------------------
-//////////////////////
-% }%%
-% %% write data;
-% void f()
-% {
-% %% write init;
-% %% write exec;
-% }
-% END GENERATE
-//////////////////////
-
-
-=== Action Ordering and Duplicates
-
-When combining expressions that have embedded actions it is often the case that
-a number of actions must be executed on a single input character. For example,
-following a concatenation the leaving action of the left expression and the
-entering action of the right expression will be embedded into one transition.
-This requires a method of ordering actions that is intuitive and
-predictable for the user, and repeatable for the compiler.
-
-We associate with the embedding of each action a unique timestamp that is
-used to order actions that appear together on a single transition in the final
-state machine. To accomplish this, we recursively traverse the parse tree of
-regular expressions and assign timestamps to action embeddings. References to
-machine definitions are followed in the traversal. When we visit a
-parse tree node, we assign timestamps to all _entering_ action embeddings,
-recurse on the parse tree, then assign timestamps to the remaining _all_,
-_finishing_, and _leaving_ embeddings in the order in which they
-appear.
-
-By default Ragel does not permit a single action to appear multiple times in an action
-list. When the final machine has been created, actions that appear more than
-once in a single transition, to-state, from-state or EOF action list have their
-duplicates removed.
-The first appearance of the action is preserved. This is useful in a number of
-scenarios. First, it allows us to union machines with common prefixes without
-worrying about the action embeddings in the prefix being duplicated. Second, it
-prevents leaving actions from being transferred multiple times. This can
-happen when a machine is repeated, then followed with another machine that
-begins with a common character. For example:
-
-----------------
-word = [a-z]+ %act;
-main := word ( '\n' word )* '\n\n';
-----------------
-
-Note that Ragel does not compare action bodies to determine if they have
-identical program text. It simply checks for duplicates using each action
-block's unique location in the program.
-
-The removal of duplicates can be turned off using the `-d` option.
-
-[[vals]]
-=== Values and Statements Available in Code Blocks
-
-The following values are available in code blocks:
-
-* `fpc` -- A pointer to the current character. This is equivalent to
-accessing the `p` variable.
-
-* `fc` -- The current character. This is equivalent to the expression `(*p)`.
-
-* `fcurs` -- An integer value representing the current state. This
-value should only be read from. To move to a different place in the machine
-from action code use the `fgoto`, `fnext` or `fcall` statements. Outside of the
-machine execution code the `cs` variable may be modified.
-
-* `ftargs` -- An integer value representing the target state. This
-value should only be read from. Again, `fgoto`, `fnext` and
-`fcall` can be used to move to a specific entry point.
-
-* `fentry(<label>)` -- Retrieve an integer value representing the
-entry point `label`. The integer value returned will be a compile time
-constant. This number is suitable for later use in control flow transfer
-statements that take an expression. This value should not be compared against
-the current state because any given label can have multiple states representing
-it. The value returned by `fentry` can be any one of the multiple states that
-it represents.
-
-The following statements are available in code blocks:
-
-* `fhold;` -- Do not advance over the current character. If processing
-data in multiple buffer blocks, the `fhold` statement should only be used
-once in the set of actions executed on a character. Multiple calls may result
-in backing up over the beginning of the buffer block. The `fhold`
-statement does not imply any transfer of control. It is equivalent to the
-`p--;` statement.
-
-* `fexec <expr>;` -- Set the next character to process. This can be
-used to backtrack to previous input or advance ahead. Unlike `fhold`, which can
-be used anywhere, `fexec` requires the user to ensure that the target of the
-backtrack is in the current buffer block or is known to be somewhere ahead of
-it. The machine will continue iterating forward until `pe` is arrived at,
-`fbreak` is called or the machine moves into the error state. In actions
-embedded into transitions, the `fexec` statement is equivalent to setting `p`
-to one position ahead of the next character to process. If the user also
-modifies `pe`, it is possible to change the buffer block entirely.
-
-* `fgoto <label>;` -- Jump to an entry point defined by
-`<label>`. The `fgoto` statement immediately transfers control to
-the destination state.
-
-* `fgoto *<expr>;` -- Jump to an entry point given by `<expr>`.
-The expression must evaluate to an integer value representing a state.
-
-* `fnext <label>;` -- Set the next state to be the entry point defined
-by `label`. The `fnext` statement does not immediately jump to the specified
-state. Any action code following the statement is executed.
-
-* `fnext *<expr>;` -- Set the next state to be the entry point given
-by `<expr>`. The expression must evaluate to an integer value representing
-a state.
-
-* `fcall <label>;` -- Push the target state and jump to the entry
-point defined by `<label>`. The next `fret` will jump to the target
-of the transition on which the call was made. Use of `fcall` requires
-the declaration of a call stack. An array of integers named `stack` and a
-single integer named `top` must be declared. With the `fcall`
-construct, control is immediately transferred to the destination state.
-See <<modularization, Parser Modularization>> for more information.
-
-* `fcall *<expr>;` -- Push the current state and jump to the entry
-point given by `<expr>`. The expression must evaluate to an integer value
-representing a state.
-
-* `fret;` -- Return to the target state of the transition on which the
-last `fcall` was made. Use of `fret` requires the declaration of a
-call stack. Control is immediately transferred to the destination state.
-
-* `fbreak;` -- Advance `p`, save the target state to `cs`
-and immediately break out of the execute loop. This statement is useful in
-conjunction with the `noend` write option. Rather than process input until `pe`
-is arrived at, the `fbreak` statement can be used to stop processing from an
-action. After an `fbreak` statement the `p` variable will point to the next
-character in the input. The current state will be the target of the current
-transition. Note that `fbreak` causes the target state's to-state actions to be
-skipped.
-
-Once actions with control-flow commands are embedded into a
-machine, the user must exercise caution when using the machine as the operand
-to other machine construction operators. If an action jumps to another state
-then unioning any transition that executes that action with another transition
-that follows some other path will cause that other path to be lost. Using
-commands that manually jump around a machine takes us out of the domain of
-regular languages because transitions that the
-machine construction operators are not aware of are introduced. These
-commands should therefore be used with caution.
-
-
-[[controlling_nondeterminism]]
-== Controlling Nondeterminism
-
-Along with the flexibility of arbitrary action embeddings comes a need to
-control nondeterminism in regular expressions. If a regular expression is
-ambiguous, then sub-components of a parser other than the intended parts may become
-active. This means that actions that are irrelevant to the
-current subset of the parser may be executed, causing problems for the
-programmer.
-
-Tools that are based on regular expression engines and that are used for
-recognition tasks will usually function as intended regardless of the presence
-of ambiguities. It is quite common for users of scripting languages to write
-regular expressions that are heavily ambiguous and it generally does not
-matter. As long as one of the potential matches is recognized, there can be any
-number of other matches present. In some parsing systems the run-time engine
-can employ a strategy for resolving ambiguities, for example always pursuing
-the longest possible match and discarding others.
-
-In Ragel, there is no regular expression run-time engine, just a simple state
-machine execution model. When we begin to embed actions and face the
-possibility of spurious action execution, it becomes clear that controlling
-nondeterminism at the machine construction level is very important. Consider
-the following example.
-
-////////////////////////
-% GENERATE: lines1
-% OPT: -p
-% %%{
-% machine lines1;
-% action first {}
-% action tail {}
-% word = [a-z]+;
-////////////////////////
----------------------------
-ws = [\n\t ];
-line = word $first ( ws word $tail )* '\n';
-lines = line*;
----------------------------
-////////////////////////
-% main := lines;
-% }%%
-% END GENERATE
-////////////////////////
-
-image::lines1.png[align="left"]
-
-Since the `ws` expression includes the newline character, we will
-not finish the `line` expression when a newline character is seen. We will
-simultaneously pursue the possibility of matching further words on the same
-line and the possibility of matching a second line. Evidence of this fact is
-in the state tables. On several transitions both the `first` and
-`tail` actions are executed. The solution here is simple: exclude
-the newline character from the `ws` expression.
-
-///////////////////////////
-% GENERATE: lines2
-% OPT: -p
-% %%{
-% machine lines2;
-% action first {}
-% action tail {}
-% word = [a-z]+;
-///////////////////////////
-----------------------
-ws = [\t ];
-line = word $first ( ws word $tail )* '\n';
-lines = line*;
-----------------------
-///////////////////////////
-% main := lines;
-% }%%
-% END GENERATE
-///////////////////////////
-
-image::lines2.png[align="left"]
-
-Solving this kind of problem is straightforward when the ambiguity is created
-by strings that are a single character long. When the ambiguity is created by
-strings that are multiple characters long we have a more difficult problem. The
-following example is an incorrect attempt at a regular expression for C
-language comments.
-
-/////////////////////////////
-% GENERATE: comments1
-% OPT: -p
-% %%{
-% machine comments1;
-% action comm {}
-/////////////////////////////
--------------------------
-comment = '/*' ( any @comm )* '*/';
-main := comment ' ';
--------------------------
-/////////////////////////////
-% }%%
-% END GENERATE
-/////////////////////////////
-
-image::comments1.png[align="left"]
-
-Using standard concatenation, we will never leave the `any*` expression.
-We will forever entertain the possibility that a `'*/'` string that we see
-is contained in a longer comment and that, simultaneously, the comment has
-ended. The concatenation of the `comment` machine with `SP` is done
-to show this. When we match space, we are also still matching the comment body.
-
-One way to approach the problem is to exclude the terminating string from the
-`any*` expression using set difference. We must be careful to exclude not just
-the terminating string, but any string that contains it as a substring. A
-verbose, but proper specification of a C comment parser is given by the
-following regular expression.
-
-////////////////////////
-% GENERATE: comments2
-% OPT: -p
-% %%{
-% machine comments2;
-% action comm {}
-////////////////////////
--------------------
-comment = '/*' ( ( any @comm )* - ( any* '*/' any* ) ) '*/';
--------------------
-////////////////////////
-% main := comment;
-% }%%
-% END GENERATE
-////////////////////////
-
-image::comments2.png[align="left"]
-
-Note that Ragel's strong subtraction operator `--` can also be used here.
-In doing this subtraction we have phrased the problem of controlling
-non-determinism in terms of excluding strings common to two expressions that
-interact when combined. We can also phrase the problem in terms of the
-transitions of the state machines that implement these expressions. During the
-concatenation of `any*` and `'*/'` we will be making transitions that are
-composed of both the loop of the first expression and the final character of
-the second. At this time we want the transition on the `'/'` character to take
-precedence over and disallow the transition that originated in the `any*` loop.
-
-In another parsing problem, we wish to implement a lightweight tokenizer that
-we can utilize in the composition of a larger machine. For example, some HTTP
-headers have a token stream as a sub-language. The following example is an
-attempt at a regular expression-based tokenizer that does not function
-correctly due to unintended nondeterminism.
-
-/////////////////////////////
-% GENERATE: smallscanner
-% OPT: -p
-% %%{
-% machine smallscanner;
-% action start_str {}
-% action on_char {}
-% action finish_str {}
-/////////////////////////////
---------------------------
-header_contents = (
- lower+ >start_str $on_char %finish_str |
- ' '
-)*;
---------------------------
-/////////////////////////////
-% main := header_contents;
-% }%%
-% END GENERATE
-/////////////////////////////
-
-image::smallscanner.png[align="left"]
-
-In this case, the problem with using a standard kleene star operation is that
-there is an ambiguity between extending a token and wrapping around the machine
-to begin a new token. Using the standard operator, we get an undesirable
-nondeterministic behaviour. Evidence of this can be seen on the transition out
-of state one, back to itself. The transition extends the string, and
-simultaneously, finishes the string only to immediately begin a new one. What
-is required is for the transitions that represent an extension of a token to
-take precedence over the transitions that represent the beginning of a new
-token. For this problem there is no simple solution that uses standard regular
-expression operators.
-
-=== Priorities
-
-A priority mechanism was devised and built into the determinization process,
-specifically for the purpose of allowing the user to control nondeterminism.
-Priorities are integer values embedded into transitions. When the
-determinization process is combining transitions that have different
-priorities, the transition with the higher priority is preserved and the
-transition with the lower priority is dropped.
-
-Unfortunately, priorities can have unintended side effects because their
-operation requires that they linger in transitions indefinitely. They must
-linger because the Ragel program cannot know when the user is finished with a
-priority embedding. A solution whereby they are explicitly deleted after use
-is conceivable; however this is not very user-friendly. Priorities were
-therefore made into named entities. Only priorities with the same name are
-allowed to interact. This allows any number of priorities to coexist in one
-machine for the purpose of controlling various different regular expression
-operations and eliminates the need to ever delete them. Such a scheme allows
-the user to choose a unique name, embed two different priority values using
-that name and be confident that the priority embedding will be free of any side
-effects.
-
-In the first form of priority embedding, the name defaults to the name of the
-machine definition that the priority is assigned in. In this sense priorities
-are by default local to the current machine definition or instantiation. Beware
-of using this form in a longest-match machine, since there is only one name for
-the entire set of longest match patterns. In the second form the priority's
-name can be specified, allowing priority interaction across machine definition
-boundaries.
-
-* `expr > int` -- Sets starting transitions to have priority int.
-* `expr @ int` -- Sets transitions that go into a final state to have priority int.
-* `expr $ int` -- Sets all transitions to have priority int.
-* `expr % int` -- Sets leaving transitions to
-have priority int. When a transition is made going out of the machine (either
-by concatenation or kleene star) its priority is immediately set to the leaving
-priority.
-
-The second form of priority assignment allows the programmer to specify the
-name to which the priority is assigned.
-
-* `expr > (name, int)` -- Starting transitions.
-* `expr @ (name, int)` -- Finishing transitions (into a final state).
-* `expr $ (name, int)` -- All transitions.
-* `expr % (name, int)` -- Leaving transitions.
-
-=== Guarded Operators that Encapsulate Priorities
-
-Priority embeddings are a very expressive mechanism. At the same time they
-can be very confusing for the user. They force the user to imagine
-the transitions inside two interacting expressions and work out the precise
-effects of the operations between them. When we consider
-that this problem is worsened by the
-potential for side effects caused by unintended priority name collisions, we
-see that exposing the user to priorities is undesirable.
-
-Fortunately, in practice the use of priorities has been necessary only in a
-small number of scenarios. This allows us to encapsulate their functionality
-into a small set of operators and fully hide them from the user. This is
-advantageous from a language design point of view because it greatly simplifies
-the design.
-
-Going back to the C comment example, we can now properly specify it using a
-guarded concatenation operator which we call _finish-guarded concatenation_.
-From the user's point of view, this operator terminates the first machine when
-the second machine moves into a final state. It chooses a unique name and uses
-it to embed a low priority into all transitions of the first machine. A higher
-priority is then embedded into the transitions of the second machine that enter
-into a final state. The following example yields a machine identical to the
-example in the section on <<controlling_nondeterminism,Controlling
-Nondeterminism>>.
-
-----------------
-comment = '/*' ( any @comm )* :>> '*/';
-----------------
-
-image::comments2.png[align="left"]
-
-Another guarded operator is _left-guarded concatenation_, given by the
-`<:` compound symbol. This operator places a higher priority on all
-transitions of the first machine. This is useful if one must forcibly separate
-two lists that contain common elements. For example, one may need to tokenize a
-stream, but first consume leading whitespace.
-
-Ragel also includes a _longest-match kleene star_ operator, given by the `**`
-compound symbol. This guarded operator embeds a high priority into all
-transitions of the machine. A lower priority is then embedded into the leaving
-transitions. When the kleene star operator makes the epsilon transitions from
-the final states into the new start state, the lower priority will be
-transferred to the epsilon transitions. In cases where following an epsilon
-transition out of a final state conflicts with an existing transition out of a
-final state, the epsilon transition will be dropped.
-
-Other guarded operators are conceivable, such as guards on union that cause one
-alternative to take precedence over another. These may be implemented when it
-is clear they constitute a frequently used operation. In the next section we
-discuss the explicit specification of state machines using state charts.
-
-==== Entry-Guarded Concatenation
-
---------------------
-expr :> expr
---------------------
-
-This operator concatenates two machines, but first assigns a low
-priority to all transitions
-of the first machine and a high priority to the starting transitions of the
-second machine. This operator is useful if from the final states of the first
-machine it is possible to accept the characters in the entering transitions of
-the second machine. This operator effectively terminates the first machine
-immediately upon starting the second machine, where otherwise they would be
-pursued concurrently. In the following example, entry-guarded concatenation is
-used to move out of a machine that matches everything at the first sign of an
-end-of-input marker.
-
-/////////////////////////////
-% GENERATE: entryguard
-% OPT: -p
-% %%{
-% machine entryguard;
-.code
-/////////////////////////////
---------------------
-# Leave the catch-all machine on the first character of FIN.
-main := any* :> 'FIN';
---------------------
-/////////////////////////////
-% }%%
-% END GENERATE
-/////////////////////////////
-
-image::entryguard.png[align="left"]
-
-Entry-guarded concatenation is equivalent to the following:
-
------------------
-expr $(unique_name,0) . expr >(unique_name,1)
------------------
-
-==== Finish-Guarded Concatenation
-
--------------------
-expr :>> expr
--------------------
-
-This operator is
-like the previous operator, except the higher priority is placed on the final
-transitions of the second machine. This is useful if one wishes to entertain
-the possibility of continuing to match the first machine right up until the
-second machine enters a final state. In other words, it terminates the first
-machine only when the second accepts. In the following example, finish-guarded
-concatenation causes the move out of the machine that matches everything to be
-delayed until the full end-of-input marker has been matched.
-
-////////////////////////
-% GENERATE: finguard
-% OPT: -p
-% %%{
-% machine finguard;
--------------------------
-# Leave the catch-all machine on the last character of FIN.
-main := any* :>> 'FIN';
--------------------------
-% }%%
-% END GENERATE
-////////////////////////
-
-image::finguard.png[align="left"]
-
-Finish-guarded concatenation is equivalent to the following, with one
-exception. If the right machine's start state is final, the higher priority is
-also embedded into it as a leaving priority. This prevents the left machine
-from persisting via the zero-length string.
-
--------------------
-expr $(unique_name,0) . expr @(unique_name,1)
--------------------
-
-==== Left-Guarded Concatenation
-
--------------------
-expr <: expr
--------------------
-
-This operator places a higher priority on the left expression. It is useful if
-you want to prefix a sequence with another sequence composed of some of the
-same characters. For example, one can consume leading whitespace before
-tokenizing a sequence of whitespace-separated words as in:
-
-////////////////////////////
-% GENERATE: leftguard
-% OPT: -p
-% %%{
-% machine leftguard;
-% action alpha {}
-% action ws {}
-% action start {}
-% action fin {}
-////////////////////////////
---------------------------
-main := ( ' '* >start %fin ) <: ( ' ' $ws | [a-z] $alpha )*;
---------------------------
-////////////////////////////
-% }%%
-% END GENERATE
-////////////////////////////
-
-image::leftguard.png[align="left"]
-
-Left-guarded concatenation is equivalent to the following:
-
-----------------
-expr $(unique_name,1) . expr >(unique_name,0)
-----------------
-
-[[longest_match_kleene_star]]
-==== Longest-Match Kleene Star
-
---------
-expr**
---------
-
-This version of kleene star puts a higher priority on staying in the
-machine versus wrapping around and starting over. The LM kleene star is useful
-when writing simple tokenizers. These machines are built by applying the
-longest-match kleene star to an alternation of token patterns, as in the
-following.
-
-//////////////////////
-% GENERATE: lmkleene
-% OPT: -p
-% %%{
-% machine exfinpri;
-% action A {}
-% action B {}
-//////////////////////
-----------------------
-# Repeat tokens, but make sure to get the longest match.
-main := (
- lower ( lower | digit )* %A |
- digit+ %B |
- ' '
-)**;
-----------------------
-//////////////////////
-% }%%
-% END GENERATE
-//////////////////////
-
-image::lmkleene.png[align="left"]
-
-If a regular kleene star were used the machine above would not be able to
-distinguish between extending a word and beginning a new one. This operator is
-equivalent to:
-
-------------
-( expr $(unique_name,1) %(unique_name,0) )*
-------------
-
-When the kleene star is applied, transitions that go out of the machine and
-back into it are made. These are assigned a priority of zero by the leaving
-transition mechanism. This is less than the priority of one assigned to the
-transitions leaving the final states but not leaving the machine. When
-these transitions clash on the same character, the
-transition that stays in the machine takes precedence. The transition
-that wraps around is dropped.
-
-Note that this operator does not build a scanner in the traditional sense
-because there is never any backtracking. To build a scanner with backtracking
-use the Longest-Match machine construction described in
-<<generating_scanners, Generating Scanners>>.
-
-Interface to Host Program
--------------------------
-
-The Ragel code generator is very flexible. The generated code has no
-dependencies and can be inserted in any function, perhaps inside a loop if
-desired. The user is responsible for declaring and initializing a number of
-required variables, including the current state and the pointer to the input
-stream. These can live in any scope. Control of the input processing loop is
-also possible: the user may break out of the processing loop and return to it
-at any time.
-
-In the case of the C, D, Go and OCaml host languages, Ragel is able to generate very
-fast-running code that implements state machines as directly executable code.
-Since very large files strain the host language compiler, table-based code
-generation is also supported. In the future, we hope to provide a partitioned,
-directly executable format that is able to reduce the burden on the host
-compiler by splitting large machines across multiple functions.
-
-In the case of Java and Ruby, table-based code generation is the only code
-style supported. In the future, this may be expanded to include other code
-styles.
-
-Ragel can be used to parse input in one block, or it can be used to parse input
-in a sequence of blocks as it arrives from a file or socket. Parsing the input
-in a sequence of blocks brings with it a few responsibilities. If the parser
-utilizes a scanner, care must be taken to not break the input stream anywhere
-but token boundaries. If pointers to the input stream are taken during
-parsing, care must be taken to not use a pointer that has been invalidated by
-movement to a subsequent block. If the current input data pointer is moved
-backwards it must not be moved past the beginning of the current block.
-
-The following example shows a simple Ragel program that does not have any
-actions. The example tests the first argument of the program against a number
-pattern and then prints the machine's acceptance status.
-
-.A basic Ragel example without any actions.
-----------------------
-#include <stdio.h>
-#include <string.h>
-%%{
- machine foo;
- write data;
-}%%
-int main( int argc, char **argv )
-{
- int cs;
- if ( argc > 1 ) {
- char *p = argv[1];
- char *pe = p + strlen( p );
- %%{
- main := [0-9]+ ( '.' [0-9]+ )?;
-
- write init;
- write exec;
- }%%
- }
- printf("result = %i\n", cs >= foo_first_final );
- return 0;
-}
-----------------------
-
-[[variables_used_by_ragel]]
-=== Variables Used by Ragel
-
-There are a number of variables that Ragel expects the user to declare. At a
-very minimum the `cs`, `p` and `pe` variables must be declared. In Go, Java,
-Ruby and OCaml code the `data` variable must also be declared. If EOF
-actions are used then the `eof` variable is required. If stack-based state
-machine control flow statements are used then the `stack` and `top` variables
-are required. If a scanner is declared then the `act`, `ts` and `te` variables
-must be declared.
-
-* `cs` - Current state. This must be an integer and it should persist
-across invocations of the machine when the data is broken into blocks that are
-processed independently. This variable may be modified from outside the
-execution loop, but not from within.
-
-* `p` - Data pointer. In C/C++/D code this variable is expected to be a
-pointer to the character data to process. It should be initialized to the
-beginning of the data block on every run of the machine. In C++ it is possible
-to use an object for the data pointers. The object should support comparison,
-dereferencing, and pre/post increment/decrement operations. In Go, Java, Ruby
-and OCaml it is used as an offset to `data` and must be an integer. In this
-case it should be initialized to zero on every run of the machine.
-
-* `pe` - Data end pointer. This should be initialized to `p` plus
-the data length on every run of the machine. In C++ this can be defined as one
-increment operation past the last valid value. Or in the case of a zero-length
-input buffer the initial value of `p`. A In Go, Java, Ruby and OCaml code this
-should be initialized to the data length.
-
-* `eof` - End of file pointer. This should be set to `pe` when
-the buffer block being processed is the last one, otherwise it should be set to
-null. In Go, Java, Ruby and OCaml code `-1` must be used instead of null. If the EOF
-event can be known only after the final buffer block has been processed, then
-it is possible to set `p = pe = eof` and run the execute block.
-
-* `data` - This variable is only required in Go, Java, Ruby and OCaml code. It
-must be an array containing the data to process.
-
-* `stack` - This must be an array of integers. It is used to store
-integer values representing states. If the stack must resize dynamically the
-<<prepush, Pre-push>> and <<postpop, Post-Pop>> statements can be used to do
-this.
-
-* `top` - This must be an integer value and will be used as an offset
-to `stack`, giving the next available spot on the top of the stack.
-
-* `act` - This must be an integer value. It is a variable sometimes
-used by scanner code to keep track of the most recent successful pattern match.
-
-* `ts` - This must be a pointer to character data. In Go, Java, Ruby and
-OCaml code this must be an integer. See <<generating_scanners, Generating
-Scanners>> for more information.
-
-* `te` - Also a pointer to character data.
-
-* `nfa_bp` - NFA backtrack points. Only necessary if NFA features are used.
-This var must be an array of records containing at least a state and alphtype
-pointer.
-+
----------
-struct nfa_bp_rec
-{
- long state;
- char *p;
-};
-
-struct nfa_bp_rec nfa_bp[1024];
----------
-+
-This array can be guarded against overflow or dynamically resized using
-`nfaprepush` and `nfapostpop` code blocks. If the `:nfa()` construct is being
-used, the record also needs to contain an `long pop;` field, and `long q_N`
-fields for each instance of the `:nfa()` construct, where N corresponds to the
-id specified in the `:nfa()` construct.
-
-* `nfa_len` - Number of active nfa records. Must be initialized to zero.
-
-* `nfa_count` - Number of NFA pops that have occurred. This is only for
-tracking purposes.
-
-=== Alphtype Statement
-
-------------------
-alphtype unsigned int;
-------------------
-
-The alphtype statement specifies the alphabet data type that the machine
-operates on. During the compilation of the machine, integer literals are
-expected to be in the range of possible values of the alphtype. The default is
-`char` for all languages except Go where the default is `byte` and OCaml where
-the default is `int`.
-
-C/C++/Objective-C:
-----------------------
- char unsigned char
- short unsigned short
- int unsigned int
- long unsigned long
-----------------------
-
-Go:
-----------------------
- byte
- int8 uint8
- int16 uint16
- int32 uint32
- int
-----------------------
-
-Ruby:
-----------------------
- char
- int
-----------------------
-
-
-Java:
-----------------------
- char
- byte
- short
- int
-----------------------
-
-D:
-----------------------
- char
- byte ubyte
- short ushort
- wchar
- int uint
- dchar
-----------------------
-
-OCaml:
-----------------------
- int
-----------------------
-
-=== Getkey Statement
-
-------------------
-getkey fpc->id;
-------------------
-
-This statement specifies to Ragel how to retrieve the current character from
-from the pointer to the current element (`p`). Any expression that returns
-a value of the alphabet type
-may be used. The getkey statement may be used for looking into element
-structures or for translating the character to process. The getkey expression
-defaults to `(*p)`. In goto-driven machines the getkey expression may be
-evaluated more than once per element processed, therefore it should not incur a
-large cost nor preclude optimization.
-
-=== Access Statement
-
-----------------
-access fsm->;
-----------------
-
-The access statement specifies how the generated code should access the machine
-data that is persistent across processing buffer blocks. This applies to all
-variables except `p`, `pe` and `eof`. This includes `cs`, `top`, `stack`, `ts`,
-`te` and `act`. The access statement is useful if a machine is to be
-encapsulated inside a structure in C code. It can be used to give the name of a
-pointer to the structure.
-
-=== Variable Statement
-
-------------------
-variable p fsm->p;
-------------------
-
-The variable statement specifies how to access a specific variable. All of the
-variables that are declared by the user and used by Ragel can be changed. This
-includes `p`, `pe`, `eof`, `cs`, `top`, `stack`, `ts`, `te` and `act`. In Go,
-Ruby, Java and OCaml code generation the `data` variable can also be changed.
-
-[[prepush]]
-=== Pre-Push Statement
-
-------------------
-prepush {
- /* stack growing code */
-}
-------------------
-
-The prepush statement allows the user to supply stack management code that is
-written out during the generation of fcall, immediately before the current
-state is pushed to the stack. This statement can be used to test the number of
-available spaces and dynamically grow the stack if necessary.
-
-[[postpop]]
-=== Post-Pop Statement
-
-------------------
-postpop {
- /* stack shrinking code */
-}
-------------------
-
-The postpop statement allows the user to supply stack management code that is
-written out during the generation of fret, immediately after the next state is
-popped from the stack. This statement can be used to dynamically shrink the
-stack.
-
-[[write_statement]]
-=== Write Statement
-
-------------------
-write <component> [options];
-------------------
-
-The write statement is used to generate parts of the machine.
-There are seven
-components that can be generated by a write statement. These components make up the
-state machine's data, initialization code, execution code, and export definitions.
-A write statement may appear before a machine is fully defined.
-This allows one to write out the data first then later define the machine where
-it is used. See the <<fbreak_example, fbreak Example>>.
-
-==== Write Data
-
-------------------
-write data [options];
-------------------
-
-The write data statement causes Ragel to emit the constant static data needed
-by the machine. In table-driven output styles (see
-<<genout,Generated Output Style>>) this is a collection of arrays that
-represent the states and transitions of the machine. In goto-driven machines
-much less data is emitted. At the very minimum a start state `name_start`
-is generated. All variables written out in machine data have both the
-`static` and `const` properties and are prefixed with the name of the
-machine and an underscore. The data can be placed inside a class, inside a
-function, or it can be defined as global data.
-
-Two variables are written that may be used to test the state of the machine
-after a buffer block has been processed. The `name_error` variable gives
-the id of the state that the machine moves into when it cannot find a valid
-transition to take. The machine immediately breaks out of the processing loop when
-it finds itself in the error state. The error variable can be compared to the
-current state to determine if the machine has failed to parse the input. If the
-machine is complete, that is from every state there is a transition to a proper
-state on every possible character of the alphabet, then no error state is required
-and this variable will be set to -1.
-
-The `name_first_final` variable stores the id of the first final state.
-All of the machine's states are sorted by their final state status before
-having their ids assigned. Checking if the machine has accepted its input can
-then be done by checking if the current state is greater-than or equal to the
-first final state.
-
-Data generation has several options:
-
-* `noerror` - Do not generate the integer
-variable that gives the id of the error state.
-
-* `nofinal` - Do not generate the integer variable
-that gives the id of the first final state.
-
-* `noprefix` - Do not prefix the variable names with the name of the machine.
-
-[[fbreak_example]]
-.Use of `noend` write option and the `fbreak` statement for processing a string.
-------------------------
-#include <stdio.h>
-%% machine foo;
-%% write data;
-int main( int argc, char **argv )
-{
- int cs, res = 0;
- if ( argc > 1 ) {
- char *p = argv[1];
- %%{
- main :=
- [a-z]+
- 0 @{ res = 1; fbreak; };
- write init;
- write exec noend;
- }%%
- }
- printf("execute = %i\n", res );
- return 0;
-}
-------------------------
-
-==== Write Start, First Final and Error
-
-------------------
-write start;
-write first_final;
-write error;
-------------------
-
-These three write statements provide an alternative means of accessing the
-`start`, `first_final` and `error` states. If there are many different machine
-specifications in one file it is easy to get the prefix for these wrong. This
-is especially true if the state machine boilerplate is frequently made by a
-copy-paste-edit process. These write statements allow the problem to be
-avoided. They can be used as follows:
-
--------------------
-/* Did parsing succeed? */
-if ( cs < %%{ write first_final; }%% ) {
- result = ERR_PARSE_ERROR;
- goto fail;
-}
--------------------
-
-==== Write Init
-
--------------------
-write init [options];
--------------------
-
-The write init statement causes Ragel to emit initialization code. This should
-be executed once before the machine is started. At a very minimum this sets the
-current state to the start state. If other variables are needed by the
-generated code, such as call stack variables or scanner management
-variables, they are also initialized here.
-
-The `nocs` option to the write init statement will cause ragel to skip
-intialization of the cs variable. This is useful if the user wishes to use
-custom logic to decide which state the specification should start in.
-
-==== Write Exec
-
----------------------------
-write exec [options];
----------------------------
-
-The write exec statement causes Ragel to emit the state machine's execution
-code. Ragel expects several variables to be available to this code. At a very
-minimum, the generated code needs access to the current character position `p`,
-the ending position `pe` and the current state `cs` (though `pe` can be omitted
-using the `noend` write option). The `p` variable is the cursor that the
-execute code will used to traverse the input. The `pe` variable should be set
-up to point to one position past the last valid character in the buffer.
-
-Other variables are needed when certain features are used. For example using
-the `fcall` or `fret` statements requires `stack` and `top` variables to be
-defined. If a longest-match construction is used, variables for managing
-backtracking are required.
-
-The write exec statement has one option. The `noend` option tells Ragel to
-generate code that ignores the end position `pe`. In this case the user must
-explicitly break out of the processing loop using `fbreak`, otherwise the
-machine will continue to process characters until it moves into the error
-state. This option is useful if one wishes to process a null terminated string.
-Rather than traverse the string to discover then length before processing the
-input, the user can break out when the null character is seen. The
-<<fbreak_example, fbreak Example>> shows the use of the `noend` write option and the
-`fbreak` statement for processing a string.
-
-[[export,Write Exports]]
-==== Write Exports
-
--------------------
-write exports;
--------------------
-
-The export feature can be used to export simple machine definitions. Machine definitions
-are marked for export using the `export` keyword.
-
--------------------
-export machine_to_export = 0x44;
--------------------
-
-When the write exports statement is used these machines are written out in the
-generated code. Defines are used for C and constant integers are used for D,
-Java, Ruby and OCaml. See <<import, Importing Definitions>> for a description
-of the import statement.
-
-=== Maintaining Pointers to Input Data
-
-In the creation of any parser it is not uncommon to require the collection of
-the data being parsed. It is always possible to collect data into a growable
-buffer as the machine moves over it, however the copying of data is a somewhat
-wasteful use of processor cycles. The most efficient way to collect data from
-the parser is to set pointers into the input then later reference them. This
-poses a problem for uses of Ragel where the input data arrives in blocks, such
-as over a socket or from a file. If a pointer is set in one buffer block but
-must be used while parsing a following buffer block, some extra consideration
-to correctness must be made.
-
-The scanner constructions exhibit this problem, requiring the maintenance code
-described in <<generating_scanners, Generating Scanners>>. If a longest-match
-construction has been used somewhere in the machine then it is possible to take
-advantage of the required prefix maintenance code in the driver program to
-ensure pointers to the input are always valid. If laying down a pointer one can
-set `ts` at the same spot or ahead of it. When data is shifted in between loops
-the user must also shift the pointer. In this way it is possible to maintain
-pointers to the input that will always be consistent.
-
-In general, there are two approaches for guaranteeing the consistency of
-pointers to input data. The first approach is the one just described; lay down
-a marker from an action, then later ensure that the data the marker points to
-is preserved ahead of the buffer on the next execute invocation. This approach
-is good because it allows the parser to decide on the pointer-use boundaries,
-which can be arbitrarily complex parsing conditions. A downside is that it
-requires any pointers that are set to be corrected in between execute
-invocations.
-
-The alternative is to find the pointer-use boundaries before invoking the execute
-routine, then pass in the data using these boundaries. For example, if the
-program must perform line-oriented processing, the user can scan backwards from
-the end of an input block that has just been read in and process only up to the
-first found newline. On the next input read, the new data is placed after the
-partially read line and processing continues from the beginning of the line.
-An example of line-oriented processing is given below.
-
-[[line_oriented]]
-.An example of line-oriented processing.
-------------------
- int have = 0;
- while ( 1 ) {
- char *p, *pe, *data = buf + have;
- int len, space = BUFSIZE - have;
-
- if ( space == 0 ) {
- fprintf(stderr, "BUFFER OUT OF SPACE\n");
- exit(1);
- }
-
- len = fread( data, 1, space, stdin );
- if ( len == 0 )
- break;
-
- /* Find the last newline by searching backwards. */
- p = buf;
- pe = data + len - 1;
- while ( *pe != '\n' && pe >= buf )
- pe--;
- pe += 1;
-
- %% write exec;
-
- /* How much is still in the buffer? */
- have = data + len - pe;
- if ( have > 0 )
- memmove( buf, pe, have );
-
- if ( len < space )
- break;
- }
----------------
-
-=== Specifying the Host Language
-
-The `ragel` program has a number of options for specifying the host
-language. The host-language options are:
-
-* `-C` for C/C++/Objective-C code (default)
-* `--asm` or `--gas-x86-64-sys-v` for GNU AS, x86_64, System V ABI.
-* `-D` for D code.
-* `-Z` for Go code.
-* `-J` for Java code.
-* `-R` for Ruby code.
-* `-A` for C# code.
-* `-O` for OCaml code.
-* `-U` for Rust
-* `-Y` for Julia
-* `-K` for Crack
-* `-P` for JavaScript
-
-[[genout]]
-=== Choosing a Generated Code Style
-
-There are three styles of code output to choose from. Code style affects the
-size and speed of the compiled binary. Changing code style does not require any
-change to the Ragel program. There are two table-driven formats and a goto
-driven format.
-
-In addition to choosing a style to emit, there are various levels of action
-code reuse to choose from. The maximum reuse levels (`-T0`, `-F0`
-and `-G0`) ensure that no FSM action code is ever duplicated by encoding
-each transition's action list as static data and iterating
-through the lists on every transition. This will normally result in a smaller
-binary. The less action reuse options (`-T1`, `-F1` and `-G1`)
-will usually produce faster running code by expanding each transition's action
-list into a single block of code, eliminating the need to iterate through the
-lists. This duplicates action code instead of generating the logic necessary
-for reuse. Consequently the binary will be larger. However, this tradeoff applies to
-machines with moderate to dense action lists only. If a machine's transitions
-frequently have less than two actions then the less reuse options will actually
-produce both a smaller and a faster running binary due to less action sharing
-overhead. The best way to choose the appropriate code style for your
-application is to perform your own tests.
-
-The table-driven FSM represents the state machine as constant static data. There are
-tables of states, transitions, indices and actions. The current state is
-stored in a variable. The execution is simply a loop that looks up the current
-state, looks up the transition to take, executes any actions and moves to the
-target state. In general, the table-driven FSM can handle any machine, produces
-a smaller binary and requires a less expensive host language compile, but
-results in slower running code. Since the table-driven format is the most
-flexible it is the default code style.
-
-The flat table-driven machine is a table-based machine that is optimized for
-small alphabets. Where the regular table machine uses the current character as
-the key in a binary search for the transition to take, the flat table machine
-uses the current character as an index into an array of transitions. This is
-faster in general, however is only suitable if the span of possible characters
-is small.
-
-The goto-driven FSM represents the state machine using goto and switch
-statements. The execution is a flat code block where the transition to take is
-computed using switch statements and directly executable binary searches. In
-general, the goto FSM produces faster code but results in a larger binary and a
-more expensive host language compile.
-
-The goto-driven format has an additional action reuse level (`-G2`) that writes
-actions directly into the state transitioning logic rather than putting all the
-actions together into a single switch. Generally this produces faster running
-code because it allows the machine to encode the current state using the
-processor's instruction pointer. Again, sparse machines may actually compile to
-smaller binaries when `-G2` is used due to less state and action management
-overhead. For many parsing applications `-G2` is the preferred output format.
-
-Code Output Style Options
-
-* `-T0` - binary search table-driven
-* `-T1` - binary search, expanded actions
-* `-F0` - flat table-driven
-* `-F1` - flat table, expanded actions
-* `-G0` - goto-driven
-* `-G1` - goto, expanded actions
-* `-G2` - goto, in-place actions
-
-Due to limitations of the host languages, not all styles are supported for all
-host languages. The following table shows what is supported.
-
-[width="50%"]
-|===========================================
-| C | `-T0 -T1 -F0 -F1 -G0 -G1 -G2`
-| ASM | `-G2`
-| D | `-T0 -T1 -F0 -F1 -G0 -G1 -G2`
-| Go | `-T0 -T1 -F0 -F1 -G0 -G1 -G2`
-| C# | `-T0 -T1 -F0 -F1 -G0 -G1`
-| Java | `-T0 -T1 -F0 -F1`
-| Ruby | `-T0 -T1 -F0 -F1`
-| OCaml | `-T0 -T1 -F0 -F1`
-| Rust | `-T0 -T1 -F0 -F1`
-| Julia | `-T0 -T1 -F0 -F1`
-| Crack | `-T0 -T1 -F0 -F1`
-| JavaScript | `-T0 -T1 -F0 -F1`
-|===========================================
-
-Beyond the Basic Model
-----------------------
-
-[[modularization]]
-=== Parser Modularization
-
-It is possible to use Ragel's machine construction and action embedding
-operators to specify an entire parser using a single regular expression. In
-many cases this is the desired way to specify a parser in Ragel. However, in
-some scenarios the language to parse may be so large that it is difficult to
-think about it as a single regular expression. It may also shift between distinct
-parsing strategies, in which case modularization into several coherent blocks
-of the language may be appropriate.
-
-It may also be the case that patterns that compile to a large number of states
-must be used in a number of different contexts and referencing them in each
-context results in a very large state machine. In this case, an ability to reuse
-parsers would reduce code size.
-
-To address this, distinct regular expressions may be instantiated and linked
-together by means of a jumping and calling mechanism. This mechanism is
-analogous to the jumping to and calling of processor instructions. A jump
-command, given in action code, causes control to be immediately passed to
-another portion of the machine by way of setting the current state variable. A
-call command causes the target state of the current transition to be pushed to
-a state stack before control is transferred. Later on, the original location
-may be returned to with a return statement. In the following example, distinct
-state machines are used to handle the parsing of two types of headers.
-
-///////////////////////
-% GENERATE: call
-% %%{
-% machine call;
-///////////////////////
-----------------------
-action return { fret; }
-action call_date { fcall date; }
-action call_name { fcall name; }
-
-# A parser for date strings.
-date := [0-9][0-9] '/'
- [0-9][0-9] '/'
- [0-9][0-9][0-9][0-9] '\n' @return;
-
-# A parser for name strings.
-name := ( [a-zA-Z]+ | ' ' )** '\n' @return;
-
-# The main parser.
-headers =
- ( 'from' | 'to' ) ':' @call_name |
- ( 'departed' | 'arrived' ) ':' @call_date;
-
-main := headers*;
-----------------------
-//////////////////////
-% }%%
-% %% write data;
-% void f()
-% {
-% %% write init;
-% %% write exec;
-% }
-% END GENERATE
-//////////////////////
-
-Calling and jumping should be used carefully as they are operations that take
-one out of the domain of regular languages. A machine that contains a call or
-jump statement in one of its actions should be used as an argument to a machine
-construction operator only with considerable care. Since DFA transitions may
-actually represent several NFA transitions, a call or jump embedded in one
-machine can inadvertently terminate another machine that it shares prefixes
-with. Despite this danger, theses statements have proven useful for tying
-together sub-parsers of a language into a parser for the full language,
-especially for the purpose of modularizing code and reducing the number of
-states when the machine contains frequently recurring patterns.
-
-The section on <<vals, Embedded Values and Statements>> describes the jump and
-call statements that are used to transfer control. These statements make use of
-two variables that must be declared by the user, `stack` and `top`. The `stack`
-variable must be an array of integers and `top` must be a single integer, which
-will point to the next available space in `stack`. The <<prepush, Pre-Push>>
-and <<postpop, Post-Pop>> statements which can be used to implement a
-dynamically resizable array.
-
-[[labels]]
-=== Referencing Names
-
-This section describes how to reference names in epsilon transitions
-(<<state_charts, State Charts>>) and action-based control-flow statements such
-as `fgoto`. There is a hierarchy of names implied in a Ragel specification. At
-the top level are the machine instantiations. Beneath the instantiations are
-labels and references to machine definitions. Beneath those are more labels and
-references to definitions, and so on.
-
-Any name reference may contain multiple components separated with the `::`
-compound symbol. The search for the first component of a name reference is
-rooted at the join expression that the epsilon transition or action embedding
-is contained in. If the name reference is not contained in a join,
-the search is rooted at the machine definition that the epsilon transition or
-action embedding is contained in. Each component after the first is searched
-for beginning at the location in the name tree that the previous reference
-component refers to.
-
-In the case of action-based references, if the action is embedded more than
-once, the local search is performed for each embedding and the result is the
-union of all the searches. If no result is found for action-based references then
-the search is repeated at the root of the name tree. Any action-based name
-search may be forced into a strictly global search by prefixing the name
-reference with `::`.
-
-The final component of the name reference must resolve to a unique entry point.
-If a name is unique in the entire name tree it can be referenced as is. If it
-is not unique it can be specified by qualifying it with names above it in the
-name tree. However, it can always be renamed.
-
-///////////////
-% FIXME: Should fit this in somewhere.
-% Some kinds of name references are illegal. Cannot call into longest-match
-% machine, can only call its start state. Cannot make a call to anywhere from
-% any part of a longest-match machine except a rule's action. This would result
-% in an eventual return to some point inside a longest-match other than the
-% start state. This is banned for the same reason a call into the LM machine is
-% banned.
-///////////////
-
-[[generating_scanners]]
-=== Scanners
-
-Scanners are very much intertwined with regular-languages and their
-corresponding processors. For this reason Ragel supports the definition of
-scanners. The generated code will repeatedly attempt to match patterns from a
-list, favouring longer patterns over shorter patterns. In the case of
-equal-length matches, the generated code will favour patterns that appear ahead
-of others. When a scanner makes a match it executes the user code associated
-with the match, consumes the input then resumes scanning.
-
-------------------
-<machine_name> := |*
- pattern1 => action1;
- pattern2 => action2;
- ...
- *|;
-------------------
-
-On the surface, Ragel scanners are similar to those defined by Lex. Though
-there is a key distinguishing feature: patterns may be arbitrary Ragel
-expressions and can therefore contain embedded code. With a Ragel-based scanner
-the user need not wait until the end of a pattern before user code can be
-executed.
-
-Scanners can be used to process sub-languages, as well as for tokenizing
-programming languages. In the following example a scanner is used to tokenize
-the contents of a header field.
-
-------------------------
-word = [a-z]+;
-head_name = 'Header';
-
-header := |*
- word;
- ' ';
- '\n' => { fret; };
-*|;
-
-main := ( head_name ':' @{ fcall header; } )*;
-------------------------
-
-The scanner construction has a purpose similar to the longest-match kleene star
-operator `**`. The key
-difference is that a scanner is able to backtrack to match a previously matched
-shorter string when the pursuit of a longer string fails. For this reason the
-scanner construction operator is not a pure state machine construction
-operator. It relies on several variables that enable it to backtrack and make
-pointers to the matched input text available to the user. For this reason
-scanners must be immediately instantiated. They cannot be defined inline or
-referenced by another expression. Scanners must be jumped to or called.
-
-Scanners rely on the `ts`, `te` and `act` variables to be present so that they
-can backtrack and make pointers to the matched text available to the user. If
-input is processed using multiple calls to the execute code then the user must
-ensure that when a token is only partially matched that the prefix is preserved
-on the subsequent invocation of the execute code.
-
-The `ts` variable must be defined as a pointer to the input data.
-It is used for recording where the current token match begins. This variable
-may be used in action code for retrieving the text of the current match. Ragel
-ensures that in between tokens and outside of the longest-match machines that
-this pointer is set to null. In between calls to the execute code the user must
-check if `ts` is set and if so, ensure that the data it points to is
-preserved ahead of the next buffer block. This is described in more detail
-below.
-
-The `te` variable must also be defined as a pointer to the input data.
-It is used for recording where a match ends and where scanning of the next
-token should begin. This can also be used in action code for retrieving the
-text of the current match.
-
-The `act` variable must be defined as an integer type. It is used for recording
-the identity of the last pattern matched when the scanner must go past a
-matched pattern in an attempt to make a longer match. If the longer match fails
-it may need to consult the `act` variable. In some cases, use of the `act`
-variable can be avoided because the value of the current state is enough
-information to determine which token to accept, however in other cases this is
-not enough and so the `act` variable is used.
-
-When the longest-match operator is in use, the user's driver code must take on
-some buffer management functions. The following algorithm gives an overview of
-the steps that should be taken to properly use the longest-match operator.
-
-* Read a block of input data.
-* Run the execute code.
-
-* If `ts` is set, the execute code will expect the incomplete
-token to be preserved ahead of the buffer on the next invocation of the execute
-code.
-
-** Shift the data beginning at `ts` and ending at `pe` to the
-beginning of the input buffer.
-
-** Reset `ts` to the beginning of the buffer.
-
-** Shift `te` by the distance from the old value of `ts`
-to the new value. The `te` variable may or may not be valid. There is
-no way to know if it holds a meaningful value because it is not kept at null
-when it is not in use. It can be shifted regardless.
-
-* Read another block of data into the buffer, immediately following any
-preserved data.
-
-* Run the scanner on the new data.
-
-The following listing shows the required handling of an input stream in which a
-token is broken by the input block boundaries. After processing up to and
-including the ``t'' of ``characters'', the prefix of the string token must be
-retained and processing should resume at the ``e'' on the next iteration of the
-execute code.
-
-----------------
- a) A stream "of characters" to be scanned.
- | | |
- p ts pe
-
- b) "of characters" to be scanned.
- | | |
- ts p pe
-----------------
-
-If one uses a large input buffer for collecting input then the number of times
-the shifting must be done will be small. Furthermore, if one takes care not to
-define tokens that are allowed to be very long and instead processes these
-items using pure state machines or sub-scanners, then only a small amount of
-data will ever need to be shifted.
-
-Following an invocation of the execute code there may be a partially matched
-token (a). The data of the partially matched token must be preserved ahead of
-the new data on the next invocation (b).
-
-Since scanners attempt to make the longest possible match of input, patterns
-such as identifiers require one character of lookahead in order to trigger a
-match. In the case of the last token in the input stream the user must ensure
-that the `eof` variable is set so that the final token is flushed out.
-
-The following is an an example scanner processing loop.
-
-.A processing loop for a scanner.
------------------
- int have = 0;
- bool done = false;
- while ( !done ) {
- /* How much space is in the buffer? */
- int space = BUFSIZE - have;
- if ( space == 0 ) {
- /* Buffer is full. */
- cerr << "TOKEN TOO BIG" << endl;
- exit(1);
- }
-
- /* Read in a block after any data we already have. */
- char *p = inbuf + have;
- cin.read( p, space );
- int len = cin.gcount();
-
- char *pe = p + len;
- char *eof = 0;
-
- /* If no data was read indicate EOF. */
- if ( len == 0 ) {
- eof = pe;
- done = true;
- }
-
- %% write exec;
-
- if ( cs == Scanner_error ) {
- /* Machine failed before finding a token. */
- cerr << "PARSE ERROR" << endl;
- exit(1);
- }
-
- if ( ts == 0 )
- have = 0;
- else {
- /* There is a prefix to preserve, shift it over. */
- have = pe - ts;
- memmove( inbuf, ts, have );
- te = inbuf + (te-ts);
- ts = inbuf;
- }
- }
------------------
-
-[[state_charts]]
-=== State Charts
-
-In addition to supporting the construction of state machines using regular
-languages, Ragel provides a way to manually specify state machines using
-state charts. The comma operator combines machines together without any
-implied transitions. The user can then manually link machines by specifying
-epsilon transitions with the `->` operator. Epsilon transitions are drawn
-between the final states of a machine and entry points defined by labels. This
-makes it possible to build machines using the explicit state-chart method while
-making minimal changes to the Ragel language.
-
-An interesting feature of Ragel's state chart construction method is that it
-can be mixed freely with regular expression constructions. A state chart may be
-referenced from within a regular expression, or a regular expression may be
-used in the definition of a state chart transition.
-
-==== Join
-
------------------
-expr , expr , ...
------------------
-
-Join a list of machines together without
-drawing any transitions, without setting up a start state, and without
-designating any final states. Transitions between the machines may be specified
-using labels and epsilon transitions. The start state must be explicity
-specified with the ``start'' label. Final states may be specified with an
-epsilon transition to the implicitly created ``final'' state. The join
-operation allows one to build machines using a state chart model.
-
-==== Label
-
-------------------
-label: expr
-------------------
-
-Attaches a label to an expression. Labels can be
-used as the target of epsilon transitions and explicit control transfer
-statements such as `fgoto` and `fnext` in action
-code.
-
-==== Epsilon
-
------------------
-expr -> label
------------------
-
-Draws an epsilon transition to the state defined by `label`. Epsilon
-transitions are made deterministic when join operators are evaluated. Epsilon
-transitions that are not in a join operation are made deterministic when the
-machine definition that contains the epsilon is complete. See <<labels,
-Referencing Names>> for information on referencing labels.
-
-==== Simplifying State Charts
-
-There are two benefits to providing state charts in Ragel. The first is that it
-allows us to take a state chart with a full listing of states and transitions
-and simplify it in selective places using regular expressions.
-
-The state chart method of specifying parsers is very common. It is an
-effective programming technique for producing robust code. The key disadvantage
-becomes clear when one attempts to comprehend a large parser specified in this
-way. These programs usually require many lines, causing logic to be spread out
-over large distances in the source file. Remembering the function of a large
-number of states can be difficult and organizing the parser in a sensible way
-requires discipline because branches and repetition present many file layout
-options. This kind of programming takes a specification with inherent
-structure such as looping, alternation and concatenation and expresses it in a
-flat form.
-
-If we could take an isolated component of a manually programmed state chart,
-that is, a subset of states that has only one entry point, and implement it
-using regular language operators then we could eliminate all the explicit
-naming of the states contained in it. By eliminating explicitly named states
-and replacing them with higher-level specifications we simplify a state machine
-specification.
-
-For example, sometimes chains of states are needed, with only a small number of
-possible characters appearing along the chain. These can easily be replaced
-with a concatenation of characters. Sometimes a group of common states
-implement a loop back to another single portion of the machine. Rather than
-manually duplicate all the transitions that loop back, we may be able to
-express the loop using a kleene star operator.
-
-Ragel allows one to take this state map simplification approach. We can build
-state machines using a state map model and implement portions of the state map
-using regular languages. In place of any transition in the state machine,
-entire sub-machines can be given. These can encapsulate functionality
-defined elsewhere. An important aspect of the Ragel approach is that when we
-wrap up a collection of states using a regular expression we do not lose
-access to the states and transitions. We can still execute code on the
-transitions that we have encapsulated.
-
-[[down]]
-==== Dropping Down One Level of Abstraction
-
-The second benefit of incorporating state charts into Ragel is that it permits
-us to bypass the regular language abstraction if we need to. Ragel's action
-embedding operators are sometimes insufficient for expressing certain parsing
-tasks. In the same way that is useful for C language programmers to drop down
-to assembly language programming using embedded assembler, it is sometimes
-useful for the Ragel programmer to drop down to programming with state charts.
-
-In the following example, we wish to buffer the characters of an XML CDATA
-sequence. The sequence is terminated by the string `]]>`. The challenge
-in our application is that we do not wish the terminating characters to be
-buffered. An expression of the form `any* @buffer :>> ']]>'` will not work
-because the buffer will always contain the characters `]]` on the end.
-Instead, what we need is to delay the buffering of `]`
-characters until a time when we
-abandon the terminating sequence and go back into the main loop. There is no
-easy way to express this using Ragel's regular expression and action embedding
-operators, and so an ability to drop down to the state chart method is useful.
-
-///////////////////////
-% GENERATE: dropdown
-% OPT: -p
-% %%{
-% machine dropdown;
-///////////////////////
------------------------
-action bchar { buff( fpc ); } # Buffer the current character.
-action bbrack1 { buff( "]" ); }
-action bbrack2 { buff( "]]" ); }
-
-CDATA_body =
-start: (
- ']' -> one |
- (any-']') @bchar ->start
-),
-one: (
- ']' -> two |
- [^\]] @bbrack1 @bchar ->start
-),
-two: (
- '>' -> final |
- ']' @bbrack1 -> two |
- [^>\]] @bbrack2 @bchar ->start
-);
------------------------
-//////////////////
-% main := CDATA_body;
-% }%%
-% END GENERATE
-//////////////////
-
-image::dropdown.png[align="left"]
-
-[[semantic]]
-=== Semantic Conditions
-
-==== Semantic Conditions
-
-Many communication protocols contain variable-length fields, where the length
-of the field is given ahead of the field as a value. This
-problem cannot be expressed using regular languages because of its
-context-dependent nature. The prevalence of variable-length fields in
-communication protocols motivated us to introduce semantic conditions into
-the Ragel language.
-
-A semantic condition is a block of user code that is interpreted as an
-expression and evaluated immediately
-before a transition is taken. If the code returns a value of true, the
-transition may be taken. We can now embed code that extracts the length of a
-field, then proceed to match $n$ data values.
-
-//////////////////
-% GENERATE: conds1
-% OPT: -p
-% %%{
-% machine conds1;
-% number = digit+;
-//////////////////
-----------------------
-action rec_num { i = 0; n = getnumber(); }
-action test_len { i++ < n }
-data_fields = (
- 'd'
- [0-9]+ %rec_num
- ':'
- ( [a-z] when test_len )*
-)**;
-----------------------
-///////////////////////////
-% main := data_fields;
-% }%%
-% END GENERATE
-///////////////////////////
-
-image::conds1.png[align="left"]
-
-The Ragel implementation of semantic conditions does not force us to give up the
-compositional property of Ragel definitions. For example, a machine that tests
-the length of a field using conditions can be unioned with another machine
-that accepts some of the same strings, without the two machines interfering with
-one another. The user need not be concerned about whether or not the result of the
-semantic condition will affect the matching of the second machine.
-
-To see this, first consider that when a user associates a condition with an
-existing transition, the transition's label is translated from the base character
-to its corresponding value in the space that represents ``condition $c$ true''. Should
-the determinization process combine a state that has a conditional transition
-with another state that has a transition on the same input character but
-without a condition, then the condition-less transition first has its label
-translated into two values, one to its corresponding value in the space that
-represents ``condition $c$ true'' and another to its corresponding value in the
-space that represents ``condition $c$ false''. It
-is then safe to combine the two transitions. This is shown in the following
-example. Two intersecting patterns are unioned, one with a condition and one
-without. The condition embedded in the first pattern does not affect the second
-pattern.
-
-/////////////////////////
-% GENERATE: conds2
-% OPT: -p
-% %%{
-% machine conds2;
-% number = digit+;
-/////////////////////////
--------------------------
-action test_len { i++ < n }
-action one { /* accept pattern one */ }
-action two { /* accept pattern two */ }
-patterns =
- ( [a-z] when test_len )+ %one |
- [a-z][a-z0-9]* %two;
-main := patterns '\n';
--------------------------
-//////////////////////////
-% }%%
-% END GENERATE
-//////////////////////////
-
-image::conds2.png[align="left"]
-
-There are many more potential uses for semantic conditions. The user is free to
-use arbitrary code and may therefore perform actions such as looking up names
-in dictionaries, validating input using external parsing mechanisms or
-performing checks on the semantic structure of input seen so far. In the next
-section we describe how Ragel accommodates several common parser engineering
-problems.
-
-The semantic condition feature works only with alphabet types that are smaller
-in width than the `long` type. To implement semantic conditions Ragel
-needs to be able to allocate characters from the alphabet space. Ragel uses
-these allocated characters to express "character C with condition P true" or "C
-with P false." Since internally Ragel uses longs to store characters there is
-no room left in the alphabet space unless an alphabet type smaller than long is
-used.
-
-==== Condition Based Repetition
-
-New in ragel 7 is a construct designed to simplify the use of conditions to
-control repetition. While possible to count properly using condition embedding
-operators, there is a corner case that proves difficult. If the zero-width case
-is to be accepted properly, some knowledge of what is before and after is
-necessary and tests need to be embedded there. The `:cond()` operator is
-designed to solve this problem, automatically embedding actions such that a
-zero count is possible.
-
---------------
-:cond( <expression>,
- <init>, <inc>, <min> [ , <max> ] ):
---------------
-
-* `expression` is the ragel-expression that is to be repeated.
-
-* `init` is an action that is used to initialize the state.
-
-* `inc` is an action that is used to increment the count.
-
-* `min` is a condition used to test if the minimal count has been achieved.
-
-* `max` is an optional condition used to test if the maximum count has been achieved.
-
-The `:cond` construct is a synonym for `:condstar`. Ragel also supports
-`:condplus`, which does not allow the zero-width case. There must be at least
-one item.
-
-=== NFA Features
-
-New in Ragel 7 are features for specifying non-deterministic machines. Prior to
-this, ragel always produced DFAs, so these features represent a departure
-from the standard runtime model. They allow us to cope with very large state
-machines, without having to compromise on the language matched, or devise some
-techniques outside of the ragel language.
-
-The NFA features require that the programmer make available `nfa_bp`, `nfa_len`
-and `nfa_count` variables. See the section <<variables_used_by_ragel, Varibles
-Used by Ragel>>
-
-Note that NFA features are non compatible with non-contiguous buffer blocks. It
-requires being able to set p to some previously stashed value. If the buffer is
-no longer available, ragel will attempt to backtrack into invalid data.
-
-==== NFA Union
-
-The NFA union operator allows the programmer to create a large union of
-expressions without making the construction fully deterministic. Only a prefix
-is made deterministic, in a breadth-first manner, and up to a specified depth.
-
---------------
-<name> |= ( <depth>, <group-size> ) <expression1> | <expression2> | ... ;
---------------
-
-The NFA union operator looks like a normal union, except some additional
-controls are specified. The depth value says how deep into the machine the
-NFA-to-DFA algorithm should go looking for NFA states to make deterministic.
-The group-size value allows the programmer to set a limit on the number of
-expressions that are packed into a single DFA prefix. If this value is zero,
-there is no limit, and all expressions go into a single group. If this number
-is less than the number of expressions, there will be an immediate NFA branch
-for the groups, before entering into the DFA prefixes. Roughly speaking,
-group-size allows you to control initial branch width.
-
-By making depth or groups-size smaller, you can shift cost from compile-time to
-run-time to get otherwise intractable unions to build.
-
-==== NFA Repetition
-
-The NFA repetition construct `:nfa()` is designed to allow counting of objects
-in an ambiguous context. While the `:cond()` construct is an efficent counting
-method, it does not work properly when there are ambiguities around starting,
-repeating, or exiting a repetition. Since condition tests are embedding into
-transitions, they are made deterministic, therefore the repetition could be
-simultaneously incrementing and exiting the repetition. In these cases the NFA
-repetition is useful, because it will explore all possibilites and allow use
-user to preserve and restore state. To pay for this, you must be willing to
-accept backtracking, however, and also give up non-congiguous buffer blocks.
-
-While the NFA repetition construct is designed for counting, it is useful for
-much more. It generalizes to a state capture/manipulation/test construct that
-works regardless of ambiguities. For example, if there are two possible data
-captures, the :nfa union operator can backtrack to try an alternate. It before
-backtracking it can restore relevant state, then perform the new capture.
-
---------------
-:nfa( <expression>,
- <push>, <pop>,
- <init>, <stay>, <repeat>, <exit> ):
---------------
-
-* `expression` is the ragel-expression that is to be repeated.
-
-* `push` is an action that saves state needed for counting instances of expression.
-
-* `pop` is a condition action that restores state needed for counting instances
-of expression. This is an expression due to a current limitation of the
-implementation. In the future theis may change to a plain action.
-
-* `init` is an condition that is tested before entering the repeat.
-
-* `stay` is an condition that is tested before deciding to stay in the
-expression, as opposed to wrapping around to begin a new expression, or
-exiting. If true, this allows the expression to continue to match, once a final
-state has been entered.
-
-* `repeat` is an condition that is tested before deciding to wrap around and
-attempt a new instance of expression.
-
-* `exit` is an condition that is tested before deciding to leave the repeat
-
-The following actions achieve a generalized repetition that looks for 10 to 100
-repetitions of some expression. It requires that the count variable be placed
-in the nfa_bp struct. Note that it is correct to reference nfa_len. It does not
-need to manipulated, that is done by the generated code.
-
-----------------
-action push { nfa_bp[nfa_len].count = count; }
-action pop { ({ count = nfa_bp[nfa_len].count; 1; }) }
-action init { ({ count = 0; 1; }) }
-action stay { ({ 1; }) }
-action repeat { ({ ++count < 100; }) }
-action exit { ({ ++count >= 10; }) }
-------------------
-
-The following actions, when used with expression `''`, achieve a test of a
-previously stored value (eg. backref). In the exit the stashed value is checked
-against the data at p.
-
-----------------
-action init { ({ 1; }) }
-action stay { ({ 0; }) }
-action repeat { ({ 0; }) }
-action exit
-{ ({
- int match = check_stashed( p, pe, stashed_start, stashed_end );
- if ( match )
- p += ( stashed_end - stashed_start );
- match;
-}) }
-------------------
-
-=== Implementing Lookahead
-
-There are a few strategies for implementing lookahead in Ragel programs.
-Leaving actions, which are described in <<out_actions, Leaving Actions>>, can be
-used as a form of lookahead. Ragel also provides the `fhold` directive
-which can be used in actions to prevent the machine from advancing over the
-current character. It is also possible to manually adjust the current character
-position by shifting it backwards using `fexec`, however when this is
-done, care must be taken not to overstep the beginning of the current buffer
-block. In both the use of `fhold` and `fexec` the user must be
-cautious of combining the resulting machine with another in such a way that the
-transition on which the current position is adjusted is not combined with a
-transition from the other machine.
-
-=== Parsing Recursive Language Structures
-
-In general Ragel cannot handle recursive structures because the grammar is
-interpreted as a regular language. However, depending on what needs to be
-parsed it is sometimes practical to implement the recursive parts using manual
-coding techniques. This often works in cases where the recursive structures are
-simple and easy to recognize, such as in the balancing of parentheses.
-
-One approach to parsing recursive structures is to use actions that increment
-and decrement counters or otherwise recognize the entry to and exit from
-recursive structures and then jump to the appropriate machine defnition using
-`fcall` and `fret`. Alternatively, semantic conditions can be used to
-test counter variables.
-
-A more traditional approach is to call a separate parsing function (expressed
-in the host language) when a recursive structure is entered, then later return
-when the end is recognized.