diff options
author | Eric S. Raymond <esr@thyrsus.com> | 2020-10-11 05:14:41 -0400 |
---|---|---|
committer | Eric S. Raymond <esr@thyrsus.com> | 2020-10-11 05:14:41 -0400 |
commit | 1094c2a1370defc57ea14d2398a72f15ce4f9c43 (patch) | |
tree | 9951b36955c3d41943116b4637fdda925d536ec5 /doc | |
parent | 3158c7f0721f8558a1acdcf5c2dad12be0fa0e2e (diff) | |
download | flex-git-1094c2a1370defc57ea14d2398a72f15ce4f9c43.tar.gz |
Revise Flex manual to say useful things about multilanguage support.
No specific thing can be said about a non-C/C++ backend yet, but
this patch prepares the way by explaining which features and aspects
of the Flex interface are specific to C/C++.
It also fixes one pre-ANSI prototype - that of non-reentrant yylex(),
which should be declared yylex(void) in this day and age.
These changes make one new commitment. Observing that the YY_INPUT
macro is impossible to port out of the C/C++ context, I have observed that
it is probably extinct in the wild (due to the later introduction of
the multi-buffer primitives, though I don't say that explicitly). The
text says tat people who really need the equivalent of this capability
in a non-C/C++ back end should file an issue with the Flex maintainers.
I don't actually expect this to happen.
Diffstat (limited to 'doc')
-rw-r--r-- | doc/flex.texi | 141 |
1 files changed, 107 insertions, 34 deletions
diff --git a/doc/flex.texi b/doc/flex.texi index 836018a..90c8aa5 100644 --- a/doc/flex.texi +++ b/doc/flex.texi @@ -306,13 +306,22 @@ GitHub's issue tracking facility at @url{https://github.com/westes/flex/issues/} program which recognizes lexical patterns in text. The @code{flex} program reads the given input files, or its standard input if no file names are given, for a description of a scanner to generate. The -description is in the form of pairs of regular expressions and C code, -called @dfn{rules}. @code{flex} generates as output a C source file, -@file{lex.yy.c} by default, which defines a routine @code{yylex()}. -This file can be compiled and linked with the flex runtime library to +description is in the form of pairs of regular expressions and +fragments of source code +called @dfn{rules}. @code{flex} generates as output a source file +in your target language which defines a routine @code{yylex()}. +This file can be compiled and (if you are using the C/C++ back end) +optionally linked with the flex runtime library to produce an executable. When the executable is run, it analyzes its input for occurrences of the regular expressions. Whenever it finds -one, it executes the corresponding C code. +one, it executes the corresponding rule code. + +When your target language is C, the name of the generated scanner +@file{lex.yy.c} by default. Other languages will glue the suffix they +normally use for source-code files to the prefix @file{lex.yy}. + +The examples in this manual are in C, which is Flex's default target +language and until release 4.6.2 its only one. @node Simple Examples, Format, Introduction, Top @chapter Some Simple Examples @@ -1104,7 +1113,8 @@ a time) to its output. @cindex %array, use of @cindex %pointer, use of @vindex yytext -Note that @code{yytext} can be defined in two different ways: either as +Note that in languages with only fixed-extent arrays (like C/C+) +@code{yytext} can be defined in two different ways: either as a character @emph{pointer} or as a character @emph{array}. You can control which definition @code{flex} uses by including one of the special directives @code{%pointer} or @code{%array} in the first @@ -1152,12 +1162,20 @@ run-time error results. Also note that you cannot use @code{%array} with C++ scanner classes (@pxref{Cxx}). +In target langages with automatic memory allocation and arrays none of +this applies; you can expect @code{yytext} to dynamically resize +itself, calls to the @code{unput()}will not destroy the present +contents of @code{yytext}, and you will never get a run-time error +from calls to the @code{unput()} function destroys the present +contents of @code{yytext} except in the extremely unlikely case that +your scanner cannot allocate more memory. + @node Actions, Generated Scanner, Matching, Top @chapter Actions @cindex actions Each pattern in a rule has a corresponding @dfn{action}, which can be -any arbitrary C statement. The pattern ends at the first non-escaped +any arbitrary target-language statement. The pattern ends at the first non-escaped whitespace character; the remainder of the line is its action. If the action is empty, then when the pattern is matched the input token is simply discarded. For example, here is the specification for a program @@ -1379,7 +1397,9 @@ back-to-front. @cindex %pointer, and unput() @cindex unput(), and %pointer -An important potential problem when using @code{unput()} is that if you +An important potential problem when using @code{unput()} in C (and, +generally, in target languages with C-like manual memory management) is +that if you are using @code{%pointer} (the default), a call to @code{unput()} @emph{destroys} the contents of @code{yytext}, starting with its rightmost character and devouring one character to the left with each @@ -1452,32 +1472,35 @@ Input Buffers}) @code{yyterminate()} can be used in lieu of a return statement in an action. It terminates the scanner and returns a 0 to the scanner's caller, indicating ``all done''. By default, @code{yyterminate()} is -also called when an end-of-file is encountered. It is a macro and may -be redefined. +also called when an end-of-file is encountered. It may be redefined. + +When the target language is C, @code{yyterminate()} is a macro. +Redefining it using the C preprocessor in your definitions section +is allowed, but not recommended as doing this makes code more +difficult to port out of C. In other target languages you can define +@code{yyterminate()} as a function; Flex will notice this and not +generate a default into the scanner. @node Generated Scanner, Start Conditions, Actions, Top @chapter The Generated Scanner @cindex yylex(), in generated scanner -The output of @code{flex} is the file @file{lex.yy.c}, which contains +The output of @code{flex} is a file wity the sem name @file{lex.yy}, which contains the scanning routine @code{yylex()}, a number of tables used by it for matching tokens, and a number of auxiliary routines and macros. By -default, @code{yylex()} is declared as follows: +default in C, @code{yylex()} is declared as follows: @example @verbatim - int yylex() - { + int yylex(void) { ... various definitions and the actions in here ... - } + } @end verbatim @end example @cindex yylex(), overriding, %yydecl -(If your environment supports function prototypes, then it will be -@code{int yylex( void )}.) This definition may be changed with the -the @code{%yydecl} directive. For example, you could put this in -among your directives: +This definition may be changed with the the @code{%yydecl} directive. +For example, you could put this in among your directives: @cindex yylex, overriding the prototype of @example @@ -1499,6 +1522,11 @@ traditional definitions support added extra complexity in the skeleton file. For this reason, current versions of @code{flex} generate standard C99 code only, leaving K&R-style functions to the historians. +In other languages, @code{yylex()} will be generated as a reentrant +function with a scanner context argument added. This can be enabled +in C as well, and specifying your C scanners to be rrentrant is +recommended for portability. + @cindex stdin, default for yyin @cindex yyin Whenever @code{yylex()} is called, it scans tokens from the global input @@ -1512,10 +1540,9 @@ one of its actions executes a @code{return} statement. If the scanner reaches an end-of-file, subsequent calls are undefined unless either @file{yyin} is pointed at a new input file (in which case scanning continues from that file), or @code{yyrestart()} is called. -@code{yyrestart()} takes one argument, a @code{FILE *} pointer (which -can be NULL, if you've set up @code{YY_INPUT} to scan from a source other -than @code{yyin}), and initializes @file{yyin} for scanning from that -file. Essentially there is no difference between just assigning +@code{yyrestart()} takes one argument, an inoput stream, and +initializes @file{yyin} for scanning from that +stream. Essentially there is no difference between just assigning @file{yyin} to a new input file or using @code{yyrestart()} to do so; the latter is available for compatibility with previous versions of @code{flex}, and because it can be used to switch input files in the @@ -1525,13 +1552,17 @@ better to use @code{YY_FLUSH_BUFFER} (@pxref{Actions}). Note that @code{yyrestart()} does @emph{not} reset the start condition to @code{INITIAL} (@pxref{Start Conditions}). +In C, an input stream is a a @code{FILE *} pointer. This pointer +can be NULL, if you've set up @code{YY_INPUT} to scan from a source other +than @code{yyin}. + @cindex RETURN, within actions If @code{yylex()} stops scanning due to executing a @code{return} statement in one of the actions, the scanner may then be called again and it will resume scanning where it left off. @cindex YY_INPUT -By default (and for purposes of efficiency), the scanner uses +By default (and for purposes of efficiency), C/C++ scanners use block-reads rather than simple @code{getc()} calls to read characters from @file{yyin}. The nature of how it gets its input can be controlled by defining the @code{YY_INPUT} macro. The calling sequence for @@ -1561,6 +1592,13 @@ section of the input file): This definition will change the input processing to occur one character at a time. +YY_INPUT is not available in target languages other than C/C++. It +dates from a time in the 1970s when efficiency optimizations were a +far more pressing problem than they are today, and is probably extinct +in the wild. If lack of it poses a problem for a port you are doing, +file an issue report with the Flex mauntainers and we will attempt to +assist you. + @cindex yywrap() When the scanner receives an end-of-file indication from YY_INPUT, it then checks the @code{yywrap()} function. If @code{yywrap()} returns @@ -2023,6 +2061,13 @@ associated with the given file and large enough to hold @code{size} characters (when in doubt, use @code{YY_BUF_SIZE} for the size). It returns a @code{YY_BUFFER_STATE} handle, which may then be passed to other routines (see below). + +In target languages other than C/C++, this prototype will look +different. The buffer-state object may be named differently, and will +not have ``struct'' as part of its name. The input-stream type won't +be @code{FILE *}. But expect the same semamntics wxpressed in native +tytypes. + @tindex YY_BUFFER_STATE The @code{YY_BUFFER_STATE} type is a pointer to an opaque @code{struct yy_buffer_state} structure, so you may @@ -2260,7 +2305,8 @@ by doing one of the following things: @findex YY_NEW_FILE (now obsolete) assigning @file{yyin} to a new input file (in previous versions of @code{flex}, after doing the assignment you had to call the special -action @code{YY_NEW_FILE}. This is no longer necessary.) +action @code{YY_NEW_FILE}. This is no longer necessary.) It is +still supported in the C/C++ back end only. @item executing a @code{return} statement; @@ -2437,7 +2483,7 @@ scanning the same input file. @vindex yyout @item FILE *yyout -is the file to which @code{yyecho()} actions are done. It can be reassigned +is the output stream to which @code{yyecho()} actions are done. It can be reassigned by the user. @vindex YY_CURRENT_BUFFER @@ -2481,6 +2527,11 @@ is @code{TOK_NUMBER}, part of the scanner might look like: @end verbatim @end example +Bison is also retargetable to langages other than C. Outside the +C/C++ back end, it is likely that your Bison module will simply +moduke-level constants that will be make visible to your scanner by +linkage. + @node Scanner Options, Performance, Yacc, Top @chapter Scanner Options @@ -2658,7 +2709,9 @@ implementation. Note that this does not mean @emph{full} compatibility. Use of this option costs a considerable amount of performance, and it cannot be used with the @samp{--c++}, @samp{--full}, @samp{--fast}, @samp{-Cf}, or @samp{-CF} options. For details on the compatibilities it provides, see -@ref{Lex and Posix}. This option also results in the name +@ref{Lex and Posix}. It will usuually be a no-op on back ends other +than C/C++. +This option also results in the name @code{YY_FLEX_LEX_COMPAT} being @code{#define}'d in the generated scanner. @@ -3138,7 +3191,7 @@ array look-up per character scanned). @opindex ---read @opindex read @item -Cr, --read, @code{%option read} -causes the generated scanner to @emph{bypass} use of the standard I/O +causes scanner generated with the C/C++ back end to @emph{bypass} use of the standard I/O library (@code{stdio}) for input. Instead of calling @code{fread()} or @code{getc()}, the scanner will use the @code{read()} system call, resulting in a performance gain which varies from system to system, but @@ -3147,7 +3200,9 @@ or @samp{-CF}. Using @samp{-Cr} can cause strange behavior if, for example, you read from @file{yyin} using @code{stdio} prior to calling the scanner (because the scanner will miss whatever text your previous reads left in the @code{stdio} input buffer). @samp{-Cr} has no effect -if you define @code{YY_INPUT()} (@pxref{Generated Scanner}). +if you define @code{YY_INPUT()} (@pxref{Generated Scanner}). It may +be a no-op or enable different optimizations in back ends other than +the default C/C++ one. @end table The options @samp{-Cf} or @samp{-CF} and @samp{-Cm} do not make sense @@ -3989,6 +4044,9 @@ control. The most common use for reentrant scanners is from within multi-threaded applications. Any thread may create and execute a reentrant @code{flex} scanner without the need for synchronization with other threads. +All scanners generated by back ends other than the default C/C++ back +end are reentrant. + @menu * Reentrant Uses:: * Reentrant Overview:: @@ -4137,7 +4195,7 @@ Here are the things you need to do or know to use the reentrant C API of @node Specify Reentrant, Extra Reentrant Argument, Reentrant Detail, Reentrant Detail @subsection Declaring a Scanner As Reentrant - %option reentrant (--reentrant) must be specified. +When using the default C/C++ back end %option reentrant (--reentrant) must be specified. Notice that @code{%option reentrant} is specified in the above example (@pxref{Reentrant Example}. Had this option not been specified, @@ -4146,6 +4204,8 @@ complaining. You may explicitly specify @code{%option noreentrant}, if you do @emph{not} want a reentrant scanner, although it is not necessary. The default is to generate a non-reentrant scanner. +All scanners made with other back ends are reentrant. + @node Extra Reentrant Argument, Global Replacement, Specify Reentrant, Reentrant Detail @subsection The Extra Argument @@ -4174,9 +4234,12 @@ always named @code{yyscanner}. As you may have guessed, @code{yyscanner} is a pointer to an opaque data structure encapsulating the current state of the scanner. For a list of function declarations, see @ref{Reentrant Functions}. Note that preprocessor macros, such as -@code{yyebegin()}, @code{yyecho()}, and @code{yyreject()}, do not take this +@code{yybegin()}, @code{yyecho()}, and @code{yyreject()}, do not take this additional argument. +The type name @code{yscan_t} follows C conventions. It mmay differ in +other target languages. + @node Global Replacement, Init and Destroy Functions, Extra Reentrant Argument, Reentrant Detail @subsection Global Variables Replaced By Macros @@ -4228,6 +4291,9 @@ after @code{yylex}, respectively. @end verbatim @end example +(The type name and declaration syntax will be different in target +languages other thab C/C++.) + The function @code{yylex_init} must be called before calling any other function. The argument to @code{yylex_init} is the address of an uninitialized pointer to be filled in by @code{yylex_init}, overwriting @@ -4325,7 +4391,7 @@ The above code may be called from within an action like this: @end verbatim @end example -You may find that @code{%option header-file} is particularly useful for generating +In C and C++, you may find that @code{%option header-file} is particularly useful for generating prototypes of all the accessor functions. @xref{option-header}. @node Extra Data, About yyscan_t, Accessor Methods, Reentrant Detail @@ -4460,7 +4526,7 @@ The following Functions are available in a reentrant scanner: There are no ``set'' functions for yytext and yyleng. This is intentional. -The following Macro shortcuts are available in actions in a reentrant +In the C/C++ back end, the following macro shortcuts are available in actions in a reentrant scanner: @example @@ -4790,7 +4856,12 @@ override the default behavior. @node The Default Memory Management, Overriding The Default Memory Management, Memory Management, Memory Management @section The Default Memory Management -Flex allocates dynamic memory during initialization, and once in a while from +This section applies only to target languages wuth manual memory +allocation, including the default C/C++ back end. If your target +language has garbage collection you can igore it. + +A Flex-generated scanner +allocates dynamic memory during initialization, and once in a while from within a call to yylex(). Initialization takes place during the first call to yylex(). Thereafter, flex may reallocate more memory if it needs to enlarge a buffer. As of version 2.5.9 Flex will clean up all memory when you call @code{yylex_destroy} @@ -4980,6 +5051,8 @@ but none of them at the same time. The serialization feature allows the tables to be loaded at runtime, before scanning begins. The tables may be discarded when scanning is finished. +Note: This feature is available only when using the C/C++ back end. + @menu * Creating Serialized Tables:: * Loading and Unloading Serialized Tables:: |