diff options
Diffstat (limited to 'doc/gawktexi.in')
-rw-r--r-- | doc/gawktexi.in | 347 |
1 files changed, 335 insertions, 12 deletions
diff --git a/doc/gawktexi.in b/doc/gawktexi.in index fdb688dd..2802bccf 100644 --- a/doc/gawktexi.in +++ b/doc/gawktexi.in @@ -55,8 +55,8 @@ @c These apply across the board. @set UPDATE-MONTH April, 2023 -@set VERSION 5.2 -@set PATCHLEVEL 2 +@set VERSION 5.3 +@set PATCHLEVEL 0 @set GAWKINETTITLE TCP/IP Internetworking with @command{gawk} @set GAWKWORKFLOWTITLE Participating in @command{gawk} Development @@ -68,7 +68,7 @@ @set TITLE GAWK: Effective AWK Programming @end ifclear @set SUBTITLE A User's Guide for GNU Awk -@set EDITION 5.2 +@set EDITION 5.3 @iftex @set DOCUMENT book @@ -584,6 +584,7 @@ particular records in a file and perform operations upon them. * Regexp Field Splitting:: Using regexps as the field separator. * Single Character Fields:: Making each character a separate field. +* Comma Separated Fields:: Working with CSV files. * Command Line Field Separator:: Setting @code{FS} from the command line. * Full Line Fields:: Making the full line be a single @@ -651,6 +652,7 @@ particular records in a file and perform operations upon them. Pipes. * Close Return Value:: Using the return value from @code{close()}. +* Noflush:: Speeding Up Pipe Output. * Nonfatal:: Enabling Nonfatal Output. * Output Summary:: Output summary. * Output Exercises:: Exercises. @@ -833,6 +835,8 @@ particular records in a file and perform operations upon them. shell. * Isnumeric Function:: A function to test whether a value is numeric. +* To CSV Function:: A function to convert output to CSV + format. * Data File Management:: Functions for managing command-line data files. * Filetrans Function:: A function for handling data file @@ -1490,6 +1494,10 @@ Document minimally and release. After eight years, add another part @code{egrep} and two more parts C. Document very well and release. + +After 35 more years, add Unicode and CSV support, sprinkle lightly with +a few choice features from @command{gawk}, document very well again, +and release. @end sidebar @cindex Aho, Alfred @@ -4083,6 +4091,19 @@ the program. The trace is printed to standard error. Each ``op code'' is preceded by a @code{+} sign in the output. +@item @option{-k} +@itemx @option{--csv} +@cindex @option{-k} option +@cindex @option{--csv} option +@cindex comma separated values (CSV) data @subentry @option{-k} option +@cindex comma separated values (CSV) data @subentry @option{--csv} option +@cindex CSV (comma separated values) data @subentry @option{-k} option +@cindex CSV (comma separated values) data @subentry @option{--csv} option +Enable special processing for files with comma separated values +(CSV). @xref{Comma Separated Fields}. +This option cannot be used with @option{--posix}. Attempting to do +causes a fatal error. + @item @option{-l} @var{ext} @itemx @option{--load} @var{ext} @cindex @option{-l} option @@ -5555,6 +5576,25 @@ As of @value{PVERSION} 4.2, only two digits are processed. @end quotation +@cindex @code{\} (backslash) @subentry @code{\u} escape sequence +@cindex backslash (@code{\}) @subentry @code{\u} escape sequence +@cindex common extensions @subentry @code{\u} escape sequence +@cindex extensions @subentry common @subentry @code{\u} escape sequence +@item \u@var{hh}@dots{} +The hexadecimal value @var{hh}, where @var{hh} stands for a sequence +of hexadecimal digits (@samp{0}--@samp{9}, and either @samp{A}--@samp{F} +or @samp{a}--@samp{f}). A maximum of eight digits are allowed after +the @samp{\u}. Any further hexadecimal digits are treated as simple +letters or numbers. @value{COMMONEXT} +(The @samp{\u} escape sequence is not allowed in POSIX awk.) + +This escape sequence is intended for designating a character in the +Unicode character set. @command{gawk} first converts the given digits +into an integer and then translates the given ``wide character'' +value into the current locale's multibyte encoding (even if that +is a not a Unicode locale). If the given bytes do not represent +a valid character, the value becomes @code{"?"}. + @cindex @code{\} (backslash) @subentry @code{\/} escape sequence @cindex backslash (@code{\}) @subentry @code{\/} escape sequence @item \/ @@ -6713,6 +6753,12 @@ If @code{RS} is any single character, that character separates records. Otherwise (in @command{gawk}), @code{RS} is treated as a regular expression. This mechanism is explained in greater detail shortly. +@quotation NOTE +When @command{gawk} is invoked with the @option{--csv} option, nothing +in this @value{SECTION} applies. @xref{Comma Separated Fields}, for the +details. +@end quotation + @menu * awk split records:: How standard @command{awk} splits records. * gawk split records:: How @command{gawk} splits records. @@ -7359,6 +7405,7 @@ with a statement such as @samp{$1 = $1}, as described earlier. * Default Field Splitting:: How fields are normally separated. * Regexp Field Splitting:: Using regexps as the field separator. * Single Character Fields:: Making each character a separate field. +* Comma Separated Fields:: Working with CSV files. * Command Line Field Separator:: Setting @code{FS} from the command line. * Full Line Fields:: Making the full line be a single field. * Field Splitting Summary:: Some final points and a summary table. @@ -7401,10 +7448,10 @@ is read with the proper separator. To do this, use the special @code{BEGIN} pattern (@pxref{BEGIN/END}). For example, here we set the value of @code{FS} to the string -@code{","}: +@code{":"}: @example -awk 'BEGIN @{ FS = "," @} ; @{ print $2 @}' +awk 'BEGIN @{ FS = ":" @} ; @{ print $2 @}' @end example @cindex @code{BEGIN} pattern @@ -7412,7 +7459,7 @@ awk 'BEGIN @{ FS = "," @} ; @{ print $2 @}' Given the input line: @example -John Q. Smith, 29 Oak St., Walamazoo, MI 42139 +John Q. Smith: 29 Oak St.: Walamazoo: MI 42139 @end example @noindent @@ -7428,7 +7475,7 @@ person's name in the example we just used might have a title or suffix attached, such as: @example -John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139 +John Q. Smith: LXIX: 29 Oak St.: Walamazoo: MI 42139 @end example @noindent @@ -7615,6 +7662,93 @@ In compatibility mode if @code{FS} is the null string, then @command{gawk} also behaves this way. +@node Comma Separated Fields +@subsection Working With Comma Separated Value Files + +@cindex comma separated values (CSV) data @subentry records and fields +@cindex CSV (comma separated values) data @subentry records and fields +Many commonly-used tools use a comma to separate fields, instead of whitespace. +This is particularly true of popular spreadsheet programs. There is no +universally accepted standard for the format of these files, although +@uref{http://www.ietf.org/rfc/rfc4180, RFC 4180} lists the common +practices. + +For decades, anyone wishing to work with CSV files and @command{awk} +had to ``roll their own'' solution. +(For an example, @pxref{Splitting By Content}). +In 2023, Brian Kernighan decided to add CSV support to his version of +@command{awk}. In order to keep up, @command{gawk} too provides the same +support as his version. +To use CSV data, invoke @command{gawk} with either of the +@option{-k} or @option{--csv} options. + +Fields in CSV files are separated by commas. In order to allow a comma +to appear inside a field (i.e., as data), the field may be quoted +by beginning and ending it with double quotes. In order to allow a double +quote inside a field, the field @emph{must} be quoted, and two double quotes +represent an actual double quote. +The double quote that starts a quoted field must be the first +character after the comma. +@ref{table-csv-examples} shows some examples. + +@float Table,table-csv-examples +@caption{Examples of CSV data} +@multitable @columnfractions .3 .3 +@headitem Input @tab Field Contents +@item @code{abc def} @tab @code{abc def} +@item @code{"quoted data"} @tab @code{quoted data} +@item @code{"quoted, data"} @tab @code{quoted, data} +@item @code{"She said ""Stop!""."} @tab @code{She said "Stop!".} +@end multitable +@end float + +Additionally, and here's where it gets messy, newlines are also +allowed inside double-quoted fields! +In order to deal with such things, when processing CSV files, +@command{gawk} scans the input data looking for newlines that +are not enclosed in double quotes. Thus, use of the @option{--csv} option +totally overrides normal record processing with @code{RS} (@pxref{Records}), +as well as field splitting with any of @code{FS}, @code{FIELDWIDTHS}, +or @code{FPAT}. + +@cindex Kernighan, Brian @subentry quotes +@sidebar Carriage-Return--Line-Feed Line Endings In CSV Files +@quotation +@code{\r\n} @i{is the invention of the devil.} +@author Brian Kernighan +@end quotation + +Many CSV files are imported from systems where the line terminator +for text files is a carriage-return--line-feed pair +(CR-LF, @samp{\r} followed by @samp{\n}). +For ease of use, when processing CSV files, @command{gawk} converts +CR-LF pairs into a single newline. That is, the @samp{\r} is removed. + +This occurs only when a CR is paired with an LF; a standalone CR +is left alone. This behavior is consistent with with Windows systems +which automatically convert CR-LF in files into a plain LF in memory, +and also with the commonly available @command{unix2dos} utility program. +@end sidebar + +The behavior of the @code{split()} function (not formally discussed +yet, see @ref{String Functions}) differs slightly when processing CSV +files. When called with two arguments +(@samp{split(@var{string}, @var{array})}), @code{split()} +does CSV-based splitting. Otherwise, it behaves normally. + +If @option{--csv} has been used, @code{PROCINFO["CSV"]} will +exist. Otherwise, it will not. @xref{Auto-set}. + +Finally, if @option{--csv} has been used, assigning a value +to any of @code{FS}, @code{FIELDWIDTHS}, @code{FPAT}, or +@code{RS} generates a warning message. + +To be clear, @command{gawk} takes +@uref{http://www.ietf.org/rfc/rfc4180, RFC 4180} as its +specification for CSV input data. There are no mechanisms +for accepting nonstandard CSV data, such as files that use +a semicolon instead of a comma as the separator. + @node Command Line Field Separator @subsection Setting @code{FS} from the Command Line @cindex @option{-F} option @subentry command-line @@ -7797,10 +7931,18 @@ The following list summarizes how fields are split, based on the value of @code{FS} (@samp{==} means ``is equal to''): @table @code +@item @asis{@command{gawk} was invoked with @option{--csv}} +Field splitting follows the rules given in @ref{Comma Separated Fields}. +The value of @code{FS} is ignored. + @item FS == " " Fields are separated by runs of whitespace. Leading and trailing whitespace are ignored. This is the default. +@item FS == "," +Fields are separated by commas, with quoting of fields +and special rules involved. + @item FS == @var{any other single character} Fields are separated by each occurrence of the character. Multiple successive occurrences delimit empty fields, as do leading and @@ -8046,6 +8188,9 @@ four, and @code{$4} has the value @code{"ddd"}. @node Splitting By Content @section Defining Fields by Content +@strong{FIXME}: This whole section needs rewriting now +that @command{gawk} has built-in CSV parsing. Sigh. + @menu * More CSV:: More on CSV files. * FS versus FPAT:: A subtle difference. @@ -8067,7 +8212,7 @@ what they are, and not by what they are not. @cindex CSV (comma separated values) data @subentry parsing with @code{FPAT} @cindex Comma separated values (CSV) data @subentry parsing with @code{FPAT} The most notorious such case -is so-called @dfn{comma-separated values} (CSV) data. Many spreadsheet programs, +is comma-separated values (CSV) data. Many spreadsheet programs, for example, can export their data into text files, where each record is terminated with a newline, and fields are separated by commas. If commas only separated the data, there wouldn't be an issue. The problem comes when @@ -8185,6 +8330,13 @@ with @code{FS} and with @code{FIELDWIDTHS}. Finally, the @code{patsplit()} function makes the same functionality available for splitting regular strings (@pxref{String Functions}). +@quotation NOTE +Given that @command{gawk} now has built-in CSV parsing +(@pxref{Comma Separated Fields}), the examples presented here are obsolete. +Nonetheless, it remains useful as an example of what @code{FPAT}-based +field parsing can do. +@end quotation + @node More CSV @subsection More on CSV Files @@ -8303,7 +8455,9 @@ The value is @code{"FS"} if regular field splitting is being used, or @code{"FPAT"} if content-based field splitting is being used: @example -if (PROCINFO["FS"] == "FS") +if ("CSV" in PROCINFO) + @var{CSV-based field splitting} @dots{} +else if (PROCINFO["FS"] == "FS") @var{regular field splitting} @dots{} else if (PROCINFO["FS"] == "FIELDWIDTHS") @var{fixed-width field splitting} @dots{} @@ -8314,7 +8468,7 @@ else @end example This information is useful when writing a function that needs to -temporarily change @code{FS} or @code{FIELDWIDTHS}, read some records, +temporarily change @code{FS}, @code{FIELDWIDTHS}, or @code{FPAT}, read some records, and then restore the original settings (@pxref{Passwd Functions} for an example of such a function). @@ -9414,6 +9568,7 @@ and discusses the @code{close()} built-in function. @command{gawk} allows access to inherited file descriptors. * Close Files And Pipes:: Closing Input and Output Files and Pipes. +* Noflush:: Speeding Up Pipe Output. * Nonfatal:: Enabling Nonfatal Output. * Output Summary:: Output summary. * Output Exercises:: Exercises. @@ -10874,8 +11029,53 @@ pipes; thus, the return value cannot be used portably. In POSIX mode (@pxref{Options}), @command{gawk} just returns zero when closing a pipe. +@node Noflush +@section Speeding Up Pipe Output +@c FIXME: Add indexing + +This @value{SECTION} describes a @command{gawk}-specific feature. + +Normally, when you send data down a pipeline to a command with +@code{print} or @code{printf}, @command{gawk} @dfn{flushes} the +output down the pipe. That is, output is not buffered, but +written directly. This assures, that pipeline output +intermixed with @command{gawk}'s output comes out in the +expected order: + +@example +print "something" # goes to standard output +print "something else" | "some-command" # also to standard output +print "more stuff" # and this too +@end example + +There can be a price to pay for this; flushing data down +the pipeline uses more CPU time, and in certain environments +this can become expensive. + +You can tell @command{gawk} not to flush buffered data in +one of two ways: + +@itemize @bullet +@item +Set @code{PROCINFO["BUFFERPIPE"]} to any value. When this is done, +@command{gawk} will buffer data for all pipelines. + +@item +Set @code{PROCINFO["@var{command}", "BUFFERPIPE"]} to any value. +In this case, only @var{command}'s data will be fully buffered. +@end itemize + +You @emph{must} create one or the other of these elements +in @code{PROCINFO} before the first @code{print} or +@code{printf} to the pipeline. Doing so after output has +already been sent is too late. + +Be aware that using this feature may change the output behavior of +your programs, so exercise caution. + @node Nonfatal @section Enabling Nonfatal Output +@c FIXME: Add indexing This @value{SECTION} describes a @command{gawk}-specific feature. @@ -11162,7 +11362,8 @@ $ @kbd{gawk 'BEGIN @{ print "hello, \} In POSIX mode (@pxref{Options}), @command{gawk} does not allow escaped newlines. Otherwise, it behaves as just described. -BWK @command{awk} and BusyBox @command{awk} +BWK @command{awk}@footnote{In all examples throughout this @value{DOCUMENT}, +@command{nawk} is BWK @command{awk}.} and BusyBox @command{awk} remove the backslash but leave the newline intact, as part of the string: @@ -15815,6 +16016,14 @@ to test for these elements The following elements allow you to change @command{gawk}'s behavior: @table @code +@item PROCINFO["BUFFERPIPE"] +If this element exists, all output to pipelines becomes buffered. +@xref{Noflush}. + +@item PROCINFO["@var{command}", "BUFFERPIPE"] +Make output to @var{command} buffered. +@xref{Noflush}. + @item PROCINFO["NONFATAL"] If this element exists, then I/O errors for all redirections become nonfatal. @xref{Nonfatal}. @@ -18447,6 +18656,13 @@ seps[2] = "-" @noindent The value returned by this call to @code{split()} is three. +If @command{gawk} is invoked with @option{--csv}, then a two-argument +call to @code{split()} splits the string using the CSV parsing rules as +described in @ref{Comma Separated Fields}. With three and four arguments, +@code{split()} works as just described. The four-argument call makes +no sense, since each element of @var{seps} would simply consist of a +string containing a comma. + @cindex differences in @command{awk} and @command{gawk} @subentry @code{split()} function As with input field-splitting, when the value of @var{fieldsep} is @w{@code{" "}}, leading and trailing whitespace is ignored in values assigned to @@ -21018,7 +21234,7 @@ $ @kbd{nawk -v A=1 -f funky.awk} @end example Or @command{awk} could wait until runtime to set the type of @code{a}. -In this case, since @code{a} was never assigned used before being +In this case, since @code{a} was never used before being passed to the function, how the function uses it forces the type to be resolved to either scalar or array. @command{gawk} and the MKS @command{awk} do this: @@ -21739,6 +21955,7 @@ programming use. * Readfile Function:: A function to read an entire file at once. * Shell Quoting:: A function to quote strings for the shell. * Isnumeric Function:: A function to test whether a value is numeric. +* To CSV Function:: A function to convert output to CSV format. @end menu @node Strtonum Function @@ -22570,6 +22787,88 @@ the original string. On the other hand, it uses the @code{typeof()} function (@pxref{Type Functions}), which is specific to @command{gawk}. +@node To CSV Function +@subsection Producing CSV Data + +@cindex comma separated values (CSV) data @subentry generating CSV data +@cindex CSV (comma separated values) data @subentry generating CSV data +@command{gawk}'s @option{--csv} option causes @command{gawk} +to process CSV data (@pxref{Comma Separated Fields}). + +But what if you have regular data that you want to output +in CSV format? This @value{SECTION} provides functions for +doing that. + +The first function, @code{tocsv()}, takes an array of data +fields as input. The array should be indexed starting from one. +The optional second parameter is the separator to use. If none +is supplied, the default is a comma. + +The function takes care to quote fields that contain double +quotes, newlines, or the separator character. It then builds +up the final CSV record and returns it. + +@cindex @code{tocsv()} user-defined function +@example +@c file eg/lib/tocsv.awk +# tocsv.awk --- convert data to CSV format +@c endfile +@ignore +@c file eg/lib/tocsv.awk +# +# Arnold Robbins, arnold@@skeeve.com, Public Domain +# April 2023 +@c endfile +@end ignore +@c file eg/lib/tocsv.awk + +function tocsv(fields, sep, i, j, nfields, result) +@{ + if (length(fields) == 0) + return "" + + if (sep == "") + sep = "," + delete nfields + for (i = 1; i in fields; i++) @{ + nfields[i] = fields[i] + if (nfields[i] ~ /["\n]/ || index(nfields[i], sep) != 0) @{ + gsub(/"/, "\"\"", nfields[i]) # double up quotes + nfields[i] = "\"" nfields[i] "\"" # wrap in quotes + @} + @} + + result = nfields[1] + j = length(nfields) + for (i = 2; i <= j; i++) + result = result sep nfields[i] + + return result +@} +@c endfile +@end example + +The next function, @code{tocsv_rec()} is a wrapper around +@code{tocsv()}. Its intended use is for when you want to convert the +current input record to CSV format. The function itself simply copies +the fields into an array to pass to @code{tocsv()} which does the work. +It accepts an optional separator character as its first parameter, +which it simply passes on to @code{tocsv()}. + +@cindex @code{tocsv_rec()} user-defined function +@example +@c file eg/lib/tocsv.awk +function tocsv_rec(sep, i, fields) +@{ + delete fields + for (i = 1; i <= NF; i++) + fields[i] = $i + + return tocsv(fields, sep) +@} +@c endfile +@end example + @node Data File Management @section @value{DDF} Management @@ -41095,6 +41394,28 @@ in (@pxref{Auto-set}). @end itemize +Version 5.3 added the following features: + +@itemize +@item +Comma separated value (CSV) field splitting +(@pxref{Comma Separated Fields}). + +@item +The ability to make @command{gawk} buffer output to pipes +(@pxref{Noflush}). + +@item +The @samp{\u} escape sequence +(@pxref{Escape Sequences}). + +@item +The need for GNU @code{libsigsegv} was removed from @command{gawk}. +The value-add was never very much and it caused problems in some +environments. + +@end itemize + @c XXX ADD MORE STUFF HERE @end ifclear @@ -41112,10 +41433,12 @@ the three most widely used freely available versions of @command{awk} @headitem Feature @tab BWK @command{awk} @tab @command{mawk} @tab @command{gawk} @tab Now standard @item @code{**} and @code{**=} operators @tab X @tab @tab X @tab @item @samp{\x} escape sequence @tab X @tab X @tab X @tab +@item @samp{\u} escape sequence @tab X @tab @tab X @tab @item @file{/dev/stdin} special file @tab X @tab X @tab X @tab @item @file{/dev/stdout} special file @tab X @tab X @tab X @tab @item @file{/dev/stderr} special file @tab X @tab X @tab X @tab @item @code{BINMODE} variable @tab @tab X @tab X @tab +@item CSV support @tab X @tab @tab X @tab @item @code{FS} as null string @tab X @tab X @tab X @tab @item @code{delete} without subscript @tab X @tab X @tab X @tab X @item @code{fflush()} function @tab X @tab X @tab X @tab X |