1 files changed, 335 insertions, 12 deletions
diff --git a/doc/gawktexi.in b/doc/gawktexi.in
index fdb688dd..2802bccf 100644
--- a/doc/gawktexi.in
+++ b/doc/gawktexi.in
@@ -55,8 +55,8 @@
 
 @c These apply across the board.
 @set UPDATE-MONTH April, 2023
-@set VERSION 5.2
-@set PATCHLEVEL 2
+@set VERSION 5.3
+@set PATCHLEVEL 0
 
 @set GAWKINETTITLE TCP/IP Internetworking with @command{gawk}
 @set GAWKWORKFLOWTITLE Participating in @command{gawk} Development
@@ -68,7 +68,7 @@
 @set TITLE GAWK: Effective AWK Programming
 @end ifclear
 @set SUBTITLE A User's Guide for GNU Awk
-@set EDITION 5.2
+@set EDITION 5.3
 
 @iftex
 @set DOCUMENT book
@@ -584,6 +584,7 @@ particular records in a file and perform operations upon them.
 * Regexp Field Splitting::              Using regexps as the field separator.
 * Single Character Fields::             Making each character a separate
                                         field.
+* Comma Separated Fields::              Working with CSV files.
 * Command Line Field Separator::        Setting @code{FS} from the command
                                         line.
 * Full Line Fields::                    Making the full line be a single
@@ -651,6 +652,7 @@ particular records in a file and perform operations upon them.
                                         Pipes.
 * Close Return Value::                  Using the return value from
                                         @code{close()}.
+* Noflush::                             Speeding Up Pipe Output.
 * Nonfatal::                            Enabling Nonfatal Output.
 * Output Summary::                      Output summary.
 * Output Exercises::                    Exercises.
@@ -833,6 +835,8 @@ particular records in a file and perform operations upon them.
                                         shell.
 * Isnumeric Function::                  A function to test whether a value is
                                         numeric.
+* To CSV Function::                     A function to convert output to CSV
+                                        format.
 * Data File Management::                Functions for managing command-line
                                         data files.
 * Filetrans Function::                  A function for handling data file
@@ -1490,6 +1494,10 @@ Document minimally and release.
 
 After eight years, add another part @code{egrep} and two
 more parts C.  Document very well and release.
+
+After 35 more years, add Unicode and CSV support, sprinkle lightly with
+a few choice features from @command{gawk}, document very well again,
+and release.
 @end sidebar
 
 @cindex Aho, Alfred
@@ -4083,6 +4091,19 @@ the program. The trace is printed to standard error. Each ``op code''
 is preceded by a @code{+}
 sign in the output.
 
+@item @option{-k}
+@itemx @option{--csv}
+@cindex @option{-k} option
+@cindex @option{--csv} option
+@cindex comma separated values (CSV) data @subentry @option{-k} option
+@cindex comma separated values (CSV) data @subentry @option{--csv} option
+@cindex CSV (comma separated values) data @subentry @option{-k} option
+@cindex CSV (comma separated values) data @subentry @option{--csv} option
+Enable special processing for files with comma separated values
+(CSV). @xref{Comma Separated Fields}.
+This option cannot be used with @option{--posix}. Attempting to do
+causes a fatal error.
+
 @item @option{-l} @var{ext}
 @itemx @option{--load} @var{ext}
 @cindex @option{-l} option
@@ -5555,6 +5576,25 @@ As of @value{PVERSION} 4.2, only two digits
 are processed.
 @end quotation
 
+@cindex @code{\} (backslash) @subentry @code{\u} escape sequence
+@cindex backslash (@code{\}) @subentry @code{\u} escape sequence
+@cindex common extensions @subentry @code{\u} escape sequence
+@cindex extensions @subentry common @subentry @code{\u} escape sequence
+@item \u@var{hh}@dots{}
+The hexadecimal value @var{hh}, where @var{hh} stands for a sequence
+of hexadecimal digits (@samp{0}--@samp{9}, and either @samp{A}--@samp{F}
+or @samp{a}--@samp{f}).  A maximum of eight digits are allowed after
+the @samp{\u}. Any further hexadecimal digits are treated as simple
+letters or numbers.  @value{COMMONEXT}
+(The @samp{\u} escape sequence is not allowed in POSIX awk.)
+
+This escape sequence is intended for designating a character in the
+Unicode character set.  @command{gawk} first converts the given digits
+into an integer and then translates the given ``wide character''
+value into the current locale's multibyte encoding (even if that
+is a not a Unicode locale).  If the given bytes do not represent
+a valid character, the value becomes @code{"?"}.
+
 @cindex @code{\} (backslash) @subentry @code{\/} escape sequence
 @cindex backslash (@code{\}) @subentry @code{\/} escape sequence
 @item \/
@@ -6713,6 +6753,12 @@ If @code{RS} is any single character, that character separates records.
 Otherwise (in @command{gawk}), @code{RS} is treated as a regular expression.
 This mechanism is explained in greater detail shortly.
 
+@quotation NOTE
+When @command{gawk} is invoked with the @option{--csv} option, nothing
+in this @value{SECTION} applies. @xref{Comma Separated Fields}, for the
+details.
+@end quotation
+
 @menu
 * awk split records::           How standard @command{awk} splits records.
 * gawk split records::          How @command{gawk} splits records.
@@ -7359,6 +7405,7 @@ with a statement such as @samp{$1 = $1}, as described earlier.
 * Default Field Splitting::      How fields are normally separated.
 * Regexp Field Splitting::       Using regexps as the field separator.
 * Single Character Fields::      Making each character a separate field.
+* Comma Separated Fields::       Working with CSV files.
 * Command Line Field Separator:: Setting @code{FS} from the command line.
 * Full Line Fields::             Making the full line be a single field.
 * Field Splitting Summary::      Some final points and a summary table.
@@ -7401,10 +7448,10 @@ is read with the proper separator.  To do this, use the special
 @code{BEGIN} pattern
 (@pxref{BEGIN/END}).
 For example, here we set the value of @code{FS} to the string
-@code{","}:
+@code{":"}:
 
 @example
-awk 'BEGIN @{ FS = "," @} ; @{ print $2 @}'
+awk 'BEGIN @{ FS = ":" @} ; @{ print $2 @}'
 @end example
 
 @cindex @code{BEGIN} pattern
@@ -7412,7 +7459,7 @@ awk 'BEGIN @{ FS = "," @} ; @{ print $2 @}'
 Given the input line:
 
 @example
-John Q. Smith, 29 Oak St., Walamazoo, MI 42139
+John Q. Smith: 29 Oak St.: Walamazoo: MI 42139
 @end example
 
 @noindent
@@ -7428,7 +7475,7 @@ person's name in the example we just used might have a title or
 suffix attached, such as:
 
 @example
-John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139
+John Q. Smith: LXIX: 29 Oak St.: Walamazoo: MI 42139
 @end example
 
 @noindent
@@ -7615,6 +7662,93 @@ In compatibility mode
 if @code{FS} is the null string, then @command{gawk} also
 behaves this way.
 
+@node Comma Separated Fields
+@subsection Working With Comma Separated Value Files
+
+@cindex comma separated values (CSV) data @subentry records and fields
+@cindex CSV (comma separated values) data @subentry records and fields
+Many commonly-used tools use a comma to separate fields, instead of whitespace.
+This is particularly true of popular spreadsheet programs.  There is no
+universally accepted standard for the format of these files, although
+@uref{http://www.ietf.org/rfc/rfc4180, RFC 4180} lists the common
+practices.
+
+For decades, anyone wishing to work with CSV files and @command{awk}
+had to ``roll their own'' solution.
+(For an example, @pxref{Splitting By Content}).
+In 2023, Brian Kernighan decided to add CSV support to his version of
+@command{awk}.  In order to keep up, @command{gawk} too provides the same
+support as his version.
+To use CSV data, invoke @command{gawk} with either of the
+@option{-k} or @option{--csv} options.
+
+Fields in CSV files are separated by commas.  In order to allow a comma
+to appear inside a field (i.e., as data), the field may be quoted
+by beginning and ending it with double quotes.  In order to allow a double
+quote inside a field, the field @emph{must} be quoted, and two double quotes
+represent an actual double quote.
+The double quote that starts a quoted field must be the first
+character after the comma.
+@ref{table-csv-examples} shows some examples.
+
+@float Table,table-csv-examples
+@caption{Examples of CSV data}
+@multitable @columnfractions .3 .3
+@headitem Input @tab Field Contents
+@item @code{abc def} @tab @code{abc def}
+@item @code{"quoted data"} @tab @code{quoted data}
+@item @code{"quoted, data"} @tab @code{quoted, data}
+@item @code{"She said ""Stop!""."} @tab @code{She said "Stop!".}
+@end multitable
+@end float
+
+Additionally, and here's where it gets messy, newlines are also
+allowed inside double-quoted fields!
+In order to deal with such things, when processing CSV files,
+@command{gawk} scans the input data looking for newlines that
+are not enclosed in double quotes. Thus, use of the @option{--csv} option
+totally overrides normal record processing with @code{RS} (@pxref{Records}),
+as well as field splitting with any of @code{FS}, @code{FIELDWIDTHS},
+or @code{FPAT}.
+
+@cindex Kernighan, Brian @subentry quotes
+@sidebar Carriage-Return--Line-Feed Line Endings In CSV Files
+@quotation
+@code{\r\n} @i{is the invention of the devil.}
+@author Brian Kernighan
+@end quotation
+
+Many CSV files are imported from systems where the line terminator
+for text files is a carriage-return--line-feed pair
+(CR-LF, @samp{\r} followed by @samp{\n}).
+For ease of use, when processing CSV files, @command{gawk} converts
+CR-LF pairs into a single newline. That is, the @samp{\r} is removed.
+
+This occurs only when a CR is paired with an LF; a standalone CR
+is left alone. This behavior is consistent with with Windows systems
+which automatically convert CR-LF in files into a plain LF in memory,
+and also with the commonly available @command{unix2dos} utility program.
+@end sidebar
+
+The behavior of the @code{split()} function (not formally discussed
+yet, see @ref{String Functions}) differs slightly when processing CSV
+files. When called with two arguments
+(@samp{split(@var{string}, @var{array})}), @code{split()}
+does CSV-based splitting.  Otherwise, it behaves normally.
+
+If @option{--csv} has been used, @code{PROCINFO["CSV"]} will
+exist. Otherwise, it will not. @xref{Auto-set}.
+
+Finally, if @option{--csv} has been used, assigning a value
+to any of @code{FS}, @code{FIELDWIDTHS}, @code{FPAT}, or
+@code{RS} generates a warning message.
+
+To be clear, @command{gawk} takes
+@uref{http://www.ietf.org/rfc/rfc4180, RFC 4180} as its
+specification for CSV input data.  There are no mechanisms
+for accepting nonstandard CSV data, such as files that use
+a semicolon instead of a comma as the separator.
+
 @node Command Line Field Separator
 @subsection Setting @code{FS} from the Command Line
 @cindex @option{-F} option @subentry command-line
@@ -7797,10 +7931,18 @@ The following list summarizes how fields are split, based on the value
 of @code{FS} (@samp{==} means ``is equal to''):
 
 @table @code
+@item @asis{@command{gawk} was invoked with @option{--csv}}
+Field splitting follows the rules given in @ref{Comma Separated Fields}.
+The value of @code{FS} is ignored.
+
 @item FS == " "
 Fields are separated by runs of whitespace.  Leading and trailing
 whitespace are ignored.  This is the default.
 
+@item FS == ","
+Fields are separated by commas, with quoting of fields
+and special rules involved.
+
 @item FS == @var{any other single character}
 Fields are separated by each occurrence of the character.  Multiple
 successive occurrences delimit empty fields, as do leading and
@@ -8046,6 +8188,9 @@ four, and @code{$4} has the value @code{"ddd"}.
 @node Splitting By Content
 @section Defining Fields by Content
 
+@strong{FIXME}: This whole section needs rewriting now
+that @command{gawk} has built-in CSV parsing. Sigh.
+
 @menu
 * More CSV::                    More on CSV files.
 * FS versus FPAT::              A subtle difference.
@@ -8067,7 +8212,7 @@ what they are, and not by what they are not.
 @cindex CSV (comma separated values) data @subentry parsing with @code{FPAT}
 @cindex Comma separated values (CSV) data @subentry parsing with @code{FPAT}
 The most notorious such case
-is so-called @dfn{comma-separated values} (CSV) data. Many spreadsheet programs,
+is comma-separated values (CSV) data. Many spreadsheet programs,
 for example, can export their data into text files, where each record is
 terminated with a newline, and fields are separated by commas. If
 commas only separated the data, there wouldn't be an issue. The problem comes when
@@ -8185,6 +8330,13 @@ with @code{FS} and with @code{FIELDWIDTHS}.
 Finally, the @code{patsplit()} function makes the same functionality
 available for splitting regular strings (@pxref{String Functions}).
 
+@quotation NOTE
+Given that @command{gawk} now has built-in CSV parsing
+(@pxref{Comma Separated Fields}), the examples presented here are obsolete.
+Nonetheless, it remains useful as an example of what @code{FPAT}-based
+field parsing can do.
+@end quotation
+
 @node More CSV
 @subsection More on CSV Files
 
@@ -8303,7 +8455,9 @@ The value is @code{"FS"} if regular field splitting is being used,
 or @code{"FPAT"} if content-based field splitting is being used:
 
 @example
-if (PROCINFO["FS"] == "FS")
+if ("CSV" in PROCINFO)
+    @var{CSV-based field splitting} @dots{}
+else if (PROCINFO["FS"] == "FS")
     @var{regular field splitting} @dots{}
 else if (PROCINFO["FS"] == "FIELDWIDTHS")
     @var{fixed-width field splitting} @dots{}
@@ -8314,7 +8468,7 @@ else
 @end example
 
 This information is useful when writing a function that needs to
-temporarily change @code{FS} or @code{FIELDWIDTHS}, read some records,
+temporarily change @code{FS}, @code{FIELDWIDTHS}, or @code{FPAT}, read some records,
 and then restore the original settings (@pxref{Passwd Functions} for an
 example of such a function).
 
@@ -9414,6 +9568,7 @@ and discusses the @code{close()} built-in function.
                                 @command{gawk} allows access to inherited file
                                 descriptors.
 * Close Files And Pipes::       Closing Input and Output Files and Pipes.
+* Noflush::                     Speeding Up Pipe Output.
 * Nonfatal::                    Enabling Nonfatal Output.
 * Output Summary::              Output summary.
 * Output Exercises::            Exercises.
@@ -10874,8 +11029,53 @@ pipes; thus, the return value cannot be used portably.
 In POSIX mode (@pxref{Options}), @command{gawk} just returns zero
 when closing a pipe.
 
+@node Noflush
+@section Speeding Up Pipe Output
+@c FIXME: Add indexing
+
+This @value{SECTION} describes a @command{gawk}-specific feature.
+
+Normally, when you send data down a pipeline to a command with
+@code{print} or @code{printf}, @command{gawk} @dfn{flushes} the
+output down the pipe. That is, output is not buffered, but
+written directly.  This assures, that pipeline output
+intermixed with @command{gawk}'s output comes out in the
+expected order:
+
+@example
+print "something"                       # goes to standard output
+print "something else" | "some-command" # also to standard output
+print "more stuff"                      # and this too
+@end example
+
+There can be a price to pay for this; flushing data down
+the pipeline uses more CPU time, and in certain environments
+this can become expensive.
+
+You can tell @command{gawk} not to flush buffered data in
+one of two ways:
+
+@itemize @bullet
+@item
+Set @code{PROCINFO["BUFFERPIPE"]} to any value. When this is done,
+@command{gawk} will buffer data for all pipelines.
+
+@item
+Set @code{PROCINFO["@var{command}", "BUFFERPIPE"]} to any value.
+In this case, only @var{command}'s data will be fully buffered.
+@end itemize
+
+You @emph{must} create one or the other of these elements
+in @code{PROCINFO} before the first @code{print} or
+@code{printf} to the pipeline.  Doing so after output has
+already been sent is too late.
+
+Be aware that using this feature may change the output behavior of
+your programs, so exercise caution.
+
 @node Nonfatal
 @section Enabling Nonfatal Output
+@c FIXME: Add indexing
 
 This @value{SECTION} describes a @command{gawk}-specific feature.
 
@@ -11162,7 +11362,8 @@ $ @kbd{gawk 'BEGIN @{ print "hello, \}
 In POSIX mode (@pxref{Options}), @command{gawk} does not
 allow escaped newlines.  Otherwise, it behaves as just described.
 
-BWK @command{awk} and BusyBox @command{awk}
+BWK @command{awk}@footnote{In all examples throughout this @value{DOCUMENT},
+@command{nawk} is BWK @command{awk}.} and BusyBox @command{awk}
 remove the backslash but leave the newline
 intact, as part of the string:
 
@@ -15815,6 +16016,14 @@ to test for these elements
 The following elements allow you to change @command{gawk}'s behavior:
 
 @table @code
+@item PROCINFO["BUFFERPIPE"]
+If this element exists, all output to pipelines becomes buffered.
+@xref{Noflush}.
+
+@item PROCINFO["@var{command}", "BUFFERPIPE"]
+Make output to @var{command} buffered.
+@xref{Noflush}.
+
 @item PROCINFO["NONFATAL"]
 If this element exists, then I/O errors for all redirections become nonfatal.
 @xref{Nonfatal}.
@@ -18447,6 +18656,13 @@ seps[2] = "-"
 @noindent
 The value returned by this call to @code{split()} is three.
 
+If @command{gawk} is invoked with @option{--csv}, then a two-argument
+call to @code{split()} splits the string using the CSV parsing rules as
+described in @ref{Comma Separated Fields}.  With three and four arguments,
+@code{split()} works as just described.  The four-argument call makes
+no sense, since each element of @var{seps} would simply consist of a
+string containing a comma.
+
 @cindex differences in @command{awk} and @command{gawk} @subentry @code{split()} function
 As with input field-splitting, when the value of @var{fieldsep} is
 @w{@code{" "}}, leading and trailing whitespace is ignored in values assigned to
@@ -21018,7 +21234,7 @@ $ @kbd{nawk -v A=1 -f funky.awk}
 @end example
 
 Or @command{awk} could wait until runtime to set the type of @code{a}.
-In this case, since @code{a} was never assigned used before being
+In this case, since @code{a} was never used before being
 passed to the function, how the function uses it forces the type to
 be resolved to either scalar or array.  @command{gawk}
 and the MKS @command{awk} do this:
@@ -21739,6 +21955,7 @@ programming use.
 * Readfile Function::           A function to read an entire file at once.
 * Shell Quoting::               A function to quote strings for the shell.
 * Isnumeric Function::          A function to test whether a value is numeric.
+* To CSV Function::             A function to convert output to CSV format.
 @end menu
 
 @node Strtonum Function
@@ -22570,6 +22787,88 @@ the original string.
 On the other hand, it uses the @code{typeof()} function
 (@pxref{Type Functions}), which is specific to @command{gawk}.
 
+@node To CSV Function
+@subsection Producing CSV Data
+
+@cindex comma separated values (CSV) data @subentry generating CSV data
+@cindex CSV (comma separated values) data @subentry generating CSV data
+@command{gawk}'s @option{--csv} option causes @command{gawk}
+to process CSV data (@pxref{Comma Separated Fields}).
+
+But what if you have regular data that you want to output
+in CSV format?  This @value{SECTION} provides functions for
+doing that.
+
+The first function, @code{tocsv()}, takes an array of data
+fields as input. The array should be indexed starting from one.
+The optional second parameter is the separator to use. If none
+is supplied, the default is a comma.
+
+The function takes care to quote fields that contain double
+quotes, newlines, or the separator character.  It then builds
+up the final CSV record and returns it.
+
+@cindex @code{tocsv()} user-defined function
+@example
+@c file eg/lib/tocsv.awk
+# tocsv.awk --- convert data to CSV format
+@c endfile
+@ignore
+@c file eg/lib/tocsv.awk
+#
+# Arnold Robbins, arnold@@skeeve.com, Public Domain
+# April 2023
+@c endfile
+@end ignore
+@c file eg/lib/tocsv.awk
+
+function tocsv(fields, sep,     i, j, nfields, result)
+@{
+    if (length(fields) == 0)
+        return ""
+
+    if (sep == "")
+        sep = ","
+    delete nfields
+    for (i = 1; i in fields; i++) @{
+        nfields[i] = fields[i]
+        if (nfields[i] ~ /["\n]/ || index(nfields[i], sep) != 0) @{
+            gsub(/"/, "\"\"", nfields[i])       # double up quotes
+            nfields[i] = "\"" nfields[i] "\""   # wrap in quotes
+        @}
+    @}
+
+    result = nfields[1]
+    j = length(nfields)
+    for (i = 2; i <= j; i++)
+        result = result sep nfields[i]
+
+    return result
+@}
+@c endfile
+@end example
+
+The next function, @code{tocsv_rec()} is a wrapper around
+@code{tocsv()}. Its intended use is for when you want to convert the
+current input record to CSV format.  The function itself simply copies
+the fields into an array to pass to @code{tocsv()} which does the work.
+It accepts an optional separator character as its first parameter,
+which it simply passes on to @code{tocsv()}.
+
+@cindex @code{tocsv_rec()} user-defined function
+@example
+@c file eg/lib/tocsv.awk
+function tocsv_rec(sep,     i, fields)
+@{
+    delete fields
+    for (i = 1; i <= NF; i++)
+        fields[i] = $i
+
+    return tocsv(fields, sep)
+@}
+@c endfile
+@end example
+
 @node Data File Management
 @section @value{DDF} Management
 
@@ -41095,6 +41394,28 @@ in (@pxref{Auto-set}).
 
 @end itemize
 
+Version 5.3 added the following features:
+
+@itemize
+@item
+Comma separated value (CSV) field splitting
+(@pxref{Comma Separated Fields}).
+
+@item
+The ability to make @command{gawk} buffer output to pipes
+(@pxref{Noflush}).
+
+@item
+The @samp{\u} escape sequence
+(@pxref{Escape Sequences}).
+
+@item
+The need for GNU @code{libsigsegv} was removed from @command{gawk}.
+The value-add was never very much and it caused problems in some
+environments.
+
+@end itemize
+
 @c XXX ADD MORE STUFF HERE
 @end ifclear
 
@@ -41112,10 +41433,12 @@ the three most widely used freely available versions of @command{awk}
 @headitem Feature @tab BWK @command{awk} @tab @command{mawk} @tab @command{gawk} @tab Now standard
 @item @code{**} and @code{**=} operators @tab X @tab @tab X @tab
 @item @samp{\x} escape sequence @tab X @tab X @tab X @tab
+@item @samp{\u} escape sequence @tab X @tab @tab X @tab
 @item @file{/dev/stdin} special file @tab X @tab X @tab X @tab
 @item @file{/dev/stdout} special file @tab X @tab X @tab X @tab
 @item @file{/dev/stderr} special file @tab X @tab X @tab X @tab
 @item @code{BINMODE} variable @tab @tab X @tab X @tab
+@item CSV support @tab X @tab @tab X @tab
 @item @code{FS} as null string @tab X @tab X @tab X @tab
 @item @code{delete} without subscript @tab X @tab X @tab X @tab X
 @item @code{fflush()} function @tab X @tab X @tab X @tab X