summaryrefslogtreecommitdiff
path: root/doc/gawktexi.in
diff options
context:
space:
mode:
Diffstat (limited to 'doc/gawktexi.in')
-rw-r--r--doc/gawktexi.in347
1 files changed, 335 insertions, 12 deletions
diff --git a/doc/gawktexi.in b/doc/gawktexi.in
index fdb688dd..2802bccf 100644
--- a/doc/gawktexi.in
+++ b/doc/gawktexi.in
@@ -55,8 +55,8 @@
@c These apply across the board.
@set UPDATE-MONTH April, 2023
-@set VERSION 5.2
-@set PATCHLEVEL 2
+@set VERSION 5.3
+@set PATCHLEVEL 0
@set GAWKINETTITLE TCP/IP Internetworking with @command{gawk}
@set GAWKWORKFLOWTITLE Participating in @command{gawk} Development
@@ -68,7 +68,7 @@
@set TITLE GAWK: Effective AWK Programming
@end ifclear
@set SUBTITLE A User's Guide for GNU Awk
-@set EDITION 5.2
+@set EDITION 5.3
@iftex
@set DOCUMENT book
@@ -584,6 +584,7 @@ particular records in a file and perform operations upon them.
* Regexp Field Splitting:: Using regexps as the field separator.
* Single Character Fields:: Making each character a separate
field.
+* Comma Separated Fields:: Working with CSV files.
* Command Line Field Separator:: Setting @code{FS} from the command
line.
* Full Line Fields:: Making the full line be a single
@@ -651,6 +652,7 @@ particular records in a file and perform operations upon them.
Pipes.
* Close Return Value:: Using the return value from
@code{close()}.
+* Noflush:: Speeding Up Pipe Output.
* Nonfatal:: Enabling Nonfatal Output.
* Output Summary:: Output summary.
* Output Exercises:: Exercises.
@@ -833,6 +835,8 @@ particular records in a file and perform operations upon them.
shell.
* Isnumeric Function:: A function to test whether a value is
numeric.
+* To CSV Function:: A function to convert output to CSV
+ format.
* Data File Management:: Functions for managing command-line
data files.
* Filetrans Function:: A function for handling data file
@@ -1490,6 +1494,10 @@ Document minimally and release.
After eight years, add another part @code{egrep} and two
more parts C. Document very well and release.
+
+After 35 more years, add Unicode and CSV support, sprinkle lightly with
+a few choice features from @command{gawk}, document very well again,
+and release.
@end sidebar
@cindex Aho, Alfred
@@ -4083,6 +4091,19 @@ the program. The trace is printed to standard error. Each ``op code''
is preceded by a @code{+}
sign in the output.
+@item @option{-k}
+@itemx @option{--csv}
+@cindex @option{-k} option
+@cindex @option{--csv} option
+@cindex comma separated values (CSV) data @subentry @option{-k} option
+@cindex comma separated values (CSV) data @subentry @option{--csv} option
+@cindex CSV (comma separated values) data @subentry @option{-k} option
+@cindex CSV (comma separated values) data @subentry @option{--csv} option
+Enable special processing for files with comma separated values
+(CSV). @xref{Comma Separated Fields}.
+This option cannot be used with @option{--posix}. Attempting to do
+causes a fatal error.
+
@item @option{-l} @var{ext}
@itemx @option{--load} @var{ext}
@cindex @option{-l} option
@@ -5555,6 +5576,25 @@ As of @value{PVERSION} 4.2, only two digits
are processed.
@end quotation
+@cindex @code{\} (backslash) @subentry @code{\u} escape sequence
+@cindex backslash (@code{\}) @subentry @code{\u} escape sequence
+@cindex common extensions @subentry @code{\u} escape sequence
+@cindex extensions @subentry common @subentry @code{\u} escape sequence
+@item \u@var{hh}@dots{}
+The hexadecimal value @var{hh}, where @var{hh} stands for a sequence
+of hexadecimal digits (@samp{0}--@samp{9}, and either @samp{A}--@samp{F}
+or @samp{a}--@samp{f}). A maximum of eight digits are allowed after
+the @samp{\u}. Any further hexadecimal digits are treated as simple
+letters or numbers. @value{COMMONEXT}
+(The @samp{\u} escape sequence is not allowed in POSIX awk.)
+
+This escape sequence is intended for designating a character in the
+Unicode character set. @command{gawk} first converts the given digits
+into an integer and then translates the given ``wide character''
+value into the current locale's multibyte encoding (even if that
+is a not a Unicode locale). If the given bytes do not represent
+a valid character, the value becomes @code{"?"}.
+
@cindex @code{\} (backslash) @subentry @code{\/} escape sequence
@cindex backslash (@code{\}) @subentry @code{\/} escape sequence
@item \/
@@ -6713,6 +6753,12 @@ If @code{RS} is any single character, that character separates records.
Otherwise (in @command{gawk}), @code{RS} is treated as a regular expression.
This mechanism is explained in greater detail shortly.
+@quotation NOTE
+When @command{gawk} is invoked with the @option{--csv} option, nothing
+in this @value{SECTION} applies. @xref{Comma Separated Fields}, for the
+details.
+@end quotation
+
@menu
* awk split records:: How standard @command{awk} splits records.
* gawk split records:: How @command{gawk} splits records.
@@ -7359,6 +7405,7 @@ with a statement such as @samp{$1 = $1}, as described earlier.
* Default Field Splitting:: How fields are normally separated.
* Regexp Field Splitting:: Using regexps as the field separator.
* Single Character Fields:: Making each character a separate field.
+* Comma Separated Fields:: Working with CSV files.
* Command Line Field Separator:: Setting @code{FS} from the command line.
* Full Line Fields:: Making the full line be a single field.
* Field Splitting Summary:: Some final points and a summary table.
@@ -7401,10 +7448,10 @@ is read with the proper separator. To do this, use the special
@code{BEGIN} pattern
(@pxref{BEGIN/END}).
For example, here we set the value of @code{FS} to the string
-@code{","}:
+@code{":"}:
@example
-awk 'BEGIN @{ FS = "," @} ; @{ print $2 @}'
+awk 'BEGIN @{ FS = ":" @} ; @{ print $2 @}'
@end example
@cindex @code{BEGIN} pattern
@@ -7412,7 +7459,7 @@ awk 'BEGIN @{ FS = "," @} ; @{ print $2 @}'
Given the input line:
@example
-John Q. Smith, 29 Oak St., Walamazoo, MI 42139
+John Q. Smith: 29 Oak St.: Walamazoo: MI 42139
@end example
@noindent
@@ -7428,7 +7475,7 @@ person's name in the example we just used might have a title or
suffix attached, such as:
@example
-John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139
+John Q. Smith: LXIX: 29 Oak St.: Walamazoo: MI 42139
@end example
@noindent
@@ -7615,6 +7662,93 @@ In compatibility mode
if @code{FS} is the null string, then @command{gawk} also
behaves this way.
+@node Comma Separated Fields
+@subsection Working With Comma Separated Value Files
+
+@cindex comma separated values (CSV) data @subentry records and fields
+@cindex CSV (comma separated values) data @subentry records and fields
+Many commonly-used tools use a comma to separate fields, instead of whitespace.
+This is particularly true of popular spreadsheet programs. There is no
+universally accepted standard for the format of these files, although
+@uref{http://www.ietf.org/rfc/rfc4180, RFC 4180} lists the common
+practices.
+
+For decades, anyone wishing to work with CSV files and @command{awk}
+had to ``roll their own'' solution.
+(For an example, @pxref{Splitting By Content}).
+In 2023, Brian Kernighan decided to add CSV support to his version of
+@command{awk}. In order to keep up, @command{gawk} too provides the same
+support as his version.
+To use CSV data, invoke @command{gawk} with either of the
+@option{-k} or @option{--csv} options.
+
+Fields in CSV files are separated by commas. In order to allow a comma
+to appear inside a field (i.e., as data), the field may be quoted
+by beginning and ending it with double quotes. In order to allow a double
+quote inside a field, the field @emph{must} be quoted, and two double quotes
+represent an actual double quote.
+The double quote that starts a quoted field must be the first
+character after the comma.
+@ref{table-csv-examples} shows some examples.
+
+@float Table,table-csv-examples
+@caption{Examples of CSV data}
+@multitable @columnfractions .3 .3
+@headitem Input @tab Field Contents
+@item @code{abc def} @tab @code{abc def}
+@item @code{"quoted data"} @tab @code{quoted data}
+@item @code{"quoted, data"} @tab @code{quoted, data}
+@item @code{"She said ""Stop!""."} @tab @code{She said "Stop!".}
+@end multitable
+@end float
+
+Additionally, and here's where it gets messy, newlines are also
+allowed inside double-quoted fields!
+In order to deal with such things, when processing CSV files,
+@command{gawk} scans the input data looking for newlines that
+are not enclosed in double quotes. Thus, use of the @option{--csv} option
+totally overrides normal record processing with @code{RS} (@pxref{Records}),
+as well as field splitting with any of @code{FS}, @code{FIELDWIDTHS},
+or @code{FPAT}.
+
+@cindex Kernighan, Brian @subentry quotes
+@sidebar Carriage-Return--Line-Feed Line Endings In CSV Files
+@quotation
+@code{\r\n} @i{is the invention of the devil.}
+@author Brian Kernighan
+@end quotation
+
+Many CSV files are imported from systems where the line terminator
+for text files is a carriage-return--line-feed pair
+(CR-LF, @samp{\r} followed by @samp{\n}).
+For ease of use, when processing CSV files, @command{gawk} converts
+CR-LF pairs into a single newline. That is, the @samp{\r} is removed.
+
+This occurs only when a CR is paired with an LF; a standalone CR
+is left alone. This behavior is consistent with with Windows systems
+which automatically convert CR-LF in files into a plain LF in memory,
+and also with the commonly available @command{unix2dos} utility program.
+@end sidebar
+
+The behavior of the @code{split()} function (not formally discussed
+yet, see @ref{String Functions}) differs slightly when processing CSV
+files. When called with two arguments
+(@samp{split(@var{string}, @var{array})}), @code{split()}
+does CSV-based splitting. Otherwise, it behaves normally.
+
+If @option{--csv} has been used, @code{PROCINFO["CSV"]} will
+exist. Otherwise, it will not. @xref{Auto-set}.
+
+Finally, if @option{--csv} has been used, assigning a value
+to any of @code{FS}, @code{FIELDWIDTHS}, @code{FPAT}, or
+@code{RS} generates a warning message.
+
+To be clear, @command{gawk} takes
+@uref{http://www.ietf.org/rfc/rfc4180, RFC 4180} as its
+specification for CSV input data. There are no mechanisms
+for accepting nonstandard CSV data, such as files that use
+a semicolon instead of a comma as the separator.
+
@node Command Line Field Separator
@subsection Setting @code{FS} from the Command Line
@cindex @option{-F} option @subentry command-line
@@ -7797,10 +7931,18 @@ The following list summarizes how fields are split, based on the value
of @code{FS} (@samp{==} means ``is equal to''):
@table @code
+@item @asis{@command{gawk} was invoked with @option{--csv}}
+Field splitting follows the rules given in @ref{Comma Separated Fields}.
+The value of @code{FS} is ignored.
+
@item FS == " "
Fields are separated by runs of whitespace. Leading and trailing
whitespace are ignored. This is the default.
+@item FS == ","
+Fields are separated by commas, with quoting of fields
+and special rules involved.
+
@item FS == @var{any other single character}
Fields are separated by each occurrence of the character. Multiple
successive occurrences delimit empty fields, as do leading and
@@ -8046,6 +8188,9 @@ four, and @code{$4} has the value @code{"ddd"}.
@node Splitting By Content
@section Defining Fields by Content
+@strong{FIXME}: This whole section needs rewriting now
+that @command{gawk} has built-in CSV parsing. Sigh.
+
@menu
* More CSV:: More on CSV files.
* FS versus FPAT:: A subtle difference.
@@ -8067,7 +8212,7 @@ what they are, and not by what they are not.
@cindex CSV (comma separated values) data @subentry parsing with @code{FPAT}
@cindex Comma separated values (CSV) data @subentry parsing with @code{FPAT}
The most notorious such case
-is so-called @dfn{comma-separated values} (CSV) data. Many spreadsheet programs,
+is comma-separated values (CSV) data. Many spreadsheet programs,
for example, can export their data into text files, where each record is
terminated with a newline, and fields are separated by commas. If
commas only separated the data, there wouldn't be an issue. The problem comes when
@@ -8185,6 +8330,13 @@ with @code{FS} and with @code{FIELDWIDTHS}.
Finally, the @code{patsplit()} function makes the same functionality
available for splitting regular strings (@pxref{String Functions}).
+@quotation NOTE
+Given that @command{gawk} now has built-in CSV parsing
+(@pxref{Comma Separated Fields}), the examples presented here are obsolete.
+Nonetheless, it remains useful as an example of what @code{FPAT}-based
+field parsing can do.
+@end quotation
+
@node More CSV
@subsection More on CSV Files
@@ -8303,7 +8455,9 @@ The value is @code{"FS"} if regular field splitting is being used,
or @code{"FPAT"} if content-based field splitting is being used:
@example
-if (PROCINFO["FS"] == "FS")
+if ("CSV" in PROCINFO)
+ @var{CSV-based field splitting} @dots{}
+else if (PROCINFO["FS"] == "FS")
@var{regular field splitting} @dots{}
else if (PROCINFO["FS"] == "FIELDWIDTHS")
@var{fixed-width field splitting} @dots{}
@@ -8314,7 +8468,7 @@ else
@end example
This information is useful when writing a function that needs to
-temporarily change @code{FS} or @code{FIELDWIDTHS}, read some records,
+temporarily change @code{FS}, @code{FIELDWIDTHS}, or @code{FPAT}, read some records,
and then restore the original settings (@pxref{Passwd Functions} for an
example of such a function).
@@ -9414,6 +9568,7 @@ and discusses the @code{close()} built-in function.
@command{gawk} allows access to inherited file
descriptors.
* Close Files And Pipes:: Closing Input and Output Files and Pipes.
+* Noflush:: Speeding Up Pipe Output.
* Nonfatal:: Enabling Nonfatal Output.
* Output Summary:: Output summary.
* Output Exercises:: Exercises.
@@ -10874,8 +11029,53 @@ pipes; thus, the return value cannot be used portably.
In POSIX mode (@pxref{Options}), @command{gawk} just returns zero
when closing a pipe.
+@node Noflush
+@section Speeding Up Pipe Output
+@c FIXME: Add indexing
+
+This @value{SECTION} describes a @command{gawk}-specific feature.
+
+Normally, when you send data down a pipeline to a command with
+@code{print} or @code{printf}, @command{gawk} @dfn{flushes} the
+output down the pipe. That is, output is not buffered, but
+written directly. This assures, that pipeline output
+intermixed with @command{gawk}'s output comes out in the
+expected order:
+
+@example
+print "something" # goes to standard output
+print "something else" | "some-command" # also to standard output
+print "more stuff" # and this too
+@end example
+
+There can be a price to pay for this; flushing data down
+the pipeline uses more CPU time, and in certain environments
+this can become expensive.
+
+You can tell @command{gawk} not to flush buffered data in
+one of two ways:
+
+@itemize @bullet
+@item
+Set @code{PROCINFO["BUFFERPIPE"]} to any value. When this is done,
+@command{gawk} will buffer data for all pipelines.
+
+@item
+Set @code{PROCINFO["@var{command}", "BUFFERPIPE"]} to any value.
+In this case, only @var{command}'s data will be fully buffered.
+@end itemize
+
+You @emph{must} create one or the other of these elements
+in @code{PROCINFO} before the first @code{print} or
+@code{printf} to the pipeline. Doing so after output has
+already been sent is too late.
+
+Be aware that using this feature may change the output behavior of
+your programs, so exercise caution.
+
@node Nonfatal
@section Enabling Nonfatal Output
+@c FIXME: Add indexing
This @value{SECTION} describes a @command{gawk}-specific feature.
@@ -11162,7 +11362,8 @@ $ @kbd{gawk 'BEGIN @{ print "hello, \}
In POSIX mode (@pxref{Options}), @command{gawk} does not
allow escaped newlines. Otherwise, it behaves as just described.
-BWK @command{awk} and BusyBox @command{awk}
+BWK @command{awk}@footnote{In all examples throughout this @value{DOCUMENT},
+@command{nawk} is BWK @command{awk}.} and BusyBox @command{awk}
remove the backslash but leave the newline
intact, as part of the string:
@@ -15815,6 +16016,14 @@ to test for these elements
The following elements allow you to change @command{gawk}'s behavior:
@table @code
+@item PROCINFO["BUFFERPIPE"]
+If this element exists, all output to pipelines becomes buffered.
+@xref{Noflush}.
+
+@item PROCINFO["@var{command}", "BUFFERPIPE"]
+Make output to @var{command} buffered.
+@xref{Noflush}.
+
@item PROCINFO["NONFATAL"]
If this element exists, then I/O errors for all redirections become nonfatal.
@xref{Nonfatal}.
@@ -18447,6 +18656,13 @@ seps[2] = "-"
@noindent
The value returned by this call to @code{split()} is three.
+If @command{gawk} is invoked with @option{--csv}, then a two-argument
+call to @code{split()} splits the string using the CSV parsing rules as
+described in @ref{Comma Separated Fields}. With three and four arguments,
+@code{split()} works as just described. The four-argument call makes
+no sense, since each element of @var{seps} would simply consist of a
+string containing a comma.
+
@cindex differences in @command{awk} and @command{gawk} @subentry @code{split()} function
As with input field-splitting, when the value of @var{fieldsep} is
@w{@code{" "}}, leading and trailing whitespace is ignored in values assigned to
@@ -21018,7 +21234,7 @@ $ @kbd{nawk -v A=1 -f funky.awk}
@end example
Or @command{awk} could wait until runtime to set the type of @code{a}.
-In this case, since @code{a} was never assigned used before being
+In this case, since @code{a} was never used before being
passed to the function, how the function uses it forces the type to
be resolved to either scalar or array. @command{gawk}
and the MKS @command{awk} do this:
@@ -21739,6 +21955,7 @@ programming use.
* Readfile Function:: A function to read an entire file at once.
* Shell Quoting:: A function to quote strings for the shell.
* Isnumeric Function:: A function to test whether a value is numeric.
+* To CSV Function:: A function to convert output to CSV format.
@end menu
@node Strtonum Function
@@ -22570,6 +22787,88 @@ the original string.
On the other hand, it uses the @code{typeof()} function
(@pxref{Type Functions}), which is specific to @command{gawk}.
+@node To CSV Function
+@subsection Producing CSV Data
+
+@cindex comma separated values (CSV) data @subentry generating CSV data
+@cindex CSV (comma separated values) data @subentry generating CSV data
+@command{gawk}'s @option{--csv} option causes @command{gawk}
+to process CSV data (@pxref{Comma Separated Fields}).
+
+But what if you have regular data that you want to output
+in CSV format? This @value{SECTION} provides functions for
+doing that.
+
+The first function, @code{tocsv()}, takes an array of data
+fields as input. The array should be indexed starting from one.
+The optional second parameter is the separator to use. If none
+is supplied, the default is a comma.
+
+The function takes care to quote fields that contain double
+quotes, newlines, or the separator character. It then builds
+up the final CSV record and returns it.
+
+@cindex @code{tocsv()} user-defined function
+@example
+@c file eg/lib/tocsv.awk
+# tocsv.awk --- convert data to CSV format
+@c endfile
+@ignore
+@c file eg/lib/tocsv.awk
+#
+# Arnold Robbins, arnold@@skeeve.com, Public Domain
+# April 2023
+@c endfile
+@end ignore
+@c file eg/lib/tocsv.awk
+
+function tocsv(fields, sep, i, j, nfields, result)
+@{
+ if (length(fields) == 0)
+ return ""
+
+ if (sep == "")
+ sep = ","
+ delete nfields
+ for (i = 1; i in fields; i++) @{
+ nfields[i] = fields[i]
+ if (nfields[i] ~ /["\n]/ || index(nfields[i], sep) != 0) @{
+ gsub(/"/, "\"\"", nfields[i]) # double up quotes
+ nfields[i] = "\"" nfields[i] "\"" # wrap in quotes
+ @}
+ @}
+
+ result = nfields[1]
+ j = length(nfields)
+ for (i = 2; i <= j; i++)
+ result = result sep nfields[i]
+
+ return result
+@}
+@c endfile
+@end example
+
+The next function, @code{tocsv_rec()} is a wrapper around
+@code{tocsv()}. Its intended use is for when you want to convert the
+current input record to CSV format. The function itself simply copies
+the fields into an array to pass to @code{tocsv()} which does the work.
+It accepts an optional separator character as its first parameter,
+which it simply passes on to @code{tocsv()}.
+
+@cindex @code{tocsv_rec()} user-defined function
+@example
+@c file eg/lib/tocsv.awk
+function tocsv_rec(sep, i, fields)
+@{
+ delete fields
+ for (i = 1; i <= NF; i++)
+ fields[i] = $i
+
+ return tocsv(fields, sep)
+@}
+@c endfile
+@end example
+
@node Data File Management
@section @value{DDF} Management
@@ -41095,6 +41394,28 @@ in (@pxref{Auto-set}).
@end itemize
+Version 5.3 added the following features:
+
+@itemize
+@item
+Comma separated value (CSV) field splitting
+(@pxref{Comma Separated Fields}).
+
+@item
+The ability to make @command{gawk} buffer output to pipes
+(@pxref{Noflush}).
+
+@item
+The @samp{\u} escape sequence
+(@pxref{Escape Sequences}).
+
+@item
+The need for GNU @code{libsigsegv} was removed from @command{gawk}.
+The value-add was never very much and it caused problems in some
+environments.
+
+@end itemize
+
@c XXX ADD MORE STUFF HERE
@end ifclear
@@ -41112,10 +41433,12 @@ the three most widely used freely available versions of @command{awk}
@headitem Feature @tab BWK @command{awk} @tab @command{mawk} @tab @command{gawk} @tab Now standard
@item @code{**} and @code{**=} operators @tab X @tab @tab X @tab
@item @samp{\x} escape sequence @tab X @tab X @tab X @tab
+@item @samp{\u} escape sequence @tab X @tab @tab X @tab
@item @file{/dev/stdin} special file @tab X @tab X @tab X @tab
@item @file{/dev/stdout} special file @tab X @tab X @tab X @tab
@item @file{/dev/stderr} special file @tab X @tab X @tab X @tab
@item @code{BINMODE} variable @tab @tab X @tab X @tab
+@item CSV support @tab X @tab @tab X @tab
@item @code{FS} as null string @tab X @tab X @tab X @tab
@item @code{delete} without subscript @tab X @tab X @tab X @tab X
@item @code{fflush()} function @tab X @tab X @tab X @tab X