From e7be61b04cff735e0e3974103a0c2798b2b123d2 Mon Sep 17 00:00:00 2001 From: Bruno Haible Date: Wed, 29 Mar 2023 00:31:47 +0200 Subject: doc: Document string-desc and related modules. * doc/string-desc.texi: New file. * doc/gnulib.texi (Particular Modules): Include it. --- doc/gnulib.texi | 3 ++ doc/string-desc.texi | 103 +++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 106 insertions(+) create mode 100644 doc/string-desc.texi (limited to 'doc') diff --git a/doc/gnulib.texi b/doc/gnulib.texi index 12766b2d96..3af5cb21b2 100644 --- a/doc/gnulib.texi +++ b/doc/gnulib.texi @@ -6910,6 +6910,7 @@ to POSIX that it can be treated like any other Unix-like platform. * static inline:: * extern inline:: * Closed standard fds:: +* Handling strings with NUL characters:: * Container data types:: * String Functions in C Locale:: * Recognizing Option Arguments:: @@ -6949,6 +6950,8 @@ to POSIX that it can be treated like any other Unix-like platform. @include xstdopen.texi +@include string-desc.texi + @include containers.texi @include c-locale.texi diff --git a/doc/string-desc.texi b/doc/string-desc.texi new file mode 100644 index 0000000000..5813baf47a --- /dev/null +++ b/doc/string-desc.texi @@ -0,0 +1,103 @@ +@node Handling strings with NUL characters +@section Handling strings with NUL characters + +@c Copyright (C) 2023 Free Software Foundation, Inc. + +@c Permission is granted to copy, distribute and/or modify this document +@c under the terms of the GNU Free Documentation License, Version 1.3 or +@c any later version published by the Free Software Foundation; with no +@c Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A +@c copy of the license is at . + +@c Written by Bruno Haible. + +Strings in C are usually represented by a character sequence with a +terminating NUL character. A @samp{char *}, pointer to the first byte +of this character sequence, is what gets passed around as function +argument or return value. + +The major restriction of this string representation is that it cannot +handle strings that contain NUL characters: such strings will appear +shorter than they were meant to be. In most application areas, this is +not a problem, and the @code{char *} type is well usable. + +In areas where strings with embedded NUL characters need to be handled, +the common approach is to use a @code{char *ptr} pointer variable +together with a @code{size_t nbytes} variable (or an @code{idx_t nbytes} +variable, if you want to avoid problems due to integer overflow). This +works fine in code that constructs or manipulates strings with embedded +NUL characters. But when it comes to @emph{storing} them, for example +in an array or as key or value of a hash table, one needs a type that +combines these two fields. + +The Gnulib modules @code{string-desc}, @code{xstring-desc}, and +@code{string-desc-quotearg} provide such a type. We call it a +``string descriptor'' and name it @code{string_desc_t}. + +The type @code{string_desc_t} is a struct that contains a pointer to the +first byte and the number of bytes of the memory region that make up the +string. An additional terminating NUL byte, that may be present in +memory, is not included in this byte count. This type implements the +same concept as @code{std::string_view} in C++, or the @code{String} +type in Java. + +A @code{string_desc_t} can be passed to a function as an argument, or +can be the return value of a function. This is type-safe: If, by +mistake, a programmer passes a @code{string_desc_t} to a function that +expects a @code{char *} argument, or vice versa, or assigns a +@code{string_desc_t} value to a variable of type @code{char *}, or +vice versa, the compiler will report an error. + +Functions related to string descriptors are provided: +@itemize +@item +Side-effect-free operations in @code{"string-desc.h"}, +@item +Memory-allocating operations in @code{"string-desc.h"}, +@item +Memory-allocating operations with out-of-memory checking in +@code{"xstring-desc.h"}, +@item +Operations with side effects in @code{"string-desc.h"}. +@end itemize + +For outputting a string descriptor, the @code{*printf} family of +functions cannot be used directly. A format string directive such as +@code{"%.*s"} would not work: +@itemize +@item +it would stop the output at the first encountered NUL character, +@item +it would require to cast the number of bytes to @code{int}, and thus +would not work for strings longer than @code{INT_MAX} bytes. +@end itemize +@c @noindent Other format string directives don't work either, because +@c the only way to produce a NUL character in @code{*printf}'s output +@c is through a dedicated @code{%c} or @code{%lc} directive. + +Therefore Gnulib offers +@itemize +@item +a function @code{string_desc_fwrite} that outputs a string descriptor to +a @code{FILE} stream, +@item +a function @code{string_desc_write} that outputs a string descriptor to +a file descriptor, +@item +and for those applications where the NUL characters should become +visible as @samp{\0}, a family of @code{quotearg} based functions, that +allow to specify the escaping rules in detail. +@end itemize + +The functionality is thus split across three modules as follows: +@itemize +@item +The module @code{string-desc}, under LGPL, defines the type and +elementary functions. +@item +The module @code{xstring-desc}, under GPL, defines the memory-allocating +functions with out-of-memory checking. +@item +The module @code{string-desc-quotearg}, under GPL, defines the +@code{quotearg} based functions. +@end itemize -- cgit v1.2.1